[PDF] Multiclass Sparse Discriminant Analysis

Abstract

In recent years many sparse linear discriminant analysis methods have been proposed for high-dimensional classification and variable selection. However, most of these proposals focus on binary classification and they are not directly applicable to multiclass classification problems. There are two sparse discriminant analysis methods that can handle multiclass classification problems, but their theoretical justifications remain unknown. In this paper, we propose a new multiclass sparse discriminant analysis method that estimates all discriminant directions simultaneously. We show that when applied to the binary case our proposal yields a classification direction that is equivalent to those by two successful binary sparse LDA methods in the literature. An efficient algorithm is developed for computing our method with high-dimensional data. Variable selection consistency and rates of convergence are established under the ultrahigh dimensionality setting. We further demonstrate the superior performance of our proposal over the existing methods on simulated and real data.

Full PDF

aa r X i v : . [ s t a t . M E ] A p r Multiclass Sparse Discriminant Analysis

Qing Mai ∗ , Yi Yang † , Hui Zou ‡ First version: May 30, 2014This version: April 17, 2015

Abstract

In recent years many sparse linear discriminant analysis methods have been proposed forhigh-dimensional classiﬁcation and variable selection. However, most of these proposals focuson binary classiﬁcation and they are not directly applicable to multiclass classiﬁcation prob-lems. There are two sparse discriminant analysis methods that can handle multiclass classiﬁ-cation problems, but their theoretical justiﬁcations remain unknown. In this paper, we proposea new multiclass sparse discriminant analysis method that estimates all discriminant directionssimultaneously. We show that when applied to the binary case our proposal yields a classi-ﬁcation direction that is equivalent to those by two successful binary sparse LDA methodsin the literature. An efﬁcient algorithm is developed for computing our method with high-dimensional data. Variable selection consistency and rates of convergence are established un- ∗ Department of Statistics, Florida State University ([email protected]) † School of Statistics, University of Minnesota ([email protected]). Mai and Yang are joint ﬁrst authors. ‡ Corresponding author, School of Statistics, University of Minnesota ([email protected]) er the ultrahigh dimensionality setting. We further demonstrate the superior performance ofour proposal over the existing methods on simulated and real data. Keywords: Discriminant analysis; High dimensional data; Variable selection; Multiclass clas-siﬁcation; Rates of convergence.

In multiclass classiﬁcation we have a pair of random variables ( Y, X ) , where X ∈ R p and Y ∈{ , . . . , K } . We need to predict Y based on X . Deﬁne π k = Pr( Y = k ) . The linear discriminantanalysis model states that X | ( Y = k ) ∼ N ( µ k , Σ ) , k ∈ { , , . . . , K } . (1)Under (1), the Bayes rule can be explicitly derived as follows ˆ Y = arg max k { ( X − µ k T β k + log π k } , (2)where β k = Σ − µ k for k = 1 , . . . , K . Linear discriminant analysis has been observed to performvery well on many low-dimensional datasets (Michie et al. 1994, Hand 2006). However, it maynot be suitable for high-dimensional datasets for at least two reasons. First, it is obvious that lineardiscriminant analysis cannot be applied if the dimension p exceeds the sample size n , becausethe sample covariance matrix will be singular. Second, Fan & Fan (2008) showed that even if thetrue covariance matrix is an identity matrix and we know this fact, a classiﬁer involving all thepredictors will be no better than random guessing.In recent years, many high-dimensional generalizations of linear discriminant analysis havebeen proposed (Tibshirani et al. 2002, Trendaﬁlov & Jolliffe 2007, Clemmensen et al. 2011, Fan & Fan2008, Wu et al. 2008, Shao et al. 2011, Cai & Liu 2011, Witten & Tibshirani 2011, Mai et al. 2012,Fan et al. 2012). In the binary case, the discriminant direction is β = Σ − ( µ − µ ) . One can seeksparse estimates of β to generalize linear discriminant analysis to deal with high dimensional clas-siﬁcation. Indeed, this is the common feature of three popular sparse discriminant analysis meth-ods: the linear programming discriminant (Cai & Liu 2011), the regularized optimal afﬁne discrim-inant (Fan et al. 2012) and the direct sparse discriminant analysis (Mai et al. 2012). The linear pro-gramming discriminant ﬁnds a sparse estimate by the Dantzig selector (Candes & Tao 2007); theregularized optimal afﬁne discriminant (Fan et al. 2012) adds the lasso penalty (Tibshirani 1996)to Fisher’s discriminant analysis; and the direct sparse discriminant analysis (Mai et al. 2012) de-rives the sparse discriminant direction via a sparse penalized least squares formulation. The threemethods can detect the important predictors and consistently estimate the classiﬁcation rule withoverwhelming probabilities with the presence of ultrahigh dimensions. However, they are explic-itly designed for binary classiﬁcation and do not handle the multiclass case naturally.Two popular multiclass sparse discriminant analysis proposals are the ℓ penalized Fisher’sdiscriminant (Witten & Tibshirani 2011) and sparse optimal scoring Clemmensen et al. (2011).However, these two methods do not have theoretical justiﬁcations. It is generally unknown whetherthey can select the true variables with high probabilities, how close their estimated discriminantdirections are to the true directions, and whether the ﬁnal classiﬁer will work similarly as the Bayesrule.Therefore, it is desirable to have a new multiclass sparse discriminant analysis algorithm thatis conceptually intuitive, computationally efﬁcient and theoretically sound. To this end, we pro-pose a new sparse discriminant method for high-dimensional multiclass problems. We show that3ur proposal not only has competitive empirical performance but also enjoys strong theoreticalproperties under ultrahigh dimensionality. In Section 2 we introduce the details of our proposalafter brieﬂy reviewing the existing two proposals. We also develop an efﬁcient algorithm for ourmethod. Theoretical results are given in Section 3. In Section 4 we use simulations and a realdata example to demonstrate the superior performance of our method over sparse optimal scor-ing (Clemmensen et al. 2011) and ℓ penalized Fisher’s discriminant (Witten & Tibshirani 2011).Technical proofs are in an Appendix. The Bayes rule under a linear discriminant analysis model is ˆ Y = arg max k { ( X − µ k T β k + log π k } , where β k = Σ − µ k for k = 1 , . . . , K . Let θ Bayes k = β k − β for k = 1 , . . . , K . Then the Bayesrule can be written as ˆ Y = arg max k { ( θ Bayes k ) T ( X − µ k π k } . (3)We refer to the directions θ Bayes = ( θ Bayes2 , . . . , θ Bayes K ) ∈ R p × ( K − as the discriminant directions.We brieﬂy review two existing multiclass sparse discriminant methods: the sparse optimalscoring (Clemmensen et al. 2011) and the ℓ penalized Fisher’s discriminant (Witten & Tibshirani2011). Instead of estimating θ Bayes directly, these two methods estimate a set of directions η =( η , . . . , η K − ) ∈ R p × ( K − such that η spans the same linear subspace as θ Bayes and hence linear4iscriminant analysis on X T η will be equivalent to (3) on the population level. More speciﬁcally,these two methods look for estimates of η = ( η , . . . , η K − ) in Fisher’s discriminant analysis: η k = arg max η T k Σ b η k , s.t. η T k Σ η k = 1 , η T k Σ η l = 0 for l < k , (4)where Σ b = K − P Kk =1 ( µ k − ¯ µ )( µ k − ¯ µ ) T with ¯ µ = K P k µ k .With a little abuse of terminology, we refer to η as discriminant directions as well. To ﬁnd η ,deﬁne Y dm as an n × K matrix of dummy variables with Y dm ik = 1( Y i = k ) .In addition to the discriminant direction η k , sparse optimal scoring creates K − vectors ofscores α , . . . , α K − ∈ R K . Then for k = 1 , . . . , K − , sparse optimal scoring estimates η k sequentially. In each step, sparse optimal scoring ﬁnds ˆ α k , ˆ η SOS k . Suppose the ﬁrst k − scorevectors ˆ α l , l < k and discriminant directions ˆ η SOS l , l < k are available. Then sparse optimalscoring ﬁnds ˆ α k , ˆ η SOS k by solving the following problem: ( ˆ α k , ˆ η SOS k ) = arg min α k , η k n X i =1 ( Y dm α k − ˜ X η k ) + λ k η k k (5)s.t. n α T k ( Y dm ) T Y dm α k = 1 , α T k ( Y dm ) T Y dm ˆ α l = 0 , for any l < k, where ˜ X is the centered data matrix, and λ is a tuning parameter. The sparse optimal scoring isclosely related to (4), because when the dimension is low, the unpenalized version of (5) gives thesame directions (up to a scalar) as (4) with the parameters Σ b and Σ substituted with the sampleestimates. Therefore, with the ℓ penalty, sparse optimal scoring gives sparse approximations to η .Note that the constraint α T k ( Y dm ) T Y dm α l = 0 , l < k indicates that, ( ˆ α k , ˆ η SOS k ) depends onthe knowledge of ( ˆ α l , ˆ η SOS l ) , l < k . This is why we say that the sparse optimal scoring adopts asequential approach to estimate the discriminant directions.5he ℓ penalized Fisher’s discriminant analysis estimates η k by ˆ η k = arg max η k η T k ˆ Σ kb η k + λ k X j | ˆ σ j η kj | s.t. η T k ˜ Σ η k ≤ , for k = 1 , . . . , K − , where λ k are tuning parameters, ˆ σ j is the ( j, j ) th element of the sampleestimate of Σ , ˜ Σ is a positive deﬁnite estimate of Σ , ˆ Σ kb = X T Y dm (( Y dm ) T Y dm ) − / Ω k (( Y dm ) T Y dm ) − / ( Y dm ) T X (6)and Ω k is the identity matrix if k = 1 and otherwise an orthogonal projection matrix with columnspace orthogonal to (( Y dm ) T Y ) − / Y T X ˆ η l for all l < k . Again, if the dimension is low, thenunpenalized version of (6) is equivalent to (4) with the parameters replaced by the sample estimates.Since Ω k relies on ˆ η l for all l < k , the ℓ penalized Fisher’s discriminant analysis also ﬁnds thediscriminant directions sequentially. Good empirical results have been reported for supporting the ℓ penalized Fisher’s discriminantanalysis and the sparse optimal scoring. However, it is unknown whether either of these two clas-siﬁers is consistent when more than two classes are present. Moreover, both sparse optimal scoringand ℓ penalized Fisher’s discriminant analysis estimate the discriminant directions sequentially.We believe a better multiclass sparse discriminant analysis algorithm should be able to estimateall discriminant directions simultaneously, just like the classical linear discriminant analysis. Weaim to develop a new computationally efﬁcient multiclass sparse discriminant analysis method thatenjoy strong theoretical properties under ultrahigh dimensionality. Such a method can be viewedas a natural multiclass counterpart of the three binary sparse discriminant methods in Mai et al.62012), Cai & Liu (2011) and Fan et al. (2012).To motivate our method, we ﬁrst discuss the implication of sparsity in the multiclass problem.Note that, by (3), the contribution from the j th variable ( X j ) will vanish if and only if θ Bayes2 j = · · · = θ Bayes Kj = 0 (7)Let D = { j : condition (7) does not hold } . Note that whether an index j belongs to D depends on θ kj for all k . This is because θ Bayes kj , k = 2 , . . . , K are related to each other, as they are coefﬁcientsfor the same predictor. In other words, θ Bayes kj , k = 2 , . . . , K are naturally grouped according to j . Then the sparsity assumption states that |D| ≪ p , which is referred to as the common sparsitystructure.Our proposal begins with a convex optimization formulation of the Bayes rule of the multiclasslinear discriminant analysis model. Recall that θ Bayes k = Σ − ( µ k − µ ) for k = 2 , . . . , K . On thepopulation level, we have ( θ Bayes2 , . . . , θ Bayes K ) = arg min θ ,..., θ K K X k =2 { θ T k Σ θ k − ( µ k − µ ) T θ k } . (8)In the classical low-dimension-large-sample-size setting, we can estimate ( θ Bayes2 , . . . , θ Bayes K ) viaan empirical version of (8) ( ˆ θ , . . . , ˆ θ K ) = arg min θ ,..., θ K K X k =2 { θ T k ˆ Σ θ k − ( ˆ µ k − ˆ µ ) T θ k } , (9)where ˆ Σ = 1 n − K P Kk =1 P Y i = k ( X i − ˆ µ k )( X i − ˆ µ k ) T , ˆ µ k = 1 n k P Y i = k X i and n k is the sam-ple size within Class k . The solution to (9) gives us the classical multiclass linear discriminantclassiﬁer.For presentation purpose, write θ .j = ( θ j , . . . , θ Kj ) T and deﬁne k θ .j k = ( P Ki =2 θ ij ) / . Forthe high-dimensional case, we propose the following penalized formulation for multiclass sparse7iscriminant analysis. ( ˆ θ , . . . , ˆ θ K ) = arg min θ ,..., θ K K X k =2 { θ T k ˆ Σ θ k − ( ˆ µ k − ˆ µ ) T θ k } + λ p X j =1 k θ · j k , (10)where λ is a tuning parameter. It is clear that (10) is based on (9). In (10) we have used the grouplasso (Yuan & Lin 2006) to encourage the common sparsity structure. Let ˆ D = { j : ˆ θ kj = 0 } which denotes the set of selected variables for the multiclass classiﬁcation problem. We will showlater that with a high probability ˆ D equals D . One can also use a group version of a nonconvexpenalty (Fan & Li 2001) or an adaptive group lasso penalty (Bach 2008) to replace the group lassopenalty in (10). To ﬁx the main idea, we do not pursue this direction here.After obtaining ˆ θ k , k = 2 , . . . , K , we ﬁt the classical multiclass linear discriminant analysis on ( X T ˆ θ , . . . , X T ˆ θ K ) , as in sparse optimal scoring and ℓ penalized Fisher’s discriminant analysis.We repeat the procedure for a sequence of λ values and pick the one with the smallest cross-validation error rate.We would like to make a remark here that our proposal is derived from a different angle thansparse optimal scoring and ℓ penalized Fisher’s discriminant analysis. Both sparse optimal scoringand ℓ penalized Fisher’s discriminant analysis penalize a formulation related to Fisher’s discrimi-nant analysis in (4), while our method directly estimates the Bayes rule. This different angle leadsto considerable convenience in both computation and theoretical studies. Yet we can easily recoverthe directions deﬁned by Fisher’s discriminant analysis after applying our method. See Section A.1for details. 8 .3 Connections with existing binary sparse LDA methods Although our proposal is primarily motivated by the multiclass classiﬁcation problem, it can bedirectly applied to the binary classiﬁcation problem as well by simply letting K = 2 in the formu-lation (10). It turns out that the binary special case of our proposal has very intimate connectionswith some proven successful binary sparse LDA methods in the literature. We elaborate more onthis point in what follows.When K = 2 , (10) reduces to ˆ θ MSDA ( λ ) = arg min θ θ T ˆ Σ θ − ( ˆ µ − ˆ µ ) T θ + λ k θ k (11)Considering the Dantzig selector formulation of (11), we have the following constrained ℓ mini-mization estimator deﬁned as ˆ θ = arg min θ k θ k s.t. k ˆ Σ θ − ( ˆ µ − ˆ µ ) k ∞ ≤ λ. (12)The above estimator is exactly the linear programming discriminant (LPD) Cai & Liu (2011).Moreover, we compare (11) with another two well-known sparse discriminant analysis propos-als for binary classiﬁcation: the regularized optimal afﬁne discriminant (ROAD)(Fan et al. 2012)and the direct sparse discriminant analysis (DSDA) (Mai et al. 2012). Denote the estimates of thediscriminant directions given by ROAD and DSDA as ˆ θ ROAD and ˆ θ DSDA , respectively. Then wehave ˆ θ ROAD ( λ ) = arg min θ θ T ˆ Σ θ + λ k θ k s.t. θ T ( ˆ µ − ˆ µ ) = 1 (13) ˆ θ DSDA ( λ ) = arg min θ X i ( Y i − θ − ( X i ) T θ ) + λ k θ k (14)We derive the following proposition to reveal the connections between our proposal ( K = 2) Proposition 1.

Deﬁne c ( λ ) = ˆ θ MSDA ( λ ) T ( ˆ µ − ˆ µ ) , c ( λ ) = ˆ θ DSDA ( λ ) T ( ˆ µ − ˆ µ ) and a = n | c ( λ ) || c ( λ ) | . Then we have ˆ θ MSDA ( λ ) = c ( λ ) ˆ θ ROAD ( λ/ | c ( λ ) | ) , (15) ˆ θ MSDA ( λ ) = c ( λ ) c ( aλ ) ˆ θ DSDA ( aλ ) . (16)Proposition 1 shows that the classiﬁcation direction by our proposal is identical to a classi-ﬁcation direction by ROAD and a classiﬁcation direction by DSDA. Consequently, our proposal ( K = 2) has the same solution path as ROAD and DSDA. Besides their solid theoretical foundation, LPD, ROAD and DSDA all enjoy computational efﬁ-ciency. In particular, DSDA’s computational complexity is the same as ﬁtting a lasso linear regres-sion model. In this section we show that our proposal for the multiclass problem can be solved bya very efﬁcient algorithm. In light of this and Proposition 1, our proposal is regarded as the naturalmulticlass generalization of these successful binary sparse LDA methods.We now present the efﬁcient algorithm for solving (10). For convenience write ˆ δ k = ˆ µ k − ˆ µ .Our algorithm is based on the following lemma. Lemma 1.

Given { θ .j ′ , j ′ = j } , the solution of θ .j to (10) is deﬁned as arg min θ .j K X k =2

12 ( θ kj − ˜ θ kj ) + λ ˆ σ jj k θ .j k (17)10 here ˜ θ k,j = ˆ δ kj − P l = j ˆ σ lj θ kl ˆ σ jj . Let ˜ θ .j = (˜ θ j , . . . , ˜ θ Kj ) T and k ˜ θ .j k = ( P Kk =2 ˜ θ kj ) / . The solution to (17) is given by ˆ θ .j = ˜ θ .j − λ k ˜ θ .j k ! + . (18)Based on Lemma 1 we use the following blockwise-descent algorithm to implement our mul-ticlass sparse discriminant analysis. Algorithm 1 (Multiclass sparse discriminant analysis for a given penalization parameter) .

1. Compute ˆ Σ and ˆ δ k , k = 1 , , . . . , K ;2. Initialize ˆ θ (0) k and compute ˜ θ (0) k accordingly;3. For m = 1 , . . . , do the following loop until convergence: for j = 1 , . . . , p ,(a) compute ˆ θ ( m ) .j = ˜ θ ( m − .j − λ k ˜ θ ( m − .j k ! + ; (b) update ˜ θ kj = ˆ δ kj − P l = j ˆ σ lj ˆ θ ( m ) kl ˆ σ jj .

4. Let ˆ θ k be the solution at convergence. The output classiﬁer is the usual linear discriminantclassiﬁer on ( X T ˆ θ , . . . , X T ˆ θ K ) . We have implemented our method in an R package msda which is available on CRAN. Ourpackage also handles the version of (10) using an adaptive group lasso penalty, because bothLemma 1 and Algorithm 1 can be easily generalized to handle the adaptive group lasso penalty.11

Theory

In this section we study theoretical properties of our proposal under the setting where p can bemuch larger than n . Under regularity conditions we show that our method can consistently selectthe true subset of variables and at the same time consistently estimate the Bayes rule.We begin with some useful notation. For a vector α , k α k ∞ = max j | α j | , k α k = P j | α j | ,while, for a matrix Ω ∈ R m × n , k Ω k ∞ = max i P j | ω ij | , k Ω k = max j P i | ω ij | . Deﬁne ϕ = max {k Σ D C , D k ∞ , k Σ − D , D k ∞ } , ∆ = max {k µ k , k θ k } ; θ min = min ( k,j ): θ kj =0 | θ kj | , θ max = max ( k,j ) | θ kj | ; k Σ D C , D Σ − D , D k ∞ = η ∗ . Let d be the cardinality of D .Deﬁne t D ∈ R d × ( K − as the subgradient of the group lasso penalty at the true θ D and weassume the following condition:(C0) max j ∈D c { P Kk =2 ( Σ j, D Σ − D , D t k, D ) } / = κ < . Condition (C0) is required to guarantee the selection consistency. A condition similar to condition(C0) has been used to study the group lasso penalized regression model (Bach 2008).We further let ϕ, ∆ , η ∗ , κ be ﬁxed and assume the following regularity conditions:(C1) There exists c , C such that c K ≤ π k ≤ C K for k = 1 , . . . , K and θ max θ min < C .(C2) n, p, → ∞ and d log ( pd ) n → ;(C3) θ min ≫ { d log ( pd ) n } / ; 12C4) min k,k ′ { ( θ k − θ k ′ ) T Σ ( θ k − θ k ′ ) } / is bounded away from 0.Condition (C1) guarantees that we will have a decent sample size for each class. Condition (C2)requires that p cannot grow too fast with respect to n . This condition is very mild, because it canallow p to grow at a nonpolynomial rate of n . In particular, if d = O ( n / − α ) , then condition (C2)is satisﬁed if log p = o ( n α ) . Condition (C3) guarantees that the nonzero coefﬁcients are boundedaway from 0, which is a common assumption in the literature. The lower bound of θ min tends to0 under condition (C3). Condition (C4) is required such that all the classes can be separated fromeach other. If condition (C4) is violated, even the Bayes rule cannot work well.In the following theorems, we let C denote a generic positive constant that can vary from placeto place. Theorem 1.

1. Under conditions (C0)–(C1), there exists a generic constant M such that, if λ < min { θ min ϕ , M (1 − κ ) } , then with a probability greater than − Cpd exp( − Cn ǫ Kd ) − CK exp( − C nK ) − Cp ( K −

1) exp( − Cn ǫ K ) (19) we have that ˆ D = D , and k ˆ θ k − θ Bayes k k ∞ ≤ ϕλ for k = 2 , . . . , K .2. If we further assume conditions (C2)–(C3), we have that if { d log ( pd ) n } / ≪ λ ≪ θ min ,then with probability tending to 1, we have ˆ D = D , and k ˆ θ k − θ Bayes k k ∞ ≤ ϕλ for k =2 , . . . , K . Next, we show that our proposal is a consistent estimator of the Bayes rule in terms of themisclassiﬁcation error rate. Deﬁne R n = Pr( ˆ Y ( ˆ θ k , ˆ π k , k = 1 , . . . , K ) = Y | observed data) , ˆ Y ( ˆ θ k , ˆ π k , k = 1 , . . . , K ) is the prediction by our method. Also deﬁne R as the Bayes error.Then we have the following conclusions. Theorem 2.

1) exp( − Cn ǫ K ) (20) we have | R n − R | ≤ M λ / , (21) for some generic constant M .2. Under conditions (C0)–(C4), if λ → , then with probability tending to 1, we have R n → R. Remark 2 . Based on our proof we can further derive the asymptotic results by letting K (thenumber of classes) diverge with n to inﬁnity. We only need to use more cumbersome notion andbounds, but the analysis remains pretty much the same. To show a clearer picture of the theory, wehave focused on the ﬁxed K case. We demonstrate our proposal by simulation. For comparison, we include the sparse optimal scor-ing and ℓ penalized Fisher’s discriminant analysis in the simulation study. Four simulation models14re considered where the dimension p = 800 and the training set has a sample size n = 75 K , where K is the number of classes in each model. We generate a validation set of size n to select the tuningparameters and a testing set of size 1000 for each method. Recall that β k = Σ − µ k . We specify β k and Σ as in the following four models and then let µ k = Σ β k . For simplicity, we say that amatrix Σ has the AR( ρ ) structure if σ jk = ρ | j − k | for j, k = 1 , . . . , p ; on the other hand, Σ has theCS( ρ ) structure if σ jk = ρ for any j = k and σ jj = 1 for j = 1 , . . . , p .Model 1: K = 4 , β jk = 1 . for j = 2 k − , k ; k = 1 , . . . , K and β jk = 0 otherwise. The covariancematrix Σ has the AR( . ) structure.Model 2: K = 6 , β jk = 2 . for j = 2 k − , k ; k = 1 , . . . , K and β jk = 0 otherwise. The covariancematrix Σ = I ⊗ Ω , where Ω has the CS( . ) structure.Model 3: K = 4 , β jk = k + u jk for j = 1 , . . . , K , where u jk follows the uniform distribution over theinterval [ − / , / ; β jk = 0 otherwise. The covariance matrix Σ has the CS( . ) structure.Model 4: K = 4 , β jk = k + u jk for j = 1 , . . . , , where u jk follows the uniform distribution over theinterval [ − / , / ; β jk = 0 otherwise. The covariance matrix Σ has the CS( . ) structure.Model 5: K = 4 , β , = . . . = β , = 1 . , β , = . . . = β , = − . , β , = . . . = β , = 1 . , β , j − = − . , β , j = 1 . for j = 1 , . . . , ; β jk = 0 otherwise. The covariance matrix Σ has theAR( . ) structure. 15odel 6: K = 4 , β , = . . . = β , = 1 . , β , = . . . = β , = − . β , = . . . = β , = 1 . , β , j − = − . , β , j = 1 . for j = 1 , . . . , ; β jk = 0 otherwise. The covariance matrix Σ has theAR( . ) structure.The error rates of these methods are listed in Table 1. To compare variable selection perfor-mance, we report the number of correctly selected variables (C) and the number of incorrectlyselected variables (IC) by each method. We want to highlight two observations from Table 1. First,our method is the best across all six models. Second, our method is a very good approximationof the Bayes rule in terms of both sparsity and misclassiﬁcation error rate. Although our methodtends to select a few more variables besides the true ones, this can be improved by using the adap-tive group lasso penalty (Bach 2008). Because the other two methods do not use the adaptive lassopenalty, we do not include the results of our method using the adaptive group lasso penalty for afair comparison. We further demonstrate the application of our method on the IBD dataset (Burczynski et al. 2006).This dataset contains 22283 gene expression levels from 127 people. These 127 people are eithernormal people, people with Crohn’s disease or people with ulcerative colitis. This dataset can bedownloaded from Gene Expression Omnibus with accession number GDS1615. We randomly splitthe datasets with a 2:1 ratio in a balanced manner to form the training set and the testing set.It is known that the marginal t -test screening (Fan & Fan 2008) can greatly speed up the com-putation for linear discriminant analysis in binary problems. For a multiclass problem the natural16ayes Our Witten Clemmensen Bayes Our Witten ClemmensenModel 1 Model 2Error(%) 11.0 12.4 15.5 13 13.3 15.2 31.7 17(0.06) (0.07) (0.07) (0.06) (0.05) (0.07) (0.20) (0.08)C 8 8 8 8 12 12 12 12(0) (0) (0) (0) (0) (0)IC 0 10 126 5 0 15 19.5 16(0.6) (4.9) (0.4) (0.7) (1.5) (0.3)Model 3 Model 4Error(%) 8.8 9.4 14.1 12.7 5.3 5.7 7 7.6(0.06) (0.09) (0.06) (0.08) (0.06) (0.08) (0.05) (0.07)C 4 4 4 4 4 4 4 4(0) (0) (0) (0) (0) (0)IC 0 3 796 30 0 4 796 30(0.4) (0) (0.2) (0.5) (0) (2.2)Model 5 Model 6Error(%) 8.3 9.5 17.9 13.6 14.2 17.4 23.4 24.8(0.05) (0.07) (0.14) (0.09) (0.06) (0.08) (0.09) (0.09)C 8 8 8 8 8 8 8 6(0) (0) (0) (0.0) (0) (0.1)IC 0 6 97 4 0 0 4 3(0.9) (2.8) (0.5) (0) (0.5) (0.3)Table 1: Simulation results for Models 1–6. The two competing methods are denoted by the ﬁrstauthor of the original papers. In particular, Witten’s method is the ℓ penalized Fisher’s discrim-inant analysis, and Clemmensen’s method is the sparse optimal scoring method. The reportednumbers are medians based on 500 replicates. Standard errors are in parentheses. The quantity Cis the number of correctly selected variables, and IC is the number of incorrectly selected variables.17ur Witten ClemmensenError(%) 7.32(0.972) 21.95(1.10) 9.76(0.622)Fitted Model Size 25(0.7) 127(0) 27(0.5)Table 2: Classiﬁcation and variable selection results on the real dataset. The two competing meth-ods are denoted by the ﬁrst author of the original papers. In particular, Witten’s method is the ℓ penalized Fisher’s discriminant analysis, and Clemmensen’s method is the sparse optimal scoringmethod. All numbers are medians based on 100 random splits. Standard errors are in parentheses.generalization of t -test screening is the F -test screening. Compute the F -test statistic for each X j deﬁned as f j = P Kk =1 n k (ˆ µ kj − ˆ¯ µ j ) / ( G − P ni =1 ( X ij − ˆ µ Y i ,j ) / ( n − G ) , where ˆ¯ µ j is the sample grand mean for X j and n g is the within-group sample size. Based on the F -test statistic, we deﬁne the F -test screening by only keeping the predictors with F -test statisticsamong the d n th largest. As recommended by many researchers (Fan & Fan 2008, Fan & Song2010, Mai & Zou 2013 a ), d n can be the same as the sample size, if we believe that the number oftruly important variables is much smaller than the sample size. Therefore, we let d n = 127 for thecurrent dataset.We estimate the rules given by sparse optimal scoring, ℓ penalized Fisher’s discriminant anal-ysis and our proposal on the training set. The tuning parameters are chosen by 5 fold cross val-idation. Then we evaluate the classiﬁcation errors on the testing set. The results based on 100replicates are listed in Table 2. It can be seen that our proposal achieves the highest accuracy withthe sparsest classiﬁcation rule. This again shows that our method is a very competitive classiﬁer.18 Summary

In this paper we have proposed a new formulation to derive sparse multiclass discriminant clas-siﬁers. We have shown that our proposal has a solid theoretical foundation and can be solved bya very efﬁcient computational algorithm. Our proposal actually gives a uniﬁed treatment of themulticlass and binary classiﬁcation problems. We have shown that the solution path of the binaryversion of our proposal is equivalent to that by ROAD and DSDA. Moreover, LPD is identicalto the Dantzig selector formulation of our proposal for the binary case. In light of this evidence,our proposal is regarded as the natural multiclass generalization of those proven successful binarysparse LDA methods.

Appendices

A.1 Connections with Fisher’s discriminant analysis

For simplicity, in this subsection we denote η as the discriminant directions deﬁned by Fisher’sdiscriminant analysis in (4), and θ as the discriminant directions deﬁned by Bayes rule. Ourmethod gives a sparse estimate of θ . In this section, we discuss the connection between θ and η , and hence the connection between our method and Fisher’s discriminant analysis. We ﬁrstcomment on the advantage of directly estimating θ rather than estimating η . Then we discuss howto estimate η once ˆ θ is available.There are two advantages of estimating θ rather than η . Firstly, estimating θ allows for simul-taneous estimation of all the discriminant directions. Note that (4) requires that η T k Σ η l = 0 forany l < k . This requirement almost necessarily leads to a sequential optimization problem, which19s indeed the case for sparse optimal scoring and ℓ penalized Fisher’s discriminant analysis. Inour proposal, the discriminant direction θ k is determined by the covariance matrix and the meanvectors µ k within Class k, but is not related to θ l for any l = k . Hence, our proposal can simulta-neously estimate all the directions by solving a convex problem. Secondly, it is easy to study thetheoretical properties if we focus on θ . On the population level, θ can be written out in explicitforms and hence it is easy to calculate the difference between θ and ˆ θ in the theoretical studies.Since η do not have closed-form solutions even when we know all the parameters, it is relativelyharder to study its theoretical properties.Moreover, if one is speciﬁcally interested in the discriminant directions η , it is very easy toobtain a sparse estimate of them once we have a sparse estimate of θ . For convenience, for anypositive integer m , denote m as an m -dimensional vector with all entries being 0, m as an m -dimensional vector with all entries being 1, and I m as the m × m identity matrix. The followinglemma provides an approach to estimating η once ˆ θ is available. The proof is relegated to SectionA.2. Lemma 2.

The discriminant directions η contain all the right eigenvectors of θ Π δ T correspond-ing to positive eigenvalues, where θ = (0 p , θ ) , Π = I K − K K T K , and δ = ( µ − ¯ µ , . . . , µ K − ¯ µ ) with ¯ µ = P Kk =1 π k µ k . Therefore, once we have obtained a sparse estimate of θ , we can estimate η as follows. Withoutloss of generality write ˆ θ = ( ˆ θ T ˆ D , T , where ˆ D = { j : ˆ θ · j = 0 } . Then ˆ θ = (0 , ˆ θ ) . On the otherhand, set ˆ δ = ( ˆ µ − ˆ¯ µ , . . . , ˆ µ K − ˆ¯ µ ) where ˆ µ k are sample estimates and ˆ¯ µ = P Kk =1 ˆ π k ˆ µ k . Itfollows that ˆ θ Π ˆ δ = (( ˆ θ , ˆ D Π ˆ δ T , ˆ D ) T , T . Consequently, we can perform eigen-decompositionon ˆ θ , ˆ D Π ˆ δ T , ˆ D to obtain ˆ η ˆ D . Because ˆ D is a small subset of the original dataset, this decomposition20ill be computationally efﬁcient. Then ˆ η would be ( ˆ η T ˆ D , T . A.2 Technical Proofs

Proof of Proposition 1.

We ﬁrst show (15).For a vector θ ∈ R p , Deﬁne L MSDA ( θ , λ ) = 12 θ T ˆ Σ θ − ( ˆ µ − ˆ µ ) T θ + λ k θ k , (22) L ROAD ( θ , λ ) = θ T ˆ Σ θ + λ k θ k (23)Set ˜ θ = c ( λ ) − ˆ θ MSDA ( λ ) . Since ˜ θ T ( ˆ µ − µ ) = 1 , it sufﬁces to check that, for any ˜ θ ′ suchthat ( ˜ θ ′ ) T ( ˆ µ − µ ) = 1 , we have L ROAD ( ˜ θ , λ | c ( λ ) | ) ≤ L ROAD ( ˜ θ ′ , λ | c ( λ ) | ) . Now for any such ˜ θ ′ , L MSDA ( c ( λ ) ˜ θ ′ , λ ) = c ( λ ) L ROAD ( ˜ θ ′ , λ | c ( λ ) | ) − c ( λ ) (24)Similarly, L MSDA ( c ( λ ) ˜ θ , λ ) = c ( λ ) L ROAD ( ˜ θ , λ | c ( λ ) | ) − c ( λ ) . (25)Since L MSDA ( c ( λ ) ˜ θ , λ ) ≤ L MSDA ( c ( λ ) ˜ θ ′ , λ ) , we have (15).On the other hand, by Theorem 1 in Mai & Zou (2013 b ), we have ˆ θ DSDA ( λ ) = c ( λ ) ˆ θ ROAD ( λn | c ( λ ) | ) (26)Therefore, ˆ θ ROAD ( λ | c ( λ ) | ) = ˆ θ ROAD (cid:18) ( n | c ( λ ) | λ | c ( λ ) | ) / ( n | c ( λ ) | ) (cid:19) (27) = (cid:18) c ( n | c ( λ ) | λ | c ( λ ) | ) (cid:19) − ˆ θ DSDA (cid:18) n | c ( λ ) | λ | c ( λ ) | (cid:19) (28) = ( c ( aλ )) − ˆ θ DSDA ( aλ ) (29)Combine (29) with (15) and we have (16). 21 roof of Lemma 1. We start with simplifying the ﬁrst part of our objective function, θ T k ˆ Σ θ k − ( ˆ µ k − ˆ µ ) T θ k .First, note that θ T k ˆ Σ θ k = 12 p X l,m =1 θ kl θ km ˆ σ lm (30) = 12 θ kj ˆ σ jj + 12 X l = j θ kl θ kj ˆ σ lj + 12 X m = j θ kj θ km ˆ σ jm + 12 X l = j,m = j θ kl θ km ˆ σ lm (31)(32)Because ˆ σ lj = ˆ σ jl , we have P l = j θ kl θ kj ˆ σ lj = P m = j θ kj θ km ˆ σ jm . It follows that θ T k ˆ Σ θ k = 12 θ kj ˆ σ jj + X l = j θ kj θ kl ˆ σ lj + 12 X l = j,m = j θ kl θ km ˆ σ lm (33)Then recall that ˆ δ k = ˆ µ k − ˆ µ . We have ( ˆ µ k − ˆ µ ) T θ k = p X l =1 δ kl θ kl = δ kj θ kj + X l = j δ kl θ kl (34)Combine (33) and (34) and we have θ T k ˆ Σ θ k − ( ˆ µ k − ˆ µ ) T θ k (35) = 12 θ kj ˆ σ jj + X l = j θ kj θ kl ˆ σ lj + 12 X l = j,m = j θ kl θ km ˆ σ lm − δ kj θ kj − X l = j δ kl θ kl (36) = 12 θ kj ˆ σ jj + ( X l = j ˆ σ l,j θ kl − ˆ δ kj ) θ kj + 12 X m = j,l = j θ kl θ km ˆ σ lm − X l = j ˆ δ kl θ kl (37)Note that the last two terms does not involve θ .j . Therefore, given { θ .j ′ , j ′ = j } , the solutionof θ .j is deﬁned as arg min θ ,j ,..., θ K,j K X k =2 { θ kj ˆ σ jj + ( X l = j ˆ σ lj θ kl − ˆ δ kj ) θ kj } + λ k θ .j k , which is equivalent to (17). It is easy to get (18) from (17) (Yuan & Lin 2006).22n what follows we use C to denote a generic constant for convenience.Now we deﬁne an oracle “estimator" that relies on the knowledge of D for a speciﬁc tuningparameter λ : ˆ θ oracle D = arg min θ , D ,..., θ K, D K X k =2 { θ T k, D ˆ Σ D , D θ k, D − ( ˆ µ k, D − ˆ µ , D ) T θ k, D } + λ X j ∈D k θ .j k . (38)The proof of Theorem 1 is based on a series of technical lemmas. Lemma 3.

Deﬁne ˆ θ oracle D ( λ ) as in (38) . Then ˆ θ k = ( ˆ θ oracle k, D , , k = 2 , . . . , K is the solution to (10) if max j ∈D c [ K X k =2 { ( ˆ Σ D C , D ˆ θ (oracle) k, D ) j − (ˆ µ kj − ˆ µ j ) } ] / < λ. (39) Proof of Lemma 3.

The proof is completed by checking that ˆ θ k = ( ˆ θ oracle k, D ( λ ) , satisﬁes the KKTcondition of (10). Lemma 4.

For each k , Σ D C , D Σ − D , D ( µ k, D − µ , D ) = µ k, D C − µ , D C . Proof of Lemma 4.

For each k , we have θ k, D C = 0 . By deﬁnition, θ D C = ( Σ − ( µ k − µ )) D C .Then by block inversion, we have that θ k, D C = − ( Σ D C , D C − Σ D C , D Σ D , D Σ D , D C ) − ( Σ D C , D Σ − D , D ( µ k, D − µ , D ) − ( µ k, D C − µ , D C )) , and the conclusion follows. Proposition 2.

There exist a constant ǫ such that for any ǫ ≤ ǫ we have pr {| (ˆ µ kj − ˆ µ j ) − ( µ kj − µ j ) | ≥ ǫ } ≤ C exp( − C nǫ K ) + C exp( − CnK ) , (40) k = 2 , . . . , K, j = 1 , . . . , p ;pr( | ˆ σ ij − σ ij | ≥ ǫ ) ≤ − C nǫ K ) + 2 exp( − CnK ) , i, j = 1 , . . . , p. (41)23 roof of Proposition 2. We ﬁrst show (40). Note that, by Chernoff bound pr( | ˆ µ kj − µ kj | ≥ ǫ ) ≤ E (pr( | ˆ µ kj − µ kj | ≥ ǫ | Y )) ≤ E ( C exp( − Cn k ǫ )) ≤ − C nǫ K ) + 2 exp( − CnK ) . A similar inequality holds for ˆ µ j , and (40) follows.For (41), note that ˆ σ ij = 1 n − K K X k =1 X Y m = k ( X mi − ˆ µ ki )( X mj − ˆ µ kj )= 1 n − K K X k =1 X Y m = k ( X mi − µ mi )( X mj − µ mj ) + 1 n − K K X k =1 n k (ˆ µ ki − µ ki )(ˆ µ kj − µ kj )= ˆ σ (0) ij + 1 n − K K X k =1 n k (ˆ µ ki − µ ki )(ˆ µ kj − µ kj ) . Now by Chernoff bound, pr( | ˆ σ (0) ij − σ ij | ≥ ǫ ) ≤ C exp( − Cnǫ ) . Combining this fact with (40),we have the desired result.Now we consider two events depending on a small ǫ > : A ( ǫ ) = {| ˆ σ ij − σ ij | < ǫd for any i = 1 , · · · , p and j ∈ D} ,B ( ǫ ) = {| (ˆ µ kj − ˆ µ j ) − ( µ kj − µ j ) | < ǫ for any k and j } . By simple union bounds, we can derive Lemma 4 and Lemma 5.

Lemma 5.

There exist a constant ǫ such that for any ǫ ≤ ǫ we have1. pr( A ( ǫ )) ≥ − Cpd exp( − Cn ǫ Kd ) − CK exp( − CnK ) ;2. pr( B ( ǫ )) ≥ − Cp ( K −

1) exp( − C nǫ K ) − CK exp( − CnK ) ; . pr( A ( ǫ ) ∩ B ( ǫ )) ≥ − γ ( ǫ ) , where γ ( ǫ ) = Cpd exp( − C nǫ d ) + Cp ( K −

1) exp( − C nǫ K ) + 2 CK exp( − CnK ) . Lemma 6.

Assume that both A ( ǫ ) and B ( ǫ ) have occurred. We have the following conclusions: k ˆ Σ D , D − Σ D , D k ∞ < ǫ ; k ˆ Σ D C , D − Σ D C , D k ∞ < ǫ ; k ( ˆ µ k − ˆ µ ) − ( µ k − µ ) k ∞ < ǫ ; k ( ˆ µ k, D − ˆ µ , D ) − ( µ k, D − µ , D ) k < ǫ. Lemma 7.

If both A ( ǫ ) and B ( ǫ ) have occurred for ǫ < ϕ , we have k ˆ Σ − D , D − Σ − D , D k < ǫϕ (1 − ϕǫ ) − , k ˆ Σ D C , D ( ˆ Σ D , D ) − − Σ D C , D ( Σ D , D ) − k ∞ < ϕǫ − ϕǫ . Proof of Lemma 7 .

Let η = k ˆ Σ D , D − Σ D , D k ∞ , η = k ˆ Σ D C , D − Σ D C , D k ∞ and η = k ( ˆ Σ D , D ) − − ( Σ D , D ) − k ∞ . First we have η ≤ k ( ˆ Σ D , D ) − k ∞ × k ( ˆ Σ D , D − Σ D , D ) k ∞ × k ( Σ D , D ) − k ∞ = ( ϕ + η ) ϕη . On the other hand, k ˆ Σ D C , D ( ˆ Σ D , D ) − − Σ D C , D ( Σ D , D ) − k ∞ ≤ k ˆ Σ D C , D − Σ D C , D k ∞ × k ( ˆ Σ D , D ) − − ( Σ D , D ) − k ∞ + k ˆ Σ D C , D − Σ D C , D k ∞ × k ( Σ D , D ) − k ∞ + k Σ D C , D k ∞ × k ( ˆ Σ D , D ) − − ( Σ D , D ) − k ∞ ≤ η η + η ϕ + ϕη . ϕη < we have η ≤ ϕ η (1 − ϕη ) − and hence k ˆ Σ D C , D ( ˆ Σ D , D ) − − Σ D C , D ( Σ D , D ) − k ∞ < ϕǫ − ϕǫ . Lemma 8.

Deﬁne ˆ θ k, D = ˆ Σ − D , D ( ˆ µ k, D − ˆ µ , D ) . (42) Then k ˆ θ k, D − θ k, D k ≤ ϕǫ (1 + ϕ ∆)1 − ϕǫ . Proof of Lemma 8.

By deﬁnition, we have k ˆ Σ − D , D ( ˆ µ k, D − ˆ µ , D ) − Σ − D , D ( µ k, D − µ , D ) k ≤ k ˆ Σ − D , D − Σ − D , D k k ( ˆ µ k, D − ˆ µ , D ) − ( µ k, D − µ , D ) k + k Σ − D , D k k ( ˆ µ k, D − ˆ µ , D ) − ( µ k, D − µ , D ) k + k ˆ Σ − D , D − Σ − D , D k k µ k, D − µ , D k ≤ ϕǫ (1 + ϕ ∆)1 − ϕǫ . Lemma 9. If A ( ǫ ) and B ( ǫ ) have occurred for ǫ < min { ϕ , λ ϕ ∆ } , then for all k k ˆ θ (oracle) k, D ( λ ) − θ k, D k ∞ ≤ λϕ. Proof of Lemma 9.

Observe ˆ θ oracle k = ˆ Σ − D , D ( ˆ µ k, D − ˆ µ , D ) − λ ˆ Σ − D , D ˆ t k, D . Therefore, k ˆ θ oracle k, D − θ k, D k ∞ ≤ k ˆ θ k, D − θ k, D k ∞ + λ k ˆ Σ − D , D − Σ − D , D k k ˆ t k, D k ∞ + λ k Σ − D , D k k ˆ t k, D k ∞ where ˆ θ k,D is deﬁned as in (42). Now k ˆ t k, D k ∞ ≤ and we have k ˆ θ oracle k, D − θ k, D k ∞ ≤ ϕǫ (1 + ϕ ∆) + λϕ − ϕǫ < ϕλ. emma 10. For a sets of real numbers { a , . . . , a N } , if P Ni =1 a i ≤ κ < , then P Ni =1 ( a i + b ) < as long as b < − κ √ N .Proof. By the Cauchy-Schwartz inequality, we have that N X i =1 ( a i + b ) = N X i =1 a i + 2 N X i =1 a i b + N b (43) ≤ N X i =1 a i + 2 vuut ( N X i =1 a i ) · N b + N b (44) ≤ κ + 2 κ √ N b + N b (45)which is less than 1 when b < − κ √ N .We are ready to complete the proof of Theorem 1. Proof of Theorem 1.

We ﬁrst consider the ﬁrst conclusion. For any λ < θ min ϕ and ǫ < min { ϕ , λ ϕ ∆ } ,consider the event A ( ǫ ) ∩ B ( ǫ ) . By Lemmas 3, 5 & 9 it sufﬁces to verify (39).For any j ∈ D c , by Lemma 4 we have | ( ˆ Σ D C , D ˆ θ (oracle) k, D ) j − (ˆ µ kj − ˆ µ j ) |≤ | ( ˆ Σ D C , D ˆ θ (oracle) k, D ) j − ( Σ D C , D θ k, D ) j | + | (ˆ µ kj − ˆ µ j ) − ( µ kj − µ j ) |≤ | ( ˆ Σ D C , D ˆ θ (oracle) k, D ) j − ( Σ D C , D θ k, D ) j | + ǫ ≤ | ( ˆ Σ D C , D ˆ θ (0) k, D ) j − ( Σ D C , D θ k, D ) j | + ǫ + λ | ( ˆ Σ D C , D ˆ Σ − D , D ˆ t k, D ) j | ( ˆ Σ D C , D ˆ θ (oracle) k, D ) j − ( Σ D C , D θ k, D ) j | + ǫ ≤ k ( ˆ Σ D C , D ) j − ( Σ D C , D ) j k k ˆ θ k, D − θ k, D k ∞ + k θ k, D k ∞ k ( ˆ Σ D C , D ) j − ( Σ D C , D ) j k + k ( Σ D C , D ) j k k ˆ θ k, D − θ k, D k ∞ + ǫ ≤ Cǫ. (46) | ( ˆ Σ D C , D ˆ Σ − D , D ˆ t k, D ) j − ( Σ D C , D Σ − D , D t k, D ) j |≤ k ˆ Σ D C , D ˆ Σ − D , D − Σ D C , D Σ − D , D k ∞ k ˆ t k, D − t k, D k ∞ + k Σ D C , D Σ − D , D k ∞ k ˆ t k, D − t k, D k ∞ + k ˆ Σ D C , D ˆ Σ − D , D − Σ D C , D Σ − D , D k ∞ | ( t k, D ) j || ˆ t kj − t kj | = | ˆ θ kj k θ .j k − θ kj k ˆ θ .j kk θ .j kk ˆ θ .j k |≤ | ˆ θ kj − θ kj |k θ .j k + θ max k θ .j − ˆ θ .j kk θ .j kk ˆ θ .j k≤ Cϕθ min √ ( K − λ. Therefore, λ | ( ˆ Σ D C , D ˆ Σ − D , D ˆ t k, D ) j |≤ λ | ( Σ D C , D Σ − D , D t k, D ) j | + λ ( Cϕǫ − ϕǫ + η ∗ Cϕλθ min √ K − (47) ≤ λ | ( Σ D C , D Σ − D , D t k, D ) j | + Cλ (48)Under condition (C0), it follows from (46) and (48) that | ( ˆ Σ D C , D ˆ θ (oracle) k, D ) j − (ˆ µ kj − ˆ µ j ) | ≤ λ | ( Σ D C , D Σ − D , D t k, D ) j | + Cλ (49)Combine condition (C0) with Lemma 10, we have that, there exists a generic constant M > ,such that when λ < M (1 − κ ) , (39) is true. Therefore, the ﬁrst conclusion is true.28nder conditions (C2)–(C4), the second conclusion directly follows from the ﬁrst conclusion. Proof of Theorem 2.

We ﬁrst show the ﬁrst conclusion. Deﬁne ˆ Y ( θ , . . . , θ K ) as the prediction bythe Bayes rule and ˆ Y ( ˆ θ , . . . , ˆ θ K ) as the prediction as the prediction by the estimated classiﬁcationrule. Also deﬁne l k = ( X − µ k T θ k + log( π k ) and ˆ l k = ( X − ˆ µ k T ˆ θ k + log(ˆ π k ) .Deﬁne C ( ǫ ) = {| ˆ π k − π k | ≤ min { min k π k / , ǫ }} . By the Bernstein inequality we have that Pr( C ( ǫ )) ≤ C exp( − Cn ) .Assume that the event A ( ǫ ) ∩ B ( ǫ ) ∩ C ( ǫ ) has happened. By Lemma 5, we have Pr( A ( ǫ ) ∩ B ( ǫ ) ∩ C ( ǫ )) ≥ − Cpd exp( − Cn ǫ Kd ) − CK exp( − C nK ) − Cp ( K −

1) exp( − Cn ǫ K ) (50)For any ǫ > , R n − R ≤ Pr( ˆ Y ( θ , . . . , θ K ) = ˆ Y ( ˆ θ , . . . , ˆ θ K )) ≤ − Pr( | ˆ l k − l k | < ǫ / , | l k − l k ′ | > ǫ , for any k, k ′ ) ≤ Pr( | ˆ l k − l k | ≥ ǫ / for some k ) + Pr( | l k − l k ′ | ≤ ǫ for some k, k ′ ) . Now, for X in each class, l k − l k ′ is normal with variance ( θ k − θ k ′ ) T Σ ( θ k − θ k ′′ ) . Therefore, Pr( | l k − l k ′ | ≤ ǫ for some k, k ′ ) ≤ X k ′′ Pr( | l k − l k ′ | ≤ ǫ | Y = k ′′ ) π k ′′ ≤ X k,k ′ ,k ′′ π k ′′ Cǫ { ( θ k − θ k ′ ) T Σ ( θ k − θ k ′ ) } / ≤ CK ǫ . On the other hand, conditional on training data, ˆ l k − l k is normal with mean u ( k, k ′ ) = µ T k ′ ( ˆ µ k − µ k ) + 12 ( µ T k θ k − ˆ µ T k ˆ θ k ) + log ˆ π k − log π k and variance ( ˆ θ k − θ k ) T Σ ( ˆ θ k − θ k ) within class k ′ . By29arkov’s inequality, we have Pr( | ˆ l k − l k | ≥ ǫ / for some k ) = X k ′ π k ′ Pr( | ˆ l k − l k | ≥ ǫ / | Y = k ′ ) ≤ CE { max k ( ˆ θ k − θ k ) T Σ ( ˆ θ k − θ k )( ǫ − u ( k, k ′ )) } . Moreover, under the event A ( ǫ ) ∩ B ( ǫ ) ∩ C ( ǫ )max k ( ˆ θ k − θ k ) T Σ ( ˆ θ k − θ k ) ≤ Cλ | u ( k, k ′ ) | ≤ | µ k ′ ( ˆ θ k − θ k ) | + 12 | µ T k ( ˆ µ k − µ k ) | + 12 | ( µ k − ˆ µ k ) T ˆ θ k | + | log ˆ π k − log π k |≤ C λ Hence, pick ǫ = M λ / such that ǫ ≥ C λ/ , for C in (51). Then Pr( | ˆ l k − l k | ≥ ǫ / for some k ) ≤ Cλ / . It follows that | R n − R | ≤ M λ / for some positive constant M .Under Conditions (C2)–(C4), the second conclusion is a direct consequence of the ﬁrst conclu-sion.We need the result in the following proposition to show Lemma 3. A slightly different versionof the proposition has been presented in Fukunaga (1990) (Pages 446-450), but we include theproof here for completeness. Proposition 3.

The solution to (4) consists of all the right eigenvectors of Σ − Σ b correspondingto positive eigenvalues.Proof. For any η k , set u k = Σ / η k . It follows that solving (4) is equivalent to ﬁnding ( u ∗ , . . . , u ∗ K − ) = arg max u k u T k Σ − / δ δ T Σ − / u k , s.t. u T k u k = 1 and u T k u l = 0 for any l < k .(51)30nd then setting η k = Σ − / u ∗ k . It is easy to see that u ∗ , . . . , u ∗ K − are the eigenvectors corre-sponding to positive eigenvalues of Σ − / δ δ T Σ − / . By Proposition 4, let A = Σ − / δ δ T , and B = Σ − / and we have that η consists of all the eigenvectors of Σ − δ δ T corresponding topositive eigenvalues. Proposition 4. (Mardia et al. (1979), Page 468, Theorem A.6.2) For two matrices A and B , if x isa non-trivial eigenvector of AB for a nonzero eigenvalue, then y = Bx is a non-trivial eigenvectorof BA .Proof of Lemma 2. Set ˜ δ = (0 p , δ ) and δ = ( µ − ¯ µ , . . . , µ K − ¯ µ ) . Note that δ K = P Kk =2 µ k − ( K − µ = K ( ¯ µ − µ ) . Therefore, δ = ˜ δ − K ˜ δ K T K = ˜ δ ( I K − K K T K ) = ˜ δ Π .Then, since θ = Σ − ˜ δ , we have θ Π = Σ − δ and θ Π δ T = Σ − δ δ T . By Proposition 3,we have the desired conclusion. References

Bach, F. R. (2008), ‘Consistency of the group lasso and multiple kernel learning’,

Journal ofMachine Learning Research , 1179–1225.Burczynski, M. E., Peterson, R. L., Twine, N. C., Zuberek, K. A., Brodeur, B. J., Casciotti, L.,Maganti, V., Reddy, P. S., Strahs, A., Immermann, F., Spinelli, W., Schwertschlag, U., Slager,A. M., Cotreau, M. M. & Dorner, A. J. (2006), ‘Molecular classiﬁcation of crohn’s disease andulcerative colitis patients using transcriptional proﬁles in peripheral blood mononuclear cells’, Journal of Molecular Diagnostics , 51–61. 31ai, T. & Liu, W. (2011), ‘A direct estimation approach to sparse linear discriminant analysis’, J.Am. Statist. Assoc. , 1566–1577.Candes, E. & Tao, T. (2007), ‘The Dantzig selector: Statistical estimation when p is much largerthan n ’, Ann. Statist. , 2313–2351.Clemmensen, L., Hastie, T., Witten, D. & Ersbøll, B. (2011), ‘Sparse discriminant analysis’, Tech-nometrics , 406–413.Fan, J. & Fan, Y. (2008), ‘High dimensional classiﬁcation using features annealed independencerules’, Ann. Statist. , 2605–2637.Fan, J., Feng, Y. & Tong, X. (2012), ‘A ROAD to classiﬁcation in high dimensional space’, J. R.Statist. Soc. B , 745–771.Fan, J. & Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its oracleproperties’, J. Am. Statist. Assoc. , 1348–1360.Fan, J. & Song, R. (2010), ‘Sure independence screening in generalized linear models with NP-dimensionality’, Ann. Statist. (6), 3567–3604.Fukunaga, K. (1990), Introduction to Statistical Pattern Recognition , Academic Press Professional,Inc., 2nd Edition.Hand, D. J. (2006), ‘Classiﬁer technology and the illusion of progress’,

Statistical Science , 1–14.Mai, Q. & Zou, H. (2013 a ), ‘The Kolmogorov ﬁlter for variable screening in high-dimensionalbinary classiﬁcation.’, Biometrika , 229–234.32ai, Q. & Zou, H. (2013 b ), ‘A note on the connection and equivalence of three sparse lineardiscriminant analysis methods’, Technometrics , 243–246.Mai, Q., Zou, H. & Yuan, M. (2012), ‘A direct approach to sparse discriminant analysis in ultra-high dimensions’, Biometrika , 29–42.Mardia, K. V., Kent, J. T. & Bibby, J. M. (1979), Multivariate Analysis , Academic Press.Michie, D., Spiegelhalter, D. & Taylor, C. (1994),

Machine Learning, Neural and Statistical Clas-siﬁcation , ﬁrst edn, Ellis Horwood.Shao, J., Wang, Y., Deng, X. & Wang, S. (2011), ‘Sparse linear discriminant analysis with highdimensional data’,

Ann. Statist. .Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’,

J. R. Statist. Soc. B , 267–288.Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. (2002), ‘Diagnosis of multiple cancer typesby shrunken centroids of gene expression’, Proc. Nat. Acad. Sci. , 6567–6572.Trendaﬁlov, N. T. & Jolliffe, I. T. (2007), ‘DALASS: Variable selection in discriminant analysisvia the lasso’, Computational Statistics and Data Analysis , 3718–3736.Witten, D. & Tibshirani, R. (2011), ‘Penalized classiﬁcation using ﬁsher’s linear discriminant’, J.R. Statist. Soc. B , 753–772.Wu, M., Zhang, L., Wang, Z., Christiani, D. & Lin, X. (2008), ‘Sparse linear discriminant analysisfor simultaneous testing for the signiﬁcance of a gene set/pathway and gene selection’, Bioin-formatics , 1145–1151. 33uan, M. & Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, J. R. Statist. Soc. B68