[PDF] A Generative Approach to Joint Modeling of Quantitative and Qualitative Responses

Abstract

In many scientific areas, data with quantitative and qualitative (QQ) responses are commonly encountered with a large number of predictors. By exploring the association between QQ responses, existing approaches often consider a joint model of QQ responses given the predictor variables. However, the dependency among predictive variables also provides useful information for modeling QQ responses. In this work, we propose a generative approach to model the joint distribution of the QQ responses and predictors. The proposed generative model provides efficient parameter estimation under a penalized likelihood framework. It achieves accurate classification for qualitative response and accurate prediction for quantitative response with efficient computation. Because of the generative approach framework, the asymptotic optimality of classification and prediction of the proposed method can be established under some regularity conditions. The performance of the proposed method is examined through simulations and real case studies in material science and genetics.

Full PDF

aa r X i v : . [ s t a t . M E ] F e b A Generative Approach to Joint Modeling ofQuantitative and Qualitative Responses

Xiaoning Kang a , Lulu Kang b , Wei Chen c and Xinwei Deng d a International Business College and Institute of Supply Chain Analytics,Dongbei University of Finance and Economics, Dalian, China b Department of Applied Mathematics, Illinois Institute of Technology, Chicago, USA c Department of Mechanical, Materials & Aerospace Engineering,Illinois Institute of Technology, Chicago, USA d Department of Statistics, Virginia Tech, Blacksburg, USA

Abstract

In many scientiﬁc areas, data with quantitative and qualitative (QQ) responses arecommonly encountered with a large number of predictors. By exploring the associa-tion between QQ responses, existing approaches often consider a joint model of QQresponses given the predictor variables. However, the dependency among predictivevariables also provides useful information for modeling QQ responses. In this work, wepropose a generative approach to model the joint distribution of the QQ responses andpredictors. The proposed generative model provides eﬃcient parameter estimation un-der a penalized likelihood framework. It achieves accurate classiﬁcation for qualitativeresponse and accurate prediction for quantitative response with eﬃcient computation.Because of the generative approach framework, the asymptotic optimality of classiﬁca-tion and prediction of the proposed method can be established under some regularityconditions. The performance of the proposed method is examined through simulationsand real case studies in material science and genetics.

Keywords:

Generative modeling; graphical lasso; mixed outcomes; regularization. Address for correspondence: Xinwei Deng, Associate Professor, Department of Statistics, Virginia Tech,Blacksburg, VA, 24061 (E-mail: [email protected]). Introduction

Analyzing data with heterogeneous types of responses has been an important topic withbroad applications. Such heterogeneous data often involve both quantitative and quali-tative (QQ) responses. For example, Klein et al. (2019) described a human health studyexamining the risk factors of adverse birth outcomes, which contains a qualitative response“presence/absence of low birth weight” and a quantitative response “gestational age”. Inmaterial science, the properties of a material are often characterized by QQ measures. Asshown in the case study of Section 5 on Heusler compounds, two metrics, the mixing enthalpy (quantitative) and the global stability based on hull energy (qualitative) are used to deter-mine the thermodynamic stability of a full Heusler compound. The QQ responses is a specialcase of “mixed outcomes” in the literature. In this paper, we focus on two types of mixedresponses: the quantitative continuous response and the multi-class qualitative response.In the literature of mixed outcomes, particularly on QQ responses, it has been knownthat overlooking the relationship between the QQ responses is inappropriate. Researches onthe joint model of the QQ responses include early ones such as Fitzmaurice and Laird (1995);Moustaki and Knott (2000); Dunson (2000); Gueorguieva and Agresti (2001); Dunson (2003)and recent ones such as Deng and Jin (2015); K¨ur¨um et al. (2016); Kang et al. (2018);Amini et al. (2018); Klein et al. (2019). These works mostly consider a joint regressionmodel conditioned on the predictor variables. Based on our best knowledge, we can groupthem into two categories.The ﬁrst group of methods considers a conditional regression model, in which one typeof the QQ responses is treated as the response of the model, and the other type is treated asthe regressor. For instance, Fitzmaurice and Laird (1997) introduced a marginal regressionmodel of quantitative response conditioned on the qualitative response. Song et al. (2009)used Gaussian copulas to integrate separate one-dimensional generalized linear models into ajoint regression model for mixed outcomes. Lin et al. (2010) developed a conditional mixed-eﬀects model to analyze clustered data containing QQ responses. These methods are suitablefor data with a small number of predictor variables. To handle the high-dimensional inputvariables, Deng and Jin (2015) proposed a conditional model that encourages model sparsity2hrough a constrained likelihood estimation. However, inferences and asymptotic propertiesof their method have not been explored due to the complicated constrained likelihood esti-mation. Kang et al. (2018) considered a Bayesian estimation for the conditional model ofDeng and Jin (2015) to obtain proper inferences of model parameters. Nevertheless, theirwork is not designated for studying the asymptotic properties of the proposed estimator.More related works can be found in Chen et al. (2014); Yang et al. (2014); Guglielmi et al.(2018), among others.The second group of methods considers a continuous latent variable for the qualita-tive response, and then jointly models the latent variable and the quantitative response(Sammel et al., 1997; Dunson and Herring, 2005; Bello et al., 2012). For example, Gueorguieva and Agresti(2001) studied a probit model with a latent variable and developed a Monte Carlo expectation-conditional maximization algorithm for parameter estimation. Klein et al. (2019) introducedthe idea of latent variable into the framework of copula regressions, constructing a latentcontinuous representation of binary regression models. However, the use of latent variablesoften involves considerable computation in the parameter estimation. It also makes the in-vestigation of theoretical properties diﬃcult. Moreover, most of these works focus on thebinary qualitative response and their model assumptions may not be easily extended to themulti-class qualitative response cases.In this work, we propose a novel approach to jointly model the QQ responses based on the generative approach . The proposed generative model considers the joint distribution of thehigh-dimensional input variables, the quantitative responses, and the multi-class qualitativeresponse. It is a very unique and diﬀerent perspective from the existing literature and alsobrings advantages in both theoretical and computational aspects. The proposed methodcan accommodate multi-class qualitative response and multivariate quantitative responseswith attractive theoretical properties. We call the proposed method

GAQQ , a GenerativeApproach for QQ responses.The key contributions of this work are summarized as follows. First, based on the gen-erative model framework, we are able to establish the asymptotic properties of the proposedestimators with respect to both the classiﬁcation accuracy of the qualitative response andthe prediction accuracy of the quantitative response under some regularity conditions. Such3onditions are commonly used in the regularized estimation framework (Shao et al., 2011;Zhao and Yu, 2006). The classiﬁcation of the qualitative response enjoys the asymptoticoptimality of the resulting linear discriminate classiﬁcation rule. The mean squared error(MSE) of prediction for the quantitative response is as good as the optimal prediction underthe Bayes risk. Second, an eﬃcient procedure for parameter estimation is developed viathe regularized log-likelihood function of the joint distribution of input variables and QQresponses. Speciﬁcally, we impose regularization on both the mean diﬀerences and the co-variance matrix from the joint distribution to achieve sparsity for high-dimensional predictorvariables. Third, the use of the generative approach leads to an eﬀective prediction proce-dure by inferring the conditional distribution of QQ responses conditioned on the predictorvariables. That is, the quantitative response is predicted through the property of conditionalmultivariate normal distribution, and the linear discriminant analysis (LDA) is employed forclassiﬁcation of the qualitative response. Fourth, the proposed generative model allows theparameters related to QQ responses to be mutually learned from each other, which is dif-ferent from existing methods in which only modeling one type of QQ responses attempts tobeneﬁt from the information of the other type of QQ responses.The remainder of this paper is organized as follows. Section 2 details the proposedmethod. The main theoretical results are presented in Section 3. Simulation and real dataanalysis are conducted in Sections 4 and 5, respectively. Section 6 concludes this work withsome discussion. Technical proofs are in the Appendix.

Suppose that the variables of interest are denoted by ( X , y, Z ) where X = ( X , . . . , X p − ) ′ is a ( p −

1) dimensional vector of predictor variables, y is a quantitative response variableand Z ∈ { , } is a qualitative response variable. From a generative modeling perspective,we consider the data generation mechanism as p ( X , y, Z ) = p ( X , y | Z ) p ( Z ), indicating thatdata are from two classes G and G under ( X , y ) | Z . Assume that W = ( X ′ , y ) ′ follows4ultivariate norm distributions with diﬀerent means for two classes, but sharing the samecovariance matrix as follows G : W | Z = 1 ∼ N ( µ , Σ ) , G : W | Z = 2 ∼ N ( µ , Σ ) . (1)Denote the observed data w , · · · , w n , w n +1 , · · · , w n + n with the ﬁrst n observationsfrom G and the rest n observations from G , where w i = ( x ′ i , y i ) ′ , i = 1 , , . . . , n + n . Let n = n + n . The log-likelihood function can be written as L ( µ , µ , Σ ) = n ln | C | − X k =1 X i ∈ G k ( w i − µ k ) ′ C ( w i − µ k ) , (2)up to some constant, where C = Σ − is the inverse covariance matrix. Let π and π be theprior probability of w belonging to classes G and G , respectively. Hence, the LDA assignsa new observation w to G ifln Pr( G | W = w )Pr( G | W = w ) = ln π π −

12 ( µ + µ ) ′ Cδ + w ′ Cδ ≥ , (3)where δ = µ − µ . Otherwise, w is classiﬁed to G . The estimates of π and π arethe empirical proportions of data from each class. The parameters µ , µ and C can beestimated by maximizing the log-likelihood function of (2).For high-dimensional data when p ≥ n , the regularization is often needed to ensure theproper estimation of inverse covariance matrix C and mean diﬀerence δ . We thus proposeto penalize C = ( c ij ) ≤ i,j ≤ p and δ simultaneously, resulting in the following optimizationproblem min ( µ , µ , C ) − n ln | C | + X k =1 X i ∈ G k ( w i − µ k ) ′ C ( w i − µ k ) + λ || C || + 12 λ | µ − µ | , (4)where || C || = P i = j | c ij | , and | α | = P i | α i | with α i being the i th entry of vector α . Here λ ≥ λ ≥ C and δ at the same time. Note that similarspirits of regularizing both C and δ are used in several works on the LDA (Shao et al., 2011;Cai and Liu, 2012).To estimate the parameters, we develop an iterative procedure to solve the sub-optimizationproblem with respect to C and δ respectively. Deﬁne δ = ( µ − µ ) / γ =5 µ + µ ) /

2, then accordingly we have µ = δ + γ and µ = γ − δ . As a result, theoptimization problem (4) is re-written asmin ( δ , γ , C ) − n ln | C | + X i ∈ G ( w i − δ − γ ) ′ C ( w i − δ − γ )+ X i ∈ G ( w i + δ − γ ) ′ C ( w i + δ − γ ) + λ || C || + λ | δ | . (5)It is thus easy to obtain the maximum likelihood estimate of γ from (5) asˆ γ = ¯ w + n − n n δ , (6)where ¯ w = n P ni =1 w i is the overall mean. Then plugging ˆ γ back into (5) yields( ˆ δ , ˆ C ) = arg min δ , C − n ln | C | + X i ∈ G ( w i − n n δ − ¯ w ) ′ C ( w i − n n δ − ¯ w )+ X i ∈ G ( w i + 2 n n δ − ¯ w ) ′ C ( w i + 2 n n δ − ¯ w ) + λ || C || + λ | δ | . (7)In this manner, solving the optimization problem (4) is equivalent to solving the optimiza-tion problem (7). Next, we show that (7) can be decomposed as a graphical lasso model(Glasso) (Yuan and Lin, 2007; Deng and Yuan, 2009) in terms of C and a Lasso regres-sion (Tibshirani, 1996) in terms of δ with the other parameter ﬁxed, such that these twoparameters can be estimated iteratively. To be more precise, for a given value of δ , theminimization problem (7) with respect to C ismin C − n ln | C | + tr( C ˜ S ) + λ || C || , (8)where ˜ S = P i ∈ G ( w i − n n δ − ¯ w )( w i − n n δ − ¯ w ) ′ + P i ∈ G ( w i + n n δ − ¯ w )( w i + n n δ − ¯ w ) ′ . Ithas the same form as the graphical lasso, which has been extensively studied in literature byYuan and Lin (2007); Friedman et al. (2008); Lam and Fan (2009); Raskutti et al. (2008);Liu et al. (2020), and many others. On the other hand, when the inverse covariance matrix C is ﬁxed, the minimization problem (7) regarding δ becomesmin δ X i ∈ G ( w i − n n δ − ¯ w ) ′ C ( w i − n n δ − ¯ w )+ X i ∈ G ( w i + 2 n n δ − ¯ w ) ′ C ( w i + 2 n n δ − ¯ w ) + λ | δ | , (9)6hich is equivalent to min δ ( ˜ y − C / δ ) ′ ( ˜ y − C / δ ) + λ | δ | , (10)where ˜ y = n n C / ( n P i ∈ G w i − n P i ∈ G w i ). A detailed derivation of (10) from (9) isprovided in the Appendix. We solve the minimization problem (10) by the Lasso technique.Consequently, solving the complicated optimization problem (7) is decomposed to the simpletasks of iteratively solving a Glasso estimate for C and a Lasso estimate for δ until both ofthem are converged. We summarize the above estimation procedure for the proposed modelin Algorithm 1. Algorithm 1 (Estimation Procedure) .Step 0 : Set an initial value of δ . Step 1 : Given δ = ˆ δ ,t , solve C in (8) by the Glasso technique. Step 2 : Given C = ˆ C t , solve δ in (10) by the Lasso technique. Step 3 : Repeat

Step 1 and till both ˆ C t and ˆ δ ,t converge. Here ˆ C t and ˆ δ ,t represent the estimates of C and δ in the t th iteration. The convergencecriteria are || ˆ C t − ˆ C t − || F < τ and || ˆ δ ,t − ˆ δ ,t − || < τ , where τ and τ are two pre-selectedsmall quantities, || · || F stands for the Frobenius norm, and k α k = P i α i with α i being the i th entry of vector α . We set the initial value of δ as ( ¯ w − ¯ w ) /

2, where ¯ w k is the samplemean for the k th class. With value of ˆ δ , the estimate ˆ γ is calculated by Equation (6), andthen we have ˆ µ = ˆ δ + ˆ γ and ˆ µ = ˆ γ − ˆ δ . Therefore, Algorithm 1 provides the estimatesof three parameters µ , µ and C in the classiﬁcation rule (3).Note that there are two tuning parameters λ and λ in the optimization problem (7).To choose their optimal values, we minimize a BIC-type criterion proposed by Wang et al.(2007) as BIC( λ , λ ) = − n ln | ˆ C | + tr( ˆ C ˜ S ) + ( v ( ˆ δ ) + v ( ˆ C ) + 1) ln( n ) , where v ( ˆ δ ) and v ( ˆ C ) stand for the number of nonzero entries in the estimates ˆ δ and ˆ C ,respectively. This criterion enjoys consistency properties and has been commonly used inliterature (Zou and Zhang, 2009; Lv and Fan, 2009; Armagan et al., 2013).7 .2 Model Prediction In this section, we demonstrate how to conduct model prediction by the proposed method.For convenience, let us write µ =  µ X µ y  , µ =  µ X µ y  , and C =  C X , C Xy C ′ Xy , c y  , Σ =  Σ X , Σ Xy Σ ′ Xy , σ y  . where µ X and µ X are p − X for twoclasses, and Σ X is the ( p − × ( p −

1) covariance matrix of X . The estimates ˆ µ , ˆ µ , ˆ C , ˆ Σ canbe partitioned accordingly. From model assumption (1) as well as the property of multivariatenormal distribution, we have y | X = x , Z = 1 ∼ N (cid:0) µ y + Σ ′ Xy Σ − X ( x − µ X ) , σ y − Σ ′ Xy Σ − X Σ Xy (cid:1) ,and y | X = x , Z = 2 ∼ N (cid:0) µ y + Σ ′ Xy Σ − X ( x − µ X ) , σ y − Σ ′ Xy Σ − X Σ Xy (cid:1) . Therefore, theprediction for the quantitative variable y from a new observation x isˆ y =  ˆ µ y + ˆ Σ ′ Xy ˆ Σ − X ( x − ˆ µ X ) , if ˆ Z = 1ˆ µ y + ˆ Σ ′ Xy ˆ Σ − X ( x − ˆ µ X ) , if ˆ Z = 2 . (11)Note that ˆ Σ ′ Xy ˆ Σ − X = − c y ˆ C ′ Xy where ˆ c y is a scalar, implying that the sparsity of ˆ C Xy willlead to the sparse model for the prediction of y .On the other hand, the prediction for the qualitative variable Z by the proposed modelis naturally based on the estimated LDA classiﬁcation rule of (3) asln Pr( G | W = ( x ′ , ˆ y ) ′ )Pr( G | W = ( x ′ , ˆ y ) ′ ) = ln ˆ π ˆ π −

12 ( ˆ µ + ˆ µ ) ′ ˆ C ˆ δ + ( x ′ , ˆ y ) ˆ C ˆ δ . (12)From Equations (11) and (12), however, we note that the prediction of one response variabledepends on the information of the other. To address this issue, we propose to calculate twocandidate values of y for a new observation x by Equation (11) for two diﬀerent classes,denoted by ˆ y and ˆ y . Then the conditional probability densities p ( W = ( x ′ , ˆ y ) ′ | G ) and p ( W = ( x ′ , ˆ y ) ′ | G ) can be estimated via the density functions of N ( ˆ µ , ˆ Σ ) and N ( ˆ µ , ˆ Σ ).Denote such two values as ˆ p and ˆ p . The prediction of y at this new observation is thenobtained as ˆ y k corresponding to the larger value of ˆ π k ˆ p k , k = 1 ,

2. To express it clearly, wedescribe the above steps of the model prediction in Algorithm 2 for a new observation x . Algorithm 2 (Prediction Procedure) . tep 1 : For k = 1 , , ˆ y k = ˆ µ ky + ˆ Σ ′ Xy ˆ Σ − X ( x − ˆ µ kX ) , and consequently obtain theprobability densities ˆ p k by plugging ( x ′ , ˆ y k ) ′ into the density functions of N ( ˆ µ k , ˆ Σ ) . Step 2a : If ˆ π ˆ p > ˆ π ˆ p , let ˆ y = ˆ y ; otherwise let ˆ y = ˆ y . Step 2b : Apply the LDA classiﬁcation rule (12) to predict Z by w = ( x ′ , ˆ y ) ′ . It is seen that in Algorithm 2, we obtain the prediction of y ﬁrst, and then predict Z with the value of ˆ y . One would argue that it is not a unique way of making predictions onQQ responses, as we may also predict Z ﬁrst and then variable y . The following propositionprovides an interesting insight into this issue. Proposition 1.

For the prediction of variable Z by the proposed model, the class label k obtained from Step 2b of Algorithm 2 maximizes ˆ π k ˆ p k . Proposition 1 implies that we can predict the response variable Z by simply comparingvalues of ˆ π k ˆ p k instead of employing LDA. Therefore, the order of which response variable tobe predicted ﬁrst is not a concern. Actually, the Step 2a and

Step 2b are equivalent tothe following

Step 2 as Step 2 : If ˆ π ˆ p > ˆ π ˆ p , let ˆ y = ˆ y and ˆ Z = 1; otherwise let ˆ y = ˆ y and ˆ Z = 2. The proposed generative modeling approach also has the advantage to enable the GAQQto deal with the qualitative response with multiple classes, i.e., the qualitative variable Z ∈ { , , . . . , K } . In such cases, the GAQQ method is extended and expressed as G k : W | Z = k ∼ N ( µ k , Σ ) , k = 1 , , . . . , K . Based on a baseline class G , we regularize on thediﬀerence between means through µ k − µ for k = 2 , , . . . , K . The objective function isthus formulated asmin ( µ ,..., µ K , C ) − n ln | C | + K X k =1 X i ∈ G k ( w i − µ k ) ′ C ( w i − µ k ) + λ || C || + λ K X k =2 | µ k − µ | . (13)The subsequent derivation follows similar steps as that described in Section 2.1. With alittle abuse of notation, let K δ k = µ k − µ for k = 1 , , . . . , K and K γ = P Kk =1 µ k , then wehave µ k = γ − P Kg =2 δ g + K δ k , k = 1 , , . . . , K . As a result, the optimization problem (13)9an be re-written asmin ( δ ,..., δ K , γ , C ) − n ln | C | + K X k =1 X i ∈ G k ( w i − γ + K X g =2 δ g − K δ k ) ′ C ( w i − γ + K X g =2 δ g − K δ k )+ λ || C || + λ K X k =2 | δ k | . (14)Let n k represent the number of observations belonging to class G k . The maximum likelihoodestimator of γ from (14) is ˆ γ = ¯ w + P Kg =2 δ g − Kn P Kg =2 n g δ g . Consequently, the optimizationproblem (14) becomesmin ( δ ,..., δ K , C ) − n ln | C | + K X k =1 X i ∈ G k ( w i − ¯ w + Kn K X g =2 n g δ g − K δ k ) ′ C ( w i − ¯ w + Kn K X g =2 n g δ g − K δ k ) + λ || C || + λ K X k =2 | δ k | . (15)Let ˜ S = P Kk =1 P i ∈ G k ( w i − ¯ w + Kn P Kg =2 n g δ g − K δ k )( w i − ¯ w + Kn P Kg =2 n g δ g − K δ k ) ′ , thenthe formula (15) can be decomposed as one Glasso problemmin C − n ln | C | + tr( C ˜ S ) + λ || C || , and min ( δ ,..., δ K ) K X k =1 X i ∈ G k ( w i − ¯ w + Kn K X g =2 n g δ g − K δ k ) ′ C ( w i − ¯ w + Kn K X g =2 n g δ g − K δ k )+ λ K X k =2 | δ k | . (16)The optimization problem (16) is equivalent to the following K − δ k ( ˜ y − C / δ k ) ′ ( ˜ y − C / δ k ) + λ | δ k | , k = 2 , , . . . , K, (17)where ˜ y = Knn k C / " ( n − n k ) P i ∈ G k w i − n k P i/ ∈ G k w i + Kn k K P g =2 ,g = k n g δ g . The detailed deriva-tion from (16) to (17) is provided in the Appendix. Therefore, the parameters δ k and C can be solved iteratively until convergence following the spirit of Algorithm 1. The optimal10alues of tuning parameters are chosen by the BIC-type criterion extended for the multi-classproblem as BIC( λ , λ ) = − n ln | ˆ C | + tr( ˆ C ˜ S ) + ( v ( ˆ δ ) + v ( ˆ C ) + K −

1) ln( n ) , where v ( ˆ δ ) represents the number of nonzero entries in all the estimates ˆ δ k .For a new observation x , the quantitative response y is predicted, similarly as in Algo-rithm 2, to be ˆ y k = ˆ µ ky + ˆ Σ ′ Xy ˆ Σ − X ( x − ˆ µ kX ), where k maximizes ˆ π k ˆ p k with ˆ p k = p ( W =( x ′ , ˆ y k ) ′ | G k ), computed by plugging ( x ′ , ˆ y k ) ′ into the density functions of N ( ˆ µ k , ˆ Σ ). Theclass label is estimated as ˆ Z = arg max k ˆ π k ˆ p k , or equivalently by the LDA rule asˆ Z = arg max k ln ˆ π k ˆ π + K (( x ′ , ˆ y k ) ′ − ˆ µ + ˆ µ k ′ ˆ C ˆ δ k . In this section, we will investigate the asymptotic optimality of the classiﬁcation rule bythe proposed GAQQ method in Theorem 1 to Theorem 3. The asymptotic consistencyproperties of the prediction of y by the GAQQ method are established in Theorem 4. Forthe proposed classiﬁcation rule, we ﬁrst establish the theoretical results for the multi-classproblem and then provide a thorough discussion of the two-class case. We use the samedeﬁnition of asymptotic optimality for a classiﬁcation rule as deﬁned in Shao et al. (2011).Denote by R Bayes and R P ROP ( T ) the Bayes error and the conditional misclassiﬁcation rateof the proposed rule, where T denotes the training samples. The asymptotic optimality fora classiﬁcation rule is deﬁned as follows. Deﬁnition 1.

Let T be a classiﬁcation rule with conditional misclassiﬁcation rate R T ( T ) ,given the training samples T .(1) T is asymptotically optimal if R T ( T ) /R Bayes P → .(2) T is asymptotically sub-optimal if R T ( T ) − R Bayes P → . Note that if lim n →∞ R Bayes >

0, then the asymptotically sub-optimality is the same as theasymptotically optimality. To facilitate the construction of theoretical results, we need tointroduce some notation and make assumptions on the true model. Deﬁne the true values11f µ k , Σ , C and δ k as µ k , Σ , C and δ k = K ( µ k − µ ) = (( δ kX ) ′ , δ ky ) ′ , where δ kX is a p − X between classes G and G k . Denote the true inverse covariance matrix of variable X by C X . Also deﬁne∆ k = p ( δ kX ) ′ C X δ kX and ∆ = max { ∆ k } Kk =1 . Denote S δ k = { j ; ( δ k ) j = 0 } , which is the setcontaining location indices of the nonzero entries in δ k . Let ˜ s k be the cardinalities of set S δ k . Deﬁne s k = ˜ s k if δ ky = 0; otherwise s k = ˜ s k −

1. That is, s k is the number of nonzeroentries of δ kX . Additionally, we use the same sparsity measure on Σ = ( σ ij ) ≤ i,j ≤ p as inBickel and Levina (2008), which is S h ; p = max i ≤ p P pj =1 | σ ij | h where 0 ≤ h < isdeﬁned to be 0. Hence ﬁrstly, S p equals the maximum of the numbers of nonzero entries ineach row of the matrix Σ . In this case, a smaller value of S p compared with p implies asparse structure in matrix Σ . Secondly, if S h ; p is smaller than p for 0 < h <

1, it indicatesthat many entries of matrix Σ are very small. Moreover, we assume the following regularityconditions. • (C1) There exists a constant θ such that 0 < θ − < λ min ( C ) ≤ λ max ( C ) < θ < ∞ ,where λ min ( C ) and λ max ( C ) are the minimum and maximum eigenvalues of matrix C . • (C2) λ = O ( p log p/n ), λ = O ( p log p/n ). • (C3) Restricted eigenvalue condition: for some constant ϕ k >

0, assume C satisﬁes n k ( C ) / δ k k ≥ ϕ k k δ k k for all subsets J ⊆ { , . . . , p } such that the cardinality of J equals ˜ s k , and | ( δ k ) J c | ≤ | ( δ k ) J | . Here ( δ k ) J = (( δ k ) j · I { j ∈ J } ) ≤ j ≤ p , and J c represents the complement set of J . • (C4) Irrepresentable condition: without loss of generality, write δ k = (( δ k ) ′ S δ , ( δ k ) ′ S cδ ) ′ ,and correspondingly let C =  Ψ , Ψ Ψ , Ψ  , where Ψ is an ˜ s k × ˜ s k matrix. Thenthere exists a positive constant vector ζ such that | Ψ Ψ − sign(( δ k ) ′ S δ ) | ≤ − ζ , where is a p − ˜ s k dimensional unit vector, and the inequality holds element-wise. • (C5) There exist 0 ≤ c < c ≤ M >

0, such that n − c min ≤ i ≤ ˜ s k | ( δ k ) i | ≥ M ,˜ s k = O ( n c ). λ = o ( n c − c ), p = o ( λ /n ).12 (C6) There exists a constant c > δ kX − δ lX ) ′ C X ( δ kX − δ lX ) > c > k = l . • (C7) There exists a constant c such that c − ≤ Kπ k ≤ c , k = 1 , , . . . , K .By conditions (C1) and (C2), Rothman et al. (2008) and Lam and Fan (2009) derivedthe convergence rate of Glasso estimate. We thus have k ˆ C X − C X k = O p ( d n ) , (18)where d n = S h ; p ( log pn ) (1 − h ) / , and k A k is the matrix spectral norm deﬁned as the squaredroot of the maximum eigenvalue of matrix A ′ A . The conditions (C2) and (C3) are used inB¨uhlmann and Van De Geer (2011) to study the theoretical property of Lasso estimate, andwe have k ˆ δ kX − δ kX k = O p ( b ( n ) k ) , (19)where b ( n ) k = q ˜ s k log pnϕ k . Under conditions (C4) and (C5), Zhao and Yu (2006) showed thatthe Lasso estimate is model selection consistency, which will be used for investigating δ k in(17). Condition (C6) requires that all the classes should be separated from each other. Alsonote that condition (C6) is equivalent to that ∆ k is bounded away from 0. The condition(C7) guarantees a balanced sample size for each class, which is commonly used in literatureto bound the term log π k π in the LDA rule for establishing the properties of classiﬁcationrules. Based on the above results, we present the following theories on the consistency ofthe classiﬁcation rule by the proposed method. Theorem 1.

Assume that conditions (C1) - (C7) hold, and ξ n ; k = max { d n , b ( n ) k ∆ k , p s k S h ; p √ n ∆ k for any k } → . Then the proposed rule for the multi-class problem is asymptotically sub-optimal if either oneof the following two conditions is satisﬁed(1) ∆ = max { ∆ k } Kk =1 is bounded;(2) if ∆ → ∞ , then there exists a constant α ∈ (0 , / such that ∆ ξ − αn ; k → . Theorem 1 establishes the sub-optimality property of the proposed classiﬁcation rule forthe multi-class problem. In the case of two-class problem, the Bayes error can be expressed13n a closed form of R Bayes = Φ( − ∆ /

2) when the data are from normal distribution, whereΦ represents the cumulative distribution function of N (0 , = p ( δ X ) ′ C X δ X = p ( µ X − µ X ) ′ C X ( µ X − µ X ). Accordingly, in Theorems 2 and 3, we can compute theconvergence rate of the proposed rule for the two-class problem, and subsequently investigateits properties. Theorem 2.

Assume that conditions (C1) - (C7) hold with K = 2 , and ξ n = max { d n , b ( n )2 ∆ , p s S h ; p √ n ∆ } → . Then we have R P ROP ( T ) = Φ( − ∆ [1 + O p ( ξ n )]) . Moreover, we establish the following properties.

Theorem 3.

Assuming all the conditions in Theorem 2 are satisﬁed, we have(1) if ∆ is bounded, then the proposed rule is asymptotically optimal and R PROP ( T ) R Bayes − O p ( ξ n ) ;(2) if ∆ → ∞ , then the proposed rule is asymptotically sub-optimal;(3) if ∆ → ∞ and ξ n ∆ → , then the proposed rule is asymptotically optimal. Theorem 2 provides the convergence rate of the proposed classiﬁcation rule for the two-class problem with respect to ξ n . Base on such a result, Theorem 3 demonstrates that theproperty of the proposed classiﬁcation rule (optimality or sub-optimality) depends on thescenarios of the true model’s ∆ . Speciﬁcally, (1) when ∆ is bounded, i.e. lim n →∞ R Bayes > R P ROP ( T ) converges in probability to the same limit as R Bayes . (2) When ∆ → ∞ , i.e. R Bayes →

0, then R P ROP ( T ) P →

0; in this case, if we further have ξ n ∆ →

0, then R P ROP ( T )and R Bayes have the same convergence rate.Next, we derive the consistency property for the proposed estimate of y . Denote by ˆ y P thepredicted value of y obtained from the proposed model. Deﬁne ˆ y B to be the predicted value of y for x when all parameters are known. Speciﬁcally, ﬁrst obtain the class label k via the BayesLDA rule, then ˆ y B = y k = µ ky + Σ ′ Xy Σ − X ( x − µ kX ). Hence, the mean squared errors ( M SE )of estimates ˆ y B and ˆ y P are M SE

Bayes = E [(ˆ y B − y ) |T ] and M SE

P ROP = E [(ˆ y P − y ) |T ].Now we establish the theoretical results of ˆ y P in Theorem 4.14 heorem 4. Assume that conditions (C1) - (C7) hold and conditions in Theorem 1 aresatisﬁed. Then we have

M SE

P ROP − M SE

Bayes P → , for the multi-class qualitative response. This result compares the

M SE of the proposed estimate of y with that from the optimalBayes rule (under all parameters known). Since the classiﬁcation errors from a classiﬁcationrule might be larger than 0, the M SE of ˆ y may not converge to 0 even though the sample size n is suﬃciently large. Here we adopt the M SE

Bayes as a reasonable performance benchmarkto evaluate the property of the proposed model with respect to y . Theorem 4 states thatthe diﬀerence of M SE between the proposed model and the Bayes method converges to 0in probability.

In this section, we evaluate the performance of the proposed GAQQ method for a binaryresponse Z under diﬀerent inverse covariance matrices C and mean diﬀerences δ . Theproposed GAQQ model is compared with several benchmark methods, denoted as GLDA,CL, and ENET, which use the predictor variables X to predict Z and y . The GLDA employsthe LDA classiﬁcation rule for Z using the generalized inverse of the sample covariance matrixof X when p > n . The CL method applies the LPD technique introduced by Cai and Liu(2012) to predict the response variable Z based on X . With their estimated class label of Z , the GLDA and CL predict y by Equation (11). The ENET method uses the elastic-netlogistic model (Zou and Hastie, 2005) on predictor variables X to ﬁt the qualitative response Z and hence predicts Z for the testing data. For the quantitative response y , the ENETseparately ﬁts two elastic-net linear regressions for two classes using training data and thenpredicts y in the testing data based on its estimated Z . The tuning parameters of the CLand ENET methods are chosen by cross-validation.15egarding the inverse covariance matrix C , we consider the following ﬁve structures in thesimulation, which are commonly used in the literature (Yuan and Lin, 2007; Kang and Deng,2020). • Model 1. C = I . c ij = 1 if i = j and 0 otherwise; • Model 2. C = AR(0.6). The conditional covariance between any two random variablesis ﬁxed to be 0 . | i − j | , 1 ≤ i, j ≤ p . • Model 3. C is generated by randomly permuting rows and corresponding columns ofthe matrix C . • Model 4. C =  CS(0 . I  , where CS(0.6) represents a 5 × indicates a matrix with all entries 0. • Model 5. C = Θ + α I , where the diagonal entries of Θ are zeros and Θ ij = Θ ji = b ∗ U nif ( − ,

1) for i = j , where b is from the Bernoulli distribution with probability0.15 equal 1. Each oﬀ-diagonal entry of Θ is generated independently. The value of α is gradually increased to make sure that C is positive deﬁnite.Model 1 is the simplest sparse matrix indicating that variables are independent of eachother. Model 4 is a sparse matrix indicating that only the ﬁrst 5 variables are correlated.This matrix has more sparsity as the dimensionality increases. Models 2 and 3 are relativelydense matrices, and they also become more sparse when the dimensionality increases. All ofthese four matrices have sparse structures to some extent, while Model 5 is a general sparsematrix with no structure, which is similarly used in Bien and Tibshirani (2011).For the mean diﬀerence δ , we consider two diﬀerent levels of sparsity. The µ is thevector with all elements zeros. Then generate µ such that (S1): 25% of the elements in µ are zeros; (S2): 75% of the elements in µ are zeros. The positions of zeros in µ are randomly distributed with its nonzero values independently generated from uniformdistribution U nif (0 , p ∈ { , , } , and generate n = 30 observationsfrom N ( µ , C − ) as well as n = 30 observations from N ( µ , C − ) as the training set.The same procedure is employed to generate the testing data, which is used to evaluate16he prediction performance of y and Z for diﬀerent compared methods. We consider theroot mean squared prediction error RMSPE = q n P ni =1 ( y i − ˆ y i ) to measure the predictionaccuracy for the quantitative response y , where ˆ y i represents the predicted value. Theprediction performance of the qualitative response Z is measured by the misclassiﬁcationerror ME = n P ni =1 I ( z i = ˆ z i ), where ˆ z i is the predicted value of z i and I ( · ) is an indicatorfunction.Tables 1 and 2 report the averaged MEs in percentage and averaged RMPSE, as well astheir corresponding standard errors in parenthesis for each approach over 100 replications. Itcan be seen from Table 1 that the proposed method generally outperforms other approacheswith respect to MEs. Such an advantage becomes more signiﬁcant as the underlying modelsare more sparse. Speciﬁcally, in the scenario of S1 = 25% and p = 40, the proposed GAQQmethod does not perform as well as others, since the underlying models in this scenario arethe least sparse, especially for the dense models 2 and 3. In contrast, the proposed methodproduces relatively better comparison results in the scenario of S2 = 75% and p = 40, wherethe true mean diﬀerence is more sparse. Furthermore, this advantage of the proposed methodis well evidenced in the scenario of p = 80, and even more notable when p = 200 with itssubstantially lower MEs than other methods.From Table 2, we observe that the proposed method generally gives superior performanceover other compared approaches for each scenario in predicting the quantitative response y .The possible explanations are in two folds. First, the proposed GAQQ method provides anaccurate classiﬁcation of the qualitative response Z . Second, the proposed GAQQ has aproper estimation of C by the regularization that is used in the prediction of quantitativeresponse y according to (11), resulting in an improvement of the prediction accuracy. It isalso seen that the CL and GLDA methods are comparable in some cases, possibly becauseboth of them use the generalized inverse of the sample covariance of X for ˆ Σ − X in theprediction of quantitative response y in (11). But the CL method is generally better sinceit has more accurate classiﬁcation results than the GLDA in Table 1.17able 1: Averages and standard errors (in parenthesis) of misclassiﬁcation errors (MEs) inpercentage for methods in comparison.Model 1 Model 2 Model 3 Model 4 Model 5 p = 40 ENET 25.3(0.51) 25.4(0.44) 25.0(0.49) 24.8(0.46) 25.6(0.45)S1 GLDA 9.28(0.42) 10.2(0.75) 9.40(0.66) 5.28(0.33) 9.13(0.33)CL 1.68(0.21) 10.9(1.04) 8.75(1.00) 4.58(0.51) 1.20(0.22)GAQQ 2.92(0.26) 9.93(0.41) 17.7(0.50) 2.33(0.24) 2.67(0.22)ENET 25.3(0.45) 26.1(0.47) 24.3(0.42) 26.2(0.42) 25.1(0.41)S2 GLDA 20.0(0.66) 9.78(0.41) 13.1(0.49) 22.8(0.63) 18.8(0.58)CL 6.92(0.39) 4.90(0.33) 8.63(0.50) 8.65(0.38) 8.48(0.40)GAQQ 5.32(0.32) 4.52(0.28) 7.10(0.28) 6.28(0.30) 5.88(0.29) p = 80 ENET 25.0(0.51) 24.4(0.44) 24.8(0.44) 23.9(0.48) 25.8(0.54)S1 GLDA 7.88(0.43) 11.2(0.49) 12.6(0.52) 8.03(0.38) 11.3(0.49)CL 6.37(1.38) 10.4(0.66) 11.2(0.63) 5.63(1.48) 9.38(1.57)GAQQ 0.10(0.04) 2.97(0.22) 2.22(0.19) 0.07(0.03) 3.53(0.22)ENET 24.8(0.42) 25.3(0.39) 25.2(0.43) 24.3(0.43) 23.9(0.52)S2 GLDA 19.5(0.55) 26.7(0.68) 24.8(0.76) 16.1(0.63) 20.9(0.61)CL 4.67(0.37) 20.5(0.77) 18.2(1.01) 2.65(0.30) 14.5(0.92)GAQQ 1.08(0.13) 10.6(0.37) 5.42(0.30) 0.57(0.11) 3.33(0.22) p = 200 ENET 24.3(0.41) 24.9(0.52) 24.4(0.42) 25.2(0.43) 24.6(0.43)S1 GLDA 2.25(0.20) 14.1(0.52) 13.6(0.42) 3.13(0.22) 7.23(0.36)CL 2.08(0.20) 2.88(0.18) 2.72(0.17) 2.12(0.21) 2.26(0.15)GAQQ 0.22(0.06) 0.47(0.09) 0.20(0.05) 0.23(0.07) 0.05(0.03)ENET 25.3(0.40) 25.5(0.40) 24.6(0.47) 25.7(0.49) 25.3(0.51)S2 GLDA 9.73(0.40) 20.5(0.55) 24.3(0.57) 9.08(0.40) 15.0(0.50)CL 1.46(0.14) 2.96(0.16) 2.29(0.11) 2.06(0.18) 2.38(0.16)GAQQ 0.01(0.00) 1.10(0.16) 1.55(0.16) 0.02(0.01) 0.17(0.05)18able 2: Averages and standard errors (in parenthesis) of root mean squared predictionerrors (RMSPE) for methods in comparison.Model 1 Model 2 Model 3 Model 4 Model 5 p = 40 ENET 1.18(0.01) 1.62(0.02) 1.84(0.02) 1.91(0.01) 1.73(0.02)S1 GLDA 1.82(0.03) 2.00(0.04) 2.03(0.04) 1.82(0.03) 1.82(0.03)CL 1.79(0.03) 1.97(0.06) 2.03(0.08) 1.65(0.03) 1.74(0.03)GAQQ 1.07(0.01) 1.21(0.01) 1.49(0.01) 1.02(0.01) 1.17(0.02)ENET 1.59(0.01) 1.20(0.01) 1.42(0.02) 1.22(0.01) 1.12(0.01)S2 GLDA 1.93(0.03) 1.96(0.03) 1.90(0.03) 1.85(0.03) 1.77(0.02)CL 1.82(0.03) 1.78(0.03) 1.70(0.03) 1.67(0.03) 1.58(0.03)GAQQ 1.07(0.01) 1.14(0.01) 1.37(0.01) 0.98(0.01) 1.09(0.01) p = 80 ENET 1.58(0.01) 1.77(0.02) 1.68(0.02) 1.37(0.01) 1.75(0.01)S1 GLDA 2.01(0.03) 2.62(0.04) 2.38(0.04) 2.08(0.03) 1.92(0.03)CL 2.02(0.04) 2.63(0.07) 2.31(0.05) 1.88(0.03) 1.72(0.03)GAQQ 1.11(0.01) 1.26(0.01) 1.54(0.01) 1.11(0.01) 1.31(0.01)ENET 1.08(0.01) 1.31(0.01) 1.44(0.02) 1.61(0.01) 1.11(0.01)S2 GLDA 1.96(0.03) 2.56(0.05) 2.27(0.04) 2.04(0.03) 2.36(0.04)CL 1.76(0.03) 2.38(0.05) 2.10(0.04) 1.85(0.03) 2.20(0.04)GAQQ 1.02(0.01) 1.10(0.01) 1.39(0.01) 0.99(0.01) 1.11(0.01) p = 200 ENET 1.66(0.02) 1.24(0.01) 1.68(0.02) 1.07(0.01) 1.61(0.02)S1 GLDA 1.24(0.01) 1.58(0.02) 1.62(0.02) 1.21(0.01) 1.40(0.02)CL 1.27(0.03) 1.60(0.02) 1.65(0.02) 1.28(0.02) 1.36(0.03)GAQQ 1.08(0.01) 1.27(0.01) 1.44(0.02) 1.06(0.01) 1.15(0.01)ENET 1.03(0.01) 1.65(0.01) 1.62(0.01) 1.22(0.01) 1.17(0.01)S2 GLDA 1.19(0.01) 1.67(0.02) 1.76(0.02) 1.22(0.01) 1.32(0.01)CL 1.19(0.01) 1.55(0.02) 1.74(0.02) 1.23(0.01) 1.33(0.01)GAQQ 1.01(0.01) 1.25(0.01) 1.43(0.01) 1.01(0.01) 1.15(0.01)19 .2 Multi-class Settings of the Qualitative Response Now, we examine the performance of the proposed GAQQ method for multi-class settingsof the qualitative response. We consider p = 200 and K = 4 classes of qualitative response Z with training sizes n = n = n = n = 30 for Models 1 - 5 of inverse covariance matrix C . Let µ kj represent the j th entry of the mean value µ k . Generate µ kj = 0 . ∗ k + u kj for j = 2 k − , k, k + 1 , . . . , k + 6, otherwise µ kj = 0, where u kj is from U nif ( − , N ( µ k , C − ), and the testing data follow the same generationprocedure. We compare the proposed method with the GLDA, as well as the estimatorsproposed by Witten and Tibshirani (2011) (WT) and Clemmensen et al. (2011) (CHWE),where the latter two methods are designed for multi-class problems. We use the WT andCHWE models to ﬁrst predict the class label Z for the testing data, and then the response y is estimated, by the multivariate normal property, as ˆ µ ky + ˆ Σ ′ Xy ˆ Σ − X ( x − ˆ µ kX ) if theirestimates ˆ Z = k . The results of performance measures, ME and RMSPE are summarizedin Table 3 based on 100 replications. One can see that the GAQQ method performs betterthan the GLDA as well as the WT method, and is comparable with the CHWE in termsof the MEs. Besides, the GAQQ method gives the best performance among the comparedapproaches with signiﬁcantly lower values of RMSPE. In this section, we apply the proposed GAQQ method to two real-data case studies. Theﬁrst one is from the study of Heusler compounds in material science and the second one isfrom the study of molecular diagnostics of Ulcerative colitis and Crohn’s disease. Althoughfrom diﬀerent ﬁelds, both problems contain QQ responses with high-dimensional predictors,and the proposed GAQQ method appears to have much better performance in terms ofprediction accuracy compared with other methods.The case study on material sciences is regarding the Heusler compounds, which are a largefamily of intermetallics with more than 1000 known members. Many Heusler compoundshave shown exotic properties, such as superconductivity and topological band structures,which have promising applications for quantum computing. Understanding the thermody-20able 3: Averages and standard errors (in parenthesis) of MEs in percentage and RMSPEfor methods in comparison for multi-class settings of p = 200. Model 1 Model 2 Model 3 Model 4 Model 5MEGLDA 40.58 (0.46) 59.15 (0.56) 55.56 (0.51) 39.40 (0.48) 48.28 (0.55)WT 17.26 (0.35) 43.73 (0.59) 43.01 (0.47) 17.90 (0.38) 33.17 (0.67)CHWE 14.70 (0.32) 25.14 (0.40) 32.16 (0.50) 16.66 (0.44) 21.31 (0.44)GAQQ 14.02 (0.32) 25.36 (0.53) 33.34 (0.50) 17.11 (0.42) 20.99 (0.49)RMSPEGLDA 1.64 (0.01) 2.09 (0.02) 2.02 (0.02) 1.66 (0.01) 1.71 (0.02)WT 1.56 (0.01) 2.05 (0.02) 1.94 (0.02) 1.57 (0.01) 1.63 (0.02)CHWE 1.56 (0.01) 2.01 (0.02) 1.92 (0.02) 1.55 (0.01) 1.61 (0.02)GAQQ 0.99 (0.01) 1.11 (0.01) 1.27 (0.01) 1.01 (0.01) 1.39 (0.02) namic stability of Heusler compounds lays the foundation for exploiting the large chemicalspace to discover and design new functional Heusler materials (Liu et al., 2016). To deter-mine the thermodynamic stability of Heusler compounds, there are two key metrics: themixing enthalpy (quantitative response) and the global stability based on hull energy (bi-nary qualitative response). The comprehensive database of 180628 full Heusler structureswas built by collecting the relevant structural and energetic data from the Materials Project(Jain et al., 2013), OQMD (Saal et al., 2013), and AFLOW (Curtarolo et al., 2012). Thesedata were calculated using ﬁrst-principles methods based on density functional theory, andit was extremely computationally expensive (taking hours) to generate one entry of the data.Therefore, a statistical model that can accurately predict the thermodynamic stability forany elemental and compound features is a useful surrogate of the ﬁrst-principle computationmodels.Since there is an intrinsic relationship between two QQ responses, the proposed GAQQmethod is suitable to improve the prediction accuracy by jointly ﬁtting them together. Todemonstrate the GAQQ method in the scenario when the number of predictors is largerelative to the size of the data, we randomly choose 150 samples from each class of the21inary response. We delete the predictor variables whose standard deviations are less than1 . e − , resulting in 157 predictors of elemental and compound features. To examine theprediction performance of the GAQQ method and other comparison methods, we randomlydivide data into a training set with a size of 200 and a testing set with a size of 100. Table4 reports the prediction performance results based on 50 random splits of the Heusler data.From the results, it is seen that the proposed GAQQ performs much better than othermethods in comparison, with the smallest values for the misclassiﬁcation error (ME) and theroot mean squared prediction error (RMSPE).Table 4: The MEs in percentage and RMSPE of Heusler and gene expression data. Heusler DataMethods GLDA ENET CL GAQQME 27.27 (1.828) 11.87 (0.332) 16.20 (0.688) 10.49 (0.363)RMSPE 1.797 (0.445) 0.317 (0.083) 1.046 (0.053) 0.142 (0.002)IBD Gene DataMethods GLDA WT CHWE GAQQME 21.90 (0.800) 24.80 (0.583) 18.10 (0.555) 15.77 (0.584)RMSPE 0.743 (0.014) 0.751 (0.014) 0.746 (0.014) 0.661 (0.011)

The second data for the case study considers the multi-class settings of the qualitativeresponse. The IBD gene data (Burczynski et al., 2006) are gene expressions on Ulcerativecolitis (UC) and Crohn’s disease (CD), two of which are common inﬂammatory bowel diseases(IBD) producing intestinal inﬂammation and tissue damage. The IBD data set was collectedat North American and European clinical sites from blood samples of 42 healthy individuals,59 CD patients, and 26 UC patients with 22,283 genes. An exploratory analysis, similarlyconducted as in Shao et al. (2011), is performed as variable screening by one-way ANOVAwith three levels (healthy individuals, CD patients, and UC patients). We choose the top 101signiﬁcant gene variables to form the data for methods comparison. To create a quantitativeresponse, one gene variable is randomly chosen as the quantitative response from the 101signiﬁcant variables. The data set is then randomly partitioned into a training set with 67samples and testing data with the rest 60 samples. Table 4 presents the comparison results22y the GLDA, WT, CHWE, and proposed GAQQ methods based on 50 random splits ofthe data. We observe that the proposed GAQQ method performs substantially well withrelatively lower values of ME and RMSPE, as well as their corresponding standard errors inthe parenthesis. Such empirical results demonstrate that the proposed GAQQ method canachieve accurate predictions for both QQ responses in high-dimensional data.

In this work, we propose a generative modeling approach to jointly model the data withQQ responses, which is a new perspective diﬀerent from existing methods in the literature.By fully exploring the joint distribution of the QQ responses and predictor variables, theproposed method enables eﬃcient parameter estimation, accurate prediction, and lays agood foundation for investigating the asymptotic properties. The proposed model can benaturally extended to the situation for multiple quantitative responses.One further research direction is to accommodate a more ﬂexible structure on the jointdistribution of QQ responses and predictor variables. For example, one can extend theLDA for the classiﬁcation of the qualitative response to the quadratic discriminant analysis(QDA). The QDA is more ﬂexible with diﬀerent covariance structures in each class, but itsestimation for high-dimensional data would encounter more diﬃculty due to a large numberof parameters. Besides, the derivation of its asymptotic properties is much more technicallycomplicated (Li and Shao, 2015). Another research direction is to apply the generativemodeling approach for the data with semi-continuous responses (Wang et al., 2020), or theordinal and quantitative responses. One may employ the ordinal regression for the ordinalresponse, and then derive its joint likelihood function with appropriate regularization.

References

Klein N, Kneib T, Marra G, Radice R, Rokicki S, McGovern ME. Mixed binary-continuouscopula regression models with application to adverse birth outcomes. Statistics in Medicine2019;38(3):413–436. 23itzmaurice GM, Laird NM. Regression models for a bivariate discrete and continuousoutcome with clustering. Journal of the American statistical Association 1995;90(431):845–852.Moustaki I, Knott M. Generalized latent trait models. Psychometrika 2000;65(3):391–411.Dunson DB. Bayesian latent variable models for clustered mixed outcomes. Journal of theRoyal Statistical Society: Series B (Statistical Methodology) 2000;62(2):355–366.Gueorguieva RV, Agresti A. A correlated probit model for joint modeling of clusteredbinary and continuous responses. Journal of the American Statistical Association2001;96(455):1102–1112.Dunson DB. Dynamic latent trait models for multidimensional longitudinal data. Journalof the American Statistical Association 2003;98(463):555–563.Deng X, Jin R. QQ models: Joint modeling for quantitative and qualitative quality responsesin manufacturing systems. Technometrics 2015;57(3):320–331.K¨ur¨um E, Li R, Shiﬀman S, Yao W. Time-varying coeﬃcient models for joint modelingbinary and continuous outcomes in longitudinal data. Statistica Sinica 2016;26(3):979–1000.Kang L, Kang X, Deng X, Jin R. A Bayesian hierarchical model for quantitative andqualitative responses. Journal of Quality Technology 2018;50(3):290–308.Amini P, Verbeke G, Zayeri F, Mahjub H, Marouﬁzadeh S, Moghimbeigi A. Longitudinaljoint modelling of binary and continuous outcomes: A comparison of bridge and normaldistributions. Epidemiology, Biostatistics and Public Health 2018;15(1).Fitzmaurice GM, Laird NM. Regression models for mixed discrete and continuous responseswith potentially missing values. Biometrics 1997;53(1):110–122.Song PXK, Li M, Yuan Y. Joint regression analysis of correlated data using Gaussiancopulas. Biometrics 2009;65(1):60–68. 24in L, Bandyopadhyay D, Lipsitz SR, Sinha D. Association models for clustered data withbinary and continuous responses. Biometrics 2010;66(1):287–293.Chen S, Witten DM, Shojaie A. Selection and estimation for mixed graphical models.Biometrika 2014;102(1):47–64.Yang E, Baker Y, Ravikumar P, Allen G, Liu Z. Mixed graphical models via exponentialfamilies. Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligenceand Statistics 2014;33:1042–1050.Guglielmi A, Ieva F, Paganoni AM, Quintana FA. A semiparametric Bayesian joint model formultiple mixed-type outcomes: an application to acute myocardial infarction. Advancesin Data Analysis and Classiﬁcation 2018;12(2):399–423.Sammel MD, Ryan LM, Legler JM. Latent variable models for mixed eiscrete and continuousoutcomes. Journal of the Royal Statistical Society: Series B (Statistical Methodology)1997;59(3):667–678.Dunson DB, Herring AH. Bayesian latent variable models for mixed discrete outcomes.Biostatistics 2005;6(1):11–25.Bello NM, Steibel JP, Tempelman RJ. Hierarchical Bayesian modeling of heterogeneousclusterand subject-level associations between continuous and binary outcomes in dairyproduction. Biometrical Journal 2012;54(2):230–248.Shao J, Wang Y, Deng X, Wang S. Sparse linear discriminant analysis by thresholding forhigh dimensional data. Annals of Statistics 2011;39(2):1241–1265.Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine LearningResearch 2006;7(12):2541–2563.Cai T, Liu W. A direct estimation approach to sparse linear discriminant analysis. Journalof the American Statistical Association 2012;106:1566–1577.Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika2007;94(1):19–35. 25eng X, Yuan M. Large Gaussian covariance matrix estimation with Markov structures.Journal of Computational and Graphical Statistics 2009;18(3):640–657.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 1996;58(1):267–288.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphicallasso. Biostatistics 2008;9(3):432–441.Lam C, Fan J. Sparsistency and rates of convergence in large covariance matrix estimation.Annals of Statistics 2009;37(6B):4254–4278.Raskutti G, Yu B, Wainwright MJ, Ravikumar P. Model Selection in Gaussian Graphi-cal Models: High-Dimensional Consistency of l1-regularized MLE. Advances in NeuralInformation Processing Systems 2008;21:1329–1336.Liu Y, Ren Z, et al. Minimax estimation of large precision matrices with bandable Choleskyfactor. Annals of Statistics 2020;48(4):2428–2454.Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolutedeviation method. Biometrika 2007;94(3):553–568.Zou H, Zhang H. On the adaptive elastic-net with a diverging number of parameters. Annalsof Statistics 2009;37(4):1733–1751.Lv J, Fan Y. A uniﬁed approach to model selection and sparse recovery using regularizedleast squares. Annals of Statistics 2009;37(6A):3498–3528.Armagan A, Dunson DB, Lee J. Generalized double pareto shrinkage. Statistica Sinica2013;23(1):119–143.Bickel PJ, Levina E. Covariance regularization by thresholding. Annals of Statistics2008;36(6):2577–2604.Rothman AJ, Bickel PJ, Levina E, Zhu J, et al. Sparse permutation invariant covarianceestimation. Electronic Journal of Statistics 2008;2:494–515.26¨uhlmann P, Van De Geer S. Statistics for High-Dimensional Data. Verlag Berlin Heidelberg:Springer; 2011.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of theRoyal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301–320.Kang X, Deng X. On variable ordination of Cholesky-based estimation for a sparse covariancematrix. Canadian Journal of Statistics 2020;in press.Bien J, Tibshirani RJ. Sparse estimation of a covariance matrix. Biometrika 2011;98(4):807–820.Witten DM, Tibshirani R. Penalized classiﬁcation using Fisher’s linear discriminant. Journalof the Royal Statistical Society: Series B (Statistical Methodology) 2011;73(5):753–772.Clemmensen L, Hastie T, Witten D, Ersbøll B. Sparse Discriminant Analysis. Technometrics2011;53(4):406–413.Liu Z, Yang L, Wu SC, Shekhar C, Jiang J, Yang H, et al. Observation of unusual topologicalsurface states in half-Heusler compounds LnPtBi (Ln=Lu,Y). Nature Communications2016;7(1):1–7.Jain A, Ong SP, Hautier G, Chen W, Richards WD, Dacek S, et al. Commentary: TheMaterials Project: A materials genome approach to accelerating materials innovation.Apl Materials 2013;1(1):011002.Saal JE, Kirklin S, Aykol M, Meredig B, Wolverton C. Materials design and discoverywith high-throughput density functional theory: the open quantum materials database(OQMD). Jom 2013;65(11):1501–1509.Curtarolo S, Setyawan W, Hart GL, Jahnatek M, Chepulskii RV, Taylor RH, et al. AFLOW:an automatic framework for high-throughput materials discovery. Computational Materi-als Science 2012;58:218–226. 27urczynski ME, Peterson RL, Twine NC, Zuberek KA, Brodeur BJ, Casciotti L, et al.Molecular classiﬁcation of Crohn’s disease and ulcerative colitis patients using transcrip-tional proﬁles in peripheral blood mononuclear cells. The Journal of Molecular Diagnostics2006;8(1):51–61.Li Q, Shao J. Sparse quadratic discriminant analysis for high dimensional data. StatisticaSinica 2015;25:457–473.Wang X, Feng X, Song X. Joint analysis of semicontinuous data with latent variables.Computational Statistics and Data Analysis 2020;p. 107005.28 ppendix

Derivation from (9) to (10). Let C denote a generic constant thereafter. X i ∈ G ( w i − n n δ − ¯ w ) ′ C ( w i − n n δ − ¯ w )+ X i ∈ G ( w i + 2 n n δ − ¯ w ) ′ C ( w i + 2 n n δ − ¯ w ) + λ | δ | = X i ∈ G ( C / w i − n n C / δ − C / ¯ w ) ′ ( C / w i − n n C / δ − C / ¯ w )+ X i ∈ G ( C / w i + 2 n n C / δ − C / ¯ w ) ′ ( C / w i + 2 n n C / δ − C / ¯ w ) + λ | δ | = X i ∈ G ( −

2( 2 n n C / δ ) ′ ( C / w i − C / ¯ w ) + 4 n n δ ′ Cδ )+ X i ∈ G (2( 2 n n C / δ ) ′ ( C / w i − C / ¯ x ) + 4 n n δ ′ Cδ ) + λ | δ | + C = − n n δ ′ C X i ∈ G w i + 4 n n δ ′ C ( n ¯ w ) + 4 n n n δ ′ Cδ + 4 n n δ ′ C X i ∈ G w i − n n δ ′ C ( n ¯ w ) + 4 n n n δ ′ Cδ + λ | δ | + C = 4 n n n δ ′ Cδ + 4 n n δ ′ C ( n ¯ w ) − δ ′ C X i ∈ G w i − n n δ ′ C ( n ¯ w ) + 4 δ ′ C ( n ¯ w ) + λ | δ | + C = 4 n n n δ ′ Cδ − δ ′ C ( X i ∈ G w i − n ¯ w ) + λ | δ | + C = 4 n n n ( ˜ y − C / δ ) ′ ( ˜ y − C / δ ) + λ | δ | + C, where ˜ y = n n n C / ( P i ∈ G w i − n ¯ w ) = n n C / ( n P i ∈ G w i − n P i ∈ G w i ). (cid:3) erivation from (16) to (17) . For δ j , j = 2 , , . . . , K , K X k =1 X i ∈ G k ( w i − ¯ w + Kn K X g =2 n g δ g − K δ k ) ′ C ( w i − ¯ w + Kn K X g =2 n g δ g − K δ k ) + λ | δ j | = K X k =1 ,k = j X i ∈ G k " C / ( w i − ¯ w + Kn K X g =2 ,g = j n g δ g − K δ k ) + C / Kn n j δ j ′ " C / ( w i − ¯ w + Kn K X g =2 ,g = j n g δ g − K δ k ) + C / Kn n j δ j + X i ∈ G j " C / ( w i − ¯ w + Kn K X g =2 ,g = j n g δ g ) + C / ( Kn n j δ j − K δ j ) ′ " C / ( w i − ¯ w + Kn K X g =2 ,g = j n g δ g ) + C / ( Kn n j δ j − K δ j ) + λ | δ j | = K X k =1 ,k = j X i ∈ G k " Kn j n ( C / δ j ) ′ ( C / w i − C / ¯ w + Kn C / K X g =2 ,g = j n g δ g − K C / δ k ) + K n j n δ ′ j Cδ j + X i ∈ G j " K ( n j n − C / δ j ) ′ ( C / w i − C / ¯ w + Kn C / K X g =2 ,g = j n g δ g ) + K ( n j n − δ ′ j Cδ j + λ | δ j | + C = Kn j n K X k =1 ,k = j " δ ′ j C X i ∈ G k w i − n k δ ′ j C ¯ w + 2 n k Kn δ ′ j C K X g =2 ,g = j n g δ g − n k K δ ′ j Cδ k + Kn j n k n δ ′ j Cδ j + Kn j ( n j n −  δ ′ j C ( 1 n j X i ∈ G j w i ) − δ ′ j C ¯ w + 2 Kn δ ′ j C K X g =2 ,g = j n g δ g + K ( n j n − δ ′ j Cδ j  + λ | δ j | + C = K n j ( n − n j ) n δ ′ j Cδ j − Kn δ ′ j C { K X k =1 ,k = j ( − n j X i ∈ G k w i + n j n k ¯ w − Kn j n k n K X g =2 ,g = j n g δ g + Kn j n k δ k ) − ( n j − n ) X i ∈ G j w i + n j ( n j − n ) ¯ w − Kn j ( n j n − K X g =2 ,g = j n g δ g } + λ | δ j | + C , K n j ( n − n j ) n δ ′ j Cδ j − Kn δ ′ j C M + λ | δ j | + C, (20)30here M = − n j X i/ ∈ G j w i + n j ( n − n j ) ¯ w − Kn j ( n − n j ) n K X g =2 ,g = j n g δ g + Kn j K X g =2 ,g = j n g δ g − ( n j − n ) X i ∈ G j w i + n j ( n j − n ) ¯ w − Kn j ( n j n − K X g =2 ,g = j n g δ g =( n − n j ) X i ∈ G j w i − n j X i/ ∈ G j w i + Kn j K X g =2 ,g = j n g δ g . Let ˜ y = Kn j ( n − n j ) C / M = Kn j ( n − n j ) C / " ( n − n j ) P i ∈ G j w i − n j P i/ ∈ G j w i + Kn j K P g =2 ,g = j n g δ g .Hence, formula (20) is equal to K n j ( n − n j ) n (˜ y − C / δ j ) ′ (˜ y − C / δ j ) + λ | δ j | + C. Lemma 1.

Suppose a random vector ( x ′ , y ′ ) ′ ∼ N ( µ , Σ ) , where x and y are multivariatevariables. For a given value of x , then y = µ Y + Σ ′ XY Σ − X ( x − µ X ) maximizes exp {− [( x ′ , y ′ ) − µ ′ ] Σ − [( x ′ , y ′ ) ′ − µ ] } , where µ =  µ X µ Y  and Σ =  Σ X , Σ XY Σ ′ XY , Σ Y  .Proof. We need to search for y to minimize [( x ′ , y ′ ) − µ ′ ] Ω [( x ′ , y ′ ) ′ − µ ], where Ω = Σ − =  Ω X , Ω XY Ω ′ XY , Ω Y  . That is, we minimize L ( y ) = ( x ′ , y ′ ) Ω ( x ′ , y ′ ) ′ − µ ′ Ω ( x ′ , y ′ ) ′ = ( x ′ , y ′ )  Ω X , Ω XY Ω ′ XY , Ω Y   xy  − µ ′ X , µ ′ Y )  Ω X , Ω XY Ω ′ XY , Ω Y   xy  = 2 x ′ Ω XY y + y ′ Ω Y y − µ ′ X Ω XY + µ ′ Y Ω Y ) y + C, where C is a constant not depending on y . Taking derivative of L ( y ) and setting to zeroyields ∂L ( y ) ∂ y = 2 Ω ′ XY x + 2 Ω Y y − Ω ′ XY µ X + Ω Y µ Y ) = 0 y = µ Y − Ω − Y Ω ′ XY ( x − µ X ) . This, together with a property of block matrix that Ω ′ XY = − Ω Y Σ ′ XY Σ − X , completes theproof. 31or a new observation x , let y = µ y + Σ ′ Xy Σ − X ( x − µ X ) and y = µ y + Σ ′ Xy Σ − X ( x − µ X ). Denote by p = p ( W = ( x ′ , y ) ′ | G ) and p = p ( W = ( x ′ , y ) ′ | G ). Now we proveProposition 1. Proof.

Proof of Proposition 1 .Without loss of generality, we suppose π p > π p , then we show below that the LDAclassiﬁcation rule would assign ( x ′ , y ) ′ to G . In order to achieve this, we only need to provethat p ≥ p = p ( W = ( x ′ , y ) ′ | G ) (21)for any value of y . That is, we need to prove W = ( x ′ , y ) ′ will maximize the densityfunction of N ( µ , Σ ), which is the conclusion of Lemma 1. As a result, π p = π p ( W =( x ′ , y ) ′ | G ) > π p ≥ π p ( W = ( x ′ , y ) ′ | G ) by taking y = y in (21). Hence, p ( x ∈ G | W = ( x ′ , y ) ′ ) = π p p ( W = ( x ′ , y ) ′ ) > π p ( W = ( x ′ , y ) ′ | x ∈ G ) p ( W = ( x ′ , y ) ′ )= p ( x ∈ G | W = ( x ′ , y ) ′ ) , implying that the LDA assigns ( x ′ , y ) ′ to G . Proposition 2.

For an observation x , let y = µ y + Σ ′ Xy Σ − X ( x − µ X ) and y = µ y + Σ ′ Xy Σ − X ( x − µ X ) . Denote by p = p ( W = ( x ′ , y ) ′ | G ) and p = p ( W = ( x ′ , y ) ′ | G ) .Then p ( x ∈ G | X = x ) > p ( x ∈ G | X = x ) is equivalent to π p > π p .Proof. Since p ( x ∈ G | X = x ) > p ( x ∈ G | X = x ), we have π p ( X = x | x ∈ G ) > π p ( X = x | x ∈ G ) π exp {−

12 ( x − µ X ) ′ Σ − X ( x − µ X ) } > π exp {−

12 ( x − µ X ) ′ Σ − X ( x − µ X ) } ln π −

12 ( x − µ X ) ′ Σ − X ( x − µ X ) > ln π −

12 ( x − µ X ) ′ Σ − X ( x − µ X ) . (22)On the other hand, π p > π p yieldsln π −  x y  −  µ X µ y  ′  Σ X , Σ Xy Σ ′ Xy , σ y  −  x y  −  µ X µ y  > ln π −  x y  −  µ X µ y  ′  Σ X , Σ Xy Σ ′ Xy , σ y  −  x y  −  µ X µ y  . (23)32ow we prove Equations (22) and (23) are equivalent. Since  Σ X , Σ Xy Σ ′ Xy , σ y  − =  Σ − X + Σ − X Σ Xy Σ ′ Xy Σ − X σ y − Σ ′ Xy Σ − X Σ Xy , − Σ − X Σ Xy σ y − Σ ′ Xy Σ − X Σ Xy − Σ ′ Xy Σ − X σ y − Σ ′ Xy Σ − X Σ Xy , σ y − Σ ′ Xy Σ − X Σ Xy  =  Σ − X + Σ − X Σ Xy Σ ′ Xy Σ − X V ar ( y | X ) , − Σ − X Σ Xy V ar ( y | X ) − Σ ′ Xy Σ − X V ar ( y | X ) , V ar ( y | X )  , the left side of (23) equalsln π −

12 [( x − µ X ) ′ , y − µ y ] ′  Σ − X + Σ − X Σ Xy Σ ′ Xy Σ − X V ar ( y | X ) , − Σ − X Σ Xy V ar ( y | X ) − Σ ′ Xy Σ − X V ar ( y | X ) , V ar ( y | X )   x − µ X y − µ y  = ln π − { ( x − µ X ) ′ Σ − X ( x − µ X ) + ( x − µ X ) ′ Σ − X Σ Xy Σ ′ Xy Σ − X V ar ( y | X ) ( x − µ X ) − y − µ y V ar ( y | X ) Σ ′ Xy Σ − X ( x − µ X ) + ( y − µ y ) V ar ( y | X ) − ( x − µ X ) ′ Σ − X Σ Xy V ar ( y | X )( y − µ y ) } = ln π −

12 ( x − µ X ) ′ Σ − X ( x − µ X ) , where the last equality applies y − µ y = Σ ′ Xy Σ − X ( x − µ X ). Similarly, the right side ofEquation (23) equals ln π − ( x − µ X ) ′ Σ − X ( x − µ X ). This completes the proof.The inequality p ( x ∈ G | X = x ) > p ( x ∈ G | X = x ) in Proposition 2 indicates thatthe LDA rule assigns x to G . Therefore, Proposition 2 implies that Step 2b of Algorithm2 is equivalent to applying the LDA classiﬁcation rule directly on x instead of w = ( x ′ , ˆ y ) ′ .This fact enables us to give theoretical proof for the consistency properties of the proposedclassiﬁcation rule based on variable X rather than W = ( X ′ , y ) ′ . Before the proof ofTheorem 1, we present Lemmas 2 - 4. Lemma 2.

For any k = 2 , , . . . , K , we have ( ˆ δ ′ kX ˆ C X − ( δ kX ) ′ C X ) Σ X ( ˆ C X ˆ δ kX − C X δ kX ) = ∆ k " O p ( b ( n ) k ∆ k ) + O p ( d n ) for the multi-class problem.Proof. Decompose( ˆ C X ˆ δ kX − C X δ kX ) ′ Σ X ( ˆ C X ˆ δ kX − C X δ kX ) = ˆ δ ′ kX ˆ C X Σ X ˆ C X ˆ δ kX − δ ′ kX ˆ C X δ kX +( δ kX ) ′ C X δ kX . (24)33n one hand, by the result (18) we haveˆ δ ′ kX ˆ C X Σ X ˆ C X ˆ δ kX = ˆ δ ′ kX ˆ C X ˆ δ kX [1 + O p ( d n )] = ˆ δ ′ kX C X ˆ δ kX [1 + O p ( d n )] . Since E [( δ kX ) ′ C X ( ˆ δ kX − δ kX )] ≤ ∆ k E [( ˆ δ kX − δ kX ) ′ C X ( ˆ δ kX − δ kX )] and by Equation (19),we obtainˆ δ ′ kX C X ˆ δ kX = ( δ kX ) ′ C X δ kX + 2( δ kX ) ′ C X ( ˆ δ kX − δ kX ) + ( ˆ δ kX − δ kX ) ′ C X ( ˆ δ kX − δ kX )= ∆ k + O p ( b ( n ) k ∆ k ) + O p (( b ( n ) k ) )= ∆ k [1 + O p ( b ( n ) k ∆ k ) + O p ( ( b ( n ) k ) ∆ k )]= ∆ k [1 + O p ( b ( n ) k ∆ k )] . As a result, ˆ δ ′ kX ˆ C X ˆ δ kX = ˆ δ ′ kX C X ˆ δ kX [1 + O p ( d n )] = ∆ k [1 + O p ( b ( n ) k ∆ k ) + O p ( d n )] . (25)On the other hand, since k δ kX k = O (∆ k ), we have( δ kX ) ′ ˆ C X δ kX = ( δ kX ) ′ ( ˆ C X − C X ) δ kX + ( δ kX ) ′ C X δ kX = O p (∆ k d n ) + ∆ k = ∆ k [1 + O p ( d n )] . (26)Consequently, ˆ δ ′ kX ˆ C X δ kX = ∆ k q O p ( d n ) ∆ k s O p ( b ( n ) k ∆ k ) + O p ( d n )= ∆ k s O p ( b ( n ) k ∆ k ) + O p ( d n ) . (27)Combing Equations (24), (25) and (27) yields( ˆ δ ′ kX ˆ C X − ( δ kX ) ′ C X ) Σ X ( ˆ C X ˆ δ kX − C X δ kX )=∆ k [1 + O p ( b ( n ) k ∆ k ) + O p ( d n )] − k s O p ( b ( n ) k ∆ k ) + O p ( d n ) + ∆ k =∆ k " O p ( b ( n ) k ∆ k ) + O p ( d n ) , where the last equality uses the Taylor expansion of √ x = 1 + x + o ( x ).34rite µ k = (( µ kX ) ′ , µ ky ) ′ , where µ kX is the true mean value of variable X for class G k .Correspondingly, write ˆ µ k = ( ˆ µ ′ kX , ˆ µ ky ) ′ . Let a n ≍ b n represent two sequences a n and b n tobe the same order. Now we state Lemma 3. Lemma 3.

Let q ( n ) k be the number of nonzero entries of estimate ˆ δ kX . For k = 2 , , . . . , K ,we have ˆ δ ′ kX ˆ C X ( ˆ µ X − µ X ) ≍ ˆ δ ′ kX ˆ C X ( ˆ µ kX − µ kX )= O p ( s q ( n ) k n ) q ˆ δ ′ kX ˆ C X ˆ δ kX − O p ( s S h ; p q ( n ) k n ) q ˆ δ ′ kX ˆ C X ˆ δ kX . Proof.

Without loss of generality, we assume that ˆ δ kX = ( ˆ δ ′ k, , ′ ) ′ , where ˆ δ ′ k, is a q ( n ) k -dimensional vector containing all the nonzero entries of ˆ δ kX . Note that lim n →∞ q ( n ) k = s k .Conformally, we write Σ X =  Σ , Σ ( Σ ) ′ , Σ  , ˆ Σ X =  ˆ Σ , ˆ Σ ( ˆ Σ ) ′ , ˆ Σ  , C X =  C , C ( C ) ′ , C  , ˆ C X =  ˆ C , ˆ C ( ˆ C ) ′ , ˆ C  , where Σ , ˆ Σ , C and ˆ C are q ( n ) k × q ( n ) k matrices. Let ˆ µ X − µ X = ( η ′ , η ′ ) ′ with η a q ( n ) k -dimensional vector. Hence,ˆ δ ′ kX ˆ C X ( ˆ µ X − µ X ) = ˆ δ ′ k, ˆ C η + ˆ δ ′ k, ˆ C η = ˆ δ ′ k, ˆ C η − ˆ δ ′ k, ˆ Σ − ˆ Σ ˆ C η . On one hand, ( ˆ δ ′ k, ˆ C η ) ≤ ( ˆ δ ′ k, ˆ C ˆ δ k, )( η ′ ˆ C η ) = ( ˆ δ ′ kX ˆ C X ˆ δ kX )( η ′ ˆ C η )= O p ( q ( n ) k n )( ˆ δ ′ kX ˆ C X ˆ δ kX ) . On the other hand,( ˆ δ ′ k, ˆ Σ − ˆ Σ ˆ C η ) ≤ ( ˆ δ ′ k, ˆ Σ − ˆ δ k, )( η ′ ˆ C ˆ Σ ′ ˆ Σ − ˆ Σ ˆ C η ) ≤ ( ˆ δ ′ k, ˆ C ˆ δ k, )( η ′ ˆ C ˆ Σ ′ ˆ Σ − ˆ Σ ˆ C η )= ( ˆ δ ′ kX ˆ C X ˆ δ kX )( η ′ ˆ C ˆ Σ ′ ˆ Σ − ˆ Σ ˆ C η )= ( ˆ δ ′ kX ˆ C X ˆ δ kX )( η ′ C ( Σ ) ′ ( Σ ) − Σ C η [1 + O p ( d n )]) , ( ˆ δ ′ kX ˆ C X ˆ δ kX )( ω n [1 + O p ( d n )]) , ω n = ( η ′ C ( Σ ) ′ ( Σ ) − Σ C η .Hence, we haveˆ δ ′ kX ˆ C X ( ˆ µ X − µ X ) = O p ( s q ( n ) k n ) q ˆ δ ′ kX ˆ C X ˆ δ kX − q ˆ δ ′ kX ˆ C X ˆ δ kX q ω n [1 + O p ( d n )] . Under condition (C1), E ( ω n ) ≤ θE ( η ′ C ( Σ ) ′ Σ C η ) = θn tr[ Σ C Σ C ( Σ ) ′ ] ≤ θ n tr[ Σ ( Σ ) ′ ] . Recall that Σ = ( σ ij ) ≤ i,j ≤ p , then E ( ω n ) ≤ θ n q ( n ) k X i =1 p X j = q ( n ) k +1 ( σ ij ) ≤ θ n q ( n ) k max i p X j = q ( n ) k +1 ( σ ij ) ≤ θ − h n q ( n ) k max j ≤ p p X i =1 | σ ij | h = O ( S h ; p q ( n ) k n ) . Consequently,ˆ δ ′ kX ˆ C X ( ˆ µ X − µ X ) = O p ( s q ( n ) k n ) q ˆ δ ′ kX ˆ C X ˆ δ kX − O p ( s S h ; p q ( n ) k n ) q ˆ δ ′ kX ˆ C X ˆ δ kX . Similarly, we haveˆ δ ′ kX ˆ C X ( ˆ µ kX − µ kX ) = O p ( s q ( n ) k n ) q ˆ δ ′ kX ˆ C X ˆ δ kX − O p ( s S h ; p q ( n ) k n ) q ˆ δ ′ kX ˆ C X ˆ δ kX . Lemma 4.

For t = 1 , , . . . , K and k = 2 , , . . . , K , we have ( µ tX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) − ( ˆ µ X + ˆ µ kX ′ ˆ C X ˆ δ kX + ( µ X + µ kX ′ C X δ kX =∆  O p ( b ( n ) k ∆ k ) + O p ( d n ) + O p ( q S h ; p q ( n ) k √ n ∆ k )  + 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) for the multi-class problem. roof. It is not diﬃcult to derive( µ tX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) − ( ˆ µ X + ˆ µ kX ′ ˆ C X ˆ δ kX + ( µ X + µ kX ′ C X δ kX =( µ tX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) −

12 [( ˆ µ X + ˆ µ kX ) − ( µ X + µ kX )] ′ ˆ C X ˆ δ kX −

12 ( µ X + µ kX ) ′ ( ˆ C X ˆ δ kX − C X δ kX )=[ ( µ tX − µ X ) + ( µ tX − µ kX )2 ] ′ ( ˆ C X ˆ δ kX − C X δ kX ) −

12 [( ˆ µ X − µ X ) + ( ˆ µ kX − µ kX )] ′ ˆ C X ˆ δ kX = K ( δ tX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) − K δ kX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) −

12 [( ˆ µ X − µ X ) + ( ˆ µ kX − µ kX )] ′ ˆ C X ˆ δ kX . Because ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) = ∆ k + ∆ t − δ tX ) ′ C X δ kX , we hence have − ( δ tX ) ′ C X δ kX = 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) −

12 (∆ k + ∆ t ) ≤

12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) − ∆ k ∆ t . Consequently, applying the Cauchy-Schwarz inequality together with Equations (25) and(26), we can obtain( δ tX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) ≤ ∆ t q O p ( d n ) ∆ k s O p ( b ( n ) k ∆ k ) + O p ( d n ) + 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) − ∆ t ∆ k ≤ ∆ t ∆ k (1 + O p ( b ( n ) k ∆ k ) + O p ( d n )) − ∆ t ∆ k + 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX )=∆ t ∆ k ( O p ( b ( n ) k ∆ k ) + O p ( d n )) + 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) ≤ ∆ ( O p ( b ( n ) k ∆ k ) + O p ( d n )) + 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) . Similarly,( δ kX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) = ∆ k ( O p ( b ( n ) k ∆ k ) + O p ( d n )) ≤ ∆ ( O p ( b ( n ) k ∆ k ) + O p ( d n )) .

37s a result, according to Lemma 3, we have( µ tX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) − ( ˆ µ X + ˆ µ kX ′ ˆ C X ˆ δ kX + ( µ X + µ kX ′ C X δ kX ≤ ∆ ( O p ( b ( n ) k ∆ k ) + O p ( d n )) + 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX )+  O p ( s S h ; p q ( n ) k n ) − O p ( s q ( n ) k n )  q ˆ δ ′ kX ˆ C X ˆ δ kX ≤ ∆ ( O p ( b ( n ) k ∆ k ) + O p ( d n )) + ∆ k O p ( s S h ; p q ( n ) k n ) s O p ( b ( n ) k ∆ k ) + O p ( d n )+ 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX )=∆ ( O p ( b ( n ) k ∆ k ) + O p ( d n )) + ∆ k ( O p ( b ( n ) k ∆ k ) + O p ( d n ∆ k ) + O p ( q S h ; p q ( n ) k √ n ∆ k ))+ 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) ≤ ∆  O p ( b ( n ) k ∆ k ) + O p ( d n ) + O p ( q S h ; p q ( n ) k √ n ∆ k )  + 12 ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) . From Lemma 4, note that when t = k , we have ( µ kX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) − ( ˆ µ X + ˆ µ kX ) ′ ˆ C X ˆ δ kX + ( µ X + µ kX ) ′ C X δ kX = ∆ (cid:20) O p ( b ( n ) k ∆ k ) + O p ( d n ) + O p ( q S h ; p q ( n ) k √ n ∆ k ) (cid:21) . With Lemmas 2 - 4, we areready to complete the proof of Theorem 1. Proof.

Proof of Theorem 1 .Let ˆ Z P ROP and ˆ Z Bayes denote the predicted class labels obtained by the proposed modeland the Bayes rule, respectively. For simplicity, we assume π = π = . . . = π K insteadof condition (C7) in the proofs of Theorems 1 - 3 with no inﬂuence on the theoreticalresults, since condition (C7) is only used to bound the term log π k π of the LDA rule. Deﬁne ϑ k = ( x − µ X + µ kX ) ′ C X δ kX and ˆ ϑ k = ( x − ˆ µ X + ˆ µ kX ) ′ ˆ C X ˆ δ kX for a new sample x . Then for38ny ǫ > R P ROP ( T ) − R Bayes ≤ Pr( ˆ Z P ROP = ˆ Z Bayes ) ≤ − Pr( | ˆ ϑ k − ϑ k | < ǫ , | ϑ k − ϑ l | > ǫ for any k, l ) ≤ Pr( | ˆ ϑ k − ϑ k | ≥ ǫ k ) + Pr( | ϑ k − ϑ l | ≤ ǫ for some k, l ) . Firstly, we bound the probability Pr( | ϑ k − ϑ l | ≤ ǫ for some k, l ). Since ϑ k − ϑ l = x ′ C X ( δ kX − δ lX ) − ( µ X + µ kX ) ′ C X δ kX + ( µ X + µ l ) ′ C X δ lX , the variance of ϑ k − ϑ l is ( δ kX − δ lX ) ′ C X ( δ kX − δ lX ). Hence,Pr( | ϑ k − ϑ l | ≤ ǫ for some k, l ) = K X t =1 Pr( | ϑ k − ϑ l | ≤ ǫ | Z = t ) π t ≤ X k,l,t π t Cǫ p ( δ kX − δ lX ) ′ C X ( δ kX − δ lX ) ≤ CK ǫ, where the last inequality is obtained by condition (C6). Secondly, we bound the termPr( | ˆ ϑ k − ϑ k | ≥ ǫ for some k ). As ( ˆ ϑ k − ϑ k | Z = t ) = x ′ ( ˆ C X ˆ δ kX − C X δ kX ) − ( ˆ µ X + ˆ µ kX ) ′ ˆ C X ˆ δ kX +( µ X + µ kX ) ′ C X δ kX , the conditional diﬀerence term ( ˆ ϑ k − ϑ k | Z = t ) is from normal distribution N ( µ ϑ , σ ϑ ) with µ ( t ) ϑ = ( µ tX ) ′ ( ˆ C X ˆ δ kX − C X δ kX ) − ( ˆ µ X + ˆ µ kX ′ ˆ C X ˆ δ kX + ( µ X + µ kX ′ C X δ kX and σ ϑ = ( ˆ δ ′ kX ˆ C X − ( δ kX ) ′ C X ) Σ X ( ˆ C X ˆ δ kX − C X δ kX ) .

39y Markov’s inequality, together with Lemmas 2 and 4, we havePr( | ˆ ϑ k − ϑ k | ≥ ǫ k )= K X t = k π t Pr( | ˆ ϑ k − ϑ k | ≥ ǫ | Z = t ) + π k Pr( | ˆ ϑ k − ϑ k | ≥ ǫ | Z = k ) ≤ C max k ( ˆ δ ′ kX ˆ C X − ( δ kX ) ′ C X ) Σ X ( ˆ C X ˆ δ kX − C X δ kX )( ǫ − µ ( t = k ) ϑ ) + ( ˆ δ ′ kX ˆ C X − ( δ kX ) ′ C X ) Σ X ( ˆ C X ˆ δ kX − C X δ kX )( ǫ − µ ( k ) ϑ ) ≤ C max k ∆ k [ O p ( b ( n ) k ∆ k ) + O p ( d n )] (cid:20) ǫ − ∆ ( O p ( b ( n ) k ∆ k ) + O p ( d n ) + O p ( q S h ; p q ( n ) k √ n ∆ k )) − ( δ kX − δ tX ) ′ C X ( δ kX − δ tX ) (cid:21) + ∆ k [ O p ( b ( n ) k ∆ k ) + O p ( d n )] (cid:20) ǫ − ∆ ( O p ( b ( n ) k ∆ k ) + O p ( d n ) + O p ( q S h ; p q ( n ) k √ n ∆ k )) (cid:21) ≤ ∆ O p ( ξ n ; k )[ ǫ − ∆ O p ( ξ n ; k ) − ( δ kX − δ tX ) ′ C X ( δ kX − δ tX )] + ∆ O p ( ξ n ; k )[ ǫ − ∆ O p ( ξ n ; k )] . (28)By condition (C6), the ﬁrst term of (28) converges to 0 in probability. Pick ǫ = Cξ αn ; k , where0 < α < / C , thenPr( | ˆ ϑ k − ϑ k | ≥ ǫ k ) ≤ ∆ O p ( ξ n ; k )[ ǫ − ∆ O p ( ξ n ; k ) − ( δ kX − δ tX ) ′ C X ( δ kX − δ tX )] + ∆ O p ( ξ − αn ; k )[ C − ∆ O p ( ξ − αn ; k )] P → . Proof.

Proof of Theorem 2 .The conditional misclassiﬁcation rate is R P ROP ( T ) = 12 X k =1 Φ  ( − k ˆ δ ′ X ˆ C X ( µ kX − ˆ µ kX ) − ˆ δ ′ X ˆ C X ( ˆ µ X − ˆ µ X ) / q ˆ δ ′ X ˆ C X Σ X ˆ C X ˆ δ X  = 12 X k =1 Φ  ( − k ˆ δ ′ X ˆ C X ( µ kX − ˆ µ kX ) − ˆ δ ′ X ˆ C X ˆ δ X q ˆ δ ′ X ˆ C X Σ X ˆ C X ˆ δ X  .

40y the result (18), we haveˆ δ ′ X ˆ C X Σ X ˆ C X ˆ δ X = ˆ δ ′ X ˆ C X ˆ δ X [1 + O p ( d n )] = ˆ δ ′ X C X ˆ δ X [1 + O p ( d n )] . From the result (19), together with E [( δ X ) ′ C X ( ˆ δ X − δ X )] ≤ ∆ E [( ˆ δ X − δ X ) ′ C X ( ˆ δ X − δ X )], it is easy to deriveˆ δ ′ X C X ˆ δ X = ( δ X ) ′ C X δ X + ( δ X ) ′ C X ( ˆ δ X − δ X ) + ( ˆ δ X − δ X ) ′ C X ( ˆ δ X − δ X )= ∆ + O p ( b ( n )2 ∆ ) + O p (( b ( n )2 ) )= ∆ [1 + O p ( b ( n )2 ∆ ) + O p ( ( b ( n )2 ) ∆ )]= ∆ [1 + O p ( b ( n )2 ∆ )] . Hence we haveˆ δ ′ X ˆ C X ˆ δ X = ˆ δ ′ X C X ˆ δ X [1 + O p ( d n )] = ∆ [1 + O p ( b ( n )2 ∆ ) + O p ( d n )] . Therefore, by Lemma 3, we obtainˆ δ ′ X ˆ C X ( ˆ µ X − µ X ) − ˆ δ ′ X ˆ C X ˆ δ X q ˆ δ ′ X ˆ C X Σ X ˆ C X ˆ δ X = O p ( q q ( n )2 n ) + O p ( q S h ; p q ( n )2 n ) − q ˆ δ ′ X ˆ C X ˆ δ X p O p ( d n )= − ∆ q O p ( b ( n )2 ∆ ) + O p ( d n ) p O p ( d n ) + O p ( q S h ; p q ( n )2 n ) p O p ( d n )= − ∆ O p ( b ( n )2 ∆ ) + O p ( d n )] + O p ( s S h ; p q ( n )2 n )= − ∆ O p ( b ( n )2 ∆ ) + O p ( d n ) + O p ( q S h ; p q ( n )2 √ n ∆ )]= − ∆ O p ( ξ n )] . Similarly, we have ˆ δ ′ X ˆ C X ( µ X − ˆ µ X ) − ˆ δ ′ X ˆ C X ˆ δ X q ˆ δ ′ X ˆ C X Σ X ˆ C X ˆ δ X = − ∆ O p ( ξ n )] , which proves theory. 41o establish the theoretical results in Theorem 3, we need a lemma from Shao et al.(2011). We state it here for completeness, and then prove Theorem 3. Lemma 5.

Let a (1) n and a (2) n be two sequences of positive numbers such that a (1) n → ∞ and a (2) n → as n → ∞ . If lim n →∞ a (1) n a (2) n = ρ , where ρ may be 0, positive, or ∞ , then lim n →∞ Φ( − q a (1) n (1 − a (2) n ))Φ( − q a (1) n ) = e ρ . Proof.

See the proof of Lemma 1 in Shao et al. (2011).

Proof.

Proof of Theorem 3 .(1) Let φ be the density function of N (0 , R P ROP ( T ) − R Bayes = Φ( − ∆ O p ( ξ n )]) − Φ( − ∆ − φ ( τ n ) ∆ O p ( ξ n ) , where τ n is between − ∆ and − ∆ [1 + O p ( ξ n )]. Since ∆ is bounded, then R Bayes is boundedaway from 0. Hence, R P ROP ( T ) R Bayes − − ∆ φ ( τ n ) R Bayes O p ( ξ n ) = O p ( ξ n ) . (2) When ∆ → ∞ , we have R P ROP ( T ) P →

0. This, together with lim ∆ →∞ R Bayes = 0,proves (2).(3) The conditions ∆ → ∞ , lim n →∞ ξ n ∆ = 0, together with Lemma 5, prove that R P ROP ( T ) /R Bayes P → Proof.

Proof of Theorem 4 .Deﬁne r ik = Pr( ˆ Z = i | Z = k ) for i, k = 1 , , . . . , K . Let R be the misclassiﬁcation errorfor a classiﬁer, it is then calculated via R = K X k =1 Pr( Z = k ) Pr( ˆ Z = k | Z = k ) = K X k =1 π k K X i = k r ik ! . (29)Now we derive an upper bound of (ˆ y − y ) . Since it is random, we focus on the average, i.e., E [(ˆ y − y ) | x , T ] = E y E ˆ y |T (cid:2) (ˆ y − y ) | x , T (cid:3) . (30)42o simplify the notation, we omit x and T from the right of the conditional sign and writeit as E y E ˆ y |T [(ˆ y − y ) ]. Then Equation (30) becomes E [(ˆ y − y ) ] = E y E ˆ y |T (cid:2) (ˆ y − y ) (cid:3) = E Z ( E y | Z E ˆ y |T (cid:2) (ˆ y − y ) (cid:3) | Z )= K X k =1 π k E y | Z = k K X i =1 r ik (ˆ y i − y ) | Z = k ! Next, we derive E y | Z =1 (cid:2) a (ˆ y − y ) | Z = 1 (cid:3) = E y | Z =1 (cid:2) a (ˆ y − E ( y | Z = 1) + E ( y | Z = 1) − y ) | Z = 1 (cid:3) = E y | Z =1 (cid:2) a (ˆ y − E ( y | Z = 1)) | Z = 1 (cid:3) + E y | Z =1 (cid:2) a ( y − E ( y | Z = 1)) | Z = 1 (cid:3) = a (ˆ y − E ( y | Z = 1)) + a Var( y | Z = 1) . Similarly, we have E y | Z = k (cid:2) c (ˆ y i − y ) | Z = k (cid:3) = c (ˆ y i − E ( y | Z = k )) + c Var( y | Z = k )for c > i, k = 1 , , . . . , K . As a result, Equation (30) is decomposed as E [(ˆ y − y ) | x , T ] = K X k =1 π k " K X i =1 r ik (ˆ y i − E ( y | Z = k )) + K X i =1 r ik Var( y | Z = k ) = K X k =1 K X i =1 π k r ik (ˆ y i − E ( y | Z = k )) + ( σ y − Σ ′ Xy Σ − X Σ Xy ) K X k =1 K X i =1 π k r ik = K X k =1 K X i =1 π k r ik (ˆ y i − E ( y | Z = k )) + ( σ y − Σ ′ Xy Σ − X Σ Xy ) , where the second equality applies Var( y | Z = k ) = σ y − Σ ′ Xy Σ − X Σ Xy , and the third equalityuses the fact P Kk =1 P Ki =1 π k r ik = 1 based on the deﬁnition of r ik . Now we tackle with eachterm of [ˆ y k − E ( y | Z = k )] = h (ˆ µ ky − µ ky ) + (cid:16) ˆ Σ ′ Xy ˆ Σ − X − Σ ′ Xy Σ − X (cid:17) x − (cid:16) ˆ Σ ′ Xy ˆ Σ − X ˆ µ kX − Σ ′ Xy Σ − X µ kX (cid:17)i y k − E ( y | Z = k ′ )] = h (ˆ µ ky − µ k ′ y ) + (cid:16) ˆ Σ ′ Xy ˆ Σ − X − Σ ′ Xy Σ − X (cid:17) x − (cid:16) ˆ Σ ′ Xy ˆ Σ − X ˆ µ kX − Σ ′ Xy Σ − X µ k ′ X (cid:17)i = h (ˆ µ ky − µ ky ) + (cid:16) ˆ Σ ′ Xy ˆ Σ − X − Σ ′ Xy Σ − X (cid:17) x − (cid:16) ˆ Σ ′ Xy ˆ Σ − X ˆ µ kX − Σ ′ Xy Σ − X µ kX (cid:17) + ( µ ky − µ k ′ y ) − (cid:0) Σ ′ Xy Σ − X µ kX − Σ ′ Xy Σ − X µ k ′ X (cid:1)(cid:3) , for k = k ′ . For k = 1 , , . . . , K and k = k ′ , deﬁne the following terms b ky = ˆ µ ky − µ ky ,D kk ′ = E ( y | Z = k ) − E ( y | Z = k ′ )= ( µ ky − µ k ′ y ) − (cid:0) Σ ′ Xy Σ − X µ kX − Σ ′ Xy Σ − X µ k ′ X (cid:1) , h = (cid:16) ˆ Σ ′ Xy ˆ Σ − X − Σ ′ Xy Σ − X (cid:17) ′ ,d k = ˆ Σ ′ Xy ˆ Σ − X ˆ µ kX − Σ ′ Xy Σ − X µ kX . Therefore, we obtain E [(ˆ y − y ) | x , T ]= K X i =1 π i r ii ( b iy + h ′ x − d i ) + K X k =1 K X i = k π k r ik ( b iy + h ′ x − d i + D ik ) + ( σ y − Σ ′ Xy Σ − X Σ Xy )= K X i =1 π i r ii ( b iy + h ′ x − d i ) + K X k =1 K X i = k π k r ik ( b iy + h ′ x − d i ) + K X k =1 K X i = k π k r ik D ik + K X k =1 K X i = k π k r ik ( b iy + h ′ x − d i ) D ik + ( σ y − Σ ′ Xy Σ − X Σ Xy )= M + K X k =1 K X i = k π k r ik D ik + ( σ y − Σ ′ Xy Σ − X Σ Xy ) , where M = K X k =1 K X i =1 π k r ik ( b iy + h ′ x − d i ) + K X k =1 K X i = k π k r ik ( b iy + h ′ x − d i ) D ik . Now if the classiﬁcation of Z is based on the known distribution, the misclassiﬁcation rate R is R Bayes . For i, k = 1 , , . . . , K , let r Bik = Pr( ˆ Z = i | Z = k ) represent the corresponding r ik with ˆ Z obtained from Bayes rule. Similarly, let symbol r Pik represent the corresponding44 ik with ˆ Z obtained from the proposed model. Denote by M P ROP the corresponding valueof M computed from the proposed model. Note that the value of M computed from Bayesrule is equal to 0. Then we have E [(ˆ y P − y ) | x , T ] − E [(ˆ y B − y ) | x , T ] = M P ROP + K X k =1 K X i = k ( π k r Pik − π k r Bik ) D ik ≤ M P ROP + [ R P ROP ( T ) − R Bayes ] D max , where D max = max { D kk ′ } , and the last inequality uses Equation (29). By conditions inTheorem 1, E x ( M P ROP ) P → n → ∞ . Consequently, we have M SE

P ROP − M SE

Bayes = E [(ˆ y P − y ) |T ] − E [(ˆ y B − y ) |T ]= E x E [(ˆ y P − y ) | x , T ] − E x E [(ˆ y B − y ) | x , T ] ≤ E x ( M P ROP ) + [ R P ROP ( T ) − R Bayes ] D maxP → ..