Improved Design of Quadratic Discriminant Analysis Classifier in Unbalanced Settings
Amine Bejaoui, Khalil Elkhalil, Abla Kammoun, Mohamed Slim Alouni, Tarek Al-Naffouri
JJournal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00
Improved Design of Quadratic Discriminant AnalysisClassifier in Imbalanced Settings
Amine Bejaoui [email protected]
Department of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia
Khalil Elkhalil
Division of Electrical EngineeringDuke UniversityDurham, NC 27708, USA
Abla Kammoun [email protected]
Department of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia
Mohamed Slim Alouini [email protected]
Department of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia
Tareq Alnaffouri [email protected]
Department of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia
Editor:
Abstract
The use of quadratic discriminant analysis (QDA) or its regularized version (R-QDA) forclassification is often not recommended, due to its well-acknowledged high sensitivity to theestimation noise of the covariance matrix. This becomes all the more the case in imbalancedsettings in which training data for each class are disproportionate and for which it has beenfound that R-QDA becomes equivalent to the classifier that assigns all observations to thesame class. In this paper, we propose an improved R-QDA that is based on the use oftwo regularization parameters and a modified bias, properly chosen to avoid inappropriatebehaviors of R-QDA in imbalanced settings and to ensure the best possible classificationperformance. The design of the proposed classifier builds on a random matrix theory basedanalysis of its performance when the number of samples and that of features grow largesimultaneously. The performance of the proposed classifier is assessed on both real andsynthetic data sets and was shown to be much better than what one would expect from atraditional R-QDA.
Keywords:
Statistics, Machine Learning, QDA, Random matrix theory. c (cid:13) https://creativecommons.org/licenses/by/4.0/ . Attribution requirements are providedat http://jmlr.org/papers/v1/meila00a.html . a r X i v : . [ s t a t . M L ] J u l . Introduction Discriminant analysis encompasses a wide variety of techniques used for classificationpurposes. These techniques, commonly recognized among the class of model-based methodsin the field of machine learning (Devijver and Kittler, 1982), rely merely on the fact thatwe assume a parametric model in which the outcome is described by a set of explanatoryvariables that follow a certain distribution. Among them, we particularly distinguishlinear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) as the mostrepresentatives. LDA is often connected or confused with Fisher discriminant analysis (FDA)(Fisher, 1936), a method of projecting the data into a subspace and turns out to coincidewith LDA when the target subspace has two dimensions. Both LDA and QDA are obtainedby maximizing the posterior probability under the assumption that observations follownormal distribution, with the single difference that LDA assumes common covariances acrossclasses while QDA assumes the most general situation with classes possessing different meansand covariances. If the data follow perfectly the normal distributions and the statistics areperfectly known, QDA turns out to be the optimal classifier that achieves the lowest possibleclassification error rate (J. Friedman and Tibshirani, 2009). It coincides with LDA whenthe covariances are equal but outperforms it when they are different. However, in practicalscenarios, the use of LDA and to a large extent QDA was not always shown to yield theexpected performances. This is because the mean and covariance of each class, which arein general unknown, are estimated based on available training data with perfectly knownclasses. The obtained estimates are then used as plug-in estimators in the classification rulesassociated with LDA and QDA. The estimation error of the class statistics causes a provablydegradation of the performances which reaches very high levels when the number of samplesis comparable or less than their dimensions. In this latter situation, QDA and LDA, relyingon computing the inverse of the covariance matrix could not be used. To overcome thisissue, one technique consists in using a regularized estimate of the covariance matrix as aplug-in estimator of the covariance matrix giving the name to Regularized LDA (R-LDA)or Regularized QDA (R-QDA) to the associated classifiers. However, this solution doesnot allow for a significant reduction of the estimation noise. The situation is even worsefor R-QDA, since the number of samples used to estimate the covariance matrix of eachclass is lower than that of LDA. Moreover, in imblanced settings, the estimation qualityof the covariance matrix associated with each class is not the same, one class possessingmore samples than the other classes. These are probably the reasons why LDA providedin many scenarios better performances than QDA, although it might wrongly consider thecovariances across classes equal.A question of major theoretical and practical interest is to investigate to which extentthe estimation noise of the covariance matrix impacts the performances of R-LDA andR-QDA. In this respect, the study of LDA and subsequently that of R-LDA have receiveda particular attention, dating back to the early works of Raudys (Raudys, 1967), beforebeing investigated again using recent advances of random matrix theory tools in a recentseries of works (Zollanvari and Dougherty, 2015; Wang and Jiang, 2018). However, thetheoretical analysis of QDA and R-QDA is more scarce and very often limited to specificsituations in which the number of samples is higher than that of the dimensions of thestatistics (McFarland and Richards, 2002), or under specific structures of the covariance mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings matrices (Cheng, 2004; Li and Shao, 2015; Jiang et al., 2015). It was only recently that ourwork in (Elkhalil et al., 2017) considered the analysis of R-QDA for general structures of thecovariance matrices and identified the necessary asymptotic conditions under which QDAdoes not behave trivially by returing always the same class. Among these conditions is toassume that training data are balanced across classes. Indeed, as will be discussed later inthis work, in case of imbalanced settings, the difference in estimation quality of covariancematrices across classes make the classification rule of R-QDA keep asymptotically the samesign irrespective of the class of the testing observation. As a result, the use of the traditionaldiscrimination rule of R-QDA is equivalent to assigning all observations to the same class.This lies behind the main motivation of the present work. Based on a careful investigationof the asymptotic behavior of R-QDA under imbalanced settings in binary classificationproblems, we propose a modified classification rule for R-QDA that copes with cases in whichthe proportions of training data from both classes are not equal. The new classification ruleis based on using two different regularization parameters instead of a common regularizationparameter as well as an optimized bias properly chosen to minimize the misclassificationerror rates. Interestingly, we show that the proposed classifier not only outperforms R-LDAand R-QDA but also other state-of-the-art classification methods, opening promising avenuesfor the use of the proposed classifier in practical scenarios.The rest of the paper is organized as follows: In section 2, we provide an overviewof the quadratic discriminant classifier and identify the issues related to the use of thisclassifier in imbalanced settings. In section 3, we propose an improved version of the R-QDAclassifier that overcomes all these problems and we design a consistent estimator of themisclassification error rate that can be used to properly chose the parameters of the proposedR-QDA and constitutes a valuable alternative to the traditional cross-validation approach.Finally, Section 4 presents the results of a set of numerical simulations on both syntheticand real data that confirm our theoretical findings. Notations
Scalars, vectors and matrices are respectively denoted by non-boldface, boldfacelowercase and boldface uppercase characters. p × n and p × n are respectively the matrix ofzeros and ones of size p × n , I p denotes the p × p identity matrix. The notation (cid:107) . (cid:107) standsfor the Euclidean norm for vectors and the spectral norm for matrices. ( . ) T , Tr[ . ] and | . | stands for the transpose, the trace and the determinant of a matrix respectively. For twofunctions f and g, we say that f = O ( g ), if ∃ < M < ∞ such that | f | ≤ M g . Moreover,for X random variable, X = O p (1) refers to a variable that is bounded in probability. Wesay also that that f = Θ( g ), if ∃ < C < C < ∞ such that C g ≤ | f | ≤ C g . Moreover,we denote by p → as → the convergence in probability and the almost sure convergenceof random variables. Finally Φ( . ) denotes the cumulative density function (CDF) of thestandard normal distribution, i.e. Φ( x ) = (cid:82) x −∞ √ π e − t dt .
2. Regularized quadratic discriminant analysis
The asymptotic analysis carried out in (Elkhalil et al., 2017) has made it clear that incase R-QDA is designed based on imbalanced training samples, it would asymptoticallyassign all testing observations to the same class. Such a behavior has led the authors in(Elkhalil et al., 2017) to consider the analysis of R-QDA only under a balanced training ample. Interestingly, understanding such a behavior can be made through simple argumentsbased on a close examination of the mean and variance of the classification rule associatedwith R-QDA. These arguments do not necessitate random matrix theory results, thus,we find it important to present them at the outset in order to pave the way towards ourimproved classifier. But prior to that, let us first review the traditional R-QDA for binaryclassification. For ease of presentation, we focus on binary classification problems where we have twodistinct classes. We assume that the data follow a Gaussian mixture model, such thatobservations in class C i , i ∈ { , } are drawn from a multivariate Gaussian distribution withmean µ i and covariance Σ i . More formally, we assume that x ∈ C i ⇔ x = µ i + Σ / i z , with z ∼ N ( , I p ) (1)Let π i , i = 0, 1 denote the prior probability that x belongs to class C i . The classificationrule associated with the QDA classifier is given by W QDA ( x ) = −
12 log | Σ || Σ | − x T (cid:0) Σ − − Σ − (cid:1) x + x T Σ − µ − x T Σ − µ − µ T Σ − µ + 12 µ T Σ − µ − log π π (2)which is used to classify the observations based on the following rule: (cid:26) x ∈ C , if W QDA > x ∈ C , otherwise. (3)As seen from (2), the classification rule of QDA involves the true parameters of the Gaussiandistribution, namely the means and covariances associated with each class. In practice, theseparameters are not known. One approach to solve this issue is to estimate them using theavailable training data. The obtained estimates are then used as plug-in estimators in (2).In particular, consider the case in which n i , i ∈ { , } training observations for each class C i , i ∈ { , } are available and denote by T = { x l ∈ C } n l =1 and T = { x l ∈ C } n + n = nl = n +1 theirrespective samples. The sample estimates of the mean and covariances of each class are thengiven by: ˆ µ i = 1 n i (cid:88) l ∈T i x l , i ∈ { , } (cid:98) Σ i = 1 n i − (cid:88) l ∈T i ( x l − ˆ µ i ) ( x l − ˆ µ i ) T , i ∈ { , } In case the number of samples n or n is less than the number of features, the use of thesample covariance matrix as plug-in estimator is not permitted since the inverse could not bedefined. A popular approach to circumvent this issue is to consider a regularized estimatorof the inverse of the covariance matrix given by H i ( γ ) = (cid:16) I p + γ (cid:98) Σ i (cid:17) − , i ∈ { , } (4) mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings where γ is a regularization parameter, which serves to shrink the sample covariance matrixtowards identity. Replacing Σ − i by H i ( γ ) yields the following classification rule for thetraditional R-QDA: (cid:99) W R − QDA ( x ) = 12 log | H ( γ ) || H ( γ ) | −
12 ( x − ˆ µ ) T H ( γ ) ( x − ˆ µ )+ 12 ( x − ˆ µ ) T H ( γ ) ( x − ˆ µ ) − log π π (5)The classifier R-QDA assigns wrongly observation x if (cid:99) W R − QDA ( x ) < x ∈ C orif (cid:99) W R − QDA ( x ) > x ∈ C . Conditioning on the training sample T i , i ∈ { , } , theclassification error associated with class C i , is thus given by (cid:15) R − QDAi = P (cid:104) ( − i (cid:99) W R − QDA ( x ) < | x ∈ C i , T , T (cid:105) (6)which gives the following expression for the total misclassification error probability (cid:15) R − QDA = π (cid:15) R − QDA + π (cid:15) R − QDA . (7) In this section, we unveil several issues related to the use of the classification rule (2) ofR-QDA in high dimensional settings. First, we shall recall that for a classification rule to beable to discriminate observations, it is essential that it presents a non-negligible difference indistributional behavior when the testing observation changes from one class to the other.Clearly, if it behaves similarly for all testing observations, it would not be possible for it todistinguish between observations from different classes. This change in behavior should bereflected by a notable difference in the expected values of the classification rule when thetesting observations belong to class C or C , which by reference to (2) need to be preferablyof opposite signs. Consider the normalized classification rule of the traditional R-QDA √ p (cid:99) W R − QDA ( x )), and denote by S i and V i its expected value and its variance taken over thedistribution of the testing observation x when it belongs to class C i . Letting µ = µ − µ , S i and V i are thus given by: S i = 12 √ p log | H ( γ ) || H ( γ ) | − √ p ( µ i − ˆ µ ) T H ( γ ) ( µ i − ˆ µ ) − √ p log π π + 12 √ p ( µ i − ˆ µ ) T H ( γ ) ( µ i − ˆ µ ) − √ p Tr (cid:2) Σ i H ( γ ) (cid:3) + 12 √ p Tr (cid:2) Σ i H ( γ ) (cid:3) (8) V i = 12 p Tr (( H ( γ ) − H ( γ )) Σ i ( H ( γ ) − H ( γ ) Σ i ) (9)+ (cid:16) ( µ Ti − ˆ µ T ) H ( γ ) + ( − µ Ti + ˆ µ T ) H ( γ ) (cid:17) Σ i ( H ( γ )( µ i − ˆ µ ) + H ( γ )( − µ i + ˆ µ ))(10)At this point, we shall recall that S i and V i are still random since they depend on thetraining data which are assumed to be drawn independently from the distribution associated ith each class. At first sight, the expressions of S i and V i are complicated and it doesnot seem that too much information can be drawn from them. To gain insights into theirbehavior in high dimensional settings, we consider the regime in which n , n and p arelarge and commensurable with n and n not asymptotically comparable, ( n n → (cid:96) (cid:54) = 1)and assume additionally that the spectral norms of Σ i , i = { , } do not grow with p while (cid:107) µ − µ (cid:107) scales at most like O ( p ). Under these assumptions, it is easy to see that S i and V i satisfy: S i = 12 √ p log | H ( γ ) || H ( γ ) | − √ p Tr (cid:2) Σ i H ( γ ) (cid:3) + 12 √ p Tr (cid:2) Σ i H ( γ ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) O p ( √ p ) + O p (1) (11) V i = O p (1) . (12)where we recall that X = O p ( p α ) means that p α X is bounded in probability (See van derVaart (1998b) for the formal definition of the notation O p ( . )). Several important remarksare in order regarding (11). First, we note that the prior probabilities π and π do not playasymptotically any role in the classification, since the term √ p log π π tends to zero. Hence,the information regarding the prior probabilities is asymptotically lost in the quantities S i , i = 0 ,
1. Second, one can easily see that if the distance between the covariances is suchthat √ p Tr (cid:2) Σ H (cid:3) − √ p Tr (cid:2) Σ H (cid:3) = O p (1) and √ p Tr (cid:2) Σ H (cid:3) − √ p Tr (cid:2) Σ H (cid:3) = O p (1) whichoccurs for instance when Σ − Σ has at most rank √ p (Elkhalil et al., 2017), the quantities S i for i = 0 , S i = 12 √ p log | H ( γ ) || H ( γ ) | − √ p Tr (cid:2) Σ H ( γ ) (cid:3) + 12 √ p Tr (cid:2) Σ H ( γ ) (cid:3) + O p (1) , i = { , } . (13)From (13), it appears that the highest order of S i is O p ( √ p ) but is non-informative, beingthe same for all testing observations regardless of the class to which they belong. Moreover asthe variance is O p (1), from the Chebyshev’s inequality (applied conditioning on the trainingsamples), one can deduce that R-QDA would keep the same sign for the majority of thetesting observations irrespective of their corresponding classes.To visually illustrate this result, we display in Figure 1a and Figure 1b the histograms ofthe QDA statistic in (2), and that of R-QDA in (5) when applied to testing observationsfrom both classes. As can be seen, using true statistics, QDA presents a clear change indistribution that visually should allow distinction between both classes. However, whenusing R-QDA , there is an important overlap between the histograms associated with bothclasses, with all realizations presenting the same sign. By reference to the decision rule in(2), this should lead to the R-QDA assigning all observations to the same class.The reason why the same behavior is not encountered when the same number of trainingsamples is used for both classes lies in that under this setting, sample covariance matricesof both classes are computed based on the same number of training samples. The scoresassociated with the two classes are thus comparable, and as such their difference whichform the R-QDA statistic, cancels out the non-informative estimation induced noise andkeep asymptotically the relevant information to classification. On the opposite, when bothclasses do not have the same training samples, the scores associated with each class contain mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings estimation induced noises which are not of the same level. The statistic of R-QDA resultingfrom computing the difference between these scores will thus be essentially at its highest ordera non-informative quantity caused by this difference in estimation quality of the covariancematrices. As shall be shown next, the use of RMT tools theoretically confirms this intuition,and most importantly, allowed us to propose an RMT-improved QDA classifier, outfittedwith two regularization parameters as well as a modified bias, that will be carefully chosenso that they minimize the misclassification error rate. More formally, the classification ruleassociated with the proposed classifier is given by: (a) R-QDA based on regularized covariance esti-mate (b) QDA classifier based on true statistics Figure 1: Histogram of the classification rule for the case with regularized covariance estimatewhere γ = 10 and the case with perfect knowledge of the covariance matrices. We consider p = 1000 features with unbalanced training size where n = 500 , n = 1000, Σ = 10 × I p , Σ = Σ , µ = p × and µ = µ + √ p p × . The testing set is of size 5000 and 10000samples for the first and second class respectively. (cid:99) W R − QDA imp ( x ) = − θ √ p −
12 ( x − ˆ µ ) T H ( γ ) ( x − ˆ µ ) + 12 ( x − ˆ µ ) T H ( γ ) ( x − ˆ µ )(14)where 1) γ and γ are two regularization parameters weighting the sample covariance matrixof each class and carefully devised so that the expected value E x (cid:104) √ p (cid:99) W R − QDA imp ( x ) (cid:105) when x ∈ C or C are O p (1) and reflects the class under consideration, and 2) θ is a bias termthat will be set to the value that minimizes the asymptotic classification error rate.
3. Design of the improved R-QDA classifier
In this section, we propose an improved design of the R-QDA classifier that fixes theaforementioned issues met in imbalanced settings. The design will be based on performingan asymptotic analysis of the statistics in (14) under the following asymptotic regime,
Assumption. 1 (Data scaling). pn → c ∈ (0 , ∞ ) and n n → (cid:96) Assumption. 2 (Mean scaling). (cid:107) µ − µ (cid:107) = O ( √ p ) ssumption. 3 (Covariance scaling). (cid:107) Σ i (cid:107) = Θ(1), i = 0 , Assumption. 4.
Matrix Σ − Σ has exactly Θ( √ p ) eigenvalues of order Θ(1). Theremaining eigenvalues are O (cid:16) √ p (cid:17) .Assumption 1 and 3 are standard and are often used to describe a growth regime in whichthe number of features scales comparably with that of samples and the spectral norm of bothcovariance matrices remain bounded. Note, however, that Assumption 1 is more generalthan the one considered in the work of (Elkhalil et al., 2017), as it accounts for imbalancedsettings in which n n → (cid:96) (cid:54) = 1. Assumption 2 provides the minimal distance scaling betweenthe mean vectors so that they can be used to discriminate between both classes, Couilletet al. (2018) . Finally Assumption 4, introduced in (Elkhalil et al., 2017) specifies thedifference between covariances that suffices on its own (regardless of the condition on themean vectors) to inform on the class of the testing observation.Under the asymptotic regime specified by Assumptions 1-4 and along the same lines asin (Elkhalil et al., 2017), we analyze the classification error rate of the proposed classifierbased on the classification rule (14). Before presenting the corresponding result, we shallfirst introduce the following notations which defines deterministic objects that naturallyappears when using random matrix theory results.For i = 0 ,
1, let δ i be the unique positive solution to the following equation: δ i = 1 n i Tr (cid:34) Σ i (cid:18) I p + γ i γ i δ i Σ i (cid:19) − (cid:35) (15)The existence and uniqueness of δ i follows from standard results in random matrix theory(Hachem et al., 2008). For i = 0 ,
1, we also define matrices T i , as: T i = (cid:18) I p + γ i γ i δ i Σ i (cid:19) − (16)and the scalars φ i and ˜ φ i as: φ i = 1 n i Tr (cid:2) Σ i T i (cid:3) , ˜ φ i = 1(1 + γ i δ i ) (17)With these notations at hand, we are now in position to state the first asymptotic result: Theorem 1
Under Assumption 1-4, and assuming that the regularization parameters γ and γ are Θ(1) , for i = { , } , the classification error rate associated with class C i definedas (cid:15) imp i = P (cid:104) ( − i (cid:99) W R − QDA imp ( x ) < | x ∈ C i , T , T (cid:105) satisfies: (cid:15) imp i − Φ (cid:32) ( − i ξ i − b i (cid:112) B i + 4 r i (cid:33) p → mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings where ξ i (cid:44) √ p (cid:2) ( − i +1 µ T T − i µ (cid:3) + θ with µ = µ − µ (19) b i = 1 √ p Tr Σ i ( T − T ) (20) B i = φ i − γ i φ i ˜ φ i n i p + 1 p Tr (cid:2) Σ i T − i (cid:3) + n i p γ − i ˜ φ − i − γ − i φ − i ˜ φ − i (cid:18) n i Tr (cid:2) Σ i Σ − T − i (cid:3)(cid:19) − p Tr (cid:2) Σ i T Σ i T (cid:3) (21) r i = p µ T Σ − i T − i µ − γ − i φ − i ˜ φ − i (22) Proof
We will provide only a sketch of proof since it follows along the same lines as in(Elkhalil et al., 2017). To begin with, note that √ p (cid:99) W R − QDA imp ( x ) is a quadratic form onthe testing observation x which, when it belongs to class C i , i = { , } is assumed to followGaussian distribution with mean µ i and covariance Σ i . By Lyapunov’s central limit theorem(Billingsley (1995)), when x is in C i , for i ∈ { , } , √ p (cid:99) W R − QDA imp ( x ) satisfies:1 (cid:112) ˜ V i (cid:18) √ p (cid:99) W R − QDA imp ( x ) − ˜ S i (cid:19) d → N (0 ,
1) (23)where ˜ S i and ˜ V i are given by:˜ S i = − θ − √ p Tr Σ i H ( γ ) + 1 √ p Tr Σ i H ( γ ) − √ p ( µ i − ˆ µ ) T H ( γ )( µ i − ˆ µ )+ 1 √ p ( µ i − ˆ µ ) T H ( γ )( µ i − ˆ µ ) (24)˜ V i = 2 p Tr ( H ( γ ) − H ( γ )) Σ i ( H ( γ ) − H ( γ ))+ 4 p (cid:16) ( µ Ti − ˆ µ T ) H ( γ ) + ( − µ Ti + ˆ µ T ) H ( γ ) (cid:17) Σ i ( H ( γ )( µ i − ˆ µ ) + H ( γ )( − µ i + ˆ µ ))(25)From (23), and using Lemma 2.11 in van der Vaart (1998a) for i = { , } , we can easily seethat: (cid:15) imp i − Φ (cid:32) ( − i (cid:32) − ˜ S i (cid:112) ˜ V i (cid:33)(cid:33) → . (26)It follows using standard results from random matrix theory in Hachem et al. (2013) that: (cid:18) √ p Tr Σ i H ( γ ) − √ p Tr Σ i H ( γ ) (cid:19) − b i a.s → oreover, 1 √ p ( µ i − ˆ µ i ) T H i ( γ i ) ( µ i − ˆ µ i ) a.s. → √ p (cid:0) µ i − ˆ µ − i (cid:1) T H − i ( γ − i ) (cid:0) µ i − ˆ µ − i (cid:1) − √ p µ T T − i µ a.s. → − ˜ S i − (cid:0) − ξ i + b i (cid:1) a.s. → . (30)The variance quantity can be treated similarly, and we can prove based on the results in(Elkhalil et al., 2017) that:1 p Tr( H ( γ ) − H ( γ )) Σ i ( H ( γ ) − H ( γ )) − B i a.s. → (cid:16) ( µ Ti − ˆ µ T ) H ( γ ) + ( − µ Ti + ˆ µ T ) H ( γ ) (cid:17) Σ i ( H ( γ )( µ i − ˆ µ ) + H ( γ )( − µ i + ˆ µ )) − pr i a.s. → . Hence, ˜ V i − (cid:0) B i + 4 r i (cid:1) a.s. → . (31)Replacing − ˜ S i and ˜ V i by their deterministic equivalents in (30) and (31) in (26), we obtainthe desired convergence in (18). Remark:
Under Assumption 4, it can be shown that B i can asymptotically be simplified to B i (cid:44) n i p γ i ˜ φ i φ i − γ i φ i ˜ φ i + Θ( 1 √ p )and that B = B + Θ( √ p ). Moreover, the term r i is O ( √ p ) and as such converges to zero as p, n grow to infinity. However, in our simulations, we chose to work with the non-simplifiedexpressions for B i and to keep the term r i , since we observed that in doing so a betteraccuracy is obtained in finite-dimensional simulations.The result of Theorem 1 allows to provide guidelines on how to choose γ and γ and theoptimal bias θ . As discussed before, the design should require the mean of the classificationrule to be Θ(1) and to reflect the class under consideration. This mean is represented inthe asymptotic expression of the classification error rate by the quantity ξ i − b i which isΘ( √ p ) for arbitrary γ and γ as b i = Θ( √ p ) and ξ i = Θ(1). Moreover, the class of thetesting observation is not reflected in b i since under Assumption 3-4, in case b i = Θ( √ p ), b i = √ p Tr Σ ( T − T ) + O (1) and as such up to a quantity of order O (1), b and b areequal. To solve this issue, we need to design γ and γ such that for i = { , } , b i is Θ(1) orequivalently, 1 p Tr (cid:2) Σ ( T − T ) (cid:3) = Θ( 1 √ p ) (32) mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings so that b becomes different from b at its highest order. To this end, we prove that itsuffices to select the regularization parameter associated with the class with the largestnumber of samples as: Theorem 2
Under assumption 1-4, and assume that n > n , if γ = γ − (cid:16) n − n (cid:17) γ Tr (cid:2) Σ T (cid:3) , (33) where γ is fixed to a given constant then b i = Θ(1) . Proof
See Appendix A.It is worth mentioning that in the balanced case, plugging n = n into (33) yields γ = γ .This shows that When the same number of training samples is used across classes, it isnot necessary to regularize both sample covariance matrices with different regularizationparameters.Now with this choice of the regularization parameters being set, it remains to select theoptimal bias θ . This can be chosen so that the asymptotic classification error rate given by: (cid:15) = π Φ (cid:32) − ξ − b (cid:112) B (cid:33) + π Φ (cid:32) − ξ − b (cid:112) B (cid:33) is minimized. Theorem 3
The optimal bias that allows to minimize the asymptotic classification errorrate is given by: θ ∗ = β − β − α β + β log( π π ) (34) where β = √ p (cid:2) − µ T T µ (cid:3) − √ p Tr (cid:2) Σ ( T − T ) (cid:3) β = √ p (cid:2) − µ T T µ (cid:3) + √ p Tr (cid:2) Σ ( T − T ) (cid:3) α = (cid:112) B Proof
See Appendix B.Before proceeding further, it is important to note that thanks to the careful choice of theregularization parameters γ and γ provided in Theorem 2, the term √ p Tr (cid:2) Σ i ( T − T ) (cid:3) is Θ(1) for i ∈ { , } , Additionally, it can be shown easily that the term √ p (cid:2) − µ T T i µ (cid:3) is oforder Θ(1). As a result, both β and β are Θ(1).On another note, it is worth mentioning that even in the case of balanced classes n = n ,characterized by γ = γ as proved in Theorem 2, the optimal bias is different from the ne traditionally used in R-QDA. As such, the proposed design improves on the traditionalR-QDA studied in (Elkhalil et al., 2017) even in the balanced case as it optimally adaptsthe bias term to the situation in which the covariance matrices are not known.Theorem 2 and Theorem 3 can be used to obtain an optimized design of the proposedR-QDA classifier. As can be seen, the improved classifier employs only one regularizationparameter associated with the class that presents the smallest number of training samples.Assume C is such a class. The regularization parameter associated with the other classcannot be arbitrarily chosen and should be set as (33), while the bias is selected according to(34). However, pursuing this design is not possible in practice due to the dependence of (33)and (34) on the true covariance matrices. To solve this issue, we propose in the followingtheorem consistent estimators for the quantities arising in (33) and (34) that depend onlyon the training samples. Theorem 4
Assume n > n and let γ be the regularization parameter associated withclass C . Let ˆ δ be given by: ˆ δ = 1 γ pn − n Tr (cid:2) H ( γ ) (cid:3) − pn + n Tr (cid:2) H ( γ ) (cid:3) and define ˆ γ as: ˆ γ = γ − γ (cid:16) n n ˆ δ − ˆ δ (cid:17) (35) Then, ˆ γ − γ as → where γ is given in (33) . Define ˆ β , ˆ β and ˆ α as: ˆ β = − √ p ( ˆ µ − ˆ µ ) T H (ˆ γ ) ( ˆ µ − ˆ µ ) − √ p Tr (cid:104) ˆΣ H (ˆ γ ) (cid:105) + n √ p ˆ δ ˆ β = − √ p ( ˆ µ − ˆ µ ) T H ( γ ) ( ˆ µ − ˆ µ ) − √ p Tr (cid:104) ˆΣ H ( γ ) (cid:105) + n √ p ˆ δ (36)ˆ α = (cid:113) B where ˆ B writes as: ˆ B = (cid:16) γ (cid:98) δ (cid:17) p Tr (cid:104) (cid:98) Σ H ( γ ) (cid:98) Σ H ( γ ) (cid:105) − n p (cid:98) δ (cid:16) γ (cid:98) δ (cid:17) + 1 p Tr (cid:104) (cid:98) Σ H (ˆ γ ) (cid:98) Σ H ( ˆ γ ) (cid:105) − n p (cid:18) n Tr (cid:104) (cid:98) Σ H (ˆ γ ) (cid:105)(cid:19) − (cid:16) γ ˆ δ (cid:17) p Tr (cid:104) (cid:98) Σ H ( γ ) (cid:98) Σ H (ˆ γ ) (cid:105) + (cid:98) δ (cid:16) γ (cid:98) δ (cid:17) p Tr (cid:104) (cid:98) Σ H ( ˆ γ ) (cid:105) (37) Let ˆ θ (cid:63) be given by: ˆ θ (cid:63) = ˆ β − ˆ β − α ˆ β + ˆ β log( π π ) (38) mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings Then, (cid:98) θ (cid:63) − θ (cid:63) as → where θ (cid:63) is given in (34) . Proof . See Appendix C.It is worth mentioning that unlike γ , ˆ γ is random. It does not satisfy with equality(33) but ensures (32) almost surely. Its use as a replacement of γ would lead asymptoticallyto the same results as the improved classifier using γ .For the reader convenience, we provide hereafter the algorithm describing the proposedimproved QDA classifier: Algorithm 1:
Improved design of the R-QDA classifier.
Input : Assuming n ≥ n , let γ the regularization parameter associated with class C , T = { x l } n l =1 training samples in C and T = { x l } n = n + n l = n +1 output : Estimation of the parameters γ and θ (cid:63) to be plugged in (14)1. Compute ˆ γ as in (35)2. Compute ˆ θ as in (38)3. Return ˆ θ and ˆ γ that will be plugged in the classification rule (14)The improved design described in Algorithm 1 depends on the regularization parameter γ associated with the class with the smallest number of training samples. One possible way toadjust this parameter is to resort to a traditional cross-validation approach which consistsin estimating, based on a set of testing data, the classification error rate for each candidateof the regularization parameter γ . Such an approach presents several drawbacks. First, itis computationally expensive and is way sub-optimal as it could only test few values of γ .As an alternative we propose rather to build a consistent estimator of the classification errorrate based on results from random matrix theory, which can later assist in the setting of theregularization parameter γ . This is the objective of the following theorem: Theorem 4
Under Assumptions 1-4, a consistent estimator of the misclassification errorrate associated with class C i is given by: ˆ (cid:15) i = Φ (cid:32) ( − i ˆ ξ i − ˆ b i (cid:112) B i + 4ˆ r i (cid:33) here ˆ B is given in (37) , γ is set to ˆ γ and ˆ ξ i = ˆ θ (cid:63) + ( − i +1 √ p ( ˆ µ − ˆ µ ) T H − i ( γ i ) ( ˆ µ − ˆ µ ) , i ∈ { , } ˆ δ i = 1 γ i (cid:2) pn i − n i Tr ( H i ( γ i )) (cid:3) − pn i + n i Tr [ H i ( γ i )] , i ∈ { , } ˆ b i = ( − i √ p Tr (cid:104) ˆ Σ i H − i ( γ − i ) (cid:105) + ( − i +1 n i √ p ˆ δ i , i ∈ { , } ˆ B = (cid:16) γ ˆ δ (cid:17) p Tr (cid:104) ˆ Σ H (ˆ γ ) ˆ Σ H (ˆ γ ) (cid:105) − n p ˆ δ (cid:16) γ ˆ δ (cid:17) + 1 p Tr (cid:104) ˆ Σ H ( γ ) ˆ Σ H ( γ ) (cid:105) − n p (cid:18) n Tr (cid:104) ˆ Σ H ( γ ) (cid:105)(cid:19) − (cid:16) γ ˆ δ (cid:17) p Tr (cid:104) ˆ Σ H ˆ Σ H ( γ ) (cid:105) + ˆ δ (cid:16) γ ˆ δ (cid:17) p Tr (cid:104) ˆ Σ H ( γ ) (cid:105) ˆ r i = 1 p ( ˆ µ − ˆ µ ) T H − i (ˆ γ − i ) ˆ Σ i H − i (ˆ γ − i )( ˆ µ − ˆ µ ) in the sense that: (cid:98) (cid:15) i − (cid:15) R − QDAi as → Proof
The proof is based on employing the consistent estimators provided in (Elkhalilet al., 2017) and is such omitted.
4. Numerical results
In this section, we assess the performance of our improved R-QDA classifier and compareit the with standard QDA classifier in the case of imbalanced training data. To this end,we start by generating synthetic data for both classes that are compliant with the differentassumptions used thoughout this work for the sake of validating our theoretical results.In Figure 2 and Figure 3, we plot the classification error rate of the improved classifierand the traditional R-QDA classifier with respect to the regularization parameter γ andthe features’ dimension p , respectively. As can be seen, we note that the standard R-QDAhas a classification error rate that converges to the prior of the most dominant class, whichreveals that as expected, it tends to assign all observations to the same class, which in thiscase coincides with the class that presents the highest number of training samples. On theopposite, the proposed R-QDA classifier presents a much higher performance, making itmore suitable to cope with imbalanced settings. We finally note that the consistent estimatorbased on the results of Theorem 5 is accurate and as such can be used to properly adjustthe regularization parameter γ . mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings E rr o r g estopt QDAstd QDA Figure 2: Average misclassification error rate versus the regularization parameter γ usingthe G-estimator. We consider p = 1000 features with unbalanced training size where n = 2 n , [ Σ ] = 4 I p , Σ = Σ + 3 Q p D p Q Tp , Q p ∈ O n ( R ), D p = diag (cid:104) √ p , ( p −√ p ) (cid:105) and µ = µ + √ p p × . In this section, we test the performance of the proposed R-QDA classifier on the publicUSPS dataset of handwritten digits(Lecun et al., 1998) and the EEG dataset. The USPSdataset is composed of 42000 labeled digit images, and each image has p = 784 featuresrepresented by 28 ×
28 pixels. The EEG dataset is composed of 5 classes that contain 4097observations, and each observation has p = 178 features. We consider the classification of twoclasses from each dataset composed of n and n samples. Based on the results of Theorem4, we tune the regularization factor γ to the value that minimizes the consistent estimateof the misclassification error rate. The values of θ and ˆ γ are then computed based on (35)and (38). Figure 4 and Figure 5 compare the performance of the proposed classifier withother state-of-the-art classification algorithms using cross-validation for different ratios of n n .As seen, our classifier, termed in the figure RQDA imp , not only outperforms the standardQDA but also other existing classification algorithms. Moreover, it is worth mentioning thatthe standard R-QDA is the classifier that presents the lowest performance in imbalancedsettings corresponding to n n <
1. This suggests that the use of different regularizationacross classes in the QDA classification rule along with an adequate tune of the bias makesthe R-QDA classifier more robust to the estimation noise of the covariance matrices inimbalanced settings.
00 400 500 600 700 800 900p0.000.050.100.150.200.250.300.35 E rr o r opt QDAstd QDAg est Figure 3: Average misclassification error rate versus the dimension p. We consider γ = 1with unbalanced training size where n = 2 n , [ Σ ] = 4 I p , Σ = Σ + 3 Q p D p Q Tp , Q p ∈O n ( R ), D p = diag (cid:104) √ p , ( p −√ p ) (cid:105) and µ = µ + √ p p × . n n A cc u r a c y Accuracy of improved R-QDA for EEG dataset
RQDA imp
QDALRSVMDTRFKNNNN
Figure 4: Comparaison between the performance of the our improved RQDA classifier withrespect to other machine learning algorithms on the EEG dataset. mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings n n A cc u r a c y Accuracy of improved R-QDA for USPS dataset
RQDA imp
QDALRSVMDTRFKNN
Figure 5: Comparaison between the performance of the our improved RQDA classifier withrespect to other machine learning algorithms on the USPS dataset.
5. Conclusion
The traditional view holds that the use of R-QDA leads in general to lower classificationperformances than many other existing classification methods, even though derived fromthe maximum likelihood principle under a general Gaussian mixture model. In this work,we establish that this loss in performance can be attributed to an induced estimation noiseof high order that hide the useful information for classification, leading the R-QDA score tobehave similarly for all testing observations. Based on this analysis, we propose to modifythe design of R-QDA so that it can discriminate efficiently between classes. Our amendmentof the R-QDA classifier is based on using two regularization parameters for each class as wellas a carefully designed bias that minimizes an asymptotic approximation of the classificationperformance. We confirm the efficacy of the proposed classifier through the use of a set ofnumerical results which shows that not only our proposed classifier outperforms the standardR-QDA but also other state-of-the art existing algorithms. Going further, we believe thatthis work shows that contrary to common belief, there is still room for improvement ofvery basic classification methods through a careful study of their behavior using advancedstatistical tools.
References
P. Billingsley.
Probability and Measure . Wiley, 1995. 3rd edition.Yu Cheng. Asymptotic probabilities of misclassification of two discriminant functions incases of high dimensional data.
Statistics & Probability Letters , 67:9–17, 03 2004. doi:10.1016/j.spl.2003.12.001. . Couillet, Z. Liao, and X. Mai. Classification asymptotics in the random matrix regime.In EUSIPCO , 2018.P. Devijver and J. Kittler.
Pattern Recognition: A Statistical Approach . 1982.Khalil Elkhalil, Abla Kammoun, Romain Couillet, Tareq Y. Al-Naffouri, and Mohamed-Slim Alouini. A large dimensional study of regularized discriminant analysis classifiers.abs/1711.00382, 2017. URL https://arxiv.org/abs/1711.00382 .Ronald A. Fisher. The use of multiple measurements in taxonomic problems.
Annals ofeugenics , 7(2):179–188, 1936.W. Hachem, O. Khorunzhiy, P. Loubaton, J. Najim, and L. Pastur. A New Approachfor Mutual Information Analysis of Large Dimensional Multi-Antenna Channels.
IEEETransactions on Information Theory , 54(9):3987–4004, Sept 2008.Walid Hachem, Philippe Loubaton, Jamal Najim, and Pascal Vallet. On bilinear forms basedon the resolvent of large random matrices.
Ann. Inst. H. Poincar Probab. Statist. , 49(1):36–63, 02 2013. doi: 10.1214/11-AIHP450. URL https://doi.org/10.1214/11-AIHP450 .T. Hastie J. Friedman and R. Tibshirani.
The Elements of Statistical Learning . 2009.Binyan Jiang, Xiangyu Wang, and Chenlei Leng. Quda: A direct approach for sparsequadratic discriminant analysis.
Journal of Machine Learning Research , 19, 09 2015.Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to documentrecognition.
Proceedings of the IEEE , 86(11):2278–2324, Nov 1998. ISSN 0018-9219. doi:10.1109/5.726791.Quefeng Li and Jun Shao. Sparse quadratic discriminant analysis for high dimensional data.
Statistica Sinica , 25, 04 2015. doi: 10.5705/ss.2013.150.H. Richard McFarland and Donald St. P. Richards. Exact Misclassification Probabilities forPlug-In Normal Quadratic Discriminant Functions.
Journal of Multivariate Analysis , 82:299330, 2002.S. Raudys. On determining training sample size of a linear classifier.
Computing Systems ,28:79–87, 1967. in Russian.A W van der Vaart.
Asymptotic Statistics . Cambridge University Press, Cambridge, UK,1998a.Aad W. van der Vaart.
Asymptotic Statistics . 1998b.C. Wang and B. Jiang. On the dimension effect of regularized linear discriminant analysis.
Electronic Journal of Statistics , 12:2709–2742, 2018.A. Zollanvari and E. R. Dougherty. Generalized Consistent Error Estimator of LinearDiscriminant Analysis.
IEEE Transactions on Signal Processing , 63(11):2804–2814, June2015. mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings Appendix A.
As discussed in the paper, the design of the regularization parameters γ and γ shouldensure that: 1 √ p Tr (cid:2) Σ i ( T − T ) (cid:3) = Θ(1) (39)where T i = (cid:16) I + γ i (cid:101) δ i Σ i (cid:17) − , with (cid:101) δ i = γ i δ i . Using the relation A − − B − = A − ( B − A ) B − for any two square matrices A and B , (39) boils down to:1 √ p Tr (cid:104) Σ i T (cid:16) γ ˜ δ Σ − γ ˜ δ Σ (cid:17) T (cid:105) = Θ(1)or equivalently: γ ˜ δ √ p Tr (cid:2) Σ i T ( Σ − Σ ) T (cid:3) + γ ˜ δ − γ ˜ δ √ p Tr (cid:2) Σ i T Σ T (cid:3) = Θ(1)Using Assumption 4, it can be readily seen that the first term γ ˜ δ √ p Tr (cid:2) Σ i T ( Σ − Σ ) T (cid:3) =Θ(1). To satisfy (39), we thus only need to design γ and γ such that: γ ˜ δ − γ ˜ δ = Θ(1 / √ p )or equivalently: γ + γ γ n Tr (cid:2) Σ T (cid:3) − γ − γ γ n Tr (cid:2) Σ T (cid:3) = Θ(1 / √ p )Under Assumption 4, 1 n Tr (cid:2) Σ T (cid:3) = 1 n Tr (cid:2) Σ T (cid:3) + O ( 1 √ p )which proves that in choosing γ given by: γ = γ − (cid:16) n − n (cid:17) γ Tr (cid:2) Σ T (cid:3) the condition (39) becomes satisfied. Appendix B.
The choice of the regularization parameters γ and γ allows to ensure that: B = B + O ( 1 √ p )As a result, the expression of the asymptotic equivalents for the classification error rate ofboth classes defined in (18) for i ∈ { , } can be reduced to: (cid:15) R − QDAi − Φ (cid:32) ( − i ξ i − b i (cid:112) B (cid:33) p → hen, the total classification error can be written as: (cid:15) R − QDA = π Φ (cid:18) β + θα (cid:19) + π Φ (cid:18) β − θα (cid:19) where β = √ p (cid:2) − µ T T µ (cid:3) − √ p Tr (cid:2) Σ ( T − T ) (cid:3) β = √ p (cid:2) − µ T T µ (cid:3) + √ p Tr (cid:2) Σ ( T − T ) (cid:3) α = (cid:112) B Taking the derivative of this expression with respect to θ and setting it to zero, the optimalbias θ (cid:63) should satisfy: π π e ( β − θ(cid:63) α ) − ( β θ(cid:63) α ) = 1Applying the logarithmic function on both sides, we obtain:log( π π ) + (cid:18) β − θ (cid:63) α (cid:19) − (cid:18) β + θ (cid:63) α (cid:19) = 0thus leading to θ ∗ = β − β − α β + β log( π π ) Appendix C
In Theorem 4, we provide a consistent estimator for the regularization parameter γ thatsatisfies (32) with high probability and a consistent estimator for the optimal bias θ (cid:63) . .1 Consistent estimator for γ We start by proving that γ − ˆ γ as →
0. To this end, we need to provide a consistentestimator for ( n − n )Tr (cid:2) Σ T (cid:3) . We start by noticing that:( 1 n − n )Tr (cid:2) Σ T (cid:3) = ( n n − δ A consistent estimator for δ has been provided in Elkhalil et al. (2017) and is given by:ˆ δ = 1 γ pn − n Tr (cid:2) H ( γ ) (cid:3) − pn + n Tr (cid:2) H ( γ ) (cid:3) and as such a consistent estimator for γ in (33) is given by:ˆ γ = γ − γ ( n n ˆ δ − ˆ δ )Note that the replacement of γ by ˆ γ still ensures condition (39) since from standard resultsof random matrix theory ˆ δ − δ = O ( p ) with high probability. mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings .2 Consistent estimator for θ (cid:63) Recall that θ (cid:63) = β − β − α β + β log( π π )To provide a consistent estimator for θ (cid:63) , it is thus required to provide that of β , β and α .Since α = (cid:112) B and ˆ B − B . s . →
0, we thus have: ˆ α − α a . s . → α = (cid:112) B . As for β i , i = 0 ,
1, it can be written as: β i = − √ p µ T T − i µ + 1 √ p Tr (cid:2) Σ i T i (cid:3) − √ p Tr (cid:2) Σ i T − i (cid:3) = − √ p µ T T − i µ − √ p Tr (cid:2) Σ i T − i (cid:3) + n i √ p δ i Due to the independence of Σ i from H − i and of ˆ µ and ˆ µ and H i , i = 0 ,
1, we have:1 √ p Tr (cid:2) ˆ Σ i H − i (cid:3) − √ p Tr (cid:2) Σ i T − i (cid:3) as → √ p ( ˆ µ − ˆ µ ) H − i ( ˆ µ − ˆ µ ) − √ p ( ˆ µ − ˆ µ ) T − i ( ˆ µ − ˆ µ ) as → ..