[PDF] Improved Design of Quadratic Discriminant Analysis Classifier in Unbalanced Settings

Abstract

The use of quadratic discriminant analysis (QDA) or its regularized version (R-QDA) for classification is often not recommended, due to its well-acknowledged high sensitivity to the estimation noise of the covariance matrix. This becomes all the more the case in unbalanced data settings for which it has been found that R-QDA becomes equivalent to the classifier that assigns all observations to the same class. In this paper, we propose an improved R-QDA that is based on the use of two regularization parameters and a modified bias, properly chosen to avoid inappropriate behaviors of R-QDA in unbalanced settings and to ensure the best possible classification performance. The design of the proposed classifier builds on a refined asymptotic analysis of its performance when the number of samples and that of features grow large simultaneously, which allows to cope efficiently with the high-dimensionality frequently met within the big data paradigm. The performance of the proposed classifier is assessed on both real and synthetic data sets and was shown to be much better than what one would expect from a traditional R-QDA.

Full PDF

JJournal of Machine Learning Research 1 (2000) 1-48 Submitted 4/00; Published 10/00

Improved Design of Quadratic Discriminant AnalysisClassiﬁer in Imbalanced Settings

Amine Bejaoui [email protected]

Department of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia

Khalil Elkhalil

[email protected]

Division of Electrical EngineeringDuke UniversityDurham, NC 27708, USA

Abla Kammoun [email protected]

Department of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia

Mohamed Slim Alouini [email protected]

Department of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia

Tareq Alnaﬀouri [email protected]

Department of Computer, Electrical and Mathematical Sciences and EngineeringKing Abdullah University of Science and TechnologyThuwal, 23955, Saudi Arabia

Editor:

Abstract

The use of quadratic discriminant analysis (QDA) or its regularized version (R-QDA) forclassiﬁcation is often not recommended, due to its well-acknowledged high sensitivity to theestimation noise of the covariance matrix. This becomes all the more the case in imbalancedsettings in which training data for each class are disproportionate and for which it has beenfound that R-QDA becomes equivalent to the classiﬁer that assigns all observations to thesame class. In this paper, we propose an improved R-QDA that is based on the use oftwo regularization parameters and a modiﬁed bias, properly chosen to avoid inappropriatebehaviors of R-QDA in imbalanced settings and to ensure the best possible classiﬁcationperformance. The design of the proposed classiﬁer builds on a random matrix theory basedanalysis of its performance when the number of samples and that of features grow largesimultaneously. The performance of the proposed classiﬁer is assessed on both real andsynthetic data sets and was shown to be much better than what one would expect from atraditional R-QDA.

Keywords:

Statistics, Machine Learning, QDA, Random matrix theory. c (cid:13) https://creativecommons.org/licenses/by/4.0/ . Attribution requirements are providedat http://jmlr.org/papers/v1/meila00a.html . a r X i v : . [ s t a t . M L ] J u l . Introduction Discriminant analysis encompasses a wide variety of techniques used for classiﬁcationpurposes. These techniques, commonly recognized among the class of model-based methodsin the ﬁeld of machine learning (Devijver and Kittler, 1982), rely merely on the fact thatwe assume a parametric model in which the outcome is described by a set of explanatoryvariables that follow a certain distribution. Among them, we particularly distinguishlinear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) as the mostrepresentatives. LDA is often connected or confused with Fisher discriminant analysis (FDA)(Fisher, 1936), a method of projecting the data into a subspace and turns out to coincidewith LDA when the target subspace has two dimensions. Both LDA and QDA are obtainedby maximizing the posterior probability under the assumption that observations follownormal distribution, with the single diﬀerence that LDA assumes common covariances acrossclasses while QDA assumes the most general situation with classes possessing diﬀerent meansand covariances. If the data follow perfectly the normal distributions and the statistics areperfectly known, QDA turns out to be the optimal classiﬁer that achieves the lowest possibleclassiﬁcation error rate (J. Friedman and Tibshirani, 2009). It coincides with LDA whenthe covariances are equal but outperforms it when they are diﬀerent. However, in practicalscenarios, the use of LDA and to a large extent QDA was not always shown to yield theexpected performances. This is because the mean and covariance of each class, which arein general unknown, are estimated based on available training data with perfectly knownclasses. The obtained estimates are then used as plug-in estimators in the classiﬁcation rulesassociated with LDA and QDA. The estimation error of the class statistics causes a provablydegradation of the performances which reaches very high levels when the number of samplesis comparable or less than their dimensions. In this latter situation, QDA and LDA, relyingon computing the inverse of the covariance matrix could not be used. To overcome thisissue, one technique consists in using a regularized estimate of the covariance matrix as aplug-in estimator of the covariance matrix giving the name to Regularized LDA (R-LDA)or Regularized QDA (R-QDA) to the associated classiﬁers. However, this solution doesnot allow for a signiﬁcant reduction of the estimation noise. The situation is even worsefor R-QDA, since the number of samples used to estimate the covariance matrix of eachclass is lower than that of LDA. Moreover, in imblanced settings, the estimation qualityof the covariance matrix associated with each class is not the same, one class possessingmore samples than the other classes. These are probably the reasons why LDA providedin many scenarios better performances than QDA, although it might wrongly consider thecovariances across classes equal.A question of major theoretical and practical interest is to investigate to which extentthe estimation noise of the covariance matrix impacts the performances of R-LDA andR-QDA. In this respect, the study of LDA and subsequently that of R-LDA have receiveda particular attention, dating back to the early works of Raudys (Raudys, 1967), beforebeing investigated again using recent advances of random matrix theory tools in a recentseries of works (Zollanvari and Dougherty, 2015; Wang and Jiang, 2018). However, thetheoretical analysis of QDA and R-QDA is more scarce and very often limited to speciﬁcsituations in which the number of samples is higher than that of the dimensions of thestatistics (McFarland and Richards, 2002), or under speciﬁc structures of the covariance mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings matrices (Cheng, 2004; Li and Shao, 2015; Jiang et al., 2015). It was only recently that ourwork in (Elkhalil et al., 2017) considered the analysis of R-QDA for general structures of thecovariance matrices and identiﬁed the necessary asymptotic conditions under which QDAdoes not behave trivially by returing always the same class. Among these conditions is toassume that training data are balanced across classes. Indeed, as will be discussed later inthis work, in case of imbalanced settings, the diﬀerence in estimation quality of covariancematrices across classes make the classiﬁcation rule of R-QDA keep asymptotically the samesign irrespective of the class of the testing observation. As a result, the use of the traditionaldiscrimination rule of R-QDA is equivalent to assigning all observations to the same class.This lies behind the main motivation of the present work. Based on a careful investigationof the asymptotic behavior of R-QDA under imbalanced settings in binary classiﬁcationproblems, we propose a modiﬁed classiﬁcation rule for R-QDA that copes with cases in whichthe proportions of training data from both classes are not equal. The new classiﬁcation ruleis based on using two diﬀerent regularization parameters instead of a common regularizationparameter as well as an optimized bias properly chosen to minimize the misclassiﬁcationerror rates. Interestingly, we show that the proposed classiﬁer not only outperforms R-LDAand R-QDA but also other state-of-the-art classiﬁcation methods, opening promising avenuesfor the use of the proposed classiﬁer in practical scenarios.The rest of the paper is organized as follows: In section 2, we provide an overviewof the quadratic discriminant classiﬁer and identify the issues related to the use of thisclassiﬁer in imbalanced settings. In section 3, we propose an improved version of the R-QDAclassiﬁer that overcomes all these problems and we design a consistent estimator of themisclassiﬁcation error rate that can be used to properly chose the parameters of the proposedR-QDA and constitutes a valuable alternative to the traditional cross-validation approach.Finally, Section 4 presents the results of a set of numerical simulations on both syntheticand real data that conﬁrm our theoretical ﬁndings. Notations

Scalars, vectors and matrices are respectively denoted by non-boldface, boldfacelowercase and boldface uppercase characters. p × n and p × n are respectively the matrix ofzeros and ones of size p × n , I p denotes the p × p identity matrix. The notation (cid:107) . (cid:107) standsfor the Euclidean norm for vectors and the spectral norm for matrices. ( . ) T , Tr[ . ] and | . | stands for the transpose, the trace and the determinant of a matrix respectively. For twofunctions f and g, we say that f = O ( g ), if ∃ < M < ∞ such that | f | ≤ M g . Moreover,for X random variable, X = O p (1) refers to a variable that is bounded in probability. Wesay also that that f = Θ( g ), if ∃ < C < C < ∞ such that C g ≤ | f | ≤ C g . Moreover,we denote by p → as → the convergence in probability and the almost sure convergenceof random variables. Finally Φ( . ) denotes the cumulative density function (CDF) of thestandard normal distribution, i.e. Φ( x ) = (cid:82) x −∞ √ π e − t dt .

2. Regularized quadratic discriminant analysis

The asymptotic analysis carried out in (Elkhalil et al., 2017) has made it clear that incase R-QDA is designed based on imbalanced training samples, it would asymptoticallyassign all testing observations to the same class. Such a behavior has led the authors in(Elkhalil et al., 2017) to consider the analysis of R-QDA only under a balanced training ample. Interestingly, understanding such a behavior can be made through simple argumentsbased on a close examination of the mean and variance of the classiﬁcation rule associatedwith R-QDA. These arguments do not necessitate random matrix theory results, thus,we ﬁnd it important to present them at the outset in order to pave the way towards ourimproved classiﬁer. But prior to that, let us ﬁrst review the traditional R-QDA for binaryclassiﬁcation. For ease of presentation, we focus on binary classiﬁcation problems where we have twodistinct classes. We assume that the data follow a Gaussian mixture model, such thatobservations in class C i , i ∈ { , } are drawn from a multivariate Gaussian distribution withmean µ i and covariance Σ i . More formally, we assume that x ∈ C i ⇔ x = µ i + Σ / i z , with z ∼ N ( , I p ) (1)Let π i , i = 0, 1 denote the prior probability that x belongs to class C i . The classiﬁcationrule associated with the QDA classiﬁer is given by W QDA ( x ) = −

12 log | Σ || Σ | − x T (cid:0) Σ − − Σ − (cid:1) x + x T Σ − µ − x T Σ − µ − µ T Σ − µ + 12 µ T Σ − µ − log π π (2)which is used to classify the observations based on the following rule: (cid:26) x ∈ C , if W QDA > x ∈ C , otherwise. (3)As seen from (2), the classiﬁcation rule of QDA involves the true parameters of the Gaussiandistribution, namely the means and covariances associated with each class. In practice, theseparameters are not known. One approach to solve this issue is to estimate them using theavailable training data. The obtained estimates are then used as plug-in estimators in (2).In particular, consider the case in which n i , i ∈ { , } training observations for each class C i , i ∈ { , } are available and denote by T = { x l ∈ C } n l =1 and T = { x l ∈ C } n + n = nl = n +1 theirrespective samples. The sample estimates of the mean and covariances of each class are thengiven by: ˆ µ i = 1 n i (cid:88) l ∈T i x l , i ∈ { , } (cid:98) Σ i = 1 n i − (cid:88) l ∈T i ( x l − ˆ µ i ) ( x l − ˆ µ i ) T , i ∈ { , } In case the number of samples n or n is less than the number of features, the use of thesample covariance matrix as plug-in estimator is not permitted since the inverse could not bedeﬁned. A popular approach to circumvent this issue is to consider a regularized estimatorof the inverse of the covariance matrix given by H i ( γ ) = (cid:16) I p + γ (cid:98) Σ i (cid:17) − , i ∈ { , } (4) mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings where γ is a regularization parameter, which serves to shrink the sample covariance matrixtowards identity. Replacing Σ − i by H i ( γ ) yields the following classiﬁcation rule for thetraditional R-QDA: (cid:99) W R − QDA ( x ) = 12 log | H ( γ ) || H ( γ ) | −

12 ( x − ˆ µ ) T H ( γ ) ( x − ˆ µ )+ 12 ( x − ˆ µ ) T H ( γ ) ( x − ˆ µ ) − log π π (5)The classiﬁer R-QDA assigns wrongly observation x if (cid:99) W R − QDA ( x ) < x ∈ C orif (cid:99) W R − QDA ( x ) > x ∈ C . Conditioning on the training sample T i , i ∈ { , } , theclassiﬁcation error associated with class C i , is thus given by (cid:15) R − QDAi = P (cid:104) ( − i (cid:99) W R − QDA ( x ) < | x ∈ C i , T , T (cid:105) (6)which gives the following expression for the total misclassiﬁcation error probability (cid:15) R − QDA = π (cid:15) R − QDA + π (cid:15) R − QDA . (7) In this section, we unveil several issues related to the use of the classiﬁcation rule (2) ofR-QDA in high dimensional settings. First, we shall recall that for a classiﬁcation rule to beable to discriminate observations, it is essential that it presents a non-negligible diﬀerence indistributional behavior when the testing observation changes from one class to the other.Clearly, if it behaves similarly for all testing observations, it would not be possible for it todistinguish between observations from diﬀerent classes. This change in behavior should bereﬂected by a notable diﬀerence in the expected values of the classiﬁcation rule when thetesting observations belong to class C or C , which by reference to (2) need to be preferablyof opposite signs. Consider the normalized classiﬁcation rule of the traditional R-QDA √ p (cid:99) W R − QDA ( x )), and denote by S i and V i its expected value and its variance taken over thedistribution of the testing observation x when it belongs to class C i . Letting µ = µ − µ , S i and V i are thus given by: S i = 12 √ p log | H ( γ ) || H ( γ ) | − √ p ( µ i − ˆ µ ) T H ( γ ) ( µ i − ˆ µ ) − √ p log π π + 12 √ p ( µ i − ˆ µ ) T H ( γ ) ( µ i − ˆ µ ) − √ p Tr (cid:2) Σ i H ( γ ) (cid:3) + 12 √ p Tr (cid:2) Σ i H ( γ ) (cid:3) (8) V i = 12 p Tr (( H ( γ ) − H ( γ )) Σ i ( H ( γ ) − H ( γ ) Σ i ) (9)+ (cid:16) ( µ Ti − ˆ µ T ) H ( γ ) + ( − µ Ti + ˆ µ T ) H ( γ ) (cid:17) Σ i ( H ( γ )( µ i − ˆ µ ) + H ( γ )( − µ i + ˆ µ ))(10)At this point, we shall recall that S i and V i are still random since they depend on thetraining data which are assumed to be drawn independently from the distribution associated ith each class. At ﬁrst sight, the expressions of S i and V i are complicated and it doesnot seem that too much information can be drawn from them. To gain insights into theirbehavior in high dimensional settings, we consider the regime in which n , n and p arelarge and commensurable with n and n not asymptotically comparable, ( n n → (cid:96) (cid:54) = 1)and assume additionally that the spectral norms of Σ i , i = { , } do not grow with p while (cid:107) µ − µ (cid:107) scales at most like O ( p ). Under these assumptions, it is easy to see that S i and V i satisfy: S i = 12 √ p log | H ( γ ) || H ( γ ) | − √ p Tr (cid:2) Σ i H ( γ ) (cid:3) + 12 √ p Tr (cid:2) Σ i H ( γ ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) O p ( √ p ) + O p (1) (11) V i = O p (1) . (12)where we recall that X = O p ( p α ) means that p α X is bounded in probability (See van derVaart (1998b) for the formal deﬁnition of the notation O p ( . )). Several important remarksare in order regarding (11). First, we note that the prior probabilities π and π do not playasymptotically any role in the classiﬁcation, since the term √ p log π π tends to zero. Hence,the information regarding the prior probabilities is asymptotically lost in the quantities S i , i = 0 ,

1. Second, one can easily see that if the distance between the covariances is suchthat √ p Tr (cid:2) Σ H (cid:3) − √ p Tr (cid:2) Σ H (cid:3) = O p (1) and √ p Tr (cid:2) Σ H (cid:3) − √ p Tr (cid:2) Σ H (cid:3) = O p (1) whichoccurs for instance when Σ − Σ has at most rank √ p (Elkhalil et al., 2017), the quantities S i for i = 0 , S i = 12 √ p log | H ( γ ) || H ( γ ) | − √ p Tr (cid:2) Σ H ( γ ) (cid:3) + 12 √ p Tr (cid:2) Σ H ( γ ) (cid:3) + O p (1) , i = { , } . (13)From (13), it appears that the highest order of S i is O p ( √ p ) but is non-informative, beingthe same for all testing observations regardless of the class to which they belong. Moreover asthe variance is O p (1), from the Chebyshev’s inequality (applied conditioning on the trainingsamples), one can deduce that R-QDA would keep the same sign for the majority of thetesting observations irrespective of their corresponding classes.To visually illustrate this result, we display in Figure 1a and Figure 1b the histograms ofthe QDA statistic in (2), and that of R-QDA in (5) when applied to testing observationsfrom both classes. As can be seen, using true statistics, QDA presents a clear change indistribution that visually should allow distinction between both classes. However, whenusing R-QDA , there is an important overlap between the histograms associated with bothclasses, with all realizations presenting the same sign. By reference to the decision rule in(2), this should lead to the R-QDA assigning all observations to the same class.The reason why the same behavior is not encountered when the same number of trainingsamples is used for both classes lies in that under this setting, sample covariance matricesof both classes are computed based on the same number of training samples. The scoresassociated with the two classes are thus comparable, and as such their diﬀerence whichform the R-QDA statistic, cancels out the non-informative estimation induced noise andkeep asymptotically the relevant information to classiﬁcation. On the opposite, when bothclasses do not have the same training samples, the scores associated with each class contain mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings estimation induced noises which are not of the same level. The statistic of R-QDA resultingfrom computing the diﬀerence between these scores will thus be essentially at its highest ordera non-informative quantity caused by this diﬀerence in estimation quality of the covariancematrices. As shall be shown next, the use of RMT tools theoretically conﬁrms this intuition,and most importantly, allowed us to propose an RMT-improved QDA classiﬁer, outﬁttedwith two regularization parameters as well as a modiﬁed bias, that will be carefully chosenso that they minimize the misclassiﬁcation error rate. More formally, the classiﬁcation ruleassociated with the proposed classiﬁer is given by: (a) R-QDA based on regularized covariance esti-mate (b) QDA classiﬁer based on true statistics Figure 1: Histogram of the classiﬁcation rule for the case with regularized covariance estimatewhere γ = 10 and the case with perfect knowledge of the covariance matrices. We consider p = 1000 features with unbalanced training size where n = 500 , n = 1000, Σ = 10 × I p , Σ = Σ , µ = p × and µ = µ + √ p p × . The testing set is of size 5000 and 10000samples for the ﬁrst and second class respectively. (cid:99) W R − QDA imp ( x ) = − θ √ p −

12 ( x − ˆ µ ) T H ( γ ) ( x − ˆ µ ) + 12 ( x − ˆ µ ) T H ( γ ) ( x − ˆ µ )(14)where 1) γ and γ are two regularization parameters weighting the sample covariance matrixof each class and carefully devised so that the expected value E x (cid:104) √ p (cid:99) W R − QDA imp ( x ) (cid:105) when x ∈ C or C are O p (1) and reﬂects the class under consideration, and 2) θ is a bias termthat will be set to the value that minimizes the asymptotic classiﬁcation error rate.

3. Design of the improved R-QDA classiﬁer

In this section, we propose an improved design of the R-QDA classiﬁer that ﬁxes theaforementioned issues met in imbalanced settings. The design will be based on performingan asymptotic analysis of the statistics in (14) under the following asymptotic regime,

Assumption. 1 (Data scaling). pn → c ∈ (0 , ∞ ) and n n → (cid:96) Assumption. 2 (Mean scaling). (cid:107) µ − µ (cid:107) = O ( √ p ) ssumption. 3 (Covariance scaling). (cid:107) Σ i (cid:107) = Θ(1), i = 0 , Assumption. 4.

Matrix Σ − Σ has exactly Θ( √ p ) eigenvalues of order Θ(1). Theremaining eigenvalues are O (cid:16) √ p (cid:17) .Assumption 1 and 3 are standard and are often used to describe a growth regime in whichthe number of features scales comparably with that of samples and the spectral norm of bothcovariance matrices remain bounded. Note, however, that Assumption 1 is more generalthan the one considered in the work of (Elkhalil et al., 2017), as it accounts for imbalancedsettings in which n n → (cid:96) (cid:54) = 1. Assumption 2 provides the minimal distance scaling betweenthe mean vectors so that they can be used to discriminate between both classes, Couilletet al. (2018) . Finally Assumption 4, introduced in (Elkhalil et al., 2017) speciﬁes thediﬀerence between covariances that suﬃces on its own (regardless of the condition on themean vectors) to inform on the class of the testing observation.Under the asymptotic regime speciﬁed by Assumptions 1-4 and along the same lines asin (Elkhalil et al., 2017), we analyze the classiﬁcation error rate of the proposed classiﬁerbased on the classiﬁcation rule (14). Before presenting the corresponding result, we shallﬁrst introduce the following notations which deﬁnes deterministic objects that naturallyappears when using random matrix theory results.For i = 0 ,

1, let δ i be the unique positive solution to the following equation: δ i = 1 n i Tr (cid:34) Σ i (cid:18) I p + γ i γ i δ i Σ i (cid:19) − (cid:35) (15)The existence and uniqueness of δ i follows from standard results in random matrix theory(Hachem et al., 2008). For i = 0 ,

1, we also deﬁne matrices T i , as: T i = (cid:18) I p + γ i γ i δ i Σ i (cid:19) − (16)and the scalars φ i and ˜ φ i as: φ i = 1 n i Tr (cid:2) Σ i T i (cid:3) , ˜ φ i = 1(1 + γ i δ i ) (17)With these notations at hand, we are now in position to state the ﬁrst asymptotic result: Theorem 1

Under Assumption 1-4, and assuming that the regularization parameters γ and γ are Θ(1) , for i = { , } , the classiﬁcation error rate associated with class C i deﬁnedas (cid:15) imp i = P (cid:104) ( − i (cid:99) W R − QDA imp ( x ) < | x ∈ C i , T , T (cid:105) satisﬁes: (cid:15) imp i − Φ (cid:32) ( − i ξ i − b i (cid:112) B i + 4 r i (cid:33) p → mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings where ξ i (cid:44) √ p (cid:2) ( − i +1 µ T T − i µ (cid:3) + θ with µ = µ − µ (19) b i = 1 √ p Tr Σ i ( T − T ) (20) B i = φ i − γ i φ i ˜ φ i n i p + 1 p Tr (cid:2) Σ i T − i (cid:3) + n i p γ − i ˜ φ − i − γ − i φ − i ˜ φ − i (cid:18) n i Tr (cid:2) Σ i Σ − T − i (cid:3)(cid:19) − p Tr (cid:2) Σ i T Σ i T (cid:3) (21) r i = p µ T Σ − i T − i µ − γ − i φ − i ˜ φ − i (22) Proof

We will provide only a sketch of proof since it follows along the same lines as in(Elkhalil et al., 2017). To begin with, note that √ p (cid:99) W R − QDA imp ( x ) is a quadratic form onthe testing observation x which, when it belongs to class C i , i = { , } is assumed to followGaussian distribution with mean µ i and covariance Σ i . By Lyapunov’s central limit theorem(Billingsley (1995)), when x is in C i , for i ∈ { , } , √ p (cid:99) W R − QDA imp ( x ) satisﬁes:1 (cid:112) ˜ V i (cid:18) √ p (cid:99) W R − QDA imp ( x ) − ˜ S i (cid:19) d → N (0 ,

1) (23)where ˜ S i and ˜ V i are given by:˜ S i = − θ − √ p Tr Σ i H ( γ ) + 1 √ p Tr Σ i H ( γ ) − √ p ( µ i − ˆ µ ) T H ( γ )( µ i − ˆ µ )+ 1 √ p ( µ i − ˆ µ ) T H ( γ )( µ i − ˆ µ ) (24)˜ V i = 2 p Tr ( H ( γ ) − H ( γ )) Σ i ( H ( γ ) − H ( γ ))+ 4 p (cid:16) ( µ Ti − ˆ µ T ) H ( γ ) + ( − µ Ti + ˆ µ T ) H ( γ ) (cid:17) Σ i ( H ( γ )( µ i − ˆ µ ) + H ( γ )( − µ i + ˆ µ ))(25)From (23), and using Lemma 2.11 in van der Vaart (1998a) for i = { , } , we can easily seethat: (cid:15) imp i − Φ (cid:32) ( − i (cid:32) − ˜ S i (cid:112) ˜ V i (cid:33)(cid:33) → . (26)It follows using standard results from random matrix theory in Hachem et al. (2013) that: (cid:18) √ p Tr Σ i H ( γ ) − √ p Tr Σ i H ( γ ) (cid:19) − b i a.s → oreover, 1 √ p ( µ i − ˆ µ i ) T H i ( γ i ) ( µ i − ˆ µ i ) a.s. → √ p (cid:0) µ i − ˆ µ − i (cid:1) T H − i ( γ − i ) (cid:0) µ i − ˆ µ − i (cid:1) − √ p µ T T − i µ a.s. → − ˜ S i − (cid:0) − ξ i + b i (cid:1) a.s. → . (30)The variance quantity can be treated similarly, and we can prove based on the results in(Elkhalil et al., 2017) that:1 p Tr( H ( γ ) − H ( γ )) Σ i ( H ( γ ) − H ( γ )) − B i a.s. → (cid:16) ( µ Ti − ˆ µ T ) H ( γ ) + ( − µ Ti + ˆ µ T ) H ( γ ) (cid:17) Σ i ( H ( γ )( µ i − ˆ µ ) + H ( γ )( − µ i + ˆ µ )) − pr i a.s. → . Hence, ˜ V i − (cid:0) B i + 4 r i (cid:1) a.s. → . (31)Replacing − ˜ S i and ˜ V i by their deterministic equivalents in (30) and (31) in (26), we obtainthe desired convergence in (18). Remark:

Under Assumption 4, it can be shown that B i can asymptotically be simpliﬁed to B i (cid:44) n i p γ i ˜ φ i φ i − γ i φ i ˜ φ i + Θ( 1 √ p )and that B = B + Θ( √ p ). Moreover, the term r i is O ( √ p ) and as such converges to zero as p, n grow to inﬁnity. However, in our simulations, we chose to work with the non-simpliﬁedexpressions for B i and to keep the term r i , since we observed that in doing so a betteraccuracy is obtained in ﬁnite-dimensional simulations.The result of Theorem 1 allows to provide guidelines on how to choose γ and γ and theoptimal bias θ . As discussed before, the design should require the mean of the classiﬁcationrule to be Θ(1) and to reﬂect the class under consideration. This mean is represented inthe asymptotic expression of the classiﬁcation error rate by the quantity ξ i − b i which isΘ( √ p ) for arbitrary γ and γ as b i = Θ( √ p ) and ξ i = Θ(1). Moreover, the class of thetesting observation is not reﬂected in b i since under Assumption 3-4, in case b i = Θ( √ p ), b i = √ p Tr Σ ( T − T ) + O (1) and as such up to a quantity of order O (1), b and b areequal. To solve this issue, we need to design γ and γ such that for i = { , } , b i is Θ(1) orequivalently, 1 p Tr (cid:2) Σ ( T − T ) (cid:3) = Θ( 1 √ p ) (32) mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings so that b becomes diﬀerent from b at its highest order. To this end, we prove that itsuﬃces to select the regularization parameter associated with the class with the largestnumber of samples as: Theorem 2

Under assumption 1-4, and assume that n > n , if γ = γ − (cid:16) n − n (cid:17) γ Tr (cid:2) Σ T (cid:3) , (33) where γ is ﬁxed to a given constant then b i = Θ(1) . Proof

See Appendix A.It is worth mentioning that in the balanced case, plugging n = n into (33) yields γ = γ .This shows that When the same number of training samples is used across classes, it isnot necessary to regularize both sample covariance matrices with diﬀerent regularizationparameters.Now with this choice of the regularization parameters being set, it remains to select theoptimal bias θ . This can be chosen so that the asymptotic classiﬁcation error rate given by: (cid:15) = π Φ (cid:32) − ξ − b (cid:112) B (cid:33) + π Φ (cid:32) − ξ − b (cid:112) B (cid:33) is minimized. Theorem 3

The optimal bias that allows to minimize the asymptotic classiﬁcation errorrate is given by: θ ∗ = β − β − α β + β log( π π ) (34) where  β = √ p (cid:2) − µ T T µ (cid:3) − √ p Tr (cid:2) Σ ( T − T ) (cid:3) β = √ p (cid:2) − µ T T µ (cid:3) + √ p Tr (cid:2) Σ ( T − T ) (cid:3) α = (cid:112) B Proof

See Appendix B.Before proceeding further, it is important to note that thanks to the careful choice of theregularization parameters γ and γ provided in Theorem 2, the term √ p Tr (cid:2) Σ i ( T − T ) (cid:3) is Θ(1) for i ∈ { , } , Additionally, it can be shown easily that the term √ p (cid:2) − µ T T i µ (cid:3) is oforder Θ(1). As a result, both β and β are Θ(1).On another note, it is worth mentioning that even in the case of balanced classes n = n ,characterized by γ = γ as proved in Theorem 2, the optimal bias is diﬀerent from the ne traditionally used in R-QDA. As such, the proposed design improves on the traditionalR-QDA studied in (Elkhalil et al., 2017) even in the balanced case as it optimally adaptsthe bias term to the situation in which the covariance matrices are not known.Theorem 2 and Theorem 3 can be used to obtain an optimized design of the proposedR-QDA classiﬁer. As can be seen, the improved classiﬁer employs only one regularizationparameter associated with the class that presents the smallest number of training samples.Assume C is such a class. The regularization parameter associated with the other classcannot be arbitrarily chosen and should be set as (33), while the bias is selected according to(34). However, pursuing this design is not possible in practice due to the dependence of (33)and (34) on the true covariance matrices. To solve this issue, we propose in the followingtheorem consistent estimators for the quantities arising in (33) and (34) that depend onlyon the training samples. Theorem 4

Assume n > n and let γ be the regularization parameter associated withclass C . Let ˆ δ be given by: ˆ δ = 1 γ pn − n Tr (cid:2) H ( γ ) (cid:3) − pn + n Tr (cid:2) H ( γ ) (cid:3) and deﬁne ˆ γ as: ˆ γ = γ − γ (cid:16) n n ˆ δ − ˆ δ (cid:17) (35) Then, ˆ γ − γ as → where γ is given in (33) . Deﬁne ˆ β , ˆ β and ˆ α as: ˆ β = − √ p ( ˆ µ − ˆ µ ) T H (ˆ γ ) ( ˆ µ − ˆ µ ) − √ p Tr (cid:104) ˆΣ H (ˆ γ ) (cid:105) + n √ p ˆ δ ˆ β = − √ p ( ˆ µ − ˆ µ ) T H ( γ ) ( ˆ µ − ˆ µ ) − √ p Tr (cid:104) ˆΣ H ( γ ) (cid:105) + n √ p ˆ δ (36)ˆ α = (cid:113) B where ˆ B writes as: ˆ B = (cid:16) γ (cid:98) δ (cid:17) p Tr (cid:104) (cid:98) Σ H ( γ ) (cid:98) Σ H ( γ ) (cid:105) − n p (cid:98) δ (cid:16) γ (cid:98) δ (cid:17) + 1 p Tr (cid:104) (cid:98) Σ H (ˆ γ ) (cid:98) Σ H ( ˆ γ ) (cid:105) − n p (cid:18) n Tr (cid:104) (cid:98) Σ H (ˆ γ ) (cid:105)(cid:19) − (cid:16) γ ˆ δ (cid:17) p Tr (cid:104) (cid:98) Σ H ( γ ) (cid:98) Σ H (ˆ γ ) (cid:105) + (cid:98) δ (cid:16) γ (cid:98) δ (cid:17) p Tr (cid:104) (cid:98) Σ H ( ˆ γ ) (cid:105) (37) Let ˆ θ (cid:63) be given by: ˆ θ (cid:63) = ˆ β − ˆ β − α ˆ β + ˆ β log( π π ) (38) mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings Then, (cid:98) θ (cid:63) − θ (cid:63) as → where θ (cid:63) is given in (34) . Proof . See Appendix C.It is worth mentioning that unlike γ , ˆ γ is random. It does not satisfy with equality(33) but ensures (32) almost surely. Its use as a replacement of γ would lead asymptoticallyto the same results as the improved classiﬁer using γ .For the reader convenience, we provide hereafter the algorithm describing the proposedimproved QDA classiﬁer: Algorithm 1:

Improved design of the R-QDA classiﬁer.

Input : Assuming n ≥ n , let γ the regularization parameter associated with class C , T = { x l } n l =1 training samples in C and T = { x l } n = n + n l = n +1 output : Estimation of the parameters γ and θ (cid:63) to be plugged in (14)1. Compute ˆ γ as in (35)2. Compute ˆ θ as in (38)3. Return ˆ θ and ˆ γ that will be plugged in the classiﬁcation rule (14)The improved design described in Algorithm 1 depends on the regularization parameter γ associated with the class with the smallest number of training samples. One possible way toadjust this parameter is to resort to a traditional cross-validation approach which consistsin estimating, based on a set of testing data, the classiﬁcation error rate for each candidateof the regularization parameter γ . Such an approach presents several drawbacks. First, itis computationally expensive and is way sub-optimal as it could only test few values of γ .As an alternative we propose rather to build a consistent estimator of the classiﬁcation errorrate based on results from random matrix theory, which can later assist in the setting of theregularization parameter γ . This is the objective of the following theorem: Theorem 4

Under Assumptions 1-4, a consistent estimator of the misclassiﬁcation errorrate associated with class C i is given by: ˆ (cid:15) i = Φ (cid:32) ( − i ˆ ξ i − ˆ b i (cid:112) B i + 4ˆ r i (cid:33) here ˆ B is given in (37) , γ is set to ˆ γ and ˆ ξ i = ˆ θ (cid:63) + ( − i +1 √ p ( ˆ µ − ˆ µ ) T H − i ( γ i ) ( ˆ µ − ˆ µ ) , i ∈ { , } ˆ δ i = 1 γ i (cid:2) pn i − n i Tr ( H i ( γ i )) (cid:3) − pn i + n i Tr [ H i ( γ i )] , i ∈ { , } ˆ b i = ( − i √ p Tr (cid:104) ˆ Σ i H − i ( γ − i ) (cid:105) + ( − i +1 n i √ p ˆ δ i , i ∈ { , } ˆ B = (cid:16) γ ˆ δ (cid:17) p Tr (cid:104) ˆ Σ H (ˆ γ ) ˆ Σ H (ˆ γ ) (cid:105) − n p ˆ δ (cid:16) γ ˆ δ (cid:17) + 1 p Tr (cid:104) ˆ Σ H ( γ ) ˆ Σ H ( γ ) (cid:105) − n p (cid:18) n Tr (cid:104) ˆ Σ H ( γ ) (cid:105)(cid:19) − (cid:16) γ ˆ δ (cid:17) p Tr (cid:104) ˆ Σ H ˆ Σ H ( γ ) (cid:105) + ˆ δ (cid:16) γ ˆ δ (cid:17) p Tr (cid:104) ˆ Σ H ( γ ) (cid:105) ˆ r i = 1 p ( ˆ µ − ˆ µ ) T H − i (ˆ γ − i ) ˆ Σ i H − i (ˆ γ − i )( ˆ µ − ˆ µ ) in the sense that: (cid:98) (cid:15) i − (cid:15) R − QDAi as → Proof

The proof is based on employing the consistent estimators provided in (Elkhalilet al., 2017) and is such omitted.

4. Numerical results

In this section, we assess the performance of our improved R-QDA classiﬁer and compareit the with standard QDA classiﬁer in the case of imbalanced training data. To this end,we start by generating synthetic data for both classes that are compliant with the diﬀerentassumptions used thoughout this work for the sake of validating our theoretical results.In Figure 2 and Figure 3, we plot the classiﬁcation error rate of the improved classiﬁerand the traditional R-QDA classiﬁer with respect to the regularization parameter γ andthe features’ dimension p , respectively. As can be seen, we note that the standard R-QDAhas a classiﬁcation error rate that converges to the prior of the most dominant class, whichreveals that as expected, it tends to assign all observations to the same class, which in thiscase coincides with the class that presents the highest number of training samples. On theopposite, the proposed R-QDA classiﬁer presents a much higher performance, making itmore suitable to cope with imbalanced settings. We ﬁnally note that the consistent estimatorbased on the results of Theorem 5 is accurate and as such can be used to properly adjustthe regularization parameter γ . mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings E rr o r g estopt QDAstd QDA Figure 2: Average misclassiﬁcation error rate versus the regularization parameter γ usingthe G-estimator. We consider p = 1000 features with unbalanced training size where n = 2 n , [ Σ ] = 4 I p , Σ = Σ + 3 Q p D p Q Tp , Q p ∈ O n ( R ), D p = diag (cid:104) √ p , ( p −√ p ) (cid:105) and µ = µ + √ p p × . In this section, we test the performance of the proposed R-QDA classiﬁer on the publicUSPS dataset of handwritten digits(Lecun et al., 1998) and the EEG dataset. The USPSdataset is composed of 42000 labeled digit images, and each image has p = 784 featuresrepresented by 28 ×

28 pixels. The EEG dataset is composed of 5 classes that contain 4097observations, and each observation has p = 178 features. We consider the classiﬁcation of twoclasses from each dataset composed of n and n samples. Based on the results of Theorem4, we tune the regularization factor γ to the value that minimizes the consistent estimateof the misclassiﬁcation error rate. The values of θ and ˆ γ are then computed based on (35)and (38). Figure 4 and Figure 5 compare the performance of the proposed classiﬁer withother state-of-the-art classiﬁcation algorithms using cross-validation for diﬀerent ratios of n n .As seen, our classiﬁer, termed in the ﬁgure RQDA imp , not only outperforms the standardQDA but also other existing classiﬁcation algorithms. Moreover, it is worth mentioning thatthe standard R-QDA is the classiﬁer that presents the lowest performance in imbalancedsettings corresponding to n n <

1. This suggests that the use of diﬀerent regularizationacross classes in the QDA classiﬁcation rule along with an adequate tune of the bias makesthe R-QDA classiﬁer more robust to the estimation noise of the covariance matrices inimbalanced settings.

00 400 500 600 700 800 900p0.000.050.100.150.200.250.300.35 E rr o r opt QDAstd QDAg est Figure 3: Average misclassiﬁcation error rate versus the dimension p. We consider γ = 1with unbalanced training size where n = 2 n , [ Σ ] = 4 I p , Σ = Σ + 3 Q p D p Q Tp , Q p ∈O n ( R ), D p = diag (cid:104) √ p , ( p −√ p ) (cid:105) and µ = µ + √ p p × . n n A cc u r a c y Accuracy of improved R-QDA for EEG dataset

RQDA imp

QDALRSVMDTRFKNNNN

Figure 4: Comparaison between the performance of the our improved RQDA classiﬁer withrespect to other machine learning algorithms on the EEG dataset. mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings n n A cc u r a c y Accuracy of improved R-QDA for USPS dataset

RQDA imp

QDALRSVMDTRFKNN

Figure 5: Comparaison between the performance of the our improved RQDA classiﬁer withrespect to other machine learning algorithms on the USPS dataset.

5. Conclusion

The traditional view holds that the use of R-QDA leads in general to lower classiﬁcationperformances than many other existing classiﬁcation methods, even though derived fromthe maximum likelihood principle under a general Gaussian mixture model. In this work,we establish that this loss in performance can be attributed to an induced estimation noiseof high order that hide the useful information for classiﬁcation, leading the R-QDA score tobehave similarly for all testing observations. Based on this analysis, we propose to modifythe design of R-QDA so that it can discriminate eﬃciently between classes. Our amendmentof the R-QDA classiﬁer is based on using two regularization parameters for each class as wellas a carefully designed bias that minimizes an asymptotic approximation of the classiﬁcationperformance. We conﬁrm the eﬃcacy of the proposed classiﬁer through the use of a set ofnumerical results which shows that not only our proposed classiﬁer outperforms the standardR-QDA but also other state-of-the art existing algorithms. Going further, we believe thatthis work shows that contrary to common belief, there is still room for improvement ofvery basic classiﬁcation methods through a careful study of their behavior using advancedstatistical tools.

References

P. Billingsley.

Probability and Measure . Wiley, 1995. 3rd edition.Yu Cheng. Asymptotic probabilities of misclassiﬁcation of two discriminant functions incases of high dimensional data.

Statistics & Probability Letters , 67:9–17, 03 2004. doi:10.1016/j.spl.2003.12.001. . Couillet, Z. Liao, and X. Mai. Classiﬁcation asymptotics in the random matrix regime.In EUSIPCO , 2018.P. Devijver and J. Kittler.

Pattern Recognition: A Statistical Approach . 1982.Khalil Elkhalil, Abla Kammoun, Romain Couillet, Tareq Y. Al-Naﬀouri, and Mohamed-Slim Alouini. A large dimensional study of regularized discriminant analysis classiﬁers.abs/1711.00382, 2017. URL https://arxiv.org/abs/1711.00382 .Ronald A. Fisher. The use of multiple measurements in taxonomic problems.

Annals ofeugenics , 7(2):179–188, 1936.W. Hachem, O. Khorunzhiy, P. Loubaton, J. Najim, and L. Pastur. A New Approachfor Mutual Information Analysis of Large Dimensional Multi-Antenna Channels.

IEEETransactions on Information Theory , 54(9):3987–4004, Sept 2008.Walid Hachem, Philippe Loubaton, Jamal Najim, and Pascal Vallet. On bilinear forms basedon the resolvent of large random matrices.

Ann. Inst. H. Poincar Probab. Statist. , 49(1):36–63, 02 2013. doi: 10.1214/11-AIHP450. URL https://doi.org/10.1214/11-AIHP450 .T. Hastie J. Friedman and R. Tibshirani.

The Elements of Statistical Learning . 2009.Binyan Jiang, Xiangyu Wang, and Chenlei Leng. Quda: A direct approach for sparsequadratic discriminant analysis.

Journal of Machine Learning Research , 19, 09 2015.Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, Nov 1998. ISSN 0018-9219. doi:10.1109/5.726791.Quefeng Li and Jun Shao. Sparse quadratic discriminant analysis for high dimensional data.

Statistica Sinica , 25, 04 2015. doi: 10.5705/ss.2013.150.H. Richard McFarland and Donald St. P. Richards. Exact Misclassiﬁcation Probabilities forPlug-In Normal Quadratic Discriminant Functions.

Journal of Multivariate Analysis , 82:299330, 2002.S. Raudys. On determining training sample size of a linear classiﬁer.

Computing Systems ,28:79–87, 1967. in Russian.A W van der Vaart.

Asymptotic Statistics . Cambridge University Press, Cambridge, UK,1998a.Aad W. van der Vaart.

Asymptotic Statistics . 1998b.C. Wang and B. Jiang. On the dimension eﬀect of regularized linear discriminant analysis.

Electronic Journal of Statistics , 12:2709–2742, 2018.A. Zollanvari and E. R. Dougherty. Generalized Consistent Error Estimator of LinearDiscriminant Analysis.

IEEE Transactions on Signal Processing , 63(11):2804–2814, June2015. mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings Appendix A.

As discussed in the paper, the design of the regularization parameters γ and γ shouldensure that: 1 √ p Tr (cid:2) Σ i ( T − T ) (cid:3) = Θ(1) (39)where T i = (cid:16) I + γ i (cid:101) δ i Σ i (cid:17) − , with (cid:101) δ i = γ i δ i . Using the relation A − − B − = A − ( B − A ) B − for any two square matrices A and B , (39) boils down to:1 √ p Tr (cid:104) Σ i T (cid:16) γ ˜ δ Σ − γ ˜ δ Σ (cid:17) T (cid:105) = Θ(1)or equivalently: γ ˜ δ √ p Tr (cid:2) Σ i T ( Σ − Σ ) T (cid:3) + γ ˜ δ − γ ˜ δ √ p Tr (cid:2) Σ i T Σ T (cid:3) = Θ(1)Using Assumption 4, it can be readily seen that the ﬁrst term γ ˜ δ √ p Tr (cid:2) Σ i T ( Σ − Σ ) T (cid:3) =Θ(1). To satisfy (39), we thus only need to design γ and γ such that: γ ˜ δ − γ ˜ δ = Θ(1 / √ p )or equivalently: γ + γ γ n Tr (cid:2) Σ T (cid:3) − γ − γ γ n Tr (cid:2) Σ T (cid:3) = Θ(1 / √ p )Under Assumption 4, 1 n Tr (cid:2) Σ T (cid:3) = 1 n Tr (cid:2) Σ T (cid:3) + O ( 1 √ p )which proves that in choosing γ given by: γ = γ − (cid:16) n − n (cid:17) γ Tr (cid:2) Σ T (cid:3) the condition (39) becomes satisﬁed. Appendix B.

The choice of the regularization parameters γ and γ allows to ensure that: B = B + O ( 1 √ p )As a result, the expression of the asymptotic equivalents for the classiﬁcation error rate ofboth classes deﬁned in (18) for i ∈ { , } can be reduced to: (cid:15) R − QDAi − Φ (cid:32) ( − i ξ i − b i (cid:112) B (cid:33) p → hen, the total classiﬁcation error can be written as: (cid:15) R − QDA = π Φ (cid:18) β + θα (cid:19) + π Φ (cid:18) β − θα (cid:19) where  β = √ p (cid:2) − µ T T µ (cid:3) − √ p Tr (cid:2) Σ ( T − T ) (cid:3) β = √ p (cid:2) − µ T T µ (cid:3) + √ p Tr (cid:2) Σ ( T − T ) (cid:3) α = (cid:112) B Taking the derivative of this expression with respect to θ and setting it to zero, the optimalbias θ (cid:63) should satisfy: π π e ( β − θ(cid:63) α ) − ( β θ(cid:63) α ) = 1Applying the logarithmic function on both sides, we obtain:log( π π ) + (cid:18) β − θ (cid:63) α (cid:19) − (cid:18) β + θ (cid:63) α (cid:19) = 0thus leading to θ ∗ = β − β − α β + β log( π π ) Appendix C

In Theorem 4, we provide a consistent estimator for the regularization parameter γ thatsatisﬁes (32) with high probability and a consistent estimator for the optimal bias θ (cid:63) . .1 Consistent estimator for γ We start by proving that γ − ˆ γ as →

0. To this end, we need to provide a consistentestimator for ( n − n )Tr (cid:2) Σ T (cid:3) . We start by noticing that:( 1 n − n )Tr (cid:2) Σ T (cid:3) = ( n n − δ A consistent estimator for δ has been provided in Elkhalil et al. (2017) and is given by:ˆ δ = 1 γ pn − n Tr (cid:2) H ( γ ) (cid:3) − pn + n Tr (cid:2) H ( γ ) (cid:3) and as such a consistent estimator for γ in (33) is given by:ˆ γ = γ − γ ( n n ˆ δ − ˆ δ )Note that the replacement of γ by ˆ γ still ensures condition (39) since from standard resultsof random matrix theory ˆ δ − δ = O ( p ) with high probability. mproved Design of Quadratic Discriminant Analysis Classifier in Imbalanced Settings .2 Consistent estimator for θ (cid:63) Recall that θ (cid:63) = β − β − α β + β log( π π )To provide a consistent estimator for θ (cid:63) , it is thus required to provide that of β , β and α .Since α = (cid:112) B and ˆ B − B . s . →

0, we thus have: ˆ α − α a . s . → α = (cid:112) B . As for β i , i = 0 ,

1, it can be written as: β i = − √ p µ T T − i µ + 1 √ p Tr (cid:2) Σ i T i (cid:3) − √ p Tr (cid:2) Σ i T − i (cid:3) = − √ p µ T T − i µ − √ p Tr (cid:2) Σ i T − i (cid:3) + n i √ p δ i Due to the independence of Σ i from H − i and of ˆ µ and ˆ µ and H i , i = 0 ,

1, we have:1 √ p Tr (cid:2) ˆ Σ i H − i (cid:3) − √ p Tr (cid:2) Σ i T − i (cid:3) as → √ p ( ˆ µ − ˆ µ ) H − i ( ˆ µ − ˆ µ ) − √ p ( ˆ µ − ˆ µ ) T − i ( ˆ µ − ˆ µ ) as → ..