[PDF] Flexible High-dimensional Classification Machines and Their Asymptotic Properties

Abstract

Classification is an important topic in statistics and machine learning with great potential in many real applications. In this paper, we investigate two popular large margin classification methods, Support Vector Machine (SVM) and Distance Weighted Discrimination (DWD), under two contexts: the high-dimensional, low-sample size data and the imbalanced data. A unified family of classification machines, the FLexible Assortment MachinE (FLAME) is proposed, within which DWD and SVM are special cases. The FLAME family helps to identify the similarities and differences between SVM and DWD. It is well known that many classifiers overfit the data in the high-dimensional setting; and others are sensitive to the imbalanced data, that is, the class with a larger sample size overly influences the classifier and pushes the decision boundary towards the minority class. SVM is resistant to the imbalanced data issue, but it overfits high-dimensional data sets by showing the undesired data-piling phenomena. The DWD method was proposed to improve SVM in the high-dimensional setting, but its decision boundary is sensitive to the imbalanced ratio of sample sizes. Our FLAME family helps to understand an intrinsic connection between SVM and DWD, and improves both methods by providing a better trade-off between sensitivity to the imbalanced data and overfitting the high-dimensional data. Several asymptotic properties of the FLAME classifiers are studied. Simulations and real data applications are investigated to illustrate the usefulness of the FLAME classifiers.

Full PDF

FFlexible High-dimensional Classiﬁcation Machines andTheir Asymptotic Properties

Xingye Qiao ∗ Department of Mathematical SciencesState University of New York, Binghamton, NY 13902-6000.E-mail: [email protected]

Lingsong ZhangDepartment of StatisticsPurdue University, West Lafayette, IN 47907.E-mail: [email protected] ∗ Corresponding author I a r X i v : . [ s t a t . M L ] O c t bstract Classiﬁcation is an important topic in statistics and machine learning with great po-tential in many real applications. In this paper, we investigate two popular large marginclassiﬁcation methods, Support Vector Machine (SVM) and Distance Weighted Dis-crimination (DWD), under two contexts: the high-dimensional, low-sample size dataand the imbalanced data. A uniﬁed family of classiﬁcation machines, the FL exible A ssortment M achin E ( FLAME ) is proposed, within which DWD and SVM are spe-cial cases. The FLAME family helps to identify the similarities and diﬀerences betweenSVM and DWD. It is well known that many classiﬁers overﬁt the data in the high-dimensional setting; and others are sensitive to the imbalanced data, that is, the classwith a larger sample size overly inﬂuences the classiﬁer and pushes the decision bound-ary towards the minority class. SVM is resistant to the imbalanced data issue, but itoverﬁts high-dimensional data sets by showing the undesired data-piling phenomena.The DWD method was proposed to improve SVM in the high-dimensional setting,but its decision boundary is sensitive to the imbalanced ratio of sample sizes. OurFLAME family helps to understand an intrinsic connection between SVM and DWD,and improves both methods by providing a better trade-oﬀ between sensitivity to theimbalanced data and overﬁtting the high-dimensional data. Several asymptotic prop-erties of the FLAME classiﬁers are studied. Simulations and real data applications areinvestigated to illustrate the usefulness of the FLAME classiﬁers.

Key Words and Phrases:

Classiﬁcation; Discriminant analysis; Fisher consistency;High-dimensional, low-sample size asymptotics; Imbalanced data; Support Vector Ma-chine. II Introduction

Classiﬁcation refers to predicting the class label, y ∈ C , of a data object based on itscovariates, x ∈ X . Here C is the space of class labels, and X is the space of the covariates.Usually we consider X ≡ R d , where d is the number of variables or the dimension. SeeDuda et al. (2001) and Hastie et al. (2009) for comprehensive introductions to many popularclassiﬁcation methods. When C = { +1 , − } , this is an important class of classiﬁcationproblems, called binary classiﬁcation. The classiﬁcation rule for a binary classiﬁer usuallyhas the form φ ( x ) = sign { f ( x ) } , where f ( x ) is called the discriminant function. Linearclassiﬁers are the most important and the most commonly used classiﬁers, as they are ofteneasy to interpret in addition to reasonable classiﬁcation performance. We focus on linearclassiﬁer in this article. In the above formula, linear classiﬁers correspond to f ( x ; ω , β ) = x T ω + β . The sample space is divided into halves by the separating hyperplane , also known asthe classiﬁcation boundary , deﬁned by (cid:8) x : f ( x ) ≡ x T ω + β = 0 (cid:9) . Note that the coeﬃcientvector ω ∈ R d deﬁnes the normal vector, and hence the direction, of the classiﬁcationboundary, and the intercept term β ∈ R deﬁnes the location of the classiﬁcation boundary.In this paper, two popular classiﬁcation methods, Support Vector Machine (SVM; Cortesand Vapnik, 1995, Vapnik, 1998, Cristianini and Shawe-Taylor, 2000) and Distance WeightedDiscrimination (DWD; Marron et al., 2007, Qiao et al., 2010) are investigated under two im-portant contexts: the High-Dimensional, Low-Sample Size (HDLSS) data and the imbalanceddata. Both methods are large margin classiﬁers (Smola et al., 2000), that seek separatinghyperplanes which maximize certain notions of gap ( i.e. , distances) between the two classes.The investigation of the performance of SVM and DWD motivates the invention of a novelfamily of classiﬁers, the FL exible A ssortment M achin E ( FLAME ), which uniﬁes the twoclassiﬁers, and helps to understand their connections and diﬀerences.1 .1 Motivation: Pros and Cons of SVM and DWD

SVM is a very popular classiﬁer in statistics and machine learning. It has been shown tohave Fisher consistency, i.e. , when sample size goes to inﬁnity, its decision rule convergesto the Bayes rule (Lin, 2004). SVM has several nice properties. 1) Its dual formulationis relatively easy to implement (by the Quadratic Programming). 2) SVM is robust to themodel speciﬁcation, which makes it very popular in various real applications. However, whenbeing applied to HDLSS data, it has been observed that a large portion of the data (usuallythe support vectors, to be properly deﬁned later) lie on two hyperplanes parallel to theSVM classiﬁcation boundary. This is known as the data-piling phenomena (Marron et al.,2007, Ahn and Marron, 2010). Data-piling of SVM indicates a type of overﬁtting. Otheroverﬁtting phenomena of SVM under the HDLSS context include:1. The angle between the SVM direction and the Bayes rule direction is usually large.2. The variability of the sampling distribution of the SVM direction ω is very large(Zhang and Lin, 2011). Moreover, because the separating hyperplane is decided onlyby the support vectors, the SVM direction tends to be unstable, in the sense that smallturbulence or measurement error to the support vectors can lead to big change of thedirection.3. In some cases, the out-of-sample classiﬁcation performance may not be optimal due tothe suboptimal direction of the estimated SVM discrimination direction.DWD is a recently developed classiﬁer to improve SVM in the HDLSS setting. It uses adiﬀerent notion of gap from SVM. While SVM is to maximize the smallest distance betweenclasses, DWD is to maximize a special average distance (harmonic mean) between classes.It has been shown in many earlier simulations that DWD largely overcomes the overﬁtting(data-piling) issue and it usually gives a better discrimination direction.On the other hand, the intercept term β of the DWD method is sensitive to the samplesize ratio between the two classes, i.e. , to the imbalanced data (Qiao et al., 2010). Note that,even though a good discriminant direction ω is more important in revealing the proﬁling2iﬀerence between the two populations, the classiﬁcation/prediction performance heavilydepends on the intercept β , more than on the direction ω . As shown in Qiao et al. (2010),usually the β of the SVM classiﬁer is not sensitive to the sample size ratio, while the β ofthe DWD method will become too large (or too small) if the sample size of the positive class(or negative class) is very large.In summary, both methods have pros and cons. SVM has larger stochastic variabilityand usually overﬁts the data by showing data-piling phenomena, but is less sensitive to theimbalanced data issue. DWD usually overcomes the overﬁtting/data-piling issue, and hassmaller sampling variability, but is very sensitive to the imbalanced data. Driven by theirsimilarity, we propose a uniﬁed class of classiﬁers, FLAME, in which the above two classiﬁersare special cases. FLAME provides a framework to study the connections and diﬀerencesbetween SVM and DWD. Each FLAME classiﬁer has a parameter θ which is used to controlthe performance balance between overﬁtting the HDLSS data and the sensitivity to theimbalanced data. It turns out that the DWD method is FLAME with θ = 0; and thatthe SVM method corresponds to FLAME with θ = 1. The optimal θ depends on thetrade-oﬀ among several factors: stochastic variability, overﬁtting and resistance against theimbalanced data. In this paper, we also propose two approaches to select θ , where theresulting FLAME have a balanced performance between the SVM and DWD methods. The rest of the paper is organized as follows. Section 2 provides toy examples and highlightsthe strengths and drawbacks of SVM and DWD on classifying the HDLSS and imbalanceddata. We develop the FLAME method in Section 3, which is motivated by the investigationof the loss functions of SVM and DWD. Section 4 provides suggestions for the parameters.Three types of asymptotic results for the FLAME classiﬁer are studied in Section 5. Sections6 discusses its properties using simulation experiments. A real application is discussed inSection 7. Some concluding remarks and discussions are made in Section 8.3

Comparison of SVM and DWD

In this section, we use several toy examples to illustrate the strengths and drawbacks of SVMand DWD under two contexts: HDLSS data and imbalanced data.

We use simulations to compare SVM and DWD. The results show that the stochastic vari-ability of the SVM direction is usually larger than that of the DWD method, and SVMdirections are deviated farther away from Bayes rule directions. In addition, the new pro-posed FLAME machine (see details in Section 3) is also included in the comparison, and itturns out that FLAME is between the other two.Figure 1 shows the comparison results between SVM, DWD and FLAME (with somechosen tuning parameters). We simulate 10 samples with the same underlying distribution.Each simulated data set contains 12 variables and two classes, with 120 observations in eachclass. The two classes have mean diﬀerence on only the ﬁrst three dimensions and the within-class covariances are diagonal, that is, the variables are independent. For each simulateddata set, we plot the ﬁrst three components of the resulting discriminant directions fromSVM, DWD and FLAME (after normalizing the 3D vectors to have unit norms), as shownin Figure 1. It clearly shows that the DWD directions (the blue down-pointing triangles)are the closest ones to the true

Bayes rule direction (shown as the cyan diamond marker)among the three approaches. In addition, the DWD directions have a smaller variation ( i.e. ,more stable) over diﬀerent samples. The SVM directions (the red up-pointing triangles)are farthest from the true

Bayes rule direction and have a larger variation than the othertwo methods. To highlight the direction variabilities of the three methods, we introduce anovel measure for the variation (unstableness) of the discriminant directions: the trace ofthe sample covariance of the resulting direction vectors over the 10 replications, which wename as dispersion . The dispersion for the DWD method (0.0031) is much smaller than4 dispersion of DWD = 0.0031108dispersion of FLAME ( θ =0.5) = 0.010493dispersion of SVM = 0.045332 (0,0,0) T True directionDWD direction vectorFLAME ( θ =0.5) direction vectorSVM direction vector Figure 1: The true population mean diﬀerence direction vector (the cyan dashed line anddiamond marker; equivalent to the Bayes rule direction), the DWD directions (blue down-pointing triangles), the FLAME directions with θ = 0 . θ = 0 .

5, the magenta squares), which is better than SVM but worsethan DWD.Besides the stochastic variability and the deviation from the true direction comparisonsshown above, DWD outperforms SVM in terms of stability in the presence of small pertur-bations applied to some observations. In Figure 2, we use a two-dimensional example toillustrate this phenomenon. We simulate a perfectly separable 2-dimensional data set. The5

Positive classNegative classBefore movementMovement directionAfter movementTrue boundarySVM (for original data)SVM (after movement)DWD (for original data)DWD (after movement)

Figure 2: A 2D example shows that the unstable SVM boundary has changed due to asmall turbulence of a support vector (the solid red triangle and diamond) while the DWDboundary remains almost still.theoretical Bayes rule decision boundary is shown as the thick black line. The dashed redline and the dashed dotted blue line are the SVM and the DWD classiﬁcation boundariesbefore the perturbation. We then move one observation in the positive group a little (fromthe solid triangle to the solid diamond as shown in the ﬁgure). This perturbation leads toa large change of direction in SVM (shown as the dotted red line), but a small change forDWD (shown as the solid blue line). Note that all four hyperplanes are capable of classifyingthis training data set perfectly. But it may not be true for an out-of-sample test set. Thisexample shows small perturbation may lead to unstableness in SVM.

In the last subsection, we have shown that DWD outperforms SVM in estimating the dis-crimination direction, that is, DWD directions are closer to the Bayes rule discriminationdirections and have smaller variability. However, it was found that the location of DWDclassiﬁcation boundary, which is characterized by the intercept β , is sensitive to the sample6 Figure 3: A 1D example shows that the DWD boundary is pushed towards the minorityclass (blue) when the majority class (red) has tripled its sample size.size ratio between the two classes (Qiao et al., 2010).Usually, a good discriminant direction ω helps to reveal the proﬁling diﬀerence betweentwo classes of populations. But the classiﬁcation/prediction performance heavily dependson the location coeﬃcient β . We deﬁne the imbalance factor m ≥ β in the SVM classiﬁeris not sensitive to m . However, the β for the DWD method is very sensitive to m . We alsonotice that, as a consequence, the DWD separating hyperplane will be pushed toward theminority class, when the ratio m is close to inﬁnity, i.e. , DWD classiﬁers intend to ignorethe minority class. Again, we use a toy example in order to better illustrate the impact ofthe imbalanced data on β and on the classiﬁcation performance.Figure 3 uses a one-dimensional example, so that estimating ω is not needed. This alsocorresponds to a multivariate data set, where ω is estimated correctly ﬁrst, after which thedata set is projected to ω to form the one-dimensional data. In this plot, the x -coordinates of7he red dots and the blue dots are the values of the data while the y -coordinates are randomjitters for better visualization. The red and blue curves are the kernel density estimationsfor both classes. In the top subplot of Figure 3, where m = 1 ( i.e. , the balanced data),both the DWD (blue lines) and SVM (red lines) boundaries are close to the Bayes ruleboundary (black solid line), which sits at 0. In the bottom subplot, the sample size of thered class is tripled, which corresponds to m = 3. Note that the SVM boundary movesa little towards the minority (blue) class, but still fairly close to the true boundary. TheDWD boundary, however, is pushed towards the minority. Although this does not imposeimmediate problems for the training data set, the DWD classiﬁer will suﬀer from a greatloss of classiﬁcation performance when it is applied to an out-of-sample data set. It can beshown that when m goes to inﬁnity, the DWD classiﬁcation boundary will tends to negativeinﬁnity, which totally ignores the minority group (see our Theorem 3). However, SVM willnot suﬀer from severe imbalanced data problems. One reason is that SVM only needs asmall fraction of data (called support vectors) for estimating both ω and β , which mitigatethe imbalanced data issue naturally.Imbalanced data issues have been investigated in both statistics and machine learning.See an extensive survey in Chawla et al. (2004). Recently, Owen (2007) studied the asymp-totic behavior of inﬁnitely imbalanced binary logistic regression. In addition, Qiao and Liu(2009) and Qiao et al. (2010) proposed to use adaptive weighting approaches to overcomethe imbalanced data issue.In summary, the performance of DWD and SVM is diﬀerent in the following ways: 1)The SVM direction usually has a larger variation and deviates farther from the Bayes ruledirection than the DWD direction does, which are indicators of overﬁtting HDLSS data. 2)The SVM intercept is not sensitive to the imbalanced data, but the DWD intercept is. Thismotivates us to investigate their similarity and diﬀerences. In the next section, a new familyof classiﬁer will be proposed, which uniﬁes the above two classiﬁers.8 FLAME Family

In this section, we introduce FLAME, a family of classiﬁers which is motivated by a thoroughinvestigation of the loss functions of SVM and DWD in Section 3.1. The formulation andimplementation of the FLAME classiﬁers are given in Section 3.2.

The key factors that drive the very distinct performances of the SVM and the DWD methodsare their associated loss functions (see Figure 4.) −1 0 1 2 3 400.511.522.533.54 Functional margin, u=yf(x) Lo ss FLAME loss functions for three θ values (C=1) DWD (FLAME: θ = 0)FLAME, θ = 0.5SVM (FLAME: θ = 1) Figure 4: FLAME loss functions for three θ values: θ = 0 (equivalent to SVM/Hinge loss), θ = 0 . θ = 1 (equivalent to DWD). The parameter C is set to be 1.Figure 4 displays the loss functions of SVM, DWD and FLAME with some speciﬁc tuningparameters. SVM uses the Hinge loss function, H ( u ) = (1 − u ) + (the red dashed curve inFigure 4), where u corresponds to the functional margin u ≡ yf ( x ). Note that the functionalmargin u can be viewed as the distance of vector x from the separating hyperplane (deﬁnedby { x : f ( x ) = 0 } ). When u > u <

0, the data vector is wrongly classiﬁed. Note thatwhen u >

1, the corresponding Hinge loss equals zero. Thus, only those observations with9 ≤ ω and β . These observations are called support vectors .This is why SVM is insensitive to observations that are far away from the decision boundary,and why it is less sensitive to the imbalanced data issue. However, the only inﬂuence bythe support vectors makes the SVM solution subject to the overﬁtting (data-piling) issue.This can be explained by that the optimization of SVM would try to push vectors towardssmall loss, i.e. , large functional margin u . But once a vector is pushed to the point where u = 1, the optimization lacks further incentive to continue pushing it towards a largerfunction margin as the Hinge loss cannot be reduced for this vector. Therefore many datavectors are piling along the hyperplanes corresponding to u = 1. Data-piling is bad forgeneralization because small turbulence to the support vectors could lead to big diﬀerenceof the discriminant direction vector (recall the examples in Section 2.1).The DWD method corresponds to a diﬀerent DWD loss function, V ( u ) =  √ C − Cu if u ≤ √ C , /u otherwise . (1)Here C is a pre-deﬁned constant. Figure 4 shows the DWD loss function with C = 1. It isclear that the DWD loss function is very similar to the SVM loss function when u is small(both are linearly decreasing with respect to u ). The major diﬀerence is that the DWD lossis always positive. This property will make the DWD method behave in a very diﬀerent waythan SVM. As there is always incentive to make the function margin to be larger (and theloss to be smaller), the DWD loss function kills data-piling, and mitigates the overﬁttingissue for HDLSS data.On the other hand, the DWD loss function makes the DWD method very sensitive to theimbalanced data issue, since each observation will have some inﬂuence, and thus the largerclass will have larger inﬂuence. The decision boundary of the DWD method will intend toignore the smaller class, because sacriﬁcing the smaller class (boundary being closer to thesmaller class and farther from the larger class) can lead to a dramatic reduction of the loss,which ultimately lead to a minimized overall loss.10 .2 FLAME We propose to borrow strengths from both methods to simultaneously deal with both theimbalanced data and the overﬁtting (data-piling) issues. We ﬁrst highlight the connectionsbetween the DWD loss and an modiﬁed version of the Hinge loss (of SVM). Then we modifythe DWD loss so that samples far from the classiﬁcation boundary will have zero loss.Let f ( x ) = x T ω + β . The formulation of SVM can be rewritten (see details in theappendix) in the form of argmin ω ,β (cid:80) i H ∗ ( y i f ( x i )), s.t. (cid:107) ω (cid:107) ≤ H ∗ is deﬁned as H ∗ ( u ) =  √ C − Cu if u ≤ √ C , . (2)Comparing the DWD loss (1) and this modiﬁed Hinge loss (2), one can easily see theirconnections: for u ≤ √ C , the DWD loss is greater than the Hinge loss of SVM by an exactconstant √ C , and for u > √ C , the DWD loss is 1 /u while the SVM Hinge loss equals 0.Clearly the modiﬁed Hinge loss (2) is the result of soft-thresholding the DWD loss at √ C . Inother words, SVM can be seen as a special case of DWD where the losses of those vectors with u = y i f ( x i ) > / √ C are shrunken to zero. To allow diﬀerent levels of soft-thresholding,we propose to use a new loss function which (soft-)thresholds the DWD loss function byconstant θ √ C where 0 ≤ θ ≤

1, that is, a fraction of √ C . The new loss function is L ( u ) = (cid:104) V ( u ) − θ √ C (cid:105) + =  (2 − θ ) √ C − Cu if u ≤ √ C , /u − θ √ C if √ C ≤ u < θ √ C , u ≥ θ √ C , (3)that is, to reduce the DWD loss by a constant, and truncate it at 0. The magenta solid curvein Figure 4 is the FLAME loss when C = 1 and θ = 0 .

5. This simple but useful modiﬁcationuniﬁes the DWD and SVM methods. When θ = 1, the new loss function (when C = 1)reduces to the SVM Hinge loss function; while when θ = 0, it remains as the DWD loss.Note that L ( u ) = 0 for u > / ( θ √ C ). Thus, those data vectors with large functionalmargins will have zero loss. For DWD loss, because it corresponds to θ = 0 ⇒ / ( θ √ C ) = ∞ ,11o data vector can have zero loss. For SVM loss, all the data vector with u > / ( θ √ C ) =1 / √ C will have zero loss. Training a FLAME classiﬁer with 0 < θ < /θ √ C and assignzero loss to them. Alternatively, it can be viewed as sampling data that are closer to theboundary than 1 /θ √ C and assign positive loss to them. Note that the larger θ is, the fewerdata are sampled to have positive loss. As one can ﬂexibly choose θ , the new classiﬁcationmethod with this new loss function is called the FL exible A ssortment M achin E ( FLAME ).FLAME can be implemented by a Second-Order Cone Programming algorithm (Tohet al., 1999, T¨ut¨unc¨u et al., 2003). Let θ ∈ [0 ,

1] be the FLAME parameter. The proposedmethod minimizes min ω ,b, ξ n (cid:88) i =1 (cid:18) r i + Cξ i − θ √ C (cid:19) + . A slack variable ϕ i ≥ · ) + function. The optimization of the FLAME can be written asmin ω ,b, ξ (cid:88) i ϕ i , s.t. (cid:16) r i + Cξ i − θ √ C (cid:17) − ϕ i ≤ , ϕ i ≥ ,r i = y i ( x Ti ω + β ) + ξ i , r i ≥ ξ i ≥ , (cid:107) ω (cid:107) ≤ . A Matlab routine has been implemented and is available at the authors’ personal websites.See the online supplementary materials for more details on the implementation.

There are two tuning parameters in the FLAME model: one is the C , inherited from theDWD loss, which controls the amount of allowance for misclassiﬁcation; the other is theFLAME parameter θ , which controls the level of soft-thresholding. Similar to the discussionin DWD (Marron et al., 2007), the classiﬁcation performance of FLAME is insensitive todiﬀerent values of C . In addition, it can be shown for any C , FLAME is Fisher consistent,by applying the general results in Lin (2004). Thus, the default value for C as proposed12n Marron et al. (2007) will be used in FLAME. In this section, we introduce two ways ofchoosing the second parameter θ . As the property and the performance of FLAME dependson the choice of this parameter, it is important to select the right amount of thresholding.In the following two subsections, we discuss two options of choosing parameter θ . The ﬁrstoption is based on empirical plots resulting from the training data and is of practical useful,and it is the θ value that we suggest. The second option is motivated by a theoreticalconsideration and is heuristically meaningful as well. Note that, an optimal θ depends on the nature of the data and problems that users have.The optimal θ also depends on two performance measures: insensitive to the imbalanceddata, and resistance to overﬁtting. However, without prior knowledge, we may want to havea “good” trade-oﬀ between them. In this subsection, we suggest the following method tochoose θ if no further information is provided.As will become clear from the simulation examples shown in Section 6.3, we have observedthat several performance measures for the FLAME classiﬁers, for example, the within-grouperror (see the deﬁnition in Section 6 and also in Qiao and Liu (2009)), are monotonically de-creasing with respect to θ . On the other hand, performance measures such as the RankComp (see also in Section 6), are monotonically increasing functions of θ . The RankComp mea-sure is more related to the overﬁtting phenomena, and the within-group error is designed formeasuring the performance against the imbalanced data. The lesson is that with θ increases,FLAME becomes less sensitive to the imbalanced data issue, but is subject to more overﬁt-ting. This motivates us to use the following parameter: the two curves of the two measuresare normalized to be between 0 and 1. When θ = 0, the FLAME classiﬁer (equivalent toDWD) has the smallest RankComp measure 0, but the largest within-group error 1. When θ = 1, the FLAME classiﬁer (equivalent to SVM) has the smallest within-group error 0, butthe largest RankComp 1. The suggested θ is chosen as the value where the two normalized13urves intersect, that is, the normalized within-group error is the same as the normalizedRankComp for this θ . This suggested parameter represents a natural trade-oﬀ between thetwo measures: neither measure is absolutely optimal, but each measure compromises by thesame relative amount. This suggested parameter is called the equal-trade-oﬀ parameter . Having observed that the DWD discrimination direction is usually closer to the Bayes ruledirection, but its location term β is sensitive to the imbalanced data issue, we proposethe following alternative data-driven approach to select an appropriate θ . Without loss ofgenerality, we assume that the negative class is the majority class with sample size n − andthe positive class is the minority class with sample size n + . We point out that the mainreason that DWD is sensitive to the imbalanced data issue is that it uses all vectors in the majority class to build up a classiﬁer. A heuristic strategy to correct this would be to forcethe optimization to use the same number of vectors from both classes to build up a classiﬁer:we ﬁrst apply DWD to the data set, and calculate the distances of all data in the majorityclass to the current DWD classiﬁcation boundary; we then train FLAME with a carefullychosen parameter θ which assigns positive loss to the closet n + data vectors in the majorityclass to the classiﬁcation boundary. As a consequence, each class will have exactly n + vectorswhich have positive loss. In other words, while keeping the least imbalance (because we havethe same numbers of vectors from both classes that have inﬂuence over the optimization),we obtain a model with the least possible overﬁtting (because 2 n + vectors have inﬂuence,instead of only the limited support vectors as in SVM.)In practice, since the new FLAME classiﬁcation boundary using the θ chosen abovemay be diﬀerent from the initial DWD classiﬁcation boundary, the n + closest points tothe FLAME classiﬁcation boundary may not be the same n + closest points to the DWDboundary. This means that it is not guaranteed that exactly n + points from the majorityclass will have positive loss. However, one can expect that reasonable approximation can14e achieved. Moreover, an iterative scheme for ﬁnding θ is introduced as follows in order tominimize such discrepancy.For simplicity, we let ( x i , y i ) with index i be an observation from the positive/minorityclass and ( x j , y j ) with index j be an observation from the negative/majority class. Algorithm . ( Adaptive parameter )1. Initiate θ = 0.2. For k = 0 , , · · · ,(a) Solve FLAME solutions ω ( θ k ) and β ( θ k ) given parameter θ k .(b) Let θ k +1 = max (cid:18) θ k , (cid:110) g ( n + ) ( θ k ) √ C (cid:111) − (cid:19) , where g j ( θ k ) is the functional margin u j ≡ y j ( x Tj ω ( θ k ) + β ( θ k )) of the j th vector in the negative/majority class and g ( l ) ( θ k ) is the l th order statistic of these functional margins.3. When θ k = θ k − , the iteration stops.The goal of this algorithm is to make g ( n + ) ( θ k ) to be the greatest functional margin amongall the data vectors that have positive loss in the negative/majority class. To achieve this,we calibrate θ by aligning g ( n + ) ( θ k ) to the turning point u = 1 / ( θ √ C ) in the deﬁnition ofthe FLAME loss (3), that is g ( n + ) ( θ k ) = 1 / ( θ √ C ) ⇒ θ = (cid:16) g ( n + ) ( θ k ) √ C (cid:17) − .We deﬁne the equivalent sample objective function of FLAME for the iterative algorithmabove, s ( ω , β, θ ) = 1 n + + n − (cid:34) n + (cid:88) i =1 L (( x Ti ω + β ) , θ ) + n − (cid:88) j =1 L ( − ( x (cid:48) j ω + β ) , θ ) (cid:35) + λ (cid:107) ω (cid:107) . Thenthe convergence of this algorithm is shown in Theorem 1.

Theorem . In Algorithm 1, s ( ω k , β k , θ k ) is non-increasing in k . As a consequence, Algo-rithm 1 converges to a stationary point s ( ω ∞ , β ∞ , θ ∞ ) where s ( ω k , β k , θ k ) ≥ s ( ω ∞ , β ∞ , θ ∞ ) .Moreover, Algorithm 1 terminates ﬁnitely. Ideally, one would hope to get an optimal parameter θ ∗ which satisﬁes θ ∗ = (cid:16) g ( n + ) ( θ ∗ ) √ C (cid:17) − . In practice, θ ∞ will approximate θ ∗ very well. In addition, we notice that one-step iterationusually gives decent results for simulation examples and some real examples.15 Theoretical Properties

In this section, several important theoretical properties of the FLAME classiﬁers are inves-tigated. We ﬁrst prove the Fisher consistency (Lin, 2004) of the FLAME in Section 5.1.As one focus of this paper is imbalanced data classiﬁcation, the asymptotic properties forFLAME under extremely imbalanced data setting is studied in Section 5.2. Lastly, a novelHDLSS asymptotics where n is ﬁxed and d → ∞ , the other focus of this article, is studiedin Section 5.3. Fisher consistency is a very basic property for a classiﬁer. A classiﬁer is Fisher consistentmeans that the minimizer of the conditional risk of the classiﬁer given observation x has thesame sign as the Bayes rule, argmax k ∈{ +1 , − } P( Y = k | X = x ). It has been shown that both SVMand DWD are Fisher consistent (Lin, 2004, Qiao et al., 2010). The following propositionstates that the FLAME classiﬁers are Fisher consistent too. Proposition . Let f ∗ be the global minimizer of E [ L ( Y f ( X ) , θ )] , where L ( · , θ ) is the lossfunction for FLAME given parameter θ . Then sign ( f ∗ ( x )) = sign (P( Y = +1 | X = x ) − / . In this subsection, we investigate the asymptotic performance of SVM, DWD and FLAME.The asymptotic setting we focus on is when the minority sample size n + is ﬁxed and themajority sample size n − → ∞ , which is similar to the setting in Owen (2007). We willshow that DWD is sensitive to the imbalanced data, while FLAME with proper choices ofparameter θ and SVM are not.Let x + be the sample mean of the positive/minority class. Theorem 3 shows that in theimbalanced data setting, when the size of the negative/majority class grows while that ofthe positive/minority class is ﬁxed, the intercept term for DWD tends to negative inﬁnity, in16he order of √ m . Therefore, DWD will classify all the observations to the negative/majorityclass, that is, the minority class will be 100% misclassiﬁed. Theorem . Let n + be ﬁxed. Assume that the conditional distribution of the negative ma-jority class F − ( x ) surrounds x + by the deﬁnition given in Owen (2007), and that γ is aconstant satisfying inf (cid:107) ω (cid:107) =1 (cid:90) ( x − x + ) (cid:48) ω > dF − ( x ) > γ ≥ , then the DWD intercept (cid:98) β satisﬁes (cid:98) β < − (cid:114) γC m − x T + ω = − (cid:114) n − γn + C − x T + ω . In Section 4.2, we have introduced an iterative approach to select the parameter θ .Theorem 4 shows that with the optimal parameter θ ∗ found by Algorithm 1, the discriminantdirection of FLAME is in the same direction of the vector that joins the sample mean of thepositive class and the tilted population mean of the negative class. Moreover, in contrast toDWD, the intercept term of FLAME in this case is ﬁnite. Theorem . Suppose that n − (cid:29) n + and ω ∗ and β ∗ are the FLAME solutions trained withthe parameter θ ∗ that satisﬁes θ ∗ = (cid:16) g ( n + ) ( θ ∗ ) √ C (cid:17) − . Then ω ∗ and β ∗ satisfy that ω ∗ = C (1 + m ) λ (cid:20) x + − (cid:82) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) (cid:82) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) (cid:21) , where E is the event that [ Y ( X T ω ∗ + β ∗ )] − ≥ θ ∗ √ C where ( X , Y ) is a random samplefrom the negative/majority class, and that (cid:90) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) = n + n o C, where < n o ≤ n + . As a consequence of Theorem 4, when m = n − /n + → ∞ , we have (cid:107) ω ∗ (cid:107) →

0. Since theright-hand-side of the last equation above is positive ﬁnite, β ∗ does not diverge. In addition,since P( E ) → β ∗ < − / ( θ √ C ).The following theorem shows the performance of SVM under the imbalanced data context,which completes our comparisons between SVM, DWD and FLAME. Theorem . Suppose that n − (cid:29) n + . The solutions (cid:98) ω and (cid:98) β to SVM satisfy that (cid:98) ω = 1(1 + m ) λ (cid:26) x + − (cid:90) x dF − ( x | G ) (cid:27) , here G is the event that − Y ( X T (cid:98) ω + (cid:98) β ) > where ( X , Y ) is a random sample from thenegative/majority class, and that P( G ) = P(1 + X T (cid:98) ω + (cid:98) β ≤

0) = 1 − /m. The last statement in Theorem 5 means that with probability converging to 1, (cid:98) β ≤ − (cid:98) β < − (cid:112) γC m − x T + ω ). HDLSS data are emerging in many areas of scientiﬁc research. The HDLSS asymptotics is arecently developed theoretical framework. Hall et al. (2005) gave a geometric representationfor the HDLSS data, which can be used to study these new ‘ n ﬁxed, d → ∞ ’ asymptoticproperties of binary classiﬁers such as SVM and DWD. Ahn et al. (2007) weakened the con-ditions under which the representation holds. Qiao et al. (2010) improved the conditions andapplied this representation to investigate the performance of the weighted DWD classiﬁer.The same geometric representation can be used to analyze FLAME. See summary of someprevious HDLSS results in the online supplementary materials. We develop the HDLSSasymptotic properties of the FLAME family by providing conditions in Theorem 6 underwhich the FLAME classiﬁers always correctly classify HDLSS data.We ﬁrst introduce the notations and give some regularity assumptions, then state themain theorem. Let k ∈ { +1 , − } be the class index. For the k th class and given a ﬁxed n k ,consider a sequence of random data matrices X k , X k , · · · X kd , · · · , indexed by the numberof rows d , where each column of X kd is a random observation vector from R d and eachrow represents a variable. Assume that each column of X kd comes from a multivariatedistribution with dimension d and with covariance matrix Σ kd independently. Let λ k ,d ≥· · · ≥ λ kd,d be the eigenvalues of the covariance, and (cid:0) σ kd (cid:1) = d − (cid:80) di =1 λ ki,d the averageeigenvalue. The eigenvalue decomposition of Σ kd is Σ kd = V kd Λ kd (cid:0) V kd (cid:1) T . We may deﬁnethe square root of Σ kd as (cid:0) Σ kd (cid:1) / = V kd (cid:0) Λ kd (cid:1) / , and the inverse square root (cid:0) Σ kd (cid:1) − / =18 Λ kd (cid:1) − / (cid:0) V kd (cid:1) T . With minimal abuse of notation, let E ( X kd ) denote the expectation ofcolumns of X kd . Lastly, the n k × n k dual sample covariance matrix is denoted by S kD,d = d − (cid:8) X kd − E ( X kd ) (cid:9) T (cid:8) X kd − E ( X kd ) (cid:9) . Assumption . There are ﬁve components:(i) Each column of X kd has mean E ( X kd ) and the covariance matrix Σ kd of its distributionis positive deﬁnite.(ii) The entries of Z kd ≡ (cid:0) Σ kd (cid:1) − (cid:8) X kd − E ( X kd ) (cid:9) = (cid:0) Λ kd (cid:1) − (cid:0) V kd (cid:1) T (cid:8) X kd − E ( X kd ) (cid:9) areindependent.(iii) The fourth moment of each entry of each column is uniformly bounded by M > S kD,d associated with X kd , that is, d S kD,d = (cid:110)(cid:0) Z kd (cid:1) T (cid:0) Λ kd (cid:1) / (cid:0) V kd (cid:1) T (cid:111) (cid:110) V kd (cid:0) Λ kd (cid:1) / Z kd (cid:111) = d (cid:88) i =1 λ ki,d W ki,d , where W ki,d ≡ (cid:0) Z ki,d (cid:1) T Z ki,d and Z i,d is the i th row of Z kd deﬁned above. It is calledWishart representation because if X kd is Gaussian, then each W ki,d follows the Wishartdistribution W n k (1 , I n k ) independently.(iv) The eigenvalues of Σ kd are suﬃciently diﬀused, in the sense that (cid:15) kd = (cid:80) di =1 ( λ ki,d ) ( (cid:80) di =1 λ ki,d ) → d → ∞ . (4)(v) The sum of the eigenvalues of Σ kd is the same order as d , in the sense that (cid:0) σ kd (cid:1) = O (1)and 1 / (cid:0) σ kd (cid:1) = O (1). Assumption . The distance between the two population expectations satisﬁes, d − (cid:13)(cid:13) E ( X (+1) d ) − E ( X ( − d ) (cid:13)(cid:13) → µ , as d → ∞ . Moreover, there exist constants σ and τ , such that (cid:16) σ (+1) d (cid:17) → σ , and (cid:16) σ ( − d (cid:17) → τ . Let ν ≡ µ + σ /n + + τ /n − . The following theorem gives the sure classiﬁcation conditionfor FLAME, which includes SVM and DWD as special cases.19 heorem . Without loss of generality, assume that n + ≤ n − . The situation of n + > n − is similar and omitted. • If either one of the following three conditions is satisﬁed,1. for θ ∈ (cid:104) , (1 + √ m − ) / ( ν √ dC ) (cid:17) , µ > ( n − /n + ) σ /n + − τ /n − > ;2. for θ ∈ (cid:104) (1 + √ m − ) / ( ν √ dC ) , / ( ν √ dC ) (cid:17) , µ > T − τ /n − > where T := (cid:16) / (2 θ √ dC ) + (cid:112) / (4 θ dC ) + σ /n + (cid:17) − σ /n + ;3. for θ ∈ (cid:104) / ( ν √ dC ) , (cid:105) , µ > σ /n + − τ /n − > ,then for a new data point x +0 from the positive class ( +1 ), P( x +0 is correctly classiﬁed by FLAME ) → , as d → ∞ . Otherwise, the probability above → . • If either one of the following three conditions is satisﬁed,1. for θ ∈ (cid:104) , (1 + √ m − ) / ( ν √ dC ) (cid:17) , ( n − /n + ) σ /n + − τ /n − > ;2. for θ ∈ (cid:104) (1 + √ m − ) / ( ν √ dC ) , / ( ν √ dC ) (cid:17) , T − τ /n − > ;3. for θ ∈ (cid:104) / ( ν √ dC ) , (cid:105) , σ /n + − τ /n − > ,then for any µ > , for a new data point x − from the negative class ( − ), P( x − is correctly classiﬁed by FLAME ) → , as d → ∞ . Theorem 6 has two parts. The ﬁrst part gives the conditions under which FLAMEcorrectly classiﬁes a new data point from the positive class, and the second part is for thenegative class. Each part lists three conditions based on three disjoint intervals of parameter θ . Note the ﬁrst and third intervals of each part generalize results which were shown tohold only for DWD and SVM before ( c.f. Theorem 1 and Theorem 2 in Hall et al., 2005).In particular, it shows that all the FLAME classiﬁers with θ falling into the ﬁrst intervalbehave like DWD asymptotically. Similarly, all the FLAME classiﬁers with θ falling intothe third interval behave like SVM asymptotically. This partially explains the shape of thewithin-group error curve that we will show in Figures 6, S.2, and S.3, which we will discussin the next section.In the ﬁrst part, the condition for other FLAMEs (with θ in the second interval) is weaker20han the DWD-like FLAMEs (in the ﬁrst interval), but stronger than the SVM-like FLAMEs(in the third interval). This means that it is easier to classify a new data point from thepositive/minority class by SVM, than by an intermediate FLAME, which is easier than byDWD. Note that when n + ≤ n − , the hyperplane for FLAME is in general closer to thepositive class.In terms of classifying data points from the negative class, the order of the diﬃcultiesamong DWD, FLAME and SVM reverses. FLAME is not only a uniﬁed representation of DWD and SVM, but also introduces a newfamily of classiﬁers which are capable of avoiding the overﬁtting HDLSS data issue andthe sensitivity to imbalanced data issue. In this section, we use simulations to show theperformance of FLAME at various parameter levels. We will show that with a range ofcarefully chosen parameters, FLAME can outperform both the DWD and the SVM methodsin various simulation settings.

Before we introduce our simulation examples, we ﬁrst introduce the performance measures inthis paper. Note that the Bayes rule classiﬁer can be viewed as the “gold standard” classiﬁer.In our simulation settings, we assume that data are generated from two Gaussian populations

M V N ( µ ± , Σ ) with diﬀerent mean vectors µ + and µ − and same covariance matrices Σ . Thissetting leads to the following Bayes rule.sign( x T ω B + β B ) where ω B = Σ − ( µ + − µ − ) and β B = −

12 ( µ + + µ − ) (cid:48) ω B . (5)Five performance measures are evaluated in this paper:21. The mean within-class error (MWE) for out-of-sample test set, which is deﬁned as M W E = 12 n + n + (cid:88) i =1 ( (cid:98) Y + i (cid:54) = Y + i ) + 12 n − n − (cid:88) j =1 ( (cid:98) Y − j (cid:54) = Y − j )2. The deviation of the estimated intercept β from the Bayes rule intercept β B : | β − β B | .3. Dispersion: a measure of the stochastic variability of the estimated discriminationdirection vector ω . The dispersion measure was introduced in Section 1, as the traceof the sample covariance of the resulting discriminant direction vectors: disperson =Var([ ω r ] r =1: R ) where R is the number of repeated runs.4. Angle between the estimated discrimination direction ω and the Bayes rule direction ω B : ∠ ( ω , ω B ).5. RankComp( ω , ω B ): In general, for two direction vectors ω and ω ∗ , RankComp isdeﬁned as the proportion of the pairs of variables, among all d ( d − / i.e. ,RankComp( ω , ω ∗ ) ≡ d ( d − / (cid:88) ≤ i

5) has been compared with SVM ( θ = 1) and DWD( θ = 0) in Figure 1, and on average, its discriminant directions are closer to the Bayes ruledirection ω B compared to the SVM directions, but is less close than the DWD directions. Inthis subsection, we will further investigate the performance of FLAME with several diﬀerentvalues of θ , and compare them with DWD and SVM under various simulation settings.22

00 400 700 10000.10.20.30.40.50.60.70.8 d d i s pe r s i on m = 1 100 400 700 1000051015 d ang l e f r o m t r ue d i r e c t i on m = 1 100 400 700 10000.10.20.30.40.50.60.70.8 d d i s pe r s i on m = 4100 400 700 1000051015 d ang l e f r o m t r ue d i r e c t i on m = 4 100 400 700 10000.10.20.30.40.50.60.70.8 d d i s pe r s i on m = 9100 400 700 1000051015 d ang l e f r o m t r ue d i r e c t i on m = 9 θ =0 θ =0.25 θ =0.5 θ =0.75 θ =1 Figure 5: The dispersions (top row) and the angles between the FLAME direction and theBayes direction (bottom row) for 50 runs of simulations, where the imbalance factors m are 1, 4 and 9 (the left, center and right panels), in the increasing dimension setting ( d =100 , , , x -axes). The FLAME machines have θ = 0 , . , . , . , θ and the dimension d increase, both the dispersionand the deviation from the Bayes direction increase. The emergence of the imbalanced data(the increase of m ) does not much deteriorate the FLAME directions except for large d .Figure 5 shows the comparison results under the same simulation setting with variouscombinations of ( d, m )’s. In this simulation setting, data are from multivariate normal distri-butions with identity covariance matrices M V N d ( µ ± , I d ), where d = 100 , ,

700 and 1000.We let µ = c ( d, d − , d − , · · · , T where c > µ to have norm2.7. Then we let µ + = µ and µ − = − µ . The imbalance factor varies among 1, 4 and 9while the total sample size is 240. For each experiment, we repeat the simulation 50 times,23nd plot the average performance measure in Figure 5. The Bayes rule is calculated accord-ing to (5). It is obvious that when the dimension increases, both the dispersion and theangle increase. They are indicators of overﬁtting HDLSS data. When the imbalance factor m increases, the two measures increases as well, although not as much as when the dimensionincreases. More importantly, it shows that when θ decreases (from 1 to 0, or equivalentlyFLAME changes from SVM to DWD), the dispersion and the angle both decrease, which ispromising because it shows that FLAME improves SVM in terms of the overﬁtting issue. We also investigate the eﬀect of diﬀerent covariance structures, since independence structureamong variables as in the last subsection is not realistic in real applications. We investigatethree covariance structures: independent, interchangeable and block-interchangeable covari-ance. Data are generated from two multivariate normal distributions

M V N ( µ ± , Σ ) with d = 300. We ﬁst let µ = (75 , , , · · · , , , , · · · , (cid:48) , then scale it by multiply a con-stant c such that the Mahalanobis distance between µ + = c µ and µ − = − c µ equals 5.4, i.e. , ( µ + − µ − ) (cid:48) Σ − ( µ + − µ − ) = 5 .

4. Note that this represents a reasonable signal-to-noiseratio.We consider the FLAME machines with diﬀerent parameter θ from a grid of 11 values(0 , . , . , · · · , m = 2 , , × three covariance structures). For the independent structure example, Σ = I ; For the interchangeable structure example, Σ ii = 1 and Σ ij = 0 . i (cid:54) = j ; For the block-interchangeable structure example, we let Σ be a block diagonal matrixwith ﬁve diagonal blocks, the sizes of which are 150, 100, 25, 15, 10, and each block is aninterchangeable covariance matrix with diagonal entries 1 and oﬀ-diagonal entries 0.8.Figure 6 provides the summary results of the interchangeable structure example. Sincethe results are similar under diﬀerent covariance structures, results from the other two co-variance structures are included in the online supplementary materials to save space (Figure24.2 for the independent structure, and Figure S.3 for the block-interchangeable covariance). θ Within−Group−Error 0 0.2 0.4 0.6 0.8 10.511.522.53 θ | β −true β | Remarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directionsRemarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directions0 0.2 0.4 0.6 0.8 1242628303234 θ Angle from theoretical Bayes direction 0 0.2 0.4 0.6 0.8 10.3450.350.3550.360.3650.370.375 θ RankCmp 0 0.2 0.4 0.6 0.8 10.180.20.220.240.260.280.30.32 θ Dispersionm=2m=3m=4

Figure 6: Interchangeable example. It can be seen that with FLAME turns from DWD toSVM ( θ from 0 to 1), the within-class error decreases (top-left), thanks to the more accurateestimate of the intercept term (top-middle). On the other hand, this comes at the cost oflarger deviation from the Bayes direction (bottom-left), incorrect rank of the importance ofthe variables (bottom-middle) and larger stochastic variability of the estimation directions(bottom-right).In each plot, we include the within-group error (top-left), the absolute value of thediﬀerence between the estimated intercept and the Bayes intercept | β − β B | (top-middle),the angle between the estimated direction and the Bayes direction ∠ ( ω , ω B ) (bottom-left),the RankComp between the estimated direction and the Bayes direction (bottom-middle)and the dispersion of the estimated directions (bottom-right).We can see that in Figure 6 (and Figures S.2 and S.3 in the online supplementary mate-rials), when we increase θ from 0 to 1, i.e. , when the FLAME moves from the DWD end to25he SVM end, the within-group error decreases. This is mostly due to the fact that the inter-cept term β comes closer to the Bayes rule intercept β B . On the other hand, the estimateddirection is deviating from the true direction (larger angle), is giving the wrong rank of thevariables (larger RankComp) and is more unstable (larger dispersion). Similar observationshold for the other two covariance structures, with one exception in the block interchangeablesetting (Figure S.3) where the RankComp ﬁrst decreases then increases.In the entire FLAME family, DWD represents one extreme which provides better estima-tion of the direction, is closer to the Bayes direction, provides the right order for all variables,and is more stable. But it suﬀers from the inaccurate estimation of β in the presence of im-balanced data; SVM represents the other extreme, which is not sensible to imbalanced dataand usually provides a good estimation of β , but is in general outperformed by DWD interms of closeness to the Bayes optimal direction. In most situations, within the FLAMEfamily, there is no single machine that is better than the both ends from the two aspects atthe same time.The observations above motivate the use of the equal-trade-oﬀ parameter introducedin Section 4.1. In the next subsection, we will compare this parameter choice with otheroptions. We have suggested an equal-trade-oﬀ parameter based on the plots of within-group errorand the RankComp (see Section 4.1) and justiﬁed the use of an adaptive θ based on aniterative procedure (see details in Section 4.2). Figure 7 compares FLAME with these twochoices, and with θ = 0 (DWD) and 1 (SVM). Various covariance structures (independent,interchangeable and block-interchangeable) are investigated. To save the space, we onlyshow the results for the block-interchangeable dependence structure as this is more realisticin many real applications in genomic science and other applications. Here the full dimensions( d = 80 , , , ,

600 or 900) are divided to three blocks (50%, 25% and 25% of d ). The26 dWithin−Group−Error 2 2.25 2.5 2.75 30123456 log d| β −true β |2 2.25 2.5 2.75 315202530354045 log dAngle from theoretical Bayes direction 2 2.25 2.5 2.75 30.20.250.30.350.4 log dRankComp with theoretical Bayes direction θ =0 (DWD)adaptive θ suggested θθ =1 (SVM) Figure 7: Comparison of four FLAMEs with θ = 0 ,

1, the suggested θ introduced in Section4.1 and the adaptive θ (after one step) introduced in Section 4.2 for a simulated example withblock-interchangeable dependence structure, in terms of the within-group error, deviationfrom the true intercept term β B , deviation from the true direction ω B , and the RankCompfrom ω B . Intermediate FLAMEs provide improvements over DWD for the ﬁrst two measuresand over SVM for the last two measures.total sample size is 240 and the imbalance factor m is 3 (moderate imbalanced).In Figure 7, we compare the within-group error, | β − β B | , ∠ ( ω , ω B ), and RankComp( ω , ω B ).We see that these intermediate FLAMEs provide improvements over DWD for the ﬁrst twomeasures and over SVM for the last two measures. For relatively small d , the equal-trade-oﬀ θ value is very similar to the adaptive θ . For large d , the adaptive θ is closer to DWD thanthe equal-trade-oﬀ θ . For very large d , all four machines encounter diﬃculty in classiﬁcation. In this section we demonstrate the performance of FLAME on a real example: the HumanLung Carcinomas Microarray Dataset, which has been analyzed earlier in Bhattacharjee27t al. (2001).The Human Lung Carcinomas Dataset contains six classes: adenocarcinoma, squamous,pulmonary carcinoid, colon, normal and small cell carcinoma, with sample sizes of 128, 21,20, 13, 17 and 6 respectively. Liu et al. (2008) used this data as a test set to demonstrate theirproposed signiﬁcance analysis of clustering approach. We combine the ﬁrst two subclassesand the last four subclasses to form the positive and negative classes respectively. The samplesizes are 149 and 56 with imbalance factor m = 2 .

66. The original data contain 12,625 genes.We ﬁrst ﬁlter genes using the ratio of the sample standard deviation and sample mean ofeach gene and keep 2,530 of them with large ratios (Dudoit et al., 2002, Liu et al., 2008).We conduct a ﬁve-fold cross-validation (CV) to evaluate the within-group error for thetwo classes over 100 random splits. The RankComp measure is calculated based on the fulldata set instead of on the samples in a single fold. For each replication, we ﬁnd a suggestedvalue for θ that leads to the same normalized value for RankComp and within-group errorover a grid of θ values. The adaptive value for θ (after one step) is calculated based on theDWD direction using all the samples in the data set.The average within-group errors and the RankComp measures for DWD, SVM, and thetwo intermediate FLAMEs (using the adaptive θ and the suggested equal-trade-oﬀ θ ) areshown in Figure 8.Note that as we do not know the true rank of importance of genes for this real dataapplication (no Bayes rule direction or the rank of importance that it implies), we use theDWD rank as a surrogate to the truth, since simulation examples show that on average itsestimated direction is the closest to the Bayes rule direction. Therefore, the RankCompmeasure for DWD is 0.This experiment does show that FLAME opens a new dimension of improving both theclassiﬁcation performance and the interpretative ability of the classiﬁer. The compromise ofthe FLAME classiﬁer, with the suggested equal-trade-oﬀ θ , in terms of the within-group erroris very small compared to the improvement obtained in terms of the direction. We would28 V Within−Group Error RankComp00.050.10.150.20.250.30.35 Human Lung Carcinomas Data θ =0 (DWD)adaptive θ suggested θθ =1 (SVM) Figure 8: The average cross-validation within-group errors and the RankComp measures forDWD, FLAME with the adaptive and equal-trade-oﬀ θ parameters and SVM for the HumanLung Carcinomas Dataset.not recommend using either DWD or SVM for conducting prediction and interpretationsimultaneously due to their bad performance for at least one criterion. A FLAME classiﬁerwith an appropriate parameter may be more suitable for practical use. In this paper, we thoroughly investigate SVM and DWD on their performance when appliedto the HDLSS and imbalanced data. A novel family of binary classiﬁers called FLAME isproposed, where SVM and DWD are the two ends of the spectrum. On the DWD end, theestimation of the intercept term is deteriorated while it provides better estimation of thedirection vector, and thus better handle the HDLSS data. On the hand, SVM is good atestimating the intercept term but not the direction and is subject to overﬁtting, and thus ismore suitable for imbalanced data but not HDLSS data.29e conduct extensive study of the asymptotic properties of the FLAME family in threediﬀerent ﬂavors, the ‘ d ﬁxed, n → ∞ ’ asymptotics (Fisher consistency), the ‘ d and n + ﬁxed, n − → ∞ ’ asymptotics (extremely imbalanced data), and the ‘ n ﬁxed, d → ∞ ’ asymptotics(the HDLSS asymptotics). These results explain the performance we have seen in the sim-ulations and suggest that with a smart choice of θ , FLAME can properly handle both theHDLSS data and the imbalanced data, by improving the estimations of the direction andthe intercept term.The FLAME family can be immediately extended to multi-class classiﬁcation, as was donefor SVM and DWD such as in Weston and Watkins (1999), Crammer and Singer (2000),Lee et al. (2004) or Huang et al. (2012). Another natural extension is variable selection forFLAME.The FLAME machines generalize the concepts of support vectors. In SVM, supportvectors are referred to vectors that sit on or fall into the two hyperplanes correspondingto u ≤ u ≤ / √ C for the modiﬁed version of Hinge loss (2)). In SVM, only supportvectors have impacts on the ﬁnal solution. DWD is the other extreme case where all the datavectors have some impacts. In the presence of imbalanced sample size, the fact that all thedata vectors inﬂuence the solution cause the optimization to ignore the minority class. TheFLAME with 0 < θ < θ means that one needs to include as many vectors, and as balanced inﬂuential samples,as possible. More vectors usually lead to mitigated overﬁtting, and balanced sample size ofthe inﬂuential vectors from two classes means that the sensitivity issue of the intercept termcan be alleviated.The authors are aware that it is possible to implement a two step procedure to conductbinary linear classiﬁcation. In the ﬁrst step, a good direction is found, probably in thefashion of DWD; in the second step, a ﬁne intercept is chosen by borrowing idea of SVM.However, the theoretical properties of such a procedure are known and will be left as a future30esearch direction.The choice of θ usually depends on the nature of the data and the scientiﬁc context. Ifthe users prefer better classiﬁcation performance over reasonable discrimination direction forinterpretation of the data, θ may be chosen to be closer to 1. If the right direction is the ﬁrstpriority, then θ should be chosen to be closer to 0. Note that, under some circumstances,the primary goal is to obtain a direction vector which can provide a score x T ω for eachobservation for further use, and the intercept parameter β is of no use at all. For example,some users may use a receiver operating characteristic (ROC) curve as a graphical tool toevaluate classiﬁcation performance over diﬀerent β value instead of using a single β valuegiven by the classiﬁer. In this case, a FLAME machine close to the DWD method may beideal. Appendix

Derivation of the modiﬁed Hinge loss

Note that the original SVM formulation is argmin ˜ ω , ˜ β (cid:88) (cid:16) − y i ˜ f ( x i ) (cid:17) + , s.t. (cid:107) ˜ ω (cid:107) ≤ C, where˜ f ( x ) = x T ˜ ω + ˜ β . Here the coeﬃcient vector ˜ ω does not have unit norm. We let ω = ˜ ω / √ C , β = ˜ β/ √ C and f = ˜ f / √ C . Thus SVM solution is given by argmin ω ,β (cid:88) (cid:16) − √ Cy i f ( x i ) (cid:17) + , s.t. (cid:107) ω (cid:107) ≤ , or equivalently, argmin ω ,β (cid:88) (cid:16) √ C − Cy i f ( x i ) (cid:17) + , s.t. (cid:107) ω (cid:107) ≤ . Acknowledgements

The ﬁrst author’s work was partially supported by Binghamton University Harpur CollegeDean’s New Faculty Start-up Funds and a collaboration grant from the Simons Foundation( eferences

Ahn, J. and Marron, J. (2010), “The maximal data piling direction for discrimination,”

Biometrika , 97, 254–259.Ahn, J., Marron, J., Muller, K., and Chi, Y. (2007), “The high-dimension, low-sample-sizegeometric representation holds under mild conditions,”

Biometrika , 94, 760–766.Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Be-heshti, J., Bueno, R., Gillette, M., et al. (2001), “Classiﬁcation of human lung carcinomasby mRNA expression proﬁling reveals distinct adenocarcinoma subclasses,”

Proceedings ofthe National Academy of Sciences , 98, 13790–13795.Chawla, N., Japkowicz, N., and Kotcz, A. (2004), “Editorial: special issue on learning fromimbalanced data sets,”

ACM SIGKDD Explorations Newsletter , 6, 1–6.Cortes, C. and Vapnik, V. (1995), “Support-vector networks,”

Machine learning , 20, 273–297.Crammer, K. and Singer, Y. (2000), “On the Learnability and Design of Output Codes forMulticlass Problems,” in

In Proceedings of the Thirteenth Annual Conference on Compu-tational Learning Theory .Cristianini, N. and Shawe-Taylor, J. (2000),

An introduction to Support Vector Machines:and other kernel-based learning methods , Cambridge University Press.Duda, R., Hart, P., and Stork, D. (2001),

Pattern classiﬁcation , Wiley.Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparison of discrimination methods forthe classiﬁcation of tumors using gene expression data,”

Journal of the American statisticalassociation , 97, 77–87. 32all, P., Marron, J. S., and Neeman, A. (2005), “Geometric representation of high dimen-sion, low sample size data,”

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 67, 427–444.Hastie, T., Tibshirani, R., and Friedman, J. (2009),

The elements of statistical learning:Data mining, inference, and prediction (second edition) , Springer.Huang, H., Liu, Y., Du, Y., Perou, C. M., Hayes, D. N., Todd, M. J., and Marron, J. (2012),“Multiclass Distance Weighted Discrimination,”

Journal of Computational and GraphicalStatistics .Lee, Y., Lin, Y., and Wahba, G. (2004), “Multicategory support vector machines,”

Journalof the American Statistical Association , 99, 67–81.Lin, Y. (2004), “A note on margin-based loss functions in classiﬁcation,”

Statistics & prob-ability letters , 68, 73–82.Liu, Y., Hayes, D., Nobel, A., and Marron, J. (2008), “Statistical signiﬁcance of clustering forhigh-dimension, low-sample size data,”

Journal of the American Statistical Association ,103, 1281–1293.Marron, J., Todd, M., and Ahn, J. (2007), “Distance-weighted discrimination,”

Journal ofthe American Statistical Association , 102, 1267–1271.Owen, A. (2007), “Inﬁnitely imbalanced logistic regression,”

The Journal of Machine Learn-ing Research , 8, 761–773.Qiao, X. and Liu, Y. (2009), “Adaptive weighted learning for unbalanced multicategoryclassiﬁcation,”

Biometrics , 65, 159–168.Qiao, X., Zhang, H., Liu, Y., Todd, M., and Marron, J. (2010), “Weighted distance weighteddiscrimination and its asymptotic properties,”

Journal of the American Statistical Asso-ciation , 105, 401–414. 33mola, A. J., Bartlett, P. L., Sch¨olkopf, B., and Schuurmans, D. (2000),

Advances in largemargin classiﬁers , vol. 1, MIT press Cambridge, MA.Toh, K., Todd, M., and T¨ut¨unc¨u, R. (1999), “SDPT 3- a MATLAB software package forsemideﬁnite programming, version 1.3,”

Optimization Methods and Software , 11, 545–581.T¨ut¨unc¨u, R., Toh, K., and Todd, M. (2003), “Solving semideﬁnite-quadratic-linear programsusing SDPT3,”

Mathematical programming , 95, 189–217.Vapnik, V. (1998),

Statistical learning theory , Wiley.Weston, J. and Watkins, C. (1999), “Support vector machines for multi-class pattern recog-nition,” in

European Symposium on Artiﬁcial Neural Networks , pp. 219–224.Zhang, L. and Lin, X. (2011), “Some considerations of classiﬁcation for high dimensionlow-sample size data,”

Statistical Methods in Medical Research .34 nline supplementary materials

This document provides some additional details for the main paper,

Flexible High-dimensionalClassiﬁcation Machines and Their Asymptotic Properties . This document includes how weimplement the FLAME machine with pre-deﬁned θ , detailed proofs for several theorems andpropositions, and ﬁgures for additional simulations. Implementation

In order to implement the FLAME algorithm, we introduce several new notations. Let S d +1 be a second order cone in the d +1 dimensional space, S d +1 = (cid:26) ( t , t , · · · , t d ) (cid:48) : t ≥ (cid:113)(cid:80) di =1 t i (cid:27) .Note that r i and 1 /r i can be substituted by three axillary variables ρ i , σ i and τ i which satisfy ρ i + σ i = r i , ρ i − σ i = 1 /r i , and τ i = 1. Then ρ i = σ i + τ i , and thus ( ρ i , σ i , τ i ) ∈ S . Let w = 1, then ( w ; ω ) ∈ S d +1 since (cid:107) ω (cid:107) ≤

1. Let η i ≥ ϕ i ≥

0, where ϕ i and η i can be viewedas the positive and negative parts of (cid:16) r i + Cξ i − θ √ C (cid:17) , i.e. , ϕ i − η i = (cid:16) r i + Cξ i − θ √ C (cid:17) .With the reparameterization above, FLAME can be viewed as the following optimizationproblem: min β,w, ω ,ρ i ,σ i ,τ i ,ξ i ,η i ,ϕ i n (cid:88) i =1 ϕ i s.t. y i ( x Ti ω + β ) + ξ i − ρ i − σ i = 0 ρ i − σ i + Cξ i − θ √ C + η i − ϕ i = 0 w = 1 τ i = 1and ( w ; ω ) ∈ S d +1 , ( ρ i , σ i , τ i ) (cid:48) ∈ S , ξ i ≥ , η i ≥ , ϕ i ≥ . Therefore, all the constraints can be converted to linear forms, all the variables are eithernonnegative, free, or in second order cones, and the objective function is linear. Such problemis called Second Order Cone Programming (SOCP), and can be eﬃciently solved by softwaressuch as SDPT3 (Toh et al., 1999, T¨ut¨unc¨u et al., 2003).i roof to Theorem 1

It suﬃces to show that s ( ω k , β k , θ k ) ≥ s ( ω k +1 , β k +1 , θ k +1 ). First, s ( ω k , β k , θ k ) ≥ s ( ω k , β k , θ k +1 )due to the deﬁnition of θ k +1 and that θ k ≤ θ k +1 . Then s ( ω k , β k , θ k +1 ) ≥ s ( ω k +1 , β k +1 , θ k +1 )since ω k +1 and β k +1 minimize s ( ω , β, θ k +1 ). Proof to Proposition 2

For any x , denote p ( x ) = P( Y = +1 | X = x ). The conditional risk is R ( f ) ≡ E [ L ( Y f ( X ) , θ ) | X = x ] = L ( f ( x ) , θ ) p ( x ) + L ( − f ( x ) , θ )(1 − p ( x )). For simplicity, we write L ( f ( x ) , θ ) as L ( f ). Thereby, R ( f ) = L ( f ) p ( x ) + L ( − f )(1 − p ( x )).We can see that for ﬁxed p ( x ) ∈ (0 , R ( f ) is continuous and diﬀerentiable everywhereand convex. Thus we ﬁnd f ∗ ( x ) by solving R (cid:48) ( f ) = 0, where R (cid:48) ( f ) = L (cid:48) ( f ) p ( x )+[ L ( − f )] (cid:48) (1 − p ( x )). The FLAME loss is L ( f ) =  (2 − θ ) √ C − Cf if f ≤ √ C f − θ √ C if √ C < f ≤ θ √ C . So direct calculation gives us that L (cid:48) ( f ) =  − C if f ≤ √ C − f if √ C < f ≤ θ √ C , and [ L ( − f )] (cid:48) =  C if f ≥ − √ C f if − θ √ C < f ≤ − √ C . Finally, if − p ( x ) p ( x ) (cid:54) = 1, the solution to R (cid:48) ( f ) = 0 has to be either − θ √ C < f ≤ − √ C or √ C < f ≤ θ √ C . In the former case, f ∗ = − √ C (cid:113) − p ( x ) p ( x ) if − p ( x ) p ( x ) >

1. In the latter case, f ∗ = + √ C (cid:113) − p ( x ) p ( x ) if − p ( x ) p ( x ) <

1. If − p ( x ) p ( x ) = 1, R (cid:48) ( f ∗ ) = 0 for any f ∗ ∈ [ − √ C , √ C ].Therefore, f ∗ satisﬁes sign( f ∗ ) = sign( p ( x ) − .

5) = sign(2 p ( x ) −

1) = sign( p ( x ) − (1 − p ( x ))) = sign( p ( x )1 − p ( x ) − . ii roof to Theorem 3 Since there are inﬁnitely many negative class samples, it is reasonable to assume that theclassiﬁcation boundary is pushed closer to the minority positive class, and therefore, the func-tional margin u i = y i f ( x i ) = f ( x i ) for the i th vector from the minority positive class is smalland its DWD loss is 2 √ C − Cu i = 2 √ C − Cf ( x i ). Similarly, the DWD loss for the j th vectorfrom the majority negative class is 1 / [ y j f ( x j )] = − /f ( x j ). The objective function for DWDis therefore equivalent to 1 n + + n − (cid:40) n + (cid:88) i =1 [2 √ C − C ( x Ti ω + β )] − n − (cid:88) j =1 x Tj ω + β (cid:41) + λ (cid:107) ω (cid:107) . The second term inside the curly bracket above can be approximated by n − (cid:90) x T ω + β dF − ( x ) where F − ( · ) is the conditional cumulative distribution function forthe negative class. The objective function is therefore l D = 1 n + + n − (cid:40) n + (cid:88) i =1 [2 √ C − C ( x Ti ω + β )] − n − (cid:90) x T ω + β dF − ( x ) (cid:41) + λ (cid:107) ω (cid:107) Before we continue, we need the deﬁnition that a distribution has a point surrounded(Owen, 2007).

Definition . The distribution F on R d has the point x ∗ surrounded if (cid:90) ( x − x ∗ ) (cid:48) ω >(cid:15) dF ( x ) > δ, for some δ >

0, some (cid:15) > ω ∈ R d with (cid:107) ω (cid:107) = 1. Consequentially, if F has x ∗ surrounded, then there exist γ satisfyinginf (cid:107) ω (cid:107) =1 (cid:90) ( x − x ∗ ) (cid:48) ω > dF ( x ) > γ ≥ . (S.1)We observe that ∂l D ∂β = 1 n + + n − [ − n + C + n − (cid:90) ( x T ω + β ) − dF − ( x )] ≥ n + + n − [ − n + C + n − (cid:90) ( x − x + ) (cid:48) ω ≥ (( x − x + ) T ω + x T + ω + β ) − dF − ( x )] ≥ n + + n − [ − n + C + n − (cid:90) x T ω ≥ ( x T + ω + β ) − dF − ( x )] ≥ n + + n − [ − n + C + n − γ ( x T + ω + β ) ]Now suppose that − (cid:113) n − γn + C < x T + ω + β <

0, then n − γ ( x T + ω + β ) > n + C and ∂l D /∂β >

0. Giveniiihe fact that l D is a strictly convex function, the minimizer (cid:98) β < − (cid:113) n − γn + C − x T + ω . Proof to Theorem 4

Again, with the imbalance assumption we assume that the functional margins for the mi-nority positive class are always greater than 0. Note that the penalized empirical loss forthe FLAME machine is approximated by l F = 1 n + + n − (cid:40) n + (cid:88) i (cid:104) √ C − C ( x Ti ω + β ) − θ √ C (cid:105) + n − (cid:90) (cid:18) − x T ω + β − θ √ C (cid:19) + dF − ( x ) (cid:27) + λ (cid:107) ω (cid:107) Let g ∗ j = − ( x Tj ω ∗ + β ∗ ), j = 1 , , · · · n − be the functional margins for the negative class.Because g ∗ ( n +) √ C = θ ∗ , that is, the reduced loss for the j th sample is greater than or equalto 0, g ∗ ( n +) − θ ∗ √ C = 0, we observe that 1 /g ∗ ( n + ) is the n + -th greatest among all the functionmargins of the negative class 1 /g ∗ j = − / ( x Tj ω ∗ + β ∗ ). Thus there are at most n + sampleswhose reduced losses that are ≥

0. Assume that there are n o ≤ n + such samples.For a random sample ( X , Y ) from the negative class, let E be the event that ( Y ( X T ω ∗ + β ∗ )) − ≥ θ ∗ √ C . From the argument above, P( E ) is approximately n o /n − .Then the integration (cid:90) (cid:18) − x T ω + β − θ √ C (cid:19) + dF − ( x ) equals E (cid:20) ( − x T ω + β − θ √ C ) + | E (cid:21) P( E ) + E (cid:20) ( − x T ω + β − θ √ C ) + | E (cid:21) P( E ) ≈ E (cid:2) | E (cid:3) (1 − n o n − ) + E (cid:20) ( − x T ω + β − θ √ C ) | E (cid:21) n o n − = E (cid:20) ( − x T ω + β − θ √ C ) | E (cid:21) n o n − We then have l F = 1 n + + n − (cid:40) n + (cid:88) i (cid:104) √ C − C ( x Ti ω + β ) − θ √ C (cid:105) + n o (cid:90) (cid:18) − x T ω + β − θ √ C (cid:19) dF − ( x | E ) (cid:27) + λ (cid:107) ω (cid:107) Here, dF − ( x | E ) is the conditional distribution function of X for the negative class givenevent E . ivetting ∂l F /∂β = 0 = 1 n + + n − (cid:26) − Cn + + n o (cid:90) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) (cid:27) , we have (cid:90) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) = C n + n o .Setting ∂l F /∂ ω = = 1 n + + n − (cid:26) − Cn + x + + n o (cid:90) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) (cid:27) + λ ω ∗ , we have (cid:90) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) = − n + + n − n o λ ω ∗ + C n + n o x + . And furthermore, (cid:82) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) (cid:82) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) = − n + + n − n + λC ω ∗ + x + = − (1 + m ) λC ω ∗ + x + . That is, ω ∗ = C (1 + m ) λ (cid:20) x + − (cid:82) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) (cid:82) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) (cid:21) Proof to Theorem 5

For simplicity we use the original SVM formulation with the Hinge loss function instead ofthe FLAME formulation. The objective function for SVM is equivalent to l S = 1 n + + n − (cid:40) n + (cid:88) i =1 [1 − ( x Ti ω + β )] + n − (cid:90) (cid:2) x T ω + β ) (cid:3) + dF − ( x ) (cid:41) + λ (cid:107) ω (cid:107) = 1 n + + n − (cid:40) n + (cid:88) i =1 [1 − ( x Ti ω + β )]+ n − (cid:90) (cid:2) x T ω + β ) (cid:3) { x T ω + β> } dF − ( x ) (cid:27) + λ (cid:107) ω (cid:107) Setting ∂l S /∂β = 0, we have ∂l S ∂β = 1 n + + n − (cid:26) − n + + n − P( G ; ω , β ) + n − (cid:90) (1 + x T ω + β ) δ (cid:0) x T ω + β (cid:1) dF − ( x ) (cid:27) = 1 n + + n − {− n + + n − P( G ; ω , β ) } = 0 , where δ ( · ) is the Dirac delta function. ⇒ P( G ; (cid:98) ω , (cid:98) β ) = n + n − = 1 m . voreover, ∂l S ∂ ω = 1 n + + n − (cid:40) − n + (cid:88) i =1 x i + n − (cid:90) x { x T ω + β> } dF − ( x )+ n − (cid:90) (cid:2) x T ω + β ) (cid:3) δ (cid:0) x T ω + β (cid:1) x dF − ( x ) (cid:27) + λ ω = 1 n + + n − (cid:26) − n + x + + n − (cid:90) x { x T ω + β> } dF − ( x ) (cid:27) + λ ω ⇒ (cid:98) ω = 1( n + + n − ) λ (cid:26) n + x + − n − (cid:90) x { x T (cid:99) ω + (cid:98) β> } dF − ( x ) (cid:27) = 1( n + + n − ) λ (cid:26) n + x + − n − (cid:90) x dF − ( x | G )P( G ; (cid:98) ω , (cid:98) β ) (cid:27) ≈ n + ( n + + n − ) λ (cid:26) x + − (cid:90) x dF − ( x | G ) (cid:27) = 1(1 + m ) λ (cid:26) x + − (cid:90) x dF − ( x | G ) (cid:27) . Classiﬁcation boundaries for HDLSS data

The geometric representation in Hall et al. (2005) leads to some theoretical properties ofseveral binary classiﬁers. In particular, as d → ∞ , the positive class and negative classconverge to two ( n + −

1) and ( n − −

1) simplices with random rotation. Note that the(normalized) pairwise distances between observations within each class are the same, andthe (normalized) distances between any two observations from two diﬀerent classes are thesame as well. The geometric representation for SVM and DWD in Hall et al. (2005) issummarized as the follows.1.

SVM : It was shown that the linear SVM hyperplane projected to the ( N − N data vectors is given asymptoticallyby the unique ( N − l in the N -polyhedron formed by the N data vectors. There are n + × n − such edges.Let O + be the centroid of the ( n + − X + ( d ) and O − the centroid of the( n − − X − ( d ). It can be further shown that the SVM hyperplane bisects theviine segment between O + and O − .2. DWD : The case of DWD is a little diﬀerent, especially in the case where n + (cid:28) n − (or m (cid:29) O + O − at point P . It can be shown that the two simplices, the DWD hyperplane, andthe SVM hyperplane, are all orthogonal to O + O − . Thus all the vertices in the simplex X + are equally distanced from the DWD hyperplane. Such distance is denoted by a . Similarly, all the vertices in the simplex X − are equally distanced from the DWDhyperplane by β . The general version DWD hyperplane minimizes the sum of thereciprocals of the distances of data vectors to the hyperplane, ( n + /a + n − /b ), with theconstraint that a + b equals to a constant (determined by µ, σ, τ, n + , n − , and d ). Asimple calculus practice reveals that a/b = ( n + /n − ) / .For the General FLAME case, we need to learn how the hyperplane moves from thepoint determined by a/b = ( n + /n − ) / on O + O − to the midpoint of O + O − as θ growsfrom 0 (DWD) to 1 (SVM). First, we consider the general version of FLAME which seeksto minimize the sum of losses for all data points, (cid:80) (cid:16) /u − θ √ C (cid:17) + , where the functionalmargin u is either a or β for samples from the positive or the negative classes respectively.When θ = 0, the FLAME hyperplane is determined by a/b = ( n + /n − ) / = m − / < b > a , that is, the hyperplane is closer to the minority class. We renamed them as a and b where the superscript “0” represents the value of θ .When θ > / ( b √ C ), then the hyperplane does not move. This isbecause that the new loss for each data vector becomes 1 /a − θ √ C or 1 /b − θ √ C sinceboth are greater than 0. The additional term “ − θ √ C ” does not change the minimizer andthus a θ /b θ = ( n + /n − ) / does remain unchanged.If we keep increasing θ so that it becomes greater than 1 / ( b √ C ), then if the hyperplanedoes not move, then the loss for the majority class becomes 0. In this case, there is spacefor improvement: the hyperplane would move gradually towards the majority class, becausethis can make the loss on the minority class smaller while keeping the loss on the majorityviiero. The FLAME hyperplane is determined by b = 1 / ( θ √ C ).Finally, as θ increases, the distance a increases and the distance β decreases, until ata point where a = b , and both 1 /a − θ √ C = 1 /b − θ √ C <

0. After this point, furtherincrease of θ will not change the position of the FLAME hyperplane which will remain atthe midpoint of O + O − . θ | O + P | FLAME Hyperplane Toy Example [1+ √ (1/m)]/ √ C 2/ √ C Figure S.1: A 1D toy example with n + < n − and m = 9 is used to mimic the d -asymptoticsituation. The length of the line segment O + O − equals 1. As θ increases, the FLAMEhyperplane stands still ( | O + P | unchanged); when θ > (cid:112) /m ) / √ C , | O + P | increases,which means the hyperplane moves towards the negative class, until θ = 2 / √ C , after whichthe hyperplane remains at the midpoint of O + O − .The derivation above assumes the distance between the two simplices are reasonablelarge, at least greater than 2 / √ C . This is not diﬃcult to achieve because we choose C tobe a large number. viiin summary, the intersection P of the FLAME hyperplane and O + O − stays closer to theminority class, and remains still as θ is small. When θ increases, the boundary moves towardsthe majority class, until reaching the midpoint of O + O − . This explains the simulationperformance we observed in Figures S.2, 6 and S.3. We use a toy example and show theposition of the FLAME hyperplane moving as θ increases in Figure S.1 in the same fashionwe discussed above.It is worth noting that the value of DWD/FLAME in terms of reducing overﬁtting ismaximal when the dimension is greater than, but close to, the sample size. This is whendata-piling starts to appear in SVM but not yet in DWD. Marron et al. (2007) showedsome videos about such phenomenon. As a matter of fact, according to the geometricrepresentation above, in the d -asymptotics, the discriminant directions for most classiﬁersare the same. Moreover, the projections of data points in the same class to O + O − are thesame, which is the normal vector for the DWD, SVM and FLAME hyperplanes. Therefore,they all have data-piling in the d -asymptotics. Derivation of the FLAME hyperplane in d asymptotics The FLAME seeks to minimize n + (cid:16) /a − θ √ C (cid:17) + + n − (cid:16) /b − θ √ C (cid:17) + (S.2)s.t. a + b = (cid:112) d ( µ + σ /n + + τ /n − ) = ν √ d. (S.3)When θ ∈ (cid:104) , (1 + √ m − ) / ( ν √ dC ) (cid:17) , it is easy to verify that both (1 /a − θ √ C ) + and(1 /b − θ √ C ) + are positive and equal to 1 /a − θ √ C and 1 /b − θ √ C . In this case, theoptimal solutions to problem S.2, a and β satisfy a/b = ( n + /n − ) / = √ m − . In particular, a = √ m − / (1 + √ m − ) ν √ d and b = 1 / (1 + √ m − ) ν √ d .When θ ∈ (cid:104) (1 + √ m − ) / ( ν √ dC ) , / ( ν √ dC ) (cid:17) , 1 /b − θ √ C < b = 1 / ( θ √ C ) and a = √ dν − b . Note that a > b .When θ ∈ (cid:104) / ( ν √ dC ) , (cid:105) , a = b = 0 . √ dν ix roof to Theorem 6 We only need to prove the sure classiﬁcation for the second interval, i.e. , θ ∈ (cid:104) (1 + √ m − ) / ( ν √ dC ) , / ( ν √ dC ) (cid:17) . The proofs for the other two intervals are similar to those in Qiao et al. (2010). It wasshown in Hall et al. (2005) and Qiao et al. (2010) that the length of the line segment O + O − is √ dν and that the distance between the projection (denoted as P (cid:48) ) of a new datapoint from the X + -population onto O + O − and the centroid of the positive class O + is( σ /n + ) / ( µ + τ /n − ) times of its distance to the centroid of the negative class O − , i.e. , | O + P (cid:48) || O − P (cid:48) | = ( σ /n + ) / ( µ + τ /n − ), where | AB | is the length of the line segment connectingpoints A and B . Denote | O + P (cid:48) | as a (cid:48) and | O − P (cid:48) | as b (cid:48) . Because a (cid:48) + b (cid:48) = √ dν , we musthave b (cid:48) = √ d ( µ + τ /n − ) /ν . In order for this new data point to be correctly classiﬁed tothe positive class, P (cid:48) has to be the same side as O + with respect to the intersection of theFLAME hyperplane and O + O − , that is, b (cid:48) > b ⇔ √ d (cid:18) µ + τ n − (cid:19) /ν > √ Cθ ⇔ µ + τ n − > √ dCθ ν ⇔ ν − σ n + > √ dCθ ν ⇔ ν − √ dCθ ν − σ n + > ⇔ ( ν − √ dCθ ) − dCθ − σ n + > ⇐ ν > (cid:115) dCθ + σ n + + 12 √ dCθ ⇔ µ > (cid:34)(cid:115) dCθ + σ n + + 12 √ dCθ (cid:35) − σ n + − τ n − ⇔ µ > T − τ n − . We now assume that P (cid:48) is the projection of a new data point from the X − -population. Inxhis situation, it can be shown that a (cid:48) /b (cid:48) = ( µ + σ /n + ) / ( τ /n − ) and thus b (cid:48) = √ d τ /n − ν .To correctly classify this new data point, we only need to have b (cid:48) < b . That is, √ d τ /n − ν < b = 1 θ √ C ⇔ τ n − < θ √ dC (cid:112) µ + τ /n − + σ /n + We only need to show that τ n − < θ √ dC (cid:112) τ /n − + σ /n + . Let q = τ /n − + σ /n + . Weneed to show that q − σ n + < θ √ dC q ⇔ (cid:18) q − θ √ dC (cid:19) − θ dC − σ n + < ⇐ q < (cid:115) θ dC + σ n + + 12 θ √ dC ⇔ τ n − < (cid:34)(cid:115) θ dC + σ n + + 12 θ √ dC (cid:35) − σ n + ⇔ τ n − < T. The last inequality is the condition stipulated in the theorem.xi dditional ﬁgures θ Within−Group−Error 0 0.2 0.4 0.6 0.8 100.511.522.53 θ | β −true β | Remarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directionsRemarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directions0 0.2 0.4 0.6 0.8 12426283032343638 θ Angle from theoretical Bayes direction 0 0.2 0.4 0.6 0.8 10.310.320.330.340.350.36 θ RankCmp 0 0.2 0.4 0.6 0.8 10.20.250.30.35 θ Dispersionm=2m=3m=4

Figure S.2: Independent example. It can be seen that with FLAME turns from DWD toSVM ( θ from 0 to 1), the within-class error decreases (top-left), thanks to the more accurateestimate of the intercept term (top-middle). On the other hand, this comes at the cost oflarger deviation from the Bayes direction (bottom-left), incorrect rank of the importance ofthe variables (bottom-middle) and larger stochastic variability of the estimation directions(bottom-right). xii θ Within−Group−Error m=2m=3m=4 0 0.2 0.4 0.6 0.8 10.511.522.533.5 θ | β −true β | Remarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directions0 0.2 0.4 0.6 0.8 1242628303234 θ Angle from theoretical Bayes direction 0 0.2 0.4 0.6 0.8 10.250.260.270.280.290.3 θ RankCmp 0 0.2 0.4 0.6 0.8 10.180.20.220.240.260.280.30.32 θ Dispersion

Figure S.3: Block interchangeable example. It can be seen that with FLAME turns fromDWD to SVM ( θθ