Flexible High-dimensional Classification Machines and Their Asymptotic Properties
FFlexible High-dimensional Classification Machines andTheir Asymptotic Properties
Xingye Qiao ∗ Department of Mathematical SciencesState University of New York, Binghamton, NY 13902-6000.E-mail: [email protected]
Lingsong ZhangDepartment of StatisticsPurdue University, West Lafayette, IN 47907.E-mail: [email protected] ∗ Corresponding author I a r X i v : . [ s t a t . M L ] O c t bstract Classification is an important topic in statistics and machine learning with great po-tential in many real applications. In this paper, we investigate two popular large marginclassification methods, Support Vector Machine (SVM) and Distance Weighted Dis-crimination (DWD), under two contexts: the high-dimensional, low-sample size dataand the imbalanced data. A unified family of classification machines, the FL exible A ssortment M achin E ( FLAME ) is proposed, within which DWD and SVM are spe-cial cases. The FLAME family helps to identify the similarities and differences betweenSVM and DWD. It is well known that many classifiers overfit the data in the high-dimensional setting; and others are sensitive to the imbalanced data, that is, the classwith a larger sample size overly influences the classifier and pushes the decision bound-ary towards the minority class. SVM is resistant to the imbalanced data issue, but itoverfits high-dimensional data sets by showing the undesired data-piling phenomena.The DWD method was proposed to improve SVM in the high-dimensional setting,but its decision boundary is sensitive to the imbalanced ratio of sample sizes. OurFLAME family helps to understand an intrinsic connection between SVM and DWD,and improves both methods by providing a better trade-off between sensitivity to theimbalanced data and overfitting the high-dimensional data. Several asymptotic prop-erties of the FLAME classifiers are studied. Simulations and real data applications areinvestigated to illustrate the usefulness of the FLAME classifiers.
Key Words and Phrases:
Classification; Discriminant analysis; Fisher consistency;High-dimensional, low-sample size asymptotics; Imbalanced data; Support Vector Ma-chine. II Introduction
Classification refers to predicting the class label, y ∈ C , of a data object based on itscovariates, x ∈ X . Here C is the space of class labels, and X is the space of the covariates.Usually we consider X ≡ R d , where d is the number of variables or the dimension. SeeDuda et al. (2001) and Hastie et al. (2009) for comprehensive introductions to many popularclassification methods. When C = { +1 , − } , this is an important class of classificationproblems, called binary classification. The classification rule for a binary classifier usuallyhas the form φ ( x ) = sign { f ( x ) } , where f ( x ) is called the discriminant function. Linearclassifiers are the most important and the most commonly used classifiers, as they are ofteneasy to interpret in addition to reasonable classification performance. We focus on linearclassifier in this article. In the above formula, linear classifiers correspond to f ( x ; ω , β ) = x T ω + β . The sample space is divided into halves by the separating hyperplane , also known asthe classification boundary , defined by (cid:8) x : f ( x ) ≡ x T ω + β = 0 (cid:9) . Note that the coefficientvector ω ∈ R d defines the normal vector, and hence the direction, of the classificationboundary, and the intercept term β ∈ R defines the location of the classification boundary.In this paper, two popular classification methods, Support Vector Machine (SVM; Cortesand Vapnik, 1995, Vapnik, 1998, Cristianini and Shawe-Taylor, 2000) and Distance WeightedDiscrimination (DWD; Marron et al., 2007, Qiao et al., 2010) are investigated under two im-portant contexts: the High-Dimensional, Low-Sample Size (HDLSS) data and the imbalanceddata. Both methods are large margin classifiers (Smola et al., 2000), that seek separatinghyperplanes which maximize certain notions of gap ( i.e. , distances) between the two classes.The investigation of the performance of SVM and DWD motivates the invention of a novelfamily of classifiers, the FL exible A ssortment M achin E ( FLAME ), which unifies the twoclassifiers, and helps to understand their connections and differences.1 .1 Motivation: Pros and Cons of SVM and DWD
SVM is a very popular classifier in statistics and machine learning. It has been shown tohave Fisher consistency, i.e. , when sample size goes to infinity, its decision rule convergesto the Bayes rule (Lin, 2004). SVM has several nice properties. 1) Its dual formulationis relatively easy to implement (by the Quadratic Programming). 2) SVM is robust to themodel specification, which makes it very popular in various real applications. However, whenbeing applied to HDLSS data, it has been observed that a large portion of the data (usuallythe support vectors, to be properly defined later) lie on two hyperplanes parallel to theSVM classification boundary. This is known as the data-piling phenomena (Marron et al.,2007, Ahn and Marron, 2010). Data-piling of SVM indicates a type of overfitting. Otheroverfitting phenomena of SVM under the HDLSS context include:1. The angle between the SVM direction and the Bayes rule direction is usually large.2. The variability of the sampling distribution of the SVM direction ω is very large(Zhang and Lin, 2011). Moreover, because the separating hyperplane is decided onlyby the support vectors, the SVM direction tends to be unstable, in the sense that smallturbulence or measurement error to the support vectors can lead to big change of thedirection.3. In some cases, the out-of-sample classification performance may not be optimal due tothe suboptimal direction of the estimated SVM discrimination direction.DWD is a recently developed classifier to improve SVM in the HDLSS setting. It uses adifferent notion of gap from SVM. While SVM is to maximize the smallest distance betweenclasses, DWD is to maximize a special average distance (harmonic mean) between classes.It has been shown in many earlier simulations that DWD largely overcomes the overfitting(data-piling) issue and it usually gives a better discrimination direction.On the other hand, the intercept term β of the DWD method is sensitive to the samplesize ratio between the two classes, i.e. , to the imbalanced data (Qiao et al., 2010). Note that,even though a good discriminant direction ω is more important in revealing the profiling2ifference between the two populations, the classification/prediction performance heavilydepends on the intercept β , more than on the direction ω . As shown in Qiao et al. (2010),usually the β of the SVM classifier is not sensitive to the sample size ratio, while the β ofthe DWD method will become too large (or too small) if the sample size of the positive class(or negative class) is very large.In summary, both methods have pros and cons. SVM has larger stochastic variabilityand usually overfits the data by showing data-piling phenomena, but is less sensitive to theimbalanced data issue. DWD usually overcomes the overfitting/data-piling issue, and hassmaller sampling variability, but is very sensitive to the imbalanced data. Driven by theirsimilarity, we propose a unified class of classifiers, FLAME, in which the above two classifiersare special cases. FLAME provides a framework to study the connections and differencesbetween SVM and DWD. Each FLAME classifier has a parameter θ which is used to controlthe performance balance between overfitting the HDLSS data and the sensitivity to theimbalanced data. It turns out that the DWD method is FLAME with θ = 0; and thatthe SVM method corresponds to FLAME with θ = 1. The optimal θ depends on thetrade-off among several factors: stochastic variability, overfitting and resistance against theimbalanced data. In this paper, we also propose two approaches to select θ , where theresulting FLAME have a balanced performance between the SVM and DWD methods. The rest of the paper is organized as follows. Section 2 provides toy examples and highlightsthe strengths and drawbacks of SVM and DWD on classifying the HDLSS and imbalanceddata. We develop the FLAME method in Section 3, which is motivated by the investigationof the loss functions of SVM and DWD. Section 4 provides suggestions for the parameters.Three types of asymptotic results for the FLAME classifier are studied in Section 5. Sections6 discusses its properties using simulation experiments. A real application is discussed inSection 7. Some concluding remarks and discussions are made in Section 8.3
Comparison of SVM and DWD
In this section, we use several toy examples to illustrate the strengths and drawbacks of SVMand DWD under two contexts: HDLSS data and imbalanced data.
We use simulations to compare SVM and DWD. The results show that the stochastic vari-ability of the SVM direction is usually larger than that of the DWD method, and SVMdirections are deviated farther away from Bayes rule directions. In addition, the new pro-posed FLAME machine (see details in Section 3) is also included in the comparison, and itturns out that FLAME is between the other two.Figure 1 shows the comparison results between SVM, DWD and FLAME (with somechosen tuning parameters). We simulate 10 samples with the same underlying distribution.Each simulated data set contains 12 variables and two classes, with 120 observations in eachclass. The two classes have mean difference on only the first three dimensions and the within-class covariances are diagonal, that is, the variables are independent. For each simulateddata set, we plot the first three components of the resulting discriminant directions fromSVM, DWD and FLAME (after normalizing the 3D vectors to have unit norms), as shownin Figure 1. It clearly shows that the DWD directions (the blue down-pointing triangles)are the closest ones to the true
Bayes rule direction (shown as the cyan diamond marker)among the three approaches. In addition, the DWD directions have a smaller variation ( i.e. ,more stable) over different samples. The SVM directions (the red up-pointing triangles)are farthest from the true
Bayes rule direction and have a larger variation than the othertwo methods. To highlight the direction variabilities of the three methods, we introduce anovel measure for the variation (unstableness) of the discriminant directions: the trace ofthe sample covariance of the resulting direction vectors over the 10 replications, which wename as dispersion . The dispersion for the DWD method (0.0031) is much smaller than4 dispersion of DWD = 0.0031108dispersion of FLAME ( θ =0.5) = 0.010493dispersion of SVM = 0.045332 (0,0,0) T True directionDWD direction vectorFLAME ( θ =0.5) direction vectorSVM direction vector Figure 1: The true population mean difference direction vector (the cyan dashed line anddiamond marker; equivalent to the Bayes rule direction), the DWD directions (blue down-pointing triangles), the FLAME directions with θ = 0 . θ = 0 .
5, the magenta squares), which is better than SVM but worsethan DWD.Besides the stochastic variability and the deviation from the true direction comparisonsshown above, DWD outperforms SVM in terms of stability in the presence of small pertur-bations applied to some observations. In Figure 2, we use a two-dimensional example toillustrate this phenomenon. We simulate a perfectly separable 2-dimensional data set. The5
Positive classNegative classBefore movementMovement directionAfter movementTrue boundarySVM (for original data)SVM (after movement)DWD (for original data)DWD (after movement)
Figure 2: A 2D example shows that the unstable SVM boundary has changed due to asmall turbulence of a support vector (the solid red triangle and diamond) while the DWDboundary remains almost still.theoretical Bayes rule decision boundary is shown as the thick black line. The dashed redline and the dashed dotted blue line are the SVM and the DWD classification boundariesbefore the perturbation. We then move one observation in the positive group a little (fromthe solid triangle to the solid diamond as shown in the figure). This perturbation leads toa large change of direction in SVM (shown as the dotted red line), but a small change forDWD (shown as the solid blue line). Note that all four hyperplanes are capable of classifyingthis training data set perfectly. But it may not be true for an out-of-sample test set. Thisexample shows small perturbation may lead to unstableness in SVM.
In the last subsection, we have shown that DWD outperforms SVM in estimating the dis-crimination direction, that is, DWD directions are closer to the Bayes rule discriminationdirections and have smaller variability. However, it was found that the location of DWDclassification boundary, which is characterized by the intercept β , is sensitive to the sample6 Figure 3: A 1D example shows that the DWD boundary is pushed towards the minorityclass (blue) when the majority class (red) has tripled its sample size.size ratio between the two classes (Qiao et al., 2010).Usually, a good discriminant direction ω helps to reveal the profiling difference betweentwo classes of populations. But the classification/prediction performance heavily dependson the location coefficient β . We define the imbalance factor m ≥ β in the SVM classifieris not sensitive to m . However, the β for the DWD method is very sensitive to m . We alsonotice that, as a consequence, the DWD separating hyperplane will be pushed toward theminority class, when the ratio m is close to infinity, i.e. , DWD classifiers intend to ignorethe minority class. Again, we use a toy example in order to better illustrate the impact ofthe imbalanced data on β and on the classification performance.Figure 3 uses a one-dimensional example, so that estimating ω is not needed. This alsocorresponds to a multivariate data set, where ω is estimated correctly first, after which thedata set is projected to ω to form the one-dimensional data. In this plot, the x -coordinates of7he red dots and the blue dots are the values of the data while the y -coordinates are randomjitters for better visualization. The red and blue curves are the kernel density estimationsfor both classes. In the top subplot of Figure 3, where m = 1 ( i.e. , the balanced data),both the DWD (blue lines) and SVM (red lines) boundaries are close to the Bayes ruleboundary (black solid line), which sits at 0. In the bottom subplot, the sample size of thered class is tripled, which corresponds to m = 3. Note that the SVM boundary movesa little towards the minority (blue) class, but still fairly close to the true boundary. TheDWD boundary, however, is pushed towards the minority. Although this does not imposeimmediate problems for the training data set, the DWD classifier will suffer from a greatloss of classification performance when it is applied to an out-of-sample data set. It can beshown that when m goes to infinity, the DWD classification boundary will tends to negativeinfinity, which totally ignores the minority group (see our Theorem 3). However, SVM willnot suffer from severe imbalanced data problems. One reason is that SVM only needs asmall fraction of data (called support vectors) for estimating both ω and β , which mitigatethe imbalanced data issue naturally.Imbalanced data issues have been investigated in both statistics and machine learning.See an extensive survey in Chawla et al. (2004). Recently, Owen (2007) studied the asymp-totic behavior of infinitely imbalanced binary logistic regression. In addition, Qiao and Liu(2009) and Qiao et al. (2010) proposed to use adaptive weighting approaches to overcomethe imbalanced data issue.In summary, the performance of DWD and SVM is different in the following ways: 1)The SVM direction usually has a larger variation and deviates farther from the Bayes ruledirection than the DWD direction does, which are indicators of overfitting HDLSS data. 2)The SVM intercept is not sensitive to the imbalanced data, but the DWD intercept is. Thismotivates us to investigate their similarity and differences. In the next section, a new familyof classifier will be proposed, which unifies the above two classifiers.8 FLAME Family
In this section, we introduce FLAME, a family of classifiers which is motivated by a thoroughinvestigation of the loss functions of SVM and DWD in Section 3.1. The formulation andimplementation of the FLAME classifiers are given in Section 3.2.
The key factors that drive the very distinct performances of the SVM and the DWD methodsare their associated loss functions (see Figure 4.) −1 0 1 2 3 400.511.522.533.54 Functional margin, u=yf(x) Lo ss FLAME loss functions for three θ values (C=1) DWD (FLAME: θ = 0)FLAME, θ = 0.5SVM (FLAME: θ = 1) Figure 4: FLAME loss functions for three θ values: θ = 0 (equivalent to SVM/Hinge loss), θ = 0 . θ = 1 (equivalent to DWD). The parameter C is set to be 1.Figure 4 displays the loss functions of SVM, DWD and FLAME with some specific tuningparameters. SVM uses the Hinge loss function, H ( u ) = (1 − u ) + (the red dashed curve inFigure 4), where u corresponds to the functional margin u ≡ yf ( x ). Note that the functionalmargin u can be viewed as the distance of vector x from the separating hyperplane (definedby { x : f ( x ) = 0 } ). When u > u <
0, the data vector is wrongly classified. Note thatwhen u >
1, the corresponding Hinge loss equals zero. Thus, only those observations with9 ≤ ω and β . These observations are called support vectors .This is why SVM is insensitive to observations that are far away from the decision boundary,and why it is less sensitive to the imbalanced data issue. However, the only influence bythe support vectors makes the SVM solution subject to the overfitting (data-piling) issue.This can be explained by that the optimization of SVM would try to push vectors towardssmall loss, i.e. , large functional margin u . But once a vector is pushed to the point where u = 1, the optimization lacks further incentive to continue pushing it towards a largerfunction margin as the Hinge loss cannot be reduced for this vector. Therefore many datavectors are piling along the hyperplanes corresponding to u = 1. Data-piling is bad forgeneralization because small turbulence to the support vectors could lead to big differenceof the discriminant direction vector (recall the examples in Section 2.1).The DWD method corresponds to a different DWD loss function, V ( u ) = √ C − Cu if u ≤ √ C , /u otherwise . (1)Here C is a pre-defined constant. Figure 4 shows the DWD loss function with C = 1. It isclear that the DWD loss function is very similar to the SVM loss function when u is small(both are linearly decreasing with respect to u ). The major difference is that the DWD lossis always positive. This property will make the DWD method behave in a very different waythan SVM. As there is always incentive to make the function margin to be larger (and theloss to be smaller), the DWD loss function kills data-piling, and mitigates the overfittingissue for HDLSS data.On the other hand, the DWD loss function makes the DWD method very sensitive to theimbalanced data issue, since each observation will have some influence, and thus the largerclass will have larger influence. The decision boundary of the DWD method will intend toignore the smaller class, because sacrificing the smaller class (boundary being closer to thesmaller class and farther from the larger class) can lead to a dramatic reduction of the loss,which ultimately lead to a minimized overall loss.10 .2 FLAME We propose to borrow strengths from both methods to simultaneously deal with both theimbalanced data and the overfitting (data-piling) issues. We first highlight the connectionsbetween the DWD loss and an modified version of the Hinge loss (of SVM). Then we modifythe DWD loss so that samples far from the classification boundary will have zero loss.Let f ( x ) = x T ω + β . The formulation of SVM can be rewritten (see details in theappendix) in the form of argmin ω ,β (cid:80) i H ∗ ( y i f ( x i )), s.t. (cid:107) ω (cid:107) ≤ H ∗ is defined as H ∗ ( u ) = √ C − Cu if u ≤ √ C , . (2)Comparing the DWD loss (1) and this modified Hinge loss (2), one can easily see theirconnections: for u ≤ √ C , the DWD loss is greater than the Hinge loss of SVM by an exactconstant √ C , and for u > √ C , the DWD loss is 1 /u while the SVM Hinge loss equals 0.Clearly the modified Hinge loss (2) is the result of soft-thresholding the DWD loss at √ C . Inother words, SVM can be seen as a special case of DWD where the losses of those vectors with u = y i f ( x i ) > / √ C are shrunken to zero. To allow different levels of soft-thresholding,we propose to use a new loss function which (soft-)thresholds the DWD loss function byconstant θ √ C where 0 ≤ θ ≤
1, that is, a fraction of √ C . The new loss function is L ( u ) = (cid:104) V ( u ) − θ √ C (cid:105) + = (2 − θ ) √ C − Cu if u ≤ √ C , /u − θ √ C if √ C ≤ u < θ √ C , u ≥ θ √ C , (3)that is, to reduce the DWD loss by a constant, and truncate it at 0. The magenta solid curvein Figure 4 is the FLAME loss when C = 1 and θ = 0 .
5. This simple but useful modificationunifies the DWD and SVM methods. When θ = 1, the new loss function (when C = 1)reduces to the SVM Hinge loss function; while when θ = 0, it remains as the DWD loss.Note that L ( u ) = 0 for u > / ( θ √ C ). Thus, those data vectors with large functionalmargins will have zero loss. For DWD loss, because it corresponds to θ = 0 ⇒ / ( θ √ C ) = ∞ ,11o data vector can have zero loss. For SVM loss, all the data vector with u > / ( θ √ C ) =1 / √ C will have zero loss. Training a FLAME classifier with 0 < θ < /θ √ C and assignzero loss to them. Alternatively, it can be viewed as sampling data that are closer to theboundary than 1 /θ √ C and assign positive loss to them. Note that the larger θ is, the fewerdata are sampled to have positive loss. As one can flexibly choose θ , the new classificationmethod with this new loss function is called the FL exible A ssortment M achin E ( FLAME ).FLAME can be implemented by a Second-Order Cone Programming algorithm (Tohet al., 1999, T¨ut¨unc¨u et al., 2003). Let θ ∈ [0 ,
1] be the FLAME parameter. The proposedmethod minimizes min ω ,b, ξ n (cid:88) i =1 (cid:18) r i + Cξ i − θ √ C (cid:19) + . A slack variable ϕ i ≥ · ) + function. The optimization of the FLAME can be written asmin ω ,b, ξ (cid:88) i ϕ i , s.t. (cid:16) r i + Cξ i − θ √ C (cid:17) − ϕ i ≤ , ϕ i ≥ ,r i = y i ( x Ti ω + β ) + ξ i , r i ≥ ξ i ≥ , (cid:107) ω (cid:107) ≤ . A Matlab routine has been implemented and is available at the authors’ personal websites.See the online supplementary materials for more details on the implementation.
There are two tuning parameters in the FLAME model: one is the C , inherited from theDWD loss, which controls the amount of allowance for misclassification; the other is theFLAME parameter θ , which controls the level of soft-thresholding. Similar to the discussionin DWD (Marron et al., 2007), the classification performance of FLAME is insensitive todifferent values of C . In addition, it can be shown for any C , FLAME is Fisher consistent,by applying the general results in Lin (2004). Thus, the default value for C as proposed12n Marron et al. (2007) will be used in FLAME. In this section, we introduce two ways ofchoosing the second parameter θ . As the property and the performance of FLAME dependson the choice of this parameter, it is important to select the right amount of thresholding.In the following two subsections, we discuss two options of choosing parameter θ . The firstoption is based on empirical plots resulting from the training data and is of practical useful,and it is the θ value that we suggest. The second option is motivated by a theoreticalconsideration and is heuristically meaningful as well. Note that, an optimal θ depends on the nature of the data and problems that users have.The optimal θ also depends on two performance measures: insensitive to the imbalanceddata, and resistance to overfitting. However, without prior knowledge, we may want to havea “good” trade-off between them. In this subsection, we suggest the following method tochoose θ if no further information is provided.As will become clear from the simulation examples shown in Section 6.3, we have observedthat several performance measures for the FLAME classifiers, for example, the within-grouperror (see the definition in Section 6 and also in Qiao and Liu (2009)), are monotonically de-creasing with respect to θ . On the other hand, performance measures such as the RankComp (see also in Section 6), are monotonically increasing functions of θ . The RankComp mea-sure is more related to the overfitting phenomena, and the within-group error is designed formeasuring the performance against the imbalanced data. The lesson is that with θ increases,FLAME becomes less sensitive to the imbalanced data issue, but is subject to more overfit-ting. This motivates us to use the following parameter: the two curves of the two measuresare normalized to be between 0 and 1. When θ = 0, the FLAME classifier (equivalent toDWD) has the smallest RankComp measure 0, but the largest within-group error 1. When θ = 1, the FLAME classifier (equivalent to SVM) has the smallest within-group error 0, butthe largest RankComp 1. The suggested θ is chosen as the value where the two normalized13urves intersect, that is, the normalized within-group error is the same as the normalizedRankComp for this θ . This suggested parameter represents a natural trade-off between thetwo measures: neither measure is absolutely optimal, but each measure compromises by thesame relative amount. This suggested parameter is called the equal-trade-off parameter . Having observed that the DWD discrimination direction is usually closer to the Bayes ruledirection, but its location term β is sensitive to the imbalanced data issue, we proposethe following alternative data-driven approach to select an appropriate θ . Without loss ofgenerality, we assume that the negative class is the majority class with sample size n − andthe positive class is the minority class with sample size n + . We point out that the mainreason that DWD is sensitive to the imbalanced data issue is that it uses all vectors in the majority class to build up a classifier. A heuristic strategy to correct this would be to forcethe optimization to use the same number of vectors from both classes to build up a classifier:we first apply DWD to the data set, and calculate the distances of all data in the majorityclass to the current DWD classification boundary; we then train FLAME with a carefullychosen parameter θ which assigns positive loss to the closet n + data vectors in the majorityclass to the classification boundary. As a consequence, each class will have exactly n + vectorswhich have positive loss. In other words, while keeping the least imbalance (because we havethe same numbers of vectors from both classes that have influence over the optimization),we obtain a model with the least possible overfitting (because 2 n + vectors have influence,instead of only the limited support vectors as in SVM.)In practice, since the new FLAME classification boundary using the θ chosen abovemay be different from the initial DWD classification boundary, the n + closest points tothe FLAME classification boundary may not be the same n + closest points to the DWDboundary. This means that it is not guaranteed that exactly n + points from the majorityclass will have positive loss. However, one can expect that reasonable approximation can14e achieved. Moreover, an iterative scheme for finding θ is introduced as follows in order tominimize such discrepancy.For simplicity, we let ( x i , y i ) with index i be an observation from the positive/minorityclass and ( x j , y j ) with index j be an observation from the negative/majority class. Algorithm . ( Adaptive parameter )1. Initiate θ = 0.2. For k = 0 , , · · · ,(a) Solve FLAME solutions ω ( θ k ) and β ( θ k ) given parameter θ k .(b) Let θ k +1 = max (cid:18) θ k , (cid:110) g ( n + ) ( θ k ) √ C (cid:111) − (cid:19) , where g j ( θ k ) is the functional margin u j ≡ y j ( x Tj ω ( θ k ) + β ( θ k )) of the j th vector in the negative/majority class and g ( l ) ( θ k ) is the l th order statistic of these functional margins.3. When θ k = θ k − , the iteration stops.The goal of this algorithm is to make g ( n + ) ( θ k ) to be the greatest functional margin amongall the data vectors that have positive loss in the negative/majority class. To achieve this,we calibrate θ by aligning g ( n + ) ( θ k ) to the turning point u = 1 / ( θ √ C ) in the definition ofthe FLAME loss (3), that is g ( n + ) ( θ k ) = 1 / ( θ √ C ) ⇒ θ = (cid:16) g ( n + ) ( θ k ) √ C (cid:17) − .We define the equivalent sample objective function of FLAME for the iterative algorithmabove, s ( ω , β, θ ) = 1 n + + n − (cid:34) n + (cid:88) i =1 L (( x Ti ω + β ) , θ ) + n − (cid:88) j =1 L ( − ( x (cid:48) j ω + β ) , θ ) (cid:35) + λ (cid:107) ω (cid:107) . Thenthe convergence of this algorithm is shown in Theorem 1.
Theorem . In Algorithm 1, s ( ω k , β k , θ k ) is non-increasing in k . As a consequence, Algo-rithm 1 converges to a stationary point s ( ω ∞ , β ∞ , θ ∞ ) where s ( ω k , β k , θ k ) ≥ s ( ω ∞ , β ∞ , θ ∞ ) .Moreover, Algorithm 1 terminates finitely. Ideally, one would hope to get an optimal parameter θ ∗ which satisfies θ ∗ = (cid:16) g ( n + ) ( θ ∗ ) √ C (cid:17) − . In practice, θ ∞ will approximate θ ∗ very well. In addition, we notice that one-step iterationusually gives decent results for simulation examples and some real examples.15 Theoretical Properties
In this section, several important theoretical properties of the FLAME classifiers are inves-tigated. We first prove the Fisher consistency (Lin, 2004) of the FLAME in Section 5.1.As one focus of this paper is imbalanced data classification, the asymptotic properties forFLAME under extremely imbalanced data setting is studied in Section 5.2. Lastly, a novelHDLSS asymptotics where n is fixed and d → ∞ , the other focus of this article, is studiedin Section 5.3. Fisher consistency is a very basic property for a classifier. A classifier is Fisher consistentmeans that the minimizer of the conditional risk of the classifier given observation x has thesame sign as the Bayes rule, argmax k ∈{ +1 , − } P( Y = k | X = x ). It has been shown that both SVMand DWD are Fisher consistent (Lin, 2004, Qiao et al., 2010). The following propositionstates that the FLAME classifiers are Fisher consistent too. Proposition . Let f ∗ be the global minimizer of E [ L ( Y f ( X ) , θ )] , where L ( · , θ ) is the lossfunction for FLAME given parameter θ . Then sign ( f ∗ ( x )) = sign (P( Y = +1 | X = x ) − / . In this subsection, we investigate the asymptotic performance of SVM, DWD and FLAME.The asymptotic setting we focus on is when the minority sample size n + is fixed and themajority sample size n − → ∞ , which is similar to the setting in Owen (2007). We willshow that DWD is sensitive to the imbalanced data, while FLAME with proper choices ofparameter θ and SVM are not.Let x + be the sample mean of the positive/minority class. Theorem 3 shows that in theimbalanced data setting, when the size of the negative/majority class grows while that ofthe positive/minority class is fixed, the intercept term for DWD tends to negative infinity, in16he order of √ m . Therefore, DWD will classify all the observations to the negative/majorityclass, that is, the minority class will be 100% misclassified. Theorem . Let n + be fixed. Assume that the conditional distribution of the negative ma-jority class F − ( x ) surrounds x + by the definition given in Owen (2007), and that γ is aconstant satisfying inf (cid:107) ω (cid:107) =1 (cid:90) ( x − x + ) (cid:48) ω > dF − ( x ) > γ ≥ , then the DWD intercept (cid:98) β satisfies (cid:98) β < − (cid:114) γC m − x T + ω = − (cid:114) n − γn + C − x T + ω . In Section 4.2, we have introduced an iterative approach to select the parameter θ .Theorem 4 shows that with the optimal parameter θ ∗ found by Algorithm 1, the discriminantdirection of FLAME is in the same direction of the vector that joins the sample mean of thepositive class and the tilted population mean of the negative class. Moreover, in contrast toDWD, the intercept term of FLAME in this case is finite. Theorem . Suppose that n − (cid:29) n + and ω ∗ and β ∗ are the FLAME solutions trained withthe parameter θ ∗ that satisfies θ ∗ = (cid:16) g ( n + ) ( θ ∗ ) √ C (cid:17) − . Then ω ∗ and β ∗ satisfy that ω ∗ = C (1 + m ) λ (cid:20) x + − (cid:82) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) (cid:82) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) (cid:21) , where E is the event that [ Y ( X T ω ∗ + β ∗ )] − ≥ θ ∗ √ C where ( X , Y ) is a random samplefrom the negative/majority class, and that (cid:90) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) = n + n o C, where < n o ≤ n + . As a consequence of Theorem 4, when m = n − /n + → ∞ , we have (cid:107) ω ∗ (cid:107) →
0. Since theright-hand-side of the last equation above is positive finite, β ∗ does not diverge. In addition,since P( E ) → β ∗ < − / ( θ √ C ).The following theorem shows the performance of SVM under the imbalanced data context,which completes our comparisons between SVM, DWD and FLAME. Theorem . Suppose that n − (cid:29) n + . The solutions (cid:98) ω and (cid:98) β to SVM satisfy that (cid:98) ω = 1(1 + m ) λ (cid:26) x + − (cid:90) x dF − ( x | G ) (cid:27) , here G is the event that − Y ( X T (cid:98) ω + (cid:98) β ) > where ( X , Y ) is a random sample from thenegative/majority class, and that P( G ) = P(1 + X T (cid:98) ω + (cid:98) β ≤
0) = 1 − /m. The last statement in Theorem 5 means that with probability converging to 1, (cid:98) β ≤ − (cid:98) β < − (cid:112) γC m − x T + ω ). HDLSS data are emerging in many areas of scientific research. The HDLSS asymptotics is arecently developed theoretical framework. Hall et al. (2005) gave a geometric representationfor the HDLSS data, which can be used to study these new ‘ n fixed, d → ∞ ’ asymptoticproperties of binary classifiers such as SVM and DWD. Ahn et al. (2007) weakened the con-ditions under which the representation holds. Qiao et al. (2010) improved the conditions andapplied this representation to investigate the performance of the weighted DWD classifier.The same geometric representation can be used to analyze FLAME. See summary of someprevious HDLSS results in the online supplementary materials. We develop the HDLSSasymptotic properties of the FLAME family by providing conditions in Theorem 6 underwhich the FLAME classifiers always correctly classify HDLSS data.We first introduce the notations and give some regularity assumptions, then state themain theorem. Let k ∈ { +1 , − } be the class index. For the k th class and given a fixed n k ,consider a sequence of random data matrices X k , X k , · · · X kd , · · · , indexed by the numberof rows d , where each column of X kd is a random observation vector from R d and eachrow represents a variable. Assume that each column of X kd comes from a multivariatedistribution with dimension d and with covariance matrix Σ kd independently. Let λ k ,d ≥· · · ≥ λ kd,d be the eigenvalues of the covariance, and (cid:0) σ kd (cid:1) = d − (cid:80) di =1 λ ki,d the averageeigenvalue. The eigenvalue decomposition of Σ kd is Σ kd = V kd Λ kd (cid:0) V kd (cid:1) T . We may definethe square root of Σ kd as (cid:0) Σ kd (cid:1) / = V kd (cid:0) Λ kd (cid:1) / , and the inverse square root (cid:0) Σ kd (cid:1) − / =18 Λ kd (cid:1) − / (cid:0) V kd (cid:1) T . With minimal abuse of notation, let E ( X kd ) denote the expectation ofcolumns of X kd . Lastly, the n k × n k dual sample covariance matrix is denoted by S kD,d = d − (cid:8) X kd − E ( X kd ) (cid:9) T (cid:8) X kd − E ( X kd ) (cid:9) . Assumption . There are five components:(i) Each column of X kd has mean E ( X kd ) and the covariance matrix Σ kd of its distributionis positive definite.(ii) The entries of Z kd ≡ (cid:0) Σ kd (cid:1) − (cid:8) X kd − E ( X kd ) (cid:9) = (cid:0) Λ kd (cid:1) − (cid:0) V kd (cid:1) T (cid:8) X kd − E ( X kd ) (cid:9) areindependent.(iii) The fourth moment of each entry of each column is uniformly bounded by M > S kD,d associated with X kd , that is, d S kD,d = (cid:110)(cid:0) Z kd (cid:1) T (cid:0) Λ kd (cid:1) / (cid:0) V kd (cid:1) T (cid:111) (cid:110) V kd (cid:0) Λ kd (cid:1) / Z kd (cid:111) = d (cid:88) i =1 λ ki,d W ki,d , where W ki,d ≡ (cid:0) Z ki,d (cid:1) T Z ki,d and Z i,d is the i th row of Z kd defined above. It is calledWishart representation because if X kd is Gaussian, then each W ki,d follows the Wishartdistribution W n k (1 , I n k ) independently.(iv) The eigenvalues of Σ kd are sufficiently diffused, in the sense that (cid:15) kd = (cid:80) di =1 ( λ ki,d ) ( (cid:80) di =1 λ ki,d ) → d → ∞ . (4)(v) The sum of the eigenvalues of Σ kd is the same order as d , in the sense that (cid:0) σ kd (cid:1) = O (1)and 1 / (cid:0) σ kd (cid:1) = O (1). Assumption . The distance between the two population expectations satisfies, d − (cid:13)(cid:13) E ( X (+1) d ) − E ( X ( − d ) (cid:13)(cid:13) → µ , as d → ∞ . Moreover, there exist constants σ and τ , such that (cid:16) σ (+1) d (cid:17) → σ , and (cid:16) σ ( − d (cid:17) → τ . Let ν ≡ µ + σ /n + + τ /n − . The following theorem gives the sure classification conditionfor FLAME, which includes SVM and DWD as special cases.19 heorem . Without loss of generality, assume that n + ≤ n − . The situation of n + > n − is similar and omitted. • If either one of the following three conditions is satisfied,1. for θ ∈ (cid:104) , (1 + √ m − ) / ( ν √ dC ) (cid:17) , µ > ( n − /n + ) σ /n + − τ /n − > ;2. for θ ∈ (cid:104) (1 + √ m − ) / ( ν √ dC ) , / ( ν √ dC ) (cid:17) , µ > T − τ /n − > where T := (cid:16) / (2 θ √ dC ) + (cid:112) / (4 θ dC ) + σ /n + (cid:17) − σ /n + ;3. for θ ∈ (cid:104) / ( ν √ dC ) , (cid:105) , µ > σ /n + − τ /n − > ,then for a new data point x +0 from the positive class ( +1 ), P( x +0 is correctly classified by FLAME ) → , as d → ∞ . Otherwise, the probability above → . • If either one of the following three conditions is satisfied,1. for θ ∈ (cid:104) , (1 + √ m − ) / ( ν √ dC ) (cid:17) , ( n − /n + ) σ /n + − τ /n − > ;2. for θ ∈ (cid:104) (1 + √ m − ) / ( ν √ dC ) , / ( ν √ dC ) (cid:17) , T − τ /n − > ;3. for θ ∈ (cid:104) / ( ν √ dC ) , (cid:105) , σ /n + − τ /n − > ,then for any µ > , for a new data point x − from the negative class ( − ), P( x − is correctly classified by FLAME ) → , as d → ∞ . Theorem 6 has two parts. The first part gives the conditions under which FLAMEcorrectly classifies a new data point from the positive class, and the second part is for thenegative class. Each part lists three conditions based on three disjoint intervals of parameter θ . Note the first and third intervals of each part generalize results which were shown tohold only for DWD and SVM before ( c.f. Theorem 1 and Theorem 2 in Hall et al., 2005).In particular, it shows that all the FLAME classifiers with θ falling into the first intervalbehave like DWD asymptotically. Similarly, all the FLAME classifiers with θ falling intothe third interval behave like SVM asymptotically. This partially explains the shape of thewithin-group error curve that we will show in Figures 6, S.2, and S.3, which we will discussin the next section.In the first part, the condition for other FLAMEs (with θ in the second interval) is weaker20han the DWD-like FLAMEs (in the first interval), but stronger than the SVM-like FLAMEs(in the third interval). This means that it is easier to classify a new data point from thepositive/minority class by SVM, than by an intermediate FLAME, which is easier than byDWD. Note that when n + ≤ n − , the hyperplane for FLAME is in general closer to thepositive class.In terms of classifying data points from the negative class, the order of the difficultiesamong DWD, FLAME and SVM reverses. FLAME is not only a unified representation of DWD and SVM, but also introduces a newfamily of classifiers which are capable of avoiding the overfitting HDLSS data issue andthe sensitivity to imbalanced data issue. In this section, we use simulations to show theperformance of FLAME at various parameter levels. We will show that with a range ofcarefully chosen parameters, FLAME can outperform both the DWD and the SVM methodsin various simulation settings.
Before we introduce our simulation examples, we first introduce the performance measures inthis paper. Note that the Bayes rule classifier can be viewed as the “gold standard” classifier.In our simulation settings, we assume that data are generated from two Gaussian populations
M V N ( µ ± , Σ ) with different mean vectors µ + and µ − and same covariance matrices Σ . Thissetting leads to the following Bayes rule.sign( x T ω B + β B ) where ω B = Σ − ( µ + − µ − ) and β B = −
12 ( µ + + µ − ) (cid:48) ω B . (5)Five performance measures are evaluated in this paper:21. The mean within-class error (MWE) for out-of-sample test set, which is defined as M W E = 12 n + n + (cid:88) i =1 ( (cid:98) Y + i (cid:54) = Y + i ) + 12 n − n − (cid:88) j =1 ( (cid:98) Y − j (cid:54) = Y − j )2. The deviation of the estimated intercept β from the Bayes rule intercept β B : | β − β B | .3. Dispersion: a measure of the stochastic variability of the estimated discriminationdirection vector ω . The dispersion measure was introduced in Section 1, as the traceof the sample covariance of the resulting discriminant direction vectors: disperson =Var([ ω r ] r =1: R ) where R is the number of repeated runs.4. Angle between the estimated discrimination direction ω and the Bayes rule direction ω B : ∠ ( ω , ω B ).5. RankComp( ω , ω B ): In general, for two direction vectors ω and ω ∗ , RankComp isdefined as the proportion of the pairs of variables, among all d ( d − / i.e. ,RankComp( ω , ω ∗ ) ≡ d ( d − / (cid:88) ≤ i 5) has been compared with SVM ( θ = 1) and DWD( θ = 0) in Figure 1, and on average, its discriminant directions are closer to the Bayes ruledirection ω B compared to the SVM directions, but is less close than the DWD directions. Inthis subsection, we will further investigate the performance of FLAME with several differentvalues of θ , and compare them with DWD and SVM under various simulation settings.22 00 400 700 10000.10.20.30.40.50.60.70.8 d d i s pe r s i on m = 1 100 400 700 1000051015 d ang l e f r o m t r ue d i r e c t i on m = 1 100 400 700 10000.10.20.30.40.50.60.70.8 d d i s pe r s i on m = 4100 400 700 1000051015 d ang l e f r o m t r ue d i r e c t i on m = 4 100 400 700 10000.10.20.30.40.50.60.70.8 d d i s pe r s i on m = 9100 400 700 1000051015 d ang l e f r o m t r ue d i r e c t i on m = 9 θ =0 θ =0.25 θ =0.5 θ =0.75 θ =1 Figure 5: The dispersions (top row) and the angles between the FLAME direction and theBayes direction (bottom row) for 50 runs of simulations, where the imbalance factors m are 1, 4 and 9 (the left, center and right panels), in the increasing dimension setting ( d =100 , , , x -axes). The FLAME machines have θ = 0 , . , . , . , θ and the dimension d increase, both the dispersionand the deviation from the Bayes direction increase. The emergence of the imbalanced data(the increase of m ) does not much deteriorate the FLAME directions except for large d .Figure 5 shows the comparison results under the same simulation setting with variouscombinations of ( d, m )’s. In this simulation setting, data are from multivariate normal distri-butions with identity covariance matrices M V N d ( µ ± , I d ), where d = 100 , , 700 and 1000.We let µ = c ( d, d − , d − , · · · , T where c > µ to have norm2.7. Then we let µ + = µ and µ − = − µ . The imbalance factor varies among 1, 4 and 9while the total sample size is 240. For each experiment, we repeat the simulation 50 times,23nd plot the average performance measure in Figure 5. The Bayes rule is calculated accord-ing to (5). It is obvious that when the dimension increases, both the dispersion and theangle increase. They are indicators of overfitting HDLSS data. When the imbalance factor m increases, the two measures increases as well, although not as much as when the dimensionincreases. More importantly, it shows that when θ decreases (from 1 to 0, or equivalentlyFLAME changes from SVM to DWD), the dispersion and the angle both decrease, which ispromising because it shows that FLAME improves SVM in terms of the overfitting issue. We also investigate the effect of different covariance structures, since independence structureamong variables as in the last subsection is not realistic in real applications. We investigatethree covariance structures: independent, interchangeable and block-interchangeable covari-ance. Data are generated from two multivariate normal distributions M V N ( µ ± , Σ ) with d = 300. We fist let µ = (75 , , , · · · , , , , · · · , (cid:48) , then scale it by multiply a con-stant c such that the Mahalanobis distance between µ + = c µ and µ − = − c µ equals 5.4, i.e. , ( µ + − µ − ) (cid:48) Σ − ( µ + − µ − ) = 5 . 4. Note that this represents a reasonable signal-to-noiseratio.We consider the FLAME machines with different parameter θ from a grid of 11 values(0 , . , . , · · · , m = 2 , , × three covariance structures). For the independent structure example, Σ = I ; For the interchangeable structure example, Σ ii = 1 and Σ ij = 0 . i (cid:54) = j ; For the block-interchangeable structure example, we let Σ be a block diagonal matrixwith five diagonal blocks, the sizes of which are 150, 100, 25, 15, 10, and each block is aninterchangeable covariance matrix with diagonal entries 1 and off-diagonal entries 0.8.Figure 6 provides the summary results of the interchangeable structure example. Sincethe results are similar under different covariance structures, results from the other two co-variance structures are included in the online supplementary materials to save space (Figure24.2 for the independent structure, and Figure S.3 for the block-interchangeable covariance). θ Within−Group−Error 0 0.2 0.4 0.6 0.8 10.511.522.53 θ | β −true β | Remarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directionsRemarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directions0 0.2 0.4 0.6 0.8 1242628303234 θ Angle from theoretical Bayes direction 0 0.2 0.4 0.6 0.8 10.3450.350.3550.360.3650.370.375 θ RankCmp 0 0.2 0.4 0.6 0.8 10.180.20.220.240.260.280.30.32 θ Dispersionm=2m=3m=4 Figure 6: Interchangeable example. It can be seen that with FLAME turns from DWD toSVM ( θ from 0 to 1), the within-class error decreases (top-left), thanks to the more accurateestimate of the intercept term (top-middle). On the other hand, this comes at the cost oflarger deviation from the Bayes direction (bottom-left), incorrect rank of the importance ofthe variables (bottom-middle) and larger stochastic variability of the estimation directions(bottom-right).In each plot, we include the within-group error (top-left), the absolute value of thedifference between the estimated intercept and the Bayes intercept | β − β B | (top-middle),the angle between the estimated direction and the Bayes direction ∠ ( ω , ω B ) (bottom-left),the RankComp between the estimated direction and the Bayes direction (bottom-middle)and the dispersion of the estimated directions (bottom-right).We can see that in Figure 6 (and Figures S.2 and S.3 in the online supplementary mate-rials), when we increase θ from 0 to 1, i.e. , when the FLAME moves from the DWD end to25he SVM end, the within-group error decreases. This is mostly due to the fact that the inter-cept term β comes closer to the Bayes rule intercept β B . On the other hand, the estimateddirection is deviating from the true direction (larger angle), is giving the wrong rank of thevariables (larger RankComp) and is more unstable (larger dispersion). Similar observationshold for the other two covariance structures, with one exception in the block interchangeablesetting (Figure S.3) where the RankComp first decreases then increases.In the entire FLAME family, DWD represents one extreme which provides better estima-tion of the direction, is closer to the Bayes direction, provides the right order for all variables,and is more stable. But it suffers from the inaccurate estimation of β in the presence of im-balanced data; SVM represents the other extreme, which is not sensible to imbalanced dataand usually provides a good estimation of β , but is in general outperformed by DWD interms of closeness to the Bayes optimal direction. In most situations, within the FLAMEfamily, there is no single machine that is better than the both ends from the two aspects atthe same time.The observations above motivate the use of the equal-trade-off parameter introducedin Section 4.1. In the next subsection, we will compare this parameter choice with otheroptions. We have suggested an equal-trade-off parameter based on the plots of within-group errorand the RankComp (see Section 4.1) and justified the use of an adaptive θ based on aniterative procedure (see details in Section 4.2). Figure 7 compares FLAME with these twochoices, and with θ = 0 (DWD) and 1 (SVM). Various covariance structures (independent,interchangeable and block-interchangeable) are investigated. To save the space, we onlyshow the results for the block-interchangeable dependence structure as this is more realisticin many real applications in genomic science and other applications. Here the full dimensions( d = 80 , , , , 600 or 900) are divided to three blocks (50%, 25% and 25% of d ). The26 dWithin−Group−Error 2 2.25 2.5 2.75 30123456 log d| β −true β |2 2.25 2.5 2.75 315202530354045 log dAngle from theoretical Bayes direction 2 2.25 2.5 2.75 30.20.250.30.350.4 log dRankComp with theoretical Bayes direction θ =0 (DWD)adaptive θ suggested θθ =1 (SVM) Figure 7: Comparison of four FLAMEs with θ = 0 , 1, the suggested θ introduced in Section4.1 and the adaptive θ (after one step) introduced in Section 4.2 for a simulated example withblock-interchangeable dependence structure, in terms of the within-group error, deviationfrom the true intercept term β B , deviation from the true direction ω B , and the RankCompfrom ω B . Intermediate FLAMEs provide improvements over DWD for the first two measuresand over SVM for the last two measures.total sample size is 240 and the imbalance factor m is 3 (moderate imbalanced).In Figure 7, we compare the within-group error, | β − β B | , ∠ ( ω , ω B ), and RankComp( ω , ω B ).We see that these intermediate FLAMEs provide improvements over DWD for the first twomeasures and over SVM for the last two measures. For relatively small d , the equal-trade-off θ value is very similar to the adaptive θ . For large d , the adaptive θ is closer to DWD thanthe equal-trade-off θ . For very large d , all four machines encounter difficulty in classification. In this section we demonstrate the performance of FLAME on a real example: the HumanLung Carcinomas Microarray Dataset, which has been analyzed earlier in Bhattacharjee27t al. (2001).The Human Lung Carcinomas Dataset contains six classes: adenocarcinoma, squamous,pulmonary carcinoid, colon, normal and small cell carcinoma, with sample sizes of 128, 21,20, 13, 17 and 6 respectively. Liu et al. (2008) used this data as a test set to demonstrate theirproposed significance analysis of clustering approach. We combine the first two subclassesand the last four subclasses to form the positive and negative classes respectively. The samplesizes are 149 and 56 with imbalance factor m = 2 . 66. The original data contain 12,625 genes.We first filter genes using the ratio of the sample standard deviation and sample mean ofeach gene and keep 2,530 of them with large ratios (Dudoit et al., 2002, Liu et al., 2008).We conduct a five-fold cross-validation (CV) to evaluate the within-group error for thetwo classes over 100 random splits. The RankComp measure is calculated based on the fulldata set instead of on the samples in a single fold. For each replication, we find a suggestedvalue for θ that leads to the same normalized value for RankComp and within-group errorover a grid of θ values. The adaptive value for θ (after one step) is calculated based on theDWD direction using all the samples in the data set.The average within-group errors and the RankComp measures for DWD, SVM, and thetwo intermediate FLAMEs (using the adaptive θ and the suggested equal-trade-off θ ) areshown in Figure 8.Note that as we do not know the true rank of importance of genes for this real dataapplication (no Bayes rule direction or the rank of importance that it implies), we use theDWD rank as a surrogate to the truth, since simulation examples show that on average itsestimated direction is the closest to the Bayes rule direction. Therefore, the RankCompmeasure for DWD is 0.This experiment does show that FLAME opens a new dimension of improving both theclassification performance and the interpretative ability of the classifier. The compromise ofthe FLAME classifier, with the suggested equal-trade-off θ , in terms of the within-group erroris very small compared to the improvement obtained in terms of the direction. We would28 V Within−Group Error RankComp00.050.10.150.20.250.30.35 Human Lung Carcinomas Data θ =0 (DWD)adaptive θ suggested θθ =1 (SVM) Figure 8: The average cross-validation within-group errors and the RankComp measures forDWD, FLAME with the adaptive and equal-trade-off θ parameters and SVM for the HumanLung Carcinomas Dataset.not recommend using either DWD or SVM for conducting prediction and interpretationsimultaneously due to their bad performance for at least one criterion. A FLAME classifierwith an appropriate parameter may be more suitable for practical use. In this paper, we thoroughly investigate SVM and DWD on their performance when appliedto the HDLSS and imbalanced data. A novel family of binary classifiers called FLAME isproposed, where SVM and DWD are the two ends of the spectrum. On the DWD end, theestimation of the intercept term is deteriorated while it provides better estimation of thedirection vector, and thus better handle the HDLSS data. On the hand, SVM is good atestimating the intercept term but not the direction and is subject to overfitting, and thus ismore suitable for imbalanced data but not HDLSS data.29e conduct extensive study of the asymptotic properties of the FLAME family in threedifferent flavors, the ‘ d fixed, n → ∞ ’ asymptotics (Fisher consistency), the ‘ d and n + fixed, n − → ∞ ’ asymptotics (extremely imbalanced data), and the ‘ n fixed, d → ∞ ’ asymptotics(the HDLSS asymptotics). These results explain the performance we have seen in the sim-ulations and suggest that with a smart choice of θ , FLAME can properly handle both theHDLSS data and the imbalanced data, by improving the estimations of the direction andthe intercept term.The FLAME family can be immediately extended to multi-class classification, as was donefor SVM and DWD such as in Weston and Watkins (1999), Crammer and Singer (2000),Lee et al. (2004) or Huang et al. (2012). Another natural extension is variable selection forFLAME.The FLAME machines generalize the concepts of support vectors. In SVM, supportvectors are referred to vectors that sit on or fall into the two hyperplanes correspondingto u ≤ u ≤ / √ C for the modified version of Hinge loss (2)). In SVM, only supportvectors have impacts on the final solution. DWD is the other extreme case where all the datavectors have some impacts. In the presence of imbalanced sample size, the fact that all thedata vectors influence the solution cause the optimization to ignore the minority class. TheFLAME with 0 < θ < θ means that one needs to include as many vectors, and as balanced influential samples,as possible. More vectors usually lead to mitigated overfitting, and balanced sample size ofthe influential vectors from two classes means that the sensitivity issue of the intercept termcan be alleviated.The authors are aware that it is possible to implement a two step procedure to conductbinary linear classification. In the first step, a good direction is found, probably in thefashion of DWD; in the second step, a fine intercept is chosen by borrowing idea of SVM.However, the theoretical properties of such a procedure are known and will be left as a future30esearch direction.The choice of θ usually depends on the nature of the data and the scientific context. Ifthe users prefer better classification performance over reasonable discrimination direction forinterpretation of the data, θ may be chosen to be closer to 1. If the right direction is the firstpriority, then θ should be chosen to be closer to 0. Note that, under some circumstances,the primary goal is to obtain a direction vector which can provide a score x T ω for eachobservation for further use, and the intercept parameter β is of no use at all. For example,some users may use a receiver operating characteristic (ROC) curve as a graphical tool toevaluate classification performance over different β value instead of using a single β valuegiven by the classifier. In this case, a FLAME machine close to the DWD method may beideal. Appendix Derivation of the modified Hinge loss Note that the original SVM formulation is argmin ˜ ω , ˜ β (cid:88) (cid:16) − y i ˜ f ( x i ) (cid:17) + , s.t. (cid:107) ˜ ω (cid:107) ≤ C, where˜ f ( x ) = x T ˜ ω + ˜ β . Here the coefficient vector ˜ ω does not have unit norm. We let ω = ˜ ω / √ C , β = ˜ β/ √ C and f = ˜ f / √ C . Thus SVM solution is given by argmin ω ,β (cid:88) (cid:16) − √ Cy i f ( x i ) (cid:17) + , s.t. (cid:107) ω (cid:107) ≤ , or equivalently, argmin ω ,β (cid:88) (cid:16) √ C − Cy i f ( x i ) (cid:17) + , s.t. (cid:107) ω (cid:107) ≤ . Acknowledgements The first author’s work was partially supported by Binghamton University Harpur CollegeDean’s New Faculty Start-up Funds and a collaboration grant from the Simons Foundation( eferences Ahn, J. and Marron, J. (2010), “The maximal data piling direction for discrimination,” Biometrika , 97, 254–259.Ahn, J., Marron, J., Muller, K., and Chi, Y. (2007), “The high-dimension, low-sample-sizegeometric representation holds under mild conditions,” Biometrika , 94, 760–766.Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Be-heshti, J., Bueno, R., Gillette, M., et al. (2001), “Classification of human lung carcinomasby mRNA expression profiling reveals distinct adenocarcinoma subclasses,” Proceedings ofthe National Academy of Sciences , 98, 13790–13795.Chawla, N., Japkowicz, N., and Kotcz, A. (2004), “Editorial: special issue on learning fromimbalanced data sets,” ACM SIGKDD Explorations Newsletter , 6, 1–6.Cortes, C. and Vapnik, V. (1995), “Support-vector networks,” Machine learning , 20, 273–297.Crammer, K. and Singer, Y. (2000), “On the Learnability and Design of Output Codes forMulticlass Problems,” in In Proceedings of the Thirteenth Annual Conference on Compu-tational Learning Theory .Cristianini, N. and Shawe-Taylor, J. (2000), An introduction to Support Vector Machines:and other kernel-based learning methods , Cambridge University Press.Duda, R., Hart, P., and Stork, D. (2001), Pattern classification , Wiley.Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparison of discrimination methods forthe classification of tumors using gene expression data,” Journal of the American statisticalassociation , 97, 77–87. 32all, P., Marron, J. S., and Neeman, A. (2005), “Geometric representation of high dimen-sion, low sample size data,” Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 67, 427–444.Hastie, T., Tibshirani, R., and Friedman, J. (2009), The elements of statistical learning:Data mining, inference, and prediction (second edition) , Springer.Huang, H., Liu, Y., Du, Y., Perou, C. M., Hayes, D. N., Todd, M. J., and Marron, J. (2012),“Multiclass Distance Weighted Discrimination,” Journal of Computational and GraphicalStatistics .Lee, Y., Lin, Y., and Wahba, G. (2004), “Multicategory support vector machines,” Journalof the American Statistical Association , 99, 67–81.Lin, Y. (2004), “A note on margin-based loss functions in classification,” Statistics & prob-ability letters , 68, 73–82.Liu, Y., Hayes, D., Nobel, A., and Marron, J. (2008), “Statistical significance of clustering forhigh-dimension, low-sample size data,” Journal of the American Statistical Association ,103, 1281–1293.Marron, J., Todd, M., and Ahn, J. (2007), “Distance-weighted discrimination,” Journal ofthe American Statistical Association , 102, 1267–1271.Owen, A. (2007), “Infinitely imbalanced logistic regression,” The Journal of Machine Learn-ing Research , 8, 761–773.Qiao, X. and Liu, Y. (2009), “Adaptive weighted learning for unbalanced multicategoryclassification,” Biometrics , 65, 159–168.Qiao, X., Zhang, H., Liu, Y., Todd, M., and Marron, J. (2010), “Weighted distance weighteddiscrimination and its asymptotic properties,” Journal of the American Statistical Asso-ciation , 105, 401–414. 33mola, A. J., Bartlett, P. L., Sch¨olkopf, B., and Schuurmans, D. (2000), Advances in largemargin classifiers , vol. 1, MIT press Cambridge, MA.Toh, K., Todd, M., and T¨ut¨unc¨u, R. (1999), “SDPT 3- a MATLAB software package forsemidefinite programming, version 1.3,” Optimization Methods and Software , 11, 545–581.T¨ut¨unc¨u, R., Toh, K., and Todd, M. (2003), “Solving semidefinite-quadratic-linear programsusing SDPT3,” Mathematical programming , 95, 189–217.Vapnik, V. (1998), Statistical learning theory , Wiley.Weston, J. and Watkins, C. (1999), “Support vector machines for multi-class pattern recog-nition,” in European Symposium on Artificial Neural Networks , pp. 219–224.Zhang, L. and Lin, X. (2011), “Some considerations of classification for high dimensionlow-sample size data,” Statistical Methods in Medical Research .34 nline supplementary materials This document provides some additional details for the main paper, Flexible High-dimensionalClassification Machines and Their Asymptotic Properties . This document includes how weimplement the FLAME machine with pre-defined θ , detailed proofs for several theorems andpropositions, and figures for additional simulations. Implementation In order to implement the FLAME algorithm, we introduce several new notations. Let S d +1 be a second order cone in the d +1 dimensional space, S d +1 = (cid:26) ( t , t , · · · , t d ) (cid:48) : t ≥ (cid:113)(cid:80) di =1 t i (cid:27) .Note that r i and 1 /r i can be substituted by three axillary variables ρ i , σ i and τ i which satisfy ρ i + σ i = r i , ρ i − σ i = 1 /r i , and τ i = 1. Then ρ i = σ i + τ i , and thus ( ρ i , σ i , τ i ) ∈ S . Let w = 1, then ( w ; ω ) ∈ S d +1 since (cid:107) ω (cid:107) ≤ 1. Let η i ≥ ϕ i ≥ 0, where ϕ i and η i can be viewedas the positive and negative parts of (cid:16) r i + Cξ i − θ √ C (cid:17) , i.e. , ϕ i − η i = (cid:16) r i + Cξ i − θ √ C (cid:17) .With the reparameterization above, FLAME can be viewed as the following optimizationproblem: min β,w, ω ,ρ i ,σ i ,τ i ,ξ i ,η i ,ϕ i n (cid:88) i =1 ϕ i s.t. y i ( x Ti ω + β ) + ξ i − ρ i − σ i = 0 ρ i − σ i + Cξ i − θ √ C + η i − ϕ i = 0 w = 1 τ i = 1and ( w ; ω ) ∈ S d +1 , ( ρ i , σ i , τ i ) (cid:48) ∈ S , ξ i ≥ , η i ≥ , ϕ i ≥ . Therefore, all the constraints can be converted to linear forms, all the variables are eithernonnegative, free, or in second order cones, and the objective function is linear. Such problemis called Second Order Cone Programming (SOCP), and can be efficiently solved by softwaressuch as SDPT3 (Toh et al., 1999, T¨ut¨unc¨u et al., 2003).i roof to Theorem 1 It suffices to show that s ( ω k , β k , θ k ) ≥ s ( ω k +1 , β k +1 , θ k +1 ). First, s ( ω k , β k , θ k ) ≥ s ( ω k , β k , θ k +1 )due to the definition of θ k +1 and that θ k ≤ θ k +1 . Then s ( ω k , β k , θ k +1 ) ≥ s ( ω k +1 , β k +1 , θ k +1 )since ω k +1 and β k +1 minimize s ( ω , β, θ k +1 ). Proof to Proposition 2 For any x , denote p ( x ) = P( Y = +1 | X = x ). The conditional risk is R ( f ) ≡ E [ L ( Y f ( X ) , θ ) | X = x ] = L ( f ( x ) , θ ) p ( x ) + L ( − f ( x ) , θ )(1 − p ( x )). For simplicity, we write L ( f ( x ) , θ ) as L ( f ). Thereby, R ( f ) = L ( f ) p ( x ) + L ( − f )(1 − p ( x )).We can see that for fixed p ( x ) ∈ (0 , R ( f ) is continuous and differentiable everywhereand convex. Thus we find f ∗ ( x ) by solving R (cid:48) ( f ) = 0, where R (cid:48) ( f ) = L (cid:48) ( f ) p ( x )+[ L ( − f )] (cid:48) (1 − p ( x )). The FLAME loss is L ( f ) = (2 − θ ) √ C − Cf if f ≤ √ C f − θ √ C if √ C < f ≤ θ √ C . So direct calculation gives us that L (cid:48) ( f ) = − C if f ≤ √ C − f if √ C < f ≤ θ √ C , and [ L ( − f )] (cid:48) = C if f ≥ − √ C f if − θ √ C < f ≤ − √ C . Finally, if − p ( x ) p ( x ) (cid:54) = 1, the solution to R (cid:48) ( f ) = 0 has to be either − θ √ C < f ≤ − √ C or √ C < f ≤ θ √ C . In the former case, f ∗ = − √ C (cid:113) − p ( x ) p ( x ) if − p ( x ) p ( x ) > 1. In the latter case, f ∗ = + √ C (cid:113) − p ( x ) p ( x ) if − p ( x ) p ( x ) < 1. If − p ( x ) p ( x ) = 1, R (cid:48) ( f ∗ ) = 0 for any f ∗ ∈ [ − √ C , √ C ].Therefore, f ∗ satisfies sign( f ∗ ) = sign( p ( x ) − . 5) = sign(2 p ( x ) − 1) = sign( p ( x ) − (1 − p ( x ))) = sign( p ( x )1 − p ( x ) − . ii roof to Theorem 3 Since there are infinitely many negative class samples, it is reasonable to assume that theclassification boundary is pushed closer to the minority positive class, and therefore, the func-tional margin u i = y i f ( x i ) = f ( x i ) for the i th vector from the minority positive class is smalland its DWD loss is 2 √ C − Cu i = 2 √ C − Cf ( x i ). Similarly, the DWD loss for the j th vectorfrom the majority negative class is 1 / [ y j f ( x j )] = − /f ( x j ). The objective function for DWDis therefore equivalent to 1 n + + n − (cid:40) n + (cid:88) i =1 [2 √ C − C ( x Ti ω + β )] − n − (cid:88) j =1 x Tj ω + β (cid:41) + λ (cid:107) ω (cid:107) . The second term inside the curly bracket above can be approximated by n − (cid:90) x T ω + β dF − ( x ) where F − ( · ) is the conditional cumulative distribution function forthe negative class. The objective function is therefore l D = 1 n + + n − (cid:40) n + (cid:88) i =1 [2 √ C − C ( x Ti ω + β )] − n − (cid:90) x T ω + β dF − ( x ) (cid:41) + λ (cid:107) ω (cid:107) Before we continue, we need the definition that a distribution has a point surrounded(Owen, 2007). Definition . The distribution F on R d has the point x ∗ surrounded if (cid:90) ( x − x ∗ ) (cid:48) ω >(cid:15) dF ( x ) > δ, for some δ > 0, some (cid:15) > ω ∈ R d with (cid:107) ω (cid:107) = 1. Consequentially, if F has x ∗ surrounded, then there exist γ satisfyinginf (cid:107) ω (cid:107) =1 (cid:90) ( x − x ∗ ) (cid:48) ω > dF ( x ) > γ ≥ . (S.1)We observe that ∂l D ∂β = 1 n + + n − [ − n + C + n − (cid:90) ( x T ω + β ) − dF − ( x )] ≥ n + + n − [ − n + C + n − (cid:90) ( x − x + ) (cid:48) ω ≥ (( x − x + ) T ω + x T + ω + β ) − dF − ( x )] ≥ n + + n − [ − n + C + n − (cid:90) x T ω ≥ ( x T + ω + β ) − dF − ( x )] ≥ n + + n − [ − n + C + n − γ ( x T + ω + β ) ]Now suppose that − (cid:113) n − γn + C < x T + ω + β < 0, then n − γ ( x T + ω + β ) > n + C and ∂l D /∂β > 0. Giveniiihe fact that l D is a strictly convex function, the minimizer (cid:98) β < − (cid:113) n − γn + C − x T + ω . Proof to Theorem 4 Again, with the imbalance assumption we assume that the functional margins for the mi-nority positive class are always greater than 0. Note that the penalized empirical loss forthe FLAME machine is approximated by l F = 1 n + + n − (cid:40) n + (cid:88) i (cid:104) √ C − C ( x Ti ω + β ) − θ √ C (cid:105) + n − (cid:90) (cid:18) − x T ω + β − θ √ C (cid:19) + dF − ( x ) (cid:27) + λ (cid:107) ω (cid:107) Let g ∗ j = − ( x Tj ω ∗ + β ∗ ), j = 1 , , · · · n − be the functional margins for the negative class.Because g ∗ ( n +) √ C = θ ∗ , that is, the reduced loss for the j th sample is greater than or equalto 0, g ∗ ( n +) − θ ∗ √ C = 0, we observe that 1 /g ∗ ( n + ) is the n + -th greatest among all the functionmargins of the negative class 1 /g ∗ j = − / ( x Tj ω ∗ + β ∗ ). Thus there are at most n + sampleswhose reduced losses that are ≥ 0. Assume that there are n o ≤ n + such samples.For a random sample ( X , Y ) from the negative class, let E be the event that ( Y ( X T ω ∗ + β ∗ )) − ≥ θ ∗ √ C . From the argument above, P( E ) is approximately n o /n − .Then the integration (cid:90) (cid:18) − x T ω + β − θ √ C (cid:19) + dF − ( x ) equals E (cid:20) ( − x T ω + β − θ √ C ) + | E (cid:21) P( E ) + E (cid:20) ( − x T ω + β − θ √ C ) + | E (cid:21) P( E ) ≈ E (cid:2) | E (cid:3) (1 − n o n − ) + E (cid:20) ( − x T ω + β − θ √ C ) | E (cid:21) n o n − = E (cid:20) ( − x T ω + β − θ √ C ) | E (cid:21) n o n − We then have l F = 1 n + + n − (cid:40) n + (cid:88) i (cid:104) √ C − C ( x Ti ω + β ) − θ √ C (cid:105) + n o (cid:90) (cid:18) − x T ω + β − θ √ C (cid:19) dF − ( x | E ) (cid:27) + λ (cid:107) ω (cid:107) Here, dF − ( x | E ) is the conditional distribution function of X for the negative class givenevent E . ivetting ∂l F /∂β = 0 = 1 n + + n − (cid:26) − Cn + + n o (cid:90) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) (cid:27) , we have (cid:90) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) = C n + n o .Setting ∂l F /∂ ω = = 1 n + + n − (cid:26) − Cn + x + + n o (cid:90) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) (cid:27) + λ ω ∗ , we have (cid:90) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) = − n + + n − n o λ ω ∗ + C n + n o x + . And furthermore, (cid:82) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) (cid:82) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) = − n + + n − n + λC ω ∗ + x + = − (1 + m ) λC ω ∗ + x + . That is, ω ∗ = C (1 + m ) λ (cid:20) x + − (cid:82) ( x T ω ∗ + β ∗ ) − x dF − ( x | E ) (cid:82) ( x T ω ∗ + β ∗ ) − dF − ( x | E ) (cid:21) Proof to Theorem 5 For simplicity we use the original SVM formulation with the Hinge loss function instead ofthe FLAME formulation. The objective function for SVM is equivalent to l S = 1 n + + n − (cid:40) n + (cid:88) i =1 [1 − ( x Ti ω + β )] + n − (cid:90) (cid:2) x T ω + β ) (cid:3) + dF − ( x ) (cid:41) + λ (cid:107) ω (cid:107) = 1 n + + n − (cid:40) n + (cid:88) i =1 [1 − ( x Ti ω + β )]+ n − (cid:90) (cid:2) x T ω + β ) (cid:3) { x T ω + β> } dF − ( x ) (cid:27) + λ (cid:107) ω (cid:107) Setting ∂l S /∂β = 0, we have ∂l S ∂β = 1 n + + n − (cid:26) − n + + n − P( G ; ω , β ) + n − (cid:90) (1 + x T ω + β ) δ (cid:0) x T ω + β (cid:1) dF − ( x ) (cid:27) = 1 n + + n − {− n + + n − P( G ; ω , β ) } = 0 , where δ ( · ) is the Dirac delta function. ⇒ P( G ; (cid:98) ω , (cid:98) β ) = n + n − = 1 m . voreover, ∂l S ∂ ω = 1 n + + n − (cid:40) − n + (cid:88) i =1 x i + n − (cid:90) x { x T ω + β> } dF − ( x )+ n − (cid:90) (cid:2) x T ω + β ) (cid:3) δ (cid:0) x T ω + β (cid:1) x dF − ( x ) (cid:27) + λ ω = 1 n + + n − (cid:26) − n + x + + n − (cid:90) x { x T ω + β> } dF − ( x ) (cid:27) + λ ω ⇒ (cid:98) ω = 1( n + + n − ) λ (cid:26) n + x + − n − (cid:90) x { x T (cid:99) ω + (cid:98) β> } dF − ( x ) (cid:27) = 1( n + + n − ) λ (cid:26) n + x + − n − (cid:90) x dF − ( x | G )P( G ; (cid:98) ω , (cid:98) β ) (cid:27) ≈ n + ( n + + n − ) λ (cid:26) x + − (cid:90) x dF − ( x | G ) (cid:27) = 1(1 + m ) λ (cid:26) x + − (cid:90) x dF − ( x | G ) (cid:27) . Classification boundaries for HDLSS data The geometric representation in Hall et al. (2005) leads to some theoretical properties ofseveral binary classifiers. In particular, as d → ∞ , the positive class and negative classconverge to two ( n + − 1) and ( n − − 1) simplices with random rotation. Note that the(normalized) pairwise distances between observations within each class are the same, andthe (normalized) distances between any two observations from two different classes are thesame as well. The geometric representation for SVM and DWD in Hall et al. (2005) issummarized as the follows.1. SVM : It was shown that the linear SVM hyperplane projected to the ( N − N data vectors is given asymptoticallyby the unique ( N − l in the N -polyhedron formed by the N data vectors. There are n + × n − such edges.Let O + be the centroid of the ( n + − X + ( d ) and O − the centroid of the( n − − X − ( d ). It can be further shown that the SVM hyperplane bisects theviine segment between O + and O − .2. DWD : The case of DWD is a little different, especially in the case where n + (cid:28) n − (or m (cid:29) O + O − at point P . It can be shown that the two simplices, the DWD hyperplane, andthe SVM hyperplane, are all orthogonal to O + O − . Thus all the vertices in the simplex X + are equally distanced from the DWD hyperplane. Such distance is denoted by a . Similarly, all the vertices in the simplex X − are equally distanced from the DWDhyperplane by β . The general version DWD hyperplane minimizes the sum of thereciprocals of the distances of data vectors to the hyperplane, ( n + /a + n − /b ), with theconstraint that a + b equals to a constant (determined by µ, σ, τ, n + , n − , and d ). Asimple calculus practice reveals that a/b = ( n + /n − ) / .For the General FLAME case, we need to learn how the hyperplane moves from thepoint determined by a/b = ( n + /n − ) / on O + O − to the midpoint of O + O − as θ growsfrom 0 (DWD) to 1 (SVM). First, we consider the general version of FLAME which seeksto minimize the sum of losses for all data points, (cid:80) (cid:16) /u − θ √ C (cid:17) + , where the functionalmargin u is either a or β for samples from the positive or the negative classes respectively.When θ = 0, the FLAME hyperplane is determined by a/b = ( n + /n − ) / = m − / < b > a , that is, the hyperplane is closer to the minority class. We renamed them as a and b where the superscript “0” represents the value of θ .When θ > / ( b √ C ), then the hyperplane does not move. This isbecause that the new loss for each data vector becomes 1 /a − θ √ C or 1 /b − θ √ C sinceboth are greater than 0. The additional term “ − θ √ C ” does not change the minimizer andthus a θ /b θ = ( n + /n − ) / does remain unchanged.If we keep increasing θ so that it becomes greater than 1 / ( b √ C ), then if the hyperplanedoes not move, then the loss for the majority class becomes 0. In this case, there is spacefor improvement: the hyperplane would move gradually towards the majority class, becausethis can make the loss on the minority class smaller while keeping the loss on the majorityviiero. The FLAME hyperplane is determined by b = 1 / ( θ √ C ).Finally, as θ increases, the distance a increases and the distance β decreases, until ata point where a = b , and both 1 /a − θ √ C = 1 /b − θ √ C < 0. After this point, furtherincrease of θ will not change the position of the FLAME hyperplane which will remain atthe midpoint of O + O − . θ | O + P | FLAME Hyperplane Toy Example [1+ √ (1/m)]/ √ C 2/ √ C Figure S.1: A 1D toy example with n + < n − and m = 9 is used to mimic the d -asymptoticsituation. The length of the line segment O + O − equals 1. As θ increases, the FLAMEhyperplane stands still ( | O + P | unchanged); when θ > (cid:112) /m ) / √ C , | O + P | increases,which means the hyperplane moves towards the negative class, until θ = 2 / √ C , after whichthe hyperplane remains at the midpoint of O + O − .The derivation above assumes the distance between the two simplices are reasonablelarge, at least greater than 2 / √ C . This is not difficult to achieve because we choose C tobe a large number. viiin summary, the intersection P of the FLAME hyperplane and O + O − stays closer to theminority class, and remains still as θ is small. When θ increases, the boundary moves towardsthe majority class, until reaching the midpoint of O + O − . This explains the simulationperformance we observed in Figures S.2, 6 and S.3. We use a toy example and show theposition of the FLAME hyperplane moving as θ increases in Figure S.1 in the same fashionwe discussed above.It is worth noting that the value of DWD/FLAME in terms of reducing overfitting ismaximal when the dimension is greater than, but close to, the sample size. This is whendata-piling starts to appear in SVM but not yet in DWD. Marron et al. (2007) showedsome videos about such phenomenon. As a matter of fact, according to the geometricrepresentation above, in the d -asymptotics, the discriminant directions for most classifiersare the same. Moreover, the projections of data points in the same class to O + O − are thesame, which is the normal vector for the DWD, SVM and FLAME hyperplanes. Therefore,they all have data-piling in the d -asymptotics. Derivation of the FLAME hyperplane in d asymptotics The FLAME seeks to minimize n + (cid:16) /a − θ √ C (cid:17) + + n − (cid:16) /b − θ √ C (cid:17) + (S.2)s.t. a + b = (cid:112) d ( µ + σ /n + + τ /n − ) = ν √ d. (S.3)When θ ∈ (cid:104) , (1 + √ m − ) / ( ν √ dC ) (cid:17) , it is easy to verify that both (1 /a − θ √ C ) + and(1 /b − θ √ C ) + are positive and equal to 1 /a − θ √ C and 1 /b − θ √ C . In this case, theoptimal solutions to problem S.2, a and β satisfy a/b = ( n + /n − ) / = √ m − . In particular, a = √ m − / (1 + √ m − ) ν √ d and b = 1 / (1 + √ m − ) ν √ d .When θ ∈ (cid:104) (1 + √ m − ) / ( ν √ dC ) , / ( ν √ dC ) (cid:17) , 1 /b − θ √ C < b = 1 / ( θ √ C ) and a = √ dν − b . Note that a > b .When θ ∈ (cid:104) / ( ν √ dC ) , (cid:105) , a = b = 0 . √ dν ix roof to Theorem 6 We only need to prove the sure classification for the second interval, i.e. , θ ∈ (cid:104) (1 + √ m − ) / ( ν √ dC ) , / ( ν √ dC ) (cid:17) . The proofs for the other two intervals are similar to those in Qiao et al. (2010). It wasshown in Hall et al. (2005) and Qiao et al. (2010) that the length of the line segment O + O − is √ dν and that the distance between the projection (denoted as P (cid:48) ) of a new datapoint from the X + -population onto O + O − and the centroid of the positive class O + is( σ /n + ) / ( µ + τ /n − ) times of its distance to the centroid of the negative class O − , i.e. , | O + P (cid:48) || O − P (cid:48) | = ( σ /n + ) / ( µ + τ /n − ), where | AB | is the length of the line segment connectingpoints A and B . Denote | O + P (cid:48) | as a (cid:48) and | O − P (cid:48) | as b (cid:48) . Because a (cid:48) + b (cid:48) = √ dν , we musthave b (cid:48) = √ d ( µ + τ /n − ) /ν . In order for this new data point to be correctly classified tothe positive class, P (cid:48) has to be the same side as O + with respect to the intersection of theFLAME hyperplane and O + O − , that is, b (cid:48) > b ⇔ √ d (cid:18) µ + τ n − (cid:19) /ν > √ Cθ ⇔ µ + τ n − > √ dCθ ν ⇔ ν − σ n + > √ dCθ ν ⇔ ν − √ dCθ ν − σ n + > ⇔ ( ν − √ dCθ ) − dCθ − σ n + > ⇐ ν > (cid:115) dCθ + σ n + + 12 √ dCθ ⇔ µ > (cid:34)(cid:115) dCθ + σ n + + 12 √ dCθ (cid:35) − σ n + − τ n − ⇔ µ > T − τ n − . We now assume that P (cid:48) is the projection of a new data point from the X − -population. Inxhis situation, it can be shown that a (cid:48) /b (cid:48) = ( µ + σ /n + ) / ( τ /n − ) and thus b (cid:48) = √ d τ /n − ν .To correctly classify this new data point, we only need to have b (cid:48) < b . That is, √ d τ /n − ν < b = 1 θ √ C ⇔ τ n − < θ √ dC (cid:112) µ + τ /n − + σ /n + We only need to show that τ n − < θ √ dC (cid:112) τ /n − + σ /n + . Let q = τ /n − + σ /n + . Weneed to show that q − σ n + < θ √ dC q ⇔ (cid:18) q − θ √ dC (cid:19) − θ dC − σ n + < ⇐ q < (cid:115) θ dC + σ n + + 12 θ √ dC ⇔ τ n − < (cid:34)(cid:115) θ dC + σ n + + 12 θ √ dC (cid:35) − σ n + ⇔ τ n − < T. The last inequality is the condition stipulated in the theorem.xi dditional figures θ Within−Group−Error 0 0.2 0.4 0.6 0.8 100.511.522.53 θ | β −true β | Remarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directionsRemarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directions0 0.2 0.4 0.6 0.8 12426283032343638 θ Angle from theoretical Bayes direction 0 0.2 0.4 0.6 0.8 10.310.320.330.340.350.36 θ RankCmp 0 0.2 0.4 0.6 0.8 10.20.250.30.35 θ Dispersionm=2m=3m=4 Figure S.2: Independent example. It can be seen that with FLAME turns from DWD toSVM ( θ from 0 to 1), the within-class error decreases (top-left), thanks to the more accurateestimate of the intercept term (top-middle). On the other hand, this comes at the cost oflarger deviation from the Bayes direction (bottom-left), incorrect rank of the importance ofthe variables (bottom-middle) and larger stochastic variability of the estimation directions(bottom-right). xii θ Within−Group−Error m=2m=3m=4 0 0.2 0.4 0.6 0.8 10.511.522.533.5 θ | β −true β | Remarks: m = ratio of sample sizes True β = 0 Dispersion = trace of sample covariance of directions0 0.2 0.4 0.6 0.8 1242628303234 θ Angle from theoretical Bayes direction 0 0.2 0.4 0.6 0.8 10.250.260.270.280.290.3 θ RankCmp 0 0.2 0.4 0.6 0.8 10.180.20.220.240.260.280.30.32 θ Dispersion Figure S.3: Block interchangeable example. It can be seen that with FLAME turns fromDWD to SVM ( θθ