[PDF] Distance-weighted Support Vector Machine

Abstract

A novel linear classification method that possesses the merits of both the Support Vector Machine (SVM) and the Distance-weighted Discrimination (DWD) is proposed in this article. The proposed Distance-weighted Support Vector Machine method can be viewed as a hybrid of SVM and DWD that finds the classification direction by minimizing mainly the DWD loss, and determines the intercept term in the SVM manner. We show that our method inheres the merit of DWD, and hence, overcomes the data-piling and overfitting issue of SVM. On the other hand, the new method is not subject to imbalanced data issue which was a main advantage of SVM over DWD. It uses an unusual loss which combines the Hinge loss (of SVM) and the DWD loss through a trick of axillary hyperplane. Several theoretical properties, including Fisher consistency and asymptotic normality of the DWSVM solution are developed. We use some simulated examples to show that the new method can compete DWD and SVM on both classification performance and interpretability. A real data application further establishes the usefulness of our approach.

Full PDF

DDistance-weighted Support Vector Machine

Xingye Qiao † Department of Mathematical SciencesState University of New York, Binghamton, NY 13902-6000.E-mail: [email protected]

Lingsong ZhangDepartment of StatisticsPurdue University, West Lafayette, IN 47907.E-mail: [email protected]

October 9, 2015

Abstract

A novel linear classiﬁcation method that possesses the merits of both the Support Vector Ma-chine (SVM) and the Distance-weighted Discrimination (DWD) is proposed in this article. Theproposed Distance-weighted Support Vector Machine method can be viewed as a hybrid of SVM andDWD that ﬁnds the classiﬁcation direction by minimizing mainly the DWD loss, and determinesthe intercept term in the SVM manner. We show that our method inheres the merit of DWD, andhence, overcomes the data-piling and overﬁtting issue of SVM. On the other hand, the new methodis not subject to imbalanced data issue which was a main advantage of SVM over DWD. It uses an † Corresponding author I a r X i v : . [ s t a t . M L ] O c t nusual loss which combines the Hinge loss (of SVM) and the DWD loss through a trick of axillaryhyperplane. Several theoretical properties, including Fisher consistency and asymptotic normalityof the DWSVM solution are developed. We use some simulated examples to show that the newmethod can compete DWD and SVM on both classiﬁcation performance and interpretability. Areal data application further establishes the usefulness of our approach. KEYWORDS : Discriminant analysis; Fisher consistency; Imbalanced data; High-dimensional, low-sample size data; Support Vector Machine. II Introduction

Classiﬁcation is a very important research topic in statistical machine learning, and has many usefulapplications in various scientiﬁc and social research areas. In this article, we focus on the binarylinear classiﬁcation problem, in which a classiﬁcation rule is to be found that maps a point in X to a class label chosen from Y , φ : X (cid:55)→ Y where X = R d and Y = { +1 , − } . We focus on linearclassiﬁcation methods instead of nonlinear ones because they are easy to interpret due to simpleformulations. In particular, each linear classiﬁcation rule is associated with a linear discriminantfunction f ( x ) = x T ω + β , where the coeﬃcient direction vector ω ∈ R d has unit L norm, and β ∈ R is the intercept term. The classiﬁcation rule is then φ ( x ) = sign( f ( x )), that is, the samplespace R d is divided into halves by the separating hyperplane deﬁned by (cid:8) x : f ( x ) ≡ x T ω + β = 0 (cid:9) .The coeﬃcient direction vector ω determines the orientation of the hyperplane (as a matter of fact,it is the normal vector of this hyperplane), and the intercept term β determines its location.There is a large body of literature on linear classiﬁcation. See Duda et al. (2001) and Hastie et al.(2009) for comprehensive introductions. Among many linear classiﬁcation methods, the SupportVector Machine (SVM; Cortes and Vapnik, 1995, Vapnik, 1998, Cristianini and Shawe-Taylor, 2000)and the Distance-weighted Discrimination (DWD; Marron et al., 2007, Qiao et al., 2010) are twostate-of-the-art instances and have received a lot of attention. A brief review of these two methodswill be given in Section 2.In the high-dimensional, low-sample size (HDLSS) data setting, a so-called “data-piling” phe-nomenon has been observed for SVM (Marron et al., 2007) and some other classiﬁers (for example,Ahn and Marron, 2010). Data-piling is referred to the phenomenon that after projected to the di-rection vector ω given by a linear classiﬁer, a large portion of the data vectors pile upon each otherand concentrate on two points. Data-piling reﬂects severe overﬁtting in the HDLSS data settingand is an indicator that the direction is driven by artifacts in the data, and hence the direction aswell as the classiﬁcation performance can be stochastically volatile. Moreover, it turns out that thedirections from these linear classiﬁcation methods are much deviated from the Bayes rule direction(when the Bayes rule exists and is linear). To this end, DWD was proposed largely to overcomethe data-piling issue in the HDLSS setting and has been quite successful on that. hile DWD overcomes the data-piling and mitigates the overﬁtting eﬀect, it is sensitive tothe imbalanced sample sizes between the two classes (Qiao et al., 2010). In particular, when thesample size of one class is much greater than the other one, the classiﬁcation boundary would bepushed towards the minority class and consequently, all future data vectors will be classiﬁed intothe majority class. −5 0 500.10.20.30.4 (a) Bayes rule −5 0 5051015 (b) SVM, 67.61 degrees−5 0 500.10.20.30.4 (c) DWD, 39.87 degrees −5 0 500.10.20.30.4 (d) DWSVM, 40.23 degrees Figure 1:

Plots of projections to: (a) the true mean diﬀerence (Bayes rule) direction, (b) the SVMdirection, (c) the DWD direction and (d) the proposed DWSVM direction. The angles (in degree)between the last three directions and the ﬁrst direction are shown in the titles. Projections of theseparating hyperplanes of diﬀerent methods are depicted by the magenta vertical lines. Panel (a)shows the Bayes direction and the separating hyperplane to be compared with. SVM in Panel (b)demonstrate very good separation between the two classes, but severe data-piling also appears.The projected data vectors are nowhere near Gaussian, which suggests that the direction is toomuch deviated from the Bayes direction in Panel (a). Panel (c) shows that DWD has no data-pilingissue, and the projection plot preserves the Gaussian pattern. However, the separating hyperplaneis pushed towards the red class because of its relatively small sample size. Our proposed DWSVMapproach (Panel (d)) combines the merits of SVM and DWD. It preserves a good direction byshowing the Gaussian pattern in the projections while ﬁnds a good intercept term which is notsubject to imbalanced sample sizes. iao and Zhang (2013) have thoroughly studied the high-dimensional overﬁtting issue of SVMand the imbalanced data issue of DWD. Moreover, they proposed a new family of classiﬁers calledFLAME which both SVM and DWD belong to. To illustrate the main points of data-piling andimbalanced issues, we show projection plots of a toy example to four diﬀerent discriminant directionvectors in Figure 1. In this example, the data vectors from the two classes are generated frommultivariate normal distributions N d ( ± µ d , I d ), where the dimension d = 300, µ = 1 . / √ d =0 . d is a d -dimensional vector of all ones and I d is the d × d identity matrix. The Bayesrule in this example has direction ω B = d / √ d and the Bayes intercept β B = 0. Here the samplesize of the positive class (with Y = +1) is n + = 200 and the negative class sample size is n − = 50.Panel (a) in Figure 1 shows the true mean diﬀerence direction (which in fact is the Bayesdirection) and the projections of the data vectors therein. They serve as the benchmark to becompared with. Panel (b) is for the SVM direction and it demonstrates a very dramatic separationbetween the two classes. This could be an alarming bell for overﬁtting. Indeed, severe data-pilingis visible. The projected data vectors are nowhere near Gaussian, which suggests that the directionis too much deviated from the true direction in Panel (a). This deviation is also measured by theangle between the SVM direction and the Bayes direction (67 degrees, shown in the title). Panel (c)shows that DWD has no data-piling issue, and the projection plot preserves the Gaussian pattern,which means that there is some potential to interpret the data using the DWD direction. However,because the blue class (positive class with Y = +1) has four times sample size as the red class,the separating hyperplane is therefore pushed towards the red class. Expectedly, its classiﬁcationperformance is not good.In this article, we propose a new method which integrates the merits of SVM and DWD, andthus can address the data-piling issue and the imbalanced data issue at the same time. Our proposedmethod is named Distance-weighted Support Vector Machine (DWSVM) to salute the above twoclassical methods. As shown in Panel (d) of Figure 1, DWSVM preserves a good direction byshowing the Gaussian pattern in the projections while ﬁnds a good intercept term which is notsubject to the imbalanced sample sizes. In addition, we prove in theory that the DWSVM is Fisherconsistent and asymptotically normal, and that its intercept term is not sensitive to imbalancedsample size as DWD is. he rest of the article is organized as follows. Section 2 gives a brief introduction to the SVMand the DWD methods. Our DWSVM method is proposed in Section 3. Simulated examples anda real application are studied in Sections 4 and 5. Several theoretical results are given in Section6. Some concluding remarks are made in Section 7. In this section, we give a brief introduction to SVM and DWD, their formulations and the discussionon the roles of diﬀerent terms.

In classiﬁcation, one is given a training data set,

D ≡ { ( x i , y i ) ∈ X ⊗ Y , i = 1 , . . . , n } and the goalis to ﬁnd a rule, φ ( x ) ≡ sign( f ( x )), depending on D , so that the classiﬁcation error E ( φ ( X ) (cid:54) = Y ) isminimized. A natural estimate of the classiﬁcation error is n (cid:80) ni =1 [sign( f ( x i )) (cid:54) = y i ] = n (cid:80) ni =1 [ y i f ( x i ) < .However, even in the simple case of linear classiﬁcation where f ( x ) is assume to have the form f ( x ) = x T ω + β , searching for ( ω , β ) to minimize (cid:80) ni =1 [ y i f ( x i ) < is intractable due to the dis-continuity and nonconvexity of the objective function. In statistical learning, a common practiceto avoid these issues is to use a convex surrogate function to approximate/upper-bound the 0-1loss function [ yf ( x ) < . For any discriminant function f ( x ), let us deﬁne u ≡ yf ( x ) the functionalmargin which can be viewed as the signed distance (up to a constant) from data point x to theseparating hyperplane { x : f ( x ) = 0 } . A convex surrogate ψ ( u ) : R (cid:55)→ R + can be used in the placeof [ u< . For example, a classiﬁcation rule can be obtain by,min ω ,β n (cid:88) i =1 ψ (cid:0) y i ( x Ti ω + β ) (cid:1) + λ (cid:107) ω (cid:107) Here, the ﬁrst term in the objective function bounds the empirical classiﬁcation error and the (cid:107) ω (cid:107) term in the second term measures the complexity of the model. The choice of the tuningparameter λ balances the two main concerns. Equivalently, this optimization problem can becast to min ω ,β (cid:80) ni =1 ψ (cid:0) y i ( x Ti ω + β ) (cid:1) , s.t. (cid:107) ω (cid:107) ≤ C due to standard optimization theory. Manyclassiﬁcation methods fall into this category, such as Support Vector Machine, AdaBoost (Freund nd Schapire, 1997), and logistic regression (Friedman et al., 2000). See Bartlett et al. (2006)and the references therein for more sophisticated discussion on convex loss functions and theirimplications for risk bounds. By choosing the hinge loss function (1 − u ) + as the convex surrogate, where ( a ) + ≡ max( a,

0) is thepositive part of a , the SVM method is deﬁned to maximize the smallest distances of all observationsto the separating hyperplane. Mathematically, for some positive λ , the optimization problem ofSVM can be written as min ˜ ω , ˜ β n (cid:88) i =1 (cid:16) − y i ( x Ti ˜ ω + ˜ β ) (cid:17) + + λ (cid:107) ˜ ω (cid:107) . Here, in addition to measuring themodel complexity, (cid:107) ω (cid:107) also deﬁnes a notion of gap between the two classes for SVM. In particular,2 / (cid:107) ˜ ω (cid:107) is the distance between the classes (up to a constant). Hence, to minimize (cid:107) ˜ ω (cid:107) is the sameas to maximize the gap between classes. The notion of gap will play a central role in the derivationof methods in this article.The formulation above can be equivalently written as min ˜ ω , ˜ β n (cid:88) i =1 (cid:16) − y i ( x Ti ˜ ω + ˜ β ) (cid:17) + , s.t. (cid:107) ˜ ω (cid:107) ≤ C .Here the coeﬃcient vector ˜ ω does not necessarily have unit norm. We let ω = ˜ ω / √ C and β = ˜ β/ √ C . Then the SVM solution is given by argmin ω ,β n (cid:88) i =1 (cid:16) √ C − Cy i ( x Ti ω + β ) (cid:17) + , s.t. (cid:107) ω (cid:107) ≤ H C ( u ) =  √ C − Cu if u ≤ √ C , , (1)is used, such that SVM can be viewed as to minimize (cid:80) ni =1 H C ( u i ), subject to (cid:107) ω (cid:107) ≤

1, wherethe functional margin u i for the i th data is u i = y i ( x Ti ω + β ). In order to align this formulationwith that of DWD, we introduce a slack variable ξ i and rewrite SVM as,argmin ω ,β,ξ i n (cid:88) i =1 ξ i , (2)s.t. Cy i ( x Ti ω + β ) + ξ i ≥ √ C, ξ i ≥ , (3) (cid:107) ω (cid:107) ≤ . (4) .3 Distance-weighted Discrimination (DWD) DWD method was proposed by Marron et al. (2007) to improve the performance of SVM in theHDLSS setting. It also maximizes a notion of gap between classes: the harmonic mean of thedistances of all data vectors to the separating hyperplane. Let r i = y i ( x Ti ω + β ) + η i be the(adjusted) distance of the i th data vector to the separating hyperplane. Mathematically, thesolution of DWD is argmin ω ,β,η i n (cid:88) i =1 (cid:18) r i + Cη i (cid:19) , (5)s.t. r i = y i ( x Ti ω + β ) + η i , r i ≥ η i ≥ , (6) (cid:107) ω (cid:107) ≤ . (7)When η i = 0 and y i ( x Ti ω + β ) > r i = y i ( x Ti ω + β ) is the positive distance from each data vectorto the separating hyperplane, due to (6). Thus (cid:80) ni =1 /r i deﬁnes a diﬀerent notion of gap betweenclasses from that by SVM (which was 2 / (cid:107) ω (cid:107) .)If a positive distance y i ( x Ti ω + β ) is not achievable for a data vector, then a positive slackvariable η i is added to make r i positive. Note that the value of correction η i corresponds to theamount of misclassiﬁcation for the i th vector, and hence in order to minimize the misclassiﬁcation,we must control (cid:80) ni =1 η i in the objective function.We will use this formulation and combine it with that of the SVM method in (2)–(4). Here, inorder to understand the underlying DWD loss function for later use, we modify (5)–(7) as follows.For each i , the term in the objective function (cid:16) r i + Cη i (cid:17) can be minimized over η i . Some algebraicmanipulations reveal that the optimization problem (about ω and β ) becomesargmin ω ,β n (cid:88) i =1 V C (cid:0) y i ( x Ti ω + β ) (cid:1) , (8)s.t. (cid:107) ω (cid:107) ≤ , (9)where the DWD loss function is deﬁned as V C ( u ) =  √ C − Cu if u ≤ √ C , /u otherwise . (10)One key observation is to be made here. There are two main tasks in a binary linear classiﬁcation ethod:1. a notion of gap which is to be maximized so as to make the two classes more separated; and2. a measure of misclassiﬁcation which is to be minimized.Recall that in the SVM formulation, the notion of gap is 2 / (cid:107) ω (cid:107) , and the misclassiﬁcation is mea-sured by the Hinge loss function. SVM jointly minimizes the sum of these two components to searchfor a solution. In contrast, the DWD loss function in (10) (derived from the objective function (5))has two functionalities: the ﬁrst term (cid:80) i r − i in (5), the sum of inverse distance, is a notion ofgap, and the second term (cid:80) ni =1 η i in (5) measures misclassiﬁcation. The constraint (cid:107) ω (cid:107) ≤ In Section 3.1, we ﬁrst introduce a method which can be intuitively viewed as the prototype of thehybridization between SVM and DWD. Our proposed main method will be discussed in Section3.2. Some explanations to our method are given in Section 3.3.

Before we introduce the DWSVM method, we discuss an intuitive hybridization between SVM andDWD, which is called the naive DWSWD method (nDWSVM). Based on the previous discussionand other results in the literature, a linear classiﬁer with a direction given by DWD and an interceptterm found by SVM is desirable. However, naively matching a DWD direction and an SVMintercept together would be problematic because the intercept would lose its context without the orresponding discriminant direction. Instead, we could train a DWD classiﬁer on the data set,discard the DWD intercept, keep the DWD direction, and project all the data vectors to the 1-dimensional DWD direction to obtain a set of 1-dimensional data points. Lastly, ﬁnd an intercept(a cutoﬀ) by applying SVM to this 1-dimensional data set. Following this paradigm, we can geta DWD direction, which is thought to be better than an SVM direction in overcoming overﬁtting,and then given this DWD direction, search for an intercept in an SVM manner so as to mitigate theimbalanced data issue. We name this two-step procedure as nDWSVM. The nDWSVM method isa simple prototype of DWSVM, where the DWD component and the SVM component are trainedseparately. In this subsection, we formally deﬁne the Distance-weighted Support Vector Machine (DWSVM)in order to improve nDWSVM. DWSVM simultaneously minimizes both the SVM loss functionand the DWD loss function, to identify a common discriminant direction. The less-imbalance-sensitive SVM-driven intercept term will be used to identify the location of the optimal separatinghyperplane. Mathematically, the optimization problem can be written as follows: Let C dwd > C svm > α ∈ [0 , ω , β β ,ξ i ,η i n (cid:88) i =1 (cid:26) α (cid:18) r i + C dwd · η i (cid:19) + (1 − α ) ξ i (cid:27) , (11)s.t. r i = y i ( x Ti ω + β ) + η i , r i ≥ η i ≥ , (12) C svm y i ( x Ti ω + β ) + ξ i ≥ (cid:112) C svm , ξ i ≥ , (13) (cid:107) ω (cid:107) ≤ . (14)Importantly, in the end, we let f ( x ) ≡ x T ω + β and use sign( f ( x )) = sign( x T ω + β ) as theclassiﬁcation rule instead of sign( x T ω + β ). Thus ω and β are the only two variables that reallyparticipate in classifying future data vectors, while β is not involved. However, it does not meanthat β is of no signiﬁcance. We will elaborate this point later.Comparing (11)–(14) with (2)–(4) and (5)–(7), we can see that the ﬁrst term in (11) and theconstraint (12) are similar to (5) and (6), while the second term in (11) and the constraint (13) are x x positive classnegative classmain hyperplaneaxillary hyperplanedistance to main HPdistance to axillary HP Figure 2:

The main separating hyperplane (magenta solid line) and the axillary hyperplane (ma-genta dashed line) for DWSVM applied to a two-dimensional toy example. The distance from eachdata vector to the main hyperplane is depicted as a dotted line segment while the distance to theaxillary hyperplane is depicted as a dotted-dashed line segment. Although the data vectors η i is added toeach negative functional margin y i ( x Ti ω + β ) , i = 22 ,

25, to make the sum positive.similar to (2) and (3). Thus we can write the DWSVM formulation (11)–(14) asargmin ω ,β,β n (cid:88) i =1 (cid:8) αV C dwd ( y i ( x Ti ω + β )) + (1 − α ) H C svm ( y i ( x Ti ω + β )) (cid:9) , (15)s.t. (cid:107) ω (cid:107) ≤ . (16)One might think that our DWSWM is just an optimization problem with the objective functionequaling to a weighted average of the DWD loss and the SVM loss. However, it is more sophisti-cated than that. In the next subsection, we give some explanations to diﬀerent components andparameters in DWSVM to help understand the new method. .3 Understanding DWSVM Two hyperplanes

First of all, there are two intercept terms β and β and only one direction vector ω in the DWSVMmethod, that is, there are two hyperplanes that are parallel to each other, (cid:8) x : x T ω + β = 0 (cid:9) and (cid:8) x : x T ω + β = 0 (cid:9) . For convenience, we call them the main hyperplane and the axillaryhyperplane, respectively, and their corresponding discriminant functions f ≡ x T ω + β and f ≡ x T ω + β . See Figure 2 for an illustration using a two-dimensional toy example. In the plot, themagenta solid line is the main hyperplane and the magenta dashed line is the axillary hyperplane. Axillary hyperplane

Note that f is involved with the deﬁnition of r i , the adjusted distance of a data vector to theaxillary hyperplane, shown as dot-dashed line segments in Figure 2. Similar to its role in DWD, (cid:80) ni =1 (1 /r i ) controls the gap between the two classes. In particular, the smaller (cid:80) ni =1 (1 /r i ) is, themore separated the two classes are.In words, the purpose of the axillary hyperplane is not for classifying data vectors, but to makeit possible to deﬁne a number of distances (from data vectors to itself) so that we can minimize thesum of the inverse distances. In the ordinary DWD, this axillary hyperplane has to coincide withthe hyperplane that is actually used for classiﬁcation. But here we allow some ﬂexibility so that itis free of such restriction. Necessity of the slack variable η i When y i f ( x i ) <

0, the (signed) distance from the data vector to the axillary hyperplane is negative.In this case, a positive η i is added to y i f ( x i ) to make their sum r i positive. For example, in Figure2, the data vectors y i f ( x i ) , i = 22 ,

25, are negative. The DWSVM optimization adds somepositive η i ’s to make the sum r i = y i f ( x i ) + η i positive. It is the sum of the inverse of r i thatwe minimize, instead of the sum of inverse of the signed distances y i f ( x i ). This adjustment isnecessary. Otherwise, one can always make β to be inﬁnity, i.e. , the axillary hyperplane is inﬁnitely ar from the data so that all the distances y i f ( x i )’s are inﬁnity (positive or negative), and hence1 / ( y i f ( x i )) = 0. This is certainly not a desired situation because it would make the directionvector trivial (because the minimal of the objective function would always be 0 regardless of thechoice of the direction). For these reasons, the addition of η i and the inclusion of (cid:80) ni =1 η i in theobjective function are necessary to make the optimization problem meaningful. Slack variable η i does not measure misclassiﬁcation In the original DWD, the reason to minimize (cid:80) ni =1 η i is to control misclassiﬁcation. However, theslack variable η i here is with respect to the axillary hyperplane (which is not useful in classiﬁcation),rather than to the main separating hyperplane. Thus, we have liberated the DWD component fromthe burden of controlling misclassiﬁcation, so that it can focus on deﬁning the notion of gap andhelp searching for an optimal direction vector in the DWD fashion which overcomes overﬁtting. Slack variable ξ i Last of all, the second term ξ i in (11) is a proxy of the modiﬁed Hinge loss function of SVM in (1).Inclusion of this term is for the purpose of controlling misclassiﬁcation, because ξ i can be seen as (cid:0) √ C svm − C svm u i (cid:1) + where u i is the functional margin y i f ( x i ) with respect to the main hyperplane.Minimizing the sum of ξ i ’s can help to increase the functional margin u i ’s. Note that the functionalmargin u i = y i f ( x i ) can be interpreted as the distance to the main hyperplane (instead of theaxillary one), which is ultimately the hyperplane that is used for classifying new data. Summary

In summary, the hyperplane deﬁned by ω and β is an axillary hyperplane which is useful forﬁnding the best direction, and the one deﬁned by ω and β is the main hyperplane that is useful forsearch the intercept and for good classiﬁcation performance. By the trick of allowing two interceptterms, we gain some ﬂexibility and manage to get two hyperplanes to each do their own job.Empirically, nDWSVM can be used to approximate DWSVM, especially for low to moderatedimensions. Moreover, nDWSVM is very easy to implement, so long as the user has accessibleimplementations for both SVM and DWD (both are now available in R and MATLAB). The iﬀerences between DWSVM and nDWSVM are that in the two-step prototype nDWSVM, thedirection is determined only by the DWD algorithm, and the intercept is found by SVM basedon the projections given by the DWD direction. However, in DWSVM, the axillary hyperplane(concerning DWD) and the main hyperplane (concerning SVM) work together to ﬁnd the optimaldirection. The optimization is done all at once in DWSVM.Between DWD and DWSVM, the latter inherits the direction of the former, and adopts a veryeﬀective intercept term from its SVM component. Compared with SVM, the DWSVM method hasa direction that is much improved due to the DWD component. In this section, we ﬁrst compare the classiﬁcation and the interpretability performance between theDWSVM approaches and the original SVM and DWD. The classiﬁcation performance is measuredby the misclassiﬁcation rate for a large test data set with 4000 observations. The interpretability isa concept that is more of less vague. We partially measure it by the angle between the discriminantdirection vector for the classiﬁer under investigation and for the Bayes classiﬁer. We believe thecloser to the Bayes rule direction, the better the interpretability of the linear classiﬁer is.

We consider two diﬀerent simulation settings. In each setting, samples from the two classes aregenerated from multivariate normal distributions N d ( ± µ , Σ ).1. Example 1 : Constant mean diﬀerence, identity covariance matrix example. µ ≡ c d , and Σ ≡ I d , where c > c (cid:107) d (cid:107) = 2 .

7. This corresponds tothe Mahalanobis distance between the two classes and represents a reasonable diﬃculty ofclassiﬁcation using the Bayes rule.2.

Example 2 : Decreasing mean diﬀerence, block-diagonal interchangeable covariance matrixexample. Here we let µ ≡ c v d , where v d = ( √ , √ , . . . , √ , , , . . . , T ∈ R d , and Σ ≡ Block-Diag { Σ , Σ , . . . , Σ } , where each Σ is an 50 ×

50 interchangeable sub-covariancematrix whose diagonal entries are all 1 and oﬀ-diagonal entries are 0.8. The scaling factor c s chosen to make the Mahalanobis distance (cid:8) (2 c v d ) Σ − (2 c v d ) T (cid:9) / = 2 . d among 100 , , ,

500 and 1000, thus the last threecases correspond to the HDLSS data settings.

Example 1 Example 2 l l ll l l llll l llll l llll l llll llll llllllll llll llll l l l ll l llll l llll l llll l llll llll llllllll llll llll e rr o r ang l e

100 200 300 500 1000 100 200 300 500 1000

Dimension method l l l l l

DWSVM nDWSVM DWD SVM Bayes

Figure 3:

Comparison between four methods for Example 1 (the left panel) and Example 2 (theright panel). The misclassiﬁcation error rates are shown on the top row and the angles between theclassiﬁcation directions and the Bayes direction are shown on the bottom. For Example 1 (left), forsmaller dimensions, the two DWSVM approaches are better than SVM and DWD in terms of classi-ﬁcation. For a large dimension, SVM outperforms the nDWSVM approach. The one-step DWSVMapproach dominates all the other approaches in terms of classiﬁcation performance. In terms of theinterpretability (bottom) in Example 1, the two DWSVM approaches and the DWD approach allgive similar and better results than the SVM approach. For Example 2 (right), DWSVM has simi-lar good classiﬁcation performance to SVM and similar good interpretability performance to DWDand nDWSVM. For small and moderate dimensions, the classiﬁcation performance of DWSVM issigniﬁcantly better than SVM.

In the top-left panel of Figure 3, we report the misclassiﬁcation error of DWSVM, nDWSVM, DWDand SVM applied to a test data set with 2000 data points in each class which are generated accordingto the Constant mean diﬀerence, identity covariance matrix example. We conduct the simulation or 100 times and report the averages of the measurements. Our DWSVM approach uniformly givesthe best classiﬁcation results. The two-step alternative nDWSVM has very similar performancefor dimensions 100, 200 and 500, but its performance is downgraded for higher dimensions. For alldimensions, unsurprisingly, the original DWD has misclassiﬁcation rate close to almost 50%, whichis largely due to its intercept term which is subject to the imbalanced data.In the bottom-left panel of Figure 3, we calculate the angles between the directions from diﬀerentclassiﬁers and the Bayes direction (for both simulation settings in this article, the Bayes classiﬁersare linear and the Bayes directions are well deﬁned.) It shows that all the DWD related classiﬁersgive very similar angles. As a matter of fact, the angles from DWSVM, DWSVM and DWD almostoverlap with each other in this plot, except for low dimensional case where the DWSVM angle isa bit larger than the other two. On the other hand, the SVM directions are signiﬁcantly morediﬀerent from the Bayes direction than the DWD family directions are.The observations so far verify the conjecture that DWD is worse at misclassiﬁcation rate andSVM is worse at giving interpretable classiﬁcation direction. DWSVM and nDWSVM appear tobe able to address both issues simultaneously.In the simulations, we tune the parameter C svm for SVM from a grid of possible values2 − , − . . . , , and choose the one which gives rise to the smaller misclassiﬁcation rate fora tuning data set that is identical to the training data set in terms of sample size and underlyingdistributions. For the DWD family of classiﬁers (DWSVM, nDWSVM and DWD), we let C dwd be 100 divided by a scaling factor that counts for the scale of the data, which was recommendedby Marron et al. (2007). We ﬁx C svm = 100 for DWSVM and nDWSVM. Lastly, we let α = 0 . .1.2 Example 2 We have conducted the same comparison for the Decreasing mean diﬀerence, block-diagonal inter-changeable covariance matrix example (Example 2) and the results are shown in the right panelof Figure 3. This time, the classiﬁcation performance of DWSVM and SVM are closely competingwith each other. For dimensions d = 100 , , d = 500 and 1000, its classiﬁcation error rates are slightly greaterthan SVM (not statistically signiﬁcant). In terms of the angles between the classiﬁcation directionvectors and the Bayes direction, the DWSVM direction are similar to those from nDWSVM andDWD, while all three are better than SVM. For the highest dimension case, all four directionsare much diﬀerent from the Bayes direction. However, the DWSVM direction is the best in thissituation. In this subsection, we study the impacts of diﬀerent parameter values to DWSVM. First, we useordinary DWD to search for an optimal choice of the C dwd parameter and ﬁx its value in thesequel. In particular, we adopt the recommendation of C dwd in Marron et al. (2007). We choosenot to further pursue in the direction of C dwd because this parameter has been well studied forordinary DWD by Marron et al. (2007) and for weighted DWD by Qiao et al. (2010). Here, weuse simulation to illustrate the sensitivity of the DWSVM method to the diﬀerence choices of theother two parameters, C svm and α .We applied DWSVM to 100 simulations from the simulated examples deﬁned above (Example1 and Example 2) respectively, using the following schedules, • for ﬁxed α = 0 . C svm = 2 − , − . . . , , ; • for ﬁxed C svm = 100 and various values of α = 0 . , , . , . , . . . , . α . It is very clear that in these settings (where C svm = 100), theperformance of DWSVM does not depend on the value of α as all the curves appear horizontalstraight lines. It may be too early to conclude that the performance of DWSVM is independent f α from this observation, since it could be due to the fact that C svm = 100 happens to be areasonably good parameter (see the discussion below). But it does suggest that the performance isinﬂuenced less by the α parameter than by the other parameters.In the left panel, we do the same thing for diﬀerence values of C svm given α = 0 .

5. A similarmessage can be obtained, although on a restrictive condition: For Example 1, the curves appearto be ﬂat when C svm > . Thus any value that falls into this range should work reasonable well.For Example 2, it can be seen that the optimal C svm is around 2 and 2 . However, even theirperformance is not signiﬁcantly better than those with greater C svm . Overall, it seems that aslong as the value of C svm is not too small, the classiﬁcation performance would be close to the −5 0 5 100.10.20.30.40.5 log (C svm ) e rr o r Example 1 0 0.2 0.4 0.6 0.8 10.10.20.30.40.5 α e rr o r Example 1−5 0 5 100.10.20.30.40.5 log (C svm ) e rr o r Example 2 0 0.2 0.4 0.6 0.8 10.10.20.30.40.5 α e rr o r Example 2

Figure 4:

Left panels: Test errors of DWSVM applied to Example 1 and Example 2 (with d = 300and α = 0 .

5) for diﬀerent values of C svm over 100 runs. The plots show that for Example 1, any C svm greater than about 2 will lead to similar classiﬁcation performance, while for Example 2, C svm around 2 to 2 are the best, although the performance of such parameter choice is not verydiﬀerent from those whose has even greater C svm values. Right panels: Test errors of DWSVMapplied to Example 1 and Example 2 (with d = 300 and C svm = 100) for diﬀerent values of α over100 runs. The performance does not depend on the choice of the α value very much. ptimality. This is the reason why we ﬁx the value of α and C svm to be 0.5 and 100 respectively inour comparison study conducted in the previous section. The user are free to grid search the valuesof C svm and α if he/she wishes so, although it seems that the eﬀort for the latter is not worthwhile. In this section, we compare DWSVM with the competing classiﬁers by applying them to the Golubdata set (Golub et al., 1999). This gene expression data has 3051 genes and 38 tumor mRNAsamples from the leukemia microarray study of Golub et al. (1999). Pre-processing was done asdescribed in Dudoit et al. (2002).

No mislabeling 1 pair of mislabelings 2 pairs of mislabelings l l ll l l ll l l ll C r o ss − v a li da t i on E rr o r Figure 5:

Cross-validated number of misclassiﬁed observations for SVM, DWD, DWSVM andnDWSVM for the original Golub data set, the Golub data with a pair of mislabeled observations,and the data with 2 pairs of mislabeled observations. For the original data, both the SVM andthe DWSVM methods have CV error almost 0, with DWSVM being a little better. When thereare mislabeled observations, the advantage of DWSVM becomes more obvious: it can be seen thatDWSVM has the smallest CV errors while nDWSVM is on a par with SVM. The DWD classiﬁeris always worse than the others in terms of the classiﬁcation performance.As there are 11 and 27 observations from both classes, we expect the SVM and the DWSVMclassiﬁers will give better result than DWD because the latter is subject to the imbalanced samplesize. Moreover, because the dimension is much higher than the sample size, we expect severeoverﬁtting in this data. We apply SVM, DWD, DWSVM and nDWSVM to the data set and use 3-fold cross validation to ﬁnd the best C svm tuning parameter value. The C dwd and α values are ﬁxed.In the left panel of Figure 5, we report the average cross-validated (CV) number of misclassﬁedobservations and the standard error over 100 random foldings. Both SVM and DWSVM give verygood result (CV error almost zero), although the DWSVM method is a little better. The nDWSVM rror is almost twice that of the SVM and the DWD error is almost four times.In order to see the extend to which our DWSVM avoids overﬁtting, we perturb the original dataset as follows. We randomly switch the class labels of k pairs of observations ( k observations fromeach class) ( k = 1 , We will show some theoretical properties of DWSVM in three diﬀerent favors. First, we derive theFisher consistency of the DWSVM loss function. Note that the loss function of DWSVM is nota typical large-margin loss function. Second, we derive the asymptotic normality of the DWSVMcoeﬃcient vector. Third, we show that the intercept of DWSVM does not diverge, even in anextremely imbalanced setting.

The DWSVM method can be estimated from equations (15)–(16). Thus the underlying loss functionas be written as L ( yf ( x ) , yf ( x )) = αV C dwd ( yf ( x )) + (1 − α ) H C svm ( yf ( x )). Because there are twofunctions involved, the underlying loss function is not a traditional margin-based loss function whichinvolves only one function, such as that considered in Lin (2004). Moreover, the two hyperplanesimplied by f and f in our methods are parallel to each other. In general cases (beyond linear unctions), this can be interpreted as the diﬀerence of these two functions is a constant, i.e. , f ( x ) − f ( x ) is independent of x . Theorem 1 below shows the Fisher consistency of the DWSVMloss function. Theorem . For any given C svm , C dwd > and α ∈ [0 , , if E [ L { Y f ( X ) , Y f ( X ) } ] has a globalminimizer ( f ∗ ( x ) , f ∗ ( x )) subject to f ( x ) − f ( x ) is a constant, then sign[ f ∗ ( x )] = sign[ q ( x ) − / ,where q ( x ) ≡ P ( Y = +1 | X = x ) . Fisher consistency of the DWSVM loss function ensures that the sign of the minimizer of theexpected loss function (subject to the parallel condition) coincides with the Bayes rule.

Koo et al. (2008) has studied the asymptotic normality of the coeﬃcient vector for the SVMclassiﬁer. We follow the same direction and prove the corresponding results for the DWSVMclassiﬁer.For ease of presentation of the theorem, we let ω + denote the augmented parameter vector( β , β, ω T ) T ∈ R d +2 , x + , x † and x ‡ the augmented data vectors (0 , , x T ) T ∈ R d +2 , (1 , , x T ) T ∈ R d +2 and (1 , , x T ) T ∈ R d +2 . Consequently, the main discriminant function f ( x ; ω + ) ≡ x + T ω + = x T ω + β , and the axillary discriminant function f ( x ; ω + ) ≡ x † T ω + = x T ω + β .We cast DWSVM to an optimization problem with an unconstrained objective function. q λ,n ( ω + ) ≡ n n (cid:88) i =1 L ( x i , y i , ω + ) + λ (cid:107) ω (cid:107) (17)= 1 n n (cid:88) i =1 { αV C d ( y i f ( x i ; ω + )) + (1 − α ) H C s ( y i f ( x i ; ω + )) } + λ (cid:107) ω (cid:107) (18)The solution to the optimization problem can be scaled by the norm of ω so as to make it haveunit norm.The population version of (18) without the penalty term is deﬁned as Q ( ω + ) ≡ E { αV C d ( Y f ( X ; ω + )) + (1 − α ) H C s ( Y f ( X ; ω + )) } , whose minimizer is deﬁned as ω ∗ + ≡ argmin ω + Q ( ω + ). or easy presentation, let g ( x , y, ω + ) ≡ α (cid:16) − { yf ( x ; ω + ) ≤ / √ C d } C d − { yf ( x ; ω + ) > / √ C d } / [ yf ( x ; ω + )] (cid:17) ,h ( x , y, ω + ) ≡ (1 − α ) (cid:16) − { yf ( x ; ω + ) ≤ / √ C s } C s (cid:17) ,v ( x , y, ω + ) ≡ α (cid:16) { yf ( x ; ω + ) > / √ C d } / [ yf ( x ; ω + )] (cid:17) ,w ( x , y, ω + ) ≡ (1 − α ) δ (cid:16) / (cid:112) C s − yf ( x ; ω + ) (cid:17) C s , where δ ( · ) denotes the Dirac delta function. Furthermore, let S ( ω + ) ≡ E { g ( X , Y, ω + ) Y X † + h ( X , Y, ω + ) Y X + } and U ( ω + ) ≡ E (cid:8) v ( X , Y, ω + ) X † X T † + w ( x , y, ω + ) X + X T + (cid:9) . Let Ω( X i , Y i , ω ∗ + ) = diag { g ( X i , Y i , ω ∗ + ) , h ( X i , Y i , ω ∗ + ) , [ g ( X i , Y i , ω ∗ + )+ h ( X i , Y i , ω ∗ + )] I d } , where I d is d × d identity matrix.Then, deﬁne T n ≡ n (cid:88) i =1 (cid:26) g ( X i , Y i , ω ∗ + ) Y i ( X i ) † + h ( X i , Y i , ω ∗ + ) Y i ( X i ) + (cid:27) , = n (cid:88) i =1 Y i (cid:26) Ω( X i , Y i , ω ∗ + )( X i ) ‡ (cid:27) . Lastly, deﬁne G ( ω ∗ + ) ≡ E (cid:104) ( X i ) ‡ Ω ( X i , Y i , ω ∗ + )( X i ) ‡ T (cid:105) .Some regularity conditions are needed. We state the conditions in the appendix. Note thatconditions (A1), (A2) and (A4) are the same as in Koo et al. (2008). Our new (A3) is tailoredfor DWSVM and incorporates the DWD component. In particular, (A1) ensures that U ( ω + ) iswell-deﬁned and is continuous in ω + while (A1) and (A2) ensure that the minimizer ω ∗ + exists.(A3) is a suﬃcient condition to that ω ∗ + is not zero. (A4) guarantees the positive-deﬁniteness of U ( ω + ) around ω ∗ + .Under these regularity conditions, we obtain a Bahadur representation of (cid:100) ω λ,n + in Theorem2, the asymptotic normality in Theorem 3, and consequently, the asymptotic normality of thediscriminant function f ( x ; (cid:100) ω λ,n + ) at x in Corollary 4. heorem . Suppose that (A1)–(A4) are met. For λ = o ( n − / ) , we have √ n ( (cid:100) ω λ,n + − ω ∗ + ) = − √ n U ( ω ∗ + ) − T n + o P (1) . Theorem . Suppose that (A1)–(A4) are met. For λ = o ( n − / ) , we have √ n ( (cid:100) ω λ,n + − ω ∗ + ) = N (cid:0) , U ( ω ∗ + ) − G ( ω ∗ + ) U ( ω ∗ + ) − (cid:1) This will lead to the following corollary.

Corollary . Under the same conditions as in Theorem 3, for λ = o ( n − / ) and any x ∈ R d , √ n (cid:16) f ( x , (cid:100) ω λ,n + ) − f ( x , ω ∗ + ) (cid:17) d → N (cid:0) , x T + U ( ω ∗ + ) − G ( ω ∗ + ) U ( ω ∗ + ) − x + (cid:1) Owen (2007) discussed the behavior of the intercept term in the logistic regression when the samplesize of one class is extremely large while that of the other class is ﬁxed. Moreover, Qiao and Zhang(2013) also showed that the intercept term of DWD diverges. In this subsection, we prove that theintercept term for the DWSVM classiﬁer does not diverge. Without loss of generality, we assumethat n − (cid:29) n + , i.e. , the negative class is the majority class. Lemma . Suppose that the negative majority class is sampled from a distribution with compactsupport S . Then the intercept term β in SVM does not diverge to negative inﬁnity when n − → ∞ . Corollary . Suppose that the negative majority class is sampled from a distribution with compactsupport S . Then the intercept term β in DWSVM does not diverge to negative inﬁnity when n − → ∞ . The assumption of compact support S is essential here, but it is fairly weak and is true in manyreal applications. Note that this result does not ensure that the sensitivity issue is completelyovercome by SVM or DWSVM. Instead, it suggests that in the n − → ∞ asymptotics, the impactof the imbalanced sample size is limited to some extent. Conclusion

Both SVM and DWD are subject to certain disadvantages and enjoy certain advantages. TheDWSVM combines the merits of both methods by creatively deploying an axillary intercept term.We have shown standard asymptotic results for the DWSVM classiﬁer. The simulations and realdata application establish the superiority of the DWSVM method over SVM and DWD in somesituations. In particular, the DWSVM method can lead to a discriminant direction vector that, likethe DWD direction, preserve important features of the data set. More importantly, the DWSVMalso performs very well in terms of classiﬁcation. As a bottom line, its performance is just as goodas the SVM. In special settings such as the perturbed data, we have demonstrated that DWSVMcan overcome overﬁtting and is more robust against perturbation/mislabeling of the data.We have shown some asymptotic properties of DWSVM in this paper. More work can be doneto investigate its statistical properties, for example, in the line of Blanchard et al. (2008).An instant extension of the DWSVM classiﬁer is multiclass classiﬁcation. For example, for amulticlass classiﬁcation problem with K classes, the following optimization problem accomplishessuch an extension.argmin ω j , β j β j , ξ , η n (cid:88) i =1 (cid:88) y i = j, k (cid:54) = j (cid:40) α (cid:32) r ijk + C dwd · η ijk (cid:33) + (1 − α ) ξ ijk (cid:41) , s.t. r ijk = y i { x Ti ( ω j − ω k ) + ( β j − β k ) } + η ijk , r ijk ≥ η ijk ≥ ,C svm y i { x Ti ( ω j − ω k ) + ( β j − β k ) } + ξ ijk ≥ (cid:112) C svm , ξ ijk ≥ , K (cid:88) j =1 (cid:107) ω j (cid:107) ≤ , K (cid:88) j =1 ω j = , K (cid:88) j =1 β j = 0 , K (cid:88) j =1 β j = 0 . Other extensions such as kernel DWSVM or sparse DWSVM are also readily in order.In summary, DWSVM integrates the merits of classical classiﬁcation methods. Its numericalperformance is very good and it is theoretically justiﬁed. These show evidence that it is a verypromising linear learner which has great potential in many applications.Future work will also concentrate on developing more eﬃcient implementation of DWSVM. cknowledgment The ﬁrst author’s work was partially supported by Binghamton University Harpur College Dean’sNew Faculty Start-up Funds and a collaboration grant from the Simons Foundation (

Appendices

Proof of Theorem 1

For any x , denote q ( x ) = P ( Y = +1 | X = x ). The conditional risk is R ( f, f ) ≡ E [ L { Y f ( X ) , Y f ( X ) } | X = x ]= { αV C dwd ( f ) + (1 − α ) H C svm ( f ) } q ( x )+ { αV C dwd ( − f ) + (1 − α ) H C svm ( − f ) } { − q ( x ) } , where for simplicity we write f ( x ) and f ( x ) as f and f .For the global minimizer ( f ∗ , f ∗ ), since f ∗ − f ∗ = ∆ ∗ is independent of x , we can consideranother feasible (but not optimal) solution ( − f ∗ , − f ∗ − ∆ ∗ ). Due to the optimality of ( f ∗ , f ∗ ) =( f ∗ , f ∗ − ∆ ∗ ), we can show that0 ≥ R ( f ∗ , f ∗ − ∆ ∗ ) − R ( − f ∗ , − f ∗ − ∆ ∗ )= { q ( x ) − } [ { αV C dwd ( f ∗ − ∆ ∗ ) + (1 − α ) H C svm ( f ∗ ) } − { αV C dwd ( − f ∗ − ∆ ∗ ) + (1 − α ) H C svm ( − f ∗ ) } ]= { q ( x ) − } [ α { V C dwd ( f ∗ − ∆ ∗ ) − V C dwd ( − f ∗ − ∆ ∗ ) } + (1 − α ) { H C svm ( f ∗ ) − H C svm ( − f ∗ ) } ]Thus if q ( x ) > /

2, then α { V C dwd ( f ∗ − ∆ ∗ ) − V C dwd ( − f ∗ − ∆ ∗ ) } + (1 − α ) { H C svm ( f ∗ ) − H C svm ( − f ∗ ) } ≤ . Because V C dwd ( · ) is strictly decreasing everywhere, and H C svm ( · ) is strictly decreasing around 0, wehave that V C dwd ( f ∗ − ∆ ∗ ) − V C dwd ( − f ∗ − ∆ ∗ ) and H C svm ( f ∗ ) − H C svm ( − f ∗ ) have the same sign, nd hence f ∗ ≥

0. By a similar argument, if q ( x ) < /

2, then f ∗ ≤

0. Lastly, it is easy to showthat f ∗ (cid:54) = 0. Hence we have sign( f ∗ ) = sign( q ( x ) − / Regularity conditions

We state the regularity conditions for the asymptotics below. We use C , C , . . . to denote positiveconstants independent of n . A1 The densities p + and p − are continuous and have ﬁnite second moments. A2 There exists B ( x , δ ), a ball centered at x with radius δ such that p ( x ) > C and p ( x ) >C for every x ∈ B ( x , δ ). A3 For some 1 ≤ l ≤ d , E (cid:16) { X l ≥ F L − } X | Y = − (cid:17) < E (cid:16) { X l ≤ F U + } X | Y = +1 (cid:17) or E (cid:16) { X l ≤ F U − } X | Y = − (cid:17) > E (cid:16) { X l ≥ F L + } X | Y = +1 (cid:17) , where F L + and F L − ( F U + and F U − , respectively) are the lower bounds (upper bounds, respec-tively) for the positive and negative classes. They are deﬁned as P (cid:0) X l ≥ F L + | Y = +1 (cid:1) = min (cid:18) , π + { αC d + (1 − α ) C s } π − (1 − α ) C s (cid:19) , P (cid:0) X l ≥ F L − | Y = +1 (cid:1) = min (cid:18) , π − { αC d + (1 − α ) C s } π + (1 − α ) C s (cid:19) , P (cid:0) X l ≤ F U + | Y = +1 (cid:1) = min (cid:18) , π + (1 − α ) C s π − { αC d + (1 − α ) C s } (cid:19) , P (cid:0) X l ≤ F U − | Y = +1 (cid:1) = min (cid:18) , π − (1 − α ) C s π + { αC d + (1 − α ) C s } (cid:19) . A4 For an orthogonal transformation A l that maps ω ∗ / (cid:107) ω ∗ (cid:107) to the l th unit basis vector e l forsome 1 ≤ l ≤ d , there exist rectangles D + = (cid:8) x ∈ M + : l s ≤ ( A l x ) s ≤ v s with l s < v s for s (cid:54) = l (cid:9) and D − = (cid:8) x ∈ M − : l s ≤ ( A l x ) s ≤ v s with l s < v s for s (cid:54) = l (cid:9) uch that p + ( x ) ≥ C > D + and p − ( x ) ≥ C > D − , where M + ≡ (cid:8) x : x T ω ∗ + β = 1 / √ C s (cid:9) and M − ≡ (cid:8) x : x T ω ∗ + β = − / √ C s (cid:9) . Proof of Theorems 2 and 3 and Corollary 4

For ﬁxed θ ∈ R d +2 , deﬁneΛ n ( θ ) ≡ n (cid:8) q λ,n ( ω ∗ + + θ / √ n ) − q λ,n ( ω ∗ + ) (cid:9) , andΓ n ( θ ) ≡ E Λ n ( θ ) . Observe thatΓ n ( θ ) = n (cid:8) Q ( ω ∗ + + θ / √ n ) − Q ( ω ∗ + ) (cid:9) + λ (cid:16) (cid:107) θ d +2) (cid:107) + 2 √ n ω ∗ T θ d +2) (cid:17) By Taylor series expansion of Q around ω ∗ + , we obtain, for some 0 < t < n ( θ ) = 12 θ T U (cid:0) ω ∗ + + ( t/ √ n ) θ (cid:1) θ + λ (cid:16) (cid:107) θ d +2) (cid:107) + 2 √ n ω ∗ T θ d +2) (cid:17) . Because U ( ω + ) is continuous in ω + , due to condition (A1), we have12 θ T U (cid:0) ω ∗ + + ( t/ √ n ) θ (cid:1) θ = 12 θ T U (cid:0) ω ∗ + (cid:1) θ + o (1) . This, combined with λ = o ( n − / ), results inΓ n ( θ ) = 12 θ T U (cid:0) ω ∗ + (cid:1) θ + o (1) . Now, observe that E T n = nS ( ω ∗ + ) = and E ( T n T Tn ) = (cid:80) ni =1 E (cid:104) ( X i ) ‡ Ω ( X i , Y i , ω ∗ + )( X i ) ‡ T (cid:105) = nG ( ω ∗ + ). Hence, √ n T n follows N (cid:0) , G ( ω ∗ + ) (cid:1) asymptotically by central limit theorem.Next, we deﬁne R i,n ( θ ) ≡ L i,n ( ω ∗ + + θ / √ n ) − L i,n ( ω ∗ + ) − (cid:32) ∂L i,n ∂ ω + ( ω + ) (cid:12)(cid:12)(cid:12)(cid:12) ω + = ω ∗ + (cid:33) T θ / √ n, where L i,n ( ω + ) ≡ αV C d ( Y i ( X i ) T † ω + ) + (1 − α ) H C s ( Y i ( X i ) T + ω + ).We continue by splitting R i,n to two parts R i,n = αR di,n + (1 − α ) R si,n , where the ﬁrst termconcerns the DWD component and the second term concerns the SVM component. or the DWD component, R di,n ( θ ) ≡ V ( Y i ( X i ) T † ( ω ∗ + + θ / √ n )) − V ( Y i ( X i ) T † ω ∗ + ) − (cid:32) ∂V∂ ω + ( ω + ) (cid:12)(cid:12)(cid:12)(cid:12) ω + = ω ∗ + (cid:33) T θ / √ n Because the DWD loss V has ﬁrst order continuous derivative, R di,n ( θ ) = O ( n − ).For the SVM component, R si,n ( θ ) ≡ H [ Y i ( X i ) T + ( ω ∗ + + θ / √ n )] − H [ Y i ( X i ) T + ω ∗ + ] + (cid:112) C s { Y i ( X i ) T + ω ∗ + < / √ C s } Y i ( X i ) T + θ / √ n. Following the argument by Koo et al. (2008) and combining the fact that R di,n ( θ ) = O ( n − ),we can show that (cid:80) ni =1 E (cid:0) | R i,n ( θ ) − E R i,n ( θ ) | (cid:1) →

0, as n → n ( θ ) = Γ n ( θ ) + T Tn θ / √ n + (cid:80) ni =1 ( R i,n ( θ ) − E R i,n ( θ )) . ThusΛ n ( θ ) = 12 θ T U (cid:0) ω ∗ + (cid:1) θ + T Tn θ / √ n + o P (1) . By the Convexity Lemma in Pollard (1991), we have for any ﬁxed θ ,Λ n ( θ ) = 12 ( θ − ζ n ) T U (cid:0) ω ∗ + (cid:1) ( θ − ζ n ) + 12 ζ Tn U (cid:0) ω ∗ + (cid:1) ζ n + r n ( θ ) , where ζ n ≡ − U ( ω ∗ + ) − T n / √ n , and for each compact set K ∈ R d ,sup θ ∈ K | r n ( θ ) | p → . We then follow the argument in Koo et al. (2008) and have for each ε > (cid:98) θ λ,n = √ n ( (cid:100) ω λ,n + − ω ∗ + ), P (cid:16) (cid:107) (cid:98) θ λ,n − ζ n (cid:107) > ε (cid:17) p → , which completes the proof. Proof of Lemma 5

We prove the result for the simpler and more intuitive case of d = 1. In this case ω ∈ R does notneed to be optimized. We can simply assume that ω = 1. Moreover, we can consider the worstcase scenario where n + = 1. This is the worse case because this represents the most imbalancedsample sizes. We let x denote the sole data vector in the positive minority class ince the negative class is extremely large compared to the positive, we can assume that thefunctional margin with respective to the main hyperplane u ≡ y ( x + β ) = x + β for the datavectors from the positive minority class are always less than 1 / √ C s , that is β ≤ / √ C s − x .Write the objective function of SVM as L s ( β ) ≡ ( (cid:112) C s − C s x − C s β ) + n (cid:88) i =1 (cid:110) { y i = − } ( (cid:112) C s + C s x i + C s β ) + (cid:111) ≈ ( (cid:112) C s − C s x − C s β ) + n − E (cid:110) ( (cid:112) C s + C s X + C s β ) + | Y = − (cid:111) Note that ∂ L s ∂β ( β ) ≡ − C s + n − C s E (cid:104) { √ C s + C s X + C s β> } | Y = − (cid:105) = − C s + n − C s P (cid:104)(cid:112) C s + C s X + C s β > | Y = − (cid:105) . This leads to thatlim β →−∞ ∂ L s ∂β ( β ) = − C s < ∂ L s ∂β (cid:16) − M − / (cid:112) C s (cid:17) = − C s + n − C s P (cid:104)(cid:112) C s + C s X + C s ( − M − / (cid:112) C s ) > | Y = − (cid:105) = − C s + n − C s P [ X > M | Y = −

1] = − C s < / √ C s − x ≤ − M − / √ C s , then ∂ L s ∂β (cid:16) / (cid:112) C s − x (cid:17) = − C s + n − C s P (cid:104)(cid:112) C s + C s X + C s (1 / (cid:112) C s − x ) > | Y = − (cid:105) = − C s + n − C s P (cid:104) X > x − / (cid:112) C s | Y = − (cid:105) = − C s < , and β = 1 / √ C s − x is the minimizer of L s . On the other hand, if 1 / √ C s − x > − M − / √ C s ,then the minimizer β ∗ will be greater than − M − / √ C s but less than or equal to 1 / √ C s − x .This means that the intercept term β in SVM does not diverge to −∞ . eferences Ahn, J. and Marron, J. (2010), “The maximal data piling direction for discrimination,”

Biometrika ,97, 254–259.Bartlett, P., Jordan, M., and McAuliﬀe, J. (2006), “Convexity, classiﬁcation, and risk bounds,”

Journal of the American Statistical Association , 101, 138–156.Blanchard, G., Bousquet, O., and Massart, P. (2008), “Statistical performance of support vectormachines,”

The Annals of Statistics , 489–531.Cortes, C. and Vapnik, V. (1995), “Support-vector networks,”

Machine learning , 20, 273–297.Cristianini, N. and Shawe-Taylor, J. (2000),

An introduction to Support Vector Machines: andother kernel-based learning methods , Cambridge University Press.Duda, R., Hart, P., and Stork, D. (2001),

Pattern classiﬁcation , Wiley.Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparison of discrimination methods for theclassiﬁcation of tumors using gene expression data,”

Journal of the American statistical associ-ation , 97, 77–87.Freund, Y. and Schapire, R. E. (1997), “A decision-theoretic generalization of on-line learning andan application to boosting,”

Journal of Computer and System Sciences , 55, 119–139.Friedman, J., Hastie, T., and Tibshirani, R. (2000), “Additive logistic regression: A statistical viewof boosting,”

Annals of statistics , 337–374.Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M.,Downing, J., Caligiuri, M., et al. (1999), “Molecular classiﬁcation of cancer: class discovery andclass prediction by gene expression monitoring,”

Science , 286, 531.Hastie, T., Tibshirani, R., and Friedman, J. (2009),

The elements of statistical learning: Datamining, inference, and prediction (second edition) , Springer. oo, J., Lee, Y., Kim, Y., and Park, C. (2008), “A Bahadur Representation of the Linear SupportVector Machine,” Journal of Machine Learning Research , 9, 1343–1368.Lin, Y. (2004), “A note on margin-based loss functions in classiﬁcation,”

Statistics & probabilityletters , 68, 73–82.Marron, J., Todd, M., and Ahn, J. (2007), “Distance-weighted discrimination,”

Journal of theAmerican Statistical Association , 102, 1267–1271.Owen, A. (2007), “Inﬁnitely imbalanced logistic regression,”

The Journal of Machine LearningResearch , 8, 761–773.Pollard, D. (1991), “Asymptotics for least absolute deviation regression estimators,”

EconometricTheory , 7, 186–199.Qiao, X., Zhang, H., Liu, Y., Todd, M., and Marron, J. (2010), “Weighted distance weighteddiscrimination and its asymptotic properties,”

Journal of the American Statistical Association ,105, 401–414.Qiao, X. and Zhang, L. (2013),

Flexible high-dimensional classiﬁcation machines and their asymp-totic properties .Vapnik, V. (1998),

Statistical learning theory , Wiley., Wiley.