DDistance-weighted Support Vector Machine
Xingye Qiao † Department of Mathematical SciencesState University of New York, Binghamton, NY 13902-6000.E-mail: [email protected]
Lingsong ZhangDepartment of StatisticsPurdue University, West Lafayette, IN 47907.E-mail: [email protected]
October 9, 2015
Abstract
A novel linear classification method that possesses the merits of both the Support Vector Ma-chine (SVM) and the Distance-weighted Discrimination (DWD) is proposed in this article. Theproposed Distance-weighted Support Vector Machine method can be viewed as a hybrid of SVM andDWD that finds the classification direction by minimizing mainly the DWD loss, and determinesthe intercept term in the SVM manner. We show that our method inheres the merit of DWD, andhence, overcomes the data-piling and overfitting issue of SVM. On the other hand, the new methodis not subject to imbalanced data issue which was a main advantage of SVM over DWD. It uses an † Corresponding author I a r X i v : . [ s t a t . M L ] O c t nusual loss which combines the Hinge loss (of SVM) and the DWD loss through a trick of axillaryhyperplane. Several theoretical properties, including Fisher consistency and asymptotic normalityof the DWSVM solution are developed. We use some simulated examples to show that the newmethod can compete DWD and SVM on both classification performance and interpretability. Areal data application further establishes the usefulness of our approach. KEYWORDS : Discriminant analysis; Fisher consistency; Imbalanced data; High-dimensional, low-sample size data; Support Vector Machine. II Introduction
Classification is a very important research topic in statistical machine learning, and has many usefulapplications in various scientific and social research areas. In this article, we focus on the binarylinear classification problem, in which a classification rule is to be found that maps a point in X to a class label chosen from Y , φ : X (cid:55)→ Y where X = R d and Y = { +1 , − } . We focus on linearclassification methods instead of nonlinear ones because they are easy to interpret due to simpleformulations. In particular, each linear classification rule is associated with a linear discriminantfunction f ( x ) = x T ω + β , where the coefficient direction vector ω ∈ R d has unit L norm, and β ∈ R is the intercept term. The classification rule is then φ ( x ) = sign( f ( x )), that is, the samplespace R d is divided into halves by the separating hyperplane defined by (cid:8) x : f ( x ) ≡ x T ω + β = 0 (cid:9) .The coefficient direction vector ω determines the orientation of the hyperplane (as a matter of fact,it is the normal vector of this hyperplane), and the intercept term β determines its location.There is a large body of literature on linear classification. See Duda et al. (2001) and Hastie et al.(2009) for comprehensive introductions. Among many linear classification methods, the SupportVector Machine (SVM; Cortes and Vapnik, 1995, Vapnik, 1998, Cristianini and Shawe-Taylor, 2000)and the Distance-weighted Discrimination (DWD; Marron et al., 2007, Qiao et al., 2010) are twostate-of-the-art instances and have received a lot of attention. A brief review of these two methodswill be given in Section 2.In the high-dimensional, low-sample size (HDLSS) data setting, a so-called “data-piling” phe-nomenon has been observed for SVM (Marron et al., 2007) and some other classifiers (for example,Ahn and Marron, 2010). Data-piling is referred to the phenomenon that after projected to the di-rection vector ω given by a linear classifier, a large portion of the data vectors pile upon each otherand concentrate on two points. Data-piling reflects severe overfitting in the HDLSS data settingand is an indicator that the direction is driven by artifacts in the data, and hence the direction aswell as the classification performance can be stochastically volatile. Moreover, it turns out that thedirections from these linear classification methods are much deviated from the Bayes rule direction(when the Bayes rule exists and is linear). To this end, DWD was proposed largely to overcomethe data-piling issue in the HDLSS setting and has been quite successful on that. hile DWD overcomes the data-piling and mitigates the overfitting effect, it is sensitive tothe imbalanced sample sizes between the two classes (Qiao et al., 2010). In particular, when thesample size of one class is much greater than the other one, the classification boundary would bepushed towards the minority class and consequently, all future data vectors will be classified intothe majority class. −5 0 500.10.20.30.4 (a) Bayes rule −5 0 5051015 (b) SVM, 67.61 degrees−5 0 500.10.20.30.4 (c) DWD, 39.87 degrees −5 0 500.10.20.30.4 (d) DWSVM, 40.23 degrees Figure 1:
Plots of projections to: (a) the true mean difference (Bayes rule) direction, (b) the SVMdirection, (c) the DWD direction and (d) the proposed DWSVM direction. The angles (in degree)between the last three directions and the first direction are shown in the titles. Projections of theseparating hyperplanes of different methods are depicted by the magenta vertical lines. Panel (a)shows the Bayes direction and the separating hyperplane to be compared with. SVM in Panel (b)demonstrate very good separation between the two classes, but severe data-piling also appears.The projected data vectors are nowhere near Gaussian, which suggests that the direction is toomuch deviated from the Bayes direction in Panel (a). Panel (c) shows that DWD has no data-pilingissue, and the projection plot preserves the Gaussian pattern. However, the separating hyperplaneis pushed towards the red class because of its relatively small sample size. Our proposed DWSVMapproach (Panel (d)) combines the merits of SVM and DWD. It preserves a good direction byshowing the Gaussian pattern in the projections while finds a good intercept term which is notsubject to imbalanced sample sizes. iao and Zhang (2013) have thoroughly studied the high-dimensional overfitting issue of SVMand the imbalanced data issue of DWD. Moreover, they proposed a new family of classifiers calledFLAME which both SVM and DWD belong to. To illustrate the main points of data-piling andimbalanced issues, we show projection plots of a toy example to four different discriminant directionvectors in Figure 1. In this example, the data vectors from the two classes are generated frommultivariate normal distributions N d ( ± µ d , I d ), where the dimension d = 300, µ = 1 . / √ d =0 . d is a d -dimensional vector of all ones and I d is the d × d identity matrix. The Bayesrule in this example has direction ω B = d / √ d and the Bayes intercept β B = 0. Here the samplesize of the positive class (with Y = +1) is n + = 200 and the negative class sample size is n − = 50.Panel (a) in Figure 1 shows the true mean difference direction (which in fact is the Bayesdirection) and the projections of the data vectors therein. They serve as the benchmark to becompared with. Panel (b) is for the SVM direction and it demonstrates a very dramatic separationbetween the two classes. This could be an alarming bell for overfitting. Indeed, severe data-pilingis visible. The projected data vectors are nowhere near Gaussian, which suggests that the directionis too much deviated from the true direction in Panel (a). This deviation is also measured by theangle between the SVM direction and the Bayes direction (67 degrees, shown in the title). Panel (c)shows that DWD has no data-piling issue, and the projection plot preserves the Gaussian pattern,which means that there is some potential to interpret the data using the DWD direction. However,because the blue class (positive class with Y = +1) has four times sample size as the red class,the separating hyperplane is therefore pushed towards the red class. Expectedly, its classificationperformance is not good.In this article, we propose a new method which integrates the merits of SVM and DWD, andthus can address the data-piling issue and the imbalanced data issue at the same time. Our proposedmethod is named Distance-weighted Support Vector Machine (DWSVM) to salute the above twoclassical methods. As shown in Panel (d) of Figure 1, DWSVM preserves a good direction byshowing the Gaussian pattern in the projections while finds a good intercept term which is notsubject to the imbalanced sample sizes. In addition, we prove in theory that the DWSVM is Fisherconsistent and asymptotically normal, and that its intercept term is not sensitive to imbalancedsample size as DWD is. he rest of the article is organized as follows. Section 2 gives a brief introduction to the SVMand the DWD methods. Our DWSVM method is proposed in Section 3. Simulated examples anda real application are studied in Sections 4 and 5. Several theoretical results are given in Section6. Some concluding remarks are made in Section 7. In this section, we give a brief introduction to SVM and DWD, their formulations and the discussionon the roles of different terms.
In classification, one is given a training data set,
D ≡ { ( x i , y i ) ∈ X ⊗ Y , i = 1 , . . . , n } and the goalis to find a rule, φ ( x ) ≡ sign( f ( x )), depending on D , so that the classification error E ( φ ( X ) (cid:54) = Y ) isminimized. A natural estimate of the classification error is n (cid:80) ni =1 [sign( f ( x i )) (cid:54) = y i ] = n (cid:80) ni =1 [ y i f ( x i ) < .However, even in the simple case of linear classification where f ( x ) is assume to have the form f ( x ) = x T ω + β , searching for ( ω , β ) to minimize (cid:80) ni =1 [ y i f ( x i ) < is intractable due to the dis-continuity and nonconvexity of the objective function. In statistical learning, a common practiceto avoid these issues is to use a convex surrogate function to approximate/upper-bound the 0-1loss function [ yf ( x ) < . For any discriminant function f ( x ), let us define u ≡ yf ( x ) the functionalmargin which can be viewed as the signed distance (up to a constant) from data point x to theseparating hyperplane { x : f ( x ) = 0 } . A convex surrogate ψ ( u ) : R (cid:55)→ R + can be used in the placeof [ u< . For example, a classification rule can be obtain by,min ω ,β n (cid:88) i =1 ψ (cid:0) y i ( x Ti ω + β ) (cid:1) + λ (cid:107) ω (cid:107) Here, the first term in the objective function bounds the empirical classification error and the (cid:107) ω (cid:107) term in the second term measures the complexity of the model. The choice of the tuningparameter λ balances the two main concerns. Equivalently, this optimization problem can becast to min ω ,β (cid:80) ni =1 ψ (cid:0) y i ( x Ti ω + β ) (cid:1) , s.t. (cid:107) ω (cid:107) ≤ C due to standard optimization theory. Manyclassification methods fall into this category, such as Support Vector Machine, AdaBoost (Freund nd Schapire, 1997), and logistic regression (Friedman et al., 2000). See Bartlett et al. (2006)and the references therein for more sophisticated discussion on convex loss functions and theirimplications for risk bounds. By choosing the hinge loss function (1 − u ) + as the convex surrogate, where ( a ) + ≡ max( a,
0) is thepositive part of a , the SVM method is defined to maximize the smallest distances of all observationsto the separating hyperplane. Mathematically, for some positive λ , the optimization problem ofSVM can be written as min ˜ ω , ˜ β n (cid:88) i =1 (cid:16) − y i ( x Ti ˜ ω + ˜ β ) (cid:17) + + λ (cid:107) ˜ ω (cid:107) . Here, in addition to measuring themodel complexity, (cid:107) ω (cid:107) also defines a notion of gap between the two classes for SVM. In particular,2 / (cid:107) ˜ ω (cid:107) is the distance between the classes (up to a constant). Hence, to minimize (cid:107) ˜ ω (cid:107) is the sameas to maximize the gap between classes. The notion of gap will play a central role in the derivationof methods in this article.The formulation above can be equivalently written as min ˜ ω , ˜ β n (cid:88) i =1 (cid:16) − y i ( x Ti ˜ ω + ˜ β ) (cid:17) + , s.t. (cid:107) ˜ ω (cid:107) ≤ C .Here the coefficient vector ˜ ω does not necessarily have unit norm. We let ω = ˜ ω / √ C and β = ˜ β/ √ C . Then the SVM solution is given by argmin ω ,β n (cid:88) i =1 (cid:16) √ C − Cy i ( x Ti ω + β ) (cid:17) + , s.t. (cid:107) ω (cid:107) ≤ H C ( u ) = √ C − Cu if u ≤ √ C , , (1)is used, such that SVM can be viewed as to minimize (cid:80) ni =1 H C ( u i ), subject to (cid:107) ω (cid:107) ≤
1, wherethe functional margin u i for the i th data is u i = y i ( x Ti ω + β ). In order to align this formulationwith that of DWD, we introduce a slack variable ξ i and rewrite SVM as,argmin ω ,β,ξ i n (cid:88) i =1 ξ i , (2)s.t. Cy i ( x Ti ω + β ) + ξ i ≥ √ C, ξ i ≥ , (3) (cid:107) ω (cid:107) ≤ . (4) .3 Distance-weighted Discrimination (DWD) DWD method was proposed by Marron et al. (2007) to improve the performance of SVM in theHDLSS setting. It also maximizes a notion of gap between classes: the harmonic mean of thedistances of all data vectors to the separating hyperplane. Let r i = y i ( x Ti ω + β ) + η i be the(adjusted) distance of the i th data vector to the separating hyperplane. Mathematically, thesolution of DWD is argmin ω ,β,η i n (cid:88) i =1 (cid:18) r i + Cη i (cid:19) , (5)s.t. r i = y i ( x Ti ω + β ) + η i , r i ≥ η i ≥ , (6) (cid:107) ω (cid:107) ≤ . (7)When η i = 0 and y i ( x Ti ω + β ) > r i = y i ( x Ti ω + β ) is the positive distance from each data vectorto the separating hyperplane, due to (6). Thus (cid:80) ni =1 /r i defines a different notion of gap betweenclasses from that by SVM (which was 2 / (cid:107) ω (cid:107) .)If a positive distance y i ( x Ti ω + β ) is not achievable for a data vector, then a positive slackvariable η i is added to make r i positive. Note that the value of correction η i corresponds to theamount of misclassification for the i th vector, and hence in order to minimize the misclassification,we must control (cid:80) ni =1 η i in the objective function.We will use this formulation and combine it with that of the SVM method in (2)–(4). Here, inorder to understand the underlying DWD loss function for later use, we modify (5)–(7) as follows.For each i , the term in the objective function (cid:16) r i + Cη i (cid:17) can be minimized over η i . Some algebraicmanipulations reveal that the optimization problem (about ω and β ) becomesargmin ω ,β n (cid:88) i =1 V C (cid:0) y i ( x Ti ω + β ) (cid:1) , (8)s.t. (cid:107) ω (cid:107) ≤ , (9)where the DWD loss function is defined as V C ( u ) = √ C − Cu if u ≤ √ C , /u otherwise . (10)One key observation is to be made here. There are two main tasks in a binary linear classification ethod:1. a notion of gap which is to be maximized so as to make the two classes more separated; and2. a measure of misclassification which is to be minimized.Recall that in the SVM formulation, the notion of gap is 2 / (cid:107) ω (cid:107) , and the misclassification is mea-sured by the Hinge loss function. SVM jointly minimizes the sum of these two components to searchfor a solution. In contrast, the DWD loss function in (10) (derived from the objective function (5))has two functionalities: the first term (cid:80) i r − i in (5), the sum of inverse distance, is a notion ofgap, and the second term (cid:80) ni =1 η i in (5) measures misclassification. The constraint (cid:107) ω (cid:107) ≤ In Section 3.1, we first introduce a method which can be intuitively viewed as the prototype of thehybridization between SVM and DWD. Our proposed main method will be discussed in Section3.2. Some explanations to our method are given in Section 3.3.
Before we introduce the DWSVM method, we discuss an intuitive hybridization between SVM andDWD, which is called the naive DWSWD method (nDWSVM). Based on the previous discussionand other results in the literature, a linear classifier with a direction given by DWD and an interceptterm found by SVM is desirable. However, naively matching a DWD direction and an SVMintercept together would be problematic because the intercept would lose its context without the orresponding discriminant direction. Instead, we could train a DWD classifier on the data set,discard the DWD intercept, keep the DWD direction, and project all the data vectors to the 1-dimensional DWD direction to obtain a set of 1-dimensional data points. Lastly, find an intercept(a cutoff) by applying SVM to this 1-dimensional data set. Following this paradigm, we can geta DWD direction, which is thought to be better than an SVM direction in overcoming overfitting,and then given this DWD direction, search for an intercept in an SVM manner so as to mitigate theimbalanced data issue. We name this two-step procedure as nDWSVM. The nDWSVM method isa simple prototype of DWSVM, where the DWD component and the SVM component are trainedseparately. In this subsection, we formally define the Distance-weighted Support Vector Machine (DWSVM)in order to improve nDWSVM. DWSVM simultaneously minimizes both the SVM loss functionand the DWD loss function, to identify a common discriminant direction. The less-imbalance-sensitive SVM-driven intercept term will be used to identify the location of the optimal separatinghyperplane. Mathematically, the optimization problem can be written as follows: Let C dwd > C svm > α ∈ [0 , ω , β β ,ξ i ,η i n (cid:88) i =1 (cid:26) α (cid:18) r i + C dwd · η i (cid:19) + (1 − α ) ξ i (cid:27) , (11)s.t. r i = y i ( x Ti ω + β ) + η i , r i ≥ η i ≥ , (12) C svm y i ( x Ti ω + β ) + ξ i ≥ (cid:112) C svm , ξ i ≥ , (13) (cid:107) ω (cid:107) ≤ . (14)Importantly, in the end, we let f ( x ) ≡ x T ω + β and use sign( f ( x )) = sign( x T ω + β ) as theclassification rule instead of sign( x T ω + β ). Thus ω and β are the only two variables that reallyparticipate in classifying future data vectors, while β is not involved. However, it does not meanthat β is of no significance. We will elaborate this point later.Comparing (11)–(14) with (2)–(4) and (5)–(7), we can see that the first term in (11) and theconstraint (12) are similar to (5) and (6), while the second term in (11) and the constraint (13) are x x positive classnegative classmain hyperplaneaxillary hyperplanedistance to main HPdistance to axillary HP Figure 2:
The main separating hyperplane (magenta solid line) and the axillary hyperplane (ma-genta dashed line) for DWSVM applied to a two-dimensional toy example. The distance from eachdata vector to the main hyperplane is depicted as a dotted line segment while the distance to theaxillary hyperplane is depicted as a dotted-dashed line segment. Although the data vectors η i is added toeach negative functional margin y i ( x Ti ω + β ) , i = 22 ,
25, to make the sum positive.similar to (2) and (3). Thus we can write the DWSVM formulation (11)–(14) asargmin ω ,β,β n (cid:88) i =1 (cid:8) αV C dwd ( y i ( x Ti ω + β )) + (1 − α ) H C svm ( y i ( x Ti ω + β )) (cid:9) , (15)s.t. (cid:107) ω (cid:107) ≤ . (16)One might think that our DWSWM is just an optimization problem with the objective functionequaling to a weighted average of the DWD loss and the SVM loss. However, it is more sophisti-cated than that. In the next subsection, we give some explanations to different components andparameters in DWSVM to help understand the new method. .3 Understanding DWSVM Two hyperplanes
First of all, there are two intercept terms β and β and only one direction vector ω in the DWSVMmethod, that is, there are two hyperplanes that are parallel to each other, (cid:8) x : x T ω + β = 0 (cid:9) and (cid:8) x : x T ω + β = 0 (cid:9) . For convenience, we call them the main hyperplane and the axillaryhyperplane, respectively, and their corresponding discriminant functions f ≡ x T ω + β and f ≡ x T ω + β . See Figure 2 for an illustration using a two-dimensional toy example. In the plot, themagenta solid line is the main hyperplane and the magenta dashed line is the axillary hyperplane. Axillary hyperplane
Note that f is involved with the definition of r i , the adjusted distance of a data vector to theaxillary hyperplane, shown as dot-dashed line segments in Figure 2. Similar to its role in DWD, (cid:80) ni =1 (1 /r i ) controls the gap between the two classes. In particular, the smaller (cid:80) ni =1 (1 /r i ) is, themore separated the two classes are.In words, the purpose of the axillary hyperplane is not for classifying data vectors, but to makeit possible to define a number of distances (from data vectors to itself) so that we can minimize thesum of the inverse distances. In the ordinary DWD, this axillary hyperplane has to coincide withthe hyperplane that is actually used for classification. But here we allow some flexibility so that itis free of such restriction. Necessity of the slack variable η i When y i f ( x i ) <
0, the (signed) distance from the data vector to the axillary hyperplane is negative.In this case, a positive η i is added to y i f ( x i ) to make their sum r i positive. For example, in Figure2, the data vectors y i f ( x i ) , i = 22 ,
25, are negative. The DWSVM optimization adds somepositive η i ’s to make the sum r i = y i f ( x i ) + η i positive. It is the sum of the inverse of r i thatwe minimize, instead of the sum of inverse of the signed distances y i f ( x i ). This adjustment isnecessary. Otherwise, one can always make β to be infinity, i.e. , the axillary hyperplane is infinitely ar from the data so that all the distances y i f ( x i )’s are infinity (positive or negative), and hence1 / ( y i f ( x i )) = 0. This is certainly not a desired situation because it would make the directionvector trivial (because the minimal of the objective function would always be 0 regardless of thechoice of the direction). For these reasons, the addition of η i and the inclusion of (cid:80) ni =1 η i in theobjective function are necessary to make the optimization problem meaningful. Slack variable η i does not measure misclassification In the original DWD, the reason to minimize (cid:80) ni =1 η i is to control misclassification. However, theslack variable η i here is with respect to the axillary hyperplane (which is not useful in classification),rather than to the main separating hyperplane. Thus, we have liberated the DWD component fromthe burden of controlling misclassification, so that it can focus on defining the notion of gap andhelp searching for an optimal direction vector in the DWD fashion which overcomes overfitting. Slack variable ξ i Last of all, the second term ξ i in (11) is a proxy of the modified Hinge loss function of SVM in (1).Inclusion of this term is for the purpose of controlling misclassification, because ξ i can be seen as (cid:0) √ C svm − C svm u i (cid:1) + where u i is the functional margin y i f ( x i ) with respect to the main hyperplane.Minimizing the sum of ξ i ’s can help to increase the functional margin u i ’s. Note that the functionalmargin u i = y i f ( x i ) can be interpreted as the distance to the main hyperplane (instead of theaxillary one), which is ultimately the hyperplane that is used for classifying new data. Summary
In summary, the hyperplane defined by ω and β is an axillary hyperplane which is useful forfinding the best direction, and the one defined by ω and β is the main hyperplane that is useful forsearch the intercept and for good classification performance. By the trick of allowing two interceptterms, we gain some flexibility and manage to get two hyperplanes to each do their own job.Empirically, nDWSVM can be used to approximate DWSVM, especially for low to moderatedimensions. Moreover, nDWSVM is very easy to implement, so long as the user has accessibleimplementations for both SVM and DWD (both are now available in R and MATLAB). The ifferences between DWSVM and nDWSVM are that in the two-step prototype nDWSVM, thedirection is determined only by the DWD algorithm, and the intercept is found by SVM basedon the projections given by the DWD direction. However, in DWSVM, the axillary hyperplane(concerning DWD) and the main hyperplane (concerning SVM) work together to find the optimaldirection. The optimization is done all at once in DWSVM.Between DWD and DWSVM, the latter inherits the direction of the former, and adopts a veryeffective intercept term from its SVM component. Compared with SVM, the DWSVM method hasa direction that is much improved due to the DWD component. In this section, we first compare the classification and the interpretability performance between theDWSVM approaches and the original SVM and DWD. The classification performance is measuredby the misclassification rate for a large test data set with 4000 observations. The interpretability isa concept that is more of less vague. We partially measure it by the angle between the discriminantdirection vector for the classifier under investigation and for the Bayes classifier. We believe thecloser to the Bayes rule direction, the better the interpretability of the linear classifier is.
We consider two different simulation settings. In each setting, samples from the two classes aregenerated from multivariate normal distributions N d ( ± µ , Σ ).1. Example 1 : Constant mean difference, identity covariance matrix example. µ ≡ c d , and Σ ≡ I d , where c > c (cid:107) d (cid:107) = 2 .
7. This corresponds tothe Mahalanobis distance between the two classes and represents a reasonable difficulty ofclassification using the Bayes rule.2.
Example 2 : Decreasing mean difference, block-diagonal interchangeable covariance matrixexample. Here we let µ ≡ c v d , where v d = ( √ , √ , . . . , √ , , , . . . , T ∈ R d , and Σ ≡ Block-Diag { Σ , Σ , . . . , Σ } , where each Σ is an 50 ×
50 interchangeable sub-covariancematrix whose diagonal entries are all 1 and off-diagonal entries are 0.8. The scaling factor c s chosen to make the Mahalanobis distance (cid:8) (2 c v d ) Σ − (2 c v d ) T (cid:9) / = 2 . d among 100 , , ,
500 and 1000, thus the last threecases correspond to the HDLSS data settings.
Example 1 Example 2 l l ll l l llll l llll l llll l llll llll llllllll llll llll l l l ll l llll l llll l llll l llll llll llllllll llll llll e rr o r ang l e
100 200 300 500 1000 100 200 300 500 1000
Dimension method l l l l l
DWSVM nDWSVM DWD SVM Bayes
Figure 3:
Comparison between four methods for Example 1 (the left panel) and Example 2 (theright panel). The misclassification error rates are shown on the top row and the angles between theclassification directions and the Bayes direction are shown on the bottom. For Example 1 (left), forsmaller dimensions, the two DWSVM approaches are better than SVM and DWD in terms of classi-fication. For a large dimension, SVM outperforms the nDWSVM approach. The one-step DWSVMapproach dominates all the other approaches in terms of classification performance. In terms of theinterpretability (bottom) in Example 1, the two DWSVM approaches and the DWD approach allgive similar and better results than the SVM approach. For Example 2 (right), DWSVM has simi-lar good classification performance to SVM and similar good interpretability performance to DWDand nDWSVM. For small and moderate dimensions, the classification performance of DWSVM issignificantly better than SVM.
In the top-left panel of Figure 3, we report the misclassification error of DWSVM, nDWSVM, DWDand SVM applied to a test data set with 2000 data points in each class which are generated accordingto the Constant mean difference, identity covariance matrix example. We conduct the simulation or 100 times and report the averages of the measurements. Our DWSVM approach uniformly givesthe best classification results. The two-step alternative nDWSVM has very similar performancefor dimensions 100, 200 and 500, but its performance is downgraded for higher dimensions. For alldimensions, unsurprisingly, the original DWD has misclassification rate close to almost 50%, whichis largely due to its intercept term which is subject to the imbalanced data.In the bottom-left panel of Figure 3, we calculate the angles between the directions from differentclassifiers and the Bayes direction (for both simulation settings in this article, the Bayes classifiersare linear and the Bayes directions are well defined.) It shows that all the DWD related classifiersgive very similar angles. As a matter of fact, the angles from DWSVM, DWSVM and DWD almostoverlap with each other in this plot, except for low dimensional case where the DWSVM angle isa bit larger than the other two. On the other hand, the SVM directions are significantly moredifferent from the Bayes direction than the DWD family directions are.The observations so far verify the conjecture that DWD is worse at misclassification rate andSVM is worse at giving interpretable classification direction. DWSVM and nDWSVM appear tobe able to address both issues simultaneously.In the simulations, we tune the parameter C svm for SVM from a grid of possible values2 − , − . . . , , and choose the one which gives rise to the smaller misclassification rate fora tuning data set that is identical to the training data set in terms of sample size and underlyingdistributions. For the DWD family of classifiers (DWSVM, nDWSVM and DWD), we let C dwd be 100 divided by a scaling factor that counts for the scale of the data, which was recommendedby Marron et al. (2007). We fix C svm = 100 for DWSVM and nDWSVM. Lastly, we let α = 0 . .1.2 Example 2 We have conducted the same comparison for the Decreasing mean difference, block-diagonal inter-changeable covariance matrix example (Example 2) and the results are shown in the right panelof Figure 3. This time, the classification performance of DWSVM and SVM are closely competingwith each other. For dimensions d = 100 , , d = 500 and 1000, its classification error rates are slightly greaterthan SVM (not statistically significant). In terms of the angles between the classification directionvectors and the Bayes direction, the DWSVM direction are similar to those from nDWSVM andDWD, while all three are better than SVM. For the highest dimension case, all four directionsare much different from the Bayes direction. However, the DWSVM direction is the best in thissituation. In this subsection, we study the impacts of different parameter values to DWSVM. First, we useordinary DWD to search for an optimal choice of the C dwd parameter and fix its value in thesequel. In particular, we adopt the recommendation of C dwd in Marron et al. (2007). We choosenot to further pursue in the direction of C dwd because this parameter has been well studied forordinary DWD by Marron et al. (2007) and for weighted DWD by Qiao et al. (2010). Here, weuse simulation to illustrate the sensitivity of the DWSVM method to the difference choices of theother two parameters, C svm and α .We applied DWSVM to 100 simulations from the simulated examples defined above (Example1 and Example 2) respectively, using the following schedules, • for fixed α = 0 . C svm = 2 − , − . . . , , ; • for fixed C svm = 100 and various values of α = 0 . , , . , . , . . . , . α . It is very clear that in these settings (where C svm = 100), theperformance of DWSVM does not depend on the value of α as all the curves appear horizontalstraight lines. It may be too early to conclude that the performance of DWSVM is independent f α from this observation, since it could be due to the fact that C svm = 100 happens to be areasonably good parameter (see the discussion below). But it does suggest that the performance isinfluenced less by the α parameter than by the other parameters.In the left panel, we do the same thing for difference values of C svm given α = 0 .
5. A similarmessage can be obtained, although on a restrictive condition: For Example 1, the curves appearto be flat when C svm > . Thus any value that falls into this range should work reasonable well.For Example 2, it can be seen that the optimal C svm is around 2 and 2 . However, even theirperformance is not significantly better than those with greater C svm . Overall, it seems that aslong as the value of C svm is not too small, the classification performance would be close to the −5 0 5 100.10.20.30.40.5 log (C svm ) e rr o r Example 1 0 0.2 0.4 0.6 0.8 10.10.20.30.40.5 α e rr o r Example 1−5 0 5 100.10.20.30.40.5 log (C svm ) e rr o r Example 2 0 0.2 0.4 0.6 0.8 10.10.20.30.40.5 α e rr o r Example 2
Figure 4:
Left panels: Test errors of DWSVM applied to Example 1 and Example 2 (with d = 300and α = 0 .
5) for different values of C svm over 100 runs. The plots show that for Example 1, any C svm greater than about 2 will lead to similar classification performance, while for Example 2, C svm around 2 to 2 are the best, although the performance of such parameter choice is not verydifferent from those whose has even greater C svm values. Right panels: Test errors of DWSVMapplied to Example 1 and Example 2 (with d = 300 and C svm = 100) for different values of α over100 runs. The performance does not depend on the choice of the α value very much. ptimality. This is the reason why we fix the value of α and C svm to be 0.5 and 100 respectively inour comparison study conducted in the previous section. The user are free to grid search the valuesof C svm and α if he/she wishes so, although it seems that the effort for the latter is not worthwhile. In this section, we compare DWSVM with the competing classifiers by applying them to the Golubdata set (Golub et al., 1999). This gene expression data has 3051 genes and 38 tumor mRNAsamples from the leukemia microarray study of Golub et al. (1999). Pre-processing was done asdescribed in Dudoit et al. (2002).
No mislabeling 1 pair of mislabelings 2 pairs of mislabelings l l ll l l ll l l ll C r o ss − v a li da t i on E rr o r Figure 5:
Cross-validated number of misclassified observations for SVM, DWD, DWSVM andnDWSVM for the original Golub data set, the Golub data with a pair of mislabeled observations,and the data with 2 pairs of mislabeled observations. For the original data, both the SVM andthe DWSVM methods have CV error almost 0, with DWSVM being a little better. When thereare mislabeled observations, the advantage of DWSVM becomes more obvious: it can be seen thatDWSVM has the smallest CV errors while nDWSVM is on a par with SVM. The DWD classifieris always worse than the others in terms of the classification performance.As there are 11 and 27 observations from both classes, we expect the SVM and the DWSVMclassifiers will give better result than DWD because the latter is subject to the imbalanced samplesize. Moreover, because the dimension is much higher than the sample size, we expect severeoverfitting in this data. We apply SVM, DWD, DWSVM and nDWSVM to the data set and use 3-fold cross validation to find the best C svm tuning parameter value. The C dwd and α values are fixed.In the left panel of Figure 5, we report the average cross-validated (CV) number of misclassfiedobservations and the standard error over 100 random foldings. Both SVM and DWSVM give verygood result (CV error almost zero), although the DWSVM method is a little better. The nDWSVM rror is almost twice that of the SVM and the DWD error is almost four times.In order to see the extend to which our DWSVM avoids overfitting, we perturb the original dataset as follows. We randomly switch the class labels of k pairs of observations ( k observations fromeach class) ( k = 1 , We will show some theoretical properties of DWSVM in three different favors. First, we derive theFisher consistency of the DWSVM loss function. Note that the loss function of DWSVM is nota typical large-margin loss function. Second, we derive the asymptotic normality of the DWSVMcoefficient vector. Third, we show that the intercept of DWSVM does not diverge, even in anextremely imbalanced setting.
The DWSVM method can be estimated from equations (15)–(16). Thus the underlying loss functionas be written as L ( yf ( x ) , yf ( x )) = αV C dwd ( yf ( x )) + (1 − α ) H C svm ( yf ( x )). Because there are twofunctions involved, the underlying loss function is not a traditional margin-based loss function whichinvolves only one function, such as that considered in Lin (2004). Moreover, the two hyperplanesimplied by f and f in our methods are parallel to each other. In general cases (beyond linear unctions), this can be interpreted as the difference of these two functions is a constant, i.e. , f ( x ) − f ( x ) is independent of x . Theorem 1 below shows the Fisher consistency of the DWSVMloss function. Theorem . For any given C svm , C dwd > and α ∈ [0 , , if E [ L { Y f ( X ) , Y f ( X ) } ] has a globalminimizer ( f ∗ ( x ) , f ∗ ( x )) subject to f ( x ) − f ( x ) is a constant, then sign[ f ∗ ( x )] = sign[ q ( x ) − / ,where q ( x ) ≡ P ( Y = +1 | X = x ) . Fisher consistency of the DWSVM loss function ensures that the sign of the minimizer of theexpected loss function (subject to the parallel condition) coincides with the Bayes rule.
Koo et al. (2008) has studied the asymptotic normality of the coefficient vector for the SVMclassifier. We follow the same direction and prove the corresponding results for the DWSVMclassifier.For ease of presentation of the theorem, we let ω + denote the augmented parameter vector( β , β, ω T ) T ∈ R d +2 , x + , x † and x ‡ the augmented data vectors (0 , , x T ) T ∈ R d +2 , (1 , , x T ) T ∈ R d +2 and (1 , , x T ) T ∈ R d +2 . Consequently, the main discriminant function f ( x ; ω + ) ≡ x + T ω + = x T ω + β , and the axillary discriminant function f ( x ; ω + ) ≡ x † T ω + = x T ω + β .We cast DWSVM to an optimization problem with an unconstrained objective function. q λ,n ( ω + ) ≡ n n (cid:88) i =1 L ( x i , y i , ω + ) + λ (cid:107) ω (cid:107) (17)= 1 n n (cid:88) i =1 { αV C d ( y i f ( x i ; ω + )) + (1 − α ) H C s ( y i f ( x i ; ω + )) } + λ (cid:107) ω (cid:107) (18)The solution to the optimization problem can be scaled by the norm of ω so as to make it haveunit norm.The population version of (18) without the penalty term is defined as Q ( ω + ) ≡ E { αV C d ( Y f ( X ; ω + )) + (1 − α ) H C s ( Y f ( X ; ω + )) } , whose minimizer is defined as ω ∗ + ≡ argmin ω + Q ( ω + ). or easy presentation, let g ( x , y, ω + ) ≡ α (cid:16) − { yf ( x ; ω + ) ≤ / √ C d } C d − { yf ( x ; ω + ) > / √ C d } / [ yf ( x ; ω + )] (cid:17) ,h ( x , y, ω + ) ≡ (1 − α ) (cid:16) − { yf ( x ; ω + ) ≤ / √ C s } C s (cid:17) ,v ( x , y, ω + ) ≡ α (cid:16) { yf ( x ; ω + ) > / √ C d } / [ yf ( x ; ω + )] (cid:17) ,w ( x , y, ω + ) ≡ (1 − α ) δ (cid:16) / (cid:112) C s − yf ( x ; ω + ) (cid:17) C s , where δ ( · ) denotes the Dirac delta function. Furthermore, let S ( ω + ) ≡ E { g ( X , Y, ω + ) Y X † + h ( X , Y, ω + ) Y X + } and U ( ω + ) ≡ E (cid:8) v ( X , Y, ω + ) X † X T † + w ( x , y, ω + ) X + X T + (cid:9) . Let Ω( X i , Y i , ω ∗ + ) = diag { g ( X i , Y i , ω ∗ + ) , h ( X i , Y i , ω ∗ + ) , [ g ( X i , Y i , ω ∗ + )+ h ( X i , Y i , ω ∗ + )] I d } , where I d is d × d identity matrix.Then, define T n ≡ n (cid:88) i =1 (cid:26) g ( X i , Y i , ω ∗ + ) Y i ( X i ) † + h ( X i , Y i , ω ∗ + ) Y i ( X i ) + (cid:27) , = n (cid:88) i =1 Y i (cid:26) Ω( X i , Y i , ω ∗ + )( X i ) ‡ (cid:27) . Lastly, define G ( ω ∗ + ) ≡ E (cid:104) ( X i ) ‡ Ω ( X i , Y i , ω ∗ + )( X i ) ‡ T (cid:105) .Some regularity conditions are needed. We state the conditions in the appendix. Note thatconditions (A1), (A2) and (A4) are the same as in Koo et al. (2008). Our new (A3) is tailoredfor DWSVM and incorporates the DWD component. In particular, (A1) ensures that U ( ω + ) iswell-defined and is continuous in ω + while (A1) and (A2) ensure that the minimizer ω ∗ + exists.(A3) is a sufficient condition to that ω ∗ + is not zero. (A4) guarantees the positive-definiteness of U ( ω + ) around ω ∗ + .Under these regularity conditions, we obtain a Bahadur representation of (cid:100) ω λ,n + in Theorem2, the asymptotic normality in Theorem 3, and consequently, the asymptotic normality of thediscriminant function f ( x ; (cid:100) ω λ,n + ) at x in Corollary 4. heorem . Suppose that (A1)–(A4) are met. For λ = o ( n − / ) , we have √ n ( (cid:100) ω λ,n + − ω ∗ + ) = − √ n U ( ω ∗ + ) − T n + o P (1) . Theorem . Suppose that (A1)–(A4) are met. For λ = o ( n − / ) , we have √ n ( (cid:100) ω λ,n + − ω ∗ + ) = N (cid:0) , U ( ω ∗ + ) − G ( ω ∗ + ) U ( ω ∗ + ) − (cid:1) This will lead to the following corollary.
Corollary . Under the same conditions as in Theorem 3, for λ = o ( n − / ) and any x ∈ R d , √ n (cid:16) f ( x , (cid:100) ω λ,n + ) − f ( x , ω ∗ + ) (cid:17) d → N (cid:0) , x T + U ( ω ∗ + ) − G ( ω ∗ + ) U ( ω ∗ + ) − x + (cid:1) Owen (2007) discussed the behavior of the intercept term in the logistic regression when the samplesize of one class is extremely large while that of the other class is fixed. Moreover, Qiao and Zhang(2013) also showed that the intercept term of DWD diverges. In this subsection, we prove that theintercept term for the DWSVM classifier does not diverge. Without loss of generality, we assumethat n − (cid:29) n + , i.e. , the negative class is the majority class. Lemma . Suppose that the negative majority class is sampled from a distribution with compactsupport S . Then the intercept term β in SVM does not diverge to negative infinity when n − → ∞ . Corollary . Suppose that the negative majority class is sampled from a distribution with compactsupport S . Then the intercept term β in DWSVM does not diverge to negative infinity when n − → ∞ . The assumption of compact support S is essential here, but it is fairly weak and is true in manyreal applications. Note that this result does not ensure that the sensitivity issue is completelyovercome by SVM or DWSVM. Instead, it suggests that in the n − → ∞ asymptotics, the impactof the imbalanced sample size is limited to some extent. Conclusion
Both SVM and DWD are subject to certain disadvantages and enjoy certain advantages. TheDWSVM combines the merits of both methods by creatively deploying an axillary intercept term.We have shown standard asymptotic results for the DWSVM classifier. The simulations and realdata application establish the superiority of the DWSVM method over SVM and DWD in somesituations. In particular, the DWSVM method can lead to a discriminant direction vector that, likethe DWD direction, preserve important features of the data set. More importantly, the DWSVMalso performs very well in terms of classification. As a bottom line, its performance is just as goodas the SVM. In special settings such as the perturbed data, we have demonstrated that DWSVMcan overcome overfitting and is more robust against perturbation/mislabeling of the data.We have shown some asymptotic properties of DWSVM in this paper. More work can be doneto investigate its statistical properties, for example, in the line of Blanchard et al. (2008).An instant extension of the DWSVM classifier is multiclass classification. For example, for amulticlass classification problem with K classes, the following optimization problem accomplishessuch an extension.argmin ω j , β j β j , ξ , η n (cid:88) i =1 (cid:88) y i = j, k (cid:54) = j (cid:40) α (cid:32) r ijk + C dwd · η ijk (cid:33) + (1 − α ) ξ ijk (cid:41) , s.t. r ijk = y i { x Ti ( ω j − ω k ) + ( β j − β k ) } + η ijk , r ijk ≥ η ijk ≥ ,C svm y i { x Ti ( ω j − ω k ) + ( β j − β k ) } + ξ ijk ≥ (cid:112) C svm , ξ ijk ≥ , K (cid:88) j =1 (cid:107) ω j (cid:107) ≤ , K (cid:88) j =1 ω j = , K (cid:88) j =1 β j = 0 , K (cid:88) j =1 β j = 0 . Other extensions such as kernel DWSVM or sparse DWSVM are also readily in order.In summary, DWSVM integrates the merits of classical classification methods. Its numericalperformance is very good and it is theoretically justified. These show evidence that it is a verypromising linear learner which has great potential in many applications.Future work will also concentrate on developing more efficient implementation of DWSVM. cknowledgment The first author’s work was partially supported by Binghamton University Harpur College Dean’sNew Faculty Start-up Funds and a collaboration grant from the Simons Foundation (
Appendices
Proof of Theorem 1
For any x , denote q ( x ) = P ( Y = +1 | X = x ). The conditional risk is R ( f, f ) ≡ E [ L { Y f ( X ) , Y f ( X ) } | X = x ]= { αV C dwd ( f ) + (1 − α ) H C svm ( f ) } q ( x )+ { αV C dwd ( − f ) + (1 − α ) H C svm ( − f ) } { − q ( x ) } , where for simplicity we write f ( x ) and f ( x ) as f and f .For the global minimizer ( f ∗ , f ∗ ), since f ∗ − f ∗ = ∆ ∗ is independent of x , we can consideranother feasible (but not optimal) solution ( − f ∗ , − f ∗ − ∆ ∗ ). Due to the optimality of ( f ∗ , f ∗ ) =( f ∗ , f ∗ − ∆ ∗ ), we can show that0 ≥ R ( f ∗ , f ∗ − ∆ ∗ ) − R ( − f ∗ , − f ∗ − ∆ ∗ )= { q ( x ) − } [ { αV C dwd ( f ∗ − ∆ ∗ ) + (1 − α ) H C svm ( f ∗ ) } − { αV C dwd ( − f ∗ − ∆ ∗ ) + (1 − α ) H C svm ( − f ∗ ) } ]= { q ( x ) − } [ α { V C dwd ( f ∗ − ∆ ∗ ) − V C dwd ( − f ∗ − ∆ ∗ ) } + (1 − α ) { H C svm ( f ∗ ) − H C svm ( − f ∗ ) } ]Thus if q ( x ) > /
2, then α { V C dwd ( f ∗ − ∆ ∗ ) − V C dwd ( − f ∗ − ∆ ∗ ) } + (1 − α ) { H C svm ( f ∗ ) − H C svm ( − f ∗ ) } ≤ . Because V C dwd ( · ) is strictly decreasing everywhere, and H C svm ( · ) is strictly decreasing around 0, wehave that V C dwd ( f ∗ − ∆ ∗ ) − V C dwd ( − f ∗ − ∆ ∗ ) and H C svm ( f ∗ ) − H C svm ( − f ∗ ) have the same sign, nd hence f ∗ ≥
0. By a similar argument, if q ( x ) < /
2, then f ∗ ≤
0. Lastly, it is easy to showthat f ∗ (cid:54) = 0. Hence we have sign( f ∗ ) = sign( q ( x ) − / Regularity conditions
We state the regularity conditions for the asymptotics below. We use C , C , . . . to denote positiveconstants independent of n . A1 The densities p + and p − are continuous and have finite second moments. A2 There exists B ( x , δ ), a ball centered at x with radius δ such that p ( x ) > C and p ( x ) >C for every x ∈ B ( x , δ ). A3 For some 1 ≤ l ≤ d , E (cid:16) { X l ≥ F L − } X | Y = − (cid:17) < E (cid:16) { X l ≤ F U + } X | Y = +1 (cid:17) or E (cid:16) { X l ≤ F U − } X | Y = − (cid:17) > E (cid:16) { X l ≥ F L + } X | Y = +1 (cid:17) , where F L + and F L − ( F U + and F U − , respectively) are the lower bounds (upper bounds, respec-tively) for the positive and negative classes. They are defined as P (cid:0) X l ≥ F L + | Y = +1 (cid:1) = min (cid:18) , π + { αC d + (1 − α ) C s } π − (1 − α ) C s (cid:19) , P (cid:0) X l ≥ F L − | Y = +1 (cid:1) = min (cid:18) , π − { αC d + (1 − α ) C s } π + (1 − α ) C s (cid:19) , P (cid:0) X l ≤ F U + | Y = +1 (cid:1) = min (cid:18) , π + (1 − α ) C s π − { αC d + (1 − α ) C s } (cid:19) , P (cid:0) X l ≤ F U − | Y = +1 (cid:1) = min (cid:18) , π − (1 − α ) C s π + { αC d + (1 − α ) C s } (cid:19) . A4 For an orthogonal transformation A l that maps ω ∗ / (cid:107) ω ∗ (cid:107) to the l th unit basis vector e l forsome 1 ≤ l ≤ d , there exist rectangles D + = (cid:8) x ∈ M + : l s ≤ ( A l x ) s ≤ v s with l s < v s for s (cid:54) = l (cid:9) and D − = (cid:8) x ∈ M − : l s ≤ ( A l x ) s ≤ v s with l s < v s for s (cid:54) = l (cid:9) uch that p + ( x ) ≥ C > D + and p − ( x ) ≥ C > D − , where M + ≡ (cid:8) x : x T ω ∗ + β = 1 / √ C s (cid:9) and M − ≡ (cid:8) x : x T ω ∗ + β = − / √ C s (cid:9) . Proof of Theorems 2 and 3 and Corollary 4
For fixed θ ∈ R d +2 , defineΛ n ( θ ) ≡ n (cid:8) q λ,n ( ω ∗ + + θ / √ n ) − q λ,n ( ω ∗ + ) (cid:9) , andΓ n ( θ ) ≡ E Λ n ( θ ) . Observe thatΓ n ( θ ) = n (cid:8) Q ( ω ∗ + + θ / √ n ) − Q ( ω ∗ + ) (cid:9) + λ (cid:16) (cid:107) θ d +2) (cid:107) + 2 √ n ω ∗ T θ d +2) (cid:17) By Taylor series expansion of Q around ω ∗ + , we obtain, for some 0 < t < n ( θ ) = 12 θ T U (cid:0) ω ∗ + + ( t/ √ n ) θ (cid:1) θ + λ (cid:16) (cid:107) θ d +2) (cid:107) + 2 √ n ω ∗ T θ d +2) (cid:17) . Because U ( ω + ) is continuous in ω + , due to condition (A1), we have12 θ T U (cid:0) ω ∗ + + ( t/ √ n ) θ (cid:1) θ = 12 θ T U (cid:0) ω ∗ + (cid:1) θ + o (1) . This, combined with λ = o ( n − / ), results inΓ n ( θ ) = 12 θ T U (cid:0) ω ∗ + (cid:1) θ + o (1) . Now, observe that E T n = nS ( ω ∗ + ) = and E ( T n T Tn ) = (cid:80) ni =1 E (cid:104) ( X i ) ‡ Ω ( X i , Y i , ω ∗ + )( X i ) ‡ T (cid:105) = nG ( ω ∗ + ). Hence, √ n T n follows N (cid:0) , G ( ω ∗ + ) (cid:1) asymptotically by central limit theorem.Next, we define R i,n ( θ ) ≡ L i,n ( ω ∗ + + θ / √ n ) − L i,n ( ω ∗ + ) − (cid:32) ∂L i,n ∂ ω + ( ω + ) (cid:12)(cid:12)(cid:12)(cid:12) ω + = ω ∗ + (cid:33) T θ / √ n, where L i,n ( ω + ) ≡ αV C d ( Y i ( X i ) T † ω + ) + (1 − α ) H C s ( Y i ( X i ) T + ω + ).We continue by splitting R i,n to two parts R i,n = αR di,n + (1 − α ) R si,n , where the first termconcerns the DWD component and the second term concerns the SVM component. or the DWD component, R di,n ( θ ) ≡ V ( Y i ( X i ) T † ( ω ∗ + + θ / √ n )) − V ( Y i ( X i ) T † ω ∗ + ) − (cid:32) ∂V∂ ω + ( ω + ) (cid:12)(cid:12)(cid:12)(cid:12) ω + = ω ∗ + (cid:33) T θ / √ n Because the DWD loss V has first order continuous derivative, R di,n ( θ ) = O ( n − ).For the SVM component, R si,n ( θ ) ≡ H [ Y i ( X i ) T + ( ω ∗ + + θ / √ n )] − H [ Y i ( X i ) T + ω ∗ + ] + (cid:112) C s { Y i ( X i ) T + ω ∗ + < / √ C s } Y i ( X i ) T + θ / √ n. Following the argument by Koo et al. (2008) and combining the fact that R di,n ( θ ) = O ( n − ),we can show that (cid:80) ni =1 E (cid:0) | R i,n ( θ ) − E R i,n ( θ ) | (cid:1) →
0, as n → n ( θ ) = Γ n ( θ ) + T Tn θ / √ n + (cid:80) ni =1 ( R i,n ( θ ) − E R i,n ( θ )) . ThusΛ n ( θ ) = 12 θ T U (cid:0) ω ∗ + (cid:1) θ + T Tn θ / √ n + o P (1) . By the Convexity Lemma in Pollard (1991), we have for any fixed θ ,Λ n ( θ ) = 12 ( θ − ζ n ) T U (cid:0) ω ∗ + (cid:1) ( θ − ζ n ) + 12 ζ Tn U (cid:0) ω ∗ + (cid:1) ζ n + r n ( θ ) , where ζ n ≡ − U ( ω ∗ + ) − T n / √ n , and for each compact set K ∈ R d ,sup θ ∈ K | r n ( θ ) | p → . We then follow the argument in Koo et al. (2008) and have for each ε > (cid:98) θ λ,n = √ n ( (cid:100) ω λ,n + − ω ∗ + ), P (cid:16) (cid:107) (cid:98) θ λ,n − ζ n (cid:107) > ε (cid:17) p → , which completes the proof. Proof of Lemma 5
We prove the result for the simpler and more intuitive case of d = 1. In this case ω ∈ R does notneed to be optimized. We can simply assume that ω = 1. Moreover, we can consider the worstcase scenario where n + = 1. This is the worse case because this represents the most imbalancedsample sizes. We let x denote the sole data vector in the positive minority class ince the negative class is extremely large compared to the positive, we can assume that thefunctional margin with respective to the main hyperplane u ≡ y ( x + β ) = x + β for the datavectors from the positive minority class are always less than 1 / √ C s , that is β ≤ / √ C s − x .Write the objective function of SVM as L s ( β ) ≡ ( (cid:112) C s − C s x − C s β ) + n (cid:88) i =1 (cid:110) { y i = − } ( (cid:112) C s + C s x i + C s β ) + (cid:111) ≈ ( (cid:112) C s − C s x − C s β ) + n − E (cid:110) ( (cid:112) C s + C s X + C s β ) + | Y = − (cid:111) Note that ∂ L s ∂β ( β ) ≡ − C s + n − C s E (cid:104) { √ C s + C s X + C s β> } | Y = − (cid:105) = − C s + n − C s P (cid:104)(cid:112) C s + C s X + C s β > | Y = − (cid:105) . This leads to thatlim β →−∞ ∂ L s ∂β ( β ) = − C s < ∂ L s ∂β (cid:16) − M − / (cid:112) C s (cid:17) = − C s + n − C s P (cid:104)(cid:112) C s + C s X + C s ( − M − / (cid:112) C s ) > | Y = − (cid:105) = − C s + n − C s P [ X > M | Y = −
1] = − C s < / √ C s − x ≤ − M − / √ C s , then ∂ L s ∂β (cid:16) / (cid:112) C s − x (cid:17) = − C s + n − C s P (cid:104)(cid:112) C s + C s X + C s (1 / (cid:112) C s − x ) > | Y = − (cid:105) = − C s + n − C s P (cid:104) X > x − / (cid:112) C s | Y = − (cid:105) = − C s < , and β = 1 / √ C s − x is the minimizer of L s . On the other hand, if 1 / √ C s − x > − M − / √ C s ,then the minimizer β ∗ will be greater than − M − / √ C s but less than or equal to 1 / √ C s − x .This means that the intercept term β in SVM does not diverge to −∞ . eferences Ahn, J. and Marron, J. (2010), “The maximal data piling direction for discrimination,”
Biometrika ,97, 254–259.Bartlett, P., Jordan, M., and McAuliffe, J. (2006), “Convexity, classification, and risk bounds,”
Journal of the American Statistical Association , 101, 138–156.Blanchard, G., Bousquet, O., and Massart, P. (2008), “Statistical performance of support vectormachines,”
The Annals of Statistics , 489–531.Cortes, C. and Vapnik, V. (1995), “Support-vector networks,”
Machine learning , 20, 273–297.Cristianini, N. and Shawe-Taylor, J. (2000),
An introduction to Support Vector Machines: andother kernel-based learning methods , Cambridge University Press.Duda, R., Hart, P., and Stork, D. (2001),
Pattern classification , Wiley.Dudoit, S., Fridlyand, J., and Speed, T. (2002), “Comparison of discrimination methods for theclassification of tumors using gene expression data,”
Journal of the American statistical associ-ation , 97, 77–87.Freund, Y. and Schapire, R. E. (1997), “A decision-theoretic generalization of on-line learning andan application to boosting,”
Journal of Computer and System Sciences , 55, 119–139.Friedman, J., Hastie, T., and Tibshirani, R. (2000), “Additive logistic regression: A statistical viewof boosting,”
Annals of statistics , 337–374.Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M.,Downing, J., Caligiuri, M., et al. (1999), “Molecular classification of cancer: class discovery andclass prediction by gene expression monitoring,”
Science , 286, 531.Hastie, T., Tibshirani, R., and Friedman, J. (2009),
The elements of statistical learning: Datamining, inference, and prediction (second edition) , Springer. oo, J., Lee, Y., Kim, Y., and Park, C. (2008), “A Bahadur Representation of the Linear SupportVector Machine,” Journal of Machine Learning Research , 9, 1343–1368.Lin, Y. (2004), “A note on margin-based loss functions in classification,”
Statistics & probabilityletters , 68, 73–82.Marron, J., Todd, M., and Ahn, J. (2007), “Distance-weighted discrimination,”
Journal of theAmerican Statistical Association , 102, 1267–1271.Owen, A. (2007), “Infinitely imbalanced logistic regression,”
The Journal of Machine LearningResearch , 8, 761–773.Pollard, D. (1991), “Asymptotics for least absolute deviation regression estimators,”
EconometricTheory , 7, 186–199.Qiao, X., Zhang, H., Liu, Y., Todd, M., and Marron, J. (2010), “Weighted distance weighteddiscrimination and its asymptotic properties,”
Journal of the American Statistical Association ,105, 401–414.Qiao, X. and Zhang, L. (2013),
Flexible high-dimensional classification machines and their asymp-totic properties .Vapnik, V. (1998),
Statistical learning theory , Wiley., Wiley.