[PDF] Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS

Abstract

Distance weighted discrimination (DWD) is a margin-based classifier with an interesting geometric motivation. DWD was originally proposed as a superior alternative to the support vector machine (SVM), however DWD is yet to be popular compared with the SVM. The main reasons are twofold. First, the state-of-the-art algorithm for solving DWD is based on the second-order-cone programming (SOCP), while the SVM is a quadratic programming problem which is much more efficient to solve. Second, the current statistical theory of DWD mainly focuses on the linear DWD for the high-dimension-low-sample-size setting and data-piling, while the learning theory for the SVM mainly focuses on the Bayes risk consistency of the kernel SVM. In fact, the Bayes risk consistency of DWD is presented as an open problem in the original DWD paper. In this work, we advance the current understanding of DWD from both computational and theoretical perspectives. We propose a novel efficient algorithm for solving DWD, and our algorithm can be several hundred times faster than the existing state-of-the-art algorithm based on the SOCP. In addition, our algorithm can handle the generalized DWD, while the SOCP algorithm only works well for a special DWD but not the generalized DWD. Furthermore, we consider a natural kernel DWD in a reproducing kernel Hilbert space and then establish the Bayes risk consistency of the kernel DWD. We compare DWD and the SVM on several benchmark data sets and show that the two have comparable classification accuracy, but DWD equipped with our new algorithm can be much faster to compute than the SVM.

Full PDF

AAnother Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS

Boxiang Wang ∗ and Hui Zou † August 21, 2015

Abstract

Distance weighted discrimination (DWD) is a margin-based classiﬁer with an inter-esting geometric motivation. DWD was originally proposed as a superior alternativeto the support vector machine (SVM), however DWD is yet to be popular comparedwith the SVM. The main reasons are twofold. First, the state-of-the-art algorithmfor solving DWD is based on the second-order-cone programming (SOCP), while theSVM is a quadratic programming problem which is much more eﬃcient to solve. Sec-ond, the current statistical theory of DWD mainly focuses on the linear DWD for thehigh-dimension-low-sample-size setting and data-piling, while the learning theory forthe SVM mainly focuses on the Bayes risk consistency of the kernel SVM. In fact,the Bayes risk consistency of DWD is presented as an open problem in the originalDWD paper. In this work, we advance the current understanding of DWD from bothcomputational and theoretical perspectives. We propose a novel eﬃcient algorithm forsolving DWD, and our algorithm can be several hundred times faster than the existingstate-of-the-art algorithm based on the SOCP. In addition, our algorithm can handlethe generalized DWD, while the SOCP algorithm only works well for a special DWDbut not the generalized DWD. Furthermore, we consider a natural kernel DWD in areproducing kernel Hilbert space and then establish the Bayes risk consistency of thekernel DWD. We compare DWD and the SVM on several benchmark data sets andshow that the two have comparable classiﬁcation accuracy, but DWD equipped withour new algorithm can be much faster to compute than the SVM. ∗ School of Statistics, University of Minnesota. † Corresponding author, [email protected]. School of Statistics, University of Minnesota. a r X i v : . [ s t a t . M L ] A ug ey words: Bayes risk consistency, Classiﬁcation, DWD, Kernel methods, MM prin-ciple, SOCP.

Binary classiﬁcation problems appear from diverse practical applications, such as, ﬁnancialfraud detection, spam email classiﬁcation, medical diagnosis with genomics data, drug re-sponse modeling, among many others. In these classiﬁcation problems, the goal is to predictclass labels based on a given set of variables. Suppose that we observe a training dataset consisting of n pairs, where { ( x i , y i ) } ni =1 , x i ∈ R p , and y i ∈ {− , } . A classiﬁer ﬁtsa discriminant function f and constructs a classiﬁcation rule to classify data point x i toeither class 1 or class − f ( x i ). The decision boundary is given by { x : f ( x ) = 0 } . Two canonical classiﬁers are linear discriminant analysis and logistic regres-sion. Modern classiﬁcation algorithms can produce ﬂexible non-linear decision boundarieswith high accuracy. The two most popular approaches are ensemble learning and supportvector machines/kernel machines. Ensemble learning such as boosting (Freund and Schapire,1997) and random forest (Breiman, 2001) combine many weak learners like decision treesinto a powerful one. The support vector machine (SVM) (Vapnik, 1995, 1998) ﬁts an optimalseparating hyperplane in the extended kernel feature space which is non-linear in the originalcovariate spaces. In a recent extensive numerical study by Fern´andez-Delgado et al. (2014),the kernel SVM is shown to be one of the best among 179 commonly used classiﬁers.Motivated by “data-piling” in the high-dimension-low-sample-size problems, Marron et al.(2007) invented a new classiﬁcation algorithm named distance weighted discrimination (DWD)that retains the elegant geometric interpretation of the SVM and delivers competitive per-formance. Since then much work has been devoted to the development of DWD. The readersare referred to Marron (2015) for an up-to-date list of work on DWD. On the other hand,we notice that DWD has not attained the popularity it deserves. We can think of tworeasons for that. First, the current state-of-the-art algorithm for DWD is based on second-order-cone programming (SOCP) proposed in Marron et al. (2007). SOCP was an essentialpart of the DWD development. As acknowledged in Marron et al. (2007), SOCP was thenmuch less well-known than quadratic programming, even in optimization. Furthermore,SOCP is generally more computationally demanding than quadratic programming. Thereare two existing implementations of the SOCP algorithm: Marron (2013) in Matlab and2uang et al. (2012) in R. With these two implementations, we ﬁnd that DWD is usuallymore time-consuming than the SVM. Therefore, SOCP contributes to both the success andunpopularity of DWD. Second, the kernel extension of DWD and the corresponding kernellearning theory are under-developed compared to the kernel SVM. Although Marron et al.(2007) proposed a version of non-linear DWD by mimicking the kernel trick used for derivingthe kernel SVM, theoretical justiﬁcation of such a kernel DWD is still absent. On the con-trary, the kernel SVM as well as the kernel logistic regression (Wahba et al., 1994; Zhu andHasite, 2005) have mature theoretical understandings built upon the theory of reproducingkernel Hilbert space (RKHS) (Wahba, 1999; Hastie et al., 2009). Most learning theories ofDWD succeed to Hall et al. (2005)’s geometric view of HDLSS data and assume that p → ∞ and n is ﬁxed, as opposed to the learning theory for the SVM where n → ∞ and p is ﬁxed.We are not against the ﬁxed n and p → ∞ theory but it would be desirable to develop thecanonical learning theory for the kernel DWD when p is ﬁxed and n → ∞ . In fact, how toestablish the Bayes risk consistency of the DWD and kernel DWD was proposed as an openresearch problem in the original DWD paper (Marron et al., 2007). Nearly a decade later,the problem still remains open.In this paper, we aim to resolve the aforementioned issues. We show that the kernelDWD in a RKHS has the Bayes risk consistency property if a universal kernel is used. Thisresult should convince those who are less familiar with DWD to treat the kernel DWD asa serious competitor to the kernel SVM. To popularize the DWD, it is also important toallow practitioners to easily try DWD collectively with the SVM in real applications. Tothis end, we develop a novel fast algorithm to solve the linear and kernel DWD by usingthe majorization-minimization (MM) principle. Compared with the SOCP algorithm, ournew algorithm has multiple advantages. First, our algorithm is much faster than the SOCPalgorithm. In some examples, our algorithm can be several hundred times faster. Second,DWD equipped with our algorithm can be faster than the SVM. Third, our algorithm iseasier to understand than the SOCP algorithm, especially for those who are not familiar withsemi-deﬁnite and second-order-cone programming. This could help demystify the DWD andhence may increase its popularity.To give a quick demonstration, we use a simulation example to compare the kernel DWDand the kernel SVM. We drew 10 centers { µ k + } from N ((1 , T , I ). For each data point inthe positive class, we randomly picked up a center µ k + and then generated the point from N ( µ k + , I / µ k − VM − with Gaussian Kernel x1 x2 −2 −1 0 1 2 3 4 −4−2024 l ll l l ll ll ll ll ll l lll lll lll ll l ll ll lll ll ll lll ll ll lll lll ll l lll l ll ll llll lll lll l lll lll l lll ll l ll l lll ll ll ll l Training Error: 0.2300Test Error: 0.1682Bayes Error: 0.1645

DWD − with Gaussian Kernel x1 x2 −2 −1 0 1 2 3 4 −4−2024 l ll l l ll ll ll ll ll l lll lll lll ll l ll ll lll ll ll lll ll ll lll lll ll l lll l ll ll llll lll lll l lll lll l lll ll l ll l lll ll ll ll l Training Error: 0.2400Test Error: 0.1678Bayes Error: 0.1645

Figure 1.

Nonlinear SVM and DWD with Gaussian kernel. The broken curves are the Bayesdecision boundary. The R package kerndwd used 2.396 second to solve the kernel DWD, and kernlab took 7.244 second to solve the kernel SVM. The timings include tuning parameters andthey are averaged over 100 runs. were drawn from N ((0 , T , I ). For this model the Bayes rule is nonlinear . Figure 1 displaysthe training data from the simulation model where 100 observations are from the positiveclass (plotted as triangles) and another 100 observations are from the negative class (plottedas circles). We ﬁtted the SVM and DWD using Gaussian kernels. We have implementedour new algorithm for DWD in a publicly available R package kerndwd . We computed thekernel SVM by using the R package kernlab (Karatzoglou et al., 2004). We recorded theirtraining errors and test errors. From Figure 1, we observe that like the kernel SVM, thekernel DWD has a test error close to the Bayes error, which is consistent with the Bayesrisk consistency property of the kernel DWD established in section 4.2. Notably, the kernelDWD is about three times as fast as the kernel SVM in this example.The rest of the paper is organized as follows. To be self-contained, we ﬁrst review theSVM and DWD in section 2. We then derive the novel algorithm for DWD in section 3. Weintroduce the kernel DWD in a reproducing kernel Hilbert space and establish the learningtheory of kernel DWD in section 4. Real data examples are given in section 5 to compareDWD and the SVM. Technical proofs are provided in the appendix. The Bayes decision boundary is a curve: (cid:8) z : (cid:80) k exp (cid:0) − || z − µ k + || / (cid:1) = (cid:80) k exp (cid:0) − || z − µ k − || / (cid:1)(cid:9) . Review of SVMs and DWD

The introduction of the SVM usually begins with its geometric interpretation as a maximummargin classiﬁer (Vapnik, 1995). Consider a case when two classes are separable by a hy-perplane { x : f ( x ) = ω + x T ω = 0 } such that y i ( ω + x Ti ω ) are all non-negative. Withoutloss of generality, we assume that ω is a unit vector, i.e., ω T ω = 1, and we observe thateach d i ≡ y i ( ω + x Ti ω ) is equivalent to the Euclidean distance between the data point x i and the hyperplane. The reason is that d i = ( x i − x ) T ω and ω + x T ω = 0, where x isany data point on the hyperplane and ω is the unit normal vector. The SVM classiﬁer isdeﬁned as the optimal separating hyperplane that maximizes the smallest distance of eachdata point to the separating hyperplane. Mathematically, the SVM can be written as thefollowing optimization problem (for the separable data case):max ω , ω min d i , subject to d i = y i ( ω + x Ti ω ) ≥ , ∀ i, and ω T ω = 1 . (2.1)The smallest distance min d i is called the margin , and the SVM is thereby regarded as a large-margin classiﬁer . The data points closest to the hyperplane, i.e., d i = min d i , aredubbed the support vectors .In general, the two classes are not separable, and thus y i ( ω + x Ti ω ) cannot be non-negative for all i = 1 , . . . , n . To handle this issue, non-negative slack variables η i , ≤ i ≤ n ,are introduced to ensure all y i ( ω + x Ti ω ) + η i to be non-negative. With these slack variables,the optimization problem (2.1) is generalized as follows,max ω , ω min d i , subject to d i = y i ( ω + x Ti ω ) + η i ≥ , ∀ i,η i ≥ , ∀ i, n (cid:88) i =1 η i < constant , and ω T ω = 1 . (2.2)To compute SVMs, the optimization problem (2.2) is usually rephrased as an equivalent5uadratic programming (QP) problem,min β , β (cid:34) β T β + c n (cid:88) i =1 ξ i (cid:35) , subject to y i ( β + x Ti β ) + ξ i ≥ , ξ i ≥ , ∀ i, (2.3)and it can be solved by maximizing its Lagrange dual function,max µ i (cid:34) n (cid:88) i =1 µ i − n (cid:88) i =1 n (cid:88) i (cid:48) =1 µ i µ i (cid:48) y i y i (cid:48) (cid:104) x i , x i (cid:48) (cid:105) (cid:35) , subject to µ i ≥ n (cid:88) i =1 µ i y i = 0 . (2.4)By solving (2.4), one can show that the solution of (2.3) has the formˆ β = n (cid:88) i =1 ˆ µ i y i x i , and thus ˆ f ( x ) = ˆ β + n (cid:88) i =1 ˆ µ i y i (cid:104) x , x i (cid:105) , (2.5)ˆ µ i being zero only when x i lies on the support vectors.One widely used method to extend the linear SVM to non-linear classiﬁers is the kernelmethod (Aizerman et al., 1964), which replaces the dot product (cid:104) x i , x i (cid:48) (cid:105) in the Lagrangedual problem (2.4) with a kernel function K ( x i , x (cid:48) i ), and hence the solution has the formˆ f ( x ) = ˆ β + x T ˆ β = ˆ β + n (cid:88) i =1 ˆ µ i y i K ( x , x i ) . Some popular examples of the kernel function K include: K ( x , x (cid:48) ) = (cid:104) x , x (cid:48) (cid:105) (linear kernel), K ( x , x (cid:48) ) = ( a + (cid:104) x , x (cid:48) (cid:105) ) d (polynomial kernel), and K ( x , x (cid:48) ) = exp( − σ || x − x (cid:48) || ) (Gaussiankernel), among others. Distance weighted discrimination was originally proposed by Marron et al. (2007) to resolvethe data-piling issue. Marron et al. (2007) observed that many data points become supportvectors when the SVM is applied on the so-called high-dimension-low-sample-size (HDLSS)6

20 40 60 80 100 - - SVM

Index ˆ β + x T i ˆ β % of the pointsare support vectors. - - DWD

Index ˆ β + x T i ˆ β Figure 2.

A toy example illustrating the data-piling. Values ˆ β + x Ti ˆ β are plotted for SVM andDWD. Indices 1 to 50 represent negative class (triangles) and indices 51 to 100 are for positiveclass (circles). In the left panel, data points belonging to the support vectors are depicted as solidcircles and triangles. data, and Marron et al. (2007) coined the term data-piling to describe this phenomenon.We delineate it in Figure 2 through a simulation example. Let µ = (3 , , . . . ,

0) be a 200-dimension vector. We generated 50 points (indexed from 1 to 50 and represented as triangles)from N ( − µ , I p ) as the negative class and another 50 points (indexed from 51 to 100 andrepresented as circles) from N ( µ , I p ) as the positive class. We computed ˆ β and ˆ β for SVM(2.3). In the left panel of Figure 2, we plotted ˆ β + x Ti ˆ β for each data point, and we portrayedthe support vectors by solid triangles and circles. We observe that 65 out of 100 data pointsbecome support vectors. The right panel of Figure 2 corresponds to DWD (will be deﬁnedshortly), where data-piling is attenuated. A real example revealing the data-piling can beseen in Figure 1 of Ahn and Marron (2010).Marron et al. (2007) viewed “data-piling” as a drawback of the SVM, because the SVMclassiﬁer (2.5) is a function of only support vectors. Another popular classiﬁer logisticregression does classiﬁcation by using all the data points. However, the classical logisticregression classiﬁer is derived by following the maximum likelihood principle, not based on anice margin-maximization motivation . Marron et al. (2007) wanted to have a new method Zhu and Hasite (2005) later showed that the limiting (cid:96) penalized logistic regression approaches themargin-maximizing hyperplane for the separable data case. DWD was ﬁrst proposed in 2002. ω , ω (cid:34) n (cid:88) i =1 d i + c n (cid:88) i =1 η i (cid:35) , subject to d i = y i ( ω + x Ti ω ) + η i ≥ , η i ≥ , ∀ i, and ω T ω = 1 . (2.6)There has been much work on variants of the standard DWD. We can only give anincomplete list here. Qiao et al. (2010) introduced the weighted DWD to tackle unequalcost or sample sizes by imposing diﬀerent weights on two classes. Huang et al. (2013)extended the binary DWD to the multiclass case. Wang and Zou (2015) proposed the sparseDWD for high-dimensional classiﬁcation. In addition, the work connecting DWD with otherclassiﬁers, e.g., SVM, includes but not limited to LUM (Liu et al., 2011), DWSVM (Qiaoand Zhang , 2015a), and FLAME (Qiao and Zhang , 2015b). Marron (2015) provided a morecomprehensive review of the current DWD literature. Marron et al. (2007) solved the standard DWD by reformulating (2.6) as a second-order coneprogramming (SOCP) program (Alizadeh and Goldfarb, 2004; Boyd and Vandenberghe,2004), which has a linear objective, linear constraints, and second-order-cone constraints.Speciﬁcally, for each i , let ρ i = (1 /d i + d i ) / σ i = (1 /d i − d i ) /

2, and then ρ i + σ i = 1 /d i , ρ i − σ i = d i , and ρ i − σ i = 1. Hence the original optimization problem (2.6) becomesmin ω , ω (cid:20) T ρ + T σ + c T η (cid:21) , subject to ρ − σ = ˜ Y Xω + ω · y + η ,η i ≥ , ( ρ i ; σ i , ∈ S , ∀ i, (1; ω ) ∈ S p +1 , (2.7)where ˜ Y is an n × n diagonal matrix with the i th diagonal element y i , X is an n × p datamatrix with the i th row x Ti , and S m +1 = { ( ψ, φ ) ∈ R m +1 : ψ ≥ φ T φ } is the form of thesecond-order cones. After solving ˆ ω and ˆ ω from (2.7), a new observation x new is classiﬁedby sign( ˆ ω + x T new ˆ ω ). 8 .2.3 Non-linear extension Note that the kernel SVM was derived from applying the kernel trick to the dual formulation(2.5). Marron et al. (2007) followed the same approach to consider a version of kernel DWDfor achieving non-linear classiﬁcation. The dual function of the problem (2.7) is (Marronet al., 2007) max α (cid:20) − (cid:112) α T ˜ Y XX T ˜ Y α + 2 · T √ α (cid:21) , subject to y T α = 0 , ≤ α ≤ c · , (2.8)where ( √ α ) i = √ α i , i = 1 , , . . . , n . Note that (2.8) only uses XX T , which makes it easyto employ the kernel trick to get a nonlinear extension of the linear DWD. For a given kernelfunction K , deﬁne the kernel matrix as ( K ) ij = K ( X i , X j ), 1 ≤ i, j ≤ n . Then a kernelDWD can be deﬁned as (Marron et al., 2007)max α (cid:20) − (cid:112) α T ˜ Y K ˜ Y α + 2 · T √ α (cid:21) , subject to y T α = 0 , ≤ α ≤ c · . (2.9)To solve (2.9), Marron et al. (2007) used the Cholesky decomposition of the kernel matrix,i.e., K = ΦΦ T and then replaced the predictors X in (2.7) with Φ . Marron et al. (2007)also carefully discussed several algorithmic issues that ensure the equivalent optimality in(2.7) and (2.8). Remark 1.

Two DWD implementations have been published thus far: a Matlab software(Marron, 2013) and an R package

DWD (Huang et al., 2012). Both implementations are basedon a Matlab SOCP solver

SDPT3 , which was developed by T¨ut¨unc¨u et al. (2003). We noticethat the R package

DWD can only compute the linear DWD.

Remark 2.

To our best knowledge, the theoretical justiﬁcation for the kernel DWD inMarron et al. (2007) is still unclear. The reason is likely due to the fact that the nonlinearextension is purely algorithmic. In fact, the Bayes risk consistency of DWD was proposed asan open research problem in Marron et al. (2007). The kernel DWD considered in this papercan be rigorously justiﬁed to have a universal Bayes risk consistency property; see details insection 4.2. 9 .2.4 Generalized DWD

Marron et al. (2007) also attempted to replace the reciprocal in the DWD optimizationproblem (2.6) with the q th power ( q >

0) of the inverse distances, and Hall et al. (2005) alsoused it as the original deﬁnition of DWD. We name the DWD with this new formulation thegeneralized DWD:min ω , ω (cid:34) n (cid:88) i =1 d qi + c n (cid:88) i =1 η i (cid:35) , subject to d i = y i ( ω + x Ti ω ) + η i ≥ , η i ≥ , ∀ i, and ω T ω = 1 , (2.10)which degenerates to the standard DWD (2.6) when q = 1.The ﬁrst asymptotic theory for DWD and generalized DWD was given in Hall et al.(2005) who presented a novel geometric representation of the HDLSS data. Assuming X +1 , X +2 , . . . , X + n + are the data from the positive class and X − , X − , . . . , X − n − are fromthe negative class. Hall et al. (2005) stated that, when the sample size n is ﬁxed and thedimension p goes to inﬁnity, under some regularity conditions, there exist two constants l + and l − such that for each pair of i and j , p − / || X + i − X + j || P → √ l + , and p − / || X − i − X − j || P → √ l − , as p → ∞ . This result was applied the results to study several classiﬁers including the SVMand the generalized DWD. For ease presentation let us consider the equal subgroup size case,i.e., n + = n − = n/

2. Hall et al. (2005) assumed that p − / || E X + − E X − || → µ, as p → ∞ , The basic conclusion is that when µ is greater than a threshold that depends on l + , l − , n ,the misclassiﬁcation error converges to zero, and when µ is less than the same threshold, themisclassiﬁcation error converges to 50%. For more details, see Theorem 1 and Theorem 2 inHall et al. (2005). Ahn et al. (2007) further relaxed the assumptions thereof. Remark 3.

The generalized DWD has not been implemented yet because the SOCP trans-formation only works for the standard DWD ( q = 1) (2.7), but its extension to handle thegeneral cases is unclear if not impossible. That is why the current DWD literature onlyfocuses on DWD with q = 1. In fact, the generalized DWD with q (cid:54) = 1 was proposed as anopen research problem in Marron et al. (2007). The new algorithm proposed in this papercan easily solve the generalized DWD problem for any q >

0; see section 3.10

A Novel Algorithm for DWD

Marron et al. (2007) originally solved the standard DWD by transforming (2.6) into a SOCPproblem. This algorithm, however, cannot compute the generalized DWD (2.10) with q (cid:54) =1. In this section, we propose an entirely diﬀerent algorithm based on the majorization-minimization (MM) principle. Our new algorithm oﬀers a uniﬁed solution to the standardDWD and the generalized DWD. Our algorithm begins with a loss + penalty formulation of the DWD. Lemma 1 deploys theresult. Note that the loss function also lays the foundation of the kernel DWD learningtheory that will be discussed in section 4. Lemma 1.

The generalized DWD classiﬁer in (2.10) can be written as sign( ˆ β + x Ti ˆ β ) , where ( ˆ β , ˆ β ) is computed from min β , β C ( β , β ) ≡ min β , β (cid:34) n n (cid:88) i =1 V q (cid:0) y i ( β + x Ti β ) (cid:1) + λ β T β (cid:35) , (3.1) for some λ , where V q ( u ) =  − u, if u ≤ qq + 1 , u q q q ( q + 1) q +1 , if u > qq + 1 . (3.2) Remark 4.

The proof of Lemma 1 provides the one-to-one mapping between λ in (3.1) and c in (2.10). Write ( ˆ β ( λ ) , ˆ β ( λ )) as the solution to (3.1). Deﬁne c ( λ ) = ( q + 1) q +1 q q (cid:107) ˆ β ( λ ) (cid:107) q +1 . Considering (2.10) using c ( λ ),( ˆ ω , ˆ ω ) = argmin ω , ω (cid:34) n (cid:88) i =1 d qi + c ( λ ) n (cid:88) i =1 η i (cid:35) , subject to d i = y i ( ω + x Ti ω ) + η i ≥ , η i ≥ , ∀ i, and ω T ω = 1 , (3.3)11e have ˆ ω = ˆ β ( λ ) / (cid:107) ˆ β ( λ ) (cid:107) and ˆ ω = ˆ β ( λ ) / (cid:107) ˆ β ( λ ) (cid:107) . Note that sign(ˆ ω + x Ti ˆ ω ) = sign( ˆ β ( λ ) + x Ti ˆ β ( λ )), which means that the generalized DWDclassiﬁer deﬁned by (3.3) is equivalent to the generalized DWD classiﬁer deﬁned by (3.1).By Lemma 1, we call V q ( · ) the generalized DWD loss. It can be visualized in Figure 3.We observe that the generalized DWD loss decreases as q increases and it approaches theSVM hinge loss function as q → ∞ . When q = 1, the generalized DWD loss becomes V ( u ) =  − u, if u ≤ / , / (4 u ) , if u > / . We notice that V ( u ) has appeared in the literature (Qiao et al., 2010; Liu et al., 2011). Inthis work we give a uniﬁed treatment of all q values, not just q = 1. -0.5 0.0 0.5 1.0 1.5 2.0 2.5 . . . . u L o ss V q ( u ) q = 0 . q = 1 q = 4 q = 8SVM Figure 3.

Top to bottom are the DWD loss functions with q = 0 . , , , , and the SVM hinge loss. We now show how to develop the new algorithm by using the MM principle (De Leeuwand Heiser, 1977; Lange et al., 2000; Hunter and Lange, 2004). Some recent successfulapplications of the MM principle can be seen in Hunter and Li (2005); Wu and Lange(2008); Zou and Li (2008); Zhou and Lange (2010); Yang and Zou (2013); Lange and Zhou122014), among others. The main idea of the MM principle is easy to understand. Suppose θ = ( β , β T ) T and we aim to minimize C ( θ ), deﬁned in (3.1). The MM principle ﬁnds amajorization function D ( θ | θ k ) satisfying C ( θ ) < D ( θ | θ k ) for any θ (cid:54) = θ k and C ( θ k ) = D ( θ k | θ k ), and then we generate a sequence { C ( θ k ) } ∞ k =1 by updating θ k via θ k ← θ k +1 =argmin θ D ( θ | θ k ).We ﬁrst expose some properties of the generalized DWD loss functions, which give riseto a quadratic majorization function of C ( θ ). The generalized DWD loss is diﬀerentiableeverywhere; its ﬁrst-order derivative is given below, V (cid:48) q ( u ) =  − , if u ≤ qq + 1 , − u q +1 (cid:18) qq + 1 (cid:19) q +1 , if u > qq + 1 . (3.4) Lemma 2.

The generalized DWD loss function V q ( · ) has a Lipschitz continuous gradient, | V (cid:48) q ( t ) − V (cid:48) q (˜ t ) | < M | t − ˜ t | , (3.5) which further implies a quadratic majorization function of V q ( · ) such that V q ( t ) < V q (˜ t ) + V (cid:48) q (˜ t )( t − ˜ t ) + M t − ˜ t ) (3.6) for any t (cid:54) = ˜ t and M = ( q + 1) /q . Denote the current solution by ˜ θ = ( ˜ β , ˜ β T ) T and the updated solution by θ = ( β , β T ) T .We settle C ( θ ) = C ( β , β ) and D ( θ | ˜ θ ) = D ( β , β ) without abusing notations. We havethat for any ( β , β ) (cid:54) = ( ˜ β , ˜ β ), C ( β , β ) ≡ n n (cid:88) i =1 V q (cid:0) y i ( β + x Ti β ) (cid:1) + λ β T β < n n (cid:88) i =1 V q (cid:16) y i ( ˜ β + x Ti ˜ β ) (cid:17) + 1 n n (cid:88) i =1 V (cid:48) q (cid:16) y i ( ˜ β + x Ti ˜ β ) (cid:17) (cid:104) y i ( β − ˜ β ) + y i x Ti ( β − ˜ β ) (cid:105) + M n n (cid:88) i =1 (cid:104) y i ( β − ˜ β ) + y i x Ti ( β − ˜ β ) (cid:105) + λ β T β ≡ D ( β , β ) . (3.7)13e now ﬁnd the minimizer of D ( β , β ). The gradients of D ( β , β ) are given as follows: ∂ D ( β , β ) ∂ β = 1 n n (cid:88) i =1 V (cid:48) q (cid:16) y i ( ˜ β + x Ti ˜ β ) (cid:17) y i x i + Mn n (cid:88) i =1 (cid:104) ( β − ˜ β ) + x Ti ( β − ˜ β ) (cid:105) x i + 2 λ β = X T z + Mn ( β − ˜ β ) X T + Mn n (cid:88) i =1 x i x Ti ( β − ˜ β ) + 2 λ β = X T z + Mn ( β − ˜ β ) X T + (cid:18) Mn X T X + 2 λ I p (cid:19) ( β − ˜ β ) + 2 λ ˜ β , (3.8) ∂ D ( β , β ) ∂β = 1 n n (cid:88) i =1 V (cid:48) q (cid:16) y i ( ˜ β + x Ti ˜ β ) (cid:17) y i + Mn n (cid:88) i =1 (cid:104) ( β − ˜ β ) + x Ti ( β − ˜ β ) (cid:105) = T z + M ( β − ˜ β ) + Mn T X ( β − ˜ β ) . (3.9)where X is the n × p data matrix with the i th row x Ti , z is an n × i th element y i V (cid:48) q ( y i ( ˜ β + x Ti ˜ β )) /n , and ∈ R n is the vector of ones. Setting [ ∂ D ( β , β ) /∂β , ∂ D ( β , β ) /∂ β ]to be zeros, we obtain the minimizer of D ( β , β ): (cid:32) β β (cid:33) = (cid:32) ˜ β ˜ β (cid:33) − nM (cid:32) n T XX T X T X + nλM I p (cid:33) − (cid:32) T zX T z + 2 λ ˜ β (cid:33) . (3.10)So far we have completed all the steps of the MM algorithm. Details are summarized inAlgorithm 1.We have implemented Algorithm 1 in an R package kerndwd , which is publicly availablefor download on CRAN. In this section, we show the superior computation performance of our R implementation, kerndwd , over the two existing implementations, the R package

DWD (Huang et al., 2012) andthe Matlab software (Marron, 2013). To avoid confusion, we henceforth use

OURS , HUANG ,and

MARRON to denote kerndwd , DWD , and the Matlab implementation, respectively. Since

HUANG is incapable of non-linear kernels and the generalized DWD with q (cid:54) = 1, we only attendto the linear DWD with q ﬁxed to be one. All experiments were conducted on an Intel Corei5 M560 (2.67 GHz) processor.For a fair comparison, we study the four numerical examples used in Marron et al. (2007),except for diﬀerent sample sizes and dimensions. In each example, we generate a data set14 lgorithm 1 Linear generalized DWD Initialize ( ˜ β , ˜ β T ) for each λ do Compute P − ( λ ): P − ( λ ) = (cid:18) n T XX T X T X + nλM I p (cid:19) − repeat Compute z = ( z , . . . , z n ) T : z i = y i V (cid:48) q ( y i ( ˜ β + x i ˜ β )) /n Compute: (cid:18) β β (cid:19) ← (cid:18) ˜ β ˜ β (cid:19) − nq ( q + 1) P − ( λ ) (cid:18) T zX T z + 2 λ ˜ β (cid:19) Set ( ˜ β , ˜ β T ) = ( β , β T ) until the convergence condition is met end for with sample size n = 500 and dimension p = 50. The responses are always binary; one halfof the data have responses +1 and the other half have −

1. Data in example 1 are generatedfrom Gaussian distribution with means of ( ± . , , . . . ,

0) and an identity covariance for ± ± , ± , , . . . ,

0) for ± ± . ±

100 for ± − .

09 times standard Gaussian; for bothclasses, the last 25 coordinates are just the squares of the ﬁrst 25.In each example, we ﬁtted a linear DWD with ﬁve diﬀerent tuning parameter values λ = (0 . , . , , , β , ˆ β ), we computed (ˆ ω , ˆ ω ) and the constant c in (2.7) by using Remark 4. We then used HUANG and

MARRON to compute their solutions.Note that in theory all three implementations should yield identical (ˆ ω , ˆ ω ) . From table 1we observe that

OURS took remarkably less computation time than

HUANG and

MARRON . Inexample 1, for instance,

OURS spent only 0.012 second on average to ﬁt a DWD model, while

HUANG used 14.525 seconds, and

MARRON took 2.204 seconds, which were 1210 and 183 times15arger, respectively. In all four examples, the timings of

OURS were 700 times above fasterthan the existing R implementation

HUANG , and also more than 70 times faster than theMatlab implementation

MARRON . Table 1.

Timing comparisons among the R package kerndwd (denoted as

OURS ), the R package

DWD (denoted as

HUANG ), and the Matlab implementation (denoted as

MARRON ). All the timings areaveraged over 100 independent replicates.

Timing (in sec.) RatioOURS HUANG MARRON t ( HUANG ) t ( OURS ) t ( MARRON ) t ( OURS ) The kernel SVM can be derived by using the kernel trick or using the view of non-parametricfunction estimation in a reproducing kernel Hilbert space (RKHS). Much of the theoreticalwork on the kernel SVM is based on the RKHS formulation of SVMs. The derivation of thekernel SVM in a RKHS is given in Hastie et al. (2009). We take a similar approach to derivethe kernel DWD, as our goal is to establish the kernel learning theory for DWD.Consider H K , a reproducing kernel Hilbert space generated by the kernel function K .The Mercer’s theorem ensures K to have an eigen-expansion K ( x , x (cid:48) ) = (cid:80) ∞ t =1 γ t φ t ( x ) φ Tt ( x (cid:48) ),with γ t ≥ (cid:80) ∞ t =1 γ t < ∞ . Then the Hilbert space H K is deﬁned as the collection offunctions h ( x ) = (cid:80) ∞ t =1 θ t φ t ( x ), for any θ t such that (cid:80) ∞ t =1 θ t /γ t < ∞ , and the inner productis (cid:104) (cid:80) ∞ t =1 θ t φ t ( x ) , (cid:80) ∞ t (cid:48) =1 δ t (cid:48) φ t (cid:48) ( x ) (cid:105) H K = (cid:80) ∞ t =1 θ t δ t /γ t .Given H K , let the non-linear DWD be written as sign( ˆ β + ˆ h ( x )) where ( ˆ β , ˆ h ) is the We also checked the quality of the computed solutions by these diﬀerent algorithms. In theory theyshould be identical. In practice, due to machine errors and implementations, they could be diﬀerent. Wefound that in all examples our new algorithm gave better solutions in the sense that the objective functionin (2.7) has the smallest value.

HUANG and

MARRON gave similar but slightly larger objective function values. h ∈H K β ∈ R (cid:34) n n (cid:88) i =1 V q ( y i ( β + h ( x i ))) + λ || h || H K (cid:35) , (4.1)where V q ( · ) is the generalized DWD loss (3.2). The representer theorem concludes that thesolution of (4.1) has a ﬁnite expansion based on K ( x , x i ) (Wahba, 1990),ˆ h ( x ) = n (cid:88) i =1 ˆ α i K ( x , x i ) , and thus || ˆ h || H K = n (cid:88) i =1 n (cid:88) j =1 ˆ α i ˆ α j K ( x i , x j ) . Consequently, (4.1) can be paraphrased with matrix notation,min β , α C K ( β , α ) ≡ min β , α (cid:34) n n (cid:88) i =1 V q (cid:0) y i ( β + K Ti α ) (cid:1) + λ α T Kα (cid:35) , (4.2)where K is the kernel matrix with the ( i, j )th element of K ( x i , x j ) and K i is the i th columnof K . Remark 5.

We can compare (4.2) to the kernel SVM (Hastie et al., 2009)min β , α (cid:34) n n (cid:88) i =1 (cid:2) − y i ( β + K Ti α ) (cid:3) + + λ α T Kα (cid:35) , (4.3)where [1 − t ] + is the hinge loss underlying the SVM. As shown in Figure 3, the generalizedDWD loss takes the hinge loss as its limit when q → ∞ . In general, the generalized DWDloss and the hinge loss look very similar, which suggests that the kernel DWD and the kernelSVM equipped with the same kernel have similar statistical behavior.The procedure for deriving Algorithm 1 for the linear DWD can be directly adoptedto derive an eﬃcient algorithm for solving the kernel DWD. We obtain the majorizationfunction D K ( β , α ), D K ( β , α ) = 1 n n (cid:88) i =1 V (cid:48) q (cid:16) y i ( ˜ β + K Ti ˜ α ) (cid:17) (cid:104) y i ( β − ˜ β ) + y i K Ti ( α − ˜ α ) (cid:105) + λ α T Kα lgorithm 2 Kernel DWD Initialize ( ˜ β , ˜ α T ) for each λ do Compute P − ( λ ): P − ( λ ) = (cid:18) n T KK KK + nqλ ( q +1) K (cid:19) − repeat Compute z = ( z , . . . , z n ) T : z i = y i V (cid:48) q ( y i ( ˜ β + K i ˜ α )) /n Compute: (cid:18) β α (cid:19) ← (cid:18) ˜ β ˜ α (cid:19) − nq ( q + 1) P − ( λ ) (cid:18) T zKz + 2 λ K ˜ α (cid:19) Set ( ˜ β , ˜ α T ) = ( β , α T ) until the convergence condition is met end for + M n n (cid:88) i =1 (cid:104) y i ( β − ˜ β ) + y i K Ti ( α − ˜ α ) (cid:105) + 1 n n (cid:88) i =1 V q (cid:16) y i ( ˜ β + K Ti ˜ α ) (cid:17) and then ﬁnd the minimizer of D K ( β , α ) which has a closed-form expression. We opt toomit the details here for space consideration. Algorithm 2 summarizes the entire algorithmfor the kernel DWD. Lin (2002) formulated the kernel SVM as a non-parametric function estimation problem ina reproducing kernel Hilbert space and showed that the population minimizer of the SVMloss function is the Bayes rule, indicating that the SVM directly approximates the optimalBayes classiﬁer. Lin (2004) further coined a name “Fisher consistency” to describe such aresult. The Vapnik-Chervonenkis (VC) analysis (Vapnik, 1998; Anthony and Bartlett, 1999)and the margin analysis (Bartlett and Shawe-Taylor, 1999; Shawe-Taylor and Cristianini,2000) have been used to bound the expected classiﬁcation error of the SVM. Zhang (2004)used the so-called leave-one-out analysis (Jaakkola and Haussler, 1999) to study a class ofkernel machines. The exisiting theoretical work on the kernel SVM provides us a nice roadmap to study the kernel DWD. In this section we ﬁrst elucidate the Fisher consistency (Lin,18004) of the generalized kernel DWD, and we then establish the Bayes risk consistency ofthe kernel DWD when a universal kernel is employed.Let η ( x ) denote the conditional probability P ( Y = 1 | X = x ). Under the 0-1 loss, thetheoretical optimal Bayes rule is f (cid:63) ( x ) = sign( η ( x ) − / η ( x ) is a measurablefunction and P ( η ( x ) = 1 /

2) = 0 throughout.

Lemma 3.

The population minimizer of the expected generalized DWD loss E X Y [ V q ( Y f ( X ))] is ˜ f ( x ) = qq + 1 (cid:34)(cid:18) η ( x )1 − η ( x ) (cid:19) q +1 · I ( η ( x ) > / − (cid:18) − η ( x ) η ( x ) (cid:19) q +1 · I ( η ( x ) < / (cid:35) , (4.4) where I ( · ) is the indicator function. The population minimizer ˜ f ( x ) has the same sign as η ( x ) − / . Fisher consistency is a property of the loss function. The interpretation is that thegeneralized DWD can approach Bayes rule with inﬁnite many samples. We notice thatFisher consistency of V ( u ) has been shown before (Qiao et al., 2010; Liu et al., 2011). Inreality all classiﬁers are estimated from a ﬁnite sample. Thus, a more reﬁned analysis of theactual DWD classiﬁer is needed, and that is what we achieve in the following.Following the convention in the literature, we absorb the intercept into h and present thekernel DWD as follows:ˆ f n = argmin f ∈H K (cid:34) n n (cid:88) i =1 V q ( y i ( f ( x i )) + λ n || f || H K (cid:35) . (4.5)The ultimate goal is to show that the misclassiﬁcation error of the kernel DWD approachesthe Bayes error rate such that we can say the kernel DWD classiﬁer works as well as the Bayesrule (asymptotically speaking). Following Zhang (2004), we derive the following lemma. Lemma 4.

For a discrimination function f , we deﬁne R ( f ) = E X Y [ Y (cid:54) = sign ( f ( X ))] . Assume that f (cid:63) = argmin f R ( f ) is the Bayes rule and ˆ f n is the solution of (4.5) , then R ( ˆ f n ) − R ( f (cid:63) ) ≤ q + 1 q ( ε A + ε E ) , (4.6)19 here ε A and ε E are deﬁned as follows and V q is the generalized DWD loss, ε A = inf f ∈H K E X Y (cid:20) V q ( Y f ( X )) (cid:21) − E X Y (cid:20) V q (cid:16) Y ˜ f ( X ) (cid:17) (cid:21) ,ε E = ε E ( ˆ f n ) = E X Y (cid:20) V q (cid:16) Y ˆ f n ( X ) (cid:17) (cid:21) − inf f ∈H K E X Y (cid:20) V q ( Y f ( X )) (cid:21) . (4.7)In the above lemma R ( f ∗ ) is the Bayes error rate and R ( ˆ f n ) is the misclassiﬁcation errorof the kernel DWD applied to new data points. If R ( ˆ f n ) → R ( f (cid:63) ), we say the classiﬁer isBayes risk consistent. Based on Lemma 4, it suﬃces to show that both ε A and ε E approachzero in order to demonstrate the Bayes risk consistency of the kernel DWD. Note that ε A is deterministic and is called the approximation error. If the RKHS is rich enough thenthe approximation error can be made arbitrarily small. In the literature, the notation of universal kernel (Steinwart, 2001; Micchelli et al., 2006) has been proposed and studied.Suppose X ∈ R p is the compact input space of X and C ( X ) is the space of all continuousfunctions g : X → R . The kernel K is said to be universal if the function space H K generatedby K is dense in C ( X ), that is, for any positive (cid:15) and any function g ∈ C ( X ), there exists an f ∈ H K such that || f − g || ∞ < (cid:15) . Theorem 1.

Suppose ˆ f n is the solution of (4.5) , H K is induced by a universal kernel K ,and the sample space X is compact. Then we have(1) ε A = 0 ;(2) Let B = sup x K ( x , x ) < ∞ . When λ n → and nλ n → ∞ , for any (cid:15) > , lim n →∞ P (cid:16) ε E ( ˆ f n ) > (cid:15) (cid:17) = 0 . By (1) and (2) and (4.6) we have R ( ˆ f n ) → R ( f ∗ ) in probability. The Gaussian kernel is universal and B ≤

1. Thus Theorem 1 says that the kernel DWDusing the Gaussian kernel is Bayes risk consistent. This oﬀers a theoretical explanation tothe numerical results in Figure 1. 20

Real Data Analysis

In this section, we investigate the performance of kerndwd on four benchmark data sets: theBUPA liver disorder data, the Haberman’s survival data, the Connectionist Bench (sonar,mines vs. rocks) data, and the vertebral column data. All the data sets were obtained fromUCI Machine Learning Repository (Lichman, 2013).For comparison purposes, we considered the SVM, the standard DWD ( q = 1) and thegeneralized DWD models with q = 0 . , ,

8. We computed all DWD models using our Rpackage kerndwd and solved the SVM using the R package kernlab (Karatzoglou et al.,2004). We randomly split each data into a training and a test set with a ratio 2 : 1. For eachmethod using the linear kernel, we conducted a ﬁve-folder cross-validation on the trainingset to tune λ . For each method using Gaussian kernels, the pair of ( σ, λ ) was tuned by theﬁve-folder cross-validation. We then ﬁtted each model with the selected λ and evaluated itsprediction accuracy on the test set.Table 2 displays the average timing and mis-classiﬁcation rates. We do not argue thateither SVM or DWD outperforms the other; nevertheless, two models are highly comparable.SVM models work better on sonar and vertebral data, and DWD performs better on bupaand haberman data. For three out of the four data sets, the best method uses a Gaussiankernel, indicating that linear classiﬁers may not be adequate in such cases. In terms oftiming, kerndwd runs faster than kernlab in all these examples. It is also interesting tosee that DWD with q = 0 . q = 1 on bupa andhaberman data, although the diﬀerence is not signiﬁcant. In this paper we have developed a new algorithm for solving the linear generalized DWDand the kernel generalized DWD. Compared with the current state-of-the-art algorithm forsolving the linear DWD, our new algorithm is easier to understand, more general, and muchmore eﬃcient. DWD equipped with the new algorithm can be computationally more eﬃcientthan the SVM. We have established the statistical learning theory of the kernel generalizedDWD, showing that the kernel DWD and the kernel SVM are comparable in theory. Ourtheoretical analysis and algorithm do not suggest DWD with q = 1 has any special meritcompared to the other members in the generalized DWD family. Numerical examples furthersupport our theoretical conclusions. DWD with q = 1 is called the standard DWD purely21 able 2. The mis-classiﬁcation rates and timings (in seconds) for four benchmark data sets. Eachdata set was split into a training and a test set. On the training set, the tuning parameters wereselected by ﬁve-fold cross-validation and the models were ﬁtted accordingly. The mis-classiﬁcationrates were assessed on the test sets. All the timings include tuning parameters. For each dataset,the method with the best prediction accuracy is marked by black boxes.

Bupa Haberman Sonar Vertebral n = 345, p = 6 n = 305, p = 3 n = 208, p = 60 n = 310, p = 6 error (%) time error (%) time error (%) time error (%) time li n e a r k e r n e l SVM 31.63 (0.50) 17.47 26.97 (0.53) 11.74 25.97 (0.66) 8.01 (0.42) 8.07DWD q = 1 34.82 (0.75) 0.05 26.71 (0.54) 0.03 25.65 (0.75) 0.30 16.76 (0.53) 0.07DWD q = 0 . q = 4 35.08 (0.71) 0.05 26.69 (0.55) 0.03 26.00 (0.76) 0.32 16.54 (0.53) 0.06DWD q = 8 35.08 (0.76) 0.06 26.53 (0.56) 0.03 25.97 (0.71) 0.34 17.01 (0.53) 0.06 G a u ss i a n k e r n e l SVM 32.23 (0.48) 6.57 27.92 (0.61) 6.00 (0.56) 8.96 16.50 (0.46) 6.07DWD q = 1 32.14 (0.63) 2.83 26.46 (0.57) 2.03 20.67 (0.76) 0.83 17.57 (0.49) 2.23DWD q = 0 . (0.61) 2.80 (0.58) 2.06 21.42 (0.79) 0.84 17.59 (0.56) 2.27DWD q = 4 31.63 (0.61) 3.05 26.42 (0.57) 2.08 20.26 (0.76) 0.91 17.15 (0.50) 2.28DWD q = 8 32.07 (0.57) 3.28 26.53 (0.56) 2.21 20.00 (0.67) 0.98 16.93 (0.50) 2.39 due to the fact that it, not other generalized DWDs, can be solved by SOCP when the DWDidea was ﬁrst proposed. Now with our new algorithm and theory, practitioners have theoption to explore diﬀerent DWD classiﬁers.In the present paper we have considered the standard classiﬁcation problem under the 0-1loss. In many applications we may face the so-called non-standard classiﬁcation problems.For example, observed data may be collected via biased sampling and/or we need to considerunequal costs for diﬀerent types of mis-classiﬁcation. Qiao et al. (2010) introduced a weightedDWD to handle the non-standard classiﬁcation problem, which follows the treatment of thenon-standard SVM in Lin et al. (2002). Qiao et al. (2010) deﬁned the weighted DWD asfollows,min β , β (cid:34) n (cid:88) i =1 w ( y i ) (cid:18) r i + cξ i (cid:19)(cid:35) , subject to r i = y i ( β + x Ti β ) + ξ i ≥ β T β = 1 , (6.1)which can be further generalized to the weighted kernel DWD:min β , α C w ( β , α ) ≡ min β , α (cid:34) n n (cid:88) i =1 w ( y i ) V q (cid:0) y i ( β + K Ti α ) (cid:1) + λ α T Kα (cid:35) . (6.2)22iao et al. (2010) gave the expressions for w ( y i ) for various non-standard classiﬁcationproblems. Qiao et al. (2010) solved the weighted DWD with q = 1 (6.1) based on thesecond-order-cone programming. The MM procedure for Algorithm 1 and Algorithm 2 caneasily accommodate the weight factors w ( y i )’s to solve the weighted DWD and weightedkernel DWD. We have implemented the weighted DWD in the R package kerndwd . Appendix: technical proofs

Proof of Lemma 1

Write v i = y i ( ω + x Ti ω ) and G ( η i ) = 1 / ( v i + η i ) q + cη i . The objective function of (2.10)can be written as (cid:80) ni =1 G ( η i ). We next minimize (2.10) over η i for every ﬁxed i by computingthe ﬁrst-order and the second-order derivatives of G ( η i ): G (cid:48) ( η i ) = − q ( v i + η i ) q +1 + c = 0 ⇒ v i + η i = (cid:16) qc (cid:17) q +1 ,G (cid:48)(cid:48) ( η i ) = q ( q + 1)( v i + η i ) q +2 > . If v i > ( qc ) q +1 , then G (cid:48) ( η i ) > η i ≥

0, and η (cid:63)i = 0 is the minimizer. If v i ≤ ( qc ) q +1 ,then η (cid:63)i = ( qc ) q +1 − v i is the minimizer as G (cid:48) ( η (cid:63) ) = 0 and G (cid:48)(cid:48) ( η (cid:63) ) > η (cid:63)i into (cid:80) ni =1 G ( η i ), we obtainmin ω , ω n (cid:88) i =1 ˜ V q (cid:0) y i ( ω + x Ti ω ) (cid:1) , subject to ω T ω = 1 , (6.3)where ˜ V q ( v ) = (cid:16) qc (cid:17) − qq +1 + c (cid:16) qc (cid:17) q +1 − cv, if v ≤ (cid:16) qc (cid:17) q +1 , v q , if v > (cid:16) qc (cid:17) q +1 . We now simplify (6.3). Suppose t = ( qq +1 )( qc ) − q +1 and t = ( q +1 )( qc ) qq +1 . We deﬁne V q ( u ) = t · ˜ V q ( u/t ) for each q , V q ( u ) =  − u, if u ≤ qq + 1 , u q q q ( q + 1) q +1 , if u > qq + 1 .

23y setting β = t · ω and β = t · ω , we ﬁnd that (6.3) becomesmin β , β n (cid:88) i =1 V q (cid:0) y i ( β + x Ti β ) (cid:1) , subject to β T β = t , which can be further transformed to (3.1) with λ and t one-to-one correspondent. Proof of Lemma 2

We ﬁrst prove (3.5). We observe that 0 < V (cid:48)(cid:48) q ( u ) = u q +2 q q +1 ( q +1) q < ( q +1) q , for any u > qq +1 .Also V (cid:48) q ( u ) is continuous on [ qq +1 , ∞ ) and diﬀerentiable on ( qq +1 , ∞ ).If both u and u > qq +1 , then the mean value theorem implies that there exists u (cid:63)(cid:63) > qq +1 ,such that, | V (cid:48) q ( u ) − V (cid:48) q ( u ) || u − u | = | V (cid:48)(cid:48) q ( u (cid:63)(cid:63) ) | < ( q + 1) q . (6.4)If u > qq +1 and u ≤ qq +1 , then V (cid:48) q ( u ) = V (cid:48) q (cid:16) qq +1 (cid:17) = −

1. The mean value theoremimplies that there exists u (cid:63)(cid:63) > qq +1 satisfying | V (cid:48) q ( u ) − V (cid:48) q ( u ) || u − u | ≤ | V (cid:48) q ( u ) − V (cid:48) q ( qq +1 ) || u − qq +1 | = | V (cid:48)(cid:48) q ( u (cid:63)(cid:63) ) | < ( q + 1) q . (6.5)If both u and u ≤ qq +1 , V (cid:48) q ( u ) = V (cid:48) q ( u ) = −

1. It is trivial that | V (cid:48) q ( u ) − V (cid:48) q ( u ) || u − u | = 0 < ( q + 1) q . (6.6)By (6.4), (6.5), and (6.6), we prove (3.5).We now prove (3.6). Let ν ( a ) ≡ ( q + 1) q a − V q ( a ) . From (3.5), it is not hard to showthat ν (cid:48) ( a ) = ( q + 1) q a − V (cid:48) q ( a ) is strictly increasing. Therefore ν ( a ) is a strictly convexfunction, and its ﬁrst-order condition, ν ( t ) > ν (˜ t ) + ν (cid:48) (˜ t )( t − ˜ t ) , veriﬁes (3.6) directly. Proof of Lemma 3 η ( x ) = P ( Y = 1 | X = x ), we have that E X Y [ V q ( Y f ( X ))] ≡ E X ζ ( f ( X )): ζ ( f ( x )) ≡ η ( x ) V q ( f ( x )) + [1 − η ( x )] V q ( − f ( x ))=  η ( x ) 1 f ( x ) q q q ( q + 1) q +1 + [1 − η ( x )][1 + f ( x )] , if f ( x ) > qq + 1 ,η ( x )[1 − f ( x )] + [1 − η ( x )][1 + f ( x )] , if − qq + 1 ≤ f ( x ) ≤ qq + 1 ,η ( x )[1 − f ( x )] + [1 − η ( x )] 1[ − f ( x )] q q q ( q + 1) q +1 , if f ( x ) < − qq + 1 . For each given x , we take both f ( x ) and η ( x ) as scalars and hereby write them as f and η respectively. We then take ζ ( f ) = ζ ( f ( x )) as a function of f and compute the derivativewith respect to f : ∂ζ ( f ) ∂f =  − η f q +1 q q +1 ( q + 1) q +1 + 1 − η, if f > qq + 1 , − η, if − qq + 1 ≤ f ≤ qq + 1 , − η + (1 − η ) 1( − f ) q +1 q q +1 ( q + 1) q +1 , if f < − qq + 1 . We see that (1) when η > . ∂ζ ( f ) /∂f = 0 only when f = ˜ f ≡ qq +1 (cid:16) η − η (cid:17) q +1 , and (2)when η < . ∂ζ ( f ) /∂f = 0 only when f = ˜ f ≡ − qq +1 (cid:16) − ηη (cid:17) q +1 . For these two cases, wealso observe that  ∂ζ ( f ) /∂f < , if f < ˜ f ,∂ζ ( f ) /∂f > , if f > ˜ f , (6.7)which follows that ˜ f is the minimizer of ζ ( f ). Proof of Lemma 4

As ˜ f ( x ) was deﬁned in (4.4), we see that for each x , ζ (cid:16) ˜ f ( x ) (cid:17) ≡ η ( x ) V q (cid:16) ˜ f ( x ) (cid:17) + [1 − η ( x )] V q (cid:16) − ˜ f ( x ) (cid:17) =  η ( x ) + [1 − η ( x )] q +1 η ( x ) qq +1 , if η ( x ) ≤ / , − η ( x ) + η ( x ) q +1 [1 − η ( x )] qq +1 , if η ( x ) > / , = 12 (cid:18) − | η ( x ) − | (cid:19) + 12 (cid:18) | η ( x ) − | (cid:19) q +1 (cid:18) − | η ( x ) − | (cid:19) qq +1 . a ∈ [0 , γ ( a ) and compute its ﬁrst-order derivative as follows, γ ( a ) ≡ −

12 (1 − a ) −

12 (1 + a ) q +1 (1 − a ) qq +1 − qq + 1 a,γ (cid:48) ( a ) = 12 − q + 1) (cid:18) − a a (cid:19) qq +1 + q q + 1) (cid:18) a − a (cid:19) q +1 − qq + 1= (cid:34) q + 1) − q + 1) (cid:18) − a a (cid:19) qq +1 (cid:35) + (cid:34) q q + 1) + q q + 1) (cid:18) a − a (cid:19) q +1 − qq + 1 (cid:35) ≥ . Hence for each a ∈ [0 , γ ( a ) ≥ γ (0) = 0. For each x , let a = | η ( x ) − | and we see that1 − ζ (cid:16) ˜ f ( x ) (cid:17) ≥ qq + 1 | η ( x ) − | . By R ( f ) = E X Y [ Y (cid:54) = sign( f ( X )] = E { X : f ( X ) ≥ } [1 − η ( X )] + E { X : f ( X ) ≤ } η ( X ) , we obtain R ( ˆ f n ) − R ( f (cid:63) ) = E { X : ˆ f n ( X ) ≥ , f (cid:63) ( X ) < } [1 − η ( X )] + E { X : ˆ f n ( X ) ≤ , f (cid:63) ( X ) > } [2 η ( X ) − ≤ E { X : ˆ f n ( X ) f (cid:63) ( X ) ≤ } | η ( X ) − |≤ q + 1 q E { X : ˆ f n ( X ) f (cid:63) ( X ) ≤ } (cid:104) − ζ (cid:16) ˜ f ( X ) (cid:17)(cid:105) . (6.8)Since f (cid:63) ( X ) and ˜ f ( X ) share the same sign, ˆ f n ( X ) f (cid:63) ( X ) ≤ f n ( X ) ˜ f ( X ) ≤

0. When ˆ f n ( X ) ˜ f ( X ) ≤

0, 0 is between ˆ f n ( X ) and ˜ f ( X ), and thus (6.7) indicates that ζ ( ˜ f ( X )) ≤ ζ (0) = 1 ≤ ζ ( ˆ f n ( X )). From (6.8), we conclude that R ( ˆ f n ) − R ( f (cid:63) ) ≤ q + 1 q E { X : ˆ f n ( X ) f (cid:63) ( X ) ≤ } (cid:104) ζ (cid:16) ˆ f n ( X ) (cid:17) − ζ (cid:16) ˜ f ( X ) (cid:17)(cid:105) ≤ q + 1 q E X (cid:104) ζ (cid:16) ˆ f n ( X ) (cid:17) − ζ (cid:16) ˜ f ( X ) (cid:17)(cid:105) = q + 1 q E X Y (cid:104) V q (cid:16) Y ˆ f n ( X ) (cid:17) − V q (cid:16) Y ˜ f ( X ) (cid:17)(cid:105) = q + 1 q ( ε A + ε E ) . Proof of Theorem 1

Part (1). We ﬁrst show that when H K is induced by a universal kernel, the approximationerror ε A = 0. By deﬁnition, we need to show that for any (cid:15) >

0, there exists f (cid:15) ∈ H K such26hat (cid:12)(cid:12)(cid:12)(cid:12) E X Y V q ( Y f (cid:15) ( X )) − E X Y V q (cid:16) Y ˜ f ( X ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15). (6.9)We ﬁrst use truncation to consider a truncated version of ˜ f . For any given δ ∈ (0 , . f δ ( X ) =  qq +1 (cid:0) − δδ (cid:1) q +1 , if η ( X ) > − δ, ˜ f ( X ) , if − δ ≤ η ( X ) ≤ − δ, − qq +1 (cid:0) δ − δ (cid:1) q +1 , if η ( X ) < δ. We have that 0 ≤ E X Y V q ( Y f δ ( X )) − E X Y V q (cid:16) Y ˜ f ( X ) (cid:17) = κ + + κ − , where κ + = E X : η ( X ) > − δ [ η ( X ) V q ( f δ ( X )) + (1 − η ( X )) V q ( − f δ ( X ))] − E X : η ( X ) > − δ (cid:104) η ( X ) V q (cid:16) ˜ f ( X ) (cid:17) + (1 − η ( X )) V q (cid:16) − ˜ f ( X ) (cid:17)(cid:105) ,κ − = E X : η ( X ) <δ [ η ( X ) V q ( f δ ( X )) + (1 − η ( X )) V q ( − f δ ( X ))] − E X : η ( X ) <δ (cid:104) η ( X ) V q (cid:16) ˜ f ( X ) (cid:17) + (1 − η ( X )) V q (cid:16) − ˜ f ( X ) (cid:17)(cid:105) . Since V q ( f δ ( X )) < V q ( − f δ ( X )) when η ( X ) > − δ , κ + − δ [(1 − δ ) V q ( f δ ( X )) + δV q ( − f δ ( X ))] − E X : η ( X ) > − δ (cid:104) η ( X ) V q (cid:16) ˜ f ( X ) (cid:17) + (1 − η ( X )) V q (cid:16) − ˜ f ( X ) (cid:17)(cid:105) = (cid:104) δ + (1 − δ ) q +1 δ qq +1 (cid:105) − E X : η ( X ) > − δ (cid:104) − η ( X ) + η ( X ) q +1 (1 − η ( X )) qq +1 (cid:105) . We notice that (1 − a ) + a q +1 (1 − a ) qq +1 is a continuous function in terms of a ∈ (0 , η ( X ) > − δ implies that | η ( X ) − (1 − δ ) | < δ , we conclude that for any given (cid:15) >

0, thereexists a suﬃciently small δ such that κ + < (cid:15)/

6. We can also obtain κ − < (cid:15)/ ≤ E X Y V q ( Y f δ ( X )) − E X Y V q (cid:16) Y ˜ f ( X ) (cid:17) ≤ κ + + κ − < (cid:15)/ . (6.10)By Lusin’s Theorem, there exists a continuous function (cid:37) ( X ) such that P ( (cid:37) ( X ) (cid:54) =27 δ ( X )) ≤ (cid:15) ( q + 1) / (6 q ). Notice that sup X | f δ ( X ) | ≤ q/ ( q + 1). Deﬁne τ ( X ) =  (cid:37) ( X ) , if | (cid:37) ( X ) | ≤ qq + 1 ,qq + 1 · (cid:37) ( X ) | (cid:37) ( X ) | , if | (cid:37) ( X ) | > qq + 1 , then P ( τ ( X ) (cid:54) = f δ ( X )) ≤ (cid:15) ( q + 1) / (6 q ) as well. Hence (cid:12)(cid:12)(cid:12)(cid:12) E X Y V q ( Y f δ ( X )) − E X Y V q ( Y τ ( X )) (cid:12)(cid:12)(cid:12)(cid:12) ≤ E X | f δ ( X ) − τ ( X ) | = E { X : τ ( X ) (cid:54) = f δ ( X ) } | f δ ( X ) − τ ( X ) |≤ qq + 1 · (cid:15) ( q + 1)6 q = (cid:15)/ , where the ﬁrst inequality comes from the fact that V q ( u ) is Lipschitz continuous, i.e., | V q ( u ) − V q ( u ) | ≤ | u − u | , ∀ u , u ∈ R . Notice that τ ( X ) is also continuous. The deﬁnition of the universal kernel implies theexistence of a function f (cid:15) ∈ H K such that (cid:12)(cid:12)(cid:12)(cid:12) E X Y V q ( Y f (cid:15) ( X )) − E X Y V q ( Y τ ( X )) (cid:12)(cid:12)(cid:12)(cid:12) < sup X | f (cid:15) ( X ) − τ ( X ) | < (cid:15)/ . (6.11)By combining (6.10), (6.11), and (6.11) we obtain (6.9). Part (2). In this part we bound the estimation error ε E ( ˆ f n ). Note that RKHS has thefollowing reproducing property (Wahba, 1990; Hastie et al., 2009): (cid:104) K ( x i , x ) , f ( x ) (cid:105) H K = f ( x i ) , (cid:104) K ( x i , x ) , K ( x j , x ) (cid:105) H K = K ( x i , x j ) . (6.12)Fix any (cid:15) >

0. By the KKT condition of (4.5) and the representor theorem, we have1 n n (cid:88) i =1 V (cid:48) q (cid:16) y i ˆ f n ( x i ) (cid:17) y i K ( x i , x ) + 2 λ n ˆ f n ( x ) = 0 . (6.13)We deﬁne ˆ f [ k ] as the solution of (4.5) when the k th observation is excluded from the training28ata, i.e., ˆ f [ k ] = argmin f ∈H K (cid:34) n n (cid:88) i =1 ,i (cid:54) = k V q ( y i ( f ( x i )) + λ n || f || H K (cid:35) . (6.14)By the deﬁnition of ˆ f [ k ] and the convexity of V q , we have0 ≤ n n (cid:88) i =1 ,i (cid:54) = k V q (cid:16) y i ˆ f n ( x i ) (cid:17) + λ n || ˆ f n || H K − n n (cid:88) i =1 ,i (cid:54) = k V q (cid:16) y i ˆ f [ k ] ( x i ) (cid:17) − λ n || ˆ f [ k ] || H K ≤ − n n (cid:88) i =1 ,i (cid:54) = k V (cid:48) q (cid:16) y i ˆ f n ( x i ) (cid:17) y i (cid:16) ˆ f [ k ] ( x i ) − ˆ f n ( x i ) (cid:17) + λ n || ˆ f n || H K − λ n || ˆ f [ k ] || H K . By the reproducing property, we further have0 ≤ − n n (cid:88) i =1 ,i (cid:54) = k V (cid:48) q (cid:16) y i ˆ f n ( x i ) (cid:17) y i (cid:68) K ( x i , x ) , ˆ f [ k ] ( x ) − ˆ f n ( x ) (cid:69) H K + λ n || ˆ f n || H K − λ n || ˆ f [ k ] || H K = − n n (cid:88) i =1 ,i (cid:54) = k V (cid:48) q (cid:16) y i ˆ f n ( x i ) (cid:17) y i (cid:68) K ( x i , x ) , ˆ f [ k ] ( x ) − ˆ f n ( x ) (cid:69) H K − λ n (cid:68) ˆ f n ( x ) , ˆ f [ k ] ( x ) − ˆ f n ( x ) (cid:69) H K − λ n || ˆ f [ k ] − ˆ f n || H K = 1 n V (cid:48) q (cid:16) y k ˆ f n ( x k ) (cid:17) y k (cid:68) K ( x k , x ) , ˆ f [ k ] ( x ) − ˆ f n ( x ) (cid:69) H K − λ n || ˆ f [ k ] − ˆ f n || H K , where the equality in the end holds by (6.13). Thus, by Cauchy-Schwartz inequality, nλ n || ˆ f [ k ] − ˆ f n || H K ≤ V (cid:48) q (cid:16) y k ˆ f n ( x k ) (cid:17) y k (cid:68) K ( x k , x ) , ˆ f [ k ] ( x ) − ˆ f n ( x ) (cid:69) H K ≤ (cid:12)(cid:12)(cid:12) V (cid:48) q (cid:16) y k ˆ f n ( x k ) (cid:17)(cid:12)(cid:12)(cid:12) || K ( x k , x ) || H K || ˆ f [ k ] − ˆ f n || H K ≤ (cid:112) K ( x k , x k ) · || ˆ f [ k ] − ˆ f n || H K , which implies || ˆ f [ k ] − ˆ f n || H K ≤ √ Bnλ n , where B = sup x K ( x , x ). By the reproducing property, we have | ˆ f [ k ] ( x k ) − ˆ f n ( x k ) | = (cid:16) (cid:104) K ( x i , x k ) , ˆ f [ k ] ( x i ) − ˆ f n ( x i ) (cid:105) H K (cid:17) ≤ K ( x k , x k ) || ˆ f [ k ] − ˆ f n || H K ≤ B (cid:32) √ Bnλ n (cid:33) .

29y the Lipschitz continuity of the DWD loss, we obtain that for each k = 1 , . . . , n , V q (cid:16) y k ˆ f [ k ] ( x k ) (cid:17) − V q (cid:16) y k ˆ f n ( x k ) (cid:17) ≤ | ˆ f [ k ] ( x k ) − ˆ f n ( x k ) | ≤ Bnλ n , and therefore, 1 n n (cid:88) k =1 V q (cid:16) y k ˆ f [ k ] ( x k ) (cid:17) ≤ n n (cid:88) k =1 V q (cid:16) y k ˆ f n ( x k ) (cid:17) + Bnλ n . (6.15)Let f ∗ (cid:15) ∈ H K such that E X Y V q ( Y f ∗ (cid:15) ( X )) ≤ inf f ∈H K E X Y V q ( Y f ( X )) + (cid:15)/ . (6.16)By deﬁnition of ˆ f n , we have1 n n (cid:88) k =1 V q (cid:16) y k ˆ f n ( x k ) (cid:17) + λ n || ˆ f n || H K ≤ n n (cid:88) k =1 V q ( y k f ∗ (cid:15) ( x k )) + λ n || f ∗ (cid:15) || H K . (6.17)Since each data point in T n = { ( x k , y k ) } nk =1 is drawn from the same distribution, we have E T n (cid:34) n n (cid:88) k =1 V q (cid:16) y k ˆ f [ k ] ( x k ) (cid:17)(cid:35) = 1 n n (cid:88) k =1 E T n V q (cid:16) y k ˆ f [ k ] ( x k ) (cid:17) = E T n − E X Y V q (cid:16) Y ˆ f n − ( X ) (cid:17) . (6.18)By combining (6.15)–(6.18) we have E T n − E X Y V q (cid:16) Y ˆ f n − ( X ) (cid:17) ≤ inf f ∈H K E X Y V q ( Y f ( X )) + λ n || f ∗ (cid:15) || H K + Bnλ n + (cid:15) . (6.19)By the choice of λ n , we see that there exits N (cid:15) such that when n > N (cid:15) we have λ n <(cid:15)/ (3 || f ∗ (cid:15) || H K ), nλ n > B/(cid:15) , and hence E T n − (cid:104) E X Y V q (cid:16) Y ˆ f n − ( X ) (cid:17)(cid:105) ≤ inf f ∈H K E X Y V q ( Y f ( X )) + (cid:15). Because (cid:15) is arbitrary and E T n − [ E X Y V q ( Y ˆ f n − ( X ))] ≥ inf f ∈H K E X Y V q ( Y f ( X )), we havelim n →∞ E T n − [ E X Y V q ( Y ˆ f n − ( X ))] = inf f ∈H K E X Y V q ( Y f ( X )), which equivalently indicatesthat lim n →∞ E T n ε E ( ˆ f n ) = 0 . Since ε E ( ˆ f n ) ≥

0, then by Markov inequality, we prove part(2). 30 eferences

Ahn, J., Marron, J.S., Muller, K., and Chi, Y. (2007), “The high-dimension, low-sample-sizegeometric representation holds under mild conditions,”

Biometrika , 94(3), 760–766.Ahn, J. and Marron, J.S. (2010), “The maximal data-piling direction for discrimination,”

Biometrika , 97(1), 254–259.Aizerman, A., Braverman, E., and Rozoner, L. (1964), “Theoretical foundations of the po-tential function method in pattern recognition learning,”

Automation and remote control ,25, 821–837.Alizadeh, F. and Goldfarb, D. (2004), “Second-order cone programming,”

MathematicalProgramming, Series B , 95(1), 3–51.Anthony, M. and Bartlett, P. (1999),

Neural Network Learning: Theoretical Foundations. ,Cambridge University Press, Cambridge.Bartlett, P. and Shawe-Taylor, J. (1999), “Generalization performance of support vector ma-chines and other pattern classiﬁers”,

Advances in Kernel Methods–Support Vector Learn-ing , 43–54.Boyd, S. and Vandenberghe, L. (2004),

Convex Optimization , Cambridge University Press,Cambridge.Breiman, L. (2001), “Random forests,”

Machine Learning , 45(1), 5–32.De Leeuw, J. and Heiser, W. (1977), “Convergence of correction matrix algorithms for mul-tidimensional scaling”, 735–752.Fern´andez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014), “Do we needhundreds of classiﬁers to solve real world classiﬁcation problems?”

The Journal of MachineLearning Research , 15, 3133–3181.Freund, Y. and Schapire, R. (1997), “A decision-theoretic generalization of on-line learningand an application to boosting,”

Journal of Computer and System Sciences , 55(1), 119–139.Friedman, J., Hastie, T., H¨oﬂing, H., and Tibshirani, R. (2007), “Pathwise coordinate opti-mization,”

The Annals of Applied Statistics , 1(2), 302–332.31irosi, F., Jones, M., and Poggio, T. (1995), “Regularization theory and neural networksarchitectures,”

Neural Computation , 7(2), 219–269.Hall, P., Marron, J.S., and Neeman, A. (2005), “Geometric representation of high dimensions,low sample size data,”

Journal of the Royal Statistical Society, Series B , 67(3), 427–444.Hastie, T., Tibshirani, R., and Friedman, J. (2009),

The Elements of Statistical Learning:Prediction, Inference, and Data Mining , 2nd edition, Springer-Verlag, New York.Huang, H., Liu, Y., Du. Y., Perou, C., Hayes, D., Todd, M., and Marron, J.S. (2013),“Multiclass distance-weighted discrimination,”

Journal of Computational and GraphicalStatistics , 22(4), 953–969.Huang, H., Lu, X., Liu, Y., Haaland, P., and Marron, J.S. (2012), “R/DWD: distance-weighted discrimination for classiﬁcation, visualization and batch adjustment,”

Bioinfor-matics , 28(8), 1182–1183.Hunter, D. and Lange, K. (2004), “A tutorial on MM algorithms,”

The American Statistician ,58(1), 30–37.Hunter, D. and Li, R. (2005), “Variable selection using MM algorithms,”

The Annals ofStatistics , 33(4), 1617–1642.Jaakkola, T. and Haussler, D. (1999), “Probabilistic kernel regression models,”

Proceedingsof the 1999 Conference on AI and Statistics , 126, 00–04.Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A.(2004), “kernlab – An S4 Package forKernel Methods in R,”

Journal of Statistical Software , 11(9), 1–20.Lange, K., Hunter, D., and Yang, I. (2000), “Optimization transfer using surrogate objectivefunctions,”

Journal of Computational and Graphical Statistics , 9(1), 1–20.Lange, K. and Zhou, H. (2014), “MM algorithms for geometric and signomial programming,”

Mathematical Programming , 143(1-2), 339–356.Lichman, M. (2013), “UCI Machine Learning Repository,” http://archive.ics.uci.edu/ml , Irvine, CA: University of California, School of Information and Computer Science.Lin, Y., Lee, Y., and Wahba, G. (2002), “Support vector machines for classiﬁcation innonstandard situations,”

Machine Learning , 46, 191–202.32in, Y. (2002), “Support vector machines and the Bayes rule in classiﬁcation,”

Data Miningand Knowledge Discovery , 6(3), 259–275.Lin, Y. (2004), “A note on margin-based loss functions in classiﬁcation,”

Statistics & Prob-ability Letters , 68(1), 73–82.Liu, Y., Zhang, H., and Wu, Y. (2011), “Hard or soft classiﬁcation? Large-margin uniﬁedmachines,”

Journal of American Statistical Association , 106(493), 166–177.Marron, J.S., Todd, M., and Ahn, J. (2007), “Distance weighted discrimination,”

Journal ofAmerican Statistical Association , 102(480), 1267–1271.Marron, J.S. (2013), “Smoothing, functional data analysis, and distance weighted discrimi-nation software,” .Marron, J.S. (2015), “Distance-weighted discrimination,”

Wiley Interdisciplinary Reviews:Computational Statistics , 7(2), 109–114.Micchelli, C., Xu, Y., and Zhang, H. (2006), “Universal kernels,”

Journal of Machine Learn-ing Research , 7, 2651–2667.Qiao, X., Zhang, H., Liu, Y., Todd, M., Marron, J.S. (2010), “Weighted distance weighteddiscrimination and its asymptotic properties,”

Journal of American Statistical Association ,105(489), 401–414.Qiao, X. and Zhang, L. (2015a), “Distance-weighted support vector machine,”

Statistics andIts Interface , 8(3), 331–345.Qiao, X. and Zhang, L. (2015b), “Flexible high-dimensional classiﬁcation machines and theirasymptotic properties,”

Journal of Machine Learning Research , forthcoming.Shawe-Taylor, J. and Cristianini, N. (2000), “Margin distribution and soft margin”,

Advancesin Kernel Methods–Support Vector Learning , 349–358.Steinwart, I. (2001), “On the inﬂuence of the kernel on the consistency of support vectormachines,”

Journal of Machine Learning Research , 2, 67–93.T¨ut¨unc¨u R., Toh, K., Todd, M. (2003), “Solving semideﬁnite-quadratic-linear programsusing SDPT3,”

Mathematical Programming , 95(2), 189–217.33apnik, V. (1995),

The Nature of Statistical Learning Theory , Springer-Verlag, New York.Vapnik, V. (1998),

Statisitcal Learning Theory , Wiley, New York.Wahba, G. (1990),

Spline Models for Observational Data , 59, SIAM.Wahba, G., Gu, C., Wang, Y., and Campbell, R. (1994), “Soft classiﬁcation, aka risk estima-tion, via penalized log likelihood and smoothing spline analysis of variance,” In

Santa feInstitute Studies in the Sciences of Complexity-Proceeding Vol , 20, Addison-Wesley Pub-lishing CO, 331–331.Wahba, G. (1999), “Support vector machines, reproducing kernel Hilbert spaces and therandomized GACV,”

Advances in Kernel Methods-Support Vector Learning , 6, 69–87.Wang, B. and Zou, H. (2015), “Sparse distance weighted discrimination,”

Journal of Com-putational and Graphical Statistics , forthcoming.Wu, T.T. and Lange, K. (2008), “Coordinate descent algorithms for lasso penalized regres-sion,”

The Annals of Applied Statistics , 2(1), 224–244.Yang, Y. and Zou, H. (2013), “An eﬃcient algorithm for computing the HHSVM and itsgeneralizations,”

Journal of Computational and Graphical Statistics , 22(2), 396–415.Zhang, T. (2004), “Statistical behavior and consistency of classiﬁcation methods based onconvex risk minimization,”

The Annals of Statistics , 32(1), 56–134.Zhou, H. and Lange, K. (2010), “MM algorithms for some discrete multivariate distribu-tions,”

Journal of Computational and Graphical Statistics , 19(3), 645–665.Zhu, J. and Hastie, T. (2005), “Kernel logistic regression and the import vector machine,”

Journal of Computational and Graphical Statistics , 14(1), 185–205.Zou, H. and Li, R. (2008), “One-step sparse estimates in nonconcave penalized likelihoodmodels,”