Learning a powerful SVM using piece-wise linear loss functions
LLearning a powerful SVM using piece-wise linear loss functions
Pritam Anand Department of Computer Science, South Asian University, New Delhi-110021.Email :- [email protected]
Abstract
In this paper, we have considered general k -piece-wise linear convex loss functions in SVM modelfor measuring the empirical risk. The resulting k -Piece-wise Linear loss Support Vector Machine( k -PL-SVM) model is an adaptive SVM modelwhich can learn a suitable piece-wise linear lossfunction according to nature of the given trainingset. The k -PL-SVM models are general SVMmodels and existing popular SVM models, likeC-SVM, LS-SVM and Pin-SVM models, are theirparticular cases. We have performed the extensivenumerical experiments with k -PL-SVM modelsfor k = 2 and 3 and shown that they are improve-ment over existing SVM models.Support Vector Machine (SVM) models (Cortes & Vapnik,1995)(Vapnik, 2013)(Gunn, 1998) are still very useful andpopular among researchers. It is because of their interest-ing characteristics which remain missing in other machinelearning models. SVM models implement the StructuralRisk Minimization (SRM) principle (Vapnik, 2013) and canexplicitly minimize the regularization in its optimizationproblem to avoid over-fitting. Most of the existing SVMmodels require to solve appropriate convex programmingproblems only which guarantees the global optimal solution.Further, there are different choices of the specific loss func-tions and kernel functions (Mercer, 1909) available whichcan be used in SVM model according to the characteristicsof the given dataset.For a given binary classification problem with the trainingset T = { ( x i , y i ) : x i ∈ R n , y i ∈ {− , } , i = 1 . , ..., l } ,the SVM model obtains the kernel generated decision func-tion sign ( w T φ ( x ) + b ) . For this, the SVM models solvethe optimization problem in which good trade-off betweenthe empirical error of the training set and the regularizationis minimized efficiently.In SVM models, we use the loss function to measure theempirical risk of given training set. The characteristicsand performance of a SVM model depends upon the wayit measures the empirical error of the given training set. Therefore, the choice of loss function is very crucial inSVM models.The standard C-SVM model uses the Hinge loss functionto measure the empirical risk. The Hinge loss function isgiven by L Hinge ( u ) = max ( u, , u ∈ R . For training set T , the C-SVM model solves the optimization problem min ( w,b ) || w || + C l (cid:88) i =1 L Hinge (1 − y i ( w T φ ( x i ) + b )) , (1)where C is an user supplied parameter which is used fortuning the trade-off between the empirical error and themodel complexity. The use of the Hinge loss function inC-SVM model let us obtain its geometrical interpretation.The solution of C-SVM model is geometrically equivalentto obtaining a separating hyperplane ( w T φ ( x ) + b ) = 0 inthe feature space with maximum margin.The Least Squares Support Vector Machine (LS-SVM)model (Suykens & Vandewalle, 1999) uses the well knownleast squares loss function (Legendre, 1805) to measurethe empirical error, which is given by L ( u ) = u , u ∈ R .For the given training set T , the LS-SVM model solves theoptimization problem min ( w,b ) || w || + C l (cid:88) i =1 (1 − y i ( w T φ ( x i ) + b )) . (2)The Pin-SVM model (Huang et al., 2014) minimizes the pin-ball loss function (Koenker & Bassett Jr, 1978)(Huang et al.,2014) to measure the empirical error of the training set. For − ≤ τ ≤ , it is given by L τ ( u ) = max ( u, − τ u ) , u ∈ R .For the given training set, it solves the optimization problem min ( w,b ) || w || + C l (cid:88) i =1 L τ (1 − y i ( w T φ ( x i ) + b )) . (3)We have plotted the Hinge loss function, the Least squaresloss function and the pin-ball loss function in Figure 1. Wecan realize that these loss functions based SVM models,particularity the C-SVM and LS-SVM models, are rigidin nature. They measure the empirical error of the given a r X i v : . [ c s . L G ] F e b earning a powerful SVM using piece-wise linear loss functions (a) Hinge Loss (b) Pinball loss (c) Least squares loss Figure 1.
Existing loss functions in SVM training set without bothering the nature of data. The lossfunctions used in these SVM models do not posses thecapability of adapting themselves according to the nature ofthe given data. Therefore, there is the need of introducingadaptive and flexible loss functions in SVM models.We also note that the linear loss functions are robust. Theinfluence of any outlier data point on the decision functionobtained by the linear loss functions is limited. But, the leastsquares loss function and other polynomial loss functionsare not robust.In this paper, we have introduced a family of k - piece-wiselinear loss functions and used them in SVM model. The k -piece-wise linear loss functions are convex and robust. Theuse of these loss functions in SVM results into an adaptiveSVM model which can learn a suitable piece-wise linearloss function according to the nature of the data. The pro-posed k -piece-wise Linear loss function based Support Vec-tor Machine ( k -PL-SVM) model is a general SVM model.Most of popular SVM models, like C-SVM, Pin-SVM andLS-SVM models, are its particular cases. We have brieflydescribed the other interesting characteristics of the pro-posed k -PL-SVM in this paper. Further, we have carriedout an extensive numerical experiments with k -PL-SVMfor k = k -PL-SVM model can obtain significant improvement inprediction over existing SVM models.We have organized the rest of this paper as follows. We havedescribed our k -piece-wise linear loss functions and theresulting SVM model in Section-2 of this paper,. Section-3describes some interesting properties of the proposed k -PL-SVM model. In section-4, we have presented the extensivenumerical results to realize that the proposed k -PL-SVMmodel has a lot of potential for improving the prediction ofSVM models. Section-5 concludes this paper.
1. Piece-wise linear loss function basedSupport Vector Machine
We propose the general piece-wise linear convex loss func-tion for SVM model. The k-piece-wise linear loss function is defined as L k ( u ) = max ( u, − τ u + (cid:15) , − τ u + (cid:15) , ...., − τ k − u + (cid:15) k − ) ,u ∈ R (4) where τ , τ ,..., τ k − and (cid:15) , (cid:15) ,..., (cid:15) k − are real valuedparameters. Preposition 1:-
Let F ( k ) be the collection of all k-piece-wise linear convex functions. Let, f ( k ) ∈ F ( k ) besuch that f ( k ) ( u ) = u , ∀ u ∈ [ u , u (cid:48) ] , for some u , u (cid:48) > .Then, the loss function L k ( u ) is sufficient to represent thefamily F ( k ) . Proof:-
We shall prove the statement by using theprinciple of induction.At first, we shall show that the elements of F (2) canbe obtained by our proposed loss function L ( u ) = max ( u, − τ u + (cid:15) ). One of the segmented line of an ar-bitrary f (2) ∈ F (2) would be y = u as it must satisfy f ( u ) = u , ∀ u ∈ [ u , u (cid:48) ] for some u and u (cid:48) > . Let theanother segmented line of the f (2) is y = au + b then for τ = − a and (cid:15) = b , the f (2) can be represented by the L ( u ) = max ( u, − τ u + (cid:15) ) .Further, let F ( m ) can be represented by the loss func-tion L m ( u ) . It means that f ( m ) ( u ) can be obtained from max ( u, − τ u + (cid:15) , ...., − τ m − u + (cid:15) m − ) . Then we needto show that f ( m +1) can be obtained by the loss func-tion L m +1 for completing the proof. The f ( m +1) can beconstructed by considering an additional segmented line y = a (cid:48) u + b (cid:48) . Since f ( m +1) has to remain convex, so it canbe obtained by max ( L m ( u ) , − τ m u + (cid:15) m ) with τ m = − a (cid:48) and (cid:15) m = b . It means that the f ( m +1) can be obtained from L m +1 ( u ) = max ( u, − τ u + (cid:15) , ...., − τ m u + (cid:15) m ) . (cid:3) Figure 2 shows the -piece-wise linear loss function forsome particulars values of parameters. It can be noted thatit can also reduce to the pinball loss function and Hinge lossfunction for the particular chosen values of its parameters.For a given classification task with training set T = { ( x i , y i ) | x i ∈ R n , y i ∈ { , − } , i = 1 , , ., l } , earning a powerful SVM using piece-wise linear loss functions (a) (b) (c)(d) (e) (f) Figure 2.
The 3-piece-wise linear loss function for (a) τ = − . , τ = 0 . , (cid:15) = 0 , (cid:15) = − (b) τ = − , τ = 0 . , (cid:15) = − , (cid:15) = 0 (c) τ = − . , τ = − . , (cid:15) = 0 , (cid:15) = − (d) τ = 0 . , τ = 0 . , (cid:15) = 0 , (cid:15) = − (e) τ = 0 . , τ = 0 , (cid:15) =0 , (cid:15) = 0 (f) τ = 0 , τ = 0 , (cid:15) = 0 , (cid:15) = 0 . we used the proposed k -piece-wise linear loss function formeasuring the empirical risk. The resulting k -Piece-wiseLinear loss Support Vector Machine ( k -PL-SVM) modelminimizes the empirical risk obtained by the chosen k -piece-wise linear loss function along with the regularizationterm || w || in its optimization problem as follow. min ( w,b ) || w || + C l (cid:88) i =1 L k (1 − y i ( w T φ ( x i ) + b ))= min ( w,b ) 12 || w || + C l (cid:80) i =1 max (cid:18) − y i ( w T φ ( x i ) + b )) , − τ (1 − y i ( w T φ ( x i ) + b )) + (cid:15) , ..., − τ k − (1 − y i ( w T φ ( x i ) + b )) + (cid:15) k − . (cid:19) (5)Let us consider the l -dimensional slack variable ξ such that ξ i = max ((1 − y i ( w T φ ( x i ) + b )) , − τ (1 − y i ( w T φ ( x i ) + b )) + (cid:15) , ..., − τ k − (1 − y i ( w T φ ( x i ) + b )) + (cid:15) k − ) then theoptimization problem (5) can be converted to the followingQuadratic Programming Problem (QPP) min ( w,b,ξ ) 12 || w || + C l (cid:80) i =1 ξ i subject to, ξ i ≥ − y i ( w T φ ( x i ) + b ) ,ξ i ≥ − τ (1 − y i ( w T φ ( x i ) + b ) + (cid:15) ,.................................................,.................................................,ξ i ≥ − τ k − (1 − y i ( w T φ ( x i ) + b ) + (cid:15) k − , i = 1 , , .., l (6)where τ , τ , ...., τ k − , (cid:15) , (cid:15) , .....(cid:15) k − and C ≥ areuser supplied parameters. The parameter C can be usedto control the trade-off between the empirical error andmodel complexity. To handle the unbalanced class label-ing problem , we may consider a l -dimensional vector C = ( C , C , . . . C l ) in the place of single constant C such that C i = (cid:40) C , y i = +1 ,pC , y i = − , (7)where p is defined as p = number of data points on ‘class +1’number of data points in ‘class -1’ . Thereafter,we have preferred to solve the following optimizationproblem for our k -PL-SVM model. min ( w,b,ξ ) 12 || w || + l (cid:80) i =1 C i ξ i subject to, ξ i ≥ − y i ( w T φ ( x i ) + b ) ,ξ i ≥ − τ (1 − y i ( w T φ ( x i ) + b )) + (cid:15) ,ξ i ≥ − τ (1 − y i ( w T φ ( x i ) + b )) + (cid:15) ,.................................................,.................................................,ξ i ≥ − τ k − (1 − y i ( w T φ ( x i ) + b ) + (cid:15) k − , i = 1 , , ....l. (8)To derive the Wolfe dual, we need to obtain the Lagrangianfunction for the primal problem (8) of our k -PL-SVM model.The Lagrangian function can be obtained as L ( w, b, ξ, α, α (1) , α (2) , ..., α ( k − ) = || w || + C l (cid:80) i =1 ξ i − l (cid:80) i =1 α i ( y i ( w T φ ( x i ) + b ) − ξ i ) − l (cid:80) i =1 α (1) i ( τ (1 − y i ( w T φ ( x i )+ b ))+ ξ i − (cid:15) ) − l (cid:80) i =1 α (2) i ( τ (1 − y i ( w T φ ( x i )+ b ))+ ξ i − (cid:15) ) − ........ − l (cid:80) i =1 α ( k − i ( τ k − (1 − y i ( w T φ ( x i )+ b )) + ξ i − (cid:15) k − ) . Here α, α (1) , α (2) , ..., α ( k − > are l -dimensional vectorsof Lagrangian multipliers. We list the Karush-Kuhn-Tucker(KKT) conditions for the primal problem (8) as follow. w = l (cid:88) i =1 ( α i − τ α (1) i − .... − τ k − α ( k − i ) y i φ ( x i ) , (9) l (cid:88) i =1 ( α i − τ α (1) i − .... − τ k − α ( k − i ) y i = 0 , (10) C i − α i − α (1) i − .... − α ( k − i = 0 , i = 1 , , ..l, (11) α i ( y i ( w T φ ( x i ) + b ) − ξ i ) = 0 , i = 1 , , ..., l, (12) α ( m ) i ( τ m (1 − y i ( w T φ ( x i ) + b )) + ξ i − (cid:15) m ) = 0 ,i = 1 , , ..l, m = 1 , , , ...k − , (13) ξ i ≥ − y i ( w T φ ( x i ) + b ) , i = 1 , , ..l, (14) ξ i ≥ − τ m (1 − y i ( w T φ ( x i ) + b )) + (cid:15) m ,i = 1 , , ..l, m = 1 , , , .., k − , (15) α i ≥ , i = 1 , , .., l,α ( m ) i ≥ , i = 1 , , .., l, m = 1 , , , ...k − . (16) earning a powerful SVM using piece-wise linear loss functions Using the above KKT conditions, the Wolfe dual of theprimal problem (8) of our k -PL-SVM model can be obtainedas min ( α,α (1) ,...,α ( k − ) l (cid:88) j =1 l (cid:88) i =1 ( α i − τ α (1) i − , .. , − τ k − α ( k − i ) y i y j φ ( x i ) T φ ( x j )( α j − τ α (1) j − , .. , − τ k − α ( k − j ) − l (cid:80) i =1 ( α i − τ α (1) i − .. , − τ k − α ( k − i ) − l (cid:80) i =1 ( α (1) i (cid:15) + , ...., + α ( k − i (cid:15) k − ) subject to, l (cid:80) i =1 ( α i − τ α (1) i − , .. , − τ k − α ( k − i ) y i = 0 ,C i − α i − α (1) i − , ..., − α ( k − i = 0 , i = 1 , , ..l,α i ≥ , α ( m ) i ≥ , i = 1 , , .., l, m = 1 , , , ...k − . (17) For a given positive semi-definite kernel k , satisfyingMercer condition (Mercer,(Mercer, 1909)), we can obtain k ( x i , x j ) = φ ( x i ) T φ ( x j ) without explicit knowledge ofmapping φ . It makes the above dual problem to reduce as min ( α,α (1) ,...,α ( k − ) l (cid:88) j =1 l (cid:88) i =1 ( α i − τ α (1) i − , .. , − τ k − α ( k − i ) y i y j k ( x i , x j )( α j − τ α (1) j − , .. , − τ k − α ( k − j ) − l (cid:80) i =1 ( α i − τ α (1) i − .. , − τ k − α ( k − i ) − l (cid:80) i =1 ( α (1) i (cid:15) + , ...., + α ( k − i (cid:15) k − ) subject to, l (cid:80) i =1 ( α i − τ α (1) i − , .. , − τ k − α ( k − i ) y i = 0 ,C i − α i − α (1) i − , ..., − α ( k − i = 0 , i = 1 , , ..l,α i ≥ , α ( m ) i ≥ , i = 1 , , .., l, m = 1 , , , ...k − . (18) After obtaining the solution vectors α, α (1) , ..., α ( k − ofthe dual problem (18), we can classify an unseen data point x ∈ R n using the decision function f ( x ) = sign ( w T φ ( x ) + b )= sign ( l (cid:88) i =1 ( α i − τ α (1) i − , .. , − τ k − α ( k − i ) y i k ( x i , x )+ b ) . Obtaining the value of b:-
For α j > , α ( m ) i > and τ m (cid:54) = − , m = 1 , , .., k − , we can obtain using the KKTconditions (12) and (13) y j ( w T φ ( x j ) + b ) − ξ j = 0 and τ m (1 − y j ( w T φ ( x j ) + b )) + ξ j − (cid:15) m = 0 , which gives b = y j − (cid:18) l (cid:88) i =1 ( α i − τ α (1) i − , .. , − τ k − α ( k − i ) y i k ( x i , x j ) − y j (cid:15) m (1 + τ m ) (cid:19) . Also for α ( m ) j > , α ( m ) j > , m (cid:54) = m and τ m (cid:54) = τ m , m , m = 1 , , .., k − , we can obtain using the KKTconditions (13) τ m (1 − y j ( w T φ ( x j ) + b )) + ξ j − (cid:15) m = 0 , and τ m (1 − y j ( w T φ ( x j ) + b )) + ξ j − (cid:15) m = 0 , which gives b = y j − l (cid:88) i =1 (cid:18) α i − τ α (1) i , .. , − τ k − α ( k − i ) k ( x i , x j ) − y j (cid:15) m − (cid:15) m ( τ m − τ m ) (cid:19) . In practice, we compute all possible values of bias b andtake their average as final value of b .
2. Properties of the k -PL-SVM model In this section, we shall describe some interesting propertiesof the proposed k -PL-SVM models. The proposed k -PL-SVM model is a general SVM model.We shall show that there are several existing popular SVMmodels which can be realized as the particular cases of our k -PL-SVM model.(a) C-SVM:-
The underlying Hinge loss function usedin the C -SVM model is the particular case of our k -piece-wise linear loss function with τ m = 0 and (cid:15) m = 0 , for m = 1 , , .., k − . Also, if we considerthe τ m = 0 and (cid:15) m = 0 , for m = 1 , , .., k , in the op-timization problem (6) of proposed k -PL-SVM model,it becomes equivalent to the optimization problem of C -SVM model.(b) Pin-SVM:-
The pinball loss function is also equivalentto k -piece-wise linear loss function with τ = τ forany − ≤ τ ≤ , τ m = 0 for m = 2 , .., k − and (cid:15) m = 0 , for m = 1 , .., k − . At thesevalue of parameters, the optimization problem (6) ofproposed k -PL-SVM model, becomes equivalent to theoptimization problem of Pin-SVM model. earning a powerful SVM using piece-wise linear loss functions (c) LS-SVM :- As we increase the value of k in k -piece-wise linear loss function, it involves more segmentedline. As k → ∞ , the k -piece-wise linear loss functionbecomes smooth and the Least squares loss functionbecomes its particular case. Therefore, for very largevalue of k , the optimization problem (6) of proposed k -PL-SVM model is equivalent to solving the optimiza-tion problem of LS-SVM model. k -PL-SVM model The k -piece-wise-linear loss functions form a family ofconvex and robust loss functions that exist between theHinge loss function and Least squares loss function. Inthe k -PL-SVM model, the data points have been assignedthe empirical risk according to their location. The k -PL-SVM model partition the feature space in different zonesusing hyperplanes parallel to the separating hyperplanes w T φ ( x ) + b = 0 . It assigns the empirical risk to a datapoint according to the zone it lies in.Let us suppose that for the given training set T , the -PL-SVM model learns a suitable -piece-wise linear loss func-tion with parameters τ , (cid:15) , τ , (cid:15) , τ and (cid:15) , then similar tothe C-SVM model, we have attempted to obtain the geomet-rical interpretation of our -PL-SVM model in Figure 3. Wehave represented the data points from label with blue andlabel − with red color respectively. We have shown the onepossible division of the feature space in four different zonesfor data points with label in Figure 3 by -PL-SVM model.The 3-PL-SVM model assigns the empirical risk to the datapoints lying in these different zones using the different ex-pression which are in the form − τ k (1 − y i ( w T x i + b ) + (cid:15) k ) or (1 − y i ( w T x i + b ) . A similar symmetric interpretationcan be obtained for the data points with the label − . Figure 3.
For τ , ..., τ k − ≥ , we can easily show that the k -PL-SVM model minimizes the scatter of data points along theresulting decision function w T φ ( x ) + b = 0 .It should be noted that it is not necessary that the k - PL- SVM model always learn a k -piece-wise linear loss functionwhich contains k different segment lines. It may learn valuesof its parameters τ , (cid:15) , ...., τ k − and (cid:15) k − from the givendata such that the resulting loss function involves only m segmented line where m ≤ k . k -piece-wise linear lossfunction Now we shall evaluate the underlying k -piece-wise linearloss functions (4) used in the proposed k -PL-SVM modelusing the existing literature of loss functions for classifica-tion problem. According to the study done in (Bartlett et al.,2006) and (Huang et al., 2017), a typical classification lossfunction L should have the following four properties.(a) L ( u ) should be Lipschitz for a given constant.(b) L ( u ) should be convex.(c) ∂L ( u ) ∂u | u =1 > .(d) L ( u ) ≥ for any u ∈ R .A classification loss function which satisfies these four prop-erties enjoys many nice properties like Bayes consistencyand classification calibration. It is not hard to realize thatthe existing Hinge loss function and pinball loss functionwith τ ≥ satisfies these four properties.The proposed k -piece-wise linear loss functions (4) are Lip-schitz and convex functions. For (cid:15) i τ i (cid:54) = 1 , ∀ i = 1 , , ..k − ,we can easily obtain that the k -piece-wise linear loss func-tions L k ( u ) satisfies ∂L k ( u ) ∂u | u =1 > .Our k -piece-wise linear loss functions can also take negativevalues for some values of its parameters. But, we can showthat the k -piece-wise linear loss functions which satisfy (cid:15) i τ j − (cid:15) j τ i τ j − τ i ≥ and (cid:15) j τ j ≥ , ∀ i, j = 1 , , ..k − , canalways take non-negative values. Therefore, we can claimthat similar to the Hinge loss function, the k -piece-wiselinear loss functions with (cid:15) i τ j − (cid:15) j τ i τ j − τ i ≥ , (cid:15) j τ j ≥ and (cid:15) i τ i (cid:54) = 1 ∀ , i, j = 1 , , ..k − enjoys the nice properties likeBayes consistency and classification calibration.Apart from this, similar to the Hinge loss function andthe pinball loss function, the proposed k -piece-wise lin-ear loss functions are robust for finite values of k . Theinfluence function of the proposed family of k -piece-wiselinear loss function can be shown to be bounded in theinterval [ t , t ] , where t = min (1 , − τ , ..., − τ k − ) and t = max (1 , − τ , ..., − τ k − ) . It means that any outlierdata point can effect the resulting decision function up to acertain constant extent. earning a powerful SVM using piece-wise linear loss functions Table 1.
Dataset Description
Dataset No. Dataset Size Training points1 Monk 1 556 × × × ×
22 805 Haberman 306 × ×
14 1507 Ionosphere 351 ×
34 2008 Pima Indian 768 × ×
30 40010 Echocardiogram 131 ×
10 8011 Australian 690 ×
15 40012 Bupa Liver 345 × ×
17 20014 Diabetes 768 × ×
10 5016 Sonar 208 ×
61 10017 Ecoil 327 × ×
13 10019 Spambase 4601 ×
57 1500
3. Experimental Results
In this section, we shall present the numerical results ob-tained by the extensive set of experiments and show theefficacy of the proposed k -PL-SVM model. We have com-pared the performance of the proposed 2-PL-SVM and 3-PLSVM with the C-SVM, LS-SVM and Pin-SVM models onbenchmark datasets and shown that the proposed modelsown better generalization ability than existing SVM models.As we increase the value of k in k -PL-SVM model, it be-comes more adaptive and powerful. But, the optimizationproblem (6) of k -PL-SVM model involves k linear con-straints which increases its solution time as value of k in-creases. We could have solved the optimization problem(5) of the proposed k -PL-SVM model using the stochasticgradient descent methods (Bottou, 2010) to get rid of thisproblem. But, we need the accurate solution of our proposed k -PL-SVM model for studying its characteristics efficiently.Therefore, we have preferred to solve the dual problem (18)for the k -PL-SVM models for k = in.mathworks.com )with the ‘interior point-convex’ algorithm efficiently. Wehave also solved the C-SVM model and Pin-SVM modelwith the quadprog function in the MATLAB with the ‘in-terior point-convex’ algorithm. The LS-SVR model onlyrequires the solution of system of equations.Also, the k -PL-SVM model requires to tune the k − parameters for the linear kernel and k parameter for theRBF kernel. As the value of k increases, we require totune more number of parameters in proposed k -PL-SVMmodel but, this extra endeavor in parameter tuning resultsin significant improvement in the prediction ability.In our numerical experiments, we have used the direct gridsearch method to tune the parameters of 3-PL-SM, 2-PL-SVM, Pin-SVM and LS-SVM models. We could have used Table 2.
Numerical results with linear kernel
Data SVM LS-SVM Pin- SVM 2-PL SVM 3-PL-SVMset Acc. Acc. Acc. Acc. Acc.Time (s) Time (s) Time (s) Time (s) Time (s)No. ( C
0) ( C
0) ( C , τ
1) ( C , τ , (cid:15)
1) ( C , τ , τ , (cid:15) , (cid:15) Table 3.
Numerical results with non-linear kernel
Data SVM LS-SVM Pin- SVM 2-PL SVM 3-PL-SVMset Acc. Acc. Acc. Acc. Acc.Time (s) Time (s) Time (s) Time (s) Time (s)No. ( q, C
0) ( q, C
0) ( q, C , τ
1) ( q, C , τ , (cid:15)
1) ( q, C , τ , τ , (cid:15) , (cid:15) earning a powerful SVM using piece-wise linear loss functions some meta heuristic methods for tunning the parameters ofthese SVM models. But, these algorithms are of the randomnature and may cause the difficulty in unbiased comparisonsof considered SVM models. Therefore, we have preferredto use the direct grid search method to tune the parametersof considered SVM models in same ranges.It should ne noted that our objective of numerical experi-ments is to establish the fact that the proposed k -PL-SVMmodels are improvement over existing SVM models.The C-SVM and LS-SVM models involve two parameters C and RBF parameter q . The Pin-SVM model contains onemore extra parameter τ . The 2-PL SVM model requires totune τ and (cid:15) apart from C and q . The 3-PL-SVM modelrequires the tuning of two more extra parameters τ and (cid:15) than 2-PL-SVM model. For the linear kernel, we do notrequire the use and tuning of kernel parameter q . We haveused the RBF kernel of the form exp ( −|| x − y || q ) .We have tuned the values of parameters C and q from theset { − , − , ......, , }×{ − , − , ......, , } usinggrid search method for the C-SVM model. After obtainingthe suitable values of these parameters, we have used themin Pin-SVM, 2-PL-SVM model and 3-PL-SVM model. Thereason behind considering the same value for the parameters C and q in C-SVM, Pin-SVM , 2-PL SVM and 3-PL-SVMmodel is that we want to exclude the effects of these parame-ter on their performance. It helps us to make the observationthat how tunning of extra parameters in Pin-SVM, 2-PLSVM and 3-PL-SVM model effects the improvement in ac-curacy. The parameter τ in Pin-SVM model is obtainedfrom the set of {− , − . , ..., . , } using grid searchmethod. The parameters τ and (cid:15) of 2-PL-SVM modelhas been obtained from the set {− , − . , ..., . , } ×{− , − . , ..., . , } using grid search method. Theparameters τ , (cid:15) , τ and (cid:15) have been obtained fromthe set {− , − . , ..., . , } × {− , − . , ..., . , } ×{− , − . , ..., . , } × {− , − . , ..., . , } using gridsearch method. For the LS-SVM model, we have ex-plicitly tuned the parameters C and q from the set { − , − , ......, , } × { − , − , ......, , } usinggrid search method. After obtaining the value of C , wehave computed the C i from (7) for all SVM models.We have performed all experiments in MATLAB 2018( in.mathworks.com ) environment on a Dell Xeon pro-cessor with 16 GB of RAM and Windows 10 operatingsystem. We have considered the 19 benchmark datasetsfor our numerical experiments. We have listed and num-bered these datasets with their dimensions in Table 1. Thesedatasets have been downloaded from UCI repository (Dua& Graff, 2017). For some datasets like Monk 1, Monk 2,Monk 3 and Spect , the training sets and testing sets wereseparately provided. For other datasets, we have arbitrarilyseparated the training and testing set in Table 1. Further, we have normalized the training and testing set in [ − , forall datasets.We have listed and compared the performance of the C-SVM, LS-SVM, Pin-SVM, 2-PL-SVM and 3-PL-SVM mod-els in the Table 2 for the linear kernel. We have also listedtheir tuned parameters and execution time. We can easilymake the following observations form the numerical resultslisted in the Table 2.• The accuracy obtained by SVM models in Table 2always follows the rule: C-SVM ≤ Pin-SVM ≤ ≤ k -PL-SVM models areuseful.• We can observe that the 3-PL-SVM model is power-ful as it can obtain a direct improvement in accuracyover existing C-SVM, LS-SVM and Pin-SVM modelon 12 datasets. We have plotted this improvement ob-tained by 3-PL-SVM model in accuracy over C-SVM,LS-SVM and Pin-SVM model in the Figure 4. Theuse of 3-piece-wise linear loss function in 3-PL-SVMmodel makes it more efficient and adaptive. It enablesthe 3-PL-SVM model to learn the suitable values ofparameters τ , τ , (cid:15) and (cid:15) . We have plotted the lossfunction learned by 3-PL-SVM in the Figure 5. Fordifferent nature of datasets, the 3-PL-SVM has thecapability to learn the suitable piece-wise linear lossfunction which is missing in the traditional SVM likeC-SVM and LS-SVM model.• The 2-PL-SVM model also improves significantly theC-SVM model on several datasets. The 3-PL-SVMmodel improves the 2-PL-SVM model on 11 datasets.As we increase the value of k in k - PL-SVM model, itsadaptive ability to learn the loss function according tothe nature of dataset increases. It further results in theimprovement of prediction ability.• As we move from the C-SVM model to 3-PL-SVMmodel, the total execution time increases. It is becauseof the fact that we need to solve the QPP with morelinear constraints.We have also compared the performance of the C-SVM,LS-SVM, Pin-SVM, 2-PL-SVM and 3-PL-SVM models forthe RBF kernel in the Table 3. We can draw observationssimilar to the linear kernel case. earning a powerful SVM using piece-wise linear loss functions Figure 4.
Improvement in accuracy obtained by the proposed 3-PL-SVM model over existing C -SVM, LS-SVM and Pin-SVM models(a) Monk 1 (b) Monk 2 (c) Monk 3 (d) Heart(e) Haberman (f) Echo (g) Bupa Liver (h) Diabetes Figure 5.
Loss functions learned by 3-PL-SVM model for different datasets.
4. Conclusions
In this paper, we have developed a general and adaptiveSVM model. For this, we have introduced a family of k -piece-wise linear loss functions in this paper. These lossfunctions are general, robust and convex. The resulting k -PL-SVM model can learn a suitable piece-wise linear lossfunction from the given data. The k -PL-SVM model is ageneral SVM model and the popular SVM models, like C-SVM, LS-SVM and Pin-SVM, are its particular cases.The k -PL-SVM model divides the feature space in different tubeand assign the empirical risk to the data points according toits location. We have also shown that the k -PL-SVM modelbecomes Bayes consistent classification model if we imposecertain constraint on the values of its parameter. We havecarried an extensive set of experiments with the k -PL-SVMfor k = 2 and and shown that the proposed SVM modelis the improvement over existing SVM models.We would like to check the performance of the k -PL-SVMmodel for k ≥ in future. We shall develop a suitablestochastic gradient descent method to efficiently approxi-mate the solution of the optimization problem of the k -PL- SVM model. It will enable us to test the proposed k -PL-SVM model for large value of k . We have also planned todevelop an appropriate meta heuristic algorithm for tuningthe k parameters of the k -PL-SVM model in future. Itwill help us to reduce its total model selection time and alsoincrease its adaptive capability. Acknowledgment
I shall be very grateful to Prof. Suresh Chandra for hisvaluable suggestions. His suggestions have significantlyimprove the quality of this paper.
References
Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity,classification, and risk bounds.
Journal of the AmericanStatistical Association , 101(473):138–156, 2006.Bottou, L. Large-scale machine learning with stochasticgradient descent. In
Proceedings of COMPSTAT’2010 ,pp. 177–186. Springer, 2010. earning a powerful SVM using piece-wise linear loss functions
Cortes, C. and Vapnik, V. Support-vector networks.
Ma-chine learning , 20(3):273–297, 1995.Dua, D. and Graff, C. UCI machine learning repository,2017. URL http://archive.ics.uci.edu/ml .Gunn, S. Support vector machines for classification andregression.
ISIS technical report , 1998.Huang, X., Shi, L., and Suykens, J. A. Support vectormachine classifier with pinball loss.
IEEE transactionson pattern analysis and machine intelligence , 36(5):984–997, 2014.Huang, X., Shi, L., and Suykens, J. A. Solution path for pin-svm classifiers with positive and negative τ values. IEEEtransactions on neural networks and learning systems , 28(7):1584–1593, 2017.Koenker, R. and Bassett Jr, G. Regression quantiles.
Econo-metrica: journal of the Econometric Society , pp. 33–50,1978.Legendre, A. M.
Nouvelles m´ethodes pour la d´eterminationdes orbites des com`etes . F. Didot, 1805.Mercer, J. Xvi. functions of positive and negative type, andtheir connection the theory of integral equations.
Philo-sophical transactions of the royal society of London. Se-ries A, containing papers of a mathematical or physicalcharacter , 209(441-458):415–446, 1909.Suykens, J. A. and Vandewalle, J. Least squares supportvector machine classifiers.
Neural processing letters , 9(3):293–300, 1999.Vapnik, V.