Double Ramp Loss Based Reject Option Classifier
Naresh Manwani, Kalpit Desai, Sanand Sasidharan, Ramasubramanian Sundararajan
aa r X i v : . [ c s . L G ] D ec Double Ramp Loss Based Reject OptionClassifier
Naresh Manwani , Kalpit Desai , Sanand Sasidharan , and RamasubramanianSundararajan Data Mining Lab, GE Global Research, JFWTC, Whitefield, Bangalore-560066,(
[email protected], [email protected] ), Bidgely, Bangalore ( [email protected] ), Sabre Airline Solutions, Bangalore ( [email protected] ) Abstract.
We consider the problem of learning reject option classifiers.The goodness of a reject option classifier is quantified using 0 − d − d ∈ (0 , .
5) is assigned for rejection. In this paper,we propose double ramp loss function which gives a continuous upperbound for (0 − d −
1) loss. Our approach is based on minimizing regu-larized risk under the double ramp loss using difference of convex (DC)programming . We show the effectiveness of our approach through exper-iments on synthetic and benchmark datasets. Our approach performsbetter than the state of the art reject option classification approaches.
The primary focus of classification problems has been on algorithms that returna prediction on every example. However, in many real life situations, it may beprudent to reject an example rather than run the risk of a costly potential mis-classification. Consider, for instance, a physician who has to return a diagnosisfor a patient based on the observed symptoms and a preliminary examination. Ifthe symptoms are either ambiguous, or rare enough to be unexplainable withoutfurther investigation, then the physician might choose not to risk misdiagnosingthe patient (which might lead to further complications). He might instead askfor further medical tests to be performed, or refer the case to an appropriatespecialist. Similarly, a banker, when faced with a loan application from a cus-tomer, may choose not to decide on the basis of the available information, andask for a credit bureau score. While the follow-up actions might vary (askingfor more features to describe the example, or using a different classifier), theprincipal response in these cases is to “reject” the example. This paper focuseson the manner in which this principal response is decided, i.e., which examplesshould a classifier reject, and why? From a geometric standpoint, we can viewthe classifier as being possessed of a decision surface (which separates points ofdifferent classes) as well as a rejection surface. The size of the rejection regionimpacts the proportion of cases that are likely to be rejected by the classifier, aswell as the proportion of predicted cases that are likely to be correctly classified.
Authors Suppressed Due to Excessive Length
A well-optimized classifier with a reject option is the one which minimizes therejection rate as well as the mis-classification rate on the predicted examples.Let x ∈ R p is the feature vector and y ∈ {− , +1 } is the class label. Let D ( x , y ) be the joint distribution of x and y . A typical reject option classifier isdefined using a bandwidth parameter ( ρ ) and a separating surface ( f ( x ) = 0). ρ is the parameter which determines the rejection region. Then a reject optionclassifier h ( f ( x ) , ρ ) is formed as: h ( f ( x ) , ρ ) = f ( x ) > ρ | f ( x ) | ≤ ρ − f ( x ) < − ρ (1)The reject option classifier can be viewed as two parallel surfaces with the rejec-tion area in between. The goal is to determine f ( x ) as well as ρ simultaneously.The performance of this classifier is evaluated using L − d − [13,9] which is L − d − ( f ( x ) , y, ρ ) = , if yf ( x ) < − ρd, if | f ( x ) | ≤ ρ , otherwise (2)In the above loss, d is the cost of rejection. If d = 0, then we will always reject.When d > .
5, then we will never reject (because expected loss of random labelingis 0.5). Thus, we always take d ∈ (0 , . L − d − ( ., ., . ) with re-spect to D ( x , y ) ( risk ) is minimized. Since D ( x , y ) is fixed but unknown, theempirical risk minimization principle is used. The risk under L − d − is mini-mized by generalized Bayes discriminant [9,4], which is as below: f ∗ d ( x ) = − , if P ( y = 1 | x ) < d , if d ≤ P ( y = 1 | x ) ≤ − d , if P ( y = 1 | x ) > − d (3) h ( f ( x ) , ρ ) (equation (1)) is shown to be infinite sample consistent with respectto the generalized Bayes classifier f ∗ d ( x ) described in equation (3) [15]. Loss Function Definition
Generalized Hinge L GH ( f ( x ) , y ) = − − dd yf ( x ) , if yf ( x ) < − yf ( x ) , if 0 ≤ yf ( x ) < , otherwiseDouble Hinge L DH ( f ( x ) , y ) = max[ − y (1 − d ) f ( x ) + H ( d ) , − ydf ( x ) + H ( d ) , H ( d ) = − d log( d ) − (1 − d ) log(1 − d ) Table 1.
Convex surrogates for L − d − . Since minimizing the risk under L − d − is computationally cumbersome,convex surrogates for L − d − have been proposed. Generalized hinge loss L GH ouble Ramp Loss Based Reject Option Classifier 3 -3 -2 -1 0 1 2 3 . . . . . . d=0.2 yf ( x ) Lo ss GHDH0-d-1 ( r =0.7) -3 -2 -1 0 1 2 3 . . . . . . d=0.2 yf ( x ) Lo ss GHDH0-d-1 ( r =2 ) (a) (b) Fig. 1. L GH and L DH for d = 0 .
2. (a) For ρ = 0 .
7, both the losses upper bound the L − d − . For ρ = 2, both the losses fail to upper bound L − d − . L GH and L DH bothincrease linearly even in the rejection region than being flat. (see Table 1) is a convex surrogate for L − d − [13,14,3]. It is shown that aminimizer of risk under L GH is consistent to the generalized Bayes classifier [3]. Double hinge loss L DH (see Table 1) is another convex surrogate for L − d − [7].Minimizer of the risk under L DH is shown to be strongly universally consistent to the generalized Bayes classifier [7].We observe that these convex loss functions have some limitations. For ex-ample, L GH is a convex upper bound to L − d − provided ρ < − d and L DH forms an upper bound to L − d − provided ρ ∈ ( − H ( d )1 − d , H ( d ) − dd ) (see Fig. 1).Also, both L GH and L DH increase linearly in the rejection region instead of re-maining constant. These convex losses can become unbounded for misclassifiedexamples with the scaling of parameters of f . Moreover, limited experimentalresults are shown to validate the practical significance of these losses [13,14,3,7].A non-convex formulation for learning reject option classifier is proposed in [5].However, theoretical guarantees for the approach proposed in [5] are not known.While learning a reject option classifier, one has to deal with the overlappingclass regions as well as the presence of outliers. SVM and other convex loss basedapproaches are less robust to label noise and outliers in the data [11]. It is shownthat ramp loss based risk minimization is more robust to noise [6].Motivated from this, we propose double ramp loss ( L DR ) which incorporates adifferent loss value for rejection. L DR forms a continuous nonconvex upper boundfor L − d − and overcomes many of the issues of convex surrogates of L − d − .To learn a reject option classifier, we minimize the regularized risk under L DR which becomes an instance of difference of convex (DC) functions. To minimizesuch a DC function, we use difference of convex programming approach [1],which essentially solves a sequence of convex programs. The proposed methodhas following advantages over the existing approaches: (1) the proposed lossfunction L DR gives a tighter upper bound to the L − d − , (2) L DR requires no Authors Suppressed Due to Excessive Length constraint on ρ unlike L GH and L DH , (3) our approach can be easily kernelizedfor dealing with nonlinear problems.The rest of the paper is organized as follows. In Section 2 we define the double ramp loss function ( L DR ) and discuss its properties. Then we discussedthe proposed formulation based on risk minimization under L DR . In Section 3we derive the algorithm for learning reject option classifier based on regularizedrisk minimization under ( L DR ) using DC programming. We present experimentalresults in Section 4. We conclude the paper with the discussion in Section 5. Our approach for learning classifier with reject option is based on minimizingregularized risk under L DR (double ramp loss). We define double ramp loss function as a continuous upper bound for L − d − .This loss function is defined as a sum of two ramp loss functions as follows: L DR ( f ( x ) , y, ρ ) = dµ h(cid:2) µ − yf ( x ) + ρ (cid:3) + − (cid:2) − µ − yf ( x ) + ρ (cid:3) + i + (1 − d ) µ h(cid:2) µ − yf ( x ) − ρ (cid:3) + − (cid:2) − µ − yf ( x ) − ρ (cid:3) + i (4) -4 -2 0 2 4 . . . . . . d=0.2, r =2yf ( x ) Lo ss DR ( m =1)DR ( m =0.5)DR ( m =0.1)0-d-1 Fig. 2. L DR and L − d − : ∀ µ ≥ , ρ ≥ L DR is an upper bound for L − d − . where [ a ] + = max(0 , a ). µ ∈ (0 ,
1] defines the slope of ramps in the lossfunction. d ∈ (0 , .
5) is the cost of rejection and ρ ≥ f ( x ) = ouble Ramp Loss Based Reject Option Classifier 5 As in L − d − , L DR also considers the region [ − ρ, ρ ] as rejection region. Fig. 2shows L DR for d = 0 . , ρ = 2 with different values of µ . Theorem 1. (1) L DR ≥ L − d − , ∀ µ > , ρ ≥ . (2) lim µ → L DR ( f ( x ) , ρ, y ) = L − d − ( f ( x ) , ρ, y ) . (3) In the rejection region yf ( x ) ∈ ( ρ − µ , − ρ + µ ) , theloss remains constant, that is L DR ( f ( x ) , y, ρ ) = d (1 + µ ) . (4) For µ > , L DR ≤ (1+ µ ) , ∀ ρ ≥ , ∀ d ≥ . (5) When ρ = 0 , L DR is same as µ -ramp loss ([12])usedfor classification problems without rejection option. (6) L DR is a non-convexfunction of ( yf ( x ) , ρ ) . The proof of Theorem 1 is provided in Appendix A. We see that L DR does notput any restriction on ρ for it to be an upper bound of L − d − . Thus, L DR is ageneral ramp loss function which also allows rejection option. L DR Let S = { ( x n , y n ) , n = 1 . . . N } be the training dataset, where x n ∈ R p , y n ∈{− , +1 } , ∀ n . As discussed, we minimize regularized risk under L DR to find a re-ject option classifier. In this paper, we use l regularization. Let Θ = [ w T b ρ ] T .Thus, for f ( x ) = ( w T φ ( x ) + b ), regularized risk under double ramp loss is R ( Θ ) = 12 || w || + C N X n =1 L DR ( y n , w T φ ( x n ) + b )= 12 || w || + Cµ N X n =1 n d (cid:2) µ − y n f ( x n ) + ρ (cid:3) + − d (cid:2) − µ − y n f ( x n ) + ρ (cid:3) + +(1 − d ) (cid:2) µ − y n f ( x n ) − ρ (cid:3) + − (1 − d ) (cid:2) − µ − y n f ( x n ) − ρ (cid:3) + o = 12 || w || + Cµ N X n =1 n d (cid:2) µ − y n f ( x n ) + ρ (cid:3) + + (1 − d ) (cid:2) µ − y n f ( x n ) − ρ (cid:3) + − d (cid:2) − µ − y n f ( x n ) + ρ (cid:3) + − (1 − d ) (cid:2) − µ − y n f ( x n ) − ρ (cid:3) + o where C is regularization parameter. While minimizing R ( Θ ), no non-negativitycondition on ρ is required due to the following lemma. Lemma 1.
At the minimum of R ( Θ ) , ρ must be non-negative. Prood of the above lemma is provided in Appendix B. While L DR is parametrized by µ and d as well, we omit them for the sake of notationalconsistency. Authors Suppressed Due to Excessive Length R ( Θ ) (equation (5)) is a nonconvex function of Θ . However, R ( Θ ) can be writtenas R ( Θ ) = R ( Θ ) − R ( Θ ), where R ( Θ ) and R ( Θ ) are convex functions of Θ . R ( Θ ) = 12 || w || + Cµ N X n =1 h d (cid:2) µ − y n f ( x n ) + ρ (cid:3) + + (1 − d ) (cid:2) µ − y n f ( x n ) − ρ (cid:3) + i R ( Θ ) = Cµ N X n =1 h d (cid:2) − µ − y n f ( x n ) + ρ (cid:3) + + (1 − d ) (cid:2) − µ − y n f ( x n ) − ρ (cid:3) + i In this case, DC programming guarantees to find a local optima of R ( Θ ) [1].In the simplified DC algorithm [1], an upper bound of R ( Θ ) is found using theconvexity property of R ( Θ ) as follows. R ( Θ ) ≤ R ( Θ ) − R ( Θ ( l ) ) − ( Θ − Θ ( l ) ) T ∇ R ( Θ ( l ) ) =: ub ( Θ, Θ ( l ) ) (5)where Θ ( l ) is the parameter vector after ( l ) th iteration, ∇ R ( Θ ( l ) ) is a sub-gradient of R at Θ ( l ) . Θ ( l +1) is found by minimizing ub ( Θ, Θ ( l ) ). Thus, R ( Θ ( l +1) ) ≤ ub ( Θ ( l +1) , Θ ( l ) ) ≤ ub ( Θ ( l ) , Θ ( l ) ) = R ( Θ ( l ) ). Which means, in every iteration, theDC program reduces the value of R ( Θ ). In this section, we will derive a DC algorithm for minimizing R ( Θ ). We initializewith Θ = Θ (0) . For any l ≥
0, we find ub ( Θ, Θ ( l ) ) as an upper bound for R ( Θ )(see equation (5)) as follows: ub ( Θ, Θ ( l ) ) = R ( Θ ) − R ( Θ ( l ) ) − ( Θ − Θ ( l ) ) T ∇ R ( Θ ( l ) )Given Θ ( l ) , we find Θ ( l +1) by minimizing the upper bound ub ( Θ, Θ ( l ) ). Thus, Θ ( l +1) ∈ arg min Θ ub ( Θ, Θ ( l ) ) = arg min Θ R ( Θ ) − Θ T ∇ R ( Θ ( l ) ) (6)where ∇ R ( Θ ( l ) ) is the subgradient of R ( Θ ) at Θ ( l ) . We choose ∇ R ( Θ ( l ) ) as: ∇ R ( Θ ( l ) ) = N X n =1 β ′ ( l ) n [ − y n φ ( x n ) T − y n T + N X n =1 β ′′ ( l ) n [ − y n φ ( x n ) T − y n − T where ( β ′ ( l ) n = Cdµ I { y n ( φ ( x n ) T w ( l ) + b ( l ) ) − ρ ( l ) < − µ } β ′′ ( l ) n = C (1 − d ) µ I { y n ( φ ( x n ) T w ( l ) + b ( l ) )+ ρ ( l ) < − µ } (7) ouble Ramp Loss Based Reject Option Classifier 7 For f ( x ) = ( w T φ ( x ) + b , we rewrite the upper bound minimization problemdescribed in equation (6) as follows, P ( l +1) = min Θ R ( Θ ) − Θ T ∇ R ( Θ ( l ) )= min w ,b,ρ || w || + Cµ N X n =1 h d (cid:2) µ − y n f ( x n ) + ρ (cid:3) + + (1 − d ) (cid:2) µ − y n f ( x n ) − ρ (cid:3) + i + N X n =1 β ′ ( l ) n [ y n f ( x n ) − ρ ] + N X n =1 β ′′ ( l ) n [ y n f ( x n ) + ρ ]Note that P ( l +1) is a convex optimization problem where the optimization vari-ables are ( w , b, ρ ). We rewrite P ( l +1) as P ( l +1) = min w ,b, ξ ′ , ξ ′′ ,ρ || w || + Cµ N X n =1 (cid:2) dξ ′ n + (1 − d ) ξ ′′ n (cid:3) + N X n =1 β ′ ( l ) n [ y n ( w T φ ( x n ) + b ) − ρ ]+ N X n =1 β ′′ ( l ) n [ y n ( w T φ ( x n ) + b ) + ρ ] s.t. y n ( w T φ ( x n ) + b ) ≥ ρ + µ − ξ ′ n , ξ ′ n ≥ , n = 1 . . . Ny n ( w T φ ( x n ) + b ) ≥ − ρ + µ − ξ ′′ n , ξ ′′ n ≥ n = 1 . . . N where ξ ′ = [ ξ ′ ξ ′ . . . ξ ′ N ] T and ξ ′′ = [ ξ ′′ ξ ′′ . . . ξ ′′ N ] T . The dual optimizationproblem D ( l +1) of P ( l +1) is as follows. D ( l +1) = min γ ′ , γ ′′ N X n =1 N X m =1 y n y m ( γ ′ n + γ ′′ n )( γ ′ m + γ ′′ m ) k ( x n , x m ) − µ N X n =1 ( γ ′ n + γ ′′ n ) s.t. − β ′ ( l ) n ≤ γ ′ n ≤ Cdµ − β ′ ( l ) n n = 1 . . . N − β ′′ ( l ) n ≤ γ ′′ n ≤ C (1 − d ) µ − β ′′ ( l ) n n = 1 . . . N P Nn =1 y n ( γ ′ n + γ ′′ n ) = 0 P Nn =1 ( γ ′ n − γ ′′ n ) = 0where γ ′ = [ γ ′ γ ′ . . . . . . γ ′ n ] T and γ ′′ = [ γ ′′ γ ′′ . . . . . . γ ′′ n ] T are dual variables.The derivation of dual D ( l +1) can be seen in Appendix C. At the optimality of P ( l +1) , w can be found as w = P Nn =1 y n ( γ ′ n + γ ′′ n ) φ ( x n ).Since P ( l +1) has quadratic objective and linear constraints, it holds strongduality with D ( l +1) . Solving D ( l +1) is more useful as it can be easily kernelizedfor non-linear problems. Behavior of γ ′ n and γ ′′ n under different cases is as follows. y n ( w T φ ( x n ) + b ) − µ > ρ ⇒ γ ′ n = − β ′ ( l ) n ; γ ′′ n = − β ′′ ( l ) n y n ( w T φ ( x n ) + b ) − µ = ρ ⇒ γ ′ n ∈ (cid:0) − β ′ ( l ) n , Cdµ − β ′ ( l ) n (cid:1) ; γ ′′ n = − β ′′ ( l ) n y n ( w T φ ( x n ) + b ) − µ ∈ ( − ρ, ρ ) ⇒ γ ′ n = Cdµ − β ′ ( l ) n ; γ ′′ n = − β ′′ ( l ) n y n ( w T φ ( x n ) + b ) − µ = − ρ ⇒ γ ′ n = Cdµ − β ′ ( l ) n ; γ ′′ n ∈ (cid:0) − β ′′ ( l ) n , C (1 − d ) µ − β ′′ ( l ) n (cid:1) y n ( w T φ ( x n ) + b ) − µ < − ρ ⇒ γ ′ n = Cdµ − β ′ ( l ) n ; γ ′′ n = C (1 − d ) µ − β ′′ ( l ) n Authors Suppressed Due to Excessive Length b ( l +1) and ρ ( l +1) The dual optimization problem above gives dual variables γ ′ ( l +1) and γ ′′ ( l +1) us-ing which the normal vector is found as w ( l +1) = P Nn =1 ( γ ′ ( l +1) n + γ ′′ ( l +1) n ) y n φ ( x n ).To find b ( l +1) and ρ ( l +1) , we consider x n ∈ SV ′ ( l +1) ∪ SV ′′ ( l +1) , whereSV ′ ( l +1) = { x n | y n ( φ ( x n ) T w ( l +1) + b ( l +1) ) = ρ ( l +1) + µ } SV ′′ ( l +1) = { x n | y n ( φ ( x n ) T w ( l +1) + b ( l +1) ) = − ρ ( l +1) + µ } We already saw that1. If x n ∈ SV ′ ( l +1) , then γ ′ ( l +1) n ∈ (cid:0) − β ′ ( l ) n , Cdµ − β ′ n ( l ) (cid:1) and γ ′′ ( l +1) n = − β ′′ ( l ) n
2. If x n ∈ SV ′′ ( l +1) , then γ ′ ( l +1) n = Cdµ − β ′ ( l ) n and γ ′′ ( l +1) n ∈ (cid:0) − β ′′ ( l ) n , C (1 − d ) µ − β ′′ ( l ) n (cid:1) We solve the system of linear equations corresponding to sets SV ′ ( l +1) andSV ′′ ( l +1) for identifying b ( l +1) and ρ ( l +1) . We fix d ∈ [0 , . µ ∈ (0 ,
1] and C and initialize the parameter vector Θ as Θ (0) . In any iteration ( l ), we find β ′ ( l ) n , β ′′ ( l ) n , n = 1 . . . N (see equation (7))using Θ ( l ) . We use β ′ ( l ) n , β ′′ ( l ) n , n = 1 . . . N and solve D ( l +1) to find γ ′ ( l +1) , γ ′′ ( l +1) . w ( l +1) is found as w ( l +1) = P Nn =1 y n ( γ ′ ( l +1) n + γ ′′ ( l +1) n ) φ ( x n ). We find b ( l +1) and ρ ( l +1) as described in Section 3.2. Thus, we have found Θ ( l +1) . Using Θ ( l +1) ,we now find β ′ ( l +1) n , β ′′ ( l +1) n , n = 1 . . . N . We repeat the above two steps untilthe parameter vector Θ changes significantly. More formal description of ouralgorithm is provided in Algorithm 1. γ ′ and γ ′′ at the Convergence of Algorithm 1 At the convergence of Algorithm 1, let γ ′∗ n , γ ′′∗ n , n = 1 . . . N become the valuesof the dual variables. The behavior of γ ′∗ n and γ ′′∗ n is described in Table 2. Forany x n , only one of γ ′∗ n and γ ′′∗ n can be nonzero. We observe that parameters w , b and ρ are determined by the points whose margin ( yf ( x )) is in the range[ ρ − µ , ρ + µ ] ∪ [ − ρ − µ , − ρ + µ ]. We call these points as support vectors . We alsosee that x n for which y n f ( x n ) ∈ ( ρ + µ, ∞ ) ∪ ( − ρ + µ, ρ − µ ) ∪ ( −∞ , − ρ − µ ),both γ ′∗ n , γ ′′∗ n = 0. Thus, points which are correctly classified with margin at least( ρ + µ ), points falling close to the decision boundary with margin in the interval( − ρ + µ, ρ − µ ) and points misclassified with a high negative margin (less than − ρ − µ ), are ignored in the final classifier. Thus, our approach not only rejectspoints falling in the overlapping region of classes, it also ignores potential outliers.We illustrate these insights through experiments on a synthetic dataset as shownin Fig. 3. 400 points are uniformly sampled from the square region [0 1] × [0 1].We consider the diagonal passing through the origin as the separating surface ouble Ramp Loss Based Reject Option Classifier 9 Algorithm 1
Learning Reject Option Classifier by Minimizing R ( Θ ) Input : d ∈ [0 , . , µ ∈ (0 , , C > S Output : w ∗ , b ∗ , ρ ∗ Initialize w (0) , b (0) , ρ (0) , l = 0 repeatCompute β ′ ( l ) n = Cdµ I { y n ( φ ( x n ) T w ( l ) + b ( l ) ) − ρ ( l ) < − µ } β ′′ ( l ) n = C (1 − d ) µ I { y n ( φ ( x n ) T w ( l ) + b ( l ) )+ ρ ( l ) < − µ } Find γ ′ ( l +1) , γ ′′ ( l +1) by solving D ( l +1) described in equation (8) Find w ( l +1) = P Nn =1 y n ( γ ′ ( l +1) n + γ ′′ ( l +1) n ) φ ( x n ) Find b ( l +1) and ρ ( l +1) by solving the system of linear equations corresponding tosets SV ( l +1)1 and SV ( l +1)2 , whereSV ′ ( l +1) = { x n | y n ( φ ( x n ) T w ( l +1) + b ( l +1) ) = ρ ( l +1) + µ } SV ′′ ( l +1) = { x n | y n ( φ ( x n ) T w ( l +1) + b ( l +1) ) = − ρ ( l +1) + µ } until convergence of Θ ( l ) Data with Label Noise X1 X . . . . . . Double Ramp Loss Based Reject Option Classifier X1 X . . . . . . Fig. 3.
Figure on left shows that label noise affects points near the true classificationboundary. Classes are represented using empty circles and triangles . Figure on rightshows reject option classifier learnt using the proposed L DR based approach ( C = 100, µ = 1, d = . circles and triangles represent the support vectors.0 Authors Suppressed Due to Excessive LengthCondition γ ′∗ n ∈ γ ′′∗ n ∈ y n ( w T φ ( x n ) + b ) ∈ ( ρ + µ, ∞ ) 0 0 y n ( w T φ ( x n ) + b ) = ρ + µ (0 , Cdµ ) 0 y n ( w T φ ( x n ) + b ) ∈ [ ρ − µ , ρ + µ ) Cdµ y n ( w T φ ( x n ) + b ) ∈ ( − ρ + µ, ρ − µ ) 0 0 y n ( w T φ ( x n ) + b ) = − ρ + µ , C (1 − d ) µ ) y n ( w T φ ( x n ) + b ) ∈ [ − ρ − µ , − ρ + µ ) 0 C (1 − d ) µ y n ( w T φ ( x n ) + b ) ∈ ( −∞ , − ρ − µ ) 0 0 Table 2.
Behavior of γ ′∗ and γ ′′∗ and assign labels {− , +1 } to all the points using it. We changed the labels of80 points inside the band (width=0.225) around the separating surface. Fig. 3shows the reject option classifier learnt using the proposed method. We see thatthe proposed approach learns the rejection region accurately. We also observethat all of the support vectors are near the two parallel hyperplanes. We show the effectiveness of our approach by showing its performance on severaldatasets. We also compare our approach with the approach proposed in [7].
We report experimental results on 1 synthetic datasets and 2 datasets takenfrom UCI ML repository [2].1.
Synthetic Dataset 1 :
Let f and f be two mixture density functions in R defined as follows: f ( x ) = 0 . U ([1 , × [1 , . U ([4 , × [0 , . U ([10 , × [5 , f ( x ) = 0 . U ([0 , × [1 , . U ([9 , × [1 , . U ([0 , × [5 , U ( A ) denotes the uniform density function with support set A . Wesample 150 points independently each from f and f . We label these pointsusing the hyperplane with w = [1 0] T and b = 0. We choose 10% of thesepoints uniformly at random and flip their labels.2. Synthetic Dataset 2 [8] : m k , k = 1 , . . . ,
10 were drawn from N ((1 , T , I )and labeled as class C . Similarly, m k , k = 1 , . . . ,
10 were drawn from N ((0 , T , I ) and labeled as class C . For each class, 100 observations weredrawn from the following mixture distributions: f ( x | C i ) = X k =1 N ( m ki , I/ , i = 1 , Ionosphere Dataset [2] :
This dataset describes the problem of discrimi-nating good versus bad radars based on whether they send some useful infor-mation about the Ionosphere. There are 34 variables and 351 observations. ouble Ramp Loss Based Reject Option Classifier 11 Parkinsons Disease Dataset [2] :
This dataset is used to discriminatepeople with Parkinsons disease from the healthy people. There are 22 fea-tures which are comprised of a range of biomedical voice measurements fromindividuals. There are 195 such feature vectors.
In the proposed L DR based approach, for solving the dual D ( l ) at every iteration,we have used the kernlab package [10] in R . We thank the authors of L DH basedmethod [7] for providing the codes for their approach. For nonlinear problems,we use RBF kernel. In our approach, we set µ = 1. C and σ (width parameterfor RBF kernel) are chosen using 10-fold cross validation. For every dataset, we report results for values of d in the interval [0 . .
5] withthe step size of 0.05. For every value of d , we find the cross validation risk (under L − d − ), % accuracy on the non-rejected examples (Acc) and % rejection rate(RR). The results provided are based on 10 repetitions of 10-fold cross validation(CV). We show the average values and standard deviation (computed over the10 repetitions).We now discuss the experimental results. Fig. 4(a) shows the Syntheticdataset and the true classification boundary. This dataset has some mislabeledpoints creating noise around the classification surface. Fig. 4(b) and (c) show theclassifiers learnt using L DR and L DH based approaches respectively for d = 0 . L DR based approach accurately finds the true classification bound-ary as oppose to L DH based approach. Also, the reject region found by L DR based approach is covers the most ambiguous region unlike L DH based approachwhich rejects almost all the points. X1 X -10 -5 0 5 10 - - OOOOOO OO OO OO OO OOO OOOO OO OO OOOO OOO OOOOO OOO OOOOOO OOOO OOOOOOO OOO OOOOO OOOO OOO O OOOO O OO OO OO OOOO OOO OOO OOOOOOO OOO OO OOOO OOOOO O OOO OOO O OOOO OO OO O OOOOO OOO OOOOOO OO ++ ++ +++ ++ +++ ++++ ++ + +++ +++++ ++ ++++ +++ +++ ++ + +++ ++ ++ +++++ ++++++ ++++ + +++ +++++ + +++++ ++++ +++ ++++ ++ +++ ++ ++++ ++ +++ ++ ++++ ++ ++++++ ++++ +++ ++++++ ++ ++++++ +++ +++++ +++ X1 X -10 -5 0 5 10 - - OOOOOO OO OO OO OO OOO OOOO OO OO OOOO OOO OOOOO OOO OOOOOO OOOO OOOOOOO OOO OOOOO OOOO OOO O OOOO O OO OO OO OOOO OOO OOO OOOOOOO OOO OO OOOO OOOOO O OOO OOO O OOOO OO OO O OOOOO OOO OOOOOO OO ++ ++ +++ ++ +++ ++++ ++ + +++ +++++ ++ ++++ +++ +++ ++ + +++ ++ ++ +++++ ++++++ ++++ + +++ +++++ + +++++ ++++ +++ ++++ ++ +++ ++ ++++ ++ +++ ++ ++++ ++ ++++++ ++++ +++ ++++++ ++ ++++++ +++ +++++ +++ X1 X -10 -5 0 5 10 - - OOOOOO OO OO OO OO OOO OOOO OO OO OOOO OOO OOOOO OOO OOOOOO OOOO OOOOOOO OOO OOOOO OOOO OOO O OOOO O OO OO OO OOOO OOO OOO OOOOOOO OOO OO OOOO OOOOO O OOO OOO O OOOO OO OO O OOOOO OOO OOOOOO OO ++ ++ +++ ++ +++ ++++ ++ + +++ +++++ ++ ++++ +++ +++ ++ + +++ ++ ++ +++++ ++++++ ++++ + +++ +++++ + +++++ ++++ +++ ++++ ++ +++ ++ ++++ ++ +++ ++ ++++ ++ ++++++ ++++ +++ ++++++ ++ ++++++ +++ +++++ +++ (a) (b) (c)
Fig. 4. (a) Synthetic Dataset and the true classification boundary. Reject option clas-sifiers learnt using (b) proposed L DR based approach for d = 0 .
2, (c) L DH basedapproach for d = 0 . d L DR ( C = 2 ) L DH ( C = 32 )Risk RR Acc on un-rejected Risk RR Acc on un-rejected ± ± ±
100 NA0.1 0.138 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3.
Comparison results on Synthetic Dataset 1 (linear classifiers for both theapproaches).
Table 3-6 show the experimental results on all the datasets. We observe thefollowing:1. We see that the proposed L DR based method outperforms L DH based ap-proach in terms of the risk (expectation of L − d − ). For Synthetic dataset1, except for d = 0 .
05 and 0 . L DR based method has lower CV risk. ForSynthetic dataset 2, both the approaches perform comparable to each other.For Ionosphere dataset, except for d = 0 . , .
25 and 0 . L DR based methodhas lower CV risk. For Parkinsons dataset, L DR based method has lower CVrisk except for d = 0 . L DR based method outputs classifiers with significantlylesser rejection rate for all the datasets and for all values of d .Thus, for most of the cases, the proposed L DR based approach outputs classifierswith lesser risk. Moreover, the learnt classifier has always lesser rejection ratecompared to the L DH based approach. In this paper, we have proposed a new loss function L DR ( double ramp loss ) forlearning the reject option classifier. L DR gives tighter upper bound for L − d − compared to convex losses L DH and L GH . Our approach learns the classifierby minimizing the regularized risk under the double ramp loss function whichbecomes an instance of DC optimization problem. Our approach can also learnnonlinear classifiers by using appropriate kernel function. Experimentally wehave shown that our approach works superior to L DH based approach for learningreject option classifier. ouble Ramp Loss Based Reject Option Classifier 13 d L DR ( C = 64 , γ = 0 . ) L DH ( C = 64 , γ = 0 . )Risk RR Acc on un-rejected Risk RR Acc on un-rejected ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4.
Comparison Results on Synthetic Dataset 2 (nonlinear classifiers using RBFkernel for both the approaches). d L DR ( C = 2 , γ = 0 . ) L DH ( C = 16 , γ = 0 . )Risk RR Acc on un-rejected Risk RR Acc on un-rejected ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 5.
Comparison results on Ionosphere dataset (nonlinear classifiers using RBFkernel for both the approaches).
References
1. Le Thi Hoai An and Pham Dinh Tao. Solving a class of linearly constrainedindefinite quadratic problems by d.c. algorithms.
Journal of Global Optimization ,11:253–285, 1997.2. K. Bache and M. Lichman. UCI machine learning repository, 2013.3. Peter L. Bartlett and Marten H. Wegkamp. Classification with a reject option usinga hinge loss.
Journal of Machine Learning Research , 9:1823–1840, June 2008.4. C. K. Chow. On optimum recognition error and reject tradeoff.
IEEE Transactionson Information Theory , 16(1):41–46, January 1970.5. Giorgio Fumera and Fabio Roli. Support vector machines with embedded rejectoption. In
Proceedings of the First International Workshop on Pattern Recognitionwith Support Vector Machines , SVM ’02, pages 68–82, 2002.6. Aritra Ghosh, Naresh Manwani, and P. S. Sastry. Making risk minimization toler-ant to label noise.
CoRR , abs/1403.3610, 2014.4 Authors Suppressed Due to Excessive Length d L DR ( C = 32 ) L DH ( C = 32 )Risk RR Acc on un-rejected Risk RR Acc on un-rejected ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 6.
Comparison results on Parkinsons Disease dataset (linear classifiers for boththe approaches).7. Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, and St´ephane Canu. Sup-port vector machines with a reject option. In
NIPS , pages 537–544, 2008.8. Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
Elements of StatisticalLearning: Data Mining, Inference and Prediction . Springer Series of Statistics.New York, N. Y. Springer, 2nd edition, 2009.9. Radu Herbei and Marten H. Wegkamp. Classification with reject option.
TheCanadian Journal of Statistics , 34(4):709–721, December 2006.10. Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab –an S4 package for kernel methods in R.
Journal of Statistical Software , 11(9):1–20,November 2004.11. Naresh Manwani and P. S. Sastry. Noise tolerance under risk minimization.
IEEETransactions on Systems, Man and Cybernetics: Part–B , 43:1146–1151, March2013.12. Cheng Soon Ong and Le Thi Hoai An. Learning sparse classifiers with differenceof convex functions algorithms.
Optimization Methods and Software , (ahead-of-print):1–25, 2012.13. Marten Wegkamp and Ming Yuan. Support vector machines with a reject option.
Bernaulli , 17(4):1368–1385, 2011.14. Marten H. Wegkamp. Lasso type classifiers with a reject option.
Electronic Journalof Statistics , 1:155–168, 2007.15. Ming Yuan and Marten Wegkamp. Classification methods with reject option basedon convex risk minimization.
Journal of Machine Learning Research , 11:111–130,March 2010.
A Proof of Theorem 1 L DR ( f ( x ) , ρ, y ) = dµ h(cid:2) µ − yf ( x ) + ρ (cid:3) + − (cid:2) − µ − yf ( x ) + ρ (cid:3) + i + (1 − d ) µ h(cid:2) µ − yf ( x ) − ρ (cid:3) + − (cid:2) − µ − yf ( x ) − ρ (cid:3) + i ouble Ramp Loss Based Reject Option Classifier 15Interval L DR L − d − yf ( x ) ∈ [ ρ + µ, ∞ ) 0 0 yf ( x ) ∈ ( ρ, ρ + µ ) ∈ (0 , d ) 0 yf ( x ) ∈ ( ρ − µ , ρ ] ∈ [ d, (1 + µ ) d ) dyf ( x ) ∈ [ − ρ + µ, ρ − µ ] (1 + µ ) d dyf ( x ) ∈ [ − ρ, − ρ + µ ) ∈ ((1 + µ ) d, (1 + µ ) d + (1 − d )] dyf ( x ) ∈ ( − ρ − µ , − ρ ) ∈ ((1 + µ ) d + (1 − d ) , (1 + µ )) 1 yf ( x ) ∈ ( −∞ , − ρ − µ ] 1 + µ Table 7.
Proof for Theorem 1.(1).
1. Table 7 shows that L DR ≥ L − d − , ∀ µ > , ρ ≥ µ → L DR ( f ( x ) , ρ, y ) = L − d − ( f ( x ) , ρ, y ). We firstsee the values that L DR take for different values of yf ( x ). Table 8 shows how L DR changes as a function of yf ( x ). Interval L DR yf ( x ) ∈ ( ρ + µ, ∞ ) 0 yf ( x ) ∈ [ ρ − µ , ρ + µ ] dµ ( µ − yf ( x ) + ρ ) yf ( x ) ∈ ( − ρ + µ, ρ − µ ) (1 + µ ) dyf ( x ) ∈ [ − ρ − µ , − ρ + µ ] (1 + µ ) d + (1 − d ) µ ( µ − yf ( x ) − ρ ) yf ( x ) ∈ ( −∞ , − ρ − µ ) 1 + µ Table 8. L DR in different intervals (Proof for Theorem 1.(iii)) Now we take the limit µ →
0, which is shown in Table 9. We see thatlim µ → L DR = L − d − . Interval lim µ → L DR L − d − yf ( x ) ∈ ( ρ, ∞ ) 0 0 yf ( x ) = ρ d dyf ( x ) ∈ ( − ρ, ρ ) d dyf ( x ) = − ρ yf ( x ) ∈ ( −∞ , − ρ ) 1 1 Table 9. lim µ → L DR in different intervals (Proof for Theorem 1.(iii))
3. In the rejection region yf ( x ) ∈ ( ρ − µ , − ρ + µ ), the loss remains constant,that is L DR ( f ( x ) , ρ, y ) = d (1 + µ ). This can be seen in Table 8.4. For µ > L DR ≤ (1 + µ ) , ∀ ρ ≥ , ∀ d ≥
0. This can be seen in Table 8.
5. When ρ = 0, L DR becomes L DR ( f ( x ) , , y ) = dµ h(cid:2) µ − yf ( x ) (cid:3) + − (cid:2) − µ − yf ( x ) (cid:3) + i + (1 − d ) µ h(cid:2) µ − yf ( x ) − (cid:3) + − (cid:2) − µ − yf ( x ) (cid:3) + i = 1 µ h(cid:2) µ − yf ( x ) (cid:3) + − (cid:2) − µ − yf ( x ) (cid:3) + i which is same as the µ -ramp loss function used for classification problemswithout rejection option.6. We have to show that L DR is non-convex function of ( yf ( x ) , ρ ). From (iv),we know that L DR ≤ (1 + µ ). That is, L DR is bounded above. We shownon-convexity of L DR by contradiction.Let L DR be convex function of ( yf ( x ) , ρ ). Let z = ( yf ( x ) , ρ ). We also rewrite L DR ( f ( x ) , ρ, y ) as L DR ( z ). We choose two points z , z such that L DR ( z ) >L DR ( z ). Thus, from the definition of convexity, we have L DR ( z ) ≤ λL DR ( z − (1 − λ ) z λ ) + (1 − λ ) L DR ( z ) ∀ λ ∈ (0 , L DR ( z ) − (1 − λ ) L DR ( z ) λ ≤ L DR ( z − (1 − λ ) z λ )Now, since L DR ( z ) > L DR ( z ), L DR ( z ) − (1 − λ ) L DR ( z ) λ = L DR ( z ) − L DR ( z ) λ + L DR ( z ) → ∞ as λ → + Thus lim λ → + L DR ( z − (1 − λ ) z λ ) = ∞ . But L DR is upper bounded by (1+ µ ) d .This contradicts that L DR is convex. B Proof of Lemma 1
Let Θ ′ = ( w ′ , b ′ , ρ ′ ) minimizes R ( Θ ), where ρ ′ <
0. Thus − ρ ′ >
0. Consider Θ ′′ = ( w ′ , b ′ , − ρ ′ ) as another point. R ( Θ ′ ) − R ( Θ ′′ ) = C (1 − d ) µ N X n =1 n − (cid:2) µ − y n f ( x n ) + ρ ′ (cid:3) + + (cid:2) − µ − y n f ( x n ) + ρ ′ (cid:3) + + (cid:2) µ − y n f ( x n ) − ρ ′ (cid:3) + − (cid:2) − µ − y n f ( x n ) − ρ ′ (cid:3) + o = C (1 − d ) N X n =1 n L ramp ( y n f ( x n ) + ρ ′ ) − L ramp ( y n f ( x n ) − ρ ′ ) o where L ramp ( t ) = µ ([ µ − t ] + − [ − µ − t ] + ) is a monotonically non-increasingfunction of t [12]. Since ρ ′ <
0, thus, y n f ( x n ) + ρ ′ < y n f ( x n ) − ρ ′ , ∀ n . Thisimplies L ramp ( y n f ( x n ) + ρ ′ ) ≥ L ramp ( y n f ( x n ) − ρ ′ ) , ∀ n . Also (1 − d ) ≥
0, since0 ≤ d ≤ .
5. Thus R ( Θ ′ ) − R ( Θ ′′ ) ≥
0, which contradicts that Θ ′ minimizes R ( Θ ). Thus, at the minimum of R ( Θ ), ρ must be non-negative. ouble Ramp Loss Based Reject Option Classifier 17 C Derivation of Dual Optimization Problem D ( l +1) P ( l +1) : min w ,b, ξ ′ , ξ ′′ ,ρ || w || + Cµ N X n =1 (cid:2) dξ ′ n + (1 − d ) ξ ′′ n (cid:3) + N X n =1 β ′ ( l ) n [ y n ( w T φ ( x n ) + b ) − ρ ]+ N X n =1 β ′′ ( l ) n [ y n ( w T φ ( x n ) + b ) + ρ ] s.t. y n ( w T φ ( x n ) + b ) ≥ ρ + µ − ξ ′ n , ξ ′ n ≥ , n = 1 . . . Ny n ( w T φ ( x n ) + b ) ≥ − ρ + µ − ξ ′′ n , ξ ′′ n ≥ n = 1 . . . N The Lagrangian for above problem will be: L = 12 || w || + Cµ N X n =1 (cid:2) dξ ′ n + (1 − d ) ξ ′′ n (cid:3) + N X n =1 β ′ ( l ) n [ y n ( w T φ ( x n ) + b ) − ρ ] + N X n =1 β ′′ ( l ) n [ y n ( w T φ ( x n ) + b ) + ρ ] + N X n =1 α ′ n [ ρ + µ − ξ ′ n − y n ( w T φ ( x n ) + b )] − N X n =1 η ′ n ξ ′ n + N X n =1 α ′′ n [ − ρ + µ − ξ ′′ n − y n ( w T φ ( x n ) + b )] − N X n =1 η ′′ n ξ ′′ n where α ′ n is dual variable corresponding to constraint y n ( w T φ ( x n ) + b ) ≥ ρ + µ − ξ ′ n , α ′′ n is dual variable corresponding to y n ( w T φ ( x n ) + b ) ≥ − ρ + µ − ξ ′ n , η ′ n is dual variable corresponding to ξ ′ n ≥ η ′′ n is dual variable corresponding to ξ ′′ n ≥