[PDF] Double Ramp Loss Based Reject Option Classifier

Abstract

We consider the problem of learning reject option classifiers. The goodness of a reject option classifier is quantified using 0−d−1 loss function wherein a loss d∈(0,.5) is assigned for rejection. In this paper, we propose {\em double ramp loss} function which gives a continuous upper bound for (0−d−1) loss. Our approach is based on minimizing regularized risk under the double ramp loss using {\em difference of convex (DC) programming}. We show the effectiveness of our approach through experiments on synthetic and benchmark datasets. Our approach performs better than the state of the art reject option classification approaches.

Full PDF

aa r X i v : . [ c s . L G ] D ec Double Ramp Loss Based Reject OptionClassiﬁer

Naresh Manwani , Kalpit Desai , Sanand Sasidharan , and RamasubramanianSundararajan Data Mining Lab, GE Global Research, JFWTC, Whiteﬁeld, Bangalore-560066,(

[email protected], [email protected] ), Bidgely, Bangalore ( [email protected] ), Sabre Airline Solutions, Bangalore ( [email protected] ) Abstract.

We consider the problem of learning reject option classiﬁers.The goodness of a reject option classiﬁer is quantiﬁed using 0 − d − d ∈ (0 , .

5) is assigned for rejection. In this paper,we propose double ramp loss function which gives a continuous upperbound for (0 − d −

1) loss. Our approach is based on minimizing regu-larized risk under the double ramp loss using diﬀerence of convex (DC)programming . We show the eﬀectiveness of our approach through exper-iments on synthetic and benchmark datasets. Our approach performsbetter than the state of the art reject option classiﬁcation approaches.

The primary focus of classiﬁcation problems has been on algorithms that returna prediction on every example. However, in many real life situations, it may beprudent to reject an example rather than run the risk of a costly potential mis-classiﬁcation. Consider, for instance, a physician who has to return a diagnosisfor a patient based on the observed symptoms and a preliminary examination. Ifthe symptoms are either ambiguous, or rare enough to be unexplainable withoutfurther investigation, then the physician might choose not to risk misdiagnosingthe patient (which might lead to further complications). He might instead askfor further medical tests to be performed, or refer the case to an appropriatespecialist. Similarly, a banker, when faced with a loan application from a cus-tomer, may choose not to decide on the basis of the available information, andask for a credit bureau score. While the follow-up actions might vary (askingfor more features to describe the example, or using a diﬀerent classiﬁer), theprincipal response in these cases is to “reject” the example. This paper focuseson the manner in which this principal response is decided, i.e., which examplesshould a classiﬁer reject, and why? From a geometric standpoint, we can viewthe classiﬁer as being possessed of a decision surface (which separates points ofdiﬀerent classes) as well as a rejection surface. The size of the rejection regionimpacts the proportion of cases that are likely to be rejected by the classiﬁer, aswell as the proportion of predicted cases that are likely to be correctly classiﬁed.

Authors Suppressed Due to Excessive Length

A well-optimized classiﬁer with a reject option is the one which minimizes therejection rate as well as the mis-classiﬁcation rate on the predicted examples.Let x ∈ R p is the feature vector and y ∈ {− , +1 } is the class label. Let D ( x , y ) be the joint distribution of x and y . A typical reject option classiﬁer isdeﬁned using a bandwidth parameter ( ρ ) and a separating surface ( f ( x ) = 0). ρ is the parameter which determines the rejection region. Then a reject optionclassiﬁer h ( f ( x ) , ρ ) is formed as: h ( f ( x ) , ρ ) =  f ( x ) > ρ | f ( x ) | ≤ ρ − f ( x ) < − ρ (1)The reject option classiﬁer can be viewed as two parallel surfaces with the rejec-tion area in between. The goal is to determine f ( x ) as well as ρ simultaneously.The performance of this classiﬁer is evaluated using L − d − [13,9] which is L − d − ( f ( x ) , y, ρ ) =  , if yf ( x ) < − ρd, if | f ( x ) | ≤ ρ , otherwise (2)In the above loss, d is the cost of rejection. If d = 0, then we will always reject.When d > .

5, then we will never reject (because expected loss of random labelingis 0.5). Thus, we always take d ∈ (0 , . L − d − ( ., ., . ) with re-spect to D ( x , y ) ( risk ) is minimized. Since D ( x , y ) is ﬁxed but unknown, theempirical risk minimization principle is used. The risk under L − d − is mini-mized by generalized Bayes discriminant [9,4], which is as below: f ∗ d ( x ) =  − , if P ( y = 1 | x ) < d , if d ≤ P ( y = 1 | x ) ≤ − d , if P ( y = 1 | x ) > − d (3) h ( f ( x ) , ρ ) (equation (1)) is shown to be inﬁnite sample consistent with respectto the generalized Bayes classiﬁer f ∗ d ( x ) described in equation (3) [15]. Loss Function Deﬁnition

Generalized Hinge L GH ( f ( x ) , y ) =  − − dd yf ( x ) , if yf ( x ) < − yf ( x ) , if 0 ≤ yf ( x ) < , otherwiseDouble Hinge L DH ( f ( x ) , y ) = max[ − y (1 − d ) f ( x ) + H ( d ) , − ydf ( x ) + H ( d ) , H ( d ) = − d log( d ) − (1 − d ) log(1 − d ) Table 1.

Convex surrogates for L − d − . Since minimizing the risk under L − d − is computationally cumbersome,convex surrogates for L − d − have been proposed. Generalized hinge loss L GH ouble Ramp Loss Based Reject Option Classiﬁer 3 -3 -2 -1 0 1 2 3 . . . . . . d=0.2 yf ( x ) Lo ss GHDH0-d-1 ( r =0.7) -3 -2 -1 0 1 2 3 . . . . . . d=0.2 yf ( x ) Lo ss GHDH0-d-1 ( r =2 ) (a) (b) Fig. 1. L GH and L DH for d = 0 .

2. (a) For ρ = 0 .

7, both the losses upper bound the L − d − . For ρ = 2, both the losses fail to upper bound L − d − . L GH and L DH bothincrease linearly even in the rejection region than being ﬂat. (see Table 1) is a convex surrogate for L − d − [13,14,3]. It is shown that aminimizer of risk under L GH is consistent to the generalized Bayes classiﬁer [3]. Double hinge loss L DH (see Table 1) is another convex surrogate for L − d − [7].Minimizer of the risk under L DH is shown to be strongly universally consistent to the generalized Bayes classiﬁer [7].We observe that these convex loss functions have some limitations. For ex-ample, L GH is a convex upper bound to L − d − provided ρ < − d and L DH forms an upper bound to L − d − provided ρ ∈ ( − H ( d )1 − d , H ( d ) − dd ) (see Fig. 1).Also, both L GH and L DH increase linearly in the rejection region instead of re-maining constant. These convex losses can become unbounded for misclassiﬁedexamples with the scaling of parameters of f . Moreover, limited experimentalresults are shown to validate the practical signiﬁcance of these losses [13,14,3,7].A non-convex formulation for learning reject option classiﬁer is proposed in [5].However, theoretical guarantees for the approach proposed in [5] are not known.While learning a reject option classiﬁer, one has to deal with the overlappingclass regions as well as the presence of outliers. SVM and other convex loss basedapproaches are less robust to label noise and outliers in the data [11]. It is shownthat ramp loss based risk minimization is more robust to noise [6].Motivated from this, we propose double ramp loss ( L DR ) which incorporates adiﬀerent loss value for rejection. L DR forms a continuous nonconvex upper boundfor L − d − and overcomes many of the issues of convex surrogates of L − d − .To learn a reject option classiﬁer, we minimize the regularized risk under L DR which becomes an instance of diﬀerence of convex (DC) functions. To minimizesuch a DC function, we use diﬀerence of convex programming approach [1],which essentially solves a sequence of convex programs. The proposed methodhas following advantages over the existing approaches: (1) the proposed lossfunction L DR gives a tighter upper bound to the L − d − , (2) L DR requires no Authors Suppressed Due to Excessive Length constraint on ρ unlike L GH and L DH , (3) our approach can be easily kernelizedfor dealing with nonlinear problems.The rest of the paper is organized as follows. In Section 2 we deﬁne the double ramp loss function ( L DR ) and discuss its properties. Then we discussedthe proposed formulation based on risk minimization under L DR . In Section 3we derive the algorithm for learning reject option classiﬁer based on regularizedrisk minimization under ( L DR ) using DC programming. We present experimentalresults in Section 4. We conclude the paper with the discussion in Section 5. Our approach for learning classiﬁer with reject option is based on minimizingregularized risk under L DR (double ramp loss). We deﬁne double ramp loss function as a continuous upper bound for L − d − .This loss function is deﬁned as a sum of two ramp loss functions as follows: L DR ( f ( x ) , y, ρ ) = dµ h(cid:2) µ − yf ( x ) + ρ (cid:3) + − (cid:2) − µ − yf ( x ) + ρ (cid:3) + i + (1 − d ) µ h(cid:2) µ − yf ( x ) − ρ (cid:3) + − (cid:2) − µ − yf ( x ) − ρ (cid:3) + i (4) -4 -2 0 2 4 . . . . . . d=0.2, r =2yf ( x ) Lo ss DR ( m =1)DR ( m =0.5)DR ( m =0.1)0-d-1 Fig. 2. L DR and L − d − : ∀ µ ≥ , ρ ≥ L DR is an upper bound for L − d − . where [ a ] + = max(0 , a ). µ ∈ (0 ,

1] deﬁnes the slope of ramps in the lossfunction. d ∈ (0 , .

5) is the cost of rejection and ρ ≥ f ( x ) = ouble Ramp Loss Based Reject Option Classiﬁer 5 As in L − d − , L DR also considers the region [ − ρ, ρ ] as rejection region. Fig. 2shows L DR for d = 0 . , ρ = 2 with diﬀerent values of µ . Theorem 1. (1) L DR ≥ L − d − , ∀ µ > , ρ ≥ . (2) lim µ → L DR ( f ( x ) , ρ, y ) = L − d − ( f ( x ) , ρ, y ) . (3) In the rejection region yf ( x ) ∈ ( ρ − µ , − ρ + µ ) , theloss remains constant, that is L DR ( f ( x ) , y, ρ ) = d (1 + µ ) . (4) For µ > , L DR ≤ (1+ µ ) , ∀ ρ ≥ , ∀ d ≥ . (5) When ρ = 0 , L DR is same as µ -ramp loss ([12])usedfor classiﬁcation problems without rejection option. (6) L DR is a non-convexfunction of ( yf ( x ) , ρ ) . The proof of Theorem 1 is provided in Appendix A. We see that L DR does notput any restriction on ρ for it to be an upper bound of L − d − . Thus, L DR is ageneral ramp loss function which also allows rejection option. L DR Let S = { ( x n , y n ) , n = 1 . . . N } be the training dataset, where x n ∈ R p , y n ∈{− , +1 } , ∀ n . As discussed, we minimize regularized risk under L DR to ﬁnd a re-ject option classiﬁer. In this paper, we use l regularization. Let Θ = [ w T b ρ ] T .Thus, for f ( x ) = ( w T φ ( x ) + b ), regularized risk under double ramp loss is R ( Θ ) = 12 || w || + C N X n =1 L DR ( y n , w T φ ( x n ) + b )= 12 || w || + Cµ N X n =1 n d (cid:2) µ − y n f ( x n ) + ρ (cid:3) + − d (cid:2) − µ − y n f ( x n ) + ρ (cid:3) + +(1 − d ) (cid:2) µ − y n f ( x n ) − ρ (cid:3) + − (1 − d ) (cid:2) − µ − y n f ( x n ) − ρ (cid:3) + o = 12 || w || + Cµ N X n =1 n d (cid:2) µ − y n f ( x n ) + ρ (cid:3) + + (1 − d ) (cid:2) µ − y n f ( x n ) − ρ (cid:3) + − d (cid:2) − µ − y n f ( x n ) + ρ (cid:3) + − (1 − d ) (cid:2) − µ − y n f ( x n ) − ρ (cid:3) + o where C is regularization parameter. While minimizing R ( Θ ), no non-negativitycondition on ρ is required due to the following lemma. Lemma 1.

At the minimum of R ( Θ ) , ρ must be non-negative. Prood of the above lemma is provided in Appendix B. While L DR is parametrized by µ and d as well, we omit them for the sake of notationalconsistency. Authors Suppressed Due to Excessive Length R ( Θ ) (equation (5)) is a nonconvex function of Θ . However, R ( Θ ) can be writtenas R ( Θ ) = R ( Θ ) − R ( Θ ), where R ( Θ ) and R ( Θ ) are convex functions of Θ . R ( Θ ) = 12 || w || + Cµ N X n =1 h d (cid:2) µ − y n f ( x n ) + ρ (cid:3) + + (1 − d ) (cid:2) µ − y n f ( x n ) − ρ (cid:3) + i R ( Θ ) = Cµ N X n =1 h d (cid:2) − µ − y n f ( x n ) + ρ (cid:3) + + (1 − d ) (cid:2) − µ − y n f ( x n ) − ρ (cid:3) + i In this case, DC programming guarantees to ﬁnd a local optima of R ( Θ ) [1].In the simpliﬁed DC algorithm [1], an upper bound of R ( Θ ) is found using theconvexity property of R ( Θ ) as follows. R ( Θ ) ≤ R ( Θ ) − R ( Θ ( l ) ) − ( Θ − Θ ( l ) ) T ∇ R ( Θ ( l ) ) =: ub ( Θ, Θ ( l ) ) (5)where Θ ( l ) is the parameter vector after ( l ) th iteration, ∇ R ( Θ ( l ) ) is a sub-gradient of R at Θ ( l ) . Θ ( l +1) is found by minimizing ub ( Θ, Θ ( l ) ). Thus, R ( Θ ( l +1) ) ≤ ub ( Θ ( l +1) , Θ ( l ) ) ≤ ub ( Θ ( l ) , Θ ( l ) ) = R ( Θ ( l ) ). Which means, in every iteration, theDC program reduces the value of R ( Θ ). In this section, we will derive a DC algorithm for minimizing R ( Θ ). We initializewith Θ = Θ (0) . For any l ≥

0, we ﬁnd ub ( Θ, Θ ( l ) ) as an upper bound for R ( Θ )(see equation (5)) as follows: ub ( Θ, Θ ( l ) ) = R ( Θ ) − R ( Θ ( l ) ) − ( Θ − Θ ( l ) ) T ∇ R ( Θ ( l ) )Given Θ ( l ) , we ﬁnd Θ ( l +1) by minimizing the upper bound ub ( Θ, Θ ( l ) ). Thus, Θ ( l +1) ∈ arg min Θ ub ( Θ, Θ ( l ) ) = arg min Θ R ( Θ ) − Θ T ∇ R ( Θ ( l ) ) (6)where ∇ R ( Θ ( l ) ) is the subgradient of R ( Θ ) at Θ ( l ) . We choose ∇ R ( Θ ( l ) ) as: ∇ R ( Θ ( l ) ) = N X n =1 β ′ ( l ) n [ − y n φ ( x n ) T − y n T + N X n =1 β ′′ ( l ) n [ − y n φ ( x n ) T − y n − T where ( β ′ ( l ) n = Cdµ I { y n ( φ ( x n ) T w ( l ) + b ( l ) ) − ρ ( l ) < − µ } β ′′ ( l ) n = C (1 − d ) µ I { y n ( φ ( x n ) T w ( l ) + b ( l ) )+ ρ ( l ) < − µ } (7) ouble Ramp Loss Based Reject Option Classiﬁer 7 For f ( x ) = ( w T φ ( x ) + b , we rewrite the upper bound minimization problemdescribed in equation (6) as follows, P ( l +1) = min Θ R ( Θ ) − Θ T ∇ R ( Θ ( l ) )= min w ,b,ρ || w || + Cµ N X n =1 h d (cid:2) µ − y n f ( x n ) + ρ (cid:3) + + (1 − d ) (cid:2) µ − y n f ( x n ) − ρ (cid:3) + i + N X n =1 β ′ ( l ) n [ y n f ( x n ) − ρ ] + N X n =1 β ′′ ( l ) n [ y n f ( x n ) + ρ ]Note that P ( l +1) is a convex optimization problem where the optimization vari-ables are ( w , b, ρ ). We rewrite P ( l +1) as P ( l +1) = min w ,b, ξ ′ , ξ ′′ ,ρ || w || + Cµ N X n =1 (cid:2) dξ ′ n + (1 − d ) ξ ′′ n (cid:3) + N X n =1 β ′ ( l ) n [ y n ( w T φ ( x n ) + b ) − ρ ]+ N X n =1 β ′′ ( l ) n [ y n ( w T φ ( x n ) + b ) + ρ ] s.t. y n ( w T φ ( x n ) + b ) ≥ ρ + µ − ξ ′ n , ξ ′ n ≥ , n = 1 . . . Ny n ( w T φ ( x n ) + b ) ≥ − ρ + µ − ξ ′′ n , ξ ′′ n ≥ n = 1 . . . N where ξ ′ = [ ξ ′ ξ ′ . . . ξ ′ N ] T and ξ ′′ = [ ξ ′′ ξ ′′ . . . ξ ′′ N ] T . The dual optimizationproblem D ( l +1) of P ( l +1) is as follows. D ( l +1) = min γ ′ , γ ′′ N X n =1 N X m =1 y n y m ( γ ′ n + γ ′′ n )( γ ′ m + γ ′′ m ) k ( x n , x m ) − µ N X n =1 ( γ ′ n + γ ′′ n ) s.t.  − β ′ ( l ) n ≤ γ ′ n ≤ Cdµ − β ′ ( l ) n n = 1 . . . N − β ′′ ( l ) n ≤ γ ′′ n ≤ C (1 − d ) µ − β ′′ ( l ) n n = 1 . . . N P Nn =1 y n ( γ ′ n + γ ′′ n ) = 0 P Nn =1 ( γ ′ n − γ ′′ n ) = 0where γ ′ = [ γ ′ γ ′ . . . . . . γ ′ n ] T and γ ′′ = [ γ ′′ γ ′′ . . . . . . γ ′′ n ] T are dual variables.The derivation of dual D ( l +1) can be seen in Appendix C. At the optimality of P ( l +1) , w can be found as w = P Nn =1 y n ( γ ′ n + γ ′′ n ) φ ( x n ).Since P ( l +1) has quadratic objective and linear constraints, it holds strongduality with D ( l +1) . Solving D ( l +1) is more useful as it can be easily kernelizedfor non-linear problems. Behavior of γ ′ n and γ ′′ n under diﬀerent cases is as follows.  y n ( w T φ ( x n ) + b ) − µ > ρ ⇒ γ ′ n = − β ′ ( l ) n ; γ ′′ n = − β ′′ ( l ) n y n ( w T φ ( x n ) + b ) − µ = ρ ⇒ γ ′ n ∈ (cid:0) − β ′ ( l ) n , Cdµ − β ′ ( l ) n (cid:1) ; γ ′′ n = − β ′′ ( l ) n y n ( w T φ ( x n ) + b ) − µ ∈ ( − ρ, ρ ) ⇒ γ ′ n = Cdµ − β ′ ( l ) n ; γ ′′ n = − β ′′ ( l ) n y n ( w T φ ( x n ) + b ) − µ = − ρ ⇒ γ ′ n = Cdµ − β ′ ( l ) n ; γ ′′ n ∈ (cid:0) − β ′′ ( l ) n , C (1 − d ) µ − β ′′ ( l ) n (cid:1) y n ( w T φ ( x n ) + b ) − µ < − ρ ⇒ γ ′ n = Cdµ − β ′ ( l ) n ; γ ′′ n = C (1 − d ) µ − β ′′ ( l ) n Authors Suppressed Due to Excessive Length b ( l +1) and ρ ( l +1) The dual optimization problem above gives dual variables γ ′ ( l +1) and γ ′′ ( l +1) us-ing which the normal vector is found as w ( l +1) = P Nn =1 ( γ ′ ( l +1) n + γ ′′ ( l +1) n ) y n φ ( x n ).To ﬁnd b ( l +1) and ρ ( l +1) , we consider x n ∈ SV ′ ( l +1) ∪ SV ′′ ( l +1) , whereSV ′ ( l +1) = { x n | y n ( φ ( x n ) T w ( l +1) + b ( l +1) ) = ρ ( l +1) + µ } SV ′′ ( l +1) = { x n | y n ( φ ( x n ) T w ( l +1) + b ( l +1) ) = − ρ ( l +1) + µ } We already saw that1. If x n ∈ SV ′ ( l +1) , then γ ′ ( l +1) n ∈ (cid:0) − β ′ ( l ) n , Cdµ − β ′ n ( l ) (cid:1) and γ ′′ ( l +1) n = − β ′′ ( l ) n

2. If x n ∈ SV ′′ ( l +1) , then γ ′ ( l +1) n = Cdµ − β ′ ( l ) n and γ ′′ ( l +1) n ∈ (cid:0) − β ′′ ( l ) n , C (1 − d ) µ − β ′′ ( l ) n (cid:1) We solve the system of linear equations corresponding to sets SV ′ ( l +1) andSV ′′ ( l +1) for identifying b ( l +1) and ρ ( l +1) . We ﬁx d ∈ [0 , . µ ∈ (0 ,

1] and C and initialize the parameter vector Θ as Θ (0) . In any iteration ( l ), we ﬁnd β ′ ( l ) n , β ′′ ( l ) n , n = 1 . . . N (see equation (7))using Θ ( l ) . We use β ′ ( l ) n , β ′′ ( l ) n , n = 1 . . . N and solve D ( l +1) to ﬁnd γ ′ ( l +1) , γ ′′ ( l +1) . w ( l +1) is found as w ( l +1) = P Nn =1 y n ( γ ′ ( l +1) n + γ ′′ ( l +1) n ) φ ( x n ). We ﬁnd b ( l +1) and ρ ( l +1) as described in Section 3.2. Thus, we have found Θ ( l +1) . Using Θ ( l +1) ,we now ﬁnd β ′ ( l +1) n , β ′′ ( l +1) n , n = 1 . . . N . We repeat the above two steps untilthe parameter vector Θ changes signiﬁcantly. More formal description of ouralgorithm is provided in Algorithm 1. γ ′ and γ ′′ at the Convergence of Algorithm 1 At the convergence of Algorithm 1, let γ ′∗ n , γ ′′∗ n , n = 1 . . . N become the valuesof the dual variables. The behavior of γ ′∗ n and γ ′′∗ n is described in Table 2. Forany x n , only one of γ ′∗ n and γ ′′∗ n can be nonzero. We observe that parameters w , b and ρ are determined by the points whose margin ( yf ( x )) is in the range[ ρ − µ , ρ + µ ] ∪ [ − ρ − µ , − ρ + µ ]. We call these points as support vectors . We alsosee that x n for which y n f ( x n ) ∈ ( ρ + µ, ∞ ) ∪ ( − ρ + µ, ρ − µ ) ∪ ( −∞ , − ρ − µ ),both γ ′∗ n , γ ′′∗ n = 0. Thus, points which are correctly classiﬁed with margin at least( ρ + µ ), points falling close to the decision boundary with margin in the interval( − ρ + µ, ρ − µ ) and points misclassiﬁed with a high negative margin (less than − ρ − µ ), are ignored in the ﬁnal classiﬁer. Thus, our approach not only rejectspoints falling in the overlapping region of classes, it also ignores potential outliers.We illustrate these insights through experiments on a synthetic dataset as shownin Fig. 3. 400 points are uniformly sampled from the square region [0 1] × [0 1].We consider the diagonal passing through the origin as the separating surface ouble Ramp Loss Based Reject Option Classiﬁer 9 Algorithm 1

Learning Reject Option Classiﬁer by Minimizing R ( Θ ) Input : d ∈ [0 , . , µ ∈ (0 , , C > S Output : w ∗ , b ∗ , ρ ∗ Initialize w (0) , b (0) , ρ (0) , l = 0 repeatCompute β ′ ( l ) n = Cdµ I { y n ( φ ( x n ) T w ( l ) + b ( l ) ) − ρ ( l ) < − µ } β ′′ ( l ) n = C (1 − d ) µ I { y n ( φ ( x n ) T w ( l ) + b ( l ) )+ ρ ( l ) < − µ } Find γ ′ ( l +1) , γ ′′ ( l +1) by solving D ( l +1) described in equation (8) Find w ( l +1) = P Nn =1 y n ( γ ′ ( l +1) n + γ ′′ ( l +1) n ) φ ( x n ) Find b ( l +1) and ρ ( l +1) by solving the system of linear equations corresponding tosets SV ( l +1)1 and SV ( l +1)2 , whereSV ′ ( l +1) = { x n | y n ( φ ( x n ) T w ( l +1) + b ( l +1) ) = ρ ( l +1) + µ } SV ′′ ( l +1) = { x n | y n ( φ ( x n ) T w ( l +1) + b ( l +1) ) = − ρ ( l +1) + µ } until convergence of Θ ( l ) Data with Label Noise X1 X . . . . . . Double Ramp Loss Based Reject Option Classifier X1 X . . . . . . Fig. 3.

Figure on left shows that label noise aﬀects points near the true classiﬁcationboundary. Classes are represented using empty circles and triangles . Figure on rightshows reject option classiﬁer learnt using the proposed L DR based approach ( C = 100, µ = 1, d = . circles and triangles represent the support vectors.0 Authors Suppressed Due to Excessive LengthCondition γ ′∗ n ∈ γ ′′∗ n ∈ y n ( w T φ ( x n ) + b ) ∈ ( ρ + µ, ∞ ) 0 0 y n ( w T φ ( x n ) + b ) = ρ + µ (0 , Cdµ ) 0 y n ( w T φ ( x n ) + b ) ∈ [ ρ − µ , ρ + µ ) Cdµ y n ( w T φ ( x n ) + b ) ∈ ( − ρ + µ, ρ − µ ) 0 0 y n ( w T φ ( x n ) + b ) = − ρ + µ , C (1 − d ) µ ) y n ( w T φ ( x n ) + b ) ∈ [ − ρ − µ , − ρ + µ ) 0 C (1 − d ) µ y n ( w T φ ( x n ) + b ) ∈ ( −∞ , − ρ − µ ) 0 0 Table 2.

Behavior of γ ′∗ and γ ′′∗ and assign labels {− , +1 } to all the points using it. We changed the labels of80 points inside the band (width=0.225) around the separating surface. Fig. 3shows the reject option classiﬁer learnt using the proposed method. We see thatthe proposed approach learns the rejection region accurately. We also observethat all of the support vectors are near the two parallel hyperplanes. We show the eﬀectiveness of our approach by showing its performance on severaldatasets. We also compare our approach with the approach proposed in [7].

We report experimental results on 1 synthetic datasets and 2 datasets takenfrom UCI ML repository [2].1.

Synthetic Dataset 1 :

Let f and f be two mixture density functions in R deﬁned as follows: f ( x ) = 0 . U ([1 , × [1 , . U ([4 , × [0 , . U ([10 , × [5 , f ( x ) = 0 . U ([0 , × [1 , . U ([9 , × [1 , . U ([0 , × [5 , U ( A ) denotes the uniform density function with support set A . Wesample 150 points independently each from f and f . We label these pointsusing the hyperplane with w = [1 0] T and b = 0. We choose 10% of thesepoints uniformly at random and ﬂip their labels.2. Synthetic Dataset 2 [8] : m k , k = 1 , . . . ,

10 were drawn from N ((1 , T , I )and labeled as class C . Similarly, m k , k = 1 , . . . ,

10 were drawn from N ((0 , T , I ) and labeled as class C . For each class, 100 observations weredrawn from the following mixture distributions: f ( x | C i ) = X k =1 N ( m ki , I/ , i = 1 , Ionosphere Dataset [2] :

This dataset describes the problem of discrimi-nating good versus bad radars based on whether they send some useful infor-mation about the Ionosphere. There are 34 variables and 351 observations. ouble Ramp Loss Based Reject Option Classiﬁer 11 Parkinsons Disease Dataset [2] :

This dataset is used to discriminatepeople with Parkinsons disease from the healthy people. There are 22 fea-tures which are comprised of a range of biomedical voice measurements fromindividuals. There are 195 such feature vectors.

In the proposed L DR based approach, for solving the dual D ( l ) at every iteration,we have used the kernlab package [10] in R . We thank the authors of L DH basedmethod [7] for providing the codes for their approach. For nonlinear problems,we use RBF kernel. In our approach, we set µ = 1. C and σ (width parameterfor RBF kernel) are chosen using 10-fold cross validation. For every dataset, we report results for values of d in the interval [0 . .

5] withthe step size of 0.05. For every value of d , we ﬁnd the cross validation risk (under L − d − ), % accuracy on the non-rejected examples (Acc) and % rejection rate(RR). The results provided are based on 10 repetitions of 10-fold cross validation(CV). We show the average values and standard deviation (computed over the10 repetitions).We now discuss the experimental results. Fig. 4(a) shows the Syntheticdataset and the true classiﬁcation boundary. This dataset has some mislabeledpoints creating noise around the classiﬁcation surface. Fig. 4(b) and (c) show theclassiﬁers learnt using L DR and L DH based approaches respectively for d = 0 . L DR based approach accurately ﬁnds the true classiﬁcation bound-ary as oppose to L DH based approach. Also, the reject region found by L DR based approach is covers the most ambiguous region unlike L DH based approachwhich rejects almost all the points. X1 X -10 -5 0 5 10 - - OOOOOO OO OO OO OO OOO OOOO OO OO OOOO OOO OOOOO OOO OOOOOO OOOO OOOOOOO OOO OOOOO OOOO OOO O OOOO O OO OO OO OOOO OOO OOO OOOOOOO OOO OO OOOO OOOOO O OOO OOO O OOOO OO OO O OOOOO OOO OOOOOO OO ++ ++ +++ ++ +++ ++++ ++ + +++ +++++ ++ ++++ +++ +++ ++ + +++ ++ ++ +++++ ++++++ ++++ + +++ +++++ + +++++ ++++ +++ ++++ ++ +++ ++ ++++ ++ +++ ++ ++++ ++ ++++++ ++++ +++ ++++++ ++ ++++++ +++ +++++ +++ X1 X -10 -5 0 5 10 - - OOOOOO OO OO OO OO OOO OOOO OO OO OOOO OOO OOOOO OOO OOOOOO OOOO OOOOOOO OOO OOOOO OOOO OOO O OOOO O OO OO OO OOOO OOO OOO OOOOOOO OOO OO OOOO OOOOO O OOO OOO O OOOO OO OO O OOOOO OOO OOOOOO OO ++ ++ +++ ++ +++ ++++ ++ + +++ +++++ ++ ++++ +++ +++ ++ + +++ ++ ++ +++++ ++++++ ++++ + +++ +++++ + +++++ ++++ +++ ++++ ++ +++ ++ ++++ ++ +++ ++ ++++ ++ ++++++ ++++ +++ ++++++ ++ ++++++ +++ +++++ +++ X1 X -10 -5 0 5 10 - - OOOOOO OO OO OO OO OOO OOOO OO OO OOOO OOO OOOOO OOO OOOOOO OOOO OOOOOOO OOO OOOOO OOOO OOO O OOOO O OO OO OO OOOO OOO OOO OOOOOOO OOO OO OOOO OOOOO O OOO OOO O OOOO OO OO O OOOOO OOO OOOOOO OO ++ ++ +++ ++ +++ ++++ ++ + +++ +++++ ++ ++++ +++ +++ ++ + +++ ++ ++ +++++ ++++++ ++++ + +++ +++++ + +++++ ++++ +++ ++++ ++ +++ ++ ++++ ++ +++ ++ ++++ ++ ++++++ ++++ +++ ++++++ ++ ++++++ +++ +++++ +++ (a) (b) (c)

Fig. 4. (a) Synthetic Dataset and the true classiﬁcation boundary. Reject option clas-siﬁers learnt using (b) proposed L DR based approach for d = 0 .

2, (c) L DH basedapproach for d = 0 . d L DR ( C = 2 ) L DH ( C = 32 )Risk RR Acc on un-rejected Risk RR Acc on un-rejected ± ± ±

100 NA0.1 0.138 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 3.

Comparison results on Synthetic Dataset 1 (linear classiﬁers for both theapproaches).

Table 3-6 show the experimental results on all the datasets. We observe thefollowing:1. We see that the proposed L DR based method outperforms L DH based ap-proach in terms of the risk (expectation of L − d − ). For Synthetic dataset1, except for d = 0 .

05 and 0 . L DR based method has lower CV risk. ForSynthetic dataset 2, both the approaches perform comparable to each other.For Ionosphere dataset, except for d = 0 . , .

25 and 0 . L DR based methodhas lower CV risk. For Parkinsons dataset, L DR based method has lower CVrisk except for d = 0 . L DR based method outputs classiﬁers with signiﬁcantlylesser rejection rate for all the datasets and for all values of d .Thus, for most of the cases, the proposed L DR based approach outputs classiﬁerswith lesser risk. Moreover, the learnt classiﬁer has always lesser rejection ratecompared to the L DH based approach. In this paper, we have proposed a new loss function L DR ( double ramp loss ) forlearning the reject option classiﬁer. L DR gives tighter upper bound for L − d − compared to convex losses L DH and L GH . Our approach learns the classiﬁerby minimizing the regularized risk under the double ramp loss function whichbecomes an instance of DC optimization problem. Our approach can also learnnonlinear classiﬁers by using appropriate kernel function. Experimentally wehave shown that our approach works superior to L DH based approach for learningreject option classiﬁer. ouble Ramp Loss Based Reject Option Classiﬁer 13 d L DR ( C = 64 , γ = 0 . ) L DH ( C = 64 , γ = 0 . )Risk RR Acc on un-rejected Risk RR Acc on un-rejected ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 4.

Comparison Results on Synthetic Dataset 2 (nonlinear classiﬁers using RBFkernel for both the approaches). d L DR ( C = 2 , γ = 0 . ) L DH ( C = 16 , γ = 0 . )Risk RR Acc on un-rejected Risk RR Acc on un-rejected ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 5.

Comparison results on Ionosphere dataset (nonlinear classiﬁers using RBFkernel for both the approaches).

References

1. Le Thi Hoai An and Pham Dinh Tao. Solving a class of linearly constrainedindeﬁnite quadratic problems by d.c. algorithms.

Journal of Global Optimization ,11:253–285, 1997.2. K. Bache and M. Lichman. UCI machine learning repository, 2013.3. Peter L. Bartlett and Marten H. Wegkamp. Classiﬁcation with a reject option usinga hinge loss.

Journal of Machine Learning Research , 9:1823–1840, June 2008.4. C. K. Chow. On optimum recognition error and reject tradeoﬀ.

IEEE Transactionson Information Theory , 16(1):41–46, January 1970.5. Giorgio Fumera and Fabio Roli. Support vector machines with embedded rejectoption. In

Proceedings of the First International Workshop on Pattern Recognitionwith Support Vector Machines , SVM ’02, pages 68–82, 2002.6. Aritra Ghosh, Naresh Manwani, and P. S. Sastry. Making risk minimization toler-ant to label noise.

CoRR , abs/1403.3610, 2014.4 Authors Suppressed Due to Excessive Length d L DR ( C = 32 ) L DH ( C = 32 )Risk RR Acc on un-rejected Risk RR Acc on un-rejected ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 6.

Comparison results on Parkinsons Disease dataset (linear classiﬁers for boththe approaches).7. Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, and St´ephane Canu. Sup-port vector machines with a reject option. In

NIPS , pages 537–544, 2008.8. Trevor Hastie, Robert Tibshirani, and Jerome Friedman.

Elements of StatisticalLearning: Data Mining, Inference and Prediction . Springer Series of Statistics.New York, N. Y. Springer, 2nd edition, 2009.9. Radu Herbei and Marten H. Wegkamp. Classiﬁcation with reject option.

TheCanadian Journal of Statistics , 34(4):709–721, December 2006.10. Alexandros Karatzoglou, Alex Smola, Kurt Hornik, and Achim Zeileis. kernlab –an S4 package for kernel methods in R.

Journal of Statistical Software , 11(9):1–20,November 2004.11. Naresh Manwani and P. S. Sastry. Noise tolerance under risk minimization.

IEEETransactions on Systems, Man and Cybernetics: Part–B , 43:1146–1151, March2013.12. Cheng Soon Ong and Le Thi Hoai An. Learning sparse classiﬁers with diﬀerenceof convex functions algorithms.

Optimization Methods and Software , (ahead-of-print):1–25, 2012.13. Marten Wegkamp and Ming Yuan. Support vector machines with a reject option.

Bernaulli , 17(4):1368–1385, 2011.14. Marten H. Wegkamp. Lasso type classiﬁers with a reject option.

Electronic Journalof Statistics , 1:155–168, 2007.15. Ming Yuan and Marten Wegkamp. Classiﬁcation methods with reject option basedon convex risk minimization.

Journal of Machine Learning Research , 11:111–130,March 2010.

A Proof of Theorem 1 L DR ( f ( x ) , ρ, y ) = dµ h(cid:2) µ − yf ( x ) + ρ (cid:3) + − (cid:2) − µ − yf ( x ) + ρ (cid:3) + i + (1 − d ) µ h(cid:2) µ − yf ( x ) − ρ (cid:3) + − (cid:2) − µ − yf ( x ) − ρ (cid:3) + i ouble Ramp Loss Based Reject Option Classiﬁer 15Interval L DR L − d − yf ( x ) ∈ [ ρ + µ, ∞ ) 0 0 yf ( x ) ∈ ( ρ, ρ + µ ) ∈ (0 , d ) 0 yf ( x ) ∈ ( ρ − µ , ρ ] ∈ [ d, (1 + µ ) d ) dyf ( x ) ∈ [ − ρ + µ, ρ − µ ] (1 + µ ) d dyf ( x ) ∈ [ − ρ, − ρ + µ ) ∈ ((1 + µ ) d, (1 + µ ) d + (1 − d )] dyf ( x ) ∈ ( − ρ − µ , − ρ ) ∈ ((1 + µ ) d + (1 − d ) , (1 + µ )) 1 yf ( x ) ∈ ( −∞ , − ρ − µ ] 1 + µ Table 7.

Proof for Theorem 1.(1).

1. Table 7 shows that L DR ≥ L − d − , ∀ µ > , ρ ≥ µ → L DR ( f ( x ) , ρ, y ) = L − d − ( f ( x ) , ρ, y ). We ﬁrstsee the values that L DR take for diﬀerent values of yf ( x ). Table 8 shows how L DR changes as a function of yf ( x ). Interval L DR yf ( x ) ∈ ( ρ + µ, ∞ ) 0 yf ( x ) ∈ [ ρ − µ , ρ + µ ] dµ ( µ − yf ( x ) + ρ ) yf ( x ) ∈ ( − ρ + µ, ρ − µ ) (1 + µ ) dyf ( x ) ∈ [ − ρ − µ , − ρ + µ ] (1 + µ ) d + (1 − d ) µ ( µ − yf ( x ) − ρ ) yf ( x ) ∈ ( −∞ , − ρ − µ ) 1 + µ Table 8. L DR in diﬀerent intervals (Proof for Theorem 1.(iii)) Now we take the limit µ →

0, which is shown in Table 9. We see thatlim µ → L DR = L − d − . Interval lim µ → L DR L − d − yf ( x ) ∈ ( ρ, ∞ ) 0 0 yf ( x ) = ρ d dyf ( x ) ∈ ( − ρ, ρ ) d dyf ( x ) = − ρ yf ( x ) ∈ ( −∞ , − ρ ) 1 1 Table 9. lim µ → L DR in diﬀerent intervals (Proof for Theorem 1.(iii))

3. In the rejection region yf ( x ) ∈ ( ρ − µ , − ρ + µ ), the loss remains constant,that is L DR ( f ( x ) , ρ, y ) = d (1 + µ ). This can be seen in Table 8.4. For µ > L DR ≤ (1 + µ ) , ∀ ρ ≥ , ∀ d ≥

0. This can be seen in Table 8.

5. When ρ = 0, L DR becomes L DR ( f ( x ) , , y ) = dµ h(cid:2) µ − yf ( x ) (cid:3) + − (cid:2) − µ − yf ( x ) (cid:3) + i + (1 − d ) µ h(cid:2) µ − yf ( x ) − (cid:3) + − (cid:2) − µ − yf ( x ) (cid:3) + i = 1 µ h(cid:2) µ − yf ( x ) (cid:3) + − (cid:2) − µ − yf ( x ) (cid:3) + i which is same as the µ -ramp loss function used for classiﬁcation problemswithout rejection option.6. We have to show that L DR is non-convex function of ( yf ( x ) , ρ ). From (iv),we know that L DR ≤ (1 + µ ). That is, L DR is bounded above. We shownon-convexity of L DR by contradiction.Let L DR be convex function of ( yf ( x ) , ρ ). Let z = ( yf ( x ) , ρ ). We also rewrite L DR ( f ( x ) , ρ, y ) as L DR ( z ). We choose two points z , z such that L DR ( z ) >L DR ( z ). Thus, from the deﬁnition of convexity, we have L DR ( z ) ≤ λL DR ( z − (1 − λ ) z λ ) + (1 − λ ) L DR ( z ) ∀ λ ∈ (0 , L DR ( z ) − (1 − λ ) L DR ( z ) λ ≤ L DR ( z − (1 − λ ) z λ )Now, since L DR ( z ) > L DR ( z ), L DR ( z ) − (1 − λ ) L DR ( z ) λ = L DR ( z ) − L DR ( z ) λ + L DR ( z ) → ∞ as λ → + Thus lim λ → + L DR ( z − (1 − λ ) z λ ) = ∞ . But L DR is upper bounded by (1+ µ ) d .This contradicts that L DR is convex. B Proof of Lemma 1

Let Θ ′ = ( w ′ , b ′ , ρ ′ ) minimizes R ( Θ ), where ρ ′ <

0. Thus − ρ ′ >

0. Consider Θ ′′ = ( w ′ , b ′ , − ρ ′ ) as another point. R ( Θ ′ ) − R ( Θ ′′ ) = C (1 − d ) µ N X n =1 n − (cid:2) µ − y n f ( x n ) + ρ ′ (cid:3) + + (cid:2) − µ − y n f ( x n ) + ρ ′ (cid:3) + + (cid:2) µ − y n f ( x n ) − ρ ′ (cid:3) + − (cid:2) − µ − y n f ( x n ) − ρ ′ (cid:3) + o = C (1 − d ) N X n =1 n L ramp ( y n f ( x n ) + ρ ′ ) − L ramp ( y n f ( x n ) − ρ ′ ) o where L ramp ( t ) = µ ([ µ − t ] + − [ − µ − t ] + ) is a monotonically non-increasingfunction of t [12]. Since ρ ′ <

0, thus, y n f ( x n ) + ρ ′ < y n f ( x n ) − ρ ′ , ∀ n . Thisimplies L ramp ( y n f ( x n ) + ρ ′ ) ≥ L ramp ( y n f ( x n ) − ρ ′ ) , ∀ n . Also (1 − d ) ≥

0, since0 ≤ d ≤ .

5. Thus R ( Θ ′ ) − R ( Θ ′′ ) ≥

0, which contradicts that Θ ′ minimizes R ( Θ ). Thus, at the minimum of R ( Θ ), ρ must be non-negative. ouble Ramp Loss Based Reject Option Classiﬁer 17 C Derivation of Dual Optimization Problem D ( l +1) P ( l +1) : min w ,b, ξ ′ , ξ ′′ ,ρ || w || + Cµ N X n =1 (cid:2) dξ ′ n + (1 − d ) ξ ′′ n (cid:3) + N X n =1 β ′ ( l ) n [ y n ( w T φ ( x n ) + b ) − ρ ]+ N X n =1 β ′′ ( l ) n [ y n ( w T φ ( x n ) + b ) + ρ ] s.t. y n ( w T φ ( x n ) + b ) ≥ ρ + µ − ξ ′ n , ξ ′ n ≥ , n = 1 . . . Ny n ( w T φ ( x n ) + b ) ≥ − ρ + µ − ξ ′′ n , ξ ′′ n ≥ n = 1 . . . N The Lagrangian for above problem will be: L = 12 || w || + Cµ N X n =1 (cid:2) dξ ′ n + (1 − d ) ξ ′′ n (cid:3) + N X n =1 β ′ ( l ) n [ y n ( w T φ ( x n ) + b ) − ρ ] + N X n =1 β ′′ ( l ) n [ y n ( w T φ ( x n ) + b ) + ρ ] + N X n =1 α ′ n [ ρ + µ − ξ ′ n − y n ( w T φ ( x n ) + b )] − N X n =1 η ′ n ξ ′ n + N X n =1 α ′′ n [ − ρ + µ − ξ ′′ n − y n ( w T φ ( x n ) + b )] − N X n =1 η ′′ n ξ ′′ n where α ′ n is dual variable corresponding to constraint y n ( w T φ ( x n ) + b ) ≥ ρ + µ − ξ ′ n , α ′′ n is dual variable corresponding to y n ( w T φ ( x n ) + b ) ≥ − ρ + µ − ξ ′ n , η ′ n is dual variable corresponding to ξ ′ n ≥ η ′′ n is dual variable corresponding to ξ ′′ n ≥