[PDF] Bending the Curve: Improving the ROC Curve Through Error Redistribution

Abstract

Classification performance is often not uniform over the data. Some areas in the input space are easier to classify than others. Features that hold information about the "difficulty" of the data may be non-discriminative and are therefore disregarded in the classification process. We propose a meta-learning approach where performance may be improved by post-processing. This improvement is done by establishing a dynamic threshold on the base-classifier results. Since the base-classifier is treated as a "black box" the method presented can be used on any state of the art classifier in order to try an improve its performance. We focus our attention on how to better control the true-positive/false-positive trade-off known as the ROC curve. We propose an algorithm for the derivation of optimal thresholds by redistributing the error depending on features that hold information about difficulty. We demonstrate the resulting benefit on both synthetic and real-life data.

Full PDF

aa r X i v : . [ c s . L G ] M a y Bending the Curve: Improving the ROC Curve Through ErrorRedistribution

Oran RichmanDepartment of Electrical Engineering Technion Haifa, Israel [email protected]

Shie MannorDepartment of Electrical Engineering Technion Haifa, Israel [email protected]

September 21, 2018

Abstract

Classiﬁcation performance is often not uniform overthe data. Some areas in the input space are easierto classify than others. Features that hold informa-tion about the ”diﬃculty” of the data may be non-discriminative and are therefore disregarded in theclassiﬁcation process. We propose a meta-learningapproach where performance may be improved bypost-processing. This improvement is done by es-tablishing a dynamic threshold on the base-classiﬁerresults. Since the base-classiﬁer is treated as a “blackbox” the method presented can be used on any stateof the art classiﬁer in order to try an improve itsperformance. We focus our attention on how tobetter control the true-positive/false-positive trade-oﬀ known as the ROC curve. We propose an algo-rithm for the derivation of optimal thresholds by re-distributing the error depending on features that holdinformation about diﬃculty. We demonstrate the re-sulting beneﬁt on both synthetic and real-life data.

Binary classiﬁcation is perhaps the most widely stud-ied in machine learning and many methods are usedto obtain binary classiﬁers from data. For most appli- cations two performance measures are of special in-terest. The ﬁrst is the True Positive Rate (TPR)–theportion of true positives that are classiﬁed as such bythe classiﬁer. The second is the False Positive Rate(FPR)–the portion of true negatives that are classi-ﬁed as positive by the classiﬁer.There is a fundamental trade-oﬀ between those twomeasures. This trade-oﬀ is often controlled throughthresholding: the classiﬁer produces a continuousscore for each sample, and a threshold is used to de-termine if the sample is classiﬁed as positive (abovethe threshold) or negative (below the threshold). Thepair (FPR,TPR) is the operating point of the result-ing classiﬁer.The typical approach is to vary the threshold andobtain the complete curve of operating points calledthe Receiver operating characteristic (ROC) curve[12]. The performance of the classiﬁer is then evalu-ated based on the whole curve using a speciﬁc operat-ing point (i.e., a desired FPR level) or by consideringthe area under the curve (AUC). The area under thecurve is an interesting measure since it as a proba-bilistic interpretation. The area under the curve ofa classiﬁer h ( x ) , R n → R is the probability that fora random positive sample x + and a random negativesample x − the classiﬁer will produce h ( x + ) > h ( x − ).In this paper we show that the thresholding approachcan be reﬁned such that performance can be im-1roved without retraining the classiﬁer .Our approach is based on two observations. Theﬁrst is that even after conditioning on the true classof the sample, the score is often correlated with somefeatures (we will refer to them as auxiliary features).Moreover, those features may hold no or little dis-criminative information and are therefore disregardedduring the learning process. For example, picture res-olution may aﬀect performance of object recognitiongreatly [18]. It is however often uncorrelated to thepicture content. The Discriminatingly Trained De-formable Part Model classiﬁer [10] is a popular stateof the art object detector. It can be seen that inthis classiﬁer high resolution pictures receive higherscores compared with low resolution pictures [20].The second observation is that the correlation withthe score of positive examples and the correlationwith the score of negative examples may be statis-tically diﬀerent and even signiﬁcantly so. We aremainly concerned with features that are correlatedwith the “diﬃculty” of the problem. The referenceto “diﬃculty” implies some diﬀerent eﬀect of thosefeatures on the positive and negative examples score.For example, the scores of the positive and negativeexamples get more or less concentrated. Revisitingthe image resolution example, the eﬀect or reducingresolution on a real-object’s score diﬀers from the ef-fect on a random background image. This diﬀerencecan be exploited to improve performance for a speciﬁcoperating point.For every desired operating point, we propose touse a threshold that depends on auxiliary featuresinstead of being ﬁxed for the entire input-space: thethreshold is a function instead of a constant as inthe standard approach. The threshold “curve” canbe designed so that performance is improved (i.e.,higher TPR for a given FPR or a lower FPR for agiven TPR). Our approach eﬀectively rebalances theperformance in diﬀerent areas of the input space andredistributes the error.A simple heuristic for determining the threshold(as a function of the features) is to eliminate the cor-relation between the adjusted score (original scorediﬀerence from the threshold) and the features. How-ever, in the case where the positive and negativesamples are aﬀected diﬀerentially this is not trivial and requires estimating the conditional distributionof each class given the features.The score can be adjusted either according to thepositive examples or according to the negative exam-ples. In the ﬁrst case we use a threshold which followsthe mean of the score of negative examples. We re-fer to this approach as “constant false positive rate”.Another approach is to use a threshold that followsthe mean of the score of positive examples. We referto this approach as “constant true positive rate”. Anillustration of these approaches on a simple examplecan be seen on Figure 2.Both approaches, however, suﬀer from the samestructural deﬁciency, some threshold “curve” is de-rived and then the entire ROC curve is created byadding a ﬁxed oﬀset to it.We present the Optimal Error Redistribution(OER) framework that “bends” the curve diﬀerentlyfor diﬀerent operating points. Our method is generaland does not require any knowledge concerning thelearning process used to train the classiﬁer. The clas-siﬁer is treated as a “black box” allowing to “bendthe curve” for a wide variety of classiﬁers.Our method is based on an alternative view of theROC curve. Instead of viewing the operating point asa consequence of a varying threshold, we can considerthe following optimization: Given some desired FPR,ﬁnd the threshold curve (threshold as a function ofthe auxiliary features) that brings the TPR to a max-imum. This essentially treats the FPR as a resourcewhich need to be distributed between samples. Easyexamples will contribute (in expectation) lower FPRthan that contributed by the harder examples. Thisview allows introducing methods from the ﬁeld or re-source allocation (for example, methods from sensormanagement; a good review can be found at [15]) Example 1

Consider the following case. Some ran-dom variable X is drawn uniformly from the set[1 , Y ∈ {− , } is drawnsuch that Y = 1 with probability 0.5. The randomvariable X is then drawn according to the followingdistribution: X | y = 1 ∼ N ( X , , X | y = − ∼ N (0 , X1 X example data positivenegetive Figure 1: Data distribution of Example 1Since X contains no discriminative information, areasonable classiﬁer for Y is h ( X , X ) = X (usinga linear classiﬁer do not change the results signiﬁ-cantly, however it makes the visual understanding ofthe following ﬁgures more diﬃcult). Figure 2 showsdynamic thresholds (with respect to X ) derived fromthe diﬀerent approaches described above. The upperﬁgure shows the curve matching the constant falsepositive approach. In this example it coincides withthe original ﬁxed (with respect to X ) threshold. Themiddle ﬁgure shows the curve matching the constanttrue positive approach. This corresponds to a lin-ear classiﬁer which uses also the data in X . Boththreshold curves are not optimal. The lower ﬁgureshow the optimal curves. It can be seen that for dif-ferent operating points the curve “bends”. When theexample is “hard” to classify, the optimal thresholdvaries much more than when the example is “easy”.Using a more complex classiﬁer may produce diﬀer-ent curves than those presented in those ﬁgures butwill not be able to produce the “bending” eﬀect. Example 2

The optimal threshold may vary evenwhen the mean and standard-deviations do not de- X1 t h r e s ho l d constant false positive X1 t h r e s ho l d constant true positive X1 t h r e s ho l d optimal threshold Figure 2: threshold as a function of X for diﬀerent approaches and fordiﬀerent operating points (Example 1)pend on the features. This may happen when theprior changes. Meaning, the ratio between the quan-tity of positive and negative examples is related to theauxiliary features. As an example, consider the fol-lowing. Some random variable Y ∈ {− , } is drawnsuch that Y = 1 with probability 0.5. Some randomvector x = ( X , X ) is then drawn according to thefollowing distribution: x | y = 1 ∼ N ((0 , , I ) , x | y = − ∼ N ((0 , , I ) , where I is the 2x2 unit matrix. It is easy to seethat a reasonable classiﬁer for Y is h ( X , X ) = X Observe that in this example the constant true pos-itive and constant false positive coincide and derivea constant threshold. Figure 3 shows the data dis-tribution of this example along with some optimaldynamic thresholds (with respect to X ). It can beseen that our method had essentially created a non-linear classiﬁer for each desired operating-point. Asbefore, the diﬀerent curves are not with ﬁxed oﬀsetfrom one another. Interestingly, for large enough | X | the prior is so signiﬁcant that the optimal threshold3 X1 t h r e s ho l d optimal threshold for example 2 Figure 3: Data distribution and someoptimal thresholds for Example 2is at −∞ . This characterizes situations in which thestandard deviation of the positive examples score islarger than that of the negative examples score. Related work

Meta learning [25] is concernedwith the enhancement of classiﬁers. A meta classiﬁertakes a set of classiﬁers (base classiﬁers) and mergesthem in various ways to produce a uniﬁed classiﬁ-cation result. The base classiﬁers are often trainedusing some variations of the same training set. Thisincludes bagging [2], boosting [9] and many others(for example [3, 17]). Some works in this ﬁeld tar-get speciﬁcally the improvement of the ROC curve.In [22] the authors proposed the ROC Convex Hull(ROCCH) method. The ROCCH is based on theobservation that given two classiﬁers with diﬀerentROC curves any point on the line segment betweentwo operation points can be achieved. This is byrandomly using one or the other classiﬁer with appro-priate probabilities. This allows to combine severalclassiﬁers to achieve an ROC curve which is betterthan each classiﬁer. The method we are presenting in this paper shares the basic approach with the ﬁeld ofmeta-learning. In our case, the set of base-classiﬁersis the base classiﬁer with diﬀerent thresholds.We diﬀer, however, from existing work in this ﬁeldin two important aspects. First, we use as inputonly a single classiﬁer. Second, the auxiliary fea-ture space can be completely diﬀerent from the baseclassiﬁer’s feature space. We do not require any“re-training” and no access to the classiﬁer inner-workings is needed. As a result our method is muchless sensitive to the way the original classiﬁer was de-rived. This allows, in our view, much greater ﬂexibil-ity in applying this method. Note that we do requiresome training set to determine the dynamic thresh-olds. This set however can be diﬀerent from the oneused to train the classiﬁer.A diﬀerent approach that targets speciﬁcally theimprovement of the ROC curve is trying to build clas-siﬁers that optimize the area under the curve (AUC)directly [4, 16, 26]. Using various surrogates the areaunder the curve can be optimized to derive some h opt ( x ). The optimization is done with respect tosome hypothesis class. Our method does not opti-mize the AUC but rather optimizes the ROC curvepoint by point. The resulting classiﬁer however is ina diﬀerent hypothesis class than the base classiﬁer.The relation between h opt ( x ) and the result of usingour method on some h ( x ) is unclear. This is since theoptimization of both methods is done for diﬀerent hy-pothesis classes. However, our method can even im-prove h opt after it is derived using one of the AUCoptimization methods. It is important to note alsothat while optimizing the AUC is possible for somelimited set of hypothesis classes our method is generaland can accommodate complex learning schemes.Recent work has also explored diﬀerent thresholdchoice methods [5, 7, 14]. A threshold choice methodadjusts the threshold to accommodate changes in thecost functions or class distributions. Those methodsshare a similarity with the ideas presented in this pa-per. However, the setting which we explore in thispaper is substantially diﬀerent. In our setting thethreshold may vary between diﬀerent regions of theinput-space with the goal of achieving maximal av-erage performance. The above mentioned work ex-plores the case where the threshold is used to adapt4he base classiﬁer in order to maximize current per-formance.It is important to note that simply appendingthe auxiliary features to the features vector will notproduce the same result. First, similarly to metalearning, the resulting hypothesis class is signiﬁcantlylarger than that of the base classiﬁer. Moreover, inmany cases it is far from trivial to parametrize the re-sulting hypothesis class in such a way that will allowlearning a “standard” classiﬁer. As can be seen byexamples 1 and 2, using the method presented allowscreating complex classiﬁers using simple (linear) baseclassiﬁer. This also implies that simply treating theauxiliary features as features will often provide muchsmaller beneﬁt. Also, it is far from trivial to directlylearn such complex hypothesis classes.It is possible, obviously, to incorporate the ideaspresented in this paper in the learning process ofthe base classiﬁer. While such tight-coupling mayproduce better results such adoption is far from be-ing trivial for most learning schemes. The methodpresented is treating the classiﬁer as a ”black box”.Therefore, it can be easily incorporated on top ofany existing classiﬁer. As mentioned before, in ourmethod the threshold is a function of the auxiliaryfeatures. If the base-classiﬁer was “smart enough” touse the full information contained in those featuresthen the method will produce no beneﬁt. As we willsee in the following this is often not the case, espe-cially when the features have low correlation with thereal class.Our contributions are threefold: First, we intro-duce a novel framework in which the threshold mayvary over the input-space. Second, we introduce theOptimal Error Redistribution (OER) method. Thismethod allows the creation of a meta classiﬁer withimproved ROC curve comparing with the base clas-siﬁer. In addition, we derive a closed form solutionof the optimal threshold for the special case of Gaus-sian distributions. Simulations which demonstratethe beneﬁt which may arise are presented. Finally, wepresent a feature selection technique (for OER). Thisallows the selection of the auxiliary features withoutthe explicit calculation of the ROC curve.We believe that the method presented in this papershould become a standard tool in ROC analysis. It is always beneﬁcial to try and improve the ROC curvesome more and our method proposes a generic wayto do so.This paper is structured as follows: Section 2 de-ﬁnes the problem formally and provides the generalOER method. Section 3 details a simple implementa-tion and provides a closed-form solution for a specialcase. Section 4 outlines a feature selection techniquewhich allows to select features for the method with-out the explicit calculation of the ROC curve. Section5 demonstrates the feasibility of the problem on real-life data while Section 6 concludes the paper withsome ﬁnal thoughts and some still open questions. Consider binary classiﬁcation of objects representedby some vector x ∈ R n . The base classiﬁer is basedon some function h ( x ) , R n → R . In the original clas-siﬁcation scheme a threshold is used to transform theoutput of the function to a binary classiﬁcation. Asample is classiﬁed as positive if h ( x ) ≥ k and nega-tive otherwise. We allow the threshold to depend onsome auxiliary feature vector ˜ x . Notice that ˜ x shouldnot be confused with the vector x that represents thedata. The feature vector ˜ x can be some subset of x or measured separately from the raw data (as in theexample of picture resolution).We would like to ﬁnd some function k (˜ x ) whichassigns a threshold for each example. We approxi-mate this function by partitioning the feature spaceinto N bins. Each bin can be assigned a diﬀerentthreshold. The determination of a continuous func-tion k (˜ x ) is possible in a special case which is out-lined in Section 3.1. Formally, The data distribu-tion is modelled as a superposition of N populations { A i i = 1 , . . . , N } . The auxiliary feature vector ˜ x de-terministically determine the population from whichthe example was taken. In the derived meta-classiﬁerthe original scalar threshold k is replaced with a vec-tor ( k , . . . , k N ). Sample x ∈ A i is classiﬁed as posi-tive if h ( x ) ≥ k i and negative otherwise.In each population the score distribution obeys the5ollowing: h ( x ) | x ∈ A i , y = 1 ∼ f i ,h ( x ) | x ∈ A i , y = − ∼ g i . (1)Where f i and g i are probability density functions.Denote the corresponding cumulative distributionfunctions as F i and G i . Further, p + i = P ( x ∈ A i | y =1) and p − i = P ( x ∈ A i | y = − ( k ,...,k N ) N P i =1 p + i (1 − F i ( k i )) s.t N P i =1 p − i (1 − G i ( k i )) = C. (2)This problem can be non-concave and ﬁnding theglobal maximum may be hard [23]. We can however,use an equivalent form of problem (2) to construct agradient ascent algorithm that will lead us to a localmaximum. Instead of solving Problem (2) we willsolve the following problem for some λ > ( k ,...,k N ) N P i =1 p + i (1 − F i ( k i )) − λ N P i =1 p − i (1 − G i ( k i )) . (3)It is known that for both problems a necessary con-dition for a vector ( k , . . . , k N ) to be a solution isgiven by p + i f i ( k i ) = λp − i g i ( k i ) . (4)We will denote as the beneﬁt-cost ratio the expres-sion: p + i f i ( k i ) p − i g i ( k i ) . (5)For a thresholds vector ( k , . . . , k N ) to be optimal thebeneﬁt-cost ratio should be constant between popu-lations.The OER algorithm is given by Algorithm 1. Aswe will see in the following, for the special case where f i and g i are Gaussian with the same variance, itis possible to derive a closed-form solution for the Algorithm 1

OER

Paramters ζ - learning rate, ǫ - stopping thresh-old. Input: f , g , p + , p − , λ all vector operations are done point-wise.∆ = 1 k = (0 , , . . . , while ∆ > ǫ do k = k − ζ [ p + f ( k ) − λp − g ( k )]∆ = || p + f ( k ) − λp − g ( k ) || end while return kglobal maxima. The necessary condition (4) impliesthat for the optimal threshold the beneﬁt-cost ratiois constant between populations.Notice that since we would like to derive the com-plete ROC curve there is no need to solve the problemfor diﬀerent values of C . We can use the commonbeneﬁt-cost ratio λ as a parameter and derive theROC curve by varying λ . A speciﬁc operating pointcan then be chosen for implementation.The method presented can be also interpreted froma calibration perspective. Calibration is used totransform classiﬁer outputs into posterior probabili-ties [13, 21]. One popular calibration method, knownas Platt calibration, ﬁts a sigmoid model to the data[21]. The method ﬁnds two parameters a and b suchthat the posterior probability ﬁts as good as possibleto P ( y = 1 | h ( x )) = exp ( ah ( x )+ b ) . Earlier work asused a Gaussian ﬁt as the base distribution [13].Our method (with a slight modiﬁcation, since wealso use Gaussian as our base distribution) can beviewed as an extension to Platt calibration where thetwo scalars a an b are replaced with two functions ofthe auxiliary features. This results in: P ( y = 1 | h ( x )) = 11 + exp ( a (˜ x ) h ( x ) + b (˜ x )) . The posterior probabilities can then be compared toa threshold such that the resulting classiﬁer is equiva-lent to that received by our method. while this inter-pretation of our method is valid we believe that theinterpretation detailed in this paper is clearer and6asier to implement. Some previous work by Vap-nik [24] considered a calibration method which is notuniform over the sample space. However, this methodis limited to Support Vector Machines (SVM) anduses the original feature space with no auxiliary fea-tures. Our method is much more general.

The OER method presented earlier is general andﬂexible. There are two main design choices. Firstchoosing the auxiliary features and corresponding A i .Second, choosing a model for f i ( y ) and g i ( y ) and acorresponding method for ﬁtting the data. Section4 provides a heuristic method for choosing auxiliaryfeatures. However, this question is still open and atopic for future research. One simple model for f i ( y )and g i ( y ) can be the use of a Gaussian model forthe conditional behaviour of the score. Formally, theGaussian model is stated as: h ( x ) | x ∈ A i , y = 1 ∼ N ( µ + i , σ + i ) h ( x ) | x ∈ A i , y = − ∼ N ( µ − i , σ − i ) . One of the main beneﬁts of using such a model is thatit requires only the estimation of the ﬁrst and secondmoments. Both can be easily estimated for each bin.The necessary condition for an extremum nowtakes the form of: p + i σ + i e − ( ki − µ + i )22 σ + i = p − i σ − i λe − ( ki − µ − i )22 σ − i . (6)Where k i is the threshold for the desired classiﬁer.The beneﬁt-cost ratio is p + i σ − i p − i σ + i e − (( ki − µ + i )22 σ + i + (( ki − µ − i )22 σ − i . (7)An illustration of the beneﬁt-cost ratio can be seenin Figure 4 for diﬀerent relations between σ + i and σ − i . Notice that when σ + i = σ − i the ratio (7) is strictlymonotone in k i . Therefore, if ∀ i σ + i = σ − i then (6)admits a single solution for every λ . In that case, aclosed form solution to the optimization problem canbe derived. This however is not the case in general. −4 −3 −2 −1 0 1 2 3 405101520253035404550 threshold ben f i t − c o s t r a t i o benfit−cost ratio function σ i+ = σ i− σ i+ > σ i− σ i+ < σ i− Figure 4: Beneﬁt-cost ratio for diﬀerent parametersof a Gaussian distributionIn the general case multiple extremum points mayexist and therefore local optimization methods needto be used. Notice also that if σ + i > σ − i then thereis a minimum to the beneﬁt-cost ratio. Therefore,for large enough FPR the optimal threshold is −∞ .Similarly if σ + i < σ − i then there is a maximum tothe beneﬁt-cost ratio and for small enough FPR theoptimal threshold is ∞ .A solution for the optimization problem can befound by using OER (Algorithm 1). The gradientis given by: ∇ i = p + i σ + i exp( − ( k i − µ + i ) σ + i ) − p − i λσ − i exp( − ( k i − µ − i ) σ − i ) . (8)Notice that if σ + i > σ − i then the optimal thresholdmay be −∞ and if σ + i < σ − i then the optimal thresh-old may be ∞ . It is advised at each step to projectthe threshold into some ﬁxed interval [ − K K ] suchthat the gradient-ascent method will converge.

Example 2 revisited

We have used the describedmethod on example 2. We divided X values into 1207ins, where x ∈ A i if − . i < X < − . . i .Two additional bins were used for the intervals X < − X >

6. We have generated a data-set of20000 data-points and tested the method. The re-sults can be seen in Figure 5. It can be seen thatthe method presented a signiﬁcant beneﬁt over thetwo other approaches. As mentioned before, for suf-ﬁciently large | X | the threshold is −∞ . This issince σ + i > σ − i and the beneﬁt-cost ratio admits aminimum. When the desired beneﬁt-cost ratio is be-low the minimum possible value it is always desirableto trade more TPR for more FPR. Notice that inthose bins the calculation of h(x) is useless and canbe avoided, therefore reducing computation resourcesneeded.Direct comparison to AUC optimization methods(like [16]) is inappropriate. This is since it is highlysensitive to the hypothesis set for which the AUC isoptimized. It is clear from Figure 3 that in spite thefact that our base classiﬁer is linear, no linear classi-ﬁer can achieve decent performance. Optimizing theAUC over a diﬀerent hypothesis class may producebetter results than ours. However, using this classi-ﬁer as our base classiﬁer and employing OER mayimprove it even further or at least will not reduce itsperformance. In some cases, we can assume that the auxiliary fea-tures aﬀect only the expectation of the score and donot aﬀect the variance of positive and negative sam-ples. Formally, ∀ i, σ − i = σ + i . In this case problem(2) can be solved directly. The solution to problem(2) is given by: k i = σ + i [log p − i p + i + λ ] µ + i − µ − i + µ + i + µ − i , (9)where −∞ < λ < ∞ The ROC curve can then be derived by calculatingthe optimal threshold for diﬀerent values of λ rangingfrom −∞ to ∞ . Remark 1.

For this special case the extension toan inﬁnite number of bins is straightforward. Instead

False Positive Rate T r ue P o s i t i v e R a t e ROC curves of example 2 for the different approchs constant false postiveconstant true positiveoptimal

Figure 5: ROC curve of Example 2 of ﬁtting a Gaussian model to each bin it is possi-ble to estimate some functions µ + (˜ x ) and µ − (˜ x ) thatrepresent the mean score as a function of the fea-tures for the positive and negative examples, respec-tively. Similarly, the functions σ + (˜ x ) , σ − (˜ x ) , p + (˜ x ) and p − (˜ x ) should be estimated. All of these func-tions can be estimated using conventional parametricestimation methods (For example, maximizing the loglikelihood). The optimal threshold for each examplecan then be calculated using (9) by substituting µ + i by µ + (˜ x ) , µ − i by µ − (˜ x ) and so on. Example 1 revisited

We used the describedmethod on Example 1. We divided X values into8 bins, where x ∈ A i if 0 . . i < X < . i . Itfollows that µ i = 0 . . i . Notice that we neglectedthe fact that σ + i = σ − i . We have generated a data-setof 20000 data-points and tested the method. The re-sults are presented in Figure 6. It can be seen that themethod presented a signiﬁcant beneﬁt over the twoother approaches. Notice also that the derived ROCcurve outperforms the convex hull of the two othermethods, therefore outperform the ROCCH method.8 False Positive Rate T r ue P o s i t i v e R a t e ROC curves of example 1 for the different approchs constant false postiveconstant true positiveoptimal

Figure 6: ROC curve of Example 1

One important question that arises in the context ofOER is how to choose auxiliary features that providethe most beneﬁt. Using features that do not containrelevant information may degrade performance dueto over-ﬁtting. The simplest approach is probably touse knowledge about the domain of the problem andconsider features that may impact the problem diﬃ-culty. In image classiﬁcation this can be for examplepicture’s size, lightning conditions etc. In speakerveriﬁcation diﬃculty is often related to the type ofrecording device. As the quality of the recordinggets better it is easier to classify. Using the typeof recording device as an auxiliary feature seems nat-ural for this setting. Other examples include doc-ument length in spam ﬁltering, channel characteris-tics in communication, distance from target in remotesensing and many more.Another obvious approach is to enumerate over po-tential options. For each feature apply OER, thencalculate the derived ROC and choose the featuresthat gives the most beneﬁt. Suﬃcient estimation of the ROC however requires large amount of labelleddata. In certain cases, labelled data are scarce andtherefore the estimation of the ROC is prone to er-rors.An alternative approach is to use the modellingprocess to uncover potential auxiliary features. Look-ing at the beneﬁt-cost ratio provides us with the nec-essary insight about elements of the model that im-pact performance. One measure that can be pro-posed is the diﬀerence in separation diﬃculty. Theseparation diﬃculty (SD) is deﬁned by the numberof standard deviations between the mean of positiveand negative examples. Namely the quantity SD i = ( µ + i − µ − i ) / ( σ + i σ − i ) . The diﬀerence in separation diﬃculty can then bedeﬁned as var ( SD i ). The variance is taken with re-spect to the data’s distribution. A large diﬀerencecauses signiﬁcant bending of the curve for diﬀerentoperating points. While this does not guarantee sig-niﬁcant beneﬁt on the ROC it implies a potential forsuch beneﬁt. Example 1 demonstrates the feasibilityof this measure.Another measure is the diﬀerence in the prior. Theprior of bin i ( denoted by P i ) can be deﬁned as P i = log( p + i σ − i / ( p − i σ + i )) . As before, the diﬀerence in prior can be taken to be var ( P i ) where the variance is taken with respect tothe data distribution. A large diﬀerence indicatesthat there might be a potential for signiﬁcant bene-ﬁts. Example 2 demonstrates the feasibility of thismeasure.Those measures allow to establish a feature selec-tion mechanism. First, enumerate over possible fea-tures, for each feature, partition the space into binsand measure the diﬀerence in separation diﬃcultyand diﬀerence in prior. Only features for which thosemeasures exceed some threshold should be used forOER. In the spirit of supervised PCA [1], furtherreduction in the feature space’s dimension can beachieved by using only the few main principal compo-nents of the remaining features. The resulting featurespace can then be divided into bins and OER can beapplied.9t is important to calibrate the number of bins tothe amount of training data available. Using too fewbins leads to a mismatch between the data and modeland therefore sub-optimal performance (which mayeven be worse than the original classiﬁer). Using toomany bins may lead to over-ﬁtting. We advise usingcross-validation in order to optimize the number ofbins. In addition to the results described earlier on syn-thetic examples we demonstrate our method’s poten-tial beneﬁt on real-life data . First we tested themethod using the UCI “Adult” dataset [19]. In thisdataset the goal is to predict whether a person’s in-come exceeds 50

K/yr based on census data. We haveused SVM as the base classiﬁer. As ab auxiliary fea-ture we selected number of years of education. Thisselection was made by reviewing the diﬀerence in sep-aration diﬃculty and the diﬀerence in prior of allavailable features as explained in section 4. It is possi-ble that choosing more then one auxiliary feature willimprove the results. Figure 7 shows the derived ROCcurves. Figure 8 shows a zoom-in of the ROC. It isclearly visible that the derived ROC curve is alwaysbetter then the original. The AUC improves from0.878 for the baseline SVM to 0.9028 for the derivedclassiﬁer, 20 .

33% improvement. It should be notedthat since some of the input features are categoricalthe ROC curve is highly sensitive. The results shownare averaged over ten-fold cross-validation. Note thaton all conducted experiments OER outperforms theoriginal classiﬁer (0.001 p-value with the sign test).Taking a closer look at the data distribution andthe derived thresholds shows that the improvementis made by keeping the threshold in the “easy” binslow and increasing it on the more “diﬃcult” bins. Itcan be seen that the beneﬁt arise even though thedata distribution isn’t Gaussian. It is possible thatusing a diﬀerent distribution for modelling will pro-duce better results. The data distribution as well asthree possible thresholds can be seen on ﬁgure 9.Second, we used OER for the task of object recog-nition. The task at hand is ﬁnding a certain object

False Positive Rate T r ue P o s i t i v e R a t e ROC curve for the Adult dataset originalbended

Figure 7: ROC curve for the Adultdataset, 20 .

33% improvement in AUC

False Positive Rate originalbended

Figure 8: Zoom in on the ROC curvefor the Adult dataset10

Score as a function of the number of education years together with some possible threshold curves Years of eduaction S c o r e positivenegetive Figure 9: Score as a function of the number of ed-ucation years together with some possible thresholdcurves(person, car, dog, etc.) inside a picture. For thatpurpose, multiple bounding boxes (BB) are extractedfrom the picture. A classiﬁer assign a score for eachof the BB. Detection is made using some thresholdon this score. For simplicity we have tested only the“classiﬁcation” stage of this problem.From the PASCAL ( [6]) data-base, positive ex-amples of several classes of objects were extracted(only the bounding box which contains the object).From the same data-base, 100000 background exam-ples were taken (from 10 diﬀerent pictures). Eachexample was scored using the state of the art Dis-criminatingly Trained Deformable Part Model classi-ﬁer [8, 10]. This classiﬁer models the object as com-posed out of a set of parts (for example a person iscomposed out of head, hands, body, etc.). The clas-siﬁer then matches the content of the bounding boxwith all possible orientations of the modelled objectand its parts. It is known that the size of the bound-ing box aﬀects the performance of this classiﬁer sig-niﬁcantly [18].The size of the bounding box was used to dividethe data into 4 bins. For each bin the expectationand standard deviation of the positive and negativeexamples were estimated as well as p + and p − . Thescores for the class “person” as a function of size canbe seen in Figure 12.Two eﬀects are notable. First, the bigger the BB (higher resolution) the higher the score. It can beseen that the eﬀects on positive and negative ex-amples are roughly the same in expectation but forlarger BB the variance of the positives decrease whilethe variance of the negatives remain roughly thesame. Second, Since the data-base is constructedfrom partitioning of pictures, it contains a high num-ber of small BBs and a low number of large BBs. Thepositive examples however are distributed roughlyuniform over size. This causes the change of priorto be rather large.Optimal thresholds were calculated using OER andthe results were compared to using a ﬁxed threshold.The area under the curve (AUC) was used as a per-formance measure. The results are summarised inTable 1. As can be seen, substantial beneﬁt (around20% improvement) arises from using OER. Furtherexamination of the beneﬁt shows that for a very lowFPR modelling errors start to aﬀect and beneﬁt is mi-nor. For a very high FPR there is not much room forimprovement. In between, there is a substantial areain which beneﬁt arise. Figures 10 shows the derivedROC curves for the class “person”. Figure 11 showsa zoom-in of the ROC of the area in which the beneﬁtis maximal. This improvement is done using only thepicture size as a feature. This feature boost the baseclassiﬁer’s performance although it holds little to nondiscriminative information. Ten-fold cross-validationwas performed. Recently the validity of AUC formodel comparison was questioned [11]. While forsimplicity we do use AUC as a performance measureour method improve the entire ROC curve. Sincethe improved ROC curve dominates the original one,other measures will also likely to show improvement.Note that on all conducted experiments OER out-performs the original classiﬁer (0.001 p-value with thesign test). Notice also that we have used only a fewbins (four) and a simplistic modelling as Gaussian.We believe that by using more complex features andmore complex models this results can be even furtherimproved.11 False Positive Rate T r ue P o s i t i v e R a t e ROC curves of object recognition ,class "person" optimalfixed threshold

Figure 10: ROC curve of objectrecognition, class “person”

False Positive Rate T r ue P o s i t i v e R a t e ROC curves of object recognition ,class "person" optimal thresholdfixed threshold

Figure 11: Zoom in on the ROC curveof object recognition, class “person”

50 100 150 200 250 300 350 400 450 500−4−3−2−1012

Picture size S c o r e Score as a function of picture size positivenegetive

Figure 12: Score as a function of the bounding boxsizeTable 1: Simulation results for several object classes class person dog car chairNumberof positiveexamples 2358 253 625 400FixedthresholdsAUC 0.98663 0.97827 0.99292 0.99540OER AUC 0.99043 0.98329 0.99411 0.99648Improvementin1 − AUC Conclusion

In this work we present a novel approach for improv-ing the ROC curve of existing classiﬁers. We believethat this method should become a standard tool inROC analysis and can enhance essentially any classi-ﬁer. The method presented is general and may pro-vide substantial beneﬁt for any application: as longas there is suﬃcient data to mitigate overﬁtting, any-one who considers ROC optimization should try to“bend the curve” since there is nothing much to losefrom it, and potentially much to gain.We suggest three natural directions for further re-search: First, the method presented takes a two stepapproach. Start with modelling the data and thenﬁnd optimal threshold curve according to this model.The model is used to derive the beneﬁt-cost ratio.An alternative approach is to use empirical estimatesof the beneﬁt-cost ratio directly. The eﬀect of suchan approach is unclear. On the one hand, it may im-prove performance whenever a parametric model isin-adequate to describe the data. On the other hand,it may increase over-ﬁtting.Second, accurate estimation of the model’s param-eters requires a large amount of labelled data. This isespecially true when the number of prospective fea-tures is large. Partitioning the space into too manybins may lead to a faulty model. An interesting openquestion is how to optimally partition the featurespace.Third, in some scenarios it may be preferable touse a diﬀerent optimization problems than (2). Forexample, in multi-view problems several classiﬁers,each with a diﬀerent feature-space, are fused into asingle classiﬁcation output. It may be interesting tojointly optimize the threshold curves of those classi-ﬁers.

References [1] Eric Bair, Trevor Hastie, Debashis Paul, andRobert Tibshirani. Prediction by supervisedprincipal components.

Journal of the AmericanStatistical Association , 101(473), 2006. [2] Leo Breiman. Bagging predictors.

Machinelearning , 24(2):123–140, 1996.[3] Leo Breiman. Stacked regressions.

Machinelearning , 24(1):49–64, 1996.[4] Corinna Cortes and Mehryar Mohri. Auc op-timization vs. error rate minimization.

Ad-vances in neural information processing systems ,16(16):313–320, 2004.[5] Chris Drummond and Robert C Holte. Costcurves: An improved method for visualizing clas-siﬁer performance. 2006.[6] M. Everingham, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman. The pascal vi-sual object classes (voc) challenge.

Interna-tional Journal of Computer Vision , 88(2):303–338, June 2010.[7] Tom Fawcett and Foster Provost. Adaptivefraud detection.

Data mining and knowledge dis-covery , 1(3):291–316, 1997.[8] P. F. Felzenszwalb, R. B. Girshick,D. McAllester, and D. Ramanan. Objectdetection with discriminatively trained partbased models.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 32(9):1627–1645, 2010.[9] Yoav Freund, Robert E Schapire, et al. Experi-ments with a new boosting algorithm. In

ICML ,volume 96, pages 148–156, 1996.[10] R. B. Girshick, P. F. Felzenszwalb,and D. McAllester. Discriminativelytrained deformable part models, release 5.http://people.cs.uchicago.edu/ rbg/latent-release5/.[11] David J Hand. Measuring classiﬁer performance:a coherent alternative to the area under the roccurve.

Machine learning , 77(1):103–123, 2009.[12] James A Hanley and Barbara J McNeil. Themeaning and use of the area under a receiveroperating characteristic (roc) curve.

Radiology ,143(1):29–36, 1982.1313] Trevor Hastie and Robert Tibshirani. Classiﬁ-cation by pairwise coupling, 1998.[14] Jos´e Hern´andez-Orallo, Peter Flach, and CesarFerri. A uniﬁed view of performance metrics:Translating threshold choice into expected clas-siﬁcation loss.

The Journal of Machine LearningResearch , 13(1):2813–2869, 2012.[15] Alfred O Hero and Douglas Cochran. Sensormanagement: Past, present, and future.

SensorsJournal, IEEE , 11(12):3064–3075, 2011.[16] Alan Herschtal and Bhavani Raskutti. Optimis-ing area under the roc curve using gradient de-scent. In

Proceedings of the twenty-ﬁrst interna-tional conference on Machine learning , page 49.ACM, 2004.[17] Tin Kam Ho. The random subspace methodfor constructing decision forests.

Pattern Anal-ysis and Machine Intelligence, IEEE Transac-tions on , 20(8):832–844, 1998.[18] Derek Hoiem, Yodsawalai Chodpathumwan, andQieyun Dai. Diagnosing error in object detec-tors. In

Computer Vision–ECCV 2012 , pages340–353. Springer, 2012.[19] M. Lichman. UCI machine learning repository,2013.[20] Marco Pedersoli, Jordi Gonz`alez, Xu Hu, andXavier Roca. Toward real-time pedestrian detec-tion based on a deformable template model.

In-telligent Transportation Systems, IEEE Trans-actions on , 15(1):355–364, 2014.[21] John Platt et al. Probabilistic outputs for sup-port vector machines and comparisons to reg-ularized likelihood methods.

Advances in largemargin classiﬁers , 10(3):61–74, 1999.[22] Foster Provost and Tom Fawcett. Robust clas-siﬁcation for imprecise environments.

Machinelearning , 42(3):203–231, 2001.[23] Singiresu S Rao and SS Rao.

Engineering op-timization: theory and practice . John Wiley &Sons, 2009. [24] Vladimir Naumovich Vapnik.

Statistical learningtheory , volume 1. Wiley New York, 1998.[25] Ricardo Vilalta and Youssef Drissi. A perspec-tive view and survey of meta-learning.

ArtiﬁcialIntelligence Review , 18(2):77–95, 2002.[26] Lian Yan, Robert H Dodier, Michael Mozer,and Richard H Wolniewicz. Optimizing clas-siﬁer performance via an approximation to thewilcoxon-mann-whitney statistic. In