Bending the Curve: Improving the ROC Curve Through Error Redistribution
aa r X i v : . [ c s . L G ] M a y Bending the Curve: Improving the ROC Curve Through ErrorRedistribution
Oran RichmanDepartment of Electrical Engineering Technion Haifa, Israel [email protected]
Shie MannorDepartment of Electrical Engineering Technion Haifa, Israel [email protected]
September 21, 2018
Abstract
Classification performance is often not uniform overthe data. Some areas in the input space are easierto classify than others. Features that hold informa-tion about the ”difficulty” of the data may be non-discriminative and are therefore disregarded in theclassification process. We propose a meta-learningapproach where performance may be improved bypost-processing. This improvement is done by es-tablishing a dynamic threshold on the base-classifierresults. Since the base-classifier is treated as a “blackbox” the method presented can be used on any stateof the art classifier in order to try an improve itsperformance. We focus our attention on how tobetter control the true-positive/false-positive trade-off known as the ROC curve. We propose an algo-rithm for the derivation of optimal thresholds by re-distributing the error depending on features that holdinformation about difficulty. We demonstrate the re-sulting benefit on both synthetic and real-life data.
Binary classification is perhaps the most widely stud-ied in machine learning and many methods are usedto obtain binary classifiers from data. For most appli- cations two performance measures are of special in-terest. The first is the True Positive Rate (TPR)–theportion of true positives that are classified as such bythe classifier. The second is the False Positive Rate(FPR)–the portion of true negatives that are classi-fied as positive by the classifier.There is a fundamental trade-off between those twomeasures. This trade-off is often controlled throughthresholding: the classifier produces a continuousscore for each sample, and a threshold is used to de-termine if the sample is classified as positive (abovethe threshold) or negative (below the threshold). Thepair (FPR,TPR) is the operating point of the result-ing classifier.The typical approach is to vary the threshold andobtain the complete curve of operating points calledthe Receiver operating characteristic (ROC) curve[12]. The performance of the classifier is then evalu-ated based on the whole curve using a specific operat-ing point (i.e., a desired FPR level) or by consideringthe area under the curve (AUC). The area under thecurve is an interesting measure since it as a proba-bilistic interpretation. The area under the curve ofa classifier h ( x ) , R n → R is the probability that fora random positive sample x + and a random negativesample x − the classifier will produce h ( x + ) > h ( x − ).In this paper we show that the thresholding approachcan be refined such that performance can be im-1roved without retraining the classifier .Our approach is based on two observations. Thefirst is that even after conditioning on the true classof the sample, the score is often correlated with somefeatures (we will refer to them as auxiliary features).Moreover, those features may hold no or little dis-criminative information and are therefore disregardedduring the learning process. For example, picture res-olution may affect performance of object recognitiongreatly [18]. It is however often uncorrelated to thepicture content. The Discriminatingly Trained De-formable Part Model classifier [10] is a popular stateof the art object detector. It can be seen that inthis classifier high resolution pictures receive higherscores compared with low resolution pictures [20].The second observation is that the correlation withthe score of positive examples and the correlationwith the score of negative examples may be statis-tically different and even significantly so. We aremainly concerned with features that are correlatedwith the “difficulty” of the problem. The referenceto “difficulty” implies some different effect of thosefeatures on the positive and negative examples score.For example, the scores of the positive and negativeexamples get more or less concentrated. Revisitingthe image resolution example, the effect or reducingresolution on a real-object’s score differs from the ef-fect on a random background image. This differencecan be exploited to improve performance for a specificoperating point.For every desired operating point, we propose touse a threshold that depends on auxiliary featuresinstead of being fixed for the entire input-space: thethreshold is a function instead of a constant as inthe standard approach. The threshold “curve” canbe designed so that performance is improved (i.e.,higher TPR for a given FPR or a lower FPR for agiven TPR). Our approach effectively rebalances theperformance in different areas of the input space andredistributes the error.A simple heuristic for determining the threshold(as a function of the features) is to eliminate the cor-relation between the adjusted score (original scoredifference from the threshold) and the features. How-ever, in the case where the positive and negativesamples are affected differentially this is not trivial and requires estimating the conditional distributionof each class given the features.The score can be adjusted either according to thepositive examples or according to the negative exam-ples. In the first case we use a threshold which followsthe mean of the score of negative examples. We re-fer to this approach as “constant false positive rate”.Another approach is to use a threshold that followsthe mean of the score of positive examples. We referto this approach as “constant true positive rate”. Anillustration of these approaches on a simple examplecan be seen on Figure 2.Both approaches, however, suffer from the samestructural deficiency, some threshold “curve” is de-rived and then the entire ROC curve is created byadding a fixed offset to it.We present the Optimal Error Redistribution(OER) framework that “bends” the curve differentlyfor different operating points. Our method is generaland does not require any knowledge concerning thelearning process used to train the classifier. The clas-sifier is treated as a “black box” allowing to “bendthe curve” for a wide variety of classifiers.Our method is based on an alternative view of theROC curve. Instead of viewing the operating point asa consequence of a varying threshold, we can considerthe following optimization: Given some desired FPR,find the threshold curve (threshold as a function ofthe auxiliary features) that brings the TPR to a max-imum. This essentially treats the FPR as a resourcewhich need to be distributed between samples. Easyexamples will contribute (in expectation) lower FPRthan that contributed by the harder examples. Thisview allows introducing methods from the field or re-source allocation (for example, methods from sensormanagement; a good review can be found at [15]) Example 1
Consider the following case. Some ran-dom variable X is drawn uniformly from the set[1 , Y ∈ {− , } is drawnsuch that Y = 1 with probability 0.5. The randomvariable X is then drawn according to the followingdistribution: X | y = 1 ∼ N ( X , , X | y = − ∼ N (0 , X1 X example data positivenegetive Figure 1: Data distribution of Example 1Since X contains no discriminative information, areasonable classifier for Y is h ( X , X ) = X (usinga linear classifier do not change the results signifi-cantly, however it makes the visual understanding ofthe following figures more difficult). Figure 2 showsdynamic thresholds (with respect to X ) derived fromthe different approaches described above. The upperfigure shows the curve matching the constant falsepositive approach. In this example it coincides withthe original fixed (with respect to X ) threshold. Themiddle figure shows the curve matching the constanttrue positive approach. This corresponds to a lin-ear classifier which uses also the data in X . Boththreshold curves are not optimal. The lower figureshow the optimal curves. It can be seen that for dif-ferent operating points the curve “bends”. When theexample is “hard” to classify, the optimal thresholdvaries much more than when the example is “easy”.Using a more complex classifier may produce differ-ent curves than those presented in those figures butwill not be able to produce the “bending” effect. Example 2
The optimal threshold may vary evenwhen the mean and standard-deviations do not de- X1 t h r e s ho l d constant false positive X1 t h r e s ho l d constant true positive X1 t h r e s ho l d optimal threshold Figure 2: threshold as a function of X for different approaches and fordifferent operating points (Example 1)pend on the features. This may happen when theprior changes. Meaning, the ratio between the quan-tity of positive and negative examples is related to theauxiliary features. As an example, consider the fol-lowing. Some random variable Y ∈ {− , } is drawnsuch that Y = 1 with probability 0.5. Some randomvector x = ( X , X ) is then drawn according to thefollowing distribution: x | y = 1 ∼ N ((0 , , I ) , x | y = − ∼ N ((0 , , I ) , where I is the 2x2 unit matrix. It is easy to seethat a reasonable classifier for Y is h ( X , X ) = X Observe that in this example the constant true pos-itive and constant false positive coincide and derivea constant threshold. Figure 3 shows the data dis-tribution of this example along with some optimaldynamic thresholds (with respect to X ). It can beseen that our method had essentially created a non-linear classifier for each desired operating-point. Asbefore, the different curves are not with fixed offsetfrom one another. Interestingly, for large enough | X | the prior is so significant that the optimal threshold3 X1 t h r e s ho l d optimal threshold for example 2 Figure 3: Data distribution and someoptimal thresholds for Example 2is at −∞ . This characterizes situations in which thestandard deviation of the positive examples score islarger than that of the negative examples score. Related work
Meta learning [25] is concernedwith the enhancement of classifiers. A meta classifiertakes a set of classifiers (base classifiers) and mergesthem in various ways to produce a unified classifi-cation result. The base classifiers are often trainedusing some variations of the same training set. Thisincludes bagging [2], boosting [9] and many others(for example [3, 17]). Some works in this field tar-get specifically the improvement of the ROC curve.In [22] the authors proposed the ROC Convex Hull(ROCCH) method. The ROCCH is based on theobservation that given two classifiers with differentROC curves any point on the line segment betweentwo operation points can be achieved. This is byrandomly using one or the other classifier with appro-priate probabilities. This allows to combine severalclassifiers to achieve an ROC curve which is betterthan each classifier. The method we are presenting in this paper shares the basic approach with the field ofmeta-learning. In our case, the set of base-classifiersis the base classifier with different thresholds.We differ, however, from existing work in this fieldin two important aspects. First, we use as inputonly a single classifier. Second, the auxiliary fea-ture space can be completely different from the baseclassifier’s feature space. We do not require any“re-training” and no access to the classifier inner-workings is needed. As a result our method is muchless sensitive to the way the original classifier was de-rived. This allows, in our view, much greater flexibil-ity in applying this method. Note that we do requiresome training set to determine the dynamic thresh-olds. This set however can be different from the oneused to train the classifier.A different approach that targets specifically theimprovement of the ROC curve is trying to build clas-sifiers that optimize the area under the curve (AUC)directly [4, 16, 26]. Using various surrogates the areaunder the curve can be optimized to derive some h opt ( x ). The optimization is done with respect tosome hypothesis class. Our method does not opti-mize the AUC but rather optimizes the ROC curvepoint by point. The resulting classifier however is ina different hypothesis class than the base classifier.The relation between h opt ( x ) and the result of usingour method on some h ( x ) is unclear. This is since theoptimization of both methods is done for different hy-pothesis classes. However, our method can even im-prove h opt after it is derived using one of the AUCoptimization methods. It is important to note alsothat while optimizing the AUC is possible for somelimited set of hypothesis classes our method is generaland can accommodate complex learning schemes.Recent work has also explored different thresholdchoice methods [5, 7, 14]. A threshold choice methodadjusts the threshold to accommodate changes in thecost functions or class distributions. Those methodsshare a similarity with the ideas presented in this pa-per. However, the setting which we explore in thispaper is substantially different. In our setting thethreshold may vary between different regions of theinput-space with the goal of achieving maximal av-erage performance. The above mentioned work ex-plores the case where the threshold is used to adapt4he base classifier in order to maximize current per-formance.It is important to note that simply appendingthe auxiliary features to the features vector will notproduce the same result. First, similarly to metalearning, the resulting hypothesis class is significantlylarger than that of the base classifier. Moreover, inmany cases it is far from trivial to parametrize the re-sulting hypothesis class in such a way that will allowlearning a “standard” classifier. As can be seen byexamples 1 and 2, using the method presented allowscreating complex classifiers using simple (linear) baseclassifier. This also implies that simply treating theauxiliary features as features will often provide muchsmaller benefit. Also, it is far from trivial to directlylearn such complex hypothesis classes.It is possible, obviously, to incorporate the ideaspresented in this paper in the learning process ofthe base classifier. While such tight-coupling mayproduce better results such adoption is far from be-ing trivial for most learning schemes. The methodpresented is treating the classifier as a ”black box”.Therefore, it can be easily incorporated on top ofany existing classifier. As mentioned before, in ourmethod the threshold is a function of the auxiliaryfeatures. If the base-classifier was “smart enough” touse the full information contained in those featuresthen the method will produce no benefit. As we willsee in the following this is often not the case, espe-cially when the features have low correlation with thereal class.Our contributions are threefold: First, we intro-duce a novel framework in which the threshold mayvary over the input-space. Second, we introduce theOptimal Error Redistribution (OER) method. Thismethod allows the creation of a meta classifier withimproved ROC curve comparing with the base clas-sifier. In addition, we derive a closed form solutionof the optimal threshold for the special case of Gaus-sian distributions. Simulations which demonstratethe benefit which may arise are presented. Finally, wepresent a feature selection technique (for OER). Thisallows the selection of the auxiliary features withoutthe explicit calculation of the ROC curve.We believe that the method presented in this papershould become a standard tool in ROC analysis. It is always beneficial to try and improve the ROC curvesome more and our method proposes a generic wayto do so.This paper is structured as follows: Section 2 de-fines the problem formally and provides the generalOER method. Section 3 details a simple implementa-tion and provides a closed-form solution for a specialcase. Section 4 outlines a feature selection techniquewhich allows to select features for the method with-out the explicit calculation of the ROC curve. Section5 demonstrates the feasibility of the problem on real-life data while Section 6 concludes the paper withsome final thoughts and some still open questions. Consider binary classification of objects representedby some vector x ∈ R n . The base classifier is basedon some function h ( x ) , R n → R . In the original clas-sification scheme a threshold is used to transform theoutput of the function to a binary classification. Asample is classified as positive if h ( x ) ≥ k and nega-tive otherwise. We allow the threshold to depend onsome auxiliary feature vector ˜ x . Notice that ˜ x shouldnot be confused with the vector x that represents thedata. The feature vector ˜ x can be some subset of x or measured separately from the raw data (as in theexample of picture resolution).We would like to find some function k (˜ x ) whichassigns a threshold for each example. We approxi-mate this function by partitioning the feature spaceinto N bins. Each bin can be assigned a differentthreshold. The determination of a continuous func-tion k (˜ x ) is possible in a special case which is out-lined in Section 3.1. Formally, The data distribu-tion is modelled as a superposition of N populations { A i i = 1 , . . . , N } . The auxiliary feature vector ˜ x de-terministically determine the population from whichthe example was taken. In the derived meta-classifierthe original scalar threshold k is replaced with a vec-tor ( k , . . . , k N ). Sample x ∈ A i is classified as posi-tive if h ( x ) ≥ k i and negative otherwise.In each population the score distribution obeys the5ollowing: h ( x ) | x ∈ A i , y = 1 ∼ f i ,h ( x ) | x ∈ A i , y = − ∼ g i . (1)Where f i and g i are probability density functions.Denote the corresponding cumulative distributionfunctions as F i and G i . Further, p + i = P ( x ∈ A i | y =1) and p − i = P ( x ∈ A i | y = − ( k ,...,k N ) N P i =1 p + i (1 − F i ( k i )) s.t N P i =1 p − i (1 − G i ( k i )) = C. (2)This problem can be non-concave and finding theglobal maximum may be hard [23]. We can however,use an equivalent form of problem (2) to construct agradient ascent algorithm that will lead us to a localmaximum. Instead of solving Problem (2) we willsolve the following problem for some λ > ( k ,...,k N ) N P i =1 p + i (1 − F i ( k i )) − λ N P i =1 p − i (1 − G i ( k i )) . (3)It is known that for both problems a necessary con-dition for a vector ( k , . . . , k N ) to be a solution isgiven by p + i f i ( k i ) = λp − i g i ( k i ) . (4)We will denote as the benefit-cost ratio the expres-sion: p + i f i ( k i ) p − i g i ( k i ) . (5)For a thresholds vector ( k , . . . , k N ) to be optimal thebenefit-cost ratio should be constant between popu-lations.The OER algorithm is given by Algorithm 1. Aswe will see in the following, for the special case where f i and g i are Gaussian with the same variance, itis possible to derive a closed-form solution for the Algorithm 1
OER
Paramters ζ - learning rate, ǫ - stopping thresh-old. Input: f , g , p + , p − , λ all vector operations are done point-wise.∆ = 1 k = (0 , , . . . , while ∆ > ǫ do k = k − ζ [ p + f ( k ) − λp − g ( k )]∆ = || p + f ( k ) − λp − g ( k ) || end while return kglobal maxima. The necessary condition (4) impliesthat for the optimal threshold the benefit-cost ratiois constant between populations.Notice that since we would like to derive the com-plete ROC curve there is no need to solve the problemfor different values of C . We can use the commonbenefit-cost ratio λ as a parameter and derive theROC curve by varying λ . A specific operating pointcan then be chosen for implementation.The method presented can be also interpreted froma calibration perspective. Calibration is used totransform classifier outputs into posterior probabili-ties [13, 21]. One popular calibration method, knownas Platt calibration, fits a sigmoid model to the data[21]. The method finds two parameters a and b suchthat the posterior probability fits as good as possibleto P ( y = 1 | h ( x )) = exp ( ah ( x )+ b ) . Earlier work asused a Gaussian fit as the base distribution [13].Our method (with a slight modification, since wealso use Gaussian as our base distribution) can beviewed as an extension to Platt calibration where thetwo scalars a an b are replaced with two functions ofthe auxiliary features. This results in: P ( y = 1 | h ( x )) = 11 + exp ( a (˜ x ) h ( x ) + b (˜ x )) . The posterior probabilities can then be compared toa threshold such that the resulting classifier is equiva-lent to that received by our method. while this inter-pretation of our method is valid we believe that theinterpretation detailed in this paper is clearer and6asier to implement. Some previous work by Vap-nik [24] considered a calibration method which is notuniform over the sample space. However, this methodis limited to Support Vector Machines (SVM) anduses the original feature space with no auxiliary fea-tures. Our method is much more general.
The OER method presented earlier is general andflexible. There are two main design choices. Firstchoosing the auxiliary features and corresponding A i .Second, choosing a model for f i ( y ) and g i ( y ) and acorresponding method for fitting the data. Section4 provides a heuristic method for choosing auxiliaryfeatures. However, this question is still open and atopic for future research. One simple model for f i ( y )and g i ( y ) can be the use of a Gaussian model forthe conditional behaviour of the score. Formally, theGaussian model is stated as: h ( x ) | x ∈ A i , y = 1 ∼ N ( µ + i , σ + i ) h ( x ) | x ∈ A i , y = − ∼ N ( µ − i , σ − i ) . One of the main benefits of using such a model is thatit requires only the estimation of the first and secondmoments. Both can be easily estimated for each bin.The necessary condition for an extremum nowtakes the form of: p + i σ + i e − ( ki − µ + i )22 σ + i = p − i σ − i λe − ( ki − µ − i )22 σ − i . (6)Where k i is the threshold for the desired classifier.The benefit-cost ratio is p + i σ − i p − i σ + i e − (( ki − µ + i )22 σ + i + (( ki − µ − i )22 σ − i . (7)An illustration of the benefit-cost ratio can be seenin Figure 4 for different relations between σ + i and σ − i . Notice that when σ + i = σ − i the ratio (7) is strictlymonotone in k i . Therefore, if ∀ i σ + i = σ − i then (6)admits a single solution for every λ . In that case, aclosed form solution to the optimization problem canbe derived. This however is not the case in general. −4 −3 −2 −1 0 1 2 3 405101520253035404550 threshold ben f i t − c o s t r a t i o benfit−cost ratio function σ i+ = σ i− σ i+ > σ i− σ i+ < σ i− Figure 4: Benefit-cost ratio for different parametersof a Gaussian distributionIn the general case multiple extremum points mayexist and therefore local optimization methods needto be used. Notice also that if σ + i > σ − i then thereis a minimum to the benefit-cost ratio. Therefore,for large enough FPR the optimal threshold is −∞ .Similarly if σ + i < σ − i then there is a maximum tothe benefit-cost ratio and for small enough FPR theoptimal threshold is ∞ .A solution for the optimization problem can befound by using OER (Algorithm 1). The gradientis given by: ∇ i = p + i σ + i exp( − ( k i − µ + i ) σ + i ) − p − i λσ − i exp( − ( k i − µ − i ) σ − i ) . (8)Notice that if σ + i > σ − i then the optimal thresholdmay be −∞ and if σ + i < σ − i then the optimal thresh-old may be ∞ . It is advised at each step to projectthe threshold into some fixed interval [ − K K ] suchthat the gradient-ascent method will converge.
Example 2 revisited
We have used the describedmethod on example 2. We divided X values into 1207ins, where x ∈ A i if − . i < X < − . . i .Two additional bins were used for the intervals X < − X >
6. We have generated a data-set of20000 data-points and tested the method. The re-sults can be seen in Figure 5. It can be seen thatthe method presented a significant benefit over thetwo other approaches. As mentioned before, for suf-ficiently large | X | the threshold is −∞ . This issince σ + i > σ − i and the benefit-cost ratio admits aminimum. When the desired benefit-cost ratio is be-low the minimum possible value it is always desirableto trade more TPR for more FPR. Notice that inthose bins the calculation of h(x) is useless and canbe avoided, therefore reducing computation resourcesneeded.Direct comparison to AUC optimization methods(like [16]) is inappropriate. This is since it is highlysensitive to the hypothesis set for which the AUC isoptimized. It is clear from Figure 3 that in spite thefact that our base classifier is linear, no linear classi-fier can achieve decent performance. Optimizing theAUC over a different hypothesis class may producebetter results than ours. However, using this classi-fier as our base classifier and employing OER mayimprove it even further or at least will not reduce itsperformance. In some cases, we can assume that the auxiliary fea-tures affect only the expectation of the score and donot affect the variance of positive and negative sam-ples. Formally, ∀ i, σ − i = σ + i . In this case problem(2) can be solved directly. The solution to problem(2) is given by: k i = σ + i [log p − i p + i + λ ] µ + i − µ − i + µ + i + µ − i , (9)where −∞ < λ < ∞ The ROC curve can then be derived by calculatingthe optimal threshold for different values of λ rangingfrom −∞ to ∞ . Remark 1.
For this special case the extension toan infinite number of bins is straightforward. Instead
False Positive Rate T r ue P o s i t i v e R a t e ROC curves of example 2 for the different approchs constant false postiveconstant true positiveoptimal
Figure 5: ROC curve of Example 2 of fitting a Gaussian model to each bin it is possi-ble to estimate some functions µ + (˜ x ) and µ − (˜ x ) thatrepresent the mean score as a function of the fea-tures for the positive and negative examples, respec-tively. Similarly, the functions σ + (˜ x ) , σ − (˜ x ) , p + (˜ x ) and p − (˜ x ) should be estimated. All of these func-tions can be estimated using conventional parametricestimation methods (For example, maximizing the loglikelihood). The optimal threshold for each examplecan then be calculated using (9) by substituting µ + i by µ + (˜ x ) , µ − i by µ − (˜ x ) and so on. Example 1 revisited
We used the describedmethod on Example 1. We divided X values into8 bins, where x ∈ A i if 0 . . i < X < . i . Itfollows that µ i = 0 . . i . Notice that we neglectedthe fact that σ + i = σ − i . We have generated a data-setof 20000 data-points and tested the method. The re-sults are presented in Figure 6. It can be seen that themethod presented a significant benefit over the twoother approaches. Notice also that the derived ROCcurve outperforms the convex hull of the two othermethods, therefore outperform the ROCCH method.8 False Positive Rate T r ue P o s i t i v e R a t e ROC curves of example 1 for the different approchs constant false postiveconstant true positiveoptimal
Figure 6: ROC curve of Example 1
One important question that arises in the context ofOER is how to choose auxiliary features that providethe most benefit. Using features that do not containrelevant information may degrade performance dueto over-fitting. The simplest approach is probably touse knowledge about the domain of the problem andconsider features that may impact the problem diffi-culty. In image classification this can be for examplepicture’s size, lightning conditions etc. In speakerverification difficulty is often related to the type ofrecording device. As the quality of the recordinggets better it is easier to classify. Using the typeof recording device as an auxiliary feature seems nat-ural for this setting. Other examples include doc-ument length in spam filtering, channel characteris-tics in communication, distance from target in remotesensing and many more.Another obvious approach is to enumerate over po-tential options. For each feature apply OER, thencalculate the derived ROC and choose the featuresthat gives the most benefit. Sufficient estimation of the ROC however requires large amount of labelleddata. In certain cases, labelled data are scarce andtherefore the estimation of the ROC is prone to er-rors.An alternative approach is to use the modellingprocess to uncover potential auxiliary features. Look-ing at the benefit-cost ratio provides us with the nec-essary insight about elements of the model that im-pact performance. One measure that can be pro-posed is the difference in separation difficulty. Theseparation difficulty (SD) is defined by the numberof standard deviations between the mean of positiveand negative examples. Namely the quantity SD i = ( µ + i − µ − i ) / ( σ + i σ − i ) . The difference in separation difficulty can then bedefined as var ( SD i ). The variance is taken with re-spect to the data’s distribution. A large differencecauses significant bending of the curve for differentoperating points. While this does not guarantee sig-nificant benefit on the ROC it implies a potential forsuch benefit. Example 1 demonstrates the feasibilityof this measure.Another measure is the difference in the prior. Theprior of bin i ( denoted by P i ) can be defined as P i = log( p + i σ − i / ( p − i σ + i )) . As before, the difference in prior can be taken to be var ( P i ) where the variance is taken with respect tothe data distribution. A large difference indicatesthat there might be a potential for significant bene-fits. Example 2 demonstrates the feasibility of thismeasure.Those measures allow to establish a feature selec-tion mechanism. First, enumerate over possible fea-tures, for each feature, partition the space into binsand measure the difference in separation difficultyand difference in prior. Only features for which thosemeasures exceed some threshold should be used forOER. In the spirit of supervised PCA [1], furtherreduction in the feature space’s dimension can beachieved by using only the few main principal compo-nents of the remaining features. The resulting featurespace can then be divided into bins and OER can beapplied.9t is important to calibrate the number of bins tothe amount of training data available. Using too fewbins leads to a mismatch between the data and modeland therefore sub-optimal performance (which mayeven be worse than the original classifier). Using toomany bins may lead to over-fitting. We advise usingcross-validation in order to optimize the number ofbins. In addition to the results described earlier on syn-thetic examples we demonstrate our method’s poten-tial benefit on real-life data . First we tested themethod using the UCI “Adult” dataset [19]. In thisdataset the goal is to predict whether a person’s in-come exceeds 50
K/yr based on census data. We haveused SVM as the base classifier. As ab auxiliary fea-ture we selected number of years of education. Thisselection was made by reviewing the difference in sep-aration difficulty and the difference in prior of allavailable features as explained in section 4. It is possi-ble that choosing more then one auxiliary feature willimprove the results. Figure 7 shows the derived ROCcurves. Figure 8 shows a zoom-in of the ROC. It isclearly visible that the derived ROC curve is alwaysbetter then the original. The AUC improves from0.878 for the baseline SVM to 0.9028 for the derivedclassifier, 20 .
33% improvement. It should be notedthat since some of the input features are categoricalthe ROC curve is highly sensitive. The results shownare averaged over ten-fold cross-validation. Note thaton all conducted experiments OER outperforms theoriginal classifier (0.001 p-value with the sign test).Taking a closer look at the data distribution andthe derived thresholds shows that the improvementis made by keeping the threshold in the “easy” binslow and increasing it on the more “difficult” bins. Itcan be seen that the benefit arise even though thedata distribution isn’t Gaussian. It is possible thatusing a different distribution for modelling will pro-duce better results. The data distribution as well asthree possible thresholds can be seen on figure 9.Second, we used OER for the task of object recog-nition. The task at hand is finding a certain object
False Positive Rate T r ue P o s i t i v e R a t e ROC curve for the Adult dataset originalbended
Figure 7: ROC curve for the Adultdataset, 20 .
33% improvement in AUC
False Positive Rate originalbended
Figure 8: Zoom in on the ROC curvefor the Adult dataset10
Score as a function of the number of education years together with some possible threshold curves Years of eduaction S c o r e positivenegetive Figure 9: Score as a function of the number of ed-ucation years together with some possible thresholdcurves(person, car, dog, etc.) inside a picture. For thatpurpose, multiple bounding boxes (BB) are extractedfrom the picture. A classifier assign a score for eachof the BB. Detection is made using some thresholdon this score. For simplicity we have tested only the“classification” stage of this problem.From the PASCAL ( [6]) data-base, positive ex-amples of several classes of objects were extracted(only the bounding box which contains the object).From the same data-base, 100000 background exam-ples were taken (from 10 different pictures). Eachexample was scored using the state of the art Dis-criminatingly Trained Deformable Part Model classi-fier [8, 10]. This classifier models the object as com-posed out of a set of parts (for example a person iscomposed out of head, hands, body, etc.). The clas-sifier then matches the content of the bounding boxwith all possible orientations of the modelled objectand its parts. It is known that the size of the bound-ing box affects the performance of this classifier sig-nificantly [18].The size of the bounding box was used to dividethe data into 4 bins. For each bin the expectationand standard deviation of the positive and negativeexamples were estimated as well as p + and p − . Thescores for the class “person” as a function of size canbe seen in Figure 12.Two effects are notable. First, the bigger the BB (higher resolution) the higher the score. It can beseen that the effects on positive and negative ex-amples are roughly the same in expectation but forlarger BB the variance of the positives decrease whilethe variance of the negatives remain roughly thesame. Second, Since the data-base is constructedfrom partitioning of pictures, it contains a high num-ber of small BBs and a low number of large BBs. Thepositive examples however are distributed roughlyuniform over size. This causes the change of priorto be rather large.Optimal thresholds were calculated using OER andthe results were compared to using a fixed threshold.The area under the curve (AUC) was used as a per-formance measure. The results are summarised inTable 1. As can be seen, substantial benefit (around20% improvement) arises from using OER. Furtherexamination of the benefit shows that for a very lowFPR modelling errors start to affect and benefit is mi-nor. For a very high FPR there is not much room forimprovement. In between, there is a substantial areain which benefit arise. Figures 10 shows the derivedROC curves for the class “person”. Figure 11 showsa zoom-in of the ROC of the area in which the benefitis maximal. This improvement is done using only thepicture size as a feature. This feature boost the baseclassifier’s performance although it holds little to nondiscriminative information. Ten-fold cross-validationwas performed. Recently the validity of AUC formodel comparison was questioned [11]. While forsimplicity we do use AUC as a performance measureour method improve the entire ROC curve. Sincethe improved ROC curve dominates the original one,other measures will also likely to show improvement.Note that on all conducted experiments OER out-performs the original classifier (0.001 p-value with thesign test). Notice also that we have used only a fewbins (four) and a simplistic modelling as Gaussian.We believe that by using more complex features andmore complex models this results can be even furtherimproved.11 False Positive Rate T r ue P o s i t i v e R a t e ROC curves of object recognition ,class "person" optimalfixed threshold
Figure 10: ROC curve of objectrecognition, class “person”
False Positive Rate T r ue P o s i t i v e R a t e ROC curves of object recognition ,class "person" optimal thresholdfixed threshold
Figure 11: Zoom in on the ROC curveof object recognition, class “person”
50 100 150 200 250 300 350 400 450 500−4−3−2−1012
Picture size S c o r e Score as a function of picture size positivenegetive
Figure 12: Score as a function of the bounding boxsizeTable 1: Simulation results for several object classes class person dog car chairNumberof positiveexamples 2358 253 625 400FixedthresholdsAUC 0.98663 0.97827 0.99292 0.99540OER AUC 0.99043 0.98329 0.99411 0.99648Improvementin1 − AUC Conclusion
In this work we present a novel approach for improv-ing the ROC curve of existing classifiers. We believethat this method should become a standard tool inROC analysis and can enhance essentially any classi-fier. The method presented is general and may pro-vide substantial benefit for any application: as longas there is sufficient data to mitigate overfitting, any-one who considers ROC optimization should try to“bend the curve” since there is nothing much to losefrom it, and potentially much to gain.We suggest three natural directions for further re-search: First, the method presented takes a two stepapproach. Start with modelling the data and thenfind optimal threshold curve according to this model.The model is used to derive the benefit-cost ratio.An alternative approach is to use empirical estimatesof the benefit-cost ratio directly. The effect of suchan approach is unclear. On the one hand, it may im-prove performance whenever a parametric model isin-adequate to describe the data. On the other hand,it may increase over-fitting.Second, accurate estimation of the model’s param-eters requires a large amount of labelled data. This isespecially true when the number of prospective fea-tures is large. Partitioning the space into too manybins may lead to a faulty model. An interesting openquestion is how to optimally partition the featurespace.Third, in some scenarios it may be preferable touse a different optimization problems than (2). Forexample, in multi-view problems several classifiers,each with a different feature-space, are fused into asingle classification output. It may be interesting tojointly optimize the threshold curves of those classi-fiers.
References [1] Eric Bair, Trevor Hastie, Debashis Paul, andRobert Tibshirani. Prediction by supervisedprincipal components.
Journal of the AmericanStatistical Association , 101(473), 2006. [2] Leo Breiman. Bagging predictors.
Machinelearning , 24(2):123–140, 1996.[3] Leo Breiman. Stacked regressions.
Machinelearning , 24(1):49–64, 1996.[4] Corinna Cortes and Mehryar Mohri. Auc op-timization vs. error rate minimization.
Ad-vances in neural information processing systems ,16(16):313–320, 2004.[5] Chris Drummond and Robert C Holte. Costcurves: An improved method for visualizing clas-sifier performance. 2006.[6] M. Everingham, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman. The pascal vi-sual object classes (voc) challenge.
Interna-tional Journal of Computer Vision , 88(2):303–338, June 2010.[7] Tom Fawcett and Foster Provost. Adaptivefraud detection.
Data mining and knowledge dis-covery , 1(3):291–316, 1997.[8] P. F. Felzenszwalb, R. B. Girshick,D. McAllester, and D. Ramanan. Objectdetection with discriminatively trained partbased models.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 32(9):1627–1645, 2010.[9] Yoav Freund, Robert E Schapire, et al. Experi-ments with a new boosting algorithm. In
ICML ,volume 96, pages 148–156, 1996.[10] R. B. Girshick, P. F. Felzenszwalb,and D. McAllester. Discriminativelytrained deformable part models, release 5.http://people.cs.uchicago.edu/ rbg/latent-release5/.[11] David J Hand. Measuring classifier performance:a coherent alternative to the area under the roccurve.
Machine learning , 77(1):103–123, 2009.[12] James A Hanley and Barbara J McNeil. Themeaning and use of the area under a receiveroperating characteristic (roc) curve.
Radiology ,143(1):29–36, 1982.1313] Trevor Hastie and Robert Tibshirani. Classifi-cation by pairwise coupling, 1998.[14] Jos´e Hern´andez-Orallo, Peter Flach, and CesarFerri. A unified view of performance metrics:Translating threshold choice into expected clas-sification loss.
The Journal of Machine LearningResearch , 13(1):2813–2869, 2012.[15] Alfred O Hero and Douglas Cochran. Sensormanagement: Past, present, and future.
SensorsJournal, IEEE , 11(12):3064–3075, 2011.[16] Alan Herschtal and Bhavani Raskutti. Optimis-ing area under the roc curve using gradient de-scent. In
Proceedings of the twenty-first interna-tional conference on Machine learning , page 49.ACM, 2004.[17] Tin Kam Ho. The random subspace methodfor constructing decision forests.
Pattern Anal-ysis and Machine Intelligence, IEEE Transac-tions on , 20(8):832–844, 1998.[18] Derek Hoiem, Yodsawalai Chodpathumwan, andQieyun Dai. Diagnosing error in object detec-tors. In
Computer Vision–ECCV 2012 , pages340–353. Springer, 2012.[19] M. Lichman. UCI machine learning repository,2013.[20] Marco Pedersoli, Jordi Gonz`alez, Xu Hu, andXavier Roca. Toward real-time pedestrian detec-tion based on a deformable template model.
In-telligent Transportation Systems, IEEE Trans-actions on , 15(1):355–364, 2014.[21] John Platt et al. Probabilistic outputs for sup-port vector machines and comparisons to reg-ularized likelihood methods.
Advances in largemargin classifiers , 10(3):61–74, 1999.[22] Foster Provost and Tom Fawcett. Robust clas-sification for imprecise environments.
Machinelearning , 42(3):203–231, 2001.[23] Singiresu S Rao and SS Rao.
Engineering op-timization: theory and practice . John Wiley &Sons, 2009. [24] Vladimir Naumovich Vapnik.
Statistical learningtheory , volume 1. Wiley New York, 1998.[25] Ricardo Vilalta and Youssef Drissi. A perspec-tive view and survey of meta-learning.
ArtificialIntelligence Review , 18(2):77–95, 2002.[26] Lian Yan, Robert H Dodier, Michael Mozer,and Richard H Wolniewicz. Optimizing clas-sifier performance via an approximation to thewilcoxon-mann-whitney statistic. In