Binary Classifier Calibration: Non-parametric approach
BBinary Classifier Calibration: Non-parametricapproach
Mahdi Pakdaman Naeini
Intelligent System ProgramUniversity of Piuttsburgh [email protected]
Gregory F. Cooper
Department of Biomedical InformaticsUniversity of Pittsburgh [email protected]
Milos Hauskrecht
Computer Science DepartmentUniversity of Pittsburgh [email protected]
Abstract
Accurate calibration of probabilistic predictive models learned is critical for many practical predic-tion and decision-making tasks. There are two main categories of methods for building calibratedclassifiers. One approach is to develop methods for learning probabilistic models that are well-calibrated, ab initio . The other approach is to use some post-processing methods for transformingthe output of a classifier to be well calibrated, as for example histogram binning, Platt scaling, andisotonic regression. One advantage of the post-processing approach is that it can be applied to anyexisting probabilistic classification model that was constructed using any machine-learning method.In this paper, we first introduce two measures for evaluating how well a classifier is calibrated.We prove three theorems showing that using a simple histogram binning post-processing method,it is possible to make a classifier be well calibrated while retaining its discrimination capability.Also, by casting the histogram binning method as a density-based non-parametric binary classifier,we can extend it using two simple non-parametric density estimation methods. We demonstratethe performance of the proposed calibration methods on synthetic and real datasets. Experimentalresults show that the proposed methods either outperform or are comparable to existing calibrationmethods.
The development of accurate probabilistic prediction models from data is critical for many practicalprediction and decision-making tasks. Unfortunately, the majority of existing machine learning anddata mining models and algorithms are not optimized for this task and predictions they produce maybe miscalibrated.Generally, a set of predictions of a binary outcome is well calibrated if the outcomes predicted tooccur with probability p do occur about p fraction of the time, for each probability p that is predicted.This concept can be readily generalized to outcomes with more than two values. Figure 1 shows ahypothetical example of a reliability curve (4; 14), which displays the calibration performance of aprediction method. The curve shows, for example, that when the method predicts Z = 1 to haveprobability . , the outcome Z = 1 occurs in about . fraction of the instances (cases). The curveindicates that the method is fairly well calibrated, but it tends to assign probabilities that are toolow. In general, perfect calibration corresponds to a straight line from (0 , to (1 , . The closer acalibration curve is to this line, the better calibrated is the associated prediction method.If uncertainty is represented using probabilities, then optimal decision making under uncertaintyrequires having models that are well calibrated. Producing well calibrated probabilistic predictionsis critical in many areas of science (e.g., determining which experiments to perform), medicine (e.g.,1 a r X i v : . [ s t a t . M L ] J a n igure 1: The solid line shows a calibration (re-liability) curve for predicting Z = 1 . The dottedline is the ideal calibration curve. Figure 2: Scatter plot of non-linear separablesimulated datadeciding which therapy to give a patient), business (e.g., making investment decisions), and manyother areas. At the same time, calibration has not been studied nearly as extensively as discrimination(e.g., ROC curve analysis) in machine learning and other fields that research probabilistic modeling.One approach to achieve a high level of calibration is to develop methods for learning probabilis-tic models that are well-calibrated, ab initio . However, data mining and machine learning researchhas traditionally focused on the development of methods and models for improving discrimination,rather than on methods for improving calibration. As a result, existing methods have the poten-tial to produce models that are not well calibrated. The miscalibration problem can be aggravatedwhen models are learned from small-sample data or when the models make additional simplifyingassumptions (such as linearity or independence).Another approach is to apply post-processing methods (e.g., histogram binning, Platt scaling, orisotonic regression) to the output of classifiers to improve their calibration. The post-processing stepcan be seen as a function that maps output of a prediction model to probabilities that are intended tobe well calibrated. Figure 1 shows an example of such a mapping. This approach frees the designerof the machine learning model from the need to add additional calibration measures and terms intothe objective function used to learn a model. The advantage of this approach is that it can be usedwith any existing classification method, since calibration is performed solely as the post-processingstep.The objective of the current paper is to show that the post-processing approach for calibrating bi-nary classifiers is theoretically justified. In particular, we show in the large sample limit that post-processing will produce a perfectly calibrated classifier that has discrimination perform (in terms ofarea under the ROC curve) that is at least as good as the original classifier. In the current paper wealso introduce two simple but effective methods that can address the miscalibration problem.Existing post-processing calibration methods can be divided into parametric and non-parametricmethods. An example of a parametric method is Platt’s method that applies a sigmoidal trans-formation that maps the output of a predictive model (15) to a calibrated probability output. Theparameters of the sigmoidal transformation function are learned using a maximum likelihood esti-mation framework. The key limitation of the approach is the (sigmoidal) form of the transformationfunction, which only rarely fits the true distribution of predictions.The above problem can be alleviated using non-parametric methods. The most common non-parametric methods are based either on binning (19) or isotonic regression (3). In the histogrambinning approach, the raw predictions of a binary classifier are sorted first, and then they are par-titioned into b subsets of equal size, called bins. Given a prediction y , the method finds the bincontaining that prediction and returns as ˆ y the fraction of positive outcomes ( Z = 1) in the bin.Zadronzy and Elkan (20) developed a calibration method that is based on isotonic regression. Thismethod only requires that the mapping function be isotonic (monotonically increasing) (14). The2 air adjacent violators (PAV) algorithm is one instance of an isotonic regression algorithm (3). Theisotonic calibration method based on the (PAV) algorithm can be viewed as a binning algorithmwhere the position of boundaries and the size of bins are seleted according to how well the classifierranks the examples in the training data (20). Recently a variation of the isotonic-regression-basedcalibration method was described for predicting accurate probabilities with a ranking loss(13).In this paper, section 2 introduces two measures, maximum calibration error (MCE) and expectedcalibration error (ECE), for evaluating how well a classifier is calibrated. In section 3 we provethree theorems to show that by using a simple histogram-binning calibration method, it is possibleto improve the calibration capability of a classifier measured in terms of M CE and
ECE withoutsacrificing its discrimination capability measured in terms of area under (ROC) curve ( AU C ). Sec-tion 4 introduces two simple extensions of the histogram binning method by casting the method as asimple density based non-parametric binary classification problem. The results of experiments thatevaluate various calibration methods are presented in section 5. Finally, section 6 states conclusionsand describes several areas for the future work.
This section present the notation and assumptions we use for formalizing the problem of calibratinga binary classifier. We also define two measures for assessing the calibration of such classifiers.Assume a binary classifier is defined as a mapping φ : R d → [0 , . As a result, for every inputinstance x ∈ R d the output of the classifier is y = φ ( x ) where y ∈ [0 , . For calibrating theclassifier φ ( . ) we assume there is a training set { ( x i , y i , z i ) } Ni =1 where x i ∈ R d is the i’th instanceand y i = φ ( x i ) ∈ [0 , , and z i ∈ { , } is the true class of i’th instance. Also we define ˆ y i asthe probability estimate for instance x i achieved by using the histogram binning calibration method,which is intended to yield a more calibrated estimate than does y i . In addition we have the followingnotation and assumptions that are used in the remainder of the paper: • N is total number of instances • m is total number of positive instances • n is total number of negative instances • p in is the space of uncalibrated probabilities { y i } which is defined by the classifier output • p out is the space of transformed probability estimates { ˆ y i } using histogram binning • B is the total number of bins defined on p in in the histogram binning model • B i is the i’th bin defined on p in • N i is total number of instances x k for which the predicted value y k is located inside B i • m i is number of positive instances x k for which the predicted value y k is located inside B i • n i is number of negative instances x k for which the predicted value y k is located inside B i • ˆ η i = N i N is an empirical estimate of P { y ∈ B i }• η i is the value of P { y ∈ B i } as N goes to infinity • ˆ θ i = m i N i is an empirical estimate of P { z = 1 | y ∈ B i }• θ i is the value of ˆ θ i as N goes to infinity In order to evaluate the calibration capability of a classifier, we use two simple statistics that measurecalibration relative to the ideal reliability diagram (4; 14)(Figure 1 shows an example of a reliabilitydiagram). These measures are called Expected Calibration Error (ECE), and Maximum CalibrationError (MCE). In computing these measures, the predictions are sorted and partitioned into ten bins.The predicted value of each test instance falls into one of the bins. The
ECE calculates ExpectedCalibration Error over the bins, and
M CE calculates the Maximum Calibration Error among thebins, using empirical estimates as follows:
ECE = (cid:88) i =1 P ( i ) · | o i − e i | , M CE = max ( | o i − e i | ) , where o i is the true fraction of positive instances in bin i , e i is the mean of the post-calibratedprobabilities for the instances in bin i , and P ( i ) is the empirical probability (fraction) of all instancesthat fall into bin i . The lower the values of ECE and
M CE , the better is the calibration of a model.3
Calibration Theorems
In this section we study the properties of the histogram-binning calibration method. We provethree theorems that show that this method can improve the calibration power of a classifier withoutsacrificing its discrimination capability.The first theorem shows that the
M CE of the histogram binning method is concentrated aroundzero:
Theorem 3.1.
Using histogram binning calibration, with probability at least − δ we have M CE ≤ (cid:113) B log Bδ N .Proof. For proving this theorem, we first use a concentration result for ˆ θ i . Using Hoeffding’s in-equality we have the following: P {| ˆ θ i − θ | ≥ (cid:15) } ≤ e − N(cid:15) B (1)Let’s assume ˜ B i is a bin defined on the space of transformed probabilities p out for calculating the M CE of histogram binning method. Assume after using histogram binning over p in (space ofuncalibrated probabilities which is generated by the classifier φ ), ˆ θ i , .., ˆ θ ik i will be mapped into ˜ B i . We define o i as the true fraction of positive instances in bin ˜ B i , and e i as the mean of the post-calibrated probabilities for the instances in bin ˜ B i . Using the notation defined in section 2, we canwrite o i and e i as follows: o i = η i θ i + . . . + η ik i θ ik i η i + . . . + η ik i , e i = η i ˆ θ i + . . . + η ik i ˆ θ ik i η i + . . . + η ik i by defining α it = η it η i + ... + η iki and using the triangular inequality we have that: | o i − e i | ≤ (cid:88) t ∈{ ,...,k i } α it | ˆ θ it − θ it | ≤ max t ∈{ ,...,k i } | ˆ θ it − θ it | (2)Using the above result and the concentration inequality 7 for ˆ θ i we can conclude: P {| o i − e i | > (cid:15) } ≤ P { max t ∈{ ,...,k i } | ˆ θ it − θ it | > (cid:15) } ≤ k i e − N(cid:15) B , (3)Where the last part is obtained by using a union bound and k i is the number of bins on the space p in for which their calibrated probability estimate will be mapped into the bin ˜ B i .Using a union bound again over different bins like ˜ B i defined on the space p out , we achieve thefollowing probability bound for M CE over the space of calibrated estimates p out : P { B max i =1 | o i − e i | ≥ (cid:15) } ≤ K + . . . + K B ) e − N(cid:15) B = ⇒ P { M CE ≥ (cid:15) } ≤ Be − N(cid:15) B By setting δ = 2 Be − N(cid:15) B we can show that with probability − δ the following inequality holds M CE ≤ (cid:113) B log Bδ N . Corollary 3.2.
Using histogram binning calibration method, MCE converges to zero with the rateof O ( (cid:113) B log BN ) . Next, we prove a theorem for bounding the ECE of the histogram-binning calibration method asfollows:
Theorem 3.3.
Using histogram binning calibration method, ECE converges to zero with the rate of O ( (cid:113) BN ) .Proof. The proof of this theorem uses the concentration inequality 3.Due to space limitations, thedetails of the proof is stated in the supplementary part of the paper.4he above two theorems show that we can bound the calibration error of a binary classifier, whichis measured in terms of
M CE and
ECE , by using a histogram-binning post-processing method.We next show that in addition to gaining calibration power, by using histogram binning we areguaranteed not to sacrifice discrimination capability of the base classifier φ ( . ) measured in termsof AU C . Recall the definitions of y i and ˆ y i , where y i = φ ( x i ) is the probability prediction of thebase classifier φ ( . ) for the input instance x i , and ˆ y i is the transformed estimate for instance x i thatis achieved by using the histogram-binning calibration method.We can define the AU C Loss of the histogram-binning calibration method as:
Definition ( AU C Loss ) AU C Loss is the difference between the AUC of the base classifier esti-mate and the AUC of transformed estimate using the histogram-binning calibration method. Usingthe notation in Section 2, it is defined as
AU C Loss = AU C ( y ) − AU C (ˆ y ) Using the above definition, our third theorem bounds the
AU C Loss of histogram binning classifieras follows:
Theorem 3.4.
Using the histogram-binning calibration method, the worst case
AU C Loss is upperbounded by O ( B ) .Proof. Due to space limitations, the proof of this theorem is stated in the appendix section in thesupplementary part of the paper.Using the above theorems, we can conclude that by using the histogram-binning calibration methodwe can improve calibration performance of a classifier measured in terms of
M CE and
ECE without losing discrimination performance of the base classifier measured in terms of
AU C .We will show in Section 4 that the histogram binning calibration method is simply a non-parametricplug-in classifier. By casting histogram binning as a non-parametric histogram binary classifier,there are other results that show the histogram classifier is a mini-max rate classifier for LipschitzBayes decision boundaries (5). Although the results are valid for histogram classifiers with fixed binsize, our experiments show that both fixed bin size and fixed frequency histogram classifiers behavequite similarly. We conjecture that a histogram classifier with equal frequency binning is also a mini-max (or near mini-max) rate classifier(16; 9); this is an interesting open problem that we intend tostudy in the future. These results make histogram binning a reasonable choice for binary classifiercalibration under the condition B → ∞ and NB log B → ∞ as N → ∞ . This could be achieved bysetting B (cid:39) N , which is the optimum number of bins in order to have optimal convergence rateresults for the non-parametric histogram classifier (5). In this section, we show that the histogram-binning calibration method (19) is a simple nonparamet-ric plug-in classifier. In the calibration problem, given an uncalibrated probability estimate y , oneway of finding the calibrated estimate ˆ y = P ( Z = 1 | y ) is to apply Bayes’ rule as follows: P ( Z = 1 | y ) = P ( z = 1) · P ( y | z = 1) p ( z = 1) · P ( y | z = 1) + P ( z = 0) · P ( y | z = 0) , (4)where P ( z = 0) and P ( z = 1) are the priors of class and that are estimated from the train-ing dataset. Also, P ( y | z = 1) and P ( y | z = 0) are predictive likelihood terms. If we use thehistogram density estimation method for estimating the predictive likelihood terms in the Bayesrule equation 4 we obtain the following: ˆ P ( y | z = t ) = (cid:80) Bj =1 ˆ θ tj h j I ( y ∈ B j ) , where t = { , } , ˆ θ j = n (cid:80) Ni =1 I ( y i ∈ B j , z i = 0) , and ˆ θ j = m (cid:80) Ni =1 I ( y i ∈ B j , z i = 1) are the empirical es-timates of the probability of a prediction when class z = t falls into bin B j . Now, let us assume y ∈ B j ; using the assumptions in Section 2, by substituting the value of empirical estimates of ˆ θ j = n j n , ˆ θ j = m j m , ˆ P ( z = 0) = nN , ˆ P ( z = 1) = mN from the training data and performing somebasic algebra we obtain the following calibrated estimate: ˆ y = m j m j + n j , where m i and n j are thenumber of positive and negative examples in bin B j .The above computations show that the histogram-binning calibration method is actually a simpleplug-in classifier where we use the histogram-density method for estimating the predictive likeli-hood in terms of Bayes rule as given by 4. By casting histogram binning as a plug-in method for5able 1: Experimental Results on Simulated dataset (a) SVM Linear SVM Hist Platt IsoReg KDE DPMRMSE 0.50 0.39 0.50 0.46 0.38 0.39AUC 0.50 0.84 0.50 0.65 0.85 0.85ACC 0.48 0.78 0.52 0.64 0.78 0.78MCE 0.52 0.19 0.54 0.58 0.09 0.16ECE 0.28 0.07 0.28 0.35 0.03 0.07 (b) SVM Quadratic Kernel
SVM Hist Platt IsoReg KDE DPMRMSE 0.21 0.09 0.19 0.08 0.09 0.08AUC 1.00 1.00 1.00 1.00 1.00 1.00ACC 0.99 0.99 0.99 0.99 0.99 0.99MCE 0.35 0.04 0.32 0.03 0.07 0.03ECE 0.14 0.01 0.15 0.00 0.01 0.00 classification, it is possible to use more advanced frequentist methods for density estimation ratherthan using simple histogram-based density estimation. For example, if we use kernel density es-timation (KDE) for estimating the predictive likelihood terms, the resulting calibrated probability P ( Z = 1 | X = x ) is as follows: ˆ P ( Z = 1 | X = x ) = nh (cid:80) X i ∈ X + K (cid:16) | x − X i | h (cid:17) nh (cid:80) X i ∈ X + K (cid:16) | x − X i | h (cid:17) + mh (cid:80) X i ∈ X − K (cid:16) | x − X i | h (cid:17) , (5)where X i are training instances, and m and n are respectively the number of positive and negative ex-amples in training data. Also h and h are the bandwidth of the predictive likelihood for class andclass . The bandwidth parameters can be optimized using cross validation techniques. However,in this paper we used Silverman’s rule of thumb (17) for setting the bandwidth to h = 1 . σN − ,where ˆ σ is the empirical unbiased estimate of variance. It is possible to use the same bandwidth forboth class and class , which leads to the Nadaraya-Watson kernel estimator that we use in our ex-periments. However, we noticed that there are some cases for which KDE with different bandwidthsperforms better.There are different types of smoothing kernel functions, as the Gaussian, Boxcar, Epanechnikov,and Tricube functions. Due to the similarity of the results we obtained when using different type ofkernels, we only report here the results of the simplest one, which is the Boxcar kernel.It has been shown in (18) that kernel density estimators are mini-max rate estimators, and under the L loss function the risk of the estimator converges to zero with the rate of O P ( n − β (2 β + d ) ) , where β is a measure of smoothness of the target density, and d is the dimensionality of the input data. Fromthis convergence rate, we can infer that the application of kernel density estimation is likely to bepractical when d is low. Fortunately, for the binary classifier calibration problem, the input spaceof the model is the space of uncalibrated predictions, which is a one-dimensional input space. Thisjustifies the application of KDE to the classifier calibration problem.The KDE approach presented above represents a non-parametric frequentist approach for estimatingthe likelihood terms of equation 4. Instead of using the frequentist approach, we can use Bayesianmethods for modeling the density functions. The Dirichlet Process Mixture (DPM) method is a well-known Bayesian approach for density estimation (2; 8; 6; 11). For building a Bayesian calibrationmodel, we model the predictive likelihood terms P ( X i = x | Z i = 1) and P ( X i = x | Z i = 0) inEquation 4 using the DP M method. Due to a lack of space, we do not present the details of theDPM model here, but instead refer the reader to (2; 8; 6; 11).There are different ways of performing inference in a
DP GM model. One can choose to use eitherGibbs sampling (non-collapsed or collapsed) or variational inference, for example. In implement-ing our calibration model, we use the variational inference method described in (10). We chose itbecause it has fast convergence. We will refer to it as
DP M . This section describes the set of experiments that we performed to evaluate the performance ofcalibration methods described above. To evaluate the calibration performance of each method, weran experiments on both simulated and on real data. For the evaluation of the calibration methods,6able 2: Experimental results on size of calibration dataset (a) SVM Linear Base SVM
AUC 0.82 0.84 0.85 0.85 0.85 0.49MCE 0.40 0.15 0.07 0.05 0.03 0.52ECE 0.14 0.05 0.03 0.02 0.01 0.28 (b) SVM Quadratic Kernel Base SVM
AUC 0.99 1.00 1.00 1.00 1.00 1.00MCE 0.14 0.09 0.03 0.01 0.01 0.36ECE 0.03 0.01 0.00 0.00 0.00 0.15 we used different measures. The first two measures are Accuracy (Acc) and the Area Under theROC Curve (AUC), which measure discrimination. The three other measures are the Root MeanSquare Error (RMSE), Expected Calibration Error (ECE), and Maximum Calibration Error (MCE),which measure calibration. Simulated data . For the simulated data experiments, we used a binary classification dataset inwhich the outcomes were not linearly separable. The scatter plot of the simulated dataset is shownin Figure 2. The data were divided into instances for training and calibrating the predictionmodel, and instances for testing the models.To conduct the experiments on simulated datasets, we used two extreme classifiers: support vectormachines (SVM) with linear and quadratic kernels. The choice of SVM with linear kernel allowsus to see how the calibration methods perform when the classification model makes over simplify-ing (linear) assumptions. Also, to achieve good discrimination on the data in figure 2, SVM withquadratic kernel is intuitively an ideal choice. So, the experiment using quadratic kernel SVM al-lows us to see how well different calibration methods perform when we use an ideal learner for theclassification problem, in terms of discrimination.As seen in Table 1, KDE and DPM based calibration methods performed better than Platt and iso-tonic regression in the simulation datasets, especially when the linear SVM method is used as thebase learner. The poor performance of Platt is not surprising given its simplicity, which consists of aparametric model with only two parameters. However, isotonic regression is a nonparametric modelthat only makes a monotonicity assumption over the output of base classifier. When we use a linearkernel SVM, this assumption is violated because of the non-linearity of data. As a result, isotonicregression performs relatively poorly, in terms of improving the discrimination and calibration ca-pability of a base classifier. The violation of this assumption can happen in real data as well. Inorder to mitigate this pitfall, Menon et. all (13) proposed using a combination of optimizing
AU C as a ranking loss measure, plus isotonic regression for building a ranking model. However, this iscounter to our goal of developing post-processing methods that can be used with any existing clas-sification models. As shown in Table 1b, even if we use an ideal SVM classifier for these linearlynon-separable datasets, our proposed methods perform better or as well as does isotonic regressionbased calibration.As can be seen in Table 1b, although the SVM base learner performs very well in the sense ofdiscrimination based on AUC and Acc measures, it performs poorly in terms of calibration, asmeasured by RMSE, MCE, and ECE. Moreover, all of the calibration methods retain the samediscrimination performance that was obtained prior of post-processing, while improving calibration.Also, Table 2 shows the results of experiments on using the histogram-binning calibration methodfor different sizes of calibration sets on the simulated data with linear and quadratic kernels. In theseexperiments we set the size of training data to be and we fixed instances for testing themethods. For capturing the effect of calibration size, we change the size of calibration data from up to , running the experiment times for each calibration set and averaging the results.As seen in Table 2, by having more calibration data, we have a steady decrease in the values of the M CE and
ECE errors.
Real data . In terms of real data, we used a KDD-98 data set, which is available at UCI KDDrepository. The dataset contains information about people who donated to a particular charity. Herethe decision making task is to decide whether a solicitation letter should be mailed to a person ornot. The letter costs (which costs $0 . ). The training set includes , instances in which it isknown whether a person made a donation, and if so, how much the person donated. Among all these7able 3: Experimental Results on KDD 98 dataset (a) Logistic Regression LR Hist Plat IsoReg KDE DPMRMSE 0.500 0.218 0.218 0.218 0.218 0.219AUC 0.613 0.610 0.613 0.612 0.611 0.613ACC 0.56 0.95 0.95 0.95 0.95 0.95MCE 0.454 0.020 0.013 0.030 0.004 0.017ECE 0.449 0.007 0.004 0.013 0.002 0.003Profit 10560 13183 13444 13690 12998 13696 (b) Na¨ıve Bayes
NB Hist Plat IsoReg KDE DPMRMSE 0.514 0.218 0.218 0.218 0.218 0.218AUC 0.603 0.600 0.603 0.602 0.602 0.603ACC 0.622 0.949 0.949 0.949 0.949 0.949MCE 0.850 0.008 0.008 0.046 0.005 0.010ECE 0.390 0.004 0.004 0.023 0.002 0.003Profit 7885 11631 10259 10816 12037 12631 (c) SVM Linear
SVM Hist Plat IsoReg KDE DPMRMSE 0.696 0.218 0.218 0.219 0.218 0.218AUC 0.615 0.614 0.615 0.500 0.614 0.615ACC 0.95 0.95 0.95 0.95 0.95 0.95MCE 0.694 0.011 0.013 0.454 0.003 0.019ECE 0.660 0.004 0.004 0.091 0.002 0.004Profit 10560 13480 13080 11771 13118 13544 training cases, , were responders. The validation set includes , instances from the samedonation campaign of which , where responders.Following the procedure in (19; 20), we build two models: a response model r ( x ) for predicting theprobability of responding to a solicitation, and the amount model a ( x ) for predicting the amount ofdonation of person x . The optimal mailing policy is to send a letter to those people for whom theexpected donation return r ( x ) a ( x ) is greater than the cost of mailing the letter. Since in this paperwe are not concerned with feature selection, our choice of attributes are based on (12) for buildingthe response and amount prediction models. Following the approach in (21), we build the amountmodel on the positive cases in the training data, removing the cases with more than $50 as outliers.Following their construction we also provide the output of the response model r ( x ) as an augmentedfeature to the amount model a ( x ) .In our experiments, in order to build the response model, we used three different classifiers: SV M , LogisticRegression and naiveBayes . For building the amount model, we also used a supportvector regression model. For implementing these models we used the liblinear package (7). Theresults of the experiment are shown in Table 3. In addition to previous measures of comparison, wealso show the amount of profit obtained when using different methods. As seen in these tables, theapplication of calibration methods results in at least $3000 more in expected net gain from sendingsolicitations.
In all of our experiments, we used the same training data for model calibration as we used for modelconstruction. In doing so, we did not notice any over-fitting. However, if we want to be completelysure not to over-fit on the training data, we can do one of the following: • Data Partitioning:
This approach uses different data sets for model training and modelcalibration. The amount of data that is needed to calibrate models is generally much lessthan the amount needed to train them, because the calibration feature space has a singledimension. We observed that approximately instances are sufficient for obtainingwell calibrated models, as is seen in table [2]. • Leave-one-out:
If the amount of available training data is small, and it not possible to dodata partitioning, we can use a leave-one-out (or k-fold variation) scheme for building thecalibration dataset. In this approach we learn a model based on N − instances, test it on theone remaining instance, and save the resulting one calibration instance ( x i , ˆ y − i , z i ) , where ˆ y − i is the predicted value for x i using the model trained on the remaining data points. Werepeat the process for all examples and we have the calibration dataset { ( x i , ˆ y − i , z i ) } Ni =1 Conclusion
In this paper, we described two measures for evaluating the calibration capability of a binary classi-fier called maximum calibration error (MCE) and expected calibration error (ECE). We also provedthree theorems that justify post processing as an approach for calibrating binary classifiers. Specif-ically, we showed that by using a simple histogram-binning calibration method we can improve thecalibration of a binary classifier, in terms of
M CE and
ECE , without sacrificing the discriminationperformance of the classifier, as measured in terms of
AU C . The other contribution of this paperis to introduce two extensions of the histogram-binning method that are based on kernel densityestimation and on the Dirichlet process mixture model. Our experiments on simulated and real datasets showed that the proposed methods performed well and are promising, when compared to twopopular, existing calibration methods.In future work, we plan to investigate the conjecture that histogram-binning that uses equal fre-quency bins is a mini-max (or near mini-max) rate classifier, as equal width binning is known to be.Our extensive experimental studies comparing histogram binning with equal frequency and equalwidth bins provides support that this conjecture is true. We also would like to show similar theo-retical proofs for kernel density estimation. Another direction for future research is to extend themethods described in this paper to multi-class calibration problems.
In this appendix, we give the sketch of the proofs for the ECE and AUC bound theorems mentionedin Section 3 (Calibration Theorems). It would be helpful to review the Section 2 (Notations andAssumptions) of the paper before reading the proofs.
Here we show that using histogram binning calibration method, ECE converges to zero with the rateof O ( (cid:113) BN ) . Lest’s define E i as the expected calibration loss on bin ˜ B i for the histogram binningmethod. Following the assumptions mentioned in Section 3 about MCE bound theorem, we have E i = E ( | e i − o i | ) . Also, using the definition of ECE and the notations in Section 2, we can rewriteECE as the convex combination of E i s. As a result, in order to bound ECE it suffices to show that allof its E i components are bounded. Recall the concentration results proved in MCE bound theoremin the paper we have: P {| o i − e i | > (cid:15) } ≤ k i e − N(cid:15) B , (6)also let’s recall the following two identities: Lemma 7.1. if X is a positive random variable then E [ X ] = (cid:82) ∞ P ( X > t ) dt Lemma 7.2. (cid:82) ∞ e − x dx = √ π Now, using the concentration result in Equation 6 and applying the two above identities we canbound E i to write E i ≤ C (cid:113) BN , where C is a constant. Finally, since ECE is the convex combi-nation of E i ’s we can conclude that using histogram binning method, ECE converges to zero withthe rate of O ( (cid:113) BN ) . Here we show that the worst case AUC loss using histogram binning calibration method would beat the rate of O ( B ) . For proving the theorem, let’s first recall the concentration results for ˆ η i and ˆ θ i .Using Hoeffding’s inequality we have the following: P {| ˆ θ i − θ | ≥ (cid:15) } ≤ e − N(cid:15) B (7) P {| ˆ η i − η | ≥ (cid:15) } ≤ e − N(cid:15) (8)9he above concentration inequalities show that with probability − δ we have the following in-equalities: | ˆ θ i − θ i | ≤ (cid:114) B N log( 2 δ ) (9) | ˆ η i − η i | ≤ (cid:114) N log( 2 δ ) (10)The above results show that for the large amount of data with high probability, ˆ η i is concentratedaround η i and ˆ θ i is concentrated around around θ i .Based on (1) the empirical AUC of a classifier φ ( . ) is defined as follow: ˆ AU C = 1 mn (cid:88) i : z i =1 (cid:88) j : z j =0 I ( y i > y j ) + 12 I ( y i = y j ) (11)Where m and n as mentioned in section [2] (assumptions and notations) in main script are respec-tively the total number of positive and negative examples. Computing the expectation of the equation11 gives the actual AUC as following: AU C = P r { y i > y j | z i = 1 , z j = 0 } + 12 P r { y i = y j | z i = 1 , z j = 0 } (12)It would be nice to mention that using the MacDiarmid concentration inequality it is also possibleto show that the empirical ˆ AU C is highly concentrated around true
AU C (1).Recall p in is the space of output of base classifier ( φ ). Also, p out is the space of output of trans-formed probability estimate using histogram binning. Assume B , . . . , B B are the non-overlappingbins defined on the p in in the histogram binning approach. Also, assume y i and y j are the base clas-sifier outputs for two different instance where z i = 1 and z j = 0 . In addition, assume ˆ y i and ˆ y j arerespectively, the transformed probability estimates for y i and y j using histogram binning method.Now using the above assumptions we can write the AUC loss of using histogram binning method asfollowing: AU C Loss = AU C ( y ) − AU C (ˆ y ) (13) = P { y i > y j | z i = 1 , z j = 0 } + 12 P { y i = y j | z i = 1 , z j = 0 }− ( P { ˆ y i > ˆ y j | z i = 1 , z j = 0 } + 12 P { ˆ y i = ˆ y j | z i = 1 , z j = 0 } ) (14)By partitioning the space of uncalibrated estimates p in one can write the AU C Loss as following:
AU C Loss = (cid:88) K,L ( P { y i > y j , y i ∈ B K , y j ∈ B L | z i = 1 , z j = 0 } − P { ˆ y i > ˆ y j , y i ∈ B K , y j ∈ B L | z i = 1 , z j = 0 } )+ (cid:88) K ( P { y i > y j , y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 } + 12 P { y i = y j , y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 }− P { ˆ y i = ˆ y j , y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 } ) (15)Where we make the following reasonable assumption that simplifies our calculations: • Assumption : ˆ θ i (cid:54) = ˆ θ j if i (cid:54) = j Now we will show that the first summation part in equation 15 will be less than or equal to zero.Also, the second summation part will go to zero with the convergence rate of O ( B ) .10 irst Summation Part Recall that in the histogram binning method the calibration estimate ˆ y = ˆ θ K if y ∈ B K . Also,notice that if y i ∈ B K , y j ∈ B L and K > L then we have y i > y j for sure. So, using the abovefacts we can rewrite the first summation part in equation 15 as following: Loss = (cid:88) K>L P { y i ∈ B K , y j ∈ B L | z i = 1 , z j = 0 } − (cid:88) K,L P { ˆ θ K > ˆ θ L , y i ∈ B K , y j ∈ B L | z i = 1 , z j = 0 } (16)We can rewrite the above equation as following: Loss = (cid:88) K>L ( P { y i ∈ B K , y j ∈ B L | z i = 1 , z j = 0 }− P { ˆ θ K > ˆ θ L , y i ∈ B K , y j ∈ B L | z i = 1 , z j = 0 }− P { ˆ θ L > ˆ θ K , y i ∈ B L , y j ∈ B K | z i = 1 , z j = 0 } ) (17)Next by using the Bayes’ rule and omitting the common denominators among the terms we have thefollowing: Loss ∝ (cid:88) K>L (cid:18) P { z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B L }− P { ˆ θ K > ˆ θ L , z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B L }− P { ˆ θ L > ˆ θ K , z i = 1 , z j = 0 | y i ∈ B L , y j ∈ B K } (cid:19) × P { y i ∈ B L , y j ∈ B K } (18)We next show that the term inside the parentheses in equation 18 is less or equal to zero by usingthe i.i.d. assumption and the notations we mentioned in Section 2, as following: Inside T erm ( IT ) = ( θ K (1 − θ L ) − I { ˆ θ K > ˆ θ L } θ K (1 − θ L ) − I { ˆ θ L > ˆ θ K } θ L (1 − θ K )) (19)Now if we have the case ˆ θ K > ˆ θ L then IT term would be exactly zero. if we have the case that ˆ θ L > ˆ θ K then the inside term would be equal to: IT = θ K (1 − θ L ) − θ L (1 − θ K ) (cid:39) ˆ θ K (1 − ˆ θ L ) − ˆ θ L (1 − ˆ θ K ) ≤ (20)where the last inequality is true with high probability which comes from the concentration resultsfor ˆ θ i and θ i in equation 7. Second Summation Part
Using the fact that in the second summation part ˆ y i = ˆ θ K and ˆ y j = ˆ θ K , we can rewrite the secondsummation part as: 11 oss = (cid:88) K (( P { y i > y j , y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 } + 12 P { y i = y j , y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 } ) − ( 12 P { y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 } )) ≤ (cid:88) K ( P { y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 } − P { y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 } )= 12 (cid:88) K P { y i ∈ B K , y j ∈ B K | z i = 1 , z j = 0 } ) (21)Using the Bayes rule and iid assumption of data we can rewrite the equation 21 as following: Loss ≤ (cid:80) K P { z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B K } × P { y i ∈ B K , y j ∈ B K } P { z i = 1 , z j = 0 } = 12 (cid:80) K P { z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B K } × P { y i ∈ B K } P { y j ∈ B K } (cid:80) K,L P { z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B L } × P { y i ∈ B K } P { y j ∈ B L } = 12 (cid:80) K P { z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B K } × η K (cid:80) K,L P { z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B L } × η K η L = 12 (cid:80) K P { z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B K } (cid:80) K,L P { z i = 1 , z j = 0 | y i ∈ B K , y j ∈ B L } (22)Where the last equality comes from the fact that η K and η L are concentrated around their empiricalestimates ˆ η K and ˆ η L which are equal to B by construction (we build our histogram model based onequal frequency bins).Using the i.i.d. assumptions about the calibration samples, we can rewrite the equation 22 as fol-lowing: Loss ≤ (cid:80) K P { z i = 1 | y i ∈ B K } P { z j = 0 | y j ∈ B K } (cid:80) K P { z i = 1 | y i ∈ B K } × (cid:80) L P { z j = 0 | y j ∈ B L } = (cid:80) Bk =1 θ k (1 − θ k )2 (cid:80) Bk =1 θ k × (cid:80) Bl =1 (1 − θ l ) ≤ B (23)Where the last inequality comes from the fact that the order of { (1 − θ ) , . . . , (1 − θ B ) } ’s is com-pletely reverse in comparison to the order of { θ , . . . , θ B } and applying Chebychev’s Sum inequal-ity. Theorem 7.1. (Chebyshev’s sum inequality) if a ≤ a ≤ . . . ≤ a n and b ≥ b ≥ . . . ≥ b n then n (cid:80) nk =1 a k b k ≤ ( n (cid:80) nk =1 a k )( n (cid:80) nk =1 b k ) Now the facts we proved above about
Loss and Loss in equations 23 and 20 shows that the worstcase AU C Loss is upper bounded by O ( B ) Using histogram binning calibration method.
Remark
It should be noticed, the above proof shows that the worst case AUC loss at the presenceof large number of training data point is bounded by O ( B ) . However, it is possible that we evengain AUC power by using histogram binning calibration method as we did in the case we appliedcalibration over the linear SVM model in our simulated dataset. References [1] Shivani Agarwal, Thore Graepel, Ralf Herbrich, Sariel Har-Peled, and Dan Roth. Generalizationbounds for the area under the roc curve.
Journal of Machine Learning Research , 6(1):393, 2006.122] C.E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametricproblems.
The annals of statistics , pages 1152–1174, 1974.[3] M. Ayer, HD Brunk, G.M. Ewing, WT Reid, and E. Silverman. An empirical distributionfunction for sampling with incomplete information.
The annals of mathematical statistics , pages641–647, 1955.[4] M.H. DeGroot and S.E. Fienberg. The comparison and evaluation of forecasters.
The statisti-cian , pages 12–22, 1983.[5] Luc Devroye, L´aszl´o Gy¨orfi, and G´abor Lugosi.
A probabilistic theory of pattern recognition ,volume 31. New York: Springer, 1996.[6] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures.
Journalof the american statistical association , pages 577–588, 1995.[7] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLIN-EAR: A library for large linear classification.
Journal of Machine Learning Research , 9:1871–1874, 2008.[8] T.S. Ferguson. A bayesian analysis of some nonparametric problems.
The annals of statistics ,pages 209–230, 1973.[9] Jussi Klemela. Multivariate histograms with data-dependent partitions.
Statistica sinica ,19(1):159, 2009.[10] K. Kurihara, M. Welling, and N. Vlassis. Accelerated variational dirichlet process mixtures.
Advances in Neural Information Processing Systems , 19:761, 2007.[11] S.N. MacEachern and P. Muller. Estimating mixture of dirichlet process models.
Journal ofComputational and Graphical Statistics , pages 223–238, 1998.[12] Uwe F Mayer and Armand Sarkissian. Experimental design for solicitation campaigns. In
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery anddata mining , pages 717–722. ACM, 2003.[13] Aditya Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, and Lucila Ohno-Machado.Predicting accurate probabilities with a ranking loss. arXiv preprint arXiv:1206.4661 , 2012.[14] A. Niculescu-Mizil and R. Caruana. Predicting good probabilities with supervised learning. In
Proceedings of the 22nd international conference on Machine learning , pages 625–632, 2005.[15] J. Platt et al. Probabilistic outputs for support vector machines and comparisons to regularizedlikelihood methods.
Advances in large margin classifiers , 10(3):61–74, 1999.[16] Clayton Scott and Robert Nowak. Near-minimax optimal classification with dyadic classifica-tion trees.
Advances in neural information processing systems , 16, 2003.[17] Bernard W Silverman.
Density estimation for statistics and data analysis , volume 26. Chap-man & Hall/CRC, 1986.[18] L. Wasserman.
All of nonparametric statistics . Springer, 2006.[19] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees andnaive bayesian classifiers. In
Machine Learning-International Workshop then Conference , pages609–616, 2001.[20] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probabilityestimates. In
Proceedings of the eighth ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 694–699, 2002.[21] Bianca Zadrozny and Charles Elkan. Learning and making decisions when costs and probabil-ities are both unknown. In