Classifier Calibration: with implications to threat scores in cybersecurity
CClassifier Calibration: with implications to threat scores in cybersecurity
Waleed A. Yousef a, ∗ , Issa Traoré b , William Briguglio c a ECE dep., University of Victoria, Victoria, Canada; and CS dep., Helwan University, Cairo, Egypt. b ECE dep., University of Victoria, Victoria, Canada. c ECE dep., University of Victoria, Victoria, Canada.
Abstract
This paper explores the calibration of a classifier output score in binary classification problems. A calibrator is a func-tion that maps the arbitrary classifier score, of a testing observation, onto [0 ,
1] to provide an estimate for the posteriorprobability of belonging to one of the two classes. Calibration is important for two reasons; first, it provides a meaningfulscore, that is the posterior probability; second, it puts the scores of different classifiers on the same scale for comparableinterpretation. The paper presents three main contributions: (1) Introducing multi-score calibration, when more than oneclassifier provides a score for a single observation. (2) Introducing the idea that the classifier scores to a calibration processare nothing but features to a classifier, hence proposing extending the classifier scores to higher dimensions to boost thecalibrator’s performance. (3) Conducting a massive simulation study, in the order of 24,000 experiments, that incorporatesdifferent configurations, in addition to experimenting on two real datasets from the cybersecurity domain. The resultsshow that there is no overall winner among the different calibrators and different configurations. However, general advicesfor practitioners include the following: the Platt’s calibrator [22], a version of the logistic regression that decreases bias fora small sample size, has a very stable and acceptable performance among all experiments; our suggested multi-score cal-ibration provides better performance than single score calibration in the majority of experiments, including the two realdatasets. In addition, extending the scores can help in some experiments.
Keywords:
Classification, Score Calibration, Threat Metrics, Cybersecurity.
1. Introduction
Scoring is broadly used in cybersecurity to measure attributes and outcomes of security mechanisms (e.g., intrusiondetection, biometric authentication) and underlying phenomenon (e.g., threats, vulnerabilities, intrusion). Metrics suchas threat, vulnerability, and intrusion scores are not only used to assess the severity of the corresponding phenomena, butthey are also used for security decision-making. For instance, biometric technologies involve computing a matching scorethat is compared against a threshold and used to make crucial authentication or identification decisions. In practice, thescores are computed using ad-hoc or subjective methods, for example, using heuristics or relying on human expertise. Agood example of these kinds of scores is the common vulnerability scoring system (CVSS), which is one of the prominentvulnerability scoring schemes currently available. A more rigorous alternative to ad-hoc scores is using probability scoresfor decision-making. The initial motivation behind the present article was to find a systematic approach to convert theconfidence score into probability values that are ready to be used in cybersecurity models. In practice, each decisionmodel is a classifier in its own right.On the other hand, and from the point of view of machine learning (ML), each decision model is a classifier, and eachscore value given is the decision score of that classifier to be compared with a threshold value to ultimately provide a hardbinary decision. Converting classifier scores to probability measures is called classifier calibration. Having more than onedata source accounts for having more than one classifier providing scores for the same problem. This is called a multipleclassifier system (MCS) in the ML terminology. Therefore, in the present article we treat the classifier calibration problem,and design the experiments on both simulated and real datasets, to accommodate for the whole ML community in generaland the cybersecurity domain in particular.
If a random vector X and a random variable (r.v.) Y have a joint probability density f X,Y ( x,y ), and Y is qualitative (orcategorical) with only two possible values (class ω or class ω ), the binary classification problem is defined as follows: how ∗ Corresponding Author
Email addresses: [email protected], [email protected] (Waleed A. Yousef), [email protected] (Issa Traoré), [email protected] (William Briguglio)
Preprint submitted to Elsevier February 5, 2021 o find the classification rule ˆ Y = η ( X ) that can predict the class Y (the response) from an observed value of the variable X (the predictor). It is known that the best prediction function (classifier) that minimizes the risk is given by η ( x ) : h ( x ) ω ≷ ω th, (1a) h ( x ) = log f X ( X = x | ω ) f X ( X = x | ω ) . (1b)The log-likelihood ratio (LHR) h ( x ) acts as the decision value, discriminant value, or the score. Now, if the joint distributionis unknown, a classification function is modeled parametrically or nonparametrically, and a training sample is used to buildthat model. The modeled decision function (or scoring function) h tr ( X ) is built using the training dataset tr , and the finalclassification rule takes the form (1a). However, of course, it is no longer the LHR nor the optimal classification rule thatminimizes the risk.The scoring function h tr ( X ) is usually compared with a chosen threshold value th to give the final binary decision. Onecommon measure of assessing the decision (scoring) function h tr ( X ) is the two types of error: e = (cid:82) th −∞ f h ( h ( x ) | ω ) dh ( x )is the probability of classifying a case as belonging to class 0 when it belongs to class 1, which is called false negative frac-tion (FNF); e = (cid:82) ∞ th f h ( h ( x ) | ω ) dh ( x ) is the opposite, called the false positive fraction (FPF). The error is not a sufficientmeasure, because it is a function of a single fixed threshold. A more general way to assess a classifier is provided by thereceiver operating characteristic (ROC) curve. This is a plot for a true positive fraction (TPF = 1 - FNF) versus FPF un-der different threshold values. The area under the ROC curve (AUC) can be thought of as one summary measure for theROC curve. Formally, AU C = (cid:82) T P F d ( F P F ), which is the same as Pr( h ( x | ω ) < h ( x | ω )), which expresses the sep-aration between the two sets of decision scores h ( X | ω ) and h ( X | ω ). In practical setup, only a finite testing dataset ts = { x i ,y j | i = , ··· ,n , j = , ··· ,n } is available to estimate the performance measures. The uniform minimum vari-ance unbiased estimators (UMVUE) for the AUC, under the nonparametric assumptions, is given by the Mann-Whitneystatistic [2, 10, 24, 31] (cid:131) AU C = n n n (cid:88) i = n (cid:88) j = ψ (cid:161) h ( x i | ω ) ,h (cid:161) x j | ω (cid:162)(cid:162) , (2a) ψ ( a,b ) = a > b a = b a < b (2b)In many applications, it is always desirable that the classifier output score corresponds to the posterior probabilityPr( ω i | X = x ) , i = ,
1. Suppose that we have two classifiers giving scores for an observation of 3.5 and 17.4, respectively. Itis not only true that each score is non-informative but also both scores cannot be combined in a direct way because they areon two different scales akin to the particular design process of each classifier. To wit, for example, in cybersecurity, the scoreof a classifier—either an intrusion detection system (IDS) or anti virus (AV)—should correspond directly to the probabilityof, say, being malicious ( ω ) given the feature vector X = x . This will allow for the interpretation and comparison ofdifferent scores. The connection between the posterior probability and score likelihood is straightforward through Bayes’theorem, as follows: Pr( ω | h )Pr( ω | h ) = f h ( h | ω )Pr( ω ) (cid:177) Pr( h ) f h ( h | ω )Pr( ω ) (cid:177) Pr( h ) , (3)After solving for Pr( ω | h ), giving it a shorthand notation of p , we get: p ( h ) = + L − (1 − π )/ π , (4a) L = f h ( h | ω ) (cid:177) f h ( h | ω ) , (4b) π = Pr( ω ) . (4c)An aside technical point to mention is that the probability p ( h ) in (4a) is the posterior probability Pr( ω = ω | h ), given ascore value h , not the posterior probability Pr( ω = ω | X = x ) given a feature vector X = x . This may only be important fortheoretical treatment in the cases that the score function h ( X ) assigns the same value to two or more vectors X .The problem of classifier calibration (surveyed in Sec. 2) is designing the estimator (cid:98) p ( h ) as “close” as possible to the truevalue p ( h ) in Eq. (4). Because p ∈ [0 ,
1] is no longer an indicator function, as in the original classification problem, the twocommon accuracy measures used in the literature are the mean square error (MSE) and the Brier score, which are defined2s: MSE( (cid:98) p,p ) = E h (cid:163)(cid:98) p ( h ) − p ( h ) (cid:164) , (5a)Brier( (cid:98) p,p ) = E h (cid:163)(cid:98) p ( h ) − y (cid:164) , (5b)where y ( = ,
1) is the true class label of the observation with score h . However, we always prefer the root version of eachmeasure, RMSE( (cid:98) p,p ) = (cid:112) MSE( (cid:98) p,p ), RB( (cid:98) p,p ) = (cid:112) Brier( (cid:98) p,p ) because it has the same units, not a square, of the estimand.For a black-box trained classifier, and with no knowledge of the probability distributions of scores, at most we can have anaccess to a labeled dataset, from which we can construct the set H = (cid:169) ( h i ,y i ) | i = , ··· ,n (cid:170) , (6)where h i is the score of the i th observation and y i is its label. Then, calibration performance measures, can be estimatedby: (cid:131) RMSE( (cid:98) p,p ) = (cid:115) n (cid:88) i ( (cid:98) p ( h i ) − p ( h i )) , (7a) (cid:99) RB( (cid:98) p,p ) = (cid:115) n (cid:88) i ( (cid:98) p ( h i ) − y i ) . (7b)However, because p ( h i ) requires knowledge of the parametric distributions (per Eq. (4)), the estimate (7a) is of practicaluse only for simulated datasets, and the estimate (7b) is suitable for both simulated and real datasets. However, this will bea trade-off between what we need, i.e. the error (cid:98) p ( h i ) − p ( h i ) and what we have, that is the error (cid:98) p ( h i ) − y i .The following observation is quite interesting and is the basis for Sec. 3, which is an extension of the literature oncalibration. From a pure mathematical point of view, the function (cid:98) p ( h ) that calibrates the classifier output h ( X ) that isobtained from the feature vector X is itself a classifier that trains on the dataset (6) with a single feature h ( X ) and outputscore function (cid:98) p ( h ). Moreover, one of the calibration methods is the logistic regression classifier. In our present article, weuse this observation and use the logistic regression to: (1) introduce the multi-score calibration to show how to providea single calibrated probability score from multiple scores and (2) suggest extending the score(s) of the single (or multi)classifier(s) to higher dimensions to boost the calibration accuracy of the logistic regression. In the sequel and to avoid anyconfusion of mixing up terminology, we will exclusively reserve the terms classifier or score/scoring function to mean h ( X )and will be using them exchangeably as needed by context. On the other hand, we will use the term calibrator to mean (cid:98) p ,which converts the initial score(s) to a calibrated probability score. The rest of the article is organized as follows: Sec. 2 is a literature review of the main methods of calibration. In Sec. 3,we provide more mathematical formalization of the single and multi-score calibration problem, and we detail the contri-butions in which we extend the literature. Sec. 4 is a comprehensive set of experiments on simulated datasets and on tworeal datasets from the cybersecurity domain. The experiments establish a systematic comparison among all known cali-brators under different configurations while reproducing previous results published in [3, 4] for the sake of comparison. InSec. 5, we discuss the results more, contemplate on the discrepancy between the results of simulated data and real data,and provide a final advice for the practitioners of ML in general and network security domain in particular. Finally, Sec. 6concludes the article.
2. Literature Review and Related Work
Before reviewing the literature of the score calibration, we first ponder equation (4). Modeling the calibration function p ( h ) : R (cid:55)→ [0 ,
1] can be accomplished via several approaches. We provide our view of these general approaches and surveythe literature belonging to each of them. We emphasize that our calibration problem deals with a black-boxed classifierwith no access to their internal structure and only the classifier score along with true labels, that is the dataset (6), areavailable. This should be distinguished from other approaches, e.g., [9], where they assume some knowledge of the internalbasis kernel of the classifier and use the labeled dataset to estimate some of their parameters.
In this approach, the two LH functions f i ( h | ω i ) , i = , (cid:98) f i ( h | ω i ) along with (cid:98) π = n / n . An example of the literature leveraging this approach is [32], where the estimation of the LH is done by the basicbinning method (the same method used for drawing histograms) by treating the bin width as a tuning parameter, which3an be obtained via cross validation (CV). Therefore, the LH estimation is given by the relative number of observationsbelonging to a bin interval B . Formally, for x belonging to a bin interval B , then (cid:98) f i ( x | ω i ) = n i n i (cid:88) j = I x j ∈ B , i = , . (8) In this approach, the LHR (4b) (not the individual LH functions) is modeled; then, similar to above, this estimate, alongwith the estimate of the prior probability, are plugged in (4a). There is, as well, a wide range of statistical literature formodeling the LHR of a two sample data. However, the literature is not leveraged in the score calibration problem exceptby [3, 4], where the authors used PROPROC software for fitting the ROC of a two-sample data to estimate the LHR (see [20]).They call this method the semi-parametric approach because the parametric model used involves some internal latentvariables. The details are out of the scope of the present article and can be found in the literature on ROC modeling oftwo-sample data that were well developed and established by Metz et al. during the previous two decades, for examplein [7, 8, 15, 16, 18, 20, 21, 25].
In this approach, the posterior probability (4a) is directly modeled. This is a typical regression problem that can bedone parametrically or nonparametrically. [33] seems to be the first to propose the nonparametric regression for modelingthe posterior probability directly from score values. The authors proposed using the isotonic regression, an optimizationmethod that minimizes the square loss under the constraint that the resulting regression function is monotonic (not strictlymonotonic, as will be distinguished in definition (1). Therefore, the method provides a non-smooth regression curve withpiece-wise constant segments. [22] suggested using logistic regression to model the posterior probability from scores.Their initial motivation was to model the score of the SVM in particular. The modeled regression function is the typicalsmooth sigmoid function. However, they used a slightly modified version of the known logistic regression to decrease thebias for a small sample size. It is worth mentioning that [9] follows [22] but with an additional ingredient of estimating someparameters related to the basis kernel of the classifier. As was introduced above, this is a bit different from our assumptionof knowing nothing about the internal structure of the classifier.
There are a few studies, for example [3, 4, 17], that compare the literature surveyed above. These studies assumedifferent score distributions either by explicitly specifying the distribution and playing with its parameters or producingthe score from trained classifiers on different datasets. The conclusion is that there is no overall winner over all the scoredistributions; however, the logistic fit [22] and the semiparametric fit [3] approaches are very acceptable. We would liketo point out that, aside from the calibration accuracy, the logistic regression is quite easy to use because it is availablein almost any scientific computing environment (SCE), and does not require obtaining and integrating its code intto aparticular SCE, which is in contrast to the PROPROC.
3. Score Calibration: formalization
Given a classifier that provides a scoring function h ( X ), the calibration function is defined as a function that maps h ( X = x ) to Pr( ω | h ). This mapping is called the calibration function. In this section, we first formalize the calibrationproblem in more mathematical rigor (Sec. 3.1); then, we propose how to calibrate multiple scores, and how to extend theavailable score to higher dimensions to boost the calibration accuracy of both single and multiple scores (Sec. 3.2). We alsodiscuss the attempts of calibrating multiple scores without having access to labeled datasets (Sec. 3.3). Definition 1. [Calibration Function] is a monotonically increasing mapping from R to [0 , to map a classifier output scoreto a posterior probability. This formal definition verbally means the following: Consider the two observations x and x , belonging to any set (notnecessarily R ) along with their classifier output scores h ( x ) < h ( x ). The calibration function (call it p ( h )) that estimates theposterior probabilities should satisfy p ( h ) < p ( h ) for a strictly monotonic assumption and p ( x ) ≤ p ( x ) for a monotonicassumption. This distinction is being formally made here to conform with some methods in the literature and is discussedin Sec. 2. The ROC and AUC are invariant under the strictly monotonic calibration (Corollary 2). This means that both theoriginal classifier score or its calibrated score have the same ROC and classification accuracy at two equivalent thresholdvalues. The mathematical details are deferred in the Appendix.4 h h ∧ ∨ Table 1: Truth table of the eight possible combinations of the true label of an observation along with the corresponding output of two binary classifiers.The last two columns of the table are the simple
AND and OR classifiers combination. None of the very aggressive AND rule or the very loose OR rule alwaysimproves the final classification decision; it depends on when the two classifiers agree/disagree, that is the joint PDF of both classifiers. Suppose we have access to more than one classifier instead of just a single classifier. Then, much like being able togenerate the dataset (6), we can generate the following dataset: H , = (cid:110)(cid:161) h ( x i ) ,h ( x i ) ,y i (cid:162) | y i = ω ,ω , i = , ··· ,n (cid:111) . (9)This dataset can be used to produce a calibrator in either a direct approach or in a training and testing approach. Thisshould boost both the classification and calibration accuracies of the resulting calibrator. Those two approaches are elabo-rated on in the following two subsections. Although the material presented here treats combining two classification scores,it is immediately extensible to K ≥ H , ··· ,K , and extension to (9). The scores h ( x ) ,h ( x ) for the dataset H , have a 2D PDF. Formally, and similar to (4), we can relate the posteriorprobability to this joint PDF by: Pr( ω | h ,h ) = + L − (1 − π )/ π, (10) L = f h ,h ( h ,h | ω ) f h ,h ( h ,h | ω ) ,π = Pr( ω ) . Therefore, the three approaches of calibrating a single classifier (Sec. 2) are immediate. However, the logistic regressionapproach may be preferred to the other two because it is a stable procedure and applying it on a 2D dataset is straightfor-ward. On the other hand, the other two approaches may have some hurdles to overcome. Modeling the 2D LH f ( h ,h ) maynot be very accurate under a small sample size n . Modeling the LHR through the ROC technology is not available yet as aready software package and requires new development. We can treat the calibration problem as—and indeed it is—a learning problem where the training dataset is the labeledtwo-feature dataset H , and the model is logistic regression. Therefore, we do not have to restrict ourselves to the twopredictors h and h ; rather, we can extend them to a higher dimensional space by including, for example, h , h , h h .The new dimensionality will be p , including the original two features. The optimal p is chosen by CV, stage-wise forwardregression, best subset selection, or regularization, among many others readily available techniques from the literature onstatistical learning. In this elaborate approach, not only will the output be calibrated to the appropriate posterior proba-bility (10), but also the overall classification accuracy will be better than that achieved by each individual feature (classifieroutput) h i . There are many attempts to combine classifiers from an unlabeled dataset. Consider the existence of two black-boxtrained classifiers, along with their accuracies, without the availability of a labeled dataset. Therefore, the set H , definedin (9) is not available as well. What is available is only the pairs ( h ( x i ) ,h ( x i )) ∀ i , which are produced by direct testing ofthe classifiers on the observations x i . Before investigating how the two scores can be combined, which has been done (aswill be discussed in Sec. 3.3.2), we need to discuss some important aspects.5 .3.1. Known But Ignored Facts Indeed, h and h are not only two dependent random variables (r.v.s), but they are also correlated because they are de-signed to produce high (low) scores for malicious (normal) activities. It should be clear that the higher their accuracies,the more their correlation because the majority of the time they will agree on the right decision, a consequence of high ac-curacies. Combining these scores in general and estimating the posterior in particular is theoretically impossible withoutany further assumptions about their joint PDF. This is clear from the posterior expression (10). For more intuition, let usconsider the counter example depicted in Table 1. The table represents the truth table of the binary class ω of an obser-vation x along with the outputs h ( x ) and h ( x ) of two binary classifiers. It is clear that the simple AND rule wins over thesimple OR rule in two cases (the green cases), whereas the situation is reversed in the other two cases. In four cases, whenthe classifier outputs agree, the two rules have no effect on the final decision. However, under some assumptions for theclassifiers’ output and their dependence/correlation, some rules can prove superior over other rules. This is known, andwe tersely summarize it Sec. 3.3.2. Combining the output of two classifiers, whether each output is a posterior or a mere score, and where there is nolabeled dataset as H , in (9), was studied more than 20 years ago. It started as a very ad-hoc field by suggesting naivecombination rules, for example sum, product, and so forth. There was no mathematical justification for the underlyingassumptions under which these rules should work. In addition, the probabilistic dependence between the outputs of theclassifiers (explained above) was completely ignored. As Hastie et al. put it, “there is a vast and varied literature often re-ferred to as combining classifiers which abounds in ad-hoc schemes for mixing methods of different types to achieve betterperformance. For a principled approach, see Kittler et. al. (1998).” [11, p. 624]. In [12], they justified the mathematicalassumptions under which the simple combination rules should work. Afterwards, [13] derived the errors of these combi-nation rules for some particular probability distributions.Knowing only the accuracies of the classifiers and with zero knowledge and no assumptions of the joint density func-tion of the classifier outputs, we do not see any mathematical justification for combining the output scores except from asimple Bayesian point of view. This is pursued by assigning a subjective prior for each classifier proportional to its relativeaccuracy. Said differently, the accuracy in this context plays the role of a subjective belief (or trust) that justifies taking itsoutput seriously. The mathematical argument is explained in [11, Sec. 8.8] and goes as follows: suppose each classifier η k , k = , ··· ,K is already calibrated and produces the output Pr( ω | x,η k ) for an observation x , thenPr( ω | x ) = (cid:88) k Pr( ω | x,η k )Pr( η k | x ) , (11a)Pr( η k | x ) = A k (cid:80) k (cid:48) A k (cid:48) , (11b)where Pr( η k | x ) is the subjective prior assigned to the k th classifier and A k is its accuracy. However, the subjective pri-ors (11b) can be any other function that is monotonic in accuracy and sum up to 1 (e.g., [14] raise accuracies to an arbitrarypower).
4. Experiments and Results
In this section, we conduct some experiments using simulated datasets and two popular public datasets from the cy-bersecurity domain.
We leverage the generalized lambda distributions (GLD), which was introduced in [23], to generate a wide variety ofscore distributions. We follow [28, 30] in using the GLD but in a more systematic way. The GLD is a univariate unimodaldistribution family with a set of four population parameters; we use the notation L ( λ ,λ ,λ ,λ ) to denote such a distri-bution. These parameters determine the characteristics and shape of the distribution, for example location, skewness, etc.By adjusting these population parameters λ i , i = , ··· ,
4, we can generate three basic distributions that are symmetric,skewed left, and skewed right. In addition, we add the normal distribution N ( µ,σ ). For the score calibration problem, wehave two distributions F and F ; each can be chosen from this list of four distributions, which results in 16 pairs. Foreach of these 16 pairs, without loss of generality, we fix the mean of F to zero; we set the standard deviation to 1; andwe adjust the location parameter of F to obtain a desired separability (measured in AUC) between the two distributions;namely, we choose AUC = . , . , .
9. This is done to express the different discriminating power of a particular classifier.Figure 1 lists the parameters of the basic four distributions and illustrates the 16 pairs; each pair is illustrated by three AUCvalues. Each of these 16 pairs is given an index of 1 , ··· ,
16, which is the same index of the x -axis of Figure B.3, which will beexplained later. This index is not only a mere ID assigned to each pair; rather, it has an important ordering significance that6 L ( λ ,-0.1125,-0.1359,-0.1359) b L ( λ ,0.014,0.009695,0.0285) c L ( λ ,0.014,0.0285,0.009695) d N ( µ ,1) (c, b) (d, d) (c, c) (b, b) (c, a) (a, a) (a, b) (b, c) (a, c)
10 : (b, a)
11 : (b, d)
12 : (d, b)
13 : (c, d)
14 : (d, c)
15 : (a, d)
16 : (d, a)
Figure 1: The parameters of the basic 4 distributions { a , b , c , d } for GLD symmetric, GLD left skewed, GLD right skewed, and Normal, respectively; and the16 plots of their cross product used for single-score calibration. Each plot is labeled as x : ( α,β ) , x = , ··· , α,β = a,b,c,d , were x is its index used inplotting Figure B.3. α is distribution F with location parameter ( λ or µ ) set to zero (plotted in black). β is distribution F with location parameter set tothree different values to give a separation with F of AUC = 0.65, 0.7, 0.9 (plotted in blue, green, and red, respectively on each of the 16 plots). The Figure,therefore, illustrate 16 × will be explained when explaining that figure. A minor technical issue is that, because of the mathematical formalizationof the GLD, the parameter values shown in Figure 1 do not give a distribution directly with σ =
1. It is quite tedious to findthe closed from of σ and solve for λ i . Rather, we simulate a large sample and divide by the estimated standard deviation.The smoothed histogram of this standardized version is what is plotted in the figure and is what will be used for furthersimulations. It is remarkable that to plot these curves with that smoothness, especially for the heavy tailed pairs, it requiredsimulating up to a million observations.For each of these 16 × n = × i , i = , ··· ,
9. A single configuration (or experiment) is described as follows: We select a pair out of the16 × n scores for each class, train each calibrator on this training dataset, and finallytest and assess each calibrator on: (a) the same training dataset to get both RMSE subm and RB subm (resubstitution) and (b)test them on pseudo-infinite testing set of 10,000 scores generated form the same distribution pairs to get RMSE indm andRB indm (independent). The subscript m denotes the m th Monte-Carlo (MC) trial, which will be repeated M times to obtainan average: E M ( · ) = M (cid:80) m ( · ). This is to account for the variability of the training datasets of finite size n .The calibrators used in each configuration are those discussed in Sec. 2, excluding the PROPROC for reasons discussedin Sec. 4.1.2. These calibrators are as follows: Logistic Regression (LogReg) with three different versions: (1) Platt’s version [22], which we implement in Python by literaltranslation (line by line) from Matlab code written and shared to us by the authors of [3, 4]—this can then be comparedwith to their results; (2) the scikit implementation [19]; and (3) the scikit implementation on the extended scorespace as we suggest in Sec. 3.2.2. For short, in the sequel, we call these three versions Platt, LogReg, and LogRegExt.
Isotonic Regression (IsoReg) of the scikit package [19].
Binning Method (BinReg), which we implement by using five different values of bins: 10 , , ··· , × × M = ind and RMSE sub for each of the nine calibrators versusthe index of the distribution pair explained in Figure 1. There are 30 plots corresponding to the 3 ×
10 cross product of theAUC and n values. Similarly, subfigures (c)–(d) illustrate RB. The index of the distribution pair on the x -axis is not just an IDfor the pair; rather, it expresses its rank for the average RMSE ind . This average is calculated over the nine calibrators acrossthe 3 ×
10 configurations. Therefore, the overall trend of the curves is increasing with that index. This is done to visualizethe effect of the distribution pair on the performance of calibrators in a more informative way. Subfigure (e) illustratesthe average of each of the four metrics versus the sample size n . This average is taken, conditionally on each n over allthe 16 × A TEX to control their position at thepixel level so that when pages scroll, all fixed information, for example axis, ticks, and so forth, look stationary and onlycomponents carrying information, for example curves, run as if they are animated.Before delving into the details of this figure, it is important to study its summary from subfigure (e). It is obvious thatat a smaller n , the two versions of logistic regression, LogReg and Platt, behaves better than all others, including the thirdversion LogRegExt. IsoReg comes second to LogReg. Then, all calibrators improve and converge almost to the same value,as n increases, with a slight superiority of the non-parametric BinReg. This comes as no surprise because it is known thatthe histogram method converges pointwise to the PDF of a r.v. Also, we can observe that a smaller number of bins worksbetter at smaller sample size n , and vice versa, quite as expected. This summary subfigure is the final message and ofpractical value, to practitioners to whom the only available information in the calibration problem is the sample size n .We can infer more about the calibrator’s behavior by conducting a deeper investigation of B.3a–B.3d. A pilot view ofsubfigures B.3a–B.3b reveals that there is no calibrator that is a consistent winner across all configurations. In addition,7he majority of calibrators, with a few exceptions, converge to the same value as n increases. It is obvious that the largerthe AUC, i.e., the more the classifier is capable of separating the two classes, the smaller the calibration errors. LogRegExtslightly wins across all distribution pairs only at small values of both n and the AUC. It deteriorates at higher AUC values,in contrast to almost all other calibrators. This may be interpreted as follows: at a higher AUC, the two distributionsare already separated, and the problem is much easier for LogReg; indeed, it is not worth the extra variance componentexhibited from the extended feature space.A very interesting observation is that in terms of RMSE, the calibrator’s performance gets worse with distribution pairs11–16. From Figure 1, those are the six pairs involving normal distribution. Not to accuse the normal distribution, thesecond best distribution pair (and the first best for LogReg) is distribution pair 2, two of whose components are normal ( d , d ). On the other hand, in terms of RB ind , the distribution pairs almost do not affect the calibrator’s performance. We thinkthat both observations are very counter-intuitive, and no qualitative interpretation is ready off the shelf. Rather, it requiresa decomposition to the components of variance under the corresponding parametric distribution pairs, which is quite outof the scope of the present article and is deferred to future work.It is obvious that RB sub is optimistically biased, which complies with the conventional wisdom in the field of ML forany performance measure. However, this wisdom is surprisingly violated when comparing RMSE sub with RMSE ind foralmost all calibrators. It seems that RMSE sub is at least the same as, or pessimistically biased than, RMSE ind for most con-figurations. A similar observation is reported by [4], where they provided their interpretation, to which we agree. Indeed,the Brier score is itself the objective function that is minimized for the different regression methods that will learn fromthe truth of the class indicator variable. This indicator variable is, as well, the one involved in the Brier score definition(5b). This will cause the usual optimistic/pessimistic bias of the resubstitution/independent testing. In contrast, for theMSE, the truth of posterior probability (the variable involved in the MSE definition (5a)) is not involved in the trainingphase. Hence, the estimated MSE using resubstitution will differ only from that using independent testing by the varianceexhibited from the finite sample size of the testing dataset. The PROPROC software version of [20] calculates the LHR (1b) (Eq. (4a) in [20]), which is necessary for calibratingnew scores after fitting the ROC curve. This software version was used by the authors of [3, 4]. However, this versionis neither publicly available at the official website of the University of Chicago nor by personal communication with thedevelopers and administrative staff. After the inventor of the PROPROC, Charles E. Metz, passed away, it seems that thepublicly available version of PROPROC is frozen and is not updated. Therefore, the best that we can do is to compareour nine calibrators to the PROPROC results published in [3, 4] using their limited number configurations. This requiresreproducing their results so that we can compare all methods on the same workflow. Figure B.4 uses their visualizationmethod to illustrate a reproduction of their results, on the two distributions they considered: normal and beta; we usedthe calibrators LogReg, Platt, and IsoReg. It is clear that the two versions of logistic regression, LogReg and Platt, are almostidentical and compare well to the PROPROC for the normal distribution. However, Platt outperforms the others at small n for the beta distribution; then, all converge to the same performance as n increases. Of course, this comparison is veryvulnerable to criticism because these calibrators may not compare well on other configurations. However, because of theunavailability of the PROPROC, no more conclusions can be reached. As discussed in Section 3.2, the only candidate for multi-score calibration is the logistic regression. Therefore, weconstruct the following set of experiments to study the effect on this calibrator when extending the feature space; that iswe compare LogReg to LogRegExt. Suppose we have two classifiers (scoring functions) h , h . Because we have two classes ω , ω , we then have four distributions: F ij ,i = , , j = ,
2, where F ij is the distribution of the scoring function h j underthe class ω i . We choose each of F ij ,i = , , j = ,
2, from the list of the four basic distributions (cid:169) a , b , c , d (cid:170) presented atthe top of Figure 1. Therefore, we have a total of 4 combinations of distribution pairs. We always adjust the parametersof F j ,j = ,
2, to have 0 mean and unit variance. We adjust the parameters of F j ,j = , , to both have unit varianceand to achieve a desired separation (expressed in terms of the AUC) with F j . We consider the three different values ofAUC used in the single-score calibration. In addition, if two classifiers provide scores for the same problem, we anticipatethat there should be some correlation between their scores, because the two classifiers are initially designed to solve thesame problem. We account for this by simply assuming a correlation coefficient ρ between the scores h j | ω ∼ F j ,j = , h ,h ) and the scores h j | ω ∼ F j , j = , h ,h ). We construct this correlation by building thebelow model. For class ω , we consider the two r.v. v ∼ F and v ∼ F ; then, we construct the scores by the multivariatetransformation h = v , (12) h = v ρ + v (cid:113) (1 − ρ ) . (13)It is quite straightforward to show that h and h have a correlation coefficient of ρ . Similarly, we construct h and h after adjusting the parameters for the required AUC. We experiment with ρ = , . , . distribution pairs,three values of AUC, three values of ρ , and 10 values of sample sizes n .8he calibrators are LogReg with the two plain features (scores) h and h and LogRegExt with the extended features h , h , h , h , h h (Sec. 3.2.2). We only experiment with a second degree polynomial to validate the argument thatLogRegExt can work better than LogReg. However, for real problems, practitioners should include several sets of extendedfeatures and keep increasing the complexity to find the optimal space using the typical resampling approaches, for exampleCV or bootstrap.The results of these experiments are illustrated in Figure B.5. Subfigures (a)–(d) illustrate the RMSE ind , RMSE sub , RB ind ,and RB sub , respectively, of LogRegExt versus LogReg on all configurations. Therefore, each plot displays 4 ×
10 points atparticular ρ and AUC. A common heat map of the subfigures codes the sample size n , and is displayed under subfigure(e). The latter summarizes subfigures (a)–(d) by plotting the average of each of the four performance measures for eachcalibrator versus n . The average is taken over all the 4 × × n .From the summary subfigure (e), it is obvious that the extended features enhances the LogRegExt only asymptoticallyand marginally. However, from the detailed figures (a)–(b), it is obvious that in many configurations, the LogRegExt hasa lower error than LogReg (points located below the y = x line.) The majority of these points are red (large n ) and occurat high ρ and both medium and large AUCs. This is not surprising because at a larger sample size, the regression modelcan benefit from the extended feature space without overfitting. At a stronger correlation, there will be more informationgained from the extended feature h h , which will boost LogRegExt. However, this little improvement may be washed outat smaller AUC values because the problem becomes harder. At this low AUC, the two LH functions are very interleaved;hence, the LHR will be close to 1, which gives a posterior close to 0.5 and average Brier score close to 0.5 as well. This isevident from subfigures (c)–(d). We observe again that both RMSE ind and RMSE sub are almost identical. Subfigures (c)–(d)show that in terms of RB, LogReg is better than LogRegExt in almost all configurations and that the extended features didnot boost the latter except at very small fraction of configurations with very little improvements. To test its utility, we compare the multi-score calibration of logistic regression to its single-score calibration. In bothcases, we compare its performance with and without feature extension. We explain how we establish and visualize thiscomparison for LogReg, and the same applies immediately to LogRegExt. Consider a configuration used to generate asingle point on Figure B.5a; we denote the performance of LogReg (the x coordinate of this point) by r , and the subscriptrefers to the fact that the performance is obtained from the multi-score calibration of the two features h and h . Under theaforementioned configuration, we calibrate LogReg using each of the single score h or h individually; we denote theseperformances by r ,r . Then, for all configurations, we plot r , / r versus r , / r . We do that for all the eight combinations{RMSE, RB} × {LogReg, LogRegExt} × {ind, sub}.The results of these comparisons are illustrated in Figure B.6, which consists of nine subfigures that follow the sameplotting style, convention, and legends of Figure B.5. The first eight subfigures correspond to the eight combinations,respectively, with the order of the flattened cross-product. If the multi-score calibration is better than each individualsingle-score calibration, that is r , < min( r ,r ), then the point will be inside the unit square. The mark + inside the unitsquare is located at position ( p,p ), where p is the fraction of points located inside the unit square. Therefore, p representshow frequent the multi-score calibration wins over both single score calibrations. The summary of the eight subfigures(a)–(h) is illustrated in subfigure (i), which plots the average p versus n for each of the eight combinations, and the averageis taken over all configurations of each subfigure conditionally on each value of n .From the summary subfigure (i), in terms of the RMSE, the multi-score calibration improves LogReg only in roughly lessthan 7% of experiments and almost constantly with the sample size n ; it improves LogRegExt in less tha 9% of experimentsat n = 320. In terms of RB, the multi-score calibration improves the performance of at least 80% of the experiments with asmall n and almost 100% of experiments with n >
40. The detailed subfigures (a)–(h) reveal more. (a) and (c) shows thatfor LogReg and LogRegExt, respectively, p is small at low AUCs, regardless of ρ . As explained above, when we commentedon Figure B.5, the calibration problem itself is hard, and the posterior is near 0.5. This will leave a small margin of im-provement. However, at higher AUCs and a low correlation coefficient ρ , the fraction p increases for LogReg and increasessignificantly for LogRegExt. This increase of the value of p retracts again when increasing ρ . The interpretation is that ata high AUC, the problem is easier for each single-score calibration, and at a high ρ the two scores will almost behave thesame; hence, combining both scores in the multi-score calibration may not be of great value. (e)-(h) illustrate the RB andshow that, almost for the majority of the experiments, p (cid:39)
1. The improvement factor is significant, almost at 0.75, wherethe majority of points is located at the upper right corner of the unit square.
We carried out experiments, similar to those carried out in the previous section, on two real datasets from the do-main of cybersecurity: trek05-1 dataset [5, 6] and drebin dataset [1, 27]. We interpret the results here and defer furtherinvestigations and interpretations to the discussion in Sec. 5.9 .2.1. trec05-1
Dataset
Trec05-1 is a popular public email security dataset consisting of 72,450 emails labeled as spam or not spam . Becausethe dataset is very large, we have the luxury of splitting it into training and testing to construct the classifiers before thecalibration process. Therefore, we split the dataset into two halves; each has 36,225 samples. Support vector machine(SVM) and random forest (RF) classifiers were trained on n -gram features extracted from the training dataset; then, bothclassifiers were frozen, and the training set was never used in the rest of experiments. The SVM and RF were tested onthe testing set, and each produced 36,225 scores. These scores have a correlation coefficient of 0.91, and each achieved,respectively, 96.6% and 97.53% balanced accuracy, and 0.994 and 0.996 AUCs. The smoothed histograms of these scores,under the two classes, are shown in Figure B.7. For the RF, these histograms are abruptly exponential-like and surprisinglyvery different from the simulated datasets in the previous sections. It is worth mentioning that [29] proved that the classifierscores can be abruptly exponential even when the training features are normally distributed.It is interesting to study the behavior of different calibrators on different sample sizes n obtained from this dataset.Said differently, we will pretend we are in a situation where we have obtained only n observations from this dataset. Then,to train the calibrators on different training sets of the same size n , to mimic a MC trial and to further test the trainedcalibrator on large independent testing set, we proceeded as follows: Out of the 36,225 scores available from testing eachclassifier, we sequestered 10,000 to account for the independent testing sets for calibrators. Then, for each value of n = × i , i = , ··· ,
9, we randomly selected M ( = n . Then, we trained the calibrators onthese training sets. We only report RB ind and RB sub because RMSE is not available because of the absence of the trueposterior probability. We use the same calibrators and the same comparative study used for the simulated datasets above.The results of these experiments are illustrated in Figure B.8, which plots RB ind and RB sub versus the sample size n for three cases: single-score calibration of each of SVM and RF scores, and multi-score calibration using both. The resultsare different from those obtained from the simulated datasets. For the single-score calibration, although all calibratorsconverge almost to the same performance as n increases, at a small n the simple BinReg competes with IsoReg, and thethird place is LogReg or Platt for SVM or RF, respectively. Platt almost wins over the rest of the values of n . LogReg andLogRegExt are the worst two. However, LogRegExt is better than LogReg for only RF scores. The multi-score calibrationwith either LogReg or LogRegExt wins over all the single calibrators except at small values of n . This once more shows thevalue of our proposed method of multi-score calibration in many (not all) cases. drebin Dataset
Drebin is a popular malware analysis dataset consisting of features extracted from 129,013 android applications, 5,560of which are malicious applications belonging to 179 different malware families, while the rest are benign. From eachapplication, 545,334 binary features are extracted, indicating the presence of various API calls, permissions, network ad-dresses, and so forth. Because of the size of the dataset, it is not available in one large file. Instead, for each sample, theplain text representation of each present feature is listed in a text file; features that are not present are not listed. This allowsfor a very condensed representation of the dataset because most samples contained well under 100 features.To establish some basis of comparison, we used SVM and RF for this dataset. Because these algorithms did not supportonline (incremental) learning, the dataset had to be manually extracted into one large matrix using the sparse matrixrepresentation of
Scipy . If the data set was defined in a standard numpy array, then it would require about at least 70 GBjust to fit in the memory (assuming a byte-sized datatype for each element).We used the
Scikit implementation (with liblinear for the SVM) of each classifier and trained them using 10-foldCV, while the predicted scores on the left-out folds were used for calibration. This is because the sample size of the mali-cious class is only 5,560. We adjusted the classification threshold so that the FPF was 1 in 100 (the same FPF chosen in [1]).Our SVM and RF achieved a 0.955 and 0.971 detection rate, respectively. This is higher than then the reported detectionrate of 0.939 in [1], where they used the SVM. The scores of the SVM and RF achieved an AUC of 0.992 and 0.996, respec-tively, and their correlation coefficient is 0.64. Figure B.7 illustrates the histogram of the scores; they almost show the samepattern as those of the terk-05-1 dataset.The results of these calibration experiments are illustrated in Figure B.8 and are plotted plotted in the same way asthose of terk-05-1 dataset. The results are very similar but have some differences. For single-score calibration, all themethods, except LogRegExt on SVM and both LogRegExt and LogReg on RF, are almost comparable at all values of n . Ifcompared with their performances on the RF scores of terk05-1 dataset, it seems that LogReg and LogRegExt switchedplaces. However, LogReg on the multi-score calibration is almost the overall winner. All experiments on the simulated datasets were carried out on the Compute Canada Cedar Cluster . For the single-score calibration (Sec. 4.1.1), the experiments were run using a job array for scheduling multiple independent jobs (scripts),with all jobs running in parallel. We wrote 48 separate jobs, one job for each configuration of the cross-product of the 16 n . The job took about an hour on onecore, 8 GB of RAM, obtaining 96% CPU utilization, and 63% memory utilization. Only the single-score simulation for Platt’ssigmoid fit was carried out separately on an eight-core machine with 16 GBs of memory and took 36 hours to completeusing six threads, with a thread per job.For the multi-score calibration (Sec. 4.1.3), a job array with 768 jobs was created, one job for each configuration of thecross-product of the 4 distribution pairs and three AUC values. Each job accounts for the cross-product of three values of ρ and 10 values of n . The whole job array took five hours to execute, although the majority of jobs were executed in threehours. On average, these jobs obtained 94% CPU efficiency and 62% memory efficiency with one core and 8 GB of RAMper job. These experiments obviously took longer than their counterparts for the single-score calibration for the relativenumber of configurations (23,040 vs. 480).
5. Discussion
The empirical distributions in Figure B.7, for the real datasets, show two main differences from the simulated data inthe previous section: (1) it is exponential-like for the RF scores and (2) the AUC is very high, 0.994, 0.996, 0.992, 0.996,respectively, for the combinations { terk05-1 , drebin } × {SVM, RF} appearing in the figure). Therefore, we designed thefollowing simulation experiment for a deeper investigation. Let F be the truncated exponential distribution and F bea flipped truncated exponential distribution, both share the same population parameter λ and the finite support [0 ,
1] sothat they look symmetric. The population parameter λ is adjusted to obtain a required AUC, which takes the same threevalues of the previous simulated data, in addition to the value 0.99.The results of this experiment are illustrated in Figure B.9. The RMSE ind , RMSE sub , RB ind , and RB sub of the ninecalibrators are plotted versus the sample size n at different AUC values. One more surprise is that the performance for thefirst three values of AUC complies with the simulated data in the previous section with some performance deteriorationobserved for LogReg. Only the performance at the very high AUC of 0.99 looks similar to the performance of the calibratorson the real dataset. Still, comparing the results on Figure B.9, but here recalling the results of the real datasets on Figure B.8at this very high AUC, both LogReg and LogRegExt are inferior to other methods, including the Platt version of logisticregression, exactly as was observed for the real dataset. At this very high AUC, the two distributions are very separated,which is an ideal situation to exhibit the value of the bias correction term of Platt’s version of logistic regression. IsoRegalways has a moderate performance among others.It is quite important to note to that obtaining such a very high discriminating power, represented as AUC = 0.99+, is veryunrealistic in the vast majority of ML applications. The real datasets above, where both SVM and RF were able to providea very high AUC, look very clean and atypical. Such a very high AUC means that the two sets of classification scores arealmost perfectly separable such that any threshold will provide near 100% TP and 0% FP.We try to summarize the findings and the final message of Figures B.3–B.9 to finally provide the advice for practitioners.It is almost clear that there is no over all winner across all possible configurations. However, major remarks are in order.1. The Platt version of the logistic regression shows superiority over the default version that exists in the scientific li-braries, for example scikit , at some special configurations. Therefore, it is worth implementing it in an optimizedway in these libraries to replace the default version; and it is better to leverage it in the calibration process.2. Among the single-score calibrators on the plain features, that is without feature extension, the Platt method is thewinner in the vast majority of configurations, even if it is not the over all winner. It only looses in a few experimentswith little margin.3. The isotonic regression, although it rarely wins in contrast to Platt, it rarely deteriorates similarly to Platt.4. The PROPROC method does not seem to be feasible because the software version that can calibrate scores is nolonger publicly available.5. The very rare cases with a very high AUC should not be taken as a basis for comparing the calibrators. These are veryspecial cases, and even through we found them in some real datasets, they should be dealt with on a case-by-casebasis. For example, the practitioner should run a Mann-Whitney test of the scores to estimate the AUC and thendecide whether it is exceptional or not.6. The multi-score calibration boosted the performance of LogReg for some experiments in terms of RMSE and almostfor all experiments, including the real dataset in terms of RB.7. The feature extension boosted the performance of a small fraction of experiments in terms of RMSE and a negligiblefaction of experiment in terms of RB.Based on all of these observations, if we have to provide only a single piece of advice advice to practitioners, we wouldrecommend the Platt version of the logistic regression for a single-score calibration without feature extension. When thereare more classifiers providing the score for the same problem, we would recommend the Platt regression using a multi-score calibration without feature extension as well. 11owever, for a more mature practitioner, we need to recall the observation that there is not an overall winner amongcalibrators and configurations. Therefore, we may include more than one calibrator, including Platt, on the extendedfeatures and incorporate all of them in a CV model selection per the usual paradigm in training and testing ML models.
6. Conclusion
The current paper studied the calibration of binary classifiers, where the classifier score of a testing observation, doesnot necessarily correspond to its posterior probability of belonging to one of the two classes. The paper surveyed theliterature, summarized the different methods available for calibrating a single classifier, and formalized the mathematicalproperties of the calibration process, for example the invariance of some classification performance measures. The paperpresented three main contributions: (1) proposing the multi-score calibration, that is designing a calibrator when morethan one classifier provides the scores to a single observation; (2) treating the score(s) as features and extending them tohigher dimensions to design a calibrator; (3) conducting a massive number of experiments on simulated datasets, in theorder of 24,000 experiments, and on two real datasets from the cybersecurity domain to study the behavior of differentcalibrators at different configurations. A final piece of advice for practitioners is provided based on the visualization andinterpretation of the results that show there is no overall winner among the different calibrators. However, in general,(1) the Platt’s version of the logistic regression has a decent accepted performance over a wide range of experiments; (2)the multi-score calibration boosts the performance in the majority of experiments; and (3) extending the score to higherdimensions helps in some experiments.
7. Acknowledgment
We would like to thank our friends Dr. Weijie Chen, of the US FDA, and Dr. Lorenzo Pesce, formerly the developer ofPROPROC software at University of Chicago, for their long discussions on the software. Special thanks to Dr. Weijie Chenfor sharing with us his Matlab code of his publications [3, 4].
Appendix A. ProofsCorollary 2.
Consider applying a strictly monotonic calibration function to the output score of a classifier, and then treatingits output as the new final score. Then, all the performance measures defined in Sec. 1.2, that is errors, accuracy, risk, ROC,and AUC, along with their estimators, are invariant under this transformation.
Proof. is immediate by the direct application of Lemma 3.
Lemma 3.
Consider any two r.v. X and Y belonging to R with distribution functions F X and F Y , respectively, along with astrictly monotonically increasing function p , and p ( X ) = X (cid:48) ∼ F X (cid:48) and p ( Y ) = Y (cid:48) ∼ F Y (cid:48) . Then:1. For every threshold value θ ∈ R , with the corresponding probabilities F X ( θ ) and F Y ( θ ) , there exists a unique correspond-ing threshold value θ (cid:48) = p ( θ ) , with the corresponding probabilities F X (cid:48) ( θ (cid:48) ) and F Y (cid:48) ( θ (cid:48) ) such that F X (cid:48) ( θ (cid:48) ) = F X ( θ ) and F Y (cid:48) ( θ (cid:48) ) = F Y ( θ ) .2. For every threshold value θ (cid:48) ∈ R , the converse argument applies.3. These two properties are not satisfied with relaxing the strictly monotonic assumption and considering p a monotonicincreasing function. Proof.
1. Because the mapping p is strictly monotonically increasing, it is a bijection. Therefore, ∀ x ∈ [ −∞ ,θ ] there exists aunique x (cid:48) = p ( x ) ∈ [ −∞ ,θ ], where θ (cid:48) = p ( θ ); and therefore, the two sets are equivalent. Hence, F X ( θ ) = Pr[ X ≤ θ ] = Pr[ X (cid:48) ≤ θ (cid:48) ] = F X (cid:48) ( θ (cid:48) ). The same argument is immediate for Y .2. Because the function p is a bijection, its inverse p − is as well; then, the argument is identical to the one above.3. Consider a non-strictly monotonically increasing function p such that p ( x ) ≤ p ( x ) ∀ x < x . Take any θ < θ , where p ( θ ) = p ( θ ) = θ (cid:48) . Then, for any θ < θ < θ , we have p ( θ ) = p ( θ (cid:48) ). Suppose that for some θ ∈ ] θ ,θ [ , Pr[ X ≤ θ ] = Pr[ X (cid:48) ≤ θ (cid:48) ]. However, of course, Pr[ X ≤ θ ] (cid:54)= Pr[ X (cid:48) ≤ θ (cid:48) ] ∀ θ ∈ ] θ ,θ [ unless Pr[ θ < X ≤ θ ] =
0, which is not necessary for anydistribution F X . Appendix B. Pedagogical Example on Calibration
It is more informative to illustrate the calibration process on some dataset. We use the labeled subset of one of thedatasets from the 2019 IEEE BigData Cup Challenges. The dataset consists of event log data received at a security op-eration center (SOC) run by a company called Security On-Demand (SOD) [26]. The considered subset contains 39,427observations. The single feature overallseverity , which represents the intrusion alert severity generated by the sys-tem rules, is an integer value 0 , , ··· ,
5, which accounts for the score of an intrusion detection system (IDS). The column12 a) (b) (c) (d)
Figure B.2: Histograms and logistic fits [22] of the scores of an IDS system. Left: without score modification, where the two sets of score are almostoverlapping (Mann-Whitney of 0.53), and hence are non-predictive and their logistic fit is very poor. Right: after artificial score modification, wheremore natural separability is obvious (Mann-Whitney = 0.92) and their logistic fit is, at least visually, very reasonable. The second row illustrates the sameexperiment but with introducing both the score and its squared values (two features) to the logistic regression model. The fit is obviously better. notified contains the true label to indicate whether the observation is an intrusion or a normal behavior. The 39,427records include 2,276 intrusions and 37,151 normal records. As long as the dataset size is very large, the nonparametricestimate of the LHR, and hence, the posterior probabilities, can be considered a very accurate estimate for the populationvalues.Before showing how to train the calibrators using the techniques discussed in Sec. 2, it is important to comment onthis dataset. The score overallseverity is not a discriminating feature. The Mann-Whitney statistic (2a) of this score is0.53, which is almost close to a random guess. Just for the sake of clarification, modifying the value of 5 of the score of thenormal records to 0 (to mimic a more realistic and practical IDS scoring) gives a Mann-Whitney statistic of 0.92. Figure B.2illustrates the histograms of the two classes for the data before/after modification.We calibrate this score before and after modification using two calibration methods: the logistic regression and thenonparametric binning method of [32]. Calibrating the original score gives the poor model of Figure B.2a, where the bin-ning method is plotted as individual dot markers and the logistic fit is plotted as a solid curve. (The solid curve is just forillustrating the smoothness of the method; however, because the score only takes the discrete values 0 , ··· ,
5, the curve isdiscrete as well). This poor fit rises from the fact that the two classes almost have the same likelihood functions, which isobvious from both the visual histogram and from the Mann-Whitney value of 0.53. For example, at a score value of 5, thetwo likelihoods are almost the same, so the posterior probability (4a) will be almost the prior probability π estimated by n / n = = . (cid:98) f logit ( h = | ω = ω ) = .
75; whereas, the binning method gives an estimateof exactly 1 because f h ( h = | ω ) =
0, which gives an estimate of the LHR of ∞ . When we provide the logistic regressionwith each value of the score h and its square h as an extended feature, (this approach is detailed in Sec. 3.2), the fit to theposterior probabilities is better, as in Figures B.2c–B.2d. References [1] Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K., Siemens, C., 2014. Drebin: Effective and explainable detection of android malware inyour pocket., in: NDSS, pp. 23–26.[2] Chen, W., Gallas, B.D., Yousef, W.A., 2012. Classifier Variability: Accounting for Training and testing. Pattern Recognition 45, 2661–2671.[3] Chen, W., Sahiner, B., Samuelson, F., Pezeshk, A., Petrick, N., 2015. Investigation of methods for calibration of classifier scores to probability ofdisease, in: Medical Imaging 2015: Image Perception, Observer Performance, and Technology Assessment, International Society for Optics and
Photonics. p. 94161E. [4] Chen, W., Sahiner, B., Samuelson, F., Pezeshk, A., Petrick, N., 2018. Calibration of medical diagnostic classifier scores to the probability of disease.Statistical methods in medical research 27, 1394–1409.[5] Cormack, G.V., Lynam, T.R., 2005. Trec 2005 spam track overview., in: TREC, pp. 500–274.[6] Cormack, G.V., Lynam, T.R., 2007. Online supervised spam filter evaluation. ACM Transactions on Information Systems (TOIS) 25, 11–es.[7] Dorfman, D.D., Berbaum, K.S., Metz, C.E., 1992. Receiver Operating Characteristic Rating Analysis - Generalization To the Population of Readersand Patients With the Jackknife Method. Investigative Radiology 27, 723–731.[8] Dorfman, D.D., Berbaum, K.S., Metz, C.E., Lenth, R.V., Hanley, J.A., Abu Dagga, H., 1997. Proper Receiver Operating Characteristic Analysis: theBigamma model. Acad Radiol 4, 138–149.[9] Grudic, G.Z., 2004. Predicting the probability of correct classification. Unpublished technical report .[10] Hájek, J., Šidák, Z., Sen, P.K., 1999. Theory of rank tests. 2nd ed., Academic Press, San Diego, Calif.[11] Hastie, T., Tibshirani, R., Friedman, J.H., 2009. The elements of statistical learning: data mining, inference, and prediction. 2nd ed., Springer, NewYork.[12] Kittler, J., Hatef, M., Duin, R.P.W., Matas, J., 1998. On Combining classifiers. Pattern Analysis and Machine Intelligence, IEEE Transactions on 20,
Distributed Data. Statistics In Medicine 17, 1033–1053.
16] Metz, C.E., Pan, X., 1999. “Proper” Binormal ROC Curves: Theory and Maximum-Likelihood Estimation. Journal of Mathematical Psychology 1,1–33.[17] Niculescu-Mizil, A., Caruana, R., 2005. Predicting good probabilities with supervised learning, in: Proceedings of the 22nd international conferenceon Machine learning, ACM. pp. 625–632.[18] Pan, X., Metz, C.E., 1997. The “proper” Binormal Model: Parametric Receiver Operating Characteristic Curve Estimation With Degenerate data.Academic Radiology 4, 380.[19] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine learning in Python. Journal of Machine LearningResearch 12, 2825–2830.[20] Pesce, L.L., Horsch, K., Drukker, K., Metz, C.E., 2011. Semiparametric estimation of the relationship between roc operating points and the test-resultscale: Application to the proper binormal model. Academic radiology 18, 1537–1548.[21] Pesce, L.L., Metz, E.C., 2007. Reliable and Computationally Efficient Maximum-Likelihood Estimation of “proper” Binormal Roc Curves. AcademicRadiology 14, 814.[22] Platt, J., et al., 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large marginclassifiers 10, 61–74.[23] Ramberg, J.S., Tadikamalla, P.R., Dudewicz, E.J., Mykytka, E.F., 1979. A Probability Distribution and Its Uses in Fitting Data. Technometrics 21, 201.[24] Randles, R.H., Wolfe, D.A., 1979. Introduction to the theory of nonparametric statistics. Wiley, New York.[25] Roe, C.A., Metz, C.E., 1997. Dorfman-Berbaum-Metz Method for Statistical Analysis of Multireader, Multimodality Receiver Operating CharacteristicData: Validation With Computer simulation. Academic Radiology 4, 298–303.[26] ´Slezak, D., Chadzy´nska-Krasowska, A., Holland, J., Synak, P., Glick, R., Perkowski, M., 2017. Scalable cyber-security analytics with a new summary-based approximate query engine, in: 2017 IEEE International Conference on Big Data (Big Data), IEEE. pp. 1840–1849.[27] Spreitzenbarth, M., Freiling, F., Echtler, F., Schreck, T., Hoffmann, J., 2013. Mobile-sandbox: having a deeper look into android applications, in:Proceedings of the 28th Annual ACM Symposium on Applied Computing, pp. 1808–1815.[28] Yousef, W.A., 2013. Assessing classifiers in terms of the partial area under the roc curve. Computational Statistics & Data Analysis 64, 51–70.[29] Yousef, W.A., 2019. Prudence when assuming normality: an advice for machine learning practitioners. arXiv preprint arXiv:1907.12852 .[30] Yousef, W.A., Kundu, S., Wagner, R.F., 2009. Nonparametric estimation of the threshold at an operating point on the roc curve. ComputationalStatistics & Data Analysis 53, 4370–4383.[31] Yousef, W.A., Wagner, R.F., Loew, M.H., 2006. Assessing Classifiers From Two Independent Data Sets Using ROC Analysis: a Nonparametric Approach.Pattern Analysis and Machine Intelligence, IEEE Transactions on 28, 1809–1817. [32] Zadrozny, B., Elkan, C., 2001. Learning and making decisions when costs and probabilities are both unknown, in: Proceedings of the seventh ACMSIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 204–213.[33] Zadrozny, B., Elkan, C., 2002. Transforming classifier scores into accurate multiclass probability estimates, in: Proceedings of the eighth ACMSIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 694–699. : 10 AUC=0.60 AUC=0.75 AUC=0.90 . . . R M S E . . . . . . n : 20 . . . R M S E . . . . . . n : 40 . . . R M S E . . . . . . n : 80 . . . R M S E . . . . . . n : 160 . . . R M S E . . . . . . Figure B.3a: single-score calibration: RMSE. : 320 AUC=0.60 AUC=0.75 AUC=0.90 . . . R M S E . . . . . . n : 640 . . . R M S E . . . . . . n : 1280 . . . R M S E . . . . . . n : 2560 . . . R M S E . . . . . . n : 5120 . . . R M S E . . . . . . Figure B.3b: single-score calibration: RMSE, cont. : 10 AUC=0.60 AUC=0.75 AUC=0.90 . . . R B . . . . . . n : 20 . . . R B . . . . . . n : 40 . . . R B . . . . . . n : 80 . . . R B . . . . . . n : 160 . . . R B . . . . . . Figure B.3c: single-score calibration: RB. : 320 AUC=0.60 AUC=0.75 AUC=0.90 . . . R B . . . . . . n : 640 . . . R B . . . . . . n : 1280 . . . R B . . . . . . n : 2560 . . . R B . . . . . . n : 5120 . . . R B . . . . . . Figure B.3d: single-score calibration: RB, cont. ,
280 2 ,
560 5 , . . . n R M S E
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . n R B Figure B.3e: single-score calibration: summary plot of (a)–(d).
10 bins 20 bins 30 bins 40 bins 50 bins 50binPlatt LogReg LogRegExt IsoReg Independent Resubstitution
Figure B.3: consists of 5 subfigures (a)–(e) with common legend for calibrators. (a)–(b) illustrate RMSE ind and RMSE sub of the 9 calibrators for the16 × ×
10 configurations. The 16 pairs are explained in Figure 1 and the x -axis expresses their relative rank after ordering the average RMSE over allcalibrators; the 3 ×
10 cross product of AUC and n values are listed on the top and left of the subfigures, respectively. (c)–(d) illustrate RB ind and RB sub with same convention as (a)–(b). Subfigure (e) summarizes (a)–(d) by illustrating the average of: RMSE ind and RMSE sub (left), and RB ind and RB sub (right). The averaging is taken over all configurations conditionally on each value of n . . . . . . . . . . n M S E . . . . . . . n B r i e r . . . . . . . . . n M S E . . . . . . . n B r i e r IsoReg (sub) LogReg (sub) Platt (sub) IsoReg (ind) LogReg (ind) Platt (ind)
Figure B.4: An exact reproduction of the experiments in [3, 4] in terms of MSE and Brier (rather than their root) for three calibrators, on Normal (first two)and Beta (second two) distributions. The two version LogReg and Platt are almost identical. : 0.0 . . . . . . . . . . R M S E : L o g R e g E x t AUC=0.60 . . . . . . . . . . AUC=0.75 . . . . . . . . . . AUC=0.90 ρ : 0.5 . . . . . . . . . . R M S E : L o g R e g E x t . . . . . . . . . . . . . . . . . . . . ρ : 0.9 . . . . . . . . . . RMSE: LogReg R M S E : L o g R e g E x t . . . . . . . . . . RMSE: LogReg . . . . . . . . . . RMSE: LogReg
Figure B.5a: multi-score calibration: RMSE ind . ρ : 0.0 . . . . . . . . . . R M S E : L o g R e g E x t AUC=0.60 . . . . . . . . . . AUC=0.75 . . . . . . . . . . AUC=0.90 ρ : 0.5 . . . . . . . . . . R M S E : L o g R e g E x t . . . . . . . . . . . . . . . . . . . . ρ : 0.9 . . . . . . . . . . RMSE: LogReg R M S E : L o g R e g E x t . . . . . . . . . . RMSE: LogReg . . . . . . . . . . RMSE: LogReg
Figure B.5b: multi-score calibration: RMSE sub . : 0.0 . . . . . . . . . . R B : L o g R e g E x t AUC=0.60 . . . . . . . . . . AUC=0.75 . . . . . . . . . . AUC=0.90 ρ : 0.5 . . . . . . . . . . R B : L o g R e g E x t . . . . . . . . . . . . . . . . . . . . ρ : 0.9 . . . . . . . . . . RB: LogReg R B : L o g R e g E x t . . . . . . . . . . RB: LogReg . . . . . . . . . . RB: LogReg
Figure B.5c: multi-score calibration: RB ind . ρ : 0.0 . . . . . . . . . . R B : L o g R e g E x t AUC=0.60 . . . . . . . . . . AUC=0.75 . . . . . . . . . . AUC=0.90 ρ : 0.5 . . . . . . . . . . R B : L o g R e g E x t . . . . . . . . . . . . . . . . . . . . ρ : 0.9 . . . . . . . . . . RB: LogReg R B : L o g R e g E x t . . . . . . . . . . RB: LogReg . . . . . . . . . . RB: LogReg
Figure B.5d: multi-score calibration: RB sub . ,
280 2 ,
560 5 , . . . n A v e r a g e R M S E
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . n A v e r a g e R B Figure B.5e: multi-score calibration: summary plot for (a)–(d).
10 20 40 80 160320 640 1280 2560 5120 LogReg LogRegExtIndependent Resubstitution
Figure B.5: consists of 5 subfigures (a)–(e) with two legends: (left) the heat map that codes the sample size n for subfigures (a)–(d) and (right) calibratorsfor subfigure (e). (a)–(d) illustrate the RMSE ind , RMSE sub , RB ind , and RB sub , respectively, of LogRegExt vs. LogReg for all configurations. Each subfigureconsists of 3 × ×
10 configurations (points) at particular ρ and AUC. Subfigure (e) summarizes (a)–(d) by illustrating theaverage of: RMSE ind and RMSE sub (left), and RB ind and RB sub (right). The averaging is taken over all configurations conditionally on each value of n . : 0.0 r r AUC=0.60 r r AUC=0.75 r r AUC=0.90 ρ : 0.5 r r r r r r ρ : 0.9 r r r r r r Figure B.6a: multi- vs. single-score calibration, LogReg, RMSE ind . ρ : 0.0 r r AUC=0.60 r r AUC=0.75 r r AUC=0.90 ρ : 0.5 r r r r r r ρ : 0.9 r r r r r r Figure B.6b: multi- vs. single-score calibration, LogReg, RMSE sub . : 0.0 r r AUC=0.60 r r AUC=0.75 r r AUC=0.90 ρ : 0.5 r r r r r r ρ : 0.9 r r r r r r Figure B.6e: multi- vs. single-score calibration, LogRegExt, RMSE ind . ρ : 0.0 r r AUC=0.60 r r AUC=0.75 r r AUC=0.90 ρ : 0.5 r r r r r r ρ : 0.9 r r r r r r Figure B.6f: multi- vs. single-score calibration, LogRegExt, RMSE sub . : 0.0 . . r r AUC=0.60 . . r r AUC=0.75 . . r r AUC=0.90 ρ : 0.5 . . r r . . r r . . r r ρ : 0.9 . . r r . . r r . . r r Figure B.6c: multi- vs. single-score calibration, LogReg, RB ind . ρ : 0.0 . . r r AUC=0.60 . . r r AUC=0.75 . . r r AUC=0.90 ρ : 0.5 . . r r . . r r . . r r ρ : 0.9 . . r r . . r r . . r r Figure B.6d: multi- vs. single-score calibration, LogReg, RB sub . : 0.0 . . r r AUC=0.60 . . r r AUC=0.75 . . r r AUC=0.90 ρ : 0.5 . . r r . . r r . . r r ρ : 0.9 . . r r . . r r . . r r Figure B.6g: multi- vs. single-score calibration, LogRegExt, RB ind . ρ : 0.0 . . r r AUC=0.60 . . r r AUC=0.75 . . r r AUC=0.90 ρ : 0.5 . . r r . . r r . . r r ρ : 0.9 . . r r . . r r . . r r Figure B.6h: multi- vs. single-score calibration, LogRegExt, RB sub . ,
280 2 ,
560 5 , . . . n A v e r a g e p
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . . n A v e r a g e p Figure B.6i: multi- vs. single-score calibration: summary plot for (a)–(h).
10 20 40 80 160320 640 1280 2560 5120 LogReg LogRegExtIndependent Resubstitution
Figure B.6: consists of 9 subfigures, with the same plotting style, convention, and legends, of Figure B.5. The first 8 subfigures correspond to the 8combinations of the cross product {RMSE, RB} × {LogReg, LogRegExt} × {ind, sub} as indicated by the sub-caption. Each subfigure consists of 3 × ρ and AUC combinations); each plot illustrates 4 ×
10 point (for distribution pairs and n combinations); each point is r , / r ,r , / r , were r ,r ,r , are, respectively, the performance of the calibrator using each of the single-score and the multi-score. The symbol + inside the unit square islocated at ( p,p ), where p is the ratio of the points insde the unit square to the total points. − − terk05-1 : SVM-score . . . . terk05-1 : RF-score − . . drebin : SVM-score . . . . drebin : RF-score Figure B.7: Smoothed histogram of the two classes ω (blue) and ω (red), of each of the SVM and RF scores, of each of terk05-1 and drebin datasets.The AUC of the four plots are from left to right: 0.994, 0.996, 0.992, 0.996. The correlation coefficient between the SVM and RF scores on each of the twodatasets are 0.91 and 0.64, respectively. ,
280 2 ,
560 5 , . . . . n B r i e r terk05-1 : single-score calibration (SVM)
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . . n B r i e r terk05-1 : single-score calibration (RF)
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . . n B r i e r terk05-1 : gMulti-score calibration
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . . n B r i e r drebin : single-score calibration (SVM)
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . . n B r i e r drebin : single-score calibration (RF)
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . . n B r i e r drebin : gMulti-score calibration Figure B.8: The RB ind , RB sub of all calibrators on the real datasets terk05-1 (top) and drebin (bottom). On each row, there are three plots correspondingto calibration of: SVM single-score, RF single-score, and multi-score of both. The behavior on both datasets is similar. ,
280 2 ,
560 5 , . . . n R M S E AUC = 0.60
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . nAUC = 0.75
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . nAUC = 0.90
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . nAUC = 0.99
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . n R B
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . n
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . n
10 20 40 80 160 320 640 1 ,
280 2 ,
560 5 , . . . n
10 bins 20 bins 30 bins 40 bins 50 bins 50binPlatt LogReg LogRegExt IsoReg Independent Resubstitution