[PDF] Classifier Calibration: with implications to threat scores in cybersecurity

Abstract

This paper explores the calibration of a classifier output score in binary classification problems. A calibrator is a function that maps the arbitrary classifier score, of a testing observation, onto [0,1] to provide an estimate for the posterior probability of belonging to one of the two classes. Calibration is important for two reasons; first, it provides a meaningful score, that is the posterior probability; second, it puts the scores of different classifiers on the same scale for comparable interpretation. The paper presents three main contributions: (1) Introducing multi-score calibration, when more than one classifier provides a score for a single observation. (2) Introducing the idea that the classifier scores to a calibration process are nothing but features to a classifier, hence proposing extending the classifier scores to higher dimensions to boost the calibrator's performance. (3) Conducting a massive simulation study, in the order of 24,000 experiments, that incorporates different configurations, in addition to experimenting on two real datasets from the cybersecurity domain. The results show that there is no overall winner among the different calibrators and different configurations. However, general advices for practitioners include the following: the Platt's calibrator~\citep{Platt1999ProbabilisticOutputsForSupport}, a version of the logistic regression that decreases bias for a small sample size, has a very stable and acceptable performance among all experiments; our suggested multi-score calibration provides better performance than single score calibration in the majority of experiments, including the two real datasets. In addition, extending the scores can help in some experiments.

Full PDF

CClassiﬁer Calibration: with implications to threat scores in cybersecurity

Waleed A. Yousef a, ∗ , Issa Traoré b , William Briguglio c a ECE dep., University of Victoria, Victoria, Canada; and CS dep., Helwan University, Cairo, Egypt. b ECE dep., University of Victoria, Victoria, Canada. c ECE dep., University of Victoria, Victoria, Canada.

Abstract

This paper explores the calibration of a classiﬁer output score in binary classiﬁcation problems. A calibrator is a func-tion that maps the arbitrary classiﬁer score, of a testing observation, onto [0 ,

1] to provide an estimate for the posteriorprobability of belonging to one of the two classes. Calibration is important for two reasons; ﬁrst, it provides a meaningfulscore, that is the posterior probability; second, it puts the scores of different classiﬁers on the same scale for comparableinterpretation. The paper presents three main contributions: (1) Introducing multi-score calibration, when more than oneclassiﬁer provides a score for a single observation. (2) Introducing the idea that the classiﬁer scores to a calibration processare nothing but features to a classiﬁer, hence proposing extending the classiﬁer scores to higher dimensions to boost thecalibrator’s performance. (3) Conducting a massive simulation study, in the order of 24,000 experiments, that incorporatesdifferent conﬁgurations, in addition to experimenting on two real datasets from the cybersecurity domain. The resultsshow that there is no overall winner among the different calibrators and different conﬁgurations. However, general advicesfor practitioners include the following: the Platt’s calibrator [22], a version of the logistic regression that decreases bias fora small sample size, has a very stable and acceptable performance among all experiments; our suggested multi-score cal-ibration provides better performance than single score calibration in the majority of experiments, including the two realdatasets. In addition, extending the scores can help in some experiments.

Keywords:

Classiﬁcation, Score Calibration, Threat Metrics, Cybersecurity.

1. Introduction

Scoring is broadly used in cybersecurity to measure attributes and outcomes of security mechanisms (e.g., intrusiondetection, biometric authentication) and underlying phenomenon (e.g., threats, vulnerabilities, intrusion). Metrics suchas threat, vulnerability, and intrusion scores are not only used to assess the severity of the corresponding phenomena, butthey are also used for security decision-making. For instance, biometric technologies involve computing a matching scorethat is compared against a threshold and used to make crucial authentication or identiﬁcation decisions. In practice, thescores are computed using ad-hoc or subjective methods, for example, using heuristics or relying on human expertise. Agood example of these kinds of scores is the common vulnerability scoring system (CVSS), which is one of the prominentvulnerability scoring schemes currently available. A more rigorous alternative to ad-hoc scores is using probability scoresfor decision-making. The initial motivation behind the present article was to ﬁnd a systematic approach to convert theconﬁdence score into probability values that are ready to be used in cybersecurity models. In practice, each decisionmodel is a classiﬁer in its own right.On the other hand, and from the point of view of machine learning (ML), each decision model is a classiﬁer, and eachscore value given is the decision score of that classiﬁer to be compared with a threshold value to ultimately provide a hardbinary decision. Converting classiﬁer scores to probability measures is called classiﬁer calibration. Having more than onedata source accounts for having more than one classiﬁer providing scores for the same problem. This is called a multipleclassiﬁer system (MCS) in the ML terminology. Therefore, in the present article we treat the classiﬁer calibration problem,and design the experiments on both simulated and real datasets, to accommodate for the whole ML community in generaland the cybersecurity domain in particular.

If a random vector X and a random variable (r.v.) Y have a joint probability density f X,Y ( x,y ), and Y is qualitative (orcategorical) with only two possible values (class ω or class ω ), the binary classiﬁcation problem is deﬁned as follows: how ∗ Corresponding Author

Email addresses: [email protected], [email protected] (Waleed A. Yousef), [email protected] (Issa Traoré), [email protected] (William Briguglio)

Preprint submitted to Elsevier February 5, 2021 o ﬁnd the classiﬁcation rule ˆ Y = η ( X ) that can predict the class Y (the response) from an observed value of the variable X (the predictor). It is known that the best prediction function (classiﬁer) that minimizes the risk is given by η ( x ) : h ( x ) ω ≷ ω th, (1a) h ( x ) = log f X ( X = x | ω ) f X ( X = x | ω ) . (1b)The log-likelihood ratio (LHR) h ( x ) acts as the decision value, discriminant value, or the score. Now, if the joint distributionis unknown, a classiﬁcation function is modeled parametrically or nonparametrically, and a training sample is used to buildthat model. The modeled decision function (or scoring function) h tr ( X ) is built using the training dataset tr , and the ﬁnalclassiﬁcation rule takes the form (1a). However, of course, it is no longer the LHR nor the optimal classiﬁcation rule thatminimizes the risk.The scoring function h tr ( X ) is usually compared with a chosen threshold value th to give the ﬁnal binary decision. Onecommon measure of assessing the decision (scoring) function h tr ( X ) is the two types of error: e = (cid:82) th −∞ f h ( h ( x ) | ω ) dh ( x )is the probability of classifying a case as belonging to class 0 when it belongs to class 1, which is called false negative frac-tion (FNF); e = (cid:82) ∞ th f h ( h ( x ) | ω ) dh ( x ) is the opposite, called the false positive fraction (FPF). The error is not a sufﬁcientmeasure, because it is a function of a single ﬁxed threshold. A more general way to assess a classiﬁer is provided by thereceiver operating characteristic (ROC) curve. This is a plot for a true positive fraction (TPF = 1 - FNF) versus FPF un-der different threshold values. The area under the ROC curve (AUC) can be thought of as one summary measure for theROC curve. Formally, AU C = (cid:82) T P F d ( F P F ), which is the same as Pr( h ( x | ω ) < h ( x | ω )), which expresses the sep-aration between the two sets of decision scores h ( X | ω ) and h ( X | ω ). In practical setup, only a ﬁnite testing dataset ts = { x i ,y j | i = , ··· ,n , j = , ··· ,n } is available to estimate the performance measures. The uniform minimum vari-ance unbiased estimators (UMVUE) for the AUC, under the nonparametric assumptions, is given by the Mann-Whitneystatistic [2, 10, 24, 31] (cid:131) AU C = n n n (cid:88) i = n (cid:88) j = ψ (cid:161) h ( x i | ω ) ,h (cid:161) x j | ω (cid:162)(cid:162) , (2a) ψ ( a,b ) =  a > b a = b a < b (2b)In many applications, it is always desirable that the classiﬁer output score corresponds to the posterior probabilityPr( ω i | X = x ) , i = ,

1. Suppose that we have two classiﬁers giving scores for an observation of 3.5 and 17.4, respectively. Itis not only true that each score is non-informative but also both scores cannot be combined in a direct way because they areon two different scales akin to the particular design process of each classiﬁer. To wit, for example, in cybersecurity, the scoreof a classiﬁer—either an intrusion detection system (IDS) or anti virus (AV)—should correspond directly to the probabilityof, say, being malicious ( ω ) given the feature vector X = x . This will allow for the interpretation and comparison ofdifferent scores. The connection between the posterior probability and score likelihood is straightforward through Bayes’theorem, as follows: Pr( ω | h )Pr( ω | h ) = f h ( h | ω )Pr( ω ) (cid:177) Pr( h ) f h ( h | ω )Pr( ω ) (cid:177) Pr( h ) , (3)After solving for Pr( ω | h ), giving it a shorthand notation of p , we get: p ( h ) = + L − (1 − π )/ π , (4a) L = f h ( h | ω ) (cid:177) f h ( h | ω ) , (4b) π = Pr( ω ) . (4c)An aside technical point to mention is that the probability p ( h ) in (4a) is the posterior probability Pr( ω = ω | h ), given ascore value h , not the posterior probability Pr( ω = ω | X = x ) given a feature vector X = x . This may only be important fortheoretical treatment in the cases that the score function h ( X ) assigns the same value to two or more vectors X .The problem of classiﬁer calibration (surveyed in Sec. 2) is designing the estimator (cid:98) p ( h ) as “close” as possible to the truevalue p ( h ) in Eq. (4). Because p ∈ [0 ,

1] is no longer an indicator function, as in the original classiﬁcation problem, the twocommon accuracy measures used in the literature are the mean square error (MSE) and the Brier score, which are deﬁned2s: MSE( (cid:98) p,p ) = E h (cid:163)(cid:98) p ( h ) − p ( h ) (cid:164) , (5a)Brier( (cid:98) p,p ) = E h (cid:163)(cid:98) p ( h ) − y (cid:164) , (5b)where y ( = ,

1) is the true class label of the observation with score h . However, we always prefer the root version of eachmeasure, RMSE( (cid:98) p,p ) = (cid:112) MSE( (cid:98) p,p ), RB( (cid:98) p,p ) = (cid:112) Brier( (cid:98) p,p ) because it has the same units, not a square, of the estimand.For a black-box trained classiﬁer, and with no knowledge of the probability distributions of scores, at most we can have anaccess to a labeled dataset, from which we can construct the set H = (cid:169) ( h i ,y i ) | i = , ··· ,n (cid:170) , (6)where h i is the score of the i th observation and y i is its label. Then, calibration performance measures, can be estimatedby: (cid:131) RMSE( (cid:98) p,p ) = (cid:115) n (cid:88) i ( (cid:98) p ( h i ) − p ( h i )) , (7a) (cid:99) RB( (cid:98) p,p ) = (cid:115) n (cid:88) i ( (cid:98) p ( h i ) − y i ) . (7b)However, because p ( h i ) requires knowledge of the parametric distributions (per Eq. (4)), the estimate (7a) is of practicaluse only for simulated datasets, and the estimate (7b) is suitable for both simulated and real datasets. However, this will bea trade-off between what we need, i.e. the error (cid:98) p ( h i ) − p ( h i ) and what we have, that is the error (cid:98) p ( h i ) − y i .The following observation is quite interesting and is the basis for Sec. 3, which is an extension of the literature oncalibration. From a pure mathematical point of view, the function (cid:98) p ( h ) that calibrates the classiﬁer output h ( X ) that isobtained from the feature vector X is itself a classiﬁer that trains on the dataset (6) with a single feature h ( X ) and outputscore function (cid:98) p ( h ). Moreover, one of the calibration methods is the logistic regression classiﬁer. In our present article, weuse this observation and use the logistic regression to: (1) introduce the multi-score calibration to show how to providea single calibrated probability score from multiple scores and (2) suggest extending the score(s) of the single (or multi)classiﬁer(s) to higher dimensions to boost the calibration accuracy of the logistic regression. In the sequel and to avoid anyconfusion of mixing up terminology, we will exclusively reserve the terms classiﬁer or score/scoring function to mean h ( X )and will be using them exchangeably as needed by context. On the other hand, we will use the term calibrator to mean (cid:98) p ,which converts the initial score(s) to a calibrated probability score. The rest of the article is organized as follows: Sec. 2 is a literature review of the main methods of calibration. In Sec. 3,we provide more mathematical formalization of the single and multi-score calibration problem, and we detail the contri-butions in which we extend the literature. Sec. 4 is a comprehensive set of experiments on simulated datasets and on tworeal datasets from the cybersecurity domain. The experiments establish a systematic comparison among all known cali-brators under different conﬁgurations while reproducing previous results published in [3, 4] for the sake of comparison. InSec. 5, we discuss the results more, contemplate on the discrepancy between the results of simulated data and real data,and provide a ﬁnal advice for the practitioners of ML in general and network security domain in particular. Finally, Sec. 6concludes the article.

2. Literature Review and Related Work

Before reviewing the literature of the score calibration, we ﬁrst ponder equation (4). Modeling the calibration function p ( h ) : R (cid:55)→ [0 ,

1] can be accomplished via several approaches. We provide our view of these general approaches and surveythe literature belonging to each of them. We emphasize that our calibration problem deals with a black-boxed classiﬁerwith no access to their internal structure and only the classiﬁer score along with true labels, that is the dataset (6), areavailable. This should be distinguished from other approaches, e.g., [9], where they assume some knowledge of the internalbasis kernel of the classiﬁer and use the labeled dataset to estimate some of their parameters.

In this approach, the two LH functions f i ( h | ω i ) , i = , (cid:98) f i ( h | ω i ) along with (cid:98) π = n / n . An example of the literature leveraging this approach is [32], where the estimation of the LH is done by the basicbinning method (the same method used for drawing histograms) by treating the bin width as a tuning parameter, which3an be obtained via cross validation (CV). Therefore, the LH estimation is given by the relative number of observationsbelonging to a bin interval B . Formally, for x belonging to a bin interval B , then (cid:98) f i ( x | ω i ) = n i n i (cid:88) j = I x j ∈ B , i = , . (8) In this approach, the LHR (4b) (not the individual LH functions) is modeled; then, similar to above, this estimate, alongwith the estimate of the prior probability, are plugged in (4a). There is, as well, a wide range of statistical literature formodeling the LHR of a two sample data. However, the literature is not leveraged in the score calibration problem exceptby [3, 4], where the authors used PROPROC software for ﬁtting the ROC of a two-sample data to estimate the LHR (see [20]).They call this method the semi-parametric approach because the parametric model used involves some internal latentvariables. The details are out of the scope of the present article and can be found in the literature on ROC modeling oftwo-sample data that were well developed and established by Metz et al. during the previous two decades, for examplein [7, 8, 15, 16, 18, 20, 21, 25].

In this approach, the posterior probability (4a) is directly modeled. This is a typical regression problem that can bedone parametrically or nonparametrically. [33] seems to be the ﬁrst to propose the nonparametric regression for modelingthe posterior probability directly from score values. The authors proposed using the isotonic regression, an optimizationmethod that minimizes the square loss under the constraint that the resulting regression function is monotonic (not strictlymonotonic, as will be distinguished in deﬁnition (1). Therefore, the method provides a non-smooth regression curve withpiece-wise constant segments. [22] suggested using logistic regression to model the posterior probability from scores.Their initial motivation was to model the score of the SVM in particular. The modeled regression function is the typicalsmooth sigmoid function. However, they used a slightly modiﬁed version of the known logistic regression to decrease thebias for a small sample size. It is worth mentioning that [9] follows [22] but with an additional ingredient of estimating someparameters related to the basis kernel of the classiﬁer. As was introduced above, this is a bit different from our assumptionof knowing nothing about the internal structure of the classiﬁer.

There are a few studies, for example [3, 4, 17], that compare the literature surveyed above. These studies assumedifferent score distributions either by explicitly specifying the distribution and playing with its parameters or producingthe score from trained classiﬁers on different datasets. The conclusion is that there is no overall winner over all the scoredistributions; however, the logistic ﬁt [22] and the semiparametric ﬁt [3] approaches are very acceptable. We would liketo point out that, aside from the calibration accuracy, the logistic regression is quite easy to use because it is availablein almost any scientiﬁc computing environment (SCE), and does not require obtaining and integrating its code intto aparticular SCE, which is in contrast to the PROPROC.

3. Score Calibration: formalization

Given a classiﬁer that provides a scoring function h ( X ), the calibration function is deﬁned as a function that maps h ( X = x ) to Pr( ω | h ). This mapping is called the calibration function. In this section, we ﬁrst formalize the calibrationproblem in more mathematical rigor (Sec. 3.1); then, we propose how to calibrate multiple scores, and how to extend theavailable score to higher dimensions to boost the calibration accuracy of both single and multiple scores (Sec. 3.2). We alsodiscuss the attempts of calibrating multiple scores without having access to labeled datasets (Sec. 3.3). Deﬁnition 1. [Calibration Function] is a monotonically increasing mapping from R to [0 , to map a classiﬁer output scoreto a posterior probability. This formal deﬁnition verbally means the following: Consider the two observations x and x , belonging to any set (notnecessarily R ) along with their classiﬁer output scores h ( x ) < h ( x ). The calibration function (call it p ( h )) that estimates theposterior probabilities should satisfy p ( h ) < p ( h ) for a strictly monotonic assumption and p ( x ) ≤ p ( x ) for a monotonicassumption. This distinction is being formally made here to conform with some methods in the literature and is discussedin Sec. 2. The ROC and AUC are invariant under the strictly monotonic calibration (Corollary 2). This means that both theoriginal classiﬁer score or its calibrated score have the same ROC and classiﬁcation accuracy at two equivalent thresholdvalues. The mathematical details are deferred in the Appendix.4 h h ∧ ∨ Table 1: Truth table of the eight possible combinations of the true label of an observation along with the corresponding output of two binary classiﬁers.The last two columns of the table are the simple

AND and OR classiﬁers combination. None of the very aggressive AND rule or the very loose OR rule alwaysimproves the ﬁnal classiﬁcation decision; it depends on when the two classiﬁers agree/disagree, that is the joint PDF of both classiﬁers. Suppose we have access to more than one classiﬁer instead of just a single classiﬁer. Then, much like being able togenerate the dataset (6), we can generate the following dataset: H , = (cid:110)(cid:161) h ( x i ) ,h ( x i ) ,y i (cid:162) | y i = ω ,ω , i = , ··· ,n (cid:111) . (9)This dataset can be used to produce a calibrator in either a direct approach or in a training and testing approach. Thisshould boost both the classiﬁcation and calibration accuracies of the resulting calibrator. Those two approaches are elabo-rated on in the following two subsections. Although the material presented here treats combining two classiﬁcation scores,it is immediately extensible to K ≥ H , ··· ,K , and extension to (9). The scores h ( x ) ,h ( x ) for the dataset H , have a 2D PDF. Formally, and similar to (4), we can relate the posteriorprobability to this joint PDF by: Pr( ω | h ,h ) = + L − (1 − π )/ π, (10) L = f h ,h ( h ,h | ω ) f h ,h ( h ,h | ω ) ,π = Pr( ω ) . Therefore, the three approaches of calibrating a single classiﬁer (Sec. 2) are immediate. However, the logistic regressionapproach may be preferred to the other two because it is a stable procedure and applying it on a 2D dataset is straightfor-ward. On the other hand, the other two approaches may have some hurdles to overcome. Modeling the 2D LH f ( h ,h ) maynot be very accurate under a small sample size n . Modeling the LHR through the ROC technology is not available yet as aready software package and requires new development. We can treat the calibration problem as—and indeed it is—a learning problem where the training dataset is the labeledtwo-feature dataset H , and the model is logistic regression. Therefore, we do not have to restrict ourselves to the twopredictors h and h ; rather, we can extend them to a higher dimensional space by including, for example, h , h , h h .The new dimensionality will be p , including the original two features. The optimal p is chosen by CV, stage-wise forwardregression, best subset selection, or regularization, among many others readily available techniques from the literature onstatistical learning. In this elaborate approach, not only will the output be calibrated to the appropriate posterior proba-bility (10), but also the overall classiﬁcation accuracy will be better than that achieved by each individual feature (classiﬁeroutput) h i . There are many attempts to combine classiﬁers from an unlabeled dataset. Consider the existence of two black-boxtrained classiﬁers, along with their accuracies, without the availability of a labeled dataset. Therefore, the set H , deﬁnedin (9) is not available as well. What is available is only the pairs ( h ( x i ) ,h ( x i )) ∀ i , which are produced by direct testing ofthe classiﬁers on the observations x i . Before investigating how the two scores can be combined, which has been done (aswill be discussed in Sec. 3.3.2), we need to discuss some important aspects.5 .3.1. Known But Ignored Facts Indeed, h and h are not only two dependent random variables (r.v.s), but they are also correlated because they are de-signed to produce high (low) scores for malicious (normal) activities. It should be clear that the higher their accuracies,the more their correlation because the majority of the time they will agree on the right decision, a consequence of high ac-curacies. Combining these scores in general and estimating the posterior in particular is theoretically impossible withoutany further assumptions about their joint PDF. This is clear from the posterior expression (10). For more intuition, let usconsider the counter example depicted in Table 1. The table represents the truth table of the binary class ω of an obser-vation x along with the outputs h ( x ) and h ( x ) of two binary classiﬁers. It is clear that the simple AND rule wins over thesimple OR rule in two cases (the green cases), whereas the situation is reversed in the other two cases. In four cases, whenthe classiﬁer outputs agree, the two rules have no effect on the ﬁnal decision. However, under some assumptions for theclassiﬁers’ output and their dependence/correlation, some rules can prove superior over other rules. This is known, andwe tersely summarize it Sec. 3.3.2. Combining the output of two classiﬁers, whether each output is a posterior or a mere score, and where there is nolabeled dataset as H , in (9), was studied more than 20 years ago. It started as a very ad-hoc ﬁeld by suggesting naivecombination rules, for example sum, product, and so forth. There was no mathematical justiﬁcation for the underlyingassumptions under which these rules should work. In addition, the probabilistic dependence between the outputs of theclassiﬁers (explained above) was completely ignored. As Hastie et al. put it, “there is a vast and varied literature often re-ferred to as combining classiﬁers which abounds in ad-hoc schemes for mixing methods of different types to achieve betterperformance. For a principled approach, see Kittler et. al. (1998).” [11, p. 624]. In [12], they justiﬁed the mathematicalassumptions under which the simple combination rules should work. Afterwards, [13] derived the errors of these combi-nation rules for some particular probability distributions.Knowing only the accuracies of the classiﬁers and with zero knowledge and no assumptions of the joint density func-tion of the classiﬁer outputs, we do not see any mathematical justiﬁcation for combining the output scores except from asimple Bayesian point of view. This is pursued by assigning a subjective prior for each classiﬁer proportional to its relativeaccuracy. Said differently, the accuracy in this context plays the role of a subjective belief (or trust) that justiﬁes taking itsoutput seriously. The mathematical argument is explained in [11, Sec. 8.8] and goes as follows: suppose each classiﬁer η k , k = , ··· ,K is already calibrated and produces the output Pr( ω | x,η k ) for an observation x , thenPr( ω | x ) = (cid:88) k Pr( ω | x,η k )Pr( η k | x ) , (11a)Pr( η k | x ) = A k (cid:80) k (cid:48) A k (cid:48) , (11b)where Pr( η k | x ) is the subjective prior assigned to the k th classiﬁer and A k is its accuracy. However, the subjective pri-ors (11b) can be any other function that is monotonic in accuracy and sum up to 1 (e.g., [14] raise accuracies to an arbitrarypower).

4. Experiments and Results

In this section, we conduct some experiments using simulated datasets and two popular public datasets from the cy-bersecurity domain.

We leverage the generalized lambda distributions (GLD), which was introduced in [23], to generate a wide variety ofscore distributions. We follow [28, 30] in using the GLD but in a more systematic way. The GLD is a univariate unimodaldistribution family with a set of four population parameters; we use the notation L ( λ ,λ ,λ ,λ ) to denote such a distri-bution. These parameters determine the characteristics and shape of the distribution, for example location, skewness, etc.By adjusting these population parameters λ i , i = , ··· ,

4, we can generate three basic distributions that are symmetric,skewed left, and skewed right. In addition, we add the normal distribution N ( µ,σ ). For the score calibration problem, wehave two distributions F and F ; each can be chosen from this list of four distributions, which results in 16 pairs. Foreach of these 16 pairs, without loss of generality, we ﬁx the mean of F to zero; we set the standard deviation to 1; andwe adjust the location parameter of F to obtain a desired separability (measured in AUC) between the two distributions;namely, we choose AUC = . , . , .

9. This is done to express the different discriminating power of a particular classiﬁer.Figure 1 lists the parameters of the basic four distributions and illustrates the 16 pairs; each pair is illustrated by three AUCvalues. Each of these 16 pairs is given an index of 1 , ··· ,

16, which is the same index of the x -axis of Figure B.3, which will beexplained later. This index is not only a mere ID assigned to each pair; rather, it has an important ordering signiﬁcance that6 L ( λ ,-0.1125,-0.1359,-0.1359) b L ( λ ,0.014,0.009695,0.0285) c L ( λ ,0.014,0.0285,0.009695) d N ( µ ,1) (c, b) (d, d) (c, c) (b, b) (c, a) (a, a) (a, b) (b, c) (a, c)

10 : (b, a)

11 : (b, d)

12 : (d, b)

13 : (c, d)

14 : (d, c)

15 : (a, d)

16 : (d, a)

Figure 1: The parameters of the basic 4 distributions { a , b , c , d } for GLD symmetric, GLD left skewed, GLD right skewed, and Normal, respectively; and the16 plots of their cross product used for single-score calibration. Each plot is labeled as x : ( α,β ) , x = , ··· , α,β = a,b,c,d , were x is its index used inplotting Figure B.3. α is distribution F with location parameter ( λ or µ ) set to zero (plotted in black). β is distribution F with location parameter set tothree different values to give a separation with F of AUC = 0.65, 0.7, 0.9 (plotted in blue, green, and red, respectively on each of the 16 plots). The Figure,therefore, illustrate 16 × will be explained when explaining that ﬁgure. A minor technical issue is that, because of the mathematical formalizationof the GLD, the parameter values shown in Figure 1 do not give a distribution directly with σ =

1. It is quite tedious to ﬁndthe closed from of σ and solve for λ i . Rather, we simulate a large sample and divide by the estimated standard deviation.The smoothed histogram of this standardized version is what is plotted in the ﬁgure and is what will be used for furthersimulations. It is remarkable that to plot these curves with that smoothness, especially for the heavy tailed pairs, it requiredsimulating up to a million observations.For each of these 16 × n = × i , i = , ··· ,

9. A single conﬁguration (or experiment) is described as follows: We select a pair out of the16 × n scores for each class, train each calibrator on this training dataset, and ﬁnallytest and assess each calibrator on: (a) the same training dataset to get both RMSE subm and RB subm (resubstitution) and (b)test them on pseudo-inﬁnite testing set of 10,000 scores generated form the same distribution pairs to get RMSE indm andRB indm (independent). The subscript m denotes the m th Monte-Carlo (MC) trial, which will be repeated M times to obtainan average: E M ( · ) = M (cid:80) m ( · ). This is to account for the variability of the training datasets of ﬁnite size n .The calibrators used in each conﬁguration are those discussed in Sec. 2, excluding the PROPROC for reasons discussedin Sec. 4.1.2. These calibrators are as follows: Logistic Regression (LogReg) with three different versions: (1) Platt’s version [22], which we implement in Python by literaltranslation (line by line) from Matlab code written and shared to us by the authors of [3, 4]—this can then be comparedwith to their results; (2) the scikit implementation [19]; and (3) the scikit implementation on the extended scorespace as we suggest in Sec. 3.2.2. For short, in the sequel, we call these three versions Platt, LogReg, and LogRegExt.

Isotonic Regression (IsoReg) of the scikit package [19].

Binning Method (BinReg), which we implement by using ﬁve different values of bins: 10 , , ··· , × × M = ind and RMSE sub for each of the nine calibrators versusthe index of the distribution pair explained in Figure 1. There are 30 plots corresponding to the 3 ×

10 cross product of theAUC and n values. Similarly, subﬁgures (c)–(d) illustrate RB. The index of the distribution pair on the x -axis is not just an IDfor the pair; rather, it expresses its rank for the average RMSE ind . This average is calculated over the nine calibrators acrossthe 3 ×

10 conﬁgurations. Therefore, the overall trend of the curves is increasing with that index. This is done to visualizethe effect of the distribution pair on the performance of calibrators in a more informative way. Subﬁgure (e) illustratesthe average of each of the four metrics versus the sample size n . This average is taken, conditionally on each n over allthe 16 × A TEX to control their position at thepixel level so that when pages scroll, all ﬁxed information, for example axis, ticks, and so forth, look stationary and onlycomponents carrying information, for example curves, run as if they are animated.Before delving into the details of this ﬁgure, it is important to study its summary from subﬁgure (e). It is obvious thatat a smaller n , the two versions of logistic regression, LogReg and Platt, behaves better than all others, including the thirdversion LogRegExt. IsoReg comes second to LogReg. Then, all calibrators improve and converge almost to the same value,as n increases, with a slight superiority of the non-parametric BinReg. This comes as no surprise because it is known thatthe histogram method converges pointwise to the PDF of a r.v. Also, we can observe that a smaller number of bins worksbetter at smaller sample size n , and vice versa, quite as expected. This summary subﬁgure is the ﬁnal message and ofpractical value, to practitioners to whom the only available information in the calibration problem is the sample size n .We can infer more about the calibrator’s behavior by conducting a deeper investigation of B.3a–B.3d. A pilot view ofsubﬁgures B.3a–B.3b reveals that there is no calibrator that is a consistent winner across all conﬁgurations. In addition,7he majority of calibrators, with a few exceptions, converge to the same value as n increases. It is obvious that the largerthe AUC, i.e., the more the classiﬁer is capable of separating the two classes, the smaller the calibration errors. LogRegExtslightly wins across all distribution pairs only at small values of both n and the AUC. It deteriorates at higher AUC values,in contrast to almost all other calibrators. This may be interpreted as follows: at a higher AUC, the two distributionsare already separated, and the problem is much easier for LogReg; indeed, it is not worth the extra variance componentexhibited from the extended feature space.A very interesting observation is that in terms of RMSE, the calibrator’s performance gets worse with distribution pairs11–16. From Figure 1, those are the six pairs involving normal distribution. Not to accuse the normal distribution, thesecond best distribution pair (and the ﬁrst best for LogReg) is distribution pair 2, two of whose components are normal ( d , d ). On the other hand, in terms of RB ind , the distribution pairs almost do not affect the calibrator’s performance. We thinkthat both observations are very counter-intuitive, and no qualitative interpretation is ready off the shelf. Rather, it requiresa decomposition to the components of variance under the corresponding parametric distribution pairs, which is quite outof the scope of the present article and is deferred to future work.It is obvious that RB sub is optimistically biased, which complies with the conventional wisdom in the ﬁeld of ML forany performance measure. However, this wisdom is surprisingly violated when comparing RMSE sub with RMSE ind foralmost all calibrators. It seems that RMSE sub is at least the same as, or pessimistically biased than, RMSE ind for most con-ﬁgurations. A similar observation is reported by [4], where they provided their interpretation, to which we agree. Indeed,the Brier score is itself the objective function that is minimized for the different regression methods that will learn fromthe truth of the class indicator variable. This indicator variable is, as well, the one involved in the Brier score deﬁnition(5b). This will cause the usual optimistic/pessimistic bias of the resubstitution/independent testing. In contrast, for theMSE, the truth of posterior probability (the variable involved in the MSE deﬁnition (5a)) is not involved in the trainingphase. Hence, the estimated MSE using resubstitution will differ only from that using independent testing by the varianceexhibited from the ﬁnite sample size of the testing dataset. The PROPROC software version of [20] calculates the LHR (1b) (Eq. (4a) in [20]), which is necessary for calibratingnew scores after ﬁtting the ROC curve. This software version was used by the authors of [3, 4]. However, this versionis neither publicly available at the ofﬁcial website of the University of Chicago nor by personal communication with thedevelopers and administrative staff. After the inventor of the PROPROC, Charles E. Metz, passed away, it seems that thepublicly available version of PROPROC is frozen and is not updated. Therefore, the best that we can do is to compareour nine calibrators to the PROPROC results published in [3, 4] using their limited number conﬁgurations. This requiresreproducing their results so that we can compare all methods on the same workﬂow. Figure B.4 uses their visualizationmethod to illustrate a reproduction of their results, on the two distributions they considered: normal and beta; we usedthe calibrators LogReg, Platt, and IsoReg. It is clear that the two versions of logistic regression, LogReg and Platt, are almostidentical and compare well to the PROPROC for the normal distribution. However, Platt outperforms the others at small n for the beta distribution; then, all converge to the same performance as n increases. Of course, this comparison is veryvulnerable to criticism because these calibrators may not compare well on other conﬁgurations. However, because of theunavailability of the PROPROC, no more conclusions can be reached. As discussed in Section 3.2, the only candidate for multi-score calibration is the logistic regression. Therefore, weconstruct the following set of experiments to study the effect on this calibrator when extending the feature space; that iswe compare LogReg to LogRegExt. Suppose we have two classiﬁers (scoring functions) h , h . Because we have two classes ω , ω , we then have four distributions: F ij ,i = , , j = ,

2, where F ij is the distribution of the scoring function h j underthe class ω i . We choose each of F ij ,i = , , j = ,

2, from the list of the four basic distributions (cid:169) a , b , c , d (cid:170) presented atthe top of Figure 1. Therefore, we have a total of 4 combinations of distribution pairs. We always adjust the parametersof F j ,j = ,

2, to have 0 mean and unit variance. We adjust the parameters of F j ,j = , , to both have unit varianceand to achieve a desired separation (expressed in terms of the AUC) with F j . We consider the three different values ofAUC used in the single-score calibration. In addition, if two classiﬁers provide scores for the same problem, we anticipatethat there should be some correlation between their scores, because the two classiﬁers are initially designed to solve thesame problem. We account for this by simply assuming a correlation coefﬁcient ρ between the scores h j | ω ∼ F j ,j = , h ,h ) and the scores h j | ω ∼ F j , j = , h ,h ). We construct this correlation by building thebelow model. For class ω , we consider the two r.v. v ∼ F and v ∼ F ; then, we construct the scores by the multivariatetransformation h = v , (12) h = v ρ + v (cid:113) (1 − ρ ) . (13)It is quite straightforward to show that h and h have a correlation coefﬁcient of ρ . Similarly, we construct h and h after adjusting the parameters for the required AUC. We experiment with ρ = , . , . distribution pairs,three values of AUC, three values of ρ , and 10 values of sample sizes n .8he calibrators are LogReg with the two plain features (scores) h and h and LogRegExt with the extended features h , h , h , h , h h (Sec. 3.2.2). We only experiment with a second degree polynomial to validate the argument thatLogRegExt can work better than LogReg. However, for real problems, practitioners should include several sets of extendedfeatures and keep increasing the complexity to ﬁnd the optimal space using the typical resampling approaches, for exampleCV or bootstrap.The results of these experiments are illustrated in Figure B.5. Subﬁgures (a)–(d) illustrate the RMSE ind , RMSE sub , RB ind ,and RB sub , respectively, of LogRegExt versus LogReg on all conﬁgurations. Therefore, each plot displays 4 ×

10 points atparticular ρ and AUC. A common heat map of the subﬁgures codes the sample size n , and is displayed under subﬁgure(e). The latter summarizes subﬁgures (a)–(d) by plotting the average of each of the four performance measures for eachcalibrator versus n . The average is taken over all the 4 × × n .From the summary subﬁgure (e), it is obvious that the extended features enhances the LogRegExt only asymptoticallyand marginally. However, from the detailed ﬁgures (a)–(b), it is obvious that in many conﬁgurations, the LogRegExt hasa lower error than LogReg (points located below the y = x line.) The majority of these points are red (large n ) and occurat high ρ and both medium and large AUCs. This is not surprising because at a larger sample size, the regression modelcan beneﬁt from the extended feature space without overﬁtting. At a stronger correlation, there will be more informationgained from the extended feature h h , which will boost LogRegExt. However, this little improvement may be washed outat smaller AUC values because the problem becomes harder. At this low AUC, the two LH functions are very interleaved;hence, the LHR will be close to 1, which gives a posterior close to 0.5 and average Brier score close to 0.5 as well. This isevident from subﬁgures (c)–(d). We observe again that both RMSE ind and RMSE sub are almost identical. Subﬁgures (c)–(d)show that in terms of RB, LogReg is better than LogRegExt in almost all conﬁgurations and that the extended features didnot boost the latter except at very small fraction of conﬁgurations with very little improvements. To test its utility, we compare the multi-score calibration of logistic regression to its single-score calibration. In bothcases, we compare its performance with and without feature extension. We explain how we establish and visualize thiscomparison for LogReg, and the same applies immediately to LogRegExt. Consider a conﬁguration used to generate asingle point on Figure B.5a; we denote the performance of LogReg (the x coordinate of this point) by r , and the subscriptrefers to the fact that the performance is obtained from the multi-score calibration of the two features h and h . Under theaforementioned conﬁguration, we calibrate LogReg using each of the single score h or h individually; we denote theseperformances by r ,r . Then, for all conﬁgurations, we plot r , / r versus r , / r . We do that for all the eight combinations{RMSE, RB} × {LogReg, LogRegExt} × {ind, sub}.The results of these comparisons are illustrated in Figure B.6, which consists of nine subﬁgures that follow the sameplotting style, convention, and legends of Figure B.5. The ﬁrst eight subﬁgures correspond to the eight combinations,respectively, with the order of the ﬂattened cross-product. If the multi-score calibration is better than each individualsingle-score calibration, that is r , < min( r ,r ), then the point will be inside the unit square. The mark + inside the unitsquare is located at position ( p,p ), where p is the fraction of points located inside the unit square. Therefore, p representshow frequent the multi-score calibration wins over both single score calibrations. The summary of the eight subﬁgures(a)–(h) is illustrated in subﬁgure (i), which plots the average p versus n for each of the eight combinations, and the averageis taken over all conﬁgurations of each subﬁgure conditionally on each value of n .From the summary subﬁgure (i), in terms of the RMSE, the multi-score calibration improves LogReg only in roughly lessthan 7% of experiments and almost constantly with the sample size n ; it improves LogRegExt in less tha 9% of experimentsat n = 320. In terms of RB, the multi-score calibration improves the performance of at least 80% of the experiments with asmall n and almost 100% of experiments with n >

40. The detailed subﬁgures (a)–(h) reveal more. (a) and (c) shows thatfor LogReg and LogRegExt, respectively, p is small at low AUCs, regardless of ρ . As explained above, when we commentedon Figure B.5, the calibration problem itself is hard, and the posterior is near 0.5. This will leave a small margin of im-provement. However, at higher AUCs and a low correlation coefﬁcient ρ , the fraction p increases for LogReg and increasessigniﬁcantly for LogRegExt. This increase of the value of p retracts again when increasing ρ . The interpretation is that ata high AUC, the problem is easier for each single-score calibration, and at a high ρ the two scores will almost behave thesame; hence, combining both scores in the multi-score calibration may not be of great value. (e)-(h) illustrate the RB andshow that, almost for the majority of the experiments, p (cid:39)

1. The improvement factor is signiﬁcant, almost at 0.75, wherethe majority of points is located at the upper right corner of the unit square.

We carried out experiments, similar to those carried out in the previous section, on two real datasets from the do-main of cybersecurity: trek05-1 dataset [5, 6] and drebin dataset [1, 27]. We interpret the results here and defer furtherinvestigations and interpretations to the discussion in Sec. 5.9 .2.1. trec05-1

Dataset

Trec05-1 is a popular public email security dataset consisting of 72,450 emails labeled as spam or not spam . Becausethe dataset is very large, we have the luxury of splitting it into training and testing to construct the classiﬁers before thecalibration process. Therefore, we split the dataset into two halves; each has 36,225 samples. Support vector machine(SVM) and random forest (RF) classiﬁers were trained on n -gram features extracted from the training dataset; then, bothclassiﬁers were frozen, and the training set was never used in the rest of experiments. The SVM and RF were tested onthe testing set, and each produced 36,225 scores. These scores have a correlation coefﬁcient of 0.91, and each achieved,respectively, 96.6% and 97.53% balanced accuracy, and 0.994 and 0.996 AUCs. The smoothed histograms of these scores,under the two classes, are shown in Figure B.7. For the RF, these histograms are abruptly exponential-like and surprisinglyvery different from the simulated datasets in the previous sections. It is worth mentioning that [29] proved that the classiﬁerscores can be abruptly exponential even when the training features are normally distributed.It is interesting to study the behavior of different calibrators on different sample sizes n obtained from this dataset.Said differently, we will pretend we are in a situation where we have obtained only n observations from this dataset. Then,to train the calibrators on different training sets of the same size n , to mimic a MC trial and to further test the trainedcalibrator on large independent testing set, we proceeded as follows: Out of the 36,225 scores available from testing eachclassiﬁer, we sequestered 10,000 to account for the independent testing sets for calibrators. Then, for each value of n = × i , i = , ··· ,

9, we randomly selected M ( = n . Then, we trained the calibrators onthese training sets. We only report RB ind and RB sub because RMSE is not available because of the absence of the trueposterior probability. We use the same calibrators and the same comparative study used for the simulated datasets above.The results of these experiments are illustrated in Figure B.8, which plots RB ind and RB sub versus the sample size n for three cases: single-score calibration of each of SVM and RF scores, and multi-score calibration using both. The resultsare different from those obtained from the simulated datasets. For the single-score calibration, although all calibratorsconverge almost to the same performance as n increases, at a small n the simple BinReg competes with IsoReg, and thethird place is LogReg or Platt for SVM or RF, respectively. Platt almost wins over the rest of the values of n . LogReg andLogRegExt are the worst two. However, LogRegExt is better than LogReg for only RF scores. The multi-score calibrationwith either LogReg or LogRegExt wins over all the single calibrators except at small values of n . This once more shows thevalue of our proposed method of multi-score calibration in many (not all) cases. drebin Dataset

Drebin is a popular malware analysis dataset consisting of features extracted from 129,013 android applications, 5,560of which are malicious applications belonging to 179 different malware families, while the rest are benign. From eachapplication, 545,334 binary features are extracted, indicating the presence of various API calls, permissions, network ad-dresses, and so forth. Because of the size of the dataset, it is not available in one large ﬁle. Instead, for each sample, theplain text representation of each present feature is listed in a text ﬁle; features that are not present are not listed. This allowsfor a very condensed representation of the dataset because most samples contained well under 100 features.To establish some basis of comparison, we used SVM and RF for this dataset. Because these algorithms did not supportonline (incremental) learning, the dataset had to be manually extracted into one large matrix using the sparse matrixrepresentation of

Scipy . If the data set was deﬁned in a standard numpy array, then it would require about at least 70 GBjust to ﬁt in the memory (assuming a byte-sized datatype for each element).We used the

Scikit implementation (with liblinear for the SVM) of each classiﬁer and trained them using 10-foldCV, while the predicted scores on the left-out folds were used for calibration. This is because the sample size of the mali-cious class is only 5,560. We adjusted the classiﬁcation threshold so that the FPF was 1 in 100 (the same FPF chosen in [1]).Our SVM and RF achieved a 0.955 and 0.971 detection rate, respectively. This is higher than then the reported detectionrate of 0.939 in [1], where they used the SVM. The scores of the SVM and RF achieved an AUC of 0.992 and 0.996, respec-tively, and their correlation coefﬁcient is 0.64. Figure B.7 illustrates the histogram of the scores; they almost show the samepattern as those of the terk-05-1 dataset.The results of these calibration experiments are illustrated in Figure B.8 and are plotted plotted in the same way asthose of terk-05-1 dataset. The results are very similar but have some differences. For single-score calibration, all themethods, except LogRegExt on SVM and both LogRegExt and LogReg on RF, are almost comparable at all values of n . Ifcompared with their performances on the RF scores of terk05-1 dataset, it seems that LogReg and LogRegExt switchedplaces. However, LogReg on the multi-score calibration is almost the overall winner. All experiments on the simulated datasets were carried out on the Compute Canada Cedar Cluster . For the single-score calibration (Sec. 4.1.1), the experiments were run using a job array for scheduling multiple independent jobs (scripts),with all jobs running in parallel. We wrote 48 separate jobs, one job for each conﬁguration of the cross-product of the 16 n . The job took about an hour on onecore, 8 GB of RAM, obtaining 96% CPU utilization, and 63% memory utilization. Only the single-score simulation for Platt’ssigmoid ﬁt was carried out separately on an eight-core machine with 16 GBs of memory and took 36 hours to completeusing six threads, with a thread per job.For the multi-score calibration (Sec. 4.1.3), a job array with 768 jobs was created, one job for each conﬁguration of thecross-product of the 4 distribution pairs and three AUC values. Each job accounts for the cross-product of three values of ρ and 10 values of n . The whole job array took ﬁve hours to execute, although the majority of jobs were executed in threehours. On average, these jobs obtained 94% CPU efﬁciency and 62% memory efﬁciency with one core and 8 GB of RAMper job. These experiments obviously took longer than their counterparts for the single-score calibration for the relativenumber of conﬁgurations (23,040 vs. 480).

5. Discussion

The empirical distributions in Figure B.7, for the real datasets, show two main differences from the simulated data inthe previous section: (1) it is exponential-like for the RF scores and (2) the AUC is very high, 0.994, 0.996, 0.992, 0.996,respectively, for the combinations { terk05-1 , drebin } × {SVM, RF} appearing in the ﬁgure). Therefore, we designed thefollowing simulation experiment for a deeper investigation. Let F be the truncated exponential distribution and F bea ﬂipped truncated exponential distribution, both share the same population parameter λ and the ﬁnite support [0 ,

1] sothat they look symmetric. The population parameter λ is adjusted to obtain a required AUC, which takes the same threevalues of the previous simulated data, in addition to the value 0.99.The results of this experiment are illustrated in Figure B.9. The RMSE ind , RMSE sub , RB ind , and RB sub of the ninecalibrators are plotted versus the sample size n at different AUC values. One more surprise is that the performance for theﬁrst three values of AUC complies with the simulated data in the previous section with some performance deteriorationobserved for LogReg. Only the performance at the very high AUC of 0.99 looks similar to the performance of the calibratorson the real dataset. Still, comparing the results on Figure B.9, but here recalling the results of the real datasets on Figure B.8at this very high AUC, both LogReg and LogRegExt are inferior to other methods, including the Platt version of logisticregression, exactly as was observed for the real dataset. At this very high AUC, the two distributions are very separated,which is an ideal situation to exhibit the value of the bias correction term of Platt’s version of logistic regression. IsoRegalways has a moderate performance among others.It is quite important to note to that obtaining such a very high discriminating power, represented as AUC = 0.99+, is veryunrealistic in the vast majority of ML applications. The real datasets above, where both SVM and RF were able to providea very high AUC, look very clean and atypical. Such a very high AUC means that the two sets of classiﬁcation scores arealmost perfectly separable such that any threshold will provide near 100% TP and 0% FP.We try to summarize the ﬁndings and the ﬁnal message of Figures B.3–B.9 to ﬁnally provide the advice for practitioners.It is almost clear that there is no over all winner across all possible conﬁgurations. However, major remarks are in order.1. The Platt version of the logistic regression shows superiority over the default version that exists in the scientiﬁc li-braries, for example scikit , at some special conﬁgurations. Therefore, it is worth implementing it in an optimizedway in these libraries to replace the default version; and it is better to leverage it in the calibration process.2. Among the single-score calibrators on the plain features, that is without feature extension, the Platt method is thewinner in the vast majority of conﬁgurations, even if it is not the over all winner. It only looses in a few experimentswith little margin.3. The isotonic regression, although it rarely wins in contrast to Platt, it rarely deteriorates similarly to Platt.4. The PROPROC method does not seem to be feasible because the software version that can calibrate scores is nolonger publicly available.5. The very rare cases with a very high AUC should not be taken as a basis for comparing the calibrators. These are veryspecial cases, and even through we found them in some real datasets, they should be dealt with on a case-by-casebasis. For example, the practitioner should run a Mann-Whitney test of the scores to estimate the AUC and thendecide whether it is exceptional or not.6. The multi-score calibration boosted the performance of LogReg for some experiments in terms of RMSE and almostfor all experiments, including the real dataset in terms of RB.7. The feature extension boosted the performance of a small fraction of experiments in terms of RMSE and a negligiblefaction of experiment in terms of RB.Based on all of these observations, if we have to provide only a single piece of advice advice to practitioners, we wouldrecommend the Platt version of the logistic regression for a single-score calibration without feature extension. When thereare more classiﬁers providing the score for the same problem, we would recommend the Platt regression using a multi-score calibration without feature extension as well. 11owever, for a more mature practitioner, we need to recall the observation that there is not an overall winner amongcalibrators and conﬁgurations. Therefore, we may include more than one calibrator, including Platt, on the extendedfeatures and incorporate all of them in a CV model selection per the usual paradigm in training and testing ML models.

6. Conclusion

The current paper studied the calibration of binary classiﬁers, where the classiﬁer score of a testing observation, doesnot necessarily correspond to its posterior probability of belonging to one of the two classes. The paper surveyed theliterature, summarized the different methods available for calibrating a single classiﬁer, and formalized the mathematicalproperties of the calibration process, for example the invariance of some classiﬁcation performance measures. The paperpresented three main contributions: (1) proposing the multi-score calibration, that is designing a calibrator when morethan one classiﬁer provides the scores to a single observation; (2) treating the score(s) as features and extending them tohigher dimensions to design a calibrator; (3) conducting a massive number of experiments on simulated datasets, in theorder of 24,000 experiments, and on two real datasets from the cybersecurity domain to study the behavior of differentcalibrators at different conﬁgurations. A ﬁnal piece of advice for practitioners is provided based on the visualization andinterpretation of the results that show there is no overall winner among the different calibrators. However, in general,(1) the Platt’s version of the logistic regression has a decent accepted performance over a wide range of experiments; (2)the multi-score calibration boosts the performance in the majority of experiments; and (3) extending the score to higherdimensions helps in some experiments.

7. Acknowledgment

We would like to thank our friends Dr. Weijie Chen, of the US FDA, and Dr. Lorenzo Pesce, formerly the developer ofPROPROC software at University of Chicago, for their long discussions on the software. Special thanks to Dr. Weijie Chenfor sharing with us his Matlab code of his publications [3, 4].

Appendix A. ProofsCorollary 2.

Consider applying a strictly monotonic calibration function to the output score of a classiﬁer, and then treatingits output as the new ﬁnal score. Then, all the performance measures deﬁned in Sec. 1.2, that is errors, accuracy, risk, ROC,and AUC, along with their estimators, are invariant under this transformation.

Proof. is immediate by the direct application of Lemma 3.

Lemma 3.

Consider any two r.v. X and Y belonging to R with distribution functions F X and F Y , respectively, along with astrictly monotonically increasing function p , and p ( X ) = X (cid:48) ∼ F X (cid:48) and p ( Y ) = Y (cid:48) ∼ F Y (cid:48) . Then:1. For every threshold value θ ∈ R , with the corresponding probabilities F X ( θ ) and F Y ( θ ) , there exists a unique correspond-ing threshold value θ (cid:48) = p ( θ ) , with the corresponding probabilities F X (cid:48) ( θ (cid:48) ) and F Y (cid:48) ( θ (cid:48) ) such that F X (cid:48) ( θ (cid:48) ) = F X ( θ ) and F Y (cid:48) ( θ (cid:48) ) = F Y ( θ ) .2. For every threshold value θ (cid:48) ∈ R , the converse argument applies.3. These two properties are not satisﬁed with relaxing the strictly monotonic assumption and considering p a monotonicincreasing function. Proof.

1. Because the mapping p is strictly monotonically increasing, it is a bijection. Therefore, ∀ x ∈ [ −∞ ,θ ] there exists aunique x (cid:48) = p ( x ) ∈ [ −∞ ,θ ], where θ (cid:48) = p ( θ ); and therefore, the two sets are equivalent. Hence, F X ( θ ) = Pr[ X ≤ θ ] = Pr[ X (cid:48) ≤ θ (cid:48) ] = F X (cid:48) ( θ (cid:48) ). The same argument is immediate for Y .2. Because the function p is a bijection, its inverse p − is as well; then, the argument is identical to the one above.3. Consider a non-strictly monotonically increasing function p such that p ( x ) ≤ p ( x ) ∀ x < x . Take any θ < θ , where p ( θ ) = p ( θ ) = θ (cid:48) . Then, for any θ < θ < θ , we have p ( θ ) = p ( θ (cid:48) ). Suppose that for some θ ∈ ] θ ,θ [ , Pr[ X ≤ θ ] = Pr[ X (cid:48) ≤ θ (cid:48) ]. However, of course, Pr[ X ≤ θ ] (cid:54)= Pr[ X (cid:48) ≤ θ (cid:48) ] ∀ θ ∈ ] θ ,θ [ unless Pr[ θ < X ≤ θ ] =

0, which is not necessary for anydistribution F X . Appendix B. Pedagogical Example on Calibration

It is more informative to illustrate the calibration process on some dataset. We use the labeled subset of one of thedatasets from the 2019 IEEE BigData Cup Challenges. The dataset consists of event log data received at a security op-eration center (SOC) run by a company called Security On-Demand (SOD) [26]. The considered subset contains 39,427observations. The single feature overallseverity , which represents the intrusion alert severity generated by the sys-tem rules, is an integer value 0 , , ··· ,

5, which accounts for the score of an intrusion detection system (IDS). The column12 a) (b) (c) (d)

Figure B.2: Histograms and logistic ﬁts [22] of the scores of an IDS system. Left: without score modiﬁcation, where the two sets of score are almostoverlapping (Mann-Whitney of 0.53), and hence are non-predictive and their logistic ﬁt is very poor. Right: after artiﬁcial score modiﬁcation, wheremore natural separability is obvious (Mann-Whitney = 0.92) and their logistic ﬁt is, at least visually, very reasonable. The second row illustrates the sameexperiment but with introducing both the score and its squared values (two features) to the logistic regression model. The ﬁt is obviously better. notified contains the true label to indicate whether the observation is an intrusion or a normal behavior. The 39,427records include 2,276 intrusions and 37,151 normal records. As long as the dataset size is very large, the nonparametricestimate of the LHR, and hence, the posterior probabilities, can be considered a very accurate estimate for the populationvalues.Before showing how to train the calibrators using the techniques discussed in Sec. 2, it is important to comment onthis dataset. The score overallseverity is not a discriminating feature. The Mann-Whitney statistic (2a) of this score is0.53, which is almost close to a random guess. Just for the sake of clariﬁcation, modifying the value of 5 of the score of thenormal records to 0 (to mimic a more realistic and practical IDS scoring) gives a Mann-Whitney statistic of 0.92. Figure B.2illustrates the histograms of the two classes for the data before/after modiﬁcation.We calibrate this score before and after modiﬁcation using two calibration methods: the logistic regression and thenonparametric binning method of [32]. Calibrating the original score gives the poor model of Figure B.2a, where the bin-ning method is plotted as individual dot markers and the logistic ﬁt is plotted as a solid curve. (The solid curve is just forillustrating the smoothness of the method; however, because the score only takes the discrete values 0 , ··· ,

5, the curve isdiscrete as well). This poor ﬁt rises from the fact that the two classes almost have the same likelihood functions, which isobvious from both the visual histogram and from the Mann-Whitney value of 0.53. For example, at a score value of 5, thetwo likelihoods are almost the same, so the posterior probability (4a) will be almost the prior probability π estimated by n / n = = . (cid:98) f logit ( h = | ω = ω ) = .

75; whereas, the binning method gives an estimateof exactly 1 because f h ( h = | ω ) =

0, which gives an estimate of the LHR of ∞ . When we provide the logistic regressionwith each value of the score h and its square h as an extended feature, (this approach is detailed in Sec. 3.2), the ﬁt to theposterior probabilities is better, as in Figures B.2c–B.2d. References [1] Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K., Siemens, C., 2014. Drebin: Effective and explainable detection of android malware inyour pocket., in: NDSS, pp. 23–26.[2] Chen, W., Gallas, B.D., Yousef, W.A., 2012. Classiﬁer Variability: Accounting for Training and testing. Pattern Recognition 45, 2661–2671.[3] Chen, W., Sahiner, B., Samuelson, F., Pezeshk, A., Petrick, N., 2015. Investigation of methods for calibration of classiﬁer scores to probability ofdisease, in: Medical Imaging 2015: Image Perception, Observer Performance, and Technology Assessment, International Society for Optics and

Photonics. p. 94161E. [4] Chen, W., Sahiner, B., Samuelson, F., Pezeshk, A., Petrick, N., 2018. Calibration of medical diagnostic classiﬁer scores to the probability of disease.Statistical methods in medical research 27, 1394–1409.[5] Cormack, G.V., Lynam, T.R., 2005. Trec 2005 spam track overview., in: TREC, pp. 500–274.[6] Cormack, G.V., Lynam, T.R., 2007. Online supervised spam ﬁlter evaluation. ACM Transactions on Information Systems (TOIS) 25, 11–es.[7] Dorfman, D.D., Berbaum, K.S., Metz, C.E., 1992. Receiver Operating Characteristic Rating Analysis - Generalization To the Population of Readersand Patients With the Jackknife Method. Investigative Radiology 27, 723–731.[8] Dorfman, D.D., Berbaum, K.S., Metz, C.E., Lenth, R.V., Hanley, J.A., Abu Dagga, H., 1997. Proper Receiver Operating Characteristic Analysis: theBigamma model. Acad Radiol 4, 138–149.[9] Grudic, G.Z., 2004. Predicting the probability of correct classiﬁcation. Unpublished technical report .[10] Hájek, J., Šidák, Z., Sen, P.K., 1999. Theory of rank tests. 2nd ed., Academic Press, San Diego, Calif.[11] Hastie, T., Tibshirani, R., Friedman, J.H., 2009. The elements of statistical learning: data mining, inference, and prediction. 2nd ed., Springer, NewYork.[12] Kittler, J., Hatef, M., Duin, R.P.W., Matas, J., 1998. On Combining classiﬁers. Pattern Analysis and Machine Intelligence, IEEE Transactions on 20,

Distributed Data. Statistics In Medicine 17, 1033–1053.

16] Metz, C.E., Pan, X., 1999. “Proper” Binormal ROC Curves: Theory and Maximum-Likelihood Estimation. Journal of Mathematical Psychology 1,1–33.[17] Niculescu-Mizil, A., Caruana, R., 2005. Predicting good probabilities with supervised learning, in: Proceedings of the 22nd international conferenceon Machine learning, ACM. pp. 625–632.[18] Pan, X., Metz, C.E., 1997. The “proper” Binormal Model: Parametric Receiver Operating Characteristic Curve Estimation With Degenerate data.Academic Radiology 4, 380.[19] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine learning in Python. Journal of Machine LearningResearch 12, 2825–2830.[20] Pesce, L.L., Horsch, K., Drukker, K., Metz, C.E., 2011. Semiparametric estimation of the relationship between roc operating points and the test-resultscale: Application to the proper binormal model. Academic radiology 18, 1537–1548.[21] Pesce, L.L., Metz, E.C., 2007. Reliable and Computationally Efﬁcient Maximum-Likelihood Estimation of “proper” Binormal Roc Curves. AcademicRadiology 14, 814.[22] Platt, J., et al., 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large marginclassiﬁers 10, 61–74.[23] Ramberg, J.S., Tadikamalla, P.R., Dudewicz, E.J., Mykytka, E.F., 1979. A Probability Distribution and Its Uses in Fitting Data. Technometrics 21, 201.[24] Randles, R.H., Wolfe, D.A., 1979. Introduction to the theory of nonparametric statistics. Wiley, New York.[25] Roe, C.A., Metz, C.E., 1997. Dorfman-Berbaum-Metz Method for Statistical Analysis of Multireader, Multimodality Receiver Operating CharacteristicData: Validation With Computer simulation. Academic Radiology 4, 298–303.[26] ´Slezak, D., Chadzy´nska-Krasowska, A., Holland, J., Synak, P., Glick, R., Perkowski, M., 2017. Scalable cyber-security analytics with a new summary-based approximate query engine, in: 2017 IEEE International Conference on Big Data (Big Data), IEEE. pp. 1840–1849.[27] Spreitzenbarth, M., Freiling, F., Echtler, F., Schreck, T., Hoffmann, J., 2013. Mobile-sandbox: having a deeper look into android applications, in:Proceedings of the 28th Annual ACM Symposium on Applied Computing, pp. 1808–1815.[28] Yousef, W.A., 2013. Assessing classiﬁers in terms of the partial area under the roc curve. Computational Statistics & Data Analysis 64, 51–70.[29] Yousef, W.A., 2019. Prudence when assuming normality: an advice for machine learning practitioners. arXiv preprint arXiv:1907.12852 .[30] Yousef, W.A., Kundu, S., Wagner, R.F., 2009. Nonparametric estimation of the threshold at an operating point on the roc curve. ComputationalStatistics & Data Analysis 53, 4370–4383.[31] Yousef, W.A., Wagner, R.F., Loew, M.H., 2006. Assessing Classiﬁers From Two Independent Data Sets Using ROC Analysis: a Nonparametric Approach.Pattern Analysis and Machine Intelligence, IEEE Transactions on 28, 1809–1817. [32] Zadrozny, B., Elkan, C., 2001. Learning and making decisions when costs and probabilities are both unknown, in: Proceedings of the seventh ACMSIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 204–213.[33] Zadrozny, B., Elkan, C., 2002. Transforming classiﬁer scores into accurate multiclass probability estimates, in: Proceedings of the eighth ACMSIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 694–699. : 10 AUC=0.60 AUC=0.75 AUC=0.90 . . . R M S E . . . . . . n : 20 . . . R M S E . . . . . . n : 40 . . . R M S E . . . . . . n : 80 . . . R M S E . . . . . . n : 160 . . . R M S E . . . . . . Figure B.3a: single-score calibration: RMSE. : 320 AUC=0.60 AUC=0.75 AUC=0.90 . . . R M S E . . . . . . n : 640 . . . R M S E . . . . . . n : 1280 . . . R M S E . . . . . . n : 2560 . . . R M S E . . . . . . n : 5120 . . . R M S E . . . . . . Figure B.3b: single-score calibration: RMSE, cont. : 10 AUC=0.60 AUC=0.75 AUC=0.90 . . . R B . . . . . . n : 20 . . . R B . . . . . . n : 40 . . . R B . . . . . . n : 80 . . . R B . . . . . . n : 160 . . . R B . . . . . . Figure B.3c: single-score calibration: RB. : 320 AUC=0.60 AUC=0.75 AUC=0.90 . . . R B . . . . . . n : 640 . . . R B . . . . . . n : 1280 . . . R B . . . . . . n : 2560 . . . R B . . . . . . n : 5120 . . . R B . . . . . . Figure B.3d: single-score calibration: RB, cont. ,

280 2 ,

560 5 , . . . n R M S E

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . n R B Figure B.3e: single-score calibration: summary plot of (a)–(d).

10 bins 20 bins 30 bins 40 bins 50 bins 50binPlatt LogReg LogRegExt IsoReg Independent Resubstitution

Figure B.3: consists of 5 subﬁgures (a)–(e) with common legend for calibrators. (a)–(b) illustrate RMSE ind and RMSE sub of the 9 calibrators for the16 × ×

10 conﬁgurations. The 16 pairs are explained in Figure 1 and the x -axis expresses their relative rank after ordering the average RMSE over allcalibrators; the 3 ×

10 cross product of AUC and n values are listed on the top and left of the subﬁgures, respectively. (c)–(d) illustrate RB ind and RB sub with same convention as (a)–(b). Subﬁgure (e) summarizes (a)–(d) by illustrating the average of: RMSE ind and RMSE sub (left), and RB ind and RB sub (right). The averaging is taken over all conﬁgurations conditionally on each value of n . . . . . . . . . . n M S E . . . . . . . n B r i e r . . . . . . . . . n M S E . . . . . . . n B r i e r IsoReg (sub) LogReg (sub) Platt (sub) IsoReg (ind) LogReg (ind) Platt (ind)

Figure B.4: An exact reproduction of the experiments in [3, 4] in terms of MSE and Brier (rather than their root) for three calibrators, on Normal (ﬁrst two)and Beta (second two) distributions. The two version LogReg and Platt are almost identical. : 0.0 . . . . . . . . . . R M S E : L o g R e g E x t AUC=0.60 . . . . . . . . . . AUC=0.75 . . . . . . . . . . AUC=0.90 ρ : 0.5 . . . . . . . . . . R M S E : L o g R e g E x t . . . . . . . . . . . . . . . . . . . . ρ : 0.9 . . . . . . . . . . RMSE: LogReg R M S E : L o g R e g E x t . . . . . . . . . . RMSE: LogReg . . . . . . . . . . RMSE: LogReg

Figure B.5a: multi-score calibration: RMSE ind . ρ : 0.0 . . . . . . . . . . R M S E : L o g R e g E x t AUC=0.60 . . . . . . . . . . AUC=0.75 . . . . . . . . . . AUC=0.90 ρ : 0.5 . . . . . . . . . . R M S E : L o g R e g E x t . . . . . . . . . . . . . . . . . . . . ρ : 0.9 . . . . . . . . . . RMSE: LogReg R M S E : L o g R e g E x t . . . . . . . . . . RMSE: LogReg . . . . . . . . . . RMSE: LogReg

Figure B.5b: multi-score calibration: RMSE sub . : 0.0 . . . . . . . . . . R B : L o g R e g E x t AUC=0.60 . . . . . . . . . . AUC=0.75 . . . . . . . . . . AUC=0.90 ρ : 0.5 . . . . . . . . . . R B : L o g R e g E x t . . . . . . . . . . . . . . . . . . . . ρ : 0.9 . . . . . . . . . . RB: LogReg R B : L o g R e g E x t . . . . . . . . . . RB: LogReg . . . . . . . . . . RB: LogReg

Figure B.5c: multi-score calibration: RB ind . ρ : 0.0 . . . . . . . . . . R B : L o g R e g E x t AUC=0.60 . . . . . . . . . . AUC=0.75 . . . . . . . . . . AUC=0.90 ρ : 0.5 . . . . . . . . . . R B : L o g R e g E x t . . . . . . . . . . . . . . . . . . . . ρ : 0.9 . . . . . . . . . . RB: LogReg R B : L o g R e g E x t . . . . . . . . . . RB: LogReg . . . . . . . . . . RB: LogReg

Figure B.5d: multi-score calibration: RB sub . ,

280 2 ,

560 5 , . . . n A v e r a g e R M S E

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . n A v e r a g e R B Figure B.5e: multi-score calibration: summary plot for (a)–(d).

10 20 40 80 160320 640 1280 2560 5120 LogReg LogRegExtIndependent Resubstitution

Figure B.5: consists of 5 subﬁgures (a)–(e) with two legends: (left) the heat map that codes the sample size n for subﬁgures (a)–(d) and (right) calibratorsfor subﬁgure (e). (a)–(d) illustrate the RMSE ind , RMSE sub , RB ind , and RB sub , respectively, of LogRegExt vs. LogReg for all conﬁgurations. Each subﬁgureconsists of 3 × ×

10 conﬁgurations (points) at particular ρ and AUC. Subﬁgure (e) summarizes (a)–(d) by illustrating theaverage of: RMSE ind and RMSE sub (left), and RB ind and RB sub (right). The averaging is taken over all conﬁgurations conditionally on each value of n . : 0.0 r r AUC=0.60 r r AUC=0.75 r r AUC=0.90 ρ : 0.5 r r r r r r ρ : 0.9 r r r r r r Figure B.6a: multi- vs. single-score calibration, LogReg, RMSE ind . ρ : 0.0 r r AUC=0.60 r r AUC=0.75 r r AUC=0.90 ρ : 0.5 r r r r r r ρ : 0.9 r r r r r r Figure B.6b: multi- vs. single-score calibration, LogReg, RMSE sub . : 0.0 r r AUC=0.60 r r AUC=0.75 r r AUC=0.90 ρ : 0.5 r r r r r r ρ : 0.9 r r r r r r Figure B.6e: multi- vs. single-score calibration, LogRegExt, RMSE ind . ρ : 0.0 r r AUC=0.60 r r AUC=0.75 r r AUC=0.90 ρ : 0.5 r r r r r r ρ : 0.9 r r r r r r Figure B.6f: multi- vs. single-score calibration, LogRegExt, RMSE sub . : 0.0 . . r r AUC=0.60 . . r r AUC=0.75 . . r r AUC=0.90 ρ : 0.5 . . r r . . r r . . r r ρ : 0.9 . . r r . . r r . . r r Figure B.6c: multi- vs. single-score calibration, LogReg, RB ind . ρ : 0.0 . . r r AUC=0.60 . . r r AUC=0.75 . . r r AUC=0.90 ρ : 0.5 . . r r . . r r . . r r ρ : 0.9 . . r r . . r r . . r r Figure B.6d: multi- vs. single-score calibration, LogReg, RB sub . : 0.0 . . r r AUC=0.60 . . r r AUC=0.75 . . r r AUC=0.90 ρ : 0.5 . . r r . . r r . . r r ρ : 0.9 . . r r . . r r . . r r Figure B.6g: multi- vs. single-score calibration, LogRegExt, RB ind . ρ : 0.0 . . r r AUC=0.60 . . r r AUC=0.75 . . r r AUC=0.90 ρ : 0.5 . . r r . . r r . . r r ρ : 0.9 . . r r . . r r . . r r Figure B.6h: multi- vs. single-score calibration, LogRegExt, RB sub . ,

280 2 ,

560 5 , . . . n A v e r a g e p

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . . n A v e r a g e p Figure B.6i: multi- vs. single-score calibration: summary plot for (a)–(h).

10 20 40 80 160320 640 1280 2560 5120 LogReg LogRegExtIndependent Resubstitution

Figure B.6: consists of 9 subﬁgures, with the same plotting style, convention, and legends, of Figure B.5. The ﬁrst 8 subﬁgures correspond to the 8combinations of the cross product {RMSE, RB} × {LogReg, LogRegExt} × {ind, sub} as indicated by the sub-caption. Each subﬁgure consists of 3 × ρ and AUC combinations); each plot illustrates 4 ×

10 point (for distribution pairs and n combinations); each point is r , / r ,r , / r , were r ,r ,r , are, respectively, the performance of the calibrator using each of the single-score and the multi-score. The symbol + inside the unit square islocated at ( p,p ), where p is the ratio of the points insde the unit square to the total points. − − terk05-1 : SVM-score . . . . terk05-1 : RF-score − . . drebin : SVM-score . . . . drebin : RF-score Figure B.7: Smoothed histogram of the two classes ω (blue) and ω (red), of each of the SVM and RF scores, of each of terk05-1 and drebin datasets.The AUC of the four plots are from left to right: 0.994, 0.996, 0.992, 0.996. The correlation coefﬁcient between the SVM and RF scores on each of the twodatasets are 0.91 and 0.64, respectively. ,

280 2 ,

560 5 , . . . . n B r i e r terk05-1 : single-score calibration (SVM)

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . . n B r i e r terk05-1 : single-score calibration (RF)

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . . n B r i e r terk05-1 : gMulti-score calibration

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . . n B r i e r drebin : single-score calibration (SVM)

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . . n B r i e r drebin : single-score calibration (RF)

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . . n B r i e r drebin : gMulti-score calibration Figure B.8: The RB ind , RB sub of all calibrators on the real datasets terk05-1 (top) and drebin (bottom). On each row, there are three plots correspondingto calibration of: SVM single-score, RF single-score, and multi-score of both. The behavior on both datasets is similar. ,

280 2 ,

560 5 , . . . n R M S E AUC = 0.60

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . nAUC = 0.75

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . nAUC = 0.90

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . nAUC = 0.99

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . n R B

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . n

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . n

10 20 40 80 160 320 640 1 ,

280 2 ,

560 5 , . . . n

10 bins 20 bins 30 bins 40 bins 50 bins 50binPlatt LogReg LogRegExt IsoReg Independent Resubstitution