[PDF] Evaluating probabilistic classifiers: Reliability diagrams and score decompositions revisited

Abstract

A probability forecast or probabilistic classifier is reliable or calibrated if the predicted probabilities are matched by ex post observed frequencies, as examined visually in reliability diagrams. The classical binning and counting approach to plotting reliability diagrams has been hampered by a lack of stability under unavoidable, ad hoc implementation decisions. Here we introduce the CORP approach, which generates provably statistically Consistent, Optimally binned, and Reproducible reliability diagrams in an automated way. CORP is based on non-parametric isotonic regression and implemented via the Pool-adjacent-violators (PAV) algorithm - essentially, the CORP reliability diagram shows the graph of the PAV- (re)calibrated forecast probabilities. The CORP approach allows for uncertainty quantification via either resampling techniques or asymptotic theory, furnishes a new numerical measure of miscalibration, and provides a CORP based Brier score decomposition that generalizes to any proper scoring rule. We anticipate that judicious uses of the PAV algorithm yield improved tools for diagnostics and inference for a very wide range of statistical and machine learning methods.

Full PDF

EEvaluating probabilistic classiﬁers: Reliability diagrams and scoredecompositions revisited

Timo Dimitriadis , Tilmann Gneiting , and Alexander I. Jordan Computational Statistics (CST) Group, Heidelberg Institute for Theoretical Studies, Heidelberg,Germany Institute of Economics, University of Hohenheim, Stuttgart, Germany Institute for Stochastics, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

August 10, 2020

Abstract

A probability forecast or probabilistic classiﬁer is reliable or calibrated if the predictedprobabilities are matched by ex post observed frequencies, as examined visually in reliabilitydiagrams. The classical binning and counting approach to plotting reliability diagrams hasbeen hampered by a lack of stability under unavoidable, ad hoc implementation decisions.Here we introduce the CORP approach, which generates provably statistically Consistent,Optimally binned, and Reproducible reliability diagrams in an automated way. CORP isbased on non-parametric isotonic regression and implemented via the Pool-adjacent-violators(PAV) algorithm — essentially, the CORP reliability diagram shows the graph of the PAV-(re)calibrated forecast probabilities. The CORP approach allows for uncertainty quantiﬁ-cation via either resampling techniques or asymptotic theory, furnishes a new numericalmeasure of miscalibration, and provides a CORP based Brier score decomposition that gen-eralizes to any proper scoring rule. We anticipate that judicious uses of the PAV algorithmyield improved tools for diagnostics and inference for a very wide range of statistical andmachine learning methods.

Keywords: calibration | discrimination ability | probability forecast | reliability diagram | weatherprediction We thank Andreas Fink, Peter Knippertz, Peter Vogel and seminar participants at MMMS2 for providingdata, discussion and encouragement. Our work has been supported by the Klaus Tschira Foundation, by theUniversity of Hohenheim, by the Helmholtz Association, and by the Deutsche Forschungsgemeinschaft (DFG,German Research Foundation) – Project-ID 257899354 – TRR 165. a r X i v : . [ s t a t . M E ] A ug Introduction

Calibration or reliability is a key requirement on any probability forecast or probabilistic clas-siﬁer. In a nutshell, a probabilistic classiﬁer assigns a predictive probability to a binary event.The classiﬁer is calibrated or reliable if, when looking back at a series of extant forecasts, theconditional event frequencies match the predictive probabilities. For example, if we consider allcases with a predictive probability of about .80, the observed event frequency ought to be about.80 as well. While for many decades researchers and practitioners have been checking calibrationin myriads of applications (1, 2), the topic is subject to a surge of interest in machine learning(3), spurred by the recent recognition that “modern neural networks are uncalibrated, unlikethose from a decade ago” (4).

The key diagnostic tool for checking calibration is the reliability diagram, which plots the ob-served event frequency against the predictive probability. In discrete settings where there areonly a few predictive probabilities, such as, e.g., 0 , , . . . , ,

1, this is straightforward. How-ever, statistical and machine learning approaches to binary classiﬁcation generate continuouspredictive probabilities that can take any value between 0 and 1, and typically the forecastvalues are pairwise distinct. In this ubiquitous setting, researchers have been using the “binningand counting” approach, which starts by selecting a certain, typically arbitrary number of binsfor the forecast values. Then, for each bin, one plots the respective conditional event frequencyversus the midpoint or average forecast value in the bin. For calibrated or reliable forecasts thetwo quantities ought to match, and so the points plotted ought to lie on, or close to, the diagonal(2, 5).In Fig. 1(a,c,e) we show reliability diagrams based on the binning and counting approachwith a choice of m = 10 equally spaced bins for 24-hour ahead daily probability of precipitationforecasts at Niamey, Niger in July–September 2016. They concern three competing forecastingmethods, including the world-leading, 52-member ensemble system run by the European Centrefor Medium-Range Weather Forecasts (ENS, 6), a reference forecast called extended probabilisticclimatology (EPC), and a purely data-driven statistical forecast (Logistic), as described by Vogelet al. (7, Fig. 2).Not surprisingly, the classical approach to plotting reliability diagrams is highly sensitiveto the speciﬁcation of the bins, and the visual appearance may change drastically under theslightest change. We show an example in Fig. 2(a–c) for a fourth type of forecast at Niamey,namely, a statistically postprocessed version of the ENS forecast called ensemble model outputstatistics (EMOS), for which choices of m = 9, 10, or 11 equidistant bins yield drasticallydistinct reliability diagrams. This is a disconcerting state of aﬀairs for a widely used dataanalytic tool, and contrary to well-argued recent pleas for reproducibility (8) and stability (9).Similar instabilities under the binning and counting approach have been reported for numericalmeasures of calibration, even when the size n of the dataset considered is large (10, p. 6, 11,Sect. 3.1).While methods for the choice of the binning and related implementation decisions for re-liability diagrams have been proposed in the literature (5, 12, 13), extant approaches lacktheoretical justiﬁcation, are elaborate, and have not been adopted by practitioners. Instead,researchers across discplines continue to craft reliability diagrams and report associated mea-sures of (mis)calibration, such as the Brier score reliability component (14–16), based on ad hocchoices. In this light, Stephenson et al. (17, p. 757) call for the development of “nonparametricapproaches for estimating the reliability curves (and hence the Brier score components), which2 .000.250.500.751.00 0.00 0.25 0.50 0.75 1.00 Forecast value C EP (a) ENS / Binning and Counting MCB = 0.066

Forecast value C EP (b) ENS / CORP Forecast value C EP (c) EPC / Binning and Counting MCB = 0.022

Forecast value C EP (d) EPC / CORP Forecast value C EP (e) Logistic / Binning and Counting MCB = 0.017

Forecast value C EP (f) Logistic / CORP Figure 1: Reliability diagrams for probability of precipitation forecasts over Niamey, Niger (7) in July–September2016 under (a,b) ENS, (c,d) EPC, and (e,f) Logistic methods. At left (a,c,e), we show reliability diagrams underthe binning and counting approach with a choice of ten equally spaced bins. At right (b,d,f), we show CORPreliability diagrams with uncertainty quantiﬁcation through 90% consistency bands. The histograms at bottomillustrate the distribution of the n = 86 forecast values. .000.250.500.751.00 0.00 0.25 0.50 0.75 1.00 Forecast value C EP (a) EMOS / 9 Equidistant Bins Forecast value C EP (b) EMOS / 10 Equidistant Bins Forecast value C EP (c) EMOS / 11 Equidistant Bins MCB = 0.018

Forecast value C EP (d) EMOS / CORP Figure 2: Reliability diagrams for probability of precipitation forecasts over Niamey, Niger (7) in July–September2016 with the EMOS method, using the binning and counting approach with a choice of (a) 9, (b) 10, and (c) 11equidistant bins, together with (d) the CORP reliability diagram, for which we provide uncertainty quantiﬁcationthrough 90% consistency bands. also include[d] point-wise conﬁdence intervals.”Here we introduce a new approach to reliability diagrams and score decompositions, whichresolves these issues in a theoretically optimal and readily implementable way, as illustratedon the forecasts at Niamey in Figs. 1(b,d,f) and 2(d). In a nutshell, we use nonparametricisotonic regression and the pool-adjacent-violators (PAV) algorithm to estimate conditional eventprobabilities (CEPs), which yields a fully automated choice of bins that adapts to both discreteand continuous settings, without any need for tuning parameters or implementation decisions.We call this stable, new approach CORP, as its novelty and power include the following fourproperties.

Consistency

The CORP reliability diagram and associated numerical measures of (mis)cali-bration are consistent in the classical statistical sense of convergence to population characteris-tics. We leverage existing asymptotic theory (18–20) to demonstrate that the rate of convergenceis best possible, and to generate large sample consistency and conﬁdence bands for uncertaintyquantiﬁcation. 4 ptimality

The CORP reliability diagram is optimally binned, in that no other choice ofbins generates more skillful (re)calibrated forecasts, subject to regularization via isotonicity (21,Thm. 1.10, 22, 23).

Reproducibility

The CORP approach does not require any tuning parameters nor imple-mentation decision, thus yielding well deﬁned and readily reproducible reliability diagrams andscore decompositions.

Pool-adjacent-violators (PAV) algorithm based

CORP is based on nonparametric iso-tonic regression and implemented via the PAV algorithm, a classical iterative procedure withlinear complexity only (24, 25). Essentially, the CORP reliability diagram shows the graph ofthe PAV-(re)calibrated forecast probabilities.In the remainder of the article we provide the details of CORP reliability diagrams and scoredecompositions, and we substantiate the above claims via mathematical analysis and simulationexperiments.

The basic idea of CORP is to use nonparametric isotonic regression to estimate a forecast’sCEPs as a monotonic, non-decreasing function of the original forecast values. Fortunately, inthis simple setting there is one, and only one, kind of nonparametric isotonic regression, forwhich the PAV algorithm provides a simple algorithmic solution (24, 25). To each originalforecast value, the PAV algorithm assigns a (re)calibrated probability under the regularizingconstraint of isotonicity, as illustrated in textbooks (26, Figs. 2.13 and 10.7), and this solutionis optimal under a very broad class of loss functions (21, Thm. 1.10). In particular, the PAV so-lution constitutes both the nonparametric isotonic least squares and the nonparametric isotonicmaximum likelihood estimate of the CEPs.The CORP reliability diagram plots the PAV-calibrated probability versus the original fore-cast value, as illustrated on the Niamey data in Figs. 1(b,d,f) and 2(d). The PAV algorithm as-signs calibrated probabilities to the individual unique forecast values, and we interpolate linearlyinbetween, to facilitate comparison with the diagonal that corresponds to perfect calibration.If a group of (one or more) forecast values are assigned identical PAV-calibrated probabilities,the CORP reliability diagram displays a horizontal segment. The horizontal sections can beinterpreted as bins, and the respective PAV-calibrated probabilities are simply the bin-speciﬁcempirical event frequencies. For example, we see from Fig. 1(b) that the PAV algorithm assignsa calibrated probability of .

125 to ENS forecast values between and , and a calibratedprobability of .

481 to ENS values between and . The PAV algorithm guarantees that boththe number and the positions of the horizontal segments (and hence the bins) in the CORPreliability diagram are determined in a fully automated, optimal way.The assumption of nondecreasing CEPs is natural, as decreasing estimates are counterin-tuitive, routinely being dismissed as artifacts by practitioners. Furthermore, the constraintprovides an implicit regularization, serving to stabilize the estimate and counteract overﬁtting,despite the method being entirely nonparametric. Under the binning and counting approach,small or sparsely populated bins are subject to overﬁtting and large estimation uncertainty, asexempliﬁed by the sharp upward spike at about .

25 in Fig. 2(b). The assumption of isotonicityin CORP stabilizes the estimate and avoids artifacts (Fig. 2d).5

CB = 0.042

Forecast value C EP (a) MCB = 0.042

Forecast value C EP (b) MCB = 0.039

Forecast value C EP (c) MCB = 0.039

Forecast value C EP (d) Figure 3: CORP reliability diagrams in the setting of (a,b) discretely and (c,d) continuously, uniformly distributed,simulated predictive probabilities x with a true, miscalibrated CEP of √ x , with uncertainty quantiﬁcation via(a,c) consistency and (b,d) conﬁdence bands at the 90% level. In contrast to the binning and counting approach, which has not been subject to asymptoticanalysis, CORP reliability diagrams are provably statistically consistent: If the predictive prob-abilities and event realizations are samples from a ﬁxed, joint distribution, then the graph ofthe diagram converges to the respective population equivalent, as a direct consequence of exist-ing large sample theory for nonparametric isotonic regression estimates (18–20). Furthermore,CORP is asymptotically eﬃcient, in the sense that its automated choice of binning results in anestimate that is as accurate as possible in the large sample limit. In Appendix B we formalizethese arguments and report on a simulation study, for which we give details in Appendix A, andwhich demonstrates that the eﬃciency of the CORP approach also holds in small samples.Traditionally, reliability diagrams have been accompanied by histograms or bar plots of themarginal distribution of the predictive probabilities, on either standard or logarithmic scales(e.g., 27). Under the binning and counting approach, the histogram bins are typically the sameas the reliability bins. In plotting CORP reliability diagrams, we distinguish discretely and con-tinuously distributed classiﬁers or forecasts. Intuitively, the discrete case refers to forecast valuesthat only take on a ﬁnite and suﬃciently small number of distinct values. Then we show thePAV-calibrated probabilities as dots, interpolate linearly inbetween, and visualize the marginaldistribution of the forecast values in a bar diagram, as illustrated in Fig. 3(a,b). For contin-uously distributed forecasts, essentially every forecast takes on a diﬀerent value, whence the6hoice of binning becomes crucial. The CORP reliability diagram displays the bin-wise constantPAV-calibrated probabilities in horizontal segments, which are linearly interpolated inbetween,and we use the Freedman–Diaconis rule (28) to generate a histogram estimate of the marginaldensity of the forecast values, as exempliﬁed in Fig. 3(c,d). In our software implementation (29)a simple default is used: If the smallest distance between any two distinct forecast values is0 .

01 or larger, we operate in the discrete setting, and else in the continuous one. The CORPreliability diagrams in Figs. 1–3 also display a new measure of miscalibration (

MCB ), discussedin detail later on as we introduce the CORP score decomposition.

Br¨ocker and Smith (30) convincingly advocate the need for uncertainty quantiﬁcation, so thatstructural deviations of the estimated CEP from the diagonal can be distinguished from devia-tions that merely reﬂect noise. They employ a resampling technique for the binning and countingmethod in order to ﬁnd consistency bands under the assumption of calibration. For CORP, weextend this approach in two crucial ways, by generating either consistency or conﬁdence bands,and by using either a resampling technique or asymptotic distribution theory, where we leverageexisting theory for nonparametric isotonic regression estimates (18–20).Consistency bands are generated under the assumption that the probability forecasts arecalibrated, and so they are positioned around the diagonal. There is a close relation to the clas-sical interpretation of statistical tests and p -values: Under the hypothesized perfect calibration,how much do reliability diagrams vary, and how (un)likely is the outcome at hand? In contrast,conﬁdence bands cluster around the CORP estimate and follow the classical interpretation offrequentist conﬁdence intervals: If one repeats the experiment numerous times, the fraction ofconﬁdence intervals that contain the true CEP approaches the nominal level. The two methodsare illustrated in Fig. 3, where the right column (b,d) shows conﬁdence bands, and the leftcolumn (a,c) shows consistency bands, as do the CORP reliability diagrams in Figs. 1(b,d,f)and 2(d).In our adaptation of the resampling approach, for each iteration the resampled CORP re-liability diagram is computed, and conﬁdence or consistency bands are then speciﬁed by usingresampling percentiles, in customary ways. For consistency bands, the resampling is based onthe assumption of calibrated original forecast values, whereas PAV-calibrated probabilities areused to generate conﬁdence bands. While resampling works well in small to medium samples,the use of asymptotic theory suits cases where the sample size n of the dataset is large – exactlywhen the computational cost of resampling based procedures becomes prohibitive. Existingasymptotic theory is readily applicable and operates under weak conditions on the marginaldistribution of the forecast values, and (strict) monotonicity and smoothness of (true) CEPs(18–20).The distinction between discretely and continuously distributed forecasts becomes criticalhere as the asymptotic theory diﬀers between these cases. For discrete forecasts, results of ElBarmi and Mukerjee (18) imply that the diﬀerence between the estimated and the true CEP,scaled by n / , converges to a (mixture of) normal distribution(s). For continuous forecasts,following Wright (19), the diﬀerence between the estimated and the true CEP, magniﬁed by n / ,converges to Chernoﬀ’s distribution (31). The distinct scaling laws imply that the convergenceis faster in the discrete than in the continuous case, since in the former the CORP binningstabilizes as it captures the discrete forecast values, and thereafter the amount of samples perbin increases linearly, in accordance with the standard n / rate. In either setting, asymptoticconsistency and conﬁdence bands can be obtained from quantiles of the asymptotic distributionsin customary ways. As a caveat, both resampling and asymptotic techniques operate under the7 l l ll l l l ll l l l ll Uniform Linear Beta Mixture C on s i s t en cy B and s C on f i den c e B and s

128 512 2048 8192 128 512 2048 8192 128 512 2048 81920.800.850.900.951.000.800.850.900.951.00

Sample Size E m p i r i c a l C o v e r age Uncertainty Quantification via l continuous asymptotic theory discrete asymptotic theory resampling Number k of Distinct Forecast Values l l l l

10 20 50 Inf

Figure 4: Empirical coverage, averaged equally over the forecast values, of 90% uncertainty bands for CORPreliability diagrams under default choices for 1000 simulation replicates. The upper row concerns consistencybands, and the lower row conﬁdence bands. The columns correspond to three types of marginal distributions forthe forecast values, and colors distinguish discrete and continuous settings, as described in Appendix A. Diﬀerentsymbols denote reliance of the bands on resampling, discrete, or continuous asymptotic distribution theory. assumption of independent, or at least exchangeable, forecast cases, which may or may not bewarranted in practice. We encourage follow-up work in dependent data settings, as recentlytackled for related types of data science tools (32).In our software implementation (29), we use the following default choices. Suppose that thesample size is n and there are k unique forecast values. For consistency bands, if n ≤ n ≤ n ≤ k , we use resampling, else we rely on asymptotic theory. In the lattercase we employ the discrete asymptotic distribution if n ≥ k , while otherwise we use thecontinuous one. For conﬁdence bands, the current default uses resampling throughout, as theasymptotic theory depends on the assumption of a true CEP with strictly positive derivative.In the simulation examples in Fig. 3, which are based on n = 1024 observations, this implies theuse of resampling in panels (b,c,d) and of discrete asymptotic theory in panel (a). Fig. 4 showscoverage rates of 90% consistency and conﬁdence bands in the simulation settings described inAppendix A, based on the default choices. The coverage rates are generally accurate, or slightlyconservative, especially in large samples. MCB ), discrim-ination (

DSC ), and uncertainty (

UNC ) components

Scoring rules provide a numerical measure of the quality of a classiﬁer or forecast by assigninga score or penalty S ( x, y ), based on forecast value x ∈ [0 ,

1] for a dichotomous event y ∈ { , } .A scoring rule is proper (33) if it assigns the minimal penalty in expectation when x equals thetrue underlying event probability. If the minimum is unique the scoring rule is strictly proper.In practice, for a given sample ( x , y ) , . . . , ( x n , y n ) of forecast-realization pairs the empirical8 able 1: Scoring rules for probabilistic forecasts of binary eventsScore Propriety Analytic form of S ( x, y )Brier strict ( x − y ) Logarithmic strict − y log x − (1 − y ) log(1 − x )Misclassiﬁcation error non-strict ( x < , y = 1) + ( x > , y = 0) + ( x = ) score ¯ S X = 1 n n (cid:88) i =1 S ( x i , y i ) [1]is used for forecast ranking. Table 1 presents examples of proper and strictly proper scoringrules. The Brier score and logarithmic score are strictly proper. In contrast, the misclassiﬁcationerror is proper, but not strictly proper – all that matters is whether or not a classiﬁer probabilityis on the correct side of .Under any proper scoring rule, the mean score ¯ S X constitutes a measure of overall predictiveperformance. For several decades, researchers have been seeking to decompose ¯ S X into intu-itively appealing components, typically thought of as reliability ( REL ), resolution (

RES ), anduncertainty (

UNC ) terms. The

REL component measures how much the conditional event fre-quencies deviate from the forecast probabilities, while

RES quantiﬁes the ability of the forecaststo discriminate between events and non-events. Finally,

UNC measures the inherent diﬃculty ofthe prediction problem, but does not depend on the issued forecast under consideration. Whilethere is a consensus on the character and intuitive interpretation of the decomposition terms,their exact form remains subject to debate, despite a half century quest in the wake of Murphy’s(15) Brier score decomposition. In particular, Murphy’s decomposition is exact in the discretecase, but fails to be exact under continuous forecasts, which has prompted the development ofincreasingly complex types of decompositions (16, 17).Here we adopt the general score decomposition advocated forcefully by Siegert (34), anddiscussed by various other authors (e.g., 16, 35). Speciﬁcally, let ¯ S X ,¯ S C = 1 n n (cid:88) i =1 S (ˆ x i , y i ) , and ¯ S R = 1 n n (cid:88) i =1 S ( r, y i ) [2]denote the mean score for the original forecast values of Eq. [1], the mean score for Calibratedprobabilities ˆ x , . . . , ˆ x n , and the mean score for a constant Reference forecast r , respectively.Then ¯ S X decomposes as ¯ S X = (cid:0) ¯ S X − ¯ S C (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) MCB − (cid:0) ¯ S R − ¯ S C (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) DSC + ¯ S R (cid:124)(cid:123)(cid:122)(cid:125) UNC , [3]where we adopt, in part, terminology proposed by Ehm and Ovcharov (36) and Pohle (37). Asdeﬁned in Eq. [3], the miscalibration component MCB is the diﬀerence of the mean scores of theoriginal and the calibrated forecasts. Similarly, the

DSC component quantiﬁes discriminationability via the diﬀerence between the mean score for the reference and the calibrated forecast,while the classical measure of uncertainty (

UNC ) is simply the mean score for the referenceforecast.In the extant literature, it has been assumed implicitly or explicitly that the calibratedand reference forecasts can be chosen at researchers’ discretion (34, 37). We argue otherwiseand contend that the calibrated forecasts ought to be the PAV-(re)calibrated probabilities, asdisplayed in the CORP reliability diagram, whereas the reference forecast r ought to be themarginal event frequency ¯ y = n (cid:80) ni =1 y i . We refer to the resulting decomposition as the CORPscore decomposition, which enjoys the following properties:9 able 2: CORP Brier score decomposition for the probability of precipitation forecasts in Figs. 1 and 2.Forecast ¯ S X MCB DSC UNC

ENS .266 .066 .044 .244EPC .234 .022 .032 .244EMOS .232 .018 .030 .244Logistic .206 .017 .056 .244 • MCB ≥ • DSC ≥ • The decomposition is exact.In particular, the CORP score decomposition never yields counterintuitive negative values ofthe components, contrary to choices in the extant literature. The cases of vanishing components(

MCB = 0 or

DSC = 0) support the intuitive interpretation of CORP reliability diagrams, inthat parts away from the diagonal indicate lack of calibration, whereas extended horizontalsegments are indicative of diminished discrimination ability. For reﬁned statements and proofssee Theorem 1 in Appendix C.If S is the Brier score, then in the special case of discrete forecasts with non-decreasingCEPs, the MCB , DSC , and

UNC terms in Eq. [3] agree with the

REL , RES , and

UNC compo-nents, respectively, in the classical Murphy decomposition, as we demonstrate in Theorem 2 inAppendix C. If S is the misclassiﬁcation error, MCB equals the fraction of cases in which thePAV-calibrated probability was on the correct side of , but the original forecast value was not,minus the fraction vice versa, with natural adaptations in the case of ties.In Table 2 we illustrate the CORP Brier score decomposition for the probability of precipita-tion forecasts at Niamey in Figs. 1–2. The purely data-driven Logistic forecast obtains the best(smallest) mean score, the best (smallest) MCB term, and the best (highest)

DSC component,well in line with the insights oﬀered by the CORP reliability diagrams, and attesting to theparticular challenges for precipitation forecasts over northern tropical Africa (7).Interestingly, every proper scoring rule admits a representation as a mixture of elementaryscoring rules (e.g., 33, Sect. 3.2). Consequently, the

MCB , DSC , and

UNC components of theCORP decomposition admit analogous representations as mixtures of the respective componentsunder the elementary scores, whence we may plot Murphy diagrams in the sense of Ehm etal. (38) for the

MCB , DSC , and

UNC components.

Our paper addresses two long-standing challenges in the evaluation of probabilistic classiﬁersby developing the CORP reliability diagram that enjoys theoretical guarantees, avoids artifacts,allows for uncertainty quantiﬁcation, and yields a fully automated choice of the underlying bin-ning, without any need for tuning parameters or implementation choices. The associated CORPdecomposition disaggregates the mean score under any proper scoring rule into components thatare guaranteed to be non-negative.Of particular relevance is the remarkable fact that CORP reliability diagrams feature op-timality properties in both ﬁnite sample and large sample settings. Asymptotically, the PAV-(re)calibrated probabilities, which are plotted in a CORP reliability diagram, minimize esti-mation error, while in ﬁnite samples PAV-calibrated probabilities are optimal in terms of anyproper scoring rule, subject to the regularizing constraint of isotonicity.10e believe that the proposals in this paper can serve as a blueprint for the development ofnovel diagnostic and inference tools for a very wide range of data science methods. For example,the popular Hosmer–Lemeshow goodness-of-ﬁt test (39) for logistic regression is subject to thesame types of ad hoc decisions on binning schemes, and hence the same types of instabilities asthe binning and counting approach (10, p. 6). Tests based on CORP and the

MCB miscalibrationmeasure are promising candidates for powerful alternatives.Perhaps surprisingly, the PAV algorithm and its appealing properties generalize from prob-abilistic classiﬁers to mean, quantile, and expectile assessments for real-valued outcomes (40).In this light, far-reaching generalizations of the CORP approach apply to binary regression ingeneral, to standard (mean) regression, where they yield a new mean squared error (MSE) de-composition with desirable properties, and to quantile and expectile regression. In all thesesettings, score decompositions have been studied (37, 41), and we contend that the PAV al-gorithm ought to be used to generate the Calibrated forecast in the general decomposition inEq. [3], whereas the Reference forecast ought to be the respective marginal, unconditional eventfrequency, mean, quantile, or expectile. We leave these extensions to future work and encouragefurther investigation from theoretical, methodological, and applied perspectives.Open source code for the implementation of the CORP approach in the R language andenvironment for statistical computing (42) is available on GitHub (29).

Appendix A: Simulation settings

Here we give details for the simulation scenarios in Figs. 4–5, where we use simple randomsamples with forecast values drawn from either Uniform, Linear, or Beta Mixture distributions,in either the continuous setting, or discrete settings with k = 10, 20, or 50 unique forecastvalues. The binary outcomes are drawn under the assumption of calibration, whence the trueCEP function coincides with the diagonal.We begin by describing the continuous setting, where the Uniform distribution has a uniformdensity, and the Linear distribution a linearly increasing density with ordinate .

40 at x = 0 and1 .

60 at x = 1. The Beta Mixture distribution uses Beta(1 ,

10) and Uniform components withweights and , respectively. In the discrete settings with k unique forecast values we maintainthe shape of these distributions, but discretize. Speciﬁcally, for j = 1 , . . . , k the probabilisticclassiﬁer or forecast attains the value x j = j − k with probability p j = q ( x j ) (cid:46) k (cid:88) i =1 q ( x i ) , where q is the density in the continuous case. In Fig. 4, we consider discrete settings with k = 10,20, and 50 unique forecast values and the continuous case (marked Inf). Fig. 5 uses discretesettings with k = 10 and 50 unique forecast values and the continuous case. Appendix B: Statistical eﬃciency of CORP

Suppose that we are given a simple random sample ( x , y ) , . . . , ( x n , y n ) of predictive proba-bilities x , . . . , x n ∈ [0 ,

1] and associated realizations y , . . . , y n ∈ { , } from an underlyingpopulation, with the true CEP being non-decreasing.In the case of discretely distributed forecasts that attain a small number k of distinct valuesonly, results of El Barmi and Mukerjee (18) imply that the mean squared error (MSE) of theestimates in a CORP reliability diagram decays at the standard rate of n − . If the binning andcounting approach separates the distinct forecast values, the traditional reliability diagram and11 niform Linear Beta Mixture D i sc r e t e : k = D i sc r e t e : k = C on t i nuou s

128 512 2048 8192 128 512 2048 8192 128 512 2048 81920.00030.00100.00300.01000.03000.0010.0100.1000.0010.0100.100

Sample Size n M SE CORP 5 10 50 n n n Figure 5: Mean squared error (MSE) of the CEP estimates in CORP reliability diagrams for samples of size n , incomparison to the binning and counting approach with m = 5, 10, or 50 ﬁxed bins, or m ( n ) = [ n α ] quantile-spacedbins, where α = , , or . Note the log-log scale. The simulation settings are described in Appendix A, andMSE values are averaged over 1000 replicates. the CORP reliability diagram are asymptotically the same, and so are the respective asymp-totic distributions. However, under the CORP approach the unique forecast values are alwayscorrectly identiﬁed as the sample size increases, while under the binning and counting approachthis may or may not be the case, depending on implementation decisions.Large sample theory for the continuously distributed case is more involved, and generallyassumes that the CEP is diﬀerentiable with strictly positive derivative. Asymptotic results ofWright (19) for the variance and of Dai et al. (43) for the bias imply that the MSE of the CORPestimates decays like n − / . We now compare to the binning and counting approach, using either m ﬁxed, equidistant bins, or using m = m ( n ) empirical quantile-dependent bins. For a generalsequence of m ( n ) bins, the magnitudes of the asymptotic variance and squared bias are governedby the most sparsely populated bin, at a disadvantage relative to the quantile-dependent case.The classical reliability diagram relies on a ﬁxed number m of bins, ﬁnds the respective bin-averaged event frequencies, and plots them against the bin midpoints or bin-averaged forecastvalues. Any such approach fails asymptotically, with estimates that are in general biased andinconsistent. More adequately, a ﬂexible number m ( n ) of bins can be used, with boundariesdeﬁned via empirical quantiles of x , . . . , x n . Speciﬁcally, m ( n ) bins can be bracketed by 0, theempirical quantiles at level j/m ( n ) for j = 1 , . . . , m ( n ) −

1, and 1. Then, for n suﬃcientlylarge, each bin covers about n/m ( n ) data points, and the bin-averaged CEPs converge to thetrue CEPs at the respective true quantiles with an estimation variance that decays like m ( n ) /n and a squared bias that decays like m ( n ) − . When m ( n ) is of order n α for α ∈ (0 , n α − and a squaredbias that decays like n − α . Consequently, the MSE of the estimates is of order n β where β = max( α − , − α ). The optimal choice of the exponent, α = , results in an MSE oforder n − / . While this asymptotic rate is the same as under the CORP approach, the CORPreliability diagram is preferable in ﬁnite samples, as we now demonstrate.12n Fig. 5 we detail a comparison of CORP reliability diagrams to the binning and countingapproach with either a ﬁxed number m of bins, or m = m ( n ) = [ n α ] empirical-quantile de-pendent bins, where [ x ] denotes the smallest integer less than or equal to x ∈ R . For this, weplot the empirical mean squared error (MSE) of the various CEP estimates against the samplesize n , using settings described in Appendix A. Across colums, the distributions of the forecastvalues diﬀer in shape, across rows, we are in the discrete setting with k = 10 and 50 uniqueforecasts values, and in the continuous setting, respectively. Throughout, the CORP reliabilitydiagrams exhibit the smallest MSE, uniformly over all sample sizes and against all alternativemethods, with the superiority being the most pronounced under non-uniform forecast distri-butions with many unique forecast values, as frequently generated by statistical or machinelearning techniques. Appendix C: Properties of CORP score decomposition

Consider data ( x , y ) , . . . , ( x n , y n ) in the form of probability forecasts and binary outcomes,so that x , . . . , x n ∈ [0 ,

1] and y , . . . , y n ∈ { , } . Let ¯ y = n (cid:80) ni =1 y i be the marginal eventfrequency, and write ˆ x , . . . , ˆ x n for the PAV-(re)calibrated probabilities. Furthermore, let ¯ S X ,¯ S C , and ¯ S R denote the mean scores for the original forecast values, (re)calibrated probabilities,and a reference forecast, as deﬁned in Eqs. [1] and [2]. With the speciﬁc choices of the PAV-calibrated probabilities as the (re)calibrated forecasts, and the marginal event frequency ¯ y asthe reference forecast, the CORP score decomposition in Eq. [3] enjoys the following properties. Theorem 1

For every proper scoring rule S , every set of forecast values, and every set ofbinary outcomes, the CORP decomposition satisﬁes the following:(a) MCB = ¯ S X − ¯ S C ≥ MCB > DSC ≥ DSC > Proof

The claims in (a) and (c) rely on the fact that the PAV algorithm generates a calibratedforecast that is no worse than the original forecast in terms of any proper scoring rule (21,Thm. 1.10, 22, 23). If the original forecast is calibrated, the PAV algorithm leaves it unchanged;if the PAV algorithm generates a constant forecast, the constant equals the marginal eventfrequency ¯ y .The statements in (b) and (d) follow from the equivalence of (i) and (iii) in Theorem 2.11in ref. (44) in concert with Theorem 3 in ref. (45). Finally, the claim in (e) is immediate fromthe deﬁnition of the decomposition. (cid:3) In the discrete setting we assume that the unique forecasts values z < · · · < z k are issued n , . . . , n k times, with o , . . . , o k of these cases being events, so that n + · · · + n k = n and o + · · · + o k = n ¯ y . We denote the respective PAV-calibrated probabilities by ˆ z ≤ · · · ≤ ˆ z k .The classical Brier score decomposition under our choice of the PAV-calibrated forecast as the13alibrated forecast, and ¯ y as the reference forecast, then becomes¯ S X = 1 n k (cid:88) j =1 n j (cid:18) o j n j − z j (cid:19) (cid:124) (cid:123)(cid:122) (cid:125) REL − n k (cid:88) j =1 n j (cid:18) o j n j − ¯ y (cid:19) (cid:124) (cid:123)(cid:122) (cid:125) RES + ¯ y (1 − ¯ y ) (cid:124) (cid:123)(cid:122) (cid:125) UNC , where the UNC component is the same as in the CORP decomposition in Eq. [3]. Furthermore,subject to mild conditions, the decompositions agree in full.

Theorem 2

Under the Brier score, if the sequence o /n , . . . , o k /n k is nondecreasing, then MCB = REL and

DSC = RES , respectively.

Proof

As the sequence o /n , . . . , o k /n k is nondecreasing, the PAV-calibrated probabilitiessatisfy ˆ z j = o j /n j for j = 1 , . . . , k . Adopting the arguments in the Appendix of ref. (34), we seethat MCB = ¯ S X − ¯ S C = REL and

DSC = RES . (cid:3) Data availability

The probability of precipitation forecast data at Niamey, Niger are from the paper by Vogelet al. (7, Fig. 2), where the original data sources are acknowledged. Reproduction materials,including data and code in the R software environment (42), are available on GitHub (29, 46).

References

1. D. J. Spiegelhalter, Probabilistic prediction in patient management and clinical trials.

Stat. Med. ,421–433 (1986).2. A. H. Murphy, R. L. Winkler, Diagnostic veriﬁcation of probability forecasts. Int. J. Forecasting ,435–455 (1992).3. P. A. Flach, “Classiﬁer calibration”, in Encyclopedia of Machine Learning and Data Mining (Springer,2016).4. C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, “On calibration of modern neutral networks,”

Proc. 34th Int. Conf. Mach. Learn. (2017).5. J. Br¨ocker, Some remarks on the reliability of categorical probability forecasts.

Mon. Wea. Rev. ,4488–4502 (2008).6. ECMWF Directorate, Describing ECMWF’s forecasts and forecasting system.

ECMWF Newsl. ,11–13 (2012).7. P. Vogel, P. Knippertz, T. Gneiting, A. H. Fink, M. Klar, A. Schlueter, Statistical forecasts forthe occurrence of precipitation outperform global models over northern tropical Africa. Preprint,https://doi.org/10.1002/essoar.10502501.1 (2020).8. V. Stodden, M. McNutt, D. H. Bailey, E. Deelman, Y. Gil, B. Hanson, M. A. Heroux, J. P. A. Ioan-nidis, M. Taufer, Enhancing reproducibility for computational methods.

Science , 1240–1241(2016).9. B. Yu, K. Kumbier, Veridical data science.

Proc. Natl. Acad. Sci. U.S.A. , 3920–3929 (2020).10. P. D. Allison, Measures of ﬁt for logistic regression. Paper 1485-2014, SAS Global Forum, Wash-ington DC (2014).

1. A. Kumar, P. Liang, T. Ma, “Veriﬁed uncertainty calibration,”

Proc. 33rd Conf. Neur. Inform. Pro-cess. Syst. (NeurIPS) (2019).12. J. B. Copas, Plotting p against x . Appl. Stat. , 25–31 (1983).13. F. Atger, Estimation of the reliability of ensemble-based probabilistic forecasts. Q. J. R. Meteo-rol. Soc. , 627–646 (2004).14. G. W. Brier, Veriﬁcation of forecasts expressed in terms of probability.

Mon. Wea. Rev. , 1–3(1950).15. A. H. Murphy, A new vector partition of the probability score. J. Appl. Meteorol. , 595–600(1973).16. M. Kull, P. Flach, “Novel decompositions of proper scoring rules for classiﬁcation: Score adjustmentas precursor to calibration”, ECML PKDD (2015).17. D. B. Stephenson, C. A. S. Coelho, I. T. Jolliﬀe, Two extra components in the Brier score decom-position.

Wea. Forecasting , 752–757 (2008).18. H. El Barmi, H. Mukerjee, Inferences under a stochastic ordering constraint. J. Am. Stat. As-soc. , 252–261 (2005).19. F. T. Wright, The asymptotic behavior of monotone regression estimates.

Ann. Stat. , 443–448(1981).20. A. M¨osching, L. D¨umbgen, Montone least squares and isotonic quantiles. El. J. Stat. , 24–49(2020).21. R. E. Barlow, D. J. Bartholomew, J. M. Bremner, H. D. Brunk, Statistical Inference under OrderRestrictions (Wiley, 1972).22. T. Fawcett, A. Niculescu-Mizil, PAV and the ROC convex hull.

Mach. Learn. , 97–106 (2007).23. N. Br¨ummer, J. Du Preez, The PAV algorithm optimizes binary proper scoring rules. Preprint,arXiv:1304.2331 (2013).24. M. Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, E. Silverman, An empirical distribution functionfor sampling with incomplete information. Ann. Math. Stat. , 641–647 (1955).25. J. de Leeuw, K. Hornik, P. Mair, Isotone optimization in R: Pool-adjacent-violators algorithm(PAVA) and active set methods. J. Stat. Softw. (2009).26. P. Flach. Machine Learning: The Art and Science of Algorithms that Make Sense of Data (Cam-bridge University Press, 2012).27. T. H. Hamill, R. Hagedorn, J. S. Whitaker, Probabilistic forecast calibration using ECMWF andGFS ensemble reforecasts.

Mon. Wea. Rev. , 2620–2632 (2008).28. D. Freedman, P. Diaconis, On the histogram as a density estimator: L theory. Z. Wahrschein-lichkeitsth. Verw. Geb. , 453–476 (1981).29. T. Dimitriadis, A. I. Jordan, R Package ’reliabilitydiag’, available at https://github.com/aijordan/reliabilitydiag (2020).30. J. Br¨ocker, L. A. Smith, Increasing the reliability of reliability diagrams. Wea. Forecasting ,651–661 (2007).31. P. Groeneboom, J. A. Wellner, Computing Chernoﬀ’s distribution. J. Computat. Graph. Stat. ,388–400 (2001).

2. J. Br¨ocker, Z. Ben Bouall`egue, Stratiﬁed rank histograms for ensemble forecast veriﬁcation underserial dependence.

Q. J. R. Meteorol. Soc. , in press (2020).33. T. Gneiting, A. E. Raftery, Strictly proper scoring rules, prediction, and estimation.

J. Am. Stat. As-soc. , 359–379 (2007).34. S. Siegert, Simplifying and generalising Murphy’s Brier score decomposition.

Q. J. R. Meteo-rol. Soc. , 1178–1183 (2017).35. J. Br¨ocker, Reliability, suﬃciency, and the decomposition of proper scores.

Q. J. R. Meteo-rol. Soc. , 1512–1519 (2009).36. W. Ehm, E. Y. Ovcharov, Bias-corrected score decomposition for generalized quantiles.

Biometrika , 473–480 (2017).37. M.-O. Pohle, The Murphy decomposition and the calibration–resolution principle: A new perspec-tive on forecast evaluation. Preprint, arXiv:2005.01835 (2020).38. W. Ehm, T. Gneiting, A. Jordan, F. Kr¨uger, Of quantiles and expectiles: Consistent scoring func-tions, Choquet representations and forecast rankings (with discussion).

J. R. Stat. Soc. Ser. B ,505–562 (2016).39. D. W. Hosmer, S. Lemeshow, Goodness-of-ﬁt tests for the multiple logistic regression Model. Com-mun. Stat. A , , 1043–1069 (1980).40. A. I. Jordan, A. M¨uhlemann, J. F. Ziegel, Optimal solutions to the isotonic regression problem.Preprint, arXiv:1904.04761 (2019).41. S. Bentzien, P. Friederichs, Decomposition and graphical portrayal of the quantile score. Q. J. R. Me-teorol. Soc.

El. J. Stat. , 801–834(2020).44. T. Gneiting, R. Ranjan, Combining predictive distributions. El. J. Stat. , 1747–1782 (2013).45. H. Holzmann, M. Eulert, The role of the information set for forecasting — with applications torisk management. Ann. Appl. Stat. , 595–621 (2014).46. T. Dimitriadis, A. I. Jordan, Replication material, available at https://github.com/TimoDimi/replication DGJ20 (2020)., 595–621 (2014).46. T. Dimitriadis, A. I. Jordan, Replication material, available at https://github.com/TimoDimi/replication DGJ20 (2020).