[PDF] Evaluation of point forecasts for extreme events using consistent scoring functions

Abstract

We present a method for comparing point forecasts in a region of interest, such as the tails or centre of a variable's range. This method cannot be hedged, in contrast to conditionally selecting events to evaluate and then using a scoring function that would have been consistent (or proper) prior to event selection. Our method also gives decompositions of scoring functions that are consistent for the mean or a particular quantile or expectile. Each member of each decomposition is itself a consistent scoring function that emphasises performance over a selected region of the variable's range. The score of each member of the decomposition has a natural interpretation rooted in optimal decision theory. It is the weighted average of economic regret over user decision thresholds, where the weight emphasises those decision thresholds in the corresponding region of interest.

Full PDF

EEvaluation of point forecasts for extreme events using consistentscoring functions

Robert J. TaggartBureau of [email protected] 2, 2021

Abstract

We present a method for comparing point forecasts in a region of interest, such as the tails orcentre of a variable’s range. This method cannot be hedged, in contrast to conditionally selectingevents to evaluate and then using a scoring function that would have been consistent (or proper)prior to event selection. Our method also gives decompositions of scoring functions that areconsistent for the mean or a particular quantile or expectile. Each member of each decompositionis itself a consistent scoring function that emphasises performance over a selected region of thevariable’s range. The score of each member of the decomposition has a natural interpretationrooted in optimal decision theory. It is the weighted average of economic regret over user decisionthresholds, where the weight emphasises those decision thresholds in the corresponding region ofinterest.

Keywords:

Consistent scoring function; Decision theory; Forecast ranking; Forecast veriﬁcation;Point forecast; Proper scoring rule; Rare and extreme events.

Extreme events occur in many systems, from atmospheric to economic, and present signiﬁcant chal-lenges to society. Hence the accurate prediction of extreme events is of vital importance. In manysuch situations, competing forecasts are produced by a variety of forecast systems and it is naturalto want to compare the performance of such forecasts with emphasis on the extremes.In this context, it is critical that the methodology for requesting, evaluating and ranking com-peting forecasting systems is decision-theoretically coherent. Since the future is not precisely known,ideally forecasts ought to be probabilistic in nature, taking the form of a predictive probability dis-tribution and assessed using a proper scoring rule (Gneiting and Katzfuss 2014). Nonetheless, inmany contexts and for a variety of reasons, point forecasts (i.e. single-valued forecasts taking val-ues from the prediction space) are issued and used. For decision-theoretically coherent evaluationof point forecasts, either the scoring function (such as the squared error or absolute error scoringfunction) that will be used to assess predictive performance should be advertised in advance, or aspeciﬁc functional (such as the mean or median) of the forecaster’s predictive distribution should berequested and evaluation then conducted using a scoring function that is consistent for that func-tional (Gneiting 2011a). The use of proper scoring rules and consistent scoring functions encourageforecasters to quote honest, carefully considered forecasts.To compare competing forecasts for the extremes, one may be tempted to use what would oth-erwise be a proper scoring rule or consistent scoring function, but restrict evaluation to a subset ofevents for which extremes were either observed, or forecast or perhaps both. However, such method-ologies promote hedging strategies (as illustrated in Section 2) and can result in misguided inferences.1 a r X i v : . [ s t a t . A P ] F e b his gives rise to the forecaster’s dilemma , whereby “if forecast evaluation proceeds conditionally ona catastrophic event having been observed [then] always predicting a calamity becomes a worthwhilestrategy” (Lerch et al. 2017).Nonetheless, for predictive distributions Gneiting and Ranjan (2011) showed that a suitablealternative exists in the threshold-weighted continuous ranked probability score , which is a weightedversion of the commonly used continuous ranked probability score (CRPS). The weight is selectedto emphasise the region of interest (such as the tails or some other region of a variable’s range) andinduces a proper scoring rule. This technique extends in a natural way to point forecasts targetingthe median functional, since the CRPS is a generalisation of the absolute error scoring function. TheUK Met Oﬃce has recently applied this method to report the performance of temperature forecasts,with emphasis on climatological extremes (Sharpe et al. 2020).Despite this progress, Lerch et al. (2017) oﬀer this summary of the general situation for evaluatingpoint forecasts at the extremes: “there is no obvious way to abate the forecaster’s dilemma byadapting existing forecast evaluation methods appropriately, such that particular emphasis can beput on extreme outcomes.” In this paper we remedy this situation. We construct consistent scoringfunctions that can be used evaluate point forecasts that emphasise performance in the region ofinterest for important classes of functionals, including expectations and quantiles. Moreover, therelevant speciﬁc case of these constructions agrees with the extension of the threshold-weightedCRPS to the median functional.The main idea of this paper can be illustrated using the squared error scoring function S ( x, y ) =( x − y ) , which is consistent for point forecasts of the expectation functional. Suppose that theoutcome space R is partitioned as R = I ∪ I , where I = ( −∞ , a ) and I = [ a, ∞ ) for some a in R . Corollary 3.2 gives the decomposition S = S + S , where each scoring function S i is consistentfor expectations whilst also emphasising predictive performance on the interval I i . In particular, if x, y ∈ I then S ( x, y ) = 0. Given a point forecast x and corresponding observation y , the explicitformula for S is S ( x, y ) = ( y − a ) { y ≥ a } − ( x − a ) { x ≥ a } − y − x )( x − a ) { x ≥ a } . (1.1)Here denotes the indicator function, so that { x ≥ a } equals 1 if x ≥ a and 0 otherwise.The performance of competing point forecasts for expectations can then be compared by com-puting the mean scores ¯ S , ¯ S and ¯ S for each forecast system, over the same set events. To illustrate,consider two forecast systems A and B, whose error characteristics are depicted by the scatter plotof Figure 1. System B is homoscedastic (i.e., has even scatter about the diagonal line throughout thevariable’s range) whereas System A is heteroscedastic (with relatively small errors over lower parts ofthe variable’s range and relatively large over higher parts). For this set of events, the mean squarederror ¯ S for each system is very similar and there is no statistical signiﬁcance in their diﬀerence.However, with a = 10, the mean score ¯ S of System A is signiﬁcantly higher than that of B (i.e.,B is superior when emphasis is placed on the region [10 , ∞ )) whilst the mean score ¯ S of SystemA is signiﬁcantly lower than that of B (i.e., A is superior when emphasis is placed on the region( −∞ , S = n (cid:88) i =1 S i (1.2)of a given consistent scoring function S , whose summands S i are also consistent for the functionalof concern and with each S i emphasising forecast performance in the relevant region of the outcome2 observation f o r ec a s t System ASystem B

Figure 1: Scatter plot of forecasts against observations for a random sample of events from theSynthetic data example of Section 3.5.space. Such decompositions are presented for the consistent scoring functions of quantiles, expecta-tions, expectiles, and Huber means. The approach is uniﬁed, in that the same decomposition of theoutcome space induces decompositions of the CRPS and of a variety of scoring functions for pointforecasts. Details are given in Section 3 with illustrative examples.Mathematically, the main result of this paper is a corollary of the mixture representation theoremsfor the consistent scoring functions of quantiles, expectiles (Ehm et al. 2016) and Huber functionals(Taggart 2020). This furnishes each member S i of the decomposition (1.2) with an interpretationrooted in optimal decision theory. Certain classes of optimal decision rules elicit a user action if andonly if a point forecast x exceeds a particular decision threshold θ . The score S i ( x, y ) is a measure ofeconomic regret, relative to decisions based on a perfect forecast, of using the point forecast x whenthe observation y realises, averaged over all decision thresholds θ belonging to corresponding interval I i of the partition. Precise details are given in Section 4 and illustrated with the aid of Murphydiagrams. We illustrate how a seemingly natural approach for comparing predictive performance at the extremescreates opportunities for hedging strategies.A meteorological agency maintains two forecast systems A and B, each of which produces pre-dictive distributions for hourly rainfall Y at a particular location. System A is fully automated andhence cheaper to support than B. System B has knowledge of the System A forecast prior to issuingits own forecast. The agency requests the mean value of their predictive distributions with a leadtime of 1 day. These point forecasts are assessed using the squared error scoring function, which isconsistent for forecasts of the mean (i.e., the forecast strategy that minimises one’s expected score isto issue the mean value of one’s predictive distribution). For a two year period, the bulk of observedand forecast values fall in the interval [0 mm ,

10 mm] and there is no statistically signiﬁcant diﬀerencebetween the performance of the two systems when scored using the squared error function.However, the maintainers of System B claim that B performs better for the extremes and thatbulk veriﬁcation statistics fail to highlight this. The agency decides to use forecasts from A for themajority of cases, but will test whether B is signiﬁcantly better than A at forecasting the heavyrainfall events. The agency considers four options for selecting hourly events to assess, after whichthe squared error scoring function will be used to compare predictive performance on those events.If System B does not perform signiﬁcantly better than A over the next 12 months then it will be3ecommissioned.For a given forecast case, let F A and F B denote the predictive distributions produced by eachsystem, and x A and x B the respective point forecasts issued. Write Y ∼ F B to indicate that a randomvariable Y has a distribution speciﬁed by F B . For each option, the maintainers of System B have astrategy to optimise their expected score; that is, to choose x B such that E [( x B − Y ) ] is minimised,where Y ∼ F B .Option 1: Only assess events where the observation is at least 20 mm.Strategy: x B = max(20 , Mean( F B )).Option 2: Only assess events where either x A or x B is at least 20 mm.Strategy: If max( x A , Mean( F B )) ≥

20 then x B = Mean( F B ). Otherwise if E [(20 − Y ) ] < E [( x A − Y ) ], where Y ∼ F B , then x B = 20 else x B = Mean( F B ). This will ensure that theevent is assessed whenever System B expects that a forecast of 20 will receive a more favourablescore than than a forecast of x A .Option 3: Only assess events where either x A , x B or the observation is at least 20 mm.Strategy: If max( x A , Mean( F B )) ≥

20 then x B = Mean( F B ). Else if E [(20 − Y ) ] < E [( x A − Y ) ],where Y ∼ F B , then x B = 20. Otherwise, the only other way the event will be assessed is ifthe observation is at least 20mm, so forecast x B = 19 . x A ≥ x B = Mean( F B ).Of these, only Option 4 does not expose the agency to a hedging strategy from System B. However,under Option 4 the developers of A may employ the strategy of forecasting x A = max(19 . , Mean( F A )),so that no further comparison of systems is made.There is an Option 5: use a scoring function that is consistent for the mean functional and thatemphasises performance at the extremes. We turn attention to this now. We work in a setting where point forecasts x and observations y belong to some interval I of thereal line R (possibly with I = R ). In Section 3.1 we introduce partitions of unity , which are used to‘subdivide’ the outcome space I into subregions of interest. We then illustrate in Section 3.2 howsuch subdivisions induce decompositions of the CRPS, where each member of the decompositionemphasises performance on the corresponding subregion of I whilst retaining propriety. To obtainanalogous decompositions for consistent scoring functions, we recall in Section 3.3 that the consistentscoring functions for quantiles and expectations (among others) have general mathematical forms.The aim is to ﬁnd which speciﬁc instances of the general form emphasise performance on the sub-regions speciﬁed by the partition of unity. This is answered in Section 3.4. Section 3.5 containsexamples of such decompositions, and opens with a worked example showing how to ﬁnd the formulafor the squared error decomposition of Equation (1.1). Recall that the support of a function χ : I → R , denoted supp( χ ), is deﬁned bysupp( χ ) = { t ∈ I : χ ( t ) (cid:54) = 0 } . ectangular partition trapezoidal partition t χ j ( t ) bump χ χ χ χ Figure 2: Two partitions { χ j } nj =1 of unity on the interval [0 , { χ j } nj =1 is a partition of unity on I if { χ j } nj =1 is a ﬁnite set of piecewisecontinuous functions χ j : I → R such that 0 ≤ χ j ( t ) ≤ n (cid:88) j =1 χ j ( t ) = 1whenever t ∈ I . We will call each χ j a bump function . We note that these diﬀer from typicaldeﬁnitions in that we do not require χ j to be continuous or have bounded support.Figure 2 illustrates two diﬀerent partitions of unity for the interval I = [0 , rectangularpartition of unity is consists of rectangular bump functions , each of the form χ j ( t ) = (cid:40) , a ≤ t < b , otherwise (3.1)for suitable constants satisfying a < b , both depending on j . The trapezoidal partition of unity consists of trapezoidal bump functions , each typically having the form χ j ( t ) =  ( t − a ) / ( b − a ) , a ≤ t < b , b ≤ t < c ( d − t ) / ( d − c ) , c ≤ t < d , otherwise (3.2)for suitable constants satisfying a < b < c < d , all depending on j , with appropriate modiﬁcation forthe end cases.More generally, if { ψ j } nj =1 is a set of piecewise continuous nonnegative functions with the propertythat n (cid:88) j =1 ψ j ( t ) > t ∈ I , then a partition of unity { χ j } nj =1 can be constructed by deﬁning χ j ( t ) = ψ j ( t ) (cid:16) n (cid:88) j =1 ψ j ( t ) (cid:17) − . .2 Decomposition of the CRPS Each partition { χ j } nj =1 of unity induces a corresponding decomposition of the CRPS. Given a pre-dictive distribution F , expressed as a cumulative density function, and a corresponding observation y , the CRPS is deﬁned by CRPS( F, y ) = (cid:90) I ( F ( z ) − { y ≤ z } ) d z and for each χ j the threshold-weighted CRPS byCRPS j ( F, y ) = (cid:90) I ( F ( z ) − { y ≤ z } ) χ j ( z ) d z. Both are proper scoring rules (Gneiting and Ranjan 2011). Thus the CRPS has a decompositionCRPS = n (cid:88) j =1 CRPS j , where each component CRPS j is proper and emphasises performance in the region determined bythe weight χ j . The Sydney rainfall forecasts example of Section 3.5 illustrates the application of sucha decomposition. We now establish analogous decompositions for a wide range of scoring functions. For decision-theoretically coherent point forecasting, forecasters need a directive in the form of astatistical functional (Gneiting 2011a) or a scoring function which should be minimised (Patton2020). A statistical functional

T is a (potentially set-valued) mapping from a class of probabilitydistributions F to the real line R . Examples include the mean (or expectation) functional, quantilesand expectiles (Newey and Powell 1987), the latter recently attracting interest in risk management(Bellini and Di Bernardino 2017). A consistent scoring function is a special case of a proper scoringrule in the context of point forecasts, and rewards forecasters who make careful honest forecasts. Deﬁnition 3.1. (Gneiting 2011a) A scoring function S : I × I → [0 , ∞ ) is consistent for the functionalT relative to a class F of probability distributions if E S ( t, Y ) ≤ E S ( x, Y ) , whenever Y ∼ F, (3.3)for all probability distributions F ∈ F , all t ∈ T( F ) and all x ∈ I . The function S is strictlyconsistent if S is consistent and if equality in Equation (3.3) implies that x ∈ T( F ).The consistent scoring functions for many commonly used statistical functionals have generalforms.Given g : I → R and α ∈ (0 , g, α ) : I × I → R byQSF( g, α )( x, y ) = ( { y < x } − α )( g ( x ) − g ( y )) ∀ x, y ∈ I. (3.4)The name QSF is justiﬁed because, subject to slight regularity conditions, a scoring function S is consistent for the α -quantile functional if and only if S = QSF( g, α ) where g is nondecreasing(Gneiting 2011b,a; Thomson 1979). The absolute error scoring function S ( x, y ) = | x − y | for themedian functional arises from Equation (3.4) when g ( t ) = 2 t and α = 1 /

2. The commonly used α -quantile scoring function S ( x ) = ( { y < x } − α )( x − y ) (3.5)6rises when g ( t ) = t .Given φ : I → R with subderivative φ (cid:48) and α ∈ (0 , φ, α ) : I × I → R by ESF( φ, α )( x, y ) = | { y < x } − α | (cid:0) φ ( y ) − φ ( x ) − φ (cid:48) ( x )( y − x ) (cid:1) ∀ x, y ∈ I. (3.6)Subject to weak regularity conditions, a scoring function S is consistent for the α -expectile functionalif and only if S = ESF( φ, α ) where φ is convex (Gneiting 2011a; Savage 1971). The expectation (orthe mean) functional corresponds to the special case α = 1 /

2, with the squared error scoring function S ( x, y ) = ( x − y ) for expectations arising from Equation (3.6) when φ ( t ) = 2 t and α = 1 /

2. Aspecial case of the squared error scoring function is the Brier score, where I = [0 ,

1] and observationstypically take values in { , } .Given φ : I → R with subderivative φ (cid:48) and ν >

0, deﬁne the function HSF( φ, ν ) : I × I → R byHSF( φ, ν )( x, y ) = (cid:0) φ ( y ) − φ ( κ ν ( x − y ) + y ) + κ ν ( x − y ) φ (cid:48) ( x ) (cid:1) ∀ x, y ∈ I, (3.7)where κ ν is the ‘capping’ function deﬁned by κ ν ( x ) = max( − ν, min( x, ν )). Subject to slight regularityconditions, a scoring function S is consistent for the Huber mean functional (with tuning parameter ν ) if and only if S = HSF( φ, ν ) where φ is convex (Taggart 2020). The Huber loss scoring function S ( x, y ) = (cid:40) ( x − y ) , | x − y | ≤ νν | x − y | − ν , | x − y | > ν arises from HSF( φ, ν ) when φ ( t ) = 2 t , and is used by the Bureau of Meteorology to score forecastsof various parameters. The Huber mean is an intermediary between the median and the meanfunctionals, and is a robust measure of the centre of a distribution (Huber 1964; Taggart 2020). We now state the main result of this paper, namely that the consistent scoring functions for thequantile, expectile and Huber mean functionals can be written as a sum of consistent scoring functionswith respect to the chosen partition of unity. It is presented as a corollary since it follows from themixture representation theorems of Ehm et al. (2016) and Taggart (2020).

Corollary 3.2.

Suppose that { χ j } nj =1 is a partition of unity on I . For each j in { , . . . , n } , ﬁx anypoints u j and v j in I .(a) If g : I → R is nondecreasing and α ∈ (0 , then QSF( g, α ) = n (cid:88) j =1 QSF( g j , α ) where g j is nondecreasing and deﬁned by g j ( u ) = (cid:90) uu j χ j ( θ ) g (cid:48) ( θ ) d θ. (3.8) Moreover, if I ⊂ I is an interval and supp( χ j ) ∩ I = ∅ then QSF( g j , α ) = 0 on I .(b) If φ : I → R is convex, α ∈ (0 , and ν > then ESF( φ, α ) = n (cid:88) j =1 ESF( φ j , α ) (3.9)7 nd HSF( φ, ν ) = n (cid:88) j =1 HSF( φ j , ν ) , where φ j is convex and deﬁned by φ j ( u ) = (cid:90) uu j (cid:90) vv j χ j ( θ ) φ (cid:48)(cid:48) ( θ ) d θ d v. (3.10) Moreover, if I ⊂ I is an interval and supp( χ j ) ∩ I = ∅ then ESF( φ j , α ) = HSF( φ j , α ) = 0 on I . See Appendix A for the proof. Appendix B states closed-form expressions for g j and φ j forcommonly used scoring functions when the partition of unity is rectangular or trapezoidal. We notethat the natural analogue of Corollary 3.2 for the consistent scoring functions of Huber functionals(which are generalised Huber means) also holds. Also, the condition that χ j is piecewise continuouscan be weakened to χ j being Lebesgue integrable over ﬁnite intervals.One can show that QSF( g j , α ), ESF( φ j , α ) and HSF( φ j , ν ) are independent of the choice of points u j and v j in I . In practice, the choice of u j and v j may be determined for computational convenience,such as selecting (if it exists) the minimum of supp( χ j ).Finally, we discuss strict consistency. Suppose that QSF( g, α ), ESF( φ, α ) and HSF( φ, ν ) arestrictly consistent for the quantile, expectile or Huber mean functionals respectively for some class F of probability distributions. This occurs when g is strictly positive, or when φ is strictly con-vex, perhaps subject to mild regularity conditions on F (Gneiting 2011a; Taggart 2020). If χ j isstrictly positive on I then QSF( g j , α ), ESF( φ j , α ) and HSF( φ j , ν ) are also strictly consistent for theirrespective functionals. An example of such a partition { χ , χ } of unity on R is given by χ ( t ) = + π arctan( t − a )and χ ( t ) = 1 − χ ( t ). Here, χ induces scoring functions S that emphasise performance on ( a, ∞ )but do not completely ignore performance on any subinterval of R . We begin by demonstrating how to ﬁnd explicit formulae for a particular decomposition.

Example . Let S denote the scoring function S ( x, y ) = ( x − y ) . Note that S = ESF( φ, . φ ( t ) = 2 t via Equation (3.6), so thatCorollary 3.2 applies. We use a simple rectangular partition { χ , χ } of unity, where χ ( t ) = (cid:40) , t < a , t ≥ a and χ ( t ) = 1 − χ ( t ). Corollary 3.2 gives the corresponding decomposition S = S + S , where S i =ESF( φ i , . S , we compute the function φ using Equation (3.10).Integrating twice with the choice u = v = a gives φ ( u ) = (cid:40) , u < a u − a ) , u ≥ a = 2( u − a ) { u ≥ a } . Thus S = ESF( φ , . S follows easily from the identity S = S − S .8 artitionpartition partitionpartition partitionpartition forecast = 1 forecast = 3 forecast = 4.5 r ec t a n g u l a r p a r t i t i o n t r a p ez o i d a l p a r t i t i o n observation s c o r e Figure 3: Decomposition S = S + S of the squared error scoring function, using the rectangularand trapezoidal partitions. The solid line represents S , the dotted line S and the dashed line S fordiﬀerent forecast–observation pairs. In each case, S emphasises performance on the interval (0 , S emphasises performance on (3 , S = S + S with respect to the rectangular partition { χ , χ } of unity with a = 3, and also with respect to a trapezoidal partition. For each forecast x and observation y , the solid line represents the score S ( x, y ), the dotted line the score S ( x, y )corresponding to the bump function χ with support on the left of the interval, and the dashed linethe score S ( x, y ) corresponding to the bump function χ with support on the right of the interval.Note that when the forecast x and observation y both lie outside the support of χ j then S j ( x, y ) = 0. Example . Let S denote the scoring function S ( x, y ) = (cid:40) . y − x ) , y ≥ x . x − y ) , x < y, so that S = QSF( g, α ) where g ( t ) = t and α = 0 .

25. The decomposition S = S + S is illustrated inFigure 4 with respect to two diﬀerent partitions { χ , χ } of unity. The solid line represents S , thedotted line S and the dashed line S . Example . Suppose that the climatological distribution of an unknown quantity Y is normal with Y ∼ N (4 , ). Two forecast systems A and B issue point forecasts for Y targeting themean functional. Their respective forecast errors are identically independently normally distributedwith error characteristics e A ∼ N (0 , (arctan( y −

10) + 2) ) (where y is the observation) and e B ∼N (0 , ). Hence the standard deviation of e A is lower than the standard deviation of e B when y isless than 10 while the reverse is true when y is greater than 10. This is evident in the varying degreeof scatter about the diagonal line in forecast–observation plot of Figure 1.9 artitionpartition partitionpartition partitionpartition forecast = 1 forecast = 4 forecast = 6 r ec t a n g u l a r p a r t i t i o n t r a p ez o i d a l p a r t i t i o n observation s c o r e Figure 4: Decomposition S = S + S of the commonly used consistent scoring function for 0 . S ,the dotted line S and the dashed line S for diﬀerent forecast–observation pairs. In each case, S emphasises performance on the interval (0 ,

4) while S emphasises performance on (4 , S along with the two components of its de-composition S = S + S . The decomposition is induced from a rectangular partition { χ , χ } ofunity on R with supp( χ ) = ( −∞ ,

10) and supp( χ ) = [10 , ∞ ). The mean scores for each system areshown in Table 1, along with a 95% conﬁdence interval of the diﬀerence in mean scores.An analysis of this sample concludes that neither System A nor System B is signiﬁcantly betterthan its competitor as measured by S , since 0 belongs to the conﬁdence interval of the diﬀerence intheir mean scores. However, one can infer with high conﬁdence that A performs better when scoredby S , since 0 is well outside the corresponding conﬁdence interval. Similarly, one may conclude withhigh conﬁdence that B performs better when scored by S .Based on the results in the table, one would use the forecast from A if both forecasts are lessthan 10, and the forecast from B if both forecasts are greater than 10. If the forecasts lie on oppositesides of 10, one option is to take the average of the two forecasts. We revisit this example again inSection 4 from the perspective of users and optimal decision rules. Example . Two Bureau of Meteorology forecast systems BoM and OCFissue predictive distributions for daily rainfall at Sydney Observatory Hill. Climatologically, dailyrainfall exceeds 35.8 mm on average 12 times a year, and 42.2 mm on average 6 times a year. Wepartition the outcome space [0 , ∞ ) using a trapezoidal partition { χ , χ } of unity, where χ ( t ) = 0 if0 ≤ t < . χ ( t ) = 1 if t ≥ . χ ( t ) = ( t − . / (42 . − .

8) if 35 . ≤ t < .

2. Naturally, χ ( t ) = 1 − χ ( t ).Over the entire outcome space, the quality of each system’s forecasts is assessed using the CRPS(for predictive distributions), the squared error scoring function (for expectation point forecasts), and10able 1: Comparative evaluation of forecast systems A and B from Synthetic data example, roundedto 2 decimal places. The diﬀerence is the mean score of A minus the mean score of B.Mean score System A System B 95% conﬁdence interval of diﬀerence¯ S S S Expectation point forecasts Predictive distribution0.25-quantile point forecasts 0.9-quantile point forecasts lead day m e a n s c o r e forecastsystem BoMOCF scoringfunction

SS1S2

Figure 5: Proper score decompositions S = S + S by lead day for diﬀerent types of Sydney dailyrainfall forecasts, where S is a standard quantile scoring function (top), squared error scoring function(bottom left) or CRPS (bottom right). In each case the score S emphasises performance for heavierrainfall.the standard quantile scoring function of Equation (3.5) (for quantile point forecasts). Moreover, eachof these scores S are decomposed as S = S + S using the common partition { χ , χ } . Their meanscores by forecast lead day, for the period July 2018 to June 2020, are shown in Figure 5. Thisexample illustrates that the same partition of unity can be used to hone in on particular regionsof interest across a variety of forecast types, using an assessment method that cannot be hedged.When performance is assessed with emphasis on heavier rainfall via S , BoM is better than OCFat lead days 1 and 2 for expectile and 0.9-quantile forecasts, marginally better with its predictivedistributions but worse than OCF at lead day 1 for 0.25-quantile forecasts. Mixture representations and elementary scoring functions facilitate a decision-theoretic interpretationof the scoring function decompositions of Corollary 3.2.To avoid unimportant technical details, in this section assume that I = R and that the func-11able 2: Elementary scoring functions S T θ ( x, y ) for diﬀerent functionals T.Functional T Elementary score S T θ ( x, y ) α -quantile  − α, y ≤ θ < xα, x ≤ θ < y , otherwise α -expectile  (1 − α ) | y − θ | , y ≤ θ < xα | y − θ | , x ≤ θ < y , otherwise ν -Huber mean  min( θ − y, ν ) y ≤ θ < x min( y − θ, ν ) , x ≤ θ < y , otherwisetions g and φ are twice-diﬀerentiable on R . Suppose that S is any one of the scoring functionsQSF( g, α ), ESF( φ, α ) or HSF( φ, ν ), where g is nondecreasing and φ is convex. Thus S is consistentfor some functional T (either a speciﬁc quantile, expectile or Huber mean). Then S admits a mixturerepresentation S ( x, y ) = (cid:90) R S T θ ( x, y ) d M ( θ ) , (4.1)where each S T θ is an elementary scoring function for T and the mixing measure d M ( θ ) is nonnegative(Ehm et al. 2016; Taggart 2020). That is, S is a weighted average of elementary scoring functions.Explicit formulae for these elementary scoring functions and mixing measures are given in Tables 2and 3.The elementary scoring functions S T θ and their corresponding functionals T arise naturally in thecontext of optimal decision rules. In a certain class of such rules, a predeﬁned action is taken if andonly if the point forecast x for some unknown quantity Y exceeds a certain decision threshold θ .The usefulness of the forecast for the problem at hand can be assessed via a loss function S θ , where S θ ( x, y ) encodes the economic regret, relative to a perfect forecast, of applying the decision rule withforecast x when the observation y is realised. Typically S θ ( x, y ) > θ lies between x and y (e.g., the forecast exceeded the decision threshold but the realisation did not, resulting in regret),and S θ ( x, y ) = 0 otherwise (i.e., a perfect forecast would have resulted in the same decision, resultingin no regret).For the decision rule to be optimal over many forecast cases, the point forecast x should beone that minimises the expected score E S θ ( x, Y ), where Y ∼ F for the predictive distribution F .Binary betting decisions (Ehm et al. 2016) and simple cost–loss decision models (where the usermust weigh up the cost of taking protective action as insurance against an event which may ormay not occur (Richardson 2000)) give rise to some α -quantile of F being the optimal point forecast.The α -expectile functional gives the optimal point forecast for investment problems with cost basis θ ,revenue y and simple taxation structures (Ehm et al. 2016). The Huber mean generates optimal pointforecasts when proﬁts and losses are capped in such investment problems (Taggart 2020). Though thelanguage here is ﬁnancial, such investment decisions can be made using weather forecasts (Taggart2020, Example 5.4). In each case, the particular score S θ ( x, y ) is, up to a multiplicative constant,the elementary score S T θ ( x, y ) in Equation (4.1) for the appropriate functional T.Thus the representation (4.1) expresses S ( x, y ) as a weighted average of elementary scores, eachof which encodes the economic regret associated with a decision made using the forecast x withdecision threshold θ . The weight on each θ is determined by the mixing measure d M ( θ ), as detailed12able 3: Weights in the mixing measure d M ( θ ) for diﬀerent scoring functions S .Scoring function S d M ( θ )QSF( g, α ) g (cid:48) ( θ ) d θ QSF( g j , α ) χ j ( θ ) g (cid:48) ( θ ) d θ ESF( φ, α ), HSF( φ, α ) φ (cid:48)(cid:48) ( θ ) d θ ESF( φ j , α ), HSF( φ j , α ) χ j ( θ ) φ (cid:48)(cid:48) ( θ ) d θ in Table 3. Relative to QSF( g, α ), the scoring function QSF( g j , α ) weighs the economic regret fordecision threshold θ by a factor of χ j ( θ ), thus privileging certain decision thresholds over others.We illustrate these ideas using the point forecasts generated by Systems A and B from theSynthetic data example (Section 3.5). These point forecasts target the mean functional, and soinduce optimal decision rules for investment problems with cost basis θ . The decision rule is toinvest if and only if the forecast future value x of the investment exceeds the ﬁxed up-front cost θ ofthe investment. As previously noted, neither forecast system performs signiﬁcantly better than theother when scored by squared error. However, for certain decision thresholds θ , the mean elementaryscore ¯ S θ (which measures the average economic regret) of one system is clearly superior to the other.This is illustrated by the Murphy diagram of Figure 6, which is a plot of ¯ S θ against θ (Ehm et al.2016). Since a lower score is better, users with a decision threshold exceeding about 8 should useforecasts from B, while those with a decision threshold less than 8 should use forecasts from A.Let S = S + S denote the same decomposition of the squared error scoring function used for theSynthetic data example (Section 3.5), induced from a rectangular partition of unity. To aid simplicityof interpretation, assume for the user group under consideration that there is one constant k suchthat the each elementary score equals k multiplied by the economic regret. We observe the following.For S , φ ( t ) = 2 t and so d M ( θ ) = 4 d θ by Table 3. So by Equation (4.1), the mean squarederror ¯ S for each forecast system is proportional to the total area under the respective Murphy graph.Hence mean squared error ¯ S can be interpreted as being proportional to the mean economic regretaveraged across all decision thresholds.For S , d M ( θ ) = 4 χ ( θ ) d θ , where χ is the bump function χ R on the top panel of Figure 6.From Equation (4.1), the mean score ¯ S for each forecast system is proportional to the area underthe respective Murphy curve for decision thresholds θ in [10 , ∞ ). This is illustrated on the toppanel of Figure 6 with shaded regions. System B has the smaller area and mean score and is hencesuperior. So ¯ S can be interpreted as being proportional to the mean economic regret averaged acrossall decision thresholds θ in [10 , ∞ ). Similarly, ¯ S can be interpreted as being proportional to themean economic regret averaged across all decision thresholds θ less than 10.The bottom panel of Figure 6 illustrates the eﬀect of using a trapezoidal bump function χ T .The corresponding scoring function applies zero weight to economic regret experienced by users withdecision thresholds less than 0, and increasing weight to thresholds greater than zero until a fullweight of 1 is applied for decision thresholds greater than or equal to 20. The areas of the shadedregions are proportional to the mean score for Systems A and B. We have shown that the scoring functions that are consistent for quantiles, expectations, expectiles orHuber means can be decomposed as a sum of consistent scoring functions, each of which emphasisespredictive performance in a region of the variable’s range according to the selected partition of13 artition bump function χ R parameter θ m e a n e l e m e n t a r y s c o r e ¯ S θ System ASystem B partition bump function χ T parameter θ m e a n e l e m e n t a r y s c o r e ¯ S θ System ASystem B

Figure 6: Murphy diagrams for the two forecast systems A (lighter) and B (darker) of the Syntheticdata example. The graphs on each panel are the same but the shaded regions diﬀer. Each shaded areais proportional to the mean score of the scoring function ESF j ( t (cid:55)→ t , / χ . Here χ is eitherthe rectangular bump function χ R (top) or trapezoidal bump function χ T (bottom).unity. In particular, this allows the comparison of predictive performance for the extreme region ofa variable’s range in a way that cannot be hedged and is not susceptible to misguided inferences.The decomposition is consonant with the analogous decomposition of the CRPS using the threshold-weighting method, and hence by extension with known decompositions for the absolute error scoringfunction.Such decompositions are a consequence of the mixture representation for each of the aforemen-tioned classes of consistent scoring functions and their associated functionals. This has two implica-tions. First, each score in the decomposition can be interpreted as the weighted average of economicregret associated with optimal decision rules, where the average is taken over all user decision thresh-olds and the weight applied is given by the corresponding bump function in the partition of unity.Second, such decompositions will also exist for the consistent scoring functions of other functionalsthat have analogous mixture representations. 14 cknowledgements The author thanks Jonas Brehmer, Deryn Griﬃths, Michael Foley and Tom Pagano for their con-structive comments, which helped improve the quality of this paper.

A Proof of Corollary 1

As mentioned in Section 3.4, Corollary 3.2 essentially follows from the mixture representation oftheorems of Ehm et al. (2016) and Taggart (2020) by decomposing the mixing measure. However,we provide a direct proof to avoid some technical hypotheses of those theorems.

Proof.

We prove for the case ESF. The proof for the cases QSF and HSF are similar.To begin, note that if φ ( t ) = ψ ( t ) + ct + d whenever t ∈ I for some real constants c and d thenESF( φ, α ) = ESF( ψ, α ). So without loss of generality we may assume that u j = u and v j = v whenever 1 ≤ j ≤ n for some constants u and v in I .Suppose that φ is convex and α ∈ (0 , { χ j } nj =1 , deﬁne φ j byEquation (3.10). Then | { y < x } − α | − n (cid:88) j =1 ESF( φ j , α )( x, y )= n (cid:88) j =1 (cid:0) φ j ( y ) − φ j ( x ) − φ (cid:48) j ( x )( y − x ) (cid:1) = n (cid:88) j =1 (cid:18)(cid:90) yx (cid:90) vv χ j ( θ ) φ (cid:48)(cid:48) ( θ ) d θ d v − ( y − x ) (cid:90) xv χ j ( θ ) φ (cid:48)(cid:48) ( θ ) d θ (cid:19) = (cid:90) yx (cid:90) vv φ (cid:48)(cid:48) ( θ ) d θ d v − ( y − x ) (cid:90) xv φ (cid:48)(cid:48) ( θ ) d θ = (cid:90) yx (cid:0) φ (cid:48) ( v ) − φ (cid:48) ( v ) (cid:1) d v − ( y − x ) (cid:0) φ (cid:48) ( x ) − φ (cid:48) ( v ) (cid:1) = ( φ ( y ) − φ (cid:48) ( v ) y ) − ( φ ( x ) − φ (cid:48) ( v ) x ) − ( y − x )( φ (cid:48) ( x ) − φ (cid:48) ( v ))= | { y < x } − α | − ESF( φ, α )( x, y ) , which establishes Equation (3.9). The convexity of φ j follows from the fact that φ (cid:48)(cid:48) j ( u ) = χ j ( u ) φ (cid:48)(cid:48) ( u ) ≥

0, since φ is convex. Finally, suppose that x, y ∈ I ⊂ I and supp( χ j ) ∩ I = ∅ . Now | { y < x } − α | − ESF( φ j , α )( x, y ) = (cid:90) yx ( φ (cid:48) j ( w ) − φ (cid:48) j ( x )) d w = (cid:90) yx (cid:90) wx φ (cid:48)(cid:48) j ( z ) d z d w = (cid:90) yx (cid:90) wx χ j ( z ) φ (cid:48)(cid:48) ( z ) d z d w, and since w lies between x and y , it follows that w also lies in I . But χ j = 0 on I and hence theinner integral vanishes. This shows that ESF( φ j , α ) = 0 on I . B Formulae for commonly used scoring functions

Table 4 presents closed-form expressions for g j of Equation (3.8) in the case when g ( t ) = t and for φ j of Equation (3.10) in the case when φ ( t ) = 2 t , each induced by rectangular bump functions15able 4: Closed-form expressions for g j and φ j for diﬀerent bump functions when g ( t ) = t , φ ( t ) = 2 t . bump function expressionrectangular,given byEquation (3.1) g j ( t ) =  , t < at − a, a ≤ t < bb − a, t ≥ bφ j ( t ) =  , t < a t − a ) , a ≤ t < b b − a ) x + 2( a − b ) , t ≥ t j trapezoidal,given byEquation (3.2) g j ( t ) =  , t < a ( t − a ) / (2( b − a )) , a ≤ t < bt − ( b + a ) / , b ≤ t < c − ( d − t ) / (2( d − c )) + ( d + c − a − b ) / , c ≤ t < d ( d + c − a − b ) / , t ≥ dφ j ( t ) =  , x < a t − a ) b − a ) , a ≤ t < b t − a + b ) t + ( b − a ) + 2 ab, b ≤ t < c d − t ) d − c ) + 2( d + c − a − b ) t + (( b − a ) + 3 ab − ( d − c ) − cd ) , c ≤ t < d d + c − a − b ) t + (( b − a ) + 3 ab − ( d − c ) − cd ) , t ≥ d of the form (3.1) or trapezoidal bump functions of the form (3.2). These expressions facilitate thecomputation of decompositions for the absolute error, standard quantile, squared error and Huberloss scoring functions. Note also that for each bump function we have φ (cid:48) j ( t ) = 4 g j ( t ). Bibliography

Bellini, F. and Di Bernardino, E. (2017). Risk management with expectiles.

The European Journalof Finance , 23(6):487–506.Ehm, W., Gneiting, T., Jordan, A., and Kr¨uger, F. (2016). Of quantiles and expectiles: consistentscoring functions, choquet representations and forecast rankings.

J. R. Statist. Soc. B , 78:505–562.Gneiting, T. (2011a). Making and evaluating point forecasts.

Journal of the American StatisticalAssociation , 106(494):746–762.Gneiting, T. (2011b). Quantiles as optimal point forecasts.

International Journal of forecasting ,27(2):197–207.Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting.

Annual Review of Statistics and ItsApplication , 1:125–151.Gneiting, T. and Ranjan, R. (2011). Comparing density forecasts using threshold-and quantile-weighted scoring rules.

Journal of Business & Economic Statistics , 29(3):411–422.Huber, P. J. (1964). Robust estimation of a location parameter.

Annals of Mathematical Statistics ,35:73–101.Lerch, S., Thorarinsdottir, T. L., Ravazzolo, F., Gneiting, T., et al. (2017). Forecaster’s dilemma:Extreme events and forecast evaluation.

Statistical Science , 32(1):106–127.16ewey, W. K. and Powell, J. L. (1987). Asymmetric least squares estimation and testing.

Econo-metrica , 55:819–847.Patton, A. J. (2020). Comparing possibly misspeciﬁed forecasts.

Journal of Business & EconomicStatistics , 38(4):796–809.Richardson, D. S. (2000). Skill and relative economic value of the ECMWF ensemble predictionsystem.

Quarterly Journal of the Royal Meteorological Society , 126(563):649–667.Savage, L. J. (1971). Elicitation of personal probabilities and expectations.

Journal of the AmericanStatistical Association , 66(336):783–801.Sharpe, M., Bysouth, C., and Gill, P. (2020). New operational measure to assess extreme eventsusing site-speciﬁc climatology. Presented at 2020 International Veriﬁcation Methods WorkshopOnline. Retreived from https://jwgfvr.univie.ac.at/presentations-and-notes/ on 5 January 2021.Taggart, R. (2020). Point forecasting and forecast evaluation with generalised Huber loss.

BureauResearch Report , 50.Thomson, W. (1979). Eliciting production possibilities from a well-informed manager.