aa r X i v : . [ m a t h . S T ] J u l Contrasting Probabilistic Scoring Rules
Reason L. Machete ∗ Dept. of Mathematics and Statistics, P. O. Box 220, Reading, RG6 6AX, UK
Dated: May 24, 2012
Abstract
There are several scoring rules that one can choose from in order to score prob-abilistic forecasting models or estimate model parameters. Whilst it is generallyagreed that proper scoring rules are preferable, there is no clear criterion for pre-ferring one proper scoring rule above another. This manuscript compares and con-trasts some commonly used proper scoring rules and provides guidance on scoringrule selection. In particular, it is shown that the logarithmic scoring rule preferserring with more uncertainty, the spherical scoring rule prefers erring with loweruncertainty, whereas the other scoring rules are indifferent to either option.
Keywords : estimation; forecast evaluation; probabilistic forecasting; utility function
Issuing probabilistic forecasts is meant to express uncertainty about the future evo-lution of some quantity of interest. Such forecasts arise in many applications suchas macroeconomics, finance, weather and climate forecasting. There are several scor-ing rules that one can choose from in order to elicit probabilistic forecasts, rank com-peting forecasting models or estimate forecast distribution parameters. It is generallyagreed that one should select scoring rules that encourage a forecaster to state his ‘best’judgement of the distribution, the so called proper scoring rules (Friedman, 1983; Nau,1985; Gneiting and Raftery, 2007), but which one to use is generally an open question.We shall take scoring rules to be loss functions that a forecaster wishes to minimise.Scoring rules that are minimised if and only if the issued forecasts coincide with theforecaster’s best judgement are said to be strictly proper (Gneiting and Raftery, 2007;Brocker and Smith, 2007). We shall restrict our attention to strictly proper scoringrules.Nonetheless using scoring rules to rank competing forecasting models poses a prob-lem; scoring rules do not provide a universally acceptable ranking of performance. In es-timation, different scoring rules will yield different parameter estimates (Gneiting and Raftery,2007; Johnstone and Lin, 2011). Moreover, a forecaster’s best judgement may departfrom the ideal; the ideal is a distribution that nature or the data generating processwould give (Gneiting et al. , 2007). Although strictly proper scoring rules encourage ex-perts to issue their best judgements, such judgements may yet differ from each other ∗ email: [email protected] , tel: +44(0)118 378 6378 et al. (2008) considered weighted scoring rules and showed that theycorrespond to different utility functions. A limiting feature of the utility functionsconsidered is that they are defined on bounded intervals; there are many applications inwhich the variable of interest is unbounded. Their motivation for weighted scoring rulesis based on betting arguments, but it is not clear what the betting strategies (if any)are. Recently, Boero et al. (2011) empirically compared the Quadratic Probability Score(QPS), Ranked Probability Score (RPS) and the logarithmic scoring rule on UK inflationforecasts by the Monetary Policy Committee and the Survey of External Forecasters(SEF). They found the scoring rules to rank the two sets of distributions similarly.Upon ranking individual forecasters from the SEF, they found the RPS to have betterdiscriminatory power than the QPS, a feature they attributed to the RPS’s sensitivityto distance. Despite the foregoing efforts, there is lacking a theoretical assessment ofwhat the preferences of the commonly used scoring rules are with respect to the ideal.This paper contrasts how different scoring rules would rank competing forecasts ofspecified departures from ideal forecasts and provides guidance on scoring rule selection.It focuses upon those scoring rules that are commonly used in the forecasting literature,including econometrics and meteorology. More specifically, we contrast the relativeinformation content of forecasts preferred by different scoring rules. Implications of theresults on decision making are then suggested, noting that it may be desirable to bemore or less uncertain when communicating probabilistic forecasts. We realise that anappropriate utility function may be unknown (Bickel, 2007) and expected utility theorymay not even be appropriate (Kahneman and Tversky, 1979).In section 2, we consider the case of scoring categorical forecasts by the Brierscore (Brier, 1950), the logarithmic scoring rule (Good, 1952) and the spherical scor-ing rule (Friedman, 1983). For simplicity, special attention is focused on binary fore-casts. This section then inspires our study of density forecasts in section 3, where weconsider three scoring rules: the Quadratic Score (Gneiting and Raftery, 2007), Loga-rithmic Score (Good, 1952), Spherical Score (Friedman, 1983) and Continuous RankedProbability Score (Epstein, 1969). We conclude with a discussion of the results in sec-tion 4. In this section, we consider the scoring of categorical forecasts. The scoring rules consid-ered are Brier score (Brier, 1950), the logarithmic scoring rule and the spherical scoringrule (Friedman, 1983). In order to aid intuition in the next section, here we focus on thebinary case. Another commonly used scoring rule for categorical forecasts is the RankedProbability Score (RPS) (Epstein, 1969). In the binary case, the RPS score reduces to2he Brier score.It will be useful to be aware of the following basics. Given any vectors f , g ∈ ℜ m ,the inner product between the two vectors is h f , g i = m X i =1 f i g i , from which the L -norm is defined by || f || = h f , f i / . Consider a probabilistic forecast { f i } mi =1 of m categorical events. Suppose the truedistribution is { p i } mi =1 . If the actual outcome is the j th category, the Brier score is givenby (Brier, 1950) BS ( f , j ) = 1 m m X i =1 ( f i − δ ij ) , where δ ij = 0 if i = j and δ ij = 1 if i = j . If follows that if we expand out the bracketwe get BS ( f , j ) = 1 m m X i =1 f i − f j + 1 ! . The expected Brier score is then given by E [ BS ( f , J )] = m X j =1 p j BS ( f, j )= 1 m m X i =1 (cid:0) f i − p i f i + p i (cid:1) = 1 m m X i =1 h ( f i − p i ) + p i − p i i = 1 m ( || γ || + m X i =1 p i (1 − p i ) ) , where γ is a vector with components γ i = f i − p i for all i = 1 , . . . , m . It is evident fromthe last expression on the right hand side that the Brier score is effective with respectto the metric d ( f , g ) = || f − g || . When m = 2, we can put f = p + γ , p = p and p = q and obtain E [ BS ( f , J )] = γ + pq. It follows that ± γ will yield the same Brier score. This means the Brier score does notdiscriminate between over-estimating and under-estimating the probabilities with thesame amount. Further more, for any two forecasts f i = ( p + γ i , q − γ i ), i = 1 ,
2, with | γ | < | γ | , the Brier score would prefer the forecast corresponding to γ . The logarithmic scoring rule was proposed by Good (1952). It was later termed Ig-norance by Roulston and Smith (2002) when they introduced it to the meteorological3ommunity. Given a probabilistic forecast f = ( f , f , . . . , f m ), the logarithmic scoringrule is given by LS( f , j ) = − log f j , where j denotes the category that materialises. Letus consider the expected logarithmic score of the forecasting scheme f = ( p + γ, q − γ ): E [LS( f , J )] = − p log( p + γ ) − q log( q − γ ) , (1)where J ∈ { , } is a random variable. The above expectation is also referred to asthe Kullback-Leibler Information Criterion (Corradi and Swanson, 2006). As notedby Friedman (1983), this scoring rule is not effective.If we let f + = ( p + γ, q − γ ) and f − = ( p − γ, q + γ ), then we can define E [LS] ± = E [LS( f + , J )] − E [LS( f − , J )]. Then, assuming that γ > E [LS] ± = p log (cid:18) p − γp + γ (cid:19) + q log (cid:18) q + γq − γ (cid:19) (2)Note that when p = q = 0 .
5, then E [LS] ± = 0, otherwise E [LS] ± = 0. Differentiating (2)with respect to γ yields dd γ E [LS] ± = 2 γ ( p − q )( p − γ )( q − γ ) (3)Expressions (2) and (3) are well defined provided γ < min( p, q ).dd γ E [LS] ± > , if p > q dd γ E [LS] ± < , if p < q It follows that E [LS] ± > p > q and E [LS] ± < p < q . In other words, the loga-rithmic score penalises over confidence on the likely outcome and rewards erring on theside of caution. Given forecasting schemes that are equally calibrated, the logarithmicscore will prefer the one with a higher entropy. To explain this further, let us denotethe entropy of the forecast corresponding to γ by h ( γ ), i.e. h ( γ ) = − ( p + γ ) log( p + γ ) − ( q − γ ) log( q − γ ) . (4)We now define the function G ( γ ) = h ( γ ) − h ( − γ ) and claim that G ( γ ) < < γ q implies that G ′ ( γ ) < γ ∈ (0 , q ). Therefore, it is evidentthat, of the two forecasts, the logarithmic score prefers the one with a higher entropy.We have thus proved the following proposition: Proposition 2.1
Given two forecasts, f + = ( p + γ, q − γ ) and f − = ( p − γ, q + γ ) ,where < γ < q < p , the logarithmic scoring rule prefers f − . Moreover, f − has ahigher entropy than f + . f i = ( p + γ i , q − γ i ), i = 1 , < γ <γ < q and p > q ? It is obvious that the Brier score will prefer f over f . The questionis, which of the two forecasts will the logarithmic scoring rule prefer? We answer thisquestion by stating the following proposition: Proposition 2.2
Given two forecasts f i = ( p + γ i , q − γ i ) , i = 1 , with < γ < γ < q and p > q , the logarithmic scoring rule prefers f over f . Proof.
In order to prove this proposition, it is sufficient to consider the expectedlogarithmic score of the forecast f = ( p + γ, q − γ ), which is given by equation (1).Differentiating the equation with respect to γ yieldsdd γ E [LS( f , J )] = γ ( p + γ )( q − γ ) (5)Equation (5) implies that, if q > γ > E [LS( f , J )] is an increasing function of γ .Hence, the logarithmic scoring rule prefers the forecast f .On the other hand, if γ < | γ | < p , then equation (5) implies that E [LS( f , J )]is a decreasing function of γ . It then follow that, given γ < γ < | γ | < p , thelogarithmic scoring rule will prefer the forecast f .Finally, let us consider the case of two forecasts f = ( p + γ , q − γ ) and f =( p − γ , q + γ ), where 0 < γ < γ < q < p . Again, it is clear that the Brier scorewill prefer the forecast f over f . It remains to be seen which forecast the logarithmicscoring rule will prefer. This may be determined by considering the function H ( γ , γ ),where H ( γ , γ ) = p log (cid:18) p − γ p + γ (cid:19) + q log (cid:18) q + γ q − γ (cid:19) (6)Note that H ( γ , γ ) = E [LS( f , J )] − E [LS( f , J )]. The forecast f is preferred if H ( γ , γ ) < . The following proposition gives insights of relative forecast performancein the parameter space.
Proposition 2.3
Given that < γ < q < p , there exists γ ∗ ∈ (0 , γ ) such that (a) H ( γ ∗ , γ ) = 0 , (b) H ( γ , γ ) > for γ ∈ ( γ ∗ , γ ) and (c) H ( γ , γ ) < for γ ∈ (0 , γ ∗ ) . Before proving the above proposition, we remark that H ( γ , γ ) < f . This proposition implies that the log-arithmic scoring rule and the Brier score prefer different forecasts when γ ∈ ( γ ∗ , γ ).Let us now consider the proof of this proposition. Proof.
In proving this proposition, it is useful to bear in mind that H ( γ , γ ) > ∂H∂γ = γ ( p + γ )( q − γ ) and ∂H∂γ = − γ ( p + γ )( q − γ ) . (7)Further more, we can differentiate equations (7) to obtain ∂ H∂γ = pq + γ ( p + γ ) ( q − γ ) and ∂ H∂γ = − ( pq + γ )( p − γ ) ( q + γ ) . (8)It follows from equations (7) that ∂H/∂γ = 0 at γ = 0 and ∂H/∂γ = 0 at γ = 0.Since ∂ H/∂γ > γ , H ( γ , · ) has a global minimum at γ = 0. Similarly,5 ( · , γ ) has a global maximum at γ = 0 since ∂ H/∂γ < γ and the first partialderivative with respect to γ vanishes there. In particular, H (0 , γ ) ≤ H (0 ,
0) = 0, i.e. H (0 , γ ) ≤
0. For γ >
0, we have the strict inequality, H (0 , γ ) <
0. But we also have H ( γ , γ ) > H ( γ , γ ) = 0 for some γ = γ ∗ ∈ (0 , γ ), which completes the proof. Proposition 2.4
For positive γ and γ such that γ < q < p and γ < p , the entropyof the forecast f = ( p + γ , q − γ ) is lower than that of the forecast f = ( p − γ , q + γ ) whenever γ ≤ ( p − q ) / . A consequence of this proposition is that the forecast corresponding to γ = γ ∗ is moreinformative than f provided γ ≤ ( p − q ) /
2. Otherwise, either forecast could be moreinformative than the other. We now give the proof of this proposition.
Proof.
To prove the above proposition, we consider the derivative of equation (4):d h d γ = − log (cid:18) p + γq − γ (cid:19) . We then note that d h/ d γ < p − q ) > − γ . If γ >
0, this inequality istrivially satisfied. On the other hand, if γ <
0, then the inequality is satisfied provided | γ | < ( p − q ) /
2. If γ < ( p − q ) /
2, then h ( γ ) is a strictly decreasing function forall γ ∈ [ − γ , γ ], which implies that h ( γ ) > h ( γ ). If γ > ( p − q ) /
2, then h ( γ ) is anincreasing function for all γ ∈ ( − γ , − ( p − q ) /
2) (provided p > q ) and strictly decreasingfunction in ( − ( p − q ) / , γ ), which implies that h ( − ( p − q ) / > max { h ( γ ) , h ( − γ ) } .Hence, in this case, we cannot determine which of h ( γ ) and h ( − γ ) is lower. The spherical scoring rule is given by S ( f , j ) = − f j || f || . Define f − = p − γ and f + = p + γ . Which of the two forecasts f − and f + does thespherical scoring rule prefer? In order to address this question, we appeal to geometry.Considering f = p + γ , the dot product rule yields || p || || f || cos θ = h f , p i where θ is the angle between f and p . The above formula may be rewritten ascos θ = || p || + h γ , p i|| p || || f || . (9)We then state the following proposition: Proposition 2.5 If p = ( p, q ) and γ = ( γ, − γ ) and if we denote the right hand side ofequation (9) by C ( γ ) , then d C ( γ )d γ = − γ || p || || f || . roof. First note that || f || = || p || + 2 h γ , p i + || γ || andd || f || d γ = ( p − q ) + 2 γ || f || . Using the quotient rule, we then differentiate C ( γ ) with respect to γ to obtaind C ( γ )d γ = || f || γ ( || p || + h γ , p i ) − ( || p || + h γ , p i ) dd γ || f || || p || || f || = || f || ( p − q ) − ( || p || + h γ , p i )[( p − q ) + 2 γ ] || p || || f || = ( p − q )( || p || +2 h γ , p i + || γ || ) − ( || p || + h γ , p i )[( p − q ) + 2 γ ] || p || || f || = − γ || p || + γ ( p − q ) || p || || f || = − γ || p || + γ ( || p || − pq ) || p || || f || = − γ ( || p || + 2 pq ) || p || || f || = − γ ( p + q ) || p || || f || . The desired result follows from noting that p + q = 1. Proposition 2.6
Suppose that p > q and γ ∈ (0 , q ) . Then the spherical scoring ruleprefers the lower entropy forecast, f + , instead of f − . Proof.
Since the spherical scoring rule is effective, it suffices for us to show that d ∗ ( f + , p ) < d ∗ ( f , p ). Suppose the angles that each of f + and f − makes with p arerespectively θ + and θ − . It is then true that d ∗ ( f + , p ) < d ∗ ( f , p ) if and only θ + < θ − since each distance is the length of a chord on a unit circle. Note that C (0) = 1 and C ′ (0) = 0. From Proposition 2.5, − C ′ ( τ ) < C ′ ( − τ ) for all τ ∈ (0 , γ ), which implies that − Z γ C ′ ( τ )d τ < Z γ C ′ ( − τ )d τ ⇒ − Z γ C ′ ( τ )d τ < − Z − γ C ′ ( τ )d τ ⇒ Z γ C ′ ( τ )d τ > Z − γ C ′ ( τ )d τ ⇒ C ( τ ) | γ > C ( τ ) | − γ ⇒ C ( γ ) − C (0) > C ( − γ ) − C (0) ⇒ C ( γ ) > C ( − γ ) . But C ( γ ) > C ( − γ ) implies that θ + < θ − . This section considers scoring rules for for forecasts of continuous variables. It is in somesense a generalisation of the previous section. As before, we consider how each scoring7ule would rank two competing predictive distributions of fairly good quality. In thecase of the logarithmic scoring rule and the Continuous Ranked Probability Score, weconsider errors of each predictive distribution, f ( x ), from the target distribution, p ( x ),that are odd functions, i.e. γ ( x ) = f ( x ) − p ( x ) with γ ( − x ) = − γ ( x ).Familiarity with the following notation and definitions will be useful. Given twofunctions f ( x ) and g ( x ) that are bounded, an inner product is defined by h f, g i = Z ∞−∞ f ( x ) g ( x )d x. Then the L norm is defined to be || f || = h f, f i / . A continuous counterpart of the Brier score is the quadratic score (Gneiting and Raftery,2007), given by QS ( f, X ) = || f || − f ( X ) , where X is a random variable. Taking the expectation yields E [ QS ( f, X )] = || f − p || − || p || . (10)We can now write f ( x ) = p ( x ) + γ ( x ), where R γ ( x )d x = 0, and substitute it into (10)to obtain E [ QS ( f, X )] = || γ || − || p || As was the case with the Brier score, the functions ± γ ( x ) yield the same quadraticscore. For any two forecasts, f i ( x ) = p ( x ) + γ i ( x ), i = 1 , || γ || < || γ || , thequadratic scoring rule would prefer f ( x ). Further more, || γ || = || γ || implies that E [ QS ( f , X )] = E [ QS ( f , X )]. The expectation of the logarithmic scoring rule for the forecast is E [LS( f, X ))] = − Z p ( x ) log( p ( x ) + γ ( x ))d x. As in the discrete case, we introduce the pdfs f + ( x ) = p ( x ) + γ ( x ), f − ( x ) = p ( x ) − γ ( x )so that we can define E [LS] ± = E [LS( f + , X ))] − E [LS( f − , X ))]. It follows that E [LS] ± = Z p ( x ) log (cid:18) p ( x ) − γ ( x ) p ( x ) + γ ( x ) (cid:19) d x. (11)It is necessary that | γ ( x ) | ≤ p ( x ) for (11) to be well defined. Consider the case when p ( x ) = p ( − x ). If, in addition, γ ( x ) is an odd function, i.e. γ ( − x ) = − γ ( x ), thenequation (11) yields E [LS] ± = 0 . When γ ( − x ) = − γ ( x ) and R ∞ p ( x )d x > .
5, we state the following proposition:
Proposition 3.1
Given that γ ( − x ) = − γ ( x ) with γ ( | x | ) < and p ( | x | ) ≤ p ( x ) , then E [LS] ± ≥ . f − ( x ) is preferred bythe logarithmic scoring rule over f + ( x ). Proof.
The proof proceeds as follows: E [LS] ± = Z ∞−∞ p ( x ) log (cid:18) p ( x ) − γ ( x ) p ( x ) + γ ( x ) (cid:19) d x = Z −∞ p ( x ) log (cid:18) p ( x ) − γ ( x ) p ( x ) + γ ( x ) (cid:19) d x + Z ∞ p ( x ) log (cid:18) p ( x ) − γ ( x ) p ( x ) + γ ( x ) (cid:19) d x If we now perform a change of variable u = − x in the right hand integral and thenreplace u by x , we obtain E [LS] ± = Z −∞ p ( x ) log (cid:18) p ( x ) − γ ( x ) p ( x ) + γ ( x ) (cid:19) d x − Z −∞ p ( − x ) log (cid:18) p ( − x ) − γ ( − x ) p ( − x ) + γ ( − x ) (cid:19) d x = Z −∞ p ( x ) log (cid:18) p ( x ) − γ ( x ) p ( x ) + γ ( x ) (cid:19) d x + Z −∞ p ( − x ) log (cid:18) p ( − x ) + γ ( x ) p ( − x ) − γ ( x ) (cid:19) d x ≥ Z −∞ p ( x ) log (cid:18) p ( x ) − γ ( x ) p ( x ) + γ ( x ) (cid:19) d x + Z −∞ p ( x ) log (cid:18) p ( x ) + γ ( x ) p ( x ) − γ ( x ) (cid:19) d x = 0 , where we used p ( | x | ) ≤ p ( x ) to obtain the last inequality. To justify the use of thisinequality, we need to show that the functionΦ( p ) = p log (cid:18) p + γp − γ (cid:19) is a decreasing function for γ ∈ (0 , p ). Differentiating Φ with respect to p yieldsΦ ′ ( p ) = log (cid:18) p + γp − γ (cid:19) − pγp − γ . It now suffices to show that Φ ′ ( p ) < p . Let us introduce the notation W ( p ) =log[( p + γ ) / ( p − γ )] and Y ( p ) = 2 pγ/ ( p − γ ) so that Φ ′ ( p ) = W ( p ) − Y ( p ). Note that W (2 γ ) = log 2 and Y (2 γ ) = 4 / e / ). Hence W (2 γ ) < Y (2 γ ), which implies thatΦ ′ (2 γ ) <
0. Differentiating W ( p ) and Y ( p ) with respect to p yields W ′ ( p ) = − γp − γ and Y ′ ( p ) = − γ ( p + γ )( p − γ ) . It is now clear that W ′ ( p ) < Y ′ ( p ) < p . Further more, Y ′ ( p ) < W ′ ( p ).Hence W ( p ) < Y ( p ) for all p ∈ ( γ, γ ], which implies that Φ ′ ( p ) < p ∈ ( γ, γ ].It now remains to be shown that Φ ′ ( p ) < p ∈ (2 γ, ∞ ). It suffices to considerthe asymptotic behaviour as p → ∞ . Applying L’Hopital’s rule, we obtainlim p →∞ | W ( p ) || Y ( p ) | = lim p →∞ | W ′ ( p ) || Y ′ ( p ) | = lim p →∞ p + γ p − γ = 1 . Hence, lim p →∞ W ( p ) = lim p →∞ Y ( p ), i.e. W ( ∞ ) = Y ( ∞ ). With this result in mind,for all p ∈ [2 γ, ∞ ), we have Z ∞ p Y ′ ( τ )d τ < Z ∞ p W ′ ( τ )d τ ⇒ Y ( τ ) | ∞ p < W ( τ ) | ∞ p ⇒ Y ( ∞ ) − Y ( p ) < W ( ∞ ) − W ( p ) ⇒ W ( p ) < Y ( p ) , p ( | x | ) < p ( x ) implies that R −∞ p ( x )d x ≥ . p > q in the discrete case.We now want to compare the entropies of the forecasts f ( x ) = p ( x ) ± γ ( x ) when γ ( − x ) = − γ ( x ) and γ ( | x | ) ≤ γ ( x ). The entropy of the function f ( x ) = p ( x ) + γ ( x ) isthen given by h ( γ ) = − Z ( p ( x ) + γ ( x )) log( p ( x ) + γ ( x ))d x. (12)The functional derivative of h ( γ ) with respect to γ is then given by δh ( γ ) δγ ( x ) = − ∂∂γ ( x ) { ( p ( x ) + γ ( x )) log( p ( x ) + γ ( x )) } = − [log( p + γ ) + 1] . (13)The order O ( ε ) part of h ( γ + εδγ ) − h ( γ ) is given by (see Stone and Goldbart (2008) forfurther insights) δh ( γ ) = Z ∞−∞ δh ( γ ) δγ ( x ) δγ ( x )d x. (14)Plugging (13) into (14) yields δh ( γ ) = − Z ∞−∞ [log( p ( x ) + γ ( x )) + 1] δγ ( x )d x = − Z −∞ [log( p ( x ) + γ ( x )) + 1] δγ ( x )d x − Z ∞ [log( p ( x ) + γ ( x )) + 1] δγ ( x )d x = − Z −∞ [log( p ( x ) + γ ( x )) + 1] δγ ( x )d x + Z −∞ [log( p ( − x ) + γ ( − x )) + 1] δγ ( − x )d x = − Z −∞ [log( p ( x ) + γ ( x )) + 1] δγ ( x )d x − Z −∞ [log( p ( − x ) + γ ( − x )) + 1] δγ ( − x )d x = − Z −∞ [log( p ( x ) + γ ( x )) + 1] δγ ( x )d x + Z −∞ [log( p ( − x ) − γ ( x )) + 1] δγ ( x )d x = − Z −∞ log (cid:18) p ( x ) + γ ( x ) p ( − x ) − γ ( x ) (cid:19) δγ ( x )d x, where we have applied a change of variable x → − x in the second integral of the thirdline and assumed δγ ( − x ) = − δγ ( x ) in the fifth line. In particular, δh ( γ ) | γ =0 = − Z −∞ log (cid:18) p ( x ) p ( − x ) (cid:19) δγ ( x )d x. Using the assumption that p ( x ) ≥ p ( − x ) whenever x <
0, we consequently obtain δh ( γ ) | γ =0 ≤ , (15)if δγ ( x ) > x <
0. In effect, we have just proved the following proposition:
Proposition 3.2
Given that γ ( − x ) = − γ ( x ) , R γ ( x )d x = 0 , γ ( | x | ) ≤ , p ( | x | ) ≤ p ( x ) and | γ ( x ) | < p ( x ) , then the entropy of the forecast density f + ( x ) = p ( x ) + γ ( x ) is lowerthan that of the forecast density f − ( x ) = p ( x ) − γ ( x ) . Proposition 3.3
Given two forecasts f i ( x ) = p ( x ) + γ i ( x ) , i = 1 , , with (i) | γ ( x ) | < | γ ( x ) | , (ii) γ i ( | x | ) ≤ , (iii) γ i ( − x ) = − γ i ( x ) , (iv) | γ i ( x ) | ≤ p ( x ) and (v) p ( | x | ) ≤ p ( x ) ,then the logarithmic scoring rule prefers forecast f ( x ) over forecast f ( x ) . Proof.
To prove the above proposition, we consider the functional derivative of the ex-pected logarithmic scoring rule, E [LS] = − R ∞−∞ p ( x ) log( p ( x ) + γ ( x ))d x . The functionalderivative with respect to γ ( x ) is δδγ E [LS] = − p ( x ) p ( x ) + γ ( x ) . Using this result, we obtain the first variation of E [LS] as δ E [LS] = Z ∞−∞ δ E [LS] δγ ( x ) δγ ( x )d x = Z ∞−∞ − p ( x ) p ( x ) + γ ( x ) δγ ( x )d x = Z −∞ − p ( x ) p ( x ) + γ ( x ) δγ ( x )d x + Z ∞ − p ( x ) p ( x ) + γ ( x ) δγ ( x )d x = Z −∞ − p ( x ) p ( x ) + γ ( x ) δγ ( x )d x + Z −∞ p ( − x ) p ( − x ) + γ ( − x ) δγ ( − x )d x = Z −∞ − p ( x ) p ( x ) + γ ( x ) δγ ( x )d x + Z −∞ p ( − x ) p ( − x ) − γ ( x ) δγ ( x )d x = Z −∞ (cid:20) p ( − x ) p ( − x ) − γ ( x ) − p ( x ) p ( x ) + γ ( x ) (cid:21) δγ ( x )d x = Z −∞ [ p ( − x ) + p ( x )] γ ( x )[ p ( − x ) − γ ( x )][ p ( x ) + γ ( x )] δγ ( x )d x ≥ , provided δγ ( x ) > x < δγ ( − x ) = − δγ ( x ), γ ( − x ) = − γ ( x ) and γ ( | x | ) ≤
0. What has been shown is that as γ ( x ) changes by δγ ( x ), the expected logarithmicscore changes by a positive amount. In particular, if we start at γ ( x ) = γ ( x ), andprogressively move towards γ ( x ) = γ ( x ) by making successive additions of δγ ( x ), theexpected logarithmic score can only increase. Hence the expected logarithmic score of γ ( x ) will be higher than that of γ ( x ), which yields the result.We shall now consider two forecasts, f ( x ) = p ( x ) + γ ( x ) and f ( x ) = p ( x ) − γ ( x )with | γ ( x ) | ≤ | γ ( x ) | ≤ p ( x ). In this case, the quadratic scoring rule would prefer f ( x )over f ( x ). In order to determine which forecast the logarithmic scoring would prefer,we consider the functional H ( γ , γ ) = Z ∞−∞ p ( x ) log (cid:18) p ( x ) − γ ( x ) p ( x ) + γ ( x ) (cid:19) d x. (16)Then the following proposition holds 11 roposition 3.4 Given that | γ ( x ) | ≤ | γ ( x ) | ≤ p ( x ) and γ i ( − x ) = − γ i ( x ) , i = 1 , ,there exists γ ∗ ( x ) satisfying the inequalities γ ∗ ( x ) γ ( x ) ≥ and | γ ∗ ( x ) | ≤ | γ ( x ) | suchthat (a) H ( γ ∗ , γ ) = 0 , (b) H ( γ , γ ) > for | γ ∗ | < | γ | and (c) H ( γ , γ ) < for | γ ∗ | > | γ | . Proof.
It is helpful to first note that Proposition 3.1 implies that H ( γ , γ ) > γ = 0. Thinking of γ ( x ) as fixed, the first variation of H ( · , γ ) with respect to γ ( x ) isgiven by δ H ( · , γ ) = Z ∞−∞ δ H ( · , γ ) δγ ( x ) δγ ( x )d x = Z ∞−∞ − p ( x ) p ( x ) − γ ( x ) δγ ( x )d x = Z −∞ − p ( x ) p ( x ) − γ ( x ) δγ ( x )d x + Z ∞ − p ( x ) p ( x ) − γ ( x ) δγ ( x )d x = Z −∞ − p ( x ) p ( x ) − γ ( x ) δγ ( x )d x − Z −∞ p ( − x ) p ( − x ) + γ ( x ) δγ ( x )d x = Z −∞ − p ( x ) p ( x ) − γ ( x ) δγ ( x )d x + Z −∞ p ( − x ) p ( − x ) + γ ( x ) δγ ( x )d x = Z −∞ (cid:20) p ( − x ) p ( − x ) + γ ( x ) − p ( x ) p ( x ) − γ ( x ) (cid:21) δγ ( x )d x = Z −∞ − [ p ( − x ) + p ( x )] γ ( x )[ p ( − x ) + γ ( x )][ p ( x ) − γ ( x )] δγ ( x )d x ≤ , provided δγ > δγ ( − x ) = − δγ ( x ). In the fourth line a change of variable x = − τ was applied and then τ was replaced with x since it is a dummy variable. Itfollows that H ( · , γ ) has a maximum when γ = 0, i.e. H ( · , γ ) ≤ H ( · , H (0 , γ ) ≤ H (0 ,
0) = 0. For γ = 0, we have the strict inequality, H (0 , γ ) <
0. Since H ( γ , γ ) >
0, continuity implies that H ( γ , γ ) = 0 for some γ ( x ) = γ ∗ ( x ) such that | γ ∗ | < | γ | , and this completes the proof. Given a forecast f ( x ), the spherical scoring rule is given by S ( f, X ) = − f ( X ) || f || . If we define the operator ρf = f ( x ) / || f || , the expected spherical score is the innerproduct E [ S ( f, X )] = −h ρf, p i . The minimum of this expectation is achieved if and only f = p since it is a strictlyproper scoring rule (Friedman, 1983). We now state the following proposition: Proposition 3.5
Given that γ ( − x ) = − γ ( x ) with γ ( | x | ) < , | γ ( x ) | < p ( x ) and p ( | x | ) ≤ p ( x ) , then the spherical scoring rule prefers the forecast f + ( x ) over f − ( x ) ,i.e. E [ S ( f + , X )] ≤ E [ S ( f − , X )] . roof. The aim here is to show that E [ S ( f + , X )] ≤ E [ S ( f − , X )], which is equivalent to h ρf + , p i ≥ h ρf − , p i . Note that each of these inner products is non-negative since h ρf ± , p i = h f ± , p i|| f ± || = h p ± γ, p i|| f ± || = || p || ± h γ, p i|| f ± || ≥ , due to Cauchy Schwartz’s inequality, h± γ, p i ≤ || γ || || p || , and the hypothesis, | γ ( x ) | ≤ p ( x ) ⇒ || γ || ≤ || p || . Therefore, h ρf + , p i ≥ h ρf − , p i is equivalent to h ρf + , p i ≥h ρf − , p i . It therefore suffices to show that the latter inequality holds. h ρf + , p i − h ρf − , p i = h f + , p i || f + || − h f − , p i || f − || = h p + γ, p i || f + || − h p − γ, p i || f − || = ( || p || + h γ, p i ) || f + || − ( || p || − h γ, p i ) || f − || = || f − || ( || p || + h γ, p i ) − || f + || ( || p || − h γ, p i ) || f + || || f − || . Plugging in || f + || = || p || + 2 h γ, p i + || γ || and || f − || = || p || − h γ, p i + || γ || into thenumerator of the last expression, removing brackets and collecting like terms yield h ρf + , p i − h ρf − , p i = h γ, p i ( || p || || γ || − h γ, p i ) || f + || || f − || . (17)As a consequence of Cauchy-Schwartz’s inequality, || p || || γ || − h γ, p i ≥
0. It will nowbe shown that under the hypothesis of the proposition, h γ, p i ≥ h γ, p i = Z ∞−∞ γ ( x ) p ( x )d x = Z −∞ γ ( x ) p ( x )d x + Z ∞ γ ( x ) p ( x )d x = Z −∞ γ ( x ) p ( x )d x − Z −∞ γ ( − x ) p ( − x )d x = Z −∞ γ ( x ) p ( x )d x + Z −∞ γ ( − x ) p ( − x )d x = Z −∞ γ ( x ) p ( x )d x − Z −∞ γ ( x ) p ( − x )d x = Z −∞ γ ( x )[ p ( x ) − p ( − x )]d x ≥ , p ( x ) ≥ p ( − x ) and γ ( x ) ≥ x ≤
0. Hence, the right hand side of equa-tion (17) is non-negative.The distribution preferred by the spherical scoring rule is already known throughProposition 3.2 to be of lower entropy. As was the case in the binary case, the sphericalscoring rule prefers an opposite distribution to the logarithmic scoring rule.
Finally, we consider the
Continuous Ranked Probability Score (CRPS) of the densityforecast f ( x ) whose cumulative distribution is F ( x ). The CRPS is a function of F andthe verification X and is defined by (Gneiting and Raftery, 2007)CRPS( F, X ) = Z ∞−∞ ( F ( τ ) − I { τ ≥ X } ) d τ. The above score may equivalently be written asCRPS(
F, X ) = Z X −∞ F ( τ )d τ + Z ∞ X ( F ( τ ) − d τ. (18)It follows from (18) that E [CRPS( F, X )] = Z ∞−∞ p ( x ) Z x −∞ F ( τ )d τ d x + Z ∞−∞ p ( x ) Z ∞ x ( F ( τ ) − d τ d x, (19)where p ( x ) is the true (or target) density function. If P ( x ) = R x −∞ p ( τ )d τ , we can thenapply the integration by parts formula to each term on the right hand side of (19) toobtain Z ∞−∞ p ( x ) Z x −∞ F ( τ )d τ d x = P ( x ) Z x −∞ F ( τ )d τ (cid:12)(cid:12)(cid:12)(cid:12) ∞−∞ − Z ∞−∞ P ( x ) F ( x )d x = Z ∞−∞ F ( x )d x − Z ∞−∞ P ( x ) F ( x )d x and Z ∞−∞ p ( x ) Z ∞ x ( F ( τ ) − τ d x = P ( x ) Z ∞ x ( F ( τ ) − d τ (cid:12)(cid:12)(cid:12)(cid:12) ∞−∞ + Z ∞−∞ P ( x )( F ( x ) − d x = 0 + Z ∞−∞ P ( x )( F ( x ) − d x, whence E [CRPS( F, X )] = Z ∞−∞ P ( x )(1 − P ( x ))d x + Z ∞−∞ ( F ( x ) − P ( x )) d x, (20)after some algebraic manipulation. Define F ( x ) = P ( x )+Γ( x ), where Γ( x ) = R x −∞ γ ( τ )d τ and f ( x ) = p ( x )+ γ ( x ). If we also define G ( P ) = R P ( x )(1 − P ( x ))d x , then equation (20)can be re-written as E [CRPS( F, X )] = G ( P ) + || Γ || . We have thus proved the followingproposition: Proposition 3.6
The Continuous Ranked Probability Score does not distinguish be-tween distributions whose cumulative errors from the target distribution are equal in thesense of the L norm. γ ( x ) and γ ( x ) respectively, with γ ( x ) = − γ ( x ). Itthen follows that || Γ || − || Γ || = Z ∞−∞ Γ ( x )d x − Z ∞−∞ Γ ( x )d x = Z ∞−∞ (cid:0) Γ ( x ) − Γ ( x ) (cid:1) d x = Z ∞−∞ (Γ ( x ) + Γ ( x )) (Γ ( x ) − Γ ( x )) d x = 0 , since γ ( x ) = − γ ( x ) ⇒ Γ ( x ) + Γ ( x ) = 0.As a final remark, we note that the second term in the expectation of CRPS in (20)somewhat resembles the mean squared error criterion discussed in Corradi and Swanson(2006). The mean squared error of the forecast F ( x ) is E [Γ ( X )] = R p ( x )Γ ( x )d x .Likewise, the mean squared error criterion does not distinguish between forecasts whoseerrors from the target density differ by a sign (i.e. γ ( x ) = − γ ( x )) because E [Γ ( X )] − E [Γ ( X )] = Z ∞−∞ p ( x ) (cid:0) Γ ( x ) − Γ ( x ) (cid:1) d x = Z ∞−∞ p ( x ) (Γ ( x ) + Γ ( x )) (Γ ( x ) − Γ ( x )) d x = 0 . This manuscript contrasted how certain scoring rules would rank competing forecasts ofspecified departures from the target distribution. In the categorical case, we consideredthe Brier Score, the logarithmic scoring rule and the spherical scoring rule, focusing onthe binary case. Given two forecasts whose errors from the target distribution differonly by the sign, we found that the logarithmic scoring rule prefers the higher entropydistribution whilst the spherical scoring rule prefers the lower entropy distribution. TheBrier score does not distinguish the two distributions. The logarithmic scoring ruleselects a lower entropy forecast only if it is nearer to the target distribution in the senseof the L norm and vice versa for the spherical scoring rule.We extended the investigation from binary forecasts to the continuous case, wherewe considered the Quadratic score, Logarithmic scoring rule, spherical scoring rule andthe Continuous Ranked Probability Score (CRPS). Just like the Brier score in the binarycase, the Quadratic Score does not distinguish between forecasts with equal L normsof their errors from the target distribution. On the other hand, given two densityforecasts whose errors from the target forecast differ by a sign, the logarithmic scoringrule prefers the distribution with higher entropy whilst the spherical scoring rule prefersthe one with lower entropy: bear in mind that higher entropy corresponds to moreuncertainty (Shannon, 1948). The CRPS is indifferent to forecasts whose errors fromthe target density differ by a sign. 15ome have criticised the logarithmic scoring rule for placing a heavy penalty on as-signing zero probability to events that materialise (e.g. Boero et al. , 2011; Gneiting and Raftery,2007); but assigning zero probability to events that are possible is also discouraged by Laplace’s rule of succession (Jaynes, 2003). What has been shown here is that the log-arithmic scoring rule is good at highlighting forecasts that are less uncertain than idealforecasts. Such forecasts may have to be dealt with appropriately. One way of dealingwith such forecasts is discussed in Machete (2012). Nonetheless, given two density fore-casts, the logarithmic scoring rule does not just reject the more extreme in the senseof entropy: If both forecasts are more uncertain that the ideal forecast, the logarithmicscoring rule will tend to prefer the less uncertain of the two.Does our consideration of departures from ideal forecasts amount to advocatingfor dishonesty by forecasters? Not at all. We are merely making an observation thatforecasters can honestly report predictive distributions that have departures from idealforecasts. Although strictly proper scoring rules encourage forecasters to be honest whenthey report their best judgements, they do not guarantee that the reported forecastswill coincide with ideal forecasts. Our point then is that using a given scoring rule mayinherently favour departures from ideal forecasts in one direction more than in another.Therefore, when one selects a scoring rule to estimate distribution parameters or choosebetween two competing experts, it amounts to deciding preferred departures.Which scoring rule one should choose will depend on the application at hand. Com-bining insights of scoring rules set forth in this paper with an understanding of thesituation at hand can help decide which scoring rule is most appropriate. The issue toconsider may be decisions associated with high impact, low probability events. To illus-trate our point, let us consider inflation forecasting. It is undesirable to over estimatethe probability for extreme inflation because of the panic it can create as buyers rushto spend now before prices rise. In order to manage peoples expectations better, thespherical scoring rule is preferable in this case. As another example, consider seasonalforecasts of drought in the UK, which is arguably a rare event. Under estimating theprobability of this event could result in water shortages since water companies mightnot be stringent on water usage. In this case, the logarithmic scoring rule is preferable.
Acknowledgements
This work was supported by the RCUK Digital Economy Programme at the Universityof Reading via EPSRC grant EP/G065802/1 The Horizon Digital Economy Hub.
References
Bickel JE, 2007. Some Comparisons among Quadratic, Spherical, and LogarithmicScoring Rules.
Decision Analysis :49–65.Boero G, Smith J, Wallis KF, 2011. Scoring rules and survey density forecasts. Inter-national Journal of Forecasting :379–393.Brier GW, 1950. Verification of forecasts expressed in terms of probability. MonthlyWeather Review :1–3.Brocker J, Smith LA, 2007. Scoring Probabilistic Forecasts: The importance of Beingproper. Weather and Forecasting :382–388.16orradi V, Swanson NR, 2006. Predictive density evaluation. In Elliott G, GrangerCWJ, Timmermann A (editors), Handbook for Econometric Forecasting , volume 1,197–284. North-Holland.Epstein ES, 1969. A Scoring System for Probability Forecasts of Ranked Categories.
Journal of Applied Meteorology :985–987.Friedman D, 1983. Effective Scoring Rules for Probabilistic Forecasts. ManagementScience :447–454.Gneiting T, Balabdaoui F, Raftery AE, 2007. Probabilistic forecasts, calibration andsharpness. J. R. Statist. Soc. B :243–268.Gneiting T, Raftery AE, 2007. Strictly proper scoring rules, prediction and estimation. J. Amer. Stat. Ass. :359–378.Good IJ, 1952. Rational decisions.
Journal of the Royal Statistical Society. Series B(Methodological) :107–114.Jaynes ET, 2003. Probability Theory: The Logic of Science . Cambridge University Press.Johnstone D, Lin Y, 2011. Fitting probability forecasting models by scoring rules andmaximum likelihood.
Journal of Statistical Planning and Inference :1832–1837.Jose VRR, Nau RF, Winkler RL, 2008. Scoring Rules, Generalized Entropy, and UtilityMaximisation.
Operations Research :1146–1157.Kahneman D, Tversky A, 1979. Prospect Theory: An analysis of decision under risk. Economterica :263–291.Machete RL, 2012. Early warning with calibrated and sharper probabilistic forecasts. Journal of Forecasting doi:10.1002/for.2242.Nau RF, 1985. Should scoring rules be ´effective´?
Management Science :527–535.Roulston MS, Smith LA, 2002. Evaluating Probabilistic Forecasts Using InformationTheory. Monthly Weather Review :1653–1660.Savage LJ, 1971. Elicitation of Personal Probabilities and Expectations.