aa r X i v : . [ m a t h . S T ] N ov Compatible Weighted Proper Scoring Rules ∗ Peter G. M. ForbesDepartment of Statistics, University of Oxford1 South Parks Road, Oxford OX1 3TG, U.K.September 19, 2012
Abstract
Many proper scoring rules such as the Brier and log scoring rulesimplicitly reward a probability forecaster relative to a uniform base-line distribution. Recent work has motivated weighted proper scoringrules, which have an additional baseline parameter. To date two fami-lies of weighted proper scoring rules have been introduced, the weightedpower and pseudospherical scoring families. These families are com-patible with the log scoring rule: when the baseline maximizes the logscoring rule over some set of distributions, the baseline also maximizesthe weighted power and pseudospherical scoring rules over the sameset. We characterize all weighted proper scoring families and prove ageneral property: every proper scoring rule is compatible with someweighted scoring family, and every weighted scoring family is compat-ible with some proper scoring rule.
Suppose Y is a random variable taking values in { , . . . , m } . The validdistributions for Y are P = { ( p , . . . , p m ) T : 0 ≤ p i ≤ , m X i =1 p i = 1 } ⊂ R m . ∗ This is a pre-copyedited, author-produced PDF of an article accepted forpublication in Biometrika following peer review. The definitive publisher-authenticated version (Biometrika 99 (4): 989-994, 2012) is available online at http://biomet.oxfordjournals.org/cgi/content/abstract/ass046?ijkey=CaRYBhLvVa4XvRY&keytype=ref . s : P × P → R is a function linear in its second argument.The scoring rule is proper if s ( p, r ) is maximized over p at p = r , and strictlyproper if this maximum is unique.Consider a forecaster asked to issue a probabilistic prediction p for Y .She is motivated by a reward of s ( p, r ) upon observing outcome distribution r . If the forecaster’s true belief is p ∗ , her expected score s ( p, p ∗ ) is maximizedwhen she predicts p = p ∗ . Hence proper scoring rules encourage honesty.Two scoring rules equivalent if their rewards are linearly related for all p, r ∈ P : s ( p, r ) = a { s ( p, r ) + h b , r i} (1)where h· , ·i is the standard inner product on R m , a > b ∈ R m .The main characterization theorem for proper scoring rules was statedby McCarthy (1956) and proved by Hendrickson and Buehler (1971). Theorem 1.
A scoring rule s is proper if and only if the function S ( λp ) = λs ( p, p ) (2) defined on P Λ = { λp : λ > , p ∈ P} is convex and satisfies S ( p ) ≥ s ( p, q ) for all p, q ∈ P . The scoring rule is strictly proper if and only if S is strictlyconvex on P . The function S is called the optimal expected score. Gr¨unwald and Dawid(2004) showed that the negative optimal expected score can be interpretedas a generalized entropy.When S is differentiable we have (Hendrickson and Buehler, 1971) s ( p, r ) = s ( λp, r ) = h∇ λp S ( λp ) , r i (3)which associates a proper scoring rule with any convex differentiable function S . For the rest of this paper we assume that S is twice differentiable on P Λ ,strictly convex on P , and achieves its unique minimum in P + , the interiorof P .Equation (3) extends the domain of s to P Λ × P and allows us to differ-entiate s with respect to its first parameter. Since s ( λp, r ) = s ( p, r ) for any λ >
0, we have h∇ p s ( p, r ) , m i = 0 for any p and r ∈ P , where m ∈ R m has all entries equal to one.Consider a sequence of observations y , . . . , y n with empirical distribu-tion r ∈ P . Let p ( θ ) be some model which takes values in P + and is2ifferentiable over some open convex set Θ. Then any scoring rule definesan optimal score estimator (Gneiting and Raftery, 2007) via˜ θ ( r ) = arg max θ ∈ Θ s { p ( θ ) , r } = arg max θ ∈ Θ n X i =1 s { p ( θ ) , y i } . From (1), all equivalent scoring rules have the same optimal score estimator.The optimal score estimator is well behaved at r if ˜ θ ( r ) exists and is theunique root of ∇ θ s { p ( θ ) , r } in Θ. When s is the log scoring rule s ( p, r ) = P mi =1 r i log p i , the optimal score estimator becomes the maximum likelihoodestimator.A well behaved optimal score estimate ˜ θ ( r ) yields the parameter choicethat maximizes the forecaster’s expected score under the assumption thatthe future is similar to the past. Specifically we suppose that our forecasterissues the prediction p ( θ ) for some θ ∈ Θ. If she believes that the nextobservation’s distribution is r , then p { ˜ θ ( r ) } maximizes her expected score.The optimal score estimator can be generalized so that each y i followsa different probability distribution, as long as these distributions share acommon parameter θ ∈ Θ. Thus the optimal score estimator is applicableto regression models that depend on both θ and some additional covariates.For the sake of brevity we consider only the basic optimal score estimatorhere, though all the results hold in the general case. We define the baseline of a strictly proper scoring rule to be the unique q ∈ P + that maximizes the generalized entropy − S ( p ). For example, thelog scoring rule’s generalized entropy is the Shannon entropy, which is max-imized by the uniform distribution. Proper scoring rules tends to give largerrewards for riskier predictions which vary significantly from the baseline.Given q ∈ P + and a strictly proper scoring rule s ( p, r ), there is an equiva-lent rule with baseline q given by s ( p, r ) − s ( q, r ).A weighted scoring family s ( p, r k · ) = { s ( p, r k q ) : q ∈ P + } is a family of strictly proper scoring rules where each member s ( p, r k q )has baseline q . Two weighted proper scoring rules are equivalent if (1) issatisfied, where now a and b are functions of q . Different members from thesame family need not be equivalent. 3eighted scoring families allow us to tailor our scoring rule to the prob-lem at hand, as motivated in Jose et al. (2009) and Johnstone and Lin (2011).This tailoring is achieved by modifying the baseline. The baseline is eas-ily interpretable and justifiable in many real world situations. For instance,weighted scoring families are used in Jose et al. (2008) for a optimal portfolioallocation problem, where the baseline corresponds to the market price.Let s ( p, r ) be a proper scoring rule and s ( p, r k · ) be a weighted scoringfamily. We say s ( p, r k · ) is compatible with s ( p, r ) if for any q and r ∈ P + , ∇ p s ( p, r ) | p = q = a ( q ) ∇ p s ( p, r k q ) | p = q (4)for some function a ( q ) >
0. In words, equation (4) says that the tangent ofa weighted scoring rule at its baseline q is parallel to the compatible scoringrule’s tangent at q . By approximating s ( p, r k q ) with its tangent at p = q and applying (4), we obtain s ( p, r k q ) ≈ s ( q, r k q ) + 1 a ( q ) D ∇ p s ( p, r ) | p = q , p − q E . The first term corresponds to an equivalence factor h b ( q ) , r i . Thus, upto equivalence, every member of the weighted scoring family s ( p, r k · ) islinearly approximated by the compatible proper scoring rule s ( p, r ) in thevicinity of its baseline. Theorem 2.
Any proper scoring rule is compatible with at least one weightedscoring family. Conversely, every weighted scoring family is compatible withsome proper scoring rule, which is unique up to equivalence.Proof.
Let s ( p, r ) be a proper scoring rule. From the definition (4), it is com-patible with the weighted scoring family where each member is equivalentto s ( p, r ): s ( p, r k q ) = s ( p, r ) − s ( q, r ) . Conversely, consider the weighted scoring family s ( p, r k · ). From (3) and(4), a proper scoring rule s ( p, r ) is compatible with this family if and onlyif its optimal expected score S ( p ) satisfies ∇ q S ( q ) = a ( q ) ∇ p S ( p k q ) (cid:12)(cid:12) p = q (5)for some a ( q ) > q ∈ P + . The right hand side is a positive definitematrix since it is the Hessian of the convex function S ( p k q ). Thus the S satisfying (5) is convex and corresponds to a strictly proper scoring rule.This solution is unique up to equivalence since the solution of a second-orderdifferential equation is unique up to a linear term.4aving shown that a compatible proper scoring rule always exists, wenow provide an alternative characterization for compatibility which has di-rect applications to optimal score estimation and decision theory. Lemma 1.
A weighted scoring family s ( p, r k · ) is compatible with theproper scoring rule s ( p, r ) if and only if ∇ θ s { p ( θ ) , r }| θ = θ = 0 implies ∇ θ s { p ( θ ) , r k p ( θ }| θ = θ = 0 for all differentiable models p ( θ ) , all θ ∈ Θ ,and all r ∈ P + .Proof. Choose some model p ( θ ) and r ∈ P + . Suppose that s ( p, r k · ) iscompatible with s ( p, r ), so that (4) holds for all q ∈ P + . Then (4) certainlyholds when q = p ( θ ) for any θ ∈ Θ. Left multiplying both sides of (4)with the matrix ∇ θ p T ( θ ) (cid:12)(cid:12) θ = θ and using the chain rule, ∇ θ s { p ( θ ) , r }| θ = θ = a { p ( θ ) } ∇ θ s { p ( θ ) , r k p ( θ ) }| θ = θ . Thus if ∇ θ s { p ( θ ) , r }| θ = θ = 0 then ∇ θ s { p ( θ ) , r k p ( θ ) }| θ = θ = 0.Conversely, suppose ∇ θ s { p ( θ ) , r }| θ = θ = 0 implies ∇ θ s { p ( θ ) , r k p ( θ ) }| θ = θ =0. When q = r , both sides of (4) are being evaluated at their critical pointsand hence are zero. We will show (4) holds for q = r by showing that v = ∇ p s ( p, r k q ) | p = q is parallel to w = ∇ p s ( p, r ) | p = q . Using (3) we canrewrite v as v = ∇ p S ( p k q ) (cid:12)(cid:12) p = q r (6)where ∇ p S ( p k q ) is the positive definite Hessian of S ( p k q ). This implies v = 0 since r = 0. Furthermore since v is a gradient of s ( p, r k q ), h v , m i =0. The same arguments show that w = 0 and h w , m i = 0.Suppose v is not parallel to w . Then we can define the non-zero vector b = v − h v , w ih w , w i w. (7)By construction h b , w i = 0. Consider the model p ( θ ) = q + θb where θ takes values on Θ, an open neighbourhood of zero small enough such that { p ( θ ) : θ ∈ Θ } ⊂ P + . It follows from h v , m i = 0 and h w , m i = 0 that p ( θ ) is normalized for all θ ∈ Θ. Thus p ( θ ) is a valid distribution for θ ∈ Θand, by our choice of w and p ( θ ), ∇ θ s { p ( θ ) , r }| θ =0 = h ∇ θ p ( θ ) | θ =0 , w i = h b , w i = 0 . Hence by assumption, ∇ θ s { p ( θ ) , r k q }| θ =0 = 0. By definition of v we have ∇ θ s { p ( θ ) , r k q }| θ =0 = h b , v i and thus h b , v i = 0. Substituting this into (7),5 v , v i h w , w i = h v , w i and the Cauchy–Schwarz inequality implies that v is parallel to w : w = a ( r, q ) v . Using (6), we rewrite w = a ( r, q ) v as ∇ p S ( p k q ) (cid:12)(cid:12) p = q r = a ( q, r ) ∇ p S ( p ) (cid:12)(cid:12) p = q r. Since both matrices are positive definite, a ( q, r ) >
0. Since the left handside is linear in r , we see a = a ( q ), which proves (4).Consider a forecaster motivated by a weighted scoring rule with baseline q to issue a prediction p ( θ ) for Y . She chooses her prediction based on somedecision rule p { ˘ θ ( r ) } , where r is the empirical distribution of the previousobservations of Y . For instance, ˘ θ could be the optimal score estimator forher weighted scoring rule. Her risk function is − s [ p { ˘ θ ( p ∗ ) } , p ∗ k q ], whichdepends on the unknown true distribution p ∗ of Y . Since p ∗ is unknown itis approximated with the empirical distribution r .Suppose the baseline is determined by the optimal score estimator of thecompatible scoring rule, q = p { ˜ θ ( r ) } . Then, assuming ˘ θ and ˜ θ to be wellbehaved at r , Lemma 1 implies that the forecaster’s risk function is uniquelyminimized when she issues the prediction q . The optimal score estimator ofthe compatible scoring rule dominates any other estimator ˘ θ for this choiceof baseline. Define the quasi-Bregman weighted scoring families to be the proper scoringrules with optimal expected scores S ( p k q ) = h ( m X i =1 f ( q i ) g (cid:18) p i q i (cid:19)) − g ′ (1) m X i =1 p i f ( q i ) q i h ′ g (1) m X j =1 f ( q j ) , (8)where g ′ denotes the derivative of g with respect to its parameter, andsimilarly for h ′ . We require that f is positive, g is twice differentiable andstrictly convex, and that h is twice differentiable and strictly increasing.This defines a weighted scoring family for each choice of f , g and h . Theexpected score S ( p k q ) is strictly convex since g is strictly convex, f ispositive and h is increasing. Hence the quasi-Bregman weighted scoringfamilies are strictly proper. The second term of (8) ensures that S ( p k q )has baseline q , though removing it achieve a simpler, equivalent rule foroptimal score estimation. 6he weighted power and pseudospherical scoring families of Jose et al.(2008), defined by s pow ( p, r k q ) = 1 − P mi =1 p βi q − βi β − − P mi =1 r i p β − i q − βi β − ,s ps ( p, r k q ) = 1 β − P mi =1 r i p i /q i (cid:16)P mi =1 p βi q − βi (cid:17) /β − for β >
1, are quasi-Bregman weighted scoring families with f ( x ) = x and h pow ( x ) = x − β ( β − , g pow ( x ) = x β , h ps ( x ) = x /β − β ( β − , g ps ( x ) = x β . Johnstone and Lin (2011) proved that ∇ θ s { p ( θ ) , r }| θ = θ = 0 implies ∇ θ s { p ( θ ) , r k p ( θ }| θ = θ = 0when s ( p, r k r ) is a power or pseudospherical weighted scoring family and s ( p, r ) is the log scoring rule. From Lemma 1, this is equivalent to showingthat the power and pseudospherical weighted scoring families are compatiblewith the log scoring rule. Corollary 1.
The log scoring rule is compatible with any quasi-Bregmanweighted scoring family with f ( x ) = x . This holds for any twice differentiableand strictly convex g , and any twice differentiable and strictly increasing h .Proof. By substituting f ( x ) = x into (8) and using (3), we obtain s ( p, r k q ) = h ′ ( m X i =1 q i g (cid:18) p i q i (cid:19)) m X j =1 r j g ′ (cid:18) p j q j (cid:19) . (9)The log scoring rule is s ( p, r ) = P mi =1 r i log p i . Substituting (9) and the logscoring rule into (4) shows that the equality holds with a = h ′ { g (1) } g ′ (1).The functions h and g enter only through their values and first derivativesat 1.We define the Bregman weighted scoring families as the quasi-Bregmanweighted scoring families with h ( x ) = x . By substituting (8) into (3) andusing equivalence, the Bregman weighted scoring families take the simpleform s ( p, r k q ) = m X i =1 f ( q i ) (cid:26) g (cid:18) p i q i (cid:19) + g ′ (cid:18) p i q i (cid:19) r i − p i q i (cid:27) . (10)7e recover the unweighted Bregman scoring rules of Gr¨unwald and Dawid(2004), i.e., s ( p, r ) = m X i =1 (cid:8) ˜ g ( p i ) + ˜ g ′ ( p i )( r i − p i ) (cid:9) , (11)by using a flat baseline and rescaling g to ˜ g ( p i ) = f ( q i ) g ( p i /q i ) = f ( m − ) g ( mp i ).The unweighted Bregman scoring rules are uniquely specified through theconvex function ˜ g alone. Corollary 2.
The unweighted Bregman rule specified by ˜ g is compatible withall weighted Bregman families with f ( x ) = x ˜ g ′′ ( x ) . This holds for any twicedifferentiable and strictly convex g .Proof. We use (4) with s ( p, r k q ) given by (10) with f ( x ) = x ˜ g ′′ ( x ) and s ( p, r ) given by (11).We illustrate the use of this corollary via an example. The unweightedpower scoring rule is defined by ˜ g ( x ) = x β / { β ( β − } for β >
1. Using theabove corollary with f ( x ) = x β , we see that the unweighted power scoringrule is compatible with all weighted scoring families of the form s ( p, r k q ) = m X i =1 q βi (cid:26) g (cid:18) p i q i (cid:19) + g ′ (cid:18) p i q i (cid:19) r i − p i q i (cid:27) (12)for any choice of g . As an application of compatible proper scoring rules, consider a portfolioallocation problem similar to Jose et al. (2008). There is a market con-sisting of m assets, and a market maker who sets the prices at q . Afterone time period asset Y will be worth 1 unit and the other assets will beworthless. The investor purchases a portfolio, spending a proportion of hiswealth p i ( θ ) on each asset and thus receiving p i ( θ ) /q i ( θ ) units of each as-set. He chooses θ based on the current prices q and the historical outcomedistribution r . Suppose the investor’s negative risk is given by a weightedscoring rule s { p ( θ ) , r k q } . The market maker does not know the form ofthe investor’s scoring rule, but he believes it to come from a weighted scor-ing family compatible with some known proper scoring rule. The marketmaker prices the assets using the compatible rule’s optimal score estimator, q = p { ˜ θ ( r ) } . Then the market maker’s price coincides with the investor’s8inimal risk portfolio p ( θ ): when the pricing is done by a compatible properscoring rule, the investor is best served by buying the same number of unitsof each asset. Johnstone (2011) interprets this minimal risk portfolio froman economic perspective, for the special case where the compatible rule isthe log score.Until now, the only weighted scoring families considered in the litera-ture were the weighted power and pseudospherical scoring rules. Since bothare compatible with the log scoring rule, their optimal score estimators aredominated by the maximum likelihood estimator when the baseline is givenby the latter. Johnstone and Lin (2011) conjectured the existence of a char-acterization theorem for all weighted proper scoring families whose optimalscore estimators are dominated in this way. They went on to suggest thatthis theorem might reveal an unrecognized property of the log scoring rule.We have found their conjectured characterization theorem: the optimalscore estimator of any weighted proper scoring rule is dominated by thecompatible proper scoring rule’s optimal score estimator when the baseline isset to the compatible proper scoring rule’s optimal score estimate. However,instead of revealing a special property of the log scoring rule, we have shownthat every proper scoring rule is compatible with some family of weightedproper scoring rules. Acknowledgment
I thank Steffen Lauritzen, Philip Dawid, Tilmann Gneiting and the refereesfor their helpful comments.
References
Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, pre-diction, and estimation.
Journal of the American Statistical Associa-tion 102 (477), 359–378.Gr¨unwald, P. D. and A. P. Dawid (2004). Game theory, maximum entropy,minimum discrepancy, and robust Bayesian decision theory.
Annals ofStatistics 32 , 1367–1433.Hendrickson, A. D. and R. J. Buehler (1971). Proper scores for probabilityforecasters.
Annals of Mathematical Statistics 42 (6), 1916–1921.9ohnstone, D. J. (2011). Economic interpretation of probabilities estimatedby MLE or score.
Management Science 57 (2), 308–314.Johnstone, D. J. and Y.-X. Lin (2011). Fitting probability forecasting modelsby scoring rules and maximum likelihood.
Journal of Statistical Planningand Inference 141 (5), 1832–1837.Jose, V. R. R., R. F. Nau, and R. L. Winkler (2008). Scoring rules, gen-eralized entropy, and utility maximization.
Operations Research 56 (5),1146–1157.Jose, V. R. R., R. F. Nau, and R. L. Winkler (2009). Sensitivity to dis-tance and baseline distributions in forecast evaluation.
Management Sci-ence 55 (4), 582–590.McCarthy, J. (1956). Measures of the value of information.