[PDF] Optimal probabilistic forecasts: When do they work?

Abstract

Proper scoring rules are used to assess the out-of-sample accuracy of probabilistic forecasts, with different scoring rules rewarding distinct aspects of forecast performance. Herein, we re-investigate the practice of using proper scoring rules to produce probabilistic forecasts that are `optimal' according to a given score, and assess when their out-of-sample accuracy is superior to alternative forecasts, according to that score. Particular attention is paid to relative predictive performance under misspecification of the predictive model. Using numerical illustrations, we document several novel findings within this paradigm that highlight the important interplay between the true data generating process, the assumed predictive model and the scoring rule. Notably, we show that only when a predictive model is sufficiently compatible with the true process to allow a particular score criterion to reward what it is designed to reward, will this approach to forecasting reap benefits. Subject to this compatibility however, the superiority of the optimal forecast will be greater, the greater is the degree of misspecification. We explore these issues under a range of different scenarios, and using both artificially simulated and empirical data.

Full PDF

aa r X i v : . [ ec on . E M ] S e p Optimal probabilistic forecasts:When do they work? ∗ Gael M. Martin † , Rub´en Loaiza-Maya ‡ , David T. Frazier, § Worapree Maneesoonthorn ¶ and Andr´es Ram´ırez Hassan k September 22, 2020

Abstract

Proper scoring rules are used to assess the out-of-sample accuracy of probabilistic forecasts,with diﬀerent scoring rules rewarding distinct aspects of forecast performance. Herein, we re-investigate the practice of using proper scoring rules to produce probabilistic forecasts that are‘optimal’ according to a given score, and assess when their out-of-sample accuracy is superior toalternative forecasts, according to that score. Particular attention is paid to relative predictiveperformance under misspeciﬁcation of the predictive model. Using numerical illustrations, wedocument several novel ﬁndings within this paradigm that highlight the important interplaybetween the true data generating process, the assumed predictive model and the scoring rule.Notably, we show that only when a predictive model is suﬃciently compatible with the trueprocess to allow a particular score criterion to reward what it is designed to reward, will thisapproach to forecasting reap beneﬁts. Subject to this compatibility however, the superiority ofthe optimal forecast will be greater, the greater is the degree of misspeciﬁcation. We explorethese issues under a range of diﬀerent scenarios, and using both artiﬁcially simulated andempirical data.

Keywords:

Coherent predictions; linear predictive pools; predictive distributions; properscoring rules; stochastic volatility with jumps; testing equal predictive ability

MSC2010 Subject Classiﬁcation : 60G25, 62M20, 60G35

JEL Classiﬁcations:

C18, C53, C58. ∗ This research has been supported by Australian Research Council (ARC) Discovery Grants DP170100729 andDP200101414. Frazier was also supported by ARC Early Career Researcher Award DE200101070; and Martin,Loaiza-Maya and Frazier were provided support by the Australian Centre of Excellence in Mathematics and Statistics. † Department of Econometrics and Business Statistics, Monash University, Australia. Corresponding author:[email protected]. ‡ Department of Econometrics and Business Statistics, Monash University, Australia. § Department of Econometrics and Business Statistics, Monash University, Australia. ¶ Melbourne Business School, University of Melbourne, Australia. k Department of Economics, Universidad EAFIT, Colombia. Introduction

Over the past two decades, the use of scoring rules to measure the accuracy of distributional forecastshas become ubiquitous. In brief, a scoring rule rewards a probabilistic forecast for assigning a highdensity ordinate (or high probability mass) to the observed value, so-called ‘calibration’, subject tosome criterion of ‘sharpness’, or some reward for accuracy in a part of the predictive support thatis critical to the problem at hand. We refer to Tay and Wallis (2000), Gneiting et al. (2007) andGneiting and Raftery (2007) for early extensive reviews, and Diks et al. (2011) and Opschoor et al.(2017) for examples of later developments.In the main, scoring rules have been used to compare the relative predictive accuracy of prob-abilistic forecasts produced by diﬀerent forecasting models and/or methods. It is fair to say that,on the whole, less attention has been given to the relationship between the manner in which theforecast is produced, and the way in which its accuracy is assessed. Exceptions to this com-ment include Gneiting et al. (2005), Gneiting and Raftery (2007), Elliott and Timmermann (2008),Loaiza-Maya et al. (2019) and Patton (2019), and related work on the scoring of point, quan-tile or expectile forecasts in Gneiting (2011b,a), Holzmann and Eulert (2014), Ehm et al. (2016),Fissler and Ziegel (2016), Krger and Ziegel (2020) and Ziegel et al. (2020). In this work, focus isgiven to producing forecasts that are, in some sense, optimal for the particular empirical problemand - as part of that - deliberately matched to the score used to evaluate out-of-sample performance;the idea here being that the forecast so chosen will, by construction, perform best out-of-sampleaccording that scoring rule. The literature on optimal forecast combinations is similar in spirit, withthe combination weights chosen with a view to optimizing a particular forecast-accuracy criterion.(See Aastveit et al. 2019, for a recent review.)Our work continues in this vein, but with three very speciﬁc, and inter-related questions ad-dressed regarding the production of an ‘optimal’ probabilistic forecast via the optimization of acriterion function deﬁned by a given scoring rule. First, what is the impact of model misspeciﬁca-tion on the performance of an optimal forecast? Second, when can we be assured that an optimalforecast will yield the best out-of-sample performance, as measured by the relevant score? Third,when can we have no such assurance?We phrase answers to these questions in terms of the concept of ‘coherence’: a forecast thatis optimal with respect to a particular score is said to be ‘coherent’ if it cannot be beaten out-of-sample (as assessed by that score) by a forecast that is optimal according to a diﬀerent score. Anoptimal forecast is ‘strictly coherent’ if it is strictly preferable to all alternatives, when evaluatedaccording to its own score. The word ‘coherent’ is used here to reﬂect the fact that the methodchosen to produce a forecast performs out-of-sample in a manner that ﬁts with, or is coherent withthat choice: i.e. no other choice (within the context of forecasts produced via proper scoring rules)is strictly preferable.Nested within this concept of coherence is the known result (Gneiting and Raftery, 2007; Patton,2019) that correct speciﬁcation of the model, and under equivalent conditioning sets, leads to optimal2orecasts that have theoretically equivalent out-of-sample performance according to any proper score,with numerical diﬀerences reﬂecting sampling variation only. That is, all such methods are coherentin the sense that, in the limit, no one forecast is out-performed by another. However, the conceptof coherence really has most import in the empirically relevant case where a predictive model ismisspeciﬁed. In this setting, one cannot presume that estimating the parameters of a predictivemodel by optimizing any proper criterion will reveal the true predictive model. Instead, one is forcedto confront the fact that no such ‘true model’ will be revealed, and that the criterion should bedeﬁned by a score that rewards the type of predictive accuracy that matters for the problem at hand.It is in this misspeciﬁed setting that we would hope to see strict coherence on display; providingjustiﬁcation as it would for simply producing a forecast via the scoring rule that is pertinent to theproblem at hand, and leaving matters at that. The concept of ‘coherence’ is distinct from the concept of ‘consistency’ that is used in someof the literature cited above (e.g. Gneiting, 2011a, Holzmann and Eulert, 2014, Ehm et al., 2016,and Patton, 2019). As pointed out by Patton, in the probabilistic forecasting setting a ‘consistent’scoring function is analogous to a ‘proper’ scoring rule, which is ‘consistent’ for the true forecastdistribution in the sense of being maximized (for positively-oriented scores) at that distribution.We restrict our attention only to proper (or ‘consistent’) scores. Within that set of scores, we thendocument when optimizing according to any one proper score produces out-of-sample performance- according to that score - that is superior to that of predictions deduced by optimizing alternativescores, and when it does not; i.e. when strict coherence between in-sample estimation and out-of-sample performance is in evidence and when it is not. What we illustrate is that the extent to whichcoherent forecasts arise in practice actually depends on the form, and degree of misspeciﬁcation.First, if the interplay between the predictive model and the true data generating process is suchthat a particular score cannot reward the type of predictive accuracy it is designed to reward,then optimizing that model according to that score will not necessarily lead to a strictly coherentforecast. Second, if a misspeciﬁed model is suﬃciently ‘compatible’ with the process generating thedata, in so much as it allows a particular score criterion to reward what it was designed to, strictcoherence will indeed result; with the superiority of the optimal forecast being more marked, thegreater the degree of misspeciﬁcation, subject to this basic compatibility.We demonstrate all such behaviours in the context of both probabilistic forecasts based on asingle parametric model, and forecasts produced by a linear combination of predictive distribu-tions. In the ﬁrst case optimization is performed with respect to the parameters of the assumedmodel; in the second case optimization is with respect to both the weights of the linear combi-nation and the parameters of the constituent predictives. To reﬂect our focus on model misspec-iﬁcation, at no point do we assume that the true model is spanned by the linear pool; that is,we adopt the so-called M -open view of the world (Bernardo and Smith 1994). Our results con- We note that throughout the paper we only consider examples in which there is a single, common conditioningset. That is, in contrast to Holzmann and Eulert (2014) and Patton (2019) for example, we do not explore theimpact on relative predictive performance of diﬀerent conditioning sets. can reap beneﬁts relative to alternative approaches. However, the very use of a combination ofpredictives to provide a more ﬂexible and, hence, less misspeciﬁed representation of the true modelcan in some cases mitigate against the beneﬁts of optimization. Section 4 documents the results ofan empirical exercise that focuses on accurate prediction of returns on the S&P500 index and theMSCI Emerging Market (MSCIEM) index. Once again we demonstrate that there are beneﬁts inseeking a predictor that is optimal according to a particular scoring rule, with slightly more markedgains in evidence in the case of the single predictive models than in the case of the linear pool. Thepaper concludes in Section 5.

Let (Ω , F , G ) be a probability space, and let Y ∞ := { Y , . . . , Y n , . . . } be a sequence of randomvariables whose inﬁnite-dimensional distribution is G . In general, G is unknown, and so a hypo-thetical class of probability distributions is postulated for G . Let P be a convex class of probabilitydistributions operating on (Ω , F ) that represents our best approximation of G .Assume our goal is to analyze the ability of the distribution P ∈ P to generate accurate proba-bilistic forecasts. The most common concept used to capture accuracy of such forecasts is a scoringrule. A scoring rule is a function S : { P ∪ { G }} × Ω R whereby if the forecaster quotes the distri-bution P and the value y eventuates, then the reward (or ‘score’) is S ( P, y ) . As described earlier, ingeneral terms a scoring rule rewards a forecast for assigning a high density ordinate (or high prob-ability mass) to y , often subject to some shape, or sharpness criterion, with higher scores denotingqualitatively better predictions than lower scores, assuming all scores are positively-oriented.The result of a single score evaluation is, however, of little use by itself as a measure of predictiveaccuracy. To obtain a meaningful gauge of predictive accuracy, as measured by S ( · , · ), we require4ome notion of regularity against which diﬀerent predictions can be assessed. By far the mostcommon measure of regularity used in the literature is via the notion of the expected score: followingGneiting and Raftery (2007), the expected score under the true measure G of the probabilityforecast P , is given by S ( P, G ) = R y ∈ Ω S ( P, y ) dG ( y ) . A scoring rule S ( · , · ) is ‘proper’ relative to P if for all P, G ∈ P , S ( G , G ) ≥ S ( P, G ) , and is strictly proper if S ( G , G ) = S ( P, G ) ⇔ P = G .That is, if the forecaster’s best judgement is G , then a proper scoring rule rewards the forecasterfor quoting P = G .The concept of a proper scoring rule is useful from a practical perspective since it guaranteesthat, if we knew that the true DGP was G , then according to the rule S ( · , · ) the best forecast wecould hope to obtain would, on average, result by choosing G . Note that this notion of ‘average’is embedded into the very deﬁnition of a proper scoring rule since it, itself, relies on the notion ofan expectation. It is clear that, in practice, the expected score S ( · , G ) is unknown and cannot becalculated. However, if one believes that the true DGP is an element of P , a sensible approach toadopt is to form an empirical version of S ( · , G ) and search P to ﬁnd the ‘best’ predictive over thisclass (Gneiting and Raftery, 2007).More formally, for τ such that T ≥ τ ≥

1, let { y t } T − τt =2 denote a series of size T − ( τ + 1),over which we wish to search for the most accurate predictive, where T is the total number ofobservations on y t , and where τ denotes the size of a hold-out sample. Assume that the class ofmodels under analysis is indexed by a vector of unknown parameters θ ∈ Θ ⊂ R d θ , i.e., P ≡ P (Θ),where P (Θ) := { θ ∈ Θ : P θ } . For F t − denoting the time t − θ ∈ Θ,we associate to the model P θ the predictive measure P t − θ := P ( ·|F t − , θ ) and, where applicable,the associated predictive density p ( ·|F t − , θ ). We can deﬁne an estimator of θ as b θ := arg max θ ∈ Θ S ( θ ) , (1)where S ( θ ) := 1 T − ( τ + 1) T − τ X t =2 S (cid:0) P t − θ , y t (cid:1) , (2)and where the notation S ( . ) clariﬁes that the criterion function is a sample average, with componentsdeﬁned by the particular choice of score, S ( · , · ) . The estimator b θ is referred to as the optimal scoreestimator for the scoring rule S ( · , · ), and P t − b θ as the optimal predictive . The predictive P t − b θ is ‘optimal’ in the following speciﬁc sense: if our goal is to achieve good forecast-ing performance according to the given scoring rule S , all we have to do is optimize the parametersof the predictive according to this rule. Implicitly then, if we have two proper scoring rules S and S , by which we produce two diﬀerent optimal predictives P t − b θ and P t − b θ , where b θ and b θ denote5he optimizers according to S , S , it should be the case that, for large enough τ ,1 τ T X t = T − τ +1 S (cid:16) P t − b θ , y t (cid:17) ≥ τ T X t = T − τ +1 S (cid:16) P t − b θ , y t (cid:17) (3)and 1 τ T X t = T − τ +1 S (cid:16) P t − b θ , y t (cid:17) ≤ τ T X t = T − τ +1 S (cid:16) P t − b θ , y t (cid:17) . (4)That is, coherent results are expected: the predictive that is optimal with respect to S cannot bebeaten out-of-sample (as assessed by that score) by a predictive that is optimal according to S ,and vice-versa. As mentioned earlier, this deﬁnition of coherence subsumes the case where G ∈ P and, b θ and b θ are both consistent for the true (vector) value of θ , θ . Hence, in this special case,for τ → ∞ , the expressions in (3) and (4) would collapse to equalities.What is of relevance empirically though, as already highlighted, is the case where G / ∈ P . Whether coherence holds in this setting depends on four things: the unknown true model, G , theassumed (but misspeciﬁed) model, P θ , and the two rules, S , S , in which we are optimizing toobtain predictive distributions. As we will illustrate with particular examples, it is this collection, { G , P θ , S , S } , that determines whether or not the above notion of coherence holds.We begin this illustration with a series of simulation experiments in Section 3. We ﬁrst specify asingle predictive model P θ to be, in order: correctly speciﬁed; misspeciﬁed, but suitably ‘compatible’with G to allow strict coherence to prevail; and misspeciﬁed in a way in which strict coherencedoes not hold; where by strict coherence we mean that strict inequalities hold in expressions like(3) and (4). The numerical results are presented in a variety of diﬀerent ways, in order to shed lighton this phenomenon of coherence and help practitioners gain an appreciation of what they shouldbe alert to. As part of this exercise, we make the link between our broadly descriptive analysisand the formal test of equal predictive ability of Giacomini and White (2006), in a manner to bedescribed. Six diﬀerent proper scoring rules are entertained - both in the production of the optimalpredictions and in the out-of-sample evaluation. We then shift the focus to a linear predictive poolthat does not span the true model and is, as a consequence, misspeciﬁed; documenting the natureof coherence in this context. In Section 4 the illustration - based on both single models and linearpools - proceeds with empirical returns data. In this ﬁrst set of simulation experiments the aim is to produce an optimal predictive distributionfor a variable that possesses the stylized features of a ﬁnancial return. With this in mind, weassume a predictive associated with an autoregressive conditional heteroscedastic model of order 1(ARCH(1)) for the logarithmic return, y t , y t = φ + σ t ε t ; σ t = φ + φ ( y t − − φ ) ; ε t ∼ i.i.d.N (0 , . (5)6anels A, B and C of Table 1 then describe both the true data generating process (DGP) and theprecise speciﬁcation of the model in (5) for the three scenarios: correct speciﬁcation in (i), and twodiﬀerent types of misspeciﬁcation in (ii) and (iii). As is clear: in scenario (i), the assumed modelmatches the Gaussian ARCH(1) model that has generated the data; in scenario (ii) - in which ageneralized ARCH (GARCH) model with Student t innovations generates y t - it does not; whilstin scenario (iii) there is not only misspeciﬁcation of the assumed model, but the marginal mean inthat model is held ﬁxed at zero. Thus, in the third case, the predictive of the assumed model isunable to shift location and, hence, to ‘move’ to accommodate extreme observations, in either tail.The consequences of this, in terms of relative predictive accuracy, are highlighted below.Table 1: Simulation designs for the single model case. Here, t ν denotes the Student t distributionwith ν degrees of freedom. Panel A Panel B Panel CScenario (i) Scenario (ii) Scenario (iii) y t = σ t ε t y t = q ν − ν σ t ε t y t = q ν − ν σ t ε t True DGP σ t = 1 + 0 . y t − σ t = 1 + 0 . y t − + 0 . σ t − σ t = 1 + 0 . y t − + 0 . σ t − ε t ∼ i.i.d.N (0 , ε t ∼ i.i.d.t ν ε t ∼ i.i.d.t ν y t = θ + σ t ε t y t = θ + σ t ε t y t = σ t ε t Assumed model σ t = θ + θ ( y t − − θ ) σ t = θ + θ ( y t − − θ ) σ t = θ + θ y t − ε t ∼ i.i.d.N (0 , ε t ∼ i.i.d.N (0 , ε t ∼ i.i.d.N (0 , P t − θ , with density p ( y t |F t − , θ ), is associated with the assumed Gaussian ARCH(1)model, where θ = ( θ , θ , θ ) ′ . We estimate θ as in (1) using the following three types of scoringrules: for I ( y ∈ A ) the indicator on the event y ∈ A , S LS ( P t − θ , y t ) = ln p ( y t |F t − , θ ) , (6) S CRPS ( P t − θ , y t ) = − Z ∞−∞ [ P ( y |F t − , θ ) − I ( y ≥ y t )] dy, (7) S CLS ( P t − θ , y t ) = ln p ( y t |F t − , θ ) I ( y t ∈ A ) + (cid:20) ln Z A c p ( y |F t − , θ ) dy (cid:21) I ( y t ∈ A c ) , (8)where P ( . |F t − , θ ) in (7) denotes the predictive cumulative distribution function associated with p ( . |F t − , θ ). Use of the log-score (LS) in (6) yields the average log-likelihood function as the crite-rion in (2) and, under correct speciﬁcation and appropriate regularity, the asymptotically eﬃcientestimator of θ . The score in (8) is the censored likelihood score (CLS) of Diks et al. (2011). Thisscore rewards predictive accuracy over any region of interest A ( A c denoting the complement of thisregion). We report results for A deﬁning the lower and upper tails of the predictive distribution,7s determined in turn by the 10%, 20%, 80% and 90% percentiles of the empirical distribution of y t . The results based on the use of (8) in (2) are labelled hereafter as CLS 10%, CLS 20% , CLS80% and CLS 90% . The continuously ranked probability score (CRPS) (see Gneiting and Raftery,2007) is sensitive to distance, and rewards the assignment of high predictive mass near to the re-alized value of y t , rather just at that value, as in the case of the log-score. It can be evaluatedin closed form for the (conditionally) Gaussian predictive model assumed under all three scenariosdescribed in Table 1. Similarly, in the case of the CLS in (8), all components, including the integral R A c p ( y |F t − , θ ) dy , have closed-form representations for the Gaussian predictive model. Note thatall scores are positively-oriented; hence, higher values indicate greater predictive accuracy.For each of the Monte Carlo designs, we conduct the following steps:1. Generate T observations of y t from the true DGP;2. Use observations t = 1 , ..., ,

000 to compute b θ [ i ] as in (1), for S i , i ∈ { LS, CRPS, CLS 10%,CLS 20%, CLS 80%, CLS 90% } ;3. Construct the one-step-ahead predictive P t − b θ [ i ] , and compute the score, S j (cid:16) P t − b θ [ i ] , y t (cid:17) , based on the ‘observed’ value, y t , using S j , j ∈ { LS, CRPS, CLS 10%, CLS20%, CLS 80%, CLS 90% } ;4. Expand the estimation sample by one observation and repeat Steps 2 and 3, retaining notation b θ [ i ] for the S i -based estimator of θ constructed from each expanding sample. Do this τ = T − ,

000 times, and compute: S j ( b θ [ i ] ) = 1 τ T X t = T − τ +1 S j (cid:16) P t − b θ [ i ] , y t (cid:17) (9)for each ( i, j ) combination.The results are tabulated and discussed in Section 3.2. In Table 2, results are recorded for both τ = 5 ,

000 and τ = 10 , τ = 5 ,

000 is adopted in order to minimize the eﬀectof sampling error on the results. The even larger value of τ = 10 ,

000 is then adopted as a checkthat τ = 5 ,

000 is suﬃciently large to be used in all subsequent experiments. All numbers on themain diagonal correspond to S j ( b θ [ i ] ) in (9) with i = j. Numbers on the oﬀ-diagonal correspondto S j ( b θ [ i ] ) with i = j. Rows in the table correspond to the ith optimizing criterion (with b θ [ i ] thecorresponding ‘optimizer’), and columns to the results based on the jth out-of-sample score, S j .Using the deﬁnition of coherence in Section 2.2, and given the correct speciﬁcation, we would expectany given diagonal element to be equivalent to all values in the column in which it appears, at least8p to sampling error. As is clear, in Panel A the results based on τ = 5 ,

000 essentially bear out thisexpectation; in Panel B, in which τ = 10 , τ = 5 ,

000 and τ = 10 ,

000 results in the correct speciﬁcation case, we now recordresults based on τ = 5 ,

000 only. In Table 3, the degrees of freedom in the Student t innovationof the true DGP moves from being very low ( ν = 3 in Panel A) to high ( ν = 30 in Panel C),thereby producing a spectrum of misspeciﬁcation - at least in terms of the distributional form ofthe innovations - from very severe to less severe; and the results on relative out-of-sample accuracychange accordingly. In Panel A, a strict form of coherence is in evidence: each diagonal valueexceeds all other values in its column (and is highlighted in bold accordingly). As ν increases, thediagonal values remain bold, although there is less diﬀerence between the numbers in any particularcolumn. Hence, in this case, the advice to a practitioner would certainly be to optimize the scorecriterion that is relevant. In particular, given the importance of accurate estimation of extremereturns, the edict would indeed be: produce a predictive based on an estimate of θ that is optimalin terms of the relevant CLS-based criterion. Given the chosen predictive model, no other estimateof this model will produce better predictive accuracy in the relevant tail, and this speciﬁc estimatemay well yield quite markedly superior results to any other choice, depending on the fatness of thetails in the true DGP.In Table 4 however, the results tell a very diﬀerent story. In particular, in Panel A - despite ν being very low - with one exception (the optimizer based on CRPS), the predictive based on anygiven optimizer is never superior out-of-sample according to that same score criterion; i.e. the maindiagonal is not uniformly diagonal. A similar comment applies to the results in Panels B and C.In other words, the assumed predictive model - in which the marginal mean is held ﬁxed - is notﬂexible enough to allow any particular scoring rule to produce a point estimator that delivers goodout-of-sample performance in that rule. For example, the value of θ that optimizes the criterion in(2) based on CLS 10% does not correspond to an estimated predictive that gives a high score toextremely low values of y t , as the predictive model cannot shift location, and thereby assign highdensity ordinates to these values. The assumed model is, in this sense, incompatible with the trueDGP, which will sometimes produce very low values of y t . We now provide further insights into the results in Tables 2-4, including the lack of strict coherencein Table 4, by providing useful approaches for visualizing strict coherence, and its absence.Reiterating: under correct speciﬁcation of the predictive model, in the limit all predictives opti-mized according to criteria based on proper scoring rules will yield equivalent predictive performanceout-of-sample. In contrast, under misspeciﬁcation we expect that each score criterion will yield,in principle, a distinct optimizing predictive and, hence, that out-of-sample performance will diﬀer9able 2:

Average out-of-sample scores under a correctly speciﬁed Gaussian ARCH (1) model (Scenario (i) in Table1). Panel A (B) reports the average scores based on τ = 5 , ( τ = 10 , out-of-sample values. The rows ineach panel refer to the optimizer used. The columns refer to the out-of-sample measure used to compute the averagescores. The ﬁgures in bold are the largest average scores according to a given out-of-sample measure. Panel A: 5,000 out-of-sample evaluationsOut-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.510 -0.624 -0.376 -0.602 -0.593 -0.363 CRPS -1.510 -0.624 -0.376 -0.602 -0.593 -0.363

CLS 10% -1.512 -0.625 -0.377 -0.602 -0.595 -0.365CLS 20% -1.514 -0.625 -0.377 -0.602 -0.597 -0.366CLS 80% -1.518 -0.626 -0.381 -0.609 -0.594 -0.363

CLS 90% -1.516 -0.626 -0.380 -0.608 -0.594 -0.363Panel B: 10,000 out-of-sample evaluationsOut-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.510 -0.623 -0.367 -0.598 -0.597 -0.364 CRPS -1.510 -0.623 -0.367 -0.598 -0.597 -0.364

CLS 10% -1.511 -0.624 -0.367 -0.599 -0.598 -0.365CLS 20% -1.512 -0.624 -0.367 -0.599 -0.599 -0.366CLS 80% -1.515 -0.624 -0.370 -0.602 -0.597 -0.364

CLS 90% -1.514 -0.624 -0.369 -0.602 -0.597 -0.364

Average out-of-sample scores under a misspeciﬁed Gaussian ARCH(1) model (Scenario (ii) in Table 1).All results are based on τ = 5 , out-of-sample values. Panels A, B and C, respectively, report the average scoreswhen the true DGP is GARCH(1,1) with t ν =3 , t ν =10 and t ν =30 errors. The rows in each panel refer to the optimizerused. The columns refer to the out-of-sample measure. The ﬁgures in bold are the largest average scores accordingto a given out-of-sample measure. Panel A: The true DGP is GARCH(1,1)- t ν =3 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -2.335 -1.248 -0.568 -0.873 -0.892 -0.574CRPS -2.452 -1.233 -0.625 -0.929 -0.967 -0.654CLS 10% -2.752 -2.120 -0.520 -0.843 -1.311 -0.960CLS 20% -2.472 -1.519 -0.528 -0.834 -1.045 -0.704CLS 80% -2.489 -1.532 -0.725 -1.049 -0.841 -0.526CLS 90% -2.736 -2.093 -0.957 -1.287 -0.842 -0.513Panel B: The true DGP is GARCH(1,1)- t ν =10 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -2.517 -1.678 -0.533 -0.853 -0.837 -0.511CRPS -2.525 -1.677 -0.538 -0.857 -0.841 -0.516CLS 10% -2.563 -1.754 -0.530 -0.850 -0.882 -0.546CLS 20% -2.534 -1.706 -0.531 -0.850 -0.854 -0.523CLS 80% -2.532 -1.704 -0.543 -0.868 -0.834 -0.508 CLS 90% -2.555 -1.738 -0.561 -0.888 -0.834 -0.508Panel C: The true DGP is GARCH(1,1)- t ν =30 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -2.532 -1.727 -0.525 -0.824 -0.817 -0.501CRPS -2.535 -1.727 -0.527 -0.825 -0.818 -0.503CLS 10% -2.563 -1.776 -0.524 -0.824 -0.846 -0.525CLS 20% -2.538 -1.733 -0.525 -0.823 -0.822 -0.506CLS 80% -2.538 -1.735 -0.528 -0.829 -0.816 -0.501CLS 90% -2.553 -1.758 -0.540 -0.842 -0.817 -0.500 Average out-of-sample scores under a misspeciﬁed Gaussian ARCH(1) model with a ﬁxed marginal mean(Scenario (iii) in Table 1). All results are based on τ = 5 , out-of-sample values. Panels A, B and C, respectively,report the average scores when the true DGP is GARCH(1,1) with t ν =3 , t ν =10 and t ν =30 errors. The rows in eachpanel refer to the optimizer used. The columns refer to the out-of-sample measure. The ﬁgures in bold are the largestaverage scores according to a given out-of-sample measure. Panel A: The true DGP is GARCH(1,1)- t ν =3 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer

LS -2.335 -1.248 -0.568 -0.873 -0.892 -0.574CRPS -2.451 -1.233 -0.626 -0.929 -0.966 -0.652CLS 10% -2.329 -1.257 -0.565 -0.871 -0.883 -0.565

CLS 20% -2.329 -1.257 -0.565 -0.871 -0.883 -0.565

CLS 80% -2.335 -1.257 -0.564 -0.870 -0.888 -0.571CLS 90% -2.334 -1.257 -0.564 -0.869 -0.888 -0.570

Panel B: The true DGP is GARCH(1,1)- t ν =10 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer

LS -2.517 -1.678 -0.533 -0.852 -0.837 -0.511CRPS -2.524 -1.677 -0.538 -0.857 -0.841 -0.515CLS 10% -2.519 -1.680 -0.534 -0.853 -0.838 -0.511CLS 20% -2.519 -1.679 -0.534 -0.853 -0.838 -0.511CLS 80% -2.515 -1.679 -0.531 -0.851 -0.837 -0.510

CLS 90% -2.515 -1.679 -0.531 -0.851 -0.837 -0.510Panel C: The true DGP is GARCH(1,1)- t ν =30 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -2.532 -1.727 -0.525 -0.824 -0.817 -0.501 CRPS -2.535 -1.727 -0.527 -0.825 -0.818 -0.502CLS 10% -2.535 -1.728 -0.526 -0.824 -0.819 -0.503CLS 20% -2.535 -1.728 -0.526 -0.824 -0.818 -0.503CLS 80% -2.532 -1.728 -0.525 -0.824 -0.817 -0.501

CLS 90% -2.532 -1.728 -0.525 -0.824 -0.817 -0.501 that crite-rion. Therefore, a lack of evidence in favour of strict coherence, in the presence of misspeciﬁcation,implies that the conjunction of the model and scoring rule is unable to produce suﬃciently distinctoptimizers to, in turn, yield distinct out-of-sample performance.It is possible to shed light on this phenomenon by considering the limiting behavior of theoptimizers for the various scoring rules, across diﬀerent model speciﬁcation regimes (reﬂectingthose scenarios given in Table 1). To this end, deﬁne g t ( θ ∗ ) = ∂S ( P t − θ ,y t ) ∂ θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ and h t ( θ ∗ ) = ∂ S ( P t − θ ,y t ) ∂ θ ∂ θ ′ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ ∗ , and the limit quantities J ( θ ∗ ) = lim T →∞ Var h √ T P Tt =2 g t ( θ ∗ ) i and H ( θ ∗ ) =lim T →∞ T P Tt =2 E [ h t ( θ ∗ )] , where θ ∗ denotes the maximum of the limiting criterion function towhich S ( θ ) in (2) converges as T diverges. Under regularity, the following limiting distribution isin evidence: √ T ( b θ − θ ∗ ) d → N ( , V ∗ ) , where, V ∗ = H − ( θ ∗ ) J ( θ ∗ ) H − ( θ ∗ ) . (10)Under correct speciﬁcation, and for criteria deﬁned by proper scoring rules, we expect that θ ∗ = θ for all versions of S ( θ ). Given the eﬃciency of the maximum likelihood estimator inthis scenario, we would expect that the sampling distribution of the optimizer associated with thelog-score would be more tightly concentrated around θ ∗ than optimizers associated with the otherrules. However, since all optimizers would be concentrating towards the same value, this diﬀerencewould abate and ultimately lead to scoring performances that are quite similar; i.e., a form of strictcoherence would not be in evidence, as is consistent with the results in Table 2.In contrast, under misspeciﬁcation we expect that θ ∗ = θ , with diﬀerent optimizers consistentfor diﬀerent values of θ ∗ . While the sampling distributions of the diﬀerent optimizers may diﬀersubstantially from each other, thereby leading to a form of strict coherence as in Table 3, this isnot guaranteed to occur. Indeed, it remains entirely possible that the resulting optimizers, whiledistinct, have sampling distributions that are quite similar, even for very large values of T . In thiscase, the sampling distribution of the out-of-sample “optimized” j -th scoring rule S j ( b θ [ i ] ), evaluatedat the i -th optimizer, will not vary signiﬁcantly with i , and strict coherence will likely not be inevidence, even for large sample sizes, even though the model is misspeciﬁed (and the limit optimizersunique).This behavior can be illustrated graphically by simulating and analyzing (an approximation to)the sampling distribution of S j ( b θ [ i ] ). We begin by generating T = 10 ,

000 observations from thethree ‘true’ DGPs in Table 1, and producing predictions from the corresponding assumed predictivein each of the three scenarios: (i) to (iii). Using the simulated observations, and for each scenario,we compute b θ [ i ] in (1) by maximizing S i ( θ ) := T − P Tt =2 S i (cid:0) P t − θ , y t (cid:1) , for S i , i ∈ { LS, CLS 10%,CLS 20%, CLS 80%, CLS 90% } . Coherence can then be visualized by constructing and analyzing This could occur for two (non-exclusive) reasons: one, the variances V ∗ in (10) are large; 2) the diﬀerent limitingoptimized values are very similar. In either case, the sampling distributions that result from this optimizationprocedure are likely to be very similar. s ij = S j ( b θ [ i ] ), for i, j ∈ { LS, CLS 10%, CLS 20%, CLS 80%, CLS 90% } , denoted hereas f ( s ij ). That is, we are interested in the density of the jth sample score criterion evaluated at the ith optimizer, where f ( s jj ) denotes the density of the jth score evaluated at its own optimizer. Toapproximate this density we ﬁrst simulate { b θ [ i ] m } Mm =1 from the corresponding sampling distribution of b θ [ i ] : N ( b θ [ i ] , b V ∗ /T ), where b V ∗ is the usual ﬁnite sample estimator of V ∗ in (10). Given the simulateddraws { b θ [ i ] m } Mm =1 , we then compute s ij,m = S j ( b θ [ i ] m ) for m = 1 , . . . , M . Under coherence, we do notexpect any (estimated) density, b f ( s ij ), for i = j , to be located to the right of the (estimated) score-speciﬁc density, b f ( s jj ) as, with positively-oriented scores, this would reﬂect an inferior performanceof the optimal predictive. Under strict coherence, we expect b f ( s jj ) to lie to the right of all otherdensities, and for there to be little overlap in probability mass between b f ( s jj ) and any other density.The results are given in Figure 1. In the name of brevity, we focus on Panels B and C of Figure1, which correspond respectively to Panels B and C of Table 1. Each sub-panel in these two panelsplots b f ( s ij ) for i ∈ { LS, CLS 10%, CLS 20%, CLS 80%, CLS 90% } (as indicated in the key), and j ∈{ LS, CLS 10%, CLS 20%, } (as indicated in the sub-panel heading). The results in Panels B.1 to B.3correspond to Scenario (ii) in Panel B of Table 1. In this case, the impact of the misspeciﬁcation isstark. The score-speciﬁc ( i = j ) density in each case is far to the right of the densities based on theother optimizers, and markedly more concentrated. In Panels B.2 and B.3 we see that optimizingaccordingly to some sort of left-tail criterion, even if not that which matches the criterion usedto measure out-of-sample performance, produces densities that are further to the right than thosebased on the log-score optimizer. In contrast, we note in Panel B.1 that when the log-score itselfis the out-of-sample criterion of interest, it is preferable to use an optimizer that focuses on a largerpart of the support (either b θ [ i ] , i ∈ { CLS 20% } or b θ [ i ] , i ∈ { CLS 80% } ), rather than one that focuseson the more extreme tails. Moreover, due to the symmetry of both the true DGP and the assumedmodel, it makes no diﬀerence (in terms of performance in log-score) which tail optimizer (upper orlower) is used.Panels C.1 to C.3 correspond to Scenario (iii) in Panel C of Table 1, with ν = 3 for thetrue DGP, and with predictions produced via the misspeciﬁed Gaussian ARCH(1) model with themarginal mean ﬁxed at zero. The assumed model thus has no ﬂexibility to shift location; this featureclearly limiting the ability of the estimated predictive to assign higher weight to the relevant partof the support when the realized out-of-sample value demands it. As a consequence, there is nomeasurable gain in using an optimizer that ﬁts with the out-of-sample measure. These observationsare all consistent with the distinct similarity of all scores (within a column) in Columns 1, 3 and4 of Panel A in Table 4. In short: strict coherence is not in evidence, despite the misspeciﬁcationof the predictive model. Just one simple, seemingly innocuous, change in speciﬁcation has beensuﬃcient to eradicate the beneﬁt of seeking an optimal predictor. This suggests that even minor In particular, b V ∗ is obtained as the sample estimator of V ∗ , where θ ∗ is replaced with b θ [ i ] , for the ith rule. In these two sub-panels, the densities produced using optimizers focussed on the tails that are opposite to thoseof predictive interest (i.e. b θ [ i ] , i ∈ { CLS 80%, CLS 90% } ) are omitted, due to their being centred so far to the leftof the other densities, and being so dispersed, as to distort the ﬁgures. The distinction between coherence and strict coherence can be couched in terms of the distinc-tion between the null hypothesis that two predictives - one ‘optimal’ and one not - have equalexpected performance, and the alternative hypothesis that the optimal predictive has superiorexpected performance. The test of equal predictive ability of (any) two predictives was a focusof Giacomini and White (2006) (GW hereafter; see also related references: Diebold and Mariano,1995, Hansen, 2005, and Corradi and Swanson, 2006); hence, accessing the asymptotic distributionof their test statistic enables us to shed some light on coherence. Speciﬁcally, what we do is solvethe GW test decision rule for the (out-of-sample) sample size required to yield strict coherence,under misspeciﬁcation. This enables us to gauge how large the sample size must be to diﬀerentiatebetween an optimal and a non-optimal prediction, in any particular misspeciﬁed scenario. In termsof the illustration in the previous section, this is equivalent to gauging how large the sample sizeneeds to be to enable the relevant score-speciﬁc density in each ﬁgure in Panels B and C of Figure1 to lie to the right of the others.For i = j , deﬁne ∆ jit = S j (cid:16) p ( y t |F t − , b θ [ j ] ) , y t (cid:17) − S j (cid:16) p ( y t |F t − , b θ [ i ] ) , y t (cid:17) and ∆ jiτ = τ P Tt = T − τ +1 ∆ jit , where the subscript τ is used to make explicit the number of out-of-sample evaluations usedto compute the diﬀerence in the two average scores. The test of equal predictive ability is atest of H : E [∆ jit |F t − ] = 0 versus H : E [∆ jit |F t − ] = 0 . Following GW, under H Z τ = τ (∆ jiτ ) /var τ (∆ jit ) d → χ , where var τ (∆ jit ) denotes the sample variance of ∆ jit computed overthe evaluation period of size τ. Hence, at the α × jiτ ) and var τ (∆ jit ), τ > χ (1 − α ) × var τ (∆ jit )(∆ jiτ ) = τ ∗ , (11)where χ (1 − α ) denotes the relevant critical value of the limiting χ distribution of the teststatistic.The right-hand-side of the inequality in (11), from now on denoted by τ ∗ , indicates the minimumnumber of out-of-sample evaluations associated with detection of a signiﬁcant diﬀerence between S j ( ˆ θ [ j ] ) and S j ( ˆ θ [ i ] ). For the purpose of this exercise, if ∆ jiτ <

0, we set τ ∗ = τ , as no value of τ ∗ will induce rejection of the null hypothesis in favour of strict coherence, which is the outcome weare interested in. The value of τ ∗ thus depends, for any given α , on the relative magnitudes of thesample quantities, var τ (∆ jit ) and (∆ jiτ ) . At a heuristic level, if (∆ jiτ ) and var τ (∆ jit ) converge inprobability to constants c and c , at rates that are some function of τ , then we are interested inplotting τ ∗ as a function of τ , and discerning when (if) τ ∗ begins to stabilize at a particular value.It is this value that then serves as a measure of the ‘ease’ with which strict coherence is in evidencein any particular example. 15 anel A Panel B Panel C -1.509 -1.5085 -1.5080123 10 (A.1) LS -3.3 -2.8 -2.30102030 (B.1) LS -2.36 -2.345 -2.330200400600800 (C.1) LS -0.3585 -0.35815 -0.35780123 10 (A.2) CLS-10% -0.58 -0.54 -0.5050010001500 (B.2) CLS-10% -0.565 -0.5595 -0.55405001000 (C.2) CLS-10% -0.5992 -0.59875 -0.598300.511.52 10 (A.3) CLS-20% -0.88 -0.85 -0.8205001000 (B.3) CLS-20% -0.875 -0.87 -0.86505001000 (C.3) CLS-20% Figure 1:

Plots of the approximate density functions for s ij = S j ( b θ [ i ] ) . Panels A to C plot the density functionsfor j ∈ { LS, CLS 10%, CLS 20% } , evaluated at b θ [ i ] for i ∈ { LS, CLS 10%, CLS 20%, CLS 80%, CLS 90% } . PanelA corresponds to the case of correct speciﬁcation (Scenario (i) in Table 1). Panel B corresponds to misspeciﬁcationScenario (ii) in Table 1. Panel C corresponds to misspeciﬁcation Scenario (iii) in Table 1. Each sub-panel plots b f ( s ij ) for i ∈ { LS, CLS 10%, CLS 20%, CLS 80%, CLS 90% } (as indicated in the key), and j ∈ { LS, CLS 10%, CLS20%, } (as indicated in the sub-panel heading). The reason for the omission of results for i ∈ { CLS 80%, CLS 90% } in sub-panels B.2 and B.3 is given in Footnote 4.

16n Figures 2 to 4 we plot τ ∗ as a function of τ , for τ = 1 , , ..., , α = 0 .

05, for themisspeciﬁcation scenarios (ii) and (iii) in Table 1. In all ﬁgures, the diagonal panels simply plot a45% line, as these plots correspond to the case where j = i and ∆ jiτ = 0 by construction. Again,for the purpose of the exercise if ∆ jiτ <

0, we set τ ∗ = τ , as no value of τ ∗ will induce supportof strict coherence. Moreover, whenever ∆ jiτ >

0, but τ ∗ > τ , we also set τ ∗ = τ . This allows usto avoid arbitrarily large values of τ ∗ that cannot be easily visualized. These latter two cases arethus also associated with 45% lines. Figures 2 and 3 report results for Scenario (ii) with ν = 3and ν = 30 respectively, whilst Figure 4 presents the results for Scenario (iii) with ν = 3. In eachﬁgure, sub-panels A.1 to A.3 record results for j ∈ { LS } , and i ∈ { LS, CLS 10% and CLS 90% } .Sub-panels B.1 to B.3 record the corresponding results for j ∈ { CLS 10% } , while sub-panels C.1 toC.3 record the results for j ∈ { CLS 90% } .First consider sub-panels B.3 and C.2 in Figure 2. For τ > ,

000 (approximately), τ ∗ stabilizesat a value that is approximately 20 in both cases. Viewing this value of τ ∗ as ‘small’, we concludethat it is ‘easy’ to discern the strict coherence of an upper tail optimizer relative to its lower tailcounterpart, and vice versa, under this form of misspeciﬁcation. In contrast, Panels A.2 and A.3indicate that whilst strict coherence of the log-score optimizer is eventually discernible, the valueat which τ ∗ settles is larger (between about 100 and 200) than when the distinction is to be drawnbetween the two distinct tail optimizers. Panels B.1 and C.1 show that it takes an even largernumber of out-of-sample observations ( τ ∗ exceeding 1 , τ ∗ required to detect strict coherence relative to the log-score in the case of CLS 90% hasnot settled to a ﬁnite value even by τ = 5 , . A comparison of Figures 2 and 3 highlights the eﬀect of a reduction in misspeciﬁcation. In eachoﬀ-diagonal sub-panel in Figure 3, the value of τ ∗ is markedly higher (i.e. more observations arerequired to detect strict coherence) than in the corresponding sub-panel in Figure 2. Indeed, PanelC.1 in Figure 3 indicates that strict coherence in this particular case is, to all intents and purposes,unable to be discerned in any reasonable number of out-of-sample observations. The dissimilarityof the true DGP from the assumed model is simply not marked enough for the optimal versionof the CLS 90% score to reap accuracy beneﬁts relative to the version of this score based on thelog-score optimizer. This particular scenario highlights the fact that, even if attainable, the pursuitof coherence may not always be a practical endeavour. For example, if the desired scoring rule ismore computationally costly to evaluate than, say, the log-score, then the small improvement inpredictive accuracy yielded by optimal prediction may not justify the added computational burden,in particular for real-time forecasting exercises.Finally, even more startling are the results in Figure 4, which we have termed the ‘incompatible’case. For all out-of-sample scores considered, and all pairs of optimizers, a diagonal line, τ ∗ = τ ,results, as either τ ∗ exceeds τ (and, hence, τ ∗ is set to τ ) for all values of τ , or ∆ jiτ <

0, in whichcase τ ∗ is also set to τ. Due to the incompatibility of the assumed model with the true DGP strict17oherence simply does not prevail in any sense.

A common method of producing density forecasts from diverse models is to consider the ‘optimal’combination of forecast (or predictive) densities deﬁned by a linear pool. Consider the setting wherewe entertain several possible models M k , k = 1 , ..., n , all based on the same information set, andwith associated predictive distributions, m k ( y t |F t − ) := p ( y t |F t − , θ k , M k ) , k = 1 , ..., n, (12)where the dependence of the kth model on a d k -dimensional set of unknown parameters, θ k =( θ k, , θ k, , ..., θ k,d k ) ′ , is captured in the short-hand notation, m k ( ·|· ) , and the manner in which θ k is estimated is addressed below. The goal is to determine how to combine the n predictives in(12) to produce an accurate forecast, in accordance with some measure of predictive accuracy. Ashighlighted in the Introduction, we do not assume that the true DGP coincides with any one of theconstituent models in the model set.Herein, we follow McConway (1981), and focus on the class of linear combination processes only;i.e., the class of ‘linear pools’ (see also Genest, 1984, and Geweke and Amisano, 2011): P := ( p ( y t |F t − , w ) := n X k =1 w k m k ( y t |F t − ); n X k =1 w k = 1; and w k ≥ k = 1 , ..., n ) ) . (13)Following the notion of optimal predictive estimation, and building on the established literaturecited earlier, we produce optimal weight estimatesˆ w := arg max w ∈ ∆ n S ( w ) , where ∆ n := ( w k ∈ [0 ,

1] : n X k =1 w k = 1 , w k ≥ k = 1 , ..., n ) ) , (14)where S ( w ) is a sample average of the chosen scoring rule, evaluated at the predictive distributionwith density p ( y t |F t − , w ), over a set of values deﬁned below. The estimator ˆ w is referred to as the optimal score estimator (of w ) and the density p ( y t |F t − , ˆ w ) as the optimal linear pool. The sameset of scoring rules as described in Section 3.1 are adopted herein.We simulate observations of y t from an autoregressive moving average model of order (1,1)(ARMA(1,1)), y t = φ + φ y t − + φ ε t − + ε t , (15)where φ = 0, φ = 0 .

95 and φ = − . . We employ ﬁve diﬀerent distributional assumptions for ε t : ε t ∼ i.i.d.N (0 , ε t ∼ i.i.d.t v (0 , ν = (5 , , ε t ∼ i.i.d. [ pN ( µ , σ ) + (1 − p ) N ( µ , σ )]. In the case of the mixture, we set µ = 0 . µ = − . σ = 0 . σ = 1 .

43 and p = 0 . E ( ε t ) = 0 and V ar ( ε t ) = 1, with this settinginducing a negative skewness of -1.58. In constructing the model pool, we consider three constituent18 Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Figure 2:

Required number of out-of-sample evaluations to reject H : E [∆ jit |F t − ] = 0 in favour of strictcoherence: misspeciﬁcation Scenario (ii) in Table 1, where the true DGP is the GARCH(1,1)- t ν =3 model. Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Figure 3:

Required number of out-of-sample evaluations to reject H : E [∆ jit |F t − ] = 0 in favour of strictcoherence: misspeciﬁcation Scenario (ii) in Table 1, where the true DGP is the GARCH(1,1)- t ν =30 model. Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Out-of-sample evaluations * Figure 4:

Required number of out-of-sample evaluations to reject H : E [∆ jit |F t − ] = 0 in favour of strictcoherence: misspeciﬁcation Scenario (iii) in Table 1, where the true DGP is the GARCH(1,1)- t ν =3 model. M : y t ∼ i.i.d.N ( θ , , θ , ) (16) M : y t = θ , + θ , y t − + η t with η t ∼ i.i.d.N (0 , θ , ) (17) M : y t = θ , + θ , η t − + η t with η t ∼ i.i.d.N (0 , θ , ) . (18)All designs thus correspond to some degree of misspeciﬁcation, with less misspeciﬁcation occurringwhen the true error term is either normal or Student-t with a large value for ν . Use of a skewederror term in (15) arguably produces the most extreme case of misspeciﬁcation and, hence, is thecase where we would expect strict coherence to be most evident. For each design scenario, we take the following steps:1. Generate T observations of y t from the true DGP;2. Use observations t = 1 , ..., J , where J = 1 , b θ i as in (1) for each model m k ,for S i , i ∈ { LS, CRPS, CLS 10%, CLS 20%, CLS 80% and CLS 90% } ;3. For each k = 1 , ,

3, construct the one-step-ahead predictive density b m k ( y t |F t − ) = p ( y t |F t − , b θ i , M k ),for t = J + 1 , ..., J + ζ , and compute b w = ( b w , b w , b w ) ′ based on these ζ = 50 sets of predictivedensities as in (14), with S ( w ) := ζ P J + ζt = J +1 S (cid:16) P t − b θ , w , y t (cid:17) , where b θ = ( b θ , b θ , b θ ) ′ and P t − b θ , w isthe predictive distribution associated with the density p ( y t |F t − , w ) in (13).4. Use b w to obtain the pooled predictive density for time point t = J + ζ + 1, p ( y t |F t − , b w ) = P n =3 k =1 b w k b m k ( y t |F t − ) .

5. Roll the estimation sample forward by one observation and repeat Steps 2 to 4, using the(non-subscripted) notation b θ = ( b θ , b θ , b θ ) ′ for the estimator of θ = ( θ , θ , θ ) ′ and b w forthe estimator of w based on each rolling sample of size J + ζ . Produce τ = T − ( J + ζ ) pooledpredictive densities, and compute: S ( b θ , b w ) = 1 τ T X t = T − τ +1 S (cid:16) P t − b θ , b w , y t (cid:17) . (19)The results are tabulated and discussed in Section 3.6. To keep the notation manageable, wehave not made explicit the fact that b θ and b w are produced by a given choice of score criterion,which may or may not match the score used to construct (19). The notation P t − b θ , b w refers tothe predictive distribution associated with the density p ( y t | F t − , b w ) . Whilst there is no one model that corresponds to the true DGP in (15), an appropriately weighted sum of thethree predictives would be able to reproduce certain key features of the true predictive, such as the autocorrelationstructure, at least if the parameters in each constituent model were set to appropriate values. .6 Linear Pool Case: Simulation Results With reference to the results in Table 5, our expectations are borne out to a large extent. Theaverage out-of-sample scores in Panel B pertain to arguably the most misspeciﬁed case, with themixture of normals inducing skewness in the true DGP, a feature that is not captured by any of thecomponents of the predictive pool. Whilst not uniformly indicative of strict coherence, the resultsfor this case are close to being so. In particular, the optimal pools based on the CLS 20%, CLS 80%and CLS 90% criteria always beat everything else out of sample, according to each of those samemeasures (i.e. the bold values appear on the diagonal in the last three columns in Panel B). Totwo decimal places, the bold value also appears on the diagonal in the column for the out-of-sampleaverage of CLS 10%. Thus, the degree of misspeciﬁcation of the model pool is suﬃcient to enablestrict coherence to be in evidence - most notably when it come to accurate prediction in the tails.It can also be seen that log-score optimization reaps beneﬁts out-of-sample in terms of the log-scoremeasure itself; only the CRPS optimizer does not out-perform all others out-of-sample, in terms ofthe CRPS measure.In contrast to the results in Panel B, those in Panel A (for the normal error in the true DGP)are much more reminiscent of the ‘correct speciﬁcation’ results in Table 2, in that all numberswithin a column are very similar, one to the other, and there is no marked diagonal pattern.Interestingly however, given the earlier comments in the single model context regarding the impactof the eﬃciency of the log-score optimizer under correct speciﬁcation, we note that the log-scoreoptimizer yields the smallest out-of-sample averages according to all measures in Panel A.This superiority of the log-score optimizer continues to be in evidence in all three panels in Table6, in which the degrees of freedom in the error term in the true DGP is successively increased, acrossthe panels. Moreover, there is arguably no more uniformity within columns in Panel C of this table(in which the t errors are a better match to the Gaussian errors assumed in each component modelin the pool), than there is in Panel A. Clearly the use of the model pool is suﬃcient to pick up anydegree of fatness in the tails in the true DGP, so that no one design scenario is any further from (orcloser to) ‘correct speciﬁcation’ than the other. Hence, what we observe in this table is simply alack of strict coherence - i.e. the degree of misspeciﬁcation is not marked enough for score-speciﬁcoptimizers to reap beneﬁts out-of-sample, and there is a good deal of similarity in the performanceof all optimizers, in any particular setting. Reiterating the opening comment in this paragraph, inthese settings of ‘near’ to correct speciﬁcation, the eﬃciency of the log-score optimizer seems to bein evidence. It is, in these cases, the only optimizer that one need to entertain, no matter what thespeciﬁc performance metric of interest! 23able 5: Average out-of-sample scores under two diﬀerent speciﬁcations for the true innovation, ε t in (15). PanelA (B) reports the average scores based on ε t ∼ i.i.d.N (0 ,

1) ( ε t ∼ i.i.d.M ixture of normals ). The rows ineach panel refer to the optimizer used. The columns refer to the out-of-sample measure used to compute the averagescores. The ﬁgures in bold are the largest average scores according to a given out-of-sample measure. All results arebased on τ = 5 , out-of-sample values. Panel A: ε t ∼ i.i.d.N (0 , Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.493 -0.500 -0.250 -0.451 -0.447 -0.251 CRPS -1.512 -0.529 -0.255 -0.461 -0.452 -0.253CLS 10% -1.709 -0.570 -0.254 -0.465 -0.608 -0.376CLS 20% -1.514 -0.507 -0.252 -0.455 -0.460 -0.261CLS 80% -1.532 -0.518 -0.271 -0.478 -0.451 -0.253CLS 90% -1.648 -0.551 -0.318 -0.555 -0.459 -0.257

Panel B: ε t ∼ i.i.d.M ixture of normals Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.479 -0.472 -0.313 -0.522 -0.374 -0.207CRPS -1.502 -0.528 -0.348 -0.563 -0.363 -0.198CLS10% -1.703 -0.585 -0.300 -0.525 -0.529 -0.319CLS20% -1.605 -0.519 -0.297 -0.511 -0.466 -0.275CLS80% -1.772 -0.494 -0.557 -0.824 -0.347 -0.191 CLS90% -2.319 -0.580 -0.863 -1.246 -0.358 -0.191

Average out-of-sample scores under two diﬀerent speciﬁcations for the true innovation, ε t in (15) PanelA (B; C) reports the average scores based on ε t ∼ i.i.d.t ν =5 ( ε t ∼ i.i.d.t ν =10 ; ε t ∼ i.i.d.t ν =30 ) . The rows ineach panel refer to the optimizer used. The columns refer to the out-of-sample measure used to compute the averagescores. The ﬁgures in bold are the largest average scores according to a given out-of-sample measure. All results arebased on τ = 5 , out-of-sample values. Panel A: ε t ∼ i.i.d.t ν =5 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.756 -0.630 -0.311 -0.522 -0.508 -0.298 CRPS -1.782 -0.676 -0.317 -0.534 -0.515 -0.304CLS 10% -1.881 -0.706 -0.315 -0.540 -0.591 -0.358CLS 20% -1.805 -0.656 -0.312 -0.528 -0.545 -0.326CLS 80% -1.810 -0.665 -0.339 -0.560 -0.514 -0.300CLS 90% -1.909 -0.732 -0.375 -0.617 -0.532 -0.306

Panel B: ε t ∼ i.i.d.t ν =10 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.611 -0.557 -0.263 -0.478 -0.475 -0.274 CRPS -1.622 -0.588 -0.266 -0.483 -0.478 -0.276CLS 10% -1.822 -0.627 -0.269 -0.489 -0.653 -0.426CLS 20% -1.674 -0.582 -0.264 -0.481 -0.529 -0.320CLS 80% -1.659 -0.582 -0.286 -0.513 -0.480 -0.275CLS 90% -1.757 -0.634 -0.335 -0.583 -0.489 -0.278

Panel C: ε t ∼ i.i.d.t ν =30 Out-of-sample score

LS CRPS CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.532 -0.517 -0.260 -0.473 -0.450 -0.255 CRPS -1.553 -0.547 -0.265 -0.483 -0.457 -0.258CLS 10% -1.909 -0.619 -0.264 -0.487 -0.747 -0.484CLS 20% -1.559 -0.530 -0.262 -0.477 -0.470 -0.271CLS 80% -1.572 -0.538 -0.285 -0.505 -0.453 -0.256CLS 90% -1.748 -0.588 -0.368 -0.623 -0.463 -0.261 in Max Mean Median St.Dev Range Skewness Kurtosis JB stat LB stat (10)S&P500 -12.765 10.957 0.014 0.054 1.255 23.722 -0.364 14.200 26821 5430MSCIEM -9.995 10.073 0.011 0.073 1.190 20.068 -0.549 11.163 14791 4959 Table 7:

Summary statistics. ‘JB stat’ is the test statistic for the Jarque-Bera test of normality, with a critical valueof 5.99. ‘LB stat’ is the test statistic for the Ljung-Box test of serial correlation in the squared returns; the criticalvalue based on a lag length of 10 is 18.31. ‘Skewness’ is the Pearson measure of sample skewness, and ‘Kurtosis’ asample measure of excess kurtosis. The labels ‘Min’ and ‘Max’ refer to the smallest and largest value, respectively,while ‘Range’ is the diﬀerence between these two. The remaining statistics have the obvious interpretations.

We now illustrate the performance of optimal prediction in a realistic empirical setting. We returnto the earlier example of ﬁnancial returns, but with a range of increasingly sophisticated modelsused to capture the features of observed data. Both single models and a linear pool are entertained.We consider returns on two indexes: S&P500 and MSCI Emerging Market (MSCIEM). The datafor both series extend from January 3rd, 2000 to May 7th, 2020. All returns are continuouslycompounded in daily percentage units. For each time series, we reserve the ﬁrst 1,500 observationsfor the initial parameter estimation, and conduct the predictive evaluation exercise for the periodbetween March 16th, 2006 and May 7th, 2020, with the predictive evaluation period covering boththe global ﬁnancial crisis (GFC) and the recent downturn caused by the COVID19 pandemic.As is consistent with the typical features exhibited by ﬁnancial returns, the descriptive statisticsreported in Table 7 provide evidence of time-varying and autocorrelated volatility (signiﬁcant serialcorrelation in squared returns) and marginal non-Gaussianity (signiﬁcant non-normality in the levelof returns) in both series, with evidence of slightly more negative skewness in the MSCIEM series.Treatment of the single predictive models proceeds following the steps outlined in Section 3.1,whilst the steps outlined in Section 3.5 are adopted for the linear predictive pool. However, dueto the computational burden associated with the more complex models employed in this empiricalsetting, we update the model parameter estimates every 50 observations only. The predictivedistributions are still updated daily with new data, with the model pool weights also updated dailyusing the window size ζ = 50. In the case of the S&P500 index, the out-of-sample predictiveassessment is based on τ = 3 ,

560 observations, while for the MSCIEM index, the out-of-sampleperiod comprises τ = 3 ,

683 observations.For both series, we employ three candidate predictive models of increasing complexity: i) ana¨ıve Gaussian white noise model: M : y t ∼ i.i.d.N ( θ , , θ , ) ′ ; ii) a GARCH model, with Gaussianinnovations: M : y t = θ , + σ t ε t ; σ t = θ , + θ , ( y t − θ , ) + θ , σ t − ; ε t ∼ i.i.d.N (0 , M : y t = θ , +exp ( h t / ε t + ∆ N t Z pt ; h t = θ , + θ , h t − + θ , η t ; ( ε t , η t ) ′ ∼ i.i.d.N (0 , I × ); P r (∆ N t = 1) = θ , ;26 pt ∼ i.i.d.N ( θ , , θ , ) . The ﬁrst model is obviously inappropriate for ﬁnancial returns, but isincluded to capture misspeciﬁcation and, potentially, incompatibility. Both M and M accountfor the stylized feature of time-varying and autocorrelated return volatility, but M also capturesthe random price jumps that are observed in practice, and is the only model of the three that canaccount for skewness in the predictive distribution. The linear predictive pool is constructed fromall three models, M , M and M .For this empirical exercise we consider seven scoring rules: the log-score in (6), four versions ofCLS in (8), for the 10%, 20%, 80% and 90% percentiles, and two quantile scores (QS) evaluatedat the 5th and 10th percentiles (denoted by QS 5% and QS 10% respectively). The QS deﬁnedat the p th percentile is deﬁned as QS p % = ( y t − q t ) (cid:0) ( y t ≤ q ) − p (cid:1) , with q t denoting the predictivequantile satisfying P r ( y t ≤ q t | y t − ) = p . Use of QS (in addition to CLS) enables some conclusionsto be drawn regarding the relevance of targeting tail accuracy per se in the production of optimalpredictions, as opposed to the importance of the score itself. Tables 8 and 9 report the results forthe S&P500 and MSCIEM index respectively, with the format of both tables mimicking that used inthe simulation exercises. In particular, we continue to use bold font to indicate the largest averagescore according to a given out-of-sample measure, but now supplement this with the use of italicsto indicate the second largest value in any column.We make three comments regarding the empirical results.

First , for both data sets, and forall three single models, strict coherence is close to holding uniformly, with most of the diagonalelements in all panels being either the highest (bolded) or the second highest (italics) values intheir respective columns. This suggests that each individual model, whilst inevitably a misspeciﬁedversion of the true unknown DGP, is compatible enough with the true process to enable score-speciﬁcoptimization to reap beneﬁts.

Second , we remark that all three individual models are quite distinct,and are likely to be associated with quite diﬀerent degrees of estimation error. Hence, while thena¨ıve model is no doubt the most misspeciﬁed, given the documented features of both return series,it is also the most parsimonious and, hence, likely to produce estimated scores with small samplingvariation. Thus, it is diﬃcult to assess which model has the best predictive performance overall,due to the interplay between sampling variation and model misspeciﬁcation (see Patton, 2019, foran extensive investigation of this issue). While the matter of model selection per se is not the focusof the paper, we do note that of the single models, the Gaussian GARCH(1,1) model estimatedusing the relevant score-speciﬁc optimizer is the best performer out-of-sample overall, according toall measures.

Third, we note that the pooled forecasts exhibit close to uniform strict coherence, forboth series, highlighting that the degree of misspeciﬁcation in the pool is still suﬃcient for beneﬁtsto be had via score-speciﬁc optimization. However, the numerical gains reaped by score-speciﬁcoptimization in the case of the pool are typically not as large as in the single model cases. Thatis, and as is consistent with earlier discussion, the additional ﬂexibility produced by the poolingcan reduce the ability of score-speciﬁc optimization to produce marked predictive improvements in See Gneiting and Raftery (2007) for a discussion of the properties of QS as a proper scoring rule.

Predictive results for the S&P500 index returns. Average out-of-sample scores are recorded forthe three competing models, as well as for the linear pool of these three models, based on τ = 3 , Out-of-sample score

LS QS 5% QS 10% CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.688 -0.166 -0.250 -0.475 -0.726 -0.699 -0.444QS 5% -2.230 -0.165 -0.252 -0.594 -0.594 -0.966 -0.647QS 10% -1.769 -0.164 -0.244 -0.533 -0.533 -0.733 -0.480 M :Na¨ıve CLS 10% -2.083 -0.167 -0.243 -0.403 -0.636 -1.118 -0.841CLS 20% -1.875 -0.167 -0.247 -0.416 -0.647 -0.917 -0.645CLS 80% -1.853 -0.206 -0.335 -0.606 -0.896 -0.653 -0.397

CLS 90% -2.132 -0.271 -0.455 -0.853 -1.157 -0.654 -0.381 LS -1.566 -0.156 -0.238 -0.359 -0.590 -0.547 -0.300 QS 5% -1.884 -0.129 -0.215 -0.402 -0.637 -1.058 -0.797QS 10% -1.468 -0.130 -0.207 -0.450 -0.686 -0.830 -0.576 M :GARCH CLS 10% -2.257 -0.288 -0.445 -0.337 -0.563 -0.677 -0.399CLS 20% -2.242 -0.286 -0.442 -0.338 -0.565 -0.648 -0.375CLS 80% -1.749 -0.167 -0.237 -0.377 -0.606 -0.543 -0.298

CLS 90% -6.437 -0.243 -0.296 -0.425 -0.665 -0.544 -0.298 LS -1.591 -0.194 -0.269 -0.414 -0.657 -0.669 -0.412 QS 5% -3.110 -0.170 -0.275 -0.664 -1.263 -1.011 -0.608QS 10% -2.756 -0.195 -0.265 -0.541 -1.072 -0.987 -0.653 M :SVJ CLS 10% -2.351 -0.198 -0.269 -0.413 -0.685 -0.990 -0.646CLS 20% -1.870 -0.202 -0.275 -0.413 -0.649 -0.770 -0.497CLS 80% -1.839 -0.218 -0.325 -0.493 -0.755 -0.676 -0.418CLS 90% -2.250 -0.248 -0.378 -0.620 -0.901 -0.709 -0.415 LS -1.499 -0.159 -0.239 -0.366 -0.602 -0.583 -0.333QS 5% -1.907 -0.134 -0.217 -0.466 -0.717 -0.563 -0.723 Pooled

QS 10% -1.566 -0.133 -0.211 -0.443 -0.720 -0.568 -0.554

Forecasts

CLS 10% -2.218 -0.266 -0.403 -0.348 -0.577 -0.558 -0.473CLS 20% -2.111 -0.264 -0.401 -0.349 -0.576 -0.559 -0.409CLS 80% -1.570 -0.169 -0.244 -0.385 -0.617 -0.556 -0.306

CLS 90% -2.801 -0.228 -0.297 -0.434 -0.676 -0.555 -0.303

Predictive results for the MSCIEM index returns. Average out-of-sample scores are recordedfor the three competing models, as well as for the linear pool of these three models, based on τ = 3 , Out-of-sample score

LS QS 5% QS 10% CLS 10% CLS 20% CLS 80% CLS 90%

Optimizer LS -1.692 -0.161 -0.243 -0.518 -0.789 -0.676 -0.450QS 5% -2.332 -0.159 -0.243 -0.597 -0.973 -1.072 -0.746QS 10% -1.759 -0.160 -0.239 -0.557 -0.835 -0.707 -0.483 M :Na¨ıve CLS 10% -2.141 -0.161 -0.238 -0.450 -0.733 -1.149 -0.905CLS 20% -1.890 -0.162 -0.241 -0.459 -0.728 -0.906 -0.667CLS 80% -1.784 -0.182 -0.284 -0.587 -0.878 -0.664 -0.439

CLS 90% -1.985 -0.227 -0.368 -0.762 -1.068 -0.668 -0.431 LS -1.604 -0.152 -0.234 -0.400 -0.664 -0.580 -0.358QS 5% -1.922 -0.125 -0.213 -0.405 -0.681 -1.066 -0.815QS 10% -1.986 -0.129 -0.208 -0.423 -0.703 -1.093 -0.852 M :GARCH CLS 10% -2.439 -0.326 -0.499 -0.383 -0.650 -0.810 -0.557CLS 20% -2.340 -0.302 -0.466 -0.383 -0.647 -0.727 -0.483CLS 80% -1.704 -0.160 -0.237 -0.414 -0.680 -0.575 -0.355

CLS 90% -1.715 -0.160 -0.239 -0.413 -0.682 -0.575 -0.355 LS -1.689 -0.202 -0.278 -0.462 -0.734 -0.699 -0.471QS 5% -2.630 -0.181 -0.307 -0.703 -1.200 -1.004 -0.707QS 10% -2.409 -0.203 -0.268 -0.560 -0.972 -0.966 -0.684 M :SVJ CLS 10% -2.356 -0.207 -0.281 -0.471 -0.764 -0.951 -0.702CLS 20% -1.848 -0.211 -0.288 -0.483 -0.765 -0.764 -0.528CLS 80% -2.038 -0.215 -0.306 -0.537 -0.818 -0.697 -0.466

CLS 90% -2.494 -0.260 -0.374 -0.669 -0.980 -0.717 -0.462 LS -1.539 -0.149 -0.232 -0.410 -0.626 -0.594 -0.372QS 5% -1.921 -0.127 -0.214 -0.409 -0.631 -0.595 -0.752 Pooled

QS 10% -1.944 -0.133 -0.211 -0.415 -0.631 -0.599 -0.751

Forecasts

CLS 10% -2.320 -0.277 -0.416 -0.392 -0.582 -0.588 -0.630CLS 20% -1.853 -0.154 -0.236 -0.421 -0.429 -0.626 -0.656CLS 80% -1.571 -0.158 -0.239 -0.430 -0.665 -0.586 -0.361

CLS 90% -1.601 -0.160 -0.246 -0.443 -0.681 -0.586 -0.360 This paper contributes to a growing literature in which the role of scoring rules in the productionof bespoke forecasts - i.e. forecasts designed to be optimal according to a particular measure offorecast accuracy - is given attention. With our focus solely on probabilistic forecasts, our resultshighlight the care that needs to be taken in the production and interpretation of such forecasts.It is not assured that optimization according to a problem-speciﬁc scoring rule will yield beneﬁts;the relative performance of so-called ‘optimal’ forecasts depending on the nature of, and interplaybetween: the true model, the assumed model and the score. That is, if the predictive model simplydoes not allow a given score to reward the type of accuracy it should, optimization with respect tothat score criterion comes to naught. One may as well use the simplest optimizer for the problemat hand, and leave it at that. However, subject to a basic match, or compatibility, between the trueprocess and the assumed predictive model, it is certainly the case that optimization can produceaccuracy gains in the manner intended, with the gains being more marked the greater the degreeof misspeciﬁcation.Knowing when optimization will yield beneﬁts in any particular empirical scenario is diﬃcult,but the use of a plausible predictive model that captures the key features of the true data generatingprocess is obviously key. The results in the paper also highlight the fact that use of score-speciﬁcoptimization in the linear pool context is likely to reap less beneﬁts than in the context of a singlemisspeciﬁed model. Theoretical exploration and characterization of all of these matters is likely toprove diﬃcult, given the number of aspects at play; however such work, even if conﬁned to very speciﬁc combinations of generating process/model/scoring rule, would be of value. We leave suchexplorations for future work.

References

Aastveit, K. A., Mitchell, J., Ravazzolo, F., and van Dijk, H. K. (2019). The evolution of forecastdensity combinations in economics.

Oxford Research Encyclopedias: Economics and Finance ,4:1–39.Bernardo, J. and Smith, A. (1994).

Bayesian Theory . Wiley Series in Probability & Statistics.Wiley.Clements, M. and Harvey, D. (2011). Combining probability forecasts.

International Journal ofForecasting , 27(2):208–223. For both data sets the (time-varying) weights in the linear pool (not recorded here for reasons of space) tend tofavour the GARCH(1,1) model most frequently. This ﬁnding is consistent with the fact that the magnitudes of theaverage scores for the linear pool are most similar to the corresponding values for the GARCH(1,1) model.

Journal of Econometrics , 135(1-2):187–228.Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy.

Journal of Business &Economic Statistics , 13(3):253–263.Diks, C., Panchenko, V., and van Dijk, D. (2011). Likelihood-based scoring rules for comparingdensity forecasts in tails.

Journal of Econometrics , 163(2):215–230.Ehm, W., Gneiting, T., Jordan, A., and Kr¨uger, F. (2016). Of quantiles and expectiles: consistentscoring functions, choquet representations and forecast rankings.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 78(3):505–562.Elliott, G. and Timmermann, A. (2008). Economic forecasting.

Journal of Economic Literature ,46(1):3–56.Fissler, T. and Ziegel, J. F. (2016). Higher order elicitability and Osbands principle.

The Annalsof Statistics , 44(4):1680–1707.Ganics, G. (2018). Optimal density forecast combinations. Technical report, Banco de Espa˜naWorking Paper.Genest, C. (1984). Pooling operators with the marginalization property.

Canadian Journal ofStatistics , 12(2):153–163.Geweke, J. and Amisano, G. (2011). Optimal prediction pools.

Journal of Econometrics , 164(1):130–141.Giacomini, R. and White, H. (2006). Tests of conditional predictive ability.

Econometrica ,74(6):1545–1578.Gneiting, T. (2011a). Making and evaluating point forecasts.

Journal of the American StatisticalAssociation , 106(494):746–762.Gneiting, T. (2011b). Quantiles as optimal point forecasts.

International Journal of Forecasting ,27(2):197–207.Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic forecasts, calibration andsharpness.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 69(2):243–268.Gneiting, T. and Raftery, A. (2007). Strictly proper scoring rules, prediction, and estimation.

Journal of the American Statistical Association , 102(477):359–378.31neiting, T., Raftery, A. E., Westveld III, A. H., and Goldman, T. (2005). Calibrated probabilisticforecasting using ensemble model output statistics and minimum CRPS estimation.

MonthlyWeather Review , 133(5):1098–1118.Gneiting, T. and Ranjan, R. (2013). Combining predictive distributions.

Electron. J. Statist. ,7:1747–1782.Hall, S. G. and Mitchell, J. (2007). Combining density forecasts.

International Journal of Forecast-ing , 23(1):1–13.Hansen, P. R. (2005). A test for superior predictive ability.

Journal of Business & EconomicStatistics , 23(4):365–380.Holzmann, H. and Eulert, M. (2014). The role of the information set for forecastingwith applicationsto risk management.

Annals of Applied Statistics , 8(1):595–621.Kapetanios, G., Mitchell, J., Price, S., and Fawcett, N. (2015). Generalised density forecast combi-nations.

Journal of Econometrics , 188(1):150–165.Krger, F. and Ziegel, J. F. (2020). Generic conditions for forecast dominance.

Journal of Business& Economic Statistics , 0(0):1–12.Loaiza-Maya, R., Martin, G. M., and Frazier, D. T. (2019). Focused Bayesian prediction. https://arxiv.org/abs/1912.12571 .McConway, K. J. (1981). Marginalization and linear opinion pools.

Journal of the AmericanStatistical Association , 76(374):410–414.Opschoor, A., van Dijk, D., and van der Wel, M. (2017). Combining density forecasts using focusedscoring rules.

Journal of Applied Econometrics , 32(7):12981313.Patton, A. J. (2019). Comparing possibly misspeciﬁed forecasts.

Journal of Business & EconomicStatistics , pages 1–23.Pauwels, L. L., Radchenko, P., and Vasnev, A. L. (2020). Higher moment constraints for predictivedensity combination.

CAMA Working Paper .Ranjan, R. and Gneiting, T. (2010). Combining probability forecasts.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 72(1):71–91.Tay, A. S. and Wallis, K. F. (2000). Density forecasting: a survey.

Journal of Forecasting , 19(4):235–254.Ziegel, J. F., Kr¨uger, F., Jordan, A., and Fasciati, F. (2020). Robust forecast evaluation of expectedshortfall.