[PDF] Evaluating the Discrimination Ability of Proper Multivariate Scoring Rules

Abstract

Proper scoring rules are commonly applied to quantify the accuracy of distribution forecasts. Given an observation they assign a scalar score to each distribution forecast, with the the lowest expected score attributed to the true distribution. The energy and variogram scores are two rules that have recently gained some popularity in multivariate settings because their computation does not require a forecast to have parametric density function and so they are broadly applicable. Here we conduct a simulation study to compare the discrimination ability between the energy score and three variogram scores. Compared with other studies, our simulation design is more realistic because it is supported by a historical data set containing commodity prices, currencies and interest rates, and our data generating processes include a diverse selection of models with different marginal distributions, dependence structure, and calibration windows. This facilitates a comprehensive comparison of the performance of proper scoring rules in different settings. To compare the scores we use three metrics: the mean relative score, error rate and a generalised discrimination heuristic. Overall, we find that the variogram score with parameter p=0.5 outperforms the energy score and the other two variogram scores.

Full PDF

EEvaluating the Discrimination Ability of Proper MultivariateScoring Rules

C. Alexander † ∗ , M. Coulon † , Y. Han § , X. Meng † February 1, 2021

Abstract

Proper scoring rules are commonly applied to quantify the accuracy of distribution forecasts. Givenan observation they assign a scalar score to each distribution forecast, with the the lowest expectedscore attributed to the true distribution. The energy and variogram scores are two rules that haverecently gained some popularity in multivariate settings because their computation does not requirea forecast to have parametric density function and so they are broadly applicable. Here we conduct asimulation study to compare the discrimination ability between the energy score and three variogramscores. Compared with other studies, our simulation design is more realistic because it is supportedby a historical data set containing commodity prices, currencies and interest rates, and our datagenerating processes include a diverse selection of models with diﬀerent marginal distributions,dependence structure, and calibration windows. This facilitates a comprehensive comparison ofthe performance of proper scoring rules in diﬀerent settings. To compare the scores we use threemetrics: the mean relative score, error rate and a generalised discrimination heuristic. Overall, weﬁnd that the variogram score with parameter p = 0 . outperforms the energy score and the othertwo variogram scores. Keywords: multivariate forecasting, proper scoring rules, discrimination heuristic, energy score,variogram score † Department of Accounting and Finance, University of Sussex Business School, Falmer, Brighton BN1 9SL, UK § McKinsey and Co., Frankfurt, Germany ∗ Peking University PHBS Business School a r X i v : . [ s t a t . M E ] J a n Introduction

Alongside the profusion of models for generating point or distribution forecasts, a proliﬁc strand oftheoretical research focusses on developing methods for evaluating these forecasts. A standard approachis to quantify the accuracy of each prediction with a proper scoring rule (see, for example, Gneitingand Raftery 2007). Many scoring rules are proposed and considered in the literature: Bao et al.(2007) advocate using the Kullback–Leibler information criterion which is derived from the logarithmicscore; Gneiting and Raftery (2007) advocate using the continuous ranked probability score (CRPS) andGneiting and Ranjan (2011) extend this to adopt the weighting approach of Amisano and Giacomini(2007) so that evaluation can be focused on a speciﬁc area of the distribution, such as a tail or the centre;Gneiting and Raftery (2007) generalise the CRPS to the energy score for multivariate distributions;Scheuerer and Hamill (2015) use the concept of variograms from geostatistics to derive a proper score;Hyvärinen (2005) and Parry et al. (2012) consider local scoring rules, which are based on the densityfunction or its derivatives; Diks et al. (2011) and Diks et al. (2014) generalise the log score to conditionallikelihood and censored likelihood, which can allocate more weights on speciﬁc regions.Given the large number of scoring rules, conventional wisdom dictates to apply a suitable one forthe application at hand. While it is generally agreed upon that only proper scoring rules quantify theaccuracy of probabilistic forecasts adequately (Winkler, 1996; Gneiting and Ranjan, 2011), the questionof which of the proper scores to use remains largely open (Gneiting and Raftery, 2007). This problemis especially relevant for multivariate evaluation, since the rankings of univariate scoring rules mostlycoincide, which reduces the risk of conﬂicting conclusions (see, for example, Staël von Holstein, 1970;Winkler, 1971; Bickel, 2007).Some previous studies provide an analytic assessment of proper scoring rules for a speciﬁc forecastingproblem, but these are restricted to a univariate setting in which some strong assumptions are made(see, for example, Machete 2013; Buja et al. 2005; Merkle and Steyvers 2013; Johnstone et al. 2011).Previous studies comparing multivariate proper scoring rules are limited to very simple simulationsettings. In their evaluation of the energy score, Pinson and Tastu (2013) restrict themselves to abivariate Gaussian as the true distribution. They introduce a useful discrimination heuristic whichmeasures the average relative distance of the sub-optimal scores (i.e. those obtained from mis-speciﬁedmodels) from the score obtained from the true distribution. They conclude that the energy score is2ble to discriminate errors in mean but lacks sensitivity to errors in variance and especially to errorsin correlation. Scheuerer and Hamill (2015) compare three diﬀerent variogram scores with the energyscore and the score proposed by Dawid and Sebastiani (1999). Their system has 5 or 15 dimensionsbut again they only consider a very simple true model, i.e. a Gaussian or Poisson distribution. Asimulation study assesses the ability of each score to identify the correct model through a simpleheuristic, i.e. the score sample mean. Overall, only the variogram score with p = 0 . is devoid ofranking issues in their study. Also, they conﬁrm the ﬁnding of Pinson and Tastu (2013) that the energyscore lacks sensitivity to misspeciﬁcation in the dependency structure. Ziel and Berk (2019) comparethe sensitivity of the energy score, variogram score and the score proposed by Dawid and Sebastiani(1999). They ﬁrst consider a simulation study focusing on multivariate Gaussian distributions ofunit variances and diﬀerent correlations. Then they consider a time series empirical study whereautoregressive (AR) models with diﬀerent lags are applied to airline passengers. They conclude thatthe energy score performs the best in their simulation study, which seems to contradict the conclusionfrom Pinson and Tastu (2013). Diks and Fang (2020) use conditional likelihood and censored likelihoodto compare multivariate and univariate approaches for forecasting portfolio returns. Their empiricalstudy considers the univariate and multivariate generalised autoregressive conditional heteroscedasticity(GARCH) models with elliptical distributional assumptions applied to ﬁnancial data. Their focus isnot to compare scoring rules, but to illustrate that multivariate models do not necessarily lead to betterforecasting accuracy for portfolio returns when compared with a univariate approach.We make two major contributions to this literature. The ﬁrst is to propose new metrics for discrimi-nating between correctly speciﬁed and misspeciﬁed models using proper multivariate scoring rules. Thesecond is to provide the ﬁrst empirical study on real ﬁnancial data of the relative performance of mul-tivariate scoring rules. Most empirical research in the forecasting literature in ﬁnance and economicsincludes a short empirical study, but extensive application of proper scoring rules, even to univariatedistribution forecasts of ﬁnancial returns, are hard to ﬁnd. We ﬁll this gap in the literature by con-ducting an extensive comparison between diﬀerent rules across a diverse selection of eight static anddynamic models for multivariate distributions. We do this in a forecasting setting using three largesets of daily ﬁnancial data spanning 1991 to 2018: commodity prices, exchange rates and interest rates. Even in the worst case considered, where a perfect correlation is mistaken for zero-correlation, the energy scorechanges only by 7%.

Forecasting accuracy evaluations rely on loss measures to quantify the performance of a distributionforecast. As Diebold and Mariano (1995) mention, this loss generally depends on the underlyingeconomic structures associated with the forecast. Scoring rules oﬀer a promising measure by condensingthe accuracy of a distribution forecast to a single penalty oriented value while retaining attractivestatistical properties.Let F be the convex class of distributions on (Ω , A ) . Let y be an observation of the random variable Y and let F be a forecast of the distribution of Y . A scoring rule is a function S ( F, y ) :

F × Ω −→ R ∪ {−∞ , ∞} A scoring rule S is said to be proper if for all distributions F and the true distribution G , the followingholds, E G S ( G, • ) ≤ E G S ( F, • ) . (1)Further, a scoring rule is strictly proper if equation (1) holds if G is the unique minimiser.Propriety of a scoring rule is important because the ideal forecast is preferred irrespective of the cost-loss structure (Diebold et al., 1998; Granger and Pesaran, 2000). A proper scoring rule is designed sothat a forecaster who believes the future distribution to be G has no incentive to predict any distribution F (cid:54) = G (Gneiting et al., 2007). The term has been coined by Winkler (1996, 1977) who shows that4roper scoring rules test for both calibration and sharpness of a distribution forecast simultaneously.The usage of non-proper scoring rules is generally not recommended since those can lead to wronginferences (Gneiting and Ranjan, 2011).The most well-known score is undoubtedly the log score, deﬁned as follows,LogS ( F, y ) = − log( f ( y )) , where f is the probability density function (PDF) of the distribution F . The log score has been widelyapplied in both model evaluation and estimation. In the log score, the PDF f clearly cannot havevalue 0. To overcome this weakness, quadratic and pseudo-spherical scores have been considered asalternative PDF-based scores QS ( F, y ) = (cid:107) f (cid:107) − f ( y ) , PseudoS ( F, y ) = f ( y ) α − / (cid:107) f (cid:107) α − α . where (cid:107) f (cid:107) α .. = (cid:0)(cid:82) f ( y ) α µ d y (cid:1) /α . The spherical score is a special case of the pseudo-spherical scorewith α = 2 . All three scores are strictly proper. However, these scores have been criticised for onlycrediting forecasts for high probabilities of the realizing value but not for high probabilities of values nearthe realizing one (Gneiting and Raftery, 2007). In addition, the PDF-based scores rely on predictivedensities which might not be available, especially with ensemble forecasts. In view of these limitations,eﬀorts have been devoted to developing alternative scores that do not require predictive densities. Inthe rest of section, we brieﬂy review three such scores: the CRPS, energy score and variogram score. The CRPS introduced by Matheson and Winkler (1976) and augmented by Gneiting and Ranjan (2011)has been widely used to properly compare distribution forecasts with a potential focus on certain regionsof interest. The CRPS is deﬁned asCRPS ν ( F, y ) = (cid:90) QS α ( F − ( α ) , y ) ν ( α ) d α, (2)QS α ( F − ( α ) , y ) = 2( { y ≤ F − ( α ) } − α )( F − ( α ) − y ) F − ( α ) = inf { x | F ( x ) = α } denotes the α quantile of F , ν : [0 , → R ≥ is a quantile weightfunction and QS α is known as the quantile score.Apart from this quantile score representation, CRPS can also be expressed using the Brier proba-bility score through CRPS u ( F, y ) = (cid:90) ∞−∞ PS ( F ( z ) , { y ≤ z } ) u ( z ) d z, (3)PS ( F ( z ) , { y ≤ z } ) = ( F ( z ) − { y ≤ z } ) with threshold weight function u : R → R ≥ as shown by Laio and Tamea (2007). Given a realization y , the integral of equation (3) splits into two easily interpretable parts which get penalized by the scoreas visualized in Figure 1. We remark that equations (2) and (3) are generally not equivalent.Additionally, Gneiting and Raftery (2007) derive the kernel score representationCRPS u ( F, y ) = E F ( Y − y ) − E F (cid:0) Y − Y (cid:48) (cid:1) , (4)where Y and Y (cid:48) are independent random variables with sampling distribution F . This concise expres-sion serves as a foundation for the generalization of CRPS to the multivariate energy score discussedin the next subsection. Figure 1: CRPS Schematic F Realization y Penalized area

We use F = Φ the standard normal distribution function, and y = 1 . to illustrate the concept of the CRPS. Theforecasted distribution F is penalized for the shaded area left and right of the realized value y through (cid:82) y −∞ F ( z ) d z and (cid:82) ∞ y (1 − F ( z )) d z respectively. A low score suggests high sharpness of the distribution forecast around the realisation. . For distributions with ﬁnite ﬁrst moment, the CRPS is strictly proper. Thus, the true probabilityfunction receives the lowest CRPS and is preferred to any other probabilistic forecast. Emphasizingspeciﬁc parts of the distribution by the choice of the quantile or threshold weight functions is simple since6ny non-negative functions ν and u can be used, provided that equations (2) and (3) are convergent.Table 1 lists the proposed functions by Amisano and Giacomini (2007) that we use in our analysis.Table 1: Possible weight functions s for CRPS Emphasis Quantile weights Threshold weightsUniform ν ( α ) = 1 u ( z ) = 1 Centre ν ( α ) = α (1 − α ) u ( z ) = ϕ ( z ) Both tails ν ( α ) = (2 α − u ( z ) = 1 − ϕ ( z ) /ϕ (0) Right tail ν ( α ) = α u ( z ) = Φ( z ) Left tail ν ( α ) = (1 − α ) u ( z ) = 1 − Φ( z ) The weight functions ν : [0 , → R ≥ and u : R → R ≥ put additional emphasis on certain parts of the distribution.Forecasts which deviate on those parts are penalized additionally and receive a higher CRPS. Here, ϕ and Φ denote thedensity and the distribution functions of the standard normal distribution. The energy score is a popular multivariate strictly proper score introduced by Gneiting and Raftery(2007) which generalizes the kernel representation of the CRPS in equation (4). Let y = ( y , . . . , y d ) (cid:48) be an observation of the random vector Y and let F be a forecast of the distribution of Y such that E F ( (cid:107) Y (cid:107) β ) is ﬁnite. The energy score is then deﬁned asES β ( F, y ) = 12 E F (cid:16) (cid:107) Y − Y (cid:48) (cid:107) β (cid:17) − E F (cid:16) (cid:107) Y − y (cid:107) β (cid:17) , where Y and Y (cid:48) are independent random vectors with distribution F .Székely (2003) shows that the energy score with β ∈ (0 , is strictly proper while Gneiting andRaftery (2007) provide an alternative general proof. In practice, β = 1 has been a common choice. Inthis case, the energy score reduces to the CRPS in the univariate case. Closed form expressions of theenergy score are generally unavailable which means that computations are done through Monte Carlomethods.Despite its popularity, this score has been criticized for being insensitive to misspeciﬁcation of thedependency structure (Pinson and Girard, 2012; Pinson and Tastu, 2013) and for being unable todistinguish a good representation of the predictive distribution from a very sparse one (Scheuerer andHamill, 2015). 7 .3 Variogram Score The variogram score proposed by Scheuerer and Hamill (2015) is based on the concept of variogramsfrom geostatistics. Similar to diagnostic methods by Hamill (2001) and Feldmann et al. (2015), thescore examines pairwise element diﬀerences of the variables y i of y . The variogram score of order p isdeﬁned as VS p ( F, y ) = d (cid:88) i =1 d (cid:88) j =1 ( | y i − y j | p − E F ( | Y i − Y j | p )) , where Y i and Y j are the i -th and j -th components of a random vector Y with distribution F . Intuitively,the score makes use of the variogram of order p , γ p ( i, j ) = 12 E ( | Y i − Y j | p ) , which quantiﬁes the degree of spatial dependence of a stochastic process. Pairwise comparisons measurethe closeness of the deviations in the observations relative to those of the corresponding expectations.Scheuerer and Hamill (2015) show that the score is proper relative to the class of distributions forwhich the p -th moments of all elements are ﬁnite. The variogram score is not strictly proper becauseit depends only on the p -th absolute moment of the distribution of the element diﬀerences. Therefore,it cannot distinguish any distributions where the element diﬀerences deviate in higher moments but arethe same for moments of order less than or equal to p .In practice, the choice of p depends on the forecasted distribution and should generally be largeenough to consider all relevant moments of the pairwise deviations but not too large to overly emphasizeoutliers through the exponentiation. The values p = 0 . , , are often suggested, and Figure 2 showsthe eﬀect of p by illustrating the observed variogram | y i − y j | p of diﬀerent popular orders relative tochanges in | y i − y j | . It is clearly visible that the magnitude of the eﬀect depends heavily on the valueof | y i − y j | with the absolute slope varying between and in the depicted domain ( − . , . . Thesensitivity of the observed variogram in a neighbourhood of zero deviation is strongest for p = 0 . andvery weak for p = 2 . This order reverses for | y i − y j | > (1 / / . We expect parameter p = 0 . to bemore sensitive to similar y i and y j while p = 2 will respond more to cases where | y i − y j | is expectedto be large. The choice p = 1 should be able to strike a balance between these two cases.8igure 2: Variogram observation of various orders y i y j | | y i y j | | y i y j | The ﬁgure shows the eﬀect of the variogram order depending on the observed absolute diﬀerence | y i − y j | . Slightdeviations in | y i − y j | aﬀect the observed variogram | y i − y j | p diﬀerently, depending on its order p . As with the energy score, the encapsulation of the information into a single score leads to a loss ofinformation. However, empirical applications support the sensitivity of the score to ﬂawed forecasts,especially regarding the dependency structure (Scheuerer and Hamill, 2015). Since the score is basedon pairwise deviations, any bias that is the same for all components of the forecast cancels out and istherefore undetectable. This further motivates the good practice to using multiple proper scores forthe evaluation of multivariate distribution forecasts.

While most prior studies of this type have only considered static parametric distributions with ﬁxedparameters as data generating process (DGP) and parameter variations for the misspeciﬁed models, weaim here to instead select from a broad range of realistic models for both the DGP and the misspeciﬁedmodels. There is a very long list of alternative static and dynamic models that could be selected for ourstudy. We require a diverse selection but would like all of these to ﬁt the historical datasets reasonablywell, while diﬀering in features such as whether the marginal distributions are static, whether thedependence structure is static, how much of the noise is ﬁltered from the historical data, and ﬁnallyhow much sampling error is generated by the length of the calibration window. In this section wesummarize all the models to be used in our study, beginning with static models before moving todynamic models. Here noise refers to the variation in the data which is not useful for improving forecast error, similarly to the deﬁntionof noise that can be identiﬁed by Principal Component Analysis (PCA). .1 Static Models Our simplest model assumes that the next period joint distribution of the variables can be well ap-proximated by their joint historical distribution, i.e. the frequency distribution of contemporaneousobservations on the random variables. The lack of complexity makes it this approach particularly pop-ular with practitioners. In a survey by Pérignon and Smith (2010) of risk models used by commercialbanks, almost 3/4 used such historical simulation techniques. Moreover, Danielsson et al. (2016) showsthat it remains unclear whether dynamic time-series models can outperform this so-called ‘historical’approach, at least for predicting quantiles. We therefore include models that use historical data tobuild an empirical distribution function (EDF) for each marginal and apply the Gaussian copula forthe dependence structure. More sophisticated copulas could be selected but we only apply the Gaussiancopula for consistency with the other models employed. We use EDF Cn to denote the empirical marginaldistribution with Gaussian copula, calibrated to a sample size n .The Factor Quantile (FQ) methodology developed by Alexander and Han (2020) uses quantileregression to capture the marginals. Each random variable in the system is regressed on a set of m common factors, which may be exogenously selected or determined endogenously. Here we use the latterapproach, with latent factors x m derived from a principal component analysis of all the variables in thesystem. Given a quantile partition Q , common to each variable, we perform a set of quantile regressionsfor each variable, i.e. one for each quantile τ ∈ Q . Then, conditional on a set of predetermined values(one for each of the latent factors) we compute the ﬁtted quantiles of each variable, and apply a shape-preserving spline to the ﬁtted quantiles over all τ ∈ Q . This way we interpolate a conditional marginaldistribution for each variable, all conditional on the same latent-factor values. Alexander and Han(2020) claim that such latent-factor FQ models are at least as successful for predicting distributionsof ﬁnancial asset returns as multivariate GARCH models. Furthermore, they have several advantagesin that they are very quick to calibrate, more computationally robust than GARCH models, and theyscale far more easily to high dimensions.Here we conﬁne our selection to two types of latent factor FQ models, each with a Gaussian copulacapturing dependence, which are designated FQ-AL Cn and FQ-AB Cn where n is again the number ofobservations used to calibrate the model and C denotes the Gaussian copula. The diﬀerence between theAL and AB versions stems from the choice of factors x m : the AB version uses the ﬁrst m components10nd the AL version uses the last m principal components. The components are ordered by size ofeigenvalue, so that the variance explained by each component progressively decreases. The AL versionsimply separates relevant information (captured by the intercept) from the noise (captured by the last m components). In eﬀect, it is a means of removing unwanted variation from the EDF. This may alsobe viewed as a latent-factor version of the Jensen (1968) ‘alpha’, hence the terminology ‘alpha-latent’ orAL. The AB version uses the ﬁrst m eigenvectors as latent factors, but in this case each quantile forecastis associated with a large uncertainty. Therefore a variance reduction technique based on “bootstrapaggregation” or bagging is applied to reduce this uncertainty, via an algorithm proposed by Breiman(1996). We refer to Alexander and Han (2020) for further technical details. The most common class of dynamic model for an expected value is the family of autoregressive movingaverage (ARMA) time-series models – because they are ﬂexible, have the temporal aggregation property,and allow both easy estimation and straightforward inference. However, for distribution forecasts wealso require a parametric forecast of variance. To this end, we employ time-series models for conditionalvariance forecasts using the GARCH ( p, q ) model class, a generalization by Bollerslev (1986) of theautoregressive conditional heteroscedasticity (ARCH) model of Engle (1982). The GARCH ( p, q ) processassumes a time-varying conditional variance as follows: σ t = ω + q (cid:88) i =1 α i ε t − i + p (cid:88) i =1 β i σ t − i (5) ε t |I t − ∼ N (0 , σ t ) , (6)where ε t is the market shock or innovation at time t and I t is the information set containing all pastreturns up to t . The parameters ( α i ) qi =1 and ( β i ) pi =1 measure the reaction of the conditional varianceto market shocks and the persistence of conditional variance respectively.These models have been particularly successful in modelling univariate time-series of ﬁnancial re-turns (Engle, 2001) partly because they are designed to capture volatility clustering eﬀects which areoften present in ﬁnance (Mandelbrot, 1963). In the presence of such eﬀects, volatility becomes time-dependent and can exhibit prolonged periods of exceptionally high or low values. Multivariate GARCH GARCH is usually conducive to a two-step maximum likelihood estimation with the ﬁrst step being the application (1 , class, so we shall allow for asymmetries in innovations as well as leverage eﬀects, andfor two kinds of conditional covariance dynamics. In multivariate analysis, clustering extends beyondvolatilities to correlations and generalizations of univariate GARCH models also capture time-varyingconditional covariances and spillover of volatility between diﬀerent assets. Bollerslev (1990) introducesthe constant conditional correlation GARCH (CCC-GARCH) model where the conditional correlationsare assumed to be time-invariant. The covariance matrix is estimated as (cid:15) t = V / t z t V t = D t CD t , D t = diag ( V t ) / , where V / t is a matrix such that V t is the covariance matrix (e.g. V / t can be chosen as the Choleskyfactorization of V t ) z t is the standardised residual which has mean zero and the identity covariancematrix, C is a constant correlation matrix and D t is the diagonal matrix containing the time-varyingindividual volatilities. In practice, each volatility can be estimated by an univariate GARCH modelwhile C can be speciﬁed as the sample correlation between standardized residuals. The model is easyto estimate since dependency and the volatilities are examined separately.The assumption of constant correlation may seem too strong and is not fulﬁlled for many assets (Tsuiand Yu, 1999). Dynamic conditional correlation GARCH (DCC-GARCH) by Engle (2002) generalizesCCC-GARCH to account for time-varying correlations. The correlation is estimated directly from theresiduals of the univariate models and adjusted depending on the co-movement of the returns. As such,the covariance matrix is given by V t = D t C t D t , D t = diag ( V t ) / , of an ARMA model to the conditional means and the second being the estimation of a GARCH model to the residualsof the conditional mean equations. C t is described by C t = diag ( Q t ) − / Q t diag ( Q t ) − / , Q t = (cid:32) − M (cid:88) m =1 α m − N (cid:88) n =1 β n (cid:33) Q + M (cid:88) m =1 α m (cid:0) ε t − m ε (cid:48) t − m (cid:1) + N (cid:88) n =1 β n Q t − n , (7)with Q = E (cid:0) ε t ε (cid:48) t (cid:1) . The transformation of Q t to C t guarantees a well-deﬁned correlation matrix as long as Q t is positivedeﬁnite. Similar to CCC-GARCH, there are no restrictions on the choice of univariate GARCH modelsfor the volatility. For both the CCC-GARCH and DCC-GARCH models, z t can be chosen as the multi-variate Gaussian distribution, which leads to consistent parameter estimators even under distributionalmisspeciﬁcation (Bauwens and Laurent, 2005). It is also possible to consider non-normal distributions.For example, Bauwens and Laurent (2005) use multivariate skew distributions while Cajigas and Urga(2006) and Pelagatti (2004) apply the Laplace distribution and general elliptical distributions, respec-tively. There are several alternative multivariate GARCH models but in this paper we limit our analysisto CCC-GARCH(1,1) and DCC-GARCH(1,1). Our discussion in Section 2 introduced several univariate and multivariate proper scoring rules that areprevalent in the literature. While it is agreed upon that these oﬀer a sound way to quantify the accuracyof probabilistic forecasts (Winkler, 1996; Gneiting and Ranjan, 2011), the question of which score to useremains largely open (Gneiting and Raftery, 2007). Conventional wisdom dictates to apply a suitablescoring rule for the application at hand (Machete, 2013) but this only provides a few requirements anddoes not suﬃciently restrict the selection. Our aim is to analyse the ability of diﬀerent scoring rules to discriminate between true and mis-speciﬁed distributions. Improving on previous studies we employ a realistic simulation setting that As pointed out in Section 2, scoring rules have varying assumptions for propriety and compare diﬀerent forecastingtypes, e.g. density forecast, distribution forecasts or ensemble forecast, that are sometimes not easily interchangeable.

We obtain three eight-dimensional time series of daily frequency: the ﬁrst is on USD-denominatedexchange rates, the second is on US interest rates and the third is on Bloomberg investable commodityindices. We obtain the daily exchange rates and commodity index values from Thomson ReutersDatastream and the interest rates data from the US Treasury website. All time series end on 30June 2018 but the start date varies with data availability. Within each set we have selected variablesto broadly represent the asset class. The exchange rates are those with the highest trading volumeexcluding the Chinese Renminbi, which was pegged to the USD until recently (Bank of InternationalSettlements, 2016). Our data starts in January 1999 with the introduction of the Euro as accountingcurrency. The interest rates span the term structure of US Treasury bonds from 6 months to 20 years.Alternative available maturities are 1 month, 2 month, 3 month, and 30 years but those maturities aremissing data for an extended period of time and are therefore excluded. Our data starts in January1994 after the 20-year maturity interest rate became available in October 1993. The commodity indicesare chosen to reﬂect the most liquid commodities with the highest USD-weighted production valueand are diversiﬁed to represent the energy, grains, industrial / precious metals, softs and livestocksectors (Bloomberg, 2017). The Bloomberg commodity indices were launched in 1998 with a backwardprojection to January 1991. We include all available data in our study. We summarize the total sampleperiod and the starting date of our out-of sample evaluation in Table 2.Summary statistics of the data are listed in Table 3 for monthly returns of exchange rates andcommodites, and for monthly changes in interest rates. The sample mean for all monthly returns andchanges is near zero which allows us to apply the principal component representation of Factor Quantilemodels without prior transformations. Furthermore, all assets are leptokurtic and require heavy-tailed14able 2: Sample for each data set

Data set First date Start evaluation End dateExchange rate returns 01 January 1999 28 February 2007 30 June 2018Interest rate changes 01 January 1994 03 July 2002 30 June 2018Commodity index returns 01 January 1991 26 February 1999 30 June 2018

All three data sets use daily frequencies, yielding over 18,000 observations in total. The ﬁrst dates vary due to dataavailability. distributions.To examine the robustness of our analysis, we segment the data into three parts, covering 2006–2010,2010–2014, and 2014–2018 with breakpoints at the end of June in each case. Because the exchangerate data starts much later than the other data sets, the period from June 2006 to February 2007 isstill used for calibration. Therefore, the ﬁrst sub-period begins in March 2007 in this case.

Although there have been a few studies explicitly comparing scoring rules (as reviewed in Section 1),the models chosen and the corresponding discussions generally fail to put enough emphasise on the dis-tinct features of ﬁnancial data, such as time-varying volatility, covariance and leptokurtic conditionaldistributions. Pinson and Tastu (2013) restrict themselves to bivariate Gaussian distributions withdiﬀerent means, variances and correlations, and do not consider time series in their study. SimilarlyScheuerer and Hamill (2015) only consider simple distributions such as Gaussian and Poisson distribu-tions. Ziel and Berk (2019) conduct a time series empirical study but they only consider AR modelsfor the mean with no dynamic volatility structure. In addition, their data is essentially a univariatetime series, where observations on multiple days are viewed jointly as multivariate distributions. This isinherently diﬀerent from standard ﬁnancial applications, where synchronised ﬁnancial data are viewedas a multivariate system. Diks and Fang (2020) do consider ﬁnancial applications, in which they use theunivariate and multivariate GARCH models to capture the dynamic variance and covariance structure.However, as we explained in Section 1, their focus is not to compare scoring rules.Most of the empirical literature on ﬁnancial forecasting accuracy concerns univariate times seriesand, moreover, considers only point forecasts. Yet, even in this restricted context, it remains unclear This comparison of multivariate scoring rules is more comprehensive than alternative ones, but still mainly usesGaussian distributions as the DGP. In the case where a Poisson distribution is assumed as the DGP, all scores but thevariogram score with p = 0 . have at least some ranking issues and may identify the wrong model as the correct one. Asset Mean Volatility Skewness KurtosisExchange rate returnsAUD 0.0000 0.0368 0.7922 5.9807CAD -0.0003 0.0265 0.7218 6.4690CHF -0.0011 0.0297 -0.0101 4.8305EUR 0.0003 0.0292 0.3192 4.1236GBP 0.0013 0.0252 0.5218 4.9380JPY 0.0002 0.0281 0.3077 3.5187NZD -0.0003 0.0383 0.5727 4.6172SEK 0.0011 0.0328 0.1555 3.52446 month -0.0039 0.2016 -2.2169 14.60831 year -0.0041 0.2137 -1.2097 8.63162 year -0.0055 0.2462 -0.3439 4.49363 year -0.0062 0.2617 -0.0753 3.98205 year -0.0078 0.2728 0.0288 3.74437 year -0.0086 0.2697 0.1118 3.785610 year -0.0097 0.2594 -0.0196 4.227920 year -0.0115 0.2372 0.0341 4.6910Copper 0.0065 0.0724 -0.0517 5.8856Corn -0.0048 0.0752 0.2992 4.0710Gold 0.0027 0.0453 0.1885 4.1817Live Cattle -0.0006 0.0392 -0.4110 5.1238Natural Gas -0.0080 0.1316 0.4827 3.9878Soybean 0.0047 0.0684 -0.0485 3.5828Sugar 0.0037 0.0886 0.2235 3.4728WTI Oil 0.0034 0.0876 -0.0136 3.8353

The monthly returns and changes are calculated using the values at the start of each month which the summary statisticaggregates over the time periods speciﬁed in Table 2. Our study only applies daily data but we use a monthly frequencyin this table to avoid minuscule magnitudes. whether a dynamic model structure is always preferable to a static one which assumes that the nextperiod distribution of variables can be well approximated by the historical distribution. Furthermore,multivariate models can diﬀer both according to the speciﬁcation of the marginals and their dependencestructure. As described in Section 3, a static marginal distribution is commonly represented by anempirical distribution function, while we also consider models in the Factor Quantile class. The dynamicmodels are instead from the GARCH class. Regarding dependence, all but one of our models assumesa static dependence structure (typically the Gaussian copula) – only the DCC-GARCH models extendsthis assumption to dynamic conditional correlation.16ur simulation study selects a DGP as the true distribution and assesses the ability of diﬀerentproper scoring rules to detect model misspeciﬁcation. To reduce dependence on our choice of DGP weemploy a diverse selection of models for multivariate distributions that can be used as either the DGPor a misspeciﬁed model. Also, for our results to be relevant to practical applications we calibrate thesemodels to a very comprehensive selection of real-world data. However, even though the models arecalibrated to actual data, it is important to emphasize that we do not seek to compare the forecastingperformance of diﬀerent models. Instead, our focus is on a comparison of the ability of diﬀerent properscoring rules to detect model misspeciﬁcation. Nonetheless, we do choose models which have beenshown in Alexander and Han (2020) to have similar forecasting abilities in typical ﬁnancial datasets,thereby posing a greater challenge to scoring rules in terms of identifying the DGP and discriminatingbetween competing models. As compared with other studies, we thus emply a more realistic settingwith models that are better representations of ﬁnancial data, we calibrate these models to actual dataand our results are much more comprehensive.We employ a sample size n = 250 and n = 2000 for our results based on static models. However,the dynamic multivariate GARCH models cannot be calibrated using n = 250 because the optimizationof their complex likelihood function does not converge, most of the time. Hence, their parameters areestimated using n = 2000 only. The FQ models are much easier to calibrate on small samples, sowe present results for these models based on both n = 250 and n = 2000 . Because the EDF andFQ models are calibrated on two diﬀerent sample sizes we end up with eight competing models intotal: (i) Two FQ-AL models, (ii) two FQ-AB models, (iii) two EDF models and (iv) two multivariateGARCH models. This way, each model has one associated model that is similar but diﬀers either in thecalibration length or the conditional covariance structure. Table 4 summarizes the models employed.Our motivation for selecting the models described here is to use a small but representative selectionfrom the static and dynamic modelling paradigms that are commonly used for multivariate ﬁnancialsystems. Expanding the model set beyond the eight models described in this section may hinder ratherthan help our aim. Our selection of models strikes a balance between the need to represent the salientqualities of real data adequately and a suﬃciently simple structure that calibration diﬃculties will notinterfere with the very large scale of our simulation study. Results for other choices of n are available, as are those with the assumption of independent marginals. However,these are excluded here for brevity and because their use does not change our conclusions about the ability of properscoring rules to discriminate misspeciﬁed models – but they are available from the authors on request. Model Marginals Dependency CalibrationEDF C EDF Gaussian copula 250EDF C EDF Gaussian copula 2,000FQ-AL C Alpha FQ w/ last PC Gaussian copula 250FQ-AL C Alpha FQ w/ last PC Gaussian copula 2,000FQ-AB C Asym. Bagging FQ Gaussian copula 250FQ-AB C Asym. Bagging FQ Gaussian copula 2,000CCC-GARCH Student-t E-GARCH(1,1) Conditional correlation 2,000DCC-GARCH Student-t E-GARCH(1,1) Dyn. cond. correlation 2,000

Column 1 summarizes the notation used for the model, column 2 describes the marginals, column 3 the dependencystructure and column 4 the length of the calibration window. For example, the GARCH models are calibrated to aneight-dimensional time series spanning 2000 days and the FQ-AB C models are calibrated to an eight-dimensional timeseries spanning 250 days. Our simulation study quantiﬁes and compares the ability of the energy score and the variogram scorewith p = 0 . , , to distinguish the correct DPG from misspeciﬁed models. These scoring rules weredeﬁned in Section 2 and the selected values of p , which have also been used by Scheuerer and Hamill(2015) are considered typical choices (Jordan et al., 2017).The models described above are calibrated on the systems of daily, eight-dimensional USD-denominatedexchange rates, interest rates and Bloomberg investable commodity indices that we discussed in Section4.1. The parameters for each model are estimated on an initial sample size n , then the calibrated modelis used to generate a one-day-ahead multivariate distribution forecast, then the sample is rolled forwardkeeping the sample size ﬁxed, and then the calibration and forecasting is repeated until the data setis exhausted. Then we move to another data set and repeat, until we have a continuous set of daily,rolling distribution forecasts, for each model and for each of the three data sets.Our aim is not to assess the forecasting performance of diﬀerent models. It is to assess the abilityof a scoring rule to distinguish the true DGP from misspeciﬁed models, when used for multivariatedistribution forecasting. Our simulation setting controls the DGP so that we know the true distributionevery time we evaluate the forecasts and the other seven models are misspeciﬁed. For the forecastsmade at time t all the models are calibrated using the n most recent observations up to time t , aspreviously mentioned. In addition, we select one of these models to generate the observations for time t + 1 that are used to analyse the performance of scoring rules. Put another way, realisations are18imulated from just one of these models – i.e. the one that is designated as the DGP. The reason whywe have selected such diverse models is that we seek to reduce the dependence of our results on thechoice of a speciﬁc DGP. Therefore, we rotate the choice of DGP to span all eight models.The simulation algorithm is summarized as follows. First, ﬁx a proper scoring rule s and a model m ∗ that is designated as the true DGP. This rule s will quantify each of the eight model’s distributionforecasts, for each data set, as follows: Stage 1

Calibrate a model using data ending at time t and forecast a multivariate distribution fortime t + 1 , i.e. one-day-ahead. Repeat this eight times, once for each model in Table 4; Stage 2

Draw 5,000 observations from the distribution forecasted by model m ∗ . These observationsare realisations from the true model at time t + 1 ; Stage 3

Given a distribution forecast for time t +1 by model m , apply s to each of the 5,000 realisationsgenerated in Stage 2, to quantify the performance of model m . Repeating this for all models inTable 4 gives 5,000 scores for each of the eight models at time t including the designated DGPmodel m ∗ ; Stage 4

Roll t forward by one quarter of a year and repeat stages 1, 2 and 3 until the entire samplefor that data set is exhausted.The four stages outlined above are then repeated eight times, so that each of the models in Table 4is selected as the designated DGP m ∗ . Then we change the scoring rule s and repeat the four stageseight times, so that again each model can be the designated DGP m ∗ . We do this four times, so that s spans all four scoring rules and every time we change the scoring rule we repeat stages 1 to 4 eighttimes, so that each of the eight models can be the designated DGP. Finally, we change the data setand repeat the entire process again.We evaluate the scoring rules at the ﬁrst date of each quarter in our evaluation period. This yieldsnew simulations (each with eight dimensions) on 50, 66, and 82 dates in USD-denominated exchangerates, US interest rates and Bloomberg investable commodity indices, respectively. Since we have eightpossible DGPs, this leads to approximately 1,600 applications of the simulation study above for eachof the four multivariate scoring rules. 19his setting gives us a very detailed view on the discrimination ability for each scoring rule overtime (i.e. diﬀerent market conditions) and for various choices of the DGP. Note that our simulationdesign provides optimistic conditions for the scoring rules since it knows the distribution of the DGPfor certain, and moreover it samples a very large number of realisations at each time t . In practice,we only observe one realisation and therefore must consider the scores over a large period – and thelonger the sample used the less likely the DGP is to be stationary. To approximate a more realisticsetting, in which the length of the out-of-sample period is restricted due to lack of data, we shall alsocompare the scoring rules on smaller sub-samples, with only 100 realisations instead of 5,000. Recallthat all models use historical information to forecast their joint distribution, but in our setting theyare evaluated using samples from the chosen DGP, rather than realisations of the original time series.The models are only punished if their forecast deviates from that of the true DGP and a good scoringrule should be able to distinguish misspeciﬁed models from the true distribution.

We shall generalize the approach of Pinson and Tastu (2013) to compare the discrimination abilityof several scoring rules and introduce a new metric, the error rate, as an additional measure of thesensitivity of scoring rules. Suppose there are M models in the study and that scores are based on N realisations at time t . The N × score vector assigned by scoring rule s to model m at time t , whenmodel m ∗ is the DGP is deﬁned as: S st ( m, m ∗ ) = (cid:0) S s ,t ( m, m ∗ ) , . . . , S sN,t ( m, m ∗ ) (cid:1) (cid:48) . (8)We employ several metrics aimed at summarizing the properties of (8). First, and following Scheuererand Hamill (2015), the simplest metric we employ is the mean relative score: N N (cid:88) i =1 ( S sit ( m, m ∗ ) /S sit ( m ∗ , m ∗ )) . (9) Indeed, each simulation yields a scoring-rule evaluation based on observations with a stationary DGP. Dependingon the realisation, the scoring rules may favour a model other than the DGP but the sample mean based on all 5,000scores should be the smallest for the DGP. This, of course, is because the distribution of the DGP is used to generatethe realisations. A good scoring rule should assign the lowest scores to the DGP and also produce robust rankings overthe entire evaluation period. As Pinson and Tastu (2013) point out, a large distance between the scores of the DGP andalternative models may help to avoid erroneous conclusions. N is small. Forsome realisations, the lowest score may be assigned to a model that is not the DGP. We study thisprobability in our simulation study by introducing a new ‘error rate’ metric and also by analysing thedistribution of the absolute diﬀerences between the scores assigned to each model: S st ( m, m ∗ ) − S st ( m ∗ , m ∗ ) . (10)As an additional measure for the discrimination ability, we consider a simple heuristic that examinesthe relative distance of the scores between the models. This approach has been suggested by Pinsonand Tastu (2013) who compare the sensitivity of the energy score to various misspeciﬁcations. Given S st ( m, m ∗ ) .. = 1 N N (cid:88) i =1 S sit ( m, m ∗ ) , they utilize a true Gaussian distribution and measure the sensitivity pairwise through S st ( m, m ∗ ) − S st ( m ∗ , m ∗ ) S st ( m ∗ , m ∗ ) with misspeciﬁed models that deviate from the true distribution in only in one aspect, e.g. mean orvariance. For our purposes we need to adjust their measure to consider the discrimination across scoringrules over multiple misspeciﬁed DGPs. To this end, we propose a generalized discrimination heuristicthat is deﬁned as d st ( m ∗ ) = 1 M M (cid:88) m =1 S st ( m, m ∗ ) S st ( m ∗ , m ∗ ) . (11)A large value for this generalised discrimination metric may indicate that the ranking of the scores isreliable and robust to diﬀerences in the simulation sample size N . We analyse the energy score and three parametrisations of the variogram score with respect to theirability to identify the DGP under the simulation design described in the previous section. As discussedin Section 4.3, we use N = 5000 and M = 8 . To examine the discrimination ability of each scoring21ule at time t , we apply the entire sample of 5,000 scores but also smaller sub-samples with N = 100 scores that reﬂect more realistic conditions. These correspond to out-of-sample evaluations with 5,000and 100 observations respectively. Each score assesses the closeness of the distribution forecasts tothe distribution of the DGP, rather than indicating real forecasting accuracy. Because we impose adistribution through the DGP, a model’s scores only measure how similar the models is to the DGP,according to that score.We begin our results in Section 5.1 with the mean relative score (9), focusing on exchange ratereturns and a DCC-GARCH as DGP. Then, in Section 5.2 we generalize these results to multipleDGPs and data sets by using the error rate to measure the percentage of cases in which a misspeciﬁedmodel receives a lower (i.e. better) score than the DGP. Further, we analyse the deviation betweenthe scores of misspeciﬁed models and that of the DGP, derived from (10). Finally, in Section 5.3,we illustrate the distribution of these deviations and use our generalization (11) of the discriminationheuristic proposed by Pinson and Tastu (2013) that allows us to compare multiple scoring rules. Forreasons of space several ﬁgures are only available electronically in the supplementary materials. Thereis also an expanded set of results where the model set used includes some where the correlation matrixis replaced with the identity matrix. We begin the analysis of the multivariate scoring rules with a comparison of their mean relative score(9) for each model. Figure 3 uses a DCC-GARCH as DGP for exchange rate returns for four selectedmodels, with the DGP ( m ∗ ) in this case being DCC-GARCH. The mean relative score corresponds tothe average of the ratio between the score of the misspeciﬁed model and that of the DGP. The shadedareas cover everything between the 0.25- and 0.75-quantiles for the sample mean based on a sample sizeof 100 instead of 5,000. These conﬁdence intervals are generated through a statistical bootstrap with5,000 repetitions. We limit the illustration to four models only for clarity, but the results are comparablewhen other misspeciﬁed models, DGPs or data sets are considered. Figures for other DGPs and datasets can be found in the supplementary materials.The results based on the sample mean of 5,000 scores indicate that all four scoring rules manageto evaluate the models successfully. Due to propriety, they assign the lowest expectation to the DGPwhich is why almost none of the mean relative scores fall below 1 in Figure 3. Further, CCC-GARCH22igure 3: Average scores relative to score of DGP (USD exchange rates) C FQ-AL C FQ-AB C The ﬁgure illustrates the mean relative score in Equation 9 based on 5,000 scores. A value larger than 1 means that thescoring rule is on average able to distinguish between the misspeciﬁed model and DCC-GARCH to identify the trueDGP. We generate a conﬁdence interval covering the area between the 0.25- and 0.75-quantiles of the sample meanbased on a sample size of 100 through bootstrap with 5,000 repetitions. almost always obtains the lowest score among all misspeciﬁed models as expected given its similarity toDCC-GARCH. The scores can distinguish distributions which diﬀer only in their marginals and theyare able to identify FQ-AL C as one of the misspeciﬁed models with great conﬁdence. In contrast, thediﬀerence between FQ-AB C and EDF C is less pronounced which means that they produce predictionswhich each deviate by a similar amount to the distribution forecast of DCC-GARCH.Both the variogram score with p = 0 . and p = 1 show clear and robust rankings between themisspeciﬁed models and distinguish them well from the DGP. The discrimination ability is weaker forthe energy score. As pointed out by Pinson and Tastu (2013) and Scheuerer and Hamill (2015), theenergy score changes only by a small amount between the DGP and other models. This is evidentin Figure 3 as well, where the average score of the worst model is only 25% larger than that of theDGP. In comparison, the variogram scores with p = 0 . and p = 1 assign average scores over 200%and 100% larger than that of the DGP respectively. Unlike the other scoring rules, the variogram scorewith p = 2 changes the rankings at several times and is also the only scoring rule which makes wronginferences even with a large sample size of 5,000 scores. For instance, FQ-AB C is preferred over the23GP around the end of 2017. Hence, the energy score and variogram scores with p = 0 . and p = 1 may be preferable to the variogram score with p = 2 .However, there are vast diﬀerences in the discrimination ability of scoring rules which can lead towrong inferences in smaller sample sizes:1. Despite the overall success of the variogram score with p = 0 . and p = 1 , wrong inferences mayoccur with only 100 samples. The shaded areas of CCC-GARCH dip below 1 frequently whichmeans that a slightly misspeciﬁed model may be chosen over the DGP.2. The energy score suﬀers from misidentiﬁcation risk to a much larger extent. Besides CCC-GARCH,FQ-AB C and EDF C are also at times assigned lower scores than the DGP in 2010, 2013, 2016and 2017 with the smaller sample size. Overall though, the energy score still manages to producea clear ranking that is mostly accurate.3. The variogram score with p = 2 largely fails to yield any meaningful results with the smallersample size. The rankings can change considerably, and all models obtain a lower sample meanthan the DGP at various times. Even FQ-AL C , which is regarded as the worst model by allother scoring rules, has lower scores than DCC-GARCH around 2016. Additionally, the variogramscore with p = 2 may assign scores of very large magnitude that greatly aﬀect the sample mean.This is visible in Figure 3 in two aspects: (i) The scoring rule has wide conﬁdence intervals and(ii) the sample mean is at times higher than the sample 0.75-quantile. This is, for instance, thecase around the end of 2013.These initial ﬁndings suggest that variogram score with p = 0 . and p = 1 oﬀer superior discriminationability to the more popular energy score. The variogram score with p = 2 performs very poorly andmay yield erroneous rankings of the forecasting models, even with a very large sample of scores. Figure 4 shows the results from applying (10) with DCC-GARCH as DGP for all datasets and allcompeting models and uses the scores for all t to generate the density. Each column of the ﬁgureillustrates the density of equation (10) for a speciﬁc misspeciﬁed model, under various scoring rulesand data sets. We include the error rate in the upper right corner of each sub-ﬁgure which shows the24robability that equation (10) yields a negative value. For clarity, we do not use the same x-axis for allsub-ﬁgures but show all values between the 0.001- and 0.999-quantiles of each distribution. This meansthat the magnitude of the error is not visible in these ﬁgures but instead we gain insight on the shapeof the error density. Figures for alternative DGPs can be found in the supplementary materials.Overall, Figure 4 shows that the probability of getting scores which are lower than that of theDGP is quite high and varies signiﬁcantly across cases. Depending on the data set and scoring rule,the average error rate (for each row of plots) varies between 31% and 54%. The variogram score with p = 2 in particular often assigns lower scores to misspeciﬁed models. This happens in 50%, 54%, 53%of cases for exchange rate returns, interest rate changes and commodity rate returns respectively and istherefore around 60% worse than the error rate of the variogram score with p = 0 . (with correspondingerror rates of 31%, 32% and 32%). This scoring rule achieves the lowest error rate, followed by thevariogram score with p = 1 and the energy score, which both achieve error rates in the 39-42% range.It is important to note that the error rate is only a binary statistic which does not take intoaccount the magnitude by which the scores of misspeciﬁed models are smaller than that of the DGP.By averaging over a sample of scores, the error rate decreases, until it reaches zero due to the proprietyof the scoring rules. The number of samples needed for a sample mean that favours the DGP dependson the shape of the distribution. If the tail of the shaded area is small in comparison to the tail ofthe non-shaded area, a small sample might be suﬃcient. However, many of the distributions in Figure4 are approximately symmetric which means that large positive and negative values in equation (10)are equally likely. As an additional indicator for the convergence speed, we illustrate the expectationof the distributions with a dotted line. These are always non-negative due to propriety of the scoringrules, but an expectation far right from the shaded area corresponds to a faster convergence towardslower mean relative scores for the DGP. Again, the values are generally close to zero, which suggestsslow convergence towards positive sample mean scores.The average error rate over all DGPs for the evaluation period of the multivariate scoring rules iscompared in Figure 5. Similar to Figure 4, we examine the number of times the score of a misspeciﬁedmodel is lower than that of the DGP but we now include the error rate across multiple choices ofthe DGP. The conclusions from Figure 5 are similar to Figure 4, although the magnitude of the errorrates is lower than for the case of only DCC-GARCH as DGP. The variogram score with p = 2 hasa signiﬁcantly higher error rate that is more than 47% higher than that of the variogram score with25igure 4: Density of diﬀerences between scores with DCC-GARCH as DGP V S . C C C C C C V S .

42% 37% 36% 32% 34% 46% E x c h a n g e r a t e r e t u r n s V S .

45% 49% 44% 52% 53% 56% 52% E S

44% 29% 27% 41% 42% 48% 47% V S .

39% 32% 32% 17% 20% 42% 40% V S .

42% 39% 37% 36% 38% 49% I n t e r e s t r a t e c h a n g e s V S .

47% 50% 45% 59% 60% 59% 55% E S

41% 31% 26% 42% 44% 48% 45% V S .

45% 37% 35% 7% 6% 48% 47% V S .

46% 43% 41% 26% 26% 52% C o mm o d i t y i n d e x r e t u r n s V S .

46% 55% 53% 51% 51% 60% 57% E S

47% 41% 38% 35% 35% 49% 49%

This ﬁgure displays the density of the diﬀerence between the scores of the DGP and the misspeciﬁed models describedin equation (10) . A Gaussian kernel is used to smooth the densities. The shaded areas correspond to negative values,where a lower score is assigned to the misspeciﬁed models. In the upper right corner of each sub-ﬁgures, the probabilityof the shaded area is displayed. The dotted vertical line shows the expectation of the density. For clarity, we limit thesub-ﬁgure to values between the 0.001- and 0.999-quantiles. Figures for alternative DGPs can be found in thesupplementary materials. '06 '08 '10 '12 '14 '16 '180.00.10.20.3 Exchange rate returns '02 '04 '06 '08 '10 '12 '14 '16 '18Interest rate changesVariogram Score ( p = 0.5) Variogram Score ( p = 1.0) Variogram Score ( p = 2.0) Energy Score'98 '00 '02 '04 '06 '08 '10 '12 '14 '16 '18Commodity index returns The error rates show how often a misspeciﬁed model is assigned a lower score than the DGP. Higher values areassociated with inferior scoring rules and more frequently wrong inferences. p = 0 . . This is also consistent with Figure 3 where misspeciﬁed models were sometimes preferredover the DGP. Again, there is typically a clear ranking of the scoring rules that persists with all threedata sets and the entire evaluation period. For the variogram scores, the error rate increases with theparameter p and the error rate of the energy score typically falls between the variogram score with p = 1 and p = 2 , but interestingly it often outperforms variogram with p = 1 for the interest rate data. Through the consideration of multiple models, we go beyond ceteris paribus sensitivities to obtain moregeneral results. Our misspeciﬁed models combine various misspeciﬁcations at once and are thereforemore similar to the settings under which the proper scoring rules are applied in practice. Our gener-alized discrimination heuristic deﬁned in (11) allows us to test discrimination ability against multiplecompeting models at once. Note that we do not subtract the scores of the DGP from those of the mis-speciﬁed models in the numerator, but this does not aﬀect the rankings of the scoring rules. Figure 6depicts our discrimination heuristic for each scoring rule over time with a logarithmic scale. UnlikeFigure 3, results are illustrated for all DGPs (eight rows of plots) and all data sets (three columns ofplots).Figure 6 shows a clear distinction between scoring rules in most cases, but the preferences of thediscrimination heuristic for the four rules can vary slightly depending on the data set, the time periodand the DGP chosen. Overall, there are several key features which emerge:1. The energy score is always the scoring rule with the lowest discrimination heuristic. This, again,27igure 6: Discrimination heuristic of scoring rules

12 Exchange rate returns Interest rate changes

CCC - G A R C H Commodity index returns12 D CC - G A R C H E D F C E D F C F Q - A L C F Q - A L C F Q - A B C '08 '10 '12 '14 '16 '1812 '04 '06 '08 '10 '12 '14 '16 '18Variogram Score ( p = 0.5) Variogram Score ( p = 1.0) Variogram Score ( p = 2.0) Energy Score'00 '02 '04 '06 '08 '10 '12 '14 '16 '18 F Q - A B C We display the discrimination heuristic of equation (11) for all three data sets and eight DGPs with a logarithmic scale.Scoring rules which separate the scores of misspeciﬁed models and the DGP by a larger relative distance are assigned highervalues for the discrimination heuristic. We smooth the discrimination heuristic with a moving average of 8 observationsto improve the interpretability of the ﬁgure, but the same patterns are present in case no smoothing is applied.

28s in accord with prior simulation studies by Pinson and Tastu (2013) and Scheuerer and Hamill(2015). Across all data sets and DGPs, the energy score only receives an average discriminationheuristic of . , compared to . , . and . for the variogram score with p = 0 . , p = 1 and p = 2 .2. In all cases, the variogram score with p = 1 is the scoring rule with the second highest discrimi-nation heuristic.3. The variogram score with p = 2 achieves in some settings extremely high values for the discrim-ination heuristic, but is also the only scoring rule which receives values below 1. This occursin commodity index returns with DCC-GARCH as DGP. For those t , the model ranking of thevariogram score with p = 2 is erroneous and multiple misspeciﬁed models receive lower scoresthan the DGP;4. The scoring rule with the highest discrimination heuristic varies depending on the choice of dataand DGP but exhibits a pattern. In most cases, the variogram score with p = 0 . has the highestdiscrimination heuristic, but it is surpassed by the variogram score with p = 2 during some periodsand always when FQ-AL models are used as DGP.The high discrimination heuristic of some variogram scores with p = 2 , despite the poor performancein Figure 3 can be explained by Figure 2 and our discussion on the eﬀect of diﬀerent choices of p forthe variogram score. Generally, the variogram score with p = 2 outputs a large range of scores, some ofwhich may be vastly larger in magnitude than others. These outliers shift the sample mean in Figure 3to a larger value than the sample 0.75-quantile and also aﬀect the discrimination heuristic to a similarextent. For instance, in exchange rate returns with DCC-GARCH as DGP, the largest summand ofequation (11) takes a value around 4,700. In comparison, the largest summand of the energy score,variogram score with p = 0 . and p = 1 are , and respectively.The quadratic nature of the variogram score with p = 2 further ampliﬁes large distances betweenmodels. Therefore, the variogram score with p = 2 achieves a particularly high discrimination heuristicwhen the models are easily distinguishable. The cases where the variogram score with p = 2 have thehighest discrimination heuristic mostly correspond to two scenarios:1. Around the ﬁnancial crisis in 2008, the diﬀerences of the distribution forecasts become easier to29istinguish. This is because models with a calibration window of 2,000 observations are much lessaﬀected by the abnormal values during the crisis in contrast to models with a calibration windowof only 250 observations. Hence, the distribution forecasts may deviate more strongly betweenthe competing models and scoring rules may assign larger relative distances between the scoresof misspeciﬁed models and those of the DGP.2. Similarly, the use of FQ-AL as DGP also increases the relative distances between the scores ofthe models. The Factor Quantile model typically produces a much sharper forecast than that ofalternative models and is therefore more easily identiﬁable as DGP.In those two cases, all scoring rules manage to quite clearly identify the DGP from misspeciﬁed models,so the even larger relative distance between the scores of the variogram score with p = 2 arguably hasno additional beneﬁt. Simultaneously, the scoring rule suﬀers from erroneous rankings, despite havinghigh discrimination heuristics in some settings. These issues reveal clearly that the discriminationheuristic should be considered as one possible indicator for the goodness of scoring rules, but by itselfis inadequate to quantify their discrimination ability. A large heuristic of a scoring rule may not implymore robust or less erroneous rankings. Therefore, a high discrimination heuristic between the modelsis not useful unless it is accompanied by a low error rate, i.e. the percentage of times choosing amisspeciﬁed model over the DGP. In this paper we have conducted an extensive investigation of the discrimination ability of multivariateproper scoring rules, comparing the performance of four popular choices across a variety of time periods,ﬁnancial data sets and competing stochastic models. By rotating our DGP across a selection of eightrealistic models with diﬀerent static or dynamic joint distributions and diﬀerent calibration windows,our simulation study provides a much less restrictive and more comprehensive set of results than theexisting literature this area. While results of course vary by scenario, overall trends and insights doemerge. We observe that the energy score generally has poorer discrimination ability than the otherscores, while the variogram score with p = 2 can appear to provide strong discrimination in certain casesbut on the other hand is prone to high error rates (failure to identify the correct DGP). The variogramscore with p = 0 . consistently maintains the lowest error rate, but is sometimes outperformed by the30ariogram score with p = 1 in terms of discrimination ability since diﬀerences between models aremagniﬁed less. On the whole, variogram score with p = 0 . seems to perform best in most settings,but another important conclusion is that one should typically consider multiple performance metricsand be aware of how diﬀerent scoring rules tend to behave in diﬀerent model settings. Such insightscan undoubtedly be of great value when deciding which scoring rule to use in diﬀerent multivariateforecasting applications. 31 eferences Alexander, C. and Y. Han (2020). Static and dynamic models for multivariate distribution forecasts: Properscoring rule tests of factor-quantile vs. multivariate garch models.

ArXiV 2004.14108 .Amisano, G. and R. Giacomini (2007). Comparing density forecasts via weighted likelihood ratio tests.

Journalof Business & Economic Statistics 25 (2), 177–190.Bank of International Settlements (2016, April). Triennial Central Bank survey: Foreign exchange turnover inApril 2016. Technical report.Bao, Y., T.-H. Lee, and B. Saltoğlu (2007). Comparing density forecast models.

Journal of Forecasting 26 (3),203–225.Bauwens, L. and S. Laurent (2005). A new class of multivariate skew densities, with application to generalizedautoregressive conditional heteroscedasticity models.

Journal of Business & Economic Statistics 23 (3), 346–354.Bickel, J. E. (2007). Some comparisons among quadratic, spherical, and logarithmic scoring rules.

DecisionAnalysis 4 (2), 49–65.Bloomberg (2017, May). The Bloomberg commodity index family: Index methodology. Technical report.Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity.

Journal of Econometrics 31 (3),307–327.Bollerslev, T. (1990). Modelling the coherence in short-run nominal exchange rates: A multivariate generalizedARCH model.

Review of Economics and Statistics 72 (3), 498–505.Breiman, L. (1996). Bagging predictors.

Machine Learning 24 (2), 123–140.Buja, A., W. Stuetzle, and Y. Shen (2005). Loss functions for binary class probability estimation and classiﬁca-tion: Structure and applications.Cajigas, J.-P. and G. Urga (2006). Dynamic conditional correlation models with asymmetric multivariate Laplaceinnovations.Danielsson, J., K. James, M. Valenzuela, and I. Zer (2016). Model risk of risk models.

Journal of FinancialStability 23 , 79–91.Dawid, P. A. and P. Sebastiani (1999). Coherent dispersion criteria for optimal experimental design.

Annals ofStatistics , 65–81. iebold, F. X., T. A. Gunther, and T. A. S (1998). Evaluating density forecasts, with applications to ﬁnancialrisk management. International Economic Review 39 , 863–883.Diebold, F. X. and R. S. Mariano (1995). Comparing predictive accuracy.

Journal of Business & EconomicStatistics 13 (3), 253–263.Diks, C. and H. Fang (2020). Comparing density forecasts in a risk management context.

International Journalof Forecasting 36 (2), 531–551.Diks, C., V. Panchenko, O. Sokolinskiy, and D. van Dijk (2014). Comparing the accuracy of multivariate densityforecasts in selected regions of the copula support.

Journal of Economic Dynamics and Control 48 , 79–94.Diks, C., V. Panchenko, and D. Van Dijk (2011). Likelihood-based scoring rules for comparing density forecastsin tails.

Journal of Econometrics 163 (2), 215–230.Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of UnitedKingdom inﬂation.

Econometrica: Journal of the Econometric Society , 987–1007.Engle, R. F. (2001). GARCH 101: The use of ARCH/GARCH models in applied econometrics.

Journal ofEconomic Perspectives 15 (4), 157–168.Engle, R. F. (2002). Dynamic conditional correlation: A simple class of multivariate generalized autoregressiveconditional heteroskedasticity models.

Journal of Business & Economic Statistics 20 (3), 339–350.Feldmann, K., M. Scheuerer, and T. L. Thorarinsdottir (2015). Spatial postprocessing of ensemble forecasts fortemperature using nonhomogeneous Gaussian regression.

Monthly Weather Review 143 (3), 955–971.Gneiting, T., F. Balabdaoui, and A. E. Raftery (2007). Probabilistic forecasts, calibration and sharpness.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) 69 (2), 243–268.Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation.

Journal of theAmerican Statistical Association 102 (477), 359–378.Gneiting, T. and R. Ranjan (2011). Comparing density forecasts using threshold-and quantile-weighted scoringrules.

Journal of Business & Economic Statistics 29 (3), 411–422.Granger, C. W. J. and H. M. Pesaran (2000). A decision theoretic approach to forecast evaluation. In

Statisticsand Finance: An interface , pp. 261–278. World Scientiﬁc.Hamill, T. M. (2001). Interpretation of rank histograms for verifying ensemble forecasts.

Monthly WeatherReview 129 (3), 550–560. yvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of MachineLearning Research 6 (Apr), 695–709.Jensen, M. C. (1968). The performance of mutual funds in the period 1945–1964.

The Journal of Finance 23 (2),389–416.Johnstone, D. J., V. R. R. Jose, and R. L. Winkler (2011). Tailored scoring rules for probabilities.

DecisionAnalysis 8 (4), 256–268.Jordan, A., F. Krüger, and S. Lerch (2017). Evaluating probabilistic forecasts with the R package scoringRules.Laio, F. and S. Tamea (2007). Veriﬁcation tools for probabilistic forecasts of continuous hydrological variables.

Hydrology and Earth System Sciences Discussions 11 (4), 1267–1277.Machete, R. L. (2013). Contrasting probabilistic scoring rules.

Journal of Statistical Planning and Infer-ence 143 (10), 1781–1790.Mandelbrot, B. B. (1963). The variation of certain speculative prices.

Journal of Business 36 , 394–419.Matheson, J. E. and R. L. Winkler (1976). Scoring rules for continuous probability distributions.

ManagementScience 22 (10), 1087–1096.Merkle, E. C. and M. Steyvers (2013). Choosing a strictly proper scoring rule.

Decision Analysis 10 (4), 292–304.Parry, M., A. P. Dawid, S. Lauritzen, et al. (2012). Proper local scoring rules.

The Annals of Statistics 40 (1),561–592.Pelagatti, M. M. (2004). Dynamic conditional correlation with elliptical distributions.Pérignon, C. and D. Smith (2010). The level and quality of value-at-risk disclosure by commercial banks.

Journalof Banking and Finance 34 (2), 362–377.Pinson, P. and R. Girard (2012). Evaluating the quality of scenarios of short-term wind power generation.

Applied Energy 96 , 12–20.Pinson, P. and J. Tastu (2013). Discrimination ability of the energy score. Technical report.Scheuerer, M. and T. M. Hamill (2015). Variogram-based proper scoring rules for probabilistic forecasts ofmultivariate quantities.

Monthly Weather Review 143 (4), 1321–1334.Staël von Holstein, C.-A. S. (1970). Measurement of subjective probability.

Acta Psychologica 34 , 146–159. zékely, G. J. (2003). E-statistics: The energy of statistical samples. Bowling Green State University, Departmentof Mathematics and Statistics Technical Report 3 (05), 1–18.Tsui, A. K. and Q. Yu (1999). Constant conditional correlation in a bivariate GARCH model: Evidence fromthe stock markets of China.

Mathematics and Computers in Simulation 48 (4–6), 503–509.Winkler, R. L. (1971). Probabilistic prediction: Some experimental results.

Journal of the American StatisticalAssociation 66 (336), 675–685.Winkler, R. L. (1977). Rewarding expertise in probability assessment. In

Decision Making and Change in HumanAﬀairs , pp. 127–140. Springer.Winkler, R. L. (1996). Scoring rules and the evaluation of probabilities.

Test 5 (1), 1–60.Ziel, F. and K. Berk (2019). Multivariate forecasting evaluation: On sensitive and strictly proper scoring rules. arXiv preprint arXiv:1910.07325 ..