On the Aggregation of Probability Assessments: Regularized Mixtures of Predictive Densities for Eurozone Inflation and Real Interest Rates
OOn the Aggregation of Probability Assessments:Regularized Mixtures of Predictive Densities forEurozone Inflation and Real Interest Rates
Francis X. DieboldUniversity of Pennsylvania Minchul ShinFederal Reserve Bank of PhiladelphiaBoyuan ZhangUniversity of PennsylvaniaJanuary 6, 2021
Abstract : We propose methods for constructing regularized mixtures of density forecasts.We explore a variety of objectives and regularization penalties, and we use them in a substan-tive exploration of Eurozone inflation and real interest rate density forecasts. All individualinflation forecasters (even the ex post best forecaster) are outperformed by our regularizedmixtures. From the Great Recession onward, the optimal regularization tends to move den-sity forecasts’ probability mass from the centers to the tails, correcting for overconfidence.
Acknowledgments : For helpful comments and/or assistance we are grateful to Umut Ako-vali, Brendan Beare, Graham Elliott, Rob Engle, Domenico Giannone, Christian Hansen,Nour Meddahi, Mike McCracken, Marcelo Medeiros, James Mitchell, Joon Park, HashemPesaran, Youngki Shin, Mike West, and Ken Wolpin. We are also grateful to conference par-ticipants at EC , and seminar participants at KAEA and AMLEDS. The views expressedin this paper are solely those of the authors and do not necessarily reflect the views of theFederal Reserve Bank of Philadelphia or the Federal Reserve System. Key words : Density forecasts, forecast combination, survey forecasts, shrinkage, modelselection, regularization, partially egalitarian LASSO, model averaging, subset averaging
JEL codes : C2, C5, C8
Contact : [email protected], [email protected] a r X i v : . [ ec on . E M ] J a n Introduction
Forecast combination for a series y involves transforming a set of forecasts of y , f =( f , ..., f K ) (cid:48) , into a “combined”, and hopefully superior, forecast c ( f ). Most of the hugeliterature focuses on linear combinations of univariate point forecasts, in which case we canwrite the combined forecast as c ( f ; ω ) = ω (cid:48) f , for combining weight vector ω = ( ω , ..., ω K ) (cid:48) . We typically proceed under quadratic loss, choosing the weights to minimize the sum ofsquared combined forecast errors (
SSE ), SSE ( c ( f ; ω ) , y ) = T (cid:88) t =1 ( y t − ω (cid:48) f t ) , where the sample of forecasts and realizations covers t = 1 , ..., T . That is, we simply run theleast-squares regression y → f , ..., f K , so that ˆ ω = arg min ω (cid:16) SSE ( c ( f ; ω ) , y ) (cid:17) . This is the classic Bates and Granger (1969) and Granger and Ramanathan (1984) solution.Recent point forecast combination literature such as Diebold and Shin (2019), however,focuses instead on weights that solve a penalized estimation problem,ˆ ω = arg min ω (cid:16) Objective ( c ( f ; ω ) , y ) + λ · P enalty ( ω ) (cid:17) , (1)where the Lagrange multiplier λ governs the strength of the penalty. Maintaining quadraticloss we have ˆ ω = arg min ω (cid:16) SSE ( c ( f ; ω ) , y ) + λ · P enalty ( ω ) (cid:17) . If λ =0 we obviously obtain the Bates-Granger-Ramanathan solution, but the recent liter-ature focuses on λ>
0. This produces regularization, which can be highly valuable in thefinite samples often of practical relevance, particularly for economic survey forecasts wherethe sample size T is often very small relative to the number of forecasters K . The preciseform of the penalty determines the precise form of regularization, but in general it involvesselection and/or shrinkage in directions guided by the penalty. For example, the famousLASSO penalty of Tibshirani (1996), P enalty ( ω ) = (cid:80) Kk =1 | ω k | , induces both selection to 0 Broad and insightful surveys include Timmermann (2006), Elliott and Timmermann (2016), and Aastveitet al. (2020). We assume unbiased forecasts, so there is no need for an intercept. nd shrinkage toward 0.In this paper we extend the idea of regularized forecast combination to the density forecastcase. Density forecasting is important because predictive densities are complete probabilisticstatements, which are always desirable, sometimes invaluable, and increasingly available.Density forecasts provide much more information, for example, than interval forecasts, whichin turn provide more information than point forecasts. We work with “linear opinion pools” (mixtures), as in the key contributions of Halland Mitchel (2007), Geweke and Amisano (2011) and Amisano and Geweke (2017), but weconsider a variety of estimation objectives, and most importantly, we introduce regularizationconstraints. Our regularized density forecast combinations are regularized mixtures, andimportant subtleties arise in constructing appropriate penalties for mixture regularization.In this paper we confront this situation and propose several solutions.Our methods are related to earlier and current work in both the econometrics and statis-tics literatures. A basic insight underlying our work and much of the recent literature is thatBayesian model averaging (BMA) as traditionally implemented is unattractive for combiningdensity forecasts from misspecified models, because it fails to acknowledge misspecification(Diebold, 1991). That is, it assumes implicitly or explicitly that one of the models is “true”,in which case the posterior predictive density asymptotically puts all probability on thatmodel, so that BMA actually fails to average. Instead, once we acknowledge that all modelsare misspecified, we want a method capable of delivering a defensible and diversified portfolio(weighted average) of models, even asymptotically.In one strand of econometrics literature this led Hall and Mitchel (2007), Brodie et al.(2009), Geweke and Amisano (2011), and Amisano and Geweke (2017) inter alia to moveaway from BMA, working instead with linear opinion pools that optimize the log score. Ina different strand of econometrics literature that also moved away from BMA, it led Billioet al. (2013) to treat density forecast combination as a nonlinear filtering problem, poten-tially with time-varying mixture weights. Parallel developments in the statistics literaturenow acknowledge misspecification, distinguishing between “M-open” vs. “M-complete” sit-uations, and achieve diversified density forecast mixtures by “stacking” predictive densities(Yao et al., 2018), or via “dynamic Bayesian predictive synthesis” (McAlinn and West, 2019).We pick up from there and proceed as follows. In section 2 we discuss objectives for mix-ture regularization, that is, various choices and issues associated with
Objective ( c ( f ; ω ) , y ). The evaluation of interval forecasts, moreover, is fundamentally problematic, as detailed in recent workby Askanazi et al. (2018) and Brehmer and Gneiting (2020).
P enalty ( ω ), starting withthe key unit simplex penalty, which we maintain throughout, and then introducing hybridpenalties that blend the simplex penalty with others. In section 4 we present Monte Carloevidence on the efficacy of our procedures. In section 5 we present empirical results for Eu-ropean Central Bank (ECB) survey density forecasts of Eurozone inflation and real interestrates. We conclude in section 6. Consider a discrete density (histogram) forecast for a scalar variable y , which takes valuesin m = 1 , ..., M bins, or categories. Denote the forecast by p = ( p , ..., p M ) (cid:48) . We startwith density forecast “scores” for a single forecaster in a single period in sections 2.1-2.3,we extend the discussion to multiple forecasters and periods in section 2.4, and we provideadditional discussion in section 2.5. The log score (Good, 1952; Winkler and Murphy, 1968) is L ( p, y ) = − log (cid:32) M (cid:88) m =1 p m y ∈ b m ) (cid:33) , (2)where p m is the probability assigned to bin b m , and 1( y ∈ b m ) = 1 if y ∈ b m and 0 otherwise.Ranking density forecasts by L , where smaller is better, reflects a preference for “smallsurprises”. In a frequentist interpretation, L is just the (negative of the) log predictive densityevaluated at the realization; that is, it is the (negative of the) predictive log likelihood. In aBayesian interpretation, L is, desirably, a strictly proper scoring rule. The Brier score (Brier, 1950) is: We focus largely on the discrete case, because it is the one of practical relevance for survey forecaststhat we eventually analyze. Parallel developments of course exist for the continuous case. On scoring rules see Gneiting and Raftery (2007) and the references therein. ( p, y ) = 1 M M (cid:88) m =1 ( p m − y ∈ b m )) . The Brier score generalizes the idea of quadratic loss to density forecasts. Indeed B iseffectively the same as the so-called “quadratic score”, Q ( p, y ) = − (cid:32) M (cid:88) m =1 p m y ∈ b m ) (cid:33) + (cid:32) M (cid:88) m =1 p m (cid:33) , (3)as noted by Czado et al. (2009). Rankings by Q must match rankings by B , because one isa positive monotonic transformation of the other. Both B and Q are strictly proper scoringrules under weak conditions. The ranked score (Epstein, 1969) is, R ( p, y ) = M (cid:88) m =1 ( P m − y ≤ b m + )) , where P m = (cid:80) mh =1 p ( b h ) is the cdf of the density forecast p , defined on bins b m = [ b m − , b m + ], m = 1 , ..., M . R effectively proceeds by comparing realizations to the cdf forecast ratherthan the density forecast. R is strictly proper under weak conditions. Let us now modify the notation to identify the specific forecaster, k . Thus far there has beenno need, as we have considered just one forecaster, but shortly we will want to consider a setof forecasters, k = 1 , ..., K . This is just a notational change, inserting “ k ” subscripts in therelevant places. In addition let us write the scores for a set of periods, t = 1 , ..., T , ratherthan for just one period. This just involves summing over time.We have: L k ( p k , y ) = T (cid:88) t =1 (cid:32) − log (cid:32) M (cid:88) m =1 p mkt y t ∈ b m ) (cid:33)(cid:33) , k = 1 , ..., K k ( p k , y ) = T (cid:88) t =1 (cid:32) M M (cid:88) m =1 (cid:0) p mkt − y t ∈ b m ) (cid:1) (cid:33) , k = 1 , ..., KR k ( p k , y ) = T (cid:88) t =1 (cid:32) M (cid:88) m =1 (cid:16) P mkt − y t ≤ b m + ) (cid:17) (cid:33) , k = 1 , ..., K, where p k = ( p k , ..., p kT ) is the sequence of density forecasts over time for forecaster k , and y = ( y , ..., y T ) is the sequence of realizations over time. Thus far we have implicitly emphasized the differences among the L , B , and R scores, butthere are also many similarities. B , for example, might appear linked to Gaussian environments, because it is a mean-squared error analog, unlike L which is based directly on the likelihood and therefore validunder great generality. But it is not; indeed its “ Q version” (3), Q = − L + (cid:32) M (cid:88) m =1 p m (cid:33) , reveals its close link to L . Moreover, B remains a strictly proper scoring rule regardless ofdistributional environment.Now consider R . First, it is interesting to note that R is a generalization of absolute-error loss to density forecasts, just as B is a generalization of squared-error loss to densityforecasts. In particular, Gneiting and Raftery (2007) show that R is driven by E p | Y − y | : R ( p, y ) = E p | Y − y | − E p | Y − Y (cid:48) | , where Y and Y (cid:48) are independent copies of a random variable with distribution p .Second, R ’s generalization of absolute-error loss ( M AE ) to density forecasts also makesit a generalization of the Diebold and Shin (2017) stochastic error distance (
SED ), because
M AE and
SED rankings must agree, and interestingly,
SED is based on cdf divergences,just as is R .Finally, although R might appear linked to a particular (Laplace) distributional environ-ment, because it is an absolute-error analog, it is not. R is a strictly proper scoring ruleregardless of distributional environment. 5 Penalties
Our goal is to produce mixtures of density forecasts, c ( ω ) = K (cid:88) k =1 ω k p k , with regularized mixture weights ω = ( ω , ..., ω K ) (cid:48) . We score mixtures in the same way aswe scored individual density forecasts. The only difference is that we now score the mixture, c ( ω ), rather than an individual forecast, p k .Thus far we have focused on appropriate objectives for regularized mixture weight esti-mation, objective ( c ( ω ) , y ), and we emphasized use of strictly proper density forecast scoringrules. Now we consider appropriate constraints for regularized mixture weight estimation, penalty ( ω ). As we shall see, imposition of the unit simplex constraint (i.e., imposing thatmixture weights be non-negative and sum to one: ω i ≥ ∀ i and (cid:80) Ki =1 ω i = 1) provides es-sential regularization. In addition, however, simultaneous imposition of other regularizationconstraints may also be helpful. The unit simplex constraint has two parts: non-negativity and sum-to-one. For point fore-casts we can relax both parts and potentially achieve better combined point-forecastingperformance, as recognized by Granger and Ramanathan (1984) and done routinely eversince. As first recognized in the pioneering work of Brodie et al. (2009), it turns out thatdensity forecasts are different:
When combining density forecasts it is crucial to impose (bothparts of ) the simplex constraint .First consider non-negativity. For point forecasts, allowing negative combining weightscan improve performance, in a fashion analogous to allowing short positions in a financialasset portfolio. For density forecasts, in contrast, negative weights are unambiguously prob-lematic, producing pathologies even if sum-to-one holds, because negative mixture weightscan drive parts of the mixture density negative.Now consider sum-to-one. Immediately, sum-to-one is required for the mixture com-bination to be a valid probability density. Moreover, and separately, the solution to themixture weight estimation problem can be pathological without imposition of sum-to-one. See also Yao et al. (2018), who briefly discuss issues related to the imposition of convex mixture weights.
6o see this, consider a simple example with two continuous density forecasts and a log scoreobjective. We have ˆ ω = arg min ω ,ω (cid:32) − T (cid:88) t =1 log( ω f ,t ( y t ) + ω f ,t ( y t )) (cid:33) , where f k,t ( y t ) is forecaster i ’s density forecast evaluated at the realization, y t . Without thesum-to-one constraint, the optimal solution is not well defined: either ω →∞ or ω →∞ leads to the smallest possible objective function value, because f ,t and f ,t are non-negativefor any y t .For all of the above reasons, we henceforth impose both the non-negativity and sum-to-one parts of the simplex constraint. Interestingly, moreover, their imposition is not onlynecessary to eliminate pathologies, but also desirable to provide regularization. In particular,the simplex constraint clearly imposes a particular L “parameter budget”; it is effectivelya special case of LASSO.Assembling everything, the basic regularized estimator with log score objective (Gewekeand Amisano, 2011; Amisano and Geweke, 2017) is arg min ω (cid:32) − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33) (cid:33) (4)s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 . The methodological question remains, however, of how to provide additional, and more flex-ible, regularization, as does the substantive situation-specific empirical question of whetherand where additional regularization is helpful. In the remainder of this paper we work towardanswering both questions. L simplex regularization is a special case of L LASSO regularization, corresponding to aspecific choice of LASSO regularization parameter. Hence we cannot introduce additional L regularization. Other objectives may of course be used, as discussed earlier in section 2. Note that for a histogramforecast we have f k,t ( y t ) = (cid:80) Mm =1 p mkt y t ∈ b m ). K mixture weights awayfrom 0, thereby “undoing” the selection implicit in the LASSO-style L penalty, allowing fornon-zero mixture weights on all forecasts. We focus in particular on introducing shrinkagetoward an equally-weighted mixture (i.e., shrinkage of all K weights toward 1 /K ).Consider, for example, introducing L regularization. Immediately, incorporating an L penalty in addition to the simplex constraint, we have: ˆ ω = arg min ω − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + λ (cid:32) K (cid:88) k =1 (cid:18) ω k − K (cid:19) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) L penalty (5)s.t. ω k ∈ [0 , , K (cid:88) k =1 ω k = 1 . This parallels the egalitarian ridge estimator of Diebold and Shin (2019), with an additionalsimplex constraint imposed. Note that, due to the simplex constraint, the solution maydiscard some forecasters (setting some weights approximately if not exactly to zero), but thatsituation becomes progressively less likely as λ grows, pulling the weights toward equality.We can re-write (5) as ˆ ω = arg min ω − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + λ (cid:32) K (cid:88) k =1 | ω k | − (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) L simplex/LASSO penalty + λ (cid:32) K (cid:88) k =1 (cid:18) ω k − K (cid:19) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) L ridge penalty , (6) s.t. ω k ∈ [0 , , which emphasizes that simplex+ridge regularization involves a combination of L and L penalties. Note, however, that we are not free to choose λ , because the sum-to-one con-straint must bind; equations (5) and (6) instead coincide for “large enough” λ .Equation (6) in turn reveals that simplex+ridge regularization is closely related to the For transparency we make most of our arguments using a log score objective. Equation (6) also reveals that simplex+ridge is closely related to an additive-penalty version of partialegalitarian LASSO (Diebold and Shin, 2019), but with the egalitarian penalty done in L (ridge) form ratherthan L (LASSO) form. P enalty ( ω ) = α K (cid:88) k =1 | ω k | (cid:124) (cid:123)(cid:122) (cid:125) L LASSO penalty + (1 − α ) K (cid:88) k =1 ω k (cid:124) (cid:123)(cid:122) (cid:125) L ridge penalty , where α ∈ [0 ,
1] is a parameter, so that elastic net also involves combinations of L and L (that is, LASSO/simplex and ridge) penalties. Elastic net is well known to work well forregularization problems with many correlated predictors, exactly the situation of relevancefor the large sets of economic forecasts on which we focus. Here we move from simplex+ridge to simplex plus a general penalty based on the divergencebetween two discrete probability measures. As we will see, the divergence penalty includessimplex+ridge as a special case, but it also introduces a rich variety of new possibilities.Write the estimator asˆ ω = arg min ω − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + λ D ( ω, ω ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) penalty (7)s.t. ω k ∈ [0 , , K (cid:88) k =1 ω k = 1 , where D ( ω, ω ∗ ) is a measure of divergence between w and w ∗ . The key insight is that oncethe simplex restriction is imposed, ω can be interpreted as a discrete probability measure on { , , ..., K } . If we let ω ∗ be the uniform probability mass function with weight 1 /K on eachoutcome, then the penalized optimization (7) shrinks the solution toward equal weights.Maintaining uniform ω ∗ throughout, but using different divergence measures D ( ω, ω ∗ ),we obtain new regularized estimators. For example:1. The L norm, D ( ω, ω ∗ ) = K (cid:88) k =1 (cid:18) ω k − K (cid:19) , produces the simplex plus egalitarian ridge penalty given in (5) and (6).9. The L norm (total variation), D ( ω, ω ∗ ) = K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) ω k − K (cid:12)(cid:12)(cid:12)(cid:12) , produces a simplex plus egalitarian LASSO penalty (Diebold and Shin, 2019).3. Kullback-Leibler divergence (entropy) from ω to ω ∗ , D ( ω, ω ∗ ) = − log K − K (cid:88) k =1 log ω k , produces a “simplex+entropy” penalty, − (cid:80) Kk =1 log ω k . In Appendix A we formallyshow that the simplex+entropy regularized estimator,ˆ ω = arg min ω − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + λ (cid:32) − K (cid:88) k =1 log( ω k ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) entropy penalty (8)s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 , arises as the posterior mode in a Bayesian analysis with a log score (pseudo-)likelihoodand a Dirichlet prior, which puts positive probability only on the unit simplex and alsoshrinks weights toward equality for a certain hyperparameter configuration.4. R´enyi divergence of order α from ω to ω ∗ , D α ( ω ∗ || ω ) = 1 α − (cid:32) K (cid:88) k =1 /K α ω α − k (cid:33) , encompasses various statistical divergences including Kullback-Leibler divergence ( α =1) and Hellinger distance ( α = 2), and can be used to produce still more interestingregularized estimators. All of the above divergence functions shrink the density mixture weights toward equality, R´enyi divergence, moreover, is equivalent to Cressie-Read discrepancy up to an affine transformation. D ( ω, ω ∗ ) isa convex function of ω , because the log score and simplex constraints are convex functionsof ω . This makes numerical computation of the estimator straightforward. One might want a density forecast version of partially egalitarian penalization, as developedfor the point forecast case by Diebold and Shin (2019). The additive version of partiallyegalitarian ridge or LASSO is possible, in the sense that the solution is computable inprinciple. To see this, consider the simplex-constrained partially egalitarian ridge problem:ˆ ω = min w (cid:32) − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 w k f k,t ( y t ) (cid:33) + λ K (cid:88) k =1 (cid:18) w k − δ ( w ) (cid:19) (cid:33) (9)s.t. w k ∈ [0 , , K (cid:88) k =1 w k = 1 , where δ ( ω ) is the number of non-zero elements in ω . Computation of the solution proceedsas follows:1. We define κ as the number of forecasters to be included.2. For a particular value of κ (among κ = 1 , , , ..., K ), there are C Kκ possible combina-tions of forecasters.3. For the j th such combination ( j = 1 , , ..., C Kκ ), we solve L ∗ ( κ, j ) = min w j (cid:32) − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 w jk f k,t ( y t ) (cid:33) + λ K (cid:88) k =1 (cid:18) w jk − δ ( w ) (cid:19) (cid:33) s.t. w jk ∈ [0 , , K (cid:88) k =1 w jk = 1 , where w jk is zero if the k th forecaster is not selected in j th combination. In this case,some of weights are forced to zero, so the penalty term is reduced to λ K (cid:88) k =1 (cid:18) w jk − δ ( w ) (cid:19) = λ (cid:88) k ∈N (cid:18) w jk − κ (cid:19) , N = { k : w jk (cid:54) = 0 } . This is just partial egalitarian ridge for a particular set offorecasters.4. The solution to the original partial egalitarian ridge problem is then arg min κ,j L ∗ ( κ, j ).Unfortunately, however, the computational cost is huge, because we need to solve the penal-ized optimization n K = (cid:80) Kκ =1 C Kκ times. For example, when K =20, n K =1 , , λ →∞ in equation (9), the partiallyegalitarian estimator converges to a direct subset averaging procedure in the spirit of Elliott(2011), which is simple to compute and automatically imposes the simplex constraint. Thesubset averaging idea is trivial: At each time, rolling forward, we simply find the historicallybest-performing average, and use it. A first variation is “best N -Average”. At each time wedetermine the historically best-performing N -forecast average and use it. A second varia-tion is “best ≤ N max -Average”. At each time we determine the historically best-performing ≤ N max -forecast average and use it.Subset averaging computation time can be substantial in principle, depending on K and N (or N max ). With K forecasters, finding the best N -average requires computing K C N simple averages and then sorting them to determine the minimum, each period. The per-period computational burden of best ≤ N max -forecast averaging is still larger, because wenow consider all subsets rather than only subsets of size N . Fortunately, the relevant K and N max are quite small in typical economic forecast combinations. In our subsequent empiricalwork, for example, N max ≤ K =19. Best ≤ max -Averagecombination therefore requires evaluating and sorting just C + C + C + C = 5035averages per period. It bears emphasizing that our regularized mixtures of density forecasts are not just straight-forward adaptations of existing methods of combining point forecasts. They differ in impor-tant and interesting ways.1. The objective function changes. Things like “forecast errors” and the “sum of squarederrors” are ill-defined in the density case. Appropriate density forecast scoring rulesmust be used. We have emphasized several, including the log score, the Brier score,and the ranked score. 12. The penalty function changes.(a) When forming mixtures of density forecasts, the unit simplex constraint must beimposed, and it has the side benefit of proving some regularization.(b) Mixtures of density forecasts admit new regularization penalties that are inti-mately connected to the maintained simplex constraint, by viewing the mixtureweights as a discrete probability distribution. We introduced several such penal-ties, emphasizing Kullback-Leibler distance (entropy).3. Finally (and we have not yet noted this), it is generally unnecessary to center reg-ularization penalties around equal weights once the simplex constraint is imposed.Shrinkage toward equal weights will be induced either way.Consider, for example, the ridge+simplex penalty in equation (5), and consider cen-tering around equal weights, as written, vs centering around 0. There is no difference,because K (cid:88) k =1 (cid:18) ω k − K (cid:19) = K (cid:88) k =1 ω k − K K (cid:88) k =1 ω k + 1 K = K (cid:88) k =1 ω k + 1 − KK , (10)where the last equality is due to the sum-to-one restriction embedded in the simplexconstraint. The intuition is simply that shrinkage toward 0 is impossible when main-taining the sum-to-one restriction, and equal weights are as close to 0 as one can get.
We now explore the potential of our regularized mixture estimators via a small Monte Carloanalysis. The data-generating process (DGP), which we assume to be known by the fore-casters, is: y t = x t + σ y e t , e t ∼ iid N (0 , x t = φ x x t − + σ x v t , v t ∼ iid N (0 , , (11)where e and v are orthogonal at all leads and lags. y is the variable to be forecast, and x t canbe interpreted as the long-run component of y t . Individual forecasters receive heterogeneous In fact this equivalence holds as long as all weights are centered on the same value (it does not have tobe 1 /K ) and the weights are constrained to sum to to a bounded real value (it does not have to be 1). L λ ∗ Simplex -1.31 5.27 NASimplex + Ridge -1.15 20.00 2511.25Simplex + Entropy -1.15 20.00 5.22Subset Averages L λ ∗ Best N -Average: N =1 -2.64 1.00 NA N =2 -1.59 2.00 NA N =3 -1.37 3.00 NA N =4 -1.29 4.00 NA N =5 -1.23 5.00 NA N =6 -1.22 6.00 NA N =7 -1.21 7.00 NA N =8 -1.20 8.00 NA N =9 -1.18 9.00 NA N =10 -1.18 10.00 NA N =15 -1.16 15.00 NA N =20 -1.15 20.00 NABest ≤ ≤ ≤ ≤ ≤ ≤ L λ ∗ Best -0.24 1 NA95th Percentile -0.53 1 NAMedian -1.40 1 NA5th Percentile -4.16 1 NAWorst -12.19 1 NASimple K -Average -1.15 20 NA Notes: L is the average log score, λ ∗ is the ex post optimalpenalty parameter, and K is the total number of forecasters. We perform 10,000 Monte Carlo replications. able 2: Average Log Scores, DGP 2Regularization group L λ ∗ Simplex -1.29 4.74 NASimplex + Ridge -1.19 8.65 15.00Simplex + Entropy -1.27 20.00 0.05Subset Averages L λ ∗ Best N -Average: N =1 -2.65 1.00 NA N =2 -1.57 2.00 NA N =3 -1.34 3.00 NA N =4 -1.26 4.00 NA N =5 -1.21 5.00 NA N =6 -1.19 6.00 NA N =7 -1.19 7.00 NA N =8 -1.18 8.00 NA N =9 -1.18 9.00 NA N =10 -1.18 10.00 NA N =15 -1.46 15.00 NA N =20 -1.64 20.00 NABest ≤ ≤ ≤ ≤ ≤ ≤ L λ ∗ Best -0.28 1 NA95th Percentile -0.98 1 NAMedian -3.79 1 NA5th Percentile -32.69 1 NAWorst -182.42 1 NASimple K -Average -1.64 20 NA Notes: L is the average log score, λ ∗ is the ex post optimalpenalty parameter, and K is the total number of forecasters. We perform 10,000 Monte Carlo replications. igure 1: Monte Carlo Estimates of Expected Mixture Performance vs Penalty StrengthNotes: We perform 10,000 Monte Carlo replications.ndependent noisy signals about x t . For forecaster k we have z kt = x t + σ zk η kt , η kt ∼ iid N (0 , , (12)where η k and η k (cid:48) are orthogonal at all leads and lags for all forecasters k and k (cid:48) . Assumethat forecasters have a strong belief that the 1-step-ahead predictive density is Gaussianwith variance σ y , but that they don’t know its mean, and that forecaster k therefore uses z kt , resulting in the predictive density p kt ( y t +1 ) = N ( φz kt , σ y ) . (13)Note that in this environment, forecasters’ predictive densities differ only by their locations(means).We consider two parameterizations:1. DGP 1: σ zk =1 for all k
2. DGP 2: σ zk =1 for k = 1 , , ..., K and σ zk =5 for k = K +1 , ..., K ,where each DGP has common parameters φ x =0 . σ x =1, σ y =0 .
5. The two DGPs differonly by the quality of the signals that forecasters receive. Under DGP 1 the simple averageshould be preferred, because all signals are of the same quality, while under DGP 2 the linearopinion rule should be preferred (at least asymptotically, so that estimation error vanishes),giving more weight to forecasters k = 1 , , ..., K , who receive better signals.To cohere with our subsequent empirical work, we explore K = T =20. We generate data,estimate mixture weights, generate 1-step-ahead mixture densities, and evaluate them usingthe log score objective. We repeat this 10,000 times and compute the average LPS for severalmethods:1. Simple Average2. Simplex (equation (4))3. Simplex+Ridge (equation (5))4. Simplex+Entropy (equation (8))5. Subset Averaging (equation (9) with λ →∞ ).17or each of simplex+ridge and simplex+entropy, we explore 20 penalization strengths. Forsimplex+ridge, we choose 10 equispaced points in [1e-15,10] and 10 equispaced points in[15,10000]. For simplex+entropy we choose 10 equispaced points in [1e-15,0.2] and 10 equi-spaced points in [0.3,20].Numerical results appear in Tables 1 and 2, in which we present the the optimized averagelog score for each method under DGPs 1 and 2, respectively. Graphical results appear in Fig-ure 1, in which we show how the optimized score varies with regularization penalty strengthunder DGPs 1 and 2, respectively. Under DGP 1, simple averaging performs well, and un-regularized simplex performs poorly, as expected. As the strength of shrinkage gets heavier,the performance of both simplex+entropy and simplex+ridge improves monotonically untilthey perform as well as the simple average (full shrinkage). In addition, the performance ofsimplex+entropy improves more quickly than that of simplex+ridge as shrinkage strengthincreases and dominates throughout. Finally, subset averaging performs admirably underDGP 1, and as expected the optimal “subset” includes all forecasters.Under DGP 2, simplex is expected to perform well, and simple averaging is expectedto perform poorly. Simplex does indeed outperform simple averaging. Moreover, both sim-plex+ridge and simplex+entropy behave as expected. For little shrinkage (toward the left),their performance is similar to that of simplex, and for heavy shrinkage (toward the right),their performance is similar to that of the simple average. In between, for moderate amountsof shrinkage, they outperform simplex. In that region, regularized simplex improves on un-regularized simplex, because the large unregularized simplex estimation error makes it likelythat some relevant forecasters are dropped from the pool, and regularization brings themback. Importantly, subset averaging continues to perform admirably under DGP 2, but nowthe optimal average involves only 10 or so forecasters, as expected.It is important to note that the performance documented in Tables 1 and 2, and in Figure1, is almost surely not achievable in practice, because it requires ex post omniscience (useof the ex post optimal penalty parameter for the regularized estimators, and use of the expost optimal N for the N -averages.) Nevertheless the results are informative, because theydocument what can be achieved in principle , even if not in practice. Practical performanceis an empirical matter, to which we now turn, in a detailed application to density forecastsof Eurozone inflation and real interest rates. 18igure 2: Individual and Average Density Forecasts, Eurozone Inflation, 2004Q4 (left) and2018Q4 (right) Notes: We show the individual survey forecasts in gray (as frequency polygons), and the average forecast inorange (as a histogram).
Here we use our methods to construct regularized mixtures of density forecasts for Eurozoneinflation and real interest rates. Expected inflation is a key driver of the bond marketvia its direct impact on nominal interest rates. Expected inflation may also negativelyimpact real growth, and hence the stock market, insofar as it “puts sand in the Walrasiangears”, as classically emphasized by Bresciani-Turroni (1937). High inflation, moreover, alsotends to be volatile inflation (Friedman, 1977), which adds additional sand. Expectedinflation is also a key part of the ex ante real interest rate, which in turn is a key guide tointertemporal allocation and a key link between macroeconomic fundamentals and financialmarkets. From a variety of angles, then, inflation forecasts are central to financial markets,the macroeconomy, and the interface.
Following the pathbreaking work of Conflitti et al. (2015), we study inflation density forecastsfrom the European Central Bank Survey of Professional Forecasters (ECB-SPF), which hasbeen undertaken since 1999. Participants are surveyed quarterly, in January, April, July, See also Chen et al. (1986). Our forecast sample contains 83 quarterly surveys, starting in 1999Q1 andending in 2019Q3.As an entr´ee into the data, in Figure 2 we show all forecasts expressed as frequencypolygons, and the simple average forecast expressed as a histogram, for two illustrativesurveys (2004Q4, 2018Q4). Substantial differences are apparent at the two survey dates.The simple average forecast in 2004Q4, for example, puts 2.3% probability on the event thatthe inflation rate is less than 1%, whereas in 2018Q4 it puts 10.5% probability on the sameevent. Continuing, in the top panel of Figure 3 we show the complete time series of simpleaverage forecasts. Again, large movements are evident over time, in both location and scale.The precise Euro-area inflation forecast target is the percentage change in the Harmonised[sic] Index of Consumer Prices (HICP), for the year following the forecast. For example,when the survey was conducted in October 2017 (2017Q4), HICP inflation data were availableup to September 2017, so the 2017Q4 survey asks for a forecast for the year from October2017 through September 2018. Our realization sample, matched to our forecast sample,contains 83 quarterly observations, starting in December 1999 and ending in June 2020.We will soon obtain mixture densities using the log score objective and several regular-izations, including simplex, simplex+ridge, simplex+entropy, and subset averaging. Beforeproceeding to empirical results, however, we address several issues.
First, forecasters can enter and exit the survey pool. There are 103 unique forecastersbetween 1999Q1 and 2019Q4, and no forecaster appears in the pool continuously. FollowingGenre et al. (2013), we proceed by first excluding forecasters who miss more than fourconsecutive surveys, which leaves 18 forecasters. Then we interpolate the remaining gapsbased on historical performance. See . Eurostat, Harmonized Index of Consumer Prices: All Items for Euro area (19 countries)[CP0000EZ19M086NEST], Retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/CP0000EZ19M086NEST . More precisely, we fill in the gaps in the first survey (t=1, 1999Q1) with the average of non-missingforecasts from all other available forecasters. Then we calculate the ranked score for each forecaster anddivide them into five mutually exclusive groups based on the score, and move to the second survey. Ateach of the following rounds ( t = 2 , , ..., T ), we set the missing observations of a particular forecaster tothe average of non-missing forecasts from her group, and then using the full set of forecasts we re-calculateranked scores and update the group structure for use in the next round. .1.2 Time-Varying Bin Definitions Second, outcome bin definitions vary over time. Although bin definitions have been stablefor mid-range “standard” inflation values, extreme tail bins have become finer over time,as realizations fell in the tails. For example, for high inflation, there was originally a > . > We proceed by merging extremetail bins sufficiently to produce 11 bin definitions, fixed for the entire sample: ( −∞ , − . − . , , . . , , ∞ ]. Finally, complications can arise with the log-score objective. Consider, for example, thesurvey forecast: y ∈ ( −∞ , . w.p. = 0(1 . , . w.p. = . . , . w.p. = . . , . w.p. = . . , ∞ ] w.p. = 0 . (14)The zero probabilities assigned to the leftmost and rightmost bins obviously create a problem(infinite loss) for the log-score objective, due to its use of logs, if a realization occurs thatwas assigned zero probability.Zero-probability realizations rarely, but occasionally, appear in our data. Sometimes theyoccur in edge bins (e.g., (4 , ∞ ]), because forecasters sometimes fail to put positive probabilityon those bins. In addition to the edge-bin phenomenon, some forecasters’ histograms aresimply too sharp, and they sometimes put zero probability on an interior bin that eventuallycontains the realization.One can address the log score “zero problem” by requiring the survey bin into which therealization falls to have been assigned at least some small probability, say 1%. We achievethis by assigning 1% probability to the bin containing the realization if it had originally beenassigned 0, where the 1% is taken in equal shares from the bins originally assigned non-zeroprobability. During our sample period the number of bins started at 9, peaked at 14 during the Great recession, andeventually dropped to 12. One could of course switch to another objective, but the log score objective is simple and deservedlypopular, which is why we have used it throughout this paper as a leading case for both our theory and MonteCarlo. We will continue to use it for our empirical work, where it is also deservedly popular, despite the zero
Regularized Mixtures L ≤ L Notes: We show log scores for 1-year-ahead Eurozone inflation density forecasts, made quarterly, using a20-quarter rolling estimation window. The burn-in sample is 1999Q1-2000Q4, and the forecast evaluationsample is 2001Q1-2019Q3 (75 quarters). There are 18 ECB-SPF density forecasters in the pool, plus a 19thforecaster whose predictive density is constant and uniform, for a total of 19 forecasters. L is the log score,and There are 18 ECB-SPF density forecasters in the pool. We also include a fictitious 19thforecaster whose predictive density is constant and uniform, in rough parallel to includinga constant in point forecast combining regressions, for a total of 19 forecasters. Doing soappears desirable a priori in the spirit of Granger and Ramanathan (1984). Results appear in Table 3. Strikingly, each regularized mixture outperforms each ECB/SPFindividual forecaster (even the ex post best forecaster). To get a feel for the size of the im-provement, note that the log score of the Best ≤ problem. Moreover, it constrains the mixture density to put positive probability on each histogram bin as longas the uniform forecaster gets a non-zero mixture weight, in which case the earlier-discussed log score “zeroproblem” vanishes.
Notes: We show density forecast mixtures expressed as frequency polygons. The forecasts are quarterly,from 1999Q1 to 2019Q3. always small, regardless of the regularization method. Simultaneously, both the log scoresin Table 3 and the graphs in the bottom two panels of Figure 3 reveal that the Simplex andBest Average regularized mixtures are almost identical, suggesting that the Simplex solutionis effectively dropping all but a few forecasts and simply averaging the survivors, producingsomething very close to a Best 4-Average.The good performance of both Simplex and Best Average is particularly noteworthyinsofar as they do not require tuning. That is, quite remarkably, the Simplex and BestAverage regularizations perform as well as those requiring choice of tuning parameters (Sim-plex+Ridge and Simplex+Entropy), despite the fact that we evaluate the latter in Table 3 Simplex+Entropy selects all 19 forecasters, but Simplex+Entropy must select all 19 forecasters, becauselog( ω k ) →∞ as ω k →
0. All regularizations capable of selecting only a few forecasters do in fact select only afew. Strictly speaking, Best Average procedures require some slight tuning – a choice of N – although we arecomfortable with simply always adopting N = 4. Notes: We show heat maps of differences between a regularized mixture (Simplex or Best ≤ ≤ using ex post optimal tuning parameters, which is not feasible in real time.Figure 3 merits additional examination. If its middle and bottom panels reveal thatthe Simplex and Best Average regularized mixtures are nearly identical, a comparison ofthose panels with the top panel also reveals that (1) Simplex / Best Average regularizationis nevertheless very different from a simple average, and (2) the effects of Simplex / BestAverage regularization differ strikingly before and after the onset of the Great Recession.Before the onset of the Great Recession, Simplex / Best Average regularization moves prob-ability mass upward toward higher inflation relative to simple averaging, particularly fromthe 1.0%-1.5% range to the 1.5%-2.5% range, mostly adjusting density forecast location andsymmetry. After that, however, Simplex / Best Average regularization spreads probabilitymass from the center into both tails of the distribution, from the 1.0%-2.5% range outwardto below 0.5% and above 3.0%, mostly adjusting density forecast dispersion and kurtosis.The regularization effects, and their structural shift at the onset of the Great Recession, are24igure 5: PIT
Histograms, Eurozone Inflation
Notes: We show
P IT histograms for Simple Average and Best ≤ P IT ∼ iid U (0 , revealed even more clearly in the heatmaps shown in Figure 4.It is informative to examine and compare probability integral transforms ( P IT s) for vari-ous mixtures. Diebold et al. (1998) consider the continuous case, in which the
P IT is definedas
P IT t = (cid:82) y t −∞ p t ( u ) du , and show that correct conditional calibration of density forecastsimplies that P IT ∼ iid U (0 , P IT defini-tion. To assess uniformity, and any patterns in deviations from uniformity, in Figure 5 weshow histograms of the Czado et al. (2009) discrete
P IT for the Simple Average and Best ≤ The
P IT histograms reveal problems with the Simple Average mixture, which matchour discussion of the two regimes in Figures 3 and 4, and which are ameliorated by theBest ≤ P IT histograms shownoticeable deviations from uniformity in both subsamples, and the shapes of the deviationsare very different. There is no need to show the Simplex
P IT because the Simplex and Best ≤ Notes: We show density forecast mixtures expressed as frequency polygons. The forecasts are quarterly,from 1999Q1 to 2019Q3.
In the first subsample, the Simple Average
P IT histogram is highly skewed as shown inthe upper-left panel of Figure 5, with far too little probability mass near 0 and far too muchnear 1, again indicating too many large inflation realizations relative to the Simple Averagedensity forecasts. Regularization, however, shifts the densities upward as discussed earlier,producing an improved (if still imperfect) ≤ P IT as seen in the bottom left panelof Figure 5.In the second subsample, the Simple Average
P IT histogram is more U-shaped as shownin the upper-right panel of Figure 5. In this regime the regularization spreads out thedensities as discussed earlier, better accommodating the tail realizations and producing animproved Best ≤ P IT as seen in the bottom right panel of Figure 5.Finally, in parallel to our earlier examination of ECB/SPF inflation forecasts, we examinereal interest rate density forecasts. The real interest rate density is a simple sign change andlocation shift of the inflation density: f ( r t,t +1 ) = i t,t +1 − f ( π t,t +1 ) , (15)where r denotes the real interest rate, i denotes the nominal interest rate, and π denotesinflation. Real interest rate densities are of course driven by the inflation densities viaequation (15), but it is nevertheless interesting to make the translation into the real cost ofborrowing.In Figure 6 we show the Simple Average and Best ≤ ≤ Notes: We show a heat map of the difference between the Best ≤ ≤ tions. One is immediately struck by the high probability assigned to negative real ratesthrough much of the sample. P ( r t,t +1 ) < are negative.Nevertheless our earlier inflation patterns and lessons remain firmly intact, because realinterest rate density forecasts are driven by inflation density forecasts. There are two clearreal interest rate “regularization regimes,” demarcated by the onset of the Great Recession.In the first, real interest rate densities are pushed downward, because, as discussed earlier,regularization pushes inflation densities upward. In the second, real interest rate densitiesare made more dispersed, because regularization makes inflation densities more dispersed. We have proposed methods for constructing regularized mixtures of density forecasts, ex-ploring a variety of objectives and penalties, which we used in a substantive exploration There is no need to show regularized estimation results for real interest rates, because the log score isinvariant to the switch from inflation to real interest rate density forecasts defined by equation (15). Thereis also no need to include Simplex panels in Figures 6 and 7, because the Simplex and Best ≤ P IT histograms, because they are exact mirrorimages of the inflation
P IT histograms in Figure 5, as revealed by equation (15).
27f Eurozone inflation and real interest rate survey density forecasts. All individual surveyforecasters (even the ex post best forecaster) are outperformed by our regularized mixtures.The log scores of the Simplex and Best-Average mixtures, for example, are approximately7% better than that of the ex post best individual forecaster, and 15% better than that of theex post median forecaster. Before the Great Recession, regularization shifts inflation den-sity locations upward toward higher inflation, and hence real interest rate density locationsdownward, correcting for bias. From the Great Recession onward, the regularization tendsto move probability mass from the centers to the tails of both inflation and real interest ratedensity forecasts, correcting for overconfidence.A variety of avenues for future research are possible. For example, one could use theprobability integral transform as a regularized mixture estimation objective, minimizing agoodness-of-fit statistic (e.g., Kolmogorov-Smirnov) for testing the joint hypothesis of an iid U (0 ,
1) probability integral transform.Second, one could broaden our approach to allow for nonlinear mixtures as in recent workby Takanashi and McCalinn (2020), flexibly time-varying mixture weights as in Jore et al.(2010), and mixture weights that vary over regions of density support, as in Kapetanios et al.(2015).Finally, although we did not emphasize regularization methods that require hyperparam-eter selection in our empirical work (Simplex+Ridge or Simplex+Entropy), they neverthelessrepresent interesting directions for future exploration. An obvious issue is feasible real-timehyperparameter selection. 28 ppendices
A Derivation of the Simplex+Entropy Regularized Es-timator
The Simplex+Entropy estimator solves the optimization problem:ˆ ω = arg min ω − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + ( α − (cid:32) − K (cid:88) k =1 log( ω k ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) entropy penalty (A.1)s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 . As we will show, this it arises as the posterior mode in a Bayesian analysis with (1) log like-lihood given by the log score, and (2) Dirichlet prior, which puts positive probability only onthe unit simplex but also shrinks toward equal weights for a certain hyperparameter configu-ration. In particular, the K -dimensional Dirichlet prior is governed by K hyperparameters,and when they equal, the prior mean is 1 /K . Hence the simplex+entropy regularization (8)with equal prior hyperparameters does the same thing as simplex+ridge (5): Impose simplexand shrink toward equal weights. A.1 Prior
The Dirichlet prior on ω = ( ω , ω , ..., ω K ) with hyperparameter α = ( α , α , ..., α K ) is f D ( ω ; α ) = 1 B ( α ) K (cid:89) k =1 ω α k − k , where B ( · ) is the beta function, α k > ∀ k ∈ , ..., K , and the support of ω is ω k ∈ (0 ,
1) with (cid:80) Kk =1 ω k = 1.As is well known, the Dirichlet mean and variance are: E ( ω i ) = α i (cid:80) Kk =1 α k var ( ω i ) = α i (cid:80) Kk =1 α k (cid:16) − α i (cid:80) Kk =1 α k (cid:17) (cid:80) Kk =1 α k . Hence when α = α = ... = α K = α , we have E [ ω k ] = 1 /K and V ar ( ω k ) = K − αK + K , for all k = 1 , ..., K . That is, the prior is centered on equal weights 1 /K , and var ( ω k ) → α →∞ , so that α governs prior precision, with larger α producing heavier shrinkage toward1 /K . A.2 Posterior
The posterior distribution is f D ( ω | y ; α ) = T (cid:89) t =1 (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) pseudo-likelihood × B ( α ) K (cid:89) k =1 ω α − k (cid:124) (cid:123)(cid:122) (cid:125) prior , so the log posterior islog f D ( ω ; α ) = T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33) + ( α − K (cid:88) k =1 log( ω k ) − log B ( α ) . Because B ( α ) does not depend on ω , we can drop the last term, so the posterior mode isˆ ω = arg min ω − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) Log score + ( α − (cid:32) − K (cid:88) k =1 log( ω k ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) penalty (A.2)s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 . .3 Understanding the Penalty Term One way to understand the penalty term is to recall the solution to the empirical likelihoodmaximization problem of Owen (2001),arg min ω (cid:32) − K (cid:88) k =1 log( ω k ) (cid:33) s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 , which is equal weights, ω k =1 /K, ∀ k . Hence we see that the penalty part of (A.2) is minimizedat ω k =1 /K , which yields a clear interpretation of the penalty term. Larger α means a tighterprior on ω , with heavier shrinkage toward equal weights. Several interesting limiting casesemerge. First, for α →∞ , the penalty term dominates, and the optimal solution is equalweights. Second, for α →
1, the penalty term vanishes, and the optimal solution matchesthat of the optimal linear pool, with simplex constraint imposed. Third, there is a upperbound for var ( ω k ): as α → var ( ω k ) → ( K − /K . A.4 Remarks
1. The entropy regularization optimization problem is convex, because both the the log-score and the penalty are convex. A closed form may not exist for the regularized ω ,but convexity makes numerical computation straightforward.2. Entropy regularization has a clear parallel to ridge regularization. As is well known,ridge regularization emerges as the posterior mode in a Bayesian analysis with Gaussianprior, and as we have shown, entropy regularization emerges as posterior mode in aBayesian analysis with Dirichlet prior. Both regularizations, moreover, are governedby a single parameter linked to prior precision.3. If the effects of the ridge and entropy penalties are very similar in certain respects(imposition of simplex and shrinkage toward 1 /K ), their full Bayesian interpretationsare nevertheless different. In particular, the ridge (Gaussian) and entropy (Dirichlet)priors differ, even if their means are the same (1 /K ), and so the posteriors differ. For α < eferences Aastveit, K.A., J. Mitchell, F. Ravazzolo, and H.K. van Dijk (2020), “The Evolution of Fore-cast Density Combinations in Economics,”
Oxford Research Encyclopedia of Economicsand Finance , in press.Amisano, G. and J. Geweke (2017), “Prediction Using Several Macroeconomic Models,”
Review of Economics and Statistics , 99, 912–925.Askanazi, R., F.X. Diebold, F. Schorfheide, and M. Shin (2018), “On the Comparison ofInterval Forecasts,”
Journal of Time Series Analysis , 39, 953–965.Bates, J.M. and C.W.J Granger (1969), “The Combination of Forecasts,”
Operations Re-search Quarterly , 20, 451–468.Billio, M., R. Casarin, F. Ravazzolo, and H.K. Van Dijk (2013), “Time-Varying Combinationsof Predictive Densities Using Nonlinear Filtering,”
Journal of Econometrics , 177, 213–232.Brehmer, J. and T. Gneiting (2020), “Scoring Interval Forecasts: Equal-Tailed, Shortest, andModal Interval,” ArXiv:2007.05709 [math.ST], https://arxiv.org/abs/2007.05709 .Bresciani-Turroni, C. (1937),
The Economics of Inflation , Allen and Unwin.Brier, G.W. (1950), “Verification of Forecasts Expressed in Terms of Probability,”
MonthlyWeather Review , 78, 1–3.Brodie, J., I. Daubechies, C. De Mol, D. Giannone, and I. Loris (2009), “Sparse and StableMarkowitz Portfolios,”
Proceedings of the National Academy of Sciences , 106, 12267–12272.Chen, N.-F., R. Roll, and S. Ross (1986), “Economic Forces and the Stock Market,”
Journalof Business , 383–403.Conflitti, C., C. De Mol, and D. Giannone (2015), “Optimal Combination of Survey Fore-casts,”
International Journal of Forecasting , 31, 1096–1103.Czado, C., T. Gneiting, and L. Held (2009), “Predictive Model Assessment for Count Data,”
Biometrics , 65, 1254–1261. 32iebold, F.X. (1991), “A Note on Bayesian Forecast Combination Procedures,” In P. Hackland A. Westlund (eds.),
Economic Structural Change:Analysis and Forecasting,
International Economic Review , 39, 863–883.Diebold, F.X. and M. Shin (2017), “Assessing Point Forecast Accuracy by Stochastic ErrorDistance,”
Econometric Reviews , 36, 588–598.Diebold, F.X. and M. Shin (2019), “Machine Learning for Regularized Survey Forecast Com-bination: Partially-Egalitarian Lasso and its Derivatives,”
International Journal of Fore-casting , 35, 1679–1691.Elliott, G. (2011), “Averaging and the Optimal Combination of Forecasts,” Manuscript,Department of Economics, UCSD.Elliott, G. and A. Timmermann (2016),
Economic Forecasting , Princeton University Press.Epstein, E.S. (1969), “A Scoring System for Probability Forecasts of Ranked Categories,”
Journal of Applied Meteorology , 8, 985–987.Friedman, M. (1977), “Nobel Lecture: Inflation and Unemployment,”
Journal of PoliticalEconomy , 85, 451–472.Genre, V., G. Kenny, A. Meyler, and A. Timmermann (2013), “Combining Expert Forecasts:Can Anything Beat the Simple Average?”
International Journal of Forecasting , 29, 108–121.Geweke, J. and G. Amisano (2011), “Optimal Prediction Pools,”
Journal of Econometrics ,164, 130–141.Giannone, D., M. Lenza, and G.E. Primiceri (2017), “Economic Predictions with Big Data:The Illusion Of Sparsity,” CEPR Discussion Paper 12256.Gneiting, T. and A.E. Raftery (2007), “Strictly Proper Scoring Rules, Prediction, and Esti-mation,”
Journal of the American Statistical Association , 102, 359–378.Good, I.J. (1952), “Rational Decisions,”
Journal of Royal Statistical Society: Series B , 14,107–114. 33ranger, C.W.J. and R. Ramanathan (1984), “Improved Methods of Combining Forecasts,”
Journal of Forecasting , 3, 197–204.Hall, S.G. and J. Mitchel (2007), “Combining Density Forecasts,”
International Journal ofForecasting , 23, 1–13.Jore, A.S., J. Mitchell, and S.P. Vahey (2010), “Combining Forecast Densities from VARswith Uncertain Instabilities,”
Journal of Applied Econometrics , 25, 621–634.Kapetanios, G., J. Mitchell, S. Price, and N. Fawcett (2015), “Generalised Density ForecastCombinations,”
Journal of Econometrics , 188, 150–165.McAlinn, K. and M. West (2019), “Dynamic Bayesian Predictive Synthesis in Time SeriesForecasting,”
Journal of Econometrics , 210, 155–169.Owen, A. (2001),
Empirical Likelihood , Chapman and Hall.Takanashi, K. and K. McCalinn (2020), “Predictive Properties and Minimaxity of BayesianPredictive Synthesis,” Preprint, RIKEN and Temple University.Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,”
Journal of theRoyal Statistical Society, Series B , 58, 267–288.Timmermann, A. (2006), “Forecast Combinations,” In G. Elliott, C.W.J. Granger and A.Timmermann (Eds.),
Handbook of Economic Forecasting , North Holland, 135-196.Winkler, R.L. and A.H. Murphy (1968), “‘Good’ Probability Assessors,”
Journal of AppliedMeteorology , 7, 751–758.Yao, Y., A. Vehtari, D. Simpson, and A. Gelman (2018), “Using Stacking to AverageBayesian Predictive Distributions,”
Bayesian Analysis , 13, 917–1003.Zou, H. and T. Hastie (2005), “Regularization and Variable Selection via the Elastic Net,”