[PDF] On the Aggregation of Probability Assessments: Regularized Mixtures of Predictive Densities for Eurozone Inflation and Real Interest Rates

Abstract

We propose methods for constructing regularized mixtures of density forecasts. We explore a variety of objectives and regularization penalties, and we use them in a substantive exploration of Eurozone inflation and real interest rate density forecasts. All individual inflation forecasters (even the ex post best forecaster) are outperformed by our regularized mixtures. From the Great Recession onward, the optimal regularization tends to move density forecasts' probability mass from the centers to the tails, correcting for overconfidence.

Full PDF

OOn the Aggregation of Probability Assessments:Regularized Mixtures of Predictive Densities forEurozone Inﬂation and Real Interest Rates

Francis X. DieboldUniversity of Pennsylvania Minchul ShinFederal Reserve Bank of PhiladelphiaBoyuan ZhangUniversity of PennsylvaniaJanuary 6, 2021

Abstract : We propose methods for constructing regularized mixtures of density forecasts.We explore a variety of objectives and regularization penalties, and we use them in a substan-tive exploration of Eurozone inﬂation and real interest rate density forecasts. All individualinﬂation forecasters (even the ex post best forecaster) are outperformed by our regularizedmixtures. From the Great Recession onward, the optimal regularization tends to move den-sity forecasts’ probability mass from the centers to the tails, correcting for overconﬁdence.

Acknowledgments : For helpful comments and/or assistance we are grateful to Umut Ako-vali, Brendan Beare, Graham Elliott, Rob Engle, Domenico Giannone, Christian Hansen,Nour Meddahi, Mike McCracken, Marcelo Medeiros, James Mitchell, Joon Park, HashemPesaran, Youngki Shin, Mike West, and Ken Wolpin. We are also grateful to conference par-ticipants at EC , and seminar participants at KAEA and AMLEDS. The views expressedin this paper are solely those of the authors and do not necessarily reﬂect the views of theFederal Reserve Bank of Philadelphia or the Federal Reserve System. Key words : Density forecasts, forecast combination, survey forecasts, shrinkage, modelselection, regularization, partially egalitarian LASSO, model averaging, subset averaging

JEL codes : C2, C5, C8

Contact : [email protected], [email protected] a r X i v : . [ ec on . E M ] J a n Introduction

Forecast combination for a series y involves transforming a set of forecasts of y , f =( f , ..., f K ) (cid:48) , into a “combined”, and hopefully superior, forecast c ( f ). Most of the hugeliterature focuses on linear combinations of univariate point forecasts, in which case we canwrite the combined forecast as c ( f ; ω ) = ω (cid:48) f , for combining weight vector ω = ( ω , ..., ω K ) (cid:48) . We typically proceed under quadratic loss, choosing the weights to minimize the sum ofsquared combined forecast errors (

SSE ), SSE ( c ( f ; ω ) , y ) = T (cid:88) t =1 ( y t − ω (cid:48) f t ) , where the sample of forecasts and realizations covers t = 1 , ..., T . That is, we simply run theleast-squares regression y → f , ..., f K , so that ˆ ω = arg min ω (cid:16) SSE ( c ( f ; ω ) , y ) (cid:17) . This is the classic Bates and Granger (1969) and Granger and Ramanathan (1984) solution.Recent point forecast combination literature such as Diebold and Shin (2019), however,focuses instead on weights that solve a penalized estimation problem,ˆ ω = arg min ω (cid:16) Objective ( c ( f ; ω ) , y ) + λ · P enalty ( ω ) (cid:17) , (1)where the Lagrange multiplier λ governs the strength of the penalty. Maintaining quadraticloss we have ˆ ω = arg min ω (cid:16) SSE ( c ( f ; ω ) , y ) + λ · P enalty ( ω ) (cid:17) . If λ =0 we obviously obtain the Bates-Granger-Ramanathan solution, but the recent liter-ature focuses on λ>

0. This produces regularization, which can be highly valuable in theﬁnite samples often of practical relevance, particularly for economic survey forecasts wherethe sample size T is often very small relative to the number of forecasters K . The preciseform of the penalty determines the precise form of regularization, but in general it involvesselection and/or shrinkage in directions guided by the penalty. For example, the famousLASSO penalty of Tibshirani (1996), P enalty ( ω ) = (cid:80) Kk =1 | ω k | , induces both selection to 0 Broad and insightful surveys include Timmermann (2006), Elliott and Timmermann (2016), and Aastveitet al. (2020). We assume unbiased forecasts, so there is no need for an intercept. nd shrinkage toward 0.In this paper we extend the idea of regularized forecast combination to the density forecastcase. Density forecasting is important because predictive densities are complete probabilisticstatements, which are always desirable, sometimes invaluable, and increasingly available.Density forecasts provide much more information, for example, than interval forecasts, whichin turn provide more information than point forecasts. We work with “linear opinion pools” (mixtures), as in the key contributions of Halland Mitchel (2007), Geweke and Amisano (2011) and Amisano and Geweke (2017), but weconsider a variety of estimation objectives, and most importantly, we introduce regularizationconstraints. Our regularized density forecast combinations are regularized mixtures, andimportant subtleties arise in constructing appropriate penalties for mixture regularization.In this paper we confront this situation and propose several solutions.Our methods are related to earlier and current work in both the econometrics and statis-tics literatures. A basic insight underlying our work and much of the recent literature is thatBayesian model averaging (BMA) as traditionally implemented is unattractive for combiningdensity forecasts from misspeciﬁed models, because it fails to acknowledge misspeciﬁcation(Diebold, 1991). That is, it assumes implicitly or explicitly that one of the models is “true”,in which case the posterior predictive density asymptotically puts all probability on thatmodel, so that BMA actually fails to average. Instead, once we acknowledge that all modelsare misspeciﬁed, we want a method capable of delivering a defensible and diversiﬁed portfolio(weighted average) of models, even asymptotically.In one strand of econometrics literature this led Hall and Mitchel (2007), Brodie et al.(2009), Geweke and Amisano (2011), and Amisano and Geweke (2017) inter alia to moveaway from BMA, working instead with linear opinion pools that optimize the log score. Ina diﬀerent strand of econometrics literature that also moved away from BMA, it led Billioet al. (2013) to treat density forecast combination as a nonlinear ﬁltering problem, poten-tially with time-varying mixture weights. Parallel developments in the statistics literaturenow acknowledge misspeciﬁcation, distinguishing between “M-open” vs. “M-complete” sit-uations, and achieve diversiﬁed density forecast mixtures by “stacking” predictive densities(Yao et al., 2018), or via “dynamic Bayesian predictive synthesis” (McAlinn and West, 2019).We pick up from there and proceed as follows. In section 2 we discuss objectives for mix-ture regularization, that is, various choices and issues associated with

Objective ( c ( f ; ω ) , y ). The evaluation of interval forecasts, moreover, is fundamentally problematic, as detailed in recent workby Askanazi et al. (2018) and Brehmer and Gneiting (2020).

P enalty ( ω ), starting withthe key unit simplex penalty, which we maintain throughout, and then introducing hybridpenalties that blend the simplex penalty with others. In section 4 we present Monte Carloevidence on the eﬃcacy of our procedures. In section 5 we present empirical results for Eu-ropean Central Bank (ECB) survey density forecasts of Eurozone inﬂation and real interestrates. We conclude in section 6. Consider a discrete density (histogram) forecast for a scalar variable y , which takes valuesin m = 1 , ..., M bins, or categories. Denote the forecast by p = ( p , ..., p M ) (cid:48) . We startwith density forecast “scores” for a single forecaster in a single period in sections 2.1-2.3,we extend the discussion to multiple forecasters and periods in section 2.4, and we provideadditional discussion in section 2.5. The log score (Good, 1952; Winkler and Murphy, 1968) is L ( p, y ) = − log (cid:32) M (cid:88) m =1 p m y ∈ b m ) (cid:33) , (2)where p m is the probability assigned to bin b m , and 1( y ∈ b m ) = 1 if y ∈ b m and 0 otherwise.Ranking density forecasts by L , where smaller is better, reﬂects a preference for “smallsurprises”. In a frequentist interpretation, L is just the (negative of the) log predictive densityevaluated at the realization; that is, it is the (negative of the) predictive log likelihood. In aBayesian interpretation, L is, desirably, a strictly proper scoring rule. The Brier score (Brier, 1950) is: We focus largely on the discrete case, because it is the one of practical relevance for survey forecaststhat we eventually analyze. Parallel developments of course exist for the continuous case. On scoring rules see Gneiting and Raftery (2007) and the references therein. ( p, y ) = 1 M M (cid:88) m =1 ( p m − y ∈ b m )) . The Brier score generalizes the idea of quadratic loss to density forecasts. Indeed B iseﬀectively the same as the so-called “quadratic score”, Q ( p, y ) = − (cid:32) M (cid:88) m =1 p m y ∈ b m ) (cid:33) + (cid:32) M (cid:88) m =1 p m (cid:33) , (3)as noted by Czado et al. (2009). Rankings by Q must match rankings by B , because one isa positive monotonic transformation of the other. Both B and Q are strictly proper scoringrules under weak conditions. The ranked score (Epstein, 1969) is, R ( p, y ) = M (cid:88) m =1 ( P m − y ≤ b m + )) , where P m = (cid:80) mh =1 p ( b h ) is the cdf of the density forecast p , deﬁned on bins b m = [ b m − , b m + ], m = 1 , ..., M . R eﬀectively proceeds by comparing realizations to the cdf forecast ratherthan the density forecast. R is strictly proper under weak conditions. Let us now modify the notation to identify the speciﬁc forecaster, k . Thus far there has beenno need, as we have considered just one forecaster, but shortly we will want to consider a setof forecasters, k = 1 , ..., K . This is just a notational change, inserting “ k ” subscripts in therelevant places. In addition let us write the scores for a set of periods, t = 1 , ..., T , ratherthan for just one period. This just involves summing over time.We have: L k ( p k , y ) = T (cid:88) t =1 (cid:32) − log (cid:32) M (cid:88) m =1 p mkt y t ∈ b m ) (cid:33)(cid:33) , k = 1 , ..., K k ( p k , y ) = T (cid:88) t =1 (cid:32) M M (cid:88) m =1 (cid:0) p mkt − y t ∈ b m ) (cid:1) (cid:33) , k = 1 , ..., KR k ( p k , y ) = T (cid:88) t =1 (cid:32) M (cid:88) m =1 (cid:16) P mkt − y t ≤ b m + ) (cid:17) (cid:33) , k = 1 , ..., K, where p k = ( p k , ..., p kT ) is the sequence of density forecasts over time for forecaster k , and y = ( y , ..., y T ) is the sequence of realizations over time. Thus far we have implicitly emphasized the diﬀerences among the L , B , and R scores, butthere are also many similarities. B , for example, might appear linked to Gaussian environments, because it is a mean-squared error analog, unlike L which is based directly on the likelihood and therefore validunder great generality. But it is not; indeed its “ Q version” (3), Q = − L + (cid:32) M (cid:88) m =1 p m (cid:33) , reveals its close link to L . Moreover, B remains a strictly proper scoring rule regardless ofdistributional environment.Now consider R . First, it is interesting to note that R is a generalization of absolute-error loss to density forecasts, just as B is a generalization of squared-error loss to densityforecasts. In particular, Gneiting and Raftery (2007) show that R is driven by E p | Y − y | : R ( p, y ) = E p | Y − y | − E p | Y − Y (cid:48) | , where Y and Y (cid:48) are independent copies of a random variable with distribution p .Second, R ’s generalization of absolute-error loss ( M AE ) to density forecasts also makesit a generalization of the Diebold and Shin (2017) stochastic error distance (

SED ), because

M AE and

SED rankings must agree, and interestingly,

SED is based on cdf divergences,just as is R .Finally, although R might appear linked to a particular (Laplace) distributional environ-ment, because it is an absolute-error analog, it is not. R is a strictly proper scoring ruleregardless of distributional environment. 5 Penalties

Our goal is to produce mixtures of density forecasts, c ( ω ) = K (cid:88) k =1 ω k p k , with regularized mixture weights ω = ( ω , ..., ω K ) (cid:48) . We score mixtures in the same way aswe scored individual density forecasts. The only diﬀerence is that we now score the mixture, c ( ω ), rather than an individual forecast, p k .Thus far we have focused on appropriate objectives for regularized mixture weight esti-mation, objective ( c ( ω ) , y ), and we emphasized use of strictly proper density forecast scoringrules. Now we consider appropriate constraints for regularized mixture weight estimation, penalty ( ω ). As we shall see, imposition of the unit simplex constraint (i.e., imposing thatmixture weights be non-negative and sum to one: ω i ≥ ∀ i and (cid:80) Ki =1 ω i = 1) provides es-sential regularization. In addition, however, simultaneous imposition of other regularizationconstraints may also be helpful. The unit simplex constraint has two parts: non-negativity and sum-to-one. For point fore-casts we can relax both parts and potentially achieve better combined point-forecastingperformance, as recognized by Granger and Ramanathan (1984) and done routinely eversince. As ﬁrst recognized in the pioneering work of Brodie et al. (2009), it turns out thatdensity forecasts are diﬀerent:

When combining density forecasts it is crucial to impose (bothparts of ) the simplex constraint .First consider non-negativity. For point forecasts, allowing negative combining weightscan improve performance, in a fashion analogous to allowing short positions in a ﬁnancialasset portfolio. For density forecasts, in contrast, negative weights are unambiguously prob-lematic, producing pathologies even if sum-to-one holds, because negative mixture weightscan drive parts of the mixture density negative.Now consider sum-to-one. Immediately, sum-to-one is required for the mixture com-bination to be a valid probability density. Moreover, and separately, the solution to themixture weight estimation problem can be pathological without imposition of sum-to-one. See also Yao et al. (2018), who brieﬂy discuss issues related to the imposition of convex mixture weights.

6o see this, consider a simple example with two continuous density forecasts and a log scoreobjective. We have ˆ ω = arg min ω ,ω (cid:32) − T (cid:88) t =1 log( ω f ,t ( y t ) + ω f ,t ( y t )) (cid:33) , where f k,t ( y t ) is forecaster i ’s density forecast evaluated at the realization, y t . Without thesum-to-one constraint, the optimal solution is not well deﬁned: either ω →∞ or ω →∞ leads to the smallest possible objective function value, because f ,t and f ,t are non-negativefor any y t .For all of the above reasons, we henceforth impose both the non-negativity and sum-to-one parts of the simplex constraint. Interestingly, moreover, their imposition is not onlynecessary to eliminate pathologies, but also desirable to provide regularization. In particular,the simplex constraint clearly imposes a particular L “parameter budget”; it is eﬀectivelya special case of LASSO.Assembling everything, the basic regularized estimator with log score objective (Gewekeand Amisano, 2011; Amisano and Geweke, 2017) is arg min ω (cid:32) − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33) (cid:33) (4)s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 . The methodological question remains, however, of how to provide additional, and more ﬂex-ible, regularization, as does the substantive situation-speciﬁc empirical question of whetherand where additional regularization is helpful. In the remainder of this paper we work towardanswering both questions. L simplex regularization is a special case of L LASSO regularization, corresponding to aspeciﬁc choice of LASSO regularization parameter. Hence we cannot introduce additional L regularization. Other objectives may of course be used, as discussed earlier in section 2. Note that for a histogramforecast we have f k,t ( y t ) = (cid:80) Mm =1 p mkt y t ∈ b m ). K mixture weights awayfrom 0, thereby “undoing” the selection implicit in the LASSO-style L penalty, allowing fornon-zero mixture weights on all forecasts. We focus in particular on introducing shrinkagetoward an equally-weighted mixture (i.e., shrinkage of all K weights toward 1 /K ).Consider, for example, introducing L regularization. Immediately, incorporating an L penalty in addition to the simplex constraint, we have: ˆ ω = arg min ω  − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + λ (cid:32) K (cid:88) k =1 (cid:18) ω k − K (cid:19) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) L penalty  (5)s.t. ω k ∈ [0 , , K (cid:88) k =1 ω k = 1 . This parallels the egalitarian ridge estimator of Diebold and Shin (2019), with an additionalsimplex constraint imposed. Note that, due to the simplex constraint, the solution maydiscard some forecasters (setting some weights approximately if not exactly to zero), but thatsituation becomes progressively less likely as λ grows, pulling the weights toward equality.We can re-write (5) as ˆ ω = arg min ω  − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + λ (cid:32) K (cid:88) k =1 | ω k | − (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) L simplex/LASSO penalty + λ (cid:32) K (cid:88) k =1 (cid:18) ω k − K (cid:19) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) L ridge penalty  , (6) s.t. ω k ∈ [0 , , which emphasizes that simplex+ridge regularization involves a combination of L and L penalties. Note, however, that we are not free to choose λ , because the sum-to-one con-straint must bind; equations (5) and (6) instead coincide for “large enough” λ .Equation (6) in turn reveals that simplex+ridge regularization is closely related to the For transparency we make most of our arguments using a log score objective. Equation (6) also reveals that simplex+ridge is closely related to an additive-penalty version of partialegalitarian LASSO (Diebold and Shin, 2019), but with the egalitarian penalty done in L (ridge) form ratherthan L (LASSO) form. P enalty ( ω ) = α K (cid:88) k =1 | ω k | (cid:124) (cid:123)(cid:122) (cid:125) L LASSO penalty + (1 − α ) K (cid:88) k =1 ω k (cid:124) (cid:123)(cid:122) (cid:125) L ridge penalty , where α ∈ [0 ,

1] is a parameter, so that elastic net also involves combinations of L and L (that is, LASSO/simplex and ridge) penalties. Elastic net is well known to work well forregularization problems with many correlated predictors, exactly the situation of relevancefor the large sets of economic forecasts on which we focus. Here we move from simplex+ridge to simplex plus a general penalty based on the divergencebetween two discrete probability measures. As we will see, the divergence penalty includessimplex+ridge as a special case, but it also introduces a rich variety of new possibilities.Write the estimator asˆ ω = arg min ω  − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + λ D ( ω, ω ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) penalty  (7)s.t. ω k ∈ [0 , , K (cid:88) k =1 ω k = 1 , where D ( ω, ω ∗ ) is a measure of divergence between w and w ∗ . The key insight is that oncethe simplex restriction is imposed, ω can be interpreted as a discrete probability measure on { , , ..., K } . If we let ω ∗ be the uniform probability mass function with weight 1 /K on eachoutcome, then the penalized optimization (7) shrinks the solution toward equal weights.Maintaining uniform ω ∗ throughout, but using diﬀerent divergence measures D ( ω, ω ∗ ),we obtain new regularized estimators. For example:1. The L norm, D ( ω, ω ∗ ) = K (cid:88) k =1 (cid:18) ω k − K (cid:19) , produces the simplex plus egalitarian ridge penalty given in (5) and (6).9. The L norm (total variation), D ( ω, ω ∗ ) = K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) ω k − K (cid:12)(cid:12)(cid:12)(cid:12) , produces a simplex plus egalitarian LASSO penalty (Diebold and Shin, 2019).3. Kullback-Leibler divergence (entropy) from ω to ω ∗ , D ( ω, ω ∗ ) = − log K − K (cid:88) k =1 log ω k , produces a “simplex+entropy” penalty, − (cid:80) Kk =1 log ω k . In Appendix A we formallyshow that the simplex+entropy regularized estimator,ˆ ω = arg min ω  − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + λ (cid:32) − K (cid:88) k =1 log( ω k ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) entropy penalty  (8)s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 , arises as the posterior mode in a Bayesian analysis with a log score (pseudo-)likelihoodand a Dirichlet prior, which puts positive probability only on the unit simplex and alsoshrinks weights toward equality for a certain hyperparameter conﬁguration.4. R´enyi divergence of order α from ω to ω ∗ , D α ( ω ∗ || ω ) = 1 α − (cid:32) K (cid:88) k =1 /K α ω α − k (cid:33) , encompasses various statistical divergences including Kullback-Leibler divergence ( α =1) and Hellinger distance ( α = 2), and can be used to produce still more interestingregularized estimators. All of the above divergence functions shrink the density mixture weights toward equality, R´enyi divergence, moreover, is equivalent to Cressie-Read discrepancy up to an aﬃne transformation. D ( ω, ω ∗ ) isa convex function of ω , because the log score and simplex constraints are convex functionsof ω . This makes numerical computation of the estimator straightforward. One might want a density forecast version of partially egalitarian penalization, as developedfor the point forecast case by Diebold and Shin (2019). The additive version of partiallyegalitarian ridge or LASSO is possible, in the sense that the solution is computable inprinciple. To see this, consider the simplex-constrained partially egalitarian ridge problem:ˆ ω = min w (cid:32) − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 w k f k,t ( y t ) (cid:33) + λ K (cid:88) k =1 (cid:18) w k − δ ( w ) (cid:19) (cid:33) (9)s.t. w k ∈ [0 , , K (cid:88) k =1 w k = 1 , where δ ( ω ) is the number of non-zero elements in ω . Computation of the solution proceedsas follows:1. We deﬁne κ as the number of forecasters to be included.2. For a particular value of κ (among κ = 1 , , , ..., K ), there are C Kκ possible combina-tions of forecasters.3. For the j th such combination ( j = 1 , , ..., C Kκ ), we solve L ∗ ( κ, j ) = min w j (cid:32) − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 w jk f k,t ( y t ) (cid:33) + λ K (cid:88) k =1 (cid:18) w jk − δ ( w ) (cid:19) (cid:33) s.t. w jk ∈ [0 , , K (cid:88) k =1 w jk = 1 , where w jk is zero if the k th forecaster is not selected in j th combination. In this case,some of weights are forced to zero, so the penalty term is reduced to λ K (cid:88) k =1 (cid:18) w jk − δ ( w ) (cid:19) = λ (cid:88) k ∈N (cid:18) w jk − κ (cid:19) , N = { k : w jk (cid:54) = 0 } . This is just partial egalitarian ridge for a particular set offorecasters.4. The solution to the original partial egalitarian ridge problem is then arg min κ,j L ∗ ( κ, j ).Unfortunately, however, the computational cost is huge, because we need to solve the penal-ized optimization n K = (cid:80) Kκ =1 C Kκ times. For example, when K =20, n K =1 , , λ →∞ in equation (9), the partiallyegalitarian estimator converges to a direct subset averaging procedure in the spirit of Elliott(2011), which is simple to compute and automatically imposes the simplex constraint. Thesubset averaging idea is trivial: At each time, rolling forward, we simply ﬁnd the historicallybest-performing average, and use it. A ﬁrst variation is “best N -Average”. At each time wedetermine the historically best-performing N -forecast average and use it. A second varia-tion is “best ≤ N max -Average”. At each time we determine the historically best-performing ≤ N max -forecast average and use it.Subset averaging computation time can be substantial in principle, depending on K and N (or N max ). With K forecasters, ﬁnding the best N -average requires computing K C N simple averages and then sorting them to determine the minimum, each period. The per-period computational burden of best ≤ N max -forecast averaging is still larger, because wenow consider all subsets rather than only subsets of size N . Fortunately, the relevant K and N max are quite small in typical economic forecast combinations. In our subsequent empiricalwork, for example, N max ≤ K =19. Best ≤ max -Averagecombination therefore requires evaluating and sorting just C + C + C + C = 5035averages per period. It bears emphasizing that our regularized mixtures of density forecasts are not just straight-forward adaptations of existing methods of combining point forecasts. They diﬀer in impor-tant and interesting ways.1. The objective function changes. Things like “forecast errors” and the “sum of squarederrors” are ill-deﬁned in the density case. Appropriate density forecast scoring rulesmust be used. We have emphasized several, including the log score, the Brier score,and the ranked score. 12. The penalty function changes.(a) When forming mixtures of density forecasts, the unit simplex constraint must beimposed, and it has the side beneﬁt of proving some regularization.(b) Mixtures of density forecasts admit new regularization penalties that are inti-mately connected to the maintained simplex constraint, by viewing the mixtureweights as a discrete probability distribution. We introduced several such penal-ties, emphasizing Kullback-Leibler distance (entropy).3. Finally (and we have not yet noted this), it is generally unnecessary to center reg-ularization penalties around equal weights once the simplex constraint is imposed.Shrinkage toward equal weights will be induced either way.Consider, for example, the ridge+simplex penalty in equation (5), and consider cen-tering around equal weights, as written, vs centering around 0. There is no diﬀerence,because K (cid:88) k =1 (cid:18) ω k − K (cid:19) = K (cid:88) k =1 ω k − K K (cid:88) k =1 ω k + 1 K = K (cid:88) k =1 ω k + 1 − KK , (10)where the last equality is due to the sum-to-one restriction embedded in the simplexconstraint. The intuition is simply that shrinkage toward 0 is impossible when main-taining the sum-to-one restriction, and equal weights are as close to 0 as one can get.

We now explore the potential of our regularized mixture estimators via a small Monte Carloanalysis. The data-generating process (DGP), which we assume to be known by the fore-casters, is: y t = x t + σ y e t , e t ∼ iid N (0 , x t = φ x x t − + σ x v t , v t ∼ iid N (0 , , (11)where e and v are orthogonal at all leads and lags. y is the variable to be forecast, and x t canbe interpreted as the long-run component of y t . Individual forecasters receive heterogeneous In fact this equivalence holds as long as all weights are centered on the same value (it does not have tobe 1 /K ) and the weights are constrained to sum to to a bounded real value (it does not have to be 1). L λ ∗ Simplex -1.31 5.27 NASimplex + Ridge -1.15 20.00 2511.25Simplex + Entropy -1.15 20.00 5.22Subset Averages L λ ∗ Best N -Average: N =1 -2.64 1.00 NA N =2 -1.59 2.00 NA N =3 -1.37 3.00 NA N =4 -1.29 4.00 NA N =5 -1.23 5.00 NA N =6 -1.22 6.00 NA N =7 -1.21 7.00 NA N =8 -1.20 8.00 NA N =9 -1.18 9.00 NA N =10 -1.18 10.00 NA N =15 -1.16 15.00 NA N =20 -1.15 20.00 NABest ≤ ≤ ≤ ≤ ≤ ≤ L λ ∗ Best -0.24 1 NA95th Percentile -0.53 1 NAMedian -1.40 1 NA5th Percentile -4.16 1 NAWorst -12.19 1 NASimple K -Average -1.15 20 NA Notes: L is the average log score, λ ∗ is the ex post optimalpenalty parameter, and K is the total number of forecasters. We perform 10,000 Monte Carlo replications. able 2: Average Log Scores, DGP 2Regularization group L λ ∗ Simplex -1.29 4.74 NASimplex + Ridge -1.19 8.65 15.00Simplex + Entropy -1.27 20.00 0.05Subset Averages L λ ∗ Best N -Average: N =1 -2.65 1.00 NA N =2 -1.57 2.00 NA N =3 -1.34 3.00 NA N =4 -1.26 4.00 NA N =5 -1.21 5.00 NA N =6 -1.19 6.00 NA N =7 -1.19 7.00 NA N =8 -1.18 8.00 NA N =9 -1.18 9.00 NA N =10 -1.18 10.00 NA N =15 -1.46 15.00 NA N =20 -1.64 20.00 NABest ≤ ≤ ≤ ≤ ≤ ≤ L λ ∗ Best -0.28 1 NA95th Percentile -0.98 1 NAMedian -3.79 1 NA5th Percentile -32.69 1 NAWorst -182.42 1 NASimple K -Average -1.64 20 NA Notes: L is the average log score, λ ∗ is the ex post optimalpenalty parameter, and K is the total number of forecasters. We perform 10,000 Monte Carlo replications. igure 1: Monte Carlo Estimates of Expected Mixture Performance vs Penalty StrengthNotes: We perform 10,000 Monte Carlo replications.ndependent noisy signals about x t . For forecaster k we have z kt = x t + σ zk η kt , η kt ∼ iid N (0 , , (12)where η k and η k (cid:48) are orthogonal at all leads and lags for all forecasters k and k (cid:48) . Assumethat forecasters have a strong belief that the 1-step-ahead predictive density is Gaussianwith variance σ y , but that they don’t know its mean, and that forecaster k therefore uses z kt , resulting in the predictive density p kt ( y t +1 ) = N ( φz kt , σ y ) . (13)Note that in this environment, forecasters’ predictive densities diﬀer only by their locations(means).We consider two parameterizations:1. DGP 1: σ zk =1 for all k

2. DGP 2: σ zk =1 for k = 1 , , ..., K and σ zk =5 for k = K +1 , ..., K ,where each DGP has common parameters φ x =0 . σ x =1, σ y =0 .

5. The two DGPs diﬀeronly by the quality of the signals that forecasters receive. Under DGP 1 the simple averageshould be preferred, because all signals are of the same quality, while under DGP 2 the linearopinion rule should be preferred (at least asymptotically, so that estimation error vanishes),giving more weight to forecasters k = 1 , , ..., K , who receive better signals.To cohere with our subsequent empirical work, we explore K = T =20. We generate data,estimate mixture weights, generate 1-step-ahead mixture densities, and evaluate them usingthe log score objective. We repeat this 10,000 times and compute the average LPS for severalmethods:1. Simple Average2. Simplex (equation (4))3. Simplex+Ridge (equation (5))4. Simplex+Entropy (equation (8))5. Subset Averaging (equation (9) with λ →∞ ).17or each of simplex+ridge and simplex+entropy, we explore 20 penalization strengths. Forsimplex+ridge, we choose 10 equispaced points in [1e-15,10] and 10 equispaced points in[15,10000]. For simplex+entropy we choose 10 equispaced points in [1e-15,0.2] and 10 equi-spaced points in [0.3,20].Numerical results appear in Tables 1 and 2, in which we present the the optimized averagelog score for each method under DGPs 1 and 2, respectively. Graphical results appear in Fig-ure 1, in which we show how the optimized score varies with regularization penalty strengthunder DGPs 1 and 2, respectively. Under DGP 1, simple averaging performs well, and un-regularized simplex performs poorly, as expected. As the strength of shrinkage gets heavier,the performance of both simplex+entropy and simplex+ridge improves monotonically untilthey perform as well as the simple average (full shrinkage). In addition, the performance ofsimplex+entropy improves more quickly than that of simplex+ridge as shrinkage strengthincreases and dominates throughout. Finally, subset averaging performs admirably underDGP 1, and as expected the optimal “subset” includes all forecasters.Under DGP 2, simplex is expected to perform well, and simple averaging is expectedto perform poorly. Simplex does indeed outperform simple averaging. Moreover, both sim-plex+ridge and simplex+entropy behave as expected. For little shrinkage (toward the left),their performance is similar to that of simplex, and for heavy shrinkage (toward the right),their performance is similar to that of the simple average. In between, for moderate amountsof shrinkage, they outperform simplex. In that region, regularized simplex improves on un-regularized simplex, because the large unregularized simplex estimation error makes it likelythat some relevant forecasters are dropped from the pool, and regularization brings themback. Importantly, subset averaging continues to perform admirably under DGP 2, but nowthe optimal average involves only 10 or so forecasters, as expected.It is important to note that the performance documented in Tables 1 and 2, and in Figure1, is almost surely not achievable in practice, because it requires ex post omniscience (useof the ex post optimal penalty parameter for the regularized estimators, and use of the expost optimal N for the N -averages.) Nevertheless the results are informative, because theydocument what can be achieved in principle , even if not in practice. Practical performanceis an empirical matter, to which we now turn, in a detailed application to density forecastsof Eurozone inﬂation and real interest rates. 18igure 2: Individual and Average Density Forecasts, Eurozone Inﬂation, 2004Q4 (left) and2018Q4 (right) Notes: We show the individual survey forecasts in gray (as frequency polygons), and the average forecast inorange (as a histogram).

Here we use our methods to construct regularized mixtures of density forecasts for Eurozoneinﬂation and real interest rates. Expected inﬂation is a key driver of the bond marketvia its direct impact on nominal interest rates. Expected inﬂation may also negativelyimpact real growth, and hence the stock market, insofar as it “puts sand in the Walrasiangears”, as classically emphasized by Bresciani-Turroni (1937). High inﬂation, moreover, alsotends to be volatile inﬂation (Friedman, 1977), which adds additional sand. Expectedinﬂation is also a key part of the ex ante real interest rate, which in turn is a key guide tointertemporal allocation and a key link between macroeconomic fundamentals and ﬁnancialmarkets. From a variety of angles, then, inﬂation forecasts are central to ﬁnancial markets,the macroeconomy, and the interface.

Following the pathbreaking work of Conﬂitti et al. (2015), we study inﬂation density forecastsfrom the European Central Bank Survey of Professional Forecasters (ECB-SPF), which hasbeen undertaken since 1999. Participants are surveyed quarterly, in January, April, July, See also Chen et al. (1986). Our forecast sample contains 83 quarterly surveys, starting in 1999Q1 andending in 2019Q3.As an entr´ee into the data, in Figure 2 we show all forecasts expressed as frequencypolygons, and the simple average forecast expressed as a histogram, for two illustrativesurveys (2004Q4, 2018Q4). Substantial diﬀerences are apparent at the two survey dates.The simple average forecast in 2004Q4, for example, puts 2.3% probability on the event thatthe inﬂation rate is less than 1%, whereas in 2018Q4 it puts 10.5% probability on the sameevent. Continuing, in the top panel of Figure 3 we show the complete time series of simpleaverage forecasts. Again, large movements are evident over time, in both location and scale.The precise Euro-area inﬂation forecast target is the percentage change in the Harmonised[sic] Index of Consumer Prices (HICP), for the year following the forecast. For example,when the survey was conducted in October 2017 (2017Q4), HICP inﬂation data were availableup to September 2017, so the 2017Q4 survey asks for a forecast for the year from October2017 through September 2018. Our realization sample, matched to our forecast sample,contains 83 quarterly observations, starting in December 1999 and ending in June 2020.We will soon obtain mixture densities using the log score objective and several regular-izations, including simplex, simplex+ridge, simplex+entropy, and subset averaging. Beforeproceeding to empirical results, however, we address several issues.

First, forecasters can enter and exit the survey pool. There are 103 unique forecastersbetween 1999Q1 and 2019Q4, and no forecaster appears in the pool continuously. FollowingGenre et al. (2013), we proceed by ﬁrst excluding forecasters who miss more than fourconsecutive surveys, which leaves 18 forecasters. Then we interpolate the remaining gapsbased on historical performance. See . Eurostat, Harmonized Index of Consumer Prices: All Items for Euro area (19 countries)[CP0000EZ19M086NEST], Retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/CP0000EZ19M086NEST . More precisely, we ﬁll in the gaps in the ﬁrst survey (t=1, 1999Q1) with the average of non-missingforecasts from all other available forecasters. Then we calculate the ranked score for each forecaster anddivide them into ﬁve mutually exclusive groups based on the score, and move to the second survey. Ateach of the following rounds ( t = 2 , , ..., T ), we set the missing observations of a particular forecaster tothe average of non-missing forecasts from her group, and then using the full set of forecasts we re-calculateranked scores and update the group structure for use in the next round. .1.2 Time-Varying Bin Deﬁnitions Second, outcome bin deﬁnitions vary over time. Although bin deﬁnitions have been stablefor mid-range “standard” inﬂation values, extreme tail bins have become ﬁner over time,as realizations fell in the tails. For example, for high inﬂation, there was originally a > . > We proceed by merging extremetail bins suﬃciently to produce 11 bin deﬁnitions, ﬁxed for the entire sample: ( −∞ , − . − . , , . . , , ∞ ]. Finally, complications can arise with the log-score objective. Consider, for example, thesurvey forecast: y ∈  ( −∞ , . w.p. = 0(1 . , . w.p. = . . , . w.p. = . . , . w.p. = . . , ∞ ] w.p. = 0 . (14)The zero probabilities assigned to the leftmost and rightmost bins obviously create a problem(inﬁnite loss) for the log-score objective, due to its use of logs, if a realization occurs thatwas assigned zero probability.Zero-probability realizations rarely, but occasionally, appear in our data. Sometimes theyoccur in edge bins (e.g., (4 , ∞ ]), because forecasters sometimes fail to put positive probabilityon those bins. In addition to the edge-bin phenomenon, some forecasters’ histograms aresimply too sharp, and they sometimes put zero probability on an interior bin that eventuallycontains the realization.One can address the log score “zero problem” by requiring the survey bin into which therealization falls to have been assigned at least some small probability, say 1%. We achievethis by assigning 1% probability to the bin containing the realization if it had originally beenassigned 0, where the 1% is taken in equal shares from the bins originally assigned non-zeroprobability. During our sample period the number of bins started at 9, peaked at 14 during the Great recession, andeventually dropped to 12. One could of course switch to another objective, but the log score objective is simple and deservedlypopular, which is why we have used it throughout this paper as a leading case for both our theory and MonteCarlo. We will continue to use it for our empirical work, where it is also deservedly popular, despite the zero

Regularized Mixtures L ≤ L Notes: We show log scores for 1-year-ahead Eurozone inﬂation density forecasts, made quarterly, using a20-quarter rolling estimation window. The burn-in sample is 1999Q1-2000Q4, and the forecast evaluationsample is 2001Q1-2019Q3 (75 quarters). There are 18 ECB-SPF density forecasters in the pool, plus a 19thforecaster whose predictive density is constant and uniform, for a total of 19 forecasters. L is the log score,and There are 18 ECB-SPF density forecasters in the pool. We also include a ﬁctitious 19thforecaster whose predictive density is constant and uniform, in rough parallel to includinga constant in point forecast combining regressions, for a total of 19 forecasters. Doing soappears desirable a priori in the spirit of Granger and Ramanathan (1984). Results appear in Table 3. Strikingly, each regularized mixture outperforms each ECB/SPFindividual forecaster (even the ex post best forecaster). To get a feel for the size of the im-provement, note that the log score of the Best ≤ problem. Moreover, it constrains the mixture density to put positive probability on each histogram bin as longas the uniform forecaster gets a non-zero mixture weight, in which case the earlier-discussed log score “zeroproblem” vanishes.

Notes: We show density forecast mixtures expressed as frequency polygons. The forecasts are quarterly,from 1999Q1 to 2019Q3. always small, regardless of the regularization method. Simultaneously, both the log scoresin Table 3 and the graphs in the bottom two panels of Figure 3 reveal that the Simplex andBest Average regularized mixtures are almost identical, suggesting that the Simplex solutionis eﬀectively dropping all but a few forecasts and simply averaging the survivors, producingsomething very close to a Best 4-Average.The good performance of both Simplex and Best Average is particularly noteworthyinsofar as they do not require tuning. That is, quite remarkably, the Simplex and BestAverage regularizations perform as well as those requiring choice of tuning parameters (Sim-plex+Ridge and Simplex+Entropy), despite the fact that we evaluate the latter in Table 3 Simplex+Entropy selects all 19 forecasters, but Simplex+Entropy must select all 19 forecasters, becauselog( ω k ) →∞ as ω k →

0. All regularizations capable of selecting only a few forecasters do in fact select only afew. Strictly speaking, Best Average procedures require some slight tuning – a choice of N – although we arecomfortable with simply always adopting N = 4. Notes: We show heat maps of diﬀerences between a regularized mixture (Simplex or Best ≤ ≤ using ex post optimal tuning parameters, which is not feasible in real time.Figure 3 merits additional examination. If its middle and bottom panels reveal thatthe Simplex and Best Average regularized mixtures are nearly identical, a comparison ofthose panels with the top panel also reveals that (1) Simplex / Best Average regularizationis nevertheless very diﬀerent from a simple average, and (2) the eﬀects of Simplex / BestAverage regularization diﬀer strikingly before and after the onset of the Great Recession.Before the onset of the Great Recession, Simplex / Best Average regularization moves prob-ability mass upward toward higher inﬂation relative to simple averaging, particularly fromthe 1.0%-1.5% range to the 1.5%-2.5% range, mostly adjusting density forecast location andsymmetry. After that, however, Simplex / Best Average regularization spreads probabilitymass from the center into both tails of the distribution, from the 1.0%-2.5% range outwardto below 0.5% and above 3.0%, mostly adjusting density forecast dispersion and kurtosis.The regularization eﬀects, and their structural shift at the onset of the Great Recession, are24igure 5: PIT

Histograms, Eurozone Inﬂation

Notes: We show

P IT histograms for Simple Average and Best ≤ P IT ∼ iid U (0 , revealed even more clearly in the heatmaps shown in Figure 4.It is informative to examine and compare probability integral transforms ( P IT s) for vari-ous mixtures. Diebold et al. (1998) consider the continuous case, in which the

P IT is deﬁnedas

P IT t = (cid:82) y t −∞ p t ( u ) du , and show that correct conditional calibration of density forecastsimplies that P IT ∼ iid U (0 , P IT deﬁni-tion. To assess uniformity, and any patterns in deviations from uniformity, in Figure 5 weshow histograms of the Czado et al. (2009) discrete

P IT for the Simple Average and Best ≤ The

P IT histograms reveal problems with the Simple Average mixture, which matchour discussion of the two regimes in Figures 3 and 4, and which are ameliorated by theBest ≤ P IT histograms shownoticeable deviations from uniformity in both subsamples, and the shapes of the deviationsare very diﬀerent. There is no need to show the Simplex

P IT because the Simplex and Best ≤ Notes: We show density forecast mixtures expressed as frequency polygons. The forecasts are quarterly,from 1999Q1 to 2019Q3.

In the ﬁrst subsample, the Simple Average

P IT histogram is highly skewed as shown inthe upper-left panel of Figure 5, with far too little probability mass near 0 and far too muchnear 1, again indicating too many large inﬂation realizations relative to the Simple Averagedensity forecasts. Regularization, however, shifts the densities upward as discussed earlier,producing an improved (if still imperfect) ≤ P IT as seen in the bottom left panelof Figure 5.In the second subsample, the Simple Average

P IT histogram is more U-shaped as shownin the upper-right panel of Figure 5. In this regime the regularization spreads out thedensities as discussed earlier, better accommodating the tail realizations and producing animproved Best ≤ P IT as seen in the bottom right panel of Figure 5.Finally, in parallel to our earlier examination of ECB/SPF inﬂation forecasts, we examinereal interest rate density forecasts. The real interest rate density is a simple sign change andlocation shift of the inﬂation density: f ( r t,t +1 ) = i t,t +1 − f ( π t,t +1 ) , (15)where r denotes the real interest rate, i denotes the nominal interest rate, and π denotesinﬂation. Real interest rate densities are of course driven by the inﬂation densities viaequation (15), but it is nevertheless interesting to make the translation into the real cost ofborrowing.In Figure 6 we show the Simple Average and Best ≤ ≤ Notes: We show a heat map of the diﬀerence between the Best ≤ ≤ tions. One is immediately struck by the high probability assigned to negative real ratesthrough much of the sample. P ( r t,t +1 ) < are negative.Nevertheless our earlier inﬂation patterns and lessons remain ﬁrmly intact, because realinterest rate density forecasts are driven by inﬂation density forecasts. There are two clearreal interest rate “regularization regimes,” demarcated by the onset of the Great Recession.In the ﬁrst, real interest rate densities are pushed downward, because, as discussed earlier,regularization pushes inﬂation densities upward. In the second, real interest rate densitiesare made more dispersed, because regularization makes inﬂation densities more dispersed. We have proposed methods for constructing regularized mixtures of density forecasts, ex-ploring a variety of objectives and penalties, which we used in a substantive exploration There is no need to show regularized estimation results for real interest rates, because the log score isinvariant to the switch from inﬂation to real interest rate density forecasts deﬁned by equation (15). Thereis also no need to include Simplex panels in Figures 6 and 7, because the Simplex and Best ≤ P IT histograms, because they are exact mirrorimages of the inﬂation

P IT histograms in Figure 5, as revealed by equation (15).

27f Eurozone inﬂation and real interest rate survey density forecasts. All individual surveyforecasters (even the ex post best forecaster) are outperformed by our regularized mixtures.The log scores of the Simplex and Best-Average mixtures, for example, are approximately7% better than that of the ex post best individual forecaster, and 15% better than that of theex post median forecaster. Before the Great Recession, regularization shifts inﬂation den-sity locations upward toward higher inﬂation, and hence real interest rate density locationsdownward, correcting for bias. From the Great Recession onward, the regularization tendsto move probability mass from the centers to the tails of both inﬂation and real interest ratedensity forecasts, correcting for overconﬁdence.A variety of avenues for future research are possible. For example, one could use theprobability integral transform as a regularized mixture estimation objective, minimizing agoodness-of-ﬁt statistic (e.g., Kolmogorov-Smirnov) for testing the joint hypothesis of an iid U (0 ,

1) probability integral transform.Second, one could broaden our approach to allow for nonlinear mixtures as in recent workby Takanashi and McCalinn (2020), ﬂexibly time-varying mixture weights as in Jore et al.(2010), and mixture weights that vary over regions of density support, as in Kapetanios et al.(2015).Finally, although we did not emphasize regularization methods that require hyperparam-eter selection in our empirical work (Simplex+Ridge or Simplex+Entropy), they neverthelessrepresent interesting directions for future exploration. An obvious issue is feasible real-timehyperparameter selection. 28 ppendices

A Derivation of the Simplex+Entropy Regularized Es-timator

The Simplex+Entropy estimator solves the optimization problem:ˆ ω = arg min ω  − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) log score + ( α − (cid:32) − K (cid:88) k =1 log( ω k ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) entropy penalty  (A.1)s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 . As we will show, this it arises as the posterior mode in a Bayesian analysis with (1) log like-lihood given by the log score, and (2) Dirichlet prior, which puts positive probability only onthe unit simplex but also shrinks toward equal weights for a certain hyperparameter conﬁgu-ration. In particular, the K -dimensional Dirichlet prior is governed by K hyperparameters,and when they equal, the prior mean is 1 /K . Hence the simplex+entropy regularization (8)with equal prior hyperparameters does the same thing as simplex+ridge (5): Impose simplexand shrink toward equal weights. A.1 Prior

The Dirichlet prior on ω = ( ω , ω , ..., ω K ) with hyperparameter α = ( α , α , ..., α K ) is f D ( ω ; α ) = 1 B ( α ) K (cid:89) k =1 ω α k − k , where B ( · ) is the beta function, α k > ∀ k ∈ , ..., K , and the support of ω is ω k ∈ (0 ,

1) with (cid:80) Kk =1 ω k = 1.As is well known, the Dirichlet mean and variance are: E ( ω i ) = α i (cid:80) Kk =1 α k var ( ω i ) = α i (cid:80) Kk =1 α k (cid:16) − α i (cid:80) Kk =1 α k (cid:17) (cid:80) Kk =1 α k . Hence when α = α = ... = α K = α , we have E [ ω k ] = 1 /K and V ar ( ω k ) = K − αK + K , for all k = 1 , ..., K . That is, the prior is centered on equal weights 1 /K , and var ( ω k ) → α →∞ , so that α governs prior precision, with larger α producing heavier shrinkage toward1 /K . A.2 Posterior

The posterior distribution is f D ( ω | y ; α ) = T (cid:89) t =1 (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) pseudo-likelihood × B ( α ) K (cid:89) k =1 ω α − k (cid:124) (cid:123)(cid:122) (cid:125) prior , so the log posterior islog f D ( ω ; α ) = T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33) + ( α − K (cid:88) k =1 log( ω k ) − log B ( α ) . Because B ( α ) does not depend on ω , we can drop the last term, so the posterior mode isˆ ω = arg min ω  − T (cid:88) t =1 log (cid:32) K (cid:88) k =1 ω k f k,t ( y t ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) Log score + ( α − (cid:32) − K (cid:88) k =1 log( ω k ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) penalty  (A.2)s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 . .3 Understanding the Penalty Term One way to understand the penalty term is to recall the solution to the empirical likelihoodmaximization problem of Owen (2001),arg min ω (cid:32) − K (cid:88) k =1 log( ω k ) (cid:33) s.t. ω k ∈ (0 , , K (cid:88) k =1 ω k = 1 , which is equal weights, ω k =1 /K, ∀ k . Hence we see that the penalty part of (A.2) is minimizedat ω k =1 /K , which yields a clear interpretation of the penalty term. Larger α means a tighterprior on ω , with heavier shrinkage toward equal weights. Several interesting limiting casesemerge. First, for α →∞ , the penalty term dominates, and the optimal solution is equalweights. Second, for α →

1, the penalty term vanishes, and the optimal solution matchesthat of the optimal linear pool, with simplex constraint imposed. Third, there is a upperbound for var ( ω k ): as α → var ( ω k ) → ( K − /K . A.4 Remarks

1. The entropy regularization optimization problem is convex, because both the the log-score and the penalty are convex. A closed form may not exist for the regularized ω ,but convexity makes numerical computation straightforward.2. Entropy regularization has a clear parallel to ridge regularization. As is well known,ridge regularization emerges as the posterior mode in a Bayesian analysis with Gaussianprior, and as we have shown, entropy regularization emerges as posterior mode in aBayesian analysis with Dirichlet prior. Both regularizations, moreover, are governedby a single parameter linked to prior precision.3. If the eﬀects of the ridge and entropy penalties are very similar in certain respects(imposition of simplex and shrinkage toward 1 /K ), their full Bayesian interpretationsare nevertheless diﬀerent. In particular, the ridge (Gaussian) and entropy (Dirichlet)priors diﬀer, even if their means are the same (1 /K ), and so the posteriors diﬀer. For α < eferences Aastveit, K.A., J. Mitchell, F. Ravazzolo, and H.K. van Dijk (2020), “The Evolution of Fore-cast Density Combinations in Economics,”

Oxford Research Encyclopedia of Economicsand Finance , in press.Amisano, G. and J. Geweke (2017), “Prediction Using Several Macroeconomic Models,”

Review of Economics and Statistics , 99, 912–925.Askanazi, R., F.X. Diebold, F. Schorfheide, and M. Shin (2018), “On the Comparison ofInterval Forecasts,”

Journal of Time Series Analysis , 39, 953–965.Bates, J.M. and C.W.J Granger (1969), “The Combination of Forecasts,”

Operations Re-search Quarterly , 20, 451–468.Billio, M., R. Casarin, F. Ravazzolo, and H.K. Van Dijk (2013), “Time-Varying Combinationsof Predictive Densities Using Nonlinear Filtering,”

Journal of Econometrics , 177, 213–232.Brehmer, J. and T. Gneiting (2020), “Scoring Interval Forecasts: Equal-Tailed, Shortest, andModal Interval,” ArXiv:2007.05709 [math.ST], https://arxiv.org/abs/2007.05709 .Bresciani-Turroni, C. (1937),

The Economics of Inﬂation , Allen and Unwin.Brier, G.W. (1950), “Veriﬁcation of Forecasts Expressed in Terms of Probability,”

MonthlyWeather Review , 78, 1–3.Brodie, J., I. Daubechies, C. De Mol, D. Giannone, and I. Loris (2009), “Sparse and StableMarkowitz Portfolios,”

Proceedings of the National Academy of Sciences , 106, 12267–12272.Chen, N.-F., R. Roll, and S. Ross (1986), “Economic Forces and the Stock Market,”

Journalof Business , 383–403.Conﬂitti, C., C. De Mol, and D. Giannone (2015), “Optimal Combination of Survey Fore-casts,”

International Journal of Forecasting , 31, 1096–1103.Czado, C., T. Gneiting, and L. Held (2009), “Predictive Model Assessment for Count Data,”

Biometrics , 65, 1254–1261. 32iebold, F.X. (1991), “A Note on Bayesian Forecast Combination Procedures,” In P. Hackland A. Westlund (eds.),

Economic Structural Change:Analysis and Forecasting,

International Economic Review , 39, 863–883.Diebold, F.X. and M. Shin (2017), “Assessing Point Forecast Accuracy by Stochastic ErrorDistance,”

Econometric Reviews , 36, 588–598.Diebold, F.X. and M. Shin (2019), “Machine Learning for Regularized Survey Forecast Com-bination: Partially-Egalitarian Lasso and its Derivatives,”

International Journal of Fore-casting , 35, 1679–1691.Elliott, G. (2011), “Averaging and the Optimal Combination of Forecasts,” Manuscript,Department of Economics, UCSD.Elliott, G. and A. Timmermann (2016),

Economic Forecasting , Princeton University Press.Epstein, E.S. (1969), “A Scoring System for Probability Forecasts of Ranked Categories,”

Journal of Applied Meteorology , 8, 985–987.Friedman, M. (1977), “Nobel Lecture: Inﬂation and Unemployment,”

Journal of PoliticalEconomy , 85, 451–472.Genre, V., G. Kenny, A. Meyler, and A. Timmermann (2013), “Combining Expert Forecasts:Can Anything Beat the Simple Average?”

International Journal of Forecasting , 29, 108–121.Geweke, J. and G. Amisano (2011), “Optimal Prediction Pools,”

Journal of Econometrics ,164, 130–141.Giannone, D., M. Lenza, and G.E. Primiceri (2017), “Economic Predictions with Big Data:The Illusion Of Sparsity,” CEPR Discussion Paper 12256.Gneiting, T. and A.E. Raftery (2007), “Strictly Proper Scoring Rules, Prediction, and Esti-mation,”

Journal of the American Statistical Association , 102, 359–378.Good, I.J. (1952), “Rational Decisions,”

Journal of Royal Statistical Society: Series B , 14,107–114. 33ranger, C.W.J. and R. Ramanathan (1984), “Improved Methods of Combining Forecasts,”

Journal of Forecasting , 3, 197–204.Hall, S.G. and J. Mitchel (2007), “Combining Density Forecasts,”

International Journal ofForecasting , 23, 1–13.Jore, A.S., J. Mitchell, and S.P. Vahey (2010), “Combining Forecast Densities from VARswith Uncertain Instabilities,”

Journal of Applied Econometrics , 25, 621–634.Kapetanios, G., J. Mitchell, S. Price, and N. Fawcett (2015), “Generalised Density ForecastCombinations,”

Journal of Econometrics , 188, 150–165.McAlinn, K. and M. West (2019), “Dynamic Bayesian Predictive Synthesis in Time SeriesForecasting,”

Journal of Econometrics , 210, 155–169.Owen, A. (2001),

Empirical Likelihood , Chapman and Hall.Takanashi, K. and K. McCalinn (2020), “Predictive Properties and Minimaxity of BayesianPredictive Synthesis,” Preprint, RIKEN and Temple University.Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,”

Journal of theRoyal Statistical Society, Series B , 58, 267–288.Timmermann, A. (2006), “Forecast Combinations,” In G. Elliott, C.W.J. Granger and A.Timmermann (Eds.),

Handbook of Economic Forecasting , North Holland, 135-196.Winkler, R.L. and A.H. Murphy (1968), “‘Good’ Probability Assessors,”

Journal of AppliedMeteorology , 7, 751–758.Yao, Y., A. Vehtari, D. Simpson, and A. Gelman (2018), “Using Stacking to AverageBayesian Predictive Distributions,”

Bayesian Analysis , 13, 917–1003.Zou, H. and T. Hastie (2005), “Regularization and Variable Selection via the Elastic Net,”