Bayesian hierarchical stacking: Some models are (somewhere) useful
BBayesian Hierarchical Stacking
Yuling Yao ∗ Gregor Pirˇs † Aki Vehtari ‡ Andrew Gelman §
22 Jan 2021
Abstract
Stacking is a widely used model averaging technique that yields asymptotically optimalpredictions among linear averages. We show that stacking is most effective when the modelpredictive performance is heterogeneous in inputs, so that we can further improve the stackedmixture by a full-Bayesian hierarchical modeling. With the input-varying yet partially-pooledmodel weights, hierarchical stacking improves average and conditional predictions. Our Bayesianformulation includes constant-weight (complete-pooling) stacking as a special case. We gener-alize to incorporate discrete and continuous inputs, nested and grouped priors, and time-seriesand longitudinal data. We demonstrate on several applied problems.
Keywords : Bayesian hierarchical modeling, conditional prediction, covariate shift, modelaveraging, stacking, prior construction.
1. Introduction
Statistical inference is conditional on the model, and a general challenge is how to make betterprediction in the presence of multiple candidate models. Consider data D = ( y i ∈ Y , x i ∈ X ) ni =1 ,and K models M , . . . , M k , each having its own parameter vector θ k ∈ Θ k , likelihood, and prior.We fit each model and obtain posterior predictive distributions p (˜ y | ˜ x, M k ) = (cid:90) Θ k p (˜ y | ˜ x, θ k , M k ) p ( θ k | { y i , x i } ni =1 , M k ) dθ k . (1)The model fit is judged by its expected predictive utility of future (out-of-sample) data (˜ y, ˜ x ) ∈Y × X , which generally has an unknown “true” joint density p t (˜ y, ˜ x ). Model selection seeks the bestmodel with the highest utility when averaged over p t (˜ y, ˜ x ). Model averaging assigns models withweight w , . . . , w K subject to a simplex constraint w ∈ S K = { w : (cid:80) Kk =1 w k = 1; w k ∈ [0 , , ∀ k } ,and the future prediction is a linear mixture from individual models: p (˜ y | ˜ x, w, model averaging) = K (cid:88) k =1 w k p (˜ y | ˜ x, M k ) , w ∈ S K . (2)Stacking (Wolpert, 1992), among other ensemble-learners, has been successful for various pre-diction tasks. The first step is to fit each individual model and evaluate the pointwise leave-one-outpredictive density of each data point i under each model k : p k, − i = (cid:90) Θ k p ( y i | θ k , x i , M k ) p (cid:0) θ k | M k , { ( x i (cid:48) , y i (cid:48) ) : i (cid:48) (cid:54) = i } (cid:1) dθ k , which in a Bayesian context we can approximate using posterior simulations and Pareto-smoothedimportance sampling (Vehtari et al., 2017). Reusing data eliminates the need to model the unknown ∗ Department of Statistics, Columbia University, New York, USA. [email protected] . † Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia. ‡ Department of Computer Science, Aalto University, Espoo, Finland. § Department of Statistics and Political Science, Columbia University, New York, USA. a r X i v : . [ s t a t . M E ] J a n oint density p t (˜ y, ˜ x ). The next step is to determine the vector of weights w , . . . , w K that optimizethe average log score of the prediction of the stacked model,ˆ w stacking = arg max w n (cid:88) i =1 log (cid:32) K (cid:88) k =1 w k p k, − i (cid:33) , such that w ∈ S K . (3)Yao et al. (2018) applies the stacking idea to combine predictions from separate Bayesian infer-ences. However, the linear mixture (2) restricts an identical set of weight for all input x . We willlater label this solution (3) as complete-pooling stacking . The present paper proposes hierarchicalstacking , an approach that goes further in three ways:1. Framing the estimation of the stacking weights as a Bayesian inference problem rather thana pure optimization problem. This in itself does not make much difference in the complete-pooling estimate (3) but is helpful in the later development.2. Expanding to a hierarchical model in which the stacking weights can vary over the population.If the model predictors x take on J different values in the data, we can use Bayesian inferenceto estimate a J × K matrix of weights that partially pools the data both in row and column.3. Further expanding to allow weights to vary as a function of continuous predictors. This ideageneralizes the feature-weighted linear stacking (Sill et al., 2009) with more flexible form andBayesian hierarchical shrinkage.There are two reasons we would like to consider input-dependent model weights. First, the scor-ing rule measures the expected predictive performance averaged over ˜ x and ˜ y , as the objective func-tion in (3) divided by n is an asymptotically unbiased estimate of E ˜ x, ˜ y log (cid:16)(cid:80) Kk =1 w k p (˜ y | ˜ x, D , M k ) (cid:17) .But an overall good model fit does not ensure a good conditional prediction at a given location˜ x = ˜ x , or under covariate shift when the distribution of input x in the observations differs fromthe population of interest. More importantly, different models can be good at explaining differentparts of data, which is why model averaging can be a better solution to model selection. Evenif we are only interested in the average performance, we can further improve model averaging bylearning where a model is good so as to locally inflate its weight.In Section 2, we develop detailed implementation of hierarchical stacking. We explain why itis legitimate to convert an optimization problem into a formal Bayesian model. With hierarchicalshrinkage, we partially pool the stacking weights across data. By varying priors, hierarchicalstacking includes classic stacking and selection as special cases. We generalize this approach tocontinuous input variables, other structured priors, and time-series and longitudinal data. InSection 3, we turn heuristics from the previous paragraph into a rigorous learning bound, indicatingthe benefit from model selection to model averaging, and from complete-pooling model averagingto a local averaging that allows the model weights to vary in the population. We outline relatedwork in Section 4. In Section 5, we evaluate the proposed method in several simulated and real-data examples, including a U.S. presidential election forecast. We discuss more on the hierarchicalshrinkage, connection to full-Bayesian inference, and future directions in Section 6.This paper makes two main contributions: • The proposed hierarchical stacking provides a practical Bayesian recipe for model averag-ing with input-dependent-weights and hierarchical regularization. It is beneficial for bothimproving the overall model fit, and the conditional local fit in small areas. • Our theoretical results characterize how the model list should be locally separated to be usefulin model averaging and local model averaging.2 . Hierarchical stacking
The present paper generalizes the linear model averaging (2) to pointwise model averaging. The goalis to construct an input-dependent model weight function w ( x ) = ( w ( x ) , . . . , w K ( x )) : X → S K ,and combine the predictive densities pointwisely by p (˜ y | ˜ x, w ( · ) , pointwise averaging) = K (cid:88) k =1 w k (˜ x ) p (˜ y | ˜ x, M k ) , such that w ( · ) ∈ S X K . (4)If the input is discrete and has finite categories, one naive estimation of the pointwise optimal weightis to run complete-pooling stacking (3) separately on each category, which we will label no-poolingstacking . The no-pooling procedure generally has a larger variance and overfits the data.From a Bayesian perspective, it is natural to compromise between unpooled and completelypooled procedures by a hierarchical model. Given some hierarchical prior p prior ( · ), we define theposterior distribution of the stacking weights w ∈ S X K through the usual likelihood-prior protocol:log p ( w ( · ) | D ) = n (cid:88) i =1 log (cid:32) K (cid:88) k =1 w k ( x i ) p k, − i (cid:33) + p prior ( w ) + constant , w ( · ) ∈ S X K . (5)The final estimate of the pointwise stacking weight used in (4) is then the posterior mean from thisjoint density. We call this approach hierarchical stacking . For notational consistency, we rewrite the input variables into two groups ( x, z ), where x arevariables on which the model weight w ( x ) depends during model averaging (4), and z are allremaining input variables.To start, we consider x to be discrete and has J < ∞ categories, x = 1 , . . . J . We will extendto continuous and hybrid x later. The input varying stacking weight function is parameterized bya J × K matrix { w jk } ∈ ( S K ) J : Each row of the matrix is a length- K simplex. The k -th model incell j has the weight w k ( x i ) = w jk , ∀ x i = j . We fit each individual model M k to all observed data D = ( x i , z i , y i ) ni =1 and obtain pointwise leave-one-out cross-validated log predictive densities: p k, − i := (cid:90) Θ k p ( y i | θ k , x i , z i , M k ) p ( θ k | { ( x l , y l , z l ) : l (cid:54) = i } , M k ) dθ k . (6)Same as in complete-pooling stacking, here we avoid refitting each model n times, and insteaduse the Pareto smoothed importance sampling (PSIS, Vehtari et al., 2017, 2019) to approximate { p k, − i } ni =1 from one-time-fit posterior draws p ( θ k | M k , D ). The cost of such approximate leave-out-out cross validation is often negligible compared with individual model fitting.To optimize the expected predictive performance of the point-wisely combined model averaging,we can maximize the leave-one-out predictive densitymax w ( · ) n (cid:88) i =1 log (cid:32) K (cid:88) k =1 w k ( x i ) p k, − i (cid:33) . (7)On one extreme, the complete-pooling stacking solves (7) subject to a constant constraint w k ( x ) = w k ( x (cid:48) ) , ∀ k, x, x (cid:48) , which is equivalent to solve (3) with all data D . On the other extreme,no-pooling stacking maximizes this objective function (7) without extra constraint other than the3ow-simplex-condition, which equivalents to separately solve complete-pooling stacking (3) on eachinput cell D j = { ( x i , z i , y i ) : x i = j } .If there are a large of number of repeated measurements in each cell, n j := ||{ i : x i = j }|| → ∞ ,then 1 /n j (cid:80) i : x i = j log (cid:80) Kk =1 p k, − i becomes a reasonable estimate of the conditional log predictivedensity (cid:82) Y p t (˜ y | ˜ x = j ) log p (˜ y | ˜ x, M k ) d ˜ y , with convergence rate √ n j , and therefore, no-pooling stack-ing becomes asymptotically optimal among all cell-wise combination weights. For finite sample size,because the cell size is smaller than total sample size, we would expect a larger variance in no-pooling stacking than in complete-pooling stacking. Moreover, the cell sizes are often not balanced,which entails a large noise of no-pooling stacking weight in small cells. Vanilla (optimization-based) stacking (3) is justified by
Bayesian decision theory : the expectedlog predictive density of the combined model E ˜ y (cid:80) Kk =1 w k p (˜ y | M k ) is estimated by (cid:80) Kk =1 w k p k, − i .The point optimum ˆ w is asymptotically maximizes the expected utility, E ˜ y (cid:80) Kk =1 w k p (˜ y | M k ) ≈ /n (cid:80) ni =1 log (cid:16)(cid:80) Kk =1 w k p k, − i (cid:17) , hence is M ∗ -optimal decision in terms of Vehtari and Ojanen (2012).To fold stacking into a Bayesian inference problem , we treat the objective function in (7) asa log likelihood with parameter w . It is pro forma a multinomial likelihood, expect that p k,j are neither integers nor observations. Rigorously, this log likelihood is on outcomes y ,...,n . Afterintegrating out individual-model-specific parameters θ k such that p ( y | x, M k ) is given, the actualgenerative model for outcomes y i at input location x i in the combined model is p ( y i | x i , w k ( x i )) = (cid:80) Kk =1 w k ( x i ) p ( y i | x i , M k ), which implies a log joint likelihood: (cid:80) ni =1 log (cid:16)(cid:80) Kk =1 w k ( x i ) p ( y i | x i , M k ) (cid:17) .But this procedure has used data twice—In other practice, data are often used twice to pick prior,whereas here data are used twice to pick likelihood.We use a two-stage estimation procedure to avoid reusing data. Hypothetically provided a hold-out dataset D (cid:48) of the same size and identical distribution as observations D = ( y, x ), we can use D (cid:48) to fit the individual model first and compute ˜ p ( y i | x i , M k , D (cid:48) ) = (cid:82) p ( y i | x i , M k , θ k ) p ( θ k | M k , D (cid:48) ) dθ k .In the second stage we plug in the observed y i , x i , and obtain the pointwise full likelihood p ( y i | w, D (cid:48) , x i ) = (cid:80) Kk =1 w k ( x i )˜ p ( y i | x i , M k , D (cid:48) ).Now in lack of holdout data D (cid:48) , the leave- i -out predictive density p k, − i is a consistent estimateof the pointwise out-of-sample predictive density E D (cid:48) (˜ p ( y i | x i , M k , D (cid:48) )). Plug it into the two-stagelog likelihood and integrate out the unobserved holdout data D (cid:48) , we get a profile likelihood p ( y i | w, x i ) := E D (cid:48) (cid:0) p ( y i , | w, x i , D (cid:48) ) (cid:1) = K (cid:88) k =1 w k ( x i ) E D (cid:48) (cid:0) p ( y i | x i , M k , D (cid:48) ) (cid:1) ≈ K (cid:88) k =1 w k ( x i ) p k, − i . Summing over y i arrives at log ( p ( D| w )) ≈ (cid:80) ni =1 log (cid:16)(cid:80) Kk =1 w k ( x i ) p k, − i (cid:17) . This log likelihood coin-cides with the no-pooling optimization objective function (7).The hypothetical dataset D (cid:48) is related to the idea of marginal data augmentation (Meng andvan Dyk, 1999). Polson and Scott (2011) took a similar approach to convert the optimization-basedsupport vector machine into a Bayesian inference. The log posterior density of hierarchical stacking model (11) contains the log likelihood definedabove (cid:80) ni =1 log (cid:16)(cid:80) Kk =1 w k ( x i ) p k, − i (cid:17) , and a prior distribution on the weight matrix w = { w jk } ∈ S JK ,which we specify in the following. 4e first take a softmax transformation that bijectively converts the simplex matrix space S JK to unconstrained space R J ( K − :transformedparameters : w jk = exp( α jk ) (cid:80) Kk =1 exp( α jk ) , ≤ k ≤ K − , ≤ j ≤ J ; α jK = 0 , ≤ j ≤ J. (8) α jk ∈ R is interpreted as the the log odds ratio of model k with reference to M K in cell j .We propose a normal hierarchical prior on the unconstrained model weights ( α jk ) K − k =1 condi-tional on hyperparameters µ ∈ R K − and σ ∈ R K − ,prior : α jk | µ k , σ k ∼ normal( µ k , σ k ) , k = 1 , . . . , K − , j = 1 , . . . , J, (9)The prior partially pools unconstrained weights toward the shared overall mean ( µ , . . . , µ K − ).The shrinkage effect depends on both the cell sample size n j (how strong the likelihood is in cell j ), and the model-specific σ k (how much across-cell discrepancy is allowed in model k ). If µ and σ are given constants, and if the posterior distribution is summarized by its mode, then hierarchicalstacking contains two special cases: • no-pooling stacking by a flat prior σ k → ∞ , k = 1 , . . . , K − • complete-pooling stacking by a concentration prior σ k → , k = 1 , . . . , K − w j · will enforce a cell-wise selection.Instead of choosing fixed values, we view µ and σ as hyperparameters and aim for a fullBayesian solution: to describe the uncertainty of all parameters by their joint posterior distri-bution p ( α, µ, σ |D ), letting the data to tell how much regularization is desired.To accomplish this Bayesian inference, we assign a hyperprior to ( µ, σ ).hyperprior : µ k ∼ normal( µ , τ µ ) , σ k ∼ normal + (0 , τ σ ) , k = 1 , . . . , K − . (10)Putting the pieces (5), (9), (10) together, up to a normalization constant that has been omitted,we attain a joint posterior density of all free parameters α ∈ R J × K , µ ∈ R K − , σ ∈ R K − :log p ( α, µ, σ |D ) = n (cid:88) i =1 log (cid:32) K (cid:88) k =1 w k ( x i ) p k, − i (cid:33) + K − (cid:88) k =1 J (cid:88) j =1 p prior ( α jk | µ k , σ k ) + K − (cid:88) k =1 p hyperprior ( µ k , σ k ) . (11)Unlike complete and no-pooling stacking, which are typically solved by optimization, the maxi-mum a posteriori (MAP) estimate of (11) is not meaningful: the mode is attained at the complete-pooling subspace α jk = µ k , σ k = 0 , ∀ j, k, on which the joint density is infinity. Instead, we sample( α, µ, σ ) from this joint density (11) using Markov chain Monte Carlo (MCMC) methods and com-pute the Monte Carlo mean of posterior draws ˆ w jk , which we will call hierarchical stacking weights.The final prediction for y at any input location (˜ x, ˜ z ) isfinal predictions : p (˜ y | ˜ x, ˜ z ) = K (cid:88) k =1 ˆ w k (˜ x ) (cid:90) Θ k p (˜ y | ˜ x, ˜ z, θ k , M k ) p ( θ k | M k , D ) dθ k . (12)Using a point estimate ˆ w jk is not a waste of the joint simulation draws. Because equation (12)is a linear expression on w k , and because of the linearity of expectation, using ˆ w is as good asusing all simulation draws. Nonetheless, for the purpose of post-processing and approximate crossvalidation, we will use all posterior simulation draws; see discussion in Section 6.3.5 .4. Hierarchical stacking: continuous and hybrid inputs The next step is to include more structure in the weights, which could correspond to regressionfor continuous predictors, nonexchangeable models for nested or crossed grouping factors, nonpara-metric prior, or combinations of these.
Additive model
Hierarchical stacking is not limited to discrete cell-divider x . When the input x is continuous orhybrid, one extension is to model the constrained weights via an additive model: w K ( x ) = softmax( w ∗ K ( x )) , w ∗ k ( x ) = µ k + M (cid:88) m =1 α mk f m ( x ) , k ≤ K − , w ∗ K ( x ) = 0 , (13)where { f m : X → R } are M distinct features. Here we have already extracted the prior mean µ k ,representing the “average” weight of model k in the unconstrained space. The discrete model (9)is now equivalent to letting f m ( x ) = ( x = m ) for m = 1 , . . . , J . We may still use the basic prior(9) and hyperprior (10): α mk | σ k ∼ normal(0 , σ k ) , µ k ∼ normal( µ , τ µ ) , σ k ∼ normal + (0 , τ σ ) (14)We provide Stan (Stan Development Team, 2020) code for this additive model in Appendix C.Because the main motivation of our paper is to convert the one-fit-all model-averaging algorithminto an open-ended Bayesian modeling, the basic shrinkage prior above should be viewed a startingpoint for model building and improvement. Without trying to exhaust all possible variants, we lista few useful prior structures: • Grouped hierarchical prior . The basic model (14) is limited to have a same regularization σ k for all α mk . When the features f m ( x ) are grouped (e.g., f m are dummy variable fromtwo discrete inputs; states are grouped in regions), we achieve group specific shrinkage byreplacing (14) by α mk | µ k , σ gk ∼ normal(0 , σ g [ m ] k ) , σ gk ∼ normal + (0 , τ σ ) , µ k ∼ normal( µ , τ µ ) , where g [ m ] = 1 , . . . G is the group index of feature m . • Feature-model decomposition . Alternatively we can learn feature-dependent regularization by α mk | µ k , σ k , λ m ∼ normal(0 , σ k λ m ) , λ m ∼ InvGamma( a, b ) , σ k ∼ normal + (0 , τ σ ) . • Prior correlation . For discrete cells, we would like to incorporate prior knowledge of the group-correlation. For example in election forecast (Section 5.3), we have a rough sense of somestates being demographically close, and would expect a similar model weights therein. Wecalculate the the prior correlation matrix Ω J × J from various sources of state level historicaldata, and replace the independent prior (9) by a multivariate normal prior,( α k , . . . , α jk ) | σ, Ω , µ ∼ MVN (( µ k , . . . , µ k ) , diag( σ k ) × Ω ) . (15)The prior correlation is especially useful to stabilize stacking weights in small cells.6 Crude approximation of input density.
When applying the basic model (13) to continuousinputs x = ( x , . . . , x D ) ∈ R D , instead of a direct linear regression f d ( x ) = x d , we recommenda coordinate-wise ReLU-typed transformation: { f : f d − ( x ) = ( x d − med( x d )) + , f d ( x ) = (med( x d ) − x d ) − , d ≤ D } , (16)where med( x d ) is the sample median of x d . The pointwise model predictive performancetypically relies on the training density P train X (˜ x ): the more training data seen nearby, thebetter predictions. The feature (16) is designed to be a crude approximation of log marginalinput densities. Choice of features and exploratory data analysis
Choosing the division of ( x, z ) in discrete inputs is now a more general problem on how to constructfeatures f m ( x ), or a variable selection problem in a regression (13). In ordinary statistical modeling,we often start variable selection by exploratory data analysis. Here we cannot directly associatemodel weights w ki with observable quantities. Nevertheless, we can use the paired pointwise logpredictive density difference ∆ ki = (log p k, − i − log p K, − i ) as an exploratory approximation to thetrend of α k ( x i ). A scatter plot of ∆ ki against x may suggest which margin of x is likely important.For example, the dependence of ∆ ki on whether x i is the bulk or tail is an evidence for our previousrecommendation of the rectified-features.As more variables x are allowed to vary in the staking model, model averaging are more proneto over-fitting. Pointwise stacking typically has a large noise-to-signal ratio not only due to modelsimilarity, but also high variance of pointwise model evaluation: the approximate leave-one-out crossvalidation possesses Monte Carlo errors; even if we run exact leave-one-out, or use an independentvalidation set in lieu of leave-one-out, we only observe one y i for one x i (if x is continuous) such thatlog p k, − i is at best an one-sample-estimate of E ˜ y | x i (log p (˜ y | x i , M k )) with non-vanishing variance. If f m ( · ) is flexible enough, then the sample optimum of no-pooling stacking (7) always degeneratesto pointwise model selection that pointwisely picks the model that “best” fits current realizationof y i : w arg max k p k, − i ( x i ) = 1, which is purely over-fitting.Even in companion with hierarchical priors, we do not expect to include too many features onwhich stacking weights depend on. In our experiments, an additive model with discrete variablesand rectified continuous variables without interaction is often adequate. With a moderate or largenumber of features/cells, M , it is sensible to scale the default hyperprior τ σ = O ( (cid:112) /M ) , or adoptother feature-wise shrinkage priors such as horseshoe. Gaussian process prior
An alternative way to generalize both the discrete prior in Section 2.3 and the prior correlation(15) is Gaussian process priors. To this end we need K − K , . . . , K K − ,and place priors on the unconstrained weight α k ( x ), viewed as an X → R function: α k ( x ) ∼GP ( µ k , K k ( x )). The discrete model becomes a special case of Gaussian process via a zero-onekernel K k ( x i , x j ) = σ k ( x i = x j ) . Due to the previously discussed measurement error and thepreference on stronger regularization for continuous x , we recommend simple squared exponentialkernels K k ( x i , x j ) = σ k exp( − a k (( x i − x j ) /ρ k ) ) with an informative hyperprior that avoids smalllength-scale ρ k and big a k . We present an example in Section 5.2.7 .5. Time series and longitudinal data Hierarchical stacking can easily extend to time series and longitudinal data. Consider a time seriesdataset where outcomes y i comes sequentially in time 0 ≤ t i ≤ T . The joint likelihood is notexchangeable, but still factorizable via p ( y n | θ ) = (cid:81) ni =1 p ( y i | θ, y i − ) . Therefore, assuming somestationary condition, we can approximate the expected log predictive densities of the next-unitunseen outcome by historical average of one-unit-ahead log predictive densities, defined by p k, − i := (cid:90) Θ k p ( y i | x i , y i − , x i − , θ k , M k ) p ( θ k | y n − , x n − ) dθ k . In hierarchical stacking, we only need to replace the regular leave-one-out predictive density (6) bythis redefined p k, − i , and run hierarchical stacking (11) as usual. Using importance sampling basedapproximation (B¨urkner et al., 2019), we also make efficient computation without the need to fiteach model n times.If we worry about time series being non-stationary, we can reweigh the likelihood in (11) bya non-decreasing sequence π i : n (cid:80) ni =1 (cid:16) π i log (cid:16)(cid:80) Kk =1 w k ( x i ) p k, − i (cid:17)(cid:17) / (cid:80) ni =1 π i , so as to emphasizemore on recent date. For example, π i = 1 + δ − (1 − t i /T ) , where a fixed parameter δ > x := ( x, t ), the stacking weightcan depend on a continuous time variable, too.In Section 5.3, we present an election example with longitudinal polling data (40 weeks timeseries ×
50 states). For the i -th poll (already ordered by date), we encode state index into input x i = 1 , . . . ,
50, all other poll-specific variables z i , data t i , and poll outcome y i . We computethe one-week-ahead predictive density p k, − i := (cid:82) p ( y i | x i , z i , D − i , M k ) p ( θ k |D − i , M k ) dθ k where thedataset D − i = { ( y l , x l , z l ) : t l ≤ t i − } contains polls from all states up to one week before date t i .
3. Why model averaging works and why hierarchical stacking can work better
The consistency of leave-one-out cross validation ensures that complete-pooling stacking (3) isasymptotically no worse than model selection in predictions (Clarke, 2003; Le and Clarke, 2017),hence justified by Bayesian decision theory. The theorems we establish in Section 3.2 go a stepfurther, providing lower bounds on the utility gain of stacking and hierarchical stacking. In short,model averaging is more pronounced when the model predictive performances are locally separable,but in the same situation we can improve the a linear mixture model from learning locally whichmodel is better, so that the stacking is a step toward model improvement rather than an end toitself. We illustrate with a theoretical example in Appendix A and provide proofs in Appendix B. M -views: The truer, the better? With an M -closed view (Bernardo and Smith, 1994), one of the candidate models is the true datagenerating process, whereas in the more realistic M -open scenario, none of the candidate modelsis completely correct, hence all models are evaluated to the extend how they interpret the data.A strictly proper scoring rule, such as the expected log predictive density (elpd), is maximizedat the correct data generating process. However, the extent to which a model is “true” is contingenton the input information we have collected. Consider an input-outcome pair ( x, y ) generated by x ∈ [0 , , y = 0 or 1 , x ∼ uniform(0 , , Pr( y = 1 | x ) = x. If the input x is not observed or is omitted in the analysis, then M : y ∼ Bernoulli(0 .
5) isthe only correct model and is optimal among all probabilistic predictions of y unconditioning on8 . But this marginally true model is strictly worse than a misspecified conditional prediction, M : Pr( y = 1 | x ) = √ x, since the expected log predictive densities are log(0 .
5) = − .
69 and − = − .
58 respectively after averaged over x and y . The former model is true purely because itignores some predictors.This wronger-model-does-better example does not contradict the log score being strictly proper,as we are changing the decision space from measures on y to conditional measures on y | x . But thisexample does underlines two properties of model evaluation and averaging. First, we have littleinterest in a binary model check. The hypothesis testing based model-being-true-or-false depends onwhat variables to condition on, and is not necessarily related to model fit or prediction accuracy.In a non-quantum scheme, a really “everywhere true” model that has exhausted all potentiallyunobserved inputs contains no aleatory uncertainty. Second, the model fits typically vary acrossthe input space. In the Bernoulli example, despite its larger overall error, M is more desired near x ≈ .
5, and is optimal at x = . somewhere useful.For theoretical interest, we define the conditional (on ˜ x ) expected (on ˜ y | ˜ x ) log predictive densityin the k -th model, celpd k (˜ x ) := (cid:82) Y p t (˜ y | ˜ x ) log p (˜ y | ˜ x, M k ) d ˜ y. If { celpd k } Kk =1 are known, we can dividethe input space X into K disjoint sets based on which model has the locally best fit (When thereis a tie, the point is assigned the smallest index, and I stands for “input”): I k := { ˜ x ∈ X : celpd k (˜ x ) > celpd k (cid:48) (˜ x ) , ∀ k (cid:48) (cid:54) = k } , k = 1 , . . . , K. (17)In this Bernoulli example, I = [0 . , . In this subsection, we focus on the oracle expressiveness power of (local) model selection andaveraging, and w stacking refers to the complete-pooling stacking weight in the population: w stacking := arg max w ∈S K elpd( w ) , elpd( w ) = (cid:90) X×Y log (cid:32) K (cid:88) k =1 w k p (˜ y | M k , ˜ x ) (cid:33) p t (˜ y, ˜ x ) d ˜ yd ˜ x. (18)Apart from the heuristic that model averaging is likely to be more useful when candidatemodels are more “dissimilar” or “distinct” (Breiman, 1996; Clarke, 2003), we are not aware ofrigorous theories that characterize this “diversity” regarding the effectiveness of stacking. It seemstempting to use some divergence measure between posterior predictions from each models as ametric of how close these models are, but this is irrelevant to the true data generating process.We define a more relevant metric on how individual predictive distributions can be pointwisely separated. The description of a forecast being good is probabilistic on both ˜ x and ˜ y : an overall badforecast may be lucky at an one-time realization of outcome ˜ y and covariate ˜ x . We consider theinput-output product space X × Y and divide it into K disjoints subsets ( J stands for “joint”): J k := { (˜ x, ˜ y ) ∈ X × Y : p (˜ y | M k , ˜ x ) > p (˜ y | M k (cid:48) , ˜ x ) , ∀ k (cid:48) (cid:54) = k } . k = 1 , . . . , K. In this framework, we call a family of predictive densities { p (˜ y | M k , ˜ x ) } Kk =1 to be locally separablewith a constant pair L > ≤ (cid:15) <
1, if K (cid:88) k =1 (cid:90) (˜ x, ˜ y ) ∈J k (cid:16) log p (˜ y | M k , ˜ x ) < log p (˜ y | M k , ˜ x ) + L, ∀ k (cid:48) (cid:54) = k (cid:17) p t (˜ y, ˜ x ) d ˜ yd ˜ x ≤ (cid:15). (19)Stacking is sometimes criticized for being a black-box. The next two theorems links stackingweight to a probabilistic explanation, with respect to the true joint measure p t (˜ y, ˜ x ).9 heorem 1. When the separation condition (19) holds, the stacking weight is approximately theprobability of the model being the locally best fit: w stacking k ≈ w approx k := Pr( J k ) = (cid:90) J k p t (˜ y, ˜ x ) d ˜ yd ˜ x, (20) in the sense that the objective function is nearly optimal: | elpd( w approx ) − elpd( w stacking ) | ≤ O ( (cid:15) + exp( − L )) . (21)Conversely, a model will only be ignored by stacking if the winning probability is small. Theorem 2.
When the separation condition (19) holds, and if the k -th model has zero weight instacking, w stacking k = 0 , then the probability of its winning region is bounded by: Pr( J k ) ≤ (1 + (exp( L ) − − (cid:15) ) + (cid:15) ) − . (22) The right hand side can be further upper-bounded by exp( − L ) + (cid:15) . The separation condition (19) trivially holds for (cid:15) = 1 and an arbitrary L , or for L = 0 and anarbitrary (cid:15) , though in those cases the bounds (21) and (22) become too loose. To be clear, we onlyuse the closed form approximation (20) for theoretical assessment.The next theorem ensures a lower bound of utility gain from shifting model selection to stacking: Theorem 3.
Under the separation condition (19) , let ρ = sup k Pr( J k ) , and a deterministic func-tion g ( L, K, ρ, (cid:15) ) = L (1 − ρ )(1 − (cid:15) ) − log K , then the utility gain of stacking is lower-bounded by elpd stacking − sup k elpd k ≥ max ( g ( L, K, ρ ) + O (exp( − L ) + (cid:15) ) , . Evaluating J k requires access to ˜ y | ˜ x and ˜ x . Though both terms are unknown, the roles of ˜ x and˜ y are not symmetric: we could bespoke the model in preparation for a future prediction at a given˜ x , but cannot tailored for a pointwise ˜ y . To make theory discussion more tractable, we consider thecase when the variation on ˜ x predominates the uncertainty of model comparison. More precisely,we define a strong locally separation condition with a distance-probability pair ( L , (cid:15) ): K (cid:88) k =1 (cid:90) ˜ x ∈I k (cid:90) ˜ y ∈Y (cid:16) log p (˜ y | M k , ˜ x ) < log p (˜ y | M k (cid:48) , ˜ x ) + L, ∀ k (cid:48) (cid:54) = k (cid:17) p t (˜ y, ˜ x ) d ˜ yd ˜ x ≤ (cid:15). (23)We define ρ X = sup k Pr( I k ). When condition (23) holds, J k ≈ I k × Y such that ρ X ≈ ρ .As per Theorem 3, for a given L and (cid:15) , the smaller is ρ , the higher improvement ( K (1 − (cid:15) )(1 − ρ ))can stacking achieve against model selection: the situation in which no model always predominates.Put it in another way, the effectiveness of stacking is an indicator of heterogeneity of model fitting.Interestingly, it is the same direction that further gives rise to an additional utility gain if we shiftfrom stacking to pointwise selection. That is, if we know the input space division {I k } in (17) apriori, we can assign model M k only to I k , p (˜ y | ˜ x, I , pointwise selection) = K (cid:88) k =1 (˜ x ∈ I k ) p (˜ y | ˜ x, M k ) . (24)The following theorem lower-bounds the gain from stacking to pointwise selection.10 ingle modelselection (1) (complete-pooling)stacking (2) pointwiseselection (24) no-poolingstacking (7)hierarchicalstacking (11)quickerfinite-sampleconvergence higherasymptoticexpressiveness σ k → ∀ k σ k → ∞∀ k Thm.3 Thm.4 Thm.3Figure 1:
Evolution of methods. First row from left to right: the methods have a higher degree of freedom toensure a higher asymptotic predictive accuracy, the gain of which is bounded by the labeled theorems. Mean-while, complex methods come with a slower convergence rate. The hierarchical stacking is a generalizationof all remaining methods by assigning various structured priors, and adapts to the complexity-expressivenesstradeoff by hierarchical modeling.
Theorem 4.
Under the strong separation condition (23) , and if the divisions {I k } are knownexactly, then the extra utility gain of pointwise selection has a lower bound, elpd pointwise selection − elpd stacking ≥ − log ρ X + O (exp( − L ) + (cid:15) ) . For any input location x ∈ X , if we can approach the theoretically pointwise optimal ˆ w ( x ) ∈S K , then applying Theorem 3 to the space { x }×Y suggests the advantage of pointwise averaging(4) against pointwise selection (24).The potential utility gain from Theorems 3 and 4 is the motivation behind the input-varyingmodel averaging. Despite this asymptotic expressiveness, the finite sample estimate remains chal-lenging. We do not know I k or J k ; We may use leave-one-out cross validation to estimate the overallmodel fit elpd k , but in the pointwise version we want to assess conditional model performance—The more data coming in, the more input locations need to assess. The asymptotic expressivenesscomes with an increasing complexity, as the free parameters in single model selection, complete-pooling stacking, pointwise selection, and no-pooling stacking are a single model index, a length- K simplex, an vector of pointwise model selection index { , , . . . , K } X , and a matrix of pointwiseweight ( S K ) X . It is natural to apply the hierarchical shrinkage prior to handle this complexity-expressiveness tradeoff. So far we have adopted an IID view: the training and out-of-sample data are from the samedistribution. Yet another appealing property of hierarchical stacking is its immunity to covariateshift (Shimodaira, 2000), a ubiquitous problem in non-representative sample survey, data-dependentcollection, causal inference and many other areas.If the distribution of inputs x in the training sample, p train X ( · ), differs from these predictors’distribution in the population of interest, p pop X ( · ) ( p pop X is absolutely continuous with respect to p train X ), and if p ( z | x ) and p ( y | x, z ) remain invariant, then we do not need to adjust weight estimatefrom (11), because it has already been pointwisely optimized.By contrast, complete-pooling stacking is aimed at average risk. Under covariate shift, thesample mean of leave-one-out score in the k -th model, n (cid:80) ni =1 log p (˜ y | ˜ x, M k ), is no longer a con-sistent estimate of population elpd. To adjust, we can run importance sampling (Sugiyama andM¨uller, 2005; Sugiyama et al., 2007; Yao et al., 2018) and reweigh the i -th term in the objective (3)proportional to the inverse probability ratio p pop X ( x i ) /p train X ( x i ). Even in the ideal situation whenboth p pop X and p train X are known, the importance weighted sum has in general larger or even infinite11ariance (Vehtari et al., 2019), thereby turning down the effective sample size and convergence ratein complete-pooling stacking (toward its population optima (18)). When p train X is unknown, thecovariate reweighting is more complex while hierarchical stacking circumvents the need of explicitmodeling of p train X .When we are interested only at one fixed input location p pop X (˜ x ) = δ ( x ), hierarchical stackingis ready for conditional predictions, whereas no-pooling stacking and reweighed-complete-poolingstacking effectively discard all x i (cid:54) = x training data in their objectives, especially a drawback when x is rarely observed in the sample.
4. Related literature
Stacking (Wolpert, 1992; Breiman, 1996; LeBlanc and Tibshirani, 1996), or what we call complete-pooling stacking in this paper, has long been a popular method to combine learning algorithms,and has been advocated for averaging Bayesian models (Clarke, 2003; Clyde and Iversen, 2013; Leand Clarke, 2017; Yao et al., 2018). This simple approach has been applied in various areas such asrecommendation system, epidemiology (Bhatt et al., 2017), network modeling (Ghasemian et al.,2020), and post-processing in Monte Carlo computation (Tracey and Wolpert, 2016; Yao et al.,2020). Stacking can be equipped with any scoring rules, while the present paper focuses on thelogarithm score by default.Our theory investigation in Section 3.2 is inspired by the discussion of how to choose candi-date models by Clarke (2003) and Le and Clarke (2017). In L loss stacking, they recommended“independent” models in terms of posterior point predictions (E ˜ y p (˜ y | ˜ x, M ) , . . . , E ˜ y p (˜ y | ˜ x, M K ))being independent. When combining Bayesian predictive distributions, the correlations ofthe posterior predictive mean is not enough to summarize the relation between distributions( p (˜ y | ˜ x, M ) , . . . , p (˜ y | ˜ x, M K )), hence we consider the local separation condition instead.Allowing a heterogeneous stacking model weight that changes with input x is not a new idea.Feature-weighted linear stacking (Sill et al., 2009) constructs data-varying model weights of the k -the model by w k ( x ) = (cid:80) Mm =1 α km f m ( x ), and α km optimizes the L loss of the point predictionsof the weighted model. This is similar to the likelihood term of our additive model specificationin Section 2.4, except we model the unconstrained weights. The direct least-square optimizationsolution from feature-weighted linear stacking is what we label no-pooling stacking .It is also not a new idea to add regularization and optimize the penalized loss function. For L loss stacking, Breiman (1996) advocated non-negative constraints. In the context of combiningBayesian predictive densities, a simplex constraint is necessary. Reid and Grudic (2009) investigatedto add L or L penalty, − λ || w || or − λ || w || , into complete-pooling stacking objective (3). Yaoet al. (2020) assigned a Dirichlet( λ ) , λ > w to ensure strict concavity of the objective function. Sill et al. (2009) mentioned the use of L penalization in feature-weighted linear stacking, which is equivalent to setting a fixed prior for allfree parameters α km ∼ normal(0 , τ ) , ∀ k, m , whose solution path connects between uniform weighingand no-pooling stacking by tuning τ . All of these schemes are shown to reduce over-fitting withappropriate amount of regularization, while the tuning is computation intensive. In particular,each stacking run is built upon one layer of cross validation to compute the expected pointwisescore in each model p k, − i , and this extra tuning would require to fit each model n ( n −
1) times foreach tuning parameter value evaluation if both done in exact leave-one-out way. Fushiki (2020)approximated this double cross validation for L loss complete-pooling stacking with L penaltyon w , beyond which there was no general efficient approximation.Hierarchical stacking treats { µ k } and { σ k } as parameters and sample them from the jointdensity. Such hierarchy could be approximated by using L penalized point estimate with a dif-12erent tuning parameter in each model, and tune all parameters ( { σ k } K − k =1 for the basic model, or { σ mk } M, K − m =1 ,k =1 for the product model). But then this intensive tuning is equivalent to approximatethe Type-II MAP of hierarchical stacking in an inefficient grid-searching fashion (in contrast togradient based MCMC).Another popular family of regularization in stacking enforces sparse weights (e.g., Zhang andZhou, 2011; S¸en and Erdogan, 2013; Yang and Dunson, 2014), which include sparse and groupedsparse priors on the unconstrained weights, and sparse Dirichlet prior on simplex weights. Thegoal is that only a limited number of models are expressed. From our discussion in Section 3, allmodels are somewhere useful, hence we are not aimed for model sparsity—The concavity of logscoring rules implicitly resists sparsity; The posterior mean of hierarchical stacking weights w jk is in general is never sparse. Nevertheless, when sparsity is of a concern for memory saving orinterpretability, we can run hierarchical stacking first and then apply projection predictive variableselection (Piironen and Vehtari, 2017) afterwards to the posterior draws from the stacking model(11) and pick a sparse (or cell-wise sparse) solution.In contrast to fitting individual models in parallels before model averaging, an alter-native approach is to fit all models jointly in a bigger mixture model. Kamary et al.(2019) proposed a Bayesian hypothesis testing by fitting an encompassing model p ( y | w, θ ) = (cid:80) Kk =1 w k p ( y | θ k , M k ). The mixture model requires to simultaneously fit model parameters andmodel weights p ( w ,...,K , θ ,...,K | y ), of which the computation burden is a concern when K is big.Yao et al. (2018) illustrated that (complete-pooling) stacking is often more stable than full-mixture,especially when sample size is small and some models are similar. Nevertheless, our formulationof hierarchical stacking agrees with Kamary et al. (2019) in terms of sampling from the posteriormarginal distribution of p ( w | y ) in a full-Bayesian model. A no-pooling yet jointly-inferred model p ( y | x, w ( x ) , θ ) = (cid:80) Kk =1 w k ( x ) p ( y | x, θ k , M k ) is related to a “mixture of local experts” (Jacobs et al.,1991; Jordan and Jacobs, 1994; Svens¨en and Bishop, 2003) in neural network modeling, where w k ( · )and p ( ·| x, M k ) are both parameterized by neural networks and trained jointly in the bigger mixturemodel. Analogous to the distinction between a linear mixture and complete-pooling stacking, hier-archical stacking fits p ( θ | y, M ) , . . . , p ( θ K | y, M k ) separately and globally. Both the mixture modeland stacking have limitations. If the true data generating process is truly a mixture model, thenfitting individual component p ( θ k | y ) separately is wrong and stacking cannot remedy it. On theother hand, stacking and hierarchical stacking are more suitable when each models have alreadybeen developed to fit the data on their own. Put it in another way, rather than to compete witha-mixture-of-experts on combining weak learners, hierarchical stacking is more recommended tocombine a-mixture-of-experts with other sophisticated models.
5. Examples
We present three examples. The well-switching example demonstrates an automated hierarchi-cal stacking implementation with both continuous and categorical inputs. The Gaussian processexample highlights the benefit of hierarchical stacking when individual models are already highlyexpressive. The election forecast illustrates a real-data example with a complex data structure.We evaluate the proposed method on several metrics, including the mean log predictive density onholdout data, conditional log predictive densities, and the calibration error.
We work with a dataset used by Vehtari et al. (2017) who demonstrated cross validation. A surveywith a size of n = 3020 was conducted on residents from a small area in Bangladesh that was13 ointwise LOO lpdmodel 1 − model 2 log As not switchswitch hierarchical stackinglog scale weightmodel 1 − model 2 log As no−pooling stacking log scale weight model 1 − model 2 log As hierarchical stackingpointwise weightmodel 4 log As log As noneprimarysecondaryhigh school+ no−pooling stackinglog scale weightmodel 4 Figure 2: (1) Pointwise difference of leave-one-out log scores between models 1 and 2, plotted against logarsenic. Model 1 poorly fits points with high arsenic. (2) Posterior mean of pointwise unconstrained weightdifference between models 1 and 2, α ( x ) − α ( x ) in hierarchical stacking. (3) Pointwise log weight differencebetween models 1 and 2 in no-pooling stacking. (4) Posterior mean of w ( x ) , the weight assigned to model 4,in hierarchical stacking, displayed against log arsenic and education levels. There are few sample with highschool education and above, whose effect on model weights is pooled toward the shared mean. The blue line isthe complete-pooling stacking. (5) The unconstrained weight of model 4, α ( x ) , in no-pooling stacking. The“high school” effect stands out and the resulting model weights w are nearly all zeroes and ones. affected by arsenic in drinking water. Households with elevated arsenic levels in their wells wereasked whether or not they were interested in switching to a neighbor’s well, denoted by y . Topredict this well-switching behavior, we collect a series of household information x , including thedetected arsenic concentration value in the well, the distance to the closest known safe well, thehousehold’s education level, and whether any household members is in community organizations.The first two inputs are continuous and the remaining two are categorical variables.We fit a series of plausible logistic regressions, starting by an additive model including allcovariates x in model 1. In model 2, we replace one input well arsenic level, by its logarithmvalue. In model 3 and 4, we add cubic spine basis functions with ten knots of well arsenic level andwell-distance respectively in input variables. In model 5 we replace the categorical education stageby using years of schooling as a continuous variable.Using the additive model specification (13) and default prior (14), we model the unconstrainedweight α k ( x ) by a linear regression of all categorical inputs and all rectified continuous inputs (16).In this example the categorical input has eight distinct levels based on the product of education(four levels) and community participation (binary).For comparison, we consider three alternative approaches: (a) complete-pooling stacking (b)no-pooling stacking, or the maximum likelihood estimate of (13), and (c) model selection that picksmodel with the highest leave-one-out log predictive densities. We split the data into training set( n train = 2000) and an independent holdout test set.The leftmost panel in Figure 2 displays the pointwise difference of leave-one-out log scores formodel 1 and 2 against log arsenic values in training data. Intuitively, model 1 fits poorly for datawith high arsenic. In line with this evidence, hierarchical stacking assigns model 1 an overall lowweight, and especially low for the right end of the arsenic levels. The second panel shows thepointwise posterior mean of unconstrained weight difference between model 1 and 2, α ( x ) − α ( x ),against the arsenic values in training data. The no-pooling stacking reveals a similar direction thatmodel 1’s weight should be lower with a higher arsenic value, but for lack of hierarchical priorregularization, the fitted α ( x ) − α ( x ) is orders of magnitude larger (the third panel). As a result,the realized pointwise weights w are nearly either zero or one.14 l l mean test log predictive density higher is better hierarchical stacking = 0stacking no−pooling stacking modelselection−0.08−0.040.00 mean absolute calibration error lower is better hierarchical stacking stacking no−pooling stacking modelselection0.150.200.25 l l l l mean test log predictive density amomg the worst data points
10 50 100 150 200−2.0−1.6−1.2 number of worst test points hierarchical stackingstackingno−pooling stackingmodelselection l l l l l l l l ll l l l l l l l ll l l l l l l l ll l l l l l l l l
Figure 3:
We evaluate hierarchical, complete-pooling and no-pooling stacking, and model selection on threemetrics: (a) average log predictive densities on test data, where we set the hierarchical stacking as benchmark0, (b) calibration error: discrepancy between the predicted positive probability and realized proportion ofpositives in test data, averaged over 20 equally spaced bins, and (c) average log predictive destines among the ≤ n ≤ worst test data points. We repeat 50 random training-test splits with training size 2000 andtest size 1020. l l l l l l ll l l l l l ll l l l l l ll l l l l l l mean test log predictive density higher is better training sample size l l l l l l ll l l l l l ll l l l l l ll l l l l l l mean absolute calibration error lower is better training sample size l l l l l l ll l l l l l ll l l l l l ll l l l l l l −3−2−10 400 800 1200 mean test log predictive density amomg the worst 10% data pointstraining sample size hierarchical stackingstackingno−pooling stacking modelselection Figure 4:
Same comparisons as Figure 3, with training sample size varying from 100 to 1200.
The rightmost two columns in Figure 2 display the fitted pointwise weights of model 4 against logarsenic values and education level in test data. Because only a small proportion (7%) of respondentshad high school education and above, the no-pooling stacking weight for this category is largelydetermined by small sample variation. Hierarchical stacking partially pools this “high school” effecttoward the shared posterior mean of all educational levels, and the realized hierarchical stackingmodel weights do not clearly depend on education levels.We evaluate model fit on three metrics. (a) The log predictive densities averaged over test data.In the first panel of Figure 3, we set hierarchical stacking as baseline and all other methods attaina lower predictive densities. (b) The L calibration error. We set 20 equally spaced bin between 0and 1. For each bin and each learning algorithm, we collect test data points whose model-predictedpositive probability falling in that bin, and compute the absolute discrepancy between realizedproportion of positives in test data and the model-predicted probabilities. The middle panel inFigure 3 displays the resulted calibration error averaged over 20 bins. The proposed hierarchicalstacking has the lowest error. No-pooling stacking has the highest calibration error despite of itshigher overall log predictive densities than model selection, suggesting a prediction overconfidence.(c) We compute the average log predictive densities of four methods among the n most shocking15 posterior density at sigma =0.25 − − − log α log ρ mode 1mode 2 prediction distribution of f at model 1 − − y observed datatrue f f at model 2 − − xy
50% CI95% CI − − elpd 1 − elpd 2, pointwise − xw
50% CI90% CI weight of model 1 exact Bayes ● ● ● type − II MAP Laplace approx. importance resample − − − predictive performace of ensemble mean test log predictive densityhierarchical stackingstackingmode heightimportance weighting a Figure 5:
From left to right, Column 1: posterior density at σ = 0 . . At least two modes exist. Column 2:predictive distribution of y from two modes. Column 3: the pointwise companion of log predictive density ofthe Laplace approximations at two modes, and the hierarchical stacking weight of mode 1. Column 4: the testdata mean predictive densities of the weighted model, where individual components in the final model consistsof either the MAP, Laplace approximation, or importance sampling around the two modes, and the weightingmethods include hierarchical stacking, complete-pooling stacking, mode heights and importance weighing. test data points (the ones with lowest predictive densities conditioning on a given method) for n varying from 10 to 200 and the total test data has size 1020. As exhibited in the last panel inFigure 3, the proposed hierarchical stacking consistently outperforms all other approaches for all n : a robust performance for the worst case scenario. To reduce randomness, all evaluation metricsabove are averaged over 50 random training-test splits.Figure 4 presents the same comparisons of four methods while the training sample size n train varies from 100 to 1200 (averaged over 50 random training-test splits). In agreement with theheuristic in Figure 1, the most complex method, no-pooling stacking, performs especially poorlywith small sample size. By contrast the easiest method, model selection, reaches its peak elpdquickly with moderate sample size but cannot keep improving as training data size grows. Theproposed hierarchical stacking performs the best in this setting under all metrics. The local model averaging (12) tangles a x -dependent weight w ( x ) and x -dependent individualprediction p ( y | x, M k ). If the individual model y | x, M k is big enough to have already exhausted“all” variability in input x , is there still a room for improvement by modeling local model weights w ( x )? The next example suggests a positive answer.Consider a regression problem with scalar observations y i = f ( x i ) + (cid:15) i , i = 1 , ..., n, at inputlocations { x i } ni =1 , and (cid:15) i are independent noises. The training model is a Gaussian process regressionon the latent functions f with zero mean and squared exponential covariance: y i = f ( x i ) + (cid:15) i , (cid:15) i ∼ normal(0 , σ ) , f ( x ) ∼ GP (cid:18) , a exp (cid:18) − ( x − x (cid:48) ) ρ (cid:19)(cid:19) . (25)We adopt training data from Neal (1998). They were generated such that the posterior distributionof hyperparameters θ = ( a, ρ, σ ) contains at least two isolated modes (see the first panel in Figure16). We consider three mode-based approximate inference of θ | y : (a) Type-II MAP, where wepick the local modes of hyperparameters that maximizes the marginal density ˆ θ = arg max p ( θ | y ),and further draw local variables f | ˆ θ, y , (b) Laplace approximation of θ | y around the mode, and(c) importance resampling where we draw uniform samples near the mode and keep sample withprobability proportional to p ( θ | y ). In the existence of two local modes ˆ θ , ˆ θ , we either obtain twoMAPs, or two nearly-nonoverlapped draws, further leading to two predictive distributions. Yaoet al. (2020) suggests to use complete-pooling stacking to combine two predictions, which showsadvantages over other ad-hoc weighting strategies such as mode heights or importance weighting.Visually, mode 1 has smaller length scale, more wiggling and attracted by training data. Becauseof the a better overall fit, it receives higher complete-stacking weights. However, the wiggling tailmakes its extrapolation less robust. We now run hierarchical stacking with x -dependent weight w k ( x ) for mode k = 1 , w ( x )), w ( x ) = invlogit ( α ( x )) , α ( x ) ∼ GP (0 , K ( x )) . Despite the same GP prior, it is not related to the training regression model (25). To evaluate howgood the weighted ensemble is, we generate independent holdout test data (˜ x i , ˜ y i ). Both trainingand test inputs, x and ˜ x , are distributed from normal(0 , l l l l l l l l l l −2.7 −0.9 0.9 2.7−0.020.000.020.04 bin center of xtest lpd diff. Binned test−lpd difference hierarchical stacking minus exact Bayes l l l l l l l l l l −2.7 −0.9 0.9 2.70.00.10.2 bin center of xBinned test−lpd difference hierarchical stacking minus complete−pooling stacking Figure 6:
Comparison between hierarchical stackingand stacking of two locally Laplace approximation(left) or a long-chain exact Bayes from the true model(right). We compare the binned test log predictive den-sities over 10 equally spaced bins on ( − , . For eachbin hierarchical stacking is benchmark 0 and a positivevalue means a worse fit than hierarchical stacking. In this dataset, exact MCMC is able to ex-plore both posterior modes in model (25) af-ter a long enough sampling. Gaussian pro-cess regression equipped with exact Bayesianinference can be regarded as the “always true”model here. Hierarchical stacking achieves asimilar average test data fit by combining twoLaplace approximations. Furthermore, hier-archical stacking has better predictive perfor-mance under covariate shift. To examine lo-cal model fit, we generate another independentholdout test data. This time the test inputs˜ x are from uniform( − , x , while it yieldshigher predictive densities in the tail, suggest-ing a more reliable extrapolation. We explore the use of hierarchical stacking on a practical example of forecasting polls for the 2016United States presidential election. Since the polling data are naturally divided into states, itprovides a suitable platform for hierarchical stacking in which model weights vary on states.To create a pool of candidate models, we first concisely describe the model of Heidemanns et al.(2020), an updated dynamic Bayesian forecasting model (Linzer, 2013) for the presidential election,and then follow up with different variations of it. Let i be the index of an individual poll, y i the17umber of respondents that support the Democratic candidate and n i the number of respondentswho support either the Democratic or the Republican candidate in the poll. Let s [ i ] and t [ i ] denotethe state and time of poll i respectively. The model is expressed by y i ∼ Binomial( θ i , n i ) ,θ i = (cid:40) logit − ( µ bs [ i ] ,t [ i ] + α i + ζ state i + ξ s [ i ] ) , i is a state poll,logit − ( (cid:80) Ss =1 u s µ bs,t [ i ] + α i + ζ national i + (cid:80) Ss =1 u s ξ s ) , i is a national poll, (26)where superscripts denote parameter names, and subscripts their indexes. The term µ b is theunderlying support for the Democratic candidate and α i , ζ , and ξ represent different bias terms. α i is further decomposed into α i = µ cp [ i ] + µ rr [ i ] + µ mm [ i ] + z(cid:15) t [ i ] , (27)where µ c is the house effect, µ r polling population effect, µ m polling mode effect, and (cid:15) an ad-justment term for non-response bias. Furthermore, an AR(1) process prior is given to the µ b : µ bt | µ bt − ∼ MVN( µ bt − , Σ b ) , where Σ b is the estimated state-covariance matrix and µ bT is the esti-mate from the fundamentals.Although we believe this model reasonably fits data, there is always some researcher degreesof freedom in model building and a room for improvement. Our pool of candidates consists ofeight models. M : The fundamentals-based model by Abramowitz (2008). M : The model byHeidemanns et al. (2020). M : M without the fundamentals prior, µ bT = 0. M : M withan AR(2) structure, µ bt | µ bt − , µ bt − ∼ MVN(0 . µ bt − µ bt − , Σ b ). M : simplify M without pollingpopulation effect, polling mode effect, and the adjustment trend for non-response bias, α i = µ cp [ i ] . M : M where we added an extra regression term β stock stock t [ i ] into model (26) using the S&P 500index at the time of poll i . M : M without the entire shared bias term, α i = 0. M : M withouthierarchical structure on states.We equip hierarchical stacking with either the basic independent prior (9) or the state-correlatedprior (15). The state correlation Ω is estimated using a pool of state-level macro variables, and hasalready been used in some of the individual models to partially pool state level polling. We plugthis pre-estimated prior correlation in the correlated stacking prior (15), and refer to it “hierarchicalstacking with correlation” in later comparisons.Since the data are longitudinal, we evaluate different pooling approaches using a one-week-ahead forecast with an expanding window for each conducted poll. We extract the fitted one-weekahead predictions from each individual model, and train hierarchical stacking, complete-poolingand no-pooling stacking to find the optimal model weights, and evaluate the combined models bycomputing their mean log predictive densities on the unseen test data next week. To account fornon-stationarity discussed in Section 2.5, we only use the last four week prior to prediction day fortraining model averaging. In the end we obtain a trajectory of this back-testing performance ofhierarchical stacking, complete-pooling and no-pooling stacking and single model selection.The left-hand side of Figure 7 shows the seven-day running-average of the one-week-aheadback-test log predictive density from models combined with various approaches. The right-handside of Figure 7 shows the overall cumulative one-week-ahead back-test log predictive density. Weset the uncorrelated hierarchical stacking to be a constant zero for reference. Hierarchical stackingperforms the best, followed by stacking, no-pooling stacking, and model selection respectively. Theadvantage of hierarchical stacking is highest at the beginning and slowly decreases the closer we getto election day. As we move closer to the election, more polls become available, so the candidatemodels become better and also more similar since some models only differ in priors. As a result, all18 pointwise differences in 7−days running mean test log predictive density pointwise differences in mean cumulative test log predictive density stackinghierarchical stacking corrno−pooling stackingmodel selection hierarchical stacking = 0 Apr Jul Oct election date
Apr Jul Oct election date
Figure 7:
Left: pointwise differences in 7-day running mean log predictive densities on one-week-ahead testdata, where we set the hierarchical stacking as benchmark 0. Right: pointwise differences in cumulativeaverage predictive log density by date. The advantage of hierarchical stacking is most noticeable toward thebeginning, where there are fewer polls available.
RI (n=7) SD (n=9) WV (n=14) GA (n=41) NC (n=66) FL (n=78) A p r J u l O c t A p r J u l O c t A p r J u l O c t A p r J u l O c t A p r J u l O c t A p r J u l O c t −0.50−0.250.000.25−0.5−0.4−0.3−0.2−0.10.0−0.10.00.10.2−0.6−0.4−0.20.0−0.3−0.2−0.10.00.10.2−0.75−0.50−0.250.000.250.50 datemethod hierarchical stacking corr hierarchical stacking stacking no−pooling stacking model selection pointwise differences in mean cumulative test log predictive density by state Figure 8:
Log predictive density of the combined model in small states (in number of state polls, RI, SD,WV) and three swing states (GA, NC, FL). We fix the uncorrelated hierarchical stacking to be constant zeroas a reference. The number in the bracket is the total number of polls in that state. combination methods eventually become more similar. No-pooling stacking has high variance andhence performs the worst out of all combination methods. Hierarchical stacking with correlatedprior performs similarly to the independent approach, with a minor advantage at the beginning ofthe year, where the prior correlation stabilizes the state weights where the data are scarce, andlater we see this advantage more discernible in individual states.Figure 8 shows the cumulative log predictive density of the combined-model by time in threesmall (in number of state polls) states, and three swing states, which have large poll sizes. Witha large number of state polls available, for example close to election day in Florida and NorthCarolina, no-pooling stacking performs well. With fewer polls, no-pooling stacking is unstable, ascan be seen in Rhode Island, South Dakota, West Virginia, and the early part of Georgia plots.The test data performance is random variable and a method can be lucky for one realization oftest data. Here no-pooling stacking and model selection are either too lucky or too unlucky. FromFigure 7, this instability induces an overall low predictive density. Hierarchical stacking alleviatesthis instability, while retaining enough flexibility for a good performance with large data come in.19 ierarchical stacking corr = 0few polls ( n < ) moderate polls ( ≤ n < ) many polls ( ≤ n ) all polls − − − − method hierarchical stacking stacking no − pooling stacking model selection mean pointwise difference in test log predictive densityby number of polls in state mean test log predictive densities, by number of state polls Figure 9:
Mean test log predictive densities with 50% and 95% confidence intervals, among subsets of stateswith few, moderate and many number of state polls, and among all states. Correlated hierarchical stacking isset as reference 0. It is better than independent hierarchical stacking when data are scarce. Complete-poolingstacking is close to hierarchical stacking in small states, but worse in bigger states. hierarchical weights by forecasting week differences between state weights and mean weight differences between hierarchical and no−pooling weights states with fewest pollsstates with most pollscomplete pooling weightmean weight20 25 30 35 week respondents in train set respondents in train set
Figure 10:
Hierarchical stacking weights for M in the polling example. Left: weights for M of the 10 stateswith fewest polls and with most polls over time. Dotted line shows the complete-pooling stacking weightand the solid black line is the nationwide mean weight. States with fewer polls are shrunken more towardthe mean. Middle: absolute differences between state-wise hierarchical stacking weights and the nationwidemean, against number of respondents. The blue line is the linear trend reference. States with smaller samplesizes are more pooled to the mean. Right: absolute differences between hierarchical stacking and no-poolingstacking weights, generally decreasing with bigger sample sizes. Results for all other states can be found in Appendix D.Furthermore, we divided states into three categories based on how many state polls were con-ducted. Figure 9 shows the overall mean pointwise differences in test log predictive densities dividedby these categories, along with a fourth panel over all states. No-pooling stacking performs theworst in all panels. An explanation for that could be that we are using a four-week moving windowto tackle non-stationarity, which might not contain enough data for the no-pooling method. Thevariance of the no-pooling is amended by the hierarchical approach, which performs on par withstacking with scarcer data and outperforms it otherwise. We can also observe the advantage ofusing prior correlations in hierarchical stacking in the first panel.Figure 10 illustrates how cell size affects the pooling effect. The first panel show the hierarchicalstacking state-wise weights for the first candidate model w j as a function of date. For either early-date forecasts or states with few polls, hierarchical stacking weights are more pooled toward theshared nationwide mean. The middle and right panels compare the difference between state-wisehierarchical stacking weights and the nationwide mean, or with no-pooling weights, against the20otal number of respondents for each state and prediction date. The cells with more observed dataare less pooled and closer to their no-pooling optimums, and vice versa.
6. Discussion
The input-varying model averaging is designed to improve both the overall averaged predictionE ˜ y, ˜ x (log p (˜ y | ˜ x )) and conditional prediction E ˜ y | ˜ x = x (log p (˜ y | ˜ x )), whereas these two tasks are subjectto a trade-off in complete-pooling stacking.In addition, the partial pooling prior (9) borrows information from other cells, which stabilizesmodel weights in small cells where there are not enough data for no-pooling stacking. For acrude mean-filed approximation, the likelihood in the discrete model (11) ≈ (cid:81) j,k normal( α mode jk , λ jk ),where α mode = arg max α (cid:80) ni =1 log (cid:16)(cid:80) Kk =1 w k ( x i ) p k, − i (cid:17) is the unconstrained no-pooling stackingweight, and − λ − jk = ∂ ∂α jk | mode (cid:80) ni =1 log (cid:16)(cid:80) Kk =1 w k ( x i ) p k, − i (cid:17) is the diagonal element of the Hessian.Because α jk appears in n j terms of the summation, λ jk = O ( n − / j ) for a given k . Combined withthe prior α jk ∼ normal( µ k , σ k ), the conditional posterior mean of the k -th model weight in the j -th cell is the usual precision-weighed average of the no-pooling optimum and the shared mean:ˆ α jk | λ jk , σ k , µ k ≈ ( λ − jk α mode jk + σ − k µ k )( λ − jk + σ − k ) − . Hence for a given model k , | α mode jk − ˆ α jk | = O ( n − j ): larger pooling usually occurs in smaller cells. This pooling factor is in line with Figure 10and the general hierarchical model theory (Gelman and Pardoe, 2006). Our full-Bayesian solutionalso integrates out µ k and σ k , which further partially pools across models.The possibility of partial pooling across cells encourages open-ended data gathering. In theelection polling example, even if a pollster is only interested in the forecast of one state, they couldgather polling data from everywhere else, fit multiple models, evaluate models on each state, anduse hierarchical stacking to construct model averaging, which is especially applicable when thestate of interest does not have enough polls to conduct a meaningful model evaluation individually.In this context swing states naturally have more state polls, so that the small-area estimationmay not be crucial, but in general we conjecture that the hierarchical techniques can be useful formodel evaluation and averaging in a more general domain adaptation setting. Without going intoextra details, hierarchical models are as useful to make inferences from a subset of data (small-area estimation) as to generalize data to a new area (extrapolation). When the latter task is thefocus, hierarchical stacking only needs to redefine the leave-one-data-out predictive density (6) byleave-one-cell-out p k, − i := (cid:82) Θ k p ( y i | θ k , x i , z i , M k ) p ( θ k | M k , { ( x i (cid:48) , z i (cid:48) , y i (cid:48) ) : x i (cid:48) (cid:54) = x i } ) dθ k . We use hierarchical stacking not only as a tool for optimizing predictions but also as a way tounderstand problems with fitted models. The fact that hierarchical stacking is being used is alreadyan implicit recognition that we have different models that perform better or worse in differentsubsets of data, and it can valuable to explore the conditions under which different models arefitting poorly, reveal potential problems in the data or data processing, and point to directions forindividual-model improvement.Vehtari et al. (2017) and Gelman et al. (2020) suggested to examine the pointwise cross-validatedlog score log p k, − i as a function of x i , and see if there is a pattern or explanation for why observationsare harder to fit than others. For example, the first panel of Figure 2 seems to indicate that Model1 is incapable to fit the rightmost 10–15 non-switchers. However, log p k, − i contains a non-vanishing21ariance since y i is a single realization from p t ( y | x i ). Despite its merit in exploratory data analysis,it is hard to tell from the raw cross validation scores that whether Model 1 is incapable of fittinghigh arsenic, or is merely unlucky for these few points. The hierarchical stacking weight w ( x )provides a smoothed summary of how each model fits locally in x and comes with built-in Bayesianuncertainty estimation. For example, in Figure 5, log p , − i − log p , − i has a slightly inflated righttail, but this small bump is smoothed by stacking, and the local weight therein is close to (0 . , . The implication of hierarchical stacking (11) being a formal Bayesian model is that we can evaluateits posterior distribution as with in a regular Bayesian model. For example, we can run (approxi-mate) leave-one-out cross validation of the the stacking posterior p ( w |D − i ) ∝ p ( w |D ) /p ( y i | x i , w ) = p ( w |D ) / (cid:16)(cid:80) Kk =1 w k ( x i ) p k, − i (cid:17) . In practice, we only need to fit the stacking model (11) once,collect a size- S MCMC sample of stacking parameters from the full posterior p ( w |D ), denotedby { ( w k ( x i ) , . . . , w kS ( x i )) } i,k , compute the PSIS-stabilized importance ratio of each draw r is ≈ (cid:16)(cid:80) Kk =1 w ks ( x i ) p k, − i (cid:17) − , and then compute the mean leave-one-out cross validated log predictivedensity to evaluate the overall out-of-sample fit of the final stacked model: n (cid:88) i =1 log (cid:90) S K p ( y i | x i , w ( x i )) p ( w ( x i ) |D − i ) d ( w ( x i )) ≈ n (cid:88) i =1 log (cid:80) Ss =1 (cid:16) r is (cid:80) Kk =1 w ks ( x i ) p k, − i (cid:17)(cid:80) Ss =1 r is . (28)As discussed in Section 4, the same task of out-of-sample prediction evaluation in an optimization-based stacking requires double cross validation (refit the model n ( n −
1) times if using leave-one-out),but now becomes almost computationally free by post-processing posterior draws of stacking.The Bayesian justification above applies to log score stacking. In general we cannot convert anarbitrary objective function into a log density—its exponential is not necessarily integrable; Evenif it is, the resulted density does not necessarily match a relevant model. Take linear regression forexample, the ordinary least square estimate arg min β (cid:80) ni =1 ( y i − x Ti β ) equivalents the maximumlikelihood estimate of β from a generative model y i | x i , β, σ ∼ normal( x Ti β, σ ) with flat priors. Butthe directly adapted “log posterior density” from the the negative L loss, log p ( β | y ) = − (cid:80) ni =1 ( y i − x Ti β ) + C , differs from the Bayesian inference of the latter generative model unless σ ≡
1. Thehierarchical stacking framework may still work for other scoring rules in practice, while we leavetheir Bayesian justification for future research.
Unlike our previous work (Yao et al., 2018) that merely applies stacking to Bayesian models, thepresent paper converts optimization-based stacking itself into a formal Bayesian model, analogous toreformulating a least-squares estimate into a normal-error regression. Breiman (2001) distinguishedbetween two cultures in statistics: the generative modeling culture assumes that data come from agiven stochastic model, whereas the algorithmic modeling treats the data mechanism unknown andadvocates black-box learning for the goal of predictive accuracy. As a method that Breiman himselfintroduced (along with Wolpert, 1992), stacking is arguably closer to the algorithmic end of thespectrum, while our hierarchical Bayesian formulation pulls it toward to the generative modelingend.Such a formulation is appealing for two reasons. First, the generative modeling language fa-cilitates flexible data inclusion during model averaging. For example, the election forecast model22ontains various outcomes on state polls and national polls from several pollsters, and pollster-level,state-level, and national-level fundamental predictors as well as prior state-level correlations. It isnot clear how methods like bagging or boosting can include all of them. Data do not have toconveniently arrive in independent ( x i , y i ) pairs and compliantly await an algorithm to train upon.Moreover, instead of a static one-fit-all algorithm, hierarchical stacking is now part of a statisticalworkflow (Gelman et al., 2020) that is subject to model criticisms and model improvements—wecan incorporate other Bayesian shrinkage priors as add-on components without reinventing them,and we can use approximate leave-one-out cross validation (28) to check the out-of-sample perfor-mance of the final stacking model and compare multiple priors. Looking ahead, the success of thiswork suggests that we can use generative Bayesian modeling to improve other black-box predictionalgorithms. Acknowledgments
The authors would like to thank the National Science Foundation, Institute of Education Sciences,Office of Naval Research, National Institutes of Health, Sloan Foundation, and Schmidt Futures forfinancial support.
References
Abramowitz, A. I. (2008). Forecasting the 2008 presidential election with the time-for-changemodel.
Political Science and Politics , 41:691–695.Bernardo, J. M. and Smith, A. F. (1994).
Bayesian Theory . John Wiley & Sons.Bhatt, S., Cameron, E., Flaxman, S. R., Weiss, D. J., Smith, D. L., and Gething, P. W. (2017).Improved prediction accuracy for disease risk mapping using Gaussian process stacked general-ization.
Journal of The Royal Society Interface , 14.Breiman, L. (1996). Stacked regressions.
Machine Learning , 24:49–64.Breiman, L. (2001). Statistical modeling: The two cultures.
Statistical Science , 16:199–231.B¨urkner, P.-C., Gabry, J., and Vehtari, A. (2019). Approximate leave-future-out cross-validationfor Bayesian time series models. arXiv:1902.06281 .Clarke, B. (2003). Comparing Bayes model averaging and stacking when model approximationerror cannot be ignored.
Journal of Machine Learning Research , 4:683–712.Clyde, M. and Iversen, E. S. (2013). Bayesian model averaging in the M-open framework. In
Bayesian Theory and Applications , pages 483–498. Oxford University Press.Fushiki, T. (2020). On the selection of the regularization parameter in stacking.
Neural ProcessingLetters , pages 1–12.Gelman, A. and Pardoe, I. (2006). Bayesian measures of explained variance and pooling in multilevel(hierarchical) models.
Technometrics , 48:241–251.Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., Kennedy, L.,Gabry, J., B¨urkner, P.-C., and Modr´ak, M. (2020). Bayesian workflow. arXiv:2011.01808 .Ghasemian, A., Hosseinmardi, H., Galstyan, A., Airoldi, E. M., and Clauset, A. (2020). Stackingmodels for nearly optimal link prediction in complex networks.
Proceedings of the NationalAcademy of Sciences , 117:23393–23400.Heidemanns, M., Gelman, A., and Morris, G. E. (2020). An updated dynamic Bayesian forecastingmodel for the 2020 election.
Harvard Data Science Review .23einer, M., Kottas, A., and Munch, S. (2019). Structured priors for sparse probability vectors withapplication to model selection in Markov chains.
Statistics and Computing , 29:1077–1093.Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging:A tutorial.
Statistical Science , pages 382–401.Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of localexperts.
Neural Computation , 3:79–87.Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm.
Neural Computation , 6:181–214.Kamary, K., Mengersen, K., Robert, C. P., and Rousseau, J. (2019). Testing hypotheses via amixture estimation model. arXiv:1412.2044 .Le, T. and Clarke, B. (2017). A Bayes interpretation of stacking for M -complete and M -opensettings. Bayesian Analysis , 12:807–829.LeBlanc, M. and Tibshirani, R. (1996). Combining estimates in regression and classification.
Jour-nal of the American Statistical Association , 91:1641–1650.Linzer, D. A. (2013). Dynamic Bayesian forecasting of presidential elections in the states.
Journalof the American Statistical Association , 108:124–134.Meng, X.-L. and van Dyk, D. A. (1999). Seeking efficient data augmentation schemes via conditionaland marginal augmentation.
Biometrika , 86:301–320.Neal, R. M. (1998). Regression and classification using Gaussian process priors. In Bernardo, J.,Berger, J. O., Dawid, A. P., and Smith, A. F. M., editors,
Bayesian Statistics , volume 6, pages475–501. Oxford University Press.Piironen, J. and Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection.
Statistics and Computing , 27(3):711–735.Polson, N. G. and Scott, S. L. (2011). Data augmentation for support vector machines.
BayesianAnalysis , 6:1–23.Reid, S. and Grudic, G. (2009). Regularized linear models in stacked generalization. In
InternationalWorkshop on Multiple Classifier Systems , pages 112–121.S¸en, M. U. and Erdogan, H. (2013). Linear classifier combination and selection using group sparseregularization and hinge loss.
Pattern Recognition Letters , 34:265–274.Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function.
Journal of Statistical Planning and Inference , 90:227–244.Sill, J., Tak´acs, G., Mackey, L., and Lin, D. (2009). Feature-weighted linear stacking. arXiv:0911.0460 .Stan Development Team (2020).
Stan Modeling Language Users Guide and Reference Manual .Version 2.25.0, http://mc-stan.org .Sugiyama, M., Krauledat, M., and M¨uller, K.-R. (2007). Covariate shift adaptation by importanceweighted cross validation.
Journal of Machine Learning Research , 8:985–1005.Sugiyama, M. and M¨uller, K.-R. (2005). Input-dependent estimation of generalization error undercovariate shift.
Statistics and Decisions , 23:249–280.Svens¨en, M. and Bishop, C. M. (2003). Bayesian hierarchical mixtures of experts. In
Uncertaintyin Artificial Intelligence .Tracey, B. D. and Wolpert, D. H. (2016). Reducing the error of Monte Carlo algorithms by learningcontrol variates. In
Conference on Neural Information Processing Systems .Vehtari, A., Gabry, J., Magnusson, M., Yao, Y., B¨urkner, P.-C., Paananen, T., and Gelman, A.(2020). loo: Efficient leave-one-out cross-validation and WAIC for bayesian models. R packageversion 2.4.1, https://mc-stan.org/loo/ . 24ehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.
Statistics and Computing , 27:1413–1432.Vehtari, A. and Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment,selection and comparison.
Statistics Surveys , 6:142–228.Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2019). Pareto smoothed importancesampling. arXiv:1507.02646 .Wolpert, D. H. (1992). Stacked generalization.
Neural Networks , 5:241–259.Yang, Y. and Dunson, D. B. (2014). Minimax optimal Bayesian aggregation. arXiv:1403.1345 .Yao, Y. (2019). Bayesian aggregation. arXiv:1912.11218 .Yao, Y., Vehtari, A., and Gelman, A. (2020). Stacking for non-mixing Bayesian computations: Thecurse and blessing of multimodal posteriors. arXiv:2006.12335 .Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Using stacking to average Bayesianpredictive distributions (with discussion).
Bayesian Analysis , 13:917–1007.Zhang, L. and Zhou, W.-D. (2011). Sparse ensembles using weighted combination methods basedon linear programming.
Pattern Recognition , 44:97–106.25 ppendicesA. Theoretical example
Before theorem proofs, we first consider a toy example with a closed form solution to illustrate howTheorems 1–4 apply.As shown in Figure 11, the true data generating process (DG) is y ∼ uniform( − , M : y ∼ .
99 uniform( − ,
0) + .
01 uniform(0 , ,M : y ∼ .
99 uniform(0 ,
2) + .
01 uniform( − , , which yield piece-wise constant predictive densities p ( y ) = 0 . / ( y ∈ [ − , . / ( y ∈ [0 , ,p ( y ) = 0 . / ( y ∈ [0 , . / ( y ∈ [ − , . Using our notation in Section 3.2, the region in which M predominates is J = [ − , M outperforms on J = (0 ,
2] (the convention send the tie { } to M ). We count their masses withrespect to the true DG: Pr( J ) = 3 / J ) = 1 /
4. Now consider the weighted density( w . / w . / ( y ∈ [ − , w . / w . / ( y ∈ [0 , w ∈S (cid:90) − / w . / w . / ( y ∈ [ − , w . / w . / ( y ∈ [0 , dy. The exact optimal weight is w = 0 . J and is irrelevant to the heightof each regions. For instance, if the right bump in M shrinks to the interval (0 ,
1) (i.e., y ∼ .
99 uniform(0 ,
1) + 0 .
01 uniform( − , L in condition (19)), the winner take all and its winningmargin no long matters.By contrast, likelihood based model averaging techniques such as Bayesian model averaging(BMA, Hoeting et al., 1999) and pseudo-Bayesian model averaging (Yao et al., 2018) are analogies −4 −3 −2 −1 0 1 20.25.5 tilde y predictive densities True DG −4 −3 −2 −1 0 1 20.25.5 model 1 model 2Models tilde y −4 −3 −2 −1 0 1 20.25.5 tilde y Count the mass −4 −3 −2 −1 0 1 20.25.5 J J weight » » −4 −3 −2 −1 0 1 20.25.5 tilde y Figure 11:
Theoretical example: The true data is generated from uniform ( − , and the there are two modelswith spikes and slabs on intervals ( − , and (0 , respectively. J and J in Theorem 1 are [ − , and (0 , , with DG probabilities 3/4 and 1/4. The stacking weights are approximately these two probabilities,and irrelevant to how high the winning margins are. .25.5 model 1 model 2 d =0.2model predictive densities d =0.33 −4 −2 0 1 20.25.5 tilde y d =0.45 d w weight of model 1 stackingBMA, n=1BMA, n=10 lll l Figure 12:
Left: Pointwise predictive density p (˜ y | M or M ) when the slab probability δ is chosen 0.2, 1/3and 0.45. Right: Weight of model 1 in complete-pooling stacking (not defined at δ = 0 . ) and pseudo-BMA(sample size n =1 or 10, not defined at δ = 0 or 1) as a function of the slab probability δ . They evolve in theopposite direction. Besides, stacking weights are more polarized when models are more similar. KL divergence to DG
M1 M2 d KL inbetween d separation L d elpd gain d elpd gain by LL Figure 13:
From left: (1) KL divergence between model 1 or model 2 and data generating process. (2) KLdivergence between model 1 and model 2. (3) Separation constant L . (4) Stacking elpd gain compared withthe best individual model. (5) Stacking elpd gain as a function of L . of proportional representation : every count of the winning margin matters. For illustration, wevary the slab probability δ in Model 1 and 2: M | δ : y ∼ (1 − δ ) uniform[ − ,
0] + δ uniform[0 , ,M | δ : y ∼ (1 − δ ) uniform[0 ,
2] + δ uniform[ − , . The left column in Figure 12 visualizes the predictive densities from these two models at δ = 0 . .
33, and 0 . δ increases from 0 to 0.5, these two models are closer and closerto each other, measured by a smaller KL( M , M ). The (0 . ,
1) counterpart is similar, thoughnot exactly symmetric. We compute stacking weight and the expected pseudo-BMA weight withsample size n : w BMA1 ( n, δ ) = (cid:0) (cid:0) n E y | δ log p ( y ) − n E y | δ log p ( y ) (cid:1)(cid:1) − .Interestingly, pseudo-BMA weight w BMA1 ( n, δ ) is strictly decreasing as a function of δ ∈ (0 , δ → + , log predictive density of model 2 in the left part log( δ/ → ∞ can bearbitrarily small, and the influence of this bad region dominates the overall performance of model 2.By contrast, stacking weight is monotonic non-increasing on (0 , .
5) (strictly decreasing on (0 , / − ,
0] interval and does not haggle over how much it wins.In addition, when δ = 1 / M becomes a uniform density on [ − , δ ∈ (1 / , / L for which the separation condition (19) holds. The last twopanels show the stacking epld gain (compared with the best individual model) as a function of δ and L . This constructive example reflects the worst case for it matches the theoretical lower bound g ∗ ( L, K, ρ, (cid:15) ) = log( ρ ) + (1 − ρ )(log(1 − ρ ) − log( K − L = L, K = 2 , ρ = 1 / , (cid:15) = 0) inTheorem 3, which also means the latter bound is tight.When δ ∈ [1 / , / J = (0 ,
2] with the separation constant (cid:15) = 0 and L ≤ log 2 (the winning margin is maximized at δ = 1 / B. Proofs of theorems
Theorem 1.
We call K predictive densities { p (˜ y = ·| ˜ x = · , M k ) } Kk =1 to be locally separable with aconstant pair L > and < (cid:15) < with respect to the true data generating process of (˜ y, ˜ x ) , if K (cid:88) k =1 (cid:90) (˜ x, ˜ y ) ∈J k (cid:16) log p (˜ y | ˜ x, M k ) < log p (˜ y | M k , x ) + L, ∀ k (cid:48) (cid:54) = k (cid:17) p t (˜ y, ˜ x ) d ˜ yd ˜ x ≤ (cid:15). For a small (cid:15) and a large L , the stacking weights that solve (3) is approximately the proportion ofthe model being the locally best model: w stacking k ≈ w approx k := Pr( J k ) = (cid:90) J k p t (˜ y | ˜ x ) p (˜ x ) d ˜ yd ˜ x. in the sense that the objective function is nearly optimal: | elpd( w approx ) − elpd( w stacking ) | ≤ O ( (cid:15) + exp( − L )) . Proof.
The expected log predictive density of the weighted prediction (cid:80) k w k p k ( ·| x ) (as a functionof w ) is elpd( w ) = (cid:90) X ×Y log (cid:32) K (cid:88) l =1 w l p l (˜ y | ˜ x ) (cid:33) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) k =1 (cid:90) J k log (cid:32) K (cid:88) l =1 w l p l (˜ y | ˜ x ) (cid:33) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) k =1 (cid:90) J k log w k p k (˜ y | ˜ x ) + (cid:88) l (cid:54) = k w l p l (˜ y | ˜ x ) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) k =1 (cid:90) J k log ( w k p k (˜ y | ˜ x )) + log (cid:88) l (cid:54) = k w l p l (˜ y | ˜ x ) w k p k (˜ y | ˜ x ) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y. The expression is legit for any simplex vector w ∈ S that does not contain zeros. We will treatzeros later. For now we only consider a dense weight: { w ∈ S : w k > , k = 1 , . . . K } .28onsider a surrogate objective function (the first term in the integral above):elpd surrogate ( w ) = K (cid:88) k =1 (cid:90) J k log ( w k p k (˜ y | ˜ x )) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) k =1 (cid:90) J k (log w k + log p k (˜ y | ˜ x )) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) k =1 log w k (cid:90) J k p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y + K (cid:88) k =1 (cid:90) J k log p k (˜ y | ˜ x ) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) k =1 (Pr( J k ) log w k ) + constant . Ignoring the the constant term (the expected cross-entropy between each conditional pre-diction and the true DG), to maximize the surrogate objective function is equivalent to maxi-mize (cid:80) Kk =1 Pr( J k ) log w k , we call this function elbo( w ), the evidence lower bound. To optimizeelpd surrogate is equivalent to optimize elbo. We show that this elbo function has a closed formoptimum. Using Jensen’s inequality,elbo( w ) = K (cid:88) k =1 Pr( J k ) log w k = K (cid:88) k =1 Pr( J k ) log w k Pr( J k ) + K (cid:88) k =1 Pr( J k ) log Pr( J k ) ≤ log (cid:32) K (cid:88) k =1 Pr( J k ) w k Pr( J k ) (cid:33) + K (cid:88) k =1 Pr( J k ) log Pr( J k )= K (cid:88) k =1 Pr( J k ) log Pr( J k ) . The equality is attained at w k = Pr( J k ) , ∀ k , which reaches our definition of w approx in Theorem 1.What remains to be proved is that this surrogate objective function is close to the originalobjective. We divide each set J k into two disjoint subsets J k = J ◦ k ∪ J • k , for J ◦ k := { (˜ x, ˜ y ) ∈ J k : log p (˜ y | M k ) < log p (˜ y | M k , x ) + L } ; J • k := { (˜ x, ˜ y ) ∈ J k : log p (˜ y | M k ) ≥ log p (˜ y | M k , x ) + L } . The separation condition ensures (cid:80) k Pr( J ◦ k ) ≤ (cid:15) .Let ∆( w ) = elpd( w ) − elpd surrogate ( w ). For any fixed simplex vector w , this absolute difference29s bounded by | ∆( w ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =1 (cid:90) J k log (cid:88) l (cid:54) = k w l p l (˜ y | ˜ x ) w k p k (˜ y | ˜ x ) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K (cid:88) k =1 (cid:90) J k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:88) l (cid:54) = k w l p l (˜ y | ˜ x ) w k p k (˜ y | ˜ x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) k =1 (cid:32)(cid:90) J ◦ k + (cid:90) J • k (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:88) l (cid:54) = k w l p l (˜ y | ˜ x ) w k p k (˜ y | ˜ x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y ≤ K (cid:88) k =1 (cid:90) J ◦ k log(1 + (cid:88) l (cid:54) = k w l w k ) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y + K (cid:88) k =1 (cid:90) J • k (cid:88) l (cid:54) = k w l w k p l (˜ y | ˜ x ) p k (˜ y | ˜ x ) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y ≤ K (cid:88) k =1 (cid:88) l (cid:54) = k w l w k (cid:32) K (cid:88) k =1 (cid:90) J ◦ k p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y + K (cid:88) k =1 (cid:90) J • k p l (˜ y | ˜ x ) p k (˜ y | ˜ x ) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y (cid:33) ≤ (cid:32) K (cid:88) k =1 − w k w k (cid:33) ( (cid:15) + exp( − L )) . The exact optima of objective function is w stacking . Using the inequality above twice,0 ≤ elpd( w stacking ) − elpd( w approx ) ≤ | elpd surrogate ( w stacking ) − elpd( w stacking ) | + | elpd surrogate ( w approx ) − elpd( w approx ) | + elpd surrogate ( w stacking ) − elpd surrogate ( w approx ) ≤ | ∆( w approx ) | + | ∆( w stacking ) |≤ K (cid:88) k =1 (cid:32) − w approx k w approx k + 1 − w stacking k w stacking k (cid:33) ( (cid:15) + exp( − L )) . It has almost finished the proof expect for the simplex edge where w stacking k or w approx k attainszero.Without loss of generality, if w approx1 = 0 , w approx k (cid:54) = 0 , ∀ k (cid:54) = 1, which means p (˜ y | M k , ˜ x ) isalways inferior to some other models. This will only happy if p (˜ y | M k , ˜ x ) is almost sure zero (w.r.t p t (˜ y | ˜ x ) p (˜ x )) hence we can remove model 1 from the model list, and the same O ( (cid:15) + exp( − L )) boundapplies to remaining model 2 , . . . , K . If there are more than one zeros, and repeat until all zeroshave been removed.Next, we deal with w stacking1 = 0 , w stacking k (cid:54) = 0 , ∀ k (cid:54) = 1. If w approx1 = 0, too, then we have solvedin the previous paragraph. If not, Theorem 2 shows that w approx1 has to be a small order term:Pr( J ) ≤ (1 + (exp( L ) − − (cid:15) ) + (cid:15) ) − < exp( − L ) + (cid:15). We leave the proof of this inequality in Theorem 2.The contribution of the first model in the surrogate model is at most Pr( J ) log Pr( J ). Afterwe remove the first model from the model list, with the surrogate model elpd changes by at mosta small order term, not affecting the final bound. Because the separation condition with constant( (cid:15), L ) applies to model 1 , . . . , K , and due to lack of a competition source, the same separationcondition applies to model 2 , . . . , K and the same bound applies.30 heorem 2. When the separation condition (19) holds, and if the k -th model has zero weight instacking, w stacking k = 0 , then the probability of its winning region is bounded by: Pr( J k ) ≤ (1 + (exp( L ) − − (cid:15) ) + (cid:15) ) − . The right hand side can be further upper-bounded by exp( − L ) + (cid:15) .Proof. Without loss of generality, assume w stacking1 = 0. Let p (˜ y | ˜ x ) = (cid:80) Kk =2 w stacking p k (˜ y | ˜ x ). Con-sider a constrained objective (cid:103) elpd( w ) = E(log( w p (˜ y | ˜ x ) + (1 − w ) p (˜ y | ˜ x ))) where the expectationis over both ˜ y and ˜ x as before. Because the max is attained at w = 0 and because log( · ) is aconcave function, the derivative at any w ∈ [0 ,
1] is ddw (cid:103) elpd( w ) = E p (˜ y | ˜ x ) − p (˜ y | ˜ x ) w p (˜ y | ˜ x ) + (1 − w ) p (˜ y | ˜ x ) ≤ . That is 0 ≥ E p (˜ y | ˜ x ) − p (˜ y | ˜ x ) p (˜ y | ˜ x )= Pr( J ) E[ p (˜ y | ˜ x ) − p (˜ y | ˜ x ) p (˜ y | ˜ x ) |J ] + (1 − Pr( J )) E[ p (˜ y | ˜ x ) − p (˜ y | ˜ x ) p (˜ y | ˜ x ) |J ] ≥ Pr( J ) E[ p (˜ y | ˜ x ) − p (˜ y | ˜ x ) p (˜ y | ˜ x ) |J ] − (1 − Pr( J )) . Rearrange this inequality arrives at1 ≥ Pr( J ) (cid:18) p (˜ y | ˜ x ) − p (˜ y | ˜ x ) p (˜ y | ˜ x ) |J ] (cid:19) ≥ Pr( J )(1 + (exp( L ) − − (cid:15) ) + (cid:15) ) . As a result, the model that has stacking weight zero cannot have a large probability to predominateall other models, Pr( J ) ≤ (1 + (exp( L ) − − (cid:15) ) + (cid:15) ) − < exp( − L ) + (cid:15). Theorem 3.
Let ρ = sup ≤ k ≤ K Pr( J k ) , and two deterministic functions g and g ∗ by g ( L, K, ρ, (cid:15) ) = L (1 − ρ )(1 − (cid:15) ) − log K ≤ g ∗ ( L, K, ρ, (cid:15) ) = L (1 − ρ )(1 − (cid:15) ) + ρ log( ρ ) + (1 − ρ )(log(1 − ρ ) − log( K − . Assuming the separation condition (19) holds for all k = 1 , . . . , K , then the utility gain of stackingis further lower-bounded by elpd stacking − elpd k ≥ max ( g ∗ ( L, K, ρ ) + O (exp( − L ) + (cid:15) ) , . roof. As before, we consider the approximate weights: w approx k = Pr( J k ), and the surrogate elpdelpd surrogate ( w ) = (cid:80) Kk =1 (cid:82) J k log ( w k p k (˜ y | ˜ x )) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y. elpd surrogate ( w approx ) − elpd k = K (cid:88) l =1 (cid:90) J l log (Pr( J l ) p l (˜ y | ˜ x )) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y − K (cid:88) l =1 (cid:90) J l log ( p k (˜ y | ˜ x )) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) l =1 (cid:90) J l (log Pr( J l ) + log p l (˜ y | ˜ x ) − log p k (˜ y | ˜ x )) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) l =1 Pr( J l ) log Pr( J l ) + K (cid:88) l =1 ( l (cid:54) = k ) (cid:90) J l log( p l (˜ y | ˜ x ) − log p k (˜ y | ˜ x )) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = K (cid:88) l =1 Pr( J l ) log Pr( J l ) + K (cid:88) l =1 ( l (cid:54) = k ) (cid:32)(cid:90) J ◦ l + (cid:90) J • l (cid:33) log( p l (˜ y | ˜ x ) − log p k (˜ y | ˜ x )) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y ≥ K (cid:88) l =1 Pr( J l ) log Pr( J l ) + (1 − (cid:15) )(1 − ρ ) L − (cid:15) ≥ ρ log ρ + (1 − ρ ) log 1 − ρK − − (cid:15) )(1 − ρ ) L − (cid:15) = g ( L, K, ρ, (cid:15) ) − (cid:15). The last inequality comes from the fact that the entropy term (cid:80) Kl =1 Pr( J l ) log Pr( J l ) attains itsminimal at ( ρ, (1 − ρ ) / ( K − , . . . , (1 − ρ ) / ( K − x log x . We may furtheruse a lower bound ρ log ρ + (1 − ρ ) log 1 − ρ/K − ≥ − log( K ), which arrives in g ( · ).Finally, using the proof of Theorem 1, | elpd surrogate ( w approx ) − elpd stacking ( w stacking ) | ≤ O (exp( − L ) + (cid:15) ) , we get elpd stacking − elpd k ≥ g ∗ ( L, K, ρ ) + O (exp( − L ) + (cid:15) ). Because selection is always a specialscase of averaging, the utility is further bounded below by 0.From the constrictive example (Appendix A), g ∗ ( · ) is a tight bound. We use the looser bound g ( · ) in the main paper for its simpler form. Theorem 4.
Under the strong separation assumption K (cid:88) k =1 (cid:90) ˜ x ∈I k (cid:90) ˜ y ∈Y (cid:16) log p (˜ y | M k , x ) < log p (˜ y | M k (cid:48) , x ) + L, ∀ k (cid:48) (cid:54) = k ∗ ( x ) (cid:17) p t (˜ y | x, D ) d ˜ y ≤ (cid:15), and if the sets {I k } are known exactly, then we can construct pointwise selection p (˜ y | x, pointwise selection) = K (cid:88) k =1 ( x ∈ I k ) p (˜ y | x, M k ) . Its utility gain is bounded from below by elpd pointwise selection − elpd stacking ≥ − log ρ X + O (exp( − L ) + (cid:15) ) . roof. elpd pointwise selection − elpd surrogate ( w approx )= K (cid:88) l =1 (cid:90) I l (log p l (˜ y | ˜ x ) − log (Pr( I l ) p l (˜ y | ˜ x ))) p t (˜ y | ˜ x ) p (˜ x ) d ˜ xd ˜ y = − K (cid:88) l =1 Pr( I l ) log Pr( I l ) ≥ − K (cid:88) l =1 Pr( I l ) log ρ x = − log ρ x . Finally, from the proof of Theorem 1, | elpd surrogate ( w approx ) − elpd stacking ( w stacking ) | ≤ O (exp( − L ) + (cid:15) ) . We close this section with two remarks. First, Yao (2019) approximates the probabilistic stack-ing weights under the strong separation condition (23). The result therein can be viewed as aspecial case of Theorem 4 in the present paper as Pr( J k ) ≈ Pr( I k ) under assumption (23).Second, most proofs only uses the concavity of the log scoring rule. Therefore, some proprietiesof stacking weights could be extend to other concave scoring rules, too. C. Software implementation in
Stan
We summarize our formulation of hierarchical stacking by pseudo code 1.
Algorithm 1:
Hierarchical stacking
Data: y : outcomes; x : input on which the stacking weights vary, z : other inputs; p k, − i : approximate leave-one-out predictive densities of the k -th model and i -th data. Result: input-dependent stacking weight ˆ w ( x ) : X → S K ; combined model. Sample from the joint densities p ( α, µ, σ |D ) in hierarchical stacking model (11); Compute posterior mean of w k (˜ x ) at any ˜ x , and make predictions p (˜ y | ˜ x, ˜ z ) by (12).To code the basic additive model, we prepare the input covariate X = ( X discrete , X continuous ),where X discrete is discrete dummy variable, and X continuous are remaining features (already rectifiedas in (16)). The dimension of these two parts are d continuous and d discrete .Here we use the “grouped hierarchical priors” (Section 6.3) with only two groups, distinguishingbetween continuous and discrete variables. w K ( x ) = softmax( w ∗ K ( x )) , w ∗ k ( x ) = M (cid:88) m =1 α mk f m ( x ) + µ k , k ≤ K − , w ∗ K ( x ) = 0 ,α mk | µ k , σ k ∼ normal(0 , σ k ) , k = 1 , . . . , K − , m = 1 , . . . , d discrete ,α mk | µ k , σ k ∼ normal(0 , σ k ) , k = 1 , . . . , K − , m = d discrete + 1 , . . . , d discrete + d continuous ,µ k ∼ normal( µ , τ µ ) , σ k ∼ normal + (0 , τ σ ) , σ k ∼ normal + (0 , τ σ ) , k = 1 , . . . , K − . This model is encoded by the following
Stan (Stan Development Team, 2020) program:33 ata {int
Stan , and extracttheir leave-one-out likelihoods { p k, − i } by efficient approximation in R-package loo (Vehtari et al.,2020): library("loo") Finally we run hierarchical stacking as a regular Bayesian model in
Stan . library("rstan") D. Experiment details
The replication code for experiments is available at https://github.com/yao-yl/hierarchical-stacking-code .Vehtari et al. (2017) and Gelman et al. (2020) used the same pointwise pattern (first panel inFigure 2) in our well-switch example to demonstrate the heterogeneity of model fit. The inputcontains both continuous x con ∈ R D and categorical x cat ∈ { , . . . , } . As per previous discussion(16), we convert all continuous inputs x con into two parts x +con ,j := ( x con ,j − median( x con ,j )) + and x − con ,j := ( x con ,j − median( x con ,j )) − . We then model the unconstrained weight by a linear regression α k ( x ) = D (cid:88) j =1 (cid:16) β j − ,k x +con ,j + β j,k x − con ,j (cid:17) + z k [ x cat ] , k = 1 , . . . , α ( x ) = 0 . (29)And place a default prior on parameters and hyper-parameters. z k [ j ] ∼ normal( µ k , σ k ) , β j , µ k ∼ normal(0 , , σ k ∼ normal + (0 , . The
Gaussian process regression is the same setting in Yao et al. (2020), who use thisexmaple to explain the benefit of complete-pooling stacking. The original training data of Neal(1998) can be found in file odata.txt . In the experiment, we use the first half as training data. Inthe second experiment, we simulate data with varying sample size according to his data generatingprocess. For hyper-parameter optimization, we found two modes by using gp hyper-parameterinitialization (log ρ, log a, log σ ) = (1 , . , .
1) and ( − , − , election-polling example, we conduct back-test for one-week-ahead forecasts. Forexample, if there are 20 polls between Aug 1 and Aug 7, we first fit each model on the data priorto Aug 1, and forecast for each of the 20 polls in this week. Next, we move on to forecasting for theweek between Aug 8 and Aug 14. We use this step-wise approach for both, fitting the candidatemodels and stacking.We are evaluating all combining methods on the same data, therefore we can compare thempointwisely by selecting a reference model—in our case this is the proposed hierarchical stackingand set it to be zero in all visualisations. For each combination method and each poll i , we computethe pointwise difference in elpds: elpd diff M j i = elpd M j i − elpd M ref i , where M j is the j -th model andM ref is the reference model. Then we report the mean of this differences over all polls in the testdata, elpd diff M j = N (cid:80) i elpd diff M j i , where N is the number of all polls.Additionally, we are interested in how models perform depending on time, as there are few pollsavailable in the early days of the election year, and then their number continuously increases towardelection day. This results in noisier observations in the beginning. To suitably evaluate the com-bining methods, we compute the cumulative mean elpd at each day d , elpd *,M j d = N d (cid:80) j ≤ d elpd M j j ,where N d is the number of conducted polls prior to or on day d . Then we compute the point-wise differences between these cumulative mean elpds of each method and the reference method:elpd diff *,M j d = elpd *,M j d − elpd *,M ref d .To get the elpd of a state, we take the average of all elpds in that state, for example elpd NY = N NY (cid:80) i ∈ A NY elpd i , where N NY is the number of polls conducted in New York, and A NY is the setof indexes of polls in New York. FL (n=78)GA (n=41)MI (n=29)NH (n=48)IL (n=18)MO (n=23)NY (n=27)OH (n=58)PA (n=65)AZ (n=42)NC (n=66)UT (n=28)CA (n=37)VA (n=51)WI (n=37)MA (n=19)MS (n=11)NJ (n=21)MD (n=16)CT (n=13)IN (n=20)TN (n=16)WV (n=14)LA (n=20)OR (n=23)NM (n=18)NV (n=45)IA (n=34)KS (n=20)TX (n=24)ME (n=21)WA (n=18)AR (n=13)CO (n=43)ID (n=13)VT (n=10)DE (n=10)OK (n=14)AK (n=13)AL (n=11)HI (n=8)KY (n=12)MN (n=15)MT (n=9)ND (n=9)NE (n=11)RI (n=7)SC (n=16)SD (n=9)WY (n=8) week w e i gh t Figure 14:
In the election forecast, the hierarchical stacking weight of model 1 (the fundamental model) indifferent time and states. Apart from the how the shrinkage factor is related to training sample size in the cell(Figure 10), this local stacking weight is also informative for understanding which states are more sensitiveto macro fundamentals. L (n=78)CO (n=43) NV (n=45) NH (n=48) VA (n=51) OH (n=58) PA (n=65) NC (n=66)UT (n=28) MI (n=29) IA (n=34) CA (n=37) WI (n=37) GA (n=41) AZ (n=42)KS (n=20) NJ (n=21) ME (n=21) OR (n=23) MO (n=23) TX (n=24) NY (n=27)SC (n=16) NM (n=18) WA (n=18) IL (n=18) MA (n=19) IN (n=20) LA (n=20)ID (n=13) AK (n=13) WV (n=14) OK (n=14) MN (n=15) MD (n=16) TN (n=16)DE (n=10) AL (n=11) MS (n=11) NE (n=11) KY (n=12) CT (n=13) AR (n=13)RI (n=7) HI (n=8) WY (n=8) MT (n=9) ND (n=9) SD (n=9) VT (n=10) A p r J u l O c t A p r J u l O c t A p r J u l O c t A p r J u l O c t A p r J u l O c t A p r J u l O c t A p r J u l O c t −0.3−0.2−0.10.00.00.20.4−0.4−0.20.0−0.2−0.10.00.1−0.4−0.20.0−0.10.00.10.2−0.5−0.4−0.3−0.2−0.10.0−0.3−0.2−0.10.00.10.2−0.10−0.050.000.050.10−1.0−0.50.0−0.50−0.250.00−0.4−0.3−0.2−0.10.0−0.10.00.10.20.000.050.100.150.20−0.20.00.20.4−0.3−0.2−0.10.00.10.2−0.050.000.05−0.4−0.20.0−0.3−0.2−0.10.00.1−0.10−0.050.000.050.10−0.2−0.10.00.1−0.10.00.10.2−0.040.000.040.08−0.8−0.6−0.4−0.20.00.2−0.10.00.10.2−0.2−0.10.00.1−0.50−0.250.000.250.50−0.20.00.20.4−0.20.00.20.4−0.75−0.50−0.250.00−0.6−0.4−0.20.0−0.3−0.2−0.10.0−0.6−0.4−0.20.00.2−0.4−0.3−0.2−0.10.00.1−0.10−0.050.000.050.10−0.9−0.6−0.30.00.00.10.2−0.050.000.050.100.000.050.10−0.2−0.10.00.1−0.6−0.4−0.20.0−0.2−0.10.0−0.75−0.50−0.250.000.250.50−0.10−0.050.000.050.10−3−2−10−0.08−0.040.000.04−1.0−0.50.0−1.5−1.0−0.50.00.00.30.60.9−0.50−0.250.000.25 datemethod hierarchical stacking corr hierarchical stacking stacking no−pooling stacking model selection pointwise differences in mean cumulative test log predictive density by state (higher is better) Figure 15: