[PDF] Bayesian Optimization of Hyperparameters when the Marginal Likelihood is Estimated by MCMC

Abstract

Bayesian models often involve a small set of hyperparameters determined by maximizing the marginal likelihood. Bayesian optimization is a popular iterative method where a Gaussian process posterior of the underlying function is sequentially updated by new function evaluations. An acquisition strategy uses this posterior distribution to decide where to place the next function evaluation. We propose a novel Bayesian optimization framework for situations where the user controls the computational effort, and therefore the precision of the function evaluations. This is a common situation in econometrics where the marginal likelihood is often computed by Markov Chain Monte Carlo (MCMC) methods, with the precision of the marginal likelihood estimate determined by the number of MCMC draws. The proposed acquisition strategy gives the optimizer the option to explore the function with cheap noisy evaluations and therefore finds the optimum faster. Prior hyperparameter estimation in the steady-state Bayesian vector autoregressive (BVAR) model on US macroeconomic time series data is used for illustration. The proposed method is shown to find the optimum much quicker than traditional Bayesian optimization or grid search.

Full PDF

BBayesian Optimization of Hyperparameters whenthe Marginal Likelihood is Estimated by MCMC

Oskar Gustafsson ∗ , Mattias Villani † and Pär Stockhammar ‡ April 22, 2020

Abstract

Bayesian models often involve a small set of hyperparameters determined by max-imizing the marginal likelihood. Bayesian optimization is a popular iterative methodwhere a Gaussian process posterior of the underlying function is sequentially updatedby new function evaluations. An acquisition strategy uses this posterior distributionto decide where to place the next function evaluation. We propose a novel Bayesianoptimization framework for situations where the user controls the computational eﬀort,and therefore the precision of the function evaluations. This is a common situation ineconometrics where the marginal likelihood is often computed by Markov Chain MonteCarlo (MCMC) methods, with the precision of the marginal likelihood estimate deter-mined by the number of MCMC draws. The proposed acquisition strategy gives theoptimizer the option to explore the function with cheap noisy evaluations and thereforeﬁnds the optimum faster. Prior hyperparameter estimation in the steady-state Bayesianvector autoregressive (BVAR) model on US macroeconomic time series data is used forillustration. The proposed method is shown to ﬁnd the optimum much quicker thantraditional Bayesian optimization or grid search.

Keywords:

Acquisition strategy, Optimized precision, Steady-state BVAR, US example. ∗ Department of Statistics, Stockholm University, e-mail address: [email protected]. † Department of Statistics, Stockholm University and Department of Computer and InformationScience, Linköping University, e-mail: [email protected]. ‡ Department of Statistics, Stockholm University and Sveriges Riksbank, e-mail:[email protected]. a r X i v : . [ s t a t . C O ] A p r Introduction

The trend in econometrics is to use increasingly more ﬂexible models that can give a richerdescription of the economy under investigation. As the model complexity increases, theestimation problems get more involved and computationally costly MCMC methods areoften used to sample from the posterior distribution.Most models involve a relatively small set of hyperparameters that needs to be chosenby the user. As an example, consider the steady-state BVAR model of Villani (2009), whichis widely used among practitioners and professional forecasters (see e.g. Gustafsson et al.(2016) and Stockhammar and Österholm (2017)) and is used in Section 4 for illustration.The choice of prior distribution in BVARs is often reduced to the selection of a small setof prior hyperparameters. Some of these hyperparameters can be speciﬁed subjectively byexperts, for example, the steady-state is usually given rather informative subjective priors.Other prior hyperparameters control the smoothness/shrinkage properties of the model andare less easy to specify subjectively. Giannone et al. (2015) proposed to treat these hard-to-specify prior hyperparameters as unknown parameters and explore the joint posterior of thehyperparameters, the VAR dynamics, and the shock covariance matrix. This is a statisticallyelegant approach but can be computationally costly, and most practitioners seem to preferto ﬁx the hyperparameters before estimating the other model parameters. Carriero et al.(2012) propose a brute force approach where the marginal likelihood is evaluated over a grid.This is very computationally demanding, and the vast majority of applications instead useconventional values for the hyperparameters, dating back to Doan et al. (1984). However,the conventional values were found to be optimal on a speciﬁc historical dataset and arelikely to be suboptimal for other datasets.Hence, there is a real practical demand for a fast method for optimizing the marginallikelihood over a set of hyperparameters. However, the marginal likelihood is rarely availablein closed form. The BVARs with conjugate priors considered in Carriero et al. (2012) are anexception, but already the steady-state VAR needs MCMC methods to evaluate the marginallikelihood. This makes the optimization problem challenging since every evaluation of themarginal likelihood requires a full MCMC run.Bayesian optimization (BO) is an iterative optimization technique originating from ma-chine learning. BO is particularly suitable for optimization of costly noisy functions in smallto moderate dimensional parameter spaces (Brochu et al., 2010; Snoek et al., 2012) and istherefore well suited for marginal likelihood optimization. The method treats the underly-ing function as an unknown object that can be inferred by Bayesian inference by evaluatingthe function at a ﬁnite number of points. A Gaussian process prior expresses the Bayesianprior beliefs about the underlying function, often just containing the information that thefunction is believed to has a certain degree of smoothness. Bayes theorem is then usedto sequentially update the Gaussian process posterior after each new function evaluation.2ayesian optimization uses the most recently updated posterior of the function to decidewhere to optimally place the next function evaluation. This so-called acquisition strategyis a trade-oﬀ between: i) exploiting the available knowledge about the function to improvethe current maxima and ii) exploring the function to reduce the posterior uncertainty.Our paper proposes a framework for Bayesian optimization for the setting where the usercan control the precision and computational cost of each function evaluation. The typicalscenario that we have in mind is when the marginal likelihood is computed by MCMC. This isa very common situation in econometrics using, for example, the estimators in Chib (1995),Chib and Jeliazkov (2001) and Geweke (1999). The precision of the marginal likelihoodestimate at each evaluation point is then chosen by the user via the number of MCMCiterations. This makes it possible to use occasional cheap noisy evaluations of the marginallikelihood to quickly explore the marginal likelihood over hyperparameter space during theoptimization. Our proposed acquisition strategy can be seen as jointly deciding where toplace the new evaluation but also how much computational eﬀort to spend in obtaining theestimate. We implement this strategy by a stopping rule for the MCMC sampling combinedwith an auxiliary prediction model for the computation eﬀort at any new evaluation point;the auxiliary prediction model is learned during the course of the optimization.We apply the method to the steady-state BVAR and demonstrate that the new acqui-sition strategy ﬁnds the optimal hyperparameters faster than traditionally used acquisitionfunctions. It is also substantially faster than a grid search and ﬁnds a better optimum.The outline of the paper is a follows. Section 2 introduces the hyperparameter estimationproblem and presents Chib’s marginal likelihood estimator from Gibbs sampling. Section3 gives the necessary background on Gaussian processes and Bayesian optimization andintroduces our new Bayesian optimization framework. Section 4 assesses the performance ofthe proposed algorithm in empirical examples. The ﬁnal section concludes and the appendixgives implementation details.

Hyperparameters in Bayesian econometric models can have a large eﬀect on empirical resultsand have to be selected with care. The method proposed here is generally applicable andwill be presented in full generality, but we will consider selection of hyperparameters in thepopular class of Bayesian vector autoregressive models (BVARs) as our running example.

Consider the standard BVAR model y t = K (cid:88) k =1 Π k y t − k + ε t , { ε t } Tt =1 are iid N (0 , Σ) . A simpliﬁed version of the Minnesota prior (see e.g. Karlsson(2013)) without cross-equation shrinkage is of the form (Π , Σ) ∼ M N IW (Π , Ω Π , S, ν ) , with Σ ∼ IW ( S, ν ) , vec (Π (cid:48) ) | Σ ∼ N ( vec (Π (cid:48) ) , Σ ⊗ Ω Π ) . where vec (Π) and Ω Π denotes the prior mean and covariance of the coeﬃcient matrix, S isthe prior scale matrix with the prior degrees of freedom, ν . The diagonal elements of Ω Π are given by ω ii = λ ( l λ s r ) , for lag l of variable r, i = ( l − p + r, where λ controls the overall shrinkage and λ the lag-decay shrinkage set by the user, s r denotes the estimated standard deviation of variable r . The fact that we do not usethe additional cross-equation shrinkage hyperparameter, λ , makes this prior conjugate tothe VAR likelihood, a fact that will be important in the following. It has been commonpractice to use standard values that dates back to Doan et al. (1984), but there has been arenewed interest to ﬁnd values that are optimal for the given application (see e.g. Giannoneet al. (2015), Carriero et al. (2012), and Bańbura et al. (2010)). Two main approaches havebeen proposed. First, Giannone et al. (2015) proposed to sample from the joint posteriordistribution using the decomposition p ( β , θ | y T ) = p ( β | θ , y T ) p ( θ | y T ) , where p ( θ | y T ) is the marginal posterior distribution of the hyperparameters. The algo-rithm samples from p ( θ | y T ) using Metropolis-Hastings (MH) and then samples directlyfrom p ( β | θ , y T ) for each θ draw by drawing Π and Σ from the Normal-Inverse Wishartdistribution. There are some limitations to using this approach. First, the p ( θ | y T ) can bemultimodal and it can be hard to ﬁnd a good MH proposal density, making the samplingtime-consuming. Second, it has been our experience that practitioners view hyperparameterselection similar to model selection, and want to determine a ﬁxed value for θ once and forall early in the model building process.Carriero et al. (2012) instead propose an exhaustive grid search to ﬁnd the θ thatmaximizes p ( θ | y T ) and then uses that optimal θ throughout the remaining analysis. Theobvious drawback here is that a grid search is very costly, especially with more than a coupleof hyperparameters.A problem with both the approach in Giannone et al. (2015) and Carriero et al. (2012)is that p ( θ | y T ) is often not available in closed form. This is true e.g. for the Minnesotaprior with cross-equation shrinkage since the prior is no longer conjugate, and is also true4or the steady-state BVAR (Villani, 2009). However, MCMC can often be used to get anoisy estimate of p ( θ | y T ) at any θ , usually at a sizeable computational cost.A common approach to select hyperparameters is to maximize the marginal likelihood, p ( y T | θ ) = (cid:82) p ( y T | θ , β ) p ( β | θ ) dβ , which is equivalent to maximizing the posterior distri-bution of the hyperparameters under a ﬂat hyper-prior, as is noted in both Carriero et al.(2012) and Giannone et al. (2015). Neither the posterior of the hyperparameters nor the marginal likelihood are tractable formost models. Chib (1995) proposes an accurate way of computing a simulation consistent estimate of the marginal likelihood when the posterior can be obtained via Gibbs sampling,which is the case for many econometric models. Chib and Jeliazkov (2001) extend Chib’sestimator to when the posterior is simulated from with MH. Chib’s (1995) estimator is basedon the following identity obtained by inverting Bayes’s theorem: m ( y ) = p ( y | θ ) p ( θ ) p ( θ | y ) . Consider, for example, the steady-state BVAR model, which can be sampled using a three-block Gibbs sampler, see Villani (2009). The joint posterior of the three-block model canbe factorized, and evaluated in a point β ∗ as: p θ ( β ∗ , β ∗ , β ∗ | y ) = p θ ( β ∗ | β ∗ , β ∗ , y ) p θ ( β ∗ | β ∗ , y ) p θ ( β ∗ | y ) . Now, p θ ( β ∗ | β ∗ , β ∗ , y ) is a full conditional distribution , w.r.t. the hyperparameters in θ ,which we have available in closed form, and p θ ( β ∗ | y ) = (cid:90) p θ ( β ∗ | β , β , y ) p θ ( β , β | y ) d β d β can be estimated by G − (cid:80) Ni =1 p θ ( β ∗ | β ( i )1 , β ( i )2 , y ) using the MCMC chains (cid:110) β ( i )1 , β ( i )2 (cid:111) Gi =1 .Further, p θ ( β ∗ | β ∗ , y ) can easily be obtained by running a reduced version on the sameGibbs sampler, but this time, we ﬁx β to β ∗ in every Gibbs iteration. We can thenestimate p θ ( β ∗ | β ∗ , y ) = (cid:82) p θ ( β ∗ | β , β ∗ , y ) p θ ( β | β ∗ , y ) d β by G − (cid:80) Gi =1 p θ ( β ∗ | ˜ β ( i )1 , β ∗ , y ) ,where (cid:110) ˜ β ( i )1 (cid:111) Gi =1 are draws from the reduced Gibbs sampler. Chib (1995) also derivesasymptotic standard errors for the estimator. For more details regarding the Chib estimator,see Appendix A. 5igure 1: Illustration of two Gaussian processes with squared exponential kernel with diﬀerent lengthscales and the same variance σ f = 0 . . The ﬁgure shows the prior mean (dashed line) and 95% probabilityintervals (shaded) and ﬁve realizations from each process. Since Bayesian optimization is a relatively unknown method in econometrics, we give anintroduction here to Gaussian processes and their use in Bayesian optimization.A Gaussian process (GP) is a (possibly inﬁnite) collection of random variables such thatany subset is jointly distributed according to a multivariate normal distribution, see e.g.Williams and Rasmussen (2006).A Gaussian process, denoted by f ( x ) ∼ GP ( m ( x ) , k ( x , x (cid:48) )) , can be seen as a probabilitydistribution over functions f : X → R that is completely speciﬁed by its mean function, m ( x ) ≡ E f ( x ) , and its covariance function, C ( f ( x ) , f ( x (cid:48) )) ≡ k ( x , x (cid:48) ) , where x and x (cid:48) aretwo arbitrary input values to f ( · ) . Note that the covariance function speciﬁes the covariancebetween any two function values , f ( x ) and f ( x ) . A popular covariance function is thesquared exponential (SE): k ( x , x (cid:48) ) = σ f exp (cid:18) − | x − x (cid:48) | (cid:96) (cid:19) , where | x − x (cid:48) | is the Euclidean distance between the two points; the covariance functionis speciﬁed by its two kernel hyperparameters, the scale parameter σ f > and the lengthscale (cid:96) > . The scale parameter σ f govern the variability of the function and the lengthscale determines how fast the correlation between two function values taper oﬀ with thedistance | x − x (cid:48) | , see Figure 1. The covariance function k ( x , x (cid:48) ) can be used to computethe covariance matrix for any subset of function values. The fact that any ﬁnite samplingof function values { f ( x n ) for x n ∈ X } Nn =1 constitutes a multivariate normal distribution on R N allows for the convenient conditioning and marginalization properties of the multivariatenormal distribution. Decisions regarding the kernel functions are important when working6ith GPs, and there are a few standard alternatives to choose from. An increasingly popularalternative to the squared exponential kernel is the Matérn kernel, see e.g. Matérn (1960)and Williams and Rasmussen (2006). The Matérn kernel has an additional hyperparameter, ν >0, in addition to the length scale (cid:96) and scale σ f , such that the process is k times meansquare diﬀerentiable if ν > k . Hence, ν controls the smoothness of the process and itcan be shown that the Matérn kernel approaches the SE kernel as ν → ∞ (Williams andRasmussen, 2006). The SE kernel is therefore considered too smooth for many applications.Our approach is directly applicable for any valid kernel function, but we will use the popularMatérn ν = 5 / kernel in our applications: k ν =5 / ( r ) = σ f (cid:32) √ r(cid:96) + 5 r (cid:96) (cid:33) exp (cid:32) − √ r(cid:96) (cid:33) , where r = | x − x (cid:48) | . The Matérn 5/2 has two continuous derivatives which is often a re-quirement for Newton-type optimizers and is often recommended for Bayesian optimization(Snoek et al., 2012). To ﬁnd the optimal value for the σ f and (cid:96) , the method of MacDonaldet al. (2015) is used. Bayesian optimization is distinct from other optimization methods in that it constructs aprobabilistic model for f , which it then uses to solve the optimization problem. The mainuse of a probabilistic model on f is that we can quantify the uncertainty of our function indiﬀerent regions of the function space to select optimal future evaluation points. Such utilityfunctions are referred to as acquisition functions in the Bayesian optimization literature (seee.g. Brochu et al., 2010 and Snoek et al., 2012) and are generally denoted by a ( x ) .An intuitively sensible acquisition rule is to select a new evaluation point that maximizesthe probability of obtaining a higher function value than the current maximum, f max , i.e.the Probability of Improvement ( PI ): P I ( x ) ≡ P ( f ( x ) > f max ) = Φ (cid:18) m ( x ) − f max s ( x ) (cid:19) , where f max is the maximum value of the function obtained so far. m ( x ) and s ( x ) are theposterior mean and standard deviation of f in the point x , conditional on the availablefunction evaluations, and Φ denotes the cumulative standard normal distribution. The PIcriterion will choose the point which is most probable to give an improvement but does sowithout regard to the size of the improvement. For this reason, the Expected Improvement EI ) is usually preferred:EI ( x ) =( m ( x ) − f max )Φ (cid:18) m ( x ) − f max s ( x ) (cid:19) + s ( x ) φ (cid:18) m ( x ) − f max s ( x ) (cid:19) , (1)where φ denotes the density function of the standard normal distribution. We can see from(1) that the ﬁrst part is associated with the magnitude of our predicted improvement and thesecond part is related to the uncertainty of our function in that area. Thus, the EI provides adeterministic way to select a new point where the decision incorporates the trade-oﬀ betweenhigh expected improvement (exploitation) and to learn more about the underlying function(exploration). This is illustrated in Figure 2, where the black line shows the true objectivefunction, the blue line denotes the posterior mean of the GP, and the pink-shaded regionsshow 95% posterior probability bands for f . The red dashed lines indicate the position ofthe current maximum, red (small) dots show historical function evaluations, and the green(large) dot denotes the current evaluation. The lower part shows the EI acquisition functionevaluated for each x. As we start in the top-left corner, we can see that the algorithm startsto climb towards the maximum (upper part), and the EI acquisition function (lower part)indicates that there is a high expected improvement by moving further to the right. Afterfurther improvements (top-right plot followed by bottom-left plot), the algorithm has foundan x close to the maximum such that further expected improvement is low in the middle x -region. The EI strategy then suggests that it is worth to explore the endpoints where thereis still high uncertainty. In the ﬁnal (lower-right) ﬁgure, the uncertainty at the endpoints isremoved, and the algorithm will continue a more thorough search in the region close to themaximum until it is stopped. 8igure 2: Bayesian optimization illustrated with the regular expected improvements acqui-sition strategy.However, acquisition rules like PI or EI do not take into account that diﬀerent evaluationpoints can be more or less costly. To introduce the notation of cost into the acquisition strat-egy, Snoek et al. (2012) proposed Expected Improvement per second , EIS ( x ) ≡ EI ( x ) /c ( x ) ,where c : X → R + is a duration function that measures the evaluation time at input x inseconds. More generally, we can deﬁne a ( x ) /c ( x ) as an eﬀort-aware acquisition function.The duration function is typically unknown and Snoek et al. (2012) proposed to estimate italongside f using an additional Gaussian process for log c ( x ) . EIS assumes that the duration (or the cost) of function evaluations are unknown, but ﬁxed for a given input x ; once we visit x , the cost of the function estimate ˆ f ( x ) is given. However,the user can often choose the duration spent to obtain a certain precision in the estimate;for example by increasing the number of MCMC iterations when the marginal likelihood isestimated by MCMC. This perspective opens up for strategies that not only optimize for thenext evaluation point, but also optimize over the computational resources, or equivalently,9he precision of the estimate ˆ f ( x ) . We formally extend BO by modeling the functionevaluations with a heteroscedastic GP ˆ f ( x ) = f ( x ) + (cid:15), (cid:15) ∼ N (0 , σ ( G )) (2) f ∼ GP ( µ ( x ) , k ( x , x (cid:48) )) , where the noise variance σ ( G ) is now an explicit function of the number of MCMC iter-ations, G , or some other duration measure. Hence the user can now choose both where toplace the next evaluation and the eﬀort spent in computing it by maximizing ˜ a ( x , G ) ≡ a ( x ) /G, with respect both x and G , where a ( x ) is a baseline acquisition function, for example EI.A complication with maximization of ˜ a ( x , G ) is that while we typically know that σ ( G ) = O (1 / √ G ) in Monte Carlo or MCMC, the exact numerical standard error depends on theintegrated autocorrelation time (IACT) of the MCMC chain. Note that the evaluationpoints can, for example, be hyperparameters in the prior, where diﬀerent values can giverise to varying degrees of well-behaved posteriors, so we can not expect the IACT to beconstant over the hyperparameter space. Rather than maximizing ˜ a ( x , G ) with respect toboth x and G directly, we propose to implement the algorithm in an alternative way thatachieves a similar eﬀect. The idea includes stopping the evaluation early whenever thefunction evaluation turns out to be hopelessly low with a low probability of improvementover the current f max .For a given x we let G increase, in batches of a ﬁxed size, untilPI ( x ) ≡ Φ (cid:32) ˆ m ( g ) ( x ) − f max s ( g ) ( x ) (cid:33) < α, for some small value, α , or until G reaches a predetermined upper bound, ¯ G . Where ˆ m ( g ) ( x ) and s ( g ) ( x ) denotes the posterior mean and standard deviation of the GP evaluatedat x in the g :th MCMC iteration. Note that both the posterior mean m ( x ) and standarddeviation s ( x ) are functions of the noise variance, which in turn is a function of G . Theposterior distribution for f ( x ) is hence continuously updated as G grows until − α of theposterior mass of f ( x ) is concentrated below f max , at which point the evaluation stops. Theoptimization approach is insensitive to the choice of α , as long as it is a relatively smallnumber. We now propose to maximize the following acquisition function based on earlystopping ˜ a α ( x ) = a ( x ) / ˆ G α ( x ) , (3)where ˆ G α ( x ) is a prediction of the number of MCMC draws needed at x before the evaluationstops, with the probability α as the threshold for stopping. We emphasize that early stoppingis here used in a subtle way, not only as a simple rule to short-circuit useless computations,but also in the planning of future computations; the mere possibility of early stopping canmake the algorithm try an x which does not have the highest a ( x ) , but which is expected to10e cheap and is therefore worth a try. This eﬀect that comes via σ ( G ) is not present in theEIS of Snoek et al. (2012) where the cost is ﬁxed and is not inﬂuenced by the probabilitymodel on f .Although one can use any model to predict G , we will here ﬁt a GP regression modelto the logarithm of the number of MCMC draws, log G j for j = 1 , ..., J in the J previousevaluations log G j = h ( z j ) + ε j , ε j iid ∼ N (0 , ψ ) h ∼ GP ( m G ( z ) , k G ( z , z (cid:48) )) , (4)where z j is a vector of covariates. The hyperparameters, x J , themselves may be used as pre-dictors of ˆ G ( x ) , but also D ( x ) = ˆ m ( x ) − f max and s ( x ) are likely to have predictive power for G , as well as u ( x ) = ( ˆ m ( x ) − f max ) /s ( x ) . We will use z j = (cid:0) x j , D ( j ) ( x j ) , s ( j ) ( x j ) , u ( j ) ( x j ) (cid:1) in our applications, where the superscript over j denotes the BO iteration. Note that theprediction for G is taken to be ˆ G = exp ( m G ( z )) , in our case, which correspond to themedian of a log-normal distribution. Algorithm 1

Bayesian Optimization with Optimized Precision (BOOP) input • an estimator ˆ f ( x ) of the function to be maximized, and its standard error function σ ( G ) . • j initial points x j ≡ ( x , . . . , x j ) , a vector of corresponding function evaluations, ˆ f ( x j ) ,and standard errors σ ( G j ) . • a baseline acquisition function a ( x ) , and early stopping thresholding probability α . initialization Estimate the function f ( x ) together with the standard error σ ( G ) at some j initial points x j ≡ ( x , . . . , x j ) . for j from j + 1 until convergence do: a) Estimate the heteroscedastic GP for f based on past evaluations ˆ f ( x j − ) = f ( x j − ) + (cid:15), (cid:15) ∼ N (0 , Σ j − ) f ( x ) ∼ GP ( m ( x ) , k ( x , x (cid:48) )) , where Σ j − ≡ Diag( σ ( G ) , . . . , σ ( G j − )) .b) Estimate the GP for log G based on past evaluations log G j − = h ( z j − ) + ε, ε ∼ N (0 , ψ I ) h ( z ) ∼ GP ( m G ( z ) , k G ( z (cid:48) , z )) , where the elements of z are functions of x . Return the point prediction ˆ G α ( x ) .c) Optimize the acquisition function ˜ a α ( x ) = a ( x ) / ˆ G α ( x ) to select the next point, x j .d) Compute ˆ f ( x j ) and σ ( G j ) by early stopping at thresholding probability α .e) Update the datasets in a) with ( x j , ˆ f ( x j ) , σ ( G i )) and in b) with ( z j , log G j ) . We will use the term

Bayesian Optimization with Optimized Precision (BOOP) for BO11ethods that optimize ˜ a α ( x ) in (3), and more speciﬁcally BOOP-EI when EI is used as thebaseline acquisition function, a ( x ) . The whole procedure is described in Algorithm 1. Thespeciﬁc Chib estimator of the log marginal likelihood used in the applications in Section4 is detailed in the appendix in Algorithm 2, implementation of the early stopping step isstraightforward in its application.Note that (2) assumes that ˆ f ( x ) is an unbiased estimator at any x and for any MCMCsample size. We performed a small simulation exercise that shows that the Chib estimator isapproximately unbiased after a few iterations, see Figure 7 in Appendix A. See also Section5 for some ideas to extended the current methods to biased estimators.Figure 3 illustrates the early stopping part of BOOP in a toy example. The upper leftgraph illustrates the situation at the ﬁrst BOOP iteration. The black lines show the trueunknown function. The three red dots denote initial evaluations and the large green dot isthe fourth evaluation obtained by N = 4 MCMC iterations. The inferred Gaussian processposterior for f based on these four evaluations are plotted as mean ˆ m ( x ) (blue line) and95% posterior intervals (pink shaded area). We can see that the 95% posterior interval atthe current x includes f (1) max (dotted red line), the highest function value observed so far andit therefore worthwhile to increase the number of MCMC iterations for this x . Moving onegraph to the right we see that after N = 6 MCMC iterations the 95% posterior intervalstill includes f (1) max , and we move one more graph to the right for N = 8 iterations. Here weconclude that the sampled point is almost certainly not an improvement and we move onto a new evaluation point. The new evaluation point is found by maximizing the BOOP-EIacquisition function in (3) with updated eﬀort prediction function ˆ G ( z ) in Equation 4 andis depicted by the green dot in leftmost graph in the second row of Figure 3. Followingthe progress in the second row, we see that it takes only N=6 samples to conclude thatthe function value is almost certainly lower than the current maximum at the second BOiteration. Finally, in the third row, we can see that the point is sampled with high variance atthe beginning, but as we increase N it becomes clear that this x is indeed is an improvement.12 . . . x f ( x ) x x − . . . x f ( x ) x − . . . x f ( x ) x x K = K = K = N=4 N=6 N=8

Figure 3: Illustration of BOOP-EI implemented with early stopping.

In this section we apply the tools described in the previous sections on the steady-stateBVAR of Villani (2009). This model has been widely used among practitioners and fore-casters of macroeconomic variables, see e.g. Gustafsson et al. (2016) and the referencestherein. Giannone et al. (2015) show that ﬁnding the right values for the hyperparame-ters in BVARs can signiﬁcantly improve forecasting performance. Moreover, Bańbura et al.(2010) show that diﬀerent degree of shrinkage (controlled by the hyperparameters) is nec-essary under diﬀerent model speciﬁcations.

The steady-state BVAR model of Villani (2009) is given by: Π ( L )( y t − Ψ x t ) = ε t , where ε t iid ∼ N ( , Σ) , where E [ y t ] = Ψ x t . In particular, if we assume that x t = 1 ∀ t, then Ψ is the overall meanof the process. We take the prior distribution to be: p (Σ) ∼ Σ − ( n +1) / vec (Π) ∼ N ( θ Π , Ω Π )Ψ ∼ N ( θ Ψ , Ω Ψ ) , θ Ψ and Ω Ψ are the mean and covariance matrix for the steady states which we setinformative according to Table 1. The prior mean for the lag-dynamics, θ Π , is explained inthe coming subsection and the prior covariance matrix for the dynamics, Ω Π , is constructedusing ω ii =  λ ( l λ ) , for own lag l of variable r, i = ( l − n + r, ( λ λ s r ) ( l λ s j ) , for cross-lag l of variable r (cid:54) = j, i = ( l − n + j, where ω ii is the diagonal elements of Ω Π . We also assume prior independence, followingVillani (2009). The hyperparameters that we optimize over are; the overall-shrinkage pa-rameter λ , the cross-lag shrinkage λ , and the lag-decay parameter λ .The steady-state BVAR is non-linear in the parameters, but the posterior distribution ofthe model parameters can be sampled with a simple Gibbs sampling scheme (Villani, 2009).The marginal likelihood, together with its empirical standard error, can be estimated bythe method in Chib (1995) as explained in Section 2.2 and detailed in Appendix A. Table 1 describes the data used in our applications which are also used in Giannone et al.(2015). It contains 23 macroeconomic variables for which two subsets are selected to rep-resent a medium-sized model with 7 variables and a large model that contains 22 of thevariables (real investment is excluded). Before the analysis, the consumer price index andthe ﬁve-year bond rate were transformed from monthly to a quarterly frequency. All seriesare transformed such that they become stationary according to the augmented Dickey-Fullertest. This is necessary for the data to be consistent with the prior assumption of a steady-state. The number of lags is chosen according to the HQ-criteria, Hannan and Quinn (1979)and Quinn (1980). This resulted in p = 2 lags for both the medium-sized- and the largemodel.We set the prior mean of the coeﬃcient matrix, Π , to values that reﬂect some persistenceon the ﬁrst lag, but also that all the time series are stationary. E.g. the prior mean on theﬁrst lag of the FED interest rate and the GDP-deﬂator is set to 0.6, while others are setto zero in the medium-sized model. Lags longer than 1 and cross-lags all have zero priormeans. The priors for the steady-states are set informative to the values listed in Table 1,these values follow suggestions from the literature for most variables, see e.g. Louzis (2019)and Österholm (2012). There were a few variables where we could not ﬁnd theoretical valuesfor either the mean or the standard deviation, in those cases, we set them close to theirempirical counterparts. 14able 1: Data description.Variable Mnemonic(FRED) Transform Medium Freq PriorReal GDP GDPCI 400 x diﬀ-log x Q (2.5;3.5)GDP deﬂator GDPCTPI 400 x diﬀ-log x Q (1.5;2.5)FED funds rate FEDFUNDS - x Q (4.3;5.2)Consumer price index CPIAUCSL 400 x diﬀ-log M (1.5;2.5)Commodity prices PPIACO 400 x diﬀ-log Q (1.5;2.5)Industrial production INDPRO 400 x diﬀ-log Q (2.3;3.7)Employment PAYEMS 400 x diﬀ-log Q (2.5;3.5)Employment, service SRVPRD 400 x diﬀ-log Q (1.5;2.5)Real consumption PCECC96 400 x diﬀ-log x Q (2.5;3.5)Real investment GDPIC1 400 x diﬀ-log x Q (2.3;3.7)Real residential investment PRFIx 400 x diﬀ-log Q (1.5;4.5)Non-residential investment PNFIx 400 x diﬀ-log Q (1.5;4.5)Personal consumptionexpenditure, price index PCECTPI 400 diﬀ-log x Q (1.5;4.5)Gross private domesticinvestment, price index GPDICTPI 400 x diﬀ-log Q (1.5;4.5)Capacity utilization TCU - Q (79.3;80.7)Consumer expectations UMCSENTx diﬀ Q (-0.5;0.5)Hours Worked HOANBS 400 x diﬀ-log x Q (2.5;3.5)Real compensation/hour AHETPIx 400 x diﬀ-log x Q (1.5;2.5)One year bond rate GS1 diﬀ Q (-0.5;0.5)Five year bond rate GS5 diﬀ M (-0.5;0.5)SP 500 S&P 500 400 x diﬀ-log Q (-2;2)Eﬀective exchange rate TWEXMMTH 400 x diﬀ-log Q (-1;1)M2 M2REAL 400 x diﬀ-log Q (5.5;6.5) The table shows the 23 US macroeconomic time series from the FRED database used in the empiricalanalysis. The column named Prior contains the steady-state mean ± one standard deviation. We consider three competing optimization strategies: (I) an exhaustive grid-search, (II)Bayesian optimization with the EI acquisition function (BO-EI), and (III) our BOOP-EIalgorithm. In each approach, we use the restrictions λ ∈ (0 , , λ ∈ (0 , , and λ ∈ (0 , .In the grid-search, λ and λ move in steps of 0.05 and λ moves in steps of . , yielding intotal 10000 marginal likelihood evaluations. For the Bayesian optimization algorithm, weset the number of evaluations to 150, and we use two random draws as initial values for theGPs.For (I) and (II) we use in total 10000 Gibbs iterations with 2500 as a burn-in sample ineach model evaluation. For (III) we draw 3000 Gibbs samples where we burn 2500 and usethe rest to calculate the probability of improvement PI, by doing this we ensure that theestimated marginal likelihood will be approximately unbiased (see Figure 7 in Appendix A).15f PI < α we accept the draw with relatively large standard errors, otherwise we generatea new batch (of size 200) of Gibbs samples and again check the criteria. As a consequence,the total number of Gibbs iterations will vary between 3000 and 10 000 in each of the 150BO-iterations. The application is robust to the choice of α , as long as it is a reasonably lownumber, in this study we use α = 0 . . This means that the probability of improvementshould be very low before we stop the MCMC run. We run method (II) and (III) ten timesand report the results in Figure 6 and Table 2.For comparison, we will also use the standard values of the hyperparameters used in e.g.the BEAR-toolbox, Dieppe et al. (2016), λ = 0 . , λ = 0 . , and λ = 1 , as a benchmark.The methods are compared with respect to i) how well the optimized hyperparametersmaximizes the marginal likelihood, and ii) how much computational resources that has tobe spent in the optimization. Table 2: Optimization Results Medium Steady-State BVAR.Standard BO-EI BOOP-EI GridLog ML − . − . − . − . Gibbs iterations - . ∗ Avg. iter to 90% - ∗ -Model evaluations -

150 150 10 λ . .

30 0 .

27 0 . λ . .

38 0 .

43 0 . λ .

69 0 .

76 0 . The table compares diﬀerent methods for hyperparameter optimization in the medium-sized steady-stateBVAR. Each method is run 10 times and the reported hyperparameters for each method are the best onesover the 10 runs. The reported duration measures are averages taken over all runs. The third row in thetable show the number of iterations it takes (on average) to close 90% of the gap between the log ML fromstandard values and the maximum. The marginal likelihood of the selected models were re-estimated using100,000 Gibbs iterations with 40,000 as a burn-in.

Table 2 summarizes the results of the medium size BVAR model. We see that all threeoptimization strategies ﬁnd hyperparameters that yield substantially higher log marginallikelihood than the standard values. We can also see that both Bayesian optimizationmethods yield as good hyperparameters as the grid search at only a small fraction of thecomputational cost. It is also clear from Table 2 that a substantial amount of computationscan be gained via our new acquisition strategy. It is interesting to note that the valuesfor λ and λ are similar for all three optimization approaches but that λ diﬀers to someextent. This is due to the ﬂatness of the log marginal likelihood in that area.Figure 4 displays the log marginal likelihood surfaces over the grid of ( λ , λ ) -valuesused in the grid search. Each subgraph is for a ﬁxed value of λ ∈ { . , , , } . The reddot indicates the maximum log marginal likelihood for the given λ , and the green dotindicates the standard values. In all four sub-graphs, we can see that the standard values16re located outside the high-density regions, relatively far from the maximum (except whenthe lag-decay shrinkage parameter, λ , is . ). A comparison of Figures 4 and 5 shows thatthe GP’s predicted log marginal likelihood surface is quite accurate already after merely 150evaluations; this is quite impressive considering that Bayesian optimization tries to ﬁnd themaximum in the fastest way, and does not aim to have high precision in low-density areas. (a) l l (b) l l (c) l l (d) l l Figure 4: Log marginal likelihood surfaces over a ﬁne grid of ( λ , λ ) -values. The diﬀerentpanels are: (a) λ = 0 . , (b) λ = 1 , (c) λ = 2 , (d) λ = 5 . The red dot denotes themaximum log marginal likelihood value for the given λ , and the green dot indicates thestandard values. 17igure 5: GP predictions of the hyperparameter surfaces in Figure 4 based on 150 evalua-tions.Figure 6 shows that BOOP-EI ﬁnds higher values of the log marginal likelihood muchfaster than plain BO with EI acquisitions. From Table 2 we can see that BOOP-EI usesless than a third of the MCMC iterations compared to BO-EI for a full run. To close 90%of the gap between the standard values and the maximum it takes about half the timefor BOOP-EI. Interesting to note is that BO-EI leads to (on average) a higher number ofimprovements on the way to the maximum; while BOOP-EI gives fewer improvements but oflarger magnitude; the strategy of cheaply exploring new territories before locally optimizingthe function pays oﬀ. We also optimize the parameters of the more challenging large BVAR model containing the22 diﬀerent time series, using 200 iterations for both the BO- and BOOP-EI. A completegrid search is too costly here, so we instead compare with parameters obtained from thegrid search from the medium-sized BVAR in Section 4.4, which is a realistic strategy inpractical work. 18igure 6: Comparison of the convergence speed of the Bayesian optimization methods.Table 3: Optimization Results Large Steady-State BVAR.Standard BOOP-EI BO-EI Grid (Medium)Log ML − . − . − . − . Sd log ML .

54 0 .

28 0 .

42 0 . λ . .

52 0 .

58 0 . λ . .

11 0 .

08 0 . λ .

79 1 .

72 0 . Hyperparameter optimization in the large steady-state BVAR. The column named Grid are the valuesobtained from a grid search for the medium size model. The marginal likelihood of the selected models werere-estimated using 100,000 Gibbs iterations with 40,000 as a burn-in.

Table 3 shows that our method, again, ﬁnds optimal hyperparameters with substantiallylarger log ML than standard values, and also much better values than those from thegrid search on the medium-sized BVAR. We can also see that the standard values areworse than those of the grid search for the medium-sized model. Finally, note that thehyperparameters selected by BOOP-EI in the large-sized BVAR are quite diﬀerent fromthose in the medium-sized model. The optimal λ applies less baseline shrinkage thanbefore, but the lag decay ( λ ) is higher, and in particular, the cross-lag shrinkage, λ , ismuch closer to zero, implying much harder shrinkage towards univariate AR-processes. Thislatter result strongly suggests that the computationally attractive conjugate prior structureis a highly sub-optimal solution since such a prior requires that λ = 1 .19 Concluding remarks

We propose a new Bayesian optimization method for ﬁnding optimal hyperparameters ineconometric models. The method can be used to optimize any noisy function where theprecision is under the control of the user. We focus on the common situation of maximizinga marginal likelihood evaluated by MCMC, where the precision is determined by the numberof MCMC iterations. The ability to choose the precision makes it possible for the algorithmto take occasional cheap and noisy evaluations to explore the marginal likelihood surface,thereby ﬁnding the optimum faster.We assess the performance of the new algorithm by optimizing the prior hyperparametersin the extensively used steady-state BVAR model in both a medium-sized and a large-sizedVAR. The method is shown to be practical and competitive to other approaches in that itﬁnds the optimum using a substantially smaller computational budget, and has the potentialof being part of the standard toolkit for BVARs. We have focused on optimizing the marginallikelihood, but the method is directly applicable to other utility functions, e.g. the popularlog predictive score (Geweke and Keane, 2007 and Villani et al., 2012).Our approach builds on the assumption that the noisy estimates of the log marginallikelihoods are approximately unbiased, which we show is a reasonable assumption in ourcase. The unbiasedness of the log marginal likelihood will, however, depend on the com-bination of MCMC sampler and marginal likelihood estimator, see Adolfson et al. (2007)for some evidence. It would therefore be interesting to extend the method to cases withbiased evaluations. One such example in econometrics is Dynamic Stochastic General Equi-librium (DSGE) models (An and Schorfheide, 2007) which are usually analyzed by randomwalk Metropolis sampling and the marginal likelihood estimated by the modiﬁed harmonicestimator (Geweke, 1999). The marginal likelihood estimates are typically very persistentover MCMC iterations in models with many parameters and are therefore only slowly ap-proaching the true marginal likelihood. Since the marginal likelihood trajectory over MCMCiterations is very smooth one can try to predict its evolution and then correct the bias inthe marginal likelihood estimates.Sequential Monte Carlo (SMC) gives unbiased and noisy estimates of the likelihood withprecision determined by the number of particles. Extending our method to SMC is non-trivial however since the number of particles cannot be increased adaptively as we do withthe MCMC iterations in our algorithm, at least when particle resampling is used. We arecurrently working on an alternative algorithm in the same spirit as the current proposalwhich circumvents this complication. 20 eferences

Adolfson, M., J. Lindé, and M. Villani (2007). Bayesian analysis of dsge models-somecomments.

Econometric Reviews 26 (2-4), 173–185.An, S. and F. Schorfheide (2007). Bayesian analysis of dsge models.

Econometric re-views 26 (2-4), 113–172.Bańbura, M., D. Giannone, and L. Reichlin (2010). Large bayesian vector auto regressions.

Journal of Applied Econometrics 25 (1), 71–92.Brochu, E., V. M. Cora, and N. De Freitas (2010). A tutorial on bayesian optimizationof expensive cost functions, with application to active user modeling and hierarchicalreinforcement learning. arXiv preprint arXiv:1012.2599 .Carriero, A., G. Kapetanios, and M. Marcellino (2012). Forecasting government bond yieldswith large bayesian vector autoregressions.

Journal of Banking & Finance 36 (7), 2026–2047.Chib, S. (1995). Marginal likelihood from the gibbs output.

Journal of the AmericanStatistical Association 90 (432), 1313–1321.Chib, S. and I. Jeliazkov (2001). Marginal likelihood from the metropolis–hastings output.

Journal of the American Statistical Association 96 (453), 270–281.Dieppe, A., R. Legrand, and B. Van Roye (2016). The bear toolbox.Doan, T., R. Litterman, and C. Sims (1984). Forecasting and conditional projection usingrealistic prior distributions.

Econometric reviews 3 (1), 1–100.Geweke, J. (1999). Using simulation methods for bayesian econometric models: inference,development, and communication.

Econometric reviews 18 (1), 1–73.Geweke, J. and M. Keane (2007). Smoothly mixing regressions.

Journal of Economet-rics 138 (1), 252–290.Giannone, D., M. Lenza, and G. E. Primiceri (2015). Prior selection for vector autoregres-sions.

Review of Economics and Statistics 97 (2), 436–451.Gustafsson, P., P. Stockhammar, and P. Österholm (2016). Macroeconomic eﬀects of adecline in housing prices in sweden.

Journal of Policy Modeling 38 (2), 242–255.Hannan, E. J. and B. G. Quinn (1979). The determination of the order of an autoregression.

Journal of the Royal Statistical Society: Series B (Methodological) 41 (2), 190–195.Karlsson, S. (2013). Forecasting with bayesian vector autoregression. In

Handbook of Eco-nomic Forecasting , Volume 2, pp. 791–897. Elsevier.Louzis, D. P. (2019). Steady-state modeling and macroeconomic forecasting quality.

Journalof Applied Econometrics 34 (2), 285–314.MacDonald, B., P. Ranjan, H. Chipman, et al. (2015). Gpﬁt: An r package for ﬁtting a21aussian process model to deterministic simulator outputs.

Journal of Statistical Soft-ware 64 (i12).Matérn, B. (1960). Spatial variation. medd. fr. st. skogsf. inst. 49 (5). also appeared asnumber 36 of lecture notes in statistics.Newey, W. K. and K. D. West (1987). Hypothesis testing with eﬃcient method of momentsestimation.

International Economic Review , 777–787.Österholm, P. (2012). The limited usefulness of macroeconomic bayesian vars when esti-mating the probability of a us recession.

Journal of Macroeconomics 34 (1), 76–86.Quinn, B. G. (1980). Order determination for a multivariate autoregression.

Journal of theRoyal Statistical Society: Series B (Methodological) 42 (2), 182–185.Snoek, J., H. Larochelle, and R. P. Adams (2012). Practical bayesian optimization ofmachine learning algorithms. In

Advances in neural information processing systems , pp.2951–2959.Stockhammar, P. and P. Österholm (2017). The impact of us uncertainty shocks on smallopen economies.

Open Economies Review 28 (2), 347–368.Villani, M. (2009). Steady-state priors for vector autoregressions.

Journal of Applied Econo-metrics 24 (4), 630–650.Villani, M., R. Kohn, and D. J. Nott (2012). Generalized smooth ﬁnite mixtures.

Journalof Econometrics 171 (2), 121–133.Williams, C. K. and C. E. Rasmussen (2006).

Gaussian processes for machine learning ,Volume 2. MIT Press Cambridge, MA.

A Implementation details for Chib’s marginal likelihood esti-mator

The marginal likelihood estimator for Gibbs sampling proposed by Chib (1995) has beenshown to be very accurate. We will brieﬂy present the estimator here for the case with threeblocks in the Gibbs sampler, which is also used in our BVAR application. The idea is thatthe posterior distribution of the three blocks model can be factorized as p ( θ , θ , θ | y ) = p ( θ | θ , θ , y ) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) p ( θ | θ , y ) (cid:124) (cid:123)(cid:122) (cid:125) ( b ) p ( θ | y ) (cid:124) (cid:123)(cid:122) (cid:125) ( c ) . If we look at the diﬀerent parts above; part ( a ) is a full conditional which is available inclosed form. ( c ) can be estimated in a point using Monte Carlo integration directly from22he Gibbs output p ( θ ∗ | y ) = (cid:90) p ( θ ∗ | θ , θ , y ) d θ d θ ≈ G G (cid:88) g =1 p ( θ ∗ | θ ( g )1 , θ ( g )2 , y ) . The only part that is not immediately accessible is part (b), but it can be obtained byrunning a reduced version of the ﬁrst Gibbs sampler. Practically, the only thing we do isto ﬁx θ at θ ∗ and skip this updating step in the reduced Gibbs sampler. Monte Carlointegration can again be used to obtain p ( θ ∗ | θ ∗ , y ) = (cid:90) p ( θ ∗ | θ ∗ , θ , y ) d θ ≈ G G (cid:88) g =1 p ( θ ∗ | θ ∗ , θ ( g )1 , y ) . When we have obtained all three parts (a, b, and c) they are put together with the priorand likelihood to obtain an estimate of the log marginal likelihood as ˆ m ( y ) = log p ( y | θ ∗ ) + log p ( θ ∗ ) − log ˆ p ( θ ∗ , θ ∗ , θ ∗ | y ) , where ˆ p ( θ ∗ , θ ∗ , θ ∗ | y ) = p ( θ ∗ | θ ∗ , θ ∗ , y )ˆ p ( θ ∗ | θ ∗ , y )ˆ p ( θ ∗ | y ) . Chib (1995) also provides a method to compute asymptotic standard errors, which is ex-plained here in the case with three blocks. After running the full- and reduced Gibbssamplers we have the vector process: h = (cid:32) h ( θ ∗ | y ) h ( θ ∗ | θ ∗ , y ) (cid:33) = (cid:32) ˆ p ( θ ∗ | y )ˆ p ( θ ∗ | θ ∗ , y ) (cid:33) . We can now calculate the covariance matrix for h , where it is important to account forautocorrelation in the MCMC draws. Note that due to the procedure of using two separateGibbs-samplers, h and h should be approximately independent, see Chib (1995). Butthe vector notation will be kept for convenience. We should also note that the empiricalstandard errors are for repeated experiments using the same θ ∗ every time.Using the vector notation we have ˆ h ≡ G − G (cid:88) g =1 h ( g ) = (cid:32) ˆ p ( θ ∗ | y )ˆ p ( θ ∗ | θ ∗ , y ) (cid:33) . Our objective is now to ﬁnd the variance of a function of ˆ h , i.e. ψ = ˆ h × ˆ h and ψ =ln ˆ h + ln ˆ h ≡ ln ˆ p ( θ ∗ | y ) + ln ˆ p ( θ ∗ | θ ∗ , y ) . The strategy is to ﬁrst ﬁnd the variance of ˆ h andthen use the delta method to ﬁnd the variance of our functions.Since the ˆ h inherit the ergodicity from the Gibbs-chains we have that ˆ h → µ , as G → ∞ µ = ( p ( θ ∗ | y ) + p ( θ ∗ | θ ∗ , y )) (cid:48) . Further, lim G →∞ G { E[(ˆ h − µ )(ˆ h − µ ) (cid:48) ] } = 2 π S (0) , where S (0) denotes the spectral density matrix at frequency zero. An estimate of Ω ≡ π S (0) can be obtained by the approach of Newey and West (1987). We then have: Ω s = G − G (cid:88) g = s +1 ( h ( g ) − ˆ h )( h ( g + s ) − ˆ h ) (cid:48) , then we get var (ˆ h ) = 1 G (cid:34) Ω + q (cid:88) s =1 (cid:18) − sq + 1 (cid:19) ( Ω s + Ω (cid:48) s ) (cid:35) , where q is a positive integer determining how many autocorrelations to account for. Chib(1995) suggest to (conservatively) use q = 10 while we instead ﬁt a vector autoregression toselect q to be the lag length to maximize AIC. The variance we are really after is the onefor ψ , and it is obtained by the delta method through:var ( ψ ) = (cid:18) ∂ψ ∂ ˆ h (cid:19) (cid:48) var (ˆ h ) (cid:18) ∂ψ ∂ ˆ h (cid:19) , where the gradients consists of the elements ˆ h − and ˆ h − . To get the empirical standarderrors we just take the square root of this expression.Our methods assumes an unbiased log marginal likelihood estimator. Figure ( ?? ) showsthat Chib’s estimator is nearly unbiased in the steady-state BVAR already after a fewhundred MCMC iterations. 24igure 7: Unbiasedness of Chib’s log marginal likelihood estimator in the steady-state BVARapplication. The horizontal axis denotes the number of MCMC draws (excluding 50 observa-tions as burn-in), the blue dots are draws from the sampling distribution of Chib’s estimatorfor a given MCMC sample size. The red line is the mean of the draws and the blue linerepresents the true log marginal likelihood, obtained from 100 000 MCMC iterations with5000 as a burn-in. 25 lgorithm 2 Chib’s marginal likelihood estimator with standard errors (a) Run the ﬁrst Gibbs-sampler to obtain the full conditional distributions.(b) Select θ ∗ = (Π ∗ , Σ ∗ , Ψ ∗ ) from a high density location.(c) Use the output to calculate p (Ψ ∗ | Π ∗ , Σ ∗ , y ) and ˆ p (Π ∗ | y ) . (a) Run the reduced Gibbs-sampler in the same way as the ﬁrst, but ﬁx Π at Π ∗ inevery iteration.(b) Calculate ˆ p (Σ ∗ | Π ∗ , y ) . (a) Calculate the likelihood and prior density at θ ∗ = (Π ∗ , Σ ∗ , Ψ ∗ ) .(b) Put all the pieces together to obtain ln ˆ p ( y ) = ln p ( y | Π ∗ , Σ ∗ , Ψ ∗ ) +ln p (Π ∗ , Σ ∗ , Ψ ∗ ) − ln ˆ p (Π ∗ | y ) − ln ˆ p (Σ ∗ | Π ∗ , y ) − ln ˆ p (Ψ ∗ | Π ∗ , Σ ∗ , y ) . (a) Compute Ω as Ω s = G − (cid:80) Gg = s +1 ( h ( g ) − ˆ h )( h ( g + s ) − ˆ h ) (cid:48) , where we set s = 0 .(b) Calculate the appropriate autocorrelation order by minimizing AIC for a

VAR(q) on ˆ h . And compute the autocorrelation correction terms Ω s for s = 1 , , . . . , q .(c) Compute the variance of ˆ h as var (ˆ h ) = G (cid:104) Ω + (cid:80) qs =1 (cid:16) − sq +1 (cid:17) ( Ω s + Ω (cid:48) s ) (cid:105) .(d) Finally, obtain the variance of log marginal likelihood by using the delta methodas var ( ψ ) = (cid:16) ∂ψ ∂ ˆ h (cid:17) (cid:48) var (ˆ h ) (cid:16) ∂ψ ∂ ˆ h (cid:17)(cid:17)