Inference in Bayesian Additive Vector Autoregressive Tree Models
IInference in Bayesian Additive Vector Autoregressive TreeModels ∗ Florian Huber a Luca Rossini b a University of Salzburg, Austria b Vrije Universiteit Amsterdam, The Netherlands
July 1, 2020
Abstract
Vector autoregressive (VAR) models assume linearity between the endogenous variables and theirlags. This linearity assumption might be overly restrictive and could have a deleterious impact onforecasting accuracy. As a solution, we propose combining VAR with Bayesian additive regressiontree (BART) models. The resulting Bayesian additive vector autoregressive tree (BAVART) modelis capable of capturing arbitrary non-linear relations between the endogenous variables and thecovariates without much input from the researcher. Since controlling for heteroscedasticity iskey for producing precise density forecasts, our model allows for stochastic volatility in the errors.Using synthetic and real data, we demonstrate the advantages of our methods. For Eurozone data,we show that our nonparametric approach improves upon commonly used forecasting models andthat it produces impulse responses to an uncertainty shock that are consistent with establishedfindings in the literature.
Keywords:
Bayesian additive regression trees; BAVART; Decision trees; Nonparametricregression; Vector Autoregressions
In macroeconomics and finance, most models commonly employed to study the transmissionof economic shocks or produce predictions assume linearity in their parameters and are fullyparametric. One prominent example of a frequently used fully parametric and linear model is thevector autoregression (VAR) that is extensively used in central banks and academia (see Sims, 1980;Doan et al., 1984; Litterman, 1986; Sims and Zha, 1998). In normal times, with macroeconomicrelations remaining stable, this linearity assumption might describe the data well. In turbulenttimes, however, key transmission channels often change and more flexibility could be necessary.The linearity assumption has been questioned several times in the recent literature. Forinstance, Granger and Terasvirta (1993) show that macroeconomic quantities might exhibit non-linear relations and thus assuming linearity might be overly restrictive. As a solution, researchersincreasingly wish to estimate non-linear models that assume the parameters to change over timeand/or shock variances to be time-varying. These time-varying parameter (TVP) models, however,assume that within each point in time, the relationship between the endogenous and explanatoryvariables is linear. For recent contributions, see Primiceri (2005); Kalli and Griffin (2014); Bittoand Fr¨uhwirth-Schnatter (2019); Huber et al. (2019); Chan et al. (2020).Another strand of the literature proposes Bayesian nonparametric time series models inorder to relax the linearity assumption. Particularly, several recent papers (see, e.g. Bassetti ∗ Florian Huber gratefully acknowledges funding from the Austrian Science Fund (FWF): ZK 35. Luca Rossiniacknowledges financial support from the EU Horizon 2020 programme under the Marie Sk(cid:32)lodowska-Curie scheme (grantagreement no.796902). a r X i v : . [ ec on . E M ] J un t al., 2014; Kalli and Griffin, 2018; Billio et al., 2019) propose novel methods that assume thetransition densities to be nonparametric. These techniques are characterized by featuring aninfinite dimensional parameter space that is flexibly adjusted to the complexity of the data athand. For a review on Bayesian nonparametric methods, see Hjort et al. (2010). Within thenonparametric paradigm, there has been a number of popular methods that make use of machinelearning techniques such as boosting (Freund and Schapire, 1997; Friedman, 2001), baggingand random forests (Breiman, 2001), decision trees and Mondrian forests (Roy and Teh, 2009;Lakshminarayanan et al., 2014) and Bayesian tree models (Chipman et al., 2002; Gramacy andLee, 2008; Chipman et al., 2010; Linero, 2018). In this paper, we focus on Bayesian tree modelsapplied to multivariate time series regressions.The main contribution of the present paper is to connect the literature on Bayesian additiveregression trees (BART, see Chipman et al., 2010) with the literature on VAR models. BARTis a nonparametric regression approach which allows for unveiling non-linear relations betweena set of endogenous and explanatory variables without much input needed from the researcher.Intuitively speaking, it models the conditional mean of the regression model by summing over alarge number of trees which are, by themselves, constrained through a regularization prior. Theresulting individual trees will take a particularly simple form and can thus be classified as ”weaklearners”. Each single tree only explains a small fraction of the variation of the response variablewhile a large sum of them is capable of describing the data extremely well.It is precisely this intuition on which we build on when we generalize these techniques toa multivariate setting. More precisely, we assume that a potentially large dimensional vectorof endogenous variables is driven by the lags of the response variables. As opposed to VARmodels which assume linearity between the explanatory and endogenous variables, we capture theserelations using BART. The resulting model, labeled the Bayesian Additive Vector Autoregressivetree (BAVART) model, is a highly flexible variant that can be used for forecasting and impulseresponse analysis.Estimation and inference is carried out using Markov Chain Monte Carlo (MCMC) techniques.Since the error covariance matrix is non-diagonal, we propose methods that allow for equation-by-equation estimation of the multivariate model. These techniques imply that model estimationscales well in high dimensions and permits estimation of huge dimensional models. Another crucialfeature of our approach is that it allows for flexibly handling heteroscedasticity. We control forheteroscedasticity by proposing a stochastic volatility specification. This feature is crucial forproducing precise density forecasts.We illustrate our BAVART model using macroeconomic data for the Eurozone. In thisapplication, we assess the forecasting accuracy of our method relative to competitors that arecommonly used in the literature. Since interest not only centers on one-step-ahead predictivedistributions but also on multi-step-ahead forecasts, we provide algorithms to simulate from therelevant predictive distributions. These techniques are then used in a second application wherewe consider the effects of economic uncertainty on the Eurozone. We show how our model canbe used to compute generalized impulse response functions (SGIRFs) and find responses that areconsistent with established findings in the literature.The remainder of this paper is organized as follows. Section 2 discusses the BART model in thecontext of the homoscedastic regression model while Section 3 extends this method to the VARcase and proposes the BAVART specification and how we control for heteroscedasticity. Section4 presents the results of the forecasting and impulse response exercise. Finally, the last sectionsummarizes our findings and concludes the paper. In this section, we briefly review the BART approach for regression on which we build on in latersections. For an extensive introduction, see Hill et al. (2020). Let y = ( y , . . . , y T ) (cid:48) denote the − dimensional response vector and X = ( x , . . . , x T ) (cid:48) be a T × p -dimensional matrix of exogenousvariables with x t being the p covariates in time t . We assume the following relationship between y and X : y = f ( X ) + ε , ε ∼ N ( T , σ I T ) , where σ denotes the error variance and f ( · ) is a (potentially) highly non-linear function. TheBART model is then obtained by approximating f through a sum of many regression trees asfollows: f ( X ) ≈ m (cid:88) j =1 g ( X ; T j , m j ) . (1)In Eq. (1), ( T j , m j ) corresponds to a single tree model with T j denoting the tree structureassociated with the j th binary tree and m j = ( µ j , . . . , µ jb j ) (cid:48) is the vector of terminal nodeparameters associated with T j and b j are the leaves of the j th tree. The binary trees are constructedby considering splitting rules of the form { X ∈ A jk } or { X (cid:54)∈ A jk } with A jk being a partition set.These rules typically depend on selected columns of X , denoted as X • j , and a threshold c . Theset is then defined by a unique partition of the predictor space such that { X • j ≤ c } or { X • j > c } .The step function g ( · ; · , · ) is constant over the elements of A jk : g ( X ; T j , m j ) = µ jk , if X ∈ A jk , k = 1 , . . . , b j . Hence, the set A jk defines a tree-specific unique partition of the covariate space such that thefunction g returns a specific value µ jk for specific values of x t .To avoid overfitting, the trees are encouraged to be small (i.e. take a particularly simple form)and the terminal node parameters to be shrunk to zero. If the first tree, g ( x ; T , m ), is a weaklearner and fitted in a reasonable way, the corresponding tree structure will be very simple andelements in m will be pushed towards zero. This implies that the first tree will explain a smallfraction of the variation in y . Subtracting g ( X ; T , m ) from y yields a new conditional modelwith transformed ˜ y = y − g ( X ; T , m ) and the next tree will be fitted. This procedure is repeatedfor a sufficiently large number m of trees until the fit of the additive model becomes reasonablygood.We illustrate BART using a simple example. In this simple example, we use a single regressiontree model (i.e. m = 1). The case of several regression trees is considered afterwards. Example . Consider the regression tree g ( X ; T j , m j ) in Figure 1. We include two covariates x t = ( x t , x t ). In the figure, each rectangle refers to a splitting rule and henceforth label thisa node. At the top node (also called root node or simply root), we have the condition thatasks whether x t < .
8. If this condition is not true, then we arrive at the terminal node, µ j and set E ( y t ) = µ j . By contrast, if the condition is fulfilled we move to the next node. Thecorresponding splitting rule states that if x t ≥ .
3, we reach a terminal node with E ( y t ) = µ j whereas if x t < .
3, we set E ( y t ) = µ j . Using a single tree thus yields three possible values forthe conditional mean, µ j , µ j and µ j and implies abrupt shifts between them. If y t is evolvingsmoothly over time, using a single tree would thus lead to a rather simplistic conditional meanstructure. Example . Since the tree model in the first example is too simplistic, our second example dealswith a more flexible conditional mean structure. Let us consider a sum of regression trees with m = 2 trees and 3 covariates, depicted in Figure 2. Within a given tree, the same intuition as inExample 1 applies. However, since we now use two trees we gain more flexibility in modeling theconditional mean of y t . This is because the conditional mean at a given point in time is equal to thesum of the terminal node parameters for the two trees and the combination of the two parametersdepend on the specific set of decision rules across both trees. This allows for a substantially richermean structure and increased flexibility as opposed to a single tree model. t < . µ j x t < . µ j µ j NO YESNO YES
Figure 1: Example of binary regression tree, g ( x ; T j , m j ), with internal nodes labeled by their splittingrules and leaf nodes labeled with the corresponding parameters µ jk , with j = 1 and k = 1 , , Regression tree, j = 1 Regression tree, j = 2 x t < . µ x t < . µ µ NO YESNO YES x t < . x t < . µ µ µ NOYESNO YES
Figure 2: Example of sum of regression tree, g ( x ; T j , m j ), with internal nodes labeled by their splittingrules and leaf nodes labeled with the corresponding parameters µ jk , where j = 1 , k = 1 , , The second example shows how flexibility is increased by adding more trees and illustrates howBART handles non-linearities in a flexible manner. In particular, each regression tree is a simplestep-wise function and when we sum the different regression trees, we gain flexibility. The resultingadditive model essentially allows for approximating non-linearities without prior assumptions onthe specific form on the non-linearities.
In this section we generalize the model outlined in the previous section to the multivariate case.Consider a M -dimensional vector of endogenous variables y t = ( y t , . . . , y Mt ) (cid:48) . Stacking the rowsyields a T × M matrix Y = ( y , . . . , y T ) (cid:48) . We assume that Y depends on a T × K matrix X = ( X , . . . , X T ) (cid:48) with each X t = ( y (cid:48) t − , . . . , y (cid:48) t − P ) (cid:48) being a K (= P M )-dimensional vector oflagged endogenous variables. The Bayesian Additive Vector Autoregressive Tree (BAVART) model s then given by: Y = F ( X ) + ε , (2)where ε = ( ε , . . . , ε T ) (cid:48) and ε t ∼ N ( M , Σ ). For the moment, we assume that the error variance-covariance matrix Σ is time-invariant. In Eq. (2), F ( · ) is defined in terms of equation-specificfunctions f j ( X ): F ( X ) = ( f ( X ) , . . . , f M ( X )) . Similarly to the standard BART specification, we approximate each f j ( X ) through a sum of N regression trees: f j ( X ) = N (cid:88) k =1 g jk ( X ; T jk , m jk ) . Here, g jk ( k = 1 , . . . , N ) denotes an equation-specific step function with arguments T jk and m jk . As before, the individual tree structures T jk are associated with a b jk -dimensional vector m jk = ( µ jk, , . . . , µ jk,b jk ) (cid:48) of terminal node coefficients associated with b jk denoting the numberof leaves per tree in equation j .As stated in Section 2, a regression tree creates a partition of the covariate space into subgroups.Internal nodes of regression trees are equipped with splitting rules that are characterized by somecolumn of X , labeled X • i , and a specific threshold c which can take any value in the range of X • i . Such a decision rule takes the form { X • i ≤ c } or { X • i > c } and thus allocates observationsaccording to this condition.We estimate the BAVART model by exploiting its structural form. Using the structural formof the VAR to speed up computation has been done in several recent papers (see, e.g., Carrieroet al., 2019; Huber et al., 2020). In our case, Eq. (2) can be written as follows: Y = ˜ F ( X ) + Y ( I M − A ) + (cid:15) , (3)whereby ˜ F is defined similarly to F , A is a M × M -dimensional lower triangular matrix withdiag( A ) = (1 , . . . , (cid:48) and (cid:15) = ( (cid:15) , . . . , (cid:15) T ) (cid:48) denotes a T × M matrix of orthogonal shocks with (cid:15) t ∼ N ( M , H ) with H denoting a M × M -dimensional diagonal matrix with the variances on itsmain diagonal.Conditional on A , this form permits equation-by-equation estimation since the shocks areindependent. This leads to enormous computational gains. Notice that the j th > y • j = N (cid:88) k =1 ˜ g jk ( X ; T jk , m jk ) + j − (cid:88) l =1 a jl y • l + (cid:15) • j . (4)Here, we let ˜ g jk ( X ; T jk , m jk ) denote a regression tree while y • j and (cid:15) • j refers to the j th columnof Y and (cid:15) , respectively. a jl denotes the ( j, l ) th element of − A .Notice that Eq. (4) is a generalized additive model that consists of a non-parametric part (cid:80) Nk =1 g jk ( X ; T jk , m jk ) and a regression part (cid:80) j − l =1 a jl y • l . For j = 1, the model reduces to astandard BART specification without including the contemporenous values of the other endogenousvariables.The key idea behind this formulation is that, for a sufficiently large number of trees, weapproximate non-linear relations between y t and its lags while allowing for linear relations betweenthe contemporaneous values of y t . After obtaining estimates of the trees and the free elements of A , we can solve Eq. (3) for its reduced form representation as follows: Y = ˜ F ( X ) A − (cid:48) + (cid:15)A − (cid:48) . (5)with ε = (cid:15)A − , Σ = A − HA − (cid:48) and F ( X ) = ˜ F ( X ) A − . .2 Allowing for Heteroscedasticity Up to this point we assumed the error variance to be constant. If the researcher wishes to relaxthis assumption, several feasible options exist. In a recent paper, Pratola et al. (2019) proposea combination of a standard additive BART model with a multiplicative BART specification toflexibly control for heteroscedasticity. But modeling heteroscedasticity with BART implies that weneed some information on how the volatility of economic shocks depends on additional covariates(which potentially differ from X ). As a simple yet flexible solution we adopt a standard stochasticvolatility (SV) model. SV models are frequently used in macroeconomics and finance and possessa proven track record for forecasting applications (see, e.g., Clark, 2011; Clark and Ravazzolo,2015). Our SV specification simply assumes that H is time-varying: H t = diag( e h t , . . . , e h Mt )with the time-varying variances e h jt . Their logarithms follow an AR(1) model: h jt = c j + ρ j ( h jt − − c j ) + σ jh ν jt , ν jt ∼ N (0 , , h j ∼ N (cid:32) c j , σ jh − ρ j (cid:33) . (6)Here, c j is the unconditional mean, ρ j is the persistence parameter and σ jh is the error varianceof the log-volatility process. In our empirical work, all models feature stochastic volatility of thisform. The Bayesian approach calls for the specification of suitable priors over the parameters of themodel. Here we mainly follow the different strands of the literature our approach combines andthus only provide a brief summary. In particular, we focus on the priors associated with the trees T jk and the terminal node parameters m jk for j = 1 , . . . , M and k = 1 , . . . , N . We assume thatthe hyperparameters across priors are the same for each equation.For each equation j , the joint prior structure is given by: p (cid:0) ( T j , m j ) . . . , ( T jN , m jN ) , c j , ρ j , σ j , a j (cid:1) = p (( T j , m j ) . . . , ( T jN , m jN )) p ( c j ) p ( ρ j ) p ( σ j,h ) p ( a j ) . At this level, the prior implies independence between p (( T j , m j ) . . . , ( T jN , m jN )) and theremaining model parameters. Chipman et al. (2010) further introduce the following factorization: p (( T j , m j ) , . . . , ( T jN , m jN )) = N (cid:89) k =1 p ( m jk | T jk ) × p ( T jk ) p ( m jk | T jk ) = b jq (cid:89) q =1 p ( µ jk,q | T jk ) . This structure indicates that the tree components ( T jk , m jk ) are independent of each other and ofthe remaining parameters while the prior on m jk depends on the tree structure.The prior on the tree structure is specified along the lines suggested in Chipman et al. (2010).More precisely, we assume a tree generating stochastic process that follows three recursive stepsper tree. Assume initially s = 0, then we have the following steps:1. Start with the trivial tree that consists of a single terminal node (or only its root), which willbe denoted by η jk for each equation j and k = 1 , . . . , N . . Split the terminal node η jk with probability p SP LIT ( η jk , T ( s ) jk ) = α (1 + d ) − β , α ∈ (0 , , β ≥ , d = 0 , , , . . . (7)where d denotes the depth of the tree, and ( α, β ) are prior hyperparameters. In detail, anode at depth d spawns children with p SP LIT ( η jk , T ( s ) jk ) and as the tree grows, d increaseswhile the prior probability decreases, making it less likely that a node spawns children. Thisimplies that the probability that a given node becomes a bottom node increases and thus apenalty on tree complexity is introduced.3. If the current node is split, we introduce a splitting rule κ jk drawn from the distributionfunction p RULE ( κ jk | η jk , T ( s ) jk ) and use this rule to spawn a left and right children node. Inparticular, the rule p RULE ( κ jk | η jk , T ( s ) jk ) is set such that a given ”splitting” covariate X • j is chosen with probability 1 /K . As described above, conditional on a chosen covariate, oneneeds to estimate a threshold. The prior on this threshold is, in the absence of substantialinformation, assumed to be uniformly distributed over the range of X • j .4. Once we obtain all terminal nodes (i.e. no node is split anymore), we will label this new tree T ( s +1) jk and return to step (2).On the terminal node parameter µ jk,q , we use a conjugate Gaussian prior distribution N (0 , σ µ ),where σ µ = 0 . / ( s √ N ). Recall that N is the number of trees and s is the number of prior standarddeviations between 0 and 0 .
5. Hence, larger values of N increasingly push µ jk,q towards zero.For the free elements in A , we introduce a Horseshoe prior on each element of a j =( a j , . . . , a jj − ) (cid:48) for every j = 1 , . . . , M and a jl ∼ N (0 , τ jl λ j ) ,τ jl ∼ C + (0 , ,λ j ∼ C + (0 , , where l = 1 , . . . , j − C + denotes the half-Cauchy distribution. The prior specification on theparameters of the log-volatility equation follows the setup proposed in Kastner and Fr¨uhwirth-Schnatter (2014). In details, we use a zero mean Gaussian prior with variance 10 on theunconditional mean c j , a Beta prior on a transformation of the persistence parameter, ρ j +12 ∼B (25 ,
5) and a Gamma prior on the error variance of the log-volatility process σ j,h ∼ G (1 / , / T jk , m jk for all j, k ) using the algorithm outlined in Chipman et al. (2010). The latent states and thecoefficients associated with the state equation of the log-volatilities are obtained through theefficient algorithm discussed in Kastner and Fr¨uhwirth-Schnatter (2014). Conditional on the treesand log-volatilities, the posterior of a j is multivariate Gaussian and takes a standard form since theresulting conditional model is a linear regression model with heteroscedastic shocks. Finally, theparameters of the Horseshoe prior are simulated using techniques outlined in Makalic and Schmidt(2015). In this section we briefly discuss the dataset adopted. Instead of focusing on US data, we applyour approach to monthly Eurozone data that ranges from January 1999 to January 2019. Our ataset is a selection of time series from the popular Euro Area Real Time Database (EA-RTD).For the in-sample results discussed in the next sub-section, we consider a medium-scale modelthat includes six macroeconomic and financial time series. These six time series are augmentedby an uncertainty index (abbreviated as Unc) which is estimated using the approach outlined inJurado et al. (2015). Apart from the uncertainty indicator, we include the growth rate of the DowJones Euro Stoxx 50 price index (DJE50), HICP inflation (C OV), unemployment rate (UNETO),3-month Euribor (EUR3M), industrial production (XCONS) and the yield on Euro area 10-yearbenchmark government bonds (10Y).In the forecasting application, we consider three different datasets. A small one that includesonly HICP inflation, the unemployment rate and the 3-month Euribor. These three variables willconsequently form our set of focus variables. Moreover, we consider the medium-sized datasetdiscussed above but without the uncertainty index as well as a large dataset. The large datasetaugments the medium-sized one by adding the real effective exchange rate (ERC0 BGR), the M2money base (M2 V NC), the one-month forward price of Brent crude oil (OILBR) and the spreadbetween ten and two year government bond yields (10Y2Y).All variables are transformed to be approximately stationary. We include one lag of theendogenous variables. In this section, we illustrate key features of the BART approach using a subset of the datasetdiscussed in the previous section. Variable importance is gauged by considering the posterior
Unc t DJE50 t EUR3M t C OV t UNETO t XCONS t t Unc t −
31 10 11 6 9 14 9DJE50 t −
11 40 9 9 6 10 10EUR3M t −
12 12 45 13 9 11 14C OV t −
11 12 13 55 12 14 12UNETO t −
13 10 11 9 54 12 11XCONS t −
12 10 9 10 10 35 1210Y t −
11 10 5 5 7 11 41Table 1: Number of times a given covariate shows up across splitting rules in a medium-scale VAR withone lag. median of the number of times a given quantity shows up in a splitting rule. These frequencies areshown in Table 1. The columns refer to the different equations and the rows to the correspondingcovariates. By simply inspecting the main diagonal elements of the table we directly observe thatthe first, own lag of a given variable within some equation appears to be important. This findingholds for all equations and resembles key results of the literature on Bayesian VARs which statesthat the AR(1) term explains most variation in y t . However, here it is worth stressing that ourmodel is far from sparse and also the lags of other variables seem to play a role in determining thedynamics of a given endogenous variable.In principle, we could also use Table 1 to construct a restricted VAR model. This can beachieved by introducing a threshold (such as including only variables that show up at least 12times). Using this (arbitrary) threshold would lead to a model that assumes uncertainty todepend only on its lagged value, the short-term interest rate, the unemployment rate and industrialproduction. Similarly, stock returns only depend on lagged stock returns, short-term interest rates Using a single lag allows for simple inspection of several features of the model. Including more lags is straightforwardbut, as discussed in the text, lags larger than one only rarely show up in the splitting rules. nd inflation. Short-term interest rates would only depend on inflation and lagged interest rateswhile inflation only depends on lagged inflation and the short rate. The unemployment rate isdetermined exclusively by its own lag and (lagged) inflation, closely resembling a Phillips curve-type relationship. Furthermore, we find that industrial production growth depends on laggedinflation, unemployment, uncertainty and its lag. Finally, 10-year bond yields depend on laggedbond yields, short-term interest rates, inflation and industrial production. This simple exampleessentially shows that, conditional on choosing a suitable cut-off value, one can investigate thedeterminants of each equation in a multivariate model and potentially use these to construct arestricted linear VAR model. In this section, we investigate whether our combination of BART and VAR models yields betterforecasts of short-term interest rates, the unemployment rate and inflation. Before discussing theprecise design of the forecast exercise, the question on how to compute the predictive distributionnaturally arises. Producing one-step-ahead forecasts is computationally easy since conditional onthe tree structures, splitting values and splitting covariates it is straightforward to compute aprediction for y T +1 . More precisely (and with a slight abuse of notation), the one-step-aheadpredictive distribution is given by: p ( y T +1 | y T ) = (cid:90) p ( y T +1 | y T , Ξ ) p ( Ξ | y T ) d Ξ , where y T denotes the full history of y t and Ξ is a generic notation that summarizes all parametersand latent states (i.e. the tree structures, terminal nodes, log-volatilities etc.). The conditionaldensity p ( y T +1 | y T , Ξ ) is: y T +1 | y T , Ξ ∼ N (cid:0) F ( X T +1 ) , Σ T +1 | T (cid:1) . Σ T +1 | T is obtained by using (6) to predict the log-volatilities and using these predictions to form H T +1 | T . Higher order forecasts are then computed iteratively by first simulating from the one-step-ahead predictive distributions to obtain a prediction for y T +1 . This prediction is then usedto form X T +2 which is consequently used to predict y T +2 . We repeat this procedure until we havea prediction y T + o with o denoting the desired forecast horizon.The design of the forecasting exercise is the following. Considering a hold-out period thatruns from 2009 : M
07 to 2019 : M
06 (120 monthly observations), we use an initial estimationperiod that lasts until 2009 : M
06 to produce the o = 1 , M
07. After these are obtained, we repeat the procedure until we reach the final point inour sample. The predictions are then evaluated by using relative mean squared forecast errors(MSFEs) and the continuous ranked probability score (CRPS). The MSFEs allow us to gaugethe quality of point forecasts while the CRPS is used to evaluate the quality of the predictivedistribution.We compare our BAVART model to several commonly used multivariate time series modelsin the literature. First, we use a set of VAR models that feature different shrinkage priors. Thefirst is the non-conjugate Minnesota prior that allows for asymmetric treatment of ”own” lags(defined as lags of a given endogenous variable in y t ) and ”other” lags (defined of the lags ofthe other variables in y t ). The corresponding shrinkage hyperparameters are estimated using arandom walk Metropolis Hastings step. Since this prior is often used by practitioners, we useit as a baseline model to which we benchmark our models. Second, we use the stochastic searchvariable selection (SSVS) prior of George et al. (2008). The third prior is the Normal-Gamma priorproposed in Griffin and Brown (2010) that has been applied to VARs in Huber and Feldkircher(2019). Moreover, we use a TVP-VAR with a NG shrinkage prior on the state innovation variances mall (M = 3) Medium (M = 7) Large (M = 10) BAVART NG SSVS NGTVP BAVART NG SSVS NGTVP BAVART NG SSVS NGTVP
One-step-ahead
EUR3M 1.021 0.977 0.984 1.014 1.033 0.987 0.985 1.009 1.018 0.977 0.977 1.005(0.924) (0.984) (0.987) (0.997) (0.958) (0.992) (0.99) (0.996) (0.951) (0.983) (0.981) (0.985)UNETO 0.905 0.86 0.881 0.725 0.991 0.877 0.937 0.838 0.947 0.891 0.929 0.859(1.053) (0.932) (0.941) (0.849) (1.05) (0.931) (0.958) (0.89) (1.001) (0.926) (0.945) (0.882)C OV 0.922 1.006 1.016 1.108 0.932 0.998 1.002 1.034 0.933 1.009 1.007 1.04(0.878) (1.005) (1.012) (1.102) (0.899) (0.996) (0.997) (1.022) (0.909) (1.007) (1.002) (1.025)
Three-steps-ahead
EUR3M 0.942 0.974 0.978 0.996 1.255 0.938 0.887 1.043 1.156 0.962 1.006 1.501(0.867) (0.984) (0.987) (1.003) (1.014) (0.969) (0.947) (0.971) (0.916) (0.959) (0.971) (1.053)UNETO 0.858 0.939 0.937 0.757 2.321 1.444 1.439 4.592 1.883 1.786 1.769 3.357(0.811) (0.965) (0.959) (0.82) (1.55) (1.074) (1.079) (1.547) (1.449) (1.11) (1.13) (1.6)C OV 0.745 0.998 1.012 1.177 0.731 0.989 0.995 1.144 0.72 1.024 1.02 1.143(0.769) (1) (1.008) (1.168) (0.75) (0.988) (0.99) (1.114) (0.755) (1.018) (1.012) (1.121)
Table 2: Relative mean squared forecast errors (MSFEs) and continuous ranked probability scores(CRPS, in parentheses) to the Minnesota VAR. and the initial state of the system. Finally, all models are estimated using a stochastic volatilityspecification. Recall that all results are relative to the Minnesota VAR.The results of our forecasting exercise are shown in Table 2. Our forecasting horse racedraws a rich picture on relative model performance. We consider models of different sizes, priors,assumptions on the conditional mean and forecast horizons. For all these specifications, we providea table that summarizes forecasting performance across the three focus variables and for point anddensity forecasts. Starting with the one-step-ahead horizon and point forecasts, we observe that forthe interest rate and unemployment, our BAVART setup (irrespective of model size) is dominatedby linear VAR models in terms of MSFEs. More precisely, the small-scale SSVS (for interest rates)and TVP-VARs (for unemployment) yield the lowest prediction errors. But notice that most of theratios are close to one, indicating that the BAVART is never far off and thus highly competitive.Inflation appears to be an exception. Across all model sizes considered, we find that our modelgenerally yields the smallest forecast errors. This suggests that controlling for non-linearities paysoff when interest centers on forecasting inflation.Turning to higher order forecasts provides a relatively similar picture. For longer-runforecasts, we again identify models which improve upon the BAVART setup for interest rateand unemployment forecasts. These improvements, however, are again quite small. One strikingdifference between the one-step- and the three-steps-ahead case is the strong performance of theMinnesota VAR for larger information sets and especially for unemployment. The results forinflation are again consistent with the findings for the one-step-ahead forecasts but appear to bemore pronounced. Across model sizes, BAVART produces multi-step-ahead inflation predictionswhich feature the lowest forecast errors across all models considered.Considering only point forecasts imply that we do not factor in how well a given model predictshigher order moments of the predictive distribution. We evaluate density forecasts using CRPSs.These are a generalization of the MSFE and consider the qualities of the full predictive distributionunder scrutiny. Similarly to the one-step-ahead point forecasts, we find that most values are belowbut close to unity. This suggests that using the BAVART model improves upon the Minnesotabenchmark in most cases. When compared to the other models, we can again identify a modelthat improves upon BAVART for interest rates and unemployment. For inflation, the CRPSscorroborate the findings for the point forecasts. For all information sets, the BAVART modelproduces the most precise predictive densities.When we consider three-months-ahead predictive densities, the pattern observed above remainsmainly in place. While there are models which yield more precise density forecasts than ourproposed setup, these differences are often very small. By contrast, our model dominates when NC C OV DJE50 XCONS . . . . . − . − . − . − . . − − − − − . − . − . − . − . − . . Figure 3: Dynamic responses of selected macroeconomic quantities to an uncertainty shock. The dashedgray lines represent 16th and 84th posterior credible intervals. we consider higher-order inflation predictions. In this case, the performance gains vis-a-vis thecompeting models are appreciable.To sum up, our forecasting exercise shows that the BAVART model works remarkably well forforecasting Eurozone inflation. For unemployment and interest rates, our proposed specificationis slightly outperformed but both point and density forecasts are still competitive.
In this section we briefly illustrate how the BAVART model can be used to carry out structuralinference. Since our model is dynamic, we can consider how a shock at time t affects the quantitiesin the system over time. Because of the highly non-linear nature of our model, we resort tostructural generalized impulse response (SGIRF) functions (see Koop et al., 1996).Let (cid:15) t denote the M structural shocks and s j denotes a M × j th position. Hence, (cid:15) jt = s (cid:48) j (cid:15) t yields the j th structural shock. The SGIRF to the j th structural shock is then defined as the difference between a forecast that assumes (cid:15) jt = 1 (whilesetting the other shocks to zero) and the unconditional forecast (i.e. with (cid:15) = M ). The posteriordistribution of the SGIRFs are computed by repeating this procedure during MCMC sampling.Figure 3 shows the dynamic responses to a one standard deviation uncertainty shock. Thedashed gray lines represent 16th and 84th posterior credible intervals. For brevity, we only reportimpulse responses that are significant (i.e. the credible intervals do not cover zero) for at least onemonth across the impulse response horizon.The responses to an uncertainty shock largely mirror the ones reported elsewhere in theliterature (see, among many others, Bloom, 2009; Jurado et al., 2015). The uncertainty indexdisplays an immediate increase that fades out after the first five months. By contrast, we observethat inflation falls due to decreasing demand that drives down prices. Notice that prices reactwith a lag of one month. Since the uncertainty index is ordered first in y t this is purely driven byour shrinkage prior on the covariance parameters. This sluggish reaction of prices to uncertaintyshocks reflects the fact that prices tend to be sticky and it takes some time for companies to adjustdue to menu costs.Turning to stock market reactions shows that equities display a rather short lived decline thatfades out after around five months. Finally, real activity measured through industrial productionalso decreases. Interestingly, we do not observe a rebound in real activity arising from a ”wait-and-see mechanism” reported in, e.g., Bloom (2009).To sum up, this brief structural exercise shows that BART is capable of producing dynamicresponses that are consistent with the ones commonly found in the literature. Conclusions
VAR models assume that the lagged dependent variables influence the contemporaneous values in alinear fashion. In this paper, we relax this by blending the literature on Bayesian additive regressiontree (BART) models and VARs. The resulting model, labeled the Bayesian additive vectorautoregressive tree (BAVART) model, can handle non-linear relations between the endogenousand the exogenous variables. Our proposed model is, moreover, capable of handling stochasticvolatility in the shocks. This feature is crucial for obtaining precise density forecasts. We illustrateour model using synthetic as well as real data. Using data on the Eurozone shows that the BAVARTmodel is capable of producing precise forecasts and that it yields reasonable impulse responses toan uncertainty shock.
References
Bassetti, F., Casarin, R., and Leisen, F. (2014). Beta-product dependent Pitman–Yor processesfor Bayesian inference.
Journal of Econometrics , 180(1):49–72.Billio, M., Casarin, R., and Rossini, L. (2019). Bayesian nonparametric sparse VAR models.
Journal of Econometrics , 212(1):97–115.Bitto, A. and Fr¨uhwirth-Schnatter, S. (2019). Achieving shrinkage in a time-varying parametermodel framework.
Journal of Econometrics , 210(1):75–97.Bloom, N. (2009). The impact of uncertainty shocks.
Econometrica , 77(3):623–685.Breiman, L. (2001). Random Forests.
Machine Learning , 45(1):5–32.Carriero, A., Clark, T. E., and Marcellino, M. (2019). Large Bayesian vector autoregressions withstochastic volatility and non-conjugate priors.
Journal of Econometrics , 212(1):137–154.Chan, J. C. C., Eisenstat, E., and Strachan, R. W. (2020). Reducing the state space dimension ina large TVP-VAR.
Journal of Econometrics .Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search.
Journal of the American Statistical Association , 93(443):935–948.Chipman, H. A., George, E. I., and McCulloch, R. E. (2002). Bayesian Treed Models.
MachineLearning , 48(1):299–320.Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regressiontrees.
Ann. Appl. Stat. , 4(1):266–298.Clark, T. E. (2011). Real-time density forecasts from bayesian vector autoregressions withstochastic volatility.
Journal of Business & Economic Statistics , 29(3):327–341.Clark, T. E. and Ravazzolo, F. (2015). Macroeconomic forecasting performance under alternativespecifications of time-varying volatility.
Journal of Applied Econometrics , 30(4):551–575.Doan, T., Litterman, R., and Sims, C. (1984). Forecasting and conditional projection using realisticprior distributions.
Econometric Reviews , 3(1):1–100.Freund, Y. and Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learningand an Application to Boosting.
Journal of Computer and System Sciences , 55(1):119–139. riedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann.Statist. , 29(5):1189–1232.George, E. I., Sun, D., and Ni, S. (2008). Bayesian stochastic search for VAR model restrictions.
Journal of Econometrics , 142(1):553–580.Gramacy, R. B. and Lee, H. K. H. (2008). Bayesian Treed Gaussian Process Models Withan Application to Computer Modeling.
Journal of the American Statistical Association ,103(483):1119–1130.Granger, C. W. and Terasvirta, T. (1993). Modelling non-linear economic relationships.
OUPCatalogue .Griffin, J. E. and Brown, P. J. (2010). Inference with normal-gamma prior distributions inregression problems.
Bayesian analysis , 5(1):171–188.Hill, J., Linero, A., and Murray, J. (2020). Bayesian Additive Regression Trees: A review andlook forward.
Annual Review of Statistics and Its Application , 7(1):251–278.Hjort, N. L., Holmes, C., M¨uller, P., and Walker, S. G. (2010).
Bayesian Nonparametrics .Cambridge University Press, Cambridge.Huber, F. and Feldkircher, M. (2019). Adaptive shrinkage in Bayesian vector autoregressive models.
Journal of Business & Economic Statistics , 37(1):27–39.Huber, F., Kastner, G., and Feldkircher, M. (2019). Should I stay or should I go? A latentthreshold approach to large-scale mixture innovation models.
Journal of Applied Econometrics ,34(5):621–640.Huber, F., Koop, G., and Onorante, L. (2020). Inducing sparsity and shrinkage in time-varyingparameter models.
Journal of Business & Economic Statistics , pages 1–15.Jurado, K., Ludvigson, S. C., and Ng, S. (2015). Measuring uncertainty.
American EconomicReview , 105(3):1177–1216.Kalli, M. and Griffin, J. E. (2014). Time-varying sparsity in dynamic regression models.
Journalof Econometrics , 178(2):779–793.Kalli, M. and Griffin, J. E. (2018). Bayesian nonparametric vector autoregressive models.
Journalof Econometrics , 203(2):267–282.Kastner, G. and Fr¨uhwirth-Schnatter, S. (2014). Ancillarity-sufficiency interweaving strategy(ASIS) for boosting MCMC estimation of stochastic volatility models.
Computational Statistics& Data Analysis , 76:408–423.Kastner, G. and Huber, F. (2020). Sparse Bayesian Vector Autoregressions in Huge Dimensions.
Journal of Forecasting .Koop, G., Korobilis, D., and Pettenuzzo, D. (2019). Bayesian compressed vector autoregressions.
Journal of Econometrics , 210(1):135–154.Koop, G., Pesaran, M. H., and Potter, S. M. (1996). Impulse response analysis in nonlinearmultivariate models.
Journal of econometrics , 74(1):119–147. akshminarayanan, B., Roy, D. M., and Teh, Y. W. (2014). Mondrian Forests: Efficient OnlineRandom Forests. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger,K. Q., editors, Advances in Neural Information Processing Systems 27 , pages 3140–3148. CurranAssociates, Inc.Linero, A. R. (2018). Bayesian Regression Trees for High-Dimensional Prediction and VariableSelection.
Journal of the American Statistical Association , 113(522):626–636.Litterman, R. B. (1986). Forecasting with Bayesian Vector Autoregressions – Five Years ofExperience.
Journal of Business & Economic Statistics , 4(1):25–38.Makalic, E. and Schmidt, D. F. (2015). A simple sampler for the horseshoe estimator.
IEEE SignalProcessing Letters , 23(1):179–182.Pratola, M. T., Chipman, H. A., George, E. I., and McCulloch, R. E. (2019). HeteroscedasticBART via Multiplicative Regression Trees.
Journal of Computational and Graphical Statistics ,pages 1–13.Primiceri, G. E. (2005). Time Varying Structural Vector Autoregressions and Monetary Policy.
The Review of Economic Studies , 72(3):821–852.Roy, D. M. and Teh, Y. W. (2009). The Mondrian Process. In Koller, D., Schuurmans, D.,Bengio, Y., and Bottou, L., editors,
Advances in Neural Information Processing Systems 21 ,pages 1377–1384. Curran Associates, Inc.Sims, C. A. (1980). Macroeconomics and Reality.
Econometrica , 48(1):1–48.Sims, C. A. and Zha, T. (1998). Bayesian Methods for Dynamic Multivariate Models.
InternationalEconomic Review , 39(4):949–968. Posterior Approximation of the Trees
In this appendix, we provide details on the MCMC algorithm used to simulate from the jointposterior distribution of the trees and the terminal node parameters. The remaining quantitiestake well known forms and are very easy to simulate using textbook results for the Gaussian linearregression model.As stated in Section (3.3) and following Carriero et al. (2019), Koop et al. (2019) and Kastnerand Huber (2020), we rely on a structural representation of the model that entails equation-by-equation estimation.We simulate from the conditional posterior of ( T jk , m jk ) by conditioning on all trees exceptfor the k th tree. This is achieved by using the Bayesian backfitting strategy discussed in Chipmanet al. (2010). More precisely, the likelihood function depends on ( T jk , m jk ) through the partialresiduals R jk = y • j − j − (cid:88) l =1 a jl y • l − (cid:88) s (cid:54) = k g js ( X ; T js , m js ) . The algorithm proposed in Chipman et al. (2010) then proceeds by integrating out m jk and thendrawing T jk with Metropolis-Hastings (MH) algorithm detailed in Chipman et al. (1998).This step is implemented by generating a candidate tree T ∗ jk from a proposal distribution q ( T jk , T ∗ jk ) and then accept the proposed value with probability: α ( T jk , T ∗ jk ) = min (cid:40) , q ( T ∗ jk , T jk ) q ( T jk , T ∗ jk ) p ( R jk | X , T ∗ jk , M j ) p ( R jk | X , T jk , M j ) p ( T ∗ jk ) p ( T jk ) (cid:41) . The first term is the ratio of the probability of how the previous tree moves to the new tree againstthe probability of how the new tree moves to the previous one. The second term is the likelihoodratio of the new tree against the previous tree and the last term denotes the likelihood ratio of thenew against the previous tree under the prior distribution.The proposal distribution q ( T jk , T ∗ jk ) features four local steps (Chipman et al., 1998). The firstis step grows the tree by splitting a node into two different nodes. The second step combines twonon-terminal nodes into a single terminal node. The third step swaps splitting rules between twoterminal nodes and the final step changes the splitting rule of a single non-terminal node.After sampling the tree structure we can easily simulate from the posterior distribution ofthe terminal nodes. Since the prior is conjugate the resulting posterior will also be a Gaussiandistribution that takes a standard form and does not explicitly depend on the tree structure butonly on the subset of elements in the residuals allocated to a specific terminal node.) features four local steps (Chipman et al., 1998). The firstis step grows the tree by splitting a node into two different nodes. The second step combines twonon-terminal nodes into a single terminal node. The third step swaps splitting rules between twoterminal nodes and the final step changes the splitting rule of a single non-terminal node.After sampling the tree structure we can easily simulate from the posterior distribution ofthe terminal nodes. Since the prior is conjugate the resulting posterior will also be a Gaussiandistribution that takes a standard form and does not explicitly depend on the tree structure butonly on the subset of elements in the residuals allocated to a specific terminal node.