Bayesian Inference in High-Dimensional Time-varying Parameter Models using Integrated Rotated Gaussian Approximations
BBayesian Inference in High-DimensionalTime-varying Parameter Models using IntegratedRotated Gaussian Approximations
FLORIAN HUBER ∗ , GARY KOOP , and MICHAEL PFARRHOFER University of Salzburg University of Strathclyde
February 25, 2020
Abstract
Researchers increasingly wish to estimate time-varying parameter (TVP) re-gressions which involve a large number of explanatory variables. Includingprior information to mitigate over-parameterization concerns has led to manyusing Bayesian methods. However, Bayesian Markov Chain Monte Carlo(MCMC) methods can be very computationally demanding. In this paper,we develop computationally efficient Bayesian methods for estimating TVPmodels using an integrated rotated Gaussian approximation (IRGA). Thisexploits the fact that whereas constant coefficients on regressors are oftenimportant, most of the TVPs are often unimportant. Since Gaussian dis-tributions are invariant to rotations we can split the the posterior into twoparts: one involving the constant coefficients, the other involving the TVPs.Approximate methods are used on the latter and, conditional on these, theformer are estimated with precision using MCMC methods. In empirical ex-ercises involving artificial data and a large macroeconomic data set, we showthe accuracy and computational benefits of IRGA methods.
Keywords:
Time-varying parameter regression,Bayesian, Gaussian approximation, macroe-conomic forecasting
JEL Codes:
C11, C30, E3, D31 ∗ Corresponding author: Florian Huber. Salzburg Centre of European Union Studies, University of Salzburg.
Address : M¨onchsberg 2a, 5020 Salzburg, Austria.
Email : fl[email protected]. Florian Huber and MichaelPfarrhofer gratefully acknowledge financial support from the Austrian Science Fund (FWF, grant no. ZK 35). a r X i v : . [ ec on . E M ] F e b Introduction
There has been an explosion of interest in carrying out structural analysis and forecastingwith time-varying parameter (TVP) regressions and Vector Autoregressions (TVP-VARs) inrecent years, with many of the papers employing Bayesian methods. An incomplete survey ofsignificant recent Bayesian contributions includes Cogley and Sargent (2005), Primiceri (2005),Chan et al. (2012), Dangl and Halling (2012), Koop and Korobilis (2012), Koop and Korobilis(2013), D’Agostino et al. (2013), Groen et al. (2013), Nakajima and West (2013), Belmonte et al. (2014), Kalli and Griffin (2014), Feldkircher et al. (2017), Kowal et al. (2017), Uribe andLopes (2017), Koop and Korobilis (2018), Roˇckov´a and McAlinn (2018), Bitto and Fr¨uhwirth-Schnatter (2019), Hauzenberger et al. (2019), Korobilis (2019), Paul (2019) and Huber et al. (forthcoming).There are two issues in this literature that we wish to highlight as they form the basis of thecontribution of our paper. The first is that Bayesian computation using Markov Chain MonteCarlo (MCMC) methods becomes increasingly computationally burdensome as the number ofexplanatory variables in the regression increases. Indeed, the contribution of much of the BigData side of this literature lies in the development of computationally-efficient approximatemethods which do not involve MCMC methods.The second issue is that the need for TVP methods seems to decrease as the number ofexplanatory variables increases. Many of the early contributions used a small number of re-gressors or involved low dimensional TVP-VARs and found strong evidence for the benefits oftime-variation in coefficients. Indeed, the unobserved components stochastic volatility (UC-SV)model of Stock and Watson (2007) has had great success and it is the simplest of TVP regres-sions as it involves only an intercept. However, with more regressors, the benefits of allowing fortime-variation in parameters has typically been found to be small. A potential explanation forthis finding is that, in TVP models with a small number of variables, the TVP aspect is control-ling for omitted variables. When including a large number of explanatory variables, this factoris mitigated as the extra variables provide more heterogeneity that can be fit by the regressionleaving less of a role for the TVPs.Accordingly, there may be many data sets where the researcher wishes to work with aTVP regression with many explanatory variables, but the computational burden using MCMCmethods is prohibitive. This computational burden largely arises due to the TVP aspect of themodel, and the researcher suspects that it is likely that most of the TVP part of the model isunnecessary. One alternative, of course, is to work with a constant coefficient regression model.But this risks missing small amounts of parameter change, perhaps only in a few parameters or ina few time periods, which are genuinely important. The methods developed in the present paperare designed for such a case. They are inspired by papers such as van den Boom et al. (2019)who use an integrated rotated Gaussian approximation (IRGA) which breaks the parametersin the model into two blocks. The first is for a low-dimensional set of regression coefficientswhich are of primary interest. The second is for a high-dimensional set of nuisance parameters.Fast and approximate methods are used with the latter and, conditional on the approximationobtained for the nuisance parameters, more accurate methods are used for the coefficients ofinterest. In our case, the nuisance parameters are the TVPs.To be precise, we exploit the rotation-invariance property of the Gaussian distribution tosplit the model into two parts: one involving constant coefficients and one involving the TVPs.The first involves the explanatory variables with constant coefficients and a low-dimensionalfunction of the TVPs. For this block we use MCMC methods (conditional on the approximationwe use for the TVPs). The second block depends on the high-dimensional TVPs and for thiswe use a computationally fast approximation. In particular, we use computationally-fast vector This statement holds true for time-variation in regression coefficients. Allowing for stochastic volatility orother types of time-variation in the error variance is typically more important and leads to forecasting gains evenin large models. In our macroeconomic forecasting exercise, every model allows for stochastic volatility. et al. (2019).Papers such as van den Boom et al. (2019), focus on parameter estimation of the coefficientson the primary variables of interest as opposed to forecasting. We show that these methods canbe extended to allow for efficient forecasting within a large TVP-VAR. Previous work with IRGAsassumed homoskedastic errors. An additional contribution of our paper lies in the developmentof methods for adding stochastic volatility (SV) to TVP regressions in the context of the IRGAapproximation.Using artificial data, we demonstrate the computational scaleability and accuracy of IRGAmethods. Moreover, employing a large US macroeconomic data set, we find IRGA estimation ofTVP-VARs leads to good forecast performance in a computationally efficient manner. Finally,we use our large macroeconomic data set to investigate the effect of uncertainty shocks on theeconomy in a range of TVP-VARs of varying dimension. We find that evidence of time-variationin parameters decreases as the number of variables in the TVP-VAR increases. Our methodscan discover this in a data based fashion in a manner in which either a constant coefficient VARor a conventional TVP-VAR which allows for random walk behavior in all coefficients could not.The remainder of the paper is structured as follows: Section 2 discusses the static formof a TVP regression, provides details on the prior setup and shows how IRGA can be usedto approximate the TVPs. In Section 3, we use simulated data to illustrate the merits of theIRGA approach while Section 4 shows that our model also works well for a prominent USmacroeconomic dataset. Finally, the last section summarizes and concludes the paper.
We write the TVP regression model as: y = Xβ + Zγ + ε , (1)where y denotes a T − dimensional vector containing the dependent variable with typical element y t for t = 1 , . . . , T and X is a T × K -dimensional vector of explanatory variables with typical t th row given by x (cid:48) t while β = ( β , . . . , β K ) (cid:48) represents a K × T K
TVPs γ = ( γ (cid:48) , . . . , γ (cid:48) T ) (cid:48) are put in Z which isa T × ( T K ) block-diagonal matrix defined by: Z = x (cid:48) (cid:48) K × . . . (cid:48) K × (cid:48) K × x (cid:48) . . . (cid:48) K × ... ... . . . ... (cid:48) K × (cid:48) K × . . . x (cid:48) T . For simplicity of exposition, for now we assume homoskedastic errors and, thus, ε ∼ N ( , σ ε I T ).The addition of stochastic volatility is discussed in Sub-section 2.4.Note that the regression coefficients are now divided into β , which are conventional constantcoefficients, and γ which are the TVPs. At this stage, our treatment assumes the latter to becompletely unrestricted. For instance, we are not restricting them to follow random walk pro-cesses as is commonly done with TVP regressions. This specification implies that the coefficienton each explanatory variable is β it = β i + γ it , with γ it denoting the i th element of γ t .It is also worth noting that the methods we develop can be easily extended to the TVP-VAR if it is written in equation-by-equation form (see, e.g., Kastner and Huber, 2017; Carriero et al. , forthcoming; Koop et al. , 2019; Huber et al. , forthcoming). In such a case, the TVP-VAR In the following discussion and without loss of generality, we normalize y and the regressors to have unitvariance. N variables turns into N TVP regressions. Appendix A shows how this can be done. Inthe subsequent discussion, we simplify notation by assuming that N = 1. For the empiricalapplication, all hyperparameters are specified symmetrically across equations so that regressioncoefficients and error covariances are treated in the same way in each equation. In the case where K is large, it is empirically probable that β is much more important (i.e.having some non-zero elements) than γ , most of whose elements are likely to be zero. This con-sideration motivates both our prior choice and our choice of devoting most of the computationaleffort to the former rather than the latter. In particular, we obtain an approximation of theposterior for γ using computationally-fast techniques, then sample from the posterior of β usingconventional MCMC methods conditional on this approximation. Posterior analysis is discussedin the following sub-section, in this sub-section, we discuss the prior.The methods developed in this paper will work for any prior which has a Gaussian hierarchicalstructure. So, for instance, any of the global-local shrinkage priors which involve Gaussianity atthe first stage of the prior hierarchy can be used. See, for instance, Bhattacharya et al. (2015)for a discussion of some global-local shrinkage priors and their properties.For constant coefficients, β , we use the popular Normal-Gamma (NG) shrinkage prior ofGriffin and Brown (2010): p ( β ) = K (cid:89) j =1 N ( β j | , τ j ) , τ j ∼ G ( ϑ, ϑλ/ , λ ∼ G ( d , d ) , with ϑ being a hyperparameter that controls the excess kurtosis of the marginal prior (obtainedby integrating out the local scaling parameters τ j ). λ is a global shrinkage parameter that pushesall elements in β to zero. It follows a Gamma distribution with parameters d and d a priori.In the empirical work, we follow much of the literature and set ϑ = 0 . d = d = 0 . γ , we rely on two commonly used shrinkage priors. The first one is a Spike & Slab (S&S)prior implemented as in Mitchell and Beauchamp (1988). This is a mixture of two distributions,one (the Spike) with a point mass at zero and the other (the Slab) a zero mean Gaussian priorwith variance equal to ψ . Formally, this prior is given by: p ( γ ) = T K (cid:89) j =1 q N (0 , ψ ) + (1 − q ) δ (0) .q denotes the prior probability of observing a non-zero element in γ , i.e. q = P rob ( γ j (cid:54) = 0) and δ (0) is a point mass at zero. In all applications, we set q equal to 0 .
5. For the prior hyperparam-eter ψ , we consider a range of values as discussed in the empirical section of this paper. Since,as noted above, our data is standardized, the assumption of a single prior hyperparameter, ψ ,is a reasonable one.The second prior is the sparse Bayesian learning (SBL) specification originally proposed inTipping (2001) and subsequently adopted in Zou et al. (2016) and Korobilis (2019). This priorassumes that p ( γ ) = T K (cid:89) j =1 N (0 , ψ j ) , ψ − j ∼ G ( a ψ , b ψ ) , with a ψ = 1 and b ψ = 10 − denoting prior hyperparameters with the specific choices motivatedin Fang et al. (2016). This conditionally Gaussian prior has been shown to possess excellentshrinkage properties in many contexts (see, e.g., Korobilis, 2019).4o explain these prior choices, remember that the (low dimensional) constant coefficients aremore likely to be non-zero than the (high dimensional) TVPs. For the constant coefficients, weuse a popular global-local shrinkage prior which has been found to work well in many empiricalapplications. It allows for unimportant coefficients to be shrunk to be close to zero (not preciselyzero). The Spike & Slab prior, by contrast, allows for unimportant coefficients to be preciselyzero. Huber et al. (forthcoming) demonstrates the importance of shrinking coefficients to beprecisely zero in high dimensional cases. That is, if the number of coefficients that equal zero islarge, then a global-local shrinkage prior can successfully shrink them to be near to zero, but notexactly zero. The small amount of estimation error in each of a large number of coefficients canaccumulate and cause forecast performance to deteriorate. This is why we use the Spike & Slabprior for the high dimensional vector of coefficients, γ . Bayesians using MCMC methods oftenavoid the use of the Spike & Slab prior since it can suffer from poor MCMC mixing. However,since we are not using MCMC methods for γ this difficulty of the Spike & Slab prior is notrelevant for us.As opposed to the Spike & Slab prior, which only allows for discriminating between the casethat a coefficient is time-varying or constant for a given point in time, the SBL prior allowsfor cases in between and thus might capture situations where the coefficient is constant, mildlytime-varying or strongly time-varying at certain points in time.Finally, we use a weakly informative inverted Gamma prior on σ ε ∼ G − (0 . , . Following van den Boom et al. (2019), we use IRGAs to separately model the TVPs and theconstant coefficients. This approach exploits properties of the QR decomposition of X . The QRdecomposition says we can write the T × K matrix X = QR where Q is a T × T orthogonalmatrix and R is a T × K upper triangular matrix with the final T − K rows being zero. Wepartition Q = ( Q , Q ) where Q is T × K and Q is T × ( T − K ). A key property of the QRdecomposition, which we will rely on below, is Q (cid:48) X = ( T − K ) × K . Note also that these resultsfor the QR decomposition assume T ≥ K . We will discuss the T < K case in a subsequentsub-section.The TVP regression model given in (1) implies the following distribution for y : y ∼ N ( Xβ + Zγ , σ ε I K ) . (2)Instead of working directly with the likelihood defined by (2), what the IRGA approach doesis exploit the fact that Gaussian distributions are invariant to rotations. It is based on thefollowing two distributions: Q (cid:48) y ∼ N ( Q (cid:48) Xβ + Q (cid:48) Zγ , σ ε I K ) , (3)and Q (cid:48) y ∼ N ( Q (cid:48) Zγ , σ ε I T − K ) . (4)Note that, since Q (cid:48) X = ( T − K ) × K , β does not appear in the second (high dimensional) rela-tionship and, thus, if γ were known, posterior inference for it can be based on the first (lowdimensional) relationship. This form motivates a two stage approach where a computationallyfast approximation is first done to produce an approximate posterior for γ (as well as any hy-perparameters associated with the prior on γ ) based on (4) and subsequently posterior inferenceon β conditional on this is done using a likelihood based on (3).To be precise, the first stage involves obtaining a Gaussian approximation to the posteriorof γ using a regression with Q (cid:48) y as the dependent variable:ˆ p ( γ | Q (cid:48) y , Q (cid:48) Z ) ∼ N ( µ γ , V γ ) . p ( γ | Q (cid:48) y , Q (cid:48) Z ) can be obtained using any sort of Gaussian approx-imation. In this paper we use VAMP. Message passing algorithms have been used successfullyin TVP regressions in Korobilis (2019). Note that these algorithms are particularly likely to beaccurate when the correlations between explanatory variables are low. Since the form of our Z matrix implies blocks of variables for different time periods are uncorrelated with one an-other, message passing algorithms seem appropriate for our case. VAMP, originally proposed inRangan et al. (2019), differs from standard approximate message passing (AMP) algorithms byusing a particularly simple (i.e. scalar-valued) state evolution equation to update the posteriorfor γ . More importantly, VAMP is more widely applicable relative to AMP in the sense thatVAMP can be applied to a larger class of design matrices (Rangan et al. , 2019). For a detailedimplementation of the VAMP algorithm that includes information on how we estimate γ andthe posterior inclusion probabilities, see the technical appendix of van den Boom et al. (2019).To update ψ j under the SBL prior, we follow Zou et al. (2016); Korobilis (2019) and update ψ j within VAMP as follows: ψ ( n +1) j = 2 a ψ − (cid:16) ˆ γ ( n ) j (cid:17) + 2 b ψ . Here, we let ˆ γ ( n ) j denote the estimate of the j th element of γ at iteration n of the VAMPalgorithm.Plugging in the values of µ γ and V γ produced by the VAMP algorithm in the first stage, theapproximate posterior of β in the second stage is based on:ˆ p ( β | y ) ∝ p ( β ) × N (cid:0) Q (cid:48) y | Q (cid:48) Xβ + Q (cid:48) Zµ γ , σ ε I K + ( Q (cid:48) Z ) V γ ( Q (cid:48) Z ) (cid:48) (cid:1) , where Ω Q = σ ε I K + ( Q (cid:48) Z ) V γ ( Q (cid:48) Z ) (cid:48) is a full error variance-covariance matrix with its lowerCholesky factor given by P Q and p ( β ) is the prior. As noted above, this prior can take any form,but if it is Gaussian (or of a hierarchical form with the first level being Gaussian), the Gaussianprior and Gaussian likelihood will combine to produce a Gaussian posterior (conditional on theparameters in the hierarchical prior). This Gaussianity property combined with the fact that β isonly K dimensional implies computation is fast. Textbook results for the linear regression modelwith dependent variable ˜ y = P − Q ( Q (cid:48) y − Q (cid:48) Zµ γ ) and explanatory variables ˜ X = P − Q ( Q (cid:48) X )provide the arguments in this posterior (or conditional posterior if, as we do, a hierarchicalshrinkage prior is used). More precisely, this posterior is given by: β | y ∼ N ( µ β , V β ) (5)with posterior moments: V β = ( ˜ X (cid:48) ˜ X + Λ − ) − µ β = V β ( ˜ X (cid:48) ˜ y ) . We let Λ = diag( τ , . . . , τ K ) denote a diagonal prior variance-covariance matrix.As noted above, we use the familiar shrinkage prior of Griffin and Brown (2010). With thischoice, the MCMC algorithm involves adding on to textbook MCMC for the Gaussian linearregression model with Gaussian prior, methods for drawing the shrinkage parameters, τ j , for j = 1 , . . . , K and λ , which appear in the NG prior. τ j is simulated from its generalized invertedGaussian (GIG) full conditional posterior distribution: τ j | α j , λ ∼ GIG ( ϑ − / , α j , λϑ ) (6) It is noteworthy that the error variance-covariance is a full matrix but, conditional on multiplying from theleft with the inverse of its lower Cholesky factor P − Q , yields a linear regression model with standard normallydistributed errors. We simulate from the GIG distribution using the R package GIGrvg and the corresponding parameterization. λ has a Gamma-distributed conditional posterior distribution: λ | τ , . . . , τ K ∼ G (cid:32) d + Kϑ, d + ϑ (cid:80) Kj =1 τ j (cid:33) . (7)In principle, the error variance σ ε can be estimated using either (3) or (4); van den Boom et al. (2019) use the second stage regression in (4) to estimate σ ε . This implies that only ( T − K )observations are used to inform the posterior estimates of σ ε . However, in the case where T > K (which is often the case for monthly data), this is a sensible strategy since we would expect thatthe second-stage regression contains significant information on the error variance.
In macroeconomic forecasting applications, allowing for conditional heteroscedasticity has beenshown to be of great importance to produce accurate point and density forecasts, see e.g. Clark(2011). Introducing SV in our model, however, is not straightforward. The fact that we splitthe likelihood in two parts means that we are not able to estimate a time-varying variance forall t without additional complications.As a simple solution, we modify (1) as follows: y = Xβ + Zγ + η + ε ,η t = e h t / ε η,t for t = 1 , . . . , T. (8)Here, η = ( η , . . . , η T ) (cid:48) is a process that features SV and h t denotes a log-volatility process while ε η,t ∼ N (0 , η and ε is σ t = e h t + σ ε and that neither h t nor σ ε are separately identified.Estimating η and the full history of h t is more complicated. Our proposed approach, how-ever, is perfectly suited for handling such a situation. To be more precise, we approximate η and { h t } Tt =1 alongside γ exploiting equation (4) while running VAMP. This is simply achievedby noting that η is a time-varying intercept term that features SV. Estimating h t is, unfortu-nately, more difficult. In this paper, we adopt the SV estimator proposed in Korobilis (2019).Specifically, we square and take logs of (8) to obtain:˜ η t = h t + v t , with ˜ η t = log η t and v t = log ε η,t denotes an error term that follows a log χ distribution withone degree of freedom. Kim et al. (1998) propose approximating the distribution of v t usinga seven component mixture of Gaussians. This renders the observation equation conditionallyGaussian:˜ η t = h t + u i , u i ∼ N ( µ v,i , σ v,i ) , for i = 1 , . . . ,
7. Here, we let µ v,i and σ v,i denote component-specific means and variances,respectively. Let π i denote the corresponding mixture weights. As a point estimator for h t , wethen useˆ h t = (cid:88) i =1 π i (˜ η t − µ v,i ) . The precise values for µ v,i , σ v,i and π i can be found in Kim et al. (1998), Table 4. Korobilis(2019) shows that this volatility estimate possesses excellent empirical properties and also closelymatches estimates obtained by using MCMC-based techniques but tend to be less persistent.This is because ˆ h t does not depend on ˆ h t − . 7 .5 Posterior simulation To estimate the posterior distribution of the coefficients of the model as well as the latent states,we use the following algorithm: Step 1: Approximate γ (and the corresponding hyperparameters associated with the prior) , σ ε and { h t } Tt =1 by exploiting (4) and the methods outlined in Sub-sections 2.3 and 2.4.Step 2: Conditional on having approximated the TVPs und log-volatilities, obtain a draw from p ( β | y ) using (5).Step 3: Draw { τ j } Kj =1 from the GIG distribution in (6).Step 4: Finally, simulate λ from a Gamma posterior detailed in (7).In all empirical applications, we repeat Steps (2) to (4) 15 ,
000 times and discard the first10 ,
000 draws as burn-in. Since Step 1 is fast, estimating a TVP regression model using ourIRGA approach takes approximately as long as estimating a constant parameter specification,significantly reducing computation time. Further details on computation times for the differentapplications can be found in Sections 3 and 4.
T < K
An important limitation of the IRGA methods described in Sub-section 2.3 is that they require T ≥ K to work. But in multivariate time series models such as TVP-VARs we often encounterthe situation that T < K . That is, if the number of endogenous variables ( N ) and the numberof lags ( P ) is large, the number of explanatory variables in each equation (i.e. N P plus anyexogenous regressors) can be enormous. Given the number of observations typically associatedwith macroeconomic quarterly or monthly data sets, many relevant empirical settings have
T < K .In such a case, we would recommend that the best procedure is to reformulate the problem sothat T ≥ K . Remember the basic idea of our approach is that there are K parameters which areimportant and others which can be treated as nuisance parameters. When T ≥ K it is logical toclass the constant coefficients as the important parameters and the TVPs as nuisance parameters.However, it is also possible to classify some of the less important constant coefficients as nuisanceparameters. For instance, in a macroeconomic TVP-VAR there will typically be some variablesof primary interest either for forecasting or for economic structural analysis (e.g. in our empiricalsection we use interest rates, output growth and inflation as the variables of primary interest).Constant coefficients on the lagged values of these primary variables can be estimated usingMCMC methods, with all remaining parameters (i.e. constant coefficients on lagged values ofother variables as well as TVPs) being treated as nuisance parameters.Alternatively, in macroeconomic models such as VARs it is common to find that lowerorder lags of endogenous variables are more important than higher order lags. Thus, in someapplications it might make sense to choose a small number of lags, P , and treat coefficients onall lagged endogenous variables up to P as being of primary interest, with coefficients on theremaining P − P lags being put into the category of nuisance parameters.In our empirical work, we consider two approaches using a TVP-VAR which assumes that y t is an N -dimensional vector of endogenous variables (see Appendix A). One is the T ≥ K case outlined above which treats only the TVPs as nuisance parameters. The second treats theTVPs and all cross-lags greater than one as nuisance parameters. As we will show in Sub-section4.2, this only slightly decreases predictive accuracy but allows for estimating even larger modelswhile still maintaining fast computation times. We assume throughout that everything has been initialized by using maximum likelihood estimates (if avail-able) or by starting from the prior mean. Artificial Data
In this section, we assess how our approach performs when applied to synthetic data. To thisend, we use a data generating process (DGP) based on (1). Each DGP takes a draw of constantcoefficients assuming β ∼ N (0 , I K ) and the elements of γ are drawn as: γ it ∼ N (cid:0) , s t (0 . (cid:1) , s t = (cid:40) p − p, x t ∼ N ( K , I K ) , for t = 1 , . . . ,
500 and with p denoting the probability of observing a parameter change. We set σ ε = 0 .
1. We consider a range of values for p . If we set p equal to zero we obtain a constantparameter specification while setting p = 1 yields a standard TVP model. To investigate howour approach performs across a wide variety of DGPs, we choose K ∈ { , , , } and p = { , . , . , . , } and obtain 100 realizations of the DGP. To illustrate that approximatingthe TVPs using VAMP allows us to estimate the time-invariant part of the model efficiently,we benchmark our approach (henceforth labeled IRGA-TVP S&S for the Spike & Slab priorand IRGA-TVP SBL for the SBL prior) to a TVP regression model estimated using standardMCMC techniques and shrinkage priors (see, e.g., Bitto and Fr¨uhwirth-Schnatter, 2019; Huber et al. , forthcoming) and a random walk state equation.Note that our approach is an approximate one whereas our benchmark is more exact (ifsufficient MCMC draws are taken). This represents an advantage for the benchmark approach.However, our approach allows for different types of TVPs. In addition, IRGA-TVP S&S allowsfor them to be shrunk to be precisely zero, whereas the benchmark approach does not since itassumes random walk behavior. We expect this, in many empirical contexts, to be an advantagefor our approach. That is, in practice, it is likely to be the case that many or most of the TVPsare zero, particularly when the number of explanatory variables is large.To see how these considerations play out in artificial data, consider Table 1 which displaysthe mean absolute errors (MAEs) for β of the IRGA approach relative to the standard TVPregression. In general, and for both priors considered, we see numbers near one, indicating thatboth approaches are yielding similar estimates, despite the fact that our IRGA approaches aremuch more computationally efficient. Using IRGA in combination with the Spike & Slab prioryields slightly worse estimates than the benchmark when p = 1, but performs appreciably betterin all other cases. But the p = 1 case allows time variation in every coefficient and in every timeperiod. It is not surprising that the benchmark model, which allows for this high degree of timevariation, performs well here. And it is worth noting that, other than for the case where K = 5to which we will return later, our algorithm is doing roughly as well as the benchmark even with p = 1. Using the SBL prior in this situation yields estimates that are very close to the onesof the benchmark, highlighting the increased flexibility by allowing for a varying degree of timevariation over the estimation period.When we move to DGPs with less time variation and, thus, smaller values for p , our methodsdo appreciably better relative to the benchmark (but to a slightly lesser extent if the SBL prioris used). Our algorithm does particularly well for intermediate values of p , but even in theconstant coefficient case, p = 0, we are still beating the benchmark under both priors adopted.Note that the shrinkage prior in the benchmark approach allows for separate shrinkage onconstant coefficients and TVPs. Thus, it allows for any coefficient to be constant (over theentire sample) or time-varying (over the entire sample). But, unlike our approach, it does notallow for coefficients to be time-varying in some periods or not others. This is likely why ourmethods do particularly well for values in the region of p = 0 .
5. Artificial data sets generatedfrom such a value will have many periods where coefficients are constant and others where theyare varying. DGPs with the extreme choices of p = 0 or p = 1 will not have this property andlead to relatively better performance of the benchmark. This also explains the small differencesbetween the Spike & Slab and the SBL priors, with the latter having a slight disadvantage sincethey are not capable of pushing the TVPs exactly to zero.9ext we consider the impact of K on the relative MAEs. When K = 5 and p = 1 we havea small TVP regression with lots of parameter change. In such a case, the benchmark doeswell. However, even with p = 0 and larger values for K , our methods improve relative to thebenchmark. Our best relative performance occurs for the largest model with K = 25 and anintermediate amount of parameter change p = 0 .
75. This largely reflects the fact that whenusing a Spike & Slab prior for the TVPs, we allow them to be shrunk precisely to zero in periodswhen there is no parameter change. This contrasts with the SBL prior and the benchmarkapproach which only allow for them to be shrunk to be near zero. Huber et al. (forthcoming)show that, in models with many parameters, shrinkage to precisely zero is typically useful inachieving forecast accuracy. Other forms of shrinkage allow for small amounts of estimationerror to accrue over many parameters, which can be detrimental for forecast accuracy. Thiseffect can also be clearly seen in our results in terms of the impact on estimation accuracy forthe constant coefficients. Amount of time-variation pK .
75 0 . .
25 0S&S prior5 1.616 0.681 0.771 0.773 0.92610 1.064 0.754 0.657 0.645 0.77315 1.049 0.629 0.654 0.769 0.72720 1.102 0.667 0.803 0.715 0.74425 1.179 0.583 0.798 0.685 0.740SBL prior5 1.043 1.032 0.999 0.892 0.93110 1.029 1.004 0.922 0.786 0.92915 0.968 0.834 0.783 1.052 0.73920 0.995 0.799 0.918 0.862 0.79925 0.969 0.744 0.763 0.729 0.687
Table 1:
Mean absolute error ratios between a TVP model estimated using our IRGA approachand a constant parameter model. All results are based on 100 realizations from the DGP.We next turn to the issue of computation time. Figure 1 depicts the time necessary (inminutes) to generate 15 ,
000 draws from the joint posterior for the benchmark TVP regression(in dashed black) and our IRGA-TVP approach (in solid black). The key point is that our IRGAmethods are scaleable and can potentially handle hundreds of explanatory variables, whereas thebenchmark approach cannot. To be precise, estimation times increase linearly at a very slow ratewith K under the IRGA approach, with overall time ranging from 2.7 to around 3.7 minutes. Bycontrast, estimating a standard TVP regression model using the algorithm proposed in Carterand Kohn (1994) is much more demanding, with estimation times for this model increasing quiterapidly at a nonlinear rate.Summing up, we have shown that the estimates for β obtained using our IRGA approach areprecise and close to the ones obtained from estimating a standard TVP regression. In addition,approximating the TVPs leads to vastly improved estimation times that only increase slightlywith the number of predictors. In this section, we carry out a forecasting exercise and structural economic exercise using a largeTVP-VAR involving up to N = 78 variables (see Appendix B).10 = E s t i m a t i on t i m e ( i n m i nu t e s ) Figure 1:
Time necessary for producing 15 ,
000 draws from the joint posterior: IRGA - TVP(solid black) and TVP (dashed black).
We use the popular FRED-MD macroeconomic data set, see McCracken and Ng (2016). Weobtain monthly data from 1985:01 to 2019:08 on 78 variables. The exact variables (and theway each has been transformed to stationarity) are listed in Table 5 in Appendix B. We usemeasures of inflation (labelled CPIAUCSL), interest rates (FEDFUNDS) and the unemploymentrate (UNRATE) as our three variables of primary interest. The remaining variables are onlyuseful insofar as they improve forecasts or affect structural impulse responses to the primaryvariables.
We carry out a forecasting exercise using a huge data set involving 78 variables. The resultsare based on four different versions of our IRGA approach involving two priors (the SBL andS&S priors) and the two methods of implementation outlined at the end of Sub-section 2.6.These are labelled ”TVPs” and ”TVPs & lag” below. Prior hyperparameter choice is discussedin Sub-section 2.2. For the S&S prior we fix one of the key prior hyperparameters by setting ψ = 0 . .
01) and a random walk Metropolis step, similar to Giannone et al. (2015).For the error covariances and the intercept term, we use normally distributed priors with zeromean and variance ten. The model is estimated using equation-by-equation estimation.As a a low dimensional benchmark, we include a three equation TVP-VAR with SV usingjust the focus variables. This is a standard TVP model which assumes the TVPs follow randomwalks. It is estimated using MCMC methods on an equation-by-equation basis using a non-11entered specification with Gaussian prior with zero mean and variance ten on all parameters.All benchmark models use a standard stochastic volatility specification that assumes anAR(1) evolution of the log-volatilities and is estimated using the algorithm put forward inKastner and Fr¨uhwirth-Schnatter (2014).We investigate the performance of our methods for the three primary variables of interest:inflation, unemployment and the interest rate. We use root mean squared error (RMSE) andaverage log-predictive likelihoods (LPLs) to evaluate forecast performance. We use a forecastevaluation period beginning in 2005:01 and running to the end of the sample for forecast horizonsof one month and one year.Before discussing forecast performance, it is worthwhile to briefly consider computationtime. Table 2 presents average estimation times taken by the forecast exercise over the holdoutsample. For the IRGA specifications approximating solely the TVPs, the sampling algorithmtakes roughly 150 minutes. When we approximate second-order cross-variable lags alongside theTVPs, estimation is even faster. It takes about 125 minutes. Note that the IRGA specificationsallow for parallelization of the algorithm (for reference, we only used one CPU core). Estimationusing multiple cores speeds up estimation by a factor of the number of available CPUs (e.g. eightcores would allow to estimate the models eight times as fast, and the 78 equation IRGA couldbe estimated in around 20 minutes).For reference, the hierarchical Minnesota BVAR takes 270 minutes to estimate, while thesmall TVP-VAR takes roughly four minutes (however, as suggested above, scales badly in T and K ). One would expect that estimation times of the constant parameter VAR and one of theIRGA models are comparable, which is clearly not the case here. This difference in runtimesarises from the fact that the BVAR features a standard SV specification and this implies that,if equation-by-equation estimation is used, we need to normalize y and X by dividing throughthe time-varying error standard deviation and repeatedly compute a cross product involving X during MCMC sampling. Table 2:
Average estimation time over the holdout.
Model IRGA SBL (TVPs) IRGA SBL (TVPs & lag) Minnesota TVP-VAR
Runtime 153 . . . . Notes : Runtimes in minutes.
Results of our forecasting exercise are given in Table 3. RMSEs shed light on point fore-casting performance and indicate our methods are forecasting roughly as well as the benchmarkBayesian VAR. With one exception, the same can be said for the three variable TVP-VAR. Theone exception is for long-run forecasting ( h = 12) of the interest rate where the TVP-VAR isforecasting the interest rate very poorly. In contrast, the IRGA approaches (in particular, theS&S variants) are forecasting very well. This suggests that, for the interest rate, there is littleevidence of time-variation in parameters. In this case, the S&S prior is doing a very good jobof shrinking excess time-variation to be zero in a way that the TVP-VAR does not.LPLs evaluate the qualities of entire predictive densities produced by the different ap-proaches. Note that all of the LPLs for the IRGA approach are positive, indicating that they arebeating the benchmark Bayesian VAR. These gains are often pronounced, strongly highlightingthe benefits of combining flexible models with large information sets using the IRGA framework.We stress that all of our models, including the Bayesian VAR, have stochastic volatility. Ourfindings are suggesting that, even in large VARs, there is some improvement in predictive den-sities to be found by allowing for time variation in both VAR coefficients and error covariancematrices. Note also that the low-dimensional TVP-VAR is producing predictive densities whichare roughly as good as our high-dimensional IRGA approaches. This reinforces the idea thatthere may be a trade-off between TVPs and VAR dimension. In small data sets, TVPs may be12articularly important as they may reduce the effect of omitted variables bias whereas in largedimensions, this bias is substantially reduced by controlling for observed heterogeneity throughthe inclusion of additional covariates.Figure 2, which plots our forecast metrics against time, tells of a story of constant improve-ments, since the Great Recession ended, in LPLs of our IRGA methods relative to the benchmarkBayesian VAR. Above, we noted the poor point forecast performance of the small TVP-VAR forthe effective Federal funds rate for h = 12. However, this model yielded good density forecastsfor this case. Figure 2, along with the fact that the Federal funds rate was essentially zero for2009-2016, suggests a possible explanation for this finding. The three-dimensional TVP-VARis more parsimonious than any of the other models considered which all involved 78 variables.Parsimonious models often lead to low predictive variances. In the long period when the Fed-eral funds rate was constant, this property benefitted the small TVP-VAR. However, duringthe Great Recession, it performed much more poorly, as evidenced by the large forecast errorsduring this period. Table 3:
Forecast root mean squared error (RMSE) and average log predictive score (LPS)over the period 2005–2019.
RMSE average LPSModel h = 1 h = 12 h = 1 h = 12 Minnesota
UNRATE 1.031 1.104 -1.856 -2.274FEDFUNDS 0.563 0.786 -1.413 -2.217CPIAUCSL 1.151 1.244 -2.007 -2.283
IRGA SBL (TVPs)
UNRATE 0.917 1.060 0.510 0.742FEDFUNDS 1.028 1.460 0.388 1.076CPIAUCSL 1.011 1.053 0.551 0.721
IRGA SBL (TVPs & lag)
UNRATE 0.968 1.023 0.462 0.703FEDFUNDS 1.318 0.970 0.219 0.999CPIAUCSL 1.071 1.002 0.438 0.660
IRGA S&S (TVPs), ψ = 0 . IRGA S&S (TVPs & lag), ψ = 0 . TVP-VAR
UNRATE 1.002 1.053 0.427 0.733FEDFUNDS 1.001 1.814 1.299 1.811CPIAUCSL 1.111 1.042 0.530 0.666
Notes : RMSEs and LPSs for the benchmark Bayesian VAR with hierarchical Minnesota prior, ratios for RMSEs and dif-ferences in LPSs for all others. RMSE entries less than unity indicate better performance than the benchmark for pointforecasts, LPSs greater than zero indicate that the model has a better performance for density forecasts. SBL refers tothe sparse Bayesian learning prior, S&S is the spike-and-slab prior with ψ performing best on average (additional resultsare in the Appendix). IRGA TVPs & lag is a specification where not only the TVPs are approximated, but also constantparameters for the second order cross-variable lags.
The effect uncertainty shocks have on the economy is an important issue in modern macroe-conomics. In this section, we update the seminal analysis of Bloom (2009) using our IRGA13
PIAUCSLh=1 CPIAUCSLh=12FEDFUNDSh=1 FEDFUNDSh=12UNRATEh=1 UNRATEh=12 C u m u l a t i v e F E IRGA SBL (TVPs)IRGA SBL (TVPs & lag)IRGA S&S (TVPs), y =0.001IRGA S&S (TVPs & lag), y =0.001 MinnesotaTVP−VAR CPIAUCSLh=1 CPIAUCSLh=12FEDFUNDSh=1 FEDFUNDSh=12UNRATEh=1 UNRATEh=12 C u m u l a t i v e r e l a t i v e L PS IRGA SBL (TVPs)IRGA SBL (TVPs & lag)IRGA S&S (TVPs), y =0.001IRGA S&S (TVPs & lag), y =0.001 MinnesotaTVP−VAR Figure 2:
Absolute cumulative forecast errors and log predictive scores relative to the bench-mark over the holdout period 2005–2019.
Notes : The grey shaded area indicates the Great Recession. methods and various data sets taken from the large data set used in our forecasting exercise.The impulse responses to the uncertainty shock are identified using standard Cholesky identi-fication with the uncertainty index (VXOCLSx) ordered first. We consider small, medium andlarge TVP-VARs. We use data sets involving four variables (the three core variables plus theuncertainty shock) as the small information set, and include additional series for medium (nineadditional series) and large (31 additional variables). Table 5 lists the variables included in eachdata set. For the structural analysis, we approximate the TVPs using VAMP relying on the SBLprior (results using the S&S prior are very similar). The remaining parameters are estimatedusing MCMC methods.With a TVP model, impulse responses vary over time. We present impulse responses usingthe constant coefficients. Hence, they can be interpreted as average or mean impulse responses.Figure 3 shows the impulse responses from a constant parameter VAR (in red, with 68 percentposterior credible sets) versus our IRGA approach (blue, alongside 68 percent posterior crediblesets). The constant parameter VAR is the same as the one used as a benchmark in the forecast-ing section of the paper. The right panel plots posterior features of the difference in impulseresponses between the constant parameter and TVP models.In general, results in the left panel of the figure are sensible. The uncertainty shock causes asubstantial increase in unemployment and is associated with a drop in interest rates. Inflationfalls on impact but then rises. The constant parameter and TVP models are both producingqualitatively similar findings. However, an examination of the right panel reveals an interestingpattern. When working with the large VAR, the constant parameter and TVP model areproducing statistically similar impulse responses. That is, the posterior credible intervals for thedifference in impulse responses between the two models always include zero. Thus, in the largemodel the inclusion of TVPs is having little or no impact on the impulse responses. However,14or the small and (to a lesser extent) medium-sized VARs, the posterior credible intervals forthe difference in impulse responses frequently do not include zero. This suggest that there is atrade-off between TVP-VAR dimension and the presence of TVPs. In small models, where therisk of omitted variables is high, the addition of TVPs clearly makes a difference. But in largemodels, where the risk is lower, the evidence for TVPs diminishes.Overall, we are finding that our IRGA approach is producing sensible results and that, byallowing for TVPs, we are obtaining some insights that would not be found in constant coefficientmodels. And it is worth stressing that IRGA methods are computationally effiicient, capable ofhandling very large TVP-VAR dimensions, where traditional MCMC-based TVP-VAR methodswould not be computationally practicable.
CPIAUCSL FEDFUNDS UNRATE VXOCLSx0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 2402460.00.51.01.5−1.0−0.50.0−101 small
CPIAUCSL FEDFUNDS UNRATE VXOCLSx0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 2402460.00.51.0−0.75−0.50−0.250.00−1.2−0.8−0.40.00.4 medium
CPIAUCSL FEDFUNDS UNRATE VXOCLSx0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 2402460.000.250.500.75−0.75−0.50−0.250.00−0.50.00.5
Impulse response horizon (months) large
IRF
CPIAUCSL FEDFUNDS UNRATE VXOCLSx0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24−4−3−2−10−0.50.00.5−1.00−0.75−0.50−0.250.00−0.50.00.51.0 small
CPIAUCSL FEDFUNDS UNRATE VXOCLSx0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24−2−101−1.0−0.50.00.5−0.9−0.6−0.30.0−0.50.00.51.0 medium
CPIAUCSL FEDFUNDS UNRATE VXOCLSx0 6 121824 0 6 121824 0 6 121824 0 6 121824−10123−0.250.000.250.500.75−0.50−0.250.000.25−0.6−0.30.00.30.6
Impulse response horizon (months) large
Diff
Figure 3:
Impulse response functions and differences between time-varying and constant pa-rameter specification across model sizes.
Notes : See Tab. 5 in the Appendix for further details on the respective information sets. The blue area refersto the 16th and 84th credible sets of the IRFs associated with the IRGA approach (with the SBL prior) whereasthe red lines are the IRFs associated with the constant parameter VAR.
This paper has argued that, when working with TVP regressions with large data sets, carehas to be taken to ensure that estimation methods are computationally feasible and that theappropriate type of time-variation is allowed for. That is, particularly when K is large, it islikely that time-variation in regression coefficients only occurs for a few of the coefficients andonly at some periods in time. With these considerations, we have developed computationally-efficient estimation methods which are scaleable and can accommodate a range of shrinkagepriors that allow for great flexibility in the location and timing of parameter change. These arebased on IRGAs. In artificial data, we document the computational efficiency and accuracy ofthese methods. In a large macroeconomic data set, we demonstrate good forecast performancein a practical amount of computation time. In an investigation of the role of uncertainty shocks,we show that TVP models yield insights that constant coefficient models do not.15 eferences Belmonte M, Koop G, and Korobilis D (2014), “Hierarchical shrinkage in time-varying coefficientmodels,”
Journal of Forecasting (1), 80–94. Bhattacharya A, Pati D, Pillai N, and Dunson D (2015), “Dirichlet–Laplace priors for optimalshrinkage,”
Journal of the American Statistical Association , 1479–1490.
Bitto A, and Fr¨uhwirth-Schnatter S (2019), “Achieving shrinkage in a time-varying parametermodel framework,”
Journal of Econometrics (1), 75–97.
Bloom N (2009), “The impact of uncertainty shocks,”
Econometrica , 623–685. Carriero A, Clark T, and Marcellino M (forthcoming), “Large Vector Autoregressions withstochastic volatility and flexible priors,”
Journal of Econometrics . Carter CK, and Kohn R (1994), “On Gibbs sampling for state space models,”
Biometrika (3),541–553. Chan J, Koop G, Leon-Gonzalez R, and Strachan R (2012), “Time varying dimension models,”
Journal of Business and Economic Statistics , 358–367. Clark T (2011), “Real-time density forecasts from BVARs with stochastic volatility,”
Journal of Busi-ness and Economic Statistics , 327–341. Cogley T, and Sargent TJ (2005), “Drifts and volatilities: monetary policies and outcomes in thepost WWII US,”
Review of Economic Dynamics (2), 262 – 302. D’Agostino A, Gambetti L, and Giannone D (2013), “Macroeconomic forecasting and structuralchange,”
Journal of Applied Econometrics (1), 82–101. Dangl T, and Halling M (2012), “Predictive regressions with time-varying coefficients,”
Journal ofFinancial Economics , 157–181.
Fang J, Zhang L, and Li H (2016), “Two-dimensional pattern-coupled sparse Bayesian learning viageneralized approximate message passing,”
IEEE Transactions on Image Processing (6), 2920–2930. Feldkircher M, Huber F, and Kastner G (2017), “Sophisticated and small versus simple andsizeable: When does it pay off to introduce drifting coefficients in Bayesian VARs?” arXiv . Giannone D, Lenza M, and Primiceri GE (2015), “Prior selection for vector autoregressions,”
Reviewof Economics and Statistics (2), 436–451. Griffin J, and Brown P (2010), “Inference with normal-gamma prior distributions in regressionproblems,”
Bayesian Analysis (1), 171–188. Groen J, Paap R, and Ravazzollo F (2013), “Real time inflation forecasting in a changing world,”
Journal of Business and Economic Statistics , 29–44. Hauzenberger N, Huber F, Koop G, and Onorante L (2019), “Fast and Flexible Bayesian Infer-ence in Time-varying Parameter Regression Models,” arXiv . Huber F, Koop G, and Onorante L (forthcoming), “Inducing sparsity and shrinkage in time-varyingparameter models,”
Journal of Business and Economic Statistics . Kalli M, and Griffin J (2014), “Time-varying sparsity in dynamic regression models,”
Journal ofEconometrics , 779–793.
Kastner G, and Fr¨uhwirth-Schnatter S (2014), “Ancillarity-sufficiency interweaving strategy(ASIS) for boosting MCMC estimation of stochastic volatility models,”
Computational Statistics &Data Analysis , 408–423. Kastner G, and Huber F (2017), “Sparse Bayesian vector autoregressions in huge dimensions,” manuscript . im S, Shephard N, and Chib S (1998), “Stochastic volatility: likelihood inference and comparisonwith ARCH models,” The Review of Economic Studies (3), 361–393. Koop G, and Korobilis D (2012), “Forecasting inflation using dynamic model averaging,”
Interna-tional Economic Review (3), 867–886.——— (2013), “Large time-varying parameter VARs,” Journal of Econometrics (2), 185–198.——— (2018), “Variational Bayes inference in high dimensional time-varying parameter models,” manuscript . Koop G, Korobilis D, and Pettenuzzo D (2019), “Bayesian compressed Vector Autoregressions,”
Journal of Econometrics (1), 135–154.
Korobilis D (2019), “High-dimensional macroeconomic forecasting using message passing algorithms,”
Journal of Business & Economic Statistics forthcoming , 1–30.
Kowal D, Matteson D, and Ruppert D (2017), “Dynamic shrinkage processes,” arXiv:1707.00763 . Litterman RB (1986), “Forecasting with Bayesian vector autoregressions—five years of experience,”
Journal of Business & Economic Statistics (1), 25–38. McCracken MW, and Ng S (2016), “FRED-MD: A monthly database for macroeconomic research,”
Journal of Business & Economic Statistics (4), 574–589. Mitchell TJ, and Beauchamp JJ (1988), “Bayesian variable selection in linear regression,”
Journalof the American Statistical Association (404), 1023–1032. Nakajima J, and West M (2013), “Bayesian analysis of latent threshold dynamic models,”
Journal ofBusiness and Economic Statistics , 151–164. Paul P (2019), “The time-varying effect of monetary policy on asset prices,”
Review of Economics andStatistics forthcoming , 1–44.
Primiceri G (2005), “Time varying structural autoregressions and monetary policy,”
Oxford UniversityPress (3), 821–852. Rangan S, Schniter P, and Fletcher AK (2019), “Vector approximate message passing,”
IEEETransactions on Information Theory (10), 6664–6684. Roˇckov´a V, and McAlinn K (2018), “Dynamic variable selection with spike-and-slab process priors,”
Technical report, Booth School of Business, University of Chicago . Stock J, and Watson M (2007), “Why has U.S. inflation become harder to forecast?”
Journal ofMoney, Credit and Banking , 3–33. Tipping ME (2001), “Sparse Bayesian learning and the relevance vector machine,”
Journal of machinelearning research (Jun), 211–244. Uribe P, and Lopes H (2017), “Dynamic sparsity on dynamic regression models,” manuscript . van den Boom W, Reeves G, and Dunson DB (2019), “Approximating posteriors with high-dimensional nuisance parameters via integrated rotated Gaussian approximation,” arXiv . Zou X, Li F, Fang J, and Li H (2016), “Computationally efficient sparse Bayesian learning viageneralized approximate message passing,” in “2016 IEEE International Conference on UbiquitousWireless Broadband (ICUWB),” 1–4, IEEE. A TVP-VAR model
In this appendix, we briefly show how our methods can be applied to the VAR case. Let Y denote a T × N -dimensional matrix of endogenous variables with t th row given by y (cid:48) t = ( y t , . . . , y Nt ). Using thisnotation, the TVP-VAR is given by: y t = ( I t ⊗ x (cid:48) t )( β + γ t ) + ε t . (A.1)Here, x t = ( y (cid:48) t − , . . . , y (cid:48) t − P ) (cid:48) now includes the P lags of y t and ε t is a N -dimensional Gaussian shockvector with time-varying variance-covariance matrix given by Σ t of dimension N × N .Equation A.1 can be rewritten on an equation-by-equation basis by augmenting x t in the j th equation(for j ≥
2) of the system with the contemporaneous values of the first j − y t : y jt = ˜ β (cid:48) j ˜ x jt + ˜ γ (cid:48) jt ˜ x jt + ε jt , whereby ˜ x jt = ( x (cid:48) t , y t , . . . , y j − ,t ) (cid:48) , ˜ β j is a K j (= N p + ( j − γ jt denotes an K j × j . The last j − β j and ˜ γ jt serve to approximate the covariance parameters and their time variation in Σ t .In the case that j = 1, ˜ x jt = x t .Note that the equation-by-equation form of the TVP-VAR has great advantages in terms of compu-tation since the computational burden of estimating a full system can be reduced efficiently by exploitingthe fact that (1) constitutes a set of N independent regression models that can be estimated on a cluster. B Data and results
In this appendix, we repeat the forecasting exercise using the S&S prior for a range of values for the keyprior hyperparameter ψ . It can be see that results are very robust to prior choice. able 4: Forecast RMSE and average LPS for the S&S prior with different values of ψ . RMSE average LPSModel h = 1 h = 12 h = 1 h = 12 Minnesota
UNRATE 1.031 1.104 -1.86 -2.27FEDFUNDS 0.563 0.786 -1.41 -2.22CPIAUCSL 1.151 1.244 -2.01 -2.28
IRGA S&S (TVPs), ψ = 1 e − IRGA S&S (TVPs), ψ = 1 e − IRGA S&S (TVPs), ψ = 0 . IRGA S&S (TVPs), ψ = 0 . IRGA S&S (TVPs), ψ = 1 /K UNRATE 0.923 1.025 0.509 0.745FEDFUNDS 1.122 0.978 0.248 0.911CPIAUCSL 1.052 1.022 0.435 0.574
IRGA S&S (TVPs & lag), ψ = 1 e − IRGA S&S (TVPs & lag), ψ = 1 e − IRGA S&S (TVPs & lag), ψ = 0 . IRGA S&S (TVPs & lag), ψ = 0 . IRGA S&S (TVPs & lag), ψ = 1 /K UNRATE 0.963 1.021 0.441 0.744FEDFUNDS 0.947 0.971 0.213 0.916CPIAUCSL 1.057 1.002 0.434 0.644
Notes : RMSEs and LPSs for the benchmark Bayesian VAR with hierarchical Minnesota prior, ratios for RMSEs and dif-ferences in LPSs for all others. RMSE entries less than unity indicate better performance than the benchmark for pointforecasts, LPSs greater than zero indicate that the model has a better performance for density forecasts. S&S is the spike-and-slab prior.
IRGA TVPs & lag is a specification where not only the TVPs are approximated, but also constant param-eters for the second order cross-variable lags. able 5: Monthly data obtained from FRED-MD for 1985:01 to 2019:08.
Mnemonic I (0) Description Size Mnemonic I (0) Description SizeRPI 5 Real personal income ANDENOx 5 New Orders for Nondefense Capital goodsW875RX1 5 Real personal income ex transfer receipts AMDMUOx 5 Unfilled Orders for Durable goodsINDPRO 5 IP Index M UMCSENTx 2 Consumer Sentiment Index LIPFPNSS 5 IP: Final Products BUSLOANS 6 Commercial and Industrial Loans LIPFINAL 5 IP: Final Products (Market Group) REALLN 6 Real Estate Loans at All Commerical Banks LIPCONGD 5 IP: Consumer Goods INVEST 6 Securities in Bank Credit at All Commercial Banks MIPDCONGD 5 IP: Durable Consumer Goods FEDFUNDS 2 Effective Federal Funds Rate SIPNCONGD 5 IP: Nondurable Consumer Goods CP3Mx 2 3-Month AA Financial Commercial Paper Rate LIPBUSEQ 5 IP: Business Equipment TB3MS 2 3-Month Treasury Bill LIPMAT 5 IP: Materials TB6MS 2 6-Month Treasury Bill LIPDMAT 5 IP: Durable Materials GS1 2 1-Year Treasury Rate LIPNMAT 5 IP: Nondurable Materials GS5 2 5-Year Treasury Rate LIPMANSICS 5 IP: Manufacturing (SIC) GS10 2 10-Year Treasury Rate MIPFUELS 5 IP: Fuels AAA 2 Moody’s Seasoned Aaa Corporate Bond Yield MCUMFNS 2 Capacity Utilization: Manufacturing BAA 2 Moody’s Seasoned Baa Corporate Bond Yield MUNRATE 2 Civilian Unemployment Rate S COMPAPFFx 1 3-Month Commercial Paper Minus FEDFUNDSUEMPMEAN 2 Average Duration of Unemployment (Weeks) L TB3SMFFM 1 3-Month Treasury C Minus FEDFUNDSUEMPLT5 5 Civilians Unemployed: Less Than 5 Weeks TB6SMFFM 1 6-Month Treasury C Minus FEDFUNDSUEMP5TO14 5 Civilians Unemployed for 5-14 Weeks T1YFFM 1 1-Year Treasury C Minus FEDFUNDSUEMP15OV 5 Civilians Unemployed: 15 Weeks and Over T5YFFM 1 5-Year Treasury C Minus FEDFUNDSUEMP15T26 5 Civilians Unemployed for 15-26 Weeks T10YFFM 1 10-Year Treasury C Minus FEDFUNDS MUEMP27OV 5 Civilians Unemployed for 27 Weeks and Over AAAFFM 1 Moody’s Aaa Corporate Bond Minus FEDFUNDSCLAIMSx 5 Initial Claims L BAAFFM 1 Moody’s Baa Corporate Bond Minus FEDFUNDSPAYEMS 5 All Employees: Total nonfarm L TWEXMMTH 5 Trade Weighted Trade Weighted U.S. Dollar Index: Major CurrenciesUSGOOD 5 All Employees: Goods-Producing Industries EXSZUSx 5 Switzerland–U.S. Foreign Exchange Rate LUSCONS 5 All Employees: Construction EXJPUSx 5 Japan–U.S. Foreign Exchange Rate LMANEMP 5 All Employees: Manufacturing EXUSUKx 5 U.S.–UK Foreign Exchange Rate LSRVPRD 5 All Employees: Service-Providing Industries EXCAUSx 5 Canada–U.S. Foreign Exchange Rate LUSTPU 5 All Employees: Trade, Transporation and Utilities OILPRICEx 6 Crude Oil, , spliced WTI and Cushing MUSWTRADE 5 All Employees: Wholesale Trade PPICMM 6 PPI: Metals and metal productsUSTRADE 5 All Employees: Retail Trade CPIAUCSL 6 CPI: All Items SUSFIRE 5 All Employees: Financial Activities CPIAPPSL 6 CPI: ApparelUSGOVT 5 All Employees: Government CPITRNSL 6 CPI: TransportationAWOTMAN 2 Avg Weekly Overtime Hourse: Manufacturing M CPIMEDSL 6 CPI: Medical CareAWHMAN 1 Avg Weekly Hours: Manufacturing L CPIULFSL 6 CPI: All Items Less FoodHOUST 4 Housing Starts: Total New Privately Owned M PCEPI 6 Personal Cons. Expend.: Chain Index MPERMIT 4 New Private Housing Permits (SAAR) M S&P 500 5 S&P’s Common Stock Price Index: Composite MRETAILx 5 Retail and Food Services Sales L S&P: indust 5 S&P’s Common Stock Price Index: Industrials LAMDMNOx 5 New Orders for Durable goods VXOCLSx 1 VXO S Notes : The dataset discussed in McCracken and Ng (2016) is available for download at fred.stlouisfed.org. The column I (0) indicates the applied transformation to a series x t for obtainingstationary series: (1) no transformation, (2) ∆ x t , (5) ∆ log( x t ), (6) ∆ log( x t ) with ∆ i indicating i th differences. S (small), M (medium) and L (large) indicate inclusion in differently sizedinformation sets for the structural analysis, with the letter referring to additional variables per class.th differences. S (small), M (medium) and L (large) indicate inclusion in differently sizedinformation sets for the structural analysis, with the letter referring to additional variables per class.