[PDF] Real-time Inflation Forecasting Using Non-linear Dimension Reduction Techniques

Abstract

In this paper, we assess whether using non-linear dimension reduction techniques pays off for forecasting inflation in real-time. Several recent methods from the machine learning literature are adopted to map a large dimensional dataset into a lower dimensional set of latent factors. We model the relationship between inflation and these latent factors using state-of-the-art time-varying parameter (TVP) regressions with shrinkage priors. Using monthly real-time data for the US, our results suggest that adding such non-linearities yields forecasts that are on average highly competitive to ones obtained from methods using linear dimension reduction techniques. Zooming into model performance over time moreover reveals that controlling for non-linear relations in the data is of particular importance during recessionary episodes of the business cycle.

Full PDF

RReal-time Inﬂation Forecasting Using Non-linearDimension Reduction Techniques

NIKO HAUZENBERGER

1, 2 , FLORIAN HUBER , and KARIN KLIEBER ∗ University of Salzburg Vienna University of Economics and Business

December 16, 2020

In this paper, we assess whether using non-linear dimension reduction techniquespays oﬀ for forecasting inﬂation in real-time. Several recent methods from themachine learning literature are adopted to map a large dimensional dataset intoa lower dimensional set of latent factors. We model the relationship betweeninﬂation and these latent factors using state-of-the-art time-varying parameter(TVP) regressions with shrinkage priors. Using monthly real-time data for the US,our results suggest that adding such non-linearities yields forecasts that are onaverage highly competitive to ones obtained from methods using linear dimensionreduction techniques. Zooming into model performance over time moreover revealsthat controlling for non-linear relations in the data is of particular importanceduring recessionary episodes of the business cycle.

JEL : C11, C32, C40, C53, E31

Keywords : Non-linear principal components, machine learning, time-varyingparameter regression, density forecasting, real-time data ∗ Corresponding author: Karin Klieber. Salzburg Centre of European Union Studies, University of Salzburg.

Address : M¨onchsberg 2a, 5020 Salzburg, Austria.

Email : [email protected]. We thank Michael Pfarrhoferand Anna Stelzer for valuable comments and suggestions. The authors gratefully acknowledge ﬁnancial sup-port from the Austrian Science Fund (FWF, grant no. ZK 35) and the Oesterreichische Nationalbank (OeNB,Anniversary Fund, project no. 18127). a r X i v : . [ ec on . E M ] D ec Introduction

Inﬂation expectations are used as crucial inputs for economic decision making in central bankssuch as the European Central Bank (ECB) and the US Federal Reserve (Fed). Given currentand expected inﬂation, economic agents decide on how much to consume, save and invest.In addition, measures of inﬂation expectations are often employed to estimate the slope ofthe Phillips curve, infer the output gap or the natural rate of interest. Hence, being able toaccurately predict inﬂation is key for designing and implementing appropriate monetary policiesin a forward looking manner.Although the literature on modeling inﬂation is voluminous and the eﬀorts invested con-siderable, predicting inﬂation remains a diﬃcult task (Stock and Watson, 2007) and simpleunivariate models are still diﬃcult to beat. The recent literature, however, has shown that usinglarge datasets (Stock and Watson, 2002) and/or sophisticated models (see Koop and Potter,2007; Koop and Korobilis, 2012; D’Agostino et al., 2013; Koop and Korobilis, 2013; Clark andRavazzolo, 2015; Chan et al., 2018; Jarocinski and Lenza, 2018) has the potential to improveupon simpler benchmarks.These studies often extract information from huge datasets. This is commonly achieved byextracting a relatively small number of principal components (PCs) and including them in asecond stage regression model. While this approach performs well empirically, it fails to capturenon-linear relations in the dataset. In the presence of non-linearities, using simple PCs poten-tially reduces predictive accuracy by ignoring important features of the data. Moreover, theregression model that links the PCs with inﬂation is often assumed to feature constant param-eters and homoscedastic errors. In the presence of structural breaks and/or heteroscedasticity,this may adversely aﬀect forecasting accuracy.Investigating whether allowing for non-linearities in the compression stage pays oﬀ for in-ﬂation forecasting is the key objective of the present paper. Building on recent advances inmachine learning (see Gallant and White, 1992; McAdam and McNelis, 2005; Exterkate et al.,2016; Chakraborty and Joseph, 2017; Heaton et al., 2017; Mullainathan and Spiess, 2017; Fenget al., 2018; Coulombe et al., 2019; Kelly et al., 2019; Medeiros et al., 2019), we adopt sev-eral non-linear dimension reduction techniques. The resulting latent factors are then linked toinﬂation in a second stage regression. In this second stage regression we allow for substantialﬂexibility. Speciﬁcally, we consider dynamic regression models that allow for time-varying pa-rameters (TVPs) and stochastic volatility (SV). Since the inclusion of a relatively large numberof latent factors can still imply a considerable number of parameters (and this problem is evenmore severe in the TVP regression case), we rely on state-of-the-art shrinkage techniques.From an empirical standpoint it is necessary to investigate how these dimension reductiontechniques perform over time and during diﬀerent business cycle phases. We show this using athorough real-time forecasting experiment for the US. Our forecasting application uses monthlyreal-time datasets (i.e., the FRED-MD database proposed in McCracken and Ng (2016)) andincludes a battery of well established models commonly used in central banks and other policyinstitutions to forecast inﬂation.Our results show that dimension reduction techniques yield forecasts that are highly com-petitive to the ones obtained from using linear methods based on PCs. At a ﬁrst glance, this2hows that existing models already perform well and using more sophisticated methods yieldsonly modest gains in predictive accuracy. However, zooming into model performance over timereveals that controlling for non-linear relations in the data is of particular importance duringrecessionary episodes of the business cycle.This ﬁnding gives rise to the second contribution of our paper. Since we ﬁnd that moresophisticated non-linear dimension reduction methods outperform simpler techniques duringrecessions, we combine the considered models using dynamic model averaging (see Raftery et al.,2010; Koop and Korobilis, 2013). We show that combining our proposed set of models with avariety of standard forecasting models yields predictive densities which are superior to the singlebest performing model in overall terms. These eﬀects are even more pronounced when interestcenters on multi-step ahead forecasting.The remainder of this paper is structured as follows. Section 2 discusses a set of dimen-sion reduction techniques. Section 3 introduces the econometric modeling environment thatwe use to forecast inﬂation. Section 4 provides the results of the forecasting horse race andintroduces weighted combinations of the competing models including the results of the forecastcombinations. The last section summarizes and concludes the paper.

Suppose that we are interested in predicting inﬂation using a large number of K regressors thatwe store in a T × K matrix X = ( x , . . . , x T ) (cid:48) , where x t denotes a K -dimensional vector ofobservations at time t . If K is large relative to T , estimation of an unrestricted model thatuses all columns in X quickly becomes cumbersome and overﬁtting issues arise. As a solution,dimension reduction techniques are commonly employed (see, e.g., Stock and Watson, 2002;Bernanke et al., 2005). These methods strike a balance between model ﬁt and parsimony. At avery general level, the key idea is to introduce a function f that takes the matrix X as inputand yields a lower dimensional representation Z = f ( X ) = ( z , . . . , z T ) (cid:48) , which is of dimension T × q , as output. The critical assumption to achieve parsimony is that K (cid:29) q . The latentfactors in Z are then linked to inﬂation through a dynamic regression model (see Section 3).The function f : R T × K → R T × q is typically assumed to be linear with the most prominentexample being PCs. In this paper, we will consider several choices of f that range from linear tohighly non-linear (such as manifold learning as well as deep learning algorithms) speciﬁcations.We subsequently analyze how these diﬀerent speciﬁcations impact inﬂation forecasting accuracy.In the following subsections, we brieﬂy discuss the diﬀerent techniques and refer to the originalpapers for additional information. Minor alterations of the main PCA algorithm allow for introducing non-linearities in two ways.First, we can introduce a non-linear function g that maps the covariates onto a matrix W = g ( X ). Second, we could alter the sample covariance matrix (the kernel) with a function h : κ = h ( W (cid:48) W ). Both W and κ form the two main ingredients of a general PCA reducing thedimension to q , as outlined below (for details, see Sch¨olkopf et al., 1998).3ndependent of the functional form of g and h , we obtain PCs by performing a truncatedsingular value decomposition (SVD) of the transformed sample covariance matrix κ . Conditionalon the ﬁrst q eigenvalues, the resulting factor matrix Z is of dimension T × q . These PCs, forappropriate q , explain the vast majority of variation in X . In the following, the relationshipbetween the PCs and X is: Z = f ( X ) = g ( X ) Λ ( κ ) = W Λ ( κ ) , (1)with Λ ( κ ) being the truncated K × q eigenvector matrix of κ (Stock and Watson, 2002). Noticethat this is always conditional on deciding on a suitable number q of PCs. The number offactors is a crucial parameter that strongly inﬂuences predictive accuracy and inference (Baiand Ng, 2002). In our empirical work, we consider a small ( q = 5), moderate ( q = 15), andlarge ( q = 30) number of PCs. In the case of a large number of PCs, we use shrinkage to solveoverparameterization concerns.By varying the functional form of g and h we are now able to discuss the ﬁrst set of linear-and non-linear dimension reduction techniques belonging to the class of PCA:1. Linear PCs

The simplest way is to deﬁne both g and h as the unity function, resulting in W = X and κ = X (cid:48) X . Due to the linear link between the PCs and the data, PCA is very easy toimplement and yields consistent estimators for the latent factors if K and T go to inﬁnity(Stock and Watson, 2002; Bai and Ng, 2008). Even if there is some time-variation inthe factor loadings, Stock and Watson (1999) show that principal components remain aconsistent estimator for the factors if K is large.2. Squared PCs

The literature suggests several ways to overcome the linearity restriction of PCs. Bai andNg (2008), for example, apply a quadratic link function between the latent factors and theregressors, yielding a more ﬂexible factor structure. This method considers squaring theelements of X resulting in W = X and κ = ( X ) (cid:48) ( X ) , (2)with X = ( X (cid:12) X ) and (cid:12) denoting element-wise multiplication.Squared PCs focus on the second moments of the covariate matrix and allow for a non-linear relationship between the principal components and the predictors. Bai and Ng(2008) show that quadratic variables can have substantial predictive power as they pro-vide additional information on the underlying time series. Intuitively speaking, giventhat we transform our data to stationarity in the empirical work, this transformationstrongly overweights situations characterized by sharp movements in the columns of X (such as during a recession). By contrast, periods characterized by little variation in ourmacroeconomic panel are transformed to mildly ﬂuctuate around zero (and thus carry lit-tle predictive content for inﬂation). In our empirical model, our regressions always featurelagged inﬂation and this transformation thus eﬀectively implies that in tranquil periods,4he model is close to an autoregressive model whereas in crisis periods, more informationis introduced.3. Kernel PCs

Another approach for non-linear PCs is the kernel principal component analysis (KPCA).KPCA dates back to Sch¨olkopf et al. (1998), who proposed using integral operator kernelfunctions to compute PCs in a non-linear manner. In essence, this amounts to implicitlyapplying a non-linear transformation of the data through a kernel function and then ap-plying PCA on this transformed dataset. Such an approach has been used for forecastingin Giovannelli (2012) and Exterkate et al. (2016).We allow for non-linearities in the kernel function between the data and the factors bydeﬁning h to be a Gaussian or a polynomial kernel κ (which is K × K ) with the ( i, j )thelement given by κ ij = exp (cid:18) − || x • i − x • j || c (cid:19) (3)for a Gaussian kernel and κ ij = (cid:18) x (cid:48)• i x • j c + 1 (cid:19) (4)for a polynomial kernel.Here, W = X (i.e., g is the unity function), x • i and x • j ( i, j = 1 , . . . , K ) denote twocolumns of X while c and c are scaling parameters. As suggested by Exterkate et al.(2016) we set c = (cid:112) ( K + 2) / c = √ c K /π with c K being the 95th percentile of the χ distribution with K degrees of freedom. Diﬀusion maps, originally proposed in Coifman et al. (2005) and Coifman and Lafon (2006), areanother set of non-linear dimension reduction techniques that retain local interactions betweendata points in the presence of substantial non-linearities in the data. The local interactions arepreserved by introducing a random walk process.The random walk captures the notion that moving between similar data points is morelikely than moving to points which are less similar. We assume that the weight function whichdetermines the strength of the relationship between x • i to x • j is given by w ( x • i , x • j ) = exp (cid:18) || x • i − x • j || c (cid:19) , (5)where || x • i − x • j || denotes the Euclidean distance between x • i and x • j and c is a tuningparameter set such that w ( x • i , x • j ) is close to zero except for x • i ≈ x • j . Here, c is determinedby the median distance of k -nearest neighbors of x • i as suggested by Zelnik-Manor and Perona For an application to astronomical spectra, see Richards et al. (2009). k is chosen by taking a small percentage of K (i.e., 1 %) such that itscales with the size of the dataset.The probability of moving from x • i to x • j is then simply obtained by normalizing: p i → j = Prob( x • i → x • j ) = w ( x • i , x • j ) (cid:80) j w ( x • i , x • j ) . (6)This probability tends to be small except for the situation where x • i and x • j are similar to eachother. As a result, the probability that the random walk moves from x • i to x • j will be largeif they are equal but rather small if both covariates diﬀer strongly. Let P denote a transitionmatrix of dimension K × K with ( i, j )th element given by p i → j . The probability of moving from x • i to x • j in n = 1 , , . . . steps is then simply the matrix power of P n , with typical elementdenoted by p ni → j . Using a biorthogonal spectral decomposition of P n yields: p ni → j = (cid:88) s ≥ λ ns ψ s ( x • i ) φ s ( x • j ) , (7)with ψ s and φ s denoting left and right eigenvectors of P , respectively. The correspondingeigenvalues are given by λ s .We then proceed by computing the so-called diﬀusion distance as follows: ξ n ( x • i , x • j ) = (cid:88) j ( p ni → j − p ns → j ) p ( x • j ) , (8)with p being a normalizing factor that measures the proportion the random walk spends at x • j . This measure turns out to be robust with respect to noise and outliers. Coifman and Lafon(2006) show that ξ n ( x • i , x • j ) = ∞ (cid:88) s =1 λ ns ( ψ s ( x • i ) − ψ s ( x • j )) . (9)This allows us to introduce the family of diﬀusion maps from R K → R q given by: Ξ n ( x • i ) = [ λ n ψ ( x • i ) , . . . , λ nq ψ q ( x • i )] . (10)The distance matrix can then be approximated as: ξ n ( x • i , x • j ) ≈ q (cid:88) s =1 λ ns ( ψ s ( x • i ) − ψ s ( x • j )) = || Ξ n ( x • i ) − Ξ n ( x • j ) || . (11)Intuitively, this equation states that we now approximate diﬀusion distances in R K through theEuclidian distance between Ξ n ( x • i ) and Ξ n ( x • j ). This discussion implies that we have to choose n and q and we do this by setting q = { , , } according to our approach with either a small,moderate or large number of factors and n = T , the number of time periods. The algorithmin our application is implemented using the R-package diffusionMap (Richards and Cannoodt,2019). 6 .3 Local Linear Embedding Locally linear embeddings (LLE) have been introduced by Roweis and Saul (2000). Intuitively,the LLE algorithm maps a high dimensional input dataset X into a lower dimensional spacewhile being neighborhood-preserving. This implies that points which are close to each other inthe original space are also close to each other in the transformed space.The LLE algorithm is based on the assumption that each x • i is sampled from some underlyingmanifold. If this manifold is well deﬁned, each x • i and its neighbors x • j are located close toa locally linear patch of this manifold. One consequence is that each x • i can, conditionalon suitably chosen linear coeﬃcients, be reconstructed from its neighbors x • j j (cid:54) = i . Thisreconstruction, however, will be corrupted by measurement errors. Roweis and Saul (2000)introduce a cost function to quantify these errors: C ( Ω ) = (cid:88) i ( x • i − (cid:88) j ω ij x • j ) , (12)with Ω denoting a weight matrix with the ( i, j )th element given by ω ij . This cost function isthen minimized subject to the constraint that each x • i is reconstructed only from its neighbors.This implies that ω ij = 0 if x • j is not a neighbor of x • i . The second constraint is that thematrix Ω is row-stochastic, i.e., the rows sum to one. Conditional on these two restrictions, thecost function can be minimized by solving a least squares problem.To make this algorithm operational we need to deﬁne our notion of neighbors. In the fol-lowing, we will use the k -nearest neighbors in terms of the Euclidean distance. We choose thenumber of neighbors by applying the algorithm proposed by Kayo (2006), which automaticallydetermines the optimal number for k . The q latent factors in Z , with typical i th column z • i ,are then obtained by minimizing:Φ( Z ) = (cid:88) i | z • i − (cid:88) j Ω ij z • j | , (13)which implies a quadratic form in z t . Subject to suitable constraints, this problem can be easilysolved by computing: M = ( I T − Ω ) (cid:48) ( I T − Ω ) , (14)and ﬁnding the q + 1 eigenvectors of M associated with the q + 1 smallest eigenvalues. Thebottom eigenvector is then discarded to arrive at q factors. For our application, we use theR-package lle (Diedrich and Abel, 2012). Isometric Feature Mapping (ISOMAP) is one of the earliest methods developed in the category ofmanifold learning algorithms. Introduced by Tenenbaum et al. (2000), the ISOMAP algorithmdetermines the geodesic distance on the manifold and uses multidimensional scaling to comeup with a low number of factors describing the underlying dataset. Originally, ISOMAP wasconstructed for applications in visual perception and image recognition. In economics and7nance, some recent papers highlight its usefulness (see, e.g., Ribeiro et al., 2008; Lin et al.,2011; Orsenigo and Vercellis, 2013; Zime, 2014).The algorithm consists of three steps. In the ﬁrst step, a dissimilarity index that measuresthe distance between data points is computed. These distances are then used to identify neigh-boring points on the manifold. In the second step, the algorithm estimates the geodesic distancebetween the data points as shortest path distances. In the third step, metric scaling is performedby applying classical multidimensional scaling (MDS) to the matrix of distances. For the dissim-ilarity transformation, we determine the distance between point i and j by the Manhattan index d ij = (cid:80) k | x ki − x kj | and collect those points where i is one of the k -nearest neighbors of j ina dissimilarity matrix. For our empirical application, we again choose the number of neighborsby applying the algorithm proposed by Kayo (2006) and use the algorithm implemented in theR-package vegan (Oksanen et al., 2019).The described non-linear transformation of the dataset enables the identiﬁcation of a non-linear structure hidden in a high-dimensional dataset and maps it to a lower dimension. Insteadof pairwise Euclidean distances, ISOMAP uses the geodesic distances on the manifold and com-presses information under consideration of the global structure. Deep learning algorithms are characterized by not only non-linearly converting input to out-put but also representing the input itself in a transformed way. This is called representationlearning in the sense that representations of the data are expressed in terms of other, simpler,representations before mapping the data input to output values.One tool which performs representation of itself as well as representation to output is theAutoencoder (AE). The ﬁrst step is accomplished by the encoder function, which maps aninput to an internal representation. The second part, which maps the representation to theoutput, is called the decoder function. Their ability to extract factors, which largely explainvariability of the observed data, in a non-linear manner makes deep learners a powerful toolcomplementing the range of commonly used dimension reduction techniques (Goodfellow et al.,2016). In empirical ﬁnance, Heaton et al. (2017), Feng et al. (2018) and Kelly et al. (2019) showthat the application of these methods is beneﬁcial to predict asset returns.Based on deep learning techniques, we propose obtaining hierarchical predictors Z by apply-ing a number of l ∈ { , . . . , L } non-linear transformations to X . The non-linear transformationsare also called hidden layers with L giving the depth of our architecture and f , . . . , f L de-noting univariate activation functions for each layer. More speciﬁcally, activation functions(non-linearly) transform data in each layer, taking the output of the previous layer. A commonchoice is the hyperbolic tangent (tanh) given by exp( X ) − exp( − X )exp( X )+exp( − X ) , justiﬁed by several ﬁndings inrecent studies such as Saxe et al. (2019) or Andreini et al. (2020).The structure of our deep learning algorithm can be represented in form of a composition ofunivariate semi-aﬃne functions given by f W ( l ) ,b l l = f l (cid:32) N l (cid:88) i =1 W ( l ) • i ˆ x ( l ) • i + b l (cid:33) , ≤ l ≤ L, (15)8ith W ( l ) denoting a weighting matrix associated with layer l (with W ( l ) • i denoting the i thcolumn of W ( l ) ), ˆ x ( l ) • i denotes the i th column of an input matrix ˆ X ( l ) to layer l , b l is thecorresponding bias term and N l denotes the number of neurons that determine the width of thenetwork. Notice that if l = 1, ˆ X (1) = X and the input matrix is obtained recursively by usingthe activation functions.The lower dimensional representation of our covariate matrix is then obtained by computingthe composite map: Z = f ( X ) = ( f W (1) ,b ◦ · · · ◦ f W ( L ) ,b L L )( X ) . (16)The optimal sets of ˆ W = ( ˆ W (0) , . . . , ˆ W ( L ) ) and ˆ b = (ˆ b , . . . , ˆ b L ) are obtained by computing aloss function, most commonly the mean squared error of the in-sample ﬁt. The complexity ofthe neural network is determined by choosing the number of hidden layers L and the numberof neurons in each layer N l . We create ﬁve hidden layers with the number of neurons evenlydownsizing to the desired number of factors. Corresponding to the standard literature (see, e.g.,Huang, 2003; Heaton, 2008), a huge number of covariates requires a more complex structure(i.e., a higher number of hidden layers). Furthermore, it is recommended to set the numberof neurons between the size of the input and the output layer where N l is high in the ﬁrsthidden layer and smaller in the following layers. We employ the R interface to keras (Allaireand Chollet, 2019), a high-level neural networks API and widely used package for implementingdeep learning models. In the following, we introduce the predictive regression that links our target variable (US inﬂa-tion) to Z and p lags of inﬂation. Following Stock and Watson (1999), inﬂation is speciﬁed suchthat: y t + h = ln (cid:18) CPI t + h CPI t (cid:19) − ln (cid:18) CPI t CPI t − (cid:19) , (17)with CPI t + h denoting the consumer price index in period t + h .In the empirical application we set h ∈ { , , } . y t + h is then modeled using a dynamicregression model: y t + h = d (cid:48) t β t + h + (cid:15) t + h , (cid:15) t + h ∼ N (0 , σ t + h ) , (18)where β t + h is a vector of TVPs associated with M (= q + p ) covariates denoted by d t and σ t + h is a time-varying error variance. d t might include the latent factors extracted from thevarious methods discussed in the previous subsection, lags of inﬂation, an intercept term orother covariates which are not compressed.Following much of the literature (Taylor, 1982; Belmonte et al., 2014; Kalli and Griﬃn, 2014;Kastner and Fr¨uhwirth-Schnatter, 2014; Stock and Watson, 2016; Chan, 2017; Huber et al., 2020)we assume that the TVPs and the error variances evolve according to independent stochastic9rocesses: (cid:32) β t + h log σ t + h (cid:33) ∼ N (cid:32)(cid:32) β t + h − µ h + ρ h log σ t + h − (cid:33) , (cid:32) V ϑ h (cid:33)(cid:33) , (19)with µ h denoting the conditional mean of the log-volatility, ρ h its persistence parameter and ϑ h the error variance of log σ t + h . The matrix V is a M × M -dimensional variance-covariancematrix with V = diag( v , . . . , v M ) and v j being the process innovation variance that determinesthe amount of time-variation in β t + h . This setup implies that the TVPs are assumed to followa random walk process while the log-volatilities evolve according to an AR(1) process.The model described by Eq. 18 and Eq. 19 is a ﬂexible state space model that encompassesa wide range of models commonly used for forecasting inﬂation. For instance, if we set V = M and ϑ = 0, we obtain a constant parameter model. If d t includes the lags of inﬂation and(lagged) PCs, we obtain a model closely related to the one used in Stock and Watson (2002).If we set d t = 1 and allow for TVPs, we obtain a model very closely related to the unobservedcomponents stochastic volatility (UC-SV) successfully adopted in Stock and Watson (1999).A plethora of other models can be identiﬁed by appropriately choosing d t , V and ϑ . Thisﬂexibility, however, calls for model selection. We select appropriate submodels by using Bayesianmethods for estimation and forecasting. These techniques are further discussed in Appendix Aand allow for data-based shrinkage towards simpler nested alternatives. For the empirical application, we consider the popular FRED-MD database. This dataset ispublicly accessible and available in real-time. The monthly data vintages ensure that we onlyuse information that would have been available at the time a given forecast is being produced.A detailed description of the databases can be found in McCracken and Ng (2016). To achieveapproximate stationarity we transform the dataset as given in Appendix B. Furthermore, eachtime series is standardized to have sample mean zero and unit sample variance prior to usingthe non-linear dimension reduction techniques.Our US dataset includes 105 monthly variables that span the period from 1963:01 to 2019:06.The forecasting design relies on a rolling window, as justiﬁed by Clark (2011), that initiallyranges from 1980:01 to 1999:12. For each month of the hold-out sample, which starts in 2000:01and ends in 2018:12, we compute the h -step ahead predictive distribution for each model (for h ∈ { , , } ), keeping the length of the estimation sample ﬁxed at 240 observations (i.e., arolling window of 20 years).One key limitation is that all methods are speciﬁed conditionally on d t and thus implicitlyon the speciﬁc function f used to move from X to Z . Another key object of this paper is tocontrol for uncertainty with respect to f by using dynamic model averaging techniques. Forobtaining predictive combinations, we use the ﬁrst 24 observations of our hold-out sample. Theremaining periods (i.e., ranging from 2002:01 to 2018:12) then constitute our evaluation sample.For these periods we contrast each forecast (including the combined ones) with the realization10f inﬂation in the ﬁnal vintage of 2019:06. With such a strategy we aim at minimizing the riskthat realized inﬂation especially at the end of the evaluation sample is still subject to revisionsitself. In terms of competing models we can classify the speciﬁcations along two dimensions:1.

How d t is constructed. First and importantly, let s t denote an K -dimensional vectorof covariates except for y t . x t = ( s (cid:48) t , . . . , s (cid:48) t − p +1 ) (cid:48) is then composed of p lags of s t with K = pK . In our empirical work we set p = 12 and include all variables in the dataset(except for the CPI series, i.e., K = 104). We then use the diﬀerent dimension reductiontechniques outlined in Section 2 to estimate z t . Moreover, we add p lags of y t to z t . Thisserves to investigate how diﬀerent dimension reduction techniques perform when interestcenters on predicting inﬂation. Moreover, we also consider simple AR(12) models as wellas extended Phillips curve models (see, e.g., De Mol et al., 2008; Stock and Watson, 2008;Koop and Korobilis, 2012; Hauzenberger et al., 2019) as additional competitors. For theestimation of the extended Phillips curve model we select 20 covariates such that variouseconomic sectors are covered. Details can be found in Appendix B.2.

The relationship between d t and y t + h . The second dimension along which our modelsdiﬀer is the speciﬁc relationship described by Eq. 18. To investigate whether non-linear di-mension reduction techniques are suﬃcient to control for unknown forms of non-linearities,we benchmark all our models that feature TVPs with their respective constant parametercounterparts. To perform model selection we consider two priors. The ﬁrst one is theHorseshoe (HS) prior (Carvalho et al., 2010) and the second one is the stochastic searchvariable selection (SSVS) prior outlined in George and McCulloch (1993).

In this subsection we brieﬂy discuss how the factors obtained from using diﬀerent dimensionreduction techniques look like. For exposition, we choose q = 5 factors. Panels (a) to (h)in Figure 1 show the diﬀerent factors and reveal remarkable diﬀerences across methods usedto compress the data. Considering the diﬀerent variants of the PCs suggests that the factorsbehave quite similar and exhibit a rather persistent behavior. This, however, does not hold forthe case of squared PCA. In this case, the factors show sharp spikes during the global ﬁnancialcrisis. This is not surprising since squaring the input dataset, which has been transformed for In general, the literature argues that most of the data revisions take place in the ﬁrst quarter while afterwardsthe vintages remain relatively unchanged (see Croushore, 2011; Pfarrhofer, 2020). Therefore a gap of six monthsbetween the ﬁnal observation of inﬂation in the evaluation sample (2018:12) and our ﬁnal vintage (2019:06) isconsidered as enough to render evaluation valid. We consider 20 covariates spanning diﬀerent economic sectors, e.g., • real activity: industrial production (INDPRO), real personal income (W875RX1), housing (HOUST,PERMIT), capacity utilization (CUMFNS), etc. • labor market: unemployment rates (UNRATE, CLAIMSx), employment (PAYEMS), avg. weekly hoursof production (CES0600000007), etc. • price indices: producer price index (PPICMM) • others: Federal Funds Rate (FEDFUNDS), money supply (M2REAL), 3-M (TB3MS) and 10-y (GS10)treasuries, etc. x t to z t .A similar pattern arises for LLE (see panel (d)). In this case, some of the factors behavesimilar to a regime-switching process with a moderate number of regimes. For instance, thedark gray line behaves similar to the PCs during the ﬁrst few years of the sample. It thenstrongly decreases in the midst of the 1980s before returning to values observed in the beginningof the sample. Then, in the ﬁrst half of the 1990s, we observe a strong increase (reaching apeak of around 5) before the factor quickly reverts back to the previous regime. This regimestays in place from 1996 to around 2003. Then we again ﬁnd that dynamics change and thecorresponding factor increases in the run-up to the global ﬁnancial crisis. Similar patterns canbe found for the other factors obtained from using LLE to compress the input data.Considering ISOMAP shows that the ﬁrst few factors appear to be highly persistent. Thesefactors look very smooth for some periods but seem to exhibit oscillating behavior during othertime periods. The intensity of these cycles, however, is small. The ﬁnal few factors are fullycharacterized by these oscillating dynamics.This brief discussion shows that the non-linear dimension reduction techniques yield verysimilar results with distinct dynamics. Some of them (especially the Autoencoder) pick up alot of high frequency movements. These movements might be irrelevant for modeling inﬂationdynamics but could nevertheless carry relevant information during certain periods in time. Asimilar argument applies to the other techniques which also yield factors that change theirbehavior over time. We now consider point and density forecasting performance of the diﬀerent models and dimen-sion reduction techniques. The forecast performance is evaluated through averaged log predictivelikelihoods (LPLs) for density forecasts and root mean squared errors (RMSEs) for point fore-casts. Superior models are those with high scores in terms of LPL and low values in terms ofRMSE. Formal descriptions of the evaluation metrics are provided in Appendix A. We bench-mark all models relative to the autoregressive (AR) model with constant parameters and the HSprior. The ﬁrst entry in the tables gives the actual LPL score (in averages) with actual RMSEsin parentheses for our benchmark model. The remaining entries are relative LPLs with relativeRMSEs in parentheses.Starting with the one-step ahead horizon, Table 1 shows the relative LPLs and RMSEs (inparentheses) for inﬂation forecasts. This table suggests that, in terms of density forecasts, using12imension reduction techniques (both linear and non-linear) and allowing for non-linearitiesbetween the factors and inﬂation improves density forecasts substantially. This does not carryover to point forecasts. When we consider relative RMSEs, only small improvements are obtainedby using more sophisticated modeling techniques.Comparing linear to non-linear dimension reduction methods suggests that forecasts canbe further improved. In particular, we observe that along the diﬀerent reduction techniques,squared PCA performs well. One explanation for this might relate to the fact that simple modelssuch as a random walk or other univariate benchmarks are hard to beat in a real-time forecastingexercise (see Atkeson et al., 2001; Stock and Watson, 2008; Stella and Stock, 2013). When takinga closer look on Figure 1 (h) we see that the factors are close to zero in tranquil periods, while atthe same time, show substantial movements in times of turmoil. Conditional on relatively smallregression coeﬃcients in Eq. 18, this pattern suggests that the forecast densities are close tothe ones obtained from a random walk model. But in recessionary episodes, the factors conveyinformation on the level and volatility of inﬂation that might be useful for predicting duringcrises periods (see, e.g., Chan, 2017; Huber and Pfarrhofer, 2020).When we consider the diﬀerent speciﬁcations for the observation equation we ﬁnd that al-lowing for time-variation in the parameters improves one-step ahead predictive densities. Theseimprovements appear to be substantial for all speciﬁcations except the model using squaredPCA. For squared PCA, we ﬁnd only limited diﬀerences between constant and TVP regressions(conditional on the speciﬁc prior). The single best performing model for the one-step aheadinﬂation forecasts is the TVP model with a Horseshoe prior and ﬁve factors obtained by usingsquared PCA.Again, the strong diﬀerences in predictive accuracy between constant and TVP speciﬁcationsarise from the necessity to discriminate between diﬀerent stages of the business cycle. Thesomewhat smaller diﬀerences in the case of squared PCA are driven by the speciﬁc shape of thelatent factors and the reason outlined in the previous paragraph.Next, we inspect the longer forecast horizons in greater detail. Table 2 depicts the forecastperformance of all competitors for one-quarter and one-year ahead. The table indicates thatnon-linear dimension reduction techniques clearly outperform the autoregressive benchmark andperform similarly to the linear PCAs. Results reveal that diﬀusion maps, isometric featuremapping as well as squared PCA in combination with time variation in the coeﬃcients yieldhigh LPLs. Here, again, the best performing model is squared PCA, which beats all otherdimension reduction techniques irrespective of the prior structure or whether constant or time-varying parameters are considered. For point forecasts, we again ﬁnd little diﬀerences relativeto the univariate benchmark model. 13) Autoencoder b) Diﬀusion Maps −202 1980 1990 2000 2010 2020 −404 1980 1990 2000 2010 2020 c) ISOMAP d) LLE −202 1980 1990 2000 2010 2020 −2.50.02.55.0 1980 1990 2000 2010 2020 e) PCA gauss. kernel f) PCA linear −4−2024 1980 1990 2000 2010 2020 −404 1980 1990 2000 2010 2020 g) PCA poly. kernel h) PCA squared −404 1980 1990 2000 2010 2020 −10010 1980 1990 2000 2010 2020

Figure 1:

Illustration of linear and non-linear dimension reduction techniques applied to ourUS dataset with K = 104 based on the last vintage (end of year 2018). By focussing on q = 5we depict normalized factors with mean zero and variance one ranging from January 1980 toDecember 2018. 14 able 1: One-month ahead forecast performance.

Speciﬁcation One-month ahead const. (HS) const. (SSVS) TVP (HS) TVP (SSVS)AR -336.98 0.40 15.57 19.69(1.18) (1.01) (1.00) (1.01)Autoencoder (q = 5) 1.67 4.64 13.71 22.51(1.00) ( ) (1.00) ( )Autoencoder (q = 15) 1.00 2.88 10.79 14.00(1.00) (1.01) (1.01) (1.05)Autoencoder (q = 30) 2.32 0.31 12.93 12.97(1.00) (1.01) (1.00) (1.06)Diﬀusion Maps (q = 5) 2.57 1.14 13.81 15.59(1.00) (1.01) (1.01) (1.12)Diﬀusion Maps (q = 15) 0.71 2.92 13.54 17.26(1.00) (1.01) (1.00) (1.06)Diﬀusion Maps (q = 30) 2.28 3.14 14.44 -0.36(1.00) (1.02) (1.00) (1.15)Extended PC 11.25 15.73( ) (1.07)ISOMAP (q = 5) 0.99 -0.58 10.80 19.21(1.00) (1.01) (1.00) (1.01)ISOMAP (q = 15) 0.06 1.30 9.71 18.86(1.00) (1.01) (1.01) (1.02)ISOMAP (q = 30) -1.18 2.38 9.73 20.37(1.00) (1.01) (1.02) (1.03)LLE (q = 5) 0.18 -1.83 13.81 19.75(1.00) (1.01) ( ) (1.01)LLE (q = 15) -2.02 0.05 11.64 19.06(1.01) (1.01) (1.00) (1.01)LLE (q = 30) -1.11 -3.63 6.71 19.68(1.00) (1.01) (1.01) (1.01)PCA gauss. kernel (q = 5) -0.74 0.67 13.69 15.85(1.00) (1.01) (1.00) (1.05)PCA gauss. kernel (q = 15) -0.20 2.65 14.49 11.27(1.00) (1.01) (1.01) (1.17)PCA gauss. kernel (q = 30) 0.28 6.86 15.78 -5.34(1.00) (1.01) (1.01) (1.30)PCA linear (q = 5) -0.80 0.51 11.48 18.95(1.00) (1.01) (1.01) (1.03)PCA linear (q = 15) -0.51 2.32 12.56 18.95(1.01) (1.01) (1.02) (1.04)PCA linear (q = 30) 0.27 7.05 16.46 (1.01) (1.00) (1.02) (1.03)PCA poly. kernel (q = 5) 1.86 -0.39 12.52 15.02(1.00) (1.01) (1.00) (1.05)PCA poly. kernel (q = 15) -0.11 2.78 15.56 11.82(1.00) (1.01) (1.00) (1.18)PCA poly. kernel (q = 30) 0.64 4.44 16.10 0.59(1.00) (1.01) (1.01) (1.22)PCA squared (q = 5) 16.79

Note:

Speciﬁcation One-quarter ahead One-year ahead const. (HS) const. (SSVS) TVP (HS) TVP (SSVS) const. (HS) const. (SSVS) TVP (HS) TVP (SSVS)AR -383.12 13.10 23.66 31.64 -408.15 8.87 16.52 26.25(1.31) (0.99) (1.00) (1.03) (1.41) (1.01) (1.00) (1.01)Autoencoder (q = 5) 1.02 17.96 26.39 36.60 0.51 11.24 15.55 34.21(1.00) (1.00) (0.99) ( ) (1.01) (1.00) ( ) ( )Autoencoder (q = 15) 0.34 10.34 21.68 34.66 3.44 12.95 15.60 39.35(1.00) (1.00) (1.00) (1.06) (1.00) (1.01) (1.00) (1.02)Autoencoder (q = 30) 0.00 19.77 19.29 33.63 -1.54 12.97 12.55 37.53(1.00) (1.00) (1.00) (1.09) (1.00) (1.00) (1.00) (1.04)Diﬀusion Maps (q = 5) 1.09 17.18 25.75 40.24 -0.43 17.39 16.16 29.34(1.00) (0.99) (0.99) (1.13) (1.00) (1.00) (1.00) (1.48)Diﬀusion Maps (q = 15) -1.56 18.56 25.54 ) ( ) (1.00) (1.03) (1.01) (1.01) (1.01) (1.04)LLE (q = 5) -2.94 8.09 24.14 33.60 -1.86 5.70 11.68 27.44(1.00) (1.00) (1.00) (1.03) (1.00) (1.01) (1.00) (1.01)LLE (q = 15) -7.05 10.62 16.88 33.67 1.02 4.45 9.70 26.66(1.00) (1.00) (1.00) (1.03) (1.00) (1.01) (1.00) (1.01)LLE (q = 30) -6.21 8.25 15.55 31.47 -4.56 4.14 8.30 29.24(1.00) (1.00) (1.00) (1.04) (1.00) (1.00) (1.00) (1.01)PCA gauss. kernel (q = 5) 2.83 14.65 24.91 32.19 1.53 10.61 14.64 31.25(1.00) (0.99) (1.00) (1.08) (1.00) (1.00) (1.00) (1.11)PCA gauss. kernel (q = 15) 0.85 18.40 21.89 27.12 -2.78 15.55 15.88 27.95(1.00) (1.00) (1.00) (1.34) (1.01) (1.01) (1.05) (1.32)PCA gauss. kernel (q = 30) 4.74 18.56 27.43 9.82 -2.27 19.62 11.78 16.79(1.00) (1.00) (1.00) (1.55) (1.01) (1.02) (1.05) (1.51)PCA linear (q = 5) 1.06 12.45 20.24 34.72 0.61 10.96 18.44 30.69(1.00) (1.00) (1.00) (1.04) (1.01) (1.01) (1.00) (1.02)PCA linear (q = 15) 4.74 16.12 22.90 37.77 -0.79 16.83 18.19 36.97(1.00) (1.00) (1.01) (1.05) (1.01) (1.01) (1.03) (1.09)PCA linear (q = 30) 7.76 21.16 22.94 45.40 2.52 22.03 16.49 42.25(0.99) (0.99) (1.01) (1.04) (1.00) ( ) (1.03) (1.10)PCA poly. kernel (q = 5) 2.32 10.80 21.68 34.27 4.74 13.20 15.21 34.53(1.00) (1.00) (1.00) (1.09) (1.00) (1.01) (1.00) (1.08)PCA poly. kernel (q = 15) -1.33 14.52 23.42 31.02 2.37 11.31 16.09 32.05(1.00) (1.00) (1.00) (1.24) (1.00) (1.01) (1.01) (1.25)PCA poly. kernel (q = 30) 1.68 19.78 23.15 23.30 -0.07 15.35 15.37 8.87(1.00) (0.99) ( ) (1.36) (1.00) (1.00) (1.02) (1.48)PCA squared (q = 5) (1.03) (1.06) (1.01) (2.60) ( ) (1.02) (1.03) (3.36)PCA squared (q = 15) 51.10 54.01 57.09 32.56 55.54 68.09 67.21 28.52(1.04) (1.03) (1.03) (3.18) (1.00) (1.25) (1.05) (4.32)PCA squared (q = 30) 48.84 52.21 60.12 23.10 62.97 69.93 70.63 19.67(1.04) (1.03) (1.04) (3.35) (0.99) (1.25) (1.05) (4.62)

Note:

The ﬁrst (red shaded) entry gives the actual LPL score in averages with actual RMSEs in parentheses of our benchmark model, which is the autoregressive (AR) model with constant parameters and the HSprior. All other entries are relative LPLs with relative RMSEs in parentheses. o far, the LPLs are averaged over the full evaluation sample and thus only measure modelquality over the full hold-out period (Geweke and Amisano, 2010). However, this might maskimportant diﬀerences in forecast performance of the diﬀerent models and compression tech-niques over time. Figure 2 depicts the average LPLs along the hold-out sample for the shortrun forecasting exercise. The ﬁgure suggests a great deal of performance variation over time.Regardless of the model speciﬁcation and the number of factors included in the models, account-ing for instabilities in the relationship between the factors and inﬂation through time-varyingparameters improves the forecasting performance. Especially during the global ﬁnancial crisis(the gray shaded area), more ﬂexible model speciﬁcations yield greater improvements relativeto the univariate benchmark and compared to constant speciﬁcations. The ﬁnal paragraph in the previous subsection showed that model performance varies consider-ably over time. The key implication is that non-linear compression techniques are useful duringturbulent times whereas forecast evidence is less pronounced in normal times. In this subsection,we ask whether combining models in a dynamic manner further improves predictive accuracy.After having obtained the predictive densities of y t + h for the diﬀerent dimensionality reduc-tion techniques and model speciﬁcations, the goal is to exploit the advantages of both linearand non-linear approaches. This is achieved by combining models in a model pool such thatbetter performing models over certain periods receive larger weights while inferior models aresubsequently down-weighted. The literature on forecast combinations suggests several diﬀer-ent weighting schemes, ranging from simply averaging over all models (see, e.g., Hendry andClements, 2004; Hall and Mitchell, 2007; Clark and McCracken, 2010; Berg and Henzel, 2015)to estimating weights based on the models’ performances according to the minimization of anobjective or loss function (see, e.g., Timmermann, 2006; Hall and Mitchell, 2007; Geweke andAmisano, 2011; Conﬂitti et al., 2015; Pettenuzzo and Ravazzolo, 2016) or according to the pos-terior probabilities of the predictive densities (see, e.g., Raftery et al., 2010; Koop and Korobilis,2012; Beckmann et al., 2020). Since the weights might change over time, we aim to computethem in a dynamic manner.Combining the diﬀerent predictive densities according to their posterior probabilities is re-ferred to as Bayesian model averaging (BMA). The resulting weights are capable of reﬂecting thepredictive power of each model for the respective periods. Dynamic model averaging (DMA),as speciﬁed by Raftery et al. (2010), extends the approach by adding a discount (or forgetting )factor to control for a model’s forecasting performance in the recent past. The ‘recent past’is determined by the discount factor, with higher values attaching greater importance to pastforecasting performances of the model and lower values gradually ignoring results of past pre-dictive densities. Similar to Beckmann et al. (2020), Koop and Korobilis (2012) and Rafteryet al. (2010), we apply DMA to combine the predictive densities of our various models.17a) No dimension reduction −2002040 2005 2010 2015 l l

AR (p) Extended PC const. (HS) const. (SSVS) TVP (HS) TVP (SSVS) (b) q = 5cons. (HS) const. (SSVS/BMA) TVP (HS) TVP (SSVS) −100102030 2005 2010 2015 −100102030 2005 2010 2015 −100102030 2005 2010 2015 −100102030 2005 2010 2015 (c) q = 15 −100102030 2005 2010 2015 −100102030 2005 2010 2015 −100102030 2005 2010 2015 −100102030 2005 2010 2015 (d) q = 30 −100102030 2005 2010 2015 −100102030 2005 2010 2015 −100102030 2005 2010 2015 −100102030 2005 2010 2015 ll ll l ll

AutoencoderExtended PC ISOMAPLLE PCA gauss. kernelPCA linear PCA poly. kernelPCA squared

Figure 2:

Evolution of one-month ahead cumulative LPBFs relative to the benchmark. Thered dashed lines refer to the maximum/minimum Bayes factor over the full hold-out sample.The light gray shaded areas indicate the NBER recessions in the US.18MA works as follows. Let (cid:37) t + h | t + h = ( (cid:37) t + h | t + h, , . . . , (cid:37) t + h | t + h,J ) (cid:48) denote a set of weights for J competing models. These (horizon-speciﬁc) weights vary over time and depend on the recentpredictive performance of the model according to: (cid:37) t + h | t,j = (cid:37) δt | t,j (cid:80) Jl =1 (cid:37) δt | t,l , (20) (cid:37) t + h | t + h,j = (cid:37) t + h | t,j p j ( y t + h | y t ) (cid:80) Jl =1 (cid:37) t + h | t,l p l ( y t + h | y t ) (21)where p j ( y t + h | y t ) denotes the h -step ahead predictive distribution of model j and δ ∈ (0 , δ = 0 .

9. Notice thatif δ = 1, we obtain standard BMA weights while δ = 0 would imply that the weights dependexclusively on the forecasting performance in the last period. Weights obtained by combining models according to their predictive power convey useful infor-mation about the adequacy of each model over time. In order to get a comprehensive picture ofthe eﬀects of diﬀerent model modiﬁcations, we combine our models and model speciﬁcations invarious ways.Table 3 presents the forecasting results when we use DMA to combine models. Again, allmodels are benchmarked to the AR model with constant parameters and the HS prior. The ﬁrstrow depicts the relative performance of the best performing single model for the chosen timehorizon.The table can be understood as follows. Each entry includes all dimension reduction tech-niques. The rows deﬁne whether the model space includes all factors q ∈ { , , } or whetherwe combine models with a ﬁxed number of factors exclusively. The columns refer to modelspaces which include only constant parameter, time-varying parameter or both speciﬁcations inthe respective model pool. Since we also discriminate between two competing priors we alsoconsider model weights if we condition on either the HS or the SSVS prior or average acrossboth prior speciﬁcations (the ﬁrst upper part of the table with { HS, SSVS } ).Across all three forecast horizons considered, we again ﬁnd only limited accuracy improve-ments for point forecasts relative to the AR model. This, however, does not carry over to LPLs.For density forecasts, we ﬁnd that DMA-based combinations improve upon the single best per-forming model for all forecast horizons. Hence, allowing models to change over the hold-outperiod leads to superior predictive accuracy. 19 able 3: Forecast performance of predictive combinations.

Speciﬁcation One-month ahead One-quarter ahead One-year ahead

Single best performing model 30.15 65.09 80.56(0.99) (0.98) (0.95)Prior Combination Const. TVP { const., TVP } Const. TVP { const., TVP } Const. TVP { const., TVP }{ HS, SSVS } q = {

5, 15, 30 } ) ( ) (1.17) (1.04) (1.09)HS q = {

5, 15, 30 } ) ( ) ( )q = 5 17.97 29.95 26.85 52.31 63.74 61.57 72.00 82.70 81.20(1.00) (1.01) (1.01) (1.03) (1.01) (1.01) (0.93) (1.01) (1.00)q = 15 18.82 29.44 26.37 50.34 59.23 56.77 66.54 82.79 81.50(1.01) ( ) (1.00) (1.03) (1.02) (1.02) (0.96) (1.01) (1.01)q = 30 18.32 26.66 23.81 48.94 61.53 58.95 70.54 83.68 82.02( ) (1.00) ( ) (1.03) (1.03) (1.02) (0.96) (1.01) (1.00)SSVS q = {

5, 15, 30 } (1.00) (1.04) (1.04) (1.02) (1.00) (0.98) (1.03) (1.22) (1.06)q = 5 ) (1.00) (0.98) (1.16) (1.24) (1.18) Note:

The ﬁrst (grey shaded) row states the results of the single best performing model as presented in the previous chapter for each forecast horizon benchmarked to the AR model with constant parameters and theHS prior. All other rows show the relative results for the combinations of the diﬀerent dimension reduction techniques according to the speciﬁcation stated in the rows and columns headers. For example, the entry inrow { HS, SSV S } , q = { , , } and column Const. combines all models estimated with constant parameters, the HS prior, the SSVS prior, 5, 15 and 30 factors. Entries denote the relative LPL with relative RMSEin parantheses benchmarked against the AR model with constant parameters and the HS prior. omparing whether restricting the model a priori improves predictions yields mixed insights.For the one-month and one-quarter ahead predictions we ﬁnd that a combination scheme thatuses only TVP models but both priors and q = 30 factors yields the most precise forecasts. Inthe case of one-year ahead forecasts, we ﬁnd that pooling across diﬀerent q ’s and exclusivelyincluding constant parameter models translates into highest LPLs. In general, the diﬀerencesin predictive performance across the DMA-based averaging schemes are small. Hence, as ageneral suggestion we can recommend applying DMA and using the most exhaustive modelspace available (i.e., including both priors, the diﬀerent number of factors and TVP and constantparameter regressions).To investigate which model receives substantial posterior weight over time, Figure 3 depictsthe weights associated with the one-step ahead LPLs over the hold-out period. Panel (a) displaysthe weight placed on models that allow for TVP, panel (b) shows the weight attached to thediﬀerent number of factors and panel (c) shows the weight attached to each model. These weightsare obtained by using the full model space (i.e., that includes both priors, TVP and constantparameter regressions and all number of factors). The weight placed on TVP speciﬁcations,for instance, is then simply obtained by summing up the weights associated with the diﬀerentmodels that feature TVPs.Starting with the top panel of the ﬁgure, we observe that during the beginning of the sample,appreciable model weight is placed on constant parameter models. In the mid of 2006, thischanges and DMA places increasing posterior mass on models that allow for time-variation inthe parameters. In the period from the beginning of 2007 to the onset of the ﬁnancial crisis, wesee that the weight on TVP models somewhat decreases. During the ﬁnancial crisis, we againexperience a pronounced increase in posterior weight towards TVP regression. In that period,constant parameter models only play a limited role in forming inﬂation forecasts. With some fewexceptions, the remainder of the hold-out period is characterized by evenly distributed posteriormass across constant and TVP regressions.The middle panel of Figure 3 shows that DMA places increasing posterior mass on modelswith a large number of factors during recessions (and, similar to panel (a), in 2006). Thisindicates that in turbulent times it seems to pay oﬀ to include many factors. Since our previousanalysis reveals that point forecasts are very similar to the ones obtained from simpler univariatemodels, this ﬁnding is most likely driven by a superior density forecasting performance. Hence,we conjecture that the main driving force behind the strong performance of a model with manyfactors is that this increases posterior uncertainty (through the inclusion of a large number ofcovariates), which ultimately leads to slightly wider credible sets, implying a higher probabilityof observing outlying observations.The bottom panel (panel (c)) of Figure 3 provides information on how much weight isallocated to models that exploit non-linear dimension reduction techniques. Again, we observethat non-linear dimension reduction techniques obtain considerable posterior mass during 2006and the ﬁnancial crisis of 2007/2008. In 2006, the Autoencoder with q = 15 receives substantialposterior weight. During the ﬁnancial crisis, we ﬁnd that diﬀusion maps and squared PCAfeature large weights. Apart from these two periods, weights allocated to non-linear dimensionreduction techniques are generally close to zero.21a) Parameter change const.TVP 2005 2010 2015 0.000.250.500.751.00prob. (b)

Number of factors (c)

Model selection

Autoencoder (q = 05)Autoencoder (q = 15)Autoencoder (q = 30)Diffusion Maps (q = 05)Diffusion Maps (q = 15)Diffusion Maps (q = 30)ISOMAP (q = 05)ISOMAP (q = 15)ISOMAP (q = 30)LLE (q = 05)LLE (q = 15)LLE (q = 30)PCA gauss. kernel (q = 05)PCA gauss. kernel (q = 15)PCA gauss. kernel (q = 30)PCA linear (q = 05)PCA linear (q = 15)PCA linear (q = 30)PCA poly. kernel (q = 05)PCA poly. kernel (q = 15)PCA poly. kernel (q = 30)PCA squared (q = 05)PCA squared (q = 15)PCA squared (q = 30) 2005 2010 2015 0.000.250.500.751.00prob.

Figure 3:

Evolution of the weights determined by DMA for one-month ahead cumulativeLPBFs. 22his discussion highlights that the strong performance of DMA relative to the single bestperforming model can be, at least partly, attributed to changes in model weights across businesscycles. In expansionary periods with stable inﬂation rates and macroeconomic fundamentals,linear and simple models dominate the model pool. By contrast, adding more sophisticatedmodels and dimension reduction techniques pays oﬀ during recessions. A dynamic combinationof diﬀerent approaches thus improves real-time inﬂation forecasts.

In macroeconomics, the vast majority of researchers compresses information using linear methodssuch as principal components to eﬃciently summarize huge datasets in forecasting applications.Machine learning techniques describing large datasets with relatively few latent factors havegained relevance in the last years in various areas. In this paper, we have shown that using suchapproaches potentially improves real-time inﬂation forecasts for a wide range of competing modelspeciﬁcations. Our ﬁndings indicate that point forecasts of simpler models are hard to beat.But when interest centers on predictive distributions, we ﬁnd that more sophisticated modelingtechniques that rely on non-linear dimension reduction yield favorable inﬂation predictions.These predictions can be further improved by using DMA to dynamically weight diﬀerent models,dimension reduction methods and priors. Doing so further improves density forecasts. Weightsobtained from dynamic model averaging reveal that using TVP models in combination withnon-linear approaches to dimension reduction is preferred in turbulent times.23 eferences

Allaire, J., and F. Chollet (2019): keras: R Interface to ’Keras’ , R package version 2.2.5.0.

Andreini, P., C. Izzo, and G. Ricco (2020): “Deep Dynamic Factor Models,” arXiv preprintarXiv:2007.11887 . Atkeson, A., L. E. Ohanian, et al. (2001): “Are Phillips curves useful for forecasting inﬂation?,”

Federal Reserve Bank of Minneapolis Quarterly Review , 25(1), 2–11.

Bai, J., and S. Ng (2002): “Determining the number of factors in approximate factor models,”

Econo-metrica , 70(1), 191–221.(2008): “Forecasting economic time series using targeted predictors,”

Journal of Econometrics ,146(2), 304–317.

Beckmann, J., G. Koop, D. Korobilis, and R. A. Sch¨ussler (2020): “Exchange rate predictabilityand dynamic Bayesian learning,”

Journal of Applied Econometrics , 35(4), 410–421.

Belmonte, M., G. Koop, and D. Korobilis (2014): “Hierarchical shrinkage in time-varying coeﬃ-cient models,”

Journal of Forecasting , 33(1), 80–94.

Berg, T. O., and S. R. Henzel (2015): “Point and density forecasts for the euro area using BayesianVARs,”

International Journal of Forecasting , 31(4), 1067–1095.

Bernanke, B. S., J. Boivin, and P. Eliasz (2005): “Measuring the eﬀects of monetary policy: Afactor-augmented vector autoregressive (FAVAR) approach,”

The Quarterly Journal of Economics ,120(1), 387–422.

Carter, C., and R. Kohn (1994): “On Gibbs sampling for state space models,”

Biometrika , 81(3),541–553.

Carvalho, C. M., N. G. Polson, and J. G. Scott (2010): “The horseshoe estimator for sparsesignals,”

Biometrika , 97(2), 465–480.

Chakraborty, C., and A. Joseph (2017): “Machine learning at central banks,” Bank of EnglandWorking Papers 674, Bank of England.

Chan, J. C. (2017): “The stochastic volatility in mean model with time-varying parameters: An appli-cation to inﬂation modeling,”

Journal of Business & Economic Statistics , 35(1), 17–28.

Chan, J. C., T. E. Clark, and G. Koop (2018): “A new model of inﬂation, trend inﬂation, andlong-run inﬂation expectations,”

Journal of Money, Credit and Banking , 50(1), 5–53.

Clark, T. E. (2011): “Real-time density forecasts from Bayesian vector autoregressions with stochasticvolatility,”

Journal of Business & Economic Statistics , 29(3), 327–341.

Clark, T. E., and M. W. McCracken (2010): “Averaging forecasts from VARs with uncertaininstabilities,”

Journal of Applied Econometrics , 25(1), 5–29.

Clark, T. E., and F. Ravazzolo (2015): “Macroeconomic forecasting performance under alternativespeciﬁcations of time-varying volatility,”

Journal of Applied Econometrics , 30(4), 551–575.

Coifman, R. R., and S. Lafon (2006): “Diﬀusion maps,”

Applied and Computational HarmonicAnalysis , 21(1), 5–30.

Coifman, R. R., S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W.Zucker (2005): “Geometric diﬀusions as a tool for harmonic analysis and structure deﬁnition ofdata: Diﬀusion maps,”

Proceedings of the National Academy of Sciences , 102(21), 7426–7431.

Conflitti, C., C. De Mol, and D. Giannone (2015): “Optimal combination of survey forecasts,”

International Journal of Forecasting , 31(4), 1096–1103.

Coulombe, P. G., M. Leroux, D. Stevanovic, and S. Surprenant (2019): “How is machinelearning useful for macroeconomic forecasting?,” CIRANO Working Papers 2019s-22, CIRANO.

Croushore, D. (2011): “Frontiers of real-time data analysis,”

Journal of Economic Literature , 49(1),72–100.

D’Agostino, A., L. Gambetti, and D. Giannone (2013): “Macroeconomic forecasting and structuralchange,”

Journal of Applied Econometrics , 28(1), 82–101.

De Mol, C., D. Giannone, and L. Reichlin (2008): “Forecasting using a large number of predictors:Is Bayesian shrinkage a valid alternative to principal components?,”

Journal of Econometrics , 146(2),318–328.

Diedrich, H., and D. M. Abel (2012): lle: Locally linear embedding , R package version 1.1.

Exterkate, P., P. J. Groenen, C. Heij, and D. van Dijk (2016): “Nonlinear forecasting withmany predictors using kernel ridge regression,”

International Journal of Forecasting , 32(3), 736–753.

Feng, G., J. He, and N. G. Polson (2018): “Deep learning for predicting asset returns,” arXivpreprint arXiv:1804.09314 . Fr¨uhwirth-Schnatter, S. (1994): “Data augmentation and dynamic linear models,”

Journal of TimeSeries Analysis , 15(2), 183–202.

Fr¨uhwirth-Schnatter, S., and H. Wagner (2010): “Stochastic model speciﬁcation search for Gaus-sian and partial non-Gaussian state space models,”

Journal of Econometrics , 154(1), 85–100.

Gallant, A. R., and H. White (1992): “On learning the derivatives of an unknown mapping withmultilayer feedforward networks,”

Neural Networks , 5(1), 129–138.

George, E. I., and R. E. McCulloch (1993): “Variable selection via Gibbs sampling,”

Journal ofthe American Statistical Association , 88(423), 881–889.

George, E. I., D. Sun, and S. Ni (2008): “Bayesian stochastic search for VAR model restrictions,”

Journal of Econometrics , 142(1), 553–580.

Geweke, J., and G. Amisano (2010): “Comparing and evaluating Bayesian predictive distributions of sset returns,” International Journal of Forecasting , 26(2), 216–230.(2011): “Optimal prediction pools,”

Journal of Econometrics , 164(1), 130–141.

Giovannelli, A. (2012): “Nonlinear forecasting using large datasets: Evidences on US and Euro areaeconomies,” CEIS Research Paper 255, Tor Vergata University, CEIS.

Goodfellow, I., Y. Bengio, and A. Courville (2016):

Deep Learning . MIT Press.

Hall, S. G., and J. Mitchell (2007): “Combining density forecasts,”

International Journal of Fore-casting , 23(1), 1–13.

Hauzenberger, N., F. Huber, G. Koop, and L. Onorante (2019): “Fast and Flexible BayesianInference in Time-varying Parameter Regression Models,” arXiv preprint arXiv:1910.10779 . Heaton, J. (2008):

Introduction to neural networks with Java . Heaton Research, Inc.

Heaton, J. B., N. G. Polson, and J. H. Witte (2017): “Deep learning for ﬁnance: deep portfolios,”

Applied Stochastic Models in Business and Industry , 33(1), 3–12.

Hendry, D. F., and M. P. Clements (2004): “Pooling of forecasts,”

The Econometrics Journal , 7(1),1–31.

Huang, G.-B. (2003): “Learning capability and storage capacity of two-hidden-layer feedforward net-works,”

IEEE Transactions on Neural Networks , 14(2), 274–281.

Huber, F., G. Koop, and L. Onorante (2020): “Inducing sparsity and shrinkage in time-varyingparameter models,”

Journal of Business & Economic Statistics , (forthcoming).

Huber, F., and M. Pfarrhofer (2020): “Dynamic shrinkage in time-varying parameter stochasticvolatility in mean models,”

Journal of Applied Econometrics , (forthcoming).

Jarocinski, M., and M. Lenza (2018): “An inﬂation-predicting measure of the output gap in theEuro area,”

Journal of Money, Credit and Banking , 50(6), 1189–1224.

Kalli, M., and J. E. Griffin (2014): “Time-varying sparsity in dynamic regression models,”

Journalof Econometrics , 178(2), 779 – 793.

Kastner, G. (2016): “Dealing with stochastic volatility in time series using the R package stochvol,”

Journal of Statistical Software , 69(5), 1–30.

Kastner, G., and S. Fr¨uhwirth-Schnatter (2014): “Ancillarity-suﬃciency interweaving strategy(ASIS) for boosting MCMC estimation of stochastic volatility models,”

Computational Statistics &Data Analysis , 76, 408–423.

Kayo, O. (2006): “LOCALLY LINEAR EMBEDDING ALGORITHM–Extensions and applications,” .

Kelly, B. T., S. Pruitt, and Y. Su (2019): “Characteristics are covariances: A uniﬁed model of riskand return,”

Journal of Financial Economics , 134(3), 501–524.

Koop, G., and D. Korobilis (2012): “Forecating inﬂation using dynamic model averaging,”

Interna-tional Economic Review , 53(3), 867–886.(2013): “Large time-varying parameter VARs,”

Journal of Econometrics , 177(2), 185–198.

Koop, G., and S. M. Potter (2007): “Estimation and forecasting in models with multiple breaks,”

The Review of Economic Studies , 74(3), 763–789.

Lin, F., C.-C. Yeh, and M.-Y. Lee (2011): “The use of hybrid manifold learning and support vectormachines in the prediction of business failure,”

Knowledge-Based Systems , 24(1), 95–101.

Makalic, E., and D. F. Schmidt (2015): “A simple sampler for the horseshoe estimator,”

IEEESignal Processing Letters , 23(1), 179–182.

McAdam, P., and P. McNelis (2005): “Forecasting inﬂation with thick models and neural networks,”

Economic Modelling , 22(5), 848–867.

McCracken, M. W., and S. Ng (2016): “FRED-MD: A monthly database for macroeconomic re-search,”

Journal of Business & Economic Statistics , 34(4), 574–589.

Medeiros, M. C., G. F. Vasconcelos, ´A. Veiga, and E. Zilberman (2019): “Forecasting inﬂa-tion in a data-rich environment: the beneﬁts of machine learning methods,”

Journal of Business &Economic Statistics , (forthcoming).

Mullainathan, S., and J. Spiess (2017): “Machine learning: An applied econometric approach,”

Journal of Economic Perspectives , 31(2), 87–106.

Oksanen, J., F. G. Blanchet, M. Friendly, R. Kindt, P. Legendre, D. McGlinn, P. R.Minchin, R. B. O’Hara, G. L. Simpson, P. Solymos, M. H. H. Stevens, E. Szoecs, andH. Wagner (2019): vegan: Community Ecology Package , R package version 2.5-6.

Orsenigo, C., and C. Vercellis (2013): “Linear versus nonlinear dimensionality reduction for banks’credit rating prediction,”

Knowledge-Based Systems , 47, 14–22.

Pettenuzzo, D., and F. Ravazzolo (2016): “Optimal portfolio choice under decision-based modelcombinations,”

Journal of Applied Econometrics , 31(7), 1312–1332.

Pfarrhofer, M. (2020): “Forecasts with Bayesian vector autoregressions under real time conditions,” arXiv preprint arXiv:2004.04984 . Polson, N. G., and J. G. Scott (2010): “Shrink globally, act locally: Sparse Bayesian regularizationand prediction,”

Bayesian Statistics , 9, 501–538.

Raftery, A., M. K´arn´y, and P. Ettler (2010): “Online prediction under model uncertainty viaDynamic Model Averaging: Application to a cold rolling mill,”

Technometrics , 52(1), 52–66.

Ribeiro, B., A. Vieira, and J. C. das Neves (2008): “Supervised Isomap with dissimilarity measuresin embedding learning,” in

Iberoamerican Congress on Pattern Recognition , pp. 389–396. Springer.

Richards, J., and R. Cannoodt (2019): diﬀusionMap: Diﬀusion Map , R package version 1.2.0.

Richards, J. W., P. E. Freeman, A. B. Lee, and C. M. Schafer (2009): “Exploiting low-dimensional structure in astronomical spectra,”

The Astrophysical Journal , 691(1), 32–42. oweis, S. T., and L. K. Saul (2000): “Nonlinear dimensionality reduction by locally linear embed-ding,” Science , 290(5500), 2323–2326.

Saxe, A. M., Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox (2019): “On the information bottleneck theory of deep learning,”

Journal of Statistical Mechanics:Theory and Experiment , 2019(12), 124020.

Sch¨olkopf, B., A. Smola, and K.-R. M¨uller (1998): “Nonlinear component analysis as a kerneleigenvalue problem,”

Neural Computation , 10(5), 1299–1319.

Stella, A., and J. H. Stock (2013): “A state-dependent model for inﬂation forecasting,”

FRBInternational Finance Discussion Paper , (1062).

Stock, J., and M. Watson (1999): “Forecasting inﬂation,”

Journal of Monetary Economics , 44(2),293–335.(2002): “Macroeconomic forecasting using diﬀusion indexes,”

Journal of Business & EconomicStatistics , 20(2), 147–162.(2008): “Phillips curve inﬂation forecasts,” NBER Working Papers 14322, National Bureau ofEconomic Research, Inc.

Stock, J. H., and M. W. Watson (2007): “Why has U.S. inﬂation become harder to forecast?,”

Journal of Money, Credit and Banking , 39(s1), 3–33.

Stock, J. H., and M. W. Watson (2016): “Core inﬂation and trend inﬂation,”

Review of Economicsand Statistics , 98(4), 770–784.

Taylor, S. J. (1982): “Financial returns modelled by the product of two stochastic processes-a studyof the daily sugar prices 1961-75,”

Time Series Analysis: Theory and Practice , 1, 203–226.

Tenenbaum, J. B., V. De Silva, and J. C. Langford (2000): “A global geometric framework fornonlinear dimensionality reduction,”

Science , 290(5500), 2319–2323.

Timmermann, A. (2006): “Forecast combinations,”

Handbook of Economic Forecasting , 1, 135–196.

Zelnik-Manor, L., and P. Perona (2004): “Self-tuning spectral clustering,”

Advances in neuralinformation processing systems , 17, 1601–1608.

Zime, S. (2014): “Economic performance evaluation and classiﬁcation using hybrid manifold learningand support vector machine model,” in . IEEE, 184–191. ppendices A Technical Appendix

A.1 Non-centered Parameterization

To implement the Bayesian priors to achieve shrinkage in the TVP regression deﬁned by Eq. 18 andEq. 19, we use the non-centered parameterization proposed in Fr¨uhwirth-Schnatter and Wagner (2010).Intuitively speaking, this allows us to move the process innovation variances into the observation equationand discriminate between a time-invariant and a time-varying part of the model. The non-centeredparameterization of the model is given by: y t + h = d (cid:48) t + h β + d (cid:48) t + h √ V ˜ β t + h + (cid:15) t + h , (cid:15) t + h ∼ N (0 , σ t + h ) (B.1)˜ β t + h = ˜ β t + h − + ε t + h , ε t + h ∼ N (0 , I M ) , ˜ β = M , (B.2)where the j th element in ˜ β t + h is given by ˜ β jt + h = β jt + h − β j √ v j for j = 1 , . . . , M .Conditional on the normalized states ˜ β , Eq. B.1 can be written as a linear regression model asfollows: y t + h = D (cid:48) t + h α + (cid:15) t + h , (B.3)with D t + h = [ d (cid:48) t + h , ( ˜ β t + h (cid:12) d t + h ) (cid:48) ] (cid:48) denoting a 2 M -dimensional vector of regressors and α = ( β (cid:48) , v , . . . , v (cid:48) M )is a 2 M -dimensional coeﬃcient vector. This parameterization implies that the state innovation variances(or more precisely the square roots) are moved into the observation equation and we can estimate themalongside β (conditional on the states ˜ β t + h ). A.2 Prior Setup

A.2.1 Priors on the Regression Coeﬃcients

We use a zero-mean multivariate Gaussian prior on α : α | V ∼ N ( , V ) , (B.4)with V denoting a 2 M -dimensional prior variance-covariance matrix V = diag (cid:0) τ , . . . , τ M (cid:1) . This matrixcollects the prior shrinkage parameters τ j associated with the time-invariant regression coeﬃcients andthe process innovation standard deviations.In the empirical work, the priors we consider diﬀer in the speciﬁcation of V . The ﬁrst is the stochasticsearch variable selection (SSVS) prior of George and McCulloch (1993) and the second the Horseshoe(HS) prior of Carvalho et al. (2010).1. SSVS Prior:

The SSVS prior pushes coeﬃcients associated with irrelevant variables towards zero by using amixture of Gaussians. A speciﬁc mixture component is selected by introducing an auxiliary binaryindicator variable γ j . More formally, the SSVS prior speciﬁes τ j ( j = 1 , . . . , M ) such that τ j = (1 − γ j ) τ j + γ j τ j , (B.5)with τ j (cid:28) τ j being ﬁxed prior variances. If γ j = 1, the prior variance is τ j which is set to alarge value. Hence, little shrinkage is introduced. By contrast, if γ j = 0, the prior variance τ j is lose to zero and the corresponding prior weight will be large, leading to a posterior distributionthat is tightly centered on zero.The prior probability that γ j = 1 is set equal to:Prob( γ j = 1) = 1 − Prob( γ j = 0) = p m , p m = 12 . (B.6)This choice of the prior inclusion probability implies that every quantity is equally likely to enterthe model.To control for scaling diﬀerences, we adopt the semi-automatic approach proposed in George et al.(2008) and choose τ j = 0 .

01 ˆ σ j and τ j = 100 ˆ σ j for j = 1 , . . . , M . Here, ˆ σ j denotes the OLSvariance of a standard regression model with constant parameters.2. Horseshoe Prior:

The horseshoe prior of Carvalho et al. (2010) achieves shrinkage by introducing local and globalshrinkage parameters (see Polson and Scott, 2010). These follow a standard half-Cauchy distribu-tion restricted to the positive real numbers. That is: τ j = ζ j ς , ζ m ∼ C + (1 , , ς ∼ C + (1 ,

0) (B.7)While the global component ς strongly pushes all coeﬃcients in α towards the prior mean (i.e.,zero), the local scalings { ζ j } Mj =1 allow for variable-speciﬁc departures from zero in light of a globalscaling parameter close to zero. This ﬂexibility leads to heavy tails in the marginal prior (obtainedafter integrating out ζ j ) which turns out to be useful for forecasting. A.3 Full Conditional Posterior Simulation

We carry out posterior inference by using a Markov chain Monte Carlo (MCMC) algorithm to simulatefrom the joint posterior of the parameters, the log-volatilities and the TVPs. This MCMC algorithmconsists of the following steps:1. Conditional on the time-varying part of the coeﬃcients and the stochastic volatilities, we draw( β , v , . . . , v M ) (cid:48) from N ( β , V ) with V = ( ˜ D (cid:48) ˜ D + V − ) − and β = V ( ˜ D ˜ y ). ˜ y is a T − dimensionalvector with typical element y t /σ t and ˜ D is a T × (2 M ) matrix with typical row D t /σ t .2. Controlling for all other model parameters, the full history of ˜ β t + h is sampled using the forward-ﬁltering backward-sampling (FFBS) algorithm proposed by Carter and Kohn (1994); Fr¨uhwirth-Schnatter (1994). For constant parameter models this step is skipped.3. The stochastic volatilities log σ t + h are drawn by employing the algorithm of Kastner and Fr¨uhwirth-Schnatter (2014) implemented in the stochvol R-package of Kastner (2016).4. Sampling the diagonal elements of V depends on the speciﬁc prior setup chosen. • If the SSVS prior is used, we simulate the indicators γ j from a Bernoulli distribution withthe probability that γ j = 1 given byProb( γ j = 1 | β j ) = u j u j + u j u j = τ − m exp (cid:40) − β j τ j (cid:41) × p m u j = τ − m exp (cid:40) − β j τ j (cid:41) × (1 − p m ) . In case we adopt the HS prior, we rely on the hierarchical representation of Makalic andSchmidt (2015). Introducing auxiliary random quantities which follow an inverse Gammadistribution we can draw ζ j and ς as follows: ζ j | β j , ς, η ∼ G − (cid:32) , η − j + β j ς (cid:33) ς | β j , ζ j , ϕ ∼ G −  M + 12 , ϕ − + 12 M (cid:88) j =1 β j ζ − j  η j | ζ j ∼ G − (cid:0) , ζ − j (cid:1) ,ϕ | ς ∼ G − (cid:0) , ϕ − (cid:1) We sample from the relevant full conditional posterior distributions iteratively. This is repeated 10 , ,

000 draws are discarded as burn-in. Data Appendix

The Federal Reserve Economic Data (FRED) contains monthly observations of macroeconomic variablesfor the US and is available for download at https://research.stlouisfed.org . Details on the datasetcan be found in McCracken and Ng (2016) . For each data vintage (available from 1999:08), the timeseries start from January 1959. Due to missing values in some of the series, we preselect 105 variables andtransform them according to Table C.1. We select all variables for our models except for the extendedPhillips curve, where we choose the variables indicated by column

PART . Table C.1:

Data description

FRED.Mnemonic Description Trans I(0) PART FULLRPI Real personal income 5 xW875RX1 Real personal income ex transfer receipts 5 x xINDPRO IP Index 5 x xIPFPNSS IP: Final Products 5 xIPFINAL IP: Final Products (Market Group) 5 xIPCONGD IP: Consumer Goods 5 xIPMAT IP: Materials 5 xIPMANSICS IP: Manufacturing (SIC) 5 xCUMFNS Capacity Utilization: Manufacturing 2 x xCLF16OV Civilian Labor Force 5 xCE16OV Civilian Employment 5 xUNRATE Civilian Unemployment Rate 2 x xUEMPMEAN Average Duration of Unemployment (Weeks) 2 xUEMPLT5 Civilians Unemployed : Less Than 5 Weeks 5 xUEMP5TO14 Civilians Unemployed for 5-14 Weeks 5 xUEMP15OV Civilians Unemployed : 15 Weeks & Over 5 xUEMP15T26 Civilians Unemployed for 15-26 Weeks 5 xUEMP27OV Civilians Unemployed for 27 Weeks and Over 5 xCLAIMSx Initial Claims 5 x xPAYEMS All Employees: Total nonfarm 5 x xUSGOOD All Employees: Goods-Producing Industries 5 xCES1021000001 All Employees: Mining and Logging: Mining 5 xUSCONS All Employees: Construction 5 xMANEMP All Employees: Manufacturing 5 xDMANEMP All Employees: Durable goods 5 xNDMANEMP All Employees: Nondurable goods 5 xSRVPRD All Employees: Service-Providing Industries 5 xUSWTRADE All Employees: Wholesale Trade 5 xUSTRADE All Employees: Retail Trade 5 xUSFIRE All Employees: Financial Activities 5 xUSGOVT All Employees: Government 5 xCES0600000007 Avg Weekly Hours: Goods-Producing 1 x xAWOTMAN Avg Weekly Overtime Hourse: Manufacturing 2 xAWHMAN Avg Weekly Hours: Manufacturing 1 xCES0600000008 Avg Hourly Earnings: Goods-Producing 6 x xCES2000000008 Avg Hourly Earnings: Construction 6 xCES3000000008 Avg Hourly Earnings: Manufacturing 6 xHOUST Housing Starts: Total New Privately Owned 4 x xHOUSTNE Housing Starts, Northeast 4 xHOUSTMW Housing Starts, Midwest 4 xHOUSTS Housing Starts, South 4 xHOUSTW Housing Starts, West 4 xPERMIT New Private Housing Permits (SAAR) 4 xPERMITNE New Private Housing Permits, Northeast (SAAR) 4 xPERMITMW New Private Housing Permits, Midwest (SAAR) 4 xPERMITS New Private Housing Permits, South (SAAR) 4 xPERMITW New Private Housing Permits, West (SAAR 4 xCMRMTSPLx Real Manu. and TradeIndustries Sales 5 x xRETAILx Retail and Food Services Sales 5 xAMDMNOx New Orders for Durable goods 5 xANDENOx New Orders for Nondefense Capital goods 5 xAMDMUOx Unﬁlled Orders for Durable goods 5 xBUSINVx Total Business Inventories 5 x xISRATIOx Total Business: Inventories to Sales Ratio 2 xUMCSENTx Consumer Sentiment Index 2 xOILPRICEx Crude Oil, , spliced WTI and Cushing 6 xPPICMM PPI: Metals and metal products 6 x xCPIAUCSL CPI : All Items 6 xCPIAPPSL CPI : Apparel 6 xCPITRNSL CPI : Transportation 6 x

FRED.Mnemonic Description Trans I(0) PART FULLCPIMEDSL CPI : Medical Care 6 xCUSR0000SAC CPI : Commodities 6 xCUSR0000SAS CPI : Services 6 xCPIULFSL CPI : All Items Less Food 6 xCUSR0000SA0L5 CPI : All Items Less Medical Care 6 xFEDFUNDS Eﬀective Federal Funds Rate 2 x xM1SL M1 Money Stock 6 xM2SL M2 Money Stock 6 xM2REAL Real M2 Money Stock 5 x xAMBSL St. Louis Adjusted Monetary Base 6 xTOTRESNS Total Reserves of Depository Institutions 6 xNONBORRES Reserves of Depository Institutions 7 xBUSLOANS Commercial and Industrial Loans 6 x xREALLN Real Estate Loans at All Commerical Banks 6 x xNONREVSL Total Nonrevolving Credit 6 xCONSPI Nonrevolving consumer credit to Personal Income 2 xMZMSL MZM Money Stock 6 xDTCOLNVHFNM Consumer Motor Vehicle Loans Outstanding 6 xDTCTHFNM Total Consumer Total Consumer Loans and Leases Outstanding 6 xINVEST Securities in Bank Credit at All Commercial Banks 6 xCP3Mx 3-Month AA Financial Commercial Paper Rate 2 xTB3MS 3-Month Treasury Bill 2 x xTB6MS 6-Month Treasury Bill 2 xGS1 1-Year Treasury Rate 2 xGS5 5-Year Treasury Rate 2 xGS10 10-Year Treasury Rate 2 x xAAA Moody’s Seasoned Aaa Corporate Bond Yield 2 xBAA Moody’s Seasoned Baa Corporate Bond Yield 2 xCOMPAPFFx 3-Month Commercial Paper Minus FEDFUNDS 1 xTB3SMFFM 3-Month Treasury C Minus FEDFUNDS 1 xTB6SMFFM 6-Month Treasury C Minus FEDFUNDS 1 xT1YFFM 1-Year Treasury C Minus FEDFUNDS 1 xT5YFFM 5-Year Treasury C Minus FEDFUNDS 1 xT10YFFM 10-Year Treasury C Minus FEDFUNDS 1 xAAAFFM Moody’s Aaa Corporate Bond Minus FEDFUNDS 1 xBAAFFM Moody’s Baa Corporate Bond Minus FEDFUNDS 1 xTWEXMMTH Trade Weighted Trade Weighted U.S. Dollar Index: Major Currencies 5 xEXSZUSx Switzerland / U.S. Foreign Exchange Rate 5 x xEXJPUSx Japan / U.S. Foreign Exchange Rate 5 xEXUSUKx U.S. / UK Foreign Exchange Rate 5 xEXCAUSx Canada / U.S. Foreign Exchange Rate 5 xS.P.500 S&Ps Common Stock Price Index: Composite 5 x xS.P..indust S&Ps Common Stock Price Index: Industrials 5 xS.P.div.yield S&Ps Composite Common Stock: Dividend Yield 2 xS.P.PE.ratio S&Ps Composite Common Stock: Price-Earnings Ratio 5 x

Note:

Column

Trans I(0) denotes the transformation of each time series to achieve approximate stationarity: (1) no transformation, (2)∆ x t , (4) log ( x t ), (5) ∆ log ( x t ), (6) ∆ log ( x t ), (7) ∆( x t /x t − − .0)