[PDF] Macroeconomic Data Transformations Matter

Abstract

In a low-dimensional linear regression setup, considering linear transformations/combinations of predictors does not alter predictions. However, when the forecasting technology either uses shrinkage or is nonlinear, it does. This is precisely the fabric of the machine learning (ML) macroeconomic forecasting environment. Pre-processing of the data translates to an alteration of the regularization -- explicit or implicit -- embedded in ML algorithms. We review old transformations and propose new ones, then empirically evaluate their merits in a substantial pseudo-out-sample exercise. It is found that traditional factors should almost always be included as predictors and moving average rotations of the data can provide important gains for various forecasting targets. Also, we note that while predicting directly the average growth rate is equivalent to averaging separate horizon forecasts when using OLS-based techniques, the latter can substantially improve on the former when regularization and/or nonparametric nonlinearities are involved.

Full PDF

MMacroeconomic Data Transformations Matter

Philippe Goulet Coulombe ∗ Maxime Leroux Dalibor Stevanovic † Stéphane Surprenant University of Pennsylvania Université du Québec à MontréalThis version: July 31, 2020

Abstract

From a purely predictive standpoint, rotating the predictors’ matrix in a low-dimensional linearregression setup does not alter predictions. However, when the forecasting technology eitheruses shrinkage or is non-linear, it does. This is precisely the fabric of the machine learning(ML) macroeconomic forecasting environment. Pre-processing of the data translates to an al-teration of the regularization – explicit or implicit – embedded in ML algorithms. We reviewold transformations and propose new ones, then empirically evaluate their merits in a substan-tial pseudo-out-sample exercise. It is found that traditional factors should almost always beincluded in the feature matrix and moving average rotations of the data can provide importantgains for various forecasting targets.

JEL Classiﬁcation: C53, C55, E37Keywords: Machine Learning, Big Data, Forecasting. ∗ Corresponding Author: [email protected]. Department of Economics, UPenn. † Corresponding Author: [email protected]. Département des sciences économiques, UQAM. a r X i v : . [ ec on . E M ] A ug Introduction

Following the recent enthusiasm for Machine Learning (ML) methods and widespread avail-ability of big data, macroeconomic forecasting research gradually evolved further and furtheraway from the traditional tightly speciﬁed OLS regression. Rather, nonparametric non-linearityand regularization of many forms are slowly taking the center stage, largely because they can pro-vide sizable forecasting gains with respect to traditional methods (see, among others, Kim andSwanson (2018); Medeiros et al. (2019); Goulet Coulombe et al. (2020); Goulet Coulombe (2020a)).In such environments, different linear transformations of the informational set

X can change theprediction and taking ﬁrst differences may not be the optimal transformation for many predictors,despite the fact that it guarantees viable frequentist inference. For instance, in penalized regressionproblems – like Lasso or Ridge, different rotations of X imply different priors on β in the originalregressor space. Moreover, in tree-based models algorithms, since the problem of inverting a nearsingular matrix X (cid:48) X simply does not happen, making the use of more persistent (and potentiallyhighly cross-correlated regressors) much less harmful. In sum, in the ML macro forecasting envi-ronment, traditional go-to data transformations (like detailed in McCracken and Ng (2016)) mayleave some forecasting gains on the table. To provide guidance for the growing number of re-searcher and practitioners in the ﬁeld, we conduct an extensive pseudo-out-of-sample forecastingexercise to evaluate the virtues of standard and newly proposed data transformations.From the ML perspective, it is often suggested that a "features engineering" step may improvealgorithms’ performance (Kuhn and Johnson, 2019). This is especially true of Random Forests (RF)and Boosting, which successfully handle a high-dimensional X . This simply means that the datascientist, leveraging some domain knowledge, can create plausibly more salient features out of theoriginal data matrix. Of course, an extremely ﬂexible model, like a neural network with manylayers, could very well create those relevant transformations internally in a data-driven way. Yet,this idyllic scenario is a dead end when data points are few, regressors are numerous and a noisy y serves as a prediction target. This sort of environment, of which macroeconomic forecasting isa notable example, will often beneﬁt from any prior knowledge one can incorporate in the model.Since transforming the data transforms the prior, doing so properly by including well-motivatedrotations of X has the power to increase ML performance on such challenging data sets.Macroeconomic modelers have been thinking about designing successful priors for a long time.There is a wide literature on Bayesian Vector Autoregressions (VAR) starting with Doan et al.(1984). Even earlier on, the penalized/restricted estimation of lag polynomials was extensivelystudied (Almon, 1965; Shiller, 1973). The motivation for both strands of work is the relativelysmall number of observations versus that of parameters to estimate. 40 years later, many more data2oints are available, but models have grown in complexity. Consequently, large VARs (Ba ´nburaet al., 2010) and MIDAS regression (Ghysels et al., 2004) still use those tools to regularize over-parametrized models. ML algorithms, usually allowing for sophisticated functional forms, alsocritically rely on shrinkage. However, when it comes to nonlinear nonparametric methods, espe-cially Boosting and Random Forests, there are no explicit parameters to penalize. Just like rotatingregressors changes the prior in a Ridge regression (see discussion in Goulet Coulombe (2020b)), ro-tating regressors in such algorithms will alter the implicit shrinkage scheme. This motivates us topropose two rotations of X that implicitly implement a more time series friendly prior in ML mod-els: moving average factors (MAF) and moving average rotation of X (MARX).Other than those motivated above, standard transformations are also being studied. This in-cludes factors extracted by principal components of X and the inclusion of variables in levels to re-trieve low frequency information. To evaluate the contribution of data transformations for macroe-conomic prediction, we conduct an extensive pseudo-out-of-sample forecasting experiment (38years, 10 key monthly macroeconomic indicators, 6 horizons) with three linear and two nonlinearML methods (Elastic Net, Adaptive Lasso, Linear Boosting, Random Forests and Boosted Trees).Main results can be summarized as follows. First , combining non-standard data transforma-tions,

MARX , MAF and

Level , minimize the RMSE for 7 and 8 variables out of 10 when respectivelypredicting at short horizons 1 and 3-month ahead. They remain resilient at horizons 6, 9 and 12as they are part of best RMSE speciﬁcations around 60% of time.

Second , their contribution ismagniﬁed when combined with nonlinear ML models, 24 out of 35 cases, with an advantage forRandom Forests over Boosted Trees. Both algorithms allow for nonlinearities via tree base learn-ers and make heavy use of shrinkage via ensemble averaging. This is precisely the algorithmicenvironment we conjectured could beneﬁt most from non-standard transformations of X . Third ,traditional factors can tremendously help. The overwhelming majority of best information sets foreach target included factors. On that regard, this amounts to a clear takeaway message: while MLmethods can handle the high-dimensional X (both computationally and statistically), extractingcommon factors remains straightforward feature engineering that pays off.The rest of the paper is organized as follows. In section 2, we present the ML predictive frame-work and detail the data transformations and forecasting models. In section 3, we detail the fore-casting experiment and in section 3.2 we present main results. Section 4 concludes. Nevertheless, as discussed in Hastie et al. (2009), the ensuing ensemble averaging prediction beneﬁts from ridge-like shrinkage as randomization allows each feature to contribute to the prediction, albeit in a moderate way. Machine Learning Forecasting Framework

Machine learning algorithms offer ways to approximate unknown and potentially complicatedfunctional forms with the objective of minimizing the expected loss of a forecast over h peri-ods. The focus of the current paper is to construct a feature matrix susceptible to improve themacroeconomic forecasting performance of off-the-shelf ML algorithms. Let H t = [ H t , ..., H Kt ] for t =

1, ..., T be the vector of variables found in a large macroeconomic dataset such as the FRED-MD database of McCracken and Ng (2016) and let y t + h be our target variable. We follow Stock andWatson (2002a,b) and target average growth rates or average differences over h periods ahead y t + h = g ( f Z ( H t )) + e t + h . (1)To illustrate this point, deﬁne Z t ≡ f Z ( H t ) as the N Z -dimensional feature vector, formed by com-bining several transformations of the variables in H t . The function f Z represents the data pre-processing and/or featuring engineering whose effects on forecasting performance we seek to in-vestigate. The training problem for f Z = I () ismin g ∈G (cid:40) T ∑ t = ( y t + h − g ( H t )) + pen ( g ; τ ) (cid:41) (2)The function g , chosen as a point in the functional space G , maps transformed inputs into the trans-formed targets. pen() is the regularization function whose strength depends on some vector/scalarhyperparameter(s) τ . Let ◦ denote the function product and ˜ g : = g ◦ f Z . Clearly, introducing a gen-eral f Z leads tomin g ∈G (cid:40) T ∑ t = ( y t + h − g ( f Z ( H t ))) + pen ( g ; τ ) (cid:41) ↔ min ˜ g ∈G (cid:40) T ∑ t = ( y t + h − ˜ g ( H t )) + pen ( f − Z ◦ ˜ g ; τ ) (cid:41) which is, simply, a change of regularization. Now, let g ∗ ( f ∗ Z ( H t )) be the "oracle" combinationof best transformation f Z and true function g . Let g ( f Z ( H t )) be a functional form and data pre-processing selected by the practitioner. In addition, denote ˆ g ( Z t ) and ˆ y t + h the ﬁtted model and itsforecast. The forecast error can be decomposed as y t + h − ˆ y t + h = g ∗ ( f ∗ Z ( H t )) − g ( f Z ( H t )) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) approximation error + g ( Z t ) − ˆ g ( Z t ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) estimation error + e t + h . (3) Obviously, in the context of a pseudo-out-of-sample experiment, feature matrices must be built recursively toavoid data snooping. e t + h is not shrinkable, while the estimation error can be reduced by either addingmore relevant data points or restricting the domain G . The beneﬁts of the latter can be offset bya corresponding increase of the approximation error. Thus, an optimal f Z is one that entails aprior that reduce estimation error at a minimal approximation error cost. Additionally, since mostML algorithms perform variable selection, there is the extra possibility of pooling different f Z ’stogether and let the algorithm itself choose the relevant restrictions. The marginal impact of the increased domain G has been explicitly studied in Goulet Coulombeet al. (2020), with Z t being factors extracted from the stationarized version of FRED-MD. The pri-mary objective of this paper is to study the relevance of the choice of f Z , combined with popularML approximators g . To evaluate the virtues of standard and newly proposed data transforma-tions, we conduct a pseudo-out-of-sample (POOS) forecasting experiment using various combina-tions of f Z ’s and g ’s. We start by describing candidate f Z ’s. Firstly, we consider more traditional candidates for f Z . I NCLUDING F ACTORS . Common practice in the macroeconomic forecasting literature is to rely onsome variant of the transformations proposed by McCracken and Ng (2016) to obtain a stationary X t out of H t . Letting X = [ X t ] Tt = and imposing a linear latent factor structure X = F Λ + (cid:101) , we canestimate F by the principal components of X . The feature matrix of the autoregressive diffusionindex (FM hereafter) model of Stock and Watson (2002a,b) can be formed as Z t = [ y t , Ly t , ..., L p y y t , F t , LF t , ..., L p f F t ] (4)where L is the lag operator and y t is the current value of the target. In Goulet Coulombe et al.(2020), factors were deemed the most reliable shrinkage method for macroeconomic forecasting,even when considering ML alternatives. Furthermore, the combination of factors (and nothingelse) with nonlinear nonparametric methods is (i) easy, (ii) fast and (iii) often quite successful.Point (iii) will be further re-enforced by this paper’s results, especially for forecasting inﬂation,which contrasts with the results found in Medeiros et al. (2019). More concretely, a factor F is a linear combination of X . If an algorithm pick F rather than creating its owncombination of different elements of X , it is implicitly imposing a restriction. There are many recent contributions considering the macroeconomic forecasting problem with econometric andmachine learning methods in a big data environment (Kim and Swanson, 2018; Kotchoni et al., 2019). However, theyare done using the standard stationary version of FRED-MD database. Recently, McCracken and Ng (2020) studiedthe relevance of unit root tests in the choice of stationarity transformation codes for macroeconomic forecasting withfactor models. NCLUDING L EVELS . In econometrics, debates on the consequences of unit roots for frequentistinference have a long history , just as does the handling of low frequency movements for macroe-conomic forecasting (Elliott, 2006). Exploiting potential cointegration has been found useful toimprove forecasting accuracy under some conditions (e.g., Christoffersen and Diebold (1998); En-gle and Yoo (1987); Hall et al. (1992)). From the perspective of engineering a feature matrix, theerror correction term could be obtained from a ﬁrst step regression à la Engle and Granger (1987)and is just a speciﬁc linear combination of existing variables. When it is unclear which variablesshould enter the cointegrating vector – or whether there exist any such vector, one can alternativelyinclude both variables in levels and differences into the feature matrix. This sort of approach hasbeen pursued most notably by Cook and Hall (2017) who combine variables in levels, ﬁrst differ-ences and even second differences in the feature matrix they provide to various neural networkarchitectures in the forecasting of US unemployment data. From a purely predictive point of view, using ﬁrst differences rather than levels is a linear re-striction (using the vector [ − ] ) on how H t and H t − can jointly impact y t . Depending on theprior/regularization being used with a linear regression, this may largely decrease the estimationerror or inﬂate the approximation one. However, it is often admitted that in a time series context(even if Bayesian inference is left largely unaltered by non-stationarity (Sims, 1988)), ﬁrst differ-ences are useful because they trim out low frequencies which may easily be redundant in largemacroeconomic data sets. Using a collection of highly persistent time series in X can easily leadto an unstable X (cid:48) X inverse (or even a regularized version). Such problems naturally extend toLasso (Lee et al., 2018). In contrast, tree-based approaches like RF and Boosted Trees do not relyon inverting any matrix. Of course, performing tree-like sample splitting on a trending variablelike raw GDP (without any subsequent split on lag GDP), is almost equivalent to split the sampleaccording to a time trend and will often be redundant and/or useless. Nevertheless, there are nu-merous H t ’s where opting for ﬁrst differencing the data is much less trivial. In such cases, theremay be forecasting beneﬁts from augmenting the usual X with levels. When regressors outnumbers observations, regularization, whether explicit or implicit, is nec-essary. Hence, the ML algorithms we use all entail a prior which may or may not be well-suitedfor a time series problem. There is a wide Bayesian VAR literature, starting with Doan et al. (1984), See for example, Phillips (1991b,a); Sims (1988); Sims et al. (1990); Sims and Uhlig (1991). Another approach is to consider factor modeling directly with nonstationary data (Bai and Ng, 2004; Peña andPoncela, 2006; Banerjee et al., 2014). A similar comment would apply to all parametric cointegration restrictions. For recent work on the subject, seefor example Chan and Wang (2015). (i) observe that most nonparametricML methods implicitly shrinks the individual contribution of each feature to zero in a Ridge-eanfashion (Hastie et al., 2009; Elliott et al., 2013) and (ii) rotating regressors implies a new prior inthe original space. Hence, by simply creating regressors that embody the more sophisticated lin-ear restrictions, we obtain shrinkage better suited for time series. A ﬁrst step in that direction isGoulet Coulombe (2020a) who proposes Moving Average Factors to speciﬁcally enhance RF’s pre-diction and interpretation potential. A second is to ﬁnd a rotation of the original lag polynomialsuch that implementing Ridge-ean shrinkage in fact yields Shiller (1973) approach to shrinking lagpolynomials. M OVING A VERAGE F ACTORS . Using factors is a standard approach to summarize parsimo-niously a panel of heavily cross-correlated variables. Analogously, one can extract a few principalcomponents from each variable-speciﬁc panel of lagged values, i.e.˜ X t , k = (cid:104) X t , k , LX t , k , ..., L P MAF X t , k (cid:105) ˜ X t , k = M t Γ (cid:48) k + ˜ (cid:101) k , t , k =

1, ..., K (5)to achieve a similar goal on the time axis. Deﬁne a moving average factor as the vector M k . Me-chanically, we obtain weighted moving averages, where the weights are the principal componentestimates of the loadings in Γ k . By construction, those extractions form moving averages of the P MAF lags of X t , k so that it summarizes most efﬁciently its temporal information. By doing so,the goal to summarize information in X P MAF t , k is achieved without modifying any algorithm: wecan use the MAFs which compresses information ex-ante. As it is the case for standard factors,MAF are designed to maximize the explained variance in X P MAF t , k , not the ﬁt to the ﬁnal target. Itis the learning algorithm’s job to select the relevant linear combinations to maximize the ﬁt. A cross-section RF-based example is Rodriguez et al. (2006) who propose "Rotation Forest" that build an ensembleof trees based on different rotations of X . While we work directly with the latent factors, a related decomposition called singular spectrum analysis workswith the estimate of the summed common components, i.e. with M k Γ (cid:48) k . Since this decomposition naturally yieldsa recursive formula, it has been used to forecast macroeconomic and ﬁnancial variables (Hassani et al., 2009, 2013),usually in an univariate fashion. P MAF is a tuning parameter analogous to the construction of the panel of variables (usually taken as given) in astandard factor model. We pick P MAF =

12. We keep two MAFs for each series and they are obtained by PCA. OVING A VERAGE R OTATION OF X . There are many ways one can penalize a lag polynomial.One, in the Minnesota prior tradition, is to shrink all lags coefﬁcients to zero (except for the ﬁrstself-lag) with increasing harshness in p , the order of the lag. Another is to shrink each β p to β p − and β p + rather than to 0. Intuitively, for higher-frequency series (like monthly data would qualifyfor here) it is more plausible that a simple linear combination of lags impacts y t rather than asingle one of them with all other coefﬁcients set to 0. For instance, it seems more likely thatthe average of March, April and May employment growth could impact, say, inﬂation, than onlyMay’s. Mechanically, this means we expect March, April and May ’s coefﬁcients to be close toone another, which motivated the prior β p ∼ N ( β p − , σ u I K ) and more sophisticated versions of itin other works (Shiller, 1973). Inputting the ML algorithm a transformed X such that its implicitshrinkage to 0 is twisted into this new prior could generate forecasting gains. The only questionleft is how to make this operational.The following derivation is a simple translation of Goulet Coulombe (2020b)’s insights for time-varying parameters model to regularized lag polynomials à la Shiller (1973). . Consider a genericlinear ARDL model with K variables. y t = p = P ∑ p = X t − p β p + (cid:101) t , (cid:101) t ∼ N ( σ (cid:101) ) (6a) β p = β p − + u p , u p ∼ N ( σ u I K ) (6b)where β l ∈ IR K , X t ∈ IR K , u l ∈ IR K × P and both y t and (cid:101) t are scalars. There is a natural way ofwriting this model as the penalized regression problemmin β ... β P T T ∑ t = (cid:16) y t − ∑ p = Pp = X t − p β p (cid:17) σ (cid:101) + KP P ∑ p = (cid:107) β p − β p − (cid:107) σ u . (7)It is well known that the l norm is equivalent to use a normal prior on the penalized quantity.Hence, the model in (7) implicitly assumes β t − β t − ∼ N ( σ u ) , which is exactly what (6) also This is basically a dense vs sparse choice. MAFs go all the way with the ﬁrst view by imposing it via the extractionprocedure. Such reparametrization schemes are also discussed for "fused" Lasso in Tibshirani et al. (2015) and employed fora Bayesian local-level model in Koop (2003) We use P as a generic maximum number of lags for presentation purposes. In Table 1 we deﬁne P MARX . λ ≡ σ (cid:101) σ u TKP , the problem has the more familiar look ofmin β ... β P T ∑ t = (cid:32) y t − P ∑ p = X t − p β p (cid:33) + λ P ∑ p = (cid:107) β p − β p − (cid:107) . (8)While we adopt the l norm for this exposition, our main goal is to extend traditional regularizedlag polynomial ideas to cases where there is no explicitly speciﬁed norm on β p − β p − . For in-stance, Elliott et al. (2013) prove that their Complete Subset Regression procedure implies Ridgeshrinkage in a special case. Moving away from linearity makes formal arguments more difﬁcult.Nevertheless, it has been argued several times that model/ensemble averaging performs shrink-age akin to that of a ridge regression (Hastie et al., 2009). For instance, random selection of a subsetof eligible features at each split encourage each feature to be included in the predictive function,but in a moderate fashion. To get implicit regularized lag polynomial shrinkage, we now rewrite problem (7) as a ridgeregression. For all derivations to come, it will be much less tedious to turn to matrix notations. TheFused Ridge problem is now written asmin β ( y − X β ) (cid:48) ( y − X β ) + λ β (cid:48) D (cid:48) D β where D is the ﬁrst difference operator. The ﬁrst step is to reparametrize the problem by usingthe relationship β k = C θ k that we have for all k regressors. C is a lower triangular matrix of ones(for the random walk case) and deﬁne θ k = [ u k β k ] . For the simple case of one parameter and T =  β β β β  =    β u u u  .For the general case of K parameters, we have β = C θ , C ≡ I K ⊗ C and θ is just stacking all the θ k into one long vector of length KP . Using the reparametrization Recently, (Goulet Coulombe, 2020c) argued that ensemble averaging methods à la RF prunes a latent tree. Fol-lowing this view, the need for cleverly pre-assembled data combinations is even clearer. = C θ , the Fused Ridge problem becomesmin θ ( y − XC θ ) (cid:48) ( y − XC θ ) + λ θ (cid:48) C (cid:48) D (cid:48) DC θ .Let Z ≡ XC and use the fact that D = C − to obtain the Ridge regression problemmin θ ( y − Z θ ) (cid:48) ( y − Z θ ) + λ θ (cid:48) θ . (9)We arrived at destination. Using Z rather than X in an algorithm that performs shrinkage will im-plicitly shrink β p to β p − rather than to 0. This is obviously much more convenient than modifyingthe algorithm itself, and directly applicable to any algorithm using time series data as input. Onequestion remains: what is Z , exactly? For a single polynomial at time t , we have Z t , k = X t , k C . C is gradually summing up the columns of X t , k over p . Thus, Z t , k , p = ∑ Pp (cid:48) = X t , k , p (cid:48) . Dividing each Z t , k , p by p (just another linear transformation, ˜ Z t , k , p ), it is now clear that ˜ Z is a matrix of movingaverages. Those are of increasing order (from p = p = P ) and the last observation in theaverage is always X t − k . Hence, we refer to this particular form of feature engineering as MovingAverage Rotation of X (MARX). R ECAP . We summarize our setup in Table 1. We have ﬁve basic sets of transformations to feedthe approximation of f ∗ Z : (1) single-period differences and growth rates following McCracken andNg (2016) ( X t and their lags), (2) principal components of X t ( F t and their lags), (3) variables inlevels ( H t and their lags), (4) moving average factors of X t ( MAF t ) and (5) sets of simple movingaverages of X t ( MARX t ). We consider several forecasting models in order to approximate the truefunctional form: Autoregressive (AR), Factor Model (FM, à la Stock and Watson (2002a)), AdaptiveLasso (AL), Elastic Net (EN), Linear Boosting (LB), Random Forest (RF) and Boosted Trees (BT).The details on forecasting models are presented in Appendix A.Furthermore, most ML methodologies that handle well high-dimensional data perform someform or another of variable selection. For instance, RF evaluates a certain fraction of predictorsat each split and select the most potent one. Lasso selects relevant predictors and shrinks oth-ers perfectly to 0. By rotating X , we can get these algorithms (and others) to perform restric-tion/transformation selection. Thus, one should not refrain from studying different combinationsof f Z ’s. As a result, all the combinations of f Z thereof are admissible, and 15 of them are includedin the exercise. Moreover, there is a long-standing worry that well-accepted transformations maylead to some over-differenced X k ’s (McCracken and Ng, 2020). Including MARX or MAF (which Notwithstanding, some authors have noted that a trade-off emerges between how focused a RF is and its ro-bustness via diversiﬁcation. Borup et al. (2020) sometimes get improvements over plain RF by adding a Lasso pre-processing step to trim X . Transformations Feature MatrixF Z ( F ) t : = [ F t , LF t , ..., L p f F t ] X Z ( X ) t : = [ X t , LX t , ..., L p X X t ] MARX Z ( MARX ) t : = (cid:104) MARX ( ) t , ..., MARX ( p MARX ) t , ..., MARX ( ) Kt , ..., MARX ( p MARX ) Kt (cid:105) MAF Z ( MAF ) t : = (cid:104) MAF ( ) t , ..., MAF ( r ) t , ..., MAF ( ) Kt , ..., MAF ( r K ) Kt (cid:105) Level Z ( Level ) t : = [ H t , LH t , ..., L p H H t ] Model Functional spaceAutoregression (AR) LinearFactor Model (FM) LinearAdaptive Lasso (AL) LinearElastic Net (EN) LinearLinear Boosting (LB) LinearRandom Forest (RF) NonlinearBoosted Trees (BT) Nonlinear are both speciﬁc partial sums of lags) with X can be seen as bridging the gap between a ﬁrst-difference and keeping H k in levels. Hence, interacting many f Z is not only statistically feasible,but econometrically desirable given the sizable uncertainty surrounding what is a "proper" trans-formation of the raw data (Choi, 2015). In this section, we present the results of a pseudo-out-of-sample forecasting experiment fora group of target variables at monthly frequency from the FRED-MD dataset of McCracken andNg (2016). Our target variables are the industrial production index (INDPRO), total nonfarm em-ployment (EMP), unemployment rate (UNRATE), real personal income excluding current trans-fers (INCOME), real personal consumption expenditures (CONS), retail and food services sales(RETAIL), housing starts (HOUST), M2 money stock (M2), consumer price index (CPI), and theproduction price index (PPI). Given that we make predictions at horizons of 1, 3, 6, 9, 12 and24 months, we are effectively targeting the average growth rates over those periods, except forthe unemployment rate for which we target average differences. These series are representativemacroeconomic indicators of the US economy, as stated in Kim and Swanson (2018), which is alsobased on Goulet Coulombe et al. (2020) exercise for many ML models, itself based on Kotchoniet al. (2019) and a whole literature of extensive horse races in the spirit of Stock and Watson (1998).The POOS period starts in January of 1980 and ends in December of 2017. We use an expandingwindow for estimation starting from 1960M01. Following standard practice in the literature, we11valuate the quality of point forecasts using the root Mean Square Error (RMSE). For the forecastedvalue at time t of variable v made h steps before, we compute RMSE v , h , m = (cid:115) ∑ t ∈ OOS ( y vt − ˆ y v , h , mt − h ) (10)The standard Diebold and Mariano (2002) (DM) test procedure is used to compare the predictiveaccuracy of each model against the reference factor model (FM). RMSE is the most natural lossfunction given that all models are trained to minimize the squared loss in-sample. We also imple-ment the Model Conﬁdence Set (MCS) that selects the subset of best models at a given conﬁdencelevel (Hansen et al., 2011).Hyperparameter selection is performed using the BIC for AR and FM and K-fold cross-validationis used for the remaining models. This approach is theoretically justiﬁed in time series models un-der conditions spelled out by Bergmeir et al. (2018). Moreover, Goulet Coulombe et al. (2020)compared it with a scheme which respects the time structure of the data in the context of macroe-conomic forecasting and found K-fold to be performing as well as or better than this alternativescheme. All models are estimated every month while their hyperparameters are reoptimized every2 years. Table 2 shows the best RMSE data transformation combinations as well as the associated func-tional forms for every target and forecasting horizon. It summarizes the main ﬁndings and provideimportant recommendations for practitioners in the ﬁeld of macroeconomic forecasting.

First , in-cluding non-standard choices of macroeconomic data transformation,

MARX , MAF and

Level , min-imize the RMSE for 7 and 8 variables out of 10 when respectively predicting 1 and 3-month ahead.Their overall importance is still resilient at horizons 6, 9 and 12 as they are part of best speciﬁca-tions for 6, 5 and 6 variables. At the 2-year ahead horizon, their performance is less stellar, withonly 3 wins.

Second , their success is often paired with a nonlinear functional form g , 24 out of35 cases, with an advantage for Random Forests over Boosted Trees. The former is used for 15 ofthose 24 cases. Both algorithms make heavy use of shrinkage and allow for nonlinearities via treebase learners. This is precisely the algorithmic environment that we precedently conjectured to bewhere data transformations matter. 12able 2: Best Model Speciﬁcations INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPIH=1 RF (cid:8)(cid:8)(cid:8) RF (cid:8)(cid:8)(cid:8)(cid:8) LB (cid:8)(cid:8)(cid:8)(cid:8) BT (cid:8)(cid:8)(cid:8) FM (cid:8) FM (cid:8) FM (cid:8) BT (cid:8)(cid:8)(cid:8) EN (cid:8) EN (cid:8) H=3 RF (cid:8) RF (cid:8)(cid:8)(cid:8) RF (cid:8)(cid:8) RF (cid:8) FM (cid:8) AL (cid:8)(cid:8) BT (cid:8)(cid:8)(cid:8) BT (cid:8)(cid:8)(cid:8) RF (cid:8) EN (cid:8)(cid:8)(cid:8) H=6 AL (cid:8) RF (cid:8)(cid:8)(cid:8)(cid:8) LB (cid:8) RF (cid:8)(cid:8) RF (cid:8)(cid:8)(cid:8) AL (cid:8)(cid:8) BT (cid:8)(cid:8) BT (cid:8)(cid:8) RF (cid:8) RF (cid:8) H=9 LB (cid:8) LB (cid:8)(cid:8)(cid:8) EN (cid:8) RF (cid:8)(cid:8) RF (cid:8)(cid:8)(cid:8) AL (cid:8)(cid:8) RF (cid:8) BT (cid:8)(cid:8) RF (cid:8) RF (cid:8) H=12 RF (cid:8) LB (cid:8)(cid:8) RF (cid:8)(cid:8) RF (cid:8) RF (cid:8)(cid:8)(cid:8) RF (cid:8)(cid:8)(cid:8) RF (cid:8) BT (cid:8)(cid:8) EN (cid:8) RF (cid:8) H=24 RF (cid:8)(cid:8) LB (cid:8)(cid:8)(cid:8) RF (cid:8) RF (cid:8) RF (cid:8) EN (cid:8)(cid:8) RF (cid:8)(cid:8) RF (cid:8) AL (cid:8) RF (cid:8) Note: Bullet colors represent data transformations included in the best model speciﬁcations: F, MA, X, L, MAF.

Without a doubt, the most visually obvious feature of Table 2 is the abundance of green bullets.As expected, transforming X into factors is probably the most effective form of feature engineeringavailable to the macroeconomic forecaster. Factors are included as part of the optimal speciﬁcationfor the overwhelming majority of targets. Furthermore, including factors only in combination withRF is the best forecasting strategy for both CPI and PPI inﬂation, and that, for the vast majority ofhorizons. This is in contrast with the results found in Medeiros et al. (2019), but is in line with ﬁnd-ings in Goulet Coulombe et al. (2020). Finally, the omission of factors from optimal speciﬁcationsfor industrial production growth 1, 3 and 12-month ahead is naturally surprising. This points outthat current wisdom based on linear models may not be directly applicable to nonlinear ones. Infact, alternative rotations will sometimes do better.There is plentiful of red bullets populating the top rows of Table 2. Indeed, our most salientnew transformation is MARX , which is generally important for short horizons, h =

1, 3. In combi-nation with nonlinear models, it contributes to improve forecasting accuracy for real activity seriessuch as industrial production, employment, unemployment rate and income, while they are bestpaired with elastic net to predict the CPI and PPI inﬂation rates. MARX remains a good choicefor employment growth at horizons h = h =

24, and for income and retail when predict-ing 6 and 9 months ahead. We further investigate how those RMSE gains materialize in terms offorecasts around key periods in section 3.6. While

MAF performance is often positively correlatedwith

MARX , the latter is usually the better of the two, except for some one-year ahead real activityand consumption targets.Including levels is particularly important for the M2 money stock and consumption growths,for all horizons except 2-year ahead and h =

6, 9, 12 respectively. It is interesting to note thatin those cases moving average transformation do not bring any relevant information. Levels arealso omnipresent when predicting employment growth, but their marginal effects are not quanti-tatively substantial.These ﬁndings are particularly important given the increasing interest in ML macro forecast-ing. They suggest that traditional data transformations, meant to achieve stationarity, do leave13ubstantial forecasting gains on the practitioners’ table. These losses can be successfully recoveredby combining ML methods with well-motivated rotations of predictors such as

MARX and

MAF ,or sometimes by simply including variables in levels.

While the previous results were desirably expeditive, underlying performance gains were notquantiﬁed and their statistical signiﬁcance, not assessed. We present the h = MARX and levels improves the industrial production,employment and unemployment rate forecast accuracy by 6, 3 and 5% respectively. Although sig-niﬁcant, this might seem as a small amelioration. Nevertheless, this is an important upgrade fortwo reasons. First, this is one-month ahead prediction and second, the benchmark is a state-of-the-art forecasting model that has been resilient since its introduction by Stock and Watson (2002b).Hence, little predictive gain remains on the table. The linear factor model remains the best optionto predict consumption, retail and house starts growths, while combining X , levels and factorswithin the Boosted Trees algorithm signiﬁcantly improves the forecast accuracy for M2 by 7%. Theshort-run CPI and PPI inﬂation prediction are reﬁned by 5 and 7% when Elastic Net model is usedwith MARX .The model conﬁdence set (MCS) also highlights the importance of non standard data transfor-mations. In the case of industrial production, the MCS strikingly selects only three data combi-nations with Random Forests:

F-MARX , F-X-MARX-Level and

X-MARX-Level . The MCSs for CPIand PPI inﬂation rates contain

MARX transformations, which are also employed in linear models.Forecasting gains associated with alternative transformations are in general larger when pre-dicting 3 months ahead, as showed in Table 4. For instance, they are signiﬁcant and achieve 10and 15% reduction in RMSPE over the factor model, for industrial production and unemploymentrate respectively. Adding levels to X and F in Boosted Trees signiﬁcantly improves the M2 growthprediction by 10%. Moreover, the MCS selects again MARX for industrial production, but also inthe case of employment growth. 14able 3: Relative RMSE for h = INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPI

Benchmarks

FM (RMSE) 0.006 ** ** * 1.03 1.00 Adaptive Lasso

F 0.97* * * * * * 1.02 * * ** * 1.05 1.01F-MARX 0.96* ** ** * ***F-MAF 0.96** ** * 1.09*** 1.03 1.04 1.03F-X-MARX 1.01 ** * *F-X-MAF 0.98 * ** ** * ** * 1.22 1.01F-X-Level 0.98 1.04** * ** ** ** ** * * * * ** ** * 1.05 1.01MARX 0.96* ** ** * ***MAF 1.00 1.06* ** * 1.09*** 1.04 1.13 1.04X-MARX 1.06 ** ** * * * * ** ** * 1.24 1.01X-Level 0.98 * ** * ** ** ** * * Elastic Net

F 0.98 * * * 0.98F-X 0.98 * ** * *** ** ** ***F-MAF 0.96** ** * 1.10*** 1.03 1.04 1.04F-X-MARX 0.97 ** ** **F-X-MAF 0.97 * * *** * * ** * ** *X 0.99 ** * *** ** ** ***MAF 1.00 1.04 ** * 1.11*** 1.05 1.15 1.04X-MARX 0.97 ** ** **X-MAF 0.98 * * *** * * *** ** ** ** Linear Boosting

F 0.97 1.04** * * 1.17*** 1.07* 1.00F-X 0.98 * * * 1.07*** 1.09** 1.30 1.02F-MARX 0.96* 1.07*** ** * * 1.08** 1.01 0.97F-MAF 0.98 ** * 1.10*** 1.03 1.05 1.03F-X-MARX 0.97 1.06*** ** * * 1.05*** 1.09*** 1.08* 1.01F-X-MAF 0.96* ** ** * 1.06*** * ** ** 1.05*** 1.04 1.07 1.05F-X-MARX-Level 1.01 ** 1.09 ** * *** 1.04 **X 1.00 1.11*** ** * 1.05*** 1.06** 1.13** 1.04MARX 0.97 1.12*** ** **MAF 1.00 ** ** 1.11*** 1.05 1.13* 1.04X-MARX 1.03 1.10*** ** ** * ** 1.12*** 1.07* 1.01X-MAF 0.98 * ** * 1.06*** ** * 1.05*** 1.03 1.21 1.03X-MARX-Level 1.07 1.04* ** 1.03 ** * ** 1.05* 1.04 0.97 Random Forest

F 0.99 * * ** * * * 1.06*** ** ** ** * ** * **F-MAF 0.98 * * 1.10*** ** * * ** * **F-X-MAF 0.98 * * * * *** * ** * * ** **X 1.00 * ** ** * * ***MAF 1.00 * * 1.11*** ** * * ** **X-MAF 1.00 * * ** ** ** * * ** Boosted Trees

F 1.00 1.11*** 1.03 ** * 1.06*** ** ** 1.08*** *** * *** ** 1.11*** * ** ** * 1.00 0.99F-X-MAF 0.98 1.07** ** * 1.07*** *** ** 1.07*** ** 1.03 1.02*F-X-MARX-Level 0.97 1.03 ** * ** ** ** 1.07*** ** * * *** ** 1.10*** *** ** ** ** * 1.06*** * 1.05 1.00X-Level 0.97* 1.10*** *** ** 1.08*** ** ** 1.05** Note: The numbers represent the relative. with respect to AR,BIC model. root MSPE. Models retained in model conﬁdence set are in bold, theminimum values are underlined, while *** , ** and * stand for 1%, 5% and 10% signiﬁcance of Diebold-Mariano test. h = h =

9. Gains over the reference modelremain substantial and signiﬁcant in many cases. Silent for two previous horizons, consumptiongrowth forecast accuracy improves by 10%, while adding

MARX for the income growth predictiondecrease the relative RMSE by 8% when h = MARX and Levels are often selected by MCS in thecase of employment and income. At the 9-month ahead horizon, results demonstrate how the rel-evance of statistical factors can be magniﬁed by using them as inputs in alternative linear models.For instance, using linear boosting and elastic net instead of OLS improves the forecast accuracyfor industrial production and unemployment rate by 12%. Combining them with nonparametricnonlinearity is equally, if not more, fruitful. Introducing factors in RF, we get RMSEs to decreasesby 12, 16 and 14% for housing starts, CPI and PPI. Combining X and MARX in RF improves theforecast precision by 8% for income, while considering

X-Level reduces the consumption RMSE by12%.Tables 7 and 8 present results for one and two-year ahead horizons. For h =

12, the

MAF data transformation emerges as superior for industrial production, unemployment rate and retailsales. Looking carefully, we remark that including

MAF (with or without factors) in the RandomForests model works the best in the case of INDPRO and UNRATE, while X is also needed whenpredicting retail sales. Inﬂation forecasting at this horizon is of special interest for monetary policyunder an inﬂation targeting regime. Here, results show that Elastic-net and RF models using F improve the forecast precision by more than 15%. In the case of h =

24, only including levels playsa major role for some variables. Combined with X in RF, it signiﬁcantly decreases the industrialproduction RMSE by 12%, while adding MARX improves the forecast accuracy by 9% in the caseof employment. The combination

X-Level with EN is also important for retail sales where RMSEreduces by 18%.

In order to disentangle marginal effects of data transformations on forecast accuracy we run thefollowing regression inspired by Carriero et al. (2019) and Goulet Coulombe et al. (2020): R t , h , v , m = α F + ψ t , v , h + v t , h , v , m , (11)where R t , h , v , m ≡ − e t , h , v , m T ∑ Tt = ( y v , t + h − ¯ y v , h ) is the pseudo-out-of-sample R , and e t , h , v , m are squared pre-diction errors of model m for variable v and horizon h at time t . ψ t , v , h is a ﬁxed effect term thatdemeans the dependent variable by “forecasting target,” that is a combination of t , v and h . α F isa vector of α MARX , α MAF and α F terms associated to each new data transformation considered inthis paper, as well as to the factor model. H is α f = ∀ f ∈ F = [ MARX , MAF , F ] . In other16ords, the null is that there is no predictive accuracy gain with respect to a base model that doesnot have this particular data pre-processing. While the generality of (11) is appealing, when inves-tigating the heterogeneity of speciﬁc partial effects, it will be much more convenient to run speciﬁcregressions for the multiple hypothesis we wish to test. That is, to evaluate a feature f , we run ∀ m ∈ M f : R t , h , v , m = α f + ψ t , v , h + v t , h , v , m (12)where M f is deﬁned as the set of models that differs only by the feature under study f .Figure 1 plots the distribution of α ( h , v ) MARX from equation (11) done by ( h , v ) subsets. Hence, weallow for heterogeneous effects of the MARX transformation according to 60 different targets. Themarginal contribution of

MARX on the pseudo- R depends a lot on models, horizons and series.However, we remark that at the short-run horizons, when combined with nonlinear methods, itproduces positive and signiﬁcant effects. It particularly improves the forecast accuracy for realactivity series like industrial production, labor market series and income, even at larger horizons.For instance, the gains from using MARX with RF achieve 16% when predicting INDPRO at the h = h =

6. When used with linear methods, theestimates are more often on the negative side, except for inﬂation rates and M2 at short horizons,and a few special cases at the one and two-year ahead horizons.Figure 2 plots the distribution of α ( h , v ) MAF , conditional on including X in the model. The motivationfor that is that MAF , by construction, summarizes the entirety of [ X t − p ] p = P MAF p = with no specialemphasis on most recent information. Thus, it is better-advised to always include the raw X with MAF , so recent information may interact with the lag polynomial summary if ever needed.

MAF contributions are overall more muted than that of

MARX , except when used with LinearBoosting method. Nevertheless, it is noticed that it shares common gains with the latter as shorthorizons ( h =

3, 6) of real activity variables also beneﬁt from it. More convincing improvementsare observed for retail sales at the 2-year horizons for nonlinear methods.It has already been documented that factors matter – and a lot (Stock and Watson, 2002a,b).Figure 3 evaluate their quantitative effects. Including a handful of factors rather than all of (sta-tionary) X improves substantially and signiﬁcantly forecast accuracy. The case for this is evenstronger when those are used in conjunction with nonlinear methods, especially for prediction atlonger horizons. This ﬁnding supports the view that a factor model is an accurate depiction of themacroeconomy, as originally suggested by Sargent and Sims (1977) and later expanded in variousforecasting and structural analysis applications (Stock and Watson, 2002a; Bernanke et al., 2005). Of course, one could alter the PCA weights in

MAF to introduce priority on recent lags à la Minesota-prior, butwe leave that possibility for future research.

MARX

Marginal Effects l llllllll lll ll ll l lllll lll lll lll l lll l ll lll lllll lllll l lll ll lll lllll llll lllll ll l llll ll ll llllll ll llll ll ll ll lll lll lllll llllll ll ll llll llll ll ll l ll llllll lll llll lll lll lll ll lll lll ll l ll llll l l llllllllll ll lllllll ll lllll lll ll ll ll lll lllll ll lll lll l lll lll ll llllllllll ll llllllll lllllll llll l lllllllll llllllll ll llll lllll

Adaptive Lasso Elastic Net Linear Boosting Random Forest Boosted Trees H = H = H = H = H = H = − . − . − . − . . . . . . − . − . − . − . . . . . . − . − . − . − . . . . . . − . − . − . − . . . . . . − . − . − . − . . . . . . l l l l l l l l l l INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPI

Note: This ﬁgure plots the distribution of α ( h , v ) f from equation (12) done by ( h , v ) subsets. That is, it shows the average partial effect on thepseudo- R from augmenting the model with MARX featuring, keeping everything else ﬁxed. SEs are HAC. These are the 95% conﬁdence bands.

In this line of thought, transforming X into F is not merely a mechanical dimension reduction step.Rather, it is meaningful feature engineering uncovering true latent factors which contains most, ifnot all, the relevant information about the current state of the economy. Once F ’s are extracted, thestandard diffusion indexes model of Stock and Watson (2002b) can either be upgraded by usinglinear methods performing variable selection, or nonlinear functional form approximators such asRandom Forests and Boosted Trees. In order to examine the stability of forecast accuracy, we consider the ﬂuctuation test of Giaco-mini and Rossi (2010). Figure 4 shows the results for a few selected cases. Following the simulationresults in Giacomini and Rossi (2010), the moving average of the standardized difference of MSEs isproduced with a 136-month window, which corresponds to 30% of the out-of-sample size. The up-per left panel shows the results for industrial production and labor market series at 3-month aheadhorizon. There is a fair amount of instability over time for both nonlinear models and all targets.18igure 2: Distribution of

MAF

Marginal Effects llll lllll lllllllllllll llllllllll llll lllll l ll ll l ll ll l ll ll l lll lllll llll llllllllllllllllll llllll llll llll ll llll llll ll lll l ll l lllll l l l llllll lll llll llllll lll l llll ll lll l lll l ll llll lll l ll l l l llllllllll llllllllll llllllllll lllll lll ll lllll lll lll lll l lllll llllllllll llllllllll llllll l ll l lll lllll lllll llllllll lll l lllll

Note: This ﬁgure plots the distribution of α ( h , v ) f from equation (12) done by ( h , v ) subsets. That is, it shows the average partial effect on thepseudo- R from augmenting the model with MAF featuring, keeping everything else ﬁxed. SEs are HAC. These are the 95% conﬁdence bands.

In the case of INDPRO with RF, the data combinations including the

MARX transformation dom-inates the benchmark and the alternatives most of the time, but takes off even more signiﬁcantlyand substantially since the Great Recession. A similar pattern is observed with unemploymentrate, while in the case of employment the improvements are not signiﬁcant since 2010.The upper right panel shows the results for M2 and inﬂation series when predicted 12-monthahead with RF and BT. Clearly, during the ﬁrst half of the POOS, levels were an impactful inclusionfor M2 growth. However, that improvement eventually shrank and since the Great Recession,transforming data into factors appears to be the best choice.As documented in the previous section, including standard factors in RF was the overall beststrategy. This performance is quite stable for CPI inﬂation, while it becomes better than the bench-mark from mid-1990s in the case of PPI. Interestingly, the inﬂation forecasting performance of allspeciﬁcations converge since the Great Recession, and in the case of CPI, including

MARX and

Level transformations dominate.The bottom panel presents the results for four other series and the h =

12 horizon. In the case of19igure 3: Distribution of F Marginal Effects llll l lll lllllllll lll lllllll llll lllll lll ll lllll ll l l lll l ll llll llll l lll llllllll llllllllll ll llllllll ll llll llll ll l l ll l l ll l lll lll lllll llllllll ll ll llllllll ll llll llll ll llll llll ll lll lll l lll llllllllll llllllllll lllllll lll llllll l lllllllll l lllll ll lll lll lllllllll llll ll ll l llllllllll ll ll llll lllllll lll l lllll l l ll l lll

Adaptive Lasso Elastic Net Linear Boosting Random Forest Boosted Trees H = H = H = H = H = H = − . − .

15 0 .

00 0 .

15 0 .

30 0 .

45 0 .

60 0 . − . − .

15 0 .

00 0 .

15 0 .

30 0 .

45 0 .

60 0 . − . − .

15 0 .

00 0 .

15 0 .

30 0 .

45 0 .

60 0 . − . − .

15 0 .

00 0 .

15 0 .

30 0 .

45 0 .

60 0 . − . − .

15 0 .

00 0 .

15 0 .

30 0 .

45 0 .

60 0 . l l l l l l l l l l INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPI

Note: This ﬁgure plots the distribution of α ( h , v ) f from equation (12) done by ( h , v ) subsets. That is, it shows the partial effect on the pseudo- R fromconsidering only F featuring versus including only observables X . SEs are HAC. These are the 95% conﬁdence bands. income growth, factors used to be the key component between the two recessions, while the MAF transformation combined with RF wins since 2010. In the case of consumption, the ﬂuctuation testshows a lot of instability, especially when the RF is used as forecasting model.

MAF and factors arerelevant in the middle of the sample, while the

Level and

MARX transformations gain importancesince the Great Recession.

MAF became the most resilient transformation when predicting retailsales since 2000. In the case of housing starts, all the speciﬁcations considered start improving andconverging since the mid-2000s.

In this section we conduct "event studies" to highlight more explicitly the importance of datapre-processing when predicting real activity and inﬂation indicators. Figure 5 plots cumulativesquared errors for three cases where speciﬁc transformations stand out. On the left, we comparethe performance of RF when predicting industrial production growth 3 months ahead, using either F , X or F-X-MARX as feature matrix. The middle panel shows the same exercise for employment20igure 4: Giacomini-Rossi Fluctuation Test

Random Forest Boosted Trees I ND P R O E M P UNR A T E −2024−20246−2024 F l u c t ua t i on t e s t s t a t i s t i c Random Forest Boosted Trees M C P I PP I −505−8−404−7.5−5.0−2.50.02.5 F l u c t ua t i on t e s t s t a t i s t i c Random Forest Boosted Trees I NC O M E C O N S R E T A I L H O U S T −3036−2.50.02.55.0−5.0−2.50.02.55.07.5−303 F l u c t ua t i on t e s t s t a t i s t i c Note: The ﬁgure shows the Giacomini-Rossi ﬂuctuation test against the benchmark FM model. The horizontal lines depict the 10% critical values.A model is signiﬁcantly better than the benchmark if the test statistic is above the upper critical value line. Colors represent selected datatransformations included with each nonlinear forecasting model:

F,F-X , F-MARX,F-X-MARX,F-X-MARX-Level , F-X-Level , F-MAF,F-X-MAF . growth. On the right, we report one-year ahead CPI inﬂation forecasts. Industrial productionand employment examples clearly documents the merits of including MARX : its cumulatively-summed squared errors (when using RF) are always below the ones produced by using F and X . The gap widens slowly until the Great Recession, after which it increases substantially. Asdiscussed in section 3.2, using common factors with RF constitutes the optimal speciﬁcation forCPI inﬂation. Figure 5 illustrates this ﬁnding and shows that the gap between using F or X widensduring mid-80s and mid-90s, and just before the Great Recession.In Figures 6 and 7, we look more closely at each model’s performance during last three reces-sions and subsequent recoveries. Precisely, we plot the 3-month ahead forecasts for the period cov-21igure 5: Cumulative Squared Errors (a) INDPRO 3-month ahead (b) EMP 3-month ahead (c) CPI inﬂation 12-month ahead Note: The ﬁgure plots cumulative squared 3-month ahead forecast errors for industrial production and employment, and 12-month ahead for CPIinﬂation, produced with Random Forest model that uses predictors as shown at the top of each ﬁgure. ering 3 months before, and 24 months after a recession, for industrial production and employment.The forecasting models are all RF-based, and differ by their use of either F , X or F-X-MARX . Onthe right side, we show the RMSE ratio of each RF speciﬁcation against the benchmark FM modelfor the whole POOS and for the episode under analysis. In the case of industrial production, the

F-X-MARX speciﬁcation outperforms the others during the Great Recession and its aftermath, andimproves even more upon the benchmark model compared to the full POOS period. We observeon the left panel that forecasts made with

F-X-MARX are much closer to realized values at theend of recession and during the recovery. The situation is qualitatively similar during the 2001recession but effects are smaller. Including

MARX also emerges as the best alternative around the1990-1991 recession, but the benchmark model is more competitive for this particular episode.In the case of employment,

MARX again supplants F or X in all three recessions. For instance,around the Dotcom bubble burst, it displays an outstanding performance, surpassing the bench-mark by 40%. However, during the Great Recession, it is outperformed by the traditional factormodel. Finally, the F-X-MARX combination provides the most accurate forecast during and afterthe credit crunch recession of the early 1990s. 22igure 6: Case of Industrial Production (a) Recession Episode of 2007-12-01(b) Recession Episode of 2001-03-01(c) Recession Episode of 1990-07-01

Note: The ﬁgure plots 3-month ahead forecasts for the period covering 3 months before and 24 months after each recession. RMSE ratios arerelative to FM model and the episode RMSE refers to the visible time period. (a) Recession Episode of 2007-12-01(b) Recession Episode of 2001-03-01(c) Recession Episode of 1990-07-01

Note: The ﬁgure plots 3-month ahead forecasts for the period covering 3 months before and 24 months after the recession. Log RMSE ratios arerelative to FM model and the episode RMSE refers to the visible time period. Conclusion

This paper studied the virtues of standard and newly proposed data transformations for macroe-conomic forecasting with machine learning. The classic transformations comprise the dimensionreduction of stationarized data by means of principal components, and the inclusion of level vari-ables in order to take into account low frequency movements. Newly proposed avenues includemoving average factors (MAF) and moving average rotation of X (MARX). The last two were mo-tivated by the need to compress the information within a lag polynomial, especially if one desiresto keep X close to its original – interpretable – space.To evaluate the contribution of data transformations for macroeconomic prediction, we haveconsidered three linear and two nonlinear ML methods (Elastic Net, Adaptive Lasso, Linear Boost-ing, Random Forests and Boosted Trees) in a substantive pseudo-out-of-sample forecasting exer-cise was done over 38 years for 10 key macroeconomic indicators and 6 horizons. With the differentpermutations of f Z ’s available from the above, we have analyzed a total of 15 different informa-tion sets. The combination of standard and non-standard data transformations ( MARX , MAF , Level ) is shown to minimize the RMSE, particularly at shorter horizons. Those consistent gains areusually obtained when a nonlinear non-parametric ML algorithm is being used. This is preciselythe algorithmic environment we conjectured could beneﬁt most from our proposed f Z ’s. Addi-tionally, traditional factors are featured in the overwhelming majority of best information sets foreach target. Therefore, while ML methods can handle the high-dimensional X (both computa-tionally and statistically), extracting common factors remains straightforward feature engineeringthat works. As the number of researchers and practitioners in the ﬁeld is ever-growing, we believethose insights consists a bedrock on which stronger ML-based systems can be developed to furtherimprove macroeconomic forecasting. 25 eferences Almon, S. (1965). The distributed lag between capital appropriations and expenditures.

Economet-rica , pages 178–196.Bai, J. and Ng, S. (2004). A panic attack on unit roots and cointegration.

Econometrica , 72(4):1127–1177.Bai, J. and Ng, S. (2009). Boosting diffusion indices.

Journal of Applied Econometrics , 24:607–629.Ba ´nbura, M., Giannone, D., and Reichlin, L. (2010). Large bayesian vector auto regressions.

Journalof Applied Econometrics , 25(1):71–92.Banerjee, A., Marcellino, M., and Masten, I. (2014). Forecasting with factor-augmented error cor-rection models.

International Journal of Forecasting , 30(3):589 – 612.Bergmeir, C., Hyndman, R. J., and Koo, B. (2018). A note on the validity of cross-validation forevaluating autoregressive time series prediction.

Computational Statistics & Data Analysis , 120:70–83.Bernanke, B., Boivin, J., and Eliasz, P. (2005). Measuring the effects of monetary policy: a factor-augmented vector autoregressive (FAVAR) approach.

The Quarterly Journal of Economics , 120:387–422.Borup, D., Christensen, B. J., Mühlbach, N. N., Nielsen, M. S., et al. (2020). Targeting predictors inrandom forest regression. Technical report, Department of Economics and Business Economics,Aarhus University.Breiman, L. (2001). Random forests.

Machine learning , 45(1):5–32.Carriero, A., Galvão, A. B., and Kapetanios, G. (2019). A comprehensive evaluation of macroeco-nomic forecasting methods.

International Journal of Forecasting , 35(4):1226 – 1239.Chan, N. and Wang, Q. (2015). Nonlinear regressions with nonstationary time series.

Journal ofEconometrics , 185(1):182–195.Choi, I. (2015).

Almost all about unit roots: Foundations, developments, and applications . CambridgeUniversity Press.Christoffersen, P. F. and Diebold, F. X. (1998). Cointegration and long-horizon forecasting.

Journalof Business & Economic Statistics , 16(4):450–456.Cook, T. and Hall, A. S. (2017). Macroeconomic indicator forecasting with deep neural networks.Technical report, Federal Reserve Bank of Kansas City, Research Working Paper.Diebold, F. X. and Mariano, R. S. (2002). Comparing predictive accuracy.

Journal of Business &economic statistics , 20(1):134–144. 26oan, T., Litterman, R., and Sims, C. (1984). Forecasting and conditional projection using realisticprior distributions.

Econometric reviews , 3(1):1–100.Elliott, G. (2006). Forecasting with Trending Data. In Elliott, G., Granger, C., and Timmermann,A., editors,

Handbook of Economic Forecasting , volume 1 of

Handbook of Economic Forecasting , chap-ter 11, pages 555–604. Elsevier.Elliott, G., Gargano, A., and Timmermann, A. (2013). Complete subset regressions.

Journal ofEconometrics , 177(2):357–373.Engle, R. F. and Granger, C. W. (1987). Co-integration and error correction: representation, estima-tion, and testing.

Econometrica , pages 251–276.Engle, R. F. and Yoo, B. S. (1987). Forecasting and testing in co-integrated systems.

Journal ofeconometrics , 35(1):143–159.Ghysels, E., Santa-Clara, P., and Valkanov, R. (2004). The midas touch: Mixed data samplingregression models.Giacomini, R. and Rossi, B. (2010). Forecast comparisons in unstable environments.

Journal ofApplied Econometrics , 25(4):595 – 620.Goulet Coulombe, P. (2020a). The macroeconomy as a random forest. arXiv preprintarXiv:2006.12724 .Goulet Coulombe, P. (2020b). Time-varying parameters as ridge regressions.Goulet Coulombe, P. (2020c). To bag is to prune.Goulet Coulombe, P., Leroux, M., Stevanovic, D., and Surprenant, S. (2020). How is machinelearning useful for macroeconomic forecasting? Technical report, CIRANO Working Papers,2019s-22.Hall, A. D., Anderson, H. M., and Granger, C. W. (1992). A cointegration analysis of treasury billyields.

The review of Economics and Statistics , pages 116–126.Hansen, P. R., Lunde, A., and Nason, J. M. (2011). The model conﬁdence set.

Econometrica ,79(2):453–497.Hassani, H., Heravi, S., and Zhigljavsky, A. (2009). Forecasting european industrial productionwith singular spectrum analysis.

International Journal of Forecasting , 25(1):103–118.Hassani, H., Sooﬁ, A. S., and Zhigljavsky, A. (2013). Predicting inﬂation dynamics with singularspectrum analysis.

Journal of the Royal Statistical Society: Series A (Statistics in Society) , 176(3):743–760.Hastie, T., Tibshirani, R., and Friedman, J. (2009).

The elements of statistical learning: data mining,inference, and prediction . Springer Science & Business Media.27im, H. H. and Swanson, N. R. (2018). Mining big data using parsimonious factor, machine learn-ing, variable selection and shrinkage methods.

International Journal of Forecasting , 34(2):339–354.Koop, G. M. (2003).

Bayesian econometrics . John Wiley & Sons Inc.Kotchoni, R., Leroux, M., and Stevanovic, D. (2019). Macroeconomic forecast accuracy in a data-rich environment.

Journal of Applied Econometrics , 34(7):1050–1072.Kuhn, M. and Johnson, K. (2019).

Feature engineering and selection: A practical approach for predictivemodels . CRC Press.Lee, J. H., Shi, Z., and Gao, Z. (2018). On lasso for predictive regression. arXiv preprintarXiv:1810.03140 .McCracken, M. and Ng, S. (2020). FRED-QD: A quarterly database for macroeconomic research.Technical report, National Bureau of Economic Research.McCracken, M. W. and Ng, S. (2016). FRED-MD: A monthly database for macroeconomic research.

Journal of Business & Economic Statistics , 34(4):574–589.Medeiros, M. C., Vasconcelos, G. F., Veiga, A., and Zilberman, E. (2019). Forecasting inﬂation in adata-rich environment: the beneﬁts of machine learning methods.

Journal of Business & EconomicStatistics , pages 1–22.Peña, D. and Poncela, P. (2006). Nonstationary dynamic factor analysis.

Journal of Statistical Plan-ning and Inference , 136(4):1237 – 1257.Phillips, P. C. (1991a). Optimal inference in cointegrated systems.

Econometrica , pages 283–306.Phillips, P. C. (1991b). To criticize the critics: An objective bayesian analysis of stochastic trends.

Journal of Applied Econometrics , 6(4):333–364.Rodriguez, J. J., Kuncheva, L. I., and Alonso, C. J. (2006). Rotation forest: A new classiﬁer ensemblemethod.

IEEE transactions on pattern analysis and machine intelligence , 28(10):1619–1630.Sargent, T. and Sims, C. (1977). Business cycle modeling without pretending to have too mucha priori economic theory. In Sims, C., editor,

New Methods in Business Cycle Research . FederalReserve Bank of Minneapolis, Minneapolis.Shiller, R. J. (1973). A distributed lag estimator derived from smoothness priors.

Econometrica ,pages 775–788.Sims, C. A. (1988). Bayesian skepticism on unit root econometrics.

Journal of Economic dynamics andControl , 12(2-3):463–474.Sims, C. A., Stock, J. H., and Watson, M. W. (1990). Inference in linear time series models withsome unit roots.

Econometrica , pages 113–144.Sims, C. A. and Uhlig, H. (1991). Understanding unit rooters: A helicopter tour.

Econometrica ,28ages 1591–1599.Stock, J. H. and Watson, M. W. (1998). A comparison of linear and nonlinear univariate models forforecasting macroeconomic time series. Technical report, National Bureau of Economic Research.Stock, J. H. and Watson, M. W. (2002a). Forecasting using principal components from a largenumber of predictors.

Journal of the American statistical association , 97(460):1167–1179.Stock, J. H. and Watson, M. W. (2002b). Macroeconomic forecasting using diffusion indexes.

Journalof Business & Economic Statistics , 20(2):147–162.Tibshirani, R., Wainwright, M., and Hastie, T. (2015).

Statistical learning with sparsity: the lasso andgeneralizations . Chapman and Hall/CRC.Zou, H. (2006). The adaptive lasso and its oracle properties.

Journal of the American statisticalassociation , 101(476):1418–1429. 29

Forecasting Models in Details

In this section, we brieﬂy review the basic of the econometric ML methods being used in thispaper. For a more complete discussion, see, among other, Hastie et al. (2009). L INEAR M ODELS . We consider the autoregressive model (AR), as well as the factor model ofStock and Watson (2002a,b). Let Z t : = (cid:104) y t , ..., L P y y t , F t , ..., L P f F t (cid:105) be our feature matrix, then thefactor model is given by y t + h = β Z t + (cid:101) t + h (13)where aforementioned factors are extracted by principal components from X t and parameters areestimated by OLS. The AR model is obtained by imposing β k = k (cid:48) s tied to latent factorsand their lagged values. E LASTIC N ET AND A DAPTIVE L ASSO . The Elastic Net algorithm forecast the target variable y t + h using a linear combination of the K features contained in Z t whose weights β : = ( β k ) Kk = solve thefollowing penalized regression problemˆ β : = argmin β T ∑ t = ( y t + h − Z t β ) + λ K ∑ k = (cid:16) α ˆ w k | β k | + ( − α ) β k (cid:17) (14)and where ( α , λ ) are hyperparameters and ˆ w is a weight vector. The Lasso estimator is obtainedas the special case α =

1, while the Ridge estimator imposes α = λ and α with grid search where α ∈ { .01, .02, .03, ..., 1 } and λ ∈ [ λ max ] where λ max is the penalty term beyond which coefﬁcients are guaranteed to be all zero assuming α (cid:44) w = | ˆ β γ | where ˆ β is a √ T -consistent estimator forthe above regression such as the OLS estimator or the Ridge estimator as suggested by Zou (2006)when collinearity is an issue, while γ > β with λ ridge selected with a genetic algorithm of 25 generations of 25 individuals. The Lasso step is done on theweighted features implied by the penalty with λ lasso selected the same way as for the Elastic Net.The weight γ is set to one.Since Elastic Net and Adaptive Lasso both perform variable selection we do not cross-validate P y , P f and k . We impose P y = P f =

12 and k = R ANDOM F ORESTS . This algorithm provides a means of approximating nonlinear functions bycombining regression trees. Each regression tree partitions the feature space deﬁned by Z t intodistinct regions and, in its simplest form, uses the region-speciﬁc mean of the target variable y t + h

30s the forecast, i.e. for M leaf nodes ˆ y t + h = M ∑ m = c m I ( Z t ∈ R m ) (15)where R , ..., R M is a partition of the feature space. To circumvent some of the limitations of regres-sion trees, Breiman (2001) introduced Random Forests. Random Forests consist in growing manytrees on subsamples (or nonparametric bootstrap samples) of observations. A random subset offeatures is eligible for the splitting variable, further decorrelating them. The ﬁnal forecast is ob-tained by averaging over the forecasts of all trees. In this paper we use 200 trees which is normallyenough to stabilize the predictions. The minimum number of observation in each terminal nodesis set to 5 while the number of features considered at each split is Z t . In addition, we impose P y = P f =

12 and k = B OOSTED T REES . This algorithm provides an alternative means of approximating nonlinear func-tions by additively combining regression trees in a sequential fashion. Let η ∈ [

0, 1 ] be the learningrate and ˆ y ( n ) t + h and e ( n ) t + h : = y t − h − η ˆ y ( n ) t + h be the step n predicted value and pseudo-residuals, respec-tively. Then, for square loss, the step n + y ( n + ) t + h = y ( n ) t + h + ρ n + f ( Z t , c n + ) (16)where ( c n + , ρ n + ) : = argmin ρ , c ∑ Tt = (cid:16) e ( n ) t + h − ρ n + f ( Z t , c n + ) (cid:17) and c n + : = ( c n + m ) Mm = are the pa-rameters of a regression tree. In other words, it recursively ﬁts trees on pseudo-residuals. Weconsider a vanilla Boosted Trees where the maximum depth of each tree is set to 10 and all fea-tures are considered at each split. We select the number of steps and η ∈ [

0, 1 ] with Bayesianoptimization. We impose P y = P f =

12 and k = C OMPONENT - WISE L BOOSTING . Linear boosting algorithms are convenient methods to ﬁt mod-els when the number of potential predictors is large. Many linear models are estimated and com-bined iteratively using a single regressor at a time chosen so that it reduces the most the lossconsidered. We speciﬁcally follow Bai and Ng (2009) and consider all features in Z t as separatepredictors. We implement their method as follows:1. let ˆ Φ t + h ,0 = ¯ y t + h for each t .2. For m =

1, ..., M :(a) for t =

1, ..., T , let u t + h = y t + h − ˆ Φ t , m − (b) Select n features at random from Z t .(c) For each features i in the selected set regress u t + h on Z i , t and compute the SSR i .(d) Select i ∗ so that SSR i ∗ is minimized 31e) Set ˆ φ m = Z i ∗ ˆ β i ∗ .3. Update ˆ Φ t + h , m = ˆ Φ t + h , m − + η ˆ φ t + h , m for each t .We impose P y = P f = k = n = min ( Z t ) . M ∈ {

1, 2, ..., 500 } and η ∈ [

0, 1 ] areselected with a genetic algorithm of 25 generations of 25 individuals.32 Detailed Relative RMSE Results

Table 4: Relative RMSE for h = INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPI

Benchmarks

FM (RMSE) 0.004

AR 1.05 1.11 1.03 1.07* ** F *** * * F-X 0.99 * F-MARX 1.00 1.03 *** 1.12 * F-MAF 1.08 1.10 *** * F-X-MARX ** 1.02 * 1.02

F-X-MAF 0.98 ** F-X-Level 0.97 1.03 * F-X-MARX-Level *** 1.05 * 1.04

X 0.99

MARX 1.02 1.15* *** 1.14 ** MAF 1.09* 1.16* ** * ** * 1.02 X-MAF 0.98 ** X-Level 0.98 1.03

X-MARX-Level 0.97 ** 1.03 F *** *** * ** * 1.00 F-X 0.98 ** F-MARX 0.98 * 1.08 * 1.02

F-MAF 1.07 1.05 * F-X-MARX ** 1.01 ** 1.01

F-X-MAF 0.98 ** F-X-Level 0.98 1.04 ** F-X-MARX-Level * 1.02 ** 1.02

X 0.98 ** MARX 1.01 * 1.12** * 1.02

MAF 1.08 1.12* * * 1.09** 1.09* 1.07**X-MARX * ** * 1.02 *X-MAF 0.98 ** X-Level 0.98 1.03 ** 1.01

X-MARX-Level * ** 1.01 F 0.97 1.08 *** * ** * 1.22*** 1.13** F-X 0.99 1.05 * ** F-MARX 1.00 1.14*** * 1.13*

F-MAF 1.02 1.04 * F-X-MARX 0.97 1.13** *** 1.03 * F-X-MAF 0.97 *** * 1.05*

F-X-Level 0.97 1.01 * F-X-MARX-Level 0.97 1.04 ** 1.03

X 1.02 1.15** ** MARX 1.03 1.26*** * 1.13*

MAF 1.10* 1.08 * * 1.16*** 1.15*** X-MARX 0.97 1.18*** ** 1.05 * 1.11*

X-MAF 0.98 ** X-Level 1.00 1.05 * X-MARX-Level ** 1.07

F 0.99 1.05 ** F-X 1.03 1.05

F-MARX *** *** ** F-MAF 1.01 1.03 * F-X-MARX *** *** ** F-X-MAF 1.00 1.02 * F-X-Level 1.01 1.04 ** 1.02

F-X-MARX-Level *** *** * * * 1.00 X 1.05 1.09

MARX *** *** ** * 1.02 MAF 1.02 1.12*

X-MARX *** *** ** X-MAF 1.02 1.06

X-Level 1.03 1.09 ** 1.01

X-MARX-Level *** *** * ** 0.99 F 1.08* 1.11* 1.04 * ** 0.99 F-X 1.05 1.06 * 1.07

F-MARX 0.98 *** * F-MAF 0.98 1.05 * 1.02 * * * 1.06 *** * 1.04 F-X-MAF * 1.02 ** ** *** 1.09 1.05F-X-MARX-Level * *** 1.01 *** 1.05 1.06X 1.08 1.05 MARX 0.97 1.01 *** * * MAF 0.98 1.08 ** 1.03 ** 1.10** *** * X-MAF 0.98 1.01 ** X-Level 1.04 1.01 ** *** 1.09 1.04X-MARX-Level *** ** 1.07 1.04 See the note under Table 3 for explanation. h = INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPI

Benchmarks

FM (RMSE) 0.004 0.001 0.085 0.003 AR * F *** ** * *** * *** F-X ** * ** F-MARX ** 1.00 * 0.99 * 1.01 ** * ** F-X-MAF ** 0.97 ** *** F-X-Level * 1.37 1.10F-X-MARX-Level * * ** 0.96* * ** MARX ** 1.02 * ** ** ** X-MAF ** 0.97 ** *** X-Level * 1.36 1.10X-MARX-Level * Elastic Net F *** * *** ** ** F-X ** * ** ** F-MARX ** * F-MAF 1.05 0.98 1.05 0.98 * 0.96 * ** F-X-MAF * 0.96 ** * ** F-X-Level * * 1.27 1.06F-X-MARX-Level ** ** * ** * ** MARX ** * * MAF 1.11* 1.06 1.34*** 1.00 1.05* ** 1.16***X-MARX * * * X-MAF * 0.97 ** * ** X-Level * * 1.26 1.06X-MARX-Level * * Linear Boosting F ** * *** ** F-X * 0.99 * 0.99 ** F-MARX 1.02 1.09 1.00 1.00 1.10*

F-MAF 1.07 1.13* 1.00 1.03 * 0.99 1.07

F-X-MAF * ** ** ** F-X-Level ** 0.97 * * F-X-MARX-Level * X MARX 1.00 1.12* ** MAF 1.16* 1.28** 1.05 1.07 1.09** * X-MAF * ** ** *X-Level ** * 0.98 X-MARX-Level * F ** 0.98 * ** ** *** F-X ** F-MARX * ** ** * F-MAF * ** F-X-MARX * ** ** F-X-MAF ** * F-X-Level ** ** F-X-MARX-Level * ** ** ** ** X 1.00 1.08 * MARX * ** ** * MAF ** *X-MARX * * ** X-MAF ** X-Level ** ** X-MARX-Level * * ** ** ** F 1.01 1.04 * * * F-X 1.03 1.04

F-MARX 1.01 ** 0.97 * ** F-MAF * F-X-MARX ** * * F-X-MAF * * F-X-Level 1.06 1.04 1.01 1.01 ** F-X-MARX-Level 1.00 0.96 ** 1.01 ** X 1.03 1.01 * MARX 1.00 ** 0.98 ** MAF

X-MARX 1.00 0.96 ** * * X-MAF * * X-Level 1.07 1.04 1.06 1.01 ** X-MARX-Level 1.00 0.97 ** 1.01 ** See the note under Table 3 for explanation. h = INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPI

Benchmarks

FM (RMSE) AR F * ** *** ** ** F-X ** * 1.01F-X-MAF ** * Elastic Net F ** ** *** * F-X ** * 1.01 F-X-MAF * 0.93 ** X-MAF * Linear Boosting F ** * ** ** * * 1.05 1.14** 1.01F-MARX 1.03 1.11* * * ** * * F-X-MARX-Level * X ** ** * * Random Forest F ** ** * ** * * 0.85*** *F-X ** F-MARX ** F-MAF ** * * F-X-MARX ** F-X-MAF * * F-X-Level ** F-X-MARX-Level * ** X ** MARX ** MAF ** 1.04 * X-MARX ** X-MAF * 1.04 ** X-Level ** ** X-MARX-Level * ** F * * F-X 1.06 1.04 * F-MARX 1.04 1.05 * F-MAF 1.01 1.05 * F-X-MARX 1.03 * F-X-MAF 1.01 1.04 * * F-X-Level 1.02 1.12* 1.17** * *** F-X-MARX-Level * *** X 1.05 1.07 ** MARX 1.03 1.03

MAF 1.02 1.08

X-MARX 1.03 1.05

X-MAF 1.02 1.04 * X-Level 1.01 1.13* 1.22** 1.06 ***

X-MARX-Level * See the note under Table 3 for explanation. h = INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPI

Benchmarks

FM (RMSE) AR ** ** 0.96 F * ** * ** 1.03 F-X ** * 1.17* 1.23*** 1.03 1.08 1.12F-MAF 1.20*** 1.14** 1.44** ** 1.14*** 1.16* 1.47*** 1.03 1.02 1.15*F-X-MARX * * 1.03 F-X-MAF * * * ** * 1.19* 1.22** 1.01 1.07 1.11MAF 1.33*** 1.30*** 1.70*** 1.27*** 1.21*** 1.17 1.46*** 1.05 1.16* 1.22**X-MARX * * 1.02 X-MAF * Elastic Net F ** ** * 1.02 * F-X * F-X-MAF * 1.06** 1.01 * 0.93 1.05F-X-Level * 1.12** 1.09 1.17** * 1.06 1.19*

MAF 1.20*** 1.42*** 1.65*** 1.21*** * 0.90

X-MAF * 0.93 1.04X-Level * 1.12** 1.15 1.18** * 1.04 1.18*

Linear Boosting F ** ** ** * *** 1.03 * 1.06* ** 1.14*** F-X-MAF * * ** 1.02 MARX * 1.08 1.00 ** 1.12*

MAF 1.23*** 1.35*** 1.27*** 1.23*** 1.30*** 1.16* 1.15** 1.03 1.18* 1.19*X-MARX

X-MAF * 1.00 F ** *** ** ** ** * ** *F-X * ** F-MARX * ** * F-MAF *** *** ** F-X-MARX * * F-X-MAF ** ** * ** F-X-Level * ** 1.03 ** 0.94 F-X-MARX-Level ** * 1.03 * 0.97 X * * ** MARX * ** * MAF *** *** ** X-MARX * ** * X-MAF ** ** * ** X-Level * ** 1.02 ** 0.94 X-MARX-Level ** * 1.02 * 0.97 F ** 1.05 ** F-X * F-MARX

F-MAF * F-X-MARX

F-X-MAF * F-X-Level ** F-X-MARX-Level * ** 0.93 X * 0.97 MARX

MAF

X-MARX

X-MAF

X-Level ** X-MARX-Level * ** 0.94 See the note under Table 3 for explanation. h = INDPRO EMP UNRATE INCOME CONS RETAIL HOUST M2 CPI PPI

Benchmarks

FM (RMSE) 0.003 * 0.93*** 0.95 *** F *** *** * *** 1.03 0.91* * F-X 1.07** *** 1.10** 1.11*** 0.92 0.97 1.05

F-MARX 1.26*** 1.17*** 1.07* 1.25*** 1.19*** 1.07 1.33***

F-X-MARX 1.09** 1.07* ** 1.19*** 1.10** 0.98 1.02 ** 1.19*** 1.09** 0.97 0.97

F-X-Level 1.17*** 1.22*** 1.05* 1.38*** 1.17*** 0.89 1.08 *** 1.11** 1.11*** 0.91 0.97 1.04

MARX 1.27*** 1.17*** 1.19*** 1.25*** 1.16*** 1.07 1.37*** ** 1.20*** 1.08* 1.01 1.10

X-MAF 1.12*** 1.04 ** 1.19*** 1.10** 0.97 0.97

X-Level 1.16*** 1.20*** 1.05* 1.37*** 1.16*** 0.89 1.07

Elastic Net F *** *** ** *** 1.02 0.91* F-X 1.06** * F-MARX 1.10** 1.18** 1.22*** 1.41*** * F-X-MAF 1.11***

F-X-Level 1.06 1.06 1.19*** 1.29*** 1.09* * 1.06 1.06 1.21 1.26**F-X-MARX-Level 1.04 * MARX 1.09** 1.27*** 1.23*** 1.41*** 1.06 0.96 1.03 1.05 1.12 1.14MAF 1.16*** 1.25*** 1.38*** 1.27*** 1.09* 1.00 1.03 1.07 * * X-MAF 1.11***

X-Level 1.06 1.06 1.18*** 1.29*** 1.09* * 1.06 1.05 1.20 1.26**X-MARX-Level 1.05 1.02 1.03 1.14*** 1.07* 0.88 1.02 1.05 1.08 1.16

Linear Boosting F *** * ** *** 1.02 0.91* 1.06 1.21* 1.11F-X 1.01 1.06** ** ** 1.00 0.82** * * 0.99 0.87* 1.09* 1.20 1.01F-X-MAF 1.08** 1.03 F-X-Level 0.96 * ** * ** 0.98 0.94 * ** 0.98 0.88 1.06 1.17 1.04MARX 1.11** 1.08* 1.04 1.12* * 0.99 0.96 1.07 1.22* 1.03X-MAF 1.09** 1.04 1.02 1.19*** 1.06 0.99 0.93 X-Level 0.96 * ** * * ** 0.93 1.01 Random Forest F *** *** *** *** 0.90*** ** *** *F-X ** ** ** F-MARX 0.98 1.03 * ** *** 0.87* 1.04 1.17 F-X-MARX 0.97 1.03 * ** *** ** 0.79** 1.05 1.11 F-X-Level *** 1.14** 1.15** 1.11* ** 0.94 0.90

F-X-MARX-Level ** 1.04 * 0.95 0.92 ** ** ** MARX 0.98 1.02 * * *** 0.87* 1.04 1.17 X-MARX 0.96 1.03 * ** * *** ** 0.79** 1.06 1.10 X-Level *** 1.14** 1.15** 1.11* ** 0.94 0.90

X-MARX-Level ** 1.04 * * 0.96 0.93 Boosted Trees

F 0.96 1.05* ** ** ** 0.91** 0.84** * * F-X * 1.03

F-MARX

F-MAF 0.99 1.06 ** 0.88 1.04 1.07

F-X-MARX 0.95 1.05

F-X-MAF 0.96 1.05 *** 0.89 1.05 1.01

F-X-Level 0.95 1.44*** 1.23*** 1.16** * 0.93 0.95

F-X-MARX-Level 0.95 1.27*** 1.06 1.18** X MARX

MAF 1.01 1.05 1.03

X-MARX * 1.05

X-MAF 0.95 1.07 *** 0.89 1.06 1.01

X-Level 0.95 1.42*** 1.23*** 1.16**

X-MARX-Level 0.95 1.26*** 1.07 1.18**