[PDF] How is Machine Learning Useful for Macroeconomic Forecasting?

Abstract

We move beyond "Is Machine Learning Useful for Macroeconomic Forecasting?" by adding the "how". The current forecasting literature has focused on matching specific variables and horizons with a particularly successful algorithm. In contrast, we study the usefulness of the underlying features driving ML gains over standard macroeconometric methods. We distinguish four so-called features (nonlinearities, regularization, cross-validation and alternative loss function) and study their behavior in both the data-rich and data-poor environments. To do so, we design experiments that allow to identify the "treatment" effects of interest. We conclude that (i) nonlinearity is the true game changer for macroeconomic prediction, (ii) the standard factor model remains the best regularization, (iii) K-fold cross-validation is the best practice and (iv) the L 2 is preferred to the ϵ ¯ -insensitive in-sample loss. The forecasting gains of nonlinear techniques are associated with high macroeconomic uncertainty, financial stress and housing bubble bursts. This suggests that Machine Learning is useful for macroeconomic forecasting by mostly capturing important nonlinearities that arise in the context of uncertainty and financial frictions.

Full PDF

HHow is Machine Learning Useful for MacroeconomicForecasting? ∗ Philippe Goulet Coulombe † Maxime Leroux Dalibor Stevanovic ‡ Stéphane Surprenant University of Pennsylvania Université du Québec à MontréalFirst version: October 2019This version: August 31, 2020

Abstract

We move beyond

Is Machine Learning Useful for Macroeconomic Forecasting? by addingthe how . The current forecasting literature has focused on matching speciﬁc variables andhorizons with a particularly successful algorithm. To the contrary, we study the usefulnessof the underlying features driving ML gains over standard macroeconometric methods.We distinguish four so-called features (nonlinearities, regularization, cross-validation andalternative loss function) and study their behavior in both the data-rich and data-poorenvironments. To do so, we design experiments that allow to identify the “treatment”effects of interest. We conclude that (i) nonlinearity is the true game changer for macroe-conomic prediction, (ii) the standard factor model remains the best regularization, (iii)

K-fold cross-validation is the best practice and (iv) the L is preferred to the ¯ (cid:101) -insensitivein-sample loss. The forecasting gains of nonlinear techniques are associated with highmacroeconomic uncertainty, ﬁnancial stress and housing bubble bursts. This suggests thatMachine Learning is useful for macroeconomic forecasting by mostly capturing importantnonlinearities that arise in the context of uncertainty and ﬁnancial frictions. JEL Classiﬁcation: C53, C55, E37Keywords: Machine Learning, Big Data, Forecasting. ∗ The third author acknowledges ﬁnancial support from the Fonds de recherche sur la société et la culture(Québec) and the Social Sciences and Humanities Research Council. † Corresponding Author: [email protected]. Department of Economics, UPenn. ‡ Corresponding Author: [email protected]. Département des sciences économiques, UQAM. a r X i v : . [ ec on . E M ] A ug Introduction

The intersection of Machine Learning (ML) with econometrics has become an importantresearch landscape in economics. ML has gained prominence due to the availability of largedata sets, especially in microeconomic applications (Belloni et al., 2017; Athey, 2019). Despitethe growing interest in ML, understanding the properties of ML procedures when they areapplied to predict macroeconomic outcomes remains a difﬁcult challenge. Nevertheless,that very understanding is an interesting econometric research endeavor per se . It is moreappealing to applied econometricians to upgrade a standard framework with a subset ofspeciﬁc insights rather than to drop everything altogether for an off-the-shelf ML model.Despite appearances, ML has a long history in macroeconometrics (see Lee et al. (1993);Kuan and White (1994); Swanson and White (1997); Stock and Watson (1999); Trapletti et al.(2000); Medeiros et al. (2006)). However, only recently did the ﬁeld of macroeconomic fore-casting experience an overwhelming (and succesful) surge in the number of studies applyingML methods, while works such as Joseph (2019) and Zhao and Hastie (2019) contribute totheir interpretability. However, the vast catalogue of tools, often evaluated with few modelsand forecasting targets, creates a large conceptual space, much of which remains to be ex-plored. To map that large space without getting lost in it, we move beyond the coronationof a single winning model and its subsequent interpretation. Rather, we conduct a meta-analysis of many ML products by projecting them in their "characteristic" space. Then, weprovide a direct assessment of which characteristics matter and which do not.More precisely, we aim to answer the following question: What are the key features ofML modeling that improve the macroeconomic prediction? In particular, no clear attempt The linear techniques have been extensively examined since Stock and Watson (2002b,a). Kotchoni et al.(2019) compare more than 30 forecasting models, including factor-augmented and regularized regressions. Gi-annone et al. (2018) study the relevance of sparse modeling in various economic prediction problems. Moshiri and Cameron (2000); Nakamura (2005); Marcellino (2008) use neural networks to predict inﬂa-tion and Cook and Smalter Hall (2017) explore deep learning. Sermpinis et al. (2014) apply support vectorregressions, while Diebold and Shin (2019) propose a LASSO-based forecast combination technique. Ng (2014),Döpke et al. (2017) and Medeiros et al. (2019) improve forecast accuracy with random forests and boosting,while Yousuf and Ng (2019) use boosting for high-dimensional predictive regressions with time varying param-eters. Others compare machine learning methods in horse races (Ahmed et al., 2010; Stock and Watson, 2012b;Li and Chen, 2014; Kim and Swanson, 2018; Smeekes and Wijler, 2018; Chen et al., 2019; Milunovich, 2020). experiment to identify important characteristics ofmachine learning and big data techniques. The exercise consists of an extensive pseudo-out-of-sample forecasting horse race between many models that differ with respect to the fourmain features: nonlinearity, regularization, hyperparameter selection and loss function. Tocontrol for the big data aspect, we consider data-poor and data-rich models, and administerthose patients one particular ML treatment or combinations of them. Monthly forecast errorsare constructed for ﬁve important macroeconomic variables, ﬁve forecasting horizons and foralmost 40 years. Then, we provide a straightforward framework to identify which of themare actual game changers for macroeconomic forecasting.The main results can be summarized as follows. First, the ML nonparametric nonlineari-ties constitute the most salient feature as they improve substantially the forecasting accuracyfor all macroeconomic variables in our exercise, especially when predicting at long horizons.Second, in the big data framework, alternative regularization methods (Lasso, Ridge, Elastic-net) do not improve over the factor model, suggesting that the factor representation of themacroeconomy is quite accurate as a means of dimensionality reduction.Third, the hyperparameter selection by K-fold cross-validation (CV) and the standard BIC(when possible) do better on average than any other criterion. This suggests that ignoringinformation criteria when opting for more complicated ML models is not harmful. This isalso quite convenient: K-fold is the built-in CV option in most standard ML packages. Fourth,replacing the standard in-sample quadratic loss function by the ¯ (cid:101) -insensitive loss function inSupport Vector Regressions (SVR) is not useful, except in very rare cases. The latter ﬁnding isa direct by-product of our strategy to disentangle treatment effects. In accordance with otherempirical results (Sermpinis et al., 2014; Colombo and Pelagatti, 2020), in absolute terms,SVRs do perform well – even if they use a loss at odds with the one used for evaluation.However, that performance is a mixture of the attributes of both nonlinearities (via the kerneltrick) and an alternative loss function. Our results reveal that this change in the loss functionhas detrimental effects on performance in terms of both mean squared errors and absolute3rrors. Fifth, the marginal effect of big data is positive and signiﬁcant, and improves as theforecast horizon grows. The robustness analysis shows that these results remain valid when:(i) the absolute loss is considered; (ii) quarterly targets are predicted; (iii) the exercise is re-conducted with a large Canadian data set.The evolution of economic uncertainty and ﬁnancial conditions are important driversof the NL treatment effect. ML nonlinearities are particularly useful: (i) when the level ofmacroeconomic uncertainty is high; (ii) when ﬁnancial conditions are tight and (iii) duringhousing bubble bursts. The effects are bigger in the case of data-rich models, which sug-gests that combining nonlinearity with factors made of many predictors is an accurate wayto capture complex macroeconomic relationships.These results give a clear recommendation for practitioners. For most cases, start by re-ducing the dimensionality with principal components and then augment the standard diffu-sion indices model by a ML nonlinear function approximator of your choice. That recommen-dation is conditional on being able to keep overﬁtting in check. To that end, if cross-validationmust be applied to hyperparameter selection, the best practice is the standard K-fold.These novel empirical results also complement a growing theoretical literature on MLwith dependent observations. As Alquier et al. (2013) points out, much of the work in sta-tistical learning has focus on the cross-section setting where the assumption of independentdraws is more plausible. Nevertheless, some theoretical guarantees exist in the time seriescontext. Mohri and Rostamizadeh (2010) provide generalization bounds for Support VectorMachines and Regressions, and Kernel Ridge Regression under the assumption of a station-ary joint distribution of predictors and target variable. Kuznetsov and Mohri (2015) general-ize some of those results to non-stationary distributions and non-mixing processes. However,as the macroeconomic time series framework is characterized by short samples and structuralinstability, our exercise contributes to the general understanding of machine learning prop-erties in the context of time series modeling and forecasting.In the remainder of this paper, we ﬁrst present the general prediction problem with ma-chine learning and big data. Section 3 describes the four important features of machine learn-4ng methods. Section 4 presents the empirical setup, section 5 discusses the main results,followed by section 6 that aims to open the black box. Section 7 concludes. Appendices A, B,C and D contain respectively: tables with overall performance; robustness of treatment anal-ysis; additional results and robustness of nonlinearity analysis. The supplementary materialcontains the following appendices: results for absolute loss, results with quarterly US data,results with monthly Canadian data, description of CV techniques and technical details onforecasting models.

Machine learning methods are meant to improve our predictive ability especially whenthe “true” model is unknown and complex. To illustrate this point, let y t + h be the variable tobe predicted h periods ahead (target) and Z t the N Z -dimensional vector of predictors madeout of H t , the set of all the inputs available at time t . Let g ∗ ( Z t ) be the true model and g ( Z t ) a functional (parametric or not) form selected by the practitioner. In addition, denote ˆ g ( Z t ) and ˆ y t + h the ﬁtted model and its forecast. The forecast error can be decomposed as y t + h − ˆ y t + h = g ∗ ( Z t ) − g ( Z t ) (cid:124) (cid:123)(cid:122) (cid:125) approximation error + g ( Z t ) − ˆ g ( Z t ) (cid:124) (cid:123)(cid:122) (cid:125) estimation error + e t + h . (1)The intrinsic error e t + h is not shrinkable, while the estimation error can be reduced by addingmore data. The approximation error is controlled by the functional estimator choice. Whileit can be potentially minimized by using ﬂexible functions, it also rise the risk of overﬁttingand a judicious regularization is needed to control this risk. This problem can be embeddedin the general prediction setup from Hastie et al. (2009)min g ∈G { ˆ L ( y t + h , g ( Z t )) + pen ( g ; τ ) } , t =

1, . . . , T . (2)This setup has four main features:1. G is the space of possible functions g that combine the data to form the prediction. Inparticular, the interest is how much nonlinearities can we allow for in order to reducethe approximation error in (1)?2. pen () is the regularization penalty limiting the ﬂexibility of the function g and hence5ontrolling the overﬁtting risk. This is quite general and can accommodate Bridge-typepenalties and dimension reduction techniques.3. τ is the set of hyperparameters including those in the penalty and the approximator g .The usual problem is to choose the best data-driven method to optimize τ .4. ˆ L is the loss function that deﬁnes the optimal forecast. Some ML models feature anin-sample loss function different from the standard l norm.Most of (supervised) machine learning consists of a combination of those ingredients andpopular methods like linear (penalized) regressions can be obtained as special cases of (2). We consider the direct predictive modeling in which the target is projected on the informa-tion set, and the forecast is made directly using the most recent observables. This is opposedto iterative approach where the model recursion is used to simulate the future path of thevariable. Also, the direct approach is the standard practice for in ML applications.We now deﬁne the forecast objective given the variable of interest Y t . If Y t is stationary,we forecast its level h periods ahead: y ( h ) t + h = y t + h , (3)where y t ≡ ln Y t if Y t is strictly positive. If Y t is I(1), then we forecast the average growth rateover the period [ t + t + h ] (Stock and Watson, 2002b). We shall therefore deﬁne y ( h ) t + h as: y ( h ) t + h = ( h ) ln ( Y t + h / Y t ) . (4)In order to avoid a cumbersome notation, we use y t + h instead of y ( h ) t + h in what follows. Inaddition, all the predictors in Z t are assumed to be covariance stationary. Data-Poor versus

Data-Rich

Environments

Large time series panels are now widely constructed and used for macroeconomic analy-sis. The most popular is FRED-MD monthly panel of US variables constructed by McCracken Marcellino et al. (2006) conclude that the direct approach provides slightly better results but does notdominate uniformly across time and series. See Chevillon (2007) for a survey on multi-step forecasting. Unfortunately, the performance of standard econometric models tends to de-teriorate as the dimensionality of data increases. Stock and Watson (2002b) ﬁrst proposed tosolve the problem by replacing the high-dimensional predictor set by common factors. On other hand, even though the machine learning models do not require big data, theyare useful to perform variable selection and digest large information sets to improve the pre-diction. Therefore, in addition to treatment effects in terms of characteristics of forecastingmodels, we will also interact those with the width of the sample. The data-poor, deﬁned as H − t , will only contain a ﬁnite number of lagged values of the target, while the data-rich panel,deﬁned as H + t will also include a large number of exogenous predictors. Formally, H − t ≡ { y t − j } p y j = and H + t ≡ (cid:104) { y t − j } p y j = , { X t − j } p f j = (cid:105) . (5)The analysis we propose can thus be summarized in the following way. We will considertwo standard models for forecasting.1. The H − t model is the autoregressive direct (AR) model, which is speciﬁed as: y t + h = c + ρ ( L ) y t + e t + h , t =

1, . . . , T , (6)where h ≥ p y ,the order of the lag polynomial ρ ( L ) .2. The H + t workhorse model is the autoregression augmented with diffusion indices (ARDI)from Stock and Watson (2012b): y t + h = c + ρ ( L ) y t + β ( L ) F t + e t + h , t =

1, . . . , T (7) X t = Λ F t + u t (8)where F t are K consecutive static factors, and ρ ( L ) and β ( L ) are lag polynomials of orders p y and p f respectively. The feasible procedure requires an estimate of F t that is usuallyobtained by principal component analysis (PCA).Then, we will take these models as two different types of “patients” and will administer them Fortin-Gagnon et al. (2020) have recently proposed similar data for Canada. Another way to approach the dimensionality problem is to use Bayesian methods. Indeed, some of ourRidge regressions will look like a direct version of a Bayesian VAR with a Litterman (1979) prior. Giannoneet al. (2015) have shown that an hierarchical prior can lead the BVAR to perform as well as a factor model. , PCAremains popular (Uddin et al., 2018). As we insist on treating models as symmetrically aspossible, we will use the same feature transformations throughout such that our nonlinearmodels, such as Kernel Ridge Regression, will introduce nonlinear transformations of laggedtarget values as well as of lagged values of the principal components. Hence, our nonlinearmodels postulate that a sparse set of latent variables impact the target in a ﬂexible way. The objective of this paper is to disentangle important characteristics of the ML predictionalgorithms when forecasting macroeconomic variables. To do so, we design an experiment that consists of a pseudo-out-of-sample (POOS) forecasting horse race between many mod-els that differ with respect to the four main features above, i.e., nonlinearity, regularization,hyperparameter selection and loss function. To create variation around those treatments , wewill generate forecast errors from different models associated to each feature.To test this paper’s hypothesis, suppose the following model for forecasting errors e t , h , v , m = α m + ψ t , v , h + v t , h , v , m (9a) α m = α (cid:48) F + η m (9b)where e t , h , v , m are squared prediction errors of model m for variable v and horizon h at time t . ψ t , v , h is a ﬁxed effect term that demeans the dependent variable by “forecasting target”, thatis a combination of t , v and h . α F is a vector of α G , α pen () , α τ and α ˆ L terms associated to eachfeature. We re-arrange equation (9) to obtain e t , h , v , m = α (cid:48) F + ψ t , v , h + u t , h , v , m . (10) The autoencoder method of Gu et al. (2020a) can be seen as a form of feature engineering, just as theindependent components used in conjunction with SVR in Lu et al. (2009). The interested reader may also seeHastie et al. (2009) for a detailed discussion of the use of PCA and related method in machine learning. We omit considering a VAR as an additional option. VAR iterative approach to produce h -step-aheadpredictions is not comparable with the direct forecasting used with ML models. is now α f = ∀ f ∈ F = [ G , pen () , τ , ˆ L ] . In other words, the null is that there isno predictive accuracy gain with respect to a base model that does not have this particularfeature. By interacting α F with other ﬁxed effects or variables, we can test many hypothesesabout the heterogeneity of the “ML treatment effect.” To get interpretable coefﬁcients, wedeﬁne R t , h , v , m ≡ − e t , h , v , m T ∑ Tt = ( y v , t + h − ¯ y v , h ) and run R t , h , v , m = ˙ α (cid:48) F + ˙ ψ t , v , h + ˙ u t , h , v , m . (11)While (10) has the beneﬁt of connecting directly with the speciﬁcation of a Diebold andMariano (1995) test, the transformation of the regressand in (11) has two main advantagesjustifying its use. First and foremost, it provides standardized coefﬁcients ˙ α F interpretableas marginal improvements in OOS- R ’s. In contrast, α F are a unit- and series-dependantmarginal increases in MSE. Second, the R approach has the advantage of standardizing ex-ante the regressand and removing an obvious source of ( v , h ) -driven heteroskedasticity.While the generality of (10) and (11) is appealing, when investigating the heterogeneityof speciﬁc partial effects, it will be much more convenient to run speciﬁc regressions for themultiple hypothesis we wish to test. That is, to evaluate a feature f , we run ∀ m ∈ M f : R t , h , v , m = ˙ α f + ˙ φ t , v , h + ˙ u t , h , v , m (12)where M f is deﬁned as the set of models that differs only by the feature under study f . Ananalogous evaluation setup has been considered in Carriero et al. (2019). In this section we detail the forecasting approaches that create variations for each charac-teristic of machine learning prediction problem deﬁned in (2).

Although linearity is popular in practice, if the data generating process (DGP) is complex,using linear g introduces approximation error as shown in (1). As a solution, ML proposes anapparatus of nonlinear functions able to estimate the true DGP, and thus reduces the approx- If we consider two models that differ in one feature and run this regression for a speciﬁc ( h , v ) pair, thet-test on coefﬁcients amounts to Diebold and Mariano (1995) – conditional on having the proper standard errors. A simple way to make predictive regressions (6) and (7) nonlinear is to adopt a general-ized linear model with multivariate functions of predictors (e.g. spline series expansions).However, this rapidly becomes overparameterized, so we opt for the Kernel trick (KT) toavoid computing all possible interactions and higher order terms. It is worth noting thatKernel Ridge Regression (KRR) has several implementation advantages. It has a closed-formsolution that rules out convergence problems associated with models trained with gradientdescent. It is also fast to implement since it implies inverting a T x T matrix at each step.To show how KT is implemented in our benchmark models, suppose a Ridge regressiondirect forecast with generic regressors Z t min β T ∑ t = ( y t + h − Z t β ) + λ K ∑ k = β k .The solution to that problem is ˆ β = ( Z (cid:48) Z + λ I k ) − Z (cid:48) y . By the representer theorem of Smolaand Schölkopf (2004), β can also be obtained by solving the dual of the convex optimizationproblem above. The dual solution for β is ˆ β = Z (cid:48) ( ZZ (cid:48) + λ I T ) − y . This equivalence allows torewrite the conditional expectation in the following way:ˆ E ( y t + h | Z t ) = Z t ˆ β = t ∑ i = ˆ α i (cid:104) Z i , Z t (cid:105) where ˆ α = ( ZZ (cid:48) + λ I T ) − y is the solution to the dual Ridge Regression problem.Suppose now we approximate a general nonlinear model g ( Z t ) with basis functions φ () y t + h = g ( Z t ) + ε t + h = φ ( Z t ) (cid:48) γ + ε t + h . A popular approach to model nonlinearity is deep learning. However, since we re-optimize our modelsrecursively in a POOS, selecting an accurate network architecture by cross-validation is practically infeasible. Inaddition to optimize numerous neural net hyperparameters (such as the number of hidden layers and neurons,activation function, etc.), our forecasting models also require careful input selection (number of lags and numberof factors in case of data-rich). An alternative is to ﬁx ex-ante a variety of networks as in Gu et al. (2020b), but thiswould potentially beneﬁt other models that are optimized over time. Still, since few papers have found similarpredictive ability of random forests and neural nets (Gu et al., 2020a; Joseph, 2019), we believe that consideringrandom forests and Kernel trick is enough to properly identify the ML nonlinear treatment. Nevertheless, wehave conducted a robustness analysis with feed-forward neural networks and boosted trees. The results arepresented in Appendix D. K () such thatˆ E ( y t + h | Z t ) = t ∑ i = ˆ α i (cid:104) φ ( Z i ) , φ ( Z t ) (cid:105) = t ∑ i = ˆ α i K ( Z i , Z t ) .This means we do not need to specify the numerous basis functions, a well-chosen kernelimplicitly replicates them. This paper will use the standard radial basis function (RBF) kernel K σ ( x , x (cid:48) ) = exp (cid:18) − (cid:107) x − x (cid:48) (cid:107) σ (cid:19) where σ is a tuning parameter to be chosen by cross-validation. This choice of kernel is moti-vated by its good performance in macroeconomic forecasting as reported in Sermpinis et al.(2014) and Exterkate et al. (2016). The advantage of the kernel trick is that, by using the corre-sponding Z t , we can easily make our data-rich or data-poor model nonlinear. For instance, inthe case of the factor model, we can apply it to the regression equation to implicitly estimate y t + h = c + g ( Z t ) + ε t + h , (13) Z t = (cid:104) { y t − j } p y j = , { F t − j } p f j = (cid:105) , (14) X t = Λ F t + u t . (15)In terms of implementation, this means extracting factors via PCA and then gettingˆ E ( y t + h | Z t ) = K σ ( Z t , Z )( K σ ( Z t , Z ) + λ I T ) − y t . (16)The ﬁnal set of tuning parameters for such a model is τ = { λ , σ , p y , p f , n f } . Another way to introduce nonlinearity in the estimation of the predictive equation (7) is touse regression trees instead of OLS. The idea is to split sequentially the space of Z t , as deﬁnedin (14) into several regions and model the response by the mean of y t + h in each region. Theprocess continues according to some stopping rule. The details of the recursive algorithm canbe found in Hastie et al. (2009). Then, the tree regression forecast has the following form:ˆ f ( Z ) = M ∑ m = c m I ( Z ∈ R m ) , (17)where M is the number of terminal nodes, c m are node means and R , ..., R M represents apartition of feature space. In the diffusion indices setup, the regression tree would estimate a11onlinear relationship linking factors and their lags to y t + h . Once the tree structure is known,it can be related to a linear regression with dummy variables and their interactions.While the idea of obtaining nonlinearities via decision trees is intuitive and appealing– especially for its interpretability potential, the resulting prediction is usually plagued byhigh variance. The recursive tree ﬁtting process is (i) unstable and (ii) prone to overﬁtting.The latter can be partially addressed by the use of pruning and related methodologies (Hastieet al., 2009). Notwithstanding, a much more successful (and hence popular) ﬁx was proposedin Breiman (2001): Random Forests. This consists in growing many trees on subsamples(or nonparametric bootstrap samples) of observations. Further randomization of underlyingtrees is obtained by considering a random subset of regressors for each potential split. Themain hyperparameter to be selected is the number of variables to be considered at each split.The forecasts of the estimated regression trees are then averaged together to make one single"ensemble" prediction of the targeted variable. In this section we will only consider models where dimension reduction is needed, whichare the models with H + t . The traditional shrinkage method used in macroeconomic forecast-ing is the ARDI model that consists of extracting principal components of X t and to use themas data in an ARDL model. Obviously, this is only one out of many ways to compress theinformation contained in X t to run a well-behaved regression of y t + h on it. In order to create identifying variations for pen () treatment, we need to generate multipledifferent shrinkage schemes. Some will also blend in selection, some will not. The alternativeshrinkage methods will all be special cases of the Elastic Net (EN) problem:min β T ∑ t = ( y t + h − Z t β ) + λ K ∑ k = (cid:16) α | β k | + ( − α ) β k (cid:17) (18)where Z t = B ( H t ) is some transformation of the original predictive set X t . α ∈ [

0, 1 ] and Only using a bootstrap sample of observations would be a procedure called Bagging – for Bootstrap Ag-gregation. Also selecting randomly regressors has the effect of decorrelating the trees and hence boosting thevariance reduction effect of averaging them. In this paper, we consider 500 trees, which is usually more than enough to get a stabilized prediction (thatwill not change with the addition of another tree). De Mol et al. (2008) compares Lasso, Ridge and ARDI and ﬁnds that forecasts are very much alike. > B operators, we can generateshrinkage schemes. Also, by setting α to either 1 or 0 we generate LASSO and Ridge Regres-sion respectively. All these possibilities are reasonable alternatives to the traditional factorhard-thresholding procedure that is ARDI.Each type of shrinkage in this section will be deﬁned by the tuple S = { α , B () } . To beginwith the most straightforward dimension, for a given B , we will evaluate the results for α ∈{

0, ˆ α CV , 1 } . For instance, if B is the identity mapping, we get in turns the LASSO, EN andRidge shrinkage. We now detail different pen () resulting when we vary B () for a ﬁxed α .1. (Fat Regression) : First, we consider the case B () = I () . That is, we use the entiretyof the untransformed high-dimensional data set. The results of Giannone et al. (2018)point in the direction that speciﬁcations with a higher α should do better, that is, sparsemodels do worse than models where every regressor is kept but shrunk to zero.2. (Big ARDI) Second, B () corresponds to ﬁrst rotating X t ∈ IR N so that we get N -dimensional uncorrelated F t . Note here that contrary to the ARDI approach, we donot select factors recursively, we keep them all. Hence, F t has exactly the same span as X t . Comparing LASSO and Ridge in this setup will allow to verify whether sparsityemerges in a rotated space.3. (Principal Component Regression) A third possibility is to rotate H + t rather than X t and still keep all the factors. H + t includes all the relevant preselected lags. If we were tojust drop the F t using some hard-thresholding rule, this would correspond to PrincipalComponent Regression (PCR). Note that B () = B () only when no lags are included.Hence, the tuple S has a total of 9 elements. Since we will be considering both POOS-CV andK-fold CV for each of these models, this leads to a total of 18 models. To see clearly through all of this, we describe where the benchmark ARDI model standsin this setup. Since it uses a hard thresholding rule that is based on the eigenvalues ordering,it cannot be a special case of the Elastic Net problem. While it uses B , we would need to set Adaptive versions (in the sense of Zou (2006)) of the 9 models were also considered but gave either similaror deteriorated results with respect to their plain counterparts. = F t a priori with a hard-thresholding rule. The closest approximation in thisEN setup would be to set α = λ to match the number of consecutivefactors selected by an information criteria directly in the predictive regression (7). The conventional wisdom in macroeconomic forecasting is to either use AIC or BIC andcompare results. The prime reason for the popularity of CV is that it can be applied to anymodel, including those for which the derivation of an information criterion is impossible. It is not obvious that CV should work better only because it is “out of sample” while AICand BIC are ”in sample”. All model selection methods are actually approximations to theOOS prediction error that relies on different assumptions that are sometime motivated bydifferent theoretical goals. Also, it is well known that asymptotically, these methods havesimilar behavior. Hence, it is impossible a priori to think of one model selection techniquebeing the most appropriate for macroeconomic forecasting.For samples of small to medium size encountered in macro, the question of which oneis optimal in the forecasting sense is inevitably an empirical one. For instance, Granger andJeon (2004) compared AIC and BIC in a generic forecasting exercise. In this paper, we willcompare AIC, BIC and two types of CV for our two baseline models. The two types of CV arerelatively standard. We will ﬁrst use POOS CV and then K-fold CV. The ﬁrst one will alwaysbehave correctly in the context of time series data, but may be quite inefﬁcient by only usingthe end of the training set. The latter is known to be valid only if residual autocorrelation isabsent from the models as shown in Bergmeir et al. (2018). If it were not to be the case, thenwe should expect K-fold to underperform. The speciﬁc details of the implementation of bothCVs is discussed in the section D of the supplementary material.The contributions of this section are twofold. First, it will shed light on which model Abadie and Kasy (2019) show that hyperparemeter tuning by CV performs uniformly well in high-dimensional context. Hansen and Timmermann (2015) show equivalence between test statistics for OOS forecasting performanceand in-sample Wald statistics. For instance, one can show that Leave-one-out CV (a special case of K-fold) isasymptotically equivalent to the Takeuchi Information criterion (TIC), Claeskens and Hjort (2008). AIC is aspecial case of TIC where we need to assume in addition that all models being considered are at least correctlyspeciﬁed. Thus, under the latter assumption, Leave-one-out CV is asymptotically equivalent to AIC. Hence, it is worth asking the question whether some gainsfrom ML are simply coming from selecting hyperparameters in a different fashion using amethod whose assumptions are more in line with the data at hand. To investigate that, anatural ﬁrst step is to look at our benchmark macro models, AR and ARDI, and see if usingCV to select hyperparameters gives different selected models and forecasting performances.

Until now, all of our estimators use a quadratic loss function. Of course, it is very natu-ral for them to do so: the quadratic loss is the measure used for out-of-sample evaluation.Thus, someone may legitimately wonder if the fate of the SVR is not sealed in advance asit uses an in-sample loss function which is inconsistent with the out-of-sample performancemetric. As we will discuss later after the explanation of the SVR, there are reasons to believethe alternative (and mismatched) loss function can help. As a matter of fact, SVR has beensuccessfully applied to forecasting ﬁnancial and macroeconomic time series. An importantquestion remains unanswered: are the good results due to kernel-based non-linearities or tothe use of an alternative loss-function?We provide a strategy to isolate the marginal effect of the SVR’s ¯ (cid:101) -insensitive loss functionwhich consists in, perhaps unsurprisingly by now, estimating different variants of the samemodel. We considered the Kernel Ridge Regression earlier. The latter only differs from theKernel-SVR by the use of different in-sample loss functions. This identiﬁes directly the effectof the loss function, for nonlinear models. Furthermore, we do the same exercise for linear Zou et al. (2007) show that the number of remaining parameters in the LASSO is an unbiased estimatorof the degrees of freedom and derive LASSO-BIC and LASSO-AIC criteria. Considering these as well wouldprovide additional evidence on the empirical debate of CV vs IC. See for example, Lu et al. (2009), Choudhury et al. (2014), Patel et al. (2015a), Patel et al. (2015b), Yeh et al.(2011) and Qu and Zhang (2016) for ﬁnancial forecasting. See Sermpinis et al. (2014) and Zhang et al. (2010)macroeconomic forecasting. H − t ; (2) thelinear SVR with H + t ; (3) the RBF Kernel SVR with H − t ; and (4) the RBF Kernel SVR with H + t .What follows is a bird’s-eye overview of the underlying mechanics of the SVR. As it wasthe case for the Kernel Ridge regression, the SVR estimator approximates the function g ∈ G with basis functions. We opted to use the (cid:101) -SVR variant which implicitly deﬁnes the size 2 ¯ (cid:101) of the insensitivity tube of the loss function. The (cid:101) -SVR is deﬁned by: min γ γ (cid:48) γ + C (cid:34) T ∑ t = ( ξ t + ξ ∗ t ) (cid:35) s . t .  y t + h − γ (cid:48) φ ( Z t ) − α ≤ ¯ (cid:101) + ξ t γ (cid:48) φ ( Z t ) + α − y t + h ≤ ¯ (cid:101) + ξ ∗ t ξ t , ξ ∗ t ≥ ξ t , ξ ∗ t are slack variables, φ () is the basis function of the feature space implicitly de-ﬁned by the kernel used and T is the size of the sample used for estimation. C and ¯ (cid:101) arehyperparameters. Additional hyperparameters vary depending on the choice of a kernel.In case of the RBF kernel, a scale parameter σ also has to be cross-validated. AssociatingLagrange multipliers λ j , λ ∗ j to the ﬁrst two types of constraints, Smola and Schölkopf (2004)show that we can derive the dual problem out of which we would ﬁnd the optimal weights γ = ∑ Tj = ( λ j − λ ∗ j ) φ ( Z j ) and the forecasted valuesˆ E ( y t + h | Z t ) = ˆ c + T ∑ j = ( λ j − λ ∗ j ) φ ( Z j ) φ ( Z j ) = ˆ c + T ∑ j = ( λ j − λ ∗ j ) K ( Z j , Z t ) . (19)Let us now turn to the resulting loss function of such a problem. For the (cid:101) -SVR, the penaltyis given by: P ¯ (cid:101) ( (cid:101) t + h | t ) : =  i f | e t + h | ≤ ¯ (cid:101) | e t + h | − ¯ (cid:101) otherwise .For other estimators, the penalty function is quadratic P ( e t + h ) : = e t + h . Hence, for ourother estimators, the rate of the penalty increases with the size of the forecasting error, whereasit is constant and only applies to excess errors in the case of the (cid:101) -SVR. Note that this insen-16itivity has a nontrivial consequence for the forecasting values. The Karush-Kuhn-Tuckerconditions imply that only support vectors, i.e. points lying outside the insensitivity tube,will have nonzero Lagrange multipliers and contribute to the weight vector.As discussed brieﬂy earlier, given that SVR forecasts will eventually be evaluated accord-ing to a quadratic loss, it is reasonable to ask why this alternative loss function isn’t triviallysuboptimal. Smola et al. (1998) show that the optimal size of ¯ (cid:101) is a linear function of theunderlying noise, with the exact relationship depending on the nature of the data generatingprocess. This idea is not at odds with Gu et al. (2020a) using the Huber Loss for asset pric-ing with ML (where outliers seldomly happen in-sample) or Colombo and Pelagatti (2020)successfully using SVR to forecast (notoriously noisy) exchange rates. Thus, while SVR canwork well in macroeconomic forecasting, it is unclear which feature between the nonlinearityand ¯ (cid:101) -insensitive loss has the primary inﬂuence on its performance.To sum up, the table 1 shows a list of all forecasting models and highlights their relation-ship with each of four features discussed above. The computational details for every modelin this list are available in section E in the supplementary material. This section presents the data and the design of the pseudo-of-sample experiment used togenerate the treatment effects above.

We use historical data to evaluate and compare the performance of all the forecastingmodels described previously. The dataset is FRED-MD, available at the Federal Reserve ofSt-Louis’s web site. It contains 134 monthly US macroeconomic and ﬁnancial indicators ob-served from 1960M01 to 2017M12. Since many of them are usually very persistent or notstationary, we follow McCracken and Ng (2016) in the choice of transformations in order toachieve stationarity. Even though the universe of time series available at FRED is huge, westick to FRED-MD for several reasons. First, we want to have the test set as long as possible Alternative data transformations in the context of ML modeling are used in Goulet Coulombe et al. (2020).

Models Feature 1: selecting Feature 2: selecting Feature 3: optimizing Feature 4: selectingthe function g the regularization hyperparameters τ the loss functionData-poor modelsAR,BIC Linear BIC QuadraticAR,AIC Linear AIC QuadraticAR,POOS-CV Linear POOS CV QuadraticAR,K-fold Linear K-fold CV QuadraticRRAR,POOS-CV Linear Ridge POOS CV QuadraticRRAR,K-fold Lineal Ridge K-fold CV QuadraticRFAR,POOS-CV Nonlinear POOS CV QuadraticRFAR,K-fold Nonlinear K-fold CV QuadraticKRRAR,POOS-CV Nonlinear Ridge POOS CV QuadraticKRRAR,K-fold Nonlinear Ridge K-fold CV QuadraticSVR-AR,Lin,POOS-CV Linear POOS CV ¯ (cid:101) -insensitiveSVR-AR,Lin,K-fold Linear K-fold CV ¯ (cid:101) -insensitiveSVR-AR,RBF,POOS-CV Nonlinear POOS CV ¯ (cid:101) -insensitiveSVR-AR,RBF,K-fold Nonlinear K-fold CV ¯ (cid:101) -insensitiveData-rich modelsARDI,BIC Linear PCA BIC QuadraticARDI,AIC Linear PCA AIC QuadraticARDI,POOS-CV Linear PCA POOS CV QuadraticARDI,K-fold Linear PCA K-fold CV QuadraticRRARDI,POOS-CV Linear Ridge-PCA POOS CV QuadraticRRARDI,K-fold Linear Ridge-PCA K-fold CV QuadraticRFARDI,POOS-CV Nonlinear PCA POOS CV QuadraticRFARDI,K-fold Nonlinear PCA K-fold CV QuadraticKRRARDI,POOS-CV Nonlinear Ridge-PCR POOS CV QuadraticKRRARDI,K-fold Nonlinear Ridge-PCR K-fold CV Quadratic( B , α = ˆ α ),POOS-CV Linear EN POOS CV Quadratic( B , α = ˆ α ),K-fold Linear EN K-fold CV Quadratic( B , α = B , α = B , α = B , α = B , α = ˆ α ),POOS-CV Linear EN-PCA POOS CV Quadratic( B , α = ˆ α ),K-fold Linear EN-PCA K-fold CV Quadratic( B , α = B , α = B , α = B , α = B , α = ˆ α ),POOS-CV Linear EN-PCR POOS CV Quadratic( B , α = ˆ α ),K-fold Linear EN-PCR K-fold CV Quadratic( B , α = B , α = B , α = B , α = (cid:101) -insensitiveSVR-ARDI,Lin,K-fold Linear PCA K-fold CV ¯ (cid:101) -insensitiveSVR-ARDI,RBF,POOS-CV Nonlinear PCA POOS CV ¯ (cid:101) -insensitiveSVR-ARDI,RBF,K-fold Nonlinear PCA K-fold CV ¯ (cid:101) -insensitive Note: PCA stands for Principal Component Analysis, EN for Elastic Net regularizer, PCR for Principal Component Regression. since most of the variables do not start early enough.Second, most of the timely available se-ries are disaggregated components of the variables in FRED-MD. Hence, adding them altersthe estimation of common factors (Boivin and Ng, 2006), and induces too much collinearityfor Lasso performance (Fan and Lv, 2010). Third, it is the standard high-dimensional datasetthat has been extensively used in the macroeconomic literature.18 .2 Variables of Interest

We focus on predicting ﬁve representative macroeconomic indicators of the US economy:Industrial Production (INDPRO), Unemployment rate (UNRATE), Consumer Price Index(INF), difference between 10-year Treasury Constant Maturity rate and Federal funds rate(SPREAD) and housing starts (HOUST). INDPRO, CPI and HOUST are assumed I ( ) so weforecast the average growth rate as in equation (4). UNRATE is considered I ( ) and we targetthe average change as in (4) but without logs. SPREAD is I ( ) and the target is as in (3). The pseudo-out-of-sample period is 1980M01 - 2017M12. The forecasting horizons consid-ered are 1, 3, 9, 12 and 24 months. Hence, there are 456 evaluation periods for each horizon.All models are estimated recursively with an expanding window as means of erring on theside of including more data so as to potentially reduce the variance of more ﬂexible models. Hyperparameter optimization is done with in-sample criteria (AIC and BIC) and twotypes of CV (POOS and K-fold). The in-sample selection is standard, we ﬁx the upper boundsfor the set of HPs. For the POOS CV, the validation set consists of last 25% of the in-sample.In case of K-fold CV, we set k =

5. We re-optimize hyperparameters every two years. Thisisn’t uncommon for computationally demanding studies. It is also reasonable to assumethat optimal hyperparameters would not be terribly affected by expanding the training setwith observations that account for 2-3% of the new training set size. The information on up-per / lower bounds and grid search for HPs for every model is available in section E in the The US CPI is sometimes modeled as I ( ) due to the possible stochastic trend in inﬂation rate in 70’s and80’s, see (Stock and Watson, 2002b). Since in our test set the the inﬂation is mostly stationary, we treat the priceindex as I ( ) , as in Medeiros et al. (2019). We have compared the mean squared predictive errors of best modelsunder I ( ) and I ( ) alternatives, and found that errors are minimized when predicting the inﬂation rate directly. The alternative is obviously that of a rolling window, which could be more robust to issues of modelinstability. These are valid concerns and have motivated tests and methods for taking them into account (seefor example, Pesaran and Timmermann (2007); Pesaran et al. (2013); Inoue et al. (2017); Boot and Pick (2020)),an adequate evaluation lies beyond the scope of this paper. Moreover, as noted in Boot and Pick (2020), thenumber of relevant breaks may be much smaller than previously thought. Sermpinis et al. (2014), for example, split their out-of-sample into four year periods and update both hy-perparameters and model parameter estimates every 4 years. Likewise, Teräsvirta (2006) selected the numberof lagged values to be included in nonlinear autoregressive models once and for all at the start of the POOS.

Following a standard practice in the forecasting literature, we evaluate the quality of ourpoint forecasts using the root Mean Square Prediction Error (MSPE). Diebold and Mariano(1995) (DM) procedure is used to test the predictive accuracy of each model against the refer-ence (ARDI,BIC). We also implement the Model Conﬁdence Set (MCS), (Hansen et al., 2011),that selects the subset of best models at a given conﬁdence level. These metrics measure theoverall predictive performance and classify models according to DM and MCS tests. Regres-sion analysis from section 2.3 is used to estimate the treatment effect of each ML ingredient.

We present the results in several ways. First, for each variable, we summarize tablescontaining the relative root MSPEs (to AR,BIC model) with DM and MCS outputs, for thewhole pseudo-out-of-sample and NBER recession periods. Second, we evaluate the marginaleffect of important features of ML using regressions described in section 2.3.

Tables 4 - 8, in the appendix A, summarize the overall predictive performance in termsof root MSPE relative to the reference model AR,BIC. The analysis is done for the full out-of-sample as well as for NBER recessions (i.e., when the target belongs to a recession episode).This address two questions: is ML already useful for macroeconomic forecasting and when? In case of industrial production, table 4 shows that principal component regressions B and B with Ridge and Lasso penalty respectively are the best at short-run horizons of 1 and3 months. The kernel ridge ARDI with POOS CV is best for h =

9, while its autoregressivecounterpart with K-fold minimizes the MSPE at the one-year horizon. Random forest ARDI,the alternative nonlinear approximator, outperforms the reference model by 11% for h = The knowledge of the models that have performed best historically during recessions is of interest forpractitioners. If the probability of recession is high enough at a given period, our results can provide an ex-anteguidance on which model is likely to perform best in such circumstances. h =

24. Ameliorations with respect to AR,BIC are much larger duringeconomic downturns, and the MCS selects fewer models.Results for the unemployment rate, table 5, highlight the performance of nonlinear modelsespecially for longer horizons. Improvements with respect to the AR,BIC model are bigger forboth full OOS and recessions. MCSs are narrower than in case of INDPRO. A similar patternis observed during NBER recessions. Table 6 summarizes results for the Spread. Nonlinearmodels are generally the best, combined with data-rich predictors’ set.For inﬂation, table 7 shows that the kernel ridge autoregressive model with K-fold CVis the best for 3, 9 and 12 months ahead, while the nonlinear SVR-ARDI optimized with K-fold CV reduces the MSPE by more than 20% at two-year horizon. Random forest modelsare very resilient, as in Medeiros et al. (2019), but generally outperformed by KRR form ofnonlinearity. During recessions, the fat regression models ( B ) are the best at short horizons,while the ridge regression ARDI with K-fold dominates for h =

9, 12, 24. Housing starts, intable 8, are best predicted with nonlinear data-rich models for almost all horizons.Overall, using data-rich models and nonlinear g functions improve macroeconomic pre-diction. Their marginal contribution depends on the state of the economy. The results in the previous section does not easily allow to disentangle the marginal effectsof important ML features – as presented in section 3. Therefore, we turn to the regressionanalysis described in section 2.3. In what follows, [X, NL, SH, CV and LF] stand for data-rich,nonlinearity, alternative shrinkage, cross-validation and loss function features respectively.Figure 1 shows the distribution of ˙ α ( h , v ) F from equation (11) done by ( h , v ) subsets. Hence,here we allow for heterogeneous treatment effects according to 25 different targets. This ﬁg-ure highlights by itself the main ﬁndings of this paper. First , ML nonlinearities improvesubstantially the forecasting accuracy in almost all situations. The effects are positive and21igure 1:

This ﬁgure plots the distribution of ˙ α ( h , v ) F from equation (11) done by ( h , v ) subsets. That is, we arelooking at the average partial effect on the pseudo-OOS R from augmenting the model with ML features, keep-ing everything else ﬁxed. X is making the switch from data-poor to data-rich. Finally, variables are INDPRO,UNRATE, SPREAD, INF and HOUST. Within a speciﬁc color block, the horizon increases from h = h = X on the R of INF increasesdrastically with the forecasted horizon h . SEs are HAC. These are the 95% conﬁdence bands. signiﬁcant for all horizons in case of INDPRO and SPREAD, and for most of the cases whenpredicting UNRATE, INF and HOUST. The improvements of the nonlinearity treatment reachup to 23% in terms of pseudo- R . This is in contrast with previous literature that did not ﬁndsubstantial forecasting power from nonlinear methods, see for example Stock and Watson(1999). In fact, the ML nonlinearity is highly ﬂexible and well disciplined by a careful regu-larization, and thus can solve the general overﬁtting problem of standard nonlinear models(Teräsvirta, 2006). This is also in line with the ﬁnding in Gu et al. (2020b) that nonlinearities(from ML models) can help predicting ﬁnancial returns. Second , alternative regularization means of dimensionality reduction do not improve onaverage over the standard factor model, except few cases. Choosing sparse modeling candecrease the forecast accuracy by up to 20% of the pseudo- R which is not negligible. Inter-estingly, Gu et al. (2020b) also reach similar conclusions that dense outperforms sparse in thecontext of applying ML to returns. Third , the average effect of CV appears not signiﬁcant. However, as we will see in sec-22ion 5.2.3, the averaging in this case hides some interesting and relevant differences betweenK-fold and POOS CVs.

Fourth , on average, dropping the standard in-sample squared-lossfunction for what the SVR proposes is not useful, except in very rare cases.

Fifth and lastly,the marginal beneﬁts of data-rich models ( X ) seems roughly to increase with horizons forevery variable-horizon pair, except for few cases with spread and housing. Note that thisis almost exactly like the picture we described for NL. Indeed, visually, it seems like the re-sults for X are a compressed-range version of NL that was translated to the right. Seeing NLmodels as data augmentation via basis expansions, we conclude that for predicting macroe-conomic variables, we need to augment the AR( p ) model with more regressors either createdfrom the lags of the dependent variable itself or coming from additional data. The possibilityof joining these two forces to create a “data-ﬁlthy-rich” model is studied in section 5.2.1.It turns out these ﬁndings are somewhat robust as graphs included in the appendix sectionB show. ML treatment effects plots of very similar shapes are obtained for data-poor modelsonly (ﬁgure 11), data-rich models only (ﬁgure 12) and recessions / expansions periods (ﬁg-ures 13 and 14). It is important to notice that nonlinearity effect is not only present duringrecession periods, but it is even more important during expansions. The only exceptionis the data-rich feature that has negative and signiﬁcant effects for housing starts predictionwhen we condition on the last 20 years of the forecasting exercise (ﬁgure 15).Figure 2 aggregates by h and v in order to clarify whether variable or horizon heterogene-ity matters most. Two facts detailed earlier are now quite easy to see. For both X and NL, theaverage marginal effects roughly increase in h . In addition, it is now clear that all the vari-ables beneﬁt from both additional information and nonlinearities. Alternative shrinkage isleast harmful for inﬂation and housing, and at short horizons. Cross-validation has negativeand sometimes signiﬁcant impacts, while the SVR loss function is often damaging.Supplementary material contains additional results. Section A shows the results obtainedusing the absolute loss. The importance of each feature and the way it behaves according tothe variable/horizon pair is the same. Finally, sections B and C show results for two similar This suggests that our models behave relatively similarly over the business cycle and that our analysis doesnot suffer from undesirable forecast ranking due to extreme events as pointed out in Lerch et al. (2017).

This ﬁgure plots the distribution of ˙ α ( v ) F and ˙ α ( h ) F from equation (11) done by h and v subsets. Thatis, we are looking at the average partial effect on the pseudo-OOS R from augmenting the model with MLfeatures, keeping everything else ﬁxed. X is making the switch from data-poor to data-rich. However, in thisgraph, v − speciﬁc heterogeneity and h − speciﬁc heterogeneity have been integrated out in turns. SEs are HAC.These are the 95% conﬁdence bands. exercises. The ﬁrst consider quarterly US data where we forecast the average growth rate ofGDP, consumption, investment and disposable income, and the PCE inﬂation. The results areconsistent with the ﬁndings obtained in the main body of this paper. In the second, we usea large Canadian monthly dataset and forecast the same target variables for Canada. Resultsare qualitatively in line with those on US data, except that NL effect is smaller in size.In what follows we break down averages and run speciﬁc regressions as in (12) to studyhow homogeneous are the ˙ α F ’s reported above. Figure 3 suggests that nonlinearities can be very helpful at forecasting all the ﬁve variablesin the data-rich environment. The marginal effects of random forests and KRR are almostnever statistically different for data-rich models, except for inﬂation combined with data-rich,suggesting that the common NL feature is the driving force. However, this is not the casefor data-poor models where the kernel-type nonlinearity shows signiﬁcant improvements24igure 3:

This ﬁgure compares the two NL models averaged over all horizons. The unit of the x-axis areimprovements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. Figure 4:

This ﬁgure compares the two NL models averaged over all variables. The unit of the x-axis areimprovements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. h increases. Seeing NL models asdata augmentation via some basis expansions, we can join the two facts together to concludethat the need for a complex and “data-ﬁlthy-rich” model arises for predicting macroeconomicvariables at longer horizons. Similar conclusions are obtained with neural networks andboosted trees as shown in ﬁgures 20 and 21 in Appendix D.Figure 17 in the appendix C plots the cumulative and 3-year rolling window root MSPEfor linear and nonlinear data-poor and data-rich models, for h =

12, as well as Giacomini andRossi (2010) ﬂuctuation test for those alternatives. The cumulative root MSPE clearly showsthe positive impact on forecast accuracy of both nonlinearities and data-rich environment forall series except INF. The rolling window depicts the changing level of forecast accuracy. Forall series except the SPREAD, there is a common cyclical behavior with two relatively similarpeaks (1981 and 2008 recessions), as well as a drop in MSPE during the Great Moderationperiod. Fluctuation tests conﬁrm the important role of nonlinear and data-rich models.For CPI inﬂation at horizons of 3, 9 and 12 months, Random Forests perform distinctivelywell. In both its data-poor and data-rich incarnations, the algorithm is included in the supe-rior model set of Hansen et al. (2011) and signiﬁcantly outperforms the AR-BIC benchmarkaccording to the DM test. This result can help shed some light on long standing issues inthe inﬂation forecasting literature. A consensus emerged that nonlinear models in-samplegood performance does not materialize out-of-sample (Marcellino, 2008; Stock and Watson,2009). In contrast, we found – as in Medeiros et al. (2019), that Random Forests are a partic-ularly potent tool to forecast CPI inﬂation. One possible explanation is that previous studies Concurrently, simple benchmarks such as a random walk or moving averages emerged as surprisinglyhard to beat (Atkeson and Ohanian, 2001; Stock and Watson, 2009; Kotchoni et al., 2019).

This ﬁgure compares models of section 3.2 averaged over all variables and horizons. The unit of thex-axis are improvements in OOS R over the basis model. The base models are ARDIs speciﬁed with POOS-CVand KF-CV respectively. SEs are HAC. These are the 95% conﬁdence bands. suffer from overﬁtting (Marcellino, 2008) while Random Forests are arguably completely im-mune from it (Goulet Coulombe, 2020), all this while retaining relevant nonlinearities. In thatregard, it is noted that INF is the only target where KRR performance does not match thatof Random Forests in the data rich environment. In the data-poor case, roles are reversed.Unlike most other targets, it seems the type of NL being used matters for inﬂation. Nonethe-less, ML generally appears to be useful for inﬂation forecasting by providing better-behavednon-parametric nonlinearities than what was considered by the older literature. Figure 5 shows that the ARDI reduces dimensionality in a way that certainly works wellwith economic data: all competing schemes do at most as good on average. It is overall safeto say that on average, all shrinkage schemes give similar or lower performance, which isin line with conclusions from Stock and Watson (2012b) and Kim and Swanson (2018), butcontrary to Smeekes and Wijler (2018). No clear superiority for the Bayesian versions of someof these models was also documented in De Mol et al. (2008). This suggests that the factormodel view of the macroeconomy is quite accurate in the sense that when we use it as a27eans of dimensionality reduction, it extracts the most relevant information to forecast therelevant time series. This is good news. The ARDI is the simplest model to run and resultsfrom the preceding section tells us that adding nonlinearities to an ARDI can be quite helpful.Obviously, the deceiving behavior of alternative shrinkage methods does not mean thereare no interesting ( h , v ) cases where using a different dimensionality reduction has signiﬁcantbeneﬁts as discussed in section 5.1 and Smeekes and Wijler (2018). Furthermore, LASSO andRidge can still be useful to tackle speciﬁc time series problems (other than dimensionalityreduction), as shown with time-varying parameters in Coulombe (2019). Figure 6 shows how many regressors are kept by different selection methods in the caseof ARDI. As expected, BIC is in general the lower envelope of each of these graphs. Bothcross-validations favor larger models, especially when combined with Ridge regression. Weremark a common upward trend for all model selection methods in case of INDPRO andUNRATE. This is not the case for inﬂation where large models have been selected in 80’s andmost recently since 2005. In case of HOUST, there is a downward trend since 2000’s which isconsistent with the ﬁnding in Figure 15 that data-poor models do better in last 20 years. POOSCV selection is more volatile and selects bigger models for unemployment rate, spread andhousing. While K-fold also selects models of considerable size, it does so in a more slowlygrowing fashion. This is not surprising because K-fold samples from all available data tobuild the CV criterion: adding new data points only gradually change the average. POOSCV is a shorter window approach that offers ﬂexibility against structural hyperparameterschange at the cost of greater variance and vulnerability of rapid regime changes in the data.We know that different model selection methods lead to quite different models, but whatabout their predictions? First, let us note that changes in OOS- R are much smaller in mag-nitude for CV (as can be seen easily in ﬁgures 1 and 2) than for other studied ML treatmenteffects. Nevertheless, table 2 tells many interesting tales. The models included in the regres-sions are the standard linear ARs and ARDIs (that is, excluding the Ridge versions) that haveall been tuned using BIC, AIC, POOS CV and CV-KF. First, we see that overall, only POOS28

985 1990 1995 2000 2005 2010 2015204060 I ND P R O ARDI,BICARDI,AICARDI,POOS-CVARDI,K-foldRRARDI,POOS-CVRRARDI,K-fold

UNR A T E SP R EA D I N F H O U S T Figure 6:

This ﬁgure shows the number of regressors in linear ARDI models. Results averaged across horizons.

CV is distinctively worse, especially in data-rich environment, and that AIC and CV-KF arenot signiﬁcantly different from BIC on average. For data-poor models and during recessions,AIC and CV-KF are being signiﬁcantly better than BIC in downturns, while CV-KF seemsharmless. The state-dependent effects are not signiﬁcant in data-rich environment. Hence,29or that class of models, we can safely opt for either BIC or CV-KF. Assuming some degreeof external validity beyond that model class, we can be reassured that the quasi-necessity ofleaving ICs behind when opting for more complicated ML models is not harmful.Table 2: CV comparison(1) (2) (3) (4) (5)All Data-rich Data-poor Data-rich Data-poorCV-KF -0.0380 -0.314 0.237 -0.494 -0.181(0.800) (0.711) (0.411) (0.759) (0.438)CV-POOS -1.351 -1.440 ∗ -1.262 ∗∗ -1.069 -1.454 ∗∗∗ (0.800) (0.711) (0.411) (0.759) (0.438)AIC -0.509 -0.648 -0.370 -0.580 -0.812(0.800) (0.711) (0.411) (0.759) (0.438)CV-KF * Recessions 1.473 3.405 ∗∗ (2.166) (1.251)CV-POOS * Recessions -3.020 1.562(2.166) (1.251)AIC * Recessions -0.550 3.606 ∗∗ (2.166) (1.251)Observations 91200 45600 45600 45600 45600 Standard errors in parentheses. ∗ p < ∗∗ p < ∗∗∗ p < We now consider models that are usually tuned by CV and compare the performance ofthe two CVs by horizon and variables. Since we are now pooling multiple models, includ-ing all the alternative shrinkage models, if a clear pattern only attributable to a certain CVexisted, it would most likely appear in ﬁgure 7. What we see are two things. First, CV-KF isat least as good as POOS CV on average for almost all variables and horizons, irrespectiveof the informational content of the regression. The exceptions are HOUST in data-rich andINF in data-poor frameworks, and the two-year horizon with large data. Figure 8’s messagehas the virtue of clarity. POOS CV’s failure is mostly attributable to its poor record in reces-sions periods for the ﬁrst three variables at any horizon. Note that this is the same subset ofvariables that beneﬁts from adding in more data ( X ) and nonlinearities as discussed in 5.2.1.By using only recent data, POOS CV will be more robust to gradual structural change butwill perhaps have an Achilles heel in regime switching behavior. If the optimal hyperparam-eters are state-dependent, then a switch from expansion to recession at time t can be quite30igure 7: This ﬁgure compares the two CVs procedure averaged over all the models that use them. The unitof the x-axis are improvements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdencebands. Figure 8:

This ﬁgure compares the two CVs procedure averaged over all the models that use them. The unitof the x-axis are improvements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdencebands. In this section, we investigate whether replacing the l norm as an in-sample loss functionfor the SVR machinery helps in forecasting. We again use as baseline models ARs and ARDIstrained by the same corresponding CVs. The very nature of this ML feature is that the modelis less sensible to extreme residuals, thanks to the l norm outside of the ¯ (cid:101) -insensitivity tube.We ﬁrst compare linear models in ﬁgure 9. Clearly, changing the loss function is generallyharmful and that is mostly due to recessions period. However, in expansions, the linear SVRis better on average than a standard ARDI for UNRATE and SPREAD, but these small gainsare clearly offset (on average) by the huge recession losses.The SVR is usually used in its nonlinear form. We hereby compare KRR and SVR-NL tostudy whether the loss function effect could reverse when a nonlinear model is considered.Comparing these models makes sense since they both use the same kernel trick (with an RBFkernel). Hence, like linear models of ﬁgure 9, models in ﬁgure 10 only differ by the use of adifferent loss function ˆ L . It turns out conclusions are exactly the same as for linear modelswith the negative effects being slightly smaller in nonlinear world. There are few exceptions:inﬂation rate and one month ahead horizon during recessions. Furthermore, ﬁgures 18 and19 in the appendix C conﬁrm that these ﬁndings are valid for both the data-rich and thedata-poor environments.By investigating these results more in depth using tables 4 - 8, we see an emerging pat-tern. First, SVR sometimes does very good (best model for UNRATE at horizon 3 months)but underperforms for many targets – in its AR or ARDI form. When it does perform wellcompared to the benchmark, it is more often than not outshined marginally by the KRR ver-sion. For instance, in table 5, linear and nonlinear SVR-Kfold provide respectively reductions32igure 9: This graph displays the marginal (un)improvements by variables and horizons to opt for the SVRin-sample loss function in both recession and expansion periods. The unit of the x-axis are improvements inOOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. of 17% and 13% in RMSPE over the benchmark for UNRATE at horizon 9 months. However,analogous KRR and Random Forest similarly do so. Moreover, for targets for which SVRfails, the two models it is compared to in order to extract α ˆ L , KRR or the AR/ARDI, have amore stable (good) record. Hence, on average nonlinear SVR is much worse than KRR andthe linear SVR is also inferior to the plain ARDI. This explains the clear-cut results reportedin this section: if the SVR wins, it is rather for its use of the kernel trick (nonlinearities) thanan alternative in-sample loss function.These results point out that an alternative ˆ L like the ¯ (cid:101) -insensitive loss function is not themost salient feature ML has to offer for macroeconomic forecasting. From a practical point ofview, our results indicate that, on average, one can obtain the beneﬁts of SVR and more byconsidering the much simpler KRR. This is convenient since obtaining the KRR forecast is amatter of less than 10 lines of codes implying the most straightforward form of linear algebra.In contrast, obtaining the SVR solution can be a serious numerical enterprise.33igure 10: This graph displays the marginal (un)improvements by variables and horizons to opt for the SVRin-sample loss function in both recession and expansion periods. The unit of the x-axis are improvements inOOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. In this section we aim to explain some of the heterogeneity of ML treatment effects byinteracting them in equation (12) with few macroeconomic variables ξ t that have been usedto explain main sources of observed nonlinear macroeconomic ﬂuctuations. We focus on NLfeature only given its importance for both macroeconomic prediction and modeling.The ﬁrst element in ξ t is the Chicago Fed adjusted national ﬁnancial conditions index(ANFCI). Adrian et al. (2019) ﬁnd that lower quantiles of GDP growth are time varying andare predictable by tighter ﬁnancial conditions, suggesting that higher order approximationsare needed in general equilibrium models with ﬁnancial frictions. In addition, Beaudry et al.(2018) build on the observation that recessions are preceded by accumulations of business,consumer and housing capital, while Beaudry et al. (2020) add nonlinearities in the estimationpart of a model with ﬁnancial frictions and household capital accumulation. Therefore, weadd to the list the house price growth (HOUSPRICE), measured by the S&P/Case-ShillerU.S. National Home Price Index. The goal is then to test whether ﬁnancial conditions and34apital buildups interact with the nonlinear ML feature, and if they could explain its superiorperformance in macroeconomic forecasting.Uncertainty is also related to nonlinearity in macroeconomic modeling (Bloom, 2009). Be-nigno et al. (2013) provide a second-order approximation solution for a model with time-varying risk that has its own effect on endogenous variables. Gorodnichenko and Ng (2017)ﬁnd evidence on volatility factors that are persistent and load on the housing sector, whileCarriero et al. (2018) estimate uncertainty and its effects in a large nonlinear VAR model.Hence, we include the Macro Uncertainty from Jurado et al. (2015) (MACROUNCERT). Then we add measures of sentiments: University of Michigan Consumer Expectations(UMCSENT) and Purchasing Managers Index (PMI). Angeletos and La’O (2013) and Ben-habib et al. (2015) have suggested that waves of pessimism and optimism play an importantrole in generating (nonlinear) macroeconomic ﬂuctuations. In the case of Benhabib et al.(2015), optimal decisions based on sentiments produce multiple self-fulﬁlling rational expec-tations equilibria. Consequently, including measures of sentiment in ξ t aims to test if thischannel plays a role for nonlinearities in macro forecasting. Standard monetary VAR seriesare used as controls: UNRATE, PCE inﬂation (PCEPI) and one-year treasury rate (GS1). Interactions are formed with ξ t − h to measure its impact when the forecast is made. This isof interest for practitioners as it indicates which macroeconomic conditions favor nonlinearML forecast modeling. Hence, this expands the equation (12) to ∀ m ∈ M NL : R t , h , v , m = ˙ α NL + ˙ γ I ( m ∈ N L ) ξ t − h + ˙ φ t , v , h + ˙ u t , h , v , m where M NL is deﬁned as the set of models that differs only by the use of NL.The results are presented in table 3. The ﬁrst column shows regression coefﬁcients for h = {

9, 12, 24 } , since nonlinearity has been found more important for longer horizons. Thesecond column average across all horizons, while the third presents the results for data-richmodels only. The last column shows the heterogeneity of NL treatments during last 20 years.Results show that macroeconomic uncertainty is a true game changer for ML nonlinearity We did not consider the Economic Policy Uncertainty from Baker et al. (2016) as it starts only from 1985. We consider GS1 instead of the federal funds rate because of the long zero lower bound period. Time seriesof elements in ξ t are plotted in ﬁgure 16. ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ (0.748) (0.528) (1.012) (1.565)HOUSPRICE -9.668 ∗∗∗ -4.491 ∗∗∗ -11.56 ∗∗∗ -1.219(1.269) (0.871) (1.715) (1.596)ANFCI 7.244 ∗∗∗ ∗∗ ∗∗∗ (1.881) (1.379) (2.439) (4.891)MACROUNCERT 17.98 ∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗ (1.875) (1.414) (2.745) (2.038)UMCSENT 4.695 ∗∗ ∗∗ ∗∗∗ -3.625(1.768) (1.315) (2.294) (1.922)PMI 0.0787 -1.443 -2.048 -1.919(1.179) (0.879) (1.643) (1.288)UNRATE 0.834 2.517 ∗∗ ∗∗∗ ∗∗∗ (1.353) (0.938) (1.734) (2.199)GS1 -14.24 ∗∗∗ -9.500 ∗∗∗ -17.30 ∗∗∗ ∗ ∗∗ -1.142 -6.242(2.828) (2.180) (4.093) (3.888)Observations 136800 228000 68400 72300 Standard errors in parentheses. ∗ p < ∗∗ p < ∗∗∗ p < as it improves its forecast accuracy by 34% in the case of data-rich models. This means thatif the macro uncertainty goes from -1 standard deviation to +1 standard deviation from itsmean, the expected NL treatment effect (in terms OOS- R difference) is 2*34=+68%. Tighterﬁnancial conditions and a decrease in house prices are also positively correlated with a higherNL treatment, which supports the ﬁndings in Adrian et al. (2019) and Beaudry et al. (2020).It is particularly interesting that the effect of ANFCI reaches 20% during last 20 years, whilethe impact of uncertainty decreases to less than 10%, emphasizing that the determinant roleof ﬁnancial conditions in recent US macro history is also reﬂected in our results. Waves ofconsumer optimism positively affect nonlinearities, especially with data-rich models.Among control variables, unemployment rate has a positive effect on nonlinearity. As ex-pected, this suggests that the importance of nonlinearities is a cyclical feature. Lower interestrates also improve NL treatment by as much as 17% in the data-rich setup. Higher inﬂationalso leads to stronger gains from ML nonlinearities, but mainly at shorter horizons and for36ata-poor models, as suggested by comparing speciﬁcations (2) and (3).These results document clear historical situations where NL consistently helps: (i) whenthe level of macroeconomic uncertainty is high and (ii) during episodes of tighter ﬁnancialconditions and housing bubble bursts. Also, we note that effects are often bigger in the caseof data-rich models. Hence, allowing nonlinear relationship between factors made of manypredictors can capture better the complex relationships that characterize the episodes above.These ﬁndings suggest that ML captures important macroeconomic nonlinearities, espe-cially in the context of ﬁnancial frictions and high macroeconomic uncertainty. They can alsoserve as guidance for forecasters that use a portfolio of predictive models: one should putmore weight on nonlinear speciﬁcations if economic conditions evolve as described above.

In this paper we have studied important features driving the performance of machinelearning techniques in the context of macroeconomic forecasting. We have considered manyML methods in a substantive POOS setup over 38 years for 5 key variables and 5 horizons. Wehave classiﬁed these models by “features” of machine learning: nonlinearities, regularization,cross-validation and alternative loss function. The data-rich and data-poor environmentswere considered. In order to recover their marginal effects on forecasting performance, wedesigned a series of experiments that easily allow to identify the treatment effects of interest.The ﬁrst result indicates that nonlinearities are the true game changer for the data-richenvironment, as they improve substantially the forecasting accuracy for all macroeconomicvariables in our exercise and especially when predicting at long horizons. This gives a starkrecommendation for practitioners. It recommends for most variables and horizons what is inthe end a partially nonlinear factor model – that is, factors are still obtained by PCA. The bestof ML (at least of what considered here) can be obtained by simply generating the data for astandard ARDI model and then feed it into a ML nonlinear function of choice. The perfor- Granziera and Sekhposyan (2019) have exploited similar regression setup for model selection and foundthat ‘economic’ forecasting models, AR augmented by few macroeconomic indicators, outperform the timeseries models during turbulent times (recessions, tight ﬁnancial conditions and high uncertainty). L is preferred to the ¯ (cid:101) -insensitive loss function for macroeconomic predictions. Wefound that most (if not all) the beneﬁts from the use of SVR in fact comes from the nonlinear-ities it creates via the kernel trick rather than its use of an alternative loss function. References

Abadie, A. and Kasy, M. (2019). Choosing among regularized estimators in empirical eco-nomics: The risk of machine learning.

Review of Economics and Statistics , 101(5):743–762.Adrian, T., Boyarchenko, N., and Giannone, D. (2019). Vulnerable growth.

American EconomicReview , 109(4):1263–1289.Ahmed, N. K., Atiya, A. F., El Gayar, N., and El-Shishiny, H. (2010). An empirical comparisonof machine learning models for time series forecasting.

Econometric Reviews , 29(5):594–621.Alquier, P., Li, X., and Wintenberger, O. (2013). Prediction of time series by statistical learning:General losses and fast rates.

Dependence Modeling , 1(1):65–93.Angeletos, G.-M. and La’O, J. (2013). Sentiments.

Econometrica , 81(2):739–779.Athey, S. (2019). The Impact of Machine Learning on Economics. In Agrawal, A., Gans, J.,and Goldfarb, A., editors,

The Economics of Artiﬁcial Intelligence: An Agenda , pages 507–552.University of Chicago Press.Atkeson, A. and Ohanian, L. E. (2001). Are Phillips Curves Useful for Forecasting Inﬂation?

Quarterly Review , 25(1):2–11.Baker, S. R., Bloom, N., and Davis, S. J. (2016). Measuring Economic Policy Uncertainty.

TheQuarterly Journal of Economics , 131(4):1593–1636.Beaudry, P., Galizia, D., and Portier, F. (2018). Reconciling Hayek’s and Keynes’views ofrecessions.

Review of Economic Studies , 85(1):119–156.Beaudry, P., Galizia, D., and Portier, F. (2020). Putting the cycle back into business cycleanalysis.

American Economic Review , 110(1):1–47.38elloni, A., Chernozhukov, V., Fernandes-Val, I., and Hansen, C. B. (2017). Program Evalua-tion and Causal Inference With High-Dimensional Data.

Econometrica , 85(1):233–298.Benhabib, J., Wang, P., and Wen, Y. (2015). Sentiments and Aggregate Demand Fluctuations.

Econometrica , 83(2):549–585.Benigno, G., Benigno, P., and Nisticò, S. (2013). Second-order approximation of dynamicmodels with time-varying risk.

Journal of Economic Dynamics and Control , 37(7):1231–1247.Bergmeir, C. and Benítez, J. M. (2012). On the use of cross-validation for time series predictorevaluation.

Information Sciences , 191:192–213.Bergmeir, C., Hyndman, R. J., and Koo, B. (2018). A Note on the Validity of Cross-Validationfor Evaluating Autoregressive Time Series Prediction.

Computational Statistics and DataAnalysis , 120:70–83.Bloom, N. (2009). The Impact of Uncertainty Shocks.

Econometrica , 77(3):623–685.Boivin, J. and Ng, S. (2006). Are More Data Always Better for Factor Analysis?

Journal ofEconometrics , 132(1):169–194.Boot, T. and Pick, A. (2020). Does Modeling a Structural Break Improve Forecast Accuracy?

Journal of Econometrics , 215(1):35–59.Bordo, M. D., Redish, A., and Rockoff, H. (2015). Why didn’t Canada have a banking crisisin 2008 (or in 1930, or 1907, or...)?

Economic History Review , 68(1):218–243.Breiman, L. (2001). Random forests.

Machine Learning , 45:5–32.Carriero, A., Clark, T. E., and Marcellino, M. (2018). Measuring Uncertainty and Its Impacton the Economy.

Review of Economics and Statistics , 100(5):799–815.Carriero, A., Galvão, A. B., and Kapetanios, G. (2019). A Comprehensive Evaluation ofMacroeconomic Forecasting Methods.

International Journal of Forecasting , 35(4):1226 – 1239.Chen, J., Dunn, A., Hood, K., and Batch, A. (2019). Off to the Races : A Comparison of Ma-chine Learning and Alternative Data for Nowcasting of Economic Indicators. In Abraham,K., Jarmin, R. S., Moyer, B., and Shapiro, M. D., editors,

Big Data for 21st Century EconomicStatistics . University of Chicago Press.Chevillon, G. (2007). Direct multi-step estimation and forecasting.

Journal of Economic Surveys ,21(4):746–785.Choudhury, S., Ghosh, S., Bhattacharya, A., Fernandes, K. J., and Tiwari, M. K. (2014). Areal time clustering and SVM based price-volatility prediction for optimal trading strategy.

Neurocomputing , 131:419–426.Claeskens, G. and Hjort, N. L. (2008). Akaike’s Information Criterion. In Claeskens, G. and39jort, N. L., editors,

Model Averaging and Model Selection , chapter 2, pages 22–69.Colombo, E. and Pelagatti, M. (2020). Statistical learning and exchange rate forecasting.

In-ternational Journal of Forecasting , xxx(xxxx):1–30.Cook, T. and Smalter Hall, A. (2017). Macroeconomic Indicator Forecasting with Deep NeuralNetworks.Coulombe, P. G. (2019). Time-varying Parameters : A Machine Learning Approach.De Mol, C., Giannone, D., and Reichlin, L. (2008). Forecasting using a large number of predic-tors: Is Bayesian shrinkage a valid alternative to principal components?

Journal of Econo-metrics , 146(2):318–328.Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy.

Journal of Businessand Economic Statistics , 13(3):253–263.Diebold, F. X. and Shin, M. (2019). Machine learning for regularized survey forecast combi-nation: Partially-egalitarian LASSO and its derivatives.

International Journal of Forecasting ,35(4):1679–1691.Döpke, J., Fritsche, U., and Pierdzioch, C. (2017). Predicting recessions with boosted regres-sion trees.

International Journal of Forecasting , 33(4):745–759.Exterkate, P., Groenen, P. J. F., Heij, C., and van Dijk, D. (2016). Nonlinear forecasting withmany predictors using kernel ridge regression.

International Journal of Forecasting , 32(3):736–753.Fan, J. and Lv, J. (2010). A Selective Overview of Variable Selection in High DimensionalFeature Space.

Statistica Sinica , 20(1):101–148.Fortin-Gagnon, O., Leroux, M., Stevanovic, D., and Surprenant, S. (2020). A Large CanadianDatabase for Macroeconomic Analysis.Giacomini, R. and Rossi, B. (2010). Forecast Comparisons in Unstable Environments.

Journalof Applied Econometrics , 25(4):595 – 620.Giannone, D., Lenza, M., and Primiceri, G. E. (2015). Prior Selection for Vector Autoregres-sions.

Review of Economics and Statistics , 97(2):436–451.Giannone, D., Lenza, M., and Primiceri, G. E. (2018). Economic Predictions with Big Data:The Illusion of Sparsity.Gorodnichenko, Y. and Ng, S. (2017). Level and volatility factors in macroeconomic data.

Journal of Monetary Economics , 91:52–68.Goulet Coulombe, P. (2020). To Bag is to Prune. arXiv preprint arXiv:2008.07063 .Goulet Coulombe, P., Leroux, M., Stevanovic, D., and Surprenant, S. (2020). Macroeconomic40ata Transformations Matter. arXiv preprint arXiv:2008.01714 .Granger, C. W. J. and Jeon, Y. (2004). Thick Modeling.

Economic Modelling , 21(2):323–343.Granziera, E. and Sekhposyan, T. (2019). Predicting relative forecasting performance: Anempirical investigation.

International Journal of Forecasting , 35(4):1636–1657.Gu, S., Kelly, B., and Xiu, D. (2020a). Autoencoder asset pricing models.

Journal of Economet-rics , 0(0).Gu, S., Kelly, B., and Xiu, D. (2020b). Empirical Asset Pricing via Machine Learning.

Reviewof Financial Studies , 33(5):2223–2273.Hansen, P. R., Lunde, A., and Nason, J. M. (2011). The Model Conﬁdence Set.

Econometrica ,79(2):453–497.Hansen, P. R. and Timmermann, A. (2015). Equivalence Between Out-of-Sample ForecastComparisons and Wald Statistics.

Econometrica , 83(6):2485–2505.Hastie, T., Tibshirani, R., and Friedman, J. (2009).

The Elements of Statistical Learning: DataMining, Interference, and Prediction . Springer Science & Business Media, second edition.Inoue, A., Jin, L., and Rossi, B. (2017). Rolling Window Selection for out-of-sample Forecast-ing with Time-varying Parameters.

Journal of Econometrics , 196(1):55–67.Joseph, A. (2019). Parametric Inference with Universal Function Approximators.Jurado, K., Ludvigson, S. C., and Ng, S. (2015). Measuring Uncertainty.

American EconomicReview , 105(3):1177–1216.Kim, H. H. and Swanson, N. R. (2018). Mining big data using parsimonious factor, ma-chine learning, variable selection and shrinkage methods.

International Journal of Forecast-ing , 34(2):339–354.Koenker, R. and Machado, J. A. (1999). Goodness of Fit and Related Inference Processes forQuantile Regression.

Journal of the American Statistical Association , 94(448):1296–1310.Kotchoni, R., Leroux, M., and Stevanovic, D. (2019). Macroeconomic forecast accuracy in adata-rich environment.

Journal of Applied Econometrics , 34(7):1050–1072.Kuan, C. M. and White, H. (1994). Artiﬁcial neural networks: An econometric perspective.

Econometric Reviews , 13(1).Kuznetsov, V. and Mohri, M. (2015). Learning theory and algorithms for forecasting non-stationary time series. In

Advances in Neural Information Processing Systems , pages 541–549.Lee, T. H., White, H., and Granger, C. W. J. (1993). Testing for neglected nonlinearity in timeseries models. A comparison of neural network methods and alternative tests.

Journal ofEconometrics , 56(3):269–290. 41erch, S., Thorarinsdottir, T. L., Ravazzolo, F., and Gneiting, T. (2017). Forecaster’s dilemma:Extreme events and forecast evaluation.

Statistical Science , 32(1):106–127.Li, J. and Chen, W. (2014). Forecasting macroeconomic time series: LASSO-based approachesand their forecast combinations with dynamic factor models.

International Journal of Fore-casting , 30(4):996–1015.Litterman, R. B. (1979). Techniques of Forecasting Using Vector Autoregressions.Lu, C. J., Lee, T. S., and Chiu, C. C. (2009). Financial time series forecasting using independentcomponent analysis and support vector regression.

Decision Support Systems , 47(2):115–125.Marcellino, M. (2008). A linear benchmark for forecasting GDP growth and inﬂation?

Journalof Forecasting , 27(4):305–340.Marcellino, M., Stock, J. H., and Watson, M. W. (2006). A comparison of Direct and IteratedMultistep AR Methods for Forecasting Macroeconomic Time Series.

Journal of Econometrics ,135(1-2):499–526.McCracken, M. W. and Ng, S. (2016). FRED-MD: A Monthly Database for MacroeconomicResearch.

Journal of Business and Economic Statistics , 34(4):574–589.Medeiros, M. C., Teräsvirta, T., and Rech, G. (2006). Building Neural Network Models forTime Series: A Statistical Approach.

Journal of Forecasting .Medeiros, M. C., Vasconcelos, G. F., Veiga, Á., and Zilberman, E. (2019). Forecasting Inﬂationin a Data-Rich Environment: The Beneﬁts of Machine Learning Methods.

Journal of Businessand Economic Statistics , 0(0):1–45.Milunovich, G. (2020). Forecasting Australia’s real house price index: A comparison of timeseries and machine learning methods.

Journal of Forecasting , pages 1–21.Mohri, M. and Rostamizadeh, A. (2010). Stability bounds for stationary φ -mixing and β -mixing processes. Journal of Machine Learning Research , 11:789–814.Moshiri, S. and Cameron, N. (2000). Neural network versus econometric models in forecast-ing inﬂation.

Journal of Forecasting , 19(3):201–217.Nakamura, E. (2005). Inﬂation forecasting using a neural network.

Economics Letters ,86(3):373–378.Ng, S. (2014). Viewpoint: Boosting recessions.

Canadian Journal of Economics , 47(1):1–34.Patel, J., Shah, S., Thakkar, P., and Kotecha, K. (2015a). Predicting stock and stock price indexmovement using Trend Deterministic Data Preparation and machine learning techniques.

Expert Systems with Applications , 42(1):259–268.Patel, J., Shah, S., Thakkar, P., and Kotecha, K. (2015b). Predicting stock market index using42usion of machine learning techniques.

Expert Systems with Applications , 42(4):2162–2172.Pesaran, M. H., Pick, A., and Pranovich, M. (2013). Optimal forecasts in the presence ofstructural breaks.

Journal of Econometrics , 177(2):134–152.Pesaran, M. H. and Timmermann, A. (2007). Selection of estimation window in the presenceof breaks.

Journal of Econometrics , 137(1):134–161.Qu, H. and Zhang, Y. (2016). A new kernel of support vector regression for forecasting high-frequency stock returns.

Mathematical Problems in Engineering , pages 1–9.Sermpinis, G., Stasinakis, C., Theoﬁlatos, K., and Karathanasopoulos, A. (2014). Inﬂation andunemployment forecasting with genetic support vector regression.

Journal of Forecasting ,33(6):471–487.Smeekes, S. and Wijler, E. (2018). Macroeconomic forecasting using penalized regressionmethods.

International Journal of Forecasting , 34(3):408–430.Smola, A. J., Murata, N., Schölkopf, B., and Müller, K.-R. (1998). Asymptotically OptimalChoice of (cid:101) -Loss for Support Vector Machines. In

International Conference on Artiﬁcial NeuralNetworks , number 2, pages 105–110, London. Springer.Smola, A. J. and Schölkopf, B. (2004). A Tutorial on Support Vector Regression.

Statistics andComputing , 14:199–222.Stock, J. H. and Watson, M. W. (1999). A Comparison of Linear and Nonlinear UnivariateModels for Forecasting Macroeconomic Time Series. In Engle, R. F. and White, H., edi-tors,

Cointegration, Causality and Forecasting: A Festschrift for Clive W.J. Granger , pages 1–44.Oxford University Press, Oxford.Stock, J. H. and Watson, M. W. (2002a). Forecasting using principal components from a largenumber of predictors.

Journal of the American Statistical Association , 97(460):1167–1179.Stock, J. H. and Watson, M. W. (2002b). Macroeconomic forecasting using diffusion indexes.

Journal of Business and Economic Statistics , 20(2):147–162.Stock, J. H. and Watson, M. W. (2009). Phillips Curve Inﬂation Forecasts. In Fuhrer, J., Kodrzy-cki, Y. K., Sneddon Little, J., and Olivei, G. P., editors,

Understanding Inﬂation and the Impli-cation for Monetary Policy , chapter 3, pages 99–202. MIT Press, Cambridge, Massachusetts.Stock, J. H. and Watson, M. W. (2012a). Disentangling the Channels of the 2007â ˘A ¸S09 Reces-sion.

Brookings Papers on Economic Activity , (1):81–156.Stock, J. H. and Watson, M. W. (2012b). Generalized Shrinkage Methods for Forecasting UsingMany Predictors.

Journal of Business and Economic Statistics , 30(4):481–493.Swanson, N. R. and White, H. (1997). A Model Selection Approach To Real-Time Macroe-43onomic Forecasting Using Linear Models And Artiﬁcial Neural Networks.

The Review ofEconomics and Statistics , 79(4):540–550.Tashman, L. J. (2000). Out-of-sample Tests of Forecasting Accuracy: An Analysis and Review.

International Journal of Forecasting , 16(4):437–450.Teräsvirta, T. (2006). Forecasting Economic Variables with Nonlinear Models. In Granger,C. W. J. and Elliott, G., editors,

Handbook of Economic Forecasting , chapter 8, pages 413–457.Elsevier.Trapletti, A., Leisch, F., and Hornik, K. (2000). Stationary and Integrated AutoregressiveNeural Network Processes.

Neural Computation , 12(10):2427–2450.Uddin, M. F., Lee, J., Rizvi, S., and Hamada, S. (2018). Proposing enhanced feature engineer-ing and a selection model for machine learning processes.

Applied Sciences (Switzerland) ,8(4):1–32.Yeh, C. Y., Huang, C. W., and Lee, S. J. (2011). A multiple-kernel support vector regressionapproach for stock market price forecasting.

Expert Systems with Applications , 38(3):2177–2186.Yousuf, K. and Ng, S. (2019). Boosting High Dimensional Predictive Regressions with TimeVarying Parameters.Zhang, X. R., Hu, L. Y., and Wang, Z. S. (2010). Multiple kernel support vector regression foreconomic forecasting. , (70872025):129–134.Zhao, Q. and Hastie, T. (2019). Causal Interpretations of Black-Box Models.

Journal of Businessand Economic Statistics , 0(0):1–19.Zou, H. (2006). The adaptive lasso and its oracle properties.

Journal of the American StatisticalAssociation , 101(476):1418–1429.Zou, H., Hastie, T., and Tibshirani, R. (2007). On the "degrees of freedom" of the lasso.

Annalsof Statistics , 35(5):2173–2192. 44

Detailed Overall Predictive Performance

Table 4: Industrial Production: Relative Root MSPE

Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=3 h=9 h=12 h=24 h=1 h=3 h=9 h=12 h=24Data-poor ( H − t ) modelsAR,BIC (RMSPE) 0.0765 * * * ** *** ** * *** 0.864***SVR-AR,Lin,POOS-CV 1.134*** 1.226*** 1.114*** 1.132*** * 1.186** 1.285*** 1.079** 1.034*** 0.893***SVR-AR,Lin,K-fold 1.069* 1.159** 1.055** 1.042*** 1.016*** 1.268*** 1.319*** 1.067*** 1.035*** 1.013***SVR-AR,RBF,POOS-CV 0.999 1.061*** * H + t ) modelsARDI,BIC * *** *** 0.887** 0.833*** 0.784***ARDI,AIC * *** 0.844** ** ***ARDI,POOS-CV 0.994 *** *** *** 0.812***ARDI,K-fold * * *** *** 0.841** ** ***RRARDI,POOS-CV *** 0.793*** *** 0.861**RRARDI,K-fold ** *** *** *** ***RFARDI,POOS-CV ** * ** 0.865** *** 0.837*** *** 0.819***RFARDI,K-fold ** * ** 0.889*** * 0.846*** *** ***KRR-ARDI,POOS-CV 1.038 * * * *** 0.848*** ( B , α = ˆ α ) ,POOS-CV 1.014 ( B , α = ˆ α ) ,K-fold ** *** 0.890* ( B , α = ) ,POOS-CV * ( B , α = ) ,K-fold ** *** 0.890* ( B , α = ) ,POOS-CV 1.047 1.112** ( B , α = ) ,K-fold 1.025 1.056* 1.065 1.082 1.052 1.032 0.974 0.923 0.929 0.847*** ( B , α = ˆ α ) ,POOS-CV 1.061 ** 1.237 0.810*** 0.889*** 0.904** 0.869** ( B , α = ˆ α ) ,K-fold 1.098 *** 0.896** 0.851*** *** ( B , α = ) ,POOS-CV ** 1.034 1.033 0.997 0.957 0.839*** ( B , α = ) ,K-fold ** 1.022 1.032 ( B , α = ) ,POOS-CV *** ** 0.902** *** 0.904** 0.840*** 0.807*** ( B , α = ) ,K-fold ** ** *** 0.858*** *** 0.776*** ( B , α = ˆ α ) ,POOS-CV ( B , α = ˆ α ) ,K-fold ** ** 1.042 ( B , α = ) ,POOS-CV * 1.053 1.053 1.080* ( B , α = ) ,K-fold *** ** *** 0.822*** ( B , α = ) ,POOS-CV ( B , α = ) ,K-fold 0.981 1.01 1.03 *** 0.787***SVR-ARDI,Lin,POOS-CV ***SVR-ARDI,Lin,K-fold 1.109** 1.367*** *** *** ***SVR-ARDI,RBF,POOS-CV * * 0.958 0.900* 0.873** *** 0.820***SVR-ARDI,RBF,K-fold * *** *** *** 0.791*** Note: The numbers represent the relative. with respect to AR,BIC model. root MSPE. Models retained in model conﬁdence set are in bold. the minimumvalues are underlined. while ∗∗∗ . ∗∗ . ∗ stand for 1%. 5% and 10% signiﬁcance of Diebold-Mariano test. Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=3 h=9 h=12 h=24 h=1 h=3 h=9 h=12 h=24Data-poor ( H − t ) modelsAR,BIC (RMSPE) 1.9578 1.1905 1.0169 1.0058 0.869 2.5318 2.0826 1.8823 1.7276 1.0562AR,AIC 0.991 0.984 0.988 0.993*** 1.000 0.958 0.960** 0.984* 1.000 1.000AR,POOS-CV 0.988 0.999 1.002 0.995 0.987 0.978 0.980** 0.996 0.998 1.04AR,K-fold 0.994 0.984 0.989 0.986*** 0.991 0.956* 0.960** 0.998 1.000 1.038RRAR,POOS-CV 0.989 1.000 1.002 0.990* 0.972** 0.984 0.988* 0.997 0.991* 1.001RRAR,K-fold 0.988 0.982* 0.983* 0.989** 0.999 0.963 0.971* 0.992 0.995 1.033RFAR,POOS-CV 0.983 0.995 0.968 1.000 1.002 0.989 1.003 0.929** 0.951** 0.994RFAR,K-fold 0.98 0.985 0.979 1.006 0.99 0.985 0.972 0.896*** 0.943* 0.983KRR-AR,POOS-CV 0.99 1.04 *** *** 0.876*** 1.04 1.116 0.843*** 0.883*** 0.904**KRR,AR,K-fold *** *** *** *** 0.852*** 0.847*** 0.838*** *** *** 0.908**SVR-AR,Lin,POOS-CV 1.028 1.133** 1.130*** 1.108*** 1.174*** 1.065* 1.274*** 1.137*** 1.094*** 1.185***SVR-AR,Lin,K-fold 0.993 1.061** 1.068*** 1.045*** 1.013*** 1.062** 1.108*** 1.032** 1.011 1.018***SVR-AR,RBF,POOS-CV 1.019 1.094* 1.029 1.076** 1.01 1.097** 1.247** 1.047* 1.034*** 1.112*SVR-AR,RBF,K-fold 0.997 1.011 1.078** 1.053* 0.993 1.026 1.009 1.058 1.023 0.985Data-rich ( H + t ) modelsARDI,BIC ** ** 0.938 0.939 0.875*** *** 0.715*** *** 0.782*** ***ARDI,AIC ** *** 0.928 0.953 0.893** *** 0.719*** *** 0.799*** ***ARDI,POOS-CV *** * 0.957 0.925* *** *** *** 0.840** *** ***ARDI,K-fold ** ** 0.929 0.93 0.915** *** 0.697*** *** 0.807*** ***RRARDI,POOS-CV *** * 0.968 0.946 0.870*** *** *** 0.849** *** ***RRARDI,K-fold ** ** 0.946 0.931* 0.908** ** *** *** 0.790*** ***RFARDI,POOS-CV *** 0.945 *** *** *** *** 0.769*** *** *** ***RFARDI,K-fold *** *** ** *** *** *** 0.742*** *** *** ***KRR-ARDI,POOS-CV * *** *** *** 1.01 1.017 *** *** 0.828***KRR,ARDI,K-fold *** ** *** *** *** 0.925 0.933 *** *** *** ( B . α = ˆ α ) ,POOS-CV 0.979 ( B . α = ˆ α ) ,K-fold 0.971 ** *** 0.919* 0.925* 0.787*** 0.848*** 0.840*** 0.839*** 0.829** ( B . α = ) ,POOS-CV 0.947*** * 0.962 0.922** 0.889*** 0.857** 0.789*** 0.888** 0.860*** 0.915* ( B . α = ) ,K-fold 0.971 ** *** 0.919* 0.925* 0.787*** 0.848*** 0.840*** 0.839*** 0.829** ( B . α = ) ,POOS-CV 1.238** 1.319** 1.021 1.07 1.01 1.393* 1.476* 0.979 0.972 *** ( B . α = ) ,K-fold 1.246** 0.994 1.062* 1.077* 1.018 1.322 0.963 0.991 0.933 0.802*** ( B , α = ˆ α ) ,POOS-CV *** ** 0.926* 0.936* 0.911** *** 0.767*** 0.869** 0.832*** 0.808*** ( B , α = ˆ α ) ,K-fold *** *** 0.915* 0.931 0.974 *** 0.777*** 0.829*** *** *** ( B , α = ) ,POOS-CV *** 0.955 1.057 1.011 0.883*** 0.810*** 0.830*** 1.029 0.952 0.795*** ( B , α = ) ,K-fold 0.97 ** 0.991 0.983 0.918** 0.837** 0.754*** 0.903 0.833*** *** ( B , α = ) ,POOS-CV *** *** 0.991 0.922* 0.889*** 0.781** 0.769*** 0.915 0.786*** *** ( B , α = ) ,K-fold ** *** 0.908** 0.906** 0.967 0.875 0.777*** 0.817*** *** *** ( B , α = ˆ α ) ,POOS-CV ** *** 0.952 0.943 0.874*** 0.933 0.843*** 0.886** 0.829*** 0.827*** ( B , α = ˆ α ) ,K-fold ** *** ** 0.923* 0.921** 0.836* 0.831*** 0.868*** 0.839*** *** ( B , α = ) ,POOS-CV *** ** 0.958 0.983 0.884*** 0.812** 0.771*** 0.864** 0.851** 0.845*** ( B , α = ) ,K-fold 0.968 0.941* *** 0.907* 0.943 0.808** 0.806*** 0.832*** 0.873** *** ( B , α = ) ,POOS-CV ** 0.974 0.994 1.066 0.946* 0.979 1.03 0.956 0.877** *** ( B , α = ) ,K-fold 0.969 *** 0.983 0.998 0.945* 0.963 0.901* 0.957 0.912* ***SVR-ARDI,Lin,POOS-CV * 1.041 1.072 0.929 1.028 0.872 0.858* 0.941 0.809*** ***SVR-ARDI,Lin,K-fold 0.959* *** *** 0.926 0.946 0.801** 0.791*** *** ** 0.872*SVR-ARDI,RBF,POOS-CV **SVR-ARDI,RBF,K-fold ** 0.958 ** 0.911* 0.930* *** 0.796*** *** *** *** Note: The numbers represent the relative, with respect to AR,BIC model, root MSPE. Models retained in model conﬁdence set are in bold, the minimumvalues are underlined, while ∗∗∗ , ∗∗ , ∗ stand for 1%, 5% and 10% signiﬁcance of Diebold-Mariano test. Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=3 h=9 h=12 h=24 h=1 h=3 h=9 h=12 h=24Data-poor ( H − t ) modelsAR,BIC (RMSPE) * 0.998 * 1.034** 1.041** * 1.139* ** 1.142* * 0.992 ** 0.945* *** 0.758*** 0.948KRR,AR,K-fold ** 0.999 1.048 ** SVR-AR,Lin,POOS-CV 1.158** 1.326*** * 1.045 1.045 1.111* 1.072 0.894* 0.828* 0.967SVR-AR,Lin,K-fold 1.191** 1.056 * 0.951 0.969 ***SVR-AR,RBF,K-fold H + t ) modelsARDI,BIC *** *** *** 1.049ARDI,AIC *** *** 1.076ARDI,POOS-CV ** *** 1.041RRARDI,POOS-CV *** 1.163*RFARDI,POOS-CV * ** ** *** *** 0.985RFARDI,K-fold *** 0.97KRR-ARDI,POOS-CV 1.355** 0.898 ** *** * *** ** *KRR,ARDI,K-fold 1.382*** 0.96 ** *** * *** *** 0.912* ( B , α = ˆ α ) ,POOS-CV 1.114 1.06 1.126*** 1.021 *** ** 1.012 ( B , α = ˆ α ) ,K-fold ( B , α = ) ,POOS-CV 1.125* 1.115 1.172*** 1.072 *** 1.071 1.006 1.033 0.833 0.96 ( B , α = ) ,K-fold ( B , α = ) ,POOS-CV 1.173** 1.312** 1.176*** 1.088 0.978 1.089 1.065 0.981 0.799 0.966 ( B , α = ) ,K-fold 1.163* 1.059 ** ** 0.729** * ( B , α = ˆ α ) ,POOS-CV ** 1.028 *** *** 0.989 ( B , α = ˆ α ) ,K-fold * 1.059 0.935* ( B , α = ) ,POOS-CV 1.062 0.968 ** 1.049 0.926*** ( B , α = ) ,K-fold ** 1.01 0.950* ( B , α = ) ,POOS-CV 1.118* 1.082 ** 1.008 *** *** 1.016 ( B , α = ) ,K-fold 1.102 0.988 ** ( B , α = ˆ α ) ,POOS-CV ** 1.076 0.933* ( B , α = ˆ α ) ,K-fold *** *** ( B , α = ) ,POOS-CV * 1.039 *** ( B , α = ) ,K-fold ( B , α = ) ,POOS-CV 1.181* 0.961 ** 1.056 0.937** 1.215 0.901 1.013 0.825 0.919* ( B , α = ) ,K-fold ** 1.113* 1.065 1.016 * 1.117 0.714** 1.097SVR-ARDI,Lin,K-fold ** ** 0.994SVR-ARDI,RBF,POOS-CV *** 0.707*** 1.204*SVR-ARDI,RBF,K-fold * *** ** Note: The numbers represent the relative, with respect to AR,BIC model, root MSPE. Models retained in model conﬁdence set are in bold, the minimumvalues are underlined, while ∗∗∗ , ∗∗ , ∗ stand for 1%, 5% and 10% signiﬁcance of Diebold-Mariano test. Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=3 h=9 h=12 h=24 h=1 h=3 h=9 h=12 h=24Data-poor ( H − t ) modelsAR,BIC (RMSPE) 0.0312 0.0257 0.0194 0.0187 0.0188 0.0556 0.0484 0.032 0.0277 0.0221AR,AIC 0.969*** 0.984 0.976* 0.988 0.995 1.000 0.970** 0.999 0.992 1.005AR,POOS-CV 0.966** 0.988 0.997 0.992 1.009 0.961** 0.981 0.995 0.978 1.003AR,K-fold 0.972** 0.976** 0.975* 0.988 0.987 1.002 0.965*** 0.998 0.992 1.005RRAR,POOS-CV 0.969** 0.984 0.99 0.993 1.006 0.961** 0.982 0.995 0.963* 0.998RRAR,K-fold 0.964*** 0.979** 0.970* 0.980* 0.989 0.989 0.973** 0.996 0.992 0.997RFAR,POOS-CV 0.983 * * ** * ** ** ** 0.907** 1.023 ** 0.927 0.91 0.852*KRR,AR,K-fold *** *** ** *** 0.942 0.965 ** 0.92 0.915 0.975SVR-AR,Lin,POOS-CV 1.119** 1.291** 1.210*** 1.438*** 1.417*** 1.116 1.196** 1.204** 1.055 1.613***SVR-AR,Lin,K-fold 1.239*** 1.369** 1.518*** 1.606*** 1.411*** 1.159* 1.326* 1.459** 1.501* 1.016SVR-AR,RBF,POOS-CV 0.988 1.004 1.086* 1.068** 1.127** 0.999 1.004 0.969 1.091** 1.501***SVR-AR,RBF,K-fold 0.99 1.025 1.025 1.003 1.370*** 0.965 0.979 0.996 0.896** 1.553**Data-rich ( H + t ) modelsARDI,BIC 0.96 * 0.880* 0.919* * 0.779* 0.755** 0.713**ARDI,AIC ** 0.676**ARDI,POOS-CV * 0.832** 0.781*** **ARDI,K-fold * * * 0.891** *** **RRARDI,POOS-CV * * 0.828** *** **RRARDI,K-fold ** * ** *** **RFARDI,POOS-CV ** *** ** * 0.979 0.976 ** 0.988 1.051 0.964RFARDI,K-fold *** *** ** ** 0.909* 0.962 ** 0.979 0.93 1.003KRR-ARDI,POOS-CV 1.006 1.043 0.959 0.972 1.067 1.046 1.093 0.952 0.948 0.946KRR,ARDI,K-fold ( B , α = ˆ α ) ,POOS-CV ** * 0.976 0.96 1.026 *** * 0.8 0.848 0.974 ( B , α = ˆ α ) ,K-fold ** * 1.012 1.056 1.092* ** * ( B , α = ) ,POOS-CV ** 1.11 1.03 1.076 ** * 0.794 0.825 0.989 ( B , α = ) ,K-fold ** * 1.012 1.056 1.092* ** * ( B , α = ) ,POOS-CV 0.971 1.035 1.114* 1.048 1.263** 0.848** ( B , α = ) ,K-fold * 1.057 1.246** 1.289** 1.260*** *** ( B , α = ˆ α ) ,POOS-CV ** ** ( B , α = ˆ α ) ,K-fold ** * 0.995 0.956 1.037 0.868* 0.957* 0.817* 0.778** 0.861 ( B , α = ) ,POOS-CV ( B , α = ) ,K-fold ** ( B , α = ) ,POOS-CV ** ( B , α = ) ,K-fold ** * 0.997 1.011 1.212** 0.865* ( B , α = ˆ α ) ,POOS-CV * *** ** 0.901* 0.889* ** 0.791** 0.785** 0.808** ( B , α = ˆ α ) ,K-fold * ** ( B , α = ) ,POOS-CV * * * * ** 0.86 ( B , α = ) ,K-fold ( B , α = ) ,POOS-CV * ** * ** 1.022 0.894* ** 0.865 0.875 0.896 ( B , α = ) ,K-fold *** ** 1.002 0.997 0.945 *** 0.927* 0.964 0.816** 0.826** ** Note: The numbers represent the relative, with respect to AR,BIC model, root MSPE. Models retained in model conﬁdence set are in bold, the minimumvalues are underlined, while ∗∗∗ , ∗∗ , ∗ stand for 1%, 5% and 10% signiﬁcance of Diebold-Mariano test. Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=3 h=9 h=12 h=24 h=1 h=3 h=9 h=12 h=24Data-poor ( H − t ) modelsAR,BIC (RMSPE) * 1.01 1.036** 1.015 ** 1.011* 1.013 1.057**AR,K-fold ** 1.000 1.000 1.036RRAR,POOS-CV ** 1.001 1.013 1.050**RRAR,K-fold ** 1.000** 0.999 1.042**RFAR,POOS-CV 1.030*** * * 1.045** 1.018 1.023 * * * 1.044* 1.037* * ** *** 1.064*** 1.223** 1.024* * * 0.984 ***SVR-AR,Lin,K-fold 1.036*** 1.031 *** 1.015 1.017SVR-AR,RBF,K-fold 1.009 ** 1.020*** 1.034** 1.021* 0.969* 1.010*** 1.017** 1.001Data-rich ( H + t ) modelsARDI,BIC * ** ** * 0.899** * * *** *** ** ** *** * * ** ** ** ( B . α = ˆ α ) ,POOS-CV ** 1.023 1.099 ( B . α = ˆ α ) ,K-fold 1.040* 1.095** 1.250** 1.335** 1.151* 1.096* 1.152** 1.021 1.127 ( B . α = ) ,POOS-CV 1.032** 1.039 1.155 1.045 0.949 1.013 1.063 ( B . α = ) ,K-fold 1.040* 1.095** 1.250** 1.335** 1.151* 1.096* 1.152** 1.021 1.127 ( B . α = ) ,POOS-CV ( B . α = ) ,K-fold ** ( B . α = ˆ α ) ,POOS-CV 1.044 ( B . α = ˆ α ) ,K-fold ( B . α = ) ,POOS-CV ( B . α = ) ,K-fold ** ( B . α = ) ,POOS-CV 1.091* ( B . α = ) ,K-fold 1.066 *** 0.917 ( B , α = ˆ α ) ,POOS-CV 1.009 * ** 1.028 1.019 ( B , α = ˆ α ) ,K-fold * ( B , α = ) ,POOS-CV ( B , α = ) ,K-fold 1.013 ( B , α = ) ,POOS-CV 1.022* * ** 1.001 ( B , α = ) ,K-fold 1.030** Note: The numbers represent the relative, with respect to AR,BIC model, root MSPE. Models retained in model conﬁdence set are in bold, the minimumvalues are underlined, while ∗∗∗ , ∗∗ , ∗ stand for 1%, 5% and 10% signiﬁcance of Diebold-Mariano test. Robustness of Treatment Effects Graphs

Figure 11:

This ﬁgure plots the distribution of ˙ α ( h , v ) F from equation 11 done by ( h , v ) subsets. The subsampleunder consideration here is data-poor models . The unit of the x-axis are improvements in OOS R over the basismodel. Variables are INDPRO, UNRATE, SPREAD, INF and HOUST. Within a speciﬁc color block, the horizonincreases from h = h =

24 as we are going down. SEs are HAC. These are the 95% conﬁdence bands.

Figure 12:

This ﬁgure plots the distribution of ˙ α ( h , v ) F from equation 11 done by ( h , v ) subsets. The subsampleunder consideration here is data-rich models . The unit of the x-axis are improvements in OOS R over the basismodel. Variables are INDPRO, UNRATE, SPREAD, INF and HOUST. Within a speciﬁc color block, the horizonincreases from h = h =

24 as we are going down. SEs are HAC. These are the 95% conﬁdence bands.

This ﬁgure plots the distribution of ˙ α ( h , v ) F from equation 11 done by ( h , v ) subsets. The subsampleunder consideration here are recessions . The unit of the x-axis are improvements in OOS R over the basismodel. Variables are INDPRO, UNRATE, SPREAD, INF and HOUST. Within a speciﬁc color block, the horizonincreases from h = h =

24 as we are going down. SEs are HAC. These are the 95% conﬁdence bands.

Figure 14:

This ﬁgure plots the distribution of ˙ α ( h , v ) F from equation 11 done by ( h , v ) subsets. The subsampleunder consideration here are expansions . The unit of the x-axis are improvements in OOS R over the basismodel. Variables are INDPRO, UNRATE, SPREAD, INF and HOUST. Within a speciﬁc color block, the horizonincreases from h = h =

24 as we are going down. SEs are HAC. These are the 95% conﬁdence bands.

C Additional Results

This ﬁgure plots the distribution of ˙ α ( h , v ) F from equation 11 done by ( h , v ) subsets. The subsampleunder consideration here are the last 20 years . The unit of the x-axis are improvements in OOS R over the basismodel. Variables are INDPRO, UNRATE, SPREAD, INF and HOUST. Within a speciﬁc color block, the horizonincreases from h = h =

24 as we are going down. SEs are HAC. These are the 95% conﬁdence bands. −3−2−1012 1980 1990 2000 2010

HOUSPRICE −101234 1980 1990 2000 2010

ANFCI −101234 1980 1990 2000 2010

MACROUNCERT −2−1012 1980 1990 2000 2010

UMCSENT −202 1980 1990 2000 2010

PMI −1012 1980 1990 2000 2010

UNRATE −1012 1980 1990 2000 2010

GS1 −10123 1980 1990 2000 2010

PCEPI

Figure 16:

This ﬁgure plots time series of variables explaining the heterogeneity of NL treatment effects insection 6. I ND P R O UNR A T ESP R EA D I N F H O U S T −2−10123−2−1012−2−1012−202−202 AR,K−foldRFAR,K−fold KRR,AR,K−foldARDI,K−fold RFARDI,K−foldKRR,ARDI,K−fold

Figure 17:

This ﬁgure shows the 3-year rolling window root MSPE, the cumulative root MSPE and Giacominiand Rossi (2010) ﬂuctuation tests for linear and nonlinear data-poor and data-rich models, at 12-month horizon.

This graph display the marginal (un)improvements by variables and horizons to opt for the SVRin-sample loss function in comparing the data-poor and data-rich environments for linear models. The unit ofthe x-axis are improvements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. Figure 19:

This graph display the marginal (un)improvements by variables and horizons to opt for the SVRin-sample loss function in comparing the data-poor and data-rich environments for nonlinear models. The unitof the x-axis are improvements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdencebands. Nonlinearites Matter – A Robustness Check

In this appendix, we trade Random Forests for Boosted Trees and KRR for Neural Net-works. First, we brieﬂy introduce the newest addition to our nonlinear arsenal. Second, wedemonstrate that very similar conclusions to that of section 5.2.1 are reached using those.This further backs our claim that nonlinearities matter, whichever way they were obtained.

D.1 Data-Poor

Boosted Trees AR (BTAR) . This algorithm provides an alternative means of approximatingnonlinear functions by additively combining regression trees in a sequential fashion. Let η ∈ [

0, 1 ] be the learning rate and ˆ y ( n ) t + h and e ( n ) t + h : = y t − h − η ˆ y ( n ) t + h be the step n predicted valueand pseudo-residuals, respectively. Then, the step n + y ( n + ) t + h = y ( n ) t + h + ρ n + f ( Z t , c n + ) where ( c n + , ρ n + ) : = argmin ρ , c ∑ Tt = (cid:16) e ( n ) t + h − ρ n + f ( Z t , c n + ) (cid:17) and c n + : = ( c n + m ) Mm = arethe parameters of a regression tree. In other words, it recursively ﬁts trees on pseudo-residuals. The maximum depth of each tree is set to 10 and all features are considered ateach split. We select the number of steps and η ∈ [

0, 1 ] with Bayesian optimization. Weimpose p y = Neural Network AR (NNAR) . We opted for fully connected feed-forward neural networks.The value of the input vector [ Z it ] N i = is represented by a layer of input neurons, each takingon the value of a different element in the vector. Each neuron j of the ﬁrst hidden layer takeson a value h ( n ) jt which is determined by applying a potentially nonlinear transformation to aweighted sum of the input value. The same is true of each subsequent hidden layer until wehave reached the output layer which contains a single neuron whose value is the h period55head forecast of the model. Formally, our neural network models have the following form: h ( n ) jt =  f ( ) (cid:16) ∑ N i = w ( ) ji Z it + w ( ) j (cid:17) n = f ( n ) (cid:16) ∑ N k i = w ( n ) ji h ( n − ) it + w ( n ) j (cid:17) n > y t + h = N Nh ∑ i = w ( y ) i h ( N h ) jt + w ( y ) .We restrict our attention to two ﬁxed architectures: the ﬁrst one uses a single hidden layerof 32 neurons ( ( N h , N ) = (

1, 32 ) ) and the second one uses two hidden layers of 32 and 16neurons, respectively ( ( N h , N , N ) = (

2, 32, 16 ) ). In all cases, we use rectiﬁed linear units(ReLU) as the activation functions, i.e. f ( n ) ( z ) = max { z } , ∀ n =

1, ..., N h .The training is carried out by batch gradient descent using the Adam algorithm. This algo-rithm is initialized with a learning rate of 0.01 and we use an early stopping rule . And,in an effort to mitigate the effects of overﬁtting and the impact of random initialization ofweights, we train 5 neural networks with the same architecture and use their average outputas our prediction value. In essence, those neural networks are simpliﬁed versions of the neu-ral networks used in Gu et al. (2020b) where we got rid of the hyperparameter optimizationand use 5 base learners instead of 10. For this algorithm, the input is a set of p y =

12 laggedvalues of the target variable. We do not make use of cross-validation, but we do estimatemodel weights recursively.

D.2 Data-Rich

Boosted Trees ARDI (BTARDI) . We consider a vanilla Boosted Trees where the maximumdepth of each tree is set to 10 and all features are considered at each split. We select thenumber of steps and η ∈ [

0, 1 ] with Bayesian optimization. We impose p y = p f =

12 and If improvements in the performance metric doesn’t exceed a tolerance threshold for 5 consecutive epochs,we stop the training.

This ﬁgure compares the two alternative NL models averaged over all horizons. The unit of thex-axis are improvements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. K = Neural Network ARDI (NNARDI) . We opted for fully connected feed-forward neural net-work with the same architecture as the data-poor version, but we now use ( p y , p f , K ) =(

12, 10, 12 ) for the inputs. D.3 Results

In line with what reported in section 5.2.1, we ﬁnd that NL’s treatment effect is magniﬁedfor horizons 9, 12 and 24. Additionally, it is found that both algorithms give very homoge-neous improvements in the data-rich environment, another ﬁnding detailed in the main text.Results for the data-poor environment are more scattered, as they were before. Targets bene-ﬁting most from NL in the data-rich environment are INF and HOUST, which is analogous toearlier ﬁndings. However, it was found that the real activity targets beneﬁted more from NLin our main text conﬁguration, which is the sole noticeable difference with results reportedhere. 57igure 21:

This ﬁgure compares the two alternative NL models averaged over all variables. The unit of thex-axis are improvements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. ow is Machine Learning Useful forMacroeconomic Forecasting? ∗ SUPPLEMENTARY MATERIAL

Philippe Goulet Coulombe † Maxime Leroux Dalibor Stevanovic ‡ Stéphane Surprenant University of Pennsylvania Université du Québec à MontréalThis version: August 21, 2020

Abstract

This document contains supplementary material for the paper entitled

How Is Ma-chine Learning Useful for Macroeconomic Forecasting?

It contains the following appendices:results for absolute loss; results with quarterly US data; results with monthly Canadiandata; description of CV techniques and technical details on forecasting models.

JEL Classiﬁcation: C53, C55, E37Keywords: Machine Learning, Big Data, Forecasting. ∗ The third author acknowledges ﬁnancial support from the Fonds de recherche sur la société et la culture(Québec) and the Social Sciences and Humanities Research Council. † Corresponding Author: [email protected]. Department of Economics, UPenn. ‡ Corresponding Author: [email protected]. Département des sciences économiques, UQAM. Results with Absolute Loss

In this section we present results for a different out-of-sample loss function that is oftenused in the literature: the absolute loss. Following Koenker and Machado (1999), we generatethe pseudo- R in order to perform regressions (11) and (12): R t , h , v , m ≡ − | e t , h , v , m | T ∑ Tt = | y v , t + h − ¯ y v , h | .Hence, the ﬁgure included in this section are exact replication of those included in the maintext except that the target variable of all the regressions has been changed.The main message here is that results obtained using the squared loss are very consistentwith what one would obtain using the absolute loss. The importance of each feature, ﬁgure22, and the way it behaves according to the variable/horizon pair is the same. Indeed, mostof the heterogeneity is variable speciﬁc while there are clear horizon patterns emerging whenwe average out variables. For instance, we clearly see by comparing ﬁgures 24 and 2 thatmore data and nonlinearities usefulness increase linearly in h . CV is ﬂat around the 0 line.Alternative shrinkage and loss function both are negative and follow a boomerang shape(they are not as bad for short and very long horizons, but quite bad in between).The pertinence of nonlinearities and the impertinence of alternative shrinkage follow verysimilar behavior to what is obtained in the main body of this paper. However, for nonlinear-ities, the data-poor advantages are not robust to the choice of MSPE vs MAPE. Fortunately,besides that, the ﬁgures are all very much alike.Results for the alternative in-sample loss function also seem to be independent of the pro-posed choices of out-of-sample loss function. Only for hyperparameters selection we do getslightly different results: CV-KF is now sometimes worse than BIC in a statistically signif-icant way. However, the negative effect is again much stronger for POOS CV. CV-KF stilloutperforms any other model selection criteria on recessions.2 o r . V a r . R e c . N L S H C V L F X Predictors P r ed i c t o r i m po r t an c e e s t i m a t e s Figure 22:

This ﬁgure presents predictive importance estimates. Random forest is trained to predict R t , h , v , m deﬁned in (11) and use out-of-bags observations to assess the performance of the model and compute features’importance. NL, SH, CV and LF stand for nonlinearity, shrinkage, cross-validation and loss function featuresrespectively. A dummy for H + t models, X , is included as well. Figure 23:

This compares the two NL models averaged over all horizons. The unit of the x-axis are improve-ments in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. This compares the two NL models averaged over all variables. The unit of the x-axis are improve-ments in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. Figure 27:

This compares models of section 3.2 averaged over all variables and horizons. The unit of the x-axisare improvements in OOS R over the basis model. The base models are ARDIs speciﬁed with POOS-CV andKF-CV respectively. SEs are HAC. These are the 95% conﬁdence bands. ∗ -0.762 ∗ -0.768 ∗∗∗ -0.700 -0.859 ∗∗∗ (0.375) (0.340) (0.181) (0.364) (0.193)AIC -0.396 -0.516 -0.275 -0.507 -0.522 ∗∗ (0.375) (0.340) (0.181) (0.364) (0.193)CV-KF * Recessions 1.609 1.264 ∗ (1.037) (0.552)CV-POOS * Recessions -0.506 0.747(1.037) (0.552)AIC * Recessions -0.0760 2.007 ∗∗∗ (1.037) (0.552)Observations 91200 45600 45600 45600 45600 Standard errors in parentheses ∗ p < ∗∗ p < ∗∗∗ p < Figure 28:

This compares the two CVs procedure averaged over all the models that use them. The unit of thex-axis are improvements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. This compares the two CVs procedure averaged over all the models that use them. The unit of thex-axis are improvements in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. Figure 30:

This graph display the marginal (un)improvements by variables and horizons to opt for the SVRin-sample loss function in both the data-poor and data-rich environments . The unit of the x-axis are improve-ments in OOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. This graph display the marginal (un)improvements by variables and horizons to opt for the SVRin-sample loss function in both recession and expansion periods . The unit of the x-axis are improvements inOOS R over the basis model. SEs are HAC. These are the 95% conﬁdence bands. Results with Quarterly Data

In this section we present results for quarterly frequency using the dataset FRED-QD, pub-licly available at the Federal Reserve of St-Louis’s web site. This is the quarterly companionto FRED-MD monthly dataset used in the main part of paper. It contains 248 US macroeco-nomic and ﬁnancial aggregates observed from 1960Q1 to 2018Q4. The series transformationsto induce stationarity are the same as in Stock and Watson (2012a). The variables of interestare: real GDP, real personal consumption expenditures (CONS), real gross private invest-ment (INV), real disposable personal income (INC) and the PCE deﬂator. All the targets areexpressed in average growth rate over h periods as in equation (4). Forecasting horizons are1, 2, 3, 4 and 8 quarters.The main message here is that results obtained using the quarterly data and predictingGDP components are consistent with those on monthly variables. Tables 10 - 14 summarizethe overall predictive ability in terms of RMPSE relative to the reference AR,BIC model. GDPand consumption growths are best predicted at short run by the standard Stock and Wat-son (2002a) ARDI,BIC model, while random forests dominate at longer horizons. Nonlinearmodels perform well for most horizons when predicting the disposable income growth. Fi-nally, kernel ridge regressions (both data-poor and data-rich) are the best options to predictthe PCE inﬂation.The ML features’ importance is plotted in ﬁgure 32. Contrary to monthly data, horizonsand variables ﬁxed effects are much less important which is somehow expected because ofrelative smoothness of quarterly data and similar targets (4 out 5 are real activity series).Among ML treatments, shrinkage is the most important, followed by loss function and non-linearity. As in the monthly application, CV is the least relevant, while the data-rich com-ponent remains very important. From ﬁgures 33 and 34, we see that: (i) the richness ofpredictors’ set is very helpful for most of the targets; (ii) nonlinearity treatment has positiveand signiﬁcant effects for investment, income and PCE deﬂator, while it is not signiﬁcant forGDP and CONS; (iii) the impertinence of alternative shrinkage follow very similar behaviorto what is obtained in the main body of this paper; (iv) CV has in general negative but smalland often insigniﬁcant effect; (v) SVR loss function decreases the predictive performance asin the monthly case, especially for income growth and inﬂation.9able 10: GDP: Relative Root MSPE Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=2 h=3 h=4 h=8 h=1 h=2 h=3 h=4 h=8Data-poor ( H − t ) modelsAR,BIC (RMSPE) ** ** 1.124* 1,085 1,021 1.089** 0,989KRR-AR,POOS-CV 1,049 1,044 1,011 1.065* * 1.051** * * * 1,018 * 1.157*** 1.032** 1.041** 0,986 1,002Data-rich ( H + t ) modelsARDI,BIC ** ** ** *** ** *** 0.714***ARDI,AIC * * * *** ** *** 0.687***ARDI,POOS-CV * ** ** *** 0.735***ARDI,K-fold * *** ***RRARDI,POOS-CV * * ** 0.740*** 0.770***RRARDI,K-fold * * * ** *** ***RFARDI,POOS-CV * * * 0,989 0.866* 0.810** 0.761*** 0.739***RFARDI,K-fold * * * ** 1,022 0.843** 0.813* 0.742*** 0.692***KRR-ARDI,POOS-CV 1,055 ( B , α = ˆ α ) ,POOS-CV 1.061* ( B , α = ˆ α ) ,K-fold 1,015 ( B , α = ) ,POOS-CV 1.076** 1.104* ( B , α = ) ,K-fold ( B , α = ) ,POOS-CV 1.082* ( B , α = ) ,K-fold 1.191** * 1,052 1.070* ( B , α = ˆ α ) ,POOS-CV 1,043 ( B , α = ˆ α ) ,K-fold *** 0.612*** ( B , α = ) ,POOS-CV 1.110** * ( B , α = ) ,K-fold 1,039 *** ( B , α = ) ,POOS-CV ( B , α = ) ,K-fold ( B , α = ˆ α ) ,POOS-CV 1,047 1,055 1.049* 1,052 ( B , α = ˆ α ) ,K-fold 1,038 ( B , α = ) ,POOS-CV 1.055* 1.133** ( B , α = ) ,K-fold 1,045 ( B , α = ) ,POOS-CV 1.142** 1.153* ( B , α = ) ,K-fold 1.225* ** 0.871* 0.861**SVR-ARDI,RBF,K-fold Note: The numbers represent the relative. with respect to AR,BIC model. root MSPE. Models retained in model conﬁdence set are in bold. the minimumvalues are underlined. while ∗∗∗ . ∗∗ . ∗ stand for 1%. 5% and 10% signiﬁcance of Diebold-Mariano test. Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=2 h=3 h=4 h=8 h=1 h=2 h=3 h=4 h=8Data-poor ( H − t ) modelsAR,BIC (RMSPE) 0,0604 ** ** ** * 1 ** * 0,947 1,013 1.017* 1.015** 1,014RFAR,POOS-CV 0,989 * 1,101 1.234* 1.251*** 1.133*** 0,989SVR-AR,RBF,POOS-CV 1.081* ** 1.120*** 1.052** ** 1.083** 0.881* 1 1.054* 0.959** 1.109**Data-rich ( H + t ) modelsARDI,BIC * * ** ** ** 0.809***ARDI,AIC * 0.800* * 0.761***ARDI,POOS-CV 1,007 * ** ** 0.769***RRARDI,POOS-CV 1,009 ** 0,889 0,853 0.682***RFARDI,POOS-CV 0,976 * 0.840** *** 0.808***RFARDI,K-fold * ** ** ** 0.757***KRR-ARDI,POOS-CV 1.138** * 1.181** 1.141*** ( B , α = ˆ α ) ,POOS-CV 1.153*** 1.213*** 1.168** 1.107** ( B , α = ˆ α ) ,K-fold 1,069 1.193*** 1.186*** 1.120** 1.079* 1,103 1,155 1.212*** 1.151*** 0.901* ( B , α = ) ,POOS-CV 1.118** 1.215*** 1.184** 1.153*** ( B , α = ) ,K-fold 1,056 1.166*** 1.122** 1.079** ( B , α = ) ,POOS-CV 1.158*** 1.281*** 1.300*** 1.171** 1.062** 1,119 1,163 1,172 1,049 1,012 ( B , α = ) ,K-fold 1.453*** ** 1.288* 1.103** ( B , α = ˆ α ) ,POOS-CV 1.092* * 1.140* 1.105* ( B , α = ˆ α ) ,K-fold 1,036 ** 1.167** 1,082 1,129 1.080** 1.139** 1,119 ** *** ( B , α = ) ,POOS-CV 1.158** * 1.194** 1.187*** ( B , α = ) ,K-fold 1,057 1.179*** 1.113* 1,072 1,153 1,107 1.263*** 1,056 0.872* 0.672*** ( B , α = ) ,POOS-CV 1.054* * 1.194** 1,049 ( B , α = ) ,K-fold 1.072* ( B , α = ˆ α ) ,POOS-CV 1,061 1.128** 1.165** 1,055 1.052** 1,05 1.164* 1.183** 1,027 1,003 ( B , α = ˆ α ) ,K-fold 1.128** ( B , α = ) ,POOS-CV 1.096* 1.174** 1.186** 1.138** 1.079*** 1,095 1.202* 1.192* 1,05 1,006 ( B , α = ) ,K-fold 1,065 ** 1.153** 1.188*** 1.129* 1,052 1,107 1,149 1,04 0.825** ( B , α = ) ,POOS-CV 1,063 * 1.118*** 1.168** ( B , α = ) ,K-fold 1.441** * 1.144*** 1.152* * 1.584** 1,085 1.122*** 1,104 0,986SVR-ARDI,Lin,POOS-CV 1,046 * 1,108 1,064 1.106* 0,989 1,119 1,069 1,004 1,007SVR-ARDI,Lin,K-fold 1,105 ** * *** Note: The numbers represent the relative. with respect to AR,BIC model. root MSPE. Models retained in model conﬁdence set are in bold. the minimumvalues are underlined. while ∗∗∗ . ∗∗ . ∗ stand for 1%. 5% and 10% signiﬁcance of Diebold-Mariano test. Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=2 h=3 h=4 h=8 h=1 h=2 h=3 h=4 h=8Data-poor ( H − t ) modelsAR,BIC (RMSPE) 0,4078 0,3385 0,2986 0,277 0,2036 0,7551 0,6866 0,5725 0,5482 0,3834AR,AIC 1.015* 1.011* 1.007* 1 0,996 1.023** 1.015** 1.010* 1 0,991AR,POOS-CV 0.995* 1,004 1.007** 1,004 1,007 1 1.008* 1.006** 1,008 1,03AR,K-fold 1,007 1,004 1,009 1 1,021 1,002 1.018** 1.024*** 1,017 1.040*RRAR,POOS-CV 1,004 1,001 1.013*** 1.007** 1,001 1,01 1,002 1.016*** 1.007* 1,006RRAR,K-fold 1.015** 1.013* 1.008* 1 1,002 1.026*** 1,012 1.016*** 1.013*** 0,998RFAR,POOS-CV 1,055 1,013 0,979 0,985 1,046 1,024 0.905** *** 0,978 1,022RFAR,K-fold 1,036 1,016 1,019 1 0,977 0,992 0,942 1,007 0,934 0,957KRR-AR,POOS-CV 1,036 1 * 1 0.969** 1,022 0,987 0,975 1,015 0.965***SVR-AR,Lin,POOS-CV 1,033 1.097** 1.096*** 1.050* 1.116** 1,035 1,061 1.041** 1 0,98SVR-AR,Lin,K-fold 1.033* 1.033* 1.026** 1.016* 1,019 1.063** 1,021 1.028* 0,998 1,004SVR-AR,RBF,POOS-CV 1.038*** 1,13 1.062*** 1.047** 1.094*** 1.050** 1,145 1.069** 1.008** 1,006SVR-AR,RBF,K-fold 1,03 1,026 1.039** 1,01 0,986 1.066* 1,018 1.040** 0,994 0,995Data-rich ( H + t ) modelsARDI,BIC *** ** * ** * *** *** *** 0,949ARDI,AIC *** * * 0,948 *** 0,847 0,936 ** 0.849**ARDI,POOS-CV *** ** *** ** *** 0,924ARDI,K-fold *** *** 0,837 0,993 ** 0.811***RRARDI,POOS-CV *** ** *** 0,831 *** * *** 0.810* ** ** ** ** 0.917* 0,898 *** 0.750***RFARDI,K-fold 0,951 * ** ** ** 0,966 0,92 0,922 0.830** 0.735***KRR-ARDI,POOS-CV 0,989 ( B , α = ˆ α ) ,POOS-CV 1,036 0,976 1,014 1,007 0,939 0.884** 0.916*** 1,006 0.925* 0,965 ( B , α = ˆ α ) ,K-fold 1,046 0,967 ( B , α = ) ,POOS-CV 1,023 0,991 0,989 0,941 0,966 0.889* 0,954 0,974 0.902* 0,973 ( B , α = ) ,K-fold 0,953 * * ** 1,018 0.905* 0.941** 0,959 0.899*** 0,953 ( B , α = ) ,POOS-CV 1,019 0,997 1.110** 1,045 1,013 0,973 0,997 1.078*** 1.071* 1,008 ( B , α = ) ,K-fold 1.117** 0,98 ( B , α = ˆ α ) ,POOS-CV 0,996 0,973 1,01 1,016 ( B , α = ˆ α ) ,K-fold 0,974 0,975 0,958 1,005 0,956 1,026 0,965 0,94 0.886** 0.662** ( B , α = ) ,POOS-CV 0,988 0,961 1,076 1,069 1,003 1,008 0.959* 1.150** 1,067 0.874*** ( B , α = ) ,K-fold 0,974 0,965 0,967 1,014 ** 0,997 0,973 0,975 * *** ( B , α = ) ,POOS-CV 1,033 0,975 1,048 1,057 * 1,056 0,991 1,102 1,031 0.871** ( B , α = ) ,K-fold 1,023 0,923 0,966 0,996 0,966 1,025 0.892** 0,993 0,946 0,894 ( B , α = ˆ α ) ,POOS-CV 0,961 0,982 1,006 0,988 ** 0.901* 0,991 1,058 0,996 0.929*** ( B , α = ˆ α ) ,K-fold 0.948* 0,976 ** 0,941 0.928* 0.967* ( B , α = ) ,POOS-CV 0,946 0,985 ( B , α = ) ,K-fold 0,956 0,966 ** ** 0,954 0,937 0,973 ** 0.881*** 0.880*** ( B , α = ) ,POOS-CV 1.110* 1.036* 1,027 1,027 1 1,011 0,97 1,004 1,011 1,001 ( B , α = ) ,K-fold 1,151 0,989 0,982 1,136 1,023 0,99 0,965 0,974 1,089 0,968SVR-ARDI,Lin,POOS-CV 0,975 0,995 1,077 1,013 1,013 1,042 0,974 1,086 0,986 0,938SVR-ARDI,Lin,K-fold *** ** *** *** * 0,975 0,964SVR-ARDI,RBF,POOS-CV *** *** 0,856 *** * ** Note: The numbers represent the relative. with respect to AR,BIC model. root MSPE. Models retained in model conﬁdence set are in bold. the minimumvalues are underlined. while ∗∗∗ . ∗∗ . ∗ stand for 1%. 5% and 10% signiﬁcance of Diebold-Mariano test. Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=2 h=3 h=4 h=8 h=1 h=2 h=3 h=4 h=8Data-poor ( H − t ) modelsAR,BIC (RMSPE) * ** 0.980** 0.992* 0,998 0,993AR,K-fold ** 0.963** 0.969** 1 0,999RRAR,POOS-CV ** 0.976** 0,994 1.006*** 0.991**RRAR,K-fold * 1 ** 0.972*** 1 1.008*** 0.999**RFAR,POOS-CV ** 0.905** 0,959 0,979 0.908*KRR,AR,K-fold *** 0,976 0,996 1,015 1.016** 0.965***SVR-AR,RBF,POOS-CV ** 0,959 0,973 1,01 0.928***SVR-AR,RBF,K-fold 1.012* H + t ) modelsARDI,BIC ** ** *** *** 0.769***ARDI,AIC * *** ** 0,886 0.721***ARDI,POOS-CV ** *** *** 0.770**ARDI,K-fold 1,065 ** 0.796** 0,898 0.689***RRARDI,POOS-CV * 0,869 *** 0.743***RRARDI,K-fold 1,06 ***RFARDI,POOS-CV * ** * * * *** 0.822** *** 0.678***RFARDI,K-fold ** ** * ***KRR-ARDI,POOS-CV 1,026 1.069*** 1,025 1.090* 0,985 ( B , α = ˆ α ) ,POOS-CV *** 0,993 1,018 1,034 0.922* ( B , α = ˆ α ) ,K-fold ( B , α = ) ,POOS-CV ( B , α = ) ,K-fold 1,017 ( B , α = ) ,POOS-CV 1.030* 1,034 1.050** 1.075*** 1,014 *** 1,021 1,034 1,031 1.120* ( B , α = ) ,K-fold 1.023* ( B , α = ˆ α ) ,POOS-CV ( B , α = ˆ α ) ,K-fold ( B , α = ) ,POOS-CV ( B , α = ) ,K-fold 1.080* ** ( B , α = ) ,POOS-CV ( B , α = ) ,K-fold 1,028 ( B , α = ˆ α ) ,POOS-CV ( B , α = ˆ α ) ,K-fold ( B , α = ) ,POOS-CV ( B , α = ) ,K-fold ( B , α = ) ,POOS-CV ** 0,954 0,987 1,145 0.959** ( B , α = ) ,K-fold ** 0.906** 0,991 1,134 1,001SVR-ARDI,Lin,POOS-CV 1,06 1,081 ** 0,988SVR-ARDI,RBF,POOS-CV 1.147** 1,097 Note: The numbers represent the relative. with respect to AR,BIC model. root MSPE. Models retained in model conﬁdence set are in bold. the minimumvalues are underlined. while ∗∗∗ . ∗∗ . ∗ stand for 1%. 5% and 10% signiﬁcance of Diebold-Mariano test. Full Out-of-Sample NBER Recessions PeriodsModels h=1 h=2 h=3 h=4 h=8 h=1 h=2 h=3 h=4 h=8Data-poor ( H − t ) modelsAR,BIC (RMSPE) ** * 1.033* 1,018 0,997 1 0.976*AR,POOS-CV ** * ** ** 1,025 0,976 0,998 0.984** 0.974*AR,K-fold ** ** ** ** * ** 1.019* 0,965 0,968 0,981 0.938***RRAR,K-fold *** * * 1,025 1,005 0.994** 0,993 0.955**RFAR,POOS-CV ** * ** * * 1,025 ** * * 1,178 * 0,817 ** 0.775**KRR,AR,K-fold * * 1,009 * ** 0.783**SVR-AR,Lin,POOS-CV ** 1.040** 1,059 1.222** 1.189** 1.019** 0,992 0.931*** 1,032 0,865Data-rich ( H + t ) modelsARDI,BIC ** 0.731**ARDI,AIC * 0.686**ARDI,K-fold ** **RRARDI,K-fold ** * ** * * 0.835* 0,902KRR,ARDI,K-fold * * ** 1,045 ** 0.668** ( B , α = ˆ α ) ,POOS-CV ** 1.200*** 1.195** 1.310*** 1.267** * 1,005 0.833** ( B , α = ˆ α ) ,K-fold 1.123** 1.221*** 1.187* 1.316*** 1.179* 1,029 ** 0,905 0.766*** ( B , α = ) ,POOS-CV 1.251*** 1.276*** 1.208** 1.221** 1.403*** 1,137 1,01 ( B , α = ) ,K-fold 1.368*** 1.340*** 1.412*** 1.409*** 1.270** 1.280** 0,91 0,957 0,903 0.726** ( B , α = ) ,POOS-CV 1.488** 1.562** 1.269* 1.396** 1.431*** 1.153* 0,961 0,979 ( B , α = ) ,K-fold 1.540** 1.493** 1.489** 1.429** 1.317** 1.125* * ( B , α = ˆ α ) ,POOS-CV 1.131*** 1.249** 1.152** 1.193** 1,111 1,051 1,268 0.903* 0.843** ** ( B , α = ˆ α ) ,K-fold 1.111** 1,266 1.103* * 1,079 1,115 1,387 0,925 * 0,749 ( B , α = ) ,POOS-CV ** 1.078** 1.095* 1.233** 1.259** 1,026 0,974 0.912** 0,884 ** ( B , α = ) ,K-fold 1.078* 1,315 1.098* * 1,172 1,11 1,449 0,933 ** 0.679* ( B , α = ) ,POOS-CV 1.316** 1.332** 1.418*** 1.393*** 1.169* 1,373 1.345* 1,298 0,948 *** ( B , α = ) ,K-fold 1.358** 1.291** 1.388** 1.313** 1,13 1,487 1,263 1,339 1,016 *** ( B , α = ˆ α ) ,POOS-CV * 1,009 1.063* 1.092** 1,102 1,016 0.945* 0,972 0.885* 0.854** ( B , α = ˆ α ) ,K-fold ( B , α = ) ,POOS-CV * 0.953* 0,993 0,923 0.824** ( B , α = ) ,K-fold ** 1,007 0,997 0,947 0.907* ( B , α = ) ,POOS-CV 1.084** ( B , α = ) ,K-fold * 1.198* 1,12 1.133* 1.127* 1.085* 1,149 0,979 0,948 0,923SVR-ARDI,Lin,POOS-CV * 1.271*** 1.292*** 1.228** 1.220** *** 0.798**SVR-ARDI,RBF,K-fold Note: The numbers represent the relative. with respect to AR,BIC model. root MSPE. Models retained in model conﬁdence set are in bold. the minimumvalues are underlined. while ∗∗∗ . ∗∗ . ∗ stand for 1%. 5% and 10% signiﬁcance of Diebold-Mariano test. o r . V a r . R e c . N L S H C V L F X Predictors P r ed i c t o r i m po r t an c e e s t i m a t e s Figure 32:

This compares models of section 3.2 averaged over all variables and horizons. The unit of the x-axisare improvements in OOS R over the basis model. The base models are ARDIs speciﬁed with POOS-CV andKF-CV respectively. SEs are HAC. These are the 95% conﬁdence bands. ∗ -8.304 ∗∗∗ -0.192 -9.651 ∗∗∗ ∗∗ ∗∗ -6.772 ∗∗ ∗ (1.887) (2.163) (0.382) (2.270) (0.386)AIC -2.182 -4.722 ∗∗ ∗∗ ∗∗ -2.956 ∗ (4.893) (1.500)CV-POOS * Recessions 1.002 0.683(5.345) (1.125)AIC * Recessions 8.421 -0.127(4.643) (1.101)Observations 36960 18480 18480 18360 18360 Standard errors in parentheses ∗ p < ∗∗ p < ∗∗∗ p < Figure 38:

In this section we present results obtained with Canadian data from Fortin-Gagnon et al.(2020). It is a monthly dataset of 139 macroeconomic and ﬁnancial variables, with categoriessimilar to those from McCracken and Ng (2016), except that it contains much more interna-tional trade indicators to take into account the openness of Canadian economy. Data startson 1981M01 and ends on 2017M12. The out-of-sample starts on 2000M01. The variables ofinterest are the same as in US application: industrial growth, unemployment rate change,term spread, CPI inﬂation and housing starts growth. Forecasting horizons are 1, 3, 9, 12and 24 months. We do not compute results for recession periods separately since Canada hasexperienced only one downturn in the evaluation period.The results with Canadian data are overall similar to those in the paper. The main differ-ence is a smaller NL treatment effect. That can be potentially explained through lenses of theanalysis in section 6. The pseudo-out-of-sample covers 2000-2017 period during which Cana-dian ﬁnancial system did not experience a dramatic nonﬁnancial cycle as in the US., and thehousing bubble did not burst. The main reason for this discrepancy being more concentratedand strictly regulated (since 80’s) Canadian ﬁnancial system (Bordo et al., 2015). Hence, thenonlinearities associated to ﬁnancial frictions found in the US case were probably less impor-tant and nonlinear methods did not have a signiﬁcant effect on predicting real activity serieson average. However, NL treatment is very important for inﬂation and housing. Shrinkage isstill not a good idea for industrial production and unemployment rate, but can be very help-ful other variables at some speciﬁc horizons. Cross-validation does not have a big impact andthe SVR loss function is still harmful. 21 o r . V a r . R e c . N L S H C V L F X Predictors P r ed i c t o r i m po r t an c e e s t i m a t e s Figure 42:

All of our models involve some kind of hyperparameter selection prior to estimation. Tocurb the overﬁtting problem, we use two distinct methods that we refer to loosely as cross-validation methods. To make it feasible, we optimize hyperparameters every 24 months asthe expanding window grows our in-sample set. The resulting optimization points are thesame across all models, variables and horizons considered. In all other periods, hyperparam-eter values are frozen to the previous values and models are estimated using the expandedin-sample set to generate forecasts.

POOS K folds

Figure 44: Illustration of cross-validation methods

Notes: Figures are drawn for 3 months forecasting horizon and depict the splits performed in the in-sample set.The pseudo-out-of-sample observation to be forecasted here is shown in black.

The ﬁrst cross-validation method we consider mimics in-sample the pseudo-out-of-samplecomparison we perform across models. For each set of hyperparameters considered, we keepthe last 25% of the in-sample set as a comparison window. Models are estimated every 12months, but the training set is gradually expanded to keep the forecasting horizon intact.This exercise is thus repeated 5 times. Figure 44 shows a toy example with smaller jumps,a smaller comparison window and a forecasting horizon of 3 months, hence the gaps. Oncehyperparameters have been selected, the model is estimated using the whole in-sample setand used to make a forecast in the pseudo-out-of-sample window that we use to compare allmodels (the black dot in the ﬁgure). This approach is a compromise between two methodsused to evaluate time series models detailed in Tashman (2000), rolling-origin recalibration23nd rolling-origin updating. For a simulation study of various cross-validation methods ina time series context, including the rolling-origin recalibration method, the reader is referredto Bergmeir and Benítez (2012). We stress again that the compromise is made to bring downcomputation time.The second cross-validation method, K-fold cross-validation, is based on a re-samplingscheme (Bergmeir et al., 2018). We chose to use 5 folds, meaning the in-sample set is ran-domly split into ﬁve disjoint subsets, each accounting on average for 20 % of the in-sampleobservations. For each one of the 5 subsets and each set of hyperparameters considered,4 subsets are used for estimation and the remaining corresponding observations of the in-sample set used as a test subset to generate forecasting errors. This is illustrated in ﬁgure 44where each subset is illustrated by red dots on different arrows.Note that the average mean squared error in the test subset is used as the performancemetric for both cross-validation methods to perform hyperparameter selection.

E Forecasting models in detail

E.1 Data-poor ( H − t ) models In this section we describe forecasting models that contain only lagged values of the de-pendent variable, and hence use a small amount of predictors, H − t . Autoregressive Direct (AR)

The ﬁrst univariate model is the so-called autoregressive direct (AR) model, which is speciﬁed as: y ( h ) t + h = c + ρ ( L ) y t + e t + h , t =

1, . . . , T ,where h ≥ p y , the orderof the lag polynomial ρ ( L ) . The optimal p is selected in four ways: (i) Bayesian InformationCriterion (AR,BIC); (ii) Akaike Information Criterion (AR,AIC); (iii) Pseudo-out-of-samplecross validation (AR,POOS-CV); and (iv) K-fold cross validation (AR,K-fold). The lag orderis selected from the following subset p y ∈ {

1, 3, 6, 12 } . Hence, this model enters the followingcategories: linear g function, no regularization, in-sample and cross-validation selection ofhyperparameters and quadratic loss function. Ridge Regression AR (RRAR)

The second speciﬁcation is a penalized version of the pre-vious AR model that allows potentially more lagged predictors by using Ridge regression.The model is written as in (6), and the parameters are estimated using Ridge penalty. TheRidge hyperparameter is selected with two cross validation strategies, which gives two mod- In both cases, the last observation (the origin of the forecast) of the training set is rolled forward. However,in the ﬁrst case, hyperparameters are recalibrated and, in the second, only the information set is updated. p y ∈ {

1, 3, 6, 12 } and for each of these value we choose the Ridge hyperparameter. This modelcreates variation on following axes: linear g , Ridge regularization, cross-validation for tuningparameters and quadratic loss function. Random Forests AR (RFAR)

A popular way to introduce nonlinearities in the predictivefunction g is to use a tree method that splits the predictors space in a collection of dummyvariables and their interactions. Since a standard tree regression is prompt to the overﬁt, weuse instead the random forest approach described in Section 3.1.2. We adopt the default valuein the literature of one third for ’mtry’, the share of randomly selected predictors that arecandidates for splits in each tree. Observations in each set are sampled with replacement toget as many observations in the trees as in the full sample. The number of lags of y t , is chosenfrom the subset p y ∈ {

1, 3, 6, 12 } with cross-validation while the number of trees is selectedinternally with out-of-bag observations. This model generates nonlinear approximation ofthe optimal forecast, without regularization, using both CV techniques with the quadraticloss function: RFAR,K-fold and RFAR,POOS-CV. Kernel Ridge Regression AR (KRRAR)

This speciﬁcation adds a nonlinear approximationof the function g by using the Kernel trick as in Section 3.1.1. The model is written as in (13)and (14) but with the autoregressive part only y t + h = c + g ( Z t ) + ε t + h , Z t = (cid:104) { y t − } p y j = (cid:105) ,and the forecast is obtained using the equation (16). The hyperparameters of Ridge andof its kernel are selected by two cross-validation procedures, which gives two forecastingspeciﬁcations: (i) KRRAR,POOS-CV, (ii) KRRAR,K-fold. Z t consists of y t and its p y lags, p y ∈ {

1, 3, 6, 12 } . This model is representative of a nonlinear g function, Ridge regularization,cross-validation to select τ and quadratic ˆ L . Support Vector Regression AR (SVR-AR)

We use the SVR model to create variation alongthe loss function dimension. In the data-poor version the predictors set Z t contains y t and anumber of lags chosen from p y ∈ {

1, 3, 6, 12 } . The hyperparameters are selected with bothcross-validation techniques, and we consider 2 kernels to approximate basis functions, linearand RBF. Hence, there are 4 versions: (i) SVR-AR,Lin,POOS-CV, (ii) SVR-AR,Lin,K-fold, (iii)SVR-AR,RBF,POOS-CV and (iv) SVR-AR,RBF,K-fold. The forecasts are generated using (19). E.2 Data-rich ( H + t ) models We now describe forecasting models that use a large dataset of predictors, including theautoregressive components, H + t . 25 iffusion Indices (ARDI) The reference model in the case of large predictor set is the au-toregression augmented with diffusion indices from Stock and Watson (2002b): y ( h ) t + h = c + ρ ( L ) y t + β ( L ) F t + e t + h , t =

1, . . . , T (20) X t = Λ F t + u t (21)where F t are K consecutive static factors, and ρ ( L ) and β ( L ) are lag polynomials of orders p y and p f respectively. The feasible procedure requires an estimate of F t that is usually doneby PCA.The optimal values of hyperparamters p , K and m are selected in four ways: (i)Bayesian Information Criterion (ARDI,BIC); (ii) Akaike Information Criterion (ARDI,AIC);(iii) Pseudo-out-of-sample cross validation (ARDI,POOS-CV); and (iv) K-fold cross validation(ARDI,K-fold). These are selected from following subsets: p y ∈ {

1, 3, 6, 12 } , K ∈ {

3, 6, 10 } , p f ∈ {

1, 3, 6, 12 } . Hence, this model following features: linear g function, PCA regularization,in-sample and cross-validation selection of hyperparameters and L . Ridge Regression Diffusion Indices (RRARDI)

As for the small data case, we explore howa regularization affects the predictive performance of the reference model ARDI above. Thepredictive regression is written as in (7) and p y , p f and K are selected from the same subsetsof values as for the ARDI case above. The parameters are estimated using Ridge penalty. Allthe hyperparameters are selected with two cross validation strategies, giving two models:RRARDI,POOS-CV and RRARDI,K-fold. This model creates variation on following axes:linear g , Ridge regularization, CV for tuning parameters and L . Random Forest Diffusion Indices (RFARDI)

We also explore how nonlinearities affect thepredictive performance of the ARDI model. The model is as in (7) but a Random Forest ofregression trees is used. The ARDI hyperparameters are chosen from the grid as in the linearcase, while the number of trees is selected with out-of-bag observations. Both POOS andK-fold CV are used to generate two forecasting models: RFARDI,POOS-CV and RFARDI,K-fold. This model generates nonlinear treatment, with PCA regularization, using both CVtechniques with the quadratic loss function.

Kernel Ridge Regression Diffusion Indices (KRRARDI)

As for the autoregressive case,we can use the KT to generate nonlinear predictive functions g . The model is representedby equations (13) - (15) and the forecast is obtained using the equation (16). The hyper-parameters of Ridge and of its kernel, as well as p y , K and p f are selected by two cross-validation procedures, which gives two forecasting speciﬁcations: (i) KRRARDI,POOS-CV,(ii) KRRARDI,K-fold. We use the same grid as in ARDI case for discrete hyperparameters.This model is representative of a nonlinear g function, Ridge regularization with PCA, cross-validation to select τ and quadratic ˆ L . 26 upport Vector Regression ARDI (SVR-ARDI) We use four versions of the SVR model: (i)SVR-ARDI,Lin,POOS-CV, (ii) SVR-ARDI,Lin,K-fold, (iii) SVR-ARDI,RBF,POOS-CV and (iv)SVR-ARDI,RBF,K-fold. The SVR hyperparameters are chosen by cross-validation and theARDI hyperparameters are chosen using a grid that search in the same subsets as the ARDImodel. The forecasts are generated from equation (19). This model creates variations in allcategories: nonlinear g , PCA regularization, CV and ¯ (cid:101) -insensitive loss function. E.2.1 Generating shrinkage schemes

The rest of the forecasting models relies on using different B operators to generate varia-tions across shrinkage schemes, as depicted in section 3.2. B : taking all observables H + t When B is identity mapping, we consider Z t = H + t in theElastic Net problem (18), where H + t is deﬁned by (5). The following lag structures for y t and X t are considered, p y ∈ {

1, 3, 6, 12 } p f ∈ {

1, 3, 6, 12 } , and the exact number is cross-validated.The hyperparameter λ is always selected by two cross validation procedures, while we con-sider three cases for α : ˆ α , α = α =

0, which correspond to EN, Ridge and Lassospeciﬁcations respectively. In case of EN, α is also cross-validated. This gives six combina-tions: ( B , α = ˆ α ),POOS-CV; ( B , α = ˆ α ),K-fold; ( B , α = B , α = B , α = B , α = B : taking all principal components of X t Here B () rotates X t into N factors, F t , estimatedby principal components, which then constitute Z t to be used in (18). Same lag structuresand hyperparameters’ optimization from the B case are used to generate the following sixspeciﬁcations: ( B , α = ˆ α ),POOS-CV; ( B , α = ˆ α ),K-fold; ( B , α = B , α = B , α = B , α = B : taking all principal components of H + t Finally, B () rotates H + t by taking all principalcomponents, where H + t lag structure is to be selected as in the B case. Same variations andhyperparameters’ selection are used to generate the following six speciﬁcations: ( B , α = ˆ α ),POOS-CV; ( B , α = ˆ α ),K-fold; ( B , α = B , α = B , α = B , α ==