Machine Learning Advances for Time Series Forecasting
MMachine Learning Advances for Time Series Forecasting
Ricardo P. Masini
S˜ao Paulo School of Economics, Getulio Vargas FoundationE-mail: [email protected]
Marcelo C. Medeiros
Department of Economics, Pontifical Catholic University of Rio de JaneiroE-mail: [email protected]
Eduardo F. Mendes
School of Applied Mathematics, Getulio Vargas FoundationE-mail: [email protected]
January 20, 2021
Abstract
In this paper we survey the most recent advances in supervised machine learning and high-dimensional models for time series forecasting. We consider both linear and nonlinearalternatives. Among the linear methods we pay special attention to penalized regressionsand ensemble of models. The nonlinear methods considered in the paper include shallowand deep neural networks, in their feed-forward and recurrent versions, and tree-basedmethods, such as random forests and boosted trees. We also consider ensemble andhybrid models by combining ingredients from different alternatives. Tests for superiorpredictive ability are briefly reviewed. Finally, we discuss application of machine learningin economics and finance and provide an illustration with high-frequency financial data.
JEL Codes :C22
Keywords : Machine learning, statistical learning theory, penalized regressions, regular-ization, sieve approximation, nonlinear models, neural networks, deep learning, regressiontrees, random forests, boosting, bagging, forecasting.
Acknowledgements : We are very grateful for the insightful comments made by twoanonymous referees. The second author gratefully acknowledges the partial financialsupport from CNPq. We are also grateful to Francis X. Diebold, Daniel Borub, andAndrii Babii for helpful comments. a r X i v : . [ ec on . E M ] J a n Introduction
This paper surveys the recent developments in Machine Learning (ML) methods to economicand financial time series forecasting. ML methods have become an important estimation,model selection and forecasting tool for applied researchers in Economics and Finance. Withthe availability of vast datasets in the era of
Big Data , producing reliable and robust forecastsis of great importance. However, what is Machine Learning? It is certainly a buzzword which has gained a lot ofpopularity during the last few years. There are a myriad of definitions in the literature andone of the most well established is from the artificial intelligence pioneer Arthur L. Samuelwho defines ML as the “the field of study that gives computers the ability to learn withoutbeing explicitly programmed.” We prefer a less vague definition where ML is the combinationof automated computer algorithms with powerful statistical methods to learn (discover) hid-den patterns in rich datasets. In that sense,
Statistical Learning Theory gives the statisticalfoundation of ML. Therefore, this paper is about Statistical Learning developments and notML in general as we are going to focus on statistical models. ML methods can be dividedinto three major groups: supervised, unsupervised, and reinforcement learning. This survey isabout supervised learning, where the task is to learn a function that maps an input (explana-tory variables) to an output (dependent variable) based on data organized as input-outputpairs. Regression models, for example, belongs to this class. On the other hand, unsupervisedlearning is a class of ML methods that uncover undetected patterns in a data set with nopre-existing labels as, for example, cluster analysis or data compression algorithms. Finally,in reinforcement learning, an agent learns to perform certain actions in an environment whichlead it to maximum reward. It does so by exploration and exploitation of knowledge it learnsby repeated trials of maximizing the reward. This is the core of several artificial intelligencegame players (AlfaGo, for instance) as well as in sequential treatments, like Bandit problems.The supervised ML methods presented here can be roughly divided in two groups. The firstone includes linear models and are discussed in Section 2. We focus mainly on specificationsestimated by regularization, also known as shrinkage. Such methods date back at least toTikhonov (1943). In Statistics and Econometrics, regularized estimators gained attention afterthe seminal papers by Willard James and Charles Stein who popularized the bias-variance trade-off in statistical estimation (Stein, 1956; James and Stein, 1961). We start by considering theRidge Regression estimator put forward by Hoerl and Kennard (1970). After that, we presentthe Least Absolute Shrinkage and Selection (LASSO) estimator of Tibshirani (1996) and itsmany extensions. We also include a discussion of other penalties. Theoretical derivations andinference for dependent data are also reviewed.The second group of ML techniques focus on nonlinear models. We cover this topic in Section3 and start by presenting an unified framework based on sieve semiparametric approximation More recently, ML for causal inference have started to receive a lot of attention. However, this survey willnot cover causal inference with ML methods. The original sentence is “Programming computers to learn from experience should eventually eliminate theneed for much of this detailed programming effort.” See, Samuel (1959).
2s in Grenander (1981). We continue by analysing specific models as special cases of ourgeneral setup. More specifically, we cover feedforward neural networks, both in their shallowand deep versions and recurrent neural networks, and tree-based models such as random forestsand boosted trees. Neural Networks (NN) are probably one of the most popular ML methods.The success is partly due to the, in our opinion, misguided analogy to the functioning of thehuman brain. Contrary of what has been boasted in the early literature, the empirical successof NN models comes from a mathematical fact that a linear combination of sufficiently manysimple basis functions is able to approximate very complicated functions arbitrarily well insome specific choice of metric. Regression trees only achieved popularity after the developmentof algorithms to attenuate the instability of the estimated models. Algorithms like RandomForests and Boosted Trees are now in the toolbox of applied economists.In addition to the models mentioned above, we also include a survey on ensemble-basedmethods such as Bagging Breiman (1996) and the Complete Subset Regression (Elliott et al.,2013, 2015). Furthermore, we give a brief introduction to what we named “hybrid methods”,where ideas from both linear and nonlinear models are combined to generate new ML forecastingmethods.Before presenting an empirical illustration of the methods, we discuss tests of superiorpredictive ability in the context of ML methods.
A quick word on notation: an uppercase letter as in X denotes a random quantity as opposedto a lowercase letter x which denotes a deterministic (non-random) quantity. Bold letters as in X and x are reserved for multivariate objects such as vector and matrices. The symbol (cid:107) · (cid:107) q for q ≥ (cid:96) q norm of a vector. For a set S we use | S | to denote its cardinality.Given a sample with T realizations of the random vector ( Y t , Z (cid:48) t ) (cid:48) , the goal is to predict Y T + h for horizons h = 1 , . . . , H . Throughout the paper, we consider the following assumption: Assumption 1 (DGP) . Let { ( Y t , Z (cid:48) t ) (cid:48) } ∞ t =1 be a covariance-stationary stochastic process takingvalues on R d +1 . Therefore, we are excluding important non-stationary processes that usually appear in time-series applications. In particular unit-root and some types on long-memory process are excludedby Assumption 1.For (usually predetermined) integers p ≥ r ≥ n -dimensional vectorof predictors X t := (cid:0) Y t − , . . . , Y t − p , Z (cid:48) t , . . . , Z (cid:48) t − r (cid:1) (cid:48) where n = p + d ( r + 1) and consider thefollowing direct forecasting model: Y t + h = f h ( X t ) + U t + h , h = 1 , . . . , H, t = 1 , . . . , T, (1.1)where f h : R n → R is an unknown (measurable) function and U t + h := Y t + h − f h ( X t ) is assumedto be zero mean and finite variance . The zero mean condition can be always ensured by including an intercept in the model. Also the variance f h could be the conditional expectation function, f h ( x ) = E ( Y t + h | X t = x ), orsimply the best linear projection of Y t + h onto the space spanned by X t . Regardless of the modelchoice, our target becomes f h , for h = 1 , . . . , H . As f h is unknown, it should be estimated fromdata. The target function f h can be a single model or an ensemble of different specificationsand it can also change substantially for each forecasting horizon.Given an estimate (cid:98) f h for f h , the next step is to evaluate the forecasting method by estimatingits prediction accuracy. Most measures of prediction accuracy derives from the random quantity∆ h ( X t ) := | (cid:98) f h ( X t ) − f h ( X t ) | . For instance the term prediction consistency refers to estimatorssuch that ∆ h ( X t ) p −→ T → ∞ where the probability is taken to be unconditional; asopposed to its conditional counterpart which is given by ∆ h ( x t ) p −→
0, where the probabilitylaw is conditional on X t = x t . Clearly, if the latter holds for (almost) every x t then the formerholds by the law of iterated expectation.Other measures of prediction accuracy can be derived from the L q norm induced by eitherthe unconditional probability law E | ∆ h ( X t ) | q or the conditional one E ( | ∆ h ( X t ) | q | X t = x t ) for q ≥
1. By far, the most used are the (conditional) mean absolutely prediction error ( MAPE )when q = 1 and (conditional) mean squared prediction error ( MSPE ) when q = 2 or the (conditional) root mean squared prediction error ( RMSPE ) which is simply the square rootof
MSPE . Those measures of prediction accuracy based on the L q norms are stronger thanprediction consistency in the sense that the converge to zero as sample since increases of anyof those ( q ≥
1) implies prediction consistency by Markov’s inequality.This approach stems from casting economic forecasting as a decision problem. Under thechoice of a loss function, the goal is to select f h from a family of candidate models that minimisesthe the expected predictive loss or risk. Given a estimate (cid:98) f h for f h , the next step is to evaluatethe forecasting method by estimating its risk. The most commonly used losses are the absoluteerror and squared error, corresponding to L and L risk functions, respectively. See Grangerand Machina (2006) for references a detailed exposition of this topic, Elliott and Timmermann(2008) for a discussion of the role of loss function in forecasting, and Elliott and Timmermann(2016) for a more recent review. Apart form this brief introduction, the paper is organized as follows. Section 2 reviews penalizedlinear regression models. Nonlinear ML models are discussed in Section 3. Ensemble and hybridmethods are presented in Section 4. Section 5 briefly discusses tests for superior predictiveability. An empirical application is presented in Section 6. Finally, we conclude and discusssome directions for future research in Section 7. of f ( X t ) to be finite suffices for the finite variance Penalized Linear Models
We consider the family of linear models where f ( x ) = β (cid:48) x in (1.1) for a vector of unknownparameters β ∈ R n . Notice that we drop the subscript h for clarity. However, the model aswell as the parameter β have to be understood for particular value of the forecasting horizon h . These models contemplate a series of well-known specifications in time series analysis, suchas predictive regressions, autoregressive models of order p , AR ( p ), autoregressive models withexogenous variables, ARX ( p ), autoregressive models with dynamic lags ADL ( p, r ), amongmany others (Hamilton, 1994). In particular, (1.1) becomes Y t + h = β (cid:48) X t + U t + h , h = 1 , . . . , H, t = 1 , . . . , T, (2.1)where, under squared loss, β is identified by the best linear projection of Y t + h onto X t which iswell defined whenever Σ := E ( X t X (cid:48) t ) is non-singular. In that case, U t + h is orthogonal to X t byconstruction and this property is exploited to derive estimation procedures such as the OrdinaryLeast Squares (OLS). However, when n > T (and sometimes n (cid:29) T ) the OLS estimator isnot unique as the sample counterpart of Σ is rank deficient. In fact, we can completely overfitwhenever n ≥ T .Penalized linear regression arises in the setting where the regression parameter is notuniquely defined. It is usually the case when n is large, possibly larger than the numberof observations T , and/or when covariates are highly correlated. The general idea is to restrictthe solution of the OLS problem to a ball around the origin. It can be shown that, although bi-ased, the restricted solution has smaller mean squared error, when compared to the unrestrictedOLS (Hastie et al., 2009, Ch. 3 and Ch. 6).In penalized regressions the estimator (cid:98) β for the unknown parameter vector β minimizesthe Lagrangian form Q ( β ) = T − h (cid:88) t =1 ( Y t + h − β (cid:48) X t ) + p ( β ) , = (cid:107) Y − Xβ (cid:107) + p ( β ) , (2.2)where Y := ( Y h +1 , . . . Y T ) (cid:48) , X := ( X , . . . X T − h ) (cid:48) and p ( β ) := p ( β ; λ, γ , Z ) ≥ λ ≥
0, that controls the trade-off between thegoodness of fit and the regularization term. If λ = 0, we have an the classical unrestrictedregression, since p ( β ; 0 , γ , X ) = 0. The penalty function may also depend on a set of extrahyper-parameters γ , as well as on the data X . Naturally, the estimator (cid:98) β also depends on thechoice of λ and γ . Different choices for the penalty functions were considered in the literatureof penalized regression. 5 idge Regression The ridge regression was proposed by Hoerl and Kennard (1970) as a way to fight highlycorrelated regressors and stabilize the solution of the linear regression problem. The idea wasto introduce a small bias but, in turn, reduce the variance of the estimator. The ridge regressionis also known as a particular case of Tikhonov Regularization (Tikhonov, 1943, 1963; Tikhonovand Arsenin, 1977), in which the scale matrix is diagonal with identical entries.The ridge regression corresponds to penalizing the regression by the squared (cid:96) norm of theparameter vector, i.e., the penalty in (2.2) is given by p ( β ) = λ n (cid:88) i =1 β i = λ (cid:107) β (cid:107) . Ridge regression has the advantage of having an easy to compute analytic solution, wherethe coefficients associated with the least relevant predictors are shrunk towards zero, but neverreaching exactly zero. Therefore, it cannot be used for selecting predictors, unless some trun-cation scheme is employed.
Least Absolute Shrinkage and Selection Operator (LASSO)
The LASSO was proposed by Tibshirani (1996) and Chen et al. (2001) as a method to reg-ularize and perform variable selection at the same time. LASSO is one of the most popularregularization methods and it is widely applied in data-rich environments where number offeatures n is much larger than the number of the observations.LASSO corresponds to penalizing the regression by the (cid:96) norm of the parameter vector,i.e., the penalty in (2.2) is given by p ( β ) = λ n (cid:88) i =1 | β i | = λ (cid:107) β (cid:107) . The solution of the LASSO is efficiently calculated by coordinate descent algorithms (Hastieet al., 2015, Ch. 5). The (cid:96) penalty is the smallest convex (cid:96) p penalty norm that yields sparse solutions. We say the solution is sparse if only a subset k < n coefficients are non-zero. Inother words, only a subset of variables is selected by the method. Hence, LASSO is most usefulwhen the total number of regressors n (cid:29) T and it is not feasible to test combination or models.Despite attractive properties, there are still limitations to the LASSO. A large number ofalternative penalties have been proposed to keep its desired properties whilst overcoming itslimitations. Adaptive LASSO
The adaptive LASSO (adaLASSO) was proposed by H. Zou (2006) and aimed to improve theLASSO regression by introducing a weight parameter, coming from a first step OLS regression.It also has sparse solutions and efficient estimation algorithm, but enjoys the oracle property ,6eaning that it has the same asymptotic distribution as the OLS conditional on knowing thevariables that should enter the model. The adaLASSO penalty consists in using a weighted (cid:96) penalty: p ( β ) = λ n (cid:88) i =1 ω i | β i | , where ω i = | β ∗ i | − and β ∗ i is the coefficient from the first-step estimation (any consistent esti-mator of β ) AdaLASSO can deal with many more variables than observations. Using LASSOas the first-step estimator can be regarded as the two-step implementation of the local linearapproximation in Fan et al. (2014) with a zero initial estimate. Elastic net
The elastic-net (ElNet) was proposed by Zou and Hastie (2005) as a way of combining strengthsof LASSO and ridge regression. While the L part of the method performs variable selection,the L part stabilizes the solution. This conclusion is even more accentuated when correla-tions among predictors become high. As a consequence, there is a significant improvement inprediction accuracy over the LASSO (Zou and Zhang, 2009).The elastic-net penalty is a convex combination of (cid:96) and (cid:96) penalties: p ( β ) = λ (cid:34) α n (cid:88) i =1 β i + (1 − α ) n (cid:88) i =1 | β i | (cid:35) = λ [ α (cid:107) β (cid:107) + (1 − α ) (cid:107) β (cid:107) ] , where α ∈ [0 , Folded concave penalization
LASSO approaches became popular in sparse high-dimensional estimation problems largelydue their computational properties. Another very popular approach is the folded concavepenalization of Fan and Li (2001). This approach covers a collection of penalty functionssatisfying a set of properties. The penalties aim to penalize more parameters close to zero thanthose that are further away, improving performance of the method. In this way, penalties areconcave with respect to each | β i | .One of the most popular formulations is the SCAD (smoothly clipped absolute deviation). The oracle property was first described in Fan and Li (2001) in the context of non-concave penalizedestimation. λ in a nonlinear way. We set the penaltyin (2.2) as p ( β ) = (cid:80) ni =1 (cid:101) p ( β i , λ, γ ) where (cid:101) p ( u, λ, γ ) = λ | u | if | u | ≤ λ γλ | u |− u − λ γ − if λ ≤ | u | ≤ γλ λ ( γ +1)2 if | u | > γλ , for γ > λ >
0. The SCAD penalty is identical to the LASSO penalty for small coefficients,but continuously relaxes the rate of penalization as the coefficient departs from zero. UnlikeOLS or LASSO, we have to solve a non-convex optimization problem that may have multipleminima and is computationaly more intensive than the LASSO. Nevertheless, Fan et al. (2014)showed how to calculate the oracle estimator using an iterative Local Linear Approximationalgorithm.
Other Penalties
Regularization imposes a restriction on the solution space, possibly imposing sparsity. In adata-rich environment it is a desirable property as it is likely that many regressors are notrelevant to our prediction problem. The presentation above concentrates on the, possibly, mostused penalties in time series forecasting. Nevertheless, there are many alternative penaltiesthat can be used in regularized linear models.The group LASSO, proposed by Yuan and Lin (2006), penalizes the parameters in groups,combining the (cid:96) and (cid:96) norms. It is motivated by the problem of identifying ”factors”, de-noted by groups of regressors as, for instance, in regression with categorical variables that canassume many values. Let G = { g , ..., g M } denote a partition of { , ..., n } and β g i = [ β i : i ∈ g i ]the corresponding regression sub-vector. The group lasso assign to (2.2) the penalty p ( β ) = (cid:80) Mi =1 (cid:112) | g i |(cid:107) β g i (cid:107) , where | g i | is the cardinality of set g i . The solution is efficiently estimatedusing, for instance, the group-wise majorization-descent algorithm Yang and H. Zou (2015).Naturally, the adaptive group LASSO was also proposed aiming to improve some of the limi-tations present on the group LASSO algorithm Wang and Leng (2008). In the group LASSO,the groups enter or not in the regression. The sparse group LASSO recover sparse groups bycombining the group LASSO penalty with the L penalty on the parameter vector (Simon et al.,2013).Park and Sakaori (2013) modify the adaptive lasso penalty to explicitly take into account laginformation. Konzen and Ziegelmann (2016) propose a small change in penalty and perform alarge simulation study to asses the performance of this penalty in distinct settings. They observethat taking into account lag information improves model selection and forecasting performancewhen compared to the LASSO and adaLASSO. They apply their method to forecasting inflationand risk premium with satisfactory results.There is a Bayesian interpretation to the regularization methods presented here. The ridgeregression can be also seen as a maximum a posteriori estimator of a Gaussian linear regression8ith independent, equivariant, Gaussian priors. The LASSO replaces the Gaussian prior bya Laplace prior (Park and Casella, 2008; Hans, 2009). These methods fall within the area ofBayesian Shrinkage methods, which is a very large and active research area, and it is beyondthe scope of this survey. In this section we give an overview of the theoretical properties of penalized regression esti-mators previously discussed. Most results in high-dimensional time series estimation focus onmodel selection consistency, oracle property and oracle bounds, for both the finite dimension( n fixed, but possibly larger than T ) and high-dimension ( n increases with T , usually faster).More precisely, suppose there is a population, parameter vector β that minimizes equation(2.1) over repeated samples. Suppose this parameter is sparse in a sense that only componentsindexed by S ⊂ { , ..., n } are non-null. Let (cid:98) S := { j : (cid:98) β j (cid:54) = 0 } . We say a method is model selection consistent if the index of non-zero estimated components converges to S inprobability. P ( (cid:98) S = S ) → , T → ∞ . Consistency can also be stated in terms of how close the estimator is to true parameter for agiven norm. We say that the estimation method is L q -consistent if for every (cid:15) > P ( (cid:107) (cid:98) β − β (cid:107) q > (cid:15) ) → , T → ∞ . It is important to note that model selection consistency does not imply, nor it is implied by, L q -consistency. As a matter of fact, one usually have to impose specific assumptions to achieveeach of those modes of convergence.Model selection performance of a given estimation procedure can be further broke down interms of how many relevant variables j ∈ S are included in the model (screening). Or howmany irrelevant variables j / ∈ S are excluded from the model. In terms of probability, modelscreening consistency is defined by P ( (cid:98) S ⊇ S ) → P ( (cid:98) S ⊆ S ) → T → ∞ .We say a penalized estimator has the oracle property if its asymptotic distribution is thesame as the unpenalized one only considering the S regressors. Finally, oracle risk bounds arefinite sample bounds on the estimation error of (cid:98) β that hold with high probability. These boundsrequire relatively strong conditions on the curvature of objective function, which translates intoa bound on the minimum restricted eigenvalue of the covariance matrix among predictors forlinear models and a rate condition on λ that involves the number of non-zero parameters, | S | .The LASSO was originally developed in fixed design with independent and identically dis-tributed (IID) errors, but it has been extended and adapted to a large set of models and A more precise treatment would separate sign consistency from model selection consistency . Sign consistency first appeared in Zhao and Yu (2006) and also verify whether the the sign of estimated regression weights convergeto the population ones. n framework. From theirresults, it is clear that the distribution of the parameters related to the irrelevant variables isnon-Gaussian. To our knowledge, the first work expanding the results to a dependent settingwas Wang et al. (2007), where the error term was allowed to follow an autoregressive process.Authors show that LASSO is model selection consistent, whereas a modified LASSO, similarto the adaLASSO, is both model selection consistent and has the oracle property. Nardi andRinaldo (2011) shows model selection consistency and prediction consistency for lag selectionin autoregressive models. Chan and Chen (2011) shows oracle properties and model selectionconsistency for lag selection in ARMA models. Yoon et al. (2013) derives model selectionconsistency and asymptotic distribution of the LASSO, adaLASSO and SCAD, for penalizedregressions with autoregressive error terms. Sang and Sun (2015) studies lag estimation ofautoregressive processes with long memory innovations using general penalties and show modelselection consistency and asymptotic distribution for the LASSO and SCAD as particular cases.Kock (2016) shows model selection consistency and oracle property of adaLASSO for lag selec-tion in stationary and integrated processes. All results above hold for the case of fixed numberof regressors or relatively high-dimension, meaning that n/T → n → ∞ atsome rate faster than T , Medeiros and Mendes (2016, 2017) show model selection consistencyand oracle property of a large set of linear time series models with difference martingale, strongmixing, and non-Gaussian innovations. It includes, predictive regressions, autoregressive mod-els AR ( p ), autoregressive models with exogenous variables ARX ( p ), autoregressive models withdynamic lags ADL ( p, r ), with possibly conditionally heteroscedastic errors. Xie et al. (2017)shows oracle bounds for fixed design regression with β -mixing errors. Wu and Wu (2016) deriveoracle bounds for the LASSO on regression with fixed design and weak dependent innovations,in a sense of Wu (2005), whereas Han and Tsay (2020) show model selection consistency forlinear regression with random design and weak sparsity under serially dependent errors and co-variates, within the same weak dependence framework. Xue and Taniguchi (2020) show modelselection consistency and parameter consistency for a modified version of the LASSO in timeseries regressions with long memory innovations.Fan and Li (2001) shows model selection consistency and oracle property for the foldedconcave penalty estimators in a fixed dimensional setting. Kim et al. (2008) showed that theSCAD also enjoys these properties in high-dimensions. In time-series settings,Uematsu andTanaka (2019) shows oracle properties and model selection consistency in time series modelswith dependent regressors. Lederer et al. (2019) derived oracle prediction bounds for manypenalized regression problems. The authors conclude that generic high dimensional penalizedestimators provide consistent prediction with any design matrix. Although the results are notdirectly focused on time series problems, they are general enough to hold in such setting.Babii et al. (2020c) proposed the sparse-group LASSO as an estimation technique when Weak sparsity generalizes sparsity by supposing that coefficients are (very) small instead of exactly zero.
V AR ( p ) models, extending previous works. Melnyk and Banerjee (2016) extended theseresults for a large collection of penalties. Zhu (2020) derive oracle estimation bounds for foldedconcave penalties for Gaussian V AR ( p ) models in high dimensions. More recently researchershave departed from gaussianity and correct model specification. Wong et al. (2020) derivedfinite-sample guarantees for the LASSO in a misspecified VAR model involving β -mixing pro-cess with sub-Weibull marginal distributions. Masini et al. (2019) derive equation-wise errorbounds for the LASSO estimator of weakly sparse V AR ( p ) in mixingale dependence settings,that include models with conditionally heteroscedastic innovations. Although several papers derived the asymptotic properties of penalized estimators as well as theoracle property, these results have been derived under the assumption that the true non-zerocoefficients are large enough. This condition is known as the β -min restriction. Furthermore,model selection, such as the choice of the penalty parameter, has not been taken into account.Therefore, the true limit distribution, derived under uniform asymptotics and without the β -min restriction can bee very different from Gaussian, being even bimodal; see, for instance,Leeb and P¨otscher (2005), Leeb and P¨otscher (2008), and Belloni et al. (2014) for a detaileddiscussion.Inference after model selection is actually a very active area of research and a vast numberof papers have recently appeared in the literature. van de Geer et al. (2014) proposed thedesparsified LASSO in order to construct (asymptotically) a valid confidence interval for each β j, by modifying the original LASSO estimate (cid:98) β . Let Σ ∗ be an approximation for the inverseof Σ := E ( X t X (cid:48) t ), then the desparsified LASSO is defined as (cid:101) β := (cid:98) β + Σ ∗ ( Y − X (cid:98) β ) /T . Theaddition of this extra term to the LASSO estimator results in an unbiased estimator that no11onger estimate any coefficient exactly as zero. More importantly, asymptotic normality canbe recover in the sense that √ T ( (cid:101) β i − β i, ) converges in distribution to a Gaussian distributionunder appropriate regularity conditions. Not surprisingly, the most important condition is howwell Σ − can be approximated by Σ ∗ . In particular, the authors propose to run n LASSOregressions of X i onto X − i := ( X , . . . , X i − , X i +1 , . . . , X n ), for 1 ≤ i ≤ n . The authors namedthis process as nodewide regressions , and use those estimates to construct Σ ∗ (refer to Section2.1.1 in van de Geer et al. (2014) for details).Belloni et al. (2014) put forward the double-selection method in the context of on a linearmodel in the form Y t = β X (1) t + β (cid:48) X (2) t + U t , where the interest lies on the the scalarparameter β and X (2) t is a high-dimensional vector of control variables. The procedure consistsin obtaining an estimation of the active (relevant) regressors in the high-dimension auxiliaryregressions of Y t on X (2) and of X (1) t on X (2) t , given by (cid:98) S and (cid:98) S , respectively. This can beobtained either by LASSO or any other estimation procedure. Once the set (cid:98) S := (cid:98) S ∪ (cid:98) S isidentified, the (a priori) estimated non-zero parameters can by estimated by a low-dimensionalregression Y t on X (1) t and { X (2) it : i ∈ (cid:98) S } . The main result (Theorem 1 of Belloni et al. (2014))states conditions under which the estimator (cid:98) β of the parameter of interest properly studentizedis asymptotically normal. Therefore, uniformly valid asymptotic confidence intervals for β canbe constructed in the usual fashion.Similar to Taylor et al. (2014) and Lockhart et al. (2014), Lee et al. (2016) put forwardgeneral approach to valid inference after model selection. The idea is to characterize thedistribution of a post-selection estimator conditioned on the selection event. More specifically,the authors argue that the post-selection confidence intervals for regression coefficients shouldhave the correct coverage conditional on the selected model. The specific case of the LASSOestimator is discussed in details. The main difference between Lee et al. (2016) and Taylor et al.(2014) and Lockhart et al. (2014) is that in the former, confidence intervals can be formed atany value of the LASSO penalty parameter and any coefficient in the model. Finally, it isimportant to stress that Lee et al. (2016) inference is carried on the coefficients of the selectedmodel, while van de Geer et al. (2014) and Belloni et al. (2014) consider inference on thecoefficients of the true model.The above papers do not consider a time-series environment. Hecq et al. (2019) is on thefirst papers which attempt to consider post-selection inference in a time-series environment.The authors generalize the results in Belloni et al. (2014) to dependent processes. However,their results are derived under a fixed number of variables. Babii et al. (2020a) and Ad´ameket al. (2020) extend the seminal work of van de Geer et al. (2014) to time-series framework.More specifically, Babii et al. (2020a) consider inference in time-series regression models un-der heteroskedastic and autocorrelated errors. The authors consider heteroskedaticity- andautocorrelation-consistent (HAC) estimation with sparse group-LASSO. They propose a debi-ased central limit theorem for low dimensional groups of regression coefficients and study theHAC estimator of the long-run variance based on the sparse-group LASSO residuals. Ad´amek The relevant regressors are the ones associated with non-zero parameter estimates.
12t al. (2020) extend the desparsified LASSO to a time-series setting under near-epoch depen-dence assumptions, allowing for non-Gaussian, serially correlated and heteroskedastic processes.Furthermore, the number of regressors can possibly grow faster than the sample size.
The function f h appearing (1.1) is unknown and in several applications the linearity assump-tion is too restrictive and more flexible forms must be considered. Assuming a quadratic lossfunction, the estimation problem turns to be the minimization of the functional S ( f ) := T − h (cid:88) t =1 [ Y t + h − f ( X t )] , (3.1)where f ∈ G , a generic function space. However, the optimization problem stated in (3.1) isinfeasible when G is infinite dimensional, as there is no efficient technique to search over all G . Of course, one solution is to restrict the function space, as for instance, imposing linearityor specific forms of parametric nonlinear models as in, for example, Ter¨asvirta (1994), Suarez-Fari˜nas et al. (2004) or McAleer and Medeiros (2008); see also Ter¨asvirta et al. (2010) for arecent review of such models.Alternatively, we can replace G by simpler and finite dimensional G D . The idea is to considera sequence of finite dimensional spaces, the sieve spaces, G D , D = 1 , , , . . . , that converges to G in some norm. The approximating function g D ( X t ) is written as g D ( X t ) = J (cid:88) j =1 β j g j ( X t ) , where g j ( · ) is the j -th basis function for G D and can be either fully known or indexed by avector of parameters, such that: g j ( X t ) := g ( X t ; θ j ). The number of basis functions J := J T will depend on the sample size T . D is the dimension of the space and it also depends on thesample size: D := D T . Therefore, the optimization problem is then modified to (cid:98) g D ( X t ) = arg min g D ( X t ) ∈G D T − h (cid:88) t =1 [ Y t + h − g D ( X t )] . (3.2)The sequence of approximating spaces G D is chosen by using the structure of the originalunderlying space G and the fundamental concept of dense sets. If we have two sets A and B ∈ X , X being a metric space, A is dense in B if for any (cid:15) > , ∈ R and x ∈ B there is a y ∈ A such that (cid:107) x − y (cid:107) X < (cid:15) . This is called the method of sieves . For a comprehensive review ofthe method for time-series data, see Chen (2007).For example, from the theory of approximating functions we know that the proper subset P ⊂ C of polynomials is dense in C , the space of continuous functions. The set of polynomials issmaller and simpler than the set of all continuous functions. In this case, it is natural to define13he sequence of approximating spaces G D , D = 1 , , , . . . by making G D the set of polynomialsof degree smaller or equal to D − dim ( G D ) = D < ∞ . In the limit this sequence of finite dimensional spaces converges to theinfinite dimensional space of polynomials, which on its turn is dense in C .When the basis functions are all known ( linear sieves ), the problem is linear in the pa-rameters and methods like ordinary least squares (when J (cid:28) T ) or penalized estimation aspreviously described can be used.For example, let p = 1 and pick a polynomial basis such that g D ( X t ) = β + β X t + β X t + β X t + · · · + β J X Jt . In this case, the dimension D of G D is J + 1, due to the presence of a constant term.If J << T , the vector of parameters β = ( β , . . . , β J ) (cid:48) can be estimated by (cid:98) β = ( X (cid:48) J X J ) − X (cid:48) J Y , where X J is the T × ( J + 1) design matrix and Y = ( Y , . . . , Y T ) (cid:48) .When the basis functions are also indexed by parameters ( nonlinear sieves ), nonlinearleast-squares methods should be used. In this paper we will focus on frequently used nonlinearsieves: neural networks and regression trees. Neural Networks (NN) is one of the most traditional nonlinear sieves. NN can be classified intoshallow or deep networks. We start describing the shallow NNs. The most common shallowNN is the feedforward neural network where the the approximating function g D ( X t ) is definedas g D ( X t ) := g D ( X t ; θ ) = β + J T (cid:88) j =1 β j S ( γ (cid:48) j X t + γ ,j ) , = β + J T (cid:88) j =1 β j S (˜ γ (cid:48) j ˜ X t ) , (3.3)In the above model, ˜ X t = (1 , X (cid:48) t ) (cid:48) , S j ( · ) is a basis function and the parameter vector to beestimated is given by θ = ( β , . . . , β K , γ (cid:48) , . . . , γ (cid:48) J T , γ , , . . . , γ ,J T ) (cid:48) , where ˜ γ j = ( γ ,j , γ (cid:48) j ) (cid:48) .NN models form a very popular class of nonlinear sieves and have been used in manyapplications of economic forecasting. Usually, the basis functions S ( · ) are called activationfunctions and the parameters are called weights. The terms in the sum are called hidden-neurons as an unfortunate analogy to the human brain. Specification (3.3) is also known as asingle hidden layer NN model as is usually represented in the graphical as in Figure 1. The green14igure 1: Graphical representation of a single hidden layer neural network.circles in the figure represent the input layer which consists of the covariates of the model ( X t ).In the example in the figure there are four input variables. The blue and red circles indicate thehidden and output layers, respectively. In the example, there are five elements (neurons) in thehidden layer.The arrows from the green to the blue circles represent the linear combination ofinputs: γ (cid:48) j X t + γ ,j , j = 1 , . . . ,
5. Finally, the arrows from the blue to the red circles representthe linear combination of outputs from the hidden layer: β + (cid:80) j =1 β j S ( γ (cid:48) j X t + γ ,j ).There are several possible choices for the activation functions. In the early days, S ( · ) waschosen among the class of squashing functions as per the definition bellow. Definition 1.
A function S : R −→ [ a, b ] , a < b , is a squashing (sigmoid) function if it isnon-decreasing, lim x −→∞ S ( x ) = b and lim x −→−∞ S ( x ) = a . Historically, the most popular choices are the logistic and hyperbolic tangent functions suchthat: Logistic: S ( x ) = 11 + exp ( − x )Hyperbolic tangent: S ( x ) = exp ( x ) − exp ( − x ) exp ( x ) + exp ( − x ) . The popularity of such functions was partially due to theoretical results on function ap-proximation. Funahashi (1989) establishes that NN models as in (3.3) with generic squashingfunctions are capable of approximating any continuous functions from one finite dimensionalspace to another to any desired degree of accuracy, provided that J T is sufficiently large. Cy-benko (1989) and Hornik et al. (1989) simultaneously proved approximation capabilities of NNmodels to any Borel measurable function and Hornik et al. (1989) extended the previous resultsand showed the NN models are also capable to approximate the derivatives of the unknownfunction. Barron (1993) relate previous results to the number of terms in the model.Stinchcombe and White (1989) and Park and Sandberg (1991) derived the same results ofCybenko (1989) and Hornik et al. (1989) but without requiring the activation function to besigmoid. While the former considered a very general class of functions, the later focused on15adial-basis functions (RBF) defined as:Radial Basis: S ( x ) = exp ( − x ) . More recently, Yarotsky (2017) showed that the rectified linear units (ReLU) asRectified Linear Unit: S ( x ) = max (0 , x ) , are also universal approximators.Model (3.3) can be written in matrix notation. Let Γ = (˜ γ , . . . , ˜ γ K ), X = X · · · X p X · · · X p ... . . . ...1 X T · · · X T p , and O ( X Γ ) = S (˜ γ (cid:48) ˜ x ) · · · S (˜ γ (cid:48) K ˜ x )1 S (˜ γ (cid:48) ˜ x ) · · · S (˜ γ (cid:48) K ˜ x )... ... . . . ...1 S (˜ γ (cid:48) ˜ x T ) · · · S (˜ γ (cid:48) K ˜ x T ) Therefore, by defining β = ( β , β , . . . , β K ) (cid:48) , the output of a feed-forward NN is given by: h D ( X , θ ) = [ h D ( X ; θ ) , . . . , h D ( X T ; θ )] (cid:48) = β + (cid:80) Kk =1 β k S ( γ (cid:48) k X + γ ,k )... β + (cid:80) Kk =1 β k S ( γ (cid:48) k X T + γ ,k ) = O ( X Γ ) β . (3.4)The dimension of the parameter vector θ = [ vec ( Γ ) (cid:48) , β (cid:48) ] (cid:48) is k = ( n + 1) × J T + ( J T + 1) andcan easily get very large such that the unrestricted estimation problem defined as (cid:98) θ = arg min θ ∈ R k (cid:107) Y − O ( X Γ ) β (cid:107) is unfeasible. A solution is to use regularization as in the case of linear models and considerthe minimization of the following function: Q ( θ ) = (cid:107) Y − O ( X Γ ) β (cid:107) + p ( θ ) , (3.5)where usually p ( θ ) = λ θ (cid:48) θ . Traditionally, the most common approach to minimze (3.5) is touse Bayesian methods as in MacKay (1992), MacKay (1992), and Foresee and Hagan (1997).A more modern approach is to use a technique known as Dropout (Srivastava et al., 2014).The key idea is to randomly drop neurons (along with their connections) from the neu-ral network during estimation. A NN with J T neurons in the hidden layer can generate 2 J T possible “thinned” NN by just removing some neurons. Dropout samples from this 2 J T dif-ferent thinned NN and train the sampled NN. To predict the target variable, we use a singleunthinned network that has weights adjusted by the probability law induced by the random16igure 2: Deep neural network architecturedrop. This procedure significantly reduces overfitting and gives major improvements over otherregularization methods.We modify equation (3.3) by g ∗ D ( X t ) = β + J T (cid:88) j =1 s j β j S ( γ (cid:48) j [ r (cid:12) X t ] + v j γ ,j ) , where s , v , and r = ( r , . . . , r n ) are independent Bernoulli random variables each with prob-ability q of being equal to 1. The NN model is thus estimated by using g ∗ D ( X t ) instead of g D ( X t ) where, for each training example, the values of the entries of r are drawn from theBernoulli distribution. The final estimates for β j , γ j , and γ o,j are multiplied by q . A Deep Neural Network model is a straightforward generalization of specification (3.3) wheremore hidden layers are included in the model as represented in Figure 2. In the figure werepresent a Deep NN with two hidden layers with the same number of hidden units in each.However, the number of hidden neurons can vary across layers.As pointed out in Mhaska et al. (2017), while the universal approximation property holdsfor shallow NNs, deep networks can approximate the class of compositional functions as wellas shallow networks but with exponentially lower number of training parameters and samplecomplexity.Set J (cid:96) as the number of hidden units in layer (cid:96) ∈ { , . . . , L } . For each hidden layer (cid:96) define Γ (cid:96) = (˜ γ (cid:96) , . . . , ˜ γ k (cid:96) (cid:96) ). Then the output O (cid:96) of layer (cid:96) is given recursively by O (cid:96) ( O (cid:96) − ( · ) Γ (cid:96) ) n × ( J (cid:96) +1) = S (˜ γ (cid:48) (cid:96) O (cid:96) − ( · )) · · · S (˜ γ (cid:48) k (cid:96) (cid:96) O (cid:96) − ( · ))1 S (˜ γ (cid:48) (cid:96) O (cid:96) − ( · )) · · · S (˜ γ (cid:48) k (cid:96) (cid:96) O (cid:96) − ( · ))... ... . . . ...1 S (˜ γ (cid:48) (cid:96) O n(cid:96) − ( · )) · · · S (˜ γ (cid:48) J (cid:96) (cid:96) O n(cid:96) − ( · )) O o := X . Therefore, the output of the Deep NN is the composition h D ( X ) = O L ( · · · O ( O ( O ( X Γ ) Γ ) Γ ) · · · ) Γ L β . The estimation of the parameters is usually carried out by stochastic gradient descendmethods with dropout to control the complexity of the model.
Broadly speaking, Recurrent Neural Networks (RNNs) are NNs that allow for feedback amongthe hidden layers. RNNs can use their internal state (memory) to process sequences of inputs.In the framework considered in this paper, a generic RNN could be written as H t = f ( H t − , X t ) , (cid:98) Y t + h | t = g ( H t ) , where (cid:98) Y t + h | t is the prediction of Y t + h given observations only up to time t , f and g are functionsto be defined and H t is what we call the (hidden) state. From a time-series perspective, RNNscan be see as a kind of nonlinear state-space model.RNNs can remember the order that the inputs appear through its hidden state (memory)and they can also model sequences of data so that each sample can be assumed to be dependenton previous ones, as in time series models. However, RNNs are hard to be estimated as theysuffer from the vanishing/exploding gradient problem. Set the cost function to be Q T ( θ ) = T − h (cid:88) t =1 (cid:16) Y t + h − (cid:98) Y t + h | t (cid:17) , where θ is the vector of parameters to be estimated. It is easy to show that the gradient ∂ Q T ( θ ) ∂ θ can be very small or diverge. Fortunately, there is a solution to the problem proposedby Hochreiter and Schmidhuber (1997). A variant of RNN which is called Long-Short-TermMemory (LSTM) network . Figure 3 shows the architecture of a typical LSTM layer. A LSTMnetwork can be composed of several layers. In the figure, red circles indicate logistic activationfunctions, while blue circles represent hyperbolic tangent activation. The symbols “ X ” and “ + ”represent, respectively, the element-wise multiplication and sum operations. The RNN layeris composed of several blocks: the cell state and the forget, input, and ouput gates. The cellstate introduces a bit of memory to the LSTM so it can “remember” the past. LSTM learns tokeep only relevant information to make predictions, and forget non relevant data. The forgetgate tells which information to throw away from the cell state. The output gate provides theactivation to the final output of the LSTM block at time t . Usually, the dimension of the hiddenstate ( H t ) is associated with the number of hidden neurons.Algorithm 1 describes analytically how the LSTM cell works. f t represents the outputof the forget gate. Note that it is a combination of the previous hidden-state ( H t − ) with18 + X X X + X X CELL STATE
X + X X FORGET GATE
X + X X INPUT GATE
X + X X OUTPUT GATE
Figure 3: Architecture of the Long-Short-Term Memory Cell (LSTM)the new information ( X t ). Note that f t ∈ [0 ,
1] and it will attenuate the signal coming com c t − . The input and output gates have the same structure. Their function is to filter the“relevant” information from the previous time period as well as from the new input. p t scalesthe combination of inputs and previous information. This signal will be then combined withthe output of the input gate ( i t ). The new hidden state will be an attenuation of the signalcoming from the output gate. Finally, the prediction is a linear combination of hidden states.Figure 4 illustrates how the information flows in a LSTM cell. Algorithm 1.
Mathematically, RNNs can be defined by the following algorithm:1. Initiate with c = 0 and H = 0 .2. Given the input X t , for t ∈ { , . . . , T } , do: f t = Logistic( W f X t + U f H t − + b f ) i t = Logistic( W i X t + U i H t − + b i ) o t = Logistic( W o X t + U o H t − + b o ) p t = Tanh( W c X t + U c H t − + b c ) c t = ( f t (cid:12) c t − ) + ( i t (cid:12) p t ) h t = o t (cid:12) Tanh( c t ) (cid:98) Y t + h | t = W y h t + b y where U f , U i , U o , U c , U f , W f , W i , W o , W c , b f , b i , b o , and b c are parameters to beestimated. A regression tree is a nonparametric model that approximates an unknown nonlinear function f h ( X t ) in (1.1) with local predictions using recursive partitioning of the space of the covariates.A tree may be represented by a graph as in the left side of Figure 5, which is equivalent as thepartitioning in the right side of the figure for this bi-dimensional case. For example, supposethat we want to predict the scores of basketball players based on their height and weight. Thefirst node of the tree in the example splits the players taller than 1.85m from the shorter players.The second node in the left takes the short players groups and split them by weights and thesecond node in the right does the same with the taller players. The prediction for each group isdisplayed in the terminal nodes and they are calculated as the average score in each group. Togrow a tree we must find the optimal splitting point in each node, which consists of an optimalvariable and an optimal observation. In the same example, the optimal variable in the firstnode is height and the observation is 1.85m.The idea of regression trees is to approximate f h ( X t ) by h D ( X t ) = J T (cid:88) j =1 β j I j ( X t ) , where I k ( X t ) = X t ∈ R j , . From the above expression, it becomes clear that the approximation of f h ( · ) is equivalent to alinear regression on J T dummy variables, where I j ( X t ) is a product of indicator functions.Let J := J T and N := N T be, respectively, the number of terminal nodes (regions, leaves )20igure 5: Example of a simple tree.and parent nodes. Different regions are denoted as R , . . . , R J . The root node at position0. The parent node at position j has two split (child) nodes at positions 2 j + 1 and 2 j + 2.Each parent node has a threshold (split) variable associated, X s j t , where s j ∈ S = { , , . . . , p } .Define J and T as the sets of parent and terminal nodes, respectively. Figure 6 gives an example.In the example, the parent nodes are J = { , , } and the terminal nodes are T = { , , , } .Therefore, we can write the approximating model as h D ( X t ) = (cid:88) i ∈ T β i B J i ( X t ; θ i ) , (3.6)where B J i ( X t ; θ i ) = (cid:89) j ∈ J I ( X s j ,t ; c j ) ni,j (1+ ni,j )2 × (cid:2) − I ( X s j ,t ; c j ) (cid:3) (1 − n i,j )(1+ n i,j ) , (3.7) I ( X s j ,t ; c j ) = X s j ,t ≤ c j ,n i,j = − i does not include parent node j ;0 if the path to leaf i include the right-hand child of parent node j ;1 if the path to leaf i include the left-hand child of parent node j. J i : indexes of parent nodes included in the path to leaf i . θ i = { c k } such that k ∈ J i , i ∈ T and (cid:80) j ∈ J B J i ( X t ; θ j ) = 1. 21arentnode 0Terminalnode 1(Region 1) Parentnode 2Parentnode 5Terminalnode 11(Region 2) Terminalnode 12(Region 3) Terminalnode 6(Region 4)Figure 6: Example of tree with labels. Random Forest (RF) is a collection of regression trees, each specified in a bootstrap sample ofthe original data. The method was originally proposed by Breiman (2001). Since we are dealingwith time series, we use a block bootstrap. Suppose there are B bootstrap samples. For eachsample b , b = 1 , . . . , B , a tree with K b regions is estimated for a randomly selected subset ofthe original regressors. K b is determined in order to leave a minimum number of observationsin each region. The final forecast is the average of the forecasts of each tree applied to theoriginal data: (cid:98) Y t + h | t = 1 B B (cid:88) b =1 (cid:34) T b (cid:88) i =1 (cid:98) β i,b B J i,b ( X t ; (cid:98) θ i,b ) (cid:35) . The theory for RF models has been developed only to independent and identically dis-tributed random variables. For instance, Scornet et al. (2015) proves consistency of the RFapproximation to the unknown function f h ( X t ). More recently, Wager and Athey (2018) provedconsistency and asymptotic normality of the RF estimator. Boosting is another greedy method to approximate nonlinear functions that uses base learnersfor a sequential approximation. The model we consider here, called Gradient Boosting, wasintroduced by Friedman (2001) and can be seen as a Gradient Descendent method in functionalspace.The study of statistical properties of the Gradient Boosting is well developed for independentdata. For example, for regression problems, Duffy and Helmbold (2002) derived bounds onthe convergence of boosting algorithms using assumptions on the performance of the baselearner. Zhang and Yu (2005) proves convergence, consistency and results on the speed ofconvergence with mild assumptions on the base learners. B¨uhlmann (2002) shows similar22esults for consistency in the case of (cid:96) loss functions and three base models. Since boostingindefinitely leads to overfitting problems, some authors have demonstrated the consistency ofboosting with different types of stopping rules, which are usually related to small step sizes, assuggested by Friedman (2001). Some of these works include boosting in classification problemsand gradient boosting for both classification and regression problems. See, for instance, Jiang(2004); Lugosi and Vayatis (2004); Bartlett and M. Traskin (2007); Zhang and Yu (2005);B¨uhlmann (2006); B¨uhlmann (2002).Boosting is an iterative algorithm. The idea of boosted trees is to, at each iteration, se-quentially refit the gradient of the loss function by small trees. In the case of quadratic loss asconsidered in this paper, the algorithm simply refit the residuals from the previous iteration.Algorithm (2) presents the simplified boosting procedure for a quadratic loss. It is recom-mended to use a shrinkage parameter v ∈ (0 ,
1] to control the learning rate of the algorithm.If v is close to 1, we have a faster convergence rate and a better in-sample fit. However, weare more likely to have over-fitting and produce poor out-of-sample results. Additionally, thederivative is highly affected by over-fitting, even if we look at in-sample estimates. A learningrate between 0.1 and 0.2 is recommended to maintain a reasonable convergence ratio and tolimit over-fitting problems. Algorithm 2.
The boosting algorithm is defined as the following steps.1. Initialize φ i = ¯ Y := T (cid:80) Tt =1 Y t ;2. For m = 1 , . . . , M :(a) Make U tm = Y t − φ tm − (b) Grow a (small) Tree model to fit u tm , (cid:98) u tm = (cid:80) i ∈ T m (cid:98) β im B J m i ( X t ; (cid:98) θ im ) (c) Make ρ m = arg min ρ (cid:80) Tt =1 [ u tm − ρ (cid:98) u tm ] (d) Update φ tm = φ tm − + vρ m (cid:98) u tm The final fitted value may be written as (cid:98) Y t + h = ¯ Y + M (cid:88) m =1 vρ m (cid:98) u tm = ¯ Y + M (cid:88) m =1 v (cid:98) ρ m (cid:88) k ∈ T m (cid:98) β km B J m k ( X t ; (cid:98) θ km ) (3.8) Conducting inference in nonlinear ML methods is tricky. One possible way is to follow Medeiroset al. (2006), Medeiros and Veiga (2005) and Suarez-Fari˜nas et al. (2004) and interpret particularnonlinear ML specifications as parametric models, as for example, general forms of smoothtransition regressions. However, this approach restricts the application of ML methods to veryspecific settings. An alternative, is to consider models that can be cast in the sieves framework23s described earlier. This is the case of splines and feed-forward NNs, for example. In this setup,Chen and Shen (1998) and Chen (2007) derived, under regularity conditions, the consistencyand asymptotically normality of the estimates of a semi-parametric sieve approximations. Theirsetup is defined as follows: Y t + h = β (cid:48) X t + f ( X t ) + U t + h , where f ( X t ) is a nonlinear function that is nonparametrically modeled by sieve approximations.Chen and Shen (1998) and Chen (2007) consider both the estimation of the linear and nonlinearcomponents of the model. However, their results are derived under the case where the dimensionof X t is fixed.Recently, Chernozhukov et al. (2017) and Chernozhukov et al. (2018) consider the casewhere the number of covariates diverge as the sample size increases in a very general setup. Inthis case the asymptotic results in Chen and Shen (1998) and Chen (2007) are not valid and theauthors put forward the so-called double ML methods as a nice generalization to the results ofBelloni et al. (2014). For Deep Neural Networks, Farrell et al. (2021) consider semiparametricinference and establish nonasymptotic high probability bounds. Consequently, the authors areable to derive rates of convergence that are sufficiently fast to allow them to establish validsecond-step inference after first-step estimation with deep learning. Nevertheless, the abovepapers do not include the case of time-series models.More specifically to the case of Random Forests, asymptotic and inferential results arederived in Scornet et al. (2015) and Wager and Athey (2018) for the case of IID data. Morerecently, Davis and Nielsen (2020) prove a uniform concentration inequality for regression treesbuilt on nonlinear autoregressive stochastic processes and prove consistency for a large class ofrandom forests. Finally, it is worth mentioning the interesting work of Borup et al. (2020). Intheir paper, the authors show that proper predictor targeting controls the probability of placingsplits along strong predictors and improves prediction. The term bagging means
Bootstrap Aggregating and was proposed by Breiman (1996) to re-duce the variance of unstable predictors . It was popularized in the time series literature byInoue and Kilian (2008), who to construct forecasts from multiple regression models with local-to-zero regression parameters and errors subject to possible serial correlation or conditionalheteroscedasticity. Bagging is designed for situations in which the number of predictors ismoderately large relative to the sample size.The bagging algorithm in time series settings have to take into account the time dependencedimension when constructing the bootstrap samples. An unstable predictor has large variance. Intuitively, small changes in the data yield large changes in thepredictive model lgorithm 3 (Bagging for Time-Series Models) . The Bagging algorithm is defined as follows.1. Arrange the set of tuples ( y t + h , x (cid:48) t ) , t = h + 1 , . . . , T , in the form of a matrix V ofdimension ( T − h ) × n .2. Construct (block) bootstrap samples of the form (cid:110)(cid:16) y ∗ ( i )2 , x (cid:48)∗ ( i )2 (cid:17) , . . . , (cid:16) y ∗ ( i ) T , x (cid:48)∗ ( i ) T (cid:17)(cid:111) , i =1 , . . . , B , by drawing blocks of M rows of V with replacement.3. Compute the i th bootstrap forecast as (cid:98) y ∗ ( i ) t + h | t = if | t ∗ j | < c ∀ j, (cid:98) λ ∗ ( i ) (cid:101) x ∗ ( i ) t otherwise , (4.1) where (cid:101) x ∗ ( i ) t := S ∗ ( i ) t z ∗ ( i ) t and S t is a diagonal selection matrix with j th diagonal elementgiven by I {| t j | >c } = if | t j | > c, otherwise, c is a pre-specified critical value of the test. (cid:98) λ ∗ ( i ) is the OLS estimator at each bootstraprepetition.4. Compute the average forecasts over the bootstrap samples: ˜ y t + h | t = 1 B B (cid:88) i =1 (cid:98) y ∗ ( i ) t | t − . In algorithm 3, above, one requires that it is possible to estimate and conduct inferencein the linear model. This is certainly infeasible if the number of predictors is larger than thesample size ( n > T ), which requires the algorithm to be modified. Garcia et al. (2017) andMedeiros et al. (2021) adopt the following changes of the algorithm:
Algorithm 4 (Bagging for Time-Series Models and Many Regressors) . The Bagging algorithmis defined as follows.0. Run n univariate regressions of y t + h on each covariate in x t . Compute t -statistics andkeep only the ones that turn out to be significant at a given pre-specified level. Call thisnew set of regressors as ˇ x t x t replaced by ˇ x t . Complete Subset Regression (CSR) is a method for combining forecasts developed by Elliottet al. (2013, 2015). The motivation was that selecting the optimal subset of X t to predict Y t + h by testing all possible combinations of regressors is computationally very demanding and, in25ost cases, unfeasible. For a given set of potential predictor variables, the idea is to combineforecasts by averaging all possible linear regression models with fixed number of predictors.For example, with n possible predictors, there are n unique univariate models and n k,n = n !( n − k )! k !different k -variate models for k ≤ K . The set of models for a fixed value of k as is known asthe complete subset.When the set of regressors is large the number of models to be estimated increases rapidly.Moreover, it is likely that many potential predictors are irrelevant. In these cases it wassuggested that one should include only a small, k , fixed set of predictors, such as five or ten.Nevertheless, the number of models still very large, for example, with n = 30 and k = 8,there are 5 , ,
925 regression. An alternative solution is to follow Garcia et al. (2017) andMedeiros et al. (2021) and adopt a similar strategy as in the case of Bagging high-dimensionalmodels. The idea is to start fitting a regression of Y t + h on each of the candidate variables andsave the t -statistics of each variable. The t -statistics are ranked by absolute value, and weselect the ˜ n variables that are more relevant in the ranking. The CSR forecast is calculated onthese variables for different values of k . This approach is based on the the Sure IndependenceScreening of Fan and Lv (2008), extended to dependent by Yousuf (2018), that aims to selecta superset of relevant predictors among a very large set. Recently, Medeiros and Mendes (2013) proposed the combination of LASSO-based estimationand NN models. The idea is to construct a feedforward single-hidden layer NN where theparameters of the nonlinear terms (neurons) are randomly generated and the linear parametersare estimated by LASSO (or one of its generalizations). Similar ideas were also considered byKock and Ter¨asvirta (2014) and Kock and Ter¨asvirta (2015).Trapletti et al. (2000) and Medeiros et al. (2006) proposed to augment a feedforward shallowNN by a linear term. The motivation is that the nonlinear component should capture only thenonlinear dependence, making the model more interpretable. This is in the same spirit of thesemi-parametric models considered in Chen (2007).Inspired by the above ideas, Medeiros et al. (2021) proposed combining random forestswith adaLASSO and OLS. The authors considered two specifications. In the first one, calledRF/OLS, the idea is to use the variables selected by a Random Forest in a OLS regression.The second approach, named adaLASSO/RF, works in the opposite direction. First selectthe variables by adaLASSO and than use them in a Random Forest model. The goal is todisentangle the relative importance of variable selection and nonlinearity to forecast inflation.Recently, Diebold and Shin (2019) propose the “partially-egalitarian” LASSO to combine It is possible to combine forecasts using any weighting scheme. However, it is difficult to beat uniformweighting Genre et al. (2013).
With the advances in the ML literature, the number of available forecasting models and methodshave been increasing at a fast pace. Consequently, it is very important to apply statistical toolsto compare different models. The forecasting literature provides a number of tests since theseminal paper by Diebold and Mariano (1995) that can be applied as well to the ML modelsdescribed in this survey.In the Diebold and Mariano’s (1995) test, two competing methods have the same uncondi-tional expected loss under the null hypothesis, and the test can be carried out using a simplet-test. A small sample adjustment was developed by Harvey et al. (1997). See also the recentdiscussion in Diebold (2015). One drawback of the Diebold and Mariano’s (1995) test is that itsstatistic diverges under null when the competing models are nested. However, Giacomini andWhite (2006) show that the test is valid if the forecasts are derived from models estimated ina rolling window framework. Recently, McCracken (2020) shows that if the estimation windowis fixed, the Diebold and Mariano’s (1995) statistic may diverge under the null. Therefore, itis very important that the forecasts are computed in a rolling window scheme.In order to accommodate cases where there are more than two competing models, an un-conditional superior predictive ability (USPA) test was proposed by White (2000). The nullhypothesis states that a benchmark method outperforms a set of competing alternatives. How-ever, Hansen (2005) showed that White’s (2000) test can be very conservative when there arecompeting methods that are inferior to the benchmark. Another important contribution to theforecasting literature is the model confidence set (MCS) proposed by Hansen et al. (2011). AMCS is a set of competing models that is built in a way to contain the best model with respectto a certain loss function and with a given level of confidence. The MCS acknowledges thepotential limitations of the dataset, such that uninformative data yield a MCS with a largenumber models, whereas informative data yield a MCS with only a few models. Importantly,the MCS procedure does not assume that a particular model is the true one.Another extension of the Diebold and Mariano’s (1995) test is the conditional equal predic-tive ability (CEPA) test proposed by Giacomini and White (2006). In practical applications, itis important to know not only if a given model is superior but also when it is better than the al-27ernatives. Recently, Li et al. (2020) proposed a very general framework to conduct conditionalpredictive ability tests.In summary, it is very important to compare the forecasts from different ML methods andthe literature provides a number of tests that can be used.
Penalized regressions are now an important option in the toolkit of applied economists are thereis a vast literature considering the use of such techniques to economics and financial forecasting.Macroeconomic forecasting is certainly one of the most successful applications of penalizedregressions. Medeiros and Mendes (2016) applied the adaLASSO to forecasting US inflationand showed that the method outperforms the linear autoregressive and factor models. Medeirosand Vasconcelos (2016) show that high-dimensional linear models produce, on average, smallerforecasting errors for macroeconomic variables when a large set of predictors is considered.Their results also indicate that a good selection of the adaLASSO hyperparameters reducesforecasting errors. Garcia et al. (2017) show that high-dimensional econometric models, suchas shrinkage and complete subset regression, perform very well in real time forecasting of Brazil-ian inflation in data-rich environments. The authors combine forecasts of different alternativesand show that model combination can achieve superior predictive performance. Smeeks andWijler (2018) consider an application to a large macroeconomic US dataset and demonstratethat penalized regressions are very competitive. Medeiros et al. (2021) conduct a vast compar-ison of models to forecast US inflation and showed the penalized regressions were far superiorthan several benchmarks, including factor models. Ardia et al. (2019) introduce a general textsentiment framework that optimizes the design for forecasting purposes and apply it to fore-casting economic growth in the US. The method includes the use of the elastic net for sparsedata-driven selection and the weighting of thousands of sentiment values. Tarassow (2019)consider penalized VARs to forecast six different economic uncertainty variables for the growthof the real M2 and real M4 Divisia money series for the US using monthly data. Uematsu andTanaka (2019) consider high-dimensional forecasting and variable selection via folded-concavepenalized regressions. The authors forecast quarterly US gross domestic product data usinga high-dimensional monthly data set and the mixed data sampling (MIDAS) framework withpenalization. See also Babii et al. (2020c) and Babii et al. (2020b).There is also a vast list of applications in empirical finance. Elliott et al. (2013) find thatcombinations of subset regressions can produce more accurate forecasts of the equity premiumthan conventional approaches based on equal-weighted forecasts and other regularization tech-niques. Audrino and Knaus (2016) used LASSO-based methods to estimated forecasting modelsfor realized volatilities. Callot et al. (2017) consider modelling and forecasting large realized28ovariance matrices of the 30 Dow Jones stocks by penalized vector autoregressive (VAR) mod-els. The authors find that penalized VARs outperform the benchmarks by a wide margin andimprove the portfolio construction of a mean-variance investor. Chinco et al. (2019) use theLASSO to make 1-minute-ahead return forecasts for a vast set of stocks traded at the NewYork Stock Exchange. The authors provide evidence that penalized regression estimated bythe LASSO boost out-of-sample predictive power by choosing predictors that trace out theconsequences of unexpected news announcements.
There are many papers on the application of nonlinear ML methods to economic and finan-cial forecasting. Most of the papers focus on NN methods, specially the ones from the earlyliterature.With respect to the early papers, most of the models considered were nonlinear versionsof autoregressive models. At best, a small number of extra covariates were included. See, forexample, Ter¨asvirta et al. (2005) and the references therein. In the majority of the papers,including Ter¨asvirta et al. (2005), there was no strong evidence of the superiority of nonlinearmodels as the differences in performance were marginal. Other examples from the early litera-ture are Swanson and White (1995), Swanson and White (1997a), Swanson and White (1997b),Balkin and Ord (2000), Tkacz (2001), Medeiros et al. (2001), and Heravi et al. (2004).More recently, with the availability of large datasets, nonlinear models are back to the scene.For example, Medeiros et al. (2021) show that, despite the skepticism of the previous literatureon inflation forecasting, ML models with a large number of covariates are systematically moreaccurate than the benchmarks for several forecasting horizons and show that Random Forestsdominated all other models. The good performance of the Random Forest is due not onlyto its specific method of variable selection but also the potential nonlinearities between pastkey macroeconomic variables and inflation. Other successful example is Gu et al. (2020). Theauthors show large economic gains to investors using ML forecasts of future stock returnsbased on a very large set of predictors. The best performing models are tree-based and neuralnetworks. Coulombe et al. (2020) show significant gains when nonlinear ML methods are used toforecast macroeconomic time series. Borup and Sch¨utte (2020) consider penalized regressions,ensemble methods, and random forest to forecast employment growth in the United States overthe period 2004–2019 using Google search activity. Their results strongly indicate that Googlesearch data have predictive power. Borup et al. (2020) compute now- and backcasts of weeklyunemployment insurance initial claims in the US based on a rich set of daily Google Trendssearch-volume data and machine learning methods.
In this section we illustrate the use of some of the methods reviewed in this paper to forecastdaily realized variance of the Brazilian Stock Market index (BOVESPA). We use as regressors29nformation from other major indexes, namely, the S&P500 (US), the FTSE100 (United King-dom), DAX (Germany), Hang Seng (Hong Kong), and Nikkei (Japan). Our measure of realizedvolatility is constructed by aggregating intraday returns sample at the 5-minute frequency. Thedata were obtained from the Oxford-Man Realized Library at Oxford University. For each stock index, we define the realized variance as RV t = S (cid:88) s =1 r st , where r st is the log return sampled at the five-minute frequency. S is the number of availablereturns at day t .The benchmark model is the Heterogeneous Autoregressive (HAR) model proposed by Corsi(2009): log RV t +1 = β + β log RV t + β log RV ,t + β log RV ,t + U t +1 , (6.1)where RV t is daily realized variance of the BOVESPA index, RV ,t = 15 (cid:88) i =0 RV t − i , and RV ,t = 122 (cid:88) i =0 RV t − i . As alternatives we consider a extended HAR model with additional regressors estimated byadaLASSO. We include as extra regressors the daily past volatility of the other five indexesconsidered here. The model has a total of eight candidate predictors. Furthermore, we considertwo nonlinear alternatives using all predictors: a random forest and shallow and deep neuralnetworks.The realized variances of the different indexes are illustrated in Figure 7. The data starts inFebruary 2, 2000 and ends in May 21, 2020, a total of 4,200 observations. The sample includestwo periods of very high volatility, namely the financial crisis of 2007-2008 and the Covid-19pandemics of 2020. We consider a rolling window exercise, were we set 1,500 observations ineach window. The models are re-estimated every day.Several other authors have estimated nonlinear and machine learning models to forecastrealized variances. McAleer and Medeiros (2008) considered a smooth transition version of theHAR while Hillebrand and Medeiros (2016) considered the combination of smooth transitions,long memory and neural network models. Hillebrand and Medeiros (2010) and McAleer andMedeiros (2011) combined NN models with bagging and Scharth and Medeiros (2009) consid-ered smooth transition regression trees. The use of LASSO and its generalizations to estimateextensions of the HAR model was proposed by Audrino and Knaus (2016).Although the models are estimated in logarithms, we report the results in levels, which inthe end is the quantity of interest. We compare the models according to the Mean Squared https://realized.oxford-man.ox.ac.uk/data/assets l og R V Brazil (Bovespa) l og R V US (SPX) l og R V Germany (DAX) l og R V Hong Kong (Hang Seng) date -12-10-8-6 l og R V Japan (Nikkei)
Figure 7: Realized variance of different stock indexesError (MSE) and the QLIKE metric.The results are shown in Table 1. The table reports for each model, the mean squarederror (MSE) and the QLIKE statistics as a ratio to the HAR benchmark. Values smaller thanone indicates that the model outperforms the HAR. The asterisks indicate the results of theDiebold-Mariano test of equal forecasting performance. *,**, and ***, indicate rejection of thenull of equal forecasting ability at the 10%, 5% and 1%, respectively. We report results forthe full out-of-sample period, the financial crisis years (2007-2008) and the for 2020 as a wayto capture the effects of the Covid-19 pandemics on the forecasting performance of differentmodels.As we can see from the tables the ML methods considered here outperform the HAR bench-mark. The winner model is definitely the HAR model with additional regressors and estimatedwith adaLASSO. The performance improves during the high volatility periods and the gainsreach 10% during the Covid-19 pandemics. Random Forests do not perform well. On the otherhand NN models with different number of hidden layers outperform the benchmark.
In this paper we present a non-exhaustive review of the most of the recent developments inmachine learning and high-dimensional statistics to time-series modeling and forecasting. Wepresented both linear and nonlinear alternatives. Furthermore, we consider ensemble and hybridmodels. Finally, we briefly discuss tests for superior predictive ability.31able 1: Forecasting Results
The table reports for each model, the mean squared error (MSE) and the QLIKE statistics as a ratio to the HARbenchmark. Values smaller than one indicates that the model outperforms the HAR. The asterisks indicate theresults of the Diebold-Mariano test of equal forecasting performance. *,**, and ***, indicate rejection of thenull of equal forecasting ability at the 10%, 5% and 1%, respectively.
Full Sample 2007-2008 2020Model MSE QLIKE MSE QLIKE MSE QLIKE
HARX-LASSO 0 . ∗∗ .
98 0 . ∗ .
96 0 . ∗∗∗ . .
00 1 .
02 0 . ∗∗∗ .
98 1 . ∗∗ . ∗ Neural Network (1) 0 . ∗∗ .
99 0 . ∗∗ .
98 0 .
99 0 . . ∗∗ .
99 0 . ∗ .
99 0 .
99 0 . . ∗∗ .
99 0 . ∗ .
99 0 .
99 0 . f h ( X t ) when the data are dependent.3. Derive a better understanding of the variable selection mechanism of nonlinear ML meth-ods.4. Develop inferential methods to access variable importance in nonlinear ML methods.5. Develop models based on unstructured data, such as text data, to economic forecasting.6. Evaluate ML models for nowcasting.7. Evaluate ML in very unstable environments with many structural breaks.Finally, we would like to point that we left a number of other interesting ML methods out ofthis survey, such as, for example, Support Vector Regressions, autoenconders, nonlinear factor32odels, and many more. However, we hope that the material presented here can be of valueto anyone interested of applying ML techniques to economic and/or financial forecasting.33 eferences Ad´amek, R., S. Smeekes, and I. Wilms (2020). LASSO inference for high-dimensional timeseries. Technical Report 2007.10952, arxiv.Ardia, D., K. Bluteau, and K. Boudt (2019). Questioning the news about economic growth:Sparse forecasting using thousands of news-based sentiment values.
International Journal ofForecasting 35 , 1370–1386.Audrino, F. and S. D. Knaus (2016). Lassoing the HAR model: A model selection perspectiveon realized volatility dynamics.
Econometric Reviews 35 , 1485–1521.Babii, A., E. Ghysels, and J. Striaukas (2020a). Inference for high-dimensional regressions withheteroskedasticity and autocorrelation. Technical Report 1912.06307, arxiv.Babii, A., E. Ghysels, and J. Striaukas (2020b). Machine learning panel data regressions withan application to nowcasting price earnings ratios. Technical Report 2008.03600, arxiv.Babii, A., E. Ghysels, and J. Striaukas (2020c). Machine learning time series regressions withan application to nowcasting. Technical Report 2005.14057, arxiv.Balkin, S. D. and J. K. Ord (2000). Automatic neural network modeling for univariate timeseries.
International Journal of Forecasting 16 , 509–515.Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function.
IEEE Transactions on Information Theory 39 , 930–945.Bartlett, P. and M. M. Traskin (2007). AdaBoost is consistent.
Journal of Machine LearningResearch 8 , 2347–2368.Basu, S. and G. Michailidis (2015). Regularized estimation in sparse high-dimensional timeseries models.
Annals of Statistics 43 , 1535–1567.Belloni, A., V. Chernozhukov, and C. Hansen (2014). Inference on treatment effects afterselection amongst high-dimensional controls.
Review of Economic Studies 81 , 608–650.B¨uhlmann, P. (2006). Boosting for high-dimensional linear models.
Annals of Statistics 34 ,559–583.Borup, D., B. Christensen, N. M¨uhlbach, and M. Nielsen (2020). Targeting predictors in randomforest regression. Technical Report 2004.01411, arxiv.Borup, D., D. Rapach, and E. Sch¨utte (2020). Now- and backcasting initial claims with high-dimensional daily internet search-volume data. Technical Report 3690832, SSRN.Borup, D. and E. Sch¨utte (2020). In search of a job: Forecasting employment growth usingGoogle trends.
Journal of Business and Economic Statistics . forthcoming.34reiman, L. (1996). Bagging predictors.
Machine Learning 24 , 123–140.Breiman, L. (2001). Random forests.
Machine Learning 45 , 5–32.B¨uhlmann, P. L. (2002). Consistency for l2boosting and matching pursuit with trees and tree-type basis functions. In
Research report/Seminar f¨ur Statistik, Eidgen¨ossische TechnischeHochschule (ETH) , Volume 109. Seminar f¨ur Statistik, Eidgen¨ossische Technische Hochschule(ETH).Callot, L., A.B., and Kock (2013). Oracle efficient estimation and forecasting with the adaptiveLASSO and the adaptive group LASSO in vector autoregressions. In N. Haldrup, M. Meitz,and P. Saikkonen (Eds.),
Essays in Nonlinear Time Series Econometrics . Oxford UniversityPress.Callot, L., A. Kock, and M. Medeiros (2017). Modeling and forecasting large realized covariancematrices and portfolio choice.
Journal of Applied Econometrics 32 , 140–158.Chan, K.-S. and K. Chen (2011). Subset ARMA selection via the adaptive LASSO.
Statisticsand its Interface 4 , 197–205.Chen, S., D. Donoho, and M. Saunders (2001). Atomic decomposition by basis pursuit.
SIAMreview 43 , 129–159.Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In J. Heckmanand E. Leamer (Eds.),
Handbook of Econometrics . Elsevier.Chen, X., J. Racine, and N. Swanson (2007). Semiparametric ARX neural-network models withan application to forecasting inflation.
IEEE Transactions on Neural Networks 12 , 674–683.Chen, X. and S. Shen (1998). Sieve extremum estimates for weakly dependent data.
Econo-metrica 66 , 289–314.Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and W. Newey (2017).Double/debiased/neyman machine learning of treatment effects.
American Economic Re-view 107 , 261–265.Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins(2018). Double/debiased machine learning for treatment and structural parameters.
Econo-metrics Journal 21 , C1–C68.Chinco, A., A. Clark-Joseph, and M. Ye (2019). Sparse signals in the cross-section of returns.
Journal of Finance 74 , 449–492.Corsi, F. (2009). A simple long memory model of realized volatility.
Journal of FinancialEconometrics 7 , 174–196. 35oulombe, P., M. Leroux, D. Stevanovic, and S. Surprenant (2020). How is machine learninguseful for macroeconomic forecasting? Technical report, University of Pennsylvania.Cybenko, G. (1989). Approximation by superposition of sigmoidal functions.
Mathematics ofControl, Signals, and Systems 2 , 303–314.Davis, R. and M. Nielsen (2020). Modeling of time series using random forests: Theoreticaldevelopments.
Electronic Journal of Statistics 14 , 3644–3671.Diebold, F. (2015). Comparing predictive accuracy, twenty years later: A personal perspec-tive on the use and abuse of Diebold-Mariano tests.
Journal of Business and EconomicStatistics 33 , 1–9.Diebold, F. and M. Shin (2019). Machine learning for regularized survey forecast combination:Partially-egalitarian LASSO and its derivatives.
International Journal of Forecasting 35 ,1679–1691.Diebold, F., M. Shin, and B. Zhang (2021). On the aggregation of probability assessments:Regularized mixtures of predictive densities for Eurozone inflation and real interest rates.Technical Report 2012.11649, arxiv.Diebold, F. X. and R. S. Mariano (1995). Comparing predictive accuracy.
Journal of Businessand Economic Statistics 13 , 253–263.Duffy, N. and D. Helmbold (2002). Boosting methods for regression.
Machine Learning 47 ,153–200.Elliott, G., A. Gargano, and A. Timmermann (2013). Complete subset regressions.
Journal ofEconometrics 177 (2), 357–373.Elliott, G., A. Gargano, and A. Timmermann (2015). Complete subset regressions with large-dimensional sets of predictors.
Journal of Economic Dynamics and Control 54 , 86–110.Elliott, G. and A. Timmermann (2008). Economic forecasting.
Journal of Economic Litera-ture 46 , 3–56.Elliott, G. and A. Timmermann (2016). Forecasting in economics and finance.
Annual Reviewof Economics 8 , 81–110.Fan, J. and R. Li (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties.
Journal of the American Statistical Association 96 , 1348–1360.Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space.
Journal of the Royal Statistical Society, Series B 70 , 849–911.Fan, J., L. Xue, and H. Zou (2014). Strong oracle optimality of folded concave penalizedestimation.
Annals of Statistics 42 , 819–849.36arrell, M., T. Liang, and S. Misra (2021). Deep neural networks for estimation and inference.
Econometrica 89 , 181–213.Foresee, F. D. and M. . T. Hagan (1997). Gauss-newton approximation to Bayesian regu-larization. In
IEEE International Conference on Neural Networks (Vol. 3) , New York, pp.1930–1935. IEEE.Friedman, J. (2001). Greedy function approximation: a gradient boosting machine.
Annals ofStatistics 29 , 1189–1232.Funahashi, K. (1989). On the approximate realization of continuous mappings by neural net-works.
Neural Networks 2 , 183–192.Garcia, M., M. Medeiros, and G. Vasconcelos (2017). Real-time inflation forecasting withhigh-dimensional models: The case of brazil.
International Journal of Forecasting 33 (3),679–693.Genre, V., G. Kenny, A. Meyler, and A. Timmermann (2013). Combining expert forecasts:Can anything beat the simple average?
International Journal of Forecasting 29 , 108–121.Giacomini, R. and H. White (2006). Tests of conditional predictive ability.
Econometrica 74 ,1545–1578.Granger, C. and M. Machina (2006). Forecasting and decision theory.
Handbook of EconomicForecasting 1 , 81–98.Grenander, U. (1981).
Abstract Inference . New York, USA: Wiley.Gu, S., B. Kelly, and D. Xiu (2020). Empirical asset pricing via machine learning.
Review ofFinancial Studies 33 , 2223–2273.H. Zou, H. (2006). The adaptive LASSO and its oracle properties.
Journal of the AmericanStatistical Association 101 , 1418–1429.Hamilton, J. (1994).
Time Series Analysis . Princeton University Press.Han, Y. and R. Tsay (2020). High-dimensional linear regression for dependent data withapplications to nowcasting.
Statistica Sinica 30 , 1797–1827.Hans, C. (2009). Bayesian LASSO regression.
Biometrika 96 , 835–845.Hansen, P. (2005). A test for superior predictive ability.
Journal of Business and EconomicStatistics 23 , 365–380.Hansen, P., A. Lunde, and J. Nason (2011). The model confidence set.
Econometrica 79 ,453–497. 37arvey, D., S. Leybourne, and P. Newbold (1997). Testing the equality of prediction meansquared errors.
International Journal of Forecasting 13 , 281–291.Hastie, T., R. Tibshirani, and J. Friedman (2009).
The elements of statistical learning: datamining, inference, and prediction . Springer.Hastie, T., R. Tibshirani, and M. Wainwright (2015).
Statistical learning with sparsity: theLASSO and generalizations . CRC Press.Hecq, A., L. Margaritella, and S. Smeekes (2019). Granger causality testing in high-dimensionalVARs: a post-double-selection procedure. Technical Report 1902.10991, arxiv.Heravi, S., D. Osborne, and C. Birchenhall (2004). Linear versus neural network forecasts foreuropean industrial production series.
International Journal of Forecasting 20 , 435–446.Hillebrand, E. and M. Medeiros (2010). The benefits of bagging for forecast models of realizedvolatility.
Econometric Reviews 29 , 571–593.Hillebrand, E. and M. C. Medeiros (2016). Asymmetries, breaks, and long-range dependence.
Journal of Business and Economic Statistics 34 , 23–41.Hochreiter, S. and J. Schmidhuber (1997). Long short-term memory.
Neural Computation 9 ,1735–1780.Hoerl, A. and R. Kennard (1970). Ridge regression: Biased estimation for nonorthogonalproblems.
Technometrics 12 , 55–67.Hornik, K., M. Stinchombe, and H. White (1989). Multi-layer Feedforward networks are uni-versal approximators.
Neural Networks 2 , 359–366.Hsu, N.-J., H.-L. Hung, and Y.-M. Chang (2008). Subset selection for vector autoregressiveprocesses using LASSO.
Computational Statistics & Data Analysis 52 , 3645–3657.Inoue, A. and L. Kilian (2008). How useful is bagging in forecasting economic time series? a casestudy of U.S. consumer price inflation.
Journal of the American Statistical Association 103 ,511–522.James, W. and C. Stein (1961). Estimation with quadratic loss.
Proceedings of the ThirdBerkeley Symposium on Mathematical Statistics and Probability 1 , 361–379.Jiang, W. (2004). Process consistency for AdaBoost.
Annals of Statistics 32 , 13–29.Kim, Y., H. Choi, and H.-S. Oh (2008). Smoothly clipped absolute deviation on high dimen-sions.
Journal of the American Statistical Association 103 , 1665–1673.Knight, K. and W. Fu (2000). Asymptotics for LASSO-type estimators.
Annals of Statistics 28 ,1356–1378. 38ock, A. (2016). Consistent and conservative model selection with the adaptive lasso in sta-tionary and nonstationary autoregressions.
Econometric Theory 32 , 243–259.Kock, A. and L. Callot (2015). Oracle inequalities for high dimensional vector autoregressions.
Journal of Econometrics 186 , 325–344.Kock, A. and T. Ter¨asvirta (2014). Forecasting performance of three automated modellingtechniques during the economic crisis 2007-2009.
International Journal of Forecasting 30 ,616–631.Kock, A. and T. Ter¨asvirta (2015). Forecasting macroeconomic variables using neural networkmodels and three automated model selection techniques.
Econometric Reviews 35 , 1753–1779.Konzen, E. and F. Ziegelmann (2016). LASSO-type penalties for covariate selection and fore-casting in time series.
Journal of Forecasting 35 , 592–612.Koo, B., H. Anderson, M. Seo, and W. Yao (2020). High-dimensional predictive regression inthe presence of cointegration.
Journal of Econometrics 219 , 456–477.Lederer, J., L. Yu, and I. Gaynanova (2019). Oracle inequalities for high-dimensional prediction.
Bernoulli 25 , 1225–1255.Lee, J., D. Sun, Y. Sun, and J. Taylor (2016). Exact post-selection inference with applicationto the LASSO.
Annals of Statistics 44 , 907–927.Lee, J. and Z. G. Z. Shi (2020). On LASSO for predictive regression. Technical Report1810.03140, arxiv.Leeb, H. and B. P¨otscher (2005). Model selection and inference: Facts and fiction.
EconometricTheory 21 , 21–59.Leeb, H. and B. P¨otscher (2008). Sparse estimators and the oracle property, or the return ofHodges’ estimator.
Journal of Econometrics 142 , 201–211.Li, J., Z. Liao, and R. Quaedvlieg (2020). Conditional superior predictive ability. Technicalreport, Erasmus School of Economics.Lockhart, R., J. Taylor, R. Tibshirani, and R. Tibshirani (2014). On asymptotically optimalconfidence regions and tests for high-dimensional models.
Annals of Statistics 42 , 413–468.Lugosi, G. and N. Vayatis (2004). On the Bayes-risk consistency of regularized boosting meth-ods.
Annals of Statistics 32 , 30–55.MacKay, D. J. C. (1992). Bayesian interpolation.
Neural Computation 4 , 415–447.MacKay, D. J. C. (1992). A practical bayesian framework for backpropagation networks.
NeuralComputation 4 , 448–472. 39asini, R., M. Medeiros, and E. Mendes (2019). Regularized estimation of high-dimensionalvector autoregressions with weakly dependent innovations. Technical Report 1912.09002,arxiv.McAleer, M. and M. Medeiros (2011). Forecasting realized volatility with linear and nonlinearmodels.
Journal of Economic Surveys 25 , 6–18.McAleer, M. and M. C. Medeiros (2008). A multiple regime smooth transition heterogeneousautoregressive model for long memory and asymmetries.
Journal of Econometrics 147 , 104–119.McCracken, M. (2020). Diverging tests of equal predictive ability.
Econometrica 88 , 1753–1754.Medeiros, M. and E. Mendes (2013). Penalized estimation of semi-parametric additive time-series models. In N. Haldrup, M. Meitz, and P. Saikkonen (Eds.),
Essays in Nonlinear TimeSeries Econometrics . Oxford University Press.Medeiros, M. and E. Mendes (2016). (cid:96) -regularization of high-dimensional time-series modelswith non-gaussian and heteroskedastic errors. Journal of Econometrics 191 , 255–271.Medeiros, M. and E. Mendes (2017). Adaptive LASSO estimation for ARDL models withGARCH innovations.
Econometric Reviews 36 , 622–637.Medeiros, M. and G. Vasconcelos (2016). Forecasting macroeconomic variables in data-richenvironments.
Economics Letters 138 , 50–52.Medeiros, M. C., T. Ter¨asvirta, and G. Rech (2006). Building neural network models for timeseries: A statistical approach.
Journal of Forecasting 25 , 49–75.Medeiros, M. C., G. Vasconcelos, A. Veiga, and E. Zilberman (2021). Forecasting inflation in adata-rich environment: The benefits of machine learning methods.
Journal of Business andEconomic Statistics 39 , 98–119.Medeiros, M. C. and A. Veiga (2005). A flexible coefficient smooth transition time series model.
IEEE Transactions on Neural Networks 16 , 97–113.Medeiros, M. C., A. Veiga, and C. Pedreira (2001). Modelling exchange rates: Smooth tran-sitions, neural networks, and linear models.
IEEE Transactions on Neural Networks 12 ,755–764.Melnyk, I. and A. Banerjee (2016). Estimating structured vector autoregressive models. In
International Conference on Machine Learning , pp. 830–839.Mhaska, H., Q. Liao, and T. Poggio (2017). When and why are deep networks better thanshallow ones? In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence(AAAI-17) , pp. 2343–2349. 40ardi, Y. and A. Rinaldo (2011). Autoregressive process modeling via the LASSO procedure.
Journal of Multivariate Analysis 102 , 528–549.Park, H. and F. Sakaori (2013). Lag weighted LASSO for time series model.
ComputationalStatistics 28 , 493–504.Park, J. and I. Sandberg (1991). Universal approximation using radial-basis-function networks.
Neural Computation 3 , 246–257.Park, T. and G. Casella (2008). The Bayesian LASSO.
Journal of the American StatisticalAssociation 103 , 681–686.Ren, Y. and X. Zhang (2010). Subset selection for vector autoregressive processes via adaptiveLASSO.
Statistics & Probability Letters 80 , 1705–1712.Samuel, A. (1959). Some studies in machine learning using the game of checkers.
IBM Journalof Research and Development 3.3 , 210–229.Sang, H. and Y. Sun (2015). Simultaneous sparse model selection and coefficient estimationfor heavy-tailed autoregressive processes.
Statistics 49 , 187–208.Scharth, M. and M. Medeiros (2009). Asymmetric effects and long memory in the volatility ofdow jones stocks.
International Journal of Forecasting 25 , 304–325.Scornet, E., G. Biau, and J.-P. Vert (2015). Consistency of random forests.
Annals of Statis-tics 43 , 1716–1741.Simon, N., J. Friedman, T. Hastie, and R. Tibshirani (2013). A sparse-group LASSO.
Journalof computational and Graphical Statistics 22 , 231–245.Smeeks, S. and E. Wijler (2018). Macroeconomic forecasting using penalized regression meth-ods.
International Journal of Forecasting 34 , 408–430.Smeeks, S. and E. Wijler (2020). An automated approach towards sparse single-equationcointegration modelling.
Journal of Econometrics . forthcoming.Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014). Simpleway to prevent neural networks from overfitting.
Journal of Machine Learning Research 15 ,1929–1958.Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate dis-tribution.
Proceedings of the Third Berkeley Symposium on Mathematical Statistics andProbability 1 , 197–206.Stinchcombe, M. and S. White (1989). Universal approximation using feedforward neural net-works with non-sigmoid hidden layer activation functions. In
Proceedings of the InternationalJoint Conference on Neural Networks , Washington, pp. 613–617. IEEE Press, New York, NY.41uarez-Fari˜nas, C. Pedreira, and M. C. Medeiros (2004). Local-global neural networks: A newapproach for nonlinear time series modelling.
Journal of the American Statistical Associa-tion 99 , 1092–1107.Swanson, N. R. and H. White (1995). A model selection approach to assesssing the informationin the term structure using linear models and artificial neural networks.
Journal of Businessand Economic Statistics 13 , 265–275.Swanson, N. R. and H. White (1997a). Forecasting economic time series using flexible versusfixed specification and linear versus nonlinear econometric models.
International Journal ofForecasting 13 , 439–461.Swanson, N. R. and H. White (1997b). A model selection approach to real-time macroeconomicforecasting using linear models and artificial neural networks.
Review of Economic andStatistics 79 , 540–550.Tarassow, A. (2019). Forecasting u.s. money growth using economic uncertainty measures andregularisation techniques.
International Journal of Forecasting 35 , 443–457.Taylor, J., R. Lockhart, R. Tibshirani, and R. Tibshirani (2014). Post-selection adaptiveinference for least angle regression and the LASSO. Technical Report 1401.3889, arxiv.Ter¨asvirta, T. (1994). Specification, estimation, and evaluation of smooth transition autore-gressive models.
Journal of the American Statistical Association 89 , 208–218.Ter¨asvirta, T., D. Tj¨ostheim, and C. Granger (2010).
Modelling Nonlinear Economic TimeSeries . Oxford, UK: Oxford University Press.Ter¨asvirta, T., D. van Dijk, and M. Medeiros (2005). Linear models, smooth transition autore-gressions and neural networks for forecasting macroeconomic time series: A reexamination(with discussion).
International Journal of Forecasting 21 , 755–774.Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO.
Journal of the RoyalStatistical Society, Series B 58 , 267–288.Tikhonov, A. (1943). On the stability of inverse problems.
Doklady Akademii Nauk SSSR 39 ,195–198. in Russian.Tikhonov, A. (1963). On the solution of ill-posed problems and the method of regularization.
Doklady Akademii Nauk 151 , 501–504.Tikhonov, A. and V. Arsenin (1977).
Solutions of ill-posed problems . V.H Winston and Sons.Tkacz, G. (2001). Neural network forecasting of Canadian GDP growth.
International Journalof Forecasting 17 , 57–69. 42rapletti, A., F. Leisch, and K. Hornik (2000). Stationary and integrated autoregressive neuralnetwork processes.
Neural Computation 12 , 2427–2450.Uematsu, Y. and S. Tanaka (2019). High-dimensional macroeconomic forecasting and variableselection via penalized regression.
The Econometrics Journal 22 , 34–56.van de Geer, S., P. B¨uhlmann, Y. Ritov, and R. Dezeure (2014). On asymptotically optimalconfidence regions and tests for high-dimensional models.
Annals of Statistics 42 , 1166–1202.Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effectsusing random forests.
Journal of the American Statistical Association 113 , 1228–1242.Wang, H. and C. Leng (2008). A note on adaptive group LASSO.
Computational Statistics &data analysis 52 , 5277–5286.Wang, H., G. Li, and C.-L. Tsai (2007). Regression coefficient and autoregressive order shrink-age and selection via the LASSO.
Journal of the Royal Statistical Society, Series B 69 ,63–78.White, H. (2000). A reality check for data snooping.
Econometrica 68 , 1097–1126.Wong, K., Z. Li, and A. Tewari (2020). LASSO guarantees for β -mixing heavy tailed timeseries. Annals of Statistics 48 , 1124–1142.Wu, W. (2005). Nonlinear system theory: Another look at dependence.
Proceedings of theNational Academy of Sciences 102 , 14150–14154.Wu, W. and Y. Wu (2016). Performance bounds for parameter estimates of high-dimensionallinear models with correlated errors.
Electronic Journal of Statistics 10 , 352–379.Xie, F., L. Xu, and Y. Yang (2017). LASSO for sparse linear regression with exponentially β -mixing errors. Statistics & Probability Letters 125 , 64–70.Xue, Y. and M. Taniguchi (2020). Modified LASSO estimators for time series regression modelswith dependent disturbances.
Statistical Methods & Applications 29 , 845–869.Yang, Y. and H. H. Zou (2015). A fast unified algorithm for solving group-LASSO penalizelearning problems.
Statistics and Computing 25 , 1129–1141.Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks.
NeuralNetworks 94 , 103–114.Yoon, Y., C. Park, and T. Lee (2013). Penalized regression models with autoregressive errorterms.
Journal of Statistical Computation and Simulation 83 , 1756–1772.Yousuf, K. (2018). Variable screening for high dimensional time series.
Electronic Journal ofStatistics 12 , 667–702. 43uan, M. and Y. Lin (2006). Model selection and estimation in regression with grouped vari-ables.
Journal of the Royal Statistical Society, Series B 68 , 49–67.Zhang, T. and B. Yu (2005). Boosting with early stopping: Convergence and consistency.
Annals of Statistics 33 , 1538–1579.Zhao, P. and B. Yu (2006). On model selection consistency of LASSO.
Journal of Machinelearning research 7 , 2541–2563.Zhu, X. (2020). Nonconcave penalized estimation in sparse vector autoregression model.
Elec-tronic Journal of Statistics 14 , 1413–1448.Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net.
Journalof the Royal Statistical Society, Series B 67 , 301–320.Zou, H. and H. Zhang (2009). On the adaptive elastic-net with a diverging number of param-eters.