[PDF] Machine Learning Time Series Regressions with an Application to Nowcasting

Abstract

This paper introduces structured machine learning regressions for high-dimensional time series data potentially sampled at different frequencies. The sparse-group LASSO estimator can take advantage of such time series data structures and outperforms the unstructured LASSO. We establish oracle inequalities for the sparse-group LASSO estimator within a framework that allows for the mixing processes and recognizes that the financial and the macroeconomic data may have heavier than exponential tails. An empirical application to nowcasting US GDP growth indicates that the estimator performs favorably compared to other alternatives and that text data can be a useful addition to more traditional numerical data.

Full PDF

MMachine learning time series regressions with an applicationto nowcasting ∗ Andrii Babii † Eric Ghysels ‡ Jonas Striaukas § June 1, 2020

Abstract

This paper introduces structured machine learning regressions for high-dimensional time seriesdata potentially sampled at diﬀerent frequencies. The sparse-group LASSO estimator can takeadvantage of such time series data structures and outperforms the unstructured LASSO. Weestablish oracle inequalities for the sparse-group LASSO estimator within a framework thatallows for the mixing processes and recognizes that the ﬁnancial and the macroeconomic datamay have heavier than exponential tails. An empirical application to nowcasting US GDPgrowth indicates that the estimator performs favorably compared to other alternatives andthat the text data can be a useful addition to more traditional numerical data.

Keywords: high-dimensional time series, text data, mixed frequency data, sparse-group LASSO,Fuk-Nagaev inequality, tau-dependent processes. ∗ We thank participants at the Financial Econometrics Conference at the TSE Toulouse, the JRC Big Data andForecasting Conference, the Big Data and Machine Learning in Econometrics, Finance, and Statistics Conferenceat the University of Chicago, the Nontraditional Data, Machine Learning, and Natural Language Processing inMacroeconomics Conference at the Board of Governors and the AI Innovations Forum organized by SAS and theKenan-Flagler Business School as well as Jianqing Fan, Michele Lenza and Dacheng Xiu for comments. All remainingerrors are ours. † Department of Economics, University of North Carolina–Chapel Hill - Gardner Hall, CB 3305 Chapel Hill, NC27599-3305. Email: [email protected] ‡ Department of Economics and Kenan-Flagler Business School, University of North Carolina–Chapel Hill. Email:[email protected]. § LIDAM UC Louvain and FRS–FNRS Research Fellow. Email: [email protected]. a r X i v : . [ ec on . E M ] M a y Introduction

The statistical imprecision inherent in the quarterly gross domestic product (GDP) estimates,together with the fact that even the ﬁrst estimate is available with a delay of nearly a month,pose a signiﬁcant challenge to policymakers and other observers with an interest in monitoring thestate of the economy in real time. A term originated in meteorology, nowcasting pertains to theprediction of the present and very near future. Nowcasting is intrinsically a mixed frequency dataproblem as the object of interest is a low-frequency data series - observed say quarterly like GDP- whereas real-time information - daily, weekly or monthly - during the quarter can be used toassess and potentially continuously update the state of the low-frequency series, or put diﬀerently, nowcast the series of interest. Traditional methods being used for nowcasting rely on dynamicfactor models that treat the underlying low frequency series of interest as a latent process withhigh frequency data noisy observations. These models are naturally cast in a state-space form, andinference can be performed using standard Kalman ﬁltering techniques. So far, nowcasting has mostly relied on so-called standard macroeconomic data releases. Per-haps the most prominent among these releases in the US is the Bureau of Labor Statistics Em-ployment Situation report, which is issued on the ﬁrst Friday of every month. This report includesdata on payroll employment, unemployment, earnings, and many other aspects of the labor mar-ket. The nature of business cycles, in which most sectors of the economy tend to move together,implies that good news for the labor market – or for manufacturing, construction, retail trade, andso on – usually reﬂects good news for the economy as a whole. The Employment report releasesare followed closely not just by economists, but by market participants, people in business, and themedia. Besides labor market data, nowcasting models typically also rely on construction spending,(non-)manufacturing report, price indices data, etc, which we will call traditional macroeconomicdata. One prominent example is produced by the Federal Reserve Bank of New York, using adynamic factor model with thirty-seven predictors of diﬀerent frequencies. See e.g. Ghysels, Horan, and Moench (2018) for a recent discussion of macroeconomic data revision and publi-cation delays. See Ba´nbura, Giannone, Modugno, and Reichlin (2013) for a recent survey. Wall Street Journal arti-cles that has been recently made available features a taxonomy of 180 topics. Which topics arerelevant? How should they be selected? Thorsrud (2020) constructs a daily business cycle indexbased on quarterly GDP growth and textual information contained in a daily business newspaper,using a time-varying dynamic factor model where dynamic sparsity is enforced upon the factorloadings using a latent threshold mechanism. His work shows the feasibility of using variations ofthe traditional state space setting. Yet, the challenges grow when we start thinking about alsoadding potentially large-dimensional traditional data sets as well as non-traditional data such asfor example payment systems information or GPS tracking data. We study nowcasting low-frequency series – focusing on the key example of US GDP growth– in a data-rich environment, where our data not only includes conventional high-frequency se-ries but also non-standard data generated by textual analysis. The latter type of data is shownto be statistically signiﬁcant using HAC-based inference based on Babii, Ghysels, and Striaukas(2020). Our nowcasts are superior to those posted by the Federal Reserve Bank of New Yorkwhich involves proprietary information whereas our models exclusively rely on public domain data,and most importantly involve high dimensional non-conventional data sources. To deal with suchmassive non-traditional datasets we need to rely on a diﬀerent approach, one involving machinelearning methods dealing with data sampled at diﬀerent frequencies. We adopt a MIDAS (Mixed See Bybee, Kelly, Manela, and Xiu (2020) and the website http://structureofnews.com/. Studies for Canada (Galbraith and Tkacz (2018)), Denmark (Carlsen and Storgaard (2010)), India (Raju andBalakrishnan (2019)), Italy (Aprigliano, Ardizzi, and Monteforte (2019)), Portugal (Duarte, Rodrigues, and Rua(2017)), and the United States (Barnett, Chauvet, Leiva-Leon, and Su (2016)) ﬁnd that payment transactionscan help with nowcasting and with forecasting GDP and private consumption in the short term. Other relatedapplications are Moriwaki (2019) nowcasting unemployment rates with smartphone GPS data, among others. Relatively little is known about handling high-dimensional mixed frequency data. Among the exceptions isAndreou, Gagliardini, Ghysels, and Rubin (2019) who study principal component analysis with large dimensionalpanels and focus on mixed frequency data. The attractive feature of the sg-LASSO estimator is that it allows us to combine eﬀectively the approximately sparse and densesignals; see e.g., Carrasco and Rossi (2016) for a comprehensive treatment of ill-posed dense timeseries regressions.We recognize that the economic and ﬁnancial time series data are frequently heavy-tailed,while the bulk of the machine learning methods assumes i.i.d. data and/or exponential tails forcovariates and regression errors; see Belloni, Chernozhukov, Chetverikov, Hansen, and Kato (2018)for a comprehensive review of high-dimensional regressions with i.i.d. data. There have beenseveral recent attempts to expand the asymptotic theory to settings involving time series dependentdata, mostly for the LASSO estimator. For instance, Kock and Callot (2015) establish oracleinequalities for the VAR with i.i.d. errors; Wong, Li, and Tewari (2019) consider β -mixing serieswith exponential tails; Wu and Wu (2016), Han and Tsay (2017), and Chernozhukov, H¨ardle,Huang, and Wang (2019) allow for polynomial tails under the functional dependence measure ofWu (2005).Despite these eﬀorts, there is no complete estimation theory for high-dimensional time seriesregressions under the assumptions comparable to the classical GMM and QML estimators. To thebest of our knowledge, the high-dimensional mixing processes with polynomial tails have not beentreated in the relevant literature. This paper ﬁlls this gap in the literature relying on the Fuk-Nagaev inequality for τ -dependent processes recently obtained in Babii, Ghysels, and Striaukas The sparse-group LASSO was introduced by Simon, Friedman, Hastie, and Tibshirani (2013). The idea to applygroup structures to time series covariates is novel. In contrast to group LASSO , the sparse-group LASSO promotessparsity between and within groups (i.e. lags of time series covariates). τ -dependence coeﬃcients are introduced in Dedecker and Prieur (2004) and Dedecker and Prieur (2005) as τ -dependent processes in the context of the LASSO estima-tor. The Fuk-Nagaev inequality, cf., Fuk and Nagaev (1971), describes the concentration of sumsof random variables with a mixture of the sub-Gaussian and the polynomial tails. It provides sharpestimates of tail probabilities unlike Markov’s bound in conjunction with the MarcinkiewiczZyg-mund or Rosenthal’s moment inequalities. Our results cover the LASSO and the group LASSO asspecial cases and, to the best of our knowledge, such treatment of other multi-penalty regularizedestimators, e.g., the elastic net, is not currently available even in the i.i.d. case.The rest of the paper is organized as follows. Section 2 presents the generic time series regres-sion setting used in the paper. Section 3 characterizes non-asymptotic estimation and predictionaccuracy of the sg-LASSO estimator for τ -dependent processes with polynomial tails. We reporton a Monte Carlo study in Section 4 which provides further insights about the validity of our the-oretical analysis in small sample settings typically encountered in empirical applications. Section5 covers the empirical application. Conclusions appear in Section 6. Notation:

For a random variable X ∈ R , let (cid:107) X (cid:107) q = ( E | X | q ) /q , q ≥ L q norm. For p ∈ N , put [ p ] = { , , . . . , p } . For a vector ∆ ∈ R p and a subset J ⊂ [ p ], let ∆ J be a vector in R p withthe same coordinates as ∆ on J and zero coordinates on J c . Let G = { G g : g ≥ } be a partitionof [ p ] deﬁning the group structure. For a vector β ∈ R p , the sparse-group structure is describedby a pair ( S , G ), where S = { j ∈ [ p ] : β j (cid:54) = 0 } is the support of β and G = { G ∈ G : β G (cid:54) = 0 } is its group support. For b ∈ R p , its (cid:96) q , q ≥ | b | q = (cid:16)(cid:80) pj ≥ | b j | q (cid:17) /q , q < ∞ and | b | ∞ = max ≤ j ≤ p | b j | . For u , v ∈ R T , the empirical inner product is deﬁned as (cid:104) u , v (cid:105) T = T (cid:80) Tt =1 u t v t with the induced empirical norm (cid:107) . (cid:107) T = (cid:104) ., . (cid:105) T = | . | /T . For a symmetric p × p matrix A , letvech( A ) ∈ R p ( p +1) / be its vectorization consisting of the lower triangular and the diagonal part.For a, b ∈ R , we put a ∨ b = max { a, b } and a ∧ b = min { a, b } . Lastly, we write a n (cid:46) b n if thereexists a (suﬃciently large) absolute constant C such that a n ≤ Cb n for all n ≥ a n ∼ b n if a n (cid:46) b n and b n (cid:46) a n . weaker than mixing coeﬃcients. Therefore, our results cover mixing processes, which is not the case for the physicaldependence measures. Time series regressions and sparse-group LASSO

Let ( y t ) t ∈ [ T ] be the target series measured at discrete time points t ∈ [ T ]. Predictions of y t caninvolve its lags as well as a large set of covariates and lags thereof. In the interest of generality,but more importantly because of the empirical relevance we allow the covariates to be sampled athigher frequencies - with same frequency being a special case. More speciﬁcally, let there be K covariates { x t − j/m,k , j ∈ [ m ] , t ∈ [ T ] , k ∈ [ K ] } possibly measured at some higher frequency with m observations every t and consider the following regression model φ ( L ) y t = K (cid:88) k =1 ψ ( L /m ; β k ) x t,k + u t , t ∈ [ T ] , (1)where φ ( L ) = I − ρ L − ρ L −· · ·− ρ J L J is the low-frequency lag polynomial and ψ ( L /m ; β k ) x t,k =1 /m (cid:80) mj =1 β j,k x t − j/m,k is the high-frequency lag polynomial. For m = 1, we have a standard au-toregressive distributed lag (ARDL) model, which is the workhorse regression model of the timeseries econometrics literature. The ARDL-MIDAS model (using the terminology of Andreou, Ghysels, and Kourtellos (2013))features J + 1 + m × K parameters. In the big data setting with a large number of covariatessampled at high-frequency, the total number of parameters may be large compared to the eﬀectivesample size or even exceed it. This leads to poor estimation and out-of-sample prediction accuracyin ﬁnite samples. For instance, with m = 3 (quarterly/monthly setting) and 35 covariates at 4lagged quarters, we need to estimate m × K = 420 parameters. At the same time, say post-WWIIquarterly GDP growth series has less than 300 observations.The LASSO estimator, see Tibshirani (1996), oﬀers an appealing convex relaxation of a diﬃcultnon-convex best subset selection problem. By construction, it produces sparse parsimonious modelszeroing-out a large number of the estimated parameters. The model selection is not free and comesat a price that can be high in the low signal-to-noise environment with heavy-tailed dependentdata. In this paper, we focus on the structured sparsity with additional dimensionality reductionsthat aim to improve upon the unstructured LASSO estimator. For a natural number N , we denote [ N ] = { , , . . . , N } . Note that the polynomial ψ ( L /m ; β k ) x t,k only involves m lags, which is done for the sake of simplicity andwithout the loss of generality. In addition, regression (1) involves high-frequency lags t − j/m. In some applicationslags t − − j/m might be more suitable, or a combination of both, again without the loss of generality. ψ ( L /m ; β k ) x t,k = 1 m m (cid:88) j =1 ω ( j/m ; β k ) x t − j/m,k , (2)where dim( β k ) = L < m . The weight function ω : [0 , × R L → R is approximated as ω ( t ; β k ) ≈ L (cid:88) l =1 β k,l w l ( t ) , t ∈ [0 , , (3)where { w l : l = 1 , . . . , L } is a collection of functions, called the dictionary . The simplest exampleof the dictionary consists of algebraic power polynomials, also known as Almon (1965) polynomialsin the time series analysis. More generally, the dictionary may consist of arbitrary approximatingfunctions, including classical orthonormal bases. The size of the dictionary L and the number of covariates K can still be large and the approxi-mate sparsity is a key assumption imposed throughout the paper. With the approximate sparsity,we recognize that assuming that most of the estimated coeﬃcients are zero is overly restrictive andthat the approximation error should be taken into account. For instance, the weight function mayhave an inﬁnite series expansion, nonetheless, most can be captured by a relatively small numberof orthogonal basis functions. Similarly, there can be a large number of economically relevantpredictors, nonetheless, it might be suﬃcient to select only a smaller number of the most relevantones to achieve good out-of-sample forecasting performance. Both model selection goals can beachieved with the LASSO estimator. However, the LASSO does not recognize that covariates atdiﬀerent (high-frequency) lags are temporally related.In the baseline model, all high-frequency lags (or approximating functions once we parametrizethe lag polynomial) of a single covariate constitute a group. We can also assemble all lag depen-dent variables into a group. Other group structures could be considered, for instance combiningvarious covariates into a single group, but we will work with the simplest group setting of the afore-mentioned baseline model. The sparse-group LASSO (sg-LASSO), see Simon, Friedman, Hastie, See appendix section A.1 for more examples. Using orthogonal polynomials typically reduces the multicollinear-ity and leads to better ﬁnite sample performance. The speciﬁcation in (2) deviates from the standard MIDASpolynomial speciﬁcation and results in a linear regression model - a subtle but key innovation as it maps MIDASregressions in the standard regression framework. between and within groups, and allows us to capture the predictive information from each group, such asapproximating functions from the dictionary or speciﬁc covariates from each group. (a) LASSO ( α = 1) (b) group LASSO with 1 group, α = 0 (c) sg-LASSO with 1 group, α = 0 . (d) sg-LASSO with 2 groups, α = 0 . Figure 1: Geometry of { b ∈ R : Ω( b ) ≤ } for diﬀerent groupings and values of α .To describe the estimation procedure, let y = ( y , . . . , y T ) (cid:62) , be a vector of dependent variableand let X = ( ι, y , . . . , y J , Z W, . . . , Z K W ) , be a design matrix, where ι = (1 , , . . . , (cid:62) is avector of ones, y j = ( y − j , . . . , y T − j ) (cid:62) , Z k = ( x k,t − j/m ) t ∈ [ T ] ,j ∈ [ m ] is a T × m matrix of the covariate Selecting the most important elements from the dictionary to approximate the MIDAS weights is superior toselecting, e.g., the polynomial of a ﬁxed degree, see DeVore (1998) for the comparison between the linear andnonlinear approximation. It should also be noted that Marsilli (2014), and Uematsu and Tanaka (2019) are recentexamples extending the MIDAS regression setting to a penalized regression setting. None of these existing papersprovide an asymptotic theory supporting the proposed methods. ∈ [ K ], and W = (cid:0) m w l ( j/m ) (cid:1) j ∈ [ m ] ,l ∈ [ L ] is an m × L matrix of weights. In addition, put β =( β (cid:62) , β (cid:62) , . . . , β (cid:62) K ) (cid:62) , where β = ( ρ , ρ , . . . , ρ J ) (cid:62) is a vector of parameters pertaining to the group ofautoregressive coeﬃcients and β k ∈ R L denotes parameters of the high-frequency lag polynomialpertaining to the covariate k ≥

1. Then, the sparse-group LASSO estimator, denoted ˆ β , solves thepenalized least-squares problem min b ∈ R p (cid:107) y − X b (cid:107) T + 2 λ Ω( b ) (4)with a penalty function that interpolates between the (cid:96) LASSO penalty and the (cid:96) group LASSOpenalty Ω( b ) = α | b | + (1 − α ) (cid:107) b (cid:107) , , where (cid:107) b (cid:107) , = (cid:80) G ∈G | b G | is the group LASSO norm.The amount of penalization is controlled by the regularization parameter λ > α ∈ [0 ,

1] isa weight parameter that determines the relative importance of the sparsity and the group structure.Setting α = 1, we obtain the LASSO estimator while setting α = 0, leads to the group LASSOestimator. In practice, groups are deﬁned by a particular problem, while α can be ﬁxed or selectedin a data-driven way. Figure 1 illustrates the geometry of Ω for diﬀerent groupings and diﬀerentvalues of α . The estimator can be computed eﬃciently using an appropriate coordinate descentalgorithm, cf., Simon, Friedman, Hastie, and Tibshirani (2013). We focus on the generic dynamic linear regression model that nests the ARDL-MIDAS regressionas a special case y t = E [ y t |F t ] + u t , E [ u t |F t ] = 0 , where ( y t ) t ∈ Z is a real-valued stochastic process and ( F t ) t ∈ Z is a ﬁltration. The ﬁltration reﬂectsthe information set available at a particular point of time and is generated by a large numberof covariates, lags of covariates, as well as lags of the dependent variable. We approximate the Note that with a single group, the penalty resembles the elastic net penalty with the only diﬀerence that wehave | . | instead of | . | , so that the sg-LASSO may achieve similar to the elastic net regularization goals. L norm, denoted X (cid:62) t β ,where ( X t ) t ∈ Z is a stochastic process in R p that may include some covariates, lags of covariates upto a certain order, as well as lags of the dependent variable. Using the setting of equation (4), in the vector notation, we write y = m + u , where y = ( y , . . . , y T ) (cid:62) , m = ( E [ y |F ] , . . . , E [ y T |F T ]) (cid:62) , and u = y − m . The best linear approx-imation is denoted X β , where X is a T × p design matrix and β ∈ R p is a vector of unknownparameters.We measure the time series dependence with τ -dependence coeﬃcients. For a σ -algebra M anda random vector ξ ∈ R l , the τ coeﬃcient is deﬁned as τ ( M , ξ ) = sup f ∈ Λ( R l ) (cid:90) R (cid:107) F f ( ξ ) |M ( t ) − F f ( ξ ) ( t ) (cid:107) d t, where Λ( R l ) = { f : R l → R : | f ( x ) − f ( y ) | ≤ | x − y | } is a set of 1-Lipschitz functions, F f ( ξ ) is theCDF of f ( ξ ), and F f ( ξ ) |M is the CDF of f ( ξ ) conditionally on M . Let ( ξ t ) t ∈ Z be a stochastic processand let M t = σ ( ξ t , ξ t − , . . . ) be its natural ﬁltration. The τ -dependence coeﬃcient is deﬁned as τ k = sup j ≥ max ≤ l ≤ j l sup t + k ≤ t < ··· ; (ii) max j,k ∈ [ p ] (cid:107) X t,j X t,k (cid:107) ˜ q = O (1) for some ˜ q > ; (iii) ( u t X t ) t ∈ Z is a vector of τ -dependent processes with τ k ≤ ck − a for some a > ( q − / ( q − ; (iv) ( X t X (cid:62) t ) t ∈ Z is a matrix of τ -dependent processes with ˜ τ k ≤ ˜ ck − ˜ a for some ˜ a > (˜ q − / (˜ q − . Assumption 3.1 imposes very mild moment and weak dependence restrictions on the data-generatingprocess. It is worth mentioning that the stationarity is not essential and can be relaxed at the costsof introducing heavier notation. We require only q > S and the groupsupport G of β , put Ω ( b ) (cid:44) α | b S | + (1 − α ) (cid:88) G ∈G | b G | and Ω ( b ) (cid:44) α | b S c | + (1 − α ) (cid:88) G ∈G c | b G | and consider the following cone C ( c ) = { ∆ ∈ R p : Ω (∆) ≤ c Ω (∆) } for some c > Assumption 3.2 (Restricted eigenvalue) . There exists a universal constant γ > such that | Σ / ∆ | ≥ γ (cid:113)(cid:80) G ∈G | ∆ G | for all ∆ ∈ C ( c ) , where c = c +1 c − for some c > . Assumption 3.2 is a population counterpart to the frequently used restricted eigenvalue or com-patibility condition imposed on the sample covariance matrix. It is trivially satisﬁed whenever thesmallest eigenvalue of the population covariance matrix Σ = E [ X (cid:62) X /T ] is bounded away fromzero. In econometric literature, the one-to-one property of Σ is also known as the completenesscondition. Interestingly, sparse vectors can be identiﬁed and accurately estimated even when thecovariance matrix is singular. The only requirement is that Σ is well-behaved on the cone C ( c );see Babii and Florens (2020) for a related discussion in the context of ill-posed econometric models.The regularization parameter is determined by the Fuk-Nagaev concentration inequality, appearingin the Appendix equation (A.1). Note that the sg-LASSO penalty is not decomposable with respect to the support or the group support of β andthe results from the theory of decomposable regularizers are not directly applicable, see, e.g., Negahban, Ravikumar,Wainwright, and Yu (2012). We call loosely Σ the covariance matrix as it coincides with the covariance matrix of a zero-mean random vector X ∈ R p . Likewise, we refer to ˆΣ = X (cid:62) X /T as the sample covariance matrix. ssumption 3.3 (Regularization parameter) . The regularization parameter satisﬁes λ ∼ (cid:16) pδT κ − (cid:17) /κ ∨ (cid:114) log(8 p/δ ) T , for some δ ∈ (0 , and κ = ( a +1) q − a + q − , where a, q are as in Assumption 3.1. The regularization parameter depends on the temporal dependence, measured by a and heavinessof tails, measured by q . This dependence is reﬂected in the dependence-tails exponent κ . Undermaintained assumptions, we obtain the following bound on the estimation and prediction accuracyof the sg-LASSO estimator, see Appendix A.2 for the proof. Theorem 3.1.

Suppose that Assumptions 3.1, 3.2, and 3.3 are satisﬁed. Then there exists A , A > such that with probability at least − δ − A s ˜ κα p T ˜ κ − − p ( p + 1) e − A T/s α (cid:107) X ( ˆ β − β ) (cid:107) T (cid:46) s α λ + (cid:107) m − X β (cid:107) T and Ω( ˆ β − β ) (cid:46) s α λ + λ − (cid:107) m − X β (cid:107) T + s / α (cid:107) m − X β (cid:107) T , where s / α = α (cid:112) | S | + (1 − α ) (cid:112) |G | and ˜ κ = (˜ a +1)˜ q − a +˜ q − . In the special case of the LASSO estimator, α = 1, we obtain the counterpart to the result ofBelloni, Chen, Chernozhukov, and Hansen (2012) for the LASSO with i.i.d. data that takes intoaccount the approximation error. For another extreme α = 0, we obtain non-asymptotic boundsfor the group LASSO reﬂecting the approximation error. We call the constant s α the eﬀectivesparsity . The eﬀective sparsity is a linear combination of sparsity and group sparsity constantswith weights deﬁned by the penalty function.An immediate consequence of the bounds stated in Theorem 3.1 is the asymptotic guarantee forthe sg-LASSO estimator presented in the following corollary which we state under the assumptionthat the approximation error is negligible and the dimension/sparsity increase at a certain rate. Assumption 3.4.

Suppose that (i) (cid:107) m − X β (cid:107) T = O P ( s α λ ) ; (ii) s ˜ κα p T ˜ κ − → and p e − A T/s α → . Corollary 3.1.

Suppose that Assumptions 3.1, 3.2, 3.3, and 3.4 are satisﬁed. Then (cid:107) X ( ˆ β − β ) (cid:107) T = O P (cid:18) s α p /κ T − /κ ∨ s α log pT (cid:19) . and Ω( ˆ β − β ) = O P (cid:32) s α p /κ T − /κ ∨ s α (cid:114) log pT (cid:33) .

11f the eﬀective sparsity constant is ﬁxed, then p = o ( T κ − ) is a suﬃcient condition for the predictionerror and the Ω-norm error to converge to zero, whenever ˜ κ ≥ κ −

1. In this case Assumption 3.4(ii) is vacuous. Convergence rates reﬂect a trade-oﬀ between tails, dependence, and the number ofcovariates. The number of covariates p can increase at a faster rate than the sample size, providedthat κ >

2, which is not the case for the classical OLS, ridge regression, and PCR estimators thatrequire p/T → Remark 3.1.

Since the (cid:96) -norm is equivalent to the Ω -norm whenever groups have ﬁxed size (cid:96) -norm convergence rate is the same. Remark 3.2.

For a ﬁxed sparsity constant, in the special case of the LASSO estimator withindependent data, Caner and Kock (2018) obtain the convergence rate of order O P (cid:16) p /q T / (cid:17) . Since κ → q as a → ∞ , we recover the O P (cid:18) p /q T − /q ∨ (cid:113) log pT (cid:19) convergence rate that one would obtain inthe case of independent data applying directly the Fuk and Nagaev (1971), Corollary 4, whence weconclude that the dependence on q is optimal. Furthermore, increasing q , the polynomial term canbe made arbitrarily small compared to the sub-Gaussian term. Therefore, the Fuk-Nagaev inequalityprovides a more accurate description of the performance of the LASSO estimator for the ﬁnancialand the economic time series data that are often believed to have heavier than sub-Gaussian tails. In this section, we aim to assess the out-of-sample predictive performance (forecasting and now-casting), and the MIDAS weights recovery of the sg-LASSO with dictionaries. We benchmarkthe performance of our novel sg-LASSO setup against two alternatives: (a) unstructured, mean-ing standard, LASSO with MIDAS and (b) unstructured LASSO with unrestricted lag polynomial.The former allows us to assess the beneﬁts of exploiting group structures, whereas the latter focuseson the advantages of using dictionaries in a high dimensional setting. Recall that the sub-Gaussianity requires that moments of all order q ≥ .1 Simulation Design To assess the predictive performance and the MIDAS weight recovery, we simulate the data fromthe following DGP: y t = ρ y t − + ρ y t − + K (cid:88) k =1 m m (cid:88) j =1 ω ( j/m ; β k ) x t − j/m,k + u t where u t ∼ i.i.d. N (0 , σ u ) and the DGP for covariates { x k,t − j/m : k = 1 , . . . , K } is speciﬁed below.This corresponds to a target of interest y t driven by two autoregressive lags augmented with highfrequency series, hence, the DGP is an ARDL-MIDAS model. We set σ u = 1, ρ = 0 . ρ = 0 . K = 3. In some scenarios we alsodecrease the signal-to-noise ratio setting σ u = 5. We are interested in quarterly/monthly data, anduse four quarters of data for the high frequency regressors so that m = 12. We rely on a commonlyused weighting scheme in the MIDAS literature, namely ω ( s ; β k ) for k = 1, 2 and 3 are determinedby beta densities respectively equal to Beta(1 , , Beta(2 , , K i.i.d. realizations of the univariate autoregressive (AR) process x h = ρ x h − + ε h , where ρ = 0 . ε h ∼ i.i.d. N (0 , σ ε ), σ ε = 5, or ε h ∼ i.i.d. student- t (5), where h denotes thehigh-frequency sampling.2. Multivariate vector autoregressive (VAR) process X h = Φ X h − + ε h , where ε h ∼ i.i.d. N (0 , I K ).The latter creates contemporaneously correlated high frequency regressors. In the estimationprocedure, we add 7 noisy covariates which are generated in the same way as the relevant covariatesand use 5 low-frequency lags. The empirical models use a dictionary which consists of Legendrepolynomials up to degree L = 10 shifted to [0 ,

1] interval with ω ( s ; β k ) deﬁned in equation (3). Thesample size is T ∈ { , , } . Throughout the experiment, we use 5000 simulation replicationsand 10-fold cross-validation to select the tuning parameter. In the AR case, we initiate the two processes from x ∼ N (cid:16) , σ − ρ (cid:17) , y ∼ N (cid:16) , σ (1 − ρ )(1+ ρ )((1 − ρ ) − ρ ) (cid:17) . In theVAR case, we use the same initial value for ( y t ) and initiate X ∼ N (0 , I K ). For all cases, the ﬁrst 200 observationsare treated as burn-in.

13e assess the performance of diﬀerent methods by varying assumptions on error terms of thehigh-frequency process ε h , considering multivariate high-frequency process, changing the degree ofLegendre polynomials L , increasing the noise level of the low-frequency process σ u , using only halfof the high-frequency lags in predictive regressions, and adding a larger number of noisy covariates.In the case of VAR high-frequency process, we set Φ to be block-diagonal with the ﬁrst 5 × .

15 and the remaining 5 × . L +1 number of coeﬃcients per high-frequency covariate. The third modelapplies sg-LASSO estimator together with MIDAS weights. Groups are deﬁned as in Section 2;each low-frequency lag and high-frequency covariate is a group, therefore, we have K+5 number ofgroups. We set the relative weight α to 0 .

65. This model is denoted sg-LASSO-MIDAS.For regressions with aggregated data, we consider: (a) Flow aggregation (FLOW): x Ak,t = m (cid:80) mj =1 w k x k,t − j/m , (b) Stock aggregation (STOCK): x Ak,t = x k,t , and (c) Single high-frequencylag (MIDDLE): x k,t − ( m − /m . In these cases, the models are estimated using the OLS estimator.

Detailed results are reported in the Appendix. Tables A.1–A.2, cover the average mean squaredforecast errors for one-steahead forecasts and nowcasts. The sg-LASSO with MIDAS weighting (sg-LASSO-MIDAS) outperforms all other methods in all simulation scenarios. Importantly, both sg-LASSO-MIDAS and unstructured LASSO-MIDAS with non-linear weight function approximationperform much better than all other methods in most of the scenarios when the sample size is small( T = 50). In this case, sg-LASSO-MIDAS yields the largest improvements over alternatives, inparticular, with a large number of noisy covariates (bottom-right block). The LASSO withoutMIDAS weighting has typically large forecast errors. The method performs better when half of thehigh-frequency lags are included in the regression model. Lastly, forecasts using ﬂow-aggregatedcovariates seem to perform better than other simple aggregation methods in all simulation scenarios,14ut signiﬁcantly worse than the sg-LASSO-MIDAS.In Table A.3–A.4 we report additional results for the estimation accuracy of the weight func-tions. In Figure A.1–A.3, we plot the estimated weight functions from several methods. The resultsindicate that the LASSO without MIDAS weighting can not recover accurately weights in smallsamples and/or low signal-to-noise ratio. Using Legendre polynomials improves the performancesubstantially and the sg-LASSO seems to improve even more over the unstructured LASSO. In this section we nowcast US GDP with macroeconomic, ﬁnancial, and textual news data. Thedata used in our empirical analysis are described in Appendix Section A.4. For standard macrovariables, we use a real-time FRED-MD monthly dataset. The data is available at the FederalReserve Bank of St. Louis FRED database, see McCracken and Ng (2016) for more details onthis dataset. For our main results, we use a subset of all available macro covariates which we listin Table A.5. Next, we add data from the Survey of Professional Forecasters, namely, US GDPnowcasts and forecasts for several horizons, which we aggregate using Legendre polynomials. Inaddition, we augment predictive regression with news attention data based on textual analysis thathas been recently made available by Bybee, Kelly, Manela, and Xiu (2020). Finally, we follow theliterature on nowcasting real GDP and deﬁne our target variable to be the annualized growth rate.To measure forecast errors, we take 2019 February real GDP data vintage.

Denote x t,k the k -th high-frequency covariate at time t . The general ARDL-MIDAS predictiveregression is φ ( L ) y t +1 = µ + K (cid:88) k =1 ψ ( L /m ; β k ) x t,k + u t +1 , t = 1 , . . . , T, Additional set of results for the full set of FRED-MD monthly covariates with detailed implementation descrip-tion is available in Appendix Section A.5. φ ( L ) is the low-frequency lag polynomial, µ is the regression intercept and (cid:80) Kk =1 ψ ( L /m ; β k ) x tk are high-frequency covariates. As discussed in Section 2, we parameterize the weight function as ψ ( L /m ; β k ) x t,k = 1 m m (cid:88) j =1 ω ( j/m ; β k ) x t +( h +1 − j ) /m,k , where h indicates the number of leading months in the quarter t . For example, if h = 2, we shifthigh-frequency covariates two month in the quarter, and hence we nowcast the dependent variableone month ahead.We benchmark our predictions with the simple random walk (RW) model, which is consideredto be a reasonable benchmark for short-term GDP growth predictions. We focus on predictions ofour method, sg-LASSO-MIDAS, with and without series based on textual analysis. One naturalcomparison is with Federal Reserve Bank of New York, denoted New York Fed, model impliednowcasts. Table 1 reports nowcasting results for US GDP growth rate real-time at one given instance, namelytwo months into a quarter (or put diﬀerently with one month left into the quarter). First, we observefrom the table that the sg-LASSO-MIDAS model with standard macro information improves uponthe New York Fed predictions in terms of smaller out-of-sample root mean squared errors - althoughthe margin is slim reducing from .790 (ratio with respect to RW) to .761. Without text-basedinformation, the improvement of sg-LASSO-MIDAS versus New York Fed nowcasts is therefore,not surprisingly, insigniﬁcant based on the Diebold and Mariano (1995) test statistic. We reportsimilar ﬁndings using the full dataset of covariates in Appendix Section A.5, where we also comparealternative machine learning methods with the sg-LASSO-MIDAS method and ﬁnd that the latteroutperforms all other alternatives.Turning to results using additional text-based covariates, we see a signiﬁcant improvementin terms of the quality of out-of-sample predictions. Relative to New York Fed nowcasts, sg-LASSO-MIDAS with textual data decrease prediction errors by 19%. The gain is also large relativeto the sg-LASSO-MIDAS model that does not condition on news attention information, albeitslightly smaller.The Diebold and Mariano (1995) test statistic reveals that the increase in prediction Rel-RMSE DM-stat-1 DM-stat-2RW 2.606 3.624 3.629sg-LASSO-MIDAS (with textual data) 0.639 -1.687sg-LASSO-MIDAS (without) 0.761 1.687NY Fed 0.790 0.490 1.727

Table 1:

Nowcast real GDP comparison table – Forecast horizon is one month ahead. Column

Rel-RMSE reportsroot mean squared forecasts error relative to the RW model. Column

DM-stat-1 reports the Diebold and Mari-ano (1995) test statistic for all models relative to sg-LASSO-MIDAS model without text-based information, whilecolumn

DM-stat-2 reports the Diebold Mariano test statistic relative to sg-LASSO-MIDAS model with text-basedinformation. The out-of-sample period is 2002 Q1 to 2017 Q2.

In Figure 2, we plot a heat map of selected covariates through time for the sg-LASSO-MIDASmodel which includes news attention data. In addition to the heat map, which reveals sparsitypatterns, we also plot the evolution of the number of selected covariates and the squared forecasterrors across time. In general, the pattern is relatively sparse, with more covariates being selectedafter the Great Recession. Speciﬁcally, on average 14.93 covariates are selected before, while 19.63after the crisis, and 17.58 for the entire out-of-sample exercise. Interestingly, after the crisesthe number of selected covariates is more stable - fourteen covariates are always selected. Threecovariates are always selected throughout the out-of-sample period: Government budgets, AllEmployees: Financial Activities, and 3-Month AA Financial Commercial Paper Rate. In Figure 3, we plot the cumulative sum of loss diﬀerential (cumsfe), which is computed ascumsfe t,t + k = t + k (cid:88) q = t e q,M − e q,M (5)for model M M . Positive value of cumsfe t,t + k means model M M t + k point, and negative value imply the opposite. In ourcase, M M Figure A.5 in the Appendix is a similar plot for the full-sample results using only macro data. Note that without the news data, see Figure A.4, autoregressive lags are selected more often. M a r S e p D ec J un S e p M a r J un D ec M a r S e p D ec J un Lag.1Lag.2Lag.3Lag.4IP: Business EquipmentIP: FuelsIP: Manufacturing (SIC)IP: Durable Consumer GoodsCivilians Unemployed - Less Than 5 WeeksAll Employees: Financial ActivitiesAll Employees: GovernmentInitial ClaimsAll Employees: Total nonfarmAll Employees: Service-Providing IndustriesAll Employees: Mining and Logging: MiningUnemployment RateAll Employees: ManufacturingHousing Starts, MidwestHousing Starts, WestHousing Starts: Total New Privately OwnedRetail and Food Services SalesNew Orders for Durable GoodsMZM Money StockPersonal Cons. Expend.: Chain IndexCPI: All ItemsS&P: Industrials3-Month AA Fin. Comm. Paper RateCrude Oil5-Year Treasury3-Month Commercial Paper - FEDFUNDSMoodys Baa Corporate Bond - FEDFUNDS10-Year TreasuryS&P 500Moodys Baa - Aaa Corporate Bond SpreadSurvey of professional forecastersSavings & loansRecessionGovernment budgetsMortgagesOil marketCommodities

Sparsity patternNumber of selected covariatesSquared forecast errors

Figure 2: Sparsity pattern.18

005 2010 2015020406080100

With textual dataWithout textual data

Quarters c u m u l a t i v e s u m o f l o ss d i ﬀ e r e n t i a l Figure 3: Cumulative sum of loss diﬀerential. Gray shaded area — NBER recession period.New York Fed throughout the out-of-sample period. Interestingly, the largest gains are during the2008-2009 recession period and at the beginning of 2011, which is around the period of the peak ofEuropean sovereign debt crises. The ﬁgure also shows marked improvements in prediction qualitywhen using news attention series; notably, the largest gains are the two crises periods. In this section, we test whether news attention series are signiﬁcant predictors of real GDP growthrate using the inferential methods developed in Babii, Ghysels, and Striaukas (2020). We estimatethe same sg-LASSO-MIDAS model using real-time macro data and news attention series. Asin nowcasting application, we use 4 quarters of lagged data; the eﬀective sample starts from 1985February and ends in 2017 May and the sample size is 126 quarters. We select the tuning parameterby using 10-fold cross-validation and set α = 0 .

65. For real-time macro data, we take the 2017 MayFRED-MD vintage and use real GDP values as of May 30 th . To compute the precision matrix, weuse nodewise LASSO regressions with a data-driven choice for the penalty parameter, see Babii,Ghysels, and Striaukas (2020) for more details. Since news attention series are high-frequency, wetest the restriction that all coeﬃcients associated with each series are jointly zero.Table 2 reports values of Wald test for each series where we use the HAC estimator proposedby Babii, Ghysels, and Striaukas (2020) with Parzen kernel. We report results for a grid of lag As an aside, using the cumsfe plots, we also show in the Appendix that the choice for α = 0 .

65 is optimal, andyields the largest gains versus New York Fed nowcasts, see Figure A.6. The same ﬁgure also shows that favoringsparsity over group sparsity, i.e. α ∈ (0 . , In the Appendix Table A.8 we report the test statistic values. M T values. Results indicate that Government budgets and Oil market arehighly signiﬁcant predictors. Government budgets is signiﬁcant at 1%, while Oil market at 5%signiﬁcance level for all truncation parameter values. In the nowcasting application, the formerwas always selected throughout the out-of-sample period, while the latter was always selected afterthe crisis. M T

10 20 30Commodities 0.533 0.514 0.554Government budgets 0.002 0.009 0.008Oil market 0.024 0.044 0.017Recession 0.211 0.317 0.388Savings & loans 0.754 0.685 0.655Mortgages 0.750 0.604 0.553

Table 2:

Signiﬁcance test table – values for news attention series based on Wald test for a set of truncationparameter M T values are reported. The number of lags in the HAC estimator correspond to M T . This paper oﬀers a new perspective on the high-dimensional time series regressions with datasampled at the same or mixed frequencies and contributes more broadly to the rapidly growingliterature on estimation, inference, forecasting, and nowcasting with regularized machine learningmethods. The ﬁrst contribution of the paper is to introduce the sparse-group LASSO estimator forhigh-dimensional time series regressions. An attractive feature of the estimator is that it recognizestime series data structures and allows us to perform the hierarchical model selection within andbetween groups. The classical LASSO and the group LASSO are covered as special cases.To recognize that the economic and ﬁnancial time series have typically heavier than Gaussiantails, we use a new Fuk-Nagaev concentration inequality, introduced in Babii, Ghysels, and Stri-aukas (2020), valid for a large class of τ -dependent processes, including mixing processes commonlyused in econometrics. Building on this inequality, we establish non-asymptotic and asymptoticproperties of the sparse-group LASSO estimator.Our empirical application provides new perspectives on applying machine learning methods toreal-time forecasting, nowcasting and monitoring using time series data, including non-conventional20ata, sampled at diﬀerent frequencies. To that end, we introduce a new class of MIDAS regressionswith dictionaries linear in the parameters and based on orthogonal polynomials with lag selectionusing the sg-LASSO estimator. We ﬁnd that the sg-LASSO estimator outperforms the unstructuredLASSO in small samples and conclude that incorporating speciﬁc data structures should be helpfulin various applications. References

Almon, S. (1965): “The distributed lag between capital appropriations and expenditures,”

Econo-metrica , 33(1), 178–196.

Andreou, E., P. Gagliardini, E. Ghysels, and

M. Rubin (2019): “Inference in group factormodels with an application to mixed frequency data,”

Econometrica , 87(4), 1267–1305.

Andreou, E., E. Ghysels, and

A. Kourtellos (2013): “Should macroeconomic forecastersuse daily ﬁnancial data and how?,”

Journal of Business and Economic Statistics , 31, 240–251.

Andrews, D. W. (1984): “Non-strong mixing autoregressive processes,”

Journal of Applied Prob-ability , 21(4), 930–934.

Aprigliano, V., G. Ardizzi, and

L. Monteforte (2019): “Using Payment System Data toForecast Economic Activity,”

International Journal of Central Banking , 15(4), 55–80.

Babii, A., and

J.-P. Florens (2020): “Is completeness necessary? Estimation in nonidentiﬁedlinear models,” arXiv preprint arXiv:1709.03473v3.

Babii, A., E. Ghysels, and

J. Striaukas (2020): “Inference for high-dimensional regressionswith heteroskedasticity and autocorrelation,” arXiv preprint arXiv:1912.06307v2.

Ba´nbura, M., D. Giannone, M. Modugno, and

L. Reichlin (2013): “Now-casting and thereal-time data ﬂow,” in

Handbook of Economic Forecasting, Volume 2 Part A , ed. by G. Elliott, and

A. Timmermann, pp. 195–237. Elsevier.

Barnett, W., M. Chauvet, D. Leiva-Leon, and

L. Su (2016): “Nowcasting Nominal GDPwith the Credit-Card Augmented Divisia Monetary Aggregates,” .21 elloni, A., D. Chen, V. Chernozhukov, and

C. Hansen (2012): “Sparse models andmethods for optimal instruments with an application to eminent domain,”

Econometrica , 80(6),2369–2429.

Belloni, A., V. Chernozhukov, D. Chetverikov, C. Hansen, and

K. Kato (2018):“High-dimensional econometrics and generalized GMM,” arXiv preprint arXiv:1806.01888.

Bok, B., D. Caratelli, D. Giannone, A. M. Sbordone, and

A. Tambalotti (2018):“Macroeconomic nowcasting and forecasting with big data,”

Annual Review of Economics , 10,615–643.

Bybee, L., B. T. Kelly, A. Manela, and

D. Xiu (2020): “The structure of economic news,”

National Bureau of Economic Research , and http://structureofnews.com . Caner, M., and

A. B. Kock (2018): “Asymptotically honest conﬁdence regions for high di-mensional parameters by the desparsiﬁed conservative lasso,”

Journal of Econometrics , 203(1),143–168.

Carlsen, M., and

P. E. Storgaard (2010): “Dankort payments as a timely indicator of retailsales in Denmark,” .

Carrasco, M., and

B. Rossi (2016): “In-sample inference and forecasting in misspeciﬁed factormodels,”

Journal of Business and Economic Statistics , 34(3), 313–338.

Chernozhukov, V., W. K. H¨ardle, C. Huang, and

W. Wang (2019): “Lasso-driven infer-ence in time and space,”

Annals of Statistics (forthcoming).

Dedecker, J., and

P. Doukhan (2003): “A new covariance inequality and applications,”

Stochastic Processes and their Applications , 106(1), 63–80.

Dedecker, J., and

C. Prieur (2004): “Coupling for τ -dependent sequences and applications,” Journal of Theoretical Probability , 17(4), 861–885.(2005): “New dependence coeﬃcients. Examples and applications to statistics,”

Proba-bility Theory and Related Fields , 132(2), 203–236.

DeVore, R. A. (1998): “Nonlinear approximation,”

Acta Numerica , 7, 51–150.22 iebold, F. X., and

R. S. Mariano (1995): “Comparing predictive accuracy,”

Journal ofBusiness and Economic Statistics , 13(3), 253–263.

Duarte, C., P. M. Rodrigues, and

A. Rua (2017): “A mixed frequency approach to theforecasting of private consumption with ATM/POS data,”

International Journal of Forecasting ,33(1), 61–75.

Fan, J., Q. Li, and

Y. Wang (2017): “Estimation of high dimensional mean regression in theabsence of symmetry and light tail assumptions,”

Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) , 79(1), 247–265.

Foroni, C., M. Marcellino, and

C. Schumacher (2015a): “Unrestricted mixed data sam-pling (U-MIDAS): MIDAS regressions with unrestricted lag polynomials,”

Journal of the RoyalStatistical Society: Series A (Statistics in Society) , 178(1), 57–82.(2015b): “Unrestricted mixed data sampling (U-MIDAS): MIDAS regressions with un-restricted lag polynomials,”

Journal of the Royal Statistical Society: Series A (Statistics inSociety) , 178(1), 57–82.

Fuk, D. K., and

S. V. Nagaev (1971): “Probability inequalities for sums of independent randomvariables,”

Theory of Probability and Its Applications , 16(4), 643–660.

Galbraith, J. W., and

G. Tkacz (2018): “Nowcasting with payments system data,”

Interna-tional Journal of Forecasting , 34(2), 366–376.

Ghysels, E., C. Horan, and

E. Moench (2018): “Forecasting through the Rearview Mirror:Data Revisions and Bond Return Predictability.,”

Review of Financial Studies , 31(2), 678–714.

Ghysels, E., and

H. Qian (2019): “Estimating MIDAS regressions via OLS with polynomialparameter proﬁling,”

Econometrics and Statistics , 9, 1–16.

Ghysels, E., P. Santa-Clara, and

R. Valkanov (2006): “Predicting volatility: getting themost out of return data sampled at diﬀerent frequencies,”

Journal of Econometrics , 131, 59–95.

Ghysels, E., A. Sinko, and

R. Valkanov (2007): “MIDAS regressions: Further results andnew directions,”

Econometric Reviews , 26(1), 53–90.23 an, Y., and

R. S. Tsay (2017): “High-dimensional Linear Regression for Dependent Observa-tions with Application to Nowcasting,” arXiv preprint arXiv:1706.07899 . Kock, A. B., and

L. Callot (2015): “Oracle inequalities for high dimensional vector autore-gressions,”

Journal of Econometrics , 186(2), 325–344.

Lecu´e, G., and

M. Lerasle (2019): “Robust machine learning by median-of-means: theory andpractice,”

Annals of Statistics (forthcoming).

Marsilli, C. (2014): “Variable Selection in Predictive MIDAS Models,” Working papers 520,Banque de France.

McCracken, M. W., and

S. Ng (2016): “FRED-MD: A monthly database for macroeconomicresearch,”

Journal of Business and Economic Statistics , 34(4), 574–589.

Moriwaki, D. (2019): “Nowcasting Unemployment Rates with Smartphone GPS Data,” in

Inter-national Workshop on Multiple-Aspect Analysis of Semantic Trajectories , pp. 21–33. Springer.

Negahban, S. N., P. Ravikumar, M. J. Wainwright, and

B. Yu (2012): “A uniﬁed frame-work for high-dimensional analysis of M -estimators with decomposable regularizers,” StatisticalScience , 27(4), 538–557.

Raju, S., and

M. Balakrishnan (2019): “Nowcasting economic activity in India using paymentsystems data,”

Journal of Payments Strategy and Systems , 13(1), 72–81.

Siliverstovs, B. (2017): “Short-term forecasting with mixed-frequency data: a MIDASSO ap-proach,”

Applied Economics , 49(13), 1326–1343.

Simon, N., J. Friedman, T. Hastie, and

R. Tibshirani (2013): “A sparse-group LASSO,”

Journal of Computational and Graphical Statistics , 22(2), 231–245.

Thorsrud, L. A. (2020): “Words are the new numbers: A newsy coincident index of the businesscycle,”

Journal of Business and Economic Statistics , 38(2), 393–409.

Tibshirani, R. (1996): “Regression shrinkage and selection via the lasso,”

Journal of the RoyalStatistical Society, Series B (Methodological) , 58, 267–288.24 ematsu, Y., and

S. Tanaka (2019): “High-dimensional macroeconomic forecasting and vari-able selection via penalized regression,”

Econometrics Journal , 22, 34–56.

Wong, K. C., Z. Li, and

A. Tewari (2019): “LASSO guarantees for β -mixing heavy tailedtime series,” Annals of Statistics (forthcoming).

Wu, W. B. (2005): “Nonlinear system theory: Another look at dependence,”

Proceedings of theNational Academy of Sciences , 102(40), 14150–14154.

Wu, W.-B., and

Y. N. Wu (2016): “Performance bounds for parameter estimates of high-dimensional linear models with correlated errors,”

Electronic Journal of Statistics , 10(1), 352–379.

Yuan, M., and

Y. Lin (2006): “Model selection and estimation in regression with groupedvariables,”

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(1),49–67. 25 ppendix

A.1 Dictionaries

In this section we review brieﬂy the choice of dictionaries for the MIDAS weight function. It ispossible to construct dictionaries using arbitrary sets of functions, including a mix of algebraicpolynomials, trigonometric polynomials, B-splines, Haar basis, or wavelets. In this paper, wemostly focus on dictionaries generated by orthogonalized algebraic polynomials, though it mightbe interesting to tailor the dictionary for each particular application. The attractiveness of algebraicpolynomials comes from their ability to generate a variety of shapes with a relatively low numberof parameters, which is especially desirable in the low signal-to-noise environments. The generalfamily of appropriate orthogonal algebraic polynomials is given by Jacobi polynomials that nestLegendre, Gegenbauer, and Chebychev’s polynomials as a special case.

Example A.1.1 (Jacobi polynomials) . Applying the Gram-Schmidt orthogonalization to { , x, x , x , . . . } with respect to the measure d µ ( x ) = (1 − x ) α (1 + x ) β d x, α, β > − , on [ − , , we obtain Jacobi polynomials. In practice Jacobi polynomials can be computed throughthe well-known tree-term recurrence relation for n ≥ P ( α,β ) n +1 ( x ) = axP ( α,β ) n ( x ) + bP ( α,β ) n ( x ) − cP ( α,β ) n − ( x ) with a = (2 n + α + β +1)(2 n + α + β +2)2( n +1)( n + α + β +1) , b = (2 n + α + β +1)( α − β )2( n +1)( n + α + β +1)(2 n + α + β ) , c = ( α + n )( β + n )(2 n + α + β +2)( n +1)( n + α + β +1)(2 n + α + β ) . To obtainthe orthogonal basis on [0 , , we shift Jacobi polynomials with aﬃne bijection x (cid:55)→ x − .For α = β , we obtain Gegenbauer polynomials, for α = β = 0 , we obtain Legendre polynomials,while for α = β = − / or α = β = 1 / , we obtain Chebychev’s polynomials of two kinds. In the mixed frequency setting, non-orthogonalized polynomials, { , x, x , x , . . . } , are alsocalled Almon polynomials. It is preferable to use orthogonal polynomials in practice due to re-duced multicolinearity and better numerical properties. At the same time, orthogonal polynomialsare available in Matlab, R, Python, and Julia packages. Legendre polynomials is our default rec-ommendation, while other choices of α and β are preferable if we want to accommodate heaviertails. Appendix - 1e noted in the main body of the paper that the speciﬁcation in (2) deviates from the stan-dard MIDAS polynomial speciﬁcation as it results in a linear regression model - a subtle but keyinnovation as it maps MIDAS regressions in the standard regression framework. Moreover, castingthe MIDAS regressions in a linear regression framework renders the optimization problem con-vex, something only achieved by Siliverstovs (2017) using the U-MIDAS of Foroni, Marcellino,and Schumacher (2015b) which does not recognize the mixed frequency data structure, unlike oursg-LASSO. A.2 Proofs of main results

Lemma A.2.1.

Consider (cid:107) . (cid:107) = α | . | + (1 − α ) | . | , where | . | q is (cid:96) q norm on R p . Then the dualnorm of (cid:107) . (cid:107) , denoted (cid:107) . (cid:107) ∗ , satisﬁes (cid:107) z (cid:107) ∗ ≤ α | z | ∗ + (1 − α ) | z | ∗ , ∀ z ∈ R p , where | . | ∗ is the dual norm of | . | and | . | ∗ is the dual norm of | . | .Proof. Clearly, (cid:107) . (cid:107) is a norm. By the convexity of x (cid:55)→ x − on (0 , ∞ ) (cid:107) z (cid:107) ∗ = sup b (cid:54) =0 |(cid:104) z, b (cid:105)|(cid:107) b (cid:107) ≤ sup b (cid:54) =0 (cid:26) α |(cid:104) z, b (cid:105)|| b | + (1 − α ) |(cid:104) z, b (cid:105)|| b | (cid:27) ≤ α sup b (cid:54) =0 |(cid:104) z, b (cid:105)|| b | + (1 − α ) sup b (cid:54) =0 |(cid:104) z, b (cid:105)|| b | = α | z | ∗ + (1 − α ) | z | ∗ . Proof of Theorem 3.1.

Note that the sg-LASSO penalty Ω is a norm. By Lemma A.2.1, its dualnorm satisﬁes Ω ∗ ( X (cid:62) u /T ) ≤ α | X (cid:62) u /T | ∞ + (1 − α ) max G ∈G | ( X (cid:62) u ) G /T | (cid:46) | X (cid:62) u /T | ∞ (cid:46) (cid:16) pδT κ − (cid:17) /κ ∨ (cid:114) log(8 p/δ ) T (cid:46) λ, (A.1)where the ﬁrst inequality follows since | z | ∗ = | z | ∞ and (cid:0)(cid:80) G ∈G | z G | (cid:1) ∗ = max G ∈G | z G | , the secondby elementary computations, the third under Assumptions 3.1 (i) and (iii), by Theorem A.1 withAppendix - 2robability at least 1 − δ , and the last from the deﬁnition of λ in Assumption 3.3. By Fermat’srule, the sg-LASSO satisﬁes X (cid:62) ( X ˆ β − y ) /T + λz ∗ = 0for some z ∗ ∈ ∂ Ω( ˆ β ), where ∂ Ω( ˆ β ) is the subdiﬀerential of b (cid:55)→ Ω( b ) at ˆ β . Taking the inner productwith β − ˆ β (cid:104) X (cid:62) ( y − X ˆ β ) , β − ˆ β (cid:105) T = λ (cid:104) z ∗ , β − ˆ β (cid:105) ≤ λ (cid:110) Ω( β ) − Ω( ˆ β ) (cid:111) , where the inequality follows from the deﬁnition of the subdiﬀerential. Using y = m + u andrearranging this inequality (cid:107) X ( ˆ β − β ) (cid:107) T − λ (cid:110) Ω( β ) − Ω( ˆ β ) (cid:111) ≤ (cid:104) X (cid:62) u , ˆ β − β (cid:105) T + (cid:104) X (cid:62) ( m − X β ) , ˆ β − β (cid:105) T ≤ Ω ∗ (cid:0) X (cid:62) u /T (cid:1) Ω( ˆ β − β ) + (cid:107) X ( ˆ β − β ) (cid:107) T (cid:107) m − X β (cid:107) T ≤ c − λ Ω( ˆ β − β ) + (cid:107) X ( ˆ β − β ) (cid:107) T (cid:107) m − X β (cid:107) T . where the second line follows by the dual norm inequality and the last by Ω ∗ ( X (cid:62) u /T ) ≤ c − λ forsome c > (cid:107) X ∆ (cid:107) T ≤ c − λ Ω(∆) + (cid:107) X ∆ (cid:107) T (cid:107) m − X β (cid:107) T + λ (cid:110) Ω( β ) − Ω( ˆ β ) (cid:111) ≤ ( c − + 1) λ Ω(∆) + (cid:107) X ∆ (cid:107) T (cid:107) m − X β (cid:107) T (A.2)with ∆ = ˆ β − β . Note that the sg-LASSO penalty can be decomposed as a sum of two seminormsΩ( b ) = Ω ( b ) + Ω ( b ) , ∀ b ∈ R p withΩ ( b ) = α | b S | + (1 − α ) (cid:88) G ∈G | b G | and Ω ( b ) = α | b S c | + (1 − α ) (cid:88) G ∈G c | b G | . Note also that Ω ( β ) = 0 and Ω ( ˆ β ) = Ω (∆). Then by the triangle inequalityΩ( β ) − Ω( ˆ β ) ≤ Ω (∆) − Ω (∆) . (A.3)If (cid:107) m − X β (cid:107) T ≤ (cid:107) X ∆ (cid:107) T , then it follows from the ﬁrst inequality in Eq. A.2 and Eq. A.3 that (cid:107) X ∆ (cid:107) T ≤ c − λ Ω(∆) + 2 λ { Ω (∆) − Ω (∆) } . Since the left side of this equation is positive, this shows that Ω (∆) ≤ c Ω (∆) with c = c +1 c − , andAppendix - 3hence ∆ ∈ C ( c ), cf., Assumption 3.2. ThenΩ(∆) ≤ (1 + c )Ω (∆) ≤ (1 + c )  α (cid:112) | S || ∆ S | + (1 − α ) (cid:112) |G | (cid:115) (cid:88) G ∈G | ∆ G |  ≤ (1 + c ) s / α (cid:115) (cid:88) G ∈G | ∆ G | ≤ (1 + c ) s / α γ − | Σ / ∆ | , (A.4)where the second line follows by Jensen’s inequality and the last under Assumption 3.2 with s / α = α (cid:112) | S | + (1 − α ) (cid:112) |G | . Next, put ¯ G = max G ∈G | G | , where | G | is the cardinality of thegroup G ⊂ [ p ], and note that | Σ / ∆ | = ∆ (cid:62) Σ∆= (cid:107) X ∆ (cid:107) T + ∆ (cid:62) (Σ − ˆΣ)∆= 2( c − + 1) λ Ω(∆) + Ω(∆)Ω ∗ (cid:16) ( ˆΣ − Σ)∆ (cid:17) ≤ c − + 1) λ Ω(∆) + Ω (∆) ¯ G | vech( ˆΣ − Σ) | ∞ , where the third inequality follows by the inequality in Eq. A.2 and the dual norm inequality, andthe fourth by Lemma A.2.1 and elementary computationsΩ ∗ (cid:16) ( ˆΣ − Σ)∆ (cid:17) ≤ α | ( ˆΣ − Σ)∆ | ∞ + (1 − α ) max ≤ k ≤ K | [( ˆΣ − Σ)∆] G k | ≤ α | ∆ | | vech( ˆΣ − Σ) | ∞ + (1 − α ) ¯ G / | vech( ˆΣ − Σ) | ∞ | ∆ | ≤ ¯ G | vech( ˆΣ − Σ) | ∞ Ω(∆) . Combining these computations with the inequality in Eq. A.4Ω(∆) ≤ (1 + c ) γ − s α (cid:110) c − + 1) λ + ¯ G | vech( ˆΣ − Σ) | ∞ Ω(∆) (cid:111) ≤ c ) γ − s α ( c − + 1) λ + (1 − A − )Ω(∆) , where the second line holds on the event E (cid:44) (cid:110) | vech( ˆΣ − Σ) | ∞ ≤ γ Gs α (1+2 c ) (cid:111) with 1 − A − = (1+ c ) c ) <

1. This observation in conjunction with the inequality in Eq. A.2 givesΩ(∆) ≤ A (1 + c ) s α ( c − + 1) λ (cid:107) X ∆ (cid:107) T ≤ A (1 + c ) s α ( c − + 1) λ . Appendix - 4n the other hand, if (cid:107) m − X β (cid:107) T > (cid:107) X ∆ (cid:107) T , then (cid:107) X ∆ (cid:107) T ≤ (cid:107) m − X β (cid:107) T . Therefore, on E , we always have (cid:107) X ∆ (cid:107) T ≤ C s α λ + 4 (cid:107) m − X β (cid:107) T (A.5)with C = 4 A (1 + c ) ( c − + 1) , which proves the ﬁrst claim of Theorem 3.1.For the second claim, suppose ﬁrst that ∆ ∈ C (2 c ). Then on E Ω (∆) ≤ (1 + 2 c ) s α | Σ / ∆ | = (1 + 2 c ) s α (cid:110) (cid:107) X ∆ (cid:107) T + ∆ (cid:62) (Σ − ˆΣ)∆ (cid:111) ≤ (1 + 2 c ) s α (cid:110) C s α λ + 4 (cid:107) m − X β (cid:107) T + Ω (∆) ¯ G | vech( ˆΣ − Σ) | ∞ (cid:111) ≤ (1 + 2 c ) s α (cid:8) C s α λ + 4 (cid:107) m − X β (cid:107) T (cid:9) + 2 − Ω (∆) , where the ﬁrst inequality follows by computations similar to Eq. A.4 and the second inequalityfrom Eq. A.5. Therefore,Ω (∆) ≤ c ) s α (cid:8) C s α λ + 4 (cid:107) m − X β (cid:107) T (cid:9) . (A.6)On the other hand, if ∆ (cid:54)∈ C (2 c ), then ∆ (cid:54)∈ C ( c ), which as we have already shown implies (cid:107) m − X β (cid:107) T > (cid:107) X ∆ (cid:107) T . In conjunction with Eq. A.2 and Eq. A.3, this shows that0 ≤ λc − Ω(∆) + 2 (cid:107) m − X β (cid:107) T + λ { Ω (∆) − Ω (∆) } , and whence Ω (∆) ≤ c Ω (∆) + 2 cλ ( c − (cid:107) m − X β (cid:107) T ≤ − Ω (∆) + 2 cλ ( c − (cid:107) m − X β (cid:107) T . This shows that Ω(∆) ≤ (1 + (2 c ) − )Ω (∆) ≤ (1 + (2 c ) − ) cλ ( c − (cid:107) m − X β (cid:107) T = λ − c (3 c − c − (cid:107) m − X β (cid:107) T . Combining this with the inequality in Eq. A.6, we obtain the second claim of Theorem 3.1.Lastly, under Assumptions 3.1 (ii) and (iv), by Theorem 3.1 in Babii, Ghysels, and Striaukas(2020) Pr( E c ) = Pr (cid:18) | vech( ˆΣ − Σ) | ∞ > γ Gs α (1 + 2 c ) (cid:19) ≤ A s ˜ κα p T ˜ κ − + 2 p ( p + 1) exp (cid:18) − c T s α B T (cid:19) Appendix - 5or some universal constants A and c and B T = max j,k ∈ [ p ] (cid:80) Tt =1 (cid:80) Tl =1 | Cov( X t,j X t,k , X l,j X l,k ) | .Lastly, under Assumptions 3.1 (ii) and (iv), by Babii, Ghysels, and Striaukas (2020), Lemma A.1.2that B T = O ( T ).The following result is proven in Babii, Ghysels, and Striaukas (2020), see their Theorem 3.1and Eq. (4) following it. Theorem A.1.

Let ( ξ t ) t ∈ Z be a centered stationary stochastic process in R p such that (i) max j ∈ [ p ] (cid:107) ξ ,j (cid:107) q = O (1) for some q > ; (ii) for every j ∈ [ p ] , τ -dependence coeﬃcients of ξ t,j satisfy τ ( j ) k ≤ ck − a forsome universal constants c > and a > q − q − . Then there exists C > such that for every δ ∈ (0 , (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 ξ t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ C (cid:16) pδT κ − (cid:17) /κ ∨ (cid:114) log(8 p/δ ) T (cid:33) ≥ − δ. (A.7)Appendix - 6 .3 Monte Carlo Simulations FLOW STOCK MIDDLE LASSO-U LASSO-M SGL-M FLOW STOCK MIDDLE LASSO-U LASSO-M SGL-M

T Baseline scenario ε h ∼ i.i.d. student- t (5)50 2.847 3.839 4.660 4.213 2.561 2.188 2.081 2.427 2.702 2.334 2.066 1.7490.059 0.077 0.090 0.087 0.054 0.044 0.042 0.053 0.062 0.056 0.051 0.041100 2.110 2.912 3.814 2.244 1.579 1.473 1.504 1.900 2.155 1.761 1.535 1.3430.041 0.057 0.076 0.045 0.032 0.030 0.030 0.038 0.043 0.034 0.030 0.026200 1.882 2.772 3.681 1.539 1.302 1.230 1.357 1.714 1.986 1.414 1.238 1.1920.037 0.056 0.072 0.031 0.026 0.025 0.027 0.035 0.040 0.029 0.025 0.024High-frequency process: VAR(1) Legendre degree L = 350 1.869 2.645 2.863 2.135 1.726 1.533 2.847 3.839 4.660 4.213 2.339 1.9790.039 0.053 0.057 0.046 0.036 0.032 0.059 0.077 0.090 0.087 0.050 0.041100 1.453 2.073 2.245 1.575 1.373 1.284 2.110 2.912 3.814 2.244 1.503 1.3860.028 0.042 0.046 0.031 0.028 0.025 0.041 0.057 0.076 0.045 0.031 0.029200 1.283 1.921 2.040 1.348 1.240 1.201 1.882 2.772 3.681 1.539 1.277 1.1960.026 0.038 0.041 0.026 0.024 0.023 0.037 0.056 0.072 0.031 0.025 0.024Legendre degree L = 10 Low frequency noise level σ u =550 2.847 3.839 4.660 4.213 2.983 2.583 9.598 10.429 10.726 9.799 8.732 7.7850.059 0.077 0.090 0.087 0.063 0.053 0.196 0.211 0.213 0.198 0.180 0.159100 2.110 2.912 3.814 2.244 1.719 1.633 7.319 8.177 8.880 8.928 7.359 6.6060.041 0.057 0.076 0.045 0.035 0.032 0.147 0.163 0.176 0.179 0.147 0.135200 1.882 2.772 3.681 1.539 1.348 1.300 6.489 7.699 8.381 7.275 6.391 5.9190.037 0.056 0.072 0.031 0.027 0.026 0.127 0.154 0.165 0.146 0.126 0.117Half high-frequency lags Number of covariates p = 5050 2.750 2.730 3.562 2.455 2.344 1.905 5.189 3.610 2.6580.058 0.056 0.070 0.050 0.048 0.038 0.104 0.075 0.054100 2.134 2.167 3.082 1.899 1.718 1.468 5.582 5.633 6.298 3.527 2.034 1.7530.043 0.043 0.061 0.038 0.034 0.030 0.117 0.113 0.126 0.075 0.042 0.036200 1.833 1.971 2.808 1.400 1.356 1.225 2.679 3.573 4.399 1.867 1.413 1.3190.036 0.039 0.055 0.028 0.027 0.024 0.053 0.071 0.090 0.038 0.028 0.026 Table A.1: Forecasting accuracy results – See Table A.2Appendix - 7

LOW STOCK MIDDLE LASSO-U LASSO-M SGL-M FLOW STOCK MIDDLE LASSO-U LASSO-M SGL-M

T Baseline scenario ε h ∼ i.i.d. student- t (5)50 3.095 3.793 4.659 4.622 3.196 2.646 2.257 2.391 2.649 2.357 2.131 1.7860.067 0.078 0.094 0.094 0.064 0.055 0.046 0.054 0.057 0.050 0.047 0.038100 2.393 2.948 3.860 2.805 2.113 1.888 1.598 1.840 2.068 1.824 1.653 1.4330.048 0.060 0.078 0.058 0.044 0.038 0.032 0.037 0.043 0.036 0.033 0.029200 2.122 2.682 3.597 1.971 1.712 1.604 1.452 1.690 1.969 1.544 1.383 1.3020.042 0.055 0.072 0.039 0.034 0.032 0.030 0.035 0.041 0.032 0.028 0.026High-frequency process: VAR(1) Legendre degree L = 350 2.086 2.418 2.856 2.208 1.828 1.612 3.095 3.793 4.659 4.622 2.987 2.4510.044 0.050 0.057 0.049 0.039 0.033 0.067 0.078 0.094 0.094 0.061 0.050100 1.571 1.906 2.341 1.671 1.430 1.329 2.393 2.948 3.860 2.805 2.020 1.7960.031 0.039 0.047 0.033 0.028 0.026 0.048 0.060 0.078 0.058 0.042 0.037200 1.397 1.720 2.168 1.428 1.307 1.248 2.122 2.682 3.597 1.971 1.680 1.5600.028 0.034 0.043 0.028 0.026 0.024 0.042 0.055 0.072 0.039 0.033 0.031Legendre degree L = 10 Low frequency noise level σ u =550 3.095 3.793 4.659 4.622 3.528 2.948 9.934 10.566 10.921 9.819 9.037 8.0910.067 0.078 0.094 0.094 0.071 0.059 0.213 0.212 0.216 0.198 0.184 0.168100 2.393 2.948 3.860 2.805 2.271 2.079 7.576 8.130 8.854 9.190 7.743 6.8760.048 0.060 0.078 0.058 0.047 0.042 0.150 0.166 0.180 0.188 0.160 0.141200 2.122 2.682 3.597 1.971 1.777 1.693 6.830 7.580 8.351 7.648 6.820 6.2580.042 0.055 0.072 0.039 0.035 0.034 0.135 0.152 0.168 0.156 0.136 0.124Half high-frequency lags Number of covariates p = 5050 3.014 2.773 3.638 2.455 2.509 2.201 5.222 3.919 3.0020.063 0.056 0.072 0.050 0.051 0.046 0.105 0.081 0.061100 2.344 2.087 3.116 1.899 2.101 1.774 5.978 5.556 6.536 3.948 2.665 2.2320.046 0.041 0.063 0.038 0.043 0.036 0.121 0.112 0.132 0.083 0.053 0.044200 2.119 1.985 2.988 1.400 1.761 1.590 2.974 3.422 4.412 2.355 1.938 1.7250.041 0.040 0.061 0.028 0.035 0.032 0.059 0.070 0.087 0.048 0.040 0.035 Table A.2: Nowcasting accuracy results

The table reports simulation results for nowcasting accuracy. We report eight diﬀerent scenarios for the DGP: baseline scenario (upper-left block) DGP is with the low-frequency noise level σ u = 1 which we keep for all other scenarios except where we change it to σ u = 5,the degree of Legendre polynomial L=5, andGaussian high-frequency error term. All remaining blocks report results for diﬀerent DGPs:e.g. in the upper-right block, we report results where the noise term of high-frequency covariates is i.i.d. student- t (5). Each blockreports results for LASSO-U-MIDAS (LASSO-U), LASSO-MIDAS (LASSO-M) and sg-LASSO-MIDAS (SGL-M) estimators (the lastthree columns). In addition, we report results for predictive regressions using aggregated data where we use diﬀerent aggregation schemes:1) ﬂow aggregation (FLOW), stock aggregation (STOCK) and taking the middle value of high-frequency covariates (MIDDLE). We varythe sample size T from 50 to 200. Each entry in the odd row is the average mean squared forecast error, and each entry in the even rowis the simulation standard error. Appendix - 8

ASSO-U LASSO-M SGL-M LASSO-U LASSO-M SGL-M LASSO-U LASSO-M SGL-MT=50 T=100 T=200Baseline scenarioBeta(1 ,

3) 1.955 0.887 0.652 1.846 0.287 0.247 1.804 0.138 0.1060.002 0.012 0.010 0.002 0.004 0.004 0.001 0.002 0.002Beta(2 ,

3) 1.211 0.739 0.625 1.157 0.351 0.268 1.128 0.199 0.1180.001 0.008 0.008 0.001 0.004 0.004 0.001 0.002 0.002Beta(2 ,

2) 1.062 0.593 0.537 1.019 0.231 0.216 0.995 0.106 0.0920.001 0.007 0.007 0.001 0.003 0.003 0.001 0.001 0.001 ε h ∼ i.i.d. student- t (5)Beta(1 ,

3) 2.005 1.688 1.290 1.953 1.064 0.624 1.885 0.471 0.4010.002 0.012 0.014 0.002 0.011 0.009 0.002 0.005 0.005Beta(2 ,

2) 1.237 1.126 0.993 1.218 0.848 0.614 1.185 0.506 0.4400.001 0.007 0.010 0.001 0.007 0.007 0.001 0.005 0.004Beta(2 ,

2) 1.084 0.969 0.874 1.070 0.691 0.518 1.047 0.369 0.3560.001 0.006 0.008 0.001 0.006 0.006 0.001 0.004 0.004high-frequency process: VAR(1)Beta(1 ,

3) 1.935 1.271 0.939 1.890 0.772 0.492 1.842 0.419 0.2880.003 0.016 0.015 0.002 0.010 0.008 0.002 0.005 0.004Beta(2 ,

3) 1.177 0.864 0.811 1.155 0.610 0.505 1.136 0.468 0.3590.002 0.011 0.012 0.002 0.008 0.008 0.001 0.005 0.005Beta(2 ,

2) 1.036 0.706 0.729 1.023 0.477 0.458 1.008 0.326 0.2990.002 0.009 0.011 0.002 0.007 0.007 0.001 0.004 0.004Legendre degree L = 3Beta(1 ,

3) 1.955 0.727 0.484 1.846 0.248 0.178 1.804 0.123 0.0810.002 0.010 0.008 0.002 0.004 0.003 0.001 0.002 0.001Beta(2 ,

3) 1.211 0.642 0.491 1.157 0.313 0.201 1.128 0.181 0.0940.001 0.008 0.007 0.001 0.004 0.003 0.001 0.002 0.001Beta(2 ,

2) 1.062 0.508 0.414 1.019 0.200 0.156 0.995 0.094 0.0690.001 0.007 0.006 0.001 0.003 0.003 0.001 0.001 0.001

Table A.3: Shape of weights estimation accuracy I.

The table reports results for shape of weights estimation accuracy for the ﬁrst four DGPs of Table A.1-A.2 using LASSO-U, LASSO-Mand SGL-M estimators for the weight functions Beta(1 , ,

3) and Beta(2 ,

2) with sample size T = 50, 100 and 200. Entries inodd rows are the average point-wise mean squared error, and in even rows the simulation standard error. Appendix - 9

ASSO-U LASSO-M SGL-M LASSO-U LASSO-M SGL-M LASSO-U LASSO-M SGL-MT=50 T=100 T=200Legendre degree L = 10Beta(1 ,

3) 1.955 1.155 0.952 1.846 0.378 0.386 1.804 0.163 0.1790.002 0.013 0.011 0.002 0.006 0.006 0.001 0.002 0.003Beta(2 ,

3) 1.211 0.902 0.885 1.157 0.423 0.370 1.128 0.225 0.1620.001 0.008 0.009 0.001 0.005 0.005 0.001 0.002 0.002Beta(2 ,

2) 1.062 0.747 0.775 1.019 0.293 0.314 0.995 0.126 0.1350.001 0.007 0.008 0.001 0.004 0.005 0.001 0.002 0.002low frequency noise level σ =5Beta(1 ,

3) 2.022 1.736 1.389 1.972 1.290 0.757 1.893 0.716 0.3550.001 0.012 0.017 0.002 0.011 0.011 0.002 0.008 0.006Beta(2 ,

3) 1.244 1.132 1.060 1.220 0.929 0.700 1.186 0.657 0.3850.001 0.008 0.013 0.001 0.007 0.009 0.001 0.006 0.006Beta(2 ,

2) 1.089 0.980 0.936 1.069 0.781 0.604 1.042 0.509 0.3150.001 0.007 0.012 0.001 0.007 0.008 0.001 0.005 0.005Half high-frequency lagsBeta(1 ,

3) 1.997 1.509 1.083 1.925 0.882 0.686 1.878 0.571 0.5350.001 0.011 0.011 0.001 0.008 0.006 0.001 0.004 0.004Beta(2 ,

3) 1.243 1.121 1.026 1.221 0.913 0.828 1.202 0.729 0.7160.001 0.006 0.007 0.001 0.005 0.006 0.001 0.004 0.004Beta(2 ,

2) 1.090 0.998 0.955 1.074 0.838 0.813 1.059 0.719 0.7400.001 0.005 0.007 0.001 0.005 0.005 0.000 0.004 0.004Number of covariates p = 50Beta(1 ,

3) 2.031 1.563 1.038 1.931 0.620 0.401 1.841 0.223 0.1740.001 0.010 0.011 0.002 0.007 0.005 0.001 0.002 0.002Beta(2 ,

3) 1.250 1.067 0.883 1.206 0.606 0.436 1.156 0.296 0.1960.001 0.006 0.007 0.001 0.006 0.005 0.001 0.003 0.002Beta(2 ,

2) 1.095 0.923 0.782 1.062 0.461 0.360 1.023 0.178 0.1530.000 0.005 0.006 0.001 0.005 0.004 0.001 0.002 0.002

Table A.4:

Shape of weights estimation accuracy II. – See Table A.3

Appendix - 10

LASSO-U-MIDAS W e i g h t LASSO-MIDAS W e i g h t sg-LASSO-MIDAS W e i g h t LASSO-U-MIDAS W e i g h t LASSO-MIDAS W e i g h t sg-LASSO-MIDAS W e i g h t Figure A.1:

The ﬁgure shows the ﬁtted Beta(1,3) weights. We plot the estimated weights for the LASSO-U-MIDAS, LASSO-MIDAS and sg-LASSO-MIDAS estimators for the baseline DGP scenario. The ﬁrst row plotsweights for the sample size T = 50, the second row plot weights for the sample size T = 200. Black solid line is themedian estimate of the weights function, black dashed line is the population weight function, and the grey area isthe 90% conﬁdence interval. LASSO-U-MIDAS W e i g h t LASSO-MIDAS W e i g h t sg-LASSO-MIDAS W e i g h t LASSO-U-MIDAS W e i g h t LASSO-MIDAS W e i g h t sg-LASSO-MIDAS W e i g h t Figure A.2:

The ﬁgure shows the ﬁtted Beta(2,3) weights. We plot the estimated weights for the LASSO-U-MIDAS, LASSO-MIDAS and sg-LASSO-MIDAS estimators for the baseline DGP scenario. The ﬁrst row plotsweights for the sample size T = 50, the second row plot weights for the sample size T = 200. Black solid line is themedian estimate of the weights function, black dashed line is the population weight function, and the grey area isthe 90% conﬁdence interval. Appendix - 11

LASSO-U-MIDAS W e i g h t LASSO-MIDAS W e i g h t sg-LASSO-MIDAS W e i g h t LASSO-U-MIDAS W e i g h t LASSO-MIDAS W e i g h t sg-LASSO-MIDAS W e i g h t Figure A.3:

The ﬁgure shows the ﬁtted Beta(2,2) weights. We plot the estimated weights for the LASSO-U-MIDAS, LASSO-MIDAS and sg-LASSO-MIDAS estimators for the baseline DGP scenario. The ﬁrst row plotsweights for the sample size T = 50, the second row plot weights for the sample size T = 200. Black solid line is themedian estimate of the weights function, black dashed line is the population weight function, and the grey area isthe 90% conﬁdence interval. Appendix - 12 .4 Detailed data description

To compute the main results reported in Table 1, we use thirty monthly macro series, eight quarterlysurvey covariates, and six news attention series which are aggregated using Legendre polynomi-als. Thirty predictors are real-time macro series that are either directly taken from FRED-MDdataset, or are calculated by using FRED-MD series, denoted by FRED-MD and FRED-MD(calc.) respectively in the

Source column of the data description table A.4 below. Note that forall monthly macro data, we use real-time vintages, which eﬀectively means that we take all macroseries with one month delay. For example, if we nowcast the ﬁrst quarter of GDP one monthahead, we use data up to the end of February, and thus all macro series that enter the modelareavailable up to the end of January. We then use Legendre polynomials of degree three for allcovariates to aggregate twelve lags of monthly macro data. In particular, let x t +( h +1 − j ) /m,k be k -th ∈ { , . . . , } covariate at quarter t , j = 1 , . . . , m = 3, and h = 2, minus additional lag toaccount for the publication delay of macro series. Therefore, for macro series the ﬁrst lag index is:( h + 1 − j − /m = (2 + 1 − − / /

3. We then collect all lags in X tk vector, i.e. X tk = ( x t +1 / ,k , x t +0 / ,k , . . . , x t − / ,k )and aggregate X tk using dictionary W consisting of Legendre polynomials, X tk W . In this case, X tk W is deﬁned as a single group for the sg-LASSO estimator.Furthermore, we use data from the Survey of Professional Forecasters (SPF) - nowcasts andforecasts - aggregated using Legendre polynomial of degree three. More precisely, denote x hmean,t and x hmedian,t the mean and median forecast at the horizon h . We collect all mean and medianforecast horizons that do not have missing entries, h = 0 , , ,

3, in the X t vector X t = ( x mean,t , x mean,t , x mean,t , x mean,t , x median,t , x median,t , x median,t , x median,t ) , and aggregate this data using the same dictionary W , i.e. X t W , and deﬁne X t W as a single groupfor the sg-LASSO. Note that SPF data is quarterly; therefore, we aggregate it cross-sectionallyrather than time series.Lastly, we take six news attention series from http://structureofnews.com/, see Table A.5, and,as for macro series, use Legendre polynomials of degree three to aggregate twelve monthly lags of We transform all macro covariates using transformations suggested by McCracken and Ng (2016), see TableA.5. We then standardize all covariates before the aggregation step.

Appendix - 13ach attention news series. However, in this case, news attention series is used without a publicationdelay, that is, for the one-month horizon, we take the series up to the end of the second month.We compute the predictions by using the expanding window scheme. The ﬁrst nowcast is forthe 2002 Q1, the eﬀective sample size is from 1990 February to 2001 November, and the predictionis computed using 2002 February data. We calculate predictions until the sample is exhausted,which is 2017 Q2, the last date for which news attention data is available.Appendix - 14 d Source T-code1 Commodities Bybee, Kelly, Manela, and Xiu (2020) 12 Government budgets Bybee, Kelly, Manela, and Xiu (2020) 13 Oil market Bybee, Kelly, Manela, and Xiu (2020) 14 Recession Bybee, Kelly, Manela, and Xiu (2020) 15 Savings & loans Bybee, Kelly, Manela, and Xiu (2020) 16 Mortgages Bybee, Kelly, Manela, and Xiu (2020) 17 IP: Business Equipment FRED-MD 58 IP: Fuels FRED-MD 59 IP: Manufacturing (SIC) FRED-MD 510 IP: Durable Consumer Goods FRED-MD 511 Civilians Unemployed - Less Than 5 Weeks FRED-MD 512 All Employees: Financial Activities FRED-MD 513 All Employees: Government FRED-MD 514 Initial Claims FRED-MD 515 All Employees: Total nonfarm FRED-MD 516 All Employees: Service-Providing Industries FRED-MD 517 All Employees: Mining and Logging: Mining FRED-MD 518 Unemployment Rate FRED-MD 219 All Employees: Manufacturing FRED-MD 520 Housing Starts, Midwest FRED-MD 421 Housing Starts, West FRED-MD 422 Housing Starts: Total New Privately Owned FRED-MD 423 Retail and Food Services Sales FRED-MD 524 New Orders for Durable Goods FRED-MD 525 MZM Money Stock FRED-MD 626 Personal Cons. Expend.: Chain Index FRED-MD 627 CPI: All Items FRED-MD 628 S&P: Industrials FRED-MD 529 3-Month AA Fin. Comm. Paper Rate FRED-MD 230 Crude Oil FRED-MD 631 5-Year Treasury FRED-MD 232 3-Month Commercial Paper - FEDFUNDS FRED-MD 133 Moodys Baa Corporate Bond - FEDFUNDS FRED-MD 134 10-Year Treasury FRED-MD 235 S&P 500 FRED-MD 536 Moodys Baa - Aaa Corporate Bond Spread FRED-MD (calc.) 137 Survey of professional forecasters Phil. Fed 1

Table A.5:

Data description table – The id column gives mnemonics according to data source, whichis given in the second column Source . The column

T-code denotes the data transformation applied to atime-series, which are: (1) not transformed, (2) ∆ x t , (3) ∆ x t , (4) log( x t ), (5) ∆ log ( x t ), (6) ∆ log ( x t ). Appendix - 15 .5 Additional results for empirical application

A.5.1 Full-sample nowcasting results

As for the main results, we benchmark our predictions with the simple random walk (RW) model.We implemented the following alternative machine learning nowcasting methods. The ﬁrst methodis the PCA factor-augmented autoregression, where we estimate the ﬁrst principal component ofa monthly macro panel and use it together with four autoregressive lags. We denote this modelPCA-OLS. We then consider three alternative penalty functions for the same linear model: Ridge,LASSO and Elastic Net. For these methods, we leave high-frequency lags unrestricted, and thus wecall these methods the unrestricted MIDAS (U-MIDAS). Lastly, we use sg-LASSO estimator, wherewe also aggregate high-frequency lags using MIDAS weights. The weight function is approximatedby using Legendre polynomials of degree three. For each method, we use four lags of GDP andtwelve lags of each high-frequency covariates. The ﬁrst prediction is for the 2002 Q1, and weuse expanding window scheme up until 2017 Q2. In this case, we use larger samples to estimateall models, and thus the eﬀective sample starts from 1960 February. For each quarter, we takepredictors that do not having missing values. In addition, we compute corporate bond spread anddiscard NONBORRES series due to a possible break in this series, see Uematsu and Tanaka (2019).In total, the number of covariates (without taking into account lags) ranges from 94 to 114.In Table A.6, we report out-of-sample nowcasting results for one-month horizon using real-timedata vintages. We report root mean squared forecast error relative to the RW model (column Rel-RMSE) and the Diebold Mariano predictive accuracy test statistic (DM). In addition to modelswe implemented, we compare GDP growth nowcasts provided by the New York Fed (denotedNY Fed). The column DM-stat-1 reports Diebold Mariano test statistic where we compare NYFed predictions with other methods, and the column DM-stat-2 compares sg-LASSO-MIDAS withalternative methods.Using full-sample data, sg-LASSO-MIDAS model also gives smaller forecast errors when com-pared with NY Fed predictions, however, the gains are not statistically signiﬁcant. Nonetheless,sg-LASSO-MIDAS model nowcasts give signiﬁcantly smaller prediction errors compared with otheralternative machine learning methods. Appendix - 16 M a r S e p M a r S e p M a r S e p D ec J un D ec J un D ec J un Lag.2Lag.4Industrial Production IndexIP: Final Products and Nonindustrial SuppliesIP: Final Products (Market Group)IP: Consumer GoodsIP: Durable Consumer GoodsIP: Nondurable Consumer GoodsIP: Business EquipmentIP: FuelsIP: Manufacturing (SIC)Unemployment RateCivilians Unemployed - Less Than 5 WeeksInitial ClaimsAll Employees: Total nonfarmAll Employees: Goods-Producing IndustriesAll Employees: Mining and Logging: MiningAll Employees: ManufacturingAll Employees: Durable goodsAvg Weekly Overtime Hours : ManufacturingAll Employees: Wholesale TradeAll Employees: Retail TradeAll Employees: Trade, Transportation & UtilitiesAll Employees: Service-Providing IndustriesAll Employees: Financial ActivitiesAll Employees: GovernmentHousing Starts: Total New Privately OwnedHousing Starts, NortheastHousing Starts, SouthHousing Starts, MidwestHousing Starts, WestRetail and Food Services SalesNew Orders for Durable GoodsM1 Money StockMZM Money StockCPI: Medical CareCPI: DurablesPersonal Cons. Exp: Durable goodsS&P: Industrials3-Month T-Bill3-Month Commercial Paper - FEDFUNDS3-Month Treasury - FEDFUNDS1-Year Treasury - FEDFUNDSJapan / U.S. FXCrude Oil1-Year TreasuryEFFR5-Year Treasury - FEDFUNDS10-Year Treasury - FEDFUNDSSwitzerland / U.S. FX6-Month T-Bill3-Month AA Fin. Comm. Paper Rate5-Year Treasury

Sparsity patternNumber of selected covariatesSquared forecast errors

Figure A.4: Sparsity pattern for 50 most selected covariates.Appendix - 17el-RMSE DM-stat-1 DM-stat-2RW 2.606 2.370 3.318PCA-OLS 0.849 0.854 1.975Ridge-U-MIDAS 0.838 0.763 1.974LASSO-U-MIDAS 0.853 0.967 2.039Elastic Net-U-MIDAS 0.833 0.699 1.888sg-LASSO-MIDAS 0.750 -0.739NY Fed 0.790 0.739Table A.6:

Nowcast comparison table – Forecast horizon is one month ahead. Column

Rel-RMSE reportsroot mean squared forecasts error relative to the RW model. Column

DM-stat-1 reports Diebold andMariano (1995) test statistic of all models relative to NY Fed nowcasts, while column

DM-stat-2 reportsthe Diebold Mariano test statistic relative to sg-LASSO-MIDAS model. Out-of-sample period: 2002 Q1to 2017 Q2. c u m u l a t i v e s u m o f l o ss d i ﬀ e r e n t i a l Figure A.5: Cumulative sum of loss diﬀerential. Gray shaded are — NBER recession period.Appendix - 18

005 2010 2015020406080100 Quarters c u m u l a t i v e s u m o f l o ss d i ﬀ e r e n t i a l Figure A.6: Cumulative sum of loss diﬀerentials (cumsfe) of New York Fed nowcasts compared withsg-LASSO-MIDAS model with textual analysis based data for diﬀerent α ∈ [0 ,

1] value. Dashedblack lines are cumsfe’s for α ∈ [0 , . α ∈ (0 . ,

1] excluding 0 .

65, and dash-dottedblack line - cumsfe for α = 0 .

65. Gray shaded area — NBER recession period.Table A.7: Signiﬁcance test table M T

10 20 30Commodities 3.152 3.271 3.024Government budgets 16.461 13.512 13.680Oil market 11.216 9.796 12.068Recession 5.843 4.720 4.132Savings & loans 1.900 2.275 2.440Mortgages 1.923 2.728 3.031Table A.8: