LLarge Vector Auto Regressions
Song Song ∗ , Peter J. Bickel † June 21, 2011
Abstract
One popular approach for nonstructural economic and financial forecasting is to include alarge number of economic and financial variables, which has been shown to lead to significantimprovements for forecasting, for example, by the dynamic factor models. A challenging issueis to determine which variables and (their) lags are relevant, especially when there is a mixtureof serial correlation (temporal dynamics), high dimensional (spatial) dependence structure andmoderate sample size (relative to dimensionality and lags). To this end, an integrated solutionthat addresses these three challenges simultaneously is appealing. We study the large vectorauto regressions here with three types of estimates. We treat each variable’s own lags differentfrom other variables’ lags, distinguish various lags over time, and is able to select the variablesand lags simultaneously. We first show the consequences of using Lasso type estimate directly fortime series without considering the temporal dependence. In contrast, our proposed method canstill produce an estimate as efficient as an oracle under such scenarios. The tuning parametersare chosen via a data driven “rolling scheme” method to optimize the forecasting performance.A macroeconomic and financial forecasting problem is considered to illustrate its superiority overexisting estimators.
Keywords : Time Series, Vector Auto Regression, Regularization, Lasso, Group Lasso, Oracleestimator
JEL classification : C13, C14, C32, E30, E40, G10
Macroeconomic forecasting is one of the central tasks in Economics. Broadly speaking, there are twoapproaches, structural and nonstructural forecasting. Structural forecasting, which aligns itself with ∗ University of California, Berkeley. Email: [email protected] † University of California, Berkeley. Email: [email protected] a r X i v : . [ s t a t . M L ] J un conomic theory, and hence rises and falls with that, recedes following the decline of Keynesian theory.In recent years, new dynamic stochastic general equilibrium theory has been developed, and structuralmacroeconomic forecasting is poised for resurgence. Nonstructural forecasting, in contrast, attemptsto exploit the reduced-form correlations in observed macroeconomic time series, has little relianceon economic theory, has always been working well and continues to be improved. Various univariateand multivariate time series analyzing techniques have been proposed, e.g. the auto regression (AR),moving average (MA), autoregressive moving average (ARMA), generalized autoregressive conditionalheteroskedasticity (GARCH), vector auto regression (VAR) models among many others. A verychallenging issue for this nonstructural approach is to determine which variables and (their) lags arerelevant. If we omit some “important” variables by mistake, it potentially creates an omitted variablebias with adverse consequences for both structural analysis and forecasting. For example, Christianoet al. (1999) points out that the positive reaction of prices in response to a monetary tightening, theso-called price puzzle , is an artefact resulting from the omission of forward-looking variables, such asthe commodity price index. Recently, Ba´nbura et al. (2010) shows that, when using the cross-sectionaldimension related shrinkage, the forecasting performance of small monetary vector auto regression canbe improved by adding additional macroeconomic variables and sectoral information. To illustratethis, we consider an example of interest rate forecasting. Nowadays people primarily use univariateor multivariate time series models, e.g. the Vasicek, CIR, Jump-Diffusion, Regime-Switching, andtime-varying coefficients models, all of which are mostly based on the information from the interestrate time series itself. However, in practice, the central bank (Fed) bases their decisions of interestrate adjustment (as a monetary policy instrument) heavily on the national macroeconomic situationby taking many macro and financial measures into account. Bringing in this additional spatial (overthe space of variables instead of from a geographic point of view; also used in future for convenience)information will therefore help improve its forecasting performance. Another example about theinteractions between macroeconomics and finance comes from modeling credit defaults by also usingmacroeconomic information, since variation in aggregate default rates over time presumably reflectschanges in general economic conditions also. Figlewski et al. (2006) find credit events are significantlyaffected by macroeconomic factors. Not only macroeconomics could affect finance, finance could alsoaffect macroeconomics. For example, the economic crisis typically starts from the stock market crash.All of these call for an integrated analysis of macroeconomics and finance. Thus recently there hasbeen a growing trend of using large panel macroeconomic and financial time series for forecasting,impulse response study and structural analysis, Forni et al. (2000), Stock and Watson (2002a), Stockand Watson (2002b), also seen at Forni et al. (2005), Stock and Watson (2005b), Giannone et al.22005), and Ba´nbura et al. (2010) for latest advancements.Besides its presence in empirical macroeconomics, high dimensional data, where information oftenscatters through a large number of interrelated time series, is also attracting increasing attention inmany other fields of economics and finance. In neuro-economics and behavioral finance, one useshigh dimensional functional magnetic resonance imaging data (fMRI) to analyze the brain’s responseto certain risk related stimuli as well as identifying its activation area, Worsley et al. (2002) andMyˇsiˇckov´a et al. (2011). In quantitative finance, one studies the dynamics of the implied volatilitysurface for risk management, calibration and pricing purposes, Fengler et al. (2007). Other examplesand research fields for very large dimensional time series include mortality analysis, Lee and Carter(1992); bond portfolio risk management or derivative pricing, Nelson and Siegel (1987) and Dieboldand Li (2006); international economy (many countries); industrial economy (many firms); quantitativefinance (many assets) analysis among many others.On the methodology side, if people still use either low dimensional (multivariate) time seriestechniques on a few subjectively (or from some background knowledge) selected variables or highdimensional “static” methods which are initially designed for independent data, they might eitherdisregard potentially relevant information (temporal dynamics and spatial dependence) to producesuboptimal forecasts, or bring in additional risk. Examples include the already mentioned prize puzzleand interest rate forecasting problems. The more scattered and dynamic the information is, the severerthis loss becomes. This modeling becomes more challenging under the situation that macroeconomicdata we typically deal with has only low frequencies, e.g. monthly or yearly. For example, the pop-ularly used dataset introduced by Stock and Watson (2005a) contains 131 monthly macro indicatorscovering a broad range of categories including income, industrial production, capacity, employmentand unemployment, consumer prices, producer prices, wages, housing starts, inventories and orders,stock prices, interest rates for different maturities, exchange rates and money aggregates and so on.The time span is from January 1959 to December 2003 (so T = 540). In summary, we can see that thechallenge of modeling high dimensional time series, especially the macroeconomic ones, comes froma mixture of serial correlation (temporal dynamics), high dimensional (spatial) dependence structureand moderate sample size (relative to dimensionality and lags). To this end, an integrated solutionaddressing these three challenges simultaneously is appealing.To circumvent this problem, dynamic factor models have been considered to be quite successfulrecently in the analysis of large panels of time series data, Forni et al. (2000), Stock and Watson(2002a), Stock and Watson (2002b), also seen at Forni et al. (2005), Giannone et al. (2005), Parket al. (2009) and Song et al. (2010) (nonstationary case). They rely on the assumption that the bulk3f dynamics interrelations within a large dataset can be explained and represented by a few commonfactors (low dimensional time series). Less general models in the literature include static factor modelsproposed by Stock and Watson (2002a), Stock and Watson (2002b) and exact factor model suggestedby Sargent and Sims (1977) and Geweke (1977).Compared with the well studied dynamic factor models through the use of dynamic principalcomponent analysis, the vector auto regressive (VAR) models have several natural advantages. Forexample, compared with the dynamic factor models’ typical 2-step estimation procedure: dimensionreduction first and low dimensional time series modeling, the VAR approach is able to model the highdimensional time series in one step, which may lead to greater efficiency. It also allows variable-to-variable relationship (impulse response) analysis and facilitates corresponding interpretation, whichis not feasible in the factor modeling setup since the variables are “represented” by the correspondingfactors. Historically, the VAR models are not appropriate for analyzing high dimensional time seriesbecause they involve the estimation of too many ( J P , where J is the dimensionality and P is thenumber of lags) parameters. Thus they are primarily implemented on relatively low dimensionalsituations, e.g. the Baysian VARs (BVAR) by Doan et al. (1984) or still through the idea of factormodeling, e.g. the factor-augmented VAR (FAVAR) by Bernanke et al. (2005). However, based onrecent advances in variable selection, shrinkage and regularization theory from Tibshirani (1996), Zou(2006) and Yuan and Lin (2006), large unrestricted vector auto regression becomes an alternativefor the analysis of large dynamic systems. Therefore, the VAR framework can also be applied toempirical problems that require the analysis of more than a handful of time series. Mol et al. (2008)and Ba´nbura et al. (2010) proceed that from the Bayesian point of view. Chudik and Pesaran (2007)consider the case that P = 1 and both J and T are large through some “neighboring” procedure,which can be viewed as a special case of the “segmentized grouping” as we study here (details inSubsection 2.4). In the univariate case ( J = 1), Wang et al. (2007) studies the regression coefficientand autoregressive order shrinkage and selection via the lasso when P is large and T is small (relativeto P ).In this article, we will study the large vector auto regressions when J, P → ∞ and T is moderate(relative to J P ). Comparing to prior works in (large) vector auto regressions, the novelty of thisarticle lies in the following perspectives.
First , from the variable selection and regularization point ofview, the theoretical properties of many existing methods have been established under the indepen-dent scenario, which is rarely met in practice and contradicts the original time series setup (if useddirectly). Disregarding the serial correlation in variable selection and regularization can be dangerousin the sense that various risk bounds in fact depend on the degree of time dependence, as we will4llustrate later. We propose a new methodology to address this serial correlation (time dependence)issue together with high dimensionality and moderate sample size, which enables us to obtain theconsistency of variable selection even under the dependent scenario, i.e. to reveal the equilibriumamong them.
Second , our method is able to do variable selection and lag selection simultaneously. Inprevious literature, variable selection is usually carried out first, and then the corresponding estimate’sperformances w.r.t. different number of lags are compared through some information criteria to selectthe “optimal” number of lags. By doing so, we neglect the “interaction” between variable selectionand lag selection. Additionally, when the number of the lag’s candidates to be searched over is large,it is also computationally inefficient, due to the cost of the repeated variable selection procedures.
Third , we differentiate the variable of interest’s own lags (abbreviated as own lags afterwards) fromthe ones of other variables (abbreviated as others’ lags afterwards). Their relative weights are alsoallowed to be varied when predicting different variables, while in other literature, they are assumedto stay the same. This is due to the fact that the dynamic of some variables is driven by itself, whilefor a different variable, it might be driven by the dynamics of others. When we include a vast numberof macroeconomic and financial time series, assuming the same weight seems to be too restrictive.
Fourth , our method is based on a more computationally efficient approach, which mostly uses theexisting packages, e.g. the LARS (least angle regression) package, developed by Efron et al. (2004),while most other works in the literature go through the Bayesian approach that requires the choiceof priors.The rest of the article is organized as follows. In the next section, we present the main ingredientsof the large vector autoregressive model (Large VAR) together with several corresponding estimationprocedures and comparisons among them. The estimates’ properties are presented in Section 3.In Section 4, the method is applied to the motivating macroeconomic forecasting problem, whichshows that it outperforms some major existing method. Section 5 contains concluding remarks withdiscussions about relaxing some assumptions. All technical proofs are sketched in the appendix.
In this section, we introduce the model with three different estimates first, then discuss the data drivenchoice of hyperparameters to optimize the forecasting performance, provide a numerical algorithm,and finally summarize comparisons among these three estimates.5 .1 The Model
Assume that the high dimensional time series { Y tj } T Jt =1 ,j =1 is generated from Y (cid:62) t = Y (cid:62) t − B + . . . + Y (cid:62) t − P B P + U (cid:62) t (1) Y (cid:62) T Y (cid:62) T − . . . (cid:124) (cid:123)(cid:122) (cid:125) T × J = Y (cid:62) T − Y (cid:62) T − . . . Y (cid:62) T − P Y (cid:62) T − Y (cid:62) T − . . . Y (cid:62) T − − P . . . . . . . . . . . . (cid:124) (cid:123)(cid:122) (cid:125) T × JP B B . . . (cid:124) (cid:123)(cid:122) (cid:125) JP × J + U (cid:62) T U (cid:62) T − . . . (cid:124) (cid:123)(cid:122) (cid:125) T × J Y = XB + U, (compact form) (2)where • Y = ( Y (cid:62) T , . . . , Y (cid:62) ) (cid:62) with Y (cid:62) t = ( Y t , . . . , Y tJ ); • X = ( X (cid:62) T , . . . , X (cid:62) ) (cid:62) (the lags of Y ) with X t = ( Y (cid:62) t − , . . . , Y (cid:62) t − P ) (cid:62) ; • B , . . . , B P are J × J autoregressive matrices, where P is the number of lags, B = ( B , . . . , B P ) (cid:62) is the J P × J matrix containing all coefficients { B pij } P J Jp =1 ,i =1 ,j =1 , and B ·· j , B p · j , B · i · , B pi · is the j th column of B and B p , i th row of B and B p respectively; • U = ( U (cid:62) T , . . . , U (cid:62) ) (cid:62) , where U t is a J -dimensional noise and independent of X t .All Y t , X t and U t are assumed to have mean zero. The J × J covariance matrix of U t , Cov ( U t ), isassumed to be independent of t . Here we assume Cov ( U t ) to be diagonal, say I J ∗ J . In our case, it isjustified by the fact that the variables in the panel we will consider for estimation are standardizedand demeaned. Similar assumption is also carried out in Mol et al. (2008). The relaxation allowingnonzero off-diagonal entries is discussed in Section 5.We can see that given large J and P , we have to estimate a total of J P parameters, which ismuch larger than the moderate number of observations J T , i.e.
J P (cid:29) T . Consequently, ordinaryleast squares estimation is not feasible. Additionally, due to the structural change points in the macroand financial data (although not explored in this paper), the effective number of observations used forestimation could be much smaller than the original T . Thus we can see that on one hand, we do notwant to impose any restrictions on the parameters and attain some general representations; on theother hand, it is known that making the model unnecessarily complex can degrade the efficiency ofthe resulting parameter estimate and yield less accurate predictions, as well as making interpretationand variable selection difficult. Hence, to avoid over fitting, regularization and variable selection arenecessary. In the following, we are going to discuss the estimation procedure with different kinds of6egularization (illustrated in Figure 1). Before moving on, we incorporate the following very mildbelief, as also considered in Ba´nbura et al. (2010): the more recent lags should provide more reliableinformation than the more distant ones , which tries to strike a balance between attaining modelsimplicity and keeping the historic information. • • • • • • • • • • • • •• • • • •• • • • • •• • • • • • • • • •• • • • • •• • • • • • • • • •• • • • • • • • • • • • • • • • • • • • •• • • • • •• • • • • • • • •• • • • • • Figure 1: Illustration of three different types of estimates.
Without loss of generality, we start from considering one coefficient matrix, say B p with entries { B pij , (cid:54) i, j (cid:54) J } . Inspired by Ba´nbura et al. (2010), we note the fact that the dynamic of somevariable is driven by itself, while for a different variable, it might be driven by the dynamics ofothers. Consequently, we treat the variables’ own lags (diagonal terms of B p ) different from others’lags (off-diagonal terms of B p ) and impose different regularizations for them. We assume that theoff-diagonal coefficients of B p are not only sparse, but also have the same sparsity pattern acrossdifferent columns, which we call group sparsity . Thus we base our selection solution on group Lassotechniques (Yuan and Lin (2006)) for the off-diagonal terms and Lasso techniques (Tibshirani (1996))for the diagonal terms here. We use B pj − j to denote the vector composed of { B pji } i (cid:54) = j and W − j todenote the ( J − × ( J −
1) diagonal matrix diag[ w , . . . , w j − , w j +1 , . . . , w J ] where w i is the positivereal-valued weight associated with the i th variable for 1 (cid:54) i (cid:54) J . It is included here primarily forpractical implementation since if w i is chosen as the Std( Y i ), it is equivalent (subsection 2.6 for details)to standardize the predictors so that they all have zero mean and unit variance, Tibshirani (1996),which is also preferable for comparisons to prior works.Specifically, given the above notations, we use the group Lasso type penalty (cid:80) Jj =1 (cid:107) B pj − j W − j (cid:107) andLasso type penalty µ (cid:80) Jj =1 w j | B pjj | to impose regularizations on other regressors’ lags and predictedvariables’ own lags respectively and have the following penalty for the B p matrix: J (cid:88) j =1 (cid:107) B pj − j W − j (cid:107) + µ J (cid:88) j =1 w j | B pjj | (cid:54) Cp − α , (3)7ith some generic constant C . • The hyperparameter µ controls the extent to which others’ lags are less (more) “important”than the own lags. When µ is large, the penalty assigned to own lags is larger than to others’lags. As a result, it is more likely that the off-diagonal entries are shrunk to 0 instead of thediagonal ones, which corresponds to the case that the variable’s dynamic is driven by itself, andvice versa when µ is small. • The item p − α reflects different regularization for different lags (over time). It becomes smallerwhen p gets larger. This is consistent with the previous belief: the more recent lags shouldprovide more reliable information than the more distant ones . Thus as a result, large amounts ofshrinkage are towards the more distant lags, whereas small amounts of shrinkage are towards themore recent ones. The hyperparameter α governs the relative importance of distant lags w.r.t.the more recent ones. Other decreasing functions of p , e.g. f ( p ) = log( p ) − α , f ( p ) = exp( p ) − α could also be used. However, we do not consider a general representation (and use a data drivenway to estimate f (1) , . . . , f ( p ) , . . . , f ( P ) correspondingly) to avoid too many tuning parameters,especially when P → ∞ .Since we have P coefficient matrices B , . . . , B P , summing (3) up over p (after multiplying p α on bothsides) yields (cid:80) Pp =1 (cid:80) Jj =1 p α (cid:107) B pj − j W − j (cid:107) + µ (cid:80) Pp =1 (cid:80) Jj =1 p α w j | B pjj | (cid:54) CP . If we couple this to thequadratic loss { J ( T − P ) } − (cid:80) Tt = P +1 (cid:107) Y (cid:62) t − X (cid:62) t B (cid:107) through Lagrange multipliers, we have equation(4): min B { J ( T − P ) } − T (cid:88) t = P +1 (cid:107) Y (cid:62) t − X (cid:62) t B (cid:107) + λ (cid:16) P (cid:88) p =1 J (cid:88) j =1 p α (cid:107) B pj − j W − j (cid:107) + µ P (cid:88) p =1 J (cid:88) j =1 p α w j | B pjj | (cid:17) γ = λµ = min B { J ( T − P ) } − T (cid:88) t = P +1 (cid:107) Y (cid:62) t − X (cid:62) t B (cid:107) + λ P (cid:88) p =1 J (cid:88) j =1 p α (cid:107) B pj − j W − j (cid:107) + γ P (cid:88) p =1 J (cid:88) j =1 p α w j | B pjj | (4)with hyperparameters λ , γ and α . We call this estimate ˆ B the universal grouping estimate. Asthe number of variables J increases, the autocoefficients should be shrunk more in order to avoidover-fitting, as already discussed by Mol et al. (2008).Using the group Lasso type regularization for the off-diagonal terms actually poses some strongassumptions on the underlying structure, which is not realistic from an economic point of view. Remark 2.2.1
First , we just have one hyperparameter µ ( µ = γ/λ ) to control the relative weightsbetween own lags and others’ lags. This means that the weights between own’s lags and others’ lagsare the same across different dimensions which is hardly met in practice. Correspondingly, whenwe select the “optimal” µ to optimize the forecasting performance, we are actually optimizing the8veraged forecasting performance for all J variables instead of the variable of particular interest. Thismight produce suboptimal forecasts. Ba´nbura et al. (2010) considers a special case that own lagsare always more “important” than others’ lags, which might be less general than ours. Remark2.2.2
Second , using the L norm (cid:107) B pj − j W − j (cid:107) might shrink all off-diagonal terms in the same row( { B pj · } ·(cid:54) = j ) to zero simultaneously, which implicitly means that, for the j th corresponding variable,we assume it is either significant for all the other J − J − To amend the deficiencies of the universal grouping estimate, we estimate the autocoefficient matrix B column by column instead of all at once. Without loss of generality, we consider the j th column B ·· j here. Since B ·· j is a vector, we can use the Lasso type penalties for both own lags and others’lags. By following similar ideas and abbreviations in subsection 2.2, we have equation (5) to get ˆ B ·· j :min B ·· j ( T − P ) − T (cid:88) t = P +1 ( Y tj − X (cid:62) t B ·· j ) + λ j (cid:16) P (cid:88) p =1 (cid:88) i (cid:54) = j p α w i | B pij | + u j P (cid:88) p =1 p α w j | B pjj | (cid:17) γ j = λ j µ j = min B ·· j ( T − P ) − T (cid:88) t = P +1 ( Y tj − X (cid:62) t B ·· j ) + λ j P (cid:88) p =1 (cid:88) i (cid:54) = j p α w i | B pij | + γ j P (cid:88) p =1 p α w j | B pjj | (5)with hyperparameters λ j , γ j and α . The subindex j is added to λ j and γ j to emphasize that theycould vary when estimating different B ·· j ’s, 1 (cid:54) j (cid:54) J . We call this estimate ˆ B = ( ˆ B ·· , . . . , ˆ B ·· J ) the no grouping estimate. Remark 2.3.1
Because of different µ j ’s ( µ j = γ j /λ j ) for different columns’ estimates ˆ B ·· j , weallow individualized weights between own lags and others’ lags and could tune λ j ’s and γ j ’s to produceoptimal forecasting performance for each variable of interest, say the j th. Remark 2.3.2
Also forthe same reason, we could get rid of the disadvantage that all off-diagonal terms in one row might beshrunk to 0 simultaneously.For simplicity of notation, we drop the common subindex j and write Y tj = y t , B ·· j = β , B pij = c i , i (cid:54) = j , B pjj = d p , λ j = λ , γ j = γ , and (5) becomes:min β Q T ( β ) = min β ( T − P ) − T (cid:88) t = P +1 ( y t − X (cid:62) t β ) + λ P (cid:88) p =1 (cid:88) i (cid:54) = j p α w i | c i | + γ P (cid:88) p =1 p α w p | d p | = min β ( T − P ) − T (cid:88) t = P +1 ( y t − X (cid:62) t β ) + λ i P ( J − (cid:88) i =1 | c i | + γ p P (cid:88) p =1 | d p | (6)with λ i = λp α w i and γ p = γp α w p . 9 .4 Segmentized Grouping Since the large panel of macroeconomic and financial data sets usually have some natural “segment”structure, e.g. multiple interest rate time series w.r.t. different maturities, different number of em-ployees w.r.t. different industrial sectors, different price indices w.r.t. different goods etc, if we takethis information into account instead of estimating B either all at once or column by column, wecould also do it segment by segment. Without loss of generality, we consider the i th segment B ··N i . N i is the index set for the i th segment and N i = |N i | denotes the cardinality of the set N i . Y (cid:62) t N i isthe corresponding part of Y (cid:62) t . We also use W N i to denote the N i × N i diagonal matrix with diagonalentries { w i } i ∈N i and W N i − j to denote the ( N i − × ( N i −
1) diagonal matrix with diagonal entries { w i } i ∈N i ,i (cid:54) = j .Under this situation, we have: own lags, others’ (in the same segment) lags and others’ (outside thesegment) lags for the estimation of the i th segment’s corresponding autoregressive coefficients B ··N i .Also following similar ideas and abbreviations in subsection 2.2, we have the following estimationequation: min B ·· Ni { N i ( T − P ) } − (cid:80) Tt = P +1 (cid:107) Y (cid:62) t N i − X (cid:62) t B ··N i (cid:107) + λ N i (cid:16) (cid:80) Pp =1 (cid:80) j / ∈ N i p α (cid:107) B pj · W N i (cid:107) + µ N i (cid:80) Pp =1 p α w j | B pjj | + µ N i (cid:80) Pp =1 (cid:80) j ∈ N i p α (cid:107) B pj − j W N i − j (cid:107) (cid:17) min B ·· Ni { N i ( T − P ) } − (cid:80) Tt = P +1 (cid:107) Y (cid:62) t N i − X (cid:62) t B ··N i (cid:107) + λ N i (cid:80) Pp =1 (cid:80) j / ∈N i p α (cid:107) B pj · W N i (cid:107) + γ N i (cid:80) Pp =1 p α w j | B pjj | + η N i (cid:80) Pp =1 (cid:80) j ∈ N i p α (cid:107) B pj − j W N i − j (cid:107) (7)with hyperparameters λ N i , γ N i , η N i , α , γ N i = λ N i µ N i and η N i = λ N i µ N i , i = 1 , . . . , I where I is theoverall number of segments. We call this estimate ˆ B = ( ˆ B ··N , . . . , ˆ B ··N I ) the segmentized grouping estimate.Chudik and Pesaran (2007) consider the case P = 1 and T is large (relative to J ) through some“neighboring” procedure, which can be viewed as a special case of the “segmentized grouping” westudied here. The three penalization methods discussed above critically depend on penalty parameter selection fortheir performance in model selection, parameter estimation and prediction accuracy. Here we havehyper-parameters λ, γ (universal grouping), λ j , γ j , (cid:54) j (cid:54) J (universal grouping), λ i , γ i , η i , (cid:54) i (cid:54) I and α , and choose them via a data driven “rolling scheme”. To simulate real-time forecasting,we conduct an out-of-sample experiment. Let T and T denote the beginning and the end of the10 hoice of Parameters: Forecasting Optimization | X | P || P | Y | ˆ B | (cid:27) ( t ) | ! b y ( (cid:21);(cid:13);(cid:11) ) j;tj(cid:27) ( t ) Rolling SchemeMSFE ( (cid:21);(cid:13);(cid:11) ) j;h = (cid:0) T + T X t = T ( b y ( (cid:21);(cid:13);(cid:11) ) j;tj(cid:27) ( t ) (cid:0) y j;tj(cid:27) ( t ) ) ˆ (cid:21); ˆ (cid:13); ˆ (cid:11) = argmin ( (cid:21);(cid:13);(cid:11) ) n RMSFE ( (cid:21);(cid:13);(cid:11) ) j;h = MSFE ( (cid:21);(cid:13);(cid:11) ) j;h MSFE ( ) j;h o Direct multiple-step-ahead forecast instead of recursively
Figure 2: Illustration of the Rolling Schemeevaluation sample respectively. The point estimate of the j th variable’s forecast is denoted by (cid:98) y ( λ,γ,α ) j,t | σ ( t ) based on σ ( t ), the information up to time t . The point estimate of the one-step-ahead forecast iscomputed as in equation (5), and the h -step-ahead forecasts are computed in similar spirit. Out-of-sample forecast accuracy is measured in terms of mean squared forecast error (MSFE): M SF E ( λ,γ,α ) j,h = 1 T − T − h + 1 T − h (cid:88) t = T ( (cid:98) y ( λ,γ,α ) j,t + h | σ ( t ) − y j,t + h | σ ( t ) ) . We report results for MSFE relative to the benchmark (random walk with drift) model’s (abbreviatedas
M SF E (0) j,h ), as also considered by Ba´nbura et al. (2010), i.e.
RM SF E ( λ,γ,α ) j,h = M SF E ( λ,γ,α ) j,h M SF E (0) j,h . The parameters are estimated using the observations from the most recent 10 years (rolling scheme)as illustrated in Figure 2. The parameters are set to yield a desired fit for the variable(s) of inter-est from T to T . In other words, to obtain the desired magnitude of fit, the search is performedover a grid of λ, γ and α to minimize (cid:80) Jj =1 RM SF E ( λ,γ,α ) j,h (universal grouping); λ j , γ j and α tominimize RM SF E ( λ,γ,α ) j,h (no grouping); λ i , γ i , η i and α to minimize (cid:80) j ∈N i RM SF E ( λ,γ,η,α ) j,h (segmen-tized grouping) respectively. Due to computational cost, we prefix α to be 1 or 2 first, and thendo the search of λ ’s and γ ’s over loose grids. For the nice performing λ ’s and γ ’s, we search overdenser grids around them afterwards. The parfor ∼ tibs/glmnet-matlab, makes the computation time together with the parameterselection very moderate in our experience. Motivated by the adaptive lasso procedure, Zou (2006), if we define11 P = diag[1 α , α , . . . , P α ] ⊗ I J × J , where diag[1 α , α , . . . , P α ] is the diagonal matrix with diagonalentries { − α , − α , . . . , P − α } , ⊗ is the Kronecker product and I J × J is the J × J identity matrix; • W = I P × P ⊗ diag[ w , w , . . . , w J ]; • ˜ X (cid:62) = X (cid:62) W − P − and ˜ B = PW B and note the fact that X (cid:62) B in (2) is the same as X (cid:62) W − P − PW B = ˜ X (cid:62) ˜ B , we have the followingestimation procedure (the proof is very simple and hence is omitted):(1) Generate ˜ X (cid:62) = X (cid:62) W − P − ;(2) Corresponding to the three different estimates (4), (5) and (7), solve:min ˜ B { J ( T − P ) } − T (cid:88) t = P +1 (cid:107) Y (cid:62) t − ˜ X (cid:62) t ˜ B (cid:107) + λ P (cid:88) p =1 J (cid:88) j =1 (cid:107) ˜ B pj − j (cid:107) + γ P (cid:88) p =1 J (cid:88) j =1 | ˜ B pjj | , (8)min ˜ B ·· j ( T − P ) − T (cid:88) t = P +1 ( Y tj − ˜ X (cid:62) t ˜ B ·· j ) + λ j P (cid:88) p =1 (cid:88) i (cid:54) = j | ˜ B pij | + γ j P (cid:88) p =1 | ˜ B pjj | , (9)min B ··N i { N i ( T − P ) } − T (cid:88) t = P +1 (cid:107) Y (cid:62) t N i − X (cid:62) t ˜ B ··N i (cid:107) + λ N i P (cid:88) p =1 (cid:88) j / ∈ N i (cid:107) ˜ B pj · (cid:107) + γ N i P (cid:88) p =1 | ˜ B pjj | + η N i P (cid:88) p =1 (cid:88) j ∈ N i (cid:107) ˜ B pj − j (cid:107) ; ; (10)(3) Output ˆ B = W − P − ˆ˜ B with ˆ˜ B minimizing (8) (universal grouping), ˆ˜ B = ( ˆ˜ B ·· , . . . , ˆ˜ B ·· J ),ˆ˜ B ·· j minimizing (9) (no grouping); ˆ˜ B = ( ˆ˜ B ··N , . . . , ˆ˜ B ··N I ), ˆ˜ B ··N i minimizing (10) (segmentizedgrouping).At Step (2), motivated by Wang et al. (2007), as we have more than one penalty terms (mixedLasso and group Lasso), we could iterate between penalties to solve it as the standard (group) Lassoproblem.For the “no grouping” estimate, by noting that γ j = µ j λ j and γ j (cid:80) Pp =1 | ˜ B pjj | = λ j (cid:80) Pp =1 | µ j ˜ B pjj | ,the estimation procedure above is equivalent to:(1) Generate ˜ X (cid:62) = X (cid:62) W (cid:48)− P − with W (cid:48) = I P × P ⊗ diag[ w , w , w j − , u j w j , w j +1 , . . . , w J ] for esti-mating B ·· j , (cid:54) j (cid:54) J ;(2) Corresponding to (9), solve:min ˜ B ·· j ( T − P ) − T (cid:88) t = P +1 ( Y tj − ˜ X (cid:62) t ˜ B ·· j ) + λ j P (cid:88) p =1 J (cid:88) i =1 | ˜ B pij | ; (11)123) Output ˆ B = W (cid:48)− P − ˆ˜ B with ˆ˜ B = ( ˆ˜ B ·· , . . . , ˆ˜ B ·· J ), ˆ˜ B ·· j ∼ tibs/glmnet-matlab.In “large J , small T ” paradigms, to get parsimonious models, shrinkage with penalization inmodel selection can shrink insignificant regression coefficients towards zero exactly, but at the sametime, significant coefficients are shrunk as well though they are retained in selected working models,Wainwright (2009) and Huang et al. (2008). To this end, we only use our method for the variable(and lag) selection, but not for estimation, Chernozhukov et al. (2011). Thus, we implement theordinary least squares estimation for the selected variables (and lags) from (8), (9) and (10) w.r.t.three different estimates. Now it is a matter of what kind of regularization techniques among these three choices to use inpractice.
First , as already discussed in Remark 2.2.1 and 2.3.1, from the allowing individualized(for the variable of particular interest) weights between own lags and others’ lags and individualizedforecasting performance optimization point of view, the “no grouping” approach is the best, the“universal grouping” one is the worst, and the “segmentized grouping” one is in between.
Second , asin Remark 2.2.2 and 2.3.2, from whether all off-diagonal autocoefficients in one row are shrunk to zeropoint of view, the “no grouping” one is still favored.
Third , as in subsection 2.5, the tuning parametersw.r.t. the universal grouping, no grouping or segmentized grouping estimates are selected to optimizethe averaged forecasting performance for all variables, the specific variable’s forecasting performanceor the averaged forecasting performance for the variables in the same segment respectively. Whendifferent variables’ time series have very distinct patterns, this individualized optimization is preferred.
Fourth , for the estimation of large coefficient matrices, due to the strong group-sparse assumption onthe underlying structure as mentioned in subsection 2.2, the group lasso type estimator actually hasa sharper theoretical risk bound, Huang and Zhang (2009) for more details. In particular, they showthat group Lasso is more robust to noise due to the stability associated with group structure and thusrequires a smaller sample size to satisfy the sparse eigenvalue condition required in modern sparsityanalysis. And the universal grouping estimate is also more computationally efficient since the wholeautocoefficient matrix is estimated at once. However, note that the statistical error is a combination ofmodeling error and estimation error. Even though the group Lasso type estimate might have smaller13stimation error, due to the strong assumption to the underlying structure, the overall risk mightnot be smaller, as we discussed in subsection 2.2. Moreover, the typical macroeconomic data has lowfrequency, i.e. monthly. Thus the computational cost is not a severe problem since we only need toupdate the model once per month at most. Due to all these, we suggest the no grouping estimatefor practical implementation as a compromise between flexibility and realization of assumptions. Forthis reason and technical simplicities, we mainly study the theoretical properties of the no groupingestimate as defined in (6) afterwards.
In this section, we first show that, under the time series setup, if we just use the classic Lassoestimator, the risk bound will depend on the time dependence level as in Theorem 3.1. To circumventthis problem, through reweighting over time, our estimate in (6) can still produce an estimator, whichis shown in Theorem 3.2 and 3.3, to be equivalent to an appropriate oracle. The techniques of theproofs are closely built upon those in Lounici et al. (2009), Bickel et al. (2009), Lounici (2008) andWang et al. (2007).
Now we will illustrate how the temporal dependence level affects the risk bounds of the Lasso typeestimator. For technical simplicities, we consider the univariate AR( P ) (or MA( P )) model with P → ∞ , i.e. J = 1 for equations (1) and (2): e t = x t θ + . . . , x tP θ P + (cid:15) t = x (cid:62) t θ + (cid:15) t , (12)with the regressors ( x t , . . . , x tP ) = x Tt , the coefficients ( θ , . . . , θ P ) = θ (cid:62) and the error term (cid:15) t . Wealso define x as a T × P matrix with the t, p th entry as x tp and e = ( e , . . . , e T ) (cid:62) . x tp = e t − p (or (cid:15) t − p ) corresponds to the AR( P ) (or MA( P )) model. In this situation, since there are no “others lags”( J = 1) and θ is a vector, the standard Lasso estimator ˆ θ is defined through:min θ ( T − P ) − T (cid:88) t = P +1 ( e t − x (cid:62) t θ ) + 2 λ (cid:107) θ (cid:107) . (13)We assume there is a true coefficient θ ∗ for (12) and define M ( θ ∗ ) = (cid:80) pp =1 ( θ ∗ p (cid:54) = 0) and M (ˆ θ ) = (cid:80) pp =1 (ˆ θ p (cid:54) = 0). Before moving on, we recall the fractional cover theory based definition first, whichwas introduced by Janson (2004) and can be viewed as a generalization of m -dependency. Given aset T and random variables V t , t ∈ T , we say: 14 A subset T (cid:48) of T is independent if the corresponding random variables { V t } t ∈T (cid:48) are independent. • A family {T j } j of subsets of T is a cover of T if (cid:83) j T j = T . • A family { ( T j , w j ) } j of pairs ( T j , w j ), where T j ⊆ T and w j ∈ [ , ] is a fractional cover of T if (cid:80) j w j T j (cid:62) T , i.e. (cid:80) j : t ∈T j w j (cid:62) for each t ∈ T . • A (fractional) cover is proper if each set T j in it is independent. • X ( T ) is the size of the smallest proper cover of T , i.e. the smallest m such that T is the unionof m independent subsets. • X ∗ ( T ) is the minimum of (cid:80) j w j over all proper fractional covers { ( T j , w j ) } j .Notice that, in spirit of these notations, X ( T ) and X ∗ ( T ) depend not only on T but also on thefamily { V t } t ∈T . Further note that X ∗ ( T ) (cid:62) T = ∅ ) and that X ∗ ( T ) = 1 if and only if thevariables V t , t ∈ T are independent, i.e. X ∗ ( T ) is a measure of the dependence structure of { V t } t ∈T .For example, if V t only depends on V t − , . . . , V t − k but is independent of all { V s } s With a high probability q , ∀ p , the random variables x tp and (cid:15) t satisfy | (cid:15) t x tp | (cid:54) b t and T − T (cid:88) t =1 b t (cid:54) C (cid:48) for some constants b t , C (cid:48) > , t = 1 , . . . , T . ASSUMPTION 3.2 There exists a positive number κ = κ ( s ) such that min (cid:110) | x (cid:62) ∆ | √ T | ∆ R | : |R| (cid:54) s, ∆ ∈ R P \{ } , (cid:107) ∆ R c (cid:107) (cid:54) (cid:107) ∆ R (cid:107) (cid:111) (cid:62) κ, where R c denotes the complement of the set of indices R , ∆ R denotes the vector formed by thecoordinates of the vector ∆ w.r.t. the index set R . restricted eigenvalue assumption from Bickel et al. (2009), which is essentiallya restriction on the eigenvalues of the Gram matrix Ψ T = x (cid:62) x/T as a function of sparsity s . To seethis, recall the definitions of restricted eigenvalues and restricted correlations in Bickel et al. (2009): ψ min ( u ) = min z ∈ R P :1 (cid:54) M ( z ) (cid:54) u z (cid:62) Ψ T z | z | , (cid:54) z (cid:54) P,ψ max ( u ) = max z ∈ R P :1 (cid:54) M ( z ) (cid:54) u z (cid:62) Ψ T z | z | , (cid:54) z (cid:54) P,ψ m ,m = max (cid:110) f (cid:62) x (cid:62) I x I f T | f | | f | : I (cid:92) I = ∅ , | I i | (cid:54) m i , f i ∈ R I i \{ } , i = 1 , (cid:111) , where | I i | denotes the cardinality of I i and x I i is the T × | I i | submatrix of x obtained by removingfrom x the columns that do not correspond to the indices in I i . Lemma 4.1 in Bickel et al. (2009)shows that if the restricted eigenvalue of the Gram matrix Ψ T satisfies ψ min (2 s ) > ψ s, s for someinteger 1 (cid:54) s (cid:54) P/ 2, Assumption 3.2 holds.We can now state our first main result. THEOREM 3.1 Consider the model (12) for P (cid:62) , T (cid:62) and random variables V t = (cid:15) t x tp , t ∈ T .Let the random variables x tp and (cid:15) t satisfy Assumption 3.1 for any p , all diagonal elements of thematrix x (cid:62) x/T euqal to , and M ( θ ∗ ) (cid:54) s . Furthermore, let κ be defined as in Assumption 3.2, and φ max be the maximum eigenvalue of the matrix X (cid:62) X/T . Let λ = (cid:112) X ∗ ( T )(log P ) δ (cid:48) C (cid:48) /T , δ (cid:48) > . Then with probability at least q (1 − P − δ (cid:48) ) , for any solution ˆ θ of (13), we have: T − (cid:107) x (ˆ θ − θ ∗ ) (cid:107) (cid:54) s X ∗ ( T )(log P ) δ (cid:48) C (cid:48) /T κ , (14) (cid:107) ˆ θ − θ ∗ (cid:107) (cid:54) (cid:54) s (cid:112) X ∗ ( T )(log P ) δ (cid:48) C (cid:48) /T /κ , (15) M (ˆ θ ) (cid:54) φ max s/κ . (16)Before explaining the results, we would like to discuss some related results first. Suppose x in (12)has full rank P and (cid:15) t is N(0 , σ ). Consider the least squares estimate ( P (cid:54) T ) ˆ θ OLS = ( xx (cid:62) ) − xe .Then from standard least squares theory, we know that the prediction error (cid:107) x T (ˆ θ OLS − θ ∗ ) (cid:107) /σ is χ p -distributed, i.e. E (cid:107) x T (ˆ θ OLS − θ ∗ ) (cid:107) T = σ T P. (17)In the sparse situation if (cid:15) t is N(0 , σ ) (different from our case), Corollary 6.2 of B¨uhlmann and van deGeer (2011) shows that the Lasso estimate obeys the following oracle inequality : (cid:107) x T (ˆ θ Lasso − θ ∗ ) (cid:107) T (cid:54) C σ log PT M ( θ ∗ ) (18)with a large probability and some constant C . The additional log P factor here could be seen as theprice to pay for not knowing the set { θ ∗ p , θ ∗ p (cid:54) = 0 } , Donoho and Johnstone (1994).16imilar to the i.i.d. Gaussian situation discussed above, the term s (log P ) δ (cid:48) in (20) could beinterpreted as the price to pay for not knowing the set { θ ∗ p , θ ∗ p (cid:54) = 0 } . Here we have (log P ) δ (cid:48) insteadof log P because we deviate from the typical i.i.d. Gaussian situation and establish the result underthe more general Assumption 3.1, which could be thought as the “finite second moment” condition.And the δ (cid:48) term is the price to pay for this deviation.For the case of x tp = (cid:15) t − p (MA( P ) model), by the definition of X ∗ ( T ) and V t , if k ∗ = max { p, s.t. θ ∗ p (cid:54) =0 , θ ∗ p +1 = θ ∗ p +2 = . . . = θ ∗ P = 0 } ), we have X ∗ ( T ) = k ∗ + 1 (if k ∗ + 1 < T ). The RHS of (14) becomes16 s ( k ∗ + 1)(log P ) δ (cid:48) /T κ . For the case of x tp = e t − p (AR( P ) model), which is equivalent to MA( ∞ )and X ∗ ( T ) (cid:54) T , the RHS of (14) becomes 16 s (log P ) δ (cid:48) /κ . Thus X ∗ ( T ) could be interpreted as ameasure on how many past lags V t depends on. Additionally, when the time dependence level increases,by the definition of the Gram matrix, κ will decrease since it characterizes how strong x t , . . . , x tP depend on each other. Still using the MA( k ∗ ) example considered above, κ could be thought as ameasure on how strong e t depends on e t − , . . . , e t − k ∗ , which is a complement of the measure of X ∗ ( T )on the time dependence level. In both cases (MA( P ) or AR( P )), Theorem 3.1 states that if we usethe standard Lasso estimate directly for the time series, the bounds get larger when the dependencelevel ( X ∗ ( T )) increases and κ decreases. In other words, the bound is minimized when X ∗ ( T ) = 1,which corresponds to the independent situation in the literature. When X ∗ ( T ) reaches T , it will beoffset by the T in the denominator. Thus the risk bound does not decrease when T increases. Theintuition behind is clear: if the dependence level is strong, then the additional information broughtby a “new” observation will be effectively less, i.e. the overall information from { V t } Tt =1 will be lesscorrespondingly, which will result in increasing estimates’ risk bounds. Consequently, we expect theselection not to be stable and to be very sensitive to minor perturbation of the data. In this sense, wedo not expect variable selection to provide results that lead to clearer economic interpretation thanprincipal components or Ridge regression. To study the oracle properties of the estimator in (6), we assume that there is a correct modelwith the regression and autoregression coefficients β ∗ = ( c ∗(cid:62) , d ∗(cid:62) ) (cid:62) = ( c ∗ , . . . , c ∗ P ( J − , d ∗ , . . . , d ∗ P ) (cid:48) .Furthermore, we assume that there are a total of p (cid:54) P ( J − 1) non-zero other-lag coefficients and q (cid:54) P non-zero own-lag coefficients. For convenience, we define S = { (cid:54) i (cid:54) P ( J − , c ∗ i (cid:54) = 0 } ,ˆ S = { (cid:54) i (cid:54) P ( J − , ˆ c i (cid:54) = 0 } , S = { (cid:54) p (cid:54) P, d ∗ p (cid:54) = 0 } and ˆ S = { (cid:54) p (cid:54) P, ˆ d p (cid:54) = 0 } . Then, thesets S and S contain the indices of the significant others-lag and own-lag coefficients respectively,and their complements S c and S c contain the indices of the insignificant coefficients. Next, let17 ∗ S denote the p × c S being its associated estimator.Moreover, other related parameters and their corresponding estimators are analogously defined (e.g. c ∗ S c , ˆ c S c , d ∗ S , ˆ d S , d ∗ S c , ˆ d S c ). Finally, let β ∗ = ( c ∗ (cid:48) S , d ∗ (cid:48) S ) (cid:48) and β ∗ = ( c ∗ (cid:48) S c , d ∗ (cid:48) S c ) (cid:48) with corresponding estimatesˆ β , ˆ β . To facilitate the study, we also introduce the notations a T def = max( λ i , γ p , i ∈ S , p ∈ S ) ,b T def = min( λ i , γ p , i ∈ S c , p ∈ S c ) , where λ i and γ p are functions of T . To investigate the theoretical properties of ˆ β , we introduce thefollowing conditions:A1 The sequence { X t } is independent of ε t ( ε t def = U tj );A2 All roots of polynomial 1 − (cid:80) Pp =1 d ∗ p z p are outside the unit circle;A3 ε t has finite fourth-order moment, i.e. E ( ε t ) < ∞ ;A4 ∀ j , Y tj (as components of the covariate X t ) is strictly stationary and ergodic with finte second-order moment (i.e. E (cid:107) Y tj (cid:107) < ∞ ).The technical conditions above are typically used to assure the √ T -consistency and asymptotic nor-mality of the unpenalized least squares estimator.We also rewrite equation (6) asmin β Q T ( β ) = min β T (cid:88) t = P +1 ( y t − X (cid:62) t β ) + T P ( J − (cid:88) i =1 λ i | c i | + T P (cid:88) p =1 γ p | d p | (19)by multiplying 2( T − P ) and writing T − P as T without confusion. Define (cid:80) Tt = P +1 ( y t − X (cid:62) t β ) as L T ( β ), x Tt = ( X j,t , X j,t , . . . , X Jj,t ), the own lags corresponding to the coefficients c , and z (cid:62) t = X (cid:62) t \ x (cid:62) t ,others’ lags corresponding to the coefficients d respectively. Then X (cid:62) t β = x (cid:62) t c + z (cid:62) t d .We first investigate the consistency of the estimator of (6). LEMMA 3.1 Assume that a T = O (1) as T → ∞ . Then, under conditions (A1-A4), ∃ a localminimizer (cid:98) β of (6) s.t. (cid:107) (cid:98) β − β ∗ (cid:107) = O p ( T − / + a T ) . The proof is given in the appendix. Lemma 3.1 implies that, if the tuning parameters associated withthe significant regressors converge to 0 at a speed faster than T − / , then there is a local minimizerof (6), which is √ T -consistent. Next, we show that, if the tuning parameters associated with theinsignificant regressors shrink to 0 slower than T − / , then their coefficients can be estimated exactlyas 0 with probability tending to 1. 18 HEOREM 3.2 (Consistency of Selection) Assume that b T √ T → ∞ and (cid:107) (cid:98) β − β ∗ (cid:107) = O ( T − / ) ,then P( (cid:98) β = 0) → . Theorem 3.2 shows that our method can produce a sparse solution for insignificant coefficients con-sistently. Furthermore, this theorem, together with Lemma 3.1, indicates that the √ T -consistentestimator must satisfy P( (cid:98) β = 0) → THEOREM 3.3 Assume that a T √ T → and b T √ T → ∞ . Then, under conditions (A1-A4), the“nonzero” components (cid:98) β of the local minimizer (cid:98) β in Lemma 3.1 satisfies ( (cid:98) β − β ∗ ) √ T d → N (0 , Σ − ) , P( ˆ S = S ) → , P( ˆ S = S ) → , where Σ is the submatrix of Σ corresponding to β ∗ , Σ = diag( B, C ) , B = E ( x t x (cid:62) t ) and C = E ( z t z (cid:62) t ) . Theorem 3.3 implies that, if the tuning parameters satisfy the conditions a T √ T → b T √ T → ∞ ,then, asymptotically, the resulting estimator can be as efficient as the oracle estimator. And ourmethod can produce a sparse solution for significant coefficients consistently.Since consistency of selection is established here, if we use the ordinary least squares estimationfor the selected variables, we can avoid the log term on (18) and (14). We use the dataset of Stock and Watson (2005a) for illustration. This dataset contains 131 monthlymacro indicators covering a broad range of categories including income, industrial production, ca-pacity, employment and unemployment, consumer prices, producer prices, wages, housing starts,inventories and orders, stock prices, interest rates for different maturities, exchange rates, moneyaggregates and so on. The time span is from January 1959 to December 2003. We apply logarithmsto most of the series except those already expressed in rates. The variables of special interest includea measure of real economic activity, a measure of prices and a monetary policy instrument. As inChristiano et al. (1999), we use employment as an indicator of real economic activity measured by thenumber of employees on non-farm payrolls (EMPL). The level of prices is measured by the consumerprice index (CPI) and the monetary policy instrument is the Federal Funds Rate (FFR). All 13119 = 1 P = 4 P = 7 P = 13 P = 25 BVAREMPL 0 . . . . . . h = 1 CPI 0 . . . . . . . . . . . . . . . . . . h = 3 CPI 0 . . . . . . . . . . . . . . . . . . h = 6 CPI 0 . . . . . . . . . . . . . . . . . . h = 12 CPI 0 . . . . . . . . . . . . h and P .variables’ lags are used as regressors. As discussed earlier, because of the stationary requirement ofour method, the series are transformed to obtain stationarity so that many of the series are (2ndorder) differences of the raw data series (or logarithm of the raw series).We evaluate the forecast performance over the period from T = January 70 to T = December03 and for forecast horizons up to one year ( h = 1 , , , P = 1 , , , , 25. The resulting performance is summarized in Table 1 with comparisions to the onesof Ba´nbura et al. (2010) listed under the “BVAR” column. As we can see, unlike the informationcriteria based on lag selection techniques, the RMSFE is very robust to the initial choice of P , whichprimarily benefits from the “re-weighting over lags” technique ( p − α ) we used before. For this specificdata set, P = 1 seems enough. But in general, since we never know the true value of lags, we caninclude a large enough P at the beginning to allow flexibility without worrying about over fitting.Moreover, for the one-step-ahead forecast, our method outperforms for EMPL, CPI and FFR, whilewhen h (cid:62) 3, it outperforms mainly for EMPL and FFR, especially for the latter one. This resultsfrom the fact that different time series might have quite different behaviors, so if we just have the“universal” penalty parameter for all of them as in Ba´nbura et al. (2010), the corresponding forecastingperformance might not be optimized. For reference purpose, we also provide the factor-augmentedvector autoregressive results of Bernanke et al. (2005) in Figure 3.20 ARGE BAYESIAN VARS Table III. FAVAR, relative MSFE, 1971–2003 FAVAR 1 factor FAVAR 3 factors LARGE p D p D BIC BVAR p D p D BIC BVAR BVAR h D h D h D h D 12 EMPL 1.15 0.98 0.92 3.16 0.84 0.83 0.78CPI 0.95 0.58 0.70 1.98 0.54 0.64 0.44FFR 2.69 1.43 1.93 7.09 1.46 1.69 1.93 Notes : The table reports MSFE for the FAVAR model relative to that from the benchmark model (random walk withdrift) for employment (EMPL), CPI and federal funds rate (FFR) for different forecast horizons h . FAVAR includes 1 or3 factors and the three variables of interest. The system is estimated by OLS with number of lags fixed to 13 or chosenby the BIC and by applying Bayesian shrinkage. For comparison the results from large Bayesian VAR are also provided. Let us rewrite the VAR of equation (1) in its error correction form: Y t D c (cid:1) (cid:1)I n (cid:1) A (cid:1) . . . (cid:1) A p (cid:2)Y t (cid:1) C B Y t (cid:1) C . . . C B p (cid:1) Y t (cid:1) p C C u t (cid:1) (cid:2) A VAR in first differences implies the restriction (cid:1)I n (cid:1) A (cid:1) . . . (cid:1) A p (cid:2) D 0. We follow Doan et al . (1984) and set a prior that shrinks D (cid:1)I n (cid:1) A (cid:1) . . . (cid:1) A p (cid:2) to zero. This can be understoodas ‘inexact differencing’ and in the literature it is usually implemented by adding the followingdummy observations (cf. Section 2): Y d D diag (cid:1)υ (cid:13) , . . . , υ n (cid:13) n (cid:2)/(cid:14) X d D (cid:1)(cid:1) ð p (cid:2) (cid:2) diag (cid:1)υ (cid:13) , . . . , υ n (cid:13) n (cid:2)/(cid:14) n ð (cid:2) (cid:1) (cid:2) The hyperparameter (cid:14) controls for the degree of shrinkage: as (cid:14) goes to zero we approachthe case of exact differences and, as (cid:14) goes to , we approach the case of no shrinkage. Theparameter (cid:13) i aims at capturing the average level of variable y it . Although the parameters shouldin principle be set using only prior knowledge, we follow common practice and set the parameterequal to the sample average of y it . Our approach is to set a loose prior with (cid:14) D (cid:5) . The overallshrinkage (cid:5) is again selected so as to match the fit of the small specification estimated by OLS.Table IV reports results from the forecast evaluation of the specification with the sum ofcoefficient prior. They show that, qualitatively, results do not change for the smaller models,but improve significantly for the MEDIUM and LARGE specifications. In particular, the poorresults for the federal funds rate discussed in Table I are now improved. Both the MEDIUM and LARGE models outperform the random walk forecasts at all the horizons considered. Overall,the sum of coefficient prior improves forecast accuracy, confirming the findings of Robertson andTallman (1999). See, for example, Sims and Zha (1998).Copyright J. Appl. Econ. : 71–92 (2010)DOI: 10.1002/jae Figure 3: Results of Bernanke et al. (2005) and Ba´nbura et al. (2010) To summarize, in this article, we first show that under the time series setup, if we still use the classicLasso type estimator, the risk bound will increase when the time dependence level increases, however,our method could still achieve the consistency of variable selection under such scenario; second, ourmethod is able to do variable selection and lag selection simultaneously, and is rather robust to theinitial choice of lags; third, we allow individualized weights between own and others’ lags. All thesehave been confirmed by the real forecasting performance in the previous section and come at a lowcomputational cost.Some issues we do not explore here include nonstationarity , rank test, cointegration and causaltest . For a typical macroeconomic data set, the nonstationarity comes from seasonality, business cycleand economic developments. In spirit of Song et al. (2010), this motivates us to add a nonstationarycomponent U Γ to equation (2) as below Y t = Z t Γ + X t B + U t , where Z t = ( Z ( t ) , . . . , Z R ( t )) (cid:62) contains R basis functions of time consisting of Fourier series withdifferent frequencies and segment by segment ortho-normal polynomials with corresponding R × J coefficient matrix Γ, and X t , B are the same as in equation (1). Studying this extended model deservesfurther investigation and will be presented in a separate paper. If we want to consider the rank test,cointegration and causal test, what we need for this high dimensional time series is not the ones inthe univariate case, but the high dimensional simultaneous tests, which might be much more difficult. Heteroscedasticity with Cross-section Correlations .We consider Cov ( U t ) = Σ with nonzero off-diagonal entries in Σ. Assume that we have a consis-tent estimate ˆΣ for Σ (which is another challenging task since Σ is a J × J matrix) with Choleskydecomposition ˆΣ = C (cid:62) C , where C is an upper triangular matrix with inverse D (which is also an21pper triangular matrix). Without loss of generality, assume all diagonal entries of ˆΣ , C and D areequal to 1. Transform the original X t by D to generate ˜ X t ( ˜ X t = X t D ) s.t. Cov ( U t D ) = I . Underthis situation, we are no longer selecting the original variables, but linear transformations of them.Thus we must show that this does not affect the inference. We have˜ β ˜ x t + ˜ β ˜ x t + . . . + ˜ β J ˜ x tJ = ˜ β x t + ˜ β ( d x t + x t ) + . . . + ˜ β J ( J − (cid:88) j =1 d jJ x tj + x tJ )=( ˜ β + J (cid:88) j =2 ˜ β j d j ) x t + ( ˜ β + J (cid:88) j =3 ˜ β j d j ) x t + . . . + ˜ β J x tJ . If the off-diagonal entries of D , { d ij } i The main results of this paper were first presented at the annual WinterMeeting of the Econometric Society, Denver, Jan, 2011. We are grateful for seminar participants’many interesting comments on several versions of the paper. In particular, I would like to thank ProfPeter Bickel for sponsoring my stay at the University of California, Berkeley. We would also liketo thank Prof Marta Ba´nbura, Prof Domenico Giannone and Prof Lucrezia Reichlin for sharing thecodes of BVAR. Proof of Theorem 3.1 The proof of this theorem is based on the ones of Lemma 3.1 and Theorem3.1 in Lounici et al. (2009) up to a modification of the bound on P( A c ) with random event A = (cid:110) max (cid:54) p (cid:54) P (cid:80) Tt =1 ε t X tp (cid:54) λT (cid:111) , where n, M and T there are equivalent to T, P and 1 here respectively.The intermediate results in the proof of Theorem 3.1 in Lounici et al. (2009) show that T − (cid:107) X ( (cid:98) B − B ∗ ) (cid:107) (cid:54) sλ /κ , (20) (cid:107) (cid:98) B − B ∗ (cid:107) (cid:54) (cid:54) sλ/κ , (21) M ( (cid:98) B ) (cid:54) φ max s/κ . (22)22e have:P( A c ) = P (cid:16) max (cid:54) p (cid:54) P T (cid:88) t =1 ε t X tp > λT (cid:17) (cid:54) P P (cid:16) T (cid:88) t =1 ε t X tp > λT (cid:17) def = P P (cid:16) T (cid:88) t =1 V t > λT (cid:17) . Then, by the (extended) Mcdiarmid inequality, see Theorem 2 . { V t } Tt =1 , we haveP( A c ) (cid:54) P P (cid:16) T (cid:88) t =1 V t > λT (cid:17) (cid:54) P exp (cid:26) − λ T X ∗ ( T ) (cid:80) t b t /T (cid:27) (cid:54) P − δ (cid:48) with λ = (cid:112) X ∗ ( T )(log P ) δ (cid:48) C (cid:48) /T , δ (cid:48) > 0, which, together with (20), (21) and (22), leads to (14),(15) and (16). (cid:3) Proof of Lemma 3.1 The proofs are closely built upon those of Wang et al. (2007). Let δ =( u (cid:62) , v (cid:62) ) (cid:62) , u = ( u , . . . , u P ( J − ) (cid:62) , v = ( v , . . . , v P ) (cid:62) , α T = T − / + a n and { β ∗ + α T δ : (cid:107) δ (cid:107) (cid:54) e } bethe ball around β ∗ . Then, for (cid:107) δ (cid:107) = e , we have D T ( δ ) = Q T ( β ∗ + α T δ ) − Q T ( β ∗ )= L T ( β ∗ + α T δ ) − L T ( β ∗ ) + T (cid:88) i ∈ S λ i ( | c ∗ i + α T u i | − (cid:107) c ∗ i (cid:107) ) + T (cid:88) j ∈ S γ j ( | d ∗ j + α T v j | − (cid:107) d ∗ j (cid:107) )= L T ( β ∗ + α T δ ) − L T ( β ∗ ) − T α T (cid:88) i ∈ S λ i | u i | − T α T (cid:88) j ∈ S γ j | v j | = L T ( β ∗ + α T δ ) − L T ( β ∗ ) − T α T p e − T α T q e = L T ( β ∗ + α T δ ) − L T ( β ∗ ) − T α T ( p + q ) e. (23)Furthermore, L T ( β ∗ + α T δ ) − L T ( β ∗ ) = (cid:88) t { u t − a T z (cid:62) t v − a T u (cid:62) x t } − (cid:88) t u t = a T (cid:88) t { ( z (cid:62) t v ) + u (cid:62) x t x (cid:62) t u } (24) − a T (cid:88) t u t z (cid:62) t v (25)+2 a T (cid:88) t z (cid:62) t vu (cid:62) x t . (26)By employing the martingale central limit theorem and the ergodic theorem, we can show that (24)= T a T { δ (cid:62) Σ δ + O p (1) } , (25) = δ (cid:62) O p ( T a T ) and (26) = T a T O p (1) = O p ( T a T ). Because (24) dominatesthe terms (25), (26) and T α T ( p + q ) e in equation (23), for any given (cid:15) > 0, there is a large constant e such that P[ inf (cid:107) δ (cid:107) = e Q T ( β ∗ + α T δ ) > Q T ( β ∗ )] (cid:62) − (cid:15). − (cid:15) , there is a local minimizer in the ball { β ∗ + α T δ : (cid:107) δ (cid:107) (cid:54) e } , Bickel et al. (1998) and Fan and Li (2001). Consequently, there is a local minimizer of Q T ( β ) such that (cid:107) ˆ β − β ∗ (cid:107) = O p ( α T ). This completes the proof. (cid:3) Proof of Theorem 3.2 The proof is essentially the same as those of Theorem 2 of Zou (2006) andWang et al. (2007). For i ∈ S c , assume that there is a local minimizer ˆ β with ˆ c i (cid:54) = 0. By the KKToptimality condition, we have0 = ∂L T ( ˆ β ) c i + T λ i sgn (ˆ c i )= ∂L T ( β ∗ ) c i + T Σ i ( ˆ β − β ∗ ) { O p (1) } + T λ i sgn (ˆ c i ) , (27)where Σ i denotes the i th row of Σ and i ∈ S c . By employing the central limit theorem, the first termin equation (27) is of order O p ( T / ). Furthermore, the condition in Theorem 3.2 implies that itssecond term is also of order O p ( T / ). Both are dominated by T λ i since b T √ T → ∞ . Therefore, thesign of equation (27) is dominated by the sign of ˆ c i . Thus (27) can not be equal to 0. Consequently,we must have ˆ c i = 0 in probability. Analogously, we can show that P( ˆ d S c = 0) → 1. This completesthe proof. (cid:3) Proof of Theorem 3.3 Applying Lemma 3.1 and Theorem 3.2, we have P( (cid:98) β = 0) → 1. Hence, theminimizer of Q T ( β ) is the same as that of Q T ( β ) with probability tending to 1. This implies thatthe estimator ˆ β satisfies the equation ∂Q T ( β ) β (cid:12)(cid:12)(cid:12) β = ˆ β = 0 . (28)According to Lemma 3.1, ˆ β is √ T -consistent. Thus, the Taylor series expansion of equation (28)yields 0 = 1 √ T ∂L T ( ˆ β ) β + g ( ˆ β ) √ T = 1 √ T ∂L T ( β ∗ ) β + g ( β ∗ ) √ T + Σ √ T ( ˆ β − β ∗ ) + O p (1) , where g is the first-order derivative of the penalty function (cid:88) i ∈ S λ i | c i | + (cid:88) j ∈ S γ j | d j | , and g ( ˆ β ) = g ( β ∗ ) when T is sufficiently large. Furthermore, it can be easily shown that g ( β ∗ ) √ T = O p (1), which implies that ( ˆ β − β ∗ ) √ T = Σ − √ T ∂L T ( β ∗ ) ∂β + O p (1) d → N(0 , Σ − ) . S = S ) → S = S ) → ∀ i ∈ S and p ∈ S , the asymptoticnormality result indicates that ˆ c i p → c ∗ i and ˆ d p p → d ∗ p , where p → stands for convergence in probability.Thus P( i ∈ ˆ S ) → p ∈ ˆ S ) → 1. It suffices to show that ∀ i (cid:48) / ∈ S and p (cid:48) / ∈ S , P( i (cid:48) ∈ ˆ S ) → p (cid:48) ∈ ˆ S ) → 0, which have been shown by Theorem 3.2. This completes the proof. (cid:3)