Lasso Inference for High-Dimensional Time Series
LLasso Inference for High-Dimensional Time Series ∗ Robert Adamek Stephan Smeekes Ines WilmsDepartment of Quantitative EconomicsMaastricht UniversityJuly 22, 2020
Abstract
The desparsified lasso is a high-dimensional estimation method which provides uniformlyvalid inference. We extend this method to a time series setting under Near-Epoch Dependence(NED) assumptions allowing for non-Gaussian, serially correlated and heteroskedastic processes,where the number of regressors can possibly grow faster than the time dimension. We firstderive an oracle inequality for the (regular) lasso, relaxing the commonly made exact sparsityassumption to a weaker alternative, which permits many small but non-zero parameters. Theweak sparsity coupled with the NED assumption means this inequality can also be appliedto the (inherently misspecified) nodewise regressions performed in the desparsified lasso. Thisallows us to establish the uniform asymptotic normality of the desparsified lasso under generalconditions. Additionally, we show consistency of a long-run variance estimator, thus providinga complete set of tools for performing inference in high-dimensional linear time series models.Finally, we perform a simulation exercise to demonstrate the small sample properties of thedesparsified lasso in common time series settings.
Keywords: honest inference, lasso, time series, high-dimensional data
JEL codes:
C22, C55
In this paper we propose methods for performing uniformly valid inference on high-dimensionaltime series regression models. Specifically, we establish the uniform asymptotic normality of thedesparsified lasso method (van de Geer et al., 2014) under very general conditions, thereby allowingfor inference in high-dimensional time series settings that encompass many of the settings typicallyencountered in econometric applications. That is, we establish validity for potentially misspecifiedtime series models, where the regressors and errors may exhibit serial dependence, heteroskedasticity ∗ The first and second author were financially supported by the Netherlands Organization for Scientific Research(NWO) under grant number 452-17-010. The third author was supported by the European Union’s Horizon 2020research and innovation programme under the Marie Sk(cid:32)lodowska-Curie grant agreement No 832671. Previous versionsof this paper were presented at CFE-CM Statistics 2019 and NESG 2020. We gratefully acknowledge the commentsby participants at these conferences. In addition, we thank Eti enne Wijler for helpful discussions. All remainingerrors are our own. a r X i v : . [ ec on . E M ] J u l nd fat tails. In addition, as part of our analysis we derive new oracle inequalities for the lasso(Tibshirani, 1996), on which the desparsified lasso is based.Although traditionally, approaches to high-dimensionality in econometric time series have beendominated by factor models (Bai and Ng, 2008; Stock and Watson, 2011, cf.), shrinkage methodshave rapidly been gaining ground. Unlike factor models where dimensionality is reduced by as-suming common structures underlying regressors, shrinkage methods assume a certain structureon the parameter vector. Typically, sparsity is assumed, where only a small, unknown, subset ofthe variables is thought to have “significantly non-zero” coefficients, and all the other variableshave negligible – or even exactly zero – coefficients. The most prominent among shrinkage methodsexploiting sparsity is the lasso proposed by Tibshirani (1996), which adds a penalty on the absolutevalue of the parameters to the least squares objective function. This penalty ensures that many ofthe coefficients will be set to zero and thus variable selection is performed, an attractive featurethat helps to make the results of a high-dimensional analysis interpretable. Due to this feature,the lasso and its many extensions are now standard tools for high-dimensional analysis (see e.g.,Hesterberg et al., 2008; Vidaurre et al., 2013; Hastie et al., 2015, for reviews).Much effort has been devoted to establish oracle inequalities for lasso-based methods to guaran-tee consistency for prediction (e.g., Greenshtein and Ritov, 2004; B¨uhlmann, 2006) and estimationof a high-dimensional parameter (e.g., Bunea et al., 2007; Zhang and Huang, 2008; Bickel et al.,2009; Meinshausen and Yu, 2009; Huang et al., 2008). While most of these advances have beenmade in the IID framework, early extensions of lasso-based methods to the time series case canbe found in Wang et al. (2007), Hsu et al. (2008). These authors, however, only consider the casewhere the number of variables is smaller than the sample size. Various papers (e.g., Nardi andRinaldo, 2011; Kock and Callot, 2015 and Basu and Michailidis, 2015) let the number of variablesincrease with the sample size, but often require restrictive assumptions (for instance, Gaussianity)on the error process when investigating theoretical properties of lasso-based estimators in timeseries models.Exceptions are Medeiros and Mendes (2016) Masini et al. (2019) and Wong et al. (2020).Medeiros and Mendes (2016) consider the adaptive lasso for sparse, high-dimensional time seriesmodels and show that it is model selection consistent and has the oracle property, even when theerrors are non-Gaussian and conditionally heteroskedastic. Masini et al. (2019) derive consistencyproperties of lasso estimation of high-dimensional approximately sparse vector autoregressions fora class of potentially fat tailed and serially dependent errors, which encompass many multivariatevolatility models. Wong et al. (2020) consider sparse, potentially misspecified, vector autoregres-2ions estimated by the lasso and rely on mixing assumptions to derive nonasymptotic inequalitiesfor estimation error and prediction error of the lasso for sub-Weibull random vectors.While one of the attractive feature of lasso-type methods is their ability to perform variableselection, this also causes serious issues when performing inference on the estimated parameters.In particular, performing inference on a (data-driven) selected model, while ignoring the selection,causes the inference to be invalid. This has been discussed by, among others, Leeb and P¨otscher(2005) in the general context of model selection and Leeb and P¨otscher (2008) for shrinkage es-timators. As a consequence, recent statistical literature has seen a surge in the development ofso-called post-selection inference methods that circumvent the problem induced by model selec-tion. In particular, many articles on selective inference have appeared in recent years (see e.g.,Fithian et al., 2015; Lockhart et al., 1996; Lee et al., 2016; Taylor and Tibshirani, 2018) where in-ference is performed conditional on the selected model. However, while conceptually appealing, thederivation of conditional probabilities requires “well-behaved”, typically IID, data and extensionsto econometric time series settings appear difficult. Recently, Tian and Taylor (2017); Tibshiraniet al. (2018) have considered asymptotic and bootstrap extensions of the selective approach whichalleviates some strict conditions such as membership of the exponential family, but still requiresIIDness.An alternative approach is developed by Berk et al. (2013), who consider inference simultaneousover all possible models. Bachoc et al. (2016, 2019) extend their approach to allow for more generalprocesses, but the approach is computationally very demanding. Moreover, both the selective andsimultaneous approach share the feature that their inference target is model-dependent; in linearmodels, the target is the best linear prediction coefficients given only the selected coefficients. Assuch, these methods “assume away” omitted variable bias, which is one of the most importantsources of invalidity of inference after selection (Leeb et al., 2015). This means that no structuralinterpretation can be given to the inferential results, which limits its use for many econometricapplications.On the other hand, methods have been developed that do allow for inference on true, struc-tural, parameter based on the idea of orthogonalizing the estimation of the parameter of interestto the estimation (and potential incorrect selection) of the other parameters. Belloni et al. (2014);Chernozhukov et al. (2015) propose a post-double-selection approach that uses a Frisch-Waugh par-tialling out strategy to achieve this orthogonalization by selecting important covariates in initialselection steps on both the dependent variable and the variable of interest, and show this approachyields uniformly valid and standard normal inference for independent data. In a related approach,3avanmard and Montanari (2014), van de Geer et al. (2014) and Zhang and Zhang (2014) intro-duce debiased or desparsified versions of the lasso that achieve uniform validity based on similarprinciples for IID Gaussian data. Extensions to the time series case include Chernozhukov et al.(2019) who provide desparsified simultaneous inference on the parameters in a high-dimensionalregression model allowing for temporal and cross-sectional dependency in covariates and error pro-cesses; Krampe et al. (2018), who introduce bootstrap-based inference for autoregressive time seriesmodels based on the desparsification idea, and Hecq et al. (2019) who use the post-double-selectionprocedure of Belloni et al. (2014) for constructing uniformly valid Granger causality test in high-dimensional VAR models.In this paper, we contribute to the literature on shrinkage methods for high-dimensional timeseries models by providing novel theoretical results for both point estimation and inference viathe desparsified lasso. We consider a very general time series-framework where the regressors anderrors terms are allowed to be non-Gaussian, serially correlated and heteroskedastic and the numberof variables can grow faster than the time dimension. Moreover, our assumptions allow for bothcorrectly and miss-specified models, thus providing results relevant for structural interpretations ifthe overall model is specified correctly, but not limited to this.We derive oracle inequalities for the lasso in high-dimensional, linear time series models undermixingale assumptions and a weak sparsity assumption on the parameter vector. Our setting gen-eralizes the one from Medeiros and Mendes (2016), who require a martingale difference sequenceassumption – and hence correct specification – on the error process. Moreover, we relax the tradi-tional sparsity assumption to allow for weak sparsity, thereby recognizing that the true parametersare likely not exactly zero. The oracle inequalities are used to establish estimation and predictionconsistency even when the number of parameters grows faster than the sample size.We extend the oracle inequalities to the nodewise regressions performed in the desparsified lasso,where each regressor (on which inference is performed) is regressed on all other regressors. Notethat, contrary to the setting with independence over time, these nodewise regressions are inherentlymisspecified in dynamic models with temporal dependence. As such our oracle inequalities arespecifically derived under potential misspecification. We then establish the asymptotic normalityof the desparsified lasso under general conditions. As such, we ensure uniformly valid inferenceover the class of weakly sparse models. This result is accompanied by a consistent estimatorfor the long run variance, thereby providing a complete set of tools for performing inference inhigh-dimensional, linear time series models. As such, our theoretical results accommodate variousfinancial and macro-economic applications encountered by applied researchers.4he remainder of this paper is structured as follows. Section 2 introduces the time series settingand assumptions thereof. In Section 3, we derive an oracle inequality for the lasso (Theorem 1).In Section 4, we introduce further assumptions, derive a central limit theorem for the desparsifiedlasso estimator (Theorem 2) and present a consistent long-run covariance estimator (Theorem 3).Section 5 contains a simulation study examining the small sample performance of the desparsifiedlasso, and Section 6 concludes. The main proofs and preliminary lemmas needed for section 3 arecontained in Appendix A, while Appendix B contains the results and proofs on section 4. AppendixC contains supplementary material.A word on notation. For any N dimensional vector x , (cid:107) x (cid:107) r = (cid:18) N (cid:80) i =1 | x i | r (cid:19) /r denotes the L r -norm. The L ∞ -vector norm will be denoted by (cid:107) x (cid:107) ∞ = max i | x i | , for any matrix X , we denote (cid:107) X (cid:107) ∞ = max i,j | X i,j | . We use p → and d → to denote convergence in probability and distributionrespectively. Depending on the context, ∼ denotes equivalence in order of magnitude of sequences,or equivalence in distribution. We frequently make use of arbitrary positive finite constants C (orits sub-indexed version C i ) whose values may change from line to line throughout the paper, butthey are always independent of the time and cross-sectional dimension. Similarly, generic sequencesconverging to zero as T → ∞ are denoted by η T (or its sub-indexed version η T,i ). We say a sequence η T is of size − x if η T = O ( T − x − ε ) for some ε > Consider the linear model y t = x (cid:48) t β + u t , t = 1 , . . . , T, (1)where x t = ( x ,t , . . . , x N,t ) (cid:48) is a N × β is a N × u t is an error term. Throughout the paper, we examine the high-dimensional time seriesmodel where N can be larger than T .We impose the following assumptions on the processes { x t } and { u t } . Assumption 1.
Let z t = ( x (cid:48) t , u t ) (cid:48) . For some m > c >
0, assume that(a) z t is a weakly stationary process with E [ u t ] = 0, E [ u t x j,t ] = 0, and E | z j,t | m + c ) ≤ C for all j = 1 , . . . , p + 1.(b) Let s T,t denote a k ( T )-dimensional triangular array that is α -mixing of size − m ( m + c ) /c with σ -field F s t := σ { s T,t , s T,t − , . . . } such that z t is F s t -measurable. For all j = 1 , . . . , p + 1,the process { z j,t } is L m -near-epoch-dependent (NED) on s T,t of size − s T,t in Assumption 1(b) as an underlying shock process driving the regressors and errorsin z t , where we assume z t to depend almost entirely on the “near epoch” of s T,t . Since z t growsasymptotically in dimension, it is natural to let the dimension of s T,t grow with T , though this isnot theoretically required. Assumption 1 allows for very general forms of dependence including, butnot limited to, mixingales, strong mixing processes (McLeish, 1975) and linear processes (Davidson,2002, Section 14.3).Under Assumption 1, Model (1) encompasses many time series models that are often encoun-tered in econometric applications, allowing for general forms of serial dependence, conditionalheteroskedasticity and dependence among regressors. The NED assumption on u t , allows for mis-specified models as well. In particular, it allows one to view (1) as simply the linear projectionof y t on x t with β in that case representing the corresponding best linear projection coefficients.In such a case E [ u t ] = 0 and E [ u t x j,t ] = 0 hold by construction, and the additional conditions ofAssumption 1 can be shown to hold under weak further assumptions. On the other hand, u t isnot likely to be an m.d.s. in that case, such that typical m.d.s. assumptions as used for instance inMedeiros and Mendes (2016) and Masini et al. (2019) do not allow for dynamic miss-specification.Wong et al. (2020) also allow for miss-specification by allowing for mixing errors, which is a subsetof the error processes allowed here. As will be explained later, allowing for miss-specified dynamicsis crucial for developing the theory for the desparsified lasso.We further elaborate on miss-specification in Example 3, after we present two examples ofcorrectly specified common econometric time series DGPs. Example 1 (ARDL model with GARCH errors) . Consider the autoregressive distributed lag(ARDL) model with GARCH errors y t = p (cid:88) i =1 ρ i y t − i + q (cid:88) i =0 θ (cid:48) i w t − i + u t = x (cid:48) t β + u t ,u t = (cid:112) h t ε t , ε t ∼ IID (0 , ,h t = π + π h t − + π u t − , where the roots of the lag polynomial ρ ( z ) = 1 − p (cid:80) i =1 ρ i z i are outside the unit circle. Take ε t , π and π such that E (cid:2) ln( π ε t + π ) (cid:3) <
0, then u t is a strictly stationary geometrically β -mixingprocess (Francq and Zako¨ıan, 2010, Theorem 3.4), and additionally such that E (cid:104) | u t | m (cid:105) < ∞ (cf.Francq and Zako¨ıan, 2010, Example 2.3). Also assume that w t is stationary and geometrically6 -mixing as well with finite 2 m moments. Given the invertibility of the lag polynomial, we maythen write y t = ρ − ( L ) v t , where v t = (cid:80) qi =0 θ (cid:48) i w t − i + u t and the inverse lag polynomial ρ − ( z )has geometrically decaying coefficients. Then it follows directly that y t is NED on v t , where v t isstrong mixing of size −∞ as its components are geometrically β -mixing, and the sum inherits themixing properties. Furthermore, if (cid:107) θ i (cid:107) ≤ C for all i = 0 , . . . , q , it follows directly from Minkowskithat E | v t | m ≤ C and consequently E | y t | m ≤ C . Then y t is NED of size −∞ on ( w t , u t ), andconsequently z t = ( y t − , w t , u t ) as well. Example 2 (Equation-by-equation VAR) . Consider the vector autoregressive model y t = p (cid:88) i =1 Φ i y t − i + u t , where y t is a K × K × K matrices Φ i satisfy appropriatestationarity conditions. The equivalent equation-by-equation representation is y k,t = p (cid:88) i =1 [Φ k, ,i , . . . , Φ k,K,i ] y t − i + u k,t = (cid:2) y (cid:48) t − , . . . , y (cid:48) t − p (cid:3) β k + u k,t , k ∈ (1 , . . . , K ) . Assuming a well-specified model with E (cid:2) u t | y t − , . . . , y t − p (cid:3) = , the conditions of Assumption 1are satisfied trivially. Example 3 (Misspecified AR model) . Consider an autoregressive (AR) model of order 2 y t = ρ y t − + ρ y t − + v t , v t ∼ IID (0 , , where E | v t | m ≤ C and the roots of 1 − ρ L − ρ L are outside the unit circle. Define the misspecifiedmodel y t = ˜ ρy t − + u t , where ˜ ρ = arg min ρ E (cid:2) ( y t − ρy t − ) (cid:3) = E [ y t y t − ] E [ y t − ] = ρ − ρ and u t is autocorrelated. An m.d.s. assump-tion would be inappropriate in this case as E [ u t | σ { y t − , y t − , . . . } ] = E [ y t − ˜ ρy t − | σ { y t − , y t − , . . . } ] = − ρ ρ − ρ y t − + ρ y t − (cid:54) = 0 . However, it can be shown that ( y t − , u t ) (cid:48) satisfies Assumption 1(b) by considering the movingaverage representation of y t and by extension, of u t = y t − ˜ ρy t − . As the coefficients are geometricallydecaying, u t is clearly NED on v t and Assumption 1(b) is clearly satisfied.The key condition to apply the lasso successfully, is that the parameter vector β is (at leastapproximately) sparse. We formulate this in 2 below.7 ssumption 2. For some 0 ≤ r < s r , define the N -dimensional sparse compactparameter space B N ( r, s r ) := { β ∈ R N : (cid:107) β (cid:107) rr ≤ s r , (cid:107) β (cid:107) ∞ ≤ C, ∃ C < ∞} , and assume that β ∈ B N ( r, s r ).Assumption 2 implies that β is sparse with the degree of sparsity governed by both r and s r .Without further assumptions on r and s r , Assumption 2 is not binding, but as will be seen later, theallowed rates will interact with other DGP parameters creating binding conditions. Assumption 2generalizes the common assumption of exact sparsity taking r = 0 (see e.g., Medeiros and Mendes,2016; van de Geer et al., 2014), which assumes that there are only a few (at most s ) non-zerocomponents in β , to weak sparsity (see e.g., van de Geer, 2019). This allows us to have manynon-zero elements in the parameter vector, as long as they are sufficiently small. It follows directlyfrom the formulation in Assumption 2 that, given the compactness of the parameter space, exactsparsity of order s implies weak sparsity with r > r is, the more restrictive the assumption. The smaller r , the tighter therestrictions on s r , and s can be seen as a special case of s r when r = 0. Example 4 (Infinite order AR) . Consider an infinite order autoregressive model y t = ∞ (cid:88) j =1 ρ j y t − j + ε t , where ε t is a stationary m.d.s. with sufficient moments existing, and the lag polynomial 1 − (cid:80) ∞ j =1 ρ j L j is invertible and satisfies the summability condition (cid:80) ∞ j =1 j | ρ j | < ∞ . One might con-sider fitting an autoregressive approximation of order P to y t , y t = P (cid:88) j =1 β j y t − j + u t , as it is well known that if P is sufficiently large, the best linear predictors β j will be close to thetrue coefficients ρ j (see e.g. Kreiss et al., 2011, Lemma 2.2). To relate the summability conditionabove to the weak sparsity condition, note that by H¨older’s inequality we have that (cid:107) β (cid:107) rr = P (cid:88) j =1 ( j a | β j | ) r j − ar ≤ P (cid:88) j =1 j a | β j | r P (cid:88) j =1 j − ar − r − r ≤ C max { P − ( a +1) r , } . The constant comes from bounding the first term by the convergence of β j to ρ j plus the summa-bility of the latter, while the second term involving P follows from Lemma 5.1 of Phillips and Assume 0 = 0 in this case. As such, summability conditions on lag polynomials imply weak sparsity conditions,where the strength of the summability condition (measured through a ) and the required strict-ness of the sparsity (measured through r ) determine the order s r of the sparsity. Therefore, weaksparsity – unlike exact sparsity – can accommodate sparse sieve estimation of infinite-order, appro-priately summable, processes, providing an alternative to least-squares estimation of lower orderapproximations. Remark 1.
Another common generalization of exact sparsity is approximate sparsity (Belloniet al., 2014), where it is assumed that the true functional form can accurately be approximatedby a sparse linear model. As we allow for misspecified models, this is implicitly encompassed inour setup as well. Approximate sparsity essentially states that the amount of misspecification byconsidering a sparse linear model is sufficiently small to be ignored, whereas we allow for ‘sub-stantial’ misspecification, but with the consequence that the interpretation of the coefficients mustbe changed. In that sense, to be able to attach a structural meaning to the parameters β , onemust make the additional assumption that (1) is sufficiently well specified, which then roughlycorresponds to the approximate sparsity assumption. We do not make that assumption here, as wewill need to deal explicitly with misspecified models in the development of the desparsified lasso,and in itself this assumption is not needed for development of the statistical theory.For λ ≥
0, define the weak sparsity index set S λ := (cid:8) j : (cid:12)(cid:12) β j (cid:12)(cid:12) > λ (cid:9) with cardinality s λ := | S λ | , (2)and complement set S cλ = { , . . . , N } \ S λ . With an appropriate choice of λ , this set containsall ‘sufficiently large’ coefficients; for λ = 0 it contains all non-zero parameters. We need thisset in the following conditions, which formulate the standard compatibility conditions needed forlasso consistency (see e.g., B¨uhlmann and van De Geer, 2011, Chapter 6). Let Σ := E [ x t x (cid:48) t ]and its sample counterpart ˆ Σ := X (cid:48) X /T . For clarity, we choose to formulate the compatibilitycondition on the population covariance matrix Σ rather than the sample covariance matrix ˆ Σ ; as aconsequence though we then need an additional assumption on the closeness between the populationand sample covariance matrix. These two assumptions are stated below. Assumption 3.
For a general index set S with cardinality | S | , define the compatibility constant φ Σ ( S ) := min { z ∈ R N \ : (cid:107) z Sc (cid:107) ≤ (cid:107) z S (cid:107) } (cid:26) | S | z (cid:48) Σ z (cid:107) z S (cid:107) (cid:27) . As the same lemma shows, one should in fact treat the case r = 1 / ( a + 1) separately, in which a bound of order(ln P ) aa +1 holds. φ Σ ( S λ ) >
0, which implies that (cid:107) z S λ (cid:107) ≤ s λ z (cid:48) Σ z φ Σ ( S λ ) , for all z satisfying (cid:107) z S cλ (cid:107) ≤ (cid:107) z S λ (cid:107) (cid:54) = 0. Assumption 4.
Let CC T ( S λ ) := (cid:110) (cid:107) ˆ Σ − Σ (cid:107) ∞ ≤ C φ Σ ( S λ ) s λ (cid:111) . Assume that lim T →∞ P ( CC T ( S λ )) = 1 . The compatibility constant in Assumption 3 is an upper bound on the minimum eigenvalue of Σ , so this condition is considerably weaker than assuming Σ to be positive definite. Furthermore,if the restricted eigenvalue condition (Bickel et al., 2009) is satisfied, B¨uhlmann and van De Geer(2011, Figure 6.1) show that the compatibility condition holds.We prefer to formulate the compatibility condition in Assumption 3 on the population covariancematrix in conjunction with Assumption 4 which links it to the sample covariance by stating thatthe differences between both asymptotically disappear at a certain rate, rather than directly onthe sample covariance matrix, see e.g. the restricted eigenvalue condition in Medeiros and Mendes(2016) or Assumption (A2) in Chernozhukov et al. (2019). The direct assumption is satisfied bythe two assumptions considered here, but the indirect way we consider allows for easier verificationof the compatibility condition. For an example of conditions under which this is satisfied, seeLemma C.1. Finally, note that the compatibility assumption for the weak sparsity index set S λ isweaker than (and implied by) its equivalent for S , see Lemma A.4. In this section, we derive new oracle inequalities for the lasso in a high-dimensional time seriesmodel. The lasso estimator (Tibshirani, 1996) of the parameter vector β in Model (1) is given byˆ β := arg min β ∈ R N (cid:26) (cid:107) y − Xβ (cid:107) T + 2 λ (cid:107) β (cid:107) (cid:27) , (3)where y = ( y , . . . , y T ) (cid:48) is the T × X = ( x , . . . , x T ) (cid:48) the T × N design matrixand λ > Theorem 1.
Let E T ( x ) := (cid:26) max j ≤ N,l ≤ T (cid:12)(cid:12)(cid:12)(cid:12) l (cid:80) t =1 u t x j,t (cid:12)(cid:12)(cid:12)(cid:12) ≤ x : x > (cid:27) . Under Assumptions 2 and 3, on the et P T,las := E T ( T λ ) (cid:84) CC T ( S λ ) , we have (cid:107) X ( ˆ β − β ) (cid:107) T + λ (cid:107) ˆ β − β (cid:107) ≤ (cid:2) C + C φ Σ ( S λ ) (cid:3) λ − r s r φ Σ ( S λ ) , for some constants < C , C < ∞ . Note that Theorem 1 is a deterministic result holding on a stochastic set P T,las = E T ( T λ ) (cid:84) CC T ( S λ ).In order for this inequality to lead to consistency, we need that P (cid:0) E T ( T λ ) (cid:84) CC T ( S λ ) (cid:1) →
1, inwhich case the oracle inequality holds with probability one asymptotically. For E T ( · ) this is shownin Lemma A.5, while CC T ( · ) is covered by Assumption 4. The oracle inequality gives an upperbound on the deviation of estimated quantities from their true counterparts. By letting this upperbound asymptotically converge to zero, consistency results can be established.Corollary 1 provides estimation and prediction consistency of the lasso. Corollary 1.
Let Assumptions 1-4 hold. Furthermore, assume that N = O ( T a ) for a ≥ , φ Σ ( S λ ) = O (1) , s r = O (cid:0) N b/a (cid:1) for b ≥ , and λ ∼ T − (cid:96) for (cid:96) > . Then, if b − r < (cid:96) < − am , − r − b > ,and m > a (1 − r )1 − r − b , we have that(a) Prediction consistency: T (cid:13)(cid:13)(cid:13) X ( ˆ β − β ) (cid:13)(cid:13)(cid:13) = O p (cid:0) T b − (cid:96) (2 − r ) (cid:1) , (b) Estimation consistency: (cid:13)(cid:13)(cid:13) ˆ β − β (cid:13)(cid:13)(cid:13) = O p (cid:0) T b − (cid:96) (1 − r ) (cid:1) . Under the conditions of Corollary 1, the convergence rates of (a) and (b) could be further refinedto O p (cid:0) T − ε (cid:1) and O p (cid:0) T / − ε (cid:1) . While Theorem 1 is a useful result in its own right, it is vital toderive the theoretical results for the desparsified lasso, which will be elaborated on below. We use the desparsified lasso to perform uniformly valid inference in general high-dimensional timeseries settings. After briefly reviewing the desparsified lasso, we formulate the assumptions neededin Section 4.1. The asymptotic theory is then derived in Section 4.2.The desparsified lasso (van de Geer et al., 2014) is defined asˆ b := ˆ β + ˆ Θ X (cid:48) ( y − X ˆ β ) T , (4)where ˆ β is the lasso estimator from eq. (3) and ˆ Θ := ˆ Υ − ˆ Γ is a reasonable approximation for theinverse of ˆ Σ . By de-sparsifying the initial lasso, the bias in the lasso estimator is removed anduniformly valid inference can be obtained. The matrix ˆ Γ is constructed using nodewise regressions;11egressing each column of X on all other explanatory variables using the lasso. Let the lassoestimates of the j = 1 , . . . , N nodewise regressions beˆ γ j := arg min γ j ∈ R N − (cid:40) (cid:107) x j − X − j γ j (cid:107) T + 2 λ j (cid:107) γ j (cid:107) (cid:41) , (5)where the T × ( N −
1) matrix X − j is X with its j th column removed. Their components are givenby ˆ γ j = { ˆ γ j,k : k = { , . . . , N } \ j } . Stacking these estimated parameter vectors row-wise with oneson the diagonal gives the matrixˆ Γ := − ˆ γ , . . . − ˆ γ ,N − ˆ γ , . . . − ˆ γ ,N ... ... . . . ... − ˆ γ N, − ˆ γ N, . . . . The matrix ˆ Υ − := diag (cid:0) / ˆ τ , . . . , / ˆ τ N (cid:1) , whereˆ τ j := (cid:107) x j − X − j ˆ γ j (cid:107) T + 2 λ j (cid:107) ˆ γ j (cid:107) . The nodewise regressions are viewed as linear projections of one explanatory variable on all theothers with γ j := arg min γ (cid:110) E (cid:104)(cid:0) x j,t − x (cid:48)− j,t γ (cid:1) (cid:105)(cid:111) (6)presenting the best linear regression coefficients. Still, to work with familiar notation as in section 2,consider the corresponding “nodewise regression model” x j,t = x (cid:48)− j,t γ j + v j,t , with E (cid:104) v j,t (cid:105) = τ j , Υ = diag(1 /τ , . . . , /τ N ). Note that by construction, it holds that E [ v j,t ] =0 , ∀ j and E [ v j,t x k,t ] = 0 , ∀ k (cid:54) = j . We first present Assumptions 5 and 6, which allows us to extendTheorem 1 to the nodewise lasso regressions. Assumption 5. (a) Assume that { z t } is stationary of order 4.(b) Let E | v j,t | m + c ) ≤ C for all j = 1 , . . . , N . Assumption 6. ≤ r < s ( j ) r , let γ j ∈ B N − ( r, s ( j ) r ).(b) Define Λ min and Λ max as the smallest and largest eigenvalues of Σ respectively. Assume that1 /C ≤ Λ min ≤ Λ max ≤ C .(c) Take the weak sparsity index sets S λ,j := (cid:110) k : | γ j,k | > λ (cid:111) with cardinality s λ,j := | S λ,j | , and¯ s λ := max j { s λ,j } . Let CC T,nw ( x ) := (cid:110) (cid:107) ˆ Σ − Σ (cid:107) ∞ ≤ C Λ min x (cid:111) . Then lim T →∞ P ( CC T,nw (¯ s λ )) = 1 . Assumption 5 requires { z t } to be fourth-order stationary (item (a)), and the errors v j,t fromthe nodewise linear projections to have bounded moments (item (b)). By the properties of NEDprocesses, we use Assumptions 1 and 5 to establish mixingale properties of the products v j,t u t =: w j,t and w j,t w k,t − l in Lemma B.2, which are used extensively in the derivation of the desparsified lasso’sasymptotic distribution.Assumption 6(a), similar to Assumption 2, requires weak sparsity of the nodewise regressions,not exact sparsity. The latter could be problematic, as it would imply many of the regressors to beuncorrelated. In contrast, weak sparsity is a plausible alternative, see e.g. Example 4.Assumption 6(b) requires the population covariance matrix to be positive definite, with itssmallest eigenvalue bounded away from zero, and to have finite variances. Assumption 6(b) replacesAssumption 3 in section 3, with Λ min fulfilling the role of φ Σ . It also implies that the explanatoryvariables, including the irrelevant ones, cannot be linear combinations of each other even as we letthe number of variables tends to infinity.Finally, Assumption 6(c) replaces Assumption 4 for the nodewise regressions. For a more directcomparison, one could make the marginally more general assumption of the formlim T →∞ P (cid:32) N (cid:84) j =1 (cid:110) (cid:107) Σ − j − Σ − j (cid:107) ∞ ≤ C Λ min s λ,j (cid:111)(cid:33) = 1, exploiting potential variations in asymptotic spar-sity over the nodewise regressions.These assumptions allows us to apply Theorem 1 to the nodewise regressions. Let E ( j ) T ( x ) := (cid:26) max k (cid:54) = j,l ≤ T (cid:20) | l (cid:80) t =1 v j,t x k,t | (cid:21) ≤ x : x > (cid:27) denote the set bounding the empirical process for the j -thnodewise regression. Then on the set E ( j ) T ( T λ j ) (cid:84) CC T,nw ( s λ,j ) we have (cid:107) X − j (ˆ γ j − γ j ) (cid:107) T + λ j (cid:107) ˆ γ j − γ j (cid:107) ≤ [ C + C Λ min ] λ − rj s ( j ) r Λ min ≤ C ¯ λ − r ¯ s r , (7)where ¯ λ := max j λ j and ¯ s r := max j s ( j ) r . As we generally need (7) to hold uniformly over all node-wise regressions, we show that the set P T,nw := (cid:84) Nj =1 E ( j ) T ( T λ j ) (cid:84) CC T,nw (¯ s λ ) holds with probabilityconverging to 1. In the remainder of the theory, instead of ¯ λ and ¯ s r , we consider the more general13pper bounds λ max = max { λ, λ , . . . , λ N } = max { λ, ¯ λ } s r, max = max { s r , s (1) r , . . . , s ( N ) r } = max { s r , ¯ s r } , (8)as this simplifies many of the final expressions. If we want to allow for full generality, occasionallyconditions could be weakened to have them in terms of ¯ λ or ¯ s r explicitly. However, this would beat the expense of more conditions, which will not benefit readability, and therefore we opt againstit. We make one final assumption to establish our theoretical results. Assumption 7.
Define the set L T := (cid:26) max ≤ j ≤ N (cid:12)(cid:12)(cid:12)(cid:12) T T (cid:80) t =1 v j,t − τ j (cid:12)(cid:12)(cid:12)(cid:12) ≤ Nδ T (cid:27) . Then there exists a sequence η T → δ T ≥ Nη T , lim T →∞ P ( L T ) = 1.Assumption 7 gives us a Law of Large Numbers for the squared error terms from the nodewiseregressions. The N term follows from taking the maximum over all j = 1 , . . . , N , while the term δ T is affected by the dependence and tail behaviour in v j,t . Explicit values can be derived by assuminga more specific stochastic process for v j,t , and then deriving a bound by for instance the Triplexinequality (Jiang, 2009), in a fashion similar to Lemma C.1. We establish the uniform asymptotic normality of the desparsified lasso. To this end, write √ T (cid:16) ˆ b − β (cid:17) = √ T (cid:32) ˆ β − β + ˆ Θ X (cid:48) ( y − X ˆ β ) T (cid:33) = ˆ Θ X (cid:48) u √ T + ∆ , where ∆ = √ T ( I − ˆ Θ ˆ Σ )( ˆ β − β ). Roughly speaking, the proof of asymptotic normality consistsof showing that ∆ is uniformly asymptotically negligible (Lemma B.6) and applying a mixingalecentral limit theorem (De Jong, 1997) to the first term after establishing the consistency of ˆ Θ .As the parameter vector asymptotically grows in dimension, care must be taken in characterizinglimit distributions. While one could derive high-dimensional limit distributions for the maximumof the parameter vector in the spirit of for example Chernozhukov et al. (2013) and Zhang and Wu(2017), we abstract from these complications by deriving results for linear combinations of finitesubsets of parameters in Theorem 2. Our approach allows for testing P joint hypotheses of theform R N β = q , where R N is an appropriate P × N matrix whose non-zero columns are indexedby the set H := (cid:110) j : (cid:80) Pp =1 | R N,p,j | > (cid:111) of cardinality h := | H | < ∞ . By focusing on inference fora finite subset of parameters, computational gains can be obtained with respect to the nodewise14egressions. Define the reduced desparsified lasso estimatorˆ b H := ˆ β H + ˆ Θ H X H ( y − X ˆ β ) , with inverse covariance matrix ˆ Θ H whose rows and columns not in H are replaced by zero, andanalogously for ˆ β H and X H . The reduced estimator only requires one to compute h + 1 nodewiseregression as opposed to N + 1 regressions, which can be a considerable reduction for small h relative to large N .Given our time series setting, the long-run covariance matrix Ω N,T = E (cid:20) T (cid:18) T (cid:80) t =1 w t (cid:19) (cid:18) T (cid:80) t =1 w (cid:48) t (cid:19)(cid:21) ,where w t = ( v ,t u t , . . . , v N,t u t ) (cid:48) , enters the asymptotic distribution in Theorem 2. Under the fourth-order stationarity of Assumption 5, Ω N,T can equivalently be written as Ω N,T = Ξ (0) + T − (cid:80) l =1 ( Ξ ( l ) + Ξ (cid:48) ( l )), where Ξ( l ) = E (cid:2) w t w (cid:48) t − l (cid:3) . Theorem 2.
Let Assumptions 1 to 7 hold, and assume that the smallest eigenvalue of Ω N,T isbounded away from 0. Furthermore, as T → ∞ , assume N λ − m T − m/ → , N λ − m min T − m/ → , √ T λ − r max s r, max → where λ min := min j λ j . Then we have that √ T R N (ˆ b − β ) d → N ( , Ψ ) , uniformly in β ∈ B ( s r ) , where Ψ := lim N,T →∞ R N Υ − Ω N,T Υ − R (cid:48) N and Υ − := diag (1 /τ , . . . , /τ N ) . Remark 2.
Unlike van de Geer et al. (2014), we do not require the regularization parameters λ j tohave a uniform growth rate. We only control the slowest and fastest converging λ j (covered by λ max and λ min respectively) through convergence rates that also involve N, T , and the sparsity s r, max .We provide a specific example of a joint asymptotic setup for these quantities in Corollary 2.In order to estimate the asymptotic variance Ψ , we suggest to estimate Ω N,T with the long-runvariance kernel estimatorˆ Ω = ˆ Ξ (0) + Q T − (cid:88) l =1 K (cid:18) lQ T (cid:19) (cid:16) ˆ Ξ ( l ) + ˆ Ξ (cid:48) ( l ) (cid:17) , (9)where ˆ Ξ ( l ) = T − l T (cid:80) t = l +1 ˆ w t ˆ w (cid:48) t − l with ˆ w j,t = ˆ v j,t ˆ u t , the kernel K ( · ) can be taken as the Bartlettkernel K ( l/Q T ) = (cid:16) − lQ T (cid:17) (Newey and West, 1987) and the bandwidth Q T should increase withthe sample size at an appropriate rate. In Theorem 3, we show that ˆ Ψ = R N ( ˆ Υ − ˆ Ω ˆ Υ − ) R (cid:48) N isconsistent for Ψ . Theorem 3.
Take ˆ Ω with Q T such that /Q T + Q T / min (cid:26) T − λ r − s − ,r , T − m λ r − s − ,r , T m − m λ r − max s − / ,r , T m − s +1) m − (cid:27) → as T → . Assume that the following convergence rates hold as T → ∞ : N λ − m max T − m → , N λ − m min T − m/ → . For R N with P, h < ∞ , under Assumptions 1 to 7 (cid:12)(cid:12)(cid:12) R N ( ˆ Υ − ˆ Ω ˆ Υ − ) R (cid:48) N − Ψ (cid:12)(cid:12)(cid:12) p → , uniformly in β ∈ B ( s r ) . Theorem 3 provides a consistent estimator for any finite submatrix of Ω as is required forTheorem 2.As a natural implication of Theorems 2 and 3, Corollary 2 gives an asymptotic distributionresult for a quantity composed exclusively of estimated components: Corollary 2.
Let Assumptions 1 to 7 hold, and assume that the smallest eigenvalue of Ω N,T is bounded away from 0. As T → ∞ , take the asymptotic growth rates: N = O ( T a ) , s r, max = O (cid:0) T B (cid:1) = O (cid:0) N B/a (cid:1) , λ max ∼ T − L , λ ∼ T − (cid:96) , λ min ∼ T − (cid:96) ¯ with (cid:96) ¯ ≥ (cid:96) ≥ L > , and Q T = O ( T δ Q ) .Consider the following conditions: δ Q +1+2 B − r ) < L ≤ (cid:96) ¯ < − am , − r − δ Q − B > , m > a (2 − r )1 − r − δ Q − B , and < δ Q < m − m − . Under these conditions, for a × N R N with h < ∞ sup β ∈ B ( s r ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P √ T R N (ˆ b − β ) (cid:113) R N ( ˆ Υ − ˆ Ω ˆ Υ − ) R (cid:48) N ≤ z − Φ ( z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) , ∀ z ∈ R , where Φ ( · ) is the CDF of N (0 , . Corollary 2 allows one to perform a variety of hypothesis tests. For a significance test on asingle variable j , for instance, take R (cid:48) N as the j th basis vector. Then, inference on β j of the form P √ T (ˆ b j − β j ) (cid:113) ˆ ω j,j / ˆ τ j ≤ z − Φ ( z ) = o p (1) , ∀ z ∈ R , (10)can be obtained where Φ ( · ) is the standard normal CDF. One can obtain standard confidenceintervals CI ( α ) := ˆ b j − z α/ (cid:115) ˆ ω j,j / ˆ τ j T , ˆ b j + z α/ (cid:115) ˆ ω j,j / ˆ τ j T , (11)where z α/ := Φ − (1 − α/ β ∈ B ( s r ) (cid:12)(cid:12) P (cid:0) β j ∈ CI ( α ) (cid:1) − (1 − α ) (cid:12)(cid:12) = o p (1) . For a joint test with P restrictions on N variables of interest of the form R N β = q , one can16onstruct a Wald type test statistic of the form (cid:16) R N ˆ b H − q (cid:17) (cid:48) (cid:32) ˆ Υ − ˆ Ω ˆ Υ − T (cid:33) − (cid:16) R N ˆ b H − q (cid:17) d → χ h . (12)Note that these results can also be used to test for nonlinear restrictions of parameters via theDelta method (e.g. Casella and Berger, 2002, Theorems 5.5.23,28). We analyze the finite sample performance of the desparsified lasso by means of simulations. Weconsider three simulation settings: a high-dimensinal autoregressive model with exogenous variables(in section 5.1), a factor model (in section 5.2), and a weakly sparse VAR model (in section 5.3). Insection 5.1 and section 5.2, we compute coverage rates of confidence intervals for single hypothesistests. In section 5.3, we perform a multiple hypothesis test for Granger causality.Across all settings, we take different values of the time series length T = { , , , } and number of regressors N = { , , , } . The number of regressors is rounded up whenan even number is required, as in section 5.3. The number of lags in the long-run covarianceestimator is chosen as Q T = (cid:6) (2 T ) δ Q (cid:7) with δ Q = 0 .
1. In practice, this means Q T = 2 for T =100 , , Q T = 3 for T = 1000.All lasso estimates are obtained through the coordinate descent algorithm (Friedman et al.,2010). In Tables 1 to 3, we select the tuning parameter λ from a grid of 200 values by minimizing theBayesian Information Criterion. Note that we are only considering values of the tuning parameterthat result in T /
Inspired by the simulation studies in Kock and Callot (2015) (Experiment B) and Medeiros andMendes (2016), we take the following DGP y t = ρy t − + β (cid:48) x t − + u t , x t = A x t − + A x t − + ν t , where x t is a ( N − × ρ = 0 . β j = √ s ( − j for j = 1 , . . . , s , and zero otherwise. For N = 101 ,
201 we set s = 5 and s = 10 for N = 501 , A and A are block-diagonal with each block of dimension 5 × ρ β Model N \ T
100 200 500 1000 100 200 500 1000A 101 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 .
201 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 .
501 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . B 101 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 .
201 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 .
501 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . Within each matrix, all blocks are identical with typical elements of 0.15 and -0.1 for A and A respectively. Due to the misspecification of nodewise regressions, there is induced autocorrelationin the nodewise errors v j,t . However, the block diagonal structure of A and A keeps the sparsityof nodewise regressions constant asymptotically.We consider different processes for the error terms u t and ν t A) IID errors: ν t ∼ IID N (0 , z t ∼ IID N ( , I ). Since all moments of the Normal distribu-tion are finite, all moment conditions are satisfied.B) GARCH(1,1) errors: u t = √ h t ε t , h t = 5 × − + 0 . h t − + 0 . u t − , ε t ∼ IID N (0 , ν j,t ∼ u t for j = 1 , . . . , N −
1. Under this choice of GARCH parameters, not all moments of u t are guaranteed to exist, but E (cid:2) u t (cid:3) < ∞ .For both choices, we evaluate whether the 95% confidence intervals corresponding to ρ and β covertheir true values at the correct rates. The intervals are constructed as in eq. (11), namely (cid:34) ˆ ρ ± z . (cid:114) ˆ ω , / ˆ τ T (cid:35) and (cid:34) ˆ β ± z . (cid:114) ˆ ω , / ˆ τ T (cid:35) with z . ≈ .
96, and ˆ ρ , ˆ β obtained by regressing y t on (cid:0) y t − , x (cid:48) t − (cid:1) (cid:48) . The rates at which theseintervals contain the true values are reported in Table 1.We start by discussing the results for the model with Gaussian errors (Model A). In line with ourtheoretical setup, we are mainly interested in the finite sample performance as N and T increasejointly. We expect to see an improvement in coverage rates as we move along the diagonals ofTable 1, where N and T remain approximately proportional. The coverage rates in Table 1 supportour expectation. Furthermore, by inspecting the results row-by-row and column-by-column, we18bserve a trade-off between the number of regressors N and the sample size T . For fixed N thecoverage rates improve as T increases, for fixed T , the curse of dimensionality leads to lower coveragerates as N increases. Comparing the results across the parameters, we see that the coverage ratesfor ρ are closer to the nominal value of 95% than for β .When turning to the results for the model with GARCH errors (Model B), the finite samplecoverage rates do not worsen. Coverage is overall better for Model B, especially when T is small.Comparing the coverage of ρ , the intervals are overly conservative for small N and T , but stillcloser to their nominal level than for Model A. We observe a similar pattern in the coverage of β , with coverage rates for low T being better for Model B. Models A and B perform similarly for T = 1000, and this for both parameters, indicating convergence to a common limit.While a detailed examination of selection methods for the tuning parameters is outside thescope of our work, Figures 1 to 4 do provide some initial insight. In addition to selection by theBIC (blue), we indicate selection by the AIC (red), and the EBIC (yellow) as in Chen and Chen(2012), with γ = 1. Similarly to the BIC, the AIC and EBIC are restricted to select models withat most T / N and T , suggesting thatgood coverage could be achieved by selecting the tuning parameters well. Second, as expected, theAIC produces, overall, the least sparse solutions, the EBIC the sparsest and BIC lies in between.Across all scenarios, either BIC or EBIC generally tend to result in coverage rates closest to thenominal coverage of 95%. Third, there is a region of relatively low coverage in the top right ofthese plots, especially for T = 1000, which is larger for β than for ρ . Since the BIC tends to selectnear this region, it partly explains why its coverage is worse for β . Given that the regions of goodcoverage are in different places for ρ and β , using the AIC or EBIC for generally smaller or larger λ would not lead to consistently better coverage across scenarios. We take the following factor model y t = β (cid:48) x t + u t , u t ∼ IID N (0 , x t = Λ f t + ν t , ν t ∼ IID N ( , I ) ,f t = 0 . f t − + ε t , ε t ∼ IID N (0 , , where x t is a N × f t . We draw the values of the N × Λ from a Uniform(0,1) once at the beginning of the simulation experiment.19igure 1: Model A, ρ heat map coverage: Contours mark the coverage thresholds at 5% intervals,from 75% to the nominal 95%, from dark green to white respectively. Units on the axes are notproportional to the λ -value but rather its position in the grid. The value of λ is (10 T ) − at 0, andincreases exponentially to a value that sets all parameters to zero at 50. Plots are based on 100replications, with colored dots representing combinations of λ ’s selected by AIC (red), BIC (blue),EBIC (yellow). Nodewise l I n i t i a l l N=101, T=100
Nodewise l I n i t i a l l N=101, T=200
Nodewise l I n i t i a l l N=101, T=500
Nodewise l I n i t i a l l N=101, T=1000
Nodewise l I n i t i a l l N=201, T=100
Nodewise l I n i t i a l l N=201, T=200
Nodewise l I n i t i a l l N=201, T=500
Nodewise l I n i t i a l l N=201, T=1000
Nodewise l I n i t i a l l N=501, T=100
Nodewise l I n i t i a l l N=501, T=200
Nodewise l I n i t i a l l N=501, T=500
Nodewise l I n i t i a l l N=501, T=1000
Nodewise l I n i t i a l l N=1001, T=100
Nodewise l I n i t i a l l N=1001, T=200
Nodewise l I n i t i a l l N=1001, T=500
Nodewise l I n i t i a l l N=1001, T=1000
Coverage β heat map coverage: Contours mark the coverage thresholds at 5% intervals,from 75% to the nominal 95%, from dark green to white respectively. Units on the axes are notproportional to the λ -value but rather its position in the grid. The value of λ is (10 T ) − at 0, andincreases exponentially to a value that sets all parameters to zero at 50. Plots are based on 100replications, with colored dots representing combinations of λ ’s selected by AIC (red), BIC (blue),EBIC (yellow). Nodewise l I n i t i a l l N=101, T=100
Nodewise l I n i t i a l l N=101, T=200
Nodewise l I n i t i a l l N=101, T=500
Nodewise l I n i t i a l l N=101, T=1000
Nodewise l I n i t i a l l N=201, T=100
Nodewise l I n i t i a l l N=201, T=200
Nodewise l I n i t i a l l N=201, T=500
Nodewise l I n i t i a l l N=201, T=1000
Nodewise l I n i t i a l l N=501, T=100
Nodewise l I n i t i a l l N=501, T=200
Nodewise l I n i t i a l l N=501, T=500
Nodewise l I n i t i a l l N=501, T=1000
Nodewise l I n i t i a l l N=1001, T=100
Nodewise l I n i t i a l l N=1001, T=200
Nodewise l I n i t i a l l N=1001, T=500
Nodewise l I n i t i a l l N=1001, T=1000
Coverage ρ heat map coverage: Contours mark the coverage thresholds at 5% intervals,from 75% to the nominal 95%, from dark green to white respectively. Units on the axes are notproportional to the λ -value but rather its position in the grid. The value of λ is (10 T ) − at 0, andincreases exponentially to a value that sets all parameters to zero at 50. Plots are based on 100replications, with colored dots representing combinations of λ ’s selected by AIC (red), BIC (blue),EBIC (yellow). Nodewise l I n i t i a l l N=101, T=100
Nodewise l I n i t i a l l N=101, T=200
Nodewise l I n i t i a l l N=101, T=500
Nodewise l I n i t i a l l N=101, T=1000
Nodewise l I n i t i a l l N=201, T=100
Nodewise l I n i t i a l l N=201, T=200
Nodewise l I n i t i a l l N=201, T=500
Nodewise l I n i t i a l l N=201, T=1000
Nodewise l I n i t i a l l N=501, T=100
Nodewise l I n i t i a l l N=501, T=200
Nodewise l I n i t i a l l N=501, T=500
Nodewise l I n i t i a l l N=501, T=1000
Nodewise l I n i t i a l l N=1001, T=100
Nodewise l I n i t i a l l N=1001, T=200
Nodewise l I n i t i a l l N=1001, T=500
Nodewise l I n i t i a l l N=1001, T=1000
Coverage β heat map coverage: Contours mark the coverage thresholds at 5% intervals,from 75% to the nominal 95%, from dark green to white respectively. Units on the axes are notproportional to the λ -value but rather its position in the grid. The value of λ is (10 T ) − at 0, andincreases exponentially to a value that sets all parameters to zero at 50. Plots are based on 100replications, with colored dots representing combinations of λ ’s selected by AIC (red), BIC (blue),EBIC (yellow). Nodewise l I n i t i a l l N=101, T=100
Nodewise l I n i t i a l l N=101, T=200
Nodewise l I n i t i a l l N=101, T=500
Nodewise l I n i t i a l l N=101, T=1000
Nodewise l I n i t i a l l N=201, T=100
Nodewise l I n i t i a l l N=201, T=200
Nodewise l I n i t i a l l N=201, T=500
Nodewise l I n i t i a l l N=201, T=1000
Nodewise l I n i t i a l l N=501, T=100
Nodewise l I n i t i a l l N=501, T=200
Nodewise l I n i t i a l l N=501, T=500
Nodewise l I n i t i a l l N=501, T=1000
Nodewise l I n i t i a l l N=1001, T=100
Nodewise l I n i t i a l l N=1001, T=200
Nodewise l I n i t i a l l N=1001, T=500
Nodewise l I n i t i a l l N=1001, T=1000
Coverage β . The mean interval widths arereported in parentheses. N \ T
100 200 500 1000101 0 . (0 . . (0 . . (0 . . (0 .
201 0 . (0 . . (0 . . (0 . . (0 .
501 0 . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . We take β as in section 5.1 with s increased by one to match the number of non-zero parameters.While the sparsity assumption is not violated in the regression of y t on x t , it is in the nodewiseregressions.We investigate whether the confidence interval corresponding to β covers the true value at thecorrect rate. Following eq. (11), we use the interval (cid:34) ˆ β ± z . (cid:114) ˆ ω , / ˆ τ T (cid:35) with z . ≈ .
96. Results are reported in Table 2.Coverage rates are generally around 85-90%, with some exceptions reaching the nominal cov-erage (for T = 1000), or producing severe under coverage (for small T ). For T = 100 and 200,the performance fluctuates for different N without an apparent pattern, but it appears to stabilizesomewhat for T = 500 and 1000. At T = 1000, coverage reaches close to the nominal level for N = 200 and 500, but falls short at at only around 85% for N = 101 and 1001. Inspired by Kock and Callot (2015) (Experiment D), we consider the VAR(1) model z t = y t x t w t = A z t − + u t , u t ∼ IID N (0 , , with z t a ( N/ × x t Granger causes y t . The ( j, k )-th elementof the autoregressive matrix A ( j,k )1 = ( − | j − k | ρ | j − k | +1 , with ρ = 0 .
4. To measure the size of thetest, we set A (1 , = 0; to measure the power of the test, we keep its regular value of − ρ . Weaksparsity holds under our choice of the autoregressive parameters, but exact sparsity is violated byhaving half of the parameters non-zero. Note that the desparsified lasso is convenient for estimating The weak sparsity measure is N (cid:80) j =1 | ρ j | r with asymptotic limit ρ r − ρ r < ∞ , trivially satisfying B = 0. α = 5%.Size Power N \ T
100 200 500 1000 100 200 500 1000102 0.080 0.080 0.080 0.069 0.507 0.784 0.987 1.000202 0.084 0.088 0.089 0.078 0.514 0.801 0.990 1.000502 0.082 0.096 0.102 0.090 0.536 0.832 0.994 1.0001002 0.091 0.104 0.109 0.102 0.533 0.847 0.995 1.000the full VAR equation-by-equation, since all equations share the same regressors, and ˆ Θ needs tobe computed only once. For our Granger causality test, however, only a single equation needs tobe estimated.We test whether x t Granger causes y t by regressing y t on the first and second lag of z t . To thisend, we test the null hypothesis A (1 , = A (1 , = 0 by using the Wald test statistic in eq. (12), withˆ b H = (cid:16) , ˆ A (1 , , . . . , ˆ A (1 , , . . . (cid:17) (cid:48) , H = { , N/ } , and ˆ A (1 , , ˆ A (1 , obtained by regressing y t on (cid:0) z (cid:48) t − , z (cid:48) t − (cid:1) (cid:48) . We reject the null hypothesis when the statistic exceeds χ , . ≈ . N increases. We see the thatperformance is generally worse (i.e. rejection rate further from 5%) for larger N , and that growing T does not appear to improve it. In fact, performance decreases with T for all values until T = 1000where a small improvement occurs. However, the changes in performance are rather small, withmost rejection rates laying around 8-10%. The power of the test displays near uniform behaviour,increasing with both N and T , reaching the maximum at T = 1000 and this regardless of the valuefor N . We provide a complete set of tools for uniformly valid inference in high-dimensional stationary timeseries settings, where the number of regressors N can possibly grow at a faster rate than the timedimension T . Our main results include (i) an oracle inequality for the lasso under a weak sparsityassumption on the parameter vector, thereby establishing parameter and prediction consistency;(ii) the asymptotic normality of the desparsified lasso, leading to uniformly valid inference forfinite subsets of parameters; and (iii) a consistent Bartlett kernel Newey-West long-run covarianceestimator to conduct inference in practice.These results are established under very general conditions, thereby allowing for typical settingsencountered in many econometric applications where the errors may be non-Gaussian, autocorre-25ated, heteroskedastic and weakly dependent. Crucially, this allows for certain types of misspecifiedtime series models, such as omitted lags in an AR model.Through a small simulation study, we examine the finite sample performance of the desparsifiedlasso in popular types of time series models. We perform both single and joint hypothesis tests andexamine the desparsified lasso’s robustness to, amongst others, regressors and error terms exhibitingserial dependence and conditional heteroskedasticity, and a violation of the sparsity assumption inthe nodewise regressions. Overall our results show that good coverate rates are obtained even when N and T increase jointly. Coverage rates slightly fall back to around 85-90% for factor models wherethe sparsity assumption of the nodewise regressions is violated. Finally, Granger causality tests inthe VAR are slightly oversized, but empirical sizes generally remain close to the nominal sizes, andthe test’s power increases with both N and T . References
Bachoc, F., H. Leeb, and B. M. P¨otscher (2019). Valid confidence intervals for post-model-selectionpredictors.
The Annals of Statistics 47 (3), 1475–1504.Bachoc, F., D. Preinerstorfer, and L. Steinberger (2016). Uniformly valid confidence intervalspost-model-selection. arXiv e-print 1611.01043.Bai, J. and S. Ng (2008). Large dimensional factor analysis.
Foundations and Trends in Econo-metrics 3 , 89–163.Basu, S. and G. Michailidis (2015). Regularized estimation in sparse high-dimensional time seriesmodels.
The Annals of Statistics 43 (4), 1535–1567.Belloni, A., V. Chernozhukov, and C. Hansen (2014). Inference on treatment effects after selectionamong high-dimensional controls.
Review of Economic Studies 81 , 608–650.Berk, R., L. Brown, A. Buja, K. Zhang, and L. Zhao (2013). Valid post-selection inference.
Annalsof Statistics 41 , 802–837.Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009). Simultaneous analysis of lasso and dantzigselector.
The Annals of Statistics 37 (4), 1705–1732.B¨uhlmann, P. (2006). Boosting for high-dimensional linear models.
The Annals of Statistics 34 (2),559–583. 26¨uhlmann, P. and S. van De Geer (2011).
Statistics for High-Dimensional Data: Methods, Theoryand Applications . Springer Science & Business Media.Bunea, F., A. Tsybakov, and M. Wegkamp (2007). Sparsity oracle inequalities for the lasso.
Electronic Journal of Statistics 1 , 169–194.Casella, G. and R. L. Berger (2002).
Statistical inference , Volume 2. Duxbury Pacific Grove, CA.Chen, J. and Z. Chen (2012). Extended bic for small-n-large-p sparse glm.
Statistica Sinica ,555–574.Chernozhukov, V., D. Chetverikov, and K. Kato (2013). Gaussian approximations and multiplierbootstrap for maxima of sums of high-dimensional random vectors.
Annals of Statistics 41 (6),2786–2819.Chernozhukov, V., C. Hansen, and M. Spindler (2015). Valid post-selection and post-regularizationinference: an elementary, general approach.
Annual Review of Economics 7 , 649–688.Chernozhukov, V., W. K. H¨ardle, C. Huang, and W. Wang (2019). LASSO-driven inference in timeand space. arXiv preprint arXiv:1806.05081 .Davidson, J. (2002).
Stochastic Limit Theory (2nd ed.). Oxford: Oxford University Press.De Jong, R. M. (1997). Central limit theorems for dependent heterogeneous random variables.
Econometric Theory 13 (3), 353–367.Fithian, W., D. Sun, and J. Taylor (2015). Optimal inference after model selection. arXiv preprint1410.2597v2.Francq, C. and J.-M. Zako¨ıan (2010).
GARCH models: Structure, statistical inference and financialapplications . Wiley.Friedman, J. H., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linearmodels via coordinate descent.
Journal of Statistical Software 33 , 1–22.Greenshtein, E. and Y. Ritov (2004). Persistence in high-dimensional linear predictor selection andthe virtue of overparametrization.
Bernoulli 10 (6), 971–988.Hansen, B. E. (1991). Strong laws for dependent heterogeneous processes.
Econometric Theory 7 ,213–221. 27astie, T., R. Tibshirani, and M. Wainwright (2015).
Statistical learning with sparsity: the lassoand generalizations . Chapman and Hall/CRC.Hecq, A., L. Margaritella, and S. Smeekes (2019). Granger causality testing in high-dimensionalVARs: a post-double-selection procedure. arXiv e-print 1902.10991.Hesterberg, T., N. H. Choi, L. Meier, and C. Fraley (2008). Least angle and (cid:96) penalized regression:A review. Statistics Surveys 2 , 61–93.Hsu, N.-J., H.-L. Hung, and Y.-M. Chang (2008). Subset selection for vector autoregressive pro-cesses using lasso.
Computational Statistics & Data Analysis 52 (7), 3645–3657.Huang, J., S. Ma, and C.-H. Zhang (2008). Adaptive lasso for sparse high-dimensional regressionmodels.
Statistica Sinica , 1603–1618.Javanmard, A. and A. Montanari (2014). Confidence intervals and hypothesis testing for high-dimensional regression.
Journal of Machine Learning Research 15 (1), 2869–2909.Jiang, W. (2009). On uniform deviations of general empirical risks with unboundedness, depen-dence, and high dimensionality.
Journal of Machine Learning Research 10 (Apr), 977–996.Kock, A. B. and L. Callot (2015). Oracle inequalities for high dimensional vector autoregressions.
Journal of Econometrics 186 , 325–344.Krampe, J., J. Kreiss, and E. Paparoditis (2018). Bootstrap based inference for sparse high-dimensional time series models. arXiv preprint arXiv:1806.11083 .Kreiss, J.-P., E. Paparoditis, and D. N. Politis (2011). On the range of validity of the autoregressivesieve bootstrap.
Annals of Statistics 39 , 2103–2130.Lee, J. D., D. L. Sun, Y. Sun, and J. E. Taylor (2016). Exact post-selection inference, withapplication to the lasso.
Annals of Statistics 44 , 907–927.Leeb, H. and B. M. P¨otscher (2005). Model selection and inference: Facts and fiction.
EconometricTheory 21 , 21–59.Leeb, H. and B. M. P¨otscher (2008). Sparse estimators and the oracle property, or the return ofthe Hodges’ estimator.
Journal of Econometrics 142 , 201–211.Leeb, H., B. M. P¨otscher, and K. Ewald (2015). On various confidence intervals post-model-selection.
Statistical Science 30 (2), 216–227. 28ockhart, R., R. J. Tibshirani, J. Taylor, and R. Tibshirani (1996). A significance test for the lasso.
Annals of Statistics 42 , 413–468.Masini, R. P., M. C. Medeiros, and E. F. Mendes (2019). Regularized estimation of high-dimensionalvector autoregressions with weakly dependent innovations. arXiv e-print 1912.09002.McLeish, D. L. (1975). A maximal inequality and dependent strong laws.
Annals of Probability 3 ,829–839.Medeiros, M. C. and E. F. Mendes (2016). (cid:96) -regularization of high-dimensional time-series modelswith non-gaussian and heteroskedastic errors. Journal of Econometrics 191 , 255–271.Meinshausen, N. and B. Yu (2009). Lasso-type recovery of sparse representations for high-dimensional data.
Annals of Statistics 37 (1), 246–270.Nardi, Y. and A. Rinaldo (2011). Autoregressive process modeling via the lasso procedure.
Journalof Multivariate Analysis 102 , 529–549.Newey, W. K. and K. D. West (1987). A simple, positive semi-definite, heteroskedasticity andautocorrelation consistent covariance matrix.
Econometrica 55 , 703–708.Phillips, P. C. B. and V. Solo (1992). Asymptotics for linear processes.
Annals of Statistics 20 ,971–1001.Stock, J. H. and M. W. Watson (2011). Dynamic factor models. In M. P. Clements and D. F.Hendry (Eds.),
Oxford Handbook of Economic Forecasting , pp. 35–59. Oxford University Press.Taylor, J. and R. Tibshirani (2018). Post-selection inference for (cid:96) -penalized likelihood models. Canadian Journal of Statistics 46 (1), 41–61.Tian, X. and J. Taylor (2017). Asymptotics of selective inference.
Scandinavian Journal of Statis-tics 44 (2), 480–499.Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society: Series B (Methodological) 58 (1), 267–288.Tibshirani, R. J., A. Rinaldo, R. Tibshirani, and L. Wasserman (2018). Uniform asymptoticinference and the bootstrap after model selection.
Annals of Statistics 46 (3), 1255–1287.van de Geer, S. (2019). On the asymptotic variance of the debiased lasso.
Electronic Journal ofStatistics 13 (2), 2970–3008. 29an de Geer, S., P. B¨uhlmann, Y. Ritov, and R. Dezeure (2014). On asymptotically optimalconfidence regions and tests for high-dimensional models.
The Annals of Statistics 42 (3), 1166–1202.van de Geer, S. A. (2016).
Estimation and testing under sparsity . Springer.Vidaurre, D., C. Bielza, and P. Larra˜naga (2013). A survey of L regression. International StatisticalReview 81 (3), 361–387.Wang, H., G. Li, and C.-L. Tsai (2007). Regression coefficient and autoregressive order shrink-age and selection via the lasso.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 69 (1), 63–78.Wecker, W. E. (1978). A note on the time series which is the product of two stationary time series.
Stochastic Processes and their Applications 8 (2), 153–157.Wong, K. C., Z. Li, and A. Tewari (2020). Lasso guarantees for β -mixing heavy-tailed time series. Annals of Statistics 48 (2), 1124–1142.Zhang, C.-H. and J. Huang (2008). The sparsity and bias of the lasso selection in high-dimensionallinear regression.
Annals of Statistics 36 (4), 1567–1594.Zhang, C.-H. and S. S. Zhang (2014). Confidence intervals for low dimensional parameters in highdimensional linear models.
Journal of the Royal Statistical Society Series B 76 , 217–242.Zhang, D. and W. B. Wu (2017). Gaussian approximation for high dimensional time series.
Annalsof Statistics 45 (5), 1895–1919. 30 ppendix A Proofs for Section 3
A.1 Prelimininary results
Lemma A.1.
Under Assumption 1, for every j = 1 , . . . , N , { u t x j,t } is an L m -Mixingale withrespect to F t = σ { z t , z t − , . . . } , with non-negative mixingale constants c t ≤ C and sequence ψ q satisfying ∞ (cid:80) q =1 ψ q < ∞ . Proof of Lemma A.1 . L m + c -boundedness of { x j,t u t } follows directly from the L m + c ) -boundednessof { z t } and the Cauchy-Schwartz inequality. By Theorem 17.9 in Davidson (1994) it follows that { x j,t u t } is L m -NED on { s T,t } of size −
1. We then apply Theorem 17.5 in Davidson (1994) toconclude that { x j,t u t } is an L m -mixingale of size − min { , m ( m + c ) c (1 /m − / ( m + c )) } = −
1, withrespect to F s t = σ { s T,t , s T,t − , . . . } ; the F s t -measurability of z t implies σ { z t , z t − , . . . } ⊂ F s t , whichin turn implies that { x j,t u t } it is also an L m -mixingale with respect to F t = σ { z t , z t − , . . . } . Thesummability condition ∞ (cid:80) q =1 ψ q < ∞ is satisfied by the convergence property of p -series: ∞ (cid:80) q =1 q − p < ∞ for any p > Lemma A.2.
Take an index set S with cardinality | S | . Assuming that (cid:107) β S (cid:107) ≤ | S | β (cid:48) Σ β φ Σ ( S ) holds for (cid:8) β ∈ R N : (cid:107) β S c (cid:107) ≤ (cid:107) β S (cid:107) (cid:9) , then on the set CC T ( S ) := (cid:110) (cid:107) ˆ Σ − Σ (cid:107) ∞ ≤ C φ Σ ( S ) | S | (cid:111) (cid:107) β S (cid:107) ≤ (cid:113) | S | β (cid:48) ˆ Σ β φ Σ ( S ) , for (cid:8) β ∈ R N : (cid:107) β S c (cid:107) ≤ (cid:107) β S (cid:107) (cid:9) . Proof of Lemma A.2 . This result follows directly by Corollary 6.8 in B¨uhlmann and van De Geer(2011).
Lemma A.3.
For index set S with cardinality | S | , assume that Assumption 3 and Assumption 4hold. Recall the sets E T ( x ) = (cid:26) max j ≤ N,l ≤ T (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) l (cid:80) t =1 u t x j,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ x : x > (cid:27) and CC T ( S ) = (cid:110) (cid:107) ˆ Σ − Σ (cid:107) ∞ ≤ C φ Σ ( S ) | S | (cid:111) .On the set E T ( T λ ) (cid:84) CC T ( S ) : (cid:107) X ( ˆ β − β ) (cid:107) T + λ (cid:107) ˆ β − β (cid:107) ≤ λ | S | φ Σ ( S ) + 83 λ (cid:107) β S c (cid:107) . Proof of Lemma A.3 . The proof largely follows Theorem 2.2 of van de Geer (2016) applied to β = β with some modifications. For the sake of clarity and readability, we include the full proofhere.Consider two cases. First, consider the case where (cid:107) X (ˆ β − β ) (cid:107) T < − λ (cid:107) ˆ β − β (cid:107) + 2 λ (cid:107) β S c (cid:107) .31hen (cid:107) X ( ˆ β − β ) (cid:107) T + λ (cid:107) ˆ β − β (cid:107) < λ (cid:107) β S c (cid:107) < λ (cid:107) β S c (cid:107) + 323 λ | S | φ Σ ( S ) , which satisfies Lemma A.3.Next, consider the case where (cid:107) X (ˆ β − β ) (cid:107) T ≥ − λ (cid:107) ˆ β − β (cid:107) + 2 λ (cid:107) β S c (cid:107) . From the Lasso opti-mization problem in eq. (3), we have the Karush-Kuhn-Tucker conditions X (cid:48) ( y − X ˆ β ) T = λ ˆ κ, whereˆ κ is the subdifferential of (cid:107) ˆ β (cid:107) . Premultiplying by ( β − ˆ β ) (cid:48) , we get( β − ˆ β ) (cid:48) X (cid:48) ( y − X ˆ β ) T = λ ( β − ˆ β ) (cid:48) ˆ κ = λ β (cid:48) ˆ κ − λ (cid:107) ˆ β (cid:107) ≤ λ (cid:107) β (cid:107) − λ (cid:107) ˆ β (cid:107) . By plugging in y = Xβ + u , the left-hand-side can be re-written as (cid:107) X (ˆ β − β ) (cid:107) T + u (cid:48) X ( β − ˆ β ) T , andtherefore (cid:107) X ( ˆ β − β ) (cid:107) T ≤ u (cid:48) X ( ˆ β − β ) T + λ (cid:107) β (cid:107) − λ (cid:107) ˆ β (cid:107) ≤ (1) T (cid:13)(cid:13) u (cid:48) X (cid:13)(cid:13) ∞ (cid:107) ˆ β − β (cid:107) + λ (cid:107) β (cid:107) − λ (cid:107) ˆ β (cid:107) ≤ (2) λ (cid:107) ˆ β − β (cid:107) + λ (cid:107) β (cid:107) − λ (cid:107) ˆ β (cid:107) ≤ (3) λ (cid:107) ˆ β S − β S (cid:107) − λ (cid:107) ˆ β S c (cid:107) + 5 λ (cid:107) β S c (cid:107) ≤ (4) λ (cid:107) ˆ β S − β S (cid:107) − λ (cid:107)(cid:107) ˆ β S c − β S c (cid:107) + 2 λ (cid:107) β S c (cid:107) , where (1) follows from the dual norm inequality, (2) from the bound on the empirical processgiven by E T ( T λ ), (3) from the property (cid:107) β (cid:107) = (cid:107) β S (cid:107) + (cid:107) β S c (cid:107) with β j,S = β j { j ∈ S } , as wellas several applications of the triangle inequality, and (4) follows from the fact that (cid:107) ˆ β S c (cid:107) ≤ (cid:104) (cid:107) ˆ β S c − β S c (cid:107) − (cid:107) β S c (cid:107) (cid:105) . Note that it follows from the condition (cid:107) X (ˆ β − β ) (cid:107) T ≥ − λ (cid:107) ˆ β − β (cid:107) +2 λ (cid:107) β S c (cid:107) combined with the previous inequality that (cid:107) ˆ β S c − β S c (cid:107) ≤ (cid:107) ˆ β S − β S (cid:107) such thatLemma A.2 can be applied. Adding λ (cid:107) ˆ β S − β S (cid:107) to both sides and re-arranging, we get byapplying Lemma A.243 (cid:107) X ( ˆ β − β ) (cid:107) T + λ (cid:107) ˆ β − β (cid:107) ≤ λ (cid:107) ˆ β S − β S (cid:107) + 83 λ (cid:107) β S c (cid:107) ≤ λ (cid:113) | S | ( ˆ β − β ) (cid:48) ˆ Σ ( ˆ β − β ) φ Σ ( S ) + 83 λ (cid:107) β S c (cid:107) . Using that 2 uv ≤ u + v with u = (cid:113) ( ˆ β − β ) (cid:48) ˆ Σ ( ˆ β − β ), v = √ √ λ √ | S | φ Σ ( S ) , we further bound theright-hand-side to arrive at43 (cid:107) X ( ˆ β − β ) (cid:107) T + λ (cid:107) ˆ β − β (cid:107) ≤ (cid:107) X ( ˆ β − β ) (cid:107) T + 323 λ | S | φ Σ ( S ) + 83 λ (cid:107) β S c (cid:107) , from which the result follows. Lemma A.4.
For S λ ⊂ S (cid:54) = ∅ , we have thst φ Σ ( S ) s ≤ φ Σ ( S λ ) s λ . roof of Lemma A.4 . See Lemma 6.19 in B¨uhlmann and van De Geer (2011).
Lemma A.5.
Under Assumption 1, we have that P ( E T ( x )) ≥ − CN (cid:32) √ Tx (cid:33) m . Proof of Lemma A.5 . By the union bound, Markov’s inequality and the mixingale concentrationinequality of Hansen (1991, Lemma 2), it follows that P (cid:32) max j ≤ N,l ≤ T (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l (cid:88) t =1 u t x j,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) > x (cid:33) ≤ N (cid:88) j =1 P (cid:32) max l ≤ T (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l (cid:88) t =1 u t x j,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) > x (cid:33) ≤ x − m N (cid:88) j =1 E (cid:34) max l ≤ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l (cid:88) t =1 u t x j,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m (cid:35) ≤ x − m N (cid:88) j =1 C m (cid:32) T (cid:88) t =1 c t (cid:33) m/ ≤ CN T m/ x − m , as { x j,t u t } is a mixingale of appropriate size by Lemma A.1. A.2 Proofs of the main results
Proof of Theorem 1 . By Assumption 3 and Lemma A.3, we have on the set E T ( T λ ) (cid:84) CC T ( S λ ) (cid:107) X ( ˆ β − β ) (cid:107) T + λ (cid:107) ˆ β − β (cid:107) ≤ λ s λ φ Σ ( S λ ) + 83 λ (cid:107) β S cλ (cid:107) . It follows directly from Assumption 2 that s λ ≤ N (cid:88) j =1 { | β j | >λ } (cid:12)(cid:12)(cid:12) β j (cid:12)(cid:12)(cid:12) λ r ≤ λ − r N (cid:88) j =1 (cid:12)(cid:12) β j (cid:12)(cid:12) r = λ − r s r . (cid:13)(cid:13)(cid:13) β S cλ (cid:13)(cid:13)(cid:13) = N (cid:88) j =1 { | β j | ≤ λ } (cid:12)(cid:12) β j (cid:12)(cid:12) ≤ N (cid:88) j =1 λ (cid:12)(cid:12)(cid:12) β j (cid:12)(cid:12)(cid:12) − r (cid:12)(cid:12) β j (cid:12)(cid:12) = λ − r N (cid:88) j =1 (cid:12)(cid:12) β j (cid:12)(cid:12) r ≤ λ − r s r . Plugging these in, we obtain (cid:107) X ( ˆ β − β ) (cid:107) T + λ (cid:107) ˆ β − β (cid:107) ≤ λ λ − r s r φ Σ ( S λ ) + 83 λλ − r s r = (cid:2) C + C φ Σ ( S λ ) (cid:3) λ − r s r φ Σ ( S λ ) . Proof of Corollary 1 . By Theorem 1, we can bound the expressions in (a) and (b) as (cid:107) X ( ˆ β − β ) (cid:107) T ≤ (cid:2) C + C φ Σ ( S λ ) (cid:3) λ − r s r φ Σ ( S λ ) = O (cid:16) T b − (cid:96) (2 − r ) (cid:17) , (cid:107) ˆ β − β (cid:107) ≤ (cid:2) C + C φ Σ ( S λ ) (cid:3) λ − r s r φ Σ ( S λ ) = O (cid:16) T b − (cid:96) (1 − r ) (cid:17) . Both upper bounds therefore converge to 0 when − (cid:96) (1 − r ) + b < E T ( T λ ) (cid:84) CC T ( S λ ). CC T ( S λ ) asymptotically holds by Assump-33ion 4, and by Lemma A.5 P ( E T ( T λ/ ≥ − C NT m/ λ m = 1 − O (cid:16) T a − m/ m(cid:96) (cid:17) , and this probability converges to 1 when a − m/ m(cid:96) <
0. The intersection of these sets holdswith probability converging to 1 by Boole’s inequality. Combining both bounds gives a − m/ m(cid:96) < , − (cid:96) (1 − r ) + b < . = ⇒ b − r < (cid:96) < − am , − r − b > ,m > a (1 − r )1 − r − b . Appendix B Proofs for Section 4
B.1 Preliminary results
Lemma B.1.
Under Assumptions 1 and 5, the following holds:(a) { v j,t } is a weakly stationary process with E [ v j,t ] = , ∀ j , E [ v j,t x k,t ] = 0 , ∀ k (cid:54) = j, t .(b) E [ | v j,t x j,t | m ] ≤ C, ∀ j, t .(c) { v j,t x k,t } is an L m -Mixingale with respect to F ( j ) t = σ { v j,t , x − j,t , v j,t − , x − j,t − , . . . } , ∀ k (cid:54) = j ,with non-negative mixingale constants c t ≤ C and sequences ψ q satisfying ∞ (cid:80) q =1 ψ q ≤ C . Proof of Lemma B.1 . As v j,t are the projection errors from projecting x j,t on all other x k,t , itfollows directly that E [ v j,t ] = 0 and E [ v j,t x k,t ] = 0. L m + c -boundedness of { v j,t x k,t } , ∀ j, k followsfrom Assumption 1(a), Assumption 5(b), and the Cauchy-Schwarz inequality. Weak stationarityfollows directly as v j,t is a time-constant function of x t (which 4 th -order stationary by Assump-tion 5(a)) and following the derivations in Wecker (1978), the product of 4 th -order stationarysequences is weakly stationary. By Theorem 17.8 of Davidson (2002), { v j,t } is L m -NED on { v T,t } of size -1, { v j,t } is L m -NED on { v T,t } of size -1. The remainder of the proof follows as in the proofof Lemma A.1. Lemma B.2.
Let w t = ( w ,t , . . . , w N,t ) (cid:48) with w j,t = v j,t u t . Under Assumptions 1 and 5 thefollowing holds:(a) Let sup (cid:107) h (cid:107) =1 (cid:80) ∞ l = −∞ (cid:12)(cid:12) h (cid:48) Ξ( l ) h (cid:12)(cid:12) < ∞ , where Ξ( l ) = E w t w (cid:48) t − l .(b) For all j , w j,t is L m + c -bounded and an L m -Mixingale of size -1/2 with respect to F t = σ { u t , v t , u t − , v t − , . . . } , with non-negative mixingale constants C ≤ c t ≤ C . c) For all j, k, l , w j,t w k,t − l − E [ w j,t w k,t − l ] is L m/ -bounded and an L -Mixingale with respectto F t , with non-negative mixingale constants c t ≤ C , and sequences ψ q = O ( q − s ) for some s ≥ . Proof of Lemma B.2 . It follows by the Cauchy-Schwarz inequality that { w j,t } is L m + c -boundedfor all j = 1 , . . . , p , and from the properties of { v j,t } by Theorem 17.9 of Davidson (2002) that { w j,t } is L m -NED of size -1. Consequently, Theorem 17.7 (with r - as used in this Theorem - equalto m + c ) ensures the summability of the autocovariances in (a). Note that the formulation of Ξ ( l )follows from weak stationarity of { w t } , which in turn follows from 4 th -order stationarity of { z t } Part (b) follows again by Theorem 17.5 in the same way as the first part of the proof, while (c)follows by repeated application of Corollary 17.11 and Theorem 17.5, noting that E ( w j,t w k,t − l ) isa time-constant function, so trivially NED. Lemma B.3.
Under Assumption 6(a)-(b),on the set P T,nw (cid:84) L T , we have max ≤ j ≤ N (cid:12)(cid:12) ˆ τ j − τ j (cid:12)(cid:12) ≤ Nδ T + C λ − r max ¯ s r + C (cid:113) λ − r max ¯ s r , and max ≤ j ≤ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ j − τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Nδ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r C − C (cid:16) Nδ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r (cid:17) . Proof.
Note that ˆ τ j can be rewritten as followsˆ τ j = (cid:13)(cid:13)(cid:13) x j − X − j γ j (cid:13)(cid:13)(cid:13) T + (cid:13)(cid:13)(cid:13) X − j (cid:16) ˆ γ j − γ j (cid:17)(cid:13)(cid:13)(cid:13) T − (cid:16) x j − X − j γ j (cid:17) (cid:48) X − j (cid:16) ˆ γ j − γ j (cid:17) T + λ j (cid:107) ˆ γ j (cid:107) = 1 T T (cid:88) t =1 v j,t + (cid:13)(cid:13)(cid:13) X − j (cid:16) ˆ γ j − γ j (cid:17)(cid:13)(cid:13)(cid:13) T − (cid:16) x j − X − j γ j (cid:17) (cid:48) X − j (cid:16) ˆ γ j − γ j (cid:17) T + λ j (cid:107) ˆ γ j (cid:107) . (B.1)Then | ˆ τ j − τ j | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 v j,t − τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:13)(cid:13)(cid:13) X − j (cid:16) ˆ γ j − γ j (cid:17)(cid:13)(cid:13)(cid:13) T + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) x j − X − j γ j (cid:17) (cid:48) X − j (cid:16) ˆ γ j − γ j (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) T + λ j (cid:107) ˆ γ j (cid:107) =: R (i) + R (ii) + R (iii) + R (iv) . By the set L T , we have R (i) ≤ max j (cid:12)(cid:12)(cid:12)(cid:12) T T (cid:80) t =1 v j,t − τ j (cid:12)(cid:12)(cid:12)(cid:12) ≤ Nδ T . By eq. (7), it holds that R (ii) ≤ C λ − rj s ( j ) r ≤ C λ − r max ¯ s r . By the set N (cid:84) j =1 {E ( j ) T ( T λ j ) } and eq. (7), we have R (iii) = 2 (cid:12)(cid:12)(cid:12) v (cid:48) j X − j (cid:16) ˆ γ j − γ j (cid:17)(cid:12)(cid:12)(cid:12) T ≤ C λ j (cid:13)(cid:13) ˆ γ j − γ j (cid:13)(cid:13) ≤ C λ − r max ¯ s r .
35y the triangle inequality R (iv) ≤ λ j (cid:107) γ j (cid:107) + λ j (cid:107) ˆ γ j − γ j (cid:107) . Using the weak sparsity index for thenodewise regressions S λ,j = { k (cid:54) = j : | γ j,k | > λ j } , write (cid:107) γ j (cid:107) = (cid:13)(cid:13)(cid:13) ( γ j ) S cλ,j (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ( γ j ) S λ,j (cid:13)(cid:13)(cid:13) . Theseterms can then be bounded as follows (cid:13)(cid:13)(cid:13) ( γ j ) S cλ,j (cid:13)(cid:13)(cid:13) = (cid:88) k (cid:54) = j {| γ j,k |≤ λ j } | γ j,k | ≤ λ − rj s ( j ) r ≤ λ − r max ¯ s r . Bounding the L norm by the L norm, we get (cid:13)(cid:13) ( γ j ) S λ,j (cid:13)(cid:13) ≤ s λ,j (cid:107) γ j (cid:107) ≤ λ − r max ¯ s r (cid:107) γ j (cid:107) , To further bound (cid:107) γ j (cid:107) , consider the matrix Θ = Σ − = ( E [ x t x (cid:48) t ]) − and the partitioning Σ = E (cid:16) x j,t (cid:17) E (cid:16) x j,t x (cid:48)− j,t (cid:17) E ( x − j,t x j,t ) E (cid:16) x − j,t x (cid:48)− j,t (cid:17) . By blockwise matrix inversion, we can write the j th row of Θ as Θ j = (cid:34) τ j , − τ j E (cid:0) x j,t x (cid:48)− j,t (cid:1) E (cid:0) x − j,t x (cid:48)− j,t (cid:1) − (cid:35) = 1 τ j (cid:2) , ( γ j ) (cid:48) (cid:3) . (B.2)It then follows that (cid:107) γ j (cid:107) = (cid:88) k (cid:54) = j ( γ j,k ) ≤ (cid:88) k (cid:54) = j ( γ j,k ) = τ j Θ j Θ (cid:48) j ≤ τ j Λ , as min is the largest eigenvalue of Θ . For a bound on τ j , by the definition of γ j from eq. (6) itfollows that τ j = min γ j (cid:110) E (cid:104)(cid:0) x j,t − x (cid:48)− j,t γ j (cid:1) (cid:105)(cid:111) ≤ E (cid:104)(cid:0) x j,t − x (cid:48)− j,t (cid:1) (cid:105) = E (cid:2) x j,t (cid:3) = Σ j,j ≤ Λ max . Similar arguments can be used to bound τ j from below. By the proof of Lemma 5.3 in van de Geeret al. (2014), τ j = j,j , and therefore τ j ≥ Λ min . It then follows from Assumption 6(b) that1 C ≤ τ j ≤ C. (B.3)We therefore have (cid:107) γ j (cid:107) ≤ τ j Λ min ≤ C , such that we can bound the fourth term as R (iv) ≤ λ − r max ¯ s r + λ − r/ ¯ s / r C + C λ − r max ¯ s r . Combining all bounds, we have | ˆ τ j − τ j | ≤ Nδ T + C λ − r max ¯ s r + C λ − r max ¯ s r + λ − r max ¯ s r + (cid:113) λ − r max ¯ s r C + C λ − r max ¯ s r = Nδ T + C λ − r max ¯ s r + C (cid:113) λ − r max ¯ s r . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ j − τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | ˆ τ j − τ j | τ j − τ j | ˆ τ j − τ j | ≤ | ˆ τ j − τ j | C − C | ˆ τ j − τ j | ≤ Nδ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r C − C (cid:16) Nδ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r (cid:17) . Lemma B.4.
Under Assumption 6(a)-(b), it holds for a sufficiently large T that on the set N (cid:84) j =1 E ( j ) T ( T λ j ) (cid:84) L T , (cid:107) I − ˆ Θ ˆ Σ (cid:107) ∞ ≤ λ max C − Nδ T − C λ − r max ¯ s r . Proof of Lemma B.4 . First, note that since ˆ Σ is a symmetric matrix (cid:107) I − ˆ Θ ˆ Σ (cid:107) ∞ = (cid:107) ˆ Θ ˆ Σ − I (cid:107) ∞ = (cid:107) ˆ Σ ˆ Θ (cid:48) − I (cid:107) ∞ = max j (cid:110) (cid:107) ˆ Σ ˆ Θ (cid:48) j − e j (cid:107) ∞ (cid:111) . By the extended KKT conditions (see Section 2.1.1 of van de Geer et al., 2014), we have thatmax j (cid:110) (cid:107) ˆ Σ ˆ Θ (cid:48) j − e j (cid:107) ∞ (cid:111) ≤ max j (cid:26) λ j ˆ τ j (cid:27) ≤ λ max min j { ˆ τ j } . For a lower bound on min j (cid:110) ˆ τ j (cid:111) , note that byeq. (B.1), ˆ τ j can be rewritten asˆ τ j = (cid:107) x j − X − j γ j (cid:107) T + (cid:107) X − j (cid:16) ˆ γ j − γ j (cid:17) (cid:107) T − (cid:16) x j − X − j γ j (cid:17) (cid:48) X − j (cid:16) ˆ γ j − γ j (cid:17) T + λ j (cid:107) ˆ γ j (cid:107) . With (cid:107) X − j ( ˆ γ j − γ j ) (cid:107) T ≥ λ j (cid:107) ˆ γ j (cid:107) ≥ j , we haveˆ τ j ≥ (cid:107) x j − X − j γ j (cid:107) T − (cid:16) x j − X − j γ j (cid:17) (cid:48) X − j (cid:16) ˆ γ j − γ j (cid:17) T = T (cid:80) t =1 v j,t T − v (cid:48) j X − j (cid:16) ˆ γ j − γ j (cid:17) T .
The dual norm inequality in combination with the triangle inequality then givesˆ τ j ≥ τ j − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 v j,t − τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − T max k (cid:54) = j (cid:8) | v (cid:48) j x k | (cid:9) (cid:107) ˆ γ j − γ j (cid:107) , ≥ C − max j (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 v j,t − τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) − T max k (cid:54) = j (cid:8) | v (cid:48) j x k | (cid:9) (cid:107) ˆ γ j − γ j (cid:107) , where the second line follows from eq. (B.3). Then, on the sets L T and E ( j ) T ( T λ j )ˆ τ j ≥ C − Nδ T − λ j (cid:107) ˆ γ j − γ j (cid:107) ≥ C − Nδ T − C λ − rj s ( j ) r ≥ C − Nδ T − C λ − r max ¯ s r , where we applied Theorem 1 for the second inequality. As λ − r max ¯ s r →
0, for a large enough T wehave thatmin j τ j ≤ C − Nδ T − C λ − r max ¯ s r Lemma B.5.
Under Assumptions 1 and 5, the following holds P N (cid:92) j =1 E ( j ) T ( x j ) ≥ − C N T m/ min ≤ j ≤ N x mj . Proof of Lemma B.5 . By Lemmas A.5 and B.1, we have P (cid:16) E ( j ) T ( x j ) (cid:17) ≤ CN ( √ T /x j ) m . Then P N (cid:92) j =1 E ( j ) T ( x j ) ≥ − N (cid:88) j =1 P (cid:16)(cid:110) E ( j ) T x j (cid:111) c (cid:17) ≥ − N T m/ min ≤ j ≤ N x mj . Lemma B.6.
Under Assumptions 1, 2 and 6(a)-(b), on the set P T,las (cid:84) P T,nw (cid:84) L T we have that (cid:107) ∆ (cid:107) ∞ ≤ √ T λ − r s r λ max C − Nδ T − C λ − r max ¯ s r . Proof of Lemma B.6 . Plugging in the definition of ∆, we have (cid:107) ∆ (cid:107) ∞ ≤ √ T (cid:107) I − ˆ Θ ˆ Σ (cid:107) ∞ (cid:107) ˆ β − β (cid:107) ∞ ≤ √ T (cid:107) I − ˆ Θ ˆ Σ (cid:107) ∞ (cid:107) ˆ β − β (cid:107) . Under Assumption 6(a) and (b), on the sets E T ( T λ ) (cid:84) CC T ( S λ ), we have (cid:107) X ( ˆ β − β ) (cid:107) T + λ (cid:107) ˆ β − β (cid:107) ≤ [ C + C Λ min ] λ − r s r Λ min = Cλ − r s r , (B.4)from which it follows that (cid:107) ˆ β − β (cid:107) ≤ Cλ − r s r . Combining this bound with Lemma B.4 gives (cid:107) ∆ (cid:107) ∞ ≤√ T λ − r s r λ max C − Nδ T − C λ − r max ¯ s r . Lemma B.7.
Under Assumption 6(a)-(b), on the set E T ( T λ ) (cid:84) P T,nw , max ≤ j ≤ N √ T (cid:12)(cid:12) ˆ v (cid:48) j u − v (cid:48) j u (cid:12)(cid:12) ≤ C √ T λ − r max ¯ s r . Proof of Lemma B.7 . Starting from the nodewise regression model, write1 √ T (cid:12)(cid:12) ˆ v (cid:48) j u − v (cid:48) j u (cid:12)(cid:12) = 1 √ T (cid:12)(cid:12) u (cid:48) X − j (cid:0) γ j − ˆ γ j (cid:1)(cid:12)(cid:12) ≤ √ T (cid:13)(cid:13) u (cid:48) X (cid:13)(cid:13) ∞ (cid:13)(cid:13) ˆ γ j − γ j (cid:13)(cid:13) . By the set E T ( T λ ) and eq. (7), √ T max j {| u (cid:48) X j |} T (cid:13)(cid:13) ˆ γ j − γ j (cid:13)(cid:13) ≤√ T λ (cid:13)(cid:13) ˆ γ j − γ j (cid:13)(cid:13) ≤ C √ T λλ − rj s ( j ) r ≤ C √ T λ − r max ¯ s r , where the upper bound is uniform in j . Lemma B.8.
Define the set E ( j ) T,uv ( x ) := (cid:26) max s ≤ T (cid:12)(cid:12)(cid:12)(cid:12) s (cid:80) t =1 v j,t u t (cid:12)(cid:12)(cid:12)(cid:12) ≤ x : x > (cid:27) . Under Assumptions 1and 5, it follows that P (cid:16) E ( j ) T,uv ( x ) (cid:17) ≥ − CT m/ x m . roof of Lemma B.8 . By the Markov inequality, Lemma B.2 and the mixingale concentrationinequality of Hansen (1991, Lemma 2), P (cid:32) max s ≤ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s (cid:88) t =1 v j,t u t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > x (cid:33) ≤ E (cid:18) max s ≤ T (cid:12)(cid:12)(cid:12)(cid:12) s (cid:80) t =1 v j,t u t (cid:12)(cid:12)(cid:12)(cid:12) m (cid:19) x m ≤ C m (cid:18) T (cid:80) t =1 (cid:16) c ( j ) t (cid:17) (cid:19) m/ x m = CT m/ x m , from which the result follows. Lemma B.9.
Under Assumptions 1, 3, 5 and 6(a)-(b), on the set E T ( T λ ) (cid:84) P T,nw (cid:84) L T (cid:84) E ( j ) T,uv ( T / η − T ) with η − T ≤ C √ T , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ T ˆ v (cid:48) j u ˆ τ j − √ T v (cid:48) j u τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η − T /δ T + C √ T λ − r max ¯ s r + C η − T (cid:112) λ − r max ¯ s r C − C (cid:16) /δ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r (cid:17) . Proof of Lemma B.9 . Start by writing (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ T ˆ v (cid:48) j u ˆ τ j − √ T v (cid:48) j u τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:16) ˆ v (cid:48) j u − v (cid:48) j u (cid:17) ˆ τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) τ j − τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) v (cid:48) j u √ T (cid:12)(cid:12)(cid:12)(cid:12) =: R (i) + R (ii) . For the first term, we can bound from above using Lemmas B.3 and B.7, where the factor N is notneeded as we consider pointwise bounds here. We then get R (i) ≤ | ˆ v (cid:48) j u − v (cid:48) j u |√ T | τ j | − | ˆ τ j − τ j | ≤ C √ T λ − r max ¯ s r /C − (cid:16) δ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r (cid:17) . For the second term, we can bound from above using the pointwise version of Lemma B.3 and theset E ( j ) T,uv ( T / η − T ) to get R (ii) ≤ η − T /δ T + C λ − r max ¯ s r η − T + C (cid:112) λ − r max ¯ s r η − T C − C (cid:16) /δ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r (cid:17) . Combining both bounds gives R (i) + R (ii) ≤ η − T /δ T + C √ T λ − r max ¯ s r + C η − T (cid:112) λ − r max ¯ s r C − C (cid:16) /δ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r (cid:17) from which the result follows. Lemma B.10.
For any process { d t } Tt =1 and constant x > , define the set E T,d ( x ) := {(cid:107) d (cid:107) ∞ ≤ x } . Let max t E | d t | p ≤ C < ∞ . Then P ( E T,d ( x )) ≤ Cx − p T .Proof. The result follows directly from the Markov inequality P ( (cid:107) d (cid:107) ∞ > x ) ≤ x − p E (cid:104) max t | d t | p (cid:105) ≤ x − p T max t E | d t | p ≤ Cx − p T. emma B.11. Under Assumptions 1, 2, 5 and 6(a)-(b), on the set P ( j,k ) T,uv := P T,las (cid:92) P T,nw (cid:92) L T (cid:92) E ( j ) l,uv ( l / η − T ) (cid:92) E ( k ) l,uv ( l / η − T ) (cid:92) E l ( lλ ) (cid:92) E T,uvw , where E T,uvw is a set, defined within the proof, with probability at least − CT − c/m for some c > ,the following holds (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 ( ˆ w j,t ˆ w k,t − l − w j,t w k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:104) T / λ − r max s max ,r (cid:105) + C T m λ − r max s max ,r + C (cid:113) T − mm λ − r max s max ,r . Proof.
We can write (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 ( ˆ w j,t ˆ w k,t − l − w j,t w k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 ( ˆ w j,t − w j,t ) ( ˆ w k,t − l − w k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 ( ˆ w j,t − w j,t ) w k,t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 w j,t ( ˆ w k,t − l − w k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: 1 T − l (cid:2) R (i) + R (ii) + R (iii) (cid:3) . Take R (i) first. Using that ˆ w j,t − q = ˆ u t − q ˆ v j,t − q , straightforward but tedious calculations showthat R (i) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 (ˆ u t − u t ) (ˆ u t − l − u t − l ) (ˆ v j,t − v j,t ) (ˆ v k,t − l − v k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 (ˆ u t − u t ) (ˆ u t − l − u t − l ) (ˆ v j,t − v j,t ) v k,t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 (ˆ u t − u t ) u t − l (ˆ v j,t − v j,t ) (ˆ v k,t − l − v k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 (ˆ u t − u t ) (ˆ u t − l − u t − l ) v j,t (ˆ v k,t − l − v k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 (ˆ u t − u t ) (ˆ u t − l − u t − l ) v j,t v k,t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 (ˆ u t − u t ) u t − l v j,t (ˆ v k,t − l − v k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 u t (ˆ u t − l − u t − l ) (ˆ v j,t − v j,t ) (ˆ v k,t − l − v k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 u t (ˆ u t − l − u t − l ) (ˆ v j,t − v j,t ) v k,t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 u t u t − l (ˆ v j,t − v j,t ) (ˆ v k,t − l − v k,t − l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) =: (cid:88) i =1 R (i) ,i . Using that (cid:107) ˆ v j − v j (cid:107) = (cid:13)(cid:13)(cid:13) X − j (cid:16) ˆ γ − γ j (cid:17)(cid:13)(cid:13)(cid:13) ≤ C (cid:112) T λ − r max ¯ s r and (cid:107) ˆ u j − u j (cid:107) = (cid:13)(cid:13)(cid:13) X (cid:16) ˆ β − β (cid:17)(cid:13)(cid:13)(cid:13) ≤ C √ T λ − r s r , we can use the Cauchy-Schwarz inequality to conclude that R (i) , ≤ (cid:107) ˆ u − u (cid:107) (cid:107) ˆ v j − v j (cid:107) (cid:107) ˆ v k − v k (cid:107) ≤ C T λ − r s r λ − r max ¯ s r ≤ C T (cid:2) λ − r max s max ,r (cid:3) . On the set E T,u ( T / m ) ) (cid:84) E T,v j ( T / m ) ) (cid:84) E T,v k ( T / m ), we have that (cid:107) u (cid:107) ∞ , (cid:107) v j (cid:107) ∞ , (cid:107) v k (cid:107) ∞ ≤ T / m . Then we can use this, plus the previous results to find that R (i) , ≤ (cid:107) v k (cid:107) ∞ T (cid:88) t = l +1 | ˆ u t − u t | | ˆ u t − l − u t − l | | ˆ v j,t − v j,t |≤ (cid:107) v k (cid:107) ∞ (cid:107) ˆ u − u (cid:107) (cid:107) ˆ v j − v j (cid:107) ≤ CT m T / (cid:2) λ − r max s max ,r (cid:3) / . We then find in the same way that R (i) , ≤ (cid:107) u (cid:107) ∞ (cid:107) ˆ u − u (cid:107) (cid:107) ˆ v j − v j (cid:107) (cid:107) ˆ v k − v k (cid:107) ≤ CT m T / (cid:2) λ − r max s max ,r (cid:3) / ,R (i) , ≤ (cid:107) ˆ u − u (cid:107) (cid:107) v j (cid:107) ∞ (cid:107) ˆ v k − v k (cid:107) ≤ CT m T / (cid:2) λ − r max s max ,r (cid:3) / . For R (i) , , let ˜ v j,k,l = ( v j,l +1 v k, , . . . , v j,T v k,T − l ) (cid:48) where we know that ˜ v j,k,l has bounded m + c moments. Then, on the set E T, ˜ v j,k,l ( T /m ), we have that R (i) , ≤ (cid:107) ˆ u − u (cid:107) (cid:107) ˜ v j,k,l (cid:107) ∞ ≤ CT m T λ − r max s max ,r . Similarly defining ˜ w j,l = ( u v k,l +1 , . . . , u T − l v j,T ) (cid:48) , ˜ w k, − l = ( u l +1 v k, , . . . , u T v k,T − l ) (cid:48) and ˜ u l =( u u l +1 , . . . , u T − l u T ) (cid:48) , all with m + c bounded moments, we find on the set E T,u ( T / m ) (cid:92) E T, ˜ u ( T /m ) (cid:92) E T, ˜ w j,l ( T /m ) (cid:92) E T, ˜ w k, − l ( T /m )that R (i) , ≤ (cid:107) ˜ w j,l (cid:107) ∞ (cid:107) ˆ u − u (cid:107) (cid:107) ˆ v k − v k (cid:107) ≤ CT m T λ − r max s max ,r ,R (i) , ≤ (cid:107) u (cid:107) ∞ (cid:107) ˆ u − u (cid:107) (cid:107) ˆ v j − v j (cid:107) (cid:107) ˆ v k − v k (cid:107) ≤ CT m T (cid:2) λ − r max s max ,r (cid:3) / ,R (i) , ≤ (cid:107) ˜ w k, − l (cid:107) ∞ (cid:107) ˆ u − u (cid:107) (cid:107) ˆ v j − v j (cid:107) ≤ CT m T λ − r max s max ,r ,R (i) , ≤ (cid:107) ˜ u l (cid:107) (cid:107) ˆ v j − v j (cid:107) (cid:107) ˆ v k − v k (cid:107) ≤ CT m T λ − r max s max ,r . It then follows that T − l R (i) ≤ C T (cid:2) λ − r max s max ,r (cid:3) + C T /m λ − r max s max ,r .For R (ii) we get analogously on the set E T,u ( T / m ) (cid:84) E T,v j ( T / m ) (cid:84) E T,w j ( T /m ) R (ii) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 (ˆ u t − u t ) (ˆ v j,t − v j,t ) w k,t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 (ˆ u t − u t ) v j,t w k,t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 u t (ˆ v j,t − v j,t ) w k,t − l (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) ˆ u − u (cid:107) (cid:107) ˆ v j − v j (cid:107) (cid:107) w k (cid:107) ∞ + (cid:107) ˆ u − u (cid:107) (cid:107) v j (cid:107) ∞ (cid:107) w k (cid:107) ∞ + (cid:107) u (cid:107) ∞ (cid:107) ˆ v j − v j (cid:107) (cid:107) w k (cid:107) ∞ , ≤ C T m T λ − r max s max ,r + C T m T / (cid:113) λ − r max s max ,r + C T m T / (cid:113) λ − r max s max ,r . Finally, R (iii) follows identically to R (ii) . 41ollect all sets in the set E ( j,k ) T,uvw := E T,u ( T / m ) (cid:92) E T,v j ( T / m ) (cid:92) E T,v k ( T / m ) (cid:92) E T, ˜ v j,k,l ( T /m ) (cid:92) E T, ˜ u ( T /m ) (cid:92) E T, ˜ w j,l ( T /m ) (cid:92) E T, ˜ w k, − l ( T /m ) . Now note that by application of Lemma B.10, we can show that all sets, and by extension theirintersection, have a probability of at least 1 − CT − c/m for some c >
0. Take for instance the setswith x = T /m . In that case we can apply Lemma B.10 with p = m + c moments to obtain aprobability of 1 − C (cid:0) T /m (cid:1) − m − c T = 1 − CT − ( m + c ) /m = 1 − CT − c/m . The sets for p = 2( m + c )moments can be treated similarly. Lemma B.12.
Define E T,ww ( x ) := (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:88) t = l +1 w j,t w k,t − l − ξ j,k ( l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ x (cid:41) . Under Assumptions 1 and 5, it holds that P (cid:104) E T,ww (cid:16) T − m s +1) − (cid:17)(cid:105) ≥ − η − T . Proof.
Consider the set (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − l T (cid:80) t = l +1 w j,t w k,t − l − ξ j,k ( l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κ T (cid:41) . We can use the Triplex inequality(Jiang, 2009) to show under which conditions this set holds with probability converging to 1. Let z t = w j,t w k,t − l : P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t = l +1 [ z t − E z t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > κ T ( T − l ) (cid:33) ≤ q × exp (cid:18) − ( T − l ) κ T q χ T (cid:19) + 6 κ T ( T − l ) T − l (cid:88) t =1 E | E ( z t |F t − q ) − E ( z t ) | + 15 κ T ( T − l ) T − l (cid:88) t =1 E (cid:2) | z t | {| z t | >χ T } (cid:3) =: R (i) + R (ii) + R (iii) . We treat the first term last, as we first need to establish the restrictions put on κ T , q and χ T from R (ii) and R (iii) . For the second term, by Lemma B.2(c) E | E ( z t |F t − q ) − E ( z t ) | ≤ c t ψ q ≤ Cψ q ≤ C q − s , such that R (ii) ≤ Cκ − T q − s . Hence we need that κ − T q − s → T → ∞ .For the third term, we have by H¨older’s and Markov’s inequalities E (cid:2) | z i | {| z i | >χ T } (cid:3) ≤ E (cid:34) | z t | (cid:18) | z t | χ T (cid:19) m/ − {| z i | >χ T } (cid:35) ≤ χ − m/ T E | z t | m/ so R (iii) ≤ Cκ − T χ − m/ T . Hence we know that we need to take χ T and κ T such that χ m/ − T κ T → ∞ T → ∞ .Our goal is to minimize κ T while ensuring all conditions are satisfied. For R (ii) we need that κ T ≥ q − s η − T, , where η T, is a sequence that decreases to 0 arbitrarily slowly. For R (iii) we needthat κ T ≥ η − T, χ − m/ T . Finally, consider R (i) . For R (i) we need that2 q exp (cid:18) − C T κ T q χ T (cid:19) ≤ η T, ⇒ κ T ≥ C qχ T √ T ln q, where we take η T, ≥ Cq − . Hence, we can set κ T = C max (cid:26) qχ T √ T ln q, η − T q − s , η − T χ − m/ T (cid:27) , where we minimize this expression by solving for the ( q, χ T ) pair that sets all three terms equal.This calculation yields that choosing κ T = CT − m s +1) m − is the lowest rate possible. B.2 Proofs of main results
Proof of Theorem 2 . Using eq. (4), we can write √ T R N (cid:16) ˆ b − β (cid:17) = √ T R N (cid:32) ˆ β − β + ˆ Θ X (cid:48) ( y − X ˆ β ) T (cid:33) = R N (cid:32) ˆ Θ X (cid:48) u √ T + ∆ (cid:33) , Furthermore, note that by the definition of ˆ Θ , it follows directly that ˆ Θ X (cid:48) = ˆ Υ − ˆ V (cid:48) , whereˆ V = (ˆ v , . . . , ˆ v N ), such that ˆ Θ X (cid:48) u / √ T = ˆ Υ − ˆ V (cid:48) u / √ T .Regarding R N , we may without loss of generality consider the case with P = 1. In the multi-variate setting, let R ∗ N be a P × N matrix with 1 < P < ∞ , and non-zero columns indexed by theset H of cardinality h = | H | < ∞ . By the Cram´er-Wold theorem, √ T R ∗ N (ˆ b − β ) d → N ( , Ψ ∗ ) ifand only if √ T α (cid:48) R ∗ N (ˆ b − β ) d → N ( , α (cid:48) Ψ ∗ α ) for all α (cid:54) = . We show this directly by letting the1 × N vector R N = α (cid:48) R ∗ N and the scalar ψ = lim N,T →∞ α (cid:48) R ∗ N ( Υ − Ω N,T Υ − ) R ∗(cid:48) N α .The proof will now proceed by showing that R N ∆ p −→ (cid:13)(cid:13)(cid:13) ˆ Θ X (cid:48) u − Υ − V (cid:48) u (cid:13)(cid:13)(cid:13) ∞ / √ T p −→ . By Lemma B.6, it holds that (cid:107) ∆ (cid:107) ∞ ≤ √ T λ − r s r λ max C − η T − C λ − r max ¯ s r =: U ∆ ,T , on the set P T,las (cid:84) P T,nw (cid:84) L T . First note that U ∆ ,T → √ T λ max λ − r s r → λ − r max ¯ s r →
0. Regarding P T,las (cid:84) P T,nw (cid:84) L T , it follows from Lemma A.5 that P ( E T ( T λ/ ≥ − C NT m/ λ m → NT m/ λ m → T → ∞ . Similarly, Lemma B.5 shows43hat P (cid:32) N (cid:84) j =1 (cid:110) E ( j ) T ( T λ j ) (cid:111)(cid:33) ≥ − C N T m/ λ m min →
1. The probabilities of sets CC T ( S λ ), CC T,nw (¯ s λ ),and L T converge to 1 by Assumptions 4, 6(c), and 7 respectively. It then directly follows that | R N ∆ | ≤ (cid:107) R N (cid:107) (cid:107) ∆ (cid:107) ∞ → E V,T := E T ( T λ/ (cid:92) P T,nw (cid:92) L T (cid:92) E ( j ) T,uv ( T / η − T )it holds that1 √ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ v (cid:48) j u ˆ τ j − v (cid:48) j u τ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ η − T /δ T + C √ T λ ¯ s r + C η − T (cid:112) λ − r max ¯ s r C − C (cid:16) /δ T + C λ − r max ¯ s r + C (cid:112) λ − r max ¯ s r (cid:17) =: U V,T . By Assumption 7, η − T /δ T ≤ /N →
0, and letting η T such that η − T ≤ T / gives η − T (cid:112) λ − r max ¯ s r ≤ (cid:104) √ T λ − r max ¯ s r (cid:105) / . As √ T λ − r max ¯ s r ≤ √ T λ − r max s r, max → U V,T →
0. The only new set appearing in E V,T is E ( j ) T,uv ( T / η − T ), whose probability converges to 1 byLemma B.8.As R N has only finitely many non-zero elements, the pointwise convergence established aboveallows us to conclude that (cid:13)(cid:13)(cid:13) ˆ Θ X (cid:48) u − Υ − V (cid:48) u (cid:13)(cid:13)(cid:13) ∞ / √ T p −→ X T,t = √ P N,T ψT R N Υ − w t , where P N,T = R N Υ − Ω N,T Υ − R (cid:48) N ψ ; note that bydefinition of ψ , P N,T → N, T → ∞ . Further, left F tT, −∞ = σ { s T,t , s T,t − , . . . } , the positiveconstant array { c T,t } = √ P N,T ψT , and r = m + c . We show that the requirements of this Theoremare satisfied.Part (a), F tT, −∞ -measurability of X T,t , follows from the measurability of z t in Assumption 1(b), E [ X T,t ] = √ P N,T ψT R N Υ − E [ w t ] = 0 follows from the rewriting w j,t = (cid:16) x j,t − x (cid:48)− j,t γ j (cid:17) u t andnoting that E [ x j,t u t ] = 0 , ∀ j by Assumption 1(a), and E (cid:32) T (cid:88) t =1 X T,t (cid:33) = 1 P N,T ψ R N Υ − E (cid:34) T (cid:32) T (cid:88) t =1 w t (cid:33) (cid:32) T (cid:88) t =1 w (cid:48) t (cid:33)(cid:35) Υ − R (cid:48) N = 1 P N,T ψ R N Υ − Ω N,T Υ − R (cid:48) N = 1 . T,t (cid:110)(cid:0) E | R N Υ − w t | m + c (cid:1) / ( m + c ) (cid:111) = sup T,t E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) j ∈ H R N,j τ j w j,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m + c / ( m + c ) ≤ (1) (cid:88) j ∈ H | R N,j | τ j sup T,t (cid:110)(cid:0) E | w j,t | m + c (cid:1) / ( m + c ) (cid:111) ≤ (2) C, where (1) is due to Minkowski’s inequality, and (2) follows from h < τ j ≤ C by eq. (B.3), and w j,t is L m + c -bounded by Lemma B.2(b).For part (c’), by the arguments in the proof of Lemma B.2, w j,t is L m -NED of size -1 on s T,t ,which is α -mixing of size − m ( m + c ) /c < − ( m + c ) / ( m + c − M T = max t { c T,t } = √ P N,T ψT , such that sup T T M T = sup T R N Υ − Ω N,T Υ − R (cid:48) N ≤ C , where the inequality follows from τ j ≥ C by eq. (B.3), and R N Υ − Ω N,T Υ − R (cid:48) N is boundedfrom below by the minimum eigenvalue of Ω N,T (assumed to be bounded away from 0), via theMin-max theorem.Finally, Theorem 2 states that this convergence is uniform in β ∈ B ( s r ). This follows bynoting that eq. (B.4) holds uniformly in β ∈ B ( s r ). Proof of Theorem 3 . As in the proof of Theorem 2, without loss of generality, take R N to be a1 × N vector with non-zero elements indexed by the set H of cardinality h = | H | < ∞ . We canwrite (cid:12)(cid:12)(cid:12) R N ˆ Υ − ˆ Ω ˆ Υ − R (cid:48) N − Ψ (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) R N (cid:104) ˆ Υ − ˆ Ω ˆ Υ − − Υ − ˆ ΩΥ − (cid:105) R (cid:48) N (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) R N Υ − ˆ ΩΥ − R (cid:48) N − Ψ (cid:12)(cid:12)(cid:12) =: R (a) + R (b) . For R ( a ) we get that R (a) ≤ (cid:12)(cid:12)(cid:12) R N (cid:104) ˆ Υ − − Υ − (cid:105) ˆ Ω (cid:104) ˆ Υ − − Υ − (cid:105) R (cid:48) N (cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) R N (cid:104) ˆ Υ − − Υ − (cid:105) ˆ ΩΥ − R (cid:48) N (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) R N (cid:104) ˆ Υ − − Υ − (cid:105) (cid:104) ˆ Ω − Ω N,Q T (cid:105) (cid:104) ˆ Υ − − Υ − (cid:105) R (cid:48) N (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) R N (cid:104) ˆ Υ − − Υ − (cid:105) Ω N,Q T (cid:104) ˆ Υ − − Υ − (cid:105) R (cid:48) N (cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) R N (cid:104) ˆ Υ − − Υ − (cid:105) (cid:104) ˆ Ω − Ω N,Q T (cid:105) Υ − R (cid:48) N (cid:12)(cid:12)(cid:12) + 2 (cid:12)(cid:12)(cid:12) R N (cid:104) ˆ Υ − − Υ − (cid:105) Ω N,Q T Υ − R (cid:48) N (cid:12)(cid:12)(cid:12) , where Ω N,Q T := E (cid:34) T (cid:32) Q T (cid:88) t =1 w t (cid:33) (cid:32) Q T (cid:88) t =1 w (cid:48) t (cid:33)(cid:35) = Q T − (cid:88) l =1 − Q T (cid:18) − | l | Q T (cid:19) Ξ ( l ) . As R N only contains finitely many arguments, it suffices to consider | ˆ τ j − τ j | for any j = 1 , . . . , N and (cid:12)(cid:12)(cid:12) ˆ ω j,k − ω N,Q T j,k (cid:12)(cid:12)(cid:12) for all j, k = 1 , . . . , N , where ω N,Q T j,k is the ( j, k )-th element of Ω N,Q T . The firstresult follows directly from Lemma B.3. 45e now show the second result. First note that (cid:12)(cid:12)(cid:12) ˆ ω j,k − ω N,Q T j,k (cid:12)(cid:12)(cid:12) ≤ Q T − (cid:88) l =1 − Q T (cid:18) − | l | Q T (cid:19) (cid:12)(cid:12)(cid:12) ˆ ξ j,k ( l ) − ξ j,k ( l ) (cid:12)(cid:12)(cid:12) ≤ Q T − (cid:88) l =1 − Q T (cid:18) − | l | Q T (cid:19) (cid:12)(cid:12)(cid:12) ˆ ξ j,k ( l ) − ˜ ξ j,k ( l ) (cid:12)(cid:12)(cid:12) + Q T − (cid:88) l =1 − Q T (cid:18) − | l | Q T (cid:19) (cid:12)(cid:12)(cid:12) ˜ ξ j,k ( l ) − ξ j,k ( l ) (cid:12)(cid:12)(cid:12) where we define ˜ ξ j,k ( l ) := T − l T (cid:80) t = l +1 w j,t w k,t − l . It follows from Lemmas B.11 and B.12 that (cid:12)(cid:12)(cid:12) ˆ ξ j,k ( l ) − ˜ ξ j,k ( l ) (cid:12)(cid:12)(cid:12) ≤ C (cid:104) T / λ − r max s max ,r (cid:105) + C T m λ − r max s max ,r + C (cid:113) T − mm λ − r max s max ,r (cid:12)(cid:12)(cid:12) ˜ ξ j,k ( l ) − ξ j,k ( l ) (cid:12)(cid:12)(cid:12) ≤ C T − m s +1) m − . on the set P ( j,k ) T,uv (cid:84) E T,ww (cid:16) T − m s +1) m − (cid:17) , and it also follows directly from these lemmas that this setholds with probability converging to 1. The set P ( j,k ) T,uv := P T,las (cid:92) P T,nw (cid:92) L T (cid:92) E ( j ) l,uv ( l / η − T ) (cid:92) E ( k ) l,uv ( l / η − T ) (cid:92) E l ( lλ ) (cid:92) E T,uvw holds with probability converging to 1. This can be shown by the arguments in the proof ofTheorem 2 for P T,las (cid:84) P T,nw (cid:84) L T , by Lemma B.8 for E ( j ) l,uv ( l / η − T ), by Lemma A.5 for E l ( lλ ), and E T,uvw follows from Lemma B.11. Similarly E T,ww (cid:16) T − m s +1) m − (cid:17) holds with probability convergingto 1 by Lemma B.12. Plugging the upper bounds in, we find that (cid:12)(cid:12)(cid:12) ˆ ω j,k − ω N,Q T j,k (cid:12)(cid:12)(cid:12) ≤ (2 Q T + 1) (cid:20) C (cid:104) T / λ − r max s max ,r (cid:105) + C T m λ − r max s max ,r + C (cid:113) T − mm λ − r max s max ,r + C T − m s +1) m − (cid:21) . Hence, (cid:12)(cid:12)(cid:12) ˆ ω j,k − ω N,Q T j,k (cid:12)(cid:12)(cid:12) p −→ Q T ≤ Cη T min (cid:26) T − λ r − s − ,r , T − m λ r − s − ,r , T m − m λ r − max s − / ,r , T m − s +1) m − (cid:27) . This concludes the part of R (a) . With the results above, it remains to be shown for R (b) that (cid:12)(cid:12) R N Υ − ( Ω N,Q T − Ω N,T ) Υ − R (cid:48) N (cid:12)(cid:12) →
0. Given the characteristics of R N , it suffices to show that (cid:12)(cid:12)(cid:12) ω N,Q T j,k − ω j,k (cid:12)(cid:12)(cid:12) →
0. Note that (cid:12)(cid:12)(cid:12) ω N,Q T j,k − ω j,k (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) l = Q T [ ξ j,k ( l ) + ξ k,j ( l )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q T − (cid:88) l =1 − Q T lQ T ξ j,k ( l ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ T (cid:88) l = Q T | ξ j,k ( l ) | + Q T − (cid:88) l =1 − Q T lQ T | ξ j,k ( l ) | . The first part converges to 0 as (cid:80) Tl =0 | ξ j,k ( l ) | ≤ C by Lemma B.2(a) and Q T → ∞ . For the second46art we have, for an arbitrary δ >
0, that Q T − (cid:88) l =1 − Q T lQ T | ξ j,k ( l ) | ≤ Q T − (cid:88) l =1 − Q T l − δ Q T l δ | ξ j,k ( l ) | ≤ Q δT Q T − (cid:88) l =1 − Q T l δ | ξ j,k ( l ) | ≤ Q − δT C, where the summability of ξ j,j ( l ) follows from the NED property of w j,t by Theorem 17.7 of Davidson(2002). In particular, it follows from eq. (17.26) therein that | ξ j,k ( l ) | is smaller in order of magnitudethan Cψ l = O ( l − − (cid:15) ) for some (cid:15) >
0, and therefore summable. It is then clear that for any (cid:15) > δ > l δ | ξ j,k ( l ) | ≤ O ( l − − (cid:15) + δ ), which is also summable.This shows that R (b) p −→
0. Finally, this result holding uniformly in β ∈ B ( s r ) follows the samelogic as the proof of Theorem 2, namely that eq. (B.4) holds uniformly in β ∈ B ( s r ). Proof of Corollary 2 . The result follows by applying Theorems 2 and 3, so the assumed ratesfrom both must be satisfied:
N λ − m T − m/ → ,N λ − m min T − m/ → , √ T λ − r max s r, max → ,Q T T λ − r )max s r, max → Q T T /m λ − r max s r, max → Q T T − m m λ (2 − r ) / s / r, max → Q T T − m s +1) m − → ⇒ a + (cid:96)m − m/ < a + (cid:96) ¯ m − m/ < / − L (2 − r ) + B < δ Q + 1 − L − r ) + 2 B < δ Q + 1 /m − L (2 − r ) + B < δ Q + − m m − L (2 − r ) + B/ < δ Q + − m s +1) m − < ⇒ δ Q +1+2 B − r ) < L ≤ (cid:96) ¯ < − am − r − δ Q − B > m > a (2 − r )1 − r − δ Q − B < δ Q < m − m − . By implication of Theorem 2 √ T R N (ˆ b − β ) d → N (0 , ψ ) , uniformly in β ∈ B ( s r ). Then, by Theorem 3 R N ( ˆ Υ − ˆ Ω ˆ Υ − ) R (cid:48) N p → ψ , also uniformly in β ∈ B ( s r ). By Slutsky’s Theorem, it is then the case that √ T R N (ˆ b − β ) d → N (0 , ψ ) , uniformly in β ∈ B ( s r ). As a consequence,sup β ∈ B ( s r ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P √ T R N (ˆ b − β ) (cid:113) R N ( ˆ Υ − ˆ Ω ˆ Υ − ) R (cid:48) N ≤ z − Φ ( z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) , ∀ z ∈ R . ppendix C Supplementary Results Lemma C.1.
Assume x t admits the following VMA( ∞ ) decomposition x t = ∞ (cid:88) s =0 Φ s (cid:15) t − s , where Φ s = diag ( φ ,s , . . . , φ N,s ) and (cid:15) t is a Martingale difference sequence with respect to F (cid:15),t = σ { (cid:15) t , (cid:15) t − , . . . } . Furthermore, assume(a) E [ (cid:15) t (cid:15) (cid:48) t |F (cid:15),t − ] = Σ t with [ Σ t ] i,i = σ i,t and [ Σ t ] i,j = ρ t .(b) E [ | (cid:15) j,t | ν ] ≤ C, ∀ j, t , and some ν > .(c) max s ≤ r E [ | (cid:15) i,t − s (cid:15) j,t − r − E ( (cid:15) i,t − s (cid:15) j,t − r ) | ] = c i,j ( t ) ≤ C ∀ i, j, t, (d) ∞ (cid:80) s = q | φ j,s | ≤ ψ j,q = O ( q − π ) ∀ j, q ∈ N , and some π > .Take the following asymptotic growth rates N ∼ T a , a ≥ , and s λ φ Σ ( S λ ) = O (cid:0) T b (cid:1) = O (cid:0) N b/a (cid:1) , η T (cid:17) ≤ N (cid:88) i =1 N (cid:88) j =1 P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t =1 ( x it x j,t − E [ x it x j,t ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > T η T (cid:33) . Now apply the Triplex inequality (Jiang, 2009) P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:88) t =1 ( x it x j,t − E [ x it x j,t ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > T η T (cid:33) ≤ q × exp (cid:18) − T η T × q κ T (cid:19) + 6 T η
T T (cid:88) t =1 E [ | E ( x it x j,t |F (cid:15),t − q ) − E ( x it x j,t ) | ] + 15 T η
T T (cid:88) t =1 E (cid:104) | x it x j,t | {| x it x j,t | >κ T } (cid:105) = R (i) + R (ii) + R (iii) . Let η T = φ Σ ( S λ ) s λ = O ( T − b ), q ∼ T δ q , δ q > κ T ∼ T δ K , δ K >
0. If we can show that all threeterms go to zero as T → ∞ , then the proof is complete. N (cid:88) i =1 N (cid:88) j =1 R (i) = 2 N q × exp (cid:18) − T η T × q κ T (cid:19) . Due to the exponent, this term converges when
T η T × q κ T → ∞ . Plugging in the chosen growth48ates: T η T × q κ T = C × T (1 − b ) × q κ T = O (cid:16) T (1 − b − δ q − δ K ) (cid:17) , and we need 1 − b − δ q − δ K > . By Lemma 12(1) of Medeiros and Mendes (2016), R (ii) ≤ T η T T (cid:80) t =1 c i,j ( t ) φ i,q φ j,q , so N (cid:88) i =1 N (cid:88) j =1 R (ii) ≤ N (cid:88) i =1 N (cid:88) j =1 (cid:32) T η
T T (cid:88) t =1 Cψ q (cid:33) = C N η T ψ q = O (cid:16) T (2 a + b − πδ q ) (cid:17) , and we need 2 a + b − πδ q < E [ | x i,t | ν ] = E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ (cid:88) s =0 φ i,s (cid:15) i,t − s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ν (cid:35) ≤ max s { E [ | (cid:15) i,t − s | ν ] } (cid:32) ∞ (cid:88) s =0 | φ i,s | (cid:33) ν ≤ Cψ νi, < ∞ , and by Cauchy-Schwarz E (cid:2) | x i,t x j,t | ν/ (cid:3) < ∞ . By Lemma 10 of Medeiros and Mendes (2016), R (iii) ≤ Cη T κ ν/ − T , so N (cid:88) i =1 N (cid:88) j =1 R (iii) ≤ CN η T κ ν/ − T = O (cid:16) T a + b − δ K ( ν/ − (cid:17) , and we need 2 a + b − δ k ( ν/ − < − b − δ q − δ K > a + b − πδ q < a + b − δ K ( ν/ − < ⇒ a + b < π (1 − b − δ K )2 a + b − δ K ( ν/ − < ⇒ ν/ − π < / − b a + b .b .