Confidence intervals for parameters in high-dimensional sparse vector autoregression
CConfidence Intervals for Parameters inHigh-dimensional Sparse Vector
Autoregression
Ke Zhu and Hanzhong LiuCenter for Statistical Science, Department of Industrial Engineering,Tsinghua University, Beijing, ChinaSeptember 22, 2020
Abstract
Vector autoregression (VAR) models are widely used to analyze the interrelation-ship between multiple variables over time. Estimation and inference for the transitionmatrices of VAR models are crucial for practitioners to make decisions in fields suchas economics and finance. However, when the number of variables is larger than thesample size, it remains a challenge to perform statistical inference of the model param-eters. In this article, we propose the de-biased Lasso and two bootstrap de-biasedLasso methods to construct confidence intervals for the elements of the transitionmatrices of high-dimensional VAR models. We show that the proposed methods areasymptotically valid under appropriate sparsity and other regularity conditions. Toimplement our methods, we develop feasible and parallelizable algorithms, which savea large amount of computation required by the nodewise Lasso and bootstrap. A sim-ulation study illustrates that our methods perform well in finite samples. Finally, weapply our methods to analyze the price data of stocks in the S&P 500 index in 2019.We find that some stocks, such as the largest producer of gold in the world, NewmontCorporation, have significant predictive power over the most stocks.
Keywords:
Bootstrap, De-biased Lasso, De-sparsified Lasso, Granger Causality, High-dimensional time series 1 a r X i v : . [ s t a t . M E ] S e p Introduction
Vector autoregression (VAR) models have been widely used in econometric, business statis-tics, and other fields (Sims, 1980; Fuller, 1996; L¨utkepohl, 2007). These models can capturethe dynamic relationship between factors through transition matrices. In practice, the num-ber of factors can be large, even larger than the sample size. For instance, the number ofstocks in the market is usually larger than the number of observations. This fact leadsresearchers to consider high-dimensional VAR models.High-dimensional VAR models have been thoroughly studied in the past decades. Forinstance, Guo et al. (2016) imposed banded structure on the transition matrices and es-tablished the convergence rates of the least squares estimators. A series of works in theliterature has proposed the sparsity constraint on the transition matrices and used theLasso (Tibshirani, 1996) to estimate the parameters (Hsu et al. , 2008; Song and Bickel,2011; Negahban and Wainwright, 2011; Loh and Wainwright, 2012; Chen et al. , 2013; Han et al. , 2015; Kock and Callot, 2015; Basu and Michailidis, 2015; Davis et al. , 2016). Underregularity conditions, Basu and Michailidis (2015) established the deviation bounds of theLasso estimators, which are essential for studying the theoretical properties of Lasso-basedestimators in high-dimensional sparse VAR models.In addition to parameter estimation, confidence intervals and hypothesis testing providepractitioners, such as, policy makers and business owners, with more valuable and solidinformation for assessing the significance of the correlations between factors. For instance,Wilms et al. (2016) used Granger causality test (Granger, 1969) based on the bootstrapadaptive Lasso to detect the most predictive industry-specific economic sentiment indicatorsfor macro-economic indicators. Lin and Michailidis (2017) proposed testing proceduresin multi-block VAR models to test whether a block “Granger-causes” another block ofvariables and applied them to analyze the temporal dynamics of the S&P100 componentstocks and key macroeconomic factors. In the context of VAR models, the Granger causalitytest is equivalent to test if the element of transition matrices is equal to zero (L¨utkepohl,2007). There is an increasing demand for constructing confidence intervals or performinghypothesis testing for the element of transition matrices in high-dimensional VAR models.Since the limiting distribution of the Lasso is complicated (Knight and Fu, 2000), we cannot2irectly use it for statistical inference. In high-dimensional sparse linear regression models,a series of studies has proposed methods based on the Lasso or its variants for statisticalinference. One direction of research has proposed the de-biased Lasso method, whichfocuses on obtaining an asymptotically normal estimator by correcting the bias of the Lassoestimator (Zhang and Zhang, 2014; van de Geer et al. , 2014; Javanmard and Montanari,2014). We will refer the method in Zhang and Zhang (2014) and van de Geer et al. (2014) asLDPE (low dimensional projection estimator) and the method in Javanmard and Montanari(2014) as JM. These estimators are also called the de-sparsified Lasso since they are nolonger sparse after correcting for the bias. Another direction of research uses the bootstrap(Efron, 1979) for inference, including the bootstrap threshold Lasso (Chatterjee and Lahiri,2011), bootstrap adaptive Lasso (Chatterjee and Lahiri, 2013), bootstrap Lasso + ordinaryleast squares (OLS) (Liu and Yu, 2013) and bootstrap de-biased Lasso (Dezeure et al. ,2017). Ning and Liu (2017) generalized the de-biased Lasso method from linear regressionto penalized M-estimators using the so called de-correlated score function. Neykov et al. (2018) generalized this method to estimating equation problems that are likelihood-freeusing the projection approach. Based on the framework of Ning and Liu (2017), Zhengand Raskutti (2019) studied hypothesis testing for sub-Gaussian VAR models using thede-correlated method. Basu et al. (2019) proposed using the de-biased Lasso estimator toperform inference for high-dimensional sparse VAR models but did not provide rigoroustheoretical guarantees. Since the predictors in VAR models are random and exhibit complexcorrelation structure, it is uncertain whether the de-biased Lasso method is valid and it ischallenging to provide theoretical guarantees. Krampe et al. (2019) proposed a model-basedbootstrap de-biased Lasso estimator to perform inference, but its computational burdenis heavy when the dimension p is large. Moreover, in order to guarantee the requiredorder of sparsity in the bootstrap sample, the study used the thresholded Lasso which iscomplicated than the Lasso although it is easier for theoretical study.To fill in the theoretical gap of the de-biased Lasso estimator and to address the compu-tation problem of the model-based bootstrap de-biased Lasso estimator, in this study, weformally establish the asymptotic normality of the de-biased Lasso estimator for inferringthe elements of the transition matrix of high-dimensional sparse VAR(1) models. Then, we3ropose two computational feasible bootstrap methods, residual bootstrap de-biased Lasso(BtLDPE) and multiplier wild bootstrap de-biased Lasso (MultiBtLDPE). Our methodscan be generalized to VAR( k ) models through transformation. Our contributions are sum-marized as follows.First, we derive the asymptotic properties of the de-biased Lasso, residual bootstrap de-biased Lasso and multiplier wild bootstrap de-biased Lasso estimators in high-dimensionalsparse VAR models by using the deviation bounds of the Lasso estimator (Basu and Michai-lidis, 2015) and the martingale central limit theorem (Theorem 5.3.4 in Fuller (1996)). Wedemonstrate the validity and robustness of these methods for statistical inference in a con-text broader than linear regression models. As a by-product, we show that the sparse Rieszcondition holds for high-dimensional sparse VAR models and provide an upper bound onthe sparsity of the Lasso estimator which is essential for obtaining appropriate varianceestimators and studying the theoretical properties of bootstrap de-biased Lasso.Second, to implement our methods, we provide algorithms which are feasible and easy toparallel. Specifically, for each of the p equations of the VAR models, we perform statisticalinference separately. Since the p equations share the same design matrix, we need onlyrun nodewise lasso once, which is the main computational burden of de-biased Lasso andbootstrap de-biased Lasso. Compared to the model-based bootstrap de-biased Lasso, ourmethods have significant computational advantages especially when p is large.Third, we conduct comprehensive simulation studies to compare our methods with thebootstrap Lasso, bootstrap Lasso+OLS and another de-biased Lasso method proposed byJavanmard and Montanari (2014). We find that the de-biased Lasso method, the LDPE,can always yield honest coverage probabilities, and when the sample size is large, twobootstrap methods, the BtLDPE and MultiBtLDPE, can yield the honest coverage prob-abilities with shorter interval lengths. We also apply our methods to analyze the S&P 500constituent stocks data set in 2019. We find that the prices of some stocks have significantpredictive power over the prices of most stocks, for example, Newmont Corporation, whichis the largest producer of gold in the world and the only gold producer listed in the S&P500 Index, has the ability to affect many other stock prices in advance. Notation . For a vector a = ( a , ..., a n ), we denote (cid:107) a (cid:107) = (cid:80) ni =1 | a i | , (cid:107) a (cid:107) = (cid:80) ni =1 a i (cid:107) a (cid:107) ∞ = max i | a i | . We use e j to denote the vector whose j th element is one, zerootherwise. For a square matrix A , let Λ min ( A ) and Λ max ( A ) be the smallest and largesteigenvalues of A respectively. Let | A | denote the determinant of A . For a design matrix X ,let X tj denote the ( t, j )th element of X , X j denote the j th column of X and X − j denotethe X without the j th column. Let || X || ∞ denote the maximum absolute value of theelements of X . For a set S , let | S | denote the number of elements of S and X S denote theselected columns of X in S . We use i.i.d as the abbreviation of independent and identicallydistributed. Let z α denotes the lower α quantile of the standard normal distribution. Fortwo sequences of positive numbers { a n } and { b n } , we denote a n = o ( b n ) if a n /b n → n → ∞ and a n = O ( b n ) if lim sup | a n /b n | < ∞ . We denote a n (cid:16) b n if there exist positiveconstants C , C > | a n /b n | ≤ C and lim inf | a n /b n | ≥ C . We denote Z n = o p (1) if Z n convergent to zero in probability and Z n = O p (1) if Z n is bounded inprobability.The rest of the paper is organized as follows. In Section 2, we introduce the high-dimensional sparse VAR models and the de-biased Lasso, residual bootstrap de-biasedLasso and multiplier wild bootstrap de-biased Lasso methods. In Section 3, we providetheoretical results, including the upper bound of the number of selected variables by theLasso estimator and the asymptotic normality of the proposed estimators. In Section 4,simulation studies are provided for investigating the finite-sample performance of differentmethods. In Section 5, we illustrate our method using the S&P 500 constituent stocks dataset. We summarize the results and discuss possible extensions in the last section. Proofdetails are given in the supplementary material. In this section, we first introduce the high-dimensional sparse VAR models. Consider a p -dimensional VAR(1) model, y t = A y t − + e t , t = 1 , ..., n, (1)where y t = ( y t , ..., y pt ) T is a p -dimensional random vector, A = ( a ij ) p × p is a p × p transitionmatrix and e t = ( e t , ..., e pt ) T is a p -dimensional Gaussian white noise, namely, e t is inde-5endently and identically distributed in multivariate Gaussian distribution N p ( , Σ e ). Weassume that the VAR process { y t } is stable, namely, Λ max ( A ) <
1. We have n observationsof y t and n may be smaller than p .There are p equations, corresponding to p components of y t , in the VAR(1) model. Weconsider each of the p equations separately, y it = a T i y t − + e it , i = 1 , ..., p, t = 1 , ..., n, where a i = ( a i , ..., a ip ) is the i th row of A and we denote the support set of a i by S i = { j ∈ { , ..., p } : a ij (cid:54) = 0 } and let s i = | S i | . To gather all observations, we denote Y i = X a i + ε i , i = 1 , ..., p, (2)where Y i = ( y i , ..., y in ) T , X = ( y , ..., y n − ) T and ε i = ( e i , ..., e in ) T . The denotation is todistinguish them from y t and e t . Although equation (2) violates the basic assumptions oflinear regression models, it has the same form. Thus, we can apply the de-biased Lassoand bootstrap de-biased Lasso original proposed for linear regression models to constructconfidence intervals for a i ’s, with caution that these methods may not be valid. The Lasso is widely used for simultaneous parameter estimation and model selection inhigh-dimensional linear regression models (Tibshirani, 1996), which adds an l penalty tothe loss function to obtain sparse estimates. Hsu et al. (2008) proposed to use the Lassoestimator in VAR models,ˆ a Lasso i := argmin α ∈ R p {|| Y i − X α || /n + 2 λ || α || } , i = 1 , ..., p, (3)where λ is the tuning parameter which controls the amount of regularization. In practice, λ is often chosen by cross-validation. We denote the set of selected variables by ˆ S i := { j ∈{ , ..., p } : ˆ a Lasso ij (cid:54) = 0 } and let ˆ s i := | ˆ S i | . The Lasso estimator is hard to use directly forstatistical inference due to its bias. Zhang and Zhang (2014) and van de Geer et al. (2014)proposed the de-biased Lasso method for construct confidence intervals, which proceeds asfollows. 6he Lasso estimator in (3) satisfies the Karush-Kuhn-Tucker (KKT) conditions, − X T ( Y i − X ˆ a Lasso i ) /n + λ ˆ κ = 0 , (4)where ˆ κ is the sub-gradient of (cid:96) norm and satisfies (cid:107) ˆ κ (cid:107) ∞ ≤ κ j = sign(ˆ a Lasso ij ) ifˆ a Lasso ij (cid:54) = 0. By Y i = X a i + ε i and ˆΣ := X T X /n , we obtainˆΣ(ˆ a Lasso i − a i ) + λ ˆ κ = X T ε i /n. (5)With Σ := E ( ˆΣ) = E ( y t − y T t − ), if we have a proper approximation for the precision matrixΘ := Σ − , say ˆΘ, then multiplying both hand sides of (5) by ˆΘ, we obtainˆ a Lasso i − a i + ˆΘ λ ˆ κ = ˆΘ X T ε i /n + ( I − ˆΘ ˆΣ)(ˆ a Lasso i − a i ) . (6)The de-biased Lasso estimator isˆ a i := ˆ a Lasso i + ˆΘ λ ˆ κ = ˆ a Lasso i + ˆΘ X T ( Y i − X ˆ a Lasso i ) /n, where the second equality is due to (4). Intuitively, as long as we can prove that the firstterm of the right-hand of (6) is asymptotically normal with mean zero and the second termis asymptotically negligible, then with a consistent estimator of the variance of ε i , we canperform inference for a i .Following Zhang and Zhang (2014) and van de Geer et al. (2014), we get ˆΘ by nodewiseLasso (Meinshausen and B¨uhlmann, 2006). Specifically, for j = 1 , ..., p , we run a Lassoregression of X j versus X − j ,ˆ γ j := argmin γ ∈ R p − {|| X j − X − j γ || /n + 2 λ j || γ || } , (7)where λ j is the tuning parameter. Note that ˆ γ j with the components of { ˆ γ jk ; k = 1 , ..., p, k (cid:54) = j } is an estimator of γ j := Σ − − j, − j Σ − j,j , where Σ − j, − j is the covariance matrix of X − j andΣ − j,j is the covariance matrix of X − j and X j . We denote the sparsity of γ j by q j := | { k (cid:54) = j : γ jk (cid:54) = 0 } | . We further denoteˆ C := − ˆ γ · · · − ˆ γ p − ˆ γ · · · − ˆ γ p ... ... . . . ... − ˆ γ p − ˆ γ p · · · , ˆ τ j := (cid:107) X j − X − j ˆ γ j (cid:107) /n + λ j (cid:107) ˆ γ j (cid:107) , ˆ T := diag(ˆ τ , . . . , ˆ τ p ) . T − ˆ C. (8)Javanmard and Montanari (2014) uses a different approach to obtain ˆΘ. We do not adopttheir approach because of its inferior finite-sample performance as shown in the simulationsection.According to the discussion in Reid et al. (2016), we estimate the variance of ε i , say σ i ,by the residual sum of squares of the Lasso estimator divided by its degree of freedom,ˆ σ i := 1 n − ˆ s i (cid:107) ˆ ε i (cid:107) , where ˆ ε i = (ˆ ε i , ..., ˆ ε in ) T := Y i − X ˆ a Lasso i . The procedure of nodewise Lasso requires to compute the Lasso solution path for p times, which is the main computation burden of the de-biased Lasso. Fortunately, fordifferent variable i = 1 , ..., p , we need only to compute ˆΘ once since the p equations sharethe same X . The whole procedures is summarized in Algorithm 1, where every for loopcould be ran in parallel. In this section, we introduce the residual bootstrap de-biased Lasso and wild multiplierbootstrap de-biased Lasso proposed in Dezeure et al. (2017). The only difference of thesetwo bootstrap methods is the approach to generate bootstrap residuals. For residual boot-strap, we resample with replacement from the centered residuals { ˆ ε it − ˆ ε i · , t = 1 , , ...n } ,where ˆ ε i · := (cid:80) nt =1 ˆ ε it /n , and obtain the bootstrap residuals ε ∗ i = ( ε ∗ i , ..., ε ∗ in ) T . For wildmultiplier bootstrap, we generate i.i.d multiplier W i , . . . , W in with E ( W it ) = 0, E ( W it ) = 1and E ( W it ) < ∞ , for example, W it ∼ N (0 , ε ∗ it = W it ( ε it − ˆ ε i · ), and obtain thebootstrap residuals ε ∗ i = ( ε ∗ i , ..., ε ∗ in ) T .Next, the bootstrap sample are generated as Y ∗ i = X ˆ a i + ε ∗ i . lgorithm 1: Construction of confidence intervals by the de-biased Lasso
Input
Data { y t } , t = 0 , ..., n ; Confidence level 1 − α . Output
Confidence intervals for elements of transition matrix A in VAR models. for j = 1 , ..., p do Compute the nodewise Lasso estimator ˆ γ j and residuals ˆ Z j = X j − X − j ˆ γ j ; end Compute the ˆΘ in (8); for i = 1 , ..., p do Compute the Lasso estimator ˆ a Lasso i given the data ( Y i , X );Compute the de-biased Lasso estimator ˆ a i = ˆ a Lasso i + ˆΘ X T ( Y i − X ˆ a Lasso i ) /n ,residuals ˆ ε i = Y i − X ˆ a Lasso i , and variance estimator ˆ σ i = (cid:107) ˆ ε i (cid:107) / ( n − ˆ s i ); for j = 1 , ..., p do Compute l ij = ˆ a ij − z − α/ ˆ σ i (cid:107) ˆ Z j (cid:107) / | ˆ Z T j X j | and u ij = ˆ a ij − z α/ ˆ σ i (cid:107) ˆ Z j (cid:107) / | ˆ Z T j X j | ; endendReturn − α confidence interval [ l ij , u ij ] for a ij , i, j = 1 , ..., p .Then we can replace the original sample ( Y i , X ) by the bootstrap sample ( Y ∗ i , X ) to computethe corresponding quantities for the de-biased Lasso. For example, the bootstrap versionde-biased Lasso estimator is obtained byˆ a ∗ i := ˆ a Lasso ∗ i + ˆΘ X T ( Y ∗ i − X ˆ a Lasso ∗ i ) /n, where ˆ a Lasso ∗ i is the bootstrap version Lasso estimator and defined byˆ a Lasso ∗ i := argmin α ∈ R p {|| Y ∗ i − X α || /n + 2 λ ∗ || α || } , where λ ∗ is the tuning parameter. Note that, one can use different tuning parameters inthe original and bootstrap Lasso estimators. Our simulation results indicate that using thesame tuning parameters often performs well. We also denote the set of selected variablesby ˆ S ∗ i := { j ∈ { , ..., p } : ˆ a Lasso ∗ ij (cid:54) = 0 } and let ˆ s ∗ i := | ˆ S ∗ i | . For both of bootstrap proceduresabove, we have E ∗ ( ε ∗ it ) = 0 and σ ∗ i := E ∗ ( ε ∗ it ) = (cid:80) nt =1 (ˆ ε it − ˆ ε i · ) /n , where E ∗ indicates9 lgorithm 2: Construction of confidence intervals by bootstrap de-biased Lasso
Input
Data { y t } , t = 0 , ..., n ; Confidence level 1 − α ; Bootstrap type (residualbootstrap or wild multiplier bootstrap); Number of bootstrap replications B . Output
Confidence intervals for elements of transition matrix A in VAR models. for j = 1 , ..., p do Compute the nodewise Lasso estimator ˆ γ j and residuals ˆ Z j = X j − X − j ˆ γ j ; end Compute the ˆΘ in (8); for i = 1 , ..., p do Compute the Lasso estimator ˆ a Lasso i given the data ( Y i , X );Compute the de-biased Lasso estimator ˆ a i = ˆ a Lasso i + ˆΘ X T ( Y i − X ˆ a Lasso i ) /n ,residual ˆ ε i = Y i − X ˆ a Lasso i , and variance estimator ˆ σ i = (cid:107) ˆ ε i (cid:107) / ( n − ˆ s i ); for b = 1 , ..., B doif Bootstrap type == residual bootstrap then Resample with replacement from the centered residuals { ˆ ε ij − ˆ ε i · } andobtain the bootstrap residuals ε ∗ i = ( ε ∗ i , ..., ε ∗ in ) T ; else if Bootstrap type == wild multiplier bootstrap then Generate i.i.d multiplier W i , . . . , W in with E ( W ij ) = 0, E ( W ij ) = 1 and E ( W ij ) < ∞ . Then we obtain ε ∗ ij = W ij ( ε ij − ˆ ε i · );Generate bootstrap samples Y ∗ i = X ˆ a i + ε ∗ i ;Compute the Lasso estimator ˆ a Lasso ∗ i given the bootstrap data ( Y ∗ i , X );Compute ˆ a ∗ i = ˆ a Lasso ∗ i + ˆΘ X T ( Y ∗ i − X ˆ a Lasso ∗ i ) /n , ˆ ε ∗ i = Y ∗ i − X ˆ a Lasso ∗ i , andˆ σ ∗ i = (cid:107) ˆ ε ∗ i (cid:107) / ( n − ˆ s ∗ i );Compute the pivot T ∗ ( b ) with T ∗ ( b ) ij = (ˆ a ∗ ij − ˆ a Lasso ∗ ij ) | ˆ Z T j X j | / (ˆ σ ∗ i (cid:107) ˆ Z j (cid:107) ); endfor j = 1 , ..., p do Let l ij = ˆ a ij − q − α/ ˆ σ i (cid:107) ˆ Z j (cid:107) / | ˆ Z T j X j | and u ij = ˆ a ij − q α/ ˆ σ i (cid:107) ˆ Z j (cid:107) / | ˆ Z T j X j | ,where q α denotes the lower α quantile of { T ∗ (1) ij , ..., T ∗ ( B ) ij } ; endendReturn − α confidence interval [ l ij , u ij ] for a ij , i, j = 1 , ..., p .10he expectation is with respect to the bootstrap measure. Similarly, we estimate σ ∗ i byˆ σ ∗ i := (cid:107) ˆ ε ∗ i (cid:107) / ( n − ˆ s ∗ i ), where ˆ ε ∗ i := Y ∗ i − X ˆ a Lasso ∗ i . We repeat the above procedure B timesto obtain the empirical distribution of the statistics of interest.Quantities determined by X , such as ˆΘ, do not need to be re-computed in the bootstrapreplications, since X is the same in every bootstrap sample. This is the most significantadvantage of our bootstrap methods over the model-based bootstrap method (Krampe et al. , 2019), which will regenerate the entire time series { y t } . Such bootstrap time serieswill not share the same X , leading to the computations of ˆΘ for B times, which is usuallynot feasible in practice for relatively large p . Furthermore, we will show that our twobootstrap de-biased Lasso methods are consistent even though they ignore the dependencestructure of VAR models.In all, in order to make the bootstrap de-biased Lasso methods feasible, we save thecomputation cost of ˆΘ on two folds. First, once we obtain ˆΘ, we do not need to compute ˆΘfor the p equations in VAR models; second, we do not need to compute ˆΘ for B bootstrapreplications. The whole procedure is summarized in Algorithm 2, where every for loopcould be also ran in parallel. In this section we discuss the theoretical properties of the de-biased Lasso and the bootstrapde-biased Lasso. Different from the fixed design matrix case in linear regression models,the design matrix in equation (2) of VAR models is random, exhibits complex dependencestructure, and is correlated with ε i , that is, equation (2) does not justify the assumptions oflinear regression models. However, we can still obtain their asymptotic normality under ap-propriate conditions, using the deviation bound of the Lasso estimator in high-dimensionalsparse VAR models (Basu and Michailidis, 2015) and the martingale central limit theorem(Theorem 5.3.4 in Fuller (1996)). We first introduce the following three assumptions.11 ssumption 1.
Suppose that the tuning parameters for the Lasso and nodewise Lassosatisfy: λ (cid:16) (cid:112) log( p ) /n and λ j (cid:16) (cid:112) log( p ) /n . Assumption 2. max i s i log( p ) / √ n = o (1). Assumption 3. max j q j log( p ) / √ n = o (1).Assumption 1 requires that the convergence rates of the tuning parameters are of order (cid:112) log( p ) /n . Assumption 2 is a sparsity assumption on each row of the transition matrix A ,which is commonly assumed in statistical inference based on the de-biased Lasso methods.These two assumptions are the same as the counterparts in van de Geer et al. (2014). Thesparsity assumption on the presicion matrix, Assumption 3, is a little stronger than that invan de Geer et al. (2014). We need this assumption because of the complicated dependencestructure in VAR models. Proposition 1.
Under Assumption 2, for i = 1 , ..., p , there exist constants < c ∗ < c ∗ < ∞ , such that, for s R > (2+2 c ∗ /c ∗ ) s i +1 and s R = O ( s i ) , the following holds with probabilityconverging to 1: c ∗ ≤ min (cid:107) v (cid:107) ≤ s R min (cid:107) v (cid:107) =1 (cid:107) X v (cid:107) /n ≤ max (cid:107) v (cid:107) ≤ s R max (cid:107) v (cid:107) =1 (cid:107) X v (cid:107) /n ≤ c ∗ . Proposition 1 states that X satisfies the sparse Riesz condition (Zhang and Huang,2008), which bounds the extreme eigenvalues of X T X /n in a sparse space. Basu andMichailidis (2015) has obtained the lower bound in the sparse Riesz condition and wecomplement their result by providing the upper bound. The sparse Riesz condition iscrucial for proving the conclusion (b) of Theorem 1 below. Theorem 1.
For i = 1 , ..., p ,(a) Under Assumptions 1 and 2, we have || ˆ a Lasso i − a i || = O p ( s i (cid:112) log( p ) /n ) , || X (ˆ a Lasso i − a i ) || /n = O p ( s i log( p ) /n ) . (b) Under Assumption 2, for any λ ≥ C c ∗ /c ∗ (cid:112) log p/n , where C is a constant definedin the supplementary material, we have ˆ s i = O p ( s i ) . a Lasso i . Propositions 4.1 in Basu and Michailidis(2015) provided a similar bound, but for a thresholded variant of the Lasso. Proposition 2.
Under Assumptions 1 and 3, for j = 1 , ..., p , we have (cid:107) ˆ γ j − γ j (cid:107) = O p (cid:16) q j (cid:112) log p/n (cid:17) , (cid:107) X − j (ˆ γ j − γ j ) (cid:107) /n = O p ( q j log p/n ) . Proposition 2 provides the estimation and prediction error bounds for the nodewiseLasso estimator defined in (7).
Theorem 2.
Under Assumptions 1, 2 and 3, the de-biased Lasso estimator is asymptoti-cally normal, that is, (ˆ a ij − a ij ) /s.e. ij d → N (0 , , i, j = 1 , ..., p, where s.e. ij = σ i (cid:107) ˆ Z j (cid:107) | ˆ Z T j X j | . Furthermore, ˆ σ i /σ i P → , i = 1 , ..., p. Remark 1.
The asymptotic normality of the de-biased Lasso is also proven in Theorem3.4 in Zheng and Raskutti (2019). They propose (cid:80) pi =1 (cid:107) ˆ ε i (cid:107) / ( np ) as a consistent varianceestimator of ε i .Theorem 2 shows that the de-biased Lasso estimator is asymptotically normal and itsasymptotic variance can be estimated consistently. Thus, we can construct an asymptoti-cally valid confidence interval for each element of A , a ij , by using normal approximation.13 .2 Asymptotic distribution of the bootstrap de-biased Lasso Assumption 4.
Suppose that the tuning parameters for the Lasso, nodewise Lasso, andbootstrap Lasso satisfy: λ (cid:16) (cid:112) log( p ) /n , λ j (cid:16) (cid:112) log( p ) /n , and λ ∗ (cid:16) log p/ √ n . Assumption 5. s i (log p ) / / √ n = o (1) , i = 1 , ..., p .Assumptions 4 and 5 require stronger convergence rates compared to those in Dezeure et al. (2017) used to obtain the asymptotic distribution of the bootstrap de-biased Lasso inhigh-dimensional sparse linear regression models. However, Dezeure et al. (2017) assumedthat (cid:107) X (cid:107) ∞ = O (1), which does not hold for the random design matrix in VAR models.In fact, we can only show that (cid:107) X (cid:107) ∞ = O p ( √ log pn ). This is why we require strongerconditions on the sparsity and tuning parameters of the bootstrap Lasso estimator. Theorem 3.
For i = 1 , ..., p ,(a) Under Assumptions 4 and 5, we have || ˆ a Lasso ∗ i − ˆ a Lasso i || = O p ( s i log( p ) / √ n ) , || X (ˆ a Lasso ∗ i − ˆ a Lasso i ) || /n = O p ( s i log ( p ) /n ) . (b) Under Assumption 5, for any λ ∗ ≥ C c ∗ /c ∗ (cid:112) log( p ) /n , where C , c ∗ and c ∗ are con-stants defined in the supplementary material, we have ˆ s ∗ i = O p ( s i ) , where ˆ s ∗ i = |{ j ∈ { , ..., p } : ˆ a Lasso ∗ ij (cid:54) = 0 }| . Theorem 3 is the bootstrap analogue of Theorem 1. The different convergence rates instatement (a) are due to Assumption 4.
Theorem 4.
Under Assumptions 3 – 5, the bootstrap de-biased Lasso estimators are asymp-totically normal, that is, (ˆ a ∗ ij − ˆ a Lasso ij ) /s.e. ∗ ij d ∗ → N (0 , in probability , i, j = 1 , ..., p, where ˆ a ∗ ij denotes either residual or multiplier wild bootstrap de-biased Lasso estimators, s.e. ∗ ij = σ ∗ i (cid:107) ˆ Z j (cid:107) | ˆ Z T j X j | , nd d ∗ indicates the convergence is with respect to the bootstrap measure. Furthermore, ˆ σ ∗ i /σ ∗ i P ∗ → in probability , i = 1 , ...p. Theorems 2 and 4 imply that the conditional distributions of both residual and multi-plier wild bootstrap de-biased Lasso estimators are valid approximations to the (uncondi-tional) distribution of the de-biased Lasso estimator. Thus, we could perform valid inferenceabout a ij ’s using the bootstrap. We evaluate the finite-sample performance of the proposed methods by simulation studies inthis section. We compare the methods with the bootstrap Lasso/Lasso+OLS and anotherde-biased Lasso method, JM (Javanmard and Montanari, 2014), in terms of bias, root meansquared error (RMSE), coverage probabilities and mean confidence interval lengths.We use R package hdi to implement the de-biased Lasso, residual bootstrap de-biasedLasso and multiplier wild bootstrap de-biased Lasso, R package HDCI to implement thebootstrap Lasso and bootstrap Lasso+OLS, and R code provided by Javanmard and Mon-tanari (2014) to implement their version of de-biased Lasso. The tuning parameters areselected by 10-fold cross-validation. We set the number of bootstrap replications B = 500. With sample size n = 100 , p = 200 and sparsity s i = 5 ,
10 for i = 1 , ..., p ,we first generate transition matrix A as follows.1. We generate s i × p non-zero parameters independently from a uniform distributionon [ − , − . ∪ [0 . , A init , we set the diagonal element to be non-zero and randomlyarrange the other s i − A init may greater than 1, in order to maketime series stable, we let A = 0 . max ( A init ) A init . A equals to 0 .
9. We also try tolet Λ max ( A ) = 0 . s i equals 5 and 10, theranges of the absolute values of elements of A are (0 . , .
46) and (0 . , .
34) respectively.The smaller absolute values of the parameters implies that the de-biased Lasso will out-perform the bootstrap Lasso/Lasso+OLS since the validity of the latter usually requiresthe “beta-min” condition (all nonzero parameters are sufficiently large in absolute values)while the former does not.We generate data { y , ..., y n } from VAR model (1) with1. homoscedastic Gaussian errors u t = ξ t ;2. homoscedastic non-Gaussian errors u t = ( ξ t − / √ u t = η t ξ t ;4. heteroscedastic non-Gaussian errors u t = η t ( ξ t − / √ ξ t i.i.d ∼ N p ( , I ) and η t i.i.d ∼ U (1 , n and s i , there are four cases in totaland the results will be showed in Section 4.2 and Section 4.3. For the other three types oferrors, since we set n = 100 and s i = 5, there are three cases in total and the results will bediscussed in Section 4.4. For every case, we generate 1000 sets of data in order to evaluatethe repeated sampling performance of different methods. We only report the results withrespect to the first row of A since the conclusions for other rows are similar. In this section, we compare the bias and RMSE of four estimation methods: Lasso,Lasso+OLS, LDPE and JM. Figures 1 and 2 and Table 1 show the results of absolute16 asso Lasso+OLS LDPE JM s : : s :
10 n : s : : s :
10 n : index B i a s True value of par non−zerozero
Figure 1: Comparison of absolute bias for 1000 replications produced by four methods (columns)in four cases (rows). Index on the x -axis corresponds to different a j ’s, which are arranged fromsmall to large in absolute values. The first p − s elements of a j ’s are zeros (blue points) and thelast s are non-zeros (red points). The black lines are total averages of absolute bias for zero andnon-zero parameters respectively. bias ( | E ˆ a j − a j | ) and RMSE ([ E (ˆ a j − a j ) ] / ). For non-zero parameters, the Lassoestimator has large bias, the Lasso+OLS estimator reduces the bias (23% - 41% when n = 100, 77% - 90% when n = 300), and two de-biased Lasso estimators further reduce thebias (60% - 90%). For zero parameters, the Lasso and Lasso+OLS estimators have nearlyzero bias, while the bias of two de-biased Lasso estimators are about the same magnitudeas those of non-zero parameters. In terms of RMSE, when n = 100, situation is almostthe same as bias; when n = 300, for those non-zero parameters, the Lasso and Lasso+OLSestimators have RMSE comparable to two de-biased Lasso estimators while for those zeroparameters, the Lasso and Lasso+OLS estimators have much smaller RMSE. Specifically,for those zero parameters, compared to the LDPE, the Lasso reduces the RMSE by 75% -82%. For estimation purpose, we recommend the Lasso and Lasso+OLS since their RMSEs17 asso Lasso+OLS LDPE JM s : : s :
10 n : s : : s :
10 n : index R M SE True value of par non−zerozero
Figure 2: Comparison of RMSE for 1000 replications produced by four methods (columns) in fourcases (rows). Index on the x -axis corresponds to different a j ’s, which are arranged from smallto large in absolute values. The first p − s elements of a j ’s are zeros (blue points) and the last s are non-zeros (red points). The black lines are total averages of absolute RMSE for zero andnon-zero parameters respectively. are smaller. However, for construction of confidence intervals, small bias will lead to moreaccurate coverage probabilities, which will be seen in the next section. We now compare the coverage probabilities and mean confidence interval lengths of 95%confidence intervals constructed by six methods: de-biased Lasso (LDPE), residual boot-strap de-biased Lasso (BtLDPE), multiplier wild bootstrap de-biased Lasso (MultiBtLDPE),bootstrap Lasso (BtLasso), bootstrap Lasso+OLS (BtLasso+OLS) and de-biased Lasso ofJavanmard and Montanari (2014) (JM). 18 able 1: Average absolute bias and RMSE n s a ij Lasso Lasso+OLS LDPE JMAverage absolute bias100 5 non-zero 0.1228 0.0719 0.02 0.0378100 5 zero 0.001 0.0009 0.0122 0.0062100 10 non-zero 0.1174 0.0902 0.0215 0.0463100 10 zero 0.0019 0.0021 0.0175 0.0104300 5 non-zero 0.0631 0.0058 0.0066 0.0068300 5 zero 0.0006 0.0003 0.0056 0.0022300 10 non-zero 0.057 0.0126 0.0056 0.0066300 10 zero 0.0011 0.0008 0.0067 0.0036Average RMSE100 5 non-zero 0.1467 0.1409 0.0853 0.0882100 5 zero 0.0149 0.017 0.0783 0.0589100 10 non-zero 0.1389 0.1422 0.0842 0.0888100 10 zero 0.0195 0.0258 0.0772 0.0592300 5 non-zero 0.0761 0.0437 0.0446 0.0485300 5 zero 0.0076 0.0058 0.0428 0.0443300 10 non-zero 0.0698 0.0507 0.0425 0.0451300 10 zero 0.0101 0.01 0.0409 0.0419Figure 3 and Table 2 show the results of coverage probabilities. For non-zero param-eters, the coverage probabilities of BtLasso and BtLasso+OLS do not reach the nominalconfidence level in all cases except for n = 300 , s i = 5. The LDPE, BtLDPE, MultiBtLDPEand JM can reach the nominal confidence level when n = 300 while only the LDPE reachesthe nominal confidence level when n = 100. For zero parameters, all methods except theBtLasso reach the nominal confidence level. Note that the BtLasso+OLS and JM producemuch higher coverage probabilities, for example 99%, than the nominal level 95%.Figure 4 and Table 2 show the results of average confidence interval lengths. Fornon-zero parameters, the BtLasso and BtLasso+OLS produce shorter confidence intervals19 tLasso BtLasso+OLS LDPE BtLDPE MultiBtLDPE JM s : : s :
10 n : s : : s :
10 n : index C o v e r age P r obab ili t i e s True value of par non−zerozero
Figure 3: Comparison of empirical coverage probabilities for 1000 replications produced by sixmethods (columns) in four cases (rows). Index on the x -axis corresponds to different a j ’s, whichare arranged from small to large in absolute values. The first p − s elements of a j ’s are zeros (bluepoints) and the last s are non-zeros (red points). The black lines are total averages of coverageprobabilities for zero and non-zero parameters respectively. The red dashed lines correspond to thenominal confidence level 95%. than the other four methods. Moreover, compared to the LDPE, the BtLDPE reducesconfidence interval lengths by 6% - 12% when n = 300 and 3% - 9% when n = 100.For zero parameters, the BtLasso and BtLasso+OLS have nearly zero average confidenceinterval lengths, reflecting the super-efficiency of these two methods. For the other fourmethods, the LDPE, BtLDPE, MultiBtLDPE and JM, the confidence interval lengthsfor zero parameters are nearly the same as those for non-zero parameters. Meanwhile,confidence intervals produced by the BtLDPE and MultiBtLDPE are shorter than thoseproduced by the LDPE and JM.Taking into account both coverage probabilities and confidence interval lengths, when n is small, we recommend the LDPE for its honest coverage probabilities; when n is large,20 tLasso BtLasso+OLS LDPE BtLDPE MultiBtLDPE JM s : : s :
10 n : s : : s :
10 n : index A v e r age Leng t h s True value of par non−zerozero
Figure 4: Comparison of average confidence interval lengths for 1000 replications produced bysix methods (columns) in four cases (rows). Index on the x -axis corresponds to different a j ’s,which are arranged from small to large in absolute values. The first p − s elements of a j ’s arezeros (blue points) and the last s are non-zeros (red points). The black lines are total averagesof interval lengths for zero and non-zero parameters respectively. we recommend the BtLDPE and MultiBtLDPE for their honest coverage probabilities andshorter confidence interval lengths. In this subsection, we explore the robustness of our methods to different distributionsof errors, namely, homoscedastic non-Gaussian errors, heteroscedastic Gaussian errors andheteroscedastic non-Gaussian errors. Compared to homoscedastic Gaussian errors, differentdistributions of errors do not lead to significant difference of performance; see the resultsin the supplementary material. Again, we can see that the LDPE has honest coverageprobabilities and the BtLDPE and MultiBtLDPE have shorter confidence interval lengths21 able 2: Average empirical coverage probabilities and average confidence interval lengths n s a ij BtLasso BtLassoOLS LDPE BtLDPE MultiBtLDPE JMAverage empirical coverage probabilities (%)100 5 non-zero 55.8 69 93.2 90.7 90.5 89.4100 5 zero 95.1 99 95.6 95.9 95.7 98.3100 10 non-zero 40 51.3 93.2 86.9 87.2 92.1100 10 zero 93.4 97.3 96 96.1 96 98.6300 5 non-zero 88.2 95.2 94.5 93.8 94.1 94.3300 5 zero 95.3 99.7 95.4 95 94.9 96.3300 10 non-zero 82.4 91.5 94.8 93.7 93.8 96300 10 zero 94.3 99 95.7 95.1 95.1 97.1Average confidence interval lengths100 5 non-zero 0.174 0.259 0.315 0.296 0.302 0.295100 5 zero 0.021 0.012 0.313 0.283 0.28 0.294100 10 non-zero 0.124 0.178 0.313 0.285 0.289 0.309100 10 zero 0.03 0.027 0.314 0.276 0.274 0.311300 5 non-zero 0.145 0.19 0.172 0.166 0.167 0.186300 5 zero 0.017 0.003 0.17 0.159 0.159 0.184300 10 non-zero 0.13 0.165 0.165 0.157 0.158 0.182300 10 zero 0.03 0.013 0.166 0.151 0.151 0.183compared to the LDPE.
In practice, researchers often use VAR models to analyze the time series data of stocksand conduct statistical inference on the elements of transition matrix, so as to produceknowledge about the relationship between different stocks. In this section, we use theprices of the S&P 500 constituent stocks in 2019 to demonstrate our methods. There are505 stocks in total because there are five companies have two share classes of stock. There22 igure 5: Statistical inference results of the × transition matrix of the VAR model for thereturns of the S&P 500 constituent stocks using six methods: de-biased Lasso (LDPE), residualbootstrap de-biased Lasso (BtLDPE), multiplier wild bootstrap de-biased Lasso (MultiBtLDPE),de-biased Lasso of Javanmard and Montanari (2014) (JM), bootstrap Lasso (BtLasso) and boot-strap Lasso+OLS (BtLasso+OLS). The red point indicates that the corresponding parameter a ij is significant (its 95% confidence interval does not include 0). The gray square indicates that thetwo stocks corresponding to a ij are in the same sector. are 252 trading days in 2019, but five stocks have incomplete data for some reasons. Afterdeleting these five stocks, we have data of 500 stocks for 252 days. Because our modelneeds the stationarity of time series data, we use the daily return r t = p t p t − − , t = 0 , . . . , n as { y t } in model (1), where p t is the adjusted price. Since the transformation reducesone observation and t begins from 0, in our model, n = 250. These 500 companies arein eleven different sectors: Communication Services, Consumer Discretionary, ConsumerStaples, Energy, Financials, Health Care, Industrials, Information Technology, Materials,Real Estate and Utilities. 23 igure 6: Box plot of confidence interval lengths of the × parameters produced by sixmethods: de-biased Lasso (LDPE), residual bootstrap de-biased Lasso (BtLDPE), multiplier wildbootstrap de-biased Lasso (MultiBtLDPE), de-biased Lasso of Javanmard and Montanari (2014)(JM), bootstrap Lasso (BtLasso) and bootstrap Lasso+OLS (BtLasso+OLS). We apply the same six methods as in Section 4 to this data set and obtain the 95% con-fidence intervals for elements of the transition matrix A . The results are shown in Figures5 and 6. First, compared with the BtLasso, BtLasso+OLS and JM, our proposed meth-ods produce more significant parameters, which can provide users with more candidatesfor effective relationships to make future decisions. Second, some of the columns of theestimated transition matrix have many significant parameters (see the vertical red lines inFigure 5), indicating that the prices of some stocks have prediction power on the pricesof most stocks. A further look at the results reveals that all six methods indicate thatNewmont Corporation is such a stock having the ability to predict many other stock pricesin advance. Newmont Corporation is the largest producer of gold in the world and the onlygold producer listed in the S&P 500 Index. Considering that the turbulent global financialenvironment in 2019 made gold the best safe-haven asset, our findings have practical sig-24ificance. Third, there are relationships both within and between sectors. Finally, Figure 6shows that the BtLasso and BtLasso+OLS produce confidence intervals with lengths nearlyzero, while the confidence interval lengths of the other four methods are larger than zeroand roughly comparable. Performing statistical inference for parameters in high-dimensional VAR models is a chal-lenging but important problem. We propose to use the de-biased Lasso (LDPE), residualbootstrap de-biased Lasso (BtLDPE) and multiplier wild bootstrap de-biased Lasso (Multi-BtLDPE) to construct confidence intervals for the individual parameter of the transitionmatrix. Unlike the fixed design case in linear regression models, the design matrix in VARmodels is random with complex dependence structure, which makes theoretical analysischallenging. Based on the convergence rates of the Lasso and the nodewise Lasso estima-tors, we obtain the asymptotic unbiasedness of the de-biased Lasso estimator. Combinedwith the martingale central limit theorem, we obtain its asymptotic normality. For thetwo bootstrap de-biased Lasso methods, the analysis is conditional on the original dataand we derive their asymptotic properties based on the randomness coming from the boot-strap sampling. We demonstrate the validity of statistical inference for the parametersof high-dimensional sparse VAR models using these methods. Furthermore, we proposefeasible and parallelizable algorithms to implement our methods. More specifically, we ap-ply the de-biased Lasso, residual bootstrap de-biased Lasso and multiplier wild bootstrapde-biased Lasso to each of the p equations of VAR models separately, which can be ranin parallel. More importantly, the p equations share the same design matrix, so we needonly to compute the nodewise Lasso once, which is the main computational burden of thede-biased Lasso and bootstrap de-biased Lasso. The proposed methods have significantcomputational advantages, especially when p is large.We conduct comprehensive simulation studies to compare our methods with the boot-strap Lasso, bootstrap Lasso+OLS and another de-biased Lasso method proposed by Javan-mard and Montanari (2014). We find that the LDPE can always give the honest coverageprobabilities, and when sample size is large, the BtLDPE and MultiBtLDPE can also give25he honest coverage probabilities but with shorter confidence interval lengths. Therefore,when the sample size is small, we recommend the LDPE for its reliability, and when thesample size is large, we recommend the BtLDPE and MultiBtLDPE for their reliabilityand power. Lastly, we apply our methods to analyze the S&P 500 constituent stock pricesdata set and obtain reasonable confidence intervals.In our theoretical study, we assume the homoscedastic Gaussian errors. However, wefind in simulations that our methods are robust to heteroscedastic and/or non-Gaussianerrors. It is interesting and worthy of further investigation to obtain the asymptotic dis-tributions of the proposed methods for heteroscedastic and/or non-Gaussian errors. Themain technical difficulty is to establish convergence rates of the Lasso and martingale cen-tral limit theorem for this type of errors.This article focuses on statistical inference for individual parameter of the transitionmatrix in high-dimensional sparse VAR models. It is interesting to extend the methodsfor simultaneous confidence intervals and multiple hypothesis testing. For this purpose, wecan use the Bonferroni correction or Westfall-Young procedure (Westfall and Young, 1993).We leave the corresponding theoretical investigation to future work. Supplementary Material
The document provides the detailed proofs of the theoretical results, as well as additionalsimulation results for different distributions of errors.
Acknowledgments
The authors thank Dr. Lixiang Zhang for his suggestions that have helped clarify the text.
Funding
Dr. Hanzhong Liu acknowledges the financial support from the National Natural ScienceFoundation of China (grant nos. 11701316).26 eferences
Basu, S. and Michailidis, G. (2015). Regularized Estimation in Sparse High-dimensionalTime Series Models.
The Annals of statistics , (4), 1535–1567.Basu, S., Das, S., Michailidis, G., and Purnanandam, A. K. (2019). A System-WideApproach to Measure Connectivity in the Financial Sector. Available at SSRN 2816137 .B¨uhlmann, P. and van de Geer, S. A. (2011).
Statistics for High-Dimensional Data Methods,Theory and Applications . Springer.Chatterjee, A. and Lahiri, S. N. (2011). Bootstrapping Lasso Estimators.
Journal of theAmerican Statistical Association , (494), 608–625.Chatterjee, A. and Lahiri, S. N. (2013). Rates of Convergence of the Adaptive LASSOEstimators to the Oracle Distribution and Higher Order Refinements by the Bootstrap. The Annals of statistics , (3), 1232–1259.Chen, X., Xu, M., and Wu, W.-B. (2013). Covariance and Precision Matrix Estimation forHigh-dimensional Time Series. The Annals of statistics , (6), 2994–3021.Davis, R. A., Zang, P., and Zheng, T. (2016). Sparse Vector Autoregressive Modeling. Journal of Computational and Graphical Statistics , (4), 1077–1096.Dezeure, R., B¨uhlmann, P., and Zhang, C.-H. (2017). High-dimensional SimultaneousInference with the Bootstrap. TEST , (4), 685–719.Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals ofstatistics , (1), 1–26.Fuller, W. A. (1996). Introduction to Statistical Time Series . John Wiley & Sons.Granger, C. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods.
Econometrica , (3), 424–438.Guo, S., Wang, Y., and Yao, Q. (2016). High-dimensional and Banded Vector Autoregres-sions. Biometrika , (4), 889–903. 27an, F., Lu, H., and Liu, H. (2015). A Direct Estimation of High Dimensional StationaryVector Autoregressions. Journal of Machine Learning Research , (1), 3115–3150.Hsu, N.-J., Hung, H.-L., and Chang, Y.-M. (2008). Subset Selection for Vector Autore-gressive Processes Using Lasso. Computational Statistics and Data Analysis , (7),3645–3657.Javanmard, A. and Montanari, A. (2014). Confidence Intervals and Hypothesis Testing forHigh-dimensional Regression. Journal of Machine Learning Research , (1), 2869–2909.Knight, K. and Fu, W. (2000). Asymptotics for Lasso-type Estimators. The Annals ofstatistics , (5), 1356–1378.Kock, A. B. and Callot, L. (2015). Oracle Inequalities for High Dimensional Vector Au-toregressions. Journal of Econometrics , (2), 325–344.Krampe, J., Kreiss, J. P., and Paparoditis, E. (2019). Bootstrap Based Inference for SparseHigh-Dimensional Time Series Models. arXiv preprint arXiv:1806.11083v3 .Lin, J. and Michailidis, G. (2017). Regularized Estimation and Testing for High-dimensionalMulti-block Vector-autoregressive Models. The Journal of Machine Learning Research , (1), 4188–4236.Liu, H. and Yu, B. (2013). Asymptotic Properties of Lasso+mLS and Lasso+Ridge inSparse High-dimensional Linear Regression. Electronic Journal of Statistics , , 3124–3169.Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional Regression with Noisy andMissing data: Provable Guarantees with Nonconvexity. The Annals of statistics , (3),1637–1664.L¨utkepohl, H. (2007). New Introduction to Multiple Time Series Analysis . Springer Science& Business Media.Meinshausen, N. and B¨uhlmann, P. (2006). High-dimensional Graphs and Variable Selec-tion with the Lasso.
The Annals of statistics , (3), 1436–1462.28egahban, S. N. and Wainwright, M. J. (2011). Estimation of (Near) Low-rank Matriceswith Noise and High-dimensional Scaling. The Annals of statistics , (2), 1069–1097.Neykov, M., Ning, Y., Liu, J. S., and Liu, H. (2018). A Unified Theory of ConfidenceRegions and Testing for High-Dimensional Estimating Equations. Statistical Science , (3), 427–443.Ning, Y. and Liu, H. (2017). A General Theory of Hypothesis Tests and Confidence Regionsfor Sparse High Dimensional Models. The Annals of statistics , (1), 158–195.Reid, S., Tibshirani, R. J., and Friedman, J. H. (2016). A Study of Error Variance Esti-mation in Lasso Regression. Statistica Sinica , , 35–67.Sims, C. A. (1980). Macroeconomics and Reality. Econometrica , (1), 1–48.Song, S. and Bickel, P. J. (2011). Large Vector Auto Regressions. arXiv preprintarXiv:1106.3915 .Tibshirani, R. J. (1996). Regression Shrinkage and Selection via the Lasso. Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , (1), 267–288.van de Geer, S. A., B¨uhlmann, P., Ritov, Y., and Dezeure, R. (2014). On AsymptoticallyOptimal Confidence Regions and Tests for High-dimensional Models. The Annals ofstatistics , (3), 1166–1202.Westfall, P. and Young, S. (1993). Resampling-based Multiple Testing: Examples andMethods for P-value Adjustment . Wiley, Hoboken.Wilms, I., Gelper, S., and Croux, C. (2016). The Predictive Power of the Business andBank Sentiment of Firms: A High-dimensional Granger Causality Approach.
EuropeanJournal of Operational Research , (1), 138–147.Zhang, C.-H. and Huang, J. (2008). The Sparsity and Bias of the Lasso Selection inHigh-dimensional Linear Regression. The Annals of statistics , (4), 1567–1594.29hang, C.-H. and Zhang, S. S. (2014). Confidence Intervals for Low Dimensional Parametersin High Dimensional Linear Models. Journal of the Royal Statistical Society: Series B(Statistical Methodology) , (1), 217–242.Zheng, L. and Raskutti, G. (2019). Testing for High-dimensional Network Parameters inAuto-regressive Models. Electronic Journal of Statistics , (2), 4977–5043. Supplementary Material for “Confidence Intervals forParameters in High-dimensional Sparse Vector Autoregression”
The document provides the detailed proofs of the theoretical results in the main text,as well as additional simulation results for different distributions of errors.
A Proofs of the theoretical results
Our proofs require several theoretical results from Basu and Michailidis (2015). Firstly, webound the extreme eigenvalue of Σ, which can be obtained directly from Proposition 2.3in Basu and Michailidis (2015).
Proposition 3.
For the stable VAR model in (1), we have / Λ min (Σ) = O (1) , Λ max (Σ) = O (1) . Secondly, the following concentration inequalities are obtained from Proposition 2.4 inBasu and Michailidis (2015).
Proposition 4.
For the stable VAR model in (1),(a) There exists a constant c > such that for any vector v ∈ R p with (cid:107) v (cid:107) ≤ , for any η ≥ , we have P [ v T ( X T X /n − Σ) v > Q η ] ≤ (cid:2) − c n min (cid:8) η , η (cid:9)(cid:3) , (9) where Q is a constant depends only on the dependence structure of the VAR model.(b) For ( p − × p matrix W and p -dimensional vector w , if Cov( W y t , w T y t ) = 0 for every ∈ Z , then there exist constants c such that for any vector u ∈ R p − with (cid:107) u (cid:107) ≤ , forany η ≥ , we have P [ u T ( W X T X w/n ) > Q η ] ≤ (cid:2) − c n min (cid:8) η , η (cid:9)(cid:3) , (10) where Q is a constant depends only on the dependence structure of the VAR model. Thirdly, the following Proposition 5 are from Proposition 4.2 and 4.3 in Basu andMichailidis (2015). The first inequality is related to the restricted eigenvalue (RE) conditionfor the design matrix X and we modify the original general conclusion to fit our proof relatedto the nodewise Lasso. The second is the deviation bound for X T ε i /n . Proposition 5. (a) Under Assumptions 2 and 3, we have, in probability, θ (cid:48) ( X T X /n ) θ ≥ α (cid:107) θ (cid:107) − τ (cid:107) θ (cid:107) , ∀ θ ∈ R p , (11) where α > is a constant depends only on the dependence structure of the VAR model, τ satisfies τ s i = O (1) for any i and τ q j = O (1) for any j .(b) Under Assumption 2, for i = 1 , ..., p , we have, in probability, (cid:107) X T ε i /n (cid:107) ∞ ≤ C (cid:112) log p/n, (12) where C > is a constant depends only on the dependence structure of the VAR model. Lastly, we need Lemma F.2 in the supplementary material of Basu and Michailidis(2015), which used discretization to expand the bound of a single vector in (9) to a set ofsparse vectors.
Lemma 1.
Consider a symmetric matrix D p × p . If, for any vector v ∈ R p with (cid:107) v (cid:107) ≤ ,and any η ≥ , P [ v T Dv > Cη ] ≤ (cid:2) − cn min (cid:8) η , η (cid:9)(cid:3) , then, for any integer s ≥ , we have P (cid:34) sup (cid:107) v (cid:107) ≤ s sup (cid:107) v (cid:107) ≤ | v (cid:48) Dv | > Cη (cid:35) ≤ (cid:2) − cn min (cid:8) η, η (cid:9) + s min { log p, log(21 ep/s ) } (cid:3) . .1 Proofs of the theoretical results in Section 3.1 We first show that the following sparse Riesz condition (Zhang and Huang, 2008) holdswith probability converging to 1: c ∗ ≤ min (cid:107) v (cid:107) ≤ s R min (cid:107) v (cid:107) =1 (cid:107) X v (cid:107) /n ≤ max (cid:107) v (cid:107) ≤ s R max (cid:107) v (cid:107) =1 (cid:107) X v (cid:107) /n ≤ c ∗ . (13) Proof of Proposition 1 . By (9) in Proposition 4 and Lemma 1, we have P (cid:34) sup (cid:107) v (cid:107) ≤ s R sup (cid:107) v (cid:107) ≤ | v (cid:48) ( X T X /n − Σ) v | > Q η (cid:35) ≤ − cn min { η, η } + s R min { log p, log(21 ep/s R ) } ] . Thus, we havesup (cid:107) v (cid:107) ≤ s R sup (cid:107) v (cid:107) ≤ | v (cid:48) ( X T X /n − Σ) v | = O p ( (cid:112) s R log p/n ) = O p ( (cid:112) s i log p/n ) = o p (1) , where the last equality is due to the sparsity Assumption 2 ( s i log p/ √ n → (cid:107) v (cid:107) ≤ s R sup (cid:107) v (cid:107) ≤ | v (cid:48) Σ v | = O p (1) , (cid:46) min (cid:107) v (cid:107) ≤ s R min (cid:107) v (cid:107) ≤ | v (cid:48) Σ v | = O p (1) . The result follows from triangle inequality.Now, we can prove Theorem 1. Since the statement (a) have been obtained by Basuand Michailidis (2015) (see Propositions 4.1), we need only to prove the statement (b).
Proof of Theorem 1 (b) . Firstly, we introduce some notations. Note that we omit thesubscript i for simplicity. We define the sets as follows: A tp := { j : a ij (cid:54) = 0 } ∩ { j : ˆ a Lasso ij (cid:54) = 0 } ,A fp := { j : a ij = 0 } ∩ { j : ˆ a Lasso ij (cid:54) = 0 } ,A fn := { j : a ij (cid:54) = 0 } ∩ { j : ˆ a Lasso ij = 0 } ,A tn := { j : a ij = 0 } ∩ { j : ˆ a Lasso ij = 0 } ,A := A tp ∪ A fp ∪ A fn , A := A tp ∪ A fn , A := A fp , s + := | A | . By these definitions, we have s i = | A | ≤ | A | = ˆ s + , ˆ s i = | A tp ∪ A fp | ≤ | A | = ˆ s + , | A | = ˆ s + − s i . (14)Since A ⊆ A and A ⊆ A , for k = 2 ,
3, let Q k be an | A k | × | A | transformation matrixwith elements 0 and 1. Q k selects variables in A k from A , such that Q k X T A = X T A k .Recall that ˆ a Lasso i := argmin α ∈ R p {(cid:107) Y i − X α (cid:107) /n + 2 λ (cid:107) α (cid:107) } , where ˆ a Lasso i = (ˆ a Lasso i , ..., ˆ a Lasso ip ). By KKT condition, we have X T j ( Y i − X ˆ a Lasso i ) /n = λ sign(ˆ a Lasso ij ) , ˆ a Lasso ij (cid:54) = 0 | X T j ( Y i − X ˆ a Lasso i ) /n | ≤ λ, ˆ a Lasso ij = 0 . (15)For k = 2 ,
3, we denote ξ k := X T A k ( Y i − X ˆ a Lasso i ) / ( nλ ) , (16)and v k := λ √ n ˆΣ − / A ,A Q T k ξ k , (17)where ˆΣ A ,A := X T A X A /n . By (15), we have (cid:107) ξ (cid:107) ≤ | A | , (cid:107) ξ (cid:107) = | A | = ˆ s + − s i . (18)Next, we intend to proving that ˆ s + = O p ( s i ), which implies ˆ s i = O p ( s i ) by (14). Ourproof is divided into three steps. In steps 1 and 2, taking v as a bridge, we prove thatˆ s + ≤ s R implies ˆ s + ≤ (2 + 2 c ∗ /c ∗ ) s i . Specifically, we give a lower bound and an upperbound of (cid:107) v (cid:107) in steps 1 and 2 respectively. In step 3, using the results in steps 1 and 2,we proves the desired result by contradiction. Step
1. Assuming that { ˆ s + ≤ s R } , by (13) and (18), we have (cid:107) v (cid:107) = λ n ξ T Q ˆΣ − A ,A Q T ξ ≥ λ n (cid:107) Q T ξ (cid:107) /c ∗ = λ n (ˆ s + − s i ) /c ∗ . (19)33 tep
2. Assuming { ˆ s + ≤ s R } , since a i,A tn = ˆ a Lasso i,A tn = and A ∪ A = A , we have nλ ( Q T ξ + Q T ξ ) = X T A ( Y i − X ˆ a Lasso i )= X T A ( Y i − X A ˆ a Lasso i,A )= X T A X A a i,A + X T A X A tn a i,A tn + X T A ε i − X T A X A ˆ a Lasso i,A = n ˆΣ A ,A a i,A + X T A ε i − n ˆΣ A ,A ˆ a Lasso i,A . (20)Then we have v T ( v + v ) = λ n ξ T Q ˆΣ − A ,A ( Q T ξ + Q T ξ )= λ ξ T Q ˆΣ − A ,A X T A ε i + λn ξ T Q ( a i,A − ˆ a Lasso i,A ) ≤ λ ξ T Q ˆΣ − A ,A X T A ε i ≤ λn (cid:107) ξ T Q ˆΣ − A ,A (cid:107) (cid:107) X T A ε i /n (cid:107) ∞ , (21)where the first inequality is due to Q ( a i,A − ˆ a Lasso i,A ) = a i,A − ˆ a Lasso i,A = − ˆ a Lasso i,A , and ξ T ˆ a Lasso i,A ≥ , by the KKT condition (15), and the second inequality is due to H¨older inequality. Further-more, we have (cid:107) ξ T Q ˆΣ − A ,A (cid:107) ≤ (cid:112) ˆ s + (cid:107) ξ T Q ˆΣ − A ,A (cid:107) ≤ (cid:112) ˆ s + (cid:107) ξ T Q (cid:107) /c ∗ ≤ ˆ s + /c ∗ , (22)where the second inequality is due to the sparse Riesz condition (13) and the last inequalityis because (cid:107) ξ T Q (cid:107) = (cid:107) ξ (cid:107) = (cid:112) ˆ s + − s i . By sparse Riesz condition (13) and (18), we alsohave (cid:107) v (cid:107) = λ n ξ T Q ˆΣ − A ,A Q T ξ ≤ λ n (cid:107) Q T ξ (cid:107) /c ∗ ≤ λ n s i /c ∗ . (23)By (21), triangle inequality, H¨older inequality, (22) and (23), we have (cid:107) v (cid:107) ≤ λn (cid:107) ξ T Q ˆΣ − A ,A (cid:107) (cid:107) X T A ε i /n (cid:107) ∞ + (cid:107) v (cid:107) (cid:107) v (cid:107) ≤ ( λn ˆ s + /c ∗ ) (cid:107) X T A ε i /n (cid:107) ∞ + ( λ ns i /c ∗ ) / (cid:107) v (cid:107) . (24)34ow we need to bound the term (cid:107) X T A ε i /n (cid:107) ∞ . By (12), we have, in probability, (cid:107) X T A ε i /n (cid:107) ∞ ≤ (cid:107) X T ε i /n (cid:107) ∞ ≤ C (cid:112) log p/n. (25)Since x ≤ c + 2 bx implies x ≤ ( b + √ b + c ) ≤ c + 4 b for x = (cid:107) v (cid:107) , by (24) and (25),we have, in probability, (cid:107) v (cid:107) ≤ λn ˆ s + /c ∗ ) C (cid:112) log p/n + λ ns i /c ∗ . (26)Combining with the lower bound (19), we have λ n (ˆ s + − s i ) /c ∗ ≤ λn ˆ s + /c ∗ ) C (cid:112) log p/n + λ ns i /c ∗ . Moving the terms related to ˆ s + and s i to each side of the inequality, we have (cid:32) λ − C c ∗ c ∗ (cid:114) log pn (cid:33) ˆ s + ≤ (cid:18) c ∗ c ∗ (cid:19) λs i . Recall the condition in Theorem 1, λ ≥ C c ∗ c ∗ (cid:114) log pn , we have 12 λ ˆ s + ≤ (cid:32) λ − C c ∗ c ∗ (cid:114) log pn (cid:33) ˆ s + ≤ (cid:18) c ∗ c ∗ (cid:19) λs i , then ˆ s + ≤ (2 + 2 c ∗ /c ∗ ) s i . Thus, we obtain that ˆ s + ≤ s R ⇒ ˆ s + ≤ (2 + 2 c ∗ /c ∗ ) s i . (27) Step
3. Note that ˆ s + = ˆ s + ( λ ) is a function of λ . We denoteˆ s + , min := min λ ≥ C c ∗ /c ∗ √ log p/n ˆ s + ( λ ) , ˆ s + , max := max λ ≥ C c ∗ /c ∗ √ log p/n ˆ s + ( λ ) . Because of the continuity of the Lasso path, we could choose the variable one-at-a-time.Therefore, when λ ≥ C c ∗ /c ∗ (cid:112) log p/n , ˆ s + ( λ ) could take every integer from ˆ s + , min toˆ s + , max . 35hen λ = ∞ , we have A = A and thusˆ s + , min = s i ≤ (2 + 2 c ∗ /c ∗ ) s i . If ˆ s + , max > s R , by the definition of s R in Proposition 1, we haveˆ s + , max > s R > (2 + 2 c ∗ /c ∗ ) s i + 1 . Then, there exists a λ ≥ C c ∗ /c ∗ (cid:112) log p/n , such that ˆ s + = (cid:98) (2 + 2 c ∗ /c ∗ ) s i + 1 (cid:99) , whichcontradicts (27) since ˆ s + ≤ s R and ˆ s + > (2 + 2 c ∗ /c ∗ ) s i . By contradiction, ˆ s + , max ≤ s R andour result follows.Next, we prove the estimation and the prediction error bounds for the nodewise Lassoestimator defined in (7). These bounds are useful in the proof of Theorem 2. Recall thatthese bounds are (cid:107) ˆ γ j − γ j (cid:107) = O p (cid:16) q j (cid:112) log p/n (cid:17) , (28) (cid:107) X − j (ˆ γ j − γ j ) (cid:107) /n = O p ( q j log p/n ) , (29)where j = 1 , ..., p . Proof of Proposition 2 . For j = 1 , ..., p , since ˆ γ j is the minimizer:ˆ γ j := argmin γ ∈ R p − {|| X j − X − j γ || /n + 2 λ j || γ || } , we obtain the basic inequality (cid:107) X j − X − j ˆ γ j (cid:107) /n + 2 λ j (cid:107) ˆ γ j (cid:107) ≤ (cid:107) X j − X − j γ j (cid:107) /n + 2 λ j (cid:107) γ j (cid:107) . We denote δ := ˆ γ j − γ j . By simple algebra, we have (cid:107) X − j δ (cid:107) /n ≤ ε T X − j δ/n + 2 λ j ( (cid:107) γ j (cid:107) − (cid:107) ˆ γ j (cid:107) ) , (30)where ˜ ε := X j − X − j γ j . Since Cov( X − j , ˜ ε ) = 0, then by (10), we have P ( (cid:107) ˜ ε T X − j (cid:107) ∞ > Q η ) ≤ p exp (cid:2) − c n min (cid:8) η, η (cid:9)(cid:3) . With η = c (cid:112) log p/n and suitable chosen λ j (cid:16) (cid:112) log p/n , we have, in probability, (cid:107) ˜ ε T X − j (cid:107) ∞ ≤ Q c (cid:112) log p/n ≤ λ j / . ε T X − j δ/n ≤ (cid:107) ˜ ε T X − j (cid:107) ∞ (cid:107) δ (cid:107) ≤ λ j (cid:107) δ (cid:107) / . We denote the support set of γ j by K . By triangle inequality and γ jK C = 0, we have (cid:107) γ j (cid:107) − (cid:107) ˆ γ j (cid:107) = (cid:107) γ jK (cid:107) − (cid:107) ˆ γ jK (cid:107) + (cid:107) γ jK C (cid:107) − (cid:107) ˆ γ jK C (cid:107) ≤ (cid:107) δ K (cid:107) − (cid:107) δ K C (cid:107) . Therefore, (30) becomes0 ≤ (cid:107) X − j δ (cid:107) /n ≤ λ j (cid:107) δ (cid:107) + 2 λ j ( (cid:107) δ K (cid:107) − (cid:107) δ K C (cid:107) )= λ j ( (cid:107) δ K (cid:107) + (cid:107) δ K C (cid:107) ) + 2 λ j ( (cid:107) δ K (cid:107) − (cid:107) δ K C (cid:107) )= 3 λ j (cid:107) δ K (cid:107) − λ j (cid:107) δ K C (cid:107) ≤ λ j (cid:107) δ (cid:107) . (31)Inequality (31) implies (cid:107) δ K C (cid:107) ≤ (cid:107) δ K (cid:107) so that (cid:107) δ (cid:107) ≤ (cid:107) δ K (cid:107) ≤ √ q j (cid:107) δ (cid:107) . (32)By (11) and (32), we have (cid:107) X − j δ (cid:107) /n ≥ α (cid:107) δ (cid:107) − τ (cid:107) δ (cid:107) ≥ (cid:18) α q j − τ (cid:19) (cid:107) δ (cid:107) . (33)By (31), (33), τ q j = O (1) and Assumption 1 ( λ j (cid:16) (cid:112) log( p ) /n ), we obtain (28). Then by(31), (28) and Assumption 1, we obtain (29).To prove Theroem 2, compared with the proof of the validity of the de-biased Lassoin high-dimensional sparse linear regression models (van de Geer et al. , 2014), our proofis challenge because of dependence structure of data generating process. We denote thenodewise Lasso residuals by ˆ Z j = ( ˆ Z j , ..., ˆ Z nj ) := X j − X − j ˆ γ j , which is an estimator of Z j = ( Z j , ..., Z nj ) := X j − X − j γ j . We also denote τ j := E ( (cid:107) Z j (cid:107) /n ) . Unlike the proof for high-dimensional sparse linear regression models, we should distinguishˆ Z j and Z j carefully because of the correlation between X − j and ˆ γ j .37 roof of Theorem 2 . Recall thatˆ a i − a i = ˆΘ X T ε i /n + ( I − ˆΘ ˆΣ)(ˆ a Lasso i − a i ) . (34)We prove the theorem in three steps. Step 1 proves that the second term of the right-handof (34) is asymptotically negligible. Step 2 proves that the first term is asymptoticallynormal. Step 3 proves that our variance estimator, ˆ σ i = (cid:107) ˆ ε i (cid:107) / ( n − ˆ s i ), is consistent. Step
1. Recall that ˆ C := − ˆ γ · · · − ˆ γ p − ˆ γ · · · − ˆ γ p ... ... . . . ... − ˆ γ p − ˆ γ p · · · , ˆ τ j := (cid:107) X j − X − j ˆ γ j (cid:107) /n + λ j (cid:107) ˆ γ j (cid:107) , ˆ T := diag(ˆ τ , . . . , ˆ τ p ) , ˆΘ := ˆ T − ˆ C. We denote the j th row of ˆΘ by ˆΘ j and obtain X ˆΘ T j /n = ( X j − X − j ˆ γ j ) / ( n ˆ τ j ) . (35)By the KKT conditions for the nodewise Lasso, we have − X T − j ( X j − X − j ˆ γ j ) /n + λ j ˆ κ j = 0 , (36)where ˆ κ is the sub-gradient of (cid:96) norm and satisfies (cid:107) ˆ κ j (cid:107) ∞ ≤ κ jk = sign(ˆ γ jk ) if ˆ γ jk (cid:54) = 0.Multiplying both hand sides of (36) by ˆ γ j , we obtain λ j (cid:107) ˆ γ j (cid:107) = ˆ γ T j X T − j ( X j − X − j ˆ γ j ) /n. Substituting the above formula into the definition of ˆ τ j , we haveˆ τ j := (cid:107) X j − X − j ˆ γ j (cid:107) /n + λ j (cid:107) ˆ γ j (cid:107) = X T j ( X j − X − j ˆ γ j ) /n. Combining with (35), we have X T j X ˆΘ T j /n = X T j ( X j − X − j ˆ γ j ) / ( n ˆ τ j ) = 1 . (37)By (35) and (36), we have (cid:107) X T − j X ˆΘ T j /n (cid:107) ∞ = (cid:107) X T − j ( X j − X − j ˆ γ j ) / ( n ˆ τ j ) (cid:107) ∞ ≤ λ j / ˆ τ j . (38)38y (37) and (38), we have (cid:107) ( I − ˆΘ ˆΣ) (cid:107) ∞ = max j (cid:107) ( e j − X T X ˆΘ T j /n ) (cid:107) ∞ ≤ max j λ j / ˆ τ j . By H¨older inequality, we have √ n (cid:107) ( I − ˆΘ ˆΣ)(ˆ a Lasso i − a i ) (cid:107) ∞ ≤ √ n (cid:107) ( I − ˆΘ ˆΣ) (cid:107) ∞ (cid:107) ˆ a Lasso i − a i (cid:107) ≤ √ n (cid:18) max j λ j / ˆ τ j (cid:19) (cid:107) ˆ a Lasso i − a i (cid:107) . (39)By Theorem 1, we have (cid:107) ˆ a Lasso i − a i (cid:107) = O p ( s i (cid:112) log p/n ) . Then, we intend to prove that for j = 1 , ..., p ,1 / ˆ τ j = O p (1) . (40)Since the proofs are the same for different j , for simplicity, we only consider j = 1. Bysimple algebra, we have the following bound for τ , which is the population level counterpartof ˆ τ ,1 /τ = Θ = e T Σ − e ≤ / Λ min (Σ) , τ ≤ E [ X ] = Σ , = e T Σ e ≤ Λ max (Σ) , (41)where 1 / Λ min (Σ) = O (1) and Λ max (Σ) = O (1) by Proposition 3. By the definition andCauchy-Schwarz inequality, we have, | ˆ τ − τ | ≤ |(cid:107) ˆ Z (cid:107) /n − τ | + λ (cid:107) ˆ γ (cid:107) ≤ |(cid:107) Z (cid:107) /n − τ | + (cid:107) ˆ Z − Z (cid:107) /n + 2 (cid:107) Z (cid:107) (cid:107) ˆ Z − Z (cid:107) /n + λ (cid:107) ˆ γ (cid:107) . (42)Since (cid:107) ˆ Z − Z (cid:107) /n = (cid:107) X − (ˆ γ − γ ) (cid:107) /n = O p ( q log p/n ) has been bounded by (29), weintend to work out |(cid:107) Z (cid:107) /n − τ | and (cid:107) ˆ γ (cid:107) .Before that, we first bound the corresponding population level terms (cid:107) γ (cid:107) and (cid:107) γ (cid:107) .Recall that γ = Σ − − , − Σ . We partition the covariance matrix Σ asΣ = Σ , Σ , − Σ − , Σ − , − .
39e denote Σ · := Σ , − Σ , − Σ − − , − Σ − , = Σ , − γ T Σ − , − γ . Since | Σ · (cid:107) Σ − , − | = | Σ | > | Σ − , − | > · is an one by one matrix (thus its determinant is itself), wehave Σ , − γ T Σ − , − γ >
0. Therefore,Σ , ≥ γ T Σ − , − γ ≥ (cid:107) γ (cid:107) Λ min (Σ − , − ) ≥ (cid:107) γ (cid:107) Λ min (Σ) . We obtain (cid:107) γ (cid:107) ≤ (cid:113) Σ , / Λ min (Σ) = O (1) . (43)By Cauchy-Schwarz inequality, (cid:107) γ (cid:107) ≤ (cid:112) (cid:107) γ (cid:107) (cid:107) γ (cid:107) ≤ √ q (cid:107) γ (cid:107) ≤ (cid:113) q Σ , / Λ min (Σ) = O ( √ q ) . (44)Then we study the term (cid:107) Z (cid:107) /n . With ˜ v := (1 , − ˆ γ , , ..., − ˆ γ ,p ) T / (cid:112) (cid:107) γ (cid:107) + 1, we have |(cid:107) Z (cid:107) /n − τ | = |(cid:107) X − X − γ (cid:107) /n − E ( (cid:107) X − X − γ (cid:107) /n ) | = ( (cid:107) γ (cid:107) + 1) | ˜ v T ( X T X /n − Σ)˜ v | = O p ( (cid:112) log p/n )= o p (1) , (45)where the first equality is due to the definitions of (cid:107) Z (cid:107) and τ , the second is due to thedefinition of ˜ v , the third is due to (43) and (9) with η = c (cid:112) log p/n , and the fourth is dueto Assumption 2.Now we can bound (cid:107) ˆ γ (cid:107) . We have λ (cid:107) ˆ γ (cid:107) ≤ λ (cid:107) γ (cid:107) + λ (cid:107) ˆ γ − γ (cid:107) = O p (cid:16)(cid:112) q log p/n (cid:17) + O p ( q log p/n ) , (46)where the first equality is due to triangle inequality and the second is due to (44), (28) andAssumption 1.We continue (42) with the results (29), (45), (41), (46) and Assumption 3, | ˆ τ − τ | ≤ |(cid:107) Z (cid:107) /n − τ | + (cid:107) ˆ Z − Z (cid:107) /n + 2 (cid:107) Z (cid:107) (cid:107) ˆ Z − Z (cid:107) /n + λ (cid:107) ˆ γ (cid:107) = O p ( q log p/n ) + O p (1) O p (cid:16)(cid:112) q log p/n (cid:17) + O p (cid:16)(cid:112) q log p/n (cid:17) + O p ( q log p/n )= o p (1) . (47)40ombining with (41), we obtain (40). Also, we summarize some intermediate results weobtain above as follows, |(cid:107) Z j (cid:107) /n − τ j | = o p (1) , (48) | ˆ τ j − τ j | = o p (1) . (49)By (39), (40) and Assumptions 1 and 2, we have √ n (cid:107) ( I − ˆΘ ˆΣ)(ˆ a Lasso i − a i ) (cid:107) ∞ = o p (1) . (50) Step
2. For i, j = 1 , . . . , p , recall that ε i = ( e i , ..., e in ) T and our goal is to prove that1 σ i (cid:107) ˆ Z j (cid:107) n (cid:88) t =1 ˆ Z tj e it d → N (0 , . (51)Firstly, we prove the asymptotic normality of the sum of martingale difference sequence n − / (cid:80) nt =1 Z tj e it . Let A t := σ ( e , ..., e t ), then Z tj e it is A t -measurable and A t − is containedin A t . Since Z tj = X tj − (cid:80) l (cid:54) = j γ jl X tl is A t − -measurable, we have E ( Z tj e it |A t − ) = 0 . (52)Let V n := (cid:80) nt =1 E ( n − Z tj e t |A t − ) = σ i n − (cid:80) nt =1 Z tj and v n := E ( V n ) = σ i τ j . By (48),we have V n v − n P → . (53)Let σ X = max ≤ t ≤ n, ≤ j ≤ p Var( X tj ). By Propposition 3, we have σ X = O (1). Since theelements of X are Gaussian random variables, we have P ( (cid:107) X (cid:107) ∞ > (cid:112) t + 2 log pn ) ≤ pn exp (cid:8) − ( t + 2 log pn ) / (2 σ X ) (cid:9) ≤ − t / (2 σ X )) . Together with Assumption 2, we have (cid:107) X (cid:107) ∞ = O p ( √ log p ). By Cauchy-Schwarz inequality,we have n − − ν/ n (cid:88) t =1 E | Z tj | ν ≤ (cid:107) Z j (cid:107) n (cid:18) (cid:107) Z j (cid:107) ∞ √ n (cid:19) ν ≤ (cid:107) Z j (cid:107) n (cid:18) (1 + (cid:107) γ j (cid:107) ) (cid:107) X (cid:107) ∞ √ n (cid:19) ν = O p (cid:104) ( q j log p/n ) ν/ (cid:105) . (54)41hen, we have, in probability, v − n n − n (cid:88) t =1 E ( Z tj e it {| n − / Z tj e it |≥ ρv n } ) ≤ v − − νn n − − ν/ ρ − ν n (cid:88) t =1 E | Z tj e it | ν ≤ v − − νn n − − ν/ ρ − ν L n (cid:88) t =1 E | Z tj | ν → , (55)where the first inequality is due to making use of 1 {| n − / Z tj e it |≥ ρv n } , the second holds because e it is normal, and the third is due to (54) and Assumption 3. Together with (52), (53) and(55), the martingale central limit theorem (Theorem 5.3.4 in Fuller (1996)) implies1 √ nσ i τ j n (cid:88) t =1 Z tj e it d → N (0 , . (56)Secondly, we have 1 √ n n (cid:88) t =1 ( ˆ Z tj − Z tj ) e it = 1 √ n n (cid:88) t =1 (cid:88) l (cid:54) = j ( γ jl − ˆ γ jl ) X tl e it ≤√ n (cid:107) γ j − ˆ γ j (cid:107) (cid:107) X T ε i /n (cid:107) ∞ = O p (cid:0) q j log p/ √ n (cid:1) (57)where the first equality is due to the definition, the second inequality is due to H¨olderinequality, and the third equality is due to (28) and (12). By (49), we have τ j (cid:107) ˆ Z j (cid:107) P → . (58)By Slutsky theorem, (56), (57), (58) and Assumption 3, we obtain (51). Step
3. By Theorem 1, we have ˆ s i = O p ( s i ) = o p ( n ). Then ( n − ˆ s i )ˆ σ i /n is asymptoticallyequivalent to ˆ σ i . We have | ( n − ˆ s i )ˆ σ i /n − σ i | = | (ˆ ε T i ˆ ε i − nσ i ) /n | = | (ˆ ε i − ε i ) T (ˆ ε i − ε i ) /n + 2 ε T i (ˆ ε i − ε i ) /n + ( ε T i ε i − nσ i ) /n | = |(cid:107) X (ˆ a Lasso i − a i ) (cid:107) /n − ε T i X (ˆ a Lasso i − a i ) /n + ( ε T i ε i − nσ i ) /n | = O p ( s i log( p ) /n ) + O p ( (cid:112) log( p ) /n ) O p ( s i log( p ) /n ) + O p (cid:0) / √ n (cid:1) = o p (1) , ε T i ε i − nσ i = O p ( √ n )), the fifth equality is due to Assumption 2. By Theorem 1 (ˆ s i = O p ( s i ) = o p ( n )),we obtain ˆ σ i − σ i = o p (1) . (59)In all, with (50), (51) and (59), the results of Theorem 2 follow. A.2 Proofs of the theoretical results in Section 3.2
We denote ˆ ε cent := ( ε cent , , ..., ε cent ,n ), where ε cent ,t := ˆ ε it − ˆ ε i · . Note that we omit thesubscript i for simplicity. When we are conditional on X and Y i (therefore ˆ ε cent ), the onlyrandomness comes from ε ∗ i which is generated from residual bootstrap or multiplier wildbootstrap. Since two bootstrap methods have a lot in common during the proof, we use thesame symbol ε ∗ i to refer to the residuals generated by them, and discuss them separatelywhen necessary. Proposition 6. (a) Under Assumption 5, we have (cid:107) X (cid:107) ∞ = O p ( (cid:112) log p ) . (60) (b) Under Assumption 5, for tuning parameter satisfying λ (cid:16) (cid:112) log( p ) /n , we have E ∗ ( (cid:107) ε ∗ i (cid:107) φ φ /n ) = O p (1) , (61) where ≤ φ < . In particular, when φ = 0 , we have E ∗ ( (cid:107) ε ∗ i (cid:107) /n ) = O p (1) , (62) Proof.
Let σ X = max ≤ t ≤ n, ≤ j ≤ p Var( X tj ). By Propposition 3, we have σ X = O (1). Sincethe elements of X are Gaussian random variables, we have P ( (cid:107) X (cid:107) ∞ > (cid:112) t + 2 log pn ) ≤ pn exp (cid:8) − ( t + 2 log pn ) / (2 σ X ) (cid:9) ≤ − t / (2 σ X )) . Together with Assumption 5, (60) follows. 43ince ε i = ( e i , ..., e in ) T is Gaussian distributed, we have E (cid:107) ε i (cid:107) φ φ = O ( n ). We denote ε i · := (cid:80) nt =1 e it /n , then | ˆ ε i · | ≤ | ˆ ε i · − ε i · | + | ε i · | ≤ (cid:107) ˆ ε i − ε i (cid:107) /n + | ε i · | = o p (1)where the first two inequality are due to triangle inequality and the third is due to Theorem1, Assumption 5 and the strong law of lager numbers. By Theorem 1, (60) and Assumption5, we have (cid:107) ˆ ε cent − ε i (cid:107) ∞ ≤ | ˆ ε i · | + (cid:107) ˆ ε i − ε i (cid:107) ∞ ≤ | ˆ ε i · | + (cid:107) X (cid:107) ∞ (cid:107) ˆ a Lasso i − a i (cid:107) = o p (1) . Thus, we have (cid:107) ˆ ε cent (cid:107) φ φ /n = O p (1) . For the residual bootstrap, E ∗ ( (cid:107) ε ∗ i (cid:107) φ φ /n ) = (cid:107) ˆ ε cent (cid:107) φ φ /n = O p (1) . For the multiplier wild bootstrap, E ∗ ( (cid:107) ε ∗ i (cid:107) φ φ /n ) = ( E | W | φ ) (cid:107) ˆ ε cent (cid:107) φ φ /n = O p (1) . Proof of Theorem 3 (a) . Recall that we intend to prove (cid:107) ˆ a Lasso ∗ i − ˆ a Lasso i (cid:107) = O p ( s i log( p ) / √ n ) , (cid:107) X (ˆ a Lasso ∗ i − ˆ a Lasso i ) (cid:107) /n = O p ( s i log ( p ) /n ) . By Theorem 1, we have ˆ s i = O p ( s i ), thus, we only need to prove (cid:107) ˆ a Lasso ∗ i − ˆ a Lasso i (cid:107) = O p (ˆ s i log( p ) / √ n ) , (63) (cid:107) X (ˆ a Lasso ∗ i − ˆ a Lasso i ) (cid:107) /n = O p (ˆ s i log ( p ) /n ) . (64)Recall that ˆ a Lasso ∗ i := argmin α ∈ R p {(cid:107) Y ∗ i − X α (cid:107) /n + 2 λ ∗ (cid:107) α (cid:107) } ,
44e obtain the basic inequality (cid:107) Y ∗ i − X ˆ a Lasso ∗ i (cid:107) /n + 2 λ ∗ (cid:107) ˆ a Lasso ∗ i (cid:107) ≤ (cid:107) Y ∗ i − X ˆ a Lasso i (cid:107) /n + 2 λ ∗ (cid:107) ˆ a Lasso i (cid:107) . We denote δ ∗ := ˆ a Lasso ∗ i − ˆ a Lasso i . By simple algebra, we have (cid:107) X δ ∗ (cid:107) /n ≤ ε ∗ T i X δ ∗ /n + 2 λ ∗ ( (cid:107) ˆ a Lasso i (cid:107) − (cid:107) ˆ a Lasso ∗ i (cid:107) ) . (65)We can obtain a bound for (cid:107) X T ε ∗ i /n (cid:107) ∞ as follows: P ∗ (cid:18) (cid:107) X T ε ∗ i /n (cid:107) ∞ > C log p √ n (cid:19) ≤ E ∗ ( (cid:107) X T ε ∗ i (cid:107) ∞ ) C n log p ≤ p ) (cid:107) X (cid:107) ∞ E ∗ ( (cid:107) ε ∗ i (cid:107) ) C n log p , where the first inequality is due to Markov inequality, the second inequality is due toNemirovski’s inequality and equation (6.5) in B¨uhlmann and van de Geer (2011). By (60)and (62), we have, in probability, (cid:107) X T ε ∗ i /n (cid:107) ∞ ≤ C log p √ n . (66)With suitable chosen λ ∗ (cid:16) log p/ √ n , we have, in probability, (cid:107) X T ε ∗ i /n (cid:107) ∞ ≤ C log p √ n ≤ λ ∗ / . Furthermore, by H¨older inequality, we have ε ∗ T i X δ ∗ /n ≤ (cid:107) X T ε ∗ i /n (cid:107) ∞ (cid:107) δ ∗ (cid:107) ≤ λ ∗ (cid:107) δ ∗ (cid:107) / . Recall that ˆ S i = { j ∈ { , ..., p } : ˆ a Lasso ij (cid:54) = 0 } , by triangle inequality and ˆ a Lasso i ˆ S Ci = 0, we have (cid:107) ˆ a Lasso i (cid:107) − (cid:107) ˆ a Lasso ∗ i (cid:107) = (cid:107) ˆ a Lasso i ˆ S i (cid:107) − (cid:107) ˆ a Lasso ∗ i ˆ S i (cid:107) + (cid:107) ˆ a Lasso i ˆ S Ci (cid:107) − (cid:107) ˆ a Lasso ∗ i ˆ S Ci (cid:107) ≤ (cid:107) δ ∗ ˆ S i (cid:107) − (cid:107) δ ∗ ˆ S Ci (cid:107) . Therefore, (65) becomes0 ≤ (cid:107) X δ ∗ (cid:107) /n ≤ λ ∗ (cid:107) δ ∗ (cid:107) + 2 λ ∗ ( (cid:107) δ ∗ ˆ S i (cid:107) − (cid:107) δ ∗ ˆ S Ci (cid:107) )= λ ∗ ( (cid:107) δ ∗ ˆ S i (cid:107) + (cid:107) δ ∗ ˆ S Ci (cid:107) ) + 2 λ ∗ ( (cid:107) δ ∗ ˆ S i (cid:107) − (cid:107) δ ∗ ˆ S Ci (cid:107) )= 3 λ ∗ (cid:107) δ ∗ ˆ S i (cid:107) − λ ∗ (cid:107) δ ∗ ˆ S Ci (cid:107) ≤ λ ∗ (cid:107) δ ∗ (cid:107) . (67)45nequality (67) implies (cid:107) δ ∗ ˆ S Ci (cid:107) ≤ (cid:107) δ ∗ ˆ S i (cid:107) so that (cid:107) δ ∗ (cid:107) ≤ (cid:107) δ ∗ ˆ S i (cid:107) ≤ (cid:112) ˆ s i (cid:107) δ ∗ (cid:107) . (68)By (11) and (32), we have (cid:107) X δ ∗ (cid:107) /n ≥ α (cid:107) δ (cid:107) − τ (cid:107) δ (cid:107) ≥ (cid:18) α s i − τ (cid:19) (cid:107) δ (cid:107) . (69)By Theorem 1 and Proposition 5, we have τ ˆ s i = O p ( τ s i ) = O p (1). Combining with(67), (69) and Assumption 4 ( λ ∗ (cid:16) log( p ) / √ n ), we obtain (63). Then by (67), (63) andAssumption 4, we obtain (64). Proof of Theorem 3 (b) . By Theorem 1, we have ˆ s i = O p ( s i ). To prove ˆ s ∗ i = O p ( s i ),we only need to prove ˆ s ∗ i = O p (ˆ s i ).The proof is the same as the counterpart in Theorem 1 (b) except for the bound (25).We only need to replace (cid:107) X T ε i /n (cid:107) ∞ ≤ C (cid:112) log p/n, by (cid:107) X T ε ∗ i /n (cid:107) ∞ ≤ C log p/ √ n, where the second inequality is due to (66). With λ ∗ ≥ C c ∗ c ∗ log p √ n , the result follows from the same reasoning.Similar to (34), we have the bootstrap analogue,ˆ a ∗ i − ˆ a Lasso i = ˆΘ X T ε ∗ i /n + ( I − ˆΘ ˆΣ)(ˆ a Lasso ∗ i − ˆ a Lasso i ) . Since we use the same X for nodewise Lasso during the bootstrap procedure, the boundsproved in the proof of Theorem 2 still hold. Proof of Theorem 4 . We prove the theorem in three steps. Step 1 proves that ( I − ˆΘ ˆΣ)(ˆ a Lasso ∗ i − ˆ a Lasso i ) is asymptotically negligible. Step 2 proves that ˆΘ X T ε ∗ i /n is asymp-totically normal. Step 3 proves that our variance estimator is consistent.46 tep
1. By H¨older inequality, we have √ n (cid:107) ( I − ˆΘ ˆΣ)(ˆ a Lasso ∗ i − ˆ a Lasso i ) (cid:107) ∞ ≤ √ n (cid:107) ( I − ˆΘ ˆΣ) (cid:107) ∞ (cid:107) ˆ a Lasso ∗ i − ˆ a Lasso i (cid:107) ≤ √ n (cid:18) max j λ j / ˆ τ j (cid:19) (cid:107) ˆ a Lasso ∗ i − ˆ a Lasso i (cid:107) . (70)By Theorem 3, we have (cid:107) ˆ a Lasso ∗ i − ˆ a Lasso i (cid:107) = O p ( s i log p/ √ n ) . (71)Since we use the same X for nodewise Lasso during the bootstrap procedure, (40) stillholds. Together with Assumption 4 and 5, we have √ n (cid:107) ( I − ˆΘ ˆΣ)(ˆ a Lasso ∗ i − ˆ a Lasso i ) (cid:107) ∞ = o p (1) . (72) Step
2. Compared to the counterpart in Theorem 2, ˆ Z ij is no longer correlated with ε ∗ i when we conditional on X and consider bootstrap measure. By (61), (cid:40) (cid:107) ˆ ε cent (cid:107) /n · (cid:107) ˆ Z j (cid:107) ˆ Z tj ε ∗ t , t = 1 , ..., n (cid:41) are independent variables meet the Lyapunov condition. By central limit theorem,1 (cid:107) ˆ ε cent (cid:107) /n · (cid:107) ˆ Z j (cid:107) n (cid:88) t =1 ˆ Z tj ε ∗ t d ∗ → N (0 ,
1) in probability . (73) Step
3. By Theorem 3, we have ˆ s ∗ i = O p ( s i ) = o p ( n ). Then ( n − ˆ s ∗ i )ˆ σ ∗ i /n is asymptoti-cally equivalent to ˆ σ ∗ i . Also, we have | ( n − ˆ s ∗ i )ˆ σ ∗ i /n − (cid:107) ˆ ε cent (cid:107) /n | = (cid:12)(cid:12) (ˆ ε ∗ T i ˆ ε ∗ i − (cid:107) ˆ ε cent (cid:107) ) /n (cid:12)(cid:12) = (cid:12)(cid:12) (ˆ ε ∗ i − ε ∗ i ) T (ˆ ε ∗ i − ε ∗ i ) /n + 2 ε ∗ T i (ˆ ε ∗ i − ε ∗ i ) /n + ( ε T i ε ∗ i − (cid:107) ˆ ε cent (cid:107) ) /n (cid:12)(cid:12) = (cid:12)(cid:12) (cid:107) X (ˆ a Lasso ∗ i − ˆ a Lasso i ) (cid:107) /n − ε ∗ T i X (ˆ a Lasso ∗ i − ˆ a Lasso i ) /n + ( ε ∗ T i ε ∗ i − (cid:107) ˆ ε cent (cid:107) ) /n (cid:12)(cid:12) = O p ( s i log ( p ) /n ) + O p (cid:0) log( p ) / √ n (cid:1) O p (cid:0) s i log( p ) / √ n (cid:1) + O p (cid:0) / √ n (cid:1) = o p (1) , where the first three equalities are due to definitions and simple algebra, the fourth equalityis due to Theorem 3, (66), central limit theorem, and (62), the fifth equality is due toAssumption 5. By Theorem 3 (ˆ s ∗ i = O p ( s i ) = o p ( n )), we obtain,ˆ σ ∗ i − (cid:107) ˆ ε cent (cid:107) /n = o p (1) in probability . (74)In all, with (72), (73) and (74), the results of Theorem 4 follow.47 Additional Simulation Results
Figures 7 and 8 show the coverage probabilities and average confidence interval lengths forhomoscedastic non-Gaussian errors, heteroscedastic Gaussian errors and heteroscedasticnon-Gaussian errors. Compared to homoscedastic Gaussian errors, different distributionsof errors do not lead to significant difference of performance. Again, we can see that theLDPE has honest coverage probabilities and the BtLDPE and MultiBtLDPE have shorterconfidence interval lengths compared to the LDPE.48 tLasso BtLasso+OLS LDPE BtLDPE MultiBtLDPE JM ho m o G auho m o non − G auhe t e r G auhe t e r non − G au0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 index C o v e r age P r obab ili t i e s True value of par non−zerozero
Figure 7: Comparison of empirical coverage probabilities for 1000 replications produced by sixmethods (columns) in four cases (rows). We set n = 100 and s i = 5 . Different rows correspondto different distributions of errors. Index on the x -axis corresponds to different a j ’s, which arearranged from small to large in absolute values. The first p − elements of a j ’s are zeros (bluepoints) and the last 5 are non-zeros (red points). The black lines are the total averages of coverageprobabilities for zero and non-zero parameters respectively. The red dashed lines correspond to thenominal confidence level 95%. tLasso BtLasso+OLS LDPE BtLDPE MultiBtLDPE JM ho m o G auho m o non − G auhe t e r G auhe t e r non − G au0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 index A v e r age Leng t h s True value of par non−zerozero
Figure 8: Comparison of average confidence interval lengths for 1000 replications produced by sixmethods (columns) in four cases (rows). We set n = 100 and s i = 5 . Different rows correspondto different distributions of errors. Index on the x -axis corresponds to different a j ’s, whichare arranged from small to large in absolute values. The first p − elements of a j ’s are zeros(blue points) and the last 5 are non-zeros (red points). The black lines are the total averages ofconfidence interval lengths for zero and non-zero parameters respectively.’s are zeros(blue points) and the last 5 are non-zeros (red points). The black lines are the total averages ofconfidence interval lengths for zero and non-zero parameters respectively.