[PDF] Regularized Estimation and Testing for High-Dimensional Multi-Block Vector-Autoregressive Models

Abstract

Dynamical systems comprising of multiple components that can be partitioned into distinct blocks originate in many scientific areas. A pertinent example is the interactions between financial assets and selected macroeconomic indicators, which has been studied at aggregate level---e.g. a stock index and an employment index---extensively in the macroeconomics literature. A key shortcoming of this approach is that it ignores potential influences from other related components (e.g. Gross Domestic Product) that may exert influence on the system's dynamics and structure and thus produces incorrect results. To mitigate this issue, we consider a multi-block linear dynamical system with Granger-causal ordering between blocks, wherein the blocks' temporal dynamics are described by vector autoregressive processes and are influenced by blocks higher in the system hierarchy. We derive the maximum likelihood estimator for the posited model for Gaussian data in the high-dimensional setting based on appropriate regularization schemes for the parameters of the block components. To optimize the underlying non-convex likelihood function, we develop an iterative algorithm with convergence guarantees. We establish theoretical properties of the maximum likelihood estimates, leveraging the decomposability of the regularizers and a careful analysis of the iterates. Finally, we develop testing procedures for the null hypothesis of whether a block "Granger-causes" another block of variables. The performance of the model and the testing procedures are evaluated on synthetic data, and illustrated on a data set involving log-returns of the US S&P100 component stocks and key macroeconomic variables for the 2001--16 period.

Full PDF

RRegularized Estimation and Testingfor High-Dimensional Multi-Block Vector-Autoregressive Models

Jiahe Lin ∗ , George Michailidis † Abstract

Dynamical systems comprising of multiple components that can be partitioned into distinct blocks originatein many scientiﬁc areas. A pertinent example is the interactions between ﬁnancial assets and selected macroe-conomic indicators, which has been studied at aggregate level –e.g. a stock index and an employment index–extensively in the macroeconomics literature. A key shortcoming of this approach is that it ignores potential in-ﬂuences from other related components (e.g. Gross Domestic Product) that may exert inﬂuence on the system’sdynamics and structure and thus produces incorrect results. To mitigate this issue, we consider a multi-blocklinear dynamical system with Granger-causal ordering between blocks, wherein the blocks’ temporal dynamicsare described by vector autoregressive processes and are inﬂuenced by blocks higher in the system hierarchy.We derive the maximum likelihood estimator for the posited model for Gaussian data in the high-dimensionalsetting based on appropriate regularization schemes for the parameters of the block components. To optimizethe underlying non-convex likelihood function, we develop an iterative algorithm with convergence guarantees.We establish theoretical properties of the maximum likelihood estimates, leveraging the decomposability of theregularizers and a careful analysis of the iterates. Finally, we develop testing procedures for the null hypothesisof whether a block “Granger-causes” another block of variables. The performance of the model and the testingprocedures are evaluated on synthetic data, and illustrated on a data set involving log-returns of the US S&P100component stocks and key macroeconomic variables for the 2001–16 period.

The study of linear dynamical systems has a long history in control theory (Kumar and Varaiya, 1986) andeconomics (Hansen and Sargent, 2013) due to their analytical tractability and ease to estimate their parameters.Such systems in their so-called reduced form give rise to Vector Autoregressive (VAR) models (L¨utkepohl, 2005)that have been widely used in macroeconomic modeling for policy analysis (Sims, 1980; Sims et al., 1982; Sims,1992), in ﬁnancial econometrics (Gourieroux and Jasiak, 2001), and more recently in functional genomics (Shojaieet al., 2012), ﬁnancial systemic risk analysis (Billio et al., 2012) and neuroscience (Seth, 2013).In many applications, the components of the system under consideration can be naturally partitioned into interacting blocks . For example, Cushman and Zha (1997) studied the impact of monetary policy in a smallopen economy, where the economy under consideration is modeled as one block, while variables in other (foreign)economies as the other. Both blocks have their own autoregressive structure, and the inter-dependence betweenblocks is unidirectional: the foreign block inﬂuences the small open economy, but not the other way around, thuseﬀectively introducing a linear ordering amongst blocks. Another example comes from the connection betweenthe stock market and employment macroeconomic variables (Fitoussi et al., 2000; Phelps, 1999; Farmer, 2015)that focuses on the impact through a wealth eﬀect mechanism of the former on the latter. Once again, theunderlying hypothesis of interest is that the stock market inﬂuences employment, but not the other way around.In another application domain, molecular biologists conduct time course experiments on cell lines or animal modelsand collect data across multiple molecular compartments (transcripotmics, proteomics, metabolomics, lipidomics)in order to delineate mechanisms for disease onset and progression or to study basic biological processes. Inthis case, the interactions amongst the blocks (molecular compartments) are clearly delineated (transciptomiccompartment inﬂuencing the proteomic and metabolomic ones), thus leading again to a linear ordering of theblocks (see Richardson et al., 2016).The proposed model also encompasses the popular in marketing, regional science and growth theory VAR-X model, provided that the temporal evolution of the exogenous block of variables “X” exhibits autoregressivedynamics. For example, Nijs et al. (2001) examine the sensitivity of over 500 product prices to various marketing ∗ Ph.D Candidate, Department of Statistics, University of Michigan. † Corresponding Author. Professor, Department of Statistics and the Informatics Institute, University of Florida, 205 Griﬃn-FloydHall, PO Box 118545, Gainesville, FL, 32611-8545, USA. Email: [email protected]. a r X i v : . [ s t a t . M E ] A ug romotion strategies (the exogenous block), while Pauwels and Weiss (2008) examine changes in subscriptionrates, search engine referrals and marketing eﬀorts of customers when switched from a free account to a fee-basedstructure, the latter together with customer characteristics representing the exogenous block. Pesaran et al. (2004)examine regional inter-dependencies, building a model where country speciﬁc macroeconomic indicators evolveaccording to a VAR model and they are inﬂuenced exogenously by key macroeconomic variables from neighboringcountries/regions. Finally, Abeysinghe (2001) studies the impact of the price of oil on Gross Domestic Productgrowth rates for a number of countries, while controlling for other exogenous variables such as the country’sconsumption and investment expenditures along with its trade balance.The proposed model gives rise to a network structure that in its most general form corresponds to a multi-partitegraph, depicted in Figure 1 for 3 blocks, that exhibits a directed acyclic structure between the constituent blocks,and can also exhibit additional dependence between the nodes in each block. Selected properties of such multi-blockstructures, known as chain graphs (Drton and Perlman, 2008), have been studied in the literature. Further, theirmaximum likelihood estimation for independent and identically distributed Gaussian data under a high-dimensional sparse regime is thoroughly investigated in Lin et al. (2016), where a provably convergent estimation procedure isintroduced and its theoretical properties are established. x t − y t − z t − x t y t z t x t +1 y t +1 z t +1 time t time t − t + 1Figure 1: Diagram for a dynamic system with three groups of variablesGiven the wide range of applications of multi-block VAR models, which in addition encompass the widelyused VAR-X model, the key contributions of the current paper are fourfold: (i) formulating the model as arecursive dynamical system and examining its stability properties; (ii) developing a provably convergent algorithmfor obtaining the regularized maximum likelihood estimates (MLE) of the model parameters under high-dimensionalscaling; (iii) establishing theoretical properties of the ML estimates; and (iv) devising a testing procedure for theparameters that connect the constituent blocks of the model: if the null hypothesis is not rejected, then one isdealing with a set of independently evolving VAR models, otherwise with the posited multi-block VAR model.Finally, the model, estimation and testing procedures are illustrated on an important problem in macroeconomics,as gleaned by the background of the problem and discussion of the results provided in Section 6.For the multi-block VAR model, we assume that the time series within each block are generated by a GaussianVAR process. Further, the transition matrices within and across blocks can be either sparse or low rank . The positedregularized Gaussian likelihood function is not jointly convex in all the model parameters, which poses a number oftechnical challenges that are compounded by the presence of temporal dependence. These are successfully addressedand resolved in Section 3, where we provide a numerically convergent algorithm and establish the theoreticalproperties of the resulting ML estimates, that constitutes a key contribution in the study of multi-block VARmodels.The remainder of this paper is organized as follows. In Section 2, we introduce the model setup and thecorresponding estimation procedure. In Section 3, we provide consistency properties of the obtained ML estimatesunder a high-dimensional scaling. In Section 4, we introduce the proposed testing framework, both for low-rankand sparse interaction matrices between the blocks. Section 5 contains selected numerical results that assess theperformance of the estimation and testing procedures. Finally, an application to ﬁnancial and macroeconomic datathat was previously discussed as motivation for the model under consideration is presented in Section 6. Notation.

Throughout this paper, we use ||| A ||| and ||| A ||| ∞ respectively to denote the matrix induced 1-norm and inﬁnity norm of A ∈ R m × n , that is, ||| A ||| = max ≤ j ≤ n (cid:80) mi =1 | a ij | , ||| A ||| ∞ = max ≤ i ≤ m (cid:80) nj =1 | a ij | ,and use (cid:107) A (cid:107) and (cid:107) A (cid:107) ∞ respectively to denote the element-wise 1-norm and inﬁnity norm: (cid:107) A (cid:107) = (cid:80) i,j | a ij | , (cid:107) A (cid:107) ∞ = max i,j | a ij | . Moreover, we use ||| A ||| ∗ , ||| A ||| F and ||| A ||| op to denote the nuclear, Frobenius and operatornorms of A , respectively. For two matrices A and B of commensurate dimensions, denote their inner product by (cid:104)(cid:104) A, B (cid:105)(cid:105) = trace( A (cid:48) B ). Finally, we write A (cid:37) B if there exists some absolute constant c that is independent of the2odel parameters such that A ≥ cB . To convey the main ideas and the key technical contributions, we consider a recursive linear dynamical systemcomprising of two blocks of variables, whose structure is given by: X t = AX t − + U t ,Z t = BX t − + CZ t − + V t , (1)where X t ∈ R p , Z t ∈ R p are the variables in groups 1 and 2, respectively. The temporal intra-block dependence iscaptured by transition matrices A and C , while the inter-block dependence by B . Noise processes { U t } and { V t } ,respectively, capture additional contemporaneous intra-block dependence of X t and Z t , after conditioning on theirrespective past values. Further, we assume that U t and V t follow mean zero Gaussian distributions with covariancematrices given by Σ u and Σ v , i.e., U t ∼ N (0 , Σ u ) , and V t ∼ N (0 , Σ v ) . With the above model setup, the parameters of interest are transition matrices A ∈ R p × p , B ∈ R p × p and C ∈ R p × p , as well as the covariances Σ u , Σ v . In high-dimensional settings, diﬀerent combinations of structuralassumptions can be imposed on these transition matrices to enable their estimation from limited time series data.In particular, the intra-block transition matrices A and C are sparse, while the inter-block matrix B can be eithersparse or low rank. Note that the block of X t variables acts as an exogenous eﬀect to the evolution of the Z t block(e.g., Cushman and Zha, 1997; Nicolson et al., 2016). Further, we assume Ω u := Σ − u and Ω v := Σ − v are sparse. Remark . For ease of exposition, we posit a VAR(1) modeling structure. Extensions to general multi-blockstructures akin to the one depicted in Figure 1 and VAR( d ) speciﬁcations are rather straightforward and brieﬂydiscussed in Section 7.The triangular (recursive) structure of the system enables a certain degree of separability between X t and Z t .In the posited model, X t is a stand-alone VAR(1) process, and the time series in block Z t is “Granger-caused”’by that in block X t , but not vice versa. The second equation in (1), as mentioned in the introductory section,also corresponds to the so-called “VAR-X” model in the econometrics literature (e.g., Sims, 1980; Bianchi et al.,2010; Pesaran, 2015), that extends the standard VAR model to include inﬂuences from lagged values of exogenous variables. Consider the joint process W t = ( X (cid:48) t , Z (cid:48) t ) (cid:48) , it corresponds to a VAR(1) model whose transition matrix G has a block triangular form: W t = GW t − + ε t , where G := (cid:20) A OB C (cid:21) , ε t = (cid:20) U t V t (cid:21) . (2)The model in (2) can also be viewed from a Structural Equations Modeling viewpoint involving time series data, andalso has a Moving Average representation corresponding to a structural VAR representation with Granger causalordering (L¨utkepohl, 2005). As mentioned in the introductory section, the focus of this paper is model parameterestimation under high-dimensional scaling, rather than their cause and eﬀect relationship. For a comprehensivediscourse of causality issues for VAR models, we refer the reader to Granger (1969); L¨utkepohl (2005), and referencestherein.Next, we introduce the notion of stability and spectrum with respect to the system. System Stability.

To ensure that the joint process { W t } is stable (L¨utkepohl, 2005), we require the spectralradius, denoted by ρ ( · ), of the transition matrix G to be smaller than 1, which is guaranteed by requiring that ρ ( A ) < ρ ( C ) <

1, since | λ I p × p − G | = (cid:12)(cid:12)(cid:12)(cid:12) λ I p − A O − B λ I p − C (cid:12)(cid:12)(cid:12)(cid:12) = | λ I p − A || λ I p − C | , implying that the set of eigenvalues of G is the union of the sets of eigenvalues of A and C , hence ρ ( A ) < , ρ ( C ) < , ⇒ ρ ( G ) = max { ρ ( A ) , ρ ( C ) } < . The latter relation implies that the stability of such a recursive system imposes spectrum constraints only on thediagonal blocks that govern the intra-block evolution, whereas the oﬀ-diagonal block that governs the inter-blockinteraction is left unrestricted. 3 pectrum of the joint process.

Throughout, we assume that the spectral density of { W t } exists, which thenpossesses a special structure as a result of the block triangular transition matrix G . Formally, we deﬁne the spectraldensity of { W t } as f W ( θ ) = 12 π ∞ (cid:88) h = −∞ Γ W ( h ) e − ihθ , θ ∈ [ − π, π ] , where Γ W ( h ) := E W t W (cid:48) t + h . For two (generic) processes { X t } and { Z t } , deﬁne their cross-covariance as Γ X,Z ( h ) = E X t Z (cid:48) t + h and Γ Z,X ( h ) = E Z t X (cid:48) t + h . In general, Γ X,Z ( h ) (cid:54) = Γ Z,X ( h ). The cross-spectra are deﬁned as: f X,Z ( θ ) := 12 π ∞ (cid:88) h = −∞ Γ X,Z ( h ) e − ihθ , and f Z,X ( θ ) := 12 π ∞ (cid:88) h = −∞ Γ Z,X ( h ) e − ihθ , θ ∈ [ − π, π ] . For the model given in (2), by writing out the dynamics of Z t , the cross-spectra between X t and Z t are given by f X,Z ( θ )(I p − C (cid:48) e − iθ ) = f X ( θ ) B (cid:48) e − iθ , and (I p − Ce iθ ) f Z,X ( θ ) = Be iθ f X ( θ ) . (3)Similarly, we have (I p − Ce iθ ) f Z ( θ ) = Be iθ f X,Z ( θ ) + f V,Z ( θ ) . (4)Combining (3) and (4), and after some algebra, the spectrum of the joint process W t is given by f W ( θ ) = (cid:2) H ( e iθ ) (cid:3) − (cid:16)(cid:2) H ( e iθ ) (cid:3)(cid:2) × ⊗ f X ( θ ) (cid:3)(cid:2) H ( e − iθ ) (cid:3) (cid:62) + (cid:20) O OO Σ v (cid:21) (cid:17)(cid:2) H ( e − iθ ) (cid:3) −(cid:62) , (5) where × is a 2 × H ( x ) := (cid:20) I p OO I p − Cx (cid:21) ∈ R ( p + p ) × ( p + p ) , H ( x ) := (cid:20) I p OO Bx (cid:21) ∈ R ( p + p ) × (2 p ) . Equation (5) implies that the spectrum of the joint process { W t } can be decomposed into the sum of two parts:the ﬁrst, is a function of f X ( θ ), while the second part involves the embedded idiosyncratic error process { V t } of { Z t } , which only aﬀects the right-bottom block of the spectrum. Note that since { W t } is a VAR(1) process, itsmatrix-valued characteristic polynomial is given by G ( θ ) := I ( p + p ) − Gθ, and its spectral density also takes the following form (c.f. Hannan, 1970; Anderson, 1978): f W ( θ ) = 12 π (cid:2) G − ( e iθ ) (cid:3) Σ ε (cid:2) G − ( e iθ ) (cid:3) ∗ , with G ( x ) = (cid:20) I p − Ax O − Bx I p − Cx (cid:21) , Σ ε = (cid:20) Σ u OO Σ v (cid:21) , and G ∗ being the conjugate transpose. One can easily reach the same conclusion as in (5) by multiplying eachterm, followed by some algebraic manipulations. Next, we outline the algorithm for obtaining the ML estimates of the transition matrices

A, B and C and inversecovariance matrices Σ − u and Σ − v from time series data. We allow for a potential high-dimensional setting, wherethe ambient dimensions p and p of the model exceed the total number of observations T .Given centered times series data { x , · · · , x T } and { z , · · · , z T } , we use X T and Z T respectively, to denote the“response” matrix from time 1 to T , that is: X T = (cid:2) x x . . . x T (cid:3) (cid:48) and Z T = (cid:2) z z . . . z T (cid:3) (cid:48) , and use X and Z without the superscript to denote the “design” matrix from time 0 to T − X = (cid:2) x x . . . x T − (cid:3) (cid:48) and Z = (cid:2) z z . . . z T − (cid:3) (cid:48) .

4e use U and V to denote the error matrices. To obtain estimates for the parameters of interest, we formulateoptimization problems using a penalized log-likelihood function, with regularization terms corresponding to theimposed structural assumptions on the model parameters–sparsity and/or low-rankness. To solve the optimizationproblems, we employ block-coordinate descent algorithms, akin to those described in Lin et al. (2016), to obtainthe solution.As previously mentioned, { X t } is not “Granger-caused” by Z t and hence it is a stand-alone VAR(1) process;this enables us to separately estimate the parameters governing the X t process ( A and Σ − u ) from those of the Z t process ( B , C , and Σ − v ). Estimation of A and Σ − u . Conditional on the initial observation x , the likelihood of { x t } Tt =1 is given by: L ( x T , x T − , · · · , x | x ) = L ( x T | x T − , · · · , x ) L ( x T − | x T − , · · · , x ) · · · L ( x | x )= L ( x T | x T − ) L ( x T − | x T − ) · · · L ( x | x ) , where the second equality follows from the Markov property of the process. The log-likelihood function is given by: (cid:96) ( A, Σ − u ) = T − u ) − T (cid:88) t =1 ( x t − Ax t − ) (cid:48) Σ − u ( x t − Ax t − ) + constant . Letting Ω u := Σ − u , then the penalized maximum likelihood estimator can be written as ( (cid:98) A, (cid:98) Ω u ) = argmin A ∈ R p × p Ω u ∈ S ++ p × p (cid:110) tr (cid:2) Ω u ( X T − X A (cid:48) ) (cid:48) ( X T − X A (cid:48) ) /T (cid:3) − log det Ω u + λ A (cid:107) A (cid:107) + ρ u (cid:107) Ω u (cid:107) , oﬀ (cid:111) . (6) Algorithm 1 describes the key steps for obtaining (cid:98) A and (cid:98) Ω u . Algorithm 1:

Computational procedure for estimating A and Σ − u . Input:

Time series data { x t } Tt =1 , tuning parameter λ A and ρ u . Initialization:

Initialize with (cid:98) Ω (0) u = I p , then (cid:98) A (0) = argmin A (cid:8) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T − X A (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ A (cid:107) A (cid:107) (cid:9) ; Iterate until convergence: – Update (cid:98) Ω ( k ) u by graphical Lasso (Friedman et al., 2008) on the residuals with the plug-inestimate (cid:98) A ( k ) ;– Update (cid:98) A ( k ) with the plug-in (cid:98) Ω ( k − u and cyclically update each row with a Lasso penalty,which solves min A (cid:8) T tr (cid:2)(cid:98) Ω ( k − u ( X T − X A ) (cid:48) ( X T − X A ) /T (cid:3) + λ A (cid:107) A (cid:107) (cid:9) . (7) Output:

Estimated sparse transition matrix (cid:98) A and sparse inverse covariance matrix (cid:98) Ω u .Speciﬁcally, for ﬁxed (cid:98) Ω u , each row j = 1 , . . . , p of (cid:98) A is cyclically updated by: (cid:98) A [ s +1] j · = argmin β ∈ R p (cid:110) (cid:98) ω jju T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T · j + r [ s +1] j − X β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λ A (cid:107) β (cid:107) (cid:111) , where r [ s +1] j = (cid:98) ω jju (cid:104) j − (cid:88) i =1 (cid:98) ω iju (cid:0) X T · j − X ( (cid:98) A [ s +1] i · ) (cid:48) (cid:1) + p (cid:88) i = j +1 (cid:98) ω iju (cid:0) X T · j − X ( (cid:98) A [ s ] i · ) (cid:48) (cid:1)(cid:105) . Here (cid:98) Ω ( k ) u = (cid:2) ( (cid:98) ω iju ) ( k ) (cid:3) is the estimate from the previous iteration, and for notation convenience we drop the index( k ) for the outer iteration and use [ s ] to denote the index for the inner iteration, for each round of cyclic update ofthe rows. 5 stimation of B , C and Σ − v . Similarly, to obtain estimates of B , C and Ω v := Σ − v , we formulate theoptimization problem as follows: ( (cid:98) B, (cid:98) C, (cid:98) Ω v ) = argmin B ∈ R p × p ,C ∈ R p × p Ω v ∈ S ++ p × p (cid:110) tr (cid:2) Ω v ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:48) ( Z T − X B (cid:48) − Z C (cid:48) ) /T (cid:3) − log det Ω v + λ B R ( B ) + λ C (cid:107) C (cid:107) + ρ v || Ω v || , oﬀ (cid:111) , (8) where the regularizer R ( B ) = (cid:107) B (cid:107) if B is assumed to be sparse, and R ( B ) = ||| B ||| ∗ if B is assumed to be lowrank. Algorithm 2 outlines the procedure for obtaining estimates (cid:98) B , (cid:98) C and (cid:98) Ω v . Note that (cid:98) B ( k ) and (cid:98) C ( k ) need tobe treated as a “joint block” in the outer update and convergence of the “joint block” is required before movingon to updating Ω v . Algorithm 2:

Computational procedure for estimating B , C and Σ − v . Input:

Time series data { x t } Tt =1 and { z t } Tt =1 , tuning parameters λ B , λ C , ρ v . Initialization:

Initialize with (cid:98) Ω (0) v = I p , then( (cid:98) B (0) , (cid:98) C (0) ) = argmin ( B,C ) (cid:8) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z T − X B (cid:48) − Z C (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ B R ( B ) + λ C (cid:107) C (cid:107) (cid:9) . Iterate until convergence: – Update (cid:98) Ω ( k ) v by graphical Lasso on the residuals with the plug-in estimates (cid:98) B ( k ) and (cid:98) C ( k ) ;– For ﬁxed (cid:98) Ω ( k ) v , ( (cid:98) B ( k +1) , (cid:98) C ( k +1) ) solvesmin B,C (cid:8) T tr (cid:2)(cid:98) Ω ( k ) v ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:48) ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:3) + λ B R ( B ) + λ C (cid:107) C (cid:107) (cid:9) . · Fix (cid:98) C [ s ] , update (cid:98) B [ s +1] by Lasso or singular value thresholding, which solvesmin B (cid:8) T tr (cid:2)(cid:98) Ω ( k ) v ( Z T − X B (cid:48) − Z (cid:98) C [ s ] (cid:48) ) (cid:48) ( Z T − X B (cid:48) − Z (cid:98) C [ s ] (cid:48) ) (cid:3) + λ B R ( B ) (cid:9) ; · Fix (cid:98) B [ s ] , update (cid:98) C [ s ] by Lasso, which solvesmin C (cid:8) T tr (cid:2)(cid:98) Ω ( k ) v ( Z T − X (cid:98) B [ s ] (cid:48) − Z C (cid:48) ) (cid:48) ( Z T − X (cid:98) B [ s ] (cid:48) − Z C (cid:48) ) (cid:3) + λ C (cid:107) C (cid:107) (cid:9) . Output:

Estimated transition matrices (cid:98) B , (cid:98) C and sparse (cid:98) Ω v .In many real applications, B is low rank and C is sparse, while in other settings both are sparse. In the ﬁrstcase, X t “Granger-causes” Z t and the information can be compressed to a lower dimensional space spanned by arelative small number of bases compared to the dimension of the blocks, and Z t is autoregressive through a subsetof its components. Next, we give details for updating B and C under this model speciﬁcation.For ﬁxed (cid:98) Ω ( k ) v , with B being low rank and C sparse, the updated (cid:98) B ( k ) and (cid:98) C ( k ) satisﬁes ( (cid:98) B ( k ) , (cid:98) C ( k ) ) = argmin B,C (cid:110) T tr (cid:2)(cid:98) Ω ( k − v ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:48) ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:3) + λ B ||| B ||| ∗ + λ C || C || (cid:111) . To obtain the solution to the above optimization problem, B and C need to be updated alternately, and within theupdate of each block, an iterative algorithm is required. Here we drop the superscript ( k ) that denotes the outeriterations, and use [ s ] as the inner iteration index for the alternate update between B and C , and use { t } as theindex for the within block iterative update.– Update (cid:98) B [ s +1] for ﬁxed (cid:98) C [ s ] : instead of directly updating B , update (cid:101) B := (cid:98) Ω / v B with (cid:101) B { t +1 } = T α t +1 λ B (cid:0) (cid:101) B { t } − α t +1 ∇ g ( (cid:101) B { t } ) (cid:1) , (9)where T τ ( · ) is the singular value thresholding operator with thresholding level τ , g ( (cid:101) B ) = T tr (cid:2) (cid:101) B (cid:48) (cid:101) B X (cid:48) X − X (cid:48) ( Z T − Z (cid:98) C [ s ] ) (cid:98) Ω / v (cid:101) B (cid:3) , ∇ g ( (cid:101) B ) = T (cid:2) (cid:101) B X (cid:48) X − (cid:98) Ω / v ( Z T − Z (cid:98) C [ s ] (cid:48) ) (cid:48) X (cid:3) . Denote the convergent solution by (cid:101) B {∞} , and (cid:98) B [ s +1] = (cid:98) Ω − / v (cid:101) B {∞} .– Update (cid:98) C [ s ] for ﬁxed (cid:98) B [ s ] : each row j = 1 , . . . , p of (cid:98) C [ s ] is cyclically updated by (cid:98) C { t +1 } j · = argmin β ∈ R p (cid:110) (cid:98) ω jjv T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( Z T − X (cid:98) B [ s ] (cid:48) ) · j + r { t +1 } j − Z β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λ C (cid:107) β (cid:107) (cid:111) , where r { t +1 } j = (cid:98) ω jjv (cid:104) j − (cid:88) i =1 (cid:98) ω ijv (cid:2) ( Z T − X (cid:98) B [ s ] (cid:48) ) · j − Z ( (cid:98) C { t +1 } i · ) (cid:48) (cid:3) + p (cid:88) i = j +1 (cid:98) ω ijv (cid:2) ( Z T − X (cid:98) B [ s ] (cid:48) ) · j − Z ( (cid:98) C { t } i · ) (cid:48) (cid:3)(cid:105) , and (cid:98) ω ijv are entries of (cid:98) Ω v coming from the previous outer iteration.Although based on the outlined procedure, a number of iterative steps are required to obtain the ﬁnal estimate, wehave empirically observed that the number of iterations between the B, C and Ω v blocks is usually rather small.Speciﬁcally, based on large number of simulation settings (selected ones presented in Section 5), for ﬁxed (cid:98) Ω ( k ) v , thealternate update for B and C usually converges within 20 iterations, while the update involving ( B, C ) and Ω v takes less than 10 iterations.In case, B is sparse, it can be updated by Lasso regression as outlined in Algorithm 2 (details omitted). Remark . In the low dimensional setting where the Gram matrix X (cid:48) X is invertible, the update of B when R ( B ) = ||| B ||| ∗ can be obtained by a one-shot SVD and singular value thresholding; that is, we ﬁrst obtain thegeneralized least squares estimator, then threshold the singular values at a certain level. On the other hand, in thehigh dimensional setting, an iterative algorithm is required. Note that the singular value thresholding algorithmcorresponds to a proximal gradient algorithm, and thus a number of acceleration schemes are available (see Nesterov,1983, 1988, 2005, 2007), whose theoretical properties have been thoroughly investigated in Tseng (2008). Werecommend using the acceleration scheme proposed by Beck and Teboulle (2009), in which the “momentum” iscarried over to the next iteration, as an extension of Nesterov (1983) to composite functions. Instead of updating (cid:101) B with (9), an “accelerated” update within the SVT step is given by: y = (cid:101) B { t } − t − t + 2 (cid:0) (cid:101) B { t } − (cid:101) B { t − } (cid:1) , (cid:101) B { t +1 } = T α t +1 λ B (cid:0) y − α t +1 ∇ g ( (cid:101) B { t } ) (cid:1) . Note that the objective function in (6) is not jointly convex in both parameters, but biconvex . Similarly in (8),the objective function is biconvex in (cid:2) ( B, C ) , Ω v (cid:3) . Consequently, convergence to a stationary point is guaranteed,as long as estimates from all iterations lie within a ball around the true value of the parameters, with the radius ofthe ball upper bounded by a universal constant that only depends on model dimensions and sample size (Lin et al.,2016, Theorem 4.1). This condition is satisﬁed upon the establishment of consistency properties of the estimates.To establish consistency properties of the estimates requires the existence of good initial values for the modelparameters ( A, Ω u ), and ( B, C, Ω v ), respectively, in the sense that they are suﬃciently close to the true parameters.For the ( A, Ω u ) parameters, the results in Basu and Michailidis (2015) guarantee that for random realizations of { X t , E t } , with suﬃciently large sample size, the errors of (cid:98) A (0) and (cid:98) Ω (0) u are bounded with high probability, whichprovides us with good initialization values. Yet, additional handling of the bounds is required to ensure thatestimates from subsequent iterations are also uniformly close to the true value (see Section 3.2 Theorems 1). Asimilar property for ( (cid:98) B (0) , (cid:98) C (0) , (cid:98) Ω (0) v ) and subsequent iterations is established in Section 3.2 Theorems 2 (see alsoTheorem 4 in Appendix A). In this section, we investigate the theoretical properties of the penalized maximum likelihood estimation procedureproposed in Section 2, with an emphasis on the error bounds for the obtained estimates. We focus on the model7peciﬁcation in which the inter-block transition matrix B is low rank , which is of interest in many applied set-tings. Speciﬁcally, we consider the consistency properties of (cid:98) A and ( (cid:98) B, (cid:98) C ) that are solutions to the following twooptimization problems: ( (cid:98) A, (cid:98) Ω u ) = argmin A, Ω u (cid:110) tr (cid:104) Ω u ( X T − X A (cid:48) ) (cid:48) ( X T − X A (cid:48) ) /T (cid:105) − log det Ω u + λ A (cid:107) A (cid:107) + ρ u (cid:107) Ω u (cid:107) , oﬀ (cid:111) , (10) and ( (cid:98) B, (cid:98) C, (cid:98) Ω v ) = argmin B,C, Ω v (cid:110) tr (cid:2) Ω v ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:48) ( Z T − X B (cid:48) − Z C (cid:48) ) /T (cid:3) − log det Ω v + λ B ||| B ||| ∗ + λ C (cid:107) C (cid:107) + ρ v || Ω v || , oﬀ (cid:111) . (11) The case of a sparse B can be handled similarly to that of A and/or C with minor modiﬁcations (details shown inAppendix E). Next, we outline the main steps followed in establishing the theoretical properties for the model parameters.Throughout, we denote with a superscript “ (cid:63) ” the true value of the corresponding parameters.The following key concepts, widely used in high-dimensional regularized estimation problems, are needed insubsequent developments.

Deﬁnition 1 (Restricted Strong Convexity (RSC)) . For some generic operator X : R m × m (cid:55)→ R T × m , it satisﬁesthe RSC condition with respect to norm Φ with curvature α RSC > τ > T ||| X (∆) ||| F ≥ α RSC ||| ∆ ||| F − τ Φ (∆) , for some ∆ ∈ R m × m . Note that the choice of the norm Φ is context speciﬁc. For example, in sparse regression problems, Φ(∆) = (cid:107) ∆ (cid:107) corresponds to the element-wise (cid:96) norm of the matrix (or the usual vector (cid:96) norm for the vectorized version). TheRSC condition becomes equivalent to the restricted eigenvalue (RE) condition (see Loh and Wainwright, 2012; Basuand Michailidis, 2015, and references therein). This is the case for the problem of estimating transition matrix A .For estimating B and C , deﬁne Q to be the weighted regularizer Q ( B, C ) := ||| B ||| ∗ + Λ C λ B (cid:107) C (cid:107) , and the associatednorm Φ in this setting is deﬁned as Φ(∆) := inf B aug + C aug =∆ Q ( B, C ) . Deﬁnition 2 (Diagonal dominance) . A matrix Ω ∈ R p × p is strictly diagonally dominant if | Ω ii | > (cid:88) j (cid:54) = i | Ω ij | , ∀ i = 1 , · · · , p. Deﬁnition 3 (Incoherence condition (Ravikumar et al., 2011)) . A matrix Ω ∈ R p × p satisﬁes the incoherencecondition if: max e ∈ ( S Ω ) c (cid:107) H eS Ω ( H S Ω S Ω ) − (cid:107) ≤ − ξ, for some ξ ∈ (0 , , where H S Ω S Ω denotes the Hessian of the log-determinant barrier log det Ω restricted to the true edge set of Ωdenoted by S Ω , and H eS is similarly deﬁned.The above two conditions are associated with the inverse covariance matrices Ω u and Ω v . Speciﬁcally, thediagonal dominance condition is required for Ω (cid:63)u and Ω (cid:63)v as we build the consistency properties for (cid:98) A and ( (cid:98) B, (cid:98) C )with the penalized maximum likelihood formulation. The incoherence condition is primarily required for establishingthe consistency of (cid:98) Ω u and (cid:98) Ω v .We additionally introduce the upper and lower extremes of the spectrum, deﬁned as M ( f X ) := esssup θ ∈ [ − π,π ] Λ max ( f X ( θ )) and m ( f X ) := essinf θ ∈ [ − π,π ] Λ min ( f X ( θ )) . Analogously, the upper extreme for the cross-spectrum is given by: M ( f X,Z ) := esssup θ ∈ [ − π,π ] (cid:113) Λ max ( f ∗ X,Z ( θ ) f X,Z ( θ )) , f ∗ X,Z ( θ ) being the conjugate transpose of f X,Z ( θ ). With this deﬁnition, M ( f X,Z ) = M ( f Z,X ) . Next, consider the solution to (10) that is obtained by the alternate update of A and Ω u . If Ω u is held ﬁxed,then A solves (12), and we denote the solution by ¯ A and its corresponding vectorized version as ¯ β A := vec( ¯ A ):¯ β A := argmin β ∈ R p (cid:8) − β (cid:48) γ X + β (cid:48) Γ X β + λ A (cid:107) β (cid:107) (cid:9) , (12)where Γ X = Ω u ⊗ X (cid:48) X T , γ X = T (Ω u ⊗ X (cid:48) ) vec( X T ) . (13)Using a similar notation, if A is held ﬁxed, then Ω u solves (14):¯Ω u := argmin Θ ∈ S ++ p × p (cid:8) log det Ω u − trace ( S u Ω u ) + ρ u (cid:107) Ω u (cid:107) , oﬀ (cid:9) , (14)where S u = T ( X T − X A (cid:48) ) (cid:48) ( X T − X A (cid:48) ) . (15)For ﬁxed realizations of X and U , by Basu and Michailidis (2015), the error bound of ¯ β A relies on (1) Γ X satisfyingthe RSC condition deﬁned above; and (2) the tuning parameter λ A is chosen in accordance with a deviation boundcondition associated with (cid:107)X (cid:48) U Ω u /T (cid:107) ∞ . By Ravikumar et al. (2011), the error bound of ¯Ω u relies on how well S u concentrates around Σ (cid:63)u , that is, (cid:107) S u − Σ (cid:63)u (cid:107) ∞ . Speciﬁcally, for (13) and (15), with Ω (cid:63)u and A (cid:63) plugged in respectively,for random realizations of X and U , these conditions hold with high probability. In the actual implementation ofthe algorithm, however, quantities in (13) and (15) are substituted by estimates so that at iteration k , (cid:98) β ( k ) A and (cid:98) Ω ( k ) u solve (cid:98) β ( k ) A := argmin β ∈ R p (cid:8) − β (cid:48) (cid:98) γ ( k ) X + β (cid:48) (cid:98) Γ ( k ) X β + λ A (cid:107) β (cid:107) (cid:9) , (cid:98) Ω ( k ) u := argmin Ω u ∈ S ++ p × p (cid:8) log det Ω u − trace (cid:0) (cid:98) S ( k ) u Ω u (cid:1) + ρ u (cid:107) Ω u (cid:107) , oﬀ (cid:9) , where (cid:98) Γ ( k ) X = (cid:98) Ω ( k − u ⊗ X (cid:48) X T , (cid:98) γ ( k ) X = T (cid:0)(cid:98) Ω ( k − u ⊗ X (cid:1) , (cid:98) S ( k ) u = T (cid:2) X T − X ( (cid:98) A ( k ) ) (cid:48) (cid:3) (cid:48) (cid:2) X T − X ( (cid:98) A ( k ) ) (cid:48) (cid:3) . As a consequence, to establish the ﬁnite-sample bounds of (cid:98) A and (cid:98) Ω u given in (10), we need (cid:98) Γ ( k ) X to satisfy the RSCcondition, a bound on (cid:107)X (cid:48) U (cid:98) Ω ( k − u (cid:107) ∞ and a bound on (cid:107) (cid:98) S ( k ) u − Σ (cid:63)u (cid:107) ∞ for all k . Toward this end, we prove that forrandom realizations of X and U , with high probability, the RSC condition for (cid:98) Γ ( k ) X and the universal bounds for (cid:107)X (cid:48) U (cid:98) Ω ( k − u (cid:107) ∞ and (cid:107) (cid:98) S ( k ) u − Σ (cid:63)u (cid:107) ∞ hold for all iterations k , albeit the quantities of interest rely on estimates fromthe previous or current iterations. Consistency results of (cid:98) A and (cid:98) Ω u then readily follow.Next, consider the solution to (11) that alternately updates ( B, C ) and Ω v . As the regularization term involvesboth the nuclear norm penalty and the (cid:96) norm penalty, additional handling of the norms is required which leveragesthe idea of decomposable regularizers (Agarwal et al., 2012). Speciﬁcally, if Ω v and ( B, C ) are respectively heldﬁxed, then ( ¯ B, ¯ C ) := argmin B,C (cid:110) T tr (cid:2) Ω v ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:48) ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:3) + λ B ||| B ||| ∗ + λ C || C || (cid:111) , ¯Ω v := argmin Ω v (cid:8) log det Ω v − trace (cid:0) S v Ω v (cid:1) + ρ v (cid:107) Ω v (cid:107) , oﬀ (cid:9) , where S v = T ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:48) ( Z T − X B (cid:48) − Z C (cid:48) ). If we let W := [ X , Z ] ∈ R T × ( p + p ) , and deﬁne the operator W Ω v : R p × ( p + p ) (cid:55)→ R T × p induced jointly by W and Ω v as W Ω v (∆) := W ∆ (cid:48) Ω / v for ∆ ∈ R p × ( p + p ) , (16)then ¯ B aug := [ ¯ B, O p × p ] and ¯ C aug := [ O p × p , ¯ C ] are equivalently given by ( ¯ B aug , ¯ C aug ) = argmin B,C (cid:110) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z T Ω / v − W Ω v ( B aug + C aug ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ B ||| B ||| ∗ + λ C || C || (cid:111) , (17) B aug := [ B, O p × p ] , C aug := [ O p × p , C ] ∈ R p × ( p + p ) . Then, for ﬁxed realizations of Z , X and V , with anextension of Agarwal et al. (2012) the error bound of ( ¯ B aug , ¯ C aug ) relies on (1) the operator W Ω v satisfying theRSC condition; and (2) tuning parameters λ B and λ C are respectively chosen in accordance with the deviationbound conditions associated with |||W (cid:48) V Ω v /T ||| op and |||W (cid:48) V Ω v /T ||| ∞ . (18)The error bound of ¯Ω v again relies on (cid:107) S v − Σ (cid:63)v (cid:107) ∞ . In an analogous way, for the actual alternate update,( (cid:98) B ( k )aug , (cid:98) C ( k )aug ) = argmin B,C (cid:110) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z T (cid:2)(cid:98) Ω ( k − v (cid:3) / − W (cid:98) Ω ( k − v ( B aug + C aug ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ B ||| B ||| ∗ + λ C || C || (cid:111) , (cid:98) Ω ( k ) v := argmin Ω v (cid:8) log det Ω v − trace (cid:0) (cid:98) S ( k ) v Ω v (cid:1) + ρ v (cid:107) Ω v (cid:107) , oﬀ (cid:9) , and the error bound of ( (cid:98) B, (cid:98) C, (cid:98) Ω v ) deﬁned in (11) depends on the properties of W (cid:98) Ω ( k ) v , |||W (cid:48) V Ω ( k ) v /T ||| op and |||W (cid:48) V Ω ( k ) v /T ||| ∞ for all k . Speciﬁcally, when Ω v and ( B, C ) (in (16) and (18), resp.) are substituted by esti-mated quantities, we prove that the RSC condition and bounds hold with high probability for random realizationsof Z , X and V , for all iterations k , which then establishes the consistency properties of ( (cid:98) B, (cid:98) C ) and (cid:98) Ω v . Theorems 1 and 2 below give the error bounds for the estimators in (10) and (11) obtained through Algorithms 1and 2, using random realizations coming from the stable VAR system deﬁned in (1). As previously mentioned, toestablish error bounds for both the transition matrices and the inverse covariance matrix obtained from alternatingupdates, we need to take into account that the quantities associated with the RSC condition and the deviationbound condition are based on estimated quantities obtained from the previous iteration. On the other hand, thesources of randomness contained in the observed data are ﬁxed, hence errors from observed data stop accumulatingonce all sources of randomness are considered after a few iterations, which govern both the leading term of theerror bounds and the probability for the bounds to hold.Speciﬁcally, using the same notation as deﬁned in Section 3.1, we obtain the error bounds of the estimatedtransition matrices and inverse covariance matrices iteratively, building upon that for all iterations k :(1) (cid:98) Γ ( k ) X or the operator W (cid:98) Ω ( k ) v satisﬁes the RSC condition;(2) deviation bounds hold for (cid:107)X (cid:48) U (cid:98) Ω ( k ) u /T (cid:107) ∞ , (cid:107)W (cid:48) V (cid:98) Ω ( k ) v /T (cid:107) ∞ , and |||W (cid:48) V (cid:98) Ω ( k ) v /T ||| op ;(3) a good concentration given by (cid:107) (cid:98) S ( k ) u − Σ (cid:63)u (cid:107) ∞ and (cid:107) (cid:98) S ( k ) v − Σ (cid:63)v (cid:107) ∞ .We keep track of how the bounds change in each iteration until convergence, by properly controlling the norms andtrack the rate of the error bound that depends on p , p and T , and reach the conclusion that the error bounds holduniformly for all iterations , for the estimates of both the transition matrices A, B and C and the inverse covariancematrices Ω u and Ω v . Theorem 1.

Consider the stable Gaussian VAR process deﬁned in (1) in which A (cid:63) is assumed to be s (cid:63)A -sparse.Further, assume the following:C.1 The incoherence condition holds for Ω (cid:63)u .C.2 Ω (cid:63)u is diagonally dominant.C.3 The maximum node degree of Ω (cid:63)u satisﬁes d maxΩ (cid:63)u = o ( p ) .Then, for random realizations of { X t } and { U t } , and the sequence { (cid:98) A ( k ) , (cid:98) Ω ( k ) u } k returned by Algorithm 1 outlinedin Section 2.1, there exist constants c , c , ˜ c , ˜ c > , τ > such that for sample size T (cid:37) max { ( d maxΩ (cid:63)u ) , s (cid:63)A } log p ,with probability at least − c exp( − c T ) − ˜ c exp( − ˜ c log p ) − exp( − τ log p ) , the following hold for all k ≥ for some C , C (cid:48) > that are functions of the upper and lower extremes M ( f X ) , m ( f X ) of the spectrum f X ( θ ) and do not depend on p , T or k :(i) (cid:98) Γ ( k ) X satisﬁes the RSC condition; ii) (cid:107)X (cid:48) U (cid:98) Ω ( k ) u /T (cid:107) ∞ ≤ C (cid:113) log p T ;(iii) (cid:107) (cid:98) S ( k ) u − Σ (cid:63)u (cid:107) ∞ ≤ C (cid:48) (cid:113) log p T .As a consequence, the following bounds hold uniformly for all iterations k ≥ : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) A ( k ) − A (cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F = O (cid:16)(cid:113) s (cid:63)A log p T (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω ( k ) u − Ω (cid:63)u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F = O (cid:16)(cid:113) ( s (cid:63) Ω u + p ) log p T (cid:17) . It should be noted that the above result establishes the consistency for the ML estimates of the model presentedin Basu and Michailidis (2015).

Theorem 2.

Consider the stable Gaussian VAR system deﬁned in (1) in which B (cid:63) is assumed to be low rank withrank r (cid:63)B and C (cid:63) is assumed to be s (cid:63)C -sparse. Further, assume the followingC.1 The incoherence condition holds for Ω (cid:63)v .C.2 Ω (cid:63)v is diagonally dominant.C.3 The maximum node degree of Ω (cid:63)v satisﬁes d maxΩ (cid:63)v = o ( p ) .Then, for random realizations of { X t } , { Z t } and { V t } , and the sequence { ( (cid:98) B ( k ) , (cid:98) C ( k ) ) , (cid:98) Ω ( k ) v } k returned by Algo-rithm 2 outlined in Section 2.1, there exist constants { c i , ˜ c i } , i = (0 , , and τ > such that for sample size T (cid:37) ( d maxΩ (cid:63)v ) ( p + 2 p ) , with probability at least − c exp {− ˜ c ( p + p ) } − c exp {− ˜ c ( p + 2 p ) } − c exp {− ˜ c log[ p ( p + p )] } − exp {− τ log p } , the following hold for all k ≥ for C , C (cid:48) , C (cid:48)(cid:48) > that are functions of the upper and lower extremes M ( f W ) , m ( f W ) of the spectrum f W ( θ ) and of the upper extreme M ( f W,V ) of the cross-spectrum f W,V ( θ ) and do not depend on p , p or T :(i) (cid:98) Γ ( k ) W satisﬁes the RSC condition;(ii) (cid:107)W (cid:48) V (cid:98) Ω ( k ) v /T (cid:107) ∞ ≤ C (cid:113) ( p + p )+ p T and |||W (cid:48) V (cid:98) Ω ( k ) v /T ||| op ≤ C (cid:48) (cid:113) ( p + p )+ p T ;(iii) (cid:107) (cid:98) S ( k ) v − Σ (cid:63)v (cid:107) ∞ ≤ C (cid:48)(cid:48) (cid:113) ( p + p )+ p T .As a consequence, the following bounds hold uniformly for all iterations k ≥ : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) B ( k ) − B (cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) C ( k ) − C (cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F = O (cid:16) max { r (cid:63)B ,s (cid:63)C } ( p +2 p ) T (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:98) Ω ( k ) v − Ω (cid:63)v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F = O (cid:16)(cid:113) ( s (cid:63) Ω v + p )( p +2 p ) T (cid:17) . Remark . It is worth pointing out that the initializers (cid:98) A (0) and ( (cid:98) B (0) , (cid:98) C (0) ) are slightly diﬀerent from thoseobtained in successive iterations, as they come from the penalized least square formulation where the inversecovariance matrices are temporarily assumed diagonal. Consistency results for these initializers under deterministicrealizations are established in Theorems 3 and 4 (see Appendix A), and the corresponding conditions are laterveriﬁed for random realizations in Lemmas B.1 to B.4 (see Appendix B). These theorems and lemmas serve as thestepping stone toward the proofs of Theorems 1 and 2.Further, the constants C , C (cid:48) , C (cid:48)(cid:48) reﬂect both the temporal dependence among X t and Z t blocks, as well as thecross-sectional dependence within and across the two blocks. We conclude this section with a discussion on the error bounds of the estimators that provides additional insightinto the impact of temporal and cross dependence within and between the blocks; speciﬁcally, how the exact boundsdepend on the underlying processes through their spectra when explicitly taking into consideration the triangularstructure of the joint transition matrix.First, we introduce additional notations needed in subsequent technical developments. The deﬁnition of thespectral densities and the cross-spectrum are the same as previously deﬁned in Section 2 and their upper and11xtremes are deﬁned in Section 3.1. For { X t } deﬁned in (1), let A ( θ ) = I p − Aθ denote the characteristic matrix-valued polynomial of { X t } and A ∗ ( θ ) denote its conjugate. We further deﬁne its upper and lower extremes by: µ max ( A ) = max | θ | =1 Λ max ( A ∗ ( θ ) A ( θ )) , µ min ( A ) = min | θ | =1 Λ min ( A ∗ ( θ ) A ( θ )) . The same set of quantities for the joint process { W t = ( X (cid:48) t , Z (cid:48) t ) (cid:48) } are analogously deﬁned, that is, G ( θ ) = I p + p − Gθ, µ max ( G ) = max | θ | =1 Λ max ( G ∗ ( θ ) G ( θ )) , µ min ( G ) = min | θ | =1 Λ min ( G ∗ ( θ ) G ( θ )) . Using the result in Theorem 2 as an example, we show how the error bound depends on the underlying processes { ( X (cid:48) t , Z (cid:48) t ) (cid:48) } . Speciﬁcally, we note that the bounds for ( (cid:98) B ( k ) , (cid:98) C ( k ) ) can be equivalently written as ||| (cid:98) B ( k ) − B (cid:63) ||| F + ||| (cid:98) C ( k ) − C (cid:63) ||| F ≤ ¯ C (cid:16) max { r (cid:63)B ,s (cid:63)C } ( p +2 p ) T (cid:17) , which holds for all k and some constant ¯ C that does not depend on p , p or T . Speciﬁcally, by Theorem 4,Lemmas B.3 and B.4, C ∝ [ M ( f W ) + Λ max (Σ v ) + M ( f W,V )] / m ( f W ) . This indicates that the exact error bound depends on m ( f W ) , M ( f W ) and M ( f W,V ). Next, we provide bounds onthese quantities. The joint process W t as we have noted in (2), is a VAR(1) process with characteristic polynomial G ( θ ) and spectral density f W ( θ ). The bounds for m ( f W ) and M ( f W ) are given by Basu and Michailidis (2015,Proposition 2.1), that is, m ( f W ) ≥ min { Λ min (Σ u ) , Λ min (Σ v ) } (2 π ) µ max ( G ) and M ( f W ) ≤ max { Λ max (Σ u ) , Λ max (Σ v ) } (2 π ) µ min ( G ) . (19)Consider the bound for M ( f W,V ). First, we note that { V t } is a sub-process of the joint error process { ε t } , where ε t = ( U (cid:48) t , V (cid:48) t ) (cid:48) . Then, by Lemma C.8, M ( f W,V ) ≤ M ( f W,ε ) ≤ M ( f W ) µ max ( G ) , where the second inequality follows from Basu and Michailidis (2015, Proof of Proposition 2.4).What are left to be bounded are µ min ( G ) and µ max ( G ). By Proposition 2.2 in Basu and Michailidis (2015),these two quantities are bounded by: µ max ( G ) ≤ (cid:104) ||| G ||| ∞ + ||| G ||| (cid:105) (20)and µ min ( G ) ≥ (1 − ρ ( G )) · ||| P ||| − · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − , where G = P Λ G P − with Λ G being a diagonal matrix consisting of the eigenvalues of G . Since ||| P − ||| op ≥ ||| P ||| − ,it follows that ||| P ||| − · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − = 1 , and therefore µ min ( G ) ≥ (1 − max { ρ ( A ) , ρ ( C ) } ) . (21) Remark . The impact of the system’s lower-triangular structure on the established bounds. Consider the boundsin (20) and (21). An upper bound of µ max ( G ) depends on ||| G ||| ∞ and ||| G ||| , whereas a lower bound of µ min ( G )involves only the spectral radius of G . Combined with (19), this suggests that the lower extreme of the spectraldensity is associated with the average of the maximum weighted in-degree and out-degree of the system, whereasthe upper extreme is associated with the stability condition: the less the system is intra- and inter-connected , the tighter the bound for the lower extreme will be; similarly, the more stable (exhibits smaller temporal dependence)the system is, the tighter the bound for the upper extreme will be. Finally, we note that an upper bound for( ||| G ||| ∞ + ||| G ||| ) is given bymax {||| A ||| ∞ + ||| B ||| ∞ , ||| C ||| ∞ } + max {||| A ||| , ||| B ||| + ||| C ||| } . The presence of ||| B ||| ∞ and ||| B ||| depicts the role of the inter-connectedness between { X t } and { Z t } on the lowerextreme of the spectrum, which is associated with the overall curvature of the joint process.12 he impact of the system’s lower-triangular structure on the system capacity. With G being a lower-triangularmatrix, we only require ρ ( A ) < ρ ( C ) < A and C can exhibit larger average weighted in- and out-degrees compared with a system where G does not possess such triangular structure. In the case where G is a complete matrix, one deals with a ( p + p )-dimensional VAR system and ρ ( G ) < In this section, we develop a procedure for testing the hypothesis H : B = 0. Note that without the presence of B , the blocks X t and Z t in the model become decoupled and can be treated as two separate VAR models, whereaswith a nonzero B , the group of variables in Z t is collectively “Granger-caused” by those in X t . Moreover, sincewe are testing whether or not the entire block of B is zero, we do not need to rely on the exact distribution of itsindividual entries, but rather on the properly measured correlation between the responses and the covariates. Tofacilitate presentation of the testing procedure, we illustrate the proposed framework via a simpler model settingwith Y t = Π X t + (cid:15) t and testing whether Π = 0; subsequently, we translate the results to the actual setting ofinterest, namely, whether or not B = 0 in the model Z t = BX t − + CZ t − + V t .The testing procedure focuses on the following sequence of tests for the rank of B : H : rank( B ) ≤ r, for an arbitrary r < min( p , p ) . (22)Note that the hypothesis of interest, B = 0 corresponds to the special case with r = 0. To test for it, we developa procedure associated with canonical correlations , which leverages ideas present in the literature (see Anderson,1999).As mentioned above, we consider a simpler setting similar to that in Anderson (1999, 2002), given by Y t = Π X t + (cid:15) t , where Y t ∈ R p , X ∈ R p and (cid:15) t is independent of X t . At the population level, let E Y t Y (cid:48) t = Σ Y , E X t X (cid:48) t = Σ X , E Y t X (cid:48) t = Σ Y X = Σ (cid:48) XY . The population canonical correlations between Y t and X t are the roots of (cid:12)(cid:12)(cid:12)(cid:12) − ρ Σ Y Σ Y X Σ XY − ρ Σ X (cid:12)(cid:12)(cid:12)(cid:12) = 0 , i.e., the nonnegative solutions to | Σ Y X Σ − X Σ XY − ρ Σ Y | = 0 , (23)with ρ being the unknown. By the results in Anderson (1999, 2002), the number of positive solutions to (23) isequal to the rank of Π, and indicates the “degree of dependency” between processes Y t and X t . This suggests thatif rank(Π) ≤ r < p , we would expect (cid:80) pk = r +1 λ k to be small, where the λ ’s solve the eigen-equation | S Y X S − X S XY − λS Y | = 0 , with λ ≥ λ ≥ · · · ≥ λ p , and S X , S XY and S Y are the sample counterparts corresponding to Σ X , Σ XY and Σ Y , respectively.With this background, we switch to our model setting given by Z t = BX t − + CZ t − + V t , (24)where V t is assumed to be independent of X t − and Z t − , B encodes the canonical correlation between Z t and X t − , conditional on Z t − . We use the same notation as in Section 3; that is, let Γ X ( h ) = E X t X (cid:48) t + h , Γ Z ( h ) = E Z t Z (cid:48) t + h , and Γ X,Z ( h ) = E X t Z (cid:48) t + h , with ( h ) omitted whenever h = 0. At the population level, under the Gaussianassumption,  Z t X t − Z t −  ∼ N   ,  Γ Z Γ (cid:48) X,Z (1) Γ Z (1)Γ X,Z (1) Γ X Γ X,Z Γ (cid:48) Z (1) Γ (cid:48) X,Z Γ Z  , Z t (cid:12)(cid:12) Z t − ∼ N (cid:0) Γ Z (1)Γ − Z Z t − , Σ (cid:1) and X t − (cid:12)(cid:12) Z t − ∼ N (cid:0) Γ X,Z Γ − Z Z t − , Σ (cid:1) , where Σ := Γ Z − Γ Z (1)Γ − Z Γ (cid:48) Z (1) and Σ := Γ X − Γ X,Z Γ − Z Γ (cid:48) X,Z . (25)Then, we have that jointly (cid:20) Z t X t − (cid:21) (cid:12)(cid:12)(cid:12) Z t − ∼ N (cid:18)(cid:20) Γ Z (1)Γ X,Z (cid:21) Γ − Z Z t − , (cid:20) Γ Z Γ (cid:48) XZ (1)Γ XZ (1) Γ Z (cid:21) − (cid:20) Γ Z (1)Γ XZ (cid:21) Γ − Z (cid:2) Γ (cid:48) Z (1) Γ ZX (cid:3)(cid:19) , so the partial covariance matrix between Z t and X t − conditional on Z t − is given byΣ = Σ (cid:48) := Γ X,Z (1) − Γ Z (1)Γ − Z Γ X,Z . (26)The population canonical correlations between Z t and X t − conditional on Z t − are the non-negative roots of (cid:12)(cid:12) Σ Σ − Σ − ρ Σ (cid:12)(cid:12) = 0 , and the number of positive solutions corresponds to the rank of B ; see Anderson (1951) for a discussion in whichthe author is interested in estimating and testing linear restrictions on regression coeﬃcients. Therefore, to testrank( B ) ≤ r , it is appropriate to examine the behavior of Ψ r := (cid:80) min( p ,p ) k = r +1 φ k , where φ ’s are ordered non-increasingsolutions to | S S − S − φS | = 0 , (27)and S , S and S are the empirical surrogates for the population quantities Σ , Σ and Σ . For subsequentdevelopments, we make the very mild assumption that p < T and p < T so that Z (cid:48) Z is invertible.Proposition 1 gives the tail behavior of the eigenvalues and Corollary 1 gives the testing procedure for block“Granger-causality” as a direct consequence. Proposition 1.

Consider the model setup given in (24) , where B ∈ R p × p . Further, assume all positive eigenvalues µ of the following eigen-equation are of algebraic multiplicity one: (cid:12)(cid:12) Σ Σ − Σ − µ Σ (cid:12)(cid:12) = 0 , (28) where Σ , Σ and Σ are given in (25) and (26) . The test statistic for testing H : rank ( B ) ≤ r, for an arbitrary r < min( p , p ) , is given by Ψ r := min( p ,p ) (cid:88) k = r +1 φ k , where φ k ’s are ordered decreasing solutions to the eigen-equation | S S − S − φS | = 0 where S = T X (cid:48) ( I − P z ) X , S = T (cid:0) Z T (cid:1) (cid:48) ( I − P z ) (cid:0) Z T (cid:1) , S = S (cid:48) = T X (cid:48) ( I − P z ) (cid:0) Z T (cid:1) , and P z = Z ( Z (cid:48) Z ) − Z (cid:48) . Moreover, the limiting behavior of Ψ r is given by T Ψ r ∼ χ p − r )( p − r ) . Remark . We provide a short comment on the assumption that the positive solutions to (28) have algebraicmultiplicity one in Proposition 1. This assumption is imposed on the eigen-equation associated with populationquantities, to exclude the case where a positive root has algebraic multiplicity greater than one and its geometricmultiplicity does not match the algebraic one, and hence we would fail to obtain r mutually independent canonicalvariates and the rank- r structure becomes degenerate. With the imposed assumption which is common in thecanonical correlation analysis literature (e.g. Anderson, 2002; Bach and Jordan, 2005), such a scenario is auto-matically excluded. Speciﬁcally, this condition is not stringent, as for φ ’s that are solutions to the eigen-equationassociated with sample quantities, the distinctiveness amongst roots is satisﬁed with probability 1 (see Hsu, 1941b,Proof of Lemma 3). 14 orollary 1 (Testing group Granger-causality) . Under the model setup in (24) , the test statistic for testing B = 0 is given by Ψ := min( p ,p ) (cid:88) k =1 φ k , with φ k being the ordered decreasing solutions of (cid:12)(cid:12)(cid:12) S (cid:2) diag( S ) (cid:3) − S − φS (cid:12)(cid:12)(cid:12) = 0 . Asymptotically, T Ψ ∼ χ p p . To conduct a level α test, we reject the null hypothesis H : B = 0 if Ψ > T χ p p ( α ) , where χ d ( α ) is the upper α quantile of the χ distribution with d degrees of freedom.Remark . Corollary 1 is a special case of Proposition 1 with the null hypothesis being H : r = 0, which correspondsto the Granger-causality test. Under this particular setting, we are able to take the inverse with respect to diag( S ),yet maintain the same asymptotic distribution due to the fact that S = S = 0 under the null hypothesis B = 0.This enables us to perform the test even with p > T .The above testing procedure takes advantage of the fact that when B = 0, the canonical correlations amongthe partial regression residuals after removing the eﬀect of Z t − are very close to zero. However, the test may notbe as powerful under a sparse alternative, i.e., H A : B is sparse. In Appendix D, we present a testing procedurethat speciﬁcally takes into consideration the fact that the alternative hypothesis is sparse, and the correspondingperformance evaluation is shown in Section 5.3 under this setting. Next, we present the results of numerical studies to evaluate the performance of the developed ML estimates(Section 2.1) of the model parameters, as well as that of the testing procedure (Section 4).

A number of factors may potentially inﬂuence the performance of the estimation procedure; in particular, the modeldimension p and p , the sample size T , the rank of B (cid:63) and the sparsity level of A (cid:63) and C (cid:63) , as well as the spectralradius of A (cid:63) and C (cid:63) . Hence, we consider several settings where these parameters vary.For all settings, the data { x t } t and { z t } t are generated according to the model x t = A (cid:63) x t − + u t ,z t = B (cid:63) x t − + C (cid:63) z t − + v t . For the sparse components, each entry in A (cid:63) and C (cid:63) is nonzero with probability 2 /p and 1 /p respectively, andthe nonzero entries are generated from Unif ([ − . , − . ∪ [1 . , . ρ ( A ) and ρ ( C ) satisfy the stability condition. For the low rank component, each entry in B (cid:63) is generated from Unif ( − , B (cid:63) ) conforms with the model setup. For thecontemporaneous dependence encoded by Ω (cid:63)u and Ω (cid:63)v , both matrices are generated according to an Erd¨os-R´enyirandom graph, with sparsity being 0.05 and condition number being 3.Table 1 depicts the values of model parameters under diﬀerent model settings. Speciﬁcally, we consider threemajor settings in which the size of the system, the rank of the cross-dependence component, and the stability ofthe system vary. The sample size is ﬁxed at T = 200 unless otherwise speciﬁed. Additional settings examined (notreported due to space considerations) are consistent with the main conclusions presented next.We use sensitivity (SEN), speciﬁcity (SPC) and relative error in Frobenius norm (Error) as criteria to evaluatethe performance of the estimates of transition matrices A , B and C . Tuning parameters are chosen based on BIC.Since the exact contemporaneous dependence is not of primary concern, we omit the numerical results for (cid:98) Ω u and (cid:98) Ω v . SEN = TPTP + FN , SPE = TNFP + TN , Error = |||

Est. − Truth ||| F ||| Truth ||| F . p p rank( B ∗ ) ρ A ρ C model dimension A.1 50 20 5 0.5 0.5A.2 100 50 5 0.5 0.5A.3 200 50 5 0.5 0.5A.4 50 100 5 0.5 0.5rank B.1 100 50 10 0.5 0.5B.2 100 50 20 0.5 0.5spectral radius C.1 50 20 5 0.8 0.5C.2 50 20 5 0.5 0.8C.3 50 20 5 0.8 0.8Table 2: Performance evaluation of (cid:98) A , (cid:98) B and (cid:98) C under diﬀerent model settings.performance of (cid:98) A performance of (cid:98) B performance of (cid:98) C SEN SPC Error rank( (cid:98) B ) Error SEN SPC ErrorA.1 0.98 (0.014) (0.004) (0.032) (0.42) (0.008) (0.000) (0.008) (0.074) A.2 0.97 (0.014) (0.001) (0.015) (0.42) (0.011) (0.008) (0.004) (0.033)

A.3 0.99 (0.005) (0.002) (0.011) (0.92) (0.022) (0.000) (0.009) (0.028)

A.4 0.96 (0.0261) (0.002) (0.034) (0.42) (0.012) (0.009) (0.001) (0.010)

B.1 0.97 (0.008) (0.001) (0.017) (1.17) (0.008) (0.000) (0.001) (0.021)

B.2 0.98 (0.008) (0.001) (0.016) (0.91) (0.006) (0.000) (0.001) (0.018)

C.1 1.00 (0.000) (0.005) (0.015) (0.52) (0.006) (0.000) (0.021) (0.072)

C.2 0.99 (0.007) (0.004) (0.022) (0.00) (0.014) (0.000) (0.019) (0.011)

C.3 1.00 (0.000) (0.004) (0.013) (1.16) (0.011) (0.000) (0.029) (0.067)

C.3’ 1.00 (0.000) (0.002) (0.016) (0.42) (0.005) (0.000) (0.021) (0.023)

Table 2 illustrates the performance for each of the parameters under diﬀerent simulation settings considered. Theresults are based on an average of 100 replications and their standard deviations are given in parentheses. Overall,the results are highly satisfactory and all the parameters are estimated with a high degree of accuracy. Further,all estimates were obtained in less than 20 iterations, thus indicating that the estimation procedure is numericallystable. As expected, when the the spectral radii of A and C increase thus leading to less stable { X t } and { Z t } processes, a larger sample size is required for the estimation procedure to match the performance of the settingwith same parameters but smaller ρ ( A ) and ρ ( C ). This is illustrated in row C.3’ of Table 2, where the sample sizeis increased to T = 500, which outperforms the results in row C.3 in which T = 200 and broadly matches that ofrow A.1.Next, we investigate the robustness of the algorithm in settings where the marginal distributions of { X t } and { Z t } deviate from the Gaussian assumption posited and may be more heavy-tailed. Speciﬁcally, we consider thefollowing two distributions that have been studied in Qiu et al. (2015): • t -distribution: the idiosyncratic error processes { u t } and { v t } are generated from multivariate t -distributionswith degree of freedom 3, and covariance matrices (Ω (cid:63)u ) − and (Ω (cid:63)v ) − , respectively. • elliptical distribution: ( u (cid:48) , . . . , u (cid:48) T ) and ( v (cid:48) , . . . , v (cid:48) T ) (cid:48) are generated from an elliptical distribution (e.g. R´emillardet al., 2012) with a log-normal generating variate log N (0 ,

2) and covariance matrices (cid:101) Σ u and (cid:101) Σ v – both areblock-diagonal with Σ (cid:63)u = (Ω (cid:63)u ) − and Σ (cid:63)v = (Ω (cid:63)v ) − respectively on the diagonals.For both scenarios, transition matrices, Ω (cid:63)u and Ω (cid:63)v are generated analogously to those in the Gaussian setting. Wepresent the results for (cid:98) A, (cid:98) B and (cid:98) C under model settings A.2, B.1, C.1 and C.2 (see Table 1).Based on Table 3, the performance of the estimates under these heavy-tailed settings is comparable in termsof sensitivity and speciﬁcity for A and C , as well as for rank selection for B to those under Gaussian settings.However, the estimation error exhibits some deterioration which is more pronounced for the t -distribution case. Insummary, the estimation procedure proves to be very robust for support recovery and rank estimation even in thepresence of more heavy-tailed noise terms. 16able 3: Performance evaluation of (cid:98) A , (cid:98) B and (cid:98) C under non-Gaussian settings.performance of (cid:98) A performance of (cid:98) B performance of (cid:98) C SEN SPC Error rank( (cid:98) B ) Error SEN SPC ErrorA.2 t(df=3) 0.99 (0.005) (0.013) (0.062) (1.45) (0.019) (0.013) (0.005) (0.038) elliptical 0.97 (0.014) (0.001) (0.016) (0.30) (0.009) (0.000) (0.026) (0.033) B.1 t(df=3) 0.98 (0.008) (0.014) (0.083) (0.49) (0.026) (0.015) (0.004) (0.091) elliptical 0.95 (0.015) (0.001) (0.024) (0.22) (0.013) (0.000) (0.013) (0.001)

C.1 t(df=3) 0.99 (0.001) (0.011) (0.03) (1.13) (0.014 (0.000) (0.016) (0.068) elliptical 1.00 (0.000) (0.006 (0.013) (0.44) (0.007) (0.000) (0.020) (0.041)

C.2 t(df=3) 0.99 (0.002) (0.023) (0.056) (0.22) (0.017) (0.000) (0.017) (0.029) elliptical 0.88 (0.029) (0.001) (0.032) (0.14) (0.020) (0.000) (0.026) (0.046)

Lastly, we examine performance with respect to one-step-ahead forecasting. Recall that VAR models are widelyused for forecasting purposes in many application areas (L¨utkepohl, 2005). The performance metric is given by therelative error as measured by the (cid:96) norm of the out-of-sample points x T +1 and z T +1 , where the predicted valuesare given by (cid:98) x T +1 = (cid:98) Ax T and (cid:98) z T +1 = (cid:98) Bx T + (cid:98) Cz T , respectively. It is worth noting that both { X t } and { Z t } aremean-zero processes. However, since the transition matrix of { X t } is subject to the spectral radius constraintsto ensure the stability of the corresponding process, the magnitude of the realized value x t ’s is small; whereas for { Z t } , since no constraints are imposed on the B coeﬃcient matrix that encodes the inter-dependence, z t ’s has thecapacity of having relative large values in magnitude. Consequently, the relative error of (cid:98) x T +1 is signiﬁcantly largerthan that of (cid:98) z T +1 , partially due to the small total magnitude of the denominator.The results show that an increase in the spectral radius (keeping the other structural parameters ﬁxed) leads toa decrease of the relative error, since future observations become more strongly correlated over time. On the otherhand, an increase in dimension leads to a deterioration in forecasting, since the available sample size impacts thequality of the parameter estimates. Finally, an increase in the rank of the B matrix is beneﬁcial for forecasting,since it plays a stronger role in the system’s temporal evolution.Table 4: One-step-ahead relative forecasting error. (cid:107) (cid:98) x T +1 − x T +1 (cid:107) (cid:107) x T +1 (cid:107) (cid:107) (cid:98) z T +1 − z T +1 (cid:107) (cid:107) z T +1 (cid:107) baseline A.1 0.89 (0.066) (0.075) spectral radius C.1 0.62 (0.100) (0.035) C.2 0.93 (0.062) (0.059)

C.3 0.68 (0.096) (0.045) rank B.1 0.92 (0.044) (0.038)

B.2 0.94 (0.042) (0.025) dimension A.2 0.87 (0.051) (0.073)

A.3 0.96 (0.040) (0.139)

A.4 0.89 (0.059) (0.068)

We brieﬂy compare the ML estimates to the ones obtained through the following two-step procedure:– Step 1: estimate transition matrices through penalized least squares: (cid:98) A t-s = argmin A (cid:110) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T − X A (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ A (cid:107) A (cid:107) (cid:111) , ( (cid:98) B t-s , (cid:98) C t-s ) = argmin ( B,C ) (cid:110) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z T − X B (cid:48) − Z C (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ B ||| B ||| ∗ + λ C || C || (cid:111) . – Step 2: estimate the inverse covariance matrices applying the graphical Lasso algorithm (Friedman et al.,2008) to the residuals calculated based on the Step 1 estimates.17ote that the two-step estimates coincide with our ML estimates at iteration 0, and they yield the same error ratein terms of the relative scale of p , p and T . We compare the two sets of estimates under setting A.1 with B (cid:63) being low rank and setting A.2 with B (cid:63) being sparse, whose entries are nonzero with probability 1 /p .In Table 5, we present the performance evaluation of the two-step estimates and the ML estimates under settingA.1. Additionally in Tables 6 and 7, we track the value of the objective function, the relative error (in (cid:107) · (cid:107) F ) andthe cardinality (or rank) of the estimates along iterations, with iteration 0 corresponding to the two-step estimates.A similar set of results is shown in Tables 8 to 10 for setting A.2, but with a sparse B (cid:63) . All other model parametersare identically generated according to the procedure described in Section 5.1.As the results show, the ML estimates clearly outperform their two-step counterparts, in terms of the relativeerror in Frobenius norm. On the other hand, both sets of estimates exhibit similar performance in terms of sensitivityand speciﬁcity and rank speciﬁcation. More speciﬁcally, when estimating A , the ML estimate decreases the falsepositive rate (higher SPC). Under setting A.1, while estimating B and C , both estimates correctly identify therank of B , and the ML estimate provides a more accurate estimate in terms of the magnitude of C , at the expenseof incorrectly including a few more entries in its support set; under setting A.2 with a sparse B (cid:63) , improvementsin both the relative error of B and C are observed. In particular, due to the descent nature of the algorithm, weobserve a sharp drop in the value of the objective function at iteration 1, as well as the most pronounced changein the estimates. Table 5: Performance comparison under A.1 with a low-rank B performance of (cid:98) A performance of (cid:98) B performance of (cid:98) C SEN SPC Error Error rank( (cid:98) B ) SEN SPC Errortwo-step estimates 0.97 0.95 0.52 0.27 5 1.00 0.98 0.12ML estimates 0.97 0.97 0.36 0.24 5 1.00 0.95 0.05 Table 6: Relative error of (cid:98) A and the values of the objective function under A.1 iteration 0 1 2 3 4 5Rel.Error 0.521 0.408 0.376 0.360 0.359 0.359Cardinality 227 169 160 155 155 155Value of Obj 128.14 41.74 37.94 37.85 37.70 37.70 Table 7: Relative error of (cid:98) B , (cid:98) C and the values of the objective function under A.1 iteration 0 1 2 3 4 5 6 7 8 9 10Rel.Error of (cid:98) B (cid:98) B (cid:98) C (cid:98) C

30 34 38 41 39 39 39 39 39 39 39Value of Obj 160.41 134.26 131.90 131.48 131.38 131.17 131.01 130.96 130.96 130.85 130.8

Next, we illustrate the empirical performance of the testing procedure introduced in Section 4, together with thealternative one (in Appendix D) when B is sparse, with the null hypothesis being B (cid:63) = 0 and the alternativebeing B (cid:63) (cid:54) = 0, either low rank or sparse. Speciﬁcally, when the alternative hypothesis is true and has a low-rankstructure, we use the general testing procedure proposed in Section 4, whereas when the alternative is true andsparse, we use the testing procedure presented in Appendix D. We focus on evaluating the type I error (empiricalfalse rejection rate) when B (cid:63) = 0, as well as the power of the test when B (cid:63) has nonzero entries.For both testing procedures, the transition matrix A (cid:63) is generated with each entry being nonzero with probability2 /p , and the nonzeros are generated from Unif ([ − . , − . ∪ [1 . , . ρ ( A (cid:63) ) =0 .

5. For transition matrix C (cid:63) , each entry is nonzero with probability 1 /p , and the nonzeros are generated from Unif ([ − . , − . ∪ [1 . , . ρ ( C (cid:63) ) = 0 . .

8, depending on the simulationsetting. Finally, we only consider the case where v t and u t have diagonal covariance matrices.18able 8: Performance comparison under A.2 with a sparse B performance of (cid:98) A performance of (cid:98) B performance of (cid:98) C SEN SPC Error SEN SPC Error SEN SPC Errortwo-step estimates 0.97 0.95 0.44 0.96 0.98 0.45 1 0.99 0.35ML estimates 0.97 0.98 0.35 0.99 0.95 0.34 1 0.98 0.30Table 9: Relative error of (cid:98) A and the values of the objective function under A.2iteration 0 1 2 3 4Rel.Error 0.438 0.346 0.351 0.351 0.351Cardinality 479 350 325 324 324Value of Obj 156.58 48.25 38.16 36.97 36.93Table 10: Changes over iteration under A.2iteration 0 1 2 3 4Rel.Error of (cid:98) B (cid:98) B

301 325 323 323 323Rel.Error of (cid:98) C (cid:98) C

63 70 74 74 74Value of Obj 143.942 59.63 41.46 41.87 41.87We use sub-sampling as in Politis and Romano (1994) and Politis et al. (1999) with the number of subsamplesset to 3,000; an alternative would have been a block bootstrap procedure (e.g. Hall, 1985; Carlstein, 1986; Kunsch,1989). Note that the length of the subsamples varies across simulation settings in order to gain insight on howsample size impacts the type I error or the power of the test.

Low-rank testing.

To evaluate the type I error control and the power of the test, we primarily consider the casewhere rank( B (cid:63) ) = 0, with the data alternatively generated based on rank( B (cid:63) ) = 1. We test the hypothesis H : rank( B ) = 0 and tabulate the empirical proportion of falsely rejecting H when rank( B ∗ ) = 0 (type I error)and the probability that we reject H when rank( B ∗ ) = 1 (power). In addition, we also show how the testingprocedure performs when the underlying B (cid:63) has rank r ≥

0. In particular, when rank( B (cid:63) ) = r (cid:63) , the type I errorof the test corresponds to the empirical proportion of rejections of the null hypothesis H : r ≤ r (cid:63) , while the powerof the test to the empirical proportion of rejections of the null hypothesis set to H : r ≤ ( r (cid:63) − B (cid:63) = 0, the value of the proposed test statistic mostly falls below the cut-oﬀvalue (upper α quantile), while when rank( B (cid:63) ) = 1, the value of the proposed test statistic mostly falls beyondthe critical value χ ( α ) p p /T with T being the sample size, hence leading to a detection. Table 11 gives the typeI error of the test when setting α = 0 . , . , .

1, and the power of the test using the upper 0.01 quantile of thereference distribution as the cut-oﬀ, for diﬀerent combinations of model dimensions ( p , p ) and sample size.Based on the results shown in Table 11, it can be concluded that the proposed low-rank testing procedureaccurately detects the presence of “Granger causality” across the two blocks, when the data have been generatedbased on a truly multi-layer VAR system. Further, when B (cid:63) = 0, the type I error is close to the nominal α level forsuﬃciently large sample sizes, but deteriorates for increased model dimensions. In particular, relatively large valuesof p and larger spectral radius ρ ( C (cid:63) ) negatively impact the empirical false rejection proportion, which deviatesfrom the desired control level of the type I error. In the case where rank( B (cid:63) ) = r >

0, the testing procedureprovides satisfactory type I error control for larger sample sizes and excellent power.

Sparse testing.

Since the rejection rule of the HC-statistic is based on empirical process theory (Shorack andWellner, 2009) and its dependence on α is not explicit, we focus on illustrating how the empirical proportionof false rejections (type I error) varies with the sample size T , the model dimensions ( p , p ) and the spectralradius of C (cid:63) . To show the power of the test, each entry in B (cid:63) is nonzero with probability q ∈ (0 ,

1) such that q ( p p ) = ( p p ) θ with θ = 0 .

6, to ensure the overall sparsity of B (cid:63) satisﬁes the sparsity requirement positedin Proposition 2. The magnitude is set such that the signal-to-noise ratio is 1.2. Note that the actual number19able 11: Empirical type I error and power for low-rank testingtype I error ( B (cid:63) = 0) power (rank( B (cid:63) ) = 1)( p , p ) sample size α = 0 . α = 0 . α = 0 . χ (0 . p p /Tρ ( C (cid:63) ) = 0 . , T = 500 0.028 0.123 0.227 1 T = 1000 0.015 0.073 0.137 1 T = 2000 0.011 0.059 0.118 1(50 , T = 500 0.070 0.228 0.355 1 T = 1000 0.026 0.125 0.226 1 T = 2000 0.013 0.094 0.163 1(20 , T = 500 0.484 0.751 0.857 1 T = 1000 0.089 0.246 0.375 1 T = 2000 0.020 0.088 0.164 1(100 , T = 500 0.997 0.999 1 1 T = 1000 0.608 0.828 0.908 1 T = 2000 0.166 0.374 0.511 1 ρ ( C (cid:63) ) = 0 . , T = 500 0.533 0.789 0.880 1 T = 1000 0.130 0.306 0.452 1 T = 2000 0.045 0.145 0.252 1(50 , T = 500 0.083 0.250 0.382 1 T = 1000 0.039 0.133 0.234 1 T = 2000 0.019 0.096 0.174 1type I error ( H : r ≤

5) power ( H : r ≤ α = 0 . α = 0 . α = 0 . χ (0 . ( p − p − /Tρ ( C (cid:63) ) = 0 . B (cid:63) ) = 5 (20 , T = 500 0.092 0.274 0.400 1 T = 1000 0.034 0.140 0.236 1 T = 2000 0.022 0.096 0.178 1(50 , T = 500 0.454 0.722 0.829 1 T = 1000 0.126 0.313 0.452 1 T = 2000 0.062 0.184 0.284 1of parameters is p p , while the total number of subsamples used is 3000 with the length of subsamples varyingaccording to diﬀerent simulation settings to demonstrate the dependence of type I error and power on sample sizes.Table 12: Empirical type I error and power for sparse testingtype I error ( B (cid:63) = 0) power (SNR( B (cid:63) ) = 0 . p , p ) 200 500 1000 2000 200 500 1000 2000 ρ ( C (cid:63) ) = 0 . ,

20) 0.244 0.097 0.074 0.055 1 1 1 1(50 ,

20) 0.393 0.131 0.108 0.074 1 1 1 1(20 ,

50) 0.996 0.351 0.153 0.093 1 1 1 1(100 ,

50) 1.000 0.963 0.270 0.115 1 1 1 1 ρ ( C (cid:63) ) = 0 . ,

20) 0.402 0.158 0.112 0.075 0.829 0.996 1 1(20 ,

50) 0.999 0.430 0.166 0.111 1 1 1 1Based on the results shown in Table 12, when B (cid:63) = 0, the proposed testing procedure can eﬀectively detectthe absence of block “Granger causality”, provided that the sample size is moderately large compared to the totalnumber of parameters being tested. However, if the model dimension is large, whereas the sample size is small,the test procedure becomes problematic and fails to provide legitimate type I error control, as desired. When B (cid:63) is nonzero, empirically the test is always able detect its presence, as long as the eﬀective signal-to-noise ratio isbeyond the detection threshold. 20 Real Data Analysis Illustration.

We employ the developed framework and associated testing procedures to address one of the motivating applications.Speciﬁcally, we analyze the temporal dynamics of the log-returns of stocks with large market capitalization andkey macroeconomic variables, as well as their cross-dependence. Speciﬁcally, using the notation in (1), we assumethat the X t block consists of the stock log-returns, while the macroeconomic variables form the Z t block. Withthis speciﬁcation, we assume that the macroeconomic variables are “Granger-caused” by the stock market, but notvice versa. Note that our framework allows us to pose and test a more general question than previous work in theeconomics literature considered. For example, Farmer (2015) building on previous work by Fitoussi et al. (2000);Phelps (1999) tests only the relationship between the employment index and the composite stock index, using abivariate VAR model. On the other hand, our framework enables us to consider the components of the S&P 100index and the “medium” list of macroeconomic variables considered in the work of Stock and Watson (2005).Next, we provide a brief description of the data used. The stock data consist of monthly observations of 71stocks corresponding to a stable set of historical components comprising the S&P 100 index for the 2001-2016period. The macroeconomic variables are chosen from the “medium” list in Stock and Watson (2005); that is,the 3 core economic indicators (Fed Funds Rate, Consumer Price Index and Gross Domestic Product GrowthRate), plus 17 additional variables with aggregate information (e.g., exchange rate, employment, housing, etc.).However, in our study, we exclude variables that exhibit a signiﬁcant change after the ﬁnancial crisis of 2008 (e.g.total reserves/reserves of depository institutions). We process the macroeconomic variables to ensure stationarityfollowing the recommendations in Stock and Watson (2005). As a general guideline, for real quantitative variables(e.g., GDP, money supply M2), we use the relative ﬁrst diﬀerence, which corresponds to their growth rate, whilefor rate-related variables (e.g., Federal Funds Rate, unemployment rate), we use their ﬁrst diﬀerences directly. Thecomplete lists of stocks and macroeconomic variables used in this study are given in Appendix G.We start the analysis by using the VAR model for the stock log-returns to study their evolution over the2001-2016 period. Analogously to the strategy employed by Billio et al. (2012), we consider 36-month long rolling-windows for ﬁtting the model X t = AX t − + U t , for a total of 143 estimates of the transition matrix A . VARmodels involving more than 1 lag were also ﬁtted to the data, but did not indicate temporal dependence beyondlag 1.To obtain the ﬁnal estimates across all 143 subsamples, we employ stability selection (Meinshausen and B¨uhlmann,2010), with the threshold set at 0.6 for including an edge in A . Figure 2 depicts the global clustering coeﬃcient(Luce and Perry, 1949) of the skeleton of the estimated A over all 143 rolling windows, with the time stamps onthe horizontal axis specifying the starting time of the corresponding window.Figure 2: Global clustering coeﬃcient of estimated A over diﬀerent periods / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / Global clustering coefficient of Estimated A The threshold is set at a relatively low level to compensate for the relative small rolling window size. Z t = BX t − + CZ t − + V t with the stock log-returns corresponding to the X t block and the (stationary) macroeconomic variables to the Z t block. As before, we ﬁt the data within each rolling window, with the tuning parameters based on a searchover a 10 ×

10 lattice (with ( λ B , λ C ) ∈ [0 . , × [0 . , B is 1 (data not shown). The sparsity level of the estimated C over the 143 rolling windows is depicted in Figure 3. The connectivity patterns in C show more complex andFigure 3: Sparsity of estimated C over diﬀerent periods / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / Sparsity level of Estimated C nuanced patterns than for stocks. Several local peaks depicted correspond to the following events: (i) March-April2003, when the Federal Reserve cut the Eﬀective Federal Funds Rate aggressively driving it down to 1%, thelowest level in 45 years up to that point, (ii) January-March 2008, a signiﬁcant decline in the major US stockindexes, coupled with the freezing up of the market for auctioning rate securities with investors declining to bid,(iii) January-April 2009, characterizes the unfolding of the European debt crisis with a rescue package put togetherfor Greece and signiﬁcant downgrades of Spanish and Portuguese sovereign debt by rating agencies and (iv) July2010, that correspond to the enactment of the Dodd-Frank Wall Street Reform and Consumer Protection Act andthe acceleration of quantitative easing by the Federal Reserve Board.Based on the previous ﬁndings, we partition the time frame spanning 2001-2016 into the following periods: pre-(2001/07–2007/03), during- (2007/01–2009/12) and post-crisis (2010/01-2016/06) one. We estimate the modelparameters using the data within the entire sub-period(s).The estimation procedure of the transition matrix A for diﬀerent periods is identical to that described aboveusing subsamples over rolling-windows. For the pre- and post- crisis periods, since we have 76 and 77 samplesrespectively, the stability selection threshold is set at 0.75, whereas for the during-crisis period, at 0.6 to compensatefor the small sample size (36). Table 13 shows the average R-square for all 71 stocks, as well as its standard deviation,which is calculated based on in-sample ﬁt; i.e.,the proportion of variation explained by using the VAR(1) modelto ﬁt the data. The overall sparsity level and the spectral radius of the estimated transition matrices A are also22resented. The results are consistent with the previous ﬁnding of increased connectivity during the crisis. Further,for all periods the estimate of the spectral radius is fairly large, indicating strong temporal dependence of thelog-returns. Table 13: Summary for estimated A within diﬀerent periods.2001/07–2007/03 2007/01–2009/12 2010/01–2016/06Averaged R sq 0.31 0.72 0.28Sd of R sq 0.103 0.105 0.094Sparsity level of (cid:98) A (cid:98) A A for diﬀerent periods, as a network, with edges thresh-olded based on their magnitude for ease of presentation. The node or edge coloring red/blue indicates the signpositive/negative of the corresponding entry in the transition matrix. Further, node size is proportional to theout-degree, thus indicating which stocks inﬂuence other stocks in the next time period. The most striking feature isthe outsize inﬂuence exerted by the insurance company AIG and the investment bank Goldman Sachs, whose roleduring the ﬁnancial crisis has been well documented (Financial Crisis Inquiry Commission, 2011). On the otherhand, the pre- and post-crisis periods are characterized by more sparse and balanced networks, in terms of in- andout-degree magnitude.Next, we focus on the key motivation for developing the proposed modeling framework, namely the inter-dependence of stocks and macroeconomic variables over the speciﬁed three sub-periods. The p -value for testingthe hypothesis of lack of block “Granger causality” H : B = 0, together with the spectral radius and the sparsitylevel for the estimated C transition matrices are listed in Table 14. Speciﬁcally, for all three periods, the rankof estimated B is 1, indicating that the stock market as captured by its leading stocks, “Granger-causes” thetemporal evolution of the macroeconomic variables. The fact that the rank of B is 1, indicates that the inter-blockinﬂuence can be captured as a single portfolio acting in unison. To investigate the relative importance of eachsector in the portfolio, we group the stocks by sectors. The proportion of each sector (up to normalization) isobtained by summing up the loadings (ﬁrst right singular vector of the estimated B ) of the stocks within thissector, weighted by their market capitalization. Further, the estimated transition matrices C ’s are depicted innetwork form, in Figures 4 to 6. It is worth noting that the temporal correlation of the macroeconomic variablessigniﬁcantly increased during the crisis.Note that the proportion of various sectors in the portfolio is highly consistent with their role in stock market.For example, before crisis the ﬁnancial sector had a large market capitalization (roughly 20%), while it shrunk(to roughly 12%) after the crisis. Also, the Information Technology (IT) and Financial (FIN) sectors are the onesexhibiting highest volatility (high beta) relative to the market, while the Utilities is the one with low volatility (lowbeta) , a well established stylized fact in the literature for the time period under consideration.Next, we discuss some key relationships emerging from the model. We start with total employment (ETTL),whose dynamics are only inﬂuenced by its own past values as seen by the lack of an incoming arrow in Figure 5.Further, an examination of the left singular vector (see Table 15) of B strongly indicates the impact of the stockmarket on total employment. This ﬁnding is consistent with the analysis in Farmer (2015), which argues thatthe crash of the stock market provides a plausible explanation for the great recession. However, the analysis inFarmer (2015) is based on bivariate VAR models involving only employment and the stock index. Therefore, thereis a possibility that the stock market is reacting to some other information captured by other macroeconomicvariables, such as GDP, capital spending, inﬂation, interest rates, etc. However, our high-dimensional VAR modelsimultaneously analyzes a key set of macroeconomic variables and also accounts for the inﬂuence of the largest stocksin the market. Hence, it automatically overcomes the criticism leveraged by Sims (1992) about misinterpretationsof ﬁndings from small scale VAR models due to the omission of important variables, and further echoed in thediscussion in Bernanke et al. (2005).Another interesting ﬁnding is the strong inﬂuence of the stock market on GDP in the pre- and post-crisisperiod, consistent with the popular view of being a leading indicator for GDP growth. Further, capital utilizationis positively impacted during the crisis period by GDP growth and total employment—which are both falling andhence reducing capital utilization—and further accentuated by the impact of the stock market—also falling—thusreinforcing the lack of available capital goods and resources.In summary, the brief analysis presented above provides interesting insights into the interaction of the stockmarket with the real economy, identiﬁes a number of interesting developments during the crisis period and reaﬃrms23 number of ﬁndings studied in the literature, while ensuring that a much larger information set is utilized (a largernumber of variables included) than in previous analysis. Therefore, high-dimensional multi-block VAR models areuseful for analyzing complex temporal relationships and provide insights into their dynamics.Table 14: Summary for estimated B and C within diﬀerent periods.2001/07–2007/03 2007/01–2009/12 2010/01–2016/16 p -value for testing H : B = 0 0.075 0.009 0.044Sparsity level of (cid:98) C (cid:98) C B for diﬀerent periodsPre-Crisis During-Crisis Post-CrisisFFR -0.24 -0.26 -0.23T10yr -0.09 0.14 0.16UNEMPL -0.07 0.01 -0.07IPI -0.43 0.34 0.26ETTL 0.33 0.24 0.13M1 0.23 -0.12 -0.47AHES -0.01 0.30 0.17CU -0.49 0.32 0.27M2 0.10 -0.04 -0.32HS 0.51 -0.02 -0.02EX -0.18 0.41 0.06PCEQI -0.07 -0.18 0.41GDP 0.10 -0.02 0.05PCEPI 0.00 0.14 -0.01PPI -0.15 0.00 0.06CPI 0.01 0.15 -0.31SP.IND -0.06 -0.53 0.38 Remark . We also applied our multi-block model with the ﬁrst block X t corresponding to the macro-economicvariables and the second block Z t the stocks variables (results not shown). The key question is whether there is also“Granger causality” from the broader economy to the stock market. The results are inconclusive due to sample sizeissues that do not allow us to properly test for the key hypothesis whether B = 0 or not. Speciﬁcally, the lengthof the sub-periods is short compared to the dimensionality required for the test procedure. A similar issue arises,which is related to the detection boundary for the sparse testing procedure during the crisis period. Further, for asparse B , an examination of its entries shows that Employment Total did not impact the stock market, which isin line with the conclusion reached at the aggregate level by Farmer (2015). On the other hand, GDP negativelyimpacts stock log-returns, which may act as a leading indicator for suppressed investment and business growth andhence future stock returns. We brieﬂy discuss generalizations of the model to the case of more than two blocks, as mentioned in the introductorysection. For the sake of concreteness, consider a triangular recursive linear dynamical system given by: X (1) t = A X (1) t − + (cid:15) (1) t ,X (2) t = A X (1) t − + A X (2) t − + (cid:15) (2) t ,X (3) t = A X (1) t − + A X (2) t − + A X (3) t − + (cid:15) (3) t , ... (29)24here X ( j ) ∈ R p j denotes the variables in group j , A ij ( i < j ) encodes the dependency of X ( j ) on the past valuesof variables in group i , and A jj encodes the dependency on its own past values. Further, { (cid:15) ( j ) t } is the innovationprocess that is neither temporally, nor cross-sectionally correlated, i.e.,Cov( (cid:15) ( j ) t , (cid:15) ( j ) s ) = 0 ( s (cid:54) = t ) , Cov( (cid:15) ( i ) t , (cid:15) ( j ) s ) = 0 ( i (cid:54) = j, ∀ ( s, t )) , Cov( (cid:15) ( j ) t , (cid:15) ( j ) t ) = (cid:0) Ω ( j ) (cid:1) − , with Ω ( j ) capturing the conditional contemporaneous dependency of variables within group j . The model in (29) canalso be viewed from a multi-layered time-varying network perspective: nodes in each layer are “Granger-caused” bynodes from its previous layers, and are also dependent on its own past values. As previously mentioned, in variousreal applications, it is of interest to obtain estimates of the transition matrices, and/or test if “Granger-causality”is present between interacting blocks; i.e., to test A ij = 0 for some i (cid:54) = j .The triangular structure of the system decouples the estimation of the transition matrices from each equation,and hence a straightforward extension of the estimation procedure presented in Section 2.1 becomes applicable.Speciﬁcally, to obtain estimates of the transition matrices A ij ’s for ﬁxed j and 1 ≤ i ≤ j , and the inverse covarianceΩ ( j ) , the optimization problem is formulated as follows: ( { (cid:98) A ij } i ≤ j , (cid:98) Ω ( j ) ) = argmin A ij , Ω ( j ) (cid:110) − log det Ω ( j ) + 1 T T (cid:88) t =1 (cid:0) x ( j ) t − j (cid:88) i =1 A ij x ( i ) t − (cid:1) (cid:48) Ω ( j ) (cid:0) x ( j ) t − (cid:88) ≤ i ≤ j A ij x ( i ) t − (cid:1) + j (cid:88) i =1 R ( A ij ) + ρ ( j ) (cid:107) Ω ( j ) (cid:107) , oﬀ (cid:111) , (30) where the exact expression for the R ( A ij ) adapts to the structural assumption imposed on the correspondingtransition matrix (sparse/low-rank). Solving (30) again requires an iterative algorithm involving the alternateupdate between transition matrices and the inverse covariance matrices. Further, for updating the values of thetransition matrices, a cyclic block-coordinate updating procedure is used.Consistency results can be established analogously to those provided in Section 3, under the posited conditionsof restricted strong convexity (RSC) and a deviation bound. With a larger number of interacting blocks of variables,lower bounds for the lower extremes of the spectra involve all corresponding transition matrices. The error ratesthat can be obtained are as follows: (i) if equation k only involves sparse transition matrices, then the ﬁnite-samplebounds of the transition matrices in this layer in Frobenius norm are of the order O (cid:0)(cid:113) log p k +log (cid:80) i ≤ k p k T (cid:1) , while (ii)if some of the transition matrices are assumed low rank, then the corresponding ﬁnite sample bounds are of theorder O (cid:0)(cid:113) p k + (cid:80) i ≤ k p k T (cid:1) .Another generalization that can be handled algorithmically with the same estimation procedure discussedabove is the presence of d -lags in the speciﬁcation of the linear dynamical system. Based on the consistency resultsdeveloped in this work, together with the theoretical ﬁndings for VAR( d ) models presented in Basu and Michailidis(2015), we expect all the established theoretical properties of the transition matrices estimates to go through underappropriate RSC and deviation bound conditions. 25igure 4: Sector proportion and Estimated C for pre-crisis period C.D20% C.S0%ENERGY8%FIN36%H.C5%IND2%IT18% MATL2% TELECOM9% UTIL0%

PRE-CRISIS

FFRT10yrUNEMPL IPIETTLM1 AHESCU M2HS EX PCEQIGDP PCEPIPPI CPISP.IND

Graph Representation for Estimated C, 2001 to 2007 posneg

Figure 5: Sector proportion and Estimated C for during-crisis period

C.D15% C.S7%ENERGY3%FIN21%H.C11%IND20%IT17% MATL3% TELECOM2% UTIL1%

DURING-CRISIS

FFR T10yr UNEMPLIPI ETTLM1AHES CUM2 HS EXPCEQIGDP PCEPIPPI CPI SP.IND

Graph Representation for Estimated C, 2007 to 2009 posneg

C.D10% C.S9% ENERGY9%FIN14%H.C8%IND11%IT37% MATL1% TELECOM0% UTIL1%

POST-CRISIS

FFRT10yr UNEMPL IPIETTLM1 AHESCUM2 HSEXPCEQI GDPPCEPI PPICPI SP.IND

Graph Representation for Estimated C, 2010 to 2016 posneg

Figure 7: Estimated transition matrix for stock dynamics between 2001 to 2007 l l ll l l l ll ll lll ll lll lll ll l ll l l lll l l ll lll l l l l l ll ll l l ll ll ll ll l l ll ll l l llll WBA CMCSA DISF HD LOWMCD TGTTWX CLCVS KOMDLZMO PEPPG WMTCOPCVX DVNHALOXY SLBXOMAIG ALLAXP BACBKCCOFGS JPMMS USBWFCABT AMGNBMYGILD JNJ MDT MRKPFEUNH BACAT FDXGD GE HONLMTMMM NSCRTN UPSUTX AAPLCSCO EMC IBM INTCMSFT ORCLTXN DD DOW TVZEXCSO

Graph Representation for Estimated A, 2001 to 2007 posneg ll lll lll ll lll l l l l l l l l l l l l l l l l ll l ll l lll ll ll ll llll ll l l llll l l l ll ll ll llll ll WBACMCSA DISFHD LOWMCDTGT TWXCL CVSKOMDLZ MOPEP PGWMT COPCVX DVNHAL OXYSLBXOMAIGALL AXPBAC BK CCOFGSJPMMS USB WFC ABTAMGN BMYGILD JNJMDT MRKPFE UNHBACATFDX GDGE HONLMT MMMNSCRTNUPS UTX AAPL CSCO EMCIBM INTCMSFT ORCLTXN DDDOWTVZ EXCSO

Graph Representation for Estimated A, 2007 to 2009 posneg ll ll l lll lll ll lll ll l l l ll ll l ll ll lll ll l l l l ll l l ll lll ll l ll lllll l lll l l l ll l WBACMCSA DISFHD LOWMCDTGT TWXCLCVS KOMDLZMO PEPPGWMT COPCVX DVN HAL OXY SLBXOM AIGALLAXP BAC BKC COF GS JPMMSUSB WFCABT AMGNBMY GILD JNJ MDTMRKPFEUNH BACAT FDXGDGE HONLMT MMM NSCRTN UPSUTXAAPLCSCOEMC IBM INTCMSFTORCL TXNDD DOWT VZEXC SO

Graph Representation for Estimated A, 2010 to 2016 posneg Additional Theorems and Proofs for Theorems.

In this section, we introduce two additional theorems that respectively establish the consistency properties forthe initializers (cid:98) A (0) and ( (cid:98) B (0) , (cid:98) C (0) ), for ﬁxed realizations of the processes { X t } and { Z t } . Speciﬁcally, (cid:98) A (0) and( (cid:98) B (0) , (cid:98) C (0) ) are solutions to the following optimization problems: (cid:98) A (0) := argmin A (cid:8) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T − X A (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ A (cid:107) A (cid:107) (cid:9) , (31)( (cid:98) B (0) , (cid:98) C (0) ) := argmin B,C (cid:8) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z T − X B (cid:48) − Z C (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ B ||| B ||| ∗ + λ C || C || (cid:9) . (32)Note that they also correspond to estimators of the setting where there is no contemporaneous dependence amongthe idiosyncratic error processes. If we additionally introduce operators X and W deﬁned as X : X (∆) = X (cid:48) ∆ , for ∆ ∈ R p × p , W : W (∆) = W (cid:48) ∆ , for ∆ ∈ R p × ( p + p ) where W := [ X , Z ] , then (31) and (32) can be equivalently written as (cid:98) A (0) := argmin A (cid:8) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T − X ( A ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ A (cid:107) A (cid:107) (cid:9) , ( (cid:98) B (0) , (cid:98) C (0) ) := argmin B,C (cid:8) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z T − W ( B aug , C aug ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ B ||| B ||| ∗ + λ C || C || (cid:9) , where B aug := [ B, O p × p ] , C aug := [ O p × p , C ]. Theorem 3 (Error bounds for (cid:98) A (0) ) . Suppose the operator X satisﬁes the RSC condition with norm Φ(∆) = (cid:107) ∆ (cid:107) ,curvature α RSC > and tolerance τ > , so that s (cid:63)A τ ≤ α RSC / . Then, with regularization parameter λ A satisfying λ A ≥ (cid:107)X (cid:48) U /T (cid:107) ∞ , the solution to (31) satisﬁes the followingbounds: ||| (cid:98) A (0) − A (cid:63) ||| F ≤ (cid:112) s (cid:63)A λ A /α RSC and (cid:107) (cid:98) A − A (cid:63) (cid:107) ≤ s (cid:63)A λ A /α RSC . Theorem 4 (Error bound for ( (cid:98) B (0) , (cid:98) C (0) )) . Let J C (cid:63) be the support set of C (cid:63) and s (cid:63)C denote its cardinality. Let r (cid:63)B be the rank of B (cid:63) . Assume that W satisﬁes the RSC condition with norm Φ(∆) := inf B aug + C aug =∆ Q ( B, C ) , where Q ( B, C ) := ||| B ||| ∗ + Λ C λ B (cid:107) C (cid:107) , curvature α RSC and tolerance τ such that τ r (cid:63)B < α RSC / and τ s (cid:63)C ( λ C /λ B ) < α RSC / . Then, with regularization parameters λ B and λ C satisfying λ B ≥ |||W (cid:48) V /T ||| op and λ C ≥ ||W (cid:48) V /T || ∞ , the solution to (32) satisﬁes the following bounds: ||| (cid:98) B (0) − B (cid:63) ||| F + ||| (cid:98) C (0) − C (cid:63) ||| F ≤ (cid:0) r (cid:63)B λ B + s (cid:63)C λ C (cid:1) /α RSC . In the rest of this subsection, we ﬁrst prove Theorem 3 and 4, then prove Theorem 1 and 2, whose statementsare given in Section 3.2.

Proof of Theorem 3 . For the ease of notation, in this proof, we use (cid:98) A to refer to (cid:98) A (0) whenever there is noambiguity. Let β (cid:63)A = vec( A (cid:63) ) and denote the residual matrix and its vectorized version by ∆ A = (cid:98) A − A (cid:63) and∆ β A = (cid:98) β A − β (cid:63)A , respectively. By the optimality of (cid:98) A and the feasibility of A (cid:63) , the following basic inequality holds: T ||| X (∆ A ) ||| F ≤ T (cid:104)(cid:104) ∆ A , X (cid:48) U(cid:105)(cid:105) + λ A {|| A (cid:63) || − || A (cid:63) + ∆ A || } , (cid:48) β A (cid:98) Γ (0) X ∆ β A ≤ T (cid:104) ∆ β A , vec( X (cid:48) U ) (cid:105) + λ A (cid:8) || β (cid:63)A || − || β (cid:63)A + ∆ β A || (cid:9) , (33)where (cid:98) Γ (0) X = I p ⊗ X (cid:48) X T . By H¨older’s inequality and the triangle inequality, an upper bound for the right-hand-sideof (33) is given by T || ∆ β A || ||X (cid:48) U|| ∞ + λ A (cid:107) ∆ β A (cid:107) . (34)Now with the speciﬁed choice of λ A , by Lemma C.1, (cid:107) ∆ β A |J A(cid:63) (cid:107) ≤ (cid:107) ∆ β A |J cA(cid:63) (cid:107) i.e., ∆ β A ∈ C ( J A (cid:63) , (cid:107) ∆ β A (cid:107) ≤ (cid:107) ∆ β A |J A(cid:63) (cid:107) ≤ (cid:112) s (cid:63)A (cid:107) ∆ β A (cid:107) . By choosing λ A ≥ (cid:107)X (cid:48) U /T (cid:107) ∞ , (34) is further upper bounded by32 λ A (cid:107) ∆ β A (cid:107) ≤ (cid:112) s (cid:63)A λ A (cid:107) ∆ β A (cid:107) . (35)Combined with the RSC condition and the upper bound given in (35), we have α RSC (cid:107) ∆ β A (cid:107) − τ (cid:107) ∆ β A (cid:107) ≤

12 ∆ (cid:48) β A (cid:98) Γ (0) X ∆ β A ≤ (cid:112) s (cid:63)A λ A (cid:107) ∆ β A (cid:107) ,α RSC (cid:107) ∆ β A (cid:107) ≤ (cid:18) α RSC − s (cid:63)A τ (cid:19) (cid:107) ∆ β A (cid:107) ≤ (cid:112) s (cid:63)A λ A (cid:107) ∆ β A (cid:107) , which implies (cid:107) ∆ β A (cid:107) ≤ (cid:112) s (cid:63)A λ A /α RSC and (cid:107) ∆ β A (cid:107) ≤ s (cid:63)A λ A /α RSC . It is easy to see that these bounds also hold for (cid:107) ∆ A (cid:107) F and (cid:107) ∆ A (cid:107) , respectively.Next, to prove Theorem 4, we introduce the following two sets of subspaces {S Θ , S ⊥ Θ } and {R Θ , R c Θ } associatedwith some generic matrix Θ ∈ R m × m , in which the nuclear norm and the (cid:96) -norm are decomposable, respectively(see Negahban et al., 2012). Speciﬁcally, let the singular value decomposition of Θ be Θ = U Σ V (cid:48) with U and V being orthogonal matrices. Let r = rank(Θ), and we use U r and V r to denote the ﬁrst r columns of U and V associated with the r singular values of Θ. Further, deﬁne S Θ := (cid:8) ∆ ∈ R m × m | row(∆) ⊆ V r and col(∆) ⊆ U r (cid:9) , S ⊥ Θ := (cid:8) ∆ ∈ R m × m | row(∆) ⊥ V r and col(∆) ⊥ U r (cid:9) . (36)Then, for an arbitrary (generic) matrix M ∈ R m × m , its restriction on the subspace S (Θ) and S ⊥ (Θ), denoted by M S (Θ) and M S ⊥ (Θ) respectively, are given by: M S Θ = U (cid:34) (cid:102) M (cid:102) M (cid:102) M O (cid:35) V (cid:48) and M S ⊥ Θ = U (cid:20) O OO (cid:102) M (cid:21) V (cid:48) , where Θ = U Σ V (cid:48) and (cid:102) M is deﬁned and partitioned as (cid:102) M = U (cid:48) M V = (cid:34) (cid:102) M (cid:102) M (cid:102) M (cid:102) M (cid:35) , where (cid:102) M ∈ R r × r . Note that by Lemma C.3, M S Θ + M S ⊥ Θ = M . Moreover, when Θ is restricted to the subspace induced by itself Θ S Θ (and we write Θ S for short for this speciﬁc case), the following decomposition for the nuclear norm holds: ||| Θ ||| ∗ = ||| Θ S + Θ S ⊥ ||| ∗ = ||| Θ S ||| ∗ + ||| Θ S ⊥ ||| ∗ . Let J (Θ) be the set of indexes in which Θ is nonzero. Analogously, we deﬁne R Θ := (cid:8) ∆ ∈ R m × m | ∆ ij = 0 for ( i, j ) / ∈ J (Θ) (cid:9) , R c Θ := (cid:8) ∆ ∈ R m × m | ∆ ij = 0 for ( i, j ) ∈ J (Θ) (cid:9) . (37)Then, for an arbitrary matrix M , M J Θ ∈ R Θ is obtained by setting the entries of M whose indexes are not in J (Θ) to 0, and M J c Θ ∈ R c Θ is obtained by setting the entries of M whose indexes are in J (Θ) to 0. Then, thefollowing decomposition holds: (cid:12)(cid:12)(cid:12)(cid:12) M J Θ + M J c Θ (cid:12)(cid:12)(cid:12)(cid:12) = || M J Θ || + (cid:12)(cid:12)(cid:12)(cid:12) M J c Θ (cid:12)(cid:12)(cid:12)(cid:12) . roof of Theorem 4 . Again for the ease of notation, in this proof, we drop the superscript and use ( (cid:98) B (0) , (cid:98) C (0) )to denote ( (cid:98) B, (cid:98) C ) whenever there is no ambiguity. Deﬁne Q to be the weighted regularizer: Q ( B, C ) = ||| B ||| ∗ + λ C λ B || C || . Note that ( B (cid:63) , C (cid:63) ) is always feasible, and by the optimality of ( (cid:98) B, (cid:98) C ), the following inequality holds: T |||Z T − W ( (cid:98) B aug + (cid:98) C aug ) ||| F + λ B ||| (cid:98) B ||| ∗ + λ C || (cid:98) C || ≤ T |||Z T − W ( B (cid:63) + C (cid:63) ) ||| F + λ B ||| B (cid:63) ||| ∗ + λ C || C (cid:63) || , By deﬁning ∆ B aug = (cid:98) B aug − B (cid:63) aug = [∆ B , O ], ∆ C aug = (cid:98) C aug − C (cid:63) aug = [ O, ∆ C ], we obtain the following basic inequality : T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W (∆ B aug + ∆ C aug ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ≤ T (cid:104)(cid:104) ∆ B aug + ∆ C aug , W (cid:48) V(cid:105)(cid:105) + λ B Q ( B (cid:63) , C (cid:63) ) − λ B Q ( (cid:98) B, (cid:98) C ) . (38)By H¨older’s inequality and Lemma C.4, we have T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W (∆ B aug + ∆ C aug ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ≤ T (cid:0)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ + ||| ∆ B S ⊥ B(cid:63) ||| ∗ (cid:1) |||W (cid:48) V||| op + T (cid:0) || ∆ C J cC(cid:63) || + || ∆ C J cC(cid:63) || (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) W (cid:48) V (cid:12)(cid:12)(cid:12)(cid:12) ∞ + λ B Q (∆ B S B(cid:63) , ∆ C J C(cid:63) ) − λ B Q (∆ B S ⊥ B(cid:63) , ∆ C J cC(cid:63) ) . (39) With the speciﬁed choice of λ B and λ C , after some algebra, (39) is further bounded by3 λ B Q (∆ B S B(cid:63) , ∆ C J C(cid:63) ) − λ B Q (∆ B S ⊥ B(cid:63) , ∆ C J cC(cid:63) ) . By Lemma C.5, and using this upper bound, we obtain α RSC (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ) − λ B Q (∆ B , ∆ C ) ≤ λ B Q (∆ B S B(cid:63) , ∆ C J C(cid:63) ) − λ B Q (∆ B S ⊥ B(cid:63) , ∆ C J cC(cid:63) ) . By the triangle inequality, Q (∆ B , ∆ C ) ≤ Q (∆ B S B(cid:63) , ∆ C J C(cid:63) ) + Q (∆ B S ⊥ B(cid:63) , ∆ C J cC(cid:63) ), rearranging gives α RSC (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ) ≤ λ B Q (∆ B S B(cid:63) , ∆ C J C(cid:63) ) . (40)By Lemma C.3, with N = B (cid:63) , M = ∆ B S B(cid:63) , M = ∆ B S ⊥ B(cid:63) , we getrank(∆ B S B(cid:63) ) ≤ r (cid:63)B and (cid:104)(cid:104) ∆ B S B(cid:63) , ∆ B S ⊥ B(cid:63) (cid:105)(cid:105) = 0 , which implies ||| ∆ B S B(cid:63) ||| ∗ ≤ ( (cid:112) r (cid:63)B ) ||| ∆ B S B(cid:63) ||| F ≤ ( (cid:112) r (cid:63)B ) ||| ∆ B ||| F . Since ∆ C J C(cid:63) has at most s (cid:63)C nonzero entries, itfollows that (cid:107) ∆ C J C(cid:63) (cid:107) ≤ (cid:112) s (cid:63)C ||| ∆ C J C(cid:63) ||| F ≤ (cid:112) s (cid:63)C ||| ∆ C ||| F . Therefore, Q (∆ B S B(cid:63) , ∆ C J C(cid:63) ) = λ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ + λ C (cid:12)(cid:12)(cid:12)(cid:12) ∆ C J C(cid:63) (cid:12)(cid:12)(cid:12)(cid:12) ≤ λ B ( (cid:112) r (cid:63)B ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ C ( (cid:112) s (cid:63)C ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F With an application of the Cauchy-Schwartz inequality, (40) yields: α RSC (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ) ≤ (cid:113) r (cid:63)B λ B + s (cid:63)C λ C ∗ (cid:113) ||| ∆ B ||| F + ||| ∆ C ||| F and we obtain the following bound: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ≤ (cid:0) r (cid:63)B λ B + s (cid:63)C λ C (cid:1) /α . Proof of Theorem 1 . At iteration 0, (cid:98) A (0) solves the following optimization problem: (cid:98) A (0) = argmin A ∈ R p × p (cid:8) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X T − X A (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ A ||| A ||| ∗ (cid:9) . By Theorem 3, its error bound is given by || (cid:98) A (0) − A (cid:63) || ≤ s (cid:63)A λ A /α RSC , (cid:98) Γ (0) X = I p ⊗ X (cid:48) X /T satisﬁes the RSC condition, and the regularization parameter λ A satisﬁes λ A ≥ (cid:107)X (cid:48) U /T (cid:107) ∞ . For random realizations X and U , by Lemma B.1 and Lemma B.2, there exist constants c i and c (cid:48) i such that for sample size T (cid:37) s (cid:63)A log p , with probability at least 1 − c exp( − c T min { , ω − } ), where ω = c µ max ( A ) /µ min ( A )( E ) (cid:98) Γ (0) X satisﬁes RSC condition with α RSC = Λ min (Σ (cid:63)u ) / (2 µ max ( A )) , and with probability at least 1 − c (cid:48) exp( − c (cid:48) log p ),( E ) (cid:107)X (cid:48) U /T (cid:107) ∞ ≤ C (cid:114) log p T , for some constant C .Hence with probability at least 1 − c exp( − c T ) − c (cid:48) exp( − c (cid:48) log p ), (cid:107) (cid:98) A (0) − A (cid:63) (cid:107) = O (cid:16) s (cid:63)A (cid:113) log p T (cid:17) . Moving onto (cid:98) Ω (0) u , which is given by (cid:98) Ω (0) u = argmin Ω u ∈ S p × p (cid:110) log det Ω u − trace (cid:0) (cid:98) S (0) u Ω u (cid:1) + ρ u (cid:107) Ω u (cid:107) , oﬀ (cid:111) , where (cid:98) S (0) u = T ( X T − X (cid:98) A (0) (cid:48) ) (cid:48) ( X T − X (cid:98) A (0) (cid:48) ). By Theorem 1 in Ravikumar et al. (2011), the error bound for (cid:98) Ω (0) u relies on how well (cid:98) S (0) u concentrates around Σ (cid:63)u , more speciﬁcally, (cid:107) (cid:98) S (0) u − Σ (cid:63)u (cid:107) ∞ . Note that (cid:107) (cid:98) S (0) u − Σ (cid:63)u (cid:107) ∞ ≤ (cid:107) S u − Σ (cid:63)u (cid:107) ∞ + (cid:107) (cid:98) S (0) u − S u (cid:107) ∞ , where S u = U (cid:48) U /T is the sample covariance based on true errors. For the ﬁrst term, by Ravikumar et al. (2011),there exists constant τ > − /p τ − = 1 − exp( − τ log p ) ( τ > E ) (cid:107) S u − Σ (cid:63)u (cid:107) ∞ ≤ C (cid:114) log p T , for some constant C .For the second term, (cid:98) S (0) u − S u = T U (cid:48) X ( A (cid:63) − (cid:98) A (0) ) (cid:48) + ( A (cid:63) − (cid:98) A (0) ) (cid:16) X (cid:48) X T (cid:17) ( A (cid:63) − (cid:98) A (0) ) (cid:48) := I + I , For I , based on the analysis of || A (cid:63) − (cid:98) A (0) || and ||X (cid:48) U /T || ∞ , (cid:107) I (cid:107) ∞ ≤ ||| A (cid:63) − (cid:98) A (0) ||| ∞ || T X (cid:48) U|| ∞ ≤ || A (cid:63) − (cid:98) A (0) || || T X (cid:48) U|| ∞ = O (cid:16) s (cid:63)A log p T (cid:17) For I , (cid:107) ( A (cid:63) − (cid:98) A (0) ) (cid:16) X (cid:48) X T (cid:17) ( A (cid:63) − (cid:98) A (0) ) (cid:48) (cid:107) ∞ ≤ ||| A (cid:63) − (cid:98) A (0) ||| ∞ ||| A (cid:63) − (cid:98) A (0) ||| || X (cid:48) X T || ∞ ≤ || A (cid:63) − (cid:98) A (0) || || X (cid:48) X T || ∞ , where by Proposition 2.4 in Basu and Michailidis (2015) and then taking a union bound, with probability at least1 − c (cid:48)(cid:48) exp( − c (cid:48)(cid:48) log p ) ( c (cid:48)(cid:48) , c (cid:48)(cid:48) > E ) || X (cid:48) X T || ∞ ≤ C (cid:114) log p T + Λ max (Γ X ) , for some constant C .Hence, (cid:107) I (cid:107) ∞ = O (cid:16) ( s (cid:63)A ) (cid:0) log p T (cid:1) / (cid:17) + O (cid:16) ( s (cid:63)A ) p T (cid:17) Combining all terms, and since we assume that T − log p is small, O ( (cid:112) T − log p ) becomes the leading term, andthe following bound holds with probability at least 1 − c exp( − c T ) − c (cid:48) exp( − c (cid:48) log p ) − c (cid:48)(cid:48) exp( − c (cid:48)(cid:48) log p ) − exp( − τ log p ): (cid:107) (cid:98) S (0) u − Σ (cid:63)u (cid:107) ∞ = O (cid:16)(cid:113) log p T (cid:17) . (cid:107) (cid:98) Ω (0) u − Ω (cid:63)u (cid:107) ∞ = O (cid:16)(cid:113) log p T (cid:17) . At iteration 1, the vectorized (cid:98) A (1) solves (cid:98) β (1) A = argmin β ∈ R p (cid:8) − β (cid:48) (cid:98) γ (1) X + β (cid:48) (cid:98) Γ (1) X β + λ A (cid:107) β (cid:107) (cid:9) , where (cid:98) γ (1) X = T (cid:0)(cid:98) Ω (0) u ⊗ X (cid:48) (cid:1) vec( X T ) , (cid:98) Γ (1) X = (cid:98) Ω (0) u ⊗ X (cid:48) X T . The error bound for (cid:98) β (1) A relies on (1) (cid:98) Γ (1) X satisfying the RSC condition, which holds for sample size T (cid:37) ( d maxΩ (cid:63)u ) log p upon (cid:107) (cid:98) Ω (0) u − Ω (cid:63)u (cid:107) ∞ = O ( (cid:112) T − log p ); and (2) a bound for (cid:107)X (cid:48) U (cid:98) Ω (0) u /T (cid:107) ∞ . For (cid:107)X (cid:48) U (cid:98) Ω (0) u /T (cid:107) ∞ , T X (cid:48) U (cid:98) Ω (0) u = T X (cid:48) U Ω (cid:63)u + T X (cid:48) U ( (cid:98) Ω (0) u − Ω (cid:63)u ) := I + I . For I , by Lemma 3 in Lin et al. (2016) and with the aid of Proposition 2.4 in Basu and Michailidis (2015), againwith probability at least 1 − c (cid:48)(cid:48)(cid:48) exp( − c (cid:48)(cid:48)(cid:48) log p ) we get( E ) (cid:12)(cid:12)(cid:12)(cid:12) T X (cid:48) U Ω (cid:63)u (cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ C (cid:114) log p T , for some constant C .For I , by Corollary 3 in Ravikumar et al. (2011), we get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T X (cid:48) U ( (cid:98) Ω (0) u − Ω (cid:63)u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ d maxΩ (cid:63)u || T X (cid:48) U|| ∞ || (cid:98) Ω (0) u − Ω (cid:63)u || ∞ = O (cid:16) log p T (cid:17) . Combining all terms and taking the leading one, once again we have (cid:107) (cid:98) A (1) − A (cid:63) (cid:107) = O (cid:16) s (cid:63)A (cid:113) log p T (cid:17) , which holds with probability at least 1 − c exp( − c T ) − ˜ c exp( − ˜ c log p ) − exp( − τ log p ), by letting ˜ c =max { c (cid:48) , c (cid:48)(cid:48) , c (cid:48)(cid:48)(cid:48) } and ˜ c = min { c (cid:48) , c (cid:48)(cid:48) , c (cid:48)(cid:48)(cid:48) } . It should be noted that up to this step, all sources of randomnessfrom the random realizations have been captured by events from E to E ; thus, for (cid:98) Ω (1) u and iterations thereafter,the probability for which the bounds hold will no longer change, and the same holds for the error bounds for (cid:98) A ( k ) and (cid:98) Ω ( k ) u in terms of the relative order with respect to the dimension p and sample size T . Therefore, we concludethat with high probability, for all iterations k , (cid:107)X (cid:48) U (cid:98) Ω ( k ) u /T (cid:107) ∞ = O (cid:16)(cid:113) log p T (cid:17) , (cid:107) (cid:98) S ( k ) u − Σ (cid:63)u (cid:107) ∞ = O (cid:16)(cid:113) log p T (cid:17) . With the aid of Theorem 3, it then follows that ||| (cid:98) A ( k ) − A (cid:63) ||| F = O (cid:16)(cid:113) s (cid:63)A log p T (cid:17) , ||| (cid:98) Ω ( k ) u − Ω (cid:63)u ||| F = O (cid:16)(cid:113) ( s Ω (cid:63)u + p ) log p T (cid:17) . Proof of Theorem 2 . At iteration 0, ( (cid:98) B (0) , (cid:98) C (0) ) solves the following optimization:( (cid:98) B (0) , (cid:98) C (0) ) = argmin ( B,C ) (cid:110) T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z T − X B (cid:48) − Z C (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + λ B ||| B ||| ∗ + λ C || C || (cid:111) . Let W t = ( X (cid:48) t , Z (cid:48) t ) (cid:48) ∈ R p + p be the joint process and W be the realizations, with operators W identically deﬁnedto that in Theorem 4 . By Theorem 4, ||| (cid:98) B (0) − B (cid:63) ||| F + ||| (cid:98) C (0) − C (cid:63) ||| F ≤ r (cid:63)B λ B + s (cid:63)C λ C ) /α , provided that W satisﬁes the RSC condition and λ B , λ C respectively satisfy λ B ≥ |||W (cid:48) V /T ||| op and λ C ≥ ||W (cid:48) V /T || ∞ .

34n particular, by Lemma B.3 for random realizations of X , Z and V , for sample size T (cid:37) c ( p + 2 p ), withprobability at least 1 − c exp {− c ( p + p ) } ,( E (cid:48) ) W satisﬁes the RSC condition . By Lemma B.4, for sample size T (cid:37) ( p + 2 p ) and some constant C , C > E (cid:48) ) |||W (cid:48) V /T ||| op ≤ C (cid:114) p + 2 p T and ||W (cid:48) V /T || ∞ ≤ C (cid:114) log( p + p ) + log p T , with probability at least 1 − c (cid:48) exp {− c (cid:48) ( p + 2 p ) } and 1 − c (cid:48)(cid:48) exp {− c (cid:48)(cid:48) log[ p ( p + p )] } , respectively. Hence, withprobability at least1 − c exp {− c ( p + p ) } − c (cid:48) exp {− c (cid:48) ( p + 2 p ) } − c (cid:48)(cid:48) exp {− c (cid:48)(cid:48) log[ p ( p + p )] } , the following bound holds for the initializers as long as sample size T (cid:37) ( p + 2 p ): ||| (cid:98) B (0) − B (cid:63) ||| F + ||| (cid:98) C (0) − C (cid:63) ||| F = O (cid:16) p +2 p T (cid:17) + O (cid:16) log( p + p )+log p T (cid:17) . (41)Considering the estimation of (cid:98) Ω (0) v , it solves a graphical Lasso problem: (cid:98) Ω (0) v = argmin Ω v ∈ S p × p (cid:110) log det Ω v − trace (cid:0) (cid:98) S (0) u Ω v (cid:1) + ρ v (cid:107) Ω v (cid:107) , oﬀ (cid:111) , where (cid:98) S (0) v = T ( Z T − X (cid:98) B (0) (cid:48) − Z (cid:98) C (0) (cid:48) ) (cid:48) ( Z T − X (cid:98) B (0) (cid:48) − Z (cid:98) C (0) (cid:48) ). Similar to the proof of Theorem 1, the errorbound for (cid:98) Ω (0) v depends on (cid:107) (cid:98) S (0) v − Σ (cid:63)v (cid:107) ∞ , which can be decomposed as (cid:107) (cid:98) S (0) v − Σ (cid:63)v (cid:107) ∞ ≤ (cid:107) S v − Σ (cid:63)v (cid:107) ∞ + (cid:107) (cid:98) S (0) v − S v (cid:107) ∞ , where S v = V (cid:48) V /T is the sample covariance based on the true errors. For the ﬁrst term, by Lemma 1 in Ravikumaret al. (2011), there exists constant τ > − /p τ − = 1 − exp( − τ log p ) ( τ > E (cid:48) ) (cid:107) S v − Σ (cid:63)v (cid:107) ∞ ≤ C (cid:114) log p T , for some constant C .For the second term, let Π = [ B, C ] ∈ R p × ( p + p ) , then (cid:98) S (0) v − S v = 2 T V (cid:48) W (Π (cid:63) − (cid:98) Π (0) ) (cid:48) + (Π (cid:63) − (cid:98) Π (0) ) (cid:16) W (cid:48) W T (cid:17) (Π (cid:63) − (cid:98) Π (0) ) (cid:48) := I + I , For I , we have || T V (cid:48) W (Π (cid:63) − (cid:98) Π (0) ) (cid:48) || ∞ ≤ || T V (cid:48) W (Π (cid:63) − (cid:98) Π (0) ) (cid:48) || F ≤ ||| T W (cid:48) V||| op ||| Π (cid:63) − (cid:98) Π (0) ||| F . Consider the leading term of ||| Π (cid:63) − (cid:98) Π (0) ||| F as in (41), whose rate is O ( (cid:112) T − ( p + 2 p )). We therefore obtain (cid:107) I (cid:107) ∞ ≤ (cid:107) I (cid:107) F = O (cid:16) p +2 p T (cid:17) . Similarly for I , (cid:107) I (cid:107) ∞ ≤ (cid:107) I (cid:107) F ≤ ||| Π (cid:63) − (cid:98) Π (0) ||| F ||| W (cid:48) W T ||| op , where with a similar derivation to that in Lemma C.6, for sample size T (cid:37) ( p + p ), with probability at least1 − c (cid:48)(cid:48)(cid:48) exp {− c (cid:48)(cid:48)(cid:48) ( p + p ) } , we get( E (cid:48) ) ||| W (cid:48) W T ||| op ≤ C (cid:114) p + 2 p T + Λ max (Γ X ) , for some constant C .Hence, (cid:107) I (cid:107) ∞ ≤ (cid:107) I (cid:107) F ≤ = O (cid:16)(cid:0) p + 2 p T (cid:1) / (cid:17) . − c exp {− c ( p + p ) } − c (cid:48) exp {− c (cid:48) ( p + 2 p ) } − c (cid:48)(cid:48) exp {− c (cid:48)(cid:48) log[ p ( p + p )] }− c (cid:48)(cid:48)(cid:48) exp {− c (cid:48)(cid:48)(cid:48) ( p + p ) } − exp( − τ log p ) , we obtain (cid:107) (cid:98) S (0) v − Σ (cid:63)v (cid:107) ∞ = O (cid:16)(cid:113) p +2 p T (cid:17) . Note that here with the required sample size, ( p + 2 p ) /T is a small quantity, and therefore O (cid:16)(cid:0) p +2 p T (cid:1) / (cid:17) ≤ O (cid:16) p +2 p T (cid:17) ≤ O (cid:16)(cid:113) p +2 p T (cid:17) . At iteration 1, the bound of ||| (cid:98) B (1) − B (cid:63) ||| F + ||| (cid:98) C (1) − C (cid:63) ||| F relies on the following two quantities: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V (cid:98) Ω (0) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V (cid:98) Ω (0) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . Using a similar derivation to that in the proof of Theorem 1, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V (cid:98) Ω (0) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V ( (cid:98) Ω (0) v − Ω (cid:63)v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ + (cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V Ω (cid:63)v (cid:12)(cid:12)(cid:12)(cid:12) ∞ , (42)where by viewing V Ω (cid:63)v as some random realization coming from a certain sub-Gaussian process, with probabilityat least 1 − ¯ c (cid:48)(cid:48) exp {− ¯ c (cid:48)(cid:48) log[ p ( p + p )] } , we get( E (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V Ω (cid:63)v (cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ C (cid:114) log( p + p ) + log p T , for some constant C , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V ( (cid:98) Ω (0) v − Ω (cid:63)v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ d Ω (cid:63)v max || T W (cid:48) V|| ∞ || (cid:98) Ω (0) v − Ω (cid:63)v || ∞ = O (cid:16)(cid:113) log( p + p )+log p T (cid:17) · O (cid:16)(cid:113) p +2 p T (cid:17) . For ||| T W (cid:48) V (cid:98) Ω (0) v ||| op , similarly we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V (cid:98) Ω (0) v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op ≤ ||| T W (cid:48) V ( (cid:98) Ω (0) v − Ω (cid:63)v ) ||| op + ||| T W (cid:48) V Ω (cid:63)v ||| op , (43)where with probability at least 1 − ¯ c (cid:48) exp {− ¯ c (cid:48) ( p + p ) } ,( E (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V Ω (cid:63)v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op ≤ C (cid:114) p + 2 p T for some constant C , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T W (cid:48) V ( (cid:98) Ω (0) v − Ω (cid:63)v ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) op ≤ ||| T W (cid:48) V||| op ||| (cid:98) Ω (0) v − Ω (cid:63)v ||| op ≤ ||| T W (cid:48) V||| op (cid:104) d Ω (cid:63)v max || (cid:98) Ω (0) v − Ω (cid:63)v || ∞ (cid:105) = O (cid:16) p +2 p T (cid:17) , where the second inequality follows from Corollary 3 of Ravikumar et al. (2011). Combining all terms from (42)and (43), the leading term gives the following bound: ||| (cid:98) B (1) − B (cid:63) ||| F + ||| (cid:98) C (1) − C (cid:63) ||| F ≤ C (cid:16) p + 2 p T (cid:17) for some constant C , and this error rate coincides with that in the bound of ||| (cid:98) B (0) − B (cid:63) ||| F + ||| (cid:98) C (0) − C (cid:63) ||| F . This implies that for (cid:98) Ω (1) v and iterations thereafter, the error rate remains unchanged. Moreover, all sources of randomness have been36aptured up to this step in events E (cid:48) to E (cid:48) , and therefore the probability for the bounds to hold no longer changes.Consequently, the following bounds hold for all iterations k : (cid:107)W (cid:48) V (cid:98) Ω ( k ) v /T (cid:107) ∞ = |||W (cid:48) V (cid:98) Ω ( k ) v /T ||| op = O (cid:16)(cid:113) p +2 p T (cid:17) and (cid:107) (cid:98) S ( k ) v − Σ (cid:63)v (cid:107) ∞ = O (cid:16)(cid:113) p +2 p T (cid:17) , with probability at least1 − c exp {− ˜ c ( p + p ) } − c exp {− ˜ c ( p + 2 p ) } − c exp {− ˜ c log[ p ( p + p )] } − exp {− τ log p } . for some new positive constants c i , ˜ c i ( i = 0 , ,

2) and τ . The above bounds directly imply the bound in thestatement in Theorem 2, with the aid of Theorem 4.

B Key Lemmas and Their Proofs.

In this section, we verify the conditions that are required for establishing the consistency results in Theorem 3and 4, under random realizations of X , Z , U and V .The following two lemmas verify the conditions for establishing the consistency properties for (cid:98) A (0) . Speciﬁcally,Lemma B.1 establishes that with high probability, X satisﬁes the RSC condition. Further, Lemma B.2 gives ahigh probability upper bound for (cid:107)X (cid:48) U /T (cid:107) ∞ for random X and U . Lemma B.1 (Veriﬁcation of the RSC condition) . For the VAR (1) model { X t } posited in (1) , there exist c i > i = 1 , , such that for sample size T (cid:37) max { ω , } s (cid:63)A log p , with probability at least − c exp (cid:2) − c T min { , ω − } (cid:3) , ω = c Λ max (Σ u ) µ max ( A )Λ min (Σ u ) µ min ( A ) , the following inequality holds T ||| X (∆) ||| F ≥ α RSC ||| ∆ ||| F − τ || ∆ || , for ∆ ∈ R p × p , where α RSC = Λ min (Σ u ) µ max ( A ) , τ = 4 α RSC max { ω , } log p /T . Proof of Lemma B.1 . For the speciﬁc VAR(1) process { X t } given in (1), using Proposition 4.2 in Basu andMichailidis (2015) with d = 1 directly gives the result. Speciﬁcally, we note that by letting θ = vec(∆), T ||| X (∆) ||| F = θ (cid:48) (cid:98) Γ (0) X θ, where (cid:98) Γ (0) X = I p ⊗ ( X (cid:48) X /T ), and (cid:107) θ (cid:107) = ||| ∆ ||| F , (cid:107) θ (cid:107) = || ∆ || . Lemma B.2 (Veriﬁcation of the deviation bound) . For the model in (1) , there exist constants c i > , i = 0 , , such that for T (cid:37) p , with probability at least − c exp( − c log p ) , the following bound holds: (cid:107)X (cid:48) U /T (cid:107) ∞ ≤ c Λ max (Σ u ) (cid:20) µ min ( A ) + µ max ( A ) µ min ( A ) (cid:21) (cid:114) p T . (44)

Proof of Lemma B.2 . First, we note that, ||X (cid:48) U /T || ∞ = max ≤ i ≤ p ≤ j ≤ p | e (cid:48) i ( X (cid:48) U /T ) e j | . Applying Proposition 2.4(b) in Basu and Michailidis (2015) for an arbitrary pair of ( i, j ) gives: P (cid:18) | e (cid:48) i ( X (cid:48) U /T ) e j | > η (cid:20) Λ max (Σ u ) (cid:18) µ min ( A ) + µ max ( A ) µ min ( A ) (cid:19)(cid:21)(cid:19) ≤ − cT min { η, η } ] . Here we slightly abuse the notations and redeﬁne c := max { c , c (cid:48)(cid:48)(cid:48) } , c := max { c (cid:48) , ¯ c (cid:48) } , ˜ c := min { c (cid:48) , ¯ c (cid:48) } , c = max { c (cid:48)(cid:48) , ¯ c (cid:48)(cid:48) } ,˜ c := min { c (cid:48)(cid:48) , ¯ c (cid:48)(cid:48) } . η = c (cid:112) p /T and taking a union bound over all 1 ≤ i ≤ p , ≤ j ≤ p , we get that for some c , c > − c exp[ − c log p ],max ≤ i ≤ p ≤ j ≤ p | e (cid:48) i ( X (cid:48) U /T ) e j | ≤ c Λ max (Σ u ) (cid:20) µ min ( A ) + µ max ( A ) µ min ( A ) (cid:21) (cid:114) p T .

In the next two lemmas, Lemma B.3 gives an RSC curvature that holds with high probability for W inducedby a random W , and Lemma B.4 gives a high probability upper bound for ||| W (cid:48) V /T ||| op and || W (cid:48) V /T || ∞ . Lemma B.3 (Veriﬁcation of the RSC condition) . Consider the covariance stationary process W t = ( X (cid:48) t , Z (cid:48) t ) (cid:48) ∈ R p + p whose spectral density exists. Suppose m ( f W ) > . There exist constants c i > , i = 1 , , such that withprobability at least − c exp( − c ( p + p )) , the RSC condition for W induced by a random W holds for α RSC andtolerance 0, where α RSC = π m ( f W ) / , whenever T (cid:37) c ( p + p ) . Proof of Lemma B.3 . First ,we note that the following inequality holds, for any W :12 T ||| W (∆) ||| F = 12 T |||W (cid:48) ∆ ||| F = 12 T p (cid:88) j =1 || [ W (cid:48) ∆] j || ≥

12 Λ min (cid:16)(cid:98) Γ (0) W (cid:17) ||| ∆ ||| F . (45)where (cid:98) Γ (0) W = W (cid:48) W /T . Applying Lemma 4 in Negahban and Wainwright (2011) on W together with Proposition2.3 in Basu and Michailidis (2015), the following bound holds with probability at least 1 − c exp[ − c ( p + p )],as long as T (cid:37) c ( p + p ): Λ min (cid:16)(cid:98) Γ (0) W (cid:17) ≥ Λ min (Γ W (0))4 ≥ π m ( f W ) , where Γ W (0) = E W t W (cid:48) t . Combining with (45), the RSC condition holds with κ ( W ) = π m ( f W ) / Lemma B.4 (Veriﬁcation of the deviation bound) . There exist constants c i > and c (cid:48) i > , i = 1 , , such thatthe following statements hold:(a) With probability at least − c exp[ − c ( p + 2 p )] , as long as T (cid:37) c ( p + 2 p ) , |||W (cid:48) V /T ||| op ≤ c [ M ( f W ) + Λ max (Σ v ) + M ( f W,V )] (cid:114) p + 2 p T . (46) (b) With probability at least − c (cid:48) exp( − c (cid:48) log( p + p ) − c (cid:48) log p ) , as long as T (cid:37) c (cid:48) log[( p + p ) p ] , ||W (cid:48) V /T || ∞ ≤ c (cid:48) [ M ( f W ) + Λ max (Σ v ) + M ( f W,V )] (cid:114) log( p + p ) + log p T . (47)

Proof of Lemma B.4 . (a) is a direct application of Lemma C.6 on processes { W t } ∈ R ( p + p ) and { V t } ∈ R p ,and (b) is a direct application of Lemma B.2. C Auxiliary Lemmas and Their Proofs.

Lemma C.1.

Let (cid:98) β be the solution to the following optimization problem: (cid:98) β = argmin β ∈ R p (cid:110) n || Y − Xβ || + λ n (cid:107) β (cid:107) (cid:111) , where the data is generated according to Y = Xβ (cid:63) + E with X ∈ R n × p and E ∈ R n . The errors E i are assumedto be i.i.d. for all i = 1 , · · · , n . Then, by choosing λ n ≥ (cid:107) X (cid:48) E/n (cid:107) ∞ , the error vector ∆ := (cid:98) β − β (cid:63) satisﬁes (cid:107) ∆ J (cid:107) ≤ (cid:107) ∆ J c (cid:107) . roof. Note that β (cid:63) is always feasible. By the optimality of (cid:98) β , we have12 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y − X (cid:98) β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + λ n (cid:107) (cid:98) β (cid:107) ≤ n || Y − Xβ (cid:63) || + λ n (cid:107) β (cid:63) (cid:107) which after some algebra, can be simpliﬁed to12 ∆ (cid:48) (cid:18) X (cid:48) Xn (cid:19) ∆ ≤ (cid:104) ∆ (cid:48) , n X (cid:48) E (cid:105) + λ n (cid:107) β (cid:63) (cid:107) − λ n (cid:107) β (cid:63) + ∆ (cid:107) ≤ (cid:107) ∆ (cid:107) (cid:107) n X (cid:48) E (cid:107) ∞ + λ n (cid:107) β (cid:63) (cid:107) − λ n (cid:107) β (cid:63) + ∆ (cid:107) , with the second inequality obtained through an application of from H¨older’s inequality. With the speciﬁed choiceof λ n , if follows that0 ≤ λ n (cid:107) ∆ (cid:107) + λ n (cid:8) (cid:107) β (cid:63) J (cid:107) − (cid:107) β (cid:63) J + ∆ J + β (cid:63) J c + ∆ J c (cid:107) (cid:9) = λ n (cid:107) ∆ J (cid:107) + (cid:107) ∆ J c (cid:107) ) + λ n {(cid:107) β (cid:63) J (cid:107) − (cid:107) β (cid:63) J + ∆ J (cid:107) − (cid:107) ∆ J c (cid:107) } ( β (cid:63) J c = 0, (cid:96) norm decomposable) ≤ λ n (cid:107) ∆ J (cid:107) + (cid:107) ∆ J c (cid:107) ) + λ n ( (cid:107) ∆ J (cid:107) − (cid:107) ∆ J c (cid:107) ) (triangle inequality)= 3 λ n (cid:107) ∆ J (cid:107) − λ n (cid:107) ∆ J c (cid:107) , which implies (cid:107) ∆ J (cid:107) ≤ (cid:107) ∆ J c (cid:107) . Lemma C.2.

Consider two centered stationary Gaussian processes { X t } and { Z t } . Further, assume that thespectral density of the joint process { ( X (cid:48) t , Z (cid:48) t ) (cid:48) } exists. Denote their cross-covariance by Γ X,Z ( (cid:96) ) := Cov( X t , Z t + (cid:96) ) ,and their cross-spectral density is deﬁned as f X,Z ( θ ) := 12 π ∞ (cid:88) (cid:96) = −∞ Γ X,Z ( (cid:96) ) e − i(cid:96)θ , θ ∈ [ − π, π ] , whose upper extreme is given by: M ( f X,Z ) = esssup θ ∈ [ − π,π ] (cid:114) Λ max (cid:16) f ∗ X,Z ( θ ) f X,Z ( θ ) (cid:17) . Let X and Z be data matrices with sample size n . Then, there exists a constant c > , such that for any u, v ∈ R p with (cid:107) u (cid:107) ≤ , (cid:107) v (cid:107) ≤ , we have P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) u (cid:48) (cid:18) X (cid:48) ZT − Cov( X t , Z t ) (cid:19) v (cid:12)(cid:12)(cid:12)(cid:12) > π ( M ( f X ) + M ( f Z ) + M ( f X,Z )) η (cid:21) ≤ (cid:0) − cT min { η, η } (cid:1) . Proof.

Let ξ t = (cid:104) u, X t (cid:105) , η t = (cid:104) v, Z t (cid:105) . Let f X ( θ ) , f Z ( θ ) denote the spectral density of { X t } and { Z t } , respectively.Then, the spectral density of { ξ t } and { η t } , respectively, is f ξ ( θ ) = u (cid:48) f X ( θ ) u , f η ( θ ) = v (cid:48) f Z ( θ ) v . Also, we note that M ( f ξ ) ≤ M ( f X ), M ( f η ) ≤ M ( f Z ). Then,2 T (cid:34) T (cid:88) t =0 ξ t η t − Cov( ξ t , η t ) (cid:35) = (cid:34) T T (cid:88) t =0 ( ξ t + η t ) − Var( ξ t + η t ) (cid:35) − (cid:34) T T (cid:88) t =0 ( ξ t ) − Var( ξ t ) (cid:35) − (cid:34) T T (cid:88) t =0 ( η t ) − Var( η t ) (cid:35) . (48)By Proposition 2.7 in Basu and Michailidis (2015), P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =0 ( ξ t ) − Var( ξ t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > π M ( f X ) η (cid:33) ≥ (cid:2) − cn min( η, η ) (cid:3) , P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =0 ( η t ) − Var( η t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > π M ( f Z ) η (cid:33) ≥ (cid:2) − cn min( η, η ) (cid:3) . What remains to be considered is the ﬁrst term in (48), whose spectral density is given by f ξ + η ( θ ) = u (cid:48) f X ( θ ) u + v (cid:48) f Z ( θ ) z + u (cid:48) f X,Z ( θ ) v + v (cid:48) f ∗ X,Z ( θ ) u, and its upper extreme satisﬁes M ( f ξ + η ) ≤ M ( f X ) + M ( f Z ) + 2 M ( f X,Z ) . Hence, we get: P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =0 ( ξ t + η t ) − Var( ξ t + η t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > π [ M ( f X ) + M ( f Z ) + 2 M ( f X,Z )] η (cid:33) ≥ − cn min( η, η )] . Combining all three terms yields the desired result.

Lemma C.3.

Let N and M be matrices of the same dimension. Then, there exists a decomposition M = M + M ,such that(a) rank ( M ) ≤ rank ( N ) ;(b) (cid:104)(cid:104) M , M (cid:105)(cid:105) = 0 .Proof. Let the SVD of N be given by N = U Σ V (cid:48) , where both U and V are orthogonal matrices and assumerank( N ) = r . Deﬁne (cid:102) M and do partition it as follows: (cid:102) M = U (cid:48) M V = (cid:34) (cid:102) M (cid:102) M (cid:102) M (cid:102) M (cid:35) . Next, let M = U (cid:34) (cid:102) M (cid:102) M (cid:102) M O (cid:35) V (cid:48) and M = U (cid:20) O OO (cid:102) M (cid:21) V (cid:48) , (cid:102) M ∈ R r × r . Then, M + M = M andrank( M ) ≤ rank (cid:18) U (cid:20) (cid:102) M (cid:102) M O O (cid:21) V (cid:48) (cid:19) + rank (cid:32) U (cid:34) (cid:102) M O (cid:102) M O (cid:35) V (cid:48) (cid:33) ≤ r. Moreover, (cid:104)(cid:104) M , M (cid:105)(cid:105) = tr [ M M (cid:48) ] = 0 . Lemma C.4.

Deﬁne the error matrix by ∆ B = (cid:98) B − B (cid:63) and ∆ C = (cid:98) C − C (cid:63) , and let the weighted regularizer Q bedeﬁned as Q ( B, C ) = ||| B ||| ∗ + λ C λ B || C || . With the subspaces deﬁned in (36) and (37), the following inequality holds: Q ( B (cid:63) , C (cid:63) ) − Q ( (cid:98) B, (cid:98) C ) ≤ Q (∆ B S B(cid:63) , ∆ C J C(cid:63) ) − Q (∆ B S ⊥ B(cid:63) , ∆ C J cC(cid:63) ) . Proof.

First, from deﬁnitions (36) and (37), we know that B (cid:63) S ⊥ = 0 and C (cid:63) J cC(cid:63) = 0. Using the deﬁnition of Q , weobtain Q ( B (cid:63) , C (cid:63) ) = ||| B (cid:63) S + B (cid:63) S ⊥ ||| ∗ + λ C λ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C (cid:63) J (cid:63)C + C (cid:63) J cC(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = ||| B (cid:63) S ||| ∗ + λ C λ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C (cid:63) J (cid:63)C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , Q ( (cid:98) B, (cid:98) C ) = Q ( B (cid:63) + ∆ B , C (cid:63) + ∆ C )= (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B (cid:63) S + ∆ B S ⊥ B(cid:63) + ∆ B S B(cid:63) + B (cid:63) S ⊥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ + λ C λ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C (cid:63) J (cid:63)C + ∆ C J C(cid:63) + C (cid:63) J cC(cid:63) + ∆ C J cC(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B (cid:63) S + ∆ B S ⊥ B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ + λ C λ B (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C (cid:63) J (cid:63)C + ∆ C J C(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C J cC(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:17) ≥ ||| B (cid:63) S ||| ∗ + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S ⊥ B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ + λ C λ B (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C (cid:63) J (cid:63)C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C J C(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C J cC(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:17) . The decomposition of the ﬁrst term comes from the construction of ∆ B S ⊥ B(cid:63) . It then follows that Q ( B (cid:63) , C (cid:63) ) − Q ( (cid:98) B, (cid:98) C ) ≤ λ C λ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C (cid:63) J (cid:63)C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S ⊥ B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ + λ C λ B (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C J C(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C J cC(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) C (cid:63) J (cid:63)C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:17) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ + λ C λ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C J C(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:18)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B S ⊥ B(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∗ + λ C λ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C J cC(cid:63) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:19) = Q (∆ B S B(cid:63) , ∆ C J C(cid:63) ) − Q (∆ B S ⊥ B(cid:63) , ∆ C J cC(cid:63) ) . Lemma C.5.

Under the conditions of Theorem 4, the following bound holds: T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W (∆ B aug + ∆ C aug ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ≥ α RSC (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F ) − λ B Q (∆ B , ∆ C ) . Proof.

This lemma directly follows from Lemma 2 in Agarwal et al. (2012), by setting Θ (cid:63) = B (cid:63) , Γ (cid:63) = C (cid:63) , withthe regularizer R ( · ) being the element-wise (cid:96) norm. Note that σ j ( B (cid:63) ) = 0 for j = r + 1 , · · · , min { p , p } sincerank( B ) = r . For our problem, it suﬃces to set M ⊥ as J cC (cid:63) , and therefore (cid:107) C (cid:63) J cC(cid:63) (cid:107) = 0. Lemma C.6.

Consider the two centered Gaussian processes { X t } ∈ R p and { Z t } ∈ R p , and denote their crosscovariance matrix by Γ X,Z ( h ) = ( X t , Z t + h ) = E ( X t Z (cid:48) t + h ) . Let X and Z denote the data matrix. There existpositive constants c i > such that whenever T (cid:37) c ( p + p ) , with probability at least − c exp[ − c ( p + p )] , the following bound holds: T |||X (cid:48) Z||| op ≤ Q X,Z (cid:114) p + p T + 4 ||| Γ X,Z (0) ||| op , where Q X,Z = c [ M ( f X ) + M ( f Z ) + M ( f X,Z )] . Proof.

The main structure of this proof follows from that of Lemma 3 in Negahban and Wainwright (2011), andhere we focus on how to handle the temporal dependency present in our problem. Let S p = { u ∈ R p |(cid:107) u (cid:107) = 1 } denote the p -dimensional unit sphere. The operator norm has the following variational representation form:1 T |||X (cid:48) Z||| op = 1 n sup u ∈ S p sup v ∈ S p u (cid:48) X (cid:48) Z v. For positive scalars s and s , deﬁne Ψ( s , s ) = sup u ∈ s S p sup v ∈ s S p (cid:104)X u, Z v (cid:105) , and the goal is to establish an upper bound for Ψ(1 , /T . Let A = { u , · · · , u A } and B = { v , · · · , v B } denote the1 / S p and S p , respectively. Negahban and Wainwright (2011) showed thatΨ(1 , ≤ u a ∈A ,v b ∈B (cid:104)X u a , Z v b (cid:105) , and by Anderson et al. (1998) and Anderson (2011), there exists a 1 / S p and S p with at most A ≤ p and B ≤ p elements, respectively. Consequently, P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) T Ψ(1 , (cid:12)(cid:12)(cid:12)(cid:12) ≥ δ (cid:21) ≤ p + p max u a ,v b P (cid:20) | ( u a ) (cid:48) X Z ( v b ) | T ≥ δ (cid:21) . T u (cid:48) X (cid:48) Z v, for an arbitrary ﬁxed pair of ( u, v ) ∈ S p × S p . By Lemma C.2, we have P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) u (cid:48) (cid:18) X (cid:48) ZT (cid:19) v (cid:12)(cid:12)(cid:12)(cid:12) > π ( M ( f X ) + M ( f Z ) + M ( f X,Z )) η + ||| Γ X,Z (0) ||| op (cid:21) ≤ (cid:0) − cT min { η, η } (cid:1) . Therefore, we have P (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) T Ψ(1 , (cid:12)(cid:12)(cid:12)(cid:12) ≥ π ( M ( f X ) + M ( f Z ) + M ( f X,Z )) η + 4 ||| Γ X,Z (0) ||| op (cid:21) ≤ (cid:2) ( p + p ) log 8 − cT min { η, η } (cid:3) . With the speciﬁed choice of sample size T , the probability vanishes by choosing η = c (cid:113) p + p T , for c large enough,and we yield the conclusion in Lemma C.6. Lemma C.7.

Consider some generic matrix A ∈ R m × n and let γ = { γ , . . . , γ p } ( p < n ) denote the set of columnindices of interest. Then, the following inequalities hold Λ min ( A (cid:48) A ) ≤ Λ min ( A (cid:48) γ A γ ) ≤ Λ max ( A (cid:48) γ A γ ) ≤ Λ max ( A (cid:48) A ) . Proof.

Let V := { v = ( v , · · · , v n ) ∈ R n | v (cid:48) v = 1 } . and V γ := { v = ( v , · · · , v n ) ∈ R n | v (cid:48) v = 1 and v j = 0 ∀ j / ∈ γ } . It is obvious that V γ ⊆ V . By the deﬁnition of eigenvalues through their Rayleigh quotient characterization,Λ min ( A (cid:48) γ A γ ) = min u (cid:48) u =1 ,u ∈ R p u (cid:48) ( A (cid:48) γ A γ ) u = min v (cid:48) v =1 ,v ∈V γ v (cid:48) ( A (cid:48) A ) v ≥ min v (cid:48) v =1 ,v ∈V v (cid:48) ( A (cid:48) A ) v = Λ min ( A (cid:48) A ) . Similarly,Λ max ( A (cid:48) γ A γ ) = max u (cid:48) u =1 ,u ∈ R p u (cid:48) ( A (cid:48) γ A γ ) u = max v (cid:48) v =1 ,v ∈V γ v (cid:48) ( A (cid:48) A ) v ≤ max v (cid:48) v =1 ,v ∈V v (cid:48) ( A (cid:48) A ) v = Λ max ( A (cid:48) A ) . Lemma C.8.

Let { X t } and { ε t } be two generic processes, where ε t = ( U (cid:48) t , V (cid:48) t ) (cid:48) . Suppose the spectral density ofthe joint process ( X (cid:48) t , ε (cid:48) t ) exists. Then, the following inequalities hold m ( f X,V ) ≥ m ( f X,ε ) , M ( f X,V ) ≤ M ( f X,ε ) . Proof.

By deﬁnition, the spectral density f X,ε ( θ ) can be written as f X,ε ( θ ) = (cid:18) π (cid:19) ∞ (cid:88) (cid:96) = −∞ Γ X,ε ( (cid:96) ) e − i(cid:96)θ , θ ∈ [ − π, π ]= (cid:18) π (cid:19) ∞ (cid:88) (cid:96) = −∞ ( E X t U t + (cid:96) , E X t V t + (cid:96) ) e − i(cid:96)θ = ( f X,U ( θ ) , f X,V ( θ )) . It follows that M ( f X,ε ) = ess sup θ ∈ [ − π,π ] (cid:112) Λ max ( H ( θ )) , where H ( θ ) = (cid:20) f ∗ X,U ( θ ) f ∗ X,V ( θ ) (cid:21) (cid:2) f X,U ( θ ) f X,V ( θ ) (cid:3) = (cid:20) f ∗ X,U ( θ ) f X,U ( θ ) f ∗ X,U ( θ ) f X,V ( θ ) f ∗ X,V ( θ ) f X,U ( θ ) f ∗ X,V ( θ ) f X,V ( θ ) (cid:21) . Note that M ( f X,V ) = ess sup θ ∈ [ − π,π ] (cid:113) Λ max ( f ∗ X,V ( θ ) f X,V ( θ )) . By Lemma C.7, ∀ θ , Λ min ( f ∗ X,V ( θ ) f X,V ( θ )) ≥ Λ min ( H ( θ )) and Λ max ( f ∗ X,V ( θ ) f X,V ( θ )) ≤ Λ max ( H ( θ )), hence m ( f X,V ) ≥ m ( f X,ε ) , M ( f X,V ) ≤ M ( f X,ε ) . Testing group Granger-causality under a sparse alternative.

In this section, we develop a testing procedure to test the null hypothesis against its sparse alternatives, that is, H : B = 0 vs H A : B is nonzero and sparse . Throughout, we impose assumptions on the sparsity level of B (to be speciﬁed later), and use the higher criticism framework (c.f. Tukey, 1989; Donoho and Jin, 2004; Arias-Castro et al., 2011) as the building block of the testingprocedure.Once again, we start with testing sparse alternatives in a simpler model setting Y t = Π X t + (cid:15) t , where Y t ∈ R p , X t ∈ R p , and (cid:15) t ∈ R p with each component being independent and identically distributed (i.i.d)and also independent of X t . We would like to test the null hypothesis H : Π = 0. Written in a compact form, themodel is given by Y = X Π (cid:48) + E , (49)where Y ∈ R T × p , X ∈ R T × p , and E are both contemporaneously and temporally independent. The lattershares similarities to the setting in Arias-Castro et al. (2011), with the main diﬀerence being that here we have amulti-response Y . By rewriting (49) using Kronecker products, we havevec( Y ) = ( I p ⊗ X ) vec(Π (cid:48) ) + vec( E ) i.e., Y = XΠ + E , where Y = vec( Y ) ∈ R T p , X = I p ⊗ X ∈ R T p × p p and Π = vec(Π (cid:48) ) ∈ R p p . Each coordinate in E is iid. Inthis form, using the higher criticism (Donoho and Jin, 2004; Ingster et al., 2010; Arias-Castro et al., 2011) withproper scaling, the test statistic is given by:HC ∗ ( X , Y ) = sup t> H ( t, X , Y ) := (cid:114) p p t )(1 − t )) (cid:104) p p p p (cid:88) k =1 (cid:16) √ T · X (cid:48) k Y (cid:107) X k (cid:107) (cid:107) Y (cid:107) > t (cid:17) − t ) (cid:105) , (50)where X k is the k th column of X and ¯Φ( t ) = 1 − Φ( t ) with Φ( t ) being the cumulative distribution function of astandard Normal random variable. Intuitively, (cid:0) p p (cid:1) p p (cid:88) k =1 (cid:8) √ T X (cid:48) k Y / ( (cid:107) X k (cid:107) (cid:107) Y (cid:107) ) > t (cid:9) is the fraction of signiﬁcance beyond a given level t , after scaling for the vector length and the noise level. Toconduct a level α test, H is rejected when HC ∗ ( X , Y ) > h ( p p , α p p ) where h ( p p , α p p ) ≈ (cid:112) p p ),provided that α p p → h ( p p , α p p ) = 2 (cid:112) log log( p p )(1 + o (1)) (see Donohoand Jin, 2004). The eﬀectiveness of the test relies on a number of assumptions on the design matrix and the sparsevector to be tested. Next, we introduce the three most relevant deﬁnitions for subsequent developments, originallymentioned in Arias-Castro et al. (2011). Deﬁnition 4 (Bayes risk) . Following Arias-Castro et al. (2011), the Bayes risk of a test T for testing Π = 0 vs. Π ∼ π , when H and H occur with the same probability, is deﬁned as the sum of type I error and its averageprobability of type II error; i.e., Risk π ( T ) = P ( T = 1) + π [ P Π ( T = 0)] , where π is a prior on the set of alternatives Ω. When no prior is speciﬁed, the risk is deﬁned as the worst-case risk:Risk( T ) = P ( T = 1) + max Π ∈ Ω [ P Π ( T = 0)] . Deﬁnition 5 (Asymptotically powerful) . We use T n,p to denote the dependency of the test on the sample size n and the parameter dimension p . With p → ∞ and n = n ( p ) → ∞ , a sequence of tests {T n,p } is said to be asymptotically powerful if lim p →∞ Risk( T n,p ) = 0 . eﬁnition 6 (Weakly correlated) . Let S p ( γ, ∆) denote the set of p × p correlation matrices C = [ c jk ] satisfyingthe weakly correlated assumption: for all j = 1 , . . . , p , | c jk | < − (log p ) − and { k : | c jk | > γ } ≤ ∆ , for some γ ≤ , ∆ ≥ . With the above deﬁnitions, Arias-Castro et al. (2011) establishes that using the test based on higher criticism isasymptotically powerful, provided that (1) Π satisﬁes the strong sparsity assumption , that is, the total number ofnonzeros s (cid:63) Π = ( p p ) θ with θ ∈ (1 / , X belongs to S ( γ, ∆) with γ and ∆ satisfyingcertain assumptions in terms of their relative order with respect to parameter dimension and sample size; and (3)the minimum magnitude of the nonzero elements of Π exceeds a certain lower detection threshold.Switching to our model setting in which Z t = BX t − + CZ t − + V t , B ∈ R p × p , where B encodes the dependency between Z t and X t − , conditional on Z t − , the above discussion suggests thatwe can use higher criticism on the residuals R and R , where R and R are identically deﬁned to those in thelow-rank testing; that is, R is the residual after regressing X on Z , and R is the residual after regressing Z T on Z : R = ( I − P z ) X and R = ( I − P z ) Z T , where P z = Z ( Z (cid:48) Z ) − Z (cid:48) . Writing the model in terms of R and R , we have R = R B (cid:48) + V , i.e., R = R β B + V , where R = vec( R ) , R = I ⊗ R , V = vec( V ), and β B = vec( B (cid:48) ) ∈ R p p . To test H : β B = 0, the highercriticism is given by HC ∗ ( R , R ) = sup t> H ( t, R , R ) := (cid:114) p p t )(1 − t )) (cid:104) p p p (cid:88) j =1 p (cid:88) i =1 (cid:16) √ T | R (cid:48) k R |(cid:107) R k (cid:107) (cid:107) R (cid:107) > t (cid:17) − t ) (cid:105) = sup t> (cid:114) p p t )(1 − t )) (cid:104) p p p (cid:88) j =1 p (cid:88) i =1 (cid:16) √ T | S ,ij | √ S ,ii S ,jj > t (cid:17) − t ) (cid:105) (51) where S = R (cid:48) R /T , S = R (cid:48) R /T and S = R (cid:48) R /T . The second equality is due to the block-diagonalstructure of R . We reject the null hypothesis ifHC ∗ ( R , R ) > (cid:112) log log( p p ) . Empirically, t can be chosen from { [1 , (cid:112) p p )] ∩ N } (Arias-Castro et al., 2011).Next, we analyze the theoretical properties of the above testing procedure. If the parameter dimension is ﬁxed,then classical consistency results in terms of convergence (in probability or almost surely) hold when letting T → ∞ ,and everything follows trivially, as long as the corresponding population quantities satisfy the posited assumptions.In the remainder, we allow the parameter dimension p p to slowly vary with the sample size T . Let S R = R (cid:48) R /n be the sample covariance matrix based on the residuals R , and let C R be the corresponding correlationmatrix. The following proposition directly follows from Theorem 4 in Arias-Castro et al. (2011). Proposition 2 (An asymptotically powerful test) . Under the following conditions, the testing procedure associatedwith the Higher Criticism statistics deﬁned in (51) is asymptotically powerful, provided that the smallest magnitudeof nonzero entries of B (cid:63) exceeds the lower detection boundary. (a) Strong sparsity: let p B = p p be the dimension of β (cid:63)B , then the total number of nonzeros satisﬁes s (cid:63)B = p θB ,where θ ∈ (1 / , .(b) Weakly correlated design: C R ∈ S ( γ, ∆) with the parameters satisfying ∆ = O ( p (cid:15) ) , γ = O ( p − / (cid:15) ) , ∀ (cid:15) > . Note that S R = I ⊗ S , where S = R (cid:48) R /T ; hence, C R = I ⊗ C , with C being the sample correlationmatrix based on R . The weakly correlated design assumption is thus eﬀectively imposed on C , with theparameters γ and ∆ satisfying the same condition. The weakly correlated design assumption on C in Proposition 2is for a deterministic realization of R . The following corollary states that for a random realization of R , obtainedby regressing a random X on Z , to satisfy the weakly correlated design assumption with high probability, it issuﬃcient that the population counterparts of the associated quantities satisfy the required assumptions. For a thorough discussion on the lower detection boundary, we refer the reader to Ingster et al. (2010); Arias-Castro et al. (2011)and references therein. orollary 2. Consider residual R obtained by regressing a random realization of X on that of Z . Let Σ :=Γ X − Γ X,Z Γ − Z Γ (cid:48) X,Z be the covariance of X t conditional on Z t , and ρ be the corresponding correlation matrix.Suppose ρ ∈ S ( γ, ∆) with γ and ∆ satisfying the same condition as in Proposition 2. Then with high probability,the sample correlation matrix based on R belongs to S ( γ (cid:48) , ∆ (cid:48) ) , where γ (cid:48) and ∆ (cid:48) respectively satisfy the samecondition as γ and ∆ , provided that the same condition imposed on γ holds for (cid:112) T − log( p p ) . Moreover, theconclusion in Proposition 2 holds.Remark . In the work of Anderson (2002) and Arias-Castro et al. (2011), the authors focus their analysis primarilyon the multiple regression setting, where the regression coeﬃcient matrix directly encodes the relationship betweenthe response variable and the covariates, in an iid data setting. We consider a more complicated model settingin which the regression coeﬃcient matrix of interest encodes the partial auto-correlations between a multivariateresponse and a set of exogenous variables, while the data exhibit temporal dependence. It is worth pointing outthat with the presence of temporal dependence, the rate with respect to the model dimension p and sample size T stays the same, as in the case where the data are iid (e.g., Rudelson and Vershynin, 2013; Basu and Michailidis,2015); speciﬁcally, it is (cid:112) log p/T in terms of the element-wise inﬁnity norm, whereas the associated constant is afunction of the lower and upper extremes of the spectral density, which intricately controls the exact coverage andpower of the testing procedures. Therefore, as long as the rate constraint on p and T is satisﬁed (as in Corollary 2),the main conclusion is compatible with previous work, and asymptotically, we either obtain the distribution of thetest statistic (low rank testing), or have a powerful test (sparse testing). Remark . To solve the global testing problem for the sparse setting, a possible alternative is to construct a teststatistic based on estimates of the regression coeﬃcients, then perform a global or max test on the estimatedcoeﬃcients. A key issue for such a test is that the estimated entries of B are biased due to the use of Lasso;therefore, a debiasing procedure (e.g. Javanmard and Montanari, 2014) would be required to obtain valid marginaldistributions for the entries of the B matrix. In contrast, the higher criticism test statistic is based on the correlationbetween the response and the covariates (see Equation (50)), and here we employ the idea on the residuals so thatthe eﬀect of Z t block is removed. We do not directly deal with the estimates of the B matrix and thus avoid thecomplications induced by the potentially biased estimates of B . E Estimation and Consistency for an Alternative Model Speciﬁcation.

In this section, we consider the ﬁnite-sample error bound for the case where both B and C are sparse. We assumethe presence of a sparse contemporaneous conditional dependence, hence the alternate between the estimation oftransition matrices and that of the covariance matrix is required. In what follows, we brieﬂy outline the estimationprocedure and the error bounds of the estimates. All notations follow from those in Section 3.The joint optimization problem is given by( (cid:98) B, (cid:98) C, (cid:98) Ω v ) = argmin B,C, Ω v (cid:110) tr (cid:2) Ω v ( Z T − X B (cid:48) − Z C (cid:48) ) (cid:48) ( Z T − X B (cid:48) − Z C (cid:48) ) /T (cid:3) − log det Ω v + λ B || B || + λ C || C || + ρ v || Ω v || , oﬀ (cid:111) . (52)For every iteration, with a ﬁxed (cid:98) Ω ( k ) v , (cid:98) B ( k +1) and (cid:98) C ( k +1) are both updated via Lasso; for ﬁxed ( (cid:98) B ( k ) , (cid:98) C ( k ) ), (cid:98) Ω ( k ) v isupdated by the graphical Lasso. Corollary 3.

F Proof of Propositions and Corollaries.

Proof of Proposition 1 . The joint process W t = { ( X (cid:48) t , Z (cid:48) t ) (cid:48) } is a stationary VAR(1) process, and it follows that S w ( h ) := (cid:20) S x ( h ) S x,z ( h ) S z,x ( h ) S z ( h ) (cid:21) = 1 T T (cid:88) t =1 w t w (cid:48) t + h p → Γ W ( h ) := E W t W (cid:48) t + h , as T → ∞ , which implies S x p → Γ X , S z p → Γ Z , S x,z p → Γ X,Z , S x,z (1) p → Γ X,Z (1) . Note that sample partial regression residual covariances can be obtained by S = S z − S z (1) S − z S (cid:48) z (1) , S = S x − S x,z S − z S (cid:48) x,z , S = S x,z (1) − S z (1) S − z S (cid:48) x,z . An application of the Continuous Mapping Theorem yields S p → Σ , S p → Σ , S p → Σ . By Hsu (1941a,b), the limiting behavior of T Ψ r is given by T Ψ r ∼ χ p − r )( p − r ) , as T → ∞ . Note that since µ is of multiplicity one and the ordered eigenvalues are continuous functions of the matrices, thefollowing holds: φ k p → µ k , ∀ k = 1 , . . . , min( p , p ) . Proof of Corollary 2 . First, we note that R eﬀectively comes from the following stochastic regression: X t = QZ t + R t , for some regression matrix Q, (55)with R = X − Z (cid:98) Q being the sample residual. The population covariance of R t is the conditional covariance of X t on Z t , given by Σ = Σ (cid:63)R := Γ X − Γ X,Z Γ − Z Γ (cid:48) X,Z . Σ is identical to that deﬁned in equation (26). Writing the model in terms of data, we have X = Z Q + R (cid:63) , R (cid:63) to denote the true error term, for the purpose of distinguishing it from the residuals by regressing X on Z . Note that R (cid:63) is also sub-Gaussian. First, we would like to obtain a bound for (cid:107) S − Σ (cid:107) ∞ . Let S R bethe sample covariance matrix based on the actual errors, i.e., S R (cid:63) = ( R (cid:63) ) (cid:48) ( R (cid:63) ) /T , then (cid:107) S − Σ (cid:107) ∞ ≤ (cid:107) S R (cid:63) − Σ (cid:107) ∞ + (cid:107) S − S R (cid:63) (cid:107) ∞ The ﬁrst term can be directed bounded by Lemma 1 in Ravikumar et al. (2011), that is, there exists some constant τ >

2, such that for large enough sample size T , with probability at least 1 − /p τ − , (cid:107) S R (cid:63) − Σ (cid:107) ∞ ≤ C (cid:112) log p /T , for some constant C > . Consider the second term. Rewrite it as S − S R (cid:63) = T ( R (cid:63) ) (cid:48) Z ( Q (cid:63) − (cid:98) Q ) + ( Q (cid:63) − (cid:98) Q ) (cid:48) (cid:16) Z (cid:48) Z T (cid:17) ( Q (cid:63) − (cid:98) Q ) := I + I , then for I , I ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q (cid:63) − (cid:98) Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T ( R (cid:63) ) (cid:48) Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) vec( Q (cid:63) ) − vec( (cid:98) Q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T ( R (cid:63) ) (cid:48) Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . By Lemma B.4, there exist constants c i > − c exp( − c log( p p )), forsuﬃciently large sample size T , we get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T ( R (cid:63) ) (cid:48) Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ C (cid:112) log( p p ) /T , for some constant C > . (56)For I , we have that I ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q (cid:63) − (cid:98) Q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:48) Z T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) vec( Q (cid:63) ) − vec( (cid:98) Q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:48) Z T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ . By Proposition 2.4 in Basu and Michailidis (2015) and taking the union bound, there exist some constants c (cid:48) and c (cid:48) such that with probability at least 1 − c (cid:48) exp( − c (cid:48) log p ), for suﬃciently large sample size T , we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:48) Z T − Γ Z (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ C (cid:112) log p /T , for some constant C > , which implies (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:48) Z T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ C (cid:112) log p /n + max i (Γ Z,ii ) . (57)By assuming that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) vec( Q (cid:63) ) − vec( (cid:98) Q ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε Q , it follows that I + I ≤ C (cid:48) ε Q (cid:114) log( p p ) T + C (cid:48) ε Q (cid:114) log p T , hence (cid:107) S − Σ (cid:107) ∞ ≤ C (cid:114) log p T + C (cid:48) ε Q (cid:114) log( p p ) T + C (cid:48) ε Q (cid:114) log p T . (58)Regardless of the relative order of p and p , one can easily verify that (cid:107) S − Σ (cid:107) ∞ = O (cid:16)(cid:113) log( p p ) T (cid:17) . (59)by assuming log( p p ) /T being a small quantity. Since C = (diag( S )) − / S (diag( S )) − / and letting (cid:101) R t = diag(Σ ) − / R t , we then have that C is simply the sample covariance matrix based on residualsurrogates of (cid:101) R t , whose error rate stays unchanged by scaling, i.e, || C − ρ || ∞ = O ( (cid:112) T − log( p p )). Thelatter fact further implies that if ρ ∈ S ( γ, ∆), then C ∈ S ( γ (cid:48) , ∆ (cid:48) ) with ∆ (cid:48) ≥ ∆ − (const) (cid:112) log( p p ) /T and γ (cid:48) ≥ γ + (const) (cid:112) log( p p ) /T .It then follows that as long as (cid:112) T − log( p p ) satisﬁes the same condition imposed on γ = O ( p − / (cid:15) ), that is, p − (cid:15) log( p p ) = O ( T ) , for all (cid:15) > , with high probability, the sample covariance matrix based on the residuals R satisﬁes the weakly correlated designassumption, for a random realization X and Z . 47inally, we brieﬂy outline the main steps of how to obtain the error bound in Corollary 3. Proof of Corollary 3 . Let W t = ( X (cid:48) t , Z (cid:48) t ) (cid:48) . At iteration 0, ( (cid:98) B (0) , (cid:98) C (0) ) solves (53), and the following bound holds: ||| (cid:98) B (0) − B (cid:63) ||| F + ||| (cid:98) C (0) − C (cid:63) ||| F ≤ λ B + λ C ) /α , provided that W satisﬁes the RSC condition, and λ B , λ C both satisfy λ B ≥ (cid:107)W (cid:48) V /T (cid:107) ∞ , λ C ≥ (cid:107)W (cid:48) V /T (cid:107) ∞ . In particular, by Lemma B.3 and Lemma B.4, for random realizations of { X t } , { Z t } and { V t } , for suﬃciently largesample size, with high probability W satisﬁes the RSC condition , and ||W (cid:48) V /T || ∞ ≤ C (cid:114) log( p + p ) + log p T , for some constant C . Hence, with high probability, ||| (cid:98) B (0) − B (cid:63) ||| F + ||| (cid:98) C (0) − C (cid:63) ||| F = O (cid:16) log( p + p )+log p T (cid:17) . (60)For (cid:98) Ω (0) v , it solves a graphical Lasso problem: (cid:98) Ω (0) v = argmin Ω v ∈ S p × p (cid:110) log det Ω v − trace (cid:0) (cid:98) S (0) u Ω v (cid:1) + ρ v (cid:107) Ω v (cid:107) , oﬀ (cid:111) , where (cid:98) S (0) v = T ( Z T − X (cid:98) B (0) (cid:48) − Z (cid:98) C (0) (cid:48) ) (cid:48) ( Z T − X (cid:98) B (0) (cid:48) − Z (cid:98) C (0) (cid:48) ). Similar to the proof of Theorem 2, its error bounddepends on (cid:107) (cid:98) S (0) v − Σ (cid:63)v (cid:107) ∞ . With the same decomposition and consider only the leading term, (cid:107) (cid:98) S (0) v − Σ (cid:63)v (cid:107) ∞ = O (cid:16)(cid:113) log( p + p )+log p T (cid:17) , ⇒ (cid:107) (cid:98) Ω (0) v − Ω (cid:63)v (cid:107) ∞ = O (cid:16)(cid:113) log( p + p )+log p T (cid:17) . At iteration 1, the bound of ||| (cid:98) B (1) − B (cid:63) ||| F + ||| (cid:98) C (1) − C (cid:63) ||| F relies on || T W (cid:48) V (cid:98) Ω (0) v || ∞ ≤ || T W (cid:48) V ( (cid:98) Ω (0) v − Ω (cid:63)v ) || ∞ + || T W (cid:48) V Ω (cid:63)v || ∞ , ≤ C (cid:113) log( p + p )+log p T + d Ω (cid:63)v max || T W (cid:48) V|| ∞ || (cid:98) Ω (0) v − Ω (cid:63)v || ∞ = O (cid:16)(cid:113) log( p + p )+log p T (cid:17) , (61)hence ||| (cid:98) B (1) − B (cid:63) ||| F + ||| (cid:98) C (1) − C (cid:63) ||| F = O ( log( p + p )+log p T ), which coincides with the bound of the estimator ofiteration 0, implying the error rate remains unchanged henceforth. Up to this step, all sources of randomness havebeen captured. Consequently, the following bounds hold with high probability for all iterations k : (cid:107)W (cid:48) V (cid:98) Ω ( k ) v /T (cid:107) ∞ = O (cid:16)(cid:113) log( p + p )+log p T (cid:17) , and (cid:107) (cid:98) S ( k ) v − Σ (cid:63)v (cid:107) ∞ = O (cid:16)(cid:113) log( p + p )+log p T (cid:17) , which imply the bounds in Corollary 3. 48 List of Stock and Macroeconomic Variables

Table 16: List of Stocks used in the AnalysisStock Symbol Name Stock Symbol Company NameAAPL Apple Inc. JNJ Johnson & Johnson IncAIG American International Group Inc. JPM JP Morgan Chase & CoALL Allstate Corp. KO The Coca-Cola CompanyAMGN Amgen Inc. LMT Lockheed-MartinAXP American Express Inc. LOW Lowe’sBA Boeing Co. MCD McDonald’s CorpBAC Bank of America Corp MDLZ Mondel InternationalBK Bank of New York Mellon Corp MDT Medtronic Inc.BMY Bristol-Myers Squibb MMM 3M CompanyC Citigroup Inc MO Altria GroupCAT Caterpillar Inc MRK Merck & Co.CL Colgate-Palmolive Co. MS Morgan StanleyCMCSA Comcast Corporation MSFT MicrosoftCOF Capital One Financial Corp. NSC Norfolk Southern CorpCOP ConocoPhillips ORCL Oracle CorporationCSCO Cisco Systems OXY Occidental Petroleum Corp.CVS CVS Caremark PEP Pepsico Inc.CVX Chevron PFE Pﬁzer IncDD DuPont PG Procter & Gamble CoDIS The Walt Disney Company RTN Raytheon CompanyDOW Dow Chemical SLB SchlumbergerDVN Devon Energy Corp SO Southern CompanyEMC EMC Corporation T AT&T IncEXC Exelon TGT Target Corp.F Ford Motor TWX Time Warner Inc.FCX Freeport-McMoran TXN Texas InstrumentsFDX FedEx UNH UnitedHealth Group Inc.GD General Dynamics UPS United Parcel Service IncGE General Electric Co. USB US BancorpGILD Gilead Sciences UTX United Technologies CorpGS Goldman Sachs VZ Verizon Communications IncHAL Halliburton WBA Walgreens Boots AllianceHD Home Depot WFC Wells FargoHON Honeywell WMT Wal-MartIBM International Business Machines XOM Exxon Mobil CorpINTC Intel Corporation 49able 17: List of Macroeconomic Variables and the transformation used in the AnalysisSymbol Description TransformationFFR Federal Funds Rate abs diﬀT10yr 10-Year Treasury Yield with Constant Maturity abs diﬀUNEMPL Unemployment Rate for 16 and above abs diﬀIPI Industrial Production Index relative diﬀETTL Employment Total relative diﬀM1 M1 Money Stock relative diﬀAHES Average Hourly Earnings of Production and Nonsupervisory Employees relative diﬀCU Capital Utilization relative diﬀM2 M2 Money Stock relative diﬀHS Housing starts relative diﬀEX US Exchange Rate abs diﬀPCEQI Personal Consumption Expenditures Quantity Index relative diﬀGDP real Gross Domestic Product relative diﬀPCEPI Personal Consumption Expenditures Price Index relative diﬀPPI Producer Price Index relative diﬀCPI Consumer Price Index relative diﬀSP.IND S&P Industrial Sector index relative diﬀ*abs diﬀ: x t − x t − , relative diﬀ: x t − x t − x t − eferences Tilak Abeysinghe. Estimation of direct and indirect impact of oil price on growth.

Economics letters , 73(2):147–153,2001.Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Noisy matrix decomposition via convex relaxation:Optimal rates in high dimensions.

The Annals of Statistics , 40(2):1171–1197, 2012.Charles W Anderson, Erik A Stolz, and Sanyogita Shamsunder. Multivariate autoregressive models for classiﬁ-cation of spontaneous electroencephalographic signals during mental tasks.

IEEE Transactions on BiomedicalEngineering , 45(3):277–286, 1998.Theodore W Anderson. Maximum likelihood estimation for vector autoregressive moving average models. Technicalreport, STANFORD UNIV CA DEPT OF STATISTICS, 1978.Theodore Wilbur Anderson. Estimating linear restrictions on regression coeﬃcients for multivariate normal distri-butions.

The Annals of Mathematical Statistics , 22(3):327–351, 1951.Theodore Wilbur Anderson. Asymptotic distribution of the reduced rank regression estimator under generalconditions.

The Annals of Statistics , 27(4):1141–1154, 1999.Theodore Wilbur Anderson. Canonical correlation analysis and reduced rank regression in autoregressive models.

The Annals of Statistics , 30(4):1134–1154, 2002.Theodore Wilbur Anderson.

The statistical analysis of time series , volume 19. John Wiley & Sons, 2011.Ery Arias-Castro, Emmanuel J Cand`es, and Yaniv Plan. Global testing under sparse alternatives: ANOVA, multiplecomparisons and the higher criticism.

The Annals of Statistics , 39(5):2533–2556, 2011.Francis R Bach and Michael I Jordan. A probabilistic interpretation of canonical correlation analysis. Technicalreport, University of California Berkeley, 2005.Sumanta Basu and George Michailidis. Regularized estimation in sparse high-dimensional time series models.

TheAnnals of Statistics , 43(4):1535–1567, 2015.Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAMJournal on Imaging Sciences , 2(1):183–202, 2009.Ben S Bernanke, Jean Boivin, and Piotr Eliasz. Measuring the eﬀects of monetary policy: a factor-augmentedvector autoregressive (FAVAR) approach.

The Quarterly Journal of Economics , 120(1):387–422, 2005.Carluccio Bianchi, Alessandro Carta, Dean Fantazzini, Maria Elena De Giuli, and Mario A Maggi. A copula-VAR-Xapproach for industrial production modeling and forecasting.

Applied Economics , 42(25):3267–3277, 2010.Monica Billio, Mila Getmansky, Andrew W Lo, and Loriana Pelizzon. Econometric measures of connectedness andsystemic risk in the ﬁnance and insurance sectors.

Journal of Financial Economics , 104(3):535–559, 2012.Celso Brunetti, Jeﬀrey H Harris, Shawn Mankad, and George Michailidis. Interconnectedness in the interbankmarket.

Social Sciences Research Network , 2015.Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholding algorithm for matrix com-pletion.

SIAM Journal on Optimization , 20(4):1956–1982, 2010.Edward Carlstein. The use of subseries values for estimating the variance of a general statistic from a stationarysequence.

The Annals of Statistics , 14(3):1171–1179, 1986.David O Cushman and Tao Zha. Identifying monetary policy in a small open economy under ﬂexible exchangerates.

Journal of Monetary economics , 39(3):433–448, 1997.David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures.

The Annals ofStatistics , pages 962–994, 2004.Mathias Drton and Michael D Perlman. A sinful approach to gaussian graphical model selection.

Journal ofStatistical Planning and Inference , 138(4):1179–1200, 2008.51oger EA Farmer. The stock market crash really did cause the great recession.

Oxford Bulletin of Economics andStatistics , 77(5):617–633, 2015.US Financial Crisis Inquiry Commission.

The ﬁnancial crisis inquiry report: Final report of the national commissionon the causes of the ﬁnancial and economic crisis in the United States . Public Aﬀairs, 2011.Jean-Paul Fitoussi, David Jestaz, Edmund S Phelps, Gylﬁ Zoega, Olivier Blanchard, and Christopher A Sims.Roots of the recent recoveries: labor reforms or private sector forces?

Brookings Papers on Economic Activity ,2000(1):237–311, 2000.Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphicallasso.

Biostatistics , 9(3):432–441, 2008.Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models viacoordinate descent.

Journal of Statistical Software , 33(1):1–22, 2010. URL .Christian Gourieroux and Joann Jasiak.

Financial econometrics: Problems, models, and methods . PrincetonUniversity Press Princeton, NJ, 2001.Clive WJ Granger. Investigating causal relations by econometric models and cross-spectral methods.

Econometrica:Journal of the Econometric Society , 3(3):424–438, 1969.Peter Hall. Resampling a coverage pattern.

Stochastic Processes and Their Applications , 20(2):231–246, 1985.Edward James Hannan.

Multiple time series . John Wiley & Sons, 1970.Lars Peter Hansen and Thomas J Sargent.

Recursive models of dynamic linear economies . Princeton UniversityPress, 2013.Yu Hsing. Impacts of macroeconomic variables on the us stock market index and policy implications.

EconomicsBulletin , 31(1):883–892, 2011.PL Hsu. On the limiting distribution of roots of a determinantal equation.

Journal of the London MathematicalSociety , 1(3):183–194, 1941a.PL Hsu. On the problem of rank and the limiting distribution of ﬁsher’s test function.

The Annals of Eugenics ,11(1):39–41, 1941b.Andreas Humpe and Peter Macmillan. Can macroeconomic variables explain long-term stock market movements?A comparison of the US and Japan.

Applied Financial Economics , 19(2):111–119, 2009.Yuri I Ingster, Alexandre B Tsybakov, and Nicolas Verzelen. Detection boundary in sparse regression.

ElectronicJournal of Statistics , 4:1476–1526, 2010.Adel Javanmard and Andrea Montanari. Conﬁdence intervals and hypothesis testing for high-dimensional regres-sion.

Journal of Machine Learning Research , 15(1):2869–2909, 2014.Søren Johansen. Statistical analysis of cointegration vectors.

Journal of Economic Dynamics and Control , 12(2):231–254, 1988.P.R. Kumar and Pravin Varaiya.

Stochastic Systems: Estimation, Identiﬁcation and Adaptive Control . PrenticeHall, 1986.Hans R Kunsch. The jackknife and the bootstrap for general stationary observations.

The Annals of Statistics , 17(3):1217–1241, 1989.Jiahe Lin, Sumanta Basu, Moulinath Banerjee, and George Michailidis. Penalized maximum likelihood estimationof multi-layered gaussian graphical models.

Journal of Machine Learning Research , 17(146):1–51, 2016.Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing data: provable guar-antees with nonconvexity.

The Annals of Statistics , 40(3):1637–1664, 2012.52 Duncan Luce and Albert D Perry. A method of matrix analysis of group structure.

Psychometrika , 14(2):95–116,1949.Helmut L¨utkepohl.

New introduction to multiple time series analysis . Springer Science & Business Media, 2005.Mahedi Masuduzzaman. Impact of the macroeconomic variables on the stock market returns: The case of germanyand the United Kingdom.

Global Journal of Management and Business Research , 12(16), 2012.Nicolai Meinshausen and Peter B¨uhlmann. Stability selection.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 72(4):417–473, 2010.Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling.

The Annals of Statistics , 39(2):1069–1097, 2011.Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uniﬁed framework for high-dimensional analysis of M -estimators with decomposable regularizers. Statistical Science , 27(4):538–557, 2012.Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1 /k ). SovietMathematics Doklady , 27(2):372–376, 1983.Yurii Nesterov. On an approach to the construction of optimal methods of minimization of smooth convex functions.

Ekonomika i Mateaticheskie Metody , 24(3):509–517, 1988.Yurii Nesterov. Smooth minimization of non-smooth functions.

Mathematical programming , 103(1):127–152, 2005.Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical report, UCL, 2007.William Nicolson, David Matteson, and Jacob Bien. VARX-L: Structured regularization for large vector autore-gressions with exogenous variables. arXiv preprint 1508.07497 , 2016.Vincent R Nijs, Marnik G Dekimpe, Jan-Benedict EM Steenkamps, and Dominique M Hanssens. The category-demand eﬀects of price promotions.

Marketing science , 20(1):1–22, 2001.Koen Pauwels and Allen Weiss. Moving from free to fee: How online ﬁrms market to change their business modelsuccessfully.

Journal of Marketing , 72(3):14–31, 2008.M Hashem Pesaran.

Time series and panel data econometrics . Oxford University Press, 2015.M Hashem Pesaran, Til Schuermann, and Scott M Weiner. Modeling regional interdependencies using a globalerror-correcting macroeconometric model.

Journal of Business & Economic Statistics , 22(2):129–162, 2004.M Hashem Pesaran, Til Schuermann, and L Vanessa Smith. Forecasting economic and ﬁnancial variables withglobal VARs.

International Journal of Forecasting , 25(4):642–675, 2009.Edmund S Phelps. Behind this structural boom: the role of asset valuations.

The American Economic Review , 89(2):63–68, 1999.Dimitris N Politis and Joseph P Romano. Large sample conﬁdence regions based on subsamples under minimalassumptions.

The Annals of Statistics , 22(4):2031–2050, 1994.Dimitris N Politis, Joseph P Romano, and Michael Wolf.

Subsampling . Springer Series in Statistics, 1999.Huitong Qiu, Sheng Xu, Fang Han, Han Liu, and Brian Caﬀo. Robust estimation of transition matrices in highdimensional heavy-tailed vector autoregressive processes. In

Proceedings of the 32nd International Conferenceon Machine Learning (ICML-15) , pages 1843–1851, 2015.Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, and Bin Yu. High-dimensional covariance estimationby minimizing (cid:96)

Electronic Journal of Statistics , 5:935–980, 2011.Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equa-tions via nuclear norm minimization.

SIAM Review , 52(3):471–501, 2010.Bruno R´emillard, Nicolas Papageorgiou, and Fr´ed´eric Soustra. Copula-based semiparametric models for multivari-ate time series.

Journal of Multivariate Analysis , 110:30–42, 2012.53ylvia Richardson, George C Tseng, and Wei Sun. Statistical methods in integrative genomics.

Annual Review ofStatistics and Its Application , 3:181–209, 2016.Mark Rudelson and Roman Vershynin. Hanson-Wright inequality and sub-gaussian concentration.

ElectronicCommunications in Probability , 18(82):1–9, 2013.Anil K Seth. Interoceptive inference, emotion, and the embodied self.

Trends in Cognitive Sciences , 17(11):565–573,2013.Ali Shojaie, Sumanta Basu, and George Michailidis. Adaptive thresholding for reconstructing regulatory networksfrom time-course gene expression data.

Statistics in Biosciences , 4(1):66–83, 2012.Galen R Shorack and Jon A Wellner.

Empirical processes with applications to statistics . SIAM, 2009.Christopher A Sims. Macroeconomics and reality.

Econometrica: Journal of the Econometric Society , 48(1):1–48,1980.Christopher A Sims. Interpreting the macroeconomic time series facts: The eﬀects of monetary policy.

EuropeanEconomic Review , 36(5):975–1000, 1992.Christopher A Sims, Stephen M Goldfeld, and Jeﬀrey D Sachs. Policy analysis with econometric models.

BrookingsPapers on Economic Activity , 1982(1):107–164, 1982.James H Stock and Mark W Watson. An empirical comparison of methods for forecasting using many predictors.

Manuscript, Princeton University , 2005.Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society. SeriesB (Methodological) , 58(1):267–288, 1996.Paul Tseng. Convergence of a block coordinate descent method for non-diﬀerentiable minimization.

Journal ofOptimization Theory and Applications , 109(3):475–494, 2001.Paul Tseng. On accelerated proximal gradient methods for convex-concave optimization. Technical report, Uni-versity of Washington, Seattle, 2008.JW Tukey. Higher criticism for individual signiﬁcances in several tables or parts of tables.

Princeton University,Princeton (Internal working paper) , 1989.Tuo Zhao, Xingguo Li, Han Liu, Kathryn Roeder, John Laﬀerty, and Larry Wasserman. huge: High-DimensionalUndirected Graph Estimation , 2015. URL http://CRAN.R-project.org/package=hugehttp://CRAN.R-project.org/package=huge