High-Dimensional Low-Rank Tensor Autoregressive Time Series Modeling
HHigh-Dimensional Low-Rank Tensor AutoregressiveTime Series Modeling
Di Wang † , Yao Zheng ‡ and Guodong Li †† University of Hong Kong and ‡ University of Connecticut
January 13, 2021
Abstract
Modern technological advances have enabled an unprecedented amount of struc-tured data with complex temporal dependence, urging the need for new methods toefficiently model and forecast high-dimensional tensor-valued time series. This paperprovides the first practical tool to accomplish this task via autoregression (AR). Byconsidering a low-rank Tucker decomposition for the transition tensor, the proposedtensor autoregression can flexibly capture the underlying low-dimensional tensor dy-namics, providing both substantial dimension reduction and meaningful dynamic factorinterpretation. For this model, we introduce both low-dimensional rank-constrained es-timator and high-dimensional regularized estimators, and derive their asymptotic andnon-asymptotic properties. In particular, by leveraging the special balanced struc-ture of the AR transition tensor, a novel convex regularization approach, based onthe sum of nuclear norms of square matricizations, is proposed to efficiently encouragelow-rankness of the coefficient tensor. A truncation method is further introduced toconsistently select the Tucker ranks. Simulation experiments and real data analysisdemonstrate the advantages of the proposed approach over various competing ones.
Keywords : dimension reduction; high-dimensional time series; nuclear norm; tensor de-composition; tensor-valued data 1 a r X i v : . [ s t a t . M E ] J a n Introduction
Modern technological development has made available enormous quantities of data, manyof which are structured and collected over time. Tensor-valued time series data, namely ob-servations on a set of variables structured in a tensor form collected over time, have becomeincreasingly common in a wide variety of areas. Examples include multiple macroeconomicindices time series for multiple countries (Chen et al., 2020c), dynamic inter-regional trans-port flow data (Chen and Chen, 2019), and sequential image and video processing (Waldenand Serroukh, 2002), among many others.To model temporal dependencies of tensor-valued time series data, naturally one mightresort to vectorization of the tensor-valued observations so that conventional vector-valuedtime series models can be directly applied. For instance, let Y t ∈ R p ×···× p d be the tensor-valued observation at time t , for t = 1 , . . . , T , where T is the sample size. Then an obviousfirst step is to conduct the following vector autoregression (VAR) for its vectorization:vec( Y t ) = A vec( Y t − ) + vec( E t ) , (1)where A ∈ R p × p is the unknown transition matrix, with p = (cid:81) di =1 p i , and E t ∈ R p ×···× p d is the tensor-valued random error at time t . Although model (1) is potent in that it takesinto account the linear association between every variable in Y t and that in Y t − , it clearlysuffers from two major drawbacks: • The vectorization will destroy the important multidimensional structural informationinherently embedded in the tensor-valued observations, resulting in a lack of inter-pretability; • The number of unknown parameters, p = ( (cid:81) di =1 p i ) , can be formidable even for small d and p i ; e.g., even when d = 3 and p = p = p = 3, the number of unknownparameters will be as large as (3 × × = 729, while the sample size T often has asimilar magnitude in practice; see, e.g., the real data examples in Section 7.In this work, motivated by the idea of Tucker decomposition (Tucker, 1966), we propose2he Low-Rank Tensor Autoregressive (LRTAR) model through folding the p × p transitionmatrix A in (1) into the 2 d -th-order transition tensor A ∈ R p ×···× p d × p ×···× p d which is as-sumed to have low multilinear ranks ( r , . . . , r d ). That is, r i can be much smaller than p i , where p d + i = p i for i = 1 , . . . , d . As a result, the aforementioned drawbacks of thevectorization approach can be overcome by the proposed model: • As we will show in Section 3, the proposed model can flexibly retain the distinctstructural information across all modes of observed tensor process, encapsulating aninterpretable low-dimensional tensor dynamic relationship between Y t and Y t − . • As the transition tensor A is assumed to have low multilinear ranks, the parame-ter space is simultaneously constrained in 2 d directions, reducing its dimension from( (cid:81) di =1 p i ) drastically to (cid:81) di =1 r i + (cid:80) di =1 r i ( p i − r i ) + (cid:80) di =1 r d + i ( p i − r d + i ).Recently there has been a rapidly emerging interest in high-dimensional matrix- andtensor-valued time series analysis, particularly through factor models, e.g., the matrix factormodel proposed in Wang et al. (2019), the constrained matrix factor model in Chen et al.(2020a), and the tensor factor model in Chen et al. (2020c). Factor models are powerful toolsfor dimension reduction with great interpretability. However, unlike autoregression, factormodels do not seek to explicitly model the temporal dependency and thus by themselvescannot be directly used for forecasting. On the other hand, despite the extensive literatureon high-dimensional VAR models in recent years (e.g. Negahban and Wainwright, 2011;Basu and Michailidis, 2015; Han et al., 2015a; Wang et al., 2020; Zheng and Cheng, 2020),counterparts able to meet the particular needs of matrix- and tensor-valued time seriesmodeling tasks have been rarely explored. The most relevant existing work in this directionso far might be the matrix autoregressive (MAR) model in Chen et al. (2020b), where thefocus is on low-dimensional modeling; see also Hoff (2015). As we will discuss in Section 3,the proposed model includes the MAR model as a special case, but enjoys greater flexibilityand a more drastic reduction of the dimensionality; see also Figure 1 for an illustration.The estimation of the proposed model is studied under both low- and high-dimensional3caling. When the sample size T is sufficiently large and the transition tensor A is exactlylow-rank, we consider the rank-constrained estimation method and prove the asymptoticnormality for the proposed low-Tucker-rank (LTR) estimator. Under the high-dimensionalsetup, we consider a more general and natural setting where A can be well approximatedby a low-Tucker-rank tensor, and develop regularized estimation methods based on nuclear-norm-type penalties. We first study the Sum of Nuclear (SN) norm regularizer, defined asthe sum of nuclear norms of all one-mode matricizations of A , and derive the non-asymptoticestimation error bound for the corresponding estimator. The SN norm regularizer has beenwidely used in the literature for various low-rank tensor problems (Gandy et al., 2011;Tomioka et al., 2011; Liu et al., 2013; Raskutti et al., 2019). Its major strength lies in thefact that the summation of nuclear norms allows enforcing the low-rankness simultaneouslyacross all modes of the tensor. In contrast, if only a single nuclear norm is used, the low-rankness of only one mode will be accounted for, obviously leading to a much less efficientestimator.However, although penalizing multiple one-mode matricizations of A simultaneously isfar more efficient than penalizing only one of them, the SN regularized estimator suffersfrom serious efficiency loss due to the fat-and-short shape of the one-mode matricizations.Note that the low-rankness in fact can also be encouraged by penalizing nuclear normsof multi-mode matricizations. For instance, the conventional Matrix Nuclear (MN) normregularized estimator (Negahban and Wainwright, 2011) simply penalizes the nuclear normof the transition matrix A in representation (1), which under the proposed LRTAR modelactually is a square-shaped multi-mode matricization, namely square matricization, of thetransition tensor A . As we will show in Theorem 3 and simulations, even though the MNregularizer incorporates only one single square matricization, it still clearly beats the SNregularizer, since the former avoids the efficiency bottleneck caused by the imbalance of theone-mode matricization.Indeed, due to the autoregressive form of the proposed model, the transition tensor A has a special balanced structure in the sense that its first d modes have exactly the same4imensions as its other d modes. As a result, actually many different square matricizationsof A can be formed by appropriately pairing up its modes; see Section 5.2 for details. Thisnaturally motivates us to propose a new regularizer that combines the strengths of bothSN and MN norms. Specifically, for the proposed tensor autoregression, we introduce anovel Sum of Square-matrix Nuclear (SSN) norm regularizer, defined as the sum of nuclearnorms of all the p × p square matricizations of the transition tensor A . The SSN regularizeris expected to be superior to the MN, since it simultaneously encourages the low-ranknessacross all possible square matricizations of A rather than only one of them. Moreover, thanksto the use of square matricizations, the SSN regularized estimator is provably more efficientthan the SN regularized one; see Theorem 4 and the last simulation experiment in Section6. Note that the adoption of a more balanced (square) matricization to improve estimationperformance was proposed in Mu et al. (2014) for low-rank tensor completion problems,where only a single square matricization was penalized, similarly to the MN regularizer.This work is also related to the literature of matrix-variate regression and tensor regres-sion for independent data. The matrix-variate regression model in Ding and Cook (2018) hasthe same basic bilinear form as that of the MAR model mentioned earlier, while an envelopemethod was introduced to further reduce the dimension. Raskutti et al. (2019) proposeda multi-response tensor regression model, where they mainly studied the third-order coef-ficient tensor and the SN regularization which is known to be statistically sub-optimal forhigher-order tensor estimation. In contrast, we study the model for general higher-ordertensor-valued time series. Moreover, our SSN regularized estimator has a much faster sta-tistical convergence rate than the SN estimator. Recently, Chen et al. (2019) and Hanet al. (2020) studied non-convex projected gradient descent methods for tensor regression.While their non-convex approaches require exact low-rankness with known Tucker ranks,our methods can handle both exact and approximate low-rankness and select the unknownranks consistently in the exactly low-rank case. In addition, existing literature on tensorregression has only considered independent data and Gaussian time series data, whereas weallow sub-Gaussianity of the time series. This is a non-trivial relaxation, since unlike the5aussian case, sub-Gaussian time series cannot be linearly transformed into independentsamples.We summarize the most important contributions of this paper as follows:(i) This paper provides the first practical tool to model and forecast general structured,high-dimensional data with complex temporal dependence via tensor autoregression.By flexibly and efficiently capturing the underlying low-dimensional tensor dynamics,the proposed model delivers significant dimension reduction, meaningful structuralinterpretation and favorable forecast performance.(ii) By exploiting the special balanced structure of the transition tensor A , a novel SSNregularization approach is introduced to simultaneously and efficiently encourage low-rankness across all square matricizations of A , outperforming both the SN and MNmethods under both exact and approximate low-rankness. For exactly low-rank A , atruncated estimator is further introduced for consistent rank selection.(iii) On the technical front, by establishing a novel martingale-based concentration bound,this paper relaxes the conventional Gaussian assumption in the literature on high-dimensional time series to sub-Gaussianity. This technique is generally applicable tothe non-asymptotic estimation theory for high-dimensional time series models with aVAR representation and hence is of independent interest.The rest of the paper is organized as follows. Section 2 introduces basic notation andtensor algebra. Section 3 presents the proposed LRTAR model. Section 4 studies the low-dimensional least squares estimator and its asymptotic properties. The high-dimensionalregularized estimation is covered in Section 5, where we develop the non-asymptotic theoryfor three regularized estimators and rank selection consistency for the truncated estimator.Sections 6 and 7 present simulation studies and real data analysis, respectively. Section 8concludes with a brief discussion. All technical proofs are relegated to the Appendix.6 Preliminaries: Notation and Tensor Algebra
Tensors, also known as multidimensional arrays, are natural higher-order extensions of ma-trices. The order of a tensor is known as the dimension, way or mode, so a multidimensionalarray X ∈ R p ×···× p d is called a d -th-order tensor. We refer readers to Kolda and Bader(2009) for a detailed review of basic tensor algebra.Throughout this paper, we denote vectors by boldface small letters, e.g. x , y , matricesby boldface capital letters, e.g. X , Y , and tensors by boldface Euler capital letters, e.g. X , Y . For any two real-valued sequences x k and y k , we write x k (cid:38) y k if there exists a constant c > x k ≥ cy k for all k , and write x k (cid:29) y k if lim k →∞ y k /x k = 0. In addition, write x k (cid:16) y k if x k (cid:38) y k and y k (cid:38) x k . We use C to denote a generic positive constant, which isindependent of the dimensions and the sample size.For a generic matrix X , we let X (cid:62) , (cid:107) X (cid:107) F , (cid:107) X (cid:107) op , (cid:107) X (cid:107) ∗ , vec( X ), and σ j ( X ) denoteits transpose, Frobenius norm, operator norm, nuclear norm, vectorization, and j -th largestsingular value, respectively. For any matrix X ∈ R m × n , recall that the nuclear norm and itsdual norm, the operator norm, are defined as (cid:107) X (cid:107) ∗ = min( m,n ) (cid:88) j =1 σ j ( X ) and (cid:107) X (cid:107) op = σ ( X ) . For any square matrix X , we let λ min ( X ) and λ max ( X ) denote its minimum and maximumeigenvalues. For any real symmetric matrices X and Y , we write X ≤ Y if Y − X is apositive semidefinite matrix.Matricization, also known as unfolding, is the process of reordering the elements of athird- or higher-order tensor into a matrix. The most commonly used matricization is theone-mode matricization defined as follows. For any d -th-order tensor X ∈ R p ×···× p d , itsmode- s matricization X ( s ) ∈ R p s × p − s , with p − s = (cid:81) di =1 ,i (cid:54) = s p i , is the matrix obtained bysetting the s -th tensor mode as its rows and collapsing all the others into its columns, for s = 1 , . . . , d . Specifically, the ( i , . . . , i d )-th element of X is mapped to the ( i s , j )-th element7f X ( s ) , where j = 1 + d (cid:88) k =1 k (cid:54) = s ( i k − J k with J k = k − (cid:89) (cid:96) =1 (cid:96) (cid:54) = s p (cid:96) . The above one-mode matricization can be extended to the multi-mode matricization bycombining multiple modes to rows and combining the rest to columns of a matrix. For anyindex subset S ⊂ { , . . . , d } , the multi-mode matricization X [ S ] is the (cid:81) i ∈ S p i -by- (cid:81) i/ ∈ S p i matrix whose ( i, j )-th element is mapped from the ( i , . . . , i d )-th element of X , where i = 1 + (cid:88) k ∈ S ( i k − I k and j = 1 + (cid:88) k / ∈ S ( i k − J k , with I k = (cid:89) (cid:96) ∈ S(cid:96) 1. If Y ∈ R q k × p k , then (cid:104) X × k Y , Z (cid:105) = (cid:104) X , Z × k Y (cid:62) (cid:105) . (3)8f Z ∈ R q m + j × p m + j with 1 ≤ j ≤ n − m , then (cid:104) X , Y (cid:105) × j Z = (cid:104) X × m + j Z , Y (cid:105) . (4)Moreover, vec( (cid:104) X , Y (cid:105) ) = X [ S ] vec( Y ) , (5)where S = { m + 1 , . . . , n } , and when m = n , X [ ∅ ] = vec( X ) (cid:62) .Finally, we summarize some concepts and useful results of the Tucker decomposition(Tucker, 1966; De Lathauwer et al., 2000). For any tensor X ∈ R p ×···× p d , its multilinearranks ( r , . . . , r d ) are defined as the matrix ranks of its one-mode matricizations, namely r i = rank( X ( i ) ), for i = 1 , . . . , d . Note that r i ’s are analogous to the row and column ranksof a matrix, but are not necessarily equal for third- and higher-order tensors. Suppose that X has multilinear ranks ( r , . . . , r d ). Then X has the following Tucker decomposition: X = Y × Y × Y · · · × d Y d = Y × di =1 Y i , (6)where Y i ∈ R p i × r i for i = 1 , . . . , d are the factor matrices and Y ∈ R r ×···× r d is the coretensor. Hence, the multilinear ranks are also called Tucker ranks.If X has the Tucker decomposition in (6), then we have the following results for its one-and multi-mode matricizations: X ( s ) = Y s Y ( s ) ( Y d ⊗ · · · ⊗ Y s +1 ⊗ Y s − · · · ⊗ Y ) (cid:62) = Y s Y ( s ) ( ⊗ i (cid:54) = s Y i ) (cid:62) , s = 1 , . . . , d, and X [ S ] = ( ⊗ i ∈ S Y i ) Y [ S ] ( ⊗ i/ ∈ S Y i ) (cid:62) , S ⊂ { , . . . , d } , (7)where ⊗ i (cid:54) = s , ⊗ i ∈ S and ⊗ i/ ∈ S are matrix Kronecker products operating in the reverse orderwithin the corresponding index sets.Note that for any nonsingular matrices O i ∈ R r i × r for i = 1 , . . . , d , it holds Y × di =1 Y i = ( Y × di =1 O i ) × di =1 ( Y i O − i ) . This indicates that the Tucker decomposition in (6) is not unique unless appropriate iden-tifiability constraints are imposed. In this paper, to fix ideas, we will focus on a special9ucker decomposition called the higher-order singular value decomposition (HOSVD). Inthe HOSVD, the factor matrix Y i in (6) is defined as the tall orthonormal matrix con-sisting of the top r i left singular vectors of X ( i ) , for i = 1 , . . . , d . Consequently, the coretensor Y = X × di =1 Y (cid:62) i has the all-orthogonal property that Y ( i ) Y (cid:62) ( i ) is a diagonal matrix for i = 1 , . . . , d . For the tensor-valued time series { Y t } Tt =1 , we propose the following Low-Rank Tensor Au-toregressive (LRTAR) model: Y t = (cid:104) A , Y t − (cid:105) + E t , (8)where A ∈ R p ×···× p d × p ×···× p d is the 2 d -th-order transition tensor which is assumed to havemultilinear ranks ( r , . . . , r d ), with r i = rank( A ( i ) ), (cid:104)· , ·(cid:105) is the generalized tensor innerproduct defined in (2), and E t ∈ R p ×···× p d is the mean-zero random error at time t withpossible dependencies among its contemporaneous elements. Denote S = { , , . . . , d } and S = { d + 1 , d + 2 , . . . , d } . Note that by (5), model (8) can be written into the VAR formin (1) with transition matrix A = A [ S ] .Then, we have the higher-order singular value decomposition (HOSVD) of A , A = G × di =1 U i , (9)where G ∈ R r ×···× r d is the core tensor, and each U i ∈ R p i × r i is the orthonormal factormatrix defined as the top r i left singular vectors of A ( i ) , for 1 ≤ i ≤ d . Thus, by (7), theVAR representation of model (8) can be written asvec( Y t ) (cid:124) (cid:123)(cid:122) (cid:125) y t = ( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) (cid:124) (cid:123)(cid:122) (cid:125) A [ S vec( Y t − ) (cid:124) (cid:123)(cid:122) (cid:125) y t − + vec( E t ) (cid:124) (cid:123)(cid:122) (cid:125) e t , (10)where y t = vec( Y t ) and e t = vec( E t ).In contrast to the conventional VAR model in (1) which has p unknown parameters,where p = (cid:81) di =1 p i , the dimension of the parameter space for model (8) is reduced substan-10ially to d (cid:89) i =1 r i + d (cid:88) i =1 r i ( p i − r i ) + d (cid:88) i =1 r d + i ( p i − r d + i ) , (11)which is computed by subtracting the number of constraints due to the orthonormality of U i ’s and the all-orthogonal property of G from the total number of parameters.By the VAR representation in (10), we immediately have the necessary and sufficientcondition for the existence of a unique strictly stationary solution to model (8) as follows. Assumption 1. The spectral radius of A [ S ] is strictly less than one. The proposed model implies an interesting low-dimensional tensor dynamical structure.To be specific, by (3), (4) and the orthonormality of U i , it can be shown that (8) togetherwith (9) implies Y t × di =1 U (cid:62) d + i (cid:124) (cid:123)(cid:122) (cid:125) r d +1 × r d +2 ×···× r d = (cid:10) G , Y t − × di =1 U (cid:62) i (cid:124) (cid:123)(cid:122) (cid:125) r × r ×···× r d (cid:11) + E t × di =1 U (cid:62) d + i . (12)Note that in (12), Y t and E t are both projected onto a low-dimensional space via the U d + i ’s,while Y t − is projected onto another low-dimensional space via the U i ’s, with 1 ≤ i ≤ d .Hence, (12) is essentially a low-dimensional tensor regression defined on the projections of Y t and Y t − . Element-wisely, the low-dimensional tensor Y t × di =1 U (cid:62) d + i can be interpreted as (cid:81) di =1 r d + i multilinear response factors, Y t − × di =1 U (cid:62) i as (cid:81) di =1 r i multilinear predictor factors,and E t × di =1 U (cid:62) d + i as multilinear error factors. For this reason, we call U d +1 , . . . , U d theresponse factor matrices, and U , . . . , U d the predictor factor matrices.We discuss some special cases of the proposed model below. Example 1. For simplicity, we first consider the case with d = 2 , so Y t ≡ Y t , E t ≡ E t ∈ R p × p are matrices. Then the VAR representation in (10) becomes vec( Y t ) = ( U ⊗ U ) G [ { , } ] ( U (cid:62) ⊗ U (cid:62) )vec( Y t − ) + vec( E t ) , (13) and the low-dimensional representation in (12) becomes U (cid:62) Y t U = (cid:10) G , U (cid:62) Y t − U (cid:11) + U (cid:62) E t U , here G ∈ R r ×···× r . It is interesting to compare this model with the matrix autoregressive(MAR) model in Chen et al. (2020b) and Hoff (2015), which is defined by Y t = B Y t − B (cid:62) + E t , (14) where B ∈ R p × p and B ∈ R p × p , whose vector form is vec( Y t ) = ( B ⊗ B )vec( Y t − ) + vec( E t ) . (15) It can be easily seen that if r = r = p , r = r = p , U = I p , U = I p , and G [ { , } ] = ( B ⊗ B )( U ⊗ U ) , then (13) becomes exactly (15) . Thus, the MAR model in (14) can be viewed as a special case of the proposed model without reducing dimensions p i ’s to r i ’s and without transforming Y t ; see Figure 1 for an illustration. The above comparison alsoapplies to the general case with d ≥ . The tensor version of the MAR model is consideredin Hoff (2015) and is defined as Y t = Y t − × di =1 B i + E t , (16) where B i ∈ R p i × p i for i = 1 , . . . , d . We call (16) the multilinear tensor autoregressive(MTAR) model. Note that its vector form is vec( Y t ) = ( B d ⊗ · · · ⊗ B )vec( Y t − ) + vec( E t ) . (17) Similarly, (17) is a special case of (10) with r i = r d + i = p i , U d + i = I p i , for i = 1 , . . . , d , and G [ S ] = ( ⊗ i ∈ S B i )( ⊗ i ∈ S U i ) . Obviously, the number of unknown parameters in the MTARmodel, (cid:80) di =1 p i , is much larger than that of the proposed model as shown in (11) . Also notethat Chen et al. (2020b) focuses on the low-dimensional estimation and its asymptotic theory,while Hoff (2015) considers a Bayesian estimation method. Example 2. In the special case where U d + i = U i and r d + i = r i for i = 1 , . . . , d , theproposed model may be understood from the perspective of dynamic factor modeling (Stockand Watson, 2011; Bai and Wang, 2016) for tensor-valued time series. Specifically, considerthe following model: Y t = F t × di =1 U i , F t = (cid:104) G , F t − (cid:105) + H t , (18)12 𝓖𝓖 , > 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑼𝑼 𝒀𝒀 𝑡𝑡 𝑼𝑼 = 𝒀𝒀 𝑡𝑡 = 𝑩𝑩 𝒀𝒀 𝑡𝑡−1 𝑩𝑩 + 𝑬𝑬 𝑡𝑡 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑟𝑟 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑼𝑼 𝑼𝑼 𝒀𝒀 𝑡𝑡−1 𝑼𝑼 + 𝑬𝑬 𝑡𝑡 𝑼𝑼 MAR model:LRTAR model: Figure 1: Illustration of the MAR model and the proposed LRTAR model in the case of d = 2. where Y t ∈ R p ×···× p d is the observed tensor-valued time series, F t ∈ R r ×···× r d represents (cid:81) di =1 r i factors, and U i ∈ R p i × r i are orthonormal matrices for i = 1 , . . . , d . Here F t followsthe tensor autoregression (TAR) with transition tensor G ∈ R r ×···× r d × r ×···× r d and randomerror H t . Note that (18) can be rewritten as Y t = (cid:10) G × di =1 U i × di = d +1 U i , Y t − (cid:11) + H t × di =1 U i . Thus, model (18) is a special case of the proposed model with U d + i = U i and r d + i = r i for i = 1 , . . . , d , and E t = H t × di =1 U i . Chen et al. (2020c) introduces the tensor factormodel in the form of Y t = F t × di =1 U i + E t without an explicit modeling of the latent factors F t . Hence, model (18) may be regarded as a special tensor factor model with autoregressivedynamic factors, but without any random error in the model equation of Y t . We consider the parameter estimation for the low-dimensional case where the sample size T is sufficiently large such that the dimension of the parameter space can be assumed fixed.Throughout this section, we assume that the data are generated from the proposed modelin Section 3, where the true transition tensor A is exactly low-Tucker rank with multilinearranks ( r , . . . , r d ). This assumption will be relaxed under the high-dimensional setup in the13ext section.Suppose that the true ranks ( r , . . . , r d ) of the exactly low-rank tensor A are known; werelegate the rank selection to Section 5.3. Then the parameters can be estimated by (cid:98) A LTR = (cid:98) G × di =1 (cid:98) U i = arg min G , U i T (cid:88) t =1 (cid:13)(cid:13) Y t − (cid:104) G × di =1 U i , Y t − (cid:105) (cid:13)(cid:13) . (19)We call (cid:98) A LTR the low-Tucker-rank (LTR) estimator. Note that the minimization in (19)is unconstrained, so the Tucker decomposition of (cid:98) A LTR is not unique. Indeed, there aremore than one solution of (cid:98) G and (cid:98) U i ’s corresponding to the same (cid:98) A LTR . Due to the lack ofidentifiability of the Tucker decompositions, standard asymptotics of the maximum likelihoodestimation cannot apply directly. Nevertheless, we can still derive the asymptotic distributionof (cid:98) A LTR using the asymptotic theory for overparameterized models in Shapiro (1986).Recall that S = { , , . . . , d } S = { d + 1 , d + 2 , . . . , d } , y t = vec( Y t ), and e t = vec( E t ).Let θ = (vec( G [ S ] ) (cid:62) , vec( U ) (cid:62) , · · · , vec( U d ) (cid:62) ) (cid:62) be the parameter vector, and let h = h ( θ ) = vec( A [ S ] ) = vec(( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) ) be the vectorization of the transitionmatrix. Denote Σ y = var( y t ), Σ e = var( e t ), and J = Σ − e ⊗ Σ y . In addition, for i =1 , . . . , d , let P ( i )[ S ] be the p × p permutation matrix such that vec( A [ S ] ) = P ( i )[ S ] vec( A ( i ) ).Then, it can be shown that the Jacobian matrix H := ∂ h ( θ ) /∂ θ is given by H = (cid:0) ( ⊗ i ∈ S U i ) ⊗ ( ⊗ i ∈ S U i ) , P (1)[ S ] (cid:8)(cid:2) ( ⊗ di =1 ,i (cid:54) =1 U i ) G (cid:62) (1) (cid:3) ⊗ I p (cid:9) , P (2)[ S ] (cid:8)(cid:2) ( ⊗ di =1 ,i (cid:54) =2 U i ) G (cid:62) (2) (cid:3) ⊗ I p (cid:9) , . . . , P (2 d )[ S ] (cid:8)(cid:2) ( ⊗ di =1 ,i (cid:54) =2 d U i ) G (cid:62) (2 d ) (cid:3) ⊗ I p d (cid:9) (cid:1) , (20)where p d + i = p i for i = 1 , . . . , d . Theorem 1. Suppose that the time series { Y t } is generated by model (8) with E (cid:107) e t (cid:107) < ∞ ,and Assumption 1 holds. Then, √ T vec(( (cid:98) A LTR − A ) [ S ] ) → N ( , Σ LTR ) , in distribution as T → ∞ , where Σ LTR = H ( H (cid:62) J H ) † H (cid:62) , and † is the Moore-Penroseinverse. Since the asymptotic theory for overparameterized models in Shapiro (1986) allows forunidentifiability of the components G and U i ’s in the decomposition of A , Theorem 1 does14ot require that the Tucker decomposition of A is unique; see the proof of Theorem 1 inAppendix A for more details.Next we compare the result in Theorem 1 to those of two other consistent estimators forthe proposed model in the low-dimensional setup. Note that the rank of A [ S ] in (10) is atmost s := min( (cid:81) di =1 r i , (cid:81) di = d +1 r i ). Thus, A [ S ] can be estimated by both the reduced-rankregression (RRR) and ordinary least squares (OLS) methods, (cid:98) A RRR = arg min rank( B [ S ) ≤ s T T (cid:88) t =1 (cid:107) Y t − (cid:104) B , Y t − (cid:105)(cid:107) (21)and (cid:98) A OLS = arg min B T T (cid:88) t =1 (cid:107) Y t − (cid:104) B , Y t − (cid:105)(cid:107) . Naturally, under model (8), (cid:98) A LTR is asymptotically more efficient than (cid:98) A RRR and (cid:98) A OLS : Corollary 1. If the conditions of Theorem 1 hold, then √ T vec(( (cid:98) A OLS − A ) [ S ] ) → N (0 , Σ OLS ) and √ T vec(( (cid:98) A RRR − A ) [ S ] ) → N (0 , Σ RRR ) in distribution as T → ∞ . Moreover, it holdsthat Σ LTR ≤ Σ RRR ≤ Σ OLS . To solve the minimization problem in (19), we propose an alternating least squares (ALS)method. Specifically, by the vector representation of the proposed model in (10), the lossfunction in (19) can be rewritten as L ( G , U , . . . , U d ) = T (cid:88) t =1 (cid:13)(cid:13)(cid:13) y t − ( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) y t − (cid:13)(cid:13)(cid:13) = T (cid:88) t =1 (cid:13)(cid:13)(cid:13) y t − m ( θ ) (cid:13)(cid:13)(cid:13) , where m ( θ ) = ( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) y t − is the conditional mean of y t = vec( Y t ) andis a function of the parameter vector θ = ( θ (cid:62) , θ (cid:62) , . . . , θ (cid:62) d ) (cid:62) , with θ = vec( G [ S ] ) and θ i = vec( U i ) for i = 1 , . . . , d . Note that m ( θ ) is linear in each θ i while keeping the othercomponents θ j with 0 ≤ j (cid:54) = i ≤ d fixed. This is indeed because m ( θ ) = ( ⊗ i ∈ S U i ) G [ S ] P k, (cid:8)(cid:2) ( ⊗ i ∈ S ,i (cid:54) = k U i ) (cid:62) ( Y t − ) (cid:62) ( k ) (cid:3) ⊗ I r k (cid:9) P k, vec( U k )= P k, (cid:8)(cid:2) ( ⊗ i ∈ S ,i (cid:54) = d + k U i )( M t − ) (cid:62) ( k ) (cid:3) ⊗ I p k (cid:9) vec( U d + k )= (cid:8)(cid:2) y (cid:62) t − ( ⊗ i ∈ S U i ) (cid:3) ⊗ ( ⊗ i ∈ S U i ) (cid:9) vec( G [ S ] ) , lgorithm 1 ALS algorithm for LTR estimatorInitialize: A (0) = (cid:98) A RRR or (cid:98) A OLS HOSVD: A (0) ≈ G (0) × di =1 U (0) i , with multilinear ranks ( r , . . . , r d ). repeat s = 0 , , , . . . for k = 1 , . . . , d U ( s +1) k ← arg min U k (cid:80) Tt =1 (cid:13)(cid:13)(cid:13) y t − ( ⊗ i ∈ S U ( s ) i ) G [ S ] P k, (cid:110)(cid:104) ( ⊗ i ∈ S ,i (cid:54) = k U ( s ) i ) (cid:62) ( Y t − ) (cid:62) ( k ) (cid:105) ⊗ I r k (cid:111) vec( U k ) (cid:13)(cid:13)(cid:13) end forfor k = 1 , . . . , d U ( s +1) d + k ← arg min U d + k (cid:80) Tt =1 (cid:13)(cid:13)(cid:13) y t − P k, (cid:8) (cid:104) ( ⊗ i ∈ S ,i (cid:54) = d + k U ( s ) i )( M ( s +1) t − ) (cid:62) ( k ) (cid:105) ⊗ I p k (cid:9) vec( U d + k ) (cid:13)(cid:13)(cid:13) end for G ( s +1) ← arg min G (cid:80) Tt =1 (cid:13)(cid:13)(cid:13) y t − (cid:110)(cid:104) y (cid:62) t − ( ⊗ i ∈ S U ( s +1) i ) (cid:105) ⊗ ( ⊗ i ∈ S U ( s +1) i ) (cid:111) vec( G [ S ] ) (cid:13)(cid:13)(cid:13) A ( s +1) ← G ( s +1) × di =1 U ( s +1) i until convergence for k = 1 , . . . , d , where P k, ∈ R (cid:81) di =1 r i × (cid:81) di =1 r i , P k, ∈ R p × p and P k, ∈ R r k p k × r k p k arepermutation matrices defined such that P k, vec( T ( k ) ) = vec( T ) for any T ∈ R r ×···× r d , P k, vec( T ( k ) ) = vec( T ) for any T ∈ R p ×···× p d , and P k, vec( U k ) = vec( U (cid:62) k ), and M t − ∈ R r d +1 ×···× r d is defined such that vec( M t ) = G [ S ] ( ⊗ i ∈ S U i ) (cid:62) y t − . Therefore, we can updateeach component parameter vector θ i , and hence the corresponding G and U i ’s, iterativelyby the least squares method. The resulting ALS algorithm is shown in Algorithm 1.As mentioned, the minimization in (19) is unconstrained. Accordingly, to obtain (cid:98) A LTR ,no orthogonality constraint of G and U i ’s is needed in Algorithm 1. Instead, we compute thefinal unique estimates of (cid:98) G and (cid:98) U i ’s by the HOSVD operation based on the unconstrainedsolution (cid:98) A LTR obtained from Algorithm 1. Specifically, we compute each (cid:98) U i uniquely asthe top r i left singular vectors of ( (cid:98) A LTR ) ( i ) such that the first element in each column of (cid:98) U i is positive, and set (cid:98) G = [[ (cid:98) A LTR ; (cid:98) U (cid:62) , · · · , (cid:98) U (cid:62) d ]]. Similar alternating algorithms withoutimposing identification constraints can be found in the literature of tensor decomposition;16ee, e.g. Zhou et al. (2013) and Li et al. (2018). The methods in the previous section rely on the assumptions that the dimension is fixed andthat the true transition tensor is exactly low-rank with known Tucker ranks. We next relaxboth assumptions. Specifically, under the high-dimensional setup, we consider regularizedestimation of model (8) via different nuclear-norm-type penalties and develop the corre-sponding non-asymptotic theory under only an approximately low-Tucker-rank assumptionon the underlying true transition tensor.In model (8), the exactly low-rank transition tensor A ∈ R p ×···× p d is subject to theconstraints r i = rank( A ( i ) ), for i = 1 , . . . , d . A commonly used convex relaxation of suchmultilinear rank constraints is the regularization via the sum of nuclear (SN) norms of allthe one-mode matricizations, (cid:107) A (cid:107) SN = d (cid:88) i =1 (cid:107) A ( i ) (cid:107) ∗ . (22)The SN norm has been widely used in the literature (Gandy et al., 2011; Tomioka et al.,2011; Liu et al., 2013; Raskutti et al., 2019) to simultaneously encourage the low-ranknessfor all modes of a tensor. This leads us to the SN norm regularized estimator (cid:98) A SN = arg min A (cid:40) T T (cid:88) t =1 (cid:107) Y t − (cid:104) A , Y t − (cid:105)(cid:107) + λ SN (cid:107) A (cid:107) SN (cid:41) , where λ SN is the tuning parameter. Note that if instead of (cid:107) A (cid:107) SN , only one single nuclearnorm, say (cid:107) A (1) (cid:107) ∗ , is penalized, then the resulting estimator will only enforce the low-rankness for the first mode of A , while failing to do so for all the other 2 d − (cid:98) A SN , we make the following assumption on therandom error e t = vec( E t ). 17 ssumption 2. Let e t = Σ / e ξ t , where { ξ t } is a sequence of i.i.d. random vectors, with E ( ξ t ) = and var( ξ t ) = I p , and Σ e = var( e t ) is a positive definite matrix. In addition, theentries ( ξ it ) ≤ i ≤ p of ξ t are mutually independent and κ -sub-Gaussian, i.e., E ( e µξ it ) ≤ e κ µ / ,for any µ ∈ R and i = 1 , . . . , p . The sub-Gaussianity condition in Assumption 2 is milder than the commonly used nor-mality assumption in the literature on high-dimensional stationary vector autoregressivemodels (Basu and Michailidis, 2015; Raskutti et al., 2019). This relaxation is made possi-ble through establishing a novel martingale-based concentration bound in the proof of thedeviation bound; see Lemma B.5 in Appendix B.4. The covariance matrix Σ e captures thecontemporaneous dependency in E t , and the constant κ controls the tail heaviness of themarginal distributions.For any z ∈ C , let A ( z ) = I p − A [ S ] z be a matrix polynomial, where C is the set of com-plex numbers. Let µ min ( A ) = min | z | =1 λ min ( A ∗ ( z ) A ( z )) and µ max ( A ) = max | z | =1 λ max ( A ∗ ( z ) A ( z )),where A ∗ ( z ) is the conjugate transpose of A ( z ); see Basu and Michailidis (2015) for morediscussions on the connection between the spectral density of the VAR process and the twoquantities. In addition, let α RSC = λ min ( Σ e ) µ max ( A ) , M = λ max ( Σ e ) µ / ( A ) , and M = λ min ( Σ e ) µ max ( A ) λ max ( Σ e ) µ min ( A ) . The exactly low-rankness of the true transition tensor A assumed in Section 4 couldbe too stringent in real-world applications. In what follows, we relax it to the followingapproximately low-rank assumption: We assume that all one-mode matricizations of theunderlying true transition tensor A belong to the set of approximately low-rank matrices,namely A ( i ) ∈ B q ( r ( i ) q ; p i , p − i p ) for some q ∈ [0 , B q ( r ; d , d ) := M ∈ R d × d : min( d ,d ) (cid:88) i =1 σ i ( M ) q ≤ r , and p − i = p/p i = (cid:81) dj =1 ,j (cid:54) = i p j . For the convenience of notation, here we require that 0 = 0.Note that when q = 0, B ( r ; d , d ) is the set of d -by- d rank- r matrices. For q > 0, the18estriction on (cid:80) min( d ,d ) i =1 σ i ( M ) q ≤ r requires that the singular values decay fast, and it ismore general and natural than the exactly low-rank assumption. Theorem 2. Under Assumptions 1 and 2, if T (cid:38) max ≤ i ≤ d p − i p + max( κ , κ ) M − p , λ SN (cid:38) κ M d − (cid:80) di =1 (cid:112) p − i p/T , and A ( i ) ∈ B q ( r ( i ) q ; p i , p − i p ) for some q ∈ [0 , and all i = 1 , . . . , d ,then with probability at least − (cid:80) di =1 exp( − Cp − i p ) − exp[ − C min( κ − , κ − ) M p ] , (cid:107) (cid:98) A SN − A (cid:107) F (cid:46) √ r q (cid:18) d · λ SN α RSC (cid:19) − q/ where r q = (2 d ) − (cid:80) di =1 r ( i ) q . By Theorem 2, when λ SN (cid:16) κ M d − (cid:80) di =1 (cid:112) p − i p/T , the estimation error bound scalesas √ r q (max ≤ i ≤ d p − i p/T ) / − q/ ; note that the factor d in the error bounds is canceled bythe d − in the rate of λ SN . When q = 0, namely A is exactly low-rank with Tucker ranks( r (1)0 , . . . , r (2 d )0 ), the error bound reduces to (cid:112) r max ≤ i ≤ d p − i p/T and it is comparable tothat in Tomioka et al. (2011) for i.i.d. tensor regression.However, recent research (e.g., Mu et al., 2014; Raskutti et al., 2019) has shown thatthe SN norm regularization approach is suboptimal, mainly because A ( i ) is an unbalanced fat-and-short matricization of a higher-order tensor. Technically, in the proof of Theorem 2,an essential intermediate step is to establish the deviation bound, where we need to upperbound the operator norm of a sub-Gaussian random matrix with the same dimensions as A ( i ) ; see Lemma B.5 in Appendix B.3 for details. Undesirably, the order of this operatornorm will be dominated by the larger of the row and column dimensions of the matrix A ( i ) ∈ R p i × p − i p , and hence by the column dimension p − i p , which eventually appears in theerror bound. This indicates that the imbalance of the matricization leads to the efficiencybottleneck of the SN estimator.On the other hand, similarly to (21), since the the reduced-rank VAR model can be re-garded as an overparameterization of the proposed LRTAR model, alternatively one mayfocus on the approximately low-rank structure of the transition matrix A [ S ] in the VARrepresentation in (10), and adopt the matrix nuclear (MN) estimator (Negahban and Wain-19right, 2011) to estimate A , (cid:98) A MN = arg min A (cid:40) T T (cid:88) t =1 (cid:107) Y t − (cid:104) A , Y t − (cid:105)(cid:107) + λ MN (cid:107) A [ S ] (cid:107) ∗ (cid:41) . (23)Note that the multi-mode matricization A [ S ] = A (cid:62) [ S ] is a p × p square matrix. Thus, theloss of efficiency due to the unbalanced matricization can be avoided, which is confirmed bythe following theorem. Theorem 3. Under Assumptions 1 and 2, if T (cid:38) [1+max( κ , κ ) M − ] p , λ MN (cid:38) κ M (cid:112) p/T ,and A [ S ] ∈ B q ( s (1) q ; p, p ) for some q = [0 , , then with probability at least − exp( − Cp ) − exp[ − C min( κ − , κ − ) M p ] , (cid:107) (cid:98) A MN − A (cid:107) F (cid:46) (cid:113) s (1) q (cid:18) λ MN α RSC (cid:19) − q/ . Theorem 3 shows that, with λ MN (cid:16) κ M (cid:112) p/T , the estimation error bound for (cid:98) A MN scales as (cid:113) s (1) q ( p/T ) / − q/ . This result is comparable to that in Negahban and Wainwright(2011) for reduced-rank VAR models, yet we relax both the constraint (cid:107) A (cid:107) op < A and the normality assumption on the random error in their paper. Thisestimation error bound is clearly smaller than that in Theorem 2, as (max ≤ i ≤ d p − i p/T ) / − q/ in general can be much larger than ( p/T ) / − q/ when s (1) q (cid:16) r q . Therefore, adopting squarematricization can indeed improve the estimation performance.The idea of using square matricization to improve efficiency was adopted by Mu et al.(2014) in low-rank tensor completion problems. Their proposed method, called the squaredeal, is to first unfold a general higher-order tensor into a matrix with similar numbers ofrows and columns, and then use the MN norm as the regularizer. However, for our estimationproblem, despite the advantage of (cid:98) A MN over (cid:98) A SN , Theorem 3 reveals another drawback of (cid:98) A MN . That is, the error bounds for (cid:98) A MN depend on the (cid:96) q radius s (1) q of the singular valuesof A [ S ] , suggesting that (cid:98) A MN may perform badly when s (1) q is relatively large. In otherwords, unless we have prior knowledge that the (cid:96) q “norm” of singular values of particularmatricization A [ S ] is truly small, (cid:98) A MN may not be desirable in practice.20n the other hand, although the SN regularizer in (22) suffers from inefficiency dueto the imbalance of one-mode matricizations, it has the attractive feature of simultaneouslyenforcing the low-rankness across all modes of the tensor A , and thus is more efficient than itscounterpart which considers only one single one-mode matricization, say, (cid:107) A (1) (cid:107) ∗ . Similarly,if we can enforce the approximate low-rankness across all possible square matricizations of A , the estimation performance may be further improved upon (cid:98) A MN . This motivates us topropose a new regularization approach in the next subsection. In this subsection, we propose a novel convex regularizer which improves upon both theSN and MN regularizers in (22) and (23), respectively, by simultaneously encouraging thelow-rankness across all possible square matricizations of the transition tensor A .For any 2 d -th-order tensor A ∈ R p ×···× p d × p ×···× p d , its multi-mode matricization A [ I ] willbe a p × p square matrix, with p = (cid:81) di =1 p i , if the index set is chosen as I = { (cid:96) , . . . , (cid:96) d } , where each index (cid:96) i is set to either i or d + i , for i = 1 , . . . , d . For instance, A [ S ] is the square matricization formed by setting (cid:96) i = i for all i = 1 , . . . , d . Moreover, if A has multilinear ranks ( r , . . . , r d ), then the rank of the matricization A [ I ] is at mostmin( (cid:81) di =1 ,i ∈ I r i , (cid:81) di =1 ,i/ ∈ I r i ). Therefore, if we penalize the sum of nuclear norms of all suchsquares matricizations, which we call the sum of square-matrix nuclear (SSN) norms forsimplicity, then the resulting estimator would enjoy the efficiency gain from both the use ofsquare matricizations and simultaneous incorporation of many rank constraints.Obviously, there are 2 d possible choices of the index set I that corresponds to a squarematricization A [ I ] . However, since A [ I ] = A (cid:62) [ I (cid:123) ] , when defining the SSN norm, we onlyneed to include one of I and its complement I (cid:123) . A simple way to do so is to choose onlysets containing the index one. That is, fix (cid:96) = 1 and choose (cid:96) i = i or d + i for i =2 , . . . , d . This results in totally 2 d − chosen index sets, denoted by I , I , . . . , I d − . Note21hat I = S = { , . . . , d } . For example, when d = 3, we have four chosen index sets, I = { , , } , I = { , , } , I = { , , } and I = { , , } .Based on the above choice of the 2 d − index sets, we introduce the following SSN norm, (cid:107) A (cid:107) SSN = d − (cid:88) k =1 (cid:13)(cid:13) A [ I k ] (cid:13)(cid:13) ∗ . The corresponding estimator is defined as (cid:98) A SSN = arg min A (cid:40) T T (cid:88) t =1 (cid:107) Y t − (cid:104) A , Y t − (cid:105)(cid:107) + λ SSN (cid:107) A (cid:107) SSN (cid:41) , (24)where λ SSN is the tuning parameter.If the rank of one-mode matricizations rank( A ( i ) ) = r i , each square matricization A [ I k ] is also low-rank with rank( A [ I k ] ) ≤ min( (cid:81) di =1 ,i ∈ I k r i , (cid:81) di =1 ,i/ ∈ I k r i ). Similarly, if all A ( i ) s areapproximately low-rank, the square matricizations are approximately low-rank as well. Incontrast to the SN norm in (22) which directly matches the multilinear ranks rank( A ( i ) ) for i = 1 , . . . , d , the SSN norm encourages the multilinear low-rank structure of A by simul-taneously enforcing the low-rankness of all the square matricizations A [ I k ] . The followingtheorem gives the theoretical results for (cid:98) A SSN . Theorem 4. Under Assumptions 1 and 2, if T (cid:38) [1+max( κ , κ ) M − ] p , λ SSN (cid:38) κ M − d (cid:112) p/T ,and A [ I k ] ∈ B ( s ( k ) q ; p, p ) for some q ∈ [0 , and all k = 1 , . . . , d − , with probability at least − exp[ − C ( p − d )] − exp[ − C min( κ − , κ − ) M p ] , (cid:107) (cid:98) A SSN − A (cid:107) F (cid:46) √ s q (cid:18) d − λ SSN α RSC (cid:19) − q/ where s q = 2 − d (cid:80) d − k =1 s ( k ) q . By Theorem 4, when λ SSN (cid:16) κ M − d (cid:112) p/T , the estimation error bound scales as √ s q ( p/T ) / − q/ and reduces to (cid:112) s p/T in the exactly low-rank setting for q = 0. Fora clearer comparison among the three estimators (cid:98) A SN , (cid:98) A MN and (cid:98) A SSN , we summarize themain results of Theorems 2–4 in Table 1. First, both (cid:98) A SSN and (cid:98) A MN have much smaller errorbounds and less stringent sample size requirements than (cid:98) A SN , due to the diverging dimension p − i in the results of the latter. This reaffirms the advantage of the square matricizations.22N MN SSNSample size T (cid:38) (max ≤ i ≤ d p − i + M − ) p T (cid:38) (1 + M − ) p T (cid:38) (1 + M − ) p Estimation error √ r q (max ≤ i ≤ d p − i p/T ) / − q/ (cid:113) s (1) q ( p/T ) / − q/ √ s q ( p/T ) / − q/ Table 1: Summary of the sample size conditions and error upper bounds in Theorems 2–4,where p − i = (cid:81) dj =1 ,j (cid:54) = i p j , r q = (2 d ) − (cid:80) di =1 r ( i ) q , and s q = 2 − d (cid:80) d − k =1 s ( k ) q .Secondly, comparing (cid:98) A SSN to (cid:98) A MN , since the factor s q in the error bounds of (cid:98) A SSN is theaverage of all s ( k ) q for k = 1 , . . . , d − , (cid:98) A SSN can protect us from the bad scenarios where the (cid:96) q “norm” of the singular values of A [ S ] is relatively large. If all the s ( k ) q ’s are of the sameorder, then the error upper bounds for (cid:98) A SSN and (cid:98) A MN in Table 1 will be similar. However,our simulation results in Section 6 show that (cid:98) A SSN clearly outperforms (cid:98) A MN under varioussettings, even when s (1) q = · · · = s (2 d ) q . Indeed, the error bounds for (cid:98) A SSN in Theorem 4 islikely to be loose, which is believed to be caused by taking the upper bounds on the dualnorm of the SSN norm in the proof of Lemma B.3; see Appendix B.3 for details. By contrast,the error bounds for (cid:98) A MN are minimax-optimal (Han et al., 2015b). Therefore, althoughour theoretical results are not sharp enough to distinguish clearly between the error rates of (cid:98) A SSN and (cid:98) A MN , we conjecture that the actual rate of the former is generally smaller thanthat of the latter. Methodologically, this is also easy to understand because, unlike (cid:98) A MN , (cid:98) A SSN simultaneously enforces the low-rankness across all square matricizations of A ratherthan just on A [ S ] . Remark 1. While our SSN regularization is proposed in the time series context, the ideaof imposing joint penalties on all (close to) square matricizations of the coefficient tensorcan potentially be extended to general higher-order tensor estimation problems. Moreover,it can be refined to accommodate particular structures of the data. For example, if some ofthe d modes of the tensor-value time series Y t , namely p , . . . , p d , are equal, then even agreater number of possible square matricizations of A can be formed, resulting in improvedestimation efficiency. .3 Truncated Regularized Estimation While the proposed regularized estimation methods do not require exact low-rankness ofthe true transition tensor A , sometimes imposing exact low-rankness may be more desirableif one wants to interpret the underlying low-dimensional tensor dynamics. As discussed inSection 3, the Tucker ranks can be interpreted as the numbers of dynamic factors in eachmode. In this section, we propose a truncation method to consistently estimate the truemultilinear ranks ( r , . . . , r d ) under the exact low-rank assumption.Let γ be a threshold parameter to be chosen properly. Given the estimator (cid:98) A SSN , foreach i = 1 , . . . , d , we calculate the singular value decomposition (SVD) of the mode- i matri-cization ( (cid:98) A SSN ) ( i ) with the singular values arranged in descending order. Next we truncatethe SVD by retaining only singular values greater than γ , and take their corresponding leftsingular vectors to define the matrix (cid:101) U i . Then, the truncated core tensor is defined as (cid:101) G = (cid:98) A SSN × di =1 (cid:101) U (cid:62) i , based on which we propose the truncated sum of square-matrix nuclear (TSSN) estimator (cid:98) A TSSN = (cid:101) G × di =1 (cid:101) U i . To derive the theoretical results on rank selection, we make the following assumption onthe exact Tucker ranks and the magnitude of the singular values. Assumption 3. For all i = 1 , . . . , d , σ r ( A ( i ) ) = 0 for all r > r i , and there exists a constant C > such that min ≤ i ≤ d σ r i (cid:0) A ( i ) (cid:1) ≥ Cγ . As T → ∞ , the threshold parameter satisfies γ (cid:29) ( κ M /α RSC ) (cid:112) s p/T , where s = 2 − d (cid:80) d − k =1 rank( A [ I k ] ) . Assumption 3 requires that A has exact Tucker ranks ( r , . . . , r d ) which do not divergetoo fast. The smallest positive singular value for each A ( i ) is assumed to be bounded awayfrom the threshold γ when the sample size is sufficiently large. Since Assumption 3 involvesunknown quantities, it cannot be used directly for determining γ in practice. Instead, werecommend using the data-driven threshold parameter γ described at the end of the nextsubsection. 24he rank selection consistency of the truncation method and the asymptotic estimationerror rate of (cid:98) A TSSN are given by the following theorem. Theorem 5. Under the conditions of Theorem 4 and Assumption 3, if the tuning parameter λ SSN (cid:16) κ M − d (cid:112) p/T , then P (cid:110) rank (cid:16) ( (cid:98) A TSSN ) ( i ) (cid:17) = rank( A ( i ) ) , for i = 1 , . . . , d (cid:111) → , as T → ∞ , and for any fixed d , (cid:107) (cid:98) A TSSN − A (cid:107) F = O p (cid:16)(cid:112) s p/T (cid:17) , where s is defined as in Assumption 3. This subsection presents the algorithm for the proposed (T)SSN regularized estimator. Thealgorithm for (cid:98) A SN can be developed analogously, while (cid:98) A MN can be obtained easily as inNegahban and Wainwright (2011).The objective function for the estimator (cid:98) A SSN in (24) can be rewritten as L T ( A ) + λ SSN (cid:107) A (cid:107) SSN = L T ( A ) + λ SSN 2 d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) ∗ , (25)where L T ( A ) = T − (cid:80) Tt =1 (cid:107) Y t − (cid:104) A , Y t − (cid:105)(cid:107) is the quadratic loss function. In (25), theregularizer (cid:107) A (cid:107) SSN involves 2 d − nuclear norms (cid:107) A [ I k ] (cid:107) ∗ , which are challenging to handleat the same time. A similar difficulty also occurs in low-rank tensor completion, for whichGandy et al. (2011) applied the alternating direction method of multipliers (ADMM) algo-rithm (Boyd et al., 2011) to efficiently separate the different nuclear norms. Borrowing theidea of Gandy et al. (2011), we develop an ADMM algorithm for the miminization of (25).To separate the 2 d − nuclear norms in (cid:107) A (cid:107) SSN , for each A [ I k ] , we introduce a differentdummy variable W k as a surrogate for A , where k = 1 , . . . , d − . Then the augmentedLagrangian is L ( A , W , C ) = L T ( A ) + d − (cid:88) k =1 (cid:104) λ SSN (cid:107) ( W k ) [ I k ] (cid:107) ∗ + 2 ρ (cid:104) C k , A − W k (cid:105) + ρ (cid:107) A − W k (cid:107) (cid:105) , lgorithm 2 ADMM algorithm for (T)SSN estimatorInitialize: C (0) k , W (0) k = A (0) = (cid:98) A MN , for k = 1 , . . . , d − , threshold parameter γ for j ∈ { , , . . . , J − } do A ( j +1) ← arg min (cid:110) L T ( A ) + (cid:80) d − k =1 ρ (cid:107) A − W ( j ) k + C ( j ) k (cid:107) (cid:111) for k ∈ { , , . . . , d − } do W ( j +1) k ← arg min (cid:110) ρ (cid:107) A ( j +1) − W k + C ( j ) k (cid:107) + λ SSN (cid:107) ( W k ) [ I k ] (cid:107) ∗ (cid:111) C ( j +1) k ← C ( j ) k + A ( j +1) − W ( j +1) k end forend for (cid:98) A SSN ← A ( J ) for i ∈ { , , . . . , d } do (cid:101) U i ← Truncated SVD(( (cid:98) A SSN ) ( i ) , γ ) end for (cid:101) G ← (cid:98) A SSN × di =1 (cid:101) U (cid:62) i (cid:98) A TSSN ← (cid:101) G × di =1 (cid:101) U i where C k are the Lagrangian multipliers, for k = 1 , . . . , d − , and ρ is the regularizationparameter. Then we can iteratively update A , W k and C k by the ADMM, as shown inAlgorithm 2.In Algorithm 2, the A -update step is an (cid:96) -regularized least squares problem. Similarlyto Gandy et al. (2011), the W k -update step can be solved by applying the explicit soft-thresholding operator to the singular values of ( A + C k ) [ I k ] . Both subproblems have close-form solutions. Thus, the miminization of (25) can be solved efficiently.For the tuning parameter selection, since the cross-validation method is unsuitable fortime series or intrinsically ordered data, we apply the Bayesian information criterion (BIC)to select the optimal λ SSN from a sequence of tuning parameter values, where the number ofdegrees of freedom is defined as 2 − ( d − (cid:80) d − k =1 s k (2 p − s k ). For the threshold parameter γ ofthe truncated estimator, we recommend γ = 2 d − λ SSN / λ SSN is theoptimal tuning parameter selected by the BIC.26 Simulation Studies We conduct three simulation experiments to examine the finite-sample performance of theproposed low- and high-dimensional estimation methods for the LRTAR model in previoussections. Throughout, we generate the data from model (8) with vec( E t ) i.i.d. ∼ N ( , I p ). Theentries of the core tensor G are generated independently from N (0 , 1) and rescaled such that (cid:107) G (cid:107) F = 5. The factor matrices U i ’s are generated by extracting the leading singular vectorsfrom Gaussian random matrices while ensuring the stationarity condition in Assumption 1.The first experiment focuses on the proposed low-dimensional estimation method consid-ered in Section 4. We consider four cases of data generating processes. In both cases (1a)and (1b), we consider d = 2 and multilinear ranks ( r , r , r , r ) = (1 , , , , , , , , , d = 3 and multilinear ranks( r , r , r , r , r , r ) = (1 , , , , , , , , , , , , , , , p i ’s: (1a) p = p = 5; (1b) p = p = 10; (2a) p = p = p = 5; and (2b) p = p = p = 7. We repeat each data generating process300 times, and conduct the estimation using true multilinear ranks. Figure 2 displays theaverage estimation error against T ∈ [1000 , A is exactly low-rank, we have (cid:107) (cid:98) A SSN − A (cid:107) = O p ( s p/T ), where s = 2 − d (cid:80) d − k =1 s ( k )0 with s ( k )0 = rank( A [ I k ] ).Note that this rate is dependent on the overall dimension, p = (cid:81) di =1 p i , but independent ofthe individual dimensions p i . To examine the relationship between the estimation error andthe overal dimension p , sample size T , multilinear ranks r i and individual dimensions p i , wegenerate the data under the eight settings listed in Table 2, and plot (cid:107) (cid:98) A SSN − A (cid:107) againstthe varying parameter in each case in Figure 3, including p, /T, s and different settingsof p i ’s under a fixed overall dimension p . The first three columns of Figure 3 show that (cid:107) (cid:98) A SSN − A (cid:107) roughly scales linearly in p, T and s , while the last column suggests that the27stimation error is invariant under different settings of individual p i as long as p and theother parameters are fixed. This lends support to the theoretical error bound for the SSNestimator.In the third experiment, we compare the performance of all the four high-dimensionalestimators discussed in Section 5, namely the SN, MN, SSN and TSSN estimators. Weconsider four cases of data generating processes. In cases (3a) and (3b), we consider d = 2and multilinear ranks ( r , r , r , r ) = (1 , , , , , , , , , d = 3 and multilinear ranks ( r , r , r , r , r , r ) = (1 , , , , , , , , , , , , , , , p i ’s: (3a) p = p = 5; (3b) p = p = 10; (4a) p = p = p = 5; and (4b) p = p = p = 7. For each setting, we repeat 300 times, and conduct the estimation usingSN, MN, SSN and TSSN. The tuning parameter and the truncation threshold parameter areselected by the method described in Section 5.4. In Figure 4, the average estimation erroris plotted against T ∈ [400 , T ∈ [600 , r = · · · = r d . In addition,the TSSN estimator generally performs better than the SSN, probably because the formeryields a more parsimonious model which further improves the estimation efficiency. We first consider the modeling of a matrix-valued time series. The data is the monthlymarket-adjusted return series of Fama–French 10 × 10 portfolios from January 1979 to De-cember 2019, obtained from French (2020). The portfolios are constructed as the intersec-tions of 10 portfolios formed by the book-to-market (B/M) ratio and 10 portfolios formed28y the size (market value of equity). Hence, d = 2, p = p = 10 and T = 492 months.We remove the market effect for each portfolio by subtracting its average return from theoriginal return series.Let Y t ∈ R × be the observed matrix-valued time series, with its rows and columnscorresponding to different B/M ratios (sorted from lowest to highest) and sizes (sorted fromsmallest to largest), respectively, and denote y t = vec( Y t ). For comparison, we consider thefollowing five candidate models: • Vector autoregression (VAR): y t = Ay t − + e t , where A ∈ R × . The model isestimated by the least squares method. • Vector factor model (VFM): y t = Λ f t + e t , where f t is the low-dimensional vector-valued latent factor, and Λ is the loading matrix. The model is estimated by themethod in Lam et al. (2012), and for prediction, the estimated factors (cid:98) f t are thenfitted by a VAR(1) model. • Matrix autoregression (MAR): Y t = B Y t − B (cid:62) + E t , where B , B ∈ R × arecoefficient matrices; see Example 1. The model is estimated by the iterated leastsquares method in Chen et al. (2020b). • Matrix factor model (MFM): Y t = RF t C (cid:62) + E t , where F t is the low-dimensionalmatrix-valued latent factor, and R and C are the loading matrices. The model isestimated by the method in Wang et al. (2019), and for prediction, the estimatedfactors (cid:98) F t are then fitted by a VAR(1) model. • The proposed LRTAR model: Y t = (cid:104) A , Y t − (cid:105) + E t , with A = G × i =1 U i . The modelis estimated using the SSN and TSSN regularized estimation methods in Section 5.We first focus on the results for proposed LRTAR model. By the proposed truncationmethod, we obtain the estimated multilinear ranks ( (cid:98) r , (cid:98) r , (cid:98) r , (cid:98) r ) = (8 , , , U (cid:62) Y t U = (cid:10) G , U (cid:62) Y t − U (cid:11) + U (cid:62) E t U . Thus, U and U can be viewed as the factor loadings across different B/M ratios,29hile U and U represent those across different sizes. By the factor interpretation below(12), this result indicates that the information in Y t can be effectively summarized into the2 × U (cid:62) Y t U , while the low-rank structure associated with the predictor Y t − is not strong. Moreover, the estimated multilinear ranks suggest that the MAR modelmight be inefficient for the data, since the MAR imposes that r = r = r = r = 10 and U = U = I ; see Example 1. In addition, by an argument similar to the discussionin Example 2, the MFM can be regarded as a special case of the proposed model with( U , U ) = ( U , U ) and ( r , r ) = ( r , r ). Thus, the fact that (cid:98) r and (cid:98) r are much smallerthan (cid:98) r and (cid:98) r suggests that the MFM may be too restrictive for the data.The TSSN estimates of the factor matrices are shown in Figure 5. The patterns of theestimated response factor matrices (cid:101) U , (cid:101) U ∈ R × are particularly interesting. First, forboth (cid:98) U and (cid:98) U , the uniform pattern of the first column indicate that portfolios acrossdifferent B/M ratios (or sizes) contribute approximately equally to the first B/M (or size)response factor. Thus, this factor represents the component of the market performance whichis invariant to the size and B/M ratio of the portfolios. The significance of this factor maybe partially because we fit the model to market-adjusted returns, where the average returnis subtracted for all portfolios. Meanwhile, for both response factor matrices, the secondcolumn has a smoothly increasing pattern, suggesting that part of the return variation inthe market has a monotonic relationship with the size and B/M ratio. Moreover, the aboveinterpretations are consistent with the famous Fama-French three-factor model, where thereturn of a portfolio is expected to be affected by the market premium, outperformance ofsmall versus big companies, and outperformance of high B/M versus small B/M companies.The performance of the five models are compared through both average in-sample andout-of-sample forecasting errors. The average in-sample forecasting error is calculated basedon the fitted models for the entire data, while the average out-of-sample forecasting error iscalculated based on the rolling forecast procedure as follows. From January 2016 ( t = 445)to December 2019 ( t = 492), we fit the models using all the available data until time t − (cid:98) Y t . Then, we obtain the average of the rolling forecasting30rrors for this period.The average forecasting errors in (cid:96) and (cid:96) norms are presented in Table 3. Firstly, theVAR model has the smallest in-sample forecasting error among all models, which is as ex-pected because the VAR model is highly overparametrized. The bad in-sample performanceof the MFM agrees with our discussion about its restrictiveness for the data due to themismatch of the multilinear ranks. It is worth noting that the proposed LRTAR model hascompetitive in-sample forecasting performance among all models.The out-of-sample forecasting results provides a fuller picture of the efficiency of differentmethods. It can be seen that the VAR and VFM models perform worst among all, as theyboth completely ignore the matrix structure of the data. Notably the proposed LRTARmodel, fitted by either the SSN or the TSSN methods, have the smallest out-of-sampleforecasting errors. This suggests that the proposed LRTAR model can indeed efficientlycapture the dynamic structural information in the data. Fama and French (2015) extended the classical Fama-French three-factor model to a five-factor model which further incorporates the effect of operating profitability (OP) and invest-ment (Inv). This motivates tensor-valued stock returns data formed according to the size(small and big), B/M ratio (four groups sorted from lowest to highest), OP (four groupssorted from lowest to highest), and Inv (four groups sorted from lowest to highest). Weconsider two datasets retrieved from French (2020). The first dataset consists of monthlymarket-adjusted returns series of 4 × × × × d = 3, p = p = 4, p = 2 and T = 678 months.Similar to the analysis in the previous subsection, five candidate models are considered: • Vector autoregression (VAR): y t = Ay t − + e t , where A ∈ R × .31 Vector factor model (VFM): y t = Λ f t + e t , where f t is the low-dimensional vector-valued latent factor, and Λ is the loading matrix. • Multilinear tensor autoregression (MTAR): Y t = Y t − × i =1 B i + E t , where B , B ∈ R × and B ∈ R × are coefficient matrices; see Example 1. • Tensor factor model (TFM): Y t = F t × i =1 U i + E t , where F t is the low-dimensionaltensor-valued latent factor, and U i ’s are the loading matrices. • The proposed LRTAR model: Y t = (cid:104) A , Y t − (cid:105) + E t , with A = G × i =1 U i .The TFM is estimated by the method in Chen et al. (2020c), and for prediction, theestimated factors (cid:98) F t are fitted by a VAR(1) model. The other four models are fitted bythe same methods as in the previous subsection. The multilinear ranks selected by thetruncation method are ( (cid:98) r , (cid:98) r , (cid:98) r , (cid:98) r , (cid:98) r , (cid:98) r ) = (3 , , , , , 1) and (2 , , , , , 1) for the OP-BM-Size portfolio return series and the Inv-BM-Size portfolio return series, respectively.Note that similar to the BM-Size 10 × 10 series in the previous subsection, the low-rankstructure for the response in these two tensor-valued datasets is more evident than that forthe predictor. Figure 6 shows the TSSN estimates of the factor matrices. We find that theestimated response factors have a uniform pattern similar to that in Figure 5 for the matrix-valued data. The fact that only one response factor is extracted in each direction suggeststhat there might not be substantial effect of OP, Inv, B/M ratio or size on the returns forthese datasets.Finally, using the same methods as in the previous subsection, we calculate the averagein-sample and out-of-sample forecasting errors for both datasets fitted with the five models.As shown in Table 4, the comparison results for the two datasets are quite similar. It can beclearly observed that the VAR model always has the smallest in-sample forecasting error, yetthe largest out-of-sample forecasting error. On the contrary, the proposed LRTAR model,fitted by either the SSN or the TSSN methods, has the smallest out-of-sample forecastingerror. Moreover, the in-sample forecasting performance of the LRTAR model is competi-tive even compared to the VAR model. Similarly to the MFM in the previous subsection,32he TFM model has poor in-sample performance, possibly due to the discrepancy between( r , r , r ) and ( r , r , r ), as reflected by the estimated multilinear ranks. In sum, the re-sults support the efficiency and flexibility of the proposed model and estimation methods fortensor-valued time series data. Efficient modeling and forecasting of general structured (tensor-valued), high-dimensionaltime series data is an important research topic which however has been rarely explored inthe literature so far. This paper makes the first thorough attempt to address this prob-lem by introducing the low-rank tensor autoregressive model. By assuming the exactly orapproximately low-Tucker-rank structure of the transition tensor, the model exploits thelow-dimensional tensor dynamic structure of the high-dimensional time series data, andsummarizes the complex temporal dependencies into interpretable dynamic factors.Asymptotic and non-asymptotic properties are derived for the proposed low- and high-dimensional estimators, respectively. For the latter, we relax the conventional Gaussianassumption in the high-dimensional time series literature to sub-Gaussianity via a newmartingale-based concentration technique. Moreover, based on the special structure of thetransition tensor, a novel convex regularizer, the SSN, is proposed, gaining efficiencies fromboth the square matricization and simultaneous penalization across modes. A truncationmethod, the TSSN, is further introduced to consistently select the multilinear ranks andenhance model interpretability.We discuss several directions for future research. First, the proposed estimators cannotadapt to the contemporaneous correlation among elements of the random error E t , whichmay lead to efficiency loss. This issue can be addressed by considering the generalized leastsquares loss function, L T ( A ; Σ e ) = T − (cid:80) Tt =1 vec( Y t − (cid:104) A , Y t − (cid:105) ) (cid:62) Σ − e vec( Y t − (cid:104) A , Y t − (cid:105) ),where Σ e = var(vec( E t )). Then A may be estimated jointly with Σ e or by a two-stepapproach based on a consistent estimator (cid:98) Σ e (Basu and Michailidis, 2015; Davis et al.,33016).Secondly, the proposed methods can be generalized to the LRTAR model of finite lagorder L , defined as Y t = (cid:104) A , Y t − (cid:105) + · · · + (cid:104) A L , Y t − L (cid:105) + E t , where A , . . . , A L are 2 d -th-order multilinear low-rank coefficient tensors. Then, one may consider the SSN regularizedestimation by minimizing T − (cid:80) Tt =1 (cid:107) Y t − (cid:80) Lj =1 (cid:104) A j , Y t − j (cid:105)(cid:107) + (cid:80) Lj =1 λ j (cid:107) A j (cid:107) SSN . Alterna-tively, the A j ’s can be combined into a (2 d + 1)-th-order tensor A ∈ R p ×··· p d × p ×···× p d × L whose mode-(2 d + 1) matricization is A (2 d +1) = ( A , . . . , A L ); see Wang et al. (2020) fora similar idea. Even though A may not have exactly square matricizations for L > 1, theproposed SSN can still be adapted by employing approximately square matricizations. Forinstance, consider the pL × p multi-mode matricizations A [ I k ∪{ d +1 } ] , where the index sets I k for k = 1 , . . . , d − are defined as in this paper. Then, a generalized SSN norm can beconstructed as (cid:107) A (cid:107) GSSN = (cid:80) d − k =1 (cid:107) A [ I k ∪{ d +1 } ] (cid:107) ∗ , which will reduce to (cid:107) A (cid:107) SSN when L = 1.Thirdly, the LRTAR model can be readily extended to incorporate exogenous tensor-valued predictors, giving rise to the LRTAR-X model, Y t = (cid:104) A , Y t − (cid:105) + (cid:104) B , X t (cid:105) + E t , where X t is an m -th-order tensor of exogenous variables, and B is a ( d + m )-th-order coefficienttensor. When the dimension of X t is high, a low-dimensional structure, such as sparsity,group sparsity or low-rankness, can be imposed on B to improve the estimation efficiency.Moreover, an open question for matrix- or tensor-valued time series models is how toincorporate additional structures or constraints; for instance, transport or trade flow datahave unspecified diagonal entries (Chen and Chen, 2019), and realized covariance matrixdata are subject to positive definite constraints. Beyond the time series context, it is alsoworth investigating the generalization of the SSN regularization method in higher-ordertensor estimation and completion applications, such as neuroimaging analysis (Li et al.,2018), recommender sytem (Bi et al., 2018) and natural language processing (Frandsen andGe, 2019). 34 ppendix A: Proofs for Low-Dimensional Estimation Below we give the proofs of Theorem 1 and Corollary 1 in Section 4, which generally followfrom Proposition 4.1 in Shapiro (1986) for overparameterized models. Proof of Theorem 1. The proposed model in (10) can be written in the matrix form y (cid:62) y (cid:62) ... y (cid:62) T (cid:124) (cid:123)(cid:122) (cid:125) Y = y (cid:62) y (cid:62) ... y (cid:62) T − (cid:124) (cid:123)(cid:122) (cid:125) X A (cid:62) [ S ] + e (cid:62) e (cid:62) ... e (cid:62) T (cid:124) (cid:123)(cid:122) (cid:125) E . Let (cid:98) h OLS = vec(( (cid:98) A OLS ) [ S ] ). Under Assumption 1, by the classical asymptotic theory forstationary VAR models, as T → ∞ , we have √ T ( (cid:98) h OLS − h ) d → N ( , Σ OLS ) , where Σ OLS = Σ e ⊗ Σ − y , with Σ e = var( e t ) and Σ y = var( y t ).Following Shapiro (1986), consider the discrepancy function F ( (cid:98) h OLS , h ) = (cid:107) vec( Y ) − ( I p ⊗ X ) h (cid:107) − (cid:107) vec( Y ) − ( I p ⊗ X ) (cid:98) h OLS (cid:107) . Note that F ( (cid:98) h OLS , h ) is nonnegative and twice continuously differentiable, and equals zeroif and only if (cid:98) h OLS = h .The Jacobian matrix H = ∂ h ( θ ) /∂ θ in (20) can be verified by noting that h = vec( A [ S ] ) = vec(( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) ) = (( ⊗ i ∈ S U i ) ⊗ ( ⊗ i ∈ S U i )) vec( G [ S ] )and that for any 1 ≤ i ≤ d , h = vec( A [ S ] ) = P ( i )[ S ] vec( A ( i ) ) = P ( i )[ S ] vec (cid:0) U i G ( i ) ( ⊗ dj =1 ,j (cid:54) = i U (cid:62) j ) (cid:1) = P ( i )[ S ] (cid:8)(cid:2) ( ⊗ dj =1 ,j (cid:54) = i U i ) G (cid:62) ( i ) (cid:3) ⊗ I p i (cid:9) vec( U i ) . Then, by Proposition 4.1 in Shapiro (1986), we obtain that the minimizer of F ( (cid:98) h OLS , · ),namely the LTR estimator, has the asymptotic normality, √ T ( (cid:98) h LTR − h ) d → N ( , Σ LTR )35nd Σ LTR = P Σ OLS P (cid:62) , where P = H ( H (cid:62) J H ) † H (cid:62) J is the projection matrix, J = Σ − e ⊗ Σ y is the Fisher information matrix of h , and † denotes the Moore-Penrose inverse.Since Σ OLS = J − , we can obtain that Σ LTR = H ( H (cid:62) J H ) † H (cid:62) . Proof of Corollary 1. As discussed in the proof of Theorem 1, Σ OLS = J − , and observe that Σ LTR = H ( H (cid:62) J H ) † H (cid:62) = J − / Q J / H J − / , (A.1)where Q J / H is the projection matrix onto the orthogonal compliment of span( J / H ).On the other hand, under the proposed model, the transition matrix can be decom-posed as A [ S ] = V V (cid:62) , where V = ⊗ i ∈ S U i , V = ( ⊗ i ∈ S U i ) G [ S ] , and rank( A [ S ] ) ≤ s = min( (cid:81) di =1 r i , (cid:81) di = d +1 r i ). Thus, we can write h = vec( A [ S ] ) = h ( φ ), where φ =(vec( V ) (cid:62) , vec( V ) (cid:62) ) (cid:62) is the parameter vector for the RRR estimator. Then, similarlyto the proof of Theorem 1, we denote the Jacobian matrix by R = ∂ h /∂ φ . By similararguments, we have √ T ( (cid:98) h RRR − h ) d → N ( , Σ RRR ), where (cid:98) h RRR = vec(( (cid:98) A RRR ) [ S ] ), and Σ RRR = R ( R (cid:62) J R ) † R (cid:62) = J − / Q J / R J − / , (A.2)where Q J / R is the projection matrix onto the orthogonal compliment of span( J / R ).Hence, it is clear that Σ RRR ≤ J − = Σ OLS .Moreover, since the Tucker decomposition can be viewed as a further decomposition ofthe low-rank decomposition V U (cid:62) for A [ S ] , we have H = ∂ h /∂ θ = R · ∂ φ /∂ θ . By (A.1)and (A.2), we have Σ LTR ≤ Σ RRR , since span( J / H ) ⊂ span( J / R ). Appendix B: Proofs for High-Dimensional Estimation In this appendix, we provide the proofs of Theorems 2–5 in Section 5. We start with apreliminary analysis in Appendix B.1 which lays out the common technical framework forproving the estimation and prediction error bounds of the SN, MN and SSN regularizedestimators, and four lemmas, Lemmas B.1–B.4, are introduced herein. Then in AppendixB.2 we give the proofs of Theorems 2–5. The proofs of Lemmas B.1–B.4 are provided inAppendix B.3, and three auxiliary lemmas are collected in Appendix B.436 .1 Preliminary Analysis The technical framework for proving the error bounds in Theorem 2–4 consists of two mainsteps, a deterministic analysis and a stochastic analysis, given in Sections B.1.1 and B.1.2,respectively. The goal of the first one is to derive the error bounds given the deterministicrealization of the time series, assuming that the parameters satisfy certain regularity con-ditions. The goal of the second one is to verify that under stochasticity these regularityconditions are satisfied with high probability. B.1.1 Deterministic Analysis Throughout the appendix, we adopt the following notations. We use C to denote a genericpositive constant, which is independent of the dimensions and the sample size. For anymatrix M and a compatible subspace S , we denote by M S the projection of M onto S .In addition, let col( M ) be the column space of M , and let S ⊥ be the complement of thesubspace S . For a generic tensor W ∈ R p ×···× p d , the dual norms of its SSN norm and SNnorm, denoted by (cid:107) W (cid:107) SSN ∗ and (cid:107) W (cid:107) SN ∗ , respectively, are defined as (cid:107) W (cid:107) SSN ∗ = sup T ∈ R p ×···× p d , (cid:107) T (cid:107) SSN ≤ (cid:104) W , T (cid:105) , and (cid:107) W (cid:107) SN ∗ = sup T ∈ R p ×···× p d , (cid:107) T (cid:107) SN ≤ (cid:104) W , T (cid:105) . Moreover, for any two tensors X ∈ R p ×···× p m and Y ∈ R p m +1 ×···× p mn , their tensor outerproduct is defined as ( X ◦ Y ) ∈ R p ×···× p m × p m +1 ×···× p m + n where( X ◦ Y ) i ...i m i m +1 ...i m + n = X i ...i m Y i m +1 ...i m + n , for any 1 ≤ i ≤ p , . . . , 1 ≤ i m + n ≤ p m + n .For the theory of regularized M -estimators, restricted error sets and restricted strongconvexity are essential definitions. To define the former, we need to first introduce thefollowing restricted model subspaces.For i = 1 , . . . , d , denote by (cid:101) U i and (cid:101) V i the spaces spanned by the first r i left and rightsingular vectors in the SVD of A ( i ) , respectively. Define the collections of subspaces N = ( N , . . . , N d ) and N ⊥ = ( N ⊥ , . . . , N ⊥ d ) , N i = { M ∈ R p i × p − i p | col( M ) ⊂ (cid:101) U i , col( M (cid:62) ) ⊂ (cid:101) V i } , N ⊥ i = { M ∈ R p i × p − i p | col( M ) ⊥ (cid:101) U i , col( M (cid:62) ) ⊥ (cid:101) V i } , (B.1)for i = 1 , . . . , d . Note that N i ⊂ N i .Furthermore, for k = 1 , . . . , d − , denote by U k and V k the spaces spanned by the first s ∗ k left and right singular vectors in the SVD of the square matricization A [ I k ] , respectively,where s ∗ k = rank( A ) [ I k ] . Similarly, define the collections of subspaces M := ( M , . . . , M d − ) and M ⊥ = ( M ⊥ , . . . , M ⊥ d − ) , where M k = { M ∈ R p × p | col( M ) ⊂ U k , col( M (cid:62) ) ⊂ V k } , M ⊥ k = { M ∈ R p × p | col( M ) ⊥ U k , col( M (cid:62) ) ⊥ V k } , (B.2)for k = 1 , . . . , d − . In particular, as described in Section 5.2, I = S = { , . . . , d } . Thus, M and M ⊥ are the subspaces associated with the square matricization A [ S ] .Then, for simplicity, for any W ∈ R p ×···× p d , we denote W ( i ) N = ( W ( i ) ) N i , W ( i ) N ⊥ = ( W ( i ) ) N ⊥ i , W ( i ) N = ( W ( i ) ) N i , W ( i ) N ⊥ = ( W ( i ) ) N ⊥ i W ( k ) M = ( W [ I k ] ) M k , W ( k ) M ⊥ = ( W [ I k ] ) M ⊥ k , W ( k ) M = ( W [ I k ] ) M k , W ( k ) M ⊥ = ( W [ I k ] ) M ⊥ k , where i = 1 , . . . , d and k = 1 , . . . , d − . Based on the subspaces defined in (B.1) and (B.2),we can define the restricted error sets corresponding to the three regularized estimators asfollows. Definition 1. The restricted error set corresponding to M is defined as C SSN ( M ) := ∆ ∈ R p ×···× p d : d − (cid:88) k =1 (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ ≤ d − (cid:88) k =1 (cid:107) ∆ ( k ) M (cid:107) ∗ + 4 d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ . The restricted error set corresponding to N is defined as C SN ( N ) := (cid:40) ∆ ∈ R p ×···× p d : d (cid:88) i =1 (cid:107) ∆ ( i ) N ⊥ (cid:107) ∗ ≤ d (cid:88) i =1 (cid:107) ∆ ( i ) N (cid:107) ∗ + 4 d (cid:88) i =1 (cid:107) A ( i ) N ⊥ (cid:107) ∗ (cid:41) . he restricted error set corresponding to M is defined as C MN ( M ) := (cid:110) ∆ ∈ R p ×···× p d : (cid:107) ∆ (1) M ⊥ (cid:107) ∗ ≤ (cid:107) ∆ (1) M (cid:107) ∗ + 4 (cid:107) A (1) M ⊥ (cid:107) ∗ (cid:111) . The first lemma shows that if the tuning parameter is well chosen for each regularizedestimator, the estimation error belongs to the corresponding restricted error set. Lemma B.1. For the SSN estimator, if the regularization parameter λ SSN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ , the error ∆ SSN = (cid:98) A SSN − A belongs to the set C SSN ( M ) .For the SN estimator, if the regularization parameter λ SN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SN ∗ ,the error ∆ SN = (cid:98) A SN − A belongs to the set C SN ( N ) .For the MN estimator, if the regularization parameter λ MN ≥ (cid:107) T − (cid:80) Tt =1 vec( Y t − )vec( E t ) (cid:62) (cid:107) ∗ ,the error ∆ MN = (cid:98) A MN − A belongs to the set C MN ( M (1) ) . Following Negahban and Wainwright (2012) and Negahban et al. (2012), a restrictedstrong convexity (RSC) condition for the square loss function can be defined as follows. Definition 2. The loss function satisfies the RSC condition with curvature α RSC > andrestricted error set C , if T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) F ≥ α RSC (cid:107) ∆ (cid:107) , ∀ ∆ ∈ C . Based on the restricted error sets and RSC conditions, the estimation errors have thefollowing deterministic upper bounds. Lemma B.2. Suppose that λ SSN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ , the RSC condition holds withthe parameter α RSC and restricted error set C SSN ( M ) , and A [ I k ] ∈ B q ( s ( k ) q ; p, p ) for some q ∈ [0 , and all k = 1 , . . . , d − , (cid:107) ∆ SSN (cid:107) F (cid:46) √ s q (cid:18) d − λ SSN α RSC (cid:19) − q/ , where s q = 2 − d (cid:80) d − k =1 s ( k ) q . uppose that λ SN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SN ∗ , the RSC condition holds with the parameter α RSC and restricted error set C SN ( N ) , and A ( i ) ∈ B q ( r ( i ) q ; p i , p − i p ) for some q ∈ [0 , andall i = 1 , . . . , d , (cid:107) ∆ SN (cid:107) F (cid:46) √ r q (cid:18) d · λ SN α RSC (cid:19) − q/ , where r q = (2 d ) − (cid:80) di =1 r ( i ) q .Suppose that λ MN ≥ (cid:107) T − (cid:80) Tt =1 vec( Y t − )vec( E t ) (cid:107) ∗ , the RSC condition holds with theparameter α RSC and restricted error set C MN ( M ) , and A [ S ] ∈ B q ( s (1) q ; p, p ) for some q ∈ [0 , , (cid:107) ∆ MN (cid:107) F (cid:46) (cid:113) s (1) q (cid:18) λ MN α RSC (cid:19) − q/ . Note that Lemma B.2 is deterministic and the radius s q , r q , and s (1) q can also diverge toinfinity. B.1.2 Stochastic Analysis We continue with the stochastic analysis to show that the deviation bound and the RSCcondition hold simultaneously with high probability. Lemma B.3 (Deviation bound) . Suppose that Assumptions 1 and 2 hold. If T (cid:38) p and λ SSN (cid:38) κ M − d (cid:112) p/T , with probability at least − exp[ − C ( p − d )] , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SSN ∗ ≤ λ SSN where M = λ max ( Σ e ) /µ / ( A ) .If T (cid:38) max ≤ i ≤ d p − i p and λ SN (cid:38) κ M d − (cid:80) di =1 (cid:112) p − i p/T , with probability at least − (cid:80) di =1 exp( − Cp − i p ) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SN ∗ ≤ λ SN . Moreover, if T (cid:38) p and λ MN (cid:38) κ M (cid:112) p/T , with probability at least − exp( − Cp ) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 vec( Y t − )vec( E t ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ MN . T (cid:38) p for all three estimators. In this case, we canestablish the strong convexity condition that is stronger than the RSC condition. Lemma B.4 (Strong convexity) . Under Assumptions 1 and 2, for T (cid:38) max( κ , κ ) M − p ,with probability at least − exp[ − C min( κ − , κ − ) M p ] , T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) F ≥ α RSC (cid:107) ∆ (cid:107) , where M = [ λ min ( Σ e ) µ max ( A )] / [ λ max ( Σ e ) µ min ( A )] and α RSC = λ min ( Σ e ) / (2 µ max ( A )) . B.2 Proofs of Theorems 2–5 Proof of Theorems 2 and 3. Theorems 2 and 3 can be proved based on Lemmas B.2–B.4following the same line of the proof of Theorem 4 given below. Therefore, we omit thedetails here. Proof of Theorem 4. The proof of Theorem 4 has been split into Lemmas B.2–B.4. ByLemma B.2, for deterministic realization with sample size T of a tensor autoregressive pro-cess, if we choose λ SSN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ and RSC condition holds for the squareloss with the parameter α RSC , the following error upper bound can be established (cid:107) ∆ (cid:107) F (cid:46) √ s q (cid:18) d − λ SSN α RSC (cid:19) − q/ . Denote the events E ( β ) = { β ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ } and E ( α ) = { λ min ( XX (cid:62) /T ) ≥ α } . If we take λ SSN (cid:38) κ M − d (cid:112) p/T , it suffices to show that E ( Cκ M − d (cid:112) p/T ) and E ( α RSC / 2) occur simultaneously with high probability.By Lemma B.3, when T (cid:38) p , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SSN ∗ (cid:46) κ M − d (cid:114) pT with probability at least 1 − exp[ − C ( p − d )].41y Lemma B.4, when T (cid:38) max( κ , κ ) M − p , for any ∆ ∈ R p ×···× p d ,1 T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) ≥ λ min ( Σ e )2 µ max ( A ) (cid:107) ∆ (cid:107) with probability at least 1 − exp[ − C min( κ − , κ − ) M p ].Hence, when T (cid:38) [1 + max( κ , κ ) M − ] p and λ (cid:38) κ M − d (cid:112) p/T , with probability atleast 1 − exp[ − C ( p − d )] − exp[ − C min( κ − , κ − ) M p ], the condition λ ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ and the RSC condition with the parameter α RSC = λ min ( Σ e ) /µ max ( A ) hold. Proof of Theorem 5. Theorem 4 gives the Frobenius estimation error bound. For simplicity,we write (cid:98) A = (cid:98) A SSN and (cid:101) A = (cid:98) A TSSN in this proof. By definition, for any tensor A ∈ R p ×···× p d , (cid:107) A (cid:107) = (cid:107) A ( i ) (cid:107) = p i (cid:88) j =1 σ j ( A ( i ) ) , i = 1 , , . . . , d. In other words, the Frobenius norm of the error tensor is equivalent to the (cid:96) norm of singularvalues of the one-mode matricization. By Mirsky’s singular value inequality (Mirsky, 1960), p i (cid:88) j =1 [ σ j ( (cid:98) A ( i ) ) − σ j ( A ( i ) )] ≤ p i (cid:88) j =1 σ j ( (cid:98) A ( i ) − A ( i ) ) = (cid:107) (cid:98) A − A (cid:107) , i = 1 , , . . . , d. (B.3)Obviously, the (cid:96) ∞ error bound is smaller than the (cid:96) error bound, so it follows the sameupper bound. By Theorem 4, when λ SSN (cid:16) κ M − d (cid:112) p/T , with probability approachingone, max ≤ i ≤ d max ≤ j ≤ p i | σ j ( (cid:98) A ( i ) ) − σ j ( A ( i ) ) | ≤ max ≤ i ≤ d (cid:40) p i (cid:88) j =1 [ σ j ( (cid:98) A ( i ) ) − σ j ( A ( i ) )] (cid:41) / ≤(cid:107) (cid:98) A − A (cid:107) F (cid:46) κ M α RSC (cid:114) s pT . (B.4)Therefore, by Assumption 3, as T → ∞ , γ (cid:29) max ≤ i ≤ d max ≤ j ≤ p i | σ j ( (cid:98) A ( i ) ) − σ j ( A ( i ) ) | . Then, for any j > r i , since σ j ( A ( i ) ) = 0, we have γ (cid:29) σ j ( (cid:98) A ( i ) ). Thus, for all i = 1 , . . . , d , σ j ( (cid:98) A ( i ) ) will be truncated for all j > r i . Meanwhile, by Assumption 3 and (B.4), we have42 r i ( (cid:98) A ( i ) ) > γ for T sufficiently large, for all i = 1 , . . . , d . Therefore, the rank selectionconsistency of the truncated estimator (cid:101) A can be established.Denote the event E = { rank( (cid:101) A ( i ) ) = r i , for i = 1 , . . . , d } . For a generic tensor T ∈ R p ×···× p d , denote the sub-tensor T i k = j , a p × · · · × p k − × × p k +1 × · · · × p d tensor suchthat ( T i k = j ) i ...i k − i k +1 ...i d = T i ...i k − ji k +1 ...i d , and sub-tensor T i k >j , a p × · · · × p k − × ( p k − j ) × p k +1 × · · · × p d tensor such that( T i k >j ) i ...i k − (cid:96)i k +1 ...i d = T i ...i k − ( (cid:96) + j ) i k +1 ...i d . Let the HOSVD of (cid:98) A be (cid:98) G × di =1 (cid:98) U i . By definition, (cid:98) G is a p × · · · × p d all-orthogonaland sorted tensor such that (cid:107) (cid:98) G i k =1 (cid:107) F ≥ (cid:107) (cid:98) G i k =2 (cid:107) F ≥ · · · ≥ (cid:107) (cid:98) G i k = p k (cid:107) F , for k = 1 , . . . , d . On E , the truncation procedure is equivalent to truncating all the sub-tensors (cid:98) G i k >r k to zeros. Thus, (cid:107) (cid:98) A − (cid:101) A (cid:107) F = (cid:107) (cid:98) G − (cid:101) G (cid:107) ≤ (cid:80) dk =1 (cid:107) (cid:98) G i k >r k (cid:107) .By the definition of HOSVD, (cid:107) (cid:98) G i k = j (cid:107) F = σ j ( (cid:98) G ( k ) ) = σ j ( (cid:98) A ( k ) ), and then (cid:107) (cid:98) G i k >r k (cid:107) = p k (cid:88) i = r k +1 σ i ( (cid:98) A ( k ) ) = p k (cid:88) i = r k +1 [ σ i ( (cid:98) A ( k ) ) − σ i ( A ( k ) )] ≤ p k (cid:88) i =1 [ σ i ( (cid:98) A ( k ) ) − σ i ( A ( k ) )] ≤ (cid:107) (cid:98) A − A (cid:107) , where the last inequality follows from (B.3).Finally, on the event E , (cid:107) (cid:101) A − A (cid:107) F ≤ (cid:107) (cid:101) A − (cid:98) A (cid:107) F + (cid:107) (cid:98) A − A (cid:107) F ≤ (1 + √ d ) (cid:107) (cid:98) A − A (cid:107) F , where d is fixed. Note that Theorem 4 implies the asymptotic rate (cid:107) (cid:98) A − A (cid:107) F = O p ( (cid:112) s p/T ) andthe first part of this proof shows that P ( E ) → 1, as T → ∞ . The proof is complete. B.3 Proofs of Lemmas B.1–B.4 Proof of Lemma B.1. In this part, we focus on (cid:98) A SSN and simplify it to (cid:98) A . The tuningparameter λ SSN is simplified to λ . The proof can be readily extended to (cid:98) A SN and (cid:98) A MN .43ote that the quadratic loss function can be rewritten as L T ( A ) = T − (cid:80) Tt =1 (cid:107) Y t −(cid:104) A , Y t − (cid:105)(cid:107) = T − (cid:80) Tt =1 (cid:107) y t − A [ S ] y t − (cid:107) , where y t = vec( Y t ). By the optimality of theSSN estimator,1 T T (cid:88) t =1 (cid:107) y t − (cid:98) A [ S ] y t − (cid:107) + λ (cid:107) (cid:98) A (cid:107) SSN ≤ T T (cid:88) t =1 (cid:107) y t − A [ S ] y t − (cid:107) + λ (cid:107) A (cid:107) SSN ⇒ T T (cid:88) t =1 (cid:107) ∆ [ S ] y t − (cid:107) ≤ T T (cid:88) t =1 (cid:104) e t , ∆ [ S ] y t − (cid:105) + λ ( (cid:107) A (cid:107) SSN − (cid:107) (cid:98) A (cid:107) SSN ) ⇒ T T (cid:88) t =1 (cid:107) ∆ [ S ] y t − (cid:107) ≤ (cid:42) T − T (cid:88) t =1 Y t − ◦ E t , ∆ (cid:43) + λ ( (cid:107) A (cid:107) SSN − (cid:107) (cid:98) A (cid:107) SSN ) ⇒ T T (cid:88) t =1 (cid:107) ∆ [ S ] y t − (cid:107) ≤ (cid:107) ∆ (cid:107) SSN (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − T (cid:88) t =1 Y t − ◦ E t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) SSN ∗ + λ ( (cid:107) A (cid:107) SSN − (cid:107) (cid:98) A (cid:107) SSN ) , where (cid:107) · (cid:107) SSN ∗ refers to the dual norm of the SSN norm.By triangle inequality and decomposability, we have (cid:107) (cid:98) A (cid:107) SSN − (cid:107) A (cid:107) SSN = (cid:107) A + ∆ (cid:107) SSN − (cid:107) A (cid:107) SSN = d − (cid:88) k =1 (cid:107) A [ I k ] + ∆ [ I k ] (cid:107) ∗ − d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) ∗ = d − (cid:88) k =1 (cid:107) A ( k ) M + A ( k ) M ⊥ + ∆ ( k ) M + ∆ ( k ) M ⊥ (cid:107) ∗ − d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) ∗ ≥ d − (cid:88) k =1 (cid:104) (cid:107) A ( k ) M + ∆ ( k ) M ⊥ (cid:107) ∗ − (cid:107) A ( k ) M ⊥ + ∆ ( k ) M (cid:107) ∗ − (cid:107) A ( k ) M ⊥ (cid:107) ∗ − (cid:107) A ( k ) M (cid:107) ∗ (cid:105) ≥ d − (cid:88) k =1 (cid:104) (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ − (cid:107) A ( k ) M ⊥ (cid:107) ∗ − (cid:107) ∆ ( k ) M (cid:107) ∗ (cid:105) . If λ ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ , we have0 ≤ T T (cid:88) t =1 (cid:107) ∆ [ S ] y t − (cid:107) ≤ λ (cid:107) ∆ (cid:107) SSN − λ ( (cid:107) (cid:98) A (cid:107) SSN − (cid:107) A (cid:107) SSN ) ≤ λ d − (cid:88) k =1 (cid:104) (cid:107) ∆ ( k ) M (cid:107) ∗ + (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ − (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ + 4 (cid:107) A ( k ) M ⊥ (cid:107) ∗ + 2 (cid:107) ∆ ( k ) M (cid:107) ∗ (cid:105) = λ d − (cid:88) k =1 (cid:104) (cid:107) ∆ ( k ) M (cid:107) ∗ + 4 (cid:107) A ( k ) M ⊥ (cid:107) ∗ − (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ (cid:105) . Hence, the error ∆ lies in the restricted error set C SSN ( M ).44 roof of Lemma B.2. Similar to Lemma B.1, we focus on the SSN estimator, and the resultsfor SN and MN estimators can be extended in a similar way.Note that T − (cid:80) Tt =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) = T − (cid:80) Tt =1 (cid:107) ∆ [ S ] y t − (cid:107) . Following the proof of LemmaB.1, ∆ ∈ C SSN ( M ) and1 T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) ≤ λ (cid:107) ∆ (cid:107) SSN + λ ( (cid:107) A (cid:107) SSN − (cid:107) (cid:98) A (cid:107) SSN ) ≤ λ (cid:107) ∆ (cid:107) SSN = 3 λ d − (cid:88) k =1 (cid:107) ∆ [ I k ] (cid:107) ∗ ≤ λ d − (cid:88) k =1 (cid:16) (cid:107) ∆ ( k ) M (cid:107) ∗ + (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ (cid:17) ≤ λ d − (cid:88) k =1 (cid:107) ∆ ( k ) M (cid:107) ∗ + 6 λ d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ ≤ λ d − (cid:88) k =1 √ s k (cid:107) ∆ ( k ) M (cid:107) F + 6 λ d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ (cid:46) λ d − (cid:88) k =1 √ s k (cid:107) ∆ (cid:107) F + 6 λ d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ where the last inequality stems from the fact that ∆ ( k ) M has a matrix rank at most 2 s k , similarto Lemma 1 in Negahban and Wainwright (2011).As the RSC condition holds with the parameter α RSC and restricted error set C SSN ( M ), α RSC (cid:107) ∆ (cid:107) ≤ T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) (cid:46) λ d − (cid:88) k =1 √ s k (cid:107) ∆ (cid:107) F + λ d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ . Thus, by Cauchy inequality, (cid:107) ∆ (cid:107) (cid:46) λ ( (cid:80) d − k =1 √ s k ) α + λ (cid:80) d − k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ α RSC (cid:46) λ d − (cid:80) d − k =1 s k α + λ (cid:80) d − k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ α RSC . Consider any threhold τ k ≥ M ( k ) correspondingto the column and row spaces spanned by the first r ( k ) singular vectors of A [ I k ] where σ ( A [ I k ] ) ≥ · · · ≥ σ r ( k ) ( A [ I k ] ) > τ k ≥ σ r ( k ) +1 ( A [ I k ] ). By the definition of B q ( s ( k ) q ; p, p ), wehave s ( k ) q ≥ r ( k ) · τ qk and thus r ( k ) ≤ s ( k ) q · τ − qk .Then, the approximation error can be bounded by (cid:107) A ( k ) M ⊥ (cid:107) ∗ = p (cid:88) r = r ( k ) +1 σ r ( A [ I k ] ) = p (cid:88) r = r ( k ) +1 σ qr ( A [ I k ] ) · σ − qr ( A [ I k ] ) ≤ s ( k ) q · τ − qk . (cid:107) ∆ (cid:107) (cid:46) λ d − (cid:80) d − k =1 s ( k ) q · τ − qk α + λ (cid:80) d − k =1 s ( k ) q · τ − qk α RSC . Setting each τ k (cid:16) α − ( q/ (1 − q ))2 d − λ , the upper bound can be minimized to (cid:107) ∆ (cid:107) (cid:46) − d d − (cid:88) k =1 s ( k ) q (cid:18) λ · d − α RSC (cid:19) − q . The proof is complete. Proof of Lemma B.3. First, we derive an upper bound of the dual norm of the SSN norm.By definition, for any tensor A and collection of index sets I = { I , . . . , I d − } , the SSN normis (cid:107) A (cid:107) SSN = d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) ∗ , and its dual norm is (cid:107) A (cid:107) SSN ∗ := sup (cid:104) W , A (cid:105) such that (cid:107) W (cid:107) SSN ≤ 1. By a method similarto that in Tomioka et al. (2011), it can be shown that (cid:107) A (cid:107) SSN ∗ = inf (cid:80) d − k =1 X k = A max k =1 ,..., d − (cid:107) ( X k ) [ I k ] (cid:107) op . Then, we can take X k = ( (cid:80) d − k =1 /c k ) − ( A /c k ), where c k = (cid:107) A [ I k ] (cid:107) op , and apply Jensen’sinequality so that we have (cid:107) A (cid:107) SSN ∗ ≤ − d − 1) 2 d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) op . Hence, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SSN ∗ ≤ d − 1) 2 d − (cid:88) k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I k ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op . In other words, the dual norm of the SSN norm can be upper bounded by the sum of thescaled matrix operator norms of different matricizations of the tensor T − (cid:80) Tt =1 Y t − ◦ E t .All of the square matricizations based on I k lead to a square p -by- p matrix. Therefore,by the deviation bound in Lemma B.5, we can take a union bound such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SSN ∗ ≤ Cκ M d − (cid:114) pT − exp[ − C ( p − d )].Next, for the SN estimator, we can obtain a similar upper bound of the dual norm of theSN norm. The SN norm is defined as (cid:107) A (cid:107) SN = d (cid:88) i =1 (cid:107) A ( i ) (cid:107) ∗ , and its dual norm has the equivalent form (cid:107) A (cid:107) SN ∗ = inf (cid:80) di =1 Y i = A max i =1 ,..., d (cid:107) ( Y i ) ( i ) (cid:107) op . Then, we can obtain an upper bound, (cid:107) A (cid:107) SN ∗ ≤ d ) d (cid:88) i =1 (cid:107) A ( i ) (cid:107) op . Then, for each one-mode matricization, we have the deviation bound. Then, we can take aunion bound such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SN ∗ ≤ Cκ M (2 d ) d (cid:88) i =1 (cid:114) p − i pT , with probability at least 1 − d exp[ − Cp ].Finally, the MN estimator uses a special case of square matricization, and the upperbound for the MN estimator can be obtained by Lemma B.5. Proof of Lemma B.4. For any M ∈ R m × p , denote R T ( M ) = (cid:80) T − t =0 (cid:107) M y t (cid:107) . Note that R T ( ∆ [ S ] ) ≥ E R T ( ∆ [ S ] ) − sup ∆ | R T ( ∆ [ S ] ) − E R T ( ∆ [ S ] ) | . Following the proof of LemmaB.5, E R T ( ∆ [ S ] ) = (cid:107) ( I T ⊗ ∆ [ S ] ) P D (cid:107) ≥ T (cid:107) ∆ (cid:107) · λ min ( Σ e ) λ min ( P P (cid:62) ).Similar to Lemma B.6, for any v ∈ S p − and any t > P [ | R T ( v (cid:62) ) − E R T ( v (cid:62) ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ max ( Σ e ) λ max ( P P (cid:62) ) (cid:19)(cid:19) . Considering an (cid:15) -covering net of S p − , by Lemma B.7, we can easily construct the unionbound for T (cid:38) p , P (cid:20) sup v ∈ S p − | R T ( v (cid:62) ) − E R T ( v (cid:62) ) | ≥ t (cid:21) ≤ C exp (cid:18) p − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ max ( Σ e ) λ max ( P P (cid:62) ) (cid:19)(cid:19) , t = λ min ( Σ e ) λ min ( P P (cid:62) ) / 2, for T (cid:38) M − max( κ , κ ) p , we have P [ | R T ( v (cid:62) ) − E R T ( v (cid:62) ) | ≥ λ min ( Σ e ) λ min ( P P (cid:62) ) / ≤ − CM min( κ − , κ − ) T ) , where M = [ λ min ( Σ e ) λ min ( P P (cid:62) )] / [ λ max ( Σ e ) λ max ( P P (cid:62) )].Therefore, with probability at least 1 − − CM min( κ − , κ − ) T ), R T ( ∆ [ S ] ) ≥ λ min ( Σ e ) λ min ( P P (cid:62) ) (cid:107) ∆ (cid:107) . Finally, since P is related to the VMA( ∞ ) process, by the spectral measure of ARMA pro-cess discussed in Basu and Michailidis (2015), we may replace λ max ( P P (cid:62) ) and λ min ( P P (cid:62) )with 1 /µ min ( A ) and 1 /µ max ( A ), respectively. B.4 Three Auxiliary Lemmas Three auxiliary lemmas used in the proofs of Lemmas B.3 and B.4 are presented below. Lemma B.5 (Deviation bound on different matricizations) . For any index set I ⊂ { , , . . . , d } ,denote q = (cid:81) di =1 ,i ∈ I p i and q (cid:48) = (cid:81) di =1 ,i/ ∈ I p i . If T (cid:38) ( q + q (cid:48) ) , with probability at least − exp[ − C ( q + q (cid:48) )] , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op < Cκ M (cid:112) ( q + q (cid:48) ) /T . where M = λ max ( Σ e ) /µ / ( A ) .Proof. For any index set I ⊂ { , , . . . , d } and 2 d th-mode tensor T , denote the inverseoperation of the multi-mode matricization T = T [ I ] by T [ I ] = T . Denote W ( r ; q, q (cid:48) ) = { W ∈ R q × q (cid:48) : rank( W ) = r, (cid:107) W (cid:107) F = 1 } .By definition, (cid:107) T − (cid:80) Tt =1 ( Y t − ◦ E t ) [ I ] (cid:107) op = sup W ∈W (1; q,q (cid:48) ) (cid:104) T − (cid:80) Tt =1 ( Y t − ◦ E t ) [ I ] , W (cid:105) =sup W ∈W (1; q,q (cid:48) ) (cid:104) T − (cid:80) Tt =1 vec( E t )vec( Y t − ) (cid:62) , ( W [ I ] ) (cid:62) [ S ] (cid:105) .For an arbitrary matrix W ∈ R q × q (cid:48) such that (cid:107) W (cid:107) F = 1, denote M = ( W [ I ] ) (cid:62) [ S ] . Then,one can easily check that (cid:104) ( Y t − ◦ E t ) [ I ] , W (cid:105) = (cid:104) e t , M y t − (cid:105) .48or a fixed M , denote S t ( M ) = (cid:80) ts =1 (cid:104) e s , M y s − (cid:105) and R t ( M ) = (cid:80) t − s =0 (cid:107) M y s (cid:107) , for1 ≤ t ≤ T . By the standard Chernoff argument, for any α > β > c > P [ { S T ( M ) ≥ α } ∩ { R T ( M ) ≤ β } ]= inf m> P [ { exp( mS T ( M )) ≥ exp( mα ) } ∩ { R T ( M ) ≤ β } ]= inf m> P [exp( mS T ( M )) I ( R T ( M ) ≤ β ) ≥ exp( mα )] ≤ inf m> exp( − mα ) E [exp( mS T ( M )) I ( R T ( M ) ≤ β )]= inf m> exp( − mα + cm β ) E [exp( mS T ( M ) − cm β ) I ( R T ( M ) ≤ β )] ≤ inf m> exp( − mα + cm β ) E [exp( mS T ( M ) − cm R T ( M ))] . By the tower rule, we have E [exp( mS T ( M ) − cm R T ( M ))]= E [ E [exp( mS T ( M ) − cm R T ( M ))] |F T − ]= E [exp( mS T − ( M ) − cm R T − ( M )) E [exp( m (cid:104) e T , M y T − (cid:105) − cm (cid:107) M y T (cid:107) ) |F T − ]] . Since (cid:104) e T , M y T − (cid:105) = (cid:104) ξ T , Σ / e M y T − (cid:105) , one can easily check that (cid:104) e T , M y T − (cid:105) is a κ λ max ( Σ e ) (cid:107) M y T − (cid:107) -sub-Gaussian random variable. In other words, E [exp( m (cid:104) e T , M y T − (cid:105) )] ≤ exp( m κ λ max ( Σ e ) (cid:107) M y T − (cid:107) / c = κλ max ( Σ e ) / 2, we have E [exp( mS T ( M ) − m κ λ max ( Σ e ) R T ( M ) / ≤ E [exp( mS T − ( M ) − m κ λ max ( Σ e ) R T − ( M ) / ≤ · · · ≤ E [exp( mS ( M ) − m κ λ max ( Σ e ) R ( M ) / ≤ . Hence, we have that, for any α > β > P [ { S T ( M ) ≥ α } ∩ { R T ( M ) ≤ β } ] ≤ inf m> exp( − mα + m κ λ max ( Σ e ) β/ (cid:18) − α κ λ max ( Σ e ) β (cid:19) . (B.5)By Lemma B.6, we have that for any t > P [ | R T ( M ) − E R T ( M ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ ( Σ e ) λ ( P P (cid:62) ) (cid:19)(cid:19) . 49n addition, E R T ( M ) = tr( Σ M ) = (cid:107) ( I T ⊗ M ) P D (cid:107) ≤ T · λ max ( Σ e ) λ max ( P P (cid:62) ). Letting t = Cκ T λ max ( Σ e ) λ max ( P P (cid:62) ), we have P [ R T ( M ) ≥ Cκ T λ max ( Σ e ) λ max ( P P (cid:62) )] ≤ − CT ) . Next, consider a (cid:15) -net W (1; q, q (cid:48) ) for W (1; q, q (cid:48) ). For any matrix W ∈ W (1; q, q (cid:48) ), thereexist a matrix W ∈ W (1; q, q (cid:48) ) such that (cid:107) W − W (cid:107) F ≤ (cid:15) . Since the rank of ∆ = W − W is at most 2, we can split the SVD of ∆ into 2 parts, such that ∆ = ∆ + ∆ , whererank( ∆ ) = rank( ∆ ) = 1 and (cid:104) ∆ , ∆ (cid:105) = 0. Then, for any matrix N ∈ R q × q (cid:48) , we have (cid:104) N , W (cid:105) = (cid:104) N , W (cid:105) + (cid:104) N , ∆ (cid:105) = (cid:104) N , W (cid:105) + (cid:88) i =1 (cid:104) N , ∆ i / (cid:107) ∆ i (cid:107) F (cid:105)(cid:107) ∆ i (cid:107) F , where ∆ i / (cid:107) ∆ i (cid:107) F ∈ W (1; q, q (cid:48) ). Since (cid:107) ∆ (cid:107) = (cid:107) ∆ (cid:107) + (cid:107) ∆ (cid:107) , by Cauchy inequality, (cid:107) ∆ (cid:107) F + (cid:107) ∆ (cid:107) F ≤ √ (cid:107) ∆ (cid:107) F = √ (cid:15) . Hence, we have γ := sup W ∈W (1; q,q (cid:48) ) (cid:104) N , W (cid:105) ≤ max W ∈W (1; q,q (cid:48) ) (cid:104) N , W (cid:105) + √ γ(cid:15). In other words, sup W ∈W (1; q,q (cid:48) ) (cid:104) N , W (cid:105) ≤ (1 − √ (cid:15) ) − max W ∈W (1; q,q (cid:48) ) (cid:104) N , W (cid:105) . Therefore, we have that, for any x > P (cid:34) sup W ∈W (1; q,q (cid:48) ) (cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ x (cid:35) ≤ P (cid:34) max W ∈W (1; q,q (cid:48) ) (cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ (1 − √ (cid:15) ) x (cid:35) ≤|W (1; q, q (cid:48) ) | · P (cid:34)(cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ (1 − √ (cid:15) ) x (cid:35) . (B.6)Note that by (B.5), for any x > P (cid:34)(cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ (1 − √ (cid:15) ) x (cid:35) ≤ P [ { S T ( M ) ≥ T (1 − √ (cid:15) ) x } ∩ { R T ( M ) ≤ Cκ T λ max ( Σ e ) λ max ( P P (cid:62) ) } ]+ P [ R T ( M ) > Cκ T λ max ( Σ e ) λ max ( P P (cid:62) )] ≤ exp (cid:20) − CT x κ λ ( Σ e ) λ max ( P P (cid:62) ) (cid:21) + 2 exp( − CT ) . 50y Lemma 3.1 in Candes and Plan (2011), for a (cid:15) -net for W (1; q, q (cid:48) ), the covering number |W (1; q, q (cid:48) ) | ≤ (9 /(cid:15) ) q + q (cid:48) . Combining (B.6), we have that, when T (cid:38) q + q (cid:48) , for any x > P (cid:34) sup W ∈W (1; q,q (cid:48) ) (cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ x (cid:35) ≤ exp (cid:20) ( q + q (cid:48) ) log(9 /(cid:15) ) − CT x κ λ ( Σ e ) λ max ( P P (cid:62) ) (cid:21) + 2 exp[( q + q (cid:48) ) log(9 /(cid:15) ) − CT ] . Taking (cid:15) = 0 . x = Cκ λ max ( Σ e ) λ / ( P P (cid:62) ) · (cid:112) ( q + q (cid:48) ) /T , we have P (cid:34) sup W ∈W (1; q,q (cid:48) ) (cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ Cκ λ max ( Σ e ) λ / ( P P (cid:62) ) (cid:114) q + q (cid:48) T (cid:35) ≤ exp[ − C ( q + q (cid:48) )] . Finally, since P is related to the VMA( ∞ ) process, by the spectral measure of ARMA processdiscussed in Basu and Michailidis (2015), we may replace λ max ( P P (cid:62) ) with 1 /µ min ( A ). Lemma B.6. For any M ∈ R p × p such that (cid:107) M (cid:107) F = 1 , denote R T ( M ) = (cid:80) T − t =0 (cid:107) M y t (cid:107) .Then, for any t > , P [ | R T ( M ) − E R T ( M ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ ( Σ e ) λ ( P P (cid:62) ) (cid:19)(cid:19) , where P is defined as P = I p A A A . . . A T − . . . O I p A A . . . A T − . . . ... ... ... ... . . . ... . . . O O O O . . . I p . . . . (B.7) Proof. Denote by y = ( y (cid:62) T − , y (cid:62) T − , . . . , y (cid:62) ) (cid:62) , e = ( e (cid:62) T − , e (cid:62) T − , . . . , e (cid:62) , . . . ) (cid:62) , and ξ =( ξ (cid:62) T − , ξ (cid:62) T − , . . . , ξ (cid:62) , . . . ) (cid:62) . Based on the moving average representation of VAR(1), we canrewrite y t to a VMA( ∞ ), y t = e t + Ae t − + A e t − + A e t − + · · · . Note that R T ( M ) = y (cid:62) ( I T ⊗ M (cid:62) M ) y = e (cid:62) P (cid:62) ( I T ⊗ M (cid:62) M ) P e = ξ (cid:62) DP (cid:62) ( I T ⊗ M (cid:62) M ) P Dξ := ξ (cid:62) Σ M ξ ,51here P is defined in (B.7) and D = Σ / e O O . . . O Σ / e O . . . O O Σ / e . . . ... ... ... . . . . By Hanson-Wright inequality, for any t > P [ | R T ( M ) − E R T ( M ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ (cid:107) Σ M (cid:107) , tκ (cid:107) Σ M (cid:107) op (cid:19)(cid:19) . As (cid:107) M (cid:107) F = 1, by the submultiplicative property of the Frobenius norm and operatornorm, we have (cid:107) Σ M (cid:107) ≤ T · λ ( Σ e ) λ ( P P (cid:62) ) and (cid:107) Σ M (cid:107) op ≤ λ max ( Σ e ) λ max ( P P (cid:62) ).These imply that, for any t > P [ | R T ( M ) − E R T ( M ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ max ( Σ e ) λ max ( P P (cid:62) ) (cid:19)(cid:19) . The proof of this lemma is accomplished. Lemma B.7. (Covering number of unit sphere) Let N be an ε -net of the unit sphere S p − ,where ε ∈ (0 , . Then, |N | ≤ (cid:18) ε (cid:19) p . Proof. This lemma follows directly from Corollary 4.2.13 of Vershynin (2018).52 eferences Bai, J. and Wang, P. (2016). Econometric analysis of large factor models. Annual Reviewof Economics , 8:53–80.Basu, S. and Michailidis, G. (2015). Regularized estimation in sparse high-dimensional timeseries models. Annals of Statistics , 43:1535–1567.Bi, X., Qu, A., and Shen, X. (2018). Multilayer tensor factorization with applications torecommender systems. Annals of Statistics , 46:3308–3333.Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed op-timization and statistical learning via the alternating direction method of multipliers. Foundations and Trends ® in Machine learning , 3:1–122.Candes, E. J. and Plan, Y. (2011). Tight oracle inequalities for low-rank matrix recovery froma minimal number of noisy random measurements. IEEE Transactions on InformationTheory , 57:2342–2359.Chen, E. Y. and Chen, R. (2019). Modeling dynamic transport network with matrix factormodels: with an application to international trade flow. arXiv:1901.00769 [econ.EM].Chen, E. Y., Tsay, R. S., and Chen, R. (2020a). Constrained factor models for high-dimensional matrix-variate time series. Journal of the American Statitical Association ,115:775–793.Chen, H., Raskutti, G., and Yuan, M. (2019). Non-convex projected gradient descent for gen-eralized low-rank tensor regression. The Journal of Machine Learning Research , 20(1):172–208.Chen, R., Xiao, H., and Yang, D. (2020b). Autoregressive models for matrix-valued timeseries. Journal of Econometrics . To appear.53hen, R., Yang, D., and Zhang, C.-H. (2020c). Factor models for high-dimensional tensortime series. arXiv:1905.07530v2 [stat.ME].Davis, R. A., Zang, P., and Zheng, T. (2016). Sparse vector autoregressive modeling. Journalof Computational and Graphical Statistics , 25:1077–1096.De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000). A multilinear singular valuedecomposition. SIAM Journal on Matrix Analysis and Applications , 21:1253–1278.Ding, S. and Cook, R. D. (2018). Matrix variate regressions and envelope models. Journalof the Royal Statistical Society: Series B , 80:387–408.Fama, E. F. and French, K. R. (2015). A five-factor asset pricing model. Journal of FinancialEconomics , 116:1–22.Frandsen, A. and Ge, R. (2019). Understanding composition of word embeddings via tensordecomposition. In International Conference on Learning Representations .French, K. R. (2020). Data library: U.S. research returns data. Available at http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html .Gandy, S., Recht, B., and Yamada, I. (2011). Tensor completion and low-n-rank tensorrecovery via convex optimization. Inverse Problems , 27:025010.Han, F., Lu, H., and Liu, H. (2015a). A direct estimation of high dimensional stationaryvector autoregressions. Journal of Machine Learning Research , 16:3115–3150.Han, F., Xu, S., and Liu, H. (2015b). Rate-optimal estimation of a high-dimensional semi-parametric time series model. Preprint.Han, R., Willett, R., and Zhang, A. (2020). An optimal statistical and computationalframework for generalized tensor estimation. arXiv preprint arXiv:2002.11255 .Hoff, P. D. (2015). Multilinear tensor regression for longitudinal relational data. Annals ofApplied Statistics , 9:1169–1193. 54olda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAMReview , 51:455–500.Lam, C., Yao, Q., et al. (2012). Factor modeling for high-dimensional time series: inferencefor the number of factors. Annals of Statistics , 40:694–726.Li, X., Xu, D., Zhou, H., and Li, L. (2018). Tucker tensor regression and neuroimaginganalysis. Statistics in Biosciences , 10:520–545.Liu, J., Musialski, P., Wonka, P., and Ye, J. (2013). Tensor completion for estimating missingvalues in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence ,35:208–220.Mirsky, L. (1960). Symmetric gauge functions and unitarily invariant norms. QuarterlyJournal of Mathematics , 11:50–59.Mu, C., Huang, B., Wright, J., and Goldfarb, D. (2014). Square deal: Lower bounds and im-proved relaxations for tensor recovery. In International Conference on Machine Learning ,pages 73–81.Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices withnoise and high-dimensional scaling. Annals of Statistics , 39:1069–1097.Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weightedmatrix completion: Optimal bounds with noise. Journal of Machine Learning Research ,13:1665–1697.Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. (2012). A unified frame-work for high-dimensional analysis of M -estimators with decomposable regularizers. Sta-tistical Science , 27:538–557.Raskutti, G., Yuan, M., and Chen, H. (2019). Convex regularization for high-dimensionalmulti-response tensor regression. Annals of Statistics , 47:1554–1584.55hapiro, A. (1986). Asymptotic theory of overparameterized structural models. Journal ofthe American Statistical Association , 81:142–149.Stock, J. H. and Watson, M. W. (2011). Dynamic factor models. In Clements, M. P.and Hendry, D. F., editors, Oxford Handbook of Economic Forecasting . Oxford UniversityPress.Tomioka, R., Suzuki, T., Hayashi, K., and Kashima, H. (2011). Statistical performanceof convex tensor decomposition. In Advances in Neural Information Processing Systems(NIPS) , pages 972–980.Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychome-trika , 31:279–311.Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications inData Science . Cambridge University Press, Cambridge.Walden, A. and Serroukh, A. (2002). Wavelet analysis of matrix–valued time–series. Proceed-ings of the Royal Society of London. Series A: Mathematical, Physical and EngineeringSciences , 458:157–179.Wang, D., Liu, X., and Chen, R. (2019). Factor models for matrix-valued high-dimensionaltime series. Journal of Econometrics , 208:231–248.Wang, D., Zheng, Yao Lian, H., , and Li, G. (2020). High-dimensional vector autoregres-sive time series modeling via tensor decomposition. Journal of the American StatisticalAssociation . To appear.Zheng, Y. and Cheng, G. (2020). Finite time analysis of vector autoregressive models underlinear restrictions. Biometrika . To appear.Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimagingdata analysis. Journal of the American Statistical Association , 108:540–552.56 ase (1a) Case (1b) Case (2a) Case (2b) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) Estimator LTR RRR OLS Figure 2: Average estimation error of LTR, OLS and RRR estimators for data generatedwith different d , p i ’s and multilinear ranks. 57ase d Fixed Parameter Varying Parameter(a) 2 T = 500, r = (2 , , , p = p = 5 , , , , p = p = 8, r = (2 , , , T = 200 , , , , p = p = 8, T = 500 r = (1 , , , , (1 , , , , , , , , (2 , , , , (3 , , , p = 144, T = 1000, r = (1 , , , p = 3 , , , , T = 1000, r = (2 , , , , , p = (4 , , , (4 , , , (4 , , , (5 , , , (5 , , p = (5 , , r = (2 , , , , , T = 600 , , , , p = (5 , , T = 1000 r = (1 , , , , , , (1 , , , , , , , , , , , (2 , , , , , , (2 , , , , , p = 144, T = 1000, p = (2 , , , (3 , , r = (1 , , , , , 1) (4 , , , (3 , , , (4 , , p =( p , . . . , p d ) and r = ( r , . . . , r d ). p C a s e ( a ) 12 0.001 0.002 0.003 0.004 0.005 C a s e ( b ) s C a s e ( c ) p C a s e ( d ) p C a s e ( e ) C a s e ( f ) 123 2.5 5.0 7.5 10.0 12.5 s C a s e ( g ) Five cases for p i C a s e ( h ) Figure 3: Average squared estimation error for the SSN estimator for eight cases with dif-ferent varying parameters. The error bars in each panel represent ± one standard deviation.58 ase (3a) Case (3b) Case (4a) Case (4b) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) Estimator TSSN SSN MN SN Figure 4: Average estimation error for TSSN, SSN, MN, and SN estimators for data gener-ated with different d , p i ’s and multilinear ranks.59 .60.40.20-0.2-0.4-0.6B/M Response Factor LoadingB/M Predictor Factor Loading Size Predictor Factor Loading Size Response Factor Loading Legend Figure 5: TSSN estimates of predictor and response factor matrices for 10 × 10 matrix-valuedportfolio return series. From left to right: (cid:101) U , (cid:101) U , (cid:101) U and (cid:101) U .Model VAR VFM MAR MFM LRTAR Best WorstSSN TSSNIn-sample (cid:96) norm (cid:96) norm (cid:96) norm 39.84 39.11 35.26 38.53 32.60 TSSN VAR (cid:96) ∞ norm 11.98 12.30 11.47 11.00 9.53 TSSN VFMTable 3: Average in-sample forecasting error and out-of-sample rolling forecasting error for10 × 10 matrix-valued portfolio return series. The best cases are marked in bold.60 .80.60.40.20-0.2-0.4-0.6-0.8 Inv Response Factor Loading B/M Response Factor Loading Size Response Factor Loading LegendOP Predictor Factor Loading B/M Predictor Factor Loading Size Predictor Factor Loading OP Response Factor Loading B/M Response Factor Loading Size Response Factor Loading I nv - B M - S i ze Inv Predictor Factor Loading B/M Predictor Factor Loading Size Predictor Factor Loading O P - B M - S i ze Figure 6: TSSN estimates of predictor and response factor matrices for 4 × × (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U and (cid:101) U .Model VAR VFM MTAR TFM LRTAR Best WorstSSN TSSNOP-BM-Size 4 × × (cid:96) norm (cid:96) norm (cid:96) norm 22.27 20.17 20.50 20.11 20.32 TSSN VAR (cid:96) ∞ norm 10.38 10.04 9.86 10.03 × × (cid:96) norm (cid:96) norm (cid:96) norm 18.70 17.70 16.89 17.67 (cid:96) ∞ norm 7.42 7.37 6.79 7.33 6.62 TSSN VARTable 4: Average in-sample forecasting error and out-of-sample rolling forecasting error for4 ×× Inv Response Factor Loading B/M Response Factor Loading Size Response Factor Loading LegendOP Predictor Factor Loading B/M Predictor Factor Loading Size Predictor Factor Loading OP Response Factor Loading B/M Response Factor Loading Size Response Factor Loading I nv - B M - S i ze Inv Predictor Factor Loading B/M Predictor Factor Loading Size Predictor Factor Loading O P - B M - S i ze Figure 6: TSSN estimates of predictor and response factor matrices for 4 × × (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U and (cid:101) U .Model VAR VFM MTAR TFM LRTAR Best WorstSSN TSSNOP-BM-Size 4 × × (cid:96) norm (cid:96) norm (cid:96) norm 22.27 20.17 20.50 20.11 20.32 TSSN VAR (cid:96) ∞ norm 10.38 10.04 9.86 10.03 × × (cid:96) norm (cid:96) norm (cid:96) norm 18.70 17.70 16.89 17.67 (cid:96) ∞ norm 7.42 7.37 6.79 7.33 6.62 TSSN VARTable 4: Average in-sample forecasting error and out-of-sample rolling forecasting error for4 ×× ××