[PDF] High-Dimensional Low-Rank Tensor Autoregressive Time Series Modeling

Abstract

Modern technological advances have enabled an unprecedented amount of structured data with complex temporal dependence, urging the need for new methods to efficiently model and forecast high-dimensional tensor-valued time series. This paper provides the first practical tool to accomplish this task via autoregression (AR). By considering a low-rank Tucker decomposition for the transition tensor, the proposed tensor autoregression can flexibly capture the underlying low-dimensional tensor dynamics, providing both substantial dimension reduction and meaningful dynamic factor interpretation. For this model, we introduce both low-dimensional rank-constrained estimator and high-dimensional regularized estimators, and derive their asymptotic and non-asymptotic properties. In particular, by leveraging the special balanced structure of the AR transition tensor, a novel convex regularization approach, based on the sum of nuclear norms of square matricizations, is proposed to efficiently encourage low-rankness of the coefficient tensor. A truncation method is further introduced to consistently select the Tucker ranks. Simulation experiments and real data analysis demonstrate the advantages of the proposed approach over various competing ones.

Full PDF

HHigh-Dimensional Low-Rank Tensor AutoregressiveTime Series Modeling

Di Wang † , Yao Zheng ‡ and Guodong Li †† University of Hong Kong and ‡ University of Connecticut

January 13, 2021

Abstract

Modern technological advances have enabled an unprecedented amount of struc-tured data with complex temporal dependence, urging the need for new methods toeﬃciently model and forecast high-dimensional tensor-valued time series. This paperprovides the ﬁrst practical tool to accomplish this task via autoregression (AR). Byconsidering a low-rank Tucker decomposition for the transition tensor, the proposedtensor autoregression can ﬂexibly capture the underlying low-dimensional tensor dy-namics, providing both substantial dimension reduction and meaningful dynamic factorinterpretation. For this model, we introduce both low-dimensional rank-constrained es-timator and high-dimensional regularized estimators, and derive their asymptotic andnon-asymptotic properties. In particular, by leveraging the special balanced struc-ture of the AR transition tensor, a novel convex regularization approach, based onthe sum of nuclear norms of square matricizations, is proposed to eﬃciently encouragelow-rankness of the coeﬃcient tensor. A truncation method is further introduced toconsistently select the Tucker ranks. Simulation experiments and real data analysisdemonstrate the advantages of the proposed approach over various competing ones.

Keywords : dimension reduction; high-dimensional time series; nuclear norm; tensor de-composition; tensor-valued data 1 a r X i v : . [ s t a t . M E ] J a n Introduction

Modern technological development has made available enormous quantities of data, manyof which are structured and collected over time. Tensor-valued time series data, namely ob-servations on a set of variables structured in a tensor form collected over time, have becomeincreasingly common in a wide variety of areas. Examples include multiple macroeconomicindices time series for multiple countries (Chen et al., 2020c), dynamic inter-regional trans-port ﬂow data (Chen and Chen, 2019), and sequential image and video processing (Waldenand Serroukh, 2002), among many others.To model temporal dependencies of tensor-valued time series data, naturally one mightresort to vectorization of the tensor-valued observations so that conventional vector-valuedtime series models can be directly applied. For instance, let Y t ∈ R p ×···× p d be the tensor-valued observation at time t , for t = 1 , . . . , T , where T is the sample size. Then an obviousﬁrst step is to conduct the following vector autoregression (VAR) for its vectorization:vec( Y t ) = A vec( Y t − ) + vec( E t ) , (1)where A ∈ R p × p is the unknown transition matrix, with p = (cid:81) di =1 p i , and E t ∈ R p ×···× p d is the tensor-valued random error at time t . Although model (1) is potent in that it takesinto account the linear association between every variable in Y t and that in Y t − , it clearlysuﬀers from two major drawbacks: • The vectorization will destroy the important multidimensional structural informationinherently embedded in the tensor-valued observations, resulting in a lack of inter-pretability; • The number of unknown parameters, p = ( (cid:81) di =1 p i ) , can be formidable even for small d and p i ; e.g., even when d = 3 and p = p = p = 3, the number of unknownparameters will be as large as (3 × × = 729, while the sample size T often has asimilar magnitude in practice; see, e.g., the real data examples in Section 7.In this work, motivated by the idea of Tucker decomposition (Tucker, 1966), we propose2he Low-Rank Tensor Autoregressive (LRTAR) model through folding the p × p transitionmatrix A in (1) into the 2 d -th-order transition tensor A ∈ R p ×···× p d × p ×···× p d which is as-sumed to have low multilinear ranks ( r , . . . , r d ). That is, r i can be much smaller than p i , where p d + i = p i for i = 1 , . . . , d . As a result, the aforementioned drawbacks of thevectorization approach can be overcome by the proposed model: • As we will show in Section 3, the proposed model can ﬂexibly retain the distinctstructural information across all modes of observed tensor process, encapsulating aninterpretable low-dimensional tensor dynamic relationship between Y t and Y t − . • As the transition tensor A is assumed to have low multilinear ranks, the parame-ter space is simultaneously constrained in 2 d directions, reducing its dimension from( (cid:81) di =1 p i ) drastically to (cid:81) di =1 r i + (cid:80) di =1 r i ( p i − r i ) + (cid:80) di =1 r d + i ( p i − r d + i ).Recently there has been a rapidly emerging interest in high-dimensional matrix- andtensor-valued time series analysis, particularly through factor models, e.g., the matrix factormodel proposed in Wang et al. (2019), the constrained matrix factor model in Chen et al.(2020a), and the tensor factor model in Chen et al. (2020c). Factor models are powerful toolsfor dimension reduction with great interpretability. However, unlike autoregression, factormodels do not seek to explicitly model the temporal dependency and thus by themselvescannot be directly used for forecasting. On the other hand, despite the extensive literatureon high-dimensional VAR models in recent years (e.g. Negahban and Wainwright, 2011;Basu and Michailidis, 2015; Han et al., 2015a; Wang et al., 2020; Zheng and Cheng, 2020),counterparts able to meet the particular needs of matrix- and tensor-valued time seriesmodeling tasks have been rarely explored. The most relevant existing work in this directionso far might be the matrix autoregressive (MAR) model in Chen et al. (2020b), where thefocus is on low-dimensional modeling; see also Hoﬀ (2015). As we will discuss in Section 3,the proposed model includes the MAR model as a special case, but enjoys greater ﬂexibilityand a more drastic reduction of the dimensionality; see also Figure 1 for an illustration.The estimation of the proposed model is studied under both low- and high-dimensional3caling. When the sample size T is suﬃciently large and the transition tensor A is exactlylow-rank, we consider the rank-constrained estimation method and prove the asymptoticnormality for the proposed low-Tucker-rank (LTR) estimator. Under the high-dimensionalsetup, we consider a more general and natural setting where A can be well approximatedby a low-Tucker-rank tensor, and develop regularized estimation methods based on nuclear-norm-type penalties. We ﬁrst study the Sum of Nuclear (SN) norm regularizer, deﬁned asthe sum of nuclear norms of all one-mode matricizations of A , and derive the non-asymptoticestimation error bound for the corresponding estimator. The SN norm regularizer has beenwidely used in the literature for various low-rank tensor problems (Gandy et al., 2011;Tomioka et al., 2011; Liu et al., 2013; Raskutti et al., 2019). Its major strength lies in thefact that the summation of nuclear norms allows enforcing the low-rankness simultaneouslyacross all modes of the tensor. In contrast, if only a single nuclear norm is used, the low-rankness of only one mode will be accounted for, obviously leading to a much less eﬃcientestimator.However, although penalizing multiple one-mode matricizations of A simultaneously isfar more eﬃcient than penalizing only one of them, the SN regularized estimator suﬀersfrom serious eﬃciency loss due to the fat-and-short shape of the one-mode matricizations.Note that the low-rankness in fact can also be encouraged by penalizing nuclear normsof multi-mode matricizations. For instance, the conventional Matrix Nuclear (MN) normregularized estimator (Negahban and Wainwright, 2011) simply penalizes the nuclear normof the transition matrix A in representation (1), which under the proposed LRTAR modelactually is a square-shaped multi-mode matricization, namely square matricization, of thetransition tensor A . As we will show in Theorem 3 and simulations, even though the MNregularizer incorporates only one single square matricization, it still clearly beats the SNregularizer, since the former avoids the eﬃciency bottleneck caused by the imbalance of theone-mode matricization.Indeed, due to the autoregressive form of the proposed model, the transition tensor A has a special balanced structure in the sense that its ﬁrst d modes have exactly the same4imensions as its other d modes. As a result, actually many diﬀerent square matricizationsof A can be formed by appropriately pairing up its modes; see Section 5.2 for details. Thisnaturally motivates us to propose a new regularizer that combines the strengths of bothSN and MN norms. Speciﬁcally, for the proposed tensor autoregression, we introduce anovel Sum of Square-matrix Nuclear (SSN) norm regularizer, deﬁned as the sum of nuclearnorms of all the p × p square matricizations of the transition tensor A . The SSN regularizeris expected to be superior to the MN, since it simultaneously encourages the low-ranknessacross all possible square matricizations of A rather than only one of them. Moreover, thanksto the use of square matricizations, the SSN regularized estimator is provably more eﬃcientthan the SN regularized one; see Theorem 4 and the last simulation experiment in Section6. Note that the adoption of a more balanced (square) matricization to improve estimationperformance was proposed in Mu et al. (2014) for low-rank tensor completion problems,where only a single square matricization was penalized, similarly to the MN regularizer.This work is also related to the literature of matrix-variate regression and tensor regres-sion for independent data. The matrix-variate regression model in Ding and Cook (2018) hasthe same basic bilinear form as that of the MAR model mentioned earlier, while an envelopemethod was introduced to further reduce the dimension. Raskutti et al. (2019) proposeda multi-response tensor regression model, where they mainly studied the third-order coef-ﬁcient tensor and the SN regularization which is known to be statistically sub-optimal forhigher-order tensor estimation. In contrast, we study the model for general higher-ordertensor-valued time series. Moreover, our SSN regularized estimator has a much faster sta-tistical convergence rate than the SN estimator. Recently, Chen et al. (2019) and Hanet al. (2020) studied non-convex projected gradient descent methods for tensor regression.While their non-convex approaches require exact low-rankness with known Tucker ranks,our methods can handle both exact and approximate low-rankness and select the unknownranks consistently in the exactly low-rank case. In addition, existing literature on tensorregression has only considered independent data and Gaussian time series data, whereas weallow sub-Gaussianity of the time series. This is a non-trivial relaxation, since unlike the5aussian case, sub-Gaussian time series cannot be linearly transformed into independentsamples.We summarize the most important contributions of this paper as follows:(i) This paper provides the ﬁrst practical tool to model and forecast general structured,high-dimensional data with complex temporal dependence via tensor autoregression.By ﬂexibly and eﬃciently capturing the underlying low-dimensional tensor dynamics,the proposed model delivers signiﬁcant dimension reduction, meaningful structuralinterpretation and favorable forecast performance.(ii) By exploiting the special balanced structure of the transition tensor A , a novel SSNregularization approach is introduced to simultaneously and eﬃciently encourage low-rankness across all square matricizations of A , outperforming both the SN and MNmethods under both exact and approximate low-rankness. For exactly low-rank A , atruncated estimator is further introduced for consistent rank selection.(iii) On the technical front, by establishing a novel martingale-based concentration bound,this paper relaxes the conventional Gaussian assumption in the literature on high-dimensional time series to sub-Gaussianity. This technique is generally applicable tothe non-asymptotic estimation theory for high-dimensional time series models with aVAR representation and hence is of independent interest.The rest of the paper is organized as follows. Section 2 introduces basic notation andtensor algebra. Section 3 presents the proposed LRTAR model. Section 4 studies the low-dimensional least squares estimator and its asymptotic properties. The high-dimensionalregularized estimation is covered in Section 5, where we develop the non-asymptotic theoryfor three regularized estimators and rank selection consistency for the truncated estimator.Sections 6 and 7 present simulation studies and real data analysis, respectively. Section 8concludes with a brief discussion. All technical proofs are relegated to the Appendix.6 Preliminaries: Notation and Tensor Algebra

Tensors, also known as multidimensional arrays, are natural higher-order extensions of ma-trices. The order of a tensor is known as the dimension, way or mode, so a multidimensionalarray X ∈ R p ×···× p d is called a d -th-order tensor. We refer readers to Kolda and Bader(2009) for a detailed review of basic tensor algebra.Throughout this paper, we denote vectors by boldface small letters, e.g. x , y , matricesby boldface capital letters, e.g. X , Y , and tensors by boldface Euler capital letters, e.g. X , Y . For any two real-valued sequences x k and y k , we write x k (cid:38) y k if there exists a constant c > x k ≥ cy k for all k , and write x k (cid:29) y k if lim k →∞ y k /x k = 0. In addition, write x k (cid:16) y k if x k (cid:38) y k and y k (cid:38) x k . We use C to denote a generic positive constant, which isindependent of the dimensions and the sample size.For a generic matrix X , we let X (cid:62) , (cid:107) X (cid:107) F , (cid:107) X (cid:107) op , (cid:107) X (cid:107) ∗ , vec( X ), and σ j ( X ) denoteits transpose, Frobenius norm, operator norm, nuclear norm, vectorization, and j -th largestsingular value, respectively. For any matrix X ∈ R m × n , recall that the nuclear norm and itsdual norm, the operator norm, are deﬁned as (cid:107) X (cid:107) ∗ = min( m,n ) (cid:88) j =1 σ j ( X ) and (cid:107) X (cid:107) op = σ ( X ) . For any square matrix X , we let λ min ( X ) and λ max ( X ) denote its minimum and maximumeigenvalues. For any real symmetric matrices X and Y , we write X ≤ Y if Y − X is apositive semideﬁnite matrix.Matricization, also known as unfolding, is the process of reordering the elements of athird- or higher-order tensor into a matrix. The most commonly used matricization is theone-mode matricization deﬁned as follows. For any d -th-order tensor X ∈ R p ×···× p d , itsmode- s matricization X ( s ) ∈ R p s × p − s , with p − s = (cid:81) di =1 ,i (cid:54) = s p i , is the matrix obtained bysetting the s -th tensor mode as its rows and collapsing all the others into its columns, for s = 1 , . . . , d . Speciﬁcally, the ( i , . . . , i d )-th element of X is mapped to the ( i s , j )-th element7f X ( s ) , where j = 1 + d (cid:88) k =1 k (cid:54) = s ( i k − J k with J k = k − (cid:89) (cid:96) =1 (cid:96) (cid:54) = s p (cid:96) . The above one-mode matricization can be extended to the multi-mode matricization bycombining multiple modes to rows and combining the rest to columns of a matrix. For anyindex subset S ⊂ { , . . . , d } , the multi-mode matricization X [ S ] is the (cid:81) i ∈ S p i -by- (cid:81) i/ ∈ S p i matrix whose ( i, j )-th element is mapped from the ( i , . . . , i d )-th element of X , where i = 1 + (cid:88) k ∈ S ( i k − I k and j = 1 + (cid:88) k / ∈ S ( i k − J k , with I k = (cid:89) (cid:96) ∈ S(cid:96)

1. If Y ∈ R q k × p k , then (cid:104) X × k Y , Z (cid:105) = (cid:104) X , Z × k Y (cid:62) (cid:105) . (3)8f Z ∈ R q m + j × p m + j with 1 ≤ j ≤ n − m , then (cid:104) X , Y (cid:105) × j Z = (cid:104) X × m + j Z , Y (cid:105) . (4)Moreover, vec( (cid:104) X , Y (cid:105) ) = X [ S ] vec( Y ) , (5)where S = { m + 1 , . . . , n } , and when m = n , X [ ∅ ] = vec( X ) (cid:62) .Finally, we summarize some concepts and useful results of the Tucker decomposition(Tucker, 1966; De Lathauwer et al., 2000). For any tensor X ∈ R p ×···× p d , its multilinearranks ( r , . . . , r d ) are deﬁned as the matrix ranks of its one-mode matricizations, namely r i = rank( X ( i ) ), for i = 1 , . . . , d . Note that r i ’s are analogous to the row and column ranksof a matrix, but are not necessarily equal for third- and higher-order tensors. Suppose that X has multilinear ranks ( r , . . . , r d ). Then X has the following Tucker decomposition: X = Y × Y × Y · · · × d Y d = Y × di =1 Y i , (6)where Y i ∈ R p i × r i for i = 1 , . . . , d are the factor matrices and Y ∈ R r ×···× r d is the coretensor. Hence, the multilinear ranks are also called Tucker ranks.If X has the Tucker decomposition in (6), then we have the following results for its one-and multi-mode matricizations: X ( s ) = Y s Y ( s ) ( Y d ⊗ · · · ⊗ Y s +1 ⊗ Y s − · · · ⊗ Y ) (cid:62) = Y s Y ( s ) ( ⊗ i (cid:54) = s Y i ) (cid:62) , s = 1 , . . . , d, and X [ S ] = ( ⊗ i ∈ S Y i ) Y [ S ] ( ⊗ i/ ∈ S Y i ) (cid:62) , S ⊂ { , . . . , d } , (7)where ⊗ i (cid:54) = s , ⊗ i ∈ S and ⊗ i/ ∈ S are matrix Kronecker products operating in the reverse orderwithin the corresponding index sets.Note that for any nonsingular matrices O i ∈ R r i × r for i = 1 , . . . , d , it holds Y × di =1 Y i = ( Y × di =1 O i ) × di =1 ( Y i O − i ) . This indicates that the Tucker decomposition in (6) is not unique unless appropriate iden-tiﬁability constraints are imposed. In this paper, to ﬁx ideas, we will focus on a special9ucker decomposition called the higher-order singular value decomposition (HOSVD). Inthe HOSVD, the factor matrix Y i in (6) is deﬁned as the tall orthonormal matrix con-sisting of the top r i left singular vectors of X ( i ) , for i = 1 , . . . , d . Consequently, the coretensor Y = X × di =1 Y (cid:62) i has the all-orthogonal property that Y ( i ) Y (cid:62) ( i ) is a diagonal matrix for i = 1 , . . . , d . For the tensor-valued time series { Y t } Tt =1 , we propose the following Low-Rank Tensor Au-toregressive (LRTAR) model: Y t = (cid:104) A , Y t − (cid:105) + E t , (8)where A ∈ R p ×···× p d × p ×···× p d is the 2 d -th-order transition tensor which is assumed to havemultilinear ranks ( r , . . . , r d ), with r i = rank( A ( i ) ), (cid:104)· , ·(cid:105) is the generalized tensor innerproduct deﬁned in (2), and E t ∈ R p ×···× p d is the mean-zero random error at time t withpossible dependencies among its contemporaneous elements. Denote S = { , , . . . , d } and S = { d + 1 , d + 2 , . . . , d } . Note that by (5), model (8) can be written into the VAR formin (1) with transition matrix A = A [ S ] .Then, we have the higher-order singular value decomposition (HOSVD) of A , A = G × di =1 U i , (9)where G ∈ R r ×···× r d is the core tensor, and each U i ∈ R p i × r i is the orthonormal factormatrix deﬁned as the top r i left singular vectors of A ( i ) , for 1 ≤ i ≤ d . Thus, by (7), theVAR representation of model (8) can be written asvec( Y t ) (cid:124) (cid:123)(cid:122) (cid:125) y t = ( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) (cid:124) (cid:123)(cid:122) (cid:125) A [ S vec( Y t − ) (cid:124) (cid:123)(cid:122) (cid:125) y t − + vec( E t ) (cid:124) (cid:123)(cid:122) (cid:125) e t , (10)where y t = vec( Y t ) and e t = vec( E t ).In contrast to the conventional VAR model in (1) which has p unknown parameters,where p = (cid:81) di =1 p i , the dimension of the parameter space for model (8) is reduced substan-10ially to d (cid:89) i =1 r i + d (cid:88) i =1 r i ( p i − r i ) + d (cid:88) i =1 r d + i ( p i − r d + i ) , (11)which is computed by subtracting the number of constraints due to the orthonormality of U i ’s and the all-orthogonal property of G from the total number of parameters.By the VAR representation in (10), we immediately have the necessary and suﬃcientcondition for the existence of a unique strictly stationary solution to model (8) as follows. Assumption 1.

The spectral radius of A [ S ] is strictly less than one. The proposed model implies an interesting low-dimensional tensor dynamical structure.To be speciﬁc, by (3), (4) and the orthonormality of U i , it can be shown that (8) togetherwith (9) implies Y t × di =1 U (cid:62) d + i (cid:124) (cid:123)(cid:122) (cid:125) r d +1 × r d +2 ×···× r d = (cid:10) G , Y t − × di =1 U (cid:62) i (cid:124) (cid:123)(cid:122) (cid:125) r × r ×···× r d (cid:11) + E t × di =1 U (cid:62) d + i . (12)Note that in (12), Y t and E t are both projected onto a low-dimensional space via the U d + i ’s,while Y t − is projected onto another low-dimensional space via the U i ’s, with 1 ≤ i ≤ d .Hence, (12) is essentially a low-dimensional tensor regression deﬁned on the projections of Y t and Y t − . Element-wisely, the low-dimensional tensor Y t × di =1 U (cid:62) d + i can be interpreted as (cid:81) di =1 r d + i multilinear response factors, Y t − × di =1 U (cid:62) i as (cid:81) di =1 r i multilinear predictor factors,and E t × di =1 U (cid:62) d + i as multilinear error factors. For this reason, we call U d +1 , . . . , U d theresponse factor matrices, and U , . . . , U d the predictor factor matrices.We discuss some special cases of the proposed model below. Example 1.

For simplicity, we ﬁrst consider the case with d = 2 , so Y t ≡ Y t , E t ≡ E t ∈ R p × p are matrices. Then the VAR representation in (10) becomes vec( Y t ) = ( U ⊗ U ) G [ { , } ] ( U (cid:62) ⊗ U (cid:62) )vec( Y t − ) + vec( E t ) , (13) and the low-dimensional representation in (12) becomes U (cid:62) Y t U = (cid:10) G , U (cid:62) Y t − U (cid:11) + U (cid:62) E t U , here G ∈ R r ×···× r . It is interesting to compare this model with the matrix autoregressive(MAR) model in Chen et al. (2020b) and Hoﬀ (2015), which is deﬁned by Y t = B Y t − B (cid:62) + E t , (14) where B ∈ R p × p and B ∈ R p × p , whose vector form is vec( Y t ) = ( B ⊗ B )vec( Y t − ) + vec( E t ) . (15) It can be easily seen that if r = r = p , r = r = p , U = I p , U = I p , and G [ { , } ] = ( B ⊗ B )( U ⊗ U ) , then (13) becomes exactly (15) . Thus, the MAR model in (14) can be viewed as a special case of the proposed model without reducing dimensions p i ’s to r i ’s and without transforming Y t ; see Figure 1 for an illustration. The above comparison alsoapplies to the general case with d ≥ . The tensor version of the MAR model is consideredin Hoﬀ (2015) and is deﬁned as Y t = Y t − × di =1 B i + E t , (16) where B i ∈ R p i × p i for i = 1 , . . . , d . We call (16) the multilinear tensor autoregressive(MTAR) model. Note that its vector form is vec( Y t ) = ( B d ⊗ · · · ⊗ B )vec( Y t − ) + vec( E t ) . (17) Similarly, (17) is a special case of (10) with r i = r d + i = p i , U d + i = I p i , for i = 1 , . . . , d , and G [ S ] = ( ⊗ i ∈ S B i )( ⊗ i ∈ S U i ) . Obviously, the number of unknown parameters in the MTARmodel, (cid:80) di =1 p i , is much larger than that of the proposed model as shown in (11) . Also notethat Chen et al. (2020b) focuses on the low-dimensional estimation and its asymptotic theory,while Hoﬀ (2015) considers a Bayesian estimation method. Example 2.

In the special case where U d + i = U i and r d + i = r i for i = 1 , . . . , d , theproposed model may be understood from the perspective of dynamic factor modeling (Stockand Watson, 2011; Bai and Wang, 2016) for tensor-valued time series. Speciﬁcally, considerthe following model: Y t = F t × di =1 U i , F t = (cid:104) G , F t − (cid:105) + H t , (18)12 𝓖𝓖 , > 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑼𝑼 𝒀𝒀 𝑡𝑡 𝑼𝑼 = 𝒀𝒀 𝑡𝑡 = 𝑩𝑩 𝒀𝒀 𝑡𝑡−1 𝑩𝑩 + 𝑬𝑬 𝑡𝑡 𝑝𝑝 𝑝𝑝 𝑟𝑟 𝑟𝑟 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑝𝑝 𝑼𝑼 𝑼𝑼 𝒀𝒀 𝑡𝑡−1 𝑼𝑼 + 𝑬𝑬 𝑡𝑡 𝑼𝑼 MAR model:LRTAR model:

Figure 1: Illustration of the MAR model and the proposed LRTAR model in the case of d = 2. where Y t ∈ R p ×···× p d is the observed tensor-valued time series, F t ∈ R r ×···× r d represents (cid:81) di =1 r i factors, and U i ∈ R p i × r i are orthonormal matrices for i = 1 , . . . , d . Here F t followsthe tensor autoregression (TAR) with transition tensor G ∈ R r ×···× r d × r ×···× r d and randomerror H t . Note that (18) can be rewritten as Y t = (cid:10) G × di =1 U i × di = d +1 U i , Y t − (cid:11) + H t × di =1 U i . Thus, model (18) is a special case of the proposed model with U d + i = U i and r d + i = r i for i = 1 , . . . , d , and E t = H t × di =1 U i . Chen et al. (2020c) introduces the tensor factormodel in the form of Y t = F t × di =1 U i + E t without an explicit modeling of the latent factors F t . Hence, model (18) may be regarded as a special tensor factor model with autoregressivedynamic factors, but without any random error in the model equation of Y t . We consider the parameter estimation for the low-dimensional case where the sample size T is suﬃciently large such that the dimension of the parameter space can be assumed ﬁxed.Throughout this section, we assume that the data are generated from the proposed modelin Section 3, where the true transition tensor A is exactly low-Tucker rank with multilinearranks ( r , . . . , r d ). This assumption will be relaxed under the high-dimensional setup in the13ext section.Suppose that the true ranks ( r , . . . , r d ) of the exactly low-rank tensor A are known; werelegate the rank selection to Section 5.3. Then the parameters can be estimated by (cid:98) A LTR = (cid:98) G × di =1 (cid:98) U i = arg min G , U i T (cid:88) t =1 (cid:13)(cid:13) Y t − (cid:104) G × di =1 U i , Y t − (cid:105) (cid:13)(cid:13) . (19)We call (cid:98) A LTR the low-Tucker-rank (LTR) estimator. Note that the minimization in (19)is unconstrained, so the Tucker decomposition of (cid:98) A LTR is not unique. Indeed, there aremore than one solution of (cid:98) G and (cid:98) U i ’s corresponding to the same (cid:98) A LTR . Due to the lack ofidentiﬁability of the Tucker decompositions, standard asymptotics of the maximum likelihoodestimation cannot apply directly. Nevertheless, we can still derive the asymptotic distributionof (cid:98) A LTR using the asymptotic theory for overparameterized models in Shapiro (1986).Recall that S = { , , . . . , d } S = { d + 1 , d + 2 , . . . , d } , y t = vec( Y t ), and e t = vec( E t ).Let θ = (vec( G [ S ] ) (cid:62) , vec( U ) (cid:62) , · · · , vec( U d ) (cid:62) ) (cid:62) be the parameter vector, and let h = h ( θ ) = vec( A [ S ] ) = vec(( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) ) be the vectorization of the transitionmatrix. Denote Σ y = var( y t ), Σ e = var( e t ), and J = Σ − e ⊗ Σ y . In addition, for i =1 , . . . , d , let P ( i )[ S ] be the p × p permutation matrix such that vec( A [ S ] ) = P ( i )[ S ] vec( A ( i ) ).Then, it can be shown that the Jacobian matrix H := ∂ h ( θ ) /∂ θ is given by H = (cid:0) ( ⊗ i ∈ S U i ) ⊗ ( ⊗ i ∈ S U i ) , P (1)[ S ] (cid:8)(cid:2) ( ⊗ di =1 ,i (cid:54) =1 U i ) G (cid:62) (1) (cid:3) ⊗ I p (cid:9) , P (2)[ S ] (cid:8)(cid:2) ( ⊗ di =1 ,i (cid:54) =2 U i ) G (cid:62) (2) (cid:3) ⊗ I p (cid:9) , . . . , P (2 d )[ S ] (cid:8)(cid:2) ( ⊗ di =1 ,i (cid:54) =2 d U i ) G (cid:62) (2 d ) (cid:3) ⊗ I p d (cid:9) (cid:1) , (20)where p d + i = p i for i = 1 , . . . , d . Theorem 1.

Suppose that the time series { Y t } is generated by model (8) with E (cid:107) e t (cid:107) < ∞ ,and Assumption 1 holds. Then, √ T vec(( (cid:98) A LTR − A ) [ S ] ) → N ( , Σ LTR ) , in distribution as T → ∞ , where Σ LTR = H ( H (cid:62) J H ) † H (cid:62) , and † is the Moore-Penroseinverse. Since the asymptotic theory for overparameterized models in Shapiro (1986) allows forunidentiﬁability of the components G and U i ’s in the decomposition of A , Theorem 1 does14ot require that the Tucker decomposition of A is unique; see the proof of Theorem 1 inAppendix A for more details.Next we compare the result in Theorem 1 to those of two other consistent estimators forthe proposed model in the low-dimensional setup. Note that the rank of A [ S ] in (10) is atmost s := min( (cid:81) di =1 r i , (cid:81) di = d +1 r i ). Thus, A [ S ] can be estimated by both the reduced-rankregression (RRR) and ordinary least squares (OLS) methods, (cid:98) A RRR = arg min rank( B [ S ) ≤ s T T (cid:88) t =1 (cid:107) Y t − (cid:104) B , Y t − (cid:105)(cid:107) (21)and (cid:98) A OLS = arg min B T T (cid:88) t =1 (cid:107) Y t − (cid:104) B , Y t − (cid:105)(cid:107) . Naturally, under model (8), (cid:98) A LTR is asymptotically more eﬃcient than (cid:98) A RRR and (cid:98) A OLS : Corollary 1.

If the conditions of Theorem 1 hold, then √ T vec(( (cid:98) A OLS − A ) [ S ] ) → N (0 , Σ OLS ) and √ T vec(( (cid:98) A RRR − A ) [ S ] ) → N (0 , Σ RRR ) in distribution as T → ∞ . Moreover, it holdsthat Σ LTR ≤ Σ RRR ≤ Σ OLS . To solve the minimization problem in (19), we propose an alternating least squares (ALS)method. Speciﬁcally, by the vector representation of the proposed model in (10), the lossfunction in (19) can be rewritten as L ( G , U , . . . , U d ) = T (cid:88) t =1 (cid:13)(cid:13)(cid:13) y t − ( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) y t − (cid:13)(cid:13)(cid:13) = T (cid:88) t =1 (cid:13)(cid:13)(cid:13) y t − m ( θ ) (cid:13)(cid:13)(cid:13) , where m ( θ ) = ( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) y t − is the conditional mean of y t = vec( Y t ) andis a function of the parameter vector θ = ( θ (cid:62) , θ (cid:62) , . . . , θ (cid:62) d ) (cid:62) , with θ = vec( G [ S ] ) and θ i = vec( U i ) for i = 1 , . . . , d . Note that m ( θ ) is linear in each θ i while keeping the othercomponents θ j with 0 ≤ j (cid:54) = i ≤ d ﬁxed. This is indeed because m ( θ ) = ( ⊗ i ∈ S U i ) G [ S ] P k, (cid:8)(cid:2) ( ⊗ i ∈ S ,i (cid:54) = k U i ) (cid:62) ( Y t − ) (cid:62) ( k ) (cid:3) ⊗ I r k (cid:9) P k, vec( U k )= P k, (cid:8)(cid:2) ( ⊗ i ∈ S ,i (cid:54) = d + k U i )( M t − ) (cid:62) ( k ) (cid:3) ⊗ I p k (cid:9) vec( U d + k )= (cid:8)(cid:2) y (cid:62) t − ( ⊗ i ∈ S U i ) (cid:3) ⊗ ( ⊗ i ∈ S U i ) (cid:9) vec( G [ S ] ) , lgorithm 1 ALS algorithm for LTR estimatorInitialize: A (0) = (cid:98) A RRR or (cid:98) A OLS

HOSVD: A (0) ≈ G (0) × di =1 U (0) i , with multilinear ranks ( r , . . . , r d ). repeat s = 0 , , , . . . for k = 1 , . . . , d U ( s +1) k ← arg min U k (cid:80) Tt =1 (cid:13)(cid:13)(cid:13) y t − ( ⊗ i ∈ S U ( s ) i ) G [ S ] P k, (cid:110)(cid:104) ( ⊗ i ∈ S ,i (cid:54) = k U ( s ) i ) (cid:62) ( Y t − ) (cid:62) ( k ) (cid:105) ⊗ I r k (cid:111) vec( U k ) (cid:13)(cid:13)(cid:13) end forfor k = 1 , . . . , d U ( s +1) d + k ← arg min U d + k (cid:80) Tt =1 (cid:13)(cid:13)(cid:13) y t − P k, (cid:8) (cid:104) ( ⊗ i ∈ S ,i (cid:54) = d + k U ( s ) i )( M ( s +1) t − ) (cid:62) ( k ) (cid:105) ⊗ I p k (cid:9) vec( U d + k ) (cid:13)(cid:13)(cid:13) end for G ( s +1) ← arg min G (cid:80) Tt =1 (cid:13)(cid:13)(cid:13) y t − (cid:110)(cid:104) y (cid:62) t − ( ⊗ i ∈ S U ( s +1) i ) (cid:105) ⊗ ( ⊗ i ∈ S U ( s +1) i ) (cid:111) vec( G [ S ] ) (cid:13)(cid:13)(cid:13) A ( s +1) ← G ( s +1) × di =1 U ( s +1) i until convergence for k = 1 , . . . , d , where P k, ∈ R (cid:81) di =1 r i × (cid:81) di =1 r i , P k, ∈ R p × p and P k, ∈ R r k p k × r k p k arepermutation matrices deﬁned such that P k, vec( T ( k ) ) = vec( T ) for any T ∈ R r ×···× r d , P k, vec( T ( k ) ) = vec( T ) for any T ∈ R p ×···× p d , and P k, vec( U k ) = vec( U (cid:62) k ), and M t − ∈ R r d +1 ×···× r d is deﬁned such that vec( M t ) = G [ S ] ( ⊗ i ∈ S U i ) (cid:62) y t − . Therefore, we can updateeach component parameter vector θ i , and hence the corresponding G and U i ’s, iterativelyby the least squares method. The resulting ALS algorithm is shown in Algorithm 1.As mentioned, the minimization in (19) is unconstrained. Accordingly, to obtain (cid:98) A LTR ,no orthogonality constraint of G and U i ’s is needed in Algorithm 1. Instead, we compute theﬁnal unique estimates of (cid:98) G and (cid:98) U i ’s by the HOSVD operation based on the unconstrainedsolution (cid:98) A LTR obtained from Algorithm 1. Speciﬁcally, we compute each (cid:98) U i uniquely asthe top r i left singular vectors of ( (cid:98) A LTR ) ( i ) such that the ﬁrst element in each column of (cid:98) U i is positive, and set (cid:98) G = [[ (cid:98) A LTR ; (cid:98) U (cid:62) , · · · , (cid:98) U (cid:62) d ]]. Similar alternating algorithms withoutimposing identiﬁcation constraints can be found in the literature of tensor decomposition;16ee, e.g. Zhou et al. (2013) and Li et al. (2018). The methods in the previous section rely on the assumptions that the dimension is ﬁxed andthat the true transition tensor is exactly low-rank with known Tucker ranks. We next relaxboth assumptions. Speciﬁcally, under the high-dimensional setup, we consider regularizedestimation of model (8) via diﬀerent nuclear-norm-type penalties and develop the corre-sponding non-asymptotic theory under only an approximately low-Tucker-rank assumptionon the underlying true transition tensor.In model (8), the exactly low-rank transition tensor A ∈ R p ×···× p d is subject to theconstraints r i = rank( A ( i ) ), for i = 1 , . . . , d . A commonly used convex relaxation of suchmultilinear rank constraints is the regularization via the sum of nuclear (SN) norms of allthe one-mode matricizations, (cid:107) A (cid:107) SN = d (cid:88) i =1 (cid:107) A ( i ) (cid:107) ∗ . (22)The SN norm has been widely used in the literature (Gandy et al., 2011; Tomioka et al.,2011; Liu et al., 2013; Raskutti et al., 2019) to simultaneously encourage the low-ranknessfor all modes of a tensor. This leads us to the SN norm regularized estimator (cid:98) A SN = arg min A (cid:40) T T (cid:88) t =1 (cid:107) Y t − (cid:104) A , Y t − (cid:105)(cid:107) + λ SN (cid:107) A (cid:107) SN (cid:41) , where λ SN is the tuning parameter. Note that if instead of (cid:107) A (cid:107) SN , only one single nuclearnorm, say (cid:107) A (1) (cid:107) ∗ , is penalized, then the resulting estimator will only enforce the low-rankness for the ﬁrst mode of A , while failing to do so for all the other 2 d − (cid:98) A SN , we make the following assumption on therandom error e t = vec( E t ). 17 ssumption 2. Let e t = Σ / e ξ t , where { ξ t } is a sequence of i.i.d. random vectors, with E ( ξ t ) = and var( ξ t ) = I p , and Σ e = var( e t ) is a positive deﬁnite matrix. In addition, theentries ( ξ it ) ≤ i ≤ p of ξ t are mutually independent and κ -sub-Gaussian, i.e., E ( e µξ it ) ≤ e κ µ / ,for any µ ∈ R and i = 1 , . . . , p . The sub-Gaussianity condition in Assumption 2 is milder than the commonly used nor-mality assumption in the literature on high-dimensional stationary vector autoregressivemodels (Basu and Michailidis, 2015; Raskutti et al., 2019). This relaxation is made possi-ble through establishing a novel martingale-based concentration bound in the proof of thedeviation bound; see Lemma B.5 in Appendix B.4. The covariance matrix Σ e captures thecontemporaneous dependency in E t , and the constant κ controls the tail heaviness of themarginal distributions.For any z ∈ C , let A ( z ) = I p − A [ S ] z be a matrix polynomial, where C is the set of com-plex numbers. Let µ min ( A ) = min | z | =1 λ min ( A ∗ ( z ) A ( z )) and µ max ( A ) = max | z | =1 λ max ( A ∗ ( z ) A ( z )),where A ∗ ( z ) is the conjugate transpose of A ( z ); see Basu and Michailidis (2015) for morediscussions on the connection between the spectral density of the VAR process and the twoquantities. In addition, let α RSC = λ min ( Σ e ) µ max ( A ) , M = λ max ( Σ e ) µ / ( A ) , and M = λ min ( Σ e ) µ max ( A ) λ max ( Σ e ) µ min ( A ) . The exactly low-rankness of the true transition tensor A assumed in Section 4 couldbe too stringent in real-world applications. In what follows, we relax it to the followingapproximately low-rank assumption: We assume that all one-mode matricizations of theunderlying true transition tensor A belong to the set of approximately low-rank matrices,namely A ( i ) ∈ B q ( r ( i ) q ; p i , p − i p ) for some q ∈ [0 , B q ( r ; d , d ) :=  M ∈ R d × d : min( d ,d ) (cid:88) i =1 σ i ( M ) q ≤ r  , and p − i = p/p i = (cid:81) dj =1 ,j (cid:54) = i p j . For the convenience of notation, here we require that 0 = 0.Note that when q = 0, B ( r ; d , d ) is the set of d -by- d rank- r matrices. For q >

0, the18estriction on (cid:80) min( d ,d ) i =1 σ i ( M ) q ≤ r requires that the singular values decay fast, and it ismore general and natural than the exactly low-rank assumption. Theorem 2.

Under Assumptions 1 and 2, if T (cid:38) max ≤ i ≤ d p − i p + max( κ , κ ) M − p , λ SN (cid:38) κ M d − (cid:80) di =1 (cid:112) p − i p/T , and A ( i ) ∈ B q ( r ( i ) q ; p i , p − i p ) for some q ∈ [0 , and all i = 1 , . . . , d ,then with probability at least − (cid:80) di =1 exp( − Cp − i p ) − exp[ − C min( κ − , κ − ) M p ] , (cid:107) (cid:98) A SN − A (cid:107) F (cid:46) √ r q (cid:18) d · λ SN α RSC (cid:19) − q/ where r q = (2 d ) − (cid:80) di =1 r ( i ) q . By Theorem 2, when λ SN (cid:16) κ M d − (cid:80) di =1 (cid:112) p − i p/T , the estimation error bound scalesas √ r q (max ≤ i ≤ d p − i p/T ) / − q/ ; note that the factor d in the error bounds is canceled bythe d − in the rate of λ SN . When q = 0, namely A is exactly low-rank with Tucker ranks( r (1)0 , . . . , r (2 d )0 ), the error bound reduces to (cid:112) r max ≤ i ≤ d p − i p/T and it is comparable tothat in Tomioka et al. (2011) for i.i.d. tensor regression.However, recent research (e.g., Mu et al., 2014; Raskutti et al., 2019) has shown thatthe SN norm regularization approach is suboptimal, mainly because A ( i ) is an unbalanced fat-and-short matricization of a higher-order tensor. Technically, in the proof of Theorem 2,an essential intermediate step is to establish the deviation bound, where we need to upperbound the operator norm of a sub-Gaussian random matrix with the same dimensions as A ( i ) ; see Lemma B.5 in Appendix B.3 for details. Undesirably, the order of this operatornorm will be dominated by the larger of the row and column dimensions of the matrix A ( i ) ∈ R p i × p − i p , and hence by the column dimension p − i p , which eventually appears in theerror bound. This indicates that the imbalance of the matricization leads to the eﬃciencybottleneck of the SN estimator.On the other hand, similarly to (21), since the the reduced-rank VAR model can be re-garded as an overparameterization of the proposed LRTAR model, alternatively one mayfocus on the approximately low-rank structure of the transition matrix A [ S ] in the VARrepresentation in (10), and adopt the matrix nuclear (MN) estimator (Negahban and Wain-19right, 2011) to estimate A , (cid:98) A MN = arg min A (cid:40) T T (cid:88) t =1 (cid:107) Y t − (cid:104) A , Y t − (cid:105)(cid:107) + λ MN (cid:107) A [ S ] (cid:107) ∗ (cid:41) . (23)Note that the multi-mode matricization A [ S ] = A (cid:62) [ S ] is a p × p square matrix. Thus, theloss of eﬃciency due to the unbalanced matricization can be avoided, which is conﬁrmed bythe following theorem. Theorem 3.

Under Assumptions 1 and 2, if T (cid:38) [1+max( κ , κ ) M − ] p , λ MN (cid:38) κ M (cid:112) p/T ,and A [ S ] ∈ B q ( s (1) q ; p, p ) for some q = [0 , , then with probability at least − exp( − Cp ) − exp[ − C min( κ − , κ − ) M p ] , (cid:107) (cid:98) A MN − A (cid:107) F (cid:46) (cid:113) s (1) q (cid:18) λ MN α RSC (cid:19) − q/ . Theorem 3 shows that, with λ MN (cid:16) κ M (cid:112) p/T , the estimation error bound for (cid:98) A MN scales as (cid:113) s (1) q ( p/T ) / − q/ . This result is comparable to that in Negahban and Wainwright(2011) for reduced-rank VAR models, yet we relax both the constraint (cid:107) A (cid:107) op < A and the normality assumption on the random error in their paper. Thisestimation error bound is clearly smaller than that in Theorem 2, as (max ≤ i ≤ d p − i p/T ) / − q/ in general can be much larger than ( p/T ) / − q/ when s (1) q (cid:16) r q . Therefore, adopting squarematricization can indeed improve the estimation performance.The idea of using square matricization to improve eﬃciency was adopted by Mu et al.(2014) in low-rank tensor completion problems. Their proposed method, called the squaredeal, is to ﬁrst unfold a general higher-order tensor into a matrix with similar numbers ofrows and columns, and then use the MN norm as the regularizer. However, for our estimationproblem, despite the advantage of (cid:98) A MN over (cid:98) A SN , Theorem 3 reveals another drawback of (cid:98) A MN . That is, the error bounds for (cid:98) A MN depend on the (cid:96) q radius s (1) q of the singular valuesof A [ S ] , suggesting that (cid:98) A MN may perform badly when s (1) q is relatively large. In otherwords, unless we have prior knowledge that the (cid:96) q “norm” of singular values of particularmatricization A [ S ] is truly small, (cid:98) A MN may not be desirable in practice.20n the other hand, although the SN regularizer in (22) suﬀers from ineﬃciency dueto the imbalance of one-mode matricizations, it has the attractive feature of simultaneouslyenforcing the low-rankness across all modes of the tensor A , and thus is more eﬃcient than itscounterpart which considers only one single one-mode matricization, say, (cid:107) A (1) (cid:107) ∗ . Similarly,if we can enforce the approximate low-rankness across all possible square matricizations of A , the estimation performance may be further improved upon (cid:98) A MN . This motivates us topropose a new regularization approach in the next subsection. In this subsection, we propose a novel convex regularizer which improves upon both theSN and MN regularizers in (22) and (23), respectively, by simultaneously encouraging thelow-rankness across all possible square matricizations of the transition tensor A .For any 2 d -th-order tensor A ∈ R p ×···× p d × p ×···× p d , its multi-mode matricization A [ I ] willbe a p × p square matrix, with p = (cid:81) di =1 p i , if the index set is chosen as I = { (cid:96) , . . . , (cid:96) d } , where each index (cid:96) i is set to either i or d + i , for i = 1 , . . . , d . For instance, A [ S ] is the square matricization formed by setting (cid:96) i = i for all i = 1 , . . . , d . Moreover, if A has multilinear ranks ( r , . . . , r d ), then the rank of the matricization A [ I ] is at mostmin( (cid:81) di =1 ,i ∈ I r i , (cid:81) di =1 ,i/ ∈ I r i ). Therefore, if we penalize the sum of nuclear norms of all suchsquares matricizations, which we call the sum of square-matrix nuclear (SSN) norms forsimplicity, then the resulting estimator would enjoy the eﬃciency gain from both the use ofsquare matricizations and simultaneous incorporation of many rank constraints.Obviously, there are 2 d possible choices of the index set I that corresponds to a squarematricization A [ I ] . However, since A [ I ] = A (cid:62) [ I (cid:123) ] , when deﬁning the SSN norm, we onlyneed to include one of I and its complement I (cid:123) . A simple way to do so is to choose onlysets containing the index one. That is, ﬁx (cid:96) = 1 and choose (cid:96) i = i or d + i for i =2 , . . . , d . This results in totally 2 d − chosen index sets, denoted by I , I , . . . , I d − . Note21hat I = S = { , . . . , d } . For example, when d = 3, we have four chosen index sets, I = { , , } , I = { , , } , I = { , , } and I = { , , } .Based on the above choice of the 2 d − index sets, we introduce the following SSN norm, (cid:107) A (cid:107) SSN = d − (cid:88) k =1 (cid:13)(cid:13) A [ I k ] (cid:13)(cid:13) ∗ . The corresponding estimator is deﬁned as (cid:98) A SSN = arg min A (cid:40) T T (cid:88) t =1 (cid:107) Y t − (cid:104) A , Y t − (cid:105)(cid:107) + λ SSN (cid:107) A (cid:107) SSN (cid:41) , (24)where λ SSN is the tuning parameter.If the rank of one-mode matricizations rank( A ( i ) ) = r i , each square matricization A [ I k ] is also low-rank with rank( A [ I k ] ) ≤ min( (cid:81) di =1 ,i ∈ I k r i , (cid:81) di =1 ,i/ ∈ I k r i ). Similarly, if all A ( i ) s areapproximately low-rank, the square matricizations are approximately low-rank as well. Incontrast to the SN norm in (22) which directly matches the multilinear ranks rank( A ( i ) ) for i = 1 , . . . , d , the SSN norm encourages the multilinear low-rank structure of A by simul-taneously enforcing the low-rankness of all the square matricizations A [ I k ] . The followingtheorem gives the theoretical results for (cid:98) A SSN . Theorem 4.

Under Assumptions 1 and 2, if T (cid:38) [1+max( κ , κ ) M − ] p , λ SSN (cid:38) κ M − d (cid:112) p/T ,and A [ I k ] ∈ B ( s ( k ) q ; p, p ) for some q ∈ [0 , and all k = 1 , . . . , d − , with probability at least − exp[ − C ( p − d )] − exp[ − C min( κ − , κ − ) M p ] , (cid:107) (cid:98) A SSN − A (cid:107) F (cid:46) √ s q (cid:18) d − λ SSN α RSC (cid:19) − q/ where s q = 2 − d (cid:80) d − k =1 s ( k ) q . By Theorem 4, when λ SSN (cid:16) κ M − d (cid:112) p/T , the estimation error bound scales as √ s q ( p/T ) / − q/ and reduces to (cid:112) s p/T in the exactly low-rank setting for q = 0. Fora clearer comparison among the three estimators (cid:98) A SN , (cid:98) A MN and (cid:98) A SSN , we summarize themain results of Theorems 2–4 in Table 1. First, both (cid:98) A SSN and (cid:98) A MN have much smaller errorbounds and less stringent sample size requirements than (cid:98) A SN , due to the diverging dimension p − i in the results of the latter. This reaﬃrms the advantage of the square matricizations.22N MN SSNSample size T (cid:38) (max ≤ i ≤ d p − i + M − ) p T (cid:38) (1 + M − ) p T (cid:38) (1 + M − ) p Estimation error √ r q (max ≤ i ≤ d p − i p/T ) / − q/ (cid:113) s (1) q ( p/T ) / − q/ √ s q ( p/T ) / − q/ Table 1: Summary of the sample size conditions and error upper bounds in Theorems 2–4,where p − i = (cid:81) dj =1 ,j (cid:54) = i p j , r q = (2 d ) − (cid:80) di =1 r ( i ) q , and s q = 2 − d (cid:80) d − k =1 s ( k ) q .Secondly, comparing (cid:98) A SSN to (cid:98) A MN , since the factor s q in the error bounds of (cid:98) A SSN is theaverage of all s ( k ) q for k = 1 , . . . , d − , (cid:98) A SSN can protect us from the bad scenarios where the (cid:96) q “norm” of the singular values of A [ S ] is relatively large. If all the s ( k ) q ’s are of the sameorder, then the error upper bounds for (cid:98) A SSN and (cid:98) A MN in Table 1 will be similar. However,our simulation results in Section 6 show that (cid:98) A SSN clearly outperforms (cid:98) A MN under varioussettings, even when s (1) q = · · · = s (2 d ) q . Indeed, the error bounds for (cid:98) A SSN in Theorem 4 islikely to be loose, which is believed to be caused by taking the upper bounds on the dualnorm of the SSN norm in the proof of Lemma B.3; see Appendix B.3 for details. By contrast,the error bounds for (cid:98) A MN are minimax-optimal (Han et al., 2015b). Therefore, althoughour theoretical results are not sharp enough to distinguish clearly between the error rates of (cid:98) A SSN and (cid:98) A MN , we conjecture that the actual rate of the former is generally smaller thanthat of the latter. Methodologically, this is also easy to understand because, unlike (cid:98) A MN , (cid:98) A SSN simultaneously enforces the low-rankness across all square matricizations of A ratherthan just on A [ S ] . Remark 1.

While our SSN regularization is proposed in the time series context, the ideaof imposing joint penalties on all (close to) square matricizations of the coeﬃcient tensorcan potentially be extended to general higher-order tensor estimation problems. Moreover,it can be reﬁned to accommodate particular structures of the data. For example, if some ofthe d modes of the tensor-value time series Y t , namely p , . . . , p d , are equal, then even agreater number of possible square matricizations of A can be formed, resulting in improvedestimation eﬃciency. .3 Truncated Regularized Estimation While the proposed regularized estimation methods do not require exact low-rankness ofthe true transition tensor A , sometimes imposing exact low-rankness may be more desirableif one wants to interpret the underlying low-dimensional tensor dynamics. As discussed inSection 3, the Tucker ranks can be interpreted as the numbers of dynamic factors in eachmode. In this section, we propose a truncation method to consistently estimate the truemultilinear ranks ( r , . . . , r d ) under the exact low-rank assumption.Let γ be a threshold parameter to be chosen properly. Given the estimator (cid:98) A SSN , foreach i = 1 , . . . , d , we calculate the singular value decomposition (SVD) of the mode- i matri-cization ( (cid:98) A SSN ) ( i ) with the singular values arranged in descending order. Next we truncatethe SVD by retaining only singular values greater than γ , and take their corresponding leftsingular vectors to deﬁne the matrix (cid:101) U i . Then, the truncated core tensor is deﬁned as (cid:101) G = (cid:98) A SSN × di =1 (cid:101) U (cid:62) i , based on which we propose the truncated sum of square-matrix nuclear (TSSN) estimator (cid:98) A TSSN = (cid:101) G × di =1 (cid:101) U i . To derive the theoretical results on rank selection, we make the following assumption onthe exact Tucker ranks and the magnitude of the singular values.

Assumption 3.

For all i = 1 , . . . , d , σ r ( A ( i ) ) = 0 for all r > r i , and there exists a constant C > such that min ≤ i ≤ d σ r i (cid:0) A ( i ) (cid:1) ≥ Cγ . As T → ∞ , the threshold parameter satisﬁes γ (cid:29) ( κ M /α RSC ) (cid:112) s p/T , where s = 2 − d (cid:80) d − k =1 rank( A [ I k ] ) . Assumption 3 requires that A has exact Tucker ranks ( r , . . . , r d ) which do not divergetoo fast. The smallest positive singular value for each A ( i ) is assumed to be bounded awayfrom the threshold γ when the sample size is suﬃciently large. Since Assumption 3 involvesunknown quantities, it cannot be used directly for determining γ in practice. Instead, werecommend using the data-driven threshold parameter γ described at the end of the nextsubsection. 24he rank selection consistency of the truncation method and the asymptotic estimationerror rate of (cid:98) A TSSN are given by the following theorem.

Theorem 5.

Under the conditions of Theorem 4 and Assumption 3, if the tuning parameter λ SSN (cid:16) κ M − d (cid:112) p/T , then P (cid:110) rank (cid:16) ( (cid:98) A TSSN ) ( i ) (cid:17) = rank( A ( i ) ) , for i = 1 , . . . , d (cid:111) → , as T → ∞ , and for any ﬁxed d , (cid:107) (cid:98) A TSSN − A (cid:107) F = O p (cid:16)(cid:112) s p/T (cid:17) , where s is deﬁned as in Assumption 3. This subsection presents the algorithm for the proposed (T)SSN regularized estimator. Thealgorithm for (cid:98) A SN can be developed analogously, while (cid:98) A MN can be obtained easily as inNegahban and Wainwright (2011).The objective function for the estimator (cid:98) A SSN in (24) can be rewritten as L T ( A ) + λ SSN (cid:107) A (cid:107) SSN = L T ( A ) + λ SSN 2 d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) ∗ , (25)where L T ( A ) = T − (cid:80) Tt =1 (cid:107) Y t − (cid:104) A , Y t − (cid:105)(cid:107) is the quadratic loss function. In (25), theregularizer (cid:107) A (cid:107) SSN involves 2 d − nuclear norms (cid:107) A [ I k ] (cid:107) ∗ , which are challenging to handleat the same time. A similar diﬃculty also occurs in low-rank tensor completion, for whichGandy et al. (2011) applied the alternating direction method of multipliers (ADMM) algo-rithm (Boyd et al., 2011) to eﬃciently separate the diﬀerent nuclear norms. Borrowing theidea of Gandy et al. (2011), we develop an ADMM algorithm for the miminization of (25).To separate the 2 d − nuclear norms in (cid:107) A (cid:107) SSN , for each A [ I k ] , we introduce a diﬀerentdummy variable W k as a surrogate for A , where k = 1 , . . . , d − . Then the augmentedLagrangian is L ( A , W , C ) = L T ( A ) + d − (cid:88) k =1 (cid:104) λ SSN (cid:107) ( W k ) [ I k ] (cid:107) ∗ + 2 ρ (cid:104) C k , A − W k (cid:105) + ρ (cid:107) A − W k (cid:107) (cid:105) , lgorithm 2 ADMM algorithm for (T)SSN estimatorInitialize: C (0) k , W (0) k = A (0) = (cid:98) A MN , for k = 1 , . . . , d − , threshold parameter γ for j ∈ { , , . . . , J − } do A ( j +1) ← arg min (cid:110) L T ( A ) + (cid:80) d − k =1 ρ (cid:107) A − W ( j ) k + C ( j ) k (cid:107) (cid:111) for k ∈ { , , . . . , d − } do W ( j +1) k ← arg min (cid:110) ρ (cid:107) A ( j +1) − W k + C ( j ) k (cid:107) + λ SSN (cid:107) ( W k ) [ I k ] (cid:107) ∗ (cid:111) C ( j +1) k ← C ( j ) k + A ( j +1) − W ( j +1) k end forend for (cid:98) A SSN ← A ( J ) for i ∈ { , , . . . , d } do (cid:101) U i ← Truncated SVD(( (cid:98) A SSN ) ( i ) , γ ) end for (cid:101) G ← (cid:98) A SSN × di =1 (cid:101) U (cid:62) i (cid:98) A TSSN ← (cid:101) G × di =1 (cid:101) U i where C k are the Lagrangian multipliers, for k = 1 , . . . , d − , and ρ is the regularizationparameter. Then we can iteratively update A , W k and C k by the ADMM, as shown inAlgorithm 2.In Algorithm 2, the A -update step is an (cid:96) -regularized least squares problem. Similarlyto Gandy et al. (2011), the W k -update step can be solved by applying the explicit soft-thresholding operator to the singular values of ( A + C k ) [ I k ] . Both subproblems have close-form solutions. Thus, the miminization of (25) can be solved eﬃciently.For the tuning parameter selection, since the cross-validation method is unsuitable fortime series or intrinsically ordered data, we apply the Bayesian information criterion (BIC)to select the optimal λ SSN from a sequence of tuning parameter values, where the number ofdegrees of freedom is deﬁned as 2 − ( d − (cid:80) d − k =1 s k (2 p − s k ). For the threshold parameter γ ofthe truncated estimator, we recommend γ = 2 d − λ SSN / λ SSN is theoptimal tuning parameter selected by the BIC.26

Simulation Studies

We conduct three simulation experiments to examine the ﬁnite-sample performance of theproposed low- and high-dimensional estimation methods for the LRTAR model in previoussections. Throughout, we generate the data from model (8) with vec( E t ) i.i.d. ∼ N ( , I p ). Theentries of the core tensor G are generated independently from N (0 ,

1) and rescaled such that (cid:107) G (cid:107) F = 5. The factor matrices U i ’s are generated by extracting the leading singular vectorsfrom Gaussian random matrices while ensuring the stationarity condition in Assumption 1.The ﬁrst experiment focuses on the proposed low-dimensional estimation method consid-ered in Section 4. We consider four cases of data generating processes. In both cases (1a)and (1b), we consider d = 2 and multilinear ranks ( r , r , r , r ) = (1 , , , , , , , , , d = 3 and multilinear ranks( r , r , r , r , r , r ) = (1 , , , , , , , , , , , , , , , p i ’s: (1a) p = p = 5; (1b) p = p = 10; (2a) p = p = p = 5; and (2b) p = p = p = 7. We repeat each data generating process300 times, and conduct the estimation using true multilinear ranks. Figure 2 displays theaverage estimation error against T ∈ [1000 , A is exactly low-rank, we have (cid:107) (cid:98) A SSN − A (cid:107) = O p ( s p/T ), where s = 2 − d (cid:80) d − k =1 s ( k )0 with s ( k )0 = rank( A [ I k ] ).Note that this rate is dependent on the overall dimension, p = (cid:81) di =1 p i , but independent ofthe individual dimensions p i . To examine the relationship between the estimation error andthe overal dimension p , sample size T , multilinear ranks r i and individual dimensions p i , wegenerate the data under the eight settings listed in Table 2, and plot (cid:107) (cid:98) A SSN − A (cid:107) againstthe varying parameter in each case in Figure 3, including p, /T, s and diﬀerent settingsof p i ’s under a ﬁxed overall dimension p . The ﬁrst three columns of Figure 3 show that (cid:107) (cid:98) A SSN − A (cid:107) roughly scales linearly in p, T and s , while the last column suggests that the27stimation error is invariant under diﬀerent settings of individual p i as long as p and theother parameters are ﬁxed. This lends support to the theoretical error bound for the SSNestimator.In the third experiment, we compare the performance of all the four high-dimensionalestimators discussed in Section 5, namely the SN, MN, SSN and TSSN estimators. Weconsider four cases of data generating processes. In cases (3a) and (3b), we consider d = 2and multilinear ranks ( r , r , r , r ) = (1 , , , , , , , , , d = 3 and multilinear ranks ( r , r , r , r , r , r ) = (1 , , , , , , , , , , , , , , , p i ’s: (3a) p = p = 5; (3b) p = p = 10; (4a) p = p = p = 5; and (4b) p = p = p = 7. For each setting, we repeat 300 times, and conduct the estimation usingSN, MN, SSN and TSSN. The tuning parameter and the truncation threshold parameter areselected by the method described in Section 5.4. In Figure 4, the average estimation erroris plotted against T ∈ [400 , T ∈ [600 , r = · · · = r d . In addition,the TSSN estimator generally performs better than the SSN, probably because the formeryields a more parsimonious model which further improves the estimation eﬃciency. We ﬁrst consider the modeling of a matrix-valued time series. The data is the monthlymarket-adjusted return series of Fama–French 10 ×

10 portfolios from January 1979 to De-cember 2019, obtained from French (2020). The portfolios are constructed as the intersec-tions of 10 portfolios formed by the book-to-market (B/M) ratio and 10 portfolios formed28y the size (market value of equity). Hence, d = 2, p = p = 10 and T = 492 months.We remove the market eﬀect for each portfolio by subtracting its average return from theoriginal return series.Let Y t ∈ R × be the observed matrix-valued time series, with its rows and columnscorresponding to diﬀerent B/M ratios (sorted from lowest to highest) and sizes (sorted fromsmallest to largest), respectively, and denote y t = vec( Y t ). For comparison, we consider thefollowing ﬁve candidate models: • Vector autoregression (VAR): y t = Ay t − + e t , where A ∈ R × . The model isestimated by the least squares method. • Vector factor model (VFM): y t = Λ f t + e t , where f t is the low-dimensional vector-valued latent factor, and Λ is the loading matrix. The model is estimated by themethod in Lam et al. (2012), and for prediction, the estimated factors (cid:98) f t are thenﬁtted by a VAR(1) model. • Matrix autoregression (MAR): Y t = B Y t − B (cid:62) + E t , where B , B ∈ R × arecoeﬃcient matrices; see Example 1. The model is estimated by the iterated leastsquares method in Chen et al. (2020b). • Matrix factor model (MFM): Y t = RF t C (cid:62) + E t , where F t is the low-dimensionalmatrix-valued latent factor, and R and C are the loading matrices. The model isestimated by the method in Wang et al. (2019), and for prediction, the estimatedfactors (cid:98) F t are then ﬁtted by a VAR(1) model. • The proposed LRTAR model: Y t = (cid:104) A , Y t − (cid:105) + E t , with A = G × i =1 U i . The modelis estimated using the SSN and TSSN regularized estimation methods in Section 5.We ﬁrst focus on the results for proposed LRTAR model. By the proposed truncationmethod, we obtain the estimated multilinear ranks ( (cid:98) r , (cid:98) r , (cid:98) r , (cid:98) r ) = (8 , , , U (cid:62) Y t U = (cid:10) G , U (cid:62) Y t − U (cid:11) + U (cid:62) E t U . Thus, U and U can be viewed as the factor loadings across diﬀerent B/M ratios,29hile U and U represent those across diﬀerent sizes. By the factor interpretation below(12), this result indicates that the information in Y t can be eﬀectively summarized into the2 × U (cid:62) Y t U , while the low-rank structure associated with the predictor Y t − is not strong. Moreover, the estimated multilinear ranks suggest that the MAR modelmight be ineﬃcient for the data, since the MAR imposes that r = r = r = r = 10 and U = U = I ; see Example 1. In addition, by an argument similar to the discussionin Example 2, the MFM can be regarded as a special case of the proposed model with( U , U ) = ( U , U ) and ( r , r ) = ( r , r ). Thus, the fact that (cid:98) r and (cid:98) r are much smallerthan (cid:98) r and (cid:98) r suggests that the MFM may be too restrictive for the data.The TSSN estimates of the factor matrices are shown in Figure 5. The patterns of theestimated response factor matrices (cid:101) U , (cid:101) U ∈ R × are particularly interesting. First, forboth (cid:98) U and (cid:98) U , the uniform pattern of the ﬁrst column indicate that portfolios acrossdiﬀerent B/M ratios (or sizes) contribute approximately equally to the ﬁrst B/M (or size)response factor. Thus, this factor represents the component of the market performance whichis invariant to the size and B/M ratio of the portfolios. The signiﬁcance of this factor maybe partially because we ﬁt the model to market-adjusted returns, where the average returnis subtracted for all portfolios. Meanwhile, for both response factor matrices, the secondcolumn has a smoothly increasing pattern, suggesting that part of the return variation inthe market has a monotonic relationship with the size and B/M ratio. Moreover, the aboveinterpretations are consistent with the famous Fama-French three-factor model, where thereturn of a portfolio is expected to be aﬀected by the market premium, outperformance ofsmall versus big companies, and outperformance of high B/M versus small B/M companies.The performance of the ﬁve models are compared through both average in-sample andout-of-sample forecasting errors. The average in-sample forecasting error is calculated basedon the ﬁtted models for the entire data, while the average out-of-sample forecasting error iscalculated based on the rolling forecast procedure as follows. From January 2016 ( t = 445)to December 2019 ( t = 492), we ﬁt the models using all the available data until time t − (cid:98) Y t . Then, we obtain the average of the rolling forecasting30rrors for this period.The average forecasting errors in (cid:96) and (cid:96) norms are presented in Table 3. Firstly, theVAR model has the smallest in-sample forecasting error among all models, which is as ex-pected because the VAR model is highly overparametrized. The bad in-sample performanceof the MFM agrees with our discussion about its restrictiveness for the data due to themismatch of the multilinear ranks. It is worth noting that the proposed LRTAR model hascompetitive in-sample forecasting performance among all models.The out-of-sample forecasting results provides a fuller picture of the eﬃciency of diﬀerentmethods. It can be seen that the VAR and VFM models perform worst among all, as theyboth completely ignore the matrix structure of the data. Notably the proposed LRTARmodel, ﬁtted by either the SSN or the TSSN methods, have the smallest out-of-sampleforecasting errors. This suggests that the proposed LRTAR model can indeed eﬃcientlycapture the dynamic structural information in the data. Fama and French (2015) extended the classical Fama-French three-factor model to a ﬁve-factor model which further incorporates the eﬀect of operating proﬁtability (OP) and invest-ment (Inv). This motivates tensor-valued stock returns data formed according to the size(small and big), B/M ratio (four groups sorted from lowest to highest), OP (four groupssorted from lowest to highest), and Inv (four groups sorted from lowest to highest). Weconsider two datasets retrieved from French (2020). The ﬁrst dataset consists of monthlymarket-adjusted returns series of 4 × × × × d = 3, p = p = 4, p = 2 and T = 678 months.Similar to the analysis in the previous subsection, ﬁve candidate models are considered: • Vector autoregression (VAR): y t = Ay t − + e t , where A ∈ R × .31 Vector factor model (VFM): y t = Λ f t + e t , where f t is the low-dimensional vector-valued latent factor, and Λ is the loading matrix. • Multilinear tensor autoregression (MTAR): Y t = Y t − × i =1 B i + E t , where B , B ∈ R × and B ∈ R × are coeﬃcient matrices; see Example 1. • Tensor factor model (TFM): Y t = F t × i =1 U i + E t , where F t is the low-dimensionaltensor-valued latent factor, and U i ’s are the loading matrices. • The proposed LRTAR model: Y t = (cid:104) A , Y t − (cid:105) + E t , with A = G × i =1 U i .The TFM is estimated by the method in Chen et al. (2020c), and for prediction, theestimated factors (cid:98) F t are ﬁtted by a VAR(1) model. The other four models are ﬁtted bythe same methods as in the previous subsection. The multilinear ranks selected by thetruncation method are ( (cid:98) r , (cid:98) r , (cid:98) r , (cid:98) r , (cid:98) r , (cid:98) r ) = (3 , , , , ,

1) and (2 , , , , ,

1) for the OP-BM-Size portfolio return series and the Inv-BM-Size portfolio return series, respectively.Note that similar to the BM-Size 10 ×

10 series in the previous subsection, the low-rankstructure for the response in these two tensor-valued datasets is more evident than that forthe predictor. Figure 6 shows the TSSN estimates of the factor matrices. We ﬁnd that theestimated response factors have a uniform pattern similar to that in Figure 5 for the matrix-valued data. The fact that only one response factor is extracted in each direction suggeststhat there might not be substantial eﬀect of OP, Inv, B/M ratio or size on the returns forthese datasets.Finally, using the same methods as in the previous subsection, we calculate the averagein-sample and out-of-sample forecasting errors for both datasets ﬁtted with the ﬁve models.As shown in Table 4, the comparison results for the two datasets are quite similar. It can beclearly observed that the VAR model always has the smallest in-sample forecasting error, yetthe largest out-of-sample forecasting error. On the contrary, the proposed LRTAR model,ﬁtted by either the SSN or the TSSN methods, has the smallest out-of-sample forecastingerror. Moreover, the in-sample forecasting performance of the LRTAR model is competi-tive even compared to the VAR model. Similarly to the MFM in the previous subsection,32he TFM model has poor in-sample performance, possibly due to the discrepancy between( r , r , r ) and ( r , r , r ), as reﬂected by the estimated multilinear ranks. In sum, the re-sults support the eﬃciency and ﬂexibility of the proposed model and estimation methods fortensor-valued time series data. Eﬃcient modeling and forecasting of general structured (tensor-valued), high-dimensionaltime series data is an important research topic which however has been rarely explored inthe literature so far. This paper makes the ﬁrst thorough attempt to address this prob-lem by introducing the low-rank tensor autoregressive model. By assuming the exactly orapproximately low-Tucker-rank structure of the transition tensor, the model exploits thelow-dimensional tensor dynamic structure of the high-dimensional time series data, andsummarizes the complex temporal dependencies into interpretable dynamic factors.Asymptotic and non-asymptotic properties are derived for the proposed low- and high-dimensional estimators, respectively. For the latter, we relax the conventional Gaussianassumption in the high-dimensional time series literature to sub-Gaussianity via a newmartingale-based concentration technique. Moreover, based on the special structure of thetransition tensor, a novel convex regularizer, the SSN, is proposed, gaining eﬃciencies fromboth the square matricization and simultaneous penalization across modes. A truncationmethod, the TSSN, is further introduced to consistently select the multilinear ranks andenhance model interpretability.We discuss several directions for future research. First, the proposed estimators cannotadapt to the contemporaneous correlation among elements of the random error E t , whichmay lead to eﬃciency loss. This issue can be addressed by considering the generalized leastsquares loss function, L T ( A ; Σ e ) = T − (cid:80) Tt =1 vec( Y t − (cid:104) A , Y t − (cid:105) ) (cid:62) Σ − e vec( Y t − (cid:104) A , Y t − (cid:105) ),where Σ e = var(vec( E t )). Then A may be estimated jointly with Σ e or by a two-stepapproach based on a consistent estimator (cid:98) Σ e (Basu and Michailidis, 2015; Davis et al.,33016).Secondly, the proposed methods can be generalized to the LRTAR model of ﬁnite lagorder L , deﬁned as Y t = (cid:104) A , Y t − (cid:105) + · · · + (cid:104) A L , Y t − L (cid:105) + E t , where A , . . . , A L are 2 d -th-order multilinear low-rank coeﬃcient tensors. Then, one may consider the SSN regularizedestimation by minimizing T − (cid:80) Tt =1 (cid:107) Y t − (cid:80) Lj =1 (cid:104) A j , Y t − j (cid:105)(cid:107) + (cid:80) Lj =1 λ j (cid:107) A j (cid:107) SSN . Alterna-tively, the A j ’s can be combined into a (2 d + 1)-th-order tensor A ∈ R p ×··· p d × p ×···× p d × L whose mode-(2 d + 1) matricization is A (2 d +1) = ( A , . . . , A L ); see Wang et al. (2020) fora similar idea. Even though A may not have exactly square matricizations for L >

1, theproposed SSN can still be adapted by employing approximately square matricizations. Forinstance, consider the pL × p multi-mode matricizations A [ I k ∪{ d +1 } ] , where the index sets I k for k = 1 , . . . , d − are deﬁned as in this paper. Then, a generalized SSN norm can beconstructed as (cid:107) A (cid:107) GSSN = (cid:80) d − k =1 (cid:107) A [ I k ∪{ d +1 } ] (cid:107) ∗ , which will reduce to (cid:107) A (cid:107) SSN when L = 1.Thirdly, the LRTAR model can be readily extended to incorporate exogenous tensor-valued predictors, giving rise to the LRTAR-X model, Y t = (cid:104) A , Y t − (cid:105) + (cid:104) B , X t (cid:105) + E t , where X t is an m -th-order tensor of exogenous variables, and B is a ( d + m )-th-order coeﬃcienttensor. When the dimension of X t is high, a low-dimensional structure, such as sparsity,group sparsity or low-rankness, can be imposed on B to improve the estimation eﬃciency.Moreover, an open question for matrix- or tensor-valued time series models is how toincorporate additional structures or constraints; for instance, transport or trade ﬂow datahave unspeciﬁed diagonal entries (Chen and Chen, 2019), and realized covariance matrixdata are subject to positive deﬁnite constraints. Beyond the time series context, it is alsoworth investigating the generalization of the SSN regularization method in higher-ordertensor estimation and completion applications, such as neuroimaging analysis (Li et al.,2018), recommender sytem (Bi et al., 2018) and natural language processing (Frandsen andGe, 2019). 34 ppendix A: Proofs for Low-Dimensional Estimation Below we give the proofs of Theorem 1 and Corollary 1 in Section 4, which generally followfrom Proposition 4.1 in Shapiro (1986) for overparameterized models.

Proof of Theorem 1.

The proposed model in (10) can be written in the matrix form  y (cid:62) y (cid:62) ... y (cid:62) T (cid:124) (cid:123)(cid:122) (cid:125) Y =  y (cid:62) y (cid:62) ... y (cid:62) T − (cid:124) (cid:123)(cid:122) (cid:125) X A (cid:62) [ S ] +  e (cid:62) e (cid:62) ... e (cid:62) T (cid:124) (cid:123)(cid:122) (cid:125) E . Let (cid:98) h OLS = vec(( (cid:98) A OLS ) [ S ] ). Under Assumption 1, by the classical asymptotic theory forstationary VAR models, as T → ∞ , we have √ T ( (cid:98) h OLS − h ) d → N ( , Σ OLS ) , where Σ OLS = Σ e ⊗ Σ − y , with Σ e = var( e t ) and Σ y = var( y t ).Following Shapiro (1986), consider the discrepancy function F ( (cid:98) h OLS , h ) = (cid:107) vec( Y ) − ( I p ⊗ X ) h (cid:107) − (cid:107) vec( Y ) − ( I p ⊗ X ) (cid:98) h OLS (cid:107) . Note that F ( (cid:98) h OLS , h ) is nonnegative and twice continuously diﬀerentiable, and equals zeroif and only if (cid:98) h OLS = h .The Jacobian matrix H = ∂ h ( θ ) /∂ θ in (20) can be veriﬁed by noting that h = vec( A [ S ] ) = vec(( ⊗ i ∈ S U i ) G [ S ] ( ⊗ i ∈ S U i ) (cid:62) ) = (( ⊗ i ∈ S U i ) ⊗ ( ⊗ i ∈ S U i )) vec( G [ S ] )and that for any 1 ≤ i ≤ d , h = vec( A [ S ] ) = P ( i )[ S ] vec( A ( i ) ) = P ( i )[ S ] vec (cid:0) U i G ( i ) ( ⊗ dj =1 ,j (cid:54) = i U (cid:62) j ) (cid:1) = P ( i )[ S ] (cid:8)(cid:2) ( ⊗ dj =1 ,j (cid:54) = i U i ) G (cid:62) ( i ) (cid:3) ⊗ I p i (cid:9) vec( U i ) . Then, by Proposition 4.1 in Shapiro (1986), we obtain that the minimizer of F ( (cid:98) h OLS , · ),namely the LTR estimator, has the asymptotic normality, √ T ( (cid:98) h LTR − h ) d → N ( , Σ LTR )35nd Σ LTR = P Σ OLS P (cid:62) , where P = H ( H (cid:62) J H ) † H (cid:62) J is the projection matrix, J = Σ − e ⊗ Σ y is the Fisher information matrix of h , and † denotes the Moore-Penrose inverse.Since Σ OLS = J − , we can obtain that Σ LTR = H ( H (cid:62) J H ) † H (cid:62) . Proof of Corollary 1.

As discussed in the proof of Theorem 1, Σ OLS = J − , and observe that Σ LTR = H ( H (cid:62) J H ) † H (cid:62) = J − / Q J / H J − / , (A.1)where Q J / H is the projection matrix onto the orthogonal compliment of span( J / H ).On the other hand, under the proposed model, the transition matrix can be decom-posed as A [ S ] = V V (cid:62) , where V = ⊗ i ∈ S U i , V = ( ⊗ i ∈ S U i ) G [ S ] , and rank( A [ S ] ) ≤ s = min( (cid:81) di =1 r i , (cid:81) di = d +1 r i ). Thus, we can write h = vec( A [ S ] ) = h ( φ ), where φ =(vec( V ) (cid:62) , vec( V ) (cid:62) ) (cid:62) is the parameter vector for the RRR estimator. Then, similarlyto the proof of Theorem 1, we denote the Jacobian matrix by R = ∂ h /∂ φ . By similararguments, we have √ T ( (cid:98) h RRR − h ) d → N ( , Σ RRR ), where (cid:98) h RRR = vec(( (cid:98) A RRR ) [ S ] ), and Σ RRR = R ( R (cid:62) J R ) † R (cid:62) = J − / Q J / R J − / , (A.2)where Q J / R is the projection matrix onto the orthogonal compliment of span( J / R ).Hence, it is clear that Σ RRR ≤ J − = Σ OLS .Moreover, since the Tucker decomposition can be viewed as a further decomposition ofthe low-rank decomposition

V U (cid:62) for A [ S ] , we have H = ∂ h /∂ θ = R · ∂ φ /∂ θ . By (A.1)and (A.2), we have Σ LTR ≤ Σ RRR , since span( J / H ) ⊂ span( J / R ). Appendix B: Proofs for High-Dimensional Estimation

In this appendix, we provide the proofs of Theorems 2–5 in Section 5. We start with apreliminary analysis in Appendix B.1 which lays out the common technical framework forproving the estimation and prediction error bounds of the SN, MN and SSN regularizedestimators, and four lemmas, Lemmas B.1–B.4, are introduced herein. Then in AppendixB.2 we give the proofs of Theorems 2–5. The proofs of Lemmas B.1–B.4 are provided inAppendix B.3, and three auxiliary lemmas are collected in Appendix B.436 .1 Preliminary Analysis

The technical framework for proving the error bounds in Theorem 2–4 consists of two mainsteps, a deterministic analysis and a stochastic analysis, given in Sections B.1.1 and B.1.2,respectively. The goal of the ﬁrst one is to derive the error bounds given the deterministicrealization of the time series, assuming that the parameters satisfy certain regularity con-ditions. The goal of the second one is to verify that under stochasticity these regularityconditions are satisﬁed with high probability.

B.1.1 Deterministic Analysis

Throughout the appendix, we adopt the following notations. We use C to denote a genericpositive constant, which is independent of the dimensions and the sample size. For anymatrix M and a compatible subspace S , we denote by M S the projection of M onto S .In addition, let col( M ) be the column space of M , and let S ⊥ be the complement of thesubspace S . For a generic tensor W ∈ R p ×···× p d , the dual norms of its SSN norm and SNnorm, denoted by (cid:107) W (cid:107) SSN ∗ and (cid:107) W (cid:107) SN ∗ , respectively, are deﬁned as (cid:107) W (cid:107) SSN ∗ = sup T ∈ R p ×···× p d , (cid:107) T (cid:107) SSN ≤ (cid:104) W , T (cid:105) , and (cid:107) W (cid:107) SN ∗ = sup T ∈ R p ×···× p d , (cid:107) T (cid:107) SN ≤ (cid:104) W , T (cid:105) . Moreover, for any two tensors X ∈ R p ×···× p m and Y ∈ R p m +1 ×···× p mn , their tensor outerproduct is deﬁned as ( X ◦ Y ) ∈ R p ×···× p m × p m +1 ×···× p m + n where( X ◦ Y ) i ...i m i m +1 ...i m + n = X i ...i m Y i m +1 ...i m + n , for any 1 ≤ i ≤ p , . . . , 1 ≤ i m + n ≤ p m + n .For the theory of regularized M -estimators, restricted error sets and restricted strongconvexity are essential deﬁnitions. To deﬁne the former, we need to ﬁrst introduce thefollowing restricted model subspaces.For i = 1 , . . . , d , denote by (cid:101) U i and (cid:101) V i the spaces spanned by the ﬁrst r i left and rightsingular vectors in the SVD of A ( i ) , respectively. Deﬁne the collections of subspaces N = ( N , . . . , N d ) and N ⊥ = ( N ⊥ , . . . , N ⊥ d ) , N i = { M ∈ R p i × p − i p | col( M ) ⊂ (cid:101) U i , col( M (cid:62) ) ⊂ (cid:101) V i } , N ⊥ i = { M ∈ R p i × p − i p | col( M ) ⊥ (cid:101) U i , col( M (cid:62) ) ⊥ (cid:101) V i } , (B.1)for i = 1 , . . . , d . Note that N i ⊂ N i .Furthermore, for k = 1 , . . . , d − , denote by U k and V k the spaces spanned by the ﬁrst s ∗ k left and right singular vectors in the SVD of the square matricization A [ I k ] , respectively,where s ∗ k = rank( A ) [ I k ] . Similarly, deﬁne the collections of subspaces M := ( M , . . . , M d − ) and M ⊥ = ( M ⊥ , . . . , M ⊥ d − ) , where M k = { M ∈ R p × p | col( M ) ⊂ U k , col( M (cid:62) ) ⊂ V k } , M ⊥ k = { M ∈ R p × p | col( M ) ⊥ U k , col( M (cid:62) ) ⊥ V k } , (B.2)for k = 1 , . . . , d − . In particular, as described in Section 5.2, I = S = { , . . . , d } . Thus, M and M ⊥ are the subspaces associated with the square matricization A [ S ] .Then, for simplicity, for any W ∈ R p ×···× p d , we denote W ( i ) N = ( W ( i ) ) N i , W ( i ) N ⊥ = ( W ( i ) ) N ⊥ i , W ( i ) N = ( W ( i ) ) N i , W ( i ) N ⊥ = ( W ( i ) ) N ⊥ i W ( k ) M = ( W [ I k ] ) M k , W ( k ) M ⊥ = ( W [ I k ] ) M ⊥ k , W ( k ) M = ( W [ I k ] ) M k , W ( k ) M ⊥ = ( W [ I k ] ) M ⊥ k , where i = 1 , . . . , d and k = 1 , . . . , d − . Based on the subspaces deﬁned in (B.1) and (B.2),we can deﬁne the restricted error sets corresponding to the three regularized estimators asfollows. Deﬁnition 1.

The restricted error set corresponding to M is deﬁned as C SSN ( M ) :=  ∆ ∈ R p ×···× p d : d − (cid:88) k =1 (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ ≤ d − (cid:88) k =1 (cid:107) ∆ ( k ) M (cid:107) ∗ + 4 d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗  . The restricted error set corresponding to N is deﬁned as C SN ( N ) := (cid:40) ∆ ∈ R p ×···× p d : d (cid:88) i =1 (cid:107) ∆ ( i ) N ⊥ (cid:107) ∗ ≤ d (cid:88) i =1 (cid:107) ∆ ( i ) N (cid:107) ∗ + 4 d (cid:88) i =1 (cid:107) A ( i ) N ⊥ (cid:107) ∗ (cid:41) . he restricted error set corresponding to M is deﬁned as C MN ( M ) := (cid:110) ∆ ∈ R p ×···× p d : (cid:107) ∆ (1) M ⊥ (cid:107) ∗ ≤ (cid:107) ∆ (1) M (cid:107) ∗ + 4 (cid:107) A (1) M ⊥ (cid:107) ∗ (cid:111) . The ﬁrst lemma shows that if the tuning parameter is well chosen for each regularizedestimator, the estimation error belongs to the corresponding restricted error set.

Lemma B.1.

For the SSN estimator, if the regularization parameter λ SSN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ , the error ∆ SSN = (cid:98) A SSN − A belongs to the set C SSN ( M ) .For the SN estimator, if the regularization parameter λ SN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SN ∗ ,the error ∆ SN = (cid:98) A SN − A belongs to the set C SN ( N ) .For the MN estimator, if the regularization parameter λ MN ≥ (cid:107) T − (cid:80) Tt =1 vec( Y t − )vec( E t ) (cid:62) (cid:107) ∗ ,the error ∆ MN = (cid:98) A MN − A belongs to the set C MN ( M (1) ) . Following Negahban and Wainwright (2012) and Negahban et al. (2012), a restrictedstrong convexity (RSC) condition for the square loss function can be deﬁned as follows.

Deﬁnition 2.

The loss function satisﬁes the RSC condition with curvature α RSC > andrestricted error set C , if T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) F ≥ α RSC (cid:107) ∆ (cid:107) , ∀ ∆ ∈ C . Based on the restricted error sets and RSC conditions, the estimation errors have thefollowing deterministic upper bounds.

Lemma B.2.

Suppose that λ SSN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ , the RSC condition holds withthe parameter α RSC and restricted error set C SSN ( M ) , and A [ I k ] ∈ B q ( s ( k ) q ; p, p ) for some q ∈ [0 , and all k = 1 , . . . , d − , (cid:107) ∆ SSN (cid:107) F (cid:46) √ s q (cid:18) d − λ SSN α RSC (cid:19) − q/ , where s q = 2 − d (cid:80) d − k =1 s ( k ) q . uppose that λ SN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SN ∗ , the RSC condition holds with the parameter α RSC and restricted error set C SN ( N ) , and A ( i ) ∈ B q ( r ( i ) q ; p i , p − i p ) for some q ∈ [0 , andall i = 1 , . . . , d , (cid:107) ∆ SN (cid:107) F (cid:46) √ r q (cid:18) d · λ SN α RSC (cid:19) − q/ , where r q = (2 d ) − (cid:80) di =1 r ( i ) q .Suppose that λ MN ≥ (cid:107) T − (cid:80) Tt =1 vec( Y t − )vec( E t ) (cid:107) ∗ , the RSC condition holds with theparameter α RSC and restricted error set C MN ( M ) , and A [ S ] ∈ B q ( s (1) q ; p, p ) for some q ∈ [0 , , (cid:107) ∆ MN (cid:107) F (cid:46) (cid:113) s (1) q (cid:18) λ MN α RSC (cid:19) − q/ . Note that Lemma B.2 is deterministic and the radius s q , r q , and s (1) q can also diverge toinﬁnity. B.1.2 Stochastic Analysis

We continue with the stochastic analysis to show that the deviation bound and the RSCcondition hold simultaneously with high probability.

Lemma B.3 (Deviation bound) . Suppose that Assumptions 1 and 2 hold. If T (cid:38) p and λ SSN (cid:38) κ M − d (cid:112) p/T , with probability at least − exp[ − C ( p − d )] , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SSN ∗ ≤ λ SSN where M = λ max ( Σ e ) /µ / ( A ) .If T (cid:38) max ≤ i ≤ d p − i p and λ SN (cid:38) κ M d − (cid:80) di =1 (cid:112) p − i p/T , with probability at least − (cid:80) di =1 exp( − Cp − i p ) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SN ∗ ≤ λ SN . Moreover, if T (cid:38) p and λ MN (cid:38) κ M (cid:112) p/T , with probability at least − exp( − Cp ) , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 vec( Y t − )vec( E t ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ MN . T (cid:38) p for all three estimators. In this case, we canestablish the strong convexity condition that is stronger than the RSC condition. Lemma B.4 (Strong convexity) . Under Assumptions 1 and 2, for T (cid:38) max( κ , κ ) M − p ,with probability at least − exp[ − C min( κ − , κ − ) M p ] , T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) F ≥ α RSC (cid:107) ∆ (cid:107) , where M = [ λ min ( Σ e ) µ max ( A )] / [ λ max ( Σ e ) µ min ( A )] and α RSC = λ min ( Σ e ) / (2 µ max ( A )) . B.2 Proofs of Theorems 2–5

Proof of Theorems 2 and 3.

Theorems 2 and 3 can be proved based on Lemmas B.2–B.4following the same line of the proof of Theorem 4 given below. Therefore, we omit thedetails here.

Proof of Theorem 4.

The proof of Theorem 4 has been split into Lemmas B.2–B.4. ByLemma B.2, for deterministic realization with sample size T of a tensor autoregressive pro-cess, if we choose λ SSN ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ and RSC condition holds for the squareloss with the parameter α RSC , the following error upper bound can be established (cid:107) ∆ (cid:107) F (cid:46) √ s q (cid:18) d − λ SSN α RSC (cid:19) − q/ . Denote the events E ( β ) = { β ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ } and E ( α ) = { λ min ( XX (cid:62) /T ) ≥ α } . If we take λ SSN (cid:38) κ M − d (cid:112) p/T , it suﬃces to show that E ( Cκ M − d (cid:112) p/T ) and E ( α RSC /

2) occur simultaneously with high probability.By Lemma B.3, when T (cid:38) p , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SSN ∗ (cid:46) κ M − d (cid:114) pT with probability at least 1 − exp[ − C ( p − d )].41y Lemma B.4, when T (cid:38) max( κ , κ ) M − p , for any ∆ ∈ R p ×···× p d ,1 T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) ≥ λ min ( Σ e )2 µ max ( A ) (cid:107) ∆ (cid:107) with probability at least 1 − exp[ − C min( κ − , κ − ) M p ].Hence, when T (cid:38) [1 + max( κ , κ ) M − ] p and λ (cid:38) κ M − d (cid:112) p/T , with probability atleast 1 − exp[ − C ( p − d )] − exp[ − C min( κ − , κ − ) M p ], the condition λ ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ and the RSC condition with the parameter α RSC = λ min ( Σ e ) /µ max ( A ) hold. Proof of Theorem 5.

Theorem 4 gives the Frobenius estimation error bound. For simplicity,we write (cid:98) A = (cid:98) A SSN and (cid:101) A = (cid:98) A TSSN in this proof. By deﬁnition, for any tensor A ∈ R p ×···× p d , (cid:107) A (cid:107) = (cid:107) A ( i ) (cid:107) = p i (cid:88) j =1 σ j ( A ( i ) ) , i = 1 , , . . . , d. In other words, the Frobenius norm of the error tensor is equivalent to the (cid:96) norm of singularvalues of the one-mode matricization. By Mirsky’s singular value inequality (Mirsky, 1960), p i (cid:88) j =1 [ σ j ( (cid:98) A ( i ) ) − σ j ( A ( i ) )] ≤ p i (cid:88) j =1 σ j ( (cid:98) A ( i ) − A ( i ) ) = (cid:107) (cid:98) A − A (cid:107) , i = 1 , , . . . , d. (B.3)Obviously, the (cid:96) ∞ error bound is smaller than the (cid:96) error bound, so it follows the sameupper bound. By Theorem 4, when λ SSN (cid:16) κ M − d (cid:112) p/T , with probability approachingone, max ≤ i ≤ d max ≤ j ≤ p i | σ j ( (cid:98) A ( i ) ) − σ j ( A ( i ) ) | ≤ max ≤ i ≤ d (cid:40) p i (cid:88) j =1 [ σ j ( (cid:98) A ( i ) ) − σ j ( A ( i ) )] (cid:41) / ≤(cid:107) (cid:98) A − A (cid:107) F (cid:46) κ M α RSC (cid:114) s pT . (B.4)Therefore, by Assumption 3, as T → ∞ , γ (cid:29) max ≤ i ≤ d max ≤ j ≤ p i | σ j ( (cid:98) A ( i ) ) − σ j ( A ( i ) ) | . Then, for any j > r i , since σ j ( A ( i ) ) = 0, we have γ (cid:29) σ j ( (cid:98) A ( i ) ). Thus, for all i = 1 , . . . , d , σ j ( (cid:98) A ( i ) ) will be truncated for all j > r i . Meanwhile, by Assumption 3 and (B.4), we have42 r i ( (cid:98) A ( i ) ) > γ for T suﬃciently large, for all i = 1 , . . . , d . Therefore, the rank selectionconsistency of the truncated estimator (cid:101) A can be established.Denote the event E = { rank( (cid:101) A ( i ) ) = r i , for i = 1 , . . . , d } . For a generic tensor T ∈ R p ×···× p d , denote the sub-tensor T i k = j , a p × · · · × p k − × × p k +1 × · · · × p d tensor suchthat ( T i k = j ) i ...i k − i k +1 ...i d = T i ...i k − ji k +1 ...i d , and sub-tensor T i k >j , a p × · · · × p k − × ( p k − j ) × p k +1 × · · · × p d tensor such that( T i k >j ) i ...i k − (cid:96)i k +1 ...i d = T i ...i k − ( (cid:96) + j ) i k +1 ...i d . Let the HOSVD of (cid:98) A be (cid:98) G × di =1 (cid:98) U i . By deﬁnition, (cid:98) G is a p × · · · × p d all-orthogonaland sorted tensor such that (cid:107) (cid:98) G i k =1 (cid:107) F ≥ (cid:107) (cid:98) G i k =2 (cid:107) F ≥ · · · ≥ (cid:107) (cid:98) G i k = p k (cid:107) F , for k = 1 , . . . , d . On E , the truncation procedure is equivalent to truncating all the sub-tensors (cid:98) G i k >r k to zeros. Thus, (cid:107) (cid:98) A − (cid:101) A (cid:107) F = (cid:107) (cid:98) G − (cid:101) G (cid:107) ≤ (cid:80) dk =1 (cid:107) (cid:98) G i k >r k (cid:107) .By the deﬁnition of HOSVD, (cid:107) (cid:98) G i k = j (cid:107) F = σ j ( (cid:98) G ( k ) ) = σ j ( (cid:98) A ( k ) ), and then (cid:107) (cid:98) G i k >r k (cid:107) = p k (cid:88) i = r k +1 σ i ( (cid:98) A ( k ) ) = p k (cid:88) i = r k +1 [ σ i ( (cid:98) A ( k ) ) − σ i ( A ( k ) )] ≤ p k (cid:88) i =1 [ σ i ( (cid:98) A ( k ) ) − σ i ( A ( k ) )] ≤ (cid:107) (cid:98) A − A (cid:107) , where the last inequality follows from (B.3).Finally, on the event E , (cid:107) (cid:101) A − A (cid:107) F ≤ (cid:107) (cid:101) A − (cid:98) A (cid:107) F + (cid:107) (cid:98) A − A (cid:107) F ≤ (1 + √ d ) (cid:107) (cid:98) A − A (cid:107) F , where d is ﬁxed. Note that Theorem 4 implies the asymptotic rate (cid:107) (cid:98) A − A (cid:107) F = O p ( (cid:112) s p/T ) andthe ﬁrst part of this proof shows that P ( E ) →

1, as T → ∞ . The proof is complete. B.3 Proofs of Lemmas B.1–B.4

Proof of Lemma B.1.

In this part, we focus on (cid:98) A SSN and simplify it to (cid:98) A . The tuningparameter λ SSN is simpliﬁed to λ . The proof can be readily extended to (cid:98) A SN and (cid:98) A MN .43ote that the quadratic loss function can be rewritten as L T ( A ) = T − (cid:80) Tt =1 (cid:107) Y t −(cid:104) A , Y t − (cid:105)(cid:107) = T − (cid:80) Tt =1 (cid:107) y t − A [ S ] y t − (cid:107) , where y t = vec( Y t ). By the optimality of theSSN estimator,1 T T (cid:88) t =1 (cid:107) y t − (cid:98) A [ S ] y t − (cid:107) + λ (cid:107) (cid:98) A (cid:107) SSN ≤ T T (cid:88) t =1 (cid:107) y t − A [ S ] y t − (cid:107) + λ (cid:107) A (cid:107) SSN ⇒ T T (cid:88) t =1 (cid:107) ∆ [ S ] y t − (cid:107) ≤ T T (cid:88) t =1 (cid:104) e t , ∆ [ S ] y t − (cid:105) + λ ( (cid:107) A (cid:107) SSN − (cid:107) (cid:98) A (cid:107) SSN ) ⇒ T T (cid:88) t =1 (cid:107) ∆ [ S ] y t − (cid:107) ≤ (cid:42) T − T (cid:88) t =1 Y t − ◦ E t , ∆ (cid:43) + λ ( (cid:107) A (cid:107) SSN − (cid:107) (cid:98) A (cid:107) SSN ) ⇒ T T (cid:88) t =1 (cid:107) ∆ [ S ] y t − (cid:107) ≤ (cid:107) ∆ (cid:107) SSN (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − T (cid:88) t =1 Y t − ◦ E t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) SSN ∗ + λ ( (cid:107) A (cid:107) SSN − (cid:107) (cid:98) A (cid:107) SSN ) , where (cid:107) · (cid:107) SSN ∗ refers to the dual norm of the SSN norm.By triangle inequality and decomposability, we have (cid:107) (cid:98) A (cid:107) SSN − (cid:107) A (cid:107) SSN = (cid:107) A + ∆ (cid:107) SSN − (cid:107) A (cid:107) SSN = d − (cid:88) k =1 (cid:107) A [ I k ] + ∆ [ I k ] (cid:107) ∗ − d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) ∗ = d − (cid:88) k =1 (cid:107) A ( k ) M + A ( k ) M ⊥ + ∆ ( k ) M + ∆ ( k ) M ⊥ (cid:107) ∗ − d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) ∗ ≥ d − (cid:88) k =1 (cid:104) (cid:107) A ( k ) M + ∆ ( k ) M ⊥ (cid:107) ∗ − (cid:107) A ( k ) M ⊥ + ∆ ( k ) M (cid:107) ∗ − (cid:107) A ( k ) M ⊥ (cid:107) ∗ − (cid:107) A ( k ) M (cid:107) ∗ (cid:105) ≥ d − (cid:88) k =1 (cid:104) (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ − (cid:107) A ( k ) M ⊥ (cid:107) ∗ − (cid:107) ∆ ( k ) M (cid:107) ∗ (cid:105) . If λ ≥ (cid:107) T − (cid:80) Tt =1 Y t − ◦ E t (cid:107) SSN ∗ , we have0 ≤ T T (cid:88) t =1 (cid:107) ∆ [ S ] y t − (cid:107) ≤ λ (cid:107) ∆ (cid:107) SSN − λ ( (cid:107) (cid:98) A (cid:107) SSN − (cid:107) A (cid:107) SSN ) ≤ λ d − (cid:88) k =1 (cid:104) (cid:107) ∆ ( k ) M (cid:107) ∗ + (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ − (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ + 4 (cid:107) A ( k ) M ⊥ (cid:107) ∗ + 2 (cid:107) ∆ ( k ) M (cid:107) ∗ (cid:105) = λ d − (cid:88) k =1 (cid:104) (cid:107) ∆ ( k ) M (cid:107) ∗ + 4 (cid:107) A ( k ) M ⊥ (cid:107) ∗ − (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ (cid:105) . Hence, the error ∆ lies in the restricted error set C SSN ( M ).44 roof of Lemma B.2. Similar to Lemma B.1, we focus on the SSN estimator, and the resultsfor SN and MN estimators can be extended in a similar way.Note that T − (cid:80) Tt =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) = T − (cid:80) Tt =1 (cid:107) ∆ [ S ] y t − (cid:107) . Following the proof of LemmaB.1, ∆ ∈ C SSN ( M ) and1 T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) ≤ λ (cid:107) ∆ (cid:107) SSN + λ ( (cid:107) A (cid:107) SSN − (cid:107) (cid:98) A (cid:107) SSN ) ≤ λ (cid:107) ∆ (cid:107) SSN = 3 λ d − (cid:88) k =1 (cid:107) ∆ [ I k ] (cid:107) ∗ ≤ λ d − (cid:88) k =1 (cid:16) (cid:107) ∆ ( k ) M (cid:107) ∗ + (cid:107) ∆ ( k ) M ⊥ (cid:107) ∗ (cid:17) ≤ λ d − (cid:88) k =1 (cid:107) ∆ ( k ) M (cid:107) ∗ + 6 λ d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ ≤ λ d − (cid:88) k =1 √ s k (cid:107) ∆ ( k ) M (cid:107) F + 6 λ d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ (cid:46) λ d − (cid:88) k =1 √ s k (cid:107) ∆ (cid:107) F + 6 λ d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ where the last inequality stems from the fact that ∆ ( k ) M has a matrix rank at most 2 s k , similarto Lemma 1 in Negahban and Wainwright (2011).As the RSC condition holds with the parameter α RSC and restricted error set C SSN ( M ), α RSC (cid:107) ∆ (cid:107) ≤ T T (cid:88) t =1 (cid:107)(cid:104) ∆ , Y t − (cid:105)(cid:107) (cid:46) λ d − (cid:88) k =1 √ s k (cid:107) ∆ (cid:107) F + λ d − (cid:88) k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ . Thus, by Cauchy inequality, (cid:107) ∆ (cid:107) (cid:46) λ ( (cid:80) d − k =1 √ s k ) α + λ (cid:80) d − k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ α RSC (cid:46) λ d − (cid:80) d − k =1 s k α + λ (cid:80) d − k =1 (cid:107) A ( k ) M ⊥ (cid:107) ∗ α RSC . Consider any threhold τ k ≥ M ( k ) correspondingto the column and row spaces spanned by the ﬁrst r ( k ) singular vectors of A [ I k ] where σ ( A [ I k ] ) ≥ · · · ≥ σ r ( k ) ( A [ I k ] ) > τ k ≥ σ r ( k ) +1 ( A [ I k ] ). By the deﬁnition of B q ( s ( k ) q ; p, p ), wehave s ( k ) q ≥ r ( k ) · τ qk and thus r ( k ) ≤ s ( k ) q · τ − qk .Then, the approximation error can be bounded by (cid:107) A ( k ) M ⊥ (cid:107) ∗ = p (cid:88) r = r ( k ) +1 σ r ( A [ I k ] ) = p (cid:88) r = r ( k ) +1 σ qr ( A [ I k ] ) · σ − qr ( A [ I k ] ) ≤ s ( k ) q · τ − qk . (cid:107) ∆ (cid:107) (cid:46) λ d − (cid:80) d − k =1 s ( k ) q · τ − qk α + λ (cid:80) d − k =1 s ( k ) q · τ − qk α RSC . Setting each τ k (cid:16) α − ( q/ (1 − q ))2 d − λ , the upper bound can be minimized to (cid:107) ∆ (cid:107) (cid:46) − d d − (cid:88) k =1 s ( k ) q (cid:18) λ · d − α RSC (cid:19) − q . The proof is complete.

Proof of Lemma B.3.

First, we derive an upper bound of the dual norm of the SSN norm.By deﬁnition, for any tensor A and collection of index sets I = { I , . . . , I d − } , the SSN normis (cid:107) A (cid:107) SSN = d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) ∗ , and its dual norm is (cid:107) A (cid:107) SSN ∗ := sup (cid:104) W , A (cid:105) such that (cid:107) W (cid:107) SSN ≤

1. By a method similarto that in Tomioka et al. (2011), it can be shown that (cid:107) A (cid:107) SSN ∗ = inf (cid:80) d − k =1 X k = A max k =1 ,..., d − (cid:107) ( X k ) [ I k ] (cid:107) op . Then, we can take X k = ( (cid:80) d − k =1 /c k ) − ( A /c k ), where c k = (cid:107) A [ I k ] (cid:107) op , and apply Jensen’sinequality so that we have (cid:107) A (cid:107) SSN ∗ ≤ − d −

1) 2 d − (cid:88) k =1 (cid:107) A [ I k ] (cid:107) op . Hence, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SSN ∗ ≤ d −

1) 2 d − (cid:88) k =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I k ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op . In other words, the dual norm of the SSN norm can be upper bounded by the sum of thescaled matrix operator norms of diﬀerent matricizations of the tensor T − (cid:80) Tt =1 Y t − ◦ E t .All of the square matricizations based on I k lead to a square p -by- p matrix. Therefore,by the deviation bound in Lemma B.5, we can take a union bound such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SSN ∗ ≤ Cκ M d − (cid:114) pT − exp[ − C ( p − d )].Next, for the SN estimator, we can obtain a similar upper bound of the dual norm of theSN norm. The SN norm is deﬁned as (cid:107) A (cid:107) SN = d (cid:88) i =1 (cid:107) A ( i ) (cid:107) ∗ , and its dual norm has the equivalent form (cid:107) A (cid:107) SN ∗ = inf (cid:80) di =1 Y i = A max i =1 ,..., d (cid:107) ( Y i ) ( i ) (cid:107) op . Then, we can obtain an upper bound, (cid:107) A (cid:107) SN ∗ ≤ d ) d (cid:88) i =1 (cid:107) A ( i ) (cid:107) op . Then, for each one-mode matricization, we have the deviation bound. Then, we can take aunion bound such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Y t − ◦ E t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) SN ∗ ≤ Cκ M (2 d ) d (cid:88) i =1 (cid:114) p − i pT , with probability at least 1 − d exp[ − Cp ].Finally, the MN estimator uses a special case of square matricization, and the upperbound for the MN estimator can be obtained by Lemma B.5. Proof of Lemma B.4.

For any M ∈ R m × p , denote R T ( M ) = (cid:80) T − t =0 (cid:107) M y t (cid:107) . Note that R T ( ∆ [ S ] ) ≥ E R T ( ∆ [ S ] ) − sup ∆ | R T ( ∆ [ S ] ) − E R T ( ∆ [ S ] ) | . Following the proof of LemmaB.5, E R T ( ∆ [ S ] ) = (cid:107) ( I T ⊗ ∆ [ S ] ) P D (cid:107) ≥ T (cid:107) ∆ (cid:107) · λ min ( Σ e ) λ min ( P P (cid:62) ).Similar to Lemma B.6, for any v ∈ S p − and any t > P [ | R T ( v (cid:62) ) − E R T ( v (cid:62) ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ max ( Σ e ) λ max ( P P (cid:62) ) (cid:19)(cid:19) . Considering an (cid:15) -covering net of S p − , by Lemma B.7, we can easily construct the unionbound for T (cid:38) p , P (cid:20) sup v ∈ S p − | R T ( v (cid:62) ) − E R T ( v (cid:62) ) | ≥ t (cid:21) ≤ C exp (cid:18) p − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ max ( Σ e ) λ max ( P P (cid:62) ) (cid:19)(cid:19) , t = λ min ( Σ e ) λ min ( P P (cid:62) ) /

2, for T (cid:38) M − max( κ , κ ) p , we have P [ | R T ( v (cid:62) ) − E R T ( v (cid:62) ) | ≥ λ min ( Σ e ) λ min ( P P (cid:62) ) / ≤ − CM min( κ − , κ − ) T ) , where M = [ λ min ( Σ e ) λ min ( P P (cid:62) )] / [ λ max ( Σ e ) λ max ( P P (cid:62) )].Therefore, with probability at least 1 − − CM min( κ − , κ − ) T ), R T ( ∆ [ S ] ) ≥ λ min ( Σ e ) λ min ( P P (cid:62) ) (cid:107) ∆ (cid:107) . Finally, since P is related to the VMA( ∞ ) process, by the spectral measure of ARMA pro-cess discussed in Basu and Michailidis (2015), we may replace λ max ( P P (cid:62) ) and λ min ( P P (cid:62) )with 1 /µ min ( A ) and 1 /µ max ( A ), respectively. B.4 Three Auxiliary Lemmas

Three auxiliary lemmas used in the proofs of Lemmas B.3 and B.4 are presented below.

Lemma B.5 (Deviation bound on diﬀerent matricizations) . For any index set I ⊂ { , , . . . , d } ,denote q = (cid:81) di =1 ,i ∈ I p i and q (cid:48) = (cid:81) di =1 ,i/ ∈ I p i . If T (cid:38) ( q + q (cid:48) ) , with probability at least − exp[ − C ( q + q (cid:48) )] , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) op < Cκ M (cid:112) ( q + q (cid:48) ) /T . where M = λ max ( Σ e ) /µ / ( A ) .Proof. For any index set I ⊂ { , , . . . , d } and 2 d th-mode tensor T , denote the inverseoperation of the multi-mode matricization T = T [ I ] by T [ I ] = T . Denote W ( r ; q, q (cid:48) ) = { W ∈ R q × q (cid:48) : rank( W ) = r, (cid:107) W (cid:107) F = 1 } .By deﬁnition, (cid:107) T − (cid:80) Tt =1 ( Y t − ◦ E t ) [ I ] (cid:107) op = sup W ∈W (1; q,q (cid:48) ) (cid:104) T − (cid:80) Tt =1 ( Y t − ◦ E t ) [ I ] , W (cid:105) =sup W ∈W (1; q,q (cid:48) ) (cid:104) T − (cid:80) Tt =1 vec( E t )vec( Y t − ) (cid:62) , ( W [ I ] ) (cid:62) [ S ] (cid:105) .For an arbitrary matrix W ∈ R q × q (cid:48) such that (cid:107) W (cid:107) F = 1, denote M = ( W [ I ] ) (cid:62) [ S ] . Then,one can easily check that (cid:104) ( Y t − ◦ E t ) [ I ] , W (cid:105) = (cid:104) e t , M y t − (cid:105) .48or a ﬁxed M , denote S t ( M ) = (cid:80) ts =1 (cid:104) e s , M y s − (cid:105) and R t ( M ) = (cid:80) t − s =0 (cid:107) M y s (cid:107) , for1 ≤ t ≤ T . By the standard Chernoﬀ argument, for any α > β > c > P [ { S T ( M ) ≥ α } ∩ { R T ( M ) ≤ β } ]= inf m> P [ { exp( mS T ( M )) ≥ exp( mα ) } ∩ { R T ( M ) ≤ β } ]= inf m> P [exp( mS T ( M )) I ( R T ( M ) ≤ β ) ≥ exp( mα )] ≤ inf m> exp( − mα ) E [exp( mS T ( M )) I ( R T ( M ) ≤ β )]= inf m> exp( − mα + cm β ) E [exp( mS T ( M ) − cm β ) I ( R T ( M ) ≤ β )] ≤ inf m> exp( − mα + cm β ) E [exp( mS T ( M ) − cm R T ( M ))] . By the tower rule, we have E [exp( mS T ( M ) − cm R T ( M ))]= E [ E [exp( mS T ( M ) − cm R T ( M ))] |F T − ]= E [exp( mS T − ( M ) − cm R T − ( M )) E [exp( m (cid:104) e T , M y T − (cid:105) − cm (cid:107) M y T (cid:107) ) |F T − ]] . Since (cid:104) e T , M y T − (cid:105) = (cid:104) ξ T , Σ / e M y T − (cid:105) , one can easily check that (cid:104) e T , M y T − (cid:105) is a κ λ max ( Σ e ) (cid:107) M y T − (cid:107) -sub-Gaussian random variable. In other words, E [exp( m (cid:104) e T , M y T − (cid:105) )] ≤ exp( m κ λ max ( Σ e ) (cid:107) M y T − (cid:107) / c = κλ max ( Σ e ) /

2, we have E [exp( mS T ( M ) − m κ λ max ( Σ e ) R T ( M ) / ≤ E [exp( mS T − ( M ) − m κ λ max ( Σ e ) R T − ( M ) / ≤ · · · ≤ E [exp( mS ( M ) − m κ λ max ( Σ e ) R ( M ) / ≤ . Hence, we have that, for any α > β > P [ { S T ( M ) ≥ α } ∩ { R T ( M ) ≤ β } ] ≤ inf m> exp( − mα + m κ λ max ( Σ e ) β/ (cid:18) − α κ λ max ( Σ e ) β (cid:19) . (B.5)By Lemma B.6, we have that for any t > P [ | R T ( M ) − E R T ( M ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ ( Σ e ) λ ( P P (cid:62) ) (cid:19)(cid:19) .

49n addition, E R T ( M ) = tr( Σ M ) = (cid:107) ( I T ⊗ M ) P D (cid:107) ≤ T · λ max ( Σ e ) λ max ( P P (cid:62) ). Letting t = Cκ T λ max ( Σ e ) λ max ( P P (cid:62) ), we have P [ R T ( M ) ≥ Cκ T λ max ( Σ e ) λ max ( P P (cid:62) )] ≤ − CT ) . Next, consider a (cid:15) -net W (1; q, q (cid:48) ) for W (1; q, q (cid:48) ). For any matrix W ∈ W (1; q, q (cid:48) ), thereexist a matrix W ∈ W (1; q, q (cid:48) ) such that (cid:107) W − W (cid:107) F ≤ (cid:15) . Since the rank of ∆ = W − W is at most 2, we can split the SVD of ∆ into 2 parts, such that ∆ = ∆ + ∆ , whererank( ∆ ) = rank( ∆ ) = 1 and (cid:104) ∆ , ∆ (cid:105) = 0. Then, for any matrix N ∈ R q × q (cid:48) , we have (cid:104) N , W (cid:105) = (cid:104) N , W (cid:105) + (cid:104) N , ∆ (cid:105) = (cid:104) N , W (cid:105) + (cid:88) i =1 (cid:104) N , ∆ i / (cid:107) ∆ i (cid:107) F (cid:105)(cid:107) ∆ i (cid:107) F , where ∆ i / (cid:107) ∆ i (cid:107) F ∈ W (1; q, q (cid:48) ). Since (cid:107) ∆ (cid:107) = (cid:107) ∆ (cid:107) + (cid:107) ∆ (cid:107) , by Cauchy inequality, (cid:107) ∆ (cid:107) F + (cid:107) ∆ (cid:107) F ≤ √ (cid:107) ∆ (cid:107) F = √ (cid:15) . Hence, we have γ := sup W ∈W (1; q,q (cid:48) ) (cid:104) N , W (cid:105) ≤ max W ∈W (1; q,q (cid:48) ) (cid:104) N , W (cid:105) + √ γ(cid:15). In other words, sup W ∈W (1; q,q (cid:48) ) (cid:104) N , W (cid:105) ≤ (1 − √ (cid:15) ) − max W ∈W (1; q,q (cid:48) ) (cid:104) N , W (cid:105) . Therefore, we have that, for any x > P (cid:34) sup W ∈W (1; q,q (cid:48) ) (cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ x (cid:35) ≤ P (cid:34) max W ∈W (1; q,q (cid:48) ) (cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ (1 − √ (cid:15) ) x (cid:35) ≤|W (1; q, q (cid:48) ) | · P (cid:34)(cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ (1 − √ (cid:15) ) x (cid:35) . (B.6)Note that by (B.5), for any x > P (cid:34)(cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ (1 − √ (cid:15) ) x (cid:35) ≤ P [ { S T ( M ) ≥ T (1 − √ (cid:15) ) x } ∩ { R T ( M ) ≤ Cκ T λ max ( Σ e ) λ max ( P P (cid:62) ) } ]+ P [ R T ( M ) > Cκ T λ max ( Σ e ) λ max ( P P (cid:62) )] ≤ exp (cid:20) − CT x κ λ ( Σ e ) λ max ( P P (cid:62) ) (cid:21) + 2 exp( − CT ) .

50y Lemma 3.1 in Candes and Plan (2011), for a (cid:15) -net for W (1; q, q (cid:48) ), the covering number |W (1; q, q (cid:48) ) | ≤ (9 /(cid:15) ) q + q (cid:48) . Combining (B.6), we have that, when T (cid:38) q + q (cid:48) , for any x > P (cid:34) sup W ∈W (1; q,q (cid:48) ) (cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ x (cid:35) ≤ exp (cid:20) ( q + q (cid:48) ) log(9 /(cid:15) ) − CT x κ λ ( Σ e ) λ max ( P P (cid:62) ) (cid:21) + 2 exp[( q + q (cid:48) ) log(9 /(cid:15) ) − CT ] . Taking (cid:15) = 0 . x = Cκ λ max ( Σ e ) λ / ( P P (cid:62) ) · (cid:112) ( q + q (cid:48) ) /T , we have P (cid:34) sup W ∈W (1; q,q (cid:48) ) (cid:42) T T (cid:88) t =1 ( Y t − ◦ E t ) [ I ] , W (cid:43) ≥ Cκ λ max ( Σ e ) λ / ( P P (cid:62) ) (cid:114) q + q (cid:48) T (cid:35) ≤ exp[ − C ( q + q (cid:48) )] . Finally, since P is related to the VMA( ∞ ) process, by the spectral measure of ARMA processdiscussed in Basu and Michailidis (2015), we may replace λ max ( P P (cid:62) ) with 1 /µ min ( A ). Lemma B.6.

For any M ∈ R p × p such that (cid:107) M (cid:107) F = 1 , denote R T ( M ) = (cid:80) T − t =0 (cid:107) M y t (cid:107) .Then, for any t > , P [ | R T ( M ) − E R T ( M ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ ( Σ e ) λ ( P P (cid:62) ) (cid:19)(cid:19) , where P is deﬁned as P =  I p A A A . . . A T − . . . O I p A A . . . A T − . . . ... ... ... ... . . . ... . . . O O O O . . . I p . . .  . (B.7) Proof.

Denote by y = ( y (cid:62) T − , y (cid:62) T − , . . . , y (cid:62) ) (cid:62) , e = ( e (cid:62) T − , e (cid:62) T − , . . . , e (cid:62) , . . . ) (cid:62) , and ξ =( ξ (cid:62) T − , ξ (cid:62) T − , . . . , ξ (cid:62) , . . . ) (cid:62) . Based on the moving average representation of VAR(1), we canrewrite y t to a VMA( ∞ ), y t = e t + Ae t − + A e t − + A e t − + · · · . Note that R T ( M ) = y (cid:62) ( I T ⊗ M (cid:62) M ) y = e (cid:62) P (cid:62) ( I T ⊗ M (cid:62) M ) P e = ξ (cid:62) DP (cid:62) ( I T ⊗ M (cid:62) M ) P Dξ := ξ (cid:62) Σ M ξ ,51here P is deﬁned in (B.7) and D =  Σ / e O O . . . O Σ / e O . . . O O Σ / e . . . ... ... ... . . .  . By Hanson-Wright inequality, for any t > P [ | R T ( M ) − E R T ( M ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ (cid:107) Σ M (cid:107) , tκ (cid:107) Σ M (cid:107) op (cid:19)(cid:19) . As (cid:107) M (cid:107) F = 1, by the submultiplicative property of the Frobenius norm and operatornorm, we have (cid:107) Σ M (cid:107) ≤ T · λ ( Σ e ) λ ( P P (cid:62) ) and (cid:107) Σ M (cid:107) op ≤ λ max ( Σ e ) λ max ( P P (cid:62) ).These imply that, for any t > P [ | R T ( M ) − E R T ( M ) | ≥ t ] ≤ (cid:18) − min (cid:18) t κ T λ ( Σ e ) λ ( P P (cid:62) ) , tκ λ max ( Σ e ) λ max ( P P (cid:62) ) (cid:19)(cid:19) . The proof of this lemma is accomplished.

Lemma B.7. (Covering number of unit sphere) Let N be an ε -net of the unit sphere S p − ,where ε ∈ (0 , . Then, |N | ≤ (cid:18) ε (cid:19) p . Proof.

This lemma follows directly from Corollary 4.2.13 of Vershynin (2018).52 eferences

Bai, J. and Wang, P. (2016). Econometric analysis of large factor models.

Annual Reviewof Economics , 8:53–80.Basu, S. and Michailidis, G. (2015). Regularized estimation in sparse high-dimensional timeseries models.

Annals of Statistics , 43:1535–1567.Bi, X., Qu, A., and Shen, X. (2018). Multilayer tensor factorization with applications torecommender systems.

Annals of Statistics , 46:3308–3333.Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed op-timization and statistical learning via the alternating direction method of multipliers.

Foundations and Trends ® in Machine learning , 3:1–122.Candes, E. J. and Plan, Y. (2011). Tight oracle inequalities for low-rank matrix recovery froma minimal number of noisy random measurements. IEEE Transactions on InformationTheory , 57:2342–2359.Chen, E. Y. and Chen, R. (2019). Modeling dynamic transport network with matrix factormodels: with an application to international trade ﬂow. arXiv:1901.00769 [econ.EM].Chen, E. Y., Tsay, R. S., and Chen, R. (2020a). Constrained factor models for high-dimensional matrix-variate time series.

Journal of the American Statitical Association ,115:775–793.Chen, H., Raskutti, G., and Yuan, M. (2019). Non-convex projected gradient descent for gen-eralized low-rank tensor regression.

The Journal of Machine Learning Research , 20(1):172–208.Chen, R., Xiao, H., and Yang, D. (2020b). Autoregressive models for matrix-valued timeseries.

Journal of Econometrics . To appear.53hen, R., Yang, D., and Zhang, C.-H. (2020c). Factor models for high-dimensional tensortime series. arXiv:1905.07530v2 [stat.ME].Davis, R. A., Zang, P., and Zheng, T. (2016). Sparse vector autoregressive modeling.

Journalof Computational and Graphical Statistics , 25:1077–1096.De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000). A multilinear singular valuedecomposition.

SIAM Journal on Matrix Analysis and Applications , 21:1253–1278.Ding, S. and Cook, R. D. (2018). Matrix variate regressions and envelope models.

Journalof the Royal Statistical Society: Series B , 80:387–408.Fama, E. F. and French, K. R. (2015). A ﬁve-factor asset pricing model.

Journal of FinancialEconomics , 116:1–22.Frandsen, A. and Ge, R. (2019). Understanding composition of word embeddings via tensordecomposition. In

International Conference on Learning Representations .French, K. R. (2020). Data library: U.S. research returns data. Available at http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html .Gandy, S., Recht, B., and Yamada, I. (2011). Tensor completion and low-n-rank tensorrecovery via convex optimization.

Inverse Problems , 27:025010.Han, F., Lu, H., and Liu, H. (2015a). A direct estimation of high dimensional stationaryvector autoregressions.

Journal of Machine Learning Research , 16:3115–3150.Han, F., Xu, S., and Liu, H. (2015b). Rate-optimal estimation of a high-dimensional semi-parametric time series model. Preprint.Han, R., Willett, R., and Zhang, A. (2020). An optimal statistical and computationalframework for generalized tensor estimation. arXiv preprint arXiv:2002.11255 .Hoﬀ, P. D. (2015). Multilinear tensor regression for longitudinal relational data.

Annals ofApplied Statistics , 9:1169–1193. 54olda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications.

SIAMReview , 51:455–500.Lam, C., Yao, Q., et al. (2012). Factor modeling for high-dimensional time series: inferencefor the number of factors.

Annals of Statistics , 40:694–726.Li, X., Xu, D., Zhou, H., and Li, L. (2018). Tucker tensor regression and neuroimaginganalysis.

Statistics in Biosciences , 10:520–545.Liu, J., Musialski, P., Wonka, P., and Ye, J. (2013). Tensor completion for estimating missingvalues in visual data.

IEEE Transactions on Pattern Analysis and Machine Intelligence ,35:208–220.Mirsky, L. (1960). Symmetric gauge functions and unitarily invariant norms.

QuarterlyJournal of Mathematics , 11:50–59.Mu, C., Huang, B., Wright, J., and Goldfarb, D. (2014). Square deal: Lower bounds and im-proved relaxations for tensor recovery. In

International Conference on Machine Learning ,pages 73–81.Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices withnoise and high-dimensional scaling.

Annals of Statistics , 39:1069–1097.Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weightedmatrix completion: Optimal bounds with noise.

Journal of Machine Learning Research ,13:1665–1697.Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. (2012). A uniﬁed frame-work for high-dimensional analysis of M -estimators with decomposable regularizers. Sta-tistical Science , 27:538–557.Raskutti, G., Yuan, M., and Chen, H. (2019). Convex regularization for high-dimensionalmulti-response tensor regression.

Annals of Statistics , 47:1554–1584.55hapiro, A. (1986). Asymptotic theory of overparameterized structural models.

Journal ofthe American Statistical Association , 81:142–149.Stock, J. H. and Watson, M. W. (2011). Dynamic factor models. In Clements, M. P.and Hendry, D. F., editors,

Oxford Handbook of Economic Forecasting . Oxford UniversityPress.Tomioka, R., Suzuki, T., Hayashi, K., and Kashima, H. (2011). Statistical performanceof convex tensor decomposition. In

Advances in Neural Information Processing Systems(NIPS) , pages 972–980.Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis.

Psychome-trika , 31:279–311.Vershynin, R. (2018).

High-Dimensional Probability: An Introduction with Applications inData Science . Cambridge University Press, Cambridge.Walden, A. and Serroukh, A. (2002). Wavelet analysis of matrix–valued time–series.

Proceed-ings of the Royal Society of London. Series A: Mathematical, Physical and EngineeringSciences , 458:157–179.Wang, D., Liu, X., and Chen, R. (2019). Factor models for matrix-valued high-dimensionaltime series.

Journal of Econometrics , 208:231–248.Wang, D., Zheng, Yao Lian, H., , and Li, G. (2020). High-dimensional vector autoregres-sive time series modeling via tensor decomposition.

Journal of the American StatisticalAssociation . To appear.Zheng, Y. and Cheng, G. (2020). Finite time analysis of vector autoregressive models underlinear restrictions.

Biometrika . To appear.Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimagingdata analysis.

Journal of the American Statistical Association , 108:540–552.56 ase (1a) Case (1b) Case (2a) Case (2b) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) Estimator

LTR RRR OLS

Figure 2: Average estimation error of LTR, OLS and RRR estimators for data generatedwith diﬀerent d , p i ’s and multilinear ranks. 57ase d Fixed Parameter Varying Parameter(a) 2 T = 500, r = (2 , , , p = p = 5 , , , , p = p = 8, r = (2 , , , T = 200 , , , , p = p = 8, T = 500 r = (1 , , , , (1 , , , , , , , , (2 , , , , (3 , , , p = 144, T = 1000, r = (1 , , , p = 3 , , , , T = 1000, r = (2 , , , , , p = (4 , , , (4 , , , (4 , , , (5 , , , (5 , , p = (5 , , r = (2 , , , , , T = 600 , , , , p = (5 , , T = 1000 r = (1 , , , , , , (1 , , , , , , , , , , , (2 , , , , , , (2 , , , , , p = 144, T = 1000, p = (2 , , , (3 , , r = (1 , , , , ,

1) (4 , , , (3 , , , (4 , , p =( p , . . . , p d ) and r = ( r , . . . , r d ). p C a s e ( a )

12 0.001 0.002 0.003 0.004 0.005 C a s e ( b ) s C a s e ( c ) p C a s e ( d ) p C a s e ( e ) C a s e ( f )

123 2.5 5.0 7.5 10.0 12.5 s C a s e ( g ) Five cases for p i C a s e ( h ) Figure 3: Average squared estimation error for the SSN estimator for eight cases with dif-ferent varying parameters. The error bars in each panel represent ± one standard deviation.58 ase (3a) Case (3b) Case (4a) Case (4b) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) r = ( , , , ) r = ( , , , ) r = ( , , , , , ) r = ( , , , , , ) Estimator

TSSN SSN MN SN

Figure 4: Average estimation error for TSSN, SSN, MN, and SN estimators for data gener-ated with diﬀerent d , p i ’s and multilinear ranks.59 .60.40.20-0.2-0.4-0.6B/M Response Factor LoadingB/M Predictor Factor Loading Size Predictor Factor Loading Size Response Factor Loading Legend Figure 5: TSSN estimates of predictor and response factor matrices for 10 ×

10 matrix-valuedportfolio return series. From left to right: (cid:101) U , (cid:101) U , (cid:101) U and (cid:101) U .Model VAR VFM MAR MFM LRTAR Best WorstSSN TSSNIn-sample (cid:96) norm (cid:96) norm (cid:96) norm 39.84 39.11 35.26 38.53 32.60 TSSN VAR (cid:96) ∞ norm 11.98 12.30 11.47 11.00 9.53 TSSN VFMTable 3: Average in-sample forecasting error and out-of-sample rolling forecasting error for10 ×

10 matrix-valued portfolio return series. The best cases are marked in bold.60 .80.60.40.20-0.2-0.4-0.6-0.8

Inv Response Factor Loading B/M Response Factor Loading Size Response Factor Loading LegendOP Predictor Factor Loading B/M Predictor Factor Loading Size Predictor Factor Loading OP Response Factor Loading B/M Response Factor Loading Size Response Factor Loading I nv - B M - S i ze Inv Predictor Factor Loading B/M Predictor Factor Loading Size Predictor Factor Loading O P - B M - S i ze Figure 6: TSSN estimates of predictor and response factor matrices for 4 × × (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U and (cid:101) U .Model VAR VFM MTAR TFM LRTAR Best WorstSSN TSSNOP-BM-Size 4 × × (cid:96) norm (cid:96) norm (cid:96) norm 22.27 20.17 20.50 20.11 20.32 TSSN VAR (cid:96) ∞ norm 10.38 10.04 9.86 10.03 × × (cid:96) norm (cid:96) norm (cid:96) norm 18.70 17.70 16.89 17.67 (cid:96) ∞ norm 7.42 7.37 6.79 7.33 6.62 TSSN VARTable 4: Average in-sample forecasting error and out-of-sample rolling forecasting error for4 ××