Tensor Factor Model Estimation by Iterative Projection
TTensor Factor Model Estimation by Iterative Projection
Yuefeng Han, Rong Chen, Dan Yang and Cun-Hui Zhang Rutgers University and The University of Hong Kong
Abstract
Tensor time series, which is a time series consisting of tensorial observations, has becomeubiquitous. It typically exhibits high dimensionality. One approach for dimension reduction isto use a factor model structure, in a form similar to Tucker tensor decomposition, except thatthe time dimension is treated as a dynamic process with a time dependent structure. In thispaper we introduce two approaches to estimate such a tensor factor model by using iterativeorthogonal projections of the original tensor time series. The approaches extend the existingestimation procedures and our theoretical investigation shows that they improve the estimationaccuracy and convergence rate significantly. The developed approaches are similar to higherorder orthogonal projection methods for tensor decomposition, but with significant differencesand theoretical properties. Simulation study is conducted to further illustrate the statisticalproperties of these estimators.
Motivated by a diverse range of modern scientific applications, analysis of tensors, or multi-dimensional arrays, has emerged as one of the most important and active research areas in statis-tics, computer science, and machine learning. Large tensors are encountered in genomics (Alter andGolub, 2005, Omberg et al., 2007), neuroimaging analysis (Sun and Li, 2017, Zhou et al., 2013),recommender systems (Bi et al., 2018), computer vision (Liu et al., 2012), community detection(Anandkumar et al., 2014), among others. High-order tensors often bring about high dimension-ality and impose significant computational challenges. For example, functional MRI produces atime series of 3-dimensional brain images, typically consisting of hundreds of thousands of voxelsobserved over time. Previous work has developed various tensor-based methods for independentand identically distributed (i.i.d.) tensor data or tensor data with i.i.d. noise. However, as far aswe know, the statistical framework for general tensor time series data was not well studied in theliterature.Factor analysis is one of the most useful tools for understanding common dependence amongmulti-dimensional outputs. Over the past decades, vector factor models have been extensivelystudied in the statistics and economics communities. For instance, Chamberlain and Rothschild(1983), Bai and Ng (2002), Bai (2003) and Stock and Watson (2002) developed the static factormodel using principal component analysis (PCA). They assumed that the common factors musthave impact on most of the time series, and weak serial dependence is allowed for the idiosyncratic Yuefeng Han is Postdoctoral Fellow, Department of Statistics, Rutgers University, [email protected] Chen is Professor, Department of Statistics, Rutgers University, Piscataway, NJ 08854. E-mail:[email protected]. Dan Yang is Associate Professor, Faculty of Business and Economics, The Universityof Hong Kong, Hong Kong. E-mail: [email protected]. Cun-Hui Zhang is Professor, Department of Statistics,Rutgers University, Piscataway, NJ 08854. E-mail: [email protected]. Cuh-Hui Zhang is the correspondingauthor. Han’s research is supported in part by National Science Foundation grant IIS-1741390. Chen’s researchis supported in part by National Science Foundation grants DMS-1503409, DMS-1737857 and IIS-1741390. Yang’sresearch is supported in part under NSF grant IIS-1741390. Zhang’s research is supported in part by NSF grantsDMS-1721495, IIS-1741390 and CCF-1934924. a r X i v : . [ s t a t . M E ] J un oise process. Fan et al. (2011, 2013, 2018) established large covariance matrix estimation based onthe static factor model. The static factor model has been further extended to the dynamic factormodel in Forni et al. (2000). The latent factors are assumed to follow a time series process, whichis commonly taken to be a vector autoregressive process. Fan et al. (2016) studied semi-parametricfactor models through projected principal component analysis. Pan and Yao (2008), Lam et al.(2011) and Lam and Yao (2012) adopted another type of factor model. They assumed that thelatent factors should capture all dynamics of the observed process, and thus the idiosyncratic noiseprocess has no serial dependence.Although there have been significant efforts in developing methodologies and theories for vectorfactor models, there is paucity of literature on matrix- or tensor-valued time series. Wang et al.(2019) proposed a matrix factor model for matrix-valued time series, which explores the matrixstructure. Chen et al. (2019a) established a general framework for incorporating domain and priorknowledge in the matrix factor model through linear constraints. Chen and Chen (2019) applied thematrix factor model to the dynamic transport network. Chen et al. (2020) developed an inferentialtheory of the matrix factor model under a different setting from that in Wang et al. (2019) .Recently, Chen et al. (2019b) introduced a factor approach for analyzing high dimensionaldynamic tensor time series in the form X t “ M t ` E t , (1)where X , ..., X T P R d ˆ¨¨¨ˆ d K are the observed tensor time series, M t and E t are the correspondingsignal and noise components of X t , respectively. The goal is to estimate the unknown signal tensor M t from the tensor time series data. Following Lam and Yao (2012), it is assumed that the signalpart accommodates all dynamics, making the idiosyncratic noise E t uncorrelated (white) acrosstime. It is further assumed that M t is in a lower dimensional space and has certain multilineardecomposition. Specifically, we assume that M t satisfies the following Tucker-type decomposition.Then model (1) can be rewritten as X t “ F t ˆ A ˆ . . . ˆ K A K ` E t , (2)where A k is the deterministic loading matrix of size d k ˆ r k and r k ! d k , and the core tensor F t itself is a latent tensor factor process of dimension r ˆ . . . ˆ r K . Here the k -mode product of X P R d ˆ d ˆ¨¨¨ˆ d K with a matrix U P R d k ˆ d k , denoted as X ˆ k U , is an order K -tensor of size d ˆ ¨ ¨ ¨ ˆ d k ´ ˆ d k ˆ d k ` ˆ ¨ ¨ ¨ ˆ d K such that p X ˆ k U q i ,...,i k ´ ,j,i k ` ,...,i K “ d k ÿ i k “ X i ,i ,...,i K U j,i k . The core tensor F t is usually much smaller than X t in dimension. This structure provides aneffective dimension reduction, as all the comovements of individual time series in X t are driven by F t . Without loss of generality, assume that A k is of rank r k ! d k . It should be noted that vectorand matrix factor models can be viewed as special cases of our model since a vector time series is atensor time series composed of a single fiber and a matrix times series is one composed of a singleslice.Chen et al. (2019b) proposed two estimation procedures (TOPUP and TIPUP) for estimatingthe column space spanned by the loading matrix A k , for k “ , . . . , K . Their procedures arebased on two matrix unfolding operations of the auto-cross-moment of the original tensors X t and2 t ´ h , h ą
0, and utilizing the assumption that the noise E t and E t ´ h , h ą d “ d d . . . d K , a potentially very largenumber as d k , k “ , . . . , K , are large. Often they require a large T , the length of the time series,for accurate estimation of the loading spaces.In this paper we propose two extensions of the estimation approaches taken by Chen et al.(2019b), motivated by the following observation. Suppose that the loading matrices A k are or-thonormal with A J k A k “ I , and we are given A , . . . , A K . Let Z t “ X t ˆ A J ˆ . . . ˆ K A J K ; and E ˚ t “ E t ˆ A J ˆ . . . ˆ K A J K ;Then (2) leads to Z t “ F t ˆ A ` E ˚ t (3)where Z t is a d ˆ r ˆ . . . ˆ r k tensor. Since r k ! d k , Z t is a much smaller tensor than X t . Underproper (weak) conditions on the combined noise tensor E ˚ t , estimation of the loading space of A based on Z t can be made significantly more accurate, as the convergence rate now depends on d r . . . r k rather than d d . . . d k .Of course, in practice we do not know A , . . . , A K . Similar to backfitting algorithms, we proposean iterative algorithm. With a proper initial value, we iteratively estimate the loading space of A k at iteration j based on Z p j q t,k “ X t ˆ ˆ A p j qJ ˆ . . . ˆ k ´ ˆ A p j qJ k ´ ˆ k ` ˆ A p j ´ qJ k ` ˆ k ` . . . ˆ K ˆ A p j ´ qJ K , using the estimate ˆ A p j ´ q k , k ă k ď K obtained in the previous iteration and the estimate ˆ A p j q k , ď k ă k obtained in the current iteration. Our theoretical investigation shows that the iterativeprocedures for estimating A can achieve the convergence rate as if all A , . . . , A K are known andwe indeed observe Z t following model (3). We call the procedure iTOPUP and iTIPUP, based on thematrix unfolding mechanism used, corresponding to TOPUP and TIPUP procedures of Chen et al.(2019b). To be more specific, our algorithms have two steps: (i) We first use the estimated columnspace of factor loading matrices of TOPUP (resp. TIPUP) in Chen et al. (2019b) to construct theinitial estimate of factor loading spaces; (ii) We then iteratively perform matrix unfolding of theauto-cross-moments of much smaller tensors Z p j q t,k to obtain the final estimators.We note that the iterative procedure is similar to higher order orthogonal iteration (HOOI) thathas been widely studied in the literature; see, e.g., De Lathauwer et al. (2000), Sheehan and Saad(2007), Liu et al. (2014), Zhang and Xia (2018), among others. However, most of the existing worksare not designed for tensor time series. They do not consider the the special role of the time modenor the covariance structure in the time direction. Often the signal part are treated as fixed ordeterministic. In this paper we focus on the setting that the core tensor F t in (2) is dynamic whichrequires special treatment. Although our iterative procedures in each iteration also consist of powerup and orthogonal projection operations, similar to HOOI, the matrix unfolding operation is onthe auto-cross-moments that pays special attention to the time mode. Although iTOPUP proposedhere can be reformulated as a twist of HOOI on the auto-covariance tensor, iTIPUP is differentand cannot be recast equivalently as a HOOI. More importantly, the theoretical investigation andtheoretical properties of the estimators are fundamentally different from those of HOOI, due to thedynamic structure of tensor time series and the use of auto-cross-moments, instead of covariancematrices. 3n this paper, we establish upper bounds on the estimation errors for both the iTOPUP andthe iTIPUP, which are much sharper than the respective theoretical guarantees for TOPUP andTIPUP in Chen et al. (2019b), demonstrating the benefits of iterative projection. It is also shownthat our algorithms converge within a logarithmic number of iterations. We mainly focus on thecases where the tensor dimensions are large and of similar order. We also cover the cases wherethe dimensions and the ranks of the tensor factor increase with the dimensions of the tensor timeseries. Alternative methods to the iTOPUP and iTIPUP are discussed.Chen et al. (2019b) showed that the TIPUP has a faster rate than the TOPUP, under a mildcondition on the level of signal cancellation. In contrast, the theoretically guaranteed rate ofconvergence for the iTOPUP in this paper is of the same order or even faster than that for theiTIPUP under certain regularity conditions. Our results also suggest an interesting phenomenon.Using the iterative procedures, we find that both the increase in dimension and in sample size canimprove the estimation of the factor loading space of the tensor factor model with the tensor order K ě
2. We believe that such a super convergence rate is new in the literature. Specifically, underthe strong factor conditions, the convergence rate of the iterative procedures for estimating thespace of A k is O P p T ´ { d ´ { ´ k q , where d ´ k “ ś j ‰ k d j , while the traditional rate of the non-iterativeprocedures is O P p T ´ { q for univariate factor model (Lam et al., 2011), matrix/tensor factor model(Chen et al., 2019b, Wang et al., 2019). While the increase in the dimensions d k ( k “ , . . . , K )will not improve the performance of the non-iterative estimators, it significantly improves that ofthe proposed iterative estimators.The paper is organized as follows. Section 2.1 introduces basic notation and preliminariesof tensor analysis. We present the tensor factor model and the procedures for the iTOPUP andiTIPUP in Section 2.2 and 2.3. Theoretical properties of the iTOPUP and iTIPUP are investigatedin Section 3. Numerical comparison of our iterative procedures and other methods is given in Section4. Section 5 provides a brief summary. All technical details are relegated to Appendix. Throughout this paper, for a vector x “ p x , ..., x p q J , define } x } q “ p x q ` ... ` x qp q { q , q ě
1. For a ma-trix A “ p a ij q P R m ˆ n , write the SVD as A “ U Σ V J , where Σ “ diag p σ p A q , σ p A q , ..., σ min t m,n u p A qq ,with the singular values σ p A q ě σ p A q ě ¨ ¨ ¨ ě σ min t m,n u p A q ě } A } F “ p ř ij a ij q { “ p ř min t m,n u i “ σ i p A qq { . Define the spectralnorm } A } S “ max } x } “ , } y } “ } x J Ay } “ σ p A q . Let σ min p A q (resp. σ max p A q ) be the smallest (resp. largest) nontrivial singular value of A . Denotethe projection operator onto the column space of A as P A “ A p A J A q : A J , where p¨q : is the Moore-Penrose pseudo-inverse. Based on the SVD A “ U Σ V J with Σ nonsingular, P A can be equivalentlywritten as P A “ U U J . For any two matrices A P R m ˆ r , B P R m ˆ r , denote the Kroneckerproduct d as A d B P R m m ˆ r r . Let ξ “ p ξ , ..., ξ p q J be a random vector. For two sequencesof real numbers t a n u and t b n u , write a n “ O p b n q (resp. a n — b n ) if there exists a constant C suchthat | a n | ď C | b n | (resp. 1 { C ď a n { b n ď C ) holds for all sufficiently large n , and write a n “ o p b n q iflim n Ñ8 a n { b n “
0. Write a n À b n (resp. a n Á b n ) if there exist a constant C such that a n ď Cb n a n ě Cb n ). Denote a ^ b “ min t a, b u and a _ b “ max t a, b u . We use C, C , c, c , ... to denotegeneric constants, whose actual values may vary from line to line.For any two m ˆ r matrices with orthonormal columns, say, U and p U , suppose the singularvalues of U J p U are σ ě σ ě ¨ ¨ ¨ ě σ r ě
0. A natural measure of distance between the columnspaces of U and p U is then } p U p U J ´ U U J } S “ a ´ σ r , (4)which equals to the sine of the largest principle angle between the column spaces of U and p U .For any two tensors A P R m ˆ m ˆ¨¨¨ˆ m K , B P R r ˆ r ˆ¨¨¨ˆ r N , denote the tensor product b as A b B P R m ˆ¨¨¨ˆ m K ˆ r ˆ¨¨¨ˆ r N , such that p A b B q i ,...,i K ,j ,...,j N “ p A q i ,...,i K p B q j ,...,j N . Let vec p¨q be the vectorization of matrices and tensors. The mode- k unfolding (or matricization)is defined as mat k p A q , which maps a tensor A to a matrix mat k p A q P R m k ˆ ś Kj ‰ k m j . For example,if A P R m ˆ m ˆ m , then p mat p A qq i, p j ` m p k ´ qq “ p mat p A qq j, p k ` m p i ´ qq “ p mat p A qq k, p i ` m p j ´ qq “ A ijk . The tensor Hilbert Schmidt norm for a tensor A P R m ˆ m ˆ¨¨¨ˆ m K is defined as } A } HS “ gffe m ÿ i “ ¨ ¨ ¨ m K ÿ i K “ p A q i ,...,i K . Define the tensor operator norm for an order-4 tensor A P R m ˆ m ˆ m ˆ m , } A } op “ max i ,i ,i ,i u i ,i ¨ u i ,i ¨ p A q i ,i ,i ,i : } U } F “ } U } F “ + , where U “ p u i ,i q P R m ˆ m and U “ p u i ,i q P R m ˆ m . Again, we consider X t “ F t ˆ A ˆ . . . ˆ K A K ` E t . Without loss of generality, assume that A k is of rank r k . As discussed in Chen et al. (2019b), A k is not necessarily orthonormal, which is different from the classical Tucker decomposition (Tucker(1966)). Model (2) is unchanged if we replace p A , ..., A K , F t q by p A H , ..., A K H K , F t ˆ Kk “ H ´ k q for any invertible r k ˆ r k matrix H k . We can always rotate an estimated factor loading matrixwhenever appropriate. Although p A , ..., A K , F t q are not uniquely determined, the factor loadingspace, that is, the linear space spanned by the columns of A k , is uniquely defined. Denote theorthogonal projection to the column space of A k as P k “ P A k “ A k p A J k A k q ´ A J k . (5)We use P k to represent the factor loading space of A k . If A k has SVD A k “ U k Λ k V J k , then P k “ U k U J k . The canonical representation of the tensor times series (2) is written as X t “ F p cano q t ˆ Kk “ U k ` E t , U k is the left singular matrix of A k and the core tensor F p cano q t “ F t ˆ Kk “ p Λ k V J k q absorbsthe diagonal and right singular matrices of A k . In this canonical form, the loading matrices U k areidentifiable up to a rotation in general and up to a permutation and sign changes of the columnsof U k when the singular values are all distinct in the population version of the TOPUP or TIPUPmethods, as we describe in Section 2.3 below. In what follows, we may identify the tensor timeseries as its canonical form, i.e. A k “ U k , without explicit declaration.In this paper, we consider more powerful estimation procedures to achieve sharper convergencerates than the ones in Chen et al. (2019b). It is worth emphasizing that we do not impose anyspecific structure for the dynamics of the core tensor factor process F t P R r ˆ¨¨¨ˆ r K . The estimationprocedures we use also do not require any additional structure on the noise process E t . Once wehave obtained an estimator of A k , say p A k , which is actually p U k in the canonical representation, anatural estimator for the core factor process is p F t “ X t ˆ p A J ˆ ¨ ¨ ¨ ˆ K p A J K . The resulting residuals are p E t “ X t ´ X t ˆ p P ˆ ¨ ¨ ¨ ˆ K p P K , where p P k “ p A k p p A J k p A k q ´ p A J k is the estimated projection to the column space of A k . A parsimonious fitting for p F t may be obtainedby appropriately rotating p F t (see, e.g., Tiao and Tsay (1989)). Such a rotation is equivalent toreplacing each p A k by p A k H k with an appropriate orthogonal matrix H k .The lagged cross-product of the tensor version of the autocovariance is denoted as Σ h , whichcan be viewed as an order-2 K tensor,Σ h “ E ˜ T ÿ t “ h ` X t ´ h b X t T ´ h ¸ “ E ˜ T ÿ t “ h ` M t ´ h b M t T ´ h ¸ P R d ˆ¨¨¨ˆ d K ˆ d ˆ¨¨¨ˆ d K , for h “ , ..., h . As M t “ M t ˆ Kk “ P k for all t ,Σ h “ Σ h ˆ Kk “ P k “ E ˜ T ÿ t “ h ` F t ´ h b F t T ´ h ¸ ˆ Kk “ P k A k , with the notation A k “ A k ´ K and P k “ P k ´ K for all k ą K . Let A k “ U k Λ k V J k be the SVDof A k , where the columns of U k and V k are the left singular vectors and right singular vectors,respectively. Then P k “ U k U J k . The following estimating procedures are based the sample versionof Σ h ( h “ , ..., h ), p Σ h “ T ÿ t “ h ` X t ´ h b X t T ´ h , h “ , ..., h . (6)Here, the term with h “ E t , E p E t b E t q ‰
0, is involved in Σ . 6 .3 Estimating procedures We start with a quick description of the TOPUP and TIPUP procedures of Chen et al. (2019b).They serve as the starting point of our proposed iTOPUP and iTIPUP procedures. (i). Time series Outer-Product Unfolding Procedure (TOPUP) :Let X T “ p X , . . . , X T q . Define an order-5 tensor asTOPUP k p X T q “ ˜ T ÿ t “ h ` mat k p X t ´ h q b mat k p X t q T ´ h , h “ , ..., h ¸ , (7)where h is a predetermined positive integer. Here we note that TOPUP k p¨q is a function mappinga tensor time series to a order-5 tensor.Let d ´ k “ d { d k with d “ ś Kk “ d k . Then TOPUP k p X T q is a tensor of dimension d k ˆ d ´ k ˆ d k ˆ d ´ k ˆ h , and mat p TOPUP p X T qq is a matrix of dimension d k ˆp d h { d k q . In the definition ofTOPUP k p X T q , the information from different time lags is accumulated, which is useful especiallywhen the sample size T is small. A relatively small h is typically used, since the autocorrelationis often at its strongest with small time lags. Larger h strengthens the signal, but also adds morenoise in the calculation of (7); see, e.g., Wang et al. (2019).The TOPUP method in Chen et al. (2019b) performs SVD of the mode-1 matrix unfolding of(7) to obtain the truncated left singular matrices p U T OP UPk,m p X T q “ LSVD m p mat p TOPUP k p X T qqq , (8)where LSVD m stands for the left singular matrix composed of the first m left singular vectorscorresponding to the largest m singular values. Because of the equivalence of SVD and eigendecomposition, we can take an alternative approach by computing the eigen decomposition ofmat p TOPUP k p X T qq mat J p TOPUP k p X T qq to obtain p U T OP UPk,m p X T q .For simplicity, we write UTOPUP k p X T , r k q “ p U T OP UPk,r k p X T q , (9)where r k is the mode- k rank. Again, we emphasize that UTOPUP k p¨q takes input of a tensor timeseries of length T with the target mode- k having dimension d k and rank r k , and produces an outputmatrix of size d k ˆ r k as the estimate of the mode- k loading matrix.As the expectation satisfies E r mat p TOPUP k p X T qqs“ A k mat k ˜ T ÿ t “ h ` E ˆ F t ´ h b F t T ´ h ˙ ˆ k ´ l “ A l ˆ Kl “ k ` A l , h “ , ..., h ¸ , (10)the TOPUP is expected to be consistent in estimating the column space of A k . (ii). Time series Inner-Product Unfolding Procedure (TIPUP) :Similar to (7), define a d k ˆ p d k h q matrix asTIPUP k p X T q “ mat ˜ T ÿ t “ h ` mat k p X t ´ h q mat J k p X t q T ´ h , h “ , ..., h ¸ , (11)7hich replaces the tensor product by the inner product in (7). The TIPUP method performs SVDon TIPUP k p X T q : p U T IP UPk,m p X T q “ LSVD m p TIPUP k p X T qq , for k “ , ..., K . Again, for simplicity, we writeUTIPUP k p X T , r k q “ p U T IP UPk,r k p X T q . (12)where r k is the mode- k rank.Note that E r TIPUP k p X T qs“ mat ` x Σ h , I k,k ` K y t k,k ` K u c , h “ , ..., h ˘ “ A k mat ¨˝C T ÿ t “ h ` E ˆ F t ´ h b F t T ´ h ˙ ˆ l ‰ k, ď l ď K A l , I k,k ` K G t k,k ` K u c , h “ , ..., h ˛‚ , (13)where I k,k ` K is an order-2 K tensor with elements p I k,k ` K q i , j “ I t i ´ k “ j ´ k u , i “ p i , ..., i K q , j “ p j , ..., j K q , i ´ k “ p i , ..., i k ´ , i k ` , ..., i K q , j ´ k “ p j , ..., j k ´ , j k ` , ..., j K q . Here, x¨ , ¨y t k,k ` K u c is defined as an inner product summation over all indices other than t k, k ` K u .The pseudo-code for a generic iterative procedure, under the motivation described in Section 1,is provided in Algorithm 1. It incorporates two estimators/operators UINIT and UITER that mapa tensor time series to an estimate of the loading matrix. UTOPUP and UTIPUP operators in (9)and (12) are examples of the operators.When we use the TOPUP operator (9) for both UINIT and UITER in Algorithm 1, it willbe called iTOPUP procedure. Similarly, iTIPUP uses TIPUP operator (12) for both UINIT andUITER. Besides these two versions, we may also use TIPUP for UINIT and TOPUP for UITER,named as TIPUP-iTOPUP. Similarly, TOPUP-iTIPUP denotes the procedure with TOPUP asUINIT and TIPUP as UITER. These variants are sometimes useful, because TOPUP and TIPUPhave different theoretical properties as the initializer or for iteration, as discussed in Section 3. Otherestimators of the loading spaces based on the tensor time series can also be used in place of UINITand UITER, including those briefly mentioned in Chen et al. (2019b), such as the conventionalhigh order SVD for tensor decomposition, which we refer to as Unfolding Procedure (UP), thatsimply performs SVD of the matricization along the appropriate mode of the K ` p X , . . . , X T q with time dimension as the additional (K+1)-th mode. Remark 1.
As the outer product is taken with TOPUP k in (7), the iTOPUP can be carried outby applying essentially HOOI to the order 2K+1 auto-cross-moment tensor ˆΣ h in (6) with thefollowing iteration: With U p j q K ` (cid:96) “ U p j q (cid:96) for all j ě p U p j q k “ LSVD r k ´ mat k ´ ˆΣ h ˆ (cid:96) P L ´ k p p U p j q (cid:96) q J ˆ (cid:96) P L ` k p p U p j ´ q (cid:96) q J , h “ , . . . , h ¯¯ where L ´ k “ tp p k ´ qq Y pp K ` q : p K ` k ´ qu and L ` k “ tpp k ` q : K q Y pp K ` k ` q : p K qqu .However, for iTIPUP, we need to first apply the projections to the data X t before computingthe auto-covariance tensor in the reduced space. Hence iTIPUP cannot be reduced to a HOOIprocedure. 8 lgorithm 1 A generic iterative algorithm Input: X t P R d ˆ¨¨¨ˆ K d K for t “ , ..., T , r k for all k “ , .., K , the tolerance parameter (cid:15) ą J , and the UINIT and UITER operators. Let j “
0, initiate via applying UINIT on t X T u , for k “ , ..., K , to obtain p U p q k “ UINIT k p X T , r k q . repeat Let j “ j `
1. At the j -th iteration, for k “ , ..., K , given previous estimates p p U p j ´ q k ` , . . . , p U p j ´ q K q and p p U p j q , . . . , p U p j q k ´ q , sequentially calculate, Z p j q t,k “ X t ˆ p p U p j q q J ˆ ¨ ¨ ¨ ˆ k ´ p p U p j q k ´ q J ˆ k ` p p U p j ´ q k ` q J ˆ k ` ¨ ¨ ¨ ˆ K p p U p j ´ q K q J , for t “ , . . . , T . Perform UITER on the new tensor time series Z p j q T,k “ p Z p j q ,k , . . . , Z p j q T,k q . p U p j q k “ UITER k p Z p j q T,k , r k q . until j “ J or max ď k ď K } p U p j q k p p U p j q k q J ´ p U p j ´ q k p p U p j ´ q k q J } S ď (cid:15), Estimate and output: p U iFinal k “ p U p j q k , k “ , ..., K, x P k iFinal “ p U iFinal k p p U iFinal k q J , k “ , ..., K, p F iFinal t “ X t ˆ Kk “ p p U iFinal k q J , t “ , ..., T. In this section, we shall investigate the statistical properties of the proposed iTOPUP and iTIPUPdescribed in the last section. Our theories provide theoretical guarantees for consistency and presentconvergence rates in the estimation of the principle space of the factor loading matrices A k , underproper regularity conditions.We introduce some notations first. Let d “ ś Kk “ d k , d ´ k “ d { d k , r “ ś Kk “ r k and r ´ k “ r { r k .9efine order-4 tensorsΘ k,h “ T ÿ t “ h ` mat k p M t ´ h q b mat k p M t q T ´ h P R d k ˆ d ´ k ˆ d k ˆ d ´ k , (14)Φ k,h “ T ÿ t “ h ` mat k p F t ´ h q b mat k p F t q T ´ h P R r k ˆ r ´ k ˆ r k ˆ r ´ k , Φ p cano q k,h “ T ÿ t “ h ` mat k p M t ´ h ˆ Kk “ U J k q b mat k p M t ´ h ˆ Kk “ U J k q T ´ h P R r k ˆ r ´ k ˆ r k ˆ r ´ k , with U k from the SVD A k “ U k Λ k V J k . We view Φ p cano q k,h as the canonical version of the auto-covariance of the factor process. Let E p¨q “ E p¨|t F , ..., F T uq . The noiseless version of the order-5TOPUP tensor (7) isΘ k, h “ p Θ k,h , h “ , . . . , h q “ E “ TOPUP k p X T q ‰ P R d k ˆ d ´ k ˆ d k ˆ d ´ k ˆ h , (15)and its factor version is Φ k, h “ p Φ k,h , h “ , . . . , h q P R r k ˆ r ´ k ˆ r k ˆ r ´ k ˆ h . Similarly defineΘ ˚ k,h “ T ÿ t “ h ` mat k p M t ´ h q mat J k p M t q T ´ h P R d k ˆ d k , (16)Φ ˚ k,h “ T ÿ t “ h ` mat k p F t ´ h q mat J k p F t q T ´ h P R r k ˆ r k , Φ ˚p cano q k,h “ U J k Θ ˚ k,h U k “ T ÿ t “ h ` mat k p M t ´ h ˆ Kk “ U J k q mat J k p M t ˆ Kk “ U J k q T ´ h P R r k ˆ r k . The noiseless version of (11) without matricization isΘ ˚ k, h “ p Θ ˚ k,h , h “ , . . . , h q “ E “ TIPUP k p X T q ‰ P R d k ˆ d k ˆ h (17)and its factor version is Φ ˚ k, h “ p Φ ˚ k,h , h “ , . . . , h q P R r k ˆ r k ˆ h . Let τ k,m be the m -th singularvalue of the noiseless version of the mode- k TOPUP matrix unfolding, τ k,m “ σ m p E mat p TOPUP k p X T qqq “ σ m ` mat p Θ k, h q ˘ “ σ m ` mat p Φ p cano q k, h q ˘ . Then the signal strength for iTOPUP can be characterized as λ k “ b h ´ { τ k,r k . (18)Similarly, let τ ˚ k,m “ σ m p E p TIPUP k p X T qqq “ σ m ` mat p Θ ˚ k, h q ˘ “ σ m ` mat p Φ ˚p cano q k, h q ˘ . Then the signal strength for iTIPUP can be characterized as λ ˚ k “ b h ´ { τ ˚ k,r k . (19)10e note that λ ˚ k ď } Θ ˚ , } S {p ´ h { T q by (16) and Cauchy-Schwarz inequality.To present theoretical properties of the proposed procedures, we establish the following condi-tions. These assumptions are used in Chen et al. (2019b) for the theoretical properties of TOPUPand TIPUP procedures. As TOPUP and TIPUP are basic operations for the proposed iterativealgorithms in this paper, the assumptions are also used here, though we present them slightlydifferently for better presentation. Assumption 1.
The error process E t are independent Gaussian tensors, condition on the factorprocess t F t , t P Z u . In addition, there exists some constant σ ą
0, such that E p u J vec p E t qq ď σ } u } , u P R d . This assumption is standard in most factor modelling literature. See the discussion in Chenet al. (2019b). Under this assumption the magnitude of the noise can be measured by the dimension d k before the projection and by the ranks r k after the projection. The main theorems (Theorems 1and 2 in Sections 3.2 and 3.3) are based on this assumption on the noise alone, and covers allpossible settings for the signal M t . To make our discussion more clear and to provide furtherinterpretation and simplified version of the main theorems, we provide more detailed conditionsand convergence rates of the iTOPUP and iTIPUP algorithms under the following two differentassumptions on the signal process M t . Assumption 2.
Assume r , ..., r K are fixed. The factor process F t is weakly stationary, and1 T ´ h T ÿ t “ h ` F t ´ h b F t ÝÑ E p F t ´ h b F t q in probability , where the elements of E p F t ´ h b F t q are all finite. In addition, the condition numbers of A J k A k ( k “ , ..., K ) are bounded. Furthermore, assume that h is fixed, and(i) (TOPUP related): E r mat p Φ k, h qs is of rank r k for 1 ď k ď K .(ii) (TIPUP related): E r mat p Φ ˚p cano q k, h qs is of rank r k for 1 ď k ď K .Under this assumption, the factor process has a fixed expected auto-cross-moment tensor withfixed dimensions. The assumption that the condition numbers of A J k A k ( k “ , ..., K ) are boundedcorresponds to the pervasive condition (e.g., Stock and Watson (2002), Bai (2003)). It ensures thatall the singular values of A k are of the same order. Such conditions are commonly imposed in factoranalysis.As our methods are based on auto-cross-moment at nonzero lags, we do not need to assume anyspecific model for the latent process F t , except some rank conditions in Assumption 2(i) and (ii),In order to provide a more concrete understanding of Assumption 2(i) and (ii), consider the caseof k “ K “
2. We write the factor process F t “ p f i,j,t q d ˆ d , and the stationary auto-cross-moments φ i ,j ,i ,j ,h “ E p f i ,j ,t ´ h f i ,j ,t q . Hence E r mat p Φ k, h qs is a r k ˆ p r ´ k r k r ´ k h q matrix,with columns being φ ¨ ,j ,i ,j ,h . Since E r mat p Φ k, h qs E r mat p Φ k, h qs J is a sum of many semi-positive definite r k ˆ r k matrices, if any one of these matrices is full rank, then E r mat p Φ k, h qs isof rank r k . Hence Assumption 2(i) is relatively easy to fulfill. On the other hand, Assumption 2(ii)is quite different. First, the condition is in the canonical form of the model as the inner productin TIPUP related procedures behaves differently. Let F p cano q t “ U J M t U “ p f p cano q i,j,t q d ˆ d , and φ p cano q i ,j ,i ,j ,h “ E p f p cano q i ,j ,t ´ h f p cano q i ,j ,t ). Then } Φ ˚p cano q , h } “ ř h h “ ř i ,i ` ř r j “ φ p cano q i ,j,i ,j,h ˘ . As φ p cano q i ,j,i ,j,h i , i , j, h , the summation ř r j “ φ p cano q i ,j,i ,j,h is subject topotential signal cancellation for h ą
0. Assumption 2(ii) ensures that there is no complete signalcancellation that makes the rank of E r mat p Φ ˚p cano q k, h qs less than r k . Note that the conditions dependon the choice of h . See the discussion in Chen et al. (2019b).In general, the dimensions of the core factor r k ( k “ , ..., K ) may diverge as the dimensions ofsignal tensor d k ( k “ , ..., K ) grow to infinity. The following assumption provides a more concreteset of conditions that can be used to provide some insights of the properties of the iTOPUP andiTIPUP algorithms. Assumption 3.
For a certain δ P r , s , } Θ k, } op — σ d ´ δ r ´ and } Θ ˚ k, } S — σ d ´ δ r r ´ k withprobability approaching one (as T Ñ 8 ). For the singular values, two scenarios are considered.(i) (TOPUP related): There exist some constants δ P r δ , s and c ą T Ñ 8 ), λ k ě c σ d ´ δ r ´ k , for all k “ , ..., K .(ii) (TIPUP related): There exist some constants δ P r δ , s , c ą δ ě T Ñ 8 ), λ ˚ k ě c σ d ´ δ r ´ k r ´ δ ´ k , for all k “ , ..., K .Assumption 3 is similar to the signal strength condition of Lam and Yao (2012), and thepervasive condition on the factor loadings (e.g., Stock and Watson (2002) and Bai (2003)). Thisassumption describes the signal process M t in terms of factor strength. It is more general thanAssumption 2 in the sense that it allows r , ..., r K to diverge and the latent process F t does nothave to be weakly stationary.We take δ , δ as measures of the strength of factors. They roughly indicate how much informa-tion is contained in the signals compared with the amount of noise, with respect to the dimensionsand ranks, d, r and r k . In this sense, they reflect the signal to noise ratio. When δ “ δ “
0, thefactors are called strong factors; otherwise, the factors are called weak factors.
Remark 2 (Signal Strength δ ) . Following Lam and Yao (2012), we may assume } Θ k, } — d p σ d ´ δ q . This assumption can be seen from the following. Note that } Θ k, } is the sum oftotal d terms, each is the square of the time average of the cross-product of the entries in M t .When p d q ´ δ of the d terms are roughly constant (and comparable with σ ), and the rest arealmost zero, then the total sum would be — d p σ d ´ δ q . In this case 1 ´ δ controls how sparsethe signal tensor M t is. If δ “
0, the signal tensor M t is dense with entries comparable with thenoise level σ , hence with strong factors. If the proportion of the nonzero entries in M t increasesat the rate of p d q ´ δ , or the overall average of the d terms decreases at the rate p d q ´ δ , thesignal strength is δ ą
0. If in addition all the nonzero eigenvalues are of the same order for eachof the nonnegative-definite mappings Θ k, : R d Ñ R d and Θ ˚ k, : R d k Ñ R d k , with respective ranks r “ ś Kk “ r k and r k , we would have } Θ k, } op — σ d ´ δ r ´ and } Θ ˚ k, } S — σ d ´ δ r r ´ k due totr p Θ k, q “ tr p Θ ˚ k, q . The assumptions on the ranks of Θ k, and Θ ˚ k, hold when Φ k, and Φ ˚ k, areof full rank and rank p A k q “ r k , which is a typical condition in the factor model setting. Remark 3 (Assumption 3(i) and the role of δ ) . In fact, for TOPUP, Assumption 3(i) holds when(a) } E mat p TOPUP k q} “ ř h h “ } Θ k,h } — h σ d p ´ δ q and (b) all the nonzero singular valuesof E r mat p TOPUP k qs are of the same order. Again, consider the case of k “ K “
2. Wewrite the factor process in the canonical form as F t “ U J M t U “ p f i,j,t q d ˆ d , and φ p cano q i ,j ,i ,j ,h “ ř Tt “ h ` f i ,j ,t ´ h f i ,j ,t {p T ´ h q as the time average cross product between fibers f i ,j , T and f i ,j , T of the factor process. The first condition (a) means ř h h “ } Θ ,h } “ ř h h “ } Φ p cano q ,h } “ i ,j ,i ,j ,h ` φ p cano q i ,j ,i ,j ,h ˘ — h σ d p ´ δ q , which is in the same form as } Θ , } “ } Φ p cano q , } — σ d p ´ δ q . As } Θ ,h } ď } Θ , } {p ´ h { T q by Cauchy-Schwarz inequality, we need to assume δ ě δ in Assumption 3(i). Remark 4 (Assumption 3(ii) and the role of δ ) . However, the same rationale in Remark 3 may notbe directly applicable to TIPUP. Note that } Θ ˚ ,h } “ } Φ ˚p cano q ,h } “ ř i ,i ` ř r j “ φ p cano q i ,j,i ,j,h ˘ . Asthe discussion of Assumption 2(ii), the summation ř r j “ φ p cano q i ,j,i ,j,h is subject to signal cancellationfor h ą
0. To take into account this signal cancellation, we may assume ÿ i ,i ,h ` r ÿ j “ φ p cano q i ,j,i ,j,h ˘ — r ´ δ ÿ i ,i ,j,h ` φ p cano q i ,j,i ,j,h ˘ — r ´ δ ÿ i ,j ,i ,j ,h ` φ p cano q i ,j ,i ,j ,h ˘ by counting the number of entries and measuring signal cancellation by δ as a rate of r . Thiswould lead to } E mat p TIPUP k q} — r ´ δ ´ k } E mat p TOPUP k q} and then λ ˚ k ě c d ´ δ r ´ k r ´ δ ´ k with the flexibility of taking c ‰ c . The additional parameter δ in Assumption 3(ii) measuresthe severity of signal cancellation in the TIPUP related procedures. δ “ Remark 5 (Additional remarks on signal cancellation for TIPUP-related procedures) . Followingthe above discussion on δ for TIPUP-related procedures, it would be reasonable to take δ “ φ p cano q i ,j,i ,j,h are of the same sign for most of p i , i , h q , δ “ { φ p cano q i ,j,i ,j,h behavelike independent mean zero variables, and δ “ 8 when all the signals cancel out by summation φ p cano q i ,j,i ,j,h over j . When h “ ř j φ p cano q i ,j,i ,j,h is prone to canceling out. Increasing h often helpsto reduce the problem of signal cancellation. Of course, if the dynamics (e.g. autocovariance) ofthe factor process is relatively weak, then φ p cano q i ,j,i ,j,h would be small for large h . In this case, usinglarge h does not increase the signal level, but brings in more noise terms in the estimation. Inpractice, we may choose h ě K in iTIPUP to avoid the possibility of extremely serious level ofsignal cancellation. In the case of fixed r k , the convergence rate depends on whether δ “ 8 (severesignal cancellation) or δ being finite. Let us first study the behavior of iTOPUP procedure. By Chen et al. (2019b), the risk E “›› p U p q k p U p qJ k ´ U k U J k ›› S ‰ of the TOPUP estimator for U k , the initialization of iTOPUP, is no larger than R p q k “ λ ´ k σT ´ { !a d k d ´ k r ´ k } Θ ˚ k, } { ` `a d k ` a d ´ k r ˘ } Θ k, } { ` σ a d k d ´ k ` σd k a d ´ k T ´ { ) , (20)where d ´ k “ ś j ‰ k d j and r ´ k “ ś j ‰ k r j . The aim of iTOPUP is to achieve dimension reductionby projecting the data in other modes of the tensor time series from R d j to R r j , j ‰ k . Ideally thiswould reduce the above rate to R p ideal q k “ λ ´ k σT ´ { !a d k r ´ k } Θ ˚ k, } { ` `a d k ` ? r ´ k r ˘ } Θ k, } { ` σ a d k r ´ k ` σd k ? r ´ k T ´ { ) . (21)The following theorem provides conditions under which this ideal rate is indeed achieved.13 heorem 1. Suppose Assumption 1 holds. Let h ă T { and P k , Θ k, , Θ ˚ k, and λ k be as in (5) , (14) , (16) and (18) respectively. Let R p q “ max ď k ď K R p q k , with R p q k in (20) , Suppose that forcertain constants C and C p iter q , C R p q ď , (22) and with the R p ideal q k in (21) and d ˚´ k “ ř j ‰ k d j r j , C p iter q R p ideal q k ` C p iter q λ ´ k σT ´ { ´ } Θ ˚ k, } { ` } Θ k, } { ` σ ` σT ´ { p b d ˚´ k ` a d k r ´ k q ¯ b d ˚´ k ď . (23) Then, there exist a numerical constant C p TOPUP q and a constant C p iter q ,K depending on K only suchthat when C ě p ´ ρ q ´ C p TOPUP q and C p iter q ě C p iter q ,K { ρ , ă ρ ă , ›› p P p m q k ´ P k ›› S ď C p TOPUP q ´ p ´ ρ m qp ´ ρ q ´ R p ideal q ` p ρ m { q R p q ¯ (24) simultaneously for all ď k ď K and m ě in an event with probability at least ´ ř Kk “ e ´ d k ,where R p ideal q “ max ď k ď K R p ideal q k and p P p m q k “ p U p m q k p U p m qJ k with the m -step estimator p U p m q k iniTOPUP algorithm. In particular, after at most J “ t log p max k d ´ k { min k r ´ k q{ log p { ρ q u iterations, E „ max ď k ď K ›› p P p J q k ´ P k ›› S ď C p TOPUP q ´ ρ R p ideal q ` K ÿ k “ e ´ d k . (25)It follows from Theorem 1 of Chen et al. (2019b) that with high probability } p U p q k p p U p q k q J ´ P k } S is small when condition (22) holds with sufficiently large C , i.e. R p q k are sufficiently small, so thatthe initialization in Step 1 of iTOPUP retains a large portion of the signal. Condition (23) has twoterms on the left. The first term is the same as (22) with d ´ k replaced by r ´ k , representing theestimation error after dimension reduction with the true projection with ˆ j ‰ k U j , and the secondterm represents the additional error in the estimation of ˆ j ‰ k U j in the iteration. The essence of ouranalysis of iTOPUP is that under condition (23), each iteration is a contraction of the additionalerror in the estimation of ˆ j ‰ k U j in a small neighborhood of it. The upper bound (24) for the errorof the m -step estimator is comprised of two terms, which correspond to the error upper bound forthe final estimator and the bound for the contracted error of the initial estimator, respectively. Remark 6.
The consistency of the non-iterative TOPUP estimator requires R p q Ñ d k are of the same order, r k are of the same order, and r k À d K ´ _ K ´ K k . Remark 7.
Note that as λ k typically grows with d , . . . , d K under quite general weak conditions.Under these conditions, Theorem 1 allows T ď d k d ´ k for the consistency of the iTOPUP estimatorof the loading spaces. As shown below, in many cases, T is actually allowed to be a constant for K ą
1. This is in sharp contrast of the results of traditional factor analysis which requires T Ñ 8
14o consistently estimate the loading spaces. The main reason is that the other tensor modes provideadditional information and in certain sense serve as additional samples. Roughly speaking, we havetotal d k p d ´ k T q observations in the tensor time series to estimate the d k r k parameters in the loadingspace A k , where r k ! d ´ k T in most of the cases, depending on additional conditions on λ k and λ ˚ k .For a better illustration of the statistical properties of iTOPUP, we provide the following corol-lary as a simplified version of Theorem 1 for fixed ranks r k under some additional side conditions. Corollary 1.
Suppose Assumptions 1 and 2(i) hold. Let λ “ ś Kk “ } A k } S and d min “ min t d , ..., d K u .Let h ď T { and σ fixed. Then, there exist constants C ,K and C ,K depending on K only suchthat when λ ě C ,K σ ˆ dT ` d ? T d min ˙ , (26) the 1-step iTOPUP estimator satisfies E } p P p q k ´ P k } S ď C ,K ˆ σ ? d k λ ? T ` σ ? d k λ ? T ˙ ` K ÿ k “ e ´ d k . (27)Corollary 1 implies that, in order to recover the factor loading space for A k , the signal tonoise ratio needs to satisfy λ { σ ě C p d { T ´ { ` d { d ´ { T ´ { q . Under the assumption, thetheorem indicates that the proposed iTOPUP just needs one-iteration to achieve the ideal rate ofconvergence. The rate is much sharper than that of the non-iterative TOPUP procedure.When the dimensions of the core factor r k ( k “ , ..., K ) diverge as the dimensions of signaltensor d k ( k “ , ..., K ) grow to infinity, we characterize the convergence rate of iTOPUP in termsof d k , r k and T under Assumption 3(i). Corollary 2.
Suppose Assumptions 1 and 3(i) hold. Let h ď T { , d ˚´ k “ ř j ‰ k d j r j and r “ Π Kk “ r k . Suppose that for a sufficiently large C not depending on t σ, d k , r k , k ď K u , T ě C max k ˜ d δ ´ δ r { r k ` d δ r k d k ` d { k r { k d { ´ δ ` d ˚´ k r { d ` δ ´ δ ` d { k p d ˚´ k q { r { d ´ δ ¸ . (28) Then, after J “ O p log d q iterations, we have the following upper bounds for iTOPUP, } p P p J q k ´ P k } S “ O P ˜ d { k r { T { d { ` δ { ´ δ r k ` d { k rT { d ´ δ r { k ¸ . (29) Moreover, (29) holds after at most J “ O p log r q iterations, where r “ Π Kk “ r k , if any one of thefollowing three conditions holds in addition to (28) : (i) d k ( k “ , ..., K ) are of the same order, (ii) λ k ( k “ , ..., K ) are of the same order, (iii) p λ ˚ k q ´ ? d k ( k “ , ..., K ) are of the same order. Note that the second part of Corollary 2 says that when the condition is right, iTOPUP al-gorithm only needs a small number of iterations to converge, as O p log r q is typically very small.The noise level σ does not appear directly in the rate since it is incorporated in the signal to noiseratio in the tensor form in Assumption 3. In Corollary 2, we show that as long as the sample size T satisfies (28), the iTOPUP achieves consistent estimation under regularity conditions. To digest1528), consider that the growth rate of r k is much slower than d k and the factors are strong with δ “ δ “
0. Then (28) becomes T ě C max k p r { r ´ k q .The advantage of using index δ , δ is to link the convergence rates of the estimated factorloading space explicitly to the strength of factors. It is clear that the stronger the factors are, thefaster the convergence rate is. Moreover, the stronger the factors are, the smaller the sample sizeis required. Now, let us consider the statistical performance of iTIPUP procedure. Again, by Chen et al.(2019b) the TIPUP risk in the estimation of P k is bounded by E “›› p P p TIPUP q k ´ P k ›› S ‰ À R ˚p q k “ p λ ˚ k q ´ σT ´ { a d k ´ } Θ ˚ k, } { ` σ a d ´ k ¯ (30)with d ´ k “ ś j ‰ k d j , and the aim of iTIPUP is to achieve the ideal rate R ˚p ideal q k “ p λ ˚ k q ´ σT ´ { a d k ´ } Θ ˚ k, } { ` σ ? r ´ k ¯ (31)through dimension reduction, where r ´ k “ ś j ‰ k r j . The following theorem, which allows the ranks r k to grow to infinity as well as d k when T Ñ 8 , provides sufficient conditions to guarantee thisideal convergence rate for iTIPUP.
Theorem 2.
Suppose Assumption 1 holds. Let P k , Θ ˚ k, and λ ˚ k be as in (5) , (16) and (19) respectively. Let h ă T { , and R ˚p q “ max ď k ď K R ˚p q k ; R ˚p ideal q “ max ď k ď K R ˚p ideal q k . with R ˚p q k in (30) and R ˚p ideal q k in (31) . Suppose that for certain constants C and C p iter q , C R ˚p q ď min ď k ď K λ ˚ k {} Θ ˚ k, } S , (32) C p iter q R ˚p ideal q ` C p iter q max ď k ď K b d ˚´ k { d k R ˚p ideal q k ď , (33) where d ˚´ k “ ř j ‰ k d j r j , Then, there exist a numerical constant C p TIPUP q and a constant C p iter q ,K depending on K only such that when C ě p ´ ρ q ´ C p TIPUP q and C p iter q {p ` { C p iter q q ě C p iter q ,K { ρ , ă ρ ă , ›› p P p m q k ´ P k ›› S ď C p TIPUP q ´ p ´ ρ m qp ´ ρ q ´ R ˚p ideal q ` p ρ m { q R ˚p q ¯ (34) simultaneously for all ď k ď K and m ě in an event with probability at least ´ ř Kk “ e ´ d k ,where p P p m q k “ p U p m q k p U p m qJ k with the m -step estimator p U p m q k in iTIPUP algorithm, i.e. In particular,after at most J “ t log p max k d ´ k { min k r ´ k q{ log p { ρ q u iterations, E „ max ď k ď K ›› p P p J q k ´ P k ›› S ď C p TIPUP q ´ ρ R ˚p ideal q ` K ÿ k “ e ´ d k . (35)16y (16), (19) and the Cauchy-Schwarz inequality, p ´ h { T q λ ˚ k ď } Θ ˚ , } S , so that (32) guaran-tees a sufficiently small R ˚p q , which implies a sufficiently small error in the initialization of iTIPUPby (30). Condition (33) again has two terms respectively reflecting the ideal rate after dimensionreduction by the true U ´ k “ d j ‰ k U j in the estimation of U k and the extra cost of estimating U ´ k .The upper bound (34) for the error of the m -step estimator is also comprised of two terms, thebound for the final estimator and the bound for the contracted error of the initial estimator. Remark 8.
Again, (32) and (33) do not demand a specific relationship between T and the dimen-sions, as λ ˚ k are allowed to depend on the dimensions and ranks, as discussed in Remark 7.We present more explicit error bound in Corollary 3 under the additional Assumption 2(ii),which assumes there is no severe signal cancellation in iTIPUP. Corollary 3.
Suppose Assumptions 1 and 2(ii) hold. Let λ “ ś Kk “ } A k } S and d max “ max t d , ..., d K u .Let h ď T { and σ fixed. Then, there exist constants C ,K and C ,K depending on K only suchthat when λ ě C ,K σ ˜ d max T ` c dT ¸ , (36) the 1-step iTIPUP estimator satisfies E } p P p q k ´ P k } S ď C ,K ˆ σ ? d k λ ? T ` σ ? d k λ ? T ˙ ` K ÿ k “ e ´ d k . (37)The signal to noise ratio assumption λ { σ ě C ,K p a d max { T ` p d { T q { q is required here toguarantee the performance of iTIPUP algorithm. Again, under the conditions, iTIPUP requiresone iteration to achieve the much sharper ideal convergence rate, comparing to that of the non-iterative TIPUP algorithm.When the ranks r k ( k “ , ..., K ) also diverge and there is no severe signal cancellation iniTIPUP, we have the following convergence rate for iTIPUP under Assumption 3(ii). Corollary 4.
Suppose Assumptions 1 and 3(ii) hold. Let h ď T { and d ˚´ k “ ř j ‰ k d j r j . Supposethat for a sufficiently large C not depending on t σ, d k , r k , k ď K u , T ě max k ˜ d k r { ` δ d ` δ ´ δ r δ ` k ` r ` δ d ` δ ´ δ r δ k ` d ˚´ k r { ` δ d ` δ ´ δ r δ k ` d ˚´ k r ` δ d ´ δ r δ k ¸ . (38) Then, after at most J “ O p log d q iterations, the iTIPUP estimator satisfies } p P p J q k ´ P k } S “ O P ˜ d { k r { ` δ T { d { ` δ { ´ δ r δ k ` d { k r { ` δ T { d ´ δ r δ k ¸ . (39) Moreover, (39) holds after at most J “ O p log r q iterations, if any one of the following three con-ditions holds in addition to condition (38) , (i) d k ( k “ , ..., K ) are of the same order, (ii) λ k ( k “ , ..., K ) are of the same order, (iii) p λ ˚ k q ´ ? d k ( k “ , ..., K ) are of the same order. r k is much slower than d k and the factors are strong with δ “ δ “
0, the required samplesize reduces to T ě C max k ´ r δ ` ´ k r { d ´ ´ k ` r δ ´ k d ˚´ k r { d ´ ¯ , where r ´ k “ r { r k and d ´ k “ d { d k .By comparing with Corollary 2, the sample complexity for iTIPUP is smaller, if δ is a smallconstant. As expected, the convergence rate is slower in the presence of weak factors. When thefactors are strong ( δ “ δ “
0) and all r k and δ are fixed finite constants, the rate becomes O P p T ´ { d ´ { ´ k q , the same as that for iTOPUP. As discussed in Section 2.3, we can mixed TOPUP and TIPUP operations for the initiation anditerative operations in Algorithm 1. Based on Theorems 1 and 2, for the mixed TIPUP-iTOPUPprocedure, (24) (with R p q replaced by R ˚p q ) and (25) still hold if condition (22) is replaced by C R ˚p q ď , (40)where R ˚p q “ max ď k ď K R ˚p q k for R ˚p q k in (30). On the other hand, for TIPUP-iTOPUP, con-clusions (34) (with R ˚p q replaced by R p q ) and (35) still hold when condition (32) is replacedby C R p q ď min ď k ď K λ ˚ k {} Θ ˚ , } S , (41)where R p q “ max ď k ď K R p q k with R p q k in (20). Comparison between the non-iterative procedures and iterative procedures
Theorems 1 and 2 show that the convergence rates of the non-iterative estimators TOPUP andTIPUP can be improved by the iterative procedure. Particularly, when the dimensions r k for thefactor process are fixed and the respective signal strength conditions are fulfilled, the proposediTOPUP and iTIPUP just need one-iteration to achieve the much sharper ideal rate R p ideal q in (21)and R ˚p ideal q (31), comparing to the rate (20) of TOPUP and (30) of TIPUP derived in Chen et al.(2019b), respectively. The improvement is achieved, as motivated shown in Section 1, throughreplacing the much larger d ´ k by r ´ k , via orthogonal projection. When the factors are strongwith δ “ δ “ O P p T ´ { q convergence rate forestimating the loading space. In comparison, the convergence rate O P p T ´ { d ´ { ´ k q of both iterativeestimators, iTOPUP and iTIPUP (when there is no severe signal cancellation, with bounded δ ),is much sharper. Intuitively, when the signal is strong, the orthogonal projection operation helpsto consolidate signals while potentially averaging out the noises, when the projection reduces themode- k unfolded matrix from a d k ˆ d ´ k matrix for the tensor X t into a d k ˆ r ´ k matrix for theprojected tensor Z t as the motivation we provided in Section 1. As a result, the increase in thedimension d ´ k improves the iterative estimator.When r k are allowed to diverge, the iTOTUP and iTIPUP algorithms converge after at most O p log p d qq iterations to achieve the ideal rate according to Theorems 1 and 2. The number ofiterations needed can be as few as O p log p r qq when the condition is right. The power iterations are18ecessary in order to refine the reliable initial estimates to achieve sharper guaranteed convergencerate. Moreover, stronger signal λ will result in faster exponential contraction with smaller ρ in (24)and (34), in other words, converging more quickly to the final estimators. Comparison between iTIPUP and iTOPUP:
The inner product operation in (11) for TIPUP-related procedures enjoys significant amount of noise cancellation comparing to the outer productoperation in (7) for TOPUP-related procedures. Compared with iTOPUP, the benefit of noisecancellation of the iTIPUP procedure is still visible through the reduction of r ´ k in (21) to ? r ´ k in(31) in the ideal rates. However, this benefit is much less pronounced compared with the reductionof d ´ k in (20) for TOPUP to a d ´ k in (30) for TIPUP in the non-iterative rates. Meanwhile, thepotential damage of signal cancellation in the TIPUP related schemes persists as λ ˚ k and λ k areunchanged between the initial and ideal rates. When r ´ k are allowed to diverge to infinity, and δ ą
1, then r δ ´ ´ k Ñ 8 . In this case, the iTOPUP has a faster rate than the iTIPUP in view ofCorollary 2 and 4. Otherwise, if r δ ´ ´ k Ñ
0, the rate of iTIPUP is faster.If all the assumptions in Corollary 1 and 3 are satisfied, iTOPUP and iTIPUP will have exactlythe same convergence rate. There are two considerations. First is the quality of the initial estimator.In theory, as long as the initial estimates are reasonably good, both iTOPUP and iTIPUP achieveconsistent estimation. In practice, one would prefer to use TIPUP for initialization when there is nosevere signal cancellation. The second is the signal strength. In general iTIPUP has much weakersignal strength λ ˚ k than iTOPUP λ k , depending on the severity of signal cancellation, though it alsorequires smaller signal to noise ratio. Under extremely serious level of signal cancellation ( δ “ 8 ),the convergence rate of iTIPUP will be much slower.In practice, when the rank r k is small, iTOPUP procedures are recommended, as it has thesame convergence rate as the iTIPUP, but safeguards the signal cancellation cases. If there is aconcern of possible signal cancellation, initiation should also use TOPUP, though TIPUP typicallyprovides more accurate initial estimators when there is no signal cancellation. Comparison of signal to noise ratio requirement with that of HOOI:
The signal tonoise ratio condition mainly depends on the existence of an initial estimator with sufficiently smallestimation error. In the typical factor model setting, we have condition number of A J k A k boundedand ranks r k fixed. By comparing (26) in Corollary 1 with (36) in Corollary 3, iTIPUP is able tohandle tensor factor model estimation under much weaker signal to noise ratio. In the traditionaltensor power iteration method (e.g., Zhang and Xia (2018)), the observations are assumed tobe composed by deterministic signal and i.i.d. noise. Then, averaging the observations beforeorthogonal projection, HOOI requires signal to noise ratio λ { σ ě C d { { T . However, in our tensorfactor models, signal part is random and serial correlated. The time mode should be viewed as anadditional mode of the signal part. Working with the average of cross-products of the observationsleads to T ´ { in the rate of iTOPUP and iTIPUP, instead of T ´ in HOOI. Then, this phenomenonresults in λ { σ ě C ,K d { { T { in (36). Thus, if there is no severe signal cancellation, iTIPUP andTIPIP-iTOPUP have comparable signal to noise ratio condition with HOOI. In this section, we compare the empirical performance of different procedures of estimating theloading matrices of a tensor factor model, under various simulation setups. Specifically, we considerthe following procedures: the non-iterative and iterative methods, and the intermediate output from19he iterative procedures when the number of iteration is 1 after initialization. If TIPUP is usedas UINIT and UITER, the one step procedure will be denoted as 1TIPUP. Simlar for 1UP and1TOPUP. We consider the following combinations of UINIT and UITER. • UP based: (i) UP, (ii) 1UP and (iii) iUP • TIPUP based: (iv) TIPUP, (v) 1TIPUP and (vi) iTIPUP • TOPUP based: (vii) TOPUP, (viii) 1TOPUP and (ix) iTOPUP • mixed iterative: (x) TIPUP-1TOPUP, and (xi) TIPUP-iTOPUP • mixed iterative: (xii) TOPUP-1TIPUP, and (xiii) TOPUP-iTIPUPOur empirical results show that (xii) and (xiii) are always inferior than (v) and (vi) respectively.Hence results used (xii) and (xiii) are not shown here.We demonstrate the performance of all procedures under the setting of a matrix factor model, X t “ A F t A J ` E t “ λU F t U J ` E t . (42)Here, E t is white noise with no autocorrelation, E t K E t ` h , h ą
0, and generated according to E t “ Ψ { Z t Ψ { , where Ψ , Ψ are the column and row covariance matrices with the diagonalelements being 1 and all off diagonal elements being 0 .
2. All of the elements in the d ˆ d matrix Z t are i.i.d N p , q . The elements of the loading matrices U (of size d ˆ r ) and U (of size d ˆ r )are first generated from i.i.d. N(0,1),and then orthonormalized through QR decomposition. We setdifferent λ values for different signal-to-noise ratio. All of the entries f ijt in the factor matrix F t follow independent univariate AR(1) model f ijt “ φ ij f ij p t ´ q ` (cid:15) ijt with standard N p , q innovation.We fix dimensions d “ d “
16 here and consider different choices of sample size T . We use theestimation error for A as criterion: } ˆ P ´ P } S .We considered three experimental configurations:I. We set r “ r “ f t follows AR(1) with AR coefficient φ “ . T and signal strength λ ;II. We fix the signal strength and consider the rank setup r “ r “
2, in which case F t is a 1 ˆ f it follow AR(1) f it “ φ i f i p t ´ q ` (cid:15) it independently with two AR coefficients φ “ . φ “ . φ is negative instead, φ “ ´ .
8. We repeatall the experiments 100 times.Settings I and II satisfy Assumption 2 since the rank r and r are fixed and the factor processis stationary. The λ in (42) is λ “ ś k “ } A k } S in Corollaries 1 and 3. The signal to noise ratio is λ { σ “ λ . Setting III satisfy Assumption 2(i). When h “ h “ h “ r “ T “
256 and λ “ T “
256 and λ “ T “
256 case. The top twopanels show that when signal is weak and the sample size is small, the initial estimates may bepoor, and the iterative methods may need certain accuracy in the initial estimates to producefurther improvement. This reemphasizes the condition on the initial estimate in the theorems. Thebottom two panels show that when signal is stronger, the relatively more accurate initial estimatesenable the iterative methods to improve the estimates. Again, TOPUP initial estimates are not asaccurate as the TIPUP estimates.Under Setting II, Figure 3 shows the boxplot of the logarithm of the estimation errors of8 methods including (iv)-(ix) and mixed (x)-(xi) with TIPUP initiation and TOPUP iteration.Again, the performance of the mixed (xii)-(xiii) procedures with iTIPUP iteration is not as good asthat of iTIPUP hence not shown. Here we use different sample sizes, with the signal strength fixedat λ “ h values: h “ h “
2. The theoretical λ defined in (18) and λ ˚ in (19)under the stationary auto-cross-moments of the factor process are given in the figure. Note thatthey are different for different h . It shows that the mixed TIPUP-1TOPUP method can slightlyimprove 1TOPUP because of the better initialization. With larger sample size T “ h “ h “
1, as the lag-2 autocorrelation is significantly smaller than that oflag 1 for the underlying AR(1) process with φ “ .
6. The extra term adds limited signal, shownby the small differences in λ and λ ˚ , but incorporates extra noise terms in the estimators. To seemore clearly the impact of h , we show the boxplots of the estimated λ ˚ and λ using iTIPUP andiTOPUP, respectively, for h “ , h increases in this no-signal cancellationcase. 21 ll ll llllll llllll l ll ll TIPUP 1TIPUP iTIPUP TOPUP 1TOPUP iTOPUP UP 1UP iUP − . − . − . − . . T=256, lambda=1 l llll l llll lll
TIPUP 1TIPUP iTIPUP TOPUP 1TOPUP iTOPUP UP 1UP iUP − . − . − . − . − . . T=1024, lambda=1 l lllllllll lllll l
TIPUP 1TIPUP iTIPUP TOPUP 1TOPUP iTOPUP UP 1UP iUP − . − . − . − . − . − . . T=256, lambda=2 l l lll l
TIPUP 1TIPUP iTIPUP TOPUP 1TOPUP iTOPUP UP 1UP iUP − − − T=1024, lambda=2 l lll lll lllllllllllllllllll
TIPUP 1TIPUP iTIPUP TOPUP 1TOPUP iTOPUP UP 1UP iUP − − − − T=256, lambda=4 llll ll ll ll ll ll lll
TIPUP 1TIPUP iTIPUP TOPUP 1TOPUP iTOPUP UP 1UP iUP − − − − − T=1024, lambda=4
Figure 1: Boxplot of the logarithm of the estimation error of A under Setting I. 9 methods areconsidered in total. Three rows correspond to three signal-to-noise strengths λ “ , ,
4. Twocolumns correspond to two sample sizes T “ , . − . − . − . . TIPUP−based methods: lambda=1
IndexTIPUP 1TIPUP iTIPUP − . − . − . − . . TOPUP−based methods: lambda=1
Index : TOPUP 1TOPUP iTOPUP − . − . − . . TIPUP−based methods: lambda=2
TIPUP 1TIPUP iTIPUP − . − . − . . TOPUP−based methods: lambda=2 : TOPUP 1TOPUP iTOPUP
Figure 2: Trajectory of the logarithm of the estimation error of A under Setting I with fixedsample size T “ λ “ ,
2. Two columnscorrespond to TIPUP-based and TOPUP-based methods respectively.23 ll llllll llll l llllllllllllllll llllllllllllllll l ll l lll ll ll ll lll lllllllllll llllllll lllllllllllll llllllllllllllll l lllll ll ll l lll ll h =1, l =2.41, ( l ) =2.23 h =2, l =2.16, ( l ) =1.97 T = T = T I P U P T I P U P i T I P U P T O P U P T O P U P i T O P U P T I P U P − T O P U P T I P U P − i T O P U P T I P U P T I P U P i T I P U P T O P U P T O P U P i T O P U P T I P U P − T O P U P T I P U P − i T O P U P −3−2−10−3−2−10 l og r i sk f o r e s t i m a t i ng c o l u m n s pa c e o f A f =0.6 Figure 3: Boxplot of the logarithm of the estimation error of A under Setting II. 8 methodsare considered in total. Two rows correspond to two sample sizes T “ , h . The population signal strengths λ (18) and λ ˚ (19) for different h are provided on the top. 24 l llll ll ll ll ll l ll T=512 T=1024iTIPUP iTOPUP iTIPUP iTOPUP123 ( l * ) and l h0 h0=1h0=2h0=3 f =0.6 Figure 4: Boxplot of the sample estimates of the signal strengths λ (18) and λ ˚ (19) over 100replications for iTIPUP and iTOPUP with three choices of h under Setting II. Two panels cor-respond to two sample sizes T “ , φ “ . φ “ ´ .
8, we can readily check that E p F t F J t ´ q “p φ ` φ q σ “
0. Therefore, in the TIPUP-related procedure for estimating A with h “
1, thesignal completely cancels out. Since the ranks r and r are fixed, we have δ “ 8 for h “
1, andthe corresponding λ ˚ “
0. Figure 5 shows the boxplot of the logarithm of the estimation error of A for 8 methods including (iv)-(ix) and mixed (x)-(xi) with two choices of h “ h “ λ “ h . When h “
1, both initializationTIPUP and TOPUP do not perform well. But 1TOPUP and iTOPUP improve the performance ofTOPUP significantly with TOPUP iteration while 1TIPUP and iTIPUP cannot improve TIPUP.This is because signal cancellation has significant impact on TIPUP based procedures while havingno impact on TOPUP based procedures. To our pleasant surprise, when h “
1, the mixed TIPUP-1TOPUP is better than both 1TIPUP and 1TOPUP, and the mixed TIPUP-iTOPUP is similarto iTOPUP and much better than iTIPUP. When using h “
2, the noise cancellation is mild and p λ ˚ q “ .
78. Since r k are fixed, we have δ ă 8 . Note that in this case the signal using TIPUPonly comes from lag-2 cross product and is weaker than that using TOPUP related procedures. Thedifference does not have impact on the convergence rate, but on the signal to noise ratio. Comparingthe left two subfigures with the right ones of Figure 5, it is seen that using h “ h . When h “
2, the non-iterative TIPUP performs better thanTOPUP, 1TIPUP performs better than 1TOPUP, but after convergence, iTOPUP performs betterthan iTIPUP. Because the initialization TIPUP is better than TOPUP for h “
2, it is of no surpriseto see that TIPUP-1TOPUP behaves better than 1TIPUP and 1TOPUP, and TIPUP-iTOPUP issimilar as iTOPUP and slightly better than iTIPUP.25 lllllllll l lll lllllllllllll lllllllll llllllllllll lllll ll l l lll llll lllllll lll lllllllll llllll lllllll l l l l l h =1, l =3.14, ( l ) =0 h =2, l =2.85, ( l ) =1.78 T = T = T I P U P T I P U P i T I P U P T O P U P T O P U P i T O P U P T I P U P − T O P U P T I P U P − i T O P U P T I P U P T I P U P i T I P U P T O P U P T O P U P i T O P U P T I P U P − T O P U P T I P U P − i T O P U P −3−2−10−3−2−10 l og r i sk f o r e s t i m a t i ng c o l u m n s pa c e o f A f =−0.8 Figure 5: Boxplot of the logarithm of the estimation error of A under Setting III. 8 methodsare considered in total. Two rows correspond to two sample sizes T “ , h . The population signal strengths λ (18) and λ ˚ (19) for different h are provided on the top. 26 l ll l llllll l ll llll l l l T=512 T=1024iTIPUP iTOPUP iTIPUP iTOPUP01234 ( l * ) and l h0 h0=1h0=2h0=3 f =−0.8 Figure 6: Boxplot of the sample estimates of the signal strengths λ (18) and λ ˚ (19) over 100replications for iTIPUP and iTOPUP with three choices of h under Setting III. Two panels cor-respond to two sample sizes T “ , h in this case with noise cancellation, we show theboxplots of the estimated λ ˚ and λ using iTIPUP and iTOPUP, respectively, for h “ , λ under the noise-cancellation case. And λ decreases as h increases. However, iTIPUP is very different. Althoughwhen using h “ λ ˚ significantly overestimates the theoretical value λ ˚ “
0, theyare still much less than those from using h “ λ ˚ as h increases can be potentially used to detect signal cancellation in practice, though thetheoretical property of the estimators of λ ˚ (e.g. standard deviation) is technically challenging toobtain. In practice, when one observes such a reversed order, it is recommended to use iTOPUP asa conservative estimator. Of course, the behaviors of λ k and λ ˚ k depend on the auto-cross-momentstructure of the underlying factor process. For example, if the factor process follows a MA(2) modelwith zero lag-1 autocorrelation ( f t “ e t ` θ e t ´ ), then λ and λ ˚ under h “ h “ h “
3. But we expect that the pattern of λ under different h would be similar to that of λ ˚ under different h , if there is no severe signal cancellation. Severesignal cancellation would make the patterns different. In this paper we propose new estimation procedures for tensor factor model via iterative projection,and focus on two procedures: iTOPUP and iTIPUP. Theoretical analysis shows the asymptoticproperties of the estimators. Simulation study is used to illustrate the finite sample properties of27he estimators. While theoretical results are obtained under very general conditions, specific casesare considered. Specifically, under the typical factor model setting where the condition numbers of A J k A k are bounded and the ranks r k are fixed, the proposed iterative procedures, iTOPUP methodand iTIPUP method (with no severe signal cancellation) lead to a convergence rate O P pp T d ´ k q ´ { q under strong factors settings due to information pooling of the orthogonal projection of the other d ´ k dimensions. This rate is much sharper than the classical rate O P p T ´ { q in non-iterativeestimators for vector, matrix and tensor factor models. It implies that the accuracy can be improvedby increasing the dimensions, and consistent estimation of the loading spaces can be achieved evenwith a fixed finite sample size T , different from the requirement of traditional factor model analysis.The proposed iterative estimation methods not only preserve the tensor structure, but also resultin sharper convergence rate in the estimation of factor loading space.The iterative procedure requires two operators, one for initialization and one for iteration. Undercertain conditions of the signal to noise ratio, we only need the initial estimator to have sufficientlysmall estimation errors but not the consistency of the initial estimator. Often, one iteration issufficient. Under more complicated cases, at most O p log p d qq iterations are needed to achieve idealrate of convergence. Based on the theoretical results and empirical evidence, we suggest to useiTOPUP for iteration when the ranks r k are small. In terms of initiation, we suggest to use TIPUPif no danger of signal cancellation and TOPUP otherwise. Using a slightly large h often solvethe noise cancellation problem, as the empirical results show. By examination of the patterns ofestimated singular values under different lag values h , using iTOPUP and iTIPUP, it is possibleto detect noise cancellation, which has significant impact on iTIPUP estimators.The proposed iterative procedure is similar to HOOI algorithms in spirit, but the detailedoperations and the theoretical challenges are significantly different. References
Orly Alter and Gene H Golub. Reconstructing the pathways of a cellular system from genome-scale signals by using matrix and tensor computations.
Proceedings of the National Academy ofSciences , 102(49):17559–17564, 2005. 1Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor approach tolearning mixed membership community models.
The Journal of Machine Learning Research , 15(1):2239–2312, 2014. 1Jushan Bai. Inferential theory for factor models of large dimensions.
Econometrica , 71(1):135–171,2003. 1, 11, 12Jushan Bai and Serena Ng. Determining the number of factors in approximate factor models.
Econometrica , 70(1):191–221, 2002. 1Xuan Bi, Annie Qu, and Xiaotong Shen. Multilayer tensor factorization with applications torecommender systems.
The Annals of Statistics , 46(6B):3308–3333, 2018. 1Gary Chamberlain and Michael Rothschild. Arbitrage, factor structure, and mean-variance analysison large asset markets.
Econometrica , 51(5):1281–1304, 1983. 1Elynn Y Chen and Rong Chen. Modeling dynamic transport network with matrix factor models:with an application to international trade flow. arXiv preprint arXiv:1901.00769 , 2019. 228lynn Y Chen, Ruey S Tsay, and Rong Chen. Constrained factor models for high-dimensionalmatrix-variate time series.
Journal of the American Statistical Association , pages 1–37, 2019a. 2Elynn Y Chen, Jianqing Fan, and Ellen Li. Statistical inference for high-dimensional matrix-variatefactor model. arXiv preprint arXiv:2001.01890 , 2020. 2Rong Chen, Dan Yang, and Cun-hui Zhang. Factor models for high-dimensional tensor time series. arXiv preprint arXiv:1905.07530 , 2019b. 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 16, 18, 31, 34, 36, 39,43Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank-1 and rank-(r 1, r 2,...,rn) approximation of higher-order tensors.
SIAM journal on Matrix Analysis and Applications ,21(4):1324–1342, 2000. 3Jianqing Fan, Yuan Liao, and Martina Mincheva. High dimensional covariance matrix estimationin approximate factor models.
Annals of statistics , 39(6):3320, 2011. 2Jianqing Fan, Yuan Liao, and Martina Mincheva. Large covariance estimation by thresholdingprincipal orthogonal complements.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 75(4):603–680, 2013. 2Jianqing Fan, Yuan Liao, and Weichen Wang. Projected principal component analysis in factormodels.
Ann. Statist. , 44(1):219–254, 2016. 2Jianqing Fan, Han Liu, and Weichen Wang. Large covariance estimation through elliptical factormodels.
Annals of statistics , 46(4):1383, 2018. 2Mario Forni, Marc Hallin, Marco Lippi, and Lucrezia Reichlin. The generalized dynamic-factormodel: Identification and estimation.
The Review of Economics and Statistics , 82(4):540–554,2000. 2Clifford Lam and Qiwei Yao. Factor modeling for high-dimensional time series: inference for thenumber of factors.
The Annals of Statistics , 40(2):694–726, 2012. 2, 12Clifford Lam, Qiwei Yao, and Neil Bathia. Estimation of latent factors for high-dimensional timeseries.
Biometrika , 98(4):901–918, 2011. 2, 4, 18Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion for estimatingmissing values in visual data.
IEEE transactions on pattern analysis and machine intelligence ,35(1):208–220, 2012. 1Yuanyuan Liu, Fanhua Shang, Wei Fan, James Cheng, and Hong Cheng. Generalized higher-orderorthogonal iteration for tensor decomposition and completion. In
Advances in Neural InformationProcessing Systems , pages 1763–1771, 2014. 3Larsson Omberg, Gene H Golub, and Orly Alter. A tensor higher-order singular value decompositionfor integrative analysis of dna microarray data from different studies.
Proceedings of the NationalAcademy of Sciences , 104(47):18371–18376, 2007. 1Jiazhu Pan and Qiwei Yao. Modelling multiple time series via common factors.
Biometrika , 95(2):365–379, 2008. 2 29ernard N Sheehan and Yousef Saad. Higher order orthogonal iteration of tensors (hooi) and itsrelation to pca and glram. In
Proceedings of the 2007 SIAM International Conference on DataMining , pages 355–365. SIAM, 2007. 3James H. Stock and Mark W. Watson. Forecasting using principal components from a large numberof predictors.
J. Amer. Statist. Assoc. , 97(460):1167–1179, 2002. 1, 11, 12Will Wei Sun and Lexin Li. Store: sparse tensor response regression and neuroimaging analysis.
The Journal of Machine Learning Research , 18(1):4908–4944, 2017. 1George C Tiao and Ruey S Tsay. Model specification in multivariate time series.
Journal of theRoyal Statistical Society: Series B (Methodological) , 51(2):157–195, 1989. 6Ledyard R Tucker. Some mathematical notes on three-mode factor analysis.
Psychometrika , 31(3):279–311, 1966. 5Dong Wang, Xialu Liu, and Rong Chen. Factor models for matrix-valued high-dimensional timeseries.
Journal of econometrics , 208(1):231–248, 2019. 2, 4, 7, 18Per-˚Ake Wedin. Perturbation bounds in connection with singular value decomposition.
BIT Nu-merical Mathematics , 12(1):99–111, 1972. 31, 37Anru Zhang and Dong Xia. Tensor svd: Statistical and computational limits.
IEEE Transactionson Information Theory , 64(11):7311–7338, 2018. 3, 19, 42Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging dataanalysis.
Journal of the American Statistical Association , 108(502):540–552, 2013. 130
Appendix
We focus on the case of K “ k matrix unfolding. In particular,we sometimes give explicit expressions only in the case of k “ K “
2. For K “
2, weobserve a matrix time series with X t “ A F t A J ` E t P R d ˆ d . Under the conditional expectation E , F , ...F T are fixed. Let U , U be the left singular matrices of A and A respectively with r k “ rank p U k q “ rank p A k q .We outline the proof as follows. Let L p m q k be the loss (4) for p U p m q k or equivalently the spectralnorm error for p P p m q k “ p U p m q k p U p m qJ k , k “ , ..., K , and L p m q their maximum, L p m q k “ } p P p m q k ´ P k } S , L p m q “ max k “ , ,...,K L p m q k . (43)From Chen et al. (2019b), E “ L p q k ‰ À R ˚p q k as we mentioned in (30). By applying the Gaussianconcentration inequality for Lipschitz functions and Lemme 2 in their analysis, we have L p q ď C p TIPUP q R ˚p q with R ˚p q “ max ď k ď K R ˚p q k (44)in an event Ω with P p Ω q ě ´ ´ ř Kk “ e ´ d k . This is similar to (51) below.After the initialization with p U p q k , the algorithm iteratively produces estimates p U p m q k from m “ m “ J . For any r U P R d ˆ r , let V ˚ ,h p r U q “ T ÿ t “ h ` X t ´ h r U r U J X J t T ´ h P R d ˆ d , TIPUP p r U q “ mat ` V ˚ , h p r U q ˘ P R d ˆp d h q , as matrix-valued functions of matrix-valued variable r U , where V ˚ , h p r U q “ ` V ˚ ,h p r U q , h “ , ..., h ˘ P R d ˆ d ˆ h . Given p U p m q , the p m ` q -th iteration produces estimates p U p m ` q “ LSVD r ` TIPUP p p U p m q q ˘ , p P p m ` q “ p U p m ` q p U p m ` qJ . The “noiseless” version of this update is given byΘ ˚ ,h p r U q “ T ÿ t “ h ` A F t ´ h A J r U r U J A F J t A J T ´ h , E r TIPUP sp r U q “ mat ` Θ ˚ , h p r U q ˘ (45)with Θ ˚ , h p r U q “ ` Θ ˚ ,h p r U q , h “ , ..., h ˘ as in (17), giving error free “estimates”, U “ LSVD r ` E r TIPUP sp p U p m q q ˘ , P “ U U J , when E r TIPUP sp p U p m q q is of rank r . Thus, by Wedin’s theorem (Wedin (1972)), L p m ` q “ ›› p P p m ` q ´ P ›› S ď } TIPUP p p U p m q q ´ E r TIPUP sp p U p m q q} S σ r p E r TIPUP sp p U p m q qq . (46)31e note that E r TIPUP sp r U q is treated as a matrix-valued function of matrix-valued variable r U and p U p m q is plugged-in after the conditional expectation in E r TIPUP sp p U p m q q . For general 1 ď k ď K , we define TIPUP k p r U ´ k q and E r TIPUP k sp r U ´ k q as matrix-valued functions of r U ´ k “ d j ‰ k r U j .To bound the numerator on the right-hand side of (46), we writeTIPUP p r U q ´ E r TIPUP sp r U q “ ÿ j “ ´ ∆ ˚ j, ,h ` r U r U J ˘ , h “ , . . . , h ¯ P R d ˆp d h q (47)as both TIPUP p r U q and E r TIPUP sp r U q are linear in r U r U J , where for any d ˆ d matrix Ă M ∆ ˚ p Ă M q “ ∆ ˚ , ,h p Ă M q “ ř Tt “ h ` A F t ´ h A J Ă M E J t {p T ´ h q , ∆ ˚ p Ă M q “ ∆ ˚ , ,h p Ă M q “ ř Tt “ h ` E t ´ h Ă M A F J t A J {p T ´ h q , ∆ ˚ p Ă M q “ ∆ ˚ , ,h p Ă M q “ ř Tt “ h ` E t ´ h Ă M E J t {p T ´ h q . As ∆ ˚ j, ,h p Ă M q is linear in Ă M , the numerator on the right-hand of (46) can be bounded by } TIPUP p p U p m q q ´ E r TIPUP sp p U p m q q} S (48) ď } TIPUP p U q ´ E r TIPUP sp U q} S ` L p m q p K ´ q ÿ j “ h { max h ď h } ∆ ˚ j, ,h } ,S,S with an application of Cauchy-Schwarz inequality for the sum over h “ , . . . , h , where } ∆ ˚ j, ,h } ,S,S are norms of the R d ˆ d Ñ R d ˆ d linear mappings ∆ ˚ j, ,h defined as } ∆ ˚ j } ,S,S “ } ∆ ˚ j, ,h } ,S,S “ max } Ă M } S ď , rank p Ă M qď r ›› ∆ ˚ j, ,h p Ă M q ›› S . For general 1 ď k ď K , ∆ ˚ j,k,h is an R d ´ k ˆ d ´ k to R d k ˆ d k mapping. Because it applies to Ă M ´ k “d (cid:96) ‰ k ` p U p m q (cid:96) p U p m qJ (cid:96) ˘ ´ d (cid:96) ‰ k ` U U J ˘ , (70) of Lemma 1 (iii) gives the general version of (48) with ›› ∆ ˚ j,k,h ›› k,S,S “ max } Ă M j } S ď , rank p Ă M j qď r j , @ j ‰ k ›› ∆ ˚ j,k,h ` d j ‰ k Ă M j ˘›› S . We claim that in certain events Ω j , j “ , ,
3, with P p Ω j q ě ´ ´ ř Kk “ e ´ d k , ›› ∆ ˚ j,k,h ›› k,S,S ď ρλ ˚ k L` p K ´ q ˘ , @ ď k ď K. (49)For simplicity, we will only prove this inequality for k “ K “ P } ∆ ˚ j } ,S,S “ max } Ă M } S ď , rank p Ă M qď r ›› ∆ ˚ j, ,h p Ă M q ›› S ě ρλ ˚ L + ď ´ e ´ d (50)as the proof of its counter part for general 1 ď k ď K is similar.Define the ideal version of the ratio in (46) for general 1 ď k ď K as L p ideal q k “ } TIPUP k p U ´ k q ´ E r TIPUP k sp U ´ k q} S σ r k p E r TIPUP k sp U ´ k qq , L p ideal q “ max ď k ď K L p ideal q k . U ´ k “ d j ‰ k U j is true and deterministic, the proof of (44) also implies L p ideal q ď C p TIPUP q R ˚p ideal q with R ˚p ideal q “ max ď k ď K R ˚p ideal q k (51)in an event Ω with P p Ω q ď ´ ř Kk “ e ´ d k , where R ˚p ideal q k is as in (31).We aim to prove that in the event X j “ Ω j L p m ` q k ď L p ideal q k ` p K ´ q L p m q max j,k,h ›› ∆ ˚ j,k,h ›› k,S,S L λ ˚ k ď L p ideal q k ` ρL p m q , (52)which would imply by induction L p m ` q ď p ` . . . ` ρ m q L p ideal q k ` ρ m ` L p q , (53)and then the conclusions would follow from (44) and (51). To this end, we notice that by (43),(46), (48) and (49), (52) holds provided that σ r k p E r TIPUP k sp p U p m q´ k qq , the numerator in (46), is nosmaller than a half of its ideal version as in (19), e.g.2 σ r p E r TIPUP sp p U p m q qq ě σ r p E r TIPUP sp U qq “ h { λ ˚ , (54)in the case of k “ K “
2. We divide the rest of the proof into 4 steps to prove (50) for j “ , , Step 1.
We prove (50) for the ∆ ˚ p Ă M q in (47). By Lemma 1 (ii), there exist Ă M p (cid:96) q P R d ˆ d ofthe form W (cid:96) Q J (cid:96) with W (cid:96) P R d ˆ r , Q (cid:96) P R d ˆ r , 1 ď (cid:96), (cid:96) ď N d r , { “ d r , such that } Ă M p (cid:96) q } S ď
1, rank p Ă M p (cid:96) q q ď r and } ∆ ˚ } ,S,S “ } ∆ ˚ , ,h } ,S,S ď (cid:96),(cid:96) ď N d r , { ››››› T ÿ t “ h ` A F t ´ h A J Ă M p (cid:96) q E J t T ´ h ››››› S . To bound } ∆ ˚ ,k,h } ,S,S for general k ď K , we just need to replace Ă M p (cid:96) q by d j ‰ k Ă M p (cid:96) q j and N d r , { by N d ˚´ k , {p K ´ q with d ˚´ k “ ř j ‰ k d j r j as in Lemma 1 (iii). We apply the Gaussian concentrationineqaulity to the right-hand side above. Elementary calculation shows that ˇˇˇˇˇ››››› T ÿ t “ h ` A F t ´ h A J Ă M p (cid:96) q E J t ››››› S ´ ››››› T ÿ t “ h ` A F t ´ h A J Ă M p (cid:96) q E ˚J t ››››› S ˇˇˇˇˇ ď ››››››› p A F A J , ..., A F T ´ h A J q ¨˚˝ Ă M p (cid:96) q p E J h ` ´ E ˚J h ` q ... Ă M p (cid:96) q p E J T ´ E ˚J T q ˛‹‚››››››› S ď ›› p A F A J , ..., A F T ´ h A J q ›› { ››››››› diag ´ Ă M p (cid:96) q ¯ ¨˚˝ E J h ` ´ E ˚J h ` ... E J T ´ E ˚J T ˛‹‚››››››› S ď ? T } Θ ˚ , } { ›››››››¨˚˝ E J h ` ´ E ˚J h ` ... E J T ´ E ˚J T ˛‹‚››››››› F . ›››ř Tt “ h ` A F t ´ h A J Ă M p (cid:96) q E J t ››› S is a σ ? T } Θ ˚ , } { Lipschitz function in p E , . . . , E T q . Em-ploying similar arguments in the proof of Theorem 2 in Chen et al. (2019b), we have¯ E ››››› T ÿ t “ h ` A F t ´ h A J Ă M p (cid:96) q E J t T ´ h ››››› S ď σ p T d q { T ´ h } Θ ˚ , } { . Then, by Gaussian concentration inequalities for Lipschitz functions, P ˜››››› T ÿ t “ h ` A F t ´ h A J Ă M p (cid:96) q E J t T ´ h ››››› S ´ σ p T d q { T ´ h } Θ ˚ , } { ě σ ? TT ´ h } Θ ˚ , } { x ¸ ď e ´ x . Hence, P ˜ } ∆ ˚ } ,S,S { ě σ p T d q { T ´ h } Θ ˚ , } { ` σ ? TT ´ h } Θ ˚ , } { x ¸ ď N d r , { e ´ x . As T ě h and K “
2, this implies with x — ? d r that in an event with at least probability1 ´ e ´ d { } ∆ ˚ } ,S,S ď C p iter q ,K σT ´ { } Θ ˚ , } { p? d ` ? d r q p K ´ q ď C p iter q ,K p ` a d r { d q λ ˚ R ˚p ideal q , with the R ˚p ideal q k in (31) and a constant C p iter q ,K depending on K only. In this event, (33) gives24 p λ ˚ k q ´ } ∆ ˚ ,k,h } k,S,S ď C p iter q ,K { C p iter q ď ρ . Thus, (50) holds for ∆ ˚ p Ă M q . Step 2.
Inequality (50) for ∆ ˚ p Ă M q follow from the same argument as the above step. Step 3.
Here we prove (50) for the ∆ ˚ p Ă M q in (47). By Lemma 1 (ii), we can find U p (cid:96) q , U p (cid:96) q P R d ˆ r , 1 ď (cid:96), (cid:96) ď N d r , { such that } U p (cid:96) q } S ď } U p (cid:96) q } S ď } ∆ ˚ } ,S,S “ } ∆ ˚ , ,h } ,S,S ď ď (cid:96),(cid:96) ď N d r , { ››››› T ÿ t “ h ` E t ´ h U p (cid:96) q U p (cid:96) qJ E J t T ´ h ››››› S . (55)We split the sum into two terms over the index sets, S “ p h, h s Y p h, h s Y ¨ ¨ ¨ X p h, T s and itscomplement S in p h, T s , so that t E t ´ h , t P S a u is independent of t E t , t P S a u for each a “ ,
2. Let n a “ | S a | r . Define G a “ p E t ´ h U p (cid:96) q , t P S a q P R d ˆ n a and H a “ p E t U p (cid:96) q , t P S a q P R d ˆ n a . Then, G a , H a are two independent Gaussian matrices. Note that ››››› ÿ t P S a E t ´ h U p (cid:96) q U p (cid:96) qJ E J t T ´ h ››››› S “ ›››› G a H J a T ´ h ›››› S . (56)Moreover, by Assumption 1, Var p u J vec p G a qq ď σ and Var p u J vec p H a qq ď σ for all unit vectors u P R d n a , so that by Lemme 2 (i) P ! } G a H J a } S { σ ě d ` a d n a ` x p x ` ? n a ` a d q ) ď e ´ x { , x ą . ř a “ n a “ r p T ´ h q , it follows from (55), (56) and the above inequality that P } ∆ ˚ , ,h } ,S,S σ ě p? d ` x q T ´ h ` ? r p? d ` x q? T ´ h + ď N d r , { e ´ x { . Thus, with h ď T { x — ? d ` a d ˚´ and some constant C p iter q ,K depending on K only, } ∆ ˚ , ,h } ,S,S ď C p iter q ,K p ´ h { T q σ p K ´ q ˆ p? d ` a d ˚´ q? r ´ T { ` p? d ` a d ˚´ q T ´ h ˙ (57)with at least probability 1 ´ e ´ d {
5. As λ ˚ k ď } Θ ˚ , } { {p ´ h { T q { by (19) and (16), R ˚p ideal q ě p λ ˚ q ´ p ´ h { T q σ p T ´ h q ´ { a d ` p λ ˚ q ´ σT ´ { σ a d r ´ by (31). Thus, in the event (57) and for k “ K “ } ∆ ˚ , ,h } ,S,S ď C p iter q ,K ˆ´ ` b d ˚´ { d ¯ λ ˚ R ˚p ideal q ` ´ ` b d ˚´ { d ¯ λ ˚ ` R ˚p ideal q ˘ ˙ ď λ ˚ C p iter q ,K p ` { C p iter q q C p iter q , which is no greater than λ ˚ ρ {
24 by the condition on C p iter q . This yields (50) for ∆ ˚ p Ă M q . Step 4.
Next, we prove (54) in the event X j “ Ω j . Note that, } Θ ˚ ,h p p U p m q q ´ Θ ˚ ,h p U q} S “ ››››› T ÿ t “ h ` A F t ´ h A J p p U p m q p U p m qJ ´ U U J q A F J t A J T ´ h ››››› S “ T ´ h ›››››››› p A F A J , ..., A F T ´ h A J q ¨˚˚˝ p p U p m q p U p m qJ ´ U U J q A F J A J ... p p U p m q p U p m qJ ´ U U J q A F J T ´ h A J ˛‹‹‚›››››››› S ď T ´ h ›› p A F A J , ..., A F T ´ h A J q ›› S ››››››› diag ´ p U p m q p U p m qJ ´ U U J ¯ ¨˚˝ A F J A J ... A F J T ´ h A J ˛‹‚››››››› S ď } p U p m q p U p m qJ ´ U U J } S } Θ ˚ , } S {p ´ h { T qď L p m q } Θ ˚ , } S {p ´ h { T q . Hence, by the Cauchy-Schwarz inequality and (45), } E r TIPUP sp p U p m q q ´ E r TIPUP sp U q} S ď a h L p m q } Θ ˚ , } S {p ´ h { T q . By (19), λ ˚ h { “ σ r ` mat p Θ ˚ , h q ˘ “ σ r ` E r TIPUP sp U q ˘ . Thus, by Weyl’s inequality σ r p E r TIPUP sp p U p m q qq ě λ ˚ h { ´ a h L p m q } Θ ˚ , } S ě σ r p E r TIPUP sp U qq L “ λ ˚ h { { . k ď K λ ˚ k ě L p m q } Θ ˚ , } S . We prove this condition by induction in the event X j “ Ω j .By (32) and (44), 4 L p q } Θ ˚ , } S { min k ď K λ ˚ k ď C p TIPUP q { C ď
1. Given the induction assumption4 L p m q } Θ ˚ , } S { min k ď K λ ˚ k ď
1, (54) holds for the same m , so that (53), (51) and (44), L p m ` q ď C p TIPUP q (cid:32) p ` . . . ` ρ m q R ˚p ideal q ` ρ m ` R ˚p q ( ď C p TIPUP q p ´ ρ q ´ R ˚p q . It then follows from (32) that 4 L p m ` q } Θ ˚ , } S { min k ď K λ ˚ k ď C p TIPUP q p ´ ρ q ´ { C ď
1. Thiscompletes the induction and the proof of the entire theorem.
Remark 9.
Alternatively, if we assume 2 p ´ ρ q ´ R ˚p ideal q ď R ˚p q , then 2 p ` . . . ` ρ m q R ˚p ideal q k ` ρ m ` R ˚p q ď R ˚p q , so that the condition 4 C p TIPUP q { C ď p C p TIPUP q { C q max m ě t p ` . . . ` ρ m q R ˚p ideal q { R ˚p q ` ρ m ` u ď . Again, we focus on the case of K “ k matrix unfolding.In particular, we sometimes give explicit expressions only in the case of k “ K “
2. For K “
2, we observe a matrix time series with X t “ A F t A J ` E t P R d ˆ d . Under the conditionalexpectation E , F , ...F T are fixed. Let U , U be the left singular matrices of A and A respectivelywith r k “ rank p U k q “ rank p A k q . Recall d is kronecker product and b is tensor product.We outline the proof as follows, which has exactly the same structure as the proofs of Theorem2. Recall L p m q k and L p m q in (43). From Chen et al. (2019b), E “ L p q k ‰ À R p q k as we mentioned in(20). By applying the Gaussian concentration inequality for Lipschitz functions and Lemme 2 intheir analysis, we have L p q ď C p TOPUP q R p q with R p q “ max ď k ď K R p q k (58)in an event Ω with P p Ω q ě ´ ´ ř Kk “ e ´ d k . This is similar to (64) below.After the initialization with p U p q k , the algorithm iteratively produces estimates p U p m q k from m “ m “ J . For any r U P R d ˆ r , let V ,h p r U q “ T ÿ t “ h ` X t ´ h r U b X t r U T ´ h P R d ˆ r ˆ d ˆ r , mat p TOPUP qp r U q “ mat ` V , h p r U q ˘ P R d ˆp d r h q , where V , h p r U q “ ` V ,h p r U q , h “ , ..., h ˘ P R d ˆ r ˆ d ˆ r ˆ h . Given p U p m q , the p m ` q -th iterationproduces estimates p U p m ` q “ LSVD r ` mat p TOPUP qp p U p m q q ˘ , p P p m ` q “ p U p m ` q p U p m ` qJ . The “noiseless” version of this update is given byΘ ,h p r U q “ T ÿ t “ h ` A F t ´ h A J r U b A F t A J r U T ´ h , E r mat p TOPUP qsp r U q “ mat ` Θ , h p r U q ˘ (59)36ith Θ , h p r U q “ ` Θ ,h p r U q , h “ , ..., h ˘ as in (15), giving error free “estimates”, U “ LSVD r ` E r mat p TOPUP qsp p U p m q q ˘ , P “ U U J . Thus, by Wedin’s theorem (Wedin (1972)), L p m ` q “ ›› p P p m ` q ´ P ›› S ď } mat p TOPUP qp p U p m q q ´ E r mat p TOPUP qsp p U p m q q} S σ r p E r mat p TOPUP qsp p U p m q qq . (60)To bound the numerator on the right-hand side of (60), for any matrix Ă M “ r U d I d d q U , wewrite∆ p Ă M q “ ∆ , ,h p Ă M q “ T ´ h T ÿ t “ h ` mat p A F t ´ h A J r U b E t q U q , ∆ p Ă M q “ ∆ , ,h p Ă M q “ T ´ h T ÿ t “ h ` mat p E t ´ h r U b A F J t A J q U q , ∆ p Ă M q “ ∆ , ,h p Ă M q “ T ´ h T ÿ t “ h ` mat p E t ´ h r U b E t q U q “ T ´ h T ÿ t “ h ` mat p E t ´ h b E t q Ă M . Let P U “ U U J and I “ P U ` P U K . Note that, for x M “ p U d I d d p U and M “ U d I d d U , } ∆ p x M q} S ď ››››› T ÿ t “ h ` mat p E t ´ h P U p U p m q b E t P U p U p m q q T ´ h ››››› S ` ››››› T ÿ t “ h ` mat p E t ´ h P U p U p m q b E t P U K p U p m q q T ´ h ››››› S ` ››››› T ÿ t “ h ` mat p E t ´ h P U K p U p m q b E t P U p U p m q q T ´ h ››››› S ` ››››› T ÿ t “ h ` mat p E t ´ h P U K p U p m q b E t P U K p U p m q q T ´ h ››››› S ď } ∆ p M q} S ` L p m q } ∆ } ,S,S , where } ∆ j } ,S,S are defined as } ∆ j } ,S,S “ } ∆ j, ,h } ,S,S “ max } r U } S ď , rank p r U q“ r } q U } S ď , rank p q U q“ r ›› ∆ j, ,h p Ă M q ›› S . Then, the numerator on the right-hand of (60) can be bounded by } mat p TOPUP qp p U p m q q ´ E r mat p TOPUP qsp p U p m q q} S (61) ď } mat p TOPUP qp U q ´ E r mat p TOPUP qsp U q} S ` L p m q p K ´ q ÿ j “ h { max h ď h } ∆ j, ,h } ,S,S . For general 1 ď k ď K , ∆ j,k,h , applying P d j ‰ k U j gives the general version of (61) with ›› ∆ j,k,h ›› k,S,S “ max } Ă M j } S ď ›› ∆ j,k,h ` d j ‰ k Ă M j ˘›› S .
37e claim that in certain events Ω j , j “ , ,
3, with P p Ω j q ě ´ ´ ř Kk “ e ´ d k , ›› ∆ j,k,h ›› k,S,S ď ρλ k L` p K ´ q ˘ , @ ď k ď K. (62)For simplicity, we will only prove this inequality for k “ K “ P } ∆ j } ,S,S “ max } Ă M } S ď ›› ∆ j, ,h p Ă M q ›› S ě ρλ L + ď ´ e ´ d . (63)Define the ideal version of the ratio in (60) for general 1 ď k ď K as L p ideal q k “ } mat p TOPUP k qpˆ j ‰ k U j q ´ E r mat p TOPUP k qspˆ j ‰ k U j q} S σ r k p E r mat p TOPUP k qspˆ j ‰ k U j qq , L p ideal q “ max ď k ď K L p ideal q k . The proof of (58) also implies L p ideal q ď C p TOPUP q R p ideal q with R p ideal q “ max ď k ď K R p ideal q k (64)in an event Ω with P p Ω q ď ´ ř Kk “ e ´ d k , where R p ideal q k is as in (21).We aim to prove that in the event X j “ Ω j L p m ` q k ď L p ideal q k ` p K ´ q L p m q max j,k,h ›› ∆ j,k,h ›› k,S,S L λ k ď L p ideal q k ` ρL p m q , (65)which would imply by induction L p m ` q ď p ` . . . ` ρ m q L p ideal q k ` ρ m ` L p q , (66)and then the conclusions would follow from (58) and (64). We notice that by (43), (60), (61) and(62), (65) holds provided that σ r k p E r mat p TOPUP k qsp p U p m q j , j ‰ k qq , the numerator in (60), is nosmaller than a half of its ideal version as in (18), e.g.2 σ r p E r mat p TOPUP qsp p U p m q qq ě σ r p E r mat p TOPUP qsp U qq “ h { λ , (67)in the case of k “ K “
2. Again, we divide the rest of the proof into 4 steps to prove (63)for j “ , , Step 1.
We prove (63) for the ∆ p Ă M q . By Lemma 1 (ii), there exist U p (cid:96) q , U p (cid:96) q P R d ˆ r , 1 ď (cid:96), (cid:96) ď N d r , { “ d r , such that } U p (cid:96) q } S ď } U p (cid:96) q } S ď } ∆ } ,S,S “ } ∆ , ,h } ,S,S ď (cid:96),(cid:96) ď N d r , { ››››› T ÿ t “ h ` mat p A F t ´ h A J U p (cid:96) q b E t U p (cid:96) q q T ´ h ››››› S .
38e apply the Gaussian concentration inequality to the right-hand side above. Elementary calcula-tion shows that ˇˇˇˇˇ››››› T ÿ t “ h ` mat p A F t ´ h A J U p (cid:96) q b E t U p (cid:96) q q ››››› S ´ ››››› T ÿ t “ h ` mat p A F t ´ h A J U p (cid:96) q b E ˚ t U p (cid:96) q q ››››› S ˇˇˇˇˇ ď ››››› T ÿ t “ h ` mat p A F t ´ h A J U p (cid:96) q b p E t ´ E ˚ t q U p (cid:96) q q T ´ h ››››› S ď ›››››››› p mat p A F A J b I d q , ..., mat p A F T ´ h A J b I d qq ¨˚˚˝ U p (cid:96) q d I d d p E J h ` ´ E ˚J h ` q U p (cid:96) q ... U p (cid:96) q d I d d p E J T ´ E ˚J T q U p (cid:96) q ˛‹‹‚›››››››› S ď ? T } Θ ˚ , } { } U p (cid:96) q } S } U p (cid:96) q } S ›››››››¨˚˝ E J h ` ´ E ˚J h ` ... E J T ´ E ˚J T ˛‹‚››››››› F . That is, ›››ř Tt “ h ` mat p A F t ´ h A J U p (cid:96) q b E t U p (cid:96) q q ››› S is a σ ? T } Θ ˚ , } { Lipschitz function in p E , ..., E T q .Employing similar arguments in the proof of Theorem 1 in Chen et al. (2019b), we have E ››››› T ÿ t “ h ` mat p A F t ´ h A J U p (cid:96) q b E t U p (cid:96) q q T ´ h ››››› S ď σ p T q { p? d ` a d r q T ´ h } Θ ˚ , } { Then, by Gaussian concentration inequalities for Lipschitz functions, P ˜››››› T ÿ t “ h ` mat p A F t ´ h A J U p (cid:96) q b E t U p (cid:96) q q T ´ h ››››› S ´ σ p T q { p? d ` a d r q T ´ h } Θ ˚ , } { ě σ ? TT ´ h } Θ ˚ , } { x ¸ ď e ´ x . Hence, P ˜ } ∆ } ,S,S { ě σ p T q { p? d ` a d r q T ´ h } Θ ˚ , } { ` σ ? TT ´ h } Θ ˚ , } { x ¸ ď N d r , { e ´ x . As T ě h , this implies with x — ? d r that in an event with at least probability 1 ´ e ´ d { } ∆ } ,S,S ď C p iter q ,K σT ´ { } Θ ˚ , } { p? d r ` ? d r q p K ´ q ď C p iter q ,K p λ R p ideal q ` σT ´ { } Θ ˚ , } { ? d r q R p ideal q k in (21) and a constant C p iter q ,K depending on K only. In this event, (23) gives36 p λ k q ´ } ∆ ,k,h } k,S,S ď C p iter q ,K { C p iter q ď ρ . Thus, (63) holds for ∆ ˚ p Ă M q . Step 2.
Note that } ∆ } ,S,S “ max r U P R d ˆ r , q U P R d ˆ r , } r U } S ď , } q U } S ď ››››› T ÿ t “ h ` mat p E t ´ h r U b U J A F t A J q U q T ´ h ››››› S . p Ă M q follow from the same argument as the above step. Step 3.
Now we prove (63) for the ∆ p Ă M q . We split the sum into two terms over the index sets, S “ tp h, h s Y p h, h s Y ¨ ¨ ¨ u X p h, T s and its complement S in p h, T s , so that t E t ´ h , t P S a u isindependent of t E t , t P S a u for each a “ ,
2. Let n a “ | S a | .By Lemma 1 (ii), we can find U p (cid:96) q , U p (cid:96) q P R d ˆ r , 1 ď (cid:96), (cid:96) ď N d r , { such that } U p (cid:96) q } S ď } U p (cid:96) q } S ď
1. In this case, } ∆ } ,S,S “ } ∆ , ,h } ,S,S ď ď (cid:96),(cid:96) ď N d r , { ››››› T ÿ t “ h ` mat p E t ´ h U p (cid:96) q b E t U p (cid:96) q q T ´ h ››››› S . (68)Define G a “ p E t ´ h U p (cid:96) q , t P S a q and H a “ p E t U p (cid:96) q , t P S a q . Then, G a , H a are two independentGaussian matrices. By Lemma 2(ii), for any x ą P ˜››››› ÿ t P S a mat p E t ´ h U p (cid:96) q b E t U p (cid:96) q q ››››› S ě d ? r ` r a d n a ` x ` ? nx ` a d r x ¸ ď e ´ x { . As in the derivation of } ∆ ˚ } ,S,S in the proof of Theorem 2, we have, with x — ? d r and someconstant C p iter q ,K depending on K only, P ¨˝ } ∆ , ,h } ,S,S ě C p iter q ,K σ ˆ r ? d ` ? d r T { ` d ? r ` d r ` r ? d d T ˙˛‚ ď e ´ d { . This yields (63) for ∆ p Ă M q as in the end of Step 1 for ∆ p Ă M q . Step 4.
Next, we consider the r -th largest singular value of σ r p E r mat p TOPUP qsp p U p m q qq in theevent X j “ Ω j . By definition, the left singular subspace of E r mat p TOPUP qsp p U p m q q is U . Then, σ r p E r mat p TOPUP qsp p U p m q qq“ σ r ˜ mat ˜ T ÿ t “ h ` A F t ´ h A J p U p m q b A F t A J p U p m q T ´ h , h “ , ..., h ¸¸ “ σ r ˜ mat ˜ T ÿ t “ h ` A F t ´ h A J U U J p U p m q b A F t A J U U J p U p m q T ´ h , h “ , ..., h ¸¸ “ σ r ˜ mat ˜ T ÿ t “ h ` A F t ´ h A J U b A F t A J U T ´ h , h “ , ..., h ¸ ¨ ´ U J p U p m q d I d d U J p U p m q d I h ¯¸ ě σ r ˜ mat ˜ T ÿ t “ h ` A F t ´ h A J U b A F t A J U T ´ h , h “ , ..., h ¸¸ ¨ σ min ´ U J p U p m q d I d d U J p U p m q d I h ¯ ě σ r ˜ mat ˜ T ÿ t “ h ` A F t ´ h A J b A F t A J T ´ h , h “ , ..., h ¸¸ ¨ σ min ´ U J p U p m q ¯ ¨ σ min ´ U J p U p m q ¯ ě a h λ p ´ L p m q q . L p m q ď {
2, then σ r p E r mat p TOPUP qsp p U p m q qq ě a h λ { . By (18), λ h { “ σ r ` mat p Θ , h q ˘ “ σ r ` mat p E r TOPUP sp U q ˘ . We prove this condition inthe event X j “ Ω j . By (22) and (58), L p q ď C p TOPUP q R p q ď C p TOPUP q { C ď {
2. By induction,given L p m q ď {
2, (67) holds for the same m . Applying (58), (64) and (66), L p m ` q ď C p TOPUP q (cid:32) p ` . . . ` ρ m q R p ideal q ` ρ m ` R p q ( ď C p TOPUP q p ´ ρ q ´ R p q ď C p TOPUP q p ´ ρ q ´ { C ď { . This completes the induction and the proof of the entire theorem.
Proposition 1.
Let λ “ ś Kk “ } A k } S . Assume that the condition numbers of A J k A k ( k “ , ..., K )are bounded. Then, for all ď k ď K , we have, } Θ k, } op — λ } Φ k, } op , } Θ ˚ k, } S — λ } Φ ˚ k, } S ,τ k,r k — λ ˆ σ r k p mat p Φ k, h qq ,τ ˚ k,r k — λ ˆ σ r k ´ Φ ˚p cano q k, h { λ ¯ . Proof.
If the condition numbers of A J k A k ( k “ , ..., K ) are bounded, all the singular values of A k are at the same order. Then Proposition 1 immediately follows. Proofs of Corollary 1 and 3.
Employing Proposition 1, under Assumption 2 and E r mat p Φ k, h qs is of rank r k , we can show λ k — λ . When the ranks r k are fixed, condition (23) can be written as C p iter q R p ideal q ď
1. Thus, for C “ { R p q “ C p TOPUP q and C p iter q “ { R p ideal q “ C p iter q ,K { ρ , we have ρ “ C p iter q ,K { C p iter q “ C { C p iter q . For m “
1, this gives the rate R p ideal q by (24). Then Corollary 1 follows from the results of Theorem1. Similarly, Applying Proposition 1, under Assumption 2 and E r Φ ˚p cano q k, h { λ s is of rank r k , we canobtain λ ˚ k — λ . Then Corollary 3 follows from the results of Theorem 2. Proofs of Corollary 2 and 4.
Applying Assumption 3 to Theorem 1 (resp. Theorem 2), then Corol-lary 2 (resp. Corollary 4) follows.
Lemma 1.
Let d, d j , d ˚ , r ď d ^ d j be positive integers, (cid:15) ą and N d,(cid:15) “ t p ` { (cid:15) q d u .(i) For any norm } ¨ } in R d , there exist M j P R d with } M j } ď , j “ , . . . , N d,(cid:15) , such that max } M }ď min ď j ď N d,(cid:15) } M ´ M j } ď (cid:15) . Consequently, for any linear mapping f and norm } ¨ } ˚ , sup M P R d , } M }ď } f p M q} ˚ ď ď j ď N d, { } f p M j q} ˚ . ii) Given (cid:15) ą , there exist U j P R d ˆ r and V j P R d ˆ r with } U j } S _ } V j } S ď such that max M P R d ˆ d , } M } S ď , rank p M qď r min j ď N dr,(cid:15) { ,j ď N d r,(cid:15) { } M ´ U j V J j } S ď (cid:15). Consequently, for any linear mapping f and norm } ¨ } ˚ in the range of f , sup M, Ă M P R d ˆ d , } M ´ Ă M } S ď (cid:15) } M } S _} Ă M } S ď rank p M q_ rank p Ă M qď r } f p M ´ Ă M q} ˚ (cid:15) I r ă d ^ d ď sup } M } S ď rank p M qď r } f p M q} ˚ ď ď j ď Ndr, { ď j Nd r, { } f p U j V J j q} ˚ . (69) (iii) Given (cid:15) ą , there exist U j,k P R d k ˆ r k and V j ,k P R d k ˆ r k with } U j,k } S _ } V j ,k } S ď such that max Mk P R dk ˆ d k , } Mk } S ď rank p Mk qď rk, @ k ď K min jk ď Ndkrk,(cid:15) { j k ď Nd krk,(cid:15) { , @ k ď K ››› d Kk “ M k ´ d Kk “ p U j k ,k V J j k ,k q ››› op ď (cid:15) p K ´ q . For any linear mapping f and norm } ¨ } ˚ in the range of f , sup Mk, Ă Mk P R dk ˆ d k , } Mk ´ Ă Mk } S ď (cid:15) rank p Mk q_ rank p Ă Mk qď rk } Mk } S _} Ă Mk } S ď @ k ď K } f pd Kk “ M k ´ d Kk “ Ă M k q} ˚ (cid:15) p K ´ q ď sup Mk P R dk ˆ d k rank p Mk qď rk } Mk } S ď , @ k ››› f ` d Kk “ M k ˘››› ˚ (70) and sup Mk P R dk ˆ d k , } Mk } S ď rank p Mk qď rk @ k ď K ››› f ` d Kk “ M k ˘››› ˚ ď ď jk ď Ndkrk, {p K ´ q ď j k ď Nd krk, {p K ´ q ››› f ` d Kk “ U j k ,k V J j k ,k ˘››› ˚ . (71) Proof. (i) The covering number N (cid:15) follows from the standard volume comparison argument asthe p ` (cid:15) { q -ball under } ¨ } and centered at the origin contains no more than p ` { (cid:15) q d disjoint p (cid:15) { q -balls centered at M j . The inequality follows from the “subtraction argument”,sup } M }ď } f p M q} ˚ ´ max ď j ď N d, { } f p M j q} ˚ ď sup } M ´ M j }ď { } f p M ´ M j q} ˚ ď sup } M }ď } f p M q} ˚ { . (ii) The covering numbers are given by applying (i) to both U and V in the decomposition M “ U V J as Lemma 7 in Zhang and Xia (2018). The first inequality in (69) follows from the fact that for r ă d ^ d , p M ´ Ă M q{ (cid:15) is a sum of two rank- r matrices with no greater spectrum norm than 1,and the second inequality of (69) again follows from the subtraction argument although we needto split M ´ U j V J j into two rank r matrices to result in an extra factor of 2.(iii) The proof is nearly identical to that of part (ii). The only difference is the factor K ´ } d Kk “ M k ´ d Kk “ Ă M k } op ď p K ´ q max ď k ď K } M k ´ Ă M k } S is applied. Lemma 2. (i) Let G P R d ˆ n and H P R d ˆ n be two centered independent Gaussian matrices suchthat E p u J vec p G qq ď σ @ u P R d n and E p v J vec p H qq ď σ @ v P R d n . Then, } GH J } S ď σ `a d d ` a d n ` a d n ˘ ` σ x p x ` ? n ` a d ` a d q ith at least probability ´ e ´ x { for all x ě .(ii) Let G i P R d ˆ d , H i P R d ˆ d , i “ , . . . , n , be independent centered Gaussian matrices such that E p u J vec p G i qq ď σ @ u P R d d and E p v J vec p H i qq ď σ @ v P R d d . Then, ›››› mat ˆ n ÿ i “ G i b H i ˙›››› S ď σ `a d n ` a d d d ` a nd d d ˘ ` σ x ` x ` ? n ` a d ` a d ` a d d ˘ with at least probability ´ e ´ x { for all x ě .Proof. Assume σ “ x ě G and H , let ζ j P R n , j “ ,
2, be independent standard Gaussian vectors. Asin Chen et al. (2019b), the Sudakov-Fernique inequality provides E ” } GH J } S ˇˇˇ G ı ď E ” max } u } “ u J Gζ ˇˇˇ G ı ` } G } S a d . Thus, by the Gaussian concentration inequality P " } GH J } S ě E ” max } u } “ u J Gζ ˇˇˇ G ı ` } G } S p a d ` x q ˇˇˇˇ G * ď e ´ x { . Applying the Sudakov-Fernique inequality again, we have E ” E ” max } u } “ u J Gζ ˇˇˇ G ı ` } G } S p a d ` x q ı ď a d n ` p a d ` ? n qp a d ` x q . Moreover, as the Lipschitz norm of E “ max } u } “ u J Gζ ˇˇ G ‰ ` } G } S p? d ` x q is bounded by ? n `? d ` x , by the Gaussian concentration inequality E ” max } u } “ u J Gζ ˇˇˇ G ı ` } G } S p a d ` x qď a d n ` p a d ` ? n qp a d ` x q ` x ` ? n ` a d ` x ˘ holds with at least probability 1 ´ e ´ x { .(ii) We treat G “ p G , . . . , G n q P R d ˆ d ˆ n and H “ p H , . . . , H n q P R d ˆ d ˆ n as tensors. Let ξ “ p ξ , . . . , ξ n q P R d ˆ n be a standard Gaussian matrix independent of H . For u P R d and V P R d ˆp d d q , E „›››› mat ˆ n ÿ i “ G i b H i ˙›››› S ˇˇˇˇ H “ E „ sup } u } “ , } V } F “ u J mat p G q vec ` mat p H q V J ˘ˇˇˇˇ H ď a d sup } V } F “ } mat p H q V J } F ` E „ sup } V } F “ p vec p ξ qq J vec ` mat p H q V J ˘ˇˇˇˇ H “ a d } mat p H q} S ` E „ˆ d ÿ j “ d d ÿ k “ ˆ n ÿ i “ ξ i,j vec p H i q k ˙ ˙ { ˇˇˇˇ H a d } mat p H q} S ` a d } vec p H q} By the Gaussian concentration inequality, P "›››› mat ˆ n ÿ i “ G i b H i ˙›››› S ě p a d ` x q} mat p H q} S ` a d } vec p H q} ˇˇˇˇ H * ď e ´ x { . Moreover, as E “ p? d ` x q} mat p H q} S ` ? d } vec p H q} ‰ ď p? d ` x q ` ? n ` ? d d ˘ ` ? d nd d andthe Lipschitz norm of p? d ` x q} mat p H q} S ` ? d } vec p H q} is bounded by ? d ` x ` ? d , p a d ` x q} mat p H q} S ` a d } vec p H q} ď p a d ` x q ` ? n ` a d d ˘ ` a nd d d ` x `a d ` x ` a d ˘ holds with at least probability 1 ´ e ´ x {2