[PDF] Factor Modelling for Clustering High-dimensional Time Series

Abstract

We propose a new unsupervised learning method for clustering a large number of time series based on a latent factor structure. Each cluster is characterized by its own cluster-specific factors in addition to some common factors which impact on all the time series concerned. Our setting also offers the flexibility that some time series may not belong to any clusters. The consistency with explicit convergence rates is established for the estimation of the common factors, the cluster-specific factors, the latent clusters. Numerical illustration with both simulated data as well as a real data example is also reported. As a spin-off, the proposed new approach also advances significantly the statistical inference for the factor model of Lam and Yao (2012).

Full PDF

aa r X i v : . [ m a t h . S T ] J a n Factor Modelling for ClusteringHigh-dimensional Time Series

Bo ZhangDepartment of Statistics & Finance, International Institute of FinanceSchool of Management, University of Science and Technology of [email protected] PanSchool of Physical & Mathematical Sciences, Nanyang Technological [email protected] YaoDepartment of Statistics, London School of Economics and Political [email protected] ZhouDepartment of Statistics & Applied Probability, National University of [email protected]

January 7, 2021

Abstract

We propose a new unsupervised learning method for clustering a large number of time se-ries based on a latent factor structure. Each cluster is characterized by its own cluster-speciﬁcfactors in addition to some common factors which impact on all the time series concerned.Our setting also oﬀers the ﬂexibility that some time series may not belong to any clusters.The consistency with explicit convergence rates is established for the estimation of the com-mon factors, the cluster-speciﬁc factors, the latent clusters. Numerical illustration with bothsimulated data as well as a real data example is also reported. As a spin-oﬀ, the proposed newapproach also advances signiﬁcantly the statistical inference for the factor model of Lam andYao (2012).

Keyworks . Autocovariance matrices; Clustering time series; Eigenanalysis; Idiosyncratic1omponents; k -means clustering algorithm; Ratio-based estimation; Strong and weak factors. One of the primary tasks of data mining is clustering. While most clustering methods are originallydesigned for independent observations, clustering a large number of time series gains increasingmomentum (Esling and Agon 2012), due to mining large and complex data recorded over timein business, ﬁnance, biology, medicine, climate, energy, environment, psychology, multimedia andother areas (Table 1 of Aghabozorgi et al. et al. (2015), Maharaj et al. (2019) and the referencestherein. The basic idea is to develop some relevant similarity or distance measures among timeseries ﬁrst, and then to apply the standard clustering algorithms such as hierarchical clusteringor k -means method. Most existing similarity/distances measures for time series may be looselydivided into two categories: data-based and feature-based . The data-based approaches deﬁnethe measures directly based on observed time series using, for example, L - or, more general,Minkowski’s distance, or various correlation measures. Alone and Pe˜na (2019) proposed a gen-eralized cross correlation as a similarity measure, which takes into account cross correlation overdiﬀerent time lags. Dynamic time warping can be applied beforehand to cope with time defor-mation due to, for example, shifting holidays over diﬀerent years (Keogh and Ratanamahatana,2005). The feature-based approaches extract relevant features from observed time series data ﬁrst,and then deﬁne similarity/distance measures based on the extracted features. The feature extrac-tion can be carried out by various transformation such as Fourier, wavelet or principal componentanalysis (Section 2.3 of Roelofsen, 2018). The features from ﬁtted time series models can also beused to deﬁne similarity/distance measures (Yao et al. et al. et al. et al. (2014). Itrefers to the clustering the segments from a single long time series, which is not considered in thispaper.The goal of this study is to propose a new factor model based approach to cluster a large num-2er of time series into diﬀerent and unknown clusters such that the members within each clustershare similar dynamic structure, while the number of clusters and their sizes are all unknown. Werepresent the dynamic structures by latent common and cluster-speciﬁc factors, which are bothunknown and are identiﬁed by the diﬀerence in factor strength. The common factors are strongfactors (Remark 1 of Lam and Yao 2012) as each of them carries the information on most (if notall) time series concerned. The cluster-speciﬁc factors are weak factors as they only aﬀect thetime series in a speciﬁc cluster. The clustering is based on the factor loadings on all the weakfactors; applying a k -mean algorithm using a correlation-type similarity measure deﬁned in termsof the loadings.Though our factor model is similar to that of Ando and Bai (2017), our approach is radicallydiﬀerent. First, we estimate strong factors and all the weaker factors in the manner of one-pass,and then the latent clusters are recovered based on the estimated weak factor loadings. Andoand Bai (2017) adopted an iterative least squares algorithm to estimate factors/factor loadingsand latent cluster structure recursively. Secondly, our setting allows the ﬂexibility that some timeseries do not belong to any clusters, which is often the case in practice. Thirdly, our settingallows the dependence between the common factors and cluster-speciﬁc factors while Ando andBai (2017) imposed an orthogonality condition between the two; see Remark 1(iv) in Section 2below.The methods used for estimating factors and factor loadings are adapted from Lam and Yao(2012). Nevertheless substantial advances have been made even within the context of Lam andYao (2012): (i) we remove the artifact condition that the factor loading spaces for strong and weakfactors are perpendicular with each other, (ii) we allow weak serial correlations in idiosyncraticcomponents in the model, which were assumed to be vector white noise by Lam and Yao (2012),and, more signiﬁcantly, (iii) we propose a new and consistent ratio-based estimator for the numberof factors (see Step 1 and also Remark 3(iii) in Section 3 below).The rest of the paper is organized as follows. Our factor model and the relevant conditionsare presented in Section 2. The inference methods are presented in Section 3. The asymptoticproperties of the estimation methods are collected in Section 4. Numerical illustration with bothsimulated and a real data example is reported in Section 5. All technical proofs are presented toSection 6. A supplementary ﬁle contains more simulation results.We always assumes vectors in column. Let k a k denote the Euclidean norm of vector a . Forany matrix G , let M ( G ) denote the linear space spanned by the columns of G , k G k the squareroot of the largest eigenvalue of G ⊤ G , k G k min the square root of the smallest eigenvalue of G ⊤ G ,3nd | G | the determinant of G when G is square. We write a ≍ b if a = O ( b ) and b = O ( a ).We use C > p and n , which may be diﬀerent atdiﬀerent places. Let y t be a weakly stationary p × E y t is a constant independent of t ,and all elements of Cov( y t + k , y t ) are ﬁnite and dependent on k only. Suppose that y t consists of d + 1 latent segments, i.e. y ⊤ t = ( y ⊤ t, , · · · , y ⊤ t,d , y ⊤ t,d +1 ) , (2.1)where y t, , · · · , y t,d +1 are, respectively, p , · · · , p d +1 vector time series with p , · · · , p d ≥ p d +1 ≥

0, and p + · · · + p d = p , p + p d +1 = p. Furthermore, we assume the following latent factor model with d clusters: y t = Ax t + (cid:16) B0 (cid:17) z t + ε t , (2.2) B = diag( B , · · · , B d ) , z ⊤ t = ( z ⊤ t, , · · · , z ⊤ t,d ) , where A is a p × r matrix with rank r , x t is r vector time series representing r common factorsand | Var( x t ) | 6 = 0, B j is p j × r j matrix with rank r j , z t,j is r j vector time series representing r j factors for y t,j only and | Var( z t,j ) | 6 = 0, stands for a p d +1 × r matrix with all elements equal to0, r = r + · · · + r d , and ε t is an idiosyncratic component in the sense of Chamberlain (1983) andChamberlain and Rothschild (1983) (see below). Note that in the model above, we only observepermuted y t (i.e. the order of components of y t is unknown) while all the terms on the RHS of(2.2) are unknown.By (2.2), the p components of y t are grouped into d clusters y t, , · · · , y t,d , while the p d +1 components of y t,d +1 do not belong to any clusters. The j -th cluster y t,j is characterised by thecluster-speciﬁc factor z t,j , in addition to depending on the common factor x t . The goal is toidentify those d latent clusters from observations y , · · · , y n . Note that all p j , r j and d are alsounknown. Assumption 1. max { d, r , r } < C < ∞ , where C is a constant independent of n and p , and p i ≍ p for i = 1 , · · · , d . ssumption 2. A ⊤ A = I r , B ⊤ B = I r , and it holds for a constant q ∈ (0 , that (cid:13)(cid:13) AA ⊤ (cid:16) B0 (cid:17)(cid:13)(cid:13) ≤ q . (2.3)Assumption 1 requires that the number of factors remain ﬁnite when the number of componenttime series converges to ∞ . This substantially simpliﬁes the technical proofs for the asymptoticresults. In practice, there is only one (ﬁxed) p , and the factor model is only eﬀective whenthe number of factors is much smaller than p . The assumption that A and B are orthogonalmatrices can always be fulﬁlled as we can replace original ( A , x t ) by ( H , Vx t ), where A = HV is a QR decomposition of A . The orthogonality for B can be obtained by such a replacementfor each ( B j , z t,j ). While A and B are not uniquely deﬁned by (2.3), the factor loading spaces M ( A ), M ( B j ) are, where M ( A ) denotes the linear space spanned by the columns of A . Hence AA ⊤ = A ( A ⊤ A ) − A ⊤ , i.e. the projection matrix onto M ( A ), is also unique. Condition (2.3)implies that the columns of ( B0 ) do not fall entirely into the space M ( A ) as otherwise one cannotdistinguish z t from x t .Intuitively (almost) all components of y t carry the information on common factor x t , only p j components of y t carry the information on the j -th cluster speciﬁc factor z t,j ( j = 1 , · · · , d ), andmerely a few components of y t carry the information on the each of idiosyncratic components of ε t . It is reasonable to assume that x t , z t and ε t are of the diﬀerent factor strengths. Assumption 3below quantiﬁes explicitly the diﬀerences in the factor strength between x t and z t,j . We introducesome notation ﬁrst. Σ x ( k ) = Cov( x t + k , x t ) , Σ z ( k ) = Cov( z t + k , z t ) , Σ x,z ( k ) = Cov( x t + k , z t ) , Σ z,x ( k ) = Cov( z t + k , x t ) . Assumption 3.

Let k ≥ be an integer and δ ∈ (0 , be a ﬁxed constant. It holds that for k = 0 , , · · · , k , k Σ x ( k ) k ≍ p ≍ k Σ x ( k ) k min , (2.4) k Σ z ( k ) k ≍ p − δ ≍ k Σ z ( k ) k min , (2.5) k Σ x ( k ) − / Σ x,z ( k ) Σ z ( k ) − / k ≤ q < , k Σ z ( k ) − / Σ z,x ( k ) Σ x ( k ) − / k ≤ q < , (2.6) k Σ x,z ( k ) k = O ( p − δ/ ) , k Σ z,x ( k ) k = O ( p − δ/ ) , (2.7)Cov( x t , ε s ) = 0 , Cov( z t , ε s ) = 0 for all t and s. (2.8) Remark 1. (i) Following Lam and Yao (2012), we measure the strength of factors by a constant δ ∈ [0 , δ is, the stronger a factor is. See Remark 1 of Lam and Yao (2012) for the5eaning and the implication of δ . Condition (2.4) implies that all the components of x t are strongfactors corresponding to δ = 0. Since almost all the components of y t carry the information oneach components of x t , those strong factors can be relatively easily recovered from y t . In contrast,the components of z t are weak factors with δ ∈ (0 , p − δ components of y t carrythe information on z t ; see (2.5). Hence it is more diﬃcult to recover those weak factors. Notethat our primary interest is to recover cluster-speciﬁc factor z t in order to cluster the componentsof y t , for which we also need to estimate x t .(ii) For the simplicity of the presentation, we assume that all the cluster-speciﬁc factors z t, , · · · , z t,d are of the same strength, reﬂected by the uniform constant δ in (2.5). In prac-tice, those weak factors may have diﬀerent strengths; see, for example, the real data example inSection 5.2 below. While our approach can be readily extended to the cases with weak factors ofdiﬀerent strengths, it will make the theoretical investigation more combersome.(iii) In (2.2) ε t represents the idiosyncratic component of y t in the sense that each componentof ε t only aﬀects the corresponding component and a few other components of y t (i.e. δ = 1),which is implied by Assumptions 4 below. The diﬀerences in the factor strength make the threetime series x t , z t and ε t on the RHS of (2.2) (asymptotically) identiﬁable.(iv) Model (2.2) is similar to that of Ando and Bai (2017). However we do not require thatthe common factor x t and the cluster-speciﬁc factor z t are orthogonal with each other in the sensethat n P ≤ t ≤ n x t z ⊤ t = 0, which is imposed by Ando and Bai (2017). Furthermore we allow theidiosyncratic term ε t to exhibit weak autocorrelations (Assumption 4 below), instead of completeindependence as in Ando and Bai (2017).From now on we always assume in (2.2) ε t = We t with E ( e t ) = 0, where W is a p × p constant matrix, and e t = ( e t, , · · · , e t,p ) ⊤ consists of p independent weakly stationary univariatetime series. We specify e t in Assumptions 4. Assumption 4.

Let p, n → ∞ in the order of n = O ( p ) and p δ log p = o ( n ) . Let Σ e,k be a n × n matrix with E ( e t + i,k e t + j,k ) as its ( i, j ) -element. Suppose that k W k < C , lim n,p →∞ (cid:13)(cid:13) p p X k =1 Σ e,k (cid:13)(cid:13) < C, (2.9)max ≤ i ≤ p Ee t,i < C, p X i =1 Ee t,i ≍ p, (2.10) and E (cid:8)(cid:0) n X t =1 e t,i − nEe t,i (cid:1) (cid:0) n X t =1 e t,i − nEe t,i > n (cid:1)(cid:9) < Cn min { p, n log n } , (2.11)6 or any ≤ i ≤ p , where · ) denotes the indicator function. Remark 2.

When ∞ X j =0 (cid:12)(cid:12) p p X i =1 Cov( e t + j,i , e t,i ) (cid:12)(cid:12) < C, (2.12)Gershgorin’s circle theorem ensures (2.9). (2.11) holds when p ≍ n and { e t,i , t = 1 , · · · , n } aremixing random variables with the mixing coeﬃcients decaying at appropriate rates. When p > n it is also true for appropriate mixing random variables because one may evaluate a higher ( > A and B required for the clustering analysis, we partition A according to the latent cluster structure: A ⊤ = [ A ⊤ , · · · , A ⊤ d +1 ] , (2.13)where A i is a p i × r matrix. Assumption 5.

Let q p = max ≤ i ≤ d +1 ,j = i k A i A ⊤ j B j k F . Then q p = O ( p δ/ n − / ) . Assumption 6.

For any ≤ i ≤ d , the L -norm of every row in ( I p i − A i A ⊤ i ) B i is larger than c p − / . Assumption 2 implies q p <

1, which is weaker than Assumption 5. Assumption 6 ensuresthat the proposed inference procedure can separate the components in the d clusters from thosenot belonging to any clusters. See also Remark 3(iv) in Section 3 below. Those conditions areautomatically fulﬁlled if A ⊤ ( B0 ) = 0 which is a condition imposed in Lam and Yao (2012). With available observations y , · · · , y n , we propose below an algorithm (in ﬁve steps) to identifythe latent d clusters. To this end, we introduce some notation ﬁrst. Let ¯ y = n P nt =1 y t , b Σ y ( k ) = 1 n − k n − k X t =1 ( y t + k − ¯ y )( y t − ¯ y ) ⊤ , c M = k X k =0 b Σ y ( k ) b Σ y ( k ) ⊤ , (3.1)where k ≥ Step 1 (Estimate the number of factors.) For 0 ≤ k ≤ k , let b λ k, ≥ · · · ≥ b λ k,p ≥ b Σ y ( k ) b Σ y ( k ) ⊤ . For a prespeciﬁed positive integer J ≤ p , put b R = 1and b R j = k X k =0 (1 − k/n ) b λ k,j . k X k =0 (1 − k/n ) b λ k,j +1 , ≤ j ≤ J . (3.2)7e say that b R s attains a local maximum if b R s > max { b R s − , b R s +1 } . Let b R b τ and b R b τ bethe two largest local maximums among b R , · · · , b R J − . The estimators for the numbers offactors are then deﬁned as b r = min { b τ , b τ } , b r + b r = max { b τ , b τ } . (3.3) Step 2 (Estimate the loadings for common factors.) Let b γ , · · · , b γ p be the orthonormal eigen-vectors of matrix c M , arranged according to the descending order of the correspondingeigenvalues. The estimated loading matrix for the common factors is b A = ( b γ , · · · , b γ b r ) . (3.4) Step 3 (Estimate the loadings for cluster-speciﬁc factors.) Replace y t by ( I p − b A b A ⊤ ) y t in (3.1),and repeat the eigenanalysis as in Step 2 above but now denote the corresponding orthonor-mal eigenvectors by b ζ , · · · , b ζ p . The estimated loading matrix for the cluster-speciﬁc factorsis b B = ( b ζ , · · · , b ζ b r ) . (3.5) Step 4 (Identify the components not belonging to any clusters.) Let b b , · · · , b b p denote the rowvectors of b B . Then the identiﬁed index set for the components of y t not belonging to anyclusters is b J d +1 = { j : 1 ≤ j ≤ p, k b b j k ≤ ω p } , (3.6)where ω p > ω p = o ( p − / ) and p δ n − + p − δ pω p = o (1) . Step 5 (Clustering with k -means.) Let b p = p − | b J d +1 | , and b F be the b p × b r matrix obtainedfrom b B by removing the rows with their indices in b J d +1 . Let b f , · · · , b f b p denote the b p rowsof b F . Let c W be the b p × b p matrix with the ( ℓ, m )-th element b ρ ℓ,m = (cid:12)(cid:12)b f ⊤ ℓ b f m (cid:12)(cid:12)(cid:14)(cid:0)b f ⊤ ℓ b f ℓ · b f ⊤ m b f m (cid:1) / , ≤ ℓ, m ≤ b p . Perform the k -means clustering (with L -distance) for the b p rows of c W ; leading to thepartition of { , · · · , b p } into the k clusters b J k, , · · · , b J k,k . PutMGF( k ) = 1 P ≤ i b d + 1. The b d estimated clusters are b J b d, , · · · , b J b d, b d . 8 emark 3. (i) The estimators for r and r in Step 1 are based on Theorem 3 in Section 4 below.The intuition behind is that the eigenvalues λ k, ≥ · · · ≥ λ k,p ( ≥

0) of matrix Σ y ( k ) Σ y ( k ) ⊤ , where Σ y ( k ) = Cov( y t + k , y t ), satisfy the conditions λ − k,i = o ( λ − k,j ) and λ − k,j = o ( λ − k,ℓ ) for 1 ≤ i ≤ r , r < j ≤ r + r and ℓ > r + r. This is implied by the diﬀerences in strength among the common factor x t , the cluster speciﬁcfactors z t,i , and the idiosyncratic components ε t ; see Assumptions 3 and 4. Note that we usethe ratios of the cumulative eigenvalues in (3.2) in order to add together the information fromdiﬀerent lags k . In practice we set k to be a small integer such as k ≤

5, as the signiﬁcantautocorrelation occurs typically at small lags. The results do not vary that much with respect tothe value of k (see the simulation results in Section 5.1 below). We truncate the sequence { b R j } at J to alleviate the impact of ‘0/0’. In practice we may set J = p/ p/ b r → r and b r → r inprobability. The existing approaches use the ratios of the ordered eigenvalues of matrix c M instead(Lam and Yao 2012, Chang et al. et al. e r fulﬁlls the relation P ( e r ≥ r ) → B , as Lam and Yao (2012)showed that weak factors can be more accurately estimated by removing strong factors from thedata ﬁrst.(iv) Once the number of factors are correctly speciﬁed, the factor loading spaces are relativelyeasier to identify. In fact M ( b A ) is a consistent estimator for M ( A ). However M ( b B ) is aconsistent estimator for M{ ( I p − AA ⊤ )( B0 ) } instead of M{ ( B0 ) } . See Theorem 2 in Section4 below. Furthermore the last p d +1 rows of ( I p − AA ⊤ )( B0 ) are no longer 0. Nevertheless whenboth r and r are small in relation to p , those p d +1 zero-rows can be recovered from b B in Step 4.See Theorems 4-5 in Section 4 below.(iv) Given the block diagonal structure of B in (2.2), the d clusters can be identiﬁed easilyif we take the ( i, j )-th element of BB ⊤ as the similarity measure between the i -th and the j -thcomponents, or simply apply the k -means method to the rows of BB ⊤ . But applying the k -means method directly to the rows of B will not do. Theorem 4 indicates that the block diagonalstructure, though masked by asymptotically diminishing ‘noise’, still presents in b B via a latentpermutation. Accordingly the cluster analysis in Step 5 is based on the correlation-type measuresamong the rows of b F which is an estimator of B .9 xample 1 . Consider a simple model of the form (2.2) in which ε t ≡ r = 1 , r = 2, and x t = p / ( u ,t + a u ,t − + u ,t + a u ,t − ) ,z ,t = p / − δ/ ( u ,t + a u ,t − ) , z ,t = p / − δ/ ( u ,t + a u ,t − ) , where a , a , a are constants, and u i,t , for diﬀerent i, t , are independent and N (0 , M = P ≤ k ≤ Σ y ( k ) Σ y ( k ) ⊤ , and λ ≥ λ ≥ λ be the three largest eigenvalues of M . It can be shownthat λ ≍ p , λ = p − δ { (1 + a ) + a } and λ ≍ p − δ provided ( a − a ) (1 − a a ) = 0. Hence λ /λ ≍ λ /λ ≍ p δ . This shows that r (= 1) cannot be estimated stably based on the ratios ofthe eigenvalues of c M for this example. Since only the factor loading space M ( A ) is uniquely deﬁned by (2.2) (see the discussion belowAssumption 2), we measure the estimation error in terms of its (unique) projection matrix AA ⊤ . Theorem 1.

Let Assumptions 1-4 hold. For b A deﬁned in (3.4), it holds that (cid:13)(cid:13) b A b A ⊤ − AA ⊤ (cid:13)(cid:13) = O p ( n − / + p − δ/ ) . (4.1)Theorem 1 shows that in the absence of weak factor z t , the estimation for the strong factorloading space M ( A ) achieves root- n convergence rate in spite of diverging p .Assumption 2 ensures that the rank of matrix B ∗ ≡ ( I p − AA ⊤ ) (cid:16) B0 (cid:17) is r . Denote by P A ⊥ B = B ∗ ( B ⊤∗ B ∗ ) − B ⊤∗ the projection matrix onto M (cid:8) ( I p − AA ⊤ ) (cid:16) B0 (cid:17)(cid:9) of which M ( b B ) is a consistentestimator, see Theorem 2 below, and also Remark 3(iv). Theorem 2.

Let Assumptions 1-4 hold. For b B deﬁned in (3.5), it holds that (cid:13)(cid:13) b B b B ⊤ − P A ⊥ B (cid:13)(cid:13) = O p ( p δ/ n − / + p − δ/ ) . (4.2)Theorem 3 below speciﬁes the asympotic behavour the ratios of the cummulated eigenvaluesused in estimating the numbers of factors in Step 1 in Section 3 above. Note that (log p ) = o ( p δ )and (log p ) = o (cid:0) p − δ / ( p n + log p ) (cid:1) , it implies that b r → r , b r → r in probability provided that J > r + r is ﬁxed. 10 heorem 3. Let Assumptions 1-4 hold. For b R j deﬁned in (3.2), it holds for some constant C > that lim n,p →∞ P ( b R j < C ) = 1 for j = 1 , · · · , r − , (4.3) b R − r = O p ( p − δ ) , b R − r + r = O p (cid:0) ( p /n + log p ) (cid:14) p − δ (cid:1) , (4.4)lim n,p →∞ P ( b R j < C ) = 1 for j = r + 1 , · · · , r + r − , and (4.5) b R j = O p (log p ) for j = r + r + 1 , · · · , r + r + s, (4.6) where s is a positive ﬁxed integer. Our goal is to recover the d latent clusters in model (2.2). Unfortunately b B provides merely aconsistent estimator for the space M (cid:8) ( I p − AA ⊤ ) (cid:16) B0 (cid:17)(cid:9) , see Theorem 2 above. Nevertheless itcontains the suﬃcient information for identifying the block diagonal structure of B as well as thecomponents of y t not belonging to any clusters. See Theorem 4 below. Theorem 4.

Let Assumptions 1 – 5 hold. There exists a p × b r matrix ˇ B which is a latentrow-permutation of b B deﬁned in (3.5). Write ˇ B ˇ B ⊤ into the two parts: ˇ B ˇ B ⊤ = H diag + H err , where H diag is a block diagonal matrix of the same structure as (cid:0) B0 (cid:1)(cid:0) B0 (cid:1) ⊤ , i.e. H diag = diag( H , · · · , H d , ) , (4.7) while H err has all the elements in the ﬁrst d diagonal blocks equal to 0. Then it holds for someconstant C > that k H i k F ≥ C for i = 1 , · · · , d , and k H err k F = O p ( p δ/ n − / + p − δ/ ) . (4.8)Theorem 4 shows that the components of y t not belonging to any clusters corresponds tothe rows of b B with the norms converging to 0, and, hence, Step 4 of the algorithm in Section 3.Theorem 5 below indicates that the misclassiﬁcation rate converges to 0 when p d +1 ≍ p . Let J d +1 be the collection of the indices of the components y t not belonging to any one the d clusters, and J cd +1 = { , · · · , p } − J d +1 be the complement of J d +1 . Theorem 6 provides the underpinning forStep 5 of the algorithm in Section 3. 11 heorem 5. Let Assumptions 1 – 6 hold. For b J d +1 deﬁned in (3.6), | J cd +1 ∩ b J d +1 | p = O p (cid:16) p δ n − + p − δ (cid:17) = o p (1) , and (4.9) | J d +1 ∩ b J d +1 || J d +1 | = 1 + O p (cid:16) p δ n − + p − δ pω p (cid:17) = 1 + o p (1) (4.10) provided p d +1 ≍ p . Theorem 6.

Let Assumptions 1 – 6 hold. Let e J , · · · , e J d be a partition of { , · · · , b p } such that e J j contains the indices of the components of y t belonging to the j -th cluster, j = 1 , · · · , d . Then X ≤ i that P (cid:16) X ℓ,m ∈ e J j b ρ ℓ,m > Cp (cid:17) → , j = 1 , · · · , d. We illustrate the proposed methodology through the simulation with model (2.2). We draw theelements of A and B j independently from, respectively, U ( − p − / , p − / ) and U ( − p − / j , p − / j ).All component series of x t and z t are independent and AR(1) and MA(1), respectively, withGaussian innovations. All components of ε t are independent MA(1) with N (0 , .

25) innovations.All the AR and the MA coeﬃcients are drawn randomly from U { ( − . , − . ∪ (0 . , . } . Thestandard deviations of the components of p − / x t and p δ/ − / z t are drawn randomly from U (1 , x t are strong factors with δ = 0, and all the components of z t are weak factors at strength δ ∈ (0 , n = 800, p = 450, d = 5, r = r = · · · = r = 2 (and, hence, r = 10), p = · · · = p =50, and k = 5. Therefore among 450 component series of y t , the ﬁrst ( p =)250 components form5 clusters with equal size 50, and the last ( p d +1 =)200 components do not belong to any clusters.The factor strength of z t is set at four diﬀerent levels δ = 0 . , . , . , .

5. For each setting, wereplicate the experiment 1000 times.The numbers of factors r and r are estimated based on the ratios of b R j , as in (3.3) with k = 1 , · · · , J = [ p/ b r = r and b r + b r = r + r in a simulation with 1000replications, where b r and b r are estimated by the ratios of b R j based method (3.3) with k =1 , · · · ,

5, or by the ratios of the eigenvalues of b Σ y ( k ) b Σ y ( k ) ⊤ with k = 0 , , · · · , b r = r b r + b r = r + r method δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . b R j ( k = 1) .446 .803 .973 .999 1 1 1 1 b R j ( k = 2) .476 .813 .970 .999 .997 1 1 1 b R j ( k = 3) .477 .811 .970 .997 .998 1 1 1 b R j ( k = 4) .470 .804 .965 .995 .995 1 1 .999 b R j ( k = 5) .465 .805 .963 .995 .997 1 1 .998 c M ( k = 0) .410 .762 .967 1 1 1 1 1 c M ( k = 1) .451 .808 .974 1 1 .999 .866 .339 c M ( k = 2) .499 .824 .972 .999 1 .998 .918 .367 c M ( k = 3) .520 .822 .970 .997 1 .996 .843 .296 c M ( k = 4) .529 .815 .966 .995 1 .992 .783 .250 c M ( k = 5) .531 .816 .964 .995 1 .990 .730 .193the ratios of eigenvalues of c M with k = 0 , · · · ,

5, which is the standard method used in literation(see, e.g. Lam and Yao 2012). The relative frequencies of b r = r and b r + b r = r + r are reportedin Table 1. Overall the method based on the ratios of the cumulative eigenvalues b R j providesaccurate and robust performance and is not sensitive to the choice of k . The estimation basedon the eigenvalues of c M with k ≥ r , but is considerably poorer for r + r with δ = 0 . .

5. Using c M with k = 0 leads to weaker estimates for r when δ = 0 . . r improves as δ increases. This is due to fact that the larger δ is, the larger the diﬀerence in thefactor strength between the common factor x t and the cluster-based factor z t is. Therefore it iseasier to tell x t apart from z t for larger δ . The performance for estimating r + r based on b R j isbetter than that for r , as in terms of the factor strength, the diﬀerence between ( x t , z t ) and ε t is signiﬁcantly greater than that between x t and ( z t , ε t ).Recall P A ⊥ B is the projection matrix onto the space M (cid:8) ( I p − AA ⊤ ) (cid:0) B0 (cid:1)(cid:9) ; see Theorem 2 andalso Remark 3(iv). Table 2 contains the means and standard deviations of the estimation errorsfor the factor loading spaces k b A b A ⊤ − AA ⊤ k F and k b B b B ⊤ − P A ⊥ B k F , where b A is estimated bythe eigenvectors of matrix c M in (3.1) with k = 1 , · · · ,

5, see Step 2 of the algorithm stated inSection 3. See also Step 3 there for the similar procedure in estimating B . For the comparisonpurpose, we also include the estimates obtained with c M replaced by b Σ y ( k ) b Σ y ( k ) ⊤ with k =0 , , · · · ,

5. Table 2 shows clearly that the estimation based on c M is accurate and robust with13able 2: The means and standard deviations (in parentheses) of k b A b A ⊤ − AA ⊤ k F and k b B b B ⊤ − P A ⊥ B k F in a simulation with 1000 replications, where b A is estimated by the eigenvectors of c M in(3.1) (with k = 1 , · · · , b Σ y ( k ) b Σ y ( k ) ⊤ (for k = 0 , , · · · , b B is estimatedin the similar manner. Both r and r are assumed to be known. Estimation k b A b A ⊤ − AA ⊤ k F k b B b B ⊤ − P A ⊥ B k F method δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . c M ( k = 1) .375(.320) .157(.060) .103(.028) .081(.017) .459(.290) .354(.033) .440(.029) .592(.040) c M ( k = 2) .329(.275) .153(.056) .105(.027) .083(.018) .419(.247) .354(.031) .444(.030) .597(.041) c M ( k = 3) .318(.267) .154(.056) .106(.027) .085(.018) .410(.239) .357(.031) .448(.030) .602(.042) c M ( k = 4) .315(.264) .154(.055) .108(.028) .086(.018) .409(.236) .359(.031) .452(.031) .606(.042) c M ( k = 5) .313(.263) .155(.055) .109(.028) .087(.019) .409(.235) .362(.031) .455(.031) .610(.043) b Σ y (0) b Σ y (0) ⊤ .474(.390) .169(.069) .105(.028) .079(.017) .541(.361) .345(.040) .418(.027) .562(.038) b Σ y (1) b Σ y (1) ⊤ .351(.265) .201(.077) .147(.046) .121(.033) .635(.200) .702(.053) .907(.066) 1.19(.083) b Σ y (2) b Σ y (2) ⊤ .372(.176) .295(.133) .241(.113) .210(.094) 2.04(.156) 2.25(.134) 2.46(.120) 2.66(.107) b Σ y (3) b Σ y (3) ⊤ .605(.336) .489(.278) .407(.242) .368(.220) 2.10(.171) 2.29(.146) 2.49(.124) 2.70(.114) b Σ y (4) b Σ y (4) ⊤ .810(.406) .666(.349) .565(.314) .547(.323) 2.16(.185) 2.33(.150) 2.52(.138) 2.74(.125) b Σ y (5) b Σ y (5) ⊤ .946(.411) .786(.371) .690(.346) .661(.342) 2.20(.189) 2.36(.157) 2.55(.143) 2.77(.131) respect to the diﬀerent values of k . Furthermore using a single-lagged covariance matrix forestimating factor loading spaces is not recommendable. When δ increases, the error k b A b A ⊤ − AA ⊤ k F decreases, as indicated by Theorem 1. However the pattern in the error k b B b B ⊤ − P A ⊥ B k F is more complex as it decreases initially and then increases as δ increases, which is in line withthe asymptotic result in Theorem 2.In the sequel, we only report the results with b r and b r estimated by (3.3), and the factorloading spaces estimated by the eigenvectors of c M with k = 5.To examine the eﬀectiveness of Step 4 of the algorithm, We plot in Figure 1 the samplepercentiles at the 5%, 50% and 95% levels of each k b b j k over the 1000 replications, for j =1 , · · · , p d +1 ) components (not belong to anyclusters) are indeed drop ﬂat and are close to 0. This indicates clearly that it is possible todistinguish the components of y t not belonging to any clusters from those belonging to one of the d clusters. Note that the indices of the components not belonging to any clusters are identiﬁed asthose in b J d +1 in (3.6), which is deﬁned in terms of a threshold ω p = o ( p − / ). We experiment withthe three choices of this tuning parameter, namely ω p = ( b r/p ) / / ln p , ω p = { b r/ ( p ln p ) } / and ω p = { b r/ ( p ln ln p ) } / . Recall J cd +1 contains all the indices of the components of y t belongingto one of the d clusters. The means and standard deviations of the two types of misclassiﬁcationerrors E = | J cd +1 ∩ b J d +1 | / | J cd +1 | and E = | J d +1 ∩ b J cd +1 | / | J d +1 | over the 1000 replications arereported in Table 3. Among the three choices, ω p appears to work best as the two types of errorsare both small. The increase in the errors due to the estimation for r and r is not signiﬁcantwhen δ = 0 . , .

5. But the increase in E due to unknown r and r is noticeable when δ = 0 . E = | J cd +1 ∩ b J d +1 | / | J cd +1 | and E = | J d +1 ∩ b J cd +1 | / | J d +1 | in a simulation with 1000 replications with the 3possible choices of threshold ω p in (3.6), and the numbers of factors r and r either known or tobe estimated. r and r are known r and r are estimated δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . ω p E .004(.004) .005(.004) .004(.004) .003(.003) .009(.036) .004(.004) .004(.004) .003(.003) ω p .043(.013) .044(.012) .043(.012) .041(.012) .047(.073) .041(.015) .042(.013) .041(.012) ω p .156(.021) .157(.020) .156(.021) .156(.020) .162(.072) .155(.020) .156(.021) .155(.020) ω p E .147(.196) .060(.062) .115(.061) .327(.089) .369(.339) .171(.264) .141(.145) .329(.093) ω p .011(.050) .000(.000) .000(.000) .000(.000) .112(.128) .042(.094) .011(.052) .001(.014) ω p .000(.000) .000(.000) .000(.000) .000(.000) .001(.031) .000(.000) .000(.000) .000(.000) Table 4: The means and standard deviations (STD) of the error rates S/ | b J cd +1 ∩ J cd +1 | in asimulation with 1000 replications with the numbers of factors r and r either known or to beestimated. r and r are known r and r are estimated δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = 0 . , .

3, the 95% percentiles of the last 200 minimum norms areclearly greater than 0, though the 50% percentiles are still much smaller than the 5% percentilesof the ﬁrst 250(= p ) norms.In the sequel, we only report the results with ω p = { b r/ ( p ln p ) } / .The number of clusters is estimated based on MGF in (3.7). Figure 2 presents the boxplotsof MGF( k ) for k = 2 , · · · ,

10. We calculated MGF( · ) with ( r , r ) being either known or estimatedby ( b r , b r ). In either the cases, the values of MGF( k ) increase sharply from k = 5 to k = 6, andit keeps increasing for k >

6. Hence we set for b d = 5. Then the b d clusters are obtained byperforming the k -means clustering (with k = b d ) for the b p rows of c W , where b p = p − | b J d +1 | .See Step 5 of the algorithm in Section 3. As the error rates in estimating J cd +1 has already beenreported in Table 3, we concentrate on the components of y t with indices in b J cd +1 ∩ J cd +1 now,and count the number of them which were misplaced by the k -means clustering. Let S denotethe number of misplaced components. Both the means and the standard deviations of the errorrates S/ | b J cd +1 ∩ J cd +1 | over 1000 replications are reported in Table 4. It shows clearly that the the k -mean clustering identiﬁes the latent clusters very accurately, and the diﬀerence in performancedue to the estimating ( r , r ) is also small.More simulation results are collected in an online supplementary ﬁle, exhibiting the similarpatterns as reported above with two diﬀerence settings (i.e. n = 400 , p = 300 , d = 4, and15 r , r k no w n j ||b^ j || r , r k no w n j ||b^ j || r , r k no w n j ||b^ j || r , r k no w n j ||b^ j || r , r un k no w n j ||b^ j || r , r un k no w n j ||b^ j || r , r un k no w n j ||b^ j || r , r un k no w n j ||b^ j || F i g u r e : S a m p l e p e r ce n t il e s o f k b b j k a tt h e l e v e l s o f % ( b l u e ) , % ( b l a c k ) a nd % ( r e d ) a r e p l o tt e d aga i n s t j . T h e f o u r c o l u m n s f r o m l e f tt o r i g h t c o rr e s p o nd t o , r e s p ec t i v e l y , δ = . , . , . , . . . . . . . r, r known No. of clusters M G F . . . . r, r known No. of clusters M G F . . . . r, r known No. of clusters M G F . . . . r, r known No. of clusters M G F . . . . r, r unknown No. of clusters M G F . . . . r, r unknown No. of clusters M G F . . . . r, r unknown No. of clusters M G F . . . . r, r unknown No. of clusters M G F Figure 2: box-plots for MGF. The four columns from left to right correspond to, respectively, δ = 0 . , . , . , . n = 200 , p = 240 , d = 6). We consider the daily returns of the stocks listed in S&P500 in 31 December 2014 – 31 December2019. By removing those which were not traded on every trading day during the period, there are p = 477 stocks which were traded on n = 1259 trading days. Those stocks are from 11 industrysectors: 17

10 15 20 . . . . . . . j R ^ j Figure 3: Plot of b R j against j for 2 ≤ j ≤ b R j as in (3.2) with k = 5. It turns out b R = 32 .

53 is much larger than allthe others, while b R j for j ≥ b r = 1 and b r + b r = 4. Notethat the estimators for b r and b r + b r are unchanged with k = 1 , · · · ,

4. While the existence of b r = 1 strong and common factor is reasonable, it is most unlikely that there are merely b r = 3cluster-speciﬁc weak factors. Note that estimators in (3.3) are derived under the assumption thatall the r cluster-speciﬁc (i.e. weak) factors are of the same factor strength; see Remark 1(ii) inSection 2 above. In practice weak factors may have diﬀerent degrees of strength; implying thatwe should also take into account the 3rd, the 4th largest local maximum of b R j . Hence we take b r + b r = 10 (or perhaps also 13), as Figure 3 suggests that there are 3 factors with factor strength δ >

0, and further 6 factors with strength δ ∈ ( δ , b r = 1 and b r = 9, we proceed to Steps 2 & 3 of Section 3 and obtain the estimator18

10 15 20 . . . . . . k M G F ( k ) Figure 4: MGF with diﬀerent number of clusters when b r = 1 and b r = 9. b B as in (3.5). Setting ω p = (cid:8)b r/ ( p ln p ) (cid:9) / , | b J d +1 | = 12, i.e. 12 stocks do not appear to belongto any clusters, where b J d +1 is deﬁned as in (3.6) in Step 4. Leaving those 12 stocks out, weperform Step 5, i.e the k -means clustering for the b p = 477 −

12 = 465 rows of matrix c W . Theresulting MGF( · ) is plotted in Figure 4. As MGF(9) is substantially greater than MGF( k ) for k <

9, and MGF( k ) keeps increasing for k >

9, we take b d = 9 as the number of latent clusters.To present the identiﬁed d clusters, we deﬁne 11 × d matrix with n ij /n i as its ( i, j )-th element,where n i is the number of the stocks in the i -th industry sector, and n ij is the number of thestocks in the i -th industry sector which are allocated in the j -th cluster. Thus n ij /n i ∈ [0 ,

1] and P j n ij /n i = 1. The heatmaps of this 11 × d matrix for d = b d = 9 is presented in Figure 5. Theﬁrst cluster mainly consists of the companies in Comsumer Staples, Real Estate and Utilities,Clusters 2 and 3 contain the companies in, respectively, Financials and Health Care, Cluster 4contains mainly some companies in Communication Service and Information Technology, Cluster5 consists of the companies in Industrials and Materials, Cluster 6 are mainly the companies inConsumer Discretionary, Cluster 7 is a mixture of a small number of companies from each of 5or 6 diﬀerent sectors, Cluster 8 is mainly the companies from Information Technology, Cluster 9contains almost all companies in Energy. To examine how stable the clustering is, we also includethe results for d = 11 and d = 3 in Figure 5. When d is increased from 9 to 11, the originalCluster 1 is divided into new Clusters 1 and 11 with the former consisting of Comsumer Staplesand Utilities sectors, and the latter being Real Estate sector. Furthermore the original Cluster 719 =9 Clusters I ndu s t r y s e c t o r s d=11 Clusters I ndu s t r y s e c t o r s Clusters I ndu s t r y s e c t o r s d= 3 Figure 5: Heatmaps of the distributions of the stocks in each of the 11 industry sectors (corre-sponding to 11 rows) over d clusters (corresponding to d columns), with d = 9 ,

11 and 3. Theestimated numbers of the common and cluster-speciﬁc factors are, respectively, b r = 1 and b r = 9. d=9 Clusters I ndu s t r y s e c t o r s Clusters d=11 I ndu s t r y s e c t o r s d=3 Clusters I ndu s t r y s e c t o r s Figure 6: Heatmaps of the distributions of the stocks in each of the 11 industry sectors (corre-sponding to 11 rows) over d clusters (corresponding to d columns), with d = 9 ,

11 and 3. Theestimated numbers of the common and cluster-speciﬁc factors are, respectively, b r = 1 and b r = 12.splits into new Clusters 7 and 10, while the other 7 original clusters are hardly changed. With d = 3, most companies in each of the 11 sectors stay in one cluster.If we take b r = 1 and b r = 12, the estimated b J d +1 is unchanged. The clustering results with d = 9 ,

11 and 3 are presented in Figure 6. Comparing with Figure 5, there are some strikingsimilarities: First the clustering with d = 3 are almost identical. For d = 9, the proﬁles ofClusters 2, · · · , 6, 8 and 9 are not signiﬁcantly changed while Clusters 1 and 7 in Figure 5 aresomehow mixed together in Figure 6. With d = 11, the proﬁles of Clusters 2 – 6, 8 – 10 in thetwo ﬁgures are about the same while Clusters 7 and 11 are mixed up across the two ﬁgures.20he analysis above indicates that the companies in the same industry sector tend to sharesimilar dynamic structure in the sense that they are driven by the same cluster-speciﬁc factors.Our analysis is reasonably stable as most the clusters do not change substantially when the numberof the weaker factors increases from b r = 9 to b r = 12. Write the singular value decomposition of B ∗ as B ∗ = ( I p − AA ⊤ ) (cid:16) B0 (cid:17) = ˜BΛ ˜B V ˜B ⊤ , (6.1)where ˜B is a p × r matrix consisting of the left-singular vectors and V ˜B is a r × r matrix consistingof the right singular vectors such that ˜B ⊤ ˜B = V ⊤ ˜B V ˜B = I r . Note that ˜B ˜B ⊤ = P A ⊥ B .At ﬁrst we give a lemma about B ∗ . It ensures that there is a positive constant q < k Λ ˜B k min ≥ − q . Lemma 1.

Under Assumption 2, rank B ∗ = r and k B ∗ k min ≥ − q . Proof of Lemma 1. B ⊤ B = I r implies rank B ∗ ≤ r. Moreover, from Weyl’s inequality for singular values the r th largest singular value, i.e. the smallestnon-zero singular value of B ∗ satisﬁes k B ∗ k min ≥ k B k min − k AA ⊤ (cid:16) B0 (cid:17) k = 1 − k AA ⊤ (cid:16) B0 (cid:17) k ≥ − q . Deﬁnition 6.1.

Let ˙y t = Ax t + (cid:16) B0 (cid:17) z t , (6.2) ˙Σ y ( k ) = Cov ( ˙y t + k , ˙y t ) (6.3)21 nd ˙M = k X k =0 ˙Σ y ( k ) ˙Σ y ( k ) ⊤ = ˙A ˙Λ A ˙A ⊤ + ˙B ˙Λ B ˙B ⊤ , (6.4) where ˙A is a p × r matrix which consists of the eigenvectors corresponding to the r largesteigenvalues of ˙M and ˙B is a p × r matrix which consists of the eigenvectors corresponding to theother eigenvalues of ˙M . Let ˜z t = Λ ˜B V ˜B ⊤ z t , ˜x t = x t + A ⊤ (cid:16) B0 (cid:17) z t . Therefore model (2.2) can be equivalently rewritten as y t = A˜x t + ˜B˜z t + We t . (6.5)Note that (6.1) ensures that A ⊤ ˜B = . (6.6)Now we prove ˜x t and ˜z t have the same properties as x t and z t . Deﬁnition 6.2. ˜Σ x ( k ) = Cov ( ˜x t + k , ˜x t ) , ˜Σ z ( k ) = Cov ( ˜z t + k , ˜z t ) , ˜Σ x,z ( k ) = Cov ( ˜x t + k , ˜z t ) ˜Σ z,x ( k ) = Cov ( ˜z t + k , ˜x t ) . Lemma 2.

Under Assumptions 2 and 3, k ˜Σ x ( k ) k ≍ p ≍ k ˜Σ x ( k ) k min , (6.7) k ˜Σ z ( k ) k ≍ p − δ ≍ k ˜Σ z ( k ) k min , (6.8) k ˜Σ x (0) − / ˜Σ x,z (0) ˜Σ z (0) − / k ≤ q < , k ˜Σ z (0) − / ˜Σ z,x (0) ˜Σ x (0) − / k ≤ q < , (6.9) k ˜Σ x,z ( k ) k = O ( p − δ/ ) , k ˜Σ z,x ( k ) k = O ( p − δ/ ) , (6.10) and Cov ( ˜x t , e s ) = 0 , Cov ( ˜z t , e s ) = 0 . (6.11) Proof of Lemma 2.

From (2.5) and (2.7), ˜Σ x ( k ) = Cov ( x t + k + A ⊤ (cid:16) B0 (cid:17) z t + k , x t + A ⊤ (cid:16) B0 (cid:17) z t )= Σ x ( k ) + A ⊤ (cid:16) B0 (cid:17) Σ z ( k )( B ⊤ , ) A + Σ x,z ( k )( B ⊤ , ) A + A ⊤ (cid:16) B0 (cid:17) Σ z,x ( k )= Σ x ( k ) + o ( p ) . ˜Σ z ( k ) = Cov ( Λ ˜B V ˜B ⊤ z t + k , Λ ˜B V ˜B ⊤ z t )= Λ ˜B V ˜B ⊤ Σ z ( k ) V ˜B Λ ˜B . This, together with (1) and (2.5), concludes (6.8). For ˜Σ x,z ( k ), one has ˜Σ x,z ( k ) = Cov ( x t + k + A ⊤ (cid:16) B0 (cid:17) z t + k , Λ ˜B V ˜B ⊤ z t )= A ⊤ (cid:16) B0 (cid:17) Σ z ( k ) V ˜B Λ ˜B + Σ x,z ( k ) V ˜B Λ ˜B = Σ x,z ( k ) V ˜B Λ ˜B + O ( p − δ ) . This implies (6.9) and (6.10). (6.11) is obvious.Now we give the relation between ˙A and A . Lemma 3.

Under Assumptions 1-3, k ˙A ˙A ⊤ − AA ⊤ k = O ( p − δ/ ) (6.12) and k A ⊤ ˙B k = O ( p − δ/ ) . (6.13) Moreover, the orders of the magnitude of k ˙A ˙A ⊤ − AA ⊤ k and k ˙BA k are totally determined by p k k X k =0 ˜Σ x ( k ) ˜Σ x,z ( k ) + k X k =0 ˜Σ x,z ( k ) ˜Σ z ( k ) k . k ˙A ˙A ⊤ − AA ⊤ k = k A ⊤ ˙B k = 0 if and only if k X k =0 ˜Σ x ( k ) ˜Σ z,x ( k ) ⊤ + k X k =0 ˜Σ x,z ( k ) ˜Σ z ( k ) ⊤ = 0 . (6.14) Proof of Lemma 3.

From (6.4) we see that ˙A and ˙B are the eigenvector matrices correspondingto the diﬀerent eigenvalues so that ˙A ⊤ ˙M ˙B = . Recalling (6.2) and (6.5) we have ˙y t = A˜x t + ˜B˜z t = Ax t + (cid:16) B0 (cid:17) z t . (6.15)23ence we can further expand ˙A ⊤ ˙M ˙B as0 = ˙A ⊤ ( k X k =0 ˙Σ y ( k ) ˙Σ y ( k ) ⊤ ) ˙B = ˙A ⊤ A ( k X k =0 ˜Σ x ( k ) ˜Σ x ( k ) ⊤ ) A ⊤ ˙B + ˙A ⊤ A ( k X k =0 ˜Σ x ( k ) ˜Σ z,x ( k ) ⊤ ) ˜B ⊤ ˙B + ˙A ⊤ A ( k X k =0 ˜Σ x,z ( k ) ˜Σ x,z ( k ) ⊤ ) A ⊤ ˙B + ˙A ⊤ A ( k X k =0 ˜Σ x,z ( k ) ˜Σ z ( k ) ⊤ ) ˜B ⊤ ˙B + ˙A ⊤ ˜B ( k X k =0 ˜Σ z,x ( k ) ˜Σ x ( k ) ⊤ ) A ⊤ ˙B + ˙A ⊤ ˜B ( k X k =0 ˜Σ z,x ( k ) ˜Σ z,x ( k ) ⊤ ) ˜B ⊤ ˙B + ˙A ⊤ ˜B ( k X k =0 ˜Σ z ( k ) ˜Σ x,z ( k ) ⊤ ) A ⊤ ˙B + ˙A ⊤ ˜B ( k X k =0 ˜Σ z ( k ) ˜Σ z ( k ) ⊤ ) ˜B ⊤ ˙B . This, together with (6.7)-(6.10), implies k ˙A ⊤ A ( k X k =0 ˜Σ x ( k ) ˜Σ x ( k ) ⊤ ) A ⊤ ˙B k = O ( p − δ/ ) . Moreover, (6.7) implies that k ˜Σ x ( k ) ˜Σ x ( k ) ⊤ k min ≍ p ≍ k ˜Σ x ( k ) ˜Σ x ( k ) ⊤ k . This further yields that k k X k =0 ˜Σ x ( k ) ˜Σ x ( k ) ⊤ k min ≍ p . So we conclude that (6.12)-(6.13) are true. Moreover, if k ˙A ˙A ⊤ − AA ⊤ k = k A ⊤ ˙B k = 0, then ˙A ⊤ A ( k X k =0 ˜Σ x ( k ) ˜Σ z,x ( k ) ⊤ + k X k =0 ˜Σ x,z ( k ) ˜Σ z ( k ) ⊤ ) ˜B ⊤ ˙B = . Then k X k =0 ˜Σ x ( k ) ˜Σ z,x ( k ) ⊤ + k X k =0 ˜Σ x,z ( k ) ˜Σ z ( k ) ⊤ = . If (6.14) holds, then A ⊤ ˙M ˜B = . The smallest eigenvalue of A ⊤ ˙MA has a larger order than the largest eigenvalue of ˜B ⊤ ˙M ˜B .So k ˙A ˙A ⊤ − AA ⊤ k = k A ⊤ ˙B k = 0 . e t . Let Σ e ( k ) = 1 n − k n − k X t =1 e t + k e ⊤ t . Lemma 4.

Under Assumption 4, k Σ e (0) k = O p ( pn + log p ) = o p ( p − δ ) . Lemma 4 implies that the order of k Σ e (0) k is smaller than p − δ . Proof of Lemma 4.

Let Σ e,k be a n × n matrix whose ( i, j ) element is E ( e t + i,k e t + j,k ). Deﬁne m = E max ≤ k ≤ p n X t =1 e t,k . From Theorem 5.48 and Remark 5.49 in (Vershynin, 2010) we conclude that( E k Σ e (0) k ) / ≤ k p p X k =1 Σ e,k k / r pn + C r m log nn , where C is an absolute constant.Recalling (2.9) we have lim n,p →∞ k p p X k =1 Σ e,k k < C. So we only need to prove m = O ( n + p log n ). From (2.10) E max ≤ k ≤ p n X t =1 e t,k ≤ n max ≤ k ≤ p Ee t,k + E max ≤ k ≤ p ( n X t =1 e t,k − nEe t,k ) < Cn + E max ≤ k ≤ p ( n X t =1 e t,k − nEe t,k ) . Moreover E max ≤ k ≤ p ( n X t =1 e t,k − nEe t,k ) ≤ n + p X k =1 E (cid:16) | n X t =1 e t,k − nEe t,k | { n X t =1 e t,k − nEe t,k > n } (cid:17) ≤ n + 1 n p X k =1 E (cid:16) ( n X t =1 e t,k − nEe t,k ) { n X t =1 e t,k − nEe t,k > n } (cid:17) . This, together with (2.11), implies that E max ≤ k ≤ p ( n X t =1 e t,k − nEe t,k ) = O ( n + p log n ) . Deﬁnition 6.3.

Let ˇy t = ˙y t − n n X s =1 ˙y s , b e t = We t − n n X s =1 We s . We further deﬁne ˇΣ y ( k ) = 1 n − k n − k X t =1 ˇy t + k ˇy ⊤ t , b Σ e ( k ) = 1 n − k n − k X t =1 b e t + k b e ⊤ t , ˇΣ y,e ( k ) = 1 n − k n − k X t =1 ˇy t + k b e ⊤ t , ˇΣ e,y ( k ) = 1 n − k n − k X t =1 b e t + k ˇy ⊤ t , ˇM = k X k =0 ˇΣ y ( k ) ˇΣ y ( k ) ⊤ . (6.16) Proof of Theorem 1.

If we can prove k b A b A ⊤ − ˙A ˙A ⊤ k = O p ( n − / ) , (6.17)(4.1) can be derived by (6.17) and Lemma 3. To prove (6.17) it suﬃces to show the diﬀerencebetween the r th largest eigenvalue of ˙M and ( r + 1)th largest eigenvalue of c M is larger than cp ( c is a positive constant), and k c M − ˙M k = O p ( p n − / ).Note that ˇM is the sample version of ˙M and ˙y t is stationary. So k ˇM − ˙M k = O p ( p n − / ) = o p ( p ) and b Σ y ( k ) = ˇΣ y ( k ) + b Σ e ( k ) + ˇΣ y,e ( k ) + ˇΣ e,y ( k ) . From Lemma 2 and Lemma 4 we conclude that k b Σ e ( k ) + ˇΣ y,e ( k ) + ˇΣ e,y ( k ) k = o p ( p ). This impliesthat k ˇM − c M k = o p ( p ) and k ˙M − c M k = o p ( p ). (6.7)-(6.10) imply that the order of the r thlargest eigenvalue of ˙M is p and the ( r + 1)th largest eigenvalue of ˙M is o ( p ). This impliesthat the diﬀerence between the r th largest eigenvalue of ˙M and ( r + 1)th largest eigenvalue of c M is larger than cp .Now we consider k c M − ˙M k . Since k ˇM − ˙M k = O p ( p n − / ), we only need to prove k ˇM − c M k =26 p ( p n − / ). Write c M − ˇM (6.18)= k X k =0 ˇΣ y ( k )[ b Σ e ( k ) ⊤ + ˇΣ y,e ( k ) ⊤ + ˇΣ e,y ( k ) ⊤ ]+ k X k =0 [ b Σ e ( k ) + ˇΣ y,e ( k ) + ˇΣ e,y ( k )] ˇΣ y ( k ) ⊤ + O ( k X k =1 [ k ˇΣ e,y ( k ) k + k ˇΣ y,e ( k ) k + k b Σ e ( k ) k ]) . Lemma 4 implies k b Σ e ( k ) k = O p ( pn − / ). (6.11) ensures k ˇΣ y,e ( k ) k = O p ( pn − / ) and k ˇΣ e,y ( k ) k = O p ( pn − / ). Hence the proof is completed. Now we consider the estimator of P A ⊥ B in Step 3 and Theorem 2. Let P c b A = I − b A b A ⊤ . (6.19)Then we get b B from the eigen-analysis of P k k =0 P c b A b Σ y ( k ) P c b A b Σ y ( k ) ⊤ P c b A . Deﬁnition 6.4.

Set c M = k X k =0 P c b A b Σ y ( k ) P c b A b Σ y ( k ) ⊤ P c b A , ˙M = k X k =0 P c b A ˙Σ y ( k ) P c b A ˙Σ y ( k ) ⊤ P c b A . (6.20) Lemma 5.

Under Assumptions 1-4, there exist two constant c and c such that lim n,p →∞ P ( c ≤ k P c b A ˙Σ y ( k ) P c b A k min p − δ ≤ k P c b A ˙Σ y ( k ) P c b A k p − δ ≤ c ) = 1 (6.21) and k ˙B ˙B ⊤ − ˙B ˙B ⊤ k = O p ( p δ/ n − / ) , (6.22) where ˙B is a p × r matrix consisting of the eigenvectors corresponding to the ﬁrst r largesteigenvalues of ˙M .Proof of Lemma 5. Recalling the deﬁnitions of ˙A and ˙B we have( AA ⊤ − ˙A ˙A ⊤ ) A = ( I − ˙A ˙A ⊤ ) A = ˙B ˙B ⊤ A and ( I − ˙A ˙A ⊤ ) ˜B = ˙B ˙B ⊤ ˜B .

27e can ﬁnd the order of P c b A ˙y t as follows. Via (6.15) and (6.19) write P c b A A˜x t + P c b A ˜B˜z t = ( ˙A ˙A ⊤ − b A b A ⊤ ) A˜x t + ( I − ˙A ˙A ⊤ ) A˜x t +( I − ˙A ˙A ⊤ ) ˜B˜z t + ( ˙A ˙A ⊤ − b A b A ⊤ ) ˜B˜z t = ˙B ˙B ⊤ ( A˜x t + ˜B˜z t )+( ˙A ˙A ⊤ − b A b A ⊤ )( A˜x t + ˜B˜z t ) , Π + Π , where Π = ˙B ˙B ⊤ ( A˜x t + ˜B˜z t ) and Π = ( ˙A ˙A ⊤ − b A b A ⊤ )( A˜x t + ˜B˜z t ). (4.1) implies k Π k = O p ( p / n − / ) = o p ( p / − δ/ ). This, together with (6.7)-(6.10), implies (6.21). P c b A ˙y t = ˙B ˙B ⊤ ( A˜x t + ˜B˜z t ) + O p ( p / n − / ) = ˙B ˙B ⊤ ( A˜x t + ˜B˜z t ) + o p ( p / − δ/ ) . This, together with (6.21), implies (6.22).

Proof of Theorem 2.

If we can prove k b B b B ⊤ − ˙B ˙B ⊤ k = O p ( p δ/ n − / ) , (6.23)(4.2) can be obtained by (6.23) and Lemma 3. Moreover, (6.22) shows that ˙B , the eigenvectorsmatrix corresponding to the ﬁrst r largest eigenvalues of ˙M , is close enough to ˙B . It then suﬃcesto prove that k b B b B ⊤ − ˙B ˙B ⊤ k = O p ( p δ/ n − / ) . To this end, the aim is to show that the diﬀerence between the r th largest eigenvalue of ˙M and ( r + 1)th largest eigenvalue of c M is larger than cp − δ and k c M − ˙M k = O p ( p − δ/ n − / ).From Lemma 4, we have k b Σ e ( k ) k = o p ( p − δ/ n − / ) = o p ( p − δ ) . This, together with Lemma 5, implies that the diﬀerence between the r th largest eigenvalue of ˙M and ( r + 1)th largest eigenvalue of c M is larger than cp − δ with probability tending to 1 as n → ∞ .Now we consider c M − ˙M . We still use (6.18). However, we replace c M and ˇM by c M and ˙M + O p ( p − δ n − / ) respectively. Similarly, we replace ˇΣ y ( k ), ˇΣ y,e ( k ), ˇΣ e,y ( k ) and b Σ e ( k ) by P c b A ˇΣ y ( k ) P c b A , P c b A ˇΣ y,e ( k ) P c b A , P c b A ˇΣ e,y ( k ) P c b A and P c b A b Σ e ( k ) P c b A respectively. This, together withLemma 4-5, ensures that k c M − ˙M k = O p ( p − δ/ n − / ) . This implies (6.23). 28 .3 Proof of Theorem 3

It suﬃces to prove the following version for Theorem 3.

Lemma 6.

Under the Assumptions 1-4, b λ k,i is the i th largest eigenvalue of b Σ y ( k ) b Σ y ( k ) ⊤ . Thenthere exist a positive constant C such that lim n,p →∞ P ( b λ k,i − b λ k,i ≤ C ) = 1 , when ≤ i ≤ r , (6.24) b λ k,r +1 b λ k,r = O p ( p − δ ) , b λ k,r +1+ r b λ k,r + r = O p (cid:16) p n + log pp − δ (cid:17) , (6.25)lim n,p →∞ P ( b λ k,i − b λ k,i ≤ C ) = 1 , when r + 2 ≤ i ≤ r + r, (6.26) and b λ k,i − b λ k,i = O p (log p ) , when r + r + 2 ≤ i ≤ r + r + s, (6.27) where s is a positive integer. We begin with two estimators of P A ⊥ B . Deﬁnition 6.5.

Let b λ k,i be the i th largest eigenvalue of b Σ y ( k ) b Σ y ( k ) ⊤ . We write b Σ y ( k ) b Σ y ( k ) ⊤ by its eigenvalue and eigenvector decomposition as b Σ y ( k ) b Σ y ( k ) ⊤ = b A ( k ) b Λ x ( k ) b A ( k ) ⊤ + b B ( k ) ˇΛ z ( k ) b B ( k ) ⊤ + b C ( k ) ˇΛ e ( k ) b C ( k ) ⊤ , (6.28) where ˇΛ z ( k ) = diag { b λ k,r +1 , · · · , b λ k,r + r } , ˇΛ e ( k ) = diag { b λ k,r + r +1 , · · · , b λ k,p } , and b A ( k ) ⊤ b A ( k ) = I r , b B ( k ) ⊤ b B ( k ) = I r , b C ( k ) ⊤ b C ( k ) = I P − r − r , ˇΛ x ( k ) = diag { b λ k, , · · · , b λ k,r } . Then b A ( k ) is the estimator of A and b B ( k ) is the estimator of ˜B in the one-step method. Wecall this method ”one-step” as we get b A ( k ) and b B ( k ) in the same eigen-decomposition. Deﬁnition 6.6.

Let P c b A k = I p − b A ( k ) b A ( k ) ⊤ and write P c b A k b Σ y ( k ) P c b A k b Σ y ( k ) ⊤ P c b A k by its eigen-value and eigenvector decomposition as P c b A k b Σ y ( k ) P c b A k b Σ y ( k ) ⊤ P c b A k = b B ( k ) b Λ z ( k ) b B ( k ) ⊤ + b C ( k ) b Λ e ( k ) b C ( k ) ⊤ . (6.29) Then b B ( k ) is the estimator of ˜B in the two-step method. We call this method ”two-step” as weget b A ( k ) and b B ( k ) from two diﬀerent eigen-decompositions. b Σ y ( k ) b Σ y ( k ) ⊤ . Lemma 7.

Under the Assumptions 1-4, one has k b B ( k ) b B ( k ) ⊤ − P A ⊥ B k = O p ( p δ/ n − / + p − δ/ ) , (6.30) k b B ( k ) b B ( k ) ⊤ − b B ( k ) b B ( k ) ⊤ k = O p ( p − δ/ n − / ) , (6.31) k ˇΛ z ( k ) − b Λ z ( k ) k = o p ( p − δ ) . (6.32) and k ˇΛ e ( k ) − b Λ e ( k ) k = o p ( p − δ ) . (6.33) Proof of Lemma 7.

By (6.16), (3.1) and (6.15) we have b Σ y ( k ) b Σ y ( k ) ⊤ − ˇΣ y ( k ) ˇΣ y ( k ) ⊤ = ˇΣ y ( k )( b Σ e ( k ) ⊤ + ˇΣ y,e ( k ) ⊤ + ˇΣ e,y ( k ) ⊤ )+ ( b Σ e ( k ) + ˇΣ y,e ( k ) + ˇΣ e,y ( k )) ˇΣ y ( k ) ⊤ + O ( k ˇΣ e,y ( k ) k + k ˇΣ y,e ( k ) k + k b Σ e ( k ) k ) . From the proof of Theorems 1-2, it’s not hard to obtain the property of b A ( k ) and b B ( k ) as follows: k b A ( k ) b A ( k ) ⊤ − AA ⊤ k = O p ( n − / + p − δ/ ) , k b B ( k ) b B ( k ) ⊤ − P A ⊥ B k = O p ( p δ/ n − / + p − δ/ ) k b Λ z ( k ) k ≍ p − δ ≍ k b Λ z ( k ) k min , (6.34) k b Λ e ( k ) k = O p ( p n − + log p ) . (6.35)So we only need to prove (6.31)-(6.33). From (6.28) and (6.29) one can see that I p = b A ( k ) b A ( k ) ⊤ + b B ( k ) b B ( k ) ⊤ + b C ( k ) b C ( k ) ⊤ and I p = b A ( k ) b A ( k ) ⊤ + b B ( k ) b B ( k ) ⊤ + b C ( k ) b C ( k ) ⊤ . It follows that b B ( k ) b B ( k ) ⊤ + b C ( k ) b C ( k ) ⊤ = b B ( k ) b B ( k ) ⊤ + b C ( k ) b C ( k ) ⊤ . b B ( k ) and b B ( k ). From (6.28) and (6.29) we concludethat b B ( k ) ˇΛ z ( k ) b B ( k ) ⊤ + b C ( k ) ˇΛ e ( k ) b C ( k ) ⊤ = P c b A k b Σ y ( k ) b Σ y ( k ) ⊤ P c b A k . Moreover, P c b A k = I p − b A ( k ) b A ( k ) ⊤ = b B ( k ) b B ( k ) ⊤ + b C ( k ) b C ( k ) ⊤ . So P c b A k b Σ y ( k ) b Σ y ( k ) ⊤ P c b A k = P c b A k b Σ y ( k ) P c b A k b Σ y ( k ) ⊤ P c b A k + P c b A k b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ P c b A k = b B ( k ) b Λ z ( k ) b B ( k ) ⊤ + b C ( k ) b Λ e ( k ) b C ( k ) ⊤ + b B ( k ) b B ( k ) ⊤ b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ b B ( k ) b B ( k ) ⊤ + b B ( k ) b B ( k ) ⊤ b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ b C ( k ) b C ( k ) ⊤ + b C ( k ) b C ( k ) ⊤ b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ b B ( k ) b B ( k ) ⊤ + b C ( k ) b C ( k ) ⊤ b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ b C ( k ) b C ( k ) ⊤ . It then suﬃces to get the order of b B ( k ) ⊤ b Σ y ( k ) b A ( k ) and b C ( k ) ⊤ b Σ y ( k ) b A ( k ). If we can show k b B ( k ) ⊤ b Σ y ( k ) b A ( k ) k = o p ( p − δ ) and k b C ( k ) ⊤ b Σ y ( k ) b A ( k ) k = o p ( p − δ ), then (6.32)-(6.33) follow.Note that k b A ( k ) b Σ y ( k ) ⊤ b A ( k ) ⊤ k min ≍ p. We study the order of b B ( k ) ⊤ b Σ y ( k ) b A ( k ) as follows based on the deﬁnition of eigenvectors. Write = b B ( k ) ⊤ b Σ y ( k ) b Σ y ( k ) ⊤ b A ( k )= b B ( k ) ⊤ b Σ y ( k ) b B ( k ) b B ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )+ b B ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )+ b B ( k ) ⊤ b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k ) . Then b B ( k ) ⊤ b Σ y ( k ) b A ( k ) (6.36)= − b B ( k ) ⊤ b Σ y ( k ) b B ( k ) b B ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )( b A ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )) − − b B ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )( b A ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )) − .

31e identify the order of b C ( k ) b C ( k ) ⊤ A˜x t as follows.We replace ˙M by ˙Σ y ( k ) ˙Σ y ( k ) ⊤ and deﬁne ˙A ( k ) and ˙B ( k ) as in ˙A and ˙B . Then k AA ⊤ − ˙A ( k ) ˙A ( k ) ⊤ k = O p ( p − δ/ ) , k ˙A ( k ) ˙A ( k ) ⊤ − b A ( k ) b A ( k ) ⊤ k = O p ( n − / )and k ˙B ( k ) ˙B ( k ) ⊤ − b B ( k ) b B ( k ) ⊤ k = O p ( p δ/ n − / ) . Note that k b C ( k ) b C ( k ) ⊤ A˜x t k≤ k b C ( k ) b C ( k ) ⊤ ˙A ( k ) ˙A ( k ) ⊤ A˜x t k + k b C ( k ) b C ( k ) ⊤ ( AA ⊤ − ˙A ( k ) ˙A ( k ) ⊤ ) A˜x t k . The two summands on the right hand side of the above inequality satisfy k b C ( k ) b C ( k ) ⊤ ˙A ( k ) ˙A ( k ) ⊤ A˜x t k≤ k ( I p − b A ( k ) b A ( k ) ⊤ ) ˙A ( k ) ˙A ( k ) ⊤ A˜x t k = k ( ˙A ( k ) ˙A ( k ) ⊤ − b A ( k ) b A ( k ) ⊤ ) ˙A ( k ) ˙A ( k ) ⊤ A˜x t k≤ k ˙A ( k ) ˙A ( k ) ⊤ − b A ( k ) b A ( k ) ⊤ kk ˜x t k = O p ( p / n − / )and k b C ( k ) b C ( k ) ⊤ ( AA ⊤ − ˙A ( k ) ˙A ( k ) ⊤ ) A˜x t k≤ k ( ˙A ( k ) ˙A ( k ) ⊤ − b A ( k ) b A ( k ) ⊤ )( AA ⊤ − ˙A ( k ) ˙A ( k ) ⊤ ) A˜x t k + k ( ˙B ( k ) ˙B ( k ) ⊤ − b B ( k ) b B ( k ) ⊤ )( AA ⊤ − ˙A ( k ) ˙A ( k ) ⊤ ) A˜x t k = O p ( p / n − / ) . It follows that k b C ( k ) b C ( k ) ⊤ A˜x t k = O p ( p / n − / ) . Similarly we can conclude that k b C ( k ) b C ( k ) ⊤ ˜B˜z t k = O p ( p / n − / ) . Then k b C ( k ) ⊤ ˇy t k = O p ( p / n − / ) . k b B ( k ) ⊤ ˇy t k = O p ( p / − δ/ ) . These imply that k b B ( k ) ⊤ ˇΣ y ( k ) b C ( k ) k = O p ( p − δ/ n − / ) , k b B ( k ) ⊤ ˇΣ y,e ( k ) b C ( k ) k = O p ( p / − δ/ ) = O p ( p − δ/ n − / ) , k b B ( k ) ⊤ ˇΣ e,y ( k ) b C ( k ) k = O p ( p / n − / ) = o p ( p − δ/ n − / )and k b B ( k ) ⊤ b Σ e ( k ) b C ( k ) k = O p ( k b Σ e ( k ) k ) = O p ( pn + log n ) = o p ( p − δ/ n − / ) . It follows that k b B ( k ) ⊤ b Σ y ( k ) b C ( k ) k = k b B ( k ) ⊤ ˇΣ y ( k ) b C ( k ) k + O p ( p − δ/ n − / )= O p ( p − δ/ n − / ) . Similarly, we have k b C ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k ) k ≤ k b C ( k ) ⊤ ˇΣ y ( k ) ⊤ b A ( k ) k + k b C ( k ) ⊤ ˇΣ y,e ( k ) ⊤ b A ( k ) k + k b C ( k ) ⊤ ˇΣ e,y ( k ) ⊤ b A ( k ) k + k b C ( k ) ⊤ b Σ e ( k ) ⊤ b A ( k ) k = O p ( p − δ/ ) , k b B ( k ) ⊤ b Σ y ( k ) b B ( k ) b B ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k ) k = O p ( p − δ ) , and k b B ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k ) k = O p ( p − δ n − / ) = O p ( p − δ ) . Recalling (6.36), k b B ( k ) ⊤ b Σ y ( k ) b A ( k ) k (6.37) ≤ k b B ( k ) ⊤ b Σ y ( k ) b B ( k ) b B ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )( b A ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )) − k + k b B ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )( b A ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )) − k = O p ( p − δ ) . b C ( k ) ⊤ b Σ y ( k ) b A ( k ) as follows: = b C ( k ) ⊤ b Σ y ( k ) b Σ y ( k ) ⊤ b A ( k )= b C ( k ) ⊤ b Σ y ( k ) b B ( k ) b B ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )+ b C ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )+ b C ( k ) ⊤ b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k ) . We can ﬁnd that k b C ( k ) ⊤ b Σ y ( k ) b B ( k ) b B ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k ) k = O p ( p − δ n − / ) , and k b C ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k ) k = O p ( p − δ n − / ) . It follows that k b C ( k ) ⊤ b Σ y ( k ) b A ( k ) k (6.38) ≤ k b C ( k ) ⊤ b Σ y ( k ) b B ( k ) b B ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )( b A ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )) − k + k b C ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )( b A ( k ) ⊤ b Σ y ( k ) ⊤ b A ( k )) − k = O p ( p − δ n − / ) . This implies that (6.32)-(6.33).(6.32)-(6.35) show that k b Λ z ( k ) k min ≍ p − δ and b λ k,r + r +1 = k b Λ e ( k ) k + o p ( p − δ ) = o p ( p − δ ) . Then we can ﬁnd that k b B ( k ) b B ( k ) ⊤ − b B ( k ) b B ( k ) ⊤ k is based on the fact that1 p − δ k b B ( k ) ⊤ b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ b C ( k ) k = O p ( p − δ/ n − / ) , which ensures (6.31). Lemma 8.

Under the Assumptions 1-4, b λ k,i is the i th largest eigenvalue of b Σ y ( k ) b Σ y ( k ) ⊤ . Thenthere exist two positive constants c and C such that lim n,p →∞ P ( c ≤ b λ k,i p ≤ C ) = 1 , when ≤ i ≤ r , (6.39)34im n,p →∞ P ( c ≤ b λ k,i p − δ ≤ C ) = 1 , when r + 1 ≤ i ≤ r + r, (6.40) b λ k,r + r +1 = O p ( p n + log p ) (6.41) and n p b λ k,r + r + s = O p (1) (6.42) for any ﬁxed s .Proof of Lemma 8. Recalling the proof of Theorem 1 it’s similar to prove that k b A ( k ) b A ( k ) ⊤ − AA ⊤ k = O p ( n − / + p − δ/ ) . This implies (6.39).(6.32) and (6.34) lead to (6.40).So we prove (6.41) now. We start with the relation: ˇΛ e ( k ) = b C ( k ) ⊤ b Σ y ( k ) b Σ y ( k ) ⊤ b C ( k )= b C ( k ) ⊤ b Σ y ( k ) b B ( k ) b B ( k ) ⊤ b Σ y ( k ) ⊤ b C ( k )+ b C ( k ) ⊤ b Σ y ( k ) b A ( k ) b A ( k ) ⊤ b Σ y ( k ) ⊤ b C ( k )+ b C ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b C ( k ) . Note that k (cid:16) b A ( k ) ⊤ b B ( k ) ⊤ (cid:17) b Σ y ( k ) ⊤ (cid:16) b A ( k ) , b B ( k ) (cid:17) k min ≍ p − δ , and = b C ( k ) ⊤ b Σ y ( k ) b Σ y ( k ) ⊤ (cid:16) b A ( k ) , b B ( k ) (cid:17) = b C ( k ) ⊤ b Σ y ( k ) (cid:16) b A ( k ) , b B ( k ) (cid:17)(cid:16) b A ( k ) ⊤ b B ( k ) ⊤ (cid:17) b Σ y ( k ) ⊤ ( b A ( k ) , b B ( k ))+ b C ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ (cid:16) b A ( k ) , b B ( k ) (cid:17) . It follows that b C ( k ) ⊤ b Σ y ( k )( b A ( k ) , b B ( k ))= − b C ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ (cid:16) b A ( k ) , b B ( k ) (cid:17)(cid:16)(cid:16) b A ( k ) ⊤ b B ( k ) ⊤ (cid:17) b Σ y ( k ) ⊤ (cid:16) b A ( k ) , b B ( k ) (cid:17)(cid:17) − . k b C ( k ) ⊤ b Σ y ( k )( b A ( k ) b A ( k ) ⊤ + b B ( k ) b B ( k ) ⊤ ) b Σ y ( k ) ⊤ b C ( k ) k = O p ( k b C ( k ) ⊤ b Σ y ( k ) b C ( k ) b C ( k ) ⊤ b Σ y ( k ) ⊤ b C ( k ) k ) . So we only need to get the order of k b C ( k ) b Σ y ( k ) b C ( k ) ⊤ k .(6.31) implies k b C ( k ) b C ( k ) ⊤ − b C ( k ) b C ( k ) ⊤ k = O p ( p − δ/ n − / ) . This, together with (6.35), implies k b C ( k ) b Σ y ( k ) b C ( k ) ⊤ k = O p ( k b C ( k ) b Σ y ( k ) b C ( k ) ⊤ k ) = O p ( pn − + log p ) . This proves (6.41).From (2.10), Lemma 4 and (6.41), for any ﬁxed s , min { n,p } X i = r + r +1 b λ / k,i = O p ( p ) . This, together with (6.41), implies (6.42).Lemma 6 can be concluded by Lemma 8.

Proof of Theorem 4. ( I p − AA ⊤ ) (cid:16) B0 (cid:17) =  C · · · C ,d · · · · · · · · · C d +1 , · · · C d +1 ,d  , (6.43)where C ij is a p i × r j matrix. Hence C ii = B i − A i A ⊤ i B i . When i = j , C ij = − A i A ⊤ j B j . Note that B ⊤ B = I r , and B ⊤ i B i = I r i . 366.43), (2.3) and Assumption 5 ensure that P A ⊥ B can be rewritten as P A ⊥ B = ˜H diag + ˜H err , where ˜H diag satisﬁes (4.7) and k ˜H err k F = O ( p δ/ n − / ). This, together with (4.2), completes theproof. Proof of Theorem 5.

Note that | J i | ≤ c p for any 1 ≤ i ≤ d + 1. The fact that b B ⊤ b B = I r impliesthat k b b i k = b b ⊤ i b B ⊤ b B b b i = k b b ⊤ i b B ⊤ k . Recalling the deﬁnition of b b i in Step 4, k b b i k is the norm of the i th row of b B b B ⊤ .We begin with i ≤ d . Theorem 4 and Assumption 6 imply that if j ∈ J i ∩ b J d +1 , the normof the j th row vector of b B b B ⊤ − P A ⊥ B should be larger than c p − / . This, together with (4.2),implies (4.9).Now we consider i = d + 1, if j ∈ J d +1 ∩ b J cd +1 , the norm of the j th row vector of b B b B ⊤ − P A ⊥ B should be not smaller than ω p . This, together with (4.2), implies (4.10). Proof of Theorem 6.

We deﬁne a diagonal matrix F diag which has the i th diagonal elements b f ⊤ i b f i .Then b ρ l,m is the ( l, m )th entry of F − / diag | b F b F ⊤ | F − / diag . Recalling Step 4, one can see k F − diag k ≤ ω − p .It follows that 2 X ≤ i n,p →∞ P ( | e J j | > C p ) = 1 . Note that b F only has b r (ﬁxed) columns and p goes to inﬁnity. There exist C > C > n,p →∞ P ( |{ ( ℓ, m ) : ℓ, m ∈ e J j , | b ρ ℓ,m | > C }| > C p ) = 1 . Then we can prove the second part. 37 eferences

Aghabozorgi, S., Shirkhorshid, A.S. and Wah, T.Y. (2015). Time-series clustering – A decadereview.

Information System , , 16-38.Alonso, A.S. and Pe˜na, D. (2019). Clustering time series by linear dependency. Statistics andComputing . , 655-676.Ando, T. and Bai, J. (2017). Clustering huge number of ﬁnancial time series: a panel dataapproach with high-dimensional predictors and factor structures. Journal of the AmericanStatistical Association , , 1182-1198.Maharaj, E.A., D’Urso, P. and Caiado, J. (2019). Time Series Clustering and Classiﬁcation .Chapman and Hall/CRC.Chamberlain, G. (1983). Funds, factors, and diversiﬁcation in arbitrage pricing models.

Econo-metrica , , 1305-1323.Chamberlain, G. and Rothschild, M. (1983). Arbitrage, factor structure, and mean-varianceanalysis on large asset markets. Econometrica , , 1281-1304.Chang, J., Gao, B. and Yao, Q. (2015). High dimensional stochastic regression with latentfactors, endogeneity and nonlinearity. Journal of Econometrics , , 297-312.Esling, P. and Agon, C. (2012). Time-series data mining. ACM Computing Survey , . Article12.Forni, M., Hallin, M., Lippi, M. and Reichlin, L. (2005). The generalized dynamic-factor model:one-sided estimation and forecasting. Journal of the American Statistical Association , ,830-840.Fr¨uhwirth-Schnatter, S. and Kaufmann, S. (2008). Model-based clustering of multiple timeseries. Journal of Business & Economic Statistics , , 78-89.Hallin, M. and Lippi, M. (2013). Factor models in high-dimensional time series – a time-domainapproach. Stochastic Processes and Their Applications , , 2678-2695.Kakizawa, Y., Shumway, R.H. and Taniguchi, M. (1998). Discrimination and clustering formultivariate time series. Journal of the American Statistical Association , , 328-340.Keogh, E. and Lin, J. (2005). Clustering of time-series subsequences is meaningless: implicationsfor previous and future research. Knowledge and Information Systems , , 154-177.Keogh, E. and Ratanamahatana, C.A. (2005). Exact indexing of dynamic time warping. Knowl-edge and Information Systems , , 358-386.Khaleghi, A., Ryabko, D., Mary, J. and Preux, P. (2016). Consistent algorithms for clusteringtime series. Journal of Machine Learning Research , , 1-32.Lam, C. and Yao, Q. (2012). Factor modelling for high-dimensional time series: inference forthe number of factors. The Annals of Statistics , , 694-726.Li, Z., Wang, Q. and Yao, J. (2017). Identifying the number of factors from singular values of alarge sample auto-covariance matrix. The Annals of Statistics , , 257-288.Liao, T.W. (2005). Clustering of time series data – a survey. Pattern Recognition , , 1857-1874.Pe˜na, D. and Box, E.P. (1987). Identifying a simplifying structure in time series. Journal of theAmerican Statistical Association , , 836-843.Pe˜na, D. and Poncela, P. (2006). Nonstationary dynamic factor analysis. Journal of StatisticalPlanning and Inference , , 1237-1257.Roelofsen, P. (2018). Time series clustering. Vrije Universiteit Ansterdam. ∼ sbhulai/papers/thesis-roelofsen.pdf .Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv.1011.3027 .Yao, Q., Tong, H., Finkenst¨adt, B. and Stenseth, N.C. (2000). Common structure in panels ofshort ecological time series. Proceeding of the Royal Society (London) , B , , 2457-2467.38hang, T. (2013). Clustering high-dimensional time series based on parallelism. Journal of theAmerican Statistical Association , , 577-588.Zolhavarieh, S., Aghabozorgi, S. and Teh, Y.W. (2014). A review of subsequence time seriesclustering. The Scientiﬁc World Journal , Article 312512.39 upplementary ﬁle of “Factor modelling for clustering high-dimensional time series” by B. Zhang, G. Pan, Q. Yao and Wang

More simulation results

S.1 Example of 4 Clusters

We set n = 400, p = 300, d = 4, r = r = · · · = r = 2 (hence r = 8), p = · · · = p = 30, and k = 5. Therefore among 300 component series of y t , the ﬁrst 120 components form 4 clusterswith equal size 30 each, and the last 180 components do not belong to any cluster. We will providesimilar tables (Tables 5-8) and ﬁgures(Figures 7-8 ) like the ones in the example of 5 clusters.The relative frequencies of b r = r and b r + b r = r + r are reported in Table 5. Overallthe method based on the ratios of the cumulative eigenvalues b R j provides accurate and robustperformance and is not sensitive to the choice of k . The estimation based on the eigenvalues of c M with k ≥ r , but is considerably poorer for r + r with δ = 0 . . c M with k = 0 leads to weaker estimates for r when δ = 0 . . k b A b A ⊤ − AA ⊤ k F and k b B b B ⊤ − P A ⊥ B k F , where b A is estimated by the eigenvalues ofmatrix c M in (3.1) with k = 1 , · · · ,

5, see Step 2 of the algorithm stated in Section 3. See also Step3 there for the similar procedure in estimating B . For the comparison purpose, we also includethe estimates obtained with c M replaced by b Σ y ( k ) b Σ y ( k ) ⊤ with k = 0 , , · · · ,

5. Table 6 showsclearly that the estimation based on c M is accurate and robust with respect to the diﬀerent valuesof k . Furthermore using a single-lagged covariance matrix for estimating factor loading spacesis not recommendable. When δ increases, the error k b A b A ⊤ − AA ⊤ k F decreases, as indicated byTheorem 1. However the pattern in the error k b B b B ⊤ − P A ⊥ B k F is more complex as it decreasesinitially and then increases as δ increases, which is in line with the asymptotic result in Theorem 2.To examine the eﬀectiveness of Step 4 of the algorithm, We plot in Figure 7 the samplepercentiles at the 5%, 50% and 95% levels of each k b b j k over the 1000 replications, for j =1 , · · · , p d +1 ) components (not belonging to anyclusters) are indeed drop ﬂat and are close to 0. This indicates clearly that it is possible todistinguish the components of y t not belonging to any clusters from those belonging to one of the d clusters. Note that the indices of the components not belonging to any clusters are identiﬁed asthose in b J d +1 in (3.6), which is deﬁned in terms of a threshold ω p = o ( p − / ). We experiment withthe three choices of this tuning parameter, namely ω p = ( b r/p ) / / ln p , ω p = { b r/ ( p ln p ) } / and1 p = { b r/ ( p ln ln p ) } / . Recall J cd +1 contains all the indices of the components of y t belongingto one of the d clusters. The means and standard deviations of the two types of misclassiﬁcationerrors E = | J cd +1 ∩ b J d +1 | / | J cd +1 | and E = | J d +1 ∩ b J cd +1 | / | J d +1 | over the 1000 replications arereported in Table 7. Among the three choices, ω p appears to work best as the two types of errorsare both small. The increase in the errors due to the estimation for r and r is not signiﬁcantwhen δ = 0 . , .

5. But the increase in E due to unknown r and r is noticeable when δ = 0 . δ = 0 . , .

3, the 95% percentiles of the last 180 minimum norms areclearly greater than 0, though the 50% percentiles are still much smaller than the 5% percentilesof the ﬁrst 120(= p ) norms.In the sequel, we only report the results with ω p = { b r/ ( p ln p ) } / .The number of clusters is estimated based on MGF in (3.7). Figure 8 presents the boxplotsof MGF( k ) for k = 2 , · · · ,

10. We calculated MGF( · ) with ( r , r ) being either known or estimatedby ( b r , b r ). In either the cases, the values of MGF( k ) increase sharply from k = 4 to k = 5, andit keeps increasing for k >

5. Hence we set for b d = 4. Then the b d clusters are obtained byperforming the k -means clustering (with k = b d ) for the b p rows of c W , where b p = p − | b J d +1 | .See Step 5 of the algorithm in Section 3. As the error rates in estimating J cd +1 has already beenreported in Table 7, we concentrate on the components of y t with indices in b J cd +1 ∩ J cd +1 now,and count the number of them which were misplaced by the k -means clustering. Let S denotethe number of misplaced components. Both the means and the standard deviations of the errorrates S/ | b J cd +1 ∩ J cd +1 | over 1000 replications are reported in Table 8. It shows clearly that the the k -mean clustering identiﬁes the latent clusters very accurately, and the diﬀerence in performancedue to the estimating ( r , r ) is also small.To summarise, the performances for the example of 4 clusters are very similar to Section 5.1.It means that our method is stable for diﬀerent cases. S.2 Example of 6 Clusters

We set n = 200, p = 240, d = 6, r = r = · · · = r = 2 (hence r = 12), p = · · · = p = 40, and k = 5. Therefore among 240 component series of y t , the ﬁrst 240 components form 6 clusterswith equal size 30 each, and 0 components do not belong to any cluster. We will show similartables(Tables 9-12) and ﬁgures(Figures 9-10) like the ones in the example of 5 clusters.The relative frequencies of b r = r and b r + b r = r + r are reported in Table 9. Overallthe method based on the ratios of the cumulative eigenvalues b R j provides accurate and robustperformance and is not sensitive to the choice of k . The estimation based on the eigenvalues of2able 5: The relative frequencies of b r = r and b r + b r = r + r in a simulation with 1000replications, where b r and b r are estimated by the ratios of b R j based method (3.3) with k =1 , · · · ,

5, or by the ratios of the eigenvalues of b Σ y ( k ) b Σ y ( k ) with k = 0 , , · · · , b r = r b r + b r = r + r method δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . b R j ( k = 1) .386 .732 .938 .999 1 1 1 1 b R j ( k = 2) .403 .747 .940 .996 .995 1 1 .997 b R j ( k = 3) .397 .747 .932 .994 .996 1 .995 .988 b R j ( k = 4) .395 .740 .925 .991 .999 .998 .992 .983 b R j ( k = 5) .397 .736 .919 .988 .999 .998 .988 .975 b Σ y (0) b Σ y (0) ⊤ .334 .682 .927 .994 1 1 1 1 c M ( k = 1) .394 .735 .937 .998 .991 .996 .797 .304 c M ( k = 2) .424 .761 .941 .997 1 .990 .780 .312 c M ( k = 3) .432 .763 .936 .995 .999 .971 .699 .253 c M ( k = 4) .438 .759 .930 .991 .998 .960 .633 .222 c M ( k = 5) .439 .759 .921 .988 .995 .942 .583 .189Table 6: The means and standard deviations (in parentheses) of k b A b A ⊤ − AA ⊤ k F and k b B b B ⊤ − P A ⊥ B k F in a simulation with 1000 replications, where e A is estimated by the eigenvectors of c M in(3.1) (with k = 1 , · · · , b Σ y ( k ) b Σ y ( k ) ⊤ (for k = 0 , , · · · , b B is estimatedin the similar manner. Both r and r are assumed to be known. Estimation k b A b A ⊤ − AA ⊤ k F k b B b B ⊤ − P A ⊥ B k F method δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . c M ( k = 1) .526(.393) .211(.088) .143(.041) .113(.026) .621(.352) .439(.053) .526(.041) .684(.057) c M ( k = 2) .473(.359) .207(.081) .145(.040) .116(.026) .577(.318) .441(.049) .534(.043) .694(.059) c M ( k = 3) .457(.345) .207(.081) .147(.040) .118(.026) .564(.305) .445(.049) .541(.045) .702(.062) c M ( k = 4) .450(.341) .208(.079) .148(.040) .120(.027) .561(.300) .450(.049) .547(.046) .709(.063) c M ( k = 5) .448(.338) .209(.079) .150(.041) .122(.027) .560(.297) .453(.050) .551(.047) .713(.064) b Σ y (0) b Σ y (0) ⊤ .619(.438) .230(.108) .146(.043) .111(.026) .696(.400) .432(.069) .499(.038) .648(.053) b Σ y (1) b Σ y (1) ⊤ .475(.317) .267(.117) .204(.065) .169(.048) .810(.230) .853(.084) 1.05(.084) 1.31(.098) b Σ y (2) b Σ y (2) ⊤ .495(.234) .389(.190) .336(.159) .290(.137) 1.92(.166) 2.07(.147) 2.25(.126) 2.41(.118) b Σ y (3) b Σ y (3) ⊤ .751(.362) .589(.308) .527(.292) .466(.250) 1.99(.178) 2.12(.154) 2.29(.142) 2.46(.131) b Σ y (4) b Σ y (4) ⊤ .939(.394) .738(.343) .682(.332) .616(.304) 2.05(.196) 2.16(.166) 2.32(.151) 2.50(.140) b Σ y (5) b Σ y (5) ⊤ Table 7: The means and standard deviations (in parentheses) of the error rates E = | J cd +1 ∩ b J d +1 | / | J cd +1 | and E = | J d +1 ∩ b J cd +1 | / | J d +1 | in a simulation with 1000 replications with the 3possible choices of threshold ω p in (3.6), and the numbers of factors r and r either known or tobe estimated. r and r are known r and r are estimated δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . ω p E .003(.005) .003(.005) .002(.005) .002(.004) .010(.035) .004(.017) .002(.005) .002(.009) ω p .032(.015) .032(.016) .031(.016) .029(.015) .049(.087) .031(.037) .030(.018) .030(.022) ω p .118(.027) .114(.026) .113(.026) .115(.026) .138(.101) .114(.043) .113(.027) .116(.031) ω p E .333(.242) .212(.113) .330(.102) .554(.099) .468(.305) .335(.267) .365(.163) .554(.110) ω p .032(.086) .000(.001) .000(.001) .002(.005) .142(.157) .007(.130) .002(.086) .008(.052) ω p .000(.001) .000(.000) .000(.000) .000(.000) .002(.037) .000(.001) .000(.000) .000(.003)

100 200 300 . . . . . r, r known j || b ^ j || . . . . . r, r known j || b ^ j || . . . . . r, r known j || b ^ j || . . . . . r, r known j || b ^ j || . . . . . r, r unknown j || b ^ j || . . . . . r, r unknown j || b ^ j || . . . . . r, r unknown j || b ^ j || . . . . . r, r unknown j || b ^ j || Figure 7: Sample percentiles of k b b j k at the levels of 5% (blue), 50% (black) and 95% (red)are plotted against j . The four columns from left to right correspond to, respectively, δ =0 . , . , . , .

5. 4able 8: The means and standard deviations (in parentheses) of the error rates S/ | b J cd +1 ∩ J cd +1 | in a simulation with 1000 replications with the numbers of factors r and r either known or to beestimated. r and r are known r and r are estimated δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . . . . . . . . r, r known No. of clusters M G F . . . . . . r, r known No. of clusters M G F . . . . . . r, r known No. of clusters M G F . . . . . . r, r known No. of clusters M G F . . . . r, r unknown No. of clusters M G F . . . . r, r unknown No. of clusters M G F . . . . r, r unknown No. of clusters M G F . . . . r, r unknown No. of clusters M G F Figure 8: box-plots for MGF. The four columns from left to right correspond to, respectively, δ = 0 . , . , . , . c M with k ≥ r , but is considerably poorer for r + r with δ = 0 . . c M with k = 0 leads to weaker estimates for r when δ = 0 . . k b A b A ⊤ − AA ⊤ k F and k b B b B ⊤ − P A ⊥ B k F , where b A is estimated by the eigenvalues ofmatrix c M in (3.1) with k = 1 , · · · ,

5, see Step 2 of the algorithm stated in Section 3. See also Step3 there for the similar procedure in estimating B . For the comparison purpose, we also include5he estimates obtained with c M replaced by b Σ y ( k ) b Σ y ( k ) ⊤ with k = 0 , , · · · ,

5. Table 10 showsclearly that the estimation based on c M is accurate and robust with respect to the diﬀerent valuesof k . Furthermore using a single-lagged covariance matrix for estimating factor loading spacesis not recommendable. When δ increases, the error k b A b A ⊤ − AA ⊤ k F decreases, as indicated byTheorem 1. However the pattern in the error k b B b B ⊤ − P A ⊥ B k F is more complex as it decreasesinitially and then increases as δ increases, which is in line with the asymptotic result in Theorem2. To examine the eﬀectiveness of Step 4 of the algorithm, We plot in Figure 9 the samplepercentiles at the 5%, 50% and 95% levels of each k b b j k over the 1000 replications, for j =1 , · · · , p d +1 = 0 in this case, Figure 9 shows that all components have thesimilar performances. Note that the indices of the components not belonging to any clustersare identiﬁed as those in b J d +1 in (3.6), which is deﬁned in terms of a threshold ω p = o ( p − / ).We experiment with the three choices of this tuning parameter, namely ω p = ( b r/p ) / / ln p , ω p = { b r/ ( p ln p ) } / and ω p = { b r/ ( p ln ln p ) } / . Since p d +1 = 0, we only need to investigate E = | J cd +1 ∩ b J d +1 | / | J cd +1 | . The means and standard deviations of the two types of misclassiﬁcationerrors E over the 1000 replications are reported in Table 11. We can see that ω p is still not bad.The increase in the errors due to the estimation for r and r is not signiﬁcant when δ = 0 . , . ω p = { b r/ ( p ln p ) } / .The number of clusters is estimated based on MGF in (3.7). Figure 10 presents the boxplotsof MGF( k ) for k = 2 , · · · ,

10. We calculated MGF( · ) with ( r , r ) being either known or estimatedby ( b r , b r ). In either the cases, the values of MGF( k ) increase sharply from k = 6 to k = 7, andit keeps increasing for k >

7. Hence we set for b d = 6. Then the b d clusters are obtained byperforming the k -means clustering (with k = b d ) for the b p rows of c W , where b p = p − | b J d +1 | .See Step 5 of the algorithm in Section 3. As the error rates in estimating J cd +1 has already beenreported in Table 11, we concentrate on the components of y t with indices in b J cd +1 ∩ J cd +1 now,and count the number of them which were misplaced by the k -means clustering. Let S denotethe number of misplaced components. Both the means and the standard deviations of the errorrates S/ | b J cd +1 ∩ J cd +1 | over 1000 replications are reported in Table 12. It shows clearly that the the k -mean clustering identiﬁes the latent clusters very accurately, and the diﬀerence in performancedue to the estimating ( r , r ) is also small.To summarise, the performances for the example of 6 clusters are very similar to Section 5.1even though p d +1 = 0. It means that our method is stable for diﬀerent cases.6able 9: The relative frequencies of b r = r and b r + b r = r + r in a simulation with 1000replications, where b r and b r are estimated by the ratios of b R j based method (3.3) with k =1 , · · · ,

5, or by the ratios of the eigenvalues of b Σ y ( k ) b Σ y ( k ) with k = 0 , , · · · , b r = r b r + b r = r + r method δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . b R j ( k = 1) .371 .657 .884 .984 1 1 1 .997 b R j ( k = 2) .379 .665 .889 .981 .990 1 1 .997 b R j ( k = 3) .372 .664 .878 .975 .989 1 .999 .991 b R j ( k = 4) .368 .653 .872 .972 .991 1 .998 .986 b R j ( k = 5) .357 .634 .862 .966 .992 1 .997 .982 b Σ y (0) b Σ y (0) ⊤ .337 .641 .876 .991 1 1 1 .998 c M ( k = 1) .380 .669 .888 .983 1 .980 .755 .294 c M ( k = 2) .413 .697 .891 .980 1 .985 .741 .266 c M ( k = 3) .414 .697 .885 .977 .999 .974 .678 .219 c M ( k = 4) .415 .703 .880 .973 .999 .964 .638 .199 c M ( k = 5) .418 .695 .875 .966 1 .968 .614 .174Table 10: The means and standard deviations (in parentheses) of k b A b A ⊤ − AA ⊤ k F and k b B b B ⊤ − P A ⊥ B k F in a simulation with 1000 replications, where e A is estimated by the eigenvectors of c M in(3.1) (with k = 1 , · · · , b Σ y ( k ) b Σ y ( k ) ⊤ (for k = 0 , , · · · , b B is estimatedin the similar manner. Both r and r are assumed to be known. Estimation k b A b A ⊤ − AA ⊤ k F k b B b B ⊤ − P A ⊥ B k F method δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . c M ( k = 1) .728(.385) .373(.144) .248(.067) .189(.042) .901(.320) .753(.086) .883(.057) 1.12(.074) c M ( k = 2) .684(.365) .364(.134) .250(.065) .193(.042) .875(.298) .767(.080) .911(.063) 1.16(.080) c M ( k = 3) .670(.355) .363(.131) .253(.065) .196(.043) .870(.288) .780(.078) .928(.064) 1.18(.082) c M ( k = 4) .665(.352) .365(.132) .255(.065) .199(.044) .871(.284) .789(.079) .940(.066) 1.19(.084) c M ( k = 5) .665(.353) .367(.133) .257(.066) .201(.044) .876(.283) .796(.080) .948(.066) 1.20(.084) b Σ y (0) b Σ y (0) ⊤ .806(.406) .401(.168) .250(.069) .185(.041) .948(.348) .735(.107) .835(.052) 1.06(.068) b Σ y (1) b Σ y (1) ⊤ .733(.346) .467(.177) .349(.103) .282(.078) 1.36(.216) 1.45(.113) 1.70(.107) 2.00(.112) b Σ y (2) b Σ y (2) ⊤ .795(.305) .636(.251) .543(.216) .466(.200) 2.26(.168) 2.41(.143) 2.61(.132) 2.84(.116) b Σ y (3) b Σ y (3) ⊤ b Σ y (4) b Σ y (4) ⊤ b Σ y (5) b Σ y (5) ⊤ Table 11: The means and standard deviations (in parentheses) of the error rates E = | J cd +1 ∩ b J d +1 | / | J cd +1 | in a simulation with 1000 replications with the 3 possible choices of threshold ω p in(3.6), and the numbers of factors r and r either known or to be estimated. r and r are known r and r are estimated δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . ω p E .004(.005) .004(.005) .003(.003) .001(.002) .009(.029) .004(.008) .002(.003) .002(.011) ω p .075(.019) .079(.017) .075(.016) .067(.015) .083(.086) .070(.027) .071(.018) .067(.029) ω p .298(.023) .298(.022) .296(.022) .293(.022) .307(.084) .293(.028) .293(.023) .294(.030)

50 150 . . . . . r, r known j || b ^ j || . . . . . r, r known j || b ^ j || . . . . . r, r known j || b ^ j || . . . . . r, r known j || b ^ j || . . . . . r, r unknown j || b ^ j || . . . . . r, r unknown j || b ^ j || . . . . . r, r unknown j || b ^ j || . . . . . r, r unknown j || b ^ j || Figure 9: Sample percentiles of k b b j k at the levels of 5% (blue), 50% (black) and 95% (red)are plotted against j . The four columns from left to right correspond to, respectively, δ =0 . , . , . , .

5. 8able 12: The means and standard deviations (in parentheses) of the error rates S/ | b J cd +1 ∩ J cd +1 | in a simulation with 1000 replications with the numbers of factors r and r either known or to beestimated. r and r are known r and r are estimated δ = . δ = . δ = . δ = . δ = . δ = . δ = . δ = . . . . . r, r known No. of clusters M G F . . . r, r known No. of clusters M G F . . . . . r, r known No. of clusters M G F . . . . . r, r known No. of clusters M G F . . . . . . r, r unknown No. of clusters M G F . . . . . . r, r unknown No. of clusters M G F . . . . . . r, r unknown No. of clusters M G F . . . . . . r, r unknown No. of clusters M G F Figure 10: box-plots for MGF. The four columns from left to right correspond to, respectively, δ = 0 ..