Statistical Inferences Using Large Estimated Covariances for Panel Data and Factor Models
aa r X i v : . [ m a t h . S T ] N ov Statistical Inferences Using Large EstimatedCovariances for Panel Data and Factor Models
Jushan Bai ∗ Columbia University Yuan Liao † University of Maryland
Abstract
While most of the convergence results in the literature on high dimensional covari-ance matrix are concerned about the accuracy of estimating the covariance matrix (andprecision matrix), relatively less is known about the effect of estimating large covari-ances on statistical inferences. We study two important models: factor analysis andpanel data model with interactive effects, and focus on the statistical inference andestimation efficiency of structural parameters based on large covariance estimators.For efficient estimation, both models call for a weighted principle components (WPC),which relies on a high dimensional weight matrix. This paper derives an efficient andfeasible WPC using the covariance matrix estimator of Fan et al. (2013). However,we demonstrate that existing results on large covariance estimation based on abso-lute convergence are not suitable for statistical inferences of the structural parameters.What is needed is some weighted consistency and the associated rate of convergence,which are obtained in this paper. Finally, the proposed method is applied to the USdivorce rate data. We find that the efficient WPC identifies the significant effects ofdivorce-law reforms on the divorce rate, and it provides more accurate estimation andtighter confidence intervals than existing methods.
Keywords:
High dimensionality, unknown factors, conditional sparsity, thresholding,cross-sectional correlation, heteroskedasticity, optimal weight matrix, interactive effect ∗ Department of Economics, Columbia University, New York, NY 10027. † Department of Mathematics, University of Maryland at College Park, College Park, MD 20742. Introduction
Estimating a high-dimensional covariance matrix has been an active research area in therecent literature. Many methods are proposed for estimating the covariance matrix andthe precision (inverse covariance) matrix, e.g. El Karoui (2008), Bickel and Levina (2008),Rothman et al. (2009), Lam and Fan (2009), Cai and Liu (2011), Fan et al. (2013). Amongmany theoretical results, rates of convergence under various interesting matrix norms havebeen derived. In particular, if we write N to denote the dimension and T to denote thesample size, when the N × N covariance matrix Σ is sparse whose eigenvalues are boundedaway from zero, we can obtain an estimator b Σ that achieves a near- √ T -rate under the operator norm : k b Σ − Σ k = O p ( m N ( log NT ) − q ) = k b Σ − − Σ − k (1.1)where m N and q are parameters that measure the level of sparsity. Cai and Zhou (2012)showed that the rate of convergence (1.1) is minimax optimal. However, there is relatively lessknowledge about the effect of estimating a high-dimensional covariance matrix on statisticalinferences, e.g., the estimation efficiency for a parametric model, and the effect of estimatinglarge covariances on the limiting distributions for estimators of some structural parameters.We find that when a high-dimensional covariance estimator is applied for statistical in-ferences (precisely, deriving limiting distributions of estimated structural parameters), mostof the results in the literature based on absolute convergence like (1.1) are not suitable, evenwith the minimax optimal rate. Instead, a “ weighted convergence ” is needed, which takesthe form k A ( b Σ − − Σ − ) A k , where both A , A are stochastic matrices that weight theestimation error b Σ − − Σ − . The weights A and A further “average down” the estimationerrors, which significantly improve the rate of convergence to make valid statistical infer-ences. However, the weighted convergence cannot be implied by the usual results in theliterature. One of our contributions is to tackle this problem.This paper focuses on two models that are of increasing importance in many statisticalapplications: factor analysis and panel data model with interactive effects. In factor analysis,the notion of sparsity is a natural assumption based on the factor structure, which is provedto be a successful approach (e.g., Boivin and Ng 2006, Phan 2012, Andersen et al. 2011).This paper gives a theoretical justification about how such a sparse structure can be used toimprove the estimation efficiency in two general models. Both problems involve estimatinga large weight matrix, where the problem of proving “weighted convergence” is present.2 .1 Approximate factor model We consider a high-dimensional approximate factor model: y it = λ ′ i f t + u it , i ≤ N, t ≤ T. (1.2)where f t is an r × λ i is a vector of factor loadings, and u it represents the error term, often known as the idiosyncratic component . If we denote Y t =( y t , ..., y Nt ) ′ , Λ = ( λ , ...., λ N ) ′ , and u t = ( u t , ..., u Nt ) ′ , model (1.2) can be written as Y t = Λ f t + u t , t ≤ T. Only Y t is observable in the model. In a data-rich environment, both N and T can be largeand the dimension N might be even much larger than T . The goal is to make efficientinference about λ ′ i f t , λ i , f t or their rotations.Approximate factor models often require the N × N covariance matrix Σ u = cov( u t )be non-diagonal matrix and the diagonal entries may vary over a large range (Chamberlainand Rothschild 1983). The traditional method of principal components (PC) essentiallytreats u it to be homoskedastic and uncorrelated over i . As a result, it is inefficient. In thispaper, we consider a weighted principal components (WPC) method to efficiently estimatethe heteroskedastic approximate factor models. The WPC solves a weighted least squaresproblem: min Λ ,f t T X t =1 ( Y t − Λ f t ) ′ W ( Y t − Λ f t ) (1.3)subject to certain normalization constraints. Here W is an N × N positive definite weightmatrix. We propose a feasible efficient WPC that requires consistently estimating the high-dimensional Σ − u (when N > T ) as the weight matrix, and is shown to be optimal over abroad class of estimators.
A closely related model is the panel data with a factor structure in the error term: y it = X ′ it β + ε it , ε it = λ ′ i f t + u it , i ≤ N, t ≤ T, (1.4)where X it is a d × β is a d × ε it has a factor structure with unknown loadings and factors. In the model,the only observables are ( y it , X it ). The goal is to estimate the structural parameter β , whose3imension is fixed. In this model, the factor component λ ′ i f t is regarded as an interactiveeffect of the individual and time effects. Because the regressor and factor can be correlated,simply regressing y it on X it is not consistent.Similarly, we propose to estimate β via:min β, Λ ,f t T X t =1 ( Y t − X ′ it β − Λ f t ) ′ W ( Y t − X ′ it β − Λ f t ) , (1.5)with a high-dimensional weight matrix W . The method is also WPC because the estimatedfactors are shown to be principal components of the weighted sample covariance matrix. Inparticular, it allows a consistent estimator for Σ − u as the optimal weight matrix even whenΣ − u is non-diagonal and N/T → ∞ . Except for sparsity, the off-diagonal structure of Σ u isunknown. The WPC takes into account both cross-sectional correlation and heteroskedas-ticity of u it over i , while the existing methods in the literature, e.g., Bai 2009, Moon andWeidner 2010, do not. First of all, we develop the inferential theory using a general high-dimensional weight W . This admits many promising choices of the weight matrices that are suitable for specificapplied problems for factor analysis. Especially, in cases where estimating Σ u is difficult,our inferential theory is still useful when suitable weight matrices are chosen to improve theestimation efficiency. Secondly, we show that when W = Σ − u is used, the WPC yields anoptimal estimator in the sense that the estimated common component λ ′ i f t and structuralparameter β have the minimum asymptotic variance over a broad class of estimators.Third, we focus on the effect of estimating large covariance matrices on efficient statisticalinferences. In both pure factor analysis and the large panel data with a factor structure,we employ a consistent estimator for Σ − u recently proposed by Fan et al. (2013), as anoperational weight matrix. Therefore, our optimal estimator is still feasible under N/T → ∞ .However, substituting a consistent estimator Σ − u is highly non-trivial when N > T . Aninteresting phenomenon is observed: most existing results on estimating large covariancesare not suitable for statistical inferences of the models being considered. We develop a newstrategy that investigates the weighted consistency for the estimated optimal weight matrixto address this problem.Fourth, we consistently estimate the asymptotic variances of the proposed estimators un-der both cross-sectional and serial correlations in u it . Hence the new WPC estimator for theinterative effect model is readily used for statistical inferences in practice. In contrast, exist-4ng methods usually require additionally modeling the large error covariance (e.g., assumingdiagonality, parametrizing the off-diagonal structure) in order for practical inferences.Given the popularity of the PC method, why do we need a new estimator to incorporatethe large covariance Σ u ? Even though most of the existing methods for panel data modelsavoid estimating Σ u , to demonstrate the potential efficiency loss for existing methods, wepresent a real-data application in Section 7, which studies the effect of divorce reform lawon the change of divorce rates. The WPC is applied to the year-state divorce rate dataof U.S. during 1956-1985. It illustrates that after incorporating Σ − u in the estimation,WPC captures the significant (negative) effects from nine to twelve years after the law wasreformed, consistent with the previous empirical findings in the social science literature. Incontrast, the existing method (PC) without estimating Σ − u would result in wide confidenceintervals and potentially conservative conclusions. Numerically, we find an average of 46%efficiency gained using WPC, relative to the existing method. In addition, the proposedWPC also enjoys the computational convenience, as it also admits analytical solutions.Realizing the limitation of the regular PC method, some important works have beendeveloped to improve the estimation efficiency for factor analysis, e.g., Breitung and Ten-hofen (2011), Bai and Li (2012) and Doz et al. (2012). They require the cross-sectionaldependences’ structure be specifically modeled. Recently, Choi (2012) specified W = Σ − u ,which essentially requires Σ u be known. Recently, Fan et al. (2013) proposed a thresholdingmethod to estimate Σ − u . They focused on covariance matrix estimations and did not addressthe efficient estimation for the factors, loadings and panel data models. As we discussed,replacing Σ − u with its consistent estimator is technically challenging when N/T → ∞ . Ad-ditional literature on factor analysis and panel data with interactive effects includes, e.g.,Pesaran (2006), Ahn et al. (2001), Su and Chen (2013), Su et al. (2012), Wang (2009), Forniet al. (2000), Hallin and Liˇska (2007), Lam and Yao (2012), Cheng and Hansen (2013), Canerand Han (2012), etc. None of these incorporated Σ − u or studied efficient estimation for paneldata models. We also remark that there is a rapidly growing literature on estimating high-dimensional (inverse) covariance matrices. Besides those mentioned, the list also includes,e.g., d’ Aspremont et al. (2008), Bien and Tibshirani (2011), Luo (2011), Pati et al. (2012),Xue et al. (2012), among many others.We assume the number of factors r = dim( f t ) to be known. When r is unknown, it canbe consistently estimated by certain information criteria as in, e.g., Bai and Ng (2002), aswe shall briefly discuss in Section 5.The rest of the paper is organized as follows. Section 2 describes the general problemof statistical inference based on large covariance matrices. Section 3 formally proposes the5PC method. The large-sample inferential theory of WPC with a general weight matrix ispresented. Section 4 introduces the efficient WPC. Section 5 applies the WPC method tothe panel data model with interactive effects. Section 6 illustrates numerical comparisonsof related methods. Section 7 applies WPC to a real data problem of divorce rate study.Finally, Section 8 concludes. All proofs are given in the supplementary material.Throughout the paper, we use λ min ( A ) and λ max ( A ) to denote the minimum and maxi-mum eigenvalues of matrix A . We also let k A k , k A k and k A k F denote the operator norm, L -norm and Frobenius norm of a matrix, defined as k A k = p λ max ( A ′ A ), k A k = max i P j | A ij | and k A k F = qP i,j A ij respectively. Note that if A is a vector, k A k = k A k F is equal to theEuclidean norm. Finally, for two sequences, we write a T ≪ b T (and equivalently b T ≫ a T )if a T = o ( b T ) as T → ∞ . Consider estimating a low-dimensional structural parameter θ that arises from a modelinvolving a high-dimensional covariance matrix Σ. It is often the case that when Σ wereknown, incorporating it in the estimator may achieve a better estimation accuracy, e.g.,smaller standard errors and tighter confidence intervals. Taking into account Σ, the estimatorcan be written as a function of the data D T and Σ as ( T denotes the sample size): b θ = f ( D T , Σ) , and the limiting distribution may be derived. In practice, we replace Σ by a consistentestimator b Σ and obtain a feasible efficient estimator f ( D T , b Σ).To show that replacing Σ with its consistent estimator does not affect the limiting distri-bution of b θ , one often needs a T ( f ( D T , Σ) − f ( D T , b Σ)) = o p (1) where a − T can be understoodas the rate of convergence of b θ . However, such a simple substitution is technically difficult if N > T . To see this, note that often f ( D T , Σ) depends on the precision matrix Σ − , and theeffect of estimating Σ − is approximately linearly dependent on b Σ − − Σ − . We can oftenwrite a T ( f ( D T , Σ) − f ( D T , b Σ)) = a T A ( b Σ − − Σ − ) A + o p (1)where A , A are typically non-sparse stochastic matrices of dimensions dim( θ ) × N and6 × a T k A ( b Σ − − Σ − ) A k ≤ a T k A kk A kk b Σ − − Σ − k . As both A and A are high-dimensional matrices (vectors), the right hand side of theabove inequality is typically not stochastically negligible even if the “absolute convergence” k b Σ − − Σ − k achieves the optimal convergence rate. The problem arises because k A k and k A k grow fast with the dimensionality, so they accumulate the estimation errors and leadto a crude bound.We further illustrate this issue in two examples, which are to be studied in detail in thepaper. Example 2.1.
Consider the high-dimensional factor model (1.2). The parameter of interestis the common component λ ′ i f t . The efficient estimation crucially depends on √ N Λ ′ ( b Σ − u − Σ − u ) u t , for a sparse covariance estimator b Σ − u . However, the existing results on the optimalconvergence of k b Σ − u − Σ − u k in the literature (e.g., Fan et al. 2013) are not applicabledirectly when N > T , because k Λ k = O ( √ N ) and k u t k = O p ( √ N ), but the minimax ratefor k b Σ − u − Σ − u k is no faster than O p ( T − / ). Applying the absolute convergence for b Σ − u , √ N k Λ kk b Σ − u − Σ − u kk u t k = O p ( q NT ) = o p (1) when N > T . Example 2.2.
Consider the high-dimensional panel data model (1.4). The efficient esti-mation of β requires estimating the inverse covariance Σ − u . Suppose e Σ − u is a consistentestimator. We require 1 √ N T Z ′ [( e Σ − u − Σ − u ) ⊗ I T ] U = o p (1) , where I T is a T -dimensional identity matrix and Z and U are stochastic matrices whosedimensions are N T × dim( β ) and N T × k Z k = O p ( √ N T ) = k U k , it is difficult to apply the absolute convergence k e Σ − u − Σ − u k (whose minimax rate is nofaster than O p ( T − / )) to achieve the desired convergence when N > T . The crude boundgives √ NT k Z kk e Σ − u − Σ − u kk U k = O p ( √ N ) = o p (1). (cid:3) As one of the main contributions of this paper, a new strategy of “weighted convergence”is developed. When analyzing a T A ( b Σ − − Σ − ) A , we should not separate the covarianceestimation error from the weighting matrices A , A . Intuitively, the weights further “averagedown” the estimation errors, to ensure the asymptotic negligibility of the weighted error. Wedemonstrate that the weighted convergence is useful for high-dimensional inferences in panel When Σ is sparse enough, one can obtain a near √ T -rate of convergence for the L -norm k b Σ − − Σ − k ,but this still yields a crude bound for a T A ( b Σ − − Σ − ) A . In model (1.2), the only observables are { Y t } Tt =1 , and both the factors { f t } Tt =1 and loadingsΛ = ( λ , ..., λ N ) ′ are parameters to estimate. We estimate them via the following weightedleast squares: ( b Λ , b f t ) = min Λ ,f t T X t =1 ( Y t − Λ f t ) ′ W T ( Y t − Λ f t ) (3.1)subject to: 1 T T X t =1 b f t b f ′ t = I r ; b Λ ′ W T b Λ is diagonal . (3.2)Here W T is an N × N weight matrix, which can be either stochastic or deterministic. When W T is stochastic, we mean W T to be a consistent estimator of some positive definite W underthe operator norm. We will show in Section 4 that the optimal weight is Σ − u . On the otherhand, keeping a general W T admits other choices of the weight for specific applied problems,especially when it is difficult to estimate the optimal weight matrix.Solving (3.1) subjected to the restriction (3.2) gives the WPC estimators: b λ j and b f t areboth r × T × r matrix b F / √ T = ( b f , ..., b f T ) ′ / √ T arethe eigenvectors corresponding to the largest r eigenvalues of Y ′ W T Y , and b Λ = T − Y b F =( b λ , ..., b λ N ) ′ . We call our method to be weighted principal components (WPC), to distinguishfrom the traditional principal components (PC) method that uses W T = I N . Note that PCdoes not encounter the problem of estimating large covariance matrices, and is not efficientwhen { u it } ’s are cross-sectionally correlated across i .It has been well known that the factors and loadings are not separably identifiable withoutfurther restrictions. The WPC estimates rotated factors and loadings with rotation matrix H W . Let b V be the r × r diagonal matrix of the first r largest eigenvalues of Y W T Y ′ / ( T N ). Let F = ( f , ..., f T ) ′ , then H W = b V − b F ′ F Λ ′ W T Λ / ( N T ). We use the subscript W to emphasizethe dependence of the rotation on W . 8 .2 General conditions We present general results for the proposed WPC with a general weight matrix, whichhold for a broad class of estimators. For the general weight matrix W and its data-dependentversion W T , the following assumption is needed: Assumption 3.1. (i) k W T − W k = o p (min { T − / , N − / , q NT , q TN log N } ) .(ii) k √ N Λ ′ ( W T − W ) u t k = o p (1) . Condition (i) is easy to satisfy by using many “good” covariance estimators given inthe literature. However, the main challenge described in Section 2 arises from provingcondition (ii) in the above assumption. When W T is a consistent estimator for Σ − u , we shallsee in Section 4.2 that this requires a new “weighted convergence”, which is necessary butchallenging to the high-dimensional inference problems being considered.We allow the factors and idiosyncratic components to be weakly serially dependent viathe strong mixing condition. Let F −∞ and F ∞ T denote the σ -algebras generated by { ( f t , u t ) : −∞ ≤ t ≤ } and { ( f t , u t ) : T ≤ t ≤ ∞} respectively. In addition, define the mixingcoefficient α ( T ) = sup A ∈F −∞ ,B ∈F ∞ T | P ( A ) P ( B ) − P ( AB ) | . (3.3) Assumption 3.2. (i) { u t , f t } t ≥ is strictly stationary. In addition, Eu it = Eu it f jt = 0 forall i ≤ p, j ≤ r and t ≤ T. (ii) There exist constants c , c > such that c < λ min (Σ u ) ≤ λ max (Σ u ) < c , max j ≤ N k λ j k
0. WriteΛ ′ W = ( d , ..., d N ), with each d i being an r × j ≤ N k d j k < ∞ . The following assumptions are standard in the literature. Assumption 3.3 requires thefactors be pervasive , which holds when the factors impact a non-vanishing proportion ofindividual time series. Assumption 3.4 extends similar conditions in Stock and Watson(2002) and Bai (2003). When W = I N is used, they reduce to those in the literature of theregular PC. A simple sufficient condition for Assumption 3.4 is that u it is i.i.d. in both i and t . 9 ssumption 3.3. All the eigenvalues of the r × r matrix Λ ′ Λ /N are bounded away fromboth zero and infinity as N → ∞ . Assumption 3.4. (i) E k √ NT P Ts =1 f s ( u ′ s W u t − Eu ′ s W u t ) k = O (1) .(ii) For each i ≤ N , E k √ NT P Tt =1 P Nj =1 d j ( u jt u it − Eu jt u it ) k = O (1) .(iii) For each i ≤ r , E k √ NT P Tt =1 P Nj =1 d j u jt f it k = O (1) .(iv) There is a constant δ ≥ and M > such that for all large N , E | √ N ( u ′ s W u t − Eu ′ s W u t ) | δ < M and E k √ N Λ ′ W u t k δ < M . The factors and loadings are two sets of parameters to estimate. The limiting distribu-tions of their estimators depend on the following asymptotic expansions, to be shown in theappendix: for some positive definite matrix J W , and the rotation matrix H W , √ N ( b f t − H W f t ) = J W Λ ′ W u t √ N + O p ( a T ) √ T ( b λ j − H ′ − W λ j ) = H W √ T T X t =1 f t u jt + O p ( b T ) . (3.4)where the asymptotic normality arises from the leading terms while a T and b T are someremaining stochastic sequences.The limiting distribution of b λ j requires H W to have a limit. We thus need the followingcondition: Assumption 3.5. (i) There is an r × r matrix Σ Λ such that Λ ′ W Λ /N → Σ Λ as N → ∞ .In addition, the eigenvalues of the Σ Λ cov( f t ) are distinct.(ii) For each t ≤ T , (Λ ′ W Σ u W Λ) − / Λ ′ W u t → d N (0 , I r ) . According to the expansions of (3.4), the above condition (ii) is almost a necessarycondition for the asymptotic normality of b f t . Note that √ N Λ ′ W u t = √ N P Ni =1 d i u it . Hencea cross-sectional central limit theorem can indeed apply. Condition (ii) is only for b f t , andthe limiting distribution of the estimated loading b λ j in Theorem 3.1 below does not dependon this condition.We now introduce some notation that are needed to present the limiting distributions.Let V be an r × r diagonal matrix with element as the largest r eigenvalues of Σ / cov( f t )Σ / ,and Γ W be the corresponding eigenvector matrix such that Γ ′ W Γ W = I r . We use the subscript W to indicate that Γ W depends on W via Σ Λ . Recall that Σ Λ is defined in Assumption 3.5.Let Q W = V / Γ ′ W Σ − / . In fact H W → p Q ′ − W . In addition, to account for the serial10orrelation over t , letΦ j = E ( f t f ′ t u jt ) + ∞ X t =1 E [( f f ′ t + f t f ′ ) u j u j, t ] . (3.5) Theorem 3.1.
Assume (log N ) = o ( T ) and T = o ( N ) . Under Assumptions 3.1-3.5(i),for each j ≤ N , √ T ( b λ j − H ′ − W λ j ) → d N (0 , Q ′ − W Φ j Q − W ) . If in addition, N = o ( T ) and Assumption 3.5(ii) holds, N ( V − Q W Λ ′ W Σ u W Λ Q ′ W V − ) − / ( b f t − H W f t ) → d N (0 , I r ) . For the common component, we have b λ ′ i b f t − λ ′ i f t ( λ ′ i Ξ W λ i /N + f ′ t Ω i f t /T ) / → d N (0 , . where Ξ W = Σ − Λ ′ W Σ u W ΛΣ − /N and Ω i = cov( f t ) − Φ i cov( f t ) − . Remark 3.1.
The eigenvalues of ( V − Q W Λ ′ W Σ u W Λ Q ′ W V − ) − / are of order O ( N − / ).Hence Theorem 3.1 implies the √ N -consistency of the estimated factors. If we furtherassume that Λ ′ W Σ u W Λ /N has a limit, say G , then immediately we have √ N ( b f t − H W f t ) → d N (0 , V − Q W GQ ′ W V − ) , where the √ N -consistency is more clearly demonstrated.The uniform convergence of b f t and b λ j are given below. Theorem 3.2.
Let α = max { /r , /r } with r , r defined in Assumption 3.2. Let δ ≥ be as defined in Assumption 3.4. Under Assumptions 3.1-3.4, as N, T → ∞ , max t ≤ T k b f t − H W f t k = O p (cid:18) (log T ) α k W T − W k + T /δ √ N + 1 √ T (cid:19) , (3.6)max j ≤ N k b λ j − H ′ − W λ j k = O p k W T − W k + 1 √ N + r log NT ! . (3.7) Remark 3.2.
The uniform convergence in (3.6) and (3.7) is important under large N and T . For example, in estimating large covariance matrices, it is used to derive the proper levelsof thresholding or shrinkage (e.g., Fan et al. 2013, Ledoit and Wolf 2012).11 .4 Heteroskedastic WPC As a simple choice for W , W = (diag(Σ u )) − . This choice improves the regular PC when cross-sectional heteroskedasticity is present. Thisweight can be easily estimated using the residuals. First apply the regular PC by taking W T = I N , and obtain a consistent estimator b C it of the common component λ ′ i f t for each i ≤ N, t ≤ T. Define W hT = diag { b σ − u, , ..., b σ − u,NN } , where b σ u,ii = 1 T T X t =1 ( y it − b C it ) . Then in the second step, apply the WPC with weight matrix W hT . The heteroskedastic WPC (which we call HWPC) method has been previously suggestedby, e.g., Breitung and Tenhofen (2011). Investigations of its theoretical properties can befound in the appendix. Moreover, numerical studies in Section 6 show that this methodimproves the efficiency relative to the regular PC method.
In the approximate factor models, u it ’s are correlated (over i ). A more efficient estimator(which we call EWPC) should take W = Σ − u as the weight matrix. This estimator has beenrecently suggested by Choi (2012), but Σ − u was assumed to be known.There are two main challenges in practice: on one hand, when N > T , Σ − u is hard toestimate as the sample covariance based on the residual b u it is no longer invertible. On theother hand, as we illustrated in Section 2, even if a consistent estimator for Σ − u is available,it is technically difficult to prove that the effect of covariance estimation is neglibile when N/T → ∞ . We first apply Fan et al. (2013)’s method to estimate Σ − u , and then addressthe second problem in Section 4.2. We apply a thresholded covariance estimator to estimate Σ − u , which is recently proposedby Fan et al. (2013) for factor analysis. Let ( ν j , ξ j ) Nj =1 be the eigenvalues-eignvectors of the12ample covariance S y of Y t , in a decreasing order such that ν ≥ ν ≥ ... ≥ ν N . Let R = S y − r X i =1 ν i ξ i ξ ′ i . Define a general thresholding function s ij ( z ) : R → R as in Rothman et al. (2009) and Caiand Liu (2011) with an entry-dependent threshold τ ij such that:(i) s ij ( z ) = 0 if | z | < τ ij ;(ii) | s ij ( z ) − z | ≤ τ ij . (iii) There are constants a > b > | s ij ( z ) − z | ≤ aτ ij if | z | > bτ ij .Examples of s ij ( z ) include the hard-thresholding: s ij ( z ) = zI ( | z | >τ ij ) ; SCAD (Fan and Li2001), MPC (Zhang 2010) etc. As for the threshold value, we specify τ ij = C p R ii R jj ω T , where ω T = r log NT + 1 √ N (4.1)for some pre-determined universal C >
0, chosen from cross-validation as in Fan et al.(2013). Then estimate Σ u by b Σ u = ( b Σ u,ij ) N × N , b Σ u,ij = R ii , i = js ij ( R ij ) , i = j , where R = ( R ij ) N × N . Intuitively, b Σ u thresholds off the small entries of the residual covariance T P Tt =1 b u t b u ′ t obtainedfrom the regular PC estimate.To apply such a weight estimator, we assume Σ u to be a sparse matrix. In an approximatefactor model, such a special structure is known to be conditionally sparse (given the commonfactors). Consider the notion of generalized sparsity: write Σ u = (Σ u,ij ) N × N . For some q ∈ [0 , / m N = max i ≤ N N X j =1 | Σ u,ij | q . (4.2)In particular, when q = 0, define m N = max i ≤ N P Nj =1 I (Σ u,ij =0) . Mathematically, the condi-tional sparse structure on Σ u assumes, there is q ∈ [0 , / m N = o min ( N / (cid:18) T log N (cid:19) (1 − q ) / , N / − q/ )! . (4.3)In the sparse covariance estimation literature, Condition (4.3) itself is enough to achievea covariance estimator such that k Σ − u − b Σ − u k = o p (1), whose rate of convergence is nearly13 T (e.g., Cai and Zhou 2012, Fan et al. 2013, etc.). But for the “weighted convergence”needed for efficient estimations in factor analysis and large panel data models, this conditionis not sufficient. Therefore, we introduce a more refined description of the sparse structureof Σ u (condition (ii) in Assumption 4.1 below), which is similar to those in Rothman et al.(2008).Let S L and S U denote two disjoint sets and respectively include the indices of small andlarge elements of Σ u in absolute value, and { ( i, j ) : i ≤ N, j ≤ N } = S L ∪ S U . (4.4)We assume ( i, i ) ∈ S U for all i ≤ N. The sparsity condition assumes that most of the indices( i, j ) belong to S L when i = j . A special case arises when Σ u is strictly sparse, in the sensethat its elements with small magnitudes ( S L ) are exactly zero. For the banded matrix as anexample, Σ u,ij = 0 if | i − j | > k for some fixed k ≥ . Then S L = { ( i, j ) : | i − j | > k } and S U = { ( i, j ) : | i − j | ≤ k } . Another example is the block-diagonal matrix.The following assumption mathematically defines the “conditional sparsity” for the ap-proximate factor model.Define ω T = r log NT + 1 √ N .
Assumption 4.1. (i) There is q ∈ [0 , / such that (4.3) holds.(ii) There is a partition { ( i, j ) : i ≤ N, j ≤ N } = S L ∪ S U such that P i = j, ( i,j ) ∈ S U O ( N ) and P ( i,j ) ∈ S L | Σ u,ij | = O (1) . In addition, max ( i,j ) ∈ S L | Σ u,ij | = O ( ω T ) , ω T = O ( min ( i,j ) ∈ S U | Σ u,ij | ) . If for example, Σ u is a block covariance matrix with finite block sizes, this assumption isnaturally satisfied as long as the signal is not too-weak (that is, ω T = o (min ( i,j ) ∈ S U | Σ u,ij | )).Condition (ii) requires the elements in S L and S U be well-separable. The partition { ( i, j ) : i ≤ N, j ≤ N } = S L ∪ S U may not be unique. Most importantly, we do not need to knoweither S L or S U ; hence the block size, the banding length, or the locations of the zero entriescan be completely unknown. Our analysis suffices as long as such a partition exists. We now formally discuss the issue brought by Assumption 3.1. In order for the effect ofestimating Σ − u to be negligible, k √ N Λ ′ ( b Σ − u − Σ − u ) u t k = o p (1) is required, which is a tight14ondition. However, a direct application of the optimal rate of convergence (i.e., Fan et al.2013, Cai and Zhou 2012) k b Σ − u − Σ − u k = O p ( m N ω − qT ) implies k √ N Λ ′ ( b Σ − u − Σ − u ) u t k ≤ √ N k Λ kk b Σ − u − Σ − u kk u t k = O p ( √ N m N ω − qT ) , which is O p (1 + p N (log N ) /T ) even if m N is bounded and q = 0. Hence this leads to acrude bound that does not converge. The problem is present even if Σ − u is estimated withthe optimal rate of convergence.We realize that such a technical problem is common for statistical inferences that involveestimating a high-dimensional covariance. In fact, most of the existing approaches in the lit-erature only produce “absolute convergence” k b Σ − u − Σ − u k . For statistical inference purposeslike the primary interest of this paper, however, the absolute convergence is not sufficientwhen N/T → ∞ .We propose a new technical strategy to solve this problem, by directly investigating the“weighted convergence” of the weighted error: k √ N Λ ′ ( b Σ − u − Σ − u ) u t k . (4.5)Intuitively, the weights Λ ′ and u t “average down” the estimation errors, and improve therate of convergence. Formal analysis requires us to re-investigate the asymptotic behaviorof the thresholded covariance estimator. We require the following technical assumption.Let Λ ′ Σ − u = ( ξ , ..., ξ N ). Assuming k Σ − u k = O (1), we then have max j ≤ N k ξ j k < C forsome C > . In addition, let e t = Σ − u u t , then e t has mean zero and covariance Σ − u . Assumption 4.2.
For each t ≤ T and k ≤ r ,(i) T √ N P Ni =1 P Ts =1 ( u is − Eu is ) ξ i e it = o p (1) (ii) NT √ N P Ni =1 P Ts =1 P Nj =1 ( u js u is − Eu js u is ) λ j λ ′ i e it ξ ik = o p (1) ,(iii) T √ N P i = j, ( i,j ) ∈ S U P Ts =1 ( u is u js − Eu is u js ) ξ i e jt = o p (1) ,(iv) NT √ N P i = j, ( i,j ) ∈ S U P Nv =1 P Ts =1 ( u is u vs − Eu is u vs ) ξ ik e jt λ v λ ′ j = o p (1) . The above conditions are new in the literature and essential to establish the weighedconvergence. The intuition of these conditions is that, the weighted average of the standard-ized sum √ T P Tt =1 ( u it u jt − Eu it u jt ) is o p (1) once averaged across i and j . The extra term N appeared in NT √ N of Conditions (ii) and (iv) is a scaling factor because under the sparsitycondition, the number of summands of P Ni =1 and P i = j, ( i,j ) ∈ S U is at most O ( N ) (e.g., in blockdiagonal and banded matrices).We verify the key assumption 4.2 in the following lemma, when { u t } t ≤ T is serially inde-pendent. We require N = o ( T ) but still allow N/T → ∞ .15 emma 4.1.
Suppose { u it } t ≤ T is independent across t (but can still be correlated across i ),and the sparse condition Assumption 4.1 holds. Then when N = o ( T ) , Assumption 4.2 issatisfied. We have the following weighted consistency for the estimated weight matrix, which as wehave explained, cannot be implied directly by the absolute convergence k b Σ − u − Σ − u k evenwhen Σ u is diagonal. As one of the main contributions of this paper, result of this type ispotentially widely useful for high-dimensional inferences when large covariance estimation isinvolved. Proposition 4.1.
Suppose √ N m N ω − qT = o (1) , and Assumptions 3.2- 3.5, 4.1, 4.2 hold.For q , m N and ω T defined in (4.2) and (4.1), and for each t ≤ T , we have k √ N Λ ′ ( b Σ − u − Σ − u ) u t k = o p (1) . Therefore Assumption 3.1 is satisfied for W = Σ − u . Remark 4.1.
Consider a strictly sparse case where m N = max i ≤ N P Nj =1 I (Σ u,ij = 0) = O (1). The condition in the theorem √ N m N ω − qT = o (1)then holds as long as √ N log N = o ( T ). As always the case, requiring N = o ( T ) is neededfor the asymptotic normality of b f t . We use W T = b Σ − u as the feasible weight matrix. Let the columns of the T × r matrix b F e / √ T = ( b f e , ..., b f eT ) ′ / √ T be the eigenvectors corresponding to the largest r eigenvalues of Y ′ b Σ − u Y , and b Λ e = T − Y b F e = ( b λ e , ..., b λ eN ) ′ . Here the superscript e denotes “efficient” WPC.We denote Σ Λ ,e as the limit of Λ ′ Σ − u Λ /N . Let V e be an r × r diagonal matrix with elementsas the largest r eigenvalues of Σ / ,e cov( f t )Σ / ,e , and Γ e be the corresponding eigenvectormatrix such that Γ ′ e Γ e = I r . In addition, let Q e = V / e Γ ′ e Σ − / ,e . We have the followinglimiting distributions for the estimated factors and loadings. Theorem 4.1.
Under the assumptions of Proposition 4.1, for each t ≤ T and j ≤ N , √ T ( b λ ej − H ′ − e λ j ) → d N (0 , Q ′ − e Φ j Q − e ) . √ N ( b f et − H e f t ) → d N (0 , V − e ) . here Φ j is as defined in (3.5). In addition, for the estimated common component, b λ e ′ i b f et − λ ′ i f t ( λ ′ i Ξ e λ i /N + f ′ t Ω i f t /T ) / → d N (0 , . where Ξ e = (Λ ′ Σ − u Λ /N ) − and Ω i is defined as in Theorem 3.1. For completeness, the following result gives the uniform rate of convergence.
Theorem 4.2.
Suppose N / (2 − q ) log N = o ( T ) and T = o ( N ) . Under the assumptions ofTheorem 4.1, there is an r × r matrix H e such that max t ≤ T k b f et − H e f t k = O p (cid:18) T /δ √ N + (log T ) α m N ω − qT (cid:19) , max j ≤ N k b λ ej − H ′ − e λ j k = O p (cid:0) m N ω − qT (cid:1) . Remark 4.2.
Typically in the strictly sparse case m N = O (1) and q = 0. When N/T → ∞ ,the above rates become:max t ≤ T k b f et − H e f t k = O p (cid:18) T /δ √ N + (log T ) α √ log N √ T (cid:19) , max j ≤ N k b λ ej − H ′ − e λ j k = O p r log NT ! . Regular PC, heteroskedastic WPC and the efficient WPC minimize different objectivefunctions, depending on the choices of the weight matrix. Thus the estimated b F / √ T are theeigenvectors from three different matrices. Table 1 summarizes the main differences of theestimators.A natural question arises: is the consistent estimator for W = Σ − u indeed the optimalchoice over a broad class of positive definite weight matrices? One can answer this questionvia looking at the asymptotic variance of the estimators, as choosing the optimal weightfor GMM (Hansen 1982). However, because WPC estimators are estimating rotated factorsand loadings, the rotation depends on the choice of W. But regardless of the choice W , thecommon component λ ′ i f t is always directly estimated. The following result demonstratesthat W T = b Σ − u yields the minimum asymptotic variance of b λ ′ i b f t among WPC estimators. Theorem 4.3.
Let ( λ ′ i Ξ e λ i /N + f ′ t Ω i f t /T ) denote the asymptotic variance of b λ e ′ i b f et basedon b Σ − u as in Theorem 4.1. For any positive definite matrix W , let ( λ ′ i Ξ W λ i /N + f ′ t Ω i f t /T )17able 1: Three interesting choices of W Objective function Eigenvectors of W regular PC P Tt =1 ( Y t − Λ f t ) ′ ( Y t − Λ f t ) Y ′ Y I r HWPC P Tt =1 ( Y t − Λ f t ) ′ diag( b Σ u ) − ( Y t − Λ f t ) Y ′ diag( b Σ u ) − Y diag(Σ u ) − EWPC P Tt =1 ( Y t − Λ f t ) ′ b Σ − u ( Y t − Λ f t ) Y ′ b Σ − u Y Σ − u The estimated b F / √ T is the eigenvectors of the largest r eigenvalues of Y ′ W T Y , and b Λ = T − Y b F .HWPC represents the heteroskedastic WPC; EWPC represents the efficient WPC. denote the asymptotic variance of b λ ′ i b f t as in Theorem 3.1 based on W . Then for each i ≤ N and t ≤ T , λ ′ i Ξ e λ i /N + f ′ t Ω i f t /T ≤ λ ′ i Ξ W λ i /N + f ′ t Ω i f t /T. In fact, for all large N , Ξ W − Ξ e is semi-positive definite for each positive definite matrix W . We derive consistent estimators for the asymptotic variances that appeared in Theorem4.1. Hence the derived optimal limiting distributions can be used for statistical inferences.These estimators account for the serial and cross-sectional correlations of the data in both i and t .The factor estimator has an asymptotic expansion: √ N ( b f et − H e f t ) = b V − b F e ′ FT Λ ′ Σ − u u t √ N + o p (1)where b V is the r × r diagonal matrix of the first r largest eigenvalues of T N Y b Σ − u Y ′ . Theorem4.1 shows that the asymptotic variance is V − e . Hence, b V − b F e ′ FT Λ ′ Σ − u Λ N F ′ b F e T b V − → p V − e (4.6)The left hand side involves the product F Λ ′ , which can be estimated by b F e b Λ e ′ . A consistentestimator of V − e is then given by (note that T b F e ′ b F e = I r ) b V − e = b V − b F e ′ b F e T b Λ e ′ b Σ − u b Λ e N b F e ′ b F e T b V − = 1 N b V − b Λ e ′ b Σ − u b Λ e b V − . √ T ( b λ j − H ′ − e λ j ) = 1 √ T T X t =1 H e f t u jt + o p (1) . Here H e f t u jt can be estimated by b f et b u jt , where b u jt is a WPC estimator of the error term (e.g., b u jt = y it − b λ e ′ j b f et ). We apply the HAC (heteroskedasticity and autocorrelation consistent)estimator of Newey and West (1987) to estimate Q ′ − e Φ j Q − e , the asymptotic variance of √ T ( b λ j − H ′ − e λ j ), based on the sequence { b f et b u jt } : b Ψ j = 1 T T X t =1 b u jt b f et b f e ′ t + K X l =1 (1 − lK + 1 ) 1 T T X t = l +1 b u jt b u j,t − l ( b f et b f e ′ t − l + b f et − l b f e ′ t ) , where K = K T,N → ∞ is an increasing sequence such that K = o (min { T / , N / } ) . The advantages of using the HAC estimator are: it accounts forthe serial correlations of { f t u ′ t } t ≥ , and it also guarantees the positive semi-definiteness forany given finite sample as shown by Newey and West (1987).The asymptotic variance of the common component in Theorem 4.1 consists of λ ′ i Ξ e λ i and f ′ t Ω i f t , where Ξ e = ( N Λ ′ Σ − u Λ) − andΩ i = cov( f t ) − Φ i cov( f t ) − . We respectively estimate them by b Θ i = 1 N b λ e ′ i b V − b Λ e ′ b Σ − u b Λ e b V − b λ ei , b Θ ,it = b f e ′ t b Ψ i b f et . Theorem 4.4.
Under the assumptions of Theorem 4.2, as T , N → ∞ , and K = K T,N = o (min { T / , N / } ) , b V − e → p V − e , b Ψ j → p Q ′ − e Φ j Q − e , b Θ i → p λ ′ i Ξ e λ i , b Θ ,it → p f ′ t Ω i f t . These covariance estimators can be easily computed.
The factor model we have considered so far is closely related to the following panel datamodel: y it = X ′ it β + ε it , ε it = λ ′ i f t + u it , i ≤ N, t ≤ T (5.1)19he regression noise has a factor structure with unknown λ i and f t , and u it still representsthe idiosyncratic error component. It is assumed that u it is independent of ( X it , f t ). In themodel, the only observables are ( y it , X it ). The goal is to estimate β , the structural parameterof the model.Substituting the second equation to the first one in (5.1), we obtain y it = X ′ it β + λ ′ i f t + u it . (5.2)If we treat λ i as the “individual effect” and f t as the “time effect”, then the factor structure λ ′ i f t represents the interaction between the individual and time effects, so called “interactiveeffect”. This model was previously studied by, e.g., Ahn et al. (2001), Pesaran (2006), Bai(2009), Moon and Weidner (2010).The difficulty of estimating β is that, in many applied problems the regressor X it iscorrelated with the time effect (common factor) f t , individual effect λ i , or both. As a result, X it and ε it are also correlated, so regressing y it directly on X it cannot produce a consistentestimator for β . In addition, existing methods ignore the heteroskedasticity and correlationin { u it } i ≤ N . Hence efficiency is lost, for instance, when Σ u is non-diagonal or its diagonalentries vary over a large range. We shall illustrate the consequence of efficiency loss using areal data application in Section 7. β Let X t = ( X t , ..., X Nt ) ′ , ( N × d ). We estimate β viamin β,f t , Λ T X t =1 ( Y t − Λ f t − X t β ) ′ W ( Y t − Λ f t − X t β ) , (5.3)for some positive definite N × N weight matrix. Similar to the generalized least squaresestimator (GLS) for linear regressions, we choose the weight matrix to be W = Σ − u . This choice produces similar estimators as the efficient WPC. The estimator is feasible oncewe consistently estimate Σ − u , which can be done under the assumption that Σ u is sparse.Suppose e Σ − u is a consistent covariance estimator. The feasible WPC estimates β by:ˆ β = arg min β min f t , Λ T X t =1 ( Y t − Λ f t − X t β ) ′ e Σ − u ( Y t − Λ f t − X t β ) , (5.4)20here the minimization is subjected to the constraint T P Tt =1 f t f ′ t /T = I r and Λ ′ e Σ − u Λ beingdiagonal. The estimated β for each given (Λ , f t ) is simply β (Λ , f t ) = ( T X t =1 X ′ t e Σ − u X t ) − T X t =1 X ′ t e Σ − u ( Y t − Λ f t ) . On the other hand, given β , the variable Y t − X t β has a factor structure. Hence the estimated(Λ , f t ) are the WPC estimators: let X ( ˆ β ) be an N × T matrix X ( ˆ β ) = ( X ˆ β, ..., X T ˆ β ) . Thecolumns of the T × r matrix e F / √ T = ( e f , ..., e f T ) ′ / √ T are the eigenvectors corresponding tothe largest r eigenvalues of ( Y − X ( ˆ β )) ′ e Σ − u ( Y − X ( ˆ β )), and e Λ = T − ( Y − X ( ˆ β )) e F .
Therefore,given (Λ , f t ), we can estimate β , and given β , we can estimate (Λ , f t ). So ˆ β can be simplyobtained by iterations, with an initial value ˆ β . This iteration scheme only requires twomatrix inverses: e Σ − u and ( P Tt =1 X ′ t e Σ − u X t ) − , which do not update during iterations. Basedon our experience of numerical studies, the iterations converge fast.Similar to Fan et al. (2013), the covariance estimator can be constructed based onthresholding. Let ˆ β be a “regular PC estimator” that takes W = I N in (5.3), which isknown to be √ N T -consistent (e.g., Bai 2009, Moon and Weidner 2010). Apply the singularvalue decomposition to 1 T T X t =1 ( Y t − X t ˆ β )( Y t − X t ˆ β ) ′ = N X i =1 ν i g i g ′ i , where ( ν j , g j ) Nj =1 are the eigenvalues-eigenvectors of T P Tt =1 ( Y t − X t ˆ β )( Y t − X t ˆ β ) ′ in adecreasing order such that ν ≥ ν ≥ ... ≥ ν N . Then e Σ u = ( e Σ u,ij ) N × N , e Σ u,ij = e R ii , i = js ij ( e R ij ) , i = j , e R = ( e R ij ) N × N = N X i = r +1 ν i g i g ′ i , where s ij ( · ) is the same thresholding function as defined in Section 4.2 with the same thresh-old τ ij . Rearrange the design matrix Z = ( X , ..., X T , X , ..., X T , ..., X N , ..., X NT ) ′ , N T × d. T × r matrix F , let M F = I T − F ( F ′ F ) − F ′ /T . The following matrices play animportant role in the identification and asymptotic analysis: A F = h Σ − u − Σ − u Λ (cid:0) Λ ′ Σ − u Λ (cid:1) − Λ ′ Σ − u i ⊗ M F ,V ( F ) = 1 N T Z ′ A F Z, (5.5)where (Λ , Σ − u ) in the above represent the true loading matrix and inverse error covariancein the data generating process, and ⊗ denotes the Kronecker product. Our first conditionassumes that V ( F ) is positive definite in the limit uniformly over a class of F . Assumption 5.1.
With probability approaching one, inf F : F ′ F/T = I r λ min ( V ( F )) > . If we write B F = h Σ − / u − Σ − u Λ (Λ ′ Σ − u Λ) − Λ ′ Σ − / u i ⊗ M F , then A F = B F B ′ F . So V ( F )is at least semi-positive definite. Also, summing over N T rows of Z should lead to a strictlypositive definite matrix V ( F ). As a sufficient condition, if X it depends on the factors andloadings through: X it = τ i + θ t + r X k =1 a k λ ik + r X k =1 b k f kt + r X k =1 c k λ ik f kt + η it where a k , b k , c k are constants (can be zero) and η it is i.i.d. over both i and t , then Assumption5.1 is satisfied (see Bai 2009).Let U = ( u , ..., u T , u , ..., u T , ..., u N , ..., u NT ) ′ , and F be the T × r matrix of truefactors. Assumption 5.2.
There is a dim( β ) × dim( β ) positive definite matrix Γ such that V ( F ) → p Γ , √ N T Z ′ A F U → d N (0 , Γ) . This assumption is required for the asymptotic normality of ˆ β , because it can be shownthat, √ N T ( ˆ β − β ) = V ( F ) − √ N T Z ′ A F U + o p (1) . Hence the asymptotic normality depends on that of √ NT Z ′ A F U . Assumption 5.2 isnot stringent because if we write B ′ F U = ( e u , ..., e u T , e u , ..., e u NT ) ′ , and Z ′ B F =( e Z , ..., e Z T , e Z , ..., e Z NT ), then 22 √ NT Z ′ A F U = √ NT P Tt =1 P Ni =1 e Z it e u it is a standardized summation. We can further write √ N T ( ˆ β − β ) = N T T X t =1 N X i =1 e Z it e Z ′ it ! − √ N T T X t =1 N X i =1 e Z it e u it + o p (1) . Hence the second statement of Assumption 5.2 is a central limit theorem for √ NT P Tt =1 P Ni =1 e Z it e u it on both cross-sectional and time domains. In addition, in the ab-sence of serial correlation, the conditional covariance of √ NT Z ′ A F U given Z and F equals NT Z ′ A F (Σ u ⊗ I T ) A F Z = V ( F ). This implies that the asymptotic variance of √ N T ( ˆ β − β )is simply Γ − . The issue described in Section 2 arises in establishing1 √ N T Z ′ [( e Σ − u − Σ − u ) ⊗ I T ] U = o p (1) , (5.6)which is the effect of estimating the large covariance Σ − u . In fact, the first order conditionof ˆ β leads to √ N T ( ˆ β − β ) = V ( F ) − √ N T Z ′ b AU + o p (1) , where b A is as A F with Σ − u replaced with e Σ − u and F replaced with e F . Hence we need1 √ N T Z ′ ( b A − A F ) U = o p (1) . (5.7)This requires the weighted convergence (5.6). However, when N/T → ∞ , achieving (5.6)is technically difficult. Similar to the case described in the approximate factor model, theabsolute convergence of k e Σ − u − Σ − u k is not suitable for inferences.We consider the Gaussian case for simplicty, and the problem is still highly technicallyinvolved. Non-Gaussian case will be even more challenging, and we shall leave it for futureresearch. Assumption 5.3. (i) u t is distributed as N (0 , Σ u ) .(ii) { u t } t ≥ is independent of { f t , X t } t ≥ , and { u t , f t , X t } are serially independent across t . It is possible to relax Condition (ii) to allow for serial correlations, but ˆ β will be asymp-totically biased. 23 .4 Limiting distribution We require the same conditions on the data generating process for the factors, loadingsand the sparsity of Σ u as in Sections 2 and 4. Proposition 5.1.
Under Assumptions 3.2- 3.4, 4.1, 5.1-5.3, as
N/T → ∞ , and m N = o ( T ) , we have the weighted convergence: √ N T Z ′ ( b A − A F ) U = o p (1) . We have the following limiting distribution.
Theorem 5.1.
Under the assumptions of Proposition 5.1, the asymptotic limiting distribu-tion of ˆ β is the same when either W = Σ − u or the feasible weight W T = e Σ − u is used as theweight matrix, and is given by √ N T ( ˆ β − β ) → d N (0 , Γ − ) . The asymptotic variance Γ − is the limit of V ( F ) − . Note that under the same set ofconditions, the regular PC method of Bai (2009) and Moon and Weidner (2010) gives anasymptotic conditional covariance (given Z, F ) of the sandwich-formula: V ≡ ( 1 N T Z ′ GZ ) − N T Z ′ G (Σ u ⊗ I T ) GZ ( 1 N T Z ′ GZ ) − , where G is defined as A F with Σ − u replaced with I N . It is not hard to show that V − V ( F ) − is semi-positive definite. So relative efficiency is gained when WPC is used. In fact, the choice W = e Σ − u is also the optimal weight matrix for WPC in this case.To estimate the asymptotic variance of ˆ β , let e A equal A F with F , Λ and Σ − u replacedwith e F , e Λ and e Σ − u . Define e Γ = NT Z ′ e AZ . The following result enables us to constructconfidence intervals and conduct hypothesis tests for β under large samples. Theorem 5.2.
Under the assumptions of Theorem 5.1, e Γ − → p Γ − . The methods of Section 4 also carry over to derive the limiting distributions of theestimated interactive effects λ ′ i f t . The procedure and corresponding results are very similargiven the √ N T -consistency of ˆ β . Hence we omit repeated discussions.24 .5 Estimation with unknown number of factors For simplicity of presentations, we have assumed the number of factors r to be known.As was shown by many authors, estimation results are often robust to over-estimating r .For instance, Moon and Weidner (2011) have shown that for inference on the regressioncoefficients one does not need to estimate r consistently, as long as the “working number” isnot less than the true value. On the other hand, we can also start with a consistent estimatorˆ r using a similar method of Bai and Ng (2002) and Bai (2009).Specifically, suppose there is a known upper bound ¯ r of the number of factors. For each k ≤ ¯ r , define b σ ( k ) = min β, Λ k ,f t,k N T T X t =1 ( Y t − Λ ′ k f t,k − X t β ) ′ ( Y t − Λ ′ k f t,k − X t β )where each row of Λ k is a k -dimensional loading vector, and f t,k is also k -dimensional.The above minimization is subject to the constraint that T P Tt =1 f t,k f ′ t,k = I k and Λ ′ k Λ k isdiagonal. The iterative algorithm based on principal components can calculate the aboveminimization fast. Under our conditions, Bai (2009) showed that r can be consistentlyestimated by either minimizing CP( k ) or IC( k ), whereCP( k ) = b σ ( k ) + b σ (¯ k )[ k ( N + T ) − k ] log( N T ) N T , and IC( k ) = log b σ ( k ) + [ k ( N + T ) − k ] log( N T ) N T .
We then can apply the estimator ˆ r to construct the WPC estimator, and achieve the samelimiting distributions. Estimation procedure and its theoretical properties can be proved tobe the same as before, so details are not presented to avoid repetition. We conduct numerical experiments to compare the proposed WPC with the popularmethods in the literature . The idiosyncratic error terms are generated as follows: let We have written a Matlab code to implement the proposed WPC for any user-specified weight matrixas well as the optimal WPC for both the factor model and panel data model with interactive effects, availableupon request. ǫ it } i ≤ N,t ≤ T be i.i.d. N (0 ,
1) in both t, i . Let u t = ǫ t , u t = ǫ t + a ǫ t , u t = ǫ t + a ǫ t + b ǫ t ,u i +1 ,t = ǫ i +1 ,t + a i ǫ it + b i − ǫ i − ,t + c i − ǫ i − ,t , where { a i , b i , c i } Ni =1 are i.i.d. N (0 , u is a banded matrix, possessing both cross-sectional correlation and heteroskedasticity. Let the two factors { f t , f t } be i.i.d. N (0 , { λ i, , λ i, } i ≤ N be uniform on [0 , R as suggested by Fan et al. (2013). Design 1
Consider the pure factor model y it = λ i f ,t + λ i, f t + u it , where we estimate the factorloadings { λ i, , λ i, } i ≤ N and factors { f t , f t } . For each estimator, the smallest canonicalcorrelation (the larger the better) between the estimators and parameters are calculated,as an assessment of the estimation accuracy. The simulation is replicated for one hundredtimes, and the average canonical correlations for several competing methods are reported inTable 2. The mean squared error of the estimated common components are also compared.Table 2: Canonical correlations for simulation studyLoadings Factors ( NT P i,t ( b λ ′ i b f t − λ ′ i f t ) ) / T N
PC HWPC EWPC PC HWPC EWPC PC HWPC EWPC(the larger the better) (the larger the better) (the smaller the better)50 75 0.346 0.429 0.487 0.403 0.508 0.566 0.621 0.583 0.54550 100 0.411 0.508 0.553 0.476 0.602 0.666 0.546 0.524 0.49850 150 0.522 0.561 0.602 0.611 0.679 0.746 0.467 0.444 0.427100 80 0.433 0.545 0.631 0.427 0.551 0.652 0.570 0.540 0.496100 150 0.613 0.761 0.807 0.661 0.835 0.902 0.385 0.346 0.307100 200 0.751 0.797 0.822 0.827 0.882 0.924 0.333 0.312 0.284150 100 0.380 0.558 0.738 0.371 0.557 0.749 0.443 0.394 0.334150 200 0.836 0.865 0.885 0.853 0.897 0.942 0.313 0.276 0.240150 300 0.882 0.892 0.901 0.927 0.946 0.973 0.257 0.243 0.222
The columns of loadings and factors report the canonical correlations. PC is the regular principalcomponents method; HWPC represents the heteroskedastic WPC; EWPC uses b Σ − u as the weightmatrix. We see that the estimation becomes more accurate when we increase the dimensionality.26WPC improves the regular PC, while the EWPC gives the best estimation results.
Design 2
Adding a regression term to the model of Design 1, we consider the panel data modelwith interactive effect: y it = X ′ it β + λ i f ,t + λ i, f t + u it , where the true β = (1 , ′ . Theregressors are generated to be dependent on ( f t , λ i ): X it, = 2 . λ i f ,t − . λ i f ,t − η it, , X it, = λ i f ,t − λ i f ,t + 1 + η it, where η it, and η it, are independent i.i.d. standard normal.Both the methods PC (Bai 2009 and Moon and Weidner 2011) and the proposed WPCare carried out to estimate β for the comparison. Also compared is the mean squared errorof the estimated common components. The simulation is replicated for one hundred times;results are summarized in Table 3. We see that both methods are almost unbiased, while theefficient WPC indeed has significantly smaller standard errors than the regular PC methodin the panel model with interactive effects.Table 3: Method comparison for the panel data with interactive effects, simulation β = 1 β = 3Mean Normalized SE Mean Normalized SE T N
WPC PC WPC PC WPC PC WPC PC50 75 1.005 1.013 0.758 1.413 2.998 3.002 0.744 1.47250 100 1.005 1.010 0.662 1.606 2.997 2.998 0.731 1.61650 150 1.004 1.008 0.964 1.913 2.999 2.999 0.951 1.881100 100 1.002 1.010 0.550 1.418 3.000 3.003 0.416 1.353100 150 1.003 1.007 0.681 1.626 2.999 3.000 0.611 1.683100 200 1.002 1.005 0.631 1.800 3.000 3.000 0.774 1.752150 100 1.003 1.006 0.772 1.399 3.000 2.999 0.714 1.458150 150 1.001 1.005 0.359 1.318 3.000 3.001 0.408 1.379150 200 1.001 1.003 0.547 1.566 3.000 3.000 0.602 1.762
WPC (with weight e Σ − u ) and PC (existing method) comparison. “Mean” is the average of theestimators; “Normalized SE” is the standard error of the estimators multiplied by √ N T . Empirical Study : Effects of Divorce Law Reforms
This section shows the advantages of our proposed WPC method in a real data applica-tion. It demonstrates the gain of incorporating the estimated Σ u in the panel data estimationand the efficiency gains compared to the traditional PC. An important question in sociology is the cause of the sharp increase in the U.S. divorcerate in the 1960s and 1970s. The association between divorce rates and divorce law reformshas been considered a potential key, and during 1970s, about three quarters of states in theU.S. liberalized their divorce system, so-called “no-fault revolution”. There is plenty empir-ical research regarding the effects of divorce law reforms on the divorce rates (e.g., Peters1986, Allen 1992), and statistical significance of these effects has been found (e.g., Friedberg1998). In other words, states’ law reforms are found to have significantly contributed to theincrease in state-level divorce rates within the first eight years following reforms.On the other hand, there is a puzzle about longer effects. Empirical evidence also il-lustrates the subsequent decrease of the divorce rates starting from (around) 1975, whichis between nine and fourteen years after the law reforms in most states. So whether lawreforms continue to contribute to the rate decrease has been an interesting question. Wolfers(2006) studied a treatment effect panel data model, and identified negative effects for thesubsequent years. This suggests that, the increase in divorce following reform and the sub-sequent decrease may be two sides of the same treatment: after earlier dissolution of badmatches after law reforms, marital relations were gradually affected and changed. However,it has been argued that Wolfers (2006)’s approach may not capture the complex unobservedheterogeneity. The heterogeneity may exist through an interactive effect, where unobservedcommon factors may change over time.Kim and Oka (2013) pioneered using interactive effect model for the study: y it = K X k =1 X it,k β k + λ ′ i f t + µ i + α t + f ( δ i , t ) + u it , (7.1)where y it is the divorce rate for state i in year t ; X it,k is a binary regressor, representing thetreatment effect 2 k years after the reform. Specifically, we observe the law reform year T i for each state. Then X it,k = 1 if 2 k − ≤ t − T i ≤ k , and zero otherwise. In addition to theinteractive effect λ ′ i f t as being discussed, the model also contains unobserved state and timeeffects ( µ i , α t ) and time trend f ( δ i , t ). For instance, the linear trend defines f ( δ i , t ) = δ i t δ i . Using the regular PC method, Kim and Oka (2013) concludedinsignificant ( β , ..., β ), that is, the divorce rates after eight years and beyond are not affectedby the reforms. However, We argue that using the regular PC method to estimate the modelmay lose efficiency because it ignores the off-diagonal entries. As a result, this can result inwide confidence intervals and possibly conservative conclusions.We re-estimate Kim and Oka (2013)’s model using the new WPC, and compare with theregular PC. As a first step, we rewrite the model to fit in the form being considered in thispaper. Introduce the conventional notation:˙ y it = y it − T T X t =1 y it − N N X i =1 y it + 1 N T N X i =1 T X t =1 y it . Let ˙ X it,k , ˙ u it be defined similarly. If the time trend f ( δ i , t ) is not present, under theconventional normalizations P Ni =1 λ i = P Tt =1 f t = 0 , P Ni =1 µ i = P Tt =1 α t , we have ˙ y it =˙ X ′ it β + λ ′ i f t + ˙ u it . The same data as in Wolfers (2006) and Kim and Oka (2013) are used, which contain thedivorce rates, state-level reform years and binary regressors from 1956 to 1988 ( T = 33) over48 states. We fit the models both with and without linear time trend, and apply regular PCand our proposed WPC in each model to estimate β with confidence intervals. The numberof factors is selected in a data-driven way as in Bai (2009). His IC and CP both suggestedten factors. Moreover, for the WPC, the threshold value in the estimated covariance isobtained using the suggested cross-validation procedure in Fan et al. (2013). The estimated( β , ..., β ) and their confidence intervals are summarized in Table 4.Both models produce similar estimates. Interestingly, WPC confirms that the law reformssignificantly contribute to the subsequent decrease of the divorce rates, more specifically, 9-14 years after the reform in the model with linear time trends, and 11-14 years after in themodel without linear time trends. In contrast, the regular PC reaches a more conservativeconclusion as it does not capture these significant negative effects. Moreover, both methodsshow that the effect on the increase of divorce rates for the first 6 years are significant, whichis consistent with previous findings in this literature.We also report the relative efficiency using WPC, relative to the regular PC. The reportednumbers are var(WPC) / var(PC), where var(A) calculates the estimated variance of the When the time trend is present, we can do a simple projection to eliminate the time trend, and stillestimate the untransformed β from the familiar interactive effect model. For instance, suppose f ( δ i , t ) = δ i t .Let M = (1 , , ..., T ) ′ and P M = I T − M ( M ′ M ) − M ′ . We can define e Y i = P M ( y i , ..., y iT ) ′ , and e X i = P M ( X i , ..., X iT ) ′ , and define ˙ e y it and ˙ e X it accordingly from e y it and e X it . This is the same as in Kim and Oka (2013). We also tried a few larger values for r , and the estimatesare similar, consistent with previous findings that the estimation is robust to over-estimating r .
95% confidence intervals are reported; intervals with * are significant. Relative efficiency isreferred to WPC relative to PC, as estimated var(
W P C ) / var( P C ) . Let us further demonstrate the relative efficiency WPC gains by incorporating the esti-mated Σ − u through simulated data. The true parameters are estimated from the real dataas described above. Specifically, we use the first column from Table 4 (no linear trend) asthe true β , and the corresponding estimated Λ as the true loading matrix. We fix N = 48as before. To pertain the actual cross-sectional dependence, in the simulation, the true errorterms, factors, and regressors are generated as simple random samples (with replacement)of size T from the estimated residuals, factors and regressors from the real data.Simulations are conducted with one hundred replications. The averages and the standarddeviations for each estimated component are reported in Table 5, representing the bias andstandard error. Also reported is the relative efficiency, defined as var(WPC) / var(PC). It isclearly shown in the table that the standard errors of WPC are uniformly smaller than thoseof PC. In addition, most of the time WPC also reduces the finite sample bias. The relativeefficiency varies from 39% to 52%, which illustrates 48%-61% efficiency gain. Overall, afterincorporating the error covariance, the performance of the estimator is significantly improved. The literature on estimating high-dimensional sparse covariance matrices has targeted onthe covariance and inverse covariance directly, and the theoretical results are mostly in anabsolute convergence form. We see that the absolute convergence, even though achieving theminimax optimal rate, is often not suitable for statistical inference. Thus using an estimatedhigh-dimensional covariance matrix as the optimal weight matrix is highly-nontrivial. Westudy a new notion of “weighted convergence” to show that the effect of estimating a high-dimensional covariance matrix is indeed asymptotically negligible.This paper studies in detail two models that are of increasing importance in appliedstatistics: approximate factor model and panel data with unobservable interactive effects.We propose a method of weighted principal components, which uses a high-dimensionalweight matrix. The efficient weight uses the inverse error covariance matrix. The EWPCconsiders both heteroskedasticity and cross-sectional dependence. It is shown that EWPCuses the optimal weight matrix over the class of WPC estimators thus it is the most efficient.The EWPC is applied to the year-state divorce rate data. The new method captures the31able 5: Method comparison in effects of divorce law reform: simulated dataBias Normalized SE RelativeWPC PC WPC PC Efficiency T = 50First 2 years -0.008 -0.013 1.077 1.714 0.3933-4 years -0.023 -0.033 1.911 2.694 0.4945-6 years -0.040 -0.058 2.743 3.821 0.5257-8 years -0.054 -0.080 3.429 4.899 0.5019-10 years -0.068 -0.103 4.017 5.633 0.50111-12 years -0.073 -0.107 4.262 6.221 0.47513-14 years -0.081 -0.124 4.703 6.907 0.46215 years+ -0.090 -0.133 5.094 7.691 0.439 T = 70First 2 years -0.002 -0.000 0.927 1.449 0.4083-4 years -0.008 -0.008 1.623 2.434 0.4385-6 years -0.021 -0.028 2.434 3.420 0.5057-8 years -0.030 -0.039 3.246 4.579 0.5079-10 years -0.043 -0.060 4.115 5.738 0.51311-12 years -0.048 -0.061 4.579 6.492 0.50113-14 years -0.055 -0.076 5.101 7.245 0.49515 years+ -0.062 -0.079 5.564 8.173 0.465 “Normalized SE” is the standard error of the estimator multiplied by √ N T . The relativeefficiency is calculated as the square of the ratio of the third and fourth columns, estimating var(
W P C ) / var( P C )32ignificant (negative) effects from nine to twelve years after the law was reformed, consistentwith the previous empirical findings in the social science literature. The estimator is moreaccurate and produces tighter confidence intervals.All proofs are given in the supplementary material.33 eferences
Ahn, S., Lee, Y. and
Schmidt, P. (2001). GMM estimation of linear panel data modelswith time-varying individual effects.
J. Econometrics . , 219-255. Allen, D. W. (1992). Marriage and divorce: comment.
American Economic Review , ,679-685. Andersen, T., Bollerslev, T., Christoffersen, P. and
Diebold, F. (2011). Finan-cial risk measurement for financial risk management.
Manuscript . Northwestern University.
Bai, J. (2003). Inferential theory for factor models of large dimensions.
Econometrica . Bai, J. (2009). Panel data models with interactive fixed effects.
Econometrica . Bai, J. and
Li, K. (2012). Statistical analysis of factor models of high dimension.
Ann.Statist . , 436-465. Bai, J. and
Ng, S. (2002). Determining the number of factors in approximate factor models.
Econometrica . Bickel, P. and
Levina, E. (2008). Covariance regularization by thresholding.
Ann. Statist. Bien, J. and
Tibshirani, R. (2011). Sparse estimation of a covariance matrix.
Biometrika , Boivin, J. and
Ng, S. (2006). Are More Data Always Better for Factor Analysis?
J.Econometrics . , 169-194. Breitung, J. and
Tenhofen, J. (2011). GLS estimation of dynamic factor models.
J.Amer. Statist. Assoc. , 11501166.
Cai, T. and
Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estimation.
J. Amer. Statist. Assoc. , 672-684.
Cai, T. and
Zhou, H. (2012). Optimal rates of convergence for sparse covariance matrixestimation.
Ann. Statist . , 2389-2420. Caner, M. and
Han, X. (2012). Using bridge estimators to determine number of factorsin multifactor models: case of large panel data.
Manuscript .34 hamberlain, G. and
Rothschild, M. (1983). Arbitrage, factor structure and mean-variance analysis in large asset markets.
Econometrica . Cheng, X. and
Hansen, B. (2013). Forecasting with factor-augmented regression: a fre-quentist model averaging approach. Forthcoming in
J. Econometrics . Choi, I. (2012). Efficient estimation of factor models.
Econometric Theory . Aspremont, A., Banerjee, O. and
Ghaoui, L. (2008). First-order methods for sparsecovariance selection.
SIAM Journal on Matrix Analysis and Applications . , 56-66 Doz, C., Giannone, D. and
Reichlin, L. (2012). A quasi-maximum likelihood approachfor large, approximate dynamic factor models.
The Review of Economics and Statistics . ,1014-1024. El Karoui, N. (2008). Operator norm consistent estimation of large-dimensional sparsecovariance matices.
Ann. Statist. , 2717-2756. Fan, J. and
Li, R. (2001). Variable selection via nonconcave penalized likelihood and itsoracle properties.
J. Amer. Statist. Assoc. Fan, J., Liao, Y. and
Mincheva, M. (2013). Large covariance estimation by thresholdingprincipal orthogonal complements (with discussion).
J. R. Stat. Soc. Ser. B. . To appear.
Forni, M., Hallin, M., Lippi, M. and
Reichlin, L. (2000). The generalized dynamicfactor model: identification and estimation.
The Review of Economics and Statistics . Friedberg, L. (1998). Did unilateral divorce raise divorce rates? Evidence from paneldata.
American Economic Review , , 608-627 Hallin, M. and
Liˇska, R. (2007). Determining the number of factors in the general dy-namic factor model.
J. Amer. Statist. Assoc. , 603-617.
Hansen, L. (1982). Large sample properties of generalized method of moments estimators,
Econometrica . , 1029-1054. Kim, D. and
Oka, T. (2013). Divorce law reforms and divorce rates in the U.S.: aninteractive fixed effects approach.
J. Appl. Econometrics , Lam, C. and
Fan, J. (2009). Sparsistency and rates of convergence in large covariancematrix estimation.
Ann. Statist. am, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: inferencefor the number of factors.
Ann. Statist. , 694-726. Ledoit, O. and
Wolf, M. (2012). Nonlinear shrinkage estimation of large-dimensionalcovariance matrices.
Ann. Statist. , 1024-1060 Luo, X. (2011). High dimensional low rank and sparse covariance matrix estimation viaconvex minimization.
Manuscript . Moon, R. and
Weidner, M. (2010). Dynamic linear panel regression models with inter-active fixed effects. manuscript . Moon, R. and
Weidner, M. (2011). Linear regression for panel with unknown number offactors as interactive fixed effects. manuscript . Newey, W. and
West, K. (1987). A simple, positive semi-definite, heteroskedasticity andautocorrelation consistent covariance matrix.
Econometrica . , 703-708. Pati, D., Bhattacharya, A., Pillai, N. and
Dunson, D. (2012). Posterior contractionin sparse Bayesian factor models for massive covariance matrices. manuscript . Pesaran (2006). Estimation and inference in large heterogeneous panels with a multifactorerror structure.
Econometrica . , 967-1012. Peters, H. E. (1986). Marriage and divorce: informational constraints and private con-tracting,
American Economic Review , , 437-454. Phan, Q. (2012). On the sparsity assumption of the idiosyncratic errors covariance matrix-Support from the FTSE 100 stock returns.
Manuscript . University of Warwick.
Rothman, A., Bickel, P., Levina, E. and
Zhu, J. (2008). Sparse permutation invariantcovariance estimation.
Electron. J. Stat. , 494-515. Rothman, A., Levina, E. and
Zhu, J. (2009). Generalized thresholding of large covari-ance matrices.
J. Amer. Statist. Assoc.
Stock, J. and
Watson, M. (2002). Forecasting using principal components from a largenumber of predictors.
J. Amer. Statist. Assoc. , 1167-1179. Su, L. and
Chen, Q. (2013). Testing homogeneity in panel data models with interactivefixed effects. Forthcoming in
Econometric Theory .36 u, L., Jin S. and
Zhang, Y. (2012). Specification test for panel data models with inter-active fixed effects,
Manuscript . Wang, P. (2009). Large dimensional factor models with a multi-level factor structure: iden-tification, estimation and inference.
Manuscript . Hong Kong University of Science and Tech-nology.
Wolfers, J. (2006). Did unilateral divorce raise divorce rates? A reconciliation and newresults.
American Economic Review , , 1802-1820. Xue, L., Ma, S. and
Zou, H. (2012). Positive-definite l -penalized estimation of largecovariance matrices. J. Amer. Statist. Assoc. , 1480-1491.
Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty,
Ann.Statist. ,38