[PDF] Statistical Inferences Using Large Estimated Covariances for Panel Data and Factor Models

Abstract

While most of the convergence results in the literature on high dimensional covariance matrix are concerned about the accuracy of estimating the covariance matrix (and precision matrix), relatively less is known about the effect of estimating large covariances on statistical inferences. We study two important models: factor analysis and panel data model with interactive effects, and focus on the statistical inference and estimation efficiency of structural parameters based on large covariance estimators. For efficient estimation, both models call for a weighted principle components (WPC), which relies on a high dimensional weight matrix. This paper derives an efficient and feasible WPC using the covariance matrix estimator of Fan et al. (2013). However, we demonstrate that existing results on large covariance estimation based on absolute convergence are not suitable for statistical inferences of the structural parameters. What is needed is some weighted consistency and the associated rate of convergence, which are obtained in this paper. Finally, the proposed method is applied to the US divorce rate data. We find that the efficient WPC identifies the significant effects of divorce-law reforms on the divorce rate, and it provides more accurate estimation and tighter confidence intervals than existing methods.

Full PDF

aa r X i v : . [ m a t h . S T ] N ov Statistical Inferences Using Large EstimatedCovariances for Panel Data and Factor Models

Jushan Bai ∗ Columbia University Yuan Liao † University of Maryland

Abstract

While most of the convergence results in the literature on high dimensional covari-ance matrix are concerned about the accuracy of estimating the covariance matrix (andprecision matrix), relatively less is known about the eﬀect of estimating large covari-ances on statistical inferences. We study two important models: factor analysis andpanel data model with interactive eﬀects, and focus on the statistical inference andestimation eﬃciency of structural parameters based on large covariance estimators.For eﬃcient estimation, both models call for a weighted principle components (WPC),which relies on a high dimensional weight matrix. This paper derives an eﬃcient andfeasible WPC using the covariance matrix estimator of Fan et al. (2013). However,we demonstrate that existing results on large covariance estimation based on abso-lute convergence are not suitable for statistical inferences of the structural parameters.What is needed is some weighted consistency and the associated rate of convergence,which are obtained in this paper. Finally, the proposed method is applied to the USdivorce rate data. We ﬁnd that the eﬃcient WPC identiﬁes the signiﬁcant eﬀects ofdivorce-law reforms on the divorce rate, and it provides more accurate estimation andtighter conﬁdence intervals than existing methods.

Keywords:

High dimensionality, unknown factors, conditional sparsity, thresholding,cross-sectional correlation, heteroskedasticity, optimal weight matrix, interactive eﬀect ∗ Department of Economics, Columbia University, New York, NY 10027. † Department of Mathematics, University of Maryland at College Park, College Park, MD 20742. Introduction

Estimating a high-dimensional covariance matrix has been an active research area in therecent literature. Many methods are proposed for estimating the covariance matrix andthe precision (inverse covariance) matrix, e.g. El Karoui (2008), Bickel and Levina (2008),Rothman et al. (2009), Lam and Fan (2009), Cai and Liu (2011), Fan et al. (2013). Amongmany theoretical results, rates of convergence under various interesting matrix norms havebeen derived. In particular, if we write N to denote the dimension and T to denote thesample size, when the N × N covariance matrix Σ is sparse whose eigenvalues are boundedaway from zero, we can obtain an estimator b Σ that achieves a near- √ T -rate under the operator norm : k b Σ − Σ k = O p ( m N ( log NT ) − q ) = k b Σ − − Σ − k (1.1)where m N and q are parameters that measure the level of sparsity. Cai and Zhou (2012)showed that the rate of convergence (1.1) is minimax optimal. However, there is relatively lessknowledge about the eﬀect of estimating a high-dimensional covariance matrix on statisticalinferences, e.g., the estimation eﬃciency for a parametric model, and the eﬀect of estimatinglarge covariances on the limiting distributions for estimators of some structural parameters.We ﬁnd that when a high-dimensional covariance estimator is applied for statistical in-ferences (precisely, deriving limiting distributions of estimated structural parameters), mostof the results in the literature based on absolute convergence like (1.1) are not suitable, evenwith the minimax optimal rate. Instead, a “ weighted convergence ” is needed, which takesthe form k A ( b Σ − − Σ − ) A k , where both A , A are stochastic matrices that weight theestimation error b Σ − − Σ − . The weights A and A further “average down” the estimationerrors, which signiﬁcantly improve the rate of convergence to make valid statistical infer-ences. However, the weighted convergence cannot be implied by the usual results in theliterature. One of our contributions is to tackle this problem.This paper focuses on two models that are of increasing importance in many statisticalapplications: factor analysis and panel data model with interactive eﬀects. In factor analysis,the notion of sparsity is a natural assumption based on the factor structure, which is provedto be a successful approach (e.g., Boivin and Ng 2006, Phan 2012, Andersen et al. 2011).This paper gives a theoretical justiﬁcation about how such a sparse structure can be used toimprove the estimation eﬃciency in two general models. Both problems involve estimatinga large weight matrix, where the problem of proving “weighted convergence” is present.2 .1 Approximate factor model We consider a high-dimensional approximate factor model: y it = λ ′ i f t + u it , i ≤ N, t ≤ T. (1.2)where f t is an r × λ i is a vector of factor loadings, and u it represents the error term, often known as the idiosyncratic component . If we denote Y t =( y t , ..., y Nt ) ′ , Λ = ( λ , ...., λ N ) ′ , and u t = ( u t , ..., u Nt ) ′ , model (1.2) can be written as Y t = Λ f t + u t , t ≤ T. Only Y t is observable in the model. In a data-rich environment, both N and T can be largeand the dimension N might be even much larger than T . The goal is to make eﬃcientinference about λ ′ i f t , λ i , f t or their rotations.Approximate factor models often require the N × N covariance matrix Σ u = cov( u t )be non-diagonal matrix and the diagonal entries may vary over a large range (Chamberlainand Rothschild 1983). The traditional method of principal components (PC) essentiallytreats u it to be homoskedastic and uncorrelated over i . As a result, it is ineﬃcient. In thispaper, we consider a weighted principal components (WPC) method to eﬃciently estimatethe heteroskedastic approximate factor models. The WPC solves a weighted least squaresproblem: min Λ ,f t T X t =1 ( Y t − Λ f t ) ′ W ( Y t − Λ f t ) (1.3)subject to certain normalization constraints. Here W is an N × N positive deﬁnite weightmatrix. We propose a feasible eﬃcient WPC that requires consistently estimating the high-dimensional Σ − u (when N > T ) as the weight matrix, and is shown to be optimal over abroad class of estimators.

A closely related model is the panel data with a factor structure in the error term: y it = X ′ it β + ε it , ε it = λ ′ i f t + u it , i ≤ N, t ≤ T, (1.4)where X it is a d × β is a d × ε it has a factor structure with unknown loadings and factors. In the model,the only observables are ( y it , X it ). The goal is to estimate the structural parameter β , whose3imension is ﬁxed. In this model, the factor component λ ′ i f t is regarded as an interactiveeﬀect of the individual and time eﬀects. Because the regressor and factor can be correlated,simply regressing y it on X it is not consistent.Similarly, we propose to estimate β via:min β, Λ ,f t T X t =1 ( Y t − X ′ it β − Λ f t ) ′ W ( Y t − X ′ it β − Λ f t ) , (1.5)with a high-dimensional weight matrix W . The method is also WPC because the estimatedfactors are shown to be principal components of the weighted sample covariance matrix. Inparticular, it allows a consistent estimator for Σ − u as the optimal weight matrix even whenΣ − u is non-diagonal and N/T → ∞ . Except for sparsity, the oﬀ-diagonal structure of Σ u isunknown. The WPC takes into account both cross-sectional correlation and heteroskedas-ticity of u it over i , while the existing methods in the literature, e.g., Bai 2009, Moon andWeidner 2010, do not. First of all, we develop the inferential theory using a general high-dimensional weight W . This admits many promising choices of the weight matrices that are suitable for speciﬁcapplied problems for factor analysis. Especially, in cases where estimating Σ u is diﬃcult,our inferential theory is still useful when suitable weight matrices are chosen to improve theestimation eﬃciency. Secondly, we show that when W = Σ − u is used, the WPC yields anoptimal estimator in the sense that the estimated common component λ ′ i f t and structuralparameter β have the minimum asymptotic variance over a broad class of estimators.Third, we focus on the eﬀect of estimating large covariance matrices on eﬃcient statisticalinferences. In both pure factor analysis and the large panel data with a factor structure,we employ a consistent estimator for Σ − u recently proposed by Fan et al. (2013), as anoperational weight matrix. Therefore, our optimal estimator is still feasible under N/T → ∞ .However, substituting a consistent estimator Σ − u is highly non-trivial when N > T . Aninteresting phenomenon is observed: most existing results on estimating large covariancesare not suitable for statistical inferences of the models being considered. We develop a newstrategy that investigates the weighted consistency for the estimated optimal weight matrixto address this problem.Fourth, we consistently estimate the asymptotic variances of the proposed estimators un-der both cross-sectional and serial correlations in u it . Hence the new WPC estimator for theinterative eﬀect model is readily used for statistical inferences in practice. In contrast, exist-4ng methods usually require additionally modeling the large error covariance (e.g., assumingdiagonality, parametrizing the oﬀ-diagonal structure) in order for practical inferences.Given the popularity of the PC method, why do we need a new estimator to incorporatethe large covariance Σ u ? Even though most of the existing methods for panel data modelsavoid estimating Σ u , to demonstrate the potential eﬃciency loss for existing methods, wepresent a real-data application in Section 7, which studies the eﬀect of divorce reform lawon the change of divorce rates. The WPC is applied to the year-state divorce rate dataof U.S. during 1956-1985. It illustrates that after incorporating Σ − u in the estimation,WPC captures the signiﬁcant (negative) eﬀects from nine to twelve years after the law wasreformed, consistent with the previous empirical ﬁndings in the social science literature. Incontrast, the existing method (PC) without estimating Σ − u would result in wide conﬁdenceintervals and potentially conservative conclusions. Numerically, we ﬁnd an average of 46%eﬃciency gained using WPC, relative to the existing method. In addition, the proposedWPC also enjoys the computational convenience, as it also admits analytical solutions.Realizing the limitation of the regular PC method, some important works have beendeveloped to improve the estimation eﬃciency for factor analysis, e.g., Breitung and Ten-hofen (2011), Bai and Li (2012) and Doz et al. (2012). They require the cross-sectionaldependences’ structure be speciﬁcally modeled. Recently, Choi (2012) speciﬁed W = Σ − u ,which essentially requires Σ u be known. Recently, Fan et al. (2013) proposed a thresholdingmethod to estimate Σ − u . They focused on covariance matrix estimations and did not addressthe eﬃcient estimation for the factors, loadings and panel data models. As we discussed,replacing Σ − u with its consistent estimator is technically challenging when N/T → ∞ . Ad-ditional literature on factor analysis and panel data with interactive eﬀects includes, e.g.,Pesaran (2006), Ahn et al. (2001), Su and Chen (2013), Su et al. (2012), Wang (2009), Forniet al. (2000), Hallin and Liˇska (2007), Lam and Yao (2012), Cheng and Hansen (2013), Canerand Han (2012), etc. None of these incorporated Σ − u or studied eﬃcient estimation for paneldata models. We also remark that there is a rapidly growing literature on estimating high-dimensional (inverse) covariance matrices. Besides those mentioned, the list also includes,e.g., d’ Aspremont et al. (2008), Bien and Tibshirani (2011), Luo (2011), Pati et al. (2012),Xue et al. (2012), among many others.We assume the number of factors r = dim( f t ) to be known. When r is unknown, it canbe consistently estimated by certain information criteria as in, e.g., Bai and Ng (2002), aswe shall brieﬂy discuss in Section 5.The rest of the paper is organized as follows. Section 2 describes the general problemof statistical inference based on large covariance matrices. Section 3 formally proposes the5PC method. The large-sample inferential theory of WPC with a general weight matrix ispresented. Section 4 introduces the eﬃcient WPC. Section 5 applies the WPC method tothe panel data model with interactive eﬀects. Section 6 illustrates numerical comparisonsof related methods. Section 7 applies WPC to a real data problem of divorce rate study.Finally, Section 8 concludes. All proofs are given in the supplementary material.Throughout the paper, we use λ min ( A ) and λ max ( A ) to denote the minimum and maxi-mum eigenvalues of matrix A . We also let k A k , k A k and k A k F denote the operator norm, L -norm and Frobenius norm of a matrix, deﬁned as k A k = p λ max ( A ′ A ), k A k = max i P j | A ij | and k A k F = qP i,j A ij respectively. Note that if A is a vector, k A k = k A k F is equal to theEuclidean norm. Finally, for two sequences, we write a T ≪ b T (and equivalently b T ≫ a T )if a T = o ( b T ) as T → ∞ . Consider estimating a low-dimensional structural parameter θ that arises from a modelinvolving a high-dimensional covariance matrix Σ. It is often the case that when Σ wereknown, incorporating it in the estimator may achieve a better estimation accuracy, e.g.,smaller standard errors and tighter conﬁdence intervals. Taking into account Σ, the estimatorcan be written as a function of the data D T and Σ as ( T denotes the sample size): b θ = f ( D T , Σ) , and the limiting distribution may be derived. In practice, we replace Σ by a consistentestimator b Σ and obtain a feasible eﬃcient estimator f ( D T , b Σ).To show that replacing Σ with its consistent estimator does not aﬀect the limiting distri-bution of b θ , one often needs a T ( f ( D T , Σ) − f ( D T , b Σ)) = o p (1) where a − T can be understoodas the rate of convergence of b θ . However, such a simple substitution is technically diﬃcult if N > T . To see this, note that often f ( D T , Σ) depends on the precision matrix Σ − , and theeﬀect of estimating Σ − is approximately linearly dependent on b Σ − − Σ − . We can oftenwrite a T ( f ( D T , Σ) − f ( D T , b Σ)) = a T A ( b Σ − − Σ − ) A + o p (1)where A , A are typically non-sparse stochastic matrices of dimensions dim( θ ) × N and6 × a T k A ( b Σ − − Σ − ) A k ≤ a T k A kk A kk b Σ − − Σ − k . As both A and A are high-dimensional matrices (vectors), the right hand side of theabove inequality is typically not stochastically negligible even if the “absolute convergence” k b Σ − − Σ − k achieves the optimal convergence rate. The problem arises because k A k and k A k grow fast with the dimensionality, so they accumulate the estimation errors and leadto a crude bound.We further illustrate this issue in two examples, which are to be studied in detail in thepaper. Example 2.1.

Consider the high-dimensional factor model (1.2). The parameter of interestis the common component λ ′ i f t . The eﬃcient estimation crucially depends on √ N Λ ′ ( b Σ − u − Σ − u ) u t , for a sparse covariance estimator b Σ − u . However, the existing results on the optimalconvergence of k b Σ − u − Σ − u k in the literature (e.g., Fan et al. 2013) are not applicabledirectly when N > T , because k Λ k = O ( √ N ) and k u t k = O p ( √ N ), but the minimax ratefor k b Σ − u − Σ − u k is no faster than O p ( T − / ). Applying the absolute convergence for b Σ − u , √ N k Λ kk b Σ − u − Σ − u kk u t k = O p ( q NT ) = o p (1) when N > T . Example 2.2.

Consider the high-dimensional panel data model (1.4). The eﬃcient esti-mation of β requires estimating the inverse covariance Σ − u . Suppose e Σ − u is a consistentestimator. We require 1 √ N T Z ′ [( e Σ − u − Σ − u ) ⊗ I T ] U = o p (1) , where I T is a T -dimensional identity matrix and Z and U are stochastic matrices whosedimensions are N T × dim( β ) and N T × k Z k = O p ( √ N T ) = k U k , it is diﬃcult to apply the absolute convergence k e Σ − u − Σ − u k (whose minimax rate is nofaster than O p ( T − / )) to achieve the desired convergence when N > T . The crude boundgives √ NT k Z kk e Σ − u − Σ − u kk U k = O p ( √ N ) = o p (1). (cid:3) As one of the main contributions of this paper, a new strategy of “weighted convergence”is developed. When analyzing a T A ( b Σ − − Σ − ) A , we should not separate the covarianceestimation error from the weighting matrices A , A . Intuitively, the weights further “averagedown” the estimation errors, to ensure the asymptotic negligibility of the weighted error. Wedemonstrate that the weighted convergence is useful for high-dimensional inferences in panel When Σ is sparse enough, one can obtain a near √ T -rate of convergence for the L -norm k b Σ − − Σ − k ,but this still yields a crude bound for a T A ( b Σ − − Σ − ) A . In model (1.2), the only observables are { Y t } Tt =1 , and both the factors { f t } Tt =1 and loadingsΛ = ( λ , ..., λ N ) ′ are parameters to estimate. We estimate them via the following weightedleast squares: ( b Λ , b f t ) = min Λ ,f t T X t =1 ( Y t − Λ f t ) ′ W T ( Y t − Λ f t ) (3.1)subject to: 1 T T X t =1 b f t b f ′ t = I r ; b Λ ′ W T b Λ is diagonal . (3.2)Here W T is an N × N weight matrix, which can be either stochastic or deterministic. When W T is stochastic, we mean W T to be a consistent estimator of some positive deﬁnite W underthe operator norm. We will show in Section 4 that the optimal weight is Σ − u . On the otherhand, keeping a general W T admits other choices of the weight for speciﬁc applied problems,especially when it is diﬃcult to estimate the optimal weight matrix.Solving (3.1) subjected to the restriction (3.2) gives the WPC estimators: b λ j and b f t areboth r × T × r matrix b F / √ T = ( b f , ..., b f T ) ′ / √ T arethe eigenvectors corresponding to the largest r eigenvalues of Y ′ W T Y , and b Λ = T − Y b F =( b λ , ..., b λ N ) ′ . We call our method to be weighted principal components (WPC), to distinguishfrom the traditional principal components (PC) method that uses W T = I N . Note that PCdoes not encounter the problem of estimating large covariance matrices, and is not eﬃcientwhen { u it } ’s are cross-sectionally correlated across i .It has been well known that the factors and loadings are not separably identiﬁable withoutfurther restrictions. The WPC estimates rotated factors and loadings with rotation matrix H W . Let b V be the r × r diagonal matrix of the ﬁrst r largest eigenvalues of Y W T Y ′ / ( T N ). Let F = ( f , ..., f T ) ′ , then H W = b V − b F ′ F Λ ′ W T Λ / ( N T ). We use the subscript W to emphasizethe dependence of the rotation on W . 8 .2 General conditions We present general results for the proposed WPC with a general weight matrix, whichhold for a broad class of estimators. For the general weight matrix W and its data-dependentversion W T , the following assumption is needed: Assumption 3.1. (i) k W T − W k = o p (min { T − / , N − / , q NT , q TN log N } ) .(ii) k √ N Λ ′ ( W T − W ) u t k = o p (1) . Condition (i) is easy to satisfy by using many “good” covariance estimators given inthe literature. However, the main challenge described in Section 2 arises from provingcondition (ii) in the above assumption. When W T is a consistent estimator for Σ − u , we shallsee in Section 4.2 that this requires a new “weighted convergence”, which is necessary butchallenging to the high-dimensional inference problems being considered.We allow the factors and idiosyncratic components to be weakly serially dependent viathe strong mixing condition. Let F −∞ and F ∞ T denote the σ -algebras generated by { ( f t , u t ) : −∞ ≤ t ≤ } and { ( f t , u t ) : T ≤ t ≤ ∞} respectively. In addition, deﬁne the mixingcoeﬃcient α ( T ) = sup A ∈F −∞ ,B ∈F ∞ T | P ( A ) P ( B ) − P ( AB ) | . (3.3) Assumption 3.2. (i) { u t , f t } t ≥ is strictly stationary. In addition, Eu it = Eu it f jt = 0 forall i ≤ p, j ≤ r and t ≤ T. (ii) There exist constants c , c > such that c < λ min (Σ u ) ≤ λ max (Σ u ) < c , max j ≤ N k λ j k and b , b > , such that for any s > , i ≤ p and j ≤ r , P ( | u it | > s ) ≤ exp( − ( s/b ) r ) , and P ( | f jt | > s ) ≤ exp( − ( s/b ) r ) . (iv) Strong mixing: There exists r > and C > satisfying: for all T ∈ Z + , α ( T ) ≤ exp( − CT r ) . We assume that W has bounded row sums, that is, k W k < M for some M >

0. WriteΛ ′ W = ( d , ..., d N ), with each d i being an r × j ≤ N k d j k < ∞ . The following assumptions are standard in the literature. Assumption 3.3 requires thefactors be pervasive , which holds when the factors impact a non-vanishing proportion ofindividual time series. Assumption 3.4 extends similar conditions in Stock and Watson(2002) and Bai (2003). When W = I N is used, they reduce to those in the literature of theregular PC. A simple suﬃcient condition for Assumption 3.4 is that u it is i.i.d. in both i and t . 9 ssumption 3.3. All the eigenvalues of the r × r matrix Λ ′ Λ /N are bounded away fromboth zero and inﬁnity as N → ∞ . Assumption 3.4. (i) E k √ NT P Ts =1 f s ( u ′ s W u t − Eu ′ s W u t ) k = O (1) .(ii) For each i ≤ N , E k √ NT P Tt =1 P Nj =1 d j ( u jt u it − Eu jt u it ) k = O (1) .(iii) For each i ≤ r , E k √ NT P Tt =1 P Nj =1 d j u jt f it k = O (1) .(iv) There is a constant δ ≥ and M > such that for all large N , E | √ N ( u ′ s W u t − Eu ′ s W u t ) | δ < M and E k √ N Λ ′ W u t k δ < M . The factors and loadings are two sets of parameters to estimate. The limiting distribu-tions of their estimators depend on the following asymptotic expansions, to be shown in theappendix: for some positive deﬁnite matrix J W , and the rotation matrix H W , √ N ( b f t − H W f t ) = J W Λ ′ W u t √ N + O p ( a T ) √ T ( b λ j − H ′ − W λ j ) = H W √ T T X t =1 f t u jt + O p ( b T ) . (3.4)where the asymptotic normality arises from the leading terms while a T and b T are someremaining stochastic sequences.The limiting distribution of b λ j requires H W to have a limit. We thus need the followingcondition: Assumption 3.5. (i) There is an r × r matrix Σ Λ such that Λ ′ W Λ /N → Σ Λ as N → ∞ .In addition, the eigenvalues of the Σ Λ cov( f t ) are distinct.(ii) For each t ≤ T , (Λ ′ W Σ u W Λ) − / Λ ′ W u t → d N (0 , I r ) . According to the expansions of (3.4), the above condition (ii) is almost a necessarycondition for the asymptotic normality of b f t . Note that √ N Λ ′ W u t = √ N P Ni =1 d i u it . Hencea cross-sectional central limit theorem can indeed apply. Condition (ii) is only for b f t , andthe limiting distribution of the estimated loading b λ j in Theorem 3.1 below does not dependon this condition.We now introduce some notation that are needed to present the limiting distributions.Let V be an r × r diagonal matrix with element as the largest r eigenvalues of Σ / cov( f t )Σ / ,and Γ W be the corresponding eigenvector matrix such that Γ ′ W Γ W = I r . We use the subscript W to indicate that Γ W depends on W via Σ Λ . Recall that Σ Λ is deﬁned in Assumption 3.5.Let Q W = V / Γ ′ W Σ − / . In fact H W → p Q ′ − W . In addition, to account for the serial10orrelation over t , letΦ j = E ( f t f ′ t u jt ) + ∞ X t =1 E [( f f ′ t + f t f ′ ) u j u j, t ] . (3.5) Theorem 3.1.

Assume (log N ) = o ( T ) and T = o ( N ) . Under Assumptions 3.1-3.5(i),for each j ≤ N , √ T ( b λ j − H ′ − W λ j ) → d N (0 , Q ′ − W Φ j Q − W ) . If in addition, N = o ( T ) and Assumption 3.5(ii) holds, N ( V − Q W Λ ′ W Σ u W Λ Q ′ W V − ) − / ( b f t − H W f t ) → d N (0 , I r ) . For the common component, we have b λ ′ i b f t − λ ′ i f t ( λ ′ i Ξ W λ i /N + f ′ t Ω i f t /T ) / → d N (0 , . where Ξ W = Σ − Λ ′ W Σ u W ΛΣ − /N and Ω i = cov( f t ) − Φ i cov( f t ) − . Remark 3.1.

The eigenvalues of ( V − Q W Λ ′ W Σ u W Λ Q ′ W V − ) − / are of order O ( N − / ).Hence Theorem 3.1 implies the √ N -consistency of the estimated factors. If we furtherassume that Λ ′ W Σ u W Λ /N has a limit, say G , then immediately we have √ N ( b f t − H W f t ) → d N (0 , V − Q W GQ ′ W V − ) , where the √ N -consistency is more clearly demonstrated.The uniform convergence of b f t and b λ j are given below. Theorem 3.2.

Let α = max { /r , /r } with r , r deﬁned in Assumption 3.2. Let δ ≥ be as deﬁned in Assumption 3.4. Under Assumptions 3.1-3.4, as N, T → ∞ , max t ≤ T k b f t − H W f t k = O p (cid:18) (log T ) α k W T − W k + T /δ √ N + 1 √ T (cid:19) , (3.6)max j ≤ N k b λ j − H ′ − W λ j k = O p k W T − W k + 1 √ N + r log NT ! . (3.7) Remark 3.2.

The uniform convergence in (3.6) and (3.7) is important under large N and T . For example, in estimating large covariance matrices, it is used to derive the proper levelsof thresholding or shrinkage (e.g., Fan et al. 2013, Ledoit and Wolf 2012).11 .4 Heteroskedastic WPC As a simple choice for W , W = (diag(Σ u )) − . This choice improves the regular PC when cross-sectional heteroskedasticity is present. Thisweight can be easily estimated using the residuals. First apply the regular PC by taking W T = I N , and obtain a consistent estimator b C it of the common component λ ′ i f t for each i ≤ N, t ≤ T. Deﬁne W hT = diag { b σ − u, , ..., b σ − u,NN } , where b σ u,ii = 1 T T X t =1 ( y it − b C it ) . Then in the second step, apply the WPC with weight matrix W hT . The heteroskedastic WPC (which we call HWPC) method has been previously suggestedby, e.g., Breitung and Tenhofen (2011). Investigations of its theoretical properties can befound in the appendix. Moreover, numerical studies in Section 6 show that this methodimproves the eﬃciency relative to the regular PC method.

In the approximate factor models, u it ’s are correlated (over i ). A more eﬃcient estimator(which we call EWPC) should take W = Σ − u as the weight matrix. This estimator has beenrecently suggested by Choi (2012), but Σ − u was assumed to be known.There are two main challenges in practice: on one hand, when N > T , Σ − u is hard toestimate as the sample covariance based on the residual b u it is no longer invertible. On theother hand, as we illustrated in Section 2, even if a consistent estimator for Σ − u is available,it is technically diﬃcult to prove that the eﬀect of covariance estimation is neglibile when N/T → ∞ . We ﬁrst apply Fan et al. (2013)’s method to estimate Σ − u , and then addressthe second problem in Section 4.2. We apply a thresholded covariance estimator to estimate Σ − u , which is recently proposedby Fan et al. (2013) for factor analysis. Let ( ν j , ξ j ) Nj =1 be the eigenvalues-eignvectors of the12ample covariance S y of Y t , in a decreasing order such that ν ≥ ν ≥ ... ≥ ν N . Let R = S y − r X i =1 ν i ξ i ξ ′ i . Deﬁne a general thresholding function s ij ( z ) : R → R as in Rothman et al. (2009) and Caiand Liu (2011) with an entry-dependent threshold τ ij such that:(i) s ij ( z ) = 0 if | z | < τ ij ;(ii) | s ij ( z ) − z | ≤ τ ij . (iii) There are constants a > b > | s ij ( z ) − z | ≤ aτ ij if | z | > bτ ij .Examples of s ij ( z ) include the hard-thresholding: s ij ( z ) = zI ( | z | >τ ij ) ; SCAD (Fan and Li2001), MPC (Zhang 2010) etc. As for the threshold value, we specify τ ij = C p R ii R jj ω T , where ω T = r log NT + 1 √ N (4.1)for some pre-determined universal C >

0, chosen from cross-validation as in Fan et al.(2013). Then estimate Σ u by b Σ u = ( b Σ u,ij ) N × N , b Σ u,ij =  R ii , i = js ij ( R ij ) , i = j , where R = ( R ij ) N × N . Intuitively, b Σ u thresholds oﬀ the small entries of the residual covariance T P Tt =1 b u t b u ′ t obtainedfrom the regular PC estimate.To apply such a weight estimator, we assume Σ u to be a sparse matrix. In an approximatefactor model, such a special structure is known to be conditionally sparse (given the commonfactors). Consider the notion of generalized sparsity: write Σ u = (Σ u,ij ) N × N . For some q ∈ [0 , / m N = max i ≤ N N X j =1 | Σ u,ij | q . (4.2)In particular, when q = 0, deﬁne m N = max i ≤ N P Nj =1 I (Σ u,ij =0) . Mathematically, the condi-tional sparse structure on Σ u assumes, there is q ∈ [0 , / m N = o min ( N / (cid:18) T log N (cid:19) (1 − q ) / , N / − q/ )! . (4.3)In the sparse covariance estimation literature, Condition (4.3) itself is enough to achievea covariance estimator such that k Σ − u − b Σ − u k = o p (1), whose rate of convergence is nearly13 T (e.g., Cai and Zhou 2012, Fan et al. 2013, etc.). But for the “weighted convergence”needed for eﬃcient estimations in factor analysis and large panel data models, this conditionis not suﬃcient. Therefore, we introduce a more reﬁned description of the sparse structureof Σ u (condition (ii) in Assumption 4.1 below), which is similar to those in Rothman et al.(2008).Let S L and S U denote two disjoint sets and respectively include the indices of small andlarge elements of Σ u in absolute value, and { ( i, j ) : i ≤ N, j ≤ N } = S L ∪ S U . (4.4)We assume ( i, i ) ∈ S U for all i ≤ N. The sparsity condition assumes that most of the indices( i, j ) belong to S L when i = j . A special case arises when Σ u is strictly sparse, in the sensethat its elements with small magnitudes ( S L ) are exactly zero. For the banded matrix as anexample, Σ u,ij = 0 if | i − j | > k for some ﬁxed k ≥ . Then S L = { ( i, j ) : | i − j | > k } and S U = { ( i, j ) : | i − j | ≤ k } . Another example is the block-diagonal matrix.The following assumption mathematically deﬁnes the “conditional sparsity” for the ap-proximate factor model.Deﬁne ω T = r log NT + 1 √ N .

Assumption 4.1. (i) There is q ∈ [0 , / such that (4.3) holds.(ii) There is a partition { ( i, j ) : i ≤ N, j ≤ N } = S L ∪ S U such that P i = j, ( i,j ) ∈ S U O ( N ) and P ( i,j ) ∈ S L | Σ u,ij | = O (1) . In addition, max ( i,j ) ∈ S L | Σ u,ij | = O ( ω T ) , ω T = O ( min ( i,j ) ∈ S U | Σ u,ij | ) . If for example, Σ u is a block covariance matrix with ﬁnite block sizes, this assumption isnaturally satisﬁed as long as the signal is not too-weak (that is, ω T = o (min ( i,j ) ∈ S U | Σ u,ij | )).Condition (ii) requires the elements in S L and S U be well-separable. The partition { ( i, j ) : i ≤ N, j ≤ N } = S L ∪ S U may not be unique. Most importantly, we do not need to knoweither S L or S U ; hence the block size, the banding length, or the locations of the zero entriescan be completely unknown. Our analysis suﬃces as long as such a partition exists. We now formally discuss the issue brought by Assumption 3.1. In order for the eﬀect ofestimating Σ − u to be negligible, k √ N Λ ′ ( b Σ − u − Σ − u ) u t k = o p (1) is required, which is a tight14ondition. However, a direct application of the optimal rate of convergence (i.e., Fan et al.2013, Cai and Zhou 2012) k b Σ − u − Σ − u k = O p ( m N ω − qT ) implies k √ N Λ ′ ( b Σ − u − Σ − u ) u t k ≤ √ N k Λ kk b Σ − u − Σ − u kk u t k = O p ( √ N m N ω − qT ) , which is O p (1 + p N (log N ) /T ) even if m N is bounded and q = 0. Hence this leads to acrude bound that does not converge. The problem is present even if Σ − u is estimated withthe optimal rate of convergence.We realize that such a technical problem is common for statistical inferences that involveestimating a high-dimensional covariance. In fact, most of the existing approaches in the lit-erature only produce “absolute convergence” k b Σ − u − Σ − u k . For statistical inference purposeslike the primary interest of this paper, however, the absolute convergence is not suﬃcientwhen N/T → ∞ .We propose a new technical strategy to solve this problem, by directly investigating the“weighted convergence” of the weighted error: k √ N Λ ′ ( b Σ − u − Σ − u ) u t k . (4.5)Intuitively, the weights Λ ′ and u t “average down” the estimation errors, and improve therate of convergence. Formal analysis requires us to re-investigate the asymptotic behaviorof the thresholded covariance estimator. We require the following technical assumption.Let Λ ′ Σ − u = ( ξ , ..., ξ N ). Assuming k Σ − u k = O (1), we then have max j ≤ N k ξ j k < C forsome C > . In addition, let e t = Σ − u u t , then e t has mean zero and covariance Σ − u . Assumption 4.2.

For each t ≤ T and k ≤ r ,(i) T √ N P Ni =1 P Ts =1 ( u is − Eu is ) ξ i e it = o p (1) (ii) NT √ N P Ni =1 P Ts =1 P Nj =1 ( u js u is − Eu js u is ) λ j λ ′ i e it ξ ik = o p (1) ,(iii) T √ N P i = j, ( i,j ) ∈ S U P Ts =1 ( u is u js − Eu is u js ) ξ i e jt = o p (1) ,(iv) NT √ N P i = j, ( i,j ) ∈ S U P Nv =1 P Ts =1 ( u is u vs − Eu is u vs ) ξ ik e jt λ v λ ′ j = o p (1) . The above conditions are new in the literature and essential to establish the weighedconvergence. The intuition of these conditions is that, the weighted average of the standard-ized sum √ T P Tt =1 ( u it u jt − Eu it u jt ) is o p (1) once averaged across i and j . The extra term N appeared in NT √ N of Conditions (ii) and (iv) is a scaling factor because under the sparsitycondition, the number of summands of P Ni =1 and P i = j, ( i,j ) ∈ S U is at most O ( N ) (e.g., in blockdiagonal and banded matrices).We verify the key assumption 4.2 in the following lemma, when { u t } t ≤ T is serially inde-pendent. We require N = o ( T ) but still allow N/T → ∞ .15 emma 4.1.

Suppose { u it } t ≤ T is independent across t (but can still be correlated across i ),and the sparse condition Assumption 4.1 holds. Then when N = o ( T ) , Assumption 4.2 issatisﬁed. We have the following weighted consistency for the estimated weight matrix, which as wehave explained, cannot be implied directly by the absolute convergence k b Σ − u − Σ − u k evenwhen Σ u is diagonal. As one of the main contributions of this paper, result of this type ispotentially widely useful for high-dimensional inferences when large covariance estimation isinvolved. Proposition 4.1.

Suppose √ N m N ω − qT = o (1) , and Assumptions 3.2- 3.5, 4.1, 4.2 hold.For q , m N and ω T deﬁned in (4.2) and (4.1), and for each t ≤ T , we have k √ N Λ ′ ( b Σ − u − Σ − u ) u t k = o p (1) . Therefore Assumption 3.1 is satisﬁed for W = Σ − u . Remark 4.1.

Consider a strictly sparse case where m N = max i ≤ N P Nj =1 I (Σ u,ij = 0) = O (1). The condition in the theorem √ N m N ω − qT = o (1)then holds as long as √ N log N = o ( T ). As always the case, requiring N = o ( T ) is neededfor the asymptotic normality of b f t . We use W T = b Σ − u as the feasible weight matrix. Let the columns of the T × r matrix b F e / √ T = ( b f e , ..., b f eT ) ′ / √ T be the eigenvectors corresponding to the largest r eigenvalues of Y ′ b Σ − u Y , and b Λ e = T − Y b F e = ( b λ e , ..., b λ eN ) ′ . Here the superscript e denotes “eﬃcient” WPC.We denote Σ Λ ,e as the limit of Λ ′ Σ − u Λ /N . Let V e be an r × r diagonal matrix with elementsas the largest r eigenvalues of Σ / ,e cov( f t )Σ / ,e , and Γ e be the corresponding eigenvectormatrix such that Γ ′ e Γ e = I r . In addition, let Q e = V / e Γ ′ e Σ − / ,e . We have the followinglimiting distributions for the estimated factors and loadings. Theorem 4.1.

Under the assumptions of Proposition 4.1, for each t ≤ T and j ≤ N , √ T ( b λ ej − H ′ − e λ j ) → d N (0 , Q ′ − e Φ j Q − e ) . √ N ( b f et − H e f t ) → d N (0 , V − e ) . here Φ j is as deﬁned in (3.5). In addition, for the estimated common component, b λ e ′ i b f et − λ ′ i f t ( λ ′ i Ξ e λ i /N + f ′ t Ω i f t /T ) / → d N (0 , . where Ξ e = (Λ ′ Σ − u Λ /N ) − and Ω i is deﬁned as in Theorem 3.1. For completeness, the following result gives the uniform rate of convergence.

Theorem 4.2.

Suppose N / (2 − q ) log N = o ( T ) and T = o ( N ) . Under the assumptions ofTheorem 4.1, there is an r × r matrix H e such that max t ≤ T k b f et − H e f t k = O p (cid:18) T /δ √ N + (log T ) α m N ω − qT (cid:19) , max j ≤ N k b λ ej − H ′ − e λ j k = O p (cid:0) m N ω − qT (cid:1) . Remark 4.2.

Typically in the strictly sparse case m N = O (1) and q = 0. When N/T → ∞ ,the above rates become:max t ≤ T k b f et − H e f t k = O p (cid:18) T /δ √ N + (log T ) α √ log N √ T (cid:19) , max j ≤ N k b λ ej − H ′ − e λ j k = O p r log NT ! . Regular PC, heteroskedastic WPC and the eﬃcient WPC minimize diﬀerent objectivefunctions, depending on the choices of the weight matrix. Thus the estimated b F / √ T are theeigenvectors from three diﬀerent matrices. Table 1 summarizes the main diﬀerences of theestimators.A natural question arises: is the consistent estimator for W = Σ − u indeed the optimalchoice over a broad class of positive deﬁnite weight matrices? One can answer this questionvia looking at the asymptotic variance of the estimators, as choosing the optimal weightfor GMM (Hansen 1982). However, because WPC estimators are estimating rotated factorsand loadings, the rotation depends on the choice of W. But regardless of the choice W , thecommon component λ ′ i f t is always directly estimated. The following result demonstratesthat W T = b Σ − u yields the minimum asymptotic variance of b λ ′ i b f t among WPC estimators. Theorem 4.3.

Let ( λ ′ i Ξ e λ i /N + f ′ t Ω i f t /T ) denote the asymptotic variance of b λ e ′ i b f et basedon b Σ − u as in Theorem 4.1. For any positive deﬁnite matrix W , let ( λ ′ i Ξ W λ i /N + f ′ t Ω i f t /T )17able 1: Three interesting choices of W Objective function Eigenvectors of W regular PC P Tt =1 ( Y t − Λ f t ) ′ ( Y t − Λ f t ) Y ′ Y I r HWPC P Tt =1 ( Y t − Λ f t ) ′ diag( b Σ u ) − ( Y t − Λ f t ) Y ′ diag( b Σ u ) − Y diag(Σ u ) − EWPC P Tt =1 ( Y t − Λ f t ) ′ b Σ − u ( Y t − Λ f t ) Y ′ b Σ − u Y Σ − u The estimated b F / √ T is the eigenvectors of the largest r eigenvalues of Y ′ W T Y , and b Λ = T − Y b F .HWPC represents the heteroskedastic WPC; EWPC represents the eﬃcient WPC. denote the asymptotic variance of b λ ′ i b f t as in Theorem 3.1 based on W . Then for each i ≤ N and t ≤ T , λ ′ i Ξ e λ i /N + f ′ t Ω i f t /T ≤ λ ′ i Ξ W λ i /N + f ′ t Ω i f t /T. In fact, for all large N , Ξ W − Ξ e is semi-positive deﬁnite for each positive deﬁnite matrix W . We derive consistent estimators for the asymptotic variances that appeared in Theorem4.1. Hence the derived optimal limiting distributions can be used for statistical inferences.These estimators account for the serial and cross-sectional correlations of the data in both i and t .The factor estimator has an asymptotic expansion: √ N ( b f et − H e f t ) = b V − b F e ′ FT Λ ′ Σ − u u t √ N + o p (1)where b V is the r × r diagonal matrix of the ﬁrst r largest eigenvalues of T N Y b Σ − u Y ′ . Theorem4.1 shows that the asymptotic variance is V − e . Hence, b V − b F e ′ FT Λ ′ Σ − u Λ N F ′ b F e T b V − → p V − e (4.6)The left hand side involves the product F Λ ′ , which can be estimated by b F e b Λ e ′ . A consistentestimator of V − e is then given by (note that T b F e ′ b F e = I r ) b V − e = b V − b F e ′ b F e T b Λ e ′ b Σ − u b Λ e N b F e ′ b F e T b V − = 1 N b V − b Λ e ′ b Σ − u b Λ e b V − . √ T ( b λ j − H ′ − e λ j ) = 1 √ T T X t =1 H e f t u jt + o p (1) . Here H e f t u jt can be estimated by b f et b u jt , where b u jt is a WPC estimator of the error term (e.g., b u jt = y it − b λ e ′ j b f et ). We apply the HAC (heteroskedasticity and autocorrelation consistent)estimator of Newey and West (1987) to estimate Q ′ − e Φ j Q − e , the asymptotic variance of √ T ( b λ j − H ′ − e λ j ), based on the sequence { b f et b u jt } : b Ψ j = 1 T T X t =1 b u jt b f et b f e ′ t + K X l =1 (1 − lK + 1 ) 1 T T X t = l +1 b u jt b u j,t − l ( b f et b f e ′ t − l + b f et − l b f e ′ t ) , where K = K T,N → ∞ is an increasing sequence such that K = o (min { T / , N / } ) . The advantages of using the HAC estimator are: it accounts forthe serial correlations of { f t u ′ t } t ≥ , and it also guarantees the positive semi-deﬁniteness forany given ﬁnite sample as shown by Newey and West (1987).The asymptotic variance of the common component in Theorem 4.1 consists of λ ′ i Ξ e λ i and f ′ t Ω i f t , where Ξ e = ( N Λ ′ Σ − u Λ) − andΩ i = cov( f t ) − Φ i cov( f t ) − . We respectively estimate them by b Θ i = 1 N b λ e ′ i b V − b Λ e ′ b Σ − u b Λ e b V − b λ ei , b Θ ,it = b f e ′ t b Ψ i b f et . Theorem 4.4.

Under the assumptions of Theorem 4.2, as T , N → ∞ , and K = K T,N = o (min { T / , N / } ) , b V − e → p V − e , b Ψ j → p Q ′ − e Φ j Q − e , b Θ i → p λ ′ i Ξ e λ i , b Θ ,it → p f ′ t Ω i f t . These covariance estimators can be easily computed.

The factor model we have considered so far is closely related to the following panel datamodel: y it = X ′ it β + ε it , ε it = λ ′ i f t + u it , i ≤ N, t ≤ T (5.1)19he regression noise has a factor structure with unknown λ i and f t , and u it still representsthe idiosyncratic error component. It is assumed that u it is independent of ( X it , f t ). In themodel, the only observables are ( y it , X it ). The goal is to estimate β , the structural parameterof the model.Substituting the second equation to the ﬁrst one in (5.1), we obtain y it = X ′ it β + λ ′ i f t + u it . (5.2)If we treat λ i as the “individual eﬀect” and f t as the “time eﬀect”, then the factor structure λ ′ i f t represents the interaction between the individual and time eﬀects, so called “interactiveeﬀect”. This model was previously studied by, e.g., Ahn et al. (2001), Pesaran (2006), Bai(2009), Moon and Weidner (2010).The diﬃculty of estimating β is that, in many applied problems the regressor X it iscorrelated with the time eﬀect (common factor) f t , individual eﬀect λ i , or both. As a result, X it and ε it are also correlated, so regressing y it directly on X it cannot produce a consistentestimator for β . In addition, existing methods ignore the heteroskedasticity and correlationin { u it } i ≤ N . Hence eﬃciency is lost, for instance, when Σ u is non-diagonal or its diagonalentries vary over a large range. We shall illustrate the consequence of eﬃciency loss using areal data application in Section 7. β Let X t = ( X t , ..., X Nt ) ′ , ( N × d ). We estimate β viamin β,f t , Λ T X t =1 ( Y t − Λ f t − X t β ) ′ W ( Y t − Λ f t − X t β ) , (5.3)for some positive deﬁnite N × N weight matrix. Similar to the generalized least squaresestimator (GLS) for linear regressions, we choose the weight matrix to be W = Σ − u . This choice produces similar estimators as the eﬃcient WPC. The estimator is feasible oncewe consistently estimate Σ − u , which can be done under the assumption that Σ u is sparse.Suppose e Σ − u is a consistent covariance estimator. The feasible WPC estimates β by:ˆ β = arg min β min f t , Λ T X t =1 ( Y t − Λ f t − X t β ) ′ e Σ − u ( Y t − Λ f t − X t β ) , (5.4)20here the minimization is subjected to the constraint T P Tt =1 f t f ′ t /T = I r and Λ ′ e Σ − u Λ beingdiagonal. The estimated β for each given (Λ , f t ) is simply β (Λ , f t ) = ( T X t =1 X ′ t e Σ − u X t ) − T X t =1 X ′ t e Σ − u ( Y t − Λ f t ) . On the other hand, given β , the variable Y t − X t β has a factor structure. Hence the estimated(Λ , f t ) are the WPC estimators: let X ( ˆ β ) be an N × T matrix X ( ˆ β ) = ( X ˆ β, ..., X T ˆ β ) . Thecolumns of the T × r matrix e F / √ T = ( e f , ..., e f T ) ′ / √ T are the eigenvectors corresponding tothe largest r eigenvalues of ( Y − X ( ˆ β )) ′ e Σ − u ( Y − X ( ˆ β )), and e Λ = T − ( Y − X ( ˆ β )) e F .

Therefore,given (Λ , f t ), we can estimate β , and given β , we can estimate (Λ , f t ). So ˆ β can be simplyobtained by iterations, with an initial value ˆ β . This iteration scheme only requires twomatrix inverses: e Σ − u and ( P Tt =1 X ′ t e Σ − u X t ) − , which do not update during iterations. Basedon our experience of numerical studies, the iterations converge fast.Similar to Fan et al. (2013), the covariance estimator can be constructed based onthresholding. Let ˆ β be a “regular PC estimator” that takes W = I N in (5.3), which isknown to be √ N T -consistent (e.g., Bai 2009, Moon and Weidner 2010). Apply the singularvalue decomposition to 1 T T X t =1 ( Y t − X t ˆ β )( Y t − X t ˆ β ) ′ = N X i =1 ν i g i g ′ i , where ( ν j , g j ) Nj =1 are the eigenvalues-eigenvectors of T P Tt =1 ( Y t − X t ˆ β )( Y t − X t ˆ β ) ′ in adecreasing order such that ν ≥ ν ≥ ... ≥ ν N . Then e Σ u = ( e Σ u,ij ) N × N , e Σ u,ij =  e R ii , i = js ij ( e R ij ) , i = j , e R = ( e R ij ) N × N = N X i = r +1 ν i g i g ′ i , where s ij ( · ) is the same thresholding function as deﬁned in Section 4.2 with the same thresh-old τ ij . Rearrange the design matrix Z = ( X , ..., X T , X , ..., X T , ..., X N , ..., X NT ) ′ , N T × d. T × r matrix F , let M F = I T − F ( F ′ F ) − F ′ /T . The following matrices play animportant role in the identiﬁcation and asymptotic analysis: A F = h Σ − u − Σ − u Λ (cid:0) Λ ′ Σ − u Λ (cid:1) − Λ ′ Σ − u i ⊗ M F ,V ( F ) = 1 N T Z ′ A F Z, (5.5)where (Λ , Σ − u ) in the above represent the true loading matrix and inverse error covariancein the data generating process, and ⊗ denotes the Kronecker product. Our ﬁrst conditionassumes that V ( F ) is positive deﬁnite in the limit uniformly over a class of F . Assumption 5.1.

With probability approaching one, inf F : F ′ F/T = I r λ min ( V ( F )) > . If we write B F = h Σ − / u − Σ − u Λ (Λ ′ Σ − u Λ) − Λ ′ Σ − / u i ⊗ M F , then A F = B F B ′ F . So V ( F )is at least semi-positive deﬁnite. Also, summing over N T rows of Z should lead to a strictlypositive deﬁnite matrix V ( F ). As a suﬃcient condition, if X it depends on the factors andloadings through: X it = τ i + θ t + r X k =1 a k λ ik + r X k =1 b k f kt + r X k =1 c k λ ik f kt + η it where a k , b k , c k are constants (can be zero) and η it is i.i.d. over both i and t , then Assumption5.1 is satisﬁed (see Bai 2009).Let U = ( u , ..., u T , u , ..., u T , ..., u N , ..., u NT ) ′ , and F be the T × r matrix of truefactors. Assumption 5.2.

There is a dim( β ) × dim( β ) positive deﬁnite matrix Γ such that V ( F ) → p Γ , √ N T Z ′ A F U → d N (0 , Γ) . This assumption is required for the asymptotic normality of ˆ β , because it can be shownthat, √ N T ( ˆ β − β ) = V ( F ) − √ N T Z ′ A F U + o p (1) . Hence the asymptotic normality depends on that of √ NT Z ′ A F U . Assumption 5.2 isnot stringent because if we write B ′ F U = ( e u , ..., e u T , e u , ..., e u NT ) ′ , and Z ′ B F =( e Z , ..., e Z T , e Z , ..., e Z NT ), then 22 √ NT Z ′ A F U = √ NT P Tt =1 P Ni =1 e Z it e u it is a standardized summation. We can further write √ N T ( ˆ β − β ) = N T T X t =1 N X i =1 e Z it e Z ′ it ! − √ N T T X t =1 N X i =1 e Z it e u it + o p (1) . Hence the second statement of Assumption 5.2 is a central limit theorem for √ NT P Tt =1 P Ni =1 e Z it e u it on both cross-sectional and time domains. In addition, in the ab-sence of serial correlation, the conditional covariance of √ NT Z ′ A F U given Z and F equals NT Z ′ A F (Σ u ⊗ I T ) A F Z = V ( F ). This implies that the asymptotic variance of √ N T ( ˆ β − β )is simply Γ − . The issue described in Section 2 arises in establishing1 √ N T Z ′ [( e Σ − u − Σ − u ) ⊗ I T ] U = o p (1) , (5.6)which is the eﬀect of estimating the large covariance Σ − u . In fact, the ﬁrst order conditionof ˆ β leads to √ N T ( ˆ β − β ) = V ( F ) − √ N T Z ′ b AU + o p (1) , where b A is as A F with Σ − u replaced with e Σ − u and F replaced with e F . Hence we need1 √ N T Z ′ ( b A − A F ) U = o p (1) . (5.7)This requires the weighted convergence (5.6). However, when N/T → ∞ , achieving (5.6)is technically diﬃcult. Similar to the case described in the approximate factor model, theabsolute convergence of k e Σ − u − Σ − u k is not suitable for inferences.We consider the Gaussian case for simplicty, and the problem is still highly technicallyinvolved. Non-Gaussian case will be even more challenging, and we shall leave it for futureresearch. Assumption 5.3. (i) u t is distributed as N (0 , Σ u ) .(ii) { u t } t ≥ is independent of { f t , X t } t ≥ , and { u t , f t , X t } are serially independent across t . It is possible to relax Condition (ii) to allow for serial correlations, but ˆ β will be asymp-totically biased. 23 .4 Limiting distribution We require the same conditions on the data generating process for the factors, loadingsand the sparsity of Σ u as in Sections 2 and 4. Proposition 5.1.

Under Assumptions 3.2- 3.4, 4.1, 5.1-5.3, as

N/T → ∞ , and m N = o ( T ) , we have the weighted convergence: √ N T Z ′ ( b A − A F ) U = o p (1) . We have the following limiting distribution.

Theorem 5.1.

Under the assumptions of Proposition 5.1, the asymptotic limiting distribu-tion of ˆ β is the same when either W = Σ − u or the feasible weight W T = e Σ − u is used as theweight matrix, and is given by √ N T ( ˆ β − β ) → d N (0 , Γ − ) . The asymptotic variance Γ − is the limit of V ( F ) − . Note that under the same set ofconditions, the regular PC method of Bai (2009) and Moon and Weidner (2010) gives anasymptotic conditional covariance (given Z, F ) of the sandwich-formula: V ≡ ( 1 N T Z ′ GZ ) − N T Z ′ G (Σ u ⊗ I T ) GZ ( 1 N T Z ′ GZ ) − , where G is deﬁned as A F with Σ − u replaced with I N . It is not hard to show that V − V ( F ) − is semi-positive deﬁnite. So relative eﬃciency is gained when WPC is used. In fact, the choice W = e Σ − u is also the optimal weight matrix for WPC in this case.To estimate the asymptotic variance of ˆ β , let e A equal A F with F , Λ and Σ − u replacedwith e F , e Λ and e Σ − u . Deﬁne e Γ = NT Z ′ e AZ . The following result enables us to constructconﬁdence intervals and conduct hypothesis tests for β under large samples. Theorem 5.2.

Under the assumptions of Theorem 5.1, e Γ − → p Γ − . The methods of Section 4 also carry over to derive the limiting distributions of theestimated interactive eﬀects λ ′ i f t . The procedure and corresponding results are very similargiven the √ N T -consistency of ˆ β . Hence we omit repeated discussions.24 .5 Estimation with unknown number of factors For simplicity of presentations, we have assumed the number of factors r to be known.As was shown by many authors, estimation results are often robust to over-estimating r .For instance, Moon and Weidner (2011) have shown that for inference on the regressioncoeﬃcients one does not need to estimate r consistently, as long as the “working number” isnot less than the true value. On the other hand, we can also start with a consistent estimatorˆ r using a similar method of Bai and Ng (2002) and Bai (2009).Speciﬁcally, suppose there is a known upper bound ¯ r of the number of factors. For each k ≤ ¯ r , deﬁne b σ ( k ) = min β, Λ k ,f t,k N T T X t =1 ( Y t − Λ ′ k f t,k − X t β ) ′ ( Y t − Λ ′ k f t,k − X t β )where each row of Λ k is a k -dimensional loading vector, and f t,k is also k -dimensional.The above minimization is subject to the constraint that T P Tt =1 f t,k f ′ t,k = I k and Λ ′ k Λ k isdiagonal. The iterative algorithm based on principal components can calculate the aboveminimization fast. Under our conditions, Bai (2009) showed that r can be consistentlyestimated by either minimizing CP( k ) or IC( k ), whereCP( k ) = b σ ( k ) + b σ (¯ k )[ k ( N + T ) − k ] log( N T ) N T , and IC( k ) = log b σ ( k ) + [ k ( N + T ) − k ] log( N T ) N T .

We then can apply the estimator ˆ r to construct the WPC estimator, and achieve the samelimiting distributions. Estimation procedure and its theoretical properties can be proved tobe the same as before, so details are not presented to avoid repetition. We conduct numerical experiments to compare the proposed WPC with the popularmethods in the literature . The idiosyncratic error terms are generated as follows: let We have written a Matlab code to implement the proposed WPC for any user-speciﬁed weight matrixas well as the optimal WPC for both the factor model and panel data model with interactive eﬀects, availableupon request. ǫ it } i ≤ N,t ≤ T be i.i.d. N (0 ,

1) in both t, i . Let u t = ǫ t , u t = ǫ t + a ǫ t , u t = ǫ t + a ǫ t + b ǫ t ,u i +1 ,t = ǫ i +1 ,t + a i ǫ it + b i − ǫ i − ,t + c i − ǫ i − ,t , where { a i , b i , c i } Ni =1 are i.i.d. N (0 , u is a banded matrix, possessing both cross-sectional correlation and heteroskedasticity. Let the two factors { f t , f t } be i.i.d. N (0 , { λ i, , λ i, } i ≤ N be uniform on [0 , R as suggested by Fan et al. (2013). Design 1

Consider the pure factor model y it = λ i f ,t + λ i, f t + u it , where we estimate the factorloadings { λ i, , λ i, } i ≤ N and factors { f t , f t } . For each estimator, the smallest canonicalcorrelation (the larger the better) between the estimators and parameters are calculated,as an assessment of the estimation accuracy. The simulation is replicated for one hundredtimes, and the average canonical correlations for several competing methods are reported inTable 2. The mean squared error of the estimated common components are also compared.Table 2: Canonical correlations for simulation studyLoadings Factors ( NT P i,t ( b λ ′ i b f t − λ ′ i f t ) ) / T N

PC HWPC EWPC PC HWPC EWPC PC HWPC EWPC(the larger the better) (the larger the better) (the smaller the better)50 75 0.346 0.429 0.487 0.403 0.508 0.566 0.621 0.583 0.54550 100 0.411 0.508 0.553 0.476 0.602 0.666 0.546 0.524 0.49850 150 0.522 0.561 0.602 0.611 0.679 0.746 0.467 0.444 0.427100 80 0.433 0.545 0.631 0.427 0.551 0.652 0.570 0.540 0.496100 150 0.613 0.761 0.807 0.661 0.835 0.902 0.385 0.346 0.307100 200 0.751 0.797 0.822 0.827 0.882 0.924 0.333 0.312 0.284150 100 0.380 0.558 0.738 0.371 0.557 0.749 0.443 0.394 0.334150 200 0.836 0.865 0.885 0.853 0.897 0.942 0.313 0.276 0.240150 300 0.882 0.892 0.901 0.927 0.946 0.973 0.257 0.243 0.222

The columns of loadings and factors report the canonical correlations. PC is the regular principalcomponents method; HWPC represents the heteroskedastic WPC; EWPC uses b Σ − u as the weightmatrix. We see that the estimation becomes more accurate when we increase the dimensionality.26WPC improves the regular PC, while the EWPC gives the best estimation results.

Design 2

Adding a regression term to the model of Design 1, we consider the panel data modelwith interactive eﬀect: y it = X ′ it β + λ i f ,t + λ i, f t + u it , where the true β = (1 , ′ . Theregressors are generated to be dependent on ( f t , λ i ): X it, = 2 . λ i f ,t − . λ i f ,t − η it, , X it, = λ i f ,t − λ i f ,t + 1 + η it, where η it, and η it, are independent i.i.d. standard normal.Both the methods PC (Bai 2009 and Moon and Weidner 2011) and the proposed WPCare carried out to estimate β for the comparison. Also compared is the mean squared errorof the estimated common components. The simulation is replicated for one hundred times;results are summarized in Table 3. We see that both methods are almost unbiased, while theeﬃcient WPC indeed has signiﬁcantly smaller standard errors than the regular PC methodin the panel model with interactive eﬀects.Table 3: Method comparison for the panel data with interactive eﬀects, simulation β = 1 β = 3Mean Normalized SE Mean Normalized SE T N

WPC PC WPC PC WPC PC WPC PC50 75 1.005 1.013 0.758 1.413 2.998 3.002 0.744 1.47250 100 1.005 1.010 0.662 1.606 2.997 2.998 0.731 1.61650 150 1.004 1.008 0.964 1.913 2.999 2.999 0.951 1.881100 100 1.002 1.010 0.550 1.418 3.000 3.003 0.416 1.353100 150 1.003 1.007 0.681 1.626 2.999 3.000 0.611 1.683100 200 1.002 1.005 0.631 1.800 3.000 3.000 0.774 1.752150 100 1.003 1.006 0.772 1.399 3.000 2.999 0.714 1.458150 150 1.001 1.005 0.359 1.318 3.000 3.001 0.408 1.379150 200 1.001 1.003 0.547 1.566 3.000 3.000 0.602 1.762

WPC (with weight e Σ − u ) and PC (existing method) comparison. “Mean” is the average of theestimators; “Normalized SE” is the standard error of the estimators multiplied by √ N T . Empirical Study : Eﬀects of Divorce Law Reforms

This section shows the advantages of our proposed WPC method in a real data applica-tion. It demonstrates the gain of incorporating the estimated Σ u in the panel data estimationand the eﬃciency gains compared to the traditional PC. An important question in sociology is the cause of the sharp increase in the U.S. divorcerate in the 1960s and 1970s. The association between divorce rates and divorce law reformshas been considered a potential key, and during 1970s, about three quarters of states in theU.S. liberalized their divorce system, so-called “no-fault revolution”. There is plenty empir-ical research regarding the eﬀects of divorce law reforms on the divorce rates (e.g., Peters1986, Allen 1992), and statistical signiﬁcance of these eﬀects has been found (e.g., Friedberg1998). In other words, states’ law reforms are found to have signiﬁcantly contributed to theincrease in state-level divorce rates within the ﬁrst eight years following reforms.On the other hand, there is a puzzle about longer eﬀects. Empirical evidence also il-lustrates the subsequent decrease of the divorce rates starting from (around) 1975, whichis between nine and fourteen years after the law reforms in most states. So whether lawreforms continue to contribute to the rate decrease has been an interesting question. Wolfers(2006) studied a treatment eﬀect panel data model, and identiﬁed negative eﬀects for thesubsequent years. This suggests that, the increase in divorce following reform and the sub-sequent decrease may be two sides of the same treatment: after earlier dissolution of badmatches after law reforms, marital relations were gradually aﬀected and changed. However,it has been argued that Wolfers (2006)’s approach may not capture the complex unobservedheterogeneity. The heterogeneity may exist through an interactive eﬀect, where unobservedcommon factors may change over time.Kim and Oka (2013) pioneered using interactive eﬀect model for the study: y it = K X k =1 X it,k β k + λ ′ i f t + µ i + α t + f ( δ i , t ) + u it , (7.1)where y it is the divorce rate for state i in year t ; X it,k is a binary regressor, representing thetreatment eﬀect 2 k years after the reform. Speciﬁcally, we observe the law reform year T i for each state. Then X it,k = 1 if 2 k − ≤ t − T i ≤ k , and zero otherwise. In addition to theinteractive eﬀect λ ′ i f t as being discussed, the model also contains unobserved state and timeeﬀects ( µ i , α t ) and time trend f ( δ i , t ). For instance, the linear trend deﬁnes f ( δ i , t ) = δ i t δ i . Using the regular PC method, Kim and Oka (2013) concludedinsigniﬁcant ( β , ..., β ), that is, the divorce rates after eight years and beyond are not aﬀectedby the reforms. However, We argue that using the regular PC method to estimate the modelmay lose eﬃciency because it ignores the oﬀ-diagonal entries. As a result, this can result inwide conﬁdence intervals and possibly conservative conclusions.We re-estimate Kim and Oka (2013)’s model using the new WPC, and compare with theregular PC. As a ﬁrst step, we rewrite the model to ﬁt in the form being considered in thispaper. Introduce the conventional notation:˙ y it = y it − T T X t =1 y it − N N X i =1 y it + 1 N T N X i =1 T X t =1 y it . Let ˙ X it,k , ˙ u it be deﬁned similarly. If the time trend f ( δ i , t ) is not present, under theconventional normalizations P Ni =1 λ i = P Tt =1 f t = 0 , P Ni =1 µ i = P Tt =1 α t , we have ˙ y it =˙ X ′ it β + λ ′ i f t + ˙ u it . The same data as in Wolfers (2006) and Kim and Oka (2013) are used, which contain thedivorce rates, state-level reform years and binary regressors from 1956 to 1988 ( T = 33) over48 states. We ﬁt the models both with and without linear time trend, and apply regular PCand our proposed WPC in each model to estimate β with conﬁdence intervals. The numberof factors is selected in a data-driven way as in Bai (2009). His IC and CP both suggestedten factors. Moreover, for the WPC, the threshold value in the estimated covariance isobtained using the suggested cross-validation procedure in Fan et al. (2013). The estimated( β , ..., β ) and their conﬁdence intervals are summarized in Table 4.Both models produce similar estimates. Interestingly, WPC conﬁrms that the law reformssigniﬁcantly contribute to the subsequent decrease of the divorce rates, more speciﬁcally, 9-14 years after the reform in the model with linear time trends, and 11-14 years after in themodel without linear time trends. In contrast, the regular PC reaches a more conservativeconclusion as it does not capture these signiﬁcant negative eﬀects. Moreover, both methodsshow that the eﬀect on the increase of divorce rates for the ﬁrst 6 years are signiﬁcant, whichis consistent with previous ﬁndings in this literature.We also report the relative eﬃciency using WPC, relative to the regular PC. The reportednumbers are var(WPC) / var(PC), where var(A) calculates the estimated variance of the When the time trend is present, we can do a simple projection to eliminate the time trend, and stillestimate the untransformed β from the familiar interactive eﬀect model. For instance, suppose f ( δ i , t ) = δ i t .Let M = (1 , , ..., T ) ′ and P M = I T − M ( M ′ M ) − M ′ . We can deﬁne e Y i = P M ( y i , ..., y iT ) ′ , and e X i = P M ( X i , ..., X iT ) ′ , and deﬁne ˙ e y it and ˙ e X it accordingly from e y it and e X it . This is the same as in Kim and Oka (2013). We also tried a few larger values for r , and the estimatesare similar, consistent with previous ﬁndings that the estimation is robust to over-estimating r .

95% conﬁdence intervals are reported; intervals with * are signiﬁcant. Relative eﬃciency isreferred to WPC relative to PC, as estimated var(

W P C ) / var( P C ) . Let us further demonstrate the relative eﬃciency WPC gains by incorporating the esti-mated Σ − u through simulated data. The true parameters are estimated from the real dataas described above. Speciﬁcally, we use the ﬁrst column from Table 4 (no linear trend) asthe true β , and the corresponding estimated Λ as the true loading matrix. We ﬁx N = 48as before. To pertain the actual cross-sectional dependence, in the simulation, the true errorterms, factors, and regressors are generated as simple random samples (with replacement)of size T from the estimated residuals, factors and regressors from the real data.Simulations are conducted with one hundred replications. The averages and the standarddeviations for each estimated component are reported in Table 5, representing the bias andstandard error. Also reported is the relative eﬃciency, deﬁned as var(WPC) / var(PC). It isclearly shown in the table that the standard errors of WPC are uniformly smaller than thoseof PC. In addition, most of the time WPC also reduces the ﬁnite sample bias. The relativeeﬃciency varies from 39% to 52%, which illustrates 48%-61% eﬃciency gain. Overall, afterincorporating the error covariance, the performance of the estimator is signiﬁcantly improved. The literature on estimating high-dimensional sparse covariance matrices has targeted onthe covariance and inverse covariance directly, and the theoretical results are mostly in anabsolute convergence form. We see that the absolute convergence, even though achieving theminimax optimal rate, is often not suitable for statistical inference. Thus using an estimatedhigh-dimensional covariance matrix as the optimal weight matrix is highly-nontrivial. Westudy a new notion of “weighted convergence” to show that the eﬀect of estimating a high-dimensional covariance matrix is indeed asymptotically negligible.This paper studies in detail two models that are of increasing importance in appliedstatistics: approximate factor model and panel data with unobservable interactive eﬀects.We propose a method of weighted principal components, which uses a high-dimensionalweight matrix. The eﬃcient weight uses the inverse error covariance matrix. The EWPCconsiders both heteroskedasticity and cross-sectional dependence. It is shown that EWPCuses the optimal weight matrix over the class of WPC estimators thus it is the most eﬃcient.The EWPC is applied to the year-state divorce rate data. The new method captures the31able 5: Method comparison in eﬀects of divorce law reform: simulated dataBias Normalized SE RelativeWPC PC WPC PC Eﬃciency T = 50First 2 years -0.008 -0.013 1.077 1.714 0.3933-4 years -0.023 -0.033 1.911 2.694 0.4945-6 years -0.040 -0.058 2.743 3.821 0.5257-8 years -0.054 -0.080 3.429 4.899 0.5019-10 years -0.068 -0.103 4.017 5.633 0.50111-12 years -0.073 -0.107 4.262 6.221 0.47513-14 years -0.081 -0.124 4.703 6.907 0.46215 years+ -0.090 -0.133 5.094 7.691 0.439 T = 70First 2 years -0.002 -0.000 0.927 1.449 0.4083-4 years -0.008 -0.008 1.623 2.434 0.4385-6 years -0.021 -0.028 2.434 3.420 0.5057-8 years -0.030 -0.039 3.246 4.579 0.5079-10 years -0.043 -0.060 4.115 5.738 0.51311-12 years -0.048 -0.061 4.579 6.492 0.50113-14 years -0.055 -0.076 5.101 7.245 0.49515 years+ -0.062 -0.079 5.564 8.173 0.465 “Normalized SE” is the standard error of the estimator multiplied by √ N T . The relativeeﬃciency is calculated as the square of the ratio of the third and fourth columns, estimating var(

W P C ) / var( P C )32igniﬁcant (negative) eﬀects from nine to twelve years after the law was reformed, consistentwith the previous empirical ﬁndings in the social science literature. The estimator is moreaccurate and produces tighter conﬁdence intervals.All proofs are given in the supplementary material.33 eferences

Ahn, S., Lee, Y. and

Schmidt, P. (2001). GMM estimation of linear panel data modelswith time-varying individual eﬀects.

J. Econometrics . , 219-255. Allen, D. W. (1992). Marriage and divorce: comment.

American Economic Review , ,679-685. Andersen, T., Bollerslev, T., Christoffersen, P. and

Diebold, F. (2011). Finan-cial risk measurement for ﬁnancial risk management.

Manuscript . Northwestern University.

Bai, J. (2003). Inferential theory for factor models of large dimensions.

Econometrica . Bai, J. (2009). Panel data models with interactive ﬁxed eﬀects.

Econometrica . Bai, J. and

Li, K. (2012). Statistical analysis of factor models of high dimension.

Ann.Statist . , 436-465. Bai, J. and

Ng, S. (2002). Determining the number of factors in approximate factor models.

Econometrica . Bickel, P. and

Levina, E. (2008). Covariance regularization by thresholding.

Ann. Statist. Bien, J. and

Tibshirani, R. (2011). Sparse estimation of a covariance matrix.

Biometrika , Boivin, J. and

Ng, S. (2006). Are More Data Always Better for Factor Analysis?

J.Econometrics . , 169-194. Breitung, J. and

Tenhofen, J. (2011). GLS estimation of dynamic factor models.

J.Amer. Statist. Assoc. , 11501166.

Cai, T. and

Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estimation.

J. Amer. Statist. Assoc. , 672-684.

Cai, T. and

Zhou, H. (2012). Optimal rates of convergence for sparse covariance matrixestimation.

Ann. Statist . , 2389-2420. Caner, M. and

Han, X. (2012). Using bridge estimators to determine number of factorsin multifactor models: case of large panel data.

Manuscript .34 hamberlain, G. and

Rothschild, M. (1983). Arbitrage, factor structure and mean-variance analysis in large asset markets.

Econometrica . Cheng, X. and

Hansen, B. (2013). Forecasting with factor-augmented regression: a fre-quentist model averaging approach. Forthcoming in

J. Econometrics . Choi, I. (2012). Eﬃcient estimation of factor models.

Econometric Theory . Aspremont, A., Banerjee, O. and

Ghaoui, L. (2008). First-order methods for sparsecovariance selection.

SIAM Journal on Matrix Analysis and Applications . , 56-66 Doz, C., Giannone, D. and

Reichlin, L. (2012). A quasi-maximum likelihood approachfor large, approximate dynamic factor models.

The Review of Economics and Statistics . ,1014-1024. El Karoui, N. (2008). Operator norm consistent estimation of large-dimensional sparsecovariance matices.

Ann. Statist. , 2717-2756. Fan, J. and

Li, R. (2001). Variable selection via nonconcave penalized likelihood and itsoracle properties.

J. Amer. Statist. Assoc. Fan, J., Liao, Y. and

Mincheva, M. (2013). Large covariance estimation by thresholdingprincipal orthogonal complements (with discussion).

J. R. Stat. Soc. Ser. B. . To appear.

Forni, M., Hallin, M., Lippi, M. and

Reichlin, L. (2000). The generalized dynamicfactor model: identiﬁcation and estimation.

The Review of Economics and Statistics . Friedberg, L. (1998). Did unilateral divorce raise divorce rates? Evidence from paneldata.

American Economic Review , , 608-627 Hallin, M. and

Liˇska, R. (2007). Determining the number of factors in the general dy-namic factor model.

J. Amer. Statist. Assoc. , 603-617.

Hansen, L. (1982). Large sample properties of generalized method of moments estimators,

Econometrica . , 1029-1054. Kim, D. and

Oka, T. (2013). Divorce law reforms and divorce rates in the U.S.: aninteractive ﬁxed eﬀects approach.

J. Appl. Econometrics , Lam, C. and

Fan, J. (2009). Sparsistency and rates of convergence in large covariancematrix estimation.

Ann. Statist. am, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: inferencefor the number of factors.

Ann. Statist. , 694-726. Ledoit, O. and

Wolf, M. (2012). Nonlinear shrinkage estimation of large-dimensionalcovariance matrices.

Ann. Statist. , 1024-1060 Luo, X. (2011). High dimensional low rank and sparse covariance matrix estimation viaconvex minimization.

Manuscript . Moon, R. and

Weidner, M. (2010). Dynamic linear panel regression models with inter-active ﬁxed eﬀects. manuscript . Moon, R. and

Weidner, M. (2011). Linear regression for panel with unknown number offactors as interactive ﬁxed eﬀects. manuscript . Newey, W. and

West, K. (1987). A simple, positive semi-deﬁnite, heteroskedasticity andautocorrelation consistent covariance matrix.

Econometrica . , 703-708. Pati, D., Bhattacharya, A., Pillai, N. and

Dunson, D. (2012). Posterior contractionin sparse Bayesian factor models for massive covariance matrices. manuscript . Pesaran (2006). Estimation and inference in large heterogeneous panels with a multifactorerror structure.

Econometrica . , 967-1012. Peters, H. E. (1986). Marriage and divorce: informational constraints and private con-tracting,

American Economic Review , , 437-454. Phan, Q. (2012). On the sparsity assumption of the idiosyncratic errors covariance matrix-Support from the FTSE 100 stock returns.

Manuscript . University of Warwick.

Rothman, A., Bickel, P., Levina, E. and

Zhu, J. (2008). Sparse permutation invariantcovariance estimation.

Electron. J. Stat. , 494-515. Rothman, A., Levina, E. and

Zhu, J. (2009). Generalized thresholding of large covari-ance matrices.

J. Amer. Statist. Assoc.

Stock, J. and

Watson, M. (2002). Forecasting using principal components from a largenumber of predictors.

J. Amer. Statist. Assoc. , 1167-1179. Su, L. and

Chen, Q. (2013). Testing homogeneity in panel data models with interactiveﬁxed eﬀects. Forthcoming in

Econometric Theory .36 u, L., Jin S. and

Zhang, Y. (2012). Speciﬁcation test for panel data models with inter-active ﬁxed eﬀects,

Manuscript . Wang, P. (2009). Large dimensional factor models with a multi-level factor structure: iden-tiﬁcation, estimation and inference.

Manuscript . Hong Kong University of Science and Tech-nology.

Wolfers, J. (2006). Did unilateral divorce raise divorce rates? A reconciliation and newresults.

American Economic Review , , 1802-1820. Xue, L., Ma, S. and

Zou, H. (2012). Positive-deﬁnite l -penalized estimation of largecovariance matrices. J. Amer. Statist. Assoc. , 1480-1491.

Zhang, C. (2010). Nearly unbiased variable selection under minimax concave penalty,

Ann.Statist. ,38