[PDF] Recent Developments on Factor Models and its Applications in Econometric Learning

Abstract

Full PDF

aa r X i v : . [ ec on . E M ] S e p Recent Developments on Factor Models and itsApplications in Econometric Learning

Jianqing Fan ∗ Kunpeng Li † Yuan Liao ‡ Abstract

This paper makes a selective survey on the recent development of the factor modeland its application on statistical learnings. We focus on the perspective of the low-rankstructure of factor models, and particularly draws attentions to estimating the modelfrom the low-rank recovery point of view. The survey mainly consists of three parts:the ﬁrst part is a review on new factor estimations based on modern techniques onrecovering low-rank structures of high-dimensional models. The second part discussesstatistical inferences of several factor-augmented models and applications in econo-metric learning models. The ﬁnal part summarizes new developments dealing withunbalanced panels from the matrix completion perspective.Key words: factor models, spiked low rank matrix, matrix completion, unbalanced panel, mul-tiple testing, high-dimensional ∗ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ08544, USA. [email protected] . His research is supported by NSFC grants 71991470 and 71991471. † International school of economics and management, Capital University of Economics and Business,Beijing 100070, China; email: [email protected] ‡ Department of Economics, Rutgers University, 75 Hamilton St., New Brunswick, NJ 08901, USA. [email protected] ontents A.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44A.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.3 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46A.3.1 Proof of Theorem 3.2 (i) . . . . . . . . . . . . . . . . . . . . . . . . . 46A.3.2 Proof of Theorem 3.2 (ii) . . . . . . . . . . . . . . . . . . . . . . . . . 47A.4 Some inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

The recent decade has witnessed a blossom of developments on statistical learning theoriesand practice, embraced with the exciting progresses on large-scale optimizations and dimen-sion reduction techniques. Factor models, as one of the central machinery on summarizingand extracting information from large scale datasets, have received much attention in thisrevolutionary era of data science, and many breakthrough methodologies and applicationshave been developed in this exciting area.This paper makes a selective overview on the recent developments of the factor modeland its applications on econometric learning. Our review focuses on the perspective of thelow-rank structure of factor models, and draws particular attentions to estimating the modelfrom the low-rank recovery point of view. A central focus in the progress of this literature isthe understanding and recovering low-rank structures of high-dimensional models. Many newlearning theories and methods have been developed, which have revolutionized the modernunderstanding of econometric modeling. Meanwhile, the low-rank structure is one of thekey properties of factor models. While this structure has long been aware of by researchers,studying the factor model from the perspective of low-rank matrix recovery is relatively new,and has led to many exciting new discoveries and understanding.The survey mainly consists of three parts: the ﬁrst part is a review on new factor esti-mation based on modern techniques on recovering low-rank structures of high-dimensionalmodels. The second part discusses statistical inferences of several factor-augmented modelsand applications in statistical learning models. The ﬁnal part summarizes new developmentsdealing with unbalanced panels from the matrix completion perspective.3e concentrate on recent developments on methodologies and applications in economet-ric learning. For a more comprehensive account on this topic, see Chapters 9-11 of the bookby Fan et al. (2020c). Meanwhile, several important topics are not covered in this survey, buthave also generated extensive researches in the literature. Those include selecting the numberof factors, weak factors, identiﬁcation, continuous-time and time-varying models, nonstation-arity and structural breaks, Bayesian methods, bootstrap factors, as well as more sophisti-cated panel data models. Several excellent reviews have been written with emphasis on thesetopics. For those reviews, we refer to Stock and Watson (2016) for dynamic factor modelswith applications on macroeconomics, to Bai and Wang (2016) for time series and panel datamodels, and to Gagliardini et al. (2019) for a recent review on conditional factor models withapplications to ﬁnance. Another class of estimation is a hybrid of PCA-method and the statespace approach, see Giannone et al. (2008) and Doz et al. (2011) for more discussions. In ad-dition, the generalized dynamic factor model is another important strand of literature, wherefactors are often estimated using the dynamic principal components, the frequency domainanalog of principal components, developed by Brillinger (1964). Forni et al. (2000, 2005) pro-vided rates of convergence of the common component estimated by dynamic principal compo-nents. Finally, we refer to the following papers for more detailed developments, among others:Bai and Ng (2002); Ahn and Horenstein (2013); Onatski (2010); Li et al. (2017), Bai and Li(2012, 2016), Onatski (2012); Chudik et al. (2011), Cheng et al. (2016); Massacci (2017);Gagliardini et al. (2016); Goncalves and Perron (2018); Baltagi et al. (2017); Barigozzi et al.(2018), A¨ıt-Sahalia and Xiu (2017); Chen et al. (2019a); Liao and Yang (2018); Li et al.(2019); Pelger (2019), Su and Wang (2017).We use the following notation. For a matrix A , let λ i ( A ) denote the i th largest sin-gular value of A and use λ min ( A ) and λ max ( A ) to denote its smallest and largest eigen-values. We deﬁne the Frobenius norm k A k F = p tr( A ′ A ), the operator norm k A k = p λ max ( A ′ A ), the element-wise norm k A k ∞ = max ij | A ij | , and the matrix ℓ -norm k A k ℓ :=max i ≤ N P Nj =1 | A ij | . In addition, deﬁne projection matrices P A = A ( A ′ A ) − A and M A = I − P A when A ′ A is invertible. Finally, for two (random) sequences a T and b T , we write a T ≪ b T (or b T ≫ a T ) if a T = o P ( b T ). 4 Spiked Incoherent Low-Rank Models

Modern high-dimensional factor models can be viewed as a type of spiked incoherent low-rank model , a broad class of models that have drawn active research in the recent decade. Aspiked incoherent low-rank model typically refers to a large matrix Σ (either observable ornot), having the following decomposition: Σ = L + S . (2.1)Such decomposition satisﬁes the following three properties: (i) Low-rank. The rank of L is either bounded or grows very slowly compared to its di-mensions. (ii) Spikedness. The nonzero singular values of L grow fast, while the largest singularvalue of S is either bounded or grows much slower. (iii) Incoherence. (also known as “pervasiveness”) The left and right singular vectors of L ,corresponding to the nonzero singular values, should have diversiﬁed elements, whichmeans, elements of the rescaled singular vectors should be uniformly bounded.The low-rank structure achieves dimension reductions: suppose the matrix Σ is of N × N dimensions, while the rank of L is r . Then the low-rank structure reduces the dimensionfrom O ( N N ) to O ( N + N ) r ; the latter is the magnitude of the number of parameters in L .Meanwhile, the spikedness helps seperate L from S approximately, and ensures that the large“signals” concentrate on L , the low rank component. Finally, the incoherence, a conditionthat excludes matrices being low-rank and sparse simultaneously, enables us to estimate wellthe singular eigenvectors.We explain these three properties using the matrix form of factor models. Consider y it = b ′ i f t + u it , i ≤ N, t ≤ T, (2.2)where f t is a r -dimensional vector of factors; b i is the loading vector and u it is the idiosyn-cratic noise. Speciﬁcally, (2.1) applies to two decompositions of this model.5 actor Decomposition. The matrix form of the factor model gives Y = M + U , M := BF ′ , where Y and U are N × T matrices of y it and u it ; B is the N × r matrix of b i while F is the T × r matrix of f t . Then corresponding to the notation (2.1), Σ = Y , L = M and S = U . In this decomposition, Σ is observable. Apparently, M is a low-rank matrix with rank r .The nonzero singular values of M , under the strong factor assumption, grows much fasterthan those of U , which gives rise to the spikedness property. Now let ξ be the N × r matrixwhose columns are the left singular vectors of M , and let ξ ′ i denote its i th row. Then underthe assumption that the nonzero eigenvalues of B ′ B grow fast with N , for some constant C >

0, max i ≤ N k√ N ξ i k ≤ C max i ≤ N k b i k , (2.3)which gives rise to the incoherent singular vectors. The right singular vectors can be boundedsimilarly. Covariance Decomposition.

It is also well known from the factor model (2.2) thatthe covariance matrix of y t = ( y t , · · · , y Nt ) ′ , denoted by Σ y , can be decomposed as follows: Σ y = L + Σ u , L := B cov( f t ) B ′ , (2.4)where Σ u denotes the covariance matrix of u t . The above decomposition is well known forportfolio allocations and risk managements, where the total volatility is decomposed intothe systematic risk L , plus the (sparse) idiosyncratic risk Σ u . It also leads to the spikedincoherent low-rank model, but Σ y is unknown and needs to be estimated. There are two general approaches to estimating model (2.1): (i) Principal ComponentsAnalysis (PCA), and (ii) low-rank regularization. Here we present a general PCA estimationsetting, and defer the discussion of low-rank regularization to Section 3.2. We shall assumerank( L ) = r to be known.For any matrix A , let A = U A D A V ′ A denote the singular value decomposition (SVD) of A . Deﬁne the singular value hard thresholding operator as H R ( A ) := U A ¯ D R V ′ A (2.5)6here ¯ D R is a diagonal matrix that keeps the top R diagonal elements of D A and replacesthe remaining elements by zeros. So H R ( A ) is the best rank R matrix approximation to A .Suppose an estimator of Σ , denoted by b Σ , is available, satisfying k b Σ − Σ k = O P ( η N ) , k b Σ − Σ k ∞ = O P ( c N ) (2.6)for some sequences η N and c N . We use b Σ as the input matrix, which can be the samplecovariance matrix or its robustﬁed versions (Fan et al., 2019c). The goal is to estimate L in (2.1) and its N × r matrix of the left singular vectors, denoted by ξ (also let ζ denoteits right singular vectors). We use respectively b L := H R ( b Σ ) with R = r , which is the rank r projection of b Σ , and the N × r matrix b ξ whose columns are the left singular vectors of b Σ . The following theorem, adapted from Fan et al. (2018), provides deviation bounds of theestimators. To make the paper self-contained, we also provide a simpler proof with slightlydiﬀerent conditions. Theorem 2.1.

Consider the general model (2.1) with bounded r := rank ( L ) . Suppose that min ≤ i ≤ r +1 | λ i − ( L ) − λ i ( L ) | ≍ max ≤ i ≤ r +1 | λ i − ( L ) − λ i ( L ) | := g N and η N + k S k = o P ( g N ) .Then, under condition (2.6), we have (i) k b L − L k = O P ( η N + k S k ) , k b ξ − ξ k = O P (cid:18) η N + k S k g N (cid:19) . (ii) If additionally, k S k ∞ + k L k ∞ = O P (1) , N c N = o P ( g N ) , then k b ξ − ξ k ∞ ≤ O P (cid:18) N √ N + p N (cid:19) ( η N + k S k ) g − N + O P (cid:18) c N N √ N + c N p N + k S ζ d k ∞ ∨ k S ′ ξ d k ∞ (cid:19) g − N . Proof.

See the online supplement.This theorem is relatively general, and is applicable to low-rank models that are notnecessarily consequences from factor models. The proof relies on perturbation bounds forsingular vectors/values, and the achieved rates are sharp. Result (i) is simple and givesasymptotic bounds under the operator norm. Result (ii) gives element-wise deviation boundfor the singular vectors, which requires more dedicated technical arguments.7

Estimation under Factor Models

We observe an N × T data matrix Y , which can be decomposed as Y = M + U = BF ′ + U where B is N × r factor loadings matrix, F is T × r factors matrix and U is N × T idiosyn-cratic errors, which are uncorrelated with M := BF ′ . All the three parts B , F and U areunobserved. The t th column of this expression can be written as y t = Bf t + u t . (3.1) Under the model’s speciﬁcation, we have the covariance structure (2.4). One of the mostwidely used estimation methods for the factor model is principal components analysis (PCA).Deﬁne the sample covariance S y = T P Tt =1 y t y ′ t = T YY ′ . Let b ξ j be the j th eigenvectorcorresponding to the largest j th eigenvalues of S y . The PCA estimates B by taking b B = √ N ( b ξ , · · · , b ξ R ), which estimates B up to a diagonal transformation. Given b B , the factorscan be estimated via the least squares: b F = Y ′ b B ( b B ′ b B ) − = 1 N Y ′ b B . This also leads to the estimated low-rank component T b B b F ′ b F b B ′ for B cov( f t ) B ′ .PCA is equivalent to the singular value hard thresholding by taking the input matrix b Σ = S y . Then T b B b F ′ b F b B ′ = H R ( S y ). One can then apply Theorem 2.1 to infer the ratesof convergence of the PCA estimators, which were obtained by Stock and Watson (2002a).Bai (2003) proved the asymptotic normality of PCA estimators for the factors and loadings.Results with general input b Σ can be found in Chapter 10 of Fan et al. (2020c). Another popular method to estimate a factor model is the maximum likelihood (ML) method(see, e.g., Lawley and Maxwell (1971), Bai and Li (2012), Doz et al. (2012)). Under theindependence and normality assumptions, the log-likelihood function based on y t is, for8ome constant C ,log L ML ( B , cov( f t ) , diag( Σ u )) = C − T | Σ y | − T X t =1 y ′ t Σ − y y t . The log-likelihood function is then maximized with respect to the matrix parameters ( B , cov( f t ) , diag( Σ u ))under additional restrictions that Σ u is diagonal (Bai and Li, 2012, 2016) or sparse with regu-larizations (Bai and Liao, 2016; Wang et al., 2019b). Recently Barigozzi and Luciani (2019)explicitly accounted for autocorrelations of the factors in the likelihood function.The factors can be estimated by two methods, one of which is the projection method.Under the joint normality assumptions of f t and u t , we have E ( f t | y t ) = B ′ ( BB ′ + Σ u ) − y t = ( I r + B ′ Σ − u B ) − B ′ Σ − u y t . This provides the basis of estimating factors. The other approach is the generalized leastsquares: for given B and Σ − u , the GLS estimator for f t is b f t = ( B ′ Σ − u B ) − B ′ Σ − u y t . Replacing the unknown parameters with their ML estimators, one obtains two estimatorsfor the latent factors. Under large- N setup, the diﬀerence of the two methods (PCA andMLE) for estimating factors are asymptotically negligible. Alternative to PCA, one can estimate M directly taking advantage of its low-rank structure,based on the nuclear-norm regularization , the ℓ -norm of singular values, that encouragesthe sparseness in singular values and hence low-rankness. For an n × m matrix A , let k A k n := P min { m,n } i =1 ψ i ( A ) be its nuclear-norm, where ψ i ( A ) is the i th largest singular valueof A . 9 .2.1 Singular value thresholding Given the low-rank structure of M (sparsity in singular value of M ), we can estimate themodel via solving the following penalized optimization: c M = arg min M k Y − M k F + ν k M k n (3.2)for some tuning parameter ν >

0. The solution is c M = S ν ( Y ), where S ν ( · ) is the singularvalue thresholding operator (Ma et al., 2011), deﬁned as follows. Let Y = U y DV ′ y be itsSVD. Then S ν ( Y ) := U y D ν V ′ y , where D ν = diag( { D ii − ν } + ) with D ii being the diagonalentries of D . So S ν ( Y ) applies “soft-thresholding” on the singular values of Y . One canadditionally estimate the factors and loadings using the singular vectors.We note that this method is closely related to the PC-estimator, except the soft-thresholdingis replaced by hard-threshoding. Let R denote the “working number of factors”, which isthe number of principal components one takes when applying the PC-method. We note thatthe PC-estimator for M with R factors is given by (see Section 2.2): c M PC = H R ( Y ) , H R ( Y ) := U y ¯ D R V ′ y . This estimator is the solution to the penalized least squares problem (3.2) except that thenuclear norm is replaced by P min { N,T } i =1 p ν ( ψ i ( M )), where p ν ( θ ) = ν − ( ν −| θ | ) is the hardingthresholding penalty and ψ i ( M ) is the i th singular value of M .Therefore the diﬀerence between (3.2) and PCA is more fundamentally about that ofhard- and soft- thresholding. Despite of many good properties, the soft-thresholding esti-mator possesses shrinkage bias, while the hard-thresholding reduces the bias. As a matterof fact, the shrinkage bias is on the singular values, rather than on the singular vectors.Indeed, the singular vectors of the two estimators are the same, and equal to the top R singular vectors of Y . An important implication is that the factor estimator building on c M is numerically equivalent to the PC-estimators for the factors, which does not suﬀer fromany shrinkage bias. A formal statement and proof of the unbiasedness of eigenvectors canbe found in Fan et al. (2019b). 10 .2.2 Low-rank plus sparse decomposition Recall that Σ y and Σ u denote the N × N covariance matrices of y t and u t in model (3.1),and that we have the following decomposition Σ y = L + Σ u , L := B cov( f t ) B ′ . (3.3)We now demonstrate that this decomposition also provides a nice structure for estimatingthe covariance components. A key assumption is conditionally sparsity , namely, Σ u is sparse.While the deﬁnition of sparsity may diﬀer in diﬀerent contexts, here we mean J := X i = j { E u it u jt } should not grow too fast as N → ∞ . This requirement can be weakened to approximatesparsity . In addition, L is a low-rank matrix. Thus we can directly estimate the abovecovariance decomposition via solving the following penalized optimization:( b L , b Σ u ) := arg min L , Σ u k S y − ( L + Σ u ) k F + ν k L k n + ν k Σ u k , (3.4)where ν and ν are tuning parameters. Note that here we use the notation k A k = P i,j | A ij | as the matrix 1-norm, distinguished from the usual matrix ℓ -norm k A k ℓ :=max i ≤ N P Nj =1 | A ij | . The above optimization has been employed by many authors to studythe low rank plus sparse decomposition , while some authors exclude the diagonal elementsof Σ u from the penalization, and additionally impose positive-deﬁnite and other constraintson L and Σ u (Klopp et al., 2017; Agarwal et al., 2012). Finally, given b L , we can estimatethe factors and loadings by extracting its eigenvectors.The above optimization can be solved by alternating the estimation of L and Σ u , andclosed form solutions are available in both iterations. Given Σ u , solving for L leads to the singular value soft-thresholding : b L = S ν ( S y − Σ u ), and given L , solving for Σ u leads tothe element-wise soft-thresholding : b Σ u = e S ν ( S y − L ). While both iterations solve convexproblems, standard convergence analysis can be applied to show that the iterative algorithmconverges in polynomial time.Agarwal et al. (2012) and Klopp et al. (2017) studied the statistical convergence prop-erties of (3.4). Let columns of U L, be the singular vectors of the true L corresponding tothe zero singular values. Deﬁne projections P ( A ) := U L, U ′ L, AU L, U ′ L, and M ( A ) :=11 − P ( A ). In addition, let ( A ) J and ( A ) J c be the submatrices of A , whose elements respec-tively correspond to E u it u jt = 0 and E u it u jt = 0 . Additionally deﬁne C ( ν , ν ) := { ( A , A ) : ν kP ( A ) k n + ν k ( A ) J c k ≤ ν kM ( A ) k n + 3 ν k ( A ) J k } . A key quantity is the restricted strong convexity (RSC) constant, which is deﬁned as follows: κ ( ν , ν ) := sup { c > k A + A k F ≥ c k A k F + c k A k F for all ( A , A ) ∈ C ( ν , ν ) } . We then have the following theorem, adapted from Agarwal et al. (2012). To make thepaper self-contained, we also provide a proof with slightly diﬀerent conditions. See the onlinesupplement.

Theorem 3.1.

Conditioning on events k S y − Σ y k ≤ ν and k S y − Σ y k ∞ ≤ ν , there is C > that only depends on rank ( L ) , so that N k b L − L k F + 1 N k b Σ u − Σ u k F ≤ Cκ ( ν , ν ) ( ν + ( J + N ) ν ) N . Proof.

See the online supplement.The optimal tuning parameters can be set to satisfy ν ≍ N √ T and ν ≍ q log NT , respec-tively, accounting for estimating errors under two matrix norms: k S y − Σ y k ≤ ν , k S y − Σ y k ∞ ≤ ν ;both can be shown to hold with high probability under weak serial dependence and sub-Gaussian conditions. In additionally, if κ ( ν , ν ) is bounded away from zero, with the choiceof tunings, the convergence rate in Theorem 3.1 is O P (1 + J log NN ) T , which is suﬃcient toguarantee the convergence of the estimated factors and loadings. We refer to Lemma 2 ofAgarwal et al. (2012) for more reﬁned lower bound of κ ( ν , ν ).The above problem is also called “robust PCA” (Cand`es et al., 2011). For recent advanceand references, see Chen et al. (2020c) where factorization methods are also discussed. Fan et al. (2013) proposed a nonparametric estimator of Σ y , named POET (Principal Or-thogonal complEment Thresholding), when the factors are unobservable. It is basically an12ne-step solution to optimization (3.4) with initialization Σ u = 0. To motivate the estimator,suppose r = R . Then, heuristically L ≈ H R ( Σ y ) , Σ u ≈ Σ y − H R ( Σ y ) , Thus, one estimates L by H R ( S y ) and sets S u := S y − H R ( S y ). To account for the sparsityassumption on Σ u , Fan et al. (2013) estimates Σ y and Σ u as b Σ y = H R ( S y ) + b Σ u , b Σ u = ( h ( S u,ij , λ ij )) N × N , (3.5)where h ( x, λ ij ) denotes the element-wise thresholding operator with thresholding value λ ij .Here, we emphasize element-dependent thresholding λ ij to adapt to varying scales of covari-ance. For correlation thresholding at level λ , we take λ ij = λ √ s u,ii s u,jj with s u,ii a diagnonalelement of S u (Fan et al., 2013); we can also take other form such as the adaptive threshold-ing in Cai and Liu (2011). In general, the thresholding function should satisfy:(i) h ( x, λ ) = 0 if | x | < λ ,(ii) | h ( x, λ ) − x | ≤ λ .(iii) there are constants a > b > | h ( x, λ ) − x | ≤ aλ if | x | > bλ .Note that condition (iii) requires that the thresholding bias should be of higher order.It is not necessary for consistent estimations, but we recommend using nearly unbiasedthresholding (Antoniadis and Fan, 2001) for inference applications. One such example isknown as SCAD. As noted in Fan et al. (2015), the unbiased thresholding is required toavoid size distortions in a large class of high-dimensional testing problems involving a “plug-in” estimator of Σ u . In particular, this rules out the popular soft-thresholding function,which does not satisfy (iii) due to its ﬁrst-order shrinkage bias. In empirical asset pricing, factor loadings are known to depend on individual-speciﬁc observ-ables X i , which represent a set of time-invariant characteristics such as individual stocks’size, momentum, and values. To incorporate the information carried by the observed char-acteristics, Connor and Linton (2007) and Connor et al. (2012) model explicitly the loadingmatrix as a function of covariates X i . Fan et al. (2016) extended the model to allowing13omponents in factor loadings that are not explainable by characteristics: b i = g ( X i ) + γ i , E ( γ i | X i ) = 0 . (3.6)Here g ( · ) is a vector of nonparametric functions. With this model, they introduced animproved factor estimator, known as projected PCA .The basic idea of projected PCA is to smooth the observations { y it } Ni =1 for each given t against their associated covariates { X i } Ni =1 (cross-sectional smoothing), and apply PCA tothe smoothed data (ﬁtted values). Let { φ j ( x ) } Jj =1 be a set of basis functions. This can beeither unstructured, such as kernel machines, or structured such as a basis for additive models(Fan et al., 2020c). Set φ ( X i ) ′ = ( φ ( X i ) , · · · ., φ J ( X i )) and Φ( X ) = ( φ ( X ) , · · · , φ ( X N )) ′ ,an N × J matrix. Then the projection matrix on characteristics can be taken as P =Φ( X )(Φ( X ) ′ Φ( X )) − Φ( X ) ′ . The projected data PY is the ﬁtted value of regressing Y on tothe basis functions.We make the following key assumptions: Assumption 3.1. (i) Relevance:

With probability approaching one, all the eigenvalues of N ( PB ) ′ PB are bounded away from both zero and inﬁnity as N → ∞ . (ii) Orthogonality: E ( u it | X i ) = 0 for all i ≤ N, t ≤ T. The above conditions require that the strengths of the loading matrix should remainstrong after the projection. Condition (ii) implies that if we apply P to both sides of Y = BF ′ + U , then PY ≈ PBF ′ = GF ′ where G = PB is the N × r matrix, which ≈ ( g ( X i )) N × r under additional assumption E ( γ i | X i ) = 0 for all i ≤ N . In other words, the noise U is suppressed, while signals remain.Hence, the scaled sample covariance ( PY ) ′ PY = Y ′ PY ≈ FG ′ GF ′ . For identiﬁcationpurpose, let us assume Ξ := G ′ G is a diagonal matrix and F ′ F /T = I . Then from1 T Y ′ PYF ≈ FΞ , we infer that the columns of F are approximately the eigenvectors of the Y ′ PY , scaled bya factor √ T . This motivates estimating factors by using the top R eigenvectors of Y ′ PY .Fan et al. (2016) derived the rates of convergence of the projected PCA method. A nicefeature is that the consistency of latent factors is achieved even when the sample size T

14s ﬁnite so long as N goes to inﬁnity. Intuitively, the idiosyncratic noise is removed fromcross-sectional projections, which does not require a long time series.Similarly, in many applications, while we do not know the latent factors f t , we do knowthat factors are related to some proxy variables W t . For example, the latent factors areunknown for equity markets, but they are related to Fama-French factors (Fama and French,2015); latent factors for disaggregated macroeconomics time series are unknown, but theyare related to aggregated ones (McCracken and Ng, 2016). Switching the roles of rows andcolumns, longitudinal regression of each series { y it } Tt =1 on { W t } Tt =1 yields the projected datamatrix, from which latent factors and loadings can be extracted similarly. See Fan et al.(2020a) for details on how latent factor learning is augmented by instruments { W t } Tt =1 . In this section, we continue denoting by R as the number of factors we use, and by r as the true number of factors. Fan and Liao (2020) proposed a simpler factor estimatorthat does not rely on eigenvectors, by using cross-sectional diversiﬁed projections (DP). Let W = ( w , · · · , w R ) be a given exogenous (or deterministic) N × R matrix, where each ofits R columns w k is an N × f t by simply taking b f t = 1 N W ′ y t . By substituting y t = Bf t + u t into the deﬁnition, immediately we have b f t = Hf t + 1 N W ′ u t , H = 1 N W ′ B . (3.7)Thus b f t (consistently) estimates f t up to an R × r aﬃne transform H , with the estimationerror e t := N W ′ u t . The assumption that W should be diversiﬁed ensures that as N → ∞ , e t is “diversiﬁed away” (converging to zero in probability). More speciﬁcally, we impose thefollowing assumption. Assumption 3.2.

There is a constant c > , so that as N → ∞ ,(i) The R × R matrix N W ′ W satisﬁes λ min ( N W ′ W ) > c. (ii) W is independent of { u t : t ≤ T } .(iii) Suppose R ≥ r , rank( H ) = r and ψ ( H ) ≫ N , where ψ min ( H ) denotes the minimumnonzero singular value of H = N W ′ B . W . When ( u t , · · · , u Nt ) arecross-sectionally weakly dependent, they ensure that e t is diversiﬁed away. Condition (iii)of Assumption 3.2 is a key condition, which requires that W should not diversify away thefactor components in the time series. Several choices of W can be recommended to satisfythis condition. For instance, if factor loadings satisfy (3.6), then ﬁx R components of sievebasis functions: ( φ ( · ) , · · · , φ R ( · )), we can deﬁne W := ( w i,k ) N × R , where w i,k = φ k ( X i ) . Alternatively, we can also use transformations of the initial observation x t for t = 0, whichwas considered by Juodis and Saraﬁdis (2020). If y is independent of { u t : t ≥ } , we canapply w i,k = φ k ( y i, ) . These weights are correlated with B through y = Bf + u .An important beneﬁt of the DP is that it is robust to over-estimating the number offactors. Theoretical studies of factor models have been crucially depending on the assumptionthat the number of factors, r , should be consistently estimated. This usually requires strongconditions on the strength of factors and serial conditions. Recently, Barigozzi and Cho(2018) proposed a PCA-based method to estimate factors that are robust to over-estimated r . They provided rates of convergence of the estimated common components when R ≥ r .Fan and Liao (2020) applied DP to several inference problems in factor-augmented mod-els, including the post-selection inference, high-dimensional covariance estimation, and factorspeciﬁcation tests. They formally justiﬁed the robustness to over-estimating the number offactors in these applications. In particular, DP admits r = 0 but R ≥ To apply either the PCA or the MLE to estimate the model, we need an initial covariance esti-mator S y , whose application requires elements of y t have suﬃcient moments. Some technicalresults of factor estimations even require sub-Gaussian conditions on data’s tail distributions.However, heavy tailed data are not uncommon in economic applications. For instance, aboutthirty percent of 131 disaggregated macroeconomic variables of Ludvigson and Ng (2016)have excess kurtosis greater than six, so their distributions are fatter than the t-distributionwith degrees of freedom ﬁve. Indeed, heavy tails are a stylized feature of high-dimensional16ata, as it is unlikely that all variables have sub-Gaussian tails.Because the presence of heavy-tailed data invalidates many conditions required for esti-mating factor models, the recent literature has proposed several methods that are robust tothe tail distributions. Here we describe two of them: truncation and robust M-estimation.In the high-dimensional setting, consider estimating multivariate means from an inde-pendent triangular array variables y i , ..., y iT with max i ≤ N Var( y it ) ≤ σ . Truncate the data e y it := sgn( y it ) min {| y it | , τ i } with predetermined τ i >

0. We then estimate E y it using the truncated-mean e y i := T P Tt =1 e y it .Theorem 3.2 shows that the high-dimensional means can be estimated uniformly well if E | y it | q < M for some q ≥ b y i = arg min µ T X t =1 ψ τ i ( y it − µ )where τ i is a growing sequence, and ψ τ ( z ) =  z τ − , | z | < τ | z | τ − − , | z | ≥ τ. The following theorem shows that b y i also estimates E y it well provided that max i E y it isbounded. Theorem 3.2.

Suppose y it is i.i.d. across t , and max i ≤ N E y it < σ .(i) The truncation approach: Suppose max i ≤ N E | y it | q < M for some q ≥ . In addition,suppose log N ≤ CT for some C > , and the truncation parameter is set to satisfy τ i ≍ (cid:16) T log N (cid:17) / (1+ q/ ( σ max i ≤ N E | y it | q ) / (2+ q ) . Then there is c > which does not depend on anymoments of y it , or ( N, T ) , with probability at least − N − , max i ≤ N | e y i − E y it | ≤ ( cM / (2+ q ) + 3) σ r log NT . (ii) The robust M-estimation approach: Suppose log N = o ( T ) , and the truncation pa-rameter is set to satisfy τ i ≍ q T log N . Then there is c > which does not depend on any oments of y it , or ( N, T ) , with probability at least − N − , max i ≤ N | b y i − E y it | ≤ C ( σ + 1) r log NT .

Proof.

See appendix.The robust mean estimation also applies to estimating covariance as its ( i, j ) elementis of form E y it y jt . When the high-dimensional data have heavy-tailed components, we canreplace the sample covariance by its robust version b S y before estimating the factors. By theGaussian concentration inequality, the robustly estimated covariance b S y satisﬁes k b S y − Σ y k ∞ = O P r log NT ! , so long as E y it y jt is uniformly bounded (and serial independence is assumed).Based on the above robust covariance inputs, we can create factor estimators and derivetheir theoretical properties following the guidance of Section 2.2. See Chapter 10 of Fan et al.(2020c) for further generalizations. When factors are highly persistent but E u t u Tt − h = 0, then the cross-covariance Σ h = E y t y ′ t − h = B ( E f t f t − h ) B ′ , h ≥ B . This motivates to estimate loadings by applyingPCA to aggregated { Σ h : h = 1 , · · · } , and we studied by Lam and Yao (2012). A relatedidea has been extended to matrix-variate PCA (Wang et al., 2019a; Chen et al., 2020a).Fan and Zhong (2018) also provided a procedure to eﬃciently aggregate the cross-covarianceinformation with the covariance information when h = 0. Many references have documented the comparisons among various estimation methods.Westerlund and Urbain (2013) made a comparison between PCA and cross-sectional av-erages in the panel data setting. Meanwhile, the PCA and low-rank penalized regressions18re practically very similar. So we do not distinguish their use in practice. In general,because of the simplicity for implementations and relatively weak required conditions, thePCA still seems to be the most widely used method in applied research. Meanwhile, robustcovariance inputs can also be integrated with the surveyed low-rank recovery methods.In addition, when either factors or loadings can be partially explained by observed char-acteristics, the projected PCA is recommended. This is particularly useful in asset pricingapplications where the explanatory power of asset characteristics has been well documentedin the literature.

Forecasting in a data-rich environment has been an important research topic in economicsand ﬁnance. Typical examples include forecasts of the aggregate output or inﬂation rateusing a large number of the categorized macroeconomic variables.Stock and Watson (2002a); Bai and Ng (2006) considered factor-augmented regressionmodel for h -step ahead forecast: y t + h = α ′ f t + β ′ w t + ε t (4.1) x t = Bf t + u t . (4.2)Here w t in (4.1) is the observed predictors, which may include lagged dependent variables.Equation (4.2) is a high-dimensional factor model that includes a vector of latent factors f t . The forecast can be implemented by regressing y t + h onto w t and estimated factors. Thefactor model (4.2) serves as an important dimension reduction tool. Fan et al. (2017b) generalized (4.1) to the nonlinear model with multi-indices. Consider thefollowing forecasting model: y t +1 = h ( φ ′ f t , . . . , φ ′ R f t , ε t +1 ) (4.3)19here h ( · ) is an unknown link function, and ε t +1 is the error independent of f t and u t .Vectors φ , . . . , φ R are r -dimensional linear-indepencent prediction indices. In contrast withlinear forecasting, the above model speciﬁes that the predicting function is nonlinear anddepends on multiple indices of extracted factors. If we specify R < r , further dimensionreductions are achieved.A prominent result related to model (4.3) is given by Li (1991), which shows that undersome regularity conditions such as f t is elliptically symmetric, we have E ( f t | y t +1 ) = Φa ( y t +1 ) , (4.4)for a R -dimensional vector a ( y t +1 ), where Φ = [ φ , φ , . . . , φ R ] is an r × R matrix. In otherwords, the “inverse regression vector” E ( f t | y t +1 ) falls in the column space spanned by Φ ,which can be extracted by PCA. Indeed, since E ( E ( f t | y t +1 )) = E ( f t ) = 0,cov( E ( f t | y t +1 )) = Φ E [ a ( y t +1 ) a ( y t +1 ) ′ ] Φ ′ The above matrix has R nonvanishing eigenvalues if E [ a ( y t +1 ) a ( y t +1 ) ′ ] is non-degenerate.Their corresponding eigenvectors have the same linear span as φ , . . . , φ R do. If one canconsistently estimate cov( E ( f t | y t +1 )), then the subspace spanned by φ , . . . , φ R , which is ofour primary interests, can be obtained by extracting the top R eigenvectors of the estimatedcovariance matrix that correspond to the R largest eigenvalues.However, it is not an easy task to directly estimate the covariance of E ( f t | y t +1 ). Li (1991)suggested the sliced covariance estimate , a widely used technique for dimension reductions:The sliced covariance matrix also satisﬁes the fundamental property (4.4), namely E ( f t | y t +1 ∈ I k ) falls in the column space spanned by Φ for any given partition of the range of y t +1 into H “slices” I , I , . . . , I H . Correspondingly, let \ cov( E ( f t | y t +1 )) = 1 H H X h =1 (cid:20) P Tt =1 y t +1 ∈ I h ) T X t =1 f t y t +1 ∈ I h ) (cid:21) × (cid:20) P Tt =1 y t +1 ∈ I h ) T X t =1 f t y t +1 ∈ I h ) (cid:21) ′ , (4.5)which is a nonparametric covariance estimator. The above sliced covariance estimator isbased on the observable factors. If the factors are unknown, they are replaced by theirestimators, which leads to the following suﬃcient forecasting algorithm based on the factor20odels. Algorithm 4.1.

Suﬃcient forecasting algorithm based on the factor models.

Step 1

Estimate factors in model (4.2) for t = 1 , . . . , T ; Step 2

Construct the covariance estimator as in (4.5) with b f t in place of f t ; Step 3

Obtain b φ , b φ , . . . , b φ R by the top R eigenvectors of the covariance in Step 2; Step 4

Construct the predictive indices b φ ′ b f t , . . . , b φ ′ R b f t ; Step 5

Nonparametrically estimate h ( · ) with indices from Step 4, and forecast y t +1 .Implementing the above algorithm requires the number of slices H , the number of pre-dictive indices R , and the number of factors r . In practice, H has little inﬂuence on theestimated directions, as pointed out in Li (1991) and explained above that property ((4.4))holds. As regard to the choice of R , the ﬁrst R eigenvalues of cov( E ( f t | y t +1 )) must be sig-niﬁcantly diﬀerent from zero compared to the estimation error. Several methods such asLi (1991) and Schott (1994) have been proposed to determine R . For instance, the averageof the smallest r − L eigenvalues would follow χ distribution if the underlying factors arenormally distributed. The number of factors can be determined by a number of methods. Consider a high-dimensional regression model y t = β ′ g t + ν ′ x t + η t , g t = θ ′ x t + ε g,t (4.6)where g t is a treatment variable whose eﬀect β is of the main interest. The model containshigh-dimensional exogenous control variables x t = ( x t , · · · , x Nt ) that determine both theoutcome and treatment variables. Having many control variables creates challenges forstatistical inferences, as such, we assume that ( ν , θ ) are sparse vectors.Control variables are often strongly correlated due to the presence of confounding factors x t = Bf t + u t . (4.7)21his invalidates conditions of using penalized regressions to directly select among x t . Instead,if we substitute (4.7) to (4.6), we reach a factor-adjusted regression model: y t = α ′ y f t + γ ′ u t + ε y,t , g t = α ′ g f t + θ ′ u t + ε g,t , ε y,t = β ′ ε g,t + η t (4.8)where α ′ g = θ ′ B , α ′ y = βα ′ g + ν ′ B , and γ ′ = βθ ′ + ν ′ . Here ( α y , α g , β ) are low -dimensionalcoeﬃcient vectors while ( γ , θ ) are high-dimensional sparse vectors. Importantly, the modelcontains high-dimensional latent controls u t , which are weakly dependent due to the natureof idiosyncratic noises. The use of u t instead of x t validates conditions for many high-dimensional variable selection methods.Fan et al. (2020b) and Hansen and Liao (2018) showed that the penalized regression canbe successfully applied to (4.8) to select components in u t , which are cross-sectionally weaklycorrelated. Motivated by Belloni et al. (2014), the algorithm can be summarized as follows.For notational simplicity, we focus on the univariate case dim( β ) = 1. Algorithm 4.2.

Estimate β as follows. Step 1

Estimate { ( f t , u t ) : t ≤ T } from (4.7) to obtain { ( b f t , b u t ) : t ≤ T } . Step 2

Run penalized variable selections on b u t :( b γ , b α y ) = arg min γ ,α y T T X t =1 ( y t − α ′ y b f t − γ ′ b u t ) + P τ ( γ ) , ( b θ , b α g ) = arg min θ T T X t =1 ( g t − α ′ g b f t − θ ′ b u t ) + P τ ( θ ) . Obtain residuals: b ε y,t = y t − ( b α ′ y b f t + b γ ′ b u t ) , and b ε g,t = g t − ( b α ′ g b f t + b θ ′ b u t ) . Step 3

Estimate β by residual-regression: b β = ( P Tt =1 b ε g,t ) − P Tt =1 b ε g,t b ε y,t . Note that γ : → P τ ( γ ) is a sparse-induced penalty function with a tuning parameter τ .When θ and γ are suﬃciently sparse, and the PC-estimator is used in step 1 with the correctselection of the number of factors, the above procedure is asymptotically valid: σ − η,g σ g √ T ( b β − β ) d −→ N (0 , , (4.9)22here σ g and σ η,g are the asymptotic variances of ε g,t and η t ε g,t .More recently, Fan and Liao (2020) showed that the assumption of correct selection ofthe number of factors can be relaxed if we use the diversiﬁed projection in step 1 instead,and (4.9) is still valid as long as we select R ≥ r factors (over selection). Importantly, thisadmits r = 0, and R ≥ x t = u t itself iscross-sectionally weakly dependent, but nevertheless we estimate R ≥ x t . This setting is empiricallyrelevant as it allows to avoid pre-testing the presence of common factors for inference.Figure 1: Histograms of the standardized estimates (4.9) over 200 replications, superimposedwith the standard normal density. The panel of “double selection” corresponds to directlyselecting among x t (corresponding to R = 0, no factor adjustment), while all other panelscorrespond to using diversiﬁed factor (DP) estimators with R number of working factors.Top four ﬁgures correspond to r = 0 and bottom four ﬁgures correspond to r = 2. When R ≥ r , (4.9) holds, whereas when R < r , (4.9) is violated. Figure source: Fan and Liao(2020).Figure 1, taken from Fan and Liao (2020), plots the histograms of the t-statistics basedon estimated β over 200 simulations, superimposed with the standard normal density, where R diversiﬁed projections are used to estimate factors in step 1. Here the weights are theinitial transformations ( t = 0) so that the i th row of W is ( x it , x it , · · · , x Rit ) at t = 0. The“double selection” is the algorithm used in Belloni et al. (2014) that directly selecting among x t , corresponding to the case R = 0. The factor-augmented algorithm works well even if r = 0; but when r ≥ R ≥

1, to guard against confounding factors amongthe control variables. 23 .3 Factor-adjusted robust multiple testing

Controlling the false discovery proportion (FDP) in large-scale hypothesis testing basedon strongly dependent tests has been an important problem in many scientiﬁc discoveriesacross disciplines. See Fan et al. (2019a) and references therein, and Barras et al. (2010);Harvey et al. (2015); Harvey and Liu (2018); Giglio et al. (2020) for applications in empiricalasset pricing.Suppose we observe realizations of a random vector { y t = ( y t , · · · , y Nt ) ′ } Tt =1 . Let α =( α , · · · , α N ) ′ denote its mean vector. We are interested in testing individual hypotheses: H i : α i = 0 , i = 1 , · · · , N. Let p i denote the p -value for testing H i based on a test statistic such as t -test, whichrejects if p i < x given some critical value x . Deﬁne the number of false discoveries (rejections)and the total number of rejections as follows: F ( x ) = N X i =1 { i : p i < x and H i is true } , V ( x ) = N X i =1 { i : p i < x } . In large-scale multiple testing problems, researchers often aim to control the false discoveryproportion (FDP) and the false discovery rate (FDR) deﬁned byFDP( x ) = F ( x )max {V ( x ) , } , FDR( x ) = E { FDP( x ) } . The goal is to ﬁnd the critical value x so that FDR( x ) ≤ τ for a desired level τ (e.g., 0.10)or more relevantly FDP( x ) ≤ τ with high conﬁdence. While V ( x ) is known, F ( x ) is not inpractice. A general principle of ﬁnding x proceeds as the following two steps. Algorithm 4.3.

General principle for FDP/FDR control.

Step 1.

Find ¯ F ( x ) such that either it upper bounds F ( x ) for all x ∈ (0 , F ( x ) uniformly well. Step 2.

Set the critical value to x ∗ = sup { x ∈ (0 ,

1) : ¯ F ( x ) ≤ τ max {V ( x ) , }} .One of the most popular procedures, proposed by Benjamini and Hochberg (1995), pro-ceeds as follows. Denote p (1) ≤ · · · ≤ p ( N ) as the sorted p-values for the individual tests.24hen the critical value is set to x ∗ = max { p ( i ) : p ( i ) ≤ τ i/N } . This method ﬁts into Algorithm 4.3 with ¯ F ( x ) = N x , which is an asymptotic upper boundfor F ( x ) when the individual p-values are independent. One of the limitations of this upperbound is that it is too conservative if the number of true negatives is small compared to N . More fundamentally, it requires the test statistics be weakly dependent, a topic we shalldiscuss in more detail next. Other methods, such as Storey (2002); Fan et al. (2012), etc.,aim to directly estimate F ( x ) in step 1 in the presence of strong dependence among teststatistics, and are also adaptive to the unknown number of true negatives.In addition, instead of Algorithm 4.3, Romano and Wolf (2007); Romano et al. (2008)provided alternative procedures for FDR control. The key to the success of FDR control is that the individual test statistic should be eitherweakly dependent or independent. This makes the FDR and FDP approximately the sameand easier to control. On the other hand, suppose the cross-sectional dependence of y t isgenerated from a latent factor model: y t = α + Bf t + u t , E ( u t | f t ) = 0 , (4.10)where E f t = 0, and α is the mean vector. In empirical asset pricing, the model can beused to identify nonzero alphas out of a large number of assets, and has been studied toidentify skilled mutual fund managers, e.g., Barras et al. (2010) and Harvey et al. (2015).The presence of latent factors, however, leads to strong dependence among the t-statisticsbased on the naive sample means of y t , which invalidates the weak dependence assumptions.As well documented in the literature, strong dependence creates fundamental challenges tomultiple testing, including large standard errors among the estimated α i , unstable FDP’s,and conservativeness of the test procedure. Learning dependence Bf t and removing it fromthe model (4.10) make the data not only weakly dependent but also less noisy (from Bf t + u t to u t ). This is the basic idea in factor-adjusted robust multiple tests (FarmTest) by using factor-adjusted data { y t − b B b f t } Tt =1 ; see (4.10). Furthermore, Fan et al. (2019a) makes adjustmentsso that it is also robust to heavy tailed data.25igure 2: Comparison between the sample mean (omit factors) and the factor-adjustedmethod, with T = 200 and N = 1000. The upper panels plot the histograms of estimatedindividual alphas from a single simulation; the middle panels plot the individual FDP’s over1000 simulations. The bottom left panel plots the cross-sectional histograms of standarderrors of the estimated alphas over 1000 simulations. The bottom right panel plots thesorted p-values from a single simulation. The B-H procedure rejects all the hypotheses if p ( i ) is below the B-H threshold line f ( i ) := τ i/N .To illustrate consequences of omitting adjusting latent factors as well as the eﬀectivenessof the use of the factor-adjusted method (to be detailed below), let us consider a numericalexample of a single factor model, where elements of u t , f t and B t are generated from thestandard normal distribution. We take the true means to be α i = 0 . ≤ i ≤ N/ α i : 1) the sample means of y t , without usingfactor adjustments; 2) the factor-adjusted estimator based on PCA. We apply the methodof Benjamini and Hochberg (1995) for multiple testing, setting τ = 0 . α i , corresponding to those that satisfy the null hypotheses α i = 0 and those that satisfythe alternatives α i = 0 .

6. Clearly, there is a large overlap (on the upper left panel) betweensample means from the null and alternative, making tests based on sample means diﬃcult todistinguish the alternatives from the nulls. In contrast, the PCA-based estimator can easilyseparate the nulls and alternatives, as shown on the upper right panel in Figure 2.The middle two panels of Figure 2 plot the histograms of the true FDP over 1000 simula-tions based on the two estimators. It is evident that the distribution of the FDP correspond-ing to the factor-adjusted estimator concentrates around the nominal level. In contrast,the one based on the sample mean has a noticeable long tail as well as a larger mean andvariance, which demonstrate the challenge to control FPD in presence of common factors,as explained above. 26inally, omitting confounding factors would lead to larger standard errors and conser-vative inference. The bottom two panels in Figure 2 plot the standard errors of individualestimated alphas and the sorted p values for the two estimation methods. The sample-meanestimator has much fewer sorted p-values below the B-H threshold line (i.e., fewer rejections),compared to the factor-adjusted estimator.Hence it is recommended to estimate and remove the latent factors before applyingstandard FDR control algorithms.

Giglio et al. (2020) studied the problem of identifying hedge funds that are able to producepositive alphas (i.e., have “skill”), among thousands of existing funds. They considered alinear pricing model, where hedge fund returns are: y it = α i + b ′ i λ + b ′ i ( f t − E f t ) + u it . In the model f t contains both observable and latent factors. The model allows nontradableobservable factors and λ is the vector of factor risk premia.At a broad level, their methodology proceeds as the Fama-MacBeth regression integratedwith the PCA to extract latent factors: Algorithm 4.4.

Estimating alphas in the presence of latent and nontradable factors.

Step 1.

Run fund-by-fund time series regressions to estimate fund exposures (betas) toobservable factors.

Step 2.

Apply PCA to the residuals to recover the latent factors and betas.

Step 3.

Implement cross-sectional regressions like Fama-MacBeth to estimate the risk pre-mia of the factors (including both observable and latent factors) and the alphas.Because of many negative alphas from unskilled hund managers, the multiple testingproblem should be properly formulated as one-sided hypotheses: H i : α i ≤ , i = 1 , · · · , N. Hence rejecting H i indicates skilled fund manger i . On the other hand, the existence ofpotentially a very large number of negative alphas gives rise to the issue of power loss, only27o add noises to the model. The loss of power associated with testing inequalities is wellknown as the problem of “deep in the null”, and is often seen in the econometric literature. Toaddress this issue, Giglio et al. (2020) proposed to ﬁrst screen oﬀ very bad funds, identiﬁedas: I = { i ≤ N, b α i / se( b α i ) < − c NT } where c NT > P ( I ⊆ H ) →

1. They recommended to apply FDR control algorithms on funds outside I . Therefore, there are two ingredients that are recommended for identifying skilled fundmanagers via multiple testing: (1) adjust the eﬀect of latent factors, and (2) remove theestimated alphas that are deep in the null. Both are playing essential roles of gaining goodtesting power. The issue of endogeneity is often encountered in real data applications. Consider the followinginstrumental variable (IV) regression model y t = w ′ t β + ε t = w ′ t β + w ′ t β + ε t , where w t is a k -dimensional vector of exogenous regressors and w t is a k -dimensionalvector of endogenous regressors. Meanwhile, we have an N -dimensional IV x t which admita factor structure: x t = Bf t + u t . Below we introduce four estimators for β , which diﬀer on their choices of the instruments. Use f t as the instruments. Project w t on f t : w t = φ ′ f t + v t , E ( v t | f t ) = 0where φ is a k × r matrix. We need r ≥ k for identiﬁcation. Let z t = ( w ′ t , f ′ t ) ′ be the setof instruments. As f t is unobservable, we replace it with some factor estimator and applythe two stage least squares estimator b β f with the feasible instruments.Bai and Ng (2010) studied this estimator, and showed that the estimation errors of b β f associated with the generated instruments (factor estimations) have no eﬀect on the limitingvariance. When u t and ε t are uncorrelated, this only requires ( N, T ) → ∞ regardless of the28elative growth rates. When some weak correlations are present but k E ε t u t k = O (1), wewould require √ T = o ( N ) to oﬀset the eﬀect of estimating factors. Use x t as the instruments. Project w t on x t : w t = θ x t + e t , (4.11)where θ is a k × N coeﬃcient matrix. This projection motivates the use of x t directly asa set of high-dimensional IV. Suppose that ε t is an i.i.d process, then the two-stage leastsquares estimator is eﬃcient, and is given by b β x = (cid:16) W ′ X b Σ − x X ′ W (cid:17) − W ′ X b Σ − x X ′ Y where X is T × N matrix of x t ; W and Y are matrices of w t and y t . Note that b Σ x is the estimated covariance of x t , which can be constructed using factor-based covarianceestimators as described in Section 3. It is interesting to compare the asymptotic behaviors of b β f with b β x . Bai and Ng (2010) showed when u t and w t are uncorrelated, they have the sameasymptotic variance, but b β x has a O ( NT ) bias term. So b β x is consistent only if N = o ( T ). Use selected x t as the instruments. We still consider the projection (4.11), butassume that rows of θ are sparse vectors so that we can apply penalized regression to selectamong the components of x t : b θ j = arg min θ j ∈ R N T T X t =1 ( w t,j − x ′ t θ j ) + P τ ( θ j ) , j ≤ dim( w t ) (4.12)where P τ ( θ j ) is a sparse-induced penalty with tuning τ . Let x t, selec be the vector of selectedcomponents corresponding to nonzero components of { b θ j : j ≤ dim( w t ) } . Belloni et al.(2012) used ( w t , x t, selec ) as the instruments to compute b β x , selec , the two stage least squaresestimator. This method however, would not work well in the presence of common factors.The strong dependence in x t invalidates the variable selection procedure (4.12). Use f t and selected u t as the instruments. We are not aware of any applications ofthis method in the IV literature, but it is still well motivated. Substitute the factor structureto (4.11), we obtain w t = δ f t + θ u t + e t δ = θ B . Hence we can carry out variable selections among u t :( b δ j , b θ j ) = arg min δ j ∈ R r , θ ∈ R N T T X t =1 ( w t,j − δ ′ j b f t − b u ′ t θ j ) + P τ ( θ j ) , j ≤ dim( w t ) . Let b u t, selec be the vector of selected components corresponding to nonzero components of { b θ j : j ≤ dim( w t ) } . We then use ( w t , b f t , b u t, selec ) as the instruments to compute b β f , u ,the two stage least squares estimator. This method is expected to work well because itmarginalizes out the strong factors in x t , leaving remaining components u t being weaklydependent.Let us conduct a simple simulation to study the ﬁnite sample behaviors of the afore-mentioned four estimators. We consider a model y t = w ′ t β + ε t , with a single endogenousregressor w t generated from (4.11) with e t = ε t /

2. Here θ = (2 , , − , ...,

0) and x t admitsa two-factor structure. Variables ( ε t , f t , B , u t ) are independent standard normal. Finally,variable selections are based on lasso with the oracle tuning parameter that controls thescore of the least squares function. For instance, for problem (4.12) we set P τ ( θ ) = τ k θ k with τ = 2 . k T P t x t e t k ∞ .Table 1: Comparison among four IV methods N T

Bias Standard deviation b β f b β x b β x , selec b β f , u b β f b β x b β x , selec b β f , u

50 100 0.004 -0.993 0.003 0.008 0.215 0.001 0.061 0.058200 100 0.008 -0.996 0.007 0.009 0.143 8e-4 0.056 0.054

Reported is based on 1000 replications. b β f uses b f t as IV; b β x uses x t as IV; b β x , selec selects x t as IV usinglasso; b β f , u uses ( b f t , b u t, selec ) as IV, where b u t are selected using lasso. Table 1 reports the bias and standard deviation of each estimator calculated from 1000replications. First, using only estimated factors as the instruments ( b β f ) leads to the largeststandard error. This is not surprising because it excludes the relevant information from u t while the latter is correlated with w t , so this method is less eﬃcient. Secondly, using x t as instruments without variable selection ( b β x ) has the smallest standard deviation, but isseverely biased. Finally, the two instrumental selection based estimators ( b β x , selec and b β f , u )perform favorably and similarly. But b β x , selec is not as stable, as it occasionally selects noneof the instruments in our numerical experiments.30 .5 Boosting Consider the following factor-augmented regression y t + h = c + α ( L ) ′ w t + γ ( L ) y t + β ( L ) ′ f t + ε t + h . (4.13)where α ( L ) = α + α L + · · · + α p L p , γ ( L ) = γ + γ L + · · · + γ q L q and β ( L ) = β + β L + · · · + β l L l , all are lag operator polynomials. Suppose that w t is a k -dimensional vector and f t is an r -dimensional vector. The above predictive regression has n = 1 + ( p + 1) k + ( q + 1) + ( l + 1) r parameters. It is likely that partial parameters are zero. So model selection devices canbe conducted to choose a parsimonious model. Here we brieﬂy describe a model selectionmethod, known as boosting , which was proposed to use by Bai and Ng (2009) in this context.Boosting is an ensemble meta-algorithm, which sequentially ﬁnds a “committee” ofbase learners and then makes a collective decisions by using a weighted linear combina-tion of all base learners. The ﬁrst successful and popular boosting algorithm is AdaBoost (Freund and Schapire, 1997). Friedman (2001) proposes a generic functional gradient de-scent (FGD) algorithm, which views the boosting as a method for function estimation. Ifthe squared loss function is speciﬁed, the FGD algorithm reduces to the L -Boosting, whichis studied in Friedman (2001) and B¨uhlmann and Yu (2003). Suppose that ( y t , z t ) Tt =1 are theobserved target and predictive regressors over the sample period. The L -Boosting algorithmfor estimating the conditional mean E ( y t | z t ) is given as follows. Algorithm 4.5. L -Boosting algorithm Step 1

Initialize b f [0] ( · ) an oﬀset value. The default value is b f [0] ( · ) ≡ ¯ y . Set m = 0. Step 2

Increase m by 1. Compute the residuals e t = y t − b f [ m − ( z t ) for t = 1 , , . . . , T . Step 3

Fit the residual vector e , . . . , e T to z , . . . , z T by the real-valued base procedure(e.g., regression): ( z t , e t ) Tt =1 base procedure −−−−−−−−→ b g [ m ] ( · ) . Step 4

Update b f [ m ] ( · ) = b f [ m − ( · ) + ν · b g [ m ] ( · ), where 0 < ν ≤ Step 5

Iterate steps 2 to 4 until m = m stop for some stopping iteration m stop .One can apply the above L -Boosting to the factor-augmented predictive regression(4.13). As seen in Algorithm 4.5, one needs to specify the base procedure in step 3.31ai and Ng (2009) suggest two methods depending on the way to deal with lags, whichleads to the component-wise L -Boosting and block-wise L -Boosting. In component-wise L -Boosting, one treats each lag of each variable as an independent predictor and the baseprocedure is a simple linear regression. Therefore, step 3 is given as follows. Algorithm 4.6.

Component-wise L -Boosting Step 3.1

Let z t,j denote a typical regressor in the regressors pool with j = 1 , , . . . , n .Regress the current residual e t (the residual in the m -th repetition) on each z t,j toobtain the coeﬃcient b b j . Compute the sum of squared residuals, denoted by SSR( j ). Step 3.2

Determine j m by j m = argmax ≤ j ≤ n SSR( j ) . Step 3.3 b g [ m ] ( x t ) = z t,j m b b j m if x t = z t,j m , and 0 otherwise.Another way is to only diﬀerentiate the predictors in the current period and treat thepredictor and its multiple lags as a block. This gives rise to the block-wise L -Boosting. Thebase procedure now is a multivariate regression with the regressors being one predictor andits lags. See Bai and Ng (2009) for details. Threshold regressions have been used in economic applications to capture potential structuralchanges on regression coeﬃcients. The early literature models the threshold eﬀect using someobservable scalar variable q t as in: y t = w ′ t β + w ′ t δ { q t > γ } + ε t , where w t and q t are adapted to the ﬁltration F t − ; ( β , δ , γ ) is a vector of unknown param-eters, and ε t satisﬁes the conditional mean restriction. Hence when q t > γ , the regressionfunction becomes w ′ t ( β + δ ); when q t ≤ γ , it reduces to w ′ t β (Chan, 1993; Hansen, 2000).In practice, it might be controversial to choose which observed variable plays the role of q t .For example, if the two diﬀerent regimes represent the status of two environments of thepopulation, arguably it is diﬃcult to assume that the change of the environment is governedby just a single variable. 32eo and Linton (2007) and Lee et al. (2020) extended the model to multivariate thresh-old: y t = w ′ t β + w ′ t δ { γ ′ f t > } + ε t , where f t is a vector of “factors” and γ is the corresponding unknown coeﬃcients. So themodel introduces a regime change due to a single index of factors. Allowing multivariatethresholding is important, because it permits the structural change to be governed by apotentially much larger dataset: x t = Bf t + u t , where dim( x t ) = N → ∞ . So f t canbe unobserved factors that can be learned from x t . For the identiﬁcation purpose, suppose T P t f t f ′ t = I and B ′ B is diagonal, then γ and f t are separately identiﬁed. This gives rise tothe factor-driven two-regime regression model .A natural strategy to estimate the model is to rely on least squares:min β , δ , γ T X t =1 ( y t − w ′ t β − w ′ t δ { γ ′ b f t > } ) , where b f t is the plugged-in PC-estimator of factors. Because the least squares problem isneither convex nor smooth in γ , the computational task is demanding. Lee et al. (2020) rec-ommended using algorithms based on mixed integer optimization (MIO). Introduce integers d t := 1 { γ ′ b f t > } ∈ { , } . The goal is to introduce linear constraints with respect to vari-ables of optimization. Suppose there are known upper and lower bounds for δ j : L j ≤ δ j ≤ U j ,where δ j denotes the j th element of δ . Deﬁne M t ≡ max γ ∈ Γ | γ ′ b f t | , where Γ is the parameterspace for γ . Then it can be veriﬁed that the least squares problem is numerically equivalentto the following constraint MIO problem:min β , δ , γ , d , ℓ T X t =1 ( y t − w ′ t β − w ′ t ℓ t ) (4.14)subject to (for any ǫ > t = 1 , . . . , T and each j = 1 , . . . , dim( w t ), γ ∈ Γ , d t ∈ { , } , L j ≤ δ j ≤ U j , ( d t − M t + ǫ ) < γ ′ b f t ≤ d t M t ,d t L j ≤ ℓ j,t ≤ d t U j ,L j (1 − d t ) ≤ δ j − ℓ j,t ≤ U j (1 − d t ) . (4.15)Then, we can apply modern MIO packages (e.g., Gurobi) to solve for the optimal ( β , δ , γ ).33inally, Lee et al. (2020) also derived the asymptotic distribution of the estimated coef-ﬁcients and proposed inferences based on bootstraps. Under the condition that T = O ( N ),they showed that the eﬀect estimating factors is negligible on the asymptotic distributionof the estimated ( β , δ ), but would aﬀect both the rate of convergence and the limitingdistribution of the estimated γ . The stochastic block model has been a popular approach to modeling networks (see Abbe(2017) for a recent review). We observe a graph of N nodes. Let A = ( a ij ) ∈ R N × N bethe adjancy matrix of edges so that a ij = 1 if nodes i and j are connected, and a ij = 0otherwise. Suppose each node belongs to one of r communities, and the community thatnode i belongs to is denoted by an unknown π i ∈ { , · · · , r } . In addition, elements of A arerandom variables. Then stochastic block model assumes that P ( a ij = 1 | π i = k, π j = l ) = w k,l , where w k,l is an unknown probability. We observe the matrix A and aim to recover themembership π i and the probabilities w k,l for all k, l = 1 , · · · , r .Let e , · · · , e r denote the canonical basis in R r , and b i = e k where θ i = k . Then, b i indicates the community membership of node i , and the membership matrix is B = ( b , · · · , b N ) ′ , N × r, whose rows represent nodes and columns represent communities. Let W denote the r × r matrix of ( w k,l ) and let L := E A . It can easily be seen that L = BWB ′ is a low-rankmatrix, whose rank equals r , leading to the following low-rank decomposition: A = L + S , S = A − E A . Therefore, A has the familiar decomposition (2.4), with L being similar to the systematicrisk and B as a low-rank loading matrix. Since the elements in S are independent with mean-zero (Wigner matrix), the operator norm k S k does not grow too fast, compared to that of L . We can then apply PCA on A to estimate B . Suppose r is known, then the estimator b B is deﬁned as √ N times the eigenvectors of A , corresponding to the ﬁrst r eigenvalues.34heorem 2.1 can be applied to obtain a deviation bound for the estimated loading matrix.If there is a sequence g N → ∞ and constants c , · · · , c r > λ i ( W / B ′ BW / ) = c i g N (1 + o P (1)) for all i ≤ r , then there is an r × r matrix H , so that k b B − BH k ∞ = O P ( g − N N k S k + g − N p N log N ) . Therefore, elements of a rotated B can be estimated uniformly well. Moreover, becauseeach community has many nodes belong to, BH has many identical rows, which makes thecluster analysis as a natural method for community detections. For instance, we can applyeither the K-means cluster analysis, or the homogeneous pursuit of Ke et al. (2015) on therows of b B to consistently identify the communities. So far we have been assuming that the factor loading and covariance matrices are time-invariant. Research on conditional factor models has also grown rapidly in recent years.Suppose y it = b ′ i,t f t + u it where b i,t is a time-varying vector of loadings. There have been several approaches toaddressing the issues of time-varying loadings. In this section we brieﬂy review three ofthe most commonly used ones: (1) time-varying characteristics, (2) time-smoothing and (3)continuous-time models. The ﬁrst approach models b i,t using a function of observed characteristics z i,t − : b i,t = b i ( z i,t − )where b i ( · ) is either a linear function or an unknown nonparametric function of the char-acteristics. Therefore, the time-varyingness is mainly captured by the characteristics. Anadvantage of this approach, over the other two approaches to be reviewed below, is that if z i,t − is correctly speciﬁed and indeed can fully capture the degree of time-varyingness of themodel, then b i,t allows a large degree of varyingness, and potentially, structural breaks. Onthe other hand, the limitation of this approach is the potential misspeciﬁcation of z i,t − and35mitted variable problems. Above all, we refer to Gagliardini et al. (2019) for an excellentreview on conditional factor models using this approach, and their applications in empiricalasset pricing. The second approach assumes that factor loadings change smoothly over time. Suppose b i ( · )is an unknown smooth function, we assume b i,t = b i (cid:18) tT (cid:19) , ∀ t ≤ T. Then locally, b i,t ≈ b i,r for all t ≈ r . So in a local window B ( r ) of each ﬁxed r , the modelis approximately time invariant: y it ≈ b ′ i,r f t + u it , t ∈ B ( r ) . Motivated by this assumption, Ang and Kristensen (2012) and Ma et al. (2020) tested themarket mean-variance eﬃciency assumption in the case of known factor case. In the unknownfactor case, Su and Wang (2017) ﬁrst applied local smoothing on y i,t then employed PCAon the smoothed data to estimate the factors and loadings. While this approach does notrequire the speciﬁcation of time-varying characteristics, it restricts to the smooth varyingscenario and thus rules out structural breaks. In addition, slow rates of convergence appearnear boundaries (that is, the beginning and the end of observing periods). Consider a continuous-time factor model d y t = α t dt + B t d f t + d u t where y t , f t , u t are vectors of asset prices, factors and idiosyncratic risks; α t is a drift term.The time-varying loading matrix B t is an N × r matrix that is assumed to be continuousand locally bounded Itˆo semimartingale of the form: B t = B + Z t e α s ds + Z t σ s d W s , e α s and σ s are optional processes and locally bounded; W s is a Brownian mo-tion. Roughly speaking, by the Burkholder-Davis-Grundy inequality (cf. chapter 2 ofJacod and Protter (2011)), B t is also locally time-invariant, which is similar to the treat-ment of the time-smoothing approach. The major diﬀerence though, is that the use of high-frequency data has automatically “smoothed” the data. We refer to the following papers forrecent developments on high-frequency factor models, among others: A¨ıt-Sahalia and Xiu(2017); Chen et al. (2019a); Liao and Yang (2018); Li et al. (2019); Pelger (2019). Missing data and unbalanced panels are not uncommon in economic and ﬁnancial studies.Addressing the missing data issue in statistical modeling belongs to a larger category ofproblems, known as matrix completion . Low-rank matrix completion refers to the problemof recovering missing entries from low-rank matrices. It is particularly relevant to empiricalasset pricing factor models, because many time series of returns have short histories or miss-ing records. In this section we review several methods for matrix completions, which assumethat the missing is at random, except for Cai et al. (2016); Bai and Ng (2019). Besides,the EM algorithm is also a classical approach to dealing with unbalanced panels. We referto Stock and Watson (2002b); Su et al. (2019); Zhu et al. (2019) for detailed discussions onrelated issues.

Recall that the covariance matrix of y t , under the factor model (3.1), has the followingdecomposition, Σ y = B cov( f t ) B ′ + Σ u , where columns of B are approximately equal to theeigenvectors of Σ y corresponding to the ﬁrst r eigenvalues. As such, let b Σ y be an inputmatrix, serving as an estimator for Σ y . Then as described in Section 2.2, we can estimatethe space spanned by B using the leading eigenvectors of b Σ y .In the presence of missing data with exogenous missing, let x it = 1 { y it is observed } andwe only observe y it x it for all ( i, t ), in which unobserved data is set to zero. Suppose for now w i := P ( x it = 1) is known. We can construct an unbiased estimator b Σ y = ( b σ ij ) with b σ ij := 1 w i w j T T X t =1 y it y jt x it x jt .

37n the matrix form, let Y and X be the N × T matrices of y it and x it . So we only observe Y ◦ X , where ◦ represents the element-wise matrix product, the Hadamard product. Alsolet W be the diagonal matrix with w i being its i th diagonal entry. Then b Σ y = 1 T ZZ ′ , Z := W − Y ◦ X . Therefore, columns of the loading matrix estimator b B equal to √ N times the top rightsingular vectors of Z . This method simply replaces the missing entries of Y by zero, andapply the inverse probability weighting (IPW) before applying PCA. The IPW has beenpopularly used in the causal inference literature (e.g., Imbens and Rubin (2015)). Here thesame idea is applied to create an unbiased estimator for the covariance matrix.In practice, we shall replace w i by its consistent estimators, such as b w i := T P Tt =1 x it .But in the case of homogeneous missing, that is, w = · · · = w N , the IPW is not needed,because W equals the identity matrix up to a constant, which does not aﬀect the PCA on Y ◦ X . In addition, factors can be further estimated using least squares by regressing y it x it on the estimated loadings.Theoretical properties were studied by Abbe et al. (2020); Su et al. (2019) under theassumption of homogenous missing. Su et al. (2019) used this estimator as their initialvalue for the EM algorithm. Xiong and Pelger (2019) allowed heterogenous missing andproved that the estimators are also asymptotically normal (they estimated w i w j directly by T P Tt =1 x it x jt ). We can also quickly derive the rate of convergence by applying Theorem 2.1.However, the IPW is the least eﬃcient approach among all the methods to be discussed inthis section. We shall verify this in a simulation study in Section 5.5. Regularized matrix completion is a powerful technique to recover missing entries from low-rank matrices. This approach is also much faster than the EM algorithm in handling largepanels. Due to these nice properties, it has also attracted much attention in the recenteconometrics literature, e.g., Athey et al. (2018); Bai and Ng (2017); Moon and Weidner(2018); Giglio et al. (2020).In the matrix form Y = M + U , the goal is to recover the factor component M = BF ′ when Y has missing elements. The nuclear-norm regularization is directly applicable: c M := arg min M k ( Y − M ) ◦ X k F + λ k M k n (5.1)38ith tuning parameter λ . The factors and loadings can be estimated by taking the singularvectors of c M . Negahban and Wainwright (2011) and Koltchinskii et al. (2011) derived therate of convergence under the Frobenius norm . Under suitable conditions (e.g., missing atrandom, restricted strong convexity, suﬃciently large noise) it can be proved that1

N T k c M − M k F = O P (cid:18) T + 1 N (cid:19) . Chen et al. (2020b) certiﬁes further that the convex optimization (5.1) is optimal for all noiselevels under Frobenius norm, operator norm, and elementwise-inﬁnity norm. The proof isbased on a novel technical device that bridges the convex optimization with a nonconvexoptimization problem. However, this estimator is not asymptotically normal due to thepresence of shrinkage bias, so is not suitable for statistical inferences.

Several recent progress in this literature focuses on debiasing the regularized regression inorder to have valid conﬁdence intervals, e.g., Chen et al. (2019b); Xia and Yuan (2019);Chernozhukov et al. (2019). When the missing is homogeneous, P ( x it = 1) = p for all ( i, t ),Chen et al. (2019b) proposed the following simple debiased estimator c M d = H R ( c M + b p − ( Y − c M ) ◦ X ) , (5.2)where H R ( · ) is the best rank R approximation in (2.5), c M is given by (5.1), b p is the sampleproportion of missing data. The idea is very intuitive. Ignoring the weak-dependence between c M and X and estimating error in b p , we have E ( c M + b p − ( Y − c M ) ◦ X ) ≈ E c M + E ( Y − c M ) = M , which is approximately unbiased. However, the estimator c M + b p − ( Y − c M ◦ X ) is no longerof rank R , which increases the variances. This leads to use the projection as in (5.2), whichis asymptotically eﬃcient in terms of both rate and pre-constant.Alternatively, the debiasing can be achieved through the iterative least squares (Chernozhukov et al.,2019). Suppose the true number of factors, r , is known. Algorithm 5.1.

Debias using iterative least squares.

Step 1.

Obtain c M as in (5.1). 39 tep 2. Let the columns of √ N b B be the left singular vectors of c M , corresponding to theﬁrst r singular values. Step 3.

Estimate the latent factors at time t by e f t := (cid:16)P Ni =1 b b i b b ′ i x it (cid:17) − P Ni =1 b b i y it x it andlet e F = ( e f , · · · , e f T ) ′ . Step 4.

Update loading estimates by e B = ( e b , · · · , e b N ) ′ , where e b i := T X t =1 e f t e f ′ t x it ! − T X t =1 e f t y it x it . Step 5.

The asymptotically unbiased estimator for M is f M := e B e F ′ . A key technical argument is to ensure that the estimation error in b B (step 2) has noimpact on the factor estimator (step 3); this is achieved by Chen et al. (2019b) using an“auxiliary leave-one-out” argument.When the missing probability P ( x it = 1) varies across i , there are two ways to revise theprevious algorithm to achieve the asymptotic normality. One way is to replace (5.1) with aweighted regularization:min M k ( c W − / Y − c W − / M ) ◦ X k F + λ k M k n , (5.3)where c W is a diagonal matrix, whose i th diagonal entry equals b w i := T P Tt =1 x it . This debi-ases the least squares part of the loss function, adopting the same idea of inverse probabilityweighting. The remaining steps of Algorithm 5.1 are the same. Then the same “auxiliaryleave-one-out” technical argument of Chen et al. (2019b) still goes through. The other wayis to apply “sample splitting”, which evenly split the columns of Y into two parts: on onepart we run the penalized regression as in (5.1) and obtain b B , on the other part we runiterative least squares. Then exchange the two parts and re-do the estimations. The ﬁnalestimator is taken as the average of the two. Suppose u it is serially independent, the samplesplitting then artiﬁcially creates independences among various statistics from the splittingsample. See Chernozhukov et al. (2019) for detailed descriptions of this approach.40 .4 Block-rearrangements In an attempt to handle endogenous missing, Bai and Ng (2019) proposed a block-rearrangementmethod. At the cost of this generality, they require that the data matrix Y should have asuﬃciently large balanced sub-block after elementary rearrangements. See Cai et al. (2016);Fan and Kim (2019) for related ideas.Speciﬁcally, a preliminary step of their estimation is to rearrange the data in a shapethat all the factor loadings can be estimated in one sub-block and all the factors can beestimated in another sub-block. The following example is adapted from Bai and Ng (2019),which gives a good illustration on this manipulation: example of the N × T matrix for y it :  y y y y y y y ∗ y y y y y y y ∗ y y ∗ y y y y y y y y y  = ⇒  y y y y y y y y y y y y y y ∗ y y y y y y ∗ y y y ∗ y y  . The left matrix is the originally collected data and the right is the rearranged one. Thesymbols with asterisk denote the missing data. From the column perspective, the 1st, 2ndand 4th columns have missing values and therefore are rearranged as the last three columnsin the right panel; from the row perspective, the 2nd, 3rd and 4th rows have missing valuesand therefore are rearranged as the last three rows in the right panel. Bai and Ng (2019)name the black block “bal”, name the black plus the red blocks “tall”, and name the blackplus the blue block “wide”.Consider the missing value y ∗ . We want to replace it with its expected value E ( y ∗ ) = b ′ f . Note that y ∗ shares the same factor loadings b with data points y and y in thewide block; and shares the same factors f with data points y and y in the tall block.Meanwhile, b can be estimated using data in the“tall” block; f can be estimated usingdata in the “wide” block. As a result, one might expect to recover E ( y ∗ ) with these twoestimators. However, we must take into account the rotational indeterminacy inherent withthe factor models. For a generic missing value y it , b b tall ,i = H ′ tall b i + o P (1) , b f wide ,t = H − f t + o P (1) . Therefore b ′ i f t = b b ′ tall ,i A b f wide ,t + o P (1) , A := H − H wide .

41o estimate A , by b f wide ,t = H − f t + o P (1) and b f tall ,t = H − f t + o P (1), we have b f tall ,t = A b f wide ,t + o P (1) . So one can run the regression of b f tall ,t on b f wide ,t to consistently estimate A . This leads to thefollowing estimation procedure. Algorithm 5.2.

Block-rearrangement algorithm

Step 1

Obtain estimators ( b b wide , b F wide ) using the tall block of Y . Step 2

Obtain estimators ( b b tall , b F tall ) using the wide block of Y . Step 3

Compute b C miss = b B tall A b F ′ wide where A is obtained by regressing b f tall ,t on b f wide ,t Step 4

Output e Y , where e y it = y it if y it is observable; e y it = b c miss ,it if y it is missing.Once e Y is obtained, we apply the PCA again to the imputed data e Y to get more eﬃcientestimates of B and F . Suppose the size of the “tall” block is N × T and the size of the“wide” block is N × T . So the size of the “bal” block is N × T . The whole sample size(including missing data points) is N × T . Bai and Ng require thatmax {√ N , √ T } = o ( N ) , and max {√ N , √ T } = o ( T ) . An implication of the above condition is that the missing data points should not be toofrequent in the sense that the balanced subblock is large enough. Though this conditionrules out the case of random missing (e.g., missing occurs as outcomes of Bernoulli trials),it is not stringent given the nature of endogenous missing.

We conduct a simulation study to compare six matrix completion approaches, namely:

IPW.

The inverse probability weighting.

ReUW.

Unweighted regularization. The eigenvectors of the estimator (5.1).

ReW.

Weighted regularization. The eigenvectors of the estimator (5.3).

ReDebias.

The debiased regularized estimator from Algorithm 5.1.

EM.

The EM algorithm. 42e generate a two-factor model where loadings, factors and u it are independent stan-dard normal. Under the homogeneous missing we generate x it ∼ Bernoulli(0 . x it | w i ∼ Bernoulli( w i ), and w i ∼ Uniform[0 . , λ , the tuning parameter. Write the penalizedloss function to be k ( W − / Y − W − / M ) ◦ X k F + λ k M k n where W is a diagonal weightingmatrix. The theory requires that with a high probability, there is c > c ) k U ◦ ( W − X ) k < λ. So we set λ to be the 0.95 quantile of 2 . k Z ◦ ( W − X ) k where Z is an N × T matrixof standard normal variables. In practice, one can also simulate Z using the estimatedidiosyncratic covariance matrix.Table 2: Comparison among ﬁve matrix completion methods N T

IPW ReUW ReW ReDebias2 EMHomogeneous missing100 200 0.176 0.116 0.114 0.109 0.109200 100 0.252 0.171 0.169 0.161 0.161Heterogeneous missing100 200 0.263 0.211 0.131 0.119 0.119200 100 0.369 0.304 0.222 0.204 0.203

Reported is k P b B − P B k averaged over 100 replications. We compare the performance of estimating the loading space, measured by P B = B ( B ′ B ) − B ′ .Table 2 reports k P b B − P B k averaged over 100 replications for each method. In all scenarios,the IPW performs the worst among all estimators. Under the homogeneous missing, all theother four methods perform similarly, but the diﬀerence is much more noticeable under theheterogeneous missing. The general ranking is thatIPW ≺ ReUW ≺ ReW ≺ ReDebias ≈ EM . This ranking is as expected: IPW is the least eﬃcient method among the ﬁve; ReUW usesthe nuclear-norm regularized estimation that does not take into account the heterogeneousmissing or debias; ReW accounts for the heterogeneous missing probabilities, and ReDebias43urther removes the regularization bias.Finally, it is not surprising to see that ReDebias and EM perform similarly because bothstart with an initial low-rank estimator (ReDebias initializes from ReW while EM initializesfrom IPW), then proceed via iterative least squares. But we note that ReDebias operatesmuch faster because it only iterates once, so is more attractive than EM in handling largescale problems. We also implemented the “early-stop-EM” (which only iterates twice), itperforms only slightly better than IPW and is worse than all the other estimators. Thereforewe conclude that the ReDebias is a recommended method for handling large scale low-rankmatrix completion problems.

We have conducted a selective overview on the recent developments of the factor model andits application on statistical learning. We focus on the perspective of the low-rank structureof factor models, and particularly draws attentions to estimating the model from the low-rank recovery point of view. New estimation and inference methods, and matrix completionproblems have been discussed.

A Technical details

A.1 Proof of Theorem 2.1

Proof. (i) The proof is an exercise of applying the eigen-perturbation theorem. First, by thetriangular inequality, for a N := 2 η N k L k + η N + 3 k L kk S k , k b Σ b Σ ′ − LL ′ k ≤ k ΣΣ ′ − b Σ b Σ ′ k + k ΣΣ ′ − LL ′ k ≤ O P ( a N ) . So by Weyl’s theorem (cited in Theorem A.2), max i ≤ r +1 | λ i ( b Σ ) − λ i ( L ) | ≤ O P ( a N ) . Also, g N := min ≤ i ≤ r +1 | λ i − ( L ) − λ i ( L ) | ≫ a N implies λ r ( L ) ≫ a N andmin ≤ i ≤ r +1 | λ i − ( L ) − λ i ( L ) | ≥ λ r ( L ) min ≤ i ≤ r +1 | λ i − ( L ) − λ i ( L ) | ≫ a N .

12 min ≤ i ≤ r | λ i − ( L ) − λ i ( L ) | . (A.2)Similarly, for all i ≤ r − | λ i ( L ) − λ i +1 ( b Σ ) | ≥ min ≤ i ≤ r | λ i − ( L ) − λ i ( L ) | . Now for i = r , λ r +1 ( b Σ ) ≤ λ r +1 ( L ) + O P ( a N ) = O P ( a N ). So | λ r ( L ) − λ r +1 ( b Σ ) | ≥ λ r ( L ). Hence bythe sing-theta theorem (cited in Theorem A.2),max i ≤ r k b ξ i − ξ i k = O P (cid:18) a N g N (cid:19) . (A.3)The right singular vectors have the same bound. Then (A.1) (A.3) together imply k b L − L k = O P (cid:18) a N k L k g N (cid:19) . Finally, we note that k L k = P r +1 i =2 λ i − ( L ) − λ i ( L ) ≤ rg N . So a N = o P ( g N ) is satisﬁed aslong as k S k + η N = o P ( g N ) and a N = O P ( g N η N + g N k S k ).(ii) The element-wise bound is a corollary from the more general bound in TheoremA.3. Using the notation of Theorem A.3, under the assumptions that k Σ k ∞ = O P (1), s N = O P ( √ N ). Also, m N = O P ( √ N + √ N ), k L k ∞ = O P (1), N c N = o P ( g N ). Hence b N ≤ (cid:18) N √ N + p N (cid:19) g − N ( η N + k S k ) + (cid:18) c N N √ N + c N p N + k S ζ d k ∞ (cid:19) g − N . A.2 Proof of Theorem 3.1

Proof.

Let Q ( L , Σ u ) denote the loss function. Note that k S y − b L − b Σ u k F = k S y − Σ y k F + k b L + b Σ u − L − Σ u k F + M + M M = 2 tr(( S y − Σ y )( L − b L ) ′ ) M = 2 tr(( S y − Σ y )( Σ u − b Σ u ) ′ ) .

45e now use two types of inequalities to bound M and M . As for M , note that L is alow-rank matrix, we use the inequality | tr( AB ′ ) | ≤ k A kk B k n . We thus have | M | ≤ k S y − Σ y kk L − b L k n ≤ . ν k L − b L k n . As for M , note that Σ u is a sparse matrix, we use the inequality | tr( AB ′ ) | ≤ k A k ∞ k B k , | M | ≤ k S y − Σ y k ∞ k Σ u − b Σ u k ≤ . ν k Σ u − b Σ u k . From Lemma 2.3 of Recht et al. (2010), k L + P ( A ) k n = k L k n + kP ( A ) k n where A = b L − L . Also using the standard sparse argument, k Σ u + ( A ) J c k = k Σ u k + k ( A ) J c k where A = b Σ u − Σ u . In addition, Lemma 1 of Negahban and Wainwright (2011) shows0 . M ( A )) ≤ r := rank( L ). Hence k b L k n = k L + P ( A ) + M ( A ) k n ≥ k L k n + kP ( A ) k n − kM ( A ) k n k b Σ u k = k Σ u + ( A ) J c + ( A ) J k ≥ k Σ u k + k ( A ) J c k − k ( A ) J k , kM ( A ) k n ≤ kM ( A ) k F p rank( M ( A )) ≤ √ r k A k F k ( A ) J k ≤ k A k F √ J + N .

Thus Q ( b L , b Σ u ) ≤ Q ( L , Σ u ) implies k A + A k F + 0 . ν kP ( A ) k n + 0 . ν k ( A ) J c k ≤ . ν kM ( A ) k n + 1 . ν k ( A ) J k . As such, ( A , A ) ∈ C ( ν , ν ), and thus k A + A k F ≥ κ ( ν , ν )( k A k F + k A k F ). κ ( ν , ν )( k A k F + k A k F ) ≤ . ν kM ( A ) k n + 1 . ν k ( A ) J k ≤ √ . rν k A k F + 1 . ν √ J + N k A k F . The last inequality then implies k A k F + k A k F ≤ κ ( ν ,ν ) ( √ . rν + 1 . ν √ J + N ). A.3 Proof of Theorem 3.2

A.3.1 Proof of Theorem 3.2 (i)

Proof.

First, we have | E e y i − E y it | ≤ E ( | y it | − τ i )1 {| y it | > τ i } ≤ σ p E | y it | q τ − q/ i for any q ≥ E | e y it | q ≤ τ q − i σ . As such we can apply the Bernstein’s inequality (Theorem A.1)46o reach: max i ≤ N P | e y i − E e y i | > r σ x T + τ i x T ! < − x ) . Then take x = √ N , by union bound, with probability at least 1 − N − ,max i ≤ N | e y i − E e y i | ≤ r σ log NT + 4 max i ≤ N τ i log NT .

The rest of the proof is conditioning on this event. Together we havemax i ≤ N | e y i − E y it | ≤ r σ log NT + max i ≤ N f ( τ i ) , f ( τ ) := 4 τ log NT + 2 σ √ Mτ q/ . Set τ i ≍ κ ( σ M ) / (2+ q ) , which is proportional to the minimizer of f ( τ ) . Here κ := (cid:16) T log N (cid:17) / (1+ q/ .Because log N ≤ CT , we have κ q log NT < C for some constant C as long as q ≥

2. Somax i ≤ N | e y i − E y it | ≤ √ M / (2+ q ) κ r log NT ! σ r log NT ≤ ( M / (2+ q ) C + √ σ r log NT .

A.3.2 Proof of Theorem 3.2 (ii)

Proof.

Let µ i,τ := arg min µ E ψ τ i ( y it − µ ). We separately consider the “variance” b y i − µ i,τ and the “bias” ∆ i := µ i,τ − E y it . We denote by˙ ψ τ ( x ) := ddx ψ τ ( x ) =  xτ − , | x | < τ x ) τ − , | x | ≥ τ, which is well deﬁned because ψ τ is ﬁrst-order diﬀerentiable (but not twice). bias. Bounding bias requires τ i cannot grow too slowly. Let g τ ( z ) := z − τ ψ τ ( z ) and ddz g τ ( z ) = 2 z − τ ˙ ψ τ ( z ). Hence for z = y it − µ i,τ and z = y it − E y it . There is e µ in between µ i,τ and E y it , for e z = y it − e µ , and q > E ( z − z ) = E g τ i ( z ) − E g τ i ( z ) + τ i E ψ τ i ( z ) − τ i E ψ τ i ( z ) ≤ E g τ i ( z ) − E g τ i ( z )= E ddz g τ ( e z )( E y it − µ i,τ ) ≤ | ∆ i | E {| e z | ≥ τ }| e z | ≤ | ∆ i | E | e z | q τ − ( q − i | ∆ i | E | e it + E y it − e µ | q τ − ( q − i ≤ τ − ( q − i C | ∆ i | [ E | e it | q + | ∆ i | q ] . On the other hand, E ( z − z ) = ∆ i . Hence | ∆ i | ≤ τ − ( q − i C [ E | e it | q + | ∆ i | q ]. Now considertwo cases.Case 1: | ∆ i | q > E | e it | q , then E ( z − z ) = ∆ i . Hence | ∆ i | ≤ Cτ − ( q − i | ∆ i | q . This implies τ i ≤ C | ∆ i | ≤ C | µ i,τ + E y it | which contradicts with τ i → ∞ . Case 2: | ∆ i | q < E | e it | q , then | ∆ i | ≤ Cτ − ( q − i E | e it | q → E | e it | q < ∞ . Variance.

Bounding variance requires τ i cannot grow too fast. Denote the loss function Q i ( µ ) := T P Tt =1 ψ τ i ( y it − µ ). Fix m T = q log NT . We aim to show there is δ >

0, so that P (cid:18) inf | ν | = δ min i ≤ N Q i ( µ i,τ + m T ν ) − Q i ( µ i,τ ) > (cid:19) > − N − , (A.4)which then implies with probability at least 1 − N − , max i ≤ N | b y i − µ i,τ | ≤ m T δ . To prove(A.4), note E ˙ ψ τ i ( e it,τ ) = 0 where e it,τ := y it − µ i,τ . Now let e it = y it − E y it , then e it = e it,τ +∆ i ,where ∆ i = µ i,τ − E y it . It can be veriﬁed that for any x, x , x ,˙ ψ τ i ( e it,τ + x ) − ˙ ψ τ i ( e it,τ ) = 2 xτ − i + a it ( x ) b it ( x ) ψ τ ( e it,τ + x ) − ψ τ ( e it,τ ) = ˙ ψ τ ( e it,τ ) x + Z x [ ˙ ψ τ ( e it,τ + z ) − ˙ ψ τ ( e it,τ )] dz | ˙ ψ τ ( x ) − ˙ ψ τ ( x ) | ≤ τ − | x − x | , where a it ( x ) = ˙ ψ τ i ( e it,τ + x ) − ˙ ψ τ i ( e it,τ ) − xτ − i and b it ( x ) = 1 {| e it,τ + x | ∨ | e it,τ | ≥ τ i } ; a ∨ b = max { a, b } . Also, | a it ( x ) | ≤ | x | τ − i . Applying these results with x = − m T ν , we have Q i ( µ i,τ + m T ν ) − Q i ( µ i,τ ) = 1 T T X t =1 ψ τ i ( e it,τ + x ) − ψ τ i ( e it,τ )= 1 T T X t =1 ˙ ψ τ i ( e it,τ ) x + 1 T T X t =1 Z x zτ − i dz + 1 T T X t =1 Z x a it ( z ) b it ( z ) dz { x > } − T T X t =1 Z x a it ( z ) b it ( z ) dz { x ≤ }≥ x τ − i − xτ − i · τ i max i ≤ N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T X t =1 ˙ ψ τ i ( e it,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } I − τ − i max i ≤ N T T X t =1 Z | x | zb it ( z ) dz | {z } II . To bound I , we apply the Bernstein inequality. | ˙ ψ τ i ( e it,τ ) | ≤ { ( | e it | + | ∆ i | ) τ − i , τ − i } .Hence E ψ τ i ( e it,τ ) ≤ τ − i ( σ + 1) and E | ψ τ i ( e it,τ ) | k ≤ τ − i ( σ + 1)( τ i ) k − where we used48 ∆ i | <

1, and E | x | k ≤ E x C k − if | x | < C. Hence by Theorem A.1,max i P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T X t =1 ˙ ψ τ i ( e it,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > r τ − i ( σ + 1) h T + 2 h τ i T ! < − h ) . Take h = 4 log N , by the union bound, with probability at least 1 − N − , I ≤ r σ + 1) log NT + 8 τ i log NT ≤ C ( σ + 1) m T where C does not depend on i . The last inequality holds for τ i ≍ m − T .To bound II , note b it ( z ) ≤ × {| e it | > τ i / } + 2 × {| ∆ i | > τ i / } + 1 {| z | > τ i / } . Also,when | z | ≤ | x | ≤ | ν | m T →

0, we have 1 {| z | > τ i / } = 0 and 1 {| ∆ i | > τ i / } = 0 because m T →

0. We apply Hoeﬀding inequality, with probability at least 1 − N − , II ≤ max i ≤ N x T T X t =1 {| e it | > τ i / } ≤ max i ≤ N x r log NT + P ( | e it | > τ i / ! ≤ max i ≤ N x r log NT + σ τ i ! ≤ x . Together, Q i ( µ i,τ + m T ν ) − Q i ( µ i,τ ) ≥ τ − i m T | ν | ( | ν | − C ( σ + 1)) > . This inequality holdsuniformly for all | ν | = 4 C ( σ + 1) and i ≤ N . Hence with probabiliy at least 1 − N − ,max i ≤ N | b y i − µ i,τ | ≤ m T C ( σ + 1) . Combine both the bias and variance parts, for q ≥ E | e it | q < C ,max i ≤ N | b y i − E y it | ≤ m T C ( σ + 1) + C E | e it | q m q − T ≤ Cm T ( σ + 1) . A.4 Some inequalities

The following theorem is adapted from Theorem 2.10 of Boucheron et al. (2013).

Theorem A.1 (Bernstein inequality) . Let X , ..., X T be an independent sequence with T P Tt =1 E X t <σ and T P Tt =1 E | X t | q ≤ q !2 σ c q − for all integers q > , with constants ( σ , c ) . Then for all x > , P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T X t =1 ( X t − E X t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > r σ x T + cx T ! < − x ) . heorem A.2 (Eigen-purturbation bounds) . Let λ ≥ ...λ R and b λ ≥ ... b λ R respectively bethe eigenvectors of N × N semi-positive deﬁnite matrices A and b A , where R < N . Also, let ( ξ , ..., ξ R ) and ( b ξ , ..., b ξ R ) be corresponding eigenvectors. Then(i) Sin-theta theorem: max i ≤ R k b ξ i − ξ i k ≤ k b A − A k min i ≤ R min {| b λ i − − λ i | , | λ i − b λ i +1 |} . (ii) Weyl’s theorem: max i ≤ R k b λ i − λ i k ≤ k b A − A k . Next, we prove a general element-wise deviation bound for singular vectors. We considerthe model as described in Theorem 2.1. Let b ζ and ζ be the N × r matrices of right singularvector of b Σ and L , and let b ξ and ξ be the left singular vectors. Theorem A.3.

Let c N := k b Σ − Σ k ∞ , η N := k b Σ − Σ k , s N := max i ≤ N P k ≤ N Σ ik , m N = k ζ k ∞ ∨ k ξ k ∞ and g N := min ≤ i ≤ r +1 | λ i − ( L ) − λ i ( L ) | . Suppose N c N = o P ( g N ) . Then for b N := ( s N + N k L k ∞ m N ) g − N ( η N + k S k ) + ( N c N m N + k S ζ d k ∞ ) g − N , k b ξ − ξ k ∞ + k b ζ − ζ k ∞ ≤ O P ( b N ) . Proof.

Let b ζ d and ζ d be the N × d th right singular vector of b Σ and L , for some d ≤ r . By deﬁnition, ξ d = λ − d ( L ) L ζ d and b ξ d = λ − d ( b Σ ) b Σ b ζ d . So k b ξ d − ξ d k ∞ ≤ I + II + III ,where I := k λ − d ( b Σ ) b Σ ( b ζ d − ζ d ) k ∞ II := k λ − d ( b Σ )( b Σ − L ) ζ d k ∞ III := k ( λ − d ( b Σ ) − λ − d ( L )) L ζ d k ∞ . We shall use k Ab k ∞ ≤ min {k A k ∞ k b k ∞ N , max i ≤ N k A ′ i kk b k , k A kk b k} for b ∈ R N . Also,part (i) shows k b ζ d − ζ d k = O P ( η N + k S k g N ), the same bound as the left singular vectors. I ≤ N λ − d ( b Σ ) k b Σ − Σ k ∞ k b ζ d − ζ d k ∞ + k λ − d ( b Σ ) Σ ( b ζ d − ζ d ) k ∞ ≤ O P ( N g − N c N ) k b ζ d − ζ d k ∞ + O P ( g − N ) k b ζ d − ζ d k s max i ≤ N X k ≤ N Σ ik = O P ( N g − N c N ) k b ζ d − ζ d k ∞ + O P ( g − N s N ( η N + k S k )) .II ≤ λ − d ( b Σ ) N k b Σ − Σ k ∞ k ζ d k ∞ + λ − d ( b Σ ) k S ζ d k ∞ ≤ O P ( g − N N c N k ζ d k ∞ ) + O P ( g − N k S ζ d k ∞ ) III ≤ k λ − d ( b Σ ) − λ − d ( L ) k N k L k ∞ k ζ d k ∞ ≤ O P ( g − N ( η N + k S k )) N k L k ∞ k ζ d k ∞ . k b ξ d − ξ d k ∞ ≤ O P ( N g − N c N ) k b ζ d − ζ d k ∞ + O P ( b N ) . Similarly, k b ζ d − ζ d k ∞ ≤ O P ( N g − N c N ) k b ξ d − ξ d k ∞ + O P ( b N ). Hence for ∆ := k b ξ d − ξ d k ∞ + k b ζ d − ζ d k ∞ , we have∆ ≤ O P ( N g − N c N )∆ + O P ( b N ) . Because N g − N c N = o P (1), we have ∆ = O P ( b N ) . References

Abbe, E. (2017). Community detection and stochastic block models: recent developments.

The Journal of Machine Learning Research Abbe, E. , Fan, J. , Wang, K. and

Zhong, Y. (2020). Entrywise eigenvector analysis ofrandom matrices with low expected rank.

Annals of Statistics Agarwal, A. , Negahban, S. , Wainwright, M. J. et al. (2012). Noisy matrix decom-position via convex relaxation: Optimal rates in high dimensions.

Annals of Statistics Ahn, S. and

Horenstein, A. (2013). Eigenvalue ratio test for the number of factors.

Econometrica A¨ıt-Sahalia, Y. and

Xiu, D. (2017). Using principal component analysis to estimate ahigh dimensional factor model with high-frequency data.

Journal of Econometrics

Ang, A. and

Kristensen, D. (2012). Testing conditional factor models.

Journal ofFinancial Economics

Antoniadis, A. and

Fan, J. (2001). Regularized wavelet approximations.

Journal of theAmerican Statistical Association Athey, S. , Bayati, M. , Doudchenko, N. , Imbens, G. and

Khosravi, K. (2018).Matrix completion methods for causal panel data models. Tech. rep., National Bureau ofEconomic Research.

Bai, J. (2003). Inferential theory for factor models of large dimensions.

Econometrica Bai, J. and

Li, K. (2012). Statistical analysis of factor models of high dimension.

TheAnnals of Statistics ai, J. and Li, K. (2016). Maximum likelihood estimation and inference for approximatefactor models of high dimension.

Review of Economics and Statistics Bai, J. and

Liao, Y. (2016). Eﬃcient estimation of approximate factor models via penalizedmaximum likelihood.

Journal of Econometrics

Bai, J. and

Ng, S. (2002). Determining the number of factors in approximate factor models.

Econometrica Bai, J. and

Ng, S. (2006). Conﬁdence intervals for diﬀusion index forecasts and inferencefor factor-augmented regressions.

Econometrica Bai, J. and

Ng, S. (2009). Boosting diﬀusion indices.

Journal of Applied Econometrics Bai, J. and

Ng, S. (2010). Instrumental variable estimation in a data rich environment.

Econometric Theory Bai, J. and

Ng, S. (2017). Principal components and regularized estimation of factormodels. arXiv preprint arXiv:1708.08137 . Bai, J. and

Ng, S. (2019). Matrix completion, counterfactuals, and factor analysis ofmissing data. arXiv preprint arXiv:1910.06677 . Bai, J. and

Wang, P. (2016). Econometric analysis of large factor models.

Annual Reviewof Economics Baltagi, B. H. , Kao, C. and

Wang, F. (2017). Identiﬁcation and estimation of a largefactor model with structural instability.

Journal of Econometrics

Barigozzi, M. and

Cho, H. (2018). Consistent estimation of high-dimensional factormodels when the factor number is over-estimated. arXiv preprint arXiv:1811.00306 . Barigozzi, M. , Cho, H. and

Fryzlewicz, P. (2018). Simultaneous multiple change-point and factor analysis for high-dimensional time series.

Journal of Econometrics

Barigozzi, M. and

Luciani, M. (2019). Quasi maximum likelihood estimation and infer-ence of large approximate dynamic factor models via the em algorithm. arXiv preprintarXiv:1910.03821 . 52 arras, L. , Scaillet, O. and

Wermers, R. (2010). False discoveries in mutual fundperformance: Measuring luck in estimated alphas.

Journal of Finance Belloni, A. , Chen, D. , Chernozhukov, V. and

Hansen, C. (2012). Sparse models andmethods for optimal instruments with an application to eminent domain.

Econometrica Belloni, A. , Chernozhukov, V. and

Hansen, C. (2014). Inference on treatment eﬀectsafter selection among high-dimensional controls.

The Review of Economic Studies Benjamini, Y. and

Hochberg, Y. (1995). Controlling the false discovery rate: a practicaland powerful approach to multiple testing.

Journal of the Royal statistical society: seriesB (Methodological) Boucheron, S. , Lugosi, G. and

Massart, P. (2013).

Concentration inequalities: Anonasymptotic theory of independence . Oxford university press.

Brillinger, D. R. (1964). A frequency approach to the techniques of principal components,factor analysis and canonical variates in the case of stationary time series. In . B¨uhlmann, P. and

Yu, B. (2003). Boosting with the l loss: regression and classiﬁcation. Journal of the American Statistical Association Cai, T. , Cai, T. T. and

Zhang, A. (2016). Structured matrix completion with applicationsto genomic data integration.

Journal of the American Statistical Association

Cai, T. and

Liu, W. (2011). Adaptive thresholding for sparse covariance matrix estimation.

Journal of the American Statistical Association

Cand`es, E. J. , Li, X. , Ma, Y. and

Wright, J. (2011). Robust principal componentanalysis?

Journal of the ACM (JACM) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviationstudy. In

Annales de l’IHP Probabilit´es et statistiques , vol. 48.

Chan, K.-S. (1993). Consistency and limiting distribution of the least squares estimator ofa threshold autoregressive model.

Annals of Statistics hen, D. , Mykland, P. A. and

Zhang, L. (2019a). The ﬁve trolls under the bridge:Principal component analysis with asynchronous and noisy high frequency data.

Journalof the American Statistical Association

Chen, E. Y. , Tsay, R. S. and

Chen, R. (2020a). Constrained factor models for high-dimensional matrix-variate time series.

Journal of the American Statistical Association

Chen, Y. , Chi, Y. , Fan, J. , Ma, C. and

Yan, Y. (2020b). Noisy matrix completion:Understanding statistical guarantees for convex relaxation via nonconvex optimization.

SIAM Journal on Optimization to appear.

Chen, Y. , Fan, J. , Ma, C. and

Yan, Y. (2019b). Inference and uncertainty quantiﬁcationfor noisy matrix completion.

Proceedings of the National Academy of Sciences

Chen, Y. , Fan, J. , Ma, C. and

Yan, Y. (2020c). Bridging convex and nonconvex opti-mization in robust pca: Noise, outliers, and missing data. arXiv preprint arXiv:2001.05484 . Cheng, X. , Liao, Z. and

Schorfheide, F. (2016). Shrinkage estimation of high-dimensional factor models with structural instabilities.

The Review of Economic Studies Chernozhukov, V. , Hansen, C. B. , Liao, Y. and

Zhu, Y. (2019). Inference for het-erogeneous eﬀects using low-rank estimations. Tech. rep., cemmap working paper.

Chudik, A. , Pesaran, M. H. and

Tosetti, E. (2011). Weak and strong cross-sectiondependence and estimation of large panels.

The Econometrics Journal C45–C90.

Connor, G. and

Linton, O. (2007). Semiparametric estimation of a characteristic-basedfactor model of stock returns.

Journal of Empirical Finance Connor, G. , Matthias, H. and

Linton, O. (2012). Eﬃcient semiparametric estimationof the fama-french model and extensions.

Econometrica Doz, C. , Giannone, D. and

Reichlin, L. (2011). A two-step estimator for large approx-imate dynamic factor models based on kalman ﬁltering.

Journal of Econometrics oz, C. , Giannone, D. and

Reichlin, L. (2012). A quasi-maximum likelihood approachfor large, approximate dynamic factor models.

The Review of Economics and Statistics Fama, E. F. and

French, K. R. (2015). A ﬁve-factor asset pricing model.

Journal ofFinancial Economics

Fan, J. , Han, X. and

Gu, W. (2012). Estimating false discovery proportion under arbitrarycovariance dependence.

Journal of the American Statistical Association

Fan, J. , Ke, Y. and

Liao, Y. (2020a). Augmented factor models with applications tovalidating market risk factors and forecasting bond risk premia.

Journal of Econometrics,forthcoming . Fan, J. , Ke, Y. , Sun, Q. and

Zhou, W.-X. (2019a). Farmtest: Factor-adjusted ro-bust multiple testing with approximate false discovery control.

Journal of the AmericanStatistical Association

Fan, J. , Ke, Y. and

Wang, K. (2020b). Factor-adjusted regularized model selection.

Journal of Econometrics

Fan, J. and

Kim, D. (2019). Structured volatility matrix estimation for non-synchronizedhigh-frequency ﬁnancial data.

Journal of Econometrics

Fan, J. , Li, Q. and

Wang, Y. (2017a). Estimation of high dimensional mean regressionin the absence of symmetry and light tail assumptions.

Journal of the Royal StatisticalSociety, Series B. Fan, J. , Li, R. , Zhang, C.-H. and

Zou, H. (2020c). Statistical Foundations of DataScience. CRC Press, Taylor and Francis, FL.

Fan, J. and

Liao, Y. (2020). Learning latent factors from diversiﬁed projections and itsapplications to over-estimated and weak factors.

Available at SSRN 3446097 . Fan, J. , Liao, Y. and

Mincheva, M. (2013). Large covariance estimation by threshold-ing principal orthogonal complements (with discussion).

Journal of the Royal StatisticalSociety, Series B Fan, J. , Liao, Y. and

Wang, W. (2016). Projected principal component analysis in factormodels.

Annals of Statistics an, J. , Liao, Y. and

Yao, J. (2015). Power enhancement in high dimensional cross-sectional tests.

Econometrica Fan, J. and

Lv, J. (2008). Sure independence screening for ultrahigh dimensional featurespace.

Journal of the Royal Statistical Society, Series B Fan, J. , Wang, D. , Wang, K. and

Zhu, Z. (2019b). Distributed estimation of principaleigenspaces.

Annals of statistics Fan, J. , Wang, W. and

Zhong, Y. (2018). An ℓ ∞ eigenvector perturbation bound andits application to robust covariance estimation. Journal of Machine Learning Research Fan, J. , Wang, W. and

Zhong, Y. (2019c). Robust covariance estimation for approximatefactor models.

Journal of econometrics

Fan, J. , Xue, L. and

Yao, J. (2017b). Suﬃcient forecasting using factor models.

Journalof Econometrics

Fan, J. and

Zhong, Y. (2018). Optimal subspace estimation using overidentifying vectorsvia generalized method of moments. arXiv preprint arXiv:1805.02826 . Forni, M. , Hallin, M. , Lippi, M. and

Reichlin, L. (2000). The generalized dynamicfactor model: identiﬁcation and estimation.

The Review of Economics and Statistics Forni, M. , Hallin, M. , Lippi, M. and

Reichlin, L. (2005). The generalized dynamicfactor model: one-sided estimation and forecasting.

Journal of the American StatisticalAssociation

Freund, Y. and

Schapire, R. E. (1997). A decision-theoretic generalization of on-linelearning and an application to boosting.

Journal of computer and system sciences Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.

Annals of statistics Gagliardini, P. , Ossola, E. and

Scaillet, O. (2016). Time-varying risk premium inlarge cross-sectional equity data sets.

Econometrica agliardini, P. , Ossola, E. and

Scaillet, O. (2019). Estimation of large dimensionalconditional factor models in ﬁnance.

Swiss Finance Institute Research Paper . Giannone, D. , Reichlin, L. and

Small, D. (2008). Nowcasting: The real-time informa-tional content of macroeconomic data.

Journal of Monetary Economics Giglio, S. , Liao, Y. and

Xiu, D. (2020). Thousands of alpha tests.

Review of FinancialStudies, forthcoming . Goncalves, S. and

Perron, B. (2018). Bootstrapping factor models with cross sectionaldependence .

Hansen, B. E. (2000). Sample splitting and threshold estimation.

Econometrica Hansen, C. and

Liao, Y. (2018). The factor-lasso and k-step bootstrap approach forinference in high-dimensional economic applications.

Econometric Theory

Harvey, C. R. and

Liu, Y. (2018). False (and missed) discoveries in ﬁnancial economics.Tech. rep., Duke University.

Harvey, C. R. , Liu, Y. and

Zhu, H. (2015). ... and the cross-section of expected returns.

Review of Financial Studies Imbens, G. W. and

Rubin, D. B. (2015).

Causal inference in statistics, social, andbiomedical sciences . Cambridge University Press.

Jacod, J. and

Protter, P. (2011).

Discretization of processes , vol. 67. Springer Science& Business Media.

Juodis, A. and

Sarafidis, V. (2020). A linear estimator for factoraugmented ﬁxed-t panelswith endogenous regressors. Tech. rep., Monash University, Department of Econometricsand Business Statistics.

Karabiyik, H. , Urbain, J.-P. and

Westerlund, J. (2019). Cce estimation of factor-augmented regression models with more factors than observables.

Journal of AppliedEconometrics Ke, Z. T. , Fan, J. and

Wu, Y. (2015). Homogeneity pursuit.

Journal of the AmericanStatistical Association lopp, O. , Lounici, K. and

Tsybakov, A. B. (2017). Robust matrix completion.

Probability Theory and Related Fields

Koltchinskii, V. , Lounici, K. and

Tsybakov, A. B. (2011). Nuclear-norm penalizationand optimal rates for noisy low-rank matrix completion.

The Annals of Statistics Lam, C. and

Yao, Q. (2012). Factor modeling for high-dimensional time series: inferencefor the number of factors.

The Annals of Statistics Lawley, D. and

Maxwell, A. (1971).

Factor analysis as a statistical method . The secondedition ed. Butterworths, London.

Lee, S. , Liao, Y. , Seo, M. and

Y., S. (2020). Factor-driven two-regime regression.

Annalsof Statistics, forthcoming . Li, H. , Li, Q. and

Shi, Y. (2017). Determining the number of factors when the number offactors can increase with sample size.

Journal of Econometrics

Li, J. , Todorov, V. and

Tauchen, G. (2019). Jump factor models in large cross-sections.

Quantitative Economics Li, K.-C. (1991). Sliced inverse regression for dimension reduction.

Journal of the AmericanStatistical Association Liao, Y. and

Yang, X. (2018). Uniform inference for characteristic eﬀects of largecontinuous-time linear models.

Available at SSRN 3069985 . Ludvigson, S. and

Ng, S. (2016). A factor analysis of bond risk premia 313–371.

Ma, S. , Goldfarb, D. and

Chen, L. (2011). Fixed point and bregman iterative methodsfor matrix rank minimization.

Mathematical Programming

Ma, S. , Lan, W. , Su, L. and

Tsai, C.-L. (2020). Testing alphas in conditional time-varying factor models with high-dimensional assets.

Journal of Business & EconomicStatistics Massacci, D. (2017). Least squares estimation of large dimensional threshold factor models.

Journal of Econometrics cCracken, M. W. and

Ng, S. (2016). Fred-md: A monthly database for macroeconomicresearch.

Journal of Business & Economic Statistics Moon, H. R. and

Weidner, M. (2018). Nuclear norm regularized estimation of panelregression models. arXiv preprint arXiv:1810.10987 . Negahban, S. and

Wainwright, M. J. (2011). Estimation of (near) low-rank matriceswith noise and high-dimensional scaling.

The Annals of Statistics Onatski, A. (2010). Determining the number of factors from empirical distribution ofeigenvalues.

The Review of Economics and Statistics Onatski, A. (2012). Asymptotics of the principal components estimator of large factormodels with weakly inﬂuential factors.

Journal of Econometrics

Pelger, M. (2019). Large-dimensional factor modeling based on high-frequency observa-tions.

Journal of Econometrics

Recht, B. , Fazel, M. and

Parrilo, P. A. (2010). Guaranteed minimum-rank solutionsof linear matrix equations via nuclear norm minimization.

SIAM review Romano, J. P. , Shaikh, A. M. and

Wolf, M. (2008). Control of the false discovery rateunder dependence using the bootstrap and subsampling.

Test Romano, J. P. and

Wolf, M. (2007). Control of generalized error rates in multiple testing.

The Annals of Statistics Schott, J. R. (1994). Determining the dimensionality in sliced inverse regression.

Journalof the American Statistical Association Seo, M. H. and

Linton, O. (2007). A smoothed least squares estimator for thresholdregression models.

Journal of Econometrics

Stock, J. and

Watson, M. (2002a). Forecasting using principal components from a largenumber of predictors.

Journal of the American Statistical Association Stock, J. and

Watson, M. (2002b). Macroeconomic forecasting using diﬀusion indexes.

Journal of Business & Economic Statistics tock, J. H. and Watson, M. W. (2016). Dynamic factor models, factor-augmented vec-tor autoregressions, and structural vector autoregressions in macroeconomics. In

Handbookof macroeconomics , vol. 2. Elsevier, 415–525.

Storey, J. D. (2002). A direct approach to false discovery rates.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) Su, L. , Miao, K. and

Jin, S. (2019). On factor models with random missing: Em esti-mation, inference, and cross validation.

SMU Economics and Statistics Working PaperSeries, No. 04-2019 . Su, L. and

Wang, X. (2017). On time-varying factor models: Estimation and testing.

Journal of Econometrics

Wang, D. , Liu, X. and

Chen, R. (2019a). Factor models for matrix-valued high-dimensional time series.

Journal of econometrics

Wang, S. , Yang, H. and

Yao, C. (2019b). On the penalized maximum likelihood estima-tion of high-dimensional approximate factor model.

Computational Statistics Westerlund, J. and

Urbain, J.-P. (2013). On the estimation and inference in factor-augmented panel regressions with correlated loadings.

Economics Letters

Xia, D. and

Yuan, M. (2019). Statistical inferences of linear forms for noisy matrixcompletion. arXiv preprint arXiv:1909.00116 . Xiong, R. and

Pelger, M. (2019). Large dimensional latent factor modeling with missingobservations and applications to causal inference. arXiv preprint arXiv:1910.08273 . Zhu, Z. , Wang, T. and

Samworth, R. J. (2019). High-dimensional principal componentanalysis with heterogeneous missingness. arXiv preprint arXiv:1906.12125arXiv preprint arXiv:1906.12125