[PDF] Bridging factor and sparse models

Abstract

Factor and sparse models are two widely used methods to impose a low-dimensional structure in high-dimension. They are seemingly mutually exclusive. In this paper, we propose a simple lifting method that combines the merits of these two models in a supervised learning methodology that allows to efficiently explore all the information in high-dimensional datasets. The method is based on a flexible model for panel data, called factor-augmented regression model with both observable, latent common factors, as well as idiosyncratic components as high-dimensional covariate variables. This model not only includes both factor regression and sparse regression as specific models but also significantly weakens the cross-sectional dependence and hence facilitates model selection and interpretability. The methodology consists of three steps. At each step, the remaining cross-section dependence can be inferred by a novel test for covariance structure in high-dimensions. We developed asymptotic theory for the factor-augmented sparse regression model and demonstrated the validity of the multiplier bootstrap for testing high-dimensional covariance structure. This is further extended to testing high-dimensional partial covariance structures. The theory and methods are further supported by an extensive simulation study and applications to the construction of a partial covariance network of the financial returns and a prediction exercise for a large panel of macroeconomic time series from FRED-MD database.

Full PDF

BBridging factor and sparse models

Jianqing Fan

Department of Operations Research and Financial EngineeringPrinceton UniversityE-mail: [email protected]

Ricardo Masini

Center for Statistics and Machine Learning, Princeton UniversitySao Paulo School of Economics, Getulio Vargas FoundationE-mail: [email protected]

Marcelo C. Medeiros

Department of Economics, Pontiﬁcal Catholic University of Rio de JaneiroE-mail: [email protected]

February 24, 2021

Abstract

Factor and sparse models are two widely used methods to impose a low-dimensional structurein high dimension. They are seemingly mutually exclusive. In this paper, we propose a simplelifting method that combines the merits of these two models in a supervised learning method-ology that allows to eﬃciently explore all the information in high-dimensional datasets. Themethod is based on a very ﬂexible linear model for panel data, called factor-augmented regres-sion model with both observable, latent common factors, as well as idiosyncratic componentsas high-dimensional covariate variables. This model not only includes both factor regressionand sparse regression as speciﬁc models but also signiﬁcantly weakens the cross-sectional de-pendence and hence facilitates model selection and interpretability. The methodology consistsof three steps. At each step, remaining cross-section dependence can be inferred by a novel testfor covariance structure in high-dimensions. We developed asymptotic theory for the factor-augmented sparse regression model and demonstrated the validity of the multiplier bootstrapfor testing high-dimensional covariance structure. This is further extended to testing high-dimensional partial covariance structures. The theory and methods are further supported byan extensive simulation study and applications to the construction of a partial covariance net-work of the ﬁnancial returns for the constituents of the S&P500 index and prediction exercisefor a large panel of macroeconomic time series from FRED-MD database.

JEL Codes : C22, C23, C32, C33.

Keywords : Factor models, sparse regression, high-dimensional, supervised learning, hypoth-esis testing, covariance structure.

Acknowledgments : Medeiros gratefully acknowledges the partial ﬁnancial support fromCNPq and CAPES. We are deeply grateful to Alexander Giessing, Caio Almeida, ClaudioFlores, Gilberto Boareto, Gustavo Bulh˜oes, Henrique Pires, Marcelo Fernandes, Michael Wolf,and Nathalie Gimenes for helpful discussions and comments. a r X i v : . [ ec on . E M ] F e b Introduction

With the emergence of new and large datasets, the correct characterization of the dependence amongvariables is of substantial importance. Usually, to achieve this goal, the literature has followed twoseemingly orthogonal tracks. On the one hand, factor models have become an essential tool to sum-marize information in large datasets under the assumption that the remaining dependence structureis negligible. For instance, panel factor models are applied now to a wide variety of important ap-plications, ranging from forecasting (macroeconomic) variables and asset pricing models to causalinference in applied microeconomics and network analysis. On the other hand, there have beenmajor advances on parameter estimation in ultra high-dimensions under the assumption of sparsityor weak-sparsity. That is, a variable depends only on a (very) small subset of the other variables.In this paper, we take an alternative route and combine the best of the two worlds described abovein order to better characterize the dependence structure of high-dimensional data. More speciﬁcally,we consider that the covariance structure of a large set of variables, organized in a panel dataformat, is characterized as a combination of a factor structure, where factors can be either observed,unobserved, or both, and a weakly-sparse idiosyncratic component. This formulation is generalenough in order to accommodate a very large number of data generating processes of interest ineconomics, ﬁnance, and related areas. The proposed methodology has two ingredients: a three-stepestimation procedure and a new test for structure in high dimensional (partial) covariance matrices.The steps of the estimation procedure are as follows. In the ﬁrst one, we take the original data andremove the eﬀects of any observed factors. These factors can be deterministic terms such as seasonaldummies and/or trends or any other observed covariates. The ﬁrst step can be parametric ornonparametric, low or high dimensional. A latent factor model is then estimated using the residualsfrom the ﬁrst stage. Finally, in a ﬁnal step we model the dependence among idiosyncratic termsas a weakly sparse regression estimated by the Least Absolute Shrinkage and Selection Operator(LASSO). At each step, the null-hypothesis of no remaining cross-section dependence can be testedby the proposed test for the (partial) covariance structure in high-dimensions.

Let Y t : “ p Y t , . . . , Y nt q be a random vector generated by a factor model as Y it “ λ i F t ` U it ,for i “ , . . . , n , t “ , . . . , T , where Σ : “ E p U t U t q , with U t : “ p U t , . . . , U nt q , is not necessarilydiagonal. Fix one component of interest i P t , . . . , n u , which serve as a response variable. Consider2he following prediction models: M : E p Y it | Y ´ it q , M : E p Y it | F t q , and M : E p Y it | F t , U ´ it q , (1.1)where Y ´ it and U ´ it are, respectively, vectors with the elements of Y t and U t excluding the i -thentry. Note that model M is indeed the factor augmented regression model since it is the same as E p Y it | F t , Y ´ it q . As the paper will mainly focus on linear regressions, we will refer more speciﬁcally Ă M below as the factor-augmented regression model.Suppose that we observe both F t and U ´ it . Which one of three models above is best in termsof mean square error ( MSE ) for prediction? Comparison between M and M is not clear since itdepends, among others, on the magnitude of Σ relative to Λ Λ , where Λ : “ p λ , . . . , λ n q . However,since the σ -algebras generated by Y ´ it and F t are both included in the σ -algebra generated by p F t , U ´ it q , it is not surprising that MSE p M q ď min r MSE p M q , MSE p M qs . The same will holdtrue if we replace the models in (1.1) by their best linear projections, which we denote by Ă M j for j P t , , u , since the linear space Ă M is the largest. In this case, we can explicitly write the “gains”of Ă M when compared to Ă M and Ă M : MSE p Ă M q ´ MSE p Ă M q “ ´ θ i Σ ´ i, ´ i θ i MSE p Ă M q ´ MSE p Ă M q “ ´ ∆ i ∆ i ´ ∆ i Σ ´ i, ´ i ∆ i , where θ i and β i are the coeﬃcients of the projection of U it onto U ´ it and the coeﬃcients ofthe projection of X it onto X ´ it , respectively; Σ ´ i, ´ i is Σ excluding the i -th row and column; ∆ i : “ Λ i ´ β i Λ ´ i and ∆ i : “ β i ´ θ i . From the previous expressions, it becomes evident thatboth Ă M and Ă M are restrictions on Ă M . Broadly speaking, whenever one does not expect tohave an exact factor model, there are potential gains of taking into account the contribution ofthe idiosyncratic components U ´ it . Therefore, we use Ă M as the base model for the estimationmethodology described in Section 2.2. The contributions of this paper are multi-fold. First, our methodology bridges the gap between twoapparently competing methods for high-dimensional modeling; see, for example the discussion inGiannone et al. (2018) and Fan et al. (2020). This yields a vast number of potential applications and3pin-oﬀs. For instance, in Fan et al. (2020), we apply the methods developed in here to evaluatethe eﬀects of interventions and we contribute to the literature on synthetic controls and relatedmethods by combining the approaches of Gobillon and Magnac (2016) and Carvalho et al. (2018).Therefore, in our setup both a common factor structure and weak sparsity can coexist.Second, our results can also serve as a diagnostic and misspeciﬁcation tool. For panel datamodels with interactive ﬁxed eﬀects as in Moon and Weidner (2015) and Bai and Liao (2017), ourtest can be directly applied to uncover the dependence structure among cross-sectional units beforeand after accounting for common factor components. If the factor structure is informative enough,we expect the idiosyncratic covariance matrix to be almost sparse. If this is not the case, we mayhave possibly underestimated the number of factors. One popular application is in asset pricingas discussed in Gagliardini et al. (2019) and in the empirical section of this paper. There are ahuge number of proposed factors as described in Feng et al. (2020), Giglio and Xiu (2020), and Guet al. (2020). We can apply our methodology not only to test for omitted factors, but, as well, toestimate network connections among ﬁrms as in Diebold and Yilmaz (2014) and Brownlees et al.(2020). Finally, as a diagnostic tool, our paper tackle the same problem as Gagliardini et al. (2019).However, we take an alternative solution strategy which relies on a much diﬀerent set of hypothesis.Third, the methodology proposed here contributes to the forecasting literature. For instance, inthe second application considered in this paper, we build forecasting models for a large cross-sectionof macroeconomic variables. We call this method the

FarmPredict . We show that the combinationof factors and a sparse regression strongly outperforms the traditional principal component regres-sion as in Stock and Watson (2002a,b). Therefore,

FarmPredict can be an additional contributionto the forecasting and machine learning toolkit. The method can be easily extended to a multi-variate setting combining factor-augmented vector autoregressions (FAVAR) as in Bernanke et al.(2005) with sparse vector models as in Kock and Callot (2015) and Masini et al. (2019).Fourth, we show consistency of factor estimation based on the residuals of a ﬁrst-step regression.Our results hold for both parametric (linear or nonlinear) and nonparametric ﬁrst stage. A high-dimensional ﬁrst stage is also allowed. Note that, current results in the literature consider thatfactors are estimated based on observed data and our derivations favor a much more ﬂexible andgeneral setup (Bai and Ng, 2002, 2003, 2006). More speciﬁcally, our methodology favors settingswhere there are both observed and latent factors, as well as trend-stationary data. In the later, thetrend can be ﬁrst removed by (nonparametric) ﬁrst-stage regression.4ifth, we also contribute to the LASSO literature. LASSO can not be model selection consistentfor highly correlated variables. Through the decomposition of covariates into factors and idiosyn-cratic components, we decorrelate the variables and make the model selection condition much easierto hold; see, for example, (Fan et al., 2020). We show consistency of the estimates based on resid-uals of the previous steps. Our results are derived under restrictions on the population covariancematrix of the data and not on the estimated one, as it is usual in many papers. See, for example,van de Geer and B¨uhlmann (2009).Finally, we extend the results in Chernozhukov et al. (2013, 2018) to strong-mixing data in orderto construct hypothesis tests for covariance and partial covariance structure in high dimensions. This step is necessary for econometrics and ﬁnancial applications. As side results, in order todevelop the test we ﬁrst show consistency of kernel-based estimation of a high-dimensional long-run covariance matrix of dependent process. This is a new result with important consequencesfor the theory of high-dimensional regression with dependent errors. We also establish a newconsistency of an estimator of the partial covariance matrix in high-dimensions and strong-mixingdata. Our proposed tests can be used to infer, for instance, if the (partial) covariance matrix ofa high-dimensional random vector is diagonal or block-diagonal. More generally, we can test anypre-deﬁned structure. Furthermore, we show that the test remains valid when we use the residualsfrom a previous step estimation to compute the covariance matrix. This result allows us to to applythe test to the three-stage estimation procedure proposed in this paper. Although our results arederived under the assumption that the number of factors is known, simulation results presented inthe paper provides evidence that the test have good ﬁnite-sample properties even when the numberof factors is determined by data-driven methods commonly found in the literature. Over the pastyears, a vast number of papers proposed diﬀerent methods to test for covariance structure in highdimensions. See, for example, Ledoit and Wolf (2002), Chen et al. (2010), Onatski et al. (2013),Cai and Ma (2013), Li and Qin (2014), Zheng et al. (2019), Cai et al. (2016), Zheng et al. (2019),and Guo and Tang (2020), among many others. To the best of our knowledge, we complementall the previous papers by simultaneously considering high-dimensions, strong-mixing data withmild distributional assumptions, and pre-estimation when constructing tests for both covarianceand partial covariance structure. Recently, Giessing and Fan (2020) also extended the results in Chernozhukov et al. (2013). However, their setupis very diﬀerent from ours as the authors only consider the case of independent and identically distributed data. For a nice recent review, see Cai (2017). .3 Organization of the Paper In addition to this Introduction, the paper is organized as follows. We present the model setupand assumptions in Section 2. The theoretical results are presented in Section 3 with practicalguides given in Section 4. We depict the results of a simulation experiment in Section 5 anddiscuss the empirical application in Section 6. Section 7 concludes. All proofs are deferred to theAppendix. Supplementary Material contains additional numerical results. Tables and ﬁgures in theSupplementary Material are referenced with an “S” before the number.

All random variables (real-valued scalars, vectors and matrices) are deﬁned in a common probabilityspace p Ω , F , P q . We denote random variables by an upper case letter, X for instance, and itsrealization by a lower case letter, X “ x . The expected value operator is with respect to the P law such that E p X q : “ ş Ω X p ω q d P p ω q . Matrices and vectors are written in bold letters X . Exceptfor the number of factors, r , and number of covariates, k , deﬁned below, all other dimensions areallowed to depend on the sample size ( T ). However, we omit this dependency throughout the paperto avoid clustering the notation prematurely.We use } ¨ } p to denote the (cid:96) p norm for p P r , , such that for a d ´ dimensional (possiblyrandom) vector X “ p X , . . . , X d q , we have } X } p : “ p ř di “ | X i | p q { p for p P r , and } X } : “ sup i ď d | X i | . If X is a p m ˆ n q possibly random matrix then } X } p denotes the matrix (cid:96) p -induced normand } X } max denotes the maximum entry in absolute terms of the matrix X . Note that whenever X is random, then } X } p for p P r , and } X } max are random variables. We also reserve the symbol } ¨ } without subscript for the Euclidean norm } ¨ } : “ } ¨ } for both vectors and matrices.For any convex function ψ : R ` Ñ R ` such that ψ p q “ ψ p x q Ñ 8 as x Ñ 8 and(real-valued) random variable X , we denote its Orlicz norm by } X } ψ , which is deﬁned by } X } ψ : “ inf ! C ą E ” ψ ´ | X | C ¯ı ď ) . Since we are only concerned with polynomial and exponential tailswe consider upper bounds on } X } ψ p where ψ p P Ψ andΨ : “ t ψ : R ` Ñ R ` : ψ p x q “ x p ` (cid:15) , p ě , (cid:15) ą

0; or ψ p x q “ e x p ´ , p ą u . (1.2)Evidently, as opposed to } X } p , } X } ψ p is always a non-negative scalar. We do not abide to anyconvention to apply Orcilz norm to vector or matrices to avoid confusion. We also use extensively the6act that } XY } ψ p ď } X } ψ p } Y } ψ p , where X and Y are two real-valued, not necessarily independent,random variables. For the polynomial case this is just the Cauchy-Schwartz inequality. For theexponential bounds we have similar results: For instance, when p “ X and Y are sub-Gaussianrandom variables with ψ p p x q “ exp p x q ´

1, it is not diﬃcult to show that XY is sub-exponentialwith ψ p p x q “ exp p x q ´ X , diag p X q denote the diagonal matrix whose diagonal is the elements of X . p A q is an indicator function on the event A , i.e, p A q “ A is true and 0 otherwise. We adoptthe Landau big/small O, o notation and the “in probability” O P and o p analogues. We say that x is of the same order of y , x — y , if both x “ O p y q and y “ O p x q . We write X — P Y if both X “ O P p Y q and Y “ O P p X q . Unless stated otherwise, the asymptotics are taken as T Ñ 8 , where T is the time-series dimension, and the o p q and o P p q are with respect to the limit as T Ñ 8 . Wedenote convergence in probability and in distribution by “ p ÝÑ ” and “ ñ ”, respectively. We apply the test for three-stage estimation procedure for a very general panel data model, which isrich enough in order to nest several important cases in economics, ﬁnance and related areas. Morespeciﬁcally, we deﬁne the following the Data Generating Process (DGP).

Assumption 1 (DGP) . The process t Y it : 1 ď i ď n, t ě u is generated by Y it “ γ i X it ` λ i F t ` U it , ” γ i X it ` R it , (2.1) where X it is a k -dimensional observable (random) vector which may also include a constant term, F t is a r -dimensional vector of common latent factors, and U it is a zero mean idiosyncratic shock.The unknown parameters are γ i P R k , the factor loadings λ i , and the covariance matrix of theidiosyncratic shocks. Finally, we assume that X it , F t and U it are mutually uncorrelated. Remark 1.

In Assumption 1 we consider that k , the dimension of X it is ﬁnite and ﬁxed. Fur-thermore, the relation between Y it and X it is linear. This is for the sake of exposition. However, For simplicity, we assume that all the units i have the same number of covariates ( k ). The framework cancertainly accommodate situations where k i is a function of i . he theoretical results in this paper are written in terms of the consistency rate of the ﬁrst-stepestimation. Therefore, the DGP can be made much more general by just changing the rates. Example 1 (Asset Pricing Models) . Suppose Y it is the return of an asset i at time t and let X it : “ X t be a set of k observable risk factors, such as the market returns and or Fama-Frenchfactors as in, for example, Fama and French (1993) or Fama and French (2015). F t can be a setof additional, non observable, risk factors. Several asset pricing models, such as the Capital AssetPricing Model (CAPM) of the Arbitrage Pricing Theory (APT) model, are nested into this generalframework. Example 2 (Networks) . Model (2.1) also complements the network speciﬁcations discussed inBarigozzi and Hallin (2016), Barigozzi and Hallin (2017b) and Barigozzi and Brownlees (2019).Furthermore, the test proposed here can be used to detect networks links as in Diebold and Yilmaz(2014) and Brownlees et al. (2020). For example, Y it can be the (realized) volatility of ﬁnancialassets and X it : “ X t can be volatility factors as in Brito et al. (2018) and Andreou and Ghysels(2021). Example 3 (Panel Data Models with Iterative Fixed-Eﬀects) . Model (2.1) is the exact deﬁnitionof the panel model with iterative ﬁxed-eﬀects considered in Gobillon and Magnac (2016), where theauthors propose an alternative to the Synthetic Control method of Abadie and Gardeazabal (2003)and Abadie et al. (2010) to evaluate the eﬀects of regional policies. Model (2.1) is also in the heart ofthe

FarmTreat method of Fan et al. (2020) and the model discussed in Moon and Weidner (2015).

Example 4 (FAVAR) . In the case where the index i represents a diﬀerent dependent (endogenous)variable and U it is a dependent process, model (2.1) turns out to be equivalent to the Factor Aug-mented Vector Autoregressive (FAVAR) model of Bernanke et al. (2005). In this case, X it may alsoinclude lagged dependent variables. The method proposed here for estimation, inference and prediction consists of three stages whereat the end of each stage, the covariance structure of the residuals is tested.1. For each i P t , . . . , n u run the regression: Y it “ γ i X it ` R it , t P t , . . . , T u , p R it : “ Y it ´ p γ i X it . The ﬁrst stage may consist of a regression on a constant, adeterministic time trend and seasonal dummies, for instance, or, as in Example 1, a regressionon observed factors. After removing the contribution from the observables, we can use thetest for the null hypothesis of no remaining (partial) covariance structure to check if the(partial) covariance of R it is dense or sparse. If it is dense we move to Step 2. Otherwise,we jump directly to Step 3. This ﬁrst parametric, low dimensional step can be replaced by anonlinear/nonparametric regression or by a high-dimensional model, when, for example, thenumber of observed factors is large. This will be discussed more in the subsequent sections.2. Write R t : “ p R t , . . . , R nt q and R t “ Λ F t ` U t . The second step consists of estimating Λ and F t for t “ , . . . , T using p R t through principal component analysis (PCA) and compute p U t “ p R t ´ p Λ p F t . After estimating the factors and loadings, we apply our testing procedure to test for remainingcovariance structure in U t . The second-step estimation can be carried out also by dynamicfactor models as in Barigozzi and Hallin (2016,2017,2020) or Barigozzi et al. (2020).3. Now, deﬁne p U ´ it : “ p p U t , . . . , p U i ´ ,t , p U i ` ,t . . . p U nt q . The third estimation step consists of asparse regression to estimate the following model for each i P t , . . . , n u : p U it “ θ i p U ´ it ` V it , t P t , . . . , T u . At the end of Steps 2 and 3, we can conduct the relevant inference on the structures of thecovariance or partial covariance matrices. We can also provide updated prediction future outcomes.We detail those in the next subsection. Also note that the nonzero estimates of θ i shed light on thelinks among idiosyncratic components. In a pure prediction exercise one is usually interested in the linear projection of Y it onto p X it , F t , U it q ,which results in the factor-augmented regression model (FARM) Y it “ γ i X it ` λ i F t ` θ i U ´ it ` ε it , t P t , . . . , T u , (2.2)9or each given i , and can be predicted by p Y it : “ p γ i X it ` p λ i p F t ` p θ i p U ´ it ; i P t , . . . , n u . (2.3)This will be called FarmPredict . Note that model (2.2) is equivalent to using the predictors X it , Y ´ it and F t , which augment predictors X it , Y ´ it by using the common factors F t . The formin (2.2) mitigates the collinearity issues in high dimensions.Model (2.2) also bridges factor regression ( θ i “

0) on one end and (sparse) regression on theother end with λ i “ Λ i θ i , where Λ ´ i is the loading matrix without the i th row. In the latter case,model (2.2) becomes a (sparse) regression model: Y it “ γ i X it ` θ i R ´ it ` ε it , t P t , . . . , T u . (2.4)In this case, FARM speciﬁcation as in (2.2) decorrelates the variables R ´ i in (2.4). It makes themodel selection consistency much easier to satisfy and forms the basis of FarmSelect in (Fan et al.,2020). In general, for FARM (2.2) with sparsity,

FarmPredict chooses additional idiosyncraticcomponents to enhance the prediction of the factor regression.In other applications, the structure of the idiosyncratic components U “ p U , . . . , U n q is theobjective of interest. An estimator for Σ “ E p U t U t q could be simply given by p Σ : “ T T ÿ t “ p U t p U t . (2.5)In order to proper understand the (linear) relation between a pair p U it , U jt q of U t , a simplecovariance estimate sometimes is not enough. In applications, it is often desirable to have a directmeasure of how U it and U jt are connected. By direct connection, we meant the relation betweenthose units removing the contribution of other variables of U t . For this purpose, we use the partialcovariance between U it and U jt , deﬁned for any pair i, j P t , . . . , n u as: π ij : “ E p V ijt V jit q , where V ijt : “ U it ´ Proj p U it | U ´ ij,t q and Proj p U it | U ´ ij,t q denotes the linear projection of U it onto the In high dimensions, where n ą T , there are many possible estimators for Σ available in the literature. See thebook by Fan et al. (2020). i and j , which we denote by U ´ ij,t . We suggest to estimatethe partial covariance matrix Π : “ p π ij q by p Π : “ p p π ij q and p π ij : “ T T ÿ t “ p V ij,t p V ji,t , (2.6)where p V ijt is the residual of the LASSO regression of p U it onto p U ´ ij,t for i, j P t , . . . , n u .We also would like to conduct formal test on the population structure of U t . Speciﬁcally, wepropose a test for the following null hypothesis on the covariance matrix H Σ D : Σ D “ Σ D , D Ď t , . . . , n u ˆ t , . . . , n u , (2.7)for a given subset D , where Σ D denotes the elements of Σ indexed by D and we allow d : “ | D | todiverge as n, T Ñ 8 . For example, to test if Σ is diagonal, D consists of all oﬀ diagonal elementsand Σ D “ . To test if Σ is block diagonal, D can be taken to the corresponding oﬀ-diagonalblocks. Similarly, for testing the structure on the partial covariance matrix H Π D : Π D “ Π D , D Ď t , . . . , n u ˆ t , . . . , n u . (2.8)The null hypotheses (2.7) and (2.8) nest several cases of interest in applications. The mostcommon would be to test for a diagonal or a block diagonal structure in Σ and/or Π . But it alsoaccommodates other structures. The task of estimating Σ is well documented in literature evenin high-dimensional setups; see, for example, Ledoit and Wolf (2004,2012,2017,2020), Fan et al.(2008), Lam and Fan (2009), or Fan et al. (2013). The challenges for testing (2.7) and (2.8) are similar and can be summarized as follows:1. As we allow for both n and d to diverge to inﬁnite as T grows, sometimes at a faster rate, wehave a high-dimensional test where some sort of Gaussian approximation result for dependentdata must be deployed as we also allow the number covariances to be tested ( d ) to diverge.In this case, a high-dimensional long-run covariance matrix must be estimated if one expectsto get (asymptotic) correct test size.2. We do not directly observe t U t u or t V ij,t u . Instead, we have an estimate of both from a With minor changes, the proposed test can also be used to test the null M vec p Σ q “ m for some p d ˆ n q matrix M and d -dimensional vector m where d : “ d T is also a function of T . See Ledoit and Wolf (2021a) for a recent survey. t U t u and t V ij,t u from a multi-stage estimationprocedure as we illustrate later in this paper.We propose to test (2.7) using the statistic S Σ D : “ }? T p p Σ D ´ Σ D q} max . (2.9)The quantiles of S Σ D are approximated by a Gaussian bootstrap approximation. To describe theprocedure, let Υ Σ denote the p d ˆ d q covariance matrix for the vectorized submatrix p r σ ij q p i,j qP D , where r σ ij : “ T ř Tt “ U i,t U j,t . Since the process t U t u might present some form of temporal dependence (referto Assumption 3(c)) we estimate Υ Σ using a Newey-West-type estimator. For a given integrablefunction k p¨q with k p q “ h ą Υ Σ is estimated by p Υ Σ : “ ÿ | (cid:96) |ă T k p (cid:96) { h q x M Σ ,(cid:96) and x M Σ ,(cid:96) : “ T T ÿ t “ (cid:96) ` p D Σ ,t p D Σ ,t ´ (cid:96) , (2.10)where p D Σ ,t is a d -dimensional vector with entries given by p U it p U jt ´ p σ ij for p i, j q P D . Finally, let c ˚ Σ p τ q be the τ -quantile of the Gaussian bootstrap S ˚ D : “ } Z ˚ Σ } ; Z ˚ Σ | X , Y „ N p , p Υ Σ q . Theorem 4 demonstrates the validity of Gaussian bootstrap procedure described above, i.e., it statesconditions under which the τ -quantile of the test statistic (2.9) can be approximated by c ˚ Σ p τ q inthe appropriate sense.Similarly, the test statistic for (2.8) is given by S Π D : “ }? T p p Π D ´ Π D q} max . (2.11)Let Υ Π denote the p d ˆ d q covariance matrix of p r π ij q p i,j qP D where r π ij : “ T ř Tt “ V ij,t V ji,t . For a givenkernel K p¨q P K and bandwidth h ą K is described below in (3.7), Υ Π is estimated by p Υ Π : “ ÿ | (cid:96) |ă T K p (cid:96) { h q x M Π ,(cid:96) ; x M Π ,(cid:96) : “ T T ÿ t “ (cid:96) ` p D Π ,t p D Π ,t ´ (cid:96) , (2.12)12here p D Π ,t is a d -dimensional vector with entries given by p V ij,t p V ji,t ´ p π ij for p i, j q P D . Also, let c ˚ Π p τ q be the τ -quantile of the Gaussian bootstrap S ˚ D : “ } Z ˚ Π } ; Z ˚ V | X , Y „ N p , p Υ V q . Theorem 5 demonstrate the validity of Gassian bootstrap procedure describe above, i.e., it statesconditions under which the τ -quantile of the test statistic (2.11) can be approximated by c ˚ Π p τ q inthe appropriate sense. In this section we collect all the theoretical guarantees for the estimation of the model (2.1) by usingthe proposed three-stage method described above. Speciﬁcally, Section 3.1 deals with estimationand Section 3.2 with inference on the (partial) covariance structure of Π .To present the results in this section it is convenient to use a more compact notation. For each i “ , . . . , n , we can stack the periods to deﬁne the T -dimensional vectors Y i : “ p Y i , . . . , Y iT q and U i : “ p U i , . . . , U iT q . We also deﬁne the p T ˆ k q matrix of covariates X i : “ p X i , . . . , X iT q for each i “ , . . . , n and the p T ˆ r q matrix of factors F : “ p F , . . . , F T q such that (2.1) can berepresented as Y i “ X i γ i ` F λ i ` U i , i “ , , . . . , n, “ X i γ i ` R i , (3.1)where R i : “ F λ i ` U i .When no confusion is likely to arise, we also deﬁne for each t “ , . . . , T , the n -dimensionalvectors Y t : “ p Y t , . . . , Y nt q and U t : “ p U t , . . . , U nt q ; and the nk -dimensional vector X t : “p X t , . . . , X nt q . Also deﬁne the p n ˆ nk q block diagonal matrix Γ whose block diagonal is given by p γ , . . . γ n q and the p n ˆ r q loading matrix Λ : “ p λ , . . . , λ n q . Then (2.1) can also be representedas Y t “ Γ X t ` Λ F t ` U t , t “ , , . . . , T “ Γ X t ` R t , (3.2)where R t : “ Λ F t ` U t . 13 .1 Estimation Assumption 2 (Factor Model) . Consider:(a) E p F t q “ , E p F t F t q “ I r and Λ Λ is a diagonal matrix;(b) All eigenvalues of Λ Λ { n are bounded away from zero and inﬁnity as n Ñ 8 ;(c) } Σ ´ ΛΛ } “ O p q ; and(d) } Λ } max ď C . Remark 2.

Assumption 2 is standard in the factor model literature. Note also that the assumptionthat E p F t q “ is not restrictive as our approach considers a ﬁrst-step estimation which may includea constant in the set of regressors. It is also needed for identiﬁability. Assumption 3 ( Moments and Dependency ) . There exists a constant C ă 8 and function ψ p P Ψ deﬁned in (1.2) such that for all i “ , . . . , n ; (cid:96) “ , . . . , k ; s, t “ , . . . , T and j “ , . . . , r :(a) } X it(cid:96) } ψ p ď C , } U it } ψ p ď C , } F j,t } ψ p ď C ;(b) }}p X i X i { T q ´ } max } ψ p ď C ;(c) The process tp X S,t , F t , U t q , t P Z u is weakly stationary with strong mixing coeﬃcient α sat-isfying α p m q ď exp p´ cm q for some c ą and for all m P Z , where X S,t denotes the vector X t after excluding all deterministic (non-random) components.(d) } n ´ { p U s U t ´ E p U s U t qq} ψ p ď C ;(e) } n ´ { ř ni “ λ j,i U it } ψ p ď C ; and(f ) log n “ o ´ T p { r log T s ¯ . A few words about Assumption 3 is in order. Our theory is derived in a general setup withrespect to the tail behavior of the random variables in the model. In order to present the results ina uniﬁed manner for both fat (polynomial-decayed) and thin (exponential-decayed) tails, we placeour assumptions in terms of an upper bound to the Orlicz norm. In particular, Assumptions (3.a)and (3.c) allow us to apply a Marcinkiewicz-Zygmund type inequality for partial sums to deal withthe polinomial tails (Rio (1994) and Doukhan and Louhichi (1999)) and a Bernstein inequality(Merlev`ede et al. (2009) - Theorem 2) to control exponential tails. Moreover, Assumption (3.c)14xcludes the deterministic component of X t to accommodate possibly non-random non-stationary(but uniformly bounded by (a)) covariates.Assumption (3.d) is only used to prove results for the ﬁrst-stage estimation in case it is performedby ordinary least-squares (Theorem 1). Assumption (3.d) controls for the level of cross-sectionaldependence among the units. As we allow the number of units to diverge with T , some sort ofcontrol on this quantity in necessary which is not implied by p .c q . Assumption (3.e) has a similarrole to (3.d) but in terms of linear combinations of the the idiosyncratic components. Assumption(3.e) only bounds the growth rate of the number of units n to be sub-exponential with respect to T . As a matter of fact, this assumption is only binding in the exponential tail case, otherwise therate conditions imposed in the theorems below imply (3.e).For each i “ , . . . , n , let R i : “ F λ i ` U i denote the unobservable error term in (3.1), p γ i the least-squares estimator of γ i and p R i : “ Y t ´ X t p γ i the vector of residuals. Also set p R : “ p p R , . . . , p R n q and R : “ p R , . . . , R n q . We must control for the least-squares estimation error in the ﬁrst step ofthe proposed methodology. The next result gives a bound for the maximum entry of the p n ˆ T q matrix p R ´ R when the ﬁrst-stage is conducted by OLS in a linear setup. Theorem 1.

Under Assumption 3(a)-(d) max i,t } p R it ´ R it } ψ p { ď C k,ψ ? T } p R ´ R } max “ O P « ψ ´ p { p nT q? T ﬀ , where the C k,ψ is a constant only depending on k and ψ p . Remark 3.

In case the ﬁrst step of the method involves more complicated estimation, we write } p R ´ R } max “ O P p ω q , where ω : “ ω n,T is a non-negative sequence. This will be used in the nexttheorems. Deﬁne the p n ˆ T q matrices Y : “ p Y , . . . , Y T q and U : “ p U , . . . , U T q ; and the p nk ˆ T q matrix X : “ p X , . . . , X T q . We can write (2.1) in the matrix form as Y “ Γ X ` Λ F ` U . (3.3)Notice that p R “ Λ F ` r U where r U : “ U ` p R ´ R and p Λ , F q can be estimated by Principal15omponent Analysis (PCA), which minimizes q p Λ , F q : “ } p R ´ Λ F } F , (3.4)with respect to Λ and F , subject to the normalization F F { T “ I r . The solution p F is the matrixwhose columns are ? T times r eigenvectors of the top r eigenvalues of p R p R and p Λ “ p R p F { T .Since we do not directly observe U , in the third step of our estimation procedure we use p U : “ p R ´ p Λ p F instead. Therefore, we must control of the estimation error in the factor model given by p n ˆ T q matrix p U ´ U which is the main purpose of Theorem 2 below. Also, it is well know factthat the loading matrix Λ and the factors F are not separably identiﬁed since Λ F t “ Λ H HF t for any matrix H such that H H “ I r . If we let H : “ T ´ V ´ p F F Λ Λ , where V is the p r ˆ r q diagonal matrix containing the r largest eigenvalues of p R p R { T in decreasing order, we have that HF t is identiﬁed as Λ F t is identiﬁed.The result below ﬁrst appeared in Bai (2003) for the case of ﬁxed p n, T q , and was further extendedto hold uniformly in p i ď n, t ď T q by Fan et al. (2013). Fan et al. (2020) makes the conditionsmodular. However, both consider the case when the factor model is estimated using the true dataas opposed to an “estimated” one as in our case. Therefore, the next result is a generalization thattakes into account that pre-estimation error term. Theorem 2.

Let ω : “ ω n,T be a non-negative sequence such that } p R ´ R } max “ O P p ω q . Then underAssumptions 1 -3 and ψ ´ p p n q{? T ` ψ ´ p p nT q ω “ O p q , we have(a) max t ď T } p F t ´ HF t } “ O P „ ? T ` ψ ´ p p T q? n ` ωψ ´ p { p nT q  , (b) max i ď n } p λ i ´ Hλ i } “ O P « ψ ´ p { p n q? T ` ? n ` ω ﬀ , (c) } p U ´ U } max “ O P „ ψ ´ p p n q ψ ´ p p T q? T ` ψ ´ p p T q? n ` ωψ ´ p { p nT q  . By setting ω “

0, i.e., no estimation error in the ﬁrst step, we recover Theorem 4 and Corollary1 in Fan et al. (2013) under sub-Gaussian assumption. Also it is important to notice that in order tohave the error } p U ´ U } max vanishing in probability we must have the pre-estimation error } p R ´ R } max of order (in probability) smaller than 1 { ψ ´ p { p nT q .16e have decided not to replace ω in Theorem 2 with the rate obtained in Theorem 1 as thelatter only applies to the least square estimator. In some applications, however, the ﬁrst step ofthe procedure could be done using a diﬀerent type of estimator. For instance a penalized adaptiveHuber regression (Fan et al., 2017) if the number of features k is comparable or even larger than T and the tail of the distribution is heavy. By stating the Theorem 2 in terms of a generic rate, itis easier to account for the eﬀect of a diﬀerent estimator. By combining Theorem 1 and 2 we havethe following corollary Corollary 1.

Under the same assumptions of Theorems 1 and 2, for the OLS used in the ﬁrst-stageto obtain p R , we have } p U ´ U } max “ O P « ψ ´ p { p nT q? T ` ψ ´ p p T q? n ﬀ . In particular for the sub-Gaussian case ( ψ p x q “ exp p x q ´ ) we have } p U ´ U } max “ O P « r log p nT qs ? T ` c log Tn ﬀ , and for polynomial tails ( ψ p x q “ x p ) } p U ´ U } max “ O P „ n { p T { ´ { p ` T { p ? n  . For notational convenience, for each i P t , . . . , n u , consider the split U “ p U i , U ´ i q where U i is a T -dimensional vector and U ´ i a T ˆ p n ´ q -dimensional matrix. Analogously, we split p U “ p p U i , p U ´ i q . Then for a the penalized parameter ξ ě

0, the LASSO objective function can bewritten for each i P t , . . . , n u L p θ q ` Penalty p θ q : “ T } p U i ´ p U ´ i θ } ` ξ } θ } . (3.5)To ensure a consistent estimation of θ , a sort of restricted strong convexity of the objectivefunction is required when n ą T . This in turns is ensured, in the case of a quadratic loss, by boundingthe minimum eigenvalue on p U i p U ´ i { T away from zero restrict to a cone (refer to Negahban et al.(2012) or Fan et al. (2020) for a thorough discussion). Here, we adopt the compatibility constantdeﬁned in van de Geer and B¨uhlmann (2009). For an index S Ď t , . . . , n u and any n -dimensionalvector v , let v S be the vector containing only the elements of the vector v indexed by S . Thus, v S “ S and S c : “ S zt , . . . , n u is the complement of S .17 eﬁnition 1. For an n ˆ n matrix M , a set S Ď t , . . . , n u and a scalar ζ ě , the compatibilityconstant is given by κ p M , S , ζ q : “ inf } x } M a | S |} x S } : x P R n : } x S c } ď ξ } x S } + , (3.6) where } x } M “ x M x . Moreover, we say that p M , S , ζ q satisﬁes the compatibility condition if κ p M , S , ζ q ą . Notice that the square of the compatibility constant is close related to the minimum (cid:96) -eigenvalueof Σ restricted to a cone in R n . Theorem 3.

Let η : “ η n,T be a non-negative sequence such that } p U ´ U } max “ O P p η q and considerAssumption 3. For every (cid:15) ą there is a constant ă C ă 8 such that if the penalty parameter isset to ξ “ C « ψ ´ p { p n q? T ` ηψ ´ p p T q ﬀ and s : “ max i ď n | S ,i | where S ,i : “ t j : θ i,j ‰ u obeys s “ O »– κ ˜ η “ ψ ´ p p nT q ` η ‰ ` ψ ´ p { p n q? T ¸ ´ ﬁﬂ , with κ : “ min i ď n κ i and κ i : “ κ “ E p U i U ´ i q{ T q , S ,i , ‰ deﬁned in (3.6) , then, for any minimizer p θ i of (3.5) , with probability at least ´ (cid:15) : T ´ p p θ i ´ θ i q U i U ´ i p p θ i ´ θ i q ` ξ } p θ i ´ θ i } ď ξ s κ . i P t , . . . , n u , where the left right side is taken to be `8 whenever κ “ . Remark 4.

Notice that we apply the compatibility condition on the non-random covariance matrix E p U i U ´ i q{ T instead of the estimated random covariance matrix p U i p U ´ i { T or the “unobservable”random matrix U i U ´ i { T . Careful review of the proofs reveals that the same is true for the gradientof the objective function that deﬁnes our parameter via a ﬁrst order condition. Once again, we purposely avoided to replace η in Theorem 3 with the rate of Corollary 1 to makeit readily applicable to the case when a diﬀerent type of factor models was used or, as a matter offact, any other pre-estimation procedure. By plugging the rate of Corollary 1 into η we have thenext corollary 18 orollary 2. If η deﬁned in Theorem 3 is taken to be rate given by Corollary 1 and the compatibilitycondition holds, i.e.: κ ě C ą then under the conditions of the Theorem 3: max i ď n } p θ i ´ θ i } “ O P «˜ ψ ´ p p T q ψ ´ p { p nT q? T ` ψ ´ p { p T q? n ¸ s ﬀ . We now obtain the null distributions of our test statistics for the structures of the covariance andthe partial covariance. Recall the setup and notation of section 2.3. In particular, we consider thekernel k p¨q appearing in the covariance estimator deﬁned by (2.10) belongs to the class deﬁned inAndrews (1991) which we reproduce below for convenience K : “ t f : R Ñ r´ , s : f p q “ , f p x q “ f p´ x q , @ x P R , ż f p x q dx ă 8 , f is continuous u . (3.7)It includes most of the well-known kernel used in density estimation literature such as the truncated,Bartlett, Parzen, Quadractic Spectral, Tukey-Hanning among others. To avoid confusion, it is worthto point out that our tunning parameter h , also called bandwith parameter by Andrews (1991), issupposed to diverge, as opposed to the bandwith in the density kernel estimation setup, which isexpected to shrink towards zero. Theorem 4.

Let η : “ η n,T and ν : “ ν n,T be non-negative sequence such that } p U ´ U } max “ O P p η q and max i,t } p R it ´ R it } ψ p “ O p ν q and K P K . Under Assumptions 1-3, if further(a) t U t u is fourth-order stationary process(b) } diag p Υ Σ q} ě c for some c ą (c) As h, n, T Ñ 8 :(c.1) p log n q { ψ ´ p { p n q T { ` ? log T log nψ ´ p { p n q ψ ´ p { p T { q T { “ o p q (c.2) p log n q h r η p ψ ´ p p nT qq ` ψ ´ p { p n q{? T s “ o p q (c.3) p log n q r? T η ` r ? T ` r ? n ` r ν s “ o p q ,where the rates r , r , r are deﬁned in Lemma B.10 and h ą is the bandwidth parameter of the ovariance estimator deﬁned in (2.10) ; then } p Υ Σ ´ Υ Σ } max “ O P ´ h r η p ψ ´ p p nT qq ` ψ ´ p { p n q{? T s ¯ “ o p q , and sup D sup τ Pp , q | P p S Σ D ď c ˚ Σ p τ qq ´ τ | “ o p q , where the ﬁrst supremum is over all null hypotheses of the form (2.7) indexed by D P t n uˆt n u . Remark 5.

The rate assumptions (c.1)-(c.3) in Theorem 4 seem over complicated. However,they are a direct consequence of having the ﬁrst and second step estimation error rates, ν and η respectively, explicitly appearing in the ﬁnal rate and the general tail condition through the ψ p p¨q function. It allows the practitioner to directly adjust the ﬁnal rate should (s)he prefer to employdiﬀerent intermediate estimators. For instance, a LASSO estimator in the ﬁrst step in case thenumber of covariates k is large enough or estimate the factor model by PCA variants. If we were tospecialized to the sub-Gaussian case and incorporate the rates obtain in Theorem 1 and Corollary 1we have the following Corollary Corollary 3.

For the sub-Gaussian case ( ψ p x q “ exp p x q ). Under Assumptions 1-3, conditions p a q and p b q of Theorem 4. If the rates ν and η are set to be rates given by Theorem 1 and Corollary1, respectively, then the conclusion of Theorem 4 holds provided that as h, n, T Ñ 8 :(a) log n “ o p T { q (b) h r p log n q { ? T ` p log n q ? n s “ o p q (c) p log n q p log T q? Tn “ o p q . Remark 6.

Careful review of its proof reveals that (d.1) traces back to the Gaussian Approximationof the (unobservable) process ! ? T ř Tt “ U t U t ´ E U t U t ) T ě ; whereas (d.3) controls for the diﬀer-ence between U t ´ p U t and, therefore, takes into account the estimation error of the ﬁrst and secondsteps. Note the presence of ν and η in (d.3) which are absent in (d.1) Finally, (d.2) make sure thatthe bootstrap constructed in terms of the estimated covariance matrix is close to the bootstrap basedin the true covariance. Note the presence of the bandwidth parameter h in (d.2). Remark 7.

In order to establish the rate of convergence in the last result of Theorem 4 we need anupper bound on the tails of the pre-estimation error namely } p Z ´ Z } max . In fact, we need to control he tails of the factor model estimation to establish uniform bounds on } p U it ´ U it } ψ , which translateinto obtain bounds on max jt } p F jt ´ F jt } ψ and max ji } p λ ji ´ λ ji } ψ . Theorem 5.

Let η : “ η n,T and ν : “ ν n,T be non-negative sequence such that } p U ´ U } max “ O P p η q and max i,t } p R it ´ R it } ψ p “ O p ν q and K P K deﬁned by (3.7) . Under Assumptions - , if further(a) t U t u is fourth-order stationary process(b) } diag p Υ Π q} ě c for some c ą (c) As n, T Ñ 8 :(c.1) p log n q { ψ ´ p { p n q T { ` ? log T log nψ ´ p { p n q ψ ´ p { p T { q T { “ o p q (c.2) p log n q h r s r η ` ξψ ´ p p n qsp s ψ ´ p p nT qq ` s ψ ´ p { p n q? T s “ o p q (c.3) p log n q ! s r r ? T ` r ? n ` r ν ` ξψ ´ p p n q ` ? T p η ` ξψ ´ p p n qq s ) “ o p q ,where the rates r , r , r are deﬁned in Lemma B.10, K p¨q and h ą is the bandwidth parameter ofthe covariance estimator deﬁned in (2.12) ; then } p Υ Π ´ Υ Π } max “ O P ˜ h s r η ` ξψ ´ p p n qsp s ψ ´ p p nT qq ` s ψ ´ p { p n q? T +¸ “ o p q and sup D sup τ Pp , q | P p S Π D ď c ˚ Π p τ qq ´ τ | “ o p q under H Π0 , where the ﬁrst supremum is over all null hypotheses of the form (2.8) indexed by D P t ˆ n uˆt ˆ n u . Remarks and Corollary analogous to Remarks 5-7 and Corollary 3 after Theorem 4 apply toTheorem. 5.

Remark 8.

As opposed to the case of testing covariance, when testing partial covariance in high-dimensional setup, the sparse structure plays a role in terms of s appearing in the rates (d.2) and(d.3). Therefore, these assumptions restricts the cases when the proposed partial convariance testhas the correct asymptotic size. For instance, in the case of a complete dense partial covariancestructure, i.e, all the regressors are active in all LASSO regressions we are likely to have s of orderof n and, therefore, p d. q and p d. q are not expected to hold. Guide to Practice

As described before the methodology in this paper involves three steps. The ﬁrst step consists ofidentifying known covariates that we may want to control for. This ﬁrst step may involve the removalof deterministic trends and seasonal eﬀects, for instance. This can be done either by parametricor nonparametric regressions. It is important to notice, however, that the convergence rates of theestimations in the subsequent steps will be inﬂuenced by the convergence rate of the estimation inthe ﬁrst part of the procedure.After the data is ﬁltered in the ﬁrst step, one can test for remaining covariance structure. Forinstance, if the covariance matrix of the ﬁltered data is (almost) diagonal, there is no need toestimate a latent factor structure and the practitioner may jump directly to the third step of themethod.On the other hand, if the covariance of the ﬁrst-step ﬁltered data is dense, a latent factor modelshould be considered and the number of factors must be determined. There are a number of methodsproposed in the literature to achieve this goal. In this paper we consider either the eigenvalue ratiotest of Horenstein (2013) or the information criteria put forward in Bai and Ng (2002). The factorscan be estimated by the usual methods.The last step involves a sparse regression in order to estimate any remaining links betweenidiosyncratic components. Before running the last step, the practitioner may test for a diagonalcovariance matrix of the idiosyncratic terms. If the null is not rejected, there is no need for additionalestimation. In case of rejection, the user can proceed with a LASSO regression. We recommend thatthe penalty term of the LASSO is selected by Bayesian Information Criterion (BIC) as advocatedby Medeiros and Mendes (2016).Finally, we would like to include a remark about the estimation of the long-run matrices whenconstructing the statistics for the tests of no remaining covariance structure. Usual methods dis-cussed in the literature can be used here to select the kernel and the bandwidth. In the paper weuse the simple Bartlett kernel with bandwidth given as t T { u . In this section we report simulation results to assess the ﬁnite-sample performance of the methodol-ogy depicted in this paper. The simulations are divided into two parts. In the ﬁrst one, we evaluate22he ﬁnite-sample properties of the test for remaining covariance structure. In the second part, wehighlight the informational gains when considering both the common factors and the idiosyncraticcomponent.We simulate 1,000 replications of the following model for various combinations of sample size( T ) and number of variables ( n ): Y it “ Λ i F t ` W it , (5.1) F t “ . I r ` E t , (5.2) W it “ φW it ´ ` U it , (5.3) U it “ $’’&’’% θ U t ` θ U t ` θ U t ` θ U t ` O it if i “ O it otherwise , (5.4)where t O it u is a sequence of independent Gaussian random variables with zero mean and varianceequal to 0.25, t E t u is a sequence of r -dimensional independent random vectors normally distributedwith zero mean and identity covariance, and I r is an r ˆ r identity matrix. Furthermore, t O it u and t E t u are mutually independent for all time periods, factors and variables. For each MonteCarlo replication, the vector of loadings is sampled from a Gaussian distribution with mean -6 andstandard deviation 0.2 for i “ i “ , . . . , n . The value of φ iseither 0 or 0.5. The coeﬃcients θ , θ , θ , and θ are equal to zero or 0.8, 0.9, -0.7, and 0.5,respectively. We set the true number of factor to be r “ We start by reporting results for the test of no remaining structure on the covariance matrix of U t “p U t , . . . , U nt q . The null hypothesis considered is that all the covariances between the ﬁrst variable( i “

1) and the remaining ones are all zero. For size simulations we set θ “ θ “ θ “ θ “ “ log r S p r qs ` r n ` TnT log ˆ nTn ` T ˙ IC “ log r S p r qs ` r n ` TnT log C nT IC “ log r S p r qs ` r log C nT C nT IC “ log r S p r qs ` r p n ` T ´ k q log p nT q nT . where S p r q “ nT } R ´ p Λ r p F r } and C nT : “ a min p n, T q .Tables 1 and 2 report the results of the empirical size of test for diﬀerent signiﬁcance levels. Weconsider the case of φ “ φ “ . in panel(c) or the eigenvalue ratio procedure in panel (d). Table ?? in the Supplementary Material showsthe results of the test when the number of factors are determined by IC ´ IC .A number of facts emerge from the inspection of the results in the Table 1. First, size distortionsare small when the factors are known. In this case, the test is undersized when the pair p n, T q issmall. When the factor are not known but the true number of factors is available, the size distortionsare high only when T “

100 and n “

50 due to inaccurate estimation of factors. However, thedistortions disappear when the pair p T, n q grows. In this case, the empirical size is similar to thesituation reported in Panel (a). The ﬁnite performance of test in the case where the number offactors is selected by information criterion IC is almost indistinguishable to the case reported inPanel (b). However, the results with the eigenvalue ratio procedure are much worse when T “ n “

50. In this case, the procedure selects less factors than true number r “

3. For instance,the procedure selects 2 or less factors in 36% of the replications. Just as comparison, for T “ n “

50, IC underdetermines the number of factors only in 3.10% of the cases. For all theother combinations of T and n all the data-driven methods selects the correct number of factors inalmost all replications.When the idiosyncratic components are autocorrelated the size distortions are higher, as reportedin Table 2. This is mainly caused by the well-known diﬃculties in the estimation of the long-runcovariance matrix.Tables 3–4 report the results of the empirical power. For evaluate power properties we set24 “ . β “ . β “ ´ .

7, and β “ ´ . T “ ,

700 the power is reasonablyhigh, specially when test is conducted at a 10% signiﬁcance level. For T “ n grows. The results are similar when data-driven procedures are used to determinethe number of factors. Finally, the conclusions are mostly the same whenever φ “ φ “ . The goal of this simulation is to compare, in a prediction environment, the three-stage methoddeveloped in the paper by evaluating the information gains in predicting Y t by three diﬀerentmethods. First, the predictions are computed from a LASSO regression of Y t on all the other n ´ FarmPredict methodology. Table 5 presents the results. Thetable presents the average mean squared error (MSE) over 5-fold cross-validation (CV) subsamples.As in the size and power simulations, we consider diﬀerent combinations of T and n . We reportresults for the case where θ “ . θ “ . θ “ ´ .

7, and θ “ ´ . FarmTreat is quite remarkable when T “

500 or T “ Applications

In this section we consider two applications with actual data to illustrate the beneﬁts of the method-ology developed in the paper. The ﬁrst application deals with factor structure of asset returns,whereas the second one is about macroeconomic forecasting in data-rich environments.

We illustrate the methodology developed in this paper by studying the factor structure of assetreturns. We consider monthly close-to-close excess returns from a cross-section of 9,456 ﬁrms tradedin the New York Stock Exchange. The data starts on November 1991 and runs until December2018. There are 326 monthly observations in total. In addition to the returns we also consider 16monthly factors: Market (

MKT ), Small-minus-Big (

SMB ), High-minus-Low (

HML ), Conservative-minus-Aggressive (

CMA ), Robust-minus-Weak (

RMW ), earning/price ratio, cash-ﬂow/price ratio,dividend/price ratio, accruals, market beta, net share issues, daily variance, daily idiosyncraticvariance, 1-month momentum, and 36-month momentum. The ﬁrms are grouped according to20 industry sectors as in Moskowitz and Grinblatt (1999). The following sectors are considered: Mining (602), Food (208), Apparel (161), Paper (81), Chemical (513), Petroleum (48), Construction(68), Primary Metals (133), Fabricated Metals (186), Machinery (710), Electrical Equipment (782),Transportation Equipment (166), Manufacturing (690), Railroads (25), Other transportation (157),Utilities (411), Department Stores (67), Retail (1018), Financial (3419), and Other (11).

We start the analysis by looking at the correlation matrix of a sample of nine diﬀerent sectors,namely: Mining, Food, Petroleum, Construction, Manufacturing, Utilities, Department Stores,Retail, and Financial. Figure 1 plots the correlations that are larger than 0.15 in absolute value.We also test for the null of diagonal covariance matrix. The null hypothesis is strongly rejectedwith p -value much lower than 1%. To conduct the test of the covariance matrix we use the simplesample estimator as described in the paper. However, the correlations plotted in Figure 1 and inthe subsequent ones are based on the nonlinear shrinkage estimator proposed by Ledoit and Wolf(2020). The number between parenthesis indicate the number of ﬁrms in our sample that belong to each sector.

26e proceed by regressing the daily returns on the observed 16 factors. These three factorsexplain most of the variation of the returns. Figure 2 shows the empirical distribution of the OLSestimates of factor loadings over the 9,456 regressions. Figure 3 presents the estimated correlationsfor the ﬁrst-stage residuals. We focus on the nine sectors as before. The ﬁrst-stage regression aseﬃcient in removing the correlation within speciﬁc sectors in some cases. The most notable onesare Financial and Retail, followed by Construction, Petroleum, and Manufacturing. Nevertheless,the tests for diagonal covariance matrix reject the null even in these speciﬁc cases.The second step is to conduct a principal component analysis on the residuals of the ﬁrst-stage.The eigenvalue ratio procedure selects two factors, while all four information criteria points to asingle factor. We proceed with two factors. Note that, by construction, the principal componentfactors are orthogonal to all the 16 risk factors considered in the ﬁrst stage. Figure 4 shows theestimated correlations for the residuals of the second-stage. The latent factor are not able toreduce the correlations within each sector. However, when we consider the partial correlations theconclusions are much diﬀerent. As can be seen from Figure 5 that the partial correlation matricesare (almost) diagonal. In addition, we are not able to reject the null of a diagonal covariance matrixat a 5% signiﬁcance level.Finally, in order to shed some light on the links among diﬀerent sectors, we report how oftenthat variables from sector i are selected in the third-stage LASSO regression for ﬁrms in sector j . The numbers are normalized by the total number of ﬁrms in each sector and are presented inFigure 6. The most interesting fact is that covariates from the ﬁnancial sector are the ones mostfrequently selected for all the other sectors. This may indicate that there is a “ﬁnancial factor” thatwas unmodeled in the ﬁrst two stages.The results presented here can be useful in applications where forecasting future returns isthe goal, for instance. The results indicate that the inclusion of the returns of ﬁrms belongingto the ﬁnancial sector may improve the performance of forecasting models. For example, if werun a regression of the residuals of the ﬁrst-stage regression of ﬁrms that do not belong to theﬁnancial sector on the ﬁrst principal component computed with the ﬁrst-stage residuals only fromthe ﬁnancial sector, we ﬁnd a statistically signiﬁcant coeﬃcient in 28% of the cases.27 .2 Macroeconomic Forecasting The second application consists of forecasting of a large set of monthly macroeconomic variables.We compare four diﬀerent models: (1) Autoregressive model; (2) Sparse LASSO Regression (SR);(3) Principal Component Regression (PCR); and (4) a method based on the results in this paper( farmPredict ). Our data consist of variables from the FRED-MD database, which is a large monthly macroeconomicdataset designed for empirical analysis in data-rich macroeconomic environments. The datasetis updated in real time through the FRED database and is available from Michael McCraken’swebpage. For further details, we refer to McCracken and Ng (2016).We use the vintage as of October 2020. Our sample extends from January 1960 to December2019 (719 monthly observations), and only variables with all observations in the sample period areused (119 variables). The dataset is divided into eight groups: (i) output and income; (ii) labormarket; (iii) housing; (iv) consumption, orders and inventories; (v) money and credit; (vi) interestand exchange rates; (vii) prices; and (viii) stock market. Finally, all series are transformed in orderto become approximately stationarity as in McCracken and Ng (2016).

In order to highlight the gains of exploring all relevant information in the the dataset, we constructone-step forecasts for each one of the 119 variables in the dataset according to the following models:1.

Autoregressive model ( AR ): p Y p AR q i,t ` | t “ p φ i ` p φ i p Y i,t ` . . . ` p φ ip p Y i,t ´ p ` , i “ , . . . , n, where p φ i , p φ i , . . . , p φ ip , i “ , . . . , n , are OLS estimates. This model will be also the ﬁrst-stagemodel in our methodology.2. AR + Sparse regression ( SR ): p Y p SR q i,t ` | t “ p Y p AR q i,t ` | t ` p R i,t ` | t , https://research.stlouisfed.org/econ/mccracken/fred-databases/. p R i,t ` | t “ p β i ` p β i p R t ` . . . ` p β pi p R t ´ p ` , i “ , . . . , n, p β i , p β i . . . , p β pi , i “ , . . . , n , are LASSO estimates, p R t “ ´ p R ,t , . . . , p R n,t ¯ , and ﬁnally p R i,t “ Y i,t ´ p Y p AR q i,t | t ´ , i “ , . . . , n . The parameters are estimated equation-wise for each one of the 119variables in the dataset. The penalty parameter is selected by BIC as discussed in Section 4.3. AR + Principal Component Regression (

PCR ): p Y p PCR q i,t ` | t “ p Y p AR q i,t ` | t ` p λ i p F t , where p F t is the estimate of the p k ˆ q vector of factors F t given by principal componentanalysis of p R t , the residuals of the ﬁrst-stage regression. The parameter λ i is computed byOLS regression of p R i,t on p F t in the in-sample window.4. AR + Full Information (

FarmPredict ): p Y p FarmPredict q i,t ` | t “ p Y p PCR q i,t ` | t ` p U i,t ` | t where p U i,t ` | t “ p θ i ` p θ i p U t ` . . . ` p θ pi p U t ´ p ` , p U t “ ´ p U ,t , . . . , p U n,t ¯ and p U i,t “ Y i,t ´ p Y p PCR q i,t | t ´ , i “ , . . . , n . The estimates p θ i , p θ i . . . , p θ pi , i “ , . . . , n , are given by LASSO.The forecasts are based on a rolling-window framework of ﬁxed length of 480 observations,starting in January 1960. Therefore, the forecasts start on January 1990. The last forecasts arefor December 2019. Note that the AR model only considers information concerning the own past ofthe variable of interest. SR and PCR expand the information by two opposing routes. While SR usesa sparse combination of the set of variables, PCR considers only a factor structure (dense model).

FarmPredict combines these two approaches and uses the full information available.

We start by looking at the full sample in order to analyse the structure of dependence among themany series considered. We ﬁrst estimate an autoregressive model of order 4, AR( p ), for each29ransformed series. Figure 7 reports the empirical distribution of the OLS estimators of the ARcoeﬃcients. Figure 8 shows the distribution of the absolute value of the sum of the estimates. Thisgives an idea of the persistence of each series. Although we report here the results for AR modelsof pre-speciﬁed order equal to four, in the Supplementary Material we present results for optimallag selection via the BIC. Only one series has estimated persistence above one. This is the case for NONBORRES: Reserves of Depositary Institutions , which belongs to group (v): Money and Credit.The reason for such high persistence if due to a major structural break present in the second halfof the series. However, 82.35% of the series have estimated persistence below 0.9. We continue by estimating the number of factors when the full sample is used for PCA. Weconsider two diﬀerent situations. In the ﬁrst, we do not include any lag in the basket of variablesused to compute the factors. In the second approach, we include four lags of each variable as well.The eigenvalue ratio procedure selects either two (no lags) or a single factor (with lags). The fourinformation criteria of Bai and Ng (2002) as described in Section 5, estimate respectively for thecase with no lags (with lags) the following number of factors: six (one), ﬁve (one), nine (one), andone (one). Note that the factors are estimated for the residuals of the ﬁrst-step AR ﬁlter. If weremove the

NONBORRES variable from the sample the results to not change for the eigenvalueratio procedure. On the other hand, the new numbers of factors selected by the information criteriaare as follows: seven (one), six (one), eleven (one), and one (one).Finally, we apply the testing approach developed in this paper to check for remaining (partial)covariance structure in the data. The tests strongly reject the null of a diagonal matrix when appliedto the residuals either of the ﬁrst or the second stages of the methodology. This serves as evidencethat

FarmPredict may be a useful modeling approach for this macroeconomic dataset.

For each of the four models described above, we report a number of performance metrics in Table 6.The table presents the frequency each model has the best performance among the four alternatives.Numbers between parentheses indicates the frequency each model is the second, third, and fourthbest. We report the results for each one of the eight sectors as well as for the set of all 119 variables.We show the results for two methods to determine the number of factors. Panel (a) reports theresults for the eigenvalue ratio method while Panel (d) presents the results for the informationcriterion IC . Criteria IC , IC , and IC select a very large number of factors and we relegate them Conventional unit-root tests also reject the null of unit-root for all but one of the series.

30o the supplementary material. Panels (c) and (d) in the table show the results for the cases wherethe number of factors are kept ﬁxed ( r “ r “ FarmPredict is the model which is ranked ﬁrst more frequently when all the series are considered.It is also the best model for the following groups: output and income, labor market, housing, andconsumption, orders and inventories. The AR model is best for the following groups: money andcredit and stock market. The sparse regression is superior also for two groups: interest and exchangerates and prices. In this paper we propose a new methodology which bridges the gap between sparse regressions andfactor models and evaluate the gains of increasing the information set via factor augmentation.Our proposal consists in several steps. In the ﬁrst one, we ﬁlter the data for known factors (trends,seasonal adjustments, covariates). In the second step, we estimate a latent factor structure. Finally,in the last part of the procedure we estimate a sparse regression for the idiosyncratic components.Furthermore, we also propose a new test for remaining structure in both high-dimensional covarianceand partial covariance matrices. Our test can be used to evaluate the beneﬁts of adding morestructure in the model. Our paper has also a number of important side results. First, we provedconsistency of kernel estimation of long-run covariance matrices in high-dimensions where boththe number of observations and variables grows. Second, we derive the theoretical properties offactor estimation on the residuals of a ﬁrst step process. Third, the proposed test can be used as adiagnostic tool for factor models.We evaluate our methodology with both simulations and real data. The simulations show thetest has good size and power properties even when the true number of factors is unknown and mustbe determined from the data. However, if the number of factors is underestimated, we observesize distortions. This is specially the case when the eigenvalue ratio test is used to determine thenumber of latent factors. The simulations also show that there are major informational gains whencombining factor models and sparse regressions in a forecasting exercise. Two applications areconsidered in the paper. 31

Proof of the Theorems

Throughout the proofs we use the equivalence } X } ψ p ă 8 ðñ P p| X | ą x q “ O p ψ ´ p p x qq as x Ñ 8 , for any random variable X and ψ p P Ψ, combined with Lemma 6 in Carvalho et al. (2018) andLemma 1 in Masini and Medeiros (2019). The key ingredients of the lemmas are a Marcinkiewicz-Zygmund type inequality for strong mixing sequences to deal with the polynomial tails (Rio, 1994;Doukhan and Louhichi, 1999) and a Bernstein inequality under strong mixing conditions to controlexponential tails (Merlev`ede et al. (2009) - Theorem 2).

A.1 Proof of Theorem 1

We ﬁrst upper bound } p R it ´ R it } ψ . By subsequent application of H¨older’s inequality we have . | p R it ´ R it | “ |p p γ i ´ γ i q W it |ď } p γ i ´ γ i } } W it } “ } p Σ ´ i p v i } } W it } ď k } p Σ ´ i } max } p v i } } W it } , where p Σ i : “ W i W i { T and p v i : “ W i U i { T . Then by the Cauchy-Schwartz conjugate } p R it ´ R it } ψ p { ď k }} p Σ ´ i } max } ψ p }} p v i } } ψ p { }} W it } } ψ p . The ﬁrst term is bounded by Assumption 3(b). For the second term we have: } W it(cid:96) U it } ψ p { ď} W it(cid:96) } ψ p } U it } ψ p ď C by Assumption 3(a). Then, t W it(cid:96) U it u t ą is a zero-mean strong mixing withexponential decay sequence (Assumption 3(c)) with bounded ψ p { -norm. Therefore, }} p v i } } ψ p { “ O p {? T q uniformly in i ď n . Finally, the last term is bounded by the maximal inequality (van derVaart and Wellner (1996) - Lemma 2.2.2) and Assumption 3(a). The ﬁrst result follows.32 .2 Proof of Theorem 2 The proof is an adaption of the proof of Theorem 4 and Corollary 1 in Fan et al. (2013), henceforthFLM, to include the estimation error in the sample covariance matrix. For part (a), we pick upfrom expression (A.1) in Bai (2003) to obtain the following identity p f t ´ HF t “ ˆ V n ˙ ´ « T T ÿ s “ p f s E p U s U t q n ` T T ÿ s “ p f s r ζ st ` T T ÿ s “ p f s r η st ` T T ÿ s “ p f s r ξ st ﬀ , (A.1)where r ζ st , r η st and r ξ st are deﬁned before Lemma B.3.By Assumptions 2(d) and 3(a) and the maximal inequality we have } R } max ď r } Λ } max } F } max `} U } max “ O P p ψ ´ p nT qq . Applying Lemma B.14 we conclude that } p Σ ´ r Σ } max “ O P p ω p ψ ´ p nT q ` ω qq “ O P p q , where the last assumption by the Theorem assumption. Finally ψ ´ p n q{? T “ O p q also by assumption then } V n } ´ “ O P p q by Lemma B.6. Using the results (a)-(d) of Lemma B.5we can bound in probability each of the terms in brackets of (A.1) in (cid:96) norm uniformly in t ď T and obtain the result (a).For part (b) we use the fact that p Λ : “ p R p F { T and the normalization p F p F “ I r to write p λ i ´ Hλ i “ T T ÿ t “ HF t r U it ` T T ÿ t “ p R it p p F t ´ HF t q ` H ˜ T T ÿ t “ F t F t ´ I r ¸ λ i . (A.2)The ﬁrst term can be upper bounded in (cid:96) norm uniformly in i ď n by ? r } H } max i ď n max j ď r ˇˇˇˇˇ T T ÿ t “ F jt r U it ˇˇˇˇˇ “ O P p q O P r ψ ´ p { p n q{? T ` ω s , where the equality follows from Lemma B.6(b) and (e). The (cid:96) norm of the second term is upperbounded uniformly in i ď n by ˜ max i ď n T T ÿ t “ p R it T T ÿ t “ } p F t ´ HF t } ¸ { “ „ O P p q O P p T ` p {? n ` ω q q  { , where the ﬁrst term after the equality follows from Lemma B.6(d) together with the Theoremassumption and the second term from Lemma B.4(e). Finally the last term of (A.2) is upperbounded by } H }} max i ď n λ i }} T T ÿ t “ F t F t ´ I r } “ O P p q O p q O P p {? T q , O P p {? T q by the maximum inequality and Assumption 3 Plug the last threedisplays back into (A.2) yields result (b).For part (c) we use we have } p U ´ U } max “ } Λ F ´ p Λ p F ` p R ´ R } max ď } p Λ p F ´ Λ F } max ` } p R ´ R } max . The last term is O P p ω q by assumption. For the ﬁrst term we use the decomposition p λ i p F t ´ λ i F t “ p p λ i ´ Hλ i q p p F t ´ HF t q ` p Hλ i q p p F t ´ HF t q` p p λ i ´ Hλ i q HF t ` λ i p H H ´ I r q F t . (A.3)Therefore, we can upper bound the left hand side as | p λ i p F t ´ λ i F t | ď } p λ i ´ Hλ i }} p F t ´ HF t } ` } Hλ i }} p F t ´ HF t }` } p λ i ´ Hλ i }} HF t } ` } λ i }} F t }} H H ´ I r } . Now we bound in probability each of the four term above uniformly in i ď n and t ď T . The ﬁrstone is given by part (a) and (b). max i ď n } Hλ i } ď } H } max i ď n || λ i } ď O P p q r } Λ } max “ O P p q byLemma B.6(b) and Assumption 2(d), thus the second term is bounded by part (a). Similarly forthe third term max t ď T } HF t } ď } H } max t ď T || F t } “ O P p q O P p ψ ´ p T qq “ O P p ψ ´ p T qq by LemmaB.6(b) and Assumption 2(a). Finally } H H ´ I r } “ O P p {? T ` {? n ` ω q by Lemma B.6(c) hencethe last term is O P r ψ ´ p T qp {? T ` {? n ` ω qs by Assumptions 2(d) and 3(a). A.3 Proof of Theorem 3

We have that L p p θ ξ q ` ξ } p θ ξ } ď L p θ q ` ξ } θ } for all θ P R n by deﬁnition of p θ ξ , where L p θ q : “} p u y ´ θ p U x } { T . Also, since L p θ q is a quadratic function, it implies that p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q ď ´ ∇ L p θ q p p θ ξ ´ θ q ` ξ p} θ } ´ } p θ ξ } q . By Holder’s inequality we have | ∇ L p θ q p p θ ξ ´ θ q| ď} ∇ L p θ q} } p θ ξ ´ θ } and by assumption ξ ě } ∇ L p θ q} then we have p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q ď ξ { } p θ ξ ´ θ } ` ξ p} θ } ´ } p θ ξ } q . (A.4)For any index set S P r n s , by the decomposability of the (cid:96) norm (refer to Deﬁnition 1 in Negahbanet al. (2012)) followed by the triangle inequality we have } p θ ξ } “ } p θ ξ, S } ` } p θ ξ, S c } ě } θ S } ´ } p θ ξ. S ´ θ S } ` } p θ ξ, S c } and } p θ ξ ´ θ } “ } p θ ξ, S ´ θ S } ` } p θ ξ, S c ´ θ S c } ď } p θ ξ, S ´ θ S } ` } p θ ξ, S c ´ θ S c } . Plugging34t back in (A.4) yields2 p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q ` ξ } p θ ξ, S c ´ θ S c } ď ξ } p θ ξ, S ´ θ S } ` ξ } θ S c } . (A.5)We then conclude that any minimizer p θ ξ of (3.5) and θ P R n obeys p θ ξ ´ θ P C p S , θ q : “t x P R n : } x S c } ď } x S } ` } θ S c } u . If we take θ “ θ and S “ S : “ t i : θ ,i ‰ u then p θ ξ ´ θ P C : “ C p S , θ q . Note that C is a cone in R n that does not depend on θ as } θ , S c } “ κ : “ κ p p U x p U x { T, S , q we have that } p θ ξ, S ´ θ S } ď p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q a | S |{ κ . Apply this inequality (A.5) and use the fact , 4 ab ă a ` b for non-negative a, b P R to obtain p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q ` ξ } p θ ξ ´ θ } ď ξ | S |{ κ. (A.6)Finally, we have by assumption } p U ´ U } max ď C , } U } max ď C and C p C ` C q ď κ | S | which, inturn fulﬁlls the assumptions of Lemma B.14 with ζ “ α “ {

2. Therefore, we conclude that κ ě κ { A.4 Proof of Theorem 4

We use in this proof the following additional notation for short: For every random vector X , wedenote by Σ X its covariance matrix, d X the diagonal of Σ X and σ X : “ } d X } . Also, X G denoteszero-mean Gaussian random vector deﬁned in the same probability space, independent of X andwith the same covariance matrix of X . Finally, for every pair of random vectors X , Y of the samedimension and scalar s ą ρ p X , Y q : “ sup t P R | P p} X } ď t q ´ P p} Y } ď t q| ∆ p X , s q : “ sup t P R P p t ď } X } ď t ` s q Combining equations (83)–(86) in Giessing and Fan (2020) gives us the following basic inequality | P p S ď c ˚ p τ qq ´ τ | ď ρ p r Q , r Q G q ` inf δ ą "a δ log nω max ` P p} p Υ ´ Υ } max ą δ q * ` inf δ ą " δ ? log nω max ` P p} Q ´ r Q } ą δ q * (A.7)35here r Q is deﬁned below.We start by Bounding the ﬁrst term to the right-hand side of (A.7). Here we adapt the classical”big block-small block” technique proposed by Bernstein in the context of proving CLT undermixing conditions, which was also used in the proof of Theorem E.1 in Chernozhukov et al. (2018).Consider two sequences of non-negative integers a : “ a T and b : “ b T such that b ă a , a ` b ď T , min t a, b u Ñ 8 , a “ o p T q and b “ o p a q as T Ñ 8 . Let m : “ r T {p a ` b qs and deﬁne for j P t , . . . , m u consecutive blocks of size a and b with index set A j : “ tpp j ´ qp a ` b q ` , . . . , p j ´ qp a ` b q ` a u and B j : “ tp j ´ qp a ` b q ` a ` , . . . j p a ` b qu . Finally set C : “ t m p a ` b q ` . . . , T u , which mightbe empty. A j : “ ? a ÿ t P A j r D t B j “ ? b ÿ t P B j r D t ; ; C “ a | C | ÿ t P C r D t , such that r Q : “ ? T T ÿ t “ r D t “ c maT ˜ ? m m ÿ j “ A j ¸looooooomooooooon “ : V ` c mbT ˜ ? m m ÿ j “ B j ¸loooooooomoooooooon “ : L ` c T ´ m p a ` b q T C Now let r V : “ ? m ř mj “ r A j where t r A t , ď t ď m u is an independent sequence such that A t and r A t have the same distribution for all 1 ď t ď m . Similarly deﬁne r L : “ ? m ř mj “ r B j . Lemma B.7give us for any scalar s ą ρ p r Q , r Q G q ď ρ p r V , r V G q ` ρ p c maT r V G , r Q G q ` ∆ p c maT r V G , s q` P p c mbT } r L } ą s q ` ρ p V , r V q ` ρ p L , r L q . (A.8)Notice that we any measurable A Ď R we have | P rp A , A q P A s ´ P r r A , r A , s| ď α b where t α n , n P N u denote the α -mixing coeﬃcient of the sequence p r D t q which is the same of the sequence p U t q . Then the last two terms in (A.8) can be upper bounded by p m ´ q α b and p m ´ q α a respectivelyby induction. Since α n is non-increasing in n and a ě b we have that ρ p V , r V q ` ρ p L , r L q ď p m ´ q α b ď T exp p´ cb q . (A.9)where we use Assumption 3(c) to obtain the last inequality.For the fourth term we have by the maximal inequality followed by Markov’s inequality P p b mbT } r L } ą q ď „ ψ ˆ s ? TC ψ ψ ´ p { p n q? mb ˙ ´ and the anti-concentration inequality for Gaussian random vectors (The-orem 7 in Giessing and Fan (2020) with p “ 8 ) ∆ p a maT r V G , s q À T s ? log nmaσ Ă V . Set s “ C ψ ψ ´ p { p n q? mb ? T ψ ´ p { p T γ q for some γ ą p c maT r V G , s q ` P p c mbT } r L } ą s q À Tma c mbT ? log nψ ´ p { p n q ψ ´ p { p T γ q σ r V ` T γ (A.10)For the second term we have from Rio (2013) that, for every (cid:15) ą |r M (cid:96) s ij | “ | Cov p r D it , r D j,t ´ (cid:96) q| ď α (cid:15) {p ` (cid:15) q (cid:96) } r D it } ` (cid:15) } r D jt ´ (cid:96) } ` (cid:15) . Hence, from Assumption 3 we have that } M (cid:96) } max À exp p´ c (cid:15) ` (cid:15) (cid:96) q and } maT Σ r V G ´ Σ r Q G } max ď p ´ maT q} Σ r V } max ` } Σ r Q ´ Σ r V } max ď p ba ` b ` aT q} Σ r V } max ` a ÿ | (cid:96) |ă a | (cid:96) |} M (cid:96) } max ` ÿ a ď| (cid:96) |ă T } M (cid:96) } max À ba ` aT ` a ` T exp p´ c (cid:15) ` (cid:15) a q , where we use the fact that Σ r V G “ Σ r V “ Σ r A j “ Σ A j “ ř | (cid:96) |ă a p ´ | (cid:96) |{ a q M (cid:96) , Σ r Q G “ Σ r Q “ ř | (cid:96) |ă T p ´| (cid:96) |{ T q M (cid:96) , ř | (cid:96) |ă a | (cid:96) |} M (cid:96) } max ď c for some c ă 8 and ř a ď| (cid:96) |ă T } M (cid:96) } max À T exp p´ c (cid:15) ` (cid:15) a q .Finally, we can bound the second term using Theorem 8 in Giessing and Fan (2020). In particularfor p “ 8 it implies that ρ p c maT r V G , r Q G q À log n b } maT Σ r V G ´ Σ r Q G } max a maT σ r V _ σ r Q À b Tma log n b ba ` aT ` a ` T exp p´ c(cid:15) ` (cid:15) a q σ r V _ σ r Q (A.11)For the ﬁrst term we have that } r D it } ψ p { is uniformly (upper) bounded by Assumption 3(a)then so is } r A it } ψ p { “ } A it } ψ p { “ } ? a ř s P A t r D is } ψ p { . Also p E p max i | r A it |q q { À } max i | r A it |} ψ p { À ψ ´ p { p n q max i } r A it } ψ p { À ψ ´ p { p n q . Since t r A t , ď t ď m u is an iid sequence of random vector Theorem5 in Giessing and Fan (2020) gives us ρ p r V , r V G q À p log n q { ψ ´ p { p n q T { σ r V . (A.12)By the triangle inequality we have that σ r V ě σ r Q ´ } d r Q ´ d r V } max ě c ´ } Σ r Q ´ Σ r V } max Á ´ a ´ T exp p´ c (cid:15) ` (cid:15) a q . By setting a “ r? T s we conclude that σ r V is eventually bounded awayfor zero for large enough T . If we further set b “ r log T { c s and γ “ { ρ p r Q , r Q G q “ O « p log n q { ψ ´ p { p n q T { ` ? log T log nψ ´ p { p n q ψ ´ p { p T { q T { ﬀ . (A.13)Finally, we now bound the last two term appering in (A.7). Let γ and γ be positive sequencesdepending on n and T such that } p Υ ´ Υ } max “ O P p γ q and } Q ´ r Q } “ O P p γ q . Suppose we canstate conditions under which log n p γ _ γ q “ o p q T, n

Ñ 8 (A.14)Then we have the the last two terms vanish in probability if we set δ “ γ log n and δ “ γ log n in(A.7). Lemma B.8 and Lemma B.10 give us expressions for γ and γ , respectively, which combinedwith the rate assumptions in the theorem implies (A.14). B Additional Lemmas

Lemma B.1.

Let a j and b j denote the j -th eigenvalue in decreasing order of Σ and ΛΛ respectively.Then, under Assumption 2(b) and p c q :(a) b j — n for ď j ď r (b) max j ď n | a j ´ b j | “ O p q (c) a j — n for ď j ď r .Proof. Result p a q follows from the fact that the r eigenvalues of Λ Λ are also (the only r non-zero)eigenvalues of ΛΛ and Assumption 2(b). Part p b q follows from Weyl’s inequality that implies max j ď n | a j ´ b j | ď } Σ ´ ΛΛ } “ O p q , where the last equality follows from Assumption 2(c). Finallyresult p c q follows from part p a q and p b q and the (reverse) triangle inequality.Recall that Σ be the p n ˆ n q covariance matrix of U t “ Z t ´ Γ W t . Let r Σ : “ T ř Tt “ U t U t and p Σ the same as r Σ but with Γ replaced by the estimator p Γ . Also let p a j denote the j -th eigenvalue indecreasing order of p Σ emma B.2. Let ω be a non-negative sequence of n and T such that } p Σ ´ r Σ } max “ O P p ω q . Then,under the Assumptions 2 and 3:(a) } p Σ ´ Σ } max “ O P r ω ` ψ ´ p { p n q{? T s (b) max j ď n | p a j ´ a j | “ O P r n p ω ` ψ ´ p { p n q{? T qs (c) p a j — P n for j ď r provided that ω ` ψ ´ p { p n qq{? T “ O P p q Proof.

Part (a) follows by triangle inequality followed by the maximum inequality since } p Σ ´ Σ } max ď} p Σ ´ r Σ } max ` } r Σ ´ Σ } max “ O P p ω q ` O P p ψ ´ p { p n q{? T q . Part (b) follows from Weyl’s inequality,the fact that } p Σ ´ Σ } ď n } p Σ ´ Σ } max and part p a q . Part p c q follows from the triangle inequalitycombined with part p b q and Lemma B.1(c).The Lemmas B.3-B.6 below are an adaption of Lemmas 8-10 in Fan et al. (2013), henceforthFLM, to include the estimation error in the sample covariance matrix. To avoid confusion and makeit easier for the read to follow through the changes we use the same notation adopted in FLM. Inparticular, if δ it denotes the p i, t q element of ∆ : “ p R ´ R then r U it “ U it ` δ it for i P r n s and t P r T s .Also, we consider that } ∆ } max “ O P p ω q for some non-negative sequence ω depending on n and T .Deﬁne: r ζ st : “ r U s r U t n ´ E p U s U t q n “ ˆ U s U t n ´ E p U s U t q n ˙ ` ˆ U s δ t n ` δ s U t n ` δ s δ t n ˙ “ : ζ st ` ζ ˚ st r η st : “ f s ř ni “ λ i r U it n “ f s ř ni “ λ i U it n ` f s ř ni “ λ i δ it n “ : η st ` η ˚ st r ξ st : “ F t ř ni “ λ i r U is n “ F t ř ni “ λ i U is n ` F t ř ni “ λ i δ is n “ ξ st ` ξ ˚ st . Lemma B.3.

Under Assumption 3:(a) ζ st “ O P p {? n q (b) η st “ O P p {? n q (c) ξ st “ O P p {? n q (d) ζ ˚ st “ O P p ω ` ω q and max s,t ď T ζ ˚ st “ O P p ψ ´ p nT q ω ` ω q (e) η ˚ st “ O P p ω q (f ) ξ ˚ st “ O P p ω q . roof. Parts p a q , p b q and p c q are straightforward. For (d) we have that n U s U t “ O P p q and n δ s δ t ď} ∆ } max “ O P p ω q then the other two terms in parentheses in the deﬁnition of ζ ˚ st are O P p ω q by theCauchy-Schwartz inequality. Part p e q and p f q follows by similar arguments. max t ď T T T ÿ s “ p n δ s U t q “ max t ď T n U t ˜ T T ÿ s “ δ s δ s ¸ U t ď } ∆ } max p max t ď T } U t } { n q ζ ˚ st ď } U s } } δ t } ` } U t } } δ s } ` } δ t } } δ s } ď } U } max } ∆ } max ` } ∆ } max Lemma B.4.

Under Assumption 3:(a) T ř Tt “ r nT ř Ts “ p f js E p U s U t qs “ O P p { T q (b) T ř Tt “ r T ř Ts “ p f js r ζ st s “ O P rp {? n ` ω ` ω q s (c) T ř Tt “ r T ř Ts “ p f js r η st s “ O P rp {? n ` ω q s (d) T ř Tt “ r T ř Ts “ p f js ξ st s “ O P rp {? n ` ω q s (e) T ř Tt “ } p f t ´ Hf t } “ O P r { T ` p {? n ` ω ` ω q s Proof.

Part (a) is unaltered by the presence of a pre-estimation so it follows directly from Lemma8(a) in FLM. For part (b), we have that for s, l

P r n s and j P r r s by Cauchy-Schwartz inequality1 T T ÿ t “ r T T ÿ s “ p f js r ζ st s ď »– T T ÿ s,l “ ˜ T T ÿ t “ r ζ st r ζ lt ¸ ﬁﬂ { Since r ζ st “ ζ st ` ζ ˚ st “ O P p {? n ` ω ` ω q by Lemma B.3, the term in parentheses is O P rp {? n ` ω ` ω q q . The result p b q then follows. For (c), by the triangle inequality and Lemma 8(c) in FLM,we have that } ř ni “ λ ji r u it } ď } ř ni “ λ ji U it } ` } ř ni “ λ ji δ it } “ O P p? n q ` O P p nω q , then we conclude1 T T ÿ t “ r T T ÿ s “ p f s r η st s ď T n T ÿ t “ } n ÿ i “ U it λ i } “ O P p { n ` ω {? n ` ω q . The proof of part (d) is analogous to part (c) therefore is omitted. For (e), let r p f t ´ Hf t s j denotethe j -th entry of the vector p f t ´ Hf t . Since V { n is bounded away for zero by Lemma B.2(c), thefact that p a ` b ` c ` d q ď p a ` b ` c ` d q and using (A.1) we have that max j ď r T ´ ř t r p f t ´ Hf t s j

40s upper bounded by some constant C ă 8 times »– max j ď r T T ÿ t “ ˜ T T ÿ s “ p f js E p U s U t q n ¸ ` max j ď r T T ÿ t “ ˜ T T ÿ s “ p f js r ζ st ¸ ` max j ď r T T ÿ t “ ˜ T T ÿ s “ p f js r η st ¸ ` max j ď r T T ÿ t “ ˜ T T ÿ s “ p f js r ξ st ¸ ﬁﬂ . The result then follows by applying the bounds from part (a)-(d) to each of the four terms above.

Lemma B.5.

Under Assumption 2:(a) max t ď T } nT ř Ts “ p f s E p U s U t q} “ O P p {? T q (b) max t ď T } T ř Ts “ p f s r ζ st } “ O P p b ψ ´ p { p T q{ n ` ψ ´ p nT q ω ` ω q (c) max t ď T } T ř Ts “ p f s r η st } “ O P p ψ ´ p T q{? n ` ω q (d) max t ď T } T ř Ts “ p f s ξ st } “ O P p ψ ´ p T qp {? n ` ω qq Proof.

Once again, part (a) is unaltered by the presence of a pre-estimation so it follows directlyfrom Lemma 9(a) in FLM. For part (b), from the Cauchy-Schwartz inequality we have max t ď T } T T ÿ s “ p f s r ζ st } ď ˜ T T ÿ s “ } p f s } max t ď T T T ÿ s “ r ζ st ¸ { . The ﬁrst summation inside the parentheses equal r due to the normalization. For the second summa-tion, by the triangle inequality, we have max t ď T T ř Ts “ r ζ st ď max t ď T T ř Ts “ ζ st ` max t ď T T ř Ts “ ζ st ζ ˚ st ` max t ď T T ř Ts “ ζ ˚ st . For the ﬁrst term, the maximum inequality followed by Assumption 2(e) yields max t ď T T T ÿ s “ ζ st “ O P „ ψ ´ p { p T q max s,t } ζ } ψ p {  “ O P „ ψ ´ p { p T q max s,t } ζ } ψ  “ O P « ψ ´ p { p T q n ﬀ . The last one is O P rp ψ ´ p nT q w ` ω q q by Lemma B.3(d). Then by Cauchy Schwartz we have that max t ď T T ř Ts “ r ζ st “ O P rp b ψ ´ p { p T q{ n ` ψ ´ p nT q w ` ω q s and result (b) follows.For (c), by the triangle inequality we have that max t ď T } n ř ni “ λ i r U it } ď max t ď T } n ř ni “ λ i U it } ` max t ď T } n ř ni “ λ i δ it } . For the ﬁrst term, the maximum inequality followed by Assumption 2(f)yields max t ď T } n Λ U t } “ O P „ ψ ´ p T q? n max t } ? n Λ U t }  “ O P p ψ ´ p T q{? n q . } Λ } max } ∆ } max “ O P p ω q by Assumption 2(d). We then obtainthe result since max t ď T } T T ÿ s “ p f s r η st } ď } T T ÿ s “ p f s f s } max t ď T } n n ÿ i “ λ i r U it } “ O P ˆ ψ ´ p T q? n ` ω ˙ . (B.1)By the triangle inequality, } nT ř s ř i λ i r U is p f s } ď } nT ř s ř i λ i U is p f s } ` } nT ř s ř i λ i δ is p f is } . Lemma9(d) of FLM shows that the ﬁrst term is O P p {? n q . For the second term for each j P r r s : } nT ÿ s ÿ i λ i δ is p f js } ď ˜ T n ÿ s “ } n n ÿ i “ λ i δ is } p f js ¸ ˜ T n ÿ s “ p f js ¸ “ O P p ω q . Thus } nT ř s ř i λ i r U it p f s } “ O P p {? n ` ω q and by Cauchy-Schwartz inequality we have max t ď T } T T ÿ s “ p f s ξ st } ď max t ď T } F t }} nT ÿ s ÿ i λ i r U it p f s } “ O P p ψ ´ p T qp {? n ` ω qq . (B.2) Lemma B.6.

Let ω ` ψ ´ p n qq{? T “ O p q where ω is deﬁned in Lemma B.2, then UnderAssumption 3 we have(a) } V ´ } “ O P p { n q (b) } H } “ O P p q (c) } H H ´ I r } F “ O P p {? T ` {? n ` ω q (d) max i ď n T ř Tt “ p R it “ O P p w p ψ ´ p nT q ` ω q ` ψ ´ p { p n q{? T ` q (e) max i ď n max j ď r T ř Tt “ F jt r U it “ O P p ψ ´ p { p n q{? T ` ω q Proof.

We have that V ´ “ diag p { p a , . . . , { p a r q and 1 { p a j — P { n for j ď r by Lemma B.2(c).The result (a) then follows. The normalization tell us } p F } “ ? T , Lemma 11(a) in FLM give us } F } “ O P p? T q , } Λ Λ } “ r a — n by Lemma B.1(a) and from part (b) we have } V ´ } “ O P p { n q .Result (b) then follows since by deﬁnition H : “ T ´ V ´ p F F Λ Λ . For (c) we have by the triangleinequality } H H ´ I r } F ď } H H ´ H F F { T H } F ` } H F F { T H ´ I r } F } H p I r ´ F F { T q H } F ď } H } } I r ´ F F { T } F “ O P p q O P p {? T q . The second term is equal to } H F F { T H ´ p F p F { T } F For (d) we have max i ď n T T ÿ t “ p R it ď max i ď n T T ÿ t “ p p R it ´ R it q ` max i ď n T T ÿ t “ R it ´ E p R it q ` max i ď n T T ÿ t “ E p R it qď max i,t | p R it ´ R it | ` max i ď n T T ÿ t “ R it ´ E p R it q ` max i,t E p R it q . The last term is O p q by Assumption 3(a), the middle term O P p ψ ´ p { p n q{? T q . The ﬁrst term is nolarger then } ∆ } max p } R } max ` } ∆ } max q “ O P p ω p ψ ´ p nT q ` ω qq . The result (d) then follows.For (e) we have for each j ď r : | T ´ ÿ t F jt r U it | ď | T ´ ÿ t F jt U it | ` | T ´ ÿ t F jt δ it |ď | T ´ ÿ t F jt U it | ` p T ´ ÿ t F jt T ´ ÿ t δ it q { The ﬁrst term is O P p ψ ´ p { p n q{? T by the maximum inequality and Assumption 3 and the second is O P p ω q . Lemma B.7.

For every s ą : ρ p p S, Z q ď ρ p p r T , r Z q ` ∆ p p c mqn r Z, s q ` ρ p p c mqn r Z, Z q ` P p c mrn } r U } p ą s q ` ρ p p T, r T q ` ρ p p U, r U q . Proof.

We start by showing that for every pair of random variables X and Y deﬁned in the sameprobability space taking values in the normed space p S, } ¨ }q and pair of non-negative reals t, s , wehave P p} X } ď t ´ s q ´ P p} Y } ą s q ď P p} X ` Y } ď t q ď P p} X } ď t ` s q ` P p} Y } ą s q . (B.3)Indeed, for the right hand side inequality we use } X ` Y } “ } X ´ p´ Y q} ě } X } ´ } Y } . Hence, for43ny t, s ą P p} X ` Y } ď t q ď P p} X } ď t ` } Y }qď P p} X } ď t ` } Y } , } Y } ď s q ` P p} Y } ą s qď P p} X } ď t ` s q ` P p} Y } ą s q . For the other side we use } X ` Y } ď } X } ` } Y } to write P p} X ` Y } ď t q ě P p} X } ď t ´ } Y }qě P p} X } ď t ´ } Y }q ` P p} Y } ą s q ´ P p} Y } ą s q Now replace X and Y by a mqn T and a mrn U in (B.3), respectively and set } ¨ } “ } ¨ } p . The righthand side of the resulting expression can be upper bounded by P p a mqn } r T } p ď t ` s q ` P p a mrn } r U } ą s q ` ρ p p T, r T q ` ρ p p U, r U q , whereas the left hand side can be lower bounded by P p a mqn } r T } ď t ´ s q ´ P p a mqr } r U } ą s q ´ ρ p p T, r T q ´ ρ p p U, r U q . Therefore P p c mqn } r T } p ď t ´ s q ´ P p c mrn } r U } p ą s q ´ ρ p p T, r T q ´ ρ p p U, r U qď P p} S } p ď t qď P p c mqn } r T } p ď t ` s q ` P p c mrn } r U } p ą s q ` ρ p p T, r T q ` ρ p p U, r U q . Then for the right-hand side P p c mqn } r T } p ď t ` s q ď P p c mqn } r Z } p ď t ` s q ` ρ p p r T , r Z qď P p c mqn } r Z } p ď t q ` ∆ p p c mqn r Z, s q ` ρ p p r T , r Z qď P p} Z } p ď t q ` ρ p p c mqn r Z, Z q ` ∆ p p c mqn r Z, s q ` ρ p p r T , r Z q Similarly for the left-hand side and the proof is completed.By the triangle inequality } p Υ ´ Υ } max ď } p Υ ´ r Υ } max ` } r Υ ´ Υ } max where r Υ is the sample44ovariance matrix of r D t : “ U t U ´ t . The second term is O p ψ ´ p { p n q{? T q while for the ﬁrst } p Υ ´ r Υ } max ď } D ´ r D } max p } r D } max ` } D ´ r D } max q The ﬁrst term in parentheses is O p ψ ´ ˚ p nT qq and the second can be upper bounded by } p U ´ U } max p } U } max ` } p U ´ U } max q which is show to be O P p η p n, T q ψ ´ p nT qq in the proof of LemmaB.16. Therefore we conclude } p Υ ´ Υ } max “ O P ´ η p n, T q ψ ´ p nT q ψ ´ p { p nT q ` ψ ´ p { p n q{? T ¯ To leverage on the results of Gaussian approximation, in particular on the work of Giessing andFan (2020) we would like to establish some sort of asymptotic linearity namely Q T “ ? T T ÿ t “ D t “ ? T T ÿ t “ r D t ` R T “ : r Q T ` R T . (B.4)such that } R t } vanishes in probability at an appropriate rate as n, T Ñ 8 . Then we can ap-proximate the distribution of S “ } Q } by the distribution of r S : “ } r Q } p , which in turn can beapproximated by the distribution of S ˚ : “ } Q ˚ } with high probability.For some (cid:15) ą δ “ h r η p n, T qp ψ ´ p nT qq ` ψ ´ p { p n q{? T s δ “ η ´ (cid:15) r ψ ´ p n q ` ? T η s Lemma B.8. } p Υ ´ Υ } max “ O P ´ h r η p ψ ´ p p nT qq ` ψ ´ p { p n q{? T s ¯ Proof.

Let i : “ p i , i , i , i q be a multi-index where i , i , i , i P r n s . Deﬁne for i and | (cid:96) | ă T : r γ (cid:96) i : “ T T ÿ t “| (cid:96) |` U i ,t U i ,t U i ,t ´| (cid:96) | U i ,t ´| (cid:96) | ; γ (cid:96) i : “ E r γ i , and p γ (cid:96) i as r γ (cid:96) i with U ’s replaced by p U ’s. Also deﬁne r υ i : “ ÿ | (cid:96) |ă T k p (cid:96) { h q r γ (cid:96) i υ i : “ ÿ | (cid:96) |ă T γ (cid:96) i , p υ i as r υ i with U ’s replaced by p U ’s. Then we write r υ i ´ υ i “ ÿ | (cid:96) |ă T k p (cid:96) { h qp r γ (cid:96) i ´ γ (cid:96) i q ` ÿ | (cid:96) |ă T p k p (cid:96) { h q ´ q γ (cid:96) i . (B.5)Since } r γ (cid:96) i ´ γ (cid:96) i } ψ p { “ O p a T ´ | (cid:96) |{ T q “ O p {? T q , the ψ p { -Orlicz norm of the ﬁrst term is boundedby h ÿ | (cid:96) |ă T | h ´ k p (cid:96) { h q|} r γ (cid:96) i ´ γ (cid:96) i } ψ p { “ O ˆ h ? T ż | k p u q| du ˙ “ O p h {? T q , whereas the second term is deterministic and is shown to be O p h {? T q by Andrews (1991). Thus } r υ i ´ υ i } ψ p { “ O p h {? T q uniformly in i P r n s . Thus, by the maximal inequality followed byMarkov’s inequality we conclude that max i | r υ i ´ υ i | “ O P p ψ ´ p { p n q max i } r υ i ´ υ i } ψ p { q “ O P r ψ ´ p { p n q h {? T s . (B.6)We now use the fact that for any x , y P R q we have | ś qi “ x i ´ ś qi “ y i | “ O p ř q ´ i “ } x ´ y } n ´ i } y } i q combined with the fact that } p U ´ U } max “ o p q to obtain max i ,(cid:96) | p γ (cid:96) i ´ r γ (cid:96) i | ď max i ,t,(cid:96) | p U i ,t p U i ,t p U i ,t ´| (cid:96) | p U i ,t ´| (cid:96) | ´ U i ,t U i ,t U i ,t ´| (cid:96) | U i ,t ´| (cid:96) | |“ O p} p U ´ U } max } U } max q“ O P r η r ψ ´ p p nT qs s Therefore we conclude max i | p υ i ´ r υ i | ď max i ,(cid:96) | p γ (cid:96) i ´ r γ (cid:96) i | ÿ | (cid:96) |ă T | k p (cid:96) { h q| “ O P ˆ hη r ψ ´ p p nT qs ż | k p u q| du ˙ “ O P p hη r ψ ´ p p nT qs q . (B.7)The result then follows from the triangle inequality } p Υ ´ Υ } max ď max i | p υ i ´ r υ i | ` max i | r υ i ´ υ i | ,expression (B.10) and (B.11). Lemma B.9. If } δ it } ψ p ď C ă 8 where δ it : “ p R it ´ R it then }}p V { n qp F t ´ HF t q} } ψ p “ O p ? T ` ψ ´ p { p T q? n ` ψ ´ p { p T q C q . roof. In this proof we use the fact that for any (possibly random) A st , by Cauchy-Schwartz in-equality and the normalization p F p F { T “ I r , we have } T ř Ts “ p F s A st } ď ? r ´ T ř Ts “ A st ¯ { . Thus g p A st q : “ ››››› } T T ÿ s “ p F s A st } ››››› ψ “ O »–››››››˜ T T ÿ s “ A st ¸ { ›››››› ψ ﬁﬂ . (a) Set A st “ E p U s U t q{ n , then g p A st q “ O p {? T q .(b) Set A st “ r ζ st : “ p U s U t ´ E p U s U t qq{ n , then by maximal inequality g p A st q “ O p} max s ď T | r ζ st |} ψ q “ O p ψ ´ p T q max s ď T } r ζ st } ψ q . By the triangle inequality } r ζ st } ψ ď } ζ st } ψ ` } ζ ˚ st } ψ . The ﬁrst term is O p {? n q by Assumption 3(d). The second can be upper bounded by } U s δ t { n } ψ `} δ s U t { n } ψ `} δ s δ t { n } ψ “ O p} U is } ψ p { } δ it } ψ p { q ` O p} δ it } ψ p { q . Thus g p A st q “ O p ψ ´ p T qp {? n ` C ` C qq .(c) Set A st “ r η st : “ F s ř ni “ λ i p U it ` δ it q{ n , then apply Cauchy-Schwartz twice to obtain g p A st q “ O p}p T T ÿ s “ } F s } q { } ψ p { } n ÿ i “ λ i U it ` δ it n } ψ p { q “ O p q O p} n ÿ i “ λ i U it n } ψ p { `} n ÿ i “ λ i δ it n } ψ p { q . The ﬁrst term in square brackets is O p {? n q by Assumption 2(d) and 3(e); the second is O p C q . Hence g p A st q “ O p ? n ` C q .(d) Set A st “ r ξ st : “ F t ř ni “ λ i p U is ` δ is q{ n , then apply Cauchy-Schwartz twice followed by themaximal inequality to obtain g p A st q “ O p}} F t }} ψ p { }p T T ÿ s “ } n ÿ i “ λ i U is ` δ is n } q { } ψ p { qq“ O p q O p ψ ´ p T qr} n ÿ i “ λ i U is n } ψ p { ` }p n ÿ i “ λ i δ is n } ψ p { sq . The ﬁrst term in square brackets is O p {? n q by Assumption 2(d) and 3(e); the second is O p C q . Hence g p A st q “ O p ψ ´ p { p T qr ? n ` C sq .Finally, use the identity (A.1), the triangle inequality twice and the bounds p a q ´ p d q to obtain theresult. Lemma B.10. If max i,t } δ it } ψ “ O p C q and } p U ´ U } max “ O P p η q then ›››› ? T p p U p U ´ U U q ›››› max “ O P ˆ ? T η ` r ? T ` r ? n ` r C ˙ here r : “ ψ ´ p p n q ψ ´ p { p n q ψ ´ p { p n q r : “ ψ ´ p p n q ψ ´ p { p T q ` ψ ´ p { p n q r : “ ψ ´ p p n q ψ ´ p { p T q ` ψ ´ p p nT q ψ ´ p { p n q . Proof.

By the triangle inequality we have ›››› ? T p p U p U ´ U U q ›››› max ď ›››› ? T p p U ´ U qp p U ´ U q ›››› max ` ›››› ? T U p p U ´ U q ›››› max . For the ﬁrst term we have ›››› ? T p p U ´ U qp p U ´ U q ›››› max ď ? T } p U ´ U } max “ O P p? T η q . For the second term we use decomposition (A.3) to write1 ? T T ÿ t “ U it p p U jt ´ U jt q “ ? T T ÿ t “ U it p p λ j p F t ´ λ j F t ` p R jt ´ R jt q“ ” p p λ j ´ Hλ j q ` Hλ j ı ? T T ÿ t “ U it p p F t ´ HF t q` ” p p λ j ´ Hλ j q ` p H H ´ I r q λ j ı ? T T ÿ t “ U it F t ` p p γ j ´ γ j q ? T T ÿ t “ U it W jt Apply Cauchy-Schwartz inequality in each term followed by the triangle inequality we obtain ›››› ? T U p p U ´ U q ›››› max ď „ max j ď n } p λ j ´ Hλ j } ` ? r } H }} Λ } max  max i ď n ››››› ? T T ÿ t “ U it p p F t ´ HF t q ››››› ` „ max j ď n } p λ j ´ Hλ j } ` ? r } H H ´ I r }} Λ } max  max i ď n ››››› ? T T ÿ t “ U it F t ››››› ` max j ď n } p γ j ´ γ j } max i,j ď n ››››› ? T T ÿ t “ U it W jt ››››› . The ﬁrst term is O P p q O P p ψ ´ p n qr ? T ` ψ ´ p { p T q? n ` ψ ´ p { p T q C sq due to Lemma B.6(a), Lemma B.9and the maximal inequality; the second term is O P p ψ ´ p { p n q? T ` ? n ` ψ ´ p nT q C q O P p ψ ´ p { p n qq since,by the maximal inequality, we might take ω “ ψ ´ p nT q C in Theorem 2(b). The last term is48 P p ψ ´ p n q ψ ´ p { p n q{? T q O P p ψ ´ p { p n qq . Thus, ››› ? T U p p U ´ U q ››› max “ O P p r q where r : “ ψ ´ p p n q ψ ´ p { p n q ψ ´ p { p n q? T ` ψ ´ p p n q ψ ´ p { p T q ` ψ ´ p { p n q? n ` p ψ ´ p p n q ψ ´ p { p T q ` ψ ´ p p nT q ψ ´ p { p n qq C. (B.8)The result then follows. Lemma B.11.

Let } p U ´ U } “ O P p η q then max i,j,t | p V ij,t ´ V ij,t | “ O P p s r η ` ξψ ´ p n qsq .Proof. By the triangle inequality we have | p V ij,t ´ V ij,t | ď | p U i,t ´ U i,t | ` | p θ i p U ´ ij,t ´ θ i U ´ ij,t | . Using H¨older’s inequality, the second term can be further bounded as | p θ i p U ´ ij,t ´ θ i U ´ ij,t | ď | p θ i p p U ´ ij,t ´ U ´ ij,t q| ` |p p θ i ´ θ i q U ´ ij,t |ď } p θ i } } p U ´ ij,t ´ U ´ ij,t } ` } p θ i ´ θ i } } U ´ ij,t } ď p} θ i } ` } p θ i ´ θ i } q} p U ´ ij,t ´ U ´ ij,t } ` } p θ i ´ θ i } } U ´ ij,t } . Combining the last two expressions with the fact that } θ i } ď s } θ i } ď Cs and } p θ ´ θ } “ O P p ξs q “ O P p q by Assumption 3(f) and the the maximum inequality yields the result Lemma B.12.

Let } p U ´ U } “ O P p η q then max i,j | ? T T ÿ t “ p p V ij,t p V ji,t ´ V ij,t V ij,t q | “ O P t s r r ` ξψ ´ p n q ` ? T p η ` ξψ ´ p n qq su .Proof. By the triangle inequality max i,j | ? T p p V ij p V ji ´ V ij V ji q| ď max i,j | ? T p p V ij ´ V ij qp p V ji ´ V ji q| ` max i,j | ? T V ij p p V ij ´ V ij q| . The ﬁrst term can be bounded using Lemma B.11 since max i,j | ? T p p V ij ´ V ij qp p V ji ´ V ji q| ď ? T r max i,j,t p p V ijt ´ V ijt qs “ O P p? T r s p η ` ξψ ´ p n qqs q . max i,j | ? T V ij p p U i ´ U i q| ` max i,j } p θ ij } max i,j } ? T V ij p p U ´ ij ´ U ´ ij q} ` max i,j } p θ ij ´ θ ij } max i,j | ? T V ij U ´ ij | . Recall the rate r appearing in (B.8).Then the ﬁrst term is O P p s r q , the second O P p s r q and thelast term is O P p ξs ψ ´ p n qq . Thus max i,j | ? T V ij p p V ij ´ V ij q| “ O P r s p r ` ξψ ´ p n qqs . The resultthen follows. Lemma B.13. } p Υ V ´ Υ V } max “ O P ˆ h r s r η ` ξψ ´ p p n qsp s ψ ´ p p nT qq ` s ψ ´ p { p n q? T s ˙ Proof.

The proof is similar to the proof of Lemma B.8, refer to it for details. It suﬃces to bound inprobability } p V ´ V } max and } V } max , where V is p n ˆ T q matrix whose entries are V ij,t for i, j P r n s and t P r T s . Similar for p V with V ij,t replaced p V ij,t . Lemma B.11 bounds the former, for the laterwe have } V } max ď max i,j } θ ij } } U } max “ O p s ψ ´ p nT qq .Let i : “ p i , i , i , i q be a multi-index where i , i , i , i P r n s . Deﬁne for i and | (cid:96) | ă T : r γ (cid:96) i : “ T T ÿ t “| (cid:96) |` U i ,t U i ,t U i ,t ´| (cid:96) | U i ,t ´| (cid:96) | ; γ (cid:96) i : “ E r γ i , and p γ (cid:96) i as r γ (cid:96) i with U ’s replaced by p U ’s. Also deﬁne r υ i : “ ÿ | (cid:96) |ă T k p (cid:96) { h q r γ (cid:96) i υ i : “ ÿ | (cid:96) |ă T γ (cid:96) i , and p υ i as r υ i with U ’s replaced by p U ’s. Then we write r υ i ´ υ i “ ÿ | (cid:96) |ă T k p (cid:96) { h qp r γ (cid:96) i ´ γ (cid:96) i q ` ÿ | (cid:96) |ă T p k p (cid:96) { h q ´ q γ (cid:96) i . (B.9)Since } r γ (cid:96) i ´ γ (cid:96) i } ψ p { “ O p a T ´ | (cid:96) |{ T q “ O p {? T q , the ψ p { -Orlicz norm of the ﬁrst term is boundedby h ÿ | (cid:96) |ă T | h ´ k p (cid:96) { h q|} r γ (cid:96) i ´ γ (cid:96) i } ψ p { “ O ˆ h ? T ż | k p u q| du ˙ “ O p h {? T q , whereas the second term is deterministic and is shown to be O p h {? T q by Andrews (1991). Thus } r υ i ´ υ i } ψ p { “ O p h {? T q uniformly in i P r n s . Thus, by the maximal inequality followed by50arkov’s inequality we conclude that max i | r υ i ´ υ i | “ O P p ψ ´ p { p n q max i } r υ i ´ υ i } ψ p { q “ O P r ψ ´ p { p n q h {? T s . (B.10)We now use the fact that for any x , y P R q we have | ś qi “ x i ´ ś qi “ y i | “ O p ř q ´ i “ } x ´ y } n ´ i } y } i q combined with the fact that } p U ´ U } max “ o p q to obtain max i ,(cid:96) | p γ (cid:96) i ´ r γ (cid:96) i | ď max i ,t,(cid:96) | p U i ,t p U i ,t p U i ,t ´| (cid:96) | p U i ,t ´| (cid:96) | ´ U i ,t U i ,t U i ,t ´| (cid:96) | U i ,t ´| (cid:96) | |“ O p} p U ´ U } max } U } max q“ O P r η r ψ ´ p nT qs s Therefore we conclude max i | p υ i ´ r υ i | ď max i ,(cid:96) | p γ (cid:96) i ´ r γ (cid:96) i | ÿ | (cid:96) |ă T | k p (cid:96) { h q| “ O P ˆ hη r ψ ´ p nT qs ż | k p u q| du ˙ “ O P p hη r ψ ´ p nT qs q . (B.11)The result then follows from the triangle inequality } p Υ ´ Υ } max ď max i | p υ i ´ r υ i | ` max i | r υ i ´ υ i | ,expression (B.10) and (B.11). Lemma B.14.

Let U , V be T ˆ n matrices such that } U ´ V } max ď C and } V } max ď C , then } Σ U ´ Σ V } max ď C : “ C p C ` C q , where Σ U : “ U U { T and Σ V : “ V V { T . Furthermore, if C ď ακ p Σ V , S , q{p| S |p ` ζ q q for S Ď r n s , ζ ą and α P r , s , then p ´ α q κ p Σ V , S , ζ q ď κ p Σ U , S , ζ q ď p ` α q κ p Σ V , S , ζ q .Proof. By the (reverse) triangle inequality we have } U } max ´ } V } max ď } U ´ V } max , from which weconclude that } U } max ď } U ´ V } max `} V } max ď C ` C . Now } Σ U ´ Σ V } max “ max ď i,j ď n | T ´ ř Tt “ U it U jt ´ it V ijt | ď max i,j,t | U it U jt ´ V it V jt | and | U it U jt ´ V it V jt | ď |p U it ´ V it q U jt ` p U jt ´ V jt q V it | ď} U ´ V } max p} U } max ` } V } max q ď C p C ` C q . For the second part of the lemma notice that for any x P R n we have | x Σ U x ´ x Σ V x | “| x p Σ U ´ Σ V q x | ď } Σ U ´ Σ V } max } x } ď C } x } by the ﬁrst part. Also, if } x S c } ď ζ } x S } wehave that } x } “ } x S } ` | x S c } ď p ` ζ q} x S } ď p ` ζ q a x Σ V x | S |{ κ p Σ V , S , ζ q where thelast inequality follows from the deﬁnition of compatibility condition. Thus | x Σ U x ´ x Σ V x | ď C p ` ζ q x Σ V x | S |{ κ p Σ V , S , ζ q ď x Σ V x {

2, where the last inequality follows from the deﬁnitionof compatibility condition. Therefore, we have that p ´ α q x Σ V x ď x Σ U x ď p ` α q x Σ V x whenever } x S c } ď ζ } x S } . Take in inﬁmum to conclude. Lemma B.15.

Let W : “ p U , V q and Z : “ p X , Y q be T ˆp n ` q matrices such that } W ´ Z } max ď C and } Z } max ď C , then for any δ P R n we have } U p V ´ U δ q{ T ´ X p Y ´ Xδ q{ T } ď p ` } δ } q C p C ` C q Proof.

For convenience let q : “ V ´ U δ P R T and r : “ Y ´ Xδ P R T , then H¨older’s inequality givesus } r } ď p ` } δ } q} Z } max ď p ` } δ } q C and } q ´ r } ď p ` } δ } q} W ´ Z } max ď p ` } δ } q C .From the (reverse) triangle inequality we obtain } q } ď } q ´ r } ` } r } ď p ` } δ } qp C ` C q .Now, following the same steps in the proof of previous Lemma, we can upper bound the right handside of the display by } U ´ X } max } q } ` } q ´ r } } X } max , which in turn can be upper bounded bythe left hand size of the display. Lemma B.16.

Under the same conditions of Theorems 1 and 2 } ∇ L p θ q ´ ∇ L p θ q} “ O P « ψ ´ p T q ψ ´ p nT q ψ ´ p n q ψ ´ p { p n q T { ` ψ ´ p T q T { ? n ﬀ } ∇ L p θ q ´ ∇ L p θ q} max “ O P « η p n, T q “ ψ ´ p nT q ` η p n, T q ‰ ` ψ ´ p { p n q? T ﬀ , where ∇ L p θ q : “ E r U ´ t p U t ´ θ U ´ t qs and ∇ L p θ q : “ E U ´ t U t . roof. By the triangle inequality we have12 } ∇ L p θ q ´ ∇ L p θ q} “ }p p U x ´ U x ` U x q V { T ´ E p U x V { T q} ď } U x V { T ´ E p U x V { T q} ` } p U x ´ U x } max } V } . Similarly, using Lemma 5.B } ∇ L p θ q ´ ∇ L p θ q} max ď } p U x p U x { T ´ U x U x { T } max ` } U x U x { T ´ E p U x U x { T q} max ď } p U x ´ U x } max p } U x } max ` } p U x ´ U x } max q` } U x U x { T ´ E p U x U x { T q} max . By Corollary 1 and Assumption 3 we can bound in probability each of those terms } U x V { T ´ E p U x V { T q} “ O P « ψ ´ p { p n q? T ﬀ } p U x ´ U x } max “ O P « ψ ´ p nT q ψ ´ p n q ψ ´ p { p n q T { ` T { ? n ﬀ “ : O P r η p n, T qs} V } “ ψ ´ p T q} U x } max “ O P r ψ ´ p nT qs} U x U x { T ´ E p U x U x { T q} max “ O P « ψ ´ p { p n q? T ﬀ . Therefore } ∇ L p θ q ´ ∇ L p θ q} “ O P « ψ ´ p { p n q? T ` ψ ´ p T q ψ ´ p nT q ψ ´ p n q ψ ´ p { p n q T { ` ψ ´ p T q T { ? n ﬀ and } ∇ L p θ q ´ ∇ L p θ q} max “ O P « η p n, T q “ ψ ´ p nT q ` η p n, T q ‰ ` ψ ´ p { p n q? T ﬀ Simulation Results: Size with φ “ . The table reports the empirical size of the test of remaining covariance structure. Panel (a) reportsthe case where the factors are known, whereas Panel (b) considers that the factors are unknown butthe number of factors is known. Panels (c) and (d) present the results when the number of factors aredetermined, respectively, by the eigenvalue ratio test and the information criterion IC . Factors areestimated by the usual principal component algorithm. Three nominal signiﬁcance levels are considered:0.01, 0.05, and 0.10. The table reports the results for the case where φ “ Panel(a): Known factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Known number of factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): Information criterion ( IC ) T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(d): Eigenvalue ratio T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Simulation Results: Size with φ “ . . The table reports the empirical size of the test of remaining covariance structure. Panel (a) reportsthe case where the factors are known, whereas Panel (b) considers that the factors are unknown butthe number of factors is known. Panels (c) and (d) present the results when the number of factors aredetermined, respectively, by the eigenvalue ratio test and the information criterion IC . Factors areestimated by the usual principal component algorithm. Three nominal signiﬁcance levels are considered:0.01, 0.05, and 0.10. The table reports the results for the case where φ “ . Panel(a): Known factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Known number of factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): Information criterion ( IC ) T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(d): Eigenvalue ratio T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Simulation Results: Power ( φ “ ). The table reports the empirical power of the test of remaining covariance structure. Panel (a) reportsthe case where the factors are known, whereas Panel (b) considers that the factors are unknown but thenumber of factors is known. Factors are estimated by the usual principal component algorithm. Threenominal signiﬁcance levels are considered: 0.01, 0.05, and 0.10.

Panel(a): Known factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Known number of factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): Eigenvalue ratio T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(d): Information criterion ( IC ) T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Simulation Results: Power ( φ “ . ). The table reports the empirical power of the test of remaining covariance structure. Panel (a) reportsthe case where the factors are known, whereas Panel (b) considers that the factors are unknown but thenumber of factors is known. Factors are estimated by the usual principal component algorithm. Threenominal signiﬁcance levels are considered: 0.01, 0.05, and 0.10.

The table reports the average mean squared error (MSE) of three diﬀerent prediction models over 5-foldcross-validation subsamples. The goal is to predict the ﬁrst variable using information from the remaining n ´

1. Panel (a) considers the case of Sparse Regression (SR) where Y t is LASSO-regressed on all theother variables. Panel (b) shows the results of Principal Component Regression (PCR). Finally, Panel (c)presents the results of FarmPredict . “N/A” means “not available”. Note that there is no factor selectionfor Sparse Regression. “Known Number” means that the number of factors is known.

Panel(a): Sparse Regression (SR)

Known Number Eigenvalue Ratio Information Criterion (IC ) T “ T “ T “ T “ T “ T “ T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Principal Component Regression (PCR)

Known Number Eigenvalue Ratio Information Criterion (IC ) T “ T “ T “ T “ T “ T “ T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): FarmPredict

Known Number Eigenvalue Ratio Information Criterion (IC ) T “ T “ T “ T “ T “ T “ T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Forecasting Results.

The table reports the frequency each model is ranked the ﬁrst, second, third and fourth best model among the fouralternatives. Panel (a) considers the case when the factors are selected by the eigenvalue ratio procedure. Panel (b)presents the results when factors are selected by the information criterion IC . Panels (c) and (d) consider the caseswhen the number of factors are pre-speciﬁed as either one or two. We present the results for each individual groupof variables as well as for the full set of macroeconomic variables. Panel (a): Optimal Factor Selection (eigenvalue ratio)

AR SR PCR FarmPredict

Group

Panel (b): Optimal Factor Selection ( IC ) AR SR PCR FarmPredict

Group

Panel (c): Fixed Number of Factors ( r “ ) AR SR PCR FarmPredict

Group

Panel (d): Fixed Number of Factors ( r “ ) AR SR PCR FarmPredict

Group ining

200 400 600200400600

Food

50 100 150 20050100150200

Petroleum

10 20 30 4010203040

Construction

20 40 60204060

Manufacturing

200 400 600200400600

Utilities

100 200 300 400100200300400

Dept. Stores

20 40 60204060

Retail

200 400 600 800 10002004006008001000

Financial

Figure 1: Correlations of returns larger than 0.15 in absolute value.

We estimate the correlations between all pairs of returns from a sample of nine speciﬁc sectors. The correlationsthat are higher than 0.15 in absolute value are shown as black dots in the ﬁgure. We consider the following sectors:mining, food, petroleum, construction, manufacturing, utilities, department stores, retail, and ﬁnancial. KT -2 0 2050010001500 HML -8 -6 -4 -2 0 2 4010002000

SMB -2 0 2 4010002000

CMA -2 0 2050010001500

RMW -5 0 50100020003000

UMD -3 -2 -1 0 10100020003000

ACC -5 0 5010002000

CFP -10 -5 0 5010002000

CHCSHO -5 0 5010002000

BETA -4 -2 0 2 4010002000 DY -1 -0.5 0 0.50100020003000 EP -5 0 5010002000 MOM1m -2 0 20100020003000

MOM36m -2 0 2010002000

IDIOVOL -10 -5 00100020003000

RETVOL -5 0 50100020003000

Figure 2: First-stage coeﬃcient estimates.

The ﬁgure shows the empirical distribution of the ﬁrst-stage regression where each excess returns are linearly regressedon 16 risk factors. ining

200 400 600200400600

Food

50 100 150 20050100150200

Petroleum

10 20 30 4010203040

Construction

20 40 60204060

Manufacturing

200 400 600200400600

Utilities

100 200 300 400100200300400

Dept. Stores

20 40 60204060

Retail

200 400 600 800 10002004006008001000

Financial

Figure 3: Correlations of ﬁrst-stage residuals larger than 0.15 in absolute value.

We estimate the correlations between all pairs of residuals from the ﬁrst-stage OLS regression on 16 observed riskfactors from a sample of nine speciﬁc sectors. The correlations that are higher than 0.15 in absolute value are shownas black dots in the ﬁgure. We consider the following sectors: mining, food, petroleum, construction, manufacturing,utilities, department stores, retail, and ﬁnancial. ining

200 400 600200400600

Food

50 100 150 20050100150200

Petroleum

10 20 30 4010203040

Construction

20 40 60204060

Manufacturing

200 400 600200400600

Utilities

100 200 300 400100200300400

Dept. Stores

20 40 60204060

Retail

200 400 600 800 10002004006008001000

Financial

Figure 4: Correlations of second-stage residuals larger than 0.15 in absolute value.

We estimate the correlations between all pairs of residuals from the second-stage principal component analysis froma sample of nine speciﬁc sectors. The correlations that are higher than 0.15 in absolute value are shown as blackdots in the ﬁgure. We consider the following sectors: mining, food, petroleum, construction, manufacturing, utilities,department stores, retail, and ﬁnancial. ining

200 400 600200400600

Food

50 100 150 20050100150200

Petroleum

10 20 30 4010203040

Construction

20 40 60204060

Manufacturing

200 400 600200400600

Utilities

100 200 300 400100200300400

Dept. Stores

20 40 60204060

Retail

200 400 600 800 10002004006008001000

Financial

Figure 5: Partial correlations of second-stage residuals larger than 0.15 in absolute value.

We estimate the partial correlations between all pairs of residuals from the second-stage LASSO regression from asample of nine speciﬁc sectors. The correlations that are higher than 0.15 in absolute value are shown as black dotsin the ﬁgure. We consider the following sectors: mining, food, petroleum, construction, manufacturing, utilities,department stores, retail, and ﬁnancial. i n i ng f ood appa r e l pape r c he m i c a l pe t r o l eu m c on s t r u c t i on p r i m a r y m e t a l s f ab r i c a t ed m e t a l s m a c h i ne r y e l e c t r i c a l equ i p m en t t r an s po r t a t i on equ i p m en t m anu f a c t u r i ng r a il r oad s o t he r t r an s po r t a t i on u t ili t i e s depa r t m en t s t o r e s r e t a il f i nan c i a l o t he r miningfoodapparelpaperchemicalpetroleumconstructionprimary metalsfabricated metalsmachineryelectrical equipmenttransportation equipmentmanufacturingrailroadsother transportationutilitiesdepartment storesretailfinancialother Figure 6: Variable Selection Frequency.

We report how often that variables from column sectors are selected in the third-stage LASSO regression for ﬁrmson line sectors . The numbers are normalized by the total number of ﬁrms in each sector. -0.6 -0.4 -0.2 0 0.2 0.40510152025 -0.4 -0.2 0 0.2 0.405101520253035 -0.2 -0.1 0 0.1 0.2 0.30510152025 Figure 7: AR coeﬃcient estimates.

The ﬁgure illustrates the empirical distribution of the ordinary least squares (OLS) estimation of the coeﬃcients ofan fourth-order autoregressive, AR(4), model across the 119 macroeconomic time series. Each panel relates to onespeciﬁc coeﬃcient. | + + + | Figure 8: Absolute sum of AR coeﬃcient estimates.

The ﬁgure illustrates the empirical distribution of the absolute sum of the ordinary least squares (OLS) estimationof the coeﬃcients of an fourth-order autoregressive, AR(4), model across the 119 macroeconomic time series.

50 100 150 200 250 estimation window nu m be r o f f a c t o r s eigenvalue ratioIC IC IC IC Figure 9: Estimated number of factors.

The ﬁgure illustrates the number of selected factors over the estimation windows. The ﬁgure reports the results forthe eigenvalue ratio procedure and the four information criteria discussed in the paper. eferences Abadie, A., A. Diamond, and J. Hainmueller (2010). Synthetic control methods for comparative casestudies: Estimating the eﬀect of California’s tobacco control program.

Journal of the AmericanStatistical Association 105 , 493–505.Abadie, A. and J. Gardeazabal (2003). The economic costs of conﬂict: A case study of the Basquecountry.

American Economic Review 93 , 113–132.Andreou, E. and E. Ghysels (2021). Predicting the VIX and the volatility risk premium: The roleof short-run funding spreads volatility factors.

Journal of Econometrics 220 , 366–398.Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrixestimation.

Econometrica 59 (3), 817–858.Bai, J. (2003). Inferential theory for factor models of large dimensions.

Econometrica 71 , 2135–171.Bai, J. and Y. Liao (2017). Inferences in panel data with interactive eﬀects using large covariancematrices.

Journal of Econometrics 200 , 59–78.Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models.

Econo-metrica 70 , 191–221.Bai, J. and S. Ng (2003). Inferential theory for factor models of large dimensions.

Econometrica 71 ,135–171.Bai, J. and S. Ng (2006). Conﬁdence intervals for diﬀusion index forecasts and inference for factoraugmented regressions.

Econometrica 74 , 1133–1155.Barigozzi, M. and C. Brownlees (2019). NETS: Network estimation for time series.

Journal ofApplied Econometrics 34 , 347–364.Barigozzi, M. and M. Hallin (2016). Generalized dynamic factor models and volatilities: Recoveringthe market volatility shocks.

Econometrics Journal 19 , C33–C60.Barigozzi, M. and M. Hallin (2017a). Generalized dynamic factor models and volatilities: Estimationand forecasting.

Journal of Econometrics 201 , 307–321.Barigozzi, M. and M. Hallin (2017b). A network analysis of the volatility of high-dimensionalﬁnancial series.

Journal of the Royal Statistical Society - series C 66 , 581–605.69arigozzi, M. and M. Hallin (2020). Generalized dynamic factor models and volatilities: consistency,rates, and prediction intervals.

Journal of Econometrics 116 , 4–34.Barigozzi, M., M. Hallin, S. Soccorsi, and R. von Sachs (2020). Time-varying general dynamic factormodels and the measurement of ﬁnancial connectedness.

Journal of Econometrics . forthcoming.Bernanke, B., J. Boivin, and P. Eliasz (2005). Measuring the eﬀects of monetary policy: A factor-augmented vector autoregressive (FAVAR) approach.

The Quarterly Journal of Economics 120 ,387–422.Brito, D., M. Medeiros, and R. Ribeiro (2018). Forecasting large realized covariance matrices: Thebeneﬁts of factor models and shrinkage. Technical Report 3163668, SSRN.Brownlees, C., G. Gudmundsson, and G. Lugosi (2020). Community detection in partial correlationnetwork models.

Journal of Business & Economic Statistics . forthcoming.Cai, T. (2017). Global testing and large-scale multiple testing for high-dimensional covariancestructures.

Annual Review of Statistics and its Application 4 , 4.1–4.24.Cai, T. and Z. Ma (2013). Optimal hypothesis testing for high dimensional covariance matrices.

Bernoulli 19 , 2359–2388.Cai, T., Z. Ren, and H. Zhou (2016). Estimating structured high-dimensional covariance andprecision matrices: Optimal rates and adaptive estimation.

Electronic Journal of Statistics 10 ,1–59.Carvalho, C., R. Masini, and M. Medeiros (2018). Arco: An artiﬁcial counterfactual approach forhigh-dimensional panel time-series data.

Journal of Econometrics 207 , 352–380.Chen, S., L.-X. Zhang, and P.-S. Zhong (2010). Tests for high-dimensional covariance matrices.

Journal of the American Statistical Association 105 , 810–819.Chernozhukov, V., D. Chetverikov, and K. Kato (2013). Gaussian approximations and multiplierbootstrap for maxima of sums of high-dimensional random vectors.

Annals of Statistics 41 ,2786–2819.Chernozhukov, V., D. Chetverikov, and K. Kato (2018). Inference on causal and structural param-eters using many moment inequalities. 70iebold, F. and K. Yilmaz (2014). On the network topology of variance decompositions: Measuringthe connectedness of ﬁnancial ﬁrms.

Journal of Econometrics 182 , 119–134.Doukhan, P. and S. Louhichi (1999). A new weak dependence condition and applications to momentinequalities.

Stochastic Processes and their Applications 84 , 313–342.Fama, E. and K. French (1993). Common risk factors in the returns on stocks and bonds.

Journalof Financial Economics 33 , 3–56.Fama, E. and K. French (2015). A ﬁve-factor asset pricing model.

Journal of Financial Eco-nomics 116 , 1–22.Fan, J., Y. Fan, and J. Lv (2008). High dimensional covariance matrix estimation using a factormodel.

Journal of Econometrics 147 , 186–197.Fan, J., Y. Ke, and K. Wang (2020). Factor-adjusted regularized model selection.

Journal ofEconometrics 216 , 71–85.Fan, J., Q. Li, and Y. Wang (2017). Estimation of high dimensional mean regression in the absenceof symmetry and light tail assumptions.

Journal of the Royal Statistical Society: Series B 79 ,247–265.Fan, J., R. Li, C.-H.Zhang, and H. Zou (2020).

Statistical Foundations of Data Science . CRC Press.Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding principalorthogonal complements.

Journal of the Royal Statistical Society: Series B (Statistical Method-ology) 75 , 603–680.Fan, J., R. Masini, and M. Medeiros (2020). Do we exploit all information for counterfactual analy-sis? beneﬁts of factor models and idiosyncratic correction. Working paper, Princeton University.Feng, G., S. Giglio, and D. Xiu (2020). Taming the factor zoo: A test of new factors.

Journal ofFinance 75 , 1327–1370.Gagliardini, P., E. Ossola, and P. Scaillet (2019). A diagnostic criterion for approximate factorstructure.

Journal of Econometrics 212 , 503–521.Giannone, D., M. Lenza, and G. Primiceri (2018). Economic predictions with big data: The illusionof sparsity. Working paper, Northwestern University.71iessing, A. and J. Fan (2020). Bootstrapping (cid:96) p -statistics in high dimensions.Giglio, S. and D. Xiu (2020). Asset pricing with omitted factors. Journal of Political Economy .forthcoming.Gobillon, L. and T. Magnac (2016). Regional policy evaluation: Interactive ﬁxed eﬀects and syn-thetic controls.

Review of Economics and Statistics 98 , 535–551.Gu, S., B. Kelly, and D. Xiu (2020). Empirical asset pricing via machine learning.

Review ofFinancial Studies 33 , 2223–2273.Guo, X. and C. Tang (2020). Speciﬁcation tests for covariance structures in high-dimensionalstatistical models.

Biometrika . forthcoming.Horenstein, S. A. A. (2013). Eigenvalue ratio test for the number of factors.

Econometrica 81 ,1203–1227.Kock, A. and L. Callot (2015). Oracle inequalities for high dimensional vector autoregressions.

Journal of Econometrics 186 , 325–344.Lam, C. and J. Fan (2009). Sparsistency and rates of convergence in large covariance matrixestimation.

Annals of Statistics 37 , 4254–4278.Ledoit, O. and M. Wolf (2002). Some hypothesis tests for the covariance matrix when the dimensionis large compared to the sample size.

Annals of Statistics 30 , 1081–1102.Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covariancematrices.

Journal of Multivariate Analysis 88 , 365–411.Ledoit, O. and M. Wolf (2012). Nonlinear shrinkage estimation of large-dimensional covariancematrices.

Annals of Statistics 40 , 1024–1060.Ledoit, O. and M. Wolf (2017). Nonlinear shrinkage of the covariance matrix for portfolio selection:Markowitz meets Goldilocks.

Review of Financial Studies 30 , 4349–4388.Ledoit, O. and M. Wolf (2020). Analytical nonlinear shrinkage of large-dimensional covariancematrices.

Annals of Statistics . forthcoming.Ledoit, O. and M. Wolf (2021a). The power of (non-)linear shrinking: A review and guide tocovariance matrix estimation.

Journal of Financial Econometrics . forthcoming.72edoit, O. and M. Wolf (2021b). Quadratic shrinkage for lage covariance matrices.

Bernoulli .forthcoming.Li, W. and Y. Qin (2014). Hypothesis testing for high-dimensional covariance matrices.

Journal ofMultivariate Analysis 128 , 108–119.Masini, R. and M. Medeiros (2019). Counterfactual analysis with artiﬁcial controls: Inference, highdimensions and nonstationarity. Working Paper 3303308, SSRN.Masini, R., M. Medeiros, and E. Mendes (2019). Regularized estimation of high-dimensional vectorautoregressions with weakly dependent innovations. Technical Report 1912.09002, arxiv.McCracken, M. and S. Ng (2016). FRED-MD: A monthly database for macroeconomic research.

Journal of Business & Economic Statistics 34 , 574–589.Medeiros, M. and E. Mendes (2016). (cid:96) -regularization of high-dimensional time-series models withnon-gaussian and heteroskedastic errors. Journal of Econometrics 191 , 255–271.Merlev`ede, F., M. Peligrad, and E. Rio (2009). Bernstein inequality and moderate deviations understrong mixing conditions. In C. Houdr´e, V. Koltchinskii, D. Mason, and M. Peligrad (Eds.),

HighDimensional Probability V: The Luminy Volume , Volume Volume 5, pp. 273–292. Institute ofMathematical Statistics.Moon, R. and M. Weidner (2015). Linear regression for panel with unknown number of factors asinteractive ﬁxed eﬀects.

Econometrica 83 , 1543–1579.Moskowitz, T. and M. Grinblatt (1999). Do industries explain momentum?

Journal of Finance 54 ,1249–1290.Negahban, S., P. Ravikumar, M. Wainwright, and B. Yu (2012). A uniﬁed framework for high-dimensional analysis of m -estimators with decomposable regularizers. Statistical Science 27 ,538–557.Onatski, A., M. Moreira, and M. Hallin (2013). Asymptotic power of sphericity tests for high-dimensional data.

Annals of Statistics 41 , 1204–1231.Rio, E. (1994). In´egalit´es de moments pour les suites stationnaires et fortement m´elangeantes.

Comptes rendus Acad. Sci. Paris, S´erie I 318 , 355–360.73tock, J. and M. Watson (2002a). Forecasting using principal components from a large number ofpredictors.

Journal of the American Statistical Association 97 , 1167–1179.Stock, J. and M. Watson (2002b). Macroeconomic forecasting using diﬀusion indexes.

Journal ofBusiness & Economic Statistics 20 , 147–162.van de Geer, S. and P. B¨uhlmann (2009). On the conditions used to prove oracle results for thelasso.

Electronic Journal of Statistics 3 , 1360–1392.van der Vaart, A. and J. Wellner (1996).

Weak Convergence and Empirical Processes: With Appli-cations to Statistics . Springer.Zheng, S., Z. Chen, H. Cui, and R. Li (2019). Hypothesis testing on linear structures of high-dimensional covariance matrix.

Annals of Statistics 47 , 3300–3334.Zheng, S., G. Cheng, J. Guo, and H. Zhu (2019). Test for high-dimensional correlation matrices.