Bridging factor and sparse models
BBridging factor and sparse models
Jianqing Fan
Department of Operations Research and Financial EngineeringPrinceton UniversityE-mail: [email protected]
Ricardo Masini
Center for Statistics and Machine Learning, Princeton UniversitySao Paulo School of Economics, Getulio Vargas FoundationE-mail: [email protected]
Marcelo C. Medeiros
Department of Economics, Pontifical Catholic University of Rio de JaneiroE-mail: [email protected]
February 24, 2021
Abstract
Factor and sparse models are two widely used methods to impose a low-dimensional structurein high dimension. They are seemingly mutually exclusive. In this paper, we propose a simplelifting method that combines the merits of these two models in a supervised learning method-ology that allows to efficiently explore all the information in high-dimensional datasets. Themethod is based on a very flexible linear model for panel data, called factor-augmented regres-sion model with both observable, latent common factors, as well as idiosyncratic componentsas high-dimensional covariate variables. This model not only includes both factor regressionand sparse regression as specific models but also significantly weakens the cross-sectional de-pendence and hence facilitates model selection and interpretability. The methodology consistsof three steps. At each step, remaining cross-section dependence can be inferred by a novel testfor covariance structure in high-dimensions. We developed asymptotic theory for the factor-augmented sparse regression model and demonstrated the validity of the multiplier bootstrapfor testing high-dimensional covariance structure. This is further extended to testing high-dimensional partial covariance structures. The theory and methods are further supported byan extensive simulation study and applications to the construction of a partial covariance net-work of the financial returns for the constituents of the S&P500 index and prediction exercisefor a large panel of macroeconomic time series from FRED-MD database.
JEL Codes : C22, C23, C32, C33.
Keywords : Factor models, sparse regression, high-dimensional, supervised learning, hypoth-esis testing, covariance structure.
Acknowledgments : Medeiros gratefully acknowledges the partial financial support fromCNPq and CAPES. We are deeply grateful to Alexander Giessing, Caio Almeida, ClaudioFlores, Gilberto Boareto, Gustavo Bulh˜oes, Henrique Pires, Marcelo Fernandes, Michael Wolf,and Nathalie Gimenes for helpful discussions and comments. a r X i v : . [ ec on . E M ] F e b Introduction
With the emergence of new and large datasets, the correct characterization of the dependence amongvariables is of substantial importance. Usually, to achieve this goal, the literature has followed twoseemingly orthogonal tracks. On the one hand, factor models have become an essential tool to sum-marize information in large datasets under the assumption that the remaining dependence structureis negligible. For instance, panel factor models are applied now to a wide variety of important ap-plications, ranging from forecasting (macroeconomic) variables and asset pricing models to causalinference in applied microeconomics and network analysis. On the other hand, there have beenmajor advances on parameter estimation in ultra high-dimensions under the assumption of sparsityor weak-sparsity. That is, a variable depends only on a (very) small subset of the other variables.In this paper, we take an alternative route and combine the best of the two worlds described abovein order to better characterize the dependence structure of high-dimensional data. More specifically,we consider that the covariance structure of a large set of variables, organized in a panel dataformat, is characterized as a combination of a factor structure, where factors can be either observed,unobserved, or both, and a weakly-sparse idiosyncratic component. This formulation is generalenough in order to accommodate a very large number of data generating processes of interest ineconomics, finance, and related areas. The proposed methodology has two ingredients: a three-stepestimation procedure and a new test for structure in high dimensional (partial) covariance matrices.The steps of the estimation procedure are as follows. In the first one, we take the original data andremove the effects of any observed factors. These factors can be deterministic terms such as seasonaldummies and/or trends or any other observed covariates. The first step can be parametric ornonparametric, low or high dimensional. A latent factor model is then estimated using the residualsfrom the first stage. Finally, in a final step we model the dependence among idiosyncratic termsas a weakly sparse regression estimated by the Least Absolute Shrinkage and Selection Operator(LASSO). At each step, the null-hypothesis of no remaining cross-section dependence can be testedby the proposed test for the (partial) covariance structure in high-dimensions.
Let Y t : “ p Y t , . . . , Y nt q be a random vector generated by a factor model as Y it “ λ i F t ` U it ,for i “ , . . . , n , t “ , . . . , T , where Σ : “ E p U t U t q , with U t : “ p U t , . . . , U nt q , is not necessarilydiagonal. Fix one component of interest i P t , . . . , n u , which serve as a response variable. Consider2he following prediction models: M : E p Y it | Y ´ it q , M : E p Y it | F t q , and M : E p Y it | F t , U ´ it q , (1.1)where Y ´ it and U ´ it are, respectively, vectors with the elements of Y t and U t excluding the i -thentry. Note that model M is indeed the factor augmented regression model since it is the same as E p Y it | F t , Y ´ it q . As the paper will mainly focus on linear regressions, we will refer more specifically Ă M below as the factor-augmented regression model.Suppose that we observe both F t and U ´ it . Which one of three models above is best in termsof mean square error ( MSE ) for prediction? Comparison between M and M is not clear since itdepends, among others, on the magnitude of Σ relative to Λ Λ , where Λ : “ p λ , . . . , λ n q . However,since the σ -algebras generated by Y ´ it and F t are both included in the σ -algebra generated by p F t , U ´ it q , it is not surprising that MSE p M q ď min r MSE p M q , MSE p M qs . The same will holdtrue if we replace the models in (1.1) by their best linear projections, which we denote by Ă M j for j P t , , u , since the linear space Ă M is the largest. In this case, we can explicitly write the “gains”of Ă M when compared to Ă M and Ă M : MSE p Ă M q ´ MSE p Ă M q “ ´ θ i Σ ´ i, ´ i θ i MSE p Ă M q ´ MSE p Ă M q “ ´ ∆ i ∆ i ´ ∆ i Σ ´ i, ´ i ∆ i , where θ i and β i are the coefficients of the projection of U it onto U ´ it and the coefficients ofthe projection of X it onto X ´ it , respectively; Σ ´ i, ´ i is Σ excluding the i -th row and column; ∆ i : “ Λ i ´ β i Λ ´ i and ∆ i : “ β i ´ θ i . From the previous expressions, it becomes evident thatboth Ă M and Ă M are restrictions on Ă M . Broadly speaking, whenever one does not expect tohave an exact factor model, there are potential gains of taking into account the contribution ofthe idiosyncratic components U ´ it . Therefore, we use Ă M as the base model for the estimationmethodology described in Section 2.2. The contributions of this paper are multi-fold. First, our methodology bridges the gap between twoapparently competing methods for high-dimensional modeling; see, for example the discussion inGiannone et al. (2018) and Fan et al. (2020). This yields a vast number of potential applications and3pin-offs. For instance, in Fan et al. (2020), we apply the methods developed in here to evaluatethe effects of interventions and we contribute to the literature on synthetic controls and relatedmethods by combining the approaches of Gobillon and Magnac (2016) and Carvalho et al. (2018).Therefore, in our setup both a common factor structure and weak sparsity can coexist.Second, our results can also serve as a diagnostic and misspecification tool. For panel datamodels with interactive fixed effects as in Moon and Weidner (2015) and Bai and Liao (2017), ourtest can be directly applied to uncover the dependence structure among cross-sectional units beforeand after accounting for common factor components. If the factor structure is informative enough,we expect the idiosyncratic covariance matrix to be almost sparse. If this is not the case, we mayhave possibly underestimated the number of factors. One popular application is in asset pricingas discussed in Gagliardini et al. (2019) and in the empirical section of this paper. There are ahuge number of proposed factors as described in Feng et al. (2020), Giglio and Xiu (2020), and Guet al. (2020). We can apply our methodology not only to test for omitted factors, but, as well, toestimate network connections among firms as in Diebold and Yilmaz (2014) and Brownlees et al.(2020). Finally, as a diagnostic tool, our paper tackle the same problem as Gagliardini et al. (2019).However, we take an alternative solution strategy which relies on a much different set of hypothesis.Third, the methodology proposed here contributes to the forecasting literature. For instance, inthe second application considered in this paper, we build forecasting models for a large cross-sectionof macroeconomic variables. We call this method the
FarmPredict . We show that the combinationof factors and a sparse regression strongly outperforms the traditional principal component regres-sion as in Stock and Watson (2002a,b). Therefore,
FarmPredict can be an additional contributionto the forecasting and machine learning toolkit. The method can be easily extended to a multi-variate setting combining factor-augmented vector autoregressions (FAVAR) as in Bernanke et al.(2005) with sparse vector models as in Kock and Callot (2015) and Masini et al. (2019).Fourth, we show consistency of factor estimation based on the residuals of a first-step regression.Our results hold for both parametric (linear or nonlinear) and nonparametric first stage. A high-dimensional first stage is also allowed. Note that, current results in the literature consider thatfactors are estimated based on observed data and our derivations favor a much more flexible andgeneral setup (Bai and Ng, 2002, 2003, 2006). More specifically, our methodology favors settingswhere there are both observed and latent factors, as well as trend-stationary data. In the later, thetrend can be first removed by (nonparametric) first-stage regression.4ifth, we also contribute to the LASSO literature. LASSO can not be model selection consistentfor highly correlated variables. Through the decomposition of covariates into factors and idiosyn-cratic components, we decorrelate the variables and make the model selection condition much easierto hold; see, for example, (Fan et al., 2020). We show consistency of the estimates based on resid-uals of the previous steps. Our results are derived under restrictions on the population covariancematrix of the data and not on the estimated one, as it is usual in many papers. See, for example,van de Geer and B¨uhlmann (2009).Finally, we extend the results in Chernozhukov et al. (2013, 2018) to strong-mixing data in orderto construct hypothesis tests for covariance and partial covariance structure in high dimensions. This step is necessary for econometrics and financial applications. As side results, in order todevelop the test we first show consistency of kernel-based estimation of a high-dimensional long-run covariance matrix of dependent process. This is a new result with important consequencesfor the theory of high-dimensional regression with dependent errors. We also establish a newconsistency of an estimator of the partial covariance matrix in high-dimensions and strong-mixingdata. Our proposed tests can be used to infer, for instance, if the (partial) covariance matrix ofa high-dimensional random vector is diagonal or block-diagonal. More generally, we can test anypre-defined structure. Furthermore, we show that the test remains valid when we use the residualsfrom a previous step estimation to compute the covariance matrix. This result allows us to to applythe test to the three-stage estimation procedure proposed in this paper. Although our results arederived under the assumption that the number of factors is known, simulation results presented inthe paper provides evidence that the test have good finite-sample properties even when the numberof factors is determined by data-driven methods commonly found in the literature. Over the pastyears, a vast number of papers proposed different methods to test for covariance structure in highdimensions. See, for example, Ledoit and Wolf (2002), Chen et al. (2010), Onatski et al. (2013),Cai and Ma (2013), Li and Qin (2014), Zheng et al. (2019), Cai et al. (2016), Zheng et al. (2019),and Guo and Tang (2020), among many others. To the best of our knowledge, we complementall the previous papers by simultaneously considering high-dimensions, strong-mixing data withmild distributional assumptions, and pre-estimation when constructing tests for both covarianceand partial covariance structure. Recently, Giessing and Fan (2020) also extended the results in Chernozhukov et al. (2013). However, their setupis very different from ours as the authors only consider the case of independent and identically distributed data. For a nice recent review, see Cai (2017). .3 Organization of the Paper In addition to this Introduction, the paper is organized as follows. We present the model setupand assumptions in Section 2. The theoretical results are presented in Section 3 with practicalguides given in Section 4. We depict the results of a simulation experiment in Section 5 anddiscuss the empirical application in Section 6. Section 7 concludes. All proofs are deferred to theAppendix. Supplementary Material contains additional numerical results. Tables and figures in theSupplementary Material are referenced with an “S” before the number.
All random variables (real-valued scalars, vectors and matrices) are defined in a common probabilityspace p Ω , F , P q . We denote random variables by an upper case letter, X for instance, and itsrealization by a lower case letter, X “ x . The expected value operator is with respect to the P law such that E p X q : “ ş Ω X p ω q d P p ω q . Matrices and vectors are written in bold letters X . Exceptfor the number of factors, r , and number of covariates, k , defined below, all other dimensions areallowed to depend on the sample size ( T ). However, we omit this dependency throughout the paperto avoid clustering the notation prematurely.We use } ¨ } p to denote the (cid:96) p norm for p P r , , such that for a d ´ dimensional (possiblyrandom) vector X “ p X , . . . , X d q , we have } X } p : “ p ř di “ | X i | p q { p for p P r , and } X } : “ sup i ď d | X i | . If X is a p m ˆ n q possibly random matrix then } X } p denotes the matrix (cid:96) p -induced normand } X } max denotes the maximum entry in absolute terms of the matrix X . Note that whenever X is random, then } X } p for p P r , and } X } max are random variables. We also reserve the symbol } ¨ } without subscript for the Euclidean norm } ¨ } : “ } ¨ } for both vectors and matrices.For any convex function ψ : R ` Ñ R ` such that ψ p q “ ψ p x q Ñ 8 as x Ñ 8 and(real-valued) random variable X , we denote its Orlicz norm by } X } ψ , which is defined by } X } ψ : “ inf ! C ą E ” ψ ´ | X | C ¯ı ď ) . Since we are only concerned with polynomial and exponential tailswe consider upper bounds on } X } ψ p where ψ p P Ψ andΨ : “ t ψ : R ` Ñ R ` : ψ p x q “ x p ` (cid:15) , p ě , (cid:15) ą
0; or ψ p x q “ e x p ´ , p ą u . (1.2)Evidently, as opposed to } X } p , } X } ψ p is always a non-negative scalar. We do not abide to anyconvention to apply Orcilz norm to vector or matrices to avoid confusion. We also use extensively the6act that } XY } ψ p ď } X } ψ p } Y } ψ p , where X and Y are two real-valued, not necessarily independent,random variables. For the polynomial case this is just the Cauchy-Schwartz inequality. For theexponential bounds we have similar results: For instance, when p “ X and Y are sub-Gaussianrandom variables with ψ p p x q “ exp p x q ´
1, it is not difficult to show that XY is sub-exponentialwith ψ p p x q “ exp p x q ´ X , diag p X q denote the diagonal matrix whose diagonal is the elements of X . p A q is an indicator function on the event A , i.e, p A q “ A is true and 0 otherwise. We adoptthe Landau big/small O, o notation and the “in probability” O P and o p analogues. We say that x is of the same order of y , x — y , if both x “ O p y q and y “ O p x q . We write X — P Y if both X “ O P p Y q and Y “ O P p X q . Unless stated otherwise, the asymptotics are taken as T Ñ 8 , where T is the time-series dimension, and the o p q and o P p q are with respect to the limit as T Ñ 8 . Wedenote convergence in probability and in distribution by “ p ÝÑ ” and “ ñ ”, respectively. We apply the test for three-stage estimation procedure for a very general panel data model, which isrich enough in order to nest several important cases in economics, finance and related areas. Morespecifically, we define the following the Data Generating Process (DGP).
Assumption 1 (DGP) . The process t Y it : 1 ď i ď n, t ě u is generated by Y it “ γ i X it ` λ i F t ` U it , ” γ i X it ` R it , (2.1) where X it is a k -dimensional observable (random) vector which may also include a constant term, F t is a r -dimensional vector of common latent factors, and U it is a zero mean idiosyncratic shock.The unknown parameters are γ i P R k , the factor loadings λ i , and the covariance matrix of theidiosyncratic shocks. Finally, we assume that X it , F t and U it are mutually uncorrelated. Remark 1.
In Assumption 1 we consider that k , the dimension of X it is finite and fixed. Fur-thermore, the relation between Y it and X it is linear. This is for the sake of exposition. However, For simplicity, we assume that all the units i have the same number of covariates ( k ). The framework cancertainly accommodate situations where k i is a function of i . he theoretical results in this paper are written in terms of the consistency rate of the first-stepestimation. Therefore, the DGP can be made much more general by just changing the rates. Example 1 (Asset Pricing Models) . Suppose Y it is the return of an asset i at time t and let X it : “ X t be a set of k observable risk factors, such as the market returns and or Fama-Frenchfactors as in, for example, Fama and French (1993) or Fama and French (2015). F t can be a setof additional, non observable, risk factors. Several asset pricing models, such as the Capital AssetPricing Model (CAPM) of the Arbitrage Pricing Theory (APT) model, are nested into this generalframework. Example 2 (Networks) . Model (2.1) also complements the network specifications discussed inBarigozzi and Hallin (2016), Barigozzi and Hallin (2017b) and Barigozzi and Brownlees (2019).Furthermore, the test proposed here can be used to detect networks links as in Diebold and Yilmaz(2014) and Brownlees et al. (2020). For example, Y it can be the (realized) volatility of financialassets and X it : “ X t can be volatility factors as in Brito et al. (2018) and Andreou and Ghysels(2021). Example 3 (Panel Data Models with Iterative Fixed-Effects) . Model (2.1) is the exact definitionof the panel model with iterative fixed-effects considered in Gobillon and Magnac (2016), where theauthors propose an alternative to the Synthetic Control method of Abadie and Gardeazabal (2003)and Abadie et al. (2010) to evaluate the effects of regional policies. Model (2.1) is also in the heart ofthe
FarmTreat method of Fan et al. (2020) and the model discussed in Moon and Weidner (2015).
Example 4 (FAVAR) . In the case where the index i represents a different dependent (endogenous)variable and U it is a dependent process, model (2.1) turns out to be equivalent to the Factor Aug-mented Vector Autoregressive (FAVAR) model of Bernanke et al. (2005). In this case, X it may alsoinclude lagged dependent variables. The method proposed here for estimation, inference and prediction consists of three stages whereat the end of each stage, the covariance structure of the residuals is tested.1. For each i P t , . . . , n u run the regression: Y it “ γ i X it ` R it , t P t , . . . , T u , p R it : “ Y it ´ p γ i X it . The first stage may consist of a regression on a constant, adeterministic time trend and seasonal dummies, for instance, or, as in Example 1, a regressionon observed factors. After removing the contribution from the observables, we can use thetest for the null hypothesis of no remaining (partial) covariance structure to check if the(partial) covariance of R it is dense or sparse. If it is dense we move to Step 2. Otherwise,we jump directly to Step 3. This first parametric, low dimensional step can be replaced by anonlinear/nonparametric regression or by a high-dimensional model, when, for example, thenumber of observed factors is large. This will be discussed more in the subsequent sections.2. Write R t : “ p R t , . . . , R nt q and R t “ Λ F t ` U t . The second step consists of estimating Λ and F t for t “ , . . . , T using p R t through principal component analysis (PCA) and compute p U t “ p R t ´ p Λ p F t . After estimating the factors and loadings, we apply our testing procedure to test for remainingcovariance structure in U t . The second-step estimation can be carried out also by dynamicfactor models as in Barigozzi and Hallin (2016,2017,2020) or Barigozzi et al. (2020).3. Now, define p U ´ it : “ p p U t , . . . , p U i ´ ,t , p U i ` ,t . . . p U nt q . The third estimation step consists of asparse regression to estimate the following model for each i P t , . . . , n u : p U it “ θ i p U ´ it ` V it , t P t , . . . , T u . At the end of Steps 2 and 3, we can conduct the relevant inference on the structures of thecovariance or partial covariance matrices. We can also provide updated prediction future outcomes.We detail those in the next subsection. Also note that the nonzero estimates of θ i shed light on thelinks among idiosyncratic components. In a pure prediction exercise one is usually interested in the linear projection of Y it onto p X it , F t , U it q ,which results in the factor-augmented regression model (FARM) Y it “ γ i X it ` λ i F t ` θ i U ´ it ` ε it , t P t , . . . , T u , (2.2)9or each given i , and can be predicted by p Y it : “ p γ i X it ` p λ i p F t ` p θ i p U ´ it ; i P t , . . . , n u . (2.3)This will be called FarmPredict . Note that model (2.2) is equivalent to using the predictors X it , Y ´ it and F t , which augment predictors X it , Y ´ it by using the common factors F t . The formin (2.2) mitigates the collinearity issues in high dimensions.Model (2.2) also bridges factor regression ( θ i “
0) on one end and (sparse) regression on theother end with λ i “ Λ i θ i , where Λ ´ i is the loading matrix without the i th row. In the latter case,model (2.2) becomes a (sparse) regression model: Y it “ γ i X it ` θ i R ´ it ` ε it , t P t , . . . , T u . (2.4)In this case, FARM specification as in (2.2) decorrelates the variables R ´ i in (2.4). It makes themodel selection consistency much easier to satisfy and forms the basis of FarmSelect in (Fan et al.,2020). In general, for FARM (2.2) with sparsity,
FarmPredict chooses additional idiosyncraticcomponents to enhance the prediction of the factor regression.In other applications, the structure of the idiosyncratic components U “ p U , . . . , U n q is theobjective of interest. An estimator for Σ “ E p U t U t q could be simply given by p Σ : “ T T ÿ t “ p U t p U t . (2.5)In order to proper understand the (linear) relation between a pair p U it , U jt q of U t , a simplecovariance estimate sometimes is not enough. In applications, it is often desirable to have a directmeasure of how U it and U jt are connected. By direct connection, we meant the relation betweenthose units removing the contribution of other variables of U t . For this purpose, we use the partialcovariance between U it and U jt , defined for any pair i, j P t , . . . , n u as: π ij : “ E p V ijt V jit q , where V ijt : “ U it ´ Proj p U it | U ´ ij,t q and Proj p U it | U ´ ij,t q denotes the linear projection of U it onto the In high dimensions, where n ą T , there are many possible estimators for Σ available in the literature. See thebook by Fan et al. (2020). i and j , which we denote by U ´ ij,t . We suggest to estimatethe partial covariance matrix Π : “ p π ij q by p Π : “ p p π ij q and p π ij : “ T T ÿ t “ p V ij,t p V ji,t , (2.6)where p V ijt is the residual of the LASSO regression of p U it onto p U ´ ij,t for i, j P t , . . . , n u .We also would like to conduct formal test on the population structure of U t . Specifically, wepropose a test for the following null hypothesis on the covariance matrix H Σ D : Σ D “ Σ D , D Ď t , . . . , n u ˆ t , . . . , n u , (2.7)for a given subset D , where Σ D denotes the elements of Σ indexed by D and we allow d : “ | D | todiverge as n, T Ñ 8 . For example, to test if Σ is diagonal, D consists of all off diagonal elementsand Σ D “ . To test if Σ is block diagonal, D can be taken to the corresponding off-diagonalblocks. Similarly, for testing the structure on the partial covariance matrix H Π D : Π D “ Π D , D Ď t , . . . , n u ˆ t , . . . , n u . (2.8)The null hypotheses (2.7) and (2.8) nest several cases of interest in applications. The mostcommon would be to test for a diagonal or a block diagonal structure in Σ and/or Π . But it alsoaccommodates other structures. The task of estimating Σ is well documented in literature evenin high-dimensional setups; see, for example, Ledoit and Wolf (2004,2012,2017,2020), Fan et al.(2008), Lam and Fan (2009), or Fan et al. (2013). The challenges for testing (2.7) and (2.8) are similar and can be summarized as follows:1. As we allow for both n and d to diverge to infinite as T grows, sometimes at a faster rate, wehave a high-dimensional test where some sort of Gaussian approximation result for dependentdata must be deployed as we also allow the number covariances to be tested ( d ) to diverge.In this case, a high-dimensional long-run covariance matrix must be estimated if one expectsto get (asymptotic) correct test size.2. We do not directly observe t U t u or t V ij,t u . Instead, we have an estimate of both from a With minor changes, the proposed test can also be used to test the null M vec p Σ q “ m for some p d ˆ n q matrix M and d -dimensional vector m where d : “ d T is also a function of T . See Ledoit and Wolf (2021a) for a recent survey. t U t u and t V ij,t u from a multi-stage estimationprocedure as we illustrate later in this paper.We propose to test (2.7) using the statistic S Σ D : “ }? T p p Σ D ´ Σ D q} max . (2.9)The quantiles of S Σ D are approximated by a Gaussian bootstrap approximation. To describe theprocedure, let Υ Σ denote the p d ˆ d q covariance matrix for the vectorized submatrix p r σ ij q p i,j qP D , where r σ ij : “ T ř Tt “ U i,t U j,t . Since the process t U t u might present some form of temporal dependence (referto Assumption 3(c)) we estimate Υ Σ using a Newey-West-type estimator. For a given integrablefunction k p¨q with k p q “ h ą Υ Σ is estimated by p Υ Σ : “ ÿ | (cid:96) |ă T k p (cid:96) { h q x M Σ ,(cid:96) and x M Σ ,(cid:96) : “ T T ÿ t “ (cid:96) ` p D Σ ,t p D Σ ,t ´ (cid:96) , (2.10)where p D Σ ,t is a d -dimensional vector with entries given by p U it p U jt ´ p σ ij for p i, j q P D . Finally, let c ˚ Σ p τ q be the τ -quantile of the Gaussian bootstrap S ˚ D : “ } Z ˚ Σ } ; Z ˚ Σ | X , Y „ N p , p Υ Σ q . Theorem 4 demonstrates the validity of Gaussian bootstrap procedure described above, i.e., it statesconditions under which the τ -quantile of the test statistic (2.9) can be approximated by c ˚ Σ p τ q inthe appropriate sense.Similarly, the test statistic for (2.8) is given by S Π D : “ }? T p p Π D ´ Π D q} max . (2.11)Let Υ Π denote the p d ˆ d q covariance matrix of p r π ij q p i,j qP D where r π ij : “ T ř Tt “ V ij,t V ji,t . For a givenkernel K p¨q P K and bandwidth h ą K is described below in (3.7), Υ Π is estimated by p Υ Π : “ ÿ | (cid:96) |ă T K p (cid:96) { h q x M Π ,(cid:96) ; x M Π ,(cid:96) : “ T T ÿ t “ (cid:96) ` p D Π ,t p D Π ,t ´ (cid:96) , (2.12)12here p D Π ,t is a d -dimensional vector with entries given by p V ij,t p V ji,t ´ p π ij for p i, j q P D . Also, let c ˚ Π p τ q be the τ -quantile of the Gaussian bootstrap S ˚ D : “ } Z ˚ Π } ; Z ˚ V | X , Y „ N p , p Υ V q . Theorem 5 demonstrate the validity of Gassian bootstrap procedure describe above, i.e., it statesconditions under which the τ -quantile of the test statistic (2.11) can be approximated by c ˚ Π p τ q inthe appropriate sense. In this section we collect all the theoretical guarantees for the estimation of the model (2.1) by usingthe proposed three-stage method described above. Specifically, Section 3.1 deals with estimationand Section 3.2 with inference on the (partial) covariance structure of Π .To present the results in this section it is convenient to use a more compact notation. For each i “ , . . . , n , we can stack the periods to define the T -dimensional vectors Y i : “ p Y i , . . . , Y iT q and U i : “ p U i , . . . , U iT q . We also define the p T ˆ k q matrix of covariates X i : “ p X i , . . . , X iT q for each i “ , . . . , n and the p T ˆ r q matrix of factors F : “ p F , . . . , F T q such that (2.1) can berepresented as Y i “ X i γ i ` F λ i ` U i , i “ , , . . . , n, “ X i γ i ` R i , (3.1)where R i : “ F λ i ` U i .When no confusion is likely to arise, we also define for each t “ , . . . , T , the n -dimensionalvectors Y t : “ p Y t , . . . , Y nt q and U t : “ p U t , . . . , U nt q ; and the nk -dimensional vector X t : “p X t , . . . , X nt q . Also define the p n ˆ nk q block diagonal matrix Γ whose block diagonal is given by p γ , . . . γ n q and the p n ˆ r q loading matrix Λ : “ p λ , . . . , λ n q . Then (2.1) can also be representedas Y t “ Γ X t ` Λ F t ` U t , t “ , , . . . , T “ Γ X t ` R t , (3.2)where R t : “ Λ F t ` U t . 13 .1 Estimation Assumption 2 (Factor Model) . Consider:(a) E p F t q “ , E p F t F t q “ I r and Λ Λ is a diagonal matrix;(b) All eigenvalues of Λ Λ { n are bounded away from zero and infinity as n Ñ 8 ;(c) } Σ ´ ΛΛ } “ O p q ; and(d) } Λ } max ď C . Remark 2.
Assumption 2 is standard in the factor model literature. Note also that the assumptionthat E p F t q “ is not restrictive as our approach considers a first-step estimation which may includea constant in the set of regressors. It is also needed for identifiability. Assumption 3 ( Moments and Dependency ) . There exists a constant C ă 8 and function ψ p P Ψ defined in (1.2) such that for all i “ , . . . , n ; (cid:96) “ , . . . , k ; s, t “ , . . . , T and j “ , . . . , r :(a) } X it(cid:96) } ψ p ď C , } U it } ψ p ď C , } F j,t } ψ p ď C ;(b) }}p X i X i { T q ´ } max } ψ p ď C ;(c) The process tp X S,t , F t , U t q , t P Z u is weakly stationary with strong mixing coefficient α sat-isfying α p m q ď exp p´ cm q for some c ą and for all m P Z , where X S,t denotes the vector X t after excluding all deterministic (non-random) components.(d) } n ´ { p U s U t ´ E p U s U t qq} ψ p ď C ;(e) } n ´ { ř ni “ λ j,i U it } ψ p ď C ; and(f ) log n “ o ´ T p { r log T s ¯ . A few words about Assumption 3 is in order. Our theory is derived in a general setup withrespect to the tail behavior of the random variables in the model. In order to present the results ina unified manner for both fat (polynomial-decayed) and thin (exponential-decayed) tails, we placeour assumptions in terms of an upper bound to the Orlicz norm. In particular, Assumptions (3.a)and (3.c) allow us to apply a Marcinkiewicz-Zygmund type inequality for partial sums to deal withthe polinomial tails (Rio (1994) and Doukhan and Louhichi (1999)) and a Bernstein inequality(Merlev`ede et al. (2009) - Theorem 2) to control exponential tails. Moreover, Assumption (3.c)14xcludes the deterministic component of X t to accommodate possibly non-random non-stationary(but uniformly bounded by (a)) covariates.Assumption (3.d) is only used to prove results for the first-stage estimation in case it is performedby ordinary least-squares (Theorem 1). Assumption (3.d) controls for the level of cross-sectionaldependence among the units. As we allow the number of units to diverge with T , some sort ofcontrol on this quantity in necessary which is not implied by p .c q . Assumption (3.e) has a similarrole to (3.d) but in terms of linear combinations of the the idiosyncratic components. Assumption(3.e) only bounds the growth rate of the number of units n to be sub-exponential with respect to T . As a matter of fact, this assumption is only binding in the exponential tail case, otherwise therate conditions imposed in the theorems below imply (3.e).For each i “ , . . . , n , let R i : “ F λ i ` U i denote the unobservable error term in (3.1), p γ i the least-squares estimator of γ i and p R i : “ Y t ´ X t p γ i the vector of residuals. Also set p R : “ p p R , . . . , p R n q and R : “ p R , . . . , R n q . We must control for the least-squares estimation error in the first step ofthe proposed methodology. The next result gives a bound for the maximum entry of the p n ˆ T q matrix p R ´ R when the first-stage is conducted by OLS in a linear setup. Theorem 1.
Under Assumption 3(a)-(d) max i,t } p R it ´ R it } ψ p { ď C k,ψ ? T } p R ´ R } max “ O P « ψ ´ p { p nT q? T ff , where the C k,ψ is a constant only depending on k and ψ p . Remark 3.
In case the first step of the method involves more complicated estimation, we write } p R ´ R } max “ O P p ω q , where ω : “ ω n,T is a non-negative sequence. This will be used in the nexttheorems. Define the p n ˆ T q matrices Y : “ p Y , . . . , Y T q and U : “ p U , . . . , U T q ; and the p nk ˆ T q matrix X : “ p X , . . . , X T q . We can write (2.1) in the matrix form as Y “ Γ X ` Λ F ` U . (3.3)Notice that p R “ Λ F ` r U where r U : “ U ` p R ´ R and p Λ , F q can be estimated by Principal15omponent Analysis (PCA), which minimizes q p Λ , F q : “ } p R ´ Λ F } F , (3.4)with respect to Λ and F , subject to the normalization F F { T “ I r . The solution p F is the matrixwhose columns are ? T times r eigenvectors of the top r eigenvalues of p R p R and p Λ “ p R p F { T .Since we do not directly observe U , in the third step of our estimation procedure we use p U : “ p R ´ p Λ p F instead. Therefore, we must control of the estimation error in the factor model given by p n ˆ T q matrix p U ´ U which is the main purpose of Theorem 2 below. Also, it is well know factthat the loading matrix Λ and the factors F are not separably identified since Λ F t “ Λ H HF t for any matrix H such that H H “ I r . If we let H : “ T ´ V ´ p F F Λ Λ , where V is the p r ˆ r q diagonal matrix containing the r largest eigenvalues of p R p R { T in decreasing order, we have that HF t is identified as Λ F t is identified.The result below first appeared in Bai (2003) for the case of fixed p n, T q , and was further extendedto hold uniformly in p i ď n, t ď T q by Fan et al. (2013). Fan et al. (2020) makes the conditionsmodular. However, both consider the case when the factor model is estimated using the true dataas opposed to an “estimated” one as in our case. Therefore, the next result is a generalization thattakes into account that pre-estimation error term. Theorem 2.
Let ω : “ ω n,T be a non-negative sequence such that } p R ´ R } max “ O P p ω q . Then underAssumptions 1 -3 and ψ ´ p p n q{? T ` ψ ´ p p nT q ω “ O p q , we have(a) max t ď T } p F t ´ HF t } “ O P „ ? T ` ψ ´ p p T q? n ` ωψ ´ p { p nT q , (b) max i ď n } p λ i ´ Hλ i } “ O P « ψ ´ p { p n q? T ` ? n ` ω ff , (c) } p U ´ U } max “ O P „ ψ ´ p p n q ψ ´ p p T q? T ` ψ ´ p p T q? n ` ωψ ´ p { p nT q . By setting ω “
0, i.e., no estimation error in the first step, we recover Theorem 4 and Corollary1 in Fan et al. (2013) under sub-Gaussian assumption. Also it is important to notice that in order tohave the error } p U ´ U } max vanishing in probability we must have the pre-estimation error } p R ´ R } max of order (in probability) smaller than 1 { ψ ´ p { p nT q .16e have decided not to replace ω in Theorem 2 with the rate obtained in Theorem 1 as thelatter only applies to the least square estimator. In some applications, however, the first step ofthe procedure could be done using a different type of estimator. For instance a penalized adaptiveHuber regression (Fan et al., 2017) if the number of features k is comparable or even larger than T and the tail of the distribution is heavy. By stating the Theorem 2 in terms of a generic rate, itis easier to account for the effect of a different estimator. By combining Theorem 1 and 2 we havethe following corollary Corollary 1.
Under the same assumptions of Theorems 1 and 2, for the OLS used in the first-stageto obtain p R , we have } p U ´ U } max “ O P « ψ ´ p { p nT q? T ` ψ ´ p p T q? n ff . In particular for the sub-Gaussian case ( ψ p x q “ exp p x q ´ ) we have } p U ´ U } max “ O P « r log p nT qs ? T ` c log Tn ff , and for polynomial tails ( ψ p x q “ x p ) } p U ´ U } max “ O P „ n { p T { ´ { p ` T { p ? n . For notational convenience, for each i P t , . . . , n u , consider the split U “ p U i , U ´ i q where U i is a T -dimensional vector and U ´ i a T ˆ p n ´ q -dimensional matrix. Analogously, we split p U “ p p U i , p U ´ i q . Then for a the penalized parameter ξ ě
0, the LASSO objective function can bewritten for each i P t , . . . , n u L p θ q ` Penalty p θ q : “ T } p U i ´ p U ´ i θ } ` ξ } θ } . (3.5)To ensure a consistent estimation of θ , a sort of restricted strong convexity of the objectivefunction is required when n ą T . This in turns is ensured, in the case of a quadratic loss, by boundingthe minimum eigenvalue on p U i p U ´ i { T away from zero restrict to a cone (refer to Negahban et al.(2012) or Fan et al. (2020) for a thorough discussion). Here, we adopt the compatibility constantdefined in van de Geer and B¨uhlmann (2009). For an index S Ď t , . . . , n u and any n -dimensionalvector v , let v S be the vector containing only the elements of the vector v indexed by S . Thus, v S “ S and S c : “ S zt , . . . , n u is the complement of S .17 efinition 1. For an n ˆ n matrix M , a set S Ď t , . . . , n u and a scalar ζ ě , the compatibilityconstant is given by κ p M , S , ζ q : “ inf } x } M a | S |} x S } : x P R n : } x S c } ď ξ } x S } + , (3.6) where } x } M “ x M x . Moreover, we say that p M , S , ζ q satisfies the compatibility condition if κ p M , S , ζ q ą . Notice that the square of the compatibility constant is close related to the minimum (cid:96) -eigenvalueof Σ restricted to a cone in R n . Theorem 3.
Let η : “ η n,T be a non-negative sequence such that } p U ´ U } max “ O P p η q and considerAssumption 3. For every (cid:15) ą there is a constant ă C ă 8 such that if the penalty parameter isset to ξ “ C « ψ ´ p { p n q? T ` ηψ ´ p p T q ff and s : “ max i ď n | S ,i | where S ,i : “ t j : θ i,j ‰ u obeys s “ O »– κ ˜ η “ ψ ´ p p nT q ` η ‰ ` ψ ´ p { p n q? T ¸ ´ fifl , with κ : “ min i ď n κ i and κ i : “ κ “ E p U i U ´ i q{ T q , S ,i , ‰ defined in (3.6) , then, for any minimizer p θ i of (3.5) , with probability at least ´ (cid:15) : T ´ p p θ i ´ θ i q U i U ´ i p p θ i ´ θ i q ` ξ } p θ i ´ θ i } ď ξ s κ . i P t , . . . , n u , where the left right side is taken to be `8 whenever κ “ . Remark 4.
Notice that we apply the compatibility condition on the non-random covariance matrix E p U i U ´ i q{ T instead of the estimated random covariance matrix p U i p U ´ i { T or the “unobservable”random matrix U i U ´ i { T . Careful review of the proofs reveals that the same is true for the gradientof the objective function that defines our parameter via a first order condition. Once again, we purposely avoided to replace η in Theorem 3 with the rate of Corollary 1 to makeit readily applicable to the case when a different type of factor models was used or, as a matter offact, any other pre-estimation procedure. By plugging the rate of Corollary 1 into η we have thenext corollary 18 orollary 2. If η defined in Theorem 3 is taken to be rate given by Corollary 1 and the compatibilitycondition holds, i.e.: κ ě C ą then under the conditions of the Theorem 3: max i ď n } p θ i ´ θ i } “ O P «˜ ψ ´ p p T q ψ ´ p { p nT q? T ` ψ ´ p { p T q? n ¸ s ff . We now obtain the null distributions of our test statistics for the structures of the covariance andthe partial covariance. Recall the setup and notation of section 2.3. In particular, we consider thekernel k p¨q appearing in the covariance estimator defined by (2.10) belongs to the class defined inAndrews (1991) which we reproduce below for convenience K : “ t f : R Ñ r´ , s : f p q “ , f p x q “ f p´ x q , @ x P R , ż f p x q dx ă 8 , f is continuous u . (3.7)It includes most of the well-known kernel used in density estimation literature such as the truncated,Bartlett, Parzen, Quadractic Spectral, Tukey-Hanning among others. To avoid confusion, it is worthto point out that our tunning parameter h , also called bandwith parameter by Andrews (1991), issupposed to diverge, as opposed to the bandwith in the density kernel estimation setup, which isexpected to shrink towards zero. Theorem 4.
Let η : “ η n,T and ν : “ ν n,T be non-negative sequence such that } p U ´ U } max “ O P p η q and max i,t } p R it ´ R it } ψ p “ O p ν q and K P K . Under Assumptions 1-3, if further(a) t U t u is fourth-order stationary process(b) } diag p Υ Σ q} ě c for some c ą (c) As h, n, T Ñ 8 :(c.1) p log n q { ψ ´ p { p n q T { ` ? log T log nψ ´ p { p n q ψ ´ p { p T { q T { “ o p q (c.2) p log n q h r η p ψ ´ p p nT qq ` ψ ´ p { p n q{? T s “ o p q (c.3) p log n q r? T η ` r ? T ` r ? n ` r ν s “ o p q ,where the rates r , r , r are defined in Lemma B.10 and h ą is the bandwidth parameter of the ovariance estimator defined in (2.10) ; then } p Υ Σ ´ Υ Σ } max “ O P ´ h r η p ψ ´ p p nT qq ` ψ ´ p { p n q{? T s ¯ “ o p q , and sup D sup τ Pp , q | P p S Σ D ď c ˚ Σ p τ qq ´ τ | “ o p q , where the first supremum is over all null hypotheses of the form (2.7) indexed by D P t n uˆt n u . Remark 5.
The rate assumptions (c.1)-(c.3) in Theorem 4 seem over complicated. However,they are a direct consequence of having the first and second step estimation error rates, ν and η respectively, explicitly appearing in the final rate and the general tail condition through the ψ p p¨q function. It allows the practitioner to directly adjust the final rate should (s)he prefer to employdifferent intermediate estimators. For instance, a LASSO estimator in the first step in case thenumber of covariates k is large enough or estimate the factor model by PCA variants. If we were tospecialized to the sub-Gaussian case and incorporate the rates obtain in Theorem 1 and Corollary 1we have the following Corollary Corollary 3.
For the sub-Gaussian case ( ψ p x q “ exp p x q ). Under Assumptions 1-3, conditions p a q and p b q of Theorem 4. If the rates ν and η are set to be rates given by Theorem 1 and Corollary1, respectively, then the conclusion of Theorem 4 holds provided that as h, n, T Ñ 8 :(a) log n “ o p T { q (b) h r p log n q { ? T ` p log n q ? n s “ o p q (c) p log n q p log T q? Tn “ o p q . Remark 6.
Careful review of its proof reveals that (d.1) traces back to the Gaussian Approximationof the (unobservable) process ! ? T ř Tt “ U t U t ´ E U t U t ) T ě ; whereas (d.3) controls for the differ-ence between U t ´ p U t and, therefore, takes into account the estimation error of the first and secondsteps. Note the presence of ν and η in (d.3) which are absent in (d.1) Finally, (d.2) make sure thatthe bootstrap constructed in terms of the estimated covariance matrix is close to the bootstrap basedin the true covariance. Note the presence of the bandwidth parameter h in (d.2). Remark 7.
In order to establish the rate of convergence in the last result of Theorem 4 we need anupper bound on the tails of the pre-estimation error namely } p Z ´ Z } max . In fact, we need to control he tails of the factor model estimation to establish uniform bounds on } p U it ´ U it } ψ , which translateinto obtain bounds on max jt } p F jt ´ F jt } ψ and max ji } p λ ji ´ λ ji } ψ . Theorem 5.
Let η : “ η n,T and ν : “ ν n,T be non-negative sequence such that } p U ´ U } max “ O P p η q and max i,t } p R it ´ R it } ψ p “ O p ν q and K P K defined by (3.7) . Under Assumptions - , if further(a) t U t u is fourth-order stationary process(b) } diag p Υ Π q} ě c for some c ą (c) As n, T Ñ 8 :(c.1) p log n q { ψ ´ p { p n q T { ` ? log T log nψ ´ p { p n q ψ ´ p { p T { q T { “ o p q (c.2) p log n q h r s r η ` ξψ ´ p p n qsp s ψ ´ p p nT qq ` s ψ ´ p { p n q? T s “ o p q (c.3) p log n q ! s r r ? T ` r ? n ` r ν ` ξψ ´ p p n q ` ? T p η ` ξψ ´ p p n qq s ) “ o p q ,where the rates r , r , r are defined in Lemma B.10, K p¨q and h ą is the bandwidth parameter ofthe covariance estimator defined in (2.12) ; then } p Υ Π ´ Υ Π } max “ O P ˜ h s r η ` ξψ ´ p p n qsp s ψ ´ p p nT qq ` s ψ ´ p { p n q? T +¸ “ o p q and sup D sup τ Pp , q | P p S Π D ď c ˚ Π p τ qq ´ τ | “ o p q under H Π0 , where the first supremum is over all null hypotheses of the form (2.8) indexed by D P t ˆ n uˆt ˆ n u . Remarks and Corollary analogous to Remarks 5-7 and Corollary 3 after Theorem 4 apply toTheorem. 5.
Remark 8.
As opposed to the case of testing covariance, when testing partial covariance in high-dimensional setup, the sparse structure plays a role in terms of s appearing in the rates (d.2) and(d.3). Therefore, these assumptions restricts the cases when the proposed partial convariance testhas the correct asymptotic size. For instance, in the case of a complete dense partial covariancestructure, i.e, all the regressors are active in all LASSO regressions we are likely to have s of orderof n and, therefore, p d. q and p d. q are not expected to hold. Guide to Practice
As described before the methodology in this paper involves three steps. The first step consists ofidentifying known covariates that we may want to control for. This first step may involve the removalof deterministic trends and seasonal effects, for instance. This can be done either by parametricor nonparametric regressions. It is important to notice, however, that the convergence rates of theestimations in the subsequent steps will be influenced by the convergence rate of the estimation inthe first part of the procedure.After the data is filtered in the first step, one can test for remaining covariance structure. Forinstance, if the covariance matrix of the filtered data is (almost) diagonal, there is no need toestimate a latent factor structure and the practitioner may jump directly to the third step of themethod.On the other hand, if the covariance of the first-step filtered data is dense, a latent factor modelshould be considered and the number of factors must be determined. There are a number of methodsproposed in the literature to achieve this goal. In this paper we consider either the eigenvalue ratiotest of Horenstein (2013) or the information criteria put forward in Bai and Ng (2002). The factorscan be estimated by the usual methods.The last step involves a sparse regression in order to estimate any remaining links betweenidiosyncratic components. Before running the last step, the practitioner may test for a diagonalcovariance matrix of the idiosyncratic terms. If the null is not rejected, there is no need for additionalestimation. In case of rejection, the user can proceed with a LASSO regression. We recommend thatthe penalty term of the LASSO is selected by Bayesian Information Criterion (BIC) as advocatedby Medeiros and Mendes (2016).Finally, we would like to include a remark about the estimation of the long-run matrices whenconstructing the statistics for the tests of no remaining covariance structure. Usual methods dis-cussed in the literature can be used here to select the kernel and the bandwidth. In the paper weuse the simple Bartlett kernel with bandwidth given as t T { u . In this section we report simulation results to assess the finite-sample performance of the methodol-ogy depicted in this paper. The simulations are divided into two parts. In the first one, we evaluate22he finite-sample properties of the test for remaining covariance structure. In the second part, wehighlight the informational gains when considering both the common factors and the idiosyncraticcomponent.We simulate 1,000 replications of the following model for various combinations of sample size( T ) and number of variables ( n ): Y it “ Λ i F t ` W it , (5.1) F t “ . I r ` E t , (5.2) W it “ φW it ´ ` U it , (5.3) U it “ $’’&’’% θ U t ` θ U t ` θ U t ` θ U t ` O it if i “ O it otherwise , (5.4)where t O it u is a sequence of independent Gaussian random variables with zero mean and varianceequal to 0.25, t E t u is a sequence of r -dimensional independent random vectors normally distributedwith zero mean and identity covariance, and I r is an r ˆ r identity matrix. Furthermore, t O it u and t E t u are mutually independent for all time periods, factors and variables. For each MonteCarlo replication, the vector of loadings is sampled from a Gaussian distribution with mean -6 andstandard deviation 0.2 for i “ i “ , . . . , n . The value of φ iseither 0 or 0.5. The coefficients θ , θ , θ , and θ are equal to zero or 0.8, 0.9, -0.7, and 0.5,respectively. We set the true number of factor to be r “ We start by reporting results for the test of no remaining structure on the covariance matrix of U t “p U t , . . . , U nt q . The null hypothesis considered is that all the covariances between the first variable( i “
1) and the remaining ones are all zero. For size simulations we set θ “ θ “ θ “ θ “ “ log r S p r qs ` r n ` TnT log ˆ nTn ` T ˙ IC “ log r S p r qs ` r n ` TnT log C nT IC “ log r S p r qs ` r log C nT C nT IC “ log r S p r qs ` r p n ` T ´ k q log p nT q nT . where S p r q “ nT } R ´ p Λ r p F r } and C nT : “ a min p n, T q .Tables 1 and 2 report the results of the empirical size of test for different significance levels. Weconsider the case of φ “ φ “ . in panel(c) or the eigenvalue ratio procedure in panel (d). Table ?? in the Supplementary Material showsthe results of the test when the number of factors are determined by IC ´ IC .A number of facts emerge from the inspection of the results in the Table 1. First, size distortionsare small when the factors are known. In this case, the test is undersized when the pair p n, T q issmall. When the factor are not known but the true number of factors is available, the size distortionsare high only when T “
100 and n “
50 due to inaccurate estimation of factors. However, thedistortions disappear when the pair p T, n q grows. In this case, the empirical size is similar to thesituation reported in Panel (a). The finite performance of test in the case where the number offactors is selected by information criterion IC is almost indistinguishable to the case reported inPanel (b). However, the results with the eigenvalue ratio procedure are much worse when T “ n “
50. In this case, the procedure selects less factors than true number r “
3. For instance,the procedure selects 2 or less factors in 36% of the replications. Just as comparison, for T “ n “
50, IC underdetermines the number of factors only in 3.10% of the cases. For all theother combinations of T and n all the data-driven methods selects the correct number of factors inalmost all replications.When the idiosyncratic components are autocorrelated the size distortions are higher, as reportedin Table 2. This is mainly caused by the well-known difficulties in the estimation of the long-runcovariance matrix.Tables 3–4 report the results of the empirical power. For evaluate power properties we set24 “ . β “ . β “ ´ .
7, and β “ ´ . T “ ,
700 the power is reasonablyhigh, specially when test is conducted at a 10% significance level. For T “ n grows. The results are similar when data-driven procedures are used to determinethe number of factors. Finally, the conclusions are mostly the same whenever φ “ φ “ . The goal of this simulation is to compare, in a prediction environment, the three-stage methoddeveloped in the paper by evaluating the information gains in predicting Y t by three differentmethods. First, the predictions are computed from a LASSO regression of Y t on all the other n ´ FarmPredict methodology. Table 5 presents the results. Thetable presents the average mean squared error (MSE) over 5-fold cross-validation (CV) subsamples.As in the size and power simulations, we consider different combinations of T and n . We reportresults for the case where θ “ . θ “ . θ “ ´ .
7, and θ “ ´ . FarmTreat is quite remarkable when T “
500 or T “ Applications
In this section we consider two applications with actual data to illustrate the benefits of the method-ology developed in the paper. The first application deals with factor structure of asset returns,whereas the second one is about macroeconomic forecasting in data-rich environments.
We illustrate the methodology developed in this paper by studying the factor structure of assetreturns. We consider monthly close-to-close excess returns from a cross-section of 9,456 firms tradedin the New York Stock Exchange. The data starts on November 1991 and runs until December2018. There are 326 monthly observations in total. In addition to the returns we also consider 16monthly factors: Market (
MKT ), Small-minus-Big (
SMB ), High-minus-Low (
HML ), Conservative-minus-Aggressive (
CMA ), Robust-minus-Weak (
RMW ), earning/price ratio, cash-flow/price ratio,dividend/price ratio, accruals, market beta, net share issues, daily variance, daily idiosyncraticvariance, 1-month momentum, and 36-month momentum. The firms are grouped according to20 industry sectors as in Moskowitz and Grinblatt (1999). The following sectors are considered: Mining (602), Food (208), Apparel (161), Paper (81), Chemical (513), Petroleum (48), Construction(68), Primary Metals (133), Fabricated Metals (186), Machinery (710), Electrical Equipment (782),Transportation Equipment (166), Manufacturing (690), Railroads (25), Other transportation (157),Utilities (411), Department Stores (67), Retail (1018), Financial (3419), and Other (11).
We start the analysis by looking at the correlation matrix of a sample of nine different sectors,namely: Mining, Food, Petroleum, Construction, Manufacturing, Utilities, Department Stores,Retail, and Financial. Figure 1 plots the correlations that are larger than 0.15 in absolute value.We also test for the null of diagonal covariance matrix. The null hypothesis is strongly rejectedwith p -value much lower than 1%. To conduct the test of the covariance matrix we use the simplesample estimator as described in the paper. However, the correlations plotted in Figure 1 and inthe subsequent ones are based on the nonlinear shrinkage estimator proposed by Ledoit and Wolf(2020). The number between parenthesis indicate the number of firms in our sample that belong to each sector.
26e proceed by regressing the daily returns on the observed 16 factors. These three factorsexplain most of the variation of the returns. Figure 2 shows the empirical distribution of the OLSestimates of factor loadings over the 9,456 regressions. Figure 3 presents the estimated correlationsfor the first-stage residuals. We focus on the nine sectors as before. The first-stage regression asefficient in removing the correlation within specific sectors in some cases. The most notable onesare Financial and Retail, followed by Construction, Petroleum, and Manufacturing. Nevertheless,the tests for diagonal covariance matrix reject the null even in these specific cases.The second step is to conduct a principal component analysis on the residuals of the first-stage.The eigenvalue ratio procedure selects two factors, while all four information criteria points to asingle factor. We proceed with two factors. Note that, by construction, the principal componentfactors are orthogonal to all the 16 risk factors considered in the first stage. Figure 4 shows theestimated correlations for the residuals of the second-stage. The latent factor are not able toreduce the correlations within each sector. However, when we consider the partial correlations theconclusions are much different. As can be seen from Figure 5 that the partial correlation matricesare (almost) diagonal. In addition, we are not able to reject the null of a diagonal covariance matrixat a 5% significance level.Finally, in order to shed some light on the links among different sectors, we report how oftenthat variables from sector i are selected in the third-stage LASSO regression for firms in sector j . The numbers are normalized by the total number of firms in each sector and are presented inFigure 6. The most interesting fact is that covariates from the financial sector are the ones mostfrequently selected for all the other sectors. This may indicate that there is a “financial factor” thatwas unmodeled in the first two stages.The results presented here can be useful in applications where forecasting future returns isthe goal, for instance. The results indicate that the inclusion of the returns of firms belongingto the financial sector may improve the performance of forecasting models. For example, if werun a regression of the residuals of the first-stage regression of firms that do not belong to thefinancial sector on the first principal component computed with the first-stage residuals only fromthe financial sector, we find a statistically significant coefficient in 28% of the cases.27 .2 Macroeconomic Forecasting The second application consists of forecasting of a large set of monthly macroeconomic variables.We compare four different models: (1) Autoregressive model; (2) Sparse LASSO Regression (SR);(3) Principal Component Regression (PCR); and (4) a method based on the results in this paper( farmPredict ). Our data consist of variables from the FRED-MD database, which is a large monthly macroeconomicdataset designed for empirical analysis in data-rich macroeconomic environments. The datasetis updated in real time through the FRED database and is available from Michael McCraken’swebpage. For further details, we refer to McCracken and Ng (2016).We use the vintage as of October 2020. Our sample extends from January 1960 to December2019 (719 monthly observations), and only variables with all observations in the sample period areused (119 variables). The dataset is divided into eight groups: (i) output and income; (ii) labormarket; (iii) housing; (iv) consumption, orders and inventories; (v) money and credit; (vi) interestand exchange rates; (vii) prices; and (viii) stock market. Finally, all series are transformed in orderto become approximately stationarity as in McCracken and Ng (2016).
In order to highlight the gains of exploring all relevant information in the the dataset, we constructone-step forecasts for each one of the 119 variables in the dataset according to the following models:1.
Autoregressive model ( AR ): p Y p AR q i,t ` | t “ p φ i ` p φ i p Y i,t ` . . . ` p φ ip p Y i,t ´ p ` , i “ , . . . , n, where p φ i , p φ i , . . . , p φ ip , i “ , . . . , n , are OLS estimates. This model will be also the first-stagemodel in our methodology.2. AR + Sparse regression ( SR ): p Y p SR q i,t ` | t “ p Y p AR q i,t ` | t ` p R i,t ` | t , https://research.stlouisfed.org/econ/mccracken/fred-databases/. p R i,t ` | t “ p β i ` p β i p R t ` . . . ` p β pi p R t ´ p ` , i “ , . . . , n, p β i , p β i . . . , p β pi , i “ , . . . , n , are LASSO estimates, p R t “ ´ p R ,t , . . . , p R n,t ¯ , and finally p R i,t “ Y i,t ´ p Y p AR q i,t | t ´ , i “ , . . . , n . The parameters are estimated equation-wise for each one of the 119variables in the dataset. The penalty parameter is selected by BIC as discussed in Section 4.3. AR + Principal Component Regression (
PCR ): p Y p PCR q i,t ` | t “ p Y p AR q i,t ` | t ` p λ i p F t , where p F t is the estimate of the p k ˆ q vector of factors F t given by principal componentanalysis of p R t , the residuals of the first-stage regression. The parameter λ i is computed byOLS regression of p R i,t on p F t in the in-sample window.4. AR + Full Information (
FarmPredict ): p Y p FarmPredict q i,t ` | t “ p Y p PCR q i,t ` | t ` p U i,t ` | t where p U i,t ` | t “ p θ i ` p θ i p U t ` . . . ` p θ pi p U t ´ p ` , p U t “ ´ p U ,t , . . . , p U n,t ¯ and p U i,t “ Y i,t ´ p Y p PCR q i,t | t ´ , i “ , . . . , n . The estimates p θ i , p θ i . . . , p θ pi , i “ , . . . , n , are given by LASSO.The forecasts are based on a rolling-window framework of fixed length of 480 observations,starting in January 1960. Therefore, the forecasts start on January 1990. The last forecasts arefor December 2019. Note that the AR model only considers information concerning the own past ofthe variable of interest. SR and PCR expand the information by two opposing routes. While SR usesa sparse combination of the set of variables, PCR considers only a factor structure (dense model).
FarmPredict combines these two approaches and uses the full information available.
We start by looking at the full sample in order to analyse the structure of dependence among themany series considered. We first estimate an autoregressive model of order 4, AR( p ), for each29ransformed series. Figure 7 reports the empirical distribution of the OLS estimators of the ARcoefficients. Figure 8 shows the distribution of the absolute value of the sum of the estimates. Thisgives an idea of the persistence of each series. Although we report here the results for AR modelsof pre-specified order equal to four, in the Supplementary Material we present results for optimallag selection via the BIC. Only one series has estimated persistence above one. This is the case for NONBORRES: Reserves of Depositary Institutions , which belongs to group (v): Money and Credit.The reason for such high persistence if due to a major structural break present in the second halfof the series. However, 82.35% of the series have estimated persistence below 0.9. We continue by estimating the number of factors when the full sample is used for PCA. Weconsider two different situations. In the first, we do not include any lag in the basket of variablesused to compute the factors. In the second approach, we include four lags of each variable as well.The eigenvalue ratio procedure selects either two (no lags) or a single factor (with lags). The fourinformation criteria of Bai and Ng (2002) as described in Section 5, estimate respectively for thecase with no lags (with lags) the following number of factors: six (one), five (one), nine (one), andone (one). Note that the factors are estimated for the residuals of the first-step AR filter. If weremove the
NONBORRES variable from the sample the results to not change for the eigenvalueratio procedure. On the other hand, the new numbers of factors selected by the information criteriaare as follows: seven (one), six (one), eleven (one), and one (one).Finally, we apply the testing approach developed in this paper to check for remaining (partial)covariance structure in the data. The tests strongly reject the null of a diagonal matrix when appliedto the residuals either of the first or the second stages of the methodology. This serves as evidencethat
FarmPredict may be a useful modeling approach for this macroeconomic dataset.
For each of the four models described above, we report a number of performance metrics in Table 6.The table presents the frequency each model has the best performance among the four alternatives.Numbers between parentheses indicates the frequency each model is the second, third, and fourthbest. We report the results for each one of the eight sectors as well as for the set of all 119 variables.We show the results for two methods to determine the number of factors. Panel (a) reports theresults for the eigenvalue ratio method while Panel (d) presents the results for the informationcriterion IC . Criteria IC , IC , and IC select a very large number of factors and we relegate them Conventional unit-root tests also reject the null of unit-root for all but one of the series.
30o the supplementary material. Panels (c) and (d) in the table show the results for the cases wherethe number of factors are kept fixed ( r “ r “ FarmPredict is the model which is ranked first more frequently when all the series are considered.It is also the best model for the following groups: output and income, labor market, housing, andconsumption, orders and inventories. The AR model is best for the following groups: money andcredit and stock market. The sparse regression is superior also for two groups: interest and exchangerates and prices. In this paper we propose a new methodology which bridges the gap between sparse regressions andfactor models and evaluate the gains of increasing the information set via factor augmentation.Our proposal consists in several steps. In the first one, we filter the data for known factors (trends,seasonal adjustments, covariates). In the second step, we estimate a latent factor structure. Finally,in the last part of the procedure we estimate a sparse regression for the idiosyncratic components.Furthermore, we also propose a new test for remaining structure in both high-dimensional covarianceand partial covariance matrices. Our test can be used to evaluate the benefits of adding morestructure in the model. Our paper has also a number of important side results. First, we provedconsistency of kernel estimation of long-run covariance matrices in high-dimensions where boththe number of observations and variables grows. Second, we derive the theoretical properties offactor estimation on the residuals of a first step process. Third, the proposed test can be used as adiagnostic tool for factor models.We evaluate our methodology with both simulations and real data. The simulations show thetest has good size and power properties even when the true number of factors is unknown and mustbe determined from the data. However, if the number of factors is underestimated, we observesize distortions. This is specially the case when the eigenvalue ratio test is used to determine thenumber of latent factors. The simulations also show that there are major informational gains whencombining factor models and sparse regressions in a forecasting exercise. Two applications areconsidered in the paper. 31
Proof of the Theorems
Throughout the proofs we use the equivalence } X } ψ p ă 8 ðñ P p| X | ą x q “ O p ψ ´ p p x qq as x Ñ 8 , for any random variable X and ψ p P Ψ, combined with Lemma 6 in Carvalho et al. (2018) andLemma 1 in Masini and Medeiros (2019). The key ingredients of the lemmas are a Marcinkiewicz-Zygmund type inequality for strong mixing sequences to deal with the polynomial tails (Rio, 1994;Doukhan and Louhichi, 1999) and a Bernstein inequality under strong mixing conditions to controlexponential tails (Merlev`ede et al. (2009) - Theorem 2).
A.1 Proof of Theorem 1
We first upper bound } p R it ´ R it } ψ . By subsequent application of H¨older’s inequality we have . | p R it ´ R it | “ |p p γ i ´ γ i q W it |ď } p γ i ´ γ i } } W it } “ } p Σ ´ i p v i } } W it } ď k } p Σ ´ i } max } p v i } } W it } , where p Σ i : “ W i W i { T and p v i : “ W i U i { T . Then by the Cauchy-Schwartz conjugate } p R it ´ R it } ψ p { ď k }} p Σ ´ i } max } ψ p }} p v i } } ψ p { }} W it } } ψ p . The first term is bounded by Assumption 3(b). For the second term we have: } W it(cid:96) U it } ψ p { ď} W it(cid:96) } ψ p } U it } ψ p ď C by Assumption 3(a). Then, t W it(cid:96) U it u t ą is a zero-mean strong mixing withexponential decay sequence (Assumption 3(c)) with bounded ψ p { -norm. Therefore, }} p v i } } ψ p { “ O p {? T q uniformly in i ď n . Finally, the last term is bounded by the maximal inequality (van derVaart and Wellner (1996) - Lemma 2.2.2) and Assumption 3(a). The first result follows.32 .2 Proof of Theorem 2 The proof is an adaption of the proof of Theorem 4 and Corollary 1 in Fan et al. (2013), henceforthFLM, to include the estimation error in the sample covariance matrix. For part (a), we pick upfrom expression (A.1) in Bai (2003) to obtain the following identity p f t ´ HF t “ ˆ V n ˙ ´ « T T ÿ s “ p f s E p U s U t q n ` T T ÿ s “ p f s r ζ st ` T T ÿ s “ p f s r η st ` T T ÿ s “ p f s r ξ st ff , (A.1)where r ζ st , r η st and r ξ st are defined before Lemma B.3.By Assumptions 2(d) and 3(a) and the maximal inequality we have } R } max ď r } Λ } max } F } max `} U } max “ O P p ψ ´ p nT qq . Applying Lemma B.14 we conclude that } p Σ ´ r Σ } max “ O P p ω p ψ ´ p nT q ` ω qq “ O P p q , where the last assumption by the Theorem assumption. Finally ψ ´ p n q{? T “ O p q also by assumption then } V n } ´ “ O P p q by Lemma B.6. Using the results (a)-(d) of Lemma B.5we can bound in probability each of the terms in brackets of (A.1) in (cid:96) norm uniformly in t ď T and obtain the result (a).For part (b) we use the fact that p Λ : “ p R p F { T and the normalization p F p F “ I r to write p λ i ´ Hλ i “ T T ÿ t “ HF t r U it ` T T ÿ t “ p R it p p F t ´ HF t q ` H ˜ T T ÿ t “ F t F t ´ I r ¸ λ i . (A.2)The first term can be upper bounded in (cid:96) norm uniformly in i ď n by ? r } H } max i ď n max j ď r ˇˇˇˇˇ T T ÿ t “ F jt r U it ˇˇˇˇˇ “ O P p q O P r ψ ´ p { p n q{? T ` ω s , where the equality follows from Lemma B.6(b) and (e). The (cid:96) norm of the second term is upperbounded uniformly in i ď n by ˜ max i ď n T T ÿ t “ p R it T T ÿ t “ } p F t ´ HF t } ¸ { “ „ O P p q O P p T ` p {? n ` ω q q { , where the first term after the equality follows from Lemma B.6(d) together with the Theoremassumption and the second term from Lemma B.4(e). Finally the last term of (A.2) is upperbounded by } H }} max i ď n λ i }} T T ÿ t “ F t F t ´ I r } “ O P p q O p q O P p {? T q , O P p {? T q by the maximum inequality and Assumption 3 Plug the last threedisplays back into (A.2) yields result (b).For part (c) we use we have } p U ´ U } max “ } Λ F ´ p Λ p F ` p R ´ R } max ď } p Λ p F ´ Λ F } max ` } p R ´ R } max . The last term is O P p ω q by assumption. For the first term we use the decomposition p λ i p F t ´ λ i F t “ p p λ i ´ Hλ i q p p F t ´ HF t q ` p Hλ i q p p F t ´ HF t q` p p λ i ´ Hλ i q HF t ` λ i p H H ´ I r q F t . (A.3)Therefore, we can upper bound the left hand side as | p λ i p F t ´ λ i F t | ď } p λ i ´ Hλ i }} p F t ´ HF t } ` } Hλ i }} p F t ´ HF t }` } p λ i ´ Hλ i }} HF t } ` } λ i }} F t }} H H ´ I r } . Now we bound in probability each of the four term above uniformly in i ď n and t ď T . The firstone is given by part (a) and (b). max i ď n } Hλ i } ď } H } max i ď n || λ i } ď O P p q r } Λ } max “ O P p q byLemma B.6(b) and Assumption 2(d), thus the second term is bounded by part (a). Similarly forthe third term max t ď T } HF t } ď } H } max t ď T || F t } “ O P p q O P p ψ ´ p T qq “ O P p ψ ´ p T qq by LemmaB.6(b) and Assumption 2(a). Finally } H H ´ I r } “ O P p {? T ` {? n ` ω q by Lemma B.6(c) hencethe last term is O P r ψ ´ p T qp {? T ` {? n ` ω qs by Assumptions 2(d) and 3(a). A.3 Proof of Theorem 3
We have that L p p θ ξ q ` ξ } p θ ξ } ď L p θ q ` ξ } θ } for all θ P R n by definition of p θ ξ , where L p θ q : “} p u y ´ θ p U x } { T . Also, since L p θ q is a quadratic function, it implies that p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q ď ´ ∇ L p θ q p p θ ξ ´ θ q ` ξ p} θ } ´ } p θ ξ } q . By Holder’s inequality we have | ∇ L p θ q p p θ ξ ´ θ q| ď} ∇ L p θ q} } p θ ξ ´ θ } and by assumption ξ ě } ∇ L p θ q} then we have p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q ď ξ { } p θ ξ ´ θ } ` ξ p} θ } ´ } p θ ξ } q . (A.4)For any index set S P r n s , by the decomposability of the (cid:96) norm (refer to Definition 1 in Negahbanet al. (2012)) followed by the triangle inequality we have } p θ ξ } “ } p θ ξ, S } ` } p θ ξ, S c } ě } θ S } ´ } p θ ξ. S ´ θ S } ` } p θ ξ, S c } and } p θ ξ ´ θ } “ } p θ ξ, S ´ θ S } ` } p θ ξ, S c ´ θ S c } ď } p θ ξ, S ´ θ S } ` } p θ ξ, S c ´ θ S c } . Plugging34t back in (A.4) yields2 p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q ` ξ } p θ ξ, S c ´ θ S c } ď ξ } p θ ξ, S ´ θ S } ` ξ } θ S c } . (A.5)We then conclude that any minimizer p θ ξ of (3.5) and θ P R n obeys p θ ξ ´ θ P C p S , θ q : “t x P R n : } x S c } ď } x S } ` } θ S c } u . If we take θ “ θ and S “ S : “ t i : θ ,i ‰ u then p θ ξ ´ θ P C : “ C p S , θ q . Note that C is a cone in R n that does not depend on θ as } θ , S c } “ κ : “ κ p p U x p U x { T, S , q we have that } p θ ξ, S ´ θ S } ď p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q a | S |{ κ . Apply this inequality (A.5) and use the fact , 4 ab ă a ` b for non-negative a, b P R to obtain p p θ ξ ´ θ q ∇ L p θ qp p θ ξ ´ θ q ` ξ } p θ ξ ´ θ } ď ξ | S |{ κ. (A.6)Finally, we have by assumption } p U ´ U } max ď C , } U } max ď C and C p C ` C q ď κ | S | which, inturn fulfills the assumptions of Lemma B.14 with ζ “ α “ {
2. Therefore, we conclude that κ ě κ { A.4 Proof of Theorem 4
We use in this proof the following additional notation for short: For every random vector X , wedenote by Σ X its covariance matrix, d X the diagonal of Σ X and σ X : “ } d X } . Also, X G denoteszero-mean Gaussian random vector defined in the same probability space, independent of X andwith the same covariance matrix of X . Finally, for every pair of random vectors X , Y of the samedimension and scalar s ą ρ p X , Y q : “ sup t P R | P p} X } ď t q ´ P p} Y } ď t q| ∆ p X , s q : “ sup t P R P p t ď } X } ď t ` s q Combining equations (83)–(86) in Giessing and Fan (2020) gives us the following basic inequality | P p S ď c ˚ p τ qq ´ τ | ď ρ p r Q , r Q G q ` inf δ ą "a δ log nω max ` P p} p Υ ´ Υ } max ą δ q * ` inf δ ą " δ ? log nω max ` P p} Q ´ r Q } ą δ q * (A.7)35here r Q is defined below.We start by Bounding the first term to the right-hand side of (A.7). Here we adapt the classical”big block-small block” technique proposed by Bernstein in the context of proving CLT undermixing conditions, which was also used in the proof of Theorem E.1 in Chernozhukov et al. (2018).Consider two sequences of non-negative integers a : “ a T and b : “ b T such that b ă a , a ` b ď T , min t a, b u Ñ 8 , a “ o p T q and b “ o p a q as T Ñ 8 . Let m : “ r T {p a ` b qs and define for j P t , . . . , m u consecutive blocks of size a and b with index set A j : “ tpp j ´ qp a ` b q ` , . . . , p j ´ qp a ` b q ` a u and B j : “ tp j ´ qp a ` b q ` a ` , . . . j p a ` b qu . Finally set C : “ t m p a ` b q ` . . . , T u , which mightbe empty. A j : “ ? a ÿ t P A j r D t B j “ ? b ÿ t P B j r D t ; ; C “ a | C | ÿ t P C r D t , such that r Q : “ ? T T ÿ t “ r D t “ c maT ˜ ? m m ÿ j “ A j ¸looooooomooooooon “ : V ` c mbT ˜ ? m m ÿ j “ B j ¸loooooooomoooooooon “ : L ` c T ´ m p a ` b q T C Now let r V : “ ? m ř mj “ r A j where t r A t , ď t ď m u is an independent sequence such that A t and r A t have the same distribution for all 1 ď t ď m . Similarly define r L : “ ? m ř mj “ r B j . Lemma B.7give us for any scalar s ą ρ p r Q , r Q G q ď ρ p r V , r V G q ` ρ p c maT r V G , r Q G q ` ∆ p c maT r V G , s q` P p c mbT } r L } ą s q ` ρ p V , r V q ` ρ p L , r L q . (A.8)Notice that we any measurable A Ď R we have | P rp A , A q P A s ´ P r r A , r A , s| ď α b where t α n , n P N u denote the α -mixing coefficient of the sequence p r D t q which is the same of the sequence p U t q . Then the last two terms in (A.8) can be upper bounded by p m ´ q α b and p m ´ q α a respectivelyby induction. Since α n is non-increasing in n and a ě b we have that ρ p V , r V q ` ρ p L , r L q ď p m ´ q α b ď T exp p´ cb q . (A.9)where we use Assumption 3(c) to obtain the last inequality.For the fourth term we have by the maximal inequality followed by Markov’s inequality P p b mbT } r L } ą q ď „ ψ ˆ s ? TC ψ ψ ´ p { p n q? mb ˙ ´ and the anti-concentration inequality for Gaussian random vectors (The-orem 7 in Giessing and Fan (2020) with p “ 8 ) ∆ p a maT r V G , s q À T s ? log nmaσ Ă V . Set s “ C ψ ψ ´ p { p n q? mb ? T ψ ´ p { p T γ q for some γ ą p c maT r V G , s q ` P p c mbT } r L } ą s q À Tma c mbT ? log nψ ´ p { p n q ψ ´ p { p T γ q σ r V ` T γ (A.10)For the second term we have from Rio (2013) that, for every (cid:15) ą |r M (cid:96) s ij | “ | Cov p r D it , r D j,t ´ (cid:96) q| ď α (cid:15) {p ` (cid:15) q (cid:96) } r D it } ` (cid:15) } r D jt ´ (cid:96) } ` (cid:15) . Hence, from Assumption 3 we have that } M (cid:96) } max À exp p´ c (cid:15) ` (cid:15) (cid:96) q and } maT Σ r V G ´ Σ r Q G } max ď p ´ maT q} Σ r V } max ` } Σ r Q ´ Σ r V } max ď p ba ` b ` aT q} Σ r V } max ` a ÿ | (cid:96) |ă a | (cid:96) |} M (cid:96) } max ` ÿ a ď| (cid:96) |ă T } M (cid:96) } max À ba ` aT ` a ` T exp p´ c (cid:15) ` (cid:15) a q , where we use the fact that Σ r V G “ Σ r V “ Σ r A j “ Σ A j “ ř | (cid:96) |ă a p ´ | (cid:96) |{ a q M (cid:96) , Σ r Q G “ Σ r Q “ ř | (cid:96) |ă T p ´| (cid:96) |{ T q M (cid:96) , ř | (cid:96) |ă a | (cid:96) |} M (cid:96) } max ď c for some c ă 8 and ř a ď| (cid:96) |ă T } M (cid:96) } max À T exp p´ c (cid:15) ` (cid:15) a q .Finally, we can bound the second term using Theorem 8 in Giessing and Fan (2020). In particularfor p “ 8 it implies that ρ p c maT r V G , r Q G q À log n b } maT Σ r V G ´ Σ r Q G } max a maT σ r V _ σ r Q À b Tma log n b ba ` aT ` a ` T exp p´ c(cid:15) ` (cid:15) a q σ r V _ σ r Q (A.11)For the first term we have that } r D it } ψ p { is uniformly (upper) bounded by Assumption 3(a)then so is } r A it } ψ p { “ } A it } ψ p { “ } ? a ř s P A t r D is } ψ p { . Also p E p max i | r A it |q q { À } max i | r A it |} ψ p { À ψ ´ p { p n q max i } r A it } ψ p { À ψ ´ p { p n q . Since t r A t , ď t ď m u is an iid sequence of random vector Theorem5 in Giessing and Fan (2020) gives us ρ p r V , r V G q À p log n q { ψ ´ p { p n q T { σ r V . (A.12)By the triangle inequality we have that σ r V ě σ r Q ´ } d r Q ´ d r V } max ě c ´ } Σ r Q ´ Σ r V } max Á ´ a ´ T exp p´ c (cid:15) ` (cid:15) a q . By setting a “ r? T s we conclude that σ r V is eventually bounded awayfor zero for large enough T . If we further set b “ r log T { c s and γ “ { ρ p r Q , r Q G q “ O « p log n q { ψ ´ p { p n q T { ` ? log T log nψ ´ p { p n q ψ ´ p { p T { q T { ff . (A.13)Finally, we now bound the last two term appering in (A.7). Let γ and γ be positive sequencesdepending on n and T such that } p Υ ´ Υ } max “ O P p γ q and } Q ´ r Q } “ O P p γ q . Suppose we canstate conditions under which log n p γ _ γ q “ o p q T, n
Ñ 8 (A.14)Then we have the the last two terms vanish in probability if we set δ “ γ log n and δ “ γ log n in(A.7). Lemma B.8 and Lemma B.10 give us expressions for γ and γ , respectively, which combinedwith the rate assumptions in the theorem implies (A.14). B Additional Lemmas
Lemma B.1.
Let a j and b j denote the j -th eigenvalue in decreasing order of Σ and ΛΛ respectively.Then, under Assumption 2(b) and p c q :(a) b j — n for ď j ď r (b) max j ď n | a j ´ b j | “ O p q (c) a j — n for ď j ď r .Proof. Result p a q follows from the fact that the r eigenvalues of Λ Λ are also (the only r non-zero)eigenvalues of ΛΛ and Assumption 2(b). Part p b q follows from Weyl’s inequality that implies max j ď n | a j ´ b j | ď } Σ ´ ΛΛ } “ O p q , where the last equality follows from Assumption 2(c). Finallyresult p c q follows from part p a q and p b q and the (reverse) triangle inequality.Recall that Σ be the p n ˆ n q covariance matrix of U t “ Z t ´ Γ W t . Let r Σ : “ T ř Tt “ U t U t and p Σ the same as r Σ but with Γ replaced by the estimator p Γ . Also let p a j denote the j -th eigenvalue indecreasing order of p Σ emma B.2. Let ω be a non-negative sequence of n and T such that } p Σ ´ r Σ } max “ O P p ω q . Then,under the Assumptions 2 and 3:(a) } p Σ ´ Σ } max “ O P r ω ` ψ ´ p { p n q{? T s (b) max j ď n | p a j ´ a j | “ O P r n p ω ` ψ ´ p { p n q{? T qs (c) p a j — P n for j ď r provided that ω ` ψ ´ p { p n qq{? T “ O P p q Proof.
Part (a) follows by triangle inequality followed by the maximum inequality since } p Σ ´ Σ } max ď} p Σ ´ r Σ } max ` } r Σ ´ Σ } max “ O P p ω q ` O P p ψ ´ p { p n q{? T q . Part (b) follows from Weyl’s inequality,the fact that } p Σ ´ Σ } ď n } p Σ ´ Σ } max and part p a q . Part p c q follows from the triangle inequalitycombined with part p b q and Lemma B.1(c).The Lemmas B.3-B.6 below are an adaption of Lemmas 8-10 in Fan et al. (2013), henceforthFLM, to include the estimation error in the sample covariance matrix. To avoid confusion and makeit easier for the read to follow through the changes we use the same notation adopted in FLM. Inparticular, if δ it denotes the p i, t q element of ∆ : “ p R ´ R then r U it “ U it ` δ it for i P r n s and t P r T s .Also, we consider that } ∆ } max “ O P p ω q for some non-negative sequence ω depending on n and T .Define: r ζ st : “ r U s r U t n ´ E p U s U t q n “ ˆ U s U t n ´ E p U s U t q n ˙ ` ˆ U s δ t n ` δ s U t n ` δ s δ t n ˙ “ : ζ st ` ζ ˚ st r η st : “ f s ř ni “ λ i r U it n “ f s ř ni “ λ i U it n ` f s ř ni “ λ i δ it n “ : η st ` η ˚ st r ξ st : “ F t ř ni “ λ i r U is n “ F t ř ni “ λ i U is n ` F t ř ni “ λ i δ is n “ ξ st ` ξ ˚ st . Lemma B.3.
Under Assumption 3:(a) ζ st “ O P p {? n q (b) η st “ O P p {? n q (c) ξ st “ O P p {? n q (d) ζ ˚ st “ O P p ω ` ω q and max s,t ď T ζ ˚ st “ O P p ψ ´ p nT q ω ` ω q (e) η ˚ st “ O P p ω q (f ) ξ ˚ st “ O P p ω q . roof. Parts p a q , p b q and p c q are straightforward. For (d) we have that n U s U t “ O P p q and n δ s δ t ď} ∆ } max “ O P p ω q then the other two terms in parentheses in the definition of ζ ˚ st are O P p ω q by theCauchy-Schwartz inequality. Part p e q and p f q follows by similar arguments. max t ď T T T ÿ s “ p n δ s U t q “ max t ď T n U t ˜ T T ÿ s “ δ s δ s ¸ U t ď } ∆ } max p max t ď T } U t } { n q ζ ˚ st ď } U s } } δ t } ` } U t } } δ s } ` } δ t } } δ s } ď } U } max } ∆ } max ` } ∆ } max Lemma B.4.
Under Assumption 3:(a) T ř Tt “ r nT ř Ts “ p f js E p U s U t qs “ O P p { T q (b) T ř Tt “ r T ř Ts “ p f js r ζ st s “ O P rp {? n ` ω ` ω q s (c) T ř Tt “ r T ř Ts “ p f js r η st s “ O P rp {? n ` ω q s (d) T ř Tt “ r T ř Ts “ p f js ξ st s “ O P rp {? n ` ω q s (e) T ř Tt “ } p f t ´ Hf t } “ O P r { T ` p {? n ` ω ` ω q s Proof.
Part (a) is unaltered by the presence of a pre-estimation so it follows directly from Lemma8(a) in FLM. For part (b), we have that for s, l
P r n s and j P r r s by Cauchy-Schwartz inequality1 T T ÿ t “ r T T ÿ s “ p f js r ζ st s ď »– T T ÿ s,l “ ˜ T T ÿ t “ r ζ st r ζ lt ¸ fifl { Since r ζ st “ ζ st ` ζ ˚ st “ O P p {? n ` ω ` ω q by Lemma B.3, the term in parentheses is O P rp {? n ` ω ` ω q q . The result p b q then follows. For (c), by the triangle inequality and Lemma 8(c) in FLM,we have that } ř ni “ λ ji r u it } ď } ř ni “ λ ji U it } ` } ř ni “ λ ji δ it } “ O P p? n q ` O P p nω q , then we conclude1 T T ÿ t “ r T T ÿ s “ p f s r η st s ď T n T ÿ t “ } n ÿ i “ U it λ i } “ O P p { n ` ω {? n ` ω q . The proof of part (d) is analogous to part (c) therefore is omitted. For (e), let r p f t ´ Hf t s j denotethe j -th entry of the vector p f t ´ Hf t . Since V { n is bounded away for zero by Lemma B.2(c), thefact that p a ` b ` c ` d q ď p a ` b ` c ` d q and using (A.1) we have that max j ď r T ´ ř t r p f t ´ Hf t s j
40s upper bounded by some constant C ă 8 times »– max j ď r T T ÿ t “ ˜ T T ÿ s “ p f js E p U s U t q n ¸ ` max j ď r T T ÿ t “ ˜ T T ÿ s “ p f js r ζ st ¸ ` max j ď r T T ÿ t “ ˜ T T ÿ s “ p f js r η st ¸ ` max j ď r T T ÿ t “ ˜ T T ÿ s “ p f js r ξ st ¸ fifl . The result then follows by applying the bounds from part (a)-(d) to each of the four terms above.
Lemma B.5.
Under Assumption 2:(a) max t ď T } nT ř Ts “ p f s E p U s U t q} “ O P p {? T q (b) max t ď T } T ř Ts “ p f s r ζ st } “ O P p b ψ ´ p { p T q{ n ` ψ ´ p nT q ω ` ω q (c) max t ď T } T ř Ts “ p f s r η st } “ O P p ψ ´ p T q{? n ` ω q (d) max t ď T } T ř Ts “ p f s ξ st } “ O P p ψ ´ p T qp {? n ` ω qq Proof.
Once again, part (a) is unaltered by the presence of a pre-estimation so it follows directlyfrom Lemma 9(a) in FLM. For part (b), from the Cauchy-Schwartz inequality we have max t ď T } T T ÿ s “ p f s r ζ st } ď ˜ T T ÿ s “ } p f s } max t ď T T T ÿ s “ r ζ st ¸ { . The first summation inside the parentheses equal r due to the normalization. For the second summa-tion, by the triangle inequality, we have max t ď T T ř Ts “ r ζ st ď max t ď T T ř Ts “ ζ st ` max t ď T T ř Ts “ ζ st ζ ˚ st ` max t ď T T ř Ts “ ζ ˚ st . For the first term, the maximum inequality followed by Assumption 2(e) yields max t ď T T T ÿ s “ ζ st “ O P „ ψ ´ p { p T q max s,t } ζ } ψ p { “ O P „ ψ ´ p { p T q max s,t } ζ } ψ “ O P « ψ ´ p { p T q n ff . The last one is O P rp ψ ´ p nT q w ` ω q q by Lemma B.3(d). Then by Cauchy Schwartz we have that max t ď T T ř Ts “ r ζ st “ O P rp b ψ ´ p { p T q{ n ` ψ ´ p nT q w ` ω q s and result (b) follows.For (c), by the triangle inequality we have that max t ď T } n ř ni “ λ i r U it } ď max t ď T } n ř ni “ λ i U it } ` max t ď T } n ř ni “ λ i δ it } . For the first term, the maximum inequality followed by Assumption 2(f)yields max t ď T } n Λ U t } “ O P „ ψ ´ p T q? n max t } ? n Λ U t } “ O P p ψ ´ p T q{? n q . } Λ } max } ∆ } max “ O P p ω q by Assumption 2(d). We then obtainthe result since max t ď T } T T ÿ s “ p f s r η st } ď } T T ÿ s “ p f s f s } max t ď T } n n ÿ i “ λ i r U it } “ O P ˆ ψ ´ p T q? n ` ω ˙ . (B.1)By the triangle inequality, } nT ř s ř i λ i r U is p f s } ď } nT ř s ř i λ i U is p f s } ` } nT ř s ř i λ i δ is p f is } . Lemma9(d) of FLM shows that the first term is O P p {? n q . For the second term for each j P r r s : } nT ÿ s ÿ i λ i δ is p f js } ď ˜ T n ÿ s “ } n n ÿ i “ λ i δ is } p f js ¸ ˜ T n ÿ s “ p f js ¸ “ O P p ω q . Thus } nT ř s ř i λ i r U it p f s } “ O P p {? n ` ω q and by Cauchy-Schwartz inequality we have max t ď T } T T ÿ s “ p f s ξ st } ď max t ď T } F t }} nT ÿ s ÿ i λ i r U it p f s } “ O P p ψ ´ p T qp {? n ` ω qq . (B.2) Lemma B.6.
Let ω ` ψ ´ p n qq{? T “ O p q where ω is defined in Lemma B.2, then UnderAssumption 3 we have(a) } V ´ } “ O P p { n q (b) } H } “ O P p q (c) } H H ´ I r } F “ O P p {? T ` {? n ` ω q (d) max i ď n T ř Tt “ p R it “ O P p w p ψ ´ p nT q ` ω q ` ψ ´ p { p n q{? T ` q (e) max i ď n max j ď r T ř Tt “ F jt r U it “ O P p ψ ´ p { p n q{? T ` ω q Proof.
We have that V ´ “ diag p { p a , . . . , { p a r q and 1 { p a j — P { n for j ď r by Lemma B.2(c).The result (a) then follows. The normalization tell us } p F } “ ? T , Lemma 11(a) in FLM give us } F } “ O P p? T q , } Λ Λ } “ r a — n by Lemma B.1(a) and from part (b) we have } V ´ } “ O P p { n q .Result (b) then follows since by definition H : “ T ´ V ´ p F F Λ Λ . For (c) we have by the triangleinequality } H H ´ I r } F ď } H H ´ H F F { T H } F ` } H F F { T H ´ I r } F } H p I r ´ F F { T q H } F ď } H } } I r ´ F F { T } F “ O P p q O P p {? T q . The second term is equal to } H F F { T H ´ p F p F { T } F For (d) we have max i ď n T T ÿ t “ p R it ď max i ď n T T ÿ t “ p p R it ´ R it q ` max i ď n T T ÿ t “ R it ´ E p R it q ` max i ď n T T ÿ t “ E p R it qď max i,t | p R it ´ R it | ` max i ď n T T ÿ t “ R it ´ E p R it q ` max i,t E p R it q . The last term is O p q by Assumption 3(a), the middle term O P p ψ ´ p { p n q{? T q . The first term is nolarger then } ∆ } max p } R } max ` } ∆ } max q “ O P p ω p ψ ´ p nT q ` ω qq . The result (d) then follows.For (e) we have for each j ď r : | T ´ ÿ t F jt r U it | ď | T ´ ÿ t F jt U it | ` | T ´ ÿ t F jt δ it |ď | T ´ ÿ t F jt U it | ` p T ´ ÿ t F jt T ´ ÿ t δ it q { The first term is O P p ψ ´ p { p n q{? T by the maximum inequality and Assumption 3 and the second is O P p ω q . Lemma B.7.
For every s ą : ρ p p S, Z q ď ρ p p r T , r Z q ` ∆ p p c mqn r Z, s q ` ρ p p c mqn r Z, Z q ` P p c mrn } r U } p ą s q ` ρ p p T, r T q ` ρ p p U, r U q . Proof.
We start by showing that for every pair of random variables X and Y defined in the sameprobability space taking values in the normed space p S, } ¨ }q and pair of non-negative reals t, s , wehave P p} X } ď t ´ s q ´ P p} Y } ą s q ď P p} X ` Y } ď t q ď P p} X } ď t ` s q ` P p} Y } ą s q . (B.3)Indeed, for the right hand side inequality we use } X ` Y } “ } X ´ p´ Y q} ě } X } ´ } Y } . Hence, for43ny t, s ą P p} X ` Y } ď t q ď P p} X } ď t ` } Y }qď P p} X } ď t ` } Y } , } Y } ď s q ` P p} Y } ą s qď P p} X } ď t ` s q ` P p} Y } ą s q . For the other side we use } X ` Y } ď } X } ` } Y } to write P p} X ` Y } ď t q ě P p} X } ď t ´ } Y }qě P p} X } ď t ´ } Y }q ` P p} Y } ą s q ´ P p} Y } ą s q Now replace X and Y by a mqn T and a mrn U in (B.3), respectively and set } ¨ } “ } ¨ } p . The righthand side of the resulting expression can be upper bounded by P p a mqn } r T } p ď t ` s q ` P p a mrn } r U } ą s q ` ρ p p T, r T q ` ρ p p U, r U q , whereas the left hand side can be lower bounded by P p a mqn } r T } ď t ´ s q ´ P p a mqr } r U } ą s q ´ ρ p p T, r T q ´ ρ p p U, r U q . Therefore P p c mqn } r T } p ď t ´ s q ´ P p c mrn } r U } p ą s q ´ ρ p p T, r T q ´ ρ p p U, r U qď P p} S } p ď t qď P p c mqn } r T } p ď t ` s q ` P p c mrn } r U } p ą s q ` ρ p p T, r T q ` ρ p p U, r U q . Then for the right-hand side P p c mqn } r T } p ď t ` s q ď P p c mqn } r Z } p ď t ` s q ` ρ p p r T , r Z qď P p c mqn } r Z } p ď t q ` ∆ p p c mqn r Z, s q ` ρ p p r T , r Z qď P p} Z } p ď t q ` ρ p p c mqn r Z, Z q ` ∆ p p c mqn r Z, s q ` ρ p p r T , r Z q Similarly for the left-hand side and the proof is completed.By the triangle inequality } p Υ ´ Υ } max ď } p Υ ´ r Υ } max ` } r Υ ´ Υ } max where r Υ is the sample44ovariance matrix of r D t : “ U t U ´ t . The second term is O p ψ ´ p { p n q{? T q while for the first } p Υ ´ r Υ } max ď } D ´ r D } max p } r D } max ` } D ´ r D } max q The first term in parentheses is O p ψ ´ ˚ p nT qq and the second can be upper bounded by } p U ´ U } max p } U } max ` } p U ´ U } max q which is show to be O P p η p n, T q ψ ´ p nT qq in the proof of LemmaB.16. Therefore we conclude } p Υ ´ Υ } max “ O P ´ η p n, T q ψ ´ p nT q ψ ´ p { p nT q ` ψ ´ p { p n q{? T ¯ To leverage on the results of Gaussian approximation, in particular on the work of Giessing andFan (2020) we would like to establish some sort of asymptotic linearity namely Q T “ ? T T ÿ t “ D t “ ? T T ÿ t “ r D t ` R T “ : r Q T ` R T . (B.4)such that } R t } vanishes in probability at an appropriate rate as n, T Ñ 8 . Then we can ap-proximate the distribution of S “ } Q } by the distribution of r S : “ } r Q } p , which in turn can beapproximated by the distribution of S ˚ : “ } Q ˚ } with high probability.For some (cid:15) ą δ “ h r η p n, T qp ψ ´ p nT qq ` ψ ´ p { p n q{? T s δ “ η ´ (cid:15) r ψ ´ p n q ` ? T η s Lemma B.8. } p Υ ´ Υ } max “ O P ´ h r η p ψ ´ p p nT qq ` ψ ´ p { p n q{? T s ¯ Proof.
Let i : “ p i , i , i , i q be a multi-index where i , i , i , i P r n s . Define for i and | (cid:96) | ă T : r γ (cid:96) i : “ T T ÿ t “| (cid:96) |` U i ,t U i ,t U i ,t ´| (cid:96) | U i ,t ´| (cid:96) | ; γ (cid:96) i : “ E r γ i , and p γ (cid:96) i as r γ (cid:96) i with U ’s replaced by p U ’s. Also define r υ i : “ ÿ | (cid:96) |ă T k p (cid:96) { h q r γ (cid:96) i υ i : “ ÿ | (cid:96) |ă T γ (cid:96) i , p υ i as r υ i with U ’s replaced by p U ’s. Then we write r υ i ´ υ i “ ÿ | (cid:96) |ă T k p (cid:96) { h qp r γ (cid:96) i ´ γ (cid:96) i q ` ÿ | (cid:96) |ă T p k p (cid:96) { h q ´ q γ (cid:96) i . (B.5)Since } r γ (cid:96) i ´ γ (cid:96) i } ψ p { “ O p a T ´ | (cid:96) |{ T q “ O p {? T q , the ψ p { -Orlicz norm of the first term is boundedby h ÿ | (cid:96) |ă T | h ´ k p (cid:96) { h q|} r γ (cid:96) i ´ γ (cid:96) i } ψ p { “ O ˆ h ? T ż | k p u q| du ˙ “ O p h {? T q , whereas the second term is deterministic and is shown to be O p h {? T q by Andrews (1991). Thus } r υ i ´ υ i } ψ p { “ O p h {? T q uniformly in i P r n s . Thus, by the maximal inequality followed byMarkov’s inequality we conclude that max i | r υ i ´ υ i | “ O P p ψ ´ p { p n q max i } r υ i ´ υ i } ψ p { q “ O P r ψ ´ p { p n q h {? T s . (B.6)We now use the fact that for any x , y P R q we have | ś qi “ x i ´ ś qi “ y i | “ O p ř q ´ i “ } x ´ y } n ´ i } y } i q combined with the fact that } p U ´ U } max “ o p q to obtain max i ,(cid:96) | p γ (cid:96) i ´ r γ (cid:96) i | ď max i ,t,(cid:96) | p U i ,t p U i ,t p U i ,t ´| (cid:96) | p U i ,t ´| (cid:96) | ´ U i ,t U i ,t U i ,t ´| (cid:96) | U i ,t ´| (cid:96) | |“ O p} p U ´ U } max } U } max q“ O P r η r ψ ´ p p nT qs s Therefore we conclude max i | p υ i ´ r υ i | ď max i ,(cid:96) | p γ (cid:96) i ´ r γ (cid:96) i | ÿ | (cid:96) |ă T | k p (cid:96) { h q| “ O P ˆ hη r ψ ´ p p nT qs ż | k p u q| du ˙ “ O P p hη r ψ ´ p p nT qs q . (B.7)The result then follows from the triangle inequality } p Υ ´ Υ } max ď max i | p υ i ´ r υ i | ` max i | r υ i ´ υ i | ,expression (B.10) and (B.11). Lemma B.9. If } δ it } ψ p ď C ă 8 where δ it : “ p R it ´ R it then }}p V { n qp F t ´ HF t q} } ψ p “ O p ? T ` ψ ´ p { p T q? n ` ψ ´ p { p T q C q . roof. In this proof we use the fact that for any (possibly random) A st , by Cauchy-Schwartz in-equality and the normalization p F p F { T “ I r , we have } T ř Ts “ p F s A st } ď ? r ´ T ř Ts “ A st ¯ { . Thus g p A st q : “ ››››› } T T ÿ s “ p F s A st } ››››› ψ “ O »–››››››˜ T T ÿ s “ A st ¸ { ›››››› ψ fifl . (a) Set A st “ E p U s U t q{ n , then g p A st q “ O p {? T q .(b) Set A st “ r ζ st : “ p U s U t ´ E p U s U t qq{ n , then by maximal inequality g p A st q “ O p} max s ď T | r ζ st |} ψ q “ O p ψ ´ p T q max s ď T } r ζ st } ψ q . By the triangle inequality } r ζ st } ψ ď } ζ st } ψ ` } ζ ˚ st } ψ . The first term is O p {? n q by Assumption 3(d). The second can be upper bounded by } U s δ t { n } ψ `} δ s U t { n } ψ `} δ s δ t { n } ψ “ O p} U is } ψ p { } δ it } ψ p { q ` O p} δ it } ψ p { q . Thus g p A st q “ O p ψ ´ p T qp {? n ` C ` C qq .(c) Set A st “ r η st : “ F s ř ni “ λ i p U it ` δ it q{ n , then apply Cauchy-Schwartz twice to obtain g p A st q “ O p}p T T ÿ s “ } F s } q { } ψ p { } n ÿ i “ λ i U it ` δ it n } ψ p { q “ O p q O p} n ÿ i “ λ i U it n } ψ p { `} n ÿ i “ λ i δ it n } ψ p { q . The first term in square brackets is O p {? n q by Assumption 2(d) and 3(e); the second is O p C q . Hence g p A st q “ O p ? n ` C q .(d) Set A st “ r ξ st : “ F t ř ni “ λ i p U is ` δ is q{ n , then apply Cauchy-Schwartz twice followed by themaximal inequality to obtain g p A st q “ O p}} F t }} ψ p { }p T T ÿ s “ } n ÿ i “ λ i U is ` δ is n } q { } ψ p { qq“ O p q O p ψ ´ p T qr} n ÿ i “ λ i U is n } ψ p { ` }p n ÿ i “ λ i δ is n } ψ p { sq . The first term in square brackets is O p {? n q by Assumption 2(d) and 3(e); the second is O p C q . Hence g p A st q “ O p ψ ´ p { p T qr ? n ` C sq .Finally, use the identity (A.1), the triangle inequality twice and the bounds p a q ´ p d q to obtain theresult. Lemma B.10. If max i,t } δ it } ψ “ O p C q and } p U ´ U } max “ O P p η q then ›››› ? T p p U p U ´ U U q ›››› max “ O P ˆ ? T η ` r ? T ` r ? n ` r C ˙ here r : “ ψ ´ p p n q ψ ´ p { p n q ψ ´ p { p n q r : “ ψ ´ p p n q ψ ´ p { p T q ` ψ ´ p { p n q r : “ ψ ´ p p n q ψ ´ p { p T q ` ψ ´ p p nT q ψ ´ p { p n q . Proof.
By the triangle inequality we have ›››› ? T p p U p U ´ U U q ›››› max ď ›››› ? T p p U ´ U qp p U ´ U q ›››› max ` ›››› ? T U p p U ´ U q ›››› max . For the first term we have ›››› ? T p p U ´ U qp p U ´ U q ›››› max ď ? T } p U ´ U } max “ O P p? T η q . For the second term we use decomposition (A.3) to write1 ? T T ÿ t “ U it p p U jt ´ U jt q “ ? T T ÿ t “ U it p p λ j p F t ´ λ j F t ` p R jt ´ R jt q“ ” p p λ j ´ Hλ j q ` Hλ j ı ? T T ÿ t “ U it p p F t ´ HF t q` ” p p λ j ´ Hλ j q ` p H H ´ I r q λ j ı ? T T ÿ t “ U it F t ` p p γ j ´ γ j q ? T T ÿ t “ U it W jt Apply Cauchy-Schwartz inequality in each term followed by the triangle inequality we obtain ›››› ? T U p p U ´ U q ›››› max ď „ max j ď n } p λ j ´ Hλ j } ` ? r } H }} Λ } max max i ď n ››››› ? T T ÿ t “ U it p p F t ´ HF t q ››››› ` „ max j ď n } p λ j ´ Hλ j } ` ? r } H H ´ I r }} Λ } max max i ď n ››››› ? T T ÿ t “ U it F t ››››› ` max j ď n } p γ j ´ γ j } max i,j ď n ››››› ? T T ÿ t “ U it W jt ››››› . The first term is O P p q O P p ψ ´ p n qr ? T ` ψ ´ p { p T q? n ` ψ ´ p { p T q C sq due to Lemma B.6(a), Lemma B.9and the maximal inequality; the second term is O P p ψ ´ p { p n q? T ` ? n ` ψ ´ p nT q C q O P p ψ ´ p { p n qq since,by the maximal inequality, we might take ω “ ψ ´ p nT q C in Theorem 2(b). The last term is48 P p ψ ´ p n q ψ ´ p { p n q{? T q O P p ψ ´ p { p n qq . Thus, ››› ? T U p p U ´ U q ››› max “ O P p r q where r : “ ψ ´ p p n q ψ ´ p { p n q ψ ´ p { p n q? T ` ψ ´ p p n q ψ ´ p { p T q ` ψ ´ p { p n q? n ` p ψ ´ p p n q ψ ´ p { p T q ` ψ ´ p p nT q ψ ´ p { p n qq C. (B.8)The result then follows. Lemma B.11.
Let } p U ´ U } “ O P p η q then max i,j,t | p V ij,t ´ V ij,t | “ O P p s r η ` ξψ ´ p n qsq .Proof. By the triangle inequality we have | p V ij,t ´ V ij,t | ď | p U i,t ´ U i,t | ` | p θ i p U ´ ij,t ´ θ i U ´ ij,t | . Using H¨older’s inequality, the second term can be further bounded as | p θ i p U ´ ij,t ´ θ i U ´ ij,t | ď | p θ i p p U ´ ij,t ´ U ´ ij,t q| ` |p p θ i ´ θ i q U ´ ij,t |ď } p θ i } } p U ´ ij,t ´ U ´ ij,t } ` } p θ i ´ θ i } } U ´ ij,t } ď p} θ i } ` } p θ i ´ θ i } q} p U ´ ij,t ´ U ´ ij,t } ` } p θ i ´ θ i } } U ´ ij,t } . Combining the last two expressions with the fact that } θ i } ď s } θ i } ď Cs and } p θ ´ θ } “ O P p ξs q “ O P p q by Assumption 3(f) and the the maximum inequality yields the result Lemma B.12.
Let } p U ´ U } “ O P p η q then max i,j | ? T T ÿ t “ p p V ij,t p V ji,t ´ V ij,t V ij,t q | “ O P t s r r ` ξψ ´ p n q ` ? T p η ` ξψ ´ p n qq su .Proof. By the triangle inequality max i,j | ? T p p V ij p V ji ´ V ij V ji q| ď max i,j | ? T p p V ij ´ V ij qp p V ji ´ V ji q| ` max i,j | ? T V ij p p V ij ´ V ij q| . The first term can be bounded using Lemma B.11 since max i,j | ? T p p V ij ´ V ij qp p V ji ´ V ji q| ď ? T r max i,j,t p p V ijt ´ V ijt qs “ O P p? T r s p η ` ξψ ´ p n qqs q . max i,j | ? T V ij p p U i ´ U i q| ` max i,j } p θ ij } max i,j } ? T V ij p p U ´ ij ´ U ´ ij q} ` max i,j } p θ ij ´ θ ij } max i,j | ? T V ij U ´ ij | . Recall the rate r appearing in (B.8).Then the first term is O P p s r q , the second O P p s r q and thelast term is O P p ξs ψ ´ p n qq . Thus max i,j | ? T V ij p p V ij ´ V ij q| “ O P r s p r ` ξψ ´ p n qqs . The resultthen follows. Lemma B.13. } p Υ V ´ Υ V } max “ O P ˆ h r s r η ` ξψ ´ p p n qsp s ψ ´ p p nT qq ` s ψ ´ p { p n q? T s ˙ Proof.
The proof is similar to the proof of Lemma B.8, refer to it for details. It suffices to bound inprobability } p V ´ V } max and } V } max , where V is p n ˆ T q matrix whose entries are V ij,t for i, j P r n s and t P r T s . Similar for p V with V ij,t replaced p V ij,t . Lemma B.11 bounds the former, for the laterwe have } V } max ď max i,j } θ ij } } U } max “ O p s ψ ´ p nT qq .Let i : “ p i , i , i , i q be a multi-index where i , i , i , i P r n s . Define for i and | (cid:96) | ă T : r γ (cid:96) i : “ T T ÿ t “| (cid:96) |` U i ,t U i ,t U i ,t ´| (cid:96) | U i ,t ´| (cid:96) | ; γ (cid:96) i : “ E r γ i , and p γ (cid:96) i as r γ (cid:96) i with U ’s replaced by p U ’s. Also define r υ i : “ ÿ | (cid:96) |ă T k p (cid:96) { h q r γ (cid:96) i υ i : “ ÿ | (cid:96) |ă T γ (cid:96) i , and p υ i as r υ i with U ’s replaced by p U ’s. Then we write r υ i ´ υ i “ ÿ | (cid:96) |ă T k p (cid:96) { h qp r γ (cid:96) i ´ γ (cid:96) i q ` ÿ | (cid:96) |ă T p k p (cid:96) { h q ´ q γ (cid:96) i . (B.9)Since } r γ (cid:96) i ´ γ (cid:96) i } ψ p { “ O p a T ´ | (cid:96) |{ T q “ O p {? T q , the ψ p { -Orlicz norm of the first term is boundedby h ÿ | (cid:96) |ă T | h ´ k p (cid:96) { h q|} r γ (cid:96) i ´ γ (cid:96) i } ψ p { “ O ˆ h ? T ż | k p u q| du ˙ “ O p h {? T q , whereas the second term is deterministic and is shown to be O p h {? T q by Andrews (1991). Thus } r υ i ´ υ i } ψ p { “ O p h {? T q uniformly in i P r n s . Thus, by the maximal inequality followed by50arkov’s inequality we conclude that max i | r υ i ´ υ i | “ O P p ψ ´ p { p n q max i } r υ i ´ υ i } ψ p { q “ O P r ψ ´ p { p n q h {? T s . (B.10)We now use the fact that for any x , y P R q we have | ś qi “ x i ´ ś qi “ y i | “ O p ř q ´ i “ } x ´ y } n ´ i } y } i q combined with the fact that } p U ´ U } max “ o p q to obtain max i ,(cid:96) | p γ (cid:96) i ´ r γ (cid:96) i | ď max i ,t,(cid:96) | p U i ,t p U i ,t p U i ,t ´| (cid:96) | p U i ,t ´| (cid:96) | ´ U i ,t U i ,t U i ,t ´| (cid:96) | U i ,t ´| (cid:96) | |“ O p} p U ´ U } max } U } max q“ O P r η r ψ ´ p nT qs s Therefore we conclude max i | p υ i ´ r υ i | ď max i ,(cid:96) | p γ (cid:96) i ´ r γ (cid:96) i | ÿ | (cid:96) |ă T | k p (cid:96) { h q| “ O P ˆ hη r ψ ´ p nT qs ż | k p u q| du ˙ “ O P p hη r ψ ´ p nT qs q . (B.11)The result then follows from the triangle inequality } p Υ ´ Υ } max ď max i | p υ i ´ r υ i | ` max i | r υ i ´ υ i | ,expression (B.10) and (B.11). Lemma B.14.
Let U , V be T ˆ n matrices such that } U ´ V } max ď C and } V } max ď C , then } Σ U ´ Σ V } max ď C : “ C p C ` C q , where Σ U : “ U U { T and Σ V : “ V V { T . Furthermore, if C ď ακ p Σ V , S , q{p| S |p ` ζ q q for S Ď r n s , ζ ą and α P r , s , then p ´ α q κ p Σ V , S , ζ q ď κ p Σ U , S , ζ q ď p ` α q κ p Σ V , S , ζ q .Proof. By the (reverse) triangle inequality we have } U } max ´ } V } max ď } U ´ V } max , from which weconclude that } U } max ď } U ´ V } max `} V } max ď C ` C . Now } Σ U ´ Σ V } max “ max ď i,j ď n | T ´ ř Tt “ U it U jt ´ it V ijt | ď max i,j,t | U it U jt ´ V it V jt | and | U it U jt ´ V it V jt | ď |p U it ´ V it q U jt ` p U jt ´ V jt q V it | ď} U ´ V } max p} U } max ` } V } max q ď C p C ` C q . For the second part of the lemma notice that for any x P R n we have | x Σ U x ´ x Σ V x | “| x p Σ U ´ Σ V q x | ď } Σ U ´ Σ V } max } x } ď C } x } by the first part. Also, if } x S c } ď ζ } x S } wehave that } x } “ } x S } ` | x S c } ď p ` ζ q} x S } ď p ` ζ q a x Σ V x | S |{ κ p Σ V , S , ζ q where thelast inequality follows from the definition of compatibility condition. Thus | x Σ U x ´ x Σ V x | ď C p ` ζ q x Σ V x | S |{ κ p Σ V , S , ζ q ď x Σ V x {
2, where the last inequality follows from the definitionof compatibility condition. Therefore, we have that p ´ α q x Σ V x ď x Σ U x ď p ` α q x Σ V x whenever } x S c } ď ζ } x S } . Take in infimum to conclude. Lemma B.15.
Let W : “ p U , V q and Z : “ p X , Y q be T ˆp n ` q matrices such that } W ´ Z } max ď C and } Z } max ď C , then for any δ P R n we have } U p V ´ U δ q{ T ´ X p Y ´ Xδ q{ T } ď p ` } δ } q C p C ` C q Proof.
For convenience let q : “ V ´ U δ P R T and r : “ Y ´ Xδ P R T , then H¨older’s inequality givesus } r } ď p ` } δ } q} Z } max ď p ` } δ } q C and } q ´ r } ď p ` } δ } q} W ´ Z } max ď p ` } δ } q C .From the (reverse) triangle inequality we obtain } q } ď } q ´ r } ` } r } ď p ` } δ } qp C ` C q .Now, following the same steps in the proof of previous Lemma, we can upper bound the right handside of the display by } U ´ X } max } q } ` } q ´ r } } X } max , which in turn can be upper bounded bythe left hand size of the display. Lemma B.16.
Under the same conditions of Theorems 1 and 2 } ∇ L p θ q ´ ∇ L p θ q} “ O P « ψ ´ p T q ψ ´ p nT q ψ ´ p n q ψ ´ p { p n q T { ` ψ ´ p T q T { ? n ff } ∇ L p θ q ´ ∇ L p θ q} max “ O P « η p n, T q “ ψ ´ p nT q ` η p n, T q ‰ ` ψ ´ p { p n q? T ff , where ∇ L p θ q : “ E r U ´ t p U t ´ θ U ´ t qs and ∇ L p θ q : “ E U ´ t U t . roof. By the triangle inequality we have12 } ∇ L p θ q ´ ∇ L p θ q} “ }p p U x ´ U x ` U x q V { T ´ E p U x V { T q} ď } U x V { T ´ E p U x V { T q} ` } p U x ´ U x } max } V } . Similarly, using Lemma 5.B } ∇ L p θ q ´ ∇ L p θ q} max ď } p U x p U x { T ´ U x U x { T } max ` } U x U x { T ´ E p U x U x { T q} max ď } p U x ´ U x } max p } U x } max ` } p U x ´ U x } max q` } U x U x { T ´ E p U x U x { T q} max . By Corollary 1 and Assumption 3 we can bound in probability each of those terms } U x V { T ´ E p U x V { T q} “ O P « ψ ´ p { p n q? T ff } p U x ´ U x } max “ O P « ψ ´ p nT q ψ ´ p n q ψ ´ p { p n q T { ` T { ? n ff “ : O P r η p n, T qs} V } “ ψ ´ p T q} U x } max “ O P r ψ ´ p nT qs} U x U x { T ´ E p U x U x { T q} max “ O P « ψ ´ p { p n q? T ff . Therefore } ∇ L p θ q ´ ∇ L p θ q} “ O P « ψ ´ p { p n q? T ` ψ ´ p T q ψ ´ p nT q ψ ´ p n q ψ ´ p { p n q T { ` ψ ´ p T q T { ? n ff and } ∇ L p θ q ´ ∇ L p θ q} max “ O P « η p n, T q “ ψ ´ p nT q ` η p n, T q ‰ ` ψ ´ p { p n q? T ff Simulation Results: Size with φ “ . The table reports the empirical size of the test of remaining covariance structure. Panel (a) reportsthe case where the factors are known, whereas Panel (b) considers that the factors are unknown butthe number of factors is known. Panels (c) and (d) present the results when the number of factors aredetermined, respectively, by the eigenvalue ratio test and the information criterion IC . Factors areestimated by the usual principal component algorithm. Three nominal significance levels are considered:0.01, 0.05, and 0.10. The table reports the results for the case where φ “ Panel(a): Known factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Known number of factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): Information criterion ( IC ) T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(d): Eigenvalue ratio T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Simulation Results: Size with φ “ . . The table reports the empirical size of the test of remaining covariance structure. Panel (a) reportsthe case where the factors are known, whereas Panel (b) considers that the factors are unknown butthe number of factors is known. Panels (c) and (d) present the results when the number of factors aredetermined, respectively, by the eigenvalue ratio test and the information criterion IC . Factors areestimated by the usual principal component algorithm. Three nominal significance levels are considered:0.01, 0.05, and 0.10. The table reports the results for the case where φ “ . Panel(a): Known factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Known number of factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): Information criterion ( IC ) T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(d): Eigenvalue ratio T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Simulation Results: Power ( φ “ ). The table reports the empirical power of the test of remaining covariance structure. Panel (a) reportsthe case where the factors are known, whereas Panel (b) considers that the factors are unknown but thenumber of factors is known. Factors are estimated by the usual principal component algorithm. Threenominal significance levels are considered: 0.01, 0.05, and 0.10.
Panel(a): Known factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Known number of factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): Eigenvalue ratio T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(d): Information criterion ( IC ) T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Simulation Results: Power ( φ “ . ). The table reports the empirical power of the test of remaining covariance structure. Panel (a) reportsthe case where the factors are known, whereas Panel (b) considers that the factors are unknown but thenumber of factors is known. Factors are estimated by the usual principal component algorithm. Threenominal significance levels are considered: 0.01, 0.05, and 0.10.
Panel(a): Known factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Known number of factors T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): Eigenvalue ratio T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(d): Information criterion ( IC ) T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Simulation Results: Informational Gains
The table reports the average mean squared error (MSE) of three different prediction models over 5-foldcross-validation subsamples. The goal is to predict the first variable using information from the remaining n ´
1. Panel (a) considers the case of Sparse Regression (SR) where Y t is LASSO-regressed on all theother variables. Panel (b) shows the results of Principal Component Regression (PCR). Finally, Panel (c)presents the results of FarmPredict . “N/A” means “not available”. Note that there is no factor selectionfor Sparse Regression. “Known Number” means that the number of factors is known.
Panel(a): Sparse Regression (SR)
Known Number Eigenvalue Ratio Information Criterion (IC ) T “ T “ T “ T “ T “ T “ T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(b): Principal Component Regression (PCR)
Known Number Eigenvalue Ratio Information Criterion (IC ) T “ T “ T “ T “ T “ T “ T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Panel(c): FarmPredict
Known Number Eigenvalue Ratio Information Criterion (IC ) T “ T “ T “ T “ T “ T “ T “ T “ T “ n “ . ˆ T n “ ˆ T n “ ˆ T n “ ˆ T Forecasting Results.
The table reports the frequency each model is ranked the first, second, third and fourth best model among the fouralternatives. Panel (a) considers the case when the factors are selected by the eigenvalue ratio procedure. Panel (b)presents the results when factors are selected by the information criterion IC . Panels (c) and (d) consider the caseswhen the number of factors are pre-specified as either one or two. We present the results for each individual groupof variables as well as for the full set of macroeconomic variables. Panel (a): Optimal Factor Selection (eigenvalue ratio)
AR SR PCR FarmPredict
Group
Panel (b): Optimal Factor Selection ( IC ) AR SR PCR FarmPredict
Group
Panel (c): Fixed Number of Factors ( r “ ) AR SR PCR FarmPredict
Group
Panel (d): Fixed Number of Factors ( r “ ) AR SR PCR FarmPredict
Group ining
200 400 600200400600
Food
50 100 150 20050100150200
Petroleum
10 20 30 4010203040
Construction
20 40 60204060
Manufacturing
200 400 600200400600
Utilities
100 200 300 400100200300400
Dept. Stores
20 40 60204060
Retail
200 400 600 800 10002004006008001000
Financial
Figure 1: Correlations of returns larger than 0.15 in absolute value.
We estimate the correlations between all pairs of returns from a sample of nine specific sectors. The correlationsthat are higher than 0.15 in absolute value are shown as black dots in the figure. We consider the following sectors:mining, food, petroleum, construction, manufacturing, utilities, department stores, retail, and financial. KT -2 0 2050010001500 HML -8 -6 -4 -2 0 2 4010002000
SMB -2 0 2 4010002000
CMA -2 0 2050010001500
RMW -5 0 50100020003000
UMD -3 -2 -1 0 10100020003000
ACC -5 0 5010002000
CFP -10 -5 0 5010002000
CHCSHO -5 0 5010002000
BETA -4 -2 0 2 4010002000 DY -1 -0.5 0 0.50100020003000 EP -5 0 5010002000 MOM1m -2 0 20100020003000
MOM36m -2 0 2010002000
IDIOVOL -10 -5 00100020003000
RETVOL -5 0 50100020003000
Figure 2: First-stage coefficient estimates.
The figure shows the empirical distribution of the first-stage regression where each excess returns are linearly regressedon 16 risk factors. ining
200 400 600200400600
Food
50 100 150 20050100150200
Petroleum
10 20 30 4010203040
Construction
20 40 60204060
Manufacturing
200 400 600200400600
Utilities
100 200 300 400100200300400
Dept. Stores
20 40 60204060
Retail
200 400 600 800 10002004006008001000
Financial
Figure 3: Correlations of first-stage residuals larger than 0.15 in absolute value.
We estimate the correlations between all pairs of residuals from the first-stage OLS regression on 16 observed riskfactors from a sample of nine specific sectors. The correlations that are higher than 0.15 in absolute value are shownas black dots in the figure. We consider the following sectors: mining, food, petroleum, construction, manufacturing,utilities, department stores, retail, and financial. ining
200 400 600200400600
Food
50 100 150 20050100150200
Petroleum
10 20 30 4010203040
Construction
20 40 60204060
Manufacturing
200 400 600200400600
Utilities
100 200 300 400100200300400
Dept. Stores
20 40 60204060
Retail
200 400 600 800 10002004006008001000
Financial
Figure 4: Correlations of second-stage residuals larger than 0.15 in absolute value.
We estimate the correlations between all pairs of residuals from the second-stage principal component analysis froma sample of nine specific sectors. The correlations that are higher than 0.15 in absolute value are shown as blackdots in the figure. We consider the following sectors: mining, food, petroleum, construction, manufacturing, utilities,department stores, retail, and financial. ining
200 400 600200400600
Food
50 100 150 20050100150200
Petroleum
10 20 30 4010203040
Construction
20 40 60204060
Manufacturing
200 400 600200400600
Utilities
100 200 300 400100200300400
Dept. Stores
20 40 60204060
Retail
200 400 600 800 10002004006008001000
Financial
Figure 5: Partial correlations of second-stage residuals larger than 0.15 in absolute value.
We estimate the partial correlations between all pairs of residuals from the second-stage LASSO regression from asample of nine specific sectors. The correlations that are higher than 0.15 in absolute value are shown as black dotsin the figure. We consider the following sectors: mining, food, petroleum, construction, manufacturing, utilities,department stores, retail, and financial. i n i ng f ood appa r e l pape r c he m i c a l pe t r o l eu m c on s t r u c t i on p r i m a r y m e t a l s f ab r i c a t ed m e t a l s m a c h i ne r y e l e c t r i c a l equ i p m en t t r an s po r t a t i on equ i p m en t m anu f a c t u r i ng r a il r oad s o t he r t r an s po r t a t i on u t ili t i e s depa r t m en t s t o r e s r e t a il f i nan c i a l o t he r miningfoodapparelpaperchemicalpetroleumconstructionprimary metalsfabricated metalsmachineryelectrical equipmenttransportation equipmentmanufacturingrailroadsother transportationutilitiesdepartment storesretailfinancialother Figure 6: Variable Selection Frequency.
We report how often that variables from column sectors are selected in the third-stage LASSO regression for firmson line sectors . The numbers are normalized by the total number of firms in each sector. -0.6 -0.4 -0.2 0 0.2 0.40510152025 -0.4 -0.2 0 0.2 0.405101520253035 -0.2 -0.1 0 0.1 0.2 0.30510152025 Figure 7: AR coefficient estimates.
The figure illustrates the empirical distribution of the ordinary least squares (OLS) estimation of the coefficients ofan fourth-order autoregressive, AR(4), model across the 119 macroeconomic time series. Each panel relates to onespecific coefficient. | + + + | Figure 8: Absolute sum of AR coefficient estimates.
The figure illustrates the empirical distribution of the absolute sum of the ordinary least squares (OLS) estimationof the coefficients of an fourth-order autoregressive, AR(4), model across the 119 macroeconomic time series.
50 100 150 200 250 estimation window nu m be r o f f a c t o r s eigenvalue ratioIC IC IC IC Figure 9: Estimated number of factors.
The figure illustrates the number of selected factors over the estimation windows. The figure reports the results forthe eigenvalue ratio procedure and the four information criteria discussed in the paper. eferences Abadie, A., A. Diamond, and J. Hainmueller (2010). Synthetic control methods for comparative casestudies: Estimating the effect of California’s tobacco control program.
Journal of the AmericanStatistical Association 105 , 493–505.Abadie, A. and J. Gardeazabal (2003). The economic costs of conflict: A case study of the Basquecountry.
American Economic Review 93 , 113–132.Andreou, E. and E. Ghysels (2021). Predicting the VIX and the volatility risk premium: The roleof short-run funding spreads volatility factors.
Journal of Econometrics 220 , 366–398.Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrixestimation.
Econometrica 59 (3), 817–858.Bai, J. (2003). Inferential theory for factor models of large dimensions.
Econometrica 71 , 2135–171.Bai, J. and Y. Liao (2017). Inferences in panel data with interactive effects using large covariancematrices.
Journal of Econometrics 200 , 59–78.Bai, J. and S. Ng (2002). Determining the number of factors in approximate factor models.
Econo-metrica 70 , 191–221.Bai, J. and S. Ng (2003). Inferential theory for factor models of large dimensions.
Econometrica 71 ,135–171.Bai, J. and S. Ng (2006). Confidence intervals for diffusion index forecasts and inference for factoraugmented regressions.
Econometrica 74 , 1133–1155.Barigozzi, M. and C. Brownlees (2019). NETS: Network estimation for time series.
Journal ofApplied Econometrics 34 , 347–364.Barigozzi, M. and M. Hallin (2016). Generalized dynamic factor models and volatilities: Recoveringthe market volatility shocks.
Econometrics Journal 19 , C33–C60.Barigozzi, M. and M. Hallin (2017a). Generalized dynamic factor models and volatilities: Estimationand forecasting.
Journal of Econometrics 201 , 307–321.Barigozzi, M. and M. Hallin (2017b). A network analysis of the volatility of high-dimensionalfinancial series.
Journal of the Royal Statistical Society - series C 66 , 581–605.69arigozzi, M. and M. Hallin (2020). Generalized dynamic factor models and volatilities: consistency,rates, and prediction intervals.
Journal of Econometrics 116 , 4–34.Barigozzi, M., M. Hallin, S. Soccorsi, and R. von Sachs (2020). Time-varying general dynamic factormodels and the measurement of financial connectedness.
Journal of Econometrics . forthcoming.Bernanke, B., J. Boivin, and P. Eliasz (2005). Measuring the effects of monetary policy: A factor-augmented vector autoregressive (FAVAR) approach.
The Quarterly Journal of Economics 120 ,387–422.Brito, D., M. Medeiros, and R. Ribeiro (2018). Forecasting large realized covariance matrices: Thebenefits of factor models and shrinkage. Technical Report 3163668, SSRN.Brownlees, C., G. Gudmundsson, and G. Lugosi (2020). Community detection in partial correlationnetwork models.
Journal of Business & Economic Statistics . forthcoming.Cai, T. (2017). Global testing and large-scale multiple testing for high-dimensional covariancestructures.
Annual Review of Statistics and its Application 4 , 4.1–4.24.Cai, T. and Z. Ma (2013). Optimal hypothesis testing for high dimensional covariance matrices.
Bernoulli 19 , 2359–2388.Cai, T., Z. Ren, and H. Zhou (2016). Estimating structured high-dimensional covariance andprecision matrices: Optimal rates and adaptive estimation.
Electronic Journal of Statistics 10 ,1–59.Carvalho, C., R. Masini, and M. Medeiros (2018). Arco: An artificial counterfactual approach forhigh-dimensional panel time-series data.
Journal of Econometrics 207 , 352–380.Chen, S., L.-X. Zhang, and P.-S. Zhong (2010). Tests for high-dimensional covariance matrices.
Journal of the American Statistical Association 105 , 810–819.Chernozhukov, V., D. Chetverikov, and K. Kato (2013). Gaussian approximations and multiplierbootstrap for maxima of sums of high-dimensional random vectors.
Annals of Statistics 41 ,2786–2819.Chernozhukov, V., D. Chetverikov, and K. Kato (2018). Inference on causal and structural param-eters using many moment inequalities. 70iebold, F. and K. Yilmaz (2014). On the network topology of variance decompositions: Measuringthe connectedness of financial firms.
Journal of Econometrics 182 , 119–134.Doukhan, P. and S. Louhichi (1999). A new weak dependence condition and applications to momentinequalities.
Stochastic Processes and their Applications 84 , 313–342.Fama, E. and K. French (1993). Common risk factors in the returns on stocks and bonds.
Journalof Financial Economics 33 , 3–56.Fama, E. and K. French (2015). A five-factor asset pricing model.
Journal of Financial Eco-nomics 116 , 1–22.Fan, J., Y. Fan, and J. Lv (2008). High dimensional covariance matrix estimation using a factormodel.
Journal of Econometrics 147 , 186–197.Fan, J., Y. Ke, and K. Wang (2020). Factor-adjusted regularized model selection.
Journal ofEconometrics 216 , 71–85.Fan, J., Q. Li, and Y. Wang (2017). Estimation of high dimensional mean regression in the absenceof symmetry and light tail assumptions.
Journal of the Royal Statistical Society: Series B 79 ,247–265.Fan, J., R. Li, C.-H.Zhang, and H. Zou (2020).
Statistical Foundations of Data Science . CRC Press.Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding principalorthogonal complements.
Journal of the Royal Statistical Society: Series B (Statistical Method-ology) 75 , 603–680.Fan, J., R. Masini, and M. Medeiros (2020). Do we exploit all information for counterfactual analy-sis? benefits of factor models and idiosyncratic correction. Working paper, Princeton University.Feng, G., S. Giglio, and D. Xiu (2020). Taming the factor zoo: A test of new factors.
Journal ofFinance 75 , 1327–1370.Gagliardini, P., E. Ossola, and P. Scaillet (2019). A diagnostic criterion for approximate factorstructure.
Journal of Econometrics 212 , 503–521.Giannone, D., M. Lenza, and G. Primiceri (2018). Economic predictions with big data: The illusionof sparsity. Working paper, Northwestern University.71iessing, A. and J. Fan (2020). Bootstrapping (cid:96) p -statistics in high dimensions.Giglio, S. and D. Xiu (2020). Asset pricing with omitted factors. Journal of Political Economy .forthcoming.Gobillon, L. and T. Magnac (2016). Regional policy evaluation: Interactive fixed effects and syn-thetic controls.
Review of Economics and Statistics 98 , 535–551.Gu, S., B. Kelly, and D. Xiu (2020). Empirical asset pricing via machine learning.
Review ofFinancial Studies 33 , 2223–2273.Guo, X. and C. Tang (2020). Specification tests for covariance structures in high-dimensionalstatistical models.
Biometrika . forthcoming.Horenstein, S. A. A. (2013). Eigenvalue ratio test for the number of factors.
Econometrica 81 ,1203–1227.Kock, A. and L. Callot (2015). Oracle inequalities for high dimensional vector autoregressions.
Journal of Econometrics 186 , 325–344.Lam, C. and J. Fan (2009). Sparsistency and rates of convergence in large covariance matrixestimation.
Annals of Statistics 37 , 4254–4278.Ledoit, O. and M. Wolf (2002). Some hypothesis tests for the covariance matrix when the dimensionis large compared to the sample size.
Annals of Statistics 30 , 1081–1102.Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covariancematrices.
Journal of Multivariate Analysis 88 , 365–411.Ledoit, O. and M. Wolf (2012). Nonlinear shrinkage estimation of large-dimensional covariancematrices.
Annals of Statistics 40 , 1024–1060.Ledoit, O. and M. Wolf (2017). Nonlinear shrinkage of the covariance matrix for portfolio selection:Markowitz meets Goldilocks.
Review of Financial Studies 30 , 4349–4388.Ledoit, O. and M. Wolf (2020). Analytical nonlinear shrinkage of large-dimensional covariancematrices.
Annals of Statistics . forthcoming.Ledoit, O. and M. Wolf (2021a). The power of (non-)linear shrinking: A review and guide tocovariance matrix estimation.
Journal of Financial Econometrics . forthcoming.72edoit, O. and M. Wolf (2021b). Quadratic shrinkage for lage covariance matrices.
Bernoulli .forthcoming.Li, W. and Y. Qin (2014). Hypothesis testing for high-dimensional covariance matrices.
Journal ofMultivariate Analysis 128 , 108–119.Masini, R. and M. Medeiros (2019). Counterfactual analysis with artificial controls: Inference, highdimensions and nonstationarity. Working Paper 3303308, SSRN.Masini, R., M. Medeiros, and E. Mendes (2019). Regularized estimation of high-dimensional vectorautoregressions with weakly dependent innovations. Technical Report 1912.09002, arxiv.McCracken, M. and S. Ng (2016). FRED-MD: A monthly database for macroeconomic research.
Journal of Business & Economic Statistics 34 , 574–589.Medeiros, M. and E. Mendes (2016). (cid:96) -regularization of high-dimensional time-series models withnon-gaussian and heteroskedastic errors. Journal of Econometrics 191 , 255–271.Merlev`ede, F., M. Peligrad, and E. Rio (2009). Bernstein inequality and moderate deviations understrong mixing conditions. In C. Houdr´e, V. Koltchinskii, D. Mason, and M. Peligrad (Eds.),
HighDimensional Probability V: The Luminy Volume , Volume Volume 5, pp. 273–292. Institute ofMathematical Statistics.Moon, R. and M. Weidner (2015). Linear regression for panel with unknown number of factors asinteractive fixed effects.
Econometrica 83 , 1543–1579.Moskowitz, T. and M. Grinblatt (1999). Do industries explain momentum?
Journal of Finance 54 ,1249–1290.Negahban, S., P. Ravikumar, M. Wainwright, and B. Yu (2012). A unified framework for high-dimensional analysis of m -estimators with decomposable regularizers. Statistical Science 27 ,538–557.Onatski, A., M. Moreira, and M. Hallin (2013). Asymptotic power of sphericity tests for high-dimensional data.
Annals of Statistics 41 , 1204–1231.Rio, E. (1994). In´egalit´es de moments pour les suites stationnaires et fortement m´elangeantes.
Comptes rendus Acad. Sci. Paris, S´erie I 318 , 355–360.73tock, J. and M. Watson (2002a). Forecasting using principal components from a large number ofpredictors.
Journal of the American Statistical Association 97 , 1167–1179.Stock, J. and M. Watson (2002b). Macroeconomic forecasting using diffusion indexes.
Journal ofBusiness & Economic Statistics 20 , 147–162.van de Geer, S. and P. B¨uhlmann (2009). On the conditions used to prove oracle results for thelasso.
Electronic Journal of Statistics 3 , 1360–1392.van der Vaart, A. and J. Wellner (1996).
Weak Convergence and Empirical Processes: With Appli-cations to Statistics . Springer.Zheng, S., Z. Chen, H. Cui, and R. Li (2019). Hypothesis testing on linear structures of high-dimensional covariance matrix.
Annals of Statistics 47 , 3300–3334.Zheng, S., G. Cheng, J. Guo, and H. Zhu (2019). Test for high-dimensional correlation matrices.