[PDF] Optimal Portfolio Using Factor Graphical Lasso

Abstract

Graphical models are a powerful tool to estimate a high-dimensional inverse covariance (precision) matrix, which has been applied for a portfolio allocation problem. The assumption made by these models is a sparsity of the precision matrix. However, when stock returns are driven by common factors, such assumption does not hold. We address this limitation and develop a framework, Factor Graphical Lasso (FGL), which integrates graphical models with the factor structure in the context of portfolio allocation by decomposing a precision matrix into low-rank and sparse components. Our theoretical results and simulations show that FGL consistently estimates the portfolio weights and risk exposure and also that FGL is robust to heavy-tailed distributions which makes our method suitable for financial applications. FGL-based portfolios are shown to exhibit superior performance over several prominent competitors including equal-weighted and Index portfolios in the empirical application for the S&P500 constituents.

Full PDF

OOptimal Portfolio Using Factor Graphical Lasso

Tae-Hwy Lee ∗ and Ekaterina Seregina † First version: September 8, 2020Second version: November 5, 2020 ‡ Abstract

Graphical models are a powerful tool to estimate a high-dimensional inverse covariance ( pre-cision ) matrix, which has been applied for portfolio allocation problem. The assumption madeby these models is a sparsity of the precision matrix. However, when the stock returns aredriven by the common factors, this assumption does not hold. Our paper develops a frameworkfor estimating a high-dimensional precision matrix which combines the beneﬁts of exploringthe factor structure of the stock returns and the sparsity of the precision matrix of the factor-adjusted returns. The proposed algorithm is called

Factor Graphical Lasso (FGL). We studya high-dimensional portfolio allocation problem when the asset returns admit the approximatefactor model. In high dimensions, when the number of assets is large relative to the sample size,the sample covariance matrix of the excess returns is subject to the large estimation uncertainty,which leads to unstable solutions for portfolio weights. To resolve this issue, we consider thedecomposition of low-rank and sparse components. This strategy allows us to consistently esti-mate the optimal portfolio in high dimensions, even when the covariance matrix is ill-behaved.We establish consistency of the portfolio weights in a high-dimensional setting without assumingsparsity on the covariance or precision matrix of stock returns . Our theoretical results and sim-ulations demonstrate that FGL is robust to heavy-tailed distributions, which makes our methodsuitable for ﬁnancial applications. The empirical application uses daily and monthly data forthe constituents of the S&P500 to demonstrate superior performance of FGL compared to theequal-weighted portfolio, index and some prominent precision and covariance-based estimators.

Keywords : High-dimensionality, Portfolio optimization, Graphical Lasso, Approximate FactorModel, Sharpe Ratio, Elliptical Distributions

JEL Classiﬁcations : C13, C55, C58, G11, G17 ∗ Department of Economics, University of California, Riverside. Email: [email protected]. † Department of Economics, University of California, Riverside. Email: [email protected]. ‡ In the second version, a part of Theorem 4 is corrected, with Lemma 11(e) added. Figure 1 is revised. a r X i v : . [ ec on . E M ] N ov Introduction

Estimating the inverse covariance matrix, or precision matrix, of excess stock returns is crucialfor constructing weights of ﬁnancial assets in the portfolio and estimating the out-of-sample SharpeRatio. In high-dimensional setting, when the number of assets, p , is greater than or equal to thesample size, T , using an estimator of covariance matrix for obtaining portfolio weights leads to theMarkowitz’ curse: a higher number of assets increases correlation between the investments, whichcalls for a more diversiﬁed portfolio, and yet unstable corner solutions for weights become morelikely. The reason behind this curse is the need to invert a high-dimensional covariance matrix toobtain the optimal weights from the quadratic optimization problem: when p ≥ T , the conditionnumber of the covariance matrix (i.e. the absolute value of the ratio between maximal and minimaleigenvalues of the covariance matrix) is high. Hence, the inverted covariance matrix yields anunstable estimator of the precision matrix. To circumvent this issue one can estimate precisionmatrix directly, rather than inverting covariance matrix.Graphical models were shown to provide consistent estimates of the precision matrix (Cai et al.(2011); Friedman et al. (2008); Meinshausen and B¨uhlmann (2006)). Goto and Xu (2015) esti-mated a sparse precision matrix for portfolio hedging using graphical models. They found out thattheir portfolio achieves signiﬁcant out-of-sample risk reduction and higher return, as compared tothe portfolios based on equal weights, shrunk covariance matrix, industry factor models, and no-short-sale constraints. Awoye (2016) used Graphical Lasso (Friedman et al. (2008)) to estimate asparse covariance matrix for the Markowitz mean-variance portfolio problem to improve covarianceestimation in terms of lower realized portfolio risk. Millington and Niranjan (2017) conductedan empirical study that applies Graphical Lasso for the estimation of covariance for the portfolioallocation. Their empirical ﬁndings suggest that portfolios that use Graphical Lasso for covari-ance estimation enjoy lower risk and higher returns compared to the empirical covariance matrix.They show that the results are robust to missing observations. Millington and Niranjan (2017)also construct a ﬁnancial network using the estimated precision matrix to explore the relationshipbetween the companies and show how the constructed network helps to make investment decisions.Callot et al. (2019) use the nodewise-regression method of Meinshausen and B¨uhlmann (2006)to establish consistency of the estimated variance, weights and risk of high-dimensional ﬁnancial1ortfolio. Their empirical application demonstrates that the precision matrix estimator based onthe nodewise-regression outperforms the principal orthogonal complement thresholding estimator(POET) (Fan et al. (2013)) and linear shrinkage (Ledoit and Wolf (2004)). Cai et al. (2020) useconstrained (cid:96) -minimization for inverse matrix estimation (CLIME) of the precision matrix (Caiet al. (2011)) to develop a consistent estimator of the minimum variance for high-dimensional globalminimum-variance portfolio. It is important to note that all the aforementioned methods imposesome sparsity assumption on the precision matrix of excess returns.An alternative strategy to handle high-dimensional setting uses factor models to acknowledgecommon variation in the stock prices, which was documented in many empirical studies (see Camp-bell et al. (1997) among many others). A common approach decomposes covariance matrix of excessreturns into low-rank and sparse parts, the latter is further regularized since, after the commonfactors are accounted for, the remaining covariance matrix of the idiosyncratic components is stillhigh-dimensional (Fan et al. (2011, 2013, 2016b, 2018)). This stream of literature, however, focuseson the estimation of a covariance matrix. The accuracy of precision matrices obtained from invert-ing the factor-based covariance matrix was investigated by Fan et al. (2016a) and Ait-Sahalia andXiu (2017), but they did not study a high-dimensional case. Factor models are generally treated ascompetitors to graphical models : as an example, Callot et al. (2019) ﬁnd evidence of superior per-formance of nodewise-regression estimator of precision matrix over a factor-based estimator POET(Fan et al. (2013)) in terms of the out-of-sample Sharpe Ratio and risk of ﬁnancial portfolio. Theroot cause why factor models and graphical models are treated separately is the sparsity assump-tion on the precision matrix made in the latter. Speciﬁcally, as pointed out in Koike (2020), whenasset returns have common factors, the precision matrix cannot be sparse because all pairs of assetsare partially correlated conditional on other assets through the common factors .In this paper we develop a new precision matrix estimator for the excess returns under theapproximate factor model that combines the beneﬁts of graphical models and factor structure.We call our algorithm the

Factor Graphical Lasso (FGL) . We use a factor model to remove theco-movements induced by the factors, and then we apply the Weighted Graphical Lasso for theestimation of the precision matrix of the idiosyncratic terms. We prove consistency of FGL in thespectral and (cid:96) matrix norms. In addition, we prove consistency of the estimated portfolio weightsfor three formulations of the optimal portfolio allocation.2ur empirical application uses daily and monthly data for the constituents of the S&P500:we demonstrate that FGL outperforms equal-weighted portfolio, index, portfolios based on otherestimators of precision matrix (CLIME, Cai et al. (2011)) and covariance matrix (POET, Fanet al. (2013) and the shrinkage estimator adjusted to allow for the factor structure (Ledoit andWolf (2004))) in terms of the out-of-sample Sharpe Ratio. Furthermore, we ﬁnd strong empiricalevidence that relaxing the constraint that portfolio weights sum up to one leads to a large increasein the out-of-sample Sharpe Ratio, which, to the best of our knowledge, has not been previouslywell-studied in the empirical ﬁnance literature.From the theoretical perspective, our paper makes several important contributions to the ex-isting literature on graphical models and factor models. First, to the best of out knowledge,there are no equivalent theoretical results that establish consistency of the portfolio weights in ahigh-dimensional setting without assuming sparsity on the covariance or precision matrix of stockreturns . Second, we extend the theoretical results of POET (Fan et al. (2013)) to allow the numberof factors to grow with the number of assets. Concretely, we establish uniform consistency for thefactors and factor loadings estimated using PCA. Third, we are not aware of any other papers thatprovide convergence results for estimating a high-dimensional precision matrix using the WeightedGraphical Lasso under the approximate factor model with unobserved factors. Furthermore, alltheoretical results established in this paper hold for a wide range of distributions: Sub-Gaussianfamily (including Gaussian) and elliptical family. Our simulations demonstrate that FGL is robustto very heavy-tailed distributions, which makes our method suitable for the ﬁnancial applications.This paper is organized as follows: Section 2 reviews the basics of the Markowitz mean-varianceportfolio theory and provides several formulations of the optimal portfolio allocation. Section3 provides a brief summary of the graphical models and introduces the Factor Graphical Lasso.Section 4 contains theoretical results and Section 5 validates these results using simulations. Section6 provides empirical application. Section 7 concludes. Notation . For the convenience of the reader, we summarize the notation to be used throughoutthe paper. Let S p denote the set of all p × p symmetric matrices, S + p denotes the set of all p × p positive semi-deﬁnite matrices, and S ++ p denotes the set of all p × p positive deﬁnite matrices.Given a vector u ∈ R d and parameter a ∈ [1 , ∞ ), let (cid:107) u (cid:107) a denote (cid:96) a -norm. Given a matrix U ∈ S p ,let Λ max ( U ) ≡ Λ ( U ) ≥ Λ ( U ) ≥ . . . Λ min ( U ) ≡ Λ p ( U ) be the eigenvalues of U , and eig K ( U ) ∈ K × p denote the ﬁrst K ≤ p normalized eigenvectors corresponding to Λ ( U ) , . . . Λ K ( U ). Givenparameters a, b ∈ [1 , ∞ ), let ||| U ||| a,b denote the induced matrix-operator norm max (cid:107) y (cid:107) a =1 (cid:107) Uy (cid:107) b .The special cases are ||| U ||| ≡ max ≤ j ≤ N (cid:80) Ni =1 | U i,j | for the (cid:96) /(cid:96) -operator norm; the operatornorm ( (cid:96) -matrix norm) ||| U ||| ≡ Λ max ( UU (cid:48) ) is equal to the maximal singular value of U ; ||| U ||| ∞ ≡ max ≤ j ≤ N (cid:80) Ni =1 | U j,i | for the (cid:96) ∞ /(cid:96) ∞ -operator norm. Finally, (cid:107) U (cid:107) max denotes the element-wisemaximum max i,j | U i,j | , and ||| U ||| F = (cid:80) i,j u i,j denotes the Frobenius matrix norm. The importance of the minimum-variance portfolio introduced by Markowitz (1952) as a risk-management tool has been studied by many researchers. In this section we review the basics ofMarkowitz mean-variance portfolio theory and provide several formulations of the optimal portfolioallocation.Suppose we observe p assets (indexed by i ) over T period of time (indexed by t ). Let r t =( r t , r t , . . . , r pt ) (cid:48) ∼ D ( m , Σ ) be a p × excess returns drawn from a distribution D . Thegoal of the Markowitz theory is to choose asset weights in a portfolio optimally . We will studytwo optimization problems: the well-known Markowitz weight-constrained (MWC) optimizationproblem, and the Markowitz risk-constrained (MRC) optimization with relaxing the constraint onportfolio weights.The ﬁrst optimization problem searches for asset weights such that the portfolio achieves adesired expected rate of return with minimum risk, under the restriction that all weights sum upto one. This can be formulated as the following quadratic optimization problem:min w w (cid:48) Σw , s.t. w (cid:48) ι = 1 and m (cid:48) w ≥ µ (2.1)where w is a p × ι is a p × µ is adesired expected rate of portfolio return. Let Θ ≡ Σ − be the precision matrix .If m (cid:48) w > µ , then the solution to (2.1) yields the global minimum-variance (GMV) portfolio weights w GMV : w GMV = ( ι (cid:48) Θ ι ) − Θ ι . (2.2) If, in addition to the constraint that weights sum up to unity, short-sales are not allowed, then the combinationof portfolio weights forms a convex hull. We do not impose any short-selling constraints in this paper. m (cid:48) w = µ , the solution to (2.1) is a well-known two-fund separation theorem introduced byTobin (1958): w MW C = (1 − a ) w GMV + a w M , (2.3) w M = ( ι (cid:48) Θm ) − Θm , (2.4) a = µ ( m (cid:48) Θ ι )( ι (cid:48) Θ ι ) − ( m (cid:48) Θ ι ) ( m (cid:48) Θm )( ι (cid:48) Θ ι ) − ( m (cid:48) Θ ι ) , (2.5)where w MW C denotes the portfolio allocation with the constraint that the weights need to sum upto one and w M captures all mean-related market information.The MRC problem has the same objective as in (2.1), but portfolio weights are not required tosum up to one: min w w (cid:48) Σw , s.t. m (cid:48) w ≥ µ (2.6)It can be easily shown that the solution to (2.6) is: w ∗ = µ Θmm (cid:48) Θm . (2.7)Alternatively, instead of searching for a portfolio with a speciﬁed desired expected rate of return,one can maximize expected portfolio return given a maximum risk-tolerance level:max w w (cid:48) m , s.t. w (cid:48) Σw ≤ σ . (2.8)In this case, the solution to (2.8) yields: w ∗ = σ w (cid:48) m Θm = σ µ Θm . (2.9)To get the second equality in (2.9) we use the deﬁnition of µ from (2.1) and (2.6). It follows thatif µ = σ √ θ , where θ ≡ m (cid:48) Θm is the squared Sharpe Ratio of the portfolio, then the solution to(2.6) and (2.8) admits the following expression: w MRC = σ √ m (cid:48) Θm Θm = σ √ θ α , (2.10)where α ≡ Θm . Equation (2.10) tells us that once an investor speciﬁes the desired return, µ , andmaximum risk-tolerance level, σ , this pins down the Sharpe Ratio of the portfolio which makes theoptimization problems of minimizing risk in (2.6) and maximizing expected return of the portfolioin (2.8) identical. 5his brings us to three alternative portfolio allocations commonly used in the existing literature:Global Minimum-Variance portfolio in (2.2), Markowitz Weight-Constrained portfolio in (2.3) andMarkowitz Maximum-Risk-Constrained portfolio in (2.10). It is clear that all formulations requirean estimate of the precision matrix Θ . In this paper we develop a novel method for estimatingprecision matrix for the above-mentioned ﬁnancial portfolios which account for the fact that thereturns follow approximate factor structure. The next section reviews Graphical methods for es-timating the precision matrix, and introduces a Factor Graphical Lasso for constructing ﬁnancialportfolios.

In this section we ﬁrst provide a brief review of the terminology used in the literature ongraphical models and the approaches to estimate a precision matrix. After that we propose anestimator of the precision matrix which accounts for the common factors in the excess returns.The review of the Gaussian graphical models is based on Hastie et al. (2001) and Bishop (2006).A graph consists of a set of vertices (nodes) and a set of edges (arcs) that join some pairs of thevertices. In graphical models, each vertex represents a random variable, and the graph visualizesthe joint distribution of the entire set of random variables. The edges in a graph are parameterizedby potentials (values) that encode the strength of the conditional dependence between the randomvariables at the corresponding vertices.

Sparse graphs have a relatively small number of edges.Among the main challenges in working with the graphical models are choosing the structure of thegraph ( model selection ) and estimation of the edge parameters from the data.

Deﬁne x t to be a p × t = 1 , . . . , T . Let x t ∼ D ( m , Σ ), where D belongs toeither sub-Gaussian or elliptical families. When D = N , the precision matrix Σ − ≡ Θ containsinformation about conditional dependence between the variables. For instance, if Θ ij , which isthe ij -th element of the precision matrix, is zero, then the variables i and j are conditionallyindependent, given the other variables.Given a sample { x t } Tt =1 , let S = (1 /T ) (cid:80) Tt =1 ( x t − ¯ x t )( x t − ¯ x t ) (cid:48) denote the sample covariancematrix. We can write down the Gaussian log-likelihood (up to constants): l ( Θ ) = log det( Θ ) − SΘ ). The maximum likelihood (ML) estimate of Θ is (cid:98) Θ = S − . In the high-dimensionalsettings it is necessary to regularize the precision matrix, which means that some edges will bezero.One of the approaches to induce sparsity in the estimation of precision matrix is to add penaltyto the maximum likelihood and use the connection between the precision matrix and regressioncoeﬃcients. Let (cid:98) D ≡ diag( S ). Jankov´a and van de Geer (2018) propose to use the weightedGraphical Lasso to maximize the following weighted penalized log-likelihood : (cid:98) Θ = arg min Θ ∈S ++ p trace( SΘ ) − log det( Θ ) + λ (cid:88) i (cid:54) = j (cid:98) D ii (cid:98) D jj | Θ ij | , (3.1)over symmetric positive deﬁnite matrices, where λ ≥ λ = 0,the MLE for Σ and Θ in (3.1) are the sample covariance matrix S and its inverse S − respec-tively. When λ >

0, the solution to (3.1) yields penalized MLE of the covariance and pre-cision matrices, denoted as (cid:98) Σ and (cid:98) Θ = (cid:98) Σ − . Ravikumar et al. (2011) showed that solvingmin Θ ∈S ++ p trace( (cid:98) SΘ ) − log det( Θ ) + (cid:80) pi =1 (cid:80) pj =1 p λ ( | Θ ij | ), where p ( · ) is a generic penalty function,corresponds to minimizing penalized log-determinant Bregman divergence.One of the most popular and fast algorithms to solve the optimization problem in (3.1) iscalled the Graphical Lasso (GLasso), it was introduced by Friedman et al. (2008). Graphical Lassocombines the neighborhood method by Meinshausen and B¨uhlmann (2006) and block-coordinatedescent by Banerjee et al. (2008). A brief summary of the procedure to estimate the precisionmatrix using GLasso is presented in Algorithm 1. Algorithm 1

Graphical Lasso (Friedman et al. (2008)) Let W be the estimate of Σ . Initialize W = S + λ I . The diagonal of W remains the same inwhat follows. Repeat for j = 1 , . . . , p, , . . . , p, . . . until convergence: • Partition W into part 1: all but the j -th row and column, and part 2: the j -th row andcolumn, • Solve the score equations using the cyclical coordinate descent: W β − s + λ · Sign( β ) = . This gives a ( p − × (cid:98) β . • Update (cid:98) w = W (cid:98) β . In the ﬁnal cycle (for i = 1 , . . . , p ) solve for (cid:98) θ = w − (cid:98) β (cid:48) (cid:98) w and (cid:98) θ = − (cid:98) θ (cid:98) β .7 .2 Factor Graphical Lasso The arbitrage pricing theory (APT), developed by Ross (1976), postulates that the expectedreturns on securities should be related to their covariance with the common components or factorsonly. The goal of the APT is to model the tendency of asset returns to move together via factordecomposition. Let r t = ( r t , r t , . . . , r pt ) (cid:48) ∼ D ( m , Σ ) be a p × D , where m is the unconditional mean of the returns. Assume that the returngenerating process ( r t ) follows a K -factor model: r t (cid:124)(cid:123)(cid:122)(cid:125) p × = B f t (cid:124)(cid:123)(cid:122)(cid:125) K × + ε t , t = 1 , . . . , T (3.2)where f t = ( f t , . . . , f Kt ) (cid:48) are the factors, B is a p × K matrix of factor loadings, and ε t is theidiosyncratic component that cannot be explained by the common factors. Factors in (3.2) can beeither observable, such as in Fama and French (1993, 2015), or can be estimated using statisticalfactor models. Unobservable factors and loadings are usually estimated by the principal compo-nent analysis (PCA), as studied in Bai (2003); Bai and Ng (2002); Connor and Korajczyk (1988);Stock and Watson (2002). Strict factor structure assumes that the idiosyncratic disturbances, ε t ,are uncorrelated with each other, whereas approximate factor structure allows correlation of theidiosyncratic disturbances (see Bai (2003); Chamberlain and Rothschild (1983) among others).In this subsection we examine how to solve the Markowitz mean-variance portfolio allocationproblems using factor structure in the returns. We also develop Factor Graphical Lasso that usesthe estimated common factors to obtain a sparse precision matrix of the idiosyncratic component.The resulting estimator is used to obtain the precision of the asset returns necessary to formportfolio weights. In this paper our main interest lies in establishing asymptotic properties of theprecision matrix and portfolio weights for the high-dimensional case. We assume that the numberof common factors, K = K p,T → ∞ as p → ∞ , or T → ∞ , or both p, T → ∞ , but we require thatmax { K/p, K/T } → p, T → ∞ .Our setup is similar to the one studied in Fan et al. (2013): we consider a spiked covariancemodel when the ﬁrst K principal eigenvalues of Σ are growing with p , while the remaining p − K eigenvalues are bounded and grow slower than p .8ewrite equation (3.2) in matrix form: R (cid:124)(cid:123)(cid:122)(cid:125) p × T = B (cid:124)(cid:123)(cid:122)(cid:125) p × K F + E . (3.3)Recall that the factors and loadings in (3.3) are estimated by solving the following minimizationproblem: ( (cid:98) B , (cid:98) F ) = argmin B , F (cid:107) R − BF (cid:107) F s.t. T FF (cid:48) = I K , B (cid:48) B is diagonal. The constraints areneeded to identify the factors (Fan et al. (2018)). It was shown (Stock and Watson (2002)) that (cid:98) F = √ T eig K ( R (cid:48) R ) and (cid:98) B = T − R (cid:98) F (cid:48) . Given (cid:98) F , (cid:98) B , deﬁne (cid:98) E = R − (cid:98) B (cid:98) F .Having introduced the generating process for stock returns, we move to portfolio constructionexercise. Since our interest is in constructing portfolio weights, our goal is to estimate a precisionmatrix of the excess returns. However, as pointed out by Koike (2020), when common factors arepresent across the excess returns, the precision matrix cannot be sparse because all pairs of thereturns are partially correlated given other excess returns through the common factors . Therefore,we impose a sparsity assumption on the precision matrix of the idiosyncratic errors, Θ ε , whichis obtained using the estimated residuals after removing the co-movements induced by the factors(see Barigozzi et al. (2018); Brownlees et al. (2018); Koike (2020)).We use the weighted Graphical Lasso as a shrinkage technique to estimate the precision matrix Θ ε of the idiosyncratic errors. Once the precision Θ f of the low-rank component is also obtained,similarly to Fan et al. (2011), we use the Sherman-Morrison-Woodbury formula to estimate theprecision of excess returns: Θ = Θ ε − Θ ε B [ Θ f + B (cid:48) Θ ε B ] − B (cid:48) Θ ε . (3.4)To obtain (cid:98) Θ f = (cid:98) Σ − f , we use the inverse of the sample covariance of the estimated factors (cid:98) Σ f = T − (cid:98) F (cid:98) F (cid:48) . To get (cid:98) Θ ε , we ﬁrst use the weighted GLasso Algorithm 1, with the initial estimate of thecovariance matrix of the idiosyncratic errors calculated as (cid:98) Σ ε = T − (cid:98) E (cid:98) E (cid:48) . Once we have estimated (cid:98) Θ f and (cid:98) Θ ε , we can get (cid:98) Θ using a sample analogue of (3.4).We call the proposed procedure Factor Graphical Lasso and summarize it in Algorithm 2.

Algorithm 2

Factor Graphical Lasso (FGL) Estimate the residuals: (cid:98) ε t = r t − (cid:98) B (cid:98) f t using PCA.Get (cid:98) Σ ε = T (cid:80) Tt =1 ( (cid:98) ε t − ¯ ε )( (cid:98) ε t − ¯ ε ) (cid:48) . Estimate a sparse Θ ε using the weighted Graphical Lasso: initialize Algorithm 1 with W = (cid:98) Σ ε + λ I . Estimate Θ using the Sherman-Morrison-Woodbury formula in (3.4).9ow we can use (cid:98) Θ obtained from (3.4) using Algorithm 2 to estimate portfolio weights in (2.2),(2.3) and (2.10): Remark 1.

In practice, the number of common factors, K , is unknown and needs to be estimated.One of the standard and commonly used approaches is to determine K in a data-driven way (Baiand Ng (2002); Kapetanios (2010)). As an example, in their paper Fan et al. (2013) adopt theapproach from Bai and Ng (2002).However, all of the aforementioned papers deal with a ﬁxed number of factors. Therefore, weneed to adopt a diﬀerent criteria since K is allowed to grow in our setup. For this reason, we usethe methodology by Li et al. (2017): following their notation, let V ( K ) = min B K , F K pT p (cid:88) i =1 T (cid:88) t =1 (cid:16) r it − √ K b (cid:48) i,K f t,K (cid:17) , (3.5) where the minimum is taken over ≤ K ≤ K max , subject to normalization B (cid:48) K B K /p = I K . Hence, ¯ F (cid:48) K = √ K R (cid:48) ¯ B K /p . Deﬁne ¯ F (cid:48) K = ¯ F (cid:48) K ( ¯ F K ¯ F (cid:48) K /T ) / , which is a rescaled estimator of the factorsthat is used to determine the number of factors when K grows with the sample size. We then applythe following procedure described in Li et al. (2017) to estimate K : (cid:98) K = arg min ≤ K ≤ K max ln( V ( K, ¯ F K )) + Kg ( p, T ) , (3.6) where ≤ K ≤ K max = o (min { p / , T / } ) and g ( p, T ) is a penalty function of ( p, T ) such that(i) K max · g ( p, T ) → and (ii) C − p,T,K max · g ( p, T ) → ∞ with C p,T,K max = O p (cid:16) max (cid:104) K max √ p , K / max √ T (cid:105)(cid:17) .The choice of the penalty functions is similar to Bai and Ng (2002). Throughout the paper we let (cid:98) K be the solution to (3.6) . In this section we establish consistency of the Factor Graphical Lasso in Algorithm 2. After thatwe study consistency of the estimators of weights in (2.2), (2.3) and (2.10) and the implications onthe out-of sample Sharpe Ratio.Let A ∈ S p . Deﬁne the following set for j = 1 , . . . , p : D j ( A ) ≡ { i : A ij (cid:54) = 0 , i (cid:54) = j } , d j ( A ) ≡ card( D j ( A )) , d ( A ) ≡ max j =1 ,...,p d j ( A ) , (4.1)10here d j ( A ) is the number of edges adjacent to the vertex j (i.e. the degree of vertex j ), and d ( A )measures the maximum vertex degree. Deﬁne S ( A ) ≡ (cid:83) pj =1 D j ( A ) to be the overall oﬀ-diagonalsparsity pattern, and s ( A ) ≡ (cid:80) pj =1 d j ( A ) is the overall number of edges contained in the graph.Note that card( S ( A )) ≤ s ( A ): when s ( A ) = p ( p − / We now list the assumptions on the model (3.2): (A.1) (Spiked covariance model) As p → ∞ , Λ > Λ + . . . > Λ K (cid:29) Λ K +1 ≥ . . . ≥ Λ p ≥

0, whereΛ j = O ( p ) for j ≤ K , while the non-spiked eigenvalues are bounded, Λ j = o ( p ) for j > K . (A.2) (Pervasive factors) There exists a positive deﬁnite K × K matrix ˘ B such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p − B (cid:48) B − ˘ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → min ( ˘ B ) − = O (1) as p → ∞ . (A.3) (a) { ε t , f t } t ≥ is strictly stationary. Also, E [ ε it ] = E [ ε it f it ] = 0 ∀ i ≤ p , j ≤ K and t ≤ T .(b) There are constants c , c > min ( Σ ε ) > c , ||| Σ ε ||| < c and min i ≤ p,j ≤ p var( ε it ε jt ) >c .(c) There are r , r > b , b > s > i ≤ p , j ≤ K ,Pr ( | ε it | > s ) ≤ exp {− ( s/b ) r } , Pr ( | f jt | > s ) ≤ exp {− ( s/b ) r } We also impose strong mixing condition. Let F −∞ and F ∞ T denote the σ -algebras that aregenerated by { ( f t , ε t ) : t ≤ } and { ( f t , ε t ) : t ≥ T } respectively. Deﬁne the mixing coeﬃcient α ( T ) = sup A ∈F −∞ ,B ∈F ∞ T | Pr A Pr B − Pr AB | . (4.2) (A.4) (Strong mixing) There exists r > r − + 1 . r − + 3 r − >

1, and

C > T ∈ Z + , α ( T ) ≤ exp( − CT r ). (A.5) (Regularity conditions) There exists M > i ≤ p , t ≤ T and s ≤ T , suchthat:(a) (cid:107) b i (cid:107) max < M (b) E (cid:2) p − / { ε (cid:48) s ε t − E [ ε (cid:48) s ε t ] } (cid:3) < M and 11c) E (cid:104)(cid:13)(cid:13) p − / (cid:80) pi =1 b i ε it (cid:13)(cid:13) (cid:105) < K M .Some comments regarding the aforementioned assumptions are in order. Assumptions (A.1) - (A.4) are the same as in Fan et al. (2013), and assumption (A.5) is modiﬁed to account for theincreasing number of factors. Assumption (A.1) divides the eigenvalues into the diverging andbounded ones. Without loss of generality, we assume that K largest eigenvalues have multiplicityof 1. The assumption of a spiked covariance model is common in the literature on approximatefactor models. However, we note that the model studied in this paper can be characterized asa “very spiked model”. In other words, the gap between the ﬁrst K eigenvalues and the restis increasing with p . As pointed out by Fan et al. (2018), (A.1) is typically satisﬁed by thefactor model with pervasive factors, which brings us to Assumption (A.2) : the factors impact anon-vanishing proportion of individual time-series. Assumption (A.3) (a) is slightly stronger thanin Bai (2003), since it requires strict stationarity and non-correlation between { ε t } and { f t } tosimplify technical calculations. In (A.3) (b) we require ||| Σ ε ||| < c instead of λ max ( Σ ε ) = O (1) toestimate K consistently. When K is known, as in Fan et al. (2011); Koike (2020), this conditioncan be relaxed. (A.3) (c) requires exponential-type tails to apply the large deviation theory to(1 /T ) (cid:80) Tt =1 ε it ε jt − σ u,ij and (1 /T ) (cid:80) Tt =1 f jt u it . However, at the end of section 4 we discuss theextension of our results to the setting with elliptical distribution family which is more appropriate forﬁnancial applications. Speciﬁcally, we discuss the appropriate modiﬁcations to the initial estimatorof the covariance matrix of returns such that the bounds derived in this paper continue to hold. (A.4) - (A.5) are technical conditions which are needed to consistently estimate the common factorsand loadings. The conditions (A.5) (a-b) are weaker than those in Bai (2003) since our goal is toestimate a precision matrix, and (A.5) (c) diﬀers from Bai (2003) and Bai and Ng (2006) in thatthe number of factors is assumed to slowly grow with p .In addition, the following structural assumptions on the model are imposed: (B.1) (cid:107) Σ (cid:107) max = O (1) and (cid:107) B (cid:107) max = O (1). (B.2) s ( Θ ε ) = O p ( s T ) for some sequence s T ∈ (0 , ∞ ) , T = 1 , , . . . . (B.3) d ( Θ ε ) = O p ( d T ) for some sequence d T ∈ (0 , ∞ ) , T = 1 , , . . . . B.1) is a natural structural assumption on the population quantities, and (B.2) - (B.3) are sparsityassumptions on the precision matrix of the residual process. Speciﬁcally, (B.2) states that thesparsity of Θ ε is controlled by the deterministic sequence s T ; we will impose restrictions on thegrowth rate of s T . (B.3) is another sparsity assumption on Θ ε : it is weaker than (B.2) since itis always satisﬁed when s T = d T . However, d T can generally be smaller than s T . Note that incontrast to Fan et al. (2013) we do not impose sparsity on the covariance matrix of the idiosyncraticcomponent, instead, it is more realistic to impose conditional sparsity on the precision matrix afterthe common factors are accounted for. Recall the deﬁnition of the Weighted Graphical Lasso estimator in (3.1) for the precision matrixof the idiosyncratic components: (cid:98) Θ ε = arg min Θ ∈S ++ p trace( (cid:98) Σ ε Θ ε ) − log det( Θ ε ) + λ (cid:88) i (cid:54) = j (cid:98) D ε,ii (cid:98) D ε,jj | Θ ε,ij | , (4.3)Also, recall that to estimate Θ we used equation (3.4). Therefore, in order to obtain the FGLestimator (cid:98) Θ we take the following steps: (1): estimate unknown factors and factor loadings to getan estimator of Σ ε . (2): use (cid:98) Σ ε to get an estimator of Θ ε in (4.3). (3): use (cid:98) Θ ε together with theestimators of factors and factor loadings from Step 1 to obtain the ﬁnal precision matrix estimator (cid:98) Θ . Subsection 4.3 examines the theoretical foundations of the ﬁrst step, and Subsection 4.4 isdevoted to steps 2 and 3. As pointed out in Bai (2003) and Fan et al. (2013), K × { b i } pi =1 ,which are the rows of the factor loadings matrix B , and K × { f t } Tt =1 ,which are the columns of F , are not separately identiﬁable. Concretely, for any K × K matrix H suchthat H (cid:48) H = I K , Bf t = BH (cid:48) Hf t , therefore, we cannot identify the tuple ( B , f t ) from ( BH (cid:48) , Hf t ).Let (cid:98) K ∈ { , . . . , K max } denote the estimated number of factors, where K max is allowed to increaseat a slower speed than min { p, T } such that K max = o (min { p / , T } ) (see Li et al. (2017) for thediscussion about the rate). 13eﬁne V to be a (cid:98) K × (cid:98) K diagonal matrix of the ﬁrst (cid:98) K largest eigenvalues of the sample covariancematrix in decreasing order. Further, deﬁne a (cid:98) K × (cid:98) K matrix H = (1 /T ) V − (cid:98) F (cid:48) FB (cid:48) B . For t ≤ T , Hf t = T − V − (cid:98) F (cid:48) ( Bf , . . . , Bf T ) (cid:48) Bf t , which depends only on the data V − (cid:98) F (cid:48) and an identiﬁablepart of parameters { Bf t } Tt =1 . Hence, Hf t does not have an identiﬁability problem regardless of theimposed identiﬁability condition.Let γ − = 3 r − + 1 . r − + r − + 1. The following theorem is an extension of the results in Fanet al. (2013) for the case when the number of factors is unknown and is allowed to grow. Theorem 1.

Suppose that K max = o (min { p / , T } ) , K log( p ) = o ( T γ/ ) , KT = o ( p ) and As-sumptions (A.1) - (A.5) hold. Let ω T ≡ K / (cid:112) log p/T + K/ √ p and ω T ≡ K/ √ T + KT / / √ p .Then max i ≤ p (cid:13)(cid:13)(cid:13)(cid:98) b i − Hb i (cid:13)(cid:13)(cid:13) = O p ( ω T ) and max t ≤ T (cid:13)(cid:13)(cid:13)(cid:98) f t − Hf t (cid:13)(cid:13)(cid:13) = O p ( ω T ) . The conditions K log( p ) = o ( T γ/ ), KT = o ( p ) are similar to Fan et al. (2013), the diﬀerencearises due to the fact that we do not ﬁx K , hence, in addition to the factor loadings, there are KT factors to estimate. Therefore, the number of parameters introduced by the unknown growingfactors should not be “too large”, such that we can consistently estimate them uniformly. Thegrowth rate of the number of factors is controlled by K max = o (min { p / , T } ).The bounds derived in Theorem 1 help us establish the convergence properties of the estimatedidiosyncratic covariance, (cid:98) Σ ε , and precision matrix (cid:98) Θ ε which are presented in the next theorem: Theorem 2.

Let ω T ≡ K (cid:112) log p/T + K / √ p . Under the assumptions of Theorem 1, the estimator (cid:98) Σ ε obtained by estimating factor model in (3.3) satisﬁes (cid:13)(cid:13)(cid:13) (cid:98) Σ ε − Σ ε (cid:13)(cid:13)(cid:13) max = O p ( ω T ) . We additionally assume (B.1) - (B.2) . Let λ T be a sequence of positive-valued random variablessuch that λ − T ω T p −→ . If s T λ T p −→ , then λ − T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ ε − Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l = O p ( s T ) as T → ∞ for any l ∈ [1 , ∞ ] . Note that the term containing K / √ p arises due to the need to estimate unknown factors: Fanet al. (2011) obtained a similar rate but for the case when factors are observable (in their work, ω T = K / (cid:112) log p/T ). The second part of Theorem 2 is based on the relationship between theconvergence rates of the estimated covariance and precision matrices established in Jankov´a andvan de Geer (2018) (Theorem 14.1.3). Koike (2020) obtained the convergence rate when factorsare observable: the rate obtained in our paper is slower due to the fact that factors need to beestimated (concretely, the rate under observable factors would satisfy λ − T (cid:112) K log p/T p −→ d ( Θ ε ) (cid:112) log p/T ,which can be faster than the rate obtained in Theorem 2 if d ( Θ ε ) < s T . Using penalized nodewiseregression could help achieve this faster rate. However, our empirical application to the monthlystock returns demonstrated superior performance of the Weighted Graphical Lasso compared to thenodewise regression in terms of the out-of-sample Sharpe Ratio and portfolio risk. Hence, in ordernot to divert the focus of this paper, we leave the theoretical properties of the nodewise regressionfor future research. Having established the convergence properties of (cid:98) Σ ε and (cid:98) Θ ε , we now move to the estimationof the precision matrix of the factor-adjusted returns in equation (3.4). Theorem 3.

Under the assumptions of Theorem 2, we additionally assume (B.3) . If d T s T λ T p −→ ,then λ − T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p ( s T + 1 / ( p √ K )) and λ − T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p ( d T K / ( s T + (1 /p ))) . Note that since, by construction, the precision matrix obtained using the Factor GraphicalLasso is symmetric, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ can be trivially obtained from the above theorem.Using Theorem 3, we can establish the properties of the estimated weights of portfolios basedon the Factor Graphical Lasso. Theorem 4.

Under the assumptions of Theorem 3, we additionally assume ||| Θ ||| = O (1) (thisadditional requirement essentially imposes Λ p > in (A.1) ), and λ T d T s T = o (1) . Algorithm 2consistently estimates portfolio weights in (2.2) , (2.3) and (2.10) : (cid:107) (cid:98) w GMV − w GMV (cid:107) = O p (cid:16) λ T d T K ( s T + (1 /p )) (cid:17) = o p (1) , (cid:107) (cid:98) w MWC − w MWC (cid:107) = O p ( λ T d T K ( s T +(1 /p ))) = o p (1) , and (cid:107) (cid:98) w MRC − w MRC (cid:107) = O p (cid:16) d T K · [ λ T ( s T + (1 /p ))] / (cid:17) = o p (1) . We now comment on the rates in Theorem 4: ﬁrst, the rates obtained by Callot et al. (2019)for GMV and MWC formulations, when no factor structure of stock returns is assumed, require s ( Θ ) / (cid:112) log( p ) /T , where the authors imposed sparsity on the precision matrix of stock returns, Θ . Therefore, if the precision matrix of stock returns is not sparse, portfolio weights can beconsistently estimated only if p is less than T / (since ( p − / (cid:112) log( p ) /T = o (1) is requiredto ensure consistent estimation of portfolio weights). Our result in Theorem 4 improves this rate15nd shows that as long as d T s T K (cid:112) log( p ) /T = o (1) we can consistently estimate weights of theﬁnancial portfolio. Speciﬁcally, when the precision of the factor-adjusted returns is sparse, we canconsistently estimate portfolio weights when p > T without assuming sparsity on Σ or Θ . Second,note that GMV and MWC weights converge slightly slower than MRC weight. This result is furthersupported by our simulations presented in the next section. Having examined the properties of portfolio weights, it is natural to comment on the portfoliorisk estimation error. It is determined by the errors in the two components: the estimated covariancematrix and the estimated portfolio weights. We focus on the eﬀect of the second component andcompare portfolio risk estimation error for three alternative portfolio formulations. First, we notethat for any estimators of covariance matrix and portfolio weights, we have: (cid:12)(cid:12)(cid:12) (cid:98) w (cid:48) (cid:98) Σ (cid:98) w − w (cid:48) (cid:98) Σw (cid:12)(cid:12)(cid:12) ≤ (cid:107) (cid:98) w − w (cid:107) (cid:13)(cid:13)(cid:13) (cid:98) Σw (cid:13)(cid:13)(cid:13) max . (4.4)The estimation error in portfolio risk is bounded by the estimation error in portfolio weights.Hence, combining equation (4.4) and Theorem 4, we conclude that FGL consistently estimatesrisk of a ﬁnancial portfolio. The empirical application in Section 6 reveals that the portfoliosconstructed using MRC formulation have higher risk compared with GMV and MWC alternatives:using monthly and daily returns of the components of S&P500 index, MRC portfolios exhibithigher out-of-sample risk and return compared to the alternative formulations. Furthermore, theempirical exercise demonstrates that the higher return of MRC portfolios outweighs higher risk forthe monthly data which is evidenced by the increased out-of-sample Sharpe Ratio. So far the consistency of the Factor Graphical Lasso in Theorem 4 relied on the assumption ofthe exponential-type tails in (A.3) (c). Since this tail-behavior may be too restrictive for ﬁnancialportfolio, we comment on the possibility to relax it. First, recall where (A.3) (c) was used before:we required this assumption in order to establish convergence of unknown factors and loadingsin Theorem 1, which was further used to obtain the convergence properties of (cid:98) Σ ε in Theorem 2.Hence, when Assumption (A.3) (c) is relaxed, one needs to ﬁnd another way to consistently estimate Σ ε . We achieve it using the tools developed in Fan et al. (2018). Speciﬁcally, let Σ = Γ p Λ p Γ (cid:48) p ,16here Σ is the covariance matrix of returns that follow a factor structure described in equation(3.2). Deﬁne (cid:98) Σ , (cid:98) Λ K , (cid:98) Γ K to be the estimators of Σ , Λ p , Γ p . We further let (cid:98) Λ K = diag(ˆ λ , . . . , ˆ λ K )and (cid:98) Γ K = (ˆ v , . . . , ˆ v K ) to be constructed by the ﬁrst K leading empirical eigenvalues and thecorresponding eigenvectors of (cid:98) Σ and (cid:98) B (cid:98) B (cid:48) = (cid:98) Γ K (cid:98) Λ K (cid:98) Γ (cid:48) K . Similarly to Fan et al. (2018), we requirethe following bounds on the componentwise maximums of the estimators: (C.1) (cid:13)(cid:13)(cid:13) (cid:98) Σ − Σ (cid:13)(cid:13)(cid:13) max = O p ( (cid:112) log p/T ), (C.2) (cid:13)(cid:13)(cid:13) ( (cid:98) Λ K − Λ p ) Λ − p (cid:13)(cid:13)(cid:13) max = O p ( K (cid:112) log p/T ), (C.3) (cid:13)(cid:13)(cid:13)(cid:98) Γ K − Γ p (cid:13)(cid:13)(cid:13) max = O p ( K / (cid:112) log p/ ( T p )).Let (cid:98) Σ SG be the sample covariance matrix, with (cid:98) Λ SGK and (cid:98) Γ SGK constructed with the ﬁrst K leading empirical eigenvalues and eigenvectors of (cid:98) Σ SG respectively. Also, let (cid:98) Σ EL = (cid:98) D (cid:98) R (cid:98) D ,where (cid:98) R is obtained using the Kendall’s tau correlation coeﬃcients and (cid:98) D is a robust estimatorof variances constructed using the Huber loss. Furthermore, let (cid:98) Σ EL = (cid:98) D (cid:98) R (cid:98) D , where (cid:98) R isobtained using the spatial Kendall’s tau estimator. Deﬁne (cid:98) Λ ELK to be the matrix of the ﬁrst K leading empirical eigenvalues of (cid:98) Σ EL , and (cid:98) Γ ELK is the matrix of the ﬁrst K leading empiricaleigenvectors of (cid:98) Σ EL . For more details regarding constructing (cid:98) Σ SG , (cid:98) Σ EL and (cid:98) Σ EL see Fan et al.(2018), Sections 3 and 4. Proposition 1.

For sub-Gaussian distributions, (cid:98) Σ SG , (cid:98) Λ SGK and (cid:98) Γ SGK satisfy (C.1) - (C.3) .For elliptical distributions, (cid:98) Σ EL , (cid:98) Λ ELK and (cid:98) Γ ELK satisfy (C.1) - (C.3) .When (C.1) - (C.3) are satisﬁed, the bounds obtained in Theorems 2-4 continue to hold. Proposition 1 is essentially a rephrasing of the results obtained in Fan et al. (2018), Sections3 and 4. The diﬀerence arises due to the fact that we allow K to increase, which is reﬂected inthe modiﬁed rates in (C.2) - (C.3) . As evidenced from the above Proposition, (cid:98) Σ EL is only usedfor estimating the eigenvectors. This is necessary due to the fact that, in contrast with (cid:98) Σ EL , thetheoretical properties of the eigenvectors of (cid:98) Σ EL are mathematically involved because of the sinfunction. The FGL for the elliptical distributions will be called the Robust FGL.17 Monte Carlo

In order to validate our theoretical results, we perform several simulation studies which aredivided into three parts. The ﬁrst set of results computes the empirical convergence rates andcompares them with the theoretical expressions derived in Theorems 3-4. The second set of resultscompares the performance of the FGL with several alternative models for estimating covarianceand precision matrix: linear shrinkage estimator of covariance that incorporates factor structurethrough the Sherman-Morrison inversion formula (Ledoit and Wolf (2004), further referred to asLW), POET (Fan et al. (2013)), CLIME (Cai et al. (2011)) and the standard Graphical Lassowithout incorporating factor structure. The third set of results examines the performance of FGLand Robust FGL (described in Subsection 4.6) when the dependent variable follows elliptical dis-tribution. All three exercises use 100 Monte Carlo simulations.We ﬁrst consider the following low-dimensional setup: let p = T δ , δ = 0 . K = 2(log( T )) . and T = [2 h ] , for h = 7 , . , , . . . , .

5. A sparse precision matrix of the idiosyncratic componentsis constructed as follows: we ﬁrst generate the adjacency matrix using a random graph structure.Deﬁne a p × p adjacency matrix A which is used to represent the structure of the graph: A ij = (cid:40) , for i (cid:54) = j with probability q, , otherwise. (5.1)Let A ε,ij denote the i, j -th element of the adjacency matrix A ε . We set A ε,ij = A ε,ji = 1 , for i (cid:54) = j with probability q , and 0 otherwise. Such structure results in s T = p ( p − q/ q = 1 / ( pT . ), which makes s T = O ( T . ). The adjacency matrix hasall diagonal elements equal to zero. Hence, to obtain a positive deﬁnite precision matrix we applythe procedure described in Zhao et al. (2012): using their notation, Θ ε = A · v + I ( | τ | + 0 . u ),where u > v controls the magnitude of partial correlations with u , and τ isthe smallest eigenvalue of A · v . In our simulations we use u = 0 . v = 0 . f t = φ f f t − + ζ t (5.2) r t (cid:124)(cid:123)(cid:122)(cid:125) p × = B f t (cid:124)(cid:123)(cid:122)(cid:125) K × + ε t , t = 1 , . . . , T (5.3) See Zhao et al. (2012) for further details on the generation process. ε t is a p × N ( , Σ ε ), with sparse Θ ε thathas a random graph structure described above, f t is a K × φ f is an autoregressiveparameter in the factors which is a scalar for simplicity, B is a p × K matrix of factor loadings, ζ t is a K × N (0 , σ ζ ). To create B in(5.3) we take the ﬁrst K rows of an upper triangular matrix from a Cholesky decomposition of the p × p Toeplitz matrix parameterized by ρ . For the ﬁrst set of results we set ρ = 0 . φ f = 0 . σ ζ = 1. The speciﬁcation in (5.3) leads to the low-rank plus sparse decomposition of the covariancematrix of stock returns r t .To compare the empirical rate with the theoretical expressions derived in Theorems 3-4, weuse the facts from Theorem 2 that ω T ≡ K (cid:112) log p/T + K / √ p and λ − T ω T p −→ f |||·||| = C + C · log (( δ · x ) . · (cid:112) δ · x/ x + ( δ · x ) . · / (2 x ) δ/ ) + 0 . · x, (5.4) g |||·||| = C + C · log (( δ · x ) . · (cid:112) δ · x/ x + ( δ · x ) . · / (2 x ) δ/ ) + 0 . · x + 1 . x, (5.5) h = C + C · log (( δ · x ) . · (cid:112) δ · x/ x + ( δ · x ) . · / (2 x ) δ/ ) + 0 . · x + 3 log x, (5.6) h = C + C · log (( δ · x ) . · (cid:112) δ · x/ x + ( δ · x ) . · / (2 x ) δ/ ) + 0 . · x + 3 log x. (5.7)where C , . . . , C are constants with C < C (by Theorem 4), and x = log T .Figure 1 shows the averaged (over Monte Carlo simulations) errors of the estimators of theprecision matrix Θ and the portfolio weights versus the sample size T in the logarithmic scale (base2). In order to conﬁrm the theoretical ﬁndings from Theorems 3-4, we also plot the theoreticalrates of convergence given by the functions in (5.4)-(5.7). Figure 1 veriﬁes that the empiricaland theoretical rates are matched. Since the convergence rates for GMV and MWC portfolioweights are very similar, we only report the former. Note that as predicted by Theorem 3, therate of convergence of the precision matrix in |||·||| -norm is faster than the rate in |||·||| -norm.Furthermore, the convergence rate of the GMV, MWC and MRC portfolio weights are close tothe rate of the precision matrix Θ in |||·||| -norm, which is conﬁrmed by Theorem 4. Finally, asevidenced by Figure 1, the convergence rate of the MRC portfolio is faster than the rate of GMVand MWC. This ﬁnding is in accordance with Theorem 4.As a second exercise, we compare the performance of FGL with the alternative models listed19t the beginning of this section. We consider two cases: case 1 is the same as for the ﬁrst set ofsimulations (“low-dimensional”): p = T δ , δ = 0 . K = 2(log( T )) . , s T = O ( T . ). Case 2 is“high-dimensional” with p = 3 · T δ , δ = 0 .

85, all else equal. The results for cases 1 and 2 arereported in Figures 2-3 and Figures 4-5 respectively. As evidenced by the ﬁgures, FGL demon-strates superior performance in both cases, exhibiting consistency for both low-dimensional andhigh-dimensional settings. The only instance when FGL is strictly dominated occurs in Figure 2:POET outperforms FGL in terms of convergence of precision matrix in the spectral norm. However,this changes for case 2 in Figure 4.As a ﬁnal exercise, we examine the performance of FGL and Robust FGL (described in subsec-tion 4.6) when the dependent variable follows elliptical distributions. The data generating process(DGP) is similar to Fan et al. (2018): let ( f t , ε t ) from (3.2) jointly follow the multivariate t-distribution with the degrees of freedom ν . When ν = ∞ , this corresponds to the multivariatenormal distribution, smaller values of ν are associated with thicker tails. We draw T indepen-dent samples of ( f t , ε t ) from the multivariate t-distribution with zero mean and covariance matrix Σ = diag( Σ f , Σ ε ), where Σ f = I K . To construct Σ ε we use a Toeplitz structure parameterizedby ρ = 0 .

5, which leads to the sparse Θ ε = Σ − ε . The rows of B are drawn from N ( , I K ). Welet p = T . , K = 2(log( T )) . and T = [2 h ] , for h = 7 , . , , . . . , .

5. Figures 6-7 report theaveraged (over Monte Carlo simulations) estimation errors (in the logarithmic scale, base 2) for Θ and two portfolio weights (GMV and MRC) using FGL and Robust FGL for ν = 4 .

2. Noticeably,the performance of FGL for estimating the precision matrix is comparable with that of RobustFGL: this suggests that our FGL algorithm is robust to heavy-tailed distributions even withoutadditional modiﬁcations. Furthermore, FGL outperforms its Robust counterpart in terms of esti-mating portfolio weights, as evidenced by Figure 7. We further compare the performance of FGLand Robust FGL for diﬀerent degrees of freedom: Figure 8 reports the log-ratios (base 2) of theaveraged (over Monte Carlo simulations) estimation errors for ν = 4 . ν = 7 and ν = ∞ . Theresults for the estimation of Θ presented in Figure 8 are consistent with the ﬁndings in Fan et al.(2018): Robust FGL outperforms the non-robust counterpart for thicker tails.20 Empirical Application

In this section we examine the performance of the Factor Graphical Lasso for constructing aﬁnancial portfolio using daily and monthly data. We ﬁrst describe the data and the estimationmethodology, then we list four metrics commonly reported in the ﬁnance literature, and, ﬁnally,we present the results.

We use monthly and daily returns of the components of the S&P500 index. The data onhistorical S&P500 constituents and stock returns is fetched from CRSP and Compustat using SASinterface. The full sample for the monthly data has 480 observations on 355 stocks from January1, 1980 - December 1, 2019. We use January 1, 1980 - December 1, 1994 (180 obs) as a training(estimation) period and January 1, 1995 - December 1, 2019 (300 obs) as the out-of-sample testperiod. For the daily data the full sample size has 5040 observations on 420 stocks from January20, 2000 - January 31, 2020. We use January 20, 2000 - January 24, 2002 (504 obs) as a training(estimation) period and January 25, 2002 - January 31, 2020 (4536 obs) as the out-of-sample testperiod. We roll the estimation window (training periods) over the test sample to rebalance theportfolios monthly. At the end of each month, prior to portfolio construction, we remove stockswith less than 15 or 2 years of historical stock return data for monthly and daily returns respectively.We examine the performance of Factor Graphical Lasso for three alternative portfolio allocations(2.2), (2.3) and (2.10) and compare it with the equal-weighted portfolio, index portfolio, CLIME,LW (as in the simulations, we use a linear shrinkage estimator of covariance that incorporatesthe factor structure through Sherman-Morrison inversion formula) and POET. The index is thecomposite S&P500 index listed as ∧ GSPC. We take the risk-free rate and Fama/French factorsfrom Kenneth R. French’s data library.

Similarly to Callot et al. (2019), we consider four metrics commonly reported in the ﬁnanceliterature: the Sharpe Ratio, the portfolio turnover, the average return and the risk of a portfolio(which is deﬁned as the square root of the out-of-sample variance of the portfolio). We consider Using ∧ SPX and ∧ GSPC yields very similar results, we use the latter due to better data availability. T denote the total number of observations,the training sample consists of m observations, and the test sample is n = T − m .When transaction costs are not taken into account, the out-of-sample average portfolio return,variance and Sharpe Ratio (SR) areˆ µ test = 1 n T − (cid:88) t = m (cid:98) w (cid:48) t r t +1 , ˆ σ = 1 n − T − (cid:88) t = m ( (cid:98) w (cid:48) t r t +1 − ˆ µ test ) , SR = ˆ µ test / ˆ σ test . (6.1)When transaction costs are considered, we follow Ban et al. (2018); Callot et al. (2019); DeMiguelet al. (2009); Li (2015) to account for the transaction costs, further denoted as tc. In line with theaforementioned papers, we set tc = 50bps. Deﬁne the excess portfolio at time t + 1 with transactioncosts (tc) as r t +1 , portfolio = (cid:98) w (cid:48) t r t +1 − tc(1 + (cid:98) w (cid:48) t r t +1 ) p (cid:88) j =1 (cid:12)(cid:12)(cid:12) ˆ w t +1 ,j − ˆ w + t,j (cid:12)(cid:12)(cid:12) , (6.2)where ˆ w + t,j = ˆ w t,j r t +1 ,j + r ft +1 r t +1 , portfolio + r ft +1 , (6.3)where r t +1 ,j + r ft +1 is sum of the excess return of the j -th asset and risk-free rate, and r t +1 , portfolio + r ft +1 is the sum of the excess return of the portfolio and risk-free rate. The out-of-sample averageportfolio return, variance, Sharpe Ratio and turnover are deﬁned accordingly:ˆ µ test,tc = 1 n T − (cid:88) t = m r t, portfolio , ˆ σ = 1 n − T − (cid:88) t = m ( r t, portfolio − ˆ µ test,tc ) , SR tc = ˆ µ test,tc / ˆ σ test,tc , (6.4)Turnover = 1 n T − (cid:88) t = m p (cid:88) j =1 (cid:12)(cid:12)(cid:12) ˆ w t +1 ,j − ˆ w + t,j (cid:12)(cid:12)(cid:12) . (6.5) This section explores the performance of the Factor Graphical Lasso for the ﬁnancial portfoliousing monthly and daily data. We consider two scenarios, when the factors are unknown andestimated using the standard PCA (statistical factors), and when the factors are known. For thestatistical factors we consider up to three PCs. For the scenario with known factors we includeup to 5 Fama-French factors: FF1 includes the excess return on the market, FF3 includes FF1plus size factor (Small Minus Big, SMB) and value factor (High Minus Low, HML), and FF5includes FF3 plus proﬁtability factor (Robust Minus Weak, RMW) and risk factor (ConservativeMinus Agressive, CMA). In Tables 1-2, we report the monthly and daily portfolio performance22or three alternative portfolio allocations in (2.2), (2.3) and (2.10). Following Callot et al. (2019),we set a return target µ ∈ { . , . } for monthly and daily data respectively (bothare equivalents of 10% yearly return when compounded). The target level of risk for the weight-constrained and risk-constrained Markowitz portfolio (MWC and MRC) is set at σ ∈ { . , . } which is the standard deviation of the monthly and daily excess returns of the S&P500 index inthe ﬁrst training set. Following Ao et al. (2019) and Callot et al. (2019), transaction costs for eachindividual stock are set to be a constant 0 . (1): MRC produces portfolioreturn and Sharpe Ratio that are mostly higher than those for the weight-constrained allocationsMWC and GMV. This means that relaxing the constraint that portfolio weights sum up to one leadsto a large increase in the out-of-sample Sharpe Ratio and portfolio return, which, to the best of ourknowledge, has not been previously well-studied in the empirical ﬁnance literature. The increasein the Sharpe Ratio and return, however, comes at the cost of higher risk and higher portfolioturnover: for MRC portfolios the risk-constraint is often violated. (2):

FGL outperforms all thecompetitors, including equal-weighted portfolio (EW) and Index. Speciﬁcally, our method has thelowest risk and turnover (compared to CLIME, LW and POET), and the highest out-of-sampleSharpe Ratio compared with all alternative methods. (3): the implementation of POET for MRCresulted in the erratic behavior of this method for estimating portfolio weights, concretely, manyentries in the weight matrix had “NaN” entries. One of the explanations for such behavior is anunderestimation of the number of factors ( (cid:98) K ), however, adjusting this quantity by (cid:98) K + 1 did notﬁx the problem. (4): using the observable Fama-French factors in the FGL, in general, producesportfolios with higher return and higher out-of-sample Sharpe Ratio compared to the portfoliosbased on statistical factors. Interestingly, this increase in return is not followed by higher risk.Table 2 reports the results for daily data. Some comments are in order: (1): MRC portfoliosproduce higher return and higher risk, compared to MWC and GMV, which is consistent with themonthly results from Table 1. However, the out-of-sample Sharpe Ratio for MRC is lower than thatof MWC and GMV, which implies that the higher risk of MRC portfolios is not fully compensatedby the higher return. (2): similarly to the results from Table 1, FGL outperforms the competitorsincluding EW and Index in terms of the out-of-sample Sharpe Ratio and turnover. (3): similarly tothe results in Table 1, the observable Fama-French factors produce the FGL portfolios with higher23eturn and higher out-of-sample Sharpe Ratio compared to the FGL portfolios based on statisticalfactors. Again, this increase in return is not followed by higher risk.Table 3 compares the performance of FGL and the alternative methods for the daily data fordiﬀerent time periods of interesting episodes in terms of the cumulative excess return (CER) andrisk. To demonstrate the performance of all methods during the periods of recession and expansion,we chose four periods and recorded CER for the whole year in each period of interest. Two years,2002 and 2008 correspond to the recession periods, which is why we we refer to them as “Surge”. Wenote that the references to Argentine Great Depression and The Financial Crisis do not intend tolimit these economic downturns to only one year. They merely provide the context for the recessions.The other two years, 2017 and 2019, correspond to the years which were relatively favorable to thestock market (“Boom”). Table 3 reveals some interesting ﬁndings: (1): the conclusions from Tables1-2 are supported: MRC portfolios yield higher CER and they are characterized by higher risk. (2):

MRC is the only type of portfolio that produces positive CER during both recessions. Notethat all models that used MWC and GMV during that time experienced large negative CER. (3): when EW and Index have positive CER (during Boom periods), all portfolio formulations alsoproduce positive CER. However, the return accumulated by MRC is mostly higher than that byMWC and GMV portfolio formulations. (4):

FGL mostly outperforms the competitors, includingEW and Index in terms of CER and risk.

In this paper, we propose a new precision matrix estimator for the excess returns under theapproximate factor model with unobserved factors that combines the beneﬁts of graphical modelsand factor structure. We established consistency of FGL in the spectral and (cid:96) matrix norms.In addition, we proved consistency of the portfolio weights for three formulations of the optimalportfolio allocation without assuming sparsity on the covariance or precision matrix of stock returns.All theoretical results established in this paper hold for a wide range of distributions: Sub-Gaussianfamily (including Gaussian) and elliptical family. Our simulations demonstrate that FGL is robustto very heavy-tailed distributions, which makes our method suitable for the ﬁnancial applications.The empirical exercise uses the constituents of the S&P500 and demonstrates superior per-formance of FGL compared to several alternative models for estimating precision (CLIME) and24ovariance (LW, POET) matrices, Equal-Weighted (EW) portfolio and Index portfolio in terms ofthe out-of-sample Sharpe Ratio and risk. This ﬁnding is robust to both monthly and daily data.We examine three diﬀerent portfolio formulations and discover that the only portfolios that producepositive cumulative excess return (CER) during recessions are the ones that relax the constraintrequiring portfolio weights sum up to one. To the best of our knowledge, this ﬁnding has not beenpreviously well-studied in the empirical ﬁnance literature.There are several venues for potential extensions. First, having examined empirical performanceof FGL we notice that some of the estimated portfolio weights are very close to zero. This meansthat an investor needs to buy a certain amount of each security even if there are a lot of smallweights. However, oftentimes investors are interested in managing a few assets which signiﬁcantlyreduces monitoring and transaction costs and was shown to outperform equal weighted and indexportfolios in terms of the Sharpe Ratio and cumulative return (see Fan et al. (2019), Ao et al.(2019), Li (2015), Brodie et al. (2009) among others). Therefore, our model can be extended tocreate a sparse portfolio. Second, it is possible to make FGL estimator of the precision matrix time-varying, such that the model could also capture the dynamic nature of the relationship betweenstock returns. Third, one can incorporate stock-speciﬁc characteristics (e.g. company fundamentals,such as current earnings, book value, growth in net operating assets and ﬁnancing) in the FGLframework, which would integrate fundamental analysis with portfolio optimization (see Lyle andYohn (2020)). We are currently working on all of these extensions.25 eferences Ait-Sahalia, Y. and Xiu, D. (2017). Using principal component analysis to estimate a high dimen-sional factor model with high-frequency data.

Journal of Econometrics , 201(2):384–399.Ao, M., Yingying, L., and Zheng, X. (2019). Approaching mean-variance eﬃciency for large port-folios.

The Review of Financial Studies , 32(7):2890–2919.Awoye, O. A. (2016).

Markowitz Minimum Variance Portfolio Optimization Using New MachineLearning Methods . PhD thesis, University College London.Bai, J. (2003). Inferential theory for factor models of large dimensions.

Econometrica , 71(1):135–171.Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models.

Econometrica , 70(1):191–221.Bai, J. and Ng, S. (2006). Conﬁdence intervals for diﬀusion index forecasts and inference forfactor-augmented regressions.

Econometrica , 74(4):1133–1150.Ban, G.-Y., El Karoui, N., and Lim, A. E. (2018). Machine learning and portfolio optimization.

Management Science , 64(3):1136–1154.Banerjee, O., El Ghaoui, L., and d’Aspremont, A. (2008). Model selection through sparse maximumlikelihood estimation for multivariate gaussian or binary data.

Journal of Machine LearningResearch , 9:485–516.Barigozzi, M., Brownlees, C., and Lugosi, G. (2018). Power-law partial correlation network models.

Electronic Journal of Statistics , 12(2):2905–2929.Bishop, C. M. (2006).

Pattern Recognition and Machine Learning (Information Science and Statis-tics) . Springer-Verlag, Berlin, Heidelberg.Brodie, J., Daubechies, I., De Mol, C., Giannone, D., and Loris, I. (2009). Sparse and stablemarkowitz portfolios.

Proceedings of the National Academy of Sciences , 106(30):12267–12272.Brownlees, C., Nualart, E., and Sun, Y. (2018). Realized networks.

Journal of Applied Economet-rics , 33(7):986–1006.Cai, T., Liu, W., and Luo, X. (2011). A constrained l1-minimization approach to sparse precisionmatrix estimation.

Journal of the American Statistical Association , 106(494):594–607.Cai, T. T., Hu, J., Li, Y., and Zheng, X. (2020). High-dimensional minimum variance portfolioestimation based on high-frequency data.

Journal of Econometrics , 214(2):482–494.Callot, L., Caner, M., ¨Onder, A. O., and Ula¸san, E. (2019). A nodewise regression approach toestimating large portfolios.

Journal of Business & Economic Statistics , 0(0):1–12.Campbell, J. Y., Lo, A. W., and MacKinlay, A. C. (1997).

The Econometrics of Financial Markets .Princeton University Press.Chamberlain, G. and Rothschild, M. (1983). Arbitrage, factor structure, and mean-variance analysison large asset markets.

Econometrica , 51(5):1281–1304.26onnor, G. and Korajczyk, R. A. (1988). Risk and return in an equilibrium APT: Application ofa new test methodology.

Journal of Financial Economics , 21(2):255–289.DeMiguel, V., Garlappi, L., and Uppal, R. (2009). Optimal versus naive diversiﬁcation: Howineﬃcient is the 1/n portfolio strategy?

The Review of Financial Studies , 22(5):1915–1953.Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and bonds.

Journal of Financial Economics , 33(1):3–56.Fama, E. F. and French, K. R. (2015). A ﬁve-factor asset pricing model.

Journal of FinancialEconomics , 116(1):1–22.Fan, J., Furger, A., and Xiu, D. (2016a). Incorporating global industrial classiﬁcation standard intoportfolio allocation: A simple factor-based large covariance matrix estimator with high-frequencydata.

Journal of Business & Economic Statistics , 34(4):489–503.Fan, J., Liao, Y., and Mincheva, M. (2011). High-dimensional covariance matrix estimation inapproximate factor models.

The Annals of Statistics , 39(6):3320–3356.Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding prin-cipal orthogonal complements.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 75(4):603–680.Fan, J., Liao, Y., and Wang, W. (2016b). Projected principal component analysis in factor models.

The Annals of Statistics , 44(1):219–254.Fan, J., Liu, H., and Wang, W. (2018). Large covariance estimation through elliptical factor models.

The Annals of Statistics , 46(4):1383–1414.Fan, J., Weng, H., and Zhou, Y. (2019). Optimal estimation of functionals of high-dimensionalmean and covariance matrix. arXiv:1908.07460 .Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with theGraphical Lasso.

Biostatistics , 9(3):432–441.Goto, S. and Xu, Y. (2015). Improving mean variance optimization through sparse hedging restric-tions.

Journal of Financial and Quantitative Analysis , 50(6):1415–1441.Hastie, T., Tibshirani, R., and Friedman, J. (2001).

The Elements of Statistical Learning . SpringerSeries in Statistics. Springer New York Inc., New York, NY, USA.Jankov´a, J. and van de Geer, S. (2018). Inference in high-dimensional graphical models.

Handbookof Graphical Models , Chapter 14, pages 325–351. CRC Press.Kapetanios, G. (2010). A testing procedure for determining the number of factors in approximatefactor models with large datasets.

Journal of Business & Economic Statistics , 28(3):397–409.Koike, Y. (2020). De-biased graphical lasso for high-frequency data.

Entropy , 22(4):456.Ledoit, O. and Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariancematrices.

Journal of Multivariate Analysis , 88(2):365–411.27i, H., Li, Q., and Shi, Y. (2017). Determining the number of factors when the number of factorscan increase with sample size.

Journal of Econometrics , 197(1):76–86.Li, J. (2015). Sparse and stable portfolio selection with parameter uncertainty.

Journal of Business& Economic Statistics , 33(3):381–392.Lyle, M. R. and Yohn, T. L. (2020). Fundamental analysis and mean-variance optimal portfolios.

Kelley School of Business Research Paper .Markowitz, H. (1952). Portfolio selection.

The Journal of Finance , 7(1):77–91.Meinshausen, N. and B¨uhlmann, P. (2006). High-dimensional graphs and variable selection withthe lasso.

The Annals of Statistics , 34(3):1436–1462.Millington, T. and Niranjan, M. (2017). Robust portfolio risk minimization using the graphicallasso. In

Neural Information Processing , pages 863–872, Cham. Springer International Publishing.Ravikumar, P., J. Wainwright, M., Raskutti, G., and Yu, B. (2011). High-dimensional covarianceestimation by minimizing -penalized log-determinant divergence.

Electronic Journal of Statistics ,5.Ross, S. A. (1976). The arbitrage theory of capital asset pricing.

Journal of Economic Theory ,13(3):341–360.Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a largenumber of predictors.

Journal of the American Statistical Association , 97(460):1167–1179.Tobin, J. (1958). Liquidity preference as behavior towards risk.

The Review of Economic Studies ,25(2):65–86.Zhao, T., Liu, H., Roeder, K., Laﬀerty, J., and Wasserman, L. (2012). The HUGE package forhigh-dimensional undirected graph estimation in R.

Journal of Machine Learning Research ,13(1):1059–1062. 28igure 1:

Averaged empirical errors (solid lines) and theoretical rates of convergence(dashed lines) on logarithmic scale: p = T . , K = 2(log( T )) . , s T = O ( T . ) . Figure 2:

Averaged errors of the estimators of Θ for Case 1 on logarithmic scale: p = T . , K = 2(log( T )) . , s T = O ( T . ) . Averaged errors of the estimators of w

GMV (left) and w

MRC (right) for Case1 on logarithmic scale: p = T . , K = 2(log( T )) . , s T = O ( T . ) . Figure 4:

Averaged errors of the estimators of Θ for Case 2 on logarithmic scale: p = 3 · T . , K = 2(log( T )) . , s T = O ( T . ) . Averaged errors of the estimators of w

GMV (left) and w

MRC (right) for Case2 on logarithmic scale: p = 3 · T . , K = 2(log( T )) . , s T = O ( T . ) . Figure 6:

Averaged errors of the estimators of Θ on logarithmic scale: p = T . , K =2(log( T )) . , ν = 4 . . Averaged errors of the estimators of w

GMV (left) and w

MRC (right) on loga-rithmic scale: p = T . , K = 2(log( T )) . , ν = 4 . . Figure 8:

Log ratios (base 2) of the averaged errors of the FGL and the Robust FGL es-timators of Θ: log (cid:16) ||| (cid:98) Θ − Θ ||| ||| (cid:98) Θ R − Θ ||| (cid:17) (left), log (cid:16) ||| (cid:98) Θ − Θ ||| ||| (cid:98) Θ R − Θ ||| (cid:17) (right): p = T . , K = 2(log( T )) . . a r k o w i t z ( r i s k - c o n s t r a i n e d ) M a r k o w i t z ( w e i g h t - c o n s t r a i n e d ) G l o b a l M i n i m u m - V a r i a n ce R e t u r n R i s k S R T u r n o v e r R e t u r n R i s k S R T u r n o v e r R e t u r n R i s k S R T u r n o v e r W i t h o u t T C E W . . . - . . . - . . . - I nd e x . . . - . . . - . . . - F G L . . . - . . . - . . . - C L I M E . . . - . . . - . . . - L W . . . - . . . - . . . - P O E T ----- . . - . - . . . - F G L ( FF ) . . . - . . . - . . . - F G L ( FF ) . . . - . . . - . . . - F G L ( FF ) . . . - . . . - . . . - W i t h T C E W . . . . . . . . . . . . F G L . . . . . . . . . . . . C L I M E . . . . . . . . . . . . L W . . . . . . . . . . . . P O E T ----- . . - . . . . . . F G L ( FF ) . . . . . . . . . . . . F G L ( FF ) . . . . . . . . . . . . F G L ( FF ) . . . . . . . . . . . . T a b l e : M o n t h l y p o r t f o li o r e t u r n s , r i s k , Sh a r p e R a t i o ( S R ) a nd t u r n o v e r . T r a n s a c t i o n c o s t s a r e s e tt o50 b a s i s p o i n t s , t a r g e t e d r i s k i s s e t a t σ = . ( w h i c h i s t h e s t a nd a r dd e v i a t i o n o f t h e m o n t h l y e x ce ss r e t u r n s o nS & P i nd e x f r o m t o1995 , t h e ﬁ r s tt r a i n i n g p e r i o d ) , m o n t h l y t a r g e t e d r e t u r n i s . % w h i c h i s e q u i v a l e n tt o10 % y e a r l y r e t u r n w h e n c o m p o und e d . I n - s a m p l e : J a nu a r y , - D ece m b e r , ( b s ) , O u t - o f - s a m p l e : J a nu a r y , - D ece m b e r , ( b s ) . a r k o w i t z ( r i s k - c o n s t r a i n e d ) M a r k o w i t z ( w e i g h t - c o n s t r a i n e d ) G l o b a l M i n i m u m - V a r i a n ce R e t u r n R i s k S R T u r n o v e r R e t u r n R i s k S R T u r n o v e r R e t u r n R i s k S R T u r n o v e r W i t h o u t T C E W . E - . E - . . E - . E - . . E - . E - . I nd e x . E - . E - . . E - . E - . . E - . E - . F G L . E - . E - . . E - . E - . . E - . E - . C L I M E . E - . E - . . E - . E - . . E - . E - . L W . E - . E - . . E - . E - . . E - . E - . P O E T ---- . E - . E - - . . E - . E - . F G L ( FF ) . E - . E - . . E - . E - . . E - . E - . F G L ( FF ) . E - . E - . . E - . E - . . E - . E - . F G L ( FF ) . E - . E - . . E - . E - . . E - . E - . W i t h T C E W . E - . E - . . . E - . E - . . . E - . E - . . F G L . E - . E - . . . E - . E - . . . E - . E - . . C L I M E . E - . E - . . . E - . E - . . . E - . E - . . L W - . E - . E - - . . . E - . E - . . . E - . E - . . P O E T ----- . E - . E - - . . - . E - . E - - . . F G L ( FF ) . E - . E - . . . E - . E - . . . E - . E - . . F G L ( FF ) . E - . E - . . . E - . E - . . . E - . E - . . F G L ( FF ) . E - . E - . . . E - . E - . . . E - . E - . . T a b l e : D a il y p o r t f o li o r e t u r n s , r i s k , Sh a r p e R a t i o ( S R ) a nd t u r n o v e r . T r a n s a c t i o n c o s t s a r e s e tt o50 b a s i s p o i n t s , t a r g e t e d r i s k i ss e t a t σ = . ( w h i c h i s t h e s t a nd a r dd e v i a t i o n o f t h e d a il y e x ce ss r e t u r n s o nS & P i nd e x f r o m t o2002 , t h e ﬁ r s tt r a i n i n g p e r i o d ) , d a il y t a r g e t e d r e t u r n i s . % w h i c h i s e q u i v a l e n tt o10 % y e a r l y r e t u r n w h e n c o m p o und e d . I n - s a m p l e : J a nu a r y , - J a nu a r y , ( b s ) , O u t - o f - s a m p l e : J a nu a r y , - J a nu a r y , ( b s ) . u r g e A r g e n t i n e G r e a t D e p r e ss i o n ( ) Su r g e F i n a n c i a l C r i s i s ( ) B oo m ( ) B oo m ( ) C E RR i s k C E RR i s k C E RR i s k C E RR i s k E q u a l - W e i g h t e d a nd I nd e x E W - . . - . . . . . . I nd e x - . . - . . . . . . M a r k o w i t z R i s k - C o n s t r a i n e d ( M R C ) F G L . . . . . . . . C L I M E - . . - . . . . . . L W . . . . . . . . M a r k o w i t z W e i g h t - C o n s t r a i n e d ( M W C ) F G L - . . - . . . . . . C L I M E - . . - . . . . . . L W - . . - . . . . . . P O E T - . . - . . . . . . G l o b a l M i n i m u m - V a r i a n c e P o r t f o li o ( G M V ) F G L - . . - . . . . . . C L I M E - . . - . . . . . . L W - . . - . . . . . . P O E T - . . - . . - . . . . T a b l e : C u m u l a t i v ee x ce ss r e t u r n ( C E R ) a nd r i s k o f p o r t f o li o s u s i n g d a il y d a t a . T r a n s a c t i o n c o s t s a r e s e tt o50 b a s i s p o i n t s , t a r g e t e d r i s k i ss e t a t σ = . ( w h i c h i s t h e s t a nd a r dd e v i a t i o n o f t h e d a il y e x ce ss r e t u r n s o nS & P i nd e x f r o m t o2002 , t h e ﬁ r s tt r a i n i n g p e r i o d ) , d a il y t a r g e t e d r e t u r n i s . % w h i c h i s e q u i v a l e n tt o10 % y e a r l y r e t u r n w h e n c o m p o und e d . I n - s a m p l e : J a nu a r y , - J a nu a r y , ( b s ) , O u t - o f - s a m p l e : J a nu a r y , - J a nu a r y , ( b s ) . upplemental Appendix A.1 Lemmas for Theorem 1

Lemma 1.

Under the assumptions of Theorem 1,(a) max i,j ≤ K (cid:12)(cid:12)(cid:12) (1 /T ) (cid:80) Tt =1 f it f jt − E [ f it f jt ] (cid:12)(cid:12)(cid:12) = O p ( (cid:112) /T ) ,(b) max i,j ≤ p (cid:12)(cid:12)(cid:12) (1 /T ) (cid:80) Tt =1 ε it ε jt − E [ ε it ε jt ] (cid:12)(cid:12)(cid:12) = O p ( (cid:112) log p/T ) ,(c) max i ≤ K,j ≤ p (cid:12)(cid:12)(cid:12) (1 /T ) (cid:80) Tt =1 f it ε jt (cid:12)(cid:12)(cid:12) = O p ( (cid:112) log p/T ) .Proof. The proof of Lemma 1 can be found in Fan et al. (2011) (Lemma B.1).

Lemma 2.

Under Assumption (A.4) , max t ≤ T (cid:80) Ks =1 | E [ ε (cid:48) s ε t ] | /p = O (1) .Proof. The proof of Lemma 2 can be found in Fan et al. (2013) (Lemma A.6).

Lemma 3.

For (cid:98) K deﬁned in expression (3.6), Pr (cid:16) (cid:98) K = K (cid:17) → . Proof.

The proof of Lemma 3 can be found in Li et al. (2017) (Theorem 1 and Corollary 1).Using the expressions (A.1) in Bai (2003) and (C.2) in Fan et al. (2013),, we have the followingidentity: (cid:98) f t − Hf t = (cid:16) V p (cid:17) − (cid:34) T T (cid:88) s =1 (cid:98) f s E [ ε (cid:48) s ε t ] p + 1 T T (cid:88) s =1 (cid:98) f s ζ st + 1 T T (cid:88) s =1 (cid:98) f s η st + 1 T T (cid:88) s =1 (cid:98) f s ξ st (cid:35) , (A.1)where ζ st = ε (cid:48) s ε t /p − E [ ε (cid:48) s ε t ] /p , η st = f (cid:48) s (cid:80) pi =1 b i ε it /p and ξ st = f (cid:48) t (cid:80) pi =1 b i ε is /p . Lemma 4.

For all i ≤ (cid:98) K ,(a) (1 /T ) (cid:80) Tt =1 (cid:104) (1 /T ) (cid:80) Tt =1 ˆ f is E [ ε (cid:48) s ε t ] /p (cid:105) = O p ( T − ) ,(b) (1 /T ) (cid:80) Tt =1 (cid:104) (1 /T ) (cid:80) Tt =1 ˆ f is ζ st /p (cid:105) = O p ( p − ) ,(c) (1 /T ) (cid:80) Tt =1 (cid:104) (1 /T ) (cid:80) Tt =1 ˆ f is η st /p (cid:105) = O p ( K /p ) ,(d) (1 /T ) (cid:80) Tt =1 (cid:104) (1 /T ) (cid:80) Tt =1 ˆ f is ξ st /p (cid:105) = O p ( K /p ) .Proof. We only prove (c) and (d), the proof of (a) and (b) can be found in Fan et al. (2013)(Lemma 8). 36c) Recall, η st = f (cid:48) s (cid:80) pi =1 b i ε it /p . Using Assumption (A.5) , we get E (cid:104) (1 /T ) × (cid:80) Tt =1 (cid:107) (cid:80) pi =1 b i ε it (cid:107) (cid:105) = E (cid:104) (cid:107) (cid:80) pi =1 b i ε it (cid:107) (cid:105) = O ( pK ). Therefore, by the Cauchy-Schwarz inequality and the facts that(1 /T ) (cid:80) Tt =1 (cid:107) f t (cid:107) = O ( K ), and, ∀ i , (cid:80) Ts =1 ˆ f is = T ,1 T T (cid:88) t =1 (cid:16) T T (cid:88) s =1 ˆ f is η st (cid:17) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) s =1 (cid:107) ˆ f is f (cid:48) s (cid:107) T T (cid:88) t =1 p (cid:107) p (cid:88) j =1 b i ε jt (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ T p T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p (cid:88) j =1 b i ε jt (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:32) T T (cid:88) s =1 ˆ f is T T (cid:88) s =1 (cid:107) f s (cid:107) (cid:33) = O p (cid:16) Kp · K (cid:17) = O p (cid:16) K p (cid:17) . (d) Using similar approach as in part (c):1 T T (cid:88) t =1 (cid:16) T T (cid:88) s =1 ˆ f is ξ st (cid:17) = 1 T T (cid:88) t =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) s =1 f (cid:48) t p (cid:88) j =1 ε js p ˆ f is (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:16) T T (cid:88) t =1 (cid:107) f t (cid:107) (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) s =1 p (cid:88) j =1 b j ε js p ˆ f is (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:16) T T (cid:88) t =1 (cid:107) f t (cid:107) (cid:17) T T (cid:88) s =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p (cid:88) j =1 b j ε js p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:16) T T (cid:88) s =1 ˆ f is (cid:17) = O p (cid:16) K · pKp · (cid:17) = O p (cid:16) K p (cid:17) Lemma 5. (a) max t ≤ T (cid:13)(cid:13)(cid:13) (1 / ( T p )) (cid:80) Ts =1 (cid:98) f (cid:48) s E [ ε (cid:48) s ε t ] (cid:13)(cid:13)(cid:13) = O p ( K/ √ T ) ,(b) max t ≤ T (cid:13)(cid:13)(cid:13) (1 / ( T p )) (cid:80) Ts =1 (cid:98) f (cid:48) s ζ st (cid:13)(cid:13)(cid:13) = O p ( √ KT / / √ p ) ,(c) max t ≤ T (cid:13)(cid:13)(cid:13) (1 / ( T p )) (cid:80) Ts =1 (cid:98) f (cid:48) s η st (cid:13)(cid:13)(cid:13) = O p ( KT / / √ p ) ,(d) max t ≤ T (cid:13)(cid:13)(cid:13) (1 / ( T p )) (cid:80) Ts =1 (cid:98) f (cid:48) s ξ st (cid:13)(cid:13)(cid:13) = O p ( KT / / √ p ) ,Proof. Our proof is similar to the proof in Fan et al. (2013). However, we relax the assumptionsof ﬁxed K .(a) Using the Cauchy-Schwarz inequality, Lemma 2, and the fact that (1 /T ) (cid:80) Tt =1 (cid:107) (cid:98) f t (cid:107) = O p ( K ),we getmax t ≤ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T p T (cid:88) s =1 (cid:98) f (cid:48) s E (cid:2) ε (cid:48) s ε t (cid:3)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ max t ≤ T (cid:34) T T (cid:88) s =1 (cid:13)(cid:13)(cid:13)(cid:98) f s (cid:13)(cid:13)(cid:13) T T (cid:88) s =1 (cid:32) E [ ε (cid:48) s ε t ] p (cid:33) (cid:35) / ≤ O p ( K ) max t ≤ T (cid:34) T T (cid:88) s =1 (cid:32) E [ ε (cid:48) s ε t ] p (cid:33) (cid:35) / ≤ O p ( K ) max s,t (cid:115)(cid:12)(cid:12)(cid:12)(cid:12) E [ ε (cid:48) s ε t ] p (cid:12)(cid:12)(cid:12)(cid:12) max t ≤ T (cid:34) T T (cid:88) s =1 (cid:12)(cid:12)(cid:12)(cid:12) E [ ε (cid:48) s ε t ] p (cid:12)(cid:12)(cid:12)(cid:12)(cid:35) / = O p (cid:16) K · · √ T (cid:17) = O p (cid:16) K √ T (cid:17) . t ≤ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) s =1 (cid:98) f (cid:48) s ζ st (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ max t ≤ T T (cid:32) T (cid:88) s =1 (cid:13)(cid:13)(cid:13)(cid:98) f s (cid:13)(cid:13)(cid:13) T (cid:88) s =1 ζ st (cid:33) / ≤ (cid:32) O p ( K ) max t T T (cid:88) s =1 ζ st (cid:33) / = O p (cid:16) √ K · T / / √ p · (cid:17) . To obtain the last inequality we used Assumption (A.5) (b) to get E (cid:104) (1 /T ) (cid:80) Ts =1 ζ st (cid:105) ≤ max s,t ≤ T E (cid:2) ζ st (cid:3) = O (1 /p ), and then applied the Chebyshev inequality and Bonferroni’smethod that yield max t (1 /T ) (cid:80) Ts =1 ζ st = O p (cid:16) √ T /p (cid:17) .(c) Using the deﬁnition of η st we getmax t ≤ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) s =1 (cid:98) f (cid:48) s η st (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) s =1 (cid:98) f s f (cid:48) s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p p (cid:88) i =1 b i ε it (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p (cid:16) K · T / / √ p (cid:17) . To obtain the last rate we used Assumption (A.5) (c) together with the Chebyshev inequalityand Bonferroni’s method to get max t ≤ T (cid:107) (cid:80) pi =1 b i ε it (cid:107) = O p (cid:16) T / √ p (cid:17) .(d) In the proof of Lemma 4 we showed that (cid:107) (1 /T ) × (cid:80) Tt =1 (cid:80) pi =1 b i ε it (1 /p ) (cid:98) f s (cid:107) = O (cid:16)(cid:112) K/p (cid:17) .Furthermore, Assumption (A.3) implies E (cid:2) K − f t (cid:3) < M , therefore, max t ≤ T (cid:107) f t (cid:107) = O p (cid:16) T / √ K (cid:17) .Using these bounds we getmax t ≤ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) s =1 (cid:98) f (cid:48) s ξ st (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ max t ≤ T (cid:107) f t (cid:107) · (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T (cid:88) s =1 p (cid:88) i =1 b i ε it p (cid:98) f s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = O p (cid:16) T / √ K · (cid:112) K/p (cid:17) = O p (cid:16) T / K/ √ p (cid:17) . Lemma 6. (a) max i ≤ K (1 /T ) (cid:80) Tt =1 ( (cid:98) f t − Hf t ) i = O p (1 /T + K /p ) .(b) (1 /T ) (cid:80) Tt =1 (cid:107) (cid:98) f t − Hf t (cid:107) = O p ( K/T + K /p ) .(c) max t ≤ T (1 /T ) (cid:107) (cid:98) f t − Hf t (cid:107) = O p ( K/ √ T + KT / / √ p ) .Proof. Similarly to Fan et al. (2013),, we prove this lemma conditioning on the event ˆ K = K .Since Pr( ˆ K (cid:54) = K ) = o (1), the unconditional arguments are implied.(a) Using (A.1), for some constant C > i ≤ K (1 /T ) T (cid:88) t =1 ( (cid:98) f t − Hf t ) i ≤ C max i ≤ K T T (cid:88) t =1 (cid:32) T T (cid:88) s =1 ˆ f is E [ ε (cid:48) s ε t ] p (cid:33) + C max i ≤ K T T (cid:88) t =1 (cid:32) T T (cid:88) s =1 ˆ f is ζ st (cid:33) + C max i ≤ K T T (cid:88) t =1 (cid:32) T T (cid:88) s =1 ˆ f is ζ st (cid:33) + C max i ≤ K T T (cid:88) t =1 (cid:32) T T (cid:88) s =1 ˆ f is ξ st (cid:33) = O p (cid:32) T + 1 p + K p + K p (cid:33) = O p (1 /T + K /p ) . T T (cid:88) t =1 (cid:107) (cid:98) f t − Hf t (cid:107) ≤ K max i ≤ K T T (cid:88) t =1 ( (cid:98) f t − Hf t ) i . (c) Part (c) is a direct consequence of A.1 and Lemma 5. Lemma 7. (a) HH (cid:48) = I ˆ K + O p ( K / / √ T + K / / √ p ) .(b) HH (cid:48) = I K + O p ( K / / √ T + K / / √ p ) .Proof. Similarly to Lemma 6, we ﬁrst condition on ˆ K = K .(a) The key observation here is that, according to the deﬁnition of H , its rank grows with K , thatis, (cid:107) H (cid:107) = O p ( K ). Let (cid:99) cov( Hf t ) = (1 /T ) (cid:80) Tt =1 Hf t ( Hf t ) (cid:48) . Using the triangular inequality weget (cid:13)(cid:13) HH (cid:48) − I ˆ K (cid:13)(cid:13) F ≤ (cid:13)(cid:13) HH (cid:48) − (cid:99) cov( Hf t ) (cid:13)(cid:13) F + (cid:13)(cid:13) (cid:99) cov( Hf t ) − I ˆ K (cid:13)(cid:13) F . (A.2)To bound the ﬁrst term in (A.2), we use Lemma 1: (cid:107) HH (cid:48) − (cid:99) cov( Hf t ) (cid:107) F ≤ (cid:107) H (cid:107) (cid:107) I K − (cid:99) cov( Hf t ) (cid:107) F = O p ( K / / √ T ).To bound the second term in (A.2), we use the Cauchy-Schwarz inequality and Lemma 6: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Hf t ( Hf t ) (cid:48) − T T (cid:88) t =1 (cid:98) f t (cid:98) f (cid:48) t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 ( Hf t − (cid:98) f t )( Hf t ) (cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T (cid:88) t (cid:98) f t ( (cid:98) f (cid:48) t − ( Hf t ) (cid:48) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F ≤ (cid:32) T (cid:88) t =1 (cid:13)(cid:13)(cid:13) Hf t − (cid:98) f t (cid:13)(cid:13)(cid:13) T (cid:88) t =1 (cid:107) Hf t (cid:107) (cid:33) / + (cid:32) T (cid:88) t =1 (cid:13)(cid:13)(cid:13) Hf t − (cid:98) f t (cid:13)(cid:13)(cid:13) T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:98) f t (cid:13)(cid:13)(cid:13) (cid:33) / = O p (cid:32)(cid:16) KT + K p · K (cid:17) / + (cid:16) KT + K p · K (cid:17) / (cid:33) = O p (cid:32) K / √ T + K / √ p (cid:33) . (b) The proof of (b) follows from Pr( ˆ K − K ) → K . A.2 Proof of Theorem 1

The second part of Theorem 1 was proved in Lemma 6. We now proceed to the convergence rateof the ﬁrst part. Using the following deﬁnitions: (cid:98) b i = (1 /T ) (cid:80) Tt =1 r it (cid:98) f t and (1 /T ) (cid:80) Tt =1 (cid:98) f t (cid:98) f (cid:48) t = I K ,we obtain (cid:98) b i − Hb i = 1 T T (cid:88) t =1 Hf t ε it + 1 T T (cid:88) t =1 r it ( (cid:98) f t − Hf t ) + H (cid:16) T T (cid:88) t =1 f t f (cid:48) t − I K (cid:17) b i . (A.3)39et us bound each term on the right-hand side of (A.3):max i ≤ p (cid:107) Hf t ε it (cid:107) ≤ (cid:107) H (cid:107) max i (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =1 (cid:32) T T (cid:88) t =1 f kt ε it (cid:33) ≤ (cid:107) H (cid:107)√ K max i ≤ p,j ≤ K (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T (cid:88) t =1 f jt ε it (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) K · K / · (cid:112) log p/T (cid:17) , where we used Lemmas 1 and 7 together with Bonferroni’s method.max i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 r it (cid:16)(cid:98) f t − Hf t (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ max i (cid:32) T T (cid:88) t =1 r it T T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:98) f t − Hf t (cid:13)(cid:13)(cid:13) (cid:33) / = O p (cid:32) T + K p (cid:33) / , where we used Lemma 6 and the fact that max i T − (cid:80) Tt =1 r it = O p (1) since E (cid:2) r it (cid:3) = O (1).Finally, the third term is O p ( K T − / ) since (cid:107) (1 /T ) (cid:80) Tt =1 f t f (cid:48) t − I K (cid:107) = O p (cid:16) KT − / (cid:17) , (cid:107) H (cid:107) = O p ( K )and max i (cid:107) b (cid:107) i = O (1) by Assumption (B.1) . A.3 Corollary 1

As a consequence of Theorem 1, we get the following corollary:

Corollary 1.

Under the assumptions of Theorem 1, max i ≤ p,t ≤ T (cid:13)(cid:13)(cid:13)(cid:98) b (cid:48) i (cid:98) f t − b (cid:48) i f t (cid:13)(cid:13)(cid:13) = O p (log( T ) /r K (cid:112) log p/T + K T / / √ p ) . Proof.

Using Assumption (A.4) and Bonferroni’s method, we have max t ≤ T (cid:107) f t (cid:107) = O p ( √ K log( T ) /r ).By Theorem 1, uniformly in i and t : (cid:13)(cid:13)(cid:13)(cid:98) b (cid:48) i (cid:98) f t − b (cid:48) i f t (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:98) b i − Hb i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:98) f t − Hf t (cid:13)(cid:13)(cid:13) + (cid:107) Hb i (cid:107) (cid:13)(cid:13)(cid:13)(cid:98) f t − Hf t (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:98) b i − Hb i (cid:13)(cid:13)(cid:13) (cid:107) Hf t (cid:107) + (cid:107) b i (cid:107)(cid:107) f t (cid:107) (cid:13)(cid:13) H (cid:48) H − I K (cid:13)(cid:13) = O p (cid:32)(cid:16) K / (cid:114) log pT + K √ p (cid:17) · (cid:16) K √ T + KT / √ p (cid:17)(cid:33) + O p (cid:32) K · (cid:16) K √ T + KT / √ p (cid:17)(cid:33) + O p (cid:32)(cid:16) K / (cid:114) log pT + K √ p (cid:17) · log( T ) /r K / (cid:33) + O p (cid:32) log( T ) /r K / (cid:16) K / √ T + K / √ p (cid:17)(cid:33) = O p (cid:16) log( T ) /r K (cid:112) log p/T + K T / / √ p (cid:17) . .4 Proof of Theorem 2 Using the deﬁnition of the idiosyncratic components we have ε it − ˆ ε it = b (cid:48) i H (cid:48) ( (cid:98) f t − Hf t ) + ( (cid:98) b (cid:48) i − b (cid:48) i H (cid:48) ) (cid:98) f t + b (cid:48) i ( H (cid:48) H − I K ) f t . We bound the maximum element-wise diﬀerence as follows:max i ≤ p T T (cid:88) t =1 ( ε it − ˆ ε it ) ≤ i (cid:13)(cid:13) b (cid:48) i H (cid:48) (cid:13)(cid:13) T T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:98) f t − Hf t (cid:13)(cid:13)(cid:13) + 4 max i (cid:13)(cid:13)(cid:13)(cid:98) b (cid:48) i − b (cid:48) i H (cid:48) (cid:13)(cid:13)(cid:13) T T (cid:88) t =1 (cid:13)(cid:13)(cid:13)(cid:98) f t (cid:13)(cid:13)(cid:13) + 4 max i (cid:13)(cid:13) b (cid:48) i (cid:13)(cid:13) T T (cid:88) t =1 (cid:107) f t (cid:107) (cid:13)(cid:13) H (cid:48) H − I K (cid:13)(cid:13) F = O (cid:32) K · (cid:16) KT + K p (cid:17)(cid:33) + O (cid:32)(cid:16) K log pT + K p (cid:17) · K (cid:33) + O (cid:32) K · (cid:16) K T + K p (cid:17)(cid:33) = O (cid:32) K log pT + K p (cid:33) . Let ω T ≡ K (cid:112) log p/T + K / √ p . Denote max i ≤ p (1 /T ) (cid:80) Tt =1 ( ε it − ˆ ε it ) = O p ( ω T ). Then,max i,t | ε it − ˆ ε it | = O p ( ω T ) = o p (1), where the last equality is implied by Corollary 1.As pointed out in the main text, the second part of Theorem 2 is based on the relationship betweenthe convergence rates of the estimated covariance and precision matrices established in Jankov´aand van de Geer (2018) (Theorem 14.1.3). A.5 Lemmas for Theorem 3

Lemma 8.

Under the assumptions of Theorem 1, we have the following results:(a) (cid:107) B (cid:107) = (cid:107) BH (cid:48) (cid:107) = O ( √ p ) .(b) λ − T max ≤ i ≤ p (cid:13)(cid:13)(cid:13)(cid:98) b i − H (cid:48) b i (cid:13)(cid:13)(cid:13) = o p (1 / √ K ) and max ≤ i ≤ p (cid:13)(cid:13)(cid:13)(cid:98) b i (cid:13)(cid:13)(cid:13) = O p ( √ K ) .(c) λ − T (cid:13)(cid:13)(cid:13) (cid:98) B − BH (cid:48) (cid:13)(cid:13)(cid:13) = o p (cid:16)(cid:112) p/K (cid:17) and (cid:13)(cid:13)(cid:13) (cid:98) B (cid:13)(cid:13)(cid:13) = O p ( √ p ) .Proof. Part (c) is direct consequences of (a)-(b), therefore, we only prove the latter two parts inwhat follows.(a) Part (a) easily follows from (B.1) : tr( Σ − BB (cid:48) ) = tr( Σ ) − (cid:107) B (cid:107) ≥

0, since tr( Σ ) = O ( p ) by (B.1) , we get (cid:107) B (cid:107) = O ( p ). Part (a) follows from the fact that the linear space spanned bythe rows of B is the same as that by the rows of BH (cid:48) , hence, in practice, it does not matterwhich one is used.(b) From Theorem 1, we have max i ≤ p (cid:13)(cid:13)(cid:13)(cid:98) b i − Hb i (cid:13)(cid:13)(cid:13) = O p ( ω T ). Using the deﬁnition of λ T fromTheorem 2, it follows that λ − T ω T = o p ( ω T ω − T ). Let (cid:101) z T ≡ ω T ω − T . Consider λ − T max ≤ i ≤ p (cid:13)(cid:13)(cid:13)(cid:98) b i − Hb i (cid:13)(cid:13)(cid:13) = o p ( z T ). The latter holds for any z t ≥ (cid:101) z T , with the tightest boundobtained when z T = (cid:101) z T . For the ease of representation, we use z T = 1 / √ K instead of (cid:101) z T .The second result in Part (b) is obtained using the fact that max ≤ i ≤ p (cid:13)(cid:13)(cid:13)(cid:98) b i (cid:13)(cid:13)(cid:13) ≤ √ K (cid:107) B (cid:107) max ,where (cid:107) B (cid:107) max = O (1) by (B.1) . 41 emma 9. Let Π ≡ (cid:104) Θ f + ( BH (cid:48) ) (cid:48) Θ ε ( BH (cid:48) ) (cid:105) − , (cid:98) Π ≡ (cid:104) (cid:98) Θ f + (cid:98) B (cid:48) (cid:98) Θ ε (cid:98) B (cid:105) − . Also, deﬁne Σ f =(1 /T ) (cid:80) Tt =1 Hf t ( Hf t ) (cid:48) , Θ f = Σ − f , (cid:98) Σ f ≡ (1 /T ) (cid:80) Tt =1 (cid:98) f t (cid:98) f (cid:48) t , and (cid:98) Θ f = (cid:98) Σ − f . Under the assumptionsof Theorem 2, we have the following results:(a) Λ min ( B (cid:48) B ) − = O (1 /p ) .(b) ||| Π ||| = O (1 /p ) .(c) λ − T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ f − Θ f (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (cid:16) / √ K (cid:17) .(d) λ − T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Π − Π (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) s T /p + 1 / (cid:16) p √ K (cid:17)(cid:33) and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Π (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (1 /p ) .Proof. (a) Using Assumption (A.2) we have (cid:12)(cid:12)(cid:12) Λ min ( p − B (cid:48) B ) − Λ min ( ˘ B ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p − B (cid:48) B − ˘ B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , whichimplies Part (a).(b) First, notice that ||| Π ||| = Λ min ( Θ f + ( BH (cid:48) ) (cid:48) Θ ε ( BH (cid:48) )) − . Therefore, we get ||| Π ||| ≤ Λ min (( BH (cid:48) ) (cid:48) Θ ε ( BH (cid:48) )) − ≤ Λ min ( B (cid:48) B ) − Λ min ( Θ ε ) − = Λ min ( B (cid:48) B ) − Λ max ( Σ ε ) , where the second inequality is due to the fact that the linear space spanned by the rows of B is the same as that by the rows of BH (cid:48) , hence, in practice, it does not matter which one isused. Therefore, the result in Part (b) follows from Part (a), Assumptions (A.1) and (A.2) .(c) From Lemma 7 we obtained: (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) T T (cid:88) t =1 Hf t ( Hf t ) (cid:48) − T T (cid:88) t =1 (cid:98) f t (cid:98) f (cid:48) t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = O p (cid:32) K / √ T + K / √ p (cid:33) . Since (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ f ( (cid:98) Σ f − Σ f ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) <

1, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ f − Θ f (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ||| Θ f ||| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ f ( (cid:98) Σ f − Σ f ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ f ( (cid:98) Σ f − Σ f ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) K / √ T + K / √ p (cid:33) . Let ω T = K / / √ T + K / / √ p . Using the deﬁnition of λ T from Theorem 2, it follows that λ − T ω T = o p ( ω T ω − T ). Let (cid:101) γ T ≡ ω T ω − T . Consider λ − T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ f − Θ f (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p ( γ T ). The latterholds for any γ t ≥ (cid:101) γ T , with the tightest bound obtained when γ T = (cid:101) γ T . For the ease ofrepresentation, we use γ T = 1 / √ K instead of (cid:101) γ T .(d) We will bound each term in the deﬁnition of (cid:98) Π − Π . First, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) B (cid:48) (cid:98) Θ ε (cid:98) B − ( BH (cid:48) ) (cid:48) Θ ε ( BH (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) B − BH (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) BH (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ ε − Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) B (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) BH (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ||| Θ ε ||| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) B − BH (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) p · s T · λ T (cid:33) . (A.4)42ow we combine (A.4) with the results from Parts (b)-(c): λ − T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π (cid:16) (cid:98) Π − − Π − (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) s t + 1 p √ K (cid:17) . Finally, since (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π (cid:16) (cid:98) Π − − Π − (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) <

1, we have λ − T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Π − Π (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ λ − T ||| Π ||| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π (cid:16) (cid:98) Π − − Π − (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Π (cid:16) (cid:98) Π − − Π − (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) p (cid:16) s t + 1 p √ K (cid:17)(cid:33) . A.6 Proof of Theorem 3

Using the Sherman-Morrison-Woodbury formula, we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ ε − Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( (cid:98) Θ ε − Θ ε ) (cid:98) B (cid:98) Π (cid:98) B (cid:48) (cid:98) Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ ε ( (cid:98) B − BH (cid:48) ) (cid:98) Π (cid:98) B (cid:48) (cid:98) Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ ε BH (cid:48) ( (cid:98) Π − Π ) (cid:98) B (cid:48) (cid:98) Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ ε BH (cid:48) Π ( (cid:98) B − B ) (cid:48) (cid:98) Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ ε BH (cid:48) Π ( BH (cid:48) ) (cid:48) ( (cid:98) Θ ε − Θ ε ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) l = ∆ + ∆ + ∆ + ∆ + ∆ + ∆ . (A.5)We now bound the terms in (A.5) for l = 2 and l = ∞ . We start with l = 2. First, note that λ − T ∆ = O p ( s T ) by Theorem 2. Second, using Lemmas 8-9 together with Theorem 2, we have λ − T (∆ + ∆ ) = O p ( s T · √ p · (1 /p ) · √ p ·

1) = O p ( s T ). Third, λ − T (∆ + ∆ ) is negligible accordingto Lemma 8(c). Finally, λ − T ∆ = O p (cid:16) · √ p · (cid:16) ( s T /p ) + (1 / ( p √ K ) (cid:17) · √ p · (cid:17) = O p ( s T + 1 / ( p √ K ))by Lemmas 8-9 and Theorem 2.Now consider l = ∞ . First, similarly to the previous case, λ − T ∆ = O p ( s T ). Second, λ − T (∆ +∆ ) = O p (cid:16) s T · √ pK · ( √ K/p ) · √ pK · √ d T (cid:17) = O p ( s T K / √ d T ), where we used the fact that forany A ∈ S p we have ||| A ||| = ||| A ||| ∞ ≤ (cid:112) d ( A ) ||| A ||| , where d ( A ) measures the maximum vertexdegree as described at the beginning of Section 4. Third, the term λ − T (∆ + ∆ ) is negligibleaccording to Lemma 8(c). Finally, λ − T ∆ = O p ( √ d T · √ pK · √ K ( s T + (1 /p )) /p · √ pK · √ d T ) = O p ( d T K / ( s T + (1 /p ))). A.7 Lemmas for Theorem 4

Lemma 10.

Under the assumptions of Theorem 4,(a) (cid:107) (cid:98) m − m (cid:107) max = O p ( (cid:112) log( p ) /T ) , where m is the unconditional mean of stock returns deﬁnedin Subsection 3.3, and (cid:98) m is the sample mean.(b) ||| Θ ||| = O ( d T K / ) , where d T was deﬁned in Assumption B.3 .Proof. (a) The proof of Part (a) is provided in Chang et al. (2018) (Lemma 1).43b) To prove Part (b) we use the Sherman-Morrison-Woodbury formula: ||| Θ ||| ≤ ||| Θ ε ||| + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Θ ε B [ Θ f + B (cid:48) Θ ε B ] − B (cid:48) Θ ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( (cid:112) d T ) + O (cid:16)(cid:112) d T · p · √ Kp · K · (cid:112) d T (cid:17) = O ( d T K / ) . (A.6)The last equality in (A.6) is obtained under the assumptions of Theorem 4. This result isimportant in several aspects: it shows that the sparsity of the precision matrix of stock returnsis controlled by the sparsity in the precision of the idiosyncratic returns. Hence, one does notneed to impose an unrealistic sparsity assumption on the precision of returns a priori whenthe latter follow a factor structure - sparsity of the precision once the common movementshave been taken into account would suﬃce. Lemma 11.

Deﬁne a = ι (cid:48) p Θ ι p /p , b = ι (cid:48) p Θm /p , d = m (cid:48) Θm /p , g = √ m (cid:48) Θm /p and (cid:98) a = ι (cid:48) p (cid:98) Θ ι p /p , (cid:98) b = ι (cid:48) p (cid:98) Θ (cid:98) m /p , (cid:98) d = (cid:98) m (cid:48) (cid:98) Θ (cid:98) m /p , (cid:98) g = (cid:112) (cid:98) m (cid:48) (cid:98) Θ (cid:98) m /p . Under the assumptions of Theorem 4 and assuming ( ad − b ) > ,(a) a ≥ C > , b = O (1) , d = O (1) .(b) | (cid:98) a − a | = O p ( λ T d T K / ( s T + (1 /p ))) = o p (1) .(c) (cid:12)(cid:12)(cid:12)(cid:98) b − b (cid:12)(cid:12)(cid:12) = O p ( λ T d T K / ( s T + (1 /p ))) = o p (1) (d) (cid:12)(cid:12)(cid:12) (cid:98) d − d (cid:12)(cid:12)(cid:12) = O p ( λ T d T K / ( s T + (1 /p ))) = o p (1) .(e) | (cid:98) g − g | = O p ( (cid:16) [ λ T d T K / ( s T + (1 /p ))] / ) = o p (1) .(f ) (cid:12)(cid:12)(cid:12) ( (cid:98) a (cid:98) d − (cid:98) b ) − ( ad − b ) (cid:12)(cid:12)(cid:12) = O p (cid:16) λ T d T K / ( s T + (1 /p )) (cid:17) = o p (1) .(g) (cid:12)(cid:12) ad − b (cid:12)(cid:12) = O (1) .Proof. (a) Part (a) is trivial and follows directly from ||| Θ ||| = O (1).(b) Using the H¨olders inequality, we have | (cid:98) a − a | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ι (cid:48) p ( (cid:98) Θ − Θ ) ι p p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ ) ι p (cid:13)(cid:13)(cid:13) (cid:107) ι p (cid:107) max p ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) λ T d T K / ( s T + (1 /p )) (cid:17) = o p (1) , where the last rate is obtained using the assumptions of Theorem 3.(c) First, rewrite the expression of interest: (cid:98) b − b = [ ι (cid:48) p ( (cid:98) Θ − Θ )( (cid:98) m − m )] /p + [ ι (cid:48) p ( (cid:98) Θ − Θ ) m ] /p + [ ι (cid:48) p Θ ( (cid:98) m − m )] /p. (A.7)44e now bound each of the terms in (A.7) using the expressions derived in Callot et al. (2019)(see their Proof of Lemma A.2) and the fact that log( p ) /T = o (1). (cid:12)(cid:12)(cid:12) ι (cid:48) p ( (cid:98) Θ − Θ )( (cid:98) m − m ) (cid:12)(cid:12)(cid:12) /p ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) (cid:98) m − m (cid:107) max = O p (cid:16) λ T d T K / ( s T + (1 /p )) · (cid:114) log( p ) T (cid:17) . (A.8) (cid:12)(cid:12)(cid:12) ι (cid:48) p ( (cid:98) Θ − Θ ) m (cid:12)(cid:12)(cid:12) /p ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) λ T d T K / ( s T + (1 /p )) (cid:17) . (A.9) (cid:12)(cid:12) ι (cid:48) p Θ ( (cid:98) m − m ) (cid:12)(cid:12) /p ≤ ||| Θ ||| (cid:107) (cid:98) m − m (cid:107) max = O p (cid:16) d T K / · (cid:114) log( p ) T (cid:17) . (A.10)(d) First, rewrite the expression of interest: (cid:98) d − d = [( (cid:98) m − m ) (cid:48) ( (cid:98) Θ − Θ )( (cid:98) m − m )] /p + [( (cid:98) m − m ) (cid:48) Θ ( (cid:98) m − m )] /p + [2( (cid:98) m − m ) (cid:48) Θm ] /p + [2 m (cid:48) ( (cid:98) Θ − Θ )( (cid:98) m − m )] /p + [ m (cid:48) ( (cid:98) Θ − Θ ) m ] /p. (A.11)We now bound each of the terms in (A.11) using the expressions derived in Callot et al.(2019) (see their Proof of Lemma A.3) and the fact that log( p ) /T = o (1). (cid:12)(cid:12)(cid:12) ( (cid:98) m − m ) (cid:48) ( (cid:98) Θ − Θ )( (cid:98) m − m ) (cid:12)(cid:12)(cid:12) /p ≤ (cid:107) (cid:98) m − m (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) log( p ) T · λ T d T K / ( s T + (1 /p )) (cid:17) (A.12) (cid:12)(cid:12) ( (cid:98) m − m ) (cid:48) Θ ( (cid:98) m − m ) (cid:12)(cid:12) /p ≤ (cid:107) (cid:98) m − m (cid:107) ||| Θ ||| = O p (cid:16) log( p ) T · d T K / (cid:17) . (A.13) (cid:12)(cid:12) ( (cid:98) m − m ) (cid:48) Θm (cid:12)(cid:12) /p ≤ (cid:107) (cid:98) m − m (cid:107) max ||| Θ ||| = O p (cid:16)(cid:114) log( p ) T · d T K / (cid:17) . (A.14) (cid:12)(cid:12)(cid:12) m (cid:48) ( (cid:98) Θ − Θ )( (cid:98) m − m ) (cid:12)(cid:12)(cid:12) /p ≤ (cid:107) (cid:98) m − m (cid:107) max (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16)(cid:114) log( p ) T · λ T d T K / ( s T + (1 /p )) (cid:17) . (A.15) (cid:12)(cid:12)(cid:12) m (cid:48) ( (cid:98) Θ − Θ ) m (cid:12)(cid:12)(cid:12) /p ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) λ T d T K / ( s T + (1 /p )) (cid:17) . (A.16)(e) This is a direct consequence of Part (d) and the fact that (cid:112) (cid:98) d − d ≥ (cid:112) (cid:98) d − √ d .(f) First, rewrite the expression of interest:( (cid:98) a (cid:98) d − (cid:98) b ) − ( ad − b ) = [( (cid:98) a − a ) + a ][( (cid:98) d − d ) + d ] − [( (cid:98) b − b ) + b ] , therefore, using Lemma 11, we have (cid:12)(cid:12)(cid:12) ( (cid:98) a (cid:98) d − (cid:98) b ) − ( ad − b ) (cid:12)(cid:12)(cid:12) ≤ (cid:104) | (cid:98) a − a | (cid:12)(cid:12)(cid:12) (cid:98) d − d (cid:12)(cid:12)(cid:12) + | (cid:98) a − a | d + a (cid:12)(cid:12)(cid:12) (cid:98) d − d (cid:12)(cid:12)(cid:12) + ( (cid:98) b − b ) + 2 | b | (cid:12)(cid:12)(cid:12)(cid:98) b − b (cid:12)(cid:12)(cid:12)(cid:105) = O p (cid:16) λ T d T K / ( s T + (1 /p )) (cid:17) = o p (1) . (g) This is a direct consequence of Part (a): ad − b ≤ ad = O (1).45 .8 Proof of Theorem 4 Let us derive convergence rates for each portfolio weight formulas one by one. We start withGMV formulation. (cid:107) (cid:98) w GMV − w GMV (cid:107) ≤ a (cid:107) ( (cid:98) Θ − Θ ) ι p (cid:107) p + | a − (cid:98) a | (cid:107) Θ ι p (cid:107) p | (cid:98) a | a = O p (cid:16) λ T d T K ( s T + (1 /p )) (cid:17) = o p (1) , where the ﬁrst inequality was shown in Callot et al. (2019) (see their expression A.50), and therate follows from Lemmas 11 and 10.We now proceed with the MWC weight formulation. First, let us simplify the weight expression asfollows: w MWC = κ ( Θ ι p /p ) + κ ( Θm /p ), where κ = d − µbad − b κ = µa − bad − b . Let (cid:98) w MWC = (cid:98) κ ( (cid:98) Θ ι p /p ) + (cid:98) κ ( (cid:98) Θ (cid:98) m /p ), where (cid:98) κ and (cid:98) κ are the estimators of κ and κ respectively.As shown in Callot et al. (2019) (see their equation A.57), we can bound the quantity of interestas follows: (cid:107) (cid:98) w MWC − w MWC (cid:107) ≤ | ( (cid:98) κ − κ ) | (cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ ) ι p (cid:13)(cid:13)(cid:13) /p + | ( (cid:98) κ − κ ) |(cid:107) Θ ι p (cid:107) /p + | κ | (cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ ) ι p (cid:13)(cid:13)(cid:13) /p + | ( (cid:98) κ − κ ) | (cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ )( (cid:98) m − m ) (cid:13)(cid:13)(cid:13) /p + | ( (cid:98) κ − κ ) |(cid:107) Θ ( (cid:98) m − m ) (cid:107) /p + | ( (cid:98) κ − κ ) | (cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ ) m (cid:13)(cid:13)(cid:13) /p + | ( (cid:98) κ − κ ) |(cid:107) Θm (cid:107) /p + | κ | (cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ )( (cid:98) m − m ) (cid:13)(cid:13)(cid:13) /p + | κ | (cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ ) m (cid:13)(cid:13)(cid:13) /p (A.17)For the ease of representation, denote y = ad − b . Then, using similar technique as in Callot etal. (2019) we get | ( (cid:98) κ − κ ) | ≤ y (cid:12)(cid:12)(cid:12) (cid:98) d − d (cid:12)(cid:12)(cid:12) + yµ (cid:12)(cid:12)(cid:12)(cid:98) b − b (cid:12)(cid:12)(cid:12) + | (cid:98) y − y || d − µb | (cid:98) yy = O p (cid:16) λ T d T K / ( s T + (1 /p )) (cid:17) = o p (1) , where the rate trivially follows from Lemma 11.Similarly, we get | ( (cid:98) κ − κ ) | = O p (cid:16) λ T d T K / ( s T + (1 /p )) (cid:17) = o p (1) . Callot et al. (2019) showed that | κ | = O (1) and | κ | = O (1). Therefore, we can get the rate of(A.17): (cid:107) (cid:98) w MWC − w MWC (cid:107) = O p (cid:16) λ T d T K ( s T + (1 /p )) (cid:17) = o p (1) .

46e now proceed with the MRC weight formulation: (cid:107) (cid:98) w MRC − w MRC (cid:107) ≤ gp (cid:104)(cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ )( (cid:98) m − m ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ( (cid:98) Θ − Θ ) m (cid:13)(cid:13)(cid:13) + (cid:107) Θ ( (cid:98) m − m ) (cid:107) (cid:105) + | (cid:98) g − g |(cid:107) Θm (cid:107) | (cid:98) g | g ≤ gp (cid:104) p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) ( (cid:98) m − m ) (cid:107) max + p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) Θ − Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) m (cid:107) max + p ||| Θ ||| (cid:107) ( (cid:98) m − m ) (cid:107) max (cid:105) + p | (cid:98) g − g |||| Θ ||| (cid:107) m (cid:107) max | (cid:98) g | g = O p (cid:16) λ T d T K / ( s T + (1 /p )) · (cid:114) log( p ) T (cid:17) + O p (cid:16) λ T d T K / ( s T + (1 /p )) (cid:17) + O p (cid:16) d T K / · (cid:114) log( p ) T (cid:17) + O p (cid:16) [ λ T d T K / ( s T + (1 /p ))] / · d T K / (cid:17) = o p (1) ,,