[PDF] Residual-Based Nodewise Regression in Factor Models with Ultra-High Dimensions: Analysis of Mean-Variance Portfolio Efficiency and Estimation of Out-of-Sample and Constrained Maximum Sharpe Ratios

Abstract

We provide a new theory for nodewise regression when the residuals from a fitted factor model are used to apply our results to the analysis of the maximum Sharpe ratio when the number of assets in a portfolio is larger than its time span. We introduce a new hybrid model where factor models are combined with feasible nodewise regression. Returns are generated from an increasing number of factors plus idiosyncratic components (errors). The precision matrix of the idiosyncratic terms is assumed to be sparse, but the respective covariance matrix can be non-sparse. Since the nodewise regression is not feasible due to the unknown nature of errors, we provide a feasible-residual-based nodewise regression to estimate the precision matrix of errors as a new method. Next, we show that the residual-based nodewise regression provides a consistent estimate for the precision matrix of errors. In another new development, we also show that the precision matrix of returns can be estimated consistently, even with an increasing number of factors. Benefiting from the consistency of the precision matrix estimate of returns, we show that: (1) the portfolios in high dimensions are mean-variance efficient; (2) maximum out-of-sample Sharpe ratio estimator is consistent and the number of assets slows the convergence up to a logarithmic factor; (3) the maximum Sharpe ratio estimator is consistent when the portfolio weights sum to one; and (4) the Sharpe ratio estimators are consistent in global minimum-variance and mean-variance portfolios.

Full PDF

aa r X i v : . [ q -f i n . P M ] J un Mean-Variance Eﬃciency of a Portfolio in Ultra-High Dimensionsand Cases of Maximum Out-of-Sample and Constrained MaximumSharpe Ratios

Mehmet Caner ∗ Marcelo Medeiros † Gabriel F. R. Vasconcelos ‡ June 30, 2020

Abstract

In this paper, we analyze the maximum Sharpe ratio when the number of assets in a portfoliois larger than its time span. One obstacle in this high-dimensional setup is the singularity ofthe sample covariance matrix of the excess asset returns. To resolve this issue, we beneﬁt froma technique called nodewise regression, which was developed by Meinshausen and B¨uhlmann(2006). It provides a sparse/weakly sparse and consistent estimate of the precision matrix usingthe lasso method. One of the key results in our paper is the mean-variance eﬃciency of theportfolios in high dimensions. Tied to that result, we also show that the maximum out-of-sample Sharpe ratio can be consistently estimated in this large portfolio of assets. Furthermore,we provide convergence rates and show that the number of assets slows the convergence up to alogarithmic factor. We also provide consistency of the maximum Sharpe ratio when the portfolioweights sum to one and a new formula for the constrained maximum Sharpe ratio. Finally, weobtain consistent estimates of the Sharpe ratios of the global minimum-variance portfolio andMarkowitz’s (1952) mean-variance portfolio. In terms of assumptions, we allow for dependentdata. Simulations and out-of-sample forecasting exercises show that our new method performswell compared to factor- and shrinkage-based techniques. ∗ North Carolina State University, Nelson Hall, Department of Economics, NC 27695. Email: [email protected]. † Department of Economics, Pontiﬁcal Catholic University of Rio de Janeiro - Brazil. Email: [email protected] ‡ Department of Economics, University of California, Irvine. SSPB 3201, Irvine, CA. 92697. Email:[email protected] We thank Vanderbilt Economics Department seminar guests for comments. We aregrateful for the comments by Harold Chiang, Srini Krishnamurthy, and Michael Wolf. Introduction

One of the key issues in ﬁnance is the trade-oﬀ between the return and the risk of a portfolio. Toobtain a better risk-adjusted return, we maximize the Sharpe ratio. In essence, the weights of theportfolio are chosen in such a way that the return-to-risk ratio is maximized. We contribute tothis literature by studying the case of a large number of assets p , which may be greater than thetime span of the portfolio n . Our analysis also involves time-series data for excess asset returns.To obtain the maximum Sharpe ratio, we make use of the asset’s precision matrix. However, thesample covariance matrix is not invertible when p > n . Therefore, we need another way to estimatethe precision matrix. To do so, we use a concept promoted by Meinshausen and B¨uhlmann (2006),which is called nodewise regression . To obtain the Sharpe ratio, we estimate the inverse of theprecision matrix by a nodewise regression-based inverse as in van de Geer (2016). This methodconsists of running lasso regression of a given excess asset return on the remaining assets to formthe rows of the precision matrix. This type of method assumes sparsity, or weak sparsity on therows of the precision matrix, when p ≥ n . Weak sparsity allows a non-sparse precision matrix, aslong as the absolute ℓ th power (0 < ℓ <

1) sum of absolute value of coeﬃcients in each row doesnot diverge too fast; for this issue, see section 2.10 of van de Geer (2016). Note that we do notassume the sample covariance matrix to be sparse.This assumption of the sparsity of the precision matrix can be interpreted as an asset beingpotentially correlated with a number, but not all, of the assets in a portfolio. Asset A may belinked to Asset B, and asset B may be linked to asset C, but there is no direct link between assetA and asset C. This is not a strong assumption, as we show in our empirical out-of-sample exercisein Section 7. Figure 2 shows that there are not too many large correlations for US assets in thetwo subsamples that we use in our study.The related literature on nodewise regression is as follows. Chang et al. (2019) extend nodewiseregression to time-series data and build conﬁdence intervals for the cells in the precision matrix.Callot et al. (2019) provide the variance, risk, and weight estimation of the portfolio via node-wise regression. Caner and Kock (2018) establish uniform conﬁdence intervals in the case of high-dimensional parameters in heteroskedastic setups using nodewise regression. Meinshausen and B¨uhlmann(2006) already provide an optimality result for nodewise regression in terms of predicting a certainexcess asset return with other excess asset returns when the returns are normally distributed.In this paper, we analyze three important aspects of the maximum Sharpe ratio when p ≥ n .2irst, we analyze the maximum out-of-sample Sharpe ratio and the mean-variance eﬃciency of alarge portfolio. Our technique, and hence its contribution, will be complementary to the existingpapers. One diﬀerence is that we analyze p ≥ n when both the number of assets and the time spango to inﬁnity in a time-series framework. Recently, important contributions have been in this area byusing shrinkage and factor models. Ledoit and Wolf (2017) propose a nonlinear shrinkage estimatorin which small eigenvalues of the sample covariance matrix are increased and large eigenvalues aredecreased by a shrinkage formula. Their main contribution is the optimal shrinkage function, whichthey ﬁnd by minimizing a loss function. The maximum out-of-sample Sharpe ratio is an inversefunction of this loss. Their results cover the iid case and when p/n → (0 , ∪ (1 , + ∞ ). For theanalysis of mean-variance eﬃciency, Ao et al. (2019) make a novel contribution in which they takea constrained optimization, maximize returns subject to risk of the portfolio, and show that itis equivalent to an unconstrained objective function, where they minimize a scaled return of theportfolio error by choosing optimal weights. To obtain these weights, they use lasso regression andhence assume a sparse number of nonzero weights of the portfolio, and they analyze p/n → (0 , p > n when both dimensions are growing. Relatedly, the consistency of our nodewise-based maximum-out-of-sample Sharpe ratio estimate is established. We also provide the rate ofconvergence and see that the number of assets slows the rate of convergence up to a logarithmicfactor in p ; hence, consistent estimation of the Sharpe ratio of large portfolios is possible.Second, we consider the rate of convergence and consistency of the maximum Sharpe ratio whenthe weights of the portfolio are normalized to one and p > n . Recently, Maller and Turkington(2002) and Maller et al. (2016) analyze the limit with a ﬁxed number of assets and extend thatapproach to a large number of assets but a number less than the time span of the portfolio. Theirpapers make a key discovery: in the case of weight constraints (summing to one), the formula forthe maximum Sharpe ratio depends on a technical term, unlike the unconstrained maximum Sharperatio case. Practitioners could obtain the minimum Sharpe ratio instead of the maximum if theyare using the unconstrained formula. Our paper extends their paper by analyzing two issues, ﬁrstthe case of p > n , with both quantities growing to inﬁnity, and second by handling the uncertaintycreated by this technical term, which we can estimate and use to obtain a new constrained andconsistent Sharpe ratio. 3ur third contribution is that we consider the Sharpe ratios in the global minimum-varianceportfolio and Markowitz mean-variance portfolio. Our analysis uncovers consistent estimators evenwhen p > n . We show that our method performs well in simulations and empirical applications.The reason for the good performance is due to the correlation structure of the excess asset returns.The test (out-of-sample) periods that we analyze have a small number of large correlations andare hence more in line with our sparsity assumptions, which can be seen in Figure 1. In Figure 1,Subsample 1 and Subsample 2 correspond to two out-of-sample data periods in Section 7, where wecover January 2005-December 2017 and January 2000-December 2017, respectively. Additionally,in the same ﬁgure, we superimpose a simulation design that comes from a widely used factor modeldesign in Section 6. The factor design does not conform with the two subperiods that we analyzevia real-life data. The factor design misses all negatively correlated assets and concentrates heavilyon the mean, so in that sense, it reﬂects a highly restricted sparse model.Regarding other papers, Ledoit and Wolf (2003) and Ledoit and Wolf (2004) propose a lin-ear shrinkage estimator to estimate the covariance matrix and apply it to portfolio optimization.Ledoit and Wolf (2017) shows that nonlinear shrinkage performs better in out-of-sample forecasts.Lai et al. (2011) and Garlappi et al. (2007) approach the same problem from a Bayesian perspec-tive by aiming to maximize a utility function tied to portfolio optimization. Another avenue of theliterature improves the performance of the portfolios by introducing constraints on the weights.This is in the case of the global minimum-variance portfolio. Examples of works investigating thisproblem include Jagannathan and Ma (2003) and Fan et al. (2012). We also see a combination ofdiﬀerent portfolios proposed by Kan and Zhou (2007) and Tu and Zhou (2011).This paper is organized as follows. Section 2 considers our assumptions and precision matrixestimation. Section 3 addresses the maximum out-of-sample Sharpe ratio and the mean-varianceeﬃciency. Section 4 handles the case of the maximum Sharpe ratio when the weights are normalizedto one. Section 5 concerns the global minimum-variance and Markowitz mean-variance portfolioSharpe ratios. Section 6 provides simulations that compare several methods. Section 7 presents anout-of-sample forecasting exercise. The main proofs are in the Appendix, and the SupplementaryAppendix has some benchmark results used in the main proofs section. Let k ν k l , k ν k l , k ν k ∞ bethe l , l , l ∞ norms of a generic vector ν . For matrices, we have k A k ∞ , , which is the sup norm.4 Correlation D en s i t y Sub Sample 1Sub Sample 2Factor DGP

Figure 1: Correlation Densities5

Precision Matrix and Its Estimate

Deﬁne r t := ( r t, , r t, , · · · , r t,p ) ′ as the excess asset returns for a p asset portfolio, which is a p × µ as the target excess asset return of a portfolio, µ := ( µ , · · · , µ p ) ′ , which is a p × E ( r t − µ )( r t − µ ) ′ , and we deﬁne thesample covariance matrix of excess asset returnsˆΣ := 1 n n X t =1 ( r t − ¯ r )( r t − ¯ r ) ′ . Denote ¯ r := n P nt =1 r t , which is a p × r ∗ , which is an n × p matrix. To make things clearer, set r ∗ t,j := r t,j − ¯ r j ,which is the demeaned t th period, j th asset’s excess return, and ¯ r j := n P nt =1 r t,j . Moreover, set r ∗ j as the j th asset’s demeaned excess return ( n × j = 1 , , · · · , p . Set r ∗− j as the matrixof demeaned excess returns ( n × p − j th one. Let r ∗ t, − j represent the p − j th one. Furthermore, set ˆ µ := ¯ r .To understand the assumptions, we deﬁne a model that will link us to the nodewise regressionconcept in the next section. For t = 1 , · · · , j, · · · , nr ∗ t,j = γ ′ j r ∗ t, − j + η t,j , (1)where η t,j is the unobserved error. This is equation (5) in Chang et al. (2019). For the iid case, seeequation (B.30) of Caner and Kock (2018). Here, we provide the assumptions. Assumption 1.

There exist constants that are independent of p and n , such that K > , K > , < α ≤ , and < α ≤ for t = 1 , · · · , n max ≤ j ≤ p Eexp ( K | r ∗ t,j | α ) ≤ K , max ≤ j ≤ p Eexp ( K | η t,j | α ) ≤ K . Assumption 2. (i). The minimum eigenvalue of Σ − is denoted as Eigmin (Σ − ) ≥ c > , , where c is a positive constant, and the maximum eigenvalue of Σ − is denoted as Eigmax (Σ − ) ≤ K < ∞ ,where K is a positive constant. (ii). Moreover, for all j = 1 , · · · , p : < c l ≤ | µ j | , and for all j = 1 , · · · , p | µ j | ≤ c u < ∞ , where c l , c u are positive constants. Assumption 3.

The matrix of excess asset returns (demeaned) r ∗ has strictly stationary β mixingrows with β mixing coeﬃcient satisfying β k ≤ exp ( − K k α ) for any positive constant k , with onstants K > , α > that are independent of p and n . Set ρ = min ([ α + α + α ] − , [ α + α ] − ) . Additionally, lnp = o ( n ρ/ (2 − ρ ) ) . With ρ ≤ , we have that lnp = o ( n ) . Assumptions 1-2(i)-3 are from Chang et al. (2019). Assumption 1 allows us to apply the ex-ponential tail inequalities used by Chang et al. (2019). Assumption 2(ii) does not allow a zeroreturn for all assets, and all returns should also be ﬁnite. For technical and practical reasons, wealso do not allow local to zero returns. Assumption 2 prevents the case of a zero maximum Sharperatio. Assumption 3 allows for weak dependence in the data. Chang et al. (2019) shows that causalARMA processes with continuous error distributions are β mixing with exponentially decaying β k .Stationary GARCH models with ﬁnite second moments and continuous error distributions satisfyAssumption 3. Some stationary Markov chains also satisfy Assumption 3. Note that we beneﬁtfrom the ﬁrst and fourth results of Lemma 1 on pp.70-71 Chang et al. (2019), so our ρ condition isa subset of theirs. In this subsection, we provide a precision matrix formula. This subsection is taken from Callot et al.(2019), and we repeat so that it will become clear how the precision matrix estimate is derived inthe next subsection. The next subsection shows how this is related to the concept of the nodewiseregression. We show how a formula for Θ := Σ − can be obtained under a strictly stationarytime-series excess asset return. This is an extension of the iid case in Caner and Kock (2018). LetΣ − j, − j represent the p − × p − j th row and column have been removed.Additionally, Σ j, − j is the j th row of Σ with the j th element removed. Then, Σ − j,j represents the j th column of Σ with its j th element removed. From the inverse formula for the block matrices,we have the following for the j th main diagonal term:Θ j,j = (Σ j,j − Σ j, − j Σ − − j, − j Σ − j,j ) − , (2)and for the j th row of Θ with j th element removedΘ j, − j = − (Σ j,j − Σ j, − j Σ − − j, − j Σ − j,j ) − Σ j, − j Σ − − j, − j = − Θ j,j Σ j, − j Σ − − j, − j . (3)We now try to relate (2)(3) to a linear regression that we describe below in (7). Deﬁne γ j ( p − × γ that minimizes E [ r ∗ t,j − ( r ∗ t, − j ) ′ γ ] , t = 1 , · · · , n . We can obtain a solution as γ j = Σ − − j, − j Σ − j,j , (4)by using strict stationarity of the data. Using symmetry of Σ and (4), we can write (3) asΘ j, − j = − Θ j,j γ ′ j . (5)Deﬁne the following Σ − j,j := Er ∗ t, − j r ∗ t,j , Σ − j, − j := Er ∗ t, − j r ∗ ′ t, − j . By (1), η t,j := r ∗ t,j − ( r ∗ t, − j ) ′ γ j .Then, it is easy to see by (4) that Er ∗ t, − j η t,j = Er ∗ t, − j r ∗ t,j − [ Er ∗ t, − j ( r ∗ t, − j ) ′ ] γ j = Σ − j,j − Σ − j, − j Σ − − j, − j Σ − j,j = 0 . (6)This means that we can formulate (1) as a regression model with covariates orthogonal to errors r ∗ t,j = ( r ∗ t, − j ) ′ γ j + η t,j , (7)for t = 1 , · · · , n . We can see that Θ j, − j and hence all of the row Θ j is sparse if and only if γ j issparse by comparing (5) and (7).To derive a formula for Θ, we see that given (6)(7)Σ j,j := E [ r ∗ t,j ] = γ ′ j Σ − j, − j γ j + Eη t,j = Σ j, − j Σ − − j, − j Σ − j,j + Eη t,j , (8)where we use (4) in the last equality in (8). Now, deﬁne τ j := Eη t,j for t = 1 , · · · , n , j = 1 , · · · , p .By (8) τ j = Σ j,j − Σ j, − j Σ − − j, − j Σ − j,j = 1Θ jj , (9)where we use (2) for the second equality. Next, deﬁne a p × p matrix C p :=  − γ , · · · − γ ,p − γ , · · · · · · ... ... ... ... − γ p, − γ p, · · ·  , and T − := diag ( τ − , · · · , τ − p ), which is a diagonal matrix ( p × p dimension). We can writeΘ = T − C p , (10)8nd to obtain (10), we use (2) and (9) Θ j,j = 1 τ j , (11)and by (5) with (11) Θ j, − j = − Θ j,j γ ′ j = − γ ′ j τ j . As previously mentioned, the idea of nodewise regression was developed by Meinshausen and B¨uhlmann(2006). Nodewise regression stems from the idea of neighborhood selection. In a portfolio, neigh-borhood selection (nodewise regression) will select a ”neighborhood” of a j th asset return (excess)in such a way that the smallest subset of returns of other assets in a portfolio will be conditionallydependent on this j th asset return. All the conditionally independent assets will receive a zeroin the precision matrix. This method carries an optimality property when the asset returns arenormally distributed. The normality assumption will be used only in this subsection. The bestpredictor for an excess asset return, r ∗ t,j , in the portfolio of p assets is its neighborhood. Denotethis neighborhood by A . Then, γ ∗ j = argmin γ j : γ j,k =0 ∀ k / ∈A E [ r ∗ t,j − X k ∈ Γ − j γ j,k r ∗ t,k ] , where A ⊆ Γ − j , Γ − j = Γ −{ j } , and Γ = { , , · · · , j, · · · , p } . This is equation (2) in Meinshausen and B¨uhlmann(2006), where they have a detailed explanation for this result. A possible way of estimating the precision matrix when the number of assets is larger than thesample size is by nodewise regression. In the time series, this is developed by Chang et al. (2019).Callot et al. (2019) also use these results in portfolio risk. Here, we summarize the concept as inCallot et al. (2019). This is a concept based on the exact formula for the precision matrix. Weborrow the main concepts from B¨uhlmann and van de Geer (2011). The precision matrix estimatefollows the steps below.1. Lasso nodewise regression is deﬁned, for each j = 1 , , · · · , p , asˆ γ j = argmin γ ∈ R p − [ k r ∗ j − r ∗− j γ j k /n + 2 λ j k γ k ] , (12)where λ j is a positive tuning parameter (sequence) and its choice, which will be discussed in thesimulation section. Let S j be the set of coeﬃcients that are nonzero in row j of Σ − , and let s j = | S j |

9e their cardinality. The maximum number of nonzero coeﬃcients is set at ¯ s = max ≤ j ≤ p s j .Therefore, we make a sparsity assumption. Alternatively, but costly in notation, is weak sparsity,where we allow for the absolute l th power sum of coeﬃcients in each row of the precision matrix tobe diverging but not at a faster rate than the sample size. This of course demands a larger tuningparameter than does the sparsity assumption in practice. It is easy to incorporate weak sparsityinto the proofs, as seen in Lemma 2.3 of van de Geer (2016). To avoid prolonging the paper, wehave not pursued this track and required sparsity.2. Setup the following matrix, which will be a key input in the precision matrix estimate:ˆ C p =  − ˆ γ · · · − ˆ γ p − ˆ γ · · · − ˆ γ p . . . . . . . . . . . . − ˆ γ p − ˆ γ p · · ·  .

3. Another key input is the following diagonal matrix with each scalar element ˆ τ j , j = 1 , · · · , p ˆ τ j = k r ∗ j − r ∗− j ˆ γ j k n + λ j k ˆ γ j k . From ˆ T = diag (ˆ τ , · · · , ˆ τ p ), which is the p × p matrix.4. Set the precision matrix estimate (nodewise) as ˆΘ = ˆ T − ˆ C p .We provide the ﬁrst result in Lemma 1 of Chang et al. (2019) in the following Theorem. Theiid data case with bounded moments is established in Caner and Kock (2018) Theorem 1.

Under Assumptions 1-3,(i). max ≤ j ≤ p k ˆΘ j − Θ j k = O p (¯ s r lnpn ) . (ii). k ˆ µ − µ k ∞ = O p ( √ lnp √ n ) . Note that Lemma 1 of Chang et al. (2019) applies to the estimation of sample covariance,whereas our theorem also shows the estimation of sample mean. From the proof of Lemma 1 forsample covariance in Chang et al. (2019), sample mean estimation can also be shown.We provide the following assumption for the sparsity of coeﬃcients in the nodewise regressionestimate. 10 ssumption 4.

We have the following sparsity condition: ¯ s √ lnp √ n = o (1) . This is standard in the high-dimensional econometrics literature. By Assumption 4, it is easy tosee that, via Theorem 1, the rows of the precision matrix are estimated consistently. The sparsityof the precision matrix does not imply that the covariance matrix is also sparse. It is possible tohave, for example, a Toeplitz structure in the covariance matrix that is non-sparse but sparsity inthe precision matrix.

In ﬁnance, our method considers more complicated cases of p > n and p/n → ∞ when both p, n → ∞ . We also allow the p = n case, while it is a hindrance to technical analysis in someshrinkage papers such as in the illuminating and very useful Ledoit and Wolf (2017). Our theoremsalso allow for non-iid data. Our technique should be seen as a complement to existing factorand shrinkage models and as carrying a certain optimality property, as outlined in subsection 2.3.Additionally, with our technique, one can obtain the mean-variance eﬃciency even when p > n inthe case of the maximum out-of-sample Sharpe ratio. This section analyzes the maximum out of Sharpe ratio that is considered in Ao et al. (2019).To obtain that formula, we need the optimal calculation of the weights of the portfolio. Theoptimization of the portfolio weights is formulated as argmax w w ′ µ s ubject to w ′ Σ w ≤ σ , (13)where we maximize the return subject to a speciﬁed positive and ﬁnite risk constraint, σ > p < n , with the inverse of the sample covariance matrix usedas an estimator for the precision matrix estimate, as: c SR moscov := µ ′ ˆΣ − ˆ µ q ˆ µ ′ ˆΣ − Σ ˆΣ − ˆ µ , SR ∗ := p µ ′ Σ − µ. Then, equation (1.1) of Ao et al. (2019) shows that when p/n → r ∈ (0 , p > n . Our maximum out-of-sample Sharpe ratio estimate using the nodewise estimate ˆΘ is: c SR mosnw := µ ′ ˆΘˆ µ q ˆ µ ′ ˆΘΣ ˆΘˆ µ . Theorem 2.

Under Assumptions 1-4, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)" c SR mosnw SR ∗ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s r lnpn ) = o p (1) . Remarks. 1. Note that p.4353 of Ledoit and Wolf (2017) shows that the maximum out-of-sample Sharpe ratio is equivalent to minimizing a certain loss function of the portfolio. The limitof the loss function is derived under an optimal shrinkage function in Theorem 1. After that, theyprovide a shrinkage function even in the cases of p/n → r ∈ (0 , ∪ (1 , + ∞ ). Their proofs allowfor iid data. p > n This subsection formally shows that we can obtain mean-variance eﬃciency in an out-of-samplecontext when the number of assets in the portfolio is larger than the sample size, a novel resultin the literature. Ao et al. (2019) show that this is possible when p ≤ n , when both p , and n arelarge. That article is a very important contribution since they also demonstrate that other methodsbefore theirs could not obtain that result, and it is a diﬃcult issue to address. Given a risk levelof σ > − ,the formula for weights is w oos = σ Θ µ √ µ ′ Θ µ . w oos = σ ˆΘˆ µ q ˆ µ ′ ˆΘˆ µ . We are interested in maximized out-of-sample expected return µ ′ w oos and its estimate µ ′ ˆ w oos .Additionally, we are interested in the out-of-sample variance of the portfolio returns w ′ oos Σ w oos andits estimate ˆ w ′ oos Σ ˆ w oos . Note also that by the formula for weights w ′ oos Σ w oos = σ , given Θ := Σ − .Below, we show that our estimates based on nodewise regression are consistent, and furthermore,we also provide the rate of convergence results. Theorem 3. (i). Under Assumptions 1-4, (cid:12)(cid:12)(cid:12)(cid:12) µ ′ ˆ w oos µ ′ w oos − (cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s p lnp/n ) = o p (1) . (ii). Under Assumptions 1-4, (cid:12)(cid:12)(cid:12) ˆ w ′ oos Σ ˆ w oos − σ (cid:12)(cid:12)(cid:12) = O p (¯ s p lnp/n ) = o p (1) . Remarks. 1. From the results, we allow p > n , and still there is consistency. Additionally, thesparsity of the maximum number of nonzero elements in a row of the precision matrix ¯ s can growto inﬁnity but at a rate not larger than ¯ s = o (( n/lnp ) / ) for the case in (i).2. Therefore, we can allow p = exp ( n κ ), with 0 < κ <

1, and ¯ s can be a slowly varying functionin n . This clearly shows that it is possible to have p/n → ∞ in that scenario. In Theorem 3(i), wecan have p = n , and ¯ s = o (( n/lnn ) / ), and p/n → ∞ . The case of p = 2 ∗ n is also possible with¯ s = o (( n/lnn ) / ), with p/n = 2.3. From the convergence rates, it is clear that we are penalized by the number of assets but ina logarithmic fashion; hence, our method is feasible to use in large-portfolio cases.4. Ao et al. (2019) provide new results of the mean-variance eﬃciency of a large portfolio when p ≤ n and the returns of the assets are normally distributed. They provide a novel way of estimatingreturn and risk. This involves lasso-sparse estimation of the weights of the portfolio. In this section, we deﬁne the maximum Sharpe ratio when the weights of the portfolio are normalizedto one. This in turn will depend on a critical term that will determine the formula below.13he maximum Sharpe ratio is deﬁned as follows, with w as the p × max w w ′ µ √ w ′ Σ w , s .to ′ p w = 1 , where 1 p is a vector of ones. This maximum Sharpe ratio is constrained to have portfolio weightsthat sum to one. Maller et al. (2016) shows that depending on a scalar, it has two solutions. When1 ′ p Σ − µ ≥

0, we have the square of the maximum Sharpe ratio:

M SR = µ ′ Σ − µ. (14)When 1 ′ p Σ − µ <

0, we have

M SR c = µ ′ Σ − µ − (1 ′ p Σ − µ ) / (1 ′ p Σ − p ) . (15)This is equation (6.1) of Maller et al. (2016). Equation (14) is used in the literature, and this is theformula when the weights do not necessarily sum to one given a return constraint as in Ao et al.(2019).These equations can be estimated by their sample counterparts, but in the case of p > n ,ˆΣ is not invertible, so we need to use new tools from high-dimensional statistics. We analyzethe nodewise regression precision matrix estimate of Meinshausen and B¨uhlmann (2006). This isdenoted by ˆΘ. Therefore, we analyze the asymptotic behavior of the estimate of the maximumSharpe ratio squared via nodewise regression. We will also introduce the maximum Sharpe ratio,which addresses the uncertainty regarding whether we should analyze M SR or M SR c . This is( M SR ∗ ) = M SR { ′ p Σ − µ ≥ } + M SR c { ′ p Σ − µ< } . The estimators of

M SR, M SR c , M SR ∗ will be introduced in the next subsection. First, when 1 ′ p Σ − µ ≥

0, we have the square of the maximum Sharpe ratio as in (14). To obtain anestimate by using nodewise regression, we replace Σ − with ˆΘ. Namely, the estimate of the squareof the maximum Sharpe ratio is: \ M SR = ˆ µ ′ ˆΘˆ µ. (16)Using the result in Theorem 1, we can obtain the consistency of the maximum Sharpe ratio(squared). 14 heorem 4. Under Assumptions 1-4 with ′ p Σ − µ ≥ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \ M SR M SR − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s p lnp/n ) = o p (1) . Remark. To the best of our knowledge, no existing result deals with MSR when p > n and p cangrow exponentially in n . We also allow for time-series data and establish a rate of convergence. Therate shows that precision matrix non-sparsity can badly aﬀect the estimation error. The numberof assets, on the other hand, can also increase the error by on a logarithmic scale.Note that the maximum Sharpe ratio above relies on 1 ′ p Σ − µ ≥

0, where 1 p is a column vectorof ones. This was recently pointed out in equation (6.1) Maller et al. (2016). If 1 ′ p Σ − µ <

0, theSharpe ratio is minimized, as shown on p.503 of Maller and Turkington (2002). The new maximalSharpe ratio in the case when 1 ′ p Σ − µ < ′ p Σ − µ < \ M SR c = ˆ µ ′ ˆΘˆ µ − (1 ′ p ˆΘˆ µ ) / (1 ′ p ˆΘ1 p ) . (17)The optimal portfolio allocation for such a case is given in (2.10) of Maller and Turkington(2002). The limit for such estimators when the number of assets is ﬁxed ( p ﬁxed) is given inTheorems 3.1b-c of Maller et al. (2016).We set up some notation for the next theorem. Set 1 ′ p Σ − p /p = A , 1 ′ p Σ − µ/p = B , µ ′ Σ − µ/p = D . Theorem 5. If ′ p Σ − µ < , and under Assumptions 1-4 with AD − B ≥ C > , where C is apositive constant, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \ M SR c M SR c − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s p lnp/n ) = o p (1) . Remarks. 1. Condition AD − B ≥ C > p > n , and time-series data are allowed, unlike the iid or normalreturn cases in the literature when dealing with large p, n . Theorem 5 is new and will help usestablish a new MSR result in the following Theorem.15e provide an estimate that takes into account uncertainties about the term 1 ′ p Σ − µ . Note thatthe term can be consistently estimated, as shown in Lemma A.3 in the Supplementary Appendix.A practical estimate for a maximum Sharpe ratio that will be consistent is: \ M SR ∗ = \ M SR { ′ p ˆΘˆ µ> } + \ M SR c { ′ p ˆΘˆ µ< } , where we excluded the case of 1 ′ p ˆΘˆ µ = 0 in the estimator. That speciﬁc scenario is very restrictivein terms of returns and variance. Note that under a mild assumption on the term, we show that by(A.44)(A.45)(A.48)(A.49), when 1 ′ p Σ − µ >

0, we have 1 ′ p ˆΘˆ µ >

0, and when 1 ′ p Σ − µ <

0, we have1 ′ p ˆΘˆ µ < Theorem 6.

Under Assumptions 1-4 with AD − B ≥ C > , where C is a positive constant,and assuming | ′ p Σ − µ | /p ≥ C > ǫ > , with a suﬃciently small positive ǫ > , and C being apositive constant, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( \ M SR ∗ ) ( M SR ∗ ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s p lnp/n ) = o p (1) . Remarks 1. Condition | ′ p Σ − µ | /p ≥ C > ǫ > β − min condition in high-dimensional statistics usedto achieve model selection. Note further that since Θ = Σ − , | ′ p Θ µ/p | = | p X j =1 p X k =1 Θ j,k µ k /p | , which is a sum measure of roughly theoretical mean divided by standard deviations. It is diﬃcultto see how this double sum in p will be a small number, unless the terms in the sum cancel outone another. Therefore, we exclude that type of case with our assumption. Additionally, ǫ is notarbitrary, from the proof this is the upper bound on the | ˆ B − B | in Lemma A.3 in SupplementaryAppendix, and it is of order ǫ = O (¯ s r lnpn ) = o (1) , where the asymptotically small term follows Assumption 4.2. In the case of p > n , we only consider consistency since standard central limit theorems(apart from those in rectangles or sparse convex sets) do not apply, and ideas such as multiplierbootstrap and empirical bootstrap with self-normalized moderate deviation results do not extendto this speciﬁc Sharpe ratio formulation. 16. This is a new result under the assumption that all portfolio weights sum to one and theuncertainty about the term 1 ′ p Σ − µ . We allow p > n and time-series data.4. When the precision matrix is non-sparse, i.e., ¯ s = p , we have the rate of convergence as p p lnp/n . To have the estimation error converge to zero, we need p √ lnp = o ( n / ). In the non-sparse precision matrix case, we clearly allow only p << n . Here, we provide consistent estimates of the Sharpe ratio of the global minimum-variance andMarkowitz mean-variance portfolios when p > n . In this part, we analyze not the maximum Sharpe ratio under the constraints of portfolio weightsadding up to one but the Sharpe ratio we can infer from the global minimum-variance portfolio.This is the portfolio in which weights are chosen to minimize the variance of the portfolio subjectto the weights summing to one. Speciﬁcally, w u = argmin w ∈ R p w ′ Σ w, such that w ′ p = 1 . In the main, this is similar to the maximum Sharpe ratio problem, but we minimize the squareof the denominator in the Sharpe ratio deﬁnition subject to the same constraint in the maximumSharpe ratio case above. The solution to the above problem is well known and is given by w u = Σ − p ′ p Σ − p . Next, substitute these weights into the Sharpe ratio formula, normalized by the number of assets SR = w ′ u µ p w ′ u Σ w u = √ p ( 1 ′ p Σ − µp )( 1 ′ p Σ − p p ) − / . (18)We estimate (18) by nodewise regression c SR nw = √ p ( 1 ′ p ˆΘˆ µp )( 1 ′ p ˆΘ1 p p ) − / . (19)To the best of our knowledge, the following theorem is a novel result in the literature when p > n and establishes both consistency and rate of convergence in the case of the Sharpe ratio inthe global minimum-variance portfolio. 17 heorem 7. Under Assumptions 1-4 with | ′ p Σ − µ | /p ≥ C > ǫ > , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c SR nw SR − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s r lnpn ) = o p (1) . Remark. We see that a large p only aﬀects the error by a logarithmic factor. The estimationerror increases with the non-sparsity of the precision matrix. Markowitz (1952) portfolio selection is deﬁned as ﬁnding the smallest variance given a desiredexpected return ρ . The decision problem is w MV = argmin w ∈ R p ( w ′ Σ w ) such that w ′ p = 1 , w ′ µ = ρ . The formula for optimal weight is w MV = ( µ ′ Σ − µ ) − ρ (1 ′ p Σ − µ )(1 ′ p Σ − p )( µ ′ Σ − µ ) − (1 ′ p Σ − µ ) (Σ − p )+ ρ (1 ′ p Σ − p ) − (1 ′ p Σ − µ )(1 ′ p Σ − p )( µ ′ Σ − µ ) − (1 ′ p Σ − µ ) (Σ − µ ) , which can be rewritten as w MV = (cid:20) D − ρ BAD − B (cid:21) (Σ − p /p ) + (cid:20) ρ A − BAD − B (cid:21) (Σ − µ/p ) , (20)where we use A, B, D formulas A := 1 ′ p Σ − p /p, B := 1 ′ p Σ − µ/p, D := µ ′ Σ − µ/p . We deﬁne theestimators of these terms as ˆ A := 1 ′ p ˆΘ1 p /p, ˆ B := 1 ′ p ˆΘˆ µ/p, ˆ D := ˆ µ ′ ˆΘˆ µ/p .The optimal variance of the portfolio in this scenario is normalized by the number of assets V = 1 p " Aρ − Bρ + DAD − B . The estimate of that variance is ˆ V = 1 p " ˆ Aρ − Bρ + ˆ D ˆ A ˆ D − ˆ B . By our constraint, we obtain w ′ MV µ = ρ . Using the variance V above SR MV = ρ vuut p AD − B Aρ − Bρ + D ! . c SR MV = ρ vuut p ˆ A ˆ D − ˆ B ˆ Aρ − Bρ + ˆ D ! . We provide the consistency of the maximum Sharpe ratio (squared) in this framework when thenumber of assets is larger than the sample size. This is a novel result in the literature.

Theorem 8.

Under Assumptions 1-4 with condition | ′ p Σ − µ/p | ≥ C > ǫ > and AD − B ≥ C > , Aρ − Bρ + D ≥ C > , with ρ uniformly bounded away from zero and inﬁnity, wehave (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c SR MV SR MV − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s r lnpn ) = o p (1) . Remarks. 1. Conditions AD − B ≥ C > Aρ − Bρ − D ≥ C > p in a logarithmicway, and the non-sparsity of the precision matrix aﬀects the error in a linear way. In this section, we compare the nodewise regression with several models in a simulation exercise.The two aims of the exercise are to determine whether our method achieves consistency under asparse setup and to check under two diﬀerent setups how our method performs compared to othersin the estimation of the constrained maximum Sharpe ratio, out-of-sample maximum Sharpe ratio,and Sharpe ratio in global minimum-variance and Markowitz mean-variance portfolios.The other methods that are used widely in the literature and beneﬁt from high-dimensionaltechniques are the principal orthogonal complement thresholding (POET) from Fan et al. (2013),the nonlinear shrinkage (NL-LW) and the single factor nonlinear shrinkage (SF-NL-LW) fromLedoit and Wolf (2017) and the maximum Sharpe ratio estimation and sparse regression (MAXSER)from Ao et al. (2019). All models except for the MAXSER are plug-in estimators, where the ﬁrst19tep is to estimate the precision/covariance matrix, and the second step is to plug-in the estimatein the desired equation.The POET uses principal components to estimate the covariance matrix and allows some eigen-values of Σ to be spiked and grow at a rate O ( p ), which allows common and idiosyncratic compo-nents to be identiﬁed and principal components analysis and can consistently estimate the spacespanned by the eigenvectors of Σ. However, Fan et al. (2013) point out that the absolute conver-gence rate of the model is not satisfactory for estimating Σ and consistency can only be achievedin terms of the relative error matrix.Nonlinear shrinkage is a method that individually determines the amount of shrinkage of eacheigenvalue in the covariance matrix with respect to a particular loss function. The main aim isto increase the value of the lowest eigenvalues and decrease the largest eigenvalues to stabilizethe high-dimensional covariance matrix. This is a very novel and excellent idea. Ledoit and Wolf(2017) propose a function that captures the objective of an investor using portfolio selection. Asa result, they have an optimal estimator of the covariance matrix for portfolio selection for a largenumber of assets. The SF-NL-LW method extracts a single factor structure from the data prior tothe estimation of the covariance matrix, which is simply an equal-weighted portfolio with all assets.Finally, the MAXSER starts with the estimation of the adjusted squared maximum Sharperatio that is used in a penalized regression to obtain the portfolio weights. Of all the discussedmodels, the MAXSER is the only one that does not use an estimate of the precision matrix in aplug-in estimator of the maximum Sharpe ratio.Regarding implementation, the POET and both models from Ledoit and Wolf (2017) are avail-able in the R packages POET Fan et al. (2016) and nlshrink Ramprasad (2016). The SF-NL-LWneeded some minor adjustments following the procedures described in Ledoit and Wolf (2017). Forthe MAXSER, we followed the steps for the non-factor case in Ao et al. (2019), and we used thepackage lars (Hastie and Efron, 2013) for the penalized regression estimation. We estimated thenodewise regression following the steps in Section 2.4 using the glmnet package Friedman et al.(2010) for penalized regressions. We used two alternatives to select the regularization parameter λ ,a 10-fold cross validation (CV) and the generalized information criterion (GIC) from Zhang et al.(2010).The GIC procedure starts by ﬁtting ˆ γ j in subsection 2.4 for a range of λ j that goes fromthe intercept-only model to the largest feasible model. This is automatically done by the glmnetpackage. Then, for the GIC procedure, we calculate the information criterion for a given λ j among20he ranges of all possible tuning parameters GIC j ( λ j ) = SSR ( λ j ) n + q ( λ j ) log( p −

1) log(log( n )) n (21)where SSR ( λ j ) is the sum squared error for a given λ j , q ( λ j ) is the number of variables, given λ j , in the model that is nonzero, and p is the number of assets. The last step is to select the modelwith the smallest GIC. Once this is done for all assets j = 1 , . . . , p , we can proceed to obtain ˆΘ GIC .For the CV procedure, we split the sample into k subsamples and ﬁt the model for a range of λ j as in the GIC procedure. However, we will ﬁt models in the subsamples. We always estimate themodels in k − k subsamples as a test, we ﬁnallycompute the average MSE across all subsamples and select the λ j for each asset j that yields thesmallest average MSE. We can then use the estimated ˆ γ j to obtain ˆΘ CV . We used two DGPs to test the nodewise regression. The ﬁrst DGP consists of a Toeplitz covariancematrix of excess asset returns, where Σ i,j = ρ | i − j | , with values ρ equal to 0 .

25, 0 . .

75 and the vector µ sampled from a normal distribution N (0 . , r j = α j + K X k =1 β j,k f k + e j , (22)where f j are the factor returns, β j,k are the individual stock sensitivities to the factors, and α j + e j represent the idiosyncratic component of each stock. We adopted the Fama & French three factors (FF3) monthly returns as factors with µ f and Σ f being the factors’ sample mean and covariancematrix. The β s, α s and ˆΣ e were estimated using a simple least squares regression using returnsfrom the S&P500 stocks that were part of the index in the entire period from 2008 to 2017. In eachsimulation, we randomly selected p stocks from the pool with replacement because our simulationsrequire more than the total number of available stocks. We then used the selected stocks to generateindividual returns with Σ e = γdiag ( e j ), where gamma is assumed to be 1, 2 and 4. The factors are book-to-market, market capitalization and the excess return of the market portfolio. p = 0 . ∗ n , and the last four columns are for p = 1 . ∗ n . MSR,MSR-OOS, GMV-SR and MKW-SR are the constrained maximum Sharpe ratio, the out-of-samplemaximum Sharpe ratio, the Sharpe ratio from the global minimum-variance portfolio and theSharpe ratio from the Markowitz portfolio with target returns set to 1%, respectively. Therefore,there are four categories to evaluate the diﬀerent estimates. The MAXSER risk constraint was setto 0.04 following Ao et al. (2019). We ran 100 iterations in each simulation setup. All bold-faceentries in tables show category champions.Starting with Table 1, we clearly see that our method performs very well in a sparse Toeplitzscenario. When the correlation is 0.5 or 0.75, our method has the smallest error of all those testedfor MSR and MSR-OOS. We also see that with the GMV-SR and MKW-SR scenarios, the SF-NL-LW method generally performs the best. To give a speciﬁc example, with n = 400 , p = 600and ρ = 0 .

75, our OOS-MSR error is 0.118 (GIC based nodewise), the second-best is our CV-based nodewise with 0.259, and the third is SF-NL-LW with a 0.868 error. On the other hand, inthe GMV-SR category, the best is SF-NL-LW with 0.551 error, whereas our best method is GICnodewise with 0.664 error as third among all methods.Additionally, we see that consistency is achieved with our methods, as our theorems suggestunder sparse scenarios, as in Table 1. To see this, with n = 100 , p = 150, our error in the OOS-MSRcategory is 0.336 (GIC, nodewise) and declines to 0.118 at n = 400 , p = 600 at ρ = 0 .

75. Similarresults exist in all other categories for our method in Table 1.Table 2 paints a diﬀerent picture under a factor model scenario; both NL-LW and SF-NL-LWperform the best in minimizing the errors for the constrained Markowitz-Sharpe ratio and globalminimum-variance and Markowitz mean-variance portfolios. We also note that the MAXSERgenerally obtains the best results in estimating the out-of-sample maximum Sharpe ratio when p = n/

2. 22able 1: Simulation Results – Toeplitz DGP

Toeplitz DGP ρ = 0 . MAXSER 0.149 0.267 0.363

Toeplitz DGP ρ = 0 . MAXSER 0.251 0.405 0.518

Toeplitz DGP ρ = 0 . MAXSER 0.392 0.574 0.706The table shows the simulation results for the Toeplitz DGP. Each simulation was done with 100 iterations. We used sample sizes n of 100, 200 and 400, and the number of stocks was either n/ . n for the low-dimensional and the high-dimensional case, respectively. Each block of rows shows theresults for a diﬀerent value of ρ in the Toeplitz DGP. The values in each cell show the average absolute estimation error for estimating the square of the Sharpe ratio in the case of the global minimum-variance and Markowitz mean-variance portfolios in Section 5 and maximum Sharpe ratio in the caseof out-of-sample forecasting and constrained portfolio optimization in Sections 3-4 across iterations. able 2: Simulation Results – Factor DGP Factor DGP γ = 1n=100 n=200 n=400p=n/2 p=1.5n p=n/2 p=1.5n p=n/2 p=1.5nMSR OOS-MSR GVM-SR MKW-SR MSR OOS-MSR GVM-SR MKW-SR MSR OOS-MSR GVM-SR MKW-SR MSR OOS-MSR GVM-SR MKW-SR MSR OOS-MSR GVM-SR MKW-SR MSR OOS-MSR GVM-SR MKW-SRNW-GIC 0.654 2.835 4.010 0.695 0.675 9.743 0.843 0.769 0.618 5.498 0.783 0.760 0.649 19.800 0.825 0.799 0.609 10.105 0.810 0.787 0.618 36.215 0.831 0.813NW-CV 0.671 3.129 1.423 0.710 0.705 8.218 0.787 0.786 0.637 4.894 0.783 0.768 0.669 14.725 0.819 0.811 0.618 7.584 0.801 0.789 0.630 23.422 0.817 0.818POET 0.554 1.140 0.968 0.300 0.586 1.651 0.458 0.384 0.413 0.857 0.367 0.339 0.440 2.095 0.451 0.404 0.332 1.289 0.427 0.363 0.345 3.714 0.474 0.403NL-LW MAXSER γ = 2NW-GIC 0.806 1.554 3.022 0.811 0.827 4.940 0.866 0.868 0.794 2.712 0.864 0.858 0.818 9.939 0.900 0.890 0.795 4.999 0.890 0.881 0.805 18.126 0.910 0.901NW-CV 0.815 1.788 1.247 0.821 0.844 4.153 0.847 0.877 0.804 2.396 0.865 0.863 0.829 7.330 0.897 0.896 0.799 3.692 0.885 0.881 0.812 11.623 0.902 0.904POET 0.750 1.084 0.924 0.568 0.780 1.203 0.635 0.647 0.683 0.607 0.603 0.609 0.710 1.178 0.690 0.674 0.649 0.691 0.666 0.643 0.666 1.877 0.720 0.685NL-LW SF-NL-LW 0.740 1.207

Factor DGP γ = 4NW-GIC 0.887 0.965 2.522 0.879 0.906 2.515 0.904 0.921 0.885 1.296 0.911 0.913 0.904 4.976 0.941 0.937 0.889 2.419 0.933 0.930 0.899 9.061 0.950 0.946NW-CV 0.892 1.192 1.169 0.885 0.914 2.107 0.894 0.927 0.891 1.132 0.912 0.915 0.909 3.614 0.939 0.941 0.892 1.732 0.929 0.930 0.902 5.714 0.945 0.948POET 0.854 1.248 0.943 0.722 0.880 SF-NL-LW 0.848 1.446 0.752 0.679 0.873 1.249 0.694 0.736 0.808 0.693 0.633 0.688 0.829 n of 100, 200 and 400, and the number of stocks was either n/ . n for the low-dimensional and the high-dimensional case, respectively. Each block of rows shows theresults for a diﬀerent value of γ in the factor DGP. The values in each cell show the average absolute estimation error for estimating the square of the Sharpe ratio in the case of global minimum-variance and Markowitz mean-variance portfolios in Section 5 and maximum Sharpe ratio in out-of-sampleforecasting and the case of constrained portfolio optimization in Sections 3-4 across iterations. Empirical Application

For the empirical application, we use two subsamples. The ﬁrst subsample uses all data fromJanuary 1995 to December 2017 with an out-of-sample period from January 2005 to December2017. We selected all stocks that were in the S&P 500 index for at least one month in the out-of-sample period and have data for the entire 1995-2017 period, which resulted in 383 stocks.The second subsample starts in January 1990 and ends in December 2017 with an out-of-sampleperiod from January 2000 to December 2017. Using the same criterion as the ﬁrst subsample, thenumber of stocks was 323. Given that this is an out-of-sample competition between models, weonly estimated GMV and Markowitz portfolios for the plug-in estimators. The ﬁrst out-of-sampleperiod includes only the recession of 2008. The second out-of-sample period includes the recessionsof 2000 and 2008, and the out-of-sample periods reﬂect recent history.The Markowitz return constraint ρ is 0.8% per month, and the MAXSER risk constraint is4%. In the low-dimensional experiment, we randomly select 50 stocks from the pool to estimatethe models. In the high-dimensional case, we use all stocks.We use a rolling window setup for the out-of-sample estimation of the Sharpe ratio followingCallot et al. (2019). Speciﬁcally, samples of size n are divided into in-sample (1 : n I ) and out-of-sample ( n I + 1 : n ). We start by estimating the portfolio ˆ w nI in the in-sample period and theout-of-sample portfolio returns ˆ w ′ nI r n I +1 . Then, we roll the window by one element (2 : n I + 1)and form a new in-sample portfolio ˆ w n I +1 and out-of-sample portfolio returns ˆ w n I +1 r n I +2 . Thisprocedure is repeated until the end of the sample.The out-of-sample average return and variance without transaction costs areˆ µ os = 1 n − n I n − X t = n I ˆ w ′ t r t +1 , ˆ σ os = 1 n − n I − n − X t = n I ( ˆ w ′ t r t +1 − ˆ µ os ) . We estimate the Sharpe ratios with and without transaction costs. The transaction cost, c , isdeﬁned as 50 basis points following DeMiguel et al. (2007). Let r P,t +1 = ˆ w ′ t r t +1 be the return ofthe portfolio in period t + 1; in the presence of transaction costs, the returns will be deﬁned as r NetP,t +1 = r P,t +1 − c (1 + r P,t +1 ) p X j =1 | ˆ w t +1 ,j − ˆ w + t,j | , where ˆ w + t,j = ˆ w t,j (1 + R t +1 ,j ) / (1 + R t +1 ,P ) and R t,j and R t,P are the excess returns of asset j andthe portfolio P added to the risk-free rate. The adjustment made in ˆ w + t,j is because the portfolioat the end of the period has changed compared to the portfolio at the beginning of the period.25he Sharpe ratio is calculated from the average return and the variance of the portfolio in theout-of-sample period SR = ˆ µ os ˆ σ os . The portfolio returns are replaced by the returns with transaction costs when we calculate theSharpe ratio with transaction costs.We use the same test as Ao et al. (2019) to compare the models. Speciﬁcally, H : SR Best ≤ SR vs H a : SR Best > SR , (23)where SR Best is the model with the largest Sharpe ratio, which is tested against all remainingmodels. This is the Jobson and Korkie (1981) test with Memmel (2003) correction. We alsoconsidered the method of Ledoit and Wolf (2008) for testing the signiﬁcance of the winner andusing the equally weighted portfolio as a benchmark; the results were very similar and hence arenot reported.We also included equally weighted portfolio (EW). GMV-NW-GIC and GMV-NW-CV denotethe nodewise method with GIC and cross validation tuning parameter choices, respectively, in theglobal minimum-variance portfolio (GMV). GMV-POET, GMV-NL-LW, and GMV-SF-NL-LW de-note the POET, nonlinear shrinkage, and single factor nonlinear shrinkage methods, respectively,which are described in the simulation section and also used in the global minimum-variance portfo-lio. The MAXSER is also used and explained in the simulation section. MW denotes the Markowitzmean-variance portfolio, and MW-NW-GIC denotes the nodewise method with GIC tuning param-eter selection in the Markowitz portfolio. All the other methods with MW headers are analogousand thus self-explanatory.The results are presented in Tables 3 and 4. Table 3 shows the results for the 2005-2017 out-of-sample period. Nodewise methods dominate in terms of the Sharpe ratio in Table 3. For example,with transaction costs in the high-dimensional portfolio category, in terms of Sharpe ratio (SR)(averaged over the out-of-sample time period), GMV-NW-GIC is the best model. It has an SRof 0.208. GMV-POET, GMV-NL-LW, and GMV-SF-NL-LW have SRs of 0.175, 0.144, and 0.140,respectively. If we were to analyze only the Markowitz portfolio in Table 3, with transaction costsin high dimensions, MW-NW-GIC has the highest SR of 0.205. Therefore, even in subcategories,the nodewise method dominates. Although statistical signiﬁcance is not established, it is not clearthat these signiﬁcance tests have high power in our high-dimensional cases.26able 4 shows the results for the out-of-sample January 2000-2017 subsample. We see that node-wise methods dominate all scenarios except for the low-dimensional case with no transaction costs.In the case of high dimensionality with transaction costs, MW-NW-GIC (Markowitz-nodewise-GIC)has an SR of 0.224, and the closest is EW with 0.207.Table 3: Empirical Results – Out-of-Sample Period from Jan. 2005 to Dec. 2017

Without TC With TCLow Dim. High Dim. Low Dim. High DimPortfolio SR Avg. SD p-value SR Avg. SD p-value SR Avg. SD p-value SR Avg. SD p-valueEW 0.221 0.010 0.047 0.074 0.200 0.010 0.049 0.283 0.215 0.010 0.047 0.178 0.194 0.009 0.049 0.520GMV-NW-GIC 0.260 0.009 0.036 0.200 0.215 0.009 0.040 0.215 0.247 0.009 0.036 0.454

In Table 5, we analyze turnover, leverage and maximum leverage (equations (24), (25) and (26),respectively) of the portfolios in Tables 3-4.The deﬁnitions are as follows for turnover:turnover = p X j =1 | ˆ w t +1 ,j − ˆ w + t,j | , (24)and leverage leverage = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X j =1 min { ˆ w t +1 ,j , } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (25)and maximum leverage max leverage = max j { (cid:12)(cid:12) min { ˆ w t +1 ,j , } (cid:12)(cid:12) } . (26)Our method performs very well compared to the others in terms of turnover, leverage and max-imum leverage. Nodewise-based methods even come closer to the EW (equally weighted) portfolio.To provide some perspective, in Table 5, in high dimension from January 2005 to December 2017,27able 4: Empirical Results – Out-of-Sample Period from Jan. 2000 to Dec. 2017 Without TC With TCLow Dim. High Dim. Low Dim. High DimPortfolio SR Avg. SD p-value SR Avg. SD p-value SR Avg. SD p-value SR Avg. SD p-valueEW 0.203 0.010 0.052 0.430 0.214 0.010 0.047 0.274 0.197 0.010 0.052 0.178 0.207 0.010 0.047 0.405GMV-NW-GIC 0.228 0.009 0.040 0.607 0.227 0.009 0.039 0.215 0.217 0.009 0.040 0.151 0.219 0.008 0.039 0.482GMV-NW-CV 0.227 0.009 0.041 0.606 0.227 0.009 0.038 0.262 0.213 0.009 0.041 0.103 0.211 0.008 0.038 0.153GMV-POET 0.201 0.007 0.035 0.337 0.191 0.006 0.033 0.513 0.187 0.006 0.035 0.344 0.174 0.006 0.033 0.462GMV-NL-LW 0.248 0.008 0.033 0.708 0.201 0.006 0.030 0.617 0.218 0.007 0.033 0.742 0.146 0.004 0.030 0.265GMV-SF-NL-LW 0.247 0.008 0.032 0.812 0.204 0.006 0.029 0.663 0.221 0.007 0.032 0.805 0.148 0.004 0.029 0.297MW-NW-GIC 0.249 0.009 0.038 0.948 0.236 0.009 0.038 0.951 the nodewise GMV-NW-GIC has a turnover of 0.057, which is much smaller than those of GMV-POET, GMV-NL-LW, GMV-SF-NL-LW of 0.092, 0.328, and 0.323, respectively. Our leverage andmaximum leverage are also very small compared to those of the other methods.Note that to better understand why we perform well in the out-of-sample exercise, we showthe correlation matrices for the two out-of-sample periods that we analyzed in ﬁgure 2. Subsample1 corresponds to January 2005-December 2017, and Subsample 2 corresponds to January 2000-December 2017; in Figures 2a and 2b, we colored the correlation of assets. Blue (dark in blackand white) is anything above a 0.3 positive correlation (which is the average), yellow is anythingbetween a 0 and 0.3 positive correlation (light gray in black and white), and red (dark in black andwhite) represents the very few negative correlations. Figures 2a and 2b clearly show that dark blueareas do not predominate. This follows our assumptions, where large correlations between assetsshould not dominate the correlation matrix of assets.

We provided a nodewise regression method that can control for risk and obtain the maximumexpected return of a large portfolio. Our result is novel and holds even when p > n . We also showthat the maximum out-of-sample Sharpe ratio can be estimated consistently. Furthermore, we alsodevelop a formula for the maximum Sharpe ratio when the weights of the portfolio sum to one. A28able 5: Turnover and Leverage

Low Dimension High DimensionTurnover Leverage Max Leverage Turnover Leverage Max LeverageEW 0.048 0.000 0.000 0.054 0.000 0.000GMV-NW-GIC 0.087 0.009 0.005 0.057 0.001 0.001GMV-NW-CV 0.100 0.001 0.001 0.122 0.003 0.002GMV-POET 0.074 0.193 0.037 0.092 0.301 0.007GMV-NL-LW 0.164 0.403 0.065 0.328 0.806 0.023GMV-SF-NL-LW 0.146 0.369 0.051 0.323 0.872 0.026MW-NW-GIC 0.132 0.032 0.012 0.079 0.006 0.001MW-NW-CV 0.141 0.018 0.008 0.144 0.009 0.002MW-POET 0.109 0.217 0.037 0.109 0.305 0.007MW-NL-LW 0.184 0.425 0.068 0.330 0.809 0.023MW-SF-NL-LW 0.166 0.392 0.053 0.325 0.874 0.025MAXSER 1.529 0.313 0.153

EW 0.060 0.000 0.000 0.057 0.000 0.000GMV-NW-GIC 0.081 0.007 0.004 0.059 0.001 0.000GMV-NW-CV 0.106 0.004 0.004 0.122 0.003 0.002GMV-POET 0.088 0.216 0.042 0.105 0.326 0.008GMV-NL-LW 0.188 0.403 0.073 0.308 0.784 0.028GMV-SF-NL-LW 0.154 0.366 0.062 0.300 0.829 0.026MW-NW-GIC 0.110 0.016 0.007 0.084 0.005 0.001MW-NW-CV 0.130 0.010 0.006 0.147 0.010 0.003MW-POET 0.105 0.221 0.042 0.122 0.334 0.009MW-NL-LW 0.201 0.415 0.072 0.311 0.787 0.028MW-SF-NL-LW 0.167 0.372 0.061 0.303 0.832 0.025MAXSER 1.652 0.346 0.197The table shows the average turnover, average leverage and average max leverage for all portfoliosacross all out-of-sample windows. The top panel shows the results for the 2000-2017 out-of-sampleperiod, and the second panel shows the results for the 2005-2017 out-of-sample period. consistent estimate for the constrained case is also shown. Then, we extended our results to theconsistent estimation of Sharpe ratios in two widely used portfolios in the literature. It will beimportant to extend our results to more restrictions on portfolios.29igure 2: Data Correlation Matrices

Appendix

This appendix contains the proofs. The Supplementary Appendix contains additional proofsthat are the building blocks of the proofs in and independent of the results in this appendix.

Proof of Theorem 2 . (A.2) of Ao et al. (2019) shows that the squared ratio of the estimatedmaximum out-of-sample Sharpe ratio to the theoretical ratio can be written as[ c SR mosnw SR ∗ ] = ( µ ′ ˆΘˆ µ ) ˆ µ ′ ˆΘ ′ Σ ˆΘˆ µ µ ′ Σ − µ = h µ ′ ˆΘˆ µµ ′ Σ − µ i h ˆ µ ′ ˆΘ ′ Σ ˆΘ ′ ˆ µµ ′ Σ − µ i . (A.1)The proof will consider the numerator and the denominator of the squared maximum out-of-sample Sharpe ratio. We start with the numerator using Θ := Σ − µ ′ ˆΘˆ µµ ′ Θ µ = µ ′ ˆΘˆ µ − µ ′ Θ µµ ′ Θ µ + 1 . (A.2)Consider the fraction on the right-hand side. Start with the numerator in (A.2). | µ ′ ˆΘˆ µ − µ ′ Θ µ | /p = | µ ′ ˆΘˆ µ − µ ′ Θˆ µ + µ ′ Θˆ µ − µ ′ Θ µ | /p ≤ | µ ′ ( ˆΘ − Θ)ˆ µ | /p + | µ ′ Θ(ˆ µ − µ ) | /p ≤ | µ ′ ( ˆΘ − Θ)(ˆ µ − µ ) | /p + | µ ′ ( ˆΘ − Θ) µ | /p + | µ ′ Θ(ˆ µ − µ ) | /p = [ O p (¯ s lnpn ) + O p (¯ s r lnpn ) + O p (¯ s / r lnpn )= O p (¯ s r lnpn ) , (A.3)30here we use (A.87)-(A.89) for the rates and the dominant rate in the last equality is by Assumption4. Next, we analyze the denominator in (A.2). Then, by Assumption 2, seeing that Σ − = Θ, bydeﬁnition µ ′ Σ − µ/p ≥ Eigmin (Σ − ) k µ k /p ≥ cc l > < c l ≤ | µ j | by Assumption 2, and Eigmin (Σ − ) ≥ c >

0, where c is a positiveconstantThen, by (A.3)(A.4) µ ′ ˆΘˆ µ/pµ ′ Θ µ/p ≤ | µ ′ ˆΘˆ µ − µ ′ Θ µ | /pµ ′ Θ µ/p + 1 = O p (¯ s p lnp/n ) + 1 . (A.5)We now attempt to show that the denominatorˆ µ ′ ˆΘΣ ˆΘˆ µµ ′ Σ − µ p → . (A.6)In that respect, bearing in mind that Θ = Σ − is symmetricˆ µ ′ ˆΘΣ ˆΘˆ µµ ′ Σ − µ = ˆ µ ′ ˆΘΣ ˆΘˆ µ − µ ′ ΘΣΘ µµ ′ ΘΣΘ µ + 1 ≥ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ µ ′ ˆΘΣ ˆΘˆ µ − µ ′ ΘΣΘ µµ ′ ΘΣΘ µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (A.7)We can write ˆΘˆ µ − Θ µ = ( ˆΘ − Θ)ˆ µ + Θ(ˆ µ − µ ) . (A.8)Using (A.8) | ˆ µ ′ ˆΘΣ ˆΘˆ µ − µ ′ ΘΣΘ µ | = | [(ˆ µ ′ ˆΘ − µ Θ) + µ Θ] ′ Σ[(ˆ µ ′ ˆΘ − µ Θ) + µ Θ] − µ ′ ΘΣΘ µ |≤ | [( ˆΘ − Θ)ˆ µ ] ′ Σ[( ˆΘ − Θ)ˆ µ ] | (A.9)+ 2 | [( ˆΘ − Θ)ˆ µ ] ′ ΣΘ(ˆ µ − µ ) | (A.10)+ 2 | [( ˆΘ − Θ)ˆ µ ] ′ ΣΘ µ | (A.11)+ | [Θ(ˆ µ − µ )] ′ Σ[Θ(ˆ µ − µ )] | (A.12)+ 2 | [Θ(ˆ µ − µ )] ′ ΣΘ µ | (A.13)First, we consider (A.9) | ˆ µ ′ ( ˆΘ − Θ) ′ Σ( ˆΘ − Θ)ˆ µ | ≤ Eigmax (Σ) k ( ˆΘ − Θ)ˆ µ k = Eigmax (Σ)[ p X j =1 { ( ˆΘ j − Θ j ) ′ ˆ µ } ] ≤ Eigmax (Σ) p max ≤ j ≤ p [( ˆΘ j − Θ j ) ′ ˆ µ ] ≤ Eigmax (Σ) p ( max ≤ j ≤ p k ˆΘ j − Θ j k ) k ˆ µ k ∞ = O (1) pO p (¯ s lnpn ) O p (1) , (A.14)where we use Holder’s inequality for the third inequality and Theorem 1(i), (ii) and Assumption 2for the rate. Now, consider (A.10), and by deﬁnition Θ := Σ − .31 [( ˆΘ − Θ)ˆ µ ] ′ ΣΘ(ˆ µ − µ ) | = | ˆ µ ′ ( ˆΘ − Θ) ′ (ˆ µ − µ ) |≤ | (ˆ µ − µ ) ′ ( ˆΘ − Θ) ′ (ˆ µ − µ ) | + | µ ′ ( ˆΘ − Θ) ′ (ˆ µ − µ ) | = p [ O p (¯ s ( lnpn ) / ) + O p (¯ s ( lnpn ))]= pO p (¯ s ( lnpn )) , (A.15)by (A.85)(A.88) for the second equality, and the dominant rate in third equality can be seen fromAssumption 4. Next, consider (A.11), and recall that Θ := Σ − | [( ˆΘ − Θ)ˆ µ ] ′ ΣΘ µ | = | ˆ µ ′ ( ˆΘ − Θ) µ |≤ | (ˆ µ − µ ) ′ ( ˆΘ − Θ) µ | + | µ ′ ( ˆΘ − Θ) ′ µ | = p [ O p (¯ s lnpn ) + O p (¯ s r lnpn )]= pO p (¯ s r lnpn ) , (A.16)where we use (A.88)(A.89) for the second equality, and the dominant rate in the third equality canbe seen from Assumption 4. Consider now (A.12) by the symmetry of Θ = Σ − | [Θ(ˆ µ − µ )] ′ ΣΘ(ˆ µ − µ ) | = | (ˆ µ − µ ) ′ Θ(ˆ µ − µ ) | = pO p (¯ s / lnpn ) (A.17)by (A.86). Next, analyze (A.13) by the symmetricity of Θ = Σ − | [Θ(ˆ µ − µ )] ′ ΣΘ µ | = | (ˆ µ − µ ) ′ Θ µ | = pO p (¯ s / r lnpn ) , (A.18)by (A.87). Combine the rates and terms (A.14)-(A.18) in (A.9)-(A.13) to obtain | ˆ µ ′ ˆΘΣ ˆΘˆ µ − µ ′ ΘΣΘ µ | = pO p (¯ s r lnpn ) , (A.19)by the dominant rate in (A.16), as seen in Assumption 4.See that by Θ = Σ − , by (A.4), µ ′ ΘΣΘ µ = µ ′ Σ − µ ≥ Eigmin (Σ − ) k µ k ≥ c k µ k ≥ ( c )( c l ) p, (A.20)by Assumption 2.Combine (A.19)(A.20) in the second term on the right-hand side of (A.7) to have from Assump-tion 2 and Assumption 4 | ˆ µ ′ ˆΘ ′ Σ ˆΘˆ µ − µ ′ ΘΣΘ µ | /pµ ′ ΘΣΘ µ/p ≤ cO p (¯ s q lnpn ) c ( c l ) = O p (¯ s r lnpn ) = o p (1) . (A.21)32herefore, we show (A.6). Then, combine (A.5)(A.6) in (A.1) to obtain the desired result. Q.E.D.Proof of Theorem 3 . (i). Start with deﬁnition of weights, and its estimates (cid:18) σµ ′ ˆΘˆ µ √ ˆ µ ′ ˆΘˆ µ (cid:19)(cid:16) σµ ′ Θ µ √ µ ′ Θ µ (cid:17) −  µ ′ ˆΘˆ µµ ′ Θ µ µ ′ Θ µ ˆ µ ′ ˆΘˆ µ ! /  − µ ′ ˆΘˆ µµ ′ Θ µ ! × µ ′ Θ µ ˆ µ ′ ˆΘˆ µ ! / − ≤ "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ′ ˆΘˆ µµ ′ Θ µ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 1 µ ′ Θ µ ˆ µ ′ ˆΘˆ µ ! / − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 1  − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ′ ˆΘˆ µµ ′ Θ µ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ′ Θ µ ˆ µ ′ ˆΘˆ µ ! / − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ′ ˆΘˆ µµ ′ Θ µ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ′ Θ µ ˆ µ ′ ˆΘˆ µ ! / − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (A.23)By (A.5) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ′ ˆΘˆ µµ ′ Θ µ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s p lnp/n ) . (A.24)Next, we have µ ′ Θ µ ˆ µ ˆΘˆ µ = µ ′ Θ µ − ˆ µ ′ ˆΘˆ µ ˆ µ ′ ˆΘˆ µ + 1 ≤ | µ ′ Θ µ/p − ˆ µ ′ ˆΘˆ µ/p | µ ′ Θ µ/p − | ˆ µ ′ ˆΘˆ µ/p − µ ′ Θ µ/p | + 1 , (A.25)where we divided both the numerator and denominator by p , andˆ µ ′ ˆΘˆ µ/p ≥ µ ′ Θ µ/p − | ˆ µ ′ ˆΘˆ µ/p − µ ′ Θ µ/p | . By (A.4),(A.25), Lemma A.4 in the Supplementary Appendix, and Assumption 2, with µ ′ Θ µ/p ≥ cc l , µ ′ Θ µ ˆ µ ′ ˆΘˆ µ ≤ O p (¯ s p lnp/n ) cc l − O p (¯ s p lnp/n ) + 1 = O p (¯ s p lnp/n ) + 1 . (A.26)Then, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ′ Θ µ ˆ µ ′ ˆΘˆ µ ! / − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = { [1 + O p (¯ s p lnp/n )] / − } (A.27)Now, use Assumption 4 in (A.24)(A.27) and (A.23) to obtain the desired result.33 Q.E. D (ii). Now, we analyze the risk. See thatˆ w ′ oos Σ ˆ w oos − σ = σ ˆ µ ′ ˆΘ ′ Σ ˆΘˆ µ ˆ µ ′ ˆΘˆ µ − ! = σ  ˆ µ ′ ˆΘ ′ Σ ˆΘˆ µµ ′ Θ µ ˆ µ ′ ˆΘˆ µµ ′ Θ µ −  , where we multiplied and divided by µ ′ Θ µ , which is positive by Assumption 2. By (A.6)(A.21), (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ µ ′ ˆΘ ′ Σ ˆΘˆ µµ ′ Θ µ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s p lnp/n ) . (A.28)Additionally, by Lemma A.4 in the Supplementary Appendix and Assumptions 2 and 4, | ˆ µ ′ ˆΘˆ µµ ′ Θ µ − | = o p (1) . (A.29)By (A.28)(A.29) and Assumption 4, | ˆ w oos Σ ˆ w oos − σ | = O p (¯ s p lnp/n ) = o p (1) . Q.E.D.Proof of Theorem 4 . See that by Assumption 2, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \ M SR /pM SR /p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ µ ′ ˆΘˆ µ/pµ ′ Σ − µ/p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | ˆ µ ′ ˆΘˆ µ/p − µ ′ Σ − µ/p | µ ′ Σ − µ/p . Lemma A.4 in the Supplementary Appendix shows that under Assumptions 1-3, | ˆ µ ′ ˆΘˆ µ/p − µ ′ Σ − µ/p | = O (¯ s p lnp/n ) . (A.30)Combining (A.4),(A.30) with Assumption 4, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ µ ′ ˆΘˆ µ/pµ ′ Σ − µ/p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O (¯ s p lnp/n ) = o p (1) . Q.E.D.Proof of Theorem 5 . Note that by the deﬁnition of

M SR c in (15) and A, B, D terms,

M SR c p = D − ( B /A ) , and the estimate is \ M SR c p = ˆ D − ( ˆ B / ˆ A ) , where ˆ A = 1 ′ p ˆΘ1 p /p , ˆ B = 1 ′ p ˆΘˆ µ/p , ˆ D = ˆ µ ′ ˆΘˆ µ/p .Then, clearly \ MSR c pMSR c p = " ˆ A ˆ D − ˆ B AD − B A ˆ A (cid:21) . (A.31)34e start with | ˆ A − A | = O p (¯ s p lnp/n ) = o p (1) , (A.32)by Assumption 4 and Lemma A.2 in the Supplementary Appendix. Then, A ≥ Eigmin (Σ − ) ≥ c > c a positive constant by Assumption 2. Thus, clearly we obtain, since | ˆ A | ≥ A − | ˆ A − A | , (cid:12)(cid:12)(cid:12)(cid:12) A ˆ A − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A − ˆ A ˆ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | ˆ A − A | A − | ˆ A − A | which implies | A ˆ A − | = O p (¯ s p lnp/n ) = o p (1) . (A.33)Next, Lemma A.6 in the Supplementary Appendix establishes that under our Assumptions 1-4, | ( ˆ A ˆ D − ˆ B ) − ( AD − B ) | = O p (¯ s p lnp/n ) = o p (1) . We can use the condition that AD − B ≥ C >

0, and thus we combine the results above to obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ A ˆ D − ˆ B AD − B − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (¯ s p lnp/n ) = o p (1) . (A.34)Since \ MSR c pMSR c p =  ˆ A ˆ D − ˆ B AD − B − ! + 1  "(cid:18) A ˆ A − (cid:19) + 1 Combine (A.33)(A.34) in (A.31) to obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \ M SR c /pM SR c /p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ A ˆ D − ˆ B AD − B − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | A ˆ A − | + | A ˆ A − | + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ A ˆ D − ˆ B AD − B − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (A.35)= O p (¯ s p lnp/n ) = o p (1) , (A.36)where the rate is the slowest among the three right-hand-side terms. Q.E. DProof of Theorem 6 . We need to start with (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( \ M SR ∗ ) /p ( M SR ∗ ) /p − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ( \ M SR ∗ ) /p − ( M SR ∗ ) /p (cid:12)(cid:12)(cid:12) ( M SR ∗ ) /p (A.37)As a ﬁrst step, analyze the denominator in (A.37). Note that 1 ′ p Σ − µ/p ≥ ′ p Σ − µ/p ≥ C > ǫ >

0, and thus

M SR /p = µ ′ Σ − µ/p ≥ Eigmin (Σ − ) k µ k /p ≥ cc l > , by Assumption 2. Note that 1 ′ p Σ − µ/p ≤ − C < − ǫ < ′ p Σ − µ/p <

0. Thus,

M SR c /p = D − ( B /A ) = ( AD − B ) /A ≥ C /K > , AD − B ≥ C > A = 1 ′ p Σ − p /p ≤ Eigmax (Σ − ) ≤ K < ∞ and K isa positive constant by Assumption 2. Then, clearly by combining the results,( M SR ∗ ) /p = ( M SR )1 { ′ p Σ − µ ≥ } + ( M SR c ) { ′ p Σ − µ< } ≥ min ( cc l , C /K ) > . (A.38)Next, we consider the numerator. We need to show that p − | ( \ M SR ∗ ) − ( M SR ∗ ) | = p − | ( \ M SR ) { ′ p ˆΘˆ µ> } − ( M SR ) { ′ p Σ − µ ≥ } + [( \ M SR c ) { ′ p ˆΘˆ µ< } − ( M SR c ) { ′ p Σ − µ< } ] | = O p (¯ s p lnp/n ) = o p (1) . (A.39)First, see that on the right-hand side of (A.39) p − | ( \ M SR ) { ′ p ˆΘˆ µ> } − ( M SR ) { ′ p Σ − µ ≥ } | ≤ p − | ( \ M SR ) { ′ p ˆΘˆ µ/p> } − ( M SR ) { ′ p ˆΘˆ µ/p> } | + p − | ( M SR ) { ′ p ˆΘˆ µ/p> } − ( M SR ) { ′ p Σ − µ/p ≥ } | , (A.40) where division by p in the indicator function does not change the results since the function operateswhen it is positive.Then, in (A.40), p − | ( \ M SR ) { ′ p ˆΘˆ µ/p> } − ( M SR ) { ′ p ˆΘˆ µ/p> } | ≤ p − | ( \ M SR ) − ( M SR ) || { ′ p ˆΘˆ µ/p> } |≤ p − | ( \ M SR ) − ( M SR ) | = O p (¯ s p lnp/n ) = o p (1) , (A.41)by (A.30) and Assumption 4 for the rate. In (A.40) above, consider p − | ( M SR ) { ′ p ˆΘˆ µ/p> } − ( M SR ) { ′ p Σ − µ/p ≥ } |≤ p − M SR | { ′ p ˆΘˆ µ/p> } − { ′ p Σ − µ/p ≥ } | . (A.42)Note that by deﬁnition of M SR /p , M SR /p = µ ′ Σ − µ/p ≤ Eigmax (Σ − ) k µ k /p ≤ Kc u < ∞ , (A.43)where we use Assumption 2. Deﬁne the event E = {| ′ p ˆΘˆ µ/p − ′ p Σ − µ/p | ≤ ǫ } , where ǫ >

0. Startwith the condition 1 ′ p Σ − µ/p ≥ C > ǫ >

0; then on the event E ,1 ′ p ˆΘˆ µp = 1 ′ p ˆΘˆ µp − ′ p Σ − µp + 1 ′ p Σ − µp ≥ ′ p Σ − µp − | ′ p ˆΘˆ µp − ′ p Σ − µp |≥ ′ p Σ − µp − ǫ ≥ C − ǫ > ǫ − ǫ = ǫ > , (A.44)where we use E in the second inequality and the condition for the third inequality. This clearlyshows that at event E , when the condition 1 ′ p Σ − µ/p ≥ C > ǫ > ′ p ˆΘˆ µ/p > ǫ > E occurs with a probability approaching oneunder our Assumptions 1-4 | { ′ p ˆΘˆ µ/p> } − { ′ p Σ − µ/p ≥ } | = O p (¯ s p lnp/n ) = o p (1) , (A.45)where we use (A.44) and 1 ′ p Σ − µ/p ≥ C > ǫ >

0, implying 1 ′ p Σ − µ/p ≥ p − | ( M SR ) { ′ p ˆΘˆ µ/p> } − ( M SR ) { ′ p Σ − µ/p ≥ } | = O p (¯ s p lnp/n ) = o p (1) . (A.46)By (A.41)(A.46), we have in (A.40) p − | ( \ M SR )1 { ′ p ˆΘˆ µ/p> } − ( M SR )1 { ′ p Σ − µ/p ≥ } = O p (¯ s p lnp/n ) = o p (1) . (A.47)The proof for p − | ( \ M SR c ) { ′ p ˆΘˆ µ/p< } − ( M SR c ) { ′ p Σ − µ/p< } | is identical to that in (A.47)given Theorem 5, except that we have to show that | { ′ p ˆΘˆ µ/p< } − { ′ p Σ − µ/p< } | = O p (¯ s p lnp/n ) = o p (1) , (A.48)instead of (A.45). Assume that we use event E :1 ′ p Σ − µp = 1 ′ p Σ − µp − ′ p ˆΘˆ µp + 1 ′ p ˆΘˆ µp ≥ ′ p ˆΘˆ µp − | ′ p Σ − µp − ′ p ˆΘˆ µp |≥ ′ p ˆΘˆ µp − ǫ. (A.49)Then, in (A.49), using the condition 1 ′ p Σ − µ/p ≤ − C < − ǫ < ′ p Σ − µ/p <

0) 0 > − ǫ > − C ≥ ′ p Σ − µ/p ≥ ′ p ˆΘˆ µ/p − ǫ, which implies that, with C > ǫ , adding ǫ to all sides above yields0 > − ǫ > − ( C − ǫ ) ≥ ′ p ˆΘˆ µ/p, which clearly shows that when 1 ′ p Σ − µ/p <

0, we will have 1 ′ p ˆΘˆ µ/p <

0. Note that event E occurs with probability approaching one by Lemma A.3 in the Supplementary Appendix, so wehave proven (A.48). This implies with the result of Theorem 5 that p − | ( \ M SR c ) { ′ p ˆΘˆ µ/p< } − ( M SR c ) { ′ p Σ − µ/p< } | = O p (¯ s p lnp/n ) = o p (1) . (A.50)By now combining (A.47)(A.50), we proved (A.39) via the triangle inequality. With (A.38) and(A.39), the desired result follows (A.37). Q.E.D.Proof of Theorem 7 . First, we start with deﬁnitions of ˆ A := 1 ′ p ˆΘ1 p /p , ˆ B := 1 ′ p ˆΘˆ µ/p , A := 1 ′ p Σ − p /p , B := 1 ′ p Σ − µ/p . 37 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c SR nw SR − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p (1 ′ p ˆΘˆ µ/p ) (1 ′ p ˆΘ1 p /p ) − p (1 ′ p Σ − µ/p ) (1 ′ p Σ − p /p ) − − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ B AB ˆ A − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ B A − B ˆ AB ˆ A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (A.51)We analyze the denominator in (A.51). To that eﬀect, by Assumption 2, A = 1 ′ p Σ − p /p ≥ Eigmin (Σ − ) ≥ c > . By the condition in the statement of Theorem 7, | B | = | ′ p Σ − µp | ≥ C > ǫ > . Then, by Lemma A.2 and Lemma A.5 in the Supplementary Appendix | B ˆ A | = | B ( ˆ A − A ) + B A | ≥ B A − B | ˆ A − A | ≥ C c + o p (1) > . (A.52)Now consider the numerator in (A.51): | ˆ B A − B ˆ A | = | ˆ B A − ˆ B ˆ A + ˆ B ˆ A − B ˆ A |≤ | ˆ B ( ˆ A − A ) | + | ( ˆ B − B ) ˆ A |≤ | ˆ B ( ˆ A − A ) | + | ˆ B − B || ˆ B + B || ˆ A | . (A.53)Analyze the ﬁrst term on the right side of (A.53):ˆ B = | ˆ B − B + B |≤ | ˆ B − B | + B ≤ | ˆ B − B || ˆ B + B | + B . (A.54)Then, by Lemma A.3 in the Supplementary Appendix, | ˆ B − B | = O p (¯ s r lnpn ) = o p (1) . (A.55)Then, | ˆ B + B | ≤ | ˆ B | + | B |≤ | ˆ B − B | + 2 | B | = o p (1) + 2 | B | = O p (1) , (A.56)where we use (A.55) and Lemma A.5 in the Supplementary Appendix.38y (A.55)(A.56) in (A.54), we have ˆ B = O p (1) . (A.57)Then, by Lemma A.2 in the Supplementary Appendix and (A.57), | ˆ B ( ˆ A − A ) | ≤ ˆ B | ˆ A − A | = O p (¯ s r lnpn ) = o p (1) . (A.58)Then, the second term on the right side of (A.53) is | ˆ B − B || ˆ B + B || ˆ A | = O p (¯ s r lnpn ) O p (1) O p (1) = o p (1) , (A.59)by (A.55)(A.56) and Lemma A.2, Lemma A.5 in the Supplementary Appendix. Use (A.58)(A.59)in (A.53) | ˆ B A − B ˆ A | = O p (¯ s r lnpn ) = o p (1) . (A.60)Combine (A.52) with (A.60) in (A.51) to obtain the desired result. Q.E.D.Proof of Theorem 8 . To ease the notation in the proofs, set AD − B = z , Aρ − Bρ + D = v .The estimates will be ˆ z = ˆ A ˆ D − ˆ B , ˆ v = ˆ Aρ − Bρ + ˆ D . Then, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) c SR MV SR MV − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ˆ z/ ˆ vz/v − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ˆ z ˆ v vz − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ˆ zv − ˆ vz ˆ vz (cid:12)(cid:12)(cid:12)(cid:12) . (A.61)First, analyze the denominator of (A.61). | ˆ vz | = | (ˆ v − v ) z + vz | . ≥ | vz | − | (ˆ v − v ) z |≥ | vz | − | ˆ v − v || z | . (A.62)Then, by Lemma A.2-A.4 in the Supplementary Appendix, triangle inequality and ρ being boundedaway from zero and ﬁnite, by Assumption 4, | ˆ v − v | = | ( ˆ A − A ) ρ −

2( ˆ B − B ) ρ + ( ˆ D − D ) | = O p (¯ s r lnpn ) = o p (1) . (A.63)We also know that by the conditions in theorem statement z = AD − B ≥ C >

0, and v = Aρ − Bρ + D ≥ C >

0. Then, see that by Lemma A.5 in the Supplementary Appendix | z | = | AD − B | ≤ AD = O (1) . (A.64)Thus, by (A.63)(A.64) and z ≥ C > , v ≥ C > | ˆ vz | = o p (1) + C > . (A.65)39onsider the numerator in (A.61): | ˆ zv − ˆ vz | = | ˆ zv − vz + vz − ˆ vz | ≤ | ˆ z − z || v | + | z || ˆ v − v | . (A.66)By Lemma A.6 in the Supplementary Appendix, | ˆ z − z | = | ( ˆ A ˆ D − ˆ B ) − ( AD − B ) | = O p (¯ s r lnpn ) = o p (1) . (A.67)Clearly, by Lemma A.5 in the Supplementary Appendix and triangle inequality with ρ being ﬁnite, | v | = | Aρ − Bρ + D | = O (1) . (A.68)Then, use (A.63)(A.64)(A.67)(A.68) in (A.66) by Assumption 4 | ˆ zv − ˆ vz | = O p (¯ s r lnpn ) = o p (1) . (A.69)Use (A.65)(A.69) in (A.61) to obtain the desired result. Q.E.D. upplementary Appendix Here, we provide supplemental results. We provide a matrix norm inequality. Let x be a genericvector, which is p × M is a square matrix of dimension p , where M ′ j is the j th row of dimension1 × p , and M j is the transpose of this row vector. Lemma A.1. k M x k ≤ p max ≤ j ≤ p k M j k k x k ∞ . Proof of Lemma A.1 . k M x k = | M ′ x | + | M ′ x | + · · · + | M ′ p x |≤ k M k k x k ∞ + k M k k x k ∞ + · · · + k M p k k x k ∞ = [ p X j =1 k M j k ] k x k ∞ ≤ p max j k M j k k x k ∞ , (A.70)where we use Holder’s inequality to obtain each inequality. Q.E.D.

The following lemmata are all from Callot et al. (2019) and repeated for the beneﬁt of readers.Recall the deﬁnition of A := 1 ′ p Σ − p /p and ˆ A := 1 ′ p ˆΘ1 p /p . Lemma A.2.

Under Assumptions 1-4 uniformly in j ∈ { , · · · , p } , for λ j = O ( p lnp/n ) , | ˆ A − A | = O p (¯ s r lnpn ) = o p (1) . Proof of Lemma A.2 . First, see thatˆ A − A = (1 ′ p ˆΘ1 p − ′ p Θ1 p ) /p = (1 ′ p ( ˆΘ − Θ)1 p ) /p. (A.71)Now, consider the right-hand side of (A.71) | ′ p ( ˆΘ − Θ)1 p | /p ≤ k ( ˆΘ − Θ)1 p k k p k ∞ /p ≤ max ≤ j ≤ p k ˆΘ j − Θ j k = O p (¯ s p ln p/n ) = o p (1) , (A.72)where Holder’s inequality is used in the ﬁrst inequality, Lemma A.1 is used for the second inequality,and the last equality is obtained by using Theorem 1 and imposing Assumption 4. Q.E.D.

Before the next Lemma, we deﬁne ˆ B := 1 ′ p ˆΘˆ µ/p , and B := 1 ′ p Θ µ/p . Lemma A.3.

Under Assumptions 1-4 uniformly in j ∈ { , · · · , p } , for λ j = O ( p lnp/n ) , | ˆ B − B | = O p (¯ s r lnpn ) = o p (1) . roof of Lemma A.3 . We can decompose ˆ B by simple addition and subtraction intoˆ B − B = [1 ′ p ( ˆΘ − Θ)(ˆ µ − µ )] /p (A.73)+ [1 ′ p ( ˆΘ − Θ) µ ] /p (A.74)+ [1 ′ p Θ(ˆ µ − µ )] /p (A.75)Now, we analyze each of the terms above. Since ˆ µ = n − P nt =1 r t , | ′ p ( ˆΘ − Θ)(ˆ µ − µ ) | /p ≤ k ( ˆΘ − Θ)1 p k k ˆ µ − µ k ∞ /p ≤ [ max ≤ j ≤ p k ˆΘ j − Θ j k ] k ˆ µ − µ k ∞ = O p (¯ s p ln p/n ) O p ( p lnp/n ) , (A.76)where we use Holder’s inequality in the ﬁrst inequality and Lemma A.1 with M = ˆΘ − Θ, x = 1 p in the second inequality above, and the rate is from Theorem 1.Therefore, we consider (A.74) above. Since we have Assumption 2, k µ k ∞ < c u < ∞ , where c u is a positive constant, | ′ p ( ˆΘ − Θ) µ | /p ≤ k ( ˆΘ − Θ)1 p k k µ k ∞ /p ≤ c u [ max ≤ j ≤ p k ˆΘ j − Θ j k ]= c u O p (¯ s p ln p/n ) , (A.77)where we use Holder’s inequality in the ﬁrst inequality and Lemma A.1 with M = ˆΘ − Θ, x = 1 p in the second inequality above, and the rate is from Theorem 1.Now consider (A.75). | ′ p Θ(ˆ µ − µ ) | /p ≤ k Θ1 p k k ˆ µ − µ k ∞ /p ≤ [ max ≤ j ≤ p k Θ j k ] k ˆ µ − µ k ∞ = O ( √ ¯ s ) O p ( p ln p/n ) , (A.78)where we use Holder’s inequality in the ﬁrst inequality and Lemma A.1 with M = Θ, x = 1 p in thesecond inequality above, and the rate is from Theorem 1 and (B.55) from Caner and Kock (2018)[max ≤ j ≤ p k Θ j k ] = O ( √ ¯ s ).Combine (A.76)(A.77)(A.78) in (A.73)-(A.75), and note that the largest rate is coming from(A.77). Therefore, use Assumption 4, ¯ s p lnp/n = o (1) to obtain | ˆ B − B | = O p (¯ s p lnp/n ) = o p (1) . (A.79). Q.E.D.

Note that D := µ ′ Θ µ/p , and its estimator is ˆ D := ˆ µ ′ ˆΘˆ µ/p . Lemma A.4.

Under Assumptions 1-4 uniformly in j ∈ { , · · · , p } , for λ j = O ( p lnp/n ) , | ˆ D − D | = O p (¯ s r lnpn ) = o p (1) . roof of Lemma A.4 . By simple addition and subtraction,ˆ D − D = [(ˆ µ − µ ) ′ ( ˆΘ − Θ)(ˆ µ − µ )] /p (A.80)+ [(ˆ µ − µ ) ′ Θ(ˆ µ − µ )] /p (A.81)+ [2(ˆ µ − µ ) ′ Θ µ ] /p (A.82)+ [2 µ ′ ( ˆΘ − Θ)(ˆ µ − µ )] /p (A.83)+ [ µ ′ ( ˆΘ − Θ) µ ] /p. (A.84)We start with (A.80). | (ˆ µ − µ ) ′ ( ˆΘ − Θ)(ˆ µ − µ ) | /p ≤ k ( ˆΘ − Θ)(ˆ µ − µ ) k k ˆ µ − µ k ∞ /p ≤ [ k ˆ µ − µ k ∞ ] [max j k ˆΘ j − Θ j k ]= O p (ln p/n ) O p (¯ s p ln p/n )= O p (¯ s (ln p/n ) / ) , (A.85)where Holder’s inequality is used for the ﬁrst inequality above, and the inequality Lemma A.1, with M = ˆΘ − Θ, and x = ˆ µ − µ for the second inequality above, and for the rates we use Theorem 1.We continue with (A.81). | (ˆ µ − µ ) ′ (Θ)(ˆ µ − µ ) | /p ≤ k (Θ)(ˆ µ − µ ) k k ˆ µ − µ k ∞ /p ≤ [ k ˆ µ − µ k ∞ ] [max j k Θ j k ]= O p (ln p/n ) O ( √ ¯ s )= O p ( √ ¯ s (ln p/n )) , (A.86)where Holder’s inequality is used for the ﬁrst inequality above, and the inequality Lemma A.1, with M = Θ, and x = ˆ µ − µ for the second inequality above, and for the rates, we use Theorem 1 and(B.55) of Caner and Kock (2018).Then, we consider (A.82), using k µ k ∞ ≤ c u , | (ˆ µ − µ ) ′ (Θ)( µ ) | /p ≤ k (Θ)(ˆ µ − µ ) k k µ k ∞ /p ≤ c u [ k ˆ µ − µ k ∞ ][max j k Θ j k ]= O p ( p ln p/n ) O ( √ ¯ s )= O p ( √ ¯ s p lnp/n ) , (A.87)where Holder’s inequality is used for the ﬁrst inequality above, and the inequality Lemma A.1, with M = Θ, and x = ˆ µ − µ for the second inequality above, and for the rates, we use Theorem 1 and(B.55) of Caner and Kock (2018).Then, we consider (A.83). | ( µ ) ′ ( ˆΘ − Θ)(ˆ µ − µ ) | /p ≤ k ( ˆΘ − Θ)( µ ) k k ˆ µ − µ k ∞ /p ≤ k µ k ∞ max j k ˆΘ j − Θ j k k ˆ µ − µ k ∞ ≤ c u [max j k ˆΘ j − Θ j k ] k (ˆ µ − µ ) k ∞ = O p (¯ s p ln p/n ) O p ( p ln p/n )= O p (¯ s ln p/n ) , (A.88)43here Holder’s inequality is used for the ﬁrst inequality above, and the inequality Lemma A.1,with M = ˆΘ − Θ, x = µ for the second inequality above, for the third inequality above, we use k µ k ∞ ≤ c u , and for the rates, we use Theorem 1.Then, we consider (A.84): | ( µ ) ′ ( ˆΘ − Θ)( µ ) | /p ≤ k ( ˆΘ − Θ)( µ ) k k µ k ∞ /p ≤ [ k µ k ∞ ] max j k ˆΘ j − Θ j k ≤ c u [max j k ˆΘ j − Θ j k ]= O p (¯ s p lnp/n ) , (A.89)where Holder’s inequality is used for the ﬁrst inequality above, and the inequality Lemma A.1,with M = ˆΘ − Θ, x = µ for the second inequality above, for the third inequality above, we use k µ k ∞ ≤ c u , and for the rate, we use Theorem 1. Note that the last rate above in (A.89) derivesour result, since it is the largest rate by Assumption 4.Combine (A.85)-(A.89) in (A.80)-(A.84) and use the rate (A.89) to obtain | ˆ D − D | = O p (¯ s p lnp/n ) = o p (1) . (A.90) Q.E.D.

The following lemma establishes orders for the terms in the optimal weight, A, B, D. Note thatboth

A, D are positive by Assumption 2 and uniformly bounded away from zero.

Lemma A.5.

Under Assumption 2 A = O (1) . | B | = O (1) .D = O (1) . Proof of Lemma A.5 . We complete the proof for the term D = µ ′ Θ µ/p . The proof for A = 1 ′ p Θ1 p /p is the same. D = µ ′ Θ µ/p ≤ Eigmax (Θ) k µ k /p = O (1) , where we use the fact that each µ j is a constant as in Assumption 2, and the maximal eigenvalueof Θ = Σ − is ﬁnite by Assumption 2. For the term B, the proof can be obtained by using theCauchy-Schwartz inequality ﬁrst and the same analysis as for terms A and D. Q.E.D.

Next, we need the following technical lemma, which provides the limit and the rate for thedenominator in the optimal portfolio.

Lemma A.6.

Under Assumptions 1-4 uniformly over j in λ j = O ( p lnp/n ) , | ( ˆ A ˆ D − ˆ B ) − ( AD − B ) | = O p (¯ s r lnpn ) = o p (1) . roof of Lemma A.6 . Note that by simple addition and subtraction,ˆ A ˆ D − ˆ B = [( ˆ A − A ) + A ][( ˆ D − D ) + D ] − [( ˆ B − B ) + B ] . Then, using this last expression and simplifying,

A, D being both positive, | ( ˆ A ˆ D − ˆ B ) − ( AD − B ) | ≤ {| ˆ A − A || ˆ D − D | + | ˆ A − A | D + A | ˆ D − D | + ( ˆ B − B ) + 2 | B || ˆ B − B |} = O p (¯ s p lnp/n ) = o p (1) , (A.91)where we use (A.72)(A.79)(A.90), Lemma A.5, and Assumption 4: ¯ s p lnp/n = o (1). Q.E.D.

References

Ao, M., Y. Li, and X. Zheng (2019). Approaching mean-variance eﬃciency for large portfolios.

Review of Financial Studies Forthcoming .B¨uhlmann, P. and S. van de Geer (2011).

Statistics for High Dimensional Data . Springer Verlag.Callot, L., M. Caner, O. Onder, and E. Ulasan (2019). A nodewise regression approach to estimatinglarge portfolios.

Journal of Business and Economic Statistics Forthcoming .Caner, M. and A. Kock (2018). Asymptotically honest conﬁdence regions for high dimensionalparameters by the desparsiﬁed conservative lasso.

Journal of Econometrics 203 , 143–168.Chang, J., Y. Qiu, Q. Yao, and T. Zou (2019). Conﬁdence regions for entries of a large precisionmatrix.

Journal of Econometrics Forthcoming .DeMiguel, V., L. Garlappi, and R. Uppal (2007). Optimal versus naive diversiﬁcation: How ineﬃ-cient is the 1/n portfolio strategy?

The review of Financial studies 22 (5), 1915–1953.Fan, J., Y. Li, and K. Yu (2012). Vast volatility matrix estimation using high frequency data forportfolio selection.

Journal of the American Statistical Association 107 , 412–428.Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding principalorthogonal complements.

Journal of the Royal Statistical Society: Series B (Statistical Method-ology) 75 (4), 603–680.Fan, J., Y. Liao, and M. Mincheva (2016).

POET: Principal Orthogonal Complement Thresholding(POET) Method . R package version 2.0.Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linearmodels via coordinate descent.

Journal of Statistical Software 33 (1), 1.Garlappi, L., R. Uppal, and T. Wang (2007). Portfolio selection with parameter and model uncer-tainty: A multi-prior approach.

Review of Financial Studies 20 , 41–81.Hastie, T. and B. Efron (2013). lars: Least Angle Regression, Lasso and Forward Stagewise . Rpackage version 1.2. 45agannathan, R. and T. Ma (2003). Risk reduction in large portfolios: Why imposing the wrongconstraints helps.

The Journal of Finance 58 , 1651–1684.Jobson, J. D. and B. M. Korkie (1981). Performance hypothesis testing with the sharpe and treynormeasures.

The Journal of Finance 36 (4), 889–908.Kan, R. and G. Zhou (2007). Optimal portfolio choice with parameter uncertainty.

Journal ofFinancial and Quantitative Analysis 42 .Lai, T., H. Xing, and Z. Chen (2011). Mean-variance portfolio optimization when means andcovariances are unknown.

The Annals of Applied Statistics 5 , 798–823.Ledoit, O, M. and M. Wolf (2003). Improved estimation of the covariance matrix of stock returnswith an application to portfolio selection.

Journal of Empirical Finance 10 , 603–621.Ledoit, O, M. and M. Wolf (2004). A well conditioned estimator for large dimensional covariancematrices.

Journal of Multivariate Analysis 88 , 365–411.Ledoit, O, M. and M. Wolf (2017). Nonlinear shrinkage of the covariance matrix for portfolioselection: Markowitz meets goldilocks.

Review of Financial Studies 30 , 4349–4388.Ledoit, O. and M. Wolf (2008). Robust performance hypothesis testing with the sharpe ratio.

Journal of Empirical Finance 15 (5), 850–859.Maller, R., S. Roberts, and R. Tourky (2016). The large sample distribution of the maximumsharpe ratio with and without short sales.

Journal of Econometrics 194 , 138–152.Maller, R. and D. Turkington (2002). New light on portfolio allocation problem.

MathematicalMethods of Operations Research 56 , 501–511.Markowitz, H. (1952). Portfolio selection.

Journal of Finance 7 , 77–91.Meinshausen, N. and P. B¨uhlmann (2006). High-dimensional graphs and variable selection withthe lasso.

The Annals of Statistics , 1436–1462.Memmel, C. (2003). Performance hypothesis testing with the sharpe ratio.

Finance Letters 1 (1).Ramprasad, P. (2016). nlshrink: Non-Linear Shrinkage Estimation of Population Eigenvalues andCovariance Matrices . R package version 1.0.1.Tu, J. and G. Zhou (2011). Markowitz meets talmud: A combination of sophisticated and naivediversiﬁcation strategies.

Journal of Financial Economics 99 , 204–215.van de Geer, S. (2016).

Estimation and testing under sparsity . Springer-Verlag.Zhang, Y., R. Li, and C.-L. Tsai (2010). Regularization parameter selections via generalizedinformation criterion.