An Upper Bound for Functions of Estimators in High Dimensions
aa r X i v : . [ ec on . E M ] A ug An Upper Bound for Functions of Estimators inHigh Dimensions
Mehmet Caner ∗ Xu Han † August 7, 2020
Abstract
We provide an upper bound as a random variable for the functions of esti-mators in high dimensions. This upper bound may help establish the rate ofconvergence of functions in high dimensions. The upper bound random vari-able may converge faster, slower, or at the same rate as estimators dependingon the behavior of the partial derivative of the function. We illustrate this viathree examples. The first two examples use the upper bound for testing in highdimensions, and third example derives the estimated out-of-sample variance oflarge portfolios. All our results allow for a larger number of parameters, p ,than the sample size, n . Keywords and phrases : Many assets, many restrictions, Lasso. ∗ North Carolina State University, Nelson Hall, Department of Economics, NC27695. Email: [email protected]. † City University of Hong Kong, 83 Tat Chee Avenue, KL, Hong Kong S.A.R. Email:[email protected] Introduction
The delta method is one of the most widely used theorems in econometrics andstatistics. It is a very simple and a useful idea. It can provide limits for complicatedfunctions of estimators as long as the function is differentiable. The limit of thefunction of estimators can be obtained from the limit of the estimators, with thesame rate of convergence. In the case of finite-dimensional parameter estimation,since the derivative at the parameter value is finite, rates of convergence of bothestimators and function of estimators are the same.In the case of high dimensional parameter estimation, we show that this is notthe case, and the rates of convergence may change. We show that the structureof the derivative of the function is the key. An upper bound random variable isprovided for the functions of estimators in high dimensions. We show this upperbound on functions of estimators may converge faster, slower, or at the same rate asestimators. Even though a new delta theorem is not provided, this upper bound canget the rate of convergence of functions of estimators. For example, the variance ofa portfolio is a quadratic function of the portfolio weight. Our theorem implies thatthe convergence rate of the portfolio variance is slower than that of the estimatedweight when the number of assets is diverging (see Example 3 in Section 3). Fromnow on we denote the number of assets as p n since they grow with sample size.Our result is useful when the number of parameters p n is larger than the samplesize n , where p n grows with n . The reason is that proofs in high dimensional problemsusually depend on knowing the rate of convergence of the bound, which may be alasso-type estimation, oracle inequality, or problems in high dimensional portfolio2nalysis.After the main theorem, we illustrate our point in three examples: first by ex-amining a linear function of estimators that is heavily used in econometrics, seconda new debiased lasso type of estimator, and third by analyzing the out-of-samplevariance of a large portfolio of assets in finance.Section 2 provides our theorem. Section 3 has three examples. Section 4 providesa discussion of how the results may be tied to nonparametric analysis and many weakinstrument asymptotics. Appendix shows the proof and provides more examples. Let β = ( β , , · · · , β p n , ) ′ be a p n × parameter vector with an estimator ˆ β =( ˆ β , · · · , ˆ β p n ) ′ . Define a function f ( . ) , f : K ⊂ R p n → R m . To be specific m is aconstant unless noted otherwise.We provide two conditions for our theorem. First, denote a column vector of p n zeros by p n . Let f d ( . ) represent the m × p n matrix of derivatives. Condition C1. For all h = 0 p n , and h is a p n × vector lim h → k f ( β + h ) − f ( β ) − f d ( β ) h k k h k = 0 , where k . k is the Euclidean norm for a generic vector, and f d ( β ) is an m × p n matrix, whose ( i, j ) th cell consists of ∂f i /∂β j evaluated at β , for i = 1 , · · · , m , j = 1 , · · · , p n . Second, define the rate of convergence of an estimator as r n , a positive sequence3n n , and when n → ∞ , r n → ∞ . Condition C2. r n k ˆ β − β k = O p (1) . Condition C1 is a high-level assumption that shapes our function of interest. Itrestricts the function f to be differentiable. Hence, C1 rules out continuous functionsthat are non-differentiable at β . Condition C2 gives a convergence rate for theestimator of interest. Several examples about ˆ β and r n that satisfy C2 are providedin Section 3.The following theorem is the main theoretical result. Given the convergence rateof the high-dimensional estimator, it provides an upper bound (random variable) forthe estimation error of functions of the high-dimensional estimator. This is useful ineconometric theory since it can give us an idea about what the rate of convergenceof functions of estimators might be. Let k n and d n be positive sequences in n so thatboth k n → ∞ and d n → ∞ as n → ∞ . Take a generic matrix, D , of dimensions m × p n : k| D k| denotes the Frobenius norm of the matrix D : k| D k| = vuut m X i =1 k d i k , where d i is a p n × vector, and its transpose d ′ i is the i th row of D . Theorem 2.1.
Let Conditions C1 and C2 hold, and
C > be a universal constant.Assume that k| f d ( β ) k| > ) If k| f d ( β ) k| = C , then k f ( ˆ β ) − f ( β ) k ≤ L n = O p ( 1 r n ) . (2.1) b) If k| f d ( β ) k| = Ck n , then k f ( ˆ β ) − f ( β ) k ≤ L n = O p ( k n r n ) . (2.2) c) If k| f d ( β ) k| = Cd n , then k f ( ˆ β ) − f ( β ) k ≤ L n = O p ( 1 d n r n ) + o p ( 1 r n ) = o p ( 1 r n ) . (2.3) Remarks . 1. The theorem does not provide a limit for functions of estimators,so this is not the delta theorem.2. Part (a) shows under what conditions we can get the same rate of convergencefor the functions of estimators compared with the rate of convergence of ˆ β − β .Example 2 illustrates this point.3. Part (b) shows that the l norm of the partial derivative function may changewith the dimension of the parameter vector. When r n > k n , and k n → ∞ , thenthe upper bound L n converges at a slower rate than estimators of parameters. With r n /k n → , then the upper bound is diverging which tells us that the function ofestimators may diverge, too.4. Part (c) shows that the function of the estimators converges to zero in proba-bility faster than the rate of convergence of estimators r n .5. Theorem 2.1 also holds with l and l ∞ norms. These new norm results can beshown when Conditions C1 and C2 hold in l and l ∞ norms. We discuss this in PartA of the Appendix. We now provide three examples that highlight the contribution. The first one isrelated to the linear functions of estimators, the second one considers the DebiasedConservative Lasso (DCL) of Caner and Kock (2018), and the third one is related tothe out-of-sample variance of large portfolios.
Example 1 .This example considers lasso, which is one of the benchmark methods in machinelearning. It is a penalized least squares estimator with l penalty. The penaltyinduces sparsity in the model, which can prevent overfitting (Hastie, Tibshirani,Friedman, 2009, Tibshirani, 1996).Let us denote β as the true value of vector ( p n × ) of coefficients. The numberof the true nonzero coefficients is denoted by s , and s > . A simple linear modelis: y t = x ′ t β + u t , (3.1)for t = 1 , · · · , n . For simplicity, we assume that u t are iid errors with zero mean andfinite variance and x t is a set of p n deterministic regressors. The iid assumption on u t is to keep our illustration simple and the following result still holds under moregeneral assumptions on u t and x t as shown in Assumption 1 of Caner and Kock62018).The lasso estimator is defined as ˆ β = argmin β ∈ R pn n X t =1 ( y t − x ′ t β ) n + 2 λ n p n X j =1 | β j | , (3.2)where β j is the j th element of β , λ n is a positive tuning parameter, and it isestablished that λ n = O ( q logp n n ) . Corollary 6.14 or Lemma 6.10 of Buhlmann andvan de Geer (2011) shows that, for lasso estimators ˆ β , with p n > nr n k ˆ β − β k = O p (1) , (3.3)where r n = r nlogp n √ s . (3.4)Given (3.3), we may be interested in the large sample behavior of D ( ˆ β − β ) ,where D is an m × p n matrix. The D matrix can be thought of putting restrictionson β . We want to see whether D ( ˆ β − β ) has a different rate of convergence from ˆ β − β . From our Theorem 2.1(a), it is clear that f d ( β ) = D . Basically in the caseof inference, this matrix and the vectors show how many of β will be involved withthe restrictions. If we want to use s elements in each row of D to test m restrictions,then k| D k| = O ( √ s ) . Note that this corresponds to using s elements in β fortesting m restrictions. In other words, if k n = s and s → ∞ as n → ∞ , thenTheorem 2.1(b) implies that L n = O p ( √ s r n ) . m restrictions, the upper bound randomvariable has a slower rate of convergence than the estimators, so it is possible thatfunctions of estimators also converge slower to a limit. Remark . In high dimensions, a common assumption is to impose k d i k = 1 .See, for example, Caner, Han and Lee (2018) and Caner and Kock (2018). In thatcase, k| D k| = √ m, where m is fixed, but p n is growing with n . The rate of convergence of D ( ˆ β − β ) will be still r n , which is in (3.4). Thus, there will be no slowdown of the rate ofconvergence and this is a sharp rate since k| f d ( β ) k| = k| D k| = √ m . This is notan upper bound on f d ( β ) . Example 2.
Another estimator that is recently analyzed in the context of p n > n is the DCL of Caner and Kock (2018). Consider the model in example 1 above inthe matrix form: Y = Xβ + u, where Y is an n × vector, X is an n × p n matrix, β is a p n × vector that consistsof s nonzero parameters, u is an n × error vector. ˆ β CL is the conservative lassoestimator defined as ˆ β CL = argmin β ∈ R pn {k Y − Xβ k /n + 2 λ n p n X j =1 ˆ w j | β j |} , where ˆ w j = λ n /max ( | ˆ β j | , λ n ) , max ( a, b ) chooses maximum of two elements a or b ,8nd ˆ β j is the lasso estimator defined in example 1. The DCL is uniformly consistent,and it has a standard normal asymptotic limit and an asymptotically valid uniformconfidence band, unlike the lasso and the conservative lasso. These are establishedin Theorem 3 of Caner and Kock (2018). The formula for the DCL estimator ˆ b is: ˆ b = ˆ β CL + ˆΘ X ′ ( Y − X ˆ β CL ) /n, where ˆΘ is an approximate estimate for precision matrix, which will be abstractedaway in this paper. Information and detailed properties are described in Section 3.2of Caner and Kock (2018). Specifically, Caner and Kock (2018) derive the rate ofconvergence for Wald and χ type of tests. They also show that the confidence bandson DCL are contracting at the optimal rate of n − / . By Theorem 2 of Caner andKock (2018), we have n / (ˆ b j − β j ) = O p (1) , for j = 1 , · · · , p n . If we want to test m restrictions on β , with H = [ h , ..., h m ] ′ representing the restriction matrix of m × p n dimension, then k| H k| = vuut m X i =1 k h i k = √ m, given k h i k = 1 in Theorem 2 of Caner and Kock (2018). Conditions C1 and C2 aresatisfied. If m is a fixed number, then H ( ˆ β − β ) converges to the limit at rate n / ,and a χ type test converges to the limit at rate n as in (21) of Caner and Kock(2018). 9 xample 3 . One of the main issues in finance is the analysis of portfolio variance.If we denote the portfolio allocation vector by w ( p n × ), and the covariance matrix ofasset returns by Σ , then the portfolio variance is w ′ Σ w . The out-of-sample estimateof this portfolio variance is ˆ w ′ Σ ˆ w . This estimate can be seen in Ledoit and Wolf(2017) and Ao et al. (2019). The number of assets, p n , grows with n , which isthe time span of the portfolio, and p n > n . Let Eigmax ( A ) denote the maximumeigenvalue of a matrix A and C > be a positive constant.We analyze the global minimum portfolio weights. Define p n as the p n vector ofones. The weights are computed as w = Σ − p n /p n ′ p n Σ − p n /p n . From Theorem 3.3 of Callot et al. (2019), the estimated weights are: ˆ w = ˆΘ1 p n /p n ′ p n ˆΘ1 p n /p n , where ˆΘ is the nodewise regression estimate of Σ − . Take β = w , and ˆ β = ˆ w . Soour parameter is of dimension p n , and it is growing with n and larger than n . Ourinterest centers on the out-of sample portfolio variance estimation, given that we canestimate weights consistently, with a known rate of convergence. First, we start withverifying Condition C1. See that f ( ˆ β ) = ˆ β ′ Σ ˆ β , and f ( β ) = β ′ Σ β . Condition C2also holds for this example, and we will show that after verifying Condition C1.For C1, we have 10 f ( β + h ) − f ( β ) − f d ( β ) h k = k ( β + h ) ′ Σ( β + h ) − β ′ Σ β − β ′ Σ h k = k h ′ Σ h k . Next k h ′ Σ h k k h k = | h ′ Σ h |k h k ≤ k h k Eigmax (Σ) k h k ≤ k h k k h k Eigmax (Σ) = k h k Eigmax (Σ) → k h k → , where the second inequality follows from k h k ≥ k h k , and for the convergence tozero we use assumption Eigmax (Σ) ≤ C < ∞ , and the fact that k h k → implies k h k → . Thus, Condition C1 is satisfied. Then Condition C2 is satisfied since, byTheorem 3.3 of Callot et al. (2019) we have r n k ˆ w − w k = O p (1) , where r n = √ n √ logp n s / , with ¯ s = max ≤ j ≤ p s j , and s j is the number of nonzero cellsin the j th row of the precision matrix. Now, we derive the rate of convergence ofthe out-of-sample variance estimator. Note that k f d ( β ) k = k β ′ Σ k = 2 k Σ w k , since Σ is a p n × p n symmetric matrix and β = w . Define σ i,j as the ( i, j ) th element11f Σ matrix. By (7.3) k Σ w k ≤ k| Σ k| k w k = [ max ≤ j ≤ p n p n X i =1 | σ i,j | ] k w k = [ max ≤ j ≤ p n p n X i =1 | σ i,j | ] O ( √ ¯ s )= O ( √ ¯ s ) , where the third line uses k w k = O ( √ ¯ s ) by Theorem 3.3 of Callot et al. (2019), andthe last line uses the assumption [max ≤ j ≤ p n P p n k =1 | σ i,j | ] ≤ C < ∞ . Clearly, we candefine k n := O (2 k Σ w k ) , which implies k n = O ( √ ¯ s ) by the inequality above. Apply-ing Theorem 2.1(b), we obtain a slow rate for the out-of-sample variance estimatorcompared with the weight estimation √ n √ logp n s | ˆ w ′ Σ ˆ w − w ′ Σ w | = O p (1) , when ¯ s is growing with n . Here we provide a brief discussion about our results in some econometrics problems.We can see three areas related to our technique that may be beneficial to the re-searchers. The first area is the many weak instruments literature. In the case ofmany weak instruments as in Newey and Windmeijer (2009) and Caner (2014), we12an derive the rate of convergence for the estimation of the sample moments fromthe estimation of the parameters in the structural equation. If we have ˆ β as thegeneralized empirical likelihood estimator, with r n k ˆ β − β k = O p (1) , then we canhave r n d n k ˆ g ( ˆ β ) − ˆ g ( β ) k = O p (1) , with ˆ g ( ˆ β ) = 1 n n X i =1 Z i ( y i − x ′ i ˆ β ) , where Z i is an m × vector of instruments, m is growing with n . Let y i be theoutcome variable, and x i represent the control and endogenous variables ( p n × vector). This satisfies our Theorem 2.1(c). The details of this example are providedin the appendix. This extends the results of Example 1 in Section 2 of Caner (2014),and the linear model of Newey and Windmeijer (2009, p. 690-698).The second one is the portfolio analysis, where we illustrate our results through anexample in the main text. Since there are a lot of nonlinear functions in parametersof interest, and it is neither obvious nor trivial to get the rates of these functionsin the modern portfolio theory when the number of assets, p n , is larger than thetime span of the portfolio. Our technique can help. It analyzes the partial derivativeof the function at the parameter and automatically finds the rate of convergencethrough that.The third area is nonparametric estimation. Our theorem can be applied to obtainthe convergence rate of the estimate of the nonparametric function. Consider, forexample, the series estimation of the following model y i = g ( x i ) + ε i , with E ( ε i | x i ) = 0 , g ( x ) can be approximated by a linear combination ofbasis functions h p n ( x ) and p n is increasing with n . Let S be the compact support of x . Assume that sup x ∈S | g ( x ) − h p n ( x ) ′ β | = O ( p − αn ) for some α > . (4.1)Define β p n ≡ arg min β sup x ∈S | g ( x ) − h p n ( x ) ′ β | . The estimator of β p n , denoted by ˆ β , is obtained by a linear regression of y i on h p n ( x i ) . Newey (1997) shows that k ˆ β − β p n k = O p (cid:0) √ p n / √ n + p − αn (cid:1) , which gives the rate for Condition C2. ForCondition C1, consider the linear function f ( β ) = h p n ( x ) ′ β for a given x ∈ S , so f ( β p n ) is the approximation of g ( x ) and f ( ˆ β ) is the series estimator h p n ( x ) ′ ˆ β . Weare interested in the convergence rate of | f ( ˆ β ) − f ( β p n ) | . Since f d ( β p n ) = h p n ( x ) ′ ,Condition C1 holds automatically. By our Theorem 2.1(b) we have | f ( ˆ β ) − f ( β p n ) | ≤ ζ ( p n ) · O p r p n n + 1 p αn ! = O p (cid:0) ζ ( p n )( √ p n / √ n + p − αn ) (cid:1) , (4.2)where ζ ( p n ) = sup x ∈S k h p n ( x ) k . Together with (4.1), the rate in (4.2) implies | h p n ( x ) ′ ˆ β − g ( x ) | ≤ | f ( ˆ β ) − f ( β p n ) | +sup x ∈S | h p n ( x ) ′ β p n − g ( x ) | = O p (cid:0) ζ ( p n )( √ p n / √ n + p − αn ) (cid:1) , which is the well known convergence rate for the series estimator. Note that ourtheorem actually gives a sharp upper bound on the convergence rate of the seriesestimator. Under some regularity conditions, it follows that ζ ( p n ) = O ( p n ) for power series and ζ ( p n ) = O ( √ p n ) for spline series (Corollary 15.1, Li and Racine, 2007). Simulation
In this section, we study the degree of conservativeness of the bound derived byTheorem 2.1 via simulation. We consider the lasso estimator discussed in Example1. The model is generated using (3.1), where x i ∼ iid N (0 , I p n ) , u i ∼ iid N (0 , s ) ,and β = ( × s , × ( p n − s ) ) ′ . We set s ∈ { , } , p n ∈ { , , , } , and n ∈ { , , } . Since λ n = O ( p log( p n ) /n ) , we select the optimal tuning pa-rameter from the set { λ n = c p log( p n ) /n, c = 0 . , . , . , , , , , , , , , , } by minimizing the following information criterion, λ ∗ = arg min λ n [log ˆ σ ( λ n ) + ˆ s ( λ n ) n log( n ) log log( p n )] , where ˆ s ( λ n ) is the number of nonzero entries in the lasso estimator given by (3.2)using tuning parameter λ n , and ˆ σ ( λ n ) is the corresponding mean squared residuals.The term log log( p n ) follows the design of Caner, Han and Lee (2018) to deal withthe high dimensionality.We focus on the inference of the first s parameters in β , so f ( β ) = Dβ with D = ( I s , . We compute the ratio k| f d ( β ) k| k ˆ β − β k / k f ( ˆ β ) − f ( β ) k = s k ˆ β − β k / k D ˆ β − Dβ k to see the tightness of the bound. Table 1 reports theaverage of this ratio over 1000 replications. It is evident that the averaged ratio isvery close to s , which implies that most of the zero elements in β are estimatedas zero by the lasso. In the (infeasible) oracle case where all zero parameters areexactly estimated as zero, the ratio would be equal to s . We clearly see that ourupper bound grows with sparsity, and will not be that tight unless the model is15ery sparse. However, the main problem is that in high dimensional econometrics,it is complicated to get closed-form solutions; hence we rely on upper bounds. Forexample, the oracle inequalities, l norm results are all upper bounded.Table 1: The Ratio of the Upper Bound to the Function of the Lasso Estimator n, s p n = 50 p n = 100 p n = 200 p n = 300 We provide an upper bound for the functions of the estimators in high dimensions.We also show three examples to illustrate the main theorem. It is possible to extendour result to more relevant econometrics issues in the age of big data. In summary,our method can be useful for obtaining the rate of convergence of functionals ofestimators, since it uses the partial derivative of the function at β . Our bound canbe beneficial in high dimensional scenarios where it may be difficult to have directproof of the rate of convergence. 16 EFERENCES
Abadir, K. and J.R. Magnus (2005). Matrix Algebra. Cambridge UniversityPress. Cambridge.Ao, M., Y.Li and X. Zheng (2019). Approaching mean-variance efficiency forlarge portfolios.
Review of Financial Studies , Forthcoming.Buhlmann, P. and S. van de Geer (2011). Statistics for High-Dimensional Data.Springer Verlag, Berlin.Callot, L., Caner, M., O, Onder, E. Ulasan (2019). A nodewise regression ap-proach to estimating large portfolios.
Journal of Business and Economic Statistics ,Forthcoming.Caner, M. (2014). Near Exogeneity and Weak Identification in Generalized Em-pirical Likelihood Estimators.
Journal of Econometrics , 182, 247-288.Caner, M, and X. Han, and Y. Lee (2018). Adaptive Elastic Net GMM Estimationwith Many Invalid Moment Conditions: Simultaneous Model and Moment Selection.
Journal of Business and Economics Statistics ,36, 24-46.Caner, M. and A.B. Kock. (2018). Asymptotically Honest Confidence Regionsfor High Dimensional Parameters by the Desparsified Conservative Lasso.
Journalof Econometrics , 203, 143-168.Fan, J., Y. Liao, and X. Shi (2015). Risks of large portfolios.
Journal of Econo-metrics , 186, 367-387.Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of StatisticalLearning. Springer Verlag. NYC.Horn, R.A. and C. Johnson (2013). Matrix Analysis. Second Edition. Cambridge17niversity Press, Cambridge.Ledoit, O., M. Wolf (2017). Nonlinear shrinkage of the covariance matrix forportfolio selection: Markowitz meets Goldilocks.
Review of Financial Studies , 30,4349-4388.Li, Q. and J. Racine (2007). Nonparametric Econometrics: Theory and Practice.Princeton University Press.Newey, W, (1997). Convergence Rates and Asymptotic Normality for SeriesEstimators.
Journal of Econometrics , 79(1), 147-168.Newey, W, and F. Windmeijer (2009). Generalized Method of Moments withMany Weak Moment Conditions.
Econometrica , 77, 687-719.Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
Journalof the Royal Statistical Society Series B , 58, 267-288.Van de Geer, S., P. Buhlmann, Y. Ritov, and R. Dezeure (2014). On asymptot-ically optimal confidence regions and test for high-dimensional models.
The Annalsof Statistics,
42, 1166-1202.Van der Vaart, A.W. (2000). Asymptotic Statistics. Cambridge University Press,Cambridge.
The appendix has three parts. In Part A, we introduce the matrix inequalities thatare used in the proofs. In Part B, we provide the proof of our theorem and showwhy a classical proof of the delta method in high dimensions does not work. In Part18, we give more examples that are tied to our main theorem.PART A.Take a generic matrix, A , which is of dimension m × p n . Denote the Frobeniusnorm for a matrix as k| A k| = qP mi =1 P p n j =1 a ij . Note that in some literature such asHorn and Johnson (2013), this definition is not considered a matrix norm, due to thelack of submultiplicativity. However, our results will not change regardless of matrixnorm definitions. If we use Horn and Johnson (2013) definitions, our results can besummarized in an algebraic form, rather than the matrix norm format. Define A = a ′ ... a ′ m , where a i is a p n × vector, and its transpose is a ′ i , i = 1 , · · · , m . Then for a generic p n × vector x , k Ax k = vuut m X i =1 ( a ′ i x ) ≤ vuut m X i =1 k a i k k x k = k| A k| k x k , (7.1)where the inequality is obtained by the Cauchy-Schwarz inequality. Note that if weapply Horn and Johnson’s (2013) norm definition, this matrix norm inequality still19olds, but we cannot use the matrix norm. In that case we have k Ax k ≤ vuut m X i =1 k a i k k x k . (7.2)Also, see that the results (2.1)-(2.3) can be obtained in other matrix norms. Asimple Holder’s inequality provides k Ax k ≤ k| A k| k x k , (7.3)where we define the maximum column sum matrix norm: k| A k| =max ≤ j ≤ p n P mi =1 | a ij | , and a ij is the ( i, j ) th element of the matrix A . Theorem can be written in l norm replacing l norm. We can also extend these results toanother norm. A simple inequality provides k Ax k ∞ ≤ k| A k| ∞ k x k ∞ , (7.4)where we define the maximum row sum matrix norm: k| A k| ∞ = max ≤ i ≤ m P p n j =1 | a ij | and Theorem 2.1 can be written in l ∞ norm as well.PART B.First, we show why the classical proof of the delta method does not work in high20imensions. However, this is not a negative result since it guides us towards thesolution.First by Condition C1, via p.352 of Abadir and Magnus (2005) l ( . ) : D ⊂ R p n → R m is a vector function k f ( ˆ β ) − f ( β ) − f d ( β )[ ˆ β − β ] k = k l ( ˆ β − β ) k . (7.5)and k l ( ˆ β − β ) k = o p ( k ˆ β − β k ) , (7.6)where we use Lemma 2.12 of van der Vaart (2000).Since we are given r n k ˆ β − β k = O p (1) which is Condition C2, by (7.6) r n k l ( ˆ β − β ) k = o p (1) . (7.7)By (7.5)-(7.7) k r n [ f ( ˆ β ) − f ( β )] − r n [ f d ( β )][ ˆ β − β ] k = o p (1) . (7.8)But this is the same result as in the regular delta method. (7.8) is mainly a simpleextension of Theorem 3.1 in van der Vaart (2000) to Euclidean spaces so far. However,the main caveat comes from the derivative matrix f d ( β ) , which is of dimension m × p n . The rate of the matrix plays a role when p n → ∞ as n → ∞ . For example,both r n [ f d ( β )][ ˆ β − β ] and r n [ f ( ˆ β ) − f ( β )] may be diverging, but r n k ˆ β − β k = O p (1) . Hence the delta method is not that useful if our interest centers on getting21ates for estimators as well as functions of estimators that converge. In the fixed p case, this is not an issue, since the matrix derivative will not affect the rate ofconvergence at all, as long as this is bounded away from zero, and is bounded fromabove away from infinity. Note that boundedness assumptions may not be intactwhen we have p n → ∞ , as n → ∞ . Next part shows how to correct this problem. Proof of Theorem 2.1 . From Condition C1, using p.352 of Abadir and Magnus(2005), or proof of Theorem 3.1 in van der Vaart (2000) yields f ( ˆ β ) − f ( β ) = f d ( β )[ ˆ β − β ] + l ( ˆ β − β ) . Using the Euclidean norm and the triangle inequality, we have k f ( ˆ β ) − f ( β ) k = k f d ( β )[ ˆ β − β ] + l ( ˆ β − β ) k ≤ k f d ( β )[ ˆ β − β ] k + k l ( ˆ β − β ) k . Next, multiply each side by r n , and use (7.7) r n k f ( ˆ β ) − f ( β ) k ≤ r n k f d ( β )[ ˆ β − β ] k + o p (1) . (7.9)Then apply the matrix norm inequality in (7.1) to the first term on the right side of(7.9) r n k f d ( β )[ ˆ β − β ] k ≤ r n (cid:2) k| f d ( β ) k| (cid:3) h k ˆ β − β k i . (7.10)22ubstitute (7.10) into (7.9) to have r n k f ( ˆ β ) − f ( β ) k ≤ r n (cid:2) k| f d ( β ) k| (cid:3) h k ˆ β − β k i + o p (1) . (7.11)In (7.11) on the right side note that by Condition C2, we have r n k ˆ β − β k = O p (1) . r n k f ( ˆ β ) − f ( β ) k ≤ (cid:2) k| f d ( β ) k| (cid:3) O p (1) + o p (1) . (7.12)Next, divide each side by r n to have k f ( ˆ β ) − f ( β ) k ≤ (cid:2) k| f d ( β ) k| (cid:3) O p ( 1 r n ) + o p ( 1 r n ) . (7.13)Then parts a)-c) follow through by substituting the different specifications for (cid:2) k| f d ( β ) k| (cid:3) on the right side.Note that we can prove Theorem 2.1 in l and l ∞ norms, too. To see this, (7.6)-(7.7) also hold with l and l ∞ norms. This is true since Lemma 2.12 of van der Vaart(2000) holds also with l and l ∞ norms. Hence, the proof will follow with l and l ∞ versions of C1 and C2 with (7.3) and (7.4). Q.E.D.Remark . Note that the proof of Theorem 2.1 mainly uses triangle and Cauchy-Schwarz inequalities. Under some special conditions, these inequalities hold with theequality sign. For example, | a ′ b | = k a k k b k and k a + b k = k a k + k b k when a and b are on the same ray. However, these conditions are very restrictive and they donot hold in general. Thus, we only have inequalities in most cases.23ART C. Example AC.1 . Here we use the setup on p.690 and p.698 of Newey andWindmeijer (2009), which is also used in section 2 of Caner (2014). For i = 1 , · · · , ny i = x ′ i β + ǫ i ,x i = Ψ i + η i , and we have E [ ǫ i | Z i , Ψ i ] = 0 , E [ η i | Z i , Ψ i ] = 0 . η i , ǫ i are correlated. Also y i is ascalar, x i is a p × vector of control and endogenous variables, and p is fixed here. Z i is an m × vector of instrumental variables. Ψ i is a p × vector of reduced formvalues. The moment function is an m × vector, where m grows with n , for β in acompact subset of R p : ˆ g ( β ) = 1 n n X i =1 Z i ( y i − x ′ i β ) . First, define the sequence µ n such that µ n = o ( n / ) . This will indicate less thanstrong identification. Suppose it has the reduced form x i = ( z ′ i , x ′ i ) ′ ,x i = π z i + µ n n / z i + η i ,Z i = ( z ′ i , Z ′ i ) ′ , where z i is a p vector of included exogenous variables (controls), z i is a ( p − ) × vector of excluded exogenous variables, and Z i is an ( m − p ) × vector ofinstruments. If z i is not observed, this is approximated by Z i which can be a powerseries or splines. Define a p × p matrix ˜ S n = I p π I p − p . Define S n = ˜ S n diag ( µ n , · · · , µ p ,n , µ p +1 ,n , · · · , µ p,n ) where the diag () is a p × p diagonal matrix, its first p elements are n / , and the rest is µ n . So µ j,n = n / for j = 1 , · · · , p and µ j,n = µ n for j = p + 1 , · · · , p . The reduced form can be writtenas: Ψ i = Z i π z i + µ n n / z i = S n z i n / . Note that ˆ g ( ˆ β ) − ˆ g ( β ) = 1 n Z i x ′ i ( ˆ β − β ) , as in supplement (Appendix) p.11 of the proof of Theorem 2 in Newey and Wind-meijer (2009). Condition C1 is not needed since the system is linear in β . There isno need for (7.6). We benefit from a direct proof of Theorem 2.1. By Theorem 3 ofNewey and Windmeijer (2009) k S ′ n ( ˆ β − β ) k = O p (1) , (7.14)25hich is Condition C2. Next with f d ( β ) = ˆ G = − n P ni =1 Z i x ′ i . So clearly k ˆ g ( ˆ β ) − ˆ g ( β ) k = k ˆ G ( ˆ β − β ) k . Then n / k (ˆ g ( ˆ β ) − ˆ g ( β ) k = k ˆ Gn / S − ′ n S ′ n ( ˆ β − β ) k ≤ k ˆ Gn / S − ′ n k k S ′ n ( ˆ β − β ) k = O p (1) O p (1) , by the proof of Lemma A.7 in Newey and Windmeijer (2009), we have k ˆ Gn / S − ′ n k = O p (1) , and the other rate is by (7.14). Since S n has mixed rates, n / and µ n , we find that n / is faster or equal to them, satisfying Theorem 2.1. Example AC.2.
Example 3 in the main text considers the out-of-sample portfolio variance usingthe global minimum portfolio weights. Alternatively, we can apply Theorem 2.1 tostudy the convergence rate of the estimated portfolio variance using an estimatedmatrix ˆΣ and a given weight w . Hence, in this example we are interested in theparameter β = vech (Σ ) and the function f ( β ) = w ′ Σ w = ( w ′ ⊗ w ′ ) D p vech (Σ ) = ( w ′ ⊗ w ′ ) D p β , where Σ is q × q , w is q × , β is p n × with p n = q ( q + 1) / , and D p is a duplicationmatrix. Let q (and thus p n ) be growing with n . Since f ( β ) is linear in β , Condition261 is always satisfied. Recall that k A k ∞ = max i max j | A ij | of the matrix A, (notethat this is different than k| A k| ∞ which is the maximum row-sum matrix norm), k| A k| is the Frobenius norm of matrix A, k v k is the l norm of a vector v .For Condition C2, van de Geer et al. (2014) we have k ˆΣ − Σ k ∞ = O p r log qn ! by the symmetry of ˆΣ − Σ , so by p.365 of Horn and Johnson (2013) k| ˆΣ − Σ k| ≤ q k ˆΣ − Σ k ∞ = O p q r log qn ! , which implies k ˆ β − β k ≤ k| ˆΣ − Σ k| ≤ O p q r log qn ! . (7.15)Hence, we require that q has to diverge at a slower rate than n for C2 to hold.Next k| f d k| = k| ( w ′ ⊗ w ′ ) D p k| = trace [( w ′ ⊗ w ′ ) D p D ′ p ( w ⊗ w )] / ≤ Eigmax ( D p D ′ p ) / k w ′ ⊗ w ′ k = √ k w ′ ⊗ w ′ k = √ k w k ≤ √ k w k , where the third line uses the fact that Eigmax ( D p D ′ p ) = Eigmax ( D ′ p D p ) = 2 be-cause D ′ p D p is a diagonal matrix with elements 1 and 2. If we assume that k w k = k n ,27.e., the gross exposure is of order k / n , then f ( ˆ β ) − f ( β ) = w ′ ˆΣ w − w ′ Σ w = O p k n q r log qn ! by Theorem 2.1(b).Note that by using tools in modern high dimensional portfolio analysis as inExample 3 with a longer proof, we can get a rate better than the one in (7.15). Here,we compare the estimator’s rate with the function of the estimator, which is theportfolio variance estimation by fixed weights. In this example, it turns out that thegross exposure is the key to the rate.Now we provide some perspective in the following remark why our proof techniqueis desirable. Remark . This remark shows a directly derivation of the bound for Example 3in the main text without using technique we propose. Compared to the proposedupper bound technique, the following alternative method involves more steps andmore comparisons between terms and matrix inequalities. First, arranging the termsin the out-of-sample variance yields | f ( ˆ β ) − f ( β ) | = | ˆ β ′ Σ ˆ β − β ′ Σ β | = | ˆ w ′ Σ ˆ w − w ′ Σ w | = | ˆ w ′ Σ ˆ w − w ′ Σ ˆ w + w ′ Σ ˆ w − w ′ Σ w | = | ( ˆ w − w ) ′ Σ ˆ w + w ′ Σ( ˆ w − w ) | = | ( ˆ w − w ) ′ Σ ˆ w − ( ˆ w − w ) ′ Σ w + ( ˆ w − w ) ′ Σ w + w ′ Σ( ˆ w − w ) | = | ( ˆ w − w ) ′ Σ( ˆ w − w ) + 2( ˆ w − w ) ′ Σ w | , w ′ Σ ˆ w and ( ˆ w − w ) ′ Σ w to get the third and the lastequalities, respectively, and we use symmetricity of Σ in the last equality.Using the above equality, we have | f ( ˆ β ) − f ( β ) | = | ( ˆ w − w ) ′ Σ( ˆ w − w ) + 2( ˆ w − w ) ′ Σ w |≤ | ( ˆ w − w ) ′ Σ( ˆ w − w ) | + 2 | ( ˆ w − w ) ′ Σ w |≤ k ˆ w − w k k Σ( ˆ w − w ) k ∞ + 2 k ˆ w − w k k Σ w k ∞ (7.16) ≤ k ˆ w − w k kk Σ k ∞ k ˆ w − w k + 2 k ˆ w − w k k Σ k ∞ k w k (7.17) = k ˆ w − w k k Σ k ∞ + 2 k ˆ w − w k k Σ k ∞ k w k , (7.18)where the second line uses the triangle inequality, the third line uses Holder’s in-equality, the fourth line follows from k Ax k ∞ ≤ k A k ∞ k x k for a generic A matrixand a generic vector x by our definition of k A k ∞ .Next, we evaluate the first term in (7.18). By Theorem 3.3 of Callot et al. (2019),we have k ˆ w − w k = O p ( √ logp n √ n ¯ s / ) . We assume then k Σ k ∞ = O (1) . So the first term’s rate is: k ˆ w − w k k Σ k ∞ = O p ( √ logp n √ n ¯ s / ) . (7.19)Then since k w k = O (¯ s / ) by Remark in Theorem 3.3 of Callot et al. (2019), the29econd term on the right side of (7.18) has the rate k ˆ w − w k k Σ k ∞ k w k = O p ( √ logp n √ n ¯ s / ) O p (¯ s / ) = O p ( √ logp n √ n ¯ s ) . (7.20)So (7.20) rate is slower than (7.19), hence it will be the rate of convergence for theout-of-sample variance estimator. This means that r n = √ logp n √ n s / and k n = ¯ s / . ByTheorem 2.1(b), r n /k n = √ logp n √ n s . Thus, we reach the same conclusion as in Example3, but with additional inequalities and comparisons. Of course, there is one caveat,the assumptions are slightly different, in example 3 we assume that k| Σ k| is finite,whereas in the remark here we assume that k Σ k ∞ is finite. In higher dimensions, theformer can be stronger than the latter depending on the sparsity of columns of Σ .Next we provide a remark that measures the divergence between two norms thatare used. Remark . The convergence rate implied by Theorem 2.1 might be potentiallyconservative under some cases. Whether the rate is sharp or conservative dependson the specific setup of the estimator of interest. To see whether our theorem givesa conservative rate in Example 3, we can develop a measure of divergence betweenour upper bound and the one derived in (7.20) from the direct proof in the previousremark. Let div represent the divergence between two norms of Σ div = k| Σ k| k Σ k ∞ = max ≤ j ≤ p n P p n i =1 | σ i,j | max ≤ j ≤ p n max ≤ i ≤ p n | σ i,j | . We can see from the divergence measure that our upper bound can be quite large incertain cases, such as in the case where σ i,j ’s are constants and bounded away from30ero and infinity. But as said earlier, in case of k| Σ k| ≤ C < ∞∞