Robust Inference Under Heteroskedasticity via the Hadamard Estimator
RRobust Inference Under Heteroskedasticity via the HadamardEstimator
Edgar Dobriban ∗ and Weijie J. Su † July 3, 2018
Abstract
Drawing statistical inferences from large datasets in a model-robust way is an important problemin statistics and data science. In this paper, we propose methods that are robust to large and unequalnoise in different observational units (i.e., heteroskedasticity) for statistical inference in linear regression.We leverage the
Hadamard estimator , which is unbiased for the variances of ordinary least-squaresregression. This is in contrast to the popular White’s sandwich estimator, which can be substantiallybiased in high dimensions. We propose to estimate the signal strength, noise level, signal-to-noise ratio,and mean squared error via the Hadamard estimator. We develop a new degrees of freedom adjustmentthat gives more accurate confidence intervals than variants of White’s sandwich estimator. Moreover, weprovide conditions ensuring the estimator is well-defined, by studying a new random matrix ensemblein which the entries of a random orthogonal projection matrix are squared. We also show approximatenormality, using the second-order Poincar´e inequality. Our work provides improved statistical theoryand methods for linear regression in high dimensions.
Drawing statistical inferences from large datasets in a way that is robust to model assumptions is animportant problem in statistics and data science. In this paper, we study a central question in this area,performing statistical inference for the unknown regression parameters in linear models.
The linear regression model Y = Xβ + ε (1)is widely used and fundamental in many areas. The goal is to understand the dependence of an outcomevariable Y on some p covariates x = ( x , . . . , x p ) (cid:62) . We observe n such data points, arranging theiroutcomes into the n × vector Y , and their covariates into the n × p matrix X . We assume that Y depends linearly on X , via some unknown p × parameter vector β .A fundamental practical problem is that the structure of noise ε affects the accuracy of inferencesabout the regression coefficient β . If the noise level in an observation is very high, that observationcontributes little useful information. Such an observation could bias our inferences, and we shoulddiscard or down-weight it. The practical meaning of large noise is that our model underfits the specificobservation. However, we usually do not know the noise level of each observation. Therefore, we must ∗ Wharton Statistics Department, University of Pennsylvania. E-mail: [email protected] . † Wharton Statistics Department, University of Pennsylvania. E-mail: [email protected] . a r X i v : . [ m a t h . S T ] J u l esign procedures that adapt to unknown noise levels, for instance by constructing preliminary estimatorsof the noise. This problem of unknown and unequal noise levels, i.e., heteroskedasticity , has long beenrecognized as a central problem in many applied areas, especially in finance and econometrics.In applied data analysis, and especially in the fields mentioned above, it is a common practice touse the ordinary least-squares (OLS) estimator ˆ β = ( X (cid:62) X ) − X (cid:62) Y as the estimator of the unknownregression coefficients, despite the potential of heteroskedasticity. The OLS estimator is still unbiased,and has other desirable properties—such as consistency—under mild conditions. For statistical inferenceabout β , the common practice is to use heteroskedasticity-robust confidence intervals.Specifically, in the classical low-dimensional case when the dimension p is fixed and the samplesize n grows, the OLS estimator is asymptotically normal with asymptotic covariance matrix C ∞ =lim n →∞ nC , with C = Cov( ˆ β ) = ( X (cid:62) X ) − X (cid:62) Σ X ( X (cid:62) X ) − . (2)Here the covariance matrix of the noise is a diagonal matrix Cov( ε ) = Σ . To form confidence intervalsfor individual components of β , we need to estimate diagonal entries of C . White (1980), in one of high-est cited papers in econometrics, studied the following plug-in estimator of C , which simply estimatesthe unknown noise variances by the squared residuals: (cid:98) C W = ( X (cid:62) X ) − X (cid:62) diag( (cid:98) ε ) X ( X (cid:62) X ) − . (3)Here (cid:98) ε = Y − X ˆ β is the vector containing the residuals from the OLS fit. This is also known as the sandwich estimator , the Huber-White , or the
Eicker-Huber-White estimator. White showed that this estimator is consistent forthe true covariance matrix of ˆ β , when the sample size grows to infinity, n → ∞ , with fixed dimension p .Earlier closely related work was done by Eicker (1967); Huber (1967). In theory, these works consideredmore general problems, but White’s estimator was explicit and directly applicable to the central problemof inference in OLS. This may explain why White’s work has achieved such a large practical impact,with more than 26,000 citations at the time of writing.However, it was quickly realized that White’s estimator is substantially biased when the sample size n is not too large—for instance when we only have twice as many samples as the dimension. This is aproblem, because it can lead to incorrect statistical inferences. MacKinnon and White (1985) proposed abias-correction that is unbiased under homoskedasticity. However, the question of forming confidence in-tervals has remained challenging. Despite the unbiasedness of the MacKinnon-White estimate in specialcases, confidence intervals based on it have below-nominal probability of covering the true parametersin low dimensions (see e.g., Kauermann and Carroll, 2001). It is not clear if this continues to hold inthe high-dimensional case. In fact in our simulations we observe that these CIs can be anti-conservativein high dimensions. Thus, constructing accurate CIs in high dimensions remains a challenging openproblem.In this paper, we propose to construct confidence intervals via a variance estimator that is unbiased even under heteroskedasticity . Since the estimator (described later), is based on Hadamard products, wecall it the Hadamard estimator . This remarkable estimator has been discovered several times (Hartleyet al., 1969; Chew, 1970; Cattaneo et al., 2018), and the later two works do not appear to be aware ofthe earlier ones. The estimator is also not widely known by researchers in finance and econometrics,and does not appear in standard econometrics textbooks such as Greene (2003), or in recent reviewpapers such as Imbens and Kolesar (2016). We also re-discovered the Hadamard estimator in 2017 whilestudying the bias of White’s estimator, and were surprised to find out about the interesting earlier works.We emphasize that the three papers did not study many of the important properties of this estimator, andit is not even clear based on these works under what conditions this estimator exists.In our paper, we start by showing how to solve five important problems in the linear regression modelusing the Hadamard estimator: constructing confidence intervals, estimating signal-to-noise ratio (SNR),signal strength, noise level, and mean squared error (MSE) in a robust way under heteroskedasticity Section 2.1). To use the Hadamard estimator, we need to show the fundamental result that it is well-defined (Section 2.2). We prove matching upper and lower bounds on the relation between the dimensionand sample size guaranteeing that the Hadamard estimator is generically well-defined. We also provewell conditioning. For this, we study a new random matrix ensemble in which the entries of a randompartial orthogonal projection matrix are squared. Specifically, we prove sharp bounds on the smallestand largest eigenvalues of this matrix. This mathematical contribution should be of independent interest.Next, we develop a new degrees of freedom correction for the Hadamard estimator, which givesmore accurate confidence intervals than several variants of the sandwich estimator (Section 2.3). Finally,we also establish the rate of convergence and approximate normality of the estimator, using the second-order Poincar´e inequality (Section 4). We also perform numerical experiments to validate our theoreticalresults (Section 5). Software implementing our method, and reproducing our results, is available fromthe authors’ GitHub page, http://github.com/dobriban/Hadamard . Under heteroskedasticity, some fundamental estimation and inference tasks in the linear model are morechallenging than under homoskedasticity. As we will see, the difficulty often arises from a lack of agood estimator of the variance of the OLS estimator. For the moment, assume that there is an unbi-ased estimator of the coordinate-wise variances of the OLS estimator. That is, we consider a vector (cid:98) V satisfying E (cid:98) V = V under heteroskedasticity, where V = diag C = diag Cov( ˆ β ) is defined through equation (2). To definethis unbiased estimator, we collect some useful notation as follows, though the estimator itself shall beintroduced in detail in Section 2.2. Let S = ( X (cid:62) X ) − X (cid:62) be the matrix used in defining the ordinaryleast-squares estimate, and Q = I n − X ( X (cid:62) X ) − X (cid:62) be the projection into the orthocomplement ofthe column space of X . Here I n is the identity matrix. Let us denote by M (cid:12) M the Hadamard—orelementwise—product of a matrix or vector M with itself.Among others, the following five important applications demonstrate the usefulness of the unbiasedvariance estimator (cid:98) V . Constructing confidence intervals.
A first fundamental problem is inference for the individualregression coefficients. Assuming the noise ε in the linear model (1) follows a heteroskedastic normaldistribution ε ∼ N (0 , Σ) for a diagonal covariance matrix Σ , the random variable ( ˆ β j − β j ) / (cid:112) V j follows the standard normal distribution. We replace the unknown variance V j of the OLS estimator byits approximation (cid:98) V j and focus on the distribution of the following approximate pivotal quantity ˆ β j − β j (cid:113) (cid:98) V j . (4)The distribution of this random variable is approximated by a t distribution in Section 2.3 and this playsa pivotal role in constructing confidence intervals and conducting hypothesis testing for the coefficients. Estimating SNR.
The signal-to-noise ratio (SNR)
SNR = (cid:107) β (cid:107) E (cid:107) ε (cid:107) = (cid:107) β (cid:107) tr(Σ) of the linear model (1) is a fundamental measure that quantifies the fraction of an observational unit’svariability explained by its covariates. Here (cid:107) x (cid:107) = ( (cid:80) i x i ) / is the usual Euclidean norm of a vector x . In genetics, the SNR corresponds to heritability if the response y denotes the phenotype of a genetic rait (Visscher et al., 2008). Existing work on estimating this important ratio in linear models, however,largely focuses on the relatively simple case of homoskedasticity (see, for example, Dicker (2014);Janson et al. (2017)). Without appropriately accounting for heteroskedasticity, the estimated SNR maybe unreliable.As an application of the estimator (cid:98) V , we propose to estimate the SNR using (cid:91) SNR = (cid:107) ˆ β (cid:107) − (cid:62) p (cid:98) V (cid:62) p ( Q (cid:12) Q ) − ( (cid:98) ε (cid:12) (cid:98) ε ) , (5)where recall that (cid:98) ε is the vector of residuals in the linear model, and p denotes a column vector withall p entries being ones. Above, ( Q (cid:12) Q ) − denotes the inverse of the Hadamard product Q (cid:12) Q of Q = I n − X ( X (cid:62) X ) − X (cid:62) with itself (we will later study this invertibility in detail). The numeratorand denominator of the fraction in (5) are unbiased for the signal part and noise part, respectively, as weshow in the next two examples. Estimating signal squared magnitude.
A further fundamental problem is estimating the magni-tude of the regression coefficient (cid:107) β (cid:107) . From the identy E (cid:107) ˆ β (cid:107) = (cid:107) β (cid:107) + tr (cid:16) Cov( ˆ β ) (cid:17) , it follows that an unbiased estimator of tr (cid:16) Cov( ˆ β ) (cid:17) is (cid:62) p (cid:98) V . Thus, an unbiased estimator of the squaredsignal magnitude is given as (cid:107) ˆ β (cid:107) − (cid:62) p (cid:98) V .
Estimating total noise level.
As an intermediate step in the derivation of the unbiased estimator (cid:98) V ,we obtain the identity diag(Σ) = ( Q (cid:12) Q ) − E ( (cid:98) ε (cid:12) (cid:98) ε ) . That is, the vector diag(Σ) of the entries of Σ can be written as a matrix-vector product in the appropriateway. As a consequence of this, we can use (cid:62) p ( Q (cid:12) Q ) − ( (cid:98) ε (cid:12) (cid:98) ε ) to estimate the total noise level tr(Σ) = (cid:80) ni =1 Var ( ε i ) in an unbiased way. Estimating MSE.
An important problem concerning the least-squares method is estimating its meansquared error (MSE). Let
MSE = E (cid:107) ˆ β − β (cid:107) be the MSE. Consider the estimator (cid:91) MSE = n (cid:88) i =1 (cid:98) V i . As in the part “Estimating signal squared magnitude,” it follows that (cid:91)
MSE is an unbiased estimator ofthe MSE. Later in Section 5 we will show in simulations that this estimator is more accurate than thecorresponding estimators based on White’s and MacKinnon-White’s covariance estimators.
This section specifies the variance estimator (cid:98) V . This estimator has appeared in Hartley et al. (1969);Chew (1970); Cattaneo et al. (2018), and takes the following form of matrix-vector product (cid:98) V = A ( (cid:98) ε (cid:12) (cid:98) ε ) , here the matrix A is the product of two matrices A = ( S (cid:12) S )( Q (cid:12) Q ) − . To clarify, note that ( Q (cid:12) Q ) − is the usual matrix inverse of Q (cid:12) Q and recall that both Q (cid:12) Q and (cid:98) ε (cid:12) (cid:98) ε denote the Hadamard product. As such, (cid:98) V is henceforth referred to as the Hadamard estimator . Inshort, this is a method of moments estimator, using linear combinations of the squared residuals.While the Hadamard estimator enjoys a simple expression, there is little work on a fundamentalquestion: whether this estimator exists or not . More precisely, in order for the Hadamard estimator to bewell-defined, the matrix Q (cid:12) Q must be invertible. Without this knowledge, all five important applicationsin Section 2.1 would suffer from a lack of theoretical foundation. While the invertibility can be checkedfor a given dataset, knowing that it should hold under general conditions gives us a confidence that themethod can work broadly.As a major thrust of this paper, we provide a deep understanding of under what conditions Q (cid:12) Q should be expected to be invertible. The problem is theoretically nontrivial, because there are no generalstatements about the invertibility of matrices whose entries are squared values of some other matrix.In fact, Q = I n − X ( X (cid:62) X ) − X (cid:62) is an n × n rank-deficient projection matrix of rank n − p < n .Therefore, Q itself is not invertible, and it is not clear how its rank behaves when the entries are squared.However, we have the following lower bound on n for this invertibility to hold. Proposition 2.1 (Lower bound) . If the Hadamard product Q (cid:12) Q is invertible, then the sample size n must be at least n ≥ p + 12 + (cid:114) p + 14 . (6)This result reveals that the Hadamard estimator simply does not exist if n is only slightly greater than p , (say p = n + 1 ), though the OLS estimator exists in this regime. The proof of Proposition 2.1 comesfrom a well-known property of the Hadamard product, that is, if a matrix B is of rank r , then the rank of B (cid:12) B is at most r ( r + 1) / (e.g., Horn and Johnson, 1994). For completeness, a proof of this propertyis given in Section A.2. Using this property, the invertibility of Q (cid:12) Q readily implies n ≤ ( n − p )( n − p + 1)2 , which is equivalent to (6).In light of the above, it is tempting to ask whether (6) is sufficient for the existence of the Hadamardestimator. In general, this is not the case. For example, let X = (cid:18) R (cid:19) for any orthogonal matrix R ∈ R p × p . Then, Q (cid:12) Q is not invertible as Q is a diagonal matrix whosefirst p diagonal entries are 0 and the remaining are 1. This holds no matter how large n is compared to p . However, such design matrices X that lead to a degenerate Q (cid:12) Q are very “rare” in the sense of thefollowing theorem. Recall that Q = I n − X ( X (cid:62) X ) − X (cid:62) . Theorem 1.
The set (cid:8) X ∈ R n × p : Q (cid:12) Q does not have full rank (cid:9) has Lebesgue measure zero in R np if the inequality (6) is satisfied. Therefore, the lower bound in Proposition 2.1 is sharp. Roughly speaking, n ≥ p + O ( √ p ) issufficient for the invertibility of Q (cid:12) Q . The proof of this result is new in the vast literature on theHadamard matrix product. In short, our proof uses certain algebraic properties of the determinant of Q (cid:12) Q and employs a novel induction step. Section 3 is devoted to developing the proof of Theorem1 in detail. To be complete, Cattaneo et al. (2018) show high-probability invertibility when p > n for aussian designs. As a comparison, our invertibility result is qualitatively stronger as it applies to almostevery design matrix under more widely applicable distribution-free models.To better appreciate this main theoretical contribution of the paper, we consider a random matrix X in the following corollary, which ensures that the Hadamard estimator is well-defined almost surely formany popular random matrix ensembles of X such as the Wishart ensemble. Corollary 2.2.
Under the same conditions as in Theorem 1, if X is sampled from a distribution that isabsolutely continuous with respect to the Lebesgue measure on R n × p (put simply, X has a density), then Q (cid:12) Q is invertible almost surely. Although Q (cid:12) Q is invertible under very general conditions, our simulations reveal that the condi-tion number of this matrix can be very large for p close to n due to very small eigenvalues. This isproblematic, because the estimator can then amplify the error. Our next result shows that Q (cid:12) Q is well-conditioned under some conditions if n > p . We will show that this holds for certain random designmatrices X .Suppose for instance that the entries of X are iid standard normal, X ij ∼ N (0 , . Then, eachdiagonal entry of Q = I n − X ( X (cid:62) X ) − X (cid:62) is relatively large, of unit order. The off-diagonal entriesare of order /n / . When we square the entries, the off-diagonal entries become of order /n , whilethe diagonal ones are still of unit order. Thus, it is possible that the matrix is diagonally dominant , so thediagonal entries are larger than the sum of the off-diagonal ones. This would ensure well-conditioning.We will show rigorously that this is true under some additonal conditions.Specifically, we will consider a high-dimensional asymptotic setting, where the dimension n and thesample size p are both large. We assume that they grow proportionally to each other, n, p → ∞ with p/n → γ > . This is a modern setting for high-dimensional statistics, and it has many connections torandom matrix theory (see e.g., Bai and Silverstein, 2009; Paul and Aue, 2014; Yao et al., 2015).We will provide bounds on the largest and smallest eigenvalues. We can handle correlated designs X ,where each row is sampled iid from a distribution with covariance matrix Γ . Let Γ / be the symmetricsquare root of Γ . Theorem 2 (Eigenvalue bounds) . Suppose the rows x i of X are iid and have the form x i = Γ / z i ,where z i have iid entries with mean zero and variance /p , and uniformly bounded (8 + δ ) -th moment.Suppose that Γ is invertible. Then, as n, p → ∞ such that p/n → γ < / , the matrix T = Q (cid:12) Q satisfies the following eigenvalue bounds almost surely: (1 − γ )(1 − γ ) ≤ lim inf λ min ( T ) ≤ lim sup λ max ( T ) ≤ (1 − γ ) . Practically speaking, the condition number of T is at most / (1 − γ ) with high probability. SeeSection A.3 for a proof. We note that our invertibility results are stronger than those of Cattaneo et al.(2018). Specifically, we show generic invertibility in finite dimensional designs with probability one, andcondition number bounds on non-Gaussian correlated designs that go much beyond those considered intheir work. They consider only Gaussian designs without correlations. To obtain a confidence interval for β j , we propose to approximate the distribution of the approximatepivot in (4) by a t -distribution. The key is to find a good approximation to the degrees of freedom. Letus denote by V j = Var (cid:98) β j , the expected value of (cid:98) V j . Suppose the degrees of freedom of (cid:98) V j are d j . Usingthe 4-th moment properties of the χ d j variable, these degrees of freedom should obey that E (cid:98) V j ≈ V j d E χ d j = V j (1 + 2 /d j ) . onsequently, we formally define d j = 2 E (cid:98) V j V j − V j E (cid:98) V j − V j . (7)To proceed, we need to evaluate E [ (cid:98) V (cid:12) (cid:98) V ] ∈ R p . The following proposition gives a closed-formexpression of this vector assuming homoskedasticity . Let us denote E = diag (cid:2) ( X (cid:62) X ) − (cid:3) (cid:12) diag (cid:2) ( X (cid:62) X ) − (cid:3) . Proposition 2.3 (Degrees of freedom) . Under homoskedasticity, we have that the vector of degrees offreedom of (cid:98) V , defined in equation (7) , has the form d = 2 E diag [( S (cid:12) S )1 n (cid:62) n ( S (cid:12) S ) (cid:62) ] + 2 diag [( S (cid:12) S )( Q (cid:12) Q ) − ( S (cid:12) S ) (cid:62) ] − E , (8) where the division is understood to be entrywise.
See Section A.6 for a proof.This result also leads to a useful degrees of freedom heuristic. If the degrees of freedom d i are large,this suggests that inferences for β i are based on a large amount of information. On the other hand, if thedegrees of freedom are small, this suggests that the inferences are based on little information, and maythus be unstable.In our case, the t -distribution is still a heuristic, because the numerator and denominator are notindependent under heteroskedasticity. However, the degree of dependence can be bounded as follows: (cid:107) Cov( ˆ β, (cid:98) ε ) (cid:107) op = (cid:107) S Σ( S (cid:62) X (cid:62) − I ) (cid:107) op = (cid:107) S (Σ − cI )( S (cid:62) X (cid:62) − I ) (cid:107) op ≤ (cid:107) S (cid:107) op (cid:107) Σ − cI (cid:107) op (cid:107) S (cid:62) X (cid:62) − I (cid:107) op ≤ | Σ max − Σ min | σ min ( X ) . In the last line, we have chosen c = (Σ max + Σ min ) / , where Σ max and Σ min denote the maximaland minimial entries of Σ , respectively. Now, for designs X of aspect ratios n × p that are not close to1, and with iid entries with sufficiently many moments, it is known that σ min ( X ) is of the order n / .This suggests that the covariance between ˆ β and (cid:98) ε is small. Hence, this heuristic suggests that the t -approximation should be accurate. Moreover when (cid:98) V j − V j → in probability, and under the conditionsin Section 4.1, we also have that the limiting distribution is standard normal. p = 1 As a simple example, consider the case of one covariate, when p = 1 . In this case, we have Y = Xβ + ε ,where y, X, ε are n -vectors. Assuming without loss of generality that X (cid:62) X = 1 , the OLS estimatortakes the form ˆ β = X (cid:62) y . Its variance equals V = (cid:80) nj =1 X j Σ j , where Σ j is the variance of ε j , and X j are the entries of X .The Hadamard estimator takes the form (cid:98) V = (cid:80) nj =1 X j − X j (cid:98) ε j (cid:80) nj =1 X j − X j , which is well-defined if all coordinates X j are small enough that − X j > . See section A.7 forthe argument. The unbiased estimator is not always nonnegative. To ensure nonnegativity, we need X j < / in this case. In practice, we may enforce non-negativity by using max( (cid:98) V , instead of (cid:98) V ,but see below for a more thorough discussion. or comparison, White’s variance estimator is (cid:98) V W = n (cid:88) j =1 X j (cid:98) ε j , while MacKinnon-White’s variance estimator (MacKinnon and White, 1985) can be seen to take theform (cid:98) V MW = n (cid:88) j =1 X j − X j (cid:98) ε j = n (cid:88) j =1 X j (cid:80) ni =1 ,i (cid:54) = j X i (cid:98) ε j . We observe that each variance estimator is a weighted linear combination of the squared residuals,where the weights are some functions of the squares of the entries of the feature vector X . For White’sestimator, the weights are simply the squared entries. For MacKinnon-White’s variance estimator, theweights are scaled up by a factor / (1 − X j ) > . As we know, this ensures the estimator is unbiasedunder homoskedasticity. For the Hadamard estimator, the weights are scaled up more aggressively by / (1 − X j ) > , and there is an additional normalization step. In general, these weights do not haveto be larger—or smaller—than those of the other two weighting schemes.A critical issue is that the Hadamard estimator may not always be non-negative . It is well knownthat unbiased estimators may fall outside of the parameter space (Lehmann and Casella, 1998). When p = 1 , almost sure non-negativity is ensured when the coordinates of X are sufficiently small. It wouldbe desirable, but seems non-obvious, to obtain such results for general dimension p .In addition, the degrees of freedom from (8) simplifies to d = 1 + 1 (cid:80) nj =1 X j − X j . This can be as large as n − , for instance d = n − when all X i = 1 /n . The degrees of freedom canonly be small if the distribution of X i is very skewed. As a byproduct of our analysis, we also obtain explicit formulas for the bias of the two classical estima-tors of the variances of the ordinary least-squares estimator, namely the White and MacKinnon-Whiteestimators. This can in principle enable us to understand when the bias is small or large.The estimator proposed by MacKinnon and White (1985), which we will call the
MW estimator , is: (cid:98) C MW = ( X (cid:62) X ) − [ X (cid:62) (cid:98) Σ MW X ]( X (cid:62) X ) − , (9)where (cid:98) Σ MW = diag( Q ) − diag( (cid:98) ε ) . This estimator is unbiased under homoskedasticity, (cid:98) C MW = σ I n . It is denoted as HC2 in the paper MacKinnon and White (1985). The same estimator was alsoproposed by Wu (1986), eq (2.6). Proposition 2.4 (Bias of classical estimators) . Consider White’s covariance estimator defined in (3) andMacKinnon-White’s estimator defined in (9) . Their bias for estimating the coordinate-wise variances ofthe OLS estimator equals, respectively b W = ( S (cid:12) S )[( Q (cid:12) Q ) − I n ]Σ vec for White’s covariance estimator, and b MW = ( S (cid:12) S )[diag( Q ) − ( Q (cid:12) Q ) − I n ]Σ vec for MacKinnon-White’s estimator. Here Σ vec is the vector of diagonal entries of Σ , the covariance ofthe noise. ee Section A.8 for a proof.In particular, MacKinnon-White’s estimator is known to be unbiased under homoskedasticity, thatis when Σ = I n (MacKinnon and White, 1985). This can be checked easily using our explicit formulafor the bias. Specifically suppose that Σ = I n . Then, Σ vec = 1 n , the vector of all ones. Therefore, ( Q (cid:12) Q )Σ vec = vec ( (cid:107) q j (cid:107) ) , the vector of squared Euclidean norms of the rows of Q . Since Q is aprojection matrix, Q = Q , so (cid:107) q j (cid:107) = Q jj . Therefore we see that [diag( Q ) − ( Q (cid:12) Q ) − I n ]Σ vec = diag( Q ) − vec ( Q jj ) − n = 0 , so that MacKinnon-White’s estimator is unbiased under homoskedasticity. There has been a lot of related work on inference in linear models under heteroskedasticity. Here wecan only mention a few of the most closely related works, and refer to Imbens and Kolesar (2016)for a review. In the low-dimensional case, Bera et al. (2002) compared the Hadamard and White-typeestimators and discovered that the Hadamard estimator lead to more accurate coverage, while the Whiteestimators have better mean squared error.As a heuristic to improve the performance of the MacKinnon-White (MW) confidence intervalsin high dimensions, Bell and McCaffrey (2002) have a similar approach to ours, with a t degrees offreedom correction. Simulations in the very recent review paper by Imbens and Kolesar (2016) suggestthis method is the state of the art for heteroskedasticity-consistent inference, and performs well undermany settings. However, this correction is computationally more burdensome than the MW method,because it requires a separate O ( p ) computation for each regression coefficient, raising the cost to O ( p ) . In contrast, our method has computational cost O ( p ) only. In addition, the accuracy of theirmethod typically does not increase substantially compared to the MW method. We think that this couldbe due to the bias of the MW method under heteroskedasticity.In this work, we have used the term “robust” informally to mean insensitivity to assumptions aboutthe covariance of the noise. Robust statistics is a much larger field which classically studies robustnessto outliers in the data distribution (e.g., Huber and Ronchetti, 2011). Recent work has focused, amongmany other topics, on high-dimensional regression and covariance estimation (e.g., El Karoui et al.,2013; Chen et al., 2016; Donoho and Montanari, 2016; Zhou et al., 2017; Diakonikolas et al., 2017, etc). In this section we develop the novel proof of the existence of the Hadamard estimator. We begin byobserving that Theorem 1 is equivalent to the proposition below. This is because the Lebesgue measureadmits an orthogonal decomposition using the SVD.
Proposition 3.1.
Assume r ( r + 1) / ≥ n . Denote by Q the set of all n × n projection matrices of rank r and let d Q be the Lebesgue measure on Q . Then, the set { Q ∈ Q : rank( Q (cid:12) Q ) < n } has zero- d Q measure. We take the following lemma as given for the moment.
Lemma 3.2.
Under the same assumptions as Proposition 3.1, there exists a Q ∗ ∈ Q such that rank( Q ∗ (cid:12) Q ∗ ) = n. A proof of Proposition 3.1 using Lemma 3.2 is readily given as follows. roof of Proposition 3.1. Let p = n − r . Consider the map from R n × p (ignoring the zero-Lebesguemeasure set where X is not of rank p ) to Q : X ∈ R n × p −→ Q = I − X ( X (cid:62) X ) − X (cid:62) ∈ Q . It is easy to see that the map is a surjection and the preimage of this map for every Q ∈ Q is rotationallyequivalent to each other. Hence, it suffices to show that the set of X where the Hadamard product of I − X ( X (cid:62) X ) − X (cid:62) is degenerate is measure zero.We observe that the determinant takes the form det (cid:0) ( I − X ( X (cid:62) X ) − X (cid:62) ) (cid:12) ( I − X ( X (cid:62) X ) − X (cid:62) ) (cid:1) = f ( X ) f ( X ) , where f ( X ) and f ( X ) are polynomials in np variables X ij , ≤ i ≤ n, ≤ j ≤ p . As a fundamentalproperty of polynomials, one and exactly one of the following two cases holds:(a) The polynomial f ( X ) ≡ for all X .(b) The roots of f ( X ) is of zero Lebesgue measure.Lemma 3.2 falsifies case (a). Therefore, case (b) must hold. Recognizing that the set of X where theHadamard product of Q ( X ) is not full rank is a subset of the roots of f ( X ) , case (b) confirms the claimof the present lemma.Now we turn to prove Lemma 3.2. For convenience, we adopt the following definition. Definition 3.3.
For a set of vectors u , . . . , u r ∈ R n , write rank (cid:12) ( u , . . . , u r ) the rank of the r ( r +1) / vectors each taking the form u i (cid:12) u j for ≤ i ≤ j ≤ r .First, we give two simple lemmas. Lemma 3.4.
Suppose two sets of vectors { u , u , . . . , u r } and { u (cid:48) , u (cid:48) , . . . , u (cid:48) r (cid:48) } are linearly equivalent,meaning that one can be linearly represented by the other. Then, rank (cid:12) ( u , . . . , u r ) = rank (cid:12) ( u (cid:48) , . . . , u (cid:48) r (cid:48) ) . Lemma 3.5.
For any matrix P that takes the form P = u u (cid:62) + · · · + u r u (cid:62) r for some vectors u , . . . , u r ,we have rank( P (cid:12) P ) = rank (cid:12) ( u , . . . , u r ) . Making use of the two lemmas above, Lemma 3.2 is validated once we show the following.
Lemma 3.6.
There exists (not necessarily normalized or orthogonalized) u , . . . , u r such that rank (cid:12) ( u , . . . , u r ) = n if r ( r + 1) / ≥ n . To see this point, we apply the Gram–Schmidt orthonormalization to u , . . . , u r considered in Lemma3.6, and get orthonormal vectors v , . . . , v r . Write Q ∗ = v v (cid:62) + · · · + v r v (cid:62) r , which belongs to Q . Since u , . . . , u r and v , . . . , v r are linearly equivalent, Lemmas 3.4 and 3.5 reveal that rank( Q ∗ (cid:12) Q ∗ ) = rank (cid:12) ( v , . . . , v r ) = rank (cid:12) ( u , . . . , u r ) = n. Now we aim to prove Lemma 3.6. roof of Lemma 3.6. We consider a stronger form of Lemma 3.6: for generic u , . . . , u r , any combina-tion of n vectors from u i (cid:12) u j for ≤ i ≤ j ≤ r have full rank. Here generic means that this statementdoes not hold only for a set of zero Lebesgue measure.We induct on n . The statement is true for n = 1 . Suppose it has been proven true for n − . Let U denote an arbitrary subset of { ( i, j ) : 1 ≤ i ≤ j ≤ r } with cardinality n . Write P = ( u i (cid:12) u j ) ( i,j ) ∈U . It is sufficient to show that det( P ) is generically nonzero. As earlier in the proof of Proposition 3.1,it suffices to show that det( P ) is not always zero. Without loss of generality, let ( i , j ) ∈ U be thefirst column of P . Expressing the determinant of P in terms of its minors along the first column, we seethat det( P ) is an affine function of u i (1) u j (1) , with the leading coefficient being the determinant of a ( n − × ( n − minor matrix that results from P by removing the first row and the first column. Theinduction step is complete if we show that this minor matrix, denoted by P , is nonzero generically.Write u ( − i the vector in R n − formed by removing the first entry from u i for i = 1 , . . . , r . Then, eachof the n − column of P , takes the form u ( − i (cid:12) u ( − j for some ( i, j ) ∈ U \ { ( i , j ) } . Since theinduction step has been validated for n − , it follows that the determinant of P , is nonzero in thegeneric sense.To complete this section, we prove below Lemmas 3.4 and 3.5. Proof of Lemma 3.4.
Since { u (cid:48) , u (cid:48) , . . . , u (cid:48) r (cid:48) } can be linearly represented by { u , u , . . . , u r } , each u (cid:48) j can be written as u (cid:48) j = r (cid:88) l =1 a jl u l for constants a jl . Using the representation, the Hadamard product between two vectors reads u (cid:48) i (cid:12) u (cid:48) j = (cid:32) r (cid:88) l =1 a il u l (cid:33) (cid:12) (cid:32) r (cid:88) l =1 a jl u l (cid:33) = (cid:88) l ,l a il a jl u l (cid:12) u l . This expression for u (cid:48) i (cid:12) u (cid:48) j suggests that u (cid:48) i (cid:12) u (cid:48) j is in the linear span of u l (cid:12) u l for ≤ l ≤ l ≤ r .As a consequence of this, it must hold that rank (cid:12) ( u (cid:48) , u (cid:48) , . . . , u (cid:48) r (cid:48) ) ≡ rank( { u (cid:48) i (cid:12) u (cid:48) j : 1 ≤ i ≤ j ≤ r (cid:48) } ) ≤ rank( { u l (cid:12) u l : 1 ≤ l ≤ l ≤ r } )= rank (cid:12) ( u , u , . . . , u r ) . Likewise, we have rank (cid:12) ( u (cid:48) , u (cid:48) , . . . , u (cid:48) r (cid:48) ) ≥ rank (cid:12) ( u , u , . . . , u r ) . Taking the two inequalitiestogether leads to an identity between the two ranks. Proof of Lemma 3.5.
As earlier in this section, we can write P as P (cid:12) P = (cid:88) ≤ i,j ≤ r ( u i (cid:12) u j )( u i (cid:12) u j ) (cid:62) . Let R be an n × r matrix formed by the r columns u i (cid:12) u j for ≤ i, j ≤ n . Apparently, rank( P (cid:12) P ) =rank( R ) since P (cid:12) P = RR (cid:62) . The (column) rank of R is just rank (cid:12) ( u , . . . , u r ) by Definition 3.3(recognize that u i (cid:12) u j = u j (cid:12) u i ). Hence, rank( P (cid:12) P ) = rank (cid:12) ( u , . . . , u r ) . Rate of Convergence
We next give two fundamental results characterizing the sampling properties of the Hadamard estimator.The first result bounds the relative error for estimating the vector of variances of all the entries of theOLS estimator. The result is completely explicit. It shows that the estimation error is small when theaspect ratio γ is small. The relative error converges to zero when γ goes to zero. This shows anotherdesirable property of the Hadamard estimator. Theorem 3 (Rate of convergence) . Under the conditions of Theorem 2, assume in addition that thekurtosis of the entries ε i of the noise is zero. Let also V = Var ˆ β the vector of variances of the entries ofthe OLS estimator. Then, under high-dimensional asymptotics as n, p → ∞ such that p/n → γ < / ,we have P (cid:32) (cid:107) (cid:98) V − V (cid:107)(cid:107) Σ vec (cid:107) ≥ tn (cid:33) ≤ ct · − γ / ) · (1 − γ ) a.s., for any constant c > . See Section A.9 for a proof. The theorem assumes that the kurtosis of the entries of the noise is zero,but this can be relaxed. Assuming that the fourth moment of the entries is less than a constant C ≥ times the variance squared of the entries, the result still holds, with the constant 2 in the bound changedto a larger constant C − ≥ . We already know that the estimators (cid:98) V i are unbiased for the variances of the coordinates of the OLSestimator V i = Var ˆ β i , and in the previous section we have seen an inequality bounding the error (cid:107) (cid:98) V − V (cid:107) . In this section, we give a deeper result on the distribution of each V i .To study this problem, for simplicity we will assume Gaussian noise, though much of it generalizes todistributions where the noise is approximately Gaussian. Under normality, we can express the residualsas (cid:98) ε = Q Σ / Z , where Z is a vector of standard normal random variables, Z ∼ N (0 , I n ) . Thus, wesee that the estimator (cid:98) V i , a linear combination of squared entries of (cid:98) ε i , can be written as a symmetricquadratic form in Z . Therefore, its exact distribution can be obtained as a weighted linear combinationof chi-squared random variables. The mean of that distribution is V i = Var ˆ β i . We will bound thedeviation from normality of the coordinates (cid:98) V i . Since they are linear combinations of chi-squared randomvariables, this should be true if none of the weights is too large. This is true in fact, and is formalizedby a so-called second order Poincar´e inequality (Chatterjee, 2009). We will use this result to get ourapproximate normality result. Theorem 4 (Approximate normality) . Consider the linear model y = Xβ + ε , where the noise isnormally distributed, so that ε ∼ N (0 , Σ) . Let B i be a normal random variable with the same meanand variance as the (cid:98) V i entry of the Hadamard estimator. Then we have the total variation error bound d T V ( (cid:98) V i , B i ) ≤ C λ max ( (cid:80) j λ j ) / , where C = 4 · / · / is a numerical constant, and λ j are the eigenvalues of W i = Σ / Q diag( A i ) Q Σ / . Here A (cid:62) i is the i -th row A = ( S (cid:12) S )( Q (cid:12) Q ) − . Moreover λ max is the largest eigenvalue of W i . See Section A.10 for a proof. In principle, this result could justify using normal confidence intervalsfor inference on V i as soon the upper bound provided is small. Moreover, the upper bound in result can γ = p/n White MW Hadamard Hadamard-t0.5 0.172 0.045 0.042 0.0390.75 0.347 0.059 0.053 0.047 be simplified as follows. First, we can upper bound λ max ( W i ) ≤ Σ max λ max ( Q diag( A i ) Q ) . Second,we can lower bound (cid:88) j λ j = (cid:107) W i (cid:107) F r ≥ Σ min (cid:107) Q diag( A i ) Q (cid:107) F r = Σ min A (cid:62) i QA i . Therefore, defining κ := κ (Σ) as the condition number of Σ , we obtain the simplified upper bound C · κ (Σ) λ max ( Q diag( A i ) Q ) A (cid:62) i QA i . The improvement from the upper bound stated in the theorem is that this bound decouples simply asthe product of a term depending on the unknown covariance matrix Σ , and the known design matrix X .Therefore, in practice one can evaluate the second term., Then, for any guess on the condition numberof Σ , one gets an upper bound on the deviation from normality. In this section, we present several numerical simulations supporting our theoretical results.
In Figure 1, we show the mean type I error of the normal confidence intervals based on the White,MacKinnon-White, and Hadamard methods over all coordinates of the OLS estimator. We take X tohave iid standard normal entries, and the noise to be ε = Σ / Z , where Z has iid standard normalentries. The noise covariance matrix Σ is the diagonal matrix of eigenvalues of an AR-1 covariancematrix T , with T ij = ρ i − j . We take n = 100 , and three aspect ratios, γ = 0 . , . , . , varying p . Weconsider ρ = 0 (homoskedasticity), and ρ = 0 . (heteroskedasticity). We draw one instance of X , anddraw 1000 Monte Carlo samples of ε .We observe that the CIs based on White’s covariance matrix estimator are inaccurate for the aspectratios considered. They have inflated type I error rates. All other estimators are more accurate. TheMW confidence intervals are quite accurate for each configuration. The Hadamard estimator using thedegrees of freedom correction is comparable, and noticably better if the dimension is high. The situation is more nuanced, however, when we look at individual coordinates. In Table 1, we reportthe empirical type I error of the methods for the first coordinate, where the average is over the MonteCarlo trials. In this case, the MW estimator can be both liberal and conservative, while the Hadamardestimator is closer to having the right coverage. = 0 ρ = 0 . γ = . γ = . γ = . Figure 1: Mean type I error over all coordinates.14 a) p/n = 0 . (b) p/n = 0 . Figure 2: Bias in estimating MSE. (a) p/n = 0 . (b) p/n = 0 . Figure 3: Distribution of z -scores of a fixed coordinate of the Hadamard estimator. In Figure 2, we show the bias in estimating the MSE of the OLS estimator for the three methods. Foreach method, we use the estimator which equals the sum of the variances of the individual components.We use the same setup as above.The results are in line with those from the previous sections. Both MacKinnon-White’s and theHadamard estimator perform much better than the White estimator. Moreover, when γ = 1 / , theHadamard estimator is significantly more accurate than MacKinnon-White’s. In Figure 3, we show the distribution of z -scores of a fixed coordinate of the Hadamard estimator. We usea similar setup to the previous sections, but we choose a larger sample size n = 1 , , and also smalleraspect ratios p/n = 0 . and p/n = 0 . . We observe a relatively good fit to the normal distribution, butit is also clear that a chi-squared approximation may lead to a better fit. Discussion and Future Work
In this paper we have developed a new method for constructing confidence intervals for the OLS estima-tor under heteroskedasticity. We have also provided several fundamental theoretical results. In particular,we have shown that the estimator is well-defined and well-conditioned for certain random design models.There are several important directions for future research. A few came up during our investigations.Is it possible to establish the non-negativity of the Hadamard estimator, possibly with some regulariza-tion? Is it possible to show approximate coverage results for our t -confidence intervals based on thedegrees of freedom correction as given in (8)? Such results have been obtained in the low-dimensionalcase by Kauermann and Carroll (2001), for instance. However, establishing such results in high dimen-sions seems to require different techniques.Beyond our current investigations, an important direction is the development of tests for heteroskedas-ticity. White’s original paper proposed such a test based on comparing his covariance estimator to theusual one under homoskedasticity. There are many other well-known proposals (Dette and Munk, 1998;Azzalini and Bowman, 1993; Cook and Weisberg, 1983; Breusch and Pagan, 1979; Wang et al., 2017).Perhaps most closely related to our work, Li and Yao (2015) have proposed tests for heteroskedasticitywith good properties in low and high dimensional settings. Their tests rely on computing measures ofvariability of the estimated residuals, including the ratio of the arithmetic and geometric means, as wellas the coefficient of variation. Their works and follow-ups such as Bai et al. (2016, 2017) show centrallimit theorems for these test statistics. They also show an improved empirical power compared to someclassical tests for heteroskedasticity. It would be of interest to see if our covariance matrix estimatorcould be used to develop new tests for heteroskedasticity.An important extension of the heteroskedastic model is the clustered observations model. Liangand Zeger (1986) proposed estimating equations for such longitudinal/clustered data. They allowedarbitrarily correlated observations for any fixed individual (i.e., within each cluster), and proposed aconsistent covariance estimator in the low-dimensional setting. Can one extend our ideas to the clusteredcase?Another important direction is to develop covariance estimators that have good performance in thepresence of both heteroskedasticity and autocorrelation. The most well-known example is possibly thepopular Newey-West estimator (West and Newey, 1987), which is a sum of symmetrized lagged auto-covariance matrices with decaying weights. Is it possible to develop new methods inspired by our ideassuitable for this setting?Our paper does not touch on the interesting but challenging regime where n < p . In that setting,Buhlmann, Dezeure, Zhang, (Dezeure et al., 2016) proposed bootstrap methods for inference with thelasso under heteroskedasticity, under the limited ultra-sparse regime, where the sparsity s of the regres-sion parameter is s (cid:28) n / . These methods are limited as they apply only to the lasso, and because theyonly concern the ultra-sparse regime. It would be interesting to understand this regime better. The authors thank Jason Klusowski for valuable discussions and feedback on an earlier version of themanuscript.
References
A. Azzalini and A. Bowman. On the use of nonparametric regression for checking linear relationships.
Journal of the Royal Statistical Society. Series B (Methodological) , 55(2):549–557, 1993.Z. Bai and Y. Yin. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix.
The Annals of Probability , 21(3):1275–1294, 1993. . Bai and J. W. Silverstein. Spectral analysis of large dimensional random matrices . Springer Series inStatistics. Springer, 2009.Z. Bai, G. Pan, and Y. Yin. Homoscedasticity tests for both low and high-dimensional fixed designregressions. arXiv preprint arXiv:1603.03830 , 2016.Z. Bai, G. Pan, and Y. Yin. A central limit theorem for sums of functions of residuals in a high-dimensional regression model with an application to variance homoscedasticity test.
TEST , pages1–25, 2017.R. M. Bell and D. F. McCaffrey. Bias reduction in standard errors for linear regression with multi-stagesamples.
Survey Methodology , 28(2):169–182, 2002.A. K. Bera, T. Suprayitno, and G. Premaratne. On some heteroskedasticity-robust estimators of variance–covariance matrix of the least-squares estimators.
Journal of Statistical Planning and Inference , 108(1-2):121–136, 2002.T. S. Breusch and A. R. Pagan. A simple test for heteroscedasticity and random coefficient variation.
Econometrica: Journal of the Econometric Society , 47(5):1287–1294, 1979.M. D. Cattaneo, M. Jansson, and W. K. Newey. Inference in linear regression models with many covari-ates and heteroscedasticity.
Journal of the American Statistical Association , 0(0):1–12, 2018.S. Chatterjee. Fluctuations of eigenvalues and second order poincar´e inequalities.
Probability Theoryand Related Fields , 143(1-2):1–40, 2009.M. Chen, C. Gao, Z. Ren, et al. A general decision theory for hubers epsilon-contamination model.
Electronic Journal of Statistics , 10(2):3752–3774, 2016.V. Chew. Covariance matrix estimation in linear models.
Journal of the American Statistical Association ,65(329):173–181, 1970.R. D. Cook and S. Weisberg. Diagnostics for heteroscedasticity in regression.
Biometrika , 70(1):1–10,1983.H. Dette and A. Munk. Testing heteroscedasticity in nonparametric regression.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 60(4):693–708, 1998.R. Dezeure, P. B¨uhlmann, and C.-H. Zhang. High-dimensional simultaneous inference with the boot-strap. arXiv preprint arXiv:1606.03940 , 2016.I. Diakonikolas, G. Kamath, D. M. Kane, J. Li, A. Moitra, and A. Stewart. Being robust (in highdimensions) can be practical. arXiv preprint arXiv:1703.00893 , 2017.L. H. Dicker. Variance estimation in high-dimensional linear models.
Biometrika , 101(2):269–284,2014.D. Donoho and A. Montanari. High dimensional robust m-estimation: Asymptotic variance via approx-imate message passing.
Probability Theory and Related Fields , 166(3-4):935–969, 2016.F. Eicker. Limit theorems for regressions with unequal and dependent errors. In
Proceedings of the fifthBerkeley symposium on mathematical statistics and probability , volume 1, pages 59–82, 1967.N. El Karoui, D. Bean, P. J. Bickel, C. Lim, and B. Yu. On robust regression with high-dimensionalpredictors.
Proc. Natl. Acad. Sci. USA , 110(36):14557–14562, 2013.W. H. Greene.
Econometric analysis . Pearson, 2003.H. Hartley, J. Rao, and G. Kiefer. Variance estimation with one unit per stratum.
Journal of the AmericanStatistical Association , 64(327):841–851, 1969.R. A. Horn and C. R. Johnson.
Matrix analysis . Cambridge university press, 1990.R. A. Horn and C. R. Johnson. Topics in matrix analysis. 1994.P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In
Pro-ceedings of the fifth Berkeley symposium on mathematical statistics and probability , volume 1, pages221–233. Berkeley, CA, 1967.P. J. Huber and E. M. Ronchetti. Robust statistics. 2011.G. W. Imbens and M. Kolesar. Robust standard errors in small samples: Some practical advice.
Reviewof Economics and Statistics , 98(4):701–712, 2016. . Janson, R. F. Barber, and E. Cand`es. Eigenprism: inference for high dimensional signal-to-noiseratios. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 79(4):1037–1065,2017.G. Kauermann and R. J. Carroll. A note on the efficiency of sandwich covariance matrix estimation.
Journal of the American Statistical Association , 96(456):1387–1396, 2001.E. Lehmann and G. Casella. Theory of point estimation.
Springer Texts in Statistics , 1998.Z. Li and J. Yao. Testing for heteroscedasticity in high-dimensional regressions. arXiv preprintarXiv:1510.00097 , 2015.K.-Y. Liang and S. L. Zeger. Longitudinal data analysis using generalized linear models.
Biometrika , 73(1):13–22, 1986.J. G. MacKinnon and H. White. Some heteroskedasticity-consistent covariance matrix estimators withimproved finite sample properties.
Journal of econometrics , 29(3):305–325, 1985.D. Paul and A. Aue. Random matrix theory in statistics: A review.
Journal of Statistical Planning andInference , 150:1–29, 2014.P. M. Visscher, W. G. Hill, and N. R. Wray. Heritability in the genomics era—concepts and misconcep-tions.
Nature reviews genetics , 9(4):255, 2008.H. Wang, P.-S. Zhong, and Y. Cui. Empirical likelihood ratio tests for coefficients in high dimensionalheteroscedastic linear models.
Statistica Sinica , 2017.K. D. West and W. K. Newey. A simple, positive semi-definite, heteroskedasticity and autocorrelationconsistent covariance matrix.
Econometrica , 55(3):703–708, 1987.H. White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedas-ticity.
Econometrica , 48(4):817–838, 1980.C.-F. J. Wu. Jackknife, bootstrap and other resampling methods in regression analysis.
The Annals ofStatistics , 14(4):1261–1295, 1986.J. Yao, Z. Bai, and S. Zheng.
Large Sample Covariance Matrices and High-Dimensional Data Analysis .Cambridge University Press, 2015.W.-X. Zhou, K. Bose, J. Fan, and H. Liu. A new perspective on robust m -estimation: Finite sampletheory and applications to dependence-adjusted multiple testing. arXiv preprint arXiv:1711.05381 ,2017. A Proof details
A.1 Proof of unbiasedness of the Hadamard estimator
We consider estimators of the vector of variances of ˆ β of the form (cid:98) V = A · ( (cid:98) ε (cid:12) (cid:98) ε ) where A is a p × n matrix, and M (cid:12) M is the element-wise (or Hadamard) product of the vector or ma-trix M with itself. Our goal is to find A such that E (cid:98) V = V , where V = diag Cov( ˆ β ) . Here the diag op-erator returns the vector of diagonal entries of the matrix M , that is diag M = ( M , M , . . . , M nn ) (cid:62) .Recall that S = ( X (cid:62) X ) − X (cid:62) is a p × n matrix. We have that ˆ β = Sy = Sε + β . Since Cov( ε ) = Σ ,we have that Cov( ˆ β ) = S Σ S (cid:62) . Thus, our goal is to find unbiased estimates of the diagonal of this matrix. The following key lemmare-expresses that diagonal in terms of Hadamard products:
Lemma A.1.
Let v be a zero-mean random vector, and M be a fixed matrix. Then, E ( M (cid:12) M )( v (cid:12) v ) = diag[ M diag Cov( v ) M (cid:62) ] . n particular, let Σ be a diagonal matrix. and let Σ vec be the vector of diagonal entries of Σ . Then ( M (cid:12) M )Σ vec = diag[ M Σ M (cid:62) ] . Alternatively, let u be a vector. Then ( M (cid:12) M ) u = diag[ M diag( u ) M (cid:62) ] . Proof.
Let m i be the rows of M . Let also Σ = diag Cov( v ) . Then, the i -th entry of the left hand sideequals E ( m i (cid:12) m i ) (cid:62) ( v (cid:12) v ) = E (cid:88) j m ij v j = (cid:88) j m ij Σ j . The i -th entry of the right hand side equals m (cid:62) i Σ m i = (cid:88) j m ij Σ j . Thus, the two sides are equal, which proves the first claim of the lemma.The second claim follows directly from the first claim, from the special case when the covariance of v is diagonal. The third claim is simply a restatement of the second one.1. Let us use the lemma for v = ε and M = S . Notice that we have Cov( v ) = Σ is diagonal, sothe right hand side (RHS) of the lemma is diag S Σ S (cid:62) = diag Cov( ˆ β ) , where the equality followsfrom our calculation before the lemma. Moreover, the left hand side (LHS) is E ( S (cid:12) S )( ε (cid:12) ε ) =( S (cid:12) S )Σ vec , where we vectorize Σ , writing Σ vec = (Σ , . . . , Σ nn ) (cid:62) . The equality followsbecause Cov( ε ) = Σ is diagonal. Thus, by the lemma, we have V = diag Cov( ˆ β ) = ( S (cid:12) S )Σ vec .
2. Let us now use the lemma for a second time, with M = I and v = (cid:98) ε . This shows that E ( (cid:98) ε (cid:12) (cid:98) ε ) =diag Cov( (cid:98) ε ) . By linearity of expectation, we obtain E (cid:98) V = A · E ( (cid:98) ε (cid:12) (cid:98) ε ) = A · diag Cov( (cid:98) ε ) .
3. Finally, let us use the lemma for the third time, with M = Q and v = ε . As in the first case, theLHS equals E ( (cid:98) ε (cid:12) (cid:98) ε ) = ( Q (cid:12) Q )Σ vec . The RHS equals diag[ M diag Cov( v ) M (cid:62) ] = diag Q Σ Q ,where we used that Q is a symmetric matrix. Now, Cov( (cid:98) ε ) = Cov( Qε ) = Q Σ Q . Thus, theconclusion of using the lemma for the third time is diag Cov( (cid:98) ε ) = diag Q Σ Q = ( Q (cid:12) Q )Σ vec Putting together the above three equations, we obtain that (cid:98) V is unbiased, namely E (cid:98) V = diag Cov( ˆ β ) ,if A ( Q (cid:12) Q )Σ vec = ( S (cid:12) S )Σ vec . This is a system of linear equations. The equation holds for any Σ if and only if A ( Q (cid:12) Q ) = ( S (cid:12) S ) . If Q (cid:12) Q is invertible, then we can write A = ( S (cid:12) S )( Q (cid:12) Q ) − . This shows that the original estimator (cid:98) V has the required form, finishing the proof. .2 Proof of Proposition 2.1 To prove the lower bound, we first claim that for any symmetric matrix A , rank A (cid:12) A ≤ (cid:18) rank A + 12 (cid:19) . Therefore, in order for Q (cid:12) Q to be invertible, we need n ≤ (cid:18) n − p + 12 (cid:19) . By solving the quadratic inequality, this is equivalent to p ≤ [2 n + 1 − (8 n + 1) / ] / .To prove the claim about ranks, let A = (cid:80) ri =1 v i v (cid:62) i be the eigendecomposition of A . Here v i areorthogonal, but not necessarily of unit norm. Then, A (cid:12) A = ( r (cid:88) i =1 v i v (cid:62) i ) (cid:12) ( r (cid:88) i =1 v i v (cid:62) i ) = r (cid:88) i =1 ( v i (cid:12) v i )( v i (cid:12) v i ) (cid:62) + 2 r (cid:88) ≤ i Our first step is to reduce to the case Γ = I p . Indeed, we notice that we can write X = Z Γ / , where Z is the matrix with rows z i . Hence, Q = X ( X (cid:62) X ) − X (cid:62) = Z ( Z (cid:62) Z ) − Z (cid:62) . Therefore, we can work with Γ = I p .The next step is to reduce the bounds on eigenvalues to bounds on certain quadratic forms. Let usdefine the matrices R i = X (cid:62) X − x i x (cid:62) i = (cid:80) j (cid:54) = i x j x (cid:62) j . See Section A.4 for a proof. Lemma A.2 (Reduction to quadratic forms) . We have the following two bounds on the eigenvalues of T : λ max ( T ) ≤ max i 11 + x (cid:62) i R − i x i , and λ min ( T ) ≥ min i − x (cid:62) i R − i x i (1 + x (cid:62) i R − i x i ) . To bound these expression, we will use the following well-known statement about concentration ofquadratic forms. Lemma A.3 (Concentration of quadratic forms, consequence of Lemma B.26 in Bai and Silverstein(2009)) . Let x ∈ R p be a random vector with i.i.d. entries and E [ x ] = 0 , for which E (cid:2) ( √ px i ) (cid:3) = σ and sup i E (cid:2) ( √ px i ) η (cid:3) < C for some η > and C < ∞ . Moreover, let A p be a sequence of random p × p symmetric matrices independent of x , with uniformly bounded eigenvalues. Then the quadraticforms x (cid:62) A p x concentrate around their means at the following rate P ( | x (cid:62) A p x − p − σ tr A p | η > C ) ≤ Cp − (1+ η/ . emma A.3 requires a small proof, see Section A.5.By assumption, the rows of our matrix X satisfy the above assumptions, for σ = 1 , and some η > .In particular, x ij are iid random variables of zero mean and variance /p . By taking η = 4 + δ for some δ > , we obtain by the Borel-Cantelli lemma that uniformly over all ix (cid:62) i R − i x i − p − tr R − i → a.s. . Therefore, from our earlier result we obtain lim sup λ max ( T ) ≤ lim sup max i 11 + p − tr R − i , and lim inf λ min ( T ) ≥ lim inf min i − p − tr R − i (1 + p − tr R − i ) . Now, by the Marchenko-Pastur (MP) theorem (Bai and Silverstein, 2009, Theorem 3.6), the empir-ical spectral distribution of each γR i converges to the standard MP law with parameter γ < . Thereason for normalization by γ is that E x ij = 1 /p , whereas the MP law refers to matrices of the form n − (cid:80) ni =1 z i z (cid:62) i , for z i with unit variance entries.Thus, p − tr R − i → E T − a.s., where T is distributed as a MP random variable with parameter γ .It is also well known that E T − = γ/ (1 − γ ) (see e.g., Bai and Silverstein, 2009; Yao et al., 2015).Moreover, the difference between tr R − i and tr R − j can be bounded using the formula A − − B − = B − ( A − B ) A − . The details are omitted for brevity. It follows that we have the uniform convergence max i | tr R − i − γ/ (1 − γ ) | → .Hence we obtain lim sup λ max ( T ) ≤ / [1 + γ/ (1 − γ )] = 1 − γ , and also the lower bound lim inf λ min ( T ) ≥ (1 − γ )(1 − γ ) . This finishes the argument. A.4 Proof of Lemma A.2 We need to bound the smallest and largest eigenvalues of T . Now T ij = Q ij = ( δ ij − x (cid:62) i R − x j ) ,where R = X (cid:62) X . We will use the following well-known rank one perturbation formula: ( uu (cid:62) + T ) − = T − − T − uu (cid:62) T − u (cid:62) T − u . We will also use a “leave-one-out” argument which has roots in random matrix theory (see e.g., Baiand Silverstein, 2009; Paul and Aue, 2014; Yao et al., 2015). Let R i = X (cid:62) X − x i x (cid:62) i = (cid:88) j (cid:54) = i x j x (cid:62) j . Then, R − = R − i − R − i x i x (cid:62) i R − i x (cid:62) i R − i x i . We get that the quantity that is squared in the i, j -th entry of T is x (cid:62) i R − x j = x (cid:62) i R − i x j − x (cid:62) i R − i x i · x (cid:62) i R − i x j x (cid:62) i R − i x i = x (cid:62) i R − i x j x (cid:62) i R − i x i Also, working on the diagonal, we have x (cid:62) i R − x i = x (cid:62) i R − i x i x (cid:62) i R − i x i . o, the diagonal terms are T ii = (1 − x (cid:62) i R − x i ) = 1(1 + x (cid:62) i R − i x i ) By the Gershgorin disk theorem. (Horn and Johnson, 1990, Thm 6.1.1), with T = Q (cid:12) Q , we have λ max ( T ) ≤ max i ( T ii + (cid:88) j (cid:54) = i | T ij | ) . Thus, an upper bound on the operator norm of T is the maximum over all i of (cid:80) j (cid:54) = i ( x (cid:62) i R − i x j ) (1 + x (cid:62) i R − i x i ) . Now, the sum in the numerator can be written as x (cid:62) i R − i ( (cid:80) j (cid:54) = i x j x (cid:62) j ) R − i x i = x (cid:62) i R − i x i . There-fore, there is an unexpected cancellation, which simplifies the analysis a great deal. Thus, λ max ( T ) ≤ max i 11 + x (cid:62) i R − i x i . Similarly, for the smallest eigenvalue, by the Gershgorin disk theorem, (Horn and Johnson, 1990,Thm 6.1.1), we have λ min ( T ) ≥ min i ( T ii − (cid:88) j (cid:54) = i | T ij | ) . We can express a i = T ii − (cid:88) j (cid:54) = i | T ij | = 1 − x (cid:62) i R − i x i (1 + x (cid:62) i R − i x i ) . This shows that λ min ( T ) ≥ min i − x (cid:62) i R − i x i (1 + x (cid:62) i R − i x i ) . This finishes the proof. A.5 Proof of Lemma A.3 We will use the following Trace Lemma quoted from Bai and Silverstein (2009). Lemma A.4 (Trace Lemma, Lemma B.26 of Bai and Silverstein (2009)) . Let y be a p-dimensionalrandom vector of i.i.d. elements with mean 0. Suppose that E (cid:2) y i (cid:3) = 1 , and let A p be a fixed p × p matrix. Then E (cid:2) | y (cid:62) A p y − tr A p | q (cid:3) ≤ C q (cid:110)(cid:0) E (cid:2) y (cid:3) tr[ A p A (cid:62) p ] (cid:1) q/ + E (cid:104) y q (cid:105) tr[( A p A (cid:62) p ) q/ ] (cid:111) , for some constant C q that only depends on q .Proof. Under the conditions of Lemma A.3, the operator norms (cid:107) A p | are bounded by a constant C ,thus tr[( A p A (cid:62) p ) q/ ] ≤ pC q and tr[ A p A (cid:62) p ] ≤ pC . Consider now a random vector x with the propertiesassumed in the present lemma. For y = √ px/σ and q = 2 + η/ , using that E (cid:104) y qi (cid:105) ≤ C and the otherthe conditions in Lemma A.3, Lemma A.4 thus yields p q σ q E (cid:20) | x (cid:62) A p x − σ p tr A p | q (cid:21) ≤ C (cid:110)(cid:0) pC (cid:1) q/ + ( pC ) q (cid:111) , r equivalently E (cid:104) | x (cid:62) A p x − σ p tr A p | η (cid:105) ≤ Cp − (1+ η/ .By Markov’s inequality applied to the η -th moment of ε p = x (cid:62) A p x − σ p tr A p , we obtain asrequired P ( | ε p | η > C ) ≤ Cp − (1+ η/ . A.6 Proof of Proposition 2.3 We need to evaluate E (cid:98) V (cid:12) = E (cid:98) V (cid:12) (cid:98) V ∈ R p . Note that this vector is the diagonal of E (cid:98) V (cid:98) V (cid:62) , which isequal to E (cid:98) V (cid:98) V (cid:62) = E A ( (cid:98) ε (cid:12) (cid:98) ε )( (cid:98) ε (cid:12) (cid:98) ε ) (cid:62) A (cid:62) = E A (cid:2) ( (cid:98) ε (cid:98) ε (cid:62) ) (cid:12) ( (cid:98) ε (cid:98) ε (cid:62) ) (cid:3) A (cid:62) = A E (cid:2) ( (cid:98) ε (cid:98) ε (cid:62) ) (cid:12) ( (cid:98) ε (cid:98) ε (cid:62) ) (cid:3) A (cid:62) . Note that (cid:98) ε (cid:98) ε (cid:62) = Qεε (cid:62) Q since the residuals (cid:98) ε = Qε . Using this expression and recognizing that ε hasi.i.d. N (0 , σ ) entries, the ( ij ) -element of E ( (cid:98) ε (cid:98) ε (cid:62) ) (cid:12) is E (cid:88) ≤ l,k ≤ n Q il ε l ε k Q kj = (cid:88) l (cid:54) = k E (cid:0) Q il Q jk ε l ε k + Q il Q jk Q ik Q jl ε k ε l + Q il Q jl Q ik Q jk ε l ε k (cid:1) + n (cid:88) l =1 E Q il Q jl ε l = (cid:88) l (cid:54) = k (cid:0) Q il Q jk σ + Q il Q jk Q ik Q jl σ + Q il Q jl Q ik Q jk σ (cid:1) + n (cid:88) l =1 Q il Q jl σ = σ (cid:88) l (cid:54) = k (cid:0) Q il Q jk + 2 Q il Q jk Q ik Q jl (cid:1) + 3 σ n (cid:88) l =1 Q il Q jl = σ (cid:88) ≤ l,k ≤ n (cid:0) Q il Q jk + 2 Q il Q jk Q ik Q jl (cid:1) = σ (cid:88) ≤ l,k ≤ n Q il Q jk + 2 σ (cid:32) n (cid:88) l =1 Q il Q jl (cid:33) . To proceed, we recognize that (cid:80) ≤ l,k ≤ n Q il Q jk is the ( ij ) -element of [( Q (cid:12) Q )1 n ] [( Q (cid:12) Q )1 n ] (cid:62) = ( Q (cid:12) Q )1 n (cid:62) n ( Q (cid:12) Q ) , and ( (cid:80) nl =1 Q il Q jl ) is the ( ij ) -element of Q (cid:12) Q = Q (cid:12) Q. Summarizing the calculation above, we obtain E ( (cid:98) ε (cid:98) ε (cid:62) ) (cid:12) = σ ( Q (cid:12) Q )1 n (cid:62) n ( Q (cid:12) Q ) + 2 σ Q (cid:12) Q, from which it follows that E (cid:98) V (cid:12) (cid:98) V = diag (cid:2) A (cid:0) σ ( Q (cid:12) Q )1 n (cid:62) n ( Q (cid:12) Q ) + 2 σ Q (cid:12) Q (cid:1) A (cid:62) (cid:3) = σ diag (cid:2) A ( Q (cid:12) Q )1 n (cid:62) n ( Q (cid:12) Q ) A (cid:62) (cid:3) + 2 σ diag (cid:2) A ( Q (cid:12) Q ) A (cid:62) (cid:3) = σ diag (cid:2) ( S (cid:12) S )1 n (cid:62) n ( S (cid:12) S ) (cid:62) (cid:3) + 2 σ diag (cid:2) ( S (cid:12) S )( Q (cid:12) Q ) − ( S (cid:12) S ) (cid:62) (cid:3) . ote that V = σ diag (cid:2) ( X (cid:62) X ) − (cid:3) due to the assumption of homoskedasticity. Denoting E = diag (cid:2) ( X (cid:62) X ) − (cid:3) (cid:12) diag (cid:2) ( X (cid:62) X ) − (cid:3) , we get the degrees of freedom as a vector for all j is d = 2 E diag [( S (cid:12) S )1 n (cid:62) n ( S (cid:12) S ) (cid:62) ] + 2 diag [( S (cid:12) S )( Q (cid:12) Q ) − ( S (cid:12) S ) (cid:62) ] − E , where the division is understood to be entrywise. This finishes the proof. A.7 Calculation for the case when p = 1 We compute each part of the unbiased estimator in turn. We start by noticing that S = ( X (cid:62) X ) − X (cid:62) = X (cid:62) is a × n vector. We continue by calculating Q (cid:12) Q , where Q = I − X ( X (cid:62) X ) − X (cid:62) = I − XX (cid:62) .Thus, Q ij = X i X j , i (cid:54) = j (1 − X i ) , else.Denoting u = X (cid:12) X , and D = I − X (cid:12) X ) , we can write Q (cid:12) Q = D + uu (cid:62) . Now, the estimator takes the form (cid:98) V = ( S (cid:12) S )( Q (cid:12) Q ) − ( (cid:98) ε (cid:12) (cid:98) ε ) . Hence, we need to calculate ( S (cid:12) S )( Q (cid:12) Q ) − = ( X (cid:12) X )( D + uu (cid:62) ) − . We use the rank one perturbation formula u (cid:62) ( D + uu (cid:62) ) − = u (cid:62) D − u (cid:62) D − u + 1 . In our case, u (cid:62) D − u = n (cid:88) j =1 u j D j = n (cid:88) j =1 X j − X j , and u (cid:62) D − has entries X j / (1 − X j ) . This leads to the desired final answer: (cid:98) V = u (cid:62) ( D + uu (cid:62) ) − (cid:98) ε (cid:12) (cid:98) ε = (cid:80) nj =1 X j − X j (cid:98) ε j (cid:80) nj =1 X j − X j . Next, we find E = diag (cid:2) ( X (cid:62) X ) − (cid:3) (cid:12) diag (cid:2) ( X (cid:62) X ) − (cid:3) . Since X (cid:62) X = 1 , we have E = 1 . Finally, we need to find d = 2 E diag [( S (cid:12) S )1 n (cid:62) n ( S (cid:12) S ) (cid:62) ] + 2 diag [( S (cid:12) S )( Q (cid:12) Q ) − ( S (cid:12) S ) (cid:62) ] − E . Since S = X (cid:62) , u = X (cid:12) X , and Q (cid:12) Q = D + uu (cid:62) , so that u (cid:62) n = 1 , this simplifies to d = 1 u (cid:62) ( D + uu (cid:62) ) − u = 1 + 1 u (cid:62) D − u = 1 + 1 (cid:80) nj =1 X j − X j , as desired. .8 Proof of Proposition 2.4 To compute the bias of White’s estimator defined in (3), we proceed as follows. First we need to computeits expectation, E (cid:98) C W = ( X (cid:62) X ) − [ X (cid:62) E diag( (cid:98) ε (cid:12) (cid:98) ε ) X ]( X (cid:62) X ) − . As we saw, E ( (cid:98) ε (cid:12) (cid:98) ε ) = diag Cov( (cid:98) ε ) = diag Q Σ Q = ( Q (cid:12) Q )Σ vec . Thus, diag E (cid:98) C W = diag[ S diag[( Q (cid:12) Q )Σ vec ] S (cid:62) ] = ( S (cid:12) S )( Q (cid:12) Q )Σ vec Again, as we saw, V = diag Cov( ˆ β ) = ( S (cid:12) S )Σ vec . Therefore, the bias of White’s estimator is b W = ( S (cid:12) S )[( Q (cid:12) Q ) − I n ]Σ vec . This is the desired result.To compute the bias of MacKinnon-White’s estimator, we proceed similarly, starting with its expec-tation: E (cid:98) C MW = ( X (cid:62) X ) − [ X (cid:62) E diag( Q ) − diag( (cid:98) ε (cid:12) (cid:98) ε ) X ]( X (cid:62) X ) − . In this equation, the expression diag( Q ) is interpreted as the diagonal matrix whose entries are those onthe diagonal of Q . Thus, diag E (cid:98) C MW = diag[ S diag( Q ) − diag[( Q (cid:12) Q )Σ vec ] S (cid:62) ] = ( S (cid:12) S )( Q (cid:12) Q ) diag( Q ) − Σ vec Thus the bias is b MW = ( S (cid:12) S )[diag( Q ) − ( Q (cid:12) Q ) − I n ]Σ vec . This is the desired result, finishing the proof. A.9 Proof of Theorem 3 We would like to bound (cid:107) (cid:98) V − V (cid:107) , where by (cid:107) · (cid:107) denotes usual Euclidean vector norm. Recall that V = ( S (cid:12) S )Σ vec abd (cid:98) V = ( S (cid:12) S )( Q (cid:12) Q ) − ( (cid:98) ε (cid:12) (cid:98) ε ) So, we can bound by the definition of operator norms, (cid:107) (cid:98) V − V (cid:107) ≤ (cid:107) S (cid:12) S (cid:107) op (cid:107) ( Q (cid:12) Q ) − (cid:107) op (cid:107) ( (cid:98) ε (cid:12) (cid:98) ε ) − ( Q (cid:12) Q )Σ vec (cid:107) We will find upper bounds for each term in the above product.1. Bounding (cid:107) S (cid:12) S (cid:107) op .Schur’s inequality (e.g., Horn and Johnson, 1994, Thm. 5.5.1), states that (cid:107) S (cid:12) S (cid:107) op ≤ (cid:107) S (cid:107) op . Moreover, (cid:107) S (cid:107) op = 1 /σ min ( X ) .By the Bai-Yin law, (Bai and Yin, 1993), σ min ( X ) ≥ n / − p / − c , for any constant c > almost surely (a.s.). The meaning of constants can change from line to line.Assuming that there is a constant c < such that p/n < c , we also get σ min ( X ) ≥ c (cid:48) ( n / − p / ) for any constant c (cid:48) < (whp).Thus, we get the bound n (cid:107) S (cid:12) S (cid:107) op ≤ nc n / − p / ) ≤ c − γ / ) a.s., for any constant c > . . Bounding (cid:107) ( Q (cid:12) Q ) − (cid:107) op .This follows from Theorem 2, see Section A.3. That argument shows that (cid:107) ( Q (cid:12) Q ) − (cid:107) op ≤ c − γ )(1 − γ ) a.s., for any constant c > , under high-dimensional asymptotics.3. Bounding α = (cid:107) ( (cid:98) ε (cid:12) (cid:98) ε ) − ( Q (cid:12) Q )Σ vec (cid:107) .We can express α = (cid:80) i α i , where α i = ( (cid:98) ε i − ( q i (cid:12) q i ) (cid:62) Σ vec ) . Since E (cid:98) ε i = ( q i (cid:12) q i ) (cid:62) Σ vec , which follows from the earlier unbiasedness argument, we have E α i = Var (cid:98) ε i .An easy calculation shows that, with Γ k = E ε k , we haveVar (cid:98) ε i = (cid:88) k q ik (Γ k − k ) + 2[( q i (cid:12) q i ) (cid:62) Σ vec ] . Now the kurtosis is zero by assumption, so Γ k − k = 0 . Therefore, we can bound by Markov’sinequality: P ( α ≥ t ) ≤ (cid:80) i E α i t = 2 (cid:80) i [( q i (cid:12) q i ) (cid:62) Σ vec ] t = 2 · (cid:107) ( Q (cid:12) Q )Σ vec (cid:107) t . Using a similar approach to above, the bound for (cid:107) Q (cid:12) Q (cid:107) op follows from Theorem 2, see SectionA.3. That argument shows that (cid:107) Q (cid:12) Q (cid:107) op ≤ c (1 − γ ) a.s., for any constant c > , under high-dimensional asymptotics. Hence a.s. under high-dimensional asymptotics P ( α ≥ t ) ≤ c (1 − γ ) (cid:107) Σ vec (cid:107) t . In conclusion, under high-dimensional asymptotics P (cid:32) (cid:107) (cid:98) V − V (cid:107)(cid:107) Σ vec (cid:107) ≥ tn (cid:33) ≤ ct · − γ / ) · (1 − γ ) a.s., for any constant c > . This proves the required result. A.10 Proof of Theorem 4 Since we assumed Gaussian noise, we have (cid:98) ε = Qε ∼ N (0 , Q Σ Q ) . So, we can write (cid:98) ε = Q Σ / Z , where Z ∼ N (0 , I n ) . Let us denote M = Q Σ / .Now, we have (cid:98) V i = A (cid:62) i ( (cid:98) ε (cid:12) (cid:98) ε ) , where A (cid:62) i is the i -th row of A = ( S (cid:12) S )( Q (cid:12) Q ) − . So, (cid:98) V i = (cid:88) j A ij (cid:98) ε j = (cid:88) j A ij ( (cid:88) k M jk Z k ) = (cid:88) k,l Z k Z k ( (cid:88) j A ij M jk M jl ) This shows that (cid:98) V i = Z (cid:62) W i Z, here W i is the n × n matrix W i = M (cid:62) diag( A i ) M. Letting Λ i be the diagonal matrix of eigenvalues of W i , we obtain that the distribution of (cid:98) V i is a weightedmixture of chi-squared random variables with weights λ j , j = 1 , . . . , n .We will use the second order Poincare inequality, see Chatterjee (2009), Theorem 2.2. This statesthat the total variation we need to bound is at most d T V ( (cid:98) V i , B i ) ≤ · / · κ κ σ , where κ = [ E (cid:107)∇ g ( Z ) (cid:107) ] / and κ = [ E (cid:107)∇ g ( Z ) (cid:107) op ] / , while g ( x ) = x (cid:62) W i x is the function mapping the normal random vector Z into (cid:98) V i , so that (cid:98) V i = g ( Z ) .In addition, σ is the variance of g ( Z ) .Now, it can be checked that ∇ g ( Z ) = 2 W i Z, so, for another normal random vector Z (cid:48) , denoting L = (cid:80) nj =1 ( λ j Z (cid:48) j ) , − E (cid:107)∇ g ( Z ) (cid:107) = E [ n (cid:88) j =1 ( λ j Z (cid:48) j ) ] = Var L + ( E L ) . Next, Var L = n (cid:88) j =1 Var [( λ j Z (cid:48) j ) ] = 2 n (cid:88) j =1 λ j . Meanwhile, E L = (cid:80) nj =1 λ j , and thus − E (cid:107)∇ g ( Z ) (cid:107) = 2 n (cid:88) j =1 λ j + ( n (cid:88) j =1 λ j ) ≤ n (cid:88) j =1 λ j ) . We obtain that κ ≤ · / ( (cid:80) j λ j ) / .Continuing, ∇ g ( Z ) = 2 W i , is non-random, hence κ = 2 (cid:107) W i (cid:107) op = 2 λ max . Finally, we can calculate σ . Since (cid:98) V i = Z (cid:62) W i Z , as we have already noticed, the distribution of (cid:98) V i is aweighted mixture of chi-squared random variables with weights λ j , j = 1 , . . . , n . Hence σ = Var (cid:98) V i = 2 n (cid:88) j =1 λ j . Putting everything together, we obtain that d T V ( (cid:98) V i , B i ) ≤ · / · · / ( (cid:80) j λ j ) / · λ max (cid:80) nj =1 λ j , which simplifies to the desired result. This finishes the proof.which simplifies to the desired result. This finishes the proof.