[PDF] Improved Estimators for Semi-supervised High-dimensional Regression Model

Abstract

We study a linear high-dimensional regression model in a semi-supervised setting, where for many observations only the vector of covariates X is given with no response Y. We do not make any sparsity assumptions on the vector of coefficients, and aim at estimating Var(Y|X). We propose an estimator, which is unbiased, consistent, and asymptotically normal. This estimator can be improved by adding zero-estimators arising from the unlabelled data. Adding zero-estimators does not affect the bias and potentially can reduce variance. In order to achieve optimal improvement, many zero-estimators should be used, but this raises the problem of estimating many parameters. Therefore, we introduce covariate selection algorithms that identify which zero-estimators should be used in order to improve the above estimator. We further illustrate our approach for other estimators, and present an algorithm that improves estimation for any given variance estimator. Our theoretical results are demonstrated in a simulation study.

Full PDF

aa r X i v : . [ m a t h . S T ] F e b Improved Estimators for Semi-supervisedHigh-dimensional Regression Model

Ilan Livne, David Azriel, Yair GoldbergTechnion - Israel Institute of TechnologyFebruary 16, 2021

Abstract

We study a linear high-dimensional regression model in a semi-supervised setting, wherefor many observations only the vector of covariates X is given with no response Y . Wedo not make any sparsity assumptions on the vector of coeﬃcients, and aim at estimatingVar ( Y | X ) . We propose an estimator, which is unbiased, consistent, and asymptoticallynormal. This estimator can be improved by adding zero-estimators arising from the un-labelled data. Adding zero-estimators does not aﬀect the bias and potentially can reducevariance. In order to achieve optimal improvement, many zero-estimators should be used,but this raises the problem of estimating many parameters. Therefore, we introduce co-variate selection algorithms that identify which zero-estimators should be used in orderto improve the above estimator. We further illustrate our approach for other estimators,and present an algorithm that improves estimation for any given variance estimator. Ourtheoretical results are demonstrated in a simulation study. Key words and phrases:

Regression error, Semi-supervised setting, U-statisitcs, Varianceestimation, Zero estimators.

High-dimensional data analysis, where the number of predictors is larger than the sample size,is a topic of current interest. In such settings, an important goal is to estimate the signallevel τ and the noise level σ , i.e., to quantify how much variation in the response variablecan be explained by the predictors, versus how much of the variation is left unexplained. For1xample, in disease classiﬁcation using DNA microarray data, where the number of potentialpredictors, say the genotypes, is enormous per each individual, one may wish to understandhow disease risk is associated with genotype versus environmental factors.Estimating the signal and noise levels is important even in a low-dimensional setting. Inparticular, a statistical model partitions the total variability of the response variable into twocomponents: the variance of the ﬁtted model τ , and the variance of the residuals σ . Thispartition is at the heart of techniques such as ANOVA and linear regression, where τ and σ might also be commonly referred to as explained versus unexplained variation, or betweentreatments versus within treatments variation. Moreover, in model selection problems, τ and σ may be required for computing popular statistics, such as Cp, AIC, BIC and R . Both τ and σ are also closely related to other important statistical problems, such as genetic heritabilityand signal detection. Hence, developing good estimators for these quantities is a desirable goal.When the number of covariates p is much smaller than the number of observations n , anda linear model is assumed, the ordinary least squares (henceforth, OLS) method provides usstraightforward estimators for τ and σ . However, when p > n , it becomes more challenging toperform inference on τ and σ without further assumptions, such as sparsity of the coeﬃcients.In practice, the sparsity assumption may be unrealistic for some areas of interest. In this case,considering only a small number of signiﬁcant coeﬃcient can lead to biases and inaccuracies.One relevant example is the problem of missing heritability, i.e., the gap between heritabilityestimates from genome-wide-association-studies (GWAS) and the corresponding estimates fromtwin studies. For example, by 2010, GWAS studies had identiﬁed a relatively small numberof covariates that collectively explained around of the total variations in the trait height ,which is a small fraction compared to of the total variations that were explained bytwin studies (Eichler et al., 2010). Identifying all the GWAS covariates aﬀecting a trait, andmeasuring how much variation they capture, is believed to bridge a signiﬁcant fraction of theheritability gap. With that in mind, methods that heavily rely on the sparsity assumption mayunderestimate τ by their nature. We show in this work that in the semi-supervised setting,in which for many observations only the covariates X are given with no responses Y , one mayconsistently estimate the heritability without sparsity assumptions. Estimating τ and σ ina high-dimensional regression setting is generally a challenging problem. As mentioned above,the sparsity assumption, which means that only a relatively small number of predictors arerelevant, plays an important role in this context. Fan et al. (2012) introduced a reﬁtted crossvalidation method for estimating σ . Their method includes a two-staged procedure where avariable-selection technique is performed in the ﬁrst stage, and OLS is used to estimate σ in the second stage. Sun and Zhang (2012) introduced the scaled lasso algorithm that jointly2stimates the noise level and the regression coeﬃcients by an iterative lasso procedure. Bothworks provide asymptotic distributional results for their estimators and prove consistency underseveral assumptions including sparsity. In the context of heritability estimation, Gorﬁne et al.(2017) presented the HERRA estimator, which is based on the above methods and is alsoapplicable to time-to-event outcomes, in addition to continuous or dichotomous outcomes.Another recent related work is Cai and Guo (2020) that considers, as we do here, a semi-supervised learning setting. In their work, Cai and Gue proposed the CHIVE estimator of τ ,which integrates both labelled and unlabelled data and works well when assuming sparsity.They characterize its limiting distribution and calculate conﬁdence intervals for τ . For morerelated works, see the literature review of Cai and Guo (2020).Rather than assuming sparsity, or other structural assumptions on the coeﬃcient vector β , a diﬀerent approach for high-dimensional inference is to assume some knowledge aboutthe covariates distribution. Dicker (2014) uses the method-of-moments to develop severalasymptotically-normal estimators of τ and σ , when the covariates are assumed to be Gaussian.Schwartzman et al. (2019) proposed the GWASH estimator for estimating heritability, which isessentially a modiﬁcation of one of Dicker’s estimators where the columns of X are standardized.Unlike Dicker, the GWASH estimator can also be computed from typical summary statistics,without accessing the original data. Janson et al. (2017) proposed the EigenPrism procedure toestimate τ and σ . Their method, which is based on singular value decomposition and convexoptimization techniques, provides estimates and conﬁdence intervals for normal covariates.In this paper we introduce a naive estimator of τ and show that it is asymptotically equiv-alent to one of Dicker’s estimators. The naive estimator is also a U-statistic and asymptoticallynormal. U-statistics can be typically used to obtain uniformly minimum variance unbiasedestimators (UMVUE). However, when moments restrictions exist, U-statistics are no longerUMVUE, as shown by Hoeﬀding (1977). Since the semi-supervised setting practically meansthat the distribution of X is known (and hence, moments of X are known), the naive estima-tor is not a UMVUE and it potentially can be improved. In this work, we demonstrate howits variance can be further reduced by using zero-estimators that incorporate the additionalinformation that exists in the unlabelled data.The contribution of this paper is threefold. First, we propose a novel approach for improvinginitial estimators of the signal level τ in the semi-supervised setting without assuming sparsity.The key idea of this approach is to use zero-estimators that are correlated with the initialestimator of τ in order to reduce variance without introducing extra bias. Second, we deﬁnea new notion of optimality with respect to a linear family of zero-estimators. This allows usto suggest a necessary and suﬃcient condition for identifying optimal oracle-estimators. We3se the term oracle to point out that the speciﬁc coeﬃcients that compose the optimal linearcombination of zero-estimators are dependent on the unknown parameters. Third, we suggest analgorithm that approximates these optimal-oracle coeﬃcients and successfully improves initialestimators of τ , given that a subset of relevant covariates is identiﬁed. We discuss in detail theimprovement of the naive estimator and also apply our approach to other estimators. Thus, infact, we provide an algorithm that has the potential to improve any given estimator of τ .The rest of this work in organized as follows. In Section 2 we describe our setting andintroduce the naive estimator. In Section 3 we introduce the zero-estimator approach andsuggest a new notion of optimality with respect to linear families of zero-estimators. An optimaloracle estimator of τ is also presented. In Section 4 we apply the zero-estimator approachto improve the naive estimator. We then study the theoretical properties of the improvedestimator and some practical aspects when the sample size is small. Simulation results aregiven in Section 5. Section 6 demonstrates how the zero-estimator approach can be generalizedto other estimators. A discussion is given in Section 7, while the proofs are provided in theAppendix. We begin with describing our setting and assumptions. Let ( X , Y ) , ..., ( X n , Y n ) be i.i.d. ob-servations drawn from some unknown distribution where X i ∈ R p and Y i ∈ R . We considerthe semi-supervised setting, where we have access to additional i.i.d. N − n observations ofcovariates X n +1 , ..., X N , but the responses Y n +1 , ..., Y N are not given. We assume in this workthat N ≫ n and, therefore, the distribution of the covariates is treated as known.We consider the linear model Y i = β T X i + ǫ i . , i = 1 , . . . , n and assume:(A1) Gaussian Covariates : X i i.i.d ∼ N ( µ, Σ ) , where i = 1 , . . . , N and µ and Σ are known.(A2) Invertibility: Σ is invertible.(A3) Linearity and homoscedasticity: E ( ǫ i | X i ) = 0 and E ( ǫ i | X i ) = σ .We also assume that the intercept term is zero, which can be achieved in practice by centeringthe Y ’s. Let ( X, Y ) denote a generic observation and let σ Y denote the variance of Y . Noticethat it can be decomposed into signal and noise components, σ Y = Var ( X T β + ǫ ) = β T Var ( X ) β + Var ( ǫ ) = β T Σ β + σ , (1)4here Var ( ǫ ) = E ( ǫ ) = σ and Cov ( X ) = Σ . The signal component τ ≡ β T Σ β can be thought of as the total variance explained bythe best linear function of the covariates, while the noise component σ can be thought of asthe variance left unexplained. Since µ and Σ are treated as known, we can apply the lineartransformation X Σ − / ( X − µ ) and assume w.l.o.g. that µ = and Σ = I . It follows by(1) that σ Y = k β k + σ , which implies that in order to evaluate σ , it is enough to estimateboth σ Y and k β k . The former can be easily evaluated from the sample, and the main challengeis to derive an estimator for k β k in the high-dimensional setting. In order to ﬁnd an unbiased estimator for k β k = P pj =1 β j we ﬁrst consider the estimation of β j for each j . A straightforward approach is given as follows: Let W ij ≡ X ij Y i for i = 1 , ..., n ,and j = 1 , ..., p . Notice that E ( W ij ) = E ( X ij Y i ) = E (cid:2) X ij (cid:0) β T X i + ε i (cid:1)(cid:3) = β j , where in the last equality we used Σ = I and assumption (A3). Now, since { E ( W ij ) } = E ( W ij ) − Var ( W ij ) , a natural unbiased estimator for β j is ˆ β j ≡ n n X i =1 W ij − n − n X i =1 ( W ij − W j ) = 1 n ( n − n X i = i W i j W i j , (2)where W j = n P ni =1 W ij . Thus, unbiased estimates of τ ≡ k β k and σ are given by ˆ τ = p X j =1 ˆ β j = 1 n ( n − p X j =1 n X i = i W i j W i j , ˆ σ = ˆ σ Y − ˆ τ , (3)where ˆ σ Y = n − n P i =1 ( Y i − ¯ Y ) . We use the term Naive to describe ˆ τ since its constructionis relatively simple and straightforward. Note that the derivation of the naive did not requireassumption (A1). The Naive estimator was also discussed by Kong and Valiant (2018). Asimilar estimator was proposed by Dicker (2014). Speciﬁcally, let ˆ τ Dicker = (cid:13)(cid:13) X T Y (cid:13)(cid:13) − p k Y k n ( n + 1) where X is the n × p design matrix and Y = ( Y , ..., Y n ) T . The following lemma shows that ˆ τ and ˆ τ Dicker are asymptotically equivalent.

Lemma 1.

Assume (A1) - (A3) and that ǫ , . . . , ǫ n ∼ N (0 , σ ) . When τ + σ is bounded and p/n converges to a constant, then, √ n (cid:0) ˆ τ − ˆ τ Dicker (cid:1) p → . n and p go together to ∞ . Using Corollary 1 fromDicker (2014), which computes the asymptotic variance of ˆ τ Dicker , and the above lemma, weobtain the following corollary.

Corollary 1.

Under the assumptions of Lemma 1, √ n Ç ˆ τ − τ ψ å D → N (0 , , where ψ = 2 ¶ (cid:0) pn (cid:1) ( σ + τ ) − σ + 3 τ © . In this section we introduce the zero-estimator approach and study how it can be used toimprove the naive estimator. In Section 3.1 we present the zero-estimator approach and anillustration of this approach is given in Section 3.2. In Section 3.3 we suggest a new notion ofoptimality with respect to linear families for zero-estimators. We then ﬁnd an optimal oracleestimator of τ and calculate its improvement over the naive estimator. We are now ready to suggest our novel approach that incorporates the additional data thatexist in the semi-supervised setting in order to improve the estimation of τ . We ﬁrst describethe approach in general terms. Consider a random variable V ∼ P , where P belongs to a familyof distributions P . Let g ( V ) be a zero-estimator, i.e., E P [ g ( V )] = 0 for all P ∈ P . Let T ( V ) bean unbiased estimator of a certain quantity of interest θ . Then, the statistic U c ( V ) , deﬁned by U c ( V ) = T ( V ) + cg ( V ) for a ﬁxed constant c , is also an unbiased estimator of θ . The varianceof U c ( V ) is Var [ U c ( V )] = Var [ T ( V )] + c Var [ g ( V )] + 2 c · Cov [ T ( V ) , g ( V )] . (4)Minimizing Var [ U c ( V )] with respect to c yields the minimizer c ∗ = − Cov [ T ( V ) , g ( V )] Var [ g ( V )] . (5)Notice that Cov [ T ( V ) , g ( V )] = 0 implies Var [ U c ∗ ( V )] < Var ( T ( V )) . In other words, by com-bining a correlated unbiased estimator of zero with the initial unbiased estimator of θ , one canlower the variance. Note that plugging c ∗ in (4) reveals how much variance can be potentiallyreduced, Var [ U c ∗ ( V )] = Var [ T ( V )] − { Cov [ T ( V ) , g ( V )] } Var [ g ( V ]) = (1 − ρ ) Var [ T ( V )] , (6)6here ρ is the correlation coeﬃcient between T ( V ) and g ( V ) . Therefore, it is best to ﬁndan unbiased zero-estimator g ( V ) which is highly correlated with T ( V ) , the initial unbiasedestimator of θ . It is important to notice that c ∗ is an unknown quantity and, therefore, U c ∗ isnot a statistic. However, in practice, one can estimate c ∗ by some ˆ c ∗ and use the approximation U ˆ c ∗ instead. The following example illustrates how the zero-estimator approach can be applied to improvethe naive estimator ˆ τ in the simple linear model setting. Example 1 ( p = 1 ) . Consider the simple linear model Y = βX + ǫ , where X ∼ N (0 , , ǫ has mean zero and ﬁnite variance σ , and ǫ is independent of X . By (6), we wish to ﬁnda zero-estimator g ( X ) which is correlated with ˆ τ . Consider the estimator U c = ˆ τ + cg ( X ) ,where g ( X ) ≡ n n P i =1 ( X i − and c is a ﬁxed constant. The variance of U c is minimized by c ∗ = − β and one can verify that Var ( U c ∗ ) = Var (ˆ τ ) − n β . The above example illustrates the potential of using additional information that exists in thesemi-supervised setting to lower the variance of the initial Naive estimator ˆ τ . However, it alsoraises the question: Can we achieve a lower variance by adding diﬀerent zero-estimators?

Onemight attempt to reduce the variance by adding zero-estimators such as g k ( X ) ≡ n n P i =1 [ X ki − E ( X ki )] , for k > . Surprisingly this attempt will fail. Hence, the unbiased oracle estimatorof τ , R ≡ ˆ τ − β g ( X ) , is optimal with respect to zero-estimators of the form g k ( X ) . Thisunanticipated result motivated us to extend the idea of the optimal zero-estimator to a generalregression setting of p covariates. We now deﬁne a new oracle unbiased estimator of τ and prove that under some regularity as-sumptions this estimator is optimal with respect to a family of zero-estimators. Here, optimalitymeans that the variance cannot be further reduced by including additional zero-estimators of that given family. We now speciﬁcally deﬁne our notion of optimality in a general setting. Deﬁnition 1.

Let T be an unbiased estimator of θ and let g , g , ... be a sequence of zero-estimators, i.e., E θ ( g i ) = 0 for i ∈ N . Let G = ß m P k =1 c k g k : c k ∈ R , m ∈ N ™ , be a family ofzero-estimators. For a zero-estimator g ∗ ∈ G , we say that R ∗ ≡ T + g ∗ is an optimal oracleestimator (OOE) of θ with respect to G , if Var θ [ R ∗ ] ≡ Var θ [ T + g ∗ ] Var θ [ T + g ] for all g ∈ G and for all θ.

7e use the term oracle since g ∗ ≡ m P k =1 c ∗ k g k for some optimal coeﬃcients c ∗ , ..., c ∗ m , whichare a function of the unknown parameter θ . The following theorem suggests a necessary andsuﬃcient condition for obtaining an OOE. Theorem 1.

Let g m = ( g , ..., g m ) T be a vector of zero-estimators and assume the covariancematrix M ≡ Var [ g m ] is positive deﬁnite for every m . Then, R ∗ is an optimal oracle estimator (OOE) with respect to the family of zero-estimator G iﬀ R ∗ is uncorrelated with every zero-estimator g ∈ G , i.e., Cov θ [ R ∗ , g ] = 0 for all g ∈ G and for all θ .Note that Theorem 1 is general and does not require the use of assumptions (A1) to (A3).Returning to our setting, deﬁne the following oracle estimator T oracle = ˆ τ − p X j =1 p X j ′ =1 ψ jj ′ , (7)where ψ jj ′ = β j β j ′ h jj ′ and h jj ′ = n n P i =1 [ X ij X ij ′ − E ( X ij X ij ′ )] , and let the G be the familyof zero-estimators of the form g k ...k p = n P ni =1 [ X k i · ... · X k p ip − E ( X k i · . . . · X k p ip )] , where ( k , ..., k p ) ∈ { , , , , ... } p ≡ N p . The following proposition suggests an optimal oracle im-provement over the naive estimator in a setting with p covariates. Theorem 2 (General p ) . Assume (A1) - (A3). Then, the oracle estimator T oracle deﬁned in (7)is an OOE of τ with respect to G . Remark . For the proof of Theorem 2, assumptions (A1) - (A3) can be relaxed. More specif-ically, normality of the covariates is not required. Instead, it is enough to assume that µ = and Σ = I ,and that X has moments of all orders. Homoscedasticity of ǫ is also not required. Proposition 1.

Assume (A1) - (A3). Then,Var (cid:0) ˆ τ (cid:1) = 4 n ï ( n − n − (cid:2) σ Y τ + τ (cid:3) + 12 ( n − (cid:0) pσ Y + 4 σ Y τ + 3 τ (cid:1) ò (8)and Var ( T oracle ) = Var (cid:0) ˆ τ (cid:1) − n τ . (9)When p = n and τ = σ then the optimal oracle estimator T oracle reduces the varianceof the naive estimator by . Moreover, when p/n converges to zero, the reduction is .See Remark 3 in the Appendix for more details about the relative improvement of the optimaloracle estimator. 8 The Proposed Estimator

In this section we show how to use the zero-estimator approach to derive improved estimatorsover ˆ τ . In Section 4.1 we show that estimating all p optimal coeﬃcients given in (7) mayintroduce too much variance. Therefore, Section 4.2 introduces covariate selection methodsto reduce the number of zero-estimators used in estimation. In Section 4.3 we suggest ourestimator for τ and present its properties. Section 4.4 discusses some practical considerationsthat needed to be taken into account when the sample size is small. The optimal oracle estimator deﬁned in (7) is based on adding p zero-estimators. Therefore,it is reasonable to suggest and study the following estimator instead of the oracle one: T = ˆ τ − p X j =1 p X j ′ =1 ˆ ψ jj ′ , where ˆ ψ jj ′ = 1 n ( n −

1) ( n − X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] , is an estimator of ψ jj ≡ β j β j ′ h jj ′ . Notice that E Ä ˆ ψ jj ′ ä = 0 and that for i = i we have E ( W i j W i j ′ ) = β j β j ′ ; thus, T is an unbiased estimator. The variance of the estimator T isgiven in the following proposition. Proposition 2.

Assume (A1) - (A3) and p/n = O (1) . Then,Var ( T ) = Var ( T oracle ) + 8 p σ Y n + O ( n − )= Var (cid:0) ˆ τ (cid:1) − n τ + 8 p σ Y n + O ( n − ) , (10)where s ≡ τ + σ . Note that (10) follows from (9). For p = n, the last equation can be rewritten asVar ( T ) = Var (cid:0) ˆ τ (cid:1) + 8 n (cid:0) τ σ + σ (cid:1) + O ( n − ) , and a similar result holds for p/n → c for some positive constant c. Therefore, the estimator T fails to improve the Naive estimator ˆ τ . The third term in (10) reﬂects the additional variabilitythat comes with the attempt to estimate all p optimal coeﬃcients. To overcome the problemthat arises from estimating all p parameters, we consider estimating only a small number ofcoeﬃcients of the optimal estimator T oracle . Speciﬁcally, let B ⊂ { , ..., p } be a ﬁxed set of9ome indices such that | B | ≪ p and consider the estimator T B = ˆ τ − P j,j ′ ∈ B ˆ ψ jj ′ . By the sameargument as in Proposition 2 we haveVar ( T B ) = Var (cid:0) ˆ τ (cid:1) − n τ B + O ( n − ) , (11)where τ B = P j ∈ B β j . Thus, if τ B is suﬃciently large, then one can expect a signiﬁcant improve-ment over the naive estimator by using a small number of zero-estimators. For example, when p = n , τ = σ = 1 , and τ B = 0 . , then T B reduces the variance of the Naive estimator by . See Remark 4 in the Appendix for more details. In practice the estimator T B is an oracleestimator since the set of indices B is unknown. A desirable set B should contain a relativelysmall amount of covariates that capture a signiﬁcant amount of signal level τ B . Choosing a“good” set of indices is discussed in the following section.

We now discuss covariate selection methods. We assume there is a set of “important” covariates B ≡ (cid:8) j : β j > b (cid:9) where b is a positive constant, and assume that | B | = p where p is a ﬁxedconstant. In order to construct our improved estimator, we wish to identify the covariates thatbelong to the set B . Deﬁnition 2.

We call δ a covariate selection algorithm if for every dataset ( X n × p , Y n × ) itchooses a subset of indices from { , ..., p } . We say that δ is a consistent selection algorithm ifit satisﬁes P ( B δ = B ) −−−→ n →∞ , where B δ is the set of selected indices by δ. Notice that the set B may depend on p , and theselection algorithm δ may depend on both n and p but this is suppressed in the notation.The subject of high-dimensional covariate selection has already been extensively studied.See e.g., Zambom and Kim (2018) and Oda et al. (2020) and references therein. Diﬀerentcovariate selection methods have diﬀerent pros and cons and there is no simple answer for thequestion of which method to use. Since we do not wish to limit ourselves to a speciﬁc method,we assume that we have in hand a consistent selection algorithm γ for the rest of the discussion.10 .3 Proposed Estimator and Properties Let γ be a consistent selection algorithm. The proposed estimator for τ is given in Algorithm 1. Algorithm 1:

Proposed Estimator

Input:

A dataset ( X n × p , Y n × ) and a consistent selection algorithm γ .1. Calculate the Naive estimator ˆ τ = n ( n − p P j =1 n P i = i W i j W i j , where W ij = X ij Y i .

2. Apply algorithm γ with ( X , Y ) to construct B γ .

3. Calculate the zero-estimator terms: ˆ ψ jj ′ ≡ n ( n − n − X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] , for all j, j ′ ∈ B γ . Result:

Return T γ = ˆ τ − P jj ′ ∈ B γ ˆ ψ jj ′ . Some asymptotic properties of T γ are given by the following proposition. Proposition 3.

Assume that lim n →∞ n [ P ( { B γ = B } )] / = 0 , and that E (cid:0) T γ (cid:1) and E ( T B ) arebounded. Then,1. lim n →∞ n [ Bias ( T γ , τ )] = 0 , and2. lim n →∞ n [ Var ( T γ ) − Var ( T B )] = 0 , where Bias ( T γ , τ ) ≡ E ( T γ ) − τ and Var ( T B ) is given by (11).Notice that the requirement lim n →∞ n [ P ( { B γ = B } )] / = 0 is stronger than just consistency.The following claim is a direct corollary of Proposition 3. Corollary 2.

Under the assumptions of Proposition 3, lim n,p →∞ n [ M SE ( T γ ) − M SE ( T B )] = 0 . We now suggest estimators for the variances of the naive and the proposed estimators. Let ÿ Var (cid:0) ˆ τ (cid:1) = 4 n ï ( n − n − (cid:2) ˆ σ Y ˆ τ + ˆ τ (cid:3) + 12 ( n − (cid:0) p ˆ σ Y + 4ˆ σ Y ˆ τ + 3ˆ τ (cid:1) ò and let ÿ Var ( T γ ) = ÿ Var (cid:0) ˆ τ (cid:1) − n ˆ τ B γ , (12)where ˆ σ Y = n − n P i =1 (cid:0) Y i − ¯ Y (cid:1) and ˆ τ B γ = P j ∈ B γ ˆ β j . The following proposition shows that ÿ Var (cid:0) ˆ τ (cid:1) and ÿ Var ( T γ ) are consistent. 11 roposition 4. Assume (A1) - (A3), p/n = O (1) and the regularity conditions of Proposition 3.Then,1. n ïÿ Var (cid:0) ˆ τ (cid:1) − Var (ˆ τ ) ò p → and,2. n h ÿ Var ( T γ ) − Var ( T γ ) i p → ,where Var (ˆ τ ) is given by (8). Remark . Some cautions regarding the estimator T γ need to be considered in practice. When n is insuﬃciently large, then B γ might be diﬀerent than B and Proposition 4.2 no longerholds. Speciﬁcally, let S ∩ B γ and B ∩ S γ be the set of false positive and false negative errors,respectively, where S = { , ..., p } \ B and S γ = { , ..., p } \ B γ . While false negatives merelyresult in not including some potential zero-estimator terms in our proposed estimator, falsepositives can lead to a substantial bias. This is true since the expected value of a post-selectedzero-estimator is not necessarily zero anymore.A common approach to overcome this problem is to randomly split the data into two partswhere the ﬁrst part is used for covariate selection and the second part is used for evaluation ofthe zero-estimator terms. With this in mind, we suggest in Algorithm 2 a modiﬁed version ofthe proposed estimator.

Algorithm 2:

Modiﬁed Estimator

Input:

A dataset ( X n × p , Y n × ) and a consistent selection algorithm γ .1. Calculate the Naive estimator ˆ τ = n ( n − p P j =1 n P i = i W i j W i j , where W ij = X ij Y i . Reﬁtting Step:

For ≤ m ≤ M : • Randomly split the sample into two equal subsets: ( X , Y ) m and ( X , Y ) m .• Apply algorithm γ with ( X , Y ) m to construct B mγ . • Use ( X , Y ) m to calculate the zero-estimator terms: ˆ ψ mjj ′ ≡ n ( n − n − X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] , where the indices i = i = i run over the rows of ( X , Y ) m , n = n/ and j, j ′ ∈ B mγ • Calculate U m = ˆ τ − P j,j ′ ∈ B γ ˆ ψ mjj ′ . Result:

Return U ∗ = M M P m =1 U m . selection and the second half for estimation of the zero-estimator terms. The split step is repeated multiple times and then averaged beforereturning the ﬁnal output. While splitting helps to avoid post-selection biases, averaging overmultiple splits reduces the noise that comes with estimating the zero-estimator terms. We now provide a simulation study to illustrate our estimator’s performance. Here, we choosethe following simple selection algorithm γ : Algorithm 3:

Covariate selection γ Input:

A dataset ( X n × p , Y n × ) .1. Calculate ˆ β , ..., ˆ β p where ˆ β j is given in (2) for j = 1 , ..., p.

2. Calculate the diﬀerences λ j = ˆ β j ) − ˆ β j − for j = 2 , . . . , p where ˆ β < ˆ β < ... < ˆ β p ) denotes the order statistics.3. Select the covariates B γ = ¶ j : ˆ β j ) > ˆ β j ∗ ) © , where j ∗ = arg max j λ j . Result:

Return B γ .The algorithm above ﬁnds the largest gap between the ordered estimated squared coeﬃcientsand then uses this gap as a threshold to select the set of large coeﬃcients B γ . The algorithmworks well in scenarios where a relatively large gap truly separates between the set of largecoeﬃcients B from the set of small coeﬃcients S .We ﬁx β j = τ B for j = 1 , . . . , , and β j = τ − τ B p − for j = 6 , . . . , p , where τ and τ B vary amongdiﬀerent scenarios. The number of observations and covariates is n = p = 100 , , and ,and the residual variance is σ = 1 . For each scenario, we generated 1000 independent datasetsand estimated τ by using the diﬀerent estimators. We report the results of ﬁve estimators: naive, oracle, proposed, modiﬁed and PSI . The ﬁrst four are given by (3), (7), Algorithm 1 andAlgorithm 2, respectively; the PSI (Post Selective Inference) estimator was calculated usingthe estimateSigma function from the selectiveInference R package. The modiﬁedestimator was calculated based on M = 10 splits. Boxplots of the estimates are plotted inFigure 1 and results of the RMSE are given in Table 1. Code for reproducing the results isavailable at https://git.io/Jt6bC .Figure 1 demonstrates that:• As n and p increase, the bias of the proposed estimator reduces, which in turn causes it13o outperform the modiﬁed and the naive estimators for large n and p . This is compatiblewith the fact that the selection algorithm γ tends to be more successful in identifying B and S as n gets larger.• The modiﬁed estimator demonstrates an improvement over the naive estimator in all thescenarios.• As τ increases, the improvements of all estimators over the naive are more substantial.For example, for n = p = 500 , the relative improvement in RMSE of the oracle estimatorover the naive increases from to as τ changes from τ = 1 to τ = 2 . Moregenerally, from other simulation results that are not shown here, we found that the largerthe signal to noise ratio τ /σ , the larger the improvement.• The PSI estimator is biased. This results in a larger RMSE compared to the otherestimators for large n and p . This is not surprising since the PSI estimator is based onthe LASSO method which assumes a sparse model and ignores small coeﬃcients. t = t = Sample Size V a l ue Estimator

NaiveModifiedProposedOraclePSI

Figure 1: Boxplots representing the distribution of diﬀerent estimators of τ . The x-axis standsfor the sample size. The red dashed is the true value of τ . σ RMSE ) was calculated usingthe delta method. The signal that is attributed to the large coeﬃcients, calculated by τ B /τ , was set to . . The estimator with the lowest RMSE (excluding the oracle) is in bold. τ n p Estimator Mean Bias SE RMSE · σ RM SE

PSI

Proposed

PSI

Proposed

PSI

Proposed

In the next scenario the signal that is attributed to the large coeﬃcients, calculated by τ B /τ , was ﬁxed at . , which corresponds to a sparse setting. The simulation results arepresented in Figure 2 and in Table 2. It is demonstrated that for large n and p the proposedestimator performs almost as well as the oracle estimator which conﬁrms the theoretical resultgiven in Proposition 3. Notice that in this setting the PSI estimator outperforms all other15stimators. This is not surprising since the PSI estimator is based on the LASSO methodwhich is known to work well when the true model that generates the data is sparse, as in thisscenario. −10123 100 300 500 Sample Size V a l ue Estimator

NaiveModifiedProposedOraclePSI t =1, t B2 =0.99 Figure 2: Boxplots representing the distribution of diﬀerent estimators of τ . The x-axis standsfor the sample size. The red dashed is the true value of τ . τ B = 0 . . τ n p Estimator Mean Bias SE RMSE · σ RM SE

PSI

The suggested methodology in this paper is not limited to improving only the naive estimator,but can also be generalized to other estimators. The key is to add zero-estimators that arehighly correlated with our initial estimator of τ ; see Eq. (6). Unlike the naive estimator, whichis represented by a closed-form expression, other common estimators, such as the EigenPrismestimator (Janson et al., 2017), are computed numerically and do not have a closed-form repre-sentation. That makes the task of ﬁnding optimal zero-estimators somewhat more challengingsince the zero-estimators’ coeﬃcients also need to be computed numerically. A comprehensivetheory that generalizes the zero-estimate approach to other estimators, other than the naive,is beyond the scope of this work. However, here we present a general algorithm that achievesimprovement without claiming optimality. The algorithm below calculates an initial estima-tor of τ and applies a consistent selection algorithm to choose the set of relevant covariates.Then, it approximates the optimal-oracle coeﬃcients from bootstrap samples and, similarly toAlgorithm 1, returns a new estimator that is composed of both the initial estimator of τ and17 zero-estimator. Algorithm 4:

Empirical Estimator

Input:

A dataset ( X , Y ) , an initial estimator ˜ τ , and a consistent selectionalgorithm γ .1. Calculate an initial estimator ˜ τ of τ .2. Apply algorithm γ to construct B γ . Bootstrap step: • Resample with replacement n observations from ( X , Y ) .• Calculate the initial estimator ˜ τ of τ . • Calculate the zero-estimators h jj ′ = n n P i =1 [ X ij X ij ′ − E ( X ij X ij ′ )] where j, j ′ ∈ B γ .This procedure is repeated B times in order to produce (˜ τ ) ∗ , ..., (˜ τ ) ∗ B and h ∗ jj ′ , ..., h ∗ Bjj ′ for all j, j ′ ∈ B γ .

4. Approximate the coeﬃcients by c jj = ¤ Cov (˜ τ , h jj ′ ) Var ( h jj ′ ) , where ÷ Cov ( · ) denotes the empirical covariance from the bootstrap samples, andVar ( h jj ′ ) is known by the semi-supervised setting. Result:

Return the empirical estimator U emp = ˜ τ − P j,j ′ ∈ B γ c jj ′ h jj ′ .We now demonstrate the performance of the empirical estimator given by Algorithm 4together with two initial estimators: The EigenPrism (Janson et al., 2017) and the PSI whichis described in Taylor and Tibshirani (2018) and was used in Section 5. We considered the samesparse setting as in Section 5 and an additional setting where τ = 5 . For computational reasons,Step 3 of Algorithm 4 was only calculated once for all 1000 simulations. Results are given inTables 3-4 and the code for reproducing the results is available at https://git.io/Jt6bC .Tables 3-4 demonstrate that the standard error of the empirical estimators is equal to orlower than the variance of the initial estimators, and as τ increases, the improvement over theinitial estimators is more substantial. This highlights the fact that the zero-estimator approachis not limited to improving only the naive estimator but rather has the potential to improveother estimators as well. 18able 3: Summary statistics equivalent to Table 1 for the PSI estimator. τ n p Estimator Mean Bias SE RMSE · σ RM SE

Table 4: Summary statistics equivalent to Table 1 for the EigenPrism estimator. τ n p Estimator Mean Bias SE RMSE · σ RMSE Discussion

This paper presents a new approach for improving estimation of the explained variance τ ofa high-dimensional regression model in a semi-supervised setting without assuming sparsity.The key idea is to use zero-estimators that are correlated with the initial unbiased estimatorof τ in order to lower its variance without introducing additional bias. The semi-supervisedsetting, where the number of observations is much greater than the number of responses, allowsus to construct such zero-estimators. We introduced a new notion of optimality with respect tozero-estimators and presented an oracle-estimator that achieves this type of optimality. Basedon the oracle-estimator form, we presented an improved (non-oracle) estimator that showeda signiﬁcant reduction, but not optimal, in asymptotic variance of the naive estimator. Oursimulations showed that our approach can be generalized to other types of initial estimatorsother than the Naive estimator.Many open questions remain for future research. While our proposed estimator improvedthe naive estimator, it did not achieve the optimal improvement of the oracle estimator. Thus,it remains unclear if and how one can achieve optimal improvement. Moreover, in this workstrong distributional assumptions about the data are made. For example, we assume that thecovariates are normally distributed. This assumption can be relaxed since we used normalityonly to calculate the ﬁrst few moments of X . Thus, generalizing the suggested approach byrelaxing the normality assumption is a natural direction for future work. A more ambitiousfuture goal would be to extend the suggested approach to generalized linear models (GLM),and speciﬁcally to logistic regression. In this case, the concepts of signal and noise levels areless clear and more challenging to deﬁne. References

Cai, T. and Z. Guo (2020). Semisupervised inference for explained variance in high dimensional linear regression and itsapplications.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) .Dicker, L. H. (2014). Variance estimation in high-dimensional linear models.

Biometrika 101 (2), 269–284.Eichler, E. E., J. Flint, G. Gibson, A. Kong, S. M. Leal, J. H. Moore, and J. H. Nadeau (2010). Missing heritability andstrategies for ﬁnding the underlying causes of complex disease.

Nature Reviews Genetics 11 (6), 446–450.Fan, J., S. Guo, and N. Hao (2012). Variance estimation using reﬁtted cross-validation in ultrahigh dimensional regression.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (1), 37–65.Gorﬁne, M., S. I. Berndt, J. Chang-Claude, M. Hoﬀmeister, L. Le Marchand, J. Potter, M. L. Slattery, N. Keret,U. Peters, and L. Hsu (2017). Heritability estimation using a regularized regression approach (HERRA): Applicableto continuous, dichotomous or age-at-onset outcome.

PloS one 12 (8), e0181269. oeﬀding, W. (1977). Some incomplete and boundedly complete families of distributions. The Annals of Statistics ,278–291.Janson, L., R. F. Barber, and E. Candes (2017). Eigenprism: inference for high dimensional signal-to-noise ratios.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (4), 1037–1065.Kong, W. and G. Valiant (2018). Estimating learnability in the sublinear data regime.

Advances in Neural InformationProcessing Systems 31 , 5455–5464.Oda, R., H. Yanagihara, et al. (2020). A fast and consistent variable selection method for high-dimensional multivariatelinear regression with a large number of explanatory variables.

Electronic Journal of Statistics 14 (1), 1386–1412.Schwartzman, A., A. J. Schork, R. Zablocki, W. K. Thompson, et al. (2019). A simple, consistent estimator of SNPheritability from genome-wide association studies.

The Annals of Applied Statistics 13 (4), 2509–2538.Sun, T. and C.-H. Zhang (2012). Scaled sparse linear regression.

Biometrika 99 (4), 879–898.Taylor, J. and R. Tibshirani (2018). Post-selection inference for-penalized likelihood models.

Canadian Journal ofStatistics 46 (1), 41–61.van der Vaart, A. W. (2000).

Asymptotic statistics . Cambridge university press.Zambom, A. Z. and J. Kim (2018). Consistent signiﬁcance controlled variable selection in high-dimensional regression.

Stat 7 (1), e210. Appendix

Proof of Lemma 1 :Notice that X T Y = Å n P i =1 W i , ..., n P i =1 W ip ã T where X is the n × p design matrix and Y = ( Y , ..., Y n ) T . Let U be thecentered Naive estimator, i.e., U ≡ ˆ τ − τ = 1 n ( n − X i = i p X j =1 W i j W i j − τ = (cid:13)(cid:13) X T Y (cid:13)(cid:13) − p P j =1 n P i =1 W ij n ( n − − τ . The Dicker estimate for τ ia given by ˆ τ Dicker ≡ k X T Y k − p k Y k n ( n +1) . Let U Dicker ≡ ˆ τ Dicker − τ be the centeredDicker estimator (Dicker, 2014). We need to prove that root- n times the diﬀerence between the estimators converges inprobability to zero, i.e., √ n ( U Dicker − U ) p → . We have, √ n ( U Dicker − U ) = √ n Ü (cid:13)(cid:13) X T Y (cid:13)(cid:13) − p k Y k n ( n + 1) − (cid:13)(cid:13) X T Y (cid:13)(cid:13) − p P j =1 n P i =1 W ij n ( n − ê = √ n Ü p P j =1 n P i =1 W ij n ( n − − p k Y k n ( n + 1) − (cid:13)(cid:13) X T Y (cid:13)(cid:13) n ( n −

1) ( n + 1) ê . (13)It is enough to prove that:1. n − . Ç p P j =1 n P i =1 W ij − p k Y k å p → , n − . Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä p → . We start with the ﬁrst term, n − . p X j =1 n X i =1 W ij − p k Y k ! = n − . p X j =1 n X i =1 Y i X ij − p n X i =1 Y i ! = n − . n X i =1 Y i p X j =1 X ij − p n X i =1 Y i ! = n − . n X i =1 Y i " n p X j =1 Ä X ij − ä ≡ n − . n X i =1 ω i (14)where ω i = Y i ñ n P j (cid:8) X ij − (cid:9) ô . Notice that ω i depends on n but this is suppressed in the notation. In order to showthat n − . n P i =1 ω i p → , it is enough to show that E Å n − . n P i =1 ω i ã → and Var Å n − . n P i =1 ω i ã → . Moreover, since E Å n − . n P i =1 ω i ã = √ nE ( ω i ) and Var Å n − . n P i =1 ω i ã = Var ( ω i ) = E (cid:0) ω i (cid:1) − [ E ( ω i )] , it is enough to show that √ nE ( ω i ) and E (cid:0) ω i (cid:1) converge to zero.Consider now √ nE ( ω i ) . By (14) we have n X i =1 ω i = 1 n " p X j =1 n X i =1 W ij − p k Y k . aking expectation of both sides, n X i =1 E ( ω i ) = 1 n " p X j =1 n X i =1 E Ä W ij ä − pE Ä k Y k ä . Now, notice that E ( W ij ) = E [ X ij ( β T X + ǫ ) ] = k β k + σ + β j [ E ( X ij ) −

1] = τ + σ + 2 β j . (15)Also notice that Y i / (cid:0) σ ε + τ (cid:1) ∼ χ , and hence E Ä k Y k ä = n (cid:0) τ + σ (cid:1) . Therefore, nE ( ω i ) = 1 n " p X j =1 n X i =1 Ä τ + σ + 2 β j ä − pn Ä τ + σ ä = 1 n " n X i =1 î p Ä τ + σ ä + 2 τ ó − pn Ä τ + σ ä = 2 τ which implies that √ nE ( ω i ) = τ √ n → . Consider now E ( ω i ) . By Cauchy-Schwartz, E Ä ω i ä = E Y i " n − p X j =1 ¶ X ij − © ! ≤ ¶ E Ä Y i ä© / ( E " n − p X j =1 ¶ X ij − © !) / . Notice that Y i ∼ N (cid:0) , τ + σ (cid:1) by construction and therefore E ( Y i ) = O (1) as n and p go to inﬁnity. Let V j = X ij − and notice that E ( V j ) = 0 . We have E " n − p X j =1 ¶ X ij − © ! = E " n − p X j =1 V j ! = n − X j ,j ,j ,j E ( V j V j V j V j ) . The expectation P j ,j ,j ,j E ( V j V j V j V j ) is not 0 when j = j and j = j (up to permutations) or when all terms areequal. In the ﬁrst case we have X j = j ′ E Ä V j V j ′ ä = X j = j ′ î E Ä V j äó = p ( p − h E n Ä X ij − ä oi ≤ C p , for a positive constant C . In the second case we have p P j =1 E (cid:0) V j (cid:1) = pE î (cid:0) X ij − (cid:1) ó ≤ C p, for a positive constant C .Hence, as p and n have the same order of magnitude, we have ( E " n − p X j =1 ¶ X ij − © ! / = ( n − X j ,j ,j ,j E ( V j V j V j V j ) ) / ≤ ¶ n − · O Ä p ä© / ≤ K/n, which implies E ( ω i ) ≤ K /n → , where K and K are positive constants. This completes the proof that n − . Ç p P j =1 n P i =1 W ij − p k Y k å p → .We now move to prove that n − . Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä p → . By Markov’s inequality, for ǫ > P Å n − . (cid:13)(cid:13)(cid:13) X T Y (cid:13)(cid:13)(cid:13) > ε ã ≤ n − . E Å (cid:13)(cid:13)(cid:13) X T Y (cid:13)(cid:13)(cid:13) ã /ε. Thus, it is enough to show that n − E Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä is bounded. Notice that E Å (cid:13)(cid:13)(cid:13) X T Y (cid:13)(cid:13)(cid:13) ã = X i ,i p X j =1 E ( W i j W i j ) = n X i =1 p X j =1 E Ä W ij ä + X i = i p X j =1 E ( W i j W i j )= n X i =1 p X j =1 Ä τ + σ + 2 β j ä + X i = i p X j =1 β j = n î p Ä τ + σ ä + 2 τ ó + n ( n − τ = n î p Ä τ + σ ä + ( n + 1) τ ó , here we used (15) in the third equality. Therefore, n − E Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä = n − (cid:2) p (cid:0) τ + σ (cid:1) + ( n + 1) τ (cid:3) . Since p and n have the same order of magnitude and τ + σ is bounded by assumption, then n − E Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä is also bounded. Thiscompletes the proof of n − . Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä p → and hence √ n (cid:0) ˆ τ Dicker − ˆ τ (cid:1) p → . (cid:4) Proof of Corollary 1 :According to Corollary 1 in Dicker (2014), we have √ nU Dicker ψ D → N (0 , , given that p/n converges to a constant. Write √ nUψ = 1 ψ (cid:2) √ n ( U − U Dicker ) + √ nU Dicker (cid:3) , and obtain √ nUψ D → N (0 , by Slutsky’s theorem. (cid:4) Calculations for Example 1 : Cov [ˆ τ , g ( X )] ≡ Cov n ( n − X i for all non-zero b ∈ R m . Thus, (18) issatisﬁed only if b ≡ , i.e., Cov [ R ∗ , g m ] = which also implies Cov [ R ∗ , P mk =1 c k g k ] = 0 for any c , ..., c m ∈ R . Therefore,Cov [ R ∗ , g ] = 0 for all g ∈ G .2. We now prove the other direction: if R ∗ is uncorrelated with all zero-estimators of a given family G then it is an OOE.Let R ∗ = T + g ∗ and R ≡ T + g be unbiased estimators of θ , where g ∗ , g ∈ G . Deﬁne ˜ g ≡ R ∗ − R = g ∗ − g and noticethat ˜ g ∈ G . Since by assumption R ∗ is uncorrelated with ˜ g , Cov [ R ∗ , ˜ g ] ≡ Cov [ R ∗ , R ∗ − R ] = Var [ R ∗ ] − Cov [ R ∗ , R ] , and hence Var [ R ∗ ] = Cov [ R ∗ , R ] . By the Cauchy–Schwarz inequality, ( Cov [ R ∗ , R ]) ≤ Var [ R ∗ ] Var [ R ] , we conclude thatVar [ R ∗ ] ≤ Var [ R ] = Var [ T + g ] for all g ∈ G . (cid:4) Proof of Theorem 2:

We start by proving Theorem 2 for the special case of p = 2 and then generalize for p > . By Theorem 1 we need toshow that Cov ( T oracle , g k k ) = 0 for all ( k , k ) ∈ N where g k k = n n P i =1 î X k i X k i − E Ä X k i X k i äó . Write,Cov ( T oracle , g k k ) = Cov Ñ ˆ τ − X j =1 2 X j ′ =1 ψ jj ′ , g k k é = Cov Ä ˆ τ , g k k ä − X j =1 2 X j ′ =1 Cov ( ψ jj ′ , g k k ) . Thus, we need to show that Cov Ä ˆ τ , g k k ä = 2 X j =1 2 X j ′ =1 Cov ( ψ jj ′ , g k k ) . (19) e start with calculating the LHS of (19), namely Cov (cid:0) ˆ τ , g k k (cid:1) . Recall that ˆ τ ≡ ˆ β + ˆ β and thereforeCov [ˆ τ , g k k ] = Cov ( ˆ β , g k k ) + Cov ( ˆ β , g k k ) . Now, for all ( k , k ) ∈ N we haveCov [ ˆ β , g k k ] ≡ Cov n ( n − X i
A, B, C and D are functions of ( k , k ) but this is suppressed in the notation. Write, E [ X k +111 X k Y ] = E [ X k +111 X k ( β X + β X + ǫ )]= β E [ X k +211 X k ] + β E [ X k +111 X k +112 ] = β A + β B. Thus, rewrite (20) and obtain Cov [ ˆ β , g k k ] = 2 n Ä [ β A + β B ] β − β C ä . (21)Similarly, by symmetry, Cov [ ˆ β , g k k ] = 2 n Ä [ β D + β B ] β − β C ä . (22)Using (21) and (22) we getCov [ˆ τ , g k k ] = Cov ( ˆ β , g k k ) + Cov ( ˆ β , g k k )= 2 n Ä [ β A + β B ] β − β C + [ β D + β B ] β − β C ä = 2 n  L z }| { β A + β D + L z }| { β β B − L z }| { C Ä β + β ä  = 2 n ( L + L − L ) . (23)We now move to calculate the RHS of (19), namely P j =1 2 P j ′ =1 Cov ( ψ jj ′ , g k k ) . First, recall that h jj ≡ n n P i =1 [ X ij X ij ′ − E ( X ij X ij ′ )] and g k k ≡ n n P i =1 î X k i X k i − E Ä X k i X k i äó where ( k , k ) ∈ N . Hence, h ≡ n n P i =1 (cid:0) X i − (cid:1) which by deﬁnition is lso equal to g . Similarly, we have h = h ≡ n n P i =1 ( X i X i ) = g and h ≡ n n P i =1 (cid:0) X i − (cid:1) = g . Thus, X j =1 2 X j ′ =1 Cov ( ψ jj ′ , g k k ) = X j =1 2 X j ′ =1 β j β j ′ Cov ( h jj ′ , g k k )= β Cov ( h , g k k ) + 2 β β Cov ( h , g k k ) + β Cov ( h , g k k )= β Cov ( g , g k k ) + 2 β β Cov ( g , g k k ) + β Cov ( g , g k k ) . (24)Now, observe that for every ( k , k , d , d ) ∈ N , Cov [ g k k , g d d ] = Cov (cid:16) n n X i =1 [ X k i X k i − E ( X k i X k i )] , n n X i =1 [ X d i X d i − E ( X d i X d i )] (cid:17) = 1 n n X i =1 n X i =1 ( E [ X k i X k i X d i X d i ] − E [ X k i X k i ] E [ X d i X d i ])= 1 n (cid:16) E [ X k + d X k + d ] − E [ X k X k ] E [ X d X d ] (cid:17) , (25)where the third equality holds since the terms with i = i vanish. It follows from (25) thatCov [ g k k , g ] = n Ä E [ X k +211 X k ] − E [ X k X k ] ä = n ( A − C ) , Cov [ g k k , g ] = n E [ X k +111 X k +112 ] = Bn , Cov [ g k k , g ] = n Ä E [ X k X k +212 ] − E [ X k X k ] ä = n ( D − C ) Therefore, rewrite (24) to get X j =1 2 X j ′ =1 Cov ( ψ jj ′ , g k k ) = 2 n  L z }| { β A + β D + L z }| { β β B − L z }| { C Ä β + β ä  = 2 n ( L + L − L ) (26)which is exactly the same expression as in (23). Hence, equation (19) follows which completes the proof of Theorm 2 for p = 2 . We now generalize the proof for p > . Similarly to (19) we want to show thatCov Ä ˆ τ , g k ...k p ä = 2 p X j =1 p X j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) . (27)We begin by calculating the LHS of (27), i.e., the covariance between ˆ τ and g k ...k p . By the same type of calculationsas in (20), for all ( k , ..., k p ) ∈ N p we haveCov î ˆ β j , g k ,...,k p ó =2 n  β j E Ñ X k j +21 j Y m = j X k m m é + X j = j ′ β j ′ E Ñ X k j +11 j X k j ′ +11 j Y m = j,j ′ X k m m é  β j − β j E p Y m =1 X k m m ! umming the above expressions for j = 1 , . . . , p , yieldsCov î ˆ τ , g k ,....,k p ó = p X j =1 Cov î ˆ β j , g k ,....,k p ó = 2 n p X j =1 β j E Ñ X k j +21 j Y m = j X k m m é + 2 n X j = j ′ β j β j ′ E Ñ X k j +11 j X k j ′ +11 j ′ Y m = j,j ′ X k m m é − n p X j =1 β j E p Y m =1 X k m m ! ≡ n (cid:0) L + L − L (cid:1) , (28)where L , L and L are just a generalization of the notation given in (23). Again, notice that L , L and L are functionsof k , ..., k p but this is suppressed in the notation.We now move to calculate the RHS of (27), namely p P j =1 p P j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) . Since ψ jj ′ = β j β j ′ h jj ′ we have, p X j =1 p X j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) = p X j =1 p X j ′ =1 β j β j ′ Cov (cid:0) h jj ′ , g k ...k p (cid:1) . (29)Again, notice the relationship between h jj ′ and g k ...k p : when j = j ′ we have h jj ≡ n n P i =1 (cid:0) X ij − (cid:1) = g ... ... , (i.e., the j -th entry is 2 and all others are 0), and for j = j ′ we have h jj ′ ≡ n n P i =1 X ij X ij ′ = g ... ... ... , (i.e., the j -th and j ’-thentries are 1 and all other entries are 0). Hence, p X j =1 p X j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) = p X j =1 p X j ′ =1 β j β j ′ Cov (cid:0) h jj ′ , g k ...k p (cid:1) = p X j =1 β j Cov (cid:0) g ... ... , g k ...k p (cid:1) + X j = j ′ β j β j ′ Cov (cid:0) g ... ... ... , g k ...k p (cid:1) . (30)Now, similar to (25), for all pairs of index vectors ( k , ..., k p ) ∈ N p , and ( k ′ , ..., k ′ p ) ∈ N p Cov Ä g k ,...,k p , g k ′ ,...,k ′ p ä = 1 n ( E p Y j =1 X k j + k ′ j j ! − E p Y j =1 X k j j ! E p Y j =1 X k ′ j j !) (31)This implies that Cov (cid:2) g ... ... , g k ,...,k p (cid:3) = 1 n  E Ñ X k j +21 j Y m = j X k m m é − E p Y m =1 X k m m ! and Cov (cid:2) g ... ... ... , g k ,...,k p (cid:3) = 1 n E Ñ X k j +11 j X k j ′ +11 j ′ Y m = j,j ′ X k m m é . Hence, rewrite (30) to see that p X j =1 p X j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) = 2 p X j =1 β j Cov (cid:0) g ... ... , g k ...k p (cid:1) + 2 X j = j ′ β j β j ′ Cov (cid:0) g ... ... ... , g k ...k p (cid:1) = 2 n p X j =1 β j  E Ñ X k j +21 j Y m = j X k m m é − E p Y m =1 X k m m ! +2 n X j = j ′ β j β j ′ E Ñ X k j +11 j X k j ′ +11 j ′ Y m = j,j ′ X k m m é = 2 n (cid:0) L − L + L (cid:1) , (32) hich is exactly the same expression as in (28). Hence, equation (27) follows which completes the proof of Theorem 2. (cid:4) Proof of Proposition 1 :Let W i = ( W i , ..., W ip ) T and notice that ˆ τ = n ( n − n P i = i p P j =1 W i j W i j is a U-statistic of order 2 with the kernel h ( w , w ) = w T w = p P j =1 w j w j , where w i ∈ R p . By Chapter 12 in van der Vaart (2000), the variance of a U-statisticof order r is given by Var ( U ) = r X c =1 r ! c ! ( r − c ) ! ( n − r ) ( n − r − ... ( n − r + c + 1) n ( n − ... ( n − r + 1) ζ c where ζ c is the covariance of the kernels if c variables are in common. In our case r = 2 , and thus ζ = Cov î h ( W , W ) , h Ä W , › W äó and ζ = Cov [ h ( W , W ) , h ( W , W )] where › W is an independent copy of W . Therefore, the variance of the naiveestimator has the following form Var Ä ˆ τ ä = 4 ( n − n ( n − ζ + 2 n ( n − ζ . (33)Now, let A = E (cid:0) W i W Ti (cid:1) be a p × p matrix and notice that ζ = Cov î h ( W , W ) , h Ä W , › W äó = p X j,j ′ Cov Ä W j W j , W j ′ f W j ′ ä = p X j,j ′ Ä β j β j ′ E [ W j W j ′ ] − β j β j ′ ä = β T A β − k β k and ζ = Cov [ h ( W , W ) , h ( W , W )]= X j,j ′ Cov ( W j W j , W j ′ W j ′ ) = X j,j ′ Ä ( E [ W j W j ′ ]) − β j β j ′ ä = k A k F − k β k , where k A k F is the Frobenius norm of A . Thus, by rewriting (33) the variance of the naive estimator is given byVar Ä ˆ τ ä = 4 ( n − n ( n − î β T A β − k β k ó + 2 n ( n − î k A k F − k β k ó . (34)Lastly, since we assume standard Gaussian covariates, one can verify that ζ = β T A β − k β k = σ Y τ + τ and ζ = k A k F − k β k = pσ Y + 4 σ Y τ + 3 τ , where σ Y = σ + τ by (1) and hence (8) follows.We now move to prove that Var ( T oracle ) = Var (cid:0) ˆ τ (cid:1) − n τ . Write,Var ( T oracle ) = Var Ñ ˆ τ − X j,j ′ ψ jj ′ é = Var Ä ˆ τ ä − X j,j ′ β j β j ′ Cov Ä ˆ τ , h jj ′ ä + 4 Var Ñ X j,j ′ ψ jj ′ é . (35)Consider P j,j ′ β j β j ′ Cov (cid:0) ˆ τ , h jj ′ (cid:1) . We have X j,j ′ β j β j ′ Cov Ä ˆ τ , h jj ′ ä = p X j =1 β j Cov Ä ˆ τ , h jj ä + X j = j ′ β j β j ′ Cov Ä ˆ τ , h jj ′ ä = p X j =1 β j Cov Ä ˆ τ , g ... ... ä + X j = j ′ β j β j ′ Cov Ä ˆ τ , g ... ... ... ä = p X j =1 β j ï n β j ò + X j = j ′ β j β j ′ ï n β j β j ′ ò = 4 n τ , here the second and third equality are justiﬁed by (30) and (28) respectively. Consider now Var Ç P j,j ′ ψ jj ′ å . Write,Var Ñ X j,j ′ ψ jj ′ é = Cov Ñ X j,j ′ β j β j ′ h jj ′ , X j,j ′ β j β j ′ h jj ′ é = X j ,j ,j j β j β j β j β j Cov ( h j j , h j j )= 1 n X j ,j ,j ,j β j β j β j β j X i ,i Cov ( X i j X i j , X i j X i j )= 1 n X j ,j ,j ,j β j β j β j β j X i ,i [ E ( X i j X i j X i j X i j ) − E ( X i j X i j ) E ( X i j X i j )]= 1 n X j ,j ,j ,j β j β j β j β j n X i =1 [ E ( X ij X ij X ij X ij ) − E ( X ij X ij ) E ( X ij X ij )]= 1 n X j ,j ,j ,j β j β j β j β j [ E ( X j X j X j X j ) − E ( X j X j ) E ( X j X j )] , where the ﬁfth equality holds since the summand is 0 for all i = i . The summation is not zero in only three cases:1) j = j = j = j j = j = j = j j = j = j = j . For the ﬁrst two cases the summation equals n P j = j ′ β j β j ′ . For the third case the summation equals to n p P j =1 β j . Overallwe have Var Ñ X j,j ′ ψ jj ′ é = case z }| { n X j = j ′ β j β j ′ + case z }| { n X j = j ′ β j β j ′ + case z }| { n p X j =1 β j = 2 n k β k = 2 n τ . Rewrite (35) to get Var ( T oracle ) = Var Ä ˆ τ ä − X j,j ′ β j β j ′ Cov Ä ˆ τ , h jj ′ ä + 4 Var Ñ X j,j ′ ψ jj ′ é = Var Ä ˆ τ ä − ï τ n ò + 4 ï τ n ò = Var Ä ˆ τ ä − τ n . (cid:4) Remark . We now calculate the asymptotic improvement of T oracle over the naive estimator. For simplicity, considerthe case when τ = σ . Recall the variance of ˆ τ and T oracle given in (8) and (2), respectively. Write, lim n,p →∞ Var (cid:0) ˆ τ (cid:1) − Var ( T oracle ) Var (cid:0) ˆ τ (cid:1) = lim n,p →∞ τ /n n î ( n − n − ( σ Y τ + τ ) + n − ( pσ Y + 4 σ Y τ + 3 τ ) ó = 2 τ τ + pτ +4 σ Y τ +3 τ n = 23 + 2 pn , where we used (8) in the ﬁrst equality, and the fact that σ Y = 2 τ in the second equality. Now, notice that when p = n then the reduction is = 40% and when p/n converges to zero, the reduction is . Proof of Proposition 2 :Write, Var ( T ) = Var  ˆ τ − p X j =1 p X j ′ =1 ˆ ψ jj ′  = Var Ä ˆ τ ä − Cov Ñ ˆ τ , p X j =1 p X j ′ =1 ˆ ψ jj ′ é + 4 Var Ñ p X j =1 p X j ′ =1 ˆ ψ jj ′ é . (36) e start with calculating the middle term. Let p n ( k ) ≡ n ( n − n − · · · ( n − k ) . Write,Cov (cid:0) ˆ τ , p X j =1 p X j ′ =1 ˆ ψ jj ′ (cid:1) = Cov Ñ n ( n − X i = i p X j =1 W i j W i j , n ( n −

1) ( n − X j,j ′ X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] é = C n X I X J Cov ( W i j W i j , W i j W i j [ X i j X i j − E ( X i j X i j )]) , (37)where C n ≡ p n (1) · p n (2) , I is the set of all quintuples of indices ( i , i , i , i , i ) such that i = i and i = i = i , and J is the set of all triples of indices ( j , j , j ) . For the set I , there are (cid:0) (cid:1) · diﬀerent cases to consider when one of { i , i } is equal to one of { i , i , i } , and an additional (cid:0) (cid:1) ·

3! = 6 cases to consider when two of { i , i } are equal to twoof { i , i , i } . Similarly, for the set J we have three cases to consider: (1) when only two indices of { j , j , j } are equalto each other, (e.g., j = j = j ) ; (2) when no pair of indices is equal to each other and; (3) when all three indices areequal. For example, let I = { ( i , . . . , i ) : i = i = i = i = i } and J = { ( j , j , j ) : j = j = j } . Write, C n X I X J Cov ( W i j W i j , W i j W i j [ X i j X i j − E ( X i j X i j )])= C n X I p X j =1 Cov Ä W i j W i j , W i j W i j î X i j − óä = C n X I p X j =1 E ( W i j ) E ( W i j ) E ( W i j ) E Ä W i j î X i j − óä = C n X I p X j =1 β j E Ä W ij î X ij − óä . (38)Now, notice that E î W ij Ä X ij − äó = E î X ij Y i Ä X ij − äó = E î X ij Ä β T X + ε i äó − β j = β j E Ä X ij ä − β j = 2 β j , (39)where the third equality holds since the covariates are standard normal. Rewrite (38) to get C n X I p X j =1 β j E î W ij Ä X ij − äó = C n X I p X j =1 β j (2 β j ) = 2 p n (3) p n (1) · p n (2) p X j =1 β j = 2( n − n ( n − p X j =1 β j = 2 n p X j =1 β j + O ( n − ) , where we used (39) to justify the ﬁrst equality. By the same type of calculation, one can compute the covariance in (37)over all combinations of subsets I and J and obtain thatCov (cid:0) ˆ τ , p X j =1 p X j ′ =1 ˆ ψ jj ′ (cid:1) = 4 n τ + O Ä n − ä . (40)We now move to calculate the last term of (36). Recall that ˆ ψ jj ′ = 1 n ( n −

1) ( n − X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] . herefore, Var (cid:0) p X j =1 p X j ′ =1 ˆ ψ jj ′ (cid:1) = X J Cov Ä ˆ ψ j j , ˆ ψ j j ä (41) = 1[ n ( n −

1) ( n − X J Cov Ñ X i = i = i W i j W i j X i j X i j , X i = i = i W i j W i j X i j X i j é = p − n (2) X J X I Cov ( W i j W i j X i j X i j , W i j W i j X i j X i j ) , where J is now deﬁned to be the set of all quadruples ( j , j , j , j ) , and I is now deﬁned to be the set of all sextuples ( i , ..., i ) such that i = i = i and i = i = i . For the set I , there are three diﬀerent cases to consider: (1) whenone of { i , i , i } is equal to one of { i , i , i } ; (2) when two of { i , i , i } are equal to two of { i , i , i } ; and (3) when { i , i , i } are equal to { i , i , i } . There are (cid:0) (cid:1) · options for the ﬁrst case, (cid:0) (cid:1) ·

3! = 18 for the second case, and (cid:0) (cid:1) ·

3! = 6 options for the third case. For the set J , there are ﬁve diﬀerent cases to consider: (1) when there is only one pair of equal indices (e.g., j = j = j = j ); (2) when there are two pairs of equal indices (e.g., j = j = j = j );(3) when only three indices are equal (e.g., j = j = j = j ); (4) when all four indices are equal and; (5) all fourindices are diﬀerent from each other. Note that there are (cid:0) (cid:1) = 6 combinations for the ﬁrst case, (cid:0) (cid:1) = 6 for the secondcase, (cid:0) (cid:1) = 4 combinations for the third case, and a single combination for each of the last two cases. For example, Let I = { ( i , ..., i ) : i = i , i = i , i = i } and J = { ( j , j , j , j ) : j = j = j = j } . In the view of (41), p − n (2) X J X I Cov ( W i j W i j X i j X i j , W i j W i j X i j X i j )= p − n (2) X J X I Cov ( W i j W i j X i j X i j , W i j W i j X i j X i j )= p − n (2) X J X I E Ä W i j ä E Ä W i j ä E Ä X i j ä E Ä X i j ä = p − n (2) X J X I Ä σ Y + 2 β j ä Ä σ Y + 2 β j ä = p − n (2) X j = j î σ Y + 2 σ Y Ä β j + β j ä + 4 β j β j ó = p − n (2) p X j =1 p X j =1 î σ Y + 2 σ Y Ä β j + β j ä + 4 β j β j ó − p X j =1 î σ Y + 2 σ Y Ä β j ä + 4 β j ó ! = p − n (2) î p σ Y + 4 σ Y pτ + 4 τ ó − " pσ Y + 4 σ Y τ + 4 p X j =1 β j = p − n (2) Ñ p ( p − σ Y + 4 σ Y τ ( p −

1) + 4 X j,j ′ β j β j ′ é , where the fourth equality we use E (cid:0) W ij (cid:1) = σ Y + 2 β j which is given by (15). Since we assume p/n = O (1) , the aboveexpression can be further simpliﬁed to p σ Y n + O (cid:0) n − (cid:1) . By the same type of calculation, one can compute the covariance in (41) over all combinations of subsets I and J and obtain that Var (cid:0) p X j =1 p X j ′ =1 ˆ ψ jj ′ (cid:1) = 2 τ n + p σ Y n + O Ä n − ä . (42) astly, plug-in (40) and (42) into (36) to getVar ( T ) = Var Ä ˆ τ ä − Cov Ñ ˆ τ , p X j =1 p X j ′ =1 ˆ ψ jj ′ é + 4 Var Ñ p X j =1 p X j ′ =1 ˆ ψ jj ′ é = Var Ä ˆ τ ä − Å τ n ã + 4 Å τ n + 2 p σ Y n ã + O Ä n − ä = Var Ä ˆ τ ä − τ n + 8 p σ Y n + O Ä n − ä = Var ( T oracle ) + 8 p σ Y n + O Ä n − ä , where the last equality holds by (9). (cid:4) Remark . We now calculate the asymptotic improvement of T B over the naive estimator. For simplicity, consider thecase when τ = σ = 1 . Recall the variance of ˆ τ and T B given in (8) and (11), respectively. Write, lim n,p →∞ Var (cid:0) ˆ τ (cid:1) − Var ( T B ) Var (cid:0) ˆ τ (cid:1) = lim n,p →∞ τ B /n n î ( n − n − ( σ Y τ + τ ) + n − ( pσ Y + 4 σ Y τ + 3 τ ) ó = 2 τ B τ + pτ +4 σ Y τ +3 τ n = 0 .

53 + 2 pn , where we used (8) in the ﬁrst equality, and the fact that σ Y = 2 τ = 2 in the second equality. Now, notice that when p = n and τ B = 0 . then the reduction is . = 10% and when p/n converges to zero, the reduction is . roof of Proposition 3 :We need to prove that lim n →∞ n î E ( T γ ) − τ ó = 0 , (43) lim n →∞ n [ Var ( T B ) − Var ( T γ )] = 0 . (44)We start with the ﬁrst equation. Let A denote the event that the selection algorithm γ perfectly identiﬁes the setof large coeﬃcients, i.e., A = { B γ = B } . Let p A ≡ P ( A ) denote the probability that A occurs, and let A denote theindicator of A. Write, n î E ( T γ ) − τ ó = n Ä E [ T γ (1 − A )] + E [ T γ A ] − τ ä = nE [ T γ (1 − A )] + n î E ( T B A ) − τ ó , (45)where the last equality holds since T γ A = T B A . For the convenience of notation, let C be an upper bound of themaximum over all ﬁrst four moments of T γ and T B , and consider the ﬁrst term of (45). By the Cauchy–Schwarzinequality, nE [ T γ (1 − A ))] ≤ n ¶ E î T γ ó© / ¶ E î (1 − A ) ó© / ≤ nC / { − p A } / → n →∞ , (46)where the last inequality holds lim n →∞ n (1 − p A ) / = 0 by assumption. We now consider the second term of (45). Write, n î E ( T B A ) − τ ó = nE ( T B A − T B ) = − nE [ T B (1 − A )] , (47)and notice that by the same type of argument as in (46) we have nE [ T B (1 − A )] → n →∞ . Hence, by (45) we have lim n →∞ n (cid:2) E ( T γ ) − τ (cid:3) = 0 , which completes the proof of (43).We now move to show that lim n →∞ n [ Var ( T B ) − Var ( T γ )] = 0 . Write,Var ( T B ) − Var ( T γ ) = E Ä T B ä − E Ä T γ ä − Ä [ E ( T B )] − [ E ( T γ )] ä . Thus, it is enough to show that lim n →∞ n î E Ä T B ä − E Ä T γ äó = 0 , (48) lim n →∞ n Ä [ E ( T B )] − [ E ( T γ )] ä = 0 . (49)The ﬁrst equation follows by similar arguments as in (46), with a slight modiﬁcation of using the existence of the fourthmoments of T γ and T B , rather than the second moments. For the second equation, notice that E ( T B ) = τ and hence, lim n →∞ n Ä [ E ( T B )] − [ E ( T γ )] ä = lim n →∞ n Äî τ + E ( T γ ) ó î τ − E ( T γ ) óä ≤ Ä τ + C ä lim n →∞ n Äî τ − E ( T γ ) óä = 0 , where the last equality follows from (45) and hence (44) follows. (cid:4) Proof of Proposition 4 :We start by proving that n ï ÿ Var (cid:0) ˆ τ (cid:1) − Var Ä ˆ τ äò p → . (50) ecall by (8) that Var Ä ˆ τ ä = 4 n ï ( n − n − î σ Y τ + τ ó + 12 ( n − Ä pσ Y + 4 σ Y τ + 3 τ äò . Hence, it is enough to prove the consistency of ˆ τ and ˆ σ Y . Consistency of the sample variance ˆ σ Y is a standard result,and since ˆ τ is an unbiased estimator, it is enough to show that its variance converges to zero as n → ∞ . Since weassume p/n = O (1) , we have by (8) that Var (ˆ τ ) → n →∞ , and (50) follows.We now move to prove that n h ÿ Var ( T γ ) − Var ( T γ ) i p → . (51)Recall that by (44) we have lim n →∞ n [ Var ( T B ) − Var ( T γ )] = 0 . Hence, it is enough to show that n h ÿ Var ( T γ ) − Var ( T B ) i p → . Now, recall that by (12) and (11) we have ÿ Var ( T γ ) = ÿ Var (cid:0) ˆ τ (cid:1) − n ˆ τ B γ , and Var ( T B ) = Var (cid:0) ˆ τ (cid:1) − n τ B + O (cid:0) n − (cid:1) . Thus, by (50) it is enough to show that ˆ τ B γ − τ B p → . Now, since γ is a consistent selection algorithm by assumption,it is enough to show that ˆ τ B − τ B p → . Recall that E ( ˆ β j ) = β j for j = 1 , ..., p and notice that Var ( ˆ β j ) → n →∞ by similararguments that were used to derive (8). Hence, we have ˆ β j − β j p → . Since we assumed that B is ﬁnite, we have ˆ τ B − τ B = X j ∈ B Ä ˆ β j − β j ä p → , and (51) follows. lan Livne, Faculty of Industrial Engineering and Management, The TechnionE-mail: [email protected] Azriel, Faculty of Industrial Engineering and Management, The TechnionE-mail: [email protected] Goldberg, Faculty of Industrial Engineering and Management, The TechnionE-mail: [email protected] Sample Size V a l ue Estimator

Related Researches

Berry-Esseen bounds of second moment estimators for Gaussian processes observed at high frequency

by Soukaina Douissi

Prepivoted permutation tests

by Colin B. Fogarty

Sharp Sensitivity Analysis for Inverse Propensity Weighting via Quantile Balancing

by Jacob Dorn

Online nonparametric regression with Sobolev kernels

by Oleksandr Zadorozhnyi

Discrepancy Bounds for a Class of Negatively Dependent Random Points Including Latin Hypercube Samples

by Michael Gnewuch

Edgeworth approximations for distributions of symmetric statistics

by Friedrich Götze

On the estimating equations and objective functions for parameters of exponential power distribution: Application for disorder

by Mehmet Niyazi ?ankaya

A new robust approach for multinomial logistic regression with complex design model

by Elena Castilla

Discrete Max-Linear Bayesian Networks

by Benjamin Hollering

Online Statistical Inference for Gradient-free Stochastic Optimization

by Xi Chen

The complex behaviour of Galton rank order statistic

by E. del Barrio

Sharper Sub-Weibull Concentrations: Non-asymptotic Bai-Yin Theorem

by Huiming Zhang

Inference and model selection in general causal time series with exogenous covariates

by Mamadou Lamine Diop

Instance-Dependent Bounds for Zeroth-order Lipschitz Optimization with Error Certificates

by François Bachoc

Nonparametric calibration for stochastic reaction-diffusion equations based on discrete observations

by Florian Hildebrandt

On shrinkage estimation of a spherically symmetric distribution for balanced loss functions

by Lahoucine Hobbad

Semiparametric empirical likelihood inference with estimating equations under density ratio models

by Meng Yuan

Adaptive Robust Large Volatility Matrix Estimation Based on High-Frequency Financial Data

by Minseok Shin

Graph Community Detection from Coarse Measurements: Recovery Conditions for the Coarsened Weighted Stochastic Block Model

by Nafiseh Ghoroghchian

Efficient computational algorithms for approximate optimal designs

by Jiangtao Duan

Distribution-Free Robust Linear Regression

by Jaouad Mourtada

On the consistency of the Kozachenko-Leonenko entropy estimate

by Luc Devroye

On the Minimal Error of Empirical Risk Minimization

by Gil Kur

On admissible estimation of a mean vector when the scale is unknown

by Yuzo Maruyama

It was "all" for "nothing": sharp phase transitions for noiseless discrete channels

by Jonathan Niles-Weed

«

1

2

3

4

»

Submitted on 14 Feb 2021 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar