Improved Estimators for Semi-supervised High-dimensional Regression Model
aa r X i v : . [ m a t h . S T ] F e b Improved Estimators for Semi-supervisedHigh-dimensional Regression Model
Ilan Livne, David Azriel, Yair GoldbergTechnion - Israel Institute of TechnologyFebruary 16, 2021
Abstract
We study a linear high-dimensional regression model in a semi-supervised setting, wherefor many observations only the vector of covariates X is given with no response Y . Wedo not make any sparsity assumptions on the vector of coefficients, and aim at estimatingVar ( Y | X ) . We propose an estimator, which is unbiased, consistent, and asymptoticallynormal. This estimator can be improved by adding zero-estimators arising from the un-labelled data. Adding zero-estimators does not affect the bias and potentially can reducevariance. In order to achieve optimal improvement, many zero-estimators should be used,but this raises the problem of estimating many parameters. Therefore, we introduce co-variate selection algorithms that identify which zero-estimators should be used in orderto improve the above estimator. We further illustrate our approach for other estimators,and present an algorithm that improves estimation for any given variance estimator. Ourtheoretical results are demonstrated in a simulation study. Key words and phrases:
Regression error, Semi-supervised setting, U-statisitcs, Varianceestimation, Zero estimators.
High-dimensional data analysis, where the number of predictors is larger than the sample size,is a topic of current interest. In such settings, an important goal is to estimate the signallevel τ and the noise level σ , i.e., to quantify how much variation in the response variablecan be explained by the predictors, versus how much of the variation is left unexplained. For1xample, in disease classification using DNA microarray data, where the number of potentialpredictors, say the genotypes, is enormous per each individual, one may wish to understandhow disease risk is associated with genotype versus environmental factors.Estimating the signal and noise levels is important even in a low-dimensional setting. Inparticular, a statistical model partitions the total variability of the response variable into twocomponents: the variance of the fitted model τ , and the variance of the residuals σ . Thispartition is at the heart of techniques such as ANOVA and linear regression, where τ and σ might also be commonly referred to as explained versus unexplained variation, or betweentreatments versus within treatments variation. Moreover, in model selection problems, τ and σ may be required for computing popular statistics, such as Cp, AIC, BIC and R . Both τ and σ are also closely related to other important statistical problems, such as genetic heritabilityand signal detection. Hence, developing good estimators for these quantities is a desirable goal.When the number of covariates p is much smaller than the number of observations n , anda linear model is assumed, the ordinary least squares (henceforth, OLS) method provides usstraightforward estimators for τ and σ . However, when p > n , it becomes more challenging toperform inference on τ and σ without further assumptions, such as sparsity of the coefficients.In practice, the sparsity assumption may be unrealistic for some areas of interest. In this case,considering only a small number of significant coefficient can lead to biases and inaccuracies.One relevant example is the problem of missing heritability, i.e., the gap between heritabilityestimates from genome-wide-association-studies (GWAS) and the corresponding estimates fromtwin studies. For example, by 2010, GWAS studies had identified a relatively small numberof covariates that collectively explained around of the total variations in the trait height ,which is a small fraction compared to of the total variations that were explained bytwin studies (Eichler et al., 2010). Identifying all the GWAS covariates affecting a trait, andmeasuring how much variation they capture, is believed to bridge a significant fraction of theheritability gap. With that in mind, methods that heavily rely on the sparsity assumption mayunderestimate τ by their nature. We show in this work that in the semi-supervised setting,in which for many observations only the covariates X are given with no responses Y , one mayconsistently estimate the heritability without sparsity assumptions. Estimating τ and σ ina high-dimensional regression setting is generally a challenging problem. As mentioned above,the sparsity assumption, which means that only a relatively small number of predictors arerelevant, plays an important role in this context. Fan et al. (2012) introduced a refitted crossvalidation method for estimating σ . Their method includes a two-staged procedure where avariable-selection technique is performed in the first stage, and OLS is used to estimate σ in the second stage. Sun and Zhang (2012) introduced the scaled lasso algorithm that jointly2stimates the noise level and the regression coefficients by an iterative lasso procedure. Bothworks provide asymptotic distributional results for their estimators and prove consistency underseveral assumptions including sparsity. In the context of heritability estimation, Gorfine et al.(2017) presented the HERRA estimator, which is based on the above methods and is alsoapplicable to time-to-event outcomes, in addition to continuous or dichotomous outcomes.Another recent related work is Cai and Guo (2020) that considers, as we do here, a semi-supervised learning setting. In their work, Cai and Gue proposed the CHIVE estimator of τ ,which integrates both labelled and unlabelled data and works well when assuming sparsity.They characterize its limiting distribution and calculate confidence intervals for τ . For morerelated works, see the literature review of Cai and Guo (2020).Rather than assuming sparsity, or other structural assumptions on the coefficient vector β , a different approach for high-dimensional inference is to assume some knowledge aboutthe covariates distribution. Dicker (2014) uses the method-of-moments to develop severalasymptotically-normal estimators of τ and σ , when the covariates are assumed to be Gaussian.Schwartzman et al. (2019) proposed the GWASH estimator for estimating heritability, which isessentially a modification of one of Dicker’s estimators where the columns of X are standardized.Unlike Dicker, the GWASH estimator can also be computed from typical summary statistics,without accessing the original data. Janson et al. (2017) proposed the EigenPrism procedure toestimate τ and σ . Their method, which is based on singular value decomposition and convexoptimization techniques, provides estimates and confidence intervals for normal covariates.In this paper we introduce a naive estimator of τ and show that it is asymptotically equiv-alent to one of Dicker’s estimators. The naive estimator is also a U-statistic and asymptoticallynormal. U-statistics can be typically used to obtain uniformly minimum variance unbiasedestimators (UMVUE). However, when moments restrictions exist, U-statistics are no longerUMVUE, as shown by Hoeffding (1977). Since the semi-supervised setting practically meansthat the distribution of X is known (and hence, moments of X are known), the naive estima-tor is not a UMVUE and it potentially can be improved. In this work, we demonstrate howits variance can be further reduced by using zero-estimators that incorporate the additionalinformation that exists in the unlabelled data.The contribution of this paper is threefold. First, we propose a novel approach for improvinginitial estimators of the signal level τ in the semi-supervised setting without assuming sparsity.The key idea of this approach is to use zero-estimators that are correlated with the initialestimator of τ in order to reduce variance without introducing extra bias. Second, we definea new notion of optimality with respect to a linear family of zero-estimators. This allows usto suggest a necessary and sufficient condition for identifying optimal oracle-estimators. We3se the term oracle to point out that the specific coefficients that compose the optimal linearcombination of zero-estimators are dependent on the unknown parameters. Third, we suggest analgorithm that approximates these optimal-oracle coefficients and successfully improves initialestimators of τ , given that a subset of relevant covariates is identified. We discuss in detail theimprovement of the naive estimator and also apply our approach to other estimators. Thus, infact, we provide an algorithm that has the potential to improve any given estimator of τ .The rest of this work in organized as follows. In Section 2 we describe our setting andintroduce the naive estimator. In Section 3 we introduce the zero-estimator approach andsuggest a new notion of optimality with respect to linear families of zero-estimators. An optimaloracle estimator of τ is also presented. In Section 4 we apply the zero-estimator approachto improve the naive estimator. We then study the theoretical properties of the improvedestimator and some practical aspects when the sample size is small. Simulation results aregiven in Section 5. Section 6 demonstrates how the zero-estimator approach can be generalizedto other estimators. A discussion is given in Section 7, while the proofs are provided in theAppendix. We begin with describing our setting and assumptions. Let ( X , Y ) , ..., ( X n , Y n ) be i.i.d. ob-servations drawn from some unknown distribution where X i ∈ R p and Y i ∈ R . We considerthe semi-supervised setting, where we have access to additional i.i.d. N − n observations ofcovariates X n +1 , ..., X N , but the responses Y n +1 , ..., Y N are not given. We assume in this workthat N ≫ n and, therefore, the distribution of the covariates is treated as known.We consider the linear model Y i = β T X i + ǫ i . , i = 1 , . . . , n and assume:(A1) Gaussian Covariates : X i i.i.d ∼ N ( µ, Σ ) , where i = 1 , . . . , N and µ and Σ are known.(A2) Invertibility: Σ is invertible.(A3) Linearity and homoscedasticity: E ( ǫ i | X i ) = 0 and E ( ǫ i | X i ) = σ .We also assume that the intercept term is zero, which can be achieved in practice by centeringthe Y ’s. Let ( X, Y ) denote a generic observation and let σ Y denote the variance of Y . Noticethat it can be decomposed into signal and noise components, σ Y = Var ( X T β + ǫ ) = β T Var ( X ) β + Var ( ǫ ) = β T Σ β + σ , (1)4here Var ( ǫ ) = E ( ǫ ) = σ and Cov ( X ) = Σ . The signal component τ ≡ β T Σ β can be thought of as the total variance explained bythe best linear function of the covariates, while the noise component σ can be thought of asthe variance left unexplained. Since µ and Σ are treated as known, we can apply the lineartransformation X Σ − / ( X − µ ) and assume w.l.o.g. that µ = and Σ = I . It follows by(1) that σ Y = k β k + σ , which implies that in order to evaluate σ , it is enough to estimateboth σ Y and k β k . The former can be easily evaluated from the sample, and the main challengeis to derive an estimator for k β k in the high-dimensional setting. In order to find an unbiased estimator for k β k = P pj =1 β j we first consider the estimation of β j for each j . A straightforward approach is given as follows: Let W ij ≡ X ij Y i for i = 1 , ..., n ,and j = 1 , ..., p . Notice that E ( W ij ) = E ( X ij Y i ) = E (cid:2) X ij (cid:0) β T X i + ε i (cid:1)(cid:3) = β j , where in the last equality we used Σ = I and assumption (A3). Now, since { E ( W ij ) } = E ( W ij ) − Var ( W ij ) , a natural unbiased estimator for β j is ˆ β j ≡ n n X i =1 W ij − n − n X i =1 ( W ij − W j ) = 1 n ( n − n X i = i W i j W i j , (2)where W j = n P ni =1 W ij . Thus, unbiased estimates of τ ≡ k β k and σ are given by ˆ τ = p X j =1 ˆ β j = 1 n ( n − p X j =1 n X i = i W i j W i j , ˆ σ = ˆ σ Y − ˆ τ , (3)where ˆ σ Y = n − n P i =1 ( Y i − ¯ Y ) . We use the term Naive to describe ˆ τ since its constructionis relatively simple and straightforward. Note that the derivation of the naive did not requireassumption (A1). The Naive estimator was also discussed by Kong and Valiant (2018). Asimilar estimator was proposed by Dicker (2014). Specifically, let ˆ τ Dicker = (cid:13)(cid:13) X T Y (cid:13)(cid:13) − p k Y k n ( n + 1) where X is the n × p design matrix and Y = ( Y , ..., Y n ) T . The following lemma shows that ˆ τ and ˆ τ Dicker are asymptotically equivalent.
Lemma 1.
Assume (A1) - (A3) and that ǫ , . . . , ǫ n ∼ N (0 , σ ) . When τ + σ is bounded and p/n converges to a constant, then, √ n (cid:0) ˆ τ − ˆ τ Dicker (cid:1) p → . n and p go together to ∞ . Using Corollary 1 fromDicker (2014), which computes the asymptotic variance of ˆ τ Dicker , and the above lemma, weobtain the following corollary.
Corollary 1.
Under the assumptions of Lemma 1, √ n Ç ˆ τ − τ ψ å D → N (0 , , where ψ = 2 ¶ (cid:0) pn (cid:1) ( σ + τ ) − σ + 3 τ © . In this section we introduce the zero-estimator approach and study how it can be used toimprove the naive estimator. In Section 3.1 we present the zero-estimator approach and anillustration of this approach is given in Section 3.2. In Section 3.3 we suggest a new notion ofoptimality with respect to linear families for zero-estimators. We then find an optimal oracleestimator of τ and calculate its improvement over the naive estimator. We are now ready to suggest our novel approach that incorporates the additional data thatexist in the semi-supervised setting in order to improve the estimation of τ . We first describethe approach in general terms. Consider a random variable V ∼ P , where P belongs to a familyof distributions P . Let g ( V ) be a zero-estimator, i.e., E P [ g ( V )] = 0 for all P ∈ P . Let T ( V ) bean unbiased estimator of a certain quantity of interest θ . Then, the statistic U c ( V ) , defined by U c ( V ) = T ( V ) + cg ( V ) for a fixed constant c , is also an unbiased estimator of θ . The varianceof U c ( V ) is Var [ U c ( V )] = Var [ T ( V )] + c Var [ g ( V )] + 2 c · Cov [ T ( V ) , g ( V )] . (4)Minimizing Var [ U c ( V )] with respect to c yields the minimizer c ∗ = − Cov [ T ( V ) , g ( V )] Var [ g ( V )] . (5)Notice that Cov [ T ( V ) , g ( V )] = 0 implies Var [ U c ∗ ( V )] < Var ( T ( V )) . In other words, by com-bining a correlated unbiased estimator of zero with the initial unbiased estimator of θ , one canlower the variance. Note that plugging c ∗ in (4) reveals how much variance can be potentiallyreduced, Var [ U c ∗ ( V )] = Var [ T ( V )] − { Cov [ T ( V ) , g ( V )] } Var [ g ( V ]) = (1 − ρ ) Var [ T ( V )] , (6)6here ρ is the correlation coefficient between T ( V ) and g ( V ) . Therefore, it is best to findan unbiased zero-estimator g ( V ) which is highly correlated with T ( V ) , the initial unbiasedestimator of θ . It is important to notice that c ∗ is an unknown quantity and, therefore, U c ∗ isnot a statistic. However, in practice, one can estimate c ∗ by some ˆ c ∗ and use the approximation U ˆ c ∗ instead. The following example illustrates how the zero-estimator approach can be applied to improvethe naive estimator ˆ τ in the simple linear model setting. Example 1 ( p = 1 ) . Consider the simple linear model Y = βX + ǫ , where X ∼ N (0 , , ǫ has mean zero and finite variance σ , and ǫ is independent of X . By (6), we wish to finda zero-estimator g ( X ) which is correlated with ˆ τ . Consider the estimator U c = ˆ τ + cg ( X ) ,where g ( X ) ≡ n n P i =1 ( X i − and c is a fixed constant. The variance of U c is minimized by c ∗ = − β and one can verify that Var ( U c ∗ ) = Var (ˆ τ ) − n β . The above example illustrates the potential of using additional information that exists in thesemi-supervised setting to lower the variance of the initial Naive estimator ˆ τ . However, it alsoraises the question: Can we achieve a lower variance by adding different zero-estimators?
Onemight attempt to reduce the variance by adding zero-estimators such as g k ( X ) ≡ n n P i =1 [ X ki − E ( X ki )] , for k > . Surprisingly this attempt will fail. Hence, the unbiased oracle estimatorof τ , R ≡ ˆ τ − β g ( X ) , is optimal with respect to zero-estimators of the form g k ( X ) . Thisunanticipated result motivated us to extend the idea of the optimal zero-estimator to a generalregression setting of p covariates. We now define a new oracle unbiased estimator of τ and prove that under some regularity as-sumptions this estimator is optimal with respect to a family of zero-estimators. Here, optimalitymeans that the variance cannot be further reduced by including additional zero-estimators of that given family. We now specifically define our notion of optimality in a general setting. Definition 1.
Let T be an unbiased estimator of θ and let g , g , ... be a sequence of zero-estimators, i.e., E θ ( g i ) = 0 for i ∈ N . Let G = ß m P k =1 c k g k : c k ∈ R , m ∈ N ™ , be a family ofzero-estimators. For a zero-estimator g ∗ ∈ G , we say that R ∗ ≡ T + g ∗ is an optimal oracleestimator (OOE) of θ with respect to G , if Var θ [ R ∗ ] ≡ Var θ [ T + g ∗ ] Var θ [ T + g ] for all g ∈ G and for all θ.
7e use the term oracle since g ∗ ≡ m P k =1 c ∗ k g k for some optimal coefficients c ∗ , ..., c ∗ m , whichare a function of the unknown parameter θ . The following theorem suggests a necessary andsufficient condition for obtaining an OOE. Theorem 1.
Let g m = ( g , ..., g m ) T be a vector of zero-estimators and assume the covariancematrix M ≡ Var [ g m ] is positive definite for every m . Then, R ∗ is an optimal oracle estimator (OOE) with respect to the family of zero-estimator G iff R ∗ is uncorrelated with every zero-estimator g ∈ G , i.e., Cov θ [ R ∗ , g ] = 0 for all g ∈ G and for all θ .Note that Theorem 1 is general and does not require the use of assumptions (A1) to (A3).Returning to our setting, define the following oracle estimator T oracle = ˆ τ − p X j =1 p X j ′ =1 ψ jj ′ , (7)where ψ jj ′ = β j β j ′ h jj ′ and h jj ′ = n n P i =1 [ X ij X ij ′ − E ( X ij X ij ′ )] , and let the G be the familyof zero-estimators of the form g k ...k p = n P ni =1 [ X k i · ... · X k p ip − E ( X k i · . . . · X k p ip )] , where ( k , ..., k p ) ∈ { , , , , ... } p ≡ N p . The following proposition suggests an optimal oracle im-provement over the naive estimator in a setting with p covariates. Theorem 2 (General p ) . Assume (A1) - (A3). Then, the oracle estimator T oracle defined in (7)is an OOE of τ with respect to G . Remark . For the proof of Theorem 2, assumptions (A1) - (A3) can be relaxed. More specif-ically, normality of the covariates is not required. Instead, it is enough to assume that µ = and Σ = I ,and that X has moments of all orders. Homoscedasticity of ǫ is also not required. Proposition 1.
Assume (A1) - (A3). Then,Var (cid:0) ˆ τ (cid:1) = 4 n ï ( n − n − (cid:2) σ Y τ + τ (cid:3) + 12 ( n − (cid:0) pσ Y + 4 σ Y τ + 3 τ (cid:1) ò (8)and Var ( T oracle ) = Var (cid:0) ˆ τ (cid:1) − n τ . (9)When p = n and τ = σ then the optimal oracle estimator T oracle reduces the varianceof the naive estimator by . Moreover, when p/n converges to zero, the reduction is .See Remark 3 in the Appendix for more details about the relative improvement of the optimaloracle estimator. 8 The Proposed Estimator
In this section we show how to use the zero-estimator approach to derive improved estimatorsover ˆ τ . In Section 4.1 we show that estimating all p optimal coefficients given in (7) mayintroduce too much variance. Therefore, Section 4.2 introduces covariate selection methodsto reduce the number of zero-estimators used in estimation. In Section 4.3 we suggest ourestimator for τ and present its properties. Section 4.4 discusses some practical considerationsthat needed to be taken into account when the sample size is small. The optimal oracle estimator defined in (7) is based on adding p zero-estimators. Therefore,it is reasonable to suggest and study the following estimator instead of the oracle one: T = ˆ τ − p X j =1 p X j ′ =1 ˆ ψ jj ′ , where ˆ ψ jj ′ = 1 n ( n −
1) ( n − X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] , is an estimator of ψ jj ≡ β j β j ′ h jj ′ . Notice that E Ä ˆ ψ jj ′ ä = 0 and that for i = i we have E ( W i j W i j ′ ) = β j β j ′ ; thus, T is an unbiased estimator. The variance of the estimator T isgiven in the following proposition. Proposition 2.
Assume (A1) - (A3) and p/n = O (1) . Then,Var ( T ) = Var ( T oracle ) + 8 p σ Y n + O ( n − )= Var (cid:0) ˆ τ (cid:1) − n τ + 8 p σ Y n + O ( n − ) , (10)where s ≡ τ + σ . Note that (10) follows from (9). For p = n, the last equation can be rewritten asVar ( T ) = Var (cid:0) ˆ τ (cid:1) + 8 n (cid:0) τ σ + σ (cid:1) + O ( n − ) , and a similar result holds for p/n → c for some positive constant c. Therefore, the estimator T fails to improve the Naive estimator ˆ τ . The third term in (10) reflects the additional variabilitythat comes with the attempt to estimate all p optimal coefficients. To overcome the problemthat arises from estimating all p parameters, we consider estimating only a small number ofcoefficients of the optimal estimator T oracle . Specifically, let B ⊂ { , ..., p } be a fixed set of9ome indices such that | B | ≪ p and consider the estimator T B = ˆ τ − P j,j ′ ∈ B ˆ ψ jj ′ . By the sameargument as in Proposition 2 we haveVar ( T B ) = Var (cid:0) ˆ τ (cid:1) − n τ B + O ( n − ) , (11)where τ B = P j ∈ B β j . Thus, if τ B is sufficiently large, then one can expect a significant improve-ment over the naive estimator by using a small number of zero-estimators. For example, when p = n , τ = σ = 1 , and τ B = 0 . , then T B reduces the variance of the Naive estimator by . See Remark 4 in the Appendix for more details. In practice the estimator T B is an oracleestimator since the set of indices B is unknown. A desirable set B should contain a relativelysmall amount of covariates that capture a significant amount of signal level τ B . Choosing a“good” set of indices is discussed in the following section.
We now discuss covariate selection methods. We assume there is a set of “important” covariates B ≡ (cid:8) j : β j > b (cid:9) where b is a positive constant, and assume that | B | = p where p is a fixedconstant. In order to construct our improved estimator, we wish to identify the covariates thatbelong to the set B . Definition 2.
We call δ a covariate selection algorithm if for every dataset ( X n × p , Y n × ) itchooses a subset of indices from { , ..., p } . We say that δ is a consistent selection algorithm ifit satisfies P ( B δ = B ) −−−→ n →∞ , where B δ is the set of selected indices by δ. Notice that the set B may depend on p , and theselection algorithm δ may depend on both n and p but this is suppressed in the notation.The subject of high-dimensional covariate selection has already been extensively studied.See e.g., Zambom and Kim (2018) and Oda et al. (2020) and references therein. Differentcovariate selection methods have different pros and cons and there is no simple answer for thequestion of which method to use. Since we do not wish to limit ourselves to a specific method,we assume that we have in hand a consistent selection algorithm γ for the rest of the discussion.10 .3 Proposed Estimator and Properties Let γ be a consistent selection algorithm. The proposed estimator for τ is given in Algorithm 1. Algorithm 1:
Proposed Estimator
Input:
A dataset ( X n × p , Y n × ) and a consistent selection algorithm γ .1. Calculate the Naive estimator ˆ τ = n ( n − p P j =1 n P i = i W i j W i j , where W ij = X ij Y i .
2. Apply algorithm γ with ( X , Y ) to construct B γ .
3. Calculate the zero-estimator terms: ˆ ψ jj ′ ≡ n ( n − n − X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] , for all j, j ′ ∈ B γ . Result:
Return T γ = ˆ τ − P jj ′ ∈ B γ ˆ ψ jj ′ . Some asymptotic properties of T γ are given by the following proposition. Proposition 3.
Assume that lim n →∞ n [ P ( { B γ = B } )] / = 0 , and that E (cid:0) T γ (cid:1) and E ( T B ) arebounded. Then,1. lim n →∞ n [ Bias ( T γ , τ )] = 0 , and2. lim n →∞ n [ Var ( T γ ) − Var ( T B )] = 0 , where Bias ( T γ , τ ) ≡ E ( T γ ) − τ and Var ( T B ) is given by (11).Notice that the requirement lim n →∞ n [ P ( { B γ = B } )] / = 0 is stronger than just consistency.The following claim is a direct corollary of Proposition 3. Corollary 2.
Under the assumptions of Proposition 3, lim n,p →∞ n [ M SE ( T γ ) − M SE ( T B )] = 0 . We now suggest estimators for the variances of the naive and the proposed estimators. Let ÿ Var (cid:0) ˆ τ (cid:1) = 4 n ï ( n − n − (cid:2) ˆ σ Y ˆ τ + ˆ τ (cid:3) + 12 ( n − (cid:0) p ˆ σ Y + 4ˆ σ Y ˆ τ + 3ˆ τ (cid:1) ò and let ÿ Var ( T γ ) = ÿ Var (cid:0) ˆ τ (cid:1) − n ˆ τ B γ , (12)where ˆ σ Y = n − n P i =1 (cid:0) Y i − ¯ Y (cid:1) and ˆ τ B γ = P j ∈ B γ ˆ β j . The following proposition shows that ÿ Var (cid:0) ˆ τ (cid:1) and ÿ Var ( T γ ) are consistent. 11 roposition 4. Assume (A1) - (A3), p/n = O (1) and the regularity conditions of Proposition 3.Then,1. n ïÿ Var (cid:0) ˆ τ (cid:1) − Var (ˆ τ ) ò p → and,2. n h ÿ Var ( T γ ) − Var ( T γ ) i p → ,where Var (ˆ τ ) is given by (8). Remark . Some cautions regarding the estimator T γ need to be considered in practice. When n is insufficiently large, then B γ might be different than B and Proposition 4.2 no longerholds. Specifically, let S ∩ B γ and B ∩ S γ be the set of false positive and false negative errors,respectively, where S = { , ..., p } \ B and S γ = { , ..., p } \ B γ . While false negatives merelyresult in not including some potential zero-estimator terms in our proposed estimator, falsepositives can lead to a substantial bias. This is true since the expected value of a post-selectedzero-estimator is not necessarily zero anymore.A common approach to overcome this problem is to randomly split the data into two partswhere the first part is used for covariate selection and the second part is used for evaluation ofthe zero-estimator terms. With this in mind, we suggest in Algorithm 2 a modified version ofthe proposed estimator.
Algorithm 2:
Modified Estimator
Input:
A dataset ( X n × p , Y n × ) and a consistent selection algorithm γ .1. Calculate the Naive estimator ˆ τ = n ( n − p P j =1 n P i = i W i j W i j , where W ij = X ij Y i . Refitting Step:
For ≤ m ≤ M : • Randomly split the sample into two equal subsets: ( X , Y ) m and ( X , Y ) m .• Apply algorithm γ with ( X , Y ) m to construct B mγ . • Use ( X , Y ) m to calculate the zero-estimator terms: ˆ ψ mjj ′ ≡ n ( n − n − X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] , where the indices i = i = i run over the rows of ( X , Y ) m , n = n/ and j, j ′ ∈ B mγ • Calculate U m = ˆ τ − P j,j ′ ∈ B γ ˆ ψ mjj ′ . Result:
Return U ∗ = M M P m =1 U m . selection and the second half for estimation of the zero-estimator terms. The split step is repeated multiple times and then averaged beforereturning the final output. While splitting helps to avoid post-selection biases, averaging overmultiple splits reduces the noise that comes with estimating the zero-estimator terms. We now provide a simulation study to illustrate our estimator’s performance. Here, we choosethe following simple selection algorithm γ : Algorithm 3:
Covariate selection γ Input:
A dataset ( X n × p , Y n × ) .1. Calculate ˆ β , ..., ˆ β p where ˆ β j is given in (2) for j = 1 , ..., p.
2. Calculate the differences λ j = ˆ β j ) − ˆ β j − for j = 2 , . . . , p where ˆ β < ˆ β < ... < ˆ β p ) denotes the order statistics.3. Select the covariates B γ = ¶ j : ˆ β j ) > ˆ β j ∗ ) © , where j ∗ = arg max j λ j . Result:
Return B γ .The algorithm above finds the largest gap between the ordered estimated squared coefficientsand then uses this gap as a threshold to select the set of large coefficients B γ . The algorithmworks well in scenarios where a relatively large gap truly separates between the set of largecoefficients B from the set of small coefficients S .We fix β j = τ B for j = 1 , . . . , , and β j = τ − τ B p − for j = 6 , . . . , p , where τ and τ B vary amongdifferent scenarios. The number of observations and covariates is n = p = 100 , , and ,and the residual variance is σ = 1 . For each scenario, we generated 1000 independent datasetsand estimated τ by using the different estimators. We report the results of five estimators: naive, oracle, proposed, modified and PSI . The first four are given by (3), (7), Algorithm 1 andAlgorithm 2, respectively; the PSI (Post Selective Inference) estimator was calculated usingthe estimateSigma function from the selectiveInference R package. The modifiedestimator was calculated based on M = 10 splits. Boxplots of the estimates are plotted inFigure 1 and results of the RMSE are given in Table 1. Code for reproducing the results isavailable at https://git.io/Jt6bC .Figure 1 demonstrates that:• As n and p increase, the bias of the proposed estimator reduces, which in turn causes it13o outperform the modified and the naive estimators for large n and p . This is compatiblewith the fact that the selection algorithm γ tends to be more successful in identifying B and S as n gets larger.• The modified estimator demonstrates an improvement over the naive estimator in all thescenarios.• As τ increases, the improvements of all estimators over the naive are more substantial.For example, for n = p = 500 , the relative improvement in RMSE of the oracle estimatorover the naive increases from to as τ changes from τ = 1 to τ = 2 . Moregenerally, from other simulation results that are not shown here, we found that the largerthe signal to noise ratio τ /σ , the larger the improvement.• The PSI estimator is biased. This results in a larger RMSE compared to the otherestimators for large n and p . This is not surprising since the PSI estimator is based onthe LASSO method which assumes a sparse model and ignores small coefficients. t = t = Sample Size V a l ue Estimator
NaiveModifiedProposedOraclePSI
Figure 1: Boxplots representing the distribution of different estimators of τ . The x-axis standsfor the sample size. The red dashed is the true value of τ . σ RMSE ) was calculated usingthe delta method. The signal that is attributed to the large coefficients, calculated by τ B /τ , was set to . . The estimator with the lowest RMSE (excluding the oracle) is in bold. τ n p Estimator Mean Bias SE RMSE · σ RM SE
PSI
Proposed
Proposed
PSI
Proposed
PSI
Proposed
In the next scenario the signal that is attributed to the large coefficients, calculated by τ B /τ , was fixed at . , which corresponds to a sparse setting. The simulation results arepresented in Figure 2 and in Table 2. It is demonstrated that for large n and p the proposedestimator performs almost as well as the oracle estimator which confirms the theoretical resultgiven in Proposition 3. Notice that in this setting the PSI estimator outperforms all other15stimators. This is not surprising since the PSI estimator is based on the LASSO methodwhich is known to work well when the true model that generates the data is sparse, as in thisscenario. −10123 100 300 500 Sample Size V a l ue Estimator
NaiveModifiedProposedOraclePSI t =1, t B2 =0.99 Figure 2: Boxplots representing the distribution of different estimators of τ . The x-axis standsfor the sample size. The red dashed is the true value of τ . τ B = 0 . . τ n p Estimator Mean Bias SE RMSE · σ RM SE
PSI
PSI
PSI
The suggested methodology in this paper is not limited to improving only the naive estimator,but can also be generalized to other estimators. The key is to add zero-estimators that arehighly correlated with our initial estimator of τ ; see Eq. (6). Unlike the naive estimator, whichis represented by a closed-form expression, other common estimators, such as the EigenPrismestimator (Janson et al., 2017), are computed numerically and do not have a closed-form repre-sentation. That makes the task of finding optimal zero-estimators somewhat more challengingsince the zero-estimators’ coefficients also need to be computed numerically. A comprehensivetheory that generalizes the zero-estimate approach to other estimators, other than the naive,is beyond the scope of this work. However, here we present a general algorithm that achievesimprovement without claiming optimality. The algorithm below calculates an initial estima-tor of τ and applies a consistent selection algorithm to choose the set of relevant covariates.Then, it approximates the optimal-oracle coefficients from bootstrap samples and, similarly toAlgorithm 1, returns a new estimator that is composed of both the initial estimator of τ and17 zero-estimator. Algorithm 4:
Empirical Estimator
Input:
A dataset ( X , Y ) , an initial estimator ˜ τ , and a consistent selectionalgorithm γ .1. Calculate an initial estimator ˜ τ of τ .2. Apply algorithm γ to construct B γ . Bootstrap step: • Resample with replacement n observations from ( X , Y ) .• Calculate the initial estimator ˜ τ of τ . • Calculate the zero-estimators h jj ′ = n n P i =1 [ X ij X ij ′ − E ( X ij X ij ′ )] where j, j ′ ∈ B γ .This procedure is repeated B times in order to produce (˜ τ ) ∗ , ..., (˜ τ ) ∗ B and h ∗ jj ′ , ..., h ∗ Bjj ′ for all j, j ′ ∈ B γ .
4. Approximate the coefficients by c jj = ¤ Cov (˜ τ , h jj ′ ) Var ( h jj ′ ) , where ÷ Cov ( · ) denotes the empirical covariance from the bootstrap samples, andVar ( h jj ′ ) is known by the semi-supervised setting. Result:
Return the empirical estimator U emp = ˜ τ − P j,j ′ ∈ B γ c jj ′ h jj ′ .We now demonstrate the performance of the empirical estimator given by Algorithm 4together with two initial estimators: The EigenPrism (Janson et al., 2017) and the PSI whichis described in Taylor and Tibshirani (2018) and was used in Section 5. We considered the samesparse setting as in Section 5 and an additional setting where τ = 5 . For computational reasons,Step 3 of Algorithm 4 was only calculated once for all 1000 simulations. Results are given inTables 3-4 and the code for reproducing the results is available at https://git.io/Jt6bC .Tables 3-4 demonstrate that the standard error of the empirical estimators is equal to orlower than the variance of the initial estimators, and as τ increases, the improvement over theinitial estimators is more substantial. This highlights the fact that the zero-estimator approachis not limited to improving only the naive estimator but rather has the potential to improveother estimators as well. 18able 3: Summary statistics equivalent to Table 1 for the PSI estimator. τ n p Estimator Mean Bias SE RMSE · σ RM SE
Table 4: Summary statistics equivalent to Table 1 for the EigenPrism estimator. τ n p Estimator Mean Bias SE RMSE · σ RMSE Discussion
This paper presents a new approach for improving estimation of the explained variance τ ofa high-dimensional regression model in a semi-supervised setting without assuming sparsity.The key idea is to use zero-estimators that are correlated with the initial unbiased estimatorof τ in order to lower its variance without introducing additional bias. The semi-supervisedsetting, where the number of observations is much greater than the number of responses, allowsus to construct such zero-estimators. We introduced a new notion of optimality with respect tozero-estimators and presented an oracle-estimator that achieves this type of optimality. Basedon the oracle-estimator form, we presented an improved (non-oracle) estimator that showeda significant reduction, but not optimal, in asymptotic variance of the naive estimator. Oursimulations showed that our approach can be generalized to other types of initial estimatorsother than the Naive estimator.Many open questions remain for future research. While our proposed estimator improvedthe naive estimator, it did not achieve the optimal improvement of the oracle estimator. Thus,it remains unclear if and how one can achieve optimal improvement. Moreover, in this workstrong distributional assumptions about the data are made. For example, we assume that thecovariates are normally distributed. This assumption can be relaxed since we used normalityonly to calculate the first few moments of X . Thus, generalizing the suggested approach byrelaxing the normality assumption is a natural direction for future work. A more ambitiousfuture goal would be to extend the suggested approach to generalized linear models (GLM),and specifically to logistic regression. In this case, the concepts of signal and noise levels areless clear and more challenging to define. References
Cai, T. and Z. Guo (2020). Semisupervised inference for explained variance in high dimensional linear regression and itsapplications.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) .Dicker, L. H. (2014). Variance estimation in high-dimensional linear models.
Biometrika 101 (2), 269–284.Eichler, E. E., J. Flint, G. Gibson, A. Kong, S. M. Leal, J. H. Moore, and J. H. Nadeau (2010). Missing heritability andstrategies for finding the underlying causes of complex disease.
Nature Reviews Genetics 11 (6), 446–450.Fan, J., S. Guo, and N. Hao (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 (1), 37–65.Gorfine, M., S. I. Berndt, J. Chang-Claude, M. Hoffmeister, L. Le Marchand, J. Potter, M. L. Slattery, N. Keret,U. Peters, and L. Hsu (2017). Heritability estimation using a regularized regression approach (HERRA): Applicableto continuous, dichotomous or age-at-onset outcome.
PloS one 12 (8), e0181269. oeffding, W. (1977). Some incomplete and boundedly complete families of distributions. The Annals of Statistics ,278–291.Janson, L., R. F. Barber, and E. Candes (2017). Eigenprism: inference for high dimensional signal-to-noise ratios.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (4), 1037–1065.Kong, W. and G. Valiant (2018). Estimating learnability in the sublinear data regime.
Advances in Neural InformationProcessing Systems 31 , 5455–5464.Oda, R., H. Yanagihara, et al. (2020). A fast and consistent variable selection method for high-dimensional multivariatelinear regression with a large number of explanatory variables.
Electronic Journal of Statistics 14 (1), 1386–1412.Schwartzman, A., A. J. Schork, R. Zablocki, W. K. Thompson, et al. (2019). A simple, consistent estimator of SNPheritability from genome-wide association studies.
The Annals of Applied Statistics 13 (4), 2509–2538.Sun, T. and C.-H. Zhang (2012). Scaled sparse linear regression.
Biometrika 99 (4), 879–898.Taylor, J. and R. Tibshirani (2018). Post-selection inference for-penalized likelihood models.
Canadian Journal ofStatistics 46 (1), 41–61.van der Vaart, A. W. (2000).
Asymptotic statistics . Cambridge university press.Zambom, A. Z. and J. Kim (2018). Consistent significance controlled variable selection in high-dimensional regression.
Stat 7 (1), e210. Appendix
Proof of Lemma 1 :Notice that X T Y = Å n P i =1 W i , ..., n P i =1 W ip ã T where X is the n × p design matrix and Y = ( Y , ..., Y n ) T . Let U be thecentered Naive estimator, i.e., U ≡ ˆ τ − τ = 1 n ( n − X i = i p X j =1 W i j W i j − τ = (cid:13)(cid:13) X T Y (cid:13)(cid:13) − p P j =1 n P i =1 W ij n ( n − − τ . The Dicker estimate for τ ia given by ˆ τ Dicker ≡ k X T Y k − p k Y k n ( n +1) . Let U Dicker ≡ ˆ τ Dicker − τ be the centeredDicker estimator (Dicker, 2014). We need to prove that root- n times the difference between the estimators converges inprobability to zero, i.e., √ n ( U Dicker − U ) p → . We have, √ n ( U Dicker − U ) = √ n Ü (cid:13)(cid:13) X T Y (cid:13)(cid:13) − p k Y k n ( n + 1) − (cid:13)(cid:13) X T Y (cid:13)(cid:13) − p P j =1 n P i =1 W ij n ( n − ê = √ n Ü p P j =1 n P i =1 W ij n ( n − − p k Y k n ( n + 1) − (cid:13)(cid:13) X T Y (cid:13)(cid:13) n ( n −
1) ( n + 1) ê . (13)It is enough to prove that:1. n − . Ç p P j =1 n P i =1 W ij − p k Y k å p → , n − . Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä p → . We start with the first term, n − . p X j =1 n X i =1 W ij − p k Y k ! = n − . p X j =1 n X i =1 Y i X ij − p n X i =1 Y i ! = n − . n X i =1 Y i p X j =1 X ij − p n X i =1 Y i ! = n − . n X i =1 Y i " n p X j =1 Ä X ij − ä ≡ n − . n X i =1 ω i (14)where ω i = Y i ñ n P j (cid:8) X ij − (cid:9) ô . Notice that ω i depends on n but this is suppressed in the notation. In order to showthat n − . n P i =1 ω i p → , it is enough to show that E Å n − . n P i =1 ω i ã → and Var Å n − . n P i =1 ω i ã → . Moreover, since E Å n − . n P i =1 ω i ã = √ nE ( ω i ) and Var Å n − . n P i =1 ω i ã = Var ( ω i ) = E (cid:0) ω i (cid:1) − [ E ( ω i )] , it is enough to show that √ nE ( ω i ) and E (cid:0) ω i (cid:1) converge to zero.Consider now √ nE ( ω i ) . By (14) we have n X i =1 ω i = 1 n " p X j =1 n X i =1 W ij − p k Y k . aking expectation of both sides, n X i =1 E ( ω i ) = 1 n " p X j =1 n X i =1 E Ä W ij ä − pE Ä k Y k ä . Now, notice that E ( W ij ) = E [ X ij ( β T X + ǫ ) ] = k β k + σ + β j [ E ( X ij ) −
1] = τ + σ + 2 β j . (15)Also notice that Y i / (cid:0) σ ε + τ (cid:1) ∼ χ , and hence E Ä k Y k ä = n (cid:0) τ + σ (cid:1) . Therefore, nE ( ω i ) = 1 n " p X j =1 n X i =1 Ä τ + σ + 2 β j ä − pn Ä τ + σ ä = 1 n " n X i =1 î p Ä τ + σ ä + 2 τ ó − pn Ä τ + σ ä = 2 τ which implies that √ nE ( ω i ) = τ √ n → . Consider now E ( ω i ) . By Cauchy-Schwartz, E Ä ω i ä = E Y i " n − p X j =1 ¶ X ij − © ! ≤ ¶ E Ä Y i ä© / ( E " n − p X j =1 ¶ X ij − © !) / . Notice that Y i ∼ N (cid:0) , τ + σ (cid:1) by construction and therefore E ( Y i ) = O (1) as n and p go to infinity. Let V j = X ij − and notice that E ( V j ) = 0 . We have E " n − p X j =1 ¶ X ij − © ! = E " n − p X j =1 V j ! = n − X j ,j ,j ,j E ( V j V j V j V j ) . The expectation P j ,j ,j ,j E ( V j V j V j V j ) is not 0 when j = j and j = j (up to permutations) or when all terms areequal. In the first case we have X j = j ′ E Ä V j V j ′ ä = X j = j ′ î E Ä V j äó = p ( p − h E n Ä X ij − ä oi ≤ C p , for a positive constant C . In the second case we have p P j =1 E (cid:0) V j (cid:1) = pE î (cid:0) X ij − (cid:1) ó ≤ C p, for a positive constant C .Hence, as p and n have the same order of magnitude, we have ( E " n − p X j =1 ¶ X ij − © ! / = ( n − X j ,j ,j ,j E ( V j V j V j V j ) ) / ≤ ¶ n − · O Ä p ä© / ≤ K/n, which implies E ( ω i ) ≤ K /n → , where K and K are positive constants. This completes the proof that n − . Ç p P j =1 n P i =1 W ij − p k Y k å p → .We now move to prove that n − . Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä p → . By Markov’s inequality, for ǫ > P Å n − . (cid:13)(cid:13)(cid:13) X T Y (cid:13)(cid:13)(cid:13) > ε ã ≤ n − . E Å (cid:13)(cid:13)(cid:13) X T Y (cid:13)(cid:13)(cid:13) ã /ε. Thus, it is enough to show that n − E Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä is bounded. Notice that E Å (cid:13)(cid:13)(cid:13) X T Y (cid:13)(cid:13)(cid:13) ã = X i ,i p X j =1 E ( W i j W i j ) = n X i =1 p X j =1 E Ä W ij ä + X i = i p X j =1 E ( W i j W i j )= n X i =1 p X j =1 Ä τ + σ + 2 β j ä + X i = i p X j =1 β j = n î p Ä τ + σ ä + 2 τ ó + n ( n − τ = n î p Ä τ + σ ä + ( n + 1) τ ó , here we used (15) in the third equality. Therefore, n − E Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä = n − (cid:2) p (cid:0) τ + σ (cid:1) + ( n + 1) τ (cid:3) . Since p and n have the same order of magnitude and τ + σ is bounded by assumption, then n − E Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä is also bounded. Thiscompletes the proof of n − . Ä (cid:13)(cid:13) X T Y (cid:13)(cid:13) ä p → and hence √ n (cid:0) ˆ τ Dicker − ˆ τ (cid:1) p → . (cid:4) Proof of Corollary 1 :According to Corollary 1 in Dicker (2014), we have √ nU Dicker ψ D → N (0 , , given that p/n converges to a constant. Write √ nUψ = 1 ψ (cid:2) √ n ( U − U Dicker ) + √ nU Dicker (cid:3) , and obtain √ nUψ D → N (0 , by Slutsky’s theorem. (cid:4) Calculations for Example 1 : Cov [ˆ τ , g ( X )] ≡ Cov n ( n − X i for all non-zero b ∈ R m . Thus, (18) issatisfied only if b ≡ , i.e., Cov [ R ∗ , g m ] = which also implies Cov [ R ∗ , P mk =1 c k g k ] = 0 for any c , ..., c m ∈ R . Therefore,Cov [ R ∗ , g ] = 0 for all g ∈ G .2. We now prove the other direction: if R ∗ is uncorrelated with all zero-estimators of a given family G then it is an OOE.Let R ∗ = T + g ∗ and R ≡ T + g be unbiased estimators of θ , where g ∗ , g ∈ G . Define ˜ g ≡ R ∗ − R = g ∗ − g and noticethat ˜ g ∈ G . Since by assumption R ∗ is uncorrelated with ˜ g , Cov [ R ∗ , ˜ g ] ≡ Cov [ R ∗ , R ∗ − R ] = Var [ R ∗ ] − Cov [ R ∗ , R ] , and hence Var [ R ∗ ] = Cov [ R ∗ , R ] . By the Cauchy–Schwarz inequality, ( Cov [ R ∗ , R ]) ≤ Var [ R ∗ ] Var [ R ] , we conclude thatVar [ R ∗ ] ≤ Var [ R ] = Var [ T + g ] for all g ∈ G . (cid:4) Proof of Theorem 2:
We start by proving Theorem 2 for the special case of p = 2 and then generalize for p > . By Theorem 1 we need toshow that Cov ( T oracle , g k k ) = 0 for all ( k , k ) ∈ N where g k k = n n P i =1 î X k i X k i − E Ä X k i X k i äó . Write,Cov ( T oracle , g k k ) = Cov Ñ ˆ τ − X j =1 2 X j ′ =1 ψ jj ′ , g k k é = Cov Ä ˆ τ , g k k ä − X j =1 2 X j ′ =1 Cov ( ψ jj ′ , g k k ) . Thus, we need to show that Cov Ä ˆ τ , g k k ä = 2 X j =1 2 X j ′ =1 Cov ( ψ jj ′ , g k k ) . (19) e start with calculating the LHS of (19), namely Cov (cid:0) ˆ τ , g k k (cid:1) . Recall that ˆ τ ≡ ˆ β + ˆ β and thereforeCov [ˆ τ , g k k ] = Cov ( ˆ β , g k k ) + Cov ( ˆ β , g k k ) . Now, for all ( k , k ) ∈ N we haveCov [ ˆ β , g k k ] ≡ Cov n ( n − X i
A, B, C and D are functions of ( k , k ) but this is suppressed in the notation. Write, E [ X k +111 X k Y ] = E [ X k +111 X k ( β X + β X + ǫ )]= β E [ X k +211 X k ] + β E [ X k +111 X k +112 ] = β A + β B. Thus, rewrite (20) and obtain Cov [ ˆ β , g k k ] = 2 n Ä [ β A + β B ] β − β C ä . (21)Similarly, by symmetry, Cov [ ˆ β , g k k ] = 2 n Ä [ β D + β B ] β − β C ä . (22)Using (21) and (22) we getCov [ˆ τ , g k k ] = Cov ( ˆ β , g k k ) + Cov ( ˆ β , g k k )= 2 n Ä [ β A + β B ] β − β C + [ β D + β B ] β − β C ä = 2 n L z }| { β A + β D + L z }| { β β B − L z }| { C Ä β + β ä = 2 n ( L + L − L ) . (23)We now move to calculate the RHS of (19), namely P j =1 2 P j ′ =1 Cov ( ψ jj ′ , g k k ) . First, recall that h jj ≡ n n P i =1 [ X ij X ij ′ − E ( X ij X ij ′ )] and g k k ≡ n n P i =1 î X k i X k i − E Ä X k i X k i äó where ( k , k ) ∈ N . Hence, h ≡ n n P i =1 (cid:0) X i − (cid:1) which by definition is lso equal to g . Similarly, we have h = h ≡ n n P i =1 ( X i X i ) = g and h ≡ n n P i =1 (cid:0) X i − (cid:1) = g . Thus, X j =1 2 X j ′ =1 Cov ( ψ jj ′ , g k k ) = X j =1 2 X j ′ =1 β j β j ′ Cov ( h jj ′ , g k k )= β Cov ( h , g k k ) + 2 β β Cov ( h , g k k ) + β Cov ( h , g k k )= β Cov ( g , g k k ) + 2 β β Cov ( g , g k k ) + β Cov ( g , g k k ) . (24)Now, observe that for every ( k , k , d , d ) ∈ N , Cov [ g k k , g d d ] = Cov (cid:16) n n X i =1 [ X k i X k i − E ( X k i X k i )] , n n X i =1 [ X d i X d i − E ( X d i X d i )] (cid:17) = 1 n n X i =1 n X i =1 ( E [ X k i X k i X d i X d i ] − E [ X k i X k i ] E [ X d i X d i ])= 1 n (cid:16) E [ X k + d X k + d ] − E [ X k X k ] E [ X d X d ] (cid:17) , (25)where the third equality holds since the terms with i = i vanish. It follows from (25) thatCov [ g k k , g ] = n Ä E [ X k +211 X k ] − E [ X k X k ] ä = n ( A − C ) , Cov [ g k k , g ] = n E [ X k +111 X k +112 ] = Bn , Cov [ g k k , g ] = n Ä E [ X k X k +212 ] − E [ X k X k ] ä = n ( D − C ) Therefore, rewrite (24) to get X j =1 2 X j ′ =1 Cov ( ψ jj ′ , g k k ) = 2 n L z }| { β A + β D + L z }| { β β B − L z }| { C Ä β + β ä = 2 n ( L + L − L ) (26)which is exactly the same expression as in (23). Hence, equation (19) follows which completes the proof of Theorm 2 for p = 2 . We now generalize the proof for p > . Similarly to (19) we want to show thatCov Ä ˆ τ , g k ...k p ä = 2 p X j =1 p X j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) . (27)We begin by calculating the LHS of (27), i.e., the covariance between ˆ τ and g k ...k p . By the same type of calculationsas in (20), for all ( k , ..., k p ) ∈ N p we haveCov î ˆ β j , g k ,...,k p ó =2 n β j E Ñ X k j +21 j Y m = j X k m m é + X j = j ′ β j ′ E Ñ X k j +11 j X k j ′ +11 j Y m = j,j ′ X k m m é β j − β j E p Y m =1 X k m m ! umming the above expressions for j = 1 , . . . , p , yieldsCov î ˆ τ , g k ,....,k p ó = p X j =1 Cov î ˆ β j , g k ,....,k p ó = 2 n p X j =1 β j E Ñ X k j +21 j Y m = j X k m m é + 2 n X j = j ′ β j β j ′ E Ñ X k j +11 j X k j ′ +11 j ′ Y m = j,j ′ X k m m é − n p X j =1 β j E p Y m =1 X k m m ! ≡ n (cid:0) L + L − L (cid:1) , (28)where L , L and L are just a generalization of the notation given in (23). Again, notice that L , L and L are functionsof k , ..., k p but this is suppressed in the notation.We now move to calculate the RHS of (27), namely p P j =1 p P j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) . Since ψ jj ′ = β j β j ′ h jj ′ we have, p X j =1 p X j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) = p X j =1 p X j ′ =1 β j β j ′ Cov (cid:0) h jj ′ , g k ...k p (cid:1) . (29)Again, notice the relationship between h jj ′ and g k ...k p : when j = j ′ we have h jj ≡ n n P i =1 (cid:0) X ij − (cid:1) = g ... ... , (i.e., the j -th entry is 2 and all others are 0), and for j = j ′ we have h jj ′ ≡ n n P i =1 X ij X ij ′ = g ... ... ... , (i.e., the j -th and j ’-thentries are 1 and all other entries are 0). Hence, p X j =1 p X j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) = p X j =1 p X j ′ =1 β j β j ′ Cov (cid:0) h jj ′ , g k ...k p (cid:1) = p X j =1 β j Cov (cid:0) g ... ... , g k ...k p (cid:1) + X j = j ′ β j β j ′ Cov (cid:0) g ... ... ... , g k ...k p (cid:1) . (30)Now, similar to (25), for all pairs of index vectors ( k , ..., k p ) ∈ N p , and ( k ′ , ..., k ′ p ) ∈ N p Cov Ä g k ,...,k p , g k ′ ,...,k ′ p ä = 1 n ( E p Y j =1 X k j + k ′ j j ! − E p Y j =1 X k j j ! E p Y j =1 X k ′ j j !) (31)This implies that Cov (cid:2) g ... ... , g k ,...,k p (cid:3) = 1 n E Ñ X k j +21 j Y m = j X k m m é − E p Y m =1 X k m m ! and Cov (cid:2) g ... ... ... , g k ,...,k p (cid:3) = 1 n E Ñ X k j +11 j X k j ′ +11 j ′ Y m = j,j ′ X k m m é . Hence, rewrite (30) to see that p X j =1 p X j ′ =1 Cov (cid:0) ψ jj ′ , g k ...k p (cid:1) = 2 p X j =1 β j Cov (cid:0) g ... ... , g k ...k p (cid:1) + 2 X j = j ′ β j β j ′ Cov (cid:0) g ... ... ... , g k ...k p (cid:1) = 2 n p X j =1 β j E Ñ X k j +21 j Y m = j X k m m é − E p Y m =1 X k m m ! +2 n X j = j ′ β j β j ′ E Ñ X k j +11 j X k j ′ +11 j ′ Y m = j,j ′ X k m m é = 2 n (cid:0) L − L + L (cid:1) , (32) hich is exactly the same expression as in (28). Hence, equation (27) follows which completes the proof of Theorem 2. (cid:4) Proof of Proposition 1 :Let W i = ( W i , ..., W ip ) T and notice that ˆ τ = n ( n − n P i = i p P j =1 W i j W i j is a U-statistic of order 2 with the kernel h ( w , w ) = w T w = p P j =1 w j w j , where w i ∈ R p . By Chapter 12 in van der Vaart (2000), the variance of a U-statisticof order r is given by Var ( U ) = r X c =1 r ! c ! ( r − c ) ! ( n − r ) ( n − r − ... ( n − r + c + 1) n ( n − ... ( n − r + 1) ζ c where ζ c is the covariance of the kernels if c variables are in common. In our case r = 2 , and thus ζ = Cov î h ( W , W ) , h Ä W , › W äó and ζ = Cov [ h ( W , W ) , h ( W , W )] where › W is an independent copy of W . Therefore, the variance of the naiveestimator has the following form Var Ä ˆ τ ä = 4 ( n − n ( n − ζ + 2 n ( n − ζ . (33)Now, let A = E (cid:0) W i W Ti (cid:1) be a p × p matrix and notice that ζ = Cov î h ( W , W ) , h Ä W , › W äó = p X j,j ′ Cov Ä W j W j , W j ′ f W j ′ ä = p X j,j ′ Ä β j β j ′ E [ W j W j ′ ] − β j β j ′ ä = β T A β − k β k and ζ = Cov [ h ( W , W ) , h ( W , W )]= X j,j ′ Cov ( W j W j , W j ′ W j ′ ) = X j,j ′ Ä ( E [ W j W j ′ ]) − β j β j ′ ä = k A k F − k β k , where k A k F is the Frobenius norm of A . Thus, by rewriting (33) the variance of the naive estimator is given byVar Ä ˆ τ ä = 4 ( n − n ( n − î β T A β − k β k ó + 2 n ( n − î k A k F − k β k ó . (34)Lastly, since we assume standard Gaussian covariates, one can verify that ζ = β T A β − k β k = σ Y τ + τ and ζ = k A k F − k β k = pσ Y + 4 σ Y τ + 3 τ , where σ Y = σ + τ by (1) and hence (8) follows.We now move to prove that Var ( T oracle ) = Var (cid:0) ˆ τ (cid:1) − n τ . Write,Var ( T oracle ) = Var Ñ ˆ τ − X j,j ′ ψ jj ′ é = Var Ä ˆ τ ä − X j,j ′ β j β j ′ Cov Ä ˆ τ , h jj ′ ä + 4 Var Ñ X j,j ′ ψ jj ′ é . (35)Consider P j,j ′ β j β j ′ Cov (cid:0) ˆ τ , h jj ′ (cid:1) . We have X j,j ′ β j β j ′ Cov Ä ˆ τ , h jj ′ ä = p X j =1 β j Cov Ä ˆ τ , h jj ä + X j = j ′ β j β j ′ Cov Ä ˆ τ , h jj ′ ä = p X j =1 β j Cov Ä ˆ τ , g ... ... ä + X j = j ′ β j β j ′ Cov Ä ˆ τ , g ... ... ... ä = p X j =1 β j ï n β j ò + X j = j ′ β j β j ′ ï n β j β j ′ ò = 4 n τ , here the second and third equality are justified by (30) and (28) respectively. Consider now Var Ç P j,j ′ ψ jj ′ å . Write,Var Ñ X j,j ′ ψ jj ′ é = Cov Ñ X j,j ′ β j β j ′ h jj ′ , X j,j ′ β j β j ′ h jj ′ é = X j ,j ,j j β j β j β j β j Cov ( h j j , h j j )= 1 n X j ,j ,j ,j β j β j β j β j X i ,i Cov ( X i j X i j , X i j X i j )= 1 n X j ,j ,j ,j β j β j β j β j X i ,i [ E ( X i j X i j X i j X i j ) − E ( X i j X i j ) E ( X i j X i j )]= 1 n X j ,j ,j ,j β j β j β j β j n X i =1 [ E ( X ij X ij X ij X ij ) − E ( X ij X ij ) E ( X ij X ij )]= 1 n X j ,j ,j ,j β j β j β j β j [ E ( X j X j X j X j ) − E ( X j X j ) E ( X j X j )] , where the fifth equality holds since the summand is 0 for all i = i . The summation is not zero in only three cases:1) j = j = j = j j = j = j = j j = j = j = j . For the first two cases the summation equals n P j = j ′ β j β j ′ . For the third case the summation equals to n p P j =1 β j . Overallwe have Var Ñ X j,j ′ ψ jj ′ é = case z }| { n X j = j ′ β j β j ′ + case z }| { n X j = j ′ β j β j ′ + case z }| { n p X j =1 β j = 2 n k β k = 2 n τ . Rewrite (35) to get Var ( T oracle ) = Var Ä ˆ τ ä − X j,j ′ β j β j ′ Cov Ä ˆ τ , h jj ′ ä + 4 Var Ñ X j,j ′ ψ jj ′ é = Var Ä ˆ τ ä − ï τ n ò + 4 ï τ n ò = Var Ä ˆ τ ä − τ n . (cid:4) Remark . We now calculate the asymptotic improvement of T oracle over the naive estimator. For simplicity, considerthe case when τ = σ . Recall the variance of ˆ τ and T oracle given in (8) and (2), respectively. Write, lim n,p →∞ Var (cid:0) ˆ τ (cid:1) − Var ( T oracle ) Var (cid:0) ˆ τ (cid:1) = lim n,p →∞ τ /n n î ( n − n − ( σ Y τ + τ ) + n − ( pσ Y + 4 σ Y τ + 3 τ ) ó = 2 τ τ + pτ +4 σ Y τ +3 τ n = 23 + 2 pn , where we used (8) in the first equality, and the fact that σ Y = 2 τ in the second equality. Now, notice that when p = n then the reduction is = 40% and when p/n converges to zero, the reduction is . Proof of Proposition 2 :Write, Var ( T ) = Var ˆ τ − p X j =1 p X j ′ =1 ˆ ψ jj ′ = Var Ä ˆ τ ä − Cov Ñ ˆ τ , p X j =1 p X j ′ =1 ˆ ψ jj ′ é + 4 Var Ñ p X j =1 p X j ′ =1 ˆ ψ jj ′ é . (36) e start with calculating the middle term. Let p n ( k ) ≡ n ( n − n − · · · ( n − k ) . Write,Cov (cid:0) ˆ τ , p X j =1 p X j ′ =1 ˆ ψ jj ′ (cid:1) = Cov Ñ n ( n − X i = i p X j =1 W i j W i j , n ( n −
1) ( n − X j,j ′ X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] é = C n X I X J Cov ( W i j W i j , W i j W i j [ X i j X i j − E ( X i j X i j )]) , (37)where C n ≡ p n (1) · p n (2) , I is the set of all quintuples of indices ( i , i , i , i , i ) such that i = i and i = i = i , and J is the set of all triples of indices ( j , j , j ) . For the set I , there are (cid:0) (cid:1) · different cases to consider when one of { i , i } is equal to one of { i , i , i } , and an additional (cid:0) (cid:1) ·
3! = 6 cases to consider when two of { i , i } are equal to twoof { i , i , i } . Similarly, for the set J we have three cases to consider: (1) when only two indices of { j , j , j } are equalto each other, (e.g., j = j = j ) ; (2) when no pair of indices is equal to each other and; (3) when all three indices areequal. For example, let I = { ( i , . . . , i ) : i = i = i = i = i } and J = { ( j , j , j ) : j = j = j } . Write, C n X I X J Cov ( W i j W i j , W i j W i j [ X i j X i j − E ( X i j X i j )])= C n X I p X j =1 Cov Ä W i j W i j , W i j W i j î X i j − óä = C n X I p X j =1 E ( W i j ) E ( W i j ) E ( W i j ) E Ä W i j î X i j − óä = C n X I p X j =1 β j E Ä W ij î X ij − óä . (38)Now, notice that E î W ij Ä X ij − äó = E î X ij Y i Ä X ij − äó = E î X ij Ä β T X + ε i äó − β j = β j E Ä X ij ä − β j = 2 β j , (39)where the third equality holds since the covariates are standard normal. Rewrite (38) to get C n X I p X j =1 β j E î W ij Ä X ij − äó = C n X I p X j =1 β j (2 β j ) = 2 p n (3) p n (1) · p n (2) p X j =1 β j = 2( n − n ( n − p X j =1 β j = 2 n p X j =1 β j + O ( n − ) , where we used (39) to justify the first equality. By the same type of calculation, one can compute the covariance in (37)over all combinations of subsets I and J and obtain thatCov (cid:0) ˆ τ , p X j =1 p X j ′ =1 ˆ ψ jj ′ (cid:1) = 4 n τ + O Ä n − ä . (40)We now move to calculate the last term of (36). Recall that ˆ ψ jj ′ = 1 n ( n −
1) ( n − X i = i = i W i j W i j ′ [ X i j X i j ′ − E ( X i j X i j ′ )] . herefore, Var (cid:0) p X j =1 p X j ′ =1 ˆ ψ jj ′ (cid:1) = X J Cov Ä ˆ ψ j j , ˆ ψ j j ä (41) = 1[ n ( n −
1) ( n − X J Cov Ñ X i = i = i W i j W i j X i j X i j , X i = i = i W i j W i j X i j X i j é = p − n (2) X J X I Cov ( W i j W i j X i j X i j , W i j W i j X i j X i j ) , where J is now defined to be the set of all quadruples ( j , j , j , j ) , and I is now defined to be the set of all sextuples ( i , ..., i ) such that i = i = i and i = i = i . For the set I , there are three different cases to consider: (1) whenone of { i , i , i } is equal to one of { i , i , i } ; (2) when two of { i , i , i } are equal to two of { i , i , i } ; and (3) when { i , i , i } are equal to { i , i , i } . There are (cid:0) (cid:1) · options for the first case, (cid:0) (cid:1) ·
3! = 18 for the second case, and (cid:0) (cid:1) ·
3! = 6 options for the third case. For the set J , there are five different cases to consider: (1) when there is only one pair of equal indices (e.g., j = j = j = j ); (2) when there are two pairs of equal indices (e.g., j = j = j = j );(3) when only three indices are equal (e.g., j = j = j = j ); (4) when all four indices are equal and; (5) all fourindices are different from each other. Note that there are (cid:0) (cid:1) = 6 combinations for the first case, (cid:0) (cid:1) = 6 for the secondcase, (cid:0) (cid:1) = 4 combinations for the third case, and a single combination for each of the last two cases. For example, Let I = { ( i , ..., i ) : i = i , i = i , i = i } and J = { ( j , j , j , j ) : j = j = j = j } . In the view of (41), p − n (2) X J X I Cov ( W i j W i j X i j X i j , W i j W i j X i j X i j )= p − n (2) X J X I Cov ( W i j W i j X i j X i j , W i j W i j X i j X i j )= p − n (2) X J X I E Ä W i j ä E Ä W i j ä E Ä X i j ä E Ä X i j ä = p − n (2) X J X I Ä σ Y + 2 β j ä Ä σ Y + 2 β j ä = p − n (2) X j = j î σ Y + 2 σ Y Ä β j + β j ä + 4 β j β j ó = p − n (2) p X j =1 p X j =1 î σ Y + 2 σ Y Ä β j + β j ä + 4 β j β j ó − p X j =1 î σ Y + 2 σ Y Ä β j ä + 4 β j ó ! = p − n (2) î p σ Y + 4 σ Y pτ + 4 τ ó − " pσ Y + 4 σ Y τ + 4 p X j =1 β j = p − n (2) Ñ p ( p − σ Y + 4 σ Y τ ( p −
1) + 4 X j,j ′ β j β j ′ é , where the fourth equality we use E (cid:0) W ij (cid:1) = σ Y + 2 β j which is given by (15). Since we assume p/n = O (1) , the aboveexpression can be further simplified to p σ Y n + O (cid:0) n − (cid:1) . By the same type of calculation, one can compute the covariance in (41) over all combinations of subsets I and J and obtain that Var (cid:0) p X j =1 p X j ′ =1 ˆ ψ jj ′ (cid:1) = 2 τ n + p σ Y n + O Ä n − ä . (42) astly, plug-in (40) and (42) into (36) to getVar ( T ) = Var Ä ˆ τ ä − Cov Ñ ˆ τ , p X j =1 p X j ′ =1 ˆ ψ jj ′ é + 4 Var Ñ p X j =1 p X j ′ =1 ˆ ψ jj ′ é = Var Ä ˆ τ ä − Å τ n ã + 4 Å τ n + 2 p σ Y n ã + O Ä n − ä = Var Ä ˆ τ ä − τ n + 8 p σ Y n + O Ä n − ä = Var ( T oracle ) + 8 p σ Y n + O Ä n − ä , where the last equality holds by (9). (cid:4) Remark . We now calculate the asymptotic improvement of T B over the naive estimator. For simplicity, consider thecase when τ = σ = 1 . Recall the variance of ˆ τ and T B given in (8) and (11), respectively. Write, lim n,p →∞ Var (cid:0) ˆ τ (cid:1) − Var ( T B ) Var (cid:0) ˆ τ (cid:1) = lim n,p →∞ τ B /n n î ( n − n − ( σ Y τ + τ ) + n − ( pσ Y + 4 σ Y τ + 3 τ ) ó = 2 τ B τ + pτ +4 σ Y τ +3 τ n = 0 .
53 + 2 pn , where we used (8) in the first equality, and the fact that σ Y = 2 τ = 2 in the second equality. Now, notice that when p = n and τ B = 0 . then the reduction is . = 10% and when p/n converges to zero, the reduction is . roof of Proposition 3 :We need to prove that lim n →∞ n î E ( T γ ) − τ ó = 0 , (43) lim n →∞ n [ Var ( T B ) − Var ( T γ )] = 0 . (44)We start with the first equation. Let A denote the event that the selection algorithm γ perfectly identifies the setof large coefficients, i.e., A = { B γ = B } . Let p A ≡ P ( A ) denote the probability that A occurs, and let A denote theindicator of A. Write, n î E ( T γ ) − τ ó = n Ä E [ T γ (1 − A )] + E [ T γ A ] − τ ä = nE [ T γ (1 − A )] + n î E ( T B A ) − τ ó , (45)where the last equality holds since T γ A = T B A . For the convenience of notation, let C be an upper bound of themaximum over all first four moments of T γ and T B , and consider the first term of (45). By the Cauchy–Schwarzinequality, nE [ T γ (1 − A ))] ≤ n ¶ E î T γ ó© / ¶ E î (1 − A ) ó© / ≤ nC / { − p A } / → n →∞ , (46)where the last inequality holds lim n →∞ n (1 − p A ) / = 0 by assumption. We now consider the second term of (45). Write, n î E ( T B A ) − τ ó = nE ( T B A − T B ) = − nE [ T B (1 − A )] , (47)and notice that by the same type of argument as in (46) we have nE [ T B (1 − A )] → n →∞ . Hence, by (45) we have lim n →∞ n (cid:2) E ( T γ ) − τ (cid:3) = 0 , which completes the proof of (43).We now move to show that lim n →∞ n [ Var ( T B ) − Var ( T γ )] = 0 . Write,Var ( T B ) − Var ( T γ ) = E Ä T B ä − E Ä T γ ä − Ä [ E ( T B )] − [ E ( T γ )] ä . Thus, it is enough to show that lim n →∞ n î E Ä T B ä − E Ä T γ äó = 0 , (48) lim n →∞ n Ä [ E ( T B )] − [ E ( T γ )] ä = 0 . (49)The first equation follows by similar arguments as in (46), with a slight modification of using the existence of the fourthmoments of T γ and T B , rather than the second moments. For the second equation, notice that E ( T B ) = τ and hence, lim n →∞ n Ä [ E ( T B )] − [ E ( T γ )] ä = lim n →∞ n Äî τ + E ( T γ ) ó î τ − E ( T γ ) óä ≤ Ä τ + C ä lim n →∞ n Äî τ − E ( T γ ) óä = 0 , where the last equality follows from (45) and hence (44) follows. (cid:4) Proof of Proposition 4 :We start by proving that n ï ÿ Var (cid:0) ˆ τ (cid:1) − Var Ä ˆ τ äò p → . (50) ecall by (8) that Var Ä ˆ τ ä = 4 n ï ( n − n − î σ Y τ + τ ó + 12 ( n − Ä pσ Y + 4 σ Y τ + 3 τ äò . Hence, it is enough to prove the consistency of ˆ τ and ˆ σ Y . Consistency of the sample variance ˆ σ Y is a standard result,and since ˆ τ is an unbiased estimator, it is enough to show that its variance converges to zero as n → ∞ . Since weassume p/n = O (1) , we have by (8) that Var (ˆ τ ) → n →∞ , and (50) follows.We now move to prove that n h ÿ Var ( T γ ) − Var ( T γ ) i p → . (51)Recall that by (44) we have lim n →∞ n [ Var ( T B ) − Var ( T γ )] = 0 . Hence, it is enough to show that n h ÿ Var ( T γ ) − Var ( T B ) i p → . Now, recall that by (12) and (11) we have ÿ Var ( T γ ) = ÿ Var (cid:0) ˆ τ (cid:1) − n ˆ τ B γ , and Var ( T B ) = Var (cid:0) ˆ τ (cid:1) − n τ B + O (cid:0) n − (cid:1) . Thus, by (50) it is enough to show that ˆ τ B γ − τ B p → . Now, since γ is a consistent selection algorithm by assumption,it is enough to show that ˆ τ B − τ B p → . Recall that E ( ˆ β j ) = β j for j = 1 , ..., p and notice that Var ( ˆ β j ) → n →∞ by similararguments that were used to derive (8). Hence, we have ˆ β j − β j p → . Since we assumed that B is finite, we have ˆ τ B − τ B = X j ∈ B Ä ˆ β j − β j ä p → , and (51) follows. lan Livne, Faculty of Industrial Engineering and Management, The TechnionE-mail: [email protected] Azriel, Faculty of Industrial Engineering and Management, The TechnionE-mail: [email protected] Goldberg, Faculty of Industrial Engineering and Management, The TechnionE-mail: [email protected] Sample Size V a l ue Estimator