A semiparametric efficient estimator in case-control studies
aa r X i v : . [ m a t h . S T ] O c t Bernoulli (2), 2010, 585–603DOI: 10.3150/09-BEJ210 A semiparametric efficient estimator incase-control studies
YANYUAN MA
Department of Statistics, Texas A&M University, College Station, TX 77843, USA.E-mail: [email protected]
We construct a semiparametric estimator in case-control studies where the gene and the envi-ronment are assumed to be independent. A discrete or continuous parametric distribution of thegenes is assumed in the model. A discrete distribution of the genes can be used to model the mu-tation or presence of certain group of genes. A continuous distribution allows the distribution ofthe gene effects to be in a finite-dimensional parametric family and can hence be used to modelthe gene expression levels. We leave the distribution of the environment totally unspecified. Theestimator is derived through calculating the efficiency score function in a hypothetical settingwhere a close approximation to the samples is random. The resulting estimator is proved to beefficient in the hypothetical situation. The efficiency of the estimator is further demonstrated tohold in the case-control setting as well.
Keywords: case-control study; gene-environment interaction; logistic regression; semiparametricefficiency
1. Introduction
Case-control designs are frequently implemented in clinical studies where, instead of tak-ing a random sample of a mixed population of both cases and non-cases, a fixed numberof cases and a fixed number of controls are randomly sampled from the respective pop-ulations of cases and non-cases. Because the resulting samples are no longer random orindependently and identically distributed (i.i.d.), the classical large-sample asymptotictheories could fail to apply. In the literature, two main approaches are taken in orderto adapt the large-sample theory to the case-control setting. The first approach is high-lighted in Breslow et al. (2000), where a modified design of the usual case-control studyis proposed. The resulting random sample is then linked to the true case-control samplethrough using results from McNeney (1998), where the similarity between random andnon-random sample asymptotic properties is developed by almost establishing the wholeasymptotic theory under non-i.i.d. samples. The second approach is somewhat more di-rect and is implicitly used by Rabinowitz (2000). Instead of treating the indicator ( D ) This is an electronic reprint of the original article published by the ISI/BS in
Bernoulli ,2010, Vol. 16, No. 2, 585–603. This reprint differs from the original in pagination andtypographic detail. (cid:13) Y. Ma of case/control as a random variable, D is assumed to be known and all the calculationsare performed conditionally on D . Although it does result in the conditional random-ness of the case-control samples, the resulting data is not really identically distributed.Specifically, two different distributions are involved and the large-sample theory is stillnot available. Strictly speaking, the asymptotic theory for non-i.i.d. data rederived inMcNeney (1998) also needs to be applied in order to treat such a combination of twosample cases.In addition to the complexity arising from a case-control design, the problem consid-ered in this article is also a semiparametric model problem, whose efficient estimator hasnot yet been explored even in the i.i.d. data situation. Specifically, the problem is asfollows. Suppose that in the general population, the occurrence of a disease ( D = 1) fol-lows a logistic model logit { Pr( D = 1) } = m ( G, E ), where G represents a person’s geneticcharacter and E represents the environmental elements. Further, suppose that G and E are independent of each other and that we are interested in the effect of gene, environ-ment and their interaction on the disease status. Thus, m ( g, e ) = β c + β g + β e + β ge .The parametric form of the distribution of gene g is assumed to be known as q ( g, β ),where β is an unknown finite-dimensional parameter. The distribution of the environ-ment, η ( e ), is unspecified. A special version of this problem is considered in Chatterjeeand Carroll (2005), where q ( g, β ) is assumed to be a discrete distribution. There, theauthors derived a profile maximum likelihood estimator for β = ( β c , β , β , β , β ) T andshowed that it is root- N consistent, where N is the size of the combined samples. Theestimator is later extended to a more general framework in Spinka et al. (2005). How-ever, it is not investigated whether the estimator achieves the optimal semiparametricefficiency.In this paper, we first establish in Section 2 that the classical semiparametric theoryof Bickel et al. (1993) is applicable in general case-control studies, without having torederive the theory in parallel or having to resort to the results from McNeney (1998).Such first order asymptotic equivalence between case-control sampling and random sam-pling is a new result. We then proceed to compute the semiparametric efficient score andconstruct a semiparametric estimator for β in Section 3. The computation is carried outin a hypothetical population described in Section 2. This differs from the real populationfrom which the cases and controls are drawn. Hence, the derivation has its own interestand novelty. In this section, we also prove that although the estimation of the nuisanceparameter η is bypassed in our estimator, the resulting semiparametric estimator stillachieves the optimal efficiency. The proof and treatment is rather non-standard. Numer-ical examples are included in Section 4 to demonstrate the performance of the proposedestimator. The performance of the method in the discrete gene model is close to that ofthe method in Chatterjee and Carroll (2005) and we pointed out the possible equivalencebetween the two methods in Section 5. Some analytical derivations and technical detailsare included in the Appendix. emiparametric efficiency in case-control studies
2. Case-control data versus i.i.d. data
The samples from a case control study are not random because the disease status is notrandom. In general, the design randomly samples N individuals from the case populationand N from the non-case population. However, let us consider a hypothetical populationof interest with infinite population size, in which the disease to non-disease ratio is fixedat π = N /N . Here, the reason for introducing the notion of hypothetical population isto be able to use the classical semiparametric theory for i.i.d. data, developed in Bickel et al. (1993). If the sample of size N = N + N from a case-control study happens tobe a random sample from the hypothetical population of interest, then we have a size- N i.i.d. random sample and the usual semiparametric analysis will apply. The asymptoticresults hold when N → ∞ and π stays fixed.Of course, the problem is that a random sample of size N from the hypothetical pop-ulation of interest does not have to have exactly N controls and N cases, hence wecannot immediately equate a case-control sample and a random sample from the hypo-thetical population. In general, the number of controls/cases of a random sample fromthe hypothetical population will have a binomial distribution N rd ∼ Binomial(
N, N d /N ), d = 0 ,
1, which is very close to a normal distribution when N is large, that is,( N rd − N d ) / p N π (1 − π ) → Normal(0 ,
1) in distribution when N → ∞ . Here, the super-script r stands for ‘random.’ Furthermore, the probability of having | N rd − N d | > N / goes to zero when N → ∞ . Thus, we could think of the case-control sample as obtainedby randomly picking a size- N sample from the hypothetical population of interest, thendeleting a random o p ( N / ) cases (controls) and adding a random o p ( N / ) controls(cases). Or, alternatively, we can think of the case-control sample as a random sam-ple of size N , but with a randomly chosen o p ( N / ) data contaminated in a particularway. This “particular” contamination implies the following three properties: (i) the con-tamination happens only to o p ( N ) of the observations (in the case-control samples, thecontamination in fact only happens to o p ( N / ) observations, but, in general, o p ( N ) is al-ready sufficient for our further analysis); (ii) the contaminated data is still of order O(1),that is, | X ci − X i | is bounded in probability for i = 1 , . . . , N ; (iii) the zero expectationholds for the contaminated observations, that is, if an estimating equation for β of theform P Ni =1 f ( X i ; β ) = 0 satisfies E { f ( X i ; β ) } = 0, then E { f ( X ci ; β ) } = 0 as well. Here, X i , i = 1 , . . . , N , are i.i.d. random samples, the superscript c stands for ‘contaminated’and the subscript represents the true parameter value.When the case-control sample is viewed as a contaminated random sample from thehypothetical population of interest, the first two “particular” properties certainly hold.For the estimator we will construct, we shall demonstrate that the third property alsoholds. Thus, if we can show that the same first order asymptotics apply to both thei.i.d. sample of size N and its contaminated version as long as the three properties hold,then we can treat the case-control sample as an i.i.d. sample.The argument is as follows. Assume that we mistakenly treated the contaminated dataas i.i.d. and obtained an efficient estimator: N X i =1 S eff ( X ci ; β ) = 0 . (2.1)88 Y. Ma
Here, S eff is the efficient score function and its derivation is model-dependent. One obvi-ous aspect of S eff worth emphasizing is that the construction of S eff does not depend onthe observations. Regardless of the method of derivation, the efficient score function S eff has the property E { S eff ( X i ; β ) } = 0. If we had the uncontaminated data, our subsequentestimator for β would have been P Ni =1 S eff ( X i ; β ) = 0. Working with the contaminateddata, (2.1) is the estimating equation we really have. Suppose that ˆ β solves (2.1). Wethen have 0 = N X i =1 S eff ( X ci ; ˆ β ) = N X i =1 S eff ( X ci ; β ) + N X i =1 ∂S eff ( X ci ; β ∗ ) ∂β T ( ˆ β − β ) , therefore, − N − ( N X i =1 ∂S eff ( X ci ; β ∗ ) ∂β T ) √ N ( ˆ β − β ) = N − / N X i =1 S eff ( X ci ; β ) , (2.2)where β ∗ lies on the line connecting β and ˆ β . Note that in our “particular” contaminationrequirement, only o p ( N ) terms yield a different X i from X ci (requirement (i)) and, foreach X ci = X i , the difference is O p (1) (requirement (ii)), so we have N − ( N X i =1 ∂S eff ( X ci ; β ∗ ) ∂β T ) = N − ( N X i =1 ∂S eff ( X i ; β ∗ ) ∂β T ) + o p (1) (2.3)= E ( ∂S eff ( X i ; β ) ∂β T ) + o p (1) . From the third “particular” property, we have E { S eff ( X ci ; β ) } = 0 (we will prove thatthis property holds for the case-control data in Section 3). In conjunction with the factthat only o p ( N ) of the terms S eff ( X ci ; β ) − S eff ( X i ; β ) are non-zero, we can furtherobtain N − / N X i =1 S eff ( X ci ; β ) = N − / N X i =1 S eff ( X i ; β ) + o p (1) . (2.4)The detailed argument of (2.4) is the following. Suppose for the first l = o p ( N ) observa-tions, X ci = X i . Then we have N − / N X i =1 S eff ( X ci ; β )= N − / N X i =1 S eff ( X i ; β ) + N − / l X i =1 { S eff ( X ci ; β ) − S eff ( X i ; β ) } emiparametric efficiency in case-control studies N − / N X i =1 S eff ( X i ; β ) + ( N/l ) − / l − / l X i =1 { S eff ( X ci ; β ) − S eff ( X i ; β ) } . Note that S eff ( X ci ; β ) − S eff ( X i ; β ) has mean zero, hence l − / P li =1 { S eff ( X ci ; β ) − S eff ( X i ; β ) } = O p (1). From l = o p ( N ), we obtain the result in (2.4) immediately. Thus,plugging (2.3) and (2.4) into (2.2), we obtain − E (cid:26) ∂S eff ( X i ; β ) ∂β T (cid:27) √ N ( ˆ β − β ) = N − / N X i =1 S eff ( X i ; β ) + o p (1) . The above display is exactly the first order asymptotic expansion of the estimator for β if we had performed the estimation procedure on the uncontaminated data. Thus,we have demonstrated that the estimator obtained from contaminated data performs aswell as the one obtained from uncontaminated data in terms of first order asymptoticproperties. Note that the efficient estimator can be replaced by a consistent estimator,say, a general S instead of S eff , as long as E ( S | D = d ) = 0 holds for d = 0 ,
1. Thisensures that E { S ( X ci ) } = 0 as long as E { S ( X i ) } = 0 (shown in Section 3), so the abovederivation will still carry through. Hence, the asymptotic property of the estimator usingthe contaminated data is indeed the same as if we had the uncontaminated data. Thus,the case-control data can be treated as i.i.d. data and we can achieve the same efficiencyas when the data was indeed i.i.d. In other words, a semiparametric estimator usingcontaminated data is at least as efficient as one using the uncontaminated data.One question still remains: can we do even better than in the i.i.d. data case? Infact, since case-control sampling is designed to be an efficient way to collect covariateinformation, it seems to contain more information than a random sample. However, weclaim that for asymptotically linear estimators of the form √ N ( ˆ β − β ) = 1 √ N N X i =1 ψ ( X ci ; β ) + o p (1) , where E { ψ ( X ci ; β ) | d } = 0, the efficiency in parameter estimation cannot be further im-proved by taking into account the special sampling procedure. This is because otherwise,we could have obtained a better estimator for the i.i.d. sample as well, by replacing X ci with X i . The detailed derivation is the same as in the above paragraph, where thecondition E { ψ ( X ci ; β ) | d } = 0 implies E { ψ ( X i ; β ) | d } = 0 for case-control data, whichensures E { ψ ( X ci ; β ) } = E { ψ ( X i ; β ) } = 0. Of course, if the condition E ( ψ | d ) = 0 is notsatisfied, the argument does not work. However, we now show that if ψ achieves theoptimal variance for the case-control data X ci , then it has to satisfy E { ψ ( X ci ; β ) | d } = 0.First, E { ∂E ( ψ | D ) /∂β } = ∂E ( ψ ) /∂β = 0 because the probability density function(p.d.f.) of D does not contain β . If we let ˜ ψ ( X ci ) = ψ ( X ci ) − E { ψ ( X ci ) | d } , then E { ˜ ψ ( X ci ) } = 0 and E { ∂ ˜ ψ ( X ci ) /∂β } = E { ∂ψ ( X ci ) /∂β } . If E { ψ ( X ci ) | d } 6 = 0, then we canobtainvar { ψ ( X ci ) } = E [var { ψ ( X ci ) | D } ] + var[ E { ψ ( X ci ) | D } ] = var { ˜ ψ ( X ci ) } + var[ E { ψ ( X ci ) | D } ]90 Y. Ma > var { ˜ ψ ( X ci ) } , which, together with E { ∂ ˜ ψ ( X ci ) /∂β } = E { ∂ψ ( X ci ) /∂β } , contradicts the fact that ψ ( X ci )is optimal.In summary, we have shown that the case control samples can be treated as if theywere i.i.d. and all the first order asymptotic results for i.i.d. data will be inherited forcase-control data as well. We can see that the above establishment is similar to the devel-opment in Breslow et al. (2000). However, one prominent difference is that in Breslow etal. (2000), the case-control sample is viewed as the result of a biased sampling procedurewith fixed subsample size, hence they cannot use the classical semiparametric theory fori.i.d. data, but have to refer to McNeney (1998) for the theoretical properties, where thewhole semiparametric theory for fixed-size subsamples is established in parallel to thei.i.d. framework. Here, through introducing the notion of hypothetical population and byanalyzing the first order equivalence between a random sample and a sample with fixed-size subsamples, we can easily contain the case-control problem in the usual i.i.d. modelframework. The derivation is much simpler and more elegant. Thus, in the remainder ofthe paper, we ignore the case-control nature of the data and proceed with our analysis bypretending the data is i.i.d. from the aforementioned hypothetical population of interest.
3. A semiparametric efficient estimator
A random sample from the hypothetical population of interest has p.d.f. p ( g, e, d ; β, η ) = p D ( d ) p G,E | D ( g, e | d ) = p D ( d ) p tG,E | D ( g, e | d )= p D ( d ) p tG ( g ) p tE ( e ) p tD | G,E ( d | g, e ) /p tD ( d ) (3.1)= N d N q ( g ) η ( e ) H ( d, g, e ) p tD ( d ) . Here, the superscript t stands for the p.d.f. in the true population, whereas expressionswithout superscripts, including various p.d.f.’s and expectation E , are quantities in thehypothetical population of interest; η ( e ) = p tE ( e ) is the unknown infinite-dimensionalparameter and H ( d, g, e ; β ) = exp[ d { m ( g, e ) } ] / [1 + exp { m ( g, e ) } ]= exp { d ( β c + β g + β e + β ge ) } / { β c + β g + β e + β ge ) } ; p tD ( d ; β, η ) = Z q ( g, β ) η ( e ) H ( d, g, e ; β ) d µ ( g ) d µ ( e ) . We recognize that estimating the finite-dimensional parameter β in the presence of aninfinite-dimensional nuisance parameter η , using an i.i.d. sample of size N = N + N from emiparametric efficiency in case-control studies et al. (1993). Because the general approachand related concepts have been nicely described in several recent papers including Tsiatisand Ma (2004), Allen et al. (2005), Ma et al. (2005) and Ma and Tsiatis (2006), here,we only briefly outline the general approach and the definition of the relevant concepts,referring the reader to these papers for more detailed descriptions.In general semiparametric problems, one approach to construct estimators for β is toobtain some influence function φ ( X i ; β, η ) which is subsequently used to form estimatingequations for β in the form of P Ni =1 φ ( X i ; β, η ) = 0. Here, X i = ( G i , E i , D i ) , i = 1 , . . . , N ,are i.i.d. observations. The solution of the estimating equation, ˆ β , is subsequently asemiparametric estimator and its variance has been established to be equal to the varianceof φ ( X i ; β, η ). Consequently, the optimal estimator among the class of all such estimatorsis the one whose influence function has the smallest variance. This is usually referred toas the semiparametric efficient estimator .The geometric approach considers the space in which all influence functions belong.Specifically, one considers a Hilbert space H which consists of all zero-mean measurablefunctions with finite variance and the same dimension as β . The inner product in H isdefined as the covariance. The Hilbert space H is further decomposed into two spaces,the nuisance tangent space Λ and its orthogonal complement Λ ⊥ .To understand the nuisance tangent space Λ, consider first the case where the nui-sance parameter, denoted γ , is finite-dimensional. Then, the nuisance score function, S γ = ∂ log p ( X i ; β, γ ) /∂γ , spans a linear space, which is denoted Λ. In the case of theinfinite-dimensional nuisance parameter η , the corresponding Λ is defined as the meansquared closure of the span of all the nuisance score functions S γ , where p ( X i ; β, γ ) is anyparametric submodel of p ( X i ; β, η ). The orthogonal complement of Λ in H is subsequentlydefined as Λ ⊥ .Any function in Λ ⊥ can be properly normalized to obtain a valid influence function.On the other hand, every influence function is a function in Λ ⊥ . Among all these func-tions, the projection of the score function S β = ∂ log p ( X i ; β, γ ) /∂β results in the efficientinfluence function. If we denote the projection by S eff , then the corresponding optimalvariance is var( S eff ) − . The projection S eff is usually called the efficient score function .Hence, the geometric approach converts the problem of searching for efficient semi-parametric estimators to the problem of calculating S eff . Following the description in Section 3.1, we obtain the efficient score function S eff . View-ing the sample as random from the hypothetical population, the p.d.f. in (3.1) is nolonger in a simple multiplicative form, in that the nuisance parameter appears both inthe numerator and in the integral in the denominator. Since this implies that the nui-sance tangent space is not automatically orthogonal to the score functions, the related92 Y. Ma computation for the nuisance tangent space and associated objects is more involved. Inaddition, one needs to be aware that the calculation should be carried out with respect tothe hypothetical population, hence quantities such as p tG , p tE , p tD need to be treated withextra care and not confused with p G , p E , p D . The main steps of the derivation are as fol-lows. We first calculate the score function S β by taking the derivative of log p ( g, e, d ; β, η )with respect to β . This results in S β = S − E ( S | d ), where S = (cid:26) ( m ′ β c m ′ β m ′ β m ′ β ) (cid:18) d − m (cid:19) q ′ β ( g, β ) T q ( g, β ) (cid:27) T . We then calculate the two spaces Λ , Λ ⊥ by replacing η in (3.1) with a finite-dimensionalparameter γ , taking the derivative of log p ( g, e, d ; β, γ ) with respect to γ to obtain S γ ,hypothesizing a space of all such S γ and proving that Λ is equivalent to this space. Theresults areΛ = [ h ( e ) − E { h ( e ) | d } : ∀ h ( e ) such that E t ( h ) = 0] = [ h ( e ) − E { h ( e ) | d } : ∀ h ( e )] , Λ ⊥ = [ h ( g, e, d ): E ( h | e ) = E { E ( h | d ) | e } ] . We finally project the score vector S β onto Λ ⊥ to obtain S eff = S β − f ( e ) + E ( f | d ) = S − E ( S | d ) − f ( e ) + E ( f | d ), where f ( e ) − E ( f | d ) represents the projection of S β onto Λ.The details of the derivation can be found in the Appendix. Note that this form of S eff implies that E { S eff ( X ) | d } = 0. When X is replaced by X c , the non-random case-controlsample, we still have E { S eff ( X c ) | d } = 0 because the design itself guarantees that theonly non-random component is d , which is held constant. Thus, viewing X c as a specialcontaminated version of X , we still have E { S eff ( X c ) } = 0, which is required in Section 2.From the Appendix, we can further write S eff = S − E ( S | e ) + ( − d { a (0) − a (1) } w ( e, − d ) , (3.2)where a (0) − a (1) = E ( f | D = 0) − E ( S | D = 0) − E ( f | D = 1) + E ( S | D = 1).In terms of the calculation of S eff , note that S , E ( S | e ) and w , as given in (A.1), are allfunctions with parameters β and p tD ( d ) only. Hence, as long as we can calculate p tD ( d ),we will have the ability to evaluate S , E ( S | e ) and w . The computation of a (0) − a (1)requires further arguments.In the following, we first obtain an approximation of p tD ( d ), then pursue the estimationof a (0) − a (1). To estimate p tD ( d ), using p E ( e ) to denote the probability density functionof e in the hypothetical population, we observe that N d = N p D ( d ) = Z N p
D,E ( d, e ) d µ ( e ) = Z N p E ( e ) p D,G | E ( d, g | e ) d µ ( g ) d µ ( e )= Z N p E ( e ) R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e )= E e (cid:26) N R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) (cid:27) . emiparametric efficiency in case-control studies E e with its sample moment through averaging across differentobserved e i ’s, we obtain N d ≈ N X i =1 R N d q ( g, β ) H ( d, g, e i ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e i ) d µ ( g ) /p tD ( d ) for d = 0 , . (3.3)Note that the above two equations are not independent – one determines the other. But,in combination with p tD (0) + p tD (1) = 1, we can estimate p tD ( d ) completely. Because theonly approximation involved in estimating p tD ( d ) is replacing the mean with a samplemean, the calculation will produce a root- N -consistent estimator for p tD (0) and p tD (1).We denote the estimators by ˆ p tD (0) and ˆ p tD (1). In calculating N d , we write p ( g, e, d ) as p E ( e ) p D,G | E ( d, g | e ), instead of directly using the form in (3.1). Since p E ( e ) is the p.d.f. ofthe environment variable in the hypothetical population, this enables us to replace theexpectation E e with the average of the samples.The estimation of a (0) − a (1) is much more tedious, and involves an almost bruteforce calculation of E ( S | d ) and E ( f | d ). If we let b = E ( S | D = 0) , b = E ( S | D = 1) , c = E ( f | D = 0) and c = E ( f | D = 1), then a (0) − a (1) = b − b + c − c . The calculation of b and b follows from b d = R Sp D,G,E ( d, g, e ) d µ ( g ) d µ ( e ) R p D,G,E ( d, g, e ) d µ ( g ) d µ ( e ) = R Sp E ( e ) p D,G | E ( d, g | e ) d µ ( g ) d µ ( e ) R p E ( e ) p D,G | E ( d, g | e ) d µ ( g ) d µ ( e )= Z p E ( e ) R SN d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e ) . Z p E ( e ) R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e ) . Since S can be calculated directly, we simply obtain the approximation of b d , d = 0 ,
1, byreplacing the mean with sample mean and plugging in the estimated p tD ( d ):ˆ b = N X i =1 R S (0 , g, e i ) q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) (3.4) . N X i =1 R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) , ˆ b = N X i =1 R S (1 , g, e i ) q ( g ) H (1 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) / ˆ p tD ( d ) (3.5) . N X i =1 R q ( g ) H (1 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) / ˆ p tD ( d ) . Y. Ma
The calculations of c and c are a bit more tricky. Since f = E ( S | e ) + ( c − b ) w ( e,
0) + ( c − b ) { − w ( e, } , taking expectation conditional on, say D = 0, we have c = E { E ( S | e ) | D = 0 } + ( c − b ) E { w ( e, | D = 0 } + ( c − b )[1 − E { w ( e, | D = 0 } ]or, equivalently, we obtain c − c = E { E ( S | e ) | D = 0 } − b E { w ( e, | D = 0 } − b [1 − E { w ( e, | D = 0 } ]1 − E { w ( e, | D = 0 } . Hence, replacing mean by sample mean and using ˆ p tD ( d ), c − c is estimated byˆ c − ˆ c = ˆ E { E ( S | e ) | D = 0 } − ˆ b ˆ E { w ( e, | D = 0 } − ˆ b [1 − ˆ E { w ( e, | D = 0 } ]1 − ˆ E { w ( e, | D = 0 } , (3.6)where ˆ E { w ( e, | D = 0 } = N X i =1 w ( e i , R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) (3.7) . N X i =1 R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d )and ˆ E { E ( S | e ) | D = 0 } = N X i =1 E ( S | e i ) R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) (3.8) . N X i =1 R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) . Similarly to the estimation of p tD ( d ), the only approximation involved in obtaining b (0) , b (1) and c (0) − c (1) is replacing mean by sample mean, so a (0) − a (1) is estimatedusing ˆ a (0) − ˆ a (1) = ˆ b − ˆ b + ˆ c − ˆ c at the root- N rate.We would like to emphasize that in all of the above calculations, when we replace theexpectation with the sample average, we use the result that the case-control sample canbe treated as a random sample from the hypothetical population. Hence, for any function u ( e ), the approximation N − P Ni =1 u ( e i ) can only be used to replace R u ( e ) p E ( e ) d µ ( e ),not R u ( e ) η ( e ) d µ ( e ). emiparametric efficiency in case-control studies β in all of the above expressions, in fact, p tD (0) , p tD (1) , a (0) − a (1) are all functions of β . However, if we replace β with ˜ β , an initial estimator of β , we will still obtain ˆ p tD ( d ; ˜ β ) , ˆ a (0; ˜ β ) − ˆ a (1; ˜ β ) that are root- N -consistent, as long as˜ β − β = O p ( N − / ). The final estimating equation of β has the form N X i =1 ˆ S eff ( x i ; β ) = N X i =1 S eff { x i ; β, ˆ p tD ( d ; ˜ β ) , ˆ a (0; ˆ p tD , ˜ β ) − ˆ a (1; ˆ p tD , ˜ β ) } = 0 , (3.9)where x i denotes the i th observation ( d i , g i , e i ).To summarize the description of the estimator, we outline the algorithm here:Step 1. Pick a starting value ˜ β that is root- N consistent.Step 2. Solve for ˆ p tD (0) and ˆ p tD (1) = 1 − ˆ p tD (0) from (3.3).Step 3. Obtain ˆ b and ˆ b from (3.4) and (3.5).Step 4. Obtain ˆ c − ˆ c from (3.6) and (3.7), (3.8).Step 5. Calculate S eff using (3.2) and obtain ˆ β from solving (3.9).It is worth pointing out that in order to carry out Step 1, we have used a vital assump-tion that a root- N starting value ˜ β exists. Fortunately, the existence of ˜ β is equivalent tothe identifiability of β and is already well established in Chatterjee and Carroll (2005).The starting value used there, or in Spinka et al. (2005), can be used to obtain the initialestimator ˜ β . Our algorithm here does not require an iteration of Steps 2–5 upon eachupdate of β . However, in practice, a more accurate ˜ β can improve the final estimation ˆ β significantly, hence iterations are almost always implemented. If we could use the exact p tD ( d ; β ) and a (0; β ) − a (1; β ) in (3.9), then, according toSection 3.1, the resulting estimator for β would be an efficient estimator, with es-timation variance V = E ( S eff S Teff ) − . To first order, V can be approximated using N { P Ni =1 ˆ S eff ( x i ; ˆ β ) ˆ S Teff ( x i ; ˆ β ) } − , where ˆ β solves (3.9).We claim that using the estimated ˆ S eff as in (3.9), we obtain an estimating equationthat yields the same estimator for β as using S eff , in terms of its first order asymptoticproperties. Theorem 1.
The algorithm in Section 3.2 yields a semiparametric efficient estimatorfor β . That is, √ N ( ˆ β − β ) → Normal { , var( S eff ) − } in distribution when N → ∞ and N /N is fixed. The proof of the theorem contains two main steps. In the first step, we show thesemiparametric efficiency of the estimator if the observations had been i.i.d. In the second96
Y. Ma
Table 1.
Simulation results for the two experiments, each with two different sets of parame-ter values, representing uncommon (upper-left) and common (upper-right) gene mutation, andhomogeneous (lower-left) and diversified (lower-right) gene expression levels. ‘true’ is the truevalue of β , ‘est’ is the average of the estimated β , ‘sd’ is the sample standard deviation and ‘ b sd’is the average of the estimated standard deviation β c β β β β β c β β β β Experiment 1true − . . . . . − . . . . . − . . . . . − . . . . . . . . . . . . . . . b sd 1 . . . . . . . . . . − . . . . . − . . . . . − . . . . . − . . . . . . . . . . . . . . . b sd 0 . . . . . . . . . . step, we proceed to show the efficiency in the case-control study using results in Section 2.Rather complex algebra needs to be employed in the first step. The proof also involvesa split of the data in the final estimation of β , and in estimating p tD ( d ) and a (0) − a (1),mainly for technical convenience. The details of the proof appear in the Appendix.
4. Numerical examples
We conducted a small simulation study to demonstrate the performance of the estima-tor. In the first experiment, we generated 500 cases and 500 controls, where the trueenvironment element E is min(10 , X ) and X is generated from a log-normal distributionwith mean 0 and variance 1. A dichotomous model of the gene is used, where G = 1 withprobability β and G = 0 with probability 1 − β . This kind of model for q ( g, β ) canrepresent the presence/absence of a certain gene mutation. We used two different sets ofvalues for β : the first set is β = ( − . , . , . , . , . T , where β = 0 .
26 representsa relatively common mutation; the second set is β = ( − . , . , . , . , . T , where β = 0 .
065 represents a very rare mutation. In both sets, the true parameters are chosenso that the model results in a population disease rate p tD (1) ≈ q ( g, β ). Here, wemodel q ( g, β ) with a Laplace distribution with variance β . This kind of model is typi-cally used to model the gene expression level. To maintain an approximate 5% disease ratein the population, we used β = ( − . , . , . , . , . T and β = ( − . , . , . , . , T as the true parameter values. Again, in the first set, β = 0 . emiparametric efficiency in case-control studies β = 1 represents a largervariation, so the population is more diversified. The simulation results are presented inthe lower half of Table 1. In both experiments, 1000 simulations are implemented.From Table 1, it is clear that the estimator for β is consistent in all four situationsand the estimated standard deviation approximates the true one rather well. It is worthnoting that the first experiment is a repetition of the same setting as in Chatterjee andCarroll (2005) and we observe very similar results. Specifically, for β , β , β , β in theupper-left table, their results for “sd” are 0.322, 0.037, 0.128, 0.0122, respectively, andthose in the upper-right table are 0.198, 0.043, 0.075 and 0.0273, respectively. Althoughsome numerical improvement can be observed in certain parameters (for example β ),it can be a result of finite-sample performance and numerical issues. We conjecture thatthe estimator in Chatterjee and Carroll (2005) is equivalent to the method proposedhere, hence is also efficient, although a rigorous proof is beyond the scope of this paper.It is also worth noting that the estimation of β c is more difficult than the remainingcomponents of β , in that the estimation has large variability. This is especially prominentin the discrete model setting for q ( g ). Indeed, the estimation result for β c has not beenreported elsewhere and, without the gene-environment independence, β c is known to beunidentifiable (Prentice and Pyke (1979)). This provides an intuitive explanation for theperformance of ˆ β c we observe. The set of estimating equations is solved using a standardNewton–Raphson algorithm.
5. Conclusion
Semiparametric modeling and estimation to study the occurrence of a disease in rela-tion to gene and environment has attracted much interest recently. However, despitethe various estimators proposed in the literature, very little is understood in terms ofthe efficiency of the estimators. This is partly due to the fact that most estimators areconstructed in rather ingenious ways, instead of following the standard lines of semi-parametric theory. The other reason is that most such problems are set in a case-controldesign, which violates the i.i.d. assumption for standard semiparametric theory.Instead of rederiving the whole semiparametric theory under non-i.i.d. samples, weargue that case-control data can be treated as if they were i.i.d. data and the stan-dard semiparametric efficiency theory will still apply. The equivalence of the first orderasymptotic theory shown in this article is a new contribution. The argument is based onrather elementary statistics without involving advanced knowledge or highly specializedtechniques.The establishment of the equivalence of the semiparametric efficiency betweeni.i.d. data and case-control data allows us to carry out the estimation using standard,well-established semiparametric theory. However, these standard analyses are performedunder a hypothetical population of interest, hence the detailed derivation often requiresspecial treatment, something which has not previously appeared in the literature. Underthe gene-environment independence assumption, we are able to explicitly construct a98
Y. Ma novel semiparametric estimator and show its efficiency. A special feature of this estima-tor is that we never attempted to estimate the infinite-dimensional nuisance parameter η itself, neither did we posit a model, true or false, for it. Rather, we avoided its estimationand instead approximated quantities that rely on it. On the one hand, this enables usto carry out the estimation rather easily; on the other hand, some asymptotic propertieshave to be rederived because any result that relies on the convergence properties of thenuisance parameter itself can no longer be used.Finally, our simulation results support the theory we developed, in both discrete andcontinuous gene distribution cases. Our simulation results in the discrete gene model arevery similar to those of Chatterjee and Carroll (2005), which leads us to believe thattheir estimator is also efficient. A demonstration of this aspect would be an interestingdirection for future work. The programming of the method in Chatterjee and Carrollmay seem easier. However, if the two methods are indeed equivalent, then the projectionstep in the current method should be equivalent to the profiling step in Chatterjee andCarroll, hence the computational effort and intensity should be equivalent. Although wedid not further expand our estimator to stratified case-control data, the method is clearlyapplicable there as well. Appendix
The derivation of S eff We will use S eff to construct our estimating equation. We calculate S eff by projectingthe score functions with respect to the parameters of interest β c , β , β , β , β onto theorthogonal complement of the nuisance tangent space. We first derive the score functions S β ≡ ∂ log p ( g, e, d ; β, η ) /∂β . Straightforward calculation shows that the score function S β = ( S T1 , S T2 ) T , where S T1 = ( m ′ β c m ′ β m ′ β m ′ β ) (cid:18) d − m (cid:19) − E (cid:26) ( m ′ β c m ′ β m ′ β m ′ β ) (cid:18) d − m (cid:19)(cid:12)(cid:12)(cid:12) d (cid:27) ,S = q ′ β ( g, β ) q ( g, β ) − E (cid:26) q ′ β ( g, β ) q ( g, β ) (cid:12)(cid:12)(cid:12) d (cid:27) . Here, m ′∗ and q ′∗ represent partial derivatives with respect to ∗ . Note that, in general, S β can be written as S β = S − E ( S | d ).We next derive the nuisance tangent space Λ and its orthogonal complement Λ ⊥ .Inserting the form of p tD ( d ; β, η ) into (3.1), replacing η ( e ) by an arbitrary submodel p tE ( e ; γ ) and taking the derivative of log p ( g, e, d ; β, γ ) with respect to γ , we ob-tain ∂ log p ( g, e, d ; β, γ ) /∂γ = ∂ log p tE ( e ; γ ) /∂γ − E { ∂ log p tE ( e ; γ ) /∂γ | d } . Now, recogniz-ing that ∂ log p tE ( e ; γ ) /∂γ for an arbitrary submodel can yield an arbitrary function of e with mean zero calculated under the true η ( e ), we obtain the nuisance tangent space:Λ = [ h ( e ) − E { h ( e ) | d } : ∀ h ( e ) such that E t ( h ) = 0] = [ h ( e ) − E { h ( e ) | d } : ∀ h ( e )] , Λ ⊥ = [ h ( g, e, d ): E ( h | e ) = E { E ( h | d ) | e } ] . emiparametric efficiency in case-control studies E t stands for an expectation calculated with respect to the true population distri-bution. The second expression for Λ is more convenient because it allows h ( e ) to be anarbitrary function of e , hence this is the form of Λ that we will use.Having obtained S β and the spaces Λ and Λ ⊥ , we can proceed to derive the efficientscore function S eff ≡ Π( S β | Λ ⊥ ). If we let Π( S β | Λ) = f ( e ) − E ( f | d ), then S eff = S β − f ( e ) + E ( f | d ) = S − E ( S | d ) − f ( e ) + E ( f | d ).We now modify the expression of S eff to facilitate its actual computation. Letting a ( d ) = E ( f | d ) − E ( S | d ), we can thus write S eff = S − f + a ( d ). Note that S does not de-pend on η and a ( d ) is either a (1) or a (0). In addition, we have E ( S eff | e ) = E { E ( S eff | d ) | e } .This is equivalent to E ( S β | e ) − f ( e ) + E { E ( f | d ) | e } = E [ E { S − E ( S | d ) | d } − E { f − E ( f | d ) | d }| e ] = 0 , which, in turn, is equivalent to E ( S | e ) = f + E { E ( S | d ) | e } − E { E ( f | d ) | e } = f − E { a ( d ) | e } = f − P d R a ( d ) N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) . Let v ( e, d ) = N d Z q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) = p E,D ( e, d ) N η − ( e )and w ( e, d ) = v ( e, d ) / { v ( e,
0) + v ( e, } . (A.1)We have E ( S | e ) = f − a (0) v ( e, / { v ( e,
0) + v ( e, } − a (1) v ( e, / { v ( e,
0) + v ( e, } = f − a (0) w ( e, − a (1) w ( e, f = E ( S | e ) + a (0) w ( e,
0) + a (1) w ( e, S eff = S − E ( S | e ) − a (0) w ( e, − a (1) w ( e,
1) + a ( d )= S − E ( S | e ) + ( − d { a (0) − a (1) } w ( e, − d ) . Proof of Theorem 1
To simplify notation, we denote α = p tD (0) /p tD (1), ˆ α = ˆ p tD (0) / ˆ p tD (1), δ ( α ) = a { p tD ( d ) }− a { p tD ( d ) } , δ (ˆ α ) = a {
0; ˆ p tD ( d ) } − a {
1; ˆ p tD ( d ) } and ˆ δ (ˆ α ) = ˆ a {
0; ˆ p tD ( d ) } − ˆ a {
1; ˆ p tD ( d ) } .00 Y. Ma
Suppose we randomly partition the data into two groups: group 1 has m observationsand group 2 has n observations. Here, m = N . , n = N − m . We use the first group toobtain ˆ α , and ˆ δ (ˆ α ), then use the second group to form the following estimating equationto estimate β : n X i =1 S eff { x i ; ˆ β, ˆ α, ˆ δ (ˆ α ) } = 0 . We will first show that the resulting estimator satisfies n / ( ˆ β − β ) → N (0 , V ) in distri-bution when N → ∞ .The proof splits into several steps: First, obviously, ˆ α − α = O p ( m − / ) and ˆ δ (ˆ α ) − δ (ˆ α ) = O p ( m − / ), as long as a root- N -consistent ˜ β is inserted in the calculation of thesequantities. A standard expansion yields0 = n X i =1 S eff { x i ; ˆ β, ˆ α, ˆ δ (ˆ α ) } = n X i =1 S eff { x i ; β , ˆ α, ˆ δ (ˆ α ) } + n X i =1 ∂∂β T S eff { x i ; β ∗ , ˆ α, ˆ δ (ˆ α ) } ( ˆ β − β )= n X i =1 S eff { x i ; β , ˆ α, ˆ δ (ˆ α ) } + n (cid:26) E (cid:18) ∂S eff ∂β T (cid:19) + o p (1) (cid:27) ( ˆ β − β ) , which can be rewritten as (cid:26) E (cid:18) ∂S eff ∂β T (cid:19) + o p (1) (cid:27) n / ( ˆ β − β )= − n − / n X i =1 S eff { x i ; β , ˆ α, ˆ δ (ˆ α ) } = − n − / n X i =1 [ S eff { x i ; β , ˆ α, δ (ˆ α ) } + ( − d i { ˆ δ (ˆ α ) − δ (ˆ α ) } w ( e i , − d i , ˆ α )] . The last equality uses the form of S eff in (3.2) and the fact that S , E ( S | e ) and w do notdepend on α . Because ˆ δ (ˆ α ) − δ (ˆ α ) = O p ( m − / ) = o p (1) and E { ( − d i w ( e i , − d i , ˆ α ) } = Z X d =0 , ( − d p E,D ( e, − d ; ˆ α ) η − ( e ) v ( e,
0; ˆ α ) + v ( e,
1; ˆ α ) p E,D ( e, d ; ˆ α ) d µ ( e ) = 0 , we actually have (cid:26) E (cid:18) ∂S eff ∂β T (cid:19) + o p (1) (cid:27) n / ( ˆ β − β ) emiparametric efficiency in case-control studies − n − / n X i =1 S eff { x i ; β , ˆ α, δ (ˆ α ) } + o p (1)= − n − / n X i =1 (cid:26) S eff ( x i ) + ∂S eff ( x i ; β , α ) ∂α (ˆ α − α ) + ∂ S eff ( x i ; β , α ∗ ) ∂α (ˆ α − α ) (cid:27) + o p (1) . In addition, (ˆ α − α ) = O p ( m − ) = o p ( n − / ), so (cid:26) E (cid:18) ∂S eff ∂β T (cid:19) + o p (1) (cid:27) n / ( ˆ β − β ) = − n − / n X i =1 (cid:26) S eff ( x i ) + ∂S eff ( x i ) ∂α (ˆ α − α ) (cid:27) + o p (1) . We now proceed to examine ∂S eff ( x i ) ∂α by examining each term in (3.2). S is free of α .As a function of α , we already have b ( e ; α ) ≡ E ( S | e ; α ) = P d R SN d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d )= R SN qH d µ ( g ) + α R SN qH d µ ( g ) R N qH d µ ( g ) + α R N qH d µ ( g ) = u ( e,
0) + αu ( e, u ( e,
0) + αu ( e, , where we define u ( e, d ) = R N d q ( g, β ) H ( d, g, e ) d µ ( g ) and u ( e, d ) = R SN d q ( g, β ) H ( d,g, e ) d µ ( g ). Using this notation, ∂b ∂α = u ( e, u ( e, − u ( e, u ( e, { u ( e,
0) + αu ( e, } ,w ( e,
0) = u ( e, u ( e,
0) + αu ( e, ,w ( e,
1) = αu ( e, u ( e,
0) + αu ( e, . Similarly to the calculation of b , b , we also have that for any function u , E ( u | d ; α ) = Z p E ( e ) R uN d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e ) . Z p E ( e ) R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e )= Z p E ( e ) R uN d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d u ( e, d ) /p tD ( d ) d µ ( e ) . Z p E ( e ) u ( e, d ) /p tD ( d ) P d u ( e, d ) /p tD ( d ) d µ ( e ) , Y. Ma thus E ( u | α ) = Z p E ( e ) R uN q ( g ) H (0 , g, e ) d µ ( g ) u ( e,
0) + u ( e, α d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) ,E ( u | α ) = Z p E ( e ) R uN q ( g ) H (1 , g, e ) d µ ( g ) u ( e,
0) + u ( e, α d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) . These relations lead to b = Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) ,b = Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) ,b ≡ E { E ( S | e ) | α } = Z p E ( e ) R E ( S | e ) N q ( g ) H (0 , g, e ) d µ ( g ) u ( e,
0) + u ( e, α d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e )= Z p E ( e ) E ( S | e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e )= Z p E ( e ) u ( e, { u ( e,
0) + u ( e, α }{ u ( e,
0) + u ( e, α } d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) ,b ≡ E { w ( e, | D = 0 } = Z p E ( e ) R w ( e, N q ( g ) H (0 , g, e ) d µ ( g ) u ( e,
0) + u ( e, α d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e )= Z p E ( e ) u ( e, { u ( e,
0) + u ( e, α } d µ ( e ) . Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) . Consequently, we obtain S eff (0) = S − b ( e ) + (cid:26) b − b + b − b b − b (1 − b )1 − b (cid:27) αu ( e, u ( e,
0) + αu ( e, S − b ( e ) + (cid:18) b − b − b (cid:19) αu ( e, u ( e,
0) + αu ( e, ,S eff (1) = S − b ( e ) − (cid:18) b − b − b (cid:19) u ( e, u ( e,
0) + αu ( e, ,∂S eff (0) ∂α = − b ( e ) ′ + (cid:18) b − b − b (cid:19) ′ αu ( e, u ( e,
0) + αu ( e,
1) + (cid:18) b − b − b (cid:19) u ( e, u ( e, { u ( e,
0) + αu ( e, } ,∂S eff (1) ∂α = − b ( e ) ′ − (cid:18) b − b − b (cid:19) ′ u ( e, u ( e,
0) + αu ( e,
1) + (cid:18) b − b − b (cid:19) u ( e, u ( e, { u ( e,
0) + αu ( e, } . emiparametric efficiency in case-control studies S does not contain α , ∂S eff ∂α is a function of ( e, d ) only. Because p E,D ( e, d ) = η ( e ) u ( e, d ) / { N p tD ( d ) } , we have p E,D ( e,
0) = (1 + α ) η ( e ) u ( e, / ( N α ), p E,D ( e,
1) = (1 + α ) η ( e ) u ( e, /N and p E ( e ) = (1 + α ) η ( e ) { u ( e,
0) + αu ( e, } / ( N α ). Combining theseresults, we have E (cid:18) ∂S eff ∂α (cid:19) = E (cid:20) − b ′ ( e ) + (cid:18) b − b − b (cid:19) u ( e, u ( e, { u ( e,
0) + αu ( e, } (cid:21) = E (cid:20) − u ( e, u ( e,
0) + u ( e, u ( e, { u ( e,
0) + αu ( e, } (cid:21) + (cid:18) b − b − b (cid:19) E (cid:20) u ( e, u ( e, { u ( e,
0) + αu ( e, } (cid:21) . Plugging in the expressions for b , b , b , we obtain b − b − b = (cid:20)Z p E ( e ) u ( e, { u ( e,
0) + u ( e, α }{ u ( e,
0) + u ( e, α } d µ ( e ) − Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) (cid:21).(cid:20)Z p E ( e ) u ( e, u ( e,
0) + u ( e, α d µ ( e ) − Z p E ( e ) u ( e, { u ( e,
0) + u ( e, α } d µ ( e ) (cid:21) = Z αp E ( e ) { u ( e, u ( e, − u ( e, u ( e, }{ u ( e,
0) + u ( e, α } d µ ( e ) ."Z αp E ( e ) u ( e, u ( e, { u ( e,
0) + u ( e, α } d µ ( e ) = E (cid:20) { u ( e, u ( e, − u ( e, u ( e, }{ u ( e,
0) + u ( e, α } (cid:21). E (cid:20) u ( e, u ( e, { u ( e,
0) + u ( e, α } (cid:21) , thus, we have E ( ∂S eff /∂α ) = 0.The fact that E ( ∂S eff /∂α ) = 0, in combination with ˆ α − α = o p (1), yields (cid:26) E (cid:18) ∂S eff ∂β T (cid:19) + o p (1) (cid:27) n / ( ˆ β − β ) = − n − / n X i =1 S eff ( x i ) + o p (1) . Thus, we indeed have n / ( ˆ β − β ) ∼ N (0 , V ) asymptotically.In fact, the classical N / ( ˆ β − β ) ∼ N (0 , V ) also holds. This is because N / ( ˆ β − β ) − n / ( ˆ β − β ) = mN / + n / ( ˆ β − β ) → N . N n / ( ˆ β − β ) → N → ∞ . Thus, our estimator is semiparametric efficient. Because of the equivalenceresult developed in Section 2, the estimator is also semiparametric efficient for case-control data. We split the data set into two groups with sizes m and n for simplicity ofthe asymptotic analysis. In reality, one can certainly use the whole data set in each stageof the estimation.04 Y. Ma
Acknowledgments
This work was supported by NSF Grant DMS-0906341.
References
Allen, A.S., Satten, G.A. and Tsiatis, A.A. (2005). Locally-efficient robust estimation ofhaplotype-disease association in family-based studies.
Biometrika Efficient and Adaptive Esti-mation for Semiparametric Models . Baltimore: The Johns Hopkins Univ. Press. MR1245941Breslow, N.E., Robins, J.M. and Wellner, J.A. (2000). On the semi-parametric efficiency oflogistic regression under case-control sampling.
Bernoulli Biometrika J. Amer. Statist. Assoc.
Statist. Sinica Biometrika Statist. Sinica Genetic Epidemiology Biometrika835–848. MR2126036