[PDF] A semiparametric efficient estimator in case-control studies

Abstract

We construct a semiparametric estimator in case-control studies where the gene and the environment are assumed to be independent. A discrete or continuous parametric distribution of the genes is assumed in the model. A discrete distribution of the genes can be used to model the mutation or presence of certain group of genes. A continuous distribution allows the distribution of the gene effects to be in a finite-dimensional parametric family and can hence be used to model the gene expression levels. We leave the distribution of the environment totally unspecified. The estimator is derived through calculating the efficiency score function in a hypothetical setting where a close approximation to the samples is random. The resulting estimator is proved to be efficient in the hypothetical situation. The efficiency of the estimator is further demonstrated to hold in the case-control setting as well.

Full PDF

aa r X i v : . [ m a t h . S T ] O c t Bernoulli (2), 2010, 585–603DOI: 10.3150/09-BEJ210 A semiparametric eﬃcient estimator incase-control studies

YANYUAN MA

Department of Statistics, Texas A&M University, College Station, TX 77843, USA.E-mail: [email protected]

We construct a semiparametric estimator in case-control studies where the gene and the envi-ronment are assumed to be independent. A discrete or continuous parametric distribution of thegenes is assumed in the model. A discrete distribution of the genes can be used to model the mu-tation or presence of certain group of genes. A continuous distribution allows the distribution ofthe gene eﬀects to be in a ﬁnite-dimensional parametric family and can hence be used to modelthe gene expression levels. We leave the distribution of the environment totally unspeciﬁed. Theestimator is derived through calculating the eﬃciency score function in a hypothetical settingwhere a close approximation to the samples is random. The resulting estimator is proved to beeﬃcient in the hypothetical situation. The eﬃciency of the estimator is further demonstrated tohold in the case-control setting as well.

Keywords: case-control study; gene-environment interaction; logistic regression; semiparametriceﬃciency

1. Introduction

Case-control designs are frequently implemented in clinical studies where, instead of tak-ing a random sample of a mixed population of both cases and non-cases, a ﬁxed numberof cases and a ﬁxed number of controls are randomly sampled from the respective pop-ulations of cases and non-cases. Because the resulting samples are no longer random orindependently and identically distributed (i.i.d.), the classical large-sample asymptotictheories could fail to apply. In the literature, two main approaches are taken in orderto adapt the large-sample theory to the case-control setting. The ﬁrst approach is high-lighted in Breslow et al. (2000), where a modiﬁed design of the usual case-control studyis proposed. The resulting random sample is then linked to the true case-control samplethrough using results from McNeney (1998), where the similarity between random andnon-random sample asymptotic properties is developed by almost establishing the wholeasymptotic theory under non-i.i.d. samples. The second approach is somewhat more di-rect and is implicitly used by Rabinowitz (2000). Instead of treating the indicator ( D ) This is an electronic reprint of the original article published by the ISI/BS in

Bernoulli ,2010, Vol. 16, No. 2, 585–603. This reprint diﬀers from the original in pagination andtypographic detail. (cid:13) Y. Ma of case/control as a random variable, D is assumed to be known and all the calculationsare performed conditionally on D . Although it does result in the conditional random-ness of the case-control samples, the resulting data is not really identically distributed.Speciﬁcally, two diﬀerent distributions are involved and the large-sample theory is stillnot available. Strictly speaking, the asymptotic theory for non-i.i.d. data rederived inMcNeney (1998) also needs to be applied in order to treat such a combination of twosample cases.In addition to the complexity arising from a case-control design, the problem consid-ered in this article is also a semiparametric model problem, whose eﬃcient estimator hasnot yet been explored even in the i.i.d. data situation. Speciﬁcally, the problem is asfollows. Suppose that in the general population, the occurrence of a disease ( D = 1) fol-lows a logistic model logit { Pr( D = 1) } = m ( G, E ), where G represents a person’s geneticcharacter and E represents the environmental elements. Further, suppose that G and E are independent of each other and that we are interested in the eﬀect of gene, environ-ment and their interaction on the disease status. Thus, m ( g, e ) = β c + β g + β e + β ge .The parametric form of the distribution of gene g is assumed to be known as q ( g, β ),where β is an unknown ﬁnite-dimensional parameter. The distribution of the environ-ment, η ( e ), is unspeciﬁed. A special version of this problem is considered in Chatterjeeand Carroll (2005), where q ( g, β ) is assumed to be a discrete distribution. There, theauthors derived a proﬁle maximum likelihood estimator for β = ( β c , β , β , β , β ) T andshowed that it is root- N consistent, where N is the size of the combined samples. Theestimator is later extended to a more general framework in Spinka et al. (2005). How-ever, it is not investigated whether the estimator achieves the optimal semiparametriceﬃciency.In this paper, we ﬁrst establish in Section 2 that the classical semiparametric theoryof Bickel et al. (1993) is applicable in general case-control studies, without having torederive the theory in parallel or having to resort to the results from McNeney (1998).Such ﬁrst order asymptotic equivalence between case-control sampling and random sam-pling is a new result. We then proceed to compute the semiparametric eﬃcient score andconstruct a semiparametric estimator for β in Section 3. The computation is carried outin a hypothetical population described in Section 2. This diﬀers from the real populationfrom which the cases and controls are drawn. Hence, the derivation has its own interestand novelty. In this section, we also prove that although the estimation of the nuisanceparameter η is bypassed in our estimator, the resulting semiparametric estimator stillachieves the optimal eﬃciency. The proof and treatment is rather non-standard. Numer-ical examples are included in Section 4 to demonstrate the performance of the proposedestimator. The performance of the method in the discrete gene model is close to that ofthe method in Chatterjee and Carroll (2005) and we pointed out the possible equivalencebetween the two methods in Section 5. Some analytical derivations and technical detailsare included in the Appendix. emiparametric eﬃciency in case-control studies

2. Case-control data versus i.i.d. data

The samples from a case control study are not random because the disease status is notrandom. In general, the design randomly samples N individuals from the case populationand N from the non-case population. However, let us consider a hypothetical populationof interest with inﬁnite population size, in which the disease to non-disease ratio is ﬁxedat π = N /N . Here, the reason for introducing the notion of hypothetical population isto be able to use the classical semiparametric theory for i.i.d. data, developed in Bickel et al. (1993). If the sample of size N = N + N from a case-control study happens tobe a random sample from the hypothetical population of interest, then we have a size- N i.i.d. random sample and the usual semiparametric analysis will apply. The asymptoticresults hold when N → ∞ and π stays ﬁxed.Of course, the problem is that a random sample of size N from the hypothetical pop-ulation of interest does not have to have exactly N controls and N cases, hence wecannot immediately equate a case-control sample and a random sample from the hypo-thetical population. In general, the number of controls/cases of a random sample fromthe hypothetical population will have a binomial distribution N rd ∼ Binomial(

N, N d /N ), d = 0 ,

1, which is very close to a normal distribution when N is large, that is,( N rd − N d ) / p N π (1 − π ) → Normal(0 ,

1) in distribution when N → ∞ . Here, the super-script r stands for ‘random.’ Furthermore, the probability of having | N rd − N d | > N / goes to zero when N → ∞ . Thus, we could think of the case-control sample as obtainedby randomly picking a size- N sample from the hypothetical population of interest, thendeleting a random o p ( N / ) cases (controls) and adding a random o p ( N / ) controls(cases). Or, alternatively, we can think of the case-control sample as a random sam-ple of size N , but with a randomly chosen o p ( N / ) data contaminated in a particularway. This “particular” contamination implies the following three properties: (i) the con-tamination happens only to o p ( N ) of the observations (in the case-control samples, thecontamination in fact only happens to o p ( N / ) observations, but, in general, o p ( N ) is al-ready suﬃcient for our further analysis); (ii) the contaminated data is still of order O(1),that is, | X ci − X i | is bounded in probability for i = 1 , . . . , N ; (iii) the zero expectationholds for the contaminated observations, that is, if an estimating equation for β of theform P Ni =1 f ( X i ; β ) = 0 satisﬁes E { f ( X i ; β ) } = 0, then E { f ( X ci ; β ) } = 0 as well. Here, X i , i = 1 , . . . , N , are i.i.d. random samples, the superscript c stands for ‘contaminated’and the subscript represents the true parameter value.When the case-control sample is viewed as a contaminated random sample from thehypothetical population of interest, the ﬁrst two “particular” properties certainly hold.For the estimator we will construct, we shall demonstrate that the third property alsoholds. Thus, if we can show that the same ﬁrst order asymptotics apply to both thei.i.d. sample of size N and its contaminated version as long as the three properties hold,then we can treat the case-control sample as an i.i.d. sample.The argument is as follows. Assume that we mistakenly treated the contaminated dataas i.i.d. and obtained an eﬃcient estimator: N X i =1 S eﬀ ( X ci ; β ) = 0 . (2.1)88 Y. Ma

Here, S eﬀ is the eﬃcient score function and its derivation is model-dependent. One obvi-ous aspect of S eﬀ worth emphasizing is that the construction of S eﬀ does not depend onthe observations. Regardless of the method of derivation, the eﬃcient score function S eﬀ has the property E { S eﬀ ( X i ; β ) } = 0. If we had the uncontaminated data, our subsequentestimator for β would have been P Ni =1 S eﬀ ( X i ; β ) = 0. Working with the contaminateddata, (2.1) is the estimating equation we really have. Suppose that ˆ β solves (2.1). Wethen have 0 = N X i =1 S eﬀ ( X ci ; ˆ β ) = N X i =1 S eﬀ ( X ci ; β ) + N X i =1 ∂S eﬀ ( X ci ; β ∗ ) ∂β T ( ˆ β − β ) , therefore, − N − ( N X i =1 ∂S eﬀ ( X ci ; β ∗ ) ∂β T ) √ N ( ˆ β − β ) = N − / N X i =1 S eﬀ ( X ci ; β ) , (2.2)where β ∗ lies on the line connecting β and ˆ β . Note that in our “particular” contaminationrequirement, only o p ( N ) terms yield a diﬀerent X i from X ci (requirement (i)) and, foreach X ci = X i , the diﬀerence is O p (1) (requirement (ii)), so we have N − ( N X i =1 ∂S eﬀ ( X ci ; β ∗ ) ∂β T ) = N − ( N X i =1 ∂S eﬀ ( X i ; β ∗ ) ∂β T ) + o p (1) (2.3)= E ( ∂S eﬀ ( X i ; β ) ∂β T ) + o p (1) . From the third “particular” property, we have E { S eﬀ ( X ci ; β ) } = 0 (we will prove thatthis property holds for the case-control data in Section 3). In conjunction with the factthat only o p ( N ) of the terms S eﬀ ( X ci ; β ) − S eﬀ ( X i ; β ) are non-zero, we can furtherobtain N − / N X i =1 S eﬀ ( X ci ; β ) = N − / N X i =1 S eﬀ ( X i ; β ) + o p (1) . (2.4)The detailed argument of (2.4) is the following. Suppose for the ﬁrst l = o p ( N ) observa-tions, X ci = X i . Then we have N − / N X i =1 S eﬀ ( X ci ; β )= N − / N X i =1 S eﬀ ( X i ; β ) + N − / l X i =1 { S eﬀ ( X ci ; β ) − S eﬀ ( X i ; β ) } emiparametric eﬃciency in case-control studies N − / N X i =1 S eﬀ ( X i ; β ) + ( N/l ) − / l − / l X i =1 { S eﬀ ( X ci ; β ) − S eﬀ ( X i ; β ) } . Note that S eﬀ ( X ci ; β ) − S eﬀ ( X i ; β ) has mean zero, hence l − / P li =1 { S eﬀ ( X ci ; β ) − S eﬀ ( X i ; β ) } = O p (1). From l = o p ( N ), we obtain the result in (2.4) immediately. Thus,plugging (2.3) and (2.4) into (2.2), we obtain − E (cid:26) ∂S eﬀ ( X i ; β ) ∂β T (cid:27) √ N ( ˆ β − β ) = N − / N X i =1 S eﬀ ( X i ; β ) + o p (1) . The above display is exactly the ﬁrst order asymptotic expansion of the estimator for β if we had performed the estimation procedure on the uncontaminated data. Thus,we have demonstrated that the estimator obtained from contaminated data performs aswell as the one obtained from uncontaminated data in terms of ﬁrst order asymptoticproperties. Note that the eﬃcient estimator can be replaced by a consistent estimator,say, a general S instead of S eﬀ , as long as E ( S | D = d ) = 0 holds for d = 0 ,

1. Thisensures that E { S ( X ci ) } = 0 as long as E { S ( X i ) } = 0 (shown in Section 3), so the abovederivation will still carry through. Hence, the asymptotic property of the estimator usingthe contaminated data is indeed the same as if we had the uncontaminated data. Thus,the case-control data can be treated as i.i.d. data and we can achieve the same eﬃciencyas when the data was indeed i.i.d. In other words, a semiparametric estimator usingcontaminated data is at least as eﬃcient as one using the uncontaminated data.One question still remains: can we do even better than in the i.i.d. data case? Infact, since case-control sampling is designed to be an eﬃcient way to collect covariateinformation, it seems to contain more information than a random sample. However, weclaim that for asymptotically linear estimators of the form √ N ( ˆ β − β ) = 1 √ N N X i =1 ψ ( X ci ; β ) + o p (1) , where E { ψ ( X ci ; β ) | d } = 0, the eﬃciency in parameter estimation cannot be further im-proved by taking into account the special sampling procedure. This is because otherwise,we could have obtained a better estimator for the i.i.d. sample as well, by replacing X ci with X i . The detailed derivation is the same as in the above paragraph, where thecondition E { ψ ( X ci ; β ) | d } = 0 implies E { ψ ( X i ; β ) | d } = 0 for case-control data, whichensures E { ψ ( X ci ; β ) } = E { ψ ( X i ; β ) } = 0. Of course, if the condition E ( ψ | d ) = 0 is notsatisﬁed, the argument does not work. However, we now show that if ψ achieves theoptimal variance for the case-control data X ci , then it has to satisfy E { ψ ( X ci ; β ) | d } = 0.First, E { ∂E ( ψ | D ) /∂β } = ∂E ( ψ ) /∂β = 0 because the probability density function(p.d.f.) of D does not contain β . If we let ˜ ψ ( X ci ) = ψ ( X ci ) − E { ψ ( X ci ) | d } , then E { ˜ ψ ( X ci ) } = 0 and E { ∂ ˜ ψ ( X ci ) /∂β } = E { ∂ψ ( X ci ) /∂β } . If E { ψ ( X ci ) | d } 6 = 0, then we canobtainvar { ψ ( X ci ) } = E [var { ψ ( X ci ) | D } ] + var[ E { ψ ( X ci ) | D } ] = var { ˜ ψ ( X ci ) } + var[ E { ψ ( X ci ) | D } ]90 Y. Ma > var { ˜ ψ ( X ci ) } , which, together with E { ∂ ˜ ψ ( X ci ) /∂β } = E { ∂ψ ( X ci ) /∂β } , contradicts the fact that ψ ( X ci )is optimal.In summary, we have shown that the case control samples can be treated as if theywere i.i.d. and all the ﬁrst order asymptotic results for i.i.d. data will be inherited forcase-control data as well. We can see that the above establishment is similar to the devel-opment in Breslow et al. (2000). However, one prominent diﬀerence is that in Breslow etal. (2000), the case-control sample is viewed as the result of a biased sampling procedurewith ﬁxed subsample size, hence they cannot use the classical semiparametric theory fori.i.d. data, but have to refer to McNeney (1998) for the theoretical properties, where thewhole semiparametric theory for ﬁxed-size subsamples is established in parallel to thei.i.d. framework. Here, through introducing the notion of hypothetical population and byanalyzing the ﬁrst order equivalence between a random sample and a sample with ﬁxed-size subsamples, we can easily contain the case-control problem in the usual i.i.d. modelframework. The derivation is much simpler and more elegant. Thus, in the remainder ofthe paper, we ignore the case-control nature of the data and proceed with our analysis bypretending the data is i.i.d. from the aforementioned hypothetical population of interest.

3. A semiparametric eﬃcient estimator

A random sample from the hypothetical population of interest has p.d.f. p ( g, e, d ; β, η ) = p D ( d ) p G,E | D ( g, e | d ) = p D ( d ) p tG,E | D ( g, e | d )= p D ( d ) p tG ( g ) p tE ( e ) p tD | G,E ( d | g, e ) /p tD ( d ) (3.1)= N d N q ( g ) η ( e ) H ( d, g, e ) p tD ( d ) . Here, the superscript t stands for the p.d.f. in the true population, whereas expressionswithout superscripts, including various p.d.f.’s and expectation E , are quantities in thehypothetical population of interest; η ( e ) = p tE ( e ) is the unknown inﬁnite-dimensionalparameter and H ( d, g, e ; β ) = exp[ d { m ( g, e ) } ] / [1 + exp { m ( g, e ) } ]= exp { d ( β c + β g + β e + β ge ) } / { β c + β g + β e + β ge ) } ; p tD ( d ; β, η ) = Z q ( g, β ) η ( e ) H ( d, g, e ; β ) d µ ( g ) d µ ( e ) . We recognize that estimating the ﬁnite-dimensional parameter β in the presence of aninﬁnite-dimensional nuisance parameter η , using an i.i.d. sample of size N = N + N from emiparametric eﬃciency in case-control studies et al. (1993). Because the general approachand related concepts have been nicely described in several recent papers including Tsiatisand Ma (2004), Allen et al. (2005), Ma et al. (2005) and Ma and Tsiatis (2006), here,we only brieﬂy outline the general approach and the deﬁnition of the relevant concepts,referring the reader to these papers for more detailed descriptions.In general semiparametric problems, one approach to construct estimators for β is toobtain some inﬂuence function φ ( X i ; β, η ) which is subsequently used to form estimatingequations for β in the form of P Ni =1 φ ( X i ; β, η ) = 0. Here, X i = ( G i , E i , D i ) , i = 1 , . . . , N ,are i.i.d. observations. The solution of the estimating equation, ˆ β , is subsequently asemiparametric estimator and its variance has been established to be equal to the varianceof φ ( X i ; β, η ). Consequently, the optimal estimator among the class of all such estimatorsis the one whose inﬂuence function has the smallest variance. This is usually referred toas the semiparametric eﬃcient estimator .The geometric approach considers the space in which all inﬂuence functions belong.Speciﬁcally, one considers a Hilbert space H which consists of all zero-mean measurablefunctions with ﬁnite variance and the same dimension as β . The inner product in H isdeﬁned as the covariance. The Hilbert space H is further decomposed into two spaces,the nuisance tangent space Λ and its orthogonal complement Λ ⊥ .To understand the nuisance tangent space Λ, consider ﬁrst the case where the nui-sance parameter, denoted γ , is ﬁnite-dimensional. Then, the nuisance score function, S γ = ∂ log p ( X i ; β, γ ) /∂γ , spans a linear space, which is denoted Λ. In the case of theinﬁnite-dimensional nuisance parameter η , the corresponding Λ is deﬁned as the meansquared closure of the span of all the nuisance score functions S γ , where p ( X i ; β, γ ) is anyparametric submodel of p ( X i ; β, η ). The orthogonal complement of Λ in H is subsequentlydeﬁned as Λ ⊥ .Any function in Λ ⊥ can be properly normalized to obtain a valid inﬂuence function.On the other hand, every inﬂuence function is a function in Λ ⊥ . Among all these func-tions, the projection of the score function S β = ∂ log p ( X i ; β, γ ) /∂β results in the eﬃcientinﬂuence function. If we denote the projection by S eﬀ , then the corresponding optimalvariance is var( S eﬀ ) − . The projection S eﬀ is usually called the eﬃcient score function .Hence, the geometric approach converts the problem of searching for eﬃcient semi-parametric estimators to the problem of calculating S eﬀ . Following the description in Section 3.1, we obtain the eﬃcient score function S eﬀ . View-ing the sample as random from the hypothetical population, the p.d.f. in (3.1) is nolonger in a simple multiplicative form, in that the nuisance parameter appears both inthe numerator and in the integral in the denominator. Since this implies that the nui-sance tangent space is not automatically orthogonal to the score functions, the related92 Y. Ma computation for the nuisance tangent space and associated objects is more involved. Inaddition, one needs to be aware that the calculation should be carried out with respect tothe hypothetical population, hence quantities such as p tG , p tE , p tD need to be treated withextra care and not confused with p G , p E , p D . The main steps of the derivation are as fol-lows. We ﬁrst calculate the score function S β by taking the derivative of log p ( g, e, d ; β, η )with respect to β . This results in S β = S − E ( S | d ), where S = (cid:26) ( m ′ β c m ′ β m ′ β m ′ β ) (cid:18) d − m (cid:19) q ′ β ( g, β ) T q ( g, β ) (cid:27) T . We then calculate the two spaces Λ , Λ ⊥ by replacing η in (3.1) with a ﬁnite-dimensionalparameter γ , taking the derivative of log p ( g, e, d ; β, γ ) with respect to γ to obtain S γ ,hypothesizing a space of all such S γ and proving that Λ is equivalent to this space. Theresults areΛ = [ h ( e ) − E { h ( e ) | d } : ∀ h ( e ) such that E t ( h ) = 0] = [ h ( e ) − E { h ( e ) | d } : ∀ h ( e )] , Λ ⊥ = [ h ( g, e, d ): E ( h | e ) = E { E ( h | d ) | e } ] . We ﬁnally project the score vector S β onto Λ ⊥ to obtain S eﬀ = S β − f ( e ) + E ( f | d ) = S − E ( S | d ) − f ( e ) + E ( f | d ), where f ( e ) − E ( f | d ) represents the projection of S β onto Λ.The details of the derivation can be found in the Appendix. Note that this form of S eﬀ implies that E { S eﬀ ( X ) | d } = 0. When X is replaced by X c , the non-random case-controlsample, we still have E { S eﬀ ( X c ) | d } = 0 because the design itself guarantees that theonly non-random component is d , which is held constant. Thus, viewing X c as a specialcontaminated version of X , we still have E { S eﬀ ( X c ) } = 0, which is required in Section 2.From the Appendix, we can further write S eﬀ = S − E ( S | e ) + ( − d { a (0) − a (1) } w ( e, − d ) , (3.2)where a (0) − a (1) = E ( f | D = 0) − E ( S | D = 0) − E ( f | D = 1) + E ( S | D = 1).In terms of the calculation of S eﬀ , note that S , E ( S | e ) and w , as given in (A.1), are allfunctions with parameters β and p tD ( d ) only. Hence, as long as we can calculate p tD ( d ),we will have the ability to evaluate S , E ( S | e ) and w . The computation of a (0) − a (1)requires further arguments.In the following, we ﬁrst obtain an approximation of p tD ( d ), then pursue the estimationof a (0) − a (1). To estimate p tD ( d ), using p E ( e ) to denote the probability density functionof e in the hypothetical population, we observe that N d = N p D ( d ) = Z N p

D,E ( d, e ) d µ ( e ) = Z N p E ( e ) p D,G | E ( d, g | e ) d µ ( g ) d µ ( e )= Z N p E ( e ) R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e )= E e (cid:26) N R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) (cid:27) . emiparametric eﬃciency in case-control studies E e with its sample moment through averaging across diﬀerentobserved e i ’s, we obtain N d ≈ N X i =1 R N d q ( g, β ) H ( d, g, e i ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e i ) d µ ( g ) /p tD ( d ) for d = 0 , . (3.3)Note that the above two equations are not independent – one determines the other. But,in combination with p tD (0) + p tD (1) = 1, we can estimate p tD ( d ) completely. Because theonly approximation involved in estimating p tD ( d ) is replacing the mean with a samplemean, the calculation will produce a root- N -consistent estimator for p tD (0) and p tD (1).We denote the estimators by ˆ p tD (0) and ˆ p tD (1). In calculating N d , we write p ( g, e, d ) as p E ( e ) p D,G | E ( d, g | e ), instead of directly using the form in (3.1). Since p E ( e ) is the p.d.f. ofthe environment variable in the hypothetical population, this enables us to replace theexpectation E e with the average of the samples.The estimation of a (0) − a (1) is much more tedious, and involves an almost bruteforce calculation of E ( S | d ) and E ( f | d ). If we let b = E ( S | D = 0) , b = E ( S | D = 1) , c = E ( f | D = 0) and c = E ( f | D = 1), then a (0) − a (1) = b − b + c − c . The calculation of b and b follows from b d = R Sp D,G,E ( d, g, e ) d µ ( g ) d µ ( e ) R p D,G,E ( d, g, e ) d µ ( g ) d µ ( e ) = R Sp E ( e ) p D,G | E ( d, g | e ) d µ ( g ) d µ ( e ) R p E ( e ) p D,G | E ( d, g | e ) d µ ( g ) d µ ( e )= Z p E ( e ) R SN d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e ) . Z p E ( e ) R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e ) . Since S can be calculated directly, we simply obtain the approximation of b d , d = 0 ,

1, byreplacing the mean with sample mean and plugging in the estimated p tD ( d ):ˆ b = N X i =1 R S (0 , g, e i ) q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) (3.4) . N X i =1 R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) , ˆ b = N X i =1 R S (1 , g, e i ) q ( g ) H (1 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) / ˆ p tD ( d ) (3.5) . N X i =1 R q ( g ) H (1 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) / ˆ p tD ( d ) . Y. Ma

The calculations of c and c are a bit more tricky. Since f = E ( S | e ) + ( c − b ) w ( e,

0) + ( c − b ) { − w ( e, } , taking expectation conditional on, say D = 0, we have c = E { E ( S | e ) | D = 0 } + ( c − b ) E { w ( e, | D = 0 } + ( c − b )[1 − E { w ( e, | D = 0 } ]or, equivalently, we obtain c − c = E { E ( S | e ) | D = 0 } − b E { w ( e, | D = 0 } − b [1 − E { w ( e, | D = 0 } ]1 − E { w ( e, | D = 0 } . Hence, replacing mean by sample mean and using ˆ p tD ( d ), c − c is estimated byˆ c − ˆ c = ˆ E { E ( S | e ) | D = 0 } − ˆ b ˆ E { w ( e, | D = 0 } − ˆ b [1 − ˆ E { w ( e, | D = 0 } ]1 − ˆ E { w ( e, | D = 0 } , (3.6)where ˆ E { w ( e, | D = 0 } = N X i =1 w ( e i , R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) (3.7) . N X i =1 R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d )and ˆ E { E ( S | e ) | D = 0 } = N X i =1 E ( S | e i ) R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) (3.8) . N X i =1 R q ( g ) H (0 , g, e i ) d µ ( g ) P d R N d q ( g ) H ( d, g, e i ) d µ ( g ) / ˆ p tD ( d ) . Similarly to the estimation of p tD ( d ), the only approximation involved in obtaining b (0) , b (1) and c (0) − c (1) is replacing mean by sample mean, so a (0) − a (1) is estimatedusing ˆ a (0) − ˆ a (1) = ˆ b − ˆ b + ˆ c − ˆ c at the root- N rate.We would like to emphasize that in all of the above calculations, when we replace theexpectation with the sample average, we use the result that the case-control sample canbe treated as a random sample from the hypothetical population. Hence, for any function u ( e ), the approximation N − P Ni =1 u ( e i ) can only be used to replace R u ( e ) p E ( e ) d µ ( e ),not R u ( e ) η ( e ) d µ ( e ). emiparametric eﬃciency in case-control studies β in all of the above expressions, in fact, p tD (0) , p tD (1) , a (0) − a (1) are all functions of β . However, if we replace β with ˜ β , an initial estimator of β , we will still obtain ˆ p tD ( d ; ˜ β ) , ˆ a (0; ˜ β ) − ˆ a (1; ˜ β ) that are root- N -consistent, as long as˜ β − β = O p ( N − / ). The ﬁnal estimating equation of β has the form N X i =1 ˆ S eﬀ ( x i ; β ) = N X i =1 S eﬀ { x i ; β, ˆ p tD ( d ; ˜ β ) , ˆ a (0; ˆ p tD , ˜ β ) − ˆ a (1; ˆ p tD , ˜ β ) } = 0 , (3.9)where x i denotes the i th observation ( d i , g i , e i ).To summarize the description of the estimator, we outline the algorithm here:Step 1. Pick a starting value ˜ β that is root- N consistent.Step 2. Solve for ˆ p tD (0) and ˆ p tD (1) = 1 − ˆ p tD (0) from (3.3).Step 3. Obtain ˆ b and ˆ b from (3.4) and (3.5).Step 4. Obtain ˆ c − ˆ c from (3.6) and (3.7), (3.8).Step 5. Calculate S eﬀ using (3.2) and obtain ˆ β from solving (3.9).It is worth pointing out that in order to carry out Step 1, we have used a vital assump-tion that a root- N starting value ˜ β exists. Fortunately, the existence of ˜ β is equivalent tothe identiﬁability of β and is already well established in Chatterjee and Carroll (2005).The starting value used there, or in Spinka et al. (2005), can be used to obtain the initialestimator ˜ β . Our algorithm here does not require an iteration of Steps 2–5 upon eachupdate of β . However, in practice, a more accurate ˜ β can improve the ﬁnal estimation ˆ β signiﬁcantly, hence iterations are almost always implemented. If we could use the exact p tD ( d ; β ) and a (0; β ) − a (1; β ) in (3.9), then, according toSection 3.1, the resulting estimator for β would be an eﬃcient estimator, with es-timation variance V = E ( S eﬀ S Teﬀ ) − . To ﬁrst order, V can be approximated using N { P Ni =1 ˆ S eﬀ ( x i ; ˆ β ) ˆ S Teﬀ ( x i ; ˆ β ) } − , where ˆ β solves (3.9).We claim that using the estimated ˆ S eﬀ as in (3.9), we obtain an estimating equationthat yields the same estimator for β as using S eﬀ , in terms of its ﬁrst order asymptoticproperties. Theorem 1.

The algorithm in Section 3.2 yields a semiparametric eﬃcient estimatorfor β . That is, √ N ( ˆ β − β ) → Normal { , var( S eﬀ ) − } in distribution when N → ∞ and N /N is ﬁxed. The proof of the theorem contains two main steps. In the ﬁrst step, we show thesemiparametric eﬃciency of the estimator if the observations had been i.i.d. In the second96

Y. Ma

Table 1.

Simulation results for the two experiments, each with two diﬀerent sets of parame-ter values, representing uncommon (upper-left) and common (upper-right) gene mutation, andhomogeneous (lower-left) and diversiﬁed (lower-right) gene expression levels. ‘true’ is the truevalue of β , ‘est’ is the average of the estimated β , ‘sd’ is the sample standard deviation and ‘ b sd’is the average of the estimated standard deviation β c β β β β β c β β β β Experiment 1true − . . . . . − . . . . . − . . . . . − . . . . . . . . . . . . . . . b sd 1 . . . . . . . . . . − . . . . . − . . . . . − . . . . . − . . . . . . . . . . . . . . . b sd 0 . . . . . . . . . . step, we proceed to show the eﬃciency in the case-control study using results in Section 2.Rather complex algebra needs to be employed in the ﬁrst step. The proof also involvesa split of the data in the ﬁnal estimation of β , and in estimating p tD ( d ) and a (0) − a (1),mainly for technical convenience. The details of the proof appear in the Appendix.

4. Numerical examples

We conducted a small simulation study to demonstrate the performance of the estima-tor. In the ﬁrst experiment, we generated 500 cases and 500 controls, where the trueenvironment element E is min(10 , X ) and X is generated from a log-normal distributionwith mean 0 and variance 1. A dichotomous model of the gene is used, where G = 1 withprobability β and G = 0 with probability 1 − β . This kind of model for q ( g, β ) canrepresent the presence/absence of a certain gene mutation. We used two diﬀerent sets ofvalues for β : the ﬁrst set is β = ( − . , . , . , . , . T , where β = 0 .

26 representsa relatively common mutation; the second set is β = ( − . , . , . , . , . T , where β = 0 .

065 represents a very rare mutation. In both sets, the true parameters are chosenso that the model results in a population disease rate p tD (1) ≈ q ( g, β ). Here, wemodel q ( g, β ) with a Laplace distribution with variance β . This kind of model is typi-cally used to model the gene expression level. To maintain an approximate 5% disease ratein the population, we used β = ( − . , . , . , . , . T and β = ( − . , . , . , . , T as the true parameter values. Again, in the ﬁrst set, β = 0 . emiparametric eﬃciency in case-control studies β = 1 represents a largervariation, so the population is more diversiﬁed. The simulation results are presented inthe lower half of Table 1. In both experiments, 1000 simulations are implemented.From Table 1, it is clear that the estimator for β is consistent in all four situationsand the estimated standard deviation approximates the true one rather well. It is worthnoting that the ﬁrst experiment is a repetition of the same setting as in Chatterjee andCarroll (2005) and we observe very similar results. Speciﬁcally, for β , β , β , β in theupper-left table, their results for “sd” are 0.322, 0.037, 0.128, 0.0122, respectively, andthose in the upper-right table are 0.198, 0.043, 0.075 and 0.0273, respectively. Althoughsome numerical improvement can be observed in certain parameters (for example β ),it can be a result of ﬁnite-sample performance and numerical issues. We conjecture thatthe estimator in Chatterjee and Carroll (2005) is equivalent to the method proposedhere, hence is also eﬃcient, although a rigorous proof is beyond the scope of this paper.It is also worth noting that the estimation of β c is more diﬃcult than the remainingcomponents of β , in that the estimation has large variability. This is especially prominentin the discrete model setting for q ( g ). Indeed, the estimation result for β c has not beenreported elsewhere and, without the gene-environment independence, β c is known to beunidentiﬁable (Prentice and Pyke (1979)). This provides an intuitive explanation for theperformance of ˆ β c we observe. The set of estimating equations is solved using a standardNewton–Raphson algorithm.

5. Conclusion

Semiparametric modeling and estimation to study the occurrence of a disease in rela-tion to gene and environment has attracted much interest recently. However, despitethe various estimators proposed in the literature, very little is understood in terms ofthe eﬃciency of the estimators. This is partly due to the fact that most estimators areconstructed in rather ingenious ways, instead of following the standard lines of semi-parametric theory. The other reason is that most such problems are set in a case-controldesign, which violates the i.i.d. assumption for standard semiparametric theory.Instead of rederiving the whole semiparametric theory under non-i.i.d. samples, weargue that case-control data can be treated as if they were i.i.d. data and the stan-dard semiparametric eﬃciency theory will still apply. The equivalence of the ﬁrst orderasymptotic theory shown in this article is a new contribution. The argument is based onrather elementary statistics without involving advanced knowledge or highly specializedtechniques.The establishment of the equivalence of the semiparametric eﬃciency betweeni.i.d. data and case-control data allows us to carry out the estimation using standard,well-established semiparametric theory. However, these standard analyses are performedunder a hypothetical population of interest, hence the detailed derivation often requiresspecial treatment, something which has not previously appeared in the literature. Underthe gene-environment independence assumption, we are able to explicitly construct a98

Y. Ma novel semiparametric estimator and show its eﬃciency. A special feature of this estima-tor is that we never attempted to estimate the inﬁnite-dimensional nuisance parameter η itself, neither did we posit a model, true or false, for it. Rather, we avoided its estimationand instead approximated quantities that rely on it. On the one hand, this enables usto carry out the estimation rather easily; on the other hand, some asymptotic propertieshave to be rederived because any result that relies on the convergence properties of thenuisance parameter itself can no longer be used.Finally, our simulation results support the theory we developed, in both discrete andcontinuous gene distribution cases. Our simulation results in the discrete gene model arevery similar to those of Chatterjee and Carroll (2005), which leads us to believe thattheir estimator is also eﬃcient. A demonstration of this aspect would be an interestingdirection for future work. The programming of the method in Chatterjee and Carrollmay seem easier. However, if the two methods are indeed equivalent, then the projectionstep in the current method should be equivalent to the proﬁling step in Chatterjee andCarroll, hence the computational eﬀort and intensity should be equivalent. Although wedid not further expand our estimator to stratiﬁed case-control data, the method is clearlyapplicable there as well. Appendix

The derivation of S eﬀ We will use S eﬀ to construct our estimating equation. We calculate S eﬀ by projectingthe score functions with respect to the parameters of interest β c , β , β , β , β onto theorthogonal complement of the nuisance tangent space. We ﬁrst derive the score functions S β ≡ ∂ log p ( g, e, d ; β, η ) /∂β . Straightforward calculation shows that the score function S β = ( S T1 , S T2 ) T , where S T1 = ( m ′ β c m ′ β m ′ β m ′ β ) (cid:18) d − m (cid:19) − E (cid:26) ( m ′ β c m ′ β m ′ β m ′ β ) (cid:18) d − m (cid:19)(cid:12)(cid:12)(cid:12) d (cid:27) ,S = q ′ β ( g, β ) q ( g, β ) − E (cid:26) q ′ β ( g, β ) q ( g, β ) (cid:12)(cid:12)(cid:12) d (cid:27) . Here, m ′∗ and q ′∗ represent partial derivatives with respect to ∗ . Note that, in general, S β can be written as S β = S − E ( S | d ).We next derive the nuisance tangent space Λ and its orthogonal complement Λ ⊥ .Inserting the form of p tD ( d ; β, η ) into (3.1), replacing η ( e ) by an arbitrary submodel p tE ( e ; γ ) and taking the derivative of log p ( g, e, d ; β, γ ) with respect to γ , we ob-tain ∂ log p ( g, e, d ; β, γ ) /∂γ = ∂ log p tE ( e ; γ ) /∂γ − E { ∂ log p tE ( e ; γ ) /∂γ | d } . Now, recogniz-ing that ∂ log p tE ( e ; γ ) /∂γ for an arbitrary submodel can yield an arbitrary function of e with mean zero calculated under the true η ( e ), we obtain the nuisance tangent space:Λ = [ h ( e ) − E { h ( e ) | d } : ∀ h ( e ) such that E t ( h ) = 0] = [ h ( e ) − E { h ( e ) | d } : ∀ h ( e )] , Λ ⊥ = [ h ( g, e, d ): E ( h | e ) = E { E ( h | d ) | e } ] . emiparametric eﬃciency in case-control studies E t stands for an expectation calculated with respect to the true population distri-bution. The second expression for Λ is more convenient because it allows h ( e ) to be anarbitrary function of e , hence this is the form of Λ that we will use.Having obtained S β and the spaces Λ and Λ ⊥ , we can proceed to derive the eﬃcientscore function S eﬀ ≡ Π( S β | Λ ⊥ ). If we let Π( S β | Λ) = f ( e ) − E ( f | d ), then S eﬀ = S β − f ( e ) + E ( f | d ) = S − E ( S | d ) − f ( e ) + E ( f | d ).We now modify the expression of S eﬀ to facilitate its actual computation. Letting a ( d ) = E ( f | d ) − E ( S | d ), we can thus write S eﬀ = S − f + a ( d ). Note that S does not de-pend on η and a ( d ) is either a (1) or a (0). In addition, we have E ( S eﬀ | e ) = E { E ( S eﬀ | d ) | e } .This is equivalent to E ( S β | e ) − f ( e ) + E { E ( f | d ) | e } = E [ E { S − E ( S | d ) | d } − E { f − E ( f | d ) | d }| e ] = 0 , which, in turn, is equivalent to E ( S | e ) = f + E { E ( S | d ) | e } − E { E ( f | d ) | e } = f − E { a ( d ) | e } = f − P d R a ( d ) N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) . Let v ( e, d ) = N d Z q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) = p E,D ( e, d ) N η − ( e )and w ( e, d ) = v ( e, d ) / { v ( e,

0) + v ( e, } . (A.1)We have E ( S | e ) = f − a (0) v ( e, / { v ( e,

0) + v ( e, } − a (1) v ( e, / { v ( e,

0) + v ( e, } = f − a (0) w ( e, − a (1) w ( e, f = E ( S | e ) + a (0) w ( e,

0) + a (1) w ( e, S eﬀ = S − E ( S | e ) − a (0) w ( e, − a (1) w ( e,

1) + a ( d )= S − E ( S | e ) + ( − d { a (0) − a (1) } w ( e, − d ) . Proof of Theorem 1

To simplify notation, we denote α = p tD (0) /p tD (1), ˆ α = ˆ p tD (0) / ˆ p tD (1), δ ( α ) = a { p tD ( d ) }− a { p tD ( d ) } , δ (ˆ α ) = a {

0; ˆ p tD ( d ) } − a {

1; ˆ p tD ( d ) } and ˆ δ (ˆ α ) = ˆ a {

0; ˆ p tD ( d ) } − ˆ a {

1; ˆ p tD ( d ) } .00 Y. Ma

Suppose we randomly partition the data into two groups: group 1 has m observationsand group 2 has n observations. Here, m = N . , n = N − m . We use the ﬁrst group toobtain ˆ α , and ˆ δ (ˆ α ), then use the second group to form the following estimating equationto estimate β : n X i =1 S eﬀ { x i ; ˆ β, ˆ α, ˆ δ (ˆ α ) } = 0 . We will ﬁrst show that the resulting estimator satisﬁes n / ( ˆ β − β ) → N (0 , V ) in distri-bution when N → ∞ .The proof splits into several steps: First, obviously, ˆ α − α = O p ( m − / ) and ˆ δ (ˆ α ) − δ (ˆ α ) = O p ( m − / ), as long as a root- N -consistent ˜ β is inserted in the calculation of thesequantities. A standard expansion yields0 = n X i =1 S eﬀ { x i ; ˆ β, ˆ α, ˆ δ (ˆ α ) } = n X i =1 S eﬀ { x i ; β , ˆ α, ˆ δ (ˆ α ) } + n X i =1 ∂∂β T S eﬀ { x i ; β ∗ , ˆ α, ˆ δ (ˆ α ) } ( ˆ β − β )= n X i =1 S eﬀ { x i ; β , ˆ α, ˆ δ (ˆ α ) } + n (cid:26) E (cid:18) ∂S eﬀ ∂β T (cid:19) + o p (1) (cid:27) ( ˆ β − β ) , which can be rewritten as (cid:26) E (cid:18) ∂S eﬀ ∂β T (cid:19) + o p (1) (cid:27) n / ( ˆ β − β )= − n − / n X i =1 S eﬀ { x i ; β , ˆ α, ˆ δ (ˆ α ) } = − n − / n X i =1 [ S eﬀ { x i ; β , ˆ α, δ (ˆ α ) } + ( − d i { ˆ δ (ˆ α ) − δ (ˆ α ) } w ( e i , − d i , ˆ α )] . The last equality uses the form of S eﬀ in (3.2) and the fact that S , E ( S | e ) and w do notdepend on α . Because ˆ δ (ˆ α ) − δ (ˆ α ) = O p ( m − / ) = o p (1) and E { ( − d i w ( e i , − d i , ˆ α ) } = Z X d =0 , ( − d p E,D ( e, − d ; ˆ α ) η − ( e ) v ( e,

0; ˆ α ) + v ( e,

1; ˆ α ) p E,D ( e, d ; ˆ α ) d µ ( e ) = 0 , we actually have (cid:26) E (cid:18) ∂S eﬀ ∂β T (cid:19) + o p (1) (cid:27) n / ( ˆ β − β ) emiparametric eﬃciency in case-control studies − n − / n X i =1 S eﬀ { x i ; β , ˆ α, δ (ˆ α ) } + o p (1)= − n − / n X i =1 (cid:26) S eﬀ ( x i ) + ∂S eﬀ ( x i ; β , α ) ∂α (ˆ α − α ) + ∂ S eﬀ ( x i ; β , α ∗ ) ∂α (ˆ α − α ) (cid:27) + o p (1) . In addition, (ˆ α − α ) = O p ( m − ) = o p ( n − / ), so (cid:26) E (cid:18) ∂S eﬀ ∂β T (cid:19) + o p (1) (cid:27) n / ( ˆ β − β ) = − n − / n X i =1 (cid:26) S eﬀ ( x i ) + ∂S eﬀ ( x i ) ∂α (ˆ α − α ) (cid:27) + o p (1) . We now proceed to examine ∂S eff ( x i ) ∂α by examining each term in (3.2). S is free of α .As a function of α , we already have b ( e ; α ) ≡ E ( S | e ; α ) = P d R SN d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g, β ) H ( d, g, e ) d µ ( g ) /p tD ( d )= R SN qH d µ ( g ) + α R SN qH d µ ( g ) R N qH d µ ( g ) + α R N qH d µ ( g ) = u ( e,

0) + αu ( e, u ( e,

0) + αu ( e, , where we deﬁne u ( e, d ) = R N d q ( g, β ) H ( d, g, e ) d µ ( g ) and u ( e, d ) = R SN d q ( g, β ) H ( d,g, e ) d µ ( g ). Using this notation, ∂b ∂α = u ( e, u ( e, − u ( e, u ( e, { u ( e,

0) + αu ( e, } ,w ( e,

0) = u ( e, u ( e,

0) + αu ( e, ,w ( e,

1) = αu ( e, u ( e,

0) + αu ( e, . Similarly to the calculation of b , b , we also have that for any function u , E ( u | d ; α ) = Z p E ( e ) R uN d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e ) . Z p E ( e ) R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d R N d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) d µ ( e )= Z p E ( e ) R uN d q ( g ) H ( d, g, e ) d µ ( g ) /p tD ( d ) P d u ( e, d ) /p tD ( d ) d µ ( e ) . Z p E ( e ) u ( e, d ) /p tD ( d ) P d u ( e, d ) /p tD ( d ) d µ ( e ) , Y. Ma thus E ( u | α ) = Z p E ( e ) R uN q ( g ) H (0 , g, e ) d µ ( g ) u ( e,