Estimating linear functionals in nonlinear regression with responses missing at random
aa r X i v : . [ m a t h . S T ] A ug The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2009
ESTIMATING LINEAR FUNCTIONALS IN NONLINEARREGRESSION WITH RESPONSES MISSING AT RANDOM
By Ursula U. M¨uller
Texas A&M University
We consider regression models with parametric (linear or nonlin-ear) regression function and allow responses to be “missing at ran-dom.” We assume that the errors have mean zero and are indepen-dent of the covariates. In order to estimate expectations of functionsof covariate and response we use a fully imputed estimator, namely anempirical estimator based on estimators of conditional expectationsgiven the covariate. We exploit the independence of covariates anderrors by writing the conditional expectations as unconditional expec-tations, which can now be estimated by empirical plug-in estimators.The mean zero constraint on the error distribution is exploited byadding suitable residual-based weights. We prove that the estimatoris efficient (in the sense of H´ajek and Le Cam) if an efficient esti-mator of the parameter is used. Our results give rise to new efficientestimators of smooth transformations of expectations. Estimation ofthe mean response is discussed as a special (degenerate) case.
1. Introduction.
Consider a regression model Y = r ϑ ( X ) + ε with linearor nonlinear regression function r ϑ depending on a finite-dimensional param-eter ϑ in some open set. Assume that the covariate vector X and the errorvariable ε are independent and that Eε = 0. Note that we do not make anyfurther model assumptions on the distributions of the variables. We are in-terested in the situation where the response Y is missing at random , in otherwords, we always observe X but only observe Y in those cases where someindicator Z equals one, and the indicator Z is conditionally independent of Y given X .We want to estimate the expectation Eh ( X, Y ) of some known square-integrable function h from a sample ( X i , Z i Y i , Z i ), i = 1 , . . . , n , for example,the mean response, higher moments of Y or X or mixed moments. If all Received December 2007; revised July 2008.
AMS 2000 subject classifications.
Primary 62J02; secondary 62N01, 62F12, 62G20.
Key words and phrases.
Semiparametric regression, weighted empirical estimator, em-pirical likelihood, influence function, gradient, confidence interval.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2009, Vol. 37, No. 5A, 2245–2277. This reprint differs from the original inpagination and typographic detail. 1
U. U. M ¨ULLER indicators Z i were 1, a simple consistent estimator would be the empiri-cal estimator n − P ni =1 h ( X i , Y i ). A related estimator for the missing datasituation considered here would be1 n n X i =1 Z i ˆ π ( X i ) h ( X i , Y i )with ˆ π ( X ) denoting an estimator of the conditional probability π ( X ) = P ( Z = 1 | X ) = E ( Z | X ). Another estimator is the partially imputed estimator1 n n X i =1 { Z i h ( X i , Y i ) + (1 − Z i ) ˆ χ ( X i ) } , where ˆ χ ( X ) is a (semiparametric) estimator of the conditional expectation χ ( X ) = E { h ( X, Y ) | X } . An alternative to this estimator is the fully imputed estimator n − P ni =1 ˆ χ ( X i ).If a nonparametric estimator ˆ χ is used, we expect all three estimators tobe asymptotically equivalent. For h ( X, Y ) = Y and the last two estimators,this is sketched in Cheng (1994). Here we assume a specific form of the con-ditional distribution of Y given X , and we can construct better estimatorsthan the nonparametric ones. We then expect the fully imputed estima-tor n − P ni =1 ˆ χ ( X i ) to be better than the partially imputed one, which inturn should be better than the first estimator. For parametric models thisis shown for h ( X, Y ) = Y by Tamhane (1978) and Matloff (1981). M¨uller,Schick and Wefelmeyer (2006) show for several regression models (not in-cluding the present one) and arbitrary h that the fully imputed estimator isusually better than the partially imputed estimator. That the same holds forthe nonlinear regression model considered here is intuitively clear: our model E ( Y | X ) = r ϑ ( X ) constitutes a structural constraint. The fully imputed es-timator, based on estimators ˆ χ ( X ) that use the structure, will therefore bebetter than the partially imputed estimator, which uses this informationonly at data points where responses are missing.In this article we study the fully imputed estimator based on suitableestimators for χ ( X ) and show that it is efficient. The construction is as fol-lows: in a first step we exploit the independence of covariates and errors andthe structure of the regression model and write the conditional expectation χ ( x ) = χ ( x, ϑ ) as an unconditional expectation of the error distribution, χ ( x, ϑ ) = E { h ( X, Y ) | X = x } = Eh { x, r ϑ ( x ) + ε } = Eh { x, r ϑ ( x ) + Y − r ϑ ( X ) } . This representation suggests an empirical plug-in estimator based on theobserved data, namelyˆ χ ( x, ˆ ϑ ) = n X j =1 Z j h { x, r ˆ ϑ ( x ) + Y j − r ˆ ϑ ( X j ) } . n X j =1 Z j , EGRESSION WITH MISSING RESPONSES where ˆ ϑ is an estimator of ϑ . The corresponding fully imputed estimator is1 n n X i =1 ˆ χ ( X i , ˆ ϑ ) = 1 n n X i =1 P nj =1 Z j h { X i , r ˆ ϑ ( X i ) + Y j − r ˆ ϑ ( X j ) } P nj =1 Z j . (1.1)It is straightforward to check that ˆ χ ( x, ϑ ) is consistent for Eh { x, r ϑ ( x ) + ε } [which yields consistency of n − P ni =1 ˆ χ ( X i , ˆ ϑ ), with ˆ ϑ consistent]; notethat ˆ χ ( x, ϑ ) tends in probability to E [ Zh { x, r ϑ ( x ) + ε } ] /EZ with EZ = E { E ( Z | X ) } = Eπ ( X ). Now use the missing at random assumption and theindependence of X and ε to rewrite the numerator, E ( E [ Zh { x, r ϑ ( x ) + ε }| X ]) = E ( E ( Z | X ) E [ h { x, r ϑ ( x ) + ε }| X ])= E [ π ( X ) Eh { x, r ϑ ( x ) + ε } ]= Eπ ( X ) Eh { x, r ϑ ( x ) + ε } . The limit of ˆ χ ( x, ϑ ) is therefore χ ( x, ϑ ) = Eh { x, r ϑ ( x ) + ε } .The estimator (1.1) is well thought out and consistent. However, it is notyet efficient, even if an efficient estimator for ϑ is used (which is relativelyelaborate in the model considered here; see Section 5): we focus on thecommon situation where the errors have mean zero; this information mustalso be incorporated in order to obtain efficiency.Motivated by Owen’s empirical likelihood approach, we improve the aboveestimator by introducing weights which use the mean zero constraint on theerror distribution. However, and in contrast to the original approach, we can-not observe the errors and must use residuals. This clearly complicates thesituation: since we have missing responses the residuals are partially incom-plete and, moreover, they involve parameter estimates ˆ ϑ . Formally, we chooseweights ˆ w j based on residuals ˆ ε j = Y j − r ˆ ϑ ( X j ) such that P nj =1 ˆ w j Z j ˆ ε j = 0.(See Section 3 for more details.)Our final estimator now is a weighted version of the above fully imputedestimator, namely1 n n X i =1 ˆ χ w ( X i , ˆ ϑ ) = 1 n n X i =1 P nj =1 ˆ w j Z j h { X i , r ˆ ϑ ( X i ) + Y j − r ˆ ϑ ( X j ) } P nj =1 Z j . (1.2)The combination of full imputation methods (involving estimators of un-conditional expectations of the error distribution) with empirical likelihoodideas provides a new methodology which has not appeared in the literaturebefore. We show in this article that n − P ni =1 ˆ χ w ( X i , ˆ ϑ ) is efficient if an ef-ficient estimator ˆ ϑ for ϑ is used. The partially imputed estimator will ingeneral not be efficient, even if ˆ ϑ is efficient for ϑ .For estimation of the mean response, that is, if h ( X, Y ) = Y , which isof particular interest and typically considered in the literature, the esti-mator simplifies to the straightforward estimator n − P ni =1 r ˆ ϑ ( X i ). That the U. U. M ¨ULLER unweighted estimator (1.1) for EY cannot be efficient is immediately appar-ent: consider the case where all responses are observed. Here (1.1) reducesto the empirical estimator n − P ni =1 Y i which does not use the regressionstructure at all. It will be seen that its influence function is not the efficientone. (See Section 6 for details.)Our efficiency results are based on the H´ajek–Le Cam theory for locallyasymptotically normal families. As a consequence, our proposed estimatorshave a limiting normal distribution with the asymptotic variance determinedby the influence function. It is therefore straightforward to construct asymp-totic confidence interval for Eh ( X, Y ) (see Section 6.3).In addition, estimators for smooth (continuously differentiable) transfor-mations of expectations Eh ( X, Y ) are also now available, with the varianceof the response, Var Y = EY − E Y , as an important example. Since effi-ciency is preserved by smooth transformations, plugging in efficient estima-tors yields an efficient estimator of the transformation. The transformationfor Var Y in terms of the first two moments is ( EY, EY ) EY − ( EY ) .Plugging in n − P ni =1 r ˆ ϑ ( X i ) for EY and the weighted fully imputed esti-mator for EY (which is straightforward to compute and is also given inSection 6) gives an efficient estimator of the variance.To our knowledge, our estimator (1.2) is the first efficient estimator forarbitrary linear functionals Eh ( X, Y ) (including the mean functional EY )in the nonlinear regression model (including the linear regression model Y = ϑ ⊤ X + ε ) with independent centered errors when responses are missingat random. Matloff (1981) considers estimation of the mean EY in a modelrelated to ours, the (parametric) conditional mean model , E ( Y | X ) = r ϑ ( X ),which can (but need not) also be written in the form Y = r ϑ ( X ) + ε withconditionally centered errors, E ( ε | X ) = 0. He shows that the average of theestimated regression function values (with his estimator ˆ ϑ of ϑ ) improvesupon the partially imputed estimator. Wang and Rao (2001) consider lin-early constrained covariates and develop an empirical likelihood approachfor inference about the mean in linear regression (with independent errors)based on partial linear regression imputation. In Wang and Rao (2002) theypresent an empirical likelihood approach for inference about the mean re-sponse in nonparametric regression, based on partial kernel regression impu-tation as suggested by Cheng (1994). A different empirical likelihood methodfor this setting is proposed by Qin and Zhang (2007). Wang (2004) assumes aparametric model for the conditional density of Y given X , with constraintson the covariate distribution, and introduces a weighted partial imputationestimator for the mean, utilizing empirical likelihood techniques. Wang, Lin-ton and H¨ardle (2004) consider a partially linear regression model for theconditional mean function and derive inference tools for the mean responsebased on a class of asymptotically equivalent (partially and fully imputed) EGRESSION WITH MISSING RESPONSES estimators. A related article is Liang, Wang and Carroll (2007) who addi-tionally assume that covariates are measured with error. Chen, Fan, Li andZhou (2006) consider partially imputed estimators for the mean response in aquasi-likelihood setting. Maity, Ma and Carroll (2007) estimate expectationsin semi-parametric regression models, with and without missing responses.They consider a general regression function involving a parametric and anonparametric part, thus covering the partly linear model, and assume thatthe likelihood function given the covariates is known.For estimating expectations, little attention has been given to the fullyimputed estimator. We anticipate that in many situations, in particular inmodels with structural assumptions, improved estimators can be obtainedby using appropriate full imputation instead of partial imputation estimates.Inference for missing data has been studied by many authors, also recently.Chen and Wang (2009) study estimation of parameters which are definedby model constraints. They introduce an empirical likelihood approach in-volving estimating equations, where missing variables are replaced using anonparametric imputation approach. Chen, Hong and Tarozzi (2008) con-sider parameter estimation as well. They introduce efficient estimators forparameters in GMM models with missing data, and assume that the miss-ingness can be explained by auxiliary variables. More references to recentliterature can be found, for example, in Wang, Linton and H¨ardle (2004)and in the monograph by Tsiatis (2006). For an introduction, see Tsiatis(2006) and the books by Little and Rubin (2002) and Gelman et al. (1995).This paper is organized as follows. In Section 2 we derive a stochasticexpansion of the unweighted estimator. The expansion of the weighted es-timator is given in Section 3, utilizing the results of Section 2. Section 4characterizes efficient estimators of arbitrary functionals of the joint distri-bution and gives the efficient influence function of the functional Eh ( X, Y )in the nonlinear regression model. In Section 5 we characterize efficient es-timators for the parameter vector ϑ and briefly sketch the construction ofsuch an estimator. In this section we also show our main result, that theweighted estimator with an efficient estimator ˆ ϑ for ϑ plugged in is effi-cient for Eh ( X, Y ). Section 6 contains a short discussion of special casessuch as estimation of the mean response. We also compare, using computersimulations, the efficient (weighted fully imputed) estimator with the otherapproaches, with convincing results. For these studies we considered a linearand a nonlinear regression function and estimation of two simple function-als, namely of the response mean and second moment, for which the efficient(weighted fully imputed) estimator simplifies, and estimation of a more com-plicated expectation. We also briefly sketch the construction of confidenceintervals.
U. U. M ¨ULLER
2. Expansion of the unweighted estimator.
In this section we derive anexpansion of the unweighted estimator n − P ni =1 ˆ χ ( X i , ˆ ϑ ), which is a specialcase of the weighted estimator n − P ni =1 ˆ χ w ( X i , ˆ ϑ ) with all weights beingequal to one, w j = 1. This can be regarded as a result of independent interestsince the estimator (with an appropriate estimator ˆ ϑ ) would be relevant forregression models where the errors cannot be assumed to have mean zero.Also, we will see in the next section that the weighted estimator can bewritten as the sum of the unweighted estimator and an additional correctionterm. Hence we can utilize the results later when we derive an expansion ofthe weighted estimator.Throughout this paper we will assume that Y is square integrable andthat the error variance Eε = σ is nonzero and finite. We also supposethat the error distribution has a Lebesgue density f and finite Fisher in-formation, Eℓ ( ε ) < ∞ , where ℓ denotes the score function for location, ℓ ( ε ) = − f ′ ( ε ) /f ( ε ). The degenerate case that we (almost surely) never ob-serve a response Y will be excluded by assuming P ( Z = 1) = EZ >
0. Thefollowing assumptions will also be required.
Assumption 1.
The regression function τ r τ ( x ) is differentiable at τ = ϑ with a p -dimensional square integrable gradient ˙ r ϑ ( x ) which satisfiesthe Lipschitz condition | ˙ r τ ( x ) − ˙ r ϑ ( x ) | ≤ | τ − ϑ | a ( x ) , a ( X ) square integrable.Later we will also need that the covariance matrix of an efficient parameterestimator ˆ ϑ [which involves the covariance matrix of ˙ r ϑ ( X ) and the Fisherinformation] is invertible.Now use a Taylor expansion to see that n X i =1 { r τ ( X i ) − r ϑ ( X i ) − ˙ r ϑ ( X i ) ⊤ ( τ − ϑ ) } = n X i =1 (cid:20)Z { ˙ r ϑ + u ( τ − ϑ ) ( X i ) − ˙ r ϑ ( X i ) } ⊤ ( τ − ϑ ) du (cid:21) ≤ | τ − ϑ | n X i =1 Z | ˙ r ϑ + u ( τ − ϑ ) ( X i ) − ˙ r ϑ ( X i ) | du ≤ | τ − ϑ | n X i =1 a ( X i ) . Assumption 1 therefore guarantees that the function τ r τ ( X ) is stochas-tically differentiable, that is, for each constant C ,sup | τ − ϑ |≤ Cn − / n X i =1 { r τ ( X i ) − r ϑ ( X i ) − ˙ r ϑ ( X i ) ⊤ ( τ − ϑ ) } = o p (1) . (2.1) EGRESSION WITH MISSING RESPONSES We will not need the first partial derivative of h ( x, y ), ∂/∂xh ( x, y ). There-fore we will write h ′ for the second partial derivative, h ′ ( x, y ) = ∂ h ( x, y ) = ∂/∂yh ( x, y ). Assumption 2.
The function h ( x, y ) is differentiable in y with a squareintegrable partial derivative h ′ ( x, y ) = ∂/∂yh ( x, y ) which satisfies the Lips-chitz condition | h ′ ( x, z ) − h ′ ( x, y ) | ≤ | z − y | b ( x, y ) , b ( X, Y ) square integrable.In the following ¯ Z will denote the average of the indicators Z i , ¯ Z = n − P ni =1 Z i . The next lemma gives the expansion of the estimator aroundthe true parameter ϑ . Lemma 2.1.
Assume that Assumptions and hold and that ˆ ϑ is a √ n consistent estimator of ϑ . Then the unweighted estimator has the expansion n n X i =1 ˆ χ ( X i , ˆ ϑ ) = 1 n n X i =1 ˆ χ ( X i , ϑ ) + D ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / )(2.2) with D = E ( h ( X, Y )[ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε )) . Proof.
For reasons of clarity we introduce the notation f ij ( ϑ ) = h { X i , r ϑ ( X i ) + Y j − r ϑ ( X j ) } and write ˙ f ij for the gradient. Then1 n n X i =1 ˆ χ ( X i , ˆ ϑ )= 1 n n X i =1 n X j =1 Z j ¯ Z h { X i , r ˆ ϑ ( X i ) + Y j − r ˆ ϑ ( X j ) } = 1¯ Z n n X i =1 " n X j =1 j = i Z j h { X i , r ˆ ϑ ( X i ) + Y j − r ˆ ϑ ( X j ) } + Z i h ( X i , Y i ) (2.3) = 1¯ Z n n X i =1 ( n X j =1 j = i Z j f ij ( ϑ ) + Z i h ( X i , Y i ) ) + 1¯ Z n n X i =1 n X j =1 j = i Z j { f ij ( ˆ ϑ ) − f ij ( ϑ ) } = 1 n n X i =1 ˆ χ ( X i , ϑ ) + 1¯ Z n n X i =1 n X j =1 j = i Z j { f ij ( ˆ ϑ ) − f ij ( ϑ ) } . U. U. M ¨ULLER
Below we will show that1¯ Z n n X i =1 n X j =1 j = i Z j { f ij ( ˆ ϑ ) − f ij ( ϑ ) } = D ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / )(2.4)with D = ( EZ ) − E [ Z h ′ { X , r ϑ ( X )+ Y − r ϑ ( X ) }{ ˙ r ϑ ( X ) − ˙ r ϑ ( X ) } ] . Thatthis D is indeed of the form given in the lemma can be seen as follows. Con-sider D = E [ h ′ { X , r ϑ ( X ) + ε } ˙ r ϑ ( X )] − EZ E [ h ′ { X , r ϑ ( X ) + ε } ( Z = 1) ˙ r ϑ ( X )] . The first term can be written E ( E [ h ′ { X , r ϑ ( X ) + ε }| X ] ˙ r ϑ ( X )). Inte-gration by parts of the inner integral gives E [ h ′ { X , r ϑ ( X ) + ε }| X ] = E [ h { X , r ϑ ( X ) + ε } ℓ ( ε ) | X ]. The second term is E [ h ′ { X , r ϑ ( X ) + ε } ] E { ˙ r ϑ ( X ) | Z = 1 } . We proceed analogously and, in conclusion, obtain D = E ( h ( X, Y )[ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε )) . (2.5)The result now follows from (2.3), (2.4) and (2.5). It remains to verify (2.4).The proof consists of two parts,1¯ Z n n X i =1 n X j =1 j = i Z j { f ij ( ˆ ϑ ) − f ij ( ϑ ) − ˙ f ij ( ϑ ) ⊤ ( ˆ ϑ − ϑ ) } = o p ( n − / ) , (2.6) 1¯ Z n n X i =1 n X j =1 j = i Z j ˙ f ij ( ϑ ) ⊤ ( ˆ ϑ − ϑ ) = D ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / ) . (2.7)Statement (2.7) can be quickly proved: since ˆ ϑ is √ n consistent we canreplace the gradient by its expectation,1¯ Z n n X i =1 n X j =1 j = i Z j ˙ f ij ( ϑ ) ⊤ ( ˆ ϑ − ϑ )= 1¯ Z n n X i =1 n X j =1 j = i E { Z j ˙ f ij ( ϑ ) } ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / )= 1 EZ E { Z ˙ f ( ϑ ) } ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / )with ( EZ ) − E { Z ˙ f ( ϑ ) } = D ⊤ as given in (2.4). For the proof of (2.6) itsuffices to show that n X i =1 " √ n n X j =1 j = i Z j { f ij ( ˆ ϑ ) − f ij ( ϑ ) − ˙ f ij ( ϑ ) ⊤ ( ˆ ϑ − ϑ ) } = O p (1) . EGRESSION WITH MISSING RESPONSES This holds by the following arguments. Rewrite the above expression andapply the Cauchy–Schwarz inequality to obtain n X i =1 √ n n X j =1 j = i Z j Z [ ˙ f ij { ϑ + u ( ˆ ϑ − ϑ ) } − ˙ f ij ( ϑ )] ⊤ ( ˆ ϑ − ϑ ) du ! ≤ n X i =1 n X j =1 j = i Z j | ˆ ϑ − ϑ | Z | ˙ f ij { ϑ + u ( ˆ ϑ − ϑ ) } − ˙ f ij ( ϑ ) | du. The difference | ˙ f ij { ϑ + u ( ˆ ϑ − ϑ ) } − ˙ f ij ( ϑ ) | is bounded by | ˆ ϑ − ϑ | timesa square integrable function A ij . This holds due to Assumptions 1 and 2,namely the Lipschitz conditions on ˙ r ϑ and h ′ and since a ( X ) , b ( X, Y ) , ˙ r ϑ ( X )and h ′ ( X, Y ) are square integrable. Summing up, the expression is boundedby | ˆ ϑ − ϑ | P ni =1 P nj =1 ,j = i A ij which is stochastically bounded since ˆ ϑ is √ n consistent. (cid:3) We will now replace the estimated conditional expectation ˆ χ in the right-hand side of (2.2) by the true one. Set S = 1 n ( n − n X i =1 n X j =1 j = i Z j EZ h { X i , r ϑ ( X i ) + Y j − r ϑ ( X j ) } . We have1 n n X i =1 ˆ χ ( X i , ϑ ) = EZ ¯ Z S + O p ( n − ) = S − ¯ Z − EZEZ ES + o p ( n − / )and, by the Hoeffding decomposition, S = ES + 1 n n X i =1 { χ ( X i , ϑ ) − ES } + 1 n n X i =1 (cid:26) Z i ¯ h ( ε i ) EZ − ES (cid:27) + o p ( n − / )with ¯ h ( ε ) = E { h ( X, Y ) | ε } , ES = Eh ( X, Y ) = E ¯ h ( ε ). Combining the aboveyields1 n n X i =1 ˆ χ ( X i , ϑ ) = 1 n n X i =1 χ ( X i , ϑ ) + 1 n n X i =1 Z i EZ { ¯ h ( ε i ) − E ¯ h ( ε ) } + o p ( n − / ) . This and Lemma 2.1 give our expansion for the unweighted estimator whichwe formulate as a corollary.
Corollary 2.2.
Assume that Assumptions and hold and that ˆ ϑ is a √ n consistent estimator of ϑ . Then, with D = E ( h ( X, Y )[ ˙ r ϑ ( X ) − U. U. M ¨ULLER E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε )) and ¯ h ( ε ) = E { h ( X, Y ) | ε } , the unweighted estimatorhas the expansion n n X i =1 ˆ χ ( X i , ˆ ϑ )= 1 n n X i =1 (cid:20) χ ( X i , ϑ ) + Z i EZ { ¯ h ( ε i ) − E ¯ h ( ε i ) } (cid:21) + D ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / ) .
3. Expansion of the weighted estimator.
In this section we study theweighted estimator which uses residual-based weights, ˆ w j , that are con-structed by adapting empirical likelihood techniques. The approach is tomaximize Q nj =1 ˆ w j subject to the mean zero constraint on the error distri-bution, P nj =1 ˆ w j Z j ˆ ε j = 0, with ˆ w j ≥ P nj =1 ˆ w j = n . The weights solvingthis optimization problem are given by ˆ w j = 1 / (1 + ˆ λZ j ˆ ε j ), where ˆ λ denotesthe Lagrange multiplier—provided ˆ λ exists. As shown by Owen (1988, 2001),this is the case if not all residuals have the same sign, that is, on the eventmin ≤ j ≤ n ˆ ε j < < max ≤ j ≤ n ˆ ε j , which has probability tending to one sincethe residuals ˆ ε j are uniformly close to the centered errors ε j [see (A.1) inthe Appendix]. If ˆ λ does not exist, we set ˆ λ = 0. Note that the weights equalone if Z j = 0 or ˆ λ = 0. For computational issues we refer to Section 2.9 ofOwen’s book (2001).The formula for the weights can be written as an identity, ˆ w j = 1 − ˆ λ ˆ w j Z j ˆ ε j . This enables us to decompose the estimator into the unweightedestimator and an additional correction term,1 n n X i =1 ˆ χ w ( X i , ˆ ϑ ) = 1 n n X i =1 ˆ χ ( X i , ˆ ϑ )(3.1) − ˆ λn n X i =1 n X j =1 ˆ w j ˆ ε j Z j ¯ Z h { X i , r ˆ ϑ ( X i ) + ˆ ε j } . Since we have already derived an expansion of the unweighted estimator(see Corollary 2.2) we only need to study the second term on the right-handside. In Lemma 3.1 we will derive an expansion of the estimated Lagrangemultiplier ˆ λ and use this result in Lemma 3.2, where we determine an ap-proximation of the extra term. For the proof of Lemma 3.1 we proceed anal-ogously to Owen (2001), pages 219–221 [compare also M¨uller, Schick andWefelmeyer (2005)]. This requires some auxiliary results which are provedin the Appendix, namelymax ≤ i ≤ n | Z i ˆ ε i | = o p ( n / ) , (3.2) EGRESSION WITH MISSING RESPONSES n n X i =1 Z i ˆ ε i = 1 n n X i =1 Z i ε i − EZE { ˙ r ϑ ( X ) | Z = 1 } ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / )(3.3) = O p ( n − / ) , n n X i =1 Z i ˆ ε i = 1 n n X i =1 Z i ε i + o p (1) = EZσ + o p (1) , (3.4)where ˆ ϑ is a √ n consistent estimator of ϑ and σ > Lemma 3.1.
Suppose that Assumption is satisfied and let ˆ ϑ be a √ n consistent estimator of ϑ . Then max ≤ j ≤ n | ˆ w j − | = o p (1) and ˆ λ = 1 σ n n X j =1 Z j EZ ε j − σ E { ˙ r ϑ ( X ) | Z = 1 } ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / )(3.5) = O p ( n − / ) . Proof.
We first derive the order of ˆ λ . Recall that ˆ w j = 1 / (1 + ˆ λZ j ˆ ε j ),that ˆ w j + ˆ λ ˆ w j Z j ˆ ε j = 1 and that P nj =1 ˆ w j Z j ˆ ε j = 0 by construction. Also notethat the Z j ’s are binary and that therefore Z j = Z j . This allows us to write1 n n X j =1 Z j ˆ ε j = 1 n n X j =1 ( ˆ w j + ˆ λ ˆ w j Z j ˆ ε j ) Z j ˆ ε j = ˆ λ n n X j =1 ˆ w j Z j ˆ ε j = ˆ λ n n X j =1 Z j ˆ ε j λZ j ˆ ε j . Note that 1 + ˆ λZ j ˆ ε j > | ˆ λ | n n X j =1 Z j ˆ ε j = | ˆ λ | n n X j =1 Z j ˆ ε j λZ j ˆ ε j (1 + ˆ λZ j ˆ ε j ) ≤ | ˆ λ | n n X j =1 Z j ˆ ε j λZ j ˆ ε j (cid:18) | ˆ λ | max ≤ j ≤ n | Z j ˆ ε j | (cid:19) = | ˆ λ | λ n n X j =1 Z j ˆ ε j (cid:18) | ˆ λ | max ≤ j ≤ n | Z j ˆ ε j | (cid:19) . The last equality holds due to (3.6). Applying (3.2), (3.3) and (3.4) to thefirst and last terms of the inequality we obtain | ˆ λ | · O p (1) = O p ( n − / ) + | ˆ λ | o p (1) which implies ˆ λ = O p ( n − / ). This and (3.2) give max ≤ j ≤ n | ˆ λZ j ˆ ε j | U. U. M ¨ULLER = o p (1) and therefore our first statement,max ≤ j ≤ n | ˆ w j − | = max ≤ j ≤ n (cid:12)(cid:12)(cid:12)(cid:12) − ˆ λZ j ˆ ε j λZ j ˆ ε j (cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . We now again make use of (3.6) and write1 n n X j =1 Z j ˆ ε j = ˆ λ ( n n X j =1 ( ˆ w j − Z j ˆ ε j + 1 n n X j =1 Z j ˆ ε j ) = ˆ λ n n X j =1 Z j ˆ ε j + o p ( n − / ) . For the last statement we utilized (3.4), max ≤ j ≤ n | ˆ w j − | = o p (1) and ˆ λ = O p ( n − / ). This and (3.4) giveˆ λ = P nj =1 Z j ˆ ε j P nj =1 Z j ˆ ε j + o p ( n − / )= 1 EZ σ n n X j =1 Z j ˆ ε j + o p ( n − / ) . Inserting approximation (3.3) for n − P nj =1 Z j ˆ ε j finally yields the desiredapproximation of ˆ λ . (cid:3) Lemma 3.2.
Suppose that Assumptions and are satisfied and let ˆ ϑ be a √ n consistent estimator of ϑ . Then, with ¯ h ( ε ) = E { h ( X, Y ) | ε } , ˆ λn n X i =1 n X j =1 ˆ w j ˆ ε j Z j ¯ Z h { X i , r ˆ ϑ ( X i ) + ˆ ε j } = 1 σ n n X i =1 Z i EZ ε i E { ε ¯ h ( ε ) } − σ E { ε ¯ h ( ε ) } E { ˙ r ϑ ( X ) | Z = 1 } ⊤ ( ˆ ϑ − ϑ )+ o p ( n − / ) . Proof.
Since ˆ λ = O p ( n − / ) and max ≤ j ≤ n | ˆ w j − | = o p (1) by the pre-vious lemma, and since max ≤ i ≤ n | Z i ˆ ε i | = o p ( n / ) by (3.2), it is clear thatthe terms of the sum where j = i , that is, h { X i , r ˆ ϑ ( X i ) + ˆ ε i } = h ( X i , Y i ), canbe ignored. It therefore suffices to prove the statement forˆ λn X i X j = i ˆ w j ˆ ε j Z j ¯ Z h { X i , r ˆ ϑ ( X i ) + ˆ ε j } = ˆ λ EZ ¯ Z ψ w ( ˆ ϑ ) EGRESSION WITH MISSING RESPONSES with ψ w ( ˆ ϑ ) = 1 n X i X j = i ˆ w j ˆ ε j Z j EZ h { X i , r ˆ ϑ ( X i ) + ˆ ε j } = ψ ( ˆ ϑ ) + 1 n X i X j = i ( ˆ w j − ε j Z j EZ h { X i , r ˆ ϑ ( X i ) + ˆ ε j } , where ψ is ψ w with ˆ w j = 1. The second part involving the difference ˆ w j − o p ( n − / ), which can be seen as follows: using ˆ λ = O p ( n − / ) andmax ≤ j ≤ n | ˆ w j − | = o p (1) we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ λ EZ ¯ Z n X i X j = i ( ˆ w j − ε j Z j EZ h { X i , r ˆ ϑ ( X i ) + ˆ ε j } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | ˆ λ | Z max ≤ j ≤ n | ˆ w j − | n X i X j = i | ˆ ε j h { X i , r ˆ ϑ ( X i ) + ˆ ε j }| = o p ( n − / ) · n X i X j = i | ˆ ε j h { X i , r ˆ ϑ ( X i ) + ˆ ε j }| . This gives the claimed rate o p ( n − / ) since the sum is bounded in probability,which follows from the √ n consistency of ˆ ϑ and Assumptions 1 and 2 onthe terms of the product ( Y − r τ ( X )) h { X , r τ ( X ) + Y − r τ ( X ) } .It remains to consider ˆ λEZ/ ¯ Zψ ( ˆ ϑ ). Using ˆ λ = O p ( n − / ) we can replace ψ ( ˆ ϑ ) by ψ ( ϑ ) since ψ ( ˆ ϑ ) − ψ ( ϑ ) = o p (1) , which again follows from Assump-tions 1 and 2 and the consistency of ˆ ϑ . Further, by the law of large numbers, EZ/ ¯ Z = 1 + o p (1) and ψ ( ϑ ) − Eψ ( ϑ ) = o p (1). These arguments yieldˆ λ EZ ¯ Z ψ ( ˆ ϑ ) = ˆ λEψ ( ϑ ) + o p ( n − / ) . The expected value of ψ ( ϑ ) is n − n E (cid:20) ε Z EZ h { X , r ϑ ( X ) + ε } (cid:21) = n − n E { εh ( X, Y ) } = n − n E { ε ¯ h ( ε ) } . Summing up,ˆ λ EZ ¯ Z ψ w ( ˆ ϑ ) = ˆ λEψ ( ϑ ) + o p ( n − / ) = ˆ λE { ε ¯ h ( ε ) } + o p ( n − / ) . Inserting expansion (3.5) for ˆ λ into the above completes the proof. (cid:3) Combining the previous lemma and the approximation of the weightedestimator from Section 2 gives an expansion for the weighted estimator. U. U. M ¨ULLER
Theorem 3.3.
Suppose that Assumption 1 and 2 are satisfied and that ˆ ϑ is a √ n consistent estimator of ϑ . Let ¯ h ( ε ) = E { h ( X, Y ) | ε } . Then n n X i =1 ˆ χ w ( X i , ˆ ϑ ) = 1 n n X i =1 (cid:18) χ ( X i , ϑ ) + Z i EZ (cid:20) ¯ h ( ε i ) − E ¯ h ( ε i ) − E { ε ¯ h ( ε ) } σ ε i (cid:21)(cid:19) + D ⊤ w ( ˆ ϑ − ϑ ) + o p ( n − / ) , where D w = E ( h ( X, Y )[ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε )) + σ − E { ε ¯ h ( ε ) }× E { ˙ r ϑ ( X ) | Z = 1 } . Proof.
Consider the two terms of representation (3.1) and replace themby their approximations given in Corollary 2.2 and Lemma 3.2. This yields1 n n X i =1 ˆ χ w ( X i , ˆ ϑ )= 1 n n X i =1 (cid:18) χ ( X i , ϑ ) + Z i EZ (cid:20) ¯ h ( ε i ) − E ¯ h ( ε ) − E { ε ¯ h ( ε ) } σ ε i (cid:21)(cid:19) + (cid:20) D ⊤ + 1 σ E { ε ¯ h ( ε ) } E { ˙ r ϑ ( X ) | Z = 1 } ⊤ (cid:21) ( ˆ ϑ − ϑ ) + o p ( n − / )with D + σ − E { ε ¯ h ( ε ) } E { ˙ r ϑ ( X ) | Z = 1 } = D w , by definition of D (see Corol-lary 2.2). Inserting this into the above gives the desired representation. (cid:3)
4. Efficiency.
We are interested in efficient estimation of Eh ( X, Y ) basedon observations (
X, ZY, Z ). Our estimator requires an efficient estimator of ϑ . In this section we determine the influence function of an efficient estimatorof Eh ( X, Y ). In the next section, where the influence function of an efficientestimator ˆ ϑ of ϑ is determined, this allows us to show that the fully imputedestimator with an efficient ˆ ϑ plugged in is efficient. Throughout we willsuppose that the assumptions made earlier are satisfied.We first calculate the efficient influence function for estimating an arbi-trary functional κ of the joint distribution P ( dx, dy, dz ). The joint distri-bution depends on the marginal distribution G ( dx ) of X , the conditionalprobability π ( x ) of Z = 1 given X = x , and the conditional distribution Q ( x, dy ) of Y given X = x , P ( dx, dy, dz ) = G ( dx ) B π ( x ) ( dz ) { zQ ( x, dy ) + (1 − z ) δ ( dy ) } . Here B p = pδ + (1 − p ) δ denotes the Bernoulli distribution with parameter p and δ t the Dirac measure at t . In a first step we consider a nonparamet-ric model for P , that is, we allow for arbitrary models for G, Q and π . Forthis general setting a characterization of efficient estimators of κ ( G, Q, π )is in M¨uller, Schick and Wefelmeyer (2006), Section 2. In the following we
EGRESSION WITH MISSING RESPONSES summarize their key arguments and apply them to the special case of non-linear regression (which is not considered in that article). We then calculatethe efficient influence functions for estimating Eh ( X, Y ) in the nonlinearregression model and, in the next section, for estimating ϑ .For the characterization of efficient estimators it is essential to first intro-duce the notion of tangent spaces. The tangent space of a model is the setof possible perturbations of P within the model. An estimator of a certainfunctional is, roughly speaking, efficient if its influence function equals theso-called canonical gradient of the functional, which is an element of the tan-gent space. Hence, in order to characterize the efficient influence function,we first need to determine the tangent space.Consider (Hellinger differentiable) perturbations of G , Q and π , G nu ( dx ) ˙= G ( dx ) { n − / u ( x ) } ,Q nv ( x, dy ) ˙= Q ( x, dy ) { n − / v ( x, y ) } ,B π nw ( x ) ( dz ) ˙= B π ( x ) ( dz )[1 + n − / { z − π ( x ) } w ( x )] . To guarantee that the perturbed distributions are probability distributionsrequires that the (Hellinger) derivative u belongs to L , ( G ) = (cid:26) u ∈ L ( G ) : Z u dG = 0 (cid:27) , that v belongs to V = (cid:26) v ∈ L ( M ) : Z v ( x, y ) Q ( x, dy ) = 0 (cid:27) with M ( dx, dy ) = Q ( x, dy ) G ( dx ), and that w belongs to L ( G π ), where G π ( dx ) = π ( x ) { − π ( x ) } G ( dx ). The perturbed joint distribution P nuvw thenhas derivative t uvw ( x, zy, z ) = u ( x ) + zv ( x, y ) + { z − π ( x ) } w ( x ). Note thatmodels for G, Q and π will result in further restrictions on the perturba-tions which must satisfy the model assumptions. Then u, v and π must berestricted to subspaces U of L , ( G ), V of V and W of L ( G π ).In this article we make no model assumptions on G and π and thushave U = L , ( G ) and W = L ( G π ). Since we are considering nonlinearregression we do, however, have a model for the conditional distribution,namely Q ( x, dy ) = f { y − r ϑ ( x ) } dy with f denoting the (mean zero) den-sity of the error distribution. Perturbations v of Q must therefore satisfy R v ( x, y ) f { y − r ϑ ( x ) } dy = 0. In order to derive an explicit form of V , weintroduce perturbations s and t of the two parameters f and ϑ . Write F for the distribution function of f and remember that we assume that f hasfinite Fisher information for location, Eℓ ( ε ) < ∞ , where ℓ = − f ′ /f is thescore function. The perturbed distribution Q now depends on s and t , Q nv ( x, dy ) = Q nst ( x, dy ) = f ns { y − r ϑ nt ( x ) } dy U. U. M ¨ULLER with ϑ nt = ϑ + n − / t , t ∈ R p , f ns ( y ) = f ( y ) { n − / s ( y ) } and s ∈ S , where S = (cid:26) s ∈ L ( F ) : Z s ( y ) f ( y ) dy = 0 , Z ys ( y ) f ( y ) dy = 0 (cid:27) . Note that the space S is determined by two constraints: the perturbed errordensity f ns must integrate to 1, R f ns ( y ) dy = 1, and must be centered atzero, R yf ns ( y ) dy = 0. As in Schick (1993), Section 3, we have f ns { y − r ϑ nt ( x ) } = f { y − r ϑ nt ( x ) } [1 + n − / s { y − r ϑ nt ( x ) } ]˙= [ f { y − r ϑ ( x ) } − n − / f ′ { y − r ϑ ( x ) } ˙ r ϑ ( x ) ⊤ t ][1 + n − / s { y − r ϑ ( x ) } ]˙= f { y − r ϑ ( x ) } (cid:18) n − / (cid:20) s { y − r ϑ ( x ) } − f ′ { y − r ϑ ( x ) } f { y − r ϑ ( x ) } ˙ r ϑ ( x ) ⊤ t (cid:21)(cid:19) = f { y − r ϑ ( x ) } (1 + n − / [ s { y − r ϑ ( x ) } + ℓ { y − r ϑ ( x ) } ˙ r ϑ ( x ) ⊤ t ]) . Therefore Q nst ( x, dy ) ˙= f { y − r ϑ ( x ) } dy × (1 + n − / [ s { y − r ϑ ( x ) } + ℓ { y − r ϑ ( x ) } ˙ r ϑ ( x ) ⊤ t ])and the subspace V of V is V = { v ( x, y ) = s { y − r ϑ ( x ) } + ℓ { y − r ϑ ( x ) } ˙ r ϑ ( x ) ⊤ t : s ∈ S, t ∈ R p } . (4.1)We now briefly review some definitions. We will do this for arbitrarysubspaces U, V and W of L , ( G ), V and L ( G π ), and then return to ourspecific situation.Let T denote the tangent space consisting of all derivatives t uvw . A func-tional κ of G , Q and π is called differentiable with gradient g ∈ L ( P ) if, forall u ∈ U , v ∈ V and w ∈ W , n / { κ ( G nu , Q nv , π nw ) − κ ( G, Q, π ) } (4.2) → E { g ( X, ZY, Z ) t uvw ( X, ZY, Z ) } . The (unique) canonical gradient g ∗ = g ∗ ( X, ZY, Z ) is the projection of g ( X,ZY, Z ) onto the tangent space T . It is easy to check that T can be writtenas an orthogonal sum of three subspaces, T = { u ( X ) : u ∈ U } ⊕ { Zv ( X, Y ) : v ∈ V } ⊕ {{ Z − π ( X ) } w ( X ) : w ∈ W } . The random variable g ∗ ( X, ZY, Z ) is therefore the sum u ∗ ( X ) + Zv ∗ ( X, Y ) + { Z − π ( X ) } w ∗ ( X ), where u ∗ ( X ), Zv ∗ ( X, Y ) and { Z − π ( X ) } w ∗ ( X ) are theprojections of g ( X, ZY, Z ) onto these subspaces.
EGRESSION WITH MISSING RESPONSES An estimator ˆ κ for κ is regular with limit L if L is a random variablesuch that for all u ∈ U , v ∈ V and w ∈ W , n / { ˆ κ − κ ( G nu , Q nv , π nw ) } ⇒ L under P nuvw . The H´ajek–Le Cam convolution theorem says that L is distributed as thesum of a normal random variable N , with mean zero and variance Eg ∗ ,and some independent random variable. This justifies calling an estimatorˆ κ efficient if it is regular with limit L = N . As a consequence, a regularestimator is efficient if and only if it is asymptotically linear with influencefunction g ∗ , that is, n / { ˆ κ − κ ( G, Q, π ) } = n − / n X i =1 g ∗ ( X i , Z i Y i , Z i ) + o p (1) . A reference for the convolution theorem and the characterization is Bickelet al. (1998).Let us now specify the canonical gradient for the functional Eh ( X, Y ).The canonical gradient is, in particular, a gradient and thus specified by(4.2). Moreover, it is characterized by g ∗ ( X, ZY, Z ) = u ∗ ( X ) + Zv ∗ ( X, Y ) + { Z − π ( X ) } w ∗ ( X ) with the terms of the sum being projections as statedabove. The canonical gradient for arbitrary κ is therefore determined by E { u ∗ ( X ) u ( X ) } + E { Zv ∗ ( X, Y ) v ( X, Y ) } + E [ { Z − π ( X ) } w ∗ ( X ) w ( X )](4.3) = lim n →∞ n / { κ ( G nu , Q nv , π nw ) − κ ( G, Q, π ) } . In the nonlinear regression model we have, as defined earlier, U = L , ( G ), W = L ( G π ), Q nv = Q nst with v ∈ V , that is, v ( X, Y ) = s ( ε ) + ℓ ( ε ) ˙ r ϑ ( X ) ⊤ t [see (4.1)]. Since Eh ( X, Y ) does not depend on π we have Eh ( X, Y ) = κ ( G, Q, π ) = κ ( G, Q ) and Eh ( X, Y ) = Z h dM = Z Z h ( x, y ) Q ( x, dy ) G ( dx )= Z Z h ( x, y ) f { y − r ϑ ( x ) } dy G ( dx ) . Let M nuv ( dx, dy ) = Q nv ( x, dy ) G nu ( dx ) with Q nv = Q nst = f ns { y − r ϑ nt ( x ) } dy and perturbations G nu , f ns and ϑ nt as defined earlier. Using the previousapproximations we see that the right-hand side of (4.3) islim n →∞ n / (cid:18)Z h dM nuv − Z h dM (cid:19) = E [ h ( X, Y ) { u ( X ) + v ( X, Y ) } ] U. U. M ¨ULLER with v ( X, Y ) = s ( ε ) + ℓ ( ε ) ˙ r ϑ ( X ) ⊤ t . The canonical gradient g ∗ of Eh ( X, Y )is therefore determined by E { u ∗ ( X ) u ( X ) } + E { Zv ∗ ( X, Y ) v ( X, Y ) } (4.4) + E [ { Z − π ( X ) } w ∗ ( X ) w ( X )] = E [ h ( X, Y ) { u ( X ) + v ( X, Y ) } ]for all u ∈ U , v ∈ V and w ∈ W with v of the above form.In order to specify g ∗ we set u = 0 and v = 0 in (4.4) and see that w ∗ must be zero. Setting v = 0, we see that u ∗ ( X ) is the projection of h ( X, Y )onto U = L , ( G ), that is, u ∗ ( X ) = χ ( X, ϑ ) − E { χ ( X, ϑ ) } with χ ( X, ϑ ) = E { h ( X, Y ) | X } . Hence we have g ∗ ( X, ZY, Z ) = χ ( X, ϑ ) − E { χ ( X, ϑ ) } + Zv ∗ ( X, Y )(4.5)and are left to determine v ∗ . Taking u = 0 in (4.4), we see that the projec-tion of Zv ∗ ( X, Y ) onto ˜ V = { v ( X, Y ) : v ∈ V } must equal the projection of h ( X, Y ) onto ˜ V , that is, onto˜ V = { s ( ε ) + ℓ ( ε ) ˙ r ϑ ( X ) ⊤ t, s ∈ S, t ∈ R p } . There are two possible ways to obtain v ∗ . One method would be to make aneducated guess: in Theorem 3.3 we derived an approximation of an estimatorof Eh ( X, Y ) which we expect to be efficient since it uses all informationabout the model. The approximation still involves ˆ ϑ − ϑ but, combined withthe efficient influence function for estimating ϑ (which is relatively easyto derive; see Section 5), it will suggest a candidate for v ∗ . Whether thiscandidate is the correct v ∗ can be checked with characterization (4.4), thatis, with E [ Zv ∗ ( X, Y ) { s ( ε ) + ℓ ( ε ) ˙ r ϑ ( X ) ⊤ t } ] = E [ h ( X, Y ) { s ( ε ) + ℓ ( ε ) ˙ r ϑ ( X ) ⊤ t } ] . (4.6)The other method uses the structure of the tangent space. The canonicalgradient v ∗ is characterized in terms of projections onto ˜ V . Its derivationas a projection onto ˜ V is simplified by decomposing ˜ V . Let ℓ s denote theprojection of ℓ onto S , ℓ s ( ε ) = ℓ ( ε ) − σ − ε, and note that ℓ s = 0 is possible, namely when the error density f is normal.We now introduce the notation ζ = [ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε ) + E { ˙ r ϑ ( X ) | Z = 1 } εσ and, for s ∈ S and t ∈ R p , write s ( ε ) + ˙ r ϑ ( X ) ⊤ tℓ ( ε )= s ( ε ) + t ⊤ [ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε ) EGRESSION WITH MISSING RESPONSES + t ⊤ E { ˙ r ϑ ( X ) | Z = 1 } (cid:26) ℓ ( ε ) − εσ (cid:27) + t ⊤ E { ˙ r ϑ ( X ) | Z = 1 } εσ = t ⊤ ζ + s ( ε ) + t ⊤ E { ˙ r ϑ ( X ) | Z = 1 } ℓ s ( ε )with s ( ε ) + t ⊤ E { ˙ r ϑ ( X ) | Z = 1 } ℓ s ( ε ) ∈ S . Any element of ˜ V can therefore bewritten t ⊤ ζ + s ( ε ) for some t ∈ R p and s ∈ S . Since the canonical gradient v ∗ is in ˜ V by definition, it must be of the form v ∗ ( X, Y ) = s ∗ ( ε ) + t ∗⊤ ζ with s ∗ ∈ S and t ∗ ∈ R p to be determined such that (4.6) holds, that is, afterour above considerations, E [ Z { s ∗ ( ε ) + t ∗⊤ ζ }{ s ( ε ) + t ⊤ ζ } ] = E [ h ( X, Y ) { s ( ε ) + t ⊤ ζ } ]for all t ∈ R p and s ∈ S .We first consider t = 0 and secondly s = 0 and, in both cases, use the factthat Zζ is orthogonal to S . Then the above characterization of s ∗ and t ∗ reduces to two equations, namely E { Zs ∗ ( ε ) s ( ε ) } = E { h ( X, Y ) s ( ε ) } for all s ∈ S, (4.7) E { Zt ∗⊤ ζt ⊤ ζ } = E { h ( X, Y ) t ⊤ ζ } for all t ∈ R p . (4.8)Consider (4.7) and again use the notation ¯ h ( ε ) for the conditional expecta-tion E { h ( X, Y ) | ε } . Then (4.7) can be written as E { Zs ∗ ( ε ) s ( ε ) } = E { ¯ h ( ε ) s ( ε ) } , that is, ¯ h ( ε ) /EZ is an obvious candidate for s ∗ . However, it is not(yet) in S : the desired s ∗ is obtained as its centered version with a correctionterm chosen such that s ∗ ∈ S , s ∗ ( ε ) = 1 EZ (cid:20) ¯ h ( ε ) − E ¯ h ( ε ) − E { ε ¯ h ( ε ) } σ ε (cid:21) . The vector t ∗ is obtained by solving (4.8), t ∗⊤ E ( Zζζ ⊤ ) t = E { h ( X, Y ) ζ ⊤ } t for all t ∈ R p . Now use the definition of ζ from above and the definition of thevector D w from the end of the previous section, D w = E ( h ( X, Y )[ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε )) + σ − E { ε ¯ h ( ε ) } E { ˙ r ϑ ( X ) | Z = 1 } , and assume that E ( Zζζ ⊤ ) is invertible to obtain t ∗⊤ = E { h ( X, Y ) ζ ⊤ } E ( Zζζ ⊤ ) − = E (cid:26) h ( X, Y ) (cid:18) [ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ⊤ ℓ ( ε )+ E { ˙ r ϑ ( X ) | Z = 1 } ⊤ εσ (cid:19)(cid:27) E ( Zζζ ⊤ ) − = D ⊤ w E ( Zζζ ⊤ ) − . U. U. M ¨ULLER
This completes the derivation of v ∗ ( X, Y ) = s ∗ ( ε ) + t ∗⊤ ζ : v ∗ ( X, Y ) = 1 EZ (cid:20) ¯ h ( ε ) − E ¯ h ( ε ) − E { ε ¯ h ( ε ) } σ ε (cid:21) + D ⊤ w E ( Zζζ ⊤ ) − ζ. (4.9)Equations (4.5) and (4.9) together finally yield the canonical gradient g ∗ ,which is given in the following lemma. Note that we now have the addi-tional assumption that E ( Zζζ ⊤ ) is invertible, where E ( Zζζ ⊤ ) involves thecovariance matrix of Z ˙ r ( X ) and the Fisher information Eℓ ( ε ). Lemma 4.1.
Let ¯ h ( ε ) = E { h ( X, Y ) | ε } , ζ = [ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε )+ σ − E { ˙ r ϑ ( X ) | Z = 1 } ε and D w = E ( h ( X, Y )[ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε ))+ σ − E { ε ¯ h ( ε ) } E { ˙ r ϑ ( X ) | Z = 1 } = E { h ( X, Y ) ζ } . Suppose additionally to themodel assumptions from Section 2 that E ( Zζζ ⊤ ) is invertible. Then thecanonical gradient of the functional Eh ( X, Y ) is χ ( X, ϑ ) − E { χ ( X, ϑ ) } (4.10) + ZEZ (cid:20) ¯ h ( ε ) − E ¯ h ( ε ) − E { ε ¯ h ( ε ) } σ ε (cid:21) + D ⊤ w E { Zζζ ⊤ } − Zζ.
5. Estimation of the parameter and main result.
In this section we showthat the weighted estimator for Eh ( X, Y ) with an efficient estimator ˆ ϑ for ϑ plugged in is asymptotically linear with influence function equal to thecanonical gradient, that is, it is efficient. Let us compare the expansion ofthe weighted estimator from Theorem 3.3 and the efficient influence functionwhich is given by the canonical gradient (4.10) in Lemma 4.1. The approxi-mation of n − / P ni =1 [ ˆ χ w ( X i , ˆ ϑ ) − E { χ ( X, ϑ ) } ] which we derived in Section3 is n − / n X i =1 (cid:18) χ ( X i , ϑ ) − E { χ ( X, ϑ ) } + Z i EZ (cid:20) ¯ h ( ε i ) − E ¯ h ( ε ) − E { ε ¯ h ( ε ) } σ ε i (cid:21)(cid:19) + D ⊤ w n / ( ˆ ϑ − ϑ ) , where D w = E ( h ( X, Y )[ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε )) + σ − E { ε ¯ h ( ε ) } × E { ˙ r ϑ ( X ) | Z = 1 } . The efficient influence function determined by the canon-ical gradient is χ ( X, ϑ ) − E { χ ( X, ϑ ) } + ZEZ (cid:20) ¯ h ( ε ) − E ¯ h ( ε ) − E { ε ¯ h ( ε ) } σ ε (cid:21) + D ⊤ w E { Zζζ ⊤ } − Zζ with ζ = [ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε ) + σ − E { ˙ r ϑ ( X ) | Z = 1 } ε . Using anestimator ˆ ϑ with influence function E ( Zζζ ⊤ ) − Zζ would therefore yield an EGRESSION WITH MISSING RESPONSES efficient estimator for Eh ( X, Y ). In fact, it is easy to check (this will bedone in the following lemma) that this influence function is the canonicalgradient of the functional κ ( G, Q, π ) = ϑ . This means that our estimator of Eh ( X, Y ) requires an efficient estimator ˆ ϑ for ϑ to be plugged in in orderto be efficient. Lemma 5.1.
Let ζ = [ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε ) + σ − E { ˙ r ϑ ( X ) | Z =1 } ε and suppose that E ( Zζζ ⊤ ) is invertible. An asymptotically linear esti-mator ˆ ϑ for ϑ with influence function E ( Zζζ ⊤ ) − Zζ , that is, n / ( ˆ ϑ − ϑ )= n − / n X i =1 E ( Zζζ ⊤ ) − Z i (cid:20) { ˙ r ϑ ( X i ) − E [ ˙ r ϑ ( X ) | Z = 1] } ℓ ( ε i )+ E { ˙ r ϑ ( X ) | Z = 1 } ε i σ (cid:21) + o p (1) , is efficient for ϑ . Proof.
We have a semiparametric model for the conditional distribu-tion, namely Q ( x, dy ) = f ( y − r ϑ ( x )) dy , and nonparametric models for G and π . The functional ϑ ∈ R p is therefore a functional of Q , κ ( G, Q, π ) = κ ( Q ) = ϑ . By the discussion of the previous section we must show that theinfluence function of the estimator equals the canonical gradient, which is,for arbitrary functionals κ , determined by (4.3). For the functional ϑ theright-hand side of (4.3) is simply n / { ( ϑ + n − / t ) − ϑ } = t . From Section4 we also know that in the nonlinear regression model any v in ˜ V is of theform v ( X, Y ) = s ( ε ) + t ⊤ ζ , where s ∈ S and t ∈ R . The canonical gradient u ∗ ( X ) + Zv ∗ ( X, Y ) + { Z − π ( X ) } w ∗ ( X ) is therefore characterized by E { u ∗ ( X ) u ( X ) } + E [ Zv ∗ ( X, Y ) { s ( ε ) + ζ ⊤ t } ]+ E [ { Z − π ( X ) } w ∗ ( X ) w ( X )] = t. Taking s = 0, t = 0 and w = 0 we see that u ∗ = 0. Analogously one obtainsthat w ∗ must be zero. The canonical gradient thus reduces to Zv ∗ ( X, Y ).Again, since v ∗ ∈ ˜ V , we write Zv ∗ ( X, Y ) = Zs ∗ ( ε ) + Zζ ⊤ t ∗ with s ∗ and t ∗ to be determined. Taking t = 0 we see that Zv ∗ must be orthogonal to S ,that is, s ∗ = 0 which yields Zv ∗ ( X, Y ) = Zζ ⊤ t ∗ . The above characterizationtherefore reduces to t = E [ Zζ ⊤ t ∗ { s ( ε ) + ζ ⊤ t } ] = t ∗⊤ E ( Zζζ ⊤ ) t for all t ∈ R . This gives t ∗ = E ( Zζζ ⊤ ) − and the proof is complete: the canonical gradientof the parameter ϑ is Zv ∗ ( X, Y ) = Zt ∗⊤ ζ = E ( Zζζ ⊤ ) − Zζ . (cid:3) U. U. M ¨ULLER
Note that the asymptotic variance of ˆ ϑ is E ( Zζζ ⊤ ) − . The assumptionthat E ( Zζζ ⊤ ) must be invertible is therefore a condition on the covariancematrix of an efficient estimator of ϑ which we require to have full rank.Lemma 5.1 combined with the previous discussion yields our main result,which is given in the following theorem. Note that the asymptotic varianceof the fully imputed estimator of Eh ( X, Y ) is Eg ∗ , where g ∗ is the canon-ical gradient from (4.10). This variance is also given in the theorem belowand is easily verified by taking into account that the three terms of g ∗ areorthogonal. Theorem 5.2.
Assume that Assumptions and hold and that thecovariance matrices of ˙ r ϑ ( X ) and of Z ˙ r ϑ ( X ) are invertible. Let ˆ ϑ be anasymptotically linear estimator of ϑ with influence function E ( Zζζ ⊤ ) − Zζ where ζ = [ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε ) + σ − E { ˙ r ϑ ( X ) | Z = 1 } ε . Then theestimator n − P ni =1 ˆ χ w ( X i , ˆ ϑ ) with ˆ χ w ( X i , ˆ ϑ ) = P nj =1 ˆ w j Z j h { x, r ˆ ϑ ( x ) + Y j − r ˆ ϑ ( X j ) } / P nj =1 Z j has the expansion n n X i =1 (cid:18) χ ( X i , ϑ ) + Z i EZ (cid:20) ¯ h ( ε i ) − E ¯ h ( ε i ) − E { ε ¯ h ( ε ) } σ ε i (cid:21) + D ⊤ w E ( Zζζ ⊤ ) − Z i × [ ˙ r ϑ ( X i ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε i ) + E { ˙ r ϑ ( X ) | Z = 1 } ε i σ (cid:19) + o p ( n − / ) , where D w = E ( h ( X, Y )[ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) | Z = 1 } ] ℓ ( ε )) + σ − E { ε ¯ h ( ε ) } × E { ˙ r ϑ ( X ) | Z = 1 } and ¯ h ( ε ) = E { h ( X, Y ) | ε } . In particular, it is an efficientestimator of Eh ( X, Y ) and asymptotically normally distributed with asymp-totic variance Eχ ( X, ϑ ) + 1
EZ E ¯ h ( ε ) − (cid:18) EZ (cid:19) E h ( X, Y ) − E { ε ¯ h ( ε ) } σ EZ + D ⊤ w E ( Zζζ ⊤ ) − D w . In the linear regression model without missing responses, efficient estimatorsfor ϑ have been constructed by Bickel (1982), Koul and Susarla (1983) andSchick (1987, 1993). Schick (1993) considers general regression models witharbitrary sets of identifiability assumptions and discusses the mean zero con-straint on the error distribution as an important example. His constructionof an efficient estimator requires a preliminary estimate of ϑ and a directestimator of the influence function. The influence function for the nonlinearregression model with mean zero errors [see Schick (1993), Section 4.1 andRemark 3.13] is E ( ξξ ⊤ ) − ξ with ξ = [ ˙ r ϑ ( X ) − E { ˙ r ϑ ( X ) } ] ℓ ( ε )+ E [ ˙ r ϑ ( X )] ε/σ EGRESSION WITH MISSING RESPONSES and therefore consistent with our findings. A further developed efficient es-timator, which requires weaker conditions, is in Forrester et al. (2003). Inthe model with missing responses an efficient estimator can be constructedanalogously, using only the (available) full observations. Note that the onlydifference in the construction is that the data are incomplete, that is, thepresence of indicators Z i . In the following we will briefly sketch this “one-step improvement” construction of the estimator and refer to Forrester etal. (2003) for details.Let ¯ ϑ denote a √ n consistent and discretized estimator of ϑ , that is,with values on a rectangular grid with side lengths of order n − / . Write µ ( ϑ ) for E { ˙ r ϑ ( X ) | Z = 1 } , ε ( ϑ ) for the error variables ε ( ϑ ) = Y − r ϑ ( X ) and ζ ϑ { X, ε ( ϑ ) } for ζ , that is, ζ = ζ ϑ { X, ε ( ϑ ) } = { ˙ r ϑ ( X ) − µ ( ϑ ) } ℓ { ε ( ϑ ) } + µ ( ϑ ) ε ( ϑ ) /σ . In order to estimate the influence function one replaces the unknown quan-tities by estimators. The estimator of ϑ is then of the form¯ ϑ + " n X j =1 Z j ˆ ζ ¯ ϑ { X j , ε j ( ¯ ϑ ) } ˆ ζ ¯ ϑ { X j , ε j ( ¯ ϑ ) ⊤ } − n X j =1 Z j ˆ ζ ¯ ϑ { X j , ε j ( ¯ ϑ ) } , where ˆ ζ ¯ ϑ ( X, ε ( ¯ ϑ )) = [ ˙ r ¯ ϑ ( X ) − ˆ µ ( ¯ ϑ )]ˆ ℓ { ε ( ¯ ϑ ) } + ˆ µ ( ¯ ϑ ) ε ( ¯ ϑ ) /σ ( ¯ ϑ )with ˆ µ ( ¯ ϑ ) = P nj =1 Z j ˙ r ¯ ϑ ( X j ) P nj =1 Z j , ˆ σ ( ¯ ϑ ) = P nj =1 Z j ε j ( ¯ ϑ ) P nj =1 Z j and an estimator ˆ ℓ of the score function. To describe this estimator let k be a kernel that satisfies the assumptions given in Section 8 of Forrester etal., for example, a logistic density. For a bandwidth a n → k n ( x ) = k ( x/a n ) /a n . The estimator of the score function ℓ is a kernel estimator basedon the available residuals ε ( ¯ ϑ ),ˆ ℓ ¯ ϑ ( x ) = − ˆ f ′ n ( x ) b n + ˆ f n ( x )with ˆ f n ( x ) = n − P nj =1 Z j k n { x − ε j ( ¯ ϑ ) } and where b n is a sequence of pos-itive numbers converging to zero. The orders of a n → b n → n data pairs is observed) are givenin Forrester et al. (2003).There are other simple estimators for ϑ available which, however, and incontrast to the estimators proposed by Schick (1987, 1993) and Forresteret al. (2003), are not efficient for ϑ and which, if used for plug-in, would U. U. M ¨ULLER yield inefficient estimators of Eh ( X, Y ). One could, for example, estimateˆ ϑ by a weighted least squares estimator, that is, by the solution t = ˆ ϑ ofan estimating equation P ni =1 Z i w t ( X i ) { Y i − r t ( X i ) } = 0. Such an estimatorwould be appropriate in a regression model where independence of errorsand covariates cannot be assumed. Then one could even obtain efficiency forsuitably chosen weights [see M¨uller (2007), for nonlinear regression withoutmissing responses]. The estimating equation can be regarded as an empiri-cal version of the equation E [ Zw t ( X ) { Y − r t ( X ) } ] = 0. If a solution t = ϑ of this equation exists, the solution ˆ ϑ of the empirical version will, in gen-eral, be consistent for ϑ . If one is not interested in efficiency, the estimator n − P ni =1 ˆ χ w ( X i , ˆ ϑ ) with a least squares estimator ˆ ϑ plugged in would yielda consistent estimator for Eh ( X, Y ) (but not an efficient one since the inde-pendence structure is not used). Alternatively, the least squares estimatorcan be used as a preliminary estimator for the one-step improvement ap-proach sketched above.
6. Special cases, simulations and inference.
Sometimes the estimatorsimplifies considerably, especially if we study simple special cases such asestimation of expectations Eh ( X, Y ) where h has a simple form. The mainresult from Theorem 5.2 is therefore useful in proving efficiency of existingapproaches for specific applications, or in improving them, and for com-parisons of competing methods. Theorem 5.2 further provides the limitingdistribution of the efficient estimator, which facilitates the construction ofconfidence intervals. We will address this and aspects of the construction ofestimators in the following, and illustrate the results with simulations.6.1. Special cases.
We have shown that the fully imputed weighted esti-mator n − P ni =1 ˆ χ w ( X i , ˆ ϑ ) withˆ χ w ( x, ˆ ϑ ) = n X j =1 ˆ w j Z j h { x, r ˆ ϑ ( x ) + Y j − r ˆ ϑ ( X j ) } . n X j =1 Z j is efficient for Eh ( X, Y ) where h ( X, Y ) is a known square-integrable func-tion. The literature usually deals with estimation of the mean response, thatis, h ( x, y ) = y . Other important examples are estimation of higher momentsof the response variable Y and the estimation of the covariance and of mixedmoments of X and Y . In all these cases h ( x, y ) is a polynomial in x and y and the estimator often simplifies. This holds for the mean response, and,more generally, when h is of the form h ( x, y ) = a ( x ) y . Then the estimatorreduces to an unweighted empirical estimator, which can be seen as follows.Recall that the weights must be chosen such that P nj =1 ˆ w j Z j ˆ ε j = 0 and thatˆ w j = 1 − ˆ λ ˆ w j Z j ˆ ε j which gives P nj =1 ˆ w j Z j / P nj =1 Z j = 1. Hence the estimator EGRESSION WITH MISSING RESPONSES for E { a ( X ) Y } is1 n n X i =1 ˆ χ w ( X i , ˆ ϑ ) = 1 n n X i =1 P nj =1 ˆ w j Z j a ( X i ) { r ˆ ϑ ( X i ) + ˆ ε j } P nj =1 Z j = 1 n n X i =1 P nj =1 ˆ w j Z j P nj =1 Z j a ( X i ) r ˆ ϑ ( X i ) + P nj =1 ˆ w j Z j ˆ ε j P nj =1 Z j = 1 n n X i =1 a ( X i ) r ˆ ϑ ( X i ) . In these cases it is therefore not necessary to determine weights: the aboveintuitive estimator, with an efficient estimator ˆ ϑ for ϑ plugged in, is efficientfor E { a ( X ) Y } .An interesting special case is estimation of the mean response, a ( X ) =1, when possibly all responses are observed, which we mentioned in theIntroduction. Regardless of whether there are missing responses or not, n − P ni =1 r ˆ ϑ ( X i ) is efficient for EY , provided ˆ ϑ is efficient for ϑ . The differ-ence between the two situations is the construction of ˆ ϑ , which will be basedon either complete data pairs or on missing response data. Let us stay withthis example and consider, for a comparison, the unweighted estimator (1.1)from the introduction, that is, with all weights equal to one. It involves theterm P nj =1 Z j ˆ ε j / P nj =1 Z j which is nonzero. If all responses are observable,the unweighted estimator further simplifies, namely to1 n n X i =1 r ˆ ϑ ( X i ) + 1 n n X i =1 ˆ ε i = 1 n n X i =1 Y i [whereas the weighted estimator is n − P ni =1 r ˆ ϑ ( X i )]. Its influence functionis Y − EY which is clearly not the efficient one: our efficient estimator for EY (with an efficient estimator ˆ ϑ ) has the expansion1 n n X i =1 r ˆ ϑ ( X i ) ˙= 1 n n X i =1 r ϑ ( X i ) + ( ˆ ϑ − ϑ ) E ˙ r ϑ ( X ) . We recognize this as the expansion from Theorem 3.3 with D w = E ˙ r ϑ ( X ).Even without inserting the expansion for ˆ ϑ − ϑ from the previous section, it isclear that this is, in general, not the influence function of n − P ni =1 Y i , whichshows that it cannot be efficient. Note that n − P ni =1 Y i also coincides withthe (inefficient) partially imputed estimator if all responses were observed.6.2. Simulations.
For an illustration with computer simulations we con-sider a linear regression function, r ϑ ( X ) = ϑX with ϑ = 2, and a nonlinearregression function, r ϑ ( X ) = cos( ϑX ), also with ϑ = 2. The probabilities U. U. M ¨ULLER π ( X ) = P ( Z = 1 | X ) = E ( Z | X ) are chosen as values of a logistic distributionfunction, π ( X ) = 1 / (1 + e − X ), so that on average one half of the simulatedresponses are missing. We generate covariates X from a uniform distribu-tion on the interval ( − ,
1) and error variables ε from a standard normaldistribution. If the errors are in fact normally distributed then ℓ ( ε ) = ε/σ and the efficient one-step improvement estimator for ϑ from the previoussection is asymptotically equivalent to the ordinary least squares estimator.The following considerations can therefore be based on this straightforwardestimation approach.In a first example we consider estimation of the mean response EY andcompare the efficient (fully imputed weighted) estimator, which, as seenabove, here simplifies to n − P ni =1 r ˆ ϑ ( X i ), with the partially imputed esti-mator n − P ni =1 { Z i Y i + (1 − Z i ) r ˆ ϑ ( X i )). We also study the performance ofthese estimators if the parameter estimates are replaced by their true values,and if all responses are observed, π ( · ) = 1. Further we calculate the first sim-ple estimator from the introduction, n − P ni =1 Z i Y i / ˆ π ( X i ), with, for reasonsof simplicity, the estimated probabilities ˆ π replaced by the true ones. Thevalues of the simulated mean squared errors are given in Table 1.In both the linear and the nonlinear regression models, the fully imputedestimator performs considerably better than the partially imputed estima-tor. The simple estimator in the last column is clearly outperformed by theimputation approaches. Comparing the columns for the fully imputed es-timator with and without parameter estimation (and analogously for thepartially imputed estimator), we see that the estimator of the slope ϑ inlinear regression r ϑ ( X ) = ϑX is, as a plug-in estimator for estimating EY ,better than the parameter estimator of the frequency parameter ϑ in thenonlinear regression model r ϑ ( X ) = cos( ϑX ): in the linear regression modelthe mean squared errors of the approaches based on ϑ and ˆ ϑ are very simi-lar, in contrast to the nonlinear model where the differences are quite large.Let us also compare the (a) and (b) sections in the linear regression andthe nonlinear regression example, which refer to the situation where (a) re-sponses are missing at random and (b) all responses are available. For thefully imputed estimator n − P ni =1 r ˆ ϑ ( X i ) we observe the expected improvedperformance when more (response) data for the estimation of ϑ are avail-able. The situation is different for the partially imputed estimator. Indeedwe expect that, similarly, performance will improve as the proportion ofobserved responses increases. In this case ˆ ϑ improves as an estimator of ϑ but, at the same time, the partially imputed estimator will discard moreand more information about the structure of the regression function. [In theextreme case π ( · ) = 1 it equals the empirical estimator n − P ni =1 Y i .] Ourexample demonstrates that both scenarios are possible: for the linear re-gression model the estimator of ϑ performs well and the simulated mean EGRESSION WITH MISSING RESPONSES Table 1
Simulated mean squared errors of estimators of the mean response EY π ( X ) n c FI FI c PI PI N
Linear regression: r ϑ ( X ) = ϑX ( ϑ = 2)1 / (1 + e − X ) 50 0.027520 0.026639 0.036231 0.036368 0.104962100 0.013502 0.013298 0.018074 0.018364 0.0526801000 0.001328 0.001325 0.001794 0.001835 0.0052701 50 0.026990 0.026639 0.046322 0.046322 0.046322100 0.013415 0.013298 0.023479 0.023479 0.0234791000 0.001327 0.001325 0.002345 0.002345 0.002345Nonlinear regression: r ϑ ( X ) = cos( ϑX ) ( ϑ = 2)1 / (1 + e − X ) 50 0.027858 0.003957 0.031163 0.013272 0.053038100 0.015462 0.002001 0.017147 0.007020 0.0281541000 0.001492 0.000199 0.001671 0.000696 0.0028101 50 0.016512 0.003957 0.023369 0.023369 0.023369100 0.008581 0.002001 0.012043 0.012043 0.0120431000 0.000852 0.000199 0.001207 0.001207 0.001207 Notes.
The table entries are the simulated mean squared errors of estimators of EY = Er ϑ ( X ) with partially missing responses, π ( X ) = 1 / (1 + e − X ) and completely observeddata pairs, π ( X ) = 1. In the first two columns we study the efficient fully imputed weightedestimator with the ordinary least squares estimator ˆ ϑ plugged in ( b FI) and its correspondingversion using the true parameter, ϑ = 2 (FI). The next two columns refer to the partiallyimputed estimator using ˆ ϑ ( b PI) and the version based on ϑ = 2 (PI). The last columnconsiders the simple estimator n − P ni =1 Z i Y i /π ( X i ) (N), which does not use imputation.Note that in the sections with π ( X ) = 1 the columns for b PI, PI and N are identical: since allthe indicators are 1, these estimators coincide with the empirical estimator n − P ni =1 Y i . squared error of the partially imputed estimator in (a) is smaller than in(b). In the nonlinear regression model the estimator of ϑ is not as good andthe mean squared error in (a) is larger than the mean squared error of theempirical estimator in (b). Note that this observation about the performanceof the partially imputed estimator is only of secondary interest since, in anycase, the fully imputed estimator has the smaller mean squared error.The situation is slightly more complicated when h is of the form h ( x, y ) = a ( x ) b ( y ) with a nonlinear function b , for example, when higher mixed mo-ments of X and Y or just higher moments of Y are estimated. Simpli-fied estimators are available when b has a simple form. For an illustra-tion we consider, in a second example, estimation of the second moment EY = Er ϑ ( X ) + σ . The fully imputed estimator is1 n n X i =1 P nj =1 ˆ w j Z j { r ˆ ϑ ( X i ) + ˆ ε j } P nj =1 Z j = 1 n n X i =1 r ˆ ϑ ( X i ) + P nj =1 ˆ w j Z j ˆ ε j P nj =1 Z j . U. U. M ¨ULLER
The mean square errors for the fully imputed and the partially imputedestimator (with and without parameter estimation) are given in Table 2.Consider the lower section on nonlinear regression first. We see that, asexpected, the fully imputed estimator outperforms the partially imputedestimator, and that, in part (a) with missing responses, both estimatorsare far better than the simple estimator in the last column. Using an es-timator ˆ ϑ for ϑ , or the true value ϑ = 2, does not have much impact onthe mean squared error here. The upper half of Table 2 on linear regres-sion, however, shows a different picture: although the mean squared er-ror of the fully imputed and the partially imputed based on the true ϑ are considerably different (which is what we would expect) the values ofthe estimators based on the ordinary least squares parameter estimatorˆ ϑ suggest that the two approaches are asymptotically equivalent. For theextreme case (b) where π ( · ) = 1 this would mean that the fully imputedestimator n − P ni =1 r ˆ ϑ ( X i ) + n − P ni =1 ˆ w i ˆ ε i and the empirical estimator n − P ni =1 Y i are asymptotically equivalent. This may be surprising but,in fact, it is easy to see that this is exactly what is happening: we con-sider the special example of linear regression with normal errors and theordinary least squares estimator ˆ ϑ = P ni =1 X i Y i / P ni =1 X i . Rewriting the Table 2
Simulated mean squared errors of estimators of EY π ( X ) n c FI FI c PI PI N
Linear regression: r ϑ ( X ) = ϑX ( ϑ = 2)1 / (1 + e − X ) 50 0.312670 0.116360 0.310263 0.161374 0.528146100 0.158512 0.055343 0.157402 0.079863 0.2676011000 0.016215 0.005470 0.016189 0.008113 0.0272981 50 0.174683 0.070048 0.173817 0.173817 0.173817100 0.088960 0.034685 0.088455 0.088455 0.0884551000 0.008630 0.003359 0.008623 0.008623 0.008623Nonlinear regression: r ϑ ( X ) = cos( ϑX ) ( ϑ = 2)1 / (1 + e − X ) 50 0.086350 0.087286 0.092361 0.093401 0.176124100 0.042671 0.042747 0.047054 0.047219 0.0924781000 0.004260 0.004179 0.005032 0.004961 0.0101531 50 0.043774 0.043873 0.066100 0.066100 0.066100100 0.021578 0.021574 0.035573 0.035573 0.0355731000 0.002159 0.002116 0.003713 0.003713 0.003713 Notes.
Here we study estimation of EY . The first two columns refer to the fully im-puted estimator with the ordinary least squares estimator ˆ ϑ plugged in ( b FI) and to itsversion using ϑ = 2 (FI). In the next two columns we consider the partially imputed es-timator based on ˆ ϑ ( b PI) and ϑ = 2 (PI). In the last column the mean squared errors of n − P ni =1 Z i Y i /π ( X i ) (N) are listed.EGRESSION WITH MISSING RESPONSES empirical estimator gives n − P ni =1 Y i = n − P ni =1 r ˆ ϑ ( X i ) + n − P ni =1 ˆ ε i + n − ϑ P ni =1 ˆ ε i X i . The last term cancels for the least squares estimator ˆ ϑ sothat n − P ni =1 Y i = n − P ni =1 r ˆ ϑ ( X i ) + n − P ni =1 ˆ ε i . Finally, by our resultsfrom Section 3, the estimators n − P ni =1 ˆ w i ˆ ε i and n − P ni =1 ˆ ε i of the errorvariance σ are asymptotically equivalent.In the next example we restrict our attention to linear regression, r ϑ ( X ) = ϑX , ϑ = 2, and consider estimation of a more complicated expectation,namely of Eh ( X, Y ) = E ( Xe XY ). In contrast to the previous examples the(weighted) fully imputed estimator cannot be reduced. The mean squarederrors of this estimator and of the partially imputed estimator are given inTable 3. For each estimator we study the two cases with and without param-eter estimation. Again we observe that the performance of the estimators isnot much affected by the plug-in parameter estimator. Comparing the fullyand the partially imputed estimators we see that the fully imputed estima-tor clearly outperforms the partially imputed estimator. In addition we alsocalculate the simulated mean squared error of the unweighted (inefficient)version of our fully imputed estimator. The performance of this estimatorturns out to lie between the fully and the partially imputed one. In partic-ular, the simulations in section (b), where all data are observed and wherethe partially imputed estimator equals the empirical estimator, confirm ourtheoretical observation that incorporating the information about the loca-tion of the errors, for example in the form of weights as done in this article,is important.In order to study the behavior of the fully imputed estimator for multi-dimensional ϑ we again studied estimation of E ( Xe XY ). For our simulations Table 3
Simulated mean squared errors of estimators of E { X exp( XY ) } in linear regression π ( X ) n c FI FI b U c PI PI / (1 + e − X ) 50 0.32563 0.29024 0.36187 0.48164 0.47769100 0.15017 0.14085 0.18147 0.24192 0.246981000 0.01384 0.0137 0.01992 0.02577 0.027031 50 0.28988 0.27262 0.32220 0.58566 0.58566100 0.13804 0.13413 0.16520 0.29948 0.299481000 0.01332 0.01329 0.01663 0.02997 0.02997 Notes.
We consider estimation of Eh ( X, Y ) = E ( Xe XY ) in the linear regression model r ϑ ( X ) = ϑX , ϑ = 2. The first two columns give the mean squared errors of the fullyimputed estimator with the least squares estimator ˆ ϑ plugged in ( b FI), and its versionusing ϑ = 2 (FI). The third column contains the mean squared errors of the unweightedversion b U of b FI. The last two columns refer to the partially imputed estimator using ˆ ϑ ( b PI) and ϑ = 2 (PI). Note that if π ( X ) = 1 then the partially imputed estimator againequals the empirical estimator, PI = b PI = n − P ni =1 X i exp( X i Y i ). U. U. M ¨ULLER
Table 4
Simulated mean squared errors of estimators of E { X exp( XY ) } with ϑ ∈ R p ( p = 2 , r ϑ ( X ) c FI FI b U c PI PI ϑ X + ϑ U ϑ + ϑ X + ϑ U ϑ X + ϑ U + ϑ V Notes.
The three rows refer to two regression functions with different parametrizations. Wehave ϑ = 0, ϑ = 2, ϑ = − ϑ = 0 . n = 100, π ( X ) = 1 / (1 + e − X ). The covariates X, U and V are independent from a uniform distribution on ( − , Table 5
Simulated mean squared errors of estimators of Eh ( X, Y ) with ˆ ϑ inefficient EY EY E ( Xe XY ) r ϑ ( X ) = cos( ϑX ) r ϑ ( X ) = ϑX r ϑ ( X ) = ϑXπ ( X ) n c FI c PI c FI c PI c FI c PI / (1 + e − X ) 50 0.03124 0.03545 0.02742 0.03868 0.50275 0.72944100 0.01841 0.02057 0.01375 0.01938 0.24148 0.487591 50 0.02000 0.02864 0.02689 0.05181 0.41476 0.79949100 0.01016 0.01448 0.01359 0.02589 0.25796 0.63103 Notes.
We compare fully and the partially imputed estimators of EY and E ( Xe XY ),keeping the previous notation. Again, ˆ ϑ is the least squares estimator, but now the errorsare from a t -distribution with 10 degrees of freedom. we restricted our attention to missing data and on samples of size n = 100,and considered three different regression models which are given in Table4. Note that the second regression function, ϑ + ϑ X + ϑ U with ϑ = 0, ϑ = 2 and ϑ = −
1, equals the first one, namely 2 X − U , but it involves athree-dimensional parameter. As expected, the increase of dimension impairsthe performance of the fully imputed (weighted and unweighted) and ofthe partially imputed estimator. Note that the weighted and unweightedfully imputed estimator ( c FI and b U) in the second regression model are thesame: we consider the least squares estimator in a regression model withan intercept term ϑ . In this model the least squares estimator solves, byconstruction, P nj =1 Z j ˆ ε j = 0 (which implies that all weights ˆ w j equal one).Again we observe that the fully imputed estimator consistently outperformsthe partially imputed estimator.We conclude this section with a small simulation study to examine thebehavior of the fully imputed estimator when ˆ ϑ is inefficient. The simplestsetting is to choose the ordinary least squares estimator, as we did before, EGRESSION WITH MISSING RESPONSES but in a model with non-normal errors. In Table 5 we consider estimationof the mean response and of E ( Xe XY ) for linear and nonlinear regression,and for errors from a t -distribution. The results are similar to the previousones: again the fully imputed estimator performs best, though not as wellas if the errors are, in fact, from a normal distribution (cf. Tables 1–3).Simulations with a logistic error density turned out similarly, confirming thebetter performance of the imputation method. At least in these examples,with moderate sample sizes n = 50 and n = 100, the construction of ϑ doesnot seem to be as important as the choice between the full and the partialimputation approaches.6.3. Confidence intervals.
By Theorem 5.2 the fully imputed weightedestimator n − P ni =1 ˆ χ w ( X i , ˆ ϑ ) is asymptotically normally distributed, withasymptotic variance σ = Eχ ( X, ϑ )+ ( EZ ) − E ¯ h ( ε ) − {
1+ ( EZ ) − } E h ( X,Y ) − E { ε ¯ h ( ε ) } / ( σ EZ ) + D ⊤ w E ( Zζζ ⊤ ) − D w (see Theorem 5.2 for the nota-tion). An asymptotic confidence interval for Eh ( X, Y ) with confidence level1 − α is n n X i =1 ˆ χ w ( X i , ˆ ϑ ) − z α/ s ˆ σ n , n n X i =1 ˆ χ w ( X i , ˆ ϑ ) + z α/ s ˆ σ n ! , where z α/ denotes the upper α/ σ is a consistent estimator of σ . Consider, for example,estimation of EY with r ϑ ( X ) depending on a scalar parameter ϑ , which cov-ers our previous simple examples r ϑ ( X ) = ϑX and r ϑ ( X ) = cos( ϑX ). Herethe confidence interval is n − P ni =1 r ˆ ϑ ( X i ) ± z α/ (ˆ σ /n ) / . The asymptoticvariance of n − P ni =1 r ˆ ϑ ( X i ) is σ = Var r ϑ ( X ) + E { ˙ r ϑ ( X ) } EZ Var { r ϑ ( X ) | Z = 1 } E { ℓ ( ε ) } . The expectations in the formula can be estimated by empirical methods,with a consistent estimator ˆ ϑ for the parameter ϑ plugged in. Consider, forexample, Var { r ϑ ( X ) | Z = 1 } = E { r ϑ ( X ) | Z = 1 } − E { r ϑ ( X ) | Z = 1 } . Thefirst expectation is estimated by ( P ni =1 Z i ) − P ni =1 Z i { r ˆ ϑ ( X i ) } , and analo-gously the second one.In order to confirm the theoretical results we also performed some simu-lation studies, generating confidence intervals for the above examples withthe described estimation method. As expected, for α = 0 .
05 we obtained thedesired coverage probability 0.95.APPENDIX
Lemma A.1.
Suppose that Assumption is satisfied. Then, for a √ n consistent estimator ˆ ϑ of ϑ , the statements (3.2)–(3.4) hold. U. U. M ¨ULLER
Proof.
In order to prove (3.2)–(3.4) we first showmax ≤ i ≤ n | Z i ˆ ε i − Z i ε i | = o p (1) , (A.1) n X i =1 Z i (ˆ ε i − ε ∗ i ) = o p (1) with ε ∗ i = ε i − ˙ r ϑ ( X i ) ⊤ ( ˆ ϑ − ϑ ) . (A.2)Result (A.2) immediately follows from the √ n consistency of ˆ ϑ and thestochastic differentiability of r ϑ [implication (2.1) of Assumption 1]: n X i =1 Z i (ˆ ε i − ε ∗ i ) = n X i =1 Z i [ˆ ε i − { ε i − ˙ r ϑ ( X i ) ⊤ ( ˆ ϑ − ϑ ) } ] ≤ n X i =1 { r ˆ ϑ ( X i ) − r ϑ ( X i ) − ˙ r ϑ ( X i ) ⊤ ( ˆ ϑ − ϑ ) } = o p (1) . This gives max ≤ i ≤ n | Z i (ˆ ε i − ε ∗ i ) | = o p (1). In order to establish (A.1) it there-fore suffices to show max ≤ i ≤ n | Z i ( ε ∗ i − ε i ) | = o p (1). We havemax ≤ i ≤ n | Z i ( ε ∗ i − ε i ) | ≤ max ≤ i ≤ n | ε ∗ i − ε i | ≤ | ˆ ϑ − ϑ | · max ≤ i ≤ n | ˙ r ϑ ( X i ) | . Since ˆ ϑ is √ n consistent we only need n − / max ≤ i ≤ n | ˙ r ϑ ( X i ) | = o p (1). Butthis holds by Owen (2001), Lemma 11.2, since the variables | ˙ r ϑ ( X i ) | , i =1 , . . . , n , are i.i.d. and, by Assumption 1, have finite second moments. Thisshows max ≤ i ≤ n | Z i ( ε ∗ i − ε i ) | = o p (1).Equation (3.2), max ≤ i ≤ n | Z i ˆ ε i | = o p ( n / ), can be seen as follows: we canbound max ≤ i ≤ n | Z i ˆ ε i | by max ≤ i ≤ n | Z i ˆ ε i − Z i ε i | + max ≤ i ≤ n | Z i ε i | . The firstterm is o p (1) by (A.1) and the second term is o p ( n / ) by Owen’s Lemma11.2 since the Z i ε i are i.i.d. with finite variance. We now show (3.3), that is,1 n n X i =1 Z i ˆ ε i = 1 n n X i =1 Z i ε i − EZE { ˙ r ϑ ( X ) | Z = 1 } ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / ) . In view of (A.2), n − P ni =1 Z i ˆ ε i = n − P ni =1 Z i ε ∗ i + o p ( n − / ). By the law oflarge numbers we obtain1 n n X i =1 Z i ε ∗ i = 1 n n X i =1 Z i ε i − n n X i =1 Z i ˙ r ϑ ( X i ) ⊤ ( ˆ ϑ − ϑ )= 1 n n X i =1 Z i ε i − E { Z ˙ r ϑ ( X ) } ⊤ ( ˆ ϑ − ϑ ) + o p ( n − / ) . Since E { Z ˙ r ϑ ( X ) } = EZE { ˙ r ϑ ( X ) | Z = 1 } we have established (3.3).Our last auxiliary result to prove is (3.4),1 n n X i =1 Z i ˆ ε i = 1 n n X i =1 Z i ε i + o p (1) = EZσ + o p (1) . EGRESSION WITH MISSING RESPONSES The second equality is just a consequence of the law of large numbers. Tosee that the first equation holds consider1 n n X i =1 Z i ˆ ε i − n n X i =1 Z i ε i = 1 n n X i =1 Z i (ˆ ε i − ε i ) + 2 1 n n X i =1 Z i (ˆ ε i − ε i ) ε i . The second term on the right-hand side is o p (1) by (A.1). To show that thefirst expression is o p (1) it suffices, in view of (A.2), to consider1 n n X i =1 Z i ( ε ∗ i − ε i ) = 1 n n X i =1 Z i { ˙ r ϑ ( X i ) ⊤ ( ˆ ϑ − ϑ ) } . This term is o p (1) since ˆ ϑ is √ n consistent and since ˙ r ϑ ( X ) is in L ( P ). (cid:3) Acknowledgments.
Many thanks to Anton Schick for important sugges-tions on specifying and constructing efficient parameter estimators in Section5, to Wolfgang Wefelmeyer for valuable discussions and advice and to Ray-mond Carroll for constructive criticism of an earlier draft. Thanks also totwo referees for helpful comments.REFERENCES
Bickel, P. J. (1982). On adaptive estimation.
Ann. Statist. Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and
Wellner, J. A. (1998).
Efficientand Adaptive Estimation for Semiparametric Models . Springer, New York. MR1623559
Chen, J., Fan, J., Li, K. H. and
Zhou, H. (2006). Local quasi-likelihood estimationwith data missing at random.
Statist. Sinica Chen, S. X. and
Wang, D. (2009). Empirical likelihood for estimating equations withmissing values.
Ann. Statist. Chen, X., Hong, H. and
Tarozzi, A. (2008). Semiparametric efficiency in GMM modelswith auxiliary data.
Ann. Statist. Cheng, P. E. (1994). Nonparametric estimation of mean functionals with data missingat random.
J. Amer. Statist. Assoc. Forrester, J., Hooper, W., Peng, H. and
Schick, A. (2003). On the construction of ef-ficient estimators in semiparametric models.
Statist. Decisions Gelman, A., Carlin, J. B., Stern, H. S. and
Rubin, D. B. (1995).
Bayesian DataAnalysis . Chapman & Hall, London. MR1385925
Koul, H. L. and
Susarla, V. (1983). Adaptive estimation in linear regression.
Statist.Decisions Liang, H., Wang, S. and
Carroll, R. J. (2007). Partially linear models with missingresponse variables and error-prone covariates.
Biometrika Little, R. J. A. and
Rubin, D. B. (2002).
Statistical Analysis With Missing Data , 2nded. Wiley, New York. MR1925014
Maity, A., Ma, Y. and
Carroll, R. J. (2007). Efficient estimation of population-levelsummaries in general semiparametric regression models.
J. Amer. Statist. Assoc.
Matloff, N. S. (1981). Use of regression functions for improved estimation of means.
Biometrika U. U. M ¨ULLER
M¨uller, U. U. (2007). Weighted least squares estimators in possibly misspecified non-linear regression.
Metrika M¨uller, U. U., Schick, A. and
Wefelmeyer, W. (2005). Weighted residual-baseddensity estimators for nonlinear autoregressive models.
Statist. Sinica M¨uller, U. U., Schick, A. and
Wefelmeyer, W. (2006). Imputing responses that arenot missing. In
Probability, Statistics and Modelling in Public Health (M. Nikulin, D.Commenges and C. Huber, eds.) 350–363. Springer, New York. MR2230741
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional.
Biometrika Owen, A. B. (2001).
Empirical Likelihood. Monographs on Statistics and Applied Proba-bility . Chapman & Hall, London. Qin, J. and
Zhang, B. (2007). Empirical-likelihood-based inference in missing responseproblems and its application in observational studies.
J. Roy. Statist. Soc. Ser. B Schick, A. (1987). A note on the construction of asymptotically linear estimators.
J.Statist. Plann. Inference Schick, A. (1993). On efficient estimation in regression models.
Ann. Statist. (1995) 1862–1863. MR1241276 Tamhane, A. C. (1978). Inference based on regression estimator in double sampling.
Biometrika Tsiatis, A. A. (2006).
Semiparametric Theory and Missing Data . Springer, New York.MR2233926
Wang, Q. (2004). Likelihood-based imputation inference for mean functionals in the pres-ence of missing responses.
Ann. Inst. Statist. Math. Wang, Q., Linton, O. and
H¨ardle, W. (2004). Semiparametric regression analysis withmissing response at random.
J. Amer. Statist. Assoc. Wang, Q. and
Rao, J. N. K. (2001). Empirical likelihood for linear regression modelsunder imputation for missing responses.
Canad. J. Statist. Wang, Q. and
Rao, J. N. K. (2002). Empirical likelihood-based inference under impu-tation for missing response data.
Ann. Statist. Department of StatisticsTexas A&M UniversityCollege Station, Texas 77843-3143USAE-mail: [email protected]