[PDF] Empirical Likelihood Weighted Estimation of Average Treatment Effects

Abstract

There has been growing attention on how to effectively and objectively use covariate information when the primary goal is to estimate the average treatment effect (ATE) in randomized clinical trials (RCTs). In this paper, we propose an effective weighting approach to extract covariate information based on the empirical likelihood (EL) method. The resulting two-sample empirical likelihood weighted (ELW) estimator includes two classes of weights, which are obtained from a constrained empirical likelihood estimation procedure, where the covariate information is effectively incorporated into the form of general estimating equations. Furthermore, this ELW approach separates the estimation of ATE from the analysis of the covariate-outcome relationship, which implies that our approach maintains objectivity. In theory, we show that the proposed ELW estimator is semiparametric efficient. We extend our estimator to tackle the scenarios where the outcomes are missing at random (MAR), and prove the double robustness and multiple robustness properties of our estimator. Furthermore, we derive the semiparametric efficiency bound of all regular and asymptotically linear semiparametric ATE estimators under MAR mechanism and prove that our proposed estimator attains this bound. We conduct simulations to make comparisons with other existing estimators, which confirm the efficiency and multiple robustness property of our proposed ELW estimator. An application to the AIDS Clinical Trials Group Protocol 175 (ACTG 175) data is conducted.

Full PDF

aa r X i v : . [ s t a t . M E ] A ug Empirical Likelihood Weighted Estimation of Average TreatmentEﬀects

Yuanyao Tan , Xialing Wen , Wei Liang and Ying Yan ∗ School of Mathematics, Sun Yat-sen University, Guangzhou, China ∗ The Corresponding Author. Email: [email protected]

Abstract

There has been growing attention on how to eﬀectively and objectively use covariate information whenthe primary goal is to estimate the average treatment eﬀect (ATE) in randomized clinical trials (RCTs).In this paper, we propose an eﬀective weighting approach to extract covariate information based on theempirical likelihood (EL) method. The resulting two-sample empirical likelihood weighted (ELW) estimatorincludes two classes of weights, which are obtained from a constrained empirical likelihood estimationprocedure, where the covariate information is eﬀectively incorporated into the form of general estimatingequations. Furthermore, this ELW approach separates the estimation of ATE from the analysis of thecovariate-outcome relationship, which implies that our approach maintains objectivity. In theory, we showthat the proposed ELW estimator is semiparametric eﬃcient. We extend our estimator to tackle thescenarios where the outcomes are missing at random (MAR), and prove the double robustness and multiplerobustness properties of our estimator. Furthermore, we derive the semiparametric eﬃciency bound ofall regular and asymptotically linear semiparametric ATE estimators under MAR mechanism and provethat our proposed estimator attains this bound. We conduct simulations to make comparisons with otherexisting estimators, which conﬁrm the eﬃciency and multiple robustness property of our proposed ELWestimator. An application to the AIDS Clinical Trials Group Protocol 175 (ACTG 175) data is conducted.

Keywords:

Missing outcomes, missing at random, double robustness, multiple robustness, semiparametriceﬃciency bound

The RCTs aim to compare various treatments when the subjects are randomized to enter diﬀerent treatmentgroups. The ATE is commonly used in RCTs as it measures the diﬀerence in the mean outcomes between twotreatment groups. A natural estimator of ATE is the diﬀerence in the empirical average outcomes betweenthe treatment group and the control group; it is unbiased due to randomization. When there exists possibleassociation between the primary outcome and the extensively collected baseline covariates in RCTs, the precisionof the ATE estimator may be improved by adjusting for the eﬀect of covariates. There exists a voluminousliterature dealing with covariate adjustment [1, 2, 3, 4, 5, 6] to improve the precision of the estimator andincrease statistical power. However, it also contains considerable debate regarding the appropriateness ofcovariate adjustment [7, 8]. Concerns mainly focus on the potential bias in treatment eﬀect estimation, whichis caused by post hoc selection of covariates and by allowing investigators to go on a “ﬁsh expedition” to ﬁndmodels with the most signiﬁcant estimate of treatment eﬀect. To address such concerns, a certain number ofapproaches are proposed to maintain objectivity when adjusting covariates in randomized trials. By utilizingthe semiparametric theory, Tsiatis et al. [6] proposed a systematic method to objectively incorporate covariateeﬀects while exploiting the relationship between covariates and response outcomes, by positing two separateworking regression models for the data from the two treatment groups, leading to an increase in precision.Besides, Shen et al.[9] and Williamson et al.[10] put forward two two-stage estimation procedures for covariateadjustment based on the inverse probability weighting (IPW) method. They tried to adjust for covariates byestimating the propensity score without using outcome data to ensure objectivity.The empirical likelihood (EL) method is also an appealing method to adjust for baseline covariates inthe estimation of ATE[11, 12]. Since Owen [13] ﬁrst proposed the EL method as a nonparametric likelihoodprocedure to construct conﬁdence intervals for the mean and other parameters, there have been numerousadvances bringing the application of EL to many research areas. We refer interested readers to Owen’s 2001monograph[14] for further details. An important work done by Qin and Lawless [15] showed that the ELmethod can eﬀectively incorporate side information in the form of general estimating equations (GEE) intoinference through constrained maximization of the empirical likelihood function. Their work inspired someresearchers to utilize EL to make covariate adjustments in RCTs and related clinical designs.Zhang [11] considered two unbiased estimating functions that automatically decouple the estimation ofATE from the regression modeling of covariate-outcome relationship and their resulting estimator can reach1he same eﬃciency as the existing eﬃcient adjusted estimators do [6]. Considering the estimation of ATE inpretest-posttest studies, Huang et al. [12] proposed an empirical likelihood-based estimation procedure thatcan incorporate the common baseline covariate information to improve eﬃciency.When the outcome is missing in some of the observations in RCTs, great uncertainty and possible bias inthe estimation of ATE may exist. Here, we mainly focus on situations with data missing at random (MAR),i.e., conditioning on the covariates and responses, the missing outcomes depend only on the covariates [16].In order to correct for the bias caused by missingness, various methods have been proposed, including theweighting methods originated by Horvitz and Thompson [17]. In the context of the pretest-posttest study withmissing data, Davidian et al. [5] studied a class of consistent semiparametric estimators for the treatment eﬀectand identiﬁed the most eﬃcient one based on the semiparametric theory. However, the construction of thesemiparametric eﬃcient estimator depends on whether the underlying relationship between the outcome andcovariates is correctly speciﬁed. This estimator can be much less eﬃcient if the “working regression model”and the true regression model are not close to each other, especially when the dimension of covariates is high.Recently, empirical likelihood methods have been received growing attention to missing data problems forits attractive data-driven feature and nice robustness property. Qin and Zhang [18] proposed an empiricallikelihood-based approach to estimate the mean response under the MAR assumption, the resulting estimatorsenjoy the double-robustness property, i.e., the estimator of the mean response is asymptotically unbiased ifeither the underlying propensity score or the underlying regression function is correctly speciﬁed. Huang et al.[12] applied the EL method to estimate the treatment eﬀect in the pretest-posttest setting with missing data;they considered counterfactual missing data to estimate EL weights which were not considered by Qin andZhang [18]. Chen et al. [19] proposed an imputation-based empirical likelihood approach to adjust for baselineinformation and dealt with the responses in pretest-posttest studies which are missing by design. However,none of their work deﬁnes the estimator of ATE as the diﬀerence of two weighted outcomes with two separateclasses of weights obtained from constrained maximization of the empirical likelihood function.In this article, we propose a new approach to incorporate covariate information into the estimation of ATEusing the EL method. Inspired by the work of Wu and Yan [20], we construct our estimator by separatelyweighting the outcomes of two samples, where the weights are estimated to carry covariate information throughmoment constraints which implicitly utilize randomization inherited in RCTs. These constraints focus solelyon covariates and treatment assignments but not on the outcomes. To exploit the relationship between thecovariates and the outcomes, we posit two models for each treatment group through parametric regressionor identity function, then use them in the moment constraints. Therefore, we separate the modeling of thecovariate-outcome relationship from the ATE estimation, making the covariate adjustment procedure objective.Also, we extend our approach to the scenarios where the outcomes are partly missing. In this case, we provethe double robustness, multiple robustness and semiparametric eﬃciency for our proposed estimator.Zhang’s [11] recent work focused on estimating the ATE by adding the parameter of interest and thecovariate information in the estimating functions and deriving the asymptotic form of the ATE estimator usingthe empirical likelihood theory. In contrast, we ﬁrst construct the two-sample ELW estimator for ATE withthe estimated weights, which are designed to carry the covariate information based on the EL method; thenwe discuss the asymptotic property for the proposed estimators. Furthermore, Zhang’s method didn’t considerthe possible missingness of the outcome data and the corresponding robustness properties in this case, whichwe take into account in this paper.When dealing with missing outcomes, we follow the work of Qin and Zhang [18] by adding two momentconstraints to take missing mechanism into account. However, we propose to use the combined informationfrom the treatment group and the control group to construct the two moment constraints for the propensityscores, whereas Qin and Zhang treated the two constraints separately. Intuitively, our estimator is more eﬃcient.In fact, we prove that our estimator is semiparametric eﬃcient. Furthermore, we prove that our estimator isdoubly robust and multiply robust [21, 22].In Section 2, we introduce the proposed weighted empirical likelihood estimator and show the extensionsof our method to incorporate missing outcomes and enhance multiple robustness. We show the details of thepractical implementation of the proposed method in Section 3. In section 4, the performance of our method isevaluated by a series of simulations and an application to ACTG175 data. We draw conclusions in Section 5.Proofs are presented in the supplementary material. 2

Proposed Methodology

In Section 2.1, we describe our method in the standard RCTs where there is no missingness in the outcomes.In Section 2.2, we consider the scenario where outcomes are partly missing under the missing at randommechanism. Furthermore, we apply multiple working models to enhance robustness in the estimation, whichleads to the multiple robustness property described in Section 2.3.

Consider a two-arm randomized clinical trial comparing the treatment group and the control group. Let W be a binary variable with W = 1 if treated and W = 0 if controlled. Deﬁne δ = P ( W = 1) to be the probabilityof being treated and assume 0 < δ <

1. Let Y ( Y ) be the outcome of a subject from the control (treatment)group. We deﬁne the outcome for each subject in a uniﬁed way as Y = W Y + (1 − W ) Y . Denote X l × to be a l -dimensional vector of baseline covariates. Under randomization in the RCTs, treatment assignmentand baseline covariates are independent, i.e., W ⊥⊥ X . Therefore, X | W = 1 and X | W = 0 have the samedistribution as that of the covariate X in the entire sample, i.e., ρ ( x ) = ρ ( x ) = ρ ( x ) where we deﬁne ρ ( x )and ρ ( x ) as the probability density function of the covariate X in the treatment group and the control group,respectively, and ρ ( x ) as that of the covariate X in the entire sample. The observed data of the treatment group { ( X i , Y i ) , i = 1 , · · · , m } are independent and identically distributed (i.i.d.). Likewise, the observed data of thecontrol group { ( X j , Y j ) , j = 1 , · · · , n } are i.i.d.. Let N = m + n be the total size of the two samples. Denote µ = E ( Y ) and µ = E ( Y ). We are interested in estimating ATE, given by θ = µ − µ = E ( Y ) − E ( Y ),from the observed data.We introduce an empirical likelihood method to eﬀectively incorporate covariate information when estimat-ing ATE. Let f ( x, y ) be the joint density function of ( X, Y ) and f ( x, y ) be the joint density function of( X, Y ). Let p i = f ( X i , Y i ) for i = 1 , · · · , m, and q j = f ( X j , Y j ) for j = 1 , · · · , n, be the probability massat point ( X i , Y i ) and ( X j , Y j ), respectively. The nonparametric likelihood for the observed data is m Y i =1 p i n Y j =1 q j . (1)We propose to obtain the estimators of the p i ’s and q j ’s, by maximizing the likelihood (1) subject to thefollowing constraints m X i =1 p i = 1 , p i ≥ , i = 1 , · · · , m, (2) n X j =1 q j = 1 , q j ≥ , j = 1 , · · · , n, (3) m X i =1 p i g ( X i ) = ¯ g, (4) n X j =1 q j h ( X j ) = ¯ h, (5)where ¯ g = N nP mi =1 g ( X i ) + P nj =1 g ( X j ) o and h ( x ) = N nP mi =1 h ( X i ) + P nj =1 h ( X j ) o . The g ( x ) and h ( x ) are arbitrary r -dimensional and r -dimensional functions, respectively. We take r ≥ r ≥ p i ’s and q j ’s are the empirical probabilities. Thelatter two constraints (4) and (5) are the empirical versions of two equations E { g ( X ) | W = 1 } = E { g ( X ) } and E { h ( X ) | W = 0 } = E { h ( X ) } , which utilize the fact that the two groups have identical baseline covariatedistributions due to the randomization procedure in the RCTs. Since the constraints for the p i ’s do not involveany of the q j ’s and vice versa, we can estimate the p i ’s and the q j ’s separately as two optimization problems.Note that g ( x ) and h ( x ) are known functions, for instance, they can be identity functions, linear functions ofthe covariates, etc.Since the above optimization problem is a strictly convex problem, there exists an unique global maximumunder some mild conditions, including the convex hull condition that ¯ g and ¯ h are inside the convex hull of3 g ( X i ) , i = 1 , · · · , m } and { h ( X j ) , j = 1 , · · · , n } , respectively [14]. The solutions can be obtained by usingthe method of Lagrange multipliers (details are shown in Section 3):ˆ p i = 1 m

11 + λ ⊤ { g ( X i ) − ¯ g } , i = 1 , · · · , m, ˆ q j = 1 n

11 + λ ⊤ { h ( X j ) − ¯ h } , j = 1 , · · · , n, where λ and λ are the Lagrange multipliers determined by1 m m X i =1 g ( X i ) − ¯ g λ ⊤ { g ( X i ) − ¯ g } = 0 , n n X j =1 h ( X j ) − ¯ h λ ⊤ { h ( X j ) − ¯ h } = 0 , respectively. Our proposed two-sample empirical likelihood weighted (ELW) estimator isˆ θ = m X i =1 ˆ p i Y i − n X j =1 ˆ q j Y j , which is consistent for the ATE under suitable regularity conditions due to the following theorem. Theorem 1. As N −→ ∞ , m/N −→ δ > and n/N −→ − δ > , ˆ θ is a consistent estimator for θ . The regularity conditions and the proofs of Theorem 1 and other theorems in the article are provided in thesupplementary material.Usually, we take g ( x ) and h ( x ) as two parametric outcome regression models e g ( x ; β ) and e h ( x ; β ) toapproximate e g ( x ) = E ( Y | W = 1 , X = x ) and e h ( x ) = E ( Y | W = 0 , X = x ). Note that taking g ( x ) and h ( x ) as the identity functions can be seen as adding multiple moment constraints for multiple parametricoutcome regression models, each of which only involves one covariate. In practice, we estimate β and β bytheir corresponding estimators ˆ β and ˆ β , which are obtained by ﬁtting two parametric outcome regressionmodels e g ( x ; β ) and e h ( x ; β ) separately using the least square method. According to White [23], under suitableregularity conditions, ˆ β −→ β ∗ and ˆ β −→ β ∗ in probability as N → ∞ where ∗ denote the corresponding valuesof the parameters that minimizes the Kullback-Leibler distance from the probability distribution function basedon the postulated model to the true one that generates the data. Generally, e g ( x ; β ∗ ) = E ( Y | W = 1 , X = x )unless e g ( x ; β ) is correctly speciﬁed and e h ( x ; β ∗ ) = E ( Y | W = 0 , X = x ) unless e h ( x ; β ) is correctly speciﬁed.In addition, we have ¯ g ( ˆ β ) −→ E { e g ( β ∗ ) } and ¯ h ( ˆ β ) −→ E { e h ( β ∗ ) } in probability as N → ∞ . Here, we set g ( x )and h ( x ) in (4) and (5) to be g ( x ) = e g ( x ; ˆ β ) and h ( x ) = e h ( x ; ˆ β ). The following theorem gives the asymptoticdistribution for ˆ θ in this case. Theorem 2. As N → ∞ , N / (ˆ θ − θ ) follows an asymptotically normal distribution with mean 0 and variancevar { ϕ ( Y, X, W ) } with the inﬂuence function ϕ ( Y, X, W ) = Wδ ( Y − µ ) − W − δδ C ⊤ D − [ e g ( X ; β ∗ ) − E { e g ( X ; β ∗ ) } ] − − W − δ ( Y − µ ) + W − δ − δ C ⊤ D − he h ( X ; β ∗ ) − E { e h ( X ; β ∗ ) } i , where C = E (cid:18) Wδ ( Y − µ ) [ e g ( X ; β ∗ ) − E { e g ( X ; β ∗ ) } ] (cid:19) ,C = E (cid:18) − W − δ ( Y − µ ) he h ( X ; β ∗ ) − E { e h ( X ; β ∗ ) } i(cid:19) ,D = E (cid:18)he h ( X ; β ∗ ) − E { e h ( X ; β ∗ ) } i ⊗ (cid:19) ,D = E (cid:16) [ e g ( X ; β ∗ ) − E { e g ( X ; β ∗ ) } ] ⊗ (cid:17) . e g ( x ; β ) and e h ( x ; β ) are correctly speciﬁed; namely, e g ( x ; β ∗ ) = e g ( x ) = E ( Y | W = 1 , X = x ) and e h ( x ; β ∗ ) = e h ( x ) = E ( Y | W = 0 , X = x ), we have ϕ opt ( Y, X, W ) = Wδ ( Y − µ ) − W − δδ { E ( Y | X, W = 1) − µ }− − W − δ ( Y − µ ) − W − δ − δ { E ( Y | X, W = 0) − µ } , which is the eﬃcient inﬂuence function for regular and asymptotically linear (RAL) estimators of θ in RCTsdescribed by Tsiatis et. al [24]. In this case, var { ϕ opt ( Y, X, W ) } is the asymptotic variance of N / (ˆ θ − θ ),which equals to the semiparametric eﬃciency bound. This observation leads to the following theorem on theeﬃciency of ˆ θ . Theorem 3.

When e g ( x ; β ) is correctly speciﬁed for E ( Y | W = 1 , X = x ) and e h ( x ; β ) is correctly speciﬁed for E ( Y | W = 0 , X = x ) , the asymptotic variance of ˆ θ attains the semiparametric eﬃciency bound. According to Theorem 2, ˆ θ is still consistent even if e g ( x ; β ) and e h ( x ; β ) are not correctly speciﬁed, butit is not semiparametric eﬃcient due to Theorem 3. Details for the proofs are shown in the supplementarymaterial. In this section, we follow the work of Qin and Zhang [18] to take missing outcomes into account. However,their work tackled the one-sample case; we extend their work to the RCT data and take randomization intoaccount when we construct our empirical likelihood estimator. Suppose Y is missing for some subjects, and thebaseline covariates X are always observed. Let R ( R ) be the missing indicator for treatment group (controlgroup) that takes value 0 if Y ( Y ) is missing and 1 otherwise. The observed data are { ( R i Y i , R i , X i ) , i =1 , · · · , m ; ( R j Y j , R j , X j ) , j = 1 , · · · , n } . We reformulate the data into a two-sample setting as { ( Y i , X i ); i = 1 , · · · , m } , Y i is observed in the treatment group; { (? , X i ); i = m + 1 , · · · , m } , Y i is missing in the treatment group; { ( Y j , X j ); j = 1 , · · · , n } , Y j is observed in the control group; { (? , X j ); j = n + 1 , · · · , n } , Y j is missing in the control group.For uniﬁed notation, we deﬁne R = W R + (1 − W ) R , Y = W RY + (1 − W ) RY . Then the observeddata can be written as { ( Y k , X k , R k , W k ) , k = 1 , · · · , N } . We impose the common MAR mechanism[16]; that is, P ( R = 1 | Y, X, W ) = P ( R = 1 | X, W ). Denote the missing probabilities for the treatment and control groups as π ( x ) = P ( R = 1 | W = 1 , X = x ) and π ( x ) = P ( R = 1 | W = 0 , X = x ), respectively. We specify π ( x ; α ) as aparametric model to approximate π ( x ), likewise, π ( x ; α ) is a parametric model to approximate π ( x ). The α and α are given as the unknown vector parameters. In practise, we usually model model the propensityscores π w ( x ), w = 0 ,

1, with logistic regression models.Our interest is still to estimate θ = E ( Y ) − E ( Y ) in the presence of missingness in the outcomes. Hereour proposed estimator is ˆ θ mis = P m i =1 ˆ p i Y i − P n j =1 ˆ q j Y j where ˆ p i ’s and ˆ q j ’s are obtained by maximizing thefollowing nonparametric likelihood m Y i =1 p i n Y j =1 q j (6)5ubject to m X i =1 p i = 1 , p i ≥ , i = 1 , · · · , m , n X j =1 q j = 1 , q j ≥ , j = 1 , · · · , n , m X i =1 p i π ( X i ; ˆ α ) = π , (7) n X j =1 q j π ( X j ; ˆ α ) = π , (8) m X i =1 p i g ( X i ) = ¯ g, (9) n X j =1 q j h ( X j ) = ¯ h, (10)where π = N nP mi =1 π ( X i ; ˆ α ) + P nj =1 π ( X j ; ˆ α ) o , π = N { P mi =1 π ( X i ; ˆ α )+ P nj =1 π ( X j ; ˆ α ) o . Theﬁrst two constraints guarantee that p i ’s and q j ’s are empirical probabilities. The constraints (7) and (8) reﬂectthe selection bias according to Qin and Zhang [18]. Similarly, the latter two constraints (9) and (10) utilizecovariate information through functions g ( x ) and h ( x ). As described in Section 2.1, we set g ( x ) = e g ( x ; ˆ β ) and h ( x ) = e h ( x ; ˆ β ), which are the parametric estimations for e g ( x ) = E ( Y | W = 1 , X = x ) and e h ( x ) = E ( Y | W =0 , X = x ), respectively. Here, we have the following result on the consistency of ˆ θ mis . Theorem 4. ˆ θ mis is consistent for θ as N −→ ∞ if both the following conditions are satisﬁed: i) either π ( x ; α ) is correctly speciﬁed for π ( x ) or e g ( x ; β ) is correctly speciﬁed for E ( Y | W = 1 , X = x ) ; ii) either π ( x ; α ) iscorrectly speciﬁed for π ( x ) or e h ( x ; β ) is correctly speciﬁed for E ( Y | W = 0 , X = x ) . The property indicated by Theorem 4 is known as double robustness [25]. Since double robustness is aspecial case for multiple robustness which we discuss in Section 2.3, the proof for double robustness is shownin the supplementary material where we prove the multiple robustness. Furthermore, ˆ θ mis is asymptoticallynormal distributed if both π ( x ; α ) and π ( x ; α ) are correctly speciﬁed. The asymptotic distribution for ˆ θ mis is shown in the next section, where we describe our method in a more general way by allowing multiple modelsfor each of π ( x ), π ( x ), E ( Y | W = 1 , X = x ) and E ( Y | W = 0 , X = x ), but not only one model for each.For comparison, we consider an alternative estimator for θ , which is e θ qz = P m i =1 e p i Y i − P n j =1 e q j Y j where e p i ’s and e q j ’s are obtained by the same optimization problem mentioned above except that the constraints (7) -(10) are replaced by m X i =1 p i π ( X i ; ˆ α ) = 1 m m X i =1 π ( X i ; ˆ α ) , (11) n X j =1 q j π ( X j ; ˆ α ) = 1 n n X j =1 π ( X j ; ˆ α ) , (12) m X i =1 p i g ( X i ) = 1 m m X i =1 g ( X i ) , (13) n X j =1 q j h ( X j ) = 1 n n X j =1 h ( X j ) . (14)Clearly, e θ qz is obtained by directly applying the method proposed in Qin and Zhang [18] which is originallydesigned for the one-sample case. Since our method considers the randomization procedure for the two samplesin each constraint of (7) - (10), while each one of (11) - (14) only focuses on the information from one of thetwo samples, ˆ θ mis is intuitively more eﬃcient, which is conﬁrmed by our simulation results in Section 4.6 .3 Multiple robustness Following the work of Han and Wang [21] and Han [26], we postulate multiple working parametric models P = { π c ( x ; α c ); c = 1 , . . . , C } for π ( x ), P = { π d ( x ; α d ); d = 1 , . . . , D } for π ( x ), G = { e g e ( x ; β e ); e = 1 , . . . , E } for E ( Y | W = 1 , X = x ) and H = { e h f ( x ; β f ); f = 1 , . . . , F } for E ( Y | W = 0 , X = x ). The α c , α d , β e , β f arethe corresponding parameters and their estimators are denoted as ˆ α c , ˆ α d , ˆ β e , ˆ β f . Usually, the estimators ofthe postulated propensity score models, i.e., ˆ α c for c = 1 · · · C , and ˆ α d for d = 1 · · · D , are taken to be themaximizer of the corresponding binomial likelihoods m Y i =1 { π c ( α c ; X i ) } R i { − π c ( α c ; X i ) } − R i , c = 1 , . . . , C, (15) n Y j =1 (cid:8) π d (cid:0) α d ; X j (cid:1)(cid:9) R j (cid:8) − π d (cid:0) α d ; X j (cid:1)(cid:9) − R j , d = 1 , . . . , D. (16)Our proposed estimator is ˆ θ mr = P m i =1 ˆ p i Y i − P n j =1 ˆ q j Y j with the estimated weights { (ˆ p i , ˆ q j ); i = 1 , · · · , m , j =1 , · · · , n } obtained by maximizing the empirical likelihood Q m i =1 p i Q n j =1 q j in (6) with the same constraintsexcept that (7) - (10) are changed to m X i =1 p i π c ( X i ; ˆ α c ) = π c ( c = 1 , · · · , C ) , (17) n X j =1 q j π d ( X j ; ˆ α d ) = π d ( d = 1 , · · · , D ) , (18) m X i =1 p i g e ( X i ; ˆ β e ) = g e ( e = 1 , · · · , E ) , (19) n X j =1 q j h f ( X j ; ˆ β f ) = h f ( f = 1 , · · · , F ) , (20)where π c = N nP mi =1 π c ( X i ; ˆ α c ) + P nj =1 π c ( X j ; ˆ α c ) o , π d = N (cid:8)P mi =1 π d ( X i ; ˆ α d )+ P nj =1 π d ( X j ; ˆ α d ) o , g e = N nP mi =1 g e ( X i ; ˆ β e ) + P nj =1 g e ( X j ; ˆ β e ) o , h f = N nP mi =1 h f ( X i ; ˆ β f )+ P nj =1 h f ( X j ; ˆ β f ) o with c = 1 , · · · , C , d = 1 , · · · , D , e = 1 , · · · , E , f = 1 , · · · , F . The ﬁrst two constraints ensure that p i ’s and q j ’s are empiricalprobabilities as mentioned in Section 2.2. The latter four constraints calibrate the weighted average of eachpostulated parametric function, which is evaluated at one biased sample with missing outcomes, to the corre-sponding empirical average of the two entire samples, which consistently estimates the population mean. Unlikethe previous setting in Section 2.2, there are more than one postulated models for each one of π ( x ), π ( x ), E ( Y | W = 1 , X = x ) and E ( Y | W = 0 , X = x ) to incorporate information from covariates. In this case, wehave the following theorem on the consistency of ˆ θ mr . Theorem 5. ˆ θ mr is consistent for θ as N −→ ∞ if the following two conditions are satisﬁed: i) P containsa correctly speciﬁed model for π ( x ) or G contains a correctly speciﬁed model for E ( Y | W = 1 , X = x ) ; ii) P contains a correctly speciﬁed model for π ( x ) or H contains a correctly speciﬁed model for E ( Y | W = 0 , X = x ) . Therefore, ˆ θ mr is a multiple robust estimator of θ . Next, we introduce the asymptotic distribution andeﬃciency of ˆ θ mr . The following theorem gives the asymptotic distribution of ˆ θ mr . Theorem 6.

When π ( x ; α ) is a correctly speciﬁed model for π ( x ) and π ( x ; α ) is a correctly speciﬁed modelfor π ( x ) , N / (ˆ θ mr − θ ) is asymptotically normal distributed with mean and variance var { ϕ ( Y, X, R, W ) } with the inﬂuence function for ˆ θϕ ( Y, X, R, W ) = Z − Z − Wδ E ( Z S ⊤ ) { E ( S ⊗ ) } − S − − W − δ E ( Z S ⊤ ) { E ( S ⊗ ) } − S , here Z = Wδ Rπ ( X ) ( Y − µ ) − W R − δπ ( X ) δπ ( X ) L ⊤ G − U ( X ) ,Z = 1 − W − δ Rπ ( X ) ( Y − µ ) − W R − δπ ( X ) δπ ( X ) L ⊤ G − U ( X ) ,S (cid:0) X , R , α (cid:1) = R − π (cid:0) α ; X (cid:1) π ( α ; X ) { − π ( α ; X ) } ∂π (cid:0) α ; X (cid:1) ∂α ,S (cid:0) X , R , α (cid:1) = R − π (cid:0) α ; X (cid:1) π ( α ; X ) { − π ( α ; X ) } ∂π (cid:0) α ; X (cid:1) ∂α . Here, S (cid:0) X , R , α (cid:1) and S (cid:0) X , R , α (cid:1) are the corresponding score functions of the binomial likelihoods in(15) and (16), respectively. To show that our proposed estimator ˆ θ mr attains the semiparametric eﬃciency bound, we derive the semi-parametric eﬃciency bound for ATE estimator in RCTs with missing outcomes, which is given by the followingtheorem. Theorem 7.

The eﬃcient inﬂuence function for the RAL estimators of θ in RCTs with missing outcomes isgiven by ϕ opt ( Y, X, R, W ) =

W Rδπ ( X ) { Y − E ( Y | W = 1 , X ) } − (1 − W ) R (1 − δ ) π ( X ) { Y − E ( Y | W = 0 , X ) } + E ( Y | W = 1 , X ) − E ( Y | W = 0 , X ) − θ, which leads to the semiparametric eﬃciency bound var { ϕ opt ( Y, X, R, W ) } . Following the techniques used in Han and Wang [21], we prove that the asymptotic variance var { ϕ ( Y, X, R, W ) } in Theorem 6 can reach the semiparametric eﬃciency bound deﬁned in Theorem 7, which leads to the followingresult on the eﬃciency of ˆ θ mr (proofs are given in the supplementary material). Theorem 8.

When P contains a correctly speciﬁed model for π ( x ) , P contains a correctly speciﬁed modelfor π ( x ) , G contains a correctly speciﬁed model for E ( Y | W = 1 , X = x ) and H contains a correctly speciﬁedmodel for E ( Y | W = 0 , X = x ) , the asymptotic variance of ˆ θ mr attains the semiparametric eﬃciency bound. For comparison, the alternative estimator, which is based on the work of Han and Wang [21], is denoted as e θ hw = P m i =1 e p i Y i − P n j =1 e q j Y j where e p i ’s and e q j ’s are obtained from the same optimization problem with thesame constraints as in Section 2.2 except that (17) - (20) are replaced by m X i =1 p i π c ( X i ; ˆ α c ) = m P mi =1 π c ( X i ; ˆ α c ) ( c = 1 , · · · , C ) , (21) n X j =1 q j π d ( X j ; ˆ α d ) = n P nj =1 π d ( X j ; ˆ α d ) ( d = 1 , · · · , D ) , (22) m X i =1 p i g e ( X i ; ˆ β e ) = m P mi =1 g e ( X i ; ˆ β e ) ( e = 1 , · · · , E ) , (23) n X j =1 q j h f ( X j ; ˆ β f ) = n P nj =1 h f ( X j ; ˆ β f ) ( f = 1 , · · · , F ) . (24)As mentioned in Section 2.2, our method takes randomization for the two samples into account, as indicatedby each of the constraints (17) - (20), while each of the constraint (21) - (24) only involves one of the two samples.Therefore, ˆ θ mr is intuitively more eﬃcient than e θ hw , which is conﬁrmed by our simulation results in Section 4. In Section 3.1, we introduce the computation details of solving the aforementioned optimization problem toobtain our proposed estimators, based on data with or without missing outcomes. Besides, we illustrate howto tackle the convex hull constraint problem in Section 3.2.8 .1 Numerical implementation

As mentioned in Section 2, the proposed optimization problem actually can be split into two optimizationproblems to estimate p j ’s and q j ’s separately. Now we demonstrate the method to estimate p j ’s, the estimationof q j ’s follows the same procedure. We only need to maximize m Y i =1 p i , (25)subject to (2) and (4). To simplify the notation, we write b U i = g ( X i ) − ¯ g . Applying the standard Lagrangemultiplier method, the solution of p i can be written asˆ p i = 1 m

11 + ˆ λ ⊤ b U i (26)where ˆ λ is the r -dimensional Lagrange multipliers satisfying1 m m X i =1 b U i λ ⊤ b U i = 0 . (27)In order to search for the solution of λ , we deﬁne˜ l ( λ ) = m X i =1 log (cid:16) λ ⊤ b U i (cid:17) (28)as our maximizer over λ , which is a strictly convex function. The maximum point ˆ λ of (28) satisﬁes (27) andthe ˆ p i given by (26) is subject to (2). Note that the existence of the solution of λ requires some conditionsincluding the convex hull constraint that the convex hull of { b U i } mi =1 retains the zero point. Here, we usea modiﬁed Newton–Raphson algorithm to do the numerical search for λ , which is similar to the methoddiscussed by Chen et al. [27]. Algorithm 1:

Modiﬁed Newton–Raphson algorithmStep 0. Let λ (0)1 = 0. Set t = 0, γ = 1 and ε = 10 − .Step 1. Calculate ∆ (cid:16) λ ( t )1 (cid:17) = ∂ ˜ l/∂λ and ∆ (cid:16) λ ( t )1 (cid:17) = (cid:8) ∂ ˜ l/ (cid:0) ∂λ ∂λ ⊤ (cid:1)(cid:9) − ∆ (cid:16) λ ( t )1 (cid:17) ; that is∆ ( λ ) = m X i =1 b U i λ ⊤ b U i , ∆ ( λ ) =  − m X i =1 b U i b U ⊤ i (cid:16) λ ⊤ b U i (cid:17)  − ∆ ( λ ) . If || ∆ (cid:16) λ ( t )1 (cid:17) || < ε , stop the algorithm and report λ ( t )1 ; otherwise go to Step 1.Step 2. Calculate δ ( t ) = γ ( t ) ∆ (cid:16) λ ( t )1 (cid:17) . If 1 + (cid:16) λ ( t )1 − δ ( t ) (cid:17) ⊤ b U i i or˜ l (cid:16) λ ( t )1 − δ ( t ) (cid:17) < ˜ l (cid:16) λ ( t )1 (cid:17) , let γ ( t ) = γ ( t ) / λ ( t +1)1 = λ ( t )1 − δ ( t ) , t = t + 1 and γ ( t +1) = ( t + 1) − / . Go to Step 1.Similarly, to obtain the ˆ p i ’s by solving the optimization problems described in Section 2.2 or 2.3, we onlyhave to take m = m in (25) and deﬁne b U i = { π ( X i ; ˆ α ) − π , g ( X i , ˆ β ) − g } ⊤ or b U i = { π ( X i ; ˆ α ) − π , ..., π C ( X i ; ˆ α C ) − π C , g ( X i ; ˆ β ) − g , ..., g ( X i ; ˆ β E ) − g E } ⊤ , respectively.9 .2 Convex hull constraint problem When we try to solve the constrained maximization problem depicted in Section 2, a major problem en-countered frequently in practise is that the convex hull condition, i.e., the zero vector is an interior point ofthe convex hull spanned by { b U i } mi =1 , may not be satisﬁed. The violation of the convex hull condition causesthat the solution for Lagrange multipliers may not exist, leading to the non-convergence of the algorithm.This convex hull constraint may be easily violated when the samples are small or the constraints are high-dimensional. Some signiﬁcant eﬀorts have been made to solve this problem. For instance, Emerson and Owen[28] proposed a balanced augmented empirical likelihood (BAEL) method, which aims to augment the samplewith two artiﬁcial data points leading to an expanded convex hull with the zero vector inside while preservingthe mean of augmented data as the same. Nguyen et al.[29] extended Emerson and Owen’s method [28] to thegeneral estimating equations. Following their work, we deﬁne two artiﬁcial points added in { b U i } mi =1 as b U m +1) = − sc ∗ u ¯ u , b U m +2) = 2 U + sc ∗ u ¯ u , where U = m P mi =1 b U i is in the direction of ¯ u = U || U || , c ∗ u is deﬁned as the inverse Mahalanobis distanceof a unit vector from U given by c ∗ u = (¯ u ⊤ S − ¯ u ) − / , where S is the sample covariance matrix, s is anadditional parameter set to tune the calibration of the resulting statistic. Note that the sample mean for b U i is maintained by adding these two points, i.e., m P mi =1 b U i = m +2 P m +2 i =1 b U i = U .After augmenting the sample as { b U i } m +2 i =1 and { b U j } n +2 j =1 , the empirical likelihood function for estimationof θ can be adjusted as m +2 Y i =1 p i n +2 Y j =1 q j subject to m +2 X i =1 p i = 1 , p i ≥ , i = 1 , · · · , m, n +2 X j =1 q j = 1 , q j ≥ , j = 1 , · · · , n, m +2 X i =1 p i b U i = 0 , n +2 X j =1 q j b U j = 0 . In this case, the solution for the weights is given byˆ p ∗ i = 1( m + 2) 1(1 + ˆ λ ∗⊤ b U i ) , and the ˆ λ ∗ is obtained by solving 1 m + 2 m +2 X i =1 b U i λ ⊤ b U i = 0 . Then our maximizer over λ changed to e l ∗ ( λ ) = m +2 X i =1 log (cid:16) λ ⊤ b U i (cid:17) . Therefore, we provide another modiﬁed Newton–Raphson algorithm with an augmented sample in Algorithm2 to avoid violation of the convex hull constraint when searching for ˆ λ ∗ . Since Algorithm 2 only has one more10 lgorithm 2: Modiﬁed Newton–Raphson algorithm with an augmented sampleStep 0. Let λ (0)1 = 0. Set t = 0, γ = 1 and ε = 10 − .Step 1. Generate two artiﬁcial points: b U m +1) = − sc ∗ u ¯ u , b U m +2) = 2 U + sc ∗ u ¯ u . Step 2. Calculate ∆ (cid:16) λ ( t )1 (cid:17) = ∂ ˜ l ∗ /∂λ and ∆ (cid:16) λ ( t )1 (cid:17) = (cid:8) ∂ ˜ l ∗ / (cid:0) ∂λ ∂λ ⊤ (cid:1)(cid:9) − ∆ (cid:16) λ ( t )1 (cid:17) , that is∆ ( λ ) = m +2 X i =1 b U i λ ⊤ b U i , ∆ ( λ ) =  − m +2 X i =1 b U i b U ⊤ i (cid:16) λ ⊤ b U i (cid:17)  − ∆ ( λ ) . If || ∆ (cid:16) λ ( t )1 (cid:17) || < ε , stop the algorithm and report λ ( t )1 ; otherwise go to Step 2.Step 3. Calculate δ ( t ) = γ ( t ) ∆ (cid:16) λ ( t )1 (cid:17) . If 1 + (cid:16) λ ( t )1 − δ ( t ) (cid:17) ⊤ b U i i or˜ l ∗ (cid:16) λ ( t )1 − δ ( t ) (cid:17) < ˜ l ∗ (cid:16) λ ( t )1 (cid:17) , let γ ( t ) = γ ( t ) / λ ( t +1)1 = λ ( t )1 − δ ( t ) , t = t + 1 and γ ( t +1) = ( t + 1) − / . Go to Step 2.step of generating two artiﬁcial points to build an augmented sample compared to Algorithm 1, these twoalgorithms have almost the same computational speed.In the simulations implemented in Section 4.1, we use Algorithm 2 only in the Simulation 3 where we applyour method on the simulated missing data by solving the optimization problem in Section 2.2. Recall that thisoptimization problem has two more moment constraints involving propensity score models, which can easilycause a high-dimension problem especially when we take the functions g ( x ) and h ( x ) as the identity functions. In this section, we report the results of several simulation experiments and a real data analysis for ACTG175data to evaluate the performance of our proposed estimators.

We present four simulation studies to demonstrate the performance of our proposed method based on 1000Monte Carlo data sets.

Simulation 1.

Similar to the simulation studies reported by Tsiatis et al.[6], we conduct a simulation experi-ment based on ACTG175 data analysis in Section 4.2. In each simulated data set, we generate ﬁve continuousbaseline covariates ( X , X , X , X , X ) from a multivariate normal distribution with empirical mean and co-variance matrix of the same variables in the ACTG175 data. Besides, we generate each binary covariate in( X , X , X , X , X , X ) from an independent Bernoulli distribution with their own data proportion in theACTG175 data as parameters. Independent of all the other variables, the treatment indicator W is derivedfrom Bernoulli( δ ) with δ as the treatment assignment probability. Finally, according to the covariates and thetreatment assignment, the outcome variable CD4 count at 20 ± θ , including “Unadjusted” estimator Y − Y , “Change score” estimator Y − Y − ( X − X ),two semiparametric estimators proposed by Tsiatis et al.[6] with variable selection procedure “Forward-1”11nd “Forward-2” estimators, and two classical estimators “ANCOVA” estimator[3] and “KOCH” estimator[2].Details for these competing estimators are shown in the supplementary material.Table 1 shows the results of two cases: N = 2139 and δ = 0 . N = 400 and δ = 0 .

5. ELW-Identityand ELW-Linear are both our proposed two-sample ELW estimators. A “benchmark” estimator of θ , whichuses the true treatment-speciﬁc regression models, is also included for comparison. The former estimator takes g ( x ) and h ( x ) as identity functions, while the latter one sets g ( x ) and h ( x ) as linear regression functions thatﬁtted separately by data from each treatment group. Table 1 shows that all adjusted estimators includingour proposed ones have better performance in all evaluation metrics compared to the unadjusted estimator,e.g. they all have smaller bootstrap standard error, which implies covariate information incorporation can leadto an eﬃciency improvement. Furthermore, the result indicates our proposed ELW estimators can achieve asigniﬁcant eﬃciency gain as they enjoy the smallest bootstrap standard error and mean square error among allestimates. Table 1: Results for simulation based on ACTG175 dataEstimator Bias Ave.Boot.SE Cov.prob.boot. MSE n = 2139 , δ = 0 . n = 400 , δ = 0 . Bias is the mean diﬀerence between the estimator between ˆ θ and the truevalue of θ ; Ave.Boot.SE is the average bootstrap standard error calculated asthe average of 1000 bootstrap standard error estimates, each of which involves500 bootstrap replicates; Cov.prob.boot. is the coverage probability of a 95%Wald conﬁdence interval using the average bootstrap standard error as standarderror; MSE is the mean squared error calculated as the mean squared diﬀerencebetween ˆ θ and the true value of θ . Details for each competing estimator areshown in the supplementary material. Simulation 2.

The above simulation design assumes that there is a linear relationship between the outcomevariable and covariates, which may not be true in most cases. Next, we consider a nonlinear case to check theperformance of our proposed method. This simulation uses three continuous variables, X = ( X , X , X ) ⊤ ∼ N ormal ( µ, Σ), where µ = (1 , , ⊤ and Σ × =   . We generate the outcome for each treatmentgroup using Y n = β ( w ) n + β ( w ) n sin ( X ) + β ( w ) n X + β ( w ) n X + ǫ ( w )1 , where w is the treatment assignment indicatorthat takes 1 for the treatment group and 0 for the control group. For comparison, we generate a similar linearoutcome variable Y l , where Y l = β ( w ) l + β ( w ) l X + β ( w ) l X + β ( w ) l X + ǫ ( w )2 , w = 0 ,

1. The only diﬀerence betweenthe above two cases lies in the relationship between Y and X . Let ( ǫ (1)1 , ǫ (0)1 , ǫ (1)2 , ǫ (0)2 ) ⊤ ∼ N ormal (0 , Σ ǫ ),12here Σ ǫ is a diagonal matrix with diagonal entries { , , , } . By setting β (1) ⊤ n = (12 , . , , β (0) ⊤ n = (9 , . , , β (1) ⊤ l = (3 , , ,

10) and β (0) ⊤ l = (5 , , , θ between two treatment groups to be 10. The sample size N and the probability of treatment assignment δ for this simulation are set to be 400 and 0 . All entries are as in Table 1.

As shown in Table 2, all estimators have better performance in the linear case than in the nonlinear case, aswe note that the mean squared error for each estimator in the nonlinear case is nearly twice of the mean squarederror in the linear case except the unadjusted estimator and Forward-2 estimator. Although all the estimatorshave very close results in the nonlinear case, which is indicated by the mean squared error, our proposed ELWestimators still achieve better precision than the others, but not as good as the Forward-2 estimator.

Simulation 3.

To evaluate the performance of our proposed estimator ˆ θ mis in Section 2.1, which considersmissing outcomes, we design a simulation experiment to compare it with e θ qz , the estimator proposed by Qinand Zhang [18]. This simulation involves four mutually independent variables, X ∼ N ormal (1 , X ∼ N ormal (2 , X ∼ N ormal (3 ,

1) and X ∼ Bernoulli (0 . Y m = β ( w ) m + β ( w ) m X + β ( w ) m X + β ( w ) m X + β ( w ) m X + ǫ ( w )3 , w = 0 ,

1. We set ( ǫ (0)3 , ǫ (1)3 ) ⊤ ∼ N ormal (0 , Σ ), and Σ is a2 × { , } . The true treatment eﬀect is controlled tobe 10 by setting β (0) ⊤ m = (10 , , , ,

4) and β (1) ⊤ m = (5 , , , , logit { π w ( X, α ( w ) ) } = α ( w )0 + α ( w )1 X + α ( w )2 X + α ( w )3 X + α ( w )4 X , w = 0 ,

1. Weuse diﬀerent set of α ( w ) to change the missing proportion of the outcomes. For example, we set α (1) =( − . , − . , . , . , . ⊤ and α (0) = ( − . , . , − . , . , . ⊤ for a missing proportion of approximate10%.Table 3 reports the results of 1000 Monte Carlo data sets, in which we set N = 400 and δ = 0 .

5. Thebootstrap standard error in each Monte Carlo data set is based on 500 replicates. For each data set, weestimate θ using ˆ θ mis and e θ qz for comparison. Results for estimators using the true model are included as the“benchmark” estimator. The evaluation metrics in Table 3 are the same as those in the previous experiments,noting that “.qz” indicates this metric is for Qin and Zhang’s method[18].As shown in Table 3, as the missing proportion increases, though all the estimators perform worse, ourproposed estimators are still signiﬁcantly better than Qin and Zhang’s. We note that ˆ θ mis and e θ qz have closeeﬃciency judging from their close average bootstrap standard error and mean squared error when the missingproportion is low. However, when the missing proportion is large, the performance of both ˆ θ mis and e θ qz usingidentity functions deteriorates dramatically while those using a linear regression model, which is the correctly13peciﬁed model, can maintain good performance. This demonstrates a growing sensitivity to the model speciﬁedwith a growing missing proportion no matter using our proposed method or Qin and Zhang’s.Table 3: Results for simulation with diﬀerent missing proportionMetric Estimator Mean Missing Proportion0.138 0.242 0.333 0.417 0.501Bias Identity -0.174 -0.187 -0.223 -0.341 -1.150Linear -0.140 -0.120 -0.115 -0.088 -0.018Benchmark -0.140 -0.123 -0.111 -0.090 -0.022Bias.qz Identity -0.170 -0.186 -0.226 -0.387 -1.186Linear -0.139 -0.119 -0.109 -0.090 -0.013Benchmark -0.139 -0.121 -0.111 -0.091 -0.020Ave.Boot.SE Identity 0.603 0.647 0.771 1.965 5.960Linear 0.574 0.625 0.689 0.762 0.862Benchmark 0.576 0.625 0.687 0.760 0.857Ave.Boot.SE.qz Identity 2.238 2.244 2.299 3.154 6.422Linear 2.243 2.251 2.264 2.283 2.312Benchmark 2.247 2.253 2.266 2.283 2.311Cov.prob.boot Identity 0.939 0.934 0.946 0.988 0.984Linear 0.930 0.943 0.934 0.938 0.943Benchmark 0.928 0.945 0.937 0.945 0.944Cov.prob.boot.qz Identity 0.957 0.958 0.958 0.977 0.981Linear 0.954 0.958 0.954 0.955 0.958Benchmark 0.959 0.961 0.960 0.957 0.960MSE Identity 0.383 0.443 0.546 1.054 18.427Linear 0.368 0.418 0.514 0.581 0.743Benchmark 0.365 0.418 0.502 0.573 0.717MSE.qz Identity 4.722 4.738 4.892 5.689 23.610Linear 4.714 4.739 4.874 5.015 5.039Benchmark 4.711 4.734 4.899 5.009 5.005 All metrics are as in Table 1 except that metrics with no suﬃx are for our proposedestimator while those with “.qz” are for Qin and Zhang’s method.

Simulation 4.

Table 4 and Table 5 summarize the performance of ˆ θ mr , which described in Section 2.3 basedon data with and without missing outcomes, respectively.When considering data without missing outcomes, we estimate the ATE under a similar setting as in thelast simulation. The outcome variable is generated by Y c = β ( w ) c + β ( w ) c X + β ( w ) c X + β ( w ) c X + β ( w ) c X + ǫ ( w )3 , w = 0 ,

1. The four mutually independent variables X , X , X , X are set to have the same distribution asin the last simulation. Here, we set β (1) c = (10 , , , , ⊤ and β (0) c = (3 , , , , ⊤ , which lead to a truevalue of θ = 10 and a true linear model only including X to describe the true relationship between outcomeand covariates. In this way, a series of identity functions used in the estimation can be regarded as multiplemodels, one of which correctly speciﬁes the true model, as shown in the ﬁrst row in Table 4. The secondrow is related to another estimator using two linear regression models, each of which involves all 4 variables.The third estimator based on two linear regression models, both of which include only X , uses the exactlycorrect-speciﬁed model. The results show a very close performance for these three estimators, which indicatesthe multiple robustness of the proposed estimator.When we consider data with missing outcomes, we use a similar simulation setting as in Han [22], whichis originally designed to estimate the parameters in regression models. Denote four mutually independentcovariates to be X ∼ N ormal (5 , X ∼ Bernoulli (0 . X ∼ N ormal (0 ,

1) and X ∼ N ormal (0 , Y r = β ( w ) r + β ( w ) r X + β ( w ) r X + β ( w ) r X + β ( w ) r X + ǫ ( w ) Y , w = 0 ,

1, where β (1) ⊤ r =(10 , , , ,

4) and β (0) ⊤ r = (6 , , , ,

6) leading to a true value of θ = 10. There are three auxiliary variablesinvolved: S = 1 + X − X + ǫ , S = I { S + 0 . ǫ > . } , and S = exp h { S / } i + ǫ . Here, I ( · ) representsthe indicator function, ( ǫ Y , ǫ , ǫ , ǫ ) ⊤ ∼ N ormal ( , Σ) where Σ is a 4 × , , , , . , and all the other entries 0. The missingness mechanism is set by logit { π w ( X, S ) } = 3 . − . S , w = 0 ,

1, resulting in approximately 37% of missing outcome Y r .Following the above data setting, in addition to giving four correct models: π ( X, α ) = α + α S , π ( X, α ) = α + α S , g ( X, β ) = β + β X + β X + β X + β X + β S and h ( X, β ) = β + β X + β X + β X + β X + β S , we also deﬁne an incorrect model for each model as π ( X, α ) = α + α X + α X + α X + α X + α + S , π ( X, α ) = α + α X + α X + α X + α X + α + S , g ( X, β ) = β + β S + β S + β S and h ( X, β ) = β + β S + β S + β S to test the multiplerobustness of our proposed estimator.From now on, all the eight models are used to estimate θ in the optimization problem with the constraintsdepicted in Section 2.3. We consider the sample size to be N = 400, and the results are summarized basedon 1000 replications. In order to distinguish the estimators of diﬀerent models, we assign a name for each inthe form of “ELW-00000000”, where the eight digits, from left to right, indicate whether π ( X, α ), π ( X, α ), g ( X, β ), g ( X, β ), π ( X, α ), π ( X, α ), h ( X, β ) or h ( X, β ) has been used in the estimation, by assigning0 or 1 to the corresponding digit.For implementation, e θ hw is obtained by using R-package MultiRobust , where we subtract two mean esti-mators for the two samples by implementing the MR.mean function. Our proposed estimators are obtainedby applying Algorithm 1. According to the results in Table 5, the multiple robustness for all the estimatorsexcept “ELW-01010101” is well demonstrated since they all have ignorable bias. The eﬃciency performanceof our proposed estimators are consistently better than e θ hw . We ﬁnd that the estimators of “ELW-10111011”and “ELW-11101110” already have very similar eﬃciency performance compared to “ELW-10101010” estimatorwhere all the models are correctly speciﬁed.Table 4: Results for multiple robustness given data without missing outcomesEstimator Bias Ave.BootSE Cov.prob.boot MSEIdentity -0.029 0.573 0.937 0.331Linear -0.029 0.572 0.944 0.330Linear(correct) -0.029 0.570 0.945 0.331Table 5: Results for multiple robustness given data with missing outcomesEstimator Bias Bias.hw MSE MSE.hwELW-10101010 -0.007 0.033 0.087 2.261ELW-01010101 0.110 0.134 6.966 6.912ELW-11111111 -0.009 0.033 0.090 2.266ELW-10011001 0.119 0.122 6.693 6.576ELW-10101001 0.006 0.030 2.554 4.009ELW-10011010 0.106 0.126 3.169 4.689ELW-10111011 -0.008 0.034 0.090 2.260ELW-01100110 0.003 0.043 0.088 2.260ELW-10100110 -0.002 0.038 0.090 2.272ELW-01101010 -0.001 0.039 0.085 2.250ELW-11101110 0.003 0.043 0.088 2.260 All metrics are as in Table 1 except that metrics with nosuﬃx are for our proposed estimator while those with “.hw”are for Han and Wang’s method.

Firstly, we demonstrate and compare our proposed method with the other 5 competing methods by applyingall of them to ACTG 175 data, which is collected from 2139 HIV-infected individuals and equally randomizesall of them to 4 diﬀerent antiretroviral regimens: zidovudine (ZDV) monotherapy, ZDV + didanosine (ddI),ZDV + zalcitabine, and ddI monotherapy. 15implifying the experiment setting as Tsiatis et al.[6] did, we regard the m = 532 individuals receiving ZDVmonotherapy as the treatment group, while the rest of n = 1607 individuals receiving any other antiretroviralregimens were classiﬁed as the control group. Accordingly, we have δ = mm + n ≈ . ) at 20 ± Y , between the above 2 groups. For potential use in covariate adjustment, we con-sider the following 5 continuous baseline variables: X =CD4 count (cells/mm ), X =CD8 count (cells/mm ), X =age(years), X =weight (kg), X =Karnofsky score (scale of 0–100), and 7 indicator variables: X =hemophilia, X =homosexual activity, X =history of intravenous drug use, X =race (0=white, 1=nonwhite), X =gender(0=female, 1=male), X =antiretroviral history (0=naive, 1=experienced), and X =symptomatic status(0=asymptomatic, 1=symptomatic).Now we apply the optimization algorithm in Section 3 to obtain the proposed ELW estimators. We assume g ( X ) and h ( X ) to be linear regression functions or identity functions of covariates in two diﬀerent scenarios.In the ﬁrst scenario, we develop two treatment-speciﬁc linear models for E ( Y | W = w, X ), w = 0 ,

1, with 12baseline covariates by ﬁtting separate linear models to the observed data in each treatment arm. The ﬁttedtreatment-speciﬁc linear regression models are g ( X ; ˆ β ) =98 .

900 + 0 . X − . X − . X + 0 . X + 1 . X − . X + 6 . X + 12 . X − . X − . X − . X − . X ,h ( X ; ˆ β ) =126 .

771 + 0 . X − . X − . X − . X + 0 . X − . X − . X − . X − . X + 18 . X − . X − . X (29)with estimated treatment-speciﬁc variances d V ar ( Y | W = 1 , X ) = (96 . and d V ar ( Y | W = 0 , X ) = (116 . ,and the treatment-speciﬁc coeﬃcients of determination R = 0 . W = 1 and R = 0 . W = 0.Applying these models to the optimization procedure proposed in Section 2, we obtain the proposed ELW-Linear estimator. In the second scenario, we replace linear functions with identity functions in the abovemodels to obtain the ELW-Identity estimator. Here, X denote the l × l = 12.Table 6: Estimate of θ for the ACTG 175 data based on CD420Estimator Estimate Boot.SE Test stat. RelUnadjusted 46.810 7.055 6.924 1.000Change scores 50.409 5.693 9.150 1.506Forward-1 49.895 5.439 9.716 1.733Forward-2 51.589 5.700 10.183 1.780ANCOVA 49.694 5.451 9.680 1.734KOCH 49.758 5.458 9.641 1.716ELW-Identity 50.006 5.288 10.057 1.849ELW-Linear 49.824 5.200 9.776 1.760 Boot.SE is the boostrap-based standard error; Test stat. is theWald test statistic; and Rel. eﬀ. = (SE for the unadjustedestimator) /(SE for the indicated estimator) . Given the results in Table 6, all diﬀerent estimators indicate the same evidence of treatment diﬀerence.The performance of all methods seems to be similar except that the unadjusted estimator has a lower estimatedue to a mild imbalance for baseline CD4 between two treatment groups [6]. However, the bootstrap standarderrors of our proposed ELW estimators are both smaller than that of the others, which indicates a betterperformance of our proposed method.Table 7 shows the results for the estimates of θ based on the missing outcome CD496, approximately 37%of which are missing. Here, we only calculate the standard error using bootstrapping method. As shown inthe Table 7, our proposed ELW estimators have higher estimates of θ but consistently smaller bootstrap-basedstandard errors than those based on Qin and Zhang’s method[18], which indicates a better eﬃciency for ourproposed ELW estimators. 16able 7: Estimate of θ for the ACTG 175 data based on CD496Estimator Estimate Boot.SE Test stat.ELW-Identity 64.623 9.082 7.116ELW-Linear 64.038 9.065 7.064Qz-Identity 61.223 10.316 5.935Qz-Linear 60.981 10.159 6.003 ELW-Identity and ELW-Linear are our proposed es-timators using identity functions and linear functions,respectively. Similarly, Qz-Identity and Qz-Linear arethe corresponding estimators based on Qin and Zhang’smethod[18].

We have proposed a two-sample empirical likelihood weighted estimator to eﬀectively incorporate covariateinformation into the estimation of the average treatment eﬀect in randomized clinical trials. Namely, we obtaintwo classes of estimated weights through constrained empirical likelihood estimation, where the constraints aredesigned to carry side information from covariates. Besides, our proposed estimator maintains objectivity sinceit separates the estimation of ATE from analysis of the covariate outcome relationship.Furthermore, we apply the proposed estimator to the common problem of missing outcome data in RCTsunder the assumption of missing at random. Theoretically, we have proved that our proposed estimatormaintains double robustness and multiple robustness properties.To evaluate the eﬃciency of our estimator, we demonstrate the proposed estimator is semiparametric eﬃcientgiven data without or with missingness. Various simulation experiments and an application to ACTG175have been conducted to compare our proposed estimator with the others and the results indicates a betterperformance of our proposed method.

References [1] SJ Senn. Covariate imbalance and random allocation in clinical trials.

Statistics in medicine , 8(4):467–475,1989.[2] Gary G Koch, Catherine M Tangen, Jin-Whan Jung, and Ingrid A Amara. Issues for covariance analysisof dichotomous and ordered categorical data from randomized clinical trials and non-parametric strategiesfor addressing them.

Statistics in medicine , 17(15-16):1863–1892, 1998.[3] Emmanuel Lesaﬀre and Stephen Senn. A note on non-parametric ancova for covariate adjustment inrandomized clinical trials.

Statistics in medicine , 22(23):3583–3596, 2003.[4] Selene Leon, Anastasios A Tsiatis, and Marie Davidian. Semiparametric estimation of treatment eﬀect ina pretest-posttest study.

Biometrics , 59(4):1046–1055, 2003.[5] Marie Davidian, Anastasios A Tsiatis, and Selene Leon. Semiparametric estimation of treatment eﬀectin a pretest–posttest study with missing data.

Statistical science: a review journal of the Institute ofMathematical Statistics , 20(3):261, 2005.[6] Anastasios A Tsiatis, Marie Davidian, Min Zhang, and Xiaomin Lu. Covariate adjustment for two-sampletreatment comparisons in randomized clinical trials: a principled yet ﬂexible approach.

Statistics inmedicine , 27(23):4658–4677, 2008.[7] Emmanuel Lesaﬀre, Kris Bogaerts, Xin Li, and Erich Bluhmki. On the variability of covariate adjustment:experience with koch’s method for evaluating the absolute diﬀerence in proportions in randomized clinicaltrials.

Controlled clinical trials , 23(2):127–142, 2002.[8] Stuart J Pocock, Susan E Assmann, Laura E Enos, and Linda E Kasten. Subgroup analysis, covariateadjustment and baseline comparisons in clinical trial reporting: current practiceand problems.

Statisticsin medicine , 21(19):2917–2930, 2002. 179] Changyu Shen, Xiaochun Li, and Lingling Li. Inverse probability weighting for covariate adjustment inrandomized studies.

Statistics in medicine , 33(4):555–568, 2014.[10] Elizabeth J Williamson, Andrew Forbes, and Ian R White. Variance reduction in randomised trials byinverse probability weighting using the propensity score.

Statistics in medicine , 33(5):721–737, 2014.[11] Biao Zhang. Empirical likelihood inference in randomized clinical trials.

Statistical methods in medicalresearch , 27(12):3770–3784, 2018.[12] Chiung-Yu Huang, Jing Qin, and Dean A Follmann. Empirical likelihood-based estimation of the treatmenteﬀect in a pretest–posttest study.

Journal of the American Statistical Association , 103(483):1270–1280,2008.[13] Art B Owen. Empirical likelihood ratio conﬁdence intervals for a single functional.

Biometrika , 75(2):237–249, 1988.[14] Art B Owen.

Empirical likelihood . Chapman and Hall/CRC, 2001.[15] Jing Qin and Jerry Lawless. Empirical likelihood and general estimating equations.

The Annals ofStatistics , pages 300–325, 1994.[16] RJA Little and DB Rubin. Statistical analysis with missing data. wiley.

New York , 2002.[17] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from aﬁnite universe.

Journal of the American statistical Association , 47(260):663–685, 1952.[18] Jing Qin and Biao Zhang. Empirical-likelihood-based inference in missing response problems and its appli-cation in observational studies.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,69(1):101–122, 2007.[19] Min Chen, Changbao Wu, and Mary E Thompson. An imputation based empirical likelihood approach topretest–posttest studies.

Canadian Journal of Statistics , 43(3):378–402, 2015.[20] Changbao Wu and Ying Yan. Empirical likelihood inference for two-sample problems.

Statistics and ItsInterface , 5(3):345–354, 2012.[21] Peisong Han and Lu Wang. Estimation with missing data: beyond double robustness.

Biometrika ,100(2):417–430, 2013.[22] Peisong Han. Multiply robust estimation in regression analysis with missing data.

Journal of the AmericanStatistical Association , 109(507):1159–1173, 2014.[23] Halbert White. Maximum likelihood estimation of misspeciﬁed models.

Econometrica: Journal of theEconometric Society , pages 1–25, 1982.[24] Anastasios Tsiatis.

Semiparametric theory and missing data . Springer Science & Business Media, 2007.[25] James M Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coeﬃcients when someregressors are not always observed.

Journal of the American statistical Association , 89(427):846–866, 1994.[26] Peisong Han. A further study of the multiply robust estimator in missing data analysis.

Journal ofStatistical Planning and Inference , 148:101–110, 2014.[27] Jiahua Chen and Changbao Wu. Estimation of distribution function and quantiles using the model-calibrated pseudo empirical likelihood method.

Statistica Sinica , pages 1223–1239, 2002.[28] Sarah C Emerson, Art B Owen, et al. Calibration of the empirical likelihood method for a vector mean.

Electronic Journal of Statistics , 3:1161–1192, 2009.[29] Minh Khoa Nguyen, Steve Phelps, and Wing Lon Ng. Simulation based calibration using extended balancedaugmented empirical likelihood.