[PDF] On IPW-based estimation of conditional average treatment effect

Abstract

The research in this paper gives a systematic investigation on the asymptotic behaviours of four inverse probability weighting (IPW)-based estimators for conditional average treatment effect, with nonparametrically, semiparametrically, parametrically estimated and true propensity score, respectively. To this end, we first pay a particular attention to semiparametric dimension reduction structure such that we can well study the semiparametric-based estimator that can well alleviate the curse of dimensionality and greatly avoid model misspecification. We also derive some further properties of existing estimator with nonparametrically estimated propensity score. According to their asymptotic variance functions, the studies reveal the general ranking of their asymptotic efficiencies; in which scenarios the asymptotic equivalence can hold; the critical roles of the affiliation of the given covariates in the set of arguments of propensity score, the bandwidth and kernel selections. The results show an essential difference from the IPW-based (unconditional) average treatment effect(ATE). The numerical studies indicate that for high-dimensional paradigms, the semiparametric-based estimator performs well in general {whereas nonparametric-based estimator, even sometimes, parametric-based estimator, is more affected by dimensionality. Some numerical studies are carried out to examine their performances. A real data example is analysed for illustration.

Full PDF

aa r X i v : . [ m a t h . S T ] S e p On IPW-based estimation of conditional average treatmenteffect

Niwen Zhou a , Lixing Zhu a,b, ∗ a School of Statistics, Beijing Normal University, Beijing, China b Department of Mathematics, Hong Kong Baptist University, Hong Kong, China

Abstract

The research in this paper gives a systematic investigation on the asymptotic behavioursof four inverse probability weighting (IPW)-based estimators for conditional averagetreatment effect, with nonparametrically, semiparametrically, parametrically estimatedand true propensity score, respectively. To this end, we ﬁrst pay a particular atten-tion to semiparametric dimension reduction structure such that we can well study thesemiparametric-based estimator that can well alleviate the curse of dimensionality andgreatly avoid model misspeciﬁcation. We also derive some further properties of exist-ing estimator with nonparametrically estimated propensity score. According to theirasymptotic variance functions, the studies reveal the general ranking of their asymp-totic efﬁciencies; in which scenarios the asymptotic equivalence can hold; the criticalroles of the afﬁliation of the given covariates in the set of arguments of propensityscore, the bandwidth and kernel selections. The results show an essential differencefrom the IPW-based (unconditional) average treatment effect(ATE). The numericalstudies indicate that for high-dimensional paradigms, the semiparametric-based es-timator performs well in general whereas nonparametric-based estimator, even some-times, parametric-based estimator, is more affected by dimensionality. Some numericalstudies are carried out to examine their performances. A real data example is analysedfor illustration. ∗ The authors gratefully acknowledge two grants from the University Grants Council of Hong Kong anda NSFC grant (NSFC11671042). ∗∗ Corresponding author

Email address: [email protected] (Lixing Zhu )

Preprint submitted to Elsevier eywords:

Dimension reduction, Heterogeneity Treatment effect, Propensity score

1. Introduction

Treatment effects have been widely analyzed by economists and statisticians in di-verse ﬁelds. In this paper, we focus on estimating treatment effect under the potentialoutcomes framework and the unconfoundedness assumption with binary treatment. Let D = 0 , mean that the individual does not receive or receives treatment and the re-sponse Y be the corresponding potential outcome as Y (0) or Y (1) . To convenientlyidentify the quantities measuring treatment effects, the unconfoundedness assumptionin (Rosenbaum and Rubin, 1983) is generally considered, that is, the assignment totreatment is independent of the potential outcomes given a k -dimensional vector X ofcovariates, i.e. ( Y (0) , Y (1)) ⊥ D | X. (1)Further, we in this paper consider the dimension of X to be ﬁxed throughout thispaper, but in some cases it can be high. As Y (0) and Y (1) cannot be simultaneouslyobserved for any individual, the observed outcome can be written as Y = DY (1) +(1 − D ) Y (0) . Since estimating the i -th individual treatment effect ( Y i (1) − Y i (0)) isunrealistic, an important trend in the literature turns to estimate the average treatmenteffect ( AT E ): µ = E ( Y (1) − Y (0)) . See for instance Rosenbaum and Rubin (1983)and Hirano et al. (2003).Recently, there is an increasing interest in estimating conditional (or heteroge-neous) average treatment effects:

CAT E ( X ) = E ( Y (1) − Y (0) | X ) , which is de-signed to reﬂect how treatment effects vary across different subpopulations. Note that Although the word ”high dimension” is usually conjunct with k being divergent with sample size inrecent years, when we say X is of high dimension in this paper, it only means X contains many but ﬁxednumber of covariates. For ease of explanation, we still use the word ”high dimension” whenever no confusionwill be caused. AT E = 0 , the treatment can still be effective for a subpopulation deﬁnedby speciﬁc observable characteristics, i.e. for some x such that CAT E ( x ) = 0 . Thusheterogeneous treatment effects are more informative and can play important roles inpersonalized medicine or policy intervention. Most of existing estimation methodsfor the heterogeneous treatment effects are conditional on the full set of variables, X ,see e.g. Crump et al. (2008), Wager and Athey (2018), where the multivariate vari-able X are designed to make the unconfoundedness assumption plausible. After 2015,researchers consider to estimate more general conditional/heterogeneous treatment ef-fects, in which the conditioning covariates Z with Z being a subset of covariates, i.e. X = ( Z ⊤ , U ⊤ ) ⊤ ∈ R l × R m , k = l + m < ∞ . See e.g. Abrevaya et al. (2015) and Lee et al. (2017). Note that treatment effects con-ditioning on a subset of X , rather than the high dimensional covariates X , can providedesirable ﬂexibility and can help making policy decision.Based on the assumption (1), Abrevaya et al. (2015) used the inverse probabilityweighting ( IP W )-based method, which is popularly used in literature (Robins et al.,1994), to estimate

CAT E ( Z ) = E [ Y (1) − Y (0) | Z ] when the propensity score function is estimated parametrically ( IP W - P ) and nonpara-metrically ( IP W - N ). Abrevaya et al. (2015) gave a deep investigation on the asymp-totic properties of the estimators. There are two main conclusions in Abrevaya et al.(2015): one is IP W - N can be asymptotically more efﬁcient than IP W - P in the sensethat the asymptotic variance function of IP W - N can be uniformly smaller than thatof IP W - P , the another is the asymptotic variance function of IP W - P is equal to thatof IP W - O which is deﬁned as the oracle estimator with the true propensity score. Itis noteworthy that the last conclusion is different from that of IPW-type ATE estima-tors, because the IPW-type ATE estimator based on parametrically estimated propen-sity score can be more efﬁcient than the one with true propensity score.As is known, to make the unconfoundedness assumption be plausible, it is often thecase that we need to include many covariates in the analysis. Thus we say X ∈ R k is3f high dimension with k < ∞ . In this case, on one hand, it is often not easy to choosea parametric speciﬁcation that can sufﬁciently capture all the important nonlinear andinteraction effects to have IP W - P . On the other hand, any nonparametric estimationof propensity score clearly suffers from the curse of dimensionality and then IP W - N does not work any more.Therefore, in this paper, we suggest a semiparametric IPW-based CAT E ( Z ) es-timation procedure to simultaneously alleviate the propensity score misspeciﬁcationproblem and particularly the curse of dimensionality. To this end, we consider a semi-parametric dimension reduction structure of the propensity score and the unconfound-edness assumption (1) can have a dimension reduction version. It is worth pointingout that the general nonparametric structure can be regarded as a special case of the di-mension reduction structure we consider with an orthonormal projection matrix of fullrank. We will call the estimator IP W - S and give the details about the model settingand the estimation procedure in the next section.For theoretical development, we will give the asymptotically linear representationand asymptotic normality of IP W - S . We will also give some further properties of ex-isting IP W - N in Abrevaya et al. (2015). Based on the theoretical studies, we givea systematic comparison on the asymptotic efﬁciency amongst IP W - O , IP W - P , IP W - S and IP W - N .Combining the results of Abrevaya et al. (2015) and the further properties of IP W - N we derive in this paper, the comparison reveals some very interesting and importantphenomena. Speciﬁcally, letting A (cid:22) B mean that the asymptotic variance of estimator A is not greater than that of estimator B and A ∼ = B stand for that A has the sameasymptotic variance function as B , we have the following observations in theory.First, in general IP W - N (cid:22) IP W - S (cid:22) IP W - P ∼ = IP W - O .Second, the afﬁliation of Z to the set of arguments of the propensity score playsan important role in the asymptotic efﬁciency of IP W - S and IP W - N . That is, when Z is a subset of arguments of the propensity score, IP W - S (cid:22) IP W - P ∼ = IP W - O and IP W - N (cid:22) IP W - P ∼ = IP W - O , otherwise, IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O . Note that this newly found phenomenon provides a deep insight into theperformances of IP W - S and IP W - N , which is also useful in practice.4hird, when the propensity score function is smooth enough, then even in generalcases we can also have the asymptotic equivalence by carefully choosing the band-widths and using high order kernel functions: IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O . This also gives us a better understanding for the asymptotic performance ofdifferent estimators. Of course, this part mainly serves as a theoretical exploration. Forpractical use, we would have no interest to wilfully choose those kernel function andbandwidths, which are very difﬁcult to implement and make the estimator with worseperformance. But it reminds the researchers that a “good” estimator of the propensityscore would not be helpful for the performance of the CAT E estimator.Fourth, owing to the dimension reduction structure of p ( X ) , the requirements forbandwidths and the order of kernel function for IP W - S are much milder than thosefor IP W - N . Thus when the dimension is high, even though IP W - N has the superiorefﬁciency in theory, IP W - S is preferable.The rest of the paper is organized as follows. In Section 2, we ﬁrst introduce theestimation procedure for IP W - S . Also we investigate its asymptotic properties and thetheoretical comparisons between the four CAT E estimators. Section 3 contains somenumerical studies to examine the performance of the

CAT E estimators. In Section4, we apply the CATE estimators to analyse a real data set for illustration. Section 5contains some conclusions and a further discussion. The regularity conditions are listedin Appendix and all the technical proofs are relegated to Supplementary Materials tosave space.

2. Semiparametric estimation procedure and asymptotic properties

Assume that covariates X = ( Z ⊤ , U ⊤ ) ⊤ are absolutely continuous, under theunconfoundedness assumption (1), recall that CAT E function τ ( z ) can be rewrittenas τ ( z ) = E (cid:20) DYp ( X ) − (1 − D ) Y − p ( X ) | Z = z (cid:21) , Z ∈ R l . (2)5f p ( X ) is given, we can estimate τ ( z ) immediately via the Nadaraya-Watson ker-nel method by regarding DYp ( X ) − (1 − D ) Y − p ( X ) as response: ˆ τ O ( z ) = (cid:18) n X i =1 (cid:20) D i Y i p ( X i ) − (1 − D i ) Y i − p ( X i ) (cid:21) K h ( Z i − z ) (cid:19). n X i =1 K h ( Z i − z ) . Here K ( · ) is a multivariate kernel function, K h ( u ) = h − l K ( u/h ) and l = dim ( Z ) .This CAT E estimator is

IP W - O we mentioned before.Based on existing results for nonparametric estimation, it is easy to derive theasymptotic distribution of IP W - O which will be used as the benchmark to make com-parisons among all estimators studied in this paper. Proposition 1.

Suppose the conditions (C1)-(C4) in Appendix are satisﬁed, the follow-ing statements hold for each point z in the support of Z : √ nh l (ˆ τ O ( z ) − τ ( z )) D −→ N (cid:18) , k K k σ O ( z ) f ( z ) (cid:19) . Here σ O ( z ) = E (cid:18)h DYp ( X ) − (1 − D ) Y − p ( X ) − τ ( z ) i | Z = z (cid:19) . When p ( X ) is an unknown function, we then ﬁrst estimate p ( X ) to deﬁne a ﬁ-nal CAT E estimator ˆ τ ( z ) . We propose the estimator under semiparametric structurebelow.

Assume the propensity score has a semiparametric dimension reduction structure: p ( X ) = q ( V ⊤ X ) , (3)where both the function q ( · ) and the r projection directions in V are unknown with V being a k × r orthonormal matrix. It is noteworthy that this structure is general, whichcovers the structures of some important semiparametric models such as single-indexmodels. From the deﬁnition of propensity score, (3) implies the indicator D dependson X through the projected variable V ⊤ X . Thus, we can use the following conditionalindependence to present the above semiparametric structure: D ⊥ X | V ⊤ X. (4)6t follows that ( Y (0) , Y (1)) ⊥ D | V ⊤ X. We call the intersection of all V ’ssatisfying the above independence the central subspace, see Li (1991). Usually V canonly be identiﬁed up to a rotation matrix C . That is, V ∗ = V × C can be identiﬁed.As this identiﬁcation issue does not affect the related estimation of p ( X ) , we thenstill use V without confusion. Relevant references are Luo et al. (2017) and Ma et al.(2019). This is a dimension reduction framework, so that the corresponding estimationcould be less affected by the curse of dimensionality. For such a dimension reductionstructure, we can also consider variable selection as Ma et al. (2019) did. But as this isnot a focus of this paper, we then just work on this model and assume the existence ofconsistent estimation later on.If we postulate that the information about D from X can be completely captured by r linear combinations V ⊤ X of X with r ≪ k , the propensity score can be estimated byreplaced the original X with V ⊤ X . That is, we can use lower dimensional kernel func-tion H ( u ) to get a nonparametric estimator ˆ q ( ˆ V ⊤ X ) of q ( V ⊤ X ) = E ( D | V ⊤ X ) , ˆ q ( ˆ V ⊤ X i ) = P nj = i D i H h ( ˆ V ⊤ X j − ˆ V ⊤ X i ) P nj = i H h ( ˆ V ⊤ X j − ˆ V ⊤ X i ) . (5)where h is the bandwidth, H h ( u ) = h − r H ( u/h ) and ˆ V is a consistent estimatorderived by a sufﬁcient dimension reduction method. There are several methods avail-able in the literature, such as inverse regression methods in (Cook and Li, 2002) andminimum average variance estimation(MAVE) in (Xia et al., 2002; Xia, 2007) .Recall that CAT E can be rewritten as (2). Thus, based on ˆ q ( ˆ V ⊤ X i ) , the IP W - S of τ ( z ) is deﬁned as ˆ τ S ( z ) = n X i =1 " D i Y i ˆ q ( ˆ V ⊤ X i ) − (1 − D i ) Y i − ˆ q ( ˆ V ⊤ X i ) K h ( Z i − z ) . n X i =1 K h ( Z i − z ) . (6)Since both Z and V ⊤ X are low-dimension random vectors, ˆ τ S ( z ) can well alle-viate the propensity score misspeciﬁcation problem and the curse of dimensionalitysimultaneously.In the next section, we investigate the asymptotic properties of ˆ τ S ( z ) and derivesome further properties of existing IP W - N under certain regularity conditions.7 .3. Asymptotic properties for IPW-S Denote | A | as the cardinality of set A . We ﬁrst give some notations. • W = ( X, D, Y ) and the observation data W i = ( X i , D i , Y i ) ni =1 are the inde-pendent copies of W ; • m j ( V ⊤ X ) = E [ Y ( j ) | V ⊤ X ] , j = 0 , , and K i = K (( Z i − z ) /h ) ; • ψ ( p ( V ⊤ X ) , W ) = DY / { q ( V ⊤ X ) } − (1 − D ) Y / { − q ( V ⊤ X ) } ; • ψ ∗ ( q ( V ⊤ X ) , W ) = [ D { Y − m ( V ⊤ X ) } ] / { q ( V ⊤ X ) }− [(1 − D ) { Y − m ( V ⊤ X ) } ] / { − q ( V ⊤ X ) } + m ( V ⊤ X ) − m ( V ⊤ X ) . • For two vectors A and B , we use intersection notation A ∩ B to write, withoutconfusion, as all components that are contained in both A and B . | A ∩ B | = t stands for the number of components in the intersection of A and B . Particularly,when t = 0 , A ∩ B = ∅ , and t = | A | implies A ∩ B = A .Both ψ ( q ( V ⊤ X ) , W ) and ψ ∗ ( q ( V ⊤ X ) , W ) are the central parts of inﬂuence functionfor IP W - S . Theorem 1.

Suppose all the conditions in Appendix are satisﬁed, the following state-ments hold for each point z in the support of Z :(1) When | Z ∩ V ⊤ X | = t < l with s [2 − l/ ( l − t )] + l > , the asymptoticallylinear representation is √ nh l (ˆ τ S ( z ) − τ ( z )) = 1 √ nh l f ( z ) n X i =1 [ ψ ( p ( V ⊤ X i ) , W i ) − τ ( z )] K i + o p (1) and the asymptotic distribution of ˆ τ S ( z ) is √ nh l (ˆ τ S ( z ) − τ ( z )) D −→ N (0 , Σ S ( z )) . (2) When | Z ∩ V ⊤ X | = l , the asymptotically linear representation is √ nh l (ˆ τ S ( z ) − τ ( z )) = 1 √ nh l f ( z ) n X i =1 [ ψ ∗ ( q ( V ⊤ X i ) , W i ) − τ ( z )] K i + o p (1) and the asymptotic distribution of ˆ τ S ( z ) is √ nh l (ˆ τ S ( z ) − τ ( z )) D −→ N (0 , Σ ∗ S ( z )) . ere s is the order of H ( · ) , Σ S ( z ) = k K k σ S ( z ) . f ( z ) , Σ ∗ S ( z ) = k K k σ ∗ S ( z ) . f ( z ) with σ S ( z ) = E [ { ψ ( q ( V ⊤ X ) , W ) − τ ( z ) } | Z = z ] and σ ∗ S ( z ) = E [ { ψ ∗ ( q ( V ⊤ X ) , W ) − τ ( z ) } | Z = z ] . Remark 1.

These results show a very interesting and somewhat unexpected phenomenonthat the asymptotic behaviors of ˆ τ S ( z ) also depend on whether some of elements of Z belong to V ⊤ X. Recall that | Z ∩ V ⊤ X | = t means t elements of Z also are t linearcombinations of V ⊤ X , i.e. we can rewrite V ⊤ X = ( Z , · · · , Z t , ( ˜ V ⊤ X ) ⊤ ) ⊤ with V ⊤ =  ˜ I ( t ) × k ˜ V ⊤  r × k and ˜ I ( t ) × k = (cid:16) I t × t (cid:17) . Here I t × t is an identity matrix, ˜ V ⊤ is a ( r − t ) × k matrix. The asymptotic behaviours with t = l and any ≤ t < l arevery different. A natural question is whether we can, if possible, choose a dimensionreduced vector V ⊤ X such that IP W - S works best. The question is related to IP W - P and IP W - N , we will have some detailed discission in Subsection 3.3 below. Next, we present the estimators for Σ S ( z ) and Σ ∗ S ( z ) under | Z ∩ V ⊤ X | < l and | Z ∩ V ⊤ X | = l respectively as ˆΣ S ( z ) = k K k ˆ σ S ( z )ˆ f ( z ) and ˆΣ ∗ S ( z ) = k K k ˆ σ ∗ S ( z )ˆ f ( z ) , (7)where ˆ σ S ( z ) and ˆ σ ∗ S are estimators for σ S ( z ) and σ ∗ S with ˆ σ S = 1 nh l n X i =1 ( ψ (ˆ q, W i ) − ˆ τ S ( z )) K i ˆ f ( z ) and ˆ σ ∗ S = 1 nh l n X i =1 ( ψ ∗ (ˆ q, W i ) − ˆ τ S ( z )) K i ˆ f ( z ) , ˆ f ( z ) = P ni =1 K h ( Z i − z ) /n is a kernel-based estimator of f ( z ) , ψ (ˆ q, W i ) = D i Y i ˆ q ( V ⊤ X i ) − (1 − D i ) Y i − ˆ q ( V ⊤ X i ) and ψ ∗ (ˆ q, W i ) = [ D i { Y i − ˆ m ( V ⊤ X i ) } ]ˆ q ( V ⊤ X i ) − [(1 − D i ) { Y i − ˆ m ( V ⊤ X i ) } ]1 − ˆ q ( V ⊤ X i ) + ˆ m ( V ⊤ X i ) − ˆ m ( V ⊤ X i ) with ˆ m ( ˆ V ⊤ X ) = P n { t : D t =1 } H h ( ˆ V ⊤ X t − ˆ V ⊤ X ) Y t .P n { t : D t =1 } H h ( ˆ V ⊤ X t − ˆ V ⊤ X ) , ˆ m ( ˆ V ⊤ X ) = P n { t : D t =0 } H h ( ˆ V ⊤ X t − ˆ V ⊤ X ) Y t .P n { t : D t =0 } H h ( ˆ V ⊤ X t − ˆ V ⊤ X ) being the estimators of m ( V ⊤ X ) and m ( V ⊤ X ) .9urther we can state the consistency of the proposed estimators in the followingtheorem. Theorem 2.

Suppose all the conditions in Appendix are satisﬁed, we have that ˆΣ S ( z ) = Σ S ( z ) + o p (1) , and ˆΣ ∗ S ( z ) = Σ ∗ S ( z ) + o p (1) . By Theorem 2, we can obtain the pointwise consistent estimator for standard errorof √ nh l (ˆ τ S ( z ) − τ ( z )) , so that we are able to construct a (1 − α )100% pointwiseconﬁdence interval for τ ( z ) , i.e. ˆ τ S ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ S ( z ) (cid:17) / (8)or ˆ τ S ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ ∗ S ( z ) (cid:17) / , (9)with c α/ being the (1 − α/ quantile of the standard normal distribution. Note thatthe speciﬁcation formula of conﬁdence interval depends on whether the condition | Z ∩ V ⊤ X | < l or | Z ∩ V ⊤ X | = l . One possible way to make choice between (8) and (9)is based on the value of | Z ∩ ˆ V ⊤ X | .To be speciﬁed, taking MAVE(Xia et al., 2002) as an exmaple dimension reductionmethod, we proposed a estimation and inference procedure of τ ( z ) based on IP W - S by carrying out the following steps.Step 1: Obtain the estimator of V by solving the minimizing problem min V,a,b n X i,j =1 { D i − a j − b ⊤ j V ⊤ ( X i − X j ) } ω ij . Here ω ij = H h { V ⊤ ( X i − X j ) } (cid:14) P nl =1 H h { V ⊤ ( X l − X j ) } , a = ( a , . . . , a n ) , b = ( b , . . . , b n ) . Denote the resulting estimator by ˆ V Step 2: Given ˆ V , estimate the propensity score E ( D | ˆ V ⊤ X ) via (5).Step 3: Obtain the semiparametric CATE estimator, ˆ τ S ( z ) , via (6).10tep 4: Given Z = z , a (1 − α ) pointwise conﬁdence interval for the true CATE, τ ( z ) ,can be constructed as follows. if | Z ∩ ˆ V ⊤ X | ≈ l , the conﬁdence interval of τ ( z ) can be constructed in the form of (9), i.e. ˆ τ S ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ ∗ S ( z ) (cid:17) / , Otherwise, the pointwise conﬁdence interval of τ ( z ) would be constructed in theform of (8), that is, ˆ τ S ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ S ( z ) (cid:17) / . Note that the ﬁrst step can be implemented using the R package MAVE. Based onthis estimation and inference procedure, the empirical analysis in section 5 can beimplemented.

Recall

IP W - N proposed by Abrevaya et al. (2015) is ˆ τ N ( z ) = n X i =1 (cid:20) D i Y i ˆ p ( X i ) − (1 − D i ) Y i − ˆ p ( X i ) (cid:21) K h ( Z i − z ) ! . n X i =1 K h ( Z i − z ) , (10)with ˆ p ( X i ) = P nj = i D j L h ( X j − X i ) . P nj = i L h ( X j − X i ) . Here L ( · ) is also amultivariate kernel function with L h ( · ) = h − k L ( · /h ) , and h is the correspondingbandwidth.Note that the asymptotic properties of IP W - S is inﬂuenced by the afﬁliation of Z , we in this paper try to analyse the asymptotic properties of IP W - N in differentscenarios similarly as the ones in Theorem 1. Suppose D ⊥ X | ˜ X, ˜ X ⊆ X, ˜ k = dim ( ˜ X ) ≤ k. (11)To extend the asymptotic results of IP W - N in Abrevaya et al. (2015), we derive thefollowing theorem that also conﬁrms the inﬂuence of the afﬁliation of Z to ˜ X in theasymptotic properties of IP W - N . Abrevaya et al. (2015) only considered a specialsituation in the following Theorem 3: | Z ∩ ˜ X | = l and ˜ X = X .Before stating the result as theorem, let us deﬁne some important quantities: ψ ( p ( ˜ X ) , W ) = DYp ( ˜ X ) − (1 − D ) Y − p ( ˜ X ) ,ψ ∗ ( p ( ˜ X ) , W ) = [ D { Y − m ( ˜ X ) } ] p ( ˜ X ) − [(1 − D ) { Y − m ( ˜ X ) } ]1 − p ( ˜ X ) + m ( ˜ X ) − m ( ˜ X ) . heorem 3. Suppose all the conditions in Appendix are satisﬁed, the following state-ments hold for each point z in the support of Z :(1) When | Z ∩ ˜ X | = t < l with s [2 − l/ ( l − t )] + l > , the asymptotically linearrepresentation is √ nh l (ˆ τ N ( z ) − τ ( z )) = 1 √ nh l f ( z ) n X i =1 [ ψ ( p ( ˜ X i ) , W i ) − τ ( z )] K i + o p (1); the asymptotic distribution of ˆ τ N ( z ) is √ nh l (ˆ τ N ( z ) − τ ( z )) D −→ N (0 , Σ N ( z )) . (2) When | Z ∩ ˜ X | = l , the asymptotically linear representation is √ nh l (ˆ τ N ( z ) − τ ( z )) = 1 √ nh l f ( z ) n X i =1 [ ψ ∗ ( p ( ˜ X i ) , W i ) − τ ( z )] K i + o p (1); the asymptotic distribution of ˆ τ N ( z ) is √ nh l (ˆ τ N ( z ) − τ ( z )) D −→ N (0 , Σ ∗ N ( z )) . Here s is the order of L ( · ) , Σ N ( z ) = k K k σ N ( z ) . f ( z ) , Σ ∗ N ( z ) = k K k σ ∗ N ( z ) . f ( z ) , σ N ( z ) = E [ { ψ ( p ( ˜ X i ) , W i ) − τ ( z ) } | Z = z ] , σ ∗ N ( z ) = E [ { ψ ∗ ( p ( ˜ X i ) , W i ) − τ ( z ) } | Z = z ] . Similarly as

IP W - S , we also proposed the estimators for Σ N ( z ) and Σ ∗ N ( z ) under | Z ∩ ˜ X | < l and | Z ∩ ˜ X | = l respectively as ˆΣ N ( z ) = k K k ˆ σ N ( z )ˆ f ( z ) , and ˆΣ ∗ N ( z ) = k K k ˆ σ ∗ N ( z )ˆ f ( z ) , (12)where ˆ σ N and ˆ σ ∗ N are estimators for σ N ( z ) and σ ∗ N ( z ) with σ N = 1 nh l n X i =1 ( ψ (ˆ p, W i ) − ˆ τ N ( z )) K i ˆ f ( z ) , ˆ σ ∗ N = 1 nh l n X i =1 ( ψ ∗ (ˆ p, W i ) − ˆ τ N ( z )) K i ˆ f ( z ) ,ψ (ˆ p, W i ) = D i Y i ˆ p ( ˜ X i ) − (1 − D i ) Y i − ˆ p ( ˜ X i ) and ψ ∗ (ˆ p, W i ) = [ D i { Y i − ˆ m ( ˜ X i ) } ]ˆ p ( ˜ X i ) − [(1 − D i ) { Y i − ˆ m ( ˜ X i ) } ]1 − ˆ p ( ˜ X i ) + ˆ m ( ˜ X i ) − ˆ m ( ˜ X i ) . ˆ m ( ˜ X ) = P n { t : D t =1 } L h ( ˜ X t − ˜ X ) Y t / P n { t : D t =1 } L h ( ˜ X t − ˜ X ) , ˆ m ( ˜ X ) = P n { t : D t =0 } L h ( ˜ X t − ˜ X ) Y t / P n { t : D t =0 } L h ( ˜ X t − ˜ X ) being the estimators of m ( ˜ X ) and m ( ˜ X ) . Further we can show the consistency of proposed asymptotic variancefunction estimators via the following theorem. Theorem 4.

Suppose all the conditions in Appendix are satisﬁed, we have that ˆΣ N ( z ) = Σ N ( z ) + o p (1) , and ˆΣ ∗ N ( z ) = Σ ∗ N ( z ) + o p (1) . Remark 2.

Based on Theorem 4, we can also get the consistent estimator for stan-dard error of √ nh l (ˆ τ N ( z ) − τ ( z )) and construct a pointwise conﬁdence interval of τ ( z ) based on ˆ τ N ( z ) . However, we ﬁrst need to estimate the true active arguments ofpropensity score ˜ X , denoting the corresponding estimator as ˆ X , which can be doneby variable selection method, to decide the proper form of conﬁdence interval. To bespeciﬁed, if | Z ∩ ˆ X | ≈ l , the pointwise conﬁdence interval can be constructed as ˆ τ N ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ ∗ N ( z ) (cid:17) / . Otherwise, we would construct the pointwise conﬁdence interval as ˆ τ N ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ N ( z ) (cid:17) / . When ˜ X = X , as proved by Abrevaya et al. (2015), IP W - N can be asymptoti-cally more efﬁcient than IP W - P : σ P ( z ) = σ ∗ N ( z ) + E " p ( X ) { − p ( X ) } (cid:26) m ( X ) p ( X ) + m ( X )1 − p ( X ) (cid:27) | Z = z , and IP W - P ∼ = IP W - O. Here m j ( X ) = E { Y ( j ) | X } . Thus, with p ( X ) = p ( ˜ X ) = q ( V ⊤ X ) , we can give the ranking for the estimation efﬁciency of the four estimatorsin the following corollary. Corollary 1.

Suppose all the assumptions and conditions in Appendix are satisﬁedand p ( X ) = p ( ˜ X ) = q ( V ⊤ X ) , the following statements hold for each point z in thesupport of Z : ase 1: When | Z ∩ ˜ X | = l with ˜ X = X and | Z ∩ V ⊤ X | = l , IP W - N (cid:22) IP W - S (cid:22) IP W - P ∼ = IP W - O, with σ P ( z ) = σ ∗ S ( z ) + E (cid:20) q ( V ⊤ X )(1 − q ( V ⊤ X )) (cid:26) m ( V ⊤ X ) q ( V ⊤ X ) + m ( V ⊤ X )1 − q ( V ⊤ X ) (cid:27) | Z = z (cid:21) ,σ ∗ S ( z ) = σ ∗ N ( z ) + E (cid:20) q ( V ⊤ X )(1 − q ( V ⊤ X )) (cid:26) ∆ m q ( V ⊤ X ) + ∆ m − q ( V ⊤ X ) (cid:27) | Z = z (cid:21) , where ∆ m j = m j ( X ) − m j ( V ⊤ X ) . Case 2:

When | Z ∩ ˜ X | = l with ˜ X = X but | Z ∩ V ⊤ X | = t with ≤ t < l , IP W - N (cid:22) IP W - S ∼ = IP W - P ∼ = IP W - O with σ S ( z ) = σ P ( z ) = σ O ( z ) . Case 3:

When | Z ∩ ˜ X | = t with ˜ X ( X and | Z ∩ V ⊤ X | = t with ≤ t < l , IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O with σ N ( z ) = σ S ( z ) = σ P ( z ) = σ O ( z ) . Remark 3. In Case 1 , the equality in the ﬁrst inequality holds when both m ( V ⊤ X ) and m ( V ⊤ X ) equal to zero, and the equality in the second inequality holds when m j ( X ) = m j ( V ⊤ X ) for all j = 0 , . A sufﬁcient condition to make m j ( X ) = m j ( V ⊤ X ) hold is E ( Y j | X ) ⊥ X | V ⊤ X meaning that Y (1) and Y (0) share thesame central mean subspace. Remark 4.

Here, we discuss another special case: V ⊤ X = Z in Corollary 1 such that q ( V ⊤ X ) = p ( Z ) . It follows that IP W - S (cid:22) IP W - P with σ P ( z ) = σ ∗ S ( z )+ p ( z )(1 − p ( z )) [ m ( z ) / { p ( z ) } + m ( z ) / { − p ( z ) } ] . Similarly,

IP W - N (cid:22) IP W - P if ˜ X = Z : σ P ( z ) = σ ∗ N ( z ) + p ( z )(1 − p ( z )) [ m ( z ) / { p ( z ) } + m ( z ) / { − p ( z ) } ] . Thus,if Z = V ⊤ X = ˜ X , we have σ ∗ S ( z ) = σ ∗ N ( z ) ≤ σ P ( z ) . Remark 5.

Although

IP W - S cannot be more efﬁcient than IP W - N in theory, ithas an obvious advantage due to its dimension reduction structure. This can be veryuseful in practice as when X is of high dimension, IP W - N is hard to use as it has toadopt very high order kernel function and delicately chosen bandwidths. The numerical tudies in the next section show that when the dimension of X is only , IP W - S canperforms better than IP W - N in some cases. Thus, in the numerical studies, when thein high dimension k = 20 , we do not consider IP W - N .Another issue is also relevant. Generally speaking, combining the results in Sub-sections 3.1 and 3.3, when the dimension reduced vector V ⊤ X cannot fully cover thegiven covariates Z , the IP W - S is less efﬁcient. It seems that we can add Z into thecovariates to be ( Z, V ⊤ X ) to enhance the estimation efﬁciency in theory. However,this causes the estimation procedure much more complicated (with higher order ker-nel and more delicately selected bandwidths) and less accurate due to the dimensionincreasing as described above. Thus, balancing the theoretical merit and practicalusefulness, we still prefer using IP W - S without adding more covariates. From the above discussion, we can ﬁnd that the asymptotic efﬁciency comparisonresult of IPW-type CATE estimators is different from that of IPW-type ATE estimators,because ATE estimator using nonparametric esitmated p ( X ) (cid:22) that using parametricestimated p ( X ) (cid:22) that using true p ( X ) . Thus it is worthwhile to give a further explo-ration on the reasons. From our study, we ﬁnd that it is mainly because of the differ-ent convergence rates of the estimated propensity scores under different scenarios. Inthe following corollary, we show that when the convergence rate of the nonparamet-ically estimated propensity score can be fast enough, IP W - N and IP W - S can alsobe asymptotically equivalent to IP W - P , so is IP W - O . This is the case when thepropensity score function is smooth sufﬁciently and the kernel and bandwidths arechosen delicately to meet the mentioned condition in Corollary 2. Corollary 2.

Suppose all the conditions in Appendix are satisﬁed.(1) When √ nh l (cid:16) h s + p log( n ) /nh r (cid:17) = o (1) , it follows that IP W - S ∼ = IP W - P ∼ = IP W - O. (2) When √ nh l (cid:18) h s + q log( n ) /nh ˜ k (cid:19) = o (1) , it follows that IP W - N ∼ = IP W - P ∼ = IP W - O.

3) When √ nh l (cid:16) h s + p log( n ) /nh r (cid:17) = o (1) and √ nh l (cid:18) h s + q log( n ) /nh ˜ k (cid:19) = o (1) , it follows that IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O. Remark 6.

Corollary 2 implies that when the convergence rate of estimated propen-sity score is fast enough, the corresponding CATE estimator would be asymptoticallyequivalent to

IP W - O , which is based on true propensity score, even though the condi-tion | Z ∩ V ⊤ X | in Theorem 1 or | Z ∩ ˜ X | in Theorem 3 is satisﬁed. In this sense, we cansay that the convergence rate of estimated propensity score is dominant the role of afﬁl-iation of Z in the set of arguments of propensity score in comparing the asymptotic efﬁ-ciencies among the CATE estimators. It is well known that the convergence rate of non-parametric estimator is possibly close to n − / if the estimated function is very smoothand the higher kernel function is utilized, see Li and Racine (2007). Thus the con-ditions √ nh l (cid:16) h s + p log( n ) /nh r (cid:17) = o (1) and √ nh l (cid:18) h s + q log( n ) /nh ˜ k (cid:19) = o (1) could hold. As the choices for such kernel and bandwidths often make no sense forpractical use, this investigation only serves as a theoretical exploration with a remindthat a “good” estimator for the propensity score may not be helpful for constructing a“good” CATE.

3. Simulation study

To evaluate the ﬁnite sample performances of

IP W - S , we consider the compar-isons with IP W - P , IP W - N and IP W - O. To save space, we only present the simu-lations in the case Z ∈ R . To make the comparisons more convincing, we consider twoscenarios with two low dimensions of X = ( Z, U , · · · , U k − ) equal to k = 2 and ,and higher dimensions k = 20 . In the latter, IP W - N is not included as very high orderkernel and very delicately selected bandwidths are required and then it is very difﬁcultto implement. Several criteria are used to evaluate the estimation efﬁciency: Bias ;estimated standard deviation

Est SD ; mean square error (MSE). As the asymptotic16istributions are standard normal, we then also report the proportions outside the crit-ical values ± . : P ± . . Further, to make the efﬁciency ranking in ﬁnite samplesetting more visible, we report, as relative efﬁciency, the Est SD results via dividingeach

Est SD by Est SD of IP W - O that is used as the benchmark. Thus, when theratio is smaller than , the corresponding estimator is more efﬁcient than IP W - O . In the low-dimensional setting, we consider the covariates X = ( Z, U , · · · , U k − ) are given by the following procedure. When k = 2 , X = ( Z, U ) with Z = ǫ and U = (1 + 2 Z ) ( − Z ) + ǫ . When k = 4 , X = ( Z, U , U , U ) are given by Z = ǫ , U = (1 + 2 Z ) + ǫ , U = (1 + 2 Z ) + ǫ , U = ( − Z ) + ǫ .ǫ i ∼ unif [ − . , . for i = 1 , , , and they are mutually independent. To easilycompare the theoretical results under parametric, nonparametric and semiparametricstructure, we consider four models: • Model 1 (k=2, r=1 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 ): Y (1) = β ⊤ X + γ ZU + ν, Y (0) = 0 and p ( X ) = Λ(1 / √ Z + U )) . • Model 2 (k=2, r=1 with | Z ∩ ˜ X | = | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 ): Y (1) = β ⊤ X + γ ZU + ν, Y (0) = 0 and p ( X ) = Λ( U ) . • Model 3 (k=4, r=1 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 ): Y (1) = β ⊤ X + γ ZU U U + ν, Y (0) = 0 , and p ( X ) = Λ(0 . Z + U + U + U )) . • Model 4 (k=4, r=2 with | Z ∩ X | = 1 and | Z ∩ V ⊤ X | = 1 . ): Y (1) = β ⊤ X + γ ZU U U + ν, Y (0) = 0 and p ( X ) = Λ √ Z ) √ U + U + U ! . Here ν ∼ N (0 , . ) , Λ( · ) is the c.d.f. of the logistic distribution. Given thatthe matrix V satisﬁes E ( D | X ) ⊥ X | V ⊤ X , we consider these four types of propen-sity score models to satisfy the conditions in different scenarios. Under Model 1 and17odel 3, the dimension of V ⊤ = (1 , · · · , × k is dim ( V ) = r = 1 , | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 . Thus, we aim to examine whether IP W - N (cid:22) IP W - S ∼ = IP W - P ∼ = IP W - O . In order to examine the theoretical results in Case 3 of Corol-lary 1 in the ﬁnite sample scenario, we consider p ( X ) in Model 2. In this setting, D ⊥ X | ˜ X = U , and for V ⊤ = (0 , , p ( X ) ⊥ X | V ⊤ X . Obviously, | Z ∩ ˜ X | = | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 in Model 2, thus it can be used to examine whether IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O . p ( X ) in Model 4 is also set to verifythe results in Corollary 1. This propensity score function has Z itself as an individualargument. Namely, p ( X ) ⊥ X | V ⊤ X with V ⊤ =  / √ / √ / √  and dim ( V ) = r = 2 , and | Z ∩ X | = 1 and | Z ∩ V ⊤ X | = 1 . We will examine whether

IP W - S can be more efﬁcient than IP W - P and IP W - O . As for the parameters β i , γ i , i = 1 , , we consider following scenarios: • Scenario I: β ⊤ = (0 , , γ = 1 , β ⊤ = (1 / , / √ , − / √ , − / , γ = 0 ; • Scenario II: β ⊤ = (1 / , − / , γ = 0 , β ⊤ = (0 , , , , γ = 1 .Obviously, when r i = 0 , the linear model is being considered, while r i = 1 , thenonlinear model is taken into account, i = 1 , .Next, we determine the order of kernels L ( · ) , H ( · ) and K ( · ) to guarantee the regu-larity condition 5 in Appendix. As there is no data-driven or optimal selection methodavailable for IP W - N and IP W - S , we use the rule of thumb to select them as sug-gested by Abrevaya et al. (2015) for fair comparisons. The principle of selection isbased on proper rates of convergence in the form of h = a · n − η for a > and η > .Since IP W - S can be regarded as a low-dimensional type IP W - N , the bandwidthscan be chosen via replacing ˜ k by r as follows: h = a n − / (2 r + δ r + δ ) , h = an − / ( l +4+2 r +2 δ r − δ ) , (13)where a, a , δ , δ are positive. Note that δ and δ can be as small as desired, thuswe let them to be zero in the simulations for simplicity. Further, the order of H is s = r + δ r with δ r = 0 for even r , δ r = 1 for odd r , and s = s + 2 = r + δ r + 2 .Due to the semiparametric nature, we construct two consistent estimators ˆ V via MAVE18roposed by (Xia et al., 2002; Xia, 2007), which can be implemented in the R packageMAVE.Further, to fairly examine the performances, the parameters s, h for K ( u ) are thesame for all these four CAT E estimators. Since r ≤ ˜ k , the choices of s and h for IP W - N can be used for all the CAT E estimators. Taking all into account, the corre-sponding bandwidths are summarized in Table 1.

Table 1: The order of bandwidths in the simulations. ˜ k h h h ˜ k =1 a · n − / a · n − / a · n − / (2 r + δ r ) ˜ k =2 a · n − / a · n − / a · n − / (2 r + δ r ) ˜ k =4 a · n − / a · n − / a · n − / (2 r + δ r ) As for the tuning parameters a, a , a , we consider the following two groups ofvalues: Group { a = 0 . , a = 1 . , a = 0 . } , Group { a = 0 . , a =1 . , a = 0 . } . Speciﬁcally, we estimate τ ( Z ) at Z ∈ {− . , − . , , . , . } . The sample sizesare n = 500 and . The replication time is . We choose the Gaussian kernel andhigher order kernels derived from it throughout this section. Further, we should pointout that the estimated propensity score is trimmed to lie in the interval [0 . , . as Abrevaya et al. (2015) did. We give the observations from the simulation resultsreported in Tables 2-9. To save space, we only report the results about the the relativeefﬁciency with { a = 0 . , a = 1 . , a = 0 . } under Scenario I. See Figure . Wehave the following observations. Observation 1.

As expected, larger sample size leads to smaller bias and standarddeviation in most cases. When k = 4 and the sample size goes from n = 500 upto n = 1000 , the bias and variance reduction are more signiﬁcant and the empiricalvalues of P . and P − . are closer to the nominal level . . That implies that thenormal approximation works well. Observation 2.

When the dimension k increases to , Bias , Est SD and

M SE also increase. Further, the dimension does have impact on the performance of

IP W -19 . In Tables S. under Model 1 with k = 2 , IP W - N is uniformly more efﬁcientthan all the others. But under the models with k = 4 , especially with k = 4 and r = 2 , the superiority of IP W - N becomes less signiﬁcant. We can see that IP W - S can even be more efﬁcient than IP W - N sometimes, mainly due to its dimensionreduction structure. Observation 3.

Taking into account all the simulation results in Tables 2-9., theestimated standard deviation (

Est SD ) of all CATE estimations increase as z is closeto the boundary of the support of Z . This phenomenon should be mainly because ofthe nonparametric estimation of CATE(z) function with respect to Z . Note that IP W - O also involves nonparametric estimation for the conditional expectation over Z , thusthe boundary effect also takes place for it. Further, empirically, the Est SD of IP W - O often increases, in the numerical studies we conduct, relatively more quickly than IP W - P or IP W - S in the cases we will discuss in Observation 4 below when z isclose to the boundary. Figure 1 about the relative efﬁciency compared with IP W - O shows this though for different models, at the boundary, IP W - O has different relativeefﬁciency with IP W - P . Combining the information that ATE with estimated un-knowns, the ﬁnal estimators could be more efﬁcient in general, less relative efﬁciencyof IP W - O in ﬁnite sample scenarios may be understandable although in the case thatthe asymptotic efﬁciency should be equivalent. On the other hand, when we look at theoriginal values of IP W - O , the differences with IP W - P is not signiﬁcant. Observation 4 . We also check the effect caused by the inclusiveness of the givencovariates Z in the set of the arguments of the propensity score for IP W - S and IP W - N . Under Model 1 or Model 3 with | Z ∩ ˜ X | = 1 but | Z ∩ V ⊤ X | = 0 , Figure showsthat the Est SD of IP W - N is uniformly smaller than those of the other CAT E estimators. While,

IP W - S and IP W - P have similar performance. This coincideswith the theory. In contrast, under Model 2 where | Z ∩ ˜ X | = | Z ∩ U | = 0 , IP W - N losses its superiority of efﬁciency to share similar performance to IP W - S and IP W - P . Under Model 4 where | Z ∩ X | = 1 and | Z ∩ V ⊤ X | = 1 , IP W - S outperforms IP W - O and IP W - P , and can even be comparable with IP W - N sometimes. Theseresults also coincide with the theory in Corollary 1.20 odel 1, k=2, r=1 Model 2, k=2, r=1 Model 3, k=4, r=1 Model 4, k=4, r=2 n = n = −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.40.80.91.01.10.80.91.01.1 Z A R E Method IPW−S IPW−N IPW−P IPW−0

Figure 1: The asymptotic relative efﬁciency(ARE) about

Est SD against that of

IP W - O under Group { a = 0 . , a = 1 . , a = 0 . } and Scenario I: β ⊤ = (0 , , γ = 1 , β ⊤ =(1 / , / √ , − / √ , − / , γ = 0 . Consider models with much higher dimensional X : k = dim ( X ) = 20 . As IP W - N obviously suffers from the curse of dimensionality and thus does not workat all, we then only focus on IP W - O , IP W - P and IP W - S . To better examine thecorresponding ﬁnite sample performances, we consider the model settings which aresimilar to Models 3 and 4 with uniformed Z , but with more zero coefﬁcients for easeof comparison.Given X = ( Z, U , · · · , U k − ) , X is generated by Z ∼ unif ( − . , . , U =(1 + 2 Z ) + e , U = (1 + 2 Z ) + e , U = ( − Z ) + e , and independent e j ∼ unif ( − . , . , for j = 1 , , . The other variables U ′ j s are generated as:when < j < = 9 , U j = | / (11 − j ) Z | − | /jǫ | ; when < j < = 19 , U j = | / (21 − j ) Z | − | /jǫ | ; and U j = | / (31 − j ) Z | − | /jǫ | for21 > , ǫ ∼ unif ( − . , . . We consider the following models in high dimensionalsetting. • Model 5 (r=1 with uniformed | Z ∩ V ⊤ X | = 0 ): Y (1) = β ⊤ X + γ ZU U U + ν, Y (0) = 0 and p ( X ) = Λ(1 + V ⊤ X ) . • Model 6 (r=2 with uniformed | Z ∩ V ⊤ X | = 1 ): Y (1) = β ⊤ X + γ ZU U U + ν, Y (0) = 0 , and p ( X ) = Λ( g ( ˜ V ⊤ X )) . As for the propensity score, we set V ⊤ = ( z }| { − , · · · , − , z }| { , · · · , , z }| { , · · · , / √ , ˜ α = (0 , z }| { − , · · · , − , z }| { , · · · , , z }| { , · · · , / √ , and g ( ˜ V ⊤ X ) = (1 + ˜ α ⊤ X ) / (1 + Z ) with dim ( ˜ V ⊤ X ) = r = 2 , while | Z ∩ V ⊤ X | =1 . In high dimensional setting, we only consider the nonlinear model where the param-eters are set as β ⊤ = (0 , . . . , and r = 1 .The sample size is taken to be n = 500 . Estimate τ ( Z ) at Z ∈ {− . , − . , , . , . } with 500 simulation realizations. As for the bandwidth choice we adopt the same rule in(13) of Experiment 1 to have h = an − / ( l +4+2 r +2 δ r ) and h = a n − / (2 r + δ r ) . Con-sider two groups of { a, a } , Group { a = 0 . , a = 0 . } and Group { a =0 . , a = 0 . } . For the kernel function in the estimated propensity score, we alsouse the Gaussian kernel and higher order kernels derived from it since the distributionof X is bounded. All the original simulation results are reported in Table S.5 in theSupplement and the relative efﬁciency results are plotted in Figure 2.From the simulation results, we also have the following ﬁndings.1). The high dimensionality of X has relatively weak inﬂuence on IP W - S . All thevalues of Bias , Est SD and

M SE are rather stable and the values of P ± . arecloser to the nominal value . as the dimension of X goes from to , especially inthe case of | Z ∩ V ⊤ X | = 1 . This is very informative because it implies that IP W - S can greatly avoid the curse of dimensionality due to its dimension reduction structure.2). IP W - S not only shows its superiority in dealing with the curse of dimensionality,22ut also inherits the efﬁciency superiority of IP W - N in low-dimensional cases. Un-der such high dimensional scenarios, the values of Est SD and

M SE of IP W - S aresmaller than those of the parametric competitors in some cases even | Z ∩ V ⊤ X | = 0 .When in Model 6, IP W - S is uniformly more efﬁcient than IP W - P . This is consistentwith the theoretical results in Corollary 1 as in the model | Z ⊆ V ⊤ X | = 1 , IP W - S is asymptotically more efﬁcient than IP W - P . Model 5, k=20, r=1 Model 6, k=20, r=2 G r oup G r oup −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.40.850.900.951.001.051.100.850.900.951.001.051.10 Z A R E Method IPW−S IPW−P IPW−0

Figure 2: The asymptotic relative efﬁciency(ARE) about

Est SD against that of

IP W - O under high di-mensional setting.

4. Data Analysis

In this section, we consider a dataset collected by Ichino et al. (2008), which canbe obtained from the internet. We apply the proposed method to estimate the

CAT E The data is publicly available at http://qed.econ.queensu.ca/jae/2008-v23.3/ichino-mealli-nannicini/

T W A ) onpermanent employment over worker’s age.First introduce some details and setting about the dataset. Restricting the sample toTuscany and aged 17-39, the resulting sample size is n = 901 , of which were on a T W A during the ﬁrst semester of 2001. That is, the binary treatment variable D = 0 , means that the individual was not on or was on a T W A during the ﬁrst six mouths of2001. The outcome Y here is a dummy variable: Y = 1 if the subject is permanentlyemployed at the end of 2002, and Y = 0 otherwise. Choose X as the worker’s ageand a set of 25 covariates as X adopted by Ichino et al. (2008) to guarantee the uncon-foundedness assumption. The set of covariates is about demographic characteristics,family background, educational achievements and work experience (See Table 1 inIchino et al. (2008)). This dataset was ﬁrst analyzed by Ichino et al. (2008), who esti-mated the parameter AT T = E ( Y − Y | D ) and showed that T W A can increase theprobability of getting a permanent employment. Ichino et al. (2008) pointed out thatthe

T W A effect is heterogeneous for the individuals older than 30 and younger than30. In order to catch more speciﬁc heterogeneity of the

T W A effect across individuals’age, we estimate the

CAT E function τ ( Z ) in the interval between ages 20 and 35.As the number of covariates is large (= ), we then use a semiparametric single-indexmodel to estimate the propensity score such that the dimensionality problem and modelmisspeciﬁcation problem can be greatly alleviated. Given that D ⊥ X | α ⊤ X , we canget the IP W - S and pointwise conﬁdence band of τ ( Z ) by carrying out the estimationprocedure proposed in subsection 2.3. As for nonparametric estimation part, we usethe Gaussian kernel and choose the bandwidths to be h = 0 . × ˆ σ n − / = 2 . , and h = 1 . × ˆ σ d n − / = 0 . ≪ h, where ˆ σ = p var ( x ) and ˆ σ d = p var ( α ⊤ X ) .We also estimate IP W - P as a benchmark to analyse the TWA effect over worker’sage.Figure 3 presents the results of IP W - S and IP W - P as a function of worker’s agein the range of 20 to 35 years old, which can be regarded as an extension of Ichino et al.(2008) in a certain sense. Furthermore, the 95% pointwise conﬁdence band of IP W - S and IP W - P have been also reported in Figure 3. There are several points we want24o highlight: 1). both IP W - S and IP W - P suggest that, from age to , a TWAassignment uniformly increases the probability of ﬁnding a stable job with the rangeroughly between . and . . It means that if a worker with a TWA experience wouldmore likely to get a permanent job. This ﬁnding is in accordance with, but extends theconclusion of Ichino et al. (2008). 2). The trend of

CAT E ( x ) varies with worker’sage and has two peaks. From Figure 3, we can also ﬁnd that there are two peaks ataround age and age , while the trough appears at around age . That impliesthe TWA experience has different effect for the workers older than and under ,which was also similarly discussed by Ichino et al. (2008). However, comparing thedetails in the curves of IP W - S and IP W - P , the effect of TWA on ﬁnding a stable jobfor the subpopulation aged under is greater than the ones older in the IP W - S curve, while things are opposite in the IP W - P curve. It seems that the IP W - S curveprovides a more reasonable explanation on the effect of TWA: younger individualsreceiving TWA could have better chance to get a stable job than older individuals whoneed to receive TWA.

20 25 30 35 − . . . . . Age

IPW−S

20 25 30 35 − . . . . . Age

IPW−P

Figure 3: The curves of conditional average treatment effects (CATE) over worker’s age with the 95%pointwise conﬁdence band. . Conclusion In this paper, we propose an estimation (

IP W - S ) of conditional average treatmenteffect with semiparametric propensity score and investigate its asymptotic propertieswhich can be used to construct pointwise conﬁdence intervals. We give a relativelycomplete picture about the asymptotic efﬁciency of different estimators with nonpara-metric, parametric and true propensity score when model is correctly speciﬁed. Further,when the dimension of covariates is high, by the numerical studies, we demonstrate theadvantages of IP W - S in alleviating the curse of dimensionality and inheriting the the-oretical superiority of IP W - N in estimation efﬁciency. But a challenging topic is howto develop a good uniform conﬁdence band of the whole function τ ( z ) although theBonferroni conﬁdence band could be applied. Further, a research topic is about thesituation that not all of the covariates are important for propensity score. Thus, by in-corporating variable selection, we can simultaneously identify important confoundersand guarantee the unconfoundedness assumption. The dimension reduction and vari-able selection have been investigated by, say, Ma et al. (2019) for the model undersparsity structure. This topic is also related to variable selection and thus we will tryto have a computationally inexpensive algorithm for this purpose and study its asymp-totic behaviours. Another topic is about the model misspeciﬁcation even when thesemiparametric model is used. We will study the relevant asymptotic behaviours in thenear future. Acknowledgement

The authors’ s research was supported by grants from NSFC grants (NSFC11671042,NSFC11601227) and the University Grants Council of Hong Kong.

Supplementary material

The supplementary ﬁle covers the detailed proofs to Theorems and Corollaries.26 ppendix: Technical conditions

The following regularity conditions are required to get the theoretical results.(C1) (Strong ignorability)(i) Unconfoundedness: ( Y (0) , Y (1)) ⊥ D | X .(ii) Common support: For some very small c > , c < p ( X ) < − c .(C2) (on distribution):(i) The set χ that is the support of the k -dimensional covariate vector X is aCartesian product of compact intervals.(ii) The density function of Z , f ( Z ) , and the density function of X , are boundedaway from zero and inﬁnity and s ≥ r times continuously differentiable.(C3) (Conditional moments and smoothness)(i) sup x ∈ χ E [ Y ( j ) | X = x ] < ∞ for j = 0 , (ii) the functions m j ( V ⊤ X ) = E [ Y ( j ) | V ⊤ X ] , j = 0 , are s ≥ r timescontinuously differentiable.(C4) (on kernel function)(i) L ( u ) is a kernel of order s , is symmetric around zero, has ﬁnite support [ − , ˜ k , and is continuously differentiable.(ii) H ( u ) is a kernel of order s , is symmetric around zero, has ﬁnite support [ − , r , and is continuously differentiable.(iii) K ( u ) is of order s , is symmetric around zero, and is s times continuouslydifferentiable.(C5) (on bandwidths)(i) h → , nh l → ∞ , nh s + l → .(ii) h , h → , log ( n ) / ( nh r + s ) → , and log ( n ) / ( nh ˜ k + s ) → . (iii) h s i i h − s i − l → , nh l h s i i → , i = 1 , .27C6) (on dimension reduction structure) the dimension of V , r, is given and ˆ V − V = O p ( n − / ) . Recall the deﬁnition of high order kernel in the literature. We say a function g: R r → R is a kernel of order s if it integrates to one over R r , and R u p · · · u p r g ( u ) du =0 for all nonnegative integers p , · · · , p r such that ≤ P i p i < s, and it is nonzerowhen P i p i = s. ReferencesReferences

P. R. Rosenbaum, D. B. Rubin, The central role of the propensity score in observationalstudies for causal effects, Biometrika 70 (1983) 41–55.K. Hirano, G. W. Imbens, G. Ridder, Efﬁcient estimation of average treatment effectsusing the estimated propensity score, Econometrica 71 (2003) 1161–1189.R. K. Crump, V. J. Hotz, G. W. Imbens, O. A. Mitnik, Nonparametric tests for treatmenteffect heterogeneity, Rev. Econom. Statist. 90 (2008) 389–405.S. Wager, S. Athey, Estimation and inference of heterogeneous treatment effects usingrandom forests, J. Amer. Statist. Assoc. 113 (2018) 1228–1242.J. Abrevaya, Y.-C. Hsu, R. P. Lieli, Estimating conditional average treatment effects,J. Bus. Econom. Statist. 33 (2015) 485–505.S. Lee, R. Okui, Y.-J. Whang, Doubly robust uniform conﬁdence band for the con-ditional average treatment effect function, J. Appl. Econometrics 32 (2017) 1207–1225.J. M. Robins, A. Rotnitzky, L. P. Zhao, Estimation of regression coefﬁcients whensome regressors are not always observed, J. Amer. Statist. Assoc. 89 (1994) 846–866.K.-C. Li, Sliced inverse regression for dimension reduction, J. Amer. Statist. Assoc.86 (1991) 316–327. 28. Luo, Y. Zhu, D. Ghosh, On estimating regression-based causal effects using sufﬁ-cient dimension reduction, Biometrika 104 (2017) 51–65.S. Ma, L. Zhu, Z. Zhang, C.-L. Tsai, R. J. Carroll, A robust and efﬁcient approachto causal inference based on sparse sufﬁcient dimension reduction, Ann. Statist. 47(2019) 1505–1535.R. D. Cook, B. Li, Dimension reduction for conditional mean in regression, Ann.Statist. 30 (2002) 455–474.Y. Xia, H. Tong, W. K. Li, L.-X. Zhu, An adaptive estimation of dimension reductionspace, J. R. Stat. Soc. Ser. B Stat. Methodol. 64 (2002) 363–410.Y. Xia, A constructive approach to the estimation of dimension reduction directions,Ann. Statist. 35 (2007) 2654–2690.Q. Li, J. S. Racine, Nonparametric econometrics, Princeton University Press, Prince-ton, NJ, 2007. Theory and practice.A. Ichino, F. Mealli, T. Nannicini, From temporary help jobs to permanent employ-ment: what can we learn from matching estimators and their sensitivity?, J. Appl.Econometrics 23 (2008) 305–327. 29 able 2: The simulation results under Model 1 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 , β ⊤ = (0 , and γ = 1 k=2, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias -0.0262 0.0607 0.0553 -0.0278 -0.1072 -0.0262 0.0607 0.0553 -0.0276 -0.1068

Est S D P − . P . IPW - N IPW - S Bias -0.0263 0.0608 0.0566 -0.0233 -0.0991 -0.0269 0.0600 0.0555 -0.0257 -0.1035

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0254 0.0551 0.0499 -0.0186 -0.0888 -0.0261 0.0543 0.0492 -0.0206 -0.0931

Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias -0.0262 0.0607 0.0553 -0.0278 -0.1072 -0.0262 0.0607 0.0553 -0.0276 -0.1068

Est S D P − . P . IPW - N IPW - S Bias -0.0263 0.0608 0.0566 -0.0233 -0.0991 -0.0267 0.0600 0.0554 -0.0258 -0.1037

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0254 0.0551 0.0499 -0.0186 -0.0888 -0.0259 0.0543 0.0491 -0.0207 -0.0932

Est S D P − . P . able 3: The simulation results under Model 1 with | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 , β ⊤ = (1 / , − / and γ = 0 k=2, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias 0.0027 -0.0006 0.0174 0.0147 -0.0504 0.0021 -0.0010 0.0173 0.0148 -0.0503

Est S D P − . P . IPW - N IPW - S Bias 0.0059 0.0005 0.0169 0.0135 -0.0518 0.0000 -0.0030 0.0156 0.0136 -0.0512

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0061 0.0008 0.0157 0.0147 -0.0458 0.0007 -0.0020 0.0147 0.0147 -0.0453

Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias 0.0027 -0.0006 0.0174 0.0147 -0.0504 0.0021 -0.0010 0.0173 0.0148 -0.0503

Est S D P − . P . IPW - N IPW - S Bias 0.0059 0.0005 0.0169 0.0135 -0.0518 -0.0006 -0.0034 0.0155 0.0136 -0.0512

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0061 0.0008 0.0157 0.0147 -0.0458 0.0003 -0.0022 0.0146 0.0147 -0.0453

Est S D P − . P . able 4: The simulation results under Model 2 with | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 , β ⊤ = (0 , and γ = 1 k=2, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias -0.0269 0.0603 0.0550 -0.0282 -0.1080 -0.0269 0.0603 0.0551 -0.0278 -0.1072

Est S D P − . P . IPW - N IPW - S Bias -0.0272 0.0601 0.0557 -0.0263 -0.1052 -0.0277 0.0596 0.0554 -0.0263 -0.1049

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0264 0.0541 0.0492 -0.0206 -0.0935 -0.0267 0.0538 0.0490 -0.0210 -0.0944

Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias -0.0269 0.0603 0.0550 -0.0282 -0.1080 -0.0269 0.0603 0.0551 -0.0278 -0.1072

Est S D P − . P . IPW - N IPW - S Bias -0.0272 0.0601 0.0557 -0.0263 -0.1052 -0.0275 0.0597 0.0553 -0.0264 -0.1049

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0264 0.0541 0.0492 -0.0206 -0.0935 -0.0266 0.0538 0.0490 -0.0210 -0.0944

Est S D P − . P . able 5: The simulation results under Model 2 with | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 , β ⊤ = (1 / , − / and γ = 0 Model 2 k=2, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias 0.0015 -0.0014 0.0172 0.0148 -0.0504 0.0015 -0.0014 0.0171 0.0147 -0.0504

Est S D P − . P . IPW - N IPW - S Bias 0.0010 -0.0022 0.0161 0.0137 -0.0513 -0.0005 -0.0033 0.0154 0.0133 -0.0515

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0017 -0.0015 0.0148 0.0147 -0.0453 0.0000 -0.0026 0.0142 0.0144 -0.0454

Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias 0.0015 -0.0014 0.0172 0.0148 -0.0504 0.0015 -0.0014 0.0171 0.0147 -0.0504

Est S D P − . P . IPW - N IPW - S Bias 0.0010 -0.0022 0.0161 0.0137 -0.0513 -0.0010 -0.0035 0.0154 0.0133 -0.0514

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0017 -0.0015 0.0148 0.0147 -0.0453 -0.0003 -0.0027 0.0142 0.0144 -0.0453

Est S D P − . P . able 6: The simulation results under Model 3 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 , β ⊤ =(1 / , / √ , − / √ , − / and γ = 0 k=4, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias -0.0271 0.0610 0.0556 -0.0280 -0.1086 -0.0271 0.0611 0.0557 -0.0279 -0.1083

Est S D P − . P . IPW - N IPW - S Bias -0.0280 0.0604 0.0563 -0.0247 -0.1026 -0.0270 0.0614 0.0577 -0.0213 -0.0967

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0270 0.0563 0.0517 -0.0201 -0.0940 -0.0260 0.0570 0.0521 -0.0199 -0.0935

Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias -0.0271 0.0610 0.0556 -0.0280 -0.1086 -0.0271 0.0611 0.0557 -0.0279 -0.1083

Est S D P − . P . IPW - N IPW - S Bias -0.0280 0.0604 0.0563 -0.0247 -0.1026 -0.0270 0.0614 0.0577 -0.0215 -0.0969

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0270 0.0563 0.0517 -0.0201 -0.0940 -0.0259 0.0570 0.0521 -0.0199 -0.0936

Est S D P − . P . able 7: The simulation results under Model 3 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 , β ⊤ =(1 / , / √ , − / √ , − / and γ = 0 k=4, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias 0.0333 0.0042 -0.0019 -0.0067 -0.0228 0.0332 0.0043 -0.0018 -0.0066 -0.0228

Est S D P − . P . IPW - N IPW - S Bias 0.0338 0.0041 -0.0026 -0.0072 -0.0232 0.0384 0.0069 -0.0015 -0.0071 -0.0232

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0316 0.0018 -0.0039 -0.0070 -0.0221 0.0331 0.0030 -0.0032 -0.0066 -0.0218

Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias 0.0333 0.0042 -0.0019 -0.0067 -0.0228 0.0332 0.0043 -0.0018 -0.0066 -0.0228

Est S D P − . P . IPW - N IPW - S Bias 0.0338 0.0041 -0.0026 -0.0072 -0.0232 0.0381 0.0068 -0.0015 -0.0071 -0.0232

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0316 0.0018 -0.0039 -0.0070 -0.0221 0.0329 0.0029 -0.0032 -0.0066 -0.0218

Est S D P − . P . able 8: The simulation results under Model 4 with k=4, r=2, | Z ∩ X | = 1 , | Z ∩ V ⊤ X | = 1 , β ⊤ =(0 , , , and γ = 1 k=4, r=2 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias -0.0260 0.0621 0.0565 -0.0277 -0.1087 -0.0258 0.0623 0.0567 -0.0274 -0.1082

Est S D P − . P . IPW - N IPW - S Bias -0.0254 0.0626 0.0573 -0.0254 -0.1046 -0.0247 0.0631 0.0575 -0.0254 -0.1046

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0255 0.0572 0.0515 -0.0218 -0.0966 -0.0248 0.0578 0.0521 -0.0205 -0.0943

Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias -0.0260 0.0621 0.0565 -0.0277 -0.1087 -0.0258 0.0623 0.0567 -0.0274 -0.1082

Est S D P − . P . IPW - N IPW - S Bias -0.0254 0.0626 0.0573 -0.0254 -0.1046 -0.0247 0.0631 0.0575 -0.0254 -0.1045

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0255 0.0572 0.0515 -0.0218 -0.0966 -0.0249 0.0577 0.0521 -0.0205 -0.0943

Est S D P − . P . able 9: The simulation results under Model 4 with k=4, r=2, | Z ∩ X | = 1 and | Z ∩ V ⊤ X | = 1 , β ⊤ = (1 / , / √ , − / √ , − / and γ = 0 k=4, r=2 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias 0.0339 0.0050 -0.0009 -0.0057 -0.0224 0.0343 0.0053 -0.0008 -0.0055 -0.0222

Est S D P − . P . IPW - N IPW - S Bias 0.0330 0.0040 -0.0019 -0.0064 -0.0228 0.0372 0.0064 -0.0010 -0.0060 -0.0224

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0303 0.0011 -0.0038 -0.0066 -0.0219 0.0340 0.0033 -0.0032 -0.0068 -0.0223

Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P

500 Bias 0.0339 0.0050 -0.0009 -0.0057 -0.0224 0.0343 0.0053 -0.0008 -0.0055 -0.0222

Est S D P − . P . IPW - N IPW - S Bias 0.0330 0.0040 -0.0019 -0.0064 -0.0228 0.0370 0.0063 -0.0011 -0.0061 -0.0224

Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0303 0.0011 -0.0038 -0.0066 -0.0219 0.0337 0.0031 -0.0033 -0.0068 -0.0223

Est S D P − . P . able 10: The simulation results under high dimensional setting dim(X)=20 with | Z ∩ V ⊤ X | = 0 : r = 1 IP W - O IP W - P IP W - S Group Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4Group 1 Bias -0.0275 0.0598 0.0554 -0.0271 -0.1064 -0.0273 0.0601 0.0556 -0.0270 -0.1062 -0.0254 0.0612 0.0538 -0.0349 -0.1209 Est SD P − . P . Est SD P − . P . | Z ∩ V ⊤ X | = 1 : r = 2 IP W - O IP W - P IP W - S Group Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4Group 1 Bias -0.0274 0.0599 0.0553 -0.0272 -0.1065 -0.0270 0.0603 0.0559 -0.0261 -0.1046 -0.0245 0.0617 0.0536 -0.0353 -0.1208 Est S D P − . P . Est S D P − . P .0.0460 0.0560 0.0540 0.0540 0.0480 0.0540 0.0600 0.0500 0.0580 0.0560 0.0480 0.0560 0.0600 0.0480 0.0480