On IPW-based estimation of conditional average treatment effect
aa r X i v : . [ m a t h . S T ] S e p On IPW-based estimation of conditional average treatmenteffect
Niwen Zhou a , Lixing Zhu a,b, ∗ a School of Statistics, Beijing Normal University, Beijing, China b Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
Abstract
The research in this paper gives a systematic investigation on the asymptotic behavioursof four inverse probability weighting (IPW)-based estimators for conditional averagetreatment effect, with nonparametrically, semiparametrically, parametrically estimatedand true propensity score, respectively. To this end, we first pay a particular atten-tion to semiparametric dimension reduction structure such that we can well study thesemiparametric-based estimator that can well alleviate the curse of dimensionality andgreatly avoid model misspecification. We also derive some further properties of exist-ing estimator with nonparametrically estimated propensity score. According to theirasymptotic variance functions, the studies reveal the general ranking of their asymp-totic efficiencies; in which scenarios the asymptotic equivalence can hold; the criticalroles of the affiliation of the given covariates in the set of arguments of propensityscore, the bandwidth and kernel selections. The results show an essential differencefrom the IPW-based (unconditional) average treatment effect(ATE). The numericalstudies indicate that for high-dimensional paradigms, the semiparametric-based es-timator performs well in general whereas nonparametric-based estimator, even some-times, parametric-based estimator, is more affected by dimensionality. Some numericalstudies are carried out to examine their performances. A real data example is analysedfor illustration. ∗ The authors gratefully acknowledge two grants from the University Grants Council of Hong Kong anda NSFC grant (NSFC11671042). ∗∗ Corresponding author
Email address: [email protected] (Lixing Zhu )
Preprint submitted to Elsevier eywords:
Dimension reduction, Heterogeneity Treatment effect, Propensity score
1. Introduction
Treatment effects have been widely analyzed by economists and statisticians in di-verse fields. In this paper, we focus on estimating treatment effect under the potentialoutcomes framework and the unconfoundedness assumption with binary treatment. Let D = 0 , mean that the individual does not receive or receives treatment and the re-sponse Y be the corresponding potential outcome as Y (0) or Y (1) . To convenientlyidentify the quantities measuring treatment effects, the unconfoundedness assumptionin (Rosenbaum and Rubin, 1983) is generally considered, that is, the assignment totreatment is independent of the potential outcomes given a k -dimensional vector X ofcovariates, i.e. ( Y (0) , Y (1)) ⊥ D | X. (1)Further, we in this paper consider the dimension of X to be fixed throughout thispaper, but in some cases it can be high. As Y (0) and Y (1) cannot be simultaneouslyobserved for any individual, the observed outcome can be written as Y = DY (1) +(1 − D ) Y (0) . Since estimating the i -th individual treatment effect ( Y i (1) − Y i (0)) isunrealistic, an important trend in the literature turns to estimate the average treatmenteffect ( AT E ): µ = E ( Y (1) − Y (0)) . See for instance Rosenbaum and Rubin (1983)and Hirano et al. (2003).Recently, there is an increasing interest in estimating conditional (or heteroge-neous) average treatment effects:
CAT E ( X ) = E ( Y (1) − Y (0) | X ) , which is de-signed to reflect how treatment effects vary across different subpopulations. Note that Although the word ”high dimension” is usually conjunct with k being divergent with sample size inrecent years, when we say X is of high dimension in this paper, it only means X contains many but fixednumber of covariates. For ease of explanation, we still use the word ”high dimension” whenever no confusionwill be caused. AT E = 0 , the treatment can still be effective for a subpopulation definedby specific observable characteristics, i.e. for some x such that CAT E ( x ) = 0 . Thusheterogeneous treatment effects are more informative and can play important roles inpersonalized medicine or policy intervention. Most of existing estimation methodsfor the heterogeneous treatment effects are conditional on the full set of variables, X ,see e.g. Crump et al. (2008), Wager and Athey (2018), where the multivariate vari-able X are designed to make the unconfoundedness assumption plausible. After 2015,researchers consider to estimate more general conditional/heterogeneous treatment ef-fects, in which the conditioning covariates Z with Z being a subset of covariates, i.e. X = ( Z ⊤ , U ⊤ ) ⊤ ∈ R l × R m , k = l + m < ∞ . See e.g. Abrevaya et al. (2015) and Lee et al. (2017). Note that treatment effects con-ditioning on a subset of X , rather than the high dimensional covariates X , can providedesirable flexibility and can help making policy decision.Based on the assumption (1), Abrevaya et al. (2015) used the inverse probabilityweighting ( IP W )-based method, which is popularly used in literature (Robins et al.,1994), to estimate
CAT E ( Z ) = E [ Y (1) − Y (0) | Z ] when the propensity score function is estimated parametrically ( IP W - P ) and nonpara-metrically ( IP W - N ). Abrevaya et al. (2015) gave a deep investigation on the asymp-totic properties of the estimators. There are two main conclusions in Abrevaya et al.(2015): one is IP W - N can be asymptotically more efficient than IP W - P in the sensethat the asymptotic variance function of IP W - N can be uniformly smaller than thatof IP W - P , the another is the asymptotic variance function of IP W - P is equal to thatof IP W - O which is defined as the oracle estimator with the true propensity score. Itis noteworthy that the last conclusion is different from that of IPW-type ATE estima-tors, because the IPW-type ATE estimator based on parametrically estimated propen-sity score can be more efficient than the one with true propensity score.As is known, to make the unconfoundedness assumption be plausible, it is often thecase that we need to include many covariates in the analysis. Thus we say X ∈ R k is3f high dimension with k < ∞ . In this case, on one hand, it is often not easy to choosea parametric specification that can sufficiently capture all the important nonlinear andinteraction effects to have IP W - P . On the other hand, any nonparametric estimationof propensity score clearly suffers from the curse of dimensionality and then IP W - N does not work any more.Therefore, in this paper, we suggest a semiparametric IPW-based CAT E ( Z ) es-timation procedure to simultaneously alleviate the propensity score misspecificationproblem and particularly the curse of dimensionality. To this end, we consider a semi-parametric dimension reduction structure of the propensity score and the unconfound-edness assumption (1) can have a dimension reduction version. It is worth pointingout that the general nonparametric structure can be regarded as a special case of the di-mension reduction structure we consider with an orthonormal projection matrix of fullrank. We will call the estimator IP W - S and give the details about the model settingand the estimation procedure in the next section.For theoretical development, we will give the asymptotically linear representationand asymptotic normality of IP W - S . We will also give some further properties of ex-isting IP W - N in Abrevaya et al. (2015). Based on the theoretical studies, we givea systematic comparison on the asymptotic efficiency amongst IP W - O , IP W - P , IP W - S and IP W - N .Combining the results of Abrevaya et al. (2015) and the further properties of IP W - N we derive in this paper, the comparison reveals some very interesting and importantphenomena. Specifically, letting A (cid:22) B mean that the asymptotic variance of estimator A is not greater than that of estimator B and A ∼ = B stand for that A has the sameasymptotic variance function as B , we have the following observations in theory.First, in general IP W - N (cid:22) IP W - S (cid:22) IP W - P ∼ = IP W - O .Second, the affiliation of Z to the set of arguments of the propensity score playsan important role in the asymptotic efficiency of IP W - S and IP W - N . That is, when Z is a subset of arguments of the propensity score, IP W - S (cid:22) IP W - P ∼ = IP W - O and IP W - N (cid:22) IP W - P ∼ = IP W - O , otherwise, IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O . Note that this newly found phenomenon provides a deep insight into theperformances of IP W - S and IP W - N , which is also useful in practice.4hird, when the propensity score function is smooth enough, then even in generalcases we can also have the asymptotic equivalence by carefully choosing the band-widths and using high order kernel functions: IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O . This also gives us a better understanding for the asymptotic performance ofdifferent estimators. Of course, this part mainly serves as a theoretical exploration. Forpractical use, we would have no interest to wilfully choose those kernel function andbandwidths, which are very difficult to implement and make the estimator with worseperformance. But it reminds the researchers that a “good” estimator of the propensityscore would not be helpful for the performance of the CAT E estimator.Fourth, owing to the dimension reduction structure of p ( X ) , the requirements forbandwidths and the order of kernel function for IP W - S are much milder than thosefor IP W - N . Thus when the dimension is high, even though IP W - N has the superiorefficiency in theory, IP W - S is preferable.The rest of the paper is organized as follows. In Section 2, we first introduce theestimation procedure for IP W - S . Also we investigate its asymptotic properties and thetheoretical comparisons between the four CAT E estimators. Section 3 contains somenumerical studies to examine the performance of the
CAT E estimators. In Section4, we apply the CATE estimators to analyse a real data set for illustration. Section 5contains some conclusions and a further discussion. The regularity conditions are listedin Appendix and all the technical proofs are relegated to Supplementary Materials tosave space.
2. Semiparametric estimation procedure and asymptotic properties
Assume that covariates X = ( Z ⊤ , U ⊤ ) ⊤ are absolutely continuous, under theunconfoundedness assumption (1), recall that CAT E function τ ( z ) can be rewrittenas τ ( z ) = E (cid:20) DYp ( X ) − (1 − D ) Y − p ( X ) | Z = z (cid:21) , Z ∈ R l . (2)5f p ( X ) is given, we can estimate τ ( z ) immediately via the Nadaraya-Watson ker-nel method by regarding DYp ( X ) − (1 − D ) Y − p ( X ) as response: ˆ τ O ( z ) = (cid:18) n X i =1 (cid:20) D i Y i p ( X i ) − (1 − D i ) Y i − p ( X i ) (cid:21) K h ( Z i − z ) (cid:19). n X i =1 K h ( Z i − z ) . Here K ( · ) is a multivariate kernel function, K h ( u ) = h − l K ( u/h ) and l = dim ( Z ) .This CAT E estimator is
IP W - O we mentioned before.Based on existing results for nonparametric estimation, it is easy to derive theasymptotic distribution of IP W - O which will be used as the benchmark to make com-parisons among all estimators studied in this paper. Proposition 1.
Suppose the conditions (C1)-(C4) in Appendix are satisfied, the follow-ing statements hold for each point z in the support of Z : √ nh l (ˆ τ O ( z ) − τ ( z )) D −→ N (cid:18) , k K k σ O ( z ) f ( z ) (cid:19) . Here σ O ( z ) = E (cid:18)h DYp ( X ) − (1 − D ) Y − p ( X ) − τ ( z ) i | Z = z (cid:19) . When p ( X ) is an unknown function, we then first estimate p ( X ) to define a fi-nal CAT E estimator ˆ τ ( z ) . We propose the estimator under semiparametric structurebelow.
Assume the propensity score has a semiparametric dimension reduction structure: p ( X ) = q ( V ⊤ X ) , (3)where both the function q ( · ) and the r projection directions in V are unknown with V being a k × r orthonormal matrix. It is noteworthy that this structure is general, whichcovers the structures of some important semiparametric models such as single-indexmodels. From the definition of propensity score, (3) implies the indicator D dependson X through the projected variable V ⊤ X . Thus, we can use the following conditionalindependence to present the above semiparametric structure: D ⊥ X | V ⊤ X. (4)6t follows that ( Y (0) , Y (1)) ⊥ D | V ⊤ X. We call the intersection of all V ’ssatisfying the above independence the central subspace, see Li (1991). Usually V canonly be identified up to a rotation matrix C . That is, V ∗ = V × C can be identified.As this identification issue does not affect the related estimation of p ( X ) , we thenstill use V without confusion. Relevant references are Luo et al. (2017) and Ma et al.(2019). This is a dimension reduction framework, so that the corresponding estimationcould be less affected by the curse of dimensionality. For such a dimension reductionstructure, we can also consider variable selection as Ma et al. (2019) did. But as this isnot a focus of this paper, we then just work on this model and assume the existence ofconsistent estimation later on.If we postulate that the information about D from X can be completely captured by r linear combinations V ⊤ X of X with r ≪ k , the propensity score can be estimated byreplaced the original X with V ⊤ X . That is, we can use lower dimensional kernel func-tion H ( u ) to get a nonparametric estimator ˆ q ( ˆ V ⊤ X ) of q ( V ⊤ X ) = E ( D | V ⊤ X ) , ˆ q ( ˆ V ⊤ X i ) = P nj = i D i H h ( ˆ V ⊤ X j − ˆ V ⊤ X i ) P nj = i H h ( ˆ V ⊤ X j − ˆ V ⊤ X i ) . (5)where h is the bandwidth, H h ( u ) = h − r H ( u/h ) and ˆ V is a consistent estimatorderived by a sufficient dimension reduction method. There are several methods avail-able in the literature, such as inverse regression methods in (Cook and Li, 2002) andminimum average variance estimation(MAVE) in (Xia et al., 2002; Xia, 2007) .Recall that CAT E can be rewritten as (2). Thus, based on ˆ q ( ˆ V ⊤ X i ) , the IP W - S of τ ( z ) is defined as ˆ τ S ( z ) = n X i =1 " D i Y i ˆ q ( ˆ V ⊤ X i ) − (1 − D i ) Y i − ˆ q ( ˆ V ⊤ X i ) K h ( Z i − z ) . n X i =1 K h ( Z i − z ) . (6)Since both Z and V ⊤ X are low-dimension random vectors, ˆ τ S ( z ) can well alle-viate the propensity score misspecification problem and the curse of dimensionalitysimultaneously.In the next section, we investigate the asymptotic properties of ˆ τ S ( z ) and derivesome further properties of existing IP W - N under certain regularity conditions.7 .3. Asymptotic properties for IPW-S Denote | A | as the cardinality of set A . We first give some notations. • W = ( X, D, Y ) and the observation data W i = ( X i , D i , Y i ) ni =1 are the inde-pendent copies of W ; • m j ( V ⊤ X ) = E [ Y ( j ) | V ⊤ X ] , j = 0 , , and K i = K (( Z i − z ) /h ) ; • ψ ( p ( V ⊤ X ) , W ) = DY / { q ( V ⊤ X ) } − (1 − D ) Y / { − q ( V ⊤ X ) } ; • ψ ∗ ( q ( V ⊤ X ) , W ) = [ D { Y − m ( V ⊤ X ) } ] / { q ( V ⊤ X ) }− [(1 − D ) { Y − m ( V ⊤ X ) } ] / { − q ( V ⊤ X ) } + m ( V ⊤ X ) − m ( V ⊤ X ) . • For two vectors A and B , we use intersection notation A ∩ B to write, withoutconfusion, as all components that are contained in both A and B . | A ∩ B | = t stands for the number of components in the intersection of A and B . Particularly,when t = 0 , A ∩ B = ∅ , and t = | A | implies A ∩ B = A .Both ψ ( q ( V ⊤ X ) , W ) and ψ ∗ ( q ( V ⊤ X ) , W ) are the central parts of influence functionfor IP W - S . Theorem 1.
Suppose all the conditions in Appendix are satisfied, the following state-ments hold for each point z in the support of Z :(1) When | Z ∩ V ⊤ X | = t < l with s [2 − l/ ( l − t )] + l > , the asymptoticallylinear representation is √ nh l (ˆ τ S ( z ) − τ ( z )) = 1 √ nh l f ( z ) n X i =1 [ ψ ( p ( V ⊤ X i ) , W i ) − τ ( z )] K i + o p (1) and the asymptotic distribution of ˆ τ S ( z ) is √ nh l (ˆ τ S ( z ) − τ ( z )) D −→ N (0 , Σ S ( z )) . (2) When | Z ∩ V ⊤ X | = l , the asymptotically linear representation is √ nh l (ˆ τ S ( z ) − τ ( z )) = 1 √ nh l f ( z ) n X i =1 [ ψ ∗ ( q ( V ⊤ X i ) , W i ) − τ ( z )] K i + o p (1) and the asymptotic distribution of ˆ τ S ( z ) is √ nh l (ˆ τ S ( z ) − τ ( z )) D −→ N (0 , Σ ∗ S ( z )) . ere s is the order of H ( · ) , Σ S ( z ) = k K k σ S ( z ) . f ( z ) , Σ ∗ S ( z ) = k K k σ ∗ S ( z ) . f ( z ) with σ S ( z ) = E [ { ψ ( q ( V ⊤ X ) , W ) − τ ( z ) } | Z = z ] and σ ∗ S ( z ) = E [ { ψ ∗ ( q ( V ⊤ X ) , W ) − τ ( z ) } | Z = z ] . Remark 1.
These results show a very interesting and somewhat unexpected phenomenonthat the asymptotic behaviors of ˆ τ S ( z ) also depend on whether some of elements of Z belong to V ⊤ X. Recall that | Z ∩ V ⊤ X | = t means t elements of Z also are t linearcombinations of V ⊤ X , i.e. we can rewrite V ⊤ X = ( Z , · · · , Z t , ( ˜ V ⊤ X ) ⊤ ) ⊤ with V ⊤ = ˜ I ( t ) × k ˜ V ⊤ r × k and ˜ I ( t ) × k = (cid:16) I t × t (cid:17) . Here I t × t is an identity matrix, ˜ V ⊤ is a ( r − t ) × k matrix. The asymptotic behaviours with t = l and any ≤ t < l arevery different. A natural question is whether we can, if possible, choose a dimensionreduced vector V ⊤ X such that IP W - S works best. The question is related to IP W - P and IP W - N , we will have some detailed discission in Subsection 3.3 below. Next, we present the estimators for Σ S ( z ) and Σ ∗ S ( z ) under | Z ∩ V ⊤ X | < l and | Z ∩ V ⊤ X | = l respectively as ˆΣ S ( z ) = k K k ˆ σ S ( z )ˆ f ( z ) and ˆΣ ∗ S ( z ) = k K k ˆ σ ∗ S ( z )ˆ f ( z ) , (7)where ˆ σ S ( z ) and ˆ σ ∗ S are estimators for σ S ( z ) and σ ∗ S with ˆ σ S = 1 nh l n X i =1 ( ψ (ˆ q, W i ) − ˆ τ S ( z )) K i ˆ f ( z ) and ˆ σ ∗ S = 1 nh l n X i =1 ( ψ ∗ (ˆ q, W i ) − ˆ τ S ( z )) K i ˆ f ( z ) , ˆ f ( z ) = P ni =1 K h ( Z i − z ) /n is a kernel-based estimator of f ( z ) , ψ (ˆ q, W i ) = D i Y i ˆ q ( V ⊤ X i ) − (1 − D i ) Y i − ˆ q ( V ⊤ X i ) and ψ ∗ (ˆ q, W i ) = [ D i { Y i − ˆ m ( V ⊤ X i ) } ]ˆ q ( V ⊤ X i ) − [(1 − D i ) { Y i − ˆ m ( V ⊤ X i ) } ]1 − ˆ q ( V ⊤ X i ) + ˆ m ( V ⊤ X i ) − ˆ m ( V ⊤ X i ) with ˆ m ( ˆ V ⊤ X ) = P n { t : D t =1 } H h ( ˆ V ⊤ X t − ˆ V ⊤ X ) Y t .P n { t : D t =1 } H h ( ˆ V ⊤ X t − ˆ V ⊤ X ) , ˆ m ( ˆ V ⊤ X ) = P n { t : D t =0 } H h ( ˆ V ⊤ X t − ˆ V ⊤ X ) Y t .P n { t : D t =0 } H h ( ˆ V ⊤ X t − ˆ V ⊤ X ) being the estimators of m ( V ⊤ X ) and m ( V ⊤ X ) .9urther we can state the consistency of the proposed estimators in the followingtheorem. Theorem 2.
Suppose all the conditions in Appendix are satisfied, we have that ˆΣ S ( z ) = Σ S ( z ) + o p (1) , and ˆΣ ∗ S ( z ) = Σ ∗ S ( z ) + o p (1) . By Theorem 2, we can obtain the pointwise consistent estimator for standard errorof √ nh l (ˆ τ S ( z ) − τ ( z )) , so that we are able to construct a (1 − α )100% pointwiseconfidence interval for τ ( z ) , i.e. ˆ τ S ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ S ( z ) (cid:17) / (8)or ˆ τ S ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ ∗ S ( z ) (cid:17) / , (9)with c α/ being the (1 − α/ quantile of the standard normal distribution. Note thatthe specification formula of confidence interval depends on whether the condition | Z ∩ V ⊤ X | < l or | Z ∩ V ⊤ X | = l . One possible way to make choice between (8) and (9)is based on the value of | Z ∩ ˆ V ⊤ X | .To be specified, taking MAVE(Xia et al., 2002) as an exmaple dimension reductionmethod, we proposed a estimation and inference procedure of τ ( z ) based on IP W - S by carrying out the following steps.Step 1: Obtain the estimator of V by solving the minimizing problem min V,a,b n X i,j =1 { D i − a j − b ⊤ j V ⊤ ( X i − X j ) } ω ij . Here ω ij = H h { V ⊤ ( X i − X j ) } (cid:14) P nl =1 H h { V ⊤ ( X l − X j ) } , a = ( a , . . . , a n ) , b = ( b , . . . , b n ) . Denote the resulting estimator by ˆ V Step 2: Given ˆ V , estimate the propensity score E ( D | ˆ V ⊤ X ) via (5).Step 3: Obtain the semiparametric CATE estimator, ˆ τ S ( z ) , via (6).10tep 4: Given Z = z , a (1 − α ) pointwise confidence interval for the true CATE, τ ( z ) ,can be constructed as follows. if | Z ∩ ˆ V ⊤ X | ≈ l , the confidence interval of τ ( z ) can be constructed in the form of (9), i.e. ˆ τ S ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ ∗ S ( z ) (cid:17) / , Otherwise, the pointwise confidence interval of τ ( z ) would be constructed in theform of (8), that is, ˆ τ S ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ S ( z ) (cid:17) / . Note that the first step can be implemented using the R package MAVE. Based onthis estimation and inference procedure, the empirical analysis in section 5 can beimplemented.
Recall
IP W - N proposed by Abrevaya et al. (2015) is ˆ τ N ( z ) = n X i =1 (cid:20) D i Y i ˆ p ( X i ) − (1 − D i ) Y i − ˆ p ( X i ) (cid:21) K h ( Z i − z ) ! . n X i =1 K h ( Z i − z ) , (10)with ˆ p ( X i ) = P nj = i D j L h ( X j − X i ) . P nj = i L h ( X j − X i ) . Here L ( · ) is also amultivariate kernel function with L h ( · ) = h − k L ( · /h ) , and h is the correspondingbandwidth.Note that the asymptotic properties of IP W - S is influenced by the affiliation of Z , we in this paper try to analyse the asymptotic properties of IP W - N in differentscenarios similarly as the ones in Theorem 1. Suppose D ⊥ X | ˜ X, ˜ X ⊆ X, ˜ k = dim ( ˜ X ) ≤ k. (11)To extend the asymptotic results of IP W - N in Abrevaya et al. (2015), we derive thefollowing theorem that also confirms the influence of the affiliation of Z to ˜ X in theasymptotic properties of IP W - N . Abrevaya et al. (2015) only considered a specialsituation in the following Theorem 3: | Z ∩ ˜ X | = l and ˜ X = X .Before stating the result as theorem, let us define some important quantities: ψ ( p ( ˜ X ) , W ) = DYp ( ˜ X ) − (1 − D ) Y − p ( ˜ X ) ,ψ ∗ ( p ( ˜ X ) , W ) = [ D { Y − m ( ˜ X ) } ] p ( ˜ X ) − [(1 − D ) { Y − m ( ˜ X ) } ]1 − p ( ˜ X ) + m ( ˜ X ) − m ( ˜ X ) . heorem 3. Suppose all the conditions in Appendix are satisfied, the following state-ments hold for each point z in the support of Z :(1) When | Z ∩ ˜ X | = t < l with s [2 − l/ ( l − t )] + l > , the asymptotically linearrepresentation is √ nh l (ˆ τ N ( z ) − τ ( z )) = 1 √ nh l f ( z ) n X i =1 [ ψ ( p ( ˜ X i ) , W i ) − τ ( z )] K i + o p (1); the asymptotic distribution of ˆ τ N ( z ) is √ nh l (ˆ τ N ( z ) − τ ( z )) D −→ N (0 , Σ N ( z )) . (2) When | Z ∩ ˜ X | = l , the asymptotically linear representation is √ nh l (ˆ τ N ( z ) − τ ( z )) = 1 √ nh l f ( z ) n X i =1 [ ψ ∗ ( p ( ˜ X i ) , W i ) − τ ( z )] K i + o p (1); the asymptotic distribution of ˆ τ N ( z ) is √ nh l (ˆ τ N ( z ) − τ ( z )) D −→ N (0 , Σ ∗ N ( z )) . Here s is the order of L ( · ) , Σ N ( z ) = k K k σ N ( z ) . f ( z ) , Σ ∗ N ( z ) = k K k σ ∗ N ( z ) . f ( z ) , σ N ( z ) = E [ { ψ ( p ( ˜ X i ) , W i ) − τ ( z ) } | Z = z ] , σ ∗ N ( z ) = E [ { ψ ∗ ( p ( ˜ X i ) , W i ) − τ ( z ) } | Z = z ] . Similarly as
IP W - S , we also proposed the estimators for Σ N ( z ) and Σ ∗ N ( z ) under | Z ∩ ˜ X | < l and | Z ∩ ˜ X | = l respectively as ˆΣ N ( z ) = k K k ˆ σ N ( z )ˆ f ( z ) , and ˆΣ ∗ N ( z ) = k K k ˆ σ ∗ N ( z )ˆ f ( z ) , (12)where ˆ σ N and ˆ σ ∗ N are estimators for σ N ( z ) and σ ∗ N ( z ) with σ N = 1 nh l n X i =1 ( ψ (ˆ p, W i ) − ˆ τ N ( z )) K i ˆ f ( z ) , ˆ σ ∗ N = 1 nh l n X i =1 ( ψ ∗ (ˆ p, W i ) − ˆ τ N ( z )) K i ˆ f ( z ) ,ψ (ˆ p, W i ) = D i Y i ˆ p ( ˜ X i ) − (1 − D i ) Y i − ˆ p ( ˜ X i ) and ψ ∗ (ˆ p, W i ) = [ D i { Y i − ˆ m ( ˜ X i ) } ]ˆ p ( ˜ X i ) − [(1 − D i ) { Y i − ˆ m ( ˜ X i ) } ]1 − ˆ p ( ˜ X i ) + ˆ m ( ˜ X i ) − ˆ m ( ˜ X i ) . ˆ m ( ˜ X ) = P n { t : D t =1 } L h ( ˜ X t − ˜ X ) Y t / P n { t : D t =1 } L h ( ˜ X t − ˜ X ) , ˆ m ( ˜ X ) = P n { t : D t =0 } L h ( ˜ X t − ˜ X ) Y t / P n { t : D t =0 } L h ( ˜ X t − ˜ X ) being the estimators of m ( ˜ X ) and m ( ˜ X ) . Further we can show the consistency of proposed asymptotic variancefunction estimators via the following theorem. Theorem 4.
Suppose all the conditions in Appendix are satisfied, we have that ˆΣ N ( z ) = Σ N ( z ) + o p (1) , and ˆΣ ∗ N ( z ) = Σ ∗ N ( z ) + o p (1) . Remark 2.
Based on Theorem 4, we can also get the consistent estimator for stan-dard error of √ nh l (ˆ τ N ( z ) − τ ( z )) and construct a pointwise confidence interval of τ ( z ) based on ˆ τ N ( z ) . However, we first need to estimate the true active arguments ofpropensity score ˜ X , denoting the corresponding estimator as ˆ X , which can be doneby variable selection method, to decide the proper form of confidence interval. To bespecified, if | Z ∩ ˆ X | ≈ l , the pointwise confidence interval can be constructed as ˆ τ N ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ ∗ N ( z ) (cid:17) / . Otherwise, we would construct the pointwise confidence interval as ˆ τ N ( z ) ± ( nh ) − / c α/ (cid:16) ˆΣ N ( z ) (cid:17) / . When ˜ X = X , as proved by Abrevaya et al. (2015), IP W - N can be asymptoti-cally more efficient than IP W - P : σ P ( z ) = σ ∗ N ( z ) + E " p ( X ) { − p ( X ) } (cid:26) m ( X ) p ( X ) + m ( X )1 − p ( X ) (cid:27) | Z = z , and IP W - P ∼ = IP W - O. Here m j ( X ) = E { Y ( j ) | X } . Thus, with p ( X ) = p ( ˜ X ) = q ( V ⊤ X ) , we can give the ranking for the estimation efficiency of the four estimatorsin the following corollary. Corollary 1.
Suppose all the assumptions and conditions in Appendix are satisfiedand p ( X ) = p ( ˜ X ) = q ( V ⊤ X ) , the following statements hold for each point z in thesupport of Z : ase 1: When | Z ∩ ˜ X | = l with ˜ X = X and | Z ∩ V ⊤ X | = l , IP W - N (cid:22) IP W - S (cid:22) IP W - P ∼ = IP W - O, with σ P ( z ) = σ ∗ S ( z ) + E (cid:20) q ( V ⊤ X )(1 − q ( V ⊤ X )) (cid:26) m ( V ⊤ X ) q ( V ⊤ X ) + m ( V ⊤ X )1 − q ( V ⊤ X ) (cid:27) | Z = z (cid:21) ,σ ∗ S ( z ) = σ ∗ N ( z ) + E (cid:20) q ( V ⊤ X )(1 − q ( V ⊤ X )) (cid:26) ∆ m q ( V ⊤ X ) + ∆ m − q ( V ⊤ X ) (cid:27) | Z = z (cid:21) , where ∆ m j = m j ( X ) − m j ( V ⊤ X ) . Case 2:
When | Z ∩ ˜ X | = l with ˜ X = X but | Z ∩ V ⊤ X | = t with ≤ t < l , IP W - N (cid:22) IP W - S ∼ = IP W - P ∼ = IP W - O with σ S ( z ) = σ P ( z ) = σ O ( z ) . Case 3:
When | Z ∩ ˜ X | = t with ˜ X ( X and | Z ∩ V ⊤ X | = t with ≤ t < l , IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O with σ N ( z ) = σ S ( z ) = σ P ( z ) = σ O ( z ) . Remark 3. In Case 1 , the equality in the first inequality holds when both m ( V ⊤ X ) and m ( V ⊤ X ) equal to zero, and the equality in the second inequality holds when m j ( X ) = m j ( V ⊤ X ) for all j = 0 , . A sufficient condition to make m j ( X ) = m j ( V ⊤ X ) hold is E ( Y j | X ) ⊥ X | V ⊤ X meaning that Y (1) and Y (0) share thesame central mean subspace. Remark 4.
Here, we discuss another special case: V ⊤ X = Z in Corollary 1 such that q ( V ⊤ X ) = p ( Z ) . It follows that IP W - S (cid:22) IP W - P with σ P ( z ) = σ ∗ S ( z )+ p ( z )(1 − p ( z )) [ m ( z ) / { p ( z ) } + m ( z ) / { − p ( z ) } ] . Similarly,
IP W - N (cid:22) IP W - P if ˜ X = Z : σ P ( z ) = σ ∗ N ( z ) + p ( z )(1 − p ( z )) [ m ( z ) / { p ( z ) } + m ( z ) / { − p ( z ) } ] . Thus,if Z = V ⊤ X = ˜ X , we have σ ∗ S ( z ) = σ ∗ N ( z ) ≤ σ P ( z ) . Remark 5.
Although
IP W - S cannot be more efficient than IP W - N in theory, ithas an obvious advantage due to its dimension reduction structure. This can be veryuseful in practice as when X is of high dimension, IP W - N is hard to use as it has toadopt very high order kernel function and delicately chosen bandwidths. The numerical tudies in the next section show that when the dimension of X is only , IP W - S canperforms better than IP W - N in some cases. Thus, in the numerical studies, when thein high dimension k = 20 , we do not consider IP W - N .Another issue is also relevant. Generally speaking, combining the results in Sub-sections 3.1 and 3.3, when the dimension reduced vector V ⊤ X cannot fully cover thegiven covariates Z , the IP W - S is less efficient. It seems that we can add Z into thecovariates to be ( Z, V ⊤ X ) to enhance the estimation efficiency in theory. However,this causes the estimation procedure much more complicated (with higher order ker-nel and more delicately selected bandwidths) and less accurate due to the dimensionincreasing as described above. Thus, balancing the theoretical merit and practicalusefulness, we still prefer using IP W - S without adding more covariates. From the above discussion, we can find that the asymptotic efficiency comparisonresult of IPW-type CATE estimators is different from that of IPW-type ATE estimators,because ATE estimator using nonparametric esitmated p ( X ) (cid:22) that using parametricestimated p ( X ) (cid:22) that using true p ( X ) . Thus it is worthwhile to give a further explo-ration on the reasons. From our study, we find that it is mainly because of the differ-ent convergence rates of the estimated propensity scores under different scenarios. Inthe following corollary, we show that when the convergence rate of the nonparamet-ically estimated propensity score can be fast enough, IP W - N and IP W - S can alsobe asymptotically equivalent to IP W - P , so is IP W - O . This is the case when thepropensity score function is smooth sufficiently and the kernel and bandwidths arechosen delicately to meet the mentioned condition in Corollary 2. Corollary 2.
Suppose all the conditions in Appendix are satisfied.(1) When √ nh l (cid:16) h s + p log( n ) /nh r (cid:17) = o (1) , it follows that IP W - S ∼ = IP W - P ∼ = IP W - O. (2) When √ nh l (cid:18) h s + q log( n ) /nh ˜ k (cid:19) = o (1) , it follows that IP W - N ∼ = IP W - P ∼ = IP W - O.
3) When √ nh l (cid:16) h s + p log( n ) /nh r (cid:17) = o (1) and √ nh l (cid:18) h s + q log( n ) /nh ˜ k (cid:19) = o (1) , it follows that IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O. Remark 6.
Corollary 2 implies that when the convergence rate of estimated propen-sity score is fast enough, the corresponding CATE estimator would be asymptoticallyequivalent to
IP W - O , which is based on true propensity score, even though the condi-tion | Z ∩ V ⊤ X | in Theorem 1 or | Z ∩ ˜ X | in Theorem 3 is satisfied. In this sense, we cansay that the convergence rate of estimated propensity score is dominant the role of affil-iation of Z in the set of arguments of propensity score in comparing the asymptotic effi-ciencies among the CATE estimators. It is well known that the convergence rate of non-parametric estimator is possibly close to n − / if the estimated function is very smoothand the higher kernel function is utilized, see Li and Racine (2007). Thus the con-ditions √ nh l (cid:16) h s + p log( n ) /nh r (cid:17) = o (1) and √ nh l (cid:18) h s + q log( n ) /nh ˜ k (cid:19) = o (1) could hold. As the choices for such kernel and bandwidths often make no sense forpractical use, this investigation only serves as a theoretical exploration with a remindthat a “good” estimator for the propensity score may not be helpful for constructing a“good” CATE.
3. Simulation study
To evaluate the finite sample performances of
IP W - S , we consider the compar-isons with IP W - P , IP W - N and IP W - O. To save space, we only present the simu-lations in the case Z ∈ R . To make the comparisons more convincing, we consider twoscenarios with two low dimensions of X = ( Z, U , · · · , U k − ) equal to k = 2 and ,and higher dimensions k = 20 . In the latter, IP W - N is not included as very high orderkernel and very delicately selected bandwidths are required and then it is very difficultto implement. Several criteria are used to evaluate the estimation efficiency: Bias ;estimated standard deviation
Est SD ; mean square error (MSE). As the asymptotic16istributions are standard normal, we then also report the proportions outside the crit-ical values ± . : P ± . . Further, to make the efficiency ranking in finite samplesetting more visible, we report, as relative efficiency, the Est SD results via dividingeach
Est SD by Est SD of IP W - O that is used as the benchmark. Thus, when theratio is smaller than , the corresponding estimator is more efficient than IP W - O . In the low-dimensional setting, we consider the covariates X = ( Z, U , · · · , U k − ) are given by the following procedure. When k = 2 , X = ( Z, U ) with Z = ǫ and U = (1 + 2 Z ) ( − Z ) + ǫ . When k = 4 , X = ( Z, U , U , U ) are given by Z = ǫ , U = (1 + 2 Z ) + ǫ , U = (1 + 2 Z ) + ǫ , U = ( − Z ) + ǫ .ǫ i ∼ unif [ − . , . for i = 1 , , , and they are mutually independent. To easilycompare the theoretical results under parametric, nonparametric and semiparametricstructure, we consider four models: • Model 1 (k=2, r=1 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 ): Y (1) = β ⊤ X + γ ZU + ν, Y (0) = 0 and p ( X ) = Λ(1 / √ Z + U )) . • Model 2 (k=2, r=1 with | Z ∩ ˜ X | = | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 ): Y (1) = β ⊤ X + γ ZU + ν, Y (0) = 0 and p ( X ) = Λ( U ) . • Model 3 (k=4, r=1 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 ): Y (1) = β ⊤ X + γ ZU U U + ν, Y (0) = 0 , and p ( X ) = Λ(0 . Z + U + U + U )) . • Model 4 (k=4, r=2 with | Z ∩ X | = 1 and | Z ∩ V ⊤ X | = 1 . ): Y (1) = β ⊤ X + γ ZU U U + ν, Y (0) = 0 and p ( X ) = Λ √ Z ) √ U + U + U ! . Here ν ∼ N (0 , . ) , Λ( · ) is the c.d.f. of the logistic distribution. Given thatthe matrix V satisfies E ( D | X ) ⊥ X | V ⊤ X , we consider these four types of propen-sity score models to satisfy the conditions in different scenarios. Under Model 1 and17odel 3, the dimension of V ⊤ = (1 , · · · , × k is dim ( V ) = r = 1 , | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 . Thus, we aim to examine whether IP W - N (cid:22) IP W - S ∼ = IP W - P ∼ = IP W - O . In order to examine the theoretical results in Case 3 of Corol-lary 1 in the finite sample scenario, we consider p ( X ) in Model 2. In this setting, D ⊥ X | ˜ X = U , and for V ⊤ = (0 , , p ( X ) ⊥ X | V ⊤ X . Obviously, | Z ∩ ˜ X | = | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 in Model 2, thus it can be used to examine whether IP W - N ∼ = IP W - S ∼ = IP W - P ∼ = IP W - O . p ( X ) in Model 4 is also set to verifythe results in Corollary 1. This propensity score function has Z itself as an individualargument. Namely, p ( X ) ⊥ X | V ⊤ X with V ⊤ = / √ / √ / √ and dim ( V ) = r = 2 , and | Z ∩ X | = 1 and | Z ∩ V ⊤ X | = 1 . We will examine whether
IP W - S can be more efficient than IP W - P and IP W - O . As for the parameters β i , γ i , i = 1 , , we consider following scenarios: • Scenario I: β ⊤ = (0 , , γ = 1 , β ⊤ = (1 / , / √ , − / √ , − / , γ = 0 ; • Scenario II: β ⊤ = (1 / , − / , γ = 0 , β ⊤ = (0 , , , , γ = 1 .Obviously, when r i = 0 , the linear model is being considered, while r i = 1 , thenonlinear model is taken into account, i = 1 , .Next, we determine the order of kernels L ( · ) , H ( · ) and K ( · ) to guarantee the regu-larity condition 5 in Appendix. As there is no data-driven or optimal selection methodavailable for IP W - N and IP W - S , we use the rule of thumb to select them as sug-gested by Abrevaya et al. (2015) for fair comparisons. The principle of selection isbased on proper rates of convergence in the form of h = a · n − η for a > and η > .Since IP W - S can be regarded as a low-dimensional type IP W - N , the bandwidthscan be chosen via replacing ˜ k by r as follows: h = a n − / (2 r + δ r + δ ) , h = an − / ( l +4+2 r +2 δ r − δ ) , (13)where a, a , δ , δ are positive. Note that δ and δ can be as small as desired, thuswe let them to be zero in the simulations for simplicity. Further, the order of H is s = r + δ r with δ r = 0 for even r , δ r = 1 for odd r , and s = s + 2 = r + δ r + 2 .Due to the semiparametric nature, we construct two consistent estimators ˆ V via MAVE18roposed by (Xia et al., 2002; Xia, 2007), which can be implemented in the R packageMAVE.Further, to fairly examine the performances, the parameters s, h for K ( u ) are thesame for all these four CAT E estimators. Since r ≤ ˜ k , the choices of s and h for IP W - N can be used for all the CAT E estimators. Taking all into account, the corre-sponding bandwidths are summarized in Table 1.
Table 1: The order of bandwidths in the simulations. ˜ k h h h ˜ k =1 a · n − / a · n − / a · n − / (2 r + δ r ) ˜ k =2 a · n − / a · n − / a · n − / (2 r + δ r ) ˜ k =4 a · n − / a · n − / a · n − / (2 r + δ r ) As for the tuning parameters a, a , a , we consider the following two groups ofvalues: Group { a = 0 . , a = 1 . , a = 0 . } , Group { a = 0 . , a =1 . , a = 0 . } . Specifically, we estimate τ ( Z ) at Z ∈ {− . , − . , , . , . } . The sample sizesare n = 500 and . The replication time is . We choose the Gaussian kernel andhigher order kernels derived from it throughout this section. Further, we should pointout that the estimated propensity score is trimmed to lie in the interval [0 . , . as Abrevaya et al. (2015) did. We give the observations from the simulation resultsreported in Tables 2-9. To save space, we only report the results about the the relativeefficiency with { a = 0 . , a = 1 . , a = 0 . } under Scenario I. See Figure . Wehave the following observations. Observation 1.
As expected, larger sample size leads to smaller bias and standarddeviation in most cases. When k = 4 and the sample size goes from n = 500 upto n = 1000 , the bias and variance reduction are more significant and the empiricalvalues of P . and P − . are closer to the nominal level . . That implies that thenormal approximation works well. Observation 2.
When the dimension k increases to , Bias , Est SD and
M SE also increase. Further, the dimension does have impact on the performance of
IP W -19 . In Tables S. under Model 1 with k = 2 , IP W - N is uniformly more efficientthan all the others. But under the models with k = 4 , especially with k = 4 and r = 2 , the superiority of IP W - N becomes less significant. We can see that IP W - S can even be more efficient than IP W - N sometimes, mainly due to its dimensionreduction structure. Observation 3.
Taking into account all the simulation results in Tables 2-9., theestimated standard deviation (
Est SD ) of all CATE estimations increase as z is closeto the boundary of the support of Z . This phenomenon should be mainly because ofthe nonparametric estimation of CATE(z) function with respect to Z . Note that IP W - O also involves nonparametric estimation for the conditional expectation over Z , thusthe boundary effect also takes place for it. Further, empirically, the Est SD of IP W - O often increases, in the numerical studies we conduct, relatively more quickly than IP W - P or IP W - S in the cases we will discuss in Observation 4 below when z isclose to the boundary. Figure 1 about the relative efficiency compared with IP W - O shows this though for different models, at the boundary, IP W - O has different relativeefficiency with IP W - P . Combining the information that ATE with estimated un-knowns, the final estimators could be more efficient in general, less relative efficiencyof IP W - O in finite sample scenarios may be understandable although in the case thatthe asymptotic efficiency should be equivalent. On the other hand, when we look at theoriginal values of IP W - O , the differences with IP W - P is not significant. Observation 4 . We also check the effect caused by the inclusiveness of the givencovariates Z in the set of the arguments of the propensity score for IP W - S and IP W - N . Under Model 1 or Model 3 with | Z ∩ ˜ X | = 1 but | Z ∩ V ⊤ X | = 0 , Figure showsthat the Est SD of IP W - N is uniformly smaller than those of the other CAT E estimators. While,
IP W - S and IP W - P have similar performance. This coincideswith the theory. In contrast, under Model 2 where | Z ∩ ˜ X | = | Z ∩ U | = 0 , IP W - N losses its superiority of efficiency to share similar performance to IP W - S and IP W - P . Under Model 4 where | Z ∩ X | = 1 and | Z ∩ V ⊤ X | = 1 , IP W - S outperforms IP W - O and IP W - P , and can even be comparable with IP W - N sometimes. Theseresults also coincide with the theory in Corollary 1.20 odel 1, k=2, r=1 Model 2, k=2, r=1 Model 3, k=4, r=1 Model 4, k=4, r=2 n = n = −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.40.80.91.01.10.80.91.01.1 Z A R E Method IPW−S IPW−N IPW−P IPW−0
Figure 1: The asymptotic relative efficiency(ARE) about
Est SD against that of
IP W - O under Group { a = 0 . , a = 1 . , a = 0 . } and Scenario I: β ⊤ = (0 , , γ = 1 , β ⊤ =(1 / , / √ , − / √ , − / , γ = 0 . Consider models with much higher dimensional X : k = dim ( X ) = 20 . As IP W - N obviously suffers from the curse of dimensionality and thus does not workat all, we then only focus on IP W - O , IP W - P and IP W - S . To better examine thecorresponding finite sample performances, we consider the model settings which aresimilar to Models 3 and 4 with uniformed Z , but with more zero coefficients for easeof comparison.Given X = ( Z, U , · · · , U k − ) , X is generated by Z ∼ unif ( − . , . , U =(1 + 2 Z ) + e , U = (1 + 2 Z ) + e , U = ( − Z ) + e , and independent e j ∼ unif ( − . , . , for j = 1 , , . The other variables U ′ j s are generated as:when < j < = 9 , U j = | / (11 − j ) Z | − | /jǫ | ; when < j < = 19 , U j = | / (21 − j ) Z | − | /jǫ | ; and U j = | / (31 − j ) Z | − | /jǫ | for21 > , ǫ ∼ unif ( − . , . . We consider the following models in high dimensionalsetting. • Model 5 (r=1 with uniformed | Z ∩ V ⊤ X | = 0 ): Y (1) = β ⊤ X + γ ZU U U + ν, Y (0) = 0 and p ( X ) = Λ(1 + V ⊤ X ) . • Model 6 (r=2 with uniformed | Z ∩ V ⊤ X | = 1 ): Y (1) = β ⊤ X + γ ZU U U + ν, Y (0) = 0 , and p ( X ) = Λ( g ( ˜ V ⊤ X )) . As for the propensity score, we set V ⊤ = ( z }| { − , · · · , − , z }| { , · · · , , z }| { , · · · , / √ , ˜ α = (0 , z }| { − , · · · , − , z }| { , · · · , , z }| { , · · · , / √ , and g ( ˜ V ⊤ X ) = (1 + ˜ α ⊤ X ) / (1 + Z ) with dim ( ˜ V ⊤ X ) = r = 2 , while | Z ∩ V ⊤ X | =1 . In high dimensional setting, we only consider the nonlinear model where the param-eters are set as β ⊤ = (0 , . . . , and r = 1 .The sample size is taken to be n = 500 . Estimate τ ( Z ) at Z ∈ {− . , − . , , . , . } with 500 simulation realizations. As for the bandwidth choice we adopt the same rule in(13) of Experiment 1 to have h = an − / ( l +4+2 r +2 δ r ) and h = a n − / (2 r + δ r ) . Con-sider two groups of { a, a } , Group { a = 0 . , a = 0 . } and Group { a =0 . , a = 0 . } . For the kernel function in the estimated propensity score, we alsouse the Gaussian kernel and higher order kernels derived from it since the distributionof X is bounded. All the original simulation results are reported in Table S.5 in theSupplement and the relative efficiency results are plotted in Figure 2.From the simulation results, we also have the following findings.1). The high dimensionality of X has relatively weak influence on IP W - S . All thevalues of Bias , Est SD and
M SE are rather stable and the values of P ± . arecloser to the nominal value . as the dimension of X goes from to , especially inthe case of | Z ∩ V ⊤ X | = 1 . This is very informative because it implies that IP W - S can greatly avoid the curse of dimensionality due to its dimension reduction structure.2). IP W - S not only shows its superiority in dealing with the curse of dimensionality,22ut also inherits the efficiency superiority of IP W - N in low-dimensional cases. Un-der such high dimensional scenarios, the values of Est SD and
M SE of IP W - S aresmaller than those of the parametric competitors in some cases even | Z ∩ V ⊤ X | = 0 .When in Model 6, IP W - S is uniformly more efficient than IP W - P . This is consistentwith the theoretical results in Corollary 1 as in the model | Z ⊆ V ⊤ X | = 1 , IP W - S is asymptotically more efficient than IP W - P . Model 5, k=20, r=1 Model 6, k=20, r=2 G r oup G r oup −0.4 −0.2 0 0.2 0.4 −0.4 −0.2 0 0.2 0.40.850.900.951.001.051.100.850.900.951.001.051.10 Z A R E Method IPW−S IPW−P IPW−0
Figure 2: The asymptotic relative efficiency(ARE) about
Est SD against that of
IP W - O under high di-mensional setting.
4. Data Analysis
In this section, we consider a dataset collected by Ichino et al. (2008), which canbe obtained from the internet. We apply the proposed method to estimate the
CAT E The data is publicly available at http://qed.econ.queensu.ca/jae/2008-v23.3/ichino-mealli-nannicini/
T W A ) onpermanent employment over worker’s age.First introduce some details and setting about the dataset. Restricting the sample toTuscany and aged 17-39, the resulting sample size is n = 901 , of which were on a T W A during the first semester of 2001. That is, the binary treatment variable D = 0 , means that the individual was not on or was on a T W A during the first six mouths of2001. The outcome Y here is a dummy variable: Y = 1 if the subject is permanentlyemployed at the end of 2002, and Y = 0 otherwise. Choose X as the worker’s ageand a set of 25 covariates as X adopted by Ichino et al. (2008) to guarantee the uncon-foundedness assumption. The set of covariates is about demographic characteristics,family background, educational achievements and work experience (See Table 1 inIchino et al. (2008)). This dataset was first analyzed by Ichino et al. (2008), who esti-mated the parameter AT T = E ( Y − Y | D ) and showed that T W A can increase theprobability of getting a permanent employment. Ichino et al. (2008) pointed out thatthe
T W A effect is heterogeneous for the individuals older than 30 and younger than30. In order to catch more specific heterogeneity of the
T W A effect across individuals’age, we estimate the
CAT E function τ ( Z ) in the interval between ages 20 and 35.As the number of covariates is large (= ), we then use a semiparametric single-indexmodel to estimate the propensity score such that the dimensionality problem and modelmisspecification problem can be greatly alleviated. Given that D ⊥ X | α ⊤ X , we canget the IP W - S and pointwise confidence band of τ ( Z ) by carrying out the estimationprocedure proposed in subsection 2.3. As for nonparametric estimation part, we usethe Gaussian kernel and choose the bandwidths to be h = 0 . × ˆ σ n − / = 2 . , and h = 1 . × ˆ σ d n − / = 0 . ≪ h, where ˆ σ = p var ( x ) and ˆ σ d = p var ( α ⊤ X ) .We also estimate IP W - P as a benchmark to analyse the TWA effect over worker’sage.Figure 3 presents the results of IP W - S and IP W - P as a function of worker’s agein the range of 20 to 35 years old, which can be regarded as an extension of Ichino et al.(2008) in a certain sense. Furthermore, the 95% pointwise confidence band of IP W - S and IP W - P have been also reported in Figure 3. There are several points we want24o highlight: 1). both IP W - S and IP W - P suggest that, from age to , a TWAassignment uniformly increases the probability of finding a stable job with the rangeroughly between . and . . It means that if a worker with a TWA experience wouldmore likely to get a permanent job. This finding is in accordance with, but extends theconclusion of Ichino et al. (2008). 2). The trend of
CAT E ( x ) varies with worker’sage and has two peaks. From Figure 3, we can also find that there are two peaks ataround age and age , while the trough appears at around age . That impliesthe TWA experience has different effect for the workers older than and under ,which was also similarly discussed by Ichino et al. (2008). However, comparing thedetails in the curves of IP W - S and IP W - P , the effect of TWA on finding a stable jobfor the subpopulation aged under is greater than the ones older in the IP W - S curve, while things are opposite in the IP W - P curve. It seems that the IP W - S curveprovides a more reasonable explanation on the effect of TWA: younger individualsreceiving TWA could have better chance to get a stable job than older individuals whoneed to receive TWA.
20 25 30 35 − . . . . . Age
IPW−S
20 25 30 35 − . . . . . Age
IPW−P
Figure 3: The curves of conditional average treatment effects (CATE) over worker’s age with the 95%pointwise confidence band. . Conclusion In this paper, we propose an estimation (
IP W - S ) of conditional average treatmenteffect with semiparametric propensity score and investigate its asymptotic propertieswhich can be used to construct pointwise confidence intervals. We give a relativelycomplete picture about the asymptotic efficiency of different estimators with nonpara-metric, parametric and true propensity score when model is correctly specified. Further,when the dimension of covariates is high, by the numerical studies, we demonstrate theadvantages of IP W - S in alleviating the curse of dimensionality and inheriting the the-oretical superiority of IP W - N in estimation efficiency. But a challenging topic is howto develop a good uniform confidence band of the whole function τ ( z ) although theBonferroni confidence band could be applied. Further, a research topic is about thesituation that not all of the covariates are important for propensity score. Thus, by in-corporating variable selection, we can simultaneously identify important confoundersand guarantee the unconfoundedness assumption. The dimension reduction and vari-able selection have been investigated by, say, Ma et al. (2019) for the model undersparsity structure. This topic is also related to variable selection and thus we will tryto have a computationally inexpensive algorithm for this purpose and study its asymp-totic behaviours. Another topic is about the model misspecification even when thesemiparametric model is used. We will study the relevant asymptotic behaviours in thenear future. Acknowledgement
The authors’ s research was supported by grants from NSFC grants (NSFC11671042,NSFC11601227) and the University Grants Council of Hong Kong.
Supplementary material
The supplementary file covers the detailed proofs to Theorems and Corollaries.26 ppendix: Technical conditions
The following regularity conditions are required to get the theoretical results.(C1) (Strong ignorability)(i) Unconfoundedness: ( Y (0) , Y (1)) ⊥ D | X .(ii) Common support: For some very small c > , c < p ( X ) < − c .(C2) (on distribution):(i) The set χ that is the support of the k -dimensional covariate vector X is aCartesian product of compact intervals.(ii) The density function of Z , f ( Z ) , and the density function of X , are boundedaway from zero and infinity and s ≥ r times continuously differentiable.(C3) (Conditional moments and smoothness)(i) sup x ∈ χ E [ Y ( j ) | X = x ] < ∞ for j = 0 , (ii) the functions m j ( V ⊤ X ) = E [ Y ( j ) | V ⊤ X ] , j = 0 , are s ≥ r timescontinuously differentiable.(C4) (on kernel function)(i) L ( u ) is a kernel of order s , is symmetric around zero, has finite support [ − , ˜ k , and is continuously differentiable.(ii) H ( u ) is a kernel of order s , is symmetric around zero, has finite support [ − , r , and is continuously differentiable.(iii) K ( u ) is of order s , is symmetric around zero, and is s times continuouslydifferentiable.(C5) (on bandwidths)(i) h → , nh l → ∞ , nh s + l → .(ii) h , h → , log ( n ) / ( nh r + s ) → , and log ( n ) / ( nh ˜ k + s ) → . (iii) h s i i h − s i − l → , nh l h s i i → , i = 1 , .27C6) (on dimension reduction structure) the dimension of V , r, is given and ˆ V − V = O p ( n − / ) . Recall the definition of high order kernel in the literature. We say a function g: R r → R is a kernel of order s if it integrates to one over R r , and R u p · · · u p r g ( u ) du =0 for all nonnegative integers p , · · · , p r such that ≤ P i p i < s, and it is nonzerowhen P i p i = s. ReferencesReferences
P. R. Rosenbaum, D. B. Rubin, The central role of the propensity score in observationalstudies for causal effects, Biometrika 70 (1983) 41–55.K. Hirano, G. W. Imbens, G. Ridder, Efficient estimation of average treatment effectsusing the estimated propensity score, Econometrica 71 (2003) 1161–1189.R. K. Crump, V. J. Hotz, G. W. Imbens, O. A. Mitnik, Nonparametric tests for treatmenteffect heterogeneity, Rev. Econom. Statist. 90 (2008) 389–405.S. Wager, S. Athey, Estimation and inference of heterogeneous treatment effects usingrandom forests, J. Amer. Statist. Assoc. 113 (2018) 1228–1242.J. Abrevaya, Y.-C. Hsu, R. P. Lieli, Estimating conditional average treatment effects,J. Bus. Econom. Statist. 33 (2015) 485–505.S. Lee, R. Okui, Y.-J. Whang, Doubly robust uniform confidence band for the con-ditional average treatment effect function, J. Appl. Econometrics 32 (2017) 1207–1225.J. M. Robins, A. Rotnitzky, L. P. Zhao, Estimation of regression coefficients whensome regressors are not always observed, J. Amer. Statist. Assoc. 89 (1994) 846–866.K.-C. Li, Sliced inverse regression for dimension reduction, J. Amer. Statist. Assoc.86 (1991) 316–327. 28. Luo, Y. Zhu, D. Ghosh, On estimating regression-based causal effects using suffi-cient dimension reduction, Biometrika 104 (2017) 51–65.S. Ma, L. Zhu, Z. Zhang, C.-L. Tsai, R. J. Carroll, A robust and efficient approachto causal inference based on sparse sufficient dimension reduction, Ann. Statist. 47(2019) 1505–1535.R. D. Cook, B. Li, Dimension reduction for conditional mean in regression, Ann.Statist. 30 (2002) 455–474.Y. Xia, H. Tong, W. K. Li, L.-X. Zhu, An adaptive estimation of dimension reductionspace, J. R. Stat. Soc. Ser. B Stat. Methodol. 64 (2002) 363–410.Y. Xia, A constructive approach to the estimation of dimension reduction directions,Ann. Statist. 35 (2007) 2654–2690.Q. Li, J. S. Racine, Nonparametric econometrics, Princeton University Press, Prince-ton, NJ, 2007. Theory and practice.A. Ichino, F. Mealli, T. Nannicini, From temporary help jobs to permanent employ-ment: what can we learn from matching estimators and their sensitivity?, J. Appl.Econometrics 23 (2008) 305–327. 29 able 2: The simulation results under Model 1 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 , β ⊤ = (0 , and γ = 1 k=2, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias -0.0262 0.0607 0.0553 -0.0278 -0.1072 -0.0262 0.0607 0.0553 -0.0276 -0.1068
Est S D P − . P . IPW - N IPW - S Bias -0.0263 0.0608 0.0566 -0.0233 -0.0991 -0.0269 0.0600 0.0555 -0.0257 -0.1035
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0254 0.0551 0.0499 -0.0186 -0.0888 -0.0261 0.0543 0.0492 -0.0206 -0.0931
Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias -0.0262 0.0607 0.0553 -0.0278 -0.1072 -0.0262 0.0607 0.0553 -0.0276 -0.1068
Est S D P − . P . IPW - N IPW - S Bias -0.0263 0.0608 0.0566 -0.0233 -0.0991 -0.0267 0.0600 0.0554 -0.0258 -0.1037
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0254 0.0551 0.0499 -0.0186 -0.0888 -0.0259 0.0543 0.0491 -0.0207 -0.0932
Est S D P − . P . able 3: The simulation results under Model 1 with | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 , β ⊤ = (1 / , − / and γ = 0 k=2, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias 0.0027 -0.0006 0.0174 0.0147 -0.0504 0.0021 -0.0010 0.0173 0.0148 -0.0503
Est S D P − . P . IPW - N IPW - S Bias 0.0059 0.0005 0.0169 0.0135 -0.0518 0.0000 -0.0030 0.0156 0.0136 -0.0512
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0061 0.0008 0.0157 0.0147 -0.0458 0.0007 -0.0020 0.0147 0.0147 -0.0453
Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias 0.0027 -0.0006 0.0174 0.0147 -0.0504 0.0021 -0.0010 0.0173 0.0148 -0.0503
Est S D P − . P . IPW - N IPW - S Bias 0.0059 0.0005 0.0169 0.0135 -0.0518 -0.0006 -0.0034 0.0155 0.0136 -0.0512
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0061 0.0008 0.0157 0.0147 -0.0458 0.0003 -0.0022 0.0146 0.0147 -0.0453
Est S D P − . P . able 4: The simulation results under Model 2 with | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 , β ⊤ = (0 , and γ = 1 k=2, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias -0.0269 0.0603 0.0550 -0.0282 -0.1080 -0.0269 0.0603 0.0551 -0.0278 -0.1072
Est S D P − . P . IPW - N IPW - S Bias -0.0272 0.0601 0.0557 -0.0263 -0.1052 -0.0277 0.0596 0.0554 -0.0263 -0.1049
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0264 0.0541 0.0492 -0.0206 -0.0935 -0.0267 0.0538 0.0490 -0.0210 -0.0944
Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias -0.0269 0.0603 0.0550 -0.0282 -0.1080 -0.0269 0.0603 0.0551 -0.0278 -0.1072
Est S D P − . P . IPW - N IPW - S Bias -0.0272 0.0601 0.0557 -0.0263 -0.1052 -0.0275 0.0597 0.0553 -0.0264 -0.1049
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0264 0.0541 0.0492 -0.0206 -0.0935 -0.0266 0.0538 0.0490 -0.0210 -0.0944
Est S D P − . P . able 5: The simulation results under Model 2 with | Z ∩ U | = 0 and | Z ∩ V ⊤ X | = 0 , β ⊤ = (1 / , − / and γ = 0 Model 2 k=2, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias 0.0015 -0.0014 0.0172 0.0148 -0.0504 0.0015 -0.0014 0.0171 0.0147 -0.0504
Est S D P − . P . IPW - N IPW - S Bias 0.0010 -0.0022 0.0161 0.0137 -0.0513 -0.0005 -0.0033 0.0154 0.0133 -0.0515
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0017 -0.0015 0.0148 0.0147 -0.0453 0.0000 -0.0026 0.0142 0.0144 -0.0454
Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias 0.0015 -0.0014 0.0172 0.0148 -0.0504 0.0015 -0.0014 0.0171 0.0147 -0.0504
Est S D P − . P . IPW - N IPW - S Bias 0.0010 -0.0022 0.0161 0.0137 -0.0513 -0.0010 -0.0035 0.0154 0.0133 -0.0514
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0017 -0.0015 0.0148 0.0147 -0.0453 -0.0003 -0.0027 0.0142 0.0144 -0.0453
Est S D P − . P . able 6: The simulation results under Model 3 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 , β ⊤ =(1 / , / √ , − / √ , − / and γ = 0 k=4, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias -0.0271 0.0610 0.0556 -0.0280 -0.1086 -0.0271 0.0611 0.0557 -0.0279 -0.1083
Est S D P − . P . IPW - N IPW - S Bias -0.0280 0.0604 0.0563 -0.0247 -0.1026 -0.0270 0.0614 0.0577 -0.0213 -0.0967
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0270 0.0563 0.0517 -0.0201 -0.0940 -0.0260 0.0570 0.0521 -0.0199 -0.0935
Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias -0.0271 0.0610 0.0556 -0.0280 -0.1086 -0.0271 0.0611 0.0557 -0.0279 -0.1083
Est S D P − . P . IPW - N IPW - S Bias -0.0280 0.0604 0.0563 -0.0247 -0.1026 -0.0270 0.0614 0.0577 -0.0215 -0.0969
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0270 0.0563 0.0517 -0.0201 -0.0940 -0.0259 0.0570 0.0521 -0.0199 -0.0936
Est S D P − . P . able 7: The simulation results under Model 3 with | Z ∩ X | = 1 but | Z ∩ V ⊤ X | = 0 , β ⊤ =(1 / , / √ , − / √ , − / and γ = 0 k=4, r=1 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias 0.0333 0.0042 -0.0019 -0.0067 -0.0228 0.0332 0.0043 -0.0018 -0.0066 -0.0228
Est S D P − . P . IPW - N IPW - S Bias 0.0338 0.0041 -0.0026 -0.0072 -0.0232 0.0384 0.0069 -0.0015 -0.0071 -0.0232
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0316 0.0018 -0.0039 -0.0070 -0.0221 0.0331 0.0030 -0.0032 -0.0066 -0.0218
Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias 0.0333 0.0042 -0.0019 -0.0067 -0.0228 0.0332 0.0043 -0.0018 -0.0066 -0.0228
Est S D P − . P . IPW - N IPW - S Bias 0.0338 0.0041 -0.0026 -0.0072 -0.0232 0.0381 0.0068 -0.0015 -0.0071 -0.0232
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0316 0.0018 -0.0039 -0.0070 -0.0221 0.0329 0.0029 -0.0032 -0.0066 -0.0218
Est S D P − . P . able 8: The simulation results under Model 4 with k=4, r=2, | Z ∩ X | = 1 , | Z ∩ V ⊤ X | = 1 , β ⊤ =(0 , , , and γ = 1 k=4, r=2 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias -0.0260 0.0621 0.0565 -0.0277 -0.1087 -0.0258 0.0623 0.0567 -0.0274 -0.1082
Est S D P − . P . IPW - N IPW - S Bias -0.0254 0.0626 0.0573 -0.0254 -0.1046 -0.0247 0.0631 0.0575 -0.0254 -0.1046
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0255 0.0572 0.0515 -0.0218 -0.0966 -0.0248 0.0578 0.0521 -0.0205 -0.0943
Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias -0.0260 0.0621 0.0565 -0.0277 -0.1087 -0.0258 0.0623 0.0567 -0.0274 -0.1082
Est S D P − . P . IPW - N IPW - S Bias -0.0254 0.0626 0.0573 -0.0254 -0.1046 -0.0247 0.0631 0.0575 -0.0254 -0.1045
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias -0.0255 0.0572 0.0515 -0.0218 -0.0966 -0.0249 0.0577 0.0521 -0.0205 -0.0943
Est S D P − . P . able 9: The simulation results under Model 4 with k=4, r=2, | Z ∩ X | = 1 and | Z ∩ V ⊤ X | = 1 , β ⊤ = (1 / , / √ , − / √ , − / and γ = 0 k=4, r=2 under Group 1: { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias 0.0339 0.0050 -0.0009 -0.0057 -0.0224 0.0343 0.0053 -0.0008 -0.0055 -0.0222
Est S D P − . P . IPW - N IPW - S Bias 0.0330 0.0040 -0.0019 -0.0064 -0.0228 0.0372 0.0064 -0.0010 -0.0060 -0.0224
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0303 0.0011 -0.0038 -0.0066 -0.0219 0.0340 0.0033 -0.0032 -0.0068 -0.0223
Est S D P − . P . { a = 0 . , a = 1 . , a = 0 . } n Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P
500 Bias 0.0339 0.0050 -0.0009 -0.0057 -0.0224 0.0343 0.0053 -0.0008 -0.0055 -0.0222
Est S D P − . P . IPW - N IPW - S Bias 0.0330 0.0040 -0.0019 -0.0064 -0.0228 0.0370 0.0063 -0.0011 -0.0061 -0.0224
Est S D P − . P . Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 IPW - O IPW - P Est S D P − . P . IPW - N IPW - S Bias 0.0303 0.0011 -0.0038 -0.0066 -0.0219 0.0337 0.0031 -0.0033 -0.0068 -0.0223
Est S D P − . P . able 10: The simulation results under high dimensional setting dim(X)=20 with | Z ∩ V ⊤ X | = 0 : r = 1 IP W - O IP W - P IP W - S Group Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4Group 1 Bias -0.0275 0.0598 0.0554 -0.0271 -0.1064 -0.0273 0.0601 0.0556 -0.0270 -0.1062 -0.0254 0.0612 0.0538 -0.0349 -0.1209 Est SD P − . P . Est SD P − . P . | Z ∩ V ⊤ X | = 1 : r = 2 IP W - O IP W - P IP W - S Group Z -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4Group 1 Bias -0.0274 0.0599 0.0553 -0.0272 -0.1065 -0.0270 0.0603 0.0559 -0.0261 -0.1046 -0.0245 0.0617 0.0536 -0.0353 -0.1208 Est S D P − . P . Est S D P − . P .0.0460 0.0560 0.0540 0.0540 0.0480 0.0540 0.0600 0.0500 0.0580 0.0560 0.0480 0.0560 0.0600 0.0480 0.0480