[PDF] Doubly robust estimation for conditional treatment effect: a study on asymptotics

Abstract

In this paper, we apply doubly robust approach to estimate, when some covariates are given, the conditional average treatment effect under parametric, semiparametric and nonparametric structure of the nuisance propensity score and outcome regression models. We then conduct a systematic study on the asymptotic distributions of nine estimators with different combinations of estimated propensity score and outcome regressions. The study covers the asymptotic properties with all models correctly specified; with either propensity score or outcome regressions locally / globally misspecified; and with all models locally / globally misspecified. The asymptotic variances are compared and the asymptotic bias correction under model-misspecification is discussed. The phenomenon that the asymptotic variance, with model-misspecification, could sometimes be even smaller than that with all models correctly specified is explored. We also conduct a numerical study to examine the theoretical results.

Full PDF

SStatistica Sinica

DOUBLY ROBUST ESTIMATION FORCONDITIONAL TREATMENT EFFECT:A STUDY ON ASYMPTOTICS

Chuyun Ye , Keli Guo and Lixing Zhu , Beijing Normal University, Beijing, China Hong Kong Baptist University, Hong Kong

Abstract:

In this paper, we apply doubly robust approach to estimate, whensome covariates are given, the conditional average treatment eﬀect under para-metric, semiparametric and nonparametric structure of the nuisance propensityscore and outcome regression models. We then conduct a systematic study onthe asymptotic distributions of nine estimators with diﬀerent combinations of es-timated propensity score and outcome regressions. The study covers the asymp-totic properties with all models correctly speciﬁed; with either propensity scoreor outcome regressions locally / globally misspeciﬁed; and with all models locally/ globally misspeciﬁed. The asymptotic variances are compared and the asymp-totic bias correction under model-misspeciﬁcation is discussed. The phenomenonthat the asymptotic variance, with model-misspeciﬁcation, could sometimes beeven smaller than that with all models correctly speciﬁed is explored. We alsoconduct a numerical study to examine the theoretical results. a r X i v : . [ m a t h . S T ] S e p ey words and phrases: Asymptotic variance, Conditional average treatmenteﬀect, Doubly robust estimation.

1. Introduction

To explore the heterogeneity of treatment eﬀect under Rubin’s protentialoutcome framework (Rosenbaum and Rubin (1983)) to reveal the casualityof a treatment, conditional average treatment eﬀect (CATE) is useful, whichis conditional on some covariates of interest. See Abrevaya et al. (2015)as an example. Shi et al. (2019) showed that the existence of optimalindividualized treatment regime (OITR) has a close connection with CATE.To estimate CATE, there are some standard approaches available inthe literature. When either propensity score function or outcome regres-sion functions or both are unknown, we need to estimate them ﬁrst suchthat we can then estimate the CATE function. Regard these functions asnuisance models. Abrevaya et al. (2015) used the propensity score-based(PS-based) estimation under parametric (P-IPW) and nonparametric struc-ture (N-IPW), and showed that N-IPW is asymptotically more eﬃcient thanP-IPW. Zhou and Zhu (2020) suggested the PS-based estimation under asemiparametric dimension reduction structure (S-IPW) to show the advan-tage of semiparametric estimation and Li et al. (2020) considered outcomeegression-based (OR-based) estimation under parametric (P-OR), semi-parametric (S-OR) and nonparametric structure (N-OR) to derive theirasymptotic properties and suggested also the use of semiparametric method.Both of the works together give an estimation eﬃciency comparison betweenPS-based and OR-based estimators. A clear asymptotic eﬃciency rankingwas shown by Li et al. (2020) when the propensity score and outcome re-gression models are all correctly speciﬁed and the underlying nonparametricmodels is suﬃciently smooth such that, with delicately selecting bandwidthsand kernel functions, the nonparametric estimation can achieve suﬃcientlyfast rates of convergence:

OR-based estimators (cid:122) (cid:125)(cid:124) (cid:123)

O-OR ∼ = P-OR (cid:22)

S-OR (cid:22)

N-OR ∼ = PS-based CATE estimators (cid:122) (cid:125)(cid:124) (cid:123)

N-IPW (cid:22)

S-IPW (cid:22)

P-IPW ∼ = O-IPW (1.1)where A (cid:22) B denotes the asymptotic eﬃciency advantage, with smallervariance, of A over B , A ∼ = B the eﬃciency equivalence and O-OR andO-IPW stand for OR-based and PS-based estimator respectively assumingthe nuisance models are known with no need to estimate.As well known, the doubly robust (DR) method that was ﬁrst sug-gested as the augmented inverse probability weighting (AIPW) estimationproposed by Robins et al. (1994). Later developments provide the estima-tion consistency (Scharfstein et al. (1999)) for more general doubly robustestimation, not restricted to AIPW, that even has one misspeciﬁed in thewo involved models. For further discussion and introduction on DR es-timation, readers can refer to, as an example, Seaman and Vansteelandt(2018). Like Abrevaya et al. (2015), Lee et al. (2017) brought up a two-step AIPW estimator of CATE also under parametric structure. For thecases with high-dimensional covariate, Fan et al. (2019) and Zimmert andLechner (2019) combined such an estimator with statistical learning.In the current paper, we focus on investigating the asymptotic eﬃciencycomparisons among nine doubly robust estimators under parametric, semi-parametric dimension reduction and nonparametric structure. To this end,we will give a systematic study to provide insight into which combinationsmay have merit in an asymptotic sense and in practice, which ones wouldbe worth of recommendation for use. We also further consider the asymp-totic eﬃciency when nuisance models are globally or locally misspeciﬁed,which will be deﬁned later. Roughly speaking, local misspeciﬁcation meansthat misspeciﬁed model can converge, at a certain rate, to the correspond-ing correctly speciﬁed model as the sample size n goes to inﬁnity, whileglobally misspeciﬁed model cannot. Denote c n , d n and d n respectively thedeparture degrees of used models to the corresponding correctly speciﬁedmodels, and V i ( x ) for i = 1 , , ,

4, which will be clariﬁed in Theorems 1, 2,3 and 5 respectively, of the asymptotic variance functions of x for all ninestimators in diﬀerence scenarios. Here V ( x ) is the asymptotic variancewhen all models are correctly speciﬁed, which is regarded as a benchmarkfor comparisons. We have that V ( x ) ≤ V ( x ), but V ( x ) and V ( x ) arenot necessarily larger than V ( x ). Here we display main ﬁndings in thispaper. • When all nuisance models are correctly speciﬁed, and the tuning pa-rameters including the bandwidths in nonparametric estimations aredelicately selected, the asymptotic variances are all equal to V ( x ).Write all DR estimators as DRCAT E . Together with (1.1), theasymptotic eﬃciency ranking is as:

OR-based estimators (cid:122) (cid:125)(cid:124) (cid:123)

O-OR ∼ = P-OR (cid:22)

S-OR (cid:22)

N-OR ∼ = DRCATE ∼ = PS-based CATE estimators (cid:122) (cid:125)(cid:124) (cid:123)

N-IPW (cid:22)

S-IPW (cid:22)

P-IPW ∼ = O-IPW • If only one of the nuisance models, either propensity score or outcomeregressions, is (are) misspeciﬁed, the estimators remain unbiased asexpectably. But globally misspeciﬁed outcome regressions or propen-sity score lead to asymptotic variance changes. We can give exam-ples of propensity score to show that the variance can be even smallerthan that with correctly speciﬁed models. Further, when the nuisancemodels are locally misspeciﬁed, the asymptotic eﬃciency remains thesame as that with no misspeciﬁcation.

Further, when all nuisance models are globally misspeciﬁed, we needto take care of estimation bias. When the misspeciﬁcations are alllocal, but the convergence rates c n d n and c n d n are all faster than theconvergence rate of nonparametric estimation that will be speciﬁedlater, the asymptotic distributions remain unchanged.To give a quick access to the results about the asymptotic variances,we present a summary in Table 1. Denote P S ( P ), P S ( N ) and P S ( S )as estimators with parametrically, nonparametrically and semiparametri-cally estimated PS function respectively, OR ( P ), OR ( N ) and OR ( S ) asestimators with parametrically, nonparametrically and semiparametricallyestimated OR functions respectively. Dark cells mean no such combina-tions.The remaining parts of this article are organized as follows. We ﬁrst de-scribe the Rubin’s potential outcome framework and the relevant notationsin Section 2. Section 3 contains a general two-step estimation of CATE,while Section 4 describes the corresponding asymptotic properties underdiﬀerent situations. Section 5 presents the results of Monte Carlo simula-tions and Section 6 includes some concluding remarks. We would like topoint out that such comparisons do not mean the estimations that are ofasymptotic eﬃciency advantage are always worthwhile to recommend be-able 1: Asymptotic variance result summary Combination AllCorrectlyspeciﬁed GloballyMisspeciﬁedPS LocallyMisspeciﬁedPS GloballyMisspeciﬁedOR LocallyMisspeciﬁedOR PS ( P ) + OR ( P ) V ( x ) V ( x )(Not necessarilyenlarged ) V ( x ) V ( x )(Enlarged) V ( x ) PS ( P ) + OR ( N ) V ( x ) V ( x ) V ( x ) PS ( N ) + OR ( P ) V ( x ) V ( x ) V ( x ) PS ( N ) + OR ( N ) V ( x ) PS ( P ) + OR ( S ) V ( x ) V ( x )(Not necessarilyenlarged) V ( x ) PS ( S ) + OR ( P ) V ( x ) V ( x )(Enlarged) V ( x ) PS ( S ) + OR ( N ) V ( x ) PS ( N ) + OR ( S ) V ( x ) PS ( S ) + OR ( S ) V ( x )Combination AllGloballyMisspeciﬁed AllLocallyMisspeciﬁed Globally Misspeciﬁed PS+Locally Misspeciﬁed OR Locally Misspeciﬁed PS+Globally Misspeciﬁed OR PS ( P ) + OR ( P ) Biased + V ( x )(Not necessarilyenlarged variance) V ( x ) V ( x )(Not necessarilyenlarged) V ( x )(Enlarged) cause, particularly, the nonparametric-based estimations may have severediﬃculties to handle high- even moderate-dimensional models in practice.But the comparisons can provide a good insight into the nature of variousestimations such that the practitioners can have a relatively complete pic-ure about them and have idea for when and how to use these estimations.

2. Framework and Notation

For any individual, datum W = ( X (cid:62) , Y, D ) (cid:62) is observable, including theobserved eﬀect Y , the treatment status D , and the p -dimensional covari-ates X . D = 1 implies that the individual is treated, and D = 0 meansuntreated. Denote Y (1) and Y (0) as the potential outcomes with and with-out treatment, respectively. The observed eﬀect Y can be expressed as Y = DY (1) + (1 − D ) Y (0). Denote that p ( X ) = P ( D = 1 | X ) , m ( X ) = E ( Y (1) | X ) , m ( X ) = E ( Y (0) | X ) as propensity score function and outcomeregression functions. The following conditions are commonly used when wediscuss the potential outcome framework.(C1) (Sampling distribution) { W i } ni =1 is a set of identically distributedsamples.(C2) (Ignorability condition)(i) (Unconfoundedness) ( Y (1) , Y (0)) ⊥ D | X (ii) Denote X as the support of X , where X is a Cartersian productof compact intervals. For any x ∈ X , p ( x ) is bounded away from0 and 1.enote τ ( x ) as CATE: τ ( x ) = E [ Y (1) − Y (0) | X = x ]where X is a strict subset of X . That is, X is a k -dimension covariate,and k < p . Also denote f ( x ) as the density function of X .

3. Doubly robust estimation

Rewrite τ ( x ) as τ ( x ) = E { [ m ( X ) − m ( X )] | X = x } = E (cid:26) (cid:20) DYp ( X ) − (1 − D ) Y − p ( X ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) = E (cid:26) (cid:20) Dp ( X ) [ Y − m ( X )] − − D − p ( X ) [ Y − m ( X )] + m ( X ) − m ( X ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) (3.2)The ﬁrst two equations in (3.2) show how OR and PS method work forestimating CATE. The third equation in (3.2) is an essential expression toconstruct a doubly robust estimator of τ ( x ). Under which, we propose atwo-step estimation. In the ﬁrst step, we estimate the function in (3.2): Dp ( X ) [ Y − m ( X )] − − D − p ( X ) [ Y − m ( X )] + m ( X ) − m ( X ) . To study the inﬂuence from estimating the nuisance functions, p ( X ) and m ( X ), m ( X ) under parametric, nonparametric, and semiparametric di-ension reduction framework, we will construct the corresponding estima-tions below.After this, we can then estimate the conditional expectation given x .This is a standard nonparametric estimation. We utilize the Nadaraya-Watson type estimator to deﬁne the resulting estimator: (cid:98) τ ( x ) = nh k (cid:80) ni =1 (cid:104) D i (cid:98) p i ( Y i − (cid:98) m i ) − − D i − (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − (cid:98) m i (cid:105) K (cid:16) X i − x h (cid:17) nh k (cid:80) ni =1 K (cid:16) X i − x h (cid:17) , where K ( u ) is a kernel function of order s , which is s ∗ times continuouslydiﬀerentiable, and h is the corresponding bandwidth and (cid:98) p i , (cid:98) m i , (cid:98) m i de-note the estimators of p ( X i ) , m ( X i ) , m ( X i ) respectively, which are generalnotations and have diﬀerent formulas under diﬀerent model structures.We now consider the estimations of the nuisance functions. Under theparametric structures with (cid:101) p ( x ; β ), (cid:101) m ( x ; γ ) and (cid:101) m ( x ; γ ) as the speciﬁedparametric models of p ( x ), m ( x ) and m ( x ) respectively, where β , γ and γ are unknown parameters. By maximum likelihood estimation, we canobtain (cid:98) β , (cid:98) γ and (cid:98) γ so as to have (cid:101) p ( X i ; (cid:98) β ), (cid:101) m ( X i ; (cid:98) γ ) and (cid:101) m ( X i ; (cid:98) γ ) asthe parametric estimators. Note that the speciﬁed models are not neces-sarily equal to true data generate mechanism. Now we further distinguishcorrectly speciﬁed, globally misspeciﬁed and locally misspeciﬁed case. Forall x ∈ X , there exist β , γ , γ , such that the true models have the rela-ionship with the speciﬁed models: p ( x ) = (cid:101) p ( x ; β )[1 + c n a ( x )] ,m ( x ) = (cid:101) m ( x ; γ ) + d n b ( x ) , (3.3) m ( x ) = (cid:101) m ( x ; γ ) + d n b ( x ) . Take propensity score function as an example. If c n = 0, then the para-metric propensity score model (cid:101) p ( x ; β ) is correctly speciﬁed, otherwise, itis not. If c n converges to 0 as n goes to inﬁnity, the parametric modelis locally misspeciﬁed. If c n remains a nonzero constant, it is a globallymisspeciﬁed case. Similarly for the models with d n and d n . Recall that (cid:98) β , (cid:98) γ and (cid:98) γ are the maximum likelihood estimators of the correspondingunknown parameters. Denote β ∗ , γ ∗ and γ ∗ as the limits of (cid:98) β , (cid:98) γ and (cid:98) γ as n goes to inﬁnity.Under the nonparametric structure, we utilize the kernel-based non-parametric estimators as (cid:98) p ( X i ) = (cid:80) nj =1 D j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 K (cid:16) X t − X i h (cid:17) , (cid:98) m ( X i ) = (cid:80) nj =1 D j Y j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 D t K (cid:16) X t − X i h (cid:17) , (cid:98) m ( X i ) = (cid:80) nj =1 (1 − D j ) Y j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 (1 − D t ) K (cid:16) X t − X i h (cid:17) here K ( u ), K ( u ) and K ( u ) are kernels of order s ≥ d , s ≥ d and s ≥ d , with the corresponding bandwidths h , h and h . The conditionson the kernel functions and bandwidths will be listed in the supplement.Under the semiparametric structure on the baseline covariate X forpropensity score and outcome regressions, we have the following dimensionreduction framework. Denote the matrix A ∈ R d × d such that p ( X ) ⊥ X | A (cid:62) X, (3.4)where d ≤ d . The A spanned space S E ( D | X ) is called the central meansubspace if it is the intersection of all subspaced spanned by all A satisfythe above conditional independence. The dimension of S E ( D | X ) is calledthe structural dimension that is often smaller than or equal to d . Withoutconfusion, still write it as d . Formula (3.4) implies that p ( X ) = E ( D | X ) = E ( D | A (cid:62) X ) := g ( A (cid:62) X ). Note that a nonparametric estimation of p ( X )may have very slow rate of convergence when p is large. However, under(3.4) we can estimate the matrix A ﬁrst to reduce the dimension d to d ,the nonparametric estimation of E ( D | A (cid:62) X ) can achieve a faster rate ofconvergence. The semiparametric estimator p ( X i ) is then deﬁned as, when A is root- n consistently by an estimator (cid:98) A , (cid:98) g ( (cid:98) A (cid:62) X i ) = (cid:80) nj =1 D j K (cid:16) (cid:98) A (cid:62) X j − (cid:98) A (cid:62) X i h (cid:17)(cid:80) nt =1 K (cid:16) (cid:98) A (cid:62) X t − (cid:98) A (cid:62) X i h (cid:17) . imilarly, for regression models, denote matrixes B ∈ R d × d and B ∈ R d × d , such that E ( Y (1) | X ) ⊥ X | B (cid:62) X,E ( Y (0) | X ) ⊥ X | B (cid:62) X. (3.5)The corresponding dimension reduction subspaces are called the centralmean subspaces (see Cook and Li 2002). Thus, m ( X ) = E ( Y (1) | X ) = E ( Y (1) | B (cid:62) X ) := r ( B (cid:62) X ) and m ( X ) = E ( Y (0) | X ) = E ( Y (0) | B (cid:62) X ) := r ( B (cid:62) X ). The semiparametric estimators m ( X i ) and m ( X i ) are deﬁnedas, with (cid:98) B i being the estimators of B i , i = 0 , (cid:98) r ( (cid:98) B (cid:62) X i ) = (cid:80) nj =1 D j Y j K (cid:16) (cid:98) B (cid:62) X j − (cid:98) B (cid:62) X i h (cid:17)(cid:80) nt =1 D t K (cid:16) (cid:98) B (cid:62) X t − (cid:98) B (cid:62) X i h (cid:17) , (cid:98) r ( (cid:98) B (cid:62) X i ) = (cid:80) nj =1 D j Y j K (cid:16) (cid:98) B (cid:62) X j − (cid:98) B (cid:62) X i h (cid:17)(cid:80) nt =1 D t K (cid:16) (cid:98) B (cid:62) X t − (cid:98) B (cid:62) X i h (cid:17) where K ( u ), K ( u ) and K ( u ) are kernels of order s ≥ d , s ≥ d and s ≥ d , with the corresponding bandwidths h , h and h . . Asymptotic Properties Deﬁne the following functionsΨ ( X, Y, D ) := D [ Y − m ( X )] p ( X ) − (1 − D )[ Y − m ( X )]1 − p ( X ) + m ( X ) − m ( X ) , Ψ ( X, Y, D ) := D { Y − m ( X ) } (cid:101) p ( X ; β ∗ ) − (1 − D ) { Y − m ( X ) } − (cid:101) p ( X ; β ∗ ) + m ( X ) − m ( X ) , Ψ ( X, Y, D ) := D { Y − (cid:101) m ( X ; γ ∗ ) } p ( X ) − (1 − D ) { Y − (cid:101) m ( X ; γ ∗ ) } − p ( X ) + (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ ) , Ψ ( X, Y, D ) := D { Y − (cid:101) m ( X ; γ ∗ ) } (cid:101) p ( X ; β ∗ ) − (1 − D ) { Y − (cid:101) m ( X ; γ ∗ ) } − (cid:101) p ( X ; β ∗ ) + (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ ) . The following theorem shows all asymptotic distributions of the estimatorsare identical.

Theorem 1.

Suppose Conditions (C1) – (C6), (A1), (A2) and (B1) aresatisﬁed for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , and formulas 3.4 and 3.5 hold. Then, for eachpoint x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:112) nh k f ( x ) n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) + o p (1) , .2 The Cases With Misspeciﬁed Models and (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) , where V ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) , σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x (cid:9) . Now we discuss the asymptotic behaviours of the proposed estimators ifeither outcome regression models or propensity score model is (are) mis-speciﬁed. The following results show how global misspeciﬁcation aﬀectsthe asymptotic properties.

Theorem 2.

Assume that the propensity score is globally misspeciﬁed inwhich c n = C is a nonzero constant. Suppose conditions (C1) – (C6), (A1),(A2) and (B1) are satisﬁed for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s < (2 s + k )( d − d ) , s < (2 s + k )( d − d ) .1). When the outcome regression functions are estimated nonparametri-cally, then, for each value x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) . d n = d n = 0 with .2 The Cases With Misspeciﬁed Models parametric estimation, for each value x , the asymptotic distributions areidentical: (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) , where V ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) , σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x (cid:9) . Now we consider the cases with global misspeciﬁcation of the outcomeregression models.

Theorem 3.

Assume that the outcome regression models are globally mis-speciﬁed with ﬁxed nonzero constants d n = d and d n = d . Supposeconditions (C1) – (C6), (A1), (A2) and (B1) are satisﬁed for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s < (2 s + k )( d − d ) .1). When the propensity score is estimated nonparametrically, then, foreach x , (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) . c n = 0 and parametric estimation, for eachvalue x , the asymptotic distributions are identical: (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) , .2 The Cases With Misspeciﬁed Models where V ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) , σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x (cid:9) . Remark 1.

By some calculations, we can obtain in Proposition 4 belowin Section 4.4 that σ ( x ) ≤ σ ( x ), while the analogy does not hold be-tween σ ( x ) and σ ( x ). That is, the asymptotic variance of the proposedestimator inﬂates when the outcome regression models are misspeciﬁed,and the propensity score model is parametrically estimated (correctly spec-iﬁed) or semiparametrically estimated. However, whether the asymptoticvariance gets larger with a misspeciﬁed propensity score model is model-dependent. We show the following example. Suppose that the outcomeregression models are correctly speciﬁed, while the propensity score modelis globally misspeciﬁed. Consider a situation that p ( x ) = p , (cid:101) p ( x ; β ∗ ) = p ,where p , p are free of x , and p (cid:54) = p . We have σ ( x ) − σ ( x ) = E (cid:26) p ( X ) − (cid:101) p ( X ; β ∗ ) (cid:101) p ( X ; β ∗ ) p ( X ) V ar ( Y | X, D = 1) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) + E (cid:26) [1 − p ( X )] − [1 − (cid:101) p ( X ; β ∗ )] [1 − (cid:101) p ( X ; β ∗ )] [1 − p ( X )] V ar ( Y | X, D = 0) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) = p − p p p E [ V ar ( Y | X, D = 1) | X = x ]+ (1 − p ) − (1 − p ) (1 − p )(1 − p ) E [ V ar ( Y | X, D = 0) | X = x ] . To give a clear picture, we further assume that the outcome regressionmodels are homoscedastic that

V ar ( Y | X, D = 1) =

V ar ( Y | X, D = 0) = ξ ,.2 The Cases With Misspeciﬁed Modelswhich is free of X . Then we have, σ ( x ) − σ ( x ) = ξ (cid:16) p − p p p + (1 − p ) − (1 − p ) (1 − p )(1 − p ) (cid:17) .Deﬁne the function vd ( p , p ) = (cid:16) p − p p p + (1 − p ) − (1 − p ) (1 − p )(1 − p ) (cid:17) . A negative vd ( p , p )implies the variance shrinkage. Consider three true propensity score values p ( x ) = p = 0 . , . , .

7. The following three curves of vd ( p , p ) show howthe variance inﬂation or shrinkage occurs. −101 0.3 0.4 0.5 0.6 p2 v d (a) p = 0 .

012 0.3 0.4 0.5 0.6 0.7 p2 v d (b) p = 0 . −101 0.4 0.5 0.6 0.7 p2 v d (c) p = 0 . Figure 1: Curves of vd ( p , p ) with diﬀerent p When p = 0 . .

7, appropriately overestimated propensity scoremay result in an asymptotic variance shrinking in some cases. When p = 0 .

5, which means that every individual have an 0 . V ar ( Y | X, D = 1)and

V ar ( Y | X, D = 0) are not necessarily equal. Such simple examples showthat when only propensity score is misspeciﬁed, augmenting or shrinkingasymptotic variances are all possible..2 The Cases With Misspeciﬁed Models

Remark 2.

Another interesting phenomenon is that once propensity scoremodel is misspeciﬁed and outcome regressions are nonparametrically esti-mated, or vice versa, the asymptotic performance of the proposed estimatoris identical to that when all models are correctly speciﬁed. As nonparamet-ric estimation takes no risk of misspeciﬁcation, such an estimation proce-dure “absorbs” the inﬂuence brought by model misspeciﬁcation due to thedoubly robust property. But it is clear that in high-dimensional scenarios,a purely nonparametric estimation is not worthwhile to recommend. Thus,this property mainly serves as an investigation with theoretical interest un-less the dimension of the covariates is small.The results with local misspeciﬁcation are stated in the following.

Theorem 4.

Assume that the propensity score is locally speciﬁed with c n → . Suppose conditions (C1) – (C6), (A1), (A2) and (B1) are satisﬁed for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s < (2 s + k )( d − d ) , s < (2 s + k )( d − d ) . Then, for each value x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) . Similarly, assume that the outcome regression functions are locally misspec-iﬁed with d n → and d n → . Under the same conditions as those inTheorem 4 for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d (2) , s < (2 s + k )( p − p (2)) . For .3 A Further Study: All Models are Misspeciﬁed each value x , the asymptotic distribution of (cid:98) τ ( x ) is identical to the above. We study this case as it then has a non-ignorable bias in general and goesto zero unless the rate of convergence of local misspeciﬁcation is suﬃcientlyfast. Recall the deﬁnitions of γ ∗ γ ∗ and β ∗ below (3). Theorem 5.

Suppose that all models are globally misspeciﬁed with nonzeroconstants c n , d n and d n . Assume that conditions (C1) – (C6) are satisﬁed.Then, for each value x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x ) − bias ( x )] d −→ N (0 , V ( x )) , where bias ( x ) = E (cid:26) [ m ( X ) − (cid:101) m ( X ; γ ∗ )] [ p ( X ) − (cid:101) p ( X ; β ∗ )] (cid:101) p ( X ; β ∗ ) − [ m ( X ) − (cid:101) m ( X ; γ ∗ )] [ (cid:101) p ( X ; β ∗ ) − p ( X )]1 − (cid:101) p ( X ; β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) ,V ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) , σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − (cid:101) τ ( x )] (cid:12)(cid:12) X = x (cid:9) , and (cid:101) τ ( x ) = E (cid:26)(cid:20) D (cid:101) p ( X ; β ∗ ) [ Y − (cid:101) m ( X ; γ ∗ )] − − D − (cid:101) p ( X ; β ∗ ) [ Y − (cid:101) m ( X ; γ ∗ )] + (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) . .3 A Further Study: All Models are MisspeciﬁedThe following results show the importance of the convergence rates of c n , d n and d n to zero for bias reduction and variance change. Theorem 6.

Under the conditions in Theorem 5, when c n d n = o (cid:32) (cid:112) nh k (cid:33) , c n d n = o (cid:32) (cid:112) nh k (cid:33) , then, for each x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) . Remark 3.

This theorem show that to make the bias vanished, c n d n and c n d n need to tend to zero at the rates faster than the nonparametric con-vergence rate, O (1 / (cid:112) nh k ). Recall that Theorems 2 and 3 show that when c n = o (1), then the variance is V ( x ); when d n = o (1) and d n = o (1)the variance is V ( x ). Altogether, when all misspeciﬁcations are local, theasymptotic variances reduces to V ( x ). We can then further discuss fourcases:1) All nuisance models are globally misspeciﬁed; 2) All nuisance models arelocally misspeciﬁed; 3) The propensity score function is globally misspeci-ﬁed, and the outcome regression functions are locally misspeciﬁed; 4) Thepropensity score function is locally misspeciﬁed, and the outcome regressionfunctions are globally misspeciﬁed..4 A summary on the comparison among the asymptotic VariancesThe ﬁrst is the case exactly described in Theorem 5, the second showsthat if c n d n = o (1 / (cid:112) nh k ) and c n d n = o (1 / (cid:112) nh k ), the bias term is neg-ligible, which is the situation in Theorem 6. Otherwise, the estimator isbiased. Cases 3 and 4 can be regarded as a combination of those in The-orems 5 and 6. In case 3, once d n = o (1 / (cid:112) nh k ) and d n = o (1 / (cid:112) nh k ),the bias goes to 0, and the variance goes to || K || σ ( x ) /f ( x ). In otherwords, if d n and d n go to 0 at a rate faster than O (1 / (cid:112) nh k ), Case 3 turnsto the case in Theorem 2. We can then also derive that if c n = o (1 / (cid:112) nh k ),Case 4 is similar to that in Theorem 3. We summarize the comparison among the 4 variances V j ( x ) for j = 1 , , , V j ( x ) = || K || σ j ( x ) /f ( x )for j = 1 , , , σ j ( x ) for j = 1 , , , Remark 4.

For any x ,1). σ ( x ) is not necessarily smaller than σ ( x ) and as shown in the ex-ample in Remark 1, σ ( x ) can be larger than σ ( x ) for some x ;2). σ ( x ) ≤ σ ( x );). We have no deﬁnitive answer to say whether σ ( x ) is necessarily smallerthan σ ( x ).

5. Numerical Study

In this section, we present some Monte Carlo simulations to examine theﬁnite sample performances of the estimators.

Consider two data-generating processes (DGPs) similarly as those in Abre-vaya et al. (2015), the case of d = 2 and d = 4. Here we only considerthat the conditioning covariate X is univariate, i.e. k = 1. So in thesimulations, τ ( x ) = E [ Y (1) − Y (0) | X = x ]. Model 1 . It is featured by a 2-dimensional unconfounded covariate, X = ( X , X ) (cid:62) . In other words, d = 2. For further information, X = ρ , X = (1 + 2 X ) ( − X ) + ρ , where ρ , ρ are independently identically U ( − . , .

5) distributed. Thepotential outcomes and the propensity score function are given as: Y (1) = X X + (cid:15), Y (0) = 0 ,p ( X ) = exp ( X + X )1 + exp ( X + X ) , .1 Data-Generating Processwhere (cid:15) ∼ N (0 , . ). The true CATE conditioning on X can be derivedas τ ( x ) = x (1 + 2 x ) ( − x ) . Since the misspeciﬁcation eﬀect is aconcern, we use the misspeciﬁed parametric model respectively: (cid:101) m ( X ; γ ) = (1 , X (cid:62) ) γ , (cid:101) p ( X ; β ) = exp ((1 , X ) β )1 + exp ((1 , X ) β ) . where γ ∈ R , β ∈ R . Model 2 . Another DGP is featured by a 4-dimensional unconfoundedcovariate for the purpose of a further investigation on higher dimensioncases. Write X = ( X , X , X , X ) (cid:62) and X = ρ , X = 1 + 2 X + ρ ,X = 1 + 2 X + ρ , X = ( − X ) + ρ , where ρ , ρ , ρ , ρ are independently identically U ( − . , .

5) distributed.The potential outcomes and the propensity score function are deﬁned as: Y (1) = X X X X + (cid:15), Y (0) = 0 ,p ( X ) = exp (cid:2) ( X + X + X + X ) (cid:3) exp (cid:2) ( X + X + X + X ) (cid:3) , where (cid:15) ∼ N (0 , . ). The true CATE conditioning on X remains as τ ( x ) = x (1 + 2 x ) ( − x ) . Still we use the misspeciﬁed parametricmodel respectively: (cid:101) m ( X ; γ ) = (1 , X (cid:62) ) γ , (cid:101) p ( X ; β ) = exp ((1 , X ) β )1 + exp ((1 , X ) β ) . .2 Kernel Functions and Bandwidthswhere γ ∈ R , β ∈ R . As the selections of kernel functions and bandwidths (listed in the supple-mentary material) have great inﬂuence on the asymptotic property when thenuisance models are nonparametrically or semiparametrically estimated, weﬁrst discuss this issue.Let h = an − η for η >

0. Together with condition (A2), how to deter-mine the value η goes to a linear programming problem.For model 1 ( d = 2), we consider a kernel function of order 4 ( s = 4)as the kernel in the second step of N-W estimation, K . Write h = a h − η .For the other bandwidths, take h as an example. The results in Section 4requires that s ∗ ≥ s ≥ d , we then choose s = 2. Also let h = a n − η .Then let ( η , η ) = (cid:0) , (cid:1) . The other bandwidths can also be determinedsimilarly as h j = a j n − , ( j = 2 , , , s j = 2 , ( j = 1 , , , , h i to meet condition (A16). To choose a j , ( j =2 , , , a = 0 . a = 0 . a = 1 . a = 0 . a = 1. For model 2 ( d = 4), consider s = 6 and s j =4 , ( j = 2 , , ,

6) and h = a n − and h j = a j n − , ( j = 2 , , , a = 0 . a = 2, a = 2 . a = 2 . a = 1. In simulations, we chose.3 Simulation Resultsmany other values and found that the above values are recommendable asthe values around them can make the estimators relatively stable.Consider the Gaussian kernel K of order s under condition (A1)(i).For other kernel functions, use Epanechnikov kernels of the correspondingorders under conditions (A1)(ii) and (iii). As there are many estimators (cid:98) τ ( x ) with diﬀerent estimated nuisance mod-els, we then, in Table 2, list them and the corresponding notations forconvenience.To guarantee the regularity conditions and the estimation stability, all esti-mated propensity scores are trimmed within [0 . , . τ ( x ) for x ∈ {− . , − . , , . , . } .The sample sizes are n = 500 and n = 5 ,

000 respectively to see their asymp-totic behaviours. The experiments are repeated 2 , T ( x ) = √ nh ( (cid:98) τ ( x ) − τ ( x )). we evaluate the estimators based on following cri-teria: bias of (cid:98) τ ( x ); sample standard deviation (sam-SD) of T ( x ); meansquare error (MSE) of T ( x ). We also report the proportions ( P . , P . )of the standardized T ( x ) below the 5% quantile and above the 95% quan-.3 Simulation ResultsTable 2: Estimators involved in simulation DRCATE p ( x ) m ( x ) (O, O) oracle oracle (cP, cP) parametrically estimated(correctly speciﬁed) parametrically estimated(correctly speciﬁed) (N, N) nonparametrically estimated nonparametrically estimated (S, S) semiparametrically estimated semiparametrically estimated (mP, cP) parametrically estimated(misspeciﬁed) parametrically estimated(correctly speciﬁed) (mP, N) parametrically estimated(misspeciﬁed) nonparametrically estimated (mP, S) parametrically estimated(misspeciﬁed) semiparametrically estimated (cP, mP) parametrically estimated(correctly speciﬁed) parametrically estimated(misspeciﬁed) (N, mP) nonparametrically estimated parametrically estimated(misspeciﬁed) (S, mP) semiparametrically estimated parametrically estimated(misspeciﬁed) .3 Simulation ResultsTable 3: The simulation results under model 1 (part 1) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . (O,O) -0.4 0.0001 0.2776 0.0770 0.052 0.046 0.0004 0.2724 0.0742 0.044 0.052-0.2 -0.0023 0.2378 0.0567 0.056 0.044 -0.0005 0.2333 0.0544 0.049 0.0500 -0.0002 0.2088 0.0436 0.049 0.050 0.0003 0.2014 0.0405 0.047 0.0480.2 0.0003 0.1997 0.0399 0.052 0.047 0.0002 0.1999 0.0400 0.050 0.0540.4 0.0027 0.2003 0.0403 0.045 0.058 0.0004 0.2006 0.0403 0.048 0.054(cP,cP) -0.4 0.0000 0.2797 0.0782 0.053 0.048 0.0004 0.2725 0.0743 0.044 0.052-0.2 -0.0023 0.2378 0.0567 0.056 0.042 -0.0005 0.2333 0.0544 0.051 0.0480 -0.0002 0.2089 0.0436 0.048 0.050 0.0003 0.2014 0.0405 0.047 0.0470.2 0.0003 0.1994 0.0397 0.051 0.048 0.0002 0.2001 0.0400 0.051 0.0540.4 0.0027 0.2003 0.0403 0.044 0.058 0.0004 0.2007 0.0403 0.047 0.054(N,N) -0.4 0.0008 0.2716 0.0738 0.050 0.053 0.0001 0.2845 0.0809 0.050 0.049-0.2 0.0015 0.2366 0.0560 0.042 0.058 -0.0001 0.2344 0.0549 0.050 0.0500 0.0002 0.2046 0.0419 0.043 0.052 -0.0005 0.1996 0.0399 0.057 0.0410.2 0.0010 0.2000 0.0400 0.044 0.051 -0.0001 0.1941 0.0377 0.052 0.0560.4 0.0014 0.2081 0.0433 0.045 0.054 0.0009 0.2012 0.0406 0.045 0.056(S,S) -0.4 -0.0022 0.2815 0.0794 0.051 0.044 0.0002 0.2862 0.0819 0.045 0.050-0.2 0.0004 0.2365 0.0559 0.046 0.052 -0.0004 0.2302 0.0530 0.046 0.0480 0.0005 0.2082 0.0433 0.053 0.052 0.0003 0.2059 0.0424 0.052 0.0520.2 -0.0015 0.1992 0.0397 0.061 0.041 -0.0002 0.2011 0.0404 0.053 0.0510.4 0.0002 0.2021 0.0408 0.050 0.046 0.0012 0.2048 0.0422 0.043 0.059 tile of N (0 ,

1) to verify the asymptotic Normality. We display the eﬃciencycomparisons among diﬀerent estimators under models 1 and 2 in Figures 2.3 Simulation ResultsTable 4: The simulation results under model 1 (part 2) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . (O,O) -0.4 0.0001 0.2776 0.0770 0.052 0.046 0.0004 0.2724 0.0742 0.044 0.052-0.2 -0.0023 0.2378 0.0567 0.056 0.044 -0.0005 0.2333 0.0544 0.049 0.0500 -0.0002 0.2088 0.0436 0.049 0.050 0.0003 0.2014 0.0405 0.047 0.0480.2 0.0003 0.1997 0.0399 0.052 0.047 0.0002 0.1999 0.0400 0.050 0.0540.4 0.0027 0.2003 0.0403 0.045 0.058 0.0004 0.2006 0.0403 0.048 0.054(mP,cP) -0.4 0.0000 0.2599 0.0675 0.052 0.049 0.0004 0.2530 0.0640 0.044 0.052-0.2 -0.0022 0.2363 0.0559 0.056 0.041 -0.0005 0.2323 0.0540 0.050 0.0500 -0.0002 0.2203 0.0485 0.049 0.048 0.0003 0.2116 0.0448 0.047 0.0520.2 0.0003 0.2041 0.0417 0.051 0.046 0.0002 0.2048 0.0419 0.050 0.0530.4 0.0027 0.1953 0.0383 0.044 0.058 0.0004 0.1955 0.0382 0.046 0.054(mP,N) -0.4 -0.0046 0.2666 0.0716 0.064 0.040 -0.0011 0.2629 0.0693 0.054 0.044-0.2 -0.0035 0.2373 0.0566 0.059 0.044 -0.0029 0.2383 0.0584 0.074 0.0370 -0.0068 0.2152 0.0474 0.072 0.032 -0.0027 0.2107 0.0458 0.072 0.0340.2 -0.0011 0.2041 0.0417 0.052 0.047 -0.0004 0.1952 0.0381 0.050 0.0450.4 -0.0008 0.2003 0.0401 0.049 0.049 0.0007 0.2002 0.0402 0.043 0.056(mP,S) -0.4 -0.0143 0.2701 0.0781 0.082 0.029 -0.0115 0.2722 0.0996 0.146 0.010-0.2 -0.0094 0.2453 0.0624 0.070 0.032 -0.0073 0.2302 0.0634 0.114 0.0160 -0.0046 0.2116 0.0453 0.064 0.043 -0.0038 0.2099 0.0469 0.083 0.0320.2 -0.0019 0.2041 0.0417 0.050 0.046 -0.0006 0.1970 0.0388 0.054 0.0470.4 0.0022 0.2002 0.0402 0.046 0.058 0.0017 0.1968 0.0393 0.037 0.062 and 3 and the detailed results under model 1 are displayed in Tables 6,7 and 8. To save space, the other simulation results about model 2 are.3 Simulation ResultsTable 5: The simulation results under model 1 (part 3) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . (O,O) -0.4 0.0001 0.2776 0.0770 0.052 0.046 0.0004 0.2724 0.0742 0.044 0.052-0.2 -0.0023 0.2378 0.0567 0.056 0.044 -0.0005 0.2333 0.0544 0.049 0.0500 -0.0002 0.2088 0.0436 0.049 0.050 0.0003 0.2014 0.0405 0.047 0.0480.2 0.0003 0.1997 0.0399 0.052 0.047 0.0002 0.1999 0.0400 0.050 0.0540.4 0.0027 0.2003 0.0403 0.045 0.058 0.0004 0.2006 0.0403 0.048 0.054(cP,mP) -0.4 -0.0012 0.3230 0.1044 0.051 0.048 0.0001 0.3201 0.1024 0.050 0.049-0.2 -0.0021 0.2400 0.0577 0.052 0.042 -0.0005 0.2362 0.0558 0.054 0.0440 0.0004 0.2147 0.0461 0.052 0.049 0.0003 0.2050 0.0420 0.049 0.0490.2 0.0004 0.2012 0.0405 0.054 0.046 0.0001 0.2016 0.0406 0.048 0.0490.4 0.0028 0.2059 0.0426 0.043 0.061 0.0004 0.2039 0.0416 0.045 0.053(N,mP) -0.4 -0.0105 0.2840 0.0834 0.075 0.040 -0.0013 0.2970 0.0885 0.060 0.045-0.2 0.0014 0.2353 0.0554 0.047 0.050 0.0007 0.2288 0.0525 0.040 0.0530 0.0013 0.2104 0.0443 0.048 0.054 0.0002 0.2065 0.0426 0.047 0.0440.2 -0.0014 0.1995 0.0398 0.056 0.048 -0.0004 0.2022 0.0409 0.052 0.0440.4 0.0008 0.2034 0.0414 0.046 0.046 0.0000 0.2077 0.0431 0.048 0.050(S,mP) -0.4 -0.0051 0.2964 0.0884 0.055 0.046 -0.0005 0.3089 0.0955 0.050 0.045-0.2 -0.0002 0.2421 0.0586 0.049 0.050 0.0001 0.2394 0.0573 0.048 0.0510 0.0005 0.2076 0.0431 0.050 0.050 -0.0001 0.2051 0.0421 0.048 0.0490.2 -0.0008 0.2082 0.0433 0.049 0.049 -0.0001 0.1966 0.0386 0.054 0.0480.4 0.0005 0.2104 0.0443 0.044 0.052 0.0006 0.2085 0.0435 0.048 0.054 reported in the supplementary material.Here we present some observations from the simulation results..3 Simulation Results x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,cP)DRCATE(N,N)DRCATE(S,S) 0.960.991.021.05 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(mP,cP)DRCATE(mP,N)DRCATE(mP,S) 1.001.051.101.15 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,mP)DRCATE(N,mP)DRCATE(S,mP) (a) n = 500 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,cP)DRCATE(N,N)DRCATE(S,S) 0.9250.9500.9751.0001.0251.050 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(mP,cP)DRCATE(mP,N)DRCATE(mP,S) 1.001.051.101.15 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,mP)DRCATE(N,mP)DRCATE(S,mP) (b) n = 5000 Figure 2: Relative variance against DRCATE(O,O) in model 1

First, with the sample size growth, the bias and the standard deviationof (cid:98) τ ( x ) reasonably tend to be smaller due to the estimation consistency.The reported proportions P . and P . can be controlled around 0.05,which implies that the normal approximation of the proposed estimator isvalid.Second, from Figures 2 and 3, the eﬃciency comparisons among theestimators (O,O), (cP,cP), (N,N) and (S,S) show that the distributions areclose to each other. When only the propensity score function is misspeci-ﬁed, variance inﬂation and shrinkage are both possible. With misspeciﬁedoutcome regression function, only variance inﬂation is possible. .01.11.2 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,cP)DRCATE(N,N)DRCATE(S,S) 0.9500.9751.0001.0251.050 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(mP,cP)DRCATE(mP,N)DRCATE(mP,S) 0.9751.0001.0251.0501.075 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,mP)DRCATE(N,mP)DRCATE(S,mP) (a) n = 500 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,cP)DRCATE(N,N)DRCATE(S,S) 0.9751.0001.0251.050 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(mP,cP)DRCATE(mP,N)DRCATE(mP,S) 1.01.21.41.6 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,mP)DRCATE(N,mP)DRCATE(S,mP) (b) n = 5000 Figure 3: Relative variance against DRCATE(O,O) in model 2

Third, the bias and standard deviation of (cid:98) τ ( x ) increase when the co-variate dimension grows, see the comparisons in Figures 2 and 3. Possibleexplanation to this phenomenon would be that the standard deviations ofnuisance models’ estimations increase with higher dimension of covariate.

6. Conclussion

In this paper, we investigate the asymptotic behaviours of nine doublyrobust estimators (DR), under diﬀerent combinations of model structures,to provide a relatively complete picture of this methodology.When all models are correctly speciﬁed, the asymptotic equivalencemong all deﬁned estimators does not surprisingly hold. When modelsare mispeciﬁed, we consider local and global misspeciﬁcations and someinteresting phenomena have been discovered such as asymptotic varianceshrinking in some cases due to misspeciﬁcation. Further, we would rec-ommend semiparametric estimation under dimension reduction structure.This is because nonparametric estimation severely suﬀers from the curseof dimensionality whereas parametric estimation may not be suﬃcientlyrobust against model structure.

Acknowledgements

The research described herein was supported by a NNSF grant of Chinaand a grant from the University Grants Council of Hong Kong, Hong Kong,China. . Supplementary Material

The supplementary material contains the detailed proofs of the theoremsand propositions, and the additional simulation results.

Here we present some conditions to derive the theoretical results. Togetherwith (C1) and (C2) in the main context, the following conditions in the(C) group are regularity conditions to guarantee the asymptotic propertiesregardless of the diﬀerent ways to estimate nuisance models.(C3) Density functions involved in this article satisfy the following condi-tions:(i) For any x ∈ X , the density function of X , θ ( x ) is bounded awayfrom 0.(ii) For any x , the density function of X , f ( x ) is bounded awayfrom zero and s times continuously diﬀerentiable.(iii) Denote the density functions of A (cid:62) X , B (cid:62) X and B (cid:62) X as θ A ( · ), θ B ( · ) and θ B ( · ). For any x ∈ X , all these density functions arebounded away from 0.(C4) Denote C as the parameter space of β . For any x ∈ X and β ∈ C ,.1 Technical Conditions (cid:101) p ( x ; β ) is bounded away from 0 and 1.(C5) sup x E [ Y ( j ) | X = x ] < ∞ for j = 0 , E | Ψ ( X, Y, D ) − τ ( x ) | κ ≤ ∞ , E | Ψ ( X, Y, D ) − τ ( x ) | κ ≤ ∞ , E | Ψ ( X, Y, D ) − τ ( x ) | κ ≤ ∞ , E | Ψ ( X, Y, D ) − (cid:101) τ ( x ) | κ ≤ ∞ , (cid:82) | K ( u ) | δ du ≤ ∞ for some constants κ , κ , κ , κ , δ ≥ ii ). Bounded propensity scores or speciﬁedpropensity score models, density functions and corresponding conditionalmoments are required in these conditions, which are common restrictionsin the literature, and play important roles in deriving the asymptotic linearexpression of the proposed estimators. (C6) ensures the applicability ofLyapunov’s Central Limit Theorem here.Assume some conditions on kernel functions and bandwidths in non-parametric estimation:(A1) The kernel functions satisfy the following conditions:(i) K ( u ) is a kernel function of order s , which is symmetric aroundzero and s ∗ times continuously diﬀerentiable..1 Technical Conditions(ii) K ( u ), K ( u ) and K ( u ) are kernels of order s ≥ p , s ≥ p and s ≥ p , which are symmetric around zero and equal to zerooutside (cid:81) pi =1 [ − , s +1), ( s +1) and ( s +1)order derivatives respectively.(iii) K ( u ), K ( u ) and K ( u ) are kernels of order s ≥ p (2), s ≥ p (1)and s ≥ p (0), which are symmetric around zero and equal tozero outside (cid:81) p (2) i =1 [ − , (cid:81) p (1) i =1 [ − ,

1] and (cid:81) p (0) i =1 [ − , s +1), ( s +1) and ( s +1) order derivatives respectively.(A2) As diﬀerent scenarios require diﬀerent bandwidths, we put them to-gether in the following. As n → ∞ :(i) h → nh k → ∞ , nh s + k → h →

0, (ln n ) / ( nh p + s ) → h →

0, (ln n ) / ( nh p + s ) → h s h − s − k → nh k h s → h s h − s − k → nh k h s → h s h − s − k → nh k h s → h s h − s − k → nh k h s → h s h − s − k → nh k h s → h s h − s − k → nh k h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → K j ( u ) , j = 2 , , . . . ,

7, and h j , j = 2 , , . . . , K j ( u ) , j = 2 , . . . ,

7. Abrevaya et al. (2015) statedthat this restriction on the bounded support can be relaxed to exponentialtails.(A2)(ii)-(xiv) place restrictions on the convergence rates of diﬀerentbandwidths to ensure remainders of the linear expression negligible. (A2)(xiv),.1 Technical Conditionsinvolving more than 2 bandwidths, can be regarded as an interaction term,which makes it handleable to determine those convergence rates. Herewe provide a naive idea to accomplish this task based on linear program-ming. Assume the corresponding bandwidths converge to 0 in such amanner, h j = a j n − η j , j = 1 , . . . ,

7, where η j >

0. With predetermined s j , j = 1 , . . . , η j . For a more detailed example, readercan refer to Section 5.Lastly, we give a condition to ensure the desired convergence rates ofthe estimators under semiparametric dimension reduction structure, whichwill be a favour when pursuing the asymptotic properties of (cid:98) τ ( x ):(B1) (cid:98) A − A = O p (cid:0) n − / (cid:1) , (cid:98) B − B = O p (cid:0) n − / (cid:1) , (cid:98) B − B = O p (cid:0) n − / (cid:1) These can be achieved by standard estimations in the literature, see therelevant references such as Li (1991), ????In summary, these conditions are rather standard..2 Proof of Theorem 1

Recall that (cid:98) τ ( x ) = (cid:80) ni =1 (cid:104) D i (cid:98) p i ( Y i − (cid:98) m i ) − − D i − (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − (cid:98) m i (cid:105) K (cid:16) X i − x h (cid:17)(cid:80) nt =1 K (cid:16) X t − x h (cid:17) = (cid:80) ni =1 (cid:104) D i (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i (cid:105)(cid:80) nt =1 K (cid:16) X t − x h (cid:17) − (cid:80) ni =1 (cid:104) − D i − (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i (cid:105)(cid:80) nt =1 K (cid:16) X t − x h (cid:17) =: (cid:98) τ ( x ) − (cid:98) τ ( x ) . Let τ ( x ) = E (cid:20) Dp ( X ) [ Y − m ( X )] − − D − p ( X ) [ Y − m ( X )] + m ( X ) − m ( X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:21) = E (cid:20) Dp ( X ) [ Y − m ( X )] + m ( X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:21) − E (cid:20) − D − p ( X ) [ Y − m ( X )] + m ( X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:21) = τ ( x ) − τ ( x ) . For the very ﬁrst move, we look for the asymptotic linear expression of (cid:112) nh k [ (cid:98) τ ( x ) − τ ( x )]. Note that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )]= (cid:113) nh k { [ (cid:98) τ ( x ) − τ ( x )] − [ (cid:98) τ ( x ) − τ ( x )] } = 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) − (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) − D i − (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) . .2 Proof of Theorem 1 (cid:98) f ( x ) P −→ f ( x ) ensures that we can use Slutsky’s Theorem later. So we canﬁrst consider the asymptotic linear expression of J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) . (7.1)Consider several combinations of estimation of nuisance functions. Nowwe list them as below.Scenario 1. p ( x ) parametrically estimated (correctly speciﬁed), m ( x )parametrically estimated (correctly speciﬁed)Scenario 2. p ( x ) nonparametrically estimated, m ( x ) nonparametricallyestimatedScenario 3. p ( x ) semiparametrically estimated, m ( x ) semiparametri-cally estimatedScenario 4. p ( x ) parametrically estimated (correctly speciﬁed), m ( x )nonparametrically estimatedScenario 5. p ( x ) parametrically estimated (correctly speciﬁed), m ( x )semiparametrically estimatedScenario 6. p ( x ) nonparametrically estimated, m ( x ) parametricallyestimated (correctly speciﬁed)Scenario 7. p ( x ) nonparametrically estimated, m ( x ) semiparametri-cally estimated.2 Proof of Theorem 1Scenario 8. p ( x ) semiparametrically estimated, m ( x ) parametricallyestimated (correctly speciﬁed)Scenario 9. p ( x ) semiparametrically estimated, m ( x ) nonparametri-cally estimated Scenario 1: p ( x ) and m ( x ) are parametrically estimated. Fromstandard parametric estimation argument,sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − p ( x ) (cid:12)(cid:12)(cid:12) = sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − (cid:101) p ( x ; β ∗ ) (cid:12)(cid:12)(cid:12) = O p (cid:18) √ n (cid:19) , sup x ∈X | (cid:101) m ( x ; (cid:98) γ ) − m ( x ) | = sup x ∈X | (cid:101) m ( x ; (cid:98) γ ) − (cid:101) m ( x ; γ ∗ ) | = O p (cid:18) √ n (cid:19) . We start from (7.1): J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:34) D i (cid:101) p ( X i ; (cid:98) β ) [ Y i − (cid:101) m ( X ; (cid:98) γ )] + (cid:101) m ( X ; (cid:98) γ ) − τ ( x ) (cid:35) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i Y i p ( X i ) − D i − p ( X i ) p ( X i ) m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 (cid:34) D i ( m +1 i − Y i ) p + i (cid:35) (cid:104)(cid:101) p ( X i ; (cid:98) β ) − p ( X i ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 (cid:18) p + i − D i p + i (cid:19) [ (cid:101) m ( X i ; (cid:98) γ ) − m ( X i )] K (cid:18) X i − x h (cid:19) =: J ( x ) + J ( x ) + J ( x ) (7.2)where p + i lies between p ( X i ) and (cid:101) p ( X i ; (cid:98) β ), m +1 i lies between m ( X i ) and (cid:101) m ( X i ; (cid:98) γ )..2 Proof of Theorem 1Bounding J ( x ) as | J ( x ) | = 1 (cid:112) nh k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 (cid:34) D i ( m +1 i − Y i ) p + i (cid:35) (cid:104)(cid:101) p ( X i ; (cid:98) β ) − p ( X i ) (cid:105) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) nh k sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − p ( x ) (cid:12)(cid:12)(cid:12) × nh k n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i ( m +1 i − Y i ) p + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − p ( x ) (cid:12)(cid:12)(cid:12) = O p (cid:16) √ n (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) D i ( m +1 i − Y i ) p + i (cid:12)(cid:12)(cid:12)(cid:12) is bounded due tocondition (C2)(ii), (C4) and (C5), nh k (cid:80) ni =1 (cid:12)(cid:12)(cid:12) K (cid:16) X i − x h (cid:17)(cid:12)(cid:12)(cid:12) = O p (1) by thestandard nonparametric estimation argument. Thus | J ( x ) | ≤ o p (1) · O p (1) = o p (1). With the similar arguments, we can also bound the lastterm as | J ( x ) | = o p (1). So far, we’ve proved J ( x ) and J ( x ) convergeto 0 in probability. Hence, according to Slutsky’s Theorem, together with(7.2), we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) J ( x ) = 1 (cid:98) f ( x ) J ( x ) + o p (1)= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1)(7.3) Scenario 2: p ( x ) and m ( x ) are nonparametrically estimated. From the standard nonparametric estimation argument, under conditions.2 Proof of Theorem 1(A1)(i), (ii), (iii), we havesup x ∈X | (cid:98) p ( x ) − p ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p (cid:16) h s (cid:17) sup x ∈X | (cid:98) m ( x ) − m ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p (cid:16) h s (cid:17) Rewrite (7.1): J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:98) p ( X i ) [ Y i − (cid:98) m ( X i )] + (cid:98) m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 p ( X i ) − D i p ( X i ) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) + 0+ 1 (cid:112) nh k n (cid:88) i =1 D i (cid:2) Y i − m +1 i (cid:3) p + i [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i p + i [ (cid:98) p ( X i ) − p ( X i )] [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) =: J ( x ) + J ( x ) + J ( x ) + 0 + J ( x ) + J ( x ) (7.4)where p + i lies between p ( X i ) and (cid:98) p ( X i ), m +1 i lies between m ( X i ) and (cid:98) m ( X i ).Rewrite J ( x ) as J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19) = 1 √ n n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) 1 (cid:112) h k [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19) .2 Proof of Theorem 1In which E (cid:104) D i [ m ( X i ) − Y i ] p ( X i ) (cid:12)(cid:12)(cid:12) X i (cid:105) = 0 and thus D i [ m ( X i ) − Y i ] p ( X i ) is independent of X i for every i ; sup x ∈X | (cid:98) p ( x ) − p ( x ) | = O p (cid:16) h s + (cid:113) ln nnh p (cid:17) = o p (cid:16) h s (cid:17) , Bycondition (A1)(viii) and CLT, √ h p [ (cid:98) p ( X i ) − p ( X i )] K (cid:16) X i − x h (cid:17) = o p (1).and then J ( x ) = o p (1).Similarly, | J ( x ) | = o p (1). Deal with J ( x ) by using the decomposi-tion as | J ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:112) nh k n (cid:88) i =1 D i (cid:2) Y i − m +1 i (cid:3) p + i [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) nh k sup x ∈X [ (cid:98) p ( X i ) − p ( X i )] × nh k n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i (cid:2) Y i − m +1 i (cid:3) p + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) in which sup x ∈X [ (cid:98) p ( x ) − p ( x )] = o p ( h s ). Then under condition (A1)(viii), (cid:112) nh k sup x ∈X [ (cid:98) p ( x ) − p ( x )] = o p (1). Under conditions (C2)(ii) and (C5), (cid:12)(cid:12)(cid:12)(cid:12) D i [ Y i − m +1 i ] p + i (cid:12)(cid:12)(cid:12)(cid:12) is bounded. Again by the standard argument for handling non-parametric estimation, nh k (cid:80) ni =1 (cid:12)(cid:12)(cid:12) K (cid:16) X i − x h (cid:17)(cid:12)(cid:12)(cid:12) = O p (1). Thus, we can ob-tain that | J ( x ) | = o p (1) · O p (1) = o p (1). In a similar way, | J ( x ) | = o p (1)can also be proved. Here we have derived that J ( x ), J ( x ), J ( x ) and J ( x ) can be bounded as o p (1). Together with (7.4), we can obtain that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) J ( x ) = 1 (cid:98) f ( x ) J ( x ) + o p (1)= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( Z i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.5).2 Proof of Theorem 1 Scenario 3: p ( x ) and m ( x ) are semiparametrically estimated. Under conditions (A2)(v) and (vi), (vii),sup x ∈X (cid:12)(cid:12)(cid:98) g ( A (cid:62) x ) − g ( A (cid:62) z ) (cid:12)(cid:12) = O p (cid:32) h s + (cid:115) ln ( n ) nh p (2)5 (cid:33) = o p (cid:16) h s (cid:17) , sup x ∈X (cid:12)(cid:12)(cid:98) r ( B (cid:62) z ) − r ( B (cid:62) z ) (cid:12)(cid:12) = O p (cid:32) h s + (cid:115) ln nnh p (1)6 (cid:33) = o p (cid:16) h s (cid:17) . Note that under condition (B1), we can ﬁrst discuss the asymptotic distri-bution by assuming that the projection matrices A , B and B are given.Then J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:98) g ( A (cid:62) X i ) (cid:2) Y i − (cid:98) r ( B (cid:62) X i ) (cid:3) + (cid:98) r ( B (cid:62) X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 p ( X i ) − D i p ( X i ) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 0+ 1 (cid:112) nh k n (cid:88) i =1 D i [ Y i − r +1 i ] g + i (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i g + i (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) =: J ( x ) + J ( x ) + J ( x ) + 0 + J ( x ) + J ( x ) (7.6)where g + i lies between g ( A (cid:62) X i ) and (cid:98) g ( A (cid:62) X i ), m +1 i lies between r ( B (cid:62) X i )and (cid:98) r ( B (cid:62) X i ). Then we deal with all terms one by one..2 Proof of Theorem 1Consider J ( x ) and J ( x ). We have J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) = 1 √ n n (cid:88) i =1 D i [ r ( B (cid:62) X i ) − Y i ] p ( A (cid:62) X i ) 1 (cid:112) h k (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) Again E (cid:104) D i [ m ( X i ) − Y i ] p ( X i ) (cid:12)(cid:12)(cid:12) X i (cid:105) = 0. Then, D i [ m ( X i ) − Y i ] p ( X i ) is independent of X i for every i ; sup x ∈X (cid:12)(cid:12)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:12)(cid:12) = O p (cid:16) h s + (cid:113) ln nnh p (cid:17) = o p (cid:16) h s (cid:17) .Condition (A2)(xi) yields that √ h k (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:16) X i − x h (cid:17) = o p (1). The application of CLT yields that J ( x ) = o p (1). Also, we canprove J ( x ) = o p (1) similarly.Deal with J ( x ) and J ( x ). We have | J ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:112) nh k n (cid:88) i =1 D i [ Y i − r +1 i ] g + i (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) nh k sup x ∈X (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) × nh k n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i [ Y i − r +1 i ] g + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Also sup x ∈X (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) = o p ( h s ). Condition (A2)(xi) impliesthat (cid:113) nh k sup x ∈X (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) = o p (1)Under conditions (C2)(ii) and (C5), (cid:12)(cid:12)(cid:12)(cid:12) D i [ Y i − r +1 i ] g + i (cid:12)(cid:12)(cid:12)(cid:12) is bounded. Again, nh k (cid:80) ni =1 (cid:12)(cid:12)(cid:12) K (cid:16) X i − x h (cid:17)(cid:12)(cid:12)(cid:12) = O p (1). We can then achieve | J ( x ) | = o p (1) · O p (1) = o p (1). This is alsothe way to prove | J ( x ) | = o p (1). In this way, the asymptotic negligibility.2 Proof of Theorem 1of J ( x ), J ( x ), J ( x ) and J ( x ) has been proved. Together with(7.6), it can be derived that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) J ( x ) = 1 (cid:98) f ( x ) J ( x ) + o p (1)= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.7)Consider equations (7.3), (7.5) and (7.7), which imply that the asymptoticlinear expressions of (cid:112) nh k [ (cid:98) τ ( x ) − τ ( x )] are identical among scenarios 1, 2and 3. It is obvious that under the conditions of Theorem 1, in any scenariomentioned above, the asymptotic linear expression remains the same, whichleads to the same asymptotic distribution. With the asymptotic linearexpression, we can further derive the asymptotic distribution. First, wehave the decomposition as (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )]= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) + o p (1)= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( X i )] K (cid:18) X i − x h (cid:19) + 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19) + o p (1)=: 1 (cid:98) f ( x ) I ( x ) + 1 (cid:98) f ( x ) I ( x ) + o p (1) . (7.8).2 Proof of Theorem 1Consider I ( x ) ﬁrst. Note that τ ( X i ) = E [Ψ ( X i , Y i , D i ) | X = X i ]. ThenΨ ( X i , Y i , D i ) − τ ( X i ) is independent of X i , K (cid:16) X i − x h (cid:17) only depends on n and X i . Thus E (cid:26) [Ψ ( X, Y, D ) − τ ( X )] K (cid:18) X − x h (cid:19)(cid:27) = E [Ψ ( X, Y, D ) − τ ( X )] EK (cid:18) X − x h (cid:19) = 0 . Also, (cid:110) [Ψ( X i , Y i , D i ) − τ ( X i )] K (cid:16) X i − x h (cid:17)(cid:111) ni =1 independently and identi-cally distributed. We now check the condition of Lyapunov’s CLT: ∃ κ > n (cid:88) i =1 E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n K (cid:18) X i − x h (cid:19) [Ψ ( X i , Y i , D i ) − τ ( X i )] 1 (cid:112) h k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) κ → n → ∞ )Under condition (C6), letting C = E | Ψ ( X i , Y i , D i ) − τ ( X i ) | < ∞ , we have n (cid:88) i =1 E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n K ( X i − x h )[Ψ ( X i , Y i , D i ) − τ ( X i )] 1 (cid:112) h k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) κ = (cid:32) (cid:112) nh k (cid:33) κ E (cid:12)(cid:12)(cid:12)(cid:12) Ψ ( X, Y, D ) − τ ( X ) | κ E | K (cid:18) X − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) κ h k ≤ (cid:32) (cid:112) nh k (cid:33) κ CE (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) κ h k where1 h k E (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) κ = (cid:90) K κ ( u ) f ( x + h u ) du → f ( x ) (cid:90) K κ ( u ) du < ∞ . Thus (cid:32) (cid:112) nh k (cid:33) κ E | Ψ ( X, Y, D ) − τ ( X ) | κ E (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) κ h k → n → ∞ ) . .2 Proof of Theorem 1The Lyapunov’s condition is satisﬁed and then I ( x ) d −→ N (0 , V ) (7.9)where V = lim n →∞ V ar (cid:26) √ nh k (cid:80) ni =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:16) X i − x h (cid:17)(cid:27) .To compute the variance V , we can see that V ar (cid:40) (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = E (cid:34) (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:35) = h k E (cid:40) E (cid:34) (cid:20) [Ψ ( X, Y, D ) − τ ( x )] 1 h k K (cid:18) X − x h (cid:19)(cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:35)(cid:41) = h k (cid:90) (cid:18) h k K (cid:18) X − x h (cid:19)(cid:19) E (cid:2) [Ψ( X i , Y i , D i ) − τ ( x )] (cid:12)(cid:12) X (cid:3) dF X = h k (cid:90) (cid:18) h k K (cid:18) t − x h (cid:19)(cid:19) E (cid:2) [Ψ( X i , Y i , D i ) − τ ( x )] (cid:12)(cid:12) X = t (cid:3) f ( t ) dt = h k h k (cid:90) K ( u ) E (cid:2) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x + h u (cid:3) f ( x + h u ) du = σ ( x ) f ( x ) (cid:90) K ( u ) du + O ( h k )where σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x (cid:9) . Consider I ( x ). We have I ( x ) = 1 (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19) .2 Proof of Theorem 1where E (cid:40) (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = (cid:113) nh k (cid:90) [ τ ( x + h u ) − τ ( x )] K ( u ) f ( x + h u ) du = (cid:113) nh k O p ( h s ) = o p (1) . Note that its variance is as

V ar (cid:40) (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = E (cid:40) (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) − (cid:34) E { (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19) } (cid:35) = (cid:90) [ τ ( x + h u ) − τ ( x )] K ( u ) f ( x + h u ) du − nh k [ τ ( x + h u ) − τ ( x )] K ( u ) f ( x + h u ) du ] = o p (1) . so I ( x ) = o p (1).Combining with (7.8), (7.9) and I ( x ) = o p (1), we can obtain that I ( x ) + I ( x ) d −→ N (cid:18) , σ ( x ) f ( x ) (cid:90) K ( u ) du (cid:19) and (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) [ I ( x ) + I ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . Now we consider the cases with unknown A , B and and B . Note thatunder condition (B1), (cid:98) A , (cid:98) B and (cid:98) B converge in probability to A , B and.3 Proofs of Theorems 2 and 3 B respectively at the rate of O p (cid:16) √ n (cid:17) . Following the similar arguments inHu et al. (2014), we can see easily that the asymptotic distribution retains.We then do not give the details for space saving. Now we can conclude that,under the conditions of Theorem 1, regardless of which estimation method(parametric, nonparametric, semiparametric dimension reduction) used toestimate nuisance models, (cid:113) nh l [ (cid:98) τ ( x ) − τ ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . The proof is done. (cid:3)

We now consider global misspeciﬁcation cases. Similar to the proof of The-orem 1, we ﬁrst consider the asymptotic linear expression of J ( x ). Scenario 1: m ( x ) is nonparametrically estimateed . In this case,we have sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − (cid:101) p ( x ; β ∗ ) (cid:12)(cid:12)(cid:12) = O p (cid:18) √ n (cid:19) , sup x ∈X | (cid:98) m ( x ) − m ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p (cid:16) h s (cid:17) . .3 Proofs of Theorems 2 and 3We can further rewrite (7.1) as J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:34) D i (cid:101) p ( X i ; (cid:98) β ) [ Y i − (cid:98) m ( X i )] + (cid:98) m ( X i ) − τ ( x ) (cid:35) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] (cid:101) p ( X i ; β ∗ ) (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:98) p ( X i ; β ∗ ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) + 0+ 1 (cid:112) nh k n (cid:88) i =1 D i (cid:2) Y i − m +1 i (cid:3) p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) =: J ( x ) + J ( x ) + J ( x ) + 0 + J ( x ) + J ( x ) (7.10)where p + i lies between (cid:101) p ( X i ; β ∗ ) and (cid:101) p ( X i ; (cid:98) β ), m +1 i lies between m ( X i ) and (cid:98) m ( X i ).As we can prove J ( x ) and J ( x ) are o p (1) in the same way as theproof for J ( x ) = o p (1) in scenario 1 of Subsection 7.2, the details arethen omitted. For J ( x ), obviously | J ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:112) nh k n (cid:88) i =1 D i p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:113) nh k sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − (cid:101) p ( x ; β ∗ ) (cid:12)(cid:12)(cid:12) sup x ∈X | (cid:98) m ( x ) − m ( x ) | nh k n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i p + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . .3 Proofs of Theorems 2 and 3Consider J ( x ). Denote that λ ( X i ) = E (cid:20) (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) X i (cid:21) , µ i = (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) − λ ( X i ) ,(cid:15) i = Y i − m ( X i ) , ρ ij = K (cid:16) X i − X j h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) . We ﬁrst give a lemma to show their asymptotics below, which is usefulfor the proof of the theorem.

Lemma 1.

Under condition (A1)(ii), the outcome regression estimator sat-isﬁes | ρ ij − ρ ji | ≤ C n nh p K (cid:18) X i − X j h (cid:19) where C n = O p ( h ) and does not depend on i , j . Proof.

Note that ρ ij = ρ ji = 0, if || X i − X j || ∞ > h . We now consider theevent that || X i − X j || ∞ ≤ h . For all i ,1 nh p n (cid:88) t =1 D t K (cid:18) X i − X t h (cid:19) = (cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17)(cid:80) nt =1 K (cid:16) X i − X t h (cid:17) nh p n (cid:88) t =1 K (cid:18) X i − X t h (cid:19) = (cid:98) p ( X i ) (cid:98) θ ( X i ) . Then, | ρ ij − ρ ji | = 1 nh p (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − X j h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:98) p − ( X i ) (cid:98) θ − ( X i ) − (cid:98) p − ( X j ) (cid:98) θ − ( X j ) (cid:12)(cid:12)(cid:12) ≤ nh p (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − X j h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) p ( X i ) (cid:98) θ ( X i ) − p ( X i ) θ ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) p ( X i ) θ ( X i ) − p ( X j ) θ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( X j ) θ ( X j ) − (cid:98) p ( X j ) (cid:98) θ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) . .3 Proofs of Theorems 2 and 3Again by the standard arguments for dealing with nonparametric estimationas we have used before and s ≥ p ,sup x ∈X | (cid:98) p ( x ) − p ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p ( h ) , sup x ∈X | (cid:98) θ ( x ) − θ ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p ( h ) . Recall the conditions (C2)(ii) and (C3)(i) that p ( x ) and θ ( x ) are boundedaway from 0 and 1. The two equations above implies that (cid:98) p ( x ) and (cid:98) θ ( x )uniformly converge to p ( x ) and θ ( x ) respectively. We can also obtain that (cid:98) p ( x ) and (cid:98) θ ( x ) are bounded away from 0 in probability for n large enough.Then sup x ∈X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( x ) θ ( x ) − (cid:98) p ( x ) (cid:98) θ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup x ∈X (cid:12)(cid:12)(cid:12)(cid:98) p ( x ) (cid:98) θ ( x ) − p ( x ) θ ( x ) (cid:12)(cid:12)(cid:12) p ( x ) θ ( x ) (cid:98) p ( x ) (cid:98) θ ( x ) ≤ sup x ∈X (cid:98) p ( x ) | (cid:98) θ ( x ) − θ ( x ) | + | (cid:98) p ( x ) − p ( x ) | θ ( x ) p ( x ) θ ( x ) (cid:98) p ( x ) (cid:98) θ ( x ) = o p (1) . This leads to that (cid:12)(cid:12)(cid:12) (cid:98) p ( X i ) (cid:98) θ ( X i ) − p ( X i ) θ ( X i ) (cid:12)(cid:12)(cid:12) = o p (1), (cid:12)(cid:12)(cid:12) (cid:98) p ( X j ) (cid:98) θ ( X j ) − p ( X j ) θ ( X j ) (cid:12)(cid:12)(cid:12) = o p (1) uniformly over all X i , X j . By the Lipchitz continuity, we have (cid:12)(cid:12)(cid:12) p ( X i ) θ ( X i ) − p ( X j ) θ ( X j ) (cid:12)(cid:12)(cid:12) = O p ( h ) uniformly over all X i , X j . Altogether, we have that the summationin curly brace is O p ( h ). Therefore, there exists a C n = O p ( h ) such that | ρ ij − ρ ji | ≤ C n nh p K (cid:18) X i − X j h (cid:19) . The proof is completed. (cid:3) .3 Proofs of Theorems 2 and 3Now we come back to handle the term J ( x ) that can be decomposedas J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 λ ( X i ) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 µ i [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:34) n (cid:88) j =1 ρ ij D j [ (cid:15) j + m ( X j )] − m ( X i ) (cid:35) + 1 (cid:112) nh k n (cid:88) i =1 µ i [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 K (cid:18) X i − x h (cid:19) λ ( X i ) D i (cid:15) i p ( X i )+ 1 (cid:112) nh k (cid:88) i =1 D i (cid:15) i p ( X i ) (cid:34) p ( X i ) n (cid:88) j =1 ρ ji K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) + 1 (cid:112) nh k K ( X i − x ) λ ( X i ) (cid:34) n (cid:88) j =1 ρ ij D j m ( X j ) − m ( X i ) (cid:35) + 1 (cid:112) nh k n (cid:88) i =1 µ i [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) := J ( x ) + J ( x ) + J ( x ) + J ( x ) . (7.11)We ﬁrst prove that J k ( x ) = o p (1) for k = 2 , ,

4. Consider J ( x ).3 Proofs of Theorems 2 and 3by using the following decomposition:1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ρ ji K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) = 1 (cid:112) h k p ( X i ) n (cid:88) j =1 ( ρ ij − ρ ji ) K (cid:18) X j − x h (cid:19) λ ( X j )+ 1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ρ ij K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) := L + L .L can be bounded as L ≤ (cid:112) h k sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( X i ) n (cid:88) j =1 ( ρ ij − ρ ji ) K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:112) h k sup i n (cid:88) j =1 | ρ ij − ρ ji | (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:112) h k M C n nh p sup i n (cid:88) j =1 | K (cid:18) X i − X j h (cid:19) | (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ M C n h h (cid:112) h k sup i nh p n (cid:88) j =1 | K (cid:18) X i − X j h (cid:19) | (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12) = O p (1) · o p (1) · O p (1)= o p (1) . .3 Proofs of Theorems 2 and 3Then L can be handled by noting1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ρ ij K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) = 1 (cid:112) h k  p ( X i ) (cid:80) ns =1 K (cid:16) X i − X s h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) n (cid:88) j =1 D j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:21) = 1 (cid:112) h k  p ( X i ) (cid:80) ns =1 K (cid:16) X i − X s h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) n (cid:88) j =1 D j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) K (cid:18) X j − x h (cid:19) λ ( X j ) − p ( X i ) (cid:80) ns =1 K (cid:16) X i − X s h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) K (cid:18) X i − x h (cid:19) λ ( X i )  +  p ( X i ) (cid:80) ns =1 K (cid:16) X i − X s h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) K (cid:18) X i − x h (cid:19) λ ( X i ) − p ( X i ) 1 p ( X i ) K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:21)(cid:27) = 1 (cid:112) h k O p (cid:18) h s h s + h s (cid:19) = O p (cid:32) h s h s + k/ (cid:33) . Then under Condition (A2)(ix), L = o p (1) and, together with the boundfor L , we have1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ρ ji K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) = o p (1) . Since { (cid:15) i } ni =1 are mutually independent given { Z i } ni =1 , it follows that J ( x ) = o p (1)..3 Proofs of Theorems 2 and 3Second, bound J ( x ) by noting that | J ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:112) nh k K ( X i − x ) λ ( X i ) (cid:34) n (cid:88) j =1 ρ ij D j m ( X j ) − m ( X i ) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) nh k sup x ∈X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 ρ ij D j m ( X j ) − m ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nh k n (cid:88) i =1 | K ( X i − x ) λ ( X i ) | | λ ( X i ) | = (cid:113) nh k O p ( h s ) O p (1) = o p (1) . A similar argument to bound J ( x ) can lead to J ( x ) = o p (1). Alto-gether and combining (7.11), we have J ( x ) = J ( x ) + o p (1). Recallingthat J ( x ), J ( x ) and J ( x ) have all been proved to be o p (1), togetherwith (7.10), we can conclude that J ( x ) = J ( x ) + J ( x ) + o p (1)= 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 K (cid:18) X i − x h (cid:19) λ ( X i ) D i (cid:15) i p ( X i ) + o p (1)= 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.12)The proof is ﬁnished. (cid:3) Scenario 2: m ( x ) is semiparametrically estimated. First, we.3 Proofs of Theorems 2 and 3have sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − (cid:101) p ( x ; β ∗ ) (cid:12)(cid:12)(cid:12) = O p (cid:18) √ n (cid:19) , sup x ∈X (cid:12)(cid:12)(cid:98) r ( B (cid:62) x ) − r ( B (cid:62) x ) (cid:12)(cid:12) = O p (cid:32) h s + (cid:115) ln nnh p (1)6 (cid:33) = o p (cid:16) h s (cid:17) , We can further decompose the term in (7.1) as: J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:34) D i (cid:101) p ( X i ; (cid:98) β ) (cid:2) Y i − (cid:98) r ( B (cid:62) X i ) (cid:3) + (cid:98) r ( B (cid:62) X i ) − τ ( x ) (cid:35) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i [ r ( B (cid:62) X i ) − Y i ] (cid:101) p ( X i ; β ∗ ) (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:98) p ( X i ; β ∗ ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 0+ 1 (cid:112) nh k n (cid:88) i =1 D i (cid:2) Y i − r +1 i (cid:3) p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) := J ( x ) + J ( x ) + J ( x ) + 0 + J ( x ) + J ( x ) (7.13)where p + i lies between (cid:101) p ( X i ; β ∗ ) and (cid:101) p ( X i ; (cid:98) β ), r +1 i lies between r ( B (cid:62) X i ) and (cid:98) r ( B (cid:62) X i ).Due to the similarity in the proof as the above, we omit the detailsfor proving J ( x ), J ( x ) and J ( x ) to be o p (1). Now consider J ( x )..3 Proofs of Theorems 2 and 3Denote that λ ( X i ) = E (cid:20) (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) X i (cid:21) µ i = (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) − λ ( X i ) (cid:15) i = Y i − r ( B (cid:62) X i ) ν ij = K (cid:16) B (cid:62) X i − B (cid:62) X j h (cid:17)(cid:80) nt =1 D t K (cid:16) B (cid:62) X i − B (cid:62) X t h (cid:17) J ( x ) can be rewritten as J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 λ ( X i ) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 µ i (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:34) n (cid:88) j =1 ρ ij D j [ (cid:15) j + r ( B (cid:62) X j )] − r ( B (cid:62) X i ) (cid:35) + 1 (cid:112) nh k n (cid:88) i =1 µ i (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k (cid:88) i =1 D i (cid:15) i p ( X i ) (cid:34) p ( X i ) n (cid:88) j =1 ν ji K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:35) + 1 (cid:112) nh k K (cid:18) X j − x h (cid:19) λ ( X i ) (cid:34) n (cid:88) j =1 ν ij D j m ( X j ) − m ( X i ) (cid:35) + 1 (cid:112) nh k n (cid:88) i =1 µ i [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) := J ( x ) + J ( x ) + J ( x ) . (7.14)It is obvious that J ( x ) = o p (1) and J ( x ) = o p (1). To derive that.3 Proofs of Theorems 2 and 3 J ( x ) = o p (1), we start by writing1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ν ji K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:35) = 1 (cid:112) h k p ( X i ) n (cid:88) j =1 ( ν ij − ν ji ) K (cid:18) X j − x h (cid:19) λ ( X j )+ 1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ν ij K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:35) := L + L . Similarly for proving L = o p (1) above, we can show that L = o p (1), andthus omit the details. To handle L = o p (1), we denote q B ( z ) as the densityof B (cid:62) X , and θ B ( B (cid:62) X ) = E [ Y (1) | B (cid:62) X ] , (cid:98) θ B ( B (cid:62) x ) = (cid:80) nj =1 D j Y j K (cid:16) B (cid:62) X j − B (cid:62) xh (cid:17)(cid:80) nt =1 D t K (cid:16) B (cid:62) X t − B (cid:62) xh (cid:17) , (cid:98) q B ( B (cid:62) x ) = (cid:80) nj =1 D j K (cid:16) B (cid:62) X j − B (cid:62) xh (cid:17)(cid:80) nt =1 K (cid:16) B (cid:62) X t − B (cid:62) xh (cid:17) . Let T = B (cid:62) X − B (cid:62) X i h , T = X − X i h , T = X − X i h . To deal with L , consider.3 Proofs of Theorems 2 and 3the conditional expectation that can be derived as: E (cid:40) p ( X i ) n (cid:88) j =1 ν ij K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i (cid:41) = E  p ( X i ) nh k (1)6 (cid:80) nj =1 K (cid:16) B (cid:62) X j − B (cid:62) X i h (cid:17) K (cid:16) X j − x h (cid:17) λ ( X j ) (cid:98) θ B ( B (cid:62) X i ) (cid:98) q B ( B (cid:62) X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i  = [1 + o p (1)] p ( X i ) h k (1)6 θ B ( B (cid:62) X i ) q B ( B (cid:62) X i ) (cid:90) K (cid:18) B (cid:62) u − B (cid:62) X i h (cid:19) K (cid:18) u − x h (cid:19) λ ( u ) θ ( u ) du = h p [1 + o p (1)] p ( X i ) h k (1)6 θ B ( B (cid:62) X i ) q B ( B (cid:62) X i ) (cid:90) K ( t ) K (cid:18) X i − x h + t h h (cid:19) λ ( X i + t h h ) θ ( X i + t h ) dt = h p − p (1)6 p ( X i ) q B ( B (cid:62) X i ) θ ( X i ) θ B ( B (cid:62) X i ) K (cid:18) X i − x h (cid:19) λ ( X i ) + O p (cid:32) h p − p (1)+ s h s (cid:33) = O p (cid:32) h p − p (1)6 + h p − p (1)+ s h s (cid:33) . Then, when s < (2 s + k )( p − p (1)), L = O p (cid:18) h p − p (1)6 h l/ + h p − p (1)+ s h s l/ (cid:19) = o p (1), J ( x ) = o p (1). Together with (7.14), J ( x ) = o p (1). Recall that we haveproved that J ( x ), J ( x ) and J ( x ) can be bounded by o p (1). With(7.5), we can eventually derive the asymptotically linear representation as J ( x ) = J ( x ) + o p (1)= 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.15)The proof is completed. (cid:3) .3 Proofs of Theorems 2 and 3 Scenario 3: m ( x ) is parametrically estimated (correctly spec-iﬁed). With the similar argument for proving scenario 1 in Theorem 1,we can easily derive that J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.16)With (7.12), it will be easy to further deduct the asymptotically linear ex-pression of the proposed estimator. When the outcome regression functionsare nonparametrically estimated, recalling the relation between the follow-ing and (cid:112) nh k [ (cid:98) τ ( x ) − τ ( x )] and J ( x ) deﬁned in (7.1), we can derive that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) . According to (7.15) and (7.16), when the outcome regression functions aresemiparametrically or parametrically estimated, we have a similar repre-sentation as (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) . Similarly as the proof for Theorem 1, we can derive that under the condi-tions of Theorem 2, when the outcome regression functions are estimatednonparametrically, we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) , .4 Proof of Theorem 4and when the outcome regression functions are estimated semiparametri-cally or parametrically, we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . The proof of Theorem 2 is concluded. (cid:3)

As for Theorem 3. the proof can be very similar to the proof for Theo-rem 2. Here we only give a crucial lemma in this proof and omit the detailsof the proof.

Lemma 2.

Under condition (A2)(viii), the propensity score estimator sat-isﬁes | ω ij − ω ji | ≤ E n nh p K (cid:18) X i − X j h (cid:19) where E n = O p ( h ) free of i and j . This is the case with local misspeciﬁcation. To check the asymptotic ef-ﬁciency through the variance comparison, we now compute the diﬀerence.4 Proof of Theorem 4between σ ( x ) and σ ( x ): σ ( x ) − σ ( x )= E (cid:26) p ( X ) − (cid:101) p ( X ; β ∗ )[ (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 1 , X ) + (cid:101) p ( X ; β ∗ ) − p ( X )[1 − (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 0 , X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) = E (cid:26) p ( X ) − (cid:101) p ( X ; β ) + (cid:101) p ( X ; β ) − (cid:101) p ( X ; β ∗ )[ (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 1 , X )+ (cid:101) p ( X ; β ∗ ) − (cid:101) p ( X ; β ) + (cid:101) p ( X ; β ) − p ( X )[1 − (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 0 , X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) (7.17)and the diﬀerence between σ ( x ) and σ ( x ): σ ( x ) − σ ( x )= E (cid:40) (cid:20)(cid:18) − Dp ( X ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] − (cid:18) − − D − p ( X ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:41) = E (cid:26) − p ( X ) p ( X ) [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ) + (cid:101) m ( X ; γ ) − m ( X )] + p ( X )1 − p ( X ) [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ) + (cid:101) m ( X ; γ ) − m ( X )] + [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ) + (cid:101) m ( X ; γ ) − m ( X )] × [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ) + (cid:101) m ( X ; γ ) − m ( X )] | X = x } . (7.18).4 Proof of Theorem 4Recall that as the deﬁnitions, for all x ∈ X , there exists β , γ , γ , suchthat p ( x ) = (cid:101) p ( x ; β )[1 + c n a ( x )] ,m ( x ) = (cid:101) m ( x ; γ ) + d n b ( x ) ,m ( x ) = (cid:101) m ( x ; γ ) + d n b ( x ) . That is, p ( x ) − (cid:101) p ( x ; β ) = O ( c n ), m ( x ) − (cid:101) m ( x ; γ ) = O ( d n ), and m ( x ) − (cid:101) m ( x ; γ ) = O ( d n ). So now we only need to consider (cid:101) p ( x ; β ) − (cid:101) p ( x ; β ∗ ), (cid:101) m ( x ; γ ) − (cid:101) m ( x ; γ ∗ ) and (cid:101) m ( x ; γ ) − (cid:101) m ( x ; γ ∗ ). Note that β ∗ , γ ∗ , γ ∗ arethe limits of the maximum likelihood estimators (cid:98) β, (cid:98) γ , (cid:98) γ respectively. Dis-cuss β ∗ ﬁrst. Given the propensity score function, D is bernoulli distributed.We can respectively obtain, as the propensity score function would be mis-speciﬁed, the quasi-likelihood function and the quasi-log likelihood functionof the unknown parameter β : (cid:101) L ( β ) = n (cid:89) i =1 (cid:101) p ( X i ; β ) D i [1 − (cid:101) p ( X i ; β )] − D i f ( X i ) , and (cid:101) l ( β ) = n (cid:88) i =1 D i ln (cid:101) p ( X i ; β ) + (1 − D i ) ln[1 − (cid:101) p ( X i ; β )] + ln f ( X i ) . Then, (cid:98) β and β ∗ satisfy that (cid:98) β = argmax β n (cid:101) l ( β ) , β ∗ = argmax β E [ g ( W ; β )] . .4 Proof of Theorem 4where g ( W ; β ) = D ln (cid:101) p ( X ; β ) + (1 − D ) ln[1 − (cid:101) p ( X ; β )] + ln f ( X ) . By themean value theorem, E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) − E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β ∗ (cid:35) = E (cid:34) ∂ g ( W, β ) ∂β∂β (cid:62) (cid:12)(cid:12)(cid:12)(cid:12) β = (cid:101) β (cid:35) ( β ∗ − β ) , and E (cid:34) ∂ g ( W, β ) ∂β∂β (cid:62) (cid:12)(cid:12)(cid:12)(cid:12) β = (cid:101) β (cid:35) ( β ∗ − β ) = E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) where (cid:101) β takes the value between β and β ∗ . Note that E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = E (cid:34) D [1 + c n a ( X )] p ( X ) ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β − − D − (cid:101) p ( X ; β ) ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = E (cid:34) (1 + c n a ( X )) ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β − − p ( X )1 − p ( X ) / [1 + c n a ( X )] ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = E (cid:34) c n a ( X ) + c n a ( X )[1 + c n a ( X ) − p ( X ) ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = O ( c n ) . Assume that E (cid:104) ∂ ln g ( U,β ) ∂β∂β (cid:62) (cid:105) is non-singular for any β . We have β ∗ − β = (cid:40) E (cid:34) ∂ g ( W, β ) ∂β∂β (cid:62) (cid:12)(cid:12)(cid:12)(cid:12) β = (cid:101) β (cid:35)(cid:41) − E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = (cid:40) E (cid:34) ∂ g ( W, β ) ∂β∂β (cid:62) (cid:12)(cid:12)(cid:12)(cid:12) β = (cid:101) β (cid:35)(cid:41) − O ( c n ) = O ( c n ) . .5 Proofs of Theorems 5 and 6The application of Taylor expansion yields that (cid:101) p ( x ; β ) − (cid:101) p ( x ; β ∗ ) = O ( c n ).Similar argument is devoted to deriving that (cid:101) m ( x ; γ ) − (cid:101) m ( x ; γ ∗ ) = O ( d n ) and (cid:101) m ( x ; γ ) − (cid:101) m ( x ; γ ∗ ) = O ( d n ). Together with these results,we continue to calculate the quantities in (7.17) and (7.18) to derive that σ ( x ) − σ ( x ) = O ( c n ) ,σ ( x ) − σ ( x ) = O ( d n ) + O ( d n ) + O ( d n d n ) . These diﬀerences show that when only the propensity score function is oronly the outcome regression functions are locally misspeciﬁed, the asymp-totic distribution remains the same as that without misspeciﬁcation. (cid:3)

Consider the cases with all models misspeciﬁed. The proof of Theorem 5will be very similar to the proof of scenario 1 in Theorem 6 except that theasymptotic linear expression can be as (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:112) nh k f ( x ) n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) + o p (1) . .5 Proofs of Theorems 5 and 6As the unbiasedness no longer holds, we then compute the bias term. Adecomposition is as follows: E (cid:26)(cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] (cid:27) = (cid:113) nh k E (cid:26)(cid:18) D (cid:101) p ( X ; β ∗ ) − Dp ( X ) (cid:19) [ Y − m ( X )] + (cid:18) − D (cid:101) p ( X ; β ∗ ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] − (cid:18) D (cid:101) p ( X ; β ∗ ) − Dp ( X ) (cid:19) [ Y − m ( X )] + (cid:18) − D (cid:101) p ( X ; β ∗ ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) = (cid:113) nh k E (cid:26) [ m ( X ) − (cid:101) m ( X ; γ ∗ )] [ p ( X ) − (cid:101) p ( X ; β ∗ )] (cid:101) p ( X ; β ∗ ) − [ m ( X ) − (cid:101) m ( X ; γ ∗ )] [ (cid:101) p ( X ; β ∗ ) − p ( X )]1 − (cid:101) p ( X ; β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) := (cid:113) nh k bias ( x ) . Let (cid:101) τ ( x ) = E (cid:26)(cid:20) D (cid:101) p ( X ; β ∗ ) [ Y − (cid:101) m ( X ; γ ∗ )] − − D − (cid:101) p ( X ; β ∗ ) [ Y − (cid:101) m ( X ; γ ∗ )] + (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) . .5 Proofs of Theorems 5 and 6The variance term of (cid:112) nh k [ (cid:98) τ ( x ) − τ ( x ) − bias ( x )] can be derived as: V ar (cid:40) (cid:112) nh k f ( x ) n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x ) − bias ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = E (cid:40) (cid:112) nh k f ( x ) n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − (cid:101) τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = h k f ( x ) E (cid:40) E (cid:34) (cid:20) [Ψ ( X, Y, D ) − (cid:101) τ ( x )] 1 h k K (cid:18) X − x h (cid:19)(cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:35)(cid:41) = h k f ( x ) 1 h k (cid:90) K ( u ) E (cid:2) [Ψ ( X, Y, D ) − (cid:101) τ ( x )] (cid:12)(cid:12) X = x + h u (cid:3) f ( x + h u ) du = σ ( x ) (cid:82) K ( u ) duf ( x ) + O ( h k ) . where σ ( x ) = E (cid:2) [Ψ ( X, Y, D ) − (cid:101) τ ( x )] (cid:12)(cid:12) X = x (cid:3) . With the same argument to derive the asymptotic distribution in Theorem1, we can obtain that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x ) − bias ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . (7.19)The proof of Theorem 5 is completed. (cid:3) Note that Theorem 6 is a variant of Theorem 5. To derive the asymp-totic distribution, we only need to consider the bias term and the varianceterm based on (7.19) when all nuisance models are locally misspeciﬁed.From the deﬁnitions of misspeciﬁed models before, the bias term can be.6 A simple justiﬁcation for Remark 5bounded by bias ( x ) = O p ( c n d n ) + O p ( c n d n ) . This result implies that if the convergence rates of c n d n and c n d n are fasterthan O (cid:18) √ nh k (cid:19) , the bias term vanishes asymptotically. By the central limittheorem, we can also derive the asymptotic normality with the varianceterm. By (7.19) and (7.19) when c n , d n and d n all converge to 0, we have σ ( x ) (cid:82) K ( u ) duf ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) + o (1) . With Slutsky’s Theorem, we can conclude that, when all nuisance modelsare locally misspeciﬁed, c n d n = o (cid:18) √ nh k (cid:19) , and c n d n = o (cid:18) √ nh k (cid:19) , wethen have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . Then the proof is completed. (cid:3)

As we showed in the proof of Theorem 4, σ ( x ) − σ ( x ) = E (cid:26) p ( X ) − (cid:101) p ( X ; β ∗ )[ (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 1 , X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) + E (cid:26) (cid:101) p ( X ; β ∗ ) − p ( X )[1 − (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 0 , X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) . .6 A simple justiﬁcation for Remark 5This diﬀerence cannot be showed either positive or negative for all x . Theexample in Remark 2 conﬁrms this. For σ ( x ) we have σ ( x ) − σ ( x )= E (cid:40) (cid:20)(cid:18) − Dp ( X ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] − (cid:18) − − D − p ( X ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:41) ≥ . In other words, the variance with σ ( x ) can be larger than that of theestimators with all models correctly speciﬁed. Further, σ ( x ) − σ ( x )= V ar (Ψ ( X, Y, D ) | X = x ) − V ar (Ψ ( X, Y, D ) | X = x )= E (cid:26) (cid:18) p ( X ) (cid:101) p ( X ; β ∗ ) − p ( X ) (cid:19) V ar ( Y | X, D = 1) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) − E (cid:26) (cid:18) − p ( X )[1 − (cid:101) p ( X ; β ∗ )] − − p ( X ) (cid:19) V ar ( Y | X, D = 0) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) + E (cid:26) p ( X )[ (cid:101) p ( X ; β ∗ )] [ m ( X ) − (cid:101) m ( X ; γ ∗ )] + 1 − p ( X )[1 − (cid:101) p ( X ; β ∗ )] [ m ( X ) − (cid:101) m ( X ; γ ∗ )] (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) + 2 E { [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ )][ m ( X ) − m ( X ) − (cid:101) m ( X ; γ ∗ ) + (cid:101) m ( X ; γ ∗ )] | X = x } + τ ( x ) − (cid:101) τ ( x ) . Again σ ( x ) cannot be easily judged whether it is larger than σ ( x ) ornot. (cid:3) .7 Additional Simulation Results Table 6: The simulation results under model 2 (part 1) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . DRCATE(O,O) -0.4 0.0012 0.2077 0.0431 0.058 0.045 0.0006 0.2034 0.0414 0.045 0.047-0.2 0.0010 0.2139 0.0457 0.040 0.052 0.0001 0.1988 0.0395 0.050 0.0500 -0.0005 0.2046 0.0418 0.048 0.059 0.0007 0.1846 0.0341 0.050 0.0500.2 -0.0001 0.2226 0.0495 0.045 0.050 0.0008 0.2034 0.0415 0.038 0.0570.4 0.0024 0.3312 0.1097 0.048 0.049 0.0010 0.3114 0.0971 0.044 0.053DRCATE(cP,cP) -0.4 0.0011 0.2077 0.0431 0.056 0.048 0.0006 0.2035 0.0415 0.046 0.046-0.2 0.0009 0.2137 0.0456 0.045 0.053 0.0001 0.1988 0.0395 0.049 0.0500 -0.0006 0.2044 0.0417 0.047 0.055 0.0007 0.1846 0.0341 0.048 0.0520.2 -0.0001 0.2228 0.0496 0.046 0.049 0.0007 0.2035 0.0415 0.038 0.0570.4 0.0024 0.3316 0.1100 0.047 0.047 0.0010 0.3114 0.0971 0.044 0.053DRCATE(N,N) -0.4 0.0002 0.2653 0.0703 0.017 0.029 0.0004 0.2136 0.0456 0.057 0.052-0.2 0.0011 0.2300 0.0529 0.042 0.048 0.0004 0.1990 0.0396 0.041 0.0450 0.0007 0.1962 0.0385 0.048 0.051 0.0003 0.1917 0.0367 0.041 0.0520.2 0.0011 0.2299 0.0528 0.043 0.058 0.0006 0.2122 0.0451 0.046 0.0520.4 0.0041 0.3373 0.1141 0.054 0.057 0.0003 0.3125 0.0976 0.050 0.052DRCATE(S,S) -0.4 -0.0018 0.2058 0.0424 0.051 0.046 0.0002 0.2501 0.0625 0.028 0.040-0.2 -0.0021 0.2093 0.0439 0.056 0.039 -0.0008 0.2087 0.0436 0.046 0.0470 0.0000 0.2040 0.0416 0.055 0.051 0.0011 0.1868 0.0351 0.044 0.0560.2 0.0060 0.2257 0.0518 0.031 0.068 0.0014 0.2093 0.0441 0.047 0.0590.4 0.0089 0.3409 0.1181 0.039 0.064 0.0010 0.3298 0.1089 0.043 0.062 .7 Additional Simulation ResultsTable 7: The simulation results under model 2 (part 2) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . DRCATE(O,O) -0.4 0.0012 0.2077 0.0431 0.058 0.045 0.0006 0.2034 0.0414 0.045 0.047-0.2 0.0010 0.2139 0.0457 0.040 0.052 0.0001 0.1988 0.0395 0.050 0.0500 -0.0005 0.2046 0.0418 0.048 0.059 0.0007 0.1846 0.0341 0.050 0.0500.2 -0.0001 0.2226 0.0495 0.045 0.050 0.0008 0.2034 0.0415 0.038 0.0570.4 0.0024 0.3312 0.1097 0.048 0.049 0.0010 0.3114 0.0971 0.044 0.053DRCATE(mP,cP) -0.4 0.0011 0.2082 0.0433 0.058 0.041 0.0006 0.2042 0.0417 0.050 0.045-0.2 0.0009 0.2123 0.0451 0.044 0.054 0.0001 0.1974 0.0389 0.051 0.0530 -0.0005 0.2025 0.0410 0.048 0.058 0.0006 0.1834 0.0337 0.050 0.0520.2 -0.0002 0.2222 0.0493 0.045 0.052 0.0007 0.2030 0.0413 0.037 0.0560.4 0.0025 0.3315 0.1099 0.047 0.051 0.0011 0.3116 0.0972 0.043 0.053DRCATE(mP,N) -0.4 -0.0011 0.2156 0.0464 0.056 0.042 -0.0005 0.2082 0.0434 0.048 0.043-0.2 -0.0019 0.2086 0.0436 0.061 0.036 -0.0013 0.2062 0.0428 0.058 0.0360 -0.0028 0.2003 0.0403 0.057 0.034 -0.0011 0.1888 0.0358 0.058 0.0400.2 0.0021 0.2108 0.0445 0.052 0.058 -0.0004 0.2099 0.0440 0.054 0.0440.4 0.0060 0.3258 0.1069 0.045 0.059 0.0033 0.3276 0.1093 0.033 0.069DRCATE(mP,S) -0.4 -0.0034 0.2215 0.0493 0.054 0.050 -0.0010 0.2119 0.0451 0.053 0.043-0.2 -0.0055 0.2235 0.0507 0.060 0.041 -0.0034 0.2115 0.0469 0.073 0.0290 -0.0023 0.2049 0.0421 0.051 0.043 -0.0025 0.1895 0.0371 0.061 0.0320.2 -0.0003 0.2149 0.0462 0.043 0.045 -0.0003 0.1982 0.0393 0.052 0.0430.4 0.0102 0.3351 0.1148 0.032 0.068 0.0034 0.3122 0.0997 0.039 0.058 .7 Additional Simulation ResultsTable 8: The simulation results under model 2 (part 3) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . DRCATE(O,O) -0.4 0.0012 0.2077 0.0431 0.058 0.045 0.0006 0.2034 0.0414 0.045 0.047-0.2 0.0010 0.2139 0.0457 0.040 0.052 0.0001 0.1988 0.0395 0.050 0.0500 -0.0005 0.2046 0.0418 0.048 0.059 0.0007 0.1846 0.0341 0.050 0.0500.2 -0.0001 0.2226 0.0495 0.045 0.050 0.0008 0.2034 0.0415 0.038 0.0570.4 0.0024 0.3312 0.1097 0.048 0.049 0.0010 0.3114 0.0971 0.044 0.053DRCATE(cP,mP) -0.4 0.0008 0.2179 0.0474 0.051 0.043 0.0004 0.2204 0.0485 0.048 0.045-0.2 0.0012 0.2233 0.0498 0.049 0.052 0.0003 0.2069 0.0428 0.049 0.0540 -0.0004 0.2104 0.0442 0.051 0.060 0.0008 0.1890 0.0358 0.050 0.0530.2 -0.0002 0.2226 0.0495 0.048 0.049 0.0007 0.2039 0.0416 0.036 0.0540.4 0.0028 0.3407 0.1162 0.047 0.050 0.0010 0.3170 0.1006 0.043 0.053DRCATE(N,mP) -0.4 -0.0050 0.2225 0.0501 0.060 0.036 -0.0006 0.2227 0.0496 0.051 0.050-0.2 -0.0015 0.2185 0.0477 0.056 0.043 -0.0011 0.1931 0.0375 0.054 0.0390 -0.0020 0.2039 0.0416 0.072 0.032 -0.0013 0.1857 0.0348 0.056 0.0380.2 0.0024 0.2178 0.0475 0.042 0.051 0.0005 0.2064 0.0426 0.039 0.0560.4 0.0046 0.3259 0.1066 0.035 0.064 0.0023 0.3324 0.1115 0.044 0.050DRCATE(S,mP) -0.4 -0.0115 0.2117 0.0481 0.075 0.027 -0.0024 0.3260 0.1073 0.020 0.017-0.2 -0.0021 0.2083 0.0434 0.051 0.051 -0.0021 0.2010 0.0412 0.065 0.0330 -0.0018 0.2002 0.0401 0.044 0.053 -0.0005 0.2045 0.0418 0.045 0.0380.2 0.0035 0.2290 0.0527 0.044 0.071 0.0001 0.2155 0.0464 0.054 0.0540.4 0.0017 0.3460 0.1196 0.040 0.064 0.0015 0.3519 0.1241 0.031 0.054

EFERENCES

References

Abrevaya, J., Y.-C. Hsu, and R. P. Lieli (2015). Estimating conditional average treatmenteﬀects.

Journal of Business & Economic Statistics 33 (4), 485–505.Fan, Q., Y.-C. Hsu, R. P. Lieli, and Y. Zhang (2019). Estimation of conditional average treat-ment eﬀects with high-dimensional data. arXiv preprint arXiv:1908.02399 .Hu, Z., D. A. Follmann, and N. Wang (2014). Estimation of mean response via the eﬀectivebalancing score.

Biometrika 101 (3), 613–624.Lee, S., R. Okui, and Y.-J. Whang (2017). Doubly robust uniform conﬁdence band for theconditional average treatment eﬀect function.

Journal of Applied Econometrics 32 (7),1207–1225.Li, L., N. Zhou, and L. Zhu (2020). Outcome regression-based estimation of conditional averagetreatment eﬀect.

Submited .Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coeﬃcients whensome regressors are not always observed.

Journal of the American statistical Associa-tion 89 (427), 846–866.Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observa-tional studies for causal eﬀects.

Biometrika 70 (1), 41–55.Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999). Adjusting for nonignorable drop-outusing semiparametric nonresponse models.

Journal of the American Statistical Associa-

EFERENCES tion 94 (448), 1096–1120.Seaman, S. R. and S. Vansteelandt (2018). Introduction to double robust methods for incompletedata.

Statistical science: a review journal of the Institute of Mathematical Statistics 33 (2),184.Shi, C., W. Lu, and R. Song (2019, 04). A sparse random projection-based test for overallqualitative treatment eﬀects.

Journal of the American Statistical Association , 1–41.Zhou, N. and L. Zhu (2020). On ipw-based estimation of conditional average treatment eﬀect.

Submited .Zimmert, M. and M. Lechner (2019). Nonparametric estimation of causal heterogeneity underhigh-dimensional confounding. arXiv preprint arXiv:1908.08779arXiv preprint arXiv:1908.08779