Doubly robust estimation for conditional treatment effect: a study on asymptotics
SStatistica Sinica
DOUBLY ROBUST ESTIMATION FORCONDITIONAL TREATMENT EFFECT:A STUDY ON ASYMPTOTICS
Chuyun Ye , Keli Guo and Lixing Zhu , Beijing Normal University, Beijing, China Hong Kong Baptist University, Hong Kong
Abstract:
In this paper, we apply doubly robust approach to estimate, whensome covariates are given, the conditional average treatment effect under para-metric, semiparametric and nonparametric structure of the nuisance propensityscore and outcome regression models. We then conduct a systematic study onthe asymptotic distributions of nine estimators with different combinations of es-timated propensity score and outcome regressions. The study covers the asymp-totic properties with all models correctly specified; with either propensity scoreor outcome regressions locally / globally misspecified; and with all models locally/ globally misspecified. The asymptotic variances are compared and the asymp-totic bias correction under model-misspecification is discussed. The phenomenonthat the asymptotic variance, with model-misspecification, could sometimes beeven smaller than that with all models correctly specified is explored. We alsoconduct a numerical study to examine the theoretical results. a r X i v : . [ m a t h . S T ] S e p ey words and phrases: Asymptotic variance, Conditional average treatmenteffect, Doubly robust estimation.
1. Introduction
To explore the heterogeneity of treatment effect under Rubin’s protentialoutcome framework (Rosenbaum and Rubin (1983)) to reveal the casualityof a treatment, conditional average treatment effect (CATE) is useful, whichis conditional on some covariates of interest. See Abrevaya et al. (2015)as an example. Shi et al. (2019) showed that the existence of optimalindividualized treatment regime (OITR) has a close connection with CATE.To estimate CATE, there are some standard approaches available inthe literature. When either propensity score function or outcome regres-sion functions or both are unknown, we need to estimate them first suchthat we can then estimate the CATE function. Regard these functions asnuisance models. Abrevaya et al. (2015) used the propensity score-based(PS-based) estimation under parametric (P-IPW) and nonparametric struc-ture (N-IPW), and showed that N-IPW is asymptotically more efficient thanP-IPW. Zhou and Zhu (2020) suggested the PS-based estimation under asemiparametric dimension reduction structure (S-IPW) to show the advan-tage of semiparametric estimation and Li et al. (2020) considered outcomeegression-based (OR-based) estimation under parametric (P-OR), semi-parametric (S-OR) and nonparametric structure (N-OR) to derive theirasymptotic properties and suggested also the use of semiparametric method.Both of the works together give an estimation efficiency comparison betweenPS-based and OR-based estimators. A clear asymptotic efficiency rankingwas shown by Li et al. (2020) when the propensity score and outcome re-gression models are all correctly specified and the underlying nonparametricmodels is sufficiently smooth such that, with delicately selecting bandwidthsand kernel functions, the nonparametric estimation can achieve sufficientlyfast rates of convergence:
OR-based estimators (cid:122) (cid:125)(cid:124) (cid:123)
O-OR ∼ = P-OR (cid:22)
S-OR (cid:22)
N-OR ∼ = PS-based CATE estimators (cid:122) (cid:125)(cid:124) (cid:123)
N-IPW (cid:22)
S-IPW (cid:22)
P-IPW ∼ = O-IPW (1.1)where A (cid:22) B denotes the asymptotic efficiency advantage, with smallervariance, of A over B , A ∼ = B the efficiency equivalence and O-OR andO-IPW stand for OR-based and PS-based estimator respectively assumingthe nuisance models are known with no need to estimate.As well known, the doubly robust (DR) method that was first sug-gested as the augmented inverse probability weighting (AIPW) estimationproposed by Robins et al. (1994). Later developments provide the estima-tion consistency (Scharfstein et al. (1999)) for more general doubly robustestimation, not restricted to AIPW, that even has one misspecified in thewo involved models. For further discussion and introduction on DR es-timation, readers can refer to, as an example, Seaman and Vansteelandt(2018). Like Abrevaya et al. (2015), Lee et al. (2017) brought up a two-step AIPW estimator of CATE also under parametric structure. For thecases with high-dimensional covariate, Fan et al. (2019) and Zimmert andLechner (2019) combined such an estimator with statistical learning.In the current paper, we focus on investigating the asymptotic efficiencycomparisons among nine doubly robust estimators under parametric, semi-parametric dimension reduction and nonparametric structure. To this end,we will give a systematic study to provide insight into which combinationsmay have merit in an asymptotic sense and in practice, which ones wouldbe worth of recommendation for use. We also further consider the asymp-totic efficiency when nuisance models are globally or locally misspecified,which will be defined later. Roughly speaking, local misspecification meansthat misspecified model can converge, at a certain rate, to the correspond-ing correctly specified model as the sample size n goes to infinity, whileglobally misspecified model cannot. Denote c n , d n and d n respectively thedeparture degrees of used models to the corresponding correctly specifiedmodels, and V i ( x ) for i = 1 , , ,
4, which will be clarified in Theorems 1, 2,3 and 5 respectively, of the asymptotic variance functions of x for all ninestimators in difference scenarios. Here V ( x ) is the asymptotic variancewhen all models are correctly specified, which is regarded as a benchmarkfor comparisons. We have that V ( x ) ≤ V ( x ), but V ( x ) and V ( x ) arenot necessarily larger than V ( x ). Here we display main findings in thispaper. • When all nuisance models are correctly specified, and the tuning pa-rameters including the bandwidths in nonparametric estimations aredelicately selected, the asymptotic variances are all equal to V ( x ).Write all DR estimators as DRCAT E . Together with (1.1), theasymptotic efficiency ranking is as:
OR-based estimators (cid:122) (cid:125)(cid:124) (cid:123)
O-OR ∼ = P-OR (cid:22)
S-OR (cid:22)
N-OR ∼ = DRCATE ∼ = PS-based CATE estimators (cid:122) (cid:125)(cid:124) (cid:123)
N-IPW (cid:22)
S-IPW (cid:22)
P-IPW ∼ = O-IPW • If only one of the nuisance models, either propensity score or outcomeregressions, is (are) misspecified, the estimators remain unbiased asexpectably. But globally misspecified outcome regressions or propen-sity score lead to asymptotic variance changes. We can give exam-ples of propensity score to show that the variance can be even smallerthan that with correctly specified models. Further, when the nuisancemodels are locally misspecified, the asymptotic efficiency remains thesame as that with no misspecification.
Further, when all nuisance models are globally misspecified, we needto take care of estimation bias. When the misspecifications are alllocal, but the convergence rates c n d n and c n d n are all faster than theconvergence rate of nonparametric estimation that will be specifiedlater, the asymptotic distributions remain unchanged.To give a quick access to the results about the asymptotic variances,we present a summary in Table 1. Denote P S ( P ), P S ( N ) and P S ( S )as estimators with parametrically, nonparametrically and semiparametri-cally estimated PS function respectively, OR ( P ), OR ( N ) and OR ( S ) asestimators with parametrically, nonparametrically and semiparametricallyestimated OR functions respectively. Dark cells mean no such combina-tions.The remaining parts of this article are organized as follows. We first de-scribe the Rubin’s potential outcome framework and the relevant notationsin Section 2. Section 3 contains a general two-step estimation of CATE,while Section 4 describes the corresponding asymptotic properties underdifferent situations. Section 5 presents the results of Monte Carlo simula-tions and Section 6 includes some concluding remarks. We would like topoint out that such comparisons do not mean the estimations that are ofasymptotic efficiency advantage are always worthwhile to recommend be-able 1: Asymptotic variance result summary Combination AllCorrectlyspecified GloballyMisspecifiedPS LocallyMisspecifiedPS GloballyMisspecifiedOR LocallyMisspecifiedOR PS ( P ) + OR ( P ) V ( x ) V ( x )(Not necessarilyenlarged ) V ( x ) V ( x )(Enlarged) V ( x ) PS ( P ) + OR ( N ) V ( x ) V ( x ) V ( x ) PS ( N ) + OR ( P ) V ( x ) V ( x ) V ( x ) PS ( N ) + OR ( N ) V ( x ) PS ( P ) + OR ( S ) V ( x ) V ( x )(Not necessarilyenlarged) V ( x ) PS ( S ) + OR ( P ) V ( x ) V ( x )(Enlarged) V ( x ) PS ( S ) + OR ( N ) V ( x ) PS ( N ) + OR ( S ) V ( x ) PS ( S ) + OR ( S ) V ( x )Combination AllGloballyMisspecified AllLocallyMisspecified Globally Misspecified PS+Locally Misspecified OR Locally Misspecified PS+Globally Misspecified OR PS ( P ) + OR ( P ) Biased + V ( x )(Not necessarilyenlarged variance) V ( x ) V ( x )(Not necessarilyenlarged) V ( x )(Enlarged) cause, particularly, the nonparametric-based estimations may have severedifficulties to handle high- even moderate-dimensional models in practice.But the comparisons can provide a good insight into the nature of variousestimations such that the practitioners can have a relatively complete pic-ure about them and have idea for when and how to use these estimations.
2. Framework and Notation
For any individual, datum W = ( X (cid:62) , Y, D ) (cid:62) is observable, including theobserved effect Y , the treatment status D , and the p -dimensional covari-ates X . D = 1 implies that the individual is treated, and D = 0 meansuntreated. Denote Y (1) and Y (0) as the potential outcomes with and with-out treatment, respectively. The observed effect Y can be expressed as Y = DY (1) + (1 − D ) Y (0). Denote that p ( X ) = P ( D = 1 | X ) , m ( X ) = E ( Y (1) | X ) , m ( X ) = E ( Y (0) | X ) as propensity score function and outcomeregression functions. The following conditions are commonly used when wediscuss the potential outcome framework.(C1) (Sampling distribution) { W i } ni =1 is a set of identically distributedsamples.(C2) (Ignorability condition)(i) (Unconfoundedness) ( Y (1) , Y (0)) ⊥ D | X (ii) Denote X as the support of X , where X is a Cartersian productof compact intervals. For any x ∈ X , p ( x ) is bounded away from0 and 1.enote τ ( x ) as CATE: τ ( x ) = E [ Y (1) − Y (0) | X = x ]where X is a strict subset of X . That is, X is a k -dimension covariate,and k < p . Also denote f ( x ) as the density function of X .
3. Doubly robust estimation
Rewrite τ ( x ) as τ ( x ) = E { [ m ( X ) − m ( X )] | X = x } = E (cid:26) (cid:20) DYp ( X ) − (1 − D ) Y − p ( X ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) = E (cid:26) (cid:20) Dp ( X ) [ Y − m ( X )] − − D − p ( X ) [ Y − m ( X )] + m ( X ) − m ( X ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) (3.2)The first two equations in (3.2) show how OR and PS method work forestimating CATE. The third equation in (3.2) is an essential expression toconstruct a doubly robust estimator of τ ( x ). Under which, we propose atwo-step estimation. In the first step, we estimate the function in (3.2): Dp ( X ) [ Y − m ( X )] − − D − p ( X ) [ Y − m ( X )] + m ( X ) − m ( X ) . To study the influence from estimating the nuisance functions, p ( X ) and m ( X ), m ( X ) under parametric, nonparametric, and semiparametric di-ension reduction framework, we will construct the corresponding estima-tions below.After this, we can then estimate the conditional expectation given x .This is a standard nonparametric estimation. We utilize the Nadaraya-Watson type estimator to define the resulting estimator: (cid:98) τ ( x ) = nh k (cid:80) ni =1 (cid:104) D i (cid:98) p i ( Y i − (cid:98) m i ) − − D i − (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − (cid:98) m i (cid:105) K (cid:16) X i − x h (cid:17) nh k (cid:80) ni =1 K (cid:16) X i − x h (cid:17) , where K ( u ) is a kernel function of order s , which is s ∗ times continuouslydifferentiable, and h is the corresponding bandwidth and (cid:98) p i , (cid:98) m i , (cid:98) m i de-note the estimators of p ( X i ) , m ( X i ) , m ( X i ) respectively, which are generalnotations and have different formulas under different model structures.We now consider the estimations of the nuisance functions. Under theparametric structures with (cid:101) p ( x ; β ), (cid:101) m ( x ; γ ) and (cid:101) m ( x ; γ ) as the specifiedparametric models of p ( x ), m ( x ) and m ( x ) respectively, where β , γ and γ are unknown parameters. By maximum likelihood estimation, we canobtain (cid:98) β , (cid:98) γ and (cid:98) γ so as to have (cid:101) p ( X i ; (cid:98) β ), (cid:101) m ( X i ; (cid:98) γ ) and (cid:101) m ( X i ; (cid:98) γ ) asthe parametric estimators. Note that the specified models are not neces-sarily equal to true data generate mechanism. Now we further distinguishcorrectly specified, globally misspecified and locally misspecified case. Forall x ∈ X , there exist β , γ , γ , such that the true models have the rela-ionship with the specified models: p ( x ) = (cid:101) p ( x ; β )[1 + c n a ( x )] ,m ( x ) = (cid:101) m ( x ; γ ) + d n b ( x ) , (3.3) m ( x ) = (cid:101) m ( x ; γ ) + d n b ( x ) . Take propensity score function as an example. If c n = 0, then the para-metric propensity score model (cid:101) p ( x ; β ) is correctly specified, otherwise, itis not. If c n converges to 0 as n goes to infinity, the parametric modelis locally misspecified. If c n remains a nonzero constant, it is a globallymisspecified case. Similarly for the models with d n and d n . Recall that (cid:98) β , (cid:98) γ and (cid:98) γ are the maximum likelihood estimators of the correspondingunknown parameters. Denote β ∗ , γ ∗ and γ ∗ as the limits of (cid:98) β , (cid:98) γ and (cid:98) γ as n goes to infinity.Under the nonparametric structure, we utilize the kernel-based non-parametric estimators as (cid:98) p ( X i ) = (cid:80) nj =1 D j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 K (cid:16) X t − X i h (cid:17) , (cid:98) m ( X i ) = (cid:80) nj =1 D j Y j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 D t K (cid:16) X t − X i h (cid:17) , (cid:98) m ( X i ) = (cid:80) nj =1 (1 − D j ) Y j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 (1 − D t ) K (cid:16) X t − X i h (cid:17) here K ( u ), K ( u ) and K ( u ) are kernels of order s ≥ d , s ≥ d and s ≥ d , with the corresponding bandwidths h , h and h . The conditionson the kernel functions and bandwidths will be listed in the supplement.Under the semiparametric structure on the baseline covariate X forpropensity score and outcome regressions, we have the following dimensionreduction framework. Denote the matrix A ∈ R d × d such that p ( X ) ⊥ X | A (cid:62) X, (3.4)where d ≤ d . The A spanned space S E ( D | X ) is called the central meansubspace if it is the intersection of all subspaced spanned by all A satisfythe above conditional independence. The dimension of S E ( D | X ) is calledthe structural dimension that is often smaller than or equal to d . Withoutconfusion, still write it as d . Formula (3.4) implies that p ( X ) = E ( D | X ) = E ( D | A (cid:62) X ) := g ( A (cid:62) X ). Note that a nonparametric estimation of p ( X )may have very slow rate of convergence when p is large. However, under(3.4) we can estimate the matrix A first to reduce the dimension d to d ,the nonparametric estimation of E ( D | A (cid:62) X ) can achieve a faster rate ofconvergence. The semiparametric estimator p ( X i ) is then defined as, when A is root- n consistently by an estimator (cid:98) A , (cid:98) g ( (cid:98) A (cid:62) X i ) = (cid:80) nj =1 D j K (cid:16) (cid:98) A (cid:62) X j − (cid:98) A (cid:62) X i h (cid:17)(cid:80) nt =1 K (cid:16) (cid:98) A (cid:62) X t − (cid:98) A (cid:62) X i h (cid:17) . imilarly, for regression models, denote matrixes B ∈ R d × d and B ∈ R d × d , such that E ( Y (1) | X ) ⊥ X | B (cid:62) X,E ( Y (0) | X ) ⊥ X | B (cid:62) X. (3.5)The corresponding dimension reduction subspaces are called the centralmean subspaces (see Cook and Li 2002). Thus, m ( X ) = E ( Y (1) | X ) = E ( Y (1) | B (cid:62) X ) := r ( B (cid:62) X ) and m ( X ) = E ( Y (0) | X ) = E ( Y (0) | B (cid:62) X ) := r ( B (cid:62) X ). The semiparametric estimators m ( X i ) and m ( X i ) are definedas, with (cid:98) B i being the estimators of B i , i = 0 , (cid:98) r ( (cid:98) B (cid:62) X i ) = (cid:80) nj =1 D j Y j K (cid:16) (cid:98) B (cid:62) X j − (cid:98) B (cid:62) X i h (cid:17)(cid:80) nt =1 D t K (cid:16) (cid:98) B (cid:62) X t − (cid:98) B (cid:62) X i h (cid:17) , (cid:98) r ( (cid:98) B (cid:62) X i ) = (cid:80) nj =1 D j Y j K (cid:16) (cid:98) B (cid:62) X j − (cid:98) B (cid:62) X i h (cid:17)(cid:80) nt =1 D t K (cid:16) (cid:98) B (cid:62) X t − (cid:98) B (cid:62) X i h (cid:17) where K ( u ), K ( u ) and K ( u ) are kernels of order s ≥ d , s ≥ d and s ≥ d , with the corresponding bandwidths h , h and h . . Asymptotic Properties Define the following functionsΨ ( X, Y, D ) := D [ Y − m ( X )] p ( X ) − (1 − D )[ Y − m ( X )]1 − p ( X ) + m ( X ) − m ( X ) , Ψ ( X, Y, D ) := D { Y − m ( X ) } (cid:101) p ( X ; β ∗ ) − (1 − D ) { Y − m ( X ) } − (cid:101) p ( X ; β ∗ ) + m ( X ) − m ( X ) , Ψ ( X, Y, D ) := D { Y − (cid:101) m ( X ; γ ∗ ) } p ( X ) − (1 − D ) { Y − (cid:101) m ( X ; γ ∗ ) } − p ( X ) + (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ ) , Ψ ( X, Y, D ) := D { Y − (cid:101) m ( X ; γ ∗ ) } (cid:101) p ( X ; β ∗ ) − (1 − D ) { Y − (cid:101) m ( X ; γ ∗ ) } − (cid:101) p ( X ; β ∗ ) + (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ ) . The following theorem shows all asymptotic distributions of the estimatorsare identical.
Theorem 1.
Suppose Conditions (C1) – (C6), (A1), (A2) and (B1) aresatisfied for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , and formulas 3.4 and 3.5 hold. Then, for eachpoint x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:112) nh k f ( x ) n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) + o p (1) , .2 The Cases With Misspecified Models and (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) , where V ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) , σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x (cid:9) . Now we discuss the asymptotic behaviours of the proposed estimators ifeither outcome regression models or propensity score model is (are) mis-specified. The following results show how global misspecification affectsthe asymptotic properties.
Theorem 2.
Assume that the propensity score is globally misspecified inwhich c n = C is a nonzero constant. Suppose conditions (C1) – (C6), (A1),(A2) and (B1) are satisfied for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s < (2 s + k )( d − d ) , s < (2 s + k )( d − d ) .1). When the outcome regression functions are estimated nonparametri-cally, then, for each value x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) . d n = d n = 0 with .2 The Cases With Misspecified Models parametric estimation, for each value x , the asymptotic distributions areidentical: (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) , where V ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) , σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x (cid:9) . Now we consider the cases with global misspecification of the outcomeregression models.
Theorem 3.
Assume that the outcome regression models are globally mis-specified with fixed nonzero constants d n = d and d n = d . Supposeconditions (C1) – (C6), (A1), (A2) and (B1) are satisfied for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s < (2 s + k )( d − d ) .1). When the propensity score is estimated nonparametrically, then, foreach x , (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) . c n = 0 and parametric estimation, for eachvalue x , the asymptotic distributions are identical: (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) , .2 The Cases With Misspecified Models where V ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) , σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x (cid:9) . Remark 1.
By some calculations, we can obtain in Proposition 4 belowin Section 4.4 that σ ( x ) ≤ σ ( x ), while the analogy does not hold be-tween σ ( x ) and σ ( x ). That is, the asymptotic variance of the proposedestimator inflates when the outcome regression models are misspecified,and the propensity score model is parametrically estimated (correctly spec-ified) or semiparametrically estimated. However, whether the asymptoticvariance gets larger with a misspecified propensity score model is model-dependent. We show the following example. Suppose that the outcomeregression models are correctly specified, while the propensity score modelis globally misspecified. Consider a situation that p ( x ) = p , (cid:101) p ( x ; β ∗ ) = p ,where p , p are free of x , and p (cid:54) = p . We have σ ( x ) − σ ( x ) = E (cid:26) p ( X ) − (cid:101) p ( X ; β ∗ ) (cid:101) p ( X ; β ∗ ) p ( X ) V ar ( Y | X, D = 1) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) + E (cid:26) [1 − p ( X )] − [1 − (cid:101) p ( X ; β ∗ )] [1 − (cid:101) p ( X ; β ∗ )] [1 − p ( X )] V ar ( Y | X, D = 0) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) = p − p p p E [ V ar ( Y | X, D = 1) | X = x ]+ (1 − p ) − (1 − p ) (1 − p )(1 − p ) E [ V ar ( Y | X, D = 0) | X = x ] . To give a clear picture, we further assume that the outcome regressionmodels are homoscedastic that
V ar ( Y | X, D = 1) =
V ar ( Y | X, D = 0) = ξ ,.2 The Cases With Misspecified Modelswhich is free of X . Then we have, σ ( x ) − σ ( x ) = ξ (cid:16) p − p p p + (1 − p ) − (1 − p ) (1 − p )(1 − p ) (cid:17) .Define the function vd ( p , p ) = (cid:16) p − p p p + (1 − p ) − (1 − p ) (1 − p )(1 − p ) (cid:17) . A negative vd ( p , p )implies the variance shrinkage. Consider three true propensity score values p ( x ) = p = 0 . , . , .
7. The following three curves of vd ( p , p ) show howthe variance inflation or shrinkage occurs. −101 0.3 0.4 0.5 0.6 p2 v d (a) p = 0 .
012 0.3 0.4 0.5 0.6 0.7 p2 v d (b) p = 0 . −101 0.4 0.5 0.6 0.7 p2 v d (c) p = 0 . Figure 1: Curves of vd ( p , p ) with different p When p = 0 . .
7, appropriately overestimated propensity scoremay result in an asymptotic variance shrinking in some cases. When p = 0 .
5, which means that every individual have an 0 . V ar ( Y | X, D = 1)and
V ar ( Y | X, D = 0) are not necessarily equal. Such simple examples showthat when only propensity score is misspecified, augmenting or shrinkingasymptotic variances are all possible..2 The Cases With Misspecified Models
Remark 2.
Another interesting phenomenon is that once propensity scoremodel is misspecified and outcome regressions are nonparametrically esti-mated, or vice versa, the asymptotic performance of the proposed estimatoris identical to that when all models are correctly specified. As nonparamet-ric estimation takes no risk of misspecification, such an estimation proce-dure “absorbs” the influence brought by model misspecification due to thedoubly robust property. But it is clear that in high-dimensional scenarios,a purely nonparametric estimation is not worthwhile to recommend. Thus,this property mainly serves as an investigation with theoretical interest un-less the dimension of the covariates is small.The results with local misspecification are stated in the following.
Theorem 4.
Assume that the propensity score is locally specified with c n → . Suppose conditions (C1) – (C6), (A1), (A2) and (B1) are satisfied for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d , s < (2 s + k )( d − d ) , s < (2 s + k )( d − d ) . Then, for each value x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) . Similarly, assume that the outcome regression functions are locally misspec-ified with d n → and d n → . Under the same conditions as those inTheorem 4 for s ∗ ≥ s ≥ d , s ∗ ≥ s ≥ d (2) , s < (2 s + k )( p − p (2)) . For .3 A Further Study: All Models are Misspecified each value x , the asymptotic distribution of (cid:98) τ ( x ) is identical to the above. We study this case as it then has a non-ignorable bias in general and goesto zero unless the rate of convergence of local misspecification is sufficientlyfast. Recall the definitions of γ ∗ γ ∗ and β ∗ below (3). Theorem 5.
Suppose that all models are globally misspecified with nonzeroconstants c n , d n and d n . Assume that conditions (C1) – (C6) are satisfied.Then, for each value x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x ) − bias ( x )] d −→ N (0 , V ( x )) , where bias ( x ) = E (cid:26) [ m ( X ) − (cid:101) m ( X ; γ ∗ )] [ p ( X ) − (cid:101) p ( X ; β ∗ )] (cid:101) p ( X ; β ∗ ) − [ m ( X ) − (cid:101) m ( X ; γ ∗ )] [ (cid:101) p ( X ; β ∗ ) − p ( X )]1 − (cid:101) p ( X ; β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) ,V ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) , σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − (cid:101) τ ( x )] (cid:12)(cid:12) X = x (cid:9) , and (cid:101) τ ( x ) = E (cid:26)(cid:20) D (cid:101) p ( X ; β ∗ ) [ Y − (cid:101) m ( X ; γ ∗ )] − − D − (cid:101) p ( X ; β ∗ ) [ Y − (cid:101) m ( X ; γ ∗ )] + (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) . .3 A Further Study: All Models are MisspecifiedThe following results show the importance of the convergence rates of c n , d n and d n to zero for bias reduction and variance change. Theorem 6.
Under the conditions in Theorem 5, when c n d n = o (cid:32) (cid:112) nh k (cid:33) , c n d n = o (cid:32) (cid:112) nh k (cid:33) , then, for each x , we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (0 , V ( x )) . Remark 3.
This theorem show that to make the bias vanished, c n d n and c n d n need to tend to zero at the rates faster than the nonparametric con-vergence rate, O (1 / (cid:112) nh k ). Recall that Theorems 2 and 3 show that when c n = o (1), then the variance is V ( x ); when d n = o (1) and d n = o (1)the variance is V ( x ). Altogether, when all misspecifications are local, theasymptotic variances reduces to V ( x ). We can then further discuss fourcases:1) All nuisance models are globally misspecified; 2) All nuisance models arelocally misspecified; 3) The propensity score function is globally misspeci-fied, and the outcome regression functions are locally misspecified; 4) Thepropensity score function is locally misspecified, and the outcome regressionfunctions are globally misspecified..4 A summary on the comparison among the asymptotic VariancesThe first is the case exactly described in Theorem 5, the second showsthat if c n d n = o (1 / (cid:112) nh k ) and c n d n = o (1 / (cid:112) nh k ), the bias term is neg-ligible, which is the situation in Theorem 6. Otherwise, the estimator isbiased. Cases 3 and 4 can be regarded as a combination of those in The-orems 5 and 6. In case 3, once d n = o (1 / (cid:112) nh k ) and d n = o (1 / (cid:112) nh k ),the bias goes to 0, and the variance goes to || K || σ ( x ) /f ( x ). In otherwords, if d n and d n go to 0 at a rate faster than O (1 / (cid:112) nh k ), Case 3 turnsto the case in Theorem 2. We can then also derive that if c n = o (1 / (cid:112) nh k ),Case 4 is similar to that in Theorem 3. We summarize the comparison among the 4 variances V j ( x ) for j = 1 , , , V j ( x ) = || K || σ j ( x ) /f ( x )for j = 1 , , , σ j ( x ) for j = 1 , , , Remark 4.
For any x ,1). σ ( x ) is not necessarily smaller than σ ( x ) and as shown in the ex-ample in Remark 1, σ ( x ) can be larger than σ ( x ) for some x ;2). σ ( x ) ≤ σ ( x );). We have no definitive answer to say whether σ ( x ) is necessarily smallerthan σ ( x ).
5. Numerical Study
In this section, we present some Monte Carlo simulations to examine thefinite sample performances of the estimators.
Consider two data-generating processes (DGPs) similarly as those in Abre-vaya et al. (2015), the case of d = 2 and d = 4. Here we only considerthat the conditioning covariate X is univariate, i.e. k = 1. So in thesimulations, τ ( x ) = E [ Y (1) − Y (0) | X = x ]. Model 1 . It is featured by a 2-dimensional unconfounded covariate, X = ( X , X ) (cid:62) . In other words, d = 2. For further information, X = ρ , X = (1 + 2 X ) ( − X ) + ρ , where ρ , ρ are independently identically U ( − . , .
5) distributed. Thepotential outcomes and the propensity score function are given as: Y (1) = X X + (cid:15), Y (0) = 0 ,p ( X ) = exp ( X + X )1 + exp ( X + X ) , .1 Data-Generating Processwhere (cid:15) ∼ N (0 , . ). The true CATE conditioning on X can be derivedas τ ( x ) = x (1 + 2 x ) ( − x ) . Since the misspecification effect is aconcern, we use the misspecified parametric model respectively: (cid:101) m ( X ; γ ) = (1 , X (cid:62) ) γ , (cid:101) p ( X ; β ) = exp ((1 , X ) β )1 + exp ((1 , X ) β ) . where γ ∈ R , β ∈ R . Model 2 . Another DGP is featured by a 4-dimensional unconfoundedcovariate for the purpose of a further investigation on higher dimensioncases. Write X = ( X , X , X , X ) (cid:62) and X = ρ , X = 1 + 2 X + ρ ,X = 1 + 2 X + ρ , X = ( − X ) + ρ , where ρ , ρ , ρ , ρ are independently identically U ( − . , .
5) distributed.The potential outcomes and the propensity score function are defined as: Y (1) = X X X X + (cid:15), Y (0) = 0 ,p ( X ) = exp (cid:2) ( X + X + X + X ) (cid:3) exp (cid:2) ( X + X + X + X ) (cid:3) , where (cid:15) ∼ N (0 , . ). The true CATE conditioning on X remains as τ ( x ) = x (1 + 2 x ) ( − x ) . Still we use the misspecified parametricmodel respectively: (cid:101) m ( X ; γ ) = (1 , X (cid:62) ) γ , (cid:101) p ( X ; β ) = exp ((1 , X ) β )1 + exp ((1 , X ) β ) . .2 Kernel Functions and Bandwidthswhere γ ∈ R , β ∈ R . As the selections of kernel functions and bandwidths (listed in the supple-mentary material) have great influence on the asymptotic property when thenuisance models are nonparametrically or semiparametrically estimated, wefirst discuss this issue.Let h = an − η for η >
0. Together with condition (A2), how to deter-mine the value η goes to a linear programming problem.For model 1 ( d = 2), we consider a kernel function of order 4 ( s = 4)as the kernel in the second step of N-W estimation, K . Write h = a h − η .For the other bandwidths, take h as an example. The results in Section 4requires that s ∗ ≥ s ≥ d , we then choose s = 2. Also let h = a n − η .Then let ( η , η ) = (cid:0) , (cid:1) . The other bandwidths can also be determinedsimilarly as h j = a j n − , ( j = 2 , , , s j = 2 , ( j = 1 , , , , h i to meet condition (A16). To choose a j , ( j =2 , , , a = 0 . a = 0 . a = 1 . a = 0 . a = 1. For model 2 ( d = 4), consider s = 6 and s j =4 , ( j = 2 , , ,
6) and h = a n − and h j = a j n − , ( j = 2 , , , a = 0 . a = 2, a = 2 . a = 2 . a = 1. In simulations, we chose.3 Simulation Resultsmany other values and found that the above values are recommendable asthe values around them can make the estimators relatively stable.Consider the Gaussian kernel K of order s under condition (A1)(i).For other kernel functions, use Epanechnikov kernels of the correspondingorders under conditions (A1)(ii) and (iii). As there are many estimators (cid:98) τ ( x ) with different estimated nuisance mod-els, we then, in Table 2, list them and the corresponding notations forconvenience.To guarantee the regularity conditions and the estimation stability, all esti-mated propensity scores are trimmed within [0 . , . τ ( x ) for x ∈ {− . , − . , , . , . } .The sample sizes are n = 500 and n = 5 ,
000 respectively to see their asymp-totic behaviours. The experiments are repeated 2 , T ( x ) = √ nh ( (cid:98) τ ( x ) − τ ( x )). we evaluate the estimators based on following cri-teria: bias of (cid:98) τ ( x ); sample standard deviation (sam-SD) of T ( x ); meansquare error (MSE) of T ( x ). We also report the proportions ( P . , P . )of the standardized T ( x ) below the 5% quantile and above the 95% quan-.3 Simulation ResultsTable 2: Estimators involved in simulation DRCATE p ( x ) m ( x ) (O, O) oracle oracle (cP, cP) parametrically estimated(correctly specified) parametrically estimated(correctly specified) (N, N) nonparametrically estimated nonparametrically estimated (S, S) semiparametrically estimated semiparametrically estimated (mP, cP) parametrically estimated(misspecified) parametrically estimated(correctly specified) (mP, N) parametrically estimated(misspecified) nonparametrically estimated (mP, S) parametrically estimated(misspecified) semiparametrically estimated (cP, mP) parametrically estimated(correctly specified) parametrically estimated(misspecified) (N, mP) nonparametrically estimated parametrically estimated(misspecified) (S, mP) semiparametrically estimated parametrically estimated(misspecified) .3 Simulation ResultsTable 3: The simulation results under model 1 (part 1) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . (O,O) -0.4 0.0001 0.2776 0.0770 0.052 0.046 0.0004 0.2724 0.0742 0.044 0.052-0.2 -0.0023 0.2378 0.0567 0.056 0.044 -0.0005 0.2333 0.0544 0.049 0.0500 -0.0002 0.2088 0.0436 0.049 0.050 0.0003 0.2014 0.0405 0.047 0.0480.2 0.0003 0.1997 0.0399 0.052 0.047 0.0002 0.1999 0.0400 0.050 0.0540.4 0.0027 0.2003 0.0403 0.045 0.058 0.0004 0.2006 0.0403 0.048 0.054(cP,cP) -0.4 0.0000 0.2797 0.0782 0.053 0.048 0.0004 0.2725 0.0743 0.044 0.052-0.2 -0.0023 0.2378 0.0567 0.056 0.042 -0.0005 0.2333 0.0544 0.051 0.0480 -0.0002 0.2089 0.0436 0.048 0.050 0.0003 0.2014 0.0405 0.047 0.0470.2 0.0003 0.1994 0.0397 0.051 0.048 0.0002 0.2001 0.0400 0.051 0.0540.4 0.0027 0.2003 0.0403 0.044 0.058 0.0004 0.2007 0.0403 0.047 0.054(N,N) -0.4 0.0008 0.2716 0.0738 0.050 0.053 0.0001 0.2845 0.0809 0.050 0.049-0.2 0.0015 0.2366 0.0560 0.042 0.058 -0.0001 0.2344 0.0549 0.050 0.0500 0.0002 0.2046 0.0419 0.043 0.052 -0.0005 0.1996 0.0399 0.057 0.0410.2 0.0010 0.2000 0.0400 0.044 0.051 -0.0001 0.1941 0.0377 0.052 0.0560.4 0.0014 0.2081 0.0433 0.045 0.054 0.0009 0.2012 0.0406 0.045 0.056(S,S) -0.4 -0.0022 0.2815 0.0794 0.051 0.044 0.0002 0.2862 0.0819 0.045 0.050-0.2 0.0004 0.2365 0.0559 0.046 0.052 -0.0004 0.2302 0.0530 0.046 0.0480 0.0005 0.2082 0.0433 0.053 0.052 0.0003 0.2059 0.0424 0.052 0.0520.2 -0.0015 0.1992 0.0397 0.061 0.041 -0.0002 0.2011 0.0404 0.053 0.0510.4 0.0002 0.2021 0.0408 0.050 0.046 0.0012 0.2048 0.0422 0.043 0.059 tile of N (0 ,
1) to verify the asymptotic Normality. We display the efficiencycomparisons among different estimators under models 1 and 2 in Figures 2.3 Simulation ResultsTable 4: The simulation results under model 1 (part 2) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . (O,O) -0.4 0.0001 0.2776 0.0770 0.052 0.046 0.0004 0.2724 0.0742 0.044 0.052-0.2 -0.0023 0.2378 0.0567 0.056 0.044 -0.0005 0.2333 0.0544 0.049 0.0500 -0.0002 0.2088 0.0436 0.049 0.050 0.0003 0.2014 0.0405 0.047 0.0480.2 0.0003 0.1997 0.0399 0.052 0.047 0.0002 0.1999 0.0400 0.050 0.0540.4 0.0027 0.2003 0.0403 0.045 0.058 0.0004 0.2006 0.0403 0.048 0.054(mP,cP) -0.4 0.0000 0.2599 0.0675 0.052 0.049 0.0004 0.2530 0.0640 0.044 0.052-0.2 -0.0022 0.2363 0.0559 0.056 0.041 -0.0005 0.2323 0.0540 0.050 0.0500 -0.0002 0.2203 0.0485 0.049 0.048 0.0003 0.2116 0.0448 0.047 0.0520.2 0.0003 0.2041 0.0417 0.051 0.046 0.0002 0.2048 0.0419 0.050 0.0530.4 0.0027 0.1953 0.0383 0.044 0.058 0.0004 0.1955 0.0382 0.046 0.054(mP,N) -0.4 -0.0046 0.2666 0.0716 0.064 0.040 -0.0011 0.2629 0.0693 0.054 0.044-0.2 -0.0035 0.2373 0.0566 0.059 0.044 -0.0029 0.2383 0.0584 0.074 0.0370 -0.0068 0.2152 0.0474 0.072 0.032 -0.0027 0.2107 0.0458 0.072 0.0340.2 -0.0011 0.2041 0.0417 0.052 0.047 -0.0004 0.1952 0.0381 0.050 0.0450.4 -0.0008 0.2003 0.0401 0.049 0.049 0.0007 0.2002 0.0402 0.043 0.056(mP,S) -0.4 -0.0143 0.2701 0.0781 0.082 0.029 -0.0115 0.2722 0.0996 0.146 0.010-0.2 -0.0094 0.2453 0.0624 0.070 0.032 -0.0073 0.2302 0.0634 0.114 0.0160 -0.0046 0.2116 0.0453 0.064 0.043 -0.0038 0.2099 0.0469 0.083 0.0320.2 -0.0019 0.2041 0.0417 0.050 0.046 -0.0006 0.1970 0.0388 0.054 0.0470.4 0.0022 0.2002 0.0402 0.046 0.058 0.0017 0.1968 0.0393 0.037 0.062 and 3 and the detailed results under model 1 are displayed in Tables 6,7 and 8. To save space, the other simulation results about model 2 are.3 Simulation ResultsTable 5: The simulation results under model 1 (part 3) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . (O,O) -0.4 0.0001 0.2776 0.0770 0.052 0.046 0.0004 0.2724 0.0742 0.044 0.052-0.2 -0.0023 0.2378 0.0567 0.056 0.044 -0.0005 0.2333 0.0544 0.049 0.0500 -0.0002 0.2088 0.0436 0.049 0.050 0.0003 0.2014 0.0405 0.047 0.0480.2 0.0003 0.1997 0.0399 0.052 0.047 0.0002 0.1999 0.0400 0.050 0.0540.4 0.0027 0.2003 0.0403 0.045 0.058 0.0004 0.2006 0.0403 0.048 0.054(cP,mP) -0.4 -0.0012 0.3230 0.1044 0.051 0.048 0.0001 0.3201 0.1024 0.050 0.049-0.2 -0.0021 0.2400 0.0577 0.052 0.042 -0.0005 0.2362 0.0558 0.054 0.0440 0.0004 0.2147 0.0461 0.052 0.049 0.0003 0.2050 0.0420 0.049 0.0490.2 0.0004 0.2012 0.0405 0.054 0.046 0.0001 0.2016 0.0406 0.048 0.0490.4 0.0028 0.2059 0.0426 0.043 0.061 0.0004 0.2039 0.0416 0.045 0.053(N,mP) -0.4 -0.0105 0.2840 0.0834 0.075 0.040 -0.0013 0.2970 0.0885 0.060 0.045-0.2 0.0014 0.2353 0.0554 0.047 0.050 0.0007 0.2288 0.0525 0.040 0.0530 0.0013 0.2104 0.0443 0.048 0.054 0.0002 0.2065 0.0426 0.047 0.0440.2 -0.0014 0.1995 0.0398 0.056 0.048 -0.0004 0.2022 0.0409 0.052 0.0440.4 0.0008 0.2034 0.0414 0.046 0.046 0.0000 0.2077 0.0431 0.048 0.050(S,mP) -0.4 -0.0051 0.2964 0.0884 0.055 0.046 -0.0005 0.3089 0.0955 0.050 0.045-0.2 -0.0002 0.2421 0.0586 0.049 0.050 0.0001 0.2394 0.0573 0.048 0.0510 0.0005 0.2076 0.0431 0.050 0.050 -0.0001 0.2051 0.0421 0.048 0.0490.2 -0.0008 0.2082 0.0433 0.049 0.049 -0.0001 0.1966 0.0386 0.054 0.0480.4 0.0005 0.2104 0.0443 0.044 0.052 0.0006 0.2085 0.0435 0.048 0.054 reported in the supplementary material.Here we present some observations from the simulation results..3 Simulation Results x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,cP)DRCATE(N,N)DRCATE(S,S) 0.960.991.021.05 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(mP,cP)DRCATE(mP,N)DRCATE(mP,S) 1.001.051.101.15 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,mP)DRCATE(N,mP)DRCATE(S,mP) (a) n = 500 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,cP)DRCATE(N,N)DRCATE(S,S) 0.9250.9500.9751.0001.0251.050 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(mP,cP)DRCATE(mP,N)DRCATE(mP,S) 1.001.051.101.15 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,mP)DRCATE(N,mP)DRCATE(S,mP) (b) n = 5000 Figure 2: Relative variance against DRCATE(O,O) in model 1
First, with the sample size growth, the bias and the standard deviationof (cid:98) τ ( x ) reasonably tend to be smaller due to the estimation consistency.The reported proportions P . and P . can be controlled around 0.05,which implies that the normal approximation of the proposed estimator isvalid.Second, from Figures 2 and 3, the efficiency comparisons among theestimators (O,O), (cP,cP), (N,N) and (S,S) show that the distributions areclose to each other. When only the propensity score function is misspeci-fied, variance inflation and shrinkage are both possible. With misspecifiedoutcome regression function, only variance inflation is possible. .01.11.2 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,cP)DRCATE(N,N)DRCATE(S,S) 0.9500.9751.0001.0251.050 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(mP,cP)DRCATE(mP,N)DRCATE(mP,S) 0.9751.0001.0251.0501.075 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,mP)DRCATE(N,mP)DRCATE(S,mP) (a) n = 500 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,cP)DRCATE(N,N)DRCATE(S,S) 0.9751.0001.0251.050 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(mP,cP)DRCATE(mP,N)DRCATE(mP,S) 1.01.21.41.6 −0.4 −0.2 0.0 0.2 0.4 x S a m p l e V a r i an c e estimator DRCATE(O,O)DRCATE(cP,mP)DRCATE(N,mP)DRCATE(S,mP) (b) n = 5000 Figure 3: Relative variance against DRCATE(O,O) in model 2
Third, the bias and standard deviation of (cid:98) τ ( x ) increase when the co-variate dimension grows, see the comparisons in Figures 2 and 3. Possibleexplanation to this phenomenon would be that the standard deviations ofnuisance models’ estimations increase with higher dimension of covariate.
6. Conclussion
In this paper, we investigate the asymptotic behaviours of nine doublyrobust estimators (DR), under different combinations of model structures,to provide a relatively complete picture of this methodology.When all models are correctly specified, the asymptotic equivalencemong all defined estimators does not surprisingly hold. When modelsare mispecified, we consider local and global misspecifications and someinteresting phenomena have been discovered such as asymptotic varianceshrinking in some cases due to misspecification. Further, we would rec-ommend semiparametric estimation under dimension reduction structure.This is because nonparametric estimation severely suffers from the curseof dimensionality whereas parametric estimation may not be sufficientlyrobust against model structure.
Acknowledgements
The research described herein was supported by a NNSF grant of Chinaand a grant from the University Grants Council of Hong Kong, Hong Kong,China. . Supplementary Material
The supplementary material contains the detailed proofs of the theoremsand propositions, and the additional simulation results.
Here we present some conditions to derive the theoretical results. Togetherwith (C1) and (C2) in the main context, the following conditions in the(C) group are regularity conditions to guarantee the asymptotic propertiesregardless of the different ways to estimate nuisance models.(C3) Density functions involved in this article satisfy the following condi-tions:(i) For any x ∈ X , the density function of X , θ ( x ) is bounded awayfrom 0.(ii) For any x , the density function of X , f ( x ) is bounded awayfrom zero and s times continuously differentiable.(iii) Denote the density functions of A (cid:62) X , B (cid:62) X and B (cid:62) X as θ A ( · ), θ B ( · ) and θ B ( · ). For any x ∈ X , all these density functions arebounded away from 0.(C4) Denote C as the parameter space of β . For any x ∈ X and β ∈ C ,.1 Technical Conditions (cid:101) p ( x ; β ) is bounded away from 0 and 1.(C5) sup x E [ Y ( j ) | X = x ] < ∞ for j = 0 , E | Ψ ( X, Y, D ) − τ ( x ) | κ ≤ ∞ , E | Ψ ( X, Y, D ) − τ ( x ) | κ ≤ ∞ , E | Ψ ( X, Y, D ) − τ ( x ) | κ ≤ ∞ , E | Ψ ( X, Y, D ) − (cid:101) τ ( x ) | κ ≤ ∞ , (cid:82) | K ( u ) | δ du ≤ ∞ for some constants κ , κ , κ , κ , δ ≥ ii ). Bounded propensity scores or specifiedpropensity score models, density functions and corresponding conditionalmoments are required in these conditions, which are common restrictionsin the literature, and play important roles in deriving the asymptotic linearexpression of the proposed estimators. (C6) ensures the applicability ofLyapunov’s Central Limit Theorem here.Assume some conditions on kernel functions and bandwidths in non-parametric estimation:(A1) The kernel functions satisfy the following conditions:(i) K ( u ) is a kernel function of order s , which is symmetric aroundzero and s ∗ times continuously differentiable..1 Technical Conditions(ii) K ( u ), K ( u ) and K ( u ) are kernels of order s ≥ p , s ≥ p and s ≥ p , which are symmetric around zero and equal to zerooutside (cid:81) pi =1 [ − , s +1), ( s +1) and ( s +1)order derivatives respectively.(iii) K ( u ), K ( u ) and K ( u ) are kernels of order s ≥ p (2), s ≥ p (1)and s ≥ p (0), which are symmetric around zero and equal tozero outside (cid:81) p (2) i =1 [ − , (cid:81) p (1) i =1 [ − ,
1] and (cid:81) p (0) i =1 [ − , s +1), ( s +1) and ( s +1) order derivatives respectively.(A2) As different scenarios require different bandwidths, we put them to-gether in the following. As n → ∞ :(i) h → nh k → ∞ , nh s + k → h →
0, (ln n ) / ( nh p + s ) → h →
0, (ln n ) / ( nh p + s ) → h →
0, (ln n ) / ( nh p + s ) → h →
0, (ln n ) / ( nh p + s ) → h →
0, (ln n ) / ( nh p + s ) → h →
0, (ln n ) / ( nh p + s ) → h s h − s − k → nh k h s → h s h − s − k → nh k h s → h s h − s − k → nh k h s → h s h − s − k → nh k h s → h s h − s − k → nh k h s → h s h − s − k → nh k h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → nh k h s h s → K j ( u ) , j = 2 , , . . . ,
7, and h j , j = 2 , , . . . , K j ( u ) , j = 2 , . . . ,
7. Abrevaya et al. (2015) statedthat this restriction on the bounded support can be relaxed to exponentialtails.(A2)(ii)-(xiv) place restrictions on the convergence rates of differentbandwidths to ensure remainders of the linear expression negligible. (A2)(xiv),.1 Technical Conditionsinvolving more than 2 bandwidths, can be regarded as an interaction term,which makes it handleable to determine those convergence rates. Herewe provide a naive idea to accomplish this task based on linear program-ming. Assume the corresponding bandwidths converge to 0 in such amanner, h j = a j n − η j , j = 1 , . . . ,
7, where η j >
0. With predetermined s j , j = 1 , . . . , η j . For a more detailed example, readercan refer to Section 5.Lastly, we give a condition to ensure the desired convergence rates ofthe estimators under semiparametric dimension reduction structure, whichwill be a favour when pursuing the asymptotic properties of (cid:98) τ ( x ):(B1) (cid:98) A − A = O p (cid:0) n − / (cid:1) , (cid:98) B − B = O p (cid:0) n − / (cid:1) , (cid:98) B − B = O p (cid:0) n − / (cid:1) These can be achieved by standard estimations in the literature, see therelevant references such as Li (1991), ????In summary, these conditions are rather standard..2 Proof of Theorem 1
Recall that (cid:98) τ ( x ) = (cid:80) ni =1 (cid:104) D i (cid:98) p i ( Y i − (cid:98) m i ) − − D i − (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − (cid:98) m i (cid:105) K (cid:16) X i − x h (cid:17)(cid:80) nt =1 K (cid:16) X t − x h (cid:17) = (cid:80) ni =1 (cid:104) D i (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i (cid:105)(cid:80) nt =1 K (cid:16) X t − x h (cid:17) − (cid:80) ni =1 (cid:104) − D i − (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i (cid:105)(cid:80) nt =1 K (cid:16) X t − x h (cid:17) =: (cid:98) τ ( x ) − (cid:98) τ ( x ) . Let τ ( x ) = E (cid:20) Dp ( X ) [ Y − m ( X )] − − D − p ( X ) [ Y − m ( X )] + m ( X ) − m ( X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:21) = E (cid:20) Dp ( X ) [ Y − m ( X )] + m ( X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:21) − E (cid:20) − D − p ( X ) [ Y − m ( X )] + m ( X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:21) = τ ( x ) − τ ( x ) . For the very first move, we look for the asymptotic linear expression of (cid:112) nh k [ (cid:98) τ ( x ) − τ ( x )]. Note that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )]= (cid:113) nh k { [ (cid:98) τ ( x ) − τ ( x )] − [ (cid:98) τ ( x ) − τ ( x )] } = 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) − (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) − D i − (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) . .2 Proof of Theorem 1 (cid:98) f ( x ) P −→ f ( x ) ensures that we can use Slutsky’s Theorem later. So we canfirst consider the asymptotic linear expression of J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:98) p i ( Y i − (cid:98) m i ) + (cid:98) m i − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) . (7.1)Consider several combinations of estimation of nuisance functions. Nowwe list them as below.Scenario 1. p ( x ) parametrically estimated (correctly specified), m ( x )parametrically estimated (correctly specified)Scenario 2. p ( x ) nonparametrically estimated, m ( x ) nonparametricallyestimatedScenario 3. p ( x ) semiparametrically estimated, m ( x ) semiparametri-cally estimatedScenario 4. p ( x ) parametrically estimated (correctly specified), m ( x )nonparametrically estimatedScenario 5. p ( x ) parametrically estimated (correctly specified), m ( x )semiparametrically estimatedScenario 6. p ( x ) nonparametrically estimated, m ( x ) parametricallyestimated (correctly specified)Scenario 7. p ( x ) nonparametrically estimated, m ( x ) semiparametri-cally estimated.2 Proof of Theorem 1Scenario 8. p ( x ) semiparametrically estimated, m ( x ) parametricallyestimated (correctly specified)Scenario 9. p ( x ) semiparametrically estimated, m ( x ) nonparametri-cally estimated Scenario 1: p ( x ) and m ( x ) are parametrically estimated. Fromstandard parametric estimation argument,sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − p ( x ) (cid:12)(cid:12)(cid:12) = sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − (cid:101) p ( x ; β ∗ ) (cid:12)(cid:12)(cid:12) = O p (cid:18) √ n (cid:19) , sup x ∈X | (cid:101) m ( x ; (cid:98) γ ) − m ( x ) | = sup x ∈X | (cid:101) m ( x ; (cid:98) γ ) − (cid:101) m ( x ; γ ∗ ) | = O p (cid:18) √ n (cid:19) . We start from (7.1): J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:34) D i (cid:101) p ( X i ; (cid:98) β ) [ Y i − (cid:101) m ( X ; (cid:98) γ )] + (cid:101) m ( X ; (cid:98) γ ) − τ ( x ) (cid:35) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i Y i p ( X i ) − D i − p ( X i ) p ( X i ) m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 (cid:34) D i ( m +1 i − Y i ) p + i (cid:35) (cid:104)(cid:101) p ( X i ; (cid:98) β ) − p ( X i ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 (cid:18) p + i − D i p + i (cid:19) [ (cid:101) m ( X i ; (cid:98) γ ) − m ( X i )] K (cid:18) X i − x h (cid:19) =: J ( x ) + J ( x ) + J ( x ) (7.2)where p + i lies between p ( X i ) and (cid:101) p ( X i ; (cid:98) β ), m +1 i lies between m ( X i ) and (cid:101) m ( X i ; (cid:98) γ )..2 Proof of Theorem 1Bounding J ( x ) as | J ( x ) | = 1 (cid:112) nh k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 (cid:34) D i ( m +1 i − Y i ) p + i (cid:35) (cid:104)(cid:101) p ( X i ; (cid:98) β ) − p ( X i ) (cid:105) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) nh k sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − p ( x ) (cid:12)(cid:12)(cid:12) × nh k n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i ( m +1 i − Y i ) p + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − p ( x ) (cid:12)(cid:12)(cid:12) = O p (cid:16) √ n (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) D i ( m +1 i − Y i ) p + i (cid:12)(cid:12)(cid:12)(cid:12) is bounded due tocondition (C2)(ii), (C4) and (C5), nh k (cid:80) ni =1 (cid:12)(cid:12)(cid:12) K (cid:16) X i − x h (cid:17)(cid:12)(cid:12)(cid:12) = O p (1) by thestandard nonparametric estimation argument. Thus | J ( x ) | ≤ o p (1) · O p (1) = o p (1). With the similar arguments, we can also bound the lastterm as | J ( x ) | = o p (1). So far, we’ve proved J ( x ) and J ( x ) convergeto 0 in probability. Hence, according to Slutsky’s Theorem, together with(7.2), we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) J ( x ) = 1 (cid:98) f ( x ) J ( x ) + o p (1)= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1)(7.3) Scenario 2: p ( x ) and m ( x ) are nonparametrically estimated. From the standard nonparametric estimation argument, under conditions.2 Proof of Theorem 1(A1)(i), (ii), (iii), we havesup x ∈X | (cid:98) p ( x ) − p ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p (cid:16) h s (cid:17) sup x ∈X | (cid:98) m ( x ) − m ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p (cid:16) h s (cid:17) Rewrite (7.1): J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:98) p ( X i ) [ Y i − (cid:98) m ( X i )] + (cid:98) m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 p ( X i ) − D i p ( X i ) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) + 0+ 1 (cid:112) nh k n (cid:88) i =1 D i (cid:2) Y i − m +1 i (cid:3) p + i [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i p + i [ (cid:98) p ( X i ) − p ( X i )] [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) =: J ( x ) + J ( x ) + J ( x ) + 0 + J ( x ) + J ( x ) (7.4)where p + i lies between p ( X i ) and (cid:98) p ( X i ), m +1 i lies between m ( X i ) and (cid:98) m ( X i ).Rewrite J ( x ) as J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19) = 1 √ n n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) 1 (cid:112) h k [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19) .2 Proof of Theorem 1In which E (cid:104) D i [ m ( X i ) − Y i ] p ( X i ) (cid:12)(cid:12)(cid:12) X i (cid:105) = 0 and thus D i [ m ( X i ) − Y i ] p ( X i ) is independent of X i for every i ; sup x ∈X | (cid:98) p ( x ) − p ( x ) | = O p (cid:16) h s + (cid:113) ln nnh p (cid:17) = o p (cid:16) h s (cid:17) , Bycondition (A1)(viii) and CLT, √ h p [ (cid:98) p ( X i ) − p ( X i )] K (cid:16) X i − x h (cid:17) = o p (1).and then J ( x ) = o p (1).Similarly, | J ( x ) | = o p (1). Deal with J ( x ) by using the decomposi-tion as | J ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:112) nh k n (cid:88) i =1 D i (cid:2) Y i − m +1 i (cid:3) p + i [ (cid:98) p ( X i ) − p ( X i )] K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) nh k sup x ∈X [ (cid:98) p ( X i ) − p ( X i )] × nh k n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i (cid:2) Y i − m +1 i (cid:3) p + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) in which sup x ∈X [ (cid:98) p ( x ) − p ( x )] = o p ( h s ). Then under condition (A1)(viii), (cid:112) nh k sup x ∈X [ (cid:98) p ( x ) − p ( x )] = o p (1). Under conditions (C2)(ii) and (C5), (cid:12)(cid:12)(cid:12)(cid:12) D i [ Y i − m +1 i ] p + i (cid:12)(cid:12)(cid:12)(cid:12) is bounded. Again by the standard argument for handling non-parametric estimation, nh k (cid:80) ni =1 (cid:12)(cid:12)(cid:12) K (cid:16) X i − x h (cid:17)(cid:12)(cid:12)(cid:12) = O p (1). Thus, we can ob-tain that | J ( x ) | = o p (1) · O p (1) = o p (1). In a similar way, | J ( x ) | = o p (1)can also be proved. Here we have derived that J ( x ), J ( x ), J ( x ) and J ( x ) can be bounded as o p (1). Together with (7.4), we can obtain that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) J ( x ) = 1 (cid:98) f ( x ) J ( x ) + o p (1)= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( Z i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.5).2 Proof of Theorem 1 Scenario 3: p ( x ) and m ( x ) are semiparametrically estimated. Under conditions (A2)(v) and (vi), (vii),sup x ∈X (cid:12)(cid:12)(cid:98) g ( A (cid:62) x ) − g ( A (cid:62) z ) (cid:12)(cid:12) = O p (cid:32) h s + (cid:115) ln ( n ) nh p (2)5 (cid:33) = o p (cid:16) h s (cid:17) , sup x ∈X (cid:12)(cid:12)(cid:98) r ( B (cid:62) z ) − r ( B (cid:62) z ) (cid:12)(cid:12) = O p (cid:32) h s + (cid:115) ln nnh p (1)6 (cid:33) = o p (cid:16) h s (cid:17) . Note that under condition (B1), we can first discuss the asymptotic distri-bution by assuming that the projection matrices A , B and B are given.Then J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:98) g ( A (cid:62) X i ) (cid:2) Y i − (cid:98) r ( B (cid:62) X i ) (cid:3) + (cid:98) r ( B (cid:62) X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 p ( X i ) − D i p ( X i ) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 0+ 1 (cid:112) nh k n (cid:88) i =1 D i [ Y i − r +1 i ] g + i (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i g + i (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) =: J ( x ) + J ( x ) + J ( x ) + 0 + J ( x ) + J ( x ) (7.6)where g + i lies between g ( A (cid:62) X i ) and (cid:98) g ( A (cid:62) X i ), m +1 i lies between r ( B (cid:62) X i )and (cid:98) r ( B (cid:62) X i ). Then we deal with all terms one by one..2 Proof of Theorem 1Consider J ( x ) and J ( x ). We have J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] p ( X i ) (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) = 1 √ n n (cid:88) i =1 D i [ r ( B (cid:62) X i ) − Y i ] p ( A (cid:62) X i ) 1 (cid:112) h k (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) Again E (cid:104) D i [ m ( X i ) − Y i ] p ( X i ) (cid:12)(cid:12)(cid:12) X i (cid:105) = 0. Then, D i [ m ( X i ) − Y i ] p ( X i ) is independent of X i for every i ; sup x ∈X (cid:12)(cid:12)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:12)(cid:12) = O p (cid:16) h s + (cid:113) ln nnh p (cid:17) = o p (cid:16) h s (cid:17) .Condition (A2)(xi) yields that √ h k (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:16) X i − x h (cid:17) = o p (1). The application of CLT yields that J ( x ) = o p (1). Also, we canprove J ( x ) = o p (1) similarly.Deal with J ( x ) and J ( x ). We have | J ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:112) nh k n (cid:88) i =1 D i [ Y i − r +1 i ] g + i (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) nh k sup x ∈X (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) × nh k n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i [ Y i − r +1 i ] g + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Also sup x ∈X (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) = o p ( h s ). Condition (A2)(xi) impliesthat (cid:113) nh k sup x ∈X (cid:2)(cid:98) g ( A (cid:62) X i ) − g ( A (cid:62) X i ) (cid:3) = o p (1)Under conditions (C2)(ii) and (C5), (cid:12)(cid:12)(cid:12)(cid:12) D i [ Y i − r +1 i ] g + i (cid:12)(cid:12)(cid:12)(cid:12) is bounded. Again, nh k (cid:80) ni =1 (cid:12)(cid:12)(cid:12) K (cid:16) X i − x h (cid:17)(cid:12)(cid:12)(cid:12) = O p (1). We can then achieve | J ( x ) | = o p (1) · O p (1) = o p (1). This is alsothe way to prove | J ( x ) | = o p (1). In this way, the asymptotic negligibility.2 Proof of Theorem 1of J ( x ), J ( x ), J ( x ) and J ( x ) has been proved. Together with(7.6), it can be derived that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) J ( x ) = 1 (cid:98) f ( x ) J ( x ) + o p (1)= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.7)Consider equations (7.3), (7.5) and (7.7), which imply that the asymptoticlinear expressions of (cid:112) nh k [ (cid:98) τ ( x ) − τ ( x )] are identical among scenarios 1, 2and 3. It is obvious that under the conditions of Theorem 1, in any scenariomentioned above, the asymptotic linear expression remains the same, whichleads to the same asymptotic distribution. With the asymptotic linearexpression, we can further derive the asymptotic distribution. First, wehave the decomposition as (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )]= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) + o p (1)= 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( X i )] K (cid:18) X i − x h (cid:19) + 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19) + o p (1)=: 1 (cid:98) f ( x ) I ( x ) + 1 (cid:98) f ( x ) I ( x ) + o p (1) . (7.8).2 Proof of Theorem 1Consider I ( x ) first. Note that τ ( X i ) = E [Ψ ( X i , Y i , D i ) | X = X i ]. ThenΨ ( X i , Y i , D i ) − τ ( X i ) is independent of X i , K (cid:16) X i − x h (cid:17) only depends on n and X i . Thus E (cid:26) [Ψ ( X, Y, D ) − τ ( X )] K (cid:18) X − x h (cid:19)(cid:27) = E [Ψ ( X, Y, D ) − τ ( X )] EK (cid:18) X − x h (cid:19) = 0 . Also, (cid:110) [Ψ( X i , Y i , D i ) − τ ( X i )] K (cid:16) X i − x h (cid:17)(cid:111) ni =1 independently and identi-cally distributed. We now check the condition of Lyapunov’s CLT: ∃ κ > n (cid:88) i =1 E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n K (cid:18) X i − x h (cid:19) [Ψ ( X i , Y i , D i ) − τ ( X i )] 1 (cid:112) h k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) κ → n → ∞ )Under condition (C6), letting C = E | Ψ ( X i , Y i , D i ) − τ ( X i ) | < ∞ , we have n (cid:88) i =1 E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n K ( X i − x h )[Ψ ( X i , Y i , D i ) − τ ( X i )] 1 (cid:112) h k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) κ = (cid:32) (cid:112) nh k (cid:33) κ E (cid:12)(cid:12)(cid:12)(cid:12) Ψ ( X, Y, D ) − τ ( X ) | κ E | K (cid:18) X − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) κ h k ≤ (cid:32) (cid:112) nh k (cid:33) κ CE (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) κ h k where1 h k E (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) κ = (cid:90) K κ ( u ) f ( x + h u ) du → f ( x ) (cid:90) K κ ( u ) du < ∞ . Thus (cid:32) (cid:112) nh k (cid:33) κ E | Ψ ( X, Y, D ) − τ ( X ) | κ E (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) κ h k → n → ∞ ) . .2 Proof of Theorem 1The Lyapunov’s condition is satisfied and then I ( x ) d −→ N (0 , V ) (7.9)where V = lim n →∞ V ar (cid:26) √ nh k (cid:80) ni =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:16) X i − x h (cid:17)(cid:27) .To compute the variance V , we can see that V ar (cid:40) (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = E (cid:34) (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:35) = h k E (cid:40) E (cid:34) (cid:20) [Ψ ( X, Y, D ) − τ ( x )] 1 h k K (cid:18) X − x h (cid:19)(cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:35)(cid:41) = h k (cid:90) (cid:18) h k K (cid:18) X − x h (cid:19)(cid:19) E (cid:2) [Ψ( X i , Y i , D i ) − τ ( x )] (cid:12)(cid:12) X (cid:3) dF X = h k (cid:90) (cid:18) h k K (cid:18) t − x h (cid:19)(cid:19) E (cid:2) [Ψ( X i , Y i , D i ) − τ ( x )] (cid:12)(cid:12) X = t (cid:3) f ( t ) dt = h k h k (cid:90) K ( u ) E (cid:2) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x + h u (cid:3) f ( x + h u ) du = σ ( x ) f ( x ) (cid:90) K ( u ) du + O ( h k )where σ ( x ) = E (cid:8) [Ψ ( X, Y, D ) − τ ( x )] (cid:12)(cid:12) X = x (cid:9) . Consider I ( x ). We have I ( x ) = 1 (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19) .2 Proof of Theorem 1where E (cid:40) (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = (cid:113) nh k (cid:90) [ τ ( x + h u ) − τ ( x )] K ( u ) f ( x + h u ) du = (cid:113) nh k O p ( h s ) = o p (1) . Note that its variance is as
V ar (cid:40) (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = E (cid:40) (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) − (cid:34) E { (cid:112) nh k n (cid:88) i =1 [ τ ( X i ) − τ ( x )] K (cid:18) X i − x h (cid:19) } (cid:35) = (cid:90) [ τ ( x + h u ) − τ ( x )] K ( u ) f ( x + h u ) du − nh k [ τ ( x + h u ) − τ ( x )] K ( u ) f ( x + h u ) du ] = o p (1) . so I ( x ) = o p (1).Combining with (7.8), (7.9) and I ( x ) = o p (1), we can obtain that I ( x ) + I ( x ) d −→ N (cid:18) , σ ( x ) f ( x ) (cid:90) K ( u ) du (cid:19) and (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) [ I ( x ) + I ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . Now we consider the cases with unknown A , B and and B . Note thatunder condition (B1), (cid:98) A , (cid:98) B and (cid:98) B converge in probability to A , B and.3 Proofs of Theorems 2 and 3 B respectively at the rate of O p (cid:16) √ n (cid:17) . Following the similar arguments inHu et al. (2014), we can see easily that the asymptotic distribution retains.We then do not give the details for space saving. Now we can conclude that,under the conditions of Theorem 1, regardless of which estimation method(parametric, nonparametric, semiparametric dimension reduction) used toestimate nuisance models, (cid:113) nh l [ (cid:98) τ ( x ) − τ ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . The proof is done. (cid:3)
We now consider global misspecification cases. Similar to the proof of The-orem 1, we first consider the asymptotic linear expression of J ( x ). Scenario 1: m ( x ) is nonparametrically estimateed . In this case,we have sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − (cid:101) p ( x ; β ∗ ) (cid:12)(cid:12)(cid:12) = O p (cid:18) √ n (cid:19) , sup x ∈X | (cid:98) m ( x ) − m ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p (cid:16) h s (cid:17) . .3 Proofs of Theorems 2 and 3We can further rewrite (7.1) as J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:34) D i (cid:101) p ( X i ; (cid:98) β ) [ Y i − (cid:98) m ( X i )] + (cid:98) m ( X i ) − τ ( x ) (cid:35) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i [ m ( X i ) − Y i ] (cid:101) p ( X i ; β ∗ ) (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:98) p ( X i ; β ∗ ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) + 0+ 1 (cid:112) nh k n (cid:88) i =1 D i (cid:2) Y i − m +1 i (cid:3) p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) =: J ( x ) + J ( x ) + J ( x ) + 0 + J ( x ) + J ( x ) (7.10)where p + i lies between (cid:101) p ( X i ; β ∗ ) and (cid:101) p ( X i ; (cid:98) β ), m +1 i lies between m ( X i ) and (cid:98) m ( X i ).As we can prove J ( x ) and J ( x ) are o p (1) in the same way as theproof for J ( x ) = o p (1) in scenario 1 of Subsection 7.2, the details arethen omitted. For J ( x ), obviously | J ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:112) nh k n (cid:88) i =1 D i p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:113) nh k sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − (cid:101) p ( x ; β ∗ ) (cid:12)(cid:12)(cid:12) sup x ∈X | (cid:98) m ( x ) − m ( x ) | nh k n (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − x h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i p + i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . .3 Proofs of Theorems 2 and 3Consider J ( x ). Denote that λ ( X i ) = E (cid:20) (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) X i (cid:21) , µ i = (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) − λ ( X i ) ,(cid:15) i = Y i − m ( X i ) , ρ ij = K (cid:16) X i − X j h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) . We first give a lemma to show their asymptotics below, which is usefulfor the proof of the theorem.
Lemma 1.
Under condition (A1)(ii), the outcome regression estimator sat-isfies | ρ ij − ρ ji | ≤ C n nh p K (cid:18) X i − X j h (cid:19) where C n = O p ( h ) and does not depend on i , j . Proof.
Note that ρ ij = ρ ji = 0, if || X i − X j || ∞ > h . We now consider theevent that || X i − X j || ∞ ≤ h . For all i ,1 nh p n (cid:88) t =1 D t K (cid:18) X i − X t h (cid:19) = (cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17)(cid:80) nt =1 K (cid:16) X i − X t h (cid:17) nh p n (cid:88) t =1 K (cid:18) X i − X t h (cid:19) = (cid:98) p ( X i ) (cid:98) θ ( X i ) . Then, | ρ ij − ρ ji | = 1 nh p (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − X j h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:98) p − ( X i ) (cid:98) θ − ( X i ) − (cid:98) p − ( X j ) (cid:98) θ − ( X j ) (cid:12)(cid:12)(cid:12) ≤ nh p (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X i − X j h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:98) p ( X i ) (cid:98) θ ( X i ) − p ( X i ) θ ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) p ( X i ) θ ( X i ) − p ( X j ) θ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( X j ) θ ( X j ) − (cid:98) p ( X j ) (cid:98) θ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) . .3 Proofs of Theorems 2 and 3Again by the standard arguments for dealing with nonparametric estimationas we have used before and s ≥ p ,sup x ∈X | (cid:98) p ( x ) − p ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p ( h ) , sup x ∈X | (cid:98) θ ( x ) − θ ( x ) | = O p (cid:32) h s + (cid:115) ln nnh p (cid:33) = o p ( h ) . Recall the conditions (C2)(ii) and (C3)(i) that p ( x ) and θ ( x ) are boundedaway from 0 and 1. The two equations above implies that (cid:98) p ( x ) and (cid:98) θ ( x )uniformly converge to p ( x ) and θ ( x ) respectively. We can also obtain that (cid:98) p ( x ) and (cid:98) θ ( x ) are bounded away from 0 in probability for n large enough.Then sup x ∈X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( x ) θ ( x ) − (cid:98) p ( x ) (cid:98) θ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = sup x ∈X (cid:12)(cid:12)(cid:12)(cid:98) p ( x ) (cid:98) θ ( x ) − p ( x ) θ ( x ) (cid:12)(cid:12)(cid:12) p ( x ) θ ( x ) (cid:98) p ( x ) (cid:98) θ ( x ) ≤ sup x ∈X (cid:98) p ( x ) | (cid:98) θ ( x ) − θ ( x ) | + | (cid:98) p ( x ) − p ( x ) | θ ( x ) p ( x ) θ ( x ) (cid:98) p ( x ) (cid:98) θ ( x ) = o p (1) . This leads to that (cid:12)(cid:12)(cid:12) (cid:98) p ( X i ) (cid:98) θ ( X i ) − p ( X i ) θ ( X i ) (cid:12)(cid:12)(cid:12) = o p (1), (cid:12)(cid:12)(cid:12) (cid:98) p ( X j ) (cid:98) θ ( X j ) − p ( X j ) θ ( X j ) (cid:12)(cid:12)(cid:12) = o p (1) uniformly over all X i , X j . By the Lipchitz continuity, we have (cid:12)(cid:12)(cid:12) p ( X i ) θ ( X i ) − p ( X j ) θ ( X j ) (cid:12)(cid:12)(cid:12) = O p ( h ) uniformly over all X i , X j . Altogether, we have that the summationin curly brace is O p ( h ). Therefore, there exists a C n = O p ( h ) such that | ρ ij − ρ ji | ≤ C n nh p K (cid:18) X i − X j h (cid:19) . The proof is completed. (cid:3) .3 Proofs of Theorems 2 and 3Now we come back to handle the term J ( x ) that can be decomposedas J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 λ ( X i ) [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 µ i [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:34) n (cid:88) j =1 ρ ij D j [ (cid:15) j + m ( X j )] − m ( X i ) (cid:35) + 1 (cid:112) nh k n (cid:88) i =1 µ i [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 K (cid:18) X i − x h (cid:19) λ ( X i ) D i (cid:15) i p ( X i )+ 1 (cid:112) nh k (cid:88) i =1 D i (cid:15) i p ( X i ) (cid:34) p ( X i ) n (cid:88) j =1 ρ ji K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) + 1 (cid:112) nh k K ( X i − x ) λ ( X i ) (cid:34) n (cid:88) j =1 ρ ij D j m ( X j ) − m ( X i ) (cid:35) + 1 (cid:112) nh k n (cid:88) i =1 µ i [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) := J ( x ) + J ( x ) + J ( x ) + J ( x ) . (7.11)We first prove that J k ( x ) = o p (1) for k = 2 , ,
4. Consider J ( x ).3 Proofs of Theorems 2 and 3by using the following decomposition:1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ρ ji K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) = 1 (cid:112) h k p ( X i ) n (cid:88) j =1 ( ρ ij − ρ ji ) K (cid:18) X j − x h (cid:19) λ ( X j )+ 1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ρ ij K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) := L + L .L can be bounded as L ≤ (cid:112) h k sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p ( X i ) n (cid:88) j =1 ( ρ ij − ρ ji ) K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:112) h k sup i n (cid:88) j =1 | ρ ij − ρ ji | (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:112) h k M C n nh p sup i n (cid:88) j =1 | K (cid:18) X i − X j h (cid:19) | (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ M C n h h (cid:112) h k sup i nh p n (cid:88) j =1 | K (cid:18) X i − X j h (cid:19) | (cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12) = O p (1) · o p (1) · O p (1)= o p (1) . .3 Proofs of Theorems 2 and 3Then L can be handled by noting1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ρ ij K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) = 1 (cid:112) h k p ( X i ) (cid:80) ns =1 K (cid:16) X i − X s h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) n (cid:88) j =1 D j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:21) = 1 (cid:112) h k p ( X i ) (cid:80) ns =1 K (cid:16) X i − X s h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) n (cid:88) j =1 D j K (cid:16) X j − X i h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) K (cid:18) X j − x h (cid:19) λ ( X j ) − p ( X i ) (cid:80) ns =1 K (cid:16) X i − X s h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) K (cid:18) X i − x h (cid:19) λ ( X i ) + p ( X i ) (cid:80) ns =1 K (cid:16) X i − X s h (cid:17)(cid:80) nt =1 D t K (cid:16) X i − X t h (cid:17) K (cid:18) X i − x h (cid:19) λ ( X i ) − p ( X i ) 1 p ( X i ) K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:21)(cid:27) = 1 (cid:112) h k O p (cid:18) h s h s + h s (cid:19) = O p (cid:32) h s h s + k/ (cid:33) . Then under Condition (A2)(ix), L = o p (1) and, together with the boundfor L , we have1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ρ ji K (cid:18) X j − x h (cid:19) λ ( X j ) − K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:35) = o p (1) . Since { (cid:15) i } ni =1 are mutually independent given { Z i } ni =1 , it follows that J ( x ) = o p (1)..3 Proofs of Theorems 2 and 3Second, bound J ( x ) by noting that | J ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:112) nh k K ( X i − x ) λ ( X i ) (cid:34) n (cid:88) j =1 ρ ij D j m ( X j ) − m ( X i ) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:113) nh k sup x ∈X (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 ρ ij D j m ( X j ) − m ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) nh k n (cid:88) i =1 | K ( X i − x ) λ ( X i ) | | λ ( X i ) | = (cid:113) nh k O p ( h s ) O p (1) = o p (1) . A similar argument to bound J ( x ) can lead to J ( x ) = o p (1). Alto-gether and combining (7.11), we have J ( x ) = J ( x ) + o p (1). Recallingthat J ( x ), J ( x ) and J ( x ) have all been proved to be o p (1), togetherwith (7.10), we can conclude that J ( x ) = J ( x ) + J ( x ) + o p (1)= 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 K (cid:18) X i − x h (cid:19) λ ( X i ) D i (cid:15) i p ( X i ) + o p (1)= 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i p ( X i ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.12)The proof is finished. (cid:3) Scenario 2: m ( x ) is semiparametrically estimated. First, we.3 Proofs of Theorems 2 and 3have sup x ∈X (cid:12)(cid:12)(cid:12)(cid:101) p ( x ; (cid:98) β ) − (cid:101) p ( x ; β ∗ ) (cid:12)(cid:12)(cid:12) = O p (cid:18) √ n (cid:19) , sup x ∈X (cid:12)(cid:12)(cid:98) r ( B (cid:62) x ) − r ( B (cid:62) x ) (cid:12)(cid:12) = O p (cid:32) h s + (cid:115) ln nnh p (1)6 (cid:33) = o p (cid:16) h s (cid:17) , We can further decompose the term in (7.1) as: J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:34) D i (cid:101) p ( X i ; (cid:98) β ) (cid:2) Y i − (cid:98) r ( B (cid:62) X i ) (cid:3) + (cid:98) r ( B (cid:62) X i ) − τ ( x ) (cid:35) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i [ r ( B (cid:62) X i ) − Y i ] (cid:101) p ( X i ; β ∗ ) (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:98) p ( X i ; β ∗ ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 0+ 1 (cid:112) nh k n (cid:88) i =1 D i (cid:2) Y i − r +1 i (cid:3) p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 D i p + i (cid:104)(cid:101) p ( X i ; (cid:98) β ) − (cid:101) p ( X i ; β ∗ ) (cid:105) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) := J ( x ) + J ( x ) + J ( x ) + 0 + J ( x ) + J ( x ) (7.13)where p + i lies between (cid:101) p ( X i ; β ∗ ) and (cid:101) p ( X i ; (cid:98) β ), r +1 i lies between r ( B (cid:62) X i ) and (cid:98) r ( B (cid:62) X i ).Due to the similarity in the proof as the above, we omit the detailsfor proving J ( x ), J ( x ) and J ( x ) to be o p (1). Now consider J ( x )..3 Proofs of Theorems 2 and 3Denote that λ ( X i ) = E (cid:20) (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) X i (cid:21) µ i = (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) − λ ( X i ) (cid:15) i = Y i − r ( B (cid:62) X i ) ν ij = K (cid:16) B (cid:62) X i − B (cid:62) X j h (cid:17)(cid:80) nt =1 D t K (cid:16) B (cid:62) X i − B (cid:62) X t h (cid:17) J ( x ) can be rewritten as J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:101) p ( X i ; β ∗ ) − D i (cid:101) p ( X i ; β ∗ ) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 λ ( X i ) (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) + 1 (cid:112) nh k n (cid:88) i =1 µ i (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k n (cid:88) i =1 K (cid:18) X i − x h (cid:19) λ ( X i ) (cid:34) n (cid:88) j =1 ρ ij D j [ (cid:15) j + r ( B (cid:62) X j )] − r ( B (cid:62) X i ) (cid:35) + 1 (cid:112) nh k n (cid:88) i =1 µ i (cid:2)(cid:98) r ( B (cid:62) X i ) − r ( B (cid:62) X i ) (cid:3) K (cid:18) X i − x h (cid:19) = 1 (cid:112) nh k (cid:88) i =1 D i (cid:15) i p ( X i ) (cid:34) p ( X i ) n (cid:88) j =1 ν ji K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:35) + 1 (cid:112) nh k K (cid:18) X j − x h (cid:19) λ ( X i ) (cid:34) n (cid:88) j =1 ν ij D j m ( X j ) − m ( X i ) (cid:35) + 1 (cid:112) nh k n (cid:88) i =1 µ i [ (cid:98) m ( X i ) − m ( X i )] K (cid:18) X i − x h (cid:19) := J ( x ) + J ( x ) + J ( x ) . (7.14)It is obvious that J ( x ) = o p (1) and J ( x ) = o p (1). To derive that.3 Proofs of Theorems 2 and 3 J ( x ) = o p (1), we start by writing1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ν ji K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:35) = 1 (cid:112) h k p ( X i ) n (cid:88) j =1 ( ν ij − ν ji ) K (cid:18) X j − x h (cid:19) λ ( X j )+ 1 (cid:112) h k (cid:34) p ( X i ) n (cid:88) j =1 ν ij K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:35) := L + L . Similarly for proving L = o p (1) above, we can show that L = o p (1), andthus omit the details. To handle L = o p (1), we denote q B ( z ) as the densityof B (cid:62) X , and θ B ( B (cid:62) X ) = E [ Y (1) | B (cid:62) X ] , (cid:98) θ B ( B (cid:62) x ) = (cid:80) nj =1 D j Y j K (cid:16) B (cid:62) X j − B (cid:62) xh (cid:17)(cid:80) nt =1 D t K (cid:16) B (cid:62) X t − B (cid:62) xh (cid:17) , (cid:98) q B ( B (cid:62) x ) = (cid:80) nj =1 D j K (cid:16) B (cid:62) X j − B (cid:62) xh (cid:17)(cid:80) nt =1 K (cid:16) B (cid:62) X t − B (cid:62) xh (cid:17) . Let T = B (cid:62) X − B (cid:62) X i h , T = X − X i h , T = X − X i h . To deal with L , consider.3 Proofs of Theorems 2 and 3the conditional expectation that can be derived as: E (cid:40) p ( X i ) n (cid:88) j =1 ν ij K (cid:18) X j − x h (cid:19) λ ( X j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i (cid:41) = E p ( X i ) nh k (1)6 (cid:80) nj =1 K (cid:16) B (cid:62) X j − B (cid:62) X i h (cid:17) K (cid:16) X j − x h (cid:17) λ ( X j ) (cid:98) θ B ( B (cid:62) X i ) (cid:98) q B ( B (cid:62) X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = [1 + o p (1)] p ( X i ) h k (1)6 θ B ( B (cid:62) X i ) q B ( B (cid:62) X i ) (cid:90) K (cid:18) B (cid:62) u − B (cid:62) X i h (cid:19) K (cid:18) u − x h (cid:19) λ ( u ) θ ( u ) du = h p [1 + o p (1)] p ( X i ) h k (1)6 θ B ( B (cid:62) X i ) q B ( B (cid:62) X i ) (cid:90) K ( t ) K (cid:18) X i − x h + t h h (cid:19) λ ( X i + t h h ) θ ( X i + t h ) dt = h p − p (1)6 p ( X i ) q B ( B (cid:62) X i ) θ ( X i ) θ B ( B (cid:62) X i ) K (cid:18) X i − x h (cid:19) λ ( X i ) + O p (cid:32) h p − p (1)+ s h s (cid:33) = O p (cid:32) h p − p (1)6 + h p − p (1)+ s h s (cid:33) . Then, when s < (2 s + k )( p − p (1)), L = O p (cid:18) h p − p (1)6 h l/ + h p − p (1)+ s h s l/ (cid:19) = o p (1), J ( x ) = o p (1). Together with (7.14), J ( x ) = o p (1). Recall that we haveproved that J ( x ), J ( x ) and J ( x ) can be bounded by o p (1). With(7.5), we can eventually derive the asymptotically linear representation as J ( x ) = J ( x ) + o p (1)= 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.15)The proof is completed. (cid:3) .3 Proofs of Theorems 2 and 3 Scenario 3: m ( x ) is parametrically estimated (correctly spec-ified). With the similar argument for proving scenario 1 in Theorem 1,we can easily derive that J ( x ) = 1 (cid:112) nh k n (cid:88) i =1 (cid:20) D i (cid:101) p ( X i ; β ∗ ) [ Y i − m ( X i )] + m ( X i ) − τ ( x ) (cid:21) K (cid:18) X i − x h (cid:19) + o p (1) . (7.16)With (7.12), it will be easy to further deduct the asymptotically linear ex-pression of the proposed estimator. When the outcome regression functionsare nonparametrically estimated, recalling the relation between the follow-ing and (cid:112) nh k [ (cid:98) τ ( x ) − τ ( x )] and J ( x ) defined in (7.1), we can derive that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) . According to (7.15) and (7.16), when the outcome regression functions aresemiparametrically or parametrically estimated, we have a similar repre-sentation as (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:98) f ( x ) 1 (cid:112) nh k n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) . Similarly as the proof for Theorem 1, we can derive that under the condi-tions of Theorem 2, when the outcome regression functions are estimatednonparametrically, we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) , .4 Proof of Theorem 4and when the outcome regression functions are estimated semiparametri-cally or parametrically, we have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . The proof of Theorem 2 is concluded. (cid:3)
As for Theorem 3. the proof can be very similar to the proof for Theo-rem 2. Here we only give a crucial lemma in this proof and omit the detailsof the proof.
Lemma 2.
Under condition (A2)(viii), the propensity score estimator sat-isfies | ω ij − ω ji | ≤ E n nh p K (cid:18) X i − X j h (cid:19) where E n = O p ( h ) free of i and j . This is the case with local misspecification. To check the asymptotic ef-ficiency through the variance comparison, we now compute the difference.4 Proof of Theorem 4between σ ( x ) and σ ( x ): σ ( x ) − σ ( x )= E (cid:26) p ( X ) − (cid:101) p ( X ; β ∗ )[ (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 1 , X ) + (cid:101) p ( X ; β ∗ ) − p ( X )[1 − (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 0 , X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) = E (cid:26) p ( X ) − (cid:101) p ( X ; β ) + (cid:101) p ( X ; β ) − (cid:101) p ( X ; β ∗ )[ (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 1 , X )+ (cid:101) p ( X ; β ∗ ) − (cid:101) p ( X ; β ) + (cid:101) p ( X ; β ) − p ( X )[1 − (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 0 , X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) (7.17)and the difference between σ ( x ) and σ ( x ): σ ( x ) − σ ( x )= E (cid:40) (cid:20)(cid:18) − Dp ( X ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] − (cid:18) − − D − p ( X ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:41) = E (cid:26) − p ( X ) p ( X ) [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ) + (cid:101) m ( X ; γ ) − m ( X )] + p ( X )1 − p ( X ) [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ) + (cid:101) m ( X ; γ ) − m ( X )] + [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ) + (cid:101) m ( X ; γ ) − m ( X )] × [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ) + (cid:101) m ( X ; γ ) − m ( X )] | X = x } . (7.18).4 Proof of Theorem 4Recall that as the definitions, for all x ∈ X , there exists β , γ , γ , suchthat p ( x ) = (cid:101) p ( x ; β )[1 + c n a ( x )] ,m ( x ) = (cid:101) m ( x ; γ ) + d n b ( x ) ,m ( x ) = (cid:101) m ( x ; γ ) + d n b ( x ) . That is, p ( x ) − (cid:101) p ( x ; β ) = O ( c n ), m ( x ) − (cid:101) m ( x ; γ ) = O ( d n ), and m ( x ) − (cid:101) m ( x ; γ ) = O ( d n ). So now we only need to consider (cid:101) p ( x ; β ) − (cid:101) p ( x ; β ∗ ), (cid:101) m ( x ; γ ) − (cid:101) m ( x ; γ ∗ ) and (cid:101) m ( x ; γ ) − (cid:101) m ( x ; γ ∗ ). Note that β ∗ , γ ∗ , γ ∗ arethe limits of the maximum likelihood estimators (cid:98) β, (cid:98) γ , (cid:98) γ respectively. Dis-cuss β ∗ first. Given the propensity score function, D is bernoulli distributed.We can respectively obtain, as the propensity score function would be mis-specified, the quasi-likelihood function and the quasi-log likelihood functionof the unknown parameter β : (cid:101) L ( β ) = n (cid:89) i =1 (cid:101) p ( X i ; β ) D i [1 − (cid:101) p ( X i ; β )] − D i f ( X i ) , and (cid:101) l ( β ) = n (cid:88) i =1 D i ln (cid:101) p ( X i ; β ) + (1 − D i ) ln[1 − (cid:101) p ( X i ; β )] + ln f ( X i ) . Then, (cid:98) β and β ∗ satisfy that (cid:98) β = argmax β n (cid:101) l ( β ) , β ∗ = argmax β E [ g ( W ; β )] . .4 Proof of Theorem 4where g ( W ; β ) = D ln (cid:101) p ( X ; β ) + (1 − D ) ln[1 − (cid:101) p ( X ; β )] + ln f ( X ) . By themean value theorem, E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) − E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β ∗ (cid:35) = E (cid:34) ∂ g ( W, β ) ∂β∂β (cid:62) (cid:12)(cid:12)(cid:12)(cid:12) β = (cid:101) β (cid:35) ( β ∗ − β ) , and E (cid:34) ∂ g ( W, β ) ∂β∂β (cid:62) (cid:12)(cid:12)(cid:12)(cid:12) β = (cid:101) β (cid:35) ( β ∗ − β ) = E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) where (cid:101) β takes the value between β and β ∗ . Note that E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = E (cid:34) D [1 + c n a ( X )] p ( X ) ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β − − D − (cid:101) p ( X ; β ) ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = E (cid:34) (1 + c n a ( X )) ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β − − p ( X )1 − p ( X ) / [1 + c n a ( X )] ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = E (cid:34) c n a ( X ) + c n a ( X )[1 + c n a ( X ) − p ( X ) ∂ (cid:101) p ( X ; β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = O ( c n ) . Assume that E (cid:104) ∂ ln g ( U,β ) ∂β∂β (cid:62) (cid:105) is non-singular for any β . We have β ∗ − β = (cid:40) E (cid:34) ∂ g ( W, β ) ∂β∂β (cid:62) (cid:12)(cid:12)(cid:12)(cid:12) β = (cid:101) β (cid:35)(cid:41) − E (cid:34) ∂g ( W, β ) ∂β (cid:12)(cid:12)(cid:12)(cid:12) β = β (cid:35) = (cid:40) E (cid:34) ∂ g ( W, β ) ∂β∂β (cid:62) (cid:12)(cid:12)(cid:12)(cid:12) β = (cid:101) β (cid:35)(cid:41) − O ( c n ) = O ( c n ) . .5 Proofs of Theorems 5 and 6The application of Taylor expansion yields that (cid:101) p ( x ; β ) − (cid:101) p ( x ; β ∗ ) = O ( c n ).Similar argument is devoted to deriving that (cid:101) m ( x ; γ ) − (cid:101) m ( x ; γ ∗ ) = O ( d n ) and (cid:101) m ( x ; γ ) − (cid:101) m ( x ; γ ∗ ) = O ( d n ). Together with these results,we continue to calculate the quantities in (7.17) and (7.18) to derive that σ ( x ) − σ ( x ) = O ( c n ) ,σ ( x ) − σ ( x ) = O ( d n ) + O ( d n ) + O ( d n d n ) . These differences show that when only the propensity score function is oronly the outcome regression functions are locally misspecified, the asymp-totic distribution remains the same as that without misspecification. (cid:3)
Consider the cases with all models misspecified. The proof of Theorem 5will be very similar to the proof of scenario 1 in Theorem 6 except that theasymptotic linear expression can be as (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] = 1 (cid:112) nh k f ( x ) n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x )] K (cid:18) X i − x h (cid:19) + o p (1) . .5 Proofs of Theorems 5 and 6As the unbiasedness no longer holds, we then compute the bias term. Adecomposition is as follows: E (cid:26)(cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] (cid:27) = (cid:113) nh k E (cid:26)(cid:18) D (cid:101) p ( X ; β ∗ ) − Dp ( X ) (cid:19) [ Y − m ( X )] + (cid:18) − D (cid:101) p ( X ; β ∗ ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] − (cid:18) D (cid:101) p ( X ; β ∗ ) − Dp ( X ) (cid:19) [ Y − m ( X )] + (cid:18) − D (cid:101) p ( X ; β ∗ ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) = (cid:113) nh k E (cid:26) [ m ( X ) − (cid:101) m ( X ; γ ∗ )] [ p ( X ) − (cid:101) p ( X ; β ∗ )] (cid:101) p ( X ; β ∗ ) − [ m ( X ) − (cid:101) m ( X ; γ ∗ )] [ (cid:101) p ( X ; β ∗ ) − p ( X )]1 − (cid:101) p ( X ; β ∗ ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) := (cid:113) nh k bias ( x ) . Let (cid:101) τ ( x ) = E (cid:26)(cid:20) D (cid:101) p ( X ; β ∗ ) [ Y − (cid:101) m ( X ; γ ∗ )] − − D − (cid:101) p ( X ; β ∗ ) [ Y − (cid:101) m ( X ; γ ∗ )] + (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) . .5 Proofs of Theorems 5 and 6The variance term of (cid:112) nh k [ (cid:98) τ ( x ) − τ ( x ) − bias ( x )] can be derived as: V ar (cid:40) (cid:112) nh k f ( x ) n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − τ ( x ) − bias ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = E (cid:40) (cid:112) nh k f ( x ) n (cid:88) i =1 [Ψ ( X i , Y i , D i ) − (cid:101) τ ( x )] K (cid:18) X i − x h (cid:19)(cid:41) = h k f ( x ) E (cid:40) E (cid:34) (cid:20) [Ψ ( X, Y, D ) − (cid:101) τ ( x )] 1 h k K (cid:18) X − x h (cid:19)(cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:35)(cid:41) = h k f ( x ) 1 h k (cid:90) K ( u ) E (cid:2) [Ψ ( X, Y, D ) − (cid:101) τ ( x )] (cid:12)(cid:12) X = x + h u (cid:3) f ( x + h u ) du = σ ( x ) (cid:82) K ( u ) duf ( x ) + O ( h k ) . where σ ( x ) = E (cid:2) [Ψ ( X, Y, D ) − (cid:101) τ ( x )] (cid:12)(cid:12) X = x (cid:3) . With the same argument to derive the asymptotic distribution in Theorem1, we can obtain that (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x ) − bias ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . (7.19)The proof of Theorem 5 is completed. (cid:3) Note that Theorem 6 is a variant of Theorem 5. To derive the asymp-totic distribution, we only need to consider the bias term and the varianceterm based on (7.19) when all nuisance models are locally misspecified.From the definitions of misspecified models before, the bias term can be.6 A simple justification for Remark 5bounded by bias ( x ) = O p ( c n d n ) + O p ( c n d n ) . This result implies that if the convergence rates of c n d n and c n d n are fasterthan O (cid:18) √ nh k (cid:19) , the bias term vanishes asymptotically. By the central limittheorem, we can also derive the asymptotic normality with the varianceterm. By (7.19) and (7.19) when c n , d n and d n all converge to 0, we have σ ( x ) (cid:82) K ( u ) duf ( x ) = σ ( x ) (cid:82) K ( u ) duf ( x ) + o (1) . With Slutsky’s Theorem, we can conclude that, when all nuisance modelsare locally misspecified, c n d n = o (cid:18) √ nh k (cid:19) , and c n d n = o (cid:18) √ nh k (cid:19) , wethen have (cid:113) nh k [ (cid:98) τ ( x ) − τ ( x )] d −→ N (cid:18) , σ ( x ) (cid:82) K ( u ) duf ( x ) (cid:19) . Then the proof is completed. (cid:3)
As we showed in the proof of Theorem 4, σ ( x ) − σ ( x ) = E (cid:26) p ( X ) − (cid:101) p ( X ; β ∗ )[ (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 1 , X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) + E (cid:26) (cid:101) p ( X ; β ∗ ) − p ( X )[1 − (cid:101) p ( X ; β ∗ )] V ar ( Y | D = 0 , X ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) . .6 A simple justification for Remark 5This difference cannot be showed either positive or negative for all x . Theexample in Remark 2 confirms this. For σ ( x ) we have σ ( x ) − σ ( x )= E (cid:40) (cid:20)(cid:18) − Dp ( X ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] − (cid:18) − − D − p ( X ) (cid:19) [ (cid:101) m ( X ; γ ∗ ) − m ( X )] (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:41) ≥ . In other words, the variance with σ ( x ) can be larger than that of theestimators with all models correctly specified. Further, σ ( x ) − σ ( x )= V ar (Ψ ( X, Y, D ) | X = x ) − V ar (Ψ ( X, Y, D ) | X = x )= E (cid:26) (cid:18) p ( X ) (cid:101) p ( X ; β ∗ ) − p ( X ) (cid:19) V ar ( Y | X, D = 1) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) − E (cid:26) (cid:18) − p ( X )[1 − (cid:101) p ( X ; β ∗ )] − − p ( X ) (cid:19) V ar ( Y | X, D = 0) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) + E (cid:26) p ( X )[ (cid:101) p ( X ; β ∗ )] [ m ( X ) − (cid:101) m ( X ; γ ∗ )] + 1 − p ( X )[1 − (cid:101) p ( X ; β ∗ )] [ m ( X ) − (cid:101) m ( X ; γ ∗ )] (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:27) + 2 E { [ (cid:101) m ( X ; γ ∗ ) − (cid:101) m ( X ; γ ∗ )][ m ( X ) − m ( X ) − (cid:101) m ( X ; γ ∗ ) + (cid:101) m ( X ; γ ∗ )] | X = x } + τ ( x ) − (cid:101) τ ( x ) . Again σ ( x ) cannot be easily judged whether it is larger than σ ( x ) ornot. (cid:3) .7 Additional Simulation Results Table 6: The simulation results under model 2 (part 1) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . DRCATE(O,O) -0.4 0.0012 0.2077 0.0431 0.058 0.045 0.0006 0.2034 0.0414 0.045 0.047-0.2 0.0010 0.2139 0.0457 0.040 0.052 0.0001 0.1988 0.0395 0.050 0.0500 -0.0005 0.2046 0.0418 0.048 0.059 0.0007 0.1846 0.0341 0.050 0.0500.2 -0.0001 0.2226 0.0495 0.045 0.050 0.0008 0.2034 0.0415 0.038 0.0570.4 0.0024 0.3312 0.1097 0.048 0.049 0.0010 0.3114 0.0971 0.044 0.053DRCATE(cP,cP) -0.4 0.0011 0.2077 0.0431 0.056 0.048 0.0006 0.2035 0.0415 0.046 0.046-0.2 0.0009 0.2137 0.0456 0.045 0.053 0.0001 0.1988 0.0395 0.049 0.0500 -0.0006 0.2044 0.0417 0.047 0.055 0.0007 0.1846 0.0341 0.048 0.0520.2 -0.0001 0.2228 0.0496 0.046 0.049 0.0007 0.2035 0.0415 0.038 0.0570.4 0.0024 0.3316 0.1100 0.047 0.047 0.0010 0.3114 0.0971 0.044 0.053DRCATE(N,N) -0.4 0.0002 0.2653 0.0703 0.017 0.029 0.0004 0.2136 0.0456 0.057 0.052-0.2 0.0011 0.2300 0.0529 0.042 0.048 0.0004 0.1990 0.0396 0.041 0.0450 0.0007 0.1962 0.0385 0.048 0.051 0.0003 0.1917 0.0367 0.041 0.0520.2 0.0011 0.2299 0.0528 0.043 0.058 0.0006 0.2122 0.0451 0.046 0.0520.4 0.0041 0.3373 0.1141 0.054 0.057 0.0003 0.3125 0.0976 0.050 0.052DRCATE(S,S) -0.4 -0.0018 0.2058 0.0424 0.051 0.046 0.0002 0.2501 0.0625 0.028 0.040-0.2 -0.0021 0.2093 0.0439 0.056 0.039 -0.0008 0.2087 0.0436 0.046 0.0470 0.0000 0.2040 0.0416 0.055 0.051 0.0011 0.1868 0.0351 0.044 0.0560.2 0.0060 0.2257 0.0518 0.031 0.068 0.0014 0.2093 0.0441 0.047 0.0590.4 0.0089 0.3409 0.1181 0.039 0.064 0.0010 0.3298 0.1089 0.043 0.062 .7 Additional Simulation ResultsTable 7: The simulation results under model 2 (part 2) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . DRCATE(O,O) -0.4 0.0012 0.2077 0.0431 0.058 0.045 0.0006 0.2034 0.0414 0.045 0.047-0.2 0.0010 0.2139 0.0457 0.040 0.052 0.0001 0.1988 0.0395 0.050 0.0500 -0.0005 0.2046 0.0418 0.048 0.059 0.0007 0.1846 0.0341 0.050 0.0500.2 -0.0001 0.2226 0.0495 0.045 0.050 0.0008 0.2034 0.0415 0.038 0.0570.4 0.0024 0.3312 0.1097 0.048 0.049 0.0010 0.3114 0.0971 0.044 0.053DRCATE(mP,cP) -0.4 0.0011 0.2082 0.0433 0.058 0.041 0.0006 0.2042 0.0417 0.050 0.045-0.2 0.0009 0.2123 0.0451 0.044 0.054 0.0001 0.1974 0.0389 0.051 0.0530 -0.0005 0.2025 0.0410 0.048 0.058 0.0006 0.1834 0.0337 0.050 0.0520.2 -0.0002 0.2222 0.0493 0.045 0.052 0.0007 0.2030 0.0413 0.037 0.0560.4 0.0025 0.3315 0.1099 0.047 0.051 0.0011 0.3116 0.0972 0.043 0.053DRCATE(mP,N) -0.4 -0.0011 0.2156 0.0464 0.056 0.042 -0.0005 0.2082 0.0434 0.048 0.043-0.2 -0.0019 0.2086 0.0436 0.061 0.036 -0.0013 0.2062 0.0428 0.058 0.0360 -0.0028 0.2003 0.0403 0.057 0.034 -0.0011 0.1888 0.0358 0.058 0.0400.2 0.0021 0.2108 0.0445 0.052 0.058 -0.0004 0.2099 0.0440 0.054 0.0440.4 0.0060 0.3258 0.1069 0.045 0.059 0.0033 0.3276 0.1093 0.033 0.069DRCATE(mP,S) -0.4 -0.0034 0.2215 0.0493 0.054 0.050 -0.0010 0.2119 0.0451 0.053 0.043-0.2 -0.0055 0.2235 0.0507 0.060 0.041 -0.0034 0.2115 0.0469 0.073 0.0290 -0.0023 0.2049 0.0421 0.051 0.043 -0.0025 0.1895 0.0371 0.061 0.0320.2 -0.0003 0.2149 0.0462 0.043 0.045 -0.0003 0.1982 0.0393 0.052 0.0430.4 0.0102 0.3351 0.1148 0.032 0.068 0.0034 0.3122 0.0997 0.039 0.058 .7 Additional Simulation ResultsTable 8: The simulation results under model 2 (part 3) n=500 n=5000DRCATE x bias sam-SD MSE P . P . bias sam-SD MSE P . P . DRCATE(O,O) -0.4 0.0012 0.2077 0.0431 0.058 0.045 0.0006 0.2034 0.0414 0.045 0.047-0.2 0.0010 0.2139 0.0457 0.040 0.052 0.0001 0.1988 0.0395 0.050 0.0500 -0.0005 0.2046 0.0418 0.048 0.059 0.0007 0.1846 0.0341 0.050 0.0500.2 -0.0001 0.2226 0.0495 0.045 0.050 0.0008 0.2034 0.0415 0.038 0.0570.4 0.0024 0.3312 0.1097 0.048 0.049 0.0010 0.3114 0.0971 0.044 0.053DRCATE(cP,mP) -0.4 0.0008 0.2179 0.0474 0.051 0.043 0.0004 0.2204 0.0485 0.048 0.045-0.2 0.0012 0.2233 0.0498 0.049 0.052 0.0003 0.2069 0.0428 0.049 0.0540 -0.0004 0.2104 0.0442 0.051 0.060 0.0008 0.1890 0.0358 0.050 0.0530.2 -0.0002 0.2226 0.0495 0.048 0.049 0.0007 0.2039 0.0416 0.036 0.0540.4 0.0028 0.3407 0.1162 0.047 0.050 0.0010 0.3170 0.1006 0.043 0.053DRCATE(N,mP) -0.4 -0.0050 0.2225 0.0501 0.060 0.036 -0.0006 0.2227 0.0496 0.051 0.050-0.2 -0.0015 0.2185 0.0477 0.056 0.043 -0.0011 0.1931 0.0375 0.054 0.0390 -0.0020 0.2039 0.0416 0.072 0.032 -0.0013 0.1857 0.0348 0.056 0.0380.2 0.0024 0.2178 0.0475 0.042 0.051 0.0005 0.2064 0.0426 0.039 0.0560.4 0.0046 0.3259 0.1066 0.035 0.064 0.0023 0.3324 0.1115 0.044 0.050DRCATE(S,mP) -0.4 -0.0115 0.2117 0.0481 0.075 0.027 -0.0024 0.3260 0.1073 0.020 0.017-0.2 -0.0021 0.2083 0.0434 0.051 0.051 -0.0021 0.2010 0.0412 0.065 0.0330 -0.0018 0.2002 0.0401 0.044 0.053 -0.0005 0.2045 0.0418 0.045 0.0380.2 0.0035 0.2290 0.0527 0.044 0.071 0.0001 0.2155 0.0464 0.054 0.0540.4 0.0017 0.3460 0.1196 0.040 0.064 0.0015 0.3519 0.1241 0.031 0.054
EFERENCES
References
Abrevaya, J., Y.-C. Hsu, and R. P. Lieli (2015). Estimating conditional average treatmenteffects.
Journal of Business & Economic Statistics 33 (4), 485–505.Fan, Q., Y.-C. Hsu, R. P. Lieli, and Y. Zhang (2019). Estimation of conditional average treat-ment effects with high-dimensional data. arXiv preprint arXiv:1908.02399 .Hu, Z., D. A. Follmann, and N. Wang (2014). Estimation of mean response via the effectivebalancing score.
Biometrika 101 (3), 613–624.Lee, S., R. Okui, and Y.-J. Whang (2017). Doubly robust uniform confidence band for theconditional average treatment effect function.
Journal of Applied Econometrics 32 (7),1207–1225.Li, L., N. Zhou, and L. Zhu (2020). Outcome regression-based estimation of conditional averagetreatment effect.
Submited .Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients whensome regressors are not always observed.
Journal of the American statistical Associa-tion 89 (427), 846–866.Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observa-tional studies for causal effects.
Biometrika 70 (1), 41–55.Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999). Adjusting for nonignorable drop-outusing semiparametric nonresponse models.
Journal of the American Statistical Associa-
EFERENCES tion 94 (448), 1096–1120.Seaman, S. R. and S. Vansteelandt (2018). Introduction to double robust methods for incompletedata.
Statistical science: a review journal of the Institute of Mathematical Statistics 33 (2),184.Shi, C., W. Lu, and R. Song (2019, 04). A sparse random projection-based test for overallqualitative treatment effects.
Journal of the American Statistical Association , 1–41.Zhou, N. and L. Zhu (2020). On ipw-based estimation of conditional average treatment effect.
Submited .Zimmert, M. and M. Lechner (2019). Nonparametric estimation of causal heterogeneity underhigh-dimensional confounding. arXiv preprint arXiv:1908.08779arXiv preprint arXiv:1908.08779