[PDF] Nonparametric estimation of causal heterogeneity under high-dimensional confounding

Abstract

This paper considers the practically important case of nonparametrically estimating heterogeneous average treatment effects that vary with a limited number of discrete and continuous covariates in a selection-on-observables framework where the number of possible confounders is very large. We propose a two-step estimator for which the first step is estimated by machine learning. We show that this estimator has desirable statistical properties like consistency, asymptotic normality and rate double robustness. In particular, we derive the coupled convergence conditions between the nonparametric and the machine learning steps. We also show that estimating population average treatment effects by averaging the estimated heterogeneous effects is semi-parametrically efficient. The new estimator is an empirical example of the effects of mothers' smoking during pregnancy on the resulting birth weight.

Full PDF

NNonparametric estimation of causal heterogeneity underhigh-dimensional confounding

Michael Zimmert † and Michael Lechner ‡ SEW-HSGSwiss Institute for Empirical Economic ResearchUniversity of St.Gallen, Switzerland

Abstract

This paper considers the practically important case of nonparametrically estimating heterogeneousaverage treatment eﬀects that vary with a limited number of discrete and continuous covariates ina selection-on-observables framework where the number of possible confounders is very large. Wepropose a two-step estimator for which the ﬁrst step is estimated by machine learning. We showthat this estimator has desirable statistical properties like consistency, asymptotic normality andrate double robustness. In particular, we derive the coupled convergence conditions between thenonparametric and the machine learning steps. We also show that estimating population averagetreatment eﬀects by averaging the estimated heterogeneous eﬀects is semi-parametrically eﬃcient.The new estimator is an empirical example of the eﬀects of mothers smoking during pregnancy onthe resulting birth weight.

JEL classiﬁcation:

C14, C21

Keywords: causal machine learning, eﬀect heterogeneity, group average treatment eﬀects, semiparamet-ric eﬃciency, ensemble learning † Email: [email protected] ‡ a r X i v : . [ ec on . E M ] A ug Introduction

Recently, new machine learning based estimators showed immense potential to systematically uncoveringcausal eﬀect heterogeneity so that there is now a rapidly growing literature on this topic (e.g., see theoverviews in Athey and Imbens, 2019; Athey and Imbens, 2017 and Knaus, Lechner, and Strittmatter,2018). In the context of heterogeneity, the respective aggregation levels for which the heterogeneity isestimated is playing an important role. Most papers of this literature focus on a selection-on-observableframework and investigate estimators for the heterogeneity at the lowest aggregation level to uncoverpossible heterogeneities to the largest extent possible. While this ﬁnest level of causal granularity is ob-viously of interest, Chernozhukov, Fern´andez-Val, and Luo (2018b) and Lechner (2018) argue to analyseheterogeneity at higher levels, so called ‘Group Average Treatment Eﬀects’ (GATEs). Such aggregatescan be estimated more precisely, may be far more easily interpretable by researchers in substantive terms,and are more useful for decision makers. In particular, some subgroup heterogeneities are of limited valueper se because it is hard to justify a decision or policy based on certain characteristics (race, gender etc.).Therefore, decision-makers are often only interested in eﬀect heterogeneities based on a rather small sub-set of available covariates. This paper suggests an approach that is based on statistical-learning assistedestimation of the GATEs for the various discrete and continuous variables of interest, and subsequentnon-parametric aggregation of the GATEs to obtain ‘Average Treatment Eﬀects’ (ATEs). More technically speaking, in eﬀect heterogeneity analysis covariates do not (only) serve the purpose ofmaking identifying assumptions credible. They become part of the outcome analysis by discriminatingdiﬀerent subgroups of units for which the eﬀect is of interest. Further, whenever new observations enterthe sample the covariate realizations could be used to predict a causal eﬀect. The set of covariates to beincluded in the statistical model to explore eﬀect heterogeneity is therefore not a statistical but rather asubstantive decision.The estimation of subgroup speciﬁc eﬀects is a tedious task when there is confounding. In such settings,causal eﬀects are typically only identiﬁed if the researcher includes the confounding covariates in thestatistical model as well. Hence, the identifying assumptions dictate the inclusion of the set of covariatesrequired. In empirical research based on selection-on-observables, the credibility of causal eﬀects estima-tion often depends on a very large set of possible covariates with very many possible functional forms.Qualitatively assessing which covariates should ultimately enter the model in which speciﬁc form in anon-systematic fashion is prone to be ﬂawed. Lechner (2018) also proposed this aggregation idea. However, that paper considered only a version of a Causal Forestwhile here we are in principle agnostic with respect to the machine learning method used. Furthermore, it considered onlyGATEs based on discrete variables and thus GATEs were obtained as unweighted within-cell means. already dates back to Hahn(1998). He suggested estimating a nonparametric outcome regression on the set of covariates that needsto be controlled for. Averaging over the conditional means leads to estimators of ATE that attain thesemiparametric eﬃciency bound. In practice, however, nonparametric regression with many covariatesis hardly feasible because the convergence rate of nonparametric methods exponentially decreases withthe number of covariates included. Recently, Wager and Athey (2018) follow the same ideas as in Hahn(1998) but use Causal Forests instead of standard nonparametric regression. Athey, Tibshirani, andWager (2019) and Lechner (2018) modify the Random Forest algorithm to better adjust for confoundingand improve precision. Outcome-based models adjust for confounding and infer heterogeneous eﬀects ina single estimation step. Therefore, in all of these approaches inference for eﬀect heterogeneity relies on adimension of the covariate space that is ﬁxed. Given the previous discussion, this might be a very strongassumption.In this paper we follow an alternative approach in the literature. The two distinct roles of the covariates –adjusting for confounding and estimating heterogeneous eﬀects – are explicitly reﬂected in a two-step es-timation procedure for the GATEs. This idea is conceptionally not new in the literature. In the contextof diﬀerence-in-diﬀerences estimation, Abadie (2005) shows that propensity scores weighted outcomescan be used as a dependent variable in a second stage regression on the covariates that are of interest forheterogeneous eﬀects. Abrevaya, Hsu, and Lieli (2015) use a similar idea in the standard selection-on-observables setting. They provide inferential results for nonparametric and parametric propensity scoreﬁrst stages with nonparametric second stages. In line with other results in the literature on average eﬀects(Hirano, Imbens, and Ridder, 2003, Robins, Rotnitzky, and Zhao, 1994, Lunceford and Davidian, 2004),they show that the variance for Inverse Probability Weighting (IPW) estimators can be substantiallydecreased when the propensity score is estimated nonparametrically. Since their second stage also relieson nonparametric regression, the validity of their asymptotic results requires jointly choosing two kernelbandwidths which have to be in a rather small feasible interval. Lee, Okui, and Whang (2016) augmentthe model of Abrevaya et al. (2015) by including outcome projections (Augmented IPW, AIPW) andshow that when both the propensity score and the outcome projections are estimated parametrically,one can treat the nuisance parameters as if they were known. Their asymptotic results with parametricﬁrst stages are then equivalent to those of Abrevaya et al. (2015) with nonparametric propensity score We avoid the imprecise term ‘Conditional Average Treatment Eﬀect’ because it is unclear which conditioning set isactually meant. A few days before this work appeared ﬁrst on arXiv, Fan, Hsu, Lieli, and Zhang (2019) published their independentwork on arXiv (up to that moment unknown to us) that uses similar ideas about aggregation and machine learning. While Abrevaya et al. (2015) use local constant nonparametric regression, Lee et al. (2016) show their results with locallinear nonparametric regression. confounder space. This enables our proposed estimator to be robust against functionalform misspeciﬁcation and to remain consistent even if the number of covariates relative to the samplesize is large. In particular, we provide a generic statistical framework such that the convergence raterequirements of the ﬁrst stage nuisance estimation are coupled with the kernel bandwidth second stagenonparametric convergence.Additionally, we link our identiﬁcation and estimation result to semiparametric eﬃciency theory by pro-viding a new estimator for the ATE that can be estimated as a by-product of the GATEs. The estimatoraggregates over all point estimates of the GATEs. We show that under certain convergence conditions forthe kernel bandwidth, asymptotically it hits the variance lower bound of the semiparametric estimationproblem. We therefore also contribute to the small literature on three-step semiparametric ATE estima-tion. Speciﬁcally, Hahn and Ridder (2013) (for an alternative theoretical development see also Mammen,Rothe, and Schienle, 2012) investigate a related set-up showing that nonparametric regression on anestimated propensity score can lead to eﬃcient estimation of ATE. To the best of our knowledge, thispaper is, however, the ﬁrst that analyses the asymptotic properties of averaging a transformed outcomeprojection instead of the outcome projection on the propensity score. Like propensity score matching,our three-step estimator might have better ﬁnite-sample properties than two-step estimators that sharethe same ﬁrst-order asymptotic properties (Robins et al., 1994, Hirano et al., 2003, Chernozhukov et al.,2018a) because the propensity score weights are subject to an additional smoothing step. Unlike propen-sity score matching, IPW or Hahn’s (1998) estimator, the proposed ATE estimator remains feasible whenthe dimension of the confounders entering the model is high. In general the term ‘high-dimensional’ refers to the fact that the dimension of the model can grow with the sample size.We will provide speciﬁc rate conditions in the main part of the paper.

Suppose that we observe an independent and identically distributed random sample { w i } Ni =1 with samplesize N where w i = ( y i , d i , x i , z i ). Denote with uppercase letters a variable and with lowercase letters itsrealizations. Then Y is the outcome variable and D is the binary treatment of interest. To describe causaleﬀects, we use Rubin’s (1974) potential outcome notation such that Y d is the outcome that would havebeen observed under treatment D = d . Further, X is a matrix of observed covariates with support X and Z ⊆ X as a set of predeﬁned variables where the researcher is interested in eﬀect heterogeneity withsupport Z . Also let X ∈ R dim X and Z ∈ R dim Z and denote λ X = dim X and λ Z = dim Z . Potentially wehave that λ X → ∞ when N → ∞ whereas λ Z is ﬁxed. Hence, we explicitly allow for models where thedimension of X is high-dimensional but the dimension of the subset of covariates that is of interest forthe heterogeneity analysis does not grow with the sample size. We remain agnostic about the underlyingcumulative distribution from which the sample of W = ( Y, D, X, Z ) is drawn F = F ( W ) and just assumethat it exists with density f = f ( W ). The main parameter of interest in this study is the GATE deﬁned as τ ( z ) = E (cid:2) Y − Y | Z = z (cid:3) . Since we want to avoid usually unrealistic parametric assumptions on the underlying DGP, we allow fora ﬂexible function ψ ( W, · ) such that GATE is identiﬁed as τ ( z ) = E [ ψ ( W, · ) | Z = z ] The concrete growth rates of λ X in relation to N will be discussed in Section 3. θ = E [ E [ ψ ( W, · ) | Z = z ]]implying Hahn’s (1998) eﬃcient score function for ATE ψ ( W, p, m , m ) = D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X )where p ( X ) = E [ D | X ] denotes the propensity score and m d ( X ) = E [ Y | X, D = d ] for d ∈ { , } denotesthe conditional expectations of the outcome in the treatment-speciﬁc subpopulations. For identiﬁcation of GATE and ATE we make the following assumptions.

Assumption 1 (Conditional independence) . Y , Y ⊥ D | X = x ∀ x ∈ X Assumption 2 (Stable Unit Treatment Value Assumption (SUTVA)) . Y = DY + (1 − D ) Y ssumption 3 (Exogeneity of confounders) . X = X Assumption 4 (Common support) . c < p ( X ) < − c for some small positive constant c . Assuming that appropriate moments exist, then for GATE we have τ ( z ) = E (cid:2) E (cid:0) Y − Y | X (cid:1) | Z = z (cid:3) = E (cid:20) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) (cid:12)(cid:12)(cid:12) Z = z (cid:21) = E (cid:20) DYp ( X ) − (1 − D ) Y − p ( X ) (cid:12)(cid:12)(cid:12) Z = z (cid:21) = E [ m ( X ) − m ( X ) | Z = z ] . The exposition shows that IPW and outcome based estimands are embedded in the estimand based on ψ ( W, p, m , m ). Finally, by noticing that θ = E [ τ ( Z )]identiﬁcation of ATE trivially follows from these considerations. The identiﬁcation results from the preceding section suggest a two-step estimation strategy. The detailsof our proposed estimator are described in Procedure 1. In a ﬁrst step a sample plug-in versions of6 rocedure 1.

GATE estimationIntroduce the subsample index l = 1 , ..., L and denote the corresponding information set by I l as well as its complement by I Cl .1. Randomly split the sample in equally sized subsamples 1 , ..., L .2. for l = 1 to L do Estimate the propensity score p ( x ) and the outcome projections m ( x ) and m ( x ) in thesample with I Cl using any suitable machine learning method or an ensemble of them.Predict ˆ p ( x ), ˆ m ( x ) and ˆ m ( x ) in the sample with I l . end

3. Denote ˆ p = ˆ p l =1 ,...,L , ˆ m = ˆ m ,l =1 ,...,L and ˆ m = ˆ m ,l =1 ,...,L . Then construct the vector withelements ˆ ψ = ψ ( W i , ˆ p, ˆ m , ˆ m ) for i = 1 , ..., N and estimate GATE asˆ τ ( z ) = N (cid:88) i =1 K (cid:0) z i − zh (cid:1) ψ ( W i , ˆ p, ˆ m , ˆ m ) (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) where K = K ( u ) is some kernel function that depends on a bandwidth h . ψ ( W, p, m , m ) can be obtained by estimating the nuisance parameters. In a second step the ψ -vectorcan be projected on Z . Our goal is to estimate both stages as ﬂexible as possible and to avoid parametricassumptions. Further, our estimator can cope with settings where λ X is very large which precludesclassical nonparametric and parametric methods to estimate the ﬁrst stage nuisances p ( x ), m ( x ) and m ( x ). However, we can use a large class of supervised machine learning algorithms that have beenshown to be very eﬀective predictors for such types of tasks. Following the suggestions of Chernozhukovet al. (2018a) we apply a cross-ﬁtting algorithm for the nuisance parameter estimation step in orderto guarantee that the resulting estimator of ψ ( W, p, m , m ) consists of independent observations. Therequirements for the second stage estimation step are more sophisticated as this estimator should allowfor valid inference. To estimate GATE ﬂexibly, we apply nonparametric local constant regression in thesecond step. We now investigate the theoretical properties of our proposed estimation procedure. To ease the nota-tional burden, we start with some deﬁnitions.

Deﬁnition 1 (Norms) . Denote by (cid:107) g ( X ) (cid:107) p the L p norm of the generic function g ( · ) . Further denote thesupremum norm by sup X ∈X | g ( X ) | = (cid:107) g ( X ) (cid:107) ∞ . eﬁnition 2 (Rates) . The nuisance parameter ﬁrst stage estimates ˆ p , ˆ m and ˆ m obtained by the samplesplitting procedure described above belong to the realization sets P , M and M with probability − o (1) .For any realization p ∗ , m ∗ and m ∗ in the sets deﬁne the rates (cid:15) m d = sup m ∗ d ∈M d (cid:107) m ∗ d ( X ) − m d ( X ) (cid:107) (cid:15) p = sup p ∗ ∈P (cid:107) p ∗ ( X ) − p ( X ) (cid:107) (cid:15) max = max { (cid:15) m , (cid:15) m , (cid:15) p } . Deﬁnition 3 (Scaling factor) . For any function g deﬁne a scaling parameter δ g that determines g = O (cid:0) N − δ g (cid:1) . We then make the following standard assumptions on the kernel regression step (see for example Paganand Ullah, 1999).

Assumption 5 (Kernel regression) . Z = z is a point in the interior of the support Z .2. The density function estimator is uniformly bounded away from zero such that inf z ∈Z ˆ f ( z ) ≥ C where C > is a generic constant.3. The Kernel function K ( u ) is r times continuously diﬀerentiable, symmetric and of order r in thesense (cid:82) u r − K ( u ) du = 0 and (cid:82) u r K ( u ) du = O (1) for r ∈ N .4. f ( z ) and E ( ψ ( W, p, m , m ) | Z = z ) are r times continuously diﬀerentiable.5. Further the Kernel function satisﬁes (i) (cid:82) K ( u ) du = 1 , (ii) (cid:82) | K ( u ) | C du = O (1) for any C > ,(iii) | u || K ( u ) | → as | u | → ∞ , (iv) (cid:107) K ( u ) (cid:107) ∞ = O (1) and (v) (cid:82) K ( u ) du = O (1) . Assumption 5 comprises the standard nonparametric local constant regression assumptions allowing formultiple covariates and higher-order kernels. For illustrative purposes multivariate regression results arederived assuming the same bandwidth for every regressor. Further, we have to make boundedness as-sumptions on the second moment of the sample error of the outcome model and on the nuisance predictionerrors. 8 ssumption 6 (Boundedness of conditional variances) . The conditional variances of the outcomemodels are bounded such that they obey E (cid:104) ( DY − m ( X )) | X (cid:105) = O (1) and E (cid:104) ((1 − D ) Y − m ( X )) | X (cid:105) = O (1) . Assumption 7 (Boundedness of convergence rates) . The nuisance parameter prediction errors arebounded such that they obey sup m ∗ d ∈M d (cid:107) m ∗ d ( X ) − m d ( X ) (cid:107) ∞ = O (1) for d ∈ { , } and sup p ∗ ∈P (cid:107) p ∗ ( X ) − p ( X ) (cid:107) ∞ = O (1) . Additionally, the convergence rates of our ﬁrst stage nuisance parameter prediction and the second stagenonparametric regression are assumed to be as follows:

Assumption 8 (Coupled convergence (GATE)) . The bandwidth h and the sample size N jointlyconverge such that(i) h = o (1) , N h λ Z → ∞ as N → ∞ and(ii) N h λ Z h r = o (1) .Further, N and h satisfy the joint convergence conditions with the nuisance parameter convergence rates(iii) h − λ Z (cid:15) max = o (1) (iv) N h − λ Z (cid:15) m (cid:15) p + N h − λ Z (cid:15) m (cid:15) p = o (1) . Assumption 8 comprises the coupled convergence rate assumptions that are at the centre of our theoreticalresults. Conditions (i) and (ii) quantify how the bandwidth has to converge to zero in relation to thesample size N and the number of regressors λ Z . As usual, the bandwidth has to go to zero but slowerthan the sample size grows to inﬁnity. Also the bandwidth has to be chosen such that the asymptotic biasterm vanishes faster than the variance. This allows to apply the Central Limit Theorem and makes theestimator asymptotically unbiased. In particular, condition (ii) requires undersmoothing in the sense thatthe bandwidth has to be below the mean squared error (MSE) optimal rate. As discussed for example inPagan and Ullah (1999), choosing a higher order kernel mitigates the problem.9onditions (iii) and (iv) state that h has to be chosen such that ﬁrst stage convergence rates vanish fastenough. In particular, by condition (iv) the joint convergence rates from propensity score and outcomeprojection estimation have to vanish faster than √ N scaled with the kernel bandwidth. Since h → Z = z . Thus, since sample observations enterthe estimator in a weighted form, the prediction precision needed is for the lower eﬀective sample sizearound Z = z . Hence, the ﬁrst stage prediction guarantees need to adapt to this smaller sample conditionsand therefore achieve a faster joint rate of convergence in terms of the sample size N . Condition (iii)additionally prevents the worst rate from becoming arbitrarily slow especially when λ Z is larger thanone. Still, our estimator has a ‘rate’ double robustness feature in the sense that joint rates can vanishrelatively slowly but all single ﬁrst stage rates are restricted from converging very slowly. L convergencerates of many supervised machine learning methods satisfy these properties under sparsity conditions.For example Belloni and Chernozhukov (2013) show that the predictive error of the Lasso is of order O (cid:18)(cid:113) s log max( λ X ,N ) N (cid:19) where s the unknown number of true coeﬃcients in the oracle model. Suppose that s and λ X are equal in the outcome and the propensity score models then we require s log max( λ X ,N ) Nh λZ → λ X can grow with the eﬀective sample size N h λ Z .Similar rates can be shown for L boosting (Luo and Spindler, 2016) and nonlinear models like RandomForests (Wager and Walther, 2016) or forms of Deep Neural Nets (Farrell, Liang, and Misra, 2018). A natural question is then if a bandwidth exists that satisﬁes the rate conditions in Assumption 8. Indeed,one can show (for more details see Appendix B) that the theoretical range of possible bandwidth choicescan be described by 1 λ Z + 2 r < δ h < δ (cid:15) p + δ (cid:15) md ) − λ Z and we achieve a condition for the order of the kernel r > λ Z − ( δ (cid:15) p + δ (cid:15) md )2( δ (cid:15) p + δ (cid:15) md ) − . For example if we restrict ourselves on second order kernel functions then for λ Z = 1 we require δ (cid:15) p + δ (cid:15) md = . Similarly, for λ Z = 2 and λ Z = 3, δ (cid:15) p + δ (cid:15) md = and δ (cid:15) p + δ (cid:15) md = are requiredrespectively. Thus, for a growing dimension λ Z the joint rate condition for the ﬁrst stage nuisance pa-rameters approaches the parametric rate. For the concrete dependence of sparsity conditions on the parameters of the predictors see the references mentioned.

Theorem 1.

Under Assumptions 1-8 our proposed estimation procedure for GATE obeys √ N h λ Z (ˆ τ − τ ) = 1 √ N h λ Z N (cid:88) i =1 K (cid:0) z i − zh (cid:1) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , p, m , m ) − τ ) + o (1) and √ N h λ Z (ˆ τ − τ ) → d N (0 , σ GAT E ) with σ GATE = (cid:82) K ( u ) du × E [ ( ψ ( W i ,p,m ,m ) − τ ) | Z = z ] f ( z ) . Theorem 1 shows that under the assumptions discussed above the speed of convergence is determined onlyby the nonparametric regression step. In particular, it does not depend on the ﬁrst stage estimation steps.An equivalent result can also be achieved by using IPW with nonparametric ﬁrst stages (see Abrevayaet al., 2015). However, this requires an additional bandwidth choice for the ﬁrst stage propensity scoreregression and is limited to the case when also λ X is very small. The dimension of X can be increasedunder functional form assumptions for the ﬁrst stage. However, as shown by the authors the price to payis an increase in the asymptotic variance. This is not the case for the estimator proposed in this paper.Also Theorem 1 is valid under generally weaker conditions compared to the results in Lee et al. (2016) forparametric ﬁrst stages. Heuristically , if the ﬁrst stage estimators converge at √ N then our conditionson the bandwidth are satisﬁed and our asymptotic results continue to apply. To this extent, our resultscomprise the result of Lee et al. (2016) as a special case. In contrast, to Lee et al. (2016) we use local constant instead of local linear regression and introduce cross-ﬁtting fornuisance parameter estimation. This should, however, not be a concern for the intuitive argument made. .2 Joint estimation of GATE and ATE Given the considerations so far, it appears ‘naturally’ to estimate ATE in three steps as an averageof GATEs in the sample. The details of the proposed estimator are described in Procedure 2. As

Procedure 2.

ATE estimation1. Follow steps 1-3 of Procedure 1.2. Predict GATE at every observation in the sample as ˆ τ ( z j ).3. Estimate ATE as ˆ θ = 1 N N (cid:88) j =1 ˆ τ ( z j ) . suggested in Chernozhukov et al. (2018a) one could also directly estimate ATE as the average of thevector ψ ( W, p, m , m ) using the ﬁrst stage nuisance parameter predictions. However, as a by-product ofGATE estimation, using an additional kernel smoothing step may lead to an ATE estimator with betterﬁnite sample properties. In particular, the propensity score weights do not enter the last step of ourestimator directly and our hope is that small misspeciﬁcation errors of propensity scores close to zeroor one are therefore smoothed out. The sensitivity of estimators incorporating inverse propensity scoreweights directly is the subject of many Monte Carlo experiments (e.g. Huber, Lechner, and Wunsch,2013, and Frlich, 2004). We notice that similar reasoning is also behind three-step estimators that applynonparametric regression on an estimated propensity score often used in practice. To obtain our theoretical results we have to modify Assumption 8 slightly.

Assumption 8 (cid:48) (Coupled convergence (ATE)) . The bandwidth h and the sample size N jointly convergesuch that(i) h = o (1) , N h λ Z → ∞ as N → ∞ ,(ii) N h λ Z h r = o (1) and(iii) N h r = o (1) and N h λ Z → ∞ . urther, N and h satisfy the joint convergence conditions with the nuisance parameter convergence rates(iv) h − λ Z (cid:15) max = o (1) and(v) N h − λ Z (cid:15) m (cid:15) p + N h − λ Z (cid:15) m (cid:15) p = o (1) . We notice that averaging over the estimated projection E [ ψ ( W, p, m , m ) | X ] is a partial mean problem inthe sense of Newey (1994a,b). While parts (i) and (ii) of Assumption 8 remain unchanged, the additionalcondition (iii) is necessary in order to guarantee that the MSE of the kernel regression estimator scaledwith N converges to zero. In this way we guarantee the applicability of Newey’s (1994b) framework.We could have also assumed uniform convergence rates for the kernel regression step. However, thiswould involve a unnecessarily strong condition (for a discussion see Newey, 1994b, pp. 1364-1368 andalso Newey and McFadden, 1994, p. 2205).Since we want ATE to converge with a rate of √ N , the requirements on the ﬁrst stage convergence ratesin condition (v) are more restrictive than those in the respective condition of Assumption 8. The rangeof theoretically feasible bandwidth choices reduces tomax (cid:18) r , λ Z + 2 r (cid:19) < δ h < ( δ (cid:15) p + δ (cid:15) md ) − λ Z . Assuming r < λ Z +2 r we get a modiﬁed condition for the order of the kernel function r > λ Z − ( δ (cid:15) p + δ (cid:15) md )2( δ (cid:15) p + δ (cid:15) md ) − . In general this result indicates that one relies on a higher-order kernel function whenever λ Z > (cid:48) we can then derive the following eﬃciency result. Theorem 2.

Under Assumptions 1-7 and 8 (cid:48) and the regularity conditions on the nonparametric secondstep as in Newey (1994b, pp. 1364-1368) our proposed estimation procedure for ATE has the inﬂuencefunction D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − θ and therefore obeys √ N (ˆ θ − θ ) → d N (0 , σ AT E )13 here σ AT E is the semiparametric eﬃciency bound of Hahn (1998).

Conceptually, Theorem 2 underpins our intuition from semiparametric theory outlined in Section 2.2.The result shows that indeed every estimator that involves a nonparametric projection of the AIPWmodiﬁed outcome on any low-dimensional subset of X is consistent, asymptotically normal and achievesthe semiparametric eﬃciency bound. This asymptotic result has also been shown for other estimatorsalready discussed in Section 1. In contrast to Hirano, Imbens, and Ridder’s (2003) estimator, Hahn’s(1998) estimator and matching on the propensity score (Hahn and Ridder, 2013), we do not rely onnonparametrically estimated ﬁrst stages. Due to the fact that λ X → ∞ these estimators are of nopractical use in our setting. Further, unlike AIPW with machine learning nuisance parameter estimation(Chernozhukov et al., 2018a), our estimator involves an additional step. Therefore the inverse propensityscore does not directly enter our estimator but is smoothed through the additional nonparametric step.Asymptotically, this does not make any diﬀerence as the result in Theorem 2 shows. However, in ﬁnitesample this could be a major advantage over the usual AIPW estimator. We investigate the applicability of our methods using Cattaneo’s (2010) dataset on the eﬀect of cigarettesmoking on birthweight available from the Stata website. The dataset contains the outcome variablebirthweight in grams ( Y ), whether the mother smoked during pregnancy ( D = 1) and several covariateson the mother’s health and socio-economic background ( X ). A detailed description of all covariates in thedataset can be found in Appendix C. Applied studies with diﬀerent estimation approaches unambiguouslyﬁnd negative average eﬀects (see Abrevaya, 2006, da Veiga and Wilder, 2008, Walker, Tekin, and Wallace,2009). Conditional average treatment eﬀects were investigated by Abrevaya et al. (2015) and Lee et al.(2016) who ﬁnd that mother’s age is associated with increasingly negative eﬀects of smoking. We replicatetheir results and compare their estimators with ours. Clearly, this type of analysis is limited in its scopesince the true DGP remains unknown. However, the dataset has the particular advantage that somestrong hypothesis about the estimation results are plausible. (i) The eﬀect of smoking on birthweightshould be either negative or zero. (ii) The eﬀect should be increasingly negative with mother’s age.As a second example we consider how the eﬀect changes with the number of prenatal care visits. Onthe one hand a very low number of care visits could indicate the mother’s insuﬃcient access to medicalinfrastructure and therefore could be associated with particularly negative eﬀects. On the other hand a The original dataset can be retrieved from here.

Figure 1 depicts the main results of our empirical analysis. We estimate the GATEs as described inProcedure 1 using an ensemble learner comprising Lasso, Ridge, Elastic Net and a Random Forest. Theweights of the ensemble are obtained by cross-validating the out-of-sample MSE of the procedure. X inour speciﬁcation is an extended variable set (‘alldata’) and is exactly documented in Appendix C. Forexample in contrast to Lee et al. (2016) we also include the available characteristics for the father of thechild, since they could be a good predictor for the smoking behaviour of the mother. The covariates enterour model very ﬂexibly. For the penalized regression predictors we allow for polynomials up to order fourand all two way interactions. The Random Forest has the particular advantage of being an ensembleof trees itself and is therefore very ﬂexible by construction. The results are generally in line with thehypothesis made. In particular, the eﬀect of smoking is unambiguously negative over the whole supportof mother’s age and prenatal care visits. As expected the eﬀect increases with age. Interestingly, a highernumber of prenatal care visits seems to be associated with higher negative eﬀects.We estimate all our results with second-order Gaussian kernel functions. The same analysis using higherorder Gaussian kernel functions as proposed by Li and Racine (2007) yields similar results (see AppendixC). In practice the biggest challenge is to determine the bandwidth for the nonparametric regression.To achieve undersmoothing, we multiply the bandwidth obtained by leave-one-out cross-validation with0.9. Since this choice is arbitrary, the stability of our results towards this choice is a particular concern.Figure 2 shows that our estimator is relatively robust regarding this choice. A major change in the shapeof the function only appears for massive oversmoothing. An equivalent analysis for prenatal care visitsyielding the same conclusion is relegated to Appendix C.Table 1: Smoothed ATE estimatorsSmoothed AIPW (age) Smoothed AIPW (care visits) Smoothed AIPW (age, care visits)-238.937 -235.672 -236.904(27.257) (27.257) (27.257) Results for smoothed AIPW ATE estimation as in Procedure 2 using Z = age, Z = care visits and Z =(age , care visits). Results were obtained with a second-order Gaussian kernel function and a 0 . × LOOCVbandwidth choice. Nuisance parameters were estimated using an ensemble learner comprising Lasso,Elastic Net, Ridge and Random Forest. For Lasso, Ridge and Elastic Net the penalty term was chosensuch that the cross-validation criterion was minimized. The ensemble weights were chosen by minimizingout-of-sample MSE. Asymptotic standard errors are in parenthesis.

Finally, Table 1 shows the results for ATE estimation as described in Procedure 2. In line with the15igure 1: AIPW GATE estimator with ensemble ﬁrst stages −500−400−300−200−1000 20 30 40 mother's age −500−400−300−200−1000 0 10 20 30 40 care visits m o t he r ' s age c a r e v i s i t s e ff e c t Results were obtained as described in Procedure 1 with a second-order Gaussian kernel function and a 0 . × LOOCVbandwidth choice. Nuisance parameters were estimated using an ensemble learner comprising Lasso, Elastic Net, Ridgeand Random Forest. For Lasso, Ridge and Elastic Net the penalty term was chosen such that the cross-validation criterionwas minimized. The ensemble weights were chosen by minimizing out-of-sample MSE. Asymptotic conﬁdence bands are atthe 95% level. (a) 0 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (b) 0 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (c) 0 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (d) 0 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (e) 1 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (f) 1 . × CV choice −500−400−300−200−1000 20 30 40 mother's age

Results were obtained as described in Procedure 1 with a second-order Gaussian kernel function and diﬀerent multiples of theLOOCV bandwidth choice. Nuisance parameters were estimated using an ensemble learner comprising Lasso, Elastic Net,Ridge and Random Forest. For Lasso, Ridge and Elastic Net the penalty term was chosen such that the cross-validationcriterion was minimized. The ensemble weights were chosen by minimizing out-of-sample MSE. Asymptotic conﬁdencebands are at the 95% level. previous literature mentioned above, the average eﬀect of smoking is estimated to be negative. Crucially,the estimated eﬀect turns out to be very robust regarding the choice of the smoothing variable.

A ‘fair’ comparison with other estimators is hardly feasible because our approach does not require speciﬁcfunctional form assumptions. In other words, the related estimators of Abrevaya et al. (2015) and Leeet al. (2016) suppose that they know the true propensity score or outcome projection speciﬁcations. Sincewe cannot compare our estimator against every possible parametric speciﬁcation, we use the speciﬁcationselected by Lee et al. (2016) as a benchmark. Figure 3 depicts GATE estimation results using thebenchmark models. Strikingly, the IPW based estimator gives implausible results. For mother’s agepositive eﬀects of smoking can almost nowhere be excluded. For care visits we do not obtain a GATE We do not consider nonparametric propensity score estimation as suggested in Abrevaya et al. (2015) because it mostlikely does not allow to include all potential confounders in order to make Assumption 1 credible. (a) AIPW linear ﬁrst stages (age) −500−400−300−200−1000 20 30 40 mother's age (b) AIPW linear ﬁrst stages (care visits) −500−400−300−200−1000 0 10 20 30 40 care visits (c) IPW linear ﬁrst stages (age) −1000−50005001000 20 30 40 mother's age (d) IPW linear ﬁrst stages (care visits) −1000−50005001000 0 10 20 30 40 care visits

Results were obtained following the procedures in Abrevaya et al. (2015) and Lee et al. (2016) with a second-order Gaussiankernel function and a 0 . × LOOCV bandwidth choice. Nuisance parameters were estimated using Logit for the propensityscore and OLS for the outcome projections. Asymptotic conﬁdence bands are at the 95% level.

Results for ATE estimation using Inverse Probability Weighting (IPW) and Augmented IPW (AIPW).For the ensemble learner nuisance parameters were estimated using an ensemble learner comprisingLasso, Elastic Net, Ridge and Random Forest. For Lasso, Ridge and Elastic Net the penalty term waschosen such that the cross-validation criterion was minimized. The ensemble weights were chosen byminimizing out-of-sample MSE. For the parametric speciﬁcations nuisance parameters were estimatedusing Logit for the propensity score and OLS for the outcome projections. Asymptotic standard errorsare in parenthesis.

Table 2 shows the results for ATE estimation. As expected the results of Procedure 2 in Table 1 areroughly in line with the standard AIPW based ATE estimator with ensemble ﬁrst stages. The relativebad performance of IPW based estimation for the GATEs is also reﬂected in the estimation of ATE. Inparticular, the standard error nearly doubles compared to AIPW based estimators and point estimatesare reduced. Interestingly, for average eﬀects there seems to be only little value-added for the ﬂexiblemachine learning based estimators compared to the parametric speciﬁcation.

In this study we propose new estimators for speciﬁc conditional and average causal eﬀects when thedimension of the covariate space is high. In particular, by discriminating the diﬀerent roles of covariates(adjusting for confounding vs. measuring causal heterogeneity of interest) in our approach, they canbe included very ﬂexibly – not relying on any functional form assumptions. Rather, we show coupledconvergence conditions for the diﬀerent steps involved. The procedures suggested are based on semipara-metric eﬃciency theory. In this sense, our proposed three-step estimator for ATE estimation is shown toreach the semiparametric eﬃciency bound. A widely used empirical example shows that our estimatorsare useful in practice. Compared to other estimators their desirable theoretical properties and increased This might also be seen as a implicit test for the credibility of the stronger conditions required for the smoothedestimator compared to the averaged eﬃcient score as in Chernozhukov et al. (2018a). Z is moderatelyhigher than considered in this paper, while sacriﬁcing only little ﬂexibility.Finally, it might be worth to investigate the ﬁnite sample properties of the proposed three-step estimatorsfor ATE compared to averaging the eﬃcient score vector directly. Here, we consider our ATE estima-tor as a by-product of the GATE procedure underpinning the theoretical motivation of our framework.While the smoothed three-step estimator is ﬁrst order asymptotically equivalent to directly averagingthe eﬃcient score vector, it might posses better ﬁnite sample properties since it does not directly rely onpropensity score weights. However, the ﬁnite sample performance may crucially rely on the bandwidthchoice and the set of covariates in Z . We regard this as yet another interesting direction for furtherresearch. 20 eferences Abadie, A. (2005). Semiparametric diﬀerence-in-diﬀerences estimators.

Review of Economic Studies 72 (1),1–19.Abrevaya, J. (2006). Estimating the eﬀect of smoking on birth outcomes using a matched panel dataapproach.

Journal of Applied Econometrics 21 (4), 489–519.Abrevaya, J., Hsu, Y.-C., & Lieli, R. P. (2015). Estimating conditional average treatment eﬀects.

Journalof Business and Economic Statistics 33 (4), 485–505.Athey, S. & Imbens, G. (2019).

Machine Learning Methods Economists Should Know About . Version 1.arXiv: .Athey, S. & Imbens, G. W. (2017). The state of applied econometrics: causality and policy evaluation.

The Journal of Economic Perspectives 31 (2), 3–32.Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests.

The Annals of Statistics 47 (2),1148–1178.Belloni, A. & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparsemodels.

Bernouille 19 (2), 521–547.Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment eﬀects after selection amonghigh-dimensional controls.

Review of Economic Studies 81 (2), 608–650.Bickel, P. J., Klaassen, C. A., Ritov, Y., & Wellner, J. A. (1993).

Eﬃcient and Adaptive Estimation forSemiparametric Models . Springer.Cattaneo, M. D. (2010). Eﬃcient semiparametric estimation of multi-valued treatment eﬀects underignorability.

Journal of Econometrics 155 (2), 138–154.Chernozhukov, V. & Semenova, V. (2018).

Simultaneous Inference for Best Linear Predictor of the Con-ditional Average Treatment Eﬀect and Other Structural Functions . Version 2. arXiv: .Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W., & Robins, J. (2018a).Double/debiased machine learning for treatment and structural parameters.

The Econometrics Jour-nal 21 (1), C1–C68.Chernozhukov, V., Fern´andez-Val, I., & Luo, Y. (2018b). The sorted eﬀects method: discovering hetero-geneous eﬀects beyond their averages.

Econometrica 86 (6), 1911–1938.da Veiga, P. V. & Wilder, R. P. (2008). Maternal smoking during pregnancy and birthweight: a propensityscore matching approach.

Maternal and Child Health Journal 12 (2), 194–203.Fan, Q., Hsu, Y.-C., Lieli, R. P., & Zhang, Y. (2019).

Estimation of Conditional Average TreatmentEﬀects with High-Dimensional Data . Version 1. arXiv: .21arrell, M. H., Liang, T., & Misra, S. (2018).

Deep Neural Networks for Estimation and Inference:Application to Causal Eﬀects and Other Semiparametric Estimands . Version 2. arXiv: .Frlich, M. (2004). Finite-sample properties of propensity-score matching and weighting estimators.

TheReview of Economics and Statistics 86 (1), 77–90.Hahn, J. (1998). On the role of the propensity score in eﬃcient semiparametric estimation of averagetreatment eﬀects.

Econometrica 66 (2), 315–331.Hahn, J. & Ridder, G. (2013). Asymptotic variance of semiparametric estimators with generated regres-sors.

Econometrica 81 (1), 315–340.Hirano, K., Imbens, G. W., & Ridder, G. (2003). Eﬃcient estimation of average treatment eﬀects usingthe estimated propensity score.

Econometrica 71 (4), 1161–1189.Huber, M., Lechner, M., & Wunsch, C. (2013). The performance of estimators based on the propensityscore.

Journal of Econometrics 175 (1), 1–21.Kennedy, E. H., Ma, Z., McHugh, M. D., & Small, D. S. (2017). Non-parametric methods for doublyrobust estimation of continuous treatment eﬀects.

Journal of the Royal Statistical Society Series BStatistical Methodology 79 (4), 1229–1245.Knaus, M. C., Lechner, M., & Strittmatter, A. (2018).

Machine Learning Estimation of HeterogeneousCausal Eﬀects: Empirical Monte Carlo Evidence . Version 2. arXiv: .Lechner, M. (2018).

Modiﬁed Causal Forests for Estimating Heterogeneous Causal Eﬀects . Version 2.arXiv: .Lee, S., Okui, R., & Whang, Y.-J. (2016). Doubly robust uniform conﬁdence band for the conditionalaverage treatment eﬀect function.

Journal of Applied Econometrics 32 (7), 1207–1225.Li, Q. & Racine, J. (2007).

Nonparametric Econometrics: Theory and Practice . Princeton UniversityPress.Lunceford, J. & Davidian, M. (2004). Stratiﬁcation and weighting via the propensity score in estimationof causal treatment eﬀects: a comparative study.

Statistics in Medicine 23 (19), 29372960.Luo, Y. & Spindler, M. (2016).

High-Dimensional L Boosting: Rate of Convergence . Version 2. arXiv: .Mammen, E., Rothe, C., & Schienle, M. (2012). Nonparametric regression with nonparametrically gen-erated covariates.

The Annals of Statistics 40 (2), 1132–1170.Newey, W. K. (1994a). Kernel estimation of partial means and a general variance estimator.

EconometricTheory 10 (2), 233–253.— (1994b). The asymptotic variance of semiparametric estimators.

Econometrica 62 (6), 1349–1382.22ewey, W. K. & McFadden, D. (1994). “Large sample estimation and hypothesis testing.”

Handbook ofEconometrics . Vol. 4. Elsevier Science B.V. Chap. 36, 2113–2245.Pagan, A. & Ullah, A. (1999).

Nonparametric Econometrics . Cambridge: Cambridge University Press.Robins, J. M., Rotnitzky, A., & Zhao, P. (1994). Estimation of regression coeﬃcients when some regressorsare not always observed.

Journal of the American Statistical Association 89 (427), 846–866.Rubin, D. & van der Laan, M. J. (2007). A doubly robust censoring unbiased transformation.

TheInternational Journal of Biostatistics 3 (1), Article 4.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandomized studies.

Journal of Educational Psychology 66 (5), 688–701.Tsiatis, A. A. (2006).

Semiparametric Theory and Missing Data . Springer Series in Statistics. SpringerScience+Business Media.Wager, S. & Athey, S. (2018). Estimation and inference of heterogeneous treatment eﬀects using randomforests.

Journal of the American Statistical Association 113 (523), 1228–1242.Wager, S. & Walther, G. (2016).

Adaptive Concentration of Regression Trees, with Application to RandomForests . Version 3. arXiv: .Walker, M., Tekin, E., & Wallace, S. (2009). Teen smoking and birth outcomes.

Southern EconomicJournal 75 (3), 892–907.Zimmert, M. (2018).

Eﬃcient Diﬀerence-in-Diﬀerences Estimation with High-Dimensional Common TrendConfounding . Version 4. arXiv: . 23

Proof of Theorems

A.1 Proof of Theorem 1

We can write ˆ τ − τ = Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , ˆ p, ˆ m , ˆ m )) − τ Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) = Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , p, m , m ) − τ ) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) i + Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , ˆ p, ˆ m , ˆ m ) − ψ ( W i , p, m , m )) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ii Inﬂuence function

Denote ¯ ψ i = ψ ( W i , ˆ p, ˆ m , ˆ m ) − ψ ( W i , p, m , m ). Then the second term can be further expanded as ii = Nh λZ (cid:80) Ni =1 E (cid:2) K (cid:0) Z − zh (cid:1) ¯ ψ (cid:3) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) iia + Nh λZ (cid:80) Ni =1 (cid:0) K (cid:0) z i − zh (cid:1) ¯ ψ i − E (cid:2) K (cid:0) Z − zh (cid:1) ¯ ψ (cid:3)(cid:1) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) iib and therefore (cid:12)(cid:12)(cid:12) √ N h λ Z ii (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) √ N h λ Z iib (cid:12)(cid:12)(cid:12) . Bounding iia

We ﬁrst of all notice that (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) × (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N √ h λ Z E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) ≤ sup z ∈Z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z ∈Z ˆ f ( z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C = O (1)24y Assumption 5.Further, under the sample splitting procedure used E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) (cid:21) = E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) | W i ∈I cl (cid:21) ≤ sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m )) (cid:21) Deﬁne the Gˆateaux derivative of the generic function g in the direction [ p ∗ − p, m ∗ − m , m ∗ − m ] by ∂ [ p ∗ − p,m ∗ − m ,m ∗ − m ] g . Then using Taylor’s expansion we can write E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m )) (cid:21) = ∂ [ p ∗ − p,m ∗ − m ,m ∗ − m ] E (cid:20) K (cid:18) Z − zh (cid:19) ψ ( W, p, m , m ) (cid:21) + 12 ∂ p ∗ − p,m ∗ − m ,m ∗ − m ] E (cid:20) K (cid:18) Z − zh (cid:19) ψ ( W, p, m , m ) (cid:21) + ... For the ﬁrst order term we get ∂ [ p ∗ − p,m ∗ − m ,m ∗ − m ] E (cid:20) K (cid:18) Z − zh (cid:19) ψ ( W, p, m , m ) (cid:21) = E (cid:34) K (cid:18) Z − zh (cid:19) (cid:32) − (cid:18) D ( Y − m ( X )) p ( X ) + (1 − D )( Y − m ( X ))(1 − p ( X )) (cid:19) ( p ∗ ( X ) − p ( X ))+ (cid:18) (1 − D )1 − p ( X ) − (cid:19) ( m ∗ ( X ) − m ( X )) + (cid:18) − Dp ( X ) (cid:19) ( m ∗ ( X ) − m ( X )) (cid:33)(cid:35) = 0by the Law of Iterated Expectation and using the fact that Z ⊆ X . For the second order term we get12 ∂ p ∗ − p,m ∗ − m ,m ∗ − m ] E (cid:20) K (cid:18) Z − zh (cid:19) ψ ( W, p, m , m ) (cid:21) = E (cid:34) K (cid:18) Z − zh (cid:19) (cid:32) (cid:18) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))(1 − p ( X )) (cid:19) ( p ∗ ( X ) − p ( X )) + 1 − D (1 − p ( X )) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X ))+ Dp ( X ) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) (cid:33)(cid:35) = E (cid:34) K (cid:18) Z − zh (cid:19) (cid:32) − p ( X )) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X ))25 1 p ( X ) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) (cid:33)(cid:35) ≤ (cid:107) K ( u ) (cid:107) ∞ × (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) − p ( X )) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X ))+ 1 p ( X ) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13)(cid:13) − p ( X )) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) + 1 p ( X ) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) (cid:13)(cid:13)(cid:13)(cid:13) ≤ C × (cid:107) p ∗ ( X ) − p ( X ) (cid:107) × ( (cid:107) m ∗ ( X ) − m ( X ) (cid:107) + (cid:107) m ∗ ( X ) − m ( X ) (cid:107) )which follows from Hlder’s and Jensen’s inequality, (cid:107) K ( u ) (cid:107) ∞ = O (1) in Assumption 5 and Assumption4. All higher order terms can be shown to be dominated by the second order term under the boundednessAssumption 7. Therefore E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) (cid:21) = O ( (cid:15) p (cid:15) m + (cid:15) p (cid:15) m )and (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) = O (cid:16) N h − λ Z × ( (cid:15) p (cid:15) m + (cid:15) p (cid:15) m ) (cid:17) . Bounding iib

We can write (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ Nh λZ (cid:80) Ni =1 (cid:0) K (cid:0) z i − zh (cid:1) ¯ ψ i − E (cid:2) K (cid:0) Z − zh (cid:1) ¯ ψ (cid:3)(cid:1) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ h λ Z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ h λ Z C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) which follows again from Assumption 5. The convergence of the last factor term remains to show. Since L is a ﬁxed integer that is independent of N , it suﬃces to show that for any l ∈ [ L ] the term converges.More formally (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max l ∈ [ L ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ nL n (cid:88) i ∈I l (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I l is the set of observation in subsample l and I cl is the set of observations not in subsample l .Under the sample splitting procedure we have E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i ∈I l (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)  = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i ∈I l (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W i ∈I cl  ≤ sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) z i − zh (cid:19) ( ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m )) (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) ≤ sup u (cid:12)(cid:12) K ( u ) (cid:12)(cid:12) × sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13) E (cid:104) ( ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m )) | Z (cid:105)(cid:13)(cid:13)(cid:13) ≤ (cid:107) K ( u ) (cid:107) ∞ sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:104) | ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m ) | (cid:105) ≤ C sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:104) | ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m ) | (cid:105) by Hlder’s inequality, Jensen’s inequality and Assumption 5. Nowsup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:16) E (cid:104) | ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m ) | (cid:105)(cid:17) = sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:32) E (cid:34)(cid:12)(cid:12)(cid:12) D ( Y − m ∗ ( X )) p ∗ ( X ) − (1 − D )( Y − m ∗ ( X ))1 − p ∗ ( X ) + m ∗ ( X ) − m ∗ ( X ) − D ( Y − m ( X )) p ( X )+ (1 − D )( Y − m ( X ))1 − p ( X ) − m ( X ) + m ( X ) (cid:12)(cid:12)(cid:12) (cid:35)(cid:33) ≤ sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:107) m ∗ ( X ) − m ( X ) (cid:107) + sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:107) m ∗ ( X ) − m ( X ) (cid:107) + sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13)(cid:13) D ( Y − m ∗ ( X )) p ∗ ( X ) − D ( Y − m ( X )) p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) + sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13)(cid:13) (1 − D )( Y − m ∗ ( X ))1 − p ∗ ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) and by deﬁning U = DY − m ( X ) (cid:13)(cid:13)(cid:13)(cid:13) D ( Y − m ∗ ( X )) p ∗ ( X ) − D ( Y − m ( X )) p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) p ( X ) p ∗ ( X ) ( D ( Y − m ∗ ( X )) p ( X ) − D ( Y − m ( X )) p ∗ ( X )) (cid:13)(cid:13)(cid:13)(cid:13) ≤ c − (cid:107) D ( Y − m ∗ ( X )) p ( X ) − D ( Y − m ( X )) p ∗ ( X ) (cid:107) = c − (cid:107) p ( X )( m ( X ) − m ∗ ( X )) + U ( p ( X ) − p ∗ ( X )) (cid:107) ≤ c − (cid:107) m ( X ) − m ∗ ( X ) (cid:107) + c − (cid:107) U ( p ( X ) − p ∗ ( X )) (cid:107) . (cid:107) U ( p ( X ) − p ∗ ( X )) (cid:107) = (cid:114) E (cid:104) ( U ( p ( X ) − p ∗ ( X ))) (cid:105) = (cid:112) E [ E [ U | X ] ( p ( X ) − p ∗ ( X )) ]and a similar argument for the other term by Assumption 6 we getsup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:107) m ∗ ( X ) − m ( X ) (cid:107) = (cid:15) m sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:107) m ∗ ( X ) − m ( X ) (cid:107) = (cid:15) m sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13)(cid:13) D ( Y − m ∗ ( X )) p ∗ ( X ) − D ( Y − m ( X )) p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:15) m + (cid:15) p sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13)(cid:13) (1 − D )( Y − m ∗ ( X ))1 − p ∗ ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:15) m + (cid:15) p . It follows that sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:104) | ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m ) | (cid:105) ≤ max( (cid:15) p , (cid:15) m , (cid:15) m ) = (cid:15) . By Markov’s inequality and the fact that if L is a constant independent of N it follows that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C × (cid:15) max and therefore (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) = O ( h − λ Z (cid:15) max ) . Collecting terms, we can write √ N h λ Z (ˆ τ − τ ) = √ Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , p, m , m ) − τ ) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) + O (cid:16) N h − λ Z × ( (cid:15) p (cid:15) m + (cid:15) p (cid:15) m ) + h − λ Z (cid:15) max (cid:17) . Under the convergence conditions in Assumption 8 the ﬁrst claim of the theorem is veriﬁed.28 symptotic normality

Notice that under the standard conditions provided in Assumptions 5 and 8 on the nonparametric re-gression (see for example Pagan and Ullah, 1999, chapter 2)1

N h λ Z N (cid:88) i =1 K (cid:18) z i − zh (cid:19) → p f ( z ) . Therefore, we can rewrite the inﬂuence function as √ N h λ Z (ˆ τ − τ ) = 1 √ N h λ Z f ( z ) N (cid:88) i =1 K (cid:18) z i − zh (cid:19) ( ψ ( W i , p, m , m ) − τ ) + o (1)= 1 √ N h λ Z f ( z ) N (cid:88) i =1 K (cid:18) z i − zh (cid:19) ( ψ ( W i , p, m , m ) − E [ ψ ( W i , p, m , m ) | Z = z i ]) (cid:124) (cid:123)(cid:122) (cid:125) ia + 1 √ N h λ Z f ( z ) N (cid:88) i =1 K (cid:18) z i − zh (cid:19) ( E [ ψ ( W i , p, m , m ) | Z = z i ] − τ ) (cid:124) (cid:123)(cid:122) (cid:125) ib + o (1) . The second term is the bias of the nonparametric regression estimator scaled with the convergence rate.Thus, Assumption 8 implies ib = O ( N h λ Z h r ) = o (1). Under the usual assumptions on the existenceof higher order moments in Assumption 5, we can apply the Lyapunov Central Limit Theorem on ia asin Pagan and Ullah (1999, chapter 3.4). Then √ N h λ Z (ˆ τ − τ ) → d N  , (cid:82) K ( u ) du × E (cid:104) ( ψ ( W i , p, m , m ) − τ ) | Z = z (cid:105) f ( z )  . q.e.d. A.2 Proof of Theorem 2

Similar to the proof in Theorem 1 we can writeˆ θ − θ = 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ K (cid:16) z j − z i h (cid:17) ψ ( W j , ˆ p, ˆ m , ˆ m ) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) − θ

29 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ K (cid:16) z j − z i h (cid:17) ψ ( W j , p, m , m ) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) − θ (cid:124) (cid:123)(cid:122) (cid:125) i + 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ K (cid:16) z j − z i h (cid:17) ( ψ ( W j , ˆ p, ˆ m , ˆ m ) − ψ ( W j , p, m , m )) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) ii . Bounding ii

Using the notation from the proof of Theorem 1 again leads to ii = 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) iia + 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ (cid:16) K (cid:16) z j − z i h (cid:17) ¯ ψ j − E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3)(cid:17) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) iib . Then | iia | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 1 h λZ E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h λZ E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ h λ Z sup z ∈Z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) × sup i (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) K (cid:18) Z − z i h (cid:19) ¯ ψ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) . By the same steps as in the proof of Theorem 1 we obtain E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) (cid:21) = O ( (cid:15) p (cid:15) m + (cid:15) p (cid:15) m )and therefore (cid:12)(cid:12)(cid:12) √ N iia (cid:12)(cid:12)(cid:12) = O (cid:32) √ Nh λ Z (cid:15) p × ( (cid:15) m + (cid:15) m ) (cid:33) . Also for iib we ﬁnd that (cid:12)(cid:12)(cid:12) √ N iib (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 N (cid:88) j =1 1 √ Nh λZ (cid:16) K (cid:16) z j − z i h (cid:17) ¯ ψ i − E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3)(cid:17) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h λ Z sup z ∈Z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) × sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) j =1 K (cid:18) z j − z i h (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − z i h (cid:19) ¯ ψ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:15) max h λ Z . Hence, for the overall term we have (cid:12)(cid:12)(cid:12) √ N ii (cid:12)(cid:12)(cid:12) = O (cid:32) √ Nh λ Z (cid:15) p × ( (cid:15) m + (cid:15) m ) + (cid:15) max h λ Z (cid:33) = o (1) , under the coupled convergence conditions of Assumption 8 (cid:48) . Bounding i

For i notice that √ N i = √ N (cid:16) ˜ θ − θ (cid:17) , where ˜ θ = 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ K (cid:16) z j − z i h (cid:17) ψ ( W j , p, m , m ) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) . Thus, term i gives the contribution of estimating the nonparametric projection of the vector ψ = ψ ( W, p, m , m ) with population nuisance parameters on Z ∈ Z . To derive the inﬂuence function ofthe estimator ˜ θ , we follow Newey’s (1994b) Proposition 4 which holds under the condition that the ﬁrststage nonparametric estimator is bounded by any norm (and some further regularity conditions). Forexample, using Assumption 8 (cid:48) we have N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) j =1 1 Nh λZ K (cid:16) z j − zh (cid:17) ψ ( W j , p, m , m ) Nh λZ (cid:80) Nj =1 K (cid:16) z j − zh (cid:17) − τ ( z ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (1) , such that Assumption 5.1 in Newey (1994b) is satisﬁed for the L norm. In particular, we notice thatthe inﬂuence function φ is composed of the moment condition and an adjustment term. The momentcondition of the problem is given by E [˜ τ ( Z ) − θ ] = 031ith ˜ τ ( Z ) = E [ ψ | Z ].Denote the general family of distributions of W = ( Y, D, X, Z ) as F = { F ( W ) } . Further, denote F β ( W ) ∈ F a subfamily of F that is a path in F indexed by β . Also let F be the true distributionof W . Accordingly, W realizes with density f β ( W ) when β = 0. Additionally, deﬁne E β [ g ( W )] = (cid:82) g ( W ) f β ( W ) dw for the generic function g ( · ) and ˜ τ ( Z, β ) = E β [ ψ | Z ]. Following the steps of Proposition4 in Newey (1994b) indicates that one should evaluate the derivative ∂∂β E [˜ τ ( Z, β )]at β = 0. By the Chain Rule we have ∂∂β E β [˜ τ ( Z, β )] = ∂∂β E β [˜ τ ( Z )] + ∂∂β E [˜ τ ( Z, β )]at β = 0. Furthermore, for any ¯ τ ( Z ) and for some path β we have the mean-square projection optimizationproblem˜ τ ( Z, β ) = arg max ¯ τ E β (cid:34)(cid:18) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − ¯ τ ( Z ) (cid:19) (cid:35) , giving the ﬁrst order condition E β (cid:20) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − ˜ τ ( Z, β ) (cid:21) = 0 . Deﬁne S ( W ) = ∂∂β ln f β ( W ) at β = 0. Then combining the two previous result gives ∂∂β E [˜ τ ( Z, β )] = ∂∂β E β [˜ τ ( Z, β )] − ∂∂β E β [˜ τ ( Z )]= ∂∂β E β (cid:20) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − ˜ τ ( Z ) (cid:21) = E (cid:20)(cid:18) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − ˜ τ ( Z ) (cid:19) S ( W ) (cid:21) at β = 0. It follows that the adjustment term is given by ψ ( W, p, m , m ) − ˜ τ ( Z ) and the inﬂuencefunction has the form φ = ˜ τ ( Z ) − θ + ψ ( W, p, m , m ) − ˜ τ ( Z )= D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − θ. i and ii gives √ N (ˆ θ − θ ) = 1 √ N N (cid:88) i =1 (cid:18) d i ( y i − m ( x i )) p ( x i ) − (1 − d i )( y i − m ( x i ))1 − p ( x i ) + m ( x i ) − m ( x i ) − θ (cid:19) + o (1)such that by the Central Limit Theorem we obtain √ N (ˆ θ − θ ) → d N (cid:18) , E (cid:20) V ar ( Y | D = 1 , X ) p ( X ) + V ar ( Y | D = 0 , X )1 − p ( X ) + ( m ( X ) − m ( X ) − θ ) (cid:21)(cid:19) . q.e.d. Details on the bandwidth ranges

Using the notation implied by Deﬁnition 2, notice that the convergence conditions in Assumption 8 implythe following system of inequalities. 12 λ Z δ h − δ (cid:15) max <

012 + 12 λ Z δ h − ( δ p + δ m d ) < − λ Z δ h − rδ h < − δ h < − δ h λ Z > δ h > λ Z +2 r >

0. Further, the other inequalities imply δ h < δ p + δ md ) − λ Z < δ (cid:15) max λ Z < λ Z . It therefore follows that the possible range of the bandwidth can bedescribed by λ Z +2 r < δ h < δ p + δ md ) − λ Z .The range for ATE follows similarly from the convergence conditions in Assumption 8 (cid:48) .34 Details on the empirical example

C.1 Covariates in the dataset

Table 3: Description of covariates in the datasetvariable description newly created smalldata alldata meanY infant birth weight in grams no yes yes 3361.68D =1 if mother smoked during pregnancy no yes yes 0.19mmarried =1 if mother is married no no yes 0.70mhisp =1 if mother is hispanic no yes yes 0.03fhisp =1 if father is hispanic no no yes 0.04foreign =1 if mother born abroad no no yes 0.05alcohol =1 if alcohol consumed during pregnancy no yes yes 0.03deadkids =1 if previous birth were newborn died no yes yes 0.26mage mother’s age no yes yes 26.50medu mother’s educational attainment no yes yes 12.69fage father’s age no no yes 27.27fedu father’s educational attainment no no yes 12.31nprenatal number of prenatal care visits no yes yes 10.76mrace =1 if mother is white no yes yes 0.84frace =1 if father is white no no yes 0.81prenatal1 =1 if ﬁrst prenatal visit in ﬁrst trimester no yes yes 0.80prenatal2 =1 if ﬁrst prenatal visit in second trimester yes no yes 0.15prenatal3 =1 if ﬁrst prenatal visit in third trimester yes no yes 0.05order1 =1 if ﬁrst infant yes yes yes 0.44order2 =1 if second infant yes no yes 0.34order3 =1 if j th infant with j ≥ Sample with N = 4642 observations with 864 treated and 3778 non-treated. ‘smalldata’ indicates that the variable wasalso used in Lee et al. (2016). ‘newly created’ indicates that the variable was additionally created from the original datasetby the authors. ‘alldata’ contains the speciﬁcation used for the estimation results in Section 4. .2 Additional sensitivity analysis Figure 4: Sensitivity to kernel order (age) (a) Fourth order kernel function −500−400−300−200−1000 20 30 40 mother's age (b) Sixth order kernel function −500−400−300−200−1000 20 30 40 mother's age

Results were obtained as described in Procedure 1 with a fourth and sixth order Gaussian kernel function and 0 . × LOOCVbandwidth choice. Nuisance parameters were estimated using an ensemble learner comprising Lasso, Elastic Net, Ridgeand Random Forest. For Lasso, Ridge and Elastic Net the penalty term was chosen such that the cross-validation criterionwas minimized. The ensemble weights were chosen by minimizing out-of-sample MSE. Asymptotic conﬁdence bands are atthe 95% level. (a) 0 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (b) 0 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (c) 0 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (d) 0 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (e) 1 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (f) 1 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits