Nonparametric estimation of causal heterogeneity under high-dimensional confounding
NNonparametric estimation of causal heterogeneity underhigh-dimensional confounding
Michael Zimmert † and Michael Lechner ‡ SEW-HSGSwiss Institute for Empirical Economic ResearchUniversity of St.Gallen, Switzerland
Abstract
This paper considers the practically important case of nonparametrically estimating heterogeneousaverage treatment effects that vary with a limited number of discrete and continuous covariates ina selection-on-observables framework where the number of possible confounders is very large. Wepropose a two-step estimator for which the first step is estimated by machine learning. We showthat this estimator has desirable statistical properties like consistency, asymptotic normality andrate double robustness. In particular, we derive the coupled convergence conditions between thenonparametric and the machine learning steps. We also show that estimating population averagetreatment effects by averaging the estimated heterogeneous effects is semi-parametrically efficient.The new estimator is an empirical example of the effects of mothers smoking during pregnancy onthe resulting birth weight.
JEL classification:
C14, C21
Keywords: causal machine learning, effect heterogeneity, group average treatment effects, semiparamet-ric efficiency, ensemble learning † Email: [email protected] ‡ a r X i v : . [ ec on . E M ] A ug Introduction
Recently, new machine learning based estimators showed immense potential to systematically uncoveringcausal effect heterogeneity so that there is now a rapidly growing literature on this topic (e.g., see theoverviews in Athey and Imbens, 2019; Athey and Imbens, 2017 and Knaus, Lechner, and Strittmatter,2018). In the context of heterogeneity, the respective aggregation levels for which the heterogeneity isestimated is playing an important role. Most papers of this literature focus on a selection-on-observableframework and investigate estimators for the heterogeneity at the lowest aggregation level to uncoverpossible heterogeneities to the largest extent possible. While this finest level of causal granularity is ob-viously of interest, Chernozhukov, Fern´andez-Val, and Luo (2018b) and Lechner (2018) argue to analyseheterogeneity at higher levels, so called ‘Group Average Treatment Effects’ (GATEs). Such aggregatescan be estimated more precisely, may be far more easily interpretable by researchers in substantive terms,and are more useful for decision makers. In particular, some subgroup heterogeneities are of limited valueper se because it is hard to justify a decision or policy based on certain characteristics (race, gender etc.).Therefore, decision-makers are often only interested in effect heterogeneities based on a rather small sub-set of available covariates. This paper suggests an approach that is based on statistical-learning assistedestimation of the GATEs for the various discrete and continuous variables of interest, and subsequentnon-parametric aggregation of the GATEs to obtain ‘Average Treatment Effects’ (ATEs). More technically speaking, in effect heterogeneity analysis covariates do not (only) serve the purpose ofmaking identifying assumptions credible. They become part of the outcome analysis by discriminatingdifferent subgroups of units for which the effect is of interest. Further, whenever new observations enterthe sample the covariate realizations could be used to predict a causal effect. The set of covariates to beincluded in the statistical model to explore effect heterogeneity is therefore not a statistical but rather asubstantive decision.The estimation of subgroup specific effects is a tedious task when there is confounding. In such settings,causal effects are typically only identified if the researcher includes the confounding covariates in thestatistical model as well. Hence, the identifying assumptions dictate the inclusion of the set of covariatesrequired. In empirical research based on selection-on-observables, the credibility of causal effects estima-tion often depends on a very large set of possible covariates with very many possible functional forms.Qualitatively assessing which covariates should ultimately enter the model in which specific form in anon-systematic fashion is prone to be flawed. Lechner (2018) also proposed this aggregation idea. However, that paper considered only a version of a Causal Forestwhile here we are in principle agnostic with respect to the machine learning method used. Furthermore, it considered onlyGATEs based on discrete variables and thus GATEs were obtained as unweighted within-cell means. already dates back to Hahn(1998). He suggested estimating a nonparametric outcome regression on the set of covariates that needsto be controlled for. Averaging over the conditional means leads to estimators of ATE that attain thesemiparametric efficiency bound. In practice, however, nonparametric regression with many covariatesis hardly feasible because the convergence rate of nonparametric methods exponentially decreases withthe number of covariates included. Recently, Wager and Athey (2018) follow the same ideas as in Hahn(1998) but use Causal Forests instead of standard nonparametric regression. Athey, Tibshirani, andWager (2019) and Lechner (2018) modify the Random Forest algorithm to better adjust for confoundingand improve precision. Outcome-based models adjust for confounding and infer heterogeneous effects ina single estimation step. Therefore, in all of these approaches inference for effect heterogeneity relies on adimension of the covariate space that is fixed. Given the previous discussion, this might be a very strongassumption.In this paper we follow an alternative approach in the literature. The two distinct roles of the covariates –adjusting for confounding and estimating heterogeneous effects – are explicitly reflected in a two-step es-timation procedure for the GATEs. This idea is conceptionally not new in the literature. In the contextof difference-in-differences estimation, Abadie (2005) shows that propensity scores weighted outcomescan be used as a dependent variable in a second stage regression on the covariates that are of interest forheterogeneous effects. Abrevaya, Hsu, and Lieli (2015) use a similar idea in the standard selection-on-observables setting. They provide inferential results for nonparametric and parametric propensity scorefirst stages with nonparametric second stages. In line with other results in the literature on average effects(Hirano, Imbens, and Ridder, 2003, Robins, Rotnitzky, and Zhao, 1994, Lunceford and Davidian, 2004),they show that the variance for Inverse Probability Weighting (IPW) estimators can be substantiallydecreased when the propensity score is estimated nonparametrically. Since their second stage also relieson nonparametric regression, the validity of their asymptotic results requires jointly choosing two kernelbandwidths which have to be in a rather small feasible interval. Lee, Okui, and Whang (2016) augmentthe model of Abrevaya et al. (2015) by including outcome projections (Augmented IPW, AIPW) andshow that when both the propensity score and the outcome projections are estimated parametrically,one can treat the nuisance parameters as if they were known. Their asymptotic results with parametricfirst stages are then equivalent to those of Abrevaya et al. (2015) with nonparametric propensity score We avoid the imprecise term ‘Conditional Average Treatment Effect’ because it is unclear which conditioning set isactually meant. A few days before this work appeared first on arXiv, Fan, Hsu, Lieli, and Zhang (2019) published their independentwork on arXiv (up to that moment unknown to us) that uses similar ideas about aggregation and machine learning. While Abrevaya et al. (2015) use local constant nonparametric regression, Lee et al. (2016) show their results with locallinear nonparametric regression. confounder space. This enables our proposed estimator to be robust against functionalform misspecification and to remain consistent even if the number of covariates relative to the samplesize is large. In particular, we provide a generic statistical framework such that the convergence raterequirements of the first stage nuisance estimation are coupled with the kernel bandwidth second stagenonparametric convergence.Additionally, we link our identification and estimation result to semiparametric efficiency theory by pro-viding a new estimator for the ATE that can be estimated as a by-product of the GATEs. The estimatoraggregates over all point estimates of the GATEs. We show that under certain convergence conditions forthe kernel bandwidth, asymptotically it hits the variance lower bound of the semiparametric estimationproblem. We therefore also contribute to the small literature on three-step semiparametric ATE estima-tion. Specifically, Hahn and Ridder (2013) (for an alternative theoretical development see also Mammen,Rothe, and Schienle, 2012) investigate a related set-up showing that nonparametric regression on anestimated propensity score can lead to efficient estimation of ATE. To the best of our knowledge, thispaper is, however, the first that analyses the asymptotic properties of averaging a transformed outcomeprojection instead of the outcome projection on the propensity score. Like propensity score matching,our three-step estimator might have better finite-sample properties than two-step estimators that sharethe same first-order asymptotic properties (Robins et al., 1994, Hirano et al., 2003, Chernozhukov et al.,2018a) because the propensity score weights are subject to an additional smoothing step. Unlike propen-sity score matching, IPW or Hahn’s (1998) estimator, the proposed ATE estimator remains feasible whenthe dimension of the confounders entering the model is high. In general the term ‘high-dimensional’ refers to the fact that the dimension of the model can grow with the sample size.We will provide specific rate conditions in the main part of the paper.
Suppose that we observe an independent and identically distributed random sample { w i } Ni =1 with samplesize N where w i = ( y i , d i , x i , z i ). Denote with uppercase letters a variable and with lowercase letters itsrealizations. Then Y is the outcome variable and D is the binary treatment of interest. To describe causaleffects, we use Rubin’s (1974) potential outcome notation such that Y d is the outcome that would havebeen observed under treatment D = d . Further, X is a matrix of observed covariates with support X and Z ⊆ X as a set of predefined variables where the researcher is interested in effect heterogeneity withsupport Z . Also let X ∈ R dim X and Z ∈ R dim Z and denote λ X = dim X and λ Z = dim Z . Potentially wehave that λ X → ∞ when N → ∞ whereas λ Z is fixed. Hence, we explicitly allow for models where thedimension of X is high-dimensional but the dimension of the subset of covariates that is of interest forthe heterogeneity analysis does not grow with the sample size. We remain agnostic about the underlyingcumulative distribution from which the sample of W = ( Y, D, X, Z ) is drawn F = F ( W ) and just assumethat it exists with density f = f ( W ). The main parameter of interest in this study is the GATE defined as τ ( z ) = E (cid:2) Y − Y | Z = z (cid:3) . Since we want to avoid usually unrealistic parametric assumptions on the underlying DGP, we allow fora flexible function ψ ( W, · ) such that GATE is identified as τ ( z ) = E [ ψ ( W, · ) | Z = z ] The concrete growth rates of λ X in relation to N will be discussed in Section 3. θ = E [ E [ ψ ( W, · ) | Z = z ]]implying Hahn’s (1998) efficient score function for ATE ψ ( W, p, m , m ) = D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X )where p ( X ) = E [ D | X ] denotes the propensity score and m d ( X ) = E [ Y | X, D = d ] for d ∈ { , } denotesthe conditional expectations of the outcome in the treatment-specific subpopulations. For identification of GATE and ATE we make the following assumptions.
Assumption 1 (Conditional independence) . Y , Y ⊥ D | X = x ∀ x ∈ X Assumption 2 (Stable Unit Treatment Value Assumption (SUTVA)) . Y = DY + (1 − D ) Y ssumption 3 (Exogeneity of confounders) . X = X Assumption 4 (Common support) . c < p ( X ) < − c for some small positive constant c . Assuming that appropriate moments exist, then for GATE we have τ ( z ) = E (cid:2) E (cid:0) Y − Y | X (cid:1) | Z = z (cid:3) = E (cid:20) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) (cid:12)(cid:12)(cid:12) Z = z (cid:21) = E (cid:20) DYp ( X ) − (1 − D ) Y − p ( X ) (cid:12)(cid:12)(cid:12) Z = z (cid:21) = E [ m ( X ) − m ( X ) | Z = z ] . The exposition shows that IPW and outcome based estimands are embedded in the estimand based on ψ ( W, p, m , m ). Finally, by noticing that θ = E [ τ ( Z )]identification of ATE trivially follows from these considerations. The identification results from the preceding section suggest a two-step estimation strategy. The detailsof our proposed estimator are described in Procedure 1. In a first step a sample plug-in versions of6 rocedure 1.
GATE estimationIntroduce the subsample index l = 1 , ..., L and denote the corresponding information set by I l as well as its complement by I Cl .1. Randomly split the sample in equally sized subsamples 1 , ..., L .2. for l = 1 to L do Estimate the propensity score p ( x ) and the outcome projections m ( x ) and m ( x ) in thesample with I Cl using any suitable machine learning method or an ensemble of them.Predict ˆ p ( x ), ˆ m ( x ) and ˆ m ( x ) in the sample with I l . end
3. Denote ˆ p = ˆ p l =1 ,...,L , ˆ m = ˆ m ,l =1 ,...,L and ˆ m = ˆ m ,l =1 ,...,L . Then construct the vector withelements ˆ ψ = ψ ( W i , ˆ p, ˆ m , ˆ m ) for i = 1 , ..., N and estimate GATE asˆ τ ( z ) = N (cid:88) i =1 K (cid:0) z i − zh (cid:1) ψ ( W i , ˆ p, ˆ m , ˆ m ) (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) where K = K ( u ) is some kernel function that depends on a bandwidth h . ψ ( W, p, m , m ) can be obtained by estimating the nuisance parameters. In a second step the ψ -vectorcan be projected on Z . Our goal is to estimate both stages as flexible as possible and to avoid parametricassumptions. Further, our estimator can cope with settings where λ X is very large which precludesclassical nonparametric and parametric methods to estimate the first stage nuisances p ( x ), m ( x ) and m ( x ). However, we can use a large class of supervised machine learning algorithms that have beenshown to be very effective predictors for such types of tasks. Following the suggestions of Chernozhukovet al. (2018a) we apply a cross-fitting algorithm for the nuisance parameter estimation step in orderto guarantee that the resulting estimator of ψ ( W, p, m , m ) consists of independent observations. Therequirements for the second stage estimation step are more sophisticated as this estimator should allowfor valid inference. To estimate GATE flexibly, we apply nonparametric local constant regression in thesecond step. We now investigate the theoretical properties of our proposed estimation procedure. To ease the nota-tional burden, we start with some definitions.
Definition 1 (Norms) . Denote by (cid:107) g ( X ) (cid:107) p the L p norm of the generic function g ( · ) . Further denote thesupremum norm by sup X ∈X | g ( X ) | = (cid:107) g ( X ) (cid:107) ∞ . efinition 2 (Rates) . The nuisance parameter first stage estimates ˆ p , ˆ m and ˆ m obtained by the samplesplitting procedure described above belong to the realization sets P , M and M with probability − o (1) .For any realization p ∗ , m ∗ and m ∗ in the sets define the rates (cid:15) m d = sup m ∗ d ∈M d (cid:107) m ∗ d ( X ) − m d ( X ) (cid:107) (cid:15) p = sup p ∗ ∈P (cid:107) p ∗ ( X ) − p ( X ) (cid:107) (cid:15) max = max { (cid:15) m , (cid:15) m , (cid:15) p } . Definition 3 (Scaling factor) . For any function g define a scaling parameter δ g that determines g = O (cid:0) N − δ g (cid:1) . We then make the following standard assumptions on the kernel regression step (see for example Paganand Ullah, 1999).
Assumption 5 (Kernel regression) . Z = z is a point in the interior of the support Z .2. The density function estimator is uniformly bounded away from zero such that inf z ∈Z ˆ f ( z ) ≥ C where C > is a generic constant.3. The Kernel function K ( u ) is r times continuously differentiable, symmetric and of order r in thesense (cid:82) u r − K ( u ) du = 0 and (cid:82) u r K ( u ) du = O (1) for r ∈ N .4. f ( z ) and E ( ψ ( W, p, m , m ) | Z = z ) are r times continuously differentiable.5. Further the Kernel function satisfies (i) (cid:82) K ( u ) du = 1 , (ii) (cid:82) | K ( u ) | C du = O (1) for any C > ,(iii) | u || K ( u ) | → as | u | → ∞ , (iv) (cid:107) K ( u ) (cid:107) ∞ = O (1) and (v) (cid:82) K ( u ) du = O (1) . Assumption 5 comprises the standard nonparametric local constant regression assumptions allowing formultiple covariates and higher-order kernels. For illustrative purposes multivariate regression results arederived assuming the same bandwidth for every regressor. Further, we have to make boundedness as-sumptions on the second moment of the sample error of the outcome model and on the nuisance predictionerrors. 8 ssumption 6 (Boundedness of conditional variances) . The conditional variances of the outcomemodels are bounded such that they obey E (cid:104) ( DY − m ( X )) | X (cid:105) = O (1) and E (cid:104) ((1 − D ) Y − m ( X )) | X (cid:105) = O (1) . Assumption 7 (Boundedness of convergence rates) . The nuisance parameter prediction errors arebounded such that they obey sup m ∗ d ∈M d (cid:107) m ∗ d ( X ) − m d ( X ) (cid:107) ∞ = O (1) for d ∈ { , } and sup p ∗ ∈P (cid:107) p ∗ ( X ) − p ( X ) (cid:107) ∞ = O (1) . Additionally, the convergence rates of our first stage nuisance parameter prediction and the second stagenonparametric regression are assumed to be as follows:
Assumption 8 (Coupled convergence (GATE)) . The bandwidth h and the sample size N jointlyconverge such that(i) h = o (1) , N h λ Z → ∞ as N → ∞ and(ii) N h λ Z h r = o (1) .Further, N and h satisfy the joint convergence conditions with the nuisance parameter convergence rates(iii) h − λ Z (cid:15) max = o (1) (iv) N h − λ Z (cid:15) m (cid:15) p + N h − λ Z (cid:15) m (cid:15) p = o (1) . Assumption 8 comprises the coupled convergence rate assumptions that are at the centre of our theoreticalresults. Conditions (i) and (ii) quantify how the bandwidth has to converge to zero in relation to thesample size N and the number of regressors λ Z . As usual, the bandwidth has to go to zero but slowerthan the sample size grows to infinity. Also the bandwidth has to be chosen such that the asymptotic biasterm vanishes faster than the variance. This allows to apply the Central Limit Theorem and makes theestimator asymptotically unbiased. In particular, condition (ii) requires undersmoothing in the sense thatthe bandwidth has to be below the mean squared error (MSE) optimal rate. As discussed for example inPagan and Ullah (1999), choosing a higher order kernel mitigates the problem.9onditions (iii) and (iv) state that h has to be chosen such that first stage convergence rates vanish fastenough. In particular, by condition (iv) the joint convergence rates from propensity score and outcomeprojection estimation have to vanish faster than √ N scaled with the kernel bandwidth. Since h → Z = z . Thus, since sample observations enterthe estimator in a weighted form, the prediction precision needed is for the lower effective sample sizearound Z = z . Hence, the first stage prediction guarantees need to adapt to this smaller sample conditionsand therefore achieve a faster joint rate of convergence in terms of the sample size N . Condition (iii)additionally prevents the worst rate from becoming arbitrarily slow especially when λ Z is larger thanone. Still, our estimator has a ‘rate’ double robustness feature in the sense that joint rates can vanishrelatively slowly but all single first stage rates are restricted from converging very slowly. L convergencerates of many supervised machine learning methods satisfy these properties under sparsity conditions.For example Belloni and Chernozhukov (2013) show that the predictive error of the Lasso is of order O (cid:18)(cid:113) s log max( λ X ,N ) N (cid:19) where s the unknown number of true coefficients in the oracle model. Suppose that s and λ X are equal in the outcome and the propensity score models then we require s log max( λ X ,N ) Nh λZ → λ X can grow with the effective sample size N h λ Z .Similar rates can be shown for L boosting (Luo and Spindler, 2016) and nonlinear models like RandomForests (Wager and Walther, 2016) or forms of Deep Neural Nets (Farrell, Liang, and Misra, 2018). A natural question is then if a bandwidth exists that satisfies the rate conditions in Assumption 8. Indeed,one can show (for more details see Appendix B) that the theoretical range of possible bandwidth choicescan be described by 1 λ Z + 2 r < δ h < δ (cid:15) p + δ (cid:15) md ) − λ Z and we achieve a condition for the order of the kernel r > λ Z − ( δ (cid:15) p + δ (cid:15) md )2( δ (cid:15) p + δ (cid:15) md ) − . For example if we restrict ourselves on second order kernel functions then for λ Z = 1 we require δ (cid:15) p + δ (cid:15) md = . Similarly, for λ Z = 2 and λ Z = 3, δ (cid:15) p + δ (cid:15) md = and δ (cid:15) p + δ (cid:15) md = are requiredrespectively. Thus, for a growing dimension λ Z the joint rate condition for the first stage nuisance pa-rameters approaches the parametric rate. For the concrete dependence of sparsity conditions on the parameters of the predictors see the references mentioned.
Theorem 1.
Under Assumptions 1-8 our proposed estimation procedure for GATE obeys √ N h λ Z (ˆ τ − τ ) = 1 √ N h λ Z N (cid:88) i =1 K (cid:0) z i − zh (cid:1) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , p, m , m ) − τ ) + o (1) and √ N h λ Z (ˆ τ − τ ) → d N (0 , σ GAT E ) with σ GATE = (cid:82) K ( u ) du × E [ ( ψ ( W i ,p,m ,m ) − τ ) | Z = z ] f ( z ) . Theorem 1 shows that under the assumptions discussed above the speed of convergence is determined onlyby the nonparametric regression step. In particular, it does not depend on the first stage estimation steps.An equivalent result can also be achieved by using IPW with nonparametric first stages (see Abrevayaet al., 2015). However, this requires an additional bandwidth choice for the first stage propensity scoreregression and is limited to the case when also λ X is very small. The dimension of X can be increasedunder functional form assumptions for the first stage. However, as shown by the authors the price to payis an increase in the asymptotic variance. This is not the case for the estimator proposed in this paper.Also Theorem 1 is valid under generally weaker conditions compared to the results in Lee et al. (2016) forparametric first stages. Heuristically , if the first stage estimators converge at √ N then our conditionson the bandwidth are satisfied and our asymptotic results continue to apply. To this extent, our resultscomprise the result of Lee et al. (2016) as a special case. In contrast, to Lee et al. (2016) we use local constant instead of local linear regression and introduce cross-fitting fornuisance parameter estimation. This should, however, not be a concern for the intuitive argument made. .2 Joint estimation of GATE and ATE Given the considerations so far, it appears ‘naturally’ to estimate ATE in three steps as an averageof GATEs in the sample. The details of the proposed estimator are described in Procedure 2. As
Procedure 2.
ATE estimation1. Follow steps 1-3 of Procedure 1.2. Predict GATE at every observation in the sample as ˆ τ ( z j ).3. Estimate ATE as ˆ θ = 1 N N (cid:88) j =1 ˆ τ ( z j ) . suggested in Chernozhukov et al. (2018a) one could also directly estimate ATE as the average of thevector ψ ( W, p, m , m ) using the first stage nuisance parameter predictions. However, as a by-product ofGATE estimation, using an additional kernel smoothing step may lead to an ATE estimator with betterfinite sample properties. In particular, the propensity score weights do not enter the last step of ourestimator directly and our hope is that small misspecification errors of propensity scores close to zeroor one are therefore smoothed out. The sensitivity of estimators incorporating inverse propensity scoreweights directly is the subject of many Monte Carlo experiments (e.g. Huber, Lechner, and Wunsch,2013, and Frlich, 2004). We notice that similar reasoning is also behind three-step estimators that applynonparametric regression on an estimated propensity score often used in practice. To obtain our theoretical results we have to modify Assumption 8 slightly.
Assumption 8 (cid:48) (Coupled convergence (ATE)) . The bandwidth h and the sample size N jointly convergesuch that(i) h = o (1) , N h λ Z → ∞ as N → ∞ ,(ii) N h λ Z h r = o (1) and(iii) N h r = o (1) and N h λ Z → ∞ . urther, N and h satisfy the joint convergence conditions with the nuisance parameter convergence rates(iv) h − λ Z (cid:15) max = o (1) and(v) N h − λ Z (cid:15) m (cid:15) p + N h − λ Z (cid:15) m (cid:15) p = o (1) . We notice that averaging over the estimated projection E [ ψ ( W, p, m , m ) | X ] is a partial mean problem inthe sense of Newey (1994a,b). While parts (i) and (ii) of Assumption 8 remain unchanged, the additionalcondition (iii) is necessary in order to guarantee that the MSE of the kernel regression estimator scaledwith N converges to zero. In this way we guarantee the applicability of Newey’s (1994b) framework.We could have also assumed uniform convergence rates for the kernel regression step. However, thiswould involve a unnecessarily strong condition (for a discussion see Newey, 1994b, pp. 1364-1368 andalso Newey and McFadden, 1994, p. 2205).Since we want ATE to converge with a rate of √ N , the requirements on the first stage convergence ratesin condition (v) are more restrictive than those in the respective condition of Assumption 8. The rangeof theoretically feasible bandwidth choices reduces tomax (cid:18) r , λ Z + 2 r (cid:19) < δ h < ( δ (cid:15) p + δ (cid:15) md ) − λ Z . Assuming r < λ Z +2 r we get a modified condition for the order of the kernel function r > λ Z − ( δ (cid:15) p + δ (cid:15) md )2( δ (cid:15) p + δ (cid:15) md ) − . In general this result indicates that one relies on a higher-order kernel function whenever λ Z > (cid:48) we can then derive the following efficiency result. Theorem 2.
Under Assumptions 1-7 and 8 (cid:48) and the regularity conditions on the nonparametric secondstep as in Newey (1994b, pp. 1364-1368) our proposed estimation procedure for ATE has the influencefunction D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − θ and therefore obeys √ N (ˆ θ − θ ) → d N (0 , σ AT E )13 here σ AT E is the semiparametric efficiency bound of Hahn (1998).
Conceptually, Theorem 2 underpins our intuition from semiparametric theory outlined in Section 2.2.The result shows that indeed every estimator that involves a nonparametric projection of the AIPWmodified outcome on any low-dimensional subset of X is consistent, asymptotically normal and achievesthe semiparametric efficiency bound. This asymptotic result has also been shown for other estimatorsalready discussed in Section 1. In contrast to Hirano, Imbens, and Ridder’s (2003) estimator, Hahn’s(1998) estimator and matching on the propensity score (Hahn and Ridder, 2013), we do not rely onnonparametrically estimated first stages. Due to the fact that λ X → ∞ these estimators are of nopractical use in our setting. Further, unlike AIPW with machine learning nuisance parameter estimation(Chernozhukov et al., 2018a), our estimator involves an additional step. Therefore the inverse propensityscore does not directly enter our estimator but is smoothed through the additional nonparametric step.Asymptotically, this does not make any difference as the result in Theorem 2 shows. However, in finitesample this could be a major advantage over the usual AIPW estimator. We investigate the applicability of our methods using Cattaneo’s (2010) dataset on the effect of cigarettesmoking on birthweight available from the Stata website. The dataset contains the outcome variablebirthweight in grams ( Y ), whether the mother smoked during pregnancy ( D = 1) and several covariateson the mother’s health and socio-economic background ( X ). A detailed description of all covariates in thedataset can be found in Appendix C. Applied studies with different estimation approaches unambiguouslyfind negative average effects (see Abrevaya, 2006, da Veiga and Wilder, 2008, Walker, Tekin, and Wallace,2009). Conditional average treatment effects were investigated by Abrevaya et al. (2015) and Lee et al.(2016) who find that mother’s age is associated with increasingly negative effects of smoking. We replicatetheir results and compare their estimators with ours. Clearly, this type of analysis is limited in its scopesince the true DGP remains unknown. However, the dataset has the particular advantage that somestrong hypothesis about the estimation results are plausible. (i) The effect of smoking on birthweightshould be either negative or zero. (ii) The effect should be increasingly negative with mother’s age.As a second example we consider how the effect changes with the number of prenatal care visits. Onthe one hand a very low number of care visits could indicate the mother’s insufficient access to medicalinfrastructure and therefore could be associated with particularly negative effects. On the other hand a The original dataset can be retrieved from here.
Figure 1 depicts the main results of our empirical analysis. We estimate the GATEs as described inProcedure 1 using an ensemble learner comprising Lasso, Ridge, Elastic Net and a Random Forest. Theweights of the ensemble are obtained by cross-validating the out-of-sample MSE of the procedure. X inour specification is an extended variable set (‘alldata’) and is exactly documented in Appendix C. Forexample in contrast to Lee et al. (2016) we also include the available characteristics for the father of thechild, since they could be a good predictor for the smoking behaviour of the mother. The covariates enterour model very flexibly. For the penalized regression predictors we allow for polynomials up to order fourand all two way interactions. The Random Forest has the particular advantage of being an ensembleof trees itself and is therefore very flexible by construction. The results are generally in line with thehypothesis made. In particular, the effect of smoking is unambiguously negative over the whole supportof mother’s age and prenatal care visits. As expected the effect increases with age. Interestingly, a highernumber of prenatal care visits seems to be associated with higher negative effects.We estimate all our results with second-order Gaussian kernel functions. The same analysis using higherorder Gaussian kernel functions as proposed by Li and Racine (2007) yields similar results (see AppendixC). In practice the biggest challenge is to determine the bandwidth for the nonparametric regression.To achieve undersmoothing, we multiply the bandwidth obtained by leave-one-out cross-validation with0.9. Since this choice is arbitrary, the stability of our results towards this choice is a particular concern.Figure 2 shows that our estimator is relatively robust regarding this choice. A major change in the shapeof the function only appears for massive oversmoothing. An equivalent analysis for prenatal care visitsyielding the same conclusion is relegated to Appendix C.Table 1: Smoothed ATE estimatorsSmoothed AIPW (age) Smoothed AIPW (care visits) Smoothed AIPW (age, care visits)-238.937 -235.672 -236.904(27.257) (27.257) (27.257) Results for smoothed AIPW ATE estimation as in Procedure 2 using Z = age, Z = care visits and Z =(age , care visits). Results were obtained with a second-order Gaussian kernel function and a 0 . × LOOCVbandwidth choice. Nuisance parameters were estimated using an ensemble learner comprising Lasso,Elastic Net, Ridge and Random Forest. For Lasso, Ridge and Elastic Net the penalty term was chosensuch that the cross-validation criterion was minimized. The ensemble weights were chosen by minimizingout-of-sample MSE. Asymptotic standard errors are in parenthesis.
Finally, Table 1 shows the results for ATE estimation as described in Procedure 2. In line with the15igure 1: AIPW GATE estimator with ensemble first stages −500−400−300−200−1000 20 30 40 mother's age −500−400−300−200−1000 0 10 20 30 40 care visits m o t he r ' s age c a r e v i s i t s e ff e c t Results were obtained as described in Procedure 1 with a second-order Gaussian kernel function and a 0 . × LOOCVbandwidth choice. Nuisance parameters were estimated using an ensemble learner comprising Lasso, Elastic Net, Ridgeand Random Forest. For Lasso, Ridge and Elastic Net the penalty term was chosen such that the cross-validation criterionwas minimized. The ensemble weights were chosen by minimizing out-of-sample MSE. Asymptotic confidence bands are atthe 95% level. (a) 0 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (b) 0 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (c) 0 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (d) 0 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (e) 1 . × CV choice −500−400−300−200−1000 20 30 40 mother's age (f) 1 . × CV choice −500−400−300−200−1000 20 30 40 mother's age
Results were obtained as described in Procedure 1 with a second-order Gaussian kernel function and different multiples of theLOOCV bandwidth choice. Nuisance parameters were estimated using an ensemble learner comprising Lasso, Elastic Net,Ridge and Random Forest. For Lasso, Ridge and Elastic Net the penalty term was chosen such that the cross-validationcriterion was minimized. The ensemble weights were chosen by minimizing out-of-sample MSE. Asymptotic confidencebands are at the 95% level. previous literature mentioned above, the average effect of smoking is estimated to be negative. Crucially,the estimated effect turns out to be very robust regarding the choice of the smoothing variable.
A ‘fair’ comparison with other estimators is hardly feasible because our approach does not require specificfunctional form assumptions. In other words, the related estimators of Abrevaya et al. (2015) and Leeet al. (2016) suppose that they know the true propensity score or outcome projection specifications. Sincewe cannot compare our estimator against every possible parametric specification, we use the specificationselected by Lee et al. (2016) as a benchmark. Figure 3 depicts GATE estimation results using thebenchmark models. Strikingly, the IPW based estimator gives implausible results. For mother’s agepositive effects of smoking can almost nowhere be excluded. For care visits we do not obtain a GATE We do not consider nonparametric propensity score estimation as suggested in Abrevaya et al. (2015) because it mostlikely does not allow to include all potential confounders in order to make Assumption 1 credible. (a) AIPW linear first stages (age) −500−400−300−200−1000 20 30 40 mother's age (b) AIPW linear first stages (care visits) −500−400−300−200−1000 0 10 20 30 40 care visits (c) IPW linear first stages (age) −1000−50005001000 20 30 40 mother's age (d) IPW linear first stages (care visits) −1000−50005001000 0 10 20 30 40 care visits
Results were obtained following the procedures in Abrevaya et al. (2015) and Lee et al. (2016) with a second-order Gaussiankernel function and a 0 . × LOOCV bandwidth choice. Nuisance parameters were estimated using Logit for the propensityscore and OLS for the outcome projections. Asymptotic confidence bands are at the 95% level.
Results for ATE estimation using Inverse Probability Weighting (IPW) and Augmented IPW (AIPW).For the ensemble learner nuisance parameters were estimated using an ensemble learner comprisingLasso, Elastic Net, Ridge and Random Forest. For Lasso, Ridge and Elastic Net the penalty term waschosen such that the cross-validation criterion was minimized. The ensemble weights were chosen byminimizing out-of-sample MSE. For the parametric specifications nuisance parameters were estimatedusing Logit for the propensity score and OLS for the outcome projections. Asymptotic standard errorsare in parenthesis.
Table 2 shows the results for ATE estimation. As expected the results of Procedure 2 in Table 1 areroughly in line with the standard AIPW based ATE estimator with ensemble first stages. The relativebad performance of IPW based estimation for the GATEs is also reflected in the estimation of ATE. Inparticular, the standard error nearly doubles compared to AIPW based estimators and point estimatesare reduced. Interestingly, for average effects there seems to be only little value-added for the flexiblemachine learning based estimators compared to the parametric specification.
In this study we propose new estimators for specific conditional and average causal effects when thedimension of the covariate space is high. In particular, by discriminating the different roles of covariates(adjusting for confounding vs. measuring causal heterogeneity of interest) in our approach, they canbe included very flexibly – not relying on any functional form assumptions. Rather, we show coupledconvergence conditions for the different steps involved. The procedures suggested are based on semipara-metric efficiency theory. In this sense, our proposed three-step estimator for ATE estimation is shown toreach the semiparametric efficiency bound. A widely used empirical example shows that our estimatorsare useful in practice. Compared to other estimators their desirable theoretical properties and increased This might also be seen as a implicit test for the credibility of the stronger conditions required for the smoothedestimator compared to the averaged efficient score as in Chernozhukov et al. (2018a). Z is moderatelyhigher than considered in this paper, while sacrificing only little flexibility.Finally, it might be worth to investigate the finite sample properties of the proposed three-step estimatorsfor ATE compared to averaging the efficient score vector directly. Here, we consider our ATE estima-tor as a by-product of the GATE procedure underpinning the theoretical motivation of our framework.While the smoothed three-step estimator is first order asymptotically equivalent to directly averagingthe efficient score vector, it might posses better finite sample properties since it does not directly rely onpropensity score weights. However, the finite sample performance may crucially rely on the bandwidthchoice and the set of covariates in Z . We regard this as yet another interesting direction for furtherresearch. 20 eferences Abadie, A. (2005). Semiparametric difference-in-differences estimators.
Review of Economic Studies 72 (1),1–19.Abrevaya, J. (2006). Estimating the effect of smoking on birth outcomes using a matched panel dataapproach.
Journal of Applied Econometrics 21 (4), 489–519.Abrevaya, J., Hsu, Y.-C., & Lieli, R. P. (2015). Estimating conditional average treatment effects.
Journalof Business and Economic Statistics 33 (4), 485–505.Athey, S. & Imbens, G. (2019).
Machine Learning Methods Economists Should Know About . Version 1.arXiv: .Athey, S. & Imbens, G. W. (2017). The state of applied econometrics: causality and policy evaluation.
The Journal of Economic Perspectives 31 (2), 3–32.Athey, S., Tibshirani, J., & Wager, S. (2019). Generalized random forests.
The Annals of Statistics 47 (2),1148–1178.Belloni, A. & Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparsemodels.
Bernouille 19 (2), 521–547.Belloni, A., Chernozhukov, V., & Hansen, C. (2014). Inference on treatment effects after selection amonghigh-dimensional controls.
Review of Economic Studies 81 (2), 608–650.Bickel, P. J., Klaassen, C. A., Ritov, Y., & Wellner, J. A. (1993).
Efficient and Adaptive Estimation forSemiparametric Models . Springer.Cattaneo, M. D. (2010). Efficient semiparametric estimation of multi-valued treatment effects underignorability.
Journal of Econometrics 155 (2), 138–154.Chernozhukov, V. & Semenova, V. (2018).
Simultaneous Inference for Best Linear Predictor of the Con-ditional Average Treatment Effect and Other Structural Functions . Version 2. arXiv: .Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018a).Double/debiased machine learning for treatment and structural parameters.
The Econometrics Jour-nal 21 (1), C1–C68.Chernozhukov, V., Fern´andez-Val, I., & Luo, Y. (2018b). The sorted effects method: discovering hetero-geneous effects beyond their averages.
Econometrica 86 (6), 1911–1938.da Veiga, P. V. & Wilder, R. P. (2008). Maternal smoking during pregnancy and birthweight: a propensityscore matching approach.
Maternal and Child Health Journal 12 (2), 194–203.Fan, Q., Hsu, Y.-C., Lieli, R. P., & Zhang, Y. (2019).
Estimation of Conditional Average TreatmentEffects with High-Dimensional Data . Version 1. arXiv: .21arrell, M. H., Liang, T., & Misra, S. (2018).
Deep Neural Networks for Estimation and Inference:Application to Causal Effects and Other Semiparametric Estimands . Version 2. arXiv: .Frlich, M. (2004). Finite-sample properties of propensity-score matching and weighting estimators.
TheReview of Economics and Statistics 86 (1), 77–90.Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of averagetreatment effects.
Econometrica 66 (2), 315–331.Hahn, J. & Ridder, G. (2013). Asymptotic variance of semiparametric estimators with generated regres-sors.
Econometrica 81 (1), 315–340.Hirano, K., Imbens, G. W., & Ridder, G. (2003). Efficient estimation of average treatment effects usingthe estimated propensity score.
Econometrica 71 (4), 1161–1189.Huber, M., Lechner, M., & Wunsch, C. (2013). The performance of estimators based on the propensityscore.
Journal of Econometrics 175 (1), 1–21.Kennedy, E. H., Ma, Z., McHugh, M. D., & Small, D. S. (2017). Non-parametric methods for doublyrobust estimation of continuous treatment effects.
Journal of the Royal Statistical Society Series BStatistical Methodology 79 (4), 1229–1245.Knaus, M. C., Lechner, M., & Strittmatter, A. (2018).
Machine Learning Estimation of HeterogeneousCausal Effects: Empirical Monte Carlo Evidence . Version 2. arXiv: .Lechner, M. (2018).
Modified Causal Forests for Estimating Heterogeneous Causal Effects . Version 2.arXiv: .Lee, S., Okui, R., & Whang, Y.-J. (2016). Doubly robust uniform confidence band for the conditionalaverage treatment effect function.
Journal of Applied Econometrics 32 (7), 1207–1225.Li, Q. & Racine, J. (2007).
Nonparametric Econometrics: Theory and Practice . Princeton UniversityPress.Lunceford, J. & Davidian, M. (2004). Stratification and weighting via the propensity score in estimationof causal treatment effects: a comparative study.
Statistics in Medicine 23 (19), 29372960.Luo, Y. & Spindler, M. (2016).
High-Dimensional L Boosting: Rate of Convergence . Version 2. arXiv: .Mammen, E., Rothe, C., & Schienle, M. (2012). Nonparametric regression with nonparametrically gen-erated covariates.
The Annals of Statistics 40 (2), 1132–1170.Newey, W. K. (1994a). Kernel estimation of partial means and a general variance estimator.
EconometricTheory 10 (2), 233–253.— (1994b). The asymptotic variance of semiparametric estimators.
Econometrica 62 (6), 1349–1382.22ewey, W. K. & McFadden, D. (1994). “Large sample estimation and hypothesis testing.”
Handbook ofEconometrics . Vol. 4. Elsevier Science B.V. Chap. 36, 2113–2245.Pagan, A. & Ullah, A. (1999).
Nonparametric Econometrics . Cambridge: Cambridge University Press.Robins, J. M., Rotnitzky, A., & Zhao, P. (1994). Estimation of regression coefficients when some regressorsare not always observed.
Journal of the American Statistical Association 89 (427), 846–866.Rubin, D. & van der Laan, M. J. (2007). A doubly robust censoring unbiased transformation.
TheInternational Journal of Biostatistics 3 (1), Article 4.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of Educational Psychology 66 (5), 688–701.Tsiatis, A. A. (2006).
Semiparametric Theory and Missing Data . Springer Series in Statistics. SpringerScience+Business Media.Wager, S. & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using randomforests.
Journal of the American Statistical Association 113 (523), 1228–1242.Wager, S. & Walther, G. (2016).
Adaptive Concentration of Regression Trees, with Application to RandomForests . Version 3. arXiv: .Walker, M., Tekin, E., & Wallace, S. (2009). Teen smoking and birth outcomes.
Southern EconomicJournal 75 (3), 892–907.Zimmert, M. (2018).
Efficient Difference-in-Differences Estimation with High-Dimensional Common TrendConfounding . Version 4. arXiv: . 23
Proof of Theorems
A.1 Proof of Theorem 1
We can write ˆ τ − τ = Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , ˆ p, ˆ m , ˆ m )) − τ Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) = Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , p, m , m ) − τ ) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) i + Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , ˆ p, ˆ m , ˆ m ) − ψ ( W i , p, m , m )) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ii Influence function
Denote ¯ ψ i = ψ ( W i , ˆ p, ˆ m , ˆ m ) − ψ ( W i , p, m , m ). Then the second term can be further expanded as ii = Nh λZ (cid:80) Ni =1 E (cid:2) K (cid:0) Z − zh (cid:1) ¯ ψ (cid:3) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) iia + Nh λZ (cid:80) Ni =1 (cid:0) K (cid:0) z i − zh (cid:1) ¯ ψ i − E (cid:2) K (cid:0) Z − zh (cid:1) ¯ ψ (cid:3)(cid:1) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) iib and therefore (cid:12)(cid:12)(cid:12) √ N h λ Z ii (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) √ N h λ Z iib (cid:12)(cid:12)(cid:12) . Bounding iia
We first of all notice that (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) × (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N √ h λ Z E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) and (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) ≤ sup z ∈Z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z ∈Z ˆ f ( z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C = O (1)24y Assumption 5.Further, under the sample splitting procedure used E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) (cid:21) = E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) | W i ∈I cl (cid:21) ≤ sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m )) (cid:21) Define the Gˆateaux derivative of the generic function g in the direction [ p ∗ − p, m ∗ − m , m ∗ − m ] by ∂ [ p ∗ − p,m ∗ − m ,m ∗ − m ] g . Then using Taylor’s expansion we can write E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m )) (cid:21) = ∂ [ p ∗ − p,m ∗ − m ,m ∗ − m ] E (cid:20) K (cid:18) Z − zh (cid:19) ψ ( W, p, m , m ) (cid:21) + 12 ∂ p ∗ − p,m ∗ − m ,m ∗ − m ] E (cid:20) K (cid:18) Z − zh (cid:19) ψ ( W, p, m , m ) (cid:21) + ... For the first order term we get ∂ [ p ∗ − p,m ∗ − m ,m ∗ − m ] E (cid:20) K (cid:18) Z − zh (cid:19) ψ ( W, p, m , m ) (cid:21) = E (cid:34) K (cid:18) Z − zh (cid:19) (cid:32) − (cid:18) D ( Y − m ( X )) p ( X ) + (1 − D )( Y − m ( X ))(1 − p ( X )) (cid:19) ( p ∗ ( X ) − p ( X ))+ (cid:18) (1 − D )1 − p ( X ) − (cid:19) ( m ∗ ( X ) − m ( X )) + (cid:18) − Dp ( X ) (cid:19) ( m ∗ ( X ) − m ( X )) (cid:33)(cid:35) = 0by the Law of Iterated Expectation and using the fact that Z ⊆ X . For the second order term we get12 ∂ p ∗ − p,m ∗ − m ,m ∗ − m ] E (cid:20) K (cid:18) Z − zh (cid:19) ψ ( W, p, m , m ) (cid:21) = E (cid:34) K (cid:18) Z − zh (cid:19) (cid:32) (cid:18) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))(1 − p ( X )) (cid:19) ( p ∗ ( X ) − p ( X )) + 1 − D (1 − p ( X )) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X ))+ Dp ( X ) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) (cid:33)(cid:35) = E (cid:34) K (cid:18) Z − zh (cid:19) (cid:32) − p ( X )) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X ))25 1 p ( X ) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) (cid:33)(cid:35) ≤ (cid:107) K ( u ) (cid:107) ∞ × (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E (cid:34) − p ( X )) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X ))+ 1 p ( X ) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13)(cid:13) − p ( X )) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) + 1 p ( X ) ( p ∗ ( X ) − p ( X )) ( m ∗ ( X ) − m ( X )) (cid:13)(cid:13)(cid:13)(cid:13) ≤ C × (cid:107) p ∗ ( X ) − p ( X ) (cid:107) × ( (cid:107) m ∗ ( X ) − m ( X ) (cid:107) + (cid:107) m ∗ ( X ) − m ( X ) (cid:107) )which follows from Hlder’s and Jensen’s inequality, (cid:107) K ( u ) (cid:107) ∞ = O (1) in Assumption 5 and Assumption4. All higher order terms can be shown to be dominated by the second order term under the boundednessAssumption 7. Therefore E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) (cid:21) = O ( (cid:15) p (cid:15) m + (cid:15) p (cid:15) m )and (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) = O (cid:16) N h − λ Z × ( (cid:15) p (cid:15) m + (cid:15) p (cid:15) m ) (cid:17) . Bounding iib
We can write (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ Nh λZ (cid:80) Ni =1 (cid:0) K (cid:0) z i − zh (cid:1) ¯ ψ i − E (cid:2) K (cid:0) Z − zh (cid:1) ¯ ψ (cid:3)(cid:1) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ h λ Z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ h λ Z C (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) which follows again from Assumption 5. The convergence of the last factor term remains to show. Since L is a fixed integer that is independent of N , it suffices to show that for any l ∈ [ L ] the term converges.More formally (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max l ∈ [ L ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ nL n (cid:88) i ∈I l (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) I l is the set of observation in subsample l and I cl is the set of observations not in subsample l .Under the sample splitting procedure we have E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i ∈I l (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i ∈I l (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W i ∈I cl ≤ sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:34)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:18) z i − zh (cid:19) ( ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m )) (cid:12)(cid:12)(cid:12)(cid:12) (cid:35) ≤ sup u (cid:12)(cid:12) K ( u ) (cid:12)(cid:12) × sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13) E (cid:104) ( ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m )) | Z (cid:105)(cid:13)(cid:13)(cid:13) ≤ (cid:107) K ( u ) (cid:107) ∞ sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:104) | ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m ) | (cid:105) ≤ C sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:104) | ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m ) | (cid:105) by Hlder’s inequality, Jensen’s inequality and Assumption 5. Nowsup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:16) E (cid:104) | ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m ) | (cid:105)(cid:17) = sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:32) E (cid:34)(cid:12)(cid:12)(cid:12) D ( Y − m ∗ ( X )) p ∗ ( X ) − (1 − D )( Y − m ∗ ( X ))1 − p ∗ ( X ) + m ∗ ( X ) − m ∗ ( X ) − D ( Y − m ( X )) p ( X )+ (1 − D )( Y − m ( X ))1 − p ( X ) − m ( X ) + m ( X ) (cid:12)(cid:12)(cid:12) (cid:35)(cid:33) ≤ sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:107) m ∗ ( X ) − m ( X ) (cid:107) + sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:107) m ∗ ( X ) − m ( X ) (cid:107) + sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13)(cid:13) D ( Y − m ∗ ( X )) p ∗ ( X ) − D ( Y − m ( X )) p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) + sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13)(cid:13) (1 − D )( Y − m ∗ ( X ))1 − p ∗ ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) and by defining U = DY − m ( X ) (cid:13)(cid:13)(cid:13)(cid:13) D ( Y − m ∗ ( X )) p ∗ ( X ) − D ( Y − m ( X )) p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) p ( X ) p ∗ ( X ) ( D ( Y − m ∗ ( X )) p ( X ) − D ( Y − m ( X )) p ∗ ( X )) (cid:13)(cid:13)(cid:13)(cid:13) ≤ c − (cid:107) D ( Y − m ∗ ( X )) p ( X ) − D ( Y − m ( X )) p ∗ ( X ) (cid:107) = c − (cid:107) p ( X )( m ( X ) − m ∗ ( X )) + U ( p ( X ) − p ∗ ( X )) (cid:107) ≤ c − (cid:107) m ( X ) − m ∗ ( X ) (cid:107) + c − (cid:107) U ( p ( X ) − p ∗ ( X )) (cid:107) . (cid:107) U ( p ( X ) − p ∗ ( X )) (cid:107) = (cid:114) E (cid:104) ( U ( p ( X ) − p ∗ ( X ))) (cid:105) = (cid:112) E [ E [ U | X ] ( p ( X ) − p ∗ ( X )) ]and a similar argument for the other term by Assumption 6 we getsup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:107) m ∗ ( X ) − m ( X ) (cid:107) = (cid:15) m sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:107) m ∗ ( X ) − m ( X ) (cid:107) = (cid:15) m sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13)(cid:13) D ( Y − m ∗ ( X )) p ∗ ( X ) − D ( Y − m ( X )) p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:15) m + (cid:15) p sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M (cid:13)(cid:13)(cid:13)(cid:13) (1 − D )( Y − m ∗ ( X ))1 − p ∗ ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:15) m + (cid:15) p . It follows that sup p ∗ ∈P ,m ∗ ∈M ,m ∗ ∈M E (cid:104) | ψ ( W, p ∗ , m ∗ , m ∗ ) − ψ ( W, p, m , m ) | (cid:105) ≤ max( (cid:15) p , (cid:15) m , (cid:15) m ) = (cid:15) . By Markov’s inequality and the fact that if L is a constant independent of N it follows that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) i =1 (cid:18) K (cid:18) z i − zh (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − zh (cid:19) ¯ ψ (cid:21)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C × (cid:15) max and therefore (cid:12)(cid:12)(cid:12) √ N h λ Z iia (cid:12)(cid:12)(cid:12) = O ( h − λ Z (cid:15) max ) . Collecting terms, we can write √ N h λ Z (ˆ τ − τ ) = √ Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) ( ψ ( W i , p, m , m ) − τ ) Nh λZ (cid:80) Ni =1 K (cid:0) z i − zh (cid:1) + O (cid:16) N h − λ Z × ( (cid:15) p (cid:15) m + (cid:15) p (cid:15) m ) + h − λ Z (cid:15) max (cid:17) . Under the convergence conditions in Assumption 8 the first claim of the theorem is verified.28 symptotic normality
Notice that under the standard conditions provided in Assumptions 5 and 8 on the nonparametric re-gression (see for example Pagan and Ullah, 1999, chapter 2)1
N h λ Z N (cid:88) i =1 K (cid:18) z i − zh (cid:19) → p f ( z ) . Therefore, we can rewrite the influence function as √ N h λ Z (ˆ τ − τ ) = 1 √ N h λ Z f ( z ) N (cid:88) i =1 K (cid:18) z i − zh (cid:19) ( ψ ( W i , p, m , m ) − τ ) + o (1)= 1 √ N h λ Z f ( z ) N (cid:88) i =1 K (cid:18) z i − zh (cid:19) ( ψ ( W i , p, m , m ) − E [ ψ ( W i , p, m , m ) | Z = z i ]) (cid:124) (cid:123)(cid:122) (cid:125) ia + 1 √ N h λ Z f ( z ) N (cid:88) i =1 K (cid:18) z i − zh (cid:19) ( E [ ψ ( W i , p, m , m ) | Z = z i ] − τ ) (cid:124) (cid:123)(cid:122) (cid:125) ib + o (1) . The second term is the bias of the nonparametric regression estimator scaled with the convergence rate.Thus, Assumption 8 implies ib = O ( N h λ Z h r ) = o (1). Under the usual assumptions on the existenceof higher order moments in Assumption 5, we can apply the Lyapunov Central Limit Theorem on ia asin Pagan and Ullah (1999, chapter 3.4). Then √ N h λ Z (ˆ τ − τ ) → d N , (cid:82) K ( u ) du × E (cid:104) ( ψ ( W i , p, m , m ) − τ ) | Z = z (cid:105) f ( z ) . q.e.d. A.2 Proof of Theorem 2
Similar to the proof in Theorem 1 we can writeˆ θ − θ = 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ K (cid:16) z j − z i h (cid:17) ψ ( W j , ˆ p, ˆ m , ˆ m ) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) − θ
29 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ K (cid:16) z j − z i h (cid:17) ψ ( W j , p, m , m ) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) − θ (cid:124) (cid:123)(cid:122) (cid:125) i + 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ K (cid:16) z j − z i h (cid:17) ( ψ ( W j , ˆ p, ˆ m , ˆ m ) − ψ ( W j , p, m , m )) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) ii . Bounding ii
Using the notation from the proof of Theorem 1 again leads to ii = 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) iia + 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ (cid:16) K (cid:16) z j − z i h (cid:17) ¯ ψ j − E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3)(cid:17) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) iib . Then | iia | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 1 h λZ E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h λZ E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ h λ Z sup z ∈Z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) × sup i (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) K (cid:18) Z − z i h (cid:19) ¯ ψ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) . By the same steps as in the proof of Theorem 1 we obtain E (cid:20) K (cid:18) Z − zh (cid:19) ( ψ ( W, ˆ p, ˆ m , ˆ m ) − ψ ( W, p, m , m )) (cid:21) = O ( (cid:15) p (cid:15) m + (cid:15) p (cid:15) m )and therefore (cid:12)(cid:12)(cid:12) √ N iia (cid:12)(cid:12)(cid:12) = O (cid:32) √ Nh λ Z (cid:15) p × ( (cid:15) m + (cid:15) m ) (cid:33) . Also for iib we find that (cid:12)(cid:12)(cid:12) √ N iib (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 N (cid:88) j =1 1 √ Nh λZ (cid:16) K (cid:16) z j − z i h (cid:17) ¯ ψ i − E (cid:2) K (cid:0) Z − z i h (cid:1) ¯ ψ (cid:3)(cid:17) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h λ Z sup z ∈Z (cid:12)(cid:12)(cid:12) ˆ f ( z ) − (cid:12)(cid:12)(cid:12) × sup i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ N N (cid:88) j =1 K (cid:18) z j − z i h (cid:19) ¯ ψ i − E (cid:20) K (cid:18) Z − z i h (cid:19) ¯ ψ (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:15) max h λ Z . Hence, for the overall term we have (cid:12)(cid:12)(cid:12) √ N ii (cid:12)(cid:12)(cid:12) = O (cid:32) √ Nh λ Z (cid:15) p × ( (cid:15) m + (cid:15) m ) + (cid:15) max h λ Z (cid:33) = o (1) , under the coupled convergence conditions of Assumption 8 (cid:48) . Bounding i
For i notice that √ N i = √ N (cid:16) ˜ θ − θ (cid:17) , where ˜ θ = 1 N N (cid:88) i =1 N (cid:88) j =1 1 Nh λZ K (cid:16) z j − z i h (cid:17) ψ ( W j , p, m , m ) Nh λZ (cid:80) Nj =1 K (cid:16) z j − z i h (cid:17) . Thus, term i gives the contribution of estimating the nonparametric projection of the vector ψ = ψ ( W, p, m , m ) with population nuisance parameters on Z ∈ Z . To derive the influence function ofthe estimator ˜ θ , we follow Newey’s (1994b) Proposition 4 which holds under the condition that the firststage nonparametric estimator is bounded by any norm (and some further regularity conditions). Forexample, using Assumption 8 (cid:48) we have N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) j =1 1 Nh λZ K (cid:16) z j − zh (cid:17) ψ ( W j , p, m , m ) Nh λZ (cid:80) Nj =1 K (cid:16) z j − zh (cid:17) − τ ( z ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (1) , such that Assumption 5.1 in Newey (1994b) is satisfied for the L norm. In particular, we notice thatthe influence function φ is composed of the moment condition and an adjustment term. The momentcondition of the problem is given by E [˜ τ ( Z ) − θ ] = 031ith ˜ τ ( Z ) = E [ ψ | Z ].Denote the general family of distributions of W = ( Y, D, X, Z ) as F = { F ( W ) } . Further, denote F β ( W ) ∈ F a subfamily of F that is a path in F indexed by β . Also let F be the true distributionof W . Accordingly, W realizes with density f β ( W ) when β = 0. Additionally, define E β [ g ( W )] = (cid:82) g ( W ) f β ( W ) dw for the generic function g ( · ) and ˜ τ ( Z, β ) = E β [ ψ | Z ]. Following the steps of Proposition4 in Newey (1994b) indicates that one should evaluate the derivative ∂∂β E [˜ τ ( Z, β )]at β = 0. By the Chain Rule we have ∂∂β E β [˜ τ ( Z, β )] = ∂∂β E β [˜ τ ( Z )] + ∂∂β E [˜ τ ( Z, β )]at β = 0. Furthermore, for any ¯ τ ( Z ) and for some path β we have the mean-square projection optimizationproblem˜ τ ( Z, β ) = arg max ¯ τ E β (cid:34)(cid:18) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − ¯ τ ( Z ) (cid:19) (cid:35) , giving the first order condition E β (cid:20) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − ˜ τ ( Z, β ) (cid:21) = 0 . Define S ( W ) = ∂∂β ln f β ( W ) at β = 0. Then combining the two previous result gives ∂∂β E [˜ τ ( Z, β )] = ∂∂β E β [˜ τ ( Z, β )] − ∂∂β E β [˜ τ ( Z )]= ∂∂β E β (cid:20) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − ˜ τ ( Z ) (cid:21) = E (cid:20)(cid:18) D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − ˜ τ ( Z ) (cid:19) S ( W ) (cid:21) at β = 0. It follows that the adjustment term is given by ψ ( W, p, m , m ) − ˜ τ ( Z ) and the influencefunction has the form φ = ˜ τ ( Z ) − θ + ψ ( W, p, m , m ) − ˜ τ ( Z )= D ( Y − m ( X )) p ( X ) − (1 − D )( Y − m ( X ))1 − p ( X ) + m ( X ) − m ( X ) − θ. i and ii gives √ N (ˆ θ − θ ) = 1 √ N N (cid:88) i =1 (cid:18) d i ( y i − m ( x i )) p ( x i ) − (1 − d i )( y i − m ( x i ))1 − p ( x i ) + m ( x i ) − m ( x i ) − θ (cid:19) + o (1)such that by the Central Limit Theorem we obtain √ N (ˆ θ − θ ) → d N (cid:18) , E (cid:20) V ar ( Y | D = 1 , X ) p ( X ) + V ar ( Y | D = 0 , X )1 − p ( X ) + ( m ( X ) − m ( X ) − θ ) (cid:21)(cid:19) . q.e.d. Details on the bandwidth ranges
Using the notation implied by Definition 2, notice that the convergence conditions in Assumption 8 implythe following system of inequalities. 12 λ Z δ h − δ (cid:15) max <
012 + 12 λ Z δ h − ( δ p + δ m d ) < − λ Z δ h − rδ h < − δ h < − δ h λ Z > δ h > λ Z +2 r >
0. Further, the other inequalities imply δ h < δ p + δ md ) − λ Z < δ (cid:15) max λ Z < λ Z . It therefore follows that the possible range of the bandwidth can bedescribed by λ Z +2 r < δ h < δ p + δ md ) − λ Z .The range for ATE follows similarly from the convergence conditions in Assumption 8 (cid:48) .34 Details on the empirical example
C.1 Covariates in the dataset
Table 3: Description of covariates in the datasetvariable description newly created smalldata alldata meanY infant birth weight in grams no yes yes 3361.68D =1 if mother smoked during pregnancy no yes yes 0.19mmarried =1 if mother is married no no yes 0.70mhisp =1 if mother is hispanic no yes yes 0.03fhisp =1 if father is hispanic no no yes 0.04foreign =1 if mother born abroad no no yes 0.05alcohol =1 if alcohol consumed during pregnancy no yes yes 0.03deadkids =1 if previous birth were newborn died no yes yes 0.26mage mother’s age no yes yes 26.50medu mother’s educational attainment no yes yes 12.69fage father’s age no no yes 27.27fedu father’s educational attainment no no yes 12.31nprenatal number of prenatal care visits no yes yes 10.76mrace =1 if mother is white no yes yes 0.84frace =1 if father is white no no yes 0.81prenatal1 =1 if first prenatal visit in first trimester no yes yes 0.80prenatal2 =1 if first prenatal visit in second trimester yes no yes 0.15prenatal3 =1 if first prenatal visit in third trimester yes no yes 0.05order1 =1 if first infant yes yes yes 0.44order2 =1 if second infant yes no yes 0.34order3 =1 if j th infant with j ≥ Sample with N = 4642 observations with 864 treated and 3778 non-treated. ‘smalldata’ indicates that the variable wasalso used in Lee et al. (2016). ‘newly created’ indicates that the variable was additionally created from the original datasetby the authors. ‘alldata’ contains the specification used for the estimation results in Section 4. .2 Additional sensitivity analysis Figure 4: Sensitivity to kernel order (age) (a) Fourth order kernel function −500−400−300−200−1000 20 30 40 mother's age (b) Sixth order kernel function −500−400−300−200−1000 20 30 40 mother's age
Results were obtained as described in Procedure 1 with a fourth and sixth order Gaussian kernel function and 0 . × LOOCVbandwidth choice. Nuisance parameters were estimated using an ensemble learner comprising Lasso, Elastic Net, Ridgeand Random Forest. For Lasso, Ridge and Elastic Net the penalty term was chosen such that the cross-validation criterionwas minimized. The ensemble weights were chosen by minimizing out-of-sample MSE. Asymptotic confidence bands are atthe 95% level. (a) 0 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (b) 0 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (c) 0 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (d) 0 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (e) 1 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits (f) 1 . × CV choice −500−400−300−200−1000 0 10 20 30 40 care visits