Double machine learning for sample selection models
aa r X i v : . [ ec on . E M ] D ec Double machine learning for sample selection models
Michela Bia*, Martin Huber**, and Luk´aˇs Laff´ers+ *Luxembourg Institute of Socio-Economic Research**University of Fribourg, Dept. of Economics+Matej Bel University, Dept. of Mathematics
Abstract:
This paper considers treatment evaluation when outcomes are only observed for a subpop-ulation due to sample selection or outcome attrition/non-response. For identification, we combine aselection-on-observables assumption for treatment assignment with either selection-on-observables or in-strumental variable assumptions concerning the outcome attrition/sample selection process. To control ina data-driven way for potentially high dimensional pre-treatment covariates that motivate the selection-on-observables assumptions, we adapt the double machine learning framework to sample selection prob-lems. That is, we make use of (a) Neyman-orthogonal and doubly robust score functions, which implythe robustness of treatment effect estimation to moderate regularization biases in the machine learning-based estimation of the outcome, treatment, or sample selection models and (b) sample splitting (orcross-fitting) to prevent overfitting bias. We demonstrate that the proposed estimators are asymptoti-cally normal and root-n consistent under specific regularity conditions concerning the machine learnersand investigate their finite sample properties in a simulation study. The estimator is available in the causalweight package for the statistical software R . Keywords: sample selection, double machine learning, doubly robust estimation, efficient score.
JEL classification: C21 . Addresses for correspondence: Michela Bia, Luxembourg Institute of Socio-Economic Research, 11 Porte des Sciences,Maison des Sciences Humaines, 4366 Esch-sur-Alzette/Belval, Luxembourg, [email protected]; Martin Huber, Universityof Fribourg, Bd. de P´erolles 90, 1700 Fribourg, Switzerland, [email protected]; Luk´aˇs Laff´ers, Matej Bel University,Tajovskeho 40, 97411 Bansk´a Bystrica, Slovakia, lukas.laff[email protected]. Laff´ers acknowledges support provided by theSlovak Research and Development Agency under contract no. APVV-17-0329 and VEGA-1/0692/20.
Introduction
In many studies aiming at evaluating the causal effect of a treatment or policy intervention,the empirical analysis is complicated by non-random outcome attrition or sample selection. Ex-amples include the estimation of the returns to education when wages are only known for theselective subpopulation of working individuals or the effect of school vouchers on college admis-sions tests when students non-randomly abstain from the test. Furthermore, in observationalstudies, treatment assignment is typically not random, implying that the researcher faces a dou-ble selection problem, namely selection into the treatment and observability of the outcome. Alarge literature addresses treatment selection by assuming a selection-on-observables assumption,implying that treatment is as good as randomly assigned conditional on observed pre-treatmentcovariates, see for instance the reviews by Imbens (2004) and Imbens and Wooldridge (2009).Furthermore, a growing number of studies addresses the question of how to control for thecrucial confounders in a potentially high-dimensional vector of covariates in a data-driven waybased on machine learning algorithms, see for instance the double machine learning frameworkof Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018).In this paper, we adapt the double machine learning framework to treatment evaluation in thepresence of sample selection or outcome attrition. In terms of identifying assumptions, we com-bine a selection-on-observables assumption for the treatment assignment with either selection-on-observables or instrumental variable assumptions concerning the outcome attrition/sampleselection process. Such assumptions have previously been considered in Huber (2012) and Huber(2014) for the estimation of the average treatment effect (ATE) based on inverse probabilityweighting, however, for pre-selected (or fixed) covariates. As methodological advancement, wederive doubly robust score functions for evaluating treatment effects under double selection anddemonstrate that they satisfy so-called Neyman (1959) orthogonality. The latter property per-mits controlling for covariates in a data-driven way by machine learning-based estimation ofthe treatment, outcome, and attrition models under specific conditions. Therefore, the subsetof important confounders need not be known a priori (but must be contained in the the totalset of covariates), which is particularly useful in high dimensional data with a vast number ofcovariates that could potentially serve as control variables.Following Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018),1e show that treatment effect estimation based on our score functions is root- n consistent andasymptotically normal under particular regularity conditions, in particular the n − / -convergenceof the machine learners. A further condition in the double machine learning framework is theprevention of overfitting bias due to correlations between the various estimation steps. This is ob-tained by estimating the treatment, outcome, and selection models on the one hand and the treat-ment effect on the other hand in different parts of the data. As in Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins(2018), we subsequently swap the roles of the data parts and average over treatment effects inorder to prevent asymptotic efficiency losses, a procedure known as cross-fitting.Our paper is related to a range of studies tackling sample selection and selective outcomeattrition. One strand of the literature models the attrition process based on a selection-on-observables assumption also known as missing at random (MAR) condition. The latter imposesconditional independence of sample selection and the outcome given observed information likethe covariates and the treatment. Examples include Rubin (1976), Little and Rubin (1987),Carroll, Ruppert, and Stefanski (1995), Shah, Laird, and Schoenfeld (1997), Fitzgerald, Gottschalk, and Moffitt(1998), Abowd, Crepon, and Kramarz (2001), Wooldridge (2002), and Wooldridge (2007). Robins, Rotnitzky, and Zhao(1994), Robins, Rotnitzky, and Zhao (1995), and Bang and Robins (2005) discuss doubly robustestimators of the outcome that are consistent under MAR when either the outcome or the at-trition model are correctly specified. This approach satisfies Neyman orthogonality as requiredfor double machine learning. However, their framework does not consider the (double) selectioninto treatment and the observability of the outcome at the same time as we do in this paper.In contrast to MAR-based identification, so-called sample selection or nonignorable nonre-sponse models allow for unobserved confounders of the attrition process and the outcome. Unlessstrong functional form assumptions as in Heckman (1976), Heckman (1979), Hausman and Wise(1979), and Little (1995) are imposed, identification requires an instrumental variable (IV) forsample selection. We refer to Das, Newey, and Vella (2003), Newey (2007), Huber (2012), andHuber (2014) for nonparametric estimation approaches in this context. To the best of ourknowledge, this study is the first one to propose a doubly robust treatment effect estimatorunder nonignorable outcome attrition and to consider machine learning techniques to controlfor (possibly high-dimensional) covariates in this context. Our estimators are available in the causalweight package for R by Bodory and Huber (2018).This paper proceeds as follows. Using the potential outcome framework, Section 2 discusses2he identification of the average treatment effect when outcomes are assumed to be missing atrandom (i.e. selection is on observables, as for the treatment). Section 3 considers identificationwhen outcome attrition is related to unobservables, known as nonignorable nonresponse, andan instrument is available for tackling this issue. Section 4 proposes an estimator based ondouble machine learning and shows root-n consistency and asymptotic normality under specificregularity conditions. Section 5 provides a simulation study. Section 6 concludes. Our target parameter is the average treatment effect (ATE) of a binary treatment variable D on an outcome variable Y . To define the effect of interest, we use the potential outcome frame-work, see Rubin (1974). Let Y ( d ) denote the potential outcome under hypothetical treatmentassignment d ∈ { , } , such that the ATE is given by ∆ = E [ Y (1) − Y (0)]. Furthermore,let Y denote the outcome realized under the treatment (f)actually assigned to a subject, i.e. Y = D · Y (1) + (1 − D ) · Y (0). Therefore, Y corresponds to the potential outcome under thetreatment received, while the potential outcome under the counterfactual treatment assignmentremains unknown. A further complication in our evaluation framework is that Y is assumed tobe only observed for a subpopulation, i.e. conditional on S = 1, where S is a binary variableindicating whether Y is observed/selected, or not.Empirical examples with partially observed outcomes include wage regressions, with S beingan employment indicator, see for instance Gronau (1974), or the evaluation of the effects ofpolicy interventions in education on test scores, with S being participation in the test, seeAngrist, Bettinger, and Kremer (2006). Throughout our discussion, S is permitted to be afunction of D and X , i.e. S = S ( D, X ). However, S must neither be affected by nor affect Y . Therefore, selection per se does not causally influence the outcome. The following nonparametricoutcome and selection models satisfy this framework: Y = φ ( D, X, U ) , S = ψ ( D, X, V ) , (1)where U, V are unobserved characteristics and φ, ψ are general functions. Throughout the paper See for instance Imai (2009) for alternative assumptions, which imply that selection is associated with the outcomebut is independent of the treatment conditional on the outcome and other observable variables. Note that Y ( d ) = φ ( d, X, U ), which means that fixing the treatment yields the potential outcome.
3e assume that the stable unit treatment value assumption (SUTVA, Rubin (1980)) holds suchthat Pr( D = d = ⇒ Y = Y ( d )) = 1 This rules out interaction or general equilibrium effects andimplies that the treatment is uniquely defined.We subsequently formalize the assumptions that permit identifying the average treatmenteffect when both selection into the treatment and outcome attrition is related to observed char-acteristics. Assumption 1 (conditional independence of the treatment): (a) Y ( d ) ⊥ D | X = x for all d ∈ { , } and x in the support of X .By Assumption 1, there are no unobservables jointly affecting the treatment and the outcomeconditional on covariates X . For model (1), this implies that U is not associated with unob-served terms affecting D given X . In observational studies, the plausibility of this assumptioncrucially hinges on the richness of the data, while in experiments, it is satisfied if the treatmentis randomized within strata defined by X or randomized independently of X . Assumption 2 (conditional independence of selection): Y ⊥ S | D = d, X = x for all d ∈ { , } and x in the support of X .By Assumption 2, there are no unobservables jointly affecting selection and the outcome condi-tional on D, X , such that outcomes are missing at random (MAR) in the denomination of Rubin(1976). Put differently, selection is assumed to be selective w.r.t. observed characteristics only.For model (1), this implies that U and V are conditionally independent given D, X . Assumption 3 (common support): (a) Pr( D = d | X = x ) > S = 1 | D = d, X = x ) > d ∈ { , } and x in thesupport of X .Assumption 4(a) is a common support restriction requiring that the conditional probability toreceive a specific treatment given X , henceforth referred to as treatment propensity score, islarger than zero in either treatment state. Assumption 4(b) requires that for any combination of D, X , the conditional probability to be observed, henceforth referred to as selection propensityscore, is larger than zero. Otherwise, the outcome is not observed for some specific combinationsof these variables implying yet another common support issue.4he identifying assumptions imply that E [ Y ( d ) | X ] = E [ Y | D = d, X ] = E [ Y | D = d, S = 1 , X ] , (2)where the first equality follows from Assumption 1 and the second equality from Assumption 2.Therefore, the mean potential outcome identified by E [ Y ( d )] = E [ E [ Y | D = d, S = 1 , X ]] , (3)or, using the fact that E [ Y | D = d, S = 1 , X ] = E [ I { D = d } · S · Y | X ] / Pr( D = d, S = 1 | X ), by E [ Y ( d )] = E " E [ Y · I { D = d } · S | X ]Pr( D = d, S = 1 | X ) = E " I { D = d } · S · Y Pr( D = d | X ) · Pr( S = 1 | D = d, X ) , (4)where the second equality follows from the law of iterated expectations. I {·} denotes the indica-tor function, which is equal to one if its argument is satisfied and zero otherwise. For the sake ofbrevity, we henceforth denote by µ ( D, S, X ) = E [ Y | D, S, X ] the conditional mean outcome andby p d ( X ) = Pr( D = d | X ) and π ( D, X ) = Pr( S = 1 | D, X ) the propensity scores. Expressions(3) and (4) suggest that the mean potential outcomes (and thus, the ATE) are identified, eitherbased on conditional mean outcomes or inverse probability weighting using the treatment andselection propensity scores.Following the literature on doubly robust methods, see e.g. Robins, Mark, and Newey (1992),Robins, Rotnitzky, and Zhao (1994), and Robins, Rotnitzky, and Zhao (1995), we combine bothapproaches to obtain the following identification result: E [ Y ( d )] = E h ψ d i , where ψ d = I { D = d } · S · [ Y − µ ( d, , X )] p d ( X ) · π ( d, X ) + µ ( d, , X ) . (5)The identification result in (5) is based on the so-called efficient score function, which is formallyderived in Appendix A.4. By applying the law of iterated expectations to replace [ Y − µ ( d, , X )]with E [ Y − µ ( d, , X ) | D = d, S = 1 , X ] and noting that the latter expression is zero as E [ Y | D = d, S = 1 , X ] = µ ( d, , X ), it is easy to see that (5) is equivalent to (3) and thus, (4). In contrastto (3) and (4), however, expression (5) is doubly robust in the sense that it identifies E [ Y ( d )]5f either the conditional mean outcome µ ( d, , X ) or the propensity scores p d ( X ) and π ( d, X )are correctly specified. Furthermore, it satisfies the so-called Neyman (1959) orthogonality,i.e. is first-order insensitive to perturbations in µ ( D, S, X ), p d ( X ), and π ( D, X ), see AppendixA.1. This entails desirable robustness properties when using machine learning to estimate theoutcome, treatment, and selection models in a data-driven way.
When sample selection or outcome attrition is related to unobservables even conditional onobservables, identification generally requires an instrument for S . We therefore replace Assump-tions 2 and 3, but maintain Assumption 1 (i.e. selection into treatment is on observables). Assumption 4 (Instrument for selection): (a) There exists an instrument Z that may be a function of D , i.e. Z = Z ( D ), is condi-tionally correlated with S , i.e. E [ Z · S | D, X ] = 0, and satisfies (i) Y ( d, z ) = Y ( d ) and (ii) Y ⊥ Z | D = d, X = x for all d ∈ { , } and x in the support of X ,(b) S = I { V ≤ χ ( D, X, Z ) } , where χ is a general function and V is a scalar (index of) unob-servable(s) with a strictly monotonic cumulative distribution function conditional on X ,(c) V ⊥ ( D, Z ) | X .Assumption 4 no longer imposes the conditional independence of Y and S given D, X . As theunobservable V in the selection equation is allowed to be associated with unobservables affect-ing the outcome, Assumptions 1 and 2 generally do not hold conditional on S = 1 due to theendogeneity of the post-treatment variable S . In fact, S = 1 implies that χ ( D, X, Z ) > V suchthat conditional on X , the distribution of V generally differs across values of D . This entailsa violation of the conditional independence of D and Y ( d ) given S = 1 and X if the potentialoutcome distributions differ across values of V . We therefore require an instrumental variabledenoted by Z , which must not affect Y or be associated with unobservables affecting Y condi-tional on D and X , as invoked in 4(a). We apply a control function approach based on thisinstrument, which requires further assumptions. As an alternative set of IV restrictions in the context of selection, d’Haultfoeuille (2010) permits the instrumentto be associated with the outcome, but assumes conditional independence of the instrument and selection giventhe outcome. Control function approaches have been applied in semi- and nonparametric sample selection models, e.g.Ahn and Powell (1993), Das, Newey, and Vella (2003), Newey (2007), Huber (2012), and Huber (2014), as
6y the threshold crossing model postulated in 4(b), Pr( S = 1 | D, X, Z ) = Pr( V ≤ χ ( D, X, Z )) = F V ( χ ( D, X, Z )), where F V ( v ) denotes the cumulative distribution function of V evaluated at v . We will henceforth use the notation Π = π ( D, X, Z ) = Pr( S = 1 | D, X, Z ) for the sake ofbrevity. Again by Assumption 5(b), the selection probability Π increases strictly monotonicallyin χ , such that there is a one-to-one correspondence between the distribution function F V andspecific values v given X . By Assumption 4(c), V is independent of ( D, Z ) given X , implyingthat the distribution function of V given X is (nonparametrically) identified. By comparingindividuals with the same Π, we control for F V and thus for the confounding associations of V with D and Y ( d ) that occur conditional on S = 1 , X . In other words, Π serves as controlfunction where the exogenous variation comes from Z . Controlling for the distribution of V based on the instrument is thus a feasible alternative to the (infeasible) approach of directlycontrolling for levels of V .Furthermore, identification requires the following common support assumption, which issimilar to Assumption 4(a), but in contrast to the latter also includes Π as a conditioningvariable. Assumption 5 (common support):
Pr( D = d | X = x, Π = π, S = 1) > d ∈ { , } and x, z in the support of X, Z .This means that in fully nonparametric contexts, the instrument Z must in general be continuousand strong enough to importantly shift the selection probability Π conditional on D, M, X in theselected population. Assumptions 1, 4, and 5 are sufficient for the identification of mean potentialoutcomes and the ATE in the selected population, denoted as ∆ S =1 = E [ Y (1) − Y (0) | S = 1].To see this, note that the identifying assumptions imply E [ Y ( d ) | S = 1 , X, F V ] = E [ Y ( d ) | S = 1 , X, Π] = E [ Y | D = d, S = 1 , X, Π] (6)The first equality follows from Π = F V under Assumption 4, the second from the fact that whencontrolling for F V , conditioning on S = 1 does not result in an association between Y ( d ) and D well as in nonparametric instrumental variable models, see for example Newey, Powell, and Vella (1999),Blundell and Powell (2004), and Imbens and Newey (2009). X such that Y ( d ) ⊥ D | X, Π , S = 1 holds by Assumptions 1 and 4. Therefore E [ Y ( d ) | S = 1] = E [ E [ Y | D = d, S = 1 , X, Π] | S = 1] . (7)Denoting by p d ( X, Π) = Pr( D = d | X, Π) and µ ( D, S, X,
Π) = E [ Y | D, S, X, π ( D, X, Z )], analternative expression for the mean potential outcome among the selected is obtained by E [ Y ( d ) | S = 1] = E h φ d,S =1 | S = 1 i , where φ d,S =1 = I { D = d } · [ Y − µ ( d, , X, Π)] p d ( X, Π) + µ ( d, , X, Π) . (8)By applying the law of iterated expectations to replace [ Y − µ ( d, , X, Π)] with E [ Y − µ ( d, , X, Π) | D = d, S = 1 , X, Π] and noting that the latter expression is zero, one can see that (8) is equivalent to(7). But in contrast to the latter, the identification result in (8) satisfies Neyman orthogonalityand is based on the doubly robust efficient influence function, see Appendix A.4.The identification of the ATE in the total (rather than the selected) population is not feasiblewithout further assumptions. The reason is that effects among selected observations cannotbe extrapolated to the non-selected population if the effect of D interacts with unobservablesaffecting the outcome, i.e. U in (1), as the latter are in general distributed differently across S = 1 , X, Π) or (
D, X,
Π). To see this, note that conditional on Π =Pr( V ≤ χ ( D, X, Z )), the distribution of V differs across the selected (satisfying V ≤ χ ( D, X, Z ))and the non-selected (satisfying
V > χ ( D, X, Z )), such that the distribution of U differs, too,if V and U are associated. This generally implies that E [ Y (1) − Y (0) | S = 1 , X, Π] = E [ Y (1) − Y (0) | S = 0 , X, Π]. While control function Π ensures (together with X ) that the treatmentis unconfounded in the selected subpopulation, it does not permit extrapolating effects to thenon-selected population with unobserved outcomes, see also Huber and Melly (2015) for furtherdiscussion.Assumption 6 therefore imposes homogeneity in the average treatment effect across selectedand non-selected populations conditional on X, V . A sufficient condition for effect homogene-ity is the separability of observed and unobserved components in the outcome equation, i.e. Y = η ( D, X ) + ν ( U ), where η, ν are general functions. Furthermore, common support as postu-lated in Assumption 5 needs to be strengthened to hold in the entire population. In addition,8he selection probability Π must be larger than zero for any d, x, z in their support. Otherwise,outcomes are not observed for some values of D, X . Assumption 7 formalizes these commonsupport restrictions.
Assumption 6 (conditional effect homogeneity): E [ Y (1) − Y (0) | S = 1 , X = x, V = v ] = E [ Y (1) − Y (0) | X = x, V = v ] for all d ∈ { , } and x, v in the support of X, V . Assumption 7 (common support): (a) Pr( D = d | X = x, Π = π ) > π ( d, x, z ) > d ∈ { , } and x, z in the supportof X, Z .Under the identifiying assumptions, it follows that µ (1 , , X, Π) − µ (0 , , X, Π) = E [ Y (1) − Y (0) | S = 1 , X, V ] = E [ Y (1) − Y (0) | X, V ] , (9)where the first equality follows from Assumptions 1 and 4, see (6), and the second one fromAssumption 6. Therefore, the ATE is identified by∆ = E [ µ (1 , , X, Π) − µ (0 , , X, Π)] . (10)An alternative expression for the ATE that is based on the efficient influence function, see thederivation in Appendix A.4, and respects Neyman orthogonality is given by∆ = E h φ − φ i , where φ d = I { D = d } · S · [ Y − µ ( d, , X, Π)] p d ( X, Π) · π ( d, X, Z ) + µ ( d, , X, Π) . (11)Analogously to (8), it follows that (11) is equivalent to (10) by noting that E [ Y − µ ( d, , X, Π) | D = d, S = 1 , X, Π] = 0. 9
Estimation of the counterfactual with K-fold Cross-Fitting
We subsequently propose an estimation strategy for the counterfactual E [ Y ( d )] under MAR asdiscussed in Section 2 based on (5) and show its root-n consistency under specific regularityconditions. Let to this end W = { W i | ≤ i ≤ n } with W i = ( Y i · S i , D i , S i , X i ) for all i denote the set of observations in an i.i.d. sample of size n . η denotes the plug-in (or nuisance)parameters, i.e. the conditional mean outcome, mediator density and treatment probability.Their respective estimates are referred to by ˆ η = { ˆ µ ( D, , X ) , ˆ p d ( X ) , ˆ π ( D, X ) } and the trueparameters by η = { µ ( D, , X ) , p d ( X ) , π ( D, X ) } . Finally, Ψ d = E [ Y ( d )] denotes the truecounterfactual.We estimate Ψ d by the following algorithm that combines the estimation of Neyman-orthogonal scores with sample splitting or cross-fitting and is root- n consistent under conditionsoutlined further below. Algorithm 1: Estimation of E [ Y ( d )] based on equation (5)1. Split W in K subsamples. For each subsample k , let n k denote its size, W k the set ofobservations in the sample and W Ck the complement set of all observations not in k .2. For each k , use W Ck to estimate the model parameters of the plug-ins µ ( D, S = 1 , X ), p d ( X ), π ( D, X ) in order to predict these plug-ins in W k , where the predictions are denotedby ˆ µ k ( D, , X ), ˆ p kd ( X ), and ˆ π k ( D, X ).3. For each k , obtain an estimate of the score function (see ψ d in (5)) for each observation i in W k , denoted by ˆ ψ kd,i :ˆ ψ kd,i = I { D i = d } · S i · [ Y i − ˆ µ k ( d, , X i )]ˆ p kd ( X i ) · ˆ π k ( d, X i ) + ˆ µ k ( d, , X i ) . (12)4. Average the estimated scores ˆ ψ kd,i over all observations across all K subsamples to obtainan estimate of Ψ d = E [ Y ( d )] in the total sample, denoted by ˆΨ d = 1 /n P Kk =1 P n k i =1 ˆ ψ kd,i .In order to obtain root-n consistency for counterfactual estimation, we make the following as-sumption on the prediction qualities of machine learning for estimating the nuisance parameters.10ollowing Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins (2018), weintroduce some further notation: let ( δ n ) ∞ n =1 and (∆ n ) ∞ n =1 denote sequences of positive con-stants with lim N →∞ δ n = 0 and lim N →∞ ∆ n = 0 . Furthermore, let c, ǫ, C and q be positiveconstants such that q > , and let K ≥ R = ( R , ..., R l ), let k R k q = max ≤ j ≤ l k R l k q , where k R l k q = ( E [ | R l | q ]) q . In order to easenotation, we assume that n/K is an integer. For the sake of brevity we omit the dependence ofprobability Pr P , expectation E P ( · ) , and norm k·k P,q on the probability measure P. Assumption 8 (regularity conditions and quality of plug-in parameter estimates):
For all probability laws P ∈ P , where P is the set of all possible probability laws the followingconditions hold for the random vector ( Y, D, S, X ) for d ∈ { , } :(a) k Y k q ≤ C , (cid:13)(cid:13) E [ Y | D = d, S = 1 , X ] (cid:13)(cid:13) ∞ ≤ C ,(b) Pr( ǫ ≤ p d ( X ) ≤ − ǫ ) = 1 , Pr( ǫ ≤ π ( d, X )) = 1 , (c) k Y − µ ( d, , X ) k = E h ( Y − µ ( d, , X )) i ≥ c (d) Given a random subset I of [ n ] of size n k = n/K, the nuisance parameter estimatorˆ η = ˆ η (( W i ) i ∈ I C ) satisfies the following conditions. With P -probability no less than1 − ∆ n : k ˆ η − η k q ≤ C, k ˆ η − η k ≤ δ n , k ˆ p d ( X ) − / k ∞ ≤ / − ǫ, k ˆ π ( D, X ) − / k ∞ ≤ / − ǫ, k ˆ µ ( D, S, X ) − µ ( D, S, X ) k × k ˆ p d ( X ) − p ( X ) k ≤ δ n n − / , k ˆ µ ( D, S, X ) − µ ( D, S, X ) k × k ˆ π ( D, X ) − π ( D, X ) k ≤ δ n n − / . The only non-primitive condition is the condition (d), which puts restrictions on the quality11f the nuisance parameter estimators. Condition (a) states that the distribution of the outcomedoes not have unbounded moments. (b) refines the common support condition such that thetreatment and selection propensity scores are bounded away from 0and 1 and 0, respectively.(c) states that covariates X do not perfectly predict the conditional mean outcome.For demonstrating the root-n consistency of our estimator of the mean potential outcome, weshow that it satisfies the requirements of the DML framework in Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins(2018) by first verifying linearity and Neyman orthogonality of the score (see Appendix A.1). As ψ d ( W, η, Ψ d ) is smooth in ( η, Ψ d ), it then suffices that the plug-in estimators converge with rate n − / for achieving n − / -convergence in the estimation of ˆ ψ , see Theorem 1. A rate of n − / is achievable by many commonly used machine learners under specific conditions, such as lasso,random forests, boosting and neural nets, see for instance Belloni, Chernozhukov, and Hansen(2014), Luo and Spindler (2016), Wager and Athey (2018), and Farrell, Liang, and Misra (2018). Theorem 1
Under Assumptions 1-3 and 8, it holds for estimating Ψ d = E [ Y ( d )] based on Algorithm 1: √ n (cid:16) ˆΨ d − Ψ d (cid:17) → N (0 , σ ψ d ), where σ ψ d = E [( ψ d − Ψ d ) ].The proof is provided in Appendix A.1.We subsequently discuss the estimation of Ψ d based on (11). We note that in this case,one needs to estimate the nested nuisance parameters µ ( d, , X, Π) and p d ( X, Π), because theyrequire the first-step estimation of Π = π ( D, X, Z ). To avoid overfitting in the nested esti-mation procedure, the models for Π on the one hand and µ ( d, , X, Π) , p d ( X, Π) on the otherhand are estimated in different subsamples. The plug-in estimates are now denoted by ˆ η ∗ = { ˆ µ ( D, , X, Π) , ˆ p d ( X, Π) , ˆ π ( D, X, Z ) } and the true plug-ins by η ∗ = { µ ( D, , X, Π) , p d ( X, Π) , π ( D, X, Z ) } . Algorithm 2: Estimation of E [ Y ( d )] based on equation (11)1. Split W in K subsamples. For each subsample k , let n k denote its size, W k the set ofobservations in the sample and W Ck the complement set of all observations not in k .2. Split W Ck into 2 nonoverlapping subsamples and estimate the model parameters of π ( D, X, Z )in one subsample and the model parameters of µ ( D, , X, Π) and p d ( X, Π) in the othersubsample. Predict the plug-in models in W k , where the predictions are denoted by ˆΠ k ,ˆ p kd ( X, ˆΠ k ), and ˆ µ ( D, , X, ˆΠ). 12. For each k , obtain an estimate of the efficient score function (see φ d in (11)) for eachobservation i in W k , denoted by ˆ φ kd,i :ˆ φ kd,i = I { D i = d } · S i · [ Y i − ˆ µ k ( d, , X i , ˆΠ i )]ˆ p d ( X i , ˆΠ i ) · ˆ π ( d, X i , Z i ) + ˆ µ ( d, , X i , ˆΠ i ) (13)4. Average the estimated scores ˆ φ kd,i over all observations across all K subsamples to obtainan estimate of Ψ d = E [ Y ( d )] in the total sample, denoted by ˆΦ d = 1 /n P Kk =1 P n k i =1 ˆ φ kd,i .An estimator of Ψ S =1 d = E [ Y ( d ) | S = 1] based on (8) is obtained by two modifications inAlgorithm 2. First, rather than relying on the total sample n , one merely uses the subsamplewith observed outcomes which of size P ni =1 S i to split it into K subsamples. Second, in step 3,ˆ φ kd,i is to be replaced byˆ φ kd,S =1 ,i = I { D i = d } · [ Y i − ˆ µ k ( d, , X i , ˆΠ i )]ˆ p d ( X i , ˆΠ i ) + ˆ µ ( d, , X i , ˆΠ i ) (14)to estimate Ψ S =1 d in step 4 by ˆΦ S =1 d = P ni =1 S i P Kk =1 P n k i =1 ˆ φ kd,S =1 ,i . As P ni =1 S i is an asymp-totically fixed proportion of n , also this approach can be shown to be root- n consistent underparticular regularity conditions outlined in Assumption 9, that are in analogy to those in As-sumption 8, but now adapted our IV-dependent identifying assumptions. Assumption 9 (regularity conditions and quality of plug-in parameter estimates):
For all probability laws P ∈ P , where P is the set of all possible probability laws the followingconditions hold for the random vector ( Y, D, S, X, Z ) for d ∈ { , } :(a) k Y k q ≤ C , (cid:13)(cid:13) E [ Y | D = d, S = 1 , X, Π] (cid:13)(cid:13) ∞ ≤ C ,(b) Pr( ǫ ≤ p d ( X, Π) ≤ − ǫ ) = 1 , Pr( ǫ ≤ π ( d, X, Z )) = 1 , (c) k Y − µ ( d, , X, Π) k = E h ( Y − µ ( d, , X, Π)) i ≥ c (d) Given a random subset I of [ n ] of size n k = n/K, the nuisance parameter estimatorˆ η ∗ = ˆ η ∗ (( W i ) i ∈ I C ) satisfies the following conditions. With P -probability no less than13 − ∆ n : k ˆ η ∗ − η ∗ k q ≤ C, k ˆ η ∗ − η ∗ k ≤ δ n , (cid:13)(cid:13)(cid:13) ˆ p d ( X, ˆΠ) − / (cid:13)(cid:13)(cid:13) ∞ ≤ / − ǫ, k ˆ π ( D, X, Z ) − / k ∞ ≤ / − ǫ, (cid:13)(cid:13)(cid:13) ˆ µ ( D, S, X, ˆΠ) − µ ( D, S, X, Π) (cid:13)(cid:13)(cid:13) × (cid:13)(cid:13)(cid:13) ˆ p d ( X, ˆΠ) − p ( X, Π) (cid:13)(cid:13)(cid:13) ≤ δ n n − / , (cid:13)(cid:13)(cid:13) ˆ µ ( D, S, X, ˆΠ) − µ ( D, S, X, Π) (cid:13)(cid:13)(cid:13) × k ˆ π ( D, X, Z ) − π ( D, X, Z ) k ≤ δ n n − / . Theorems 2 and 3 postulate the root-n consistency and asyptotic normality of the estimators ofthe mean potential outcomes in the selected and total populations, respectively.
Theorem 2
Under Assumptions 1, 4, 6, 7, and 9, it holds for estimating Ψ d = E [ Y ( d )] based on Algorithm2: √ n (cid:16) ˆΦ d − Ψ d (cid:17) → N (0 , σ φ d ), where σ φ d = E [( φ d − Ψ d ) ]. Theorem 3
Under Assumptions 1, 4, 5, and 9, it holds for estimating Ψ S =1 d = E [ Y ( d ) | S = 1] based onAlgorithm 2: √ n (cid:16) ˆΦ S =1 d − Ψ S =1 d (cid:17) → N (0 , σ φ d,S =1 ), where σ φ d,S =1 = E [( φ d,S =1 − Ψ S =1 d ) ].The proofs are provided in Appendices A.2 and A.3. This section provides a simulation study to investigate the finite sample behavior of our estima-tion approach based on the following data generating process: Y = D + X ′ β + U with Y being observed if S = 1 ,S = I { D + γZ + X ′ β + V > } , D = I { X ′ β + W > } ,X ∼ N (0 , Σ X ) , Z ∼ N (0 , , ( U, V ) ∼ N (0 , Σ U,V ) , W ∼ N (0 , . Y is a linear function of D (whose treatment effect is one), covariates X (for β = 0),and the unobservable U and is only observed if the selection indicator S is equal to one. Selectionis a function of D , X , the unobservable V , and of instrument Z if γ = 0. The treatment D is a function of X and the unobservable W . Both Z and W are random, standard normallydistributed variables that are uncorrelated with X or ( U, V ). The correlation between themean zero and normally distributed covariates in X is determined by the covariance matrixΣ X . Similarly, Σ U,V determines the correlation between the mean zero and normally distributedunobservables in the outcome and selection equation. In this setup, MAR is violated if thecovariance between U and V is non-zero. We consider the perfromance of our estimators in 1000simulations with two sample sizes of n = 2000 and 8000.In our simulations, we set the number of covariates p to 100. Σ X is defined based onsetting the covariance of the i th and j th covariate in X to 0 . | i − j | . β gauges the impacts ofthe covariates on Y , S , and D , respectively, and thus, the magnitude of confounding. The i thelement in the coefficient vector β is set to 0 . /i for i = 1 , ..., p , implying a squared decay ofcovariate importance in terms of confounding. In our first simulation design, we set γ = 0 andΣ U,V = such that MAR holds. We consider the performance of DML based on Theorem1 (henceforth DML MAR), which does not make use of the instrument Z , as well as based onTheorem 2 (DML IV), which exploits the instrument despite the satisfaction of MAR.The nuisance parameters, i.e. the linear and probit specifications of the outcome, selection,and treatment equations, are estimated by lasso regressions using the default options of the SuperLearner package provided by van der Laan, Polley, and Hubbard (2007) for the statisticalsoftware R . We use 3-fold cross-fitting for the estimation of the treatment effects. We dropobservations whose products of estimated treatment and selection propensity scores are close tozero, namely smaller than a trimming threshold of 0 .
01 (or 1%). This avoids an explosion of thepropensity score-based weights and, thus, of the variance when estimating the mean potentialoutcomes or ATE by the sample analogues of (5) and (11), where the product of the propensityscores enters the respective denominators for reweighing the outcome. Our estimation procedureis available in the causalweight package for R by Bodory and Huber (2018).Table 1 presents the simulation results. The biases (bias) of both DML MAR and DMLIV are rather close to zero independent of the sample size. Furthermore, the estimators have15able 1: Simulation results under MARtrue bias sd RMSE meanSE coverage n =2000DML MAR 1.000 0.003 0.060 0.060 0.063 0.939DML IV 1.000 0.003 0.060 0.060 0.063 0.939 n =8000DML MAR 1.000 0.012 0.031 0.033 0.034 0.934DML IV 1.000 0.012 0.031 0.033 0.034 0.939 Notes: column ‘true’ shows the true effect, ‘bias’ the bias of the respective estimatior, ‘sd’ the standard deviation, and‘RMSE’ the root mean squared error. Column ‘meanSE’ displays the average standard error based on the asymptoticapproximation across all simulations, ‘coverage’ the covarage rate of the true effect based on 95% confidence intervals. virtually the same variance, despite the fact that DML IV unnecessarily relies on the controlfunction approach and an irrelevant instrument. Both estimators appear to converge to thetrue effect with √ n -rate, as the root mean squared error (RMSE) is roughly cut by have whenquadrupling the sample size. The average standard error across simulations (meanSE) basedon the asymptotic variance approximation comes close to the respective estimator’s standarddeviation (sd). Finally, the coverage rate (coverage), i.e. the share of simulations in which the95% confidence interval includes the true effect, is only slightly below the nominal level of 95%.Table 2: Simulation results under nonignorable selectiontrue bias sd RMSE meanSE coverage n =2000DML MAR 1.000 -0.120 0.055 0.132 0.052 0.374DML IV 1.000 -0.020 0.071 0.074 0.065 0.907 n =8000DML MAR 1.000 -0.116 0.028 0.119 0.027 0.009DML IV 1.000 0.006 0.040 0.040 0.036 0.915 Notes: column ‘true’ shows the true effect, ‘bias’ the bias of the respective estimatior, ‘sd’ the standard deviation, and‘RMSE’ the root mean squared error. Column ‘meanSE’ displays the average standard error based on the asymptoticapproximation across all simulations, ‘coverage’ the covarage rate of the true effect based on 95% confidence intervals.
In a second simulation design, we set γ = 1 and Σ U,V = . . , such that selectionis nonignorable, i.e. related to unobservables, due to a strong correlation of U and V . Table 2presents the results. DML MAR is no longer unbiased, while the bias of DML IV appears toapproach zero as the sample size increases, at the price of somewhat higher standard deviationthan DML MAR. However, DML IV dominates DML MAR under either sample size in termsof having a lower RMSE and has thus a more favorable bias-variance trade-off in the scenarioconsidered. While coverage is quite satisfactory for DML IV, the 95% confidence interval mostly16ails to include the true effect in the case of DML MAR, in particular under the larger samplesize. In this paper, we discussed the evaluation of average treatment effects in the presence of sampleselection or outcome attrition based on double machine learning. In terms of identifying assump-tions, we imposed a selection-on-observables assumption on treatment assignment, which wascombined with either selection-on-observables or instrumental variable assumptions concerningthe outcome attrition/sample selection process. We proposed doubly robust score functionsand formally showed the satisfaction of Neyman orthogonality, implying that estimators basedon these score functions are robust to moderate (local) regularization biases in the machinelearning-based estimation of the outcome, treatment, or sample selection models. Furthermore,we demonstrated the root-n consistency and asymptotic normality of our double machine learn-ing approach to average treatment effect estimation under specific regularity conditions. Ourestimation procedure is provided in the causalweight package for the statistical software R . References
Abowd, J., B. Crepon, and
F. Kramarz (2001): “Moment Estimation With Attrition:An Application to Economic Models,”
Journal of the American Statistical Association , 96,1223–1230.
Ahn, H., and
J. Powell (1993): “Semiparametric Estimation of Censored Selection Modelswith a Nonparametric Selection Mechanism,”
Journal of Econometrics , 58, 3–29.
Angrist, J., E. Bettinger, and
M. Kremer (2006): “Long-Term Educational Consequencesof Secondary School Vouchers: Evidence from Administrative Records in Colombia,”
AmericanEconomic Review , 96, 847–862.
Bang, H., and
J. Robins (2005): “Doubly Robust Estimation in Missing Data and CausalInference Models,”
Biometrics , 61, 962–972.
Belloni, A., V. Chernozhukov, and
C. Hansen (2014): “Inference on Treatment Effects17fter Selection among High-Dimensional Controls,”
The Review of Economic Studies , 81, 608–650.
Blundell, R. W., and
J. L. Powell (2004): “Endogeneity in Semiparametric Binary Re-sponse Models,”
The Review of Economic Studies , 71, 655–679.
Bodory, H., and
M. Huber (2018): “The causalweight package for causal inference in R,”
SES Working Paper 493, University of Fribourg . Carroll, R., D. Ruppert, and
L. Stefanski (1995):
Measurement Error in NonlinearModels . Chapman and Hall, London.
Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and
J. Robins (2018): “Double/debiased machine learning for treatment and structural pa-rameters,”
The Econometrics Journal , 21, C1–C68.
Das, M., W. K. Newey, and
F. Vella (2003): “Nonparametric Estimation of Sample Selec-tion Models,”
Review of Economic Studies , 70, 33–58. d’Haultfoeuille, X. (2010): “A new instrumental method for dealing with endogenous selec-tion,”
Journal of Econometrics , 154, 1–15.
Farrell, M. H., T. Liang, and
S. Misra (2018): “Deep Neural Networks for Estimationand Inference: Application to Causal Effects and Other Semiparametric Estimands,” workingpaper, University of Chicago . Fitzgerald, J., P. Gottschalk, and
R. Moffitt (1998): “An Analysis of Sample Attritionin Panel Data: The Michigan Panel Study of Income Dynamics,”
Journal of Human Resources ,33, 251–299.
Gronau, R. (1974): “Wage comparisons-a selectivity bias,”
Journal of Political Economy , 82,1119–1143.
Hausman, J., and
D. Wise (1979): “Attrition Bias In Experimental and Panel Data: TheGary Income Maintenance Experiment,”
Econometrica , 47(2), 455–473.18 eckman, J. (1976): “The Common Structure of Statistical Models of Truncation, SampleSelection, and Limited Dependent Variables, and a Simple Estimator for such Models,”
Annalsof Economic and Social Measurement , 5, 475–492.
Heckman, J. (1979): “Sample Selection Bias as a Specification Error,”
Econometrica , 47, 153–161.
Huber, M. (2012): “Identification of average treatment effects in social experiments underalternative forms of attrition,”
Journal of Educational and Behavioral Statistics , 37, 443–474.(2014): “Treatment evaluation in the presence of sample selection,”
Econometric Re-views , 33, 869–905.
Huber, M., and
B. Melly (2015): “A Test of the Conditional Independence Assumption inSample Selection Models,”
Journal of Applied Econometrics , 30, 1144–1168.
Imai, K. (2009): “Statistical analysis of randomized experiments with non-ignorable missingbinary outcomes: an application to a voting experiment,”
Journal of the Royal StatisticalSociety Series C , 58, 83–104.
Imbens, G. W. (2004): “Nonparametric estimation of average treatment effects under exogene-ity: a review,”
The Review of Economics and Statistics , 86, 4–29.
Imbens, G. W., and
W. K. Newey (2009): “Identification and Estimation of TriangularSimultaneous Equations Models Without Additivity,”
Econometrica , 77, 1481–1512.
Imbens, G. W., and
J. M. Wooldridge (2009): “Recent Developments in the Econometricsof Program Evaluation,”
Journal of Economic Literature , 47, 5–86.
Levy, J. (2019): “Tutorial: Deriving The Efficient Influence Curve for Large Models,” arXivpreprint arXiv:1903.01706 . Little, R., and
D. Rubin (1987):
Statistical Analysis with Missing Data . Wiley, New York.
Little, R. J. A. (1995): “Modeling the Drop-Out Mechanism in Repeated-Measures Studies,”
Journal of the American Statistical Association , 90, 1112–1121.
Luo, Y., and
M. Spindler (2016): “High-Dimensional L Boosting: Rate of Convergence,” .19 ewey, W., J. Powell, and
F. Vella (1999): “Nonparametric Estimation of TriangularSimultaneous Equations Models,”
Econometrica , 67, 565–603.
Newey, W. K. (2007): “Nonparametric continuous/discrete choice models,”
International Eco-nomic Review , 48, 1429–1439.
Neyman, J. (1959):
Optimal asymptotic tests of composite statistical hypotheses p. 416–444.Wiley.
Robins, J., A. Rotnitzky, and
L. Zhao (1995): “Analysis of Semiparametric RegressionModels for Repeated Outcomes in the Presence of Missing Data,”
Journal of American Sta-tistical Association , 90, 106–121.
Robins, J. M., S. D. Mark, and
W. K. Newey (1992): “Estimating exposure effects bymodelling the expectation of exposure conditional on confounders,”
Biometrics , 48, 479–495.
Robins, J. M., A. Rotnitzky, and
L. Zhao (1994): “Estimation of Regression CoefficientsWhen Some Regressors Are not Always Observed,”
Journal of the American Statistical Asso-ciation , 90, 846–866.
Rubin, D. (1980): “Comment on ’Randomization Analysis of Experimental Data: The FisherRandomization Test’ by D. Basu,”
Journal of American Statistical Association , 75, 591–593.
Rubin, D. B. (1974): “Estimating Causal Effects of Treatments in Randomized and Nonran-domized Studies,”
Journal of Educational Psychology , 66, 688–701.(1976): “Inference and Missing Data,”
Biometrika , 63, 581–592.
Shah, A., N. Laird, and
D. Schoenfeld (1997): “A Random-Effects Model for MultipleCharacteristics With Possibly Missing Data,”
Journal of the American Statistical Association ,92, 775–779. van der Laan, M. J., E. C. Polley, and
A. E. Hubbard (2007): “Super Learner,”
StatisticalApplications in Genetics and Molecular Biology , 6.
Wager, S., and
S. Athey (2018): “Estimation and Inference of Heterogeneous TreatmentEffects using Random Forests,”
Journal of the American Statistical Association , 113, 1228–1242. 20 ooldridge, J. (2002): “Inverse Probability Weigthed M-Estimators for Sample Selection,Attrition and Stratification,”
Portuguese Economic Journal , 1, 141–162.(2007): “Inverse probability weighted estimation for general missing data problems,”
Journal of Econometrics , 141, 1281–1301. 21 ppendices
A Proofs
For the proofs of Theorems 1, 2, and 3 it is sufficient to verify the conditions of Assumptions 3.1 and 3.2 ofTheorems 3.1 and 3.2 as well as Corollary 3.2 in Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins(2018). All bounds hold uniformly over P ∈ P , where P is the set of all possible probability laws, andwe omit P for brevity. A.1 Proof of Theorem 1
Define the nuisance parameters to be the vector of functions η = ( p d ( X ) , π ( D, X ) , µ ( D, S, X )), with p d ( X ) = Pr( D = d | X ), π ( D, X ) = Pr( S = 1 | D, X ), and µ ( D, S, X ) = E [ Y | D, S, X ]. The Neyman-orthogonal score function for the counterfactual Ψ d = E [ Y ( d )] is given by the following expression, with W = ( Y · S, D, S, X ): ψ d ( W, η, Ψ d ) = I { D = d } · S · [ Y − µ ( d, , X )] p d ( X ) · π ( d, X ) + µ ( d, , X ) − Ψ d . (A.1)Let T n be the set fo all η = ( p d , π, µ ) consisting of P -square integrable functions p d , π , µ such that k η − η k q ≤ C, (A.2) k η − η k ≤ δ n , k p d ( X ) − / k ∞ ≤ / − ǫ, k π ( D, X ) − / k ∞ ≤ / − ǫ, k µ ( D, S, X ) − µ ( D, S, X ) k × k p d ( X ) − p d ( X ) k ≤ δ n n − / , k µ ( D, S, X ) − µ ( D, S, X ) k × k π ( D, X ) − π ( D, X ) k ≤ δ n n − / . We furthermore replace the sequence ( δ n ) n ≥ by ( δ ′ n ) n ≥ , where δ ′ n = C ǫ max( δ n , n − / ) , where C ǫ issufficiently large constant that only depends on C and ǫ. ssumption 3.1: Linear scores and Neyman orthogonalityAssumption 3.1(a)Moment Condition: The moment condition E h ψ d ( W, ˆ η, Ψ d ) i = 0 holds: E h ψ d ( W, ˆ η, Ψ d ) i = E " I { D = d } · S · =0 z }| { E [ Y − ˆ µ ( d, , X ) | D = d, S = 1 , X ]ˆ p d ( X ) · ˆ π ( d, X ) + ˆ µ ( d, , X ) − Ψ d = E [ˆ µ ( d, , X )] − Ψ d = 0 , where the first equality follows from the law of iterated expectations. Assumption 3.1(b) Linearity:
The score ψ d ( W, η, Ψ d ) is linear in Ψ d : ψ d ( W, η , Ψ d ) = ψ ad ( W, η ) · Ψ d + ψ bd ( W, η ) with ψ ad ( W, η ) = − ψ bd ( W, η ) = I { D = d } · S · [ Y − µ ( d, , X )] p d ( X ) · π ( d, X ) + µ ( d, , X ) . Assumption 3.1(c)Continuity:
The expression for the second Gateaux derivative of a map η E [ ψ d ( W, η, Ψ d )], givenin (A.5), is continuous. Assumption 3.1(d)Neyman Orthogonality : For any η ∈ T n , the Gateaux derivative in the direction η − η = ( π ( D, X ) − π ( D, X ) , p d ( X ) − p d ( X ) , µ ( D, S, X ) − µ ( D, S, X ) is given by: ∂E (cid:2) ψ d ( W, η, Ψ d ) (cid:3)(cid:2) η − η (cid:3) = − E " I { D = d } · S · [ µ ( d, , X ) − µ ( d, , X )] p d ( X ) · π ( d, X ) ( ∗ )+ E [ µ ( d, , X ) − µ ( d, , X )] ( ∗∗ ) − E " I { D = d } · S · E [ ·| D = d,S =1 ,X ]=0 z }| { [ Y − µ ( d, , X )] p d ( X ) · π ( d, X ) · [ p d ( X ) − p d ( X )] p d ( X ) − E " I { D = d } · S · E [ ·| D = d,S =1 ,X ]=0 z }| { [ Y − µ ( d, , X )] p d ( X ) · π ( d, X ) · [ π ( d, X ) − π ( d, X )] π ( d, X ) = 0 . The Gateaux derivative is zero because expressions ( ∗ ) and ( ∗∗ ) cancel out. To see this, note that by the aw of iterated expectations, ( ∗ ) corresponds to − E " E " I { D = d } p d ( X ) · E " S · [ µ ( d, , X ) − µ ( d, , X )] π ( d, X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D = d, X X = − E " E " I { D = d } p d ( X ) · = π ( d,X ) z }| { E [ S | D = d, X ] · [ µ ( d, , X ) − µ ( d, , X )] π ( d, X ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = − E " = p d ( X ) z }| { E [ I { D = d }| X ] p d ( X ) · [ µ ( d, , X ) − µ ( d, , X )] = − E [ µ ( d, , X ) − µ ( d, , X )] , Therefore, ∂E (cid:2) ψ d ( W, η, Ψ d ) (cid:3)(cid:2) η − η (cid:3) = 0proving that the score function is orthogonal. Assumption 3.2: Score regularity and quality of nuisance parameter estimatorsAssumption 3.2(a)
This assumption directly follows from the construction of the set T n and the regularity conditions(Assumption 8). Assumption 3.2(b)Bound for m N : Consider the following inequality k µ ( D, S, X ) k q = ( E [ | µ ( D, S, X ) | q ]) q = X d ∈{ , } ,s ∈{ , } E [ | µ ( d, s, X ) | q Pr( D = d, S = s | X )] q ≥ ǫ /q X d ∈{ , } ,s ∈{ , } E [ | µ ( d, s, X ) | q ] q ≥ ǫ /q (cid:18) max d ∈{ , } ,s ∈{ , } E [ | µ ( d, s, X ) | q ] (cid:19) q = ǫ /q (cid:18) max d ∈{ , } ,s ∈{ , } k µ ( d, s, X ) k q (cid:19) , where the first equality follows from definition, the second from the law of total probability, first inequalityfrom the fact that Pr( D = d, S = 1 | X ) = p d ( X ) · π ( d, X ) ≥ ǫ and Pr( D = d, S = 0 | X ) = p d ( X ) · (1 − π ( d, X )) ≥ ǫ . Furthermore, by Jensen’s inequality k µ ( D, S, X ) k q ≤ k Y k q and hence k µ ( d, , X ) k q ≤ /ǫ /q by conditions (A.2). Using similar steps, for any η ∈ T N : k µ ( d, , X ) − µ ( d, , X ) k q ≤ C/ǫ /q because k µ ( D, S, X ) − µ ( D, S, X ) k q ≤ C. Consider E h ψ d ( W, η, Ψ d ) i = E " I { D = d } · Sp d ( X ) · π ( d, X ) · Y | {z } = I + (cid:18) − I { D = d } · Sp d ( X ) · π ( d, X ) (cid:19) µ ( d, , X ) | {z } = I − Ψ d and thus k ψ d ( W, η, Ψ d ) k q ≤ k I k q + k I k q + k Ψ d k q ≤ ǫ k Y k q + 1 − ǫǫ k µ ( d, , X ) k q + | Ψ d |≤ C (cid:18) ǫ + 2 ǫ /q · − ǫǫ + 1 ǫ (cid:19) , because of triangular inequality and because the following set of inequalities hold: k µ ( d, , X ) k q ≤ k µ ( d, , X ) − µ ( d, , X ) k q + k µ ( d, , X ) k q ≤ C/ǫ /q , (A.3) | Ψ d | = | E [ µ ( d, , X )] | ≤ E h | µ ( d, , X ) | i = k µ ( d, , X ) k P, ≤ k µ ( d, , X ) k ≤ k Y k /ǫ / q> z}|{ ≤ k Y k q /ǫ ≤ C/ǫ. which gives the upper bound on m n in Assumption 3.2(b) of Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins(2018). Bound for m ′ n : Notice that (cid:16) E [ | ψ ad ( W, η ) | q ] (cid:17) /q = 1and this gives the upper bound on m ′ N in Assumption 3.2(b). Assumption 3.2(c)Bound for r n : For any η = ( p d , π, ν ) we have (cid:12)(cid:12)(cid:12) E (cid:16) ψ ad ( W, η ) − ψ ad ( W, η ) (cid:17)(cid:12)(cid:12)(cid:12) = | − | = 0 ≤ δ ′ N , and thus we have the bound on r n from Assumption 3.2(c).In the following, we omit arguments for the sake of brevity and use p d = p d ( X ) , π = π ( d, X ) , µ = µ ( d, , X ) and similarly for p d , π , µ . ound for r ′ n : k ψ d ( W, η, Ψ d ) − ψ d ( W, η , Ψ d ) k ≤ (cid:13)(cid:13)(cid:13)(cid:13) I { D = d } · S · Y · (cid:18) p d π − p d π (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (A.4)+ (cid:13)(cid:13)(cid:13)(cid:13) I { D = d } · S · (cid:18) µp d π − µ p d π (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + k µ − µ k ≤ (cid:13)(cid:13)(cid:13)(cid:13) Y · (cid:18) p d π − p d π (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) µp d π − µ p d π (cid:13)(cid:13)(cid:13)(cid:13) + k µ − µ k ≤ Cǫ δ n (cid:18) ǫ (cid:19) + δ n (cid:18) ǫ + C + Cǫ (cid:19) + δ n ǫ ≤ δ ′ n as long as C ǫ in the definition of δ ′ n is sufficiently large. This gives the bound on r ′ n from Assumption 3.2(c).Here we made use of the fact that k µ − µ k = k µ ( d, , X ) − µ ( d, , X ) k ≤ δ n /ǫ, and k π − π k = k π ( d, X ) − π ( d, X ) k ≤ δ n /ǫ using similar steps as in Assumption 3.1(b).The last inequality in (A.4) holds because for the first term we have (cid:13)(cid:13)(cid:13)(cid:13) Y · (cid:18) p d π − p d π (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13)(cid:13) p d π − p d π (cid:13)(cid:13)(cid:13)(cid:13) ≤ Cǫ k p d π − p d π k = Cǫ k p d π − p d π + p d π − p d π k ≤ Cǫ ( k p d ( π − π ) k + k π ( p d − p d ) k ) ≤ Cǫ ( k π − π k + k p d − p d k ) ≤ Cǫ δ n (cid:18) ǫ (cid:19) , where the first inequality follows from the second inequality in Assumption 4(a). The second term in(A.4) is bounded by (cid:13)(cid:13)(cid:13)(cid:13) µp d π − µ p d π (cid:13)(cid:13)(cid:13)(cid:13) ≤ ǫ k p d π µ − p d πµ k = 1 ǫ k p d π µ − p d πµ + p d π µ − p d π µ k ≤ ǫ ( k p d π ( µ − µ ) k + k µ ( p d π − p d π ) k ) ≤ ǫ ( k µ − µ k + C k p d π − p d π k ) ≤ ǫ (cid:18) δ n ǫ + C k p d π − p d π k (cid:19) ≤ δ n (cid:18) Cǫ + Cǫ (cid:19) , where the third inequality follows from E [ Y | D = d, S = 1 , X ] ≥ ( E [ Y | D = d, S = 1 , X ]) = µ ( d, , X )by conditional Jensen’s inequality and therefore k µ ( d, , X ) k ∞ ≤ C . Bound for λ ′ n : Now consider f ( r ) := E [ ψ d ( W ; Ψ d , η + r ( η − η )]For any r ∈ (0 ,
1) : f ( r ) ∂r = E " · I { D = d } · S · ( Y − µ − r ( µ − µ )) ( p d − p d ) ( p d + r ( p d − p d )) ( π d + r ( π d − π d )) (A.5)+ E " · I { D = d } · S · ( Y − µ − r ( µ − µ )) ( π d − π d ) ( p d + r ( p d − p d )) ( π d + r ( π d − π d )) + E " · I { D = d } · S · ( Y − µ − r ( µ − µ )) ( p d − p d )( π d − π d )( p d + r ( p d − p d )) ( π d + r ( π d − π d )) + E " · I { D = d } · S · ( µ − µ ) ( p d − p d ) ( π d + r ( π d − π d ))( p d + r ( p d − p d )) ( π d + r ( π d − π d )) + E " · I { D = d } · S · ( µ − µ ) ( p d + r ( p d − p d )) ( π d − π d )( p d + r ( p d − p d )) ( π d + r ( π d − π d )) Note that because E [ Y − µ ( d, , X ) | D = d, S = 1 , X ] = 0 , | p d − p d | ≤ , | π d − π d | ≤ k µ k q ≤ k Y k q /ǫ /q ≤ C/ǫ /q k µ − µ k × k p d − p d k ≤ δ n n − / /ǫ, k µ − µ k × k π − π k ≤ δ n n − / /ǫ , we get that for some constant C ′′ ǫ that only depends on C and ǫ (cid:12)(cid:12)(cid:12)(cid:12) ∂ f ( r ) ∂r (cid:12)(cid:12)(cid:12)(cid:12) ≤ C ′′ ǫ δ n n − / ≤ δ ′ n n − / and this gives the upper bound on λ ′ n in Assumption 3.2(c) of Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins(2018) as long as C ǫ ≥ C ′′ ǫ . We used the following inequalities k µ − µ k = k µ ( d, , X ) − µ ( d, , X ) k ≤ k µ ( D, S, X ) − µ ( D, S, X ) k /ǫ k π − π k = k π ( d, X ) − π ( d, X ) k ≤ k π ( D, X ) − π ( D, X ) k /ǫ, and these can be shown using similar steps as in Assumption 3.1(b).To observe that (cid:12)(cid:12)(cid:12) ∂ f ( r ) ∂r (cid:12)(cid:12)(cid:12) ≤ C ′′ ǫ δ n n − / holds, note that by triangular inequality it is sufficient tobound the absolute value of each of the ten terms in (A.5) separately. We illustrate it on the first, on the hird and on the last term. For the first term: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E " · I { D = d } · S · ( Y − µ − r ( µ − µ )) ( p d − p d ) ( p d + r ( p d − p d )) ( π d + r ( π d − π d )) ≤ ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E " I { D = d } · S · ( Y − µ − r ( µ − µ ))( p d − p d ) ≤ ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E " I { D = d } · S · ( Y − µ ) + 2 ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E " r ( µ − µ )( p d − p d ) ≤ · ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E " · ( µ − µ )( p d − p d ) ≤ ǫ δ n ǫ n − / , in the second inequality we used the fact that: 1 ≥ p d + r ( p d − p d ) = (1 − r ) p d + rp d ≥ (1 − r ) ǫ + rǫ = ǫ and similarly for π and in the third Holder’s inequality. Bounding of the second and third terms followssimilarly.For the fourth term, we get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E " · I { D = d } · S · ( µ − µ ) ( p d − p d ) ( π d + r ( π d − π d ))( p d + r ( p d − p d )) ( π d + r ( π d − π d )) ≤ ǫ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E " I { D = d } · S · ( µ − µ )( p d − p d ) ≤ ǫ δ n ǫ n − / where in addition we made use of conditions (A.2). The last term is bounded similarly. Assumption 3.2(d) E h ( ψ d ( W, η , Ψ d )) i = E " I { D = d } · S · [ Y − µ ] p d · π | {z } = I + µ − Ψ d | {z } = I ! = E [ I + I ] ≥ E [ I ]= E " I { D = d } · S · [ Y − µ ] p d · π ! ≥ ǫ E " [ Y − µ ] p d · π ! ≥ ǫ c (1 − ǫ ) > . because Pr( D = d, S = 1 | X ) = p d ( X ) · π ( d, X ) ≥ ǫ , p d ( X ) ≤ − ǫ and π d ( d, X ) ≤ − ǫ. where the second equality follows from E h I · I i = E " I { D = d } · Sp d ( X ) · π ( d, X ) · E [ ·| D = d,S =1 ,X ]=0 z }| { [ Y − µ ( d, , X )] · [ µ ( d, , X ) − Ψ d ] . .2 Proof of Theorem 2 Define the nuisance parameters to be the vector of functions η ∗ = ( π ( D, X, Z ) , p d ( X, Π) , µ ( D, S, X,
Π)),with Π = π ( D, X, Z ) = Pr( S = 1 | D, X, Z ), p d ( X, Π) = Pr( D = d | X, π ( D, X, Z )), and µ ( D, S, X,
Π) = E [ Y | D, S, X, π ( D, X, Z )].The shrinking neighbourhood T ∗ n of nuisance parameter vector η ∗ = ( π, p d , µ ) is defined analogouslyto T n from (A.2) in the proof of theorem 1.The score function for the counterfactual Ψ S =1 d = E [ Y ( d ) | S = 1] is given by: φ d,S =1 ( W, η ∗ , Ψ S =1 d ) = I { D = d } · [ Y − µ ( d, , X, Π)] p d ( X ) + µ ( d, , X, Π) − Ψ S =1 d . (A.6) Assumption 3.1: Linear scores and Neyman orthogonalityAssumption 3.1(a)Moment Condition:
The moment condition E h φ d,S =1 ( W, ˆ η ∗ , Ψ S =1 d ) | S = 1 i = 0 holds: E h φ d,S =1 ( W, ˆ η ∗ , Ψ S =1 d ) (cid:12)(cid:12)(cid:12) S = 1 i = E " I { D = d } · =0 z }| { E [ Y − ˆ µ ( d, , X, ˆΠ) | D = d, S = 1 , X, Π ]ˆ p d ( X, ˆΠ)+ ˆ µ ( d, , X, ˆΠ) − Ψ S =1 d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = 1 = E [ˆ µ ( d, , X, ˆΠ) | S = 1] − Ψ S =1 d = 0 , where the first equality follows from the law of iterated expectations. Assumption 3.1(b) Linearity:
The score φ d,S =1 ( W, η ∗ , Ψ S =1 d ) is linear in Ψ S =1 d : φ d,S =1 ( W, η ∗ , Ψ S =1 d ) = φ ad,S =1 ( W, η ∗ ) · Ψ S =1 d + φ bd,S =1 ( W, η ∗ ) with φ ad,S =1 ( W, η ∗ ) = − φ bd,S =1 ( W, η ∗ ) = I { D = d } · [ Y − µ ( d, , X, Π )] p d ( X, Π ) + µ ( d, , X, Π ) . Assumption 3.1(c)Continuity:
The expression for the second Gateaux derivative of a map η ∗ E [ φ d,S =1 ( W, η ∗ , Ψ S =1 d )],given in (A.6), is continuous. Assumption 3.1(d)Neyman Orthogonality : For any η ∗ ∈ T n , the Gateaux derivative in the direction η ∗ − η ∗ = π ( D, X, Z ) − π ( D, X, Z ) , p d ( X, Π) − p d ( X, Π) , µ ( D, S, X, Π) − µ ( D, S, X,
Π)) is given by: ∂E (cid:2) φ d,S =1 ( W, η ∗ , Ψ S =1 d ) | S = 1 (cid:3)(cid:2) η ∗ − η ∗ (cid:3) = − E " I { D = d } · [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z ))] p d ( X, π ( d, X, Z )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = 1 ( ∗ )+ E [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z )) | S = 1] ( ∗∗ ) − E " I { D = d } · E [ ·| D = d,S =1 ,X,π ( d,X,Z )]=0 z }| { [ Y − µ ( d, , X, π ( d, X, Z ))] · [ p d ( X, π ( d, X, Z )) − p d ( X, π ( d, X, Z ))] p d ( X, π ( d, X, Z )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = 1 − E " I { D = d } · ∂E [ µ ( d, , X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z )] p d ( X, π ( d, X, Z )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = 1 ( ∗ ∗ ∗ ) − E " I { D = d } · E [ ·| D = d,S =1 ,X,π ( d,X,Z )]=0 z }| { [ Y − µ ( d, , X, π ( d, X, Z ))] · ∂E [ p d ( X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z )] p d ( X, π ( d, X, Z )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = 1 + ∂E [ µ ( d, , X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z ) | S = 1] ( ∗ ∗ ∗∗ )= 0 . The Gateaux derivative is zero because expressions ( ∗ ) and ( ∗∗ ) as well as ( ∗ ∗ ∗ ) and ( ∗ ∗ ∗∗ ), respectively,cancel out. To see this, note that by the law of iterated expectations and the fact that conditioning on D, X,
Π is equivalent conditioning on
D, X,
Π (because Π is deterministic in Z conditional on D, X ), ( ∗ )corresponds to − E " = p d ( X,π ( d,X,Z )) z }| { E [ I { D = d }| X, Π ] p d ( X, π ( d, X, Z )) · [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = 1 = − E [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z )) | S = 1] , which cancels out with ( ∗∗ ). In an analogous way, it can be shown that ( ∗ ∗ ∗ ) corresponds to E [ − ∂E [ µ ( d, , X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z )] | S = 1] , which cancels out with ( ∗ ∗ ∗∗ ). Therefore, ∂E (cid:2) φ d,S =1 ( W, η ∗ , Ψ S =1 d ) (cid:3)(cid:2) η ∗ − η ∗ (cid:3) = 0proving that the score function is orthogonal. Assumption 3.2: Score regularity and quality of nuisance parameter estimators his proof follows in a similar manner to the proof of theorem 1 and is omitted for brevity. A.3 Proof of Theorem 3
The score function for the counterfactual Ψ d = E [ Y ( d )] is given by: φ d ( W, η ∗ , Ψ d ) = I { D = d } · S · [ Y − µ ( d, , X, Π)] p d ( X ) · π ( d, X, Z ) + µ ( d, , X, Π) − Ψ d . (A.7) Assumption 3.1: Linear scores and Neyman orthogonalityAssumption 3.1(a)Moment Condition:
The moment condition E h ψ d ( W, ˆ η ∗ , Ψ d ) i = 0 holds: E h φ d ( W, ˆ η ∗ , Ψ d ) i = E " I { D = d } · S · =0 z }| { E [ Y − ˆ µ ( d, , X, ˆΠ) | D = d, S = 1 , X, Π ]ˆ p d ( X, ˆΠ) · ˆ π ( d, X, Z ) + ˆ µ ( d, , X, ˆΠ) − Ψ d = E [ˆ µ ( d, , X, ˆΠ)] − Ψ d = 0 , where the first equality follows from the law of iterated expectations. Assumption 3.1(b) Linearity:
The score φ d ( W, η ∗ , Ψ d ) is linear in Ψ d : φ d ( W, η ∗ , Ψ d ) = φ ad ( W, η ∗ ) · Ψ d + φ bd ( W, η ∗ ) with φ ad ( W, η ∗ ) = − φ bd ( W, η ∗ ) = I { D = d } · S · [ Y − µ ( d, , X, Π )] p d ( X, Π ) · π ( d, X, Z ) + µ ( d, , X, Π ) . Assumption 3.1(c)Continuity:
The expression for the second Gateaux derivative of a map η ∗ E [ φ d ( W, η ∗ , Ψ d )], givenin (A.7), is continuous. ssumption 3.1(d)Neyman Orthogonality : For any η ∗ ∈ T n , the Gateaux derivative in the direction η ∗ − η ∗ =( π ( D, X, Z ) − π ( D, X, Z ) , p d ( X, Π) − p d ( X, Π) , µ ( D, S, X, Π) − µ ( D, S, X,
Π)) is given by: ∂E (cid:2) φ d ( W, η ∗ , Ψ d ) (cid:3)(cid:2) η ∗ − η ∗ (cid:3) = − E " I { D = d } · S · [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z ))] p d ( X, π ( d, X, Z )) · π ( d, X, Z ) ( ∗ )+ E [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z ))] ( ∗∗ ) − E " I { D = d } · S · E [ ·| D = d,S =1 ,X,π ( d,X,Z )]=0 z }| { [ Y − µ ( d, , X, π ( d, X, Z ))] · [ p d ( X, π ( d, X, Z )) − p d ( X, π ( d, X, Z ))] p d ( X, π ( d, X, Z )) · π ( d, X, Z ) − E " I { D = d } · S · ∂E [ µ ( d, , X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z )] p d ( X, π ( d, X, Z )) · π ( d, X, Z ) ( ∗ ∗ ∗ ) − E " I { D = d } · S · E [ ·| D = d,S =1 ,X,π ( d,X,Z )]=0 z }| { [ Y − µ ( d, , X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z )] p d ( X, π ( d, X, Z )) · π ( d, X, Z ) − E " I { D = d } · S · E [ ·| D = d,S =1 ,X,π ( d,X,Z )]=0 z }| { [ Y − µ ( d, , X, π ( d, X, Z ))] · ∂E [ p d ( X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z )] p d ( X, π ( d, X, Z )) · π ( d, X, Z ) + ∂E [ µ ( d, , X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z )] ( ∗ ∗ ∗∗ )= 0 . The Gateaux derivative is zero because expressions ( ∗ ) and ( ∗∗ ) as well as ( ∗ ∗ ∗ ) and ( ∗ ∗ ∗∗ ), respectively,cancel out. To see this, note that by the law of iterated expectations and the fact that conditioning on D, X, Z is equivalent conditioning on
D, X,
Π (because Π is deterministic in Z conditional on D, X ), ( ∗ )corresponds to − E " E " I { D = d } p d ( X, π ( d, X, Z )) · E " S · [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z ))] π ( d, X, Z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D = d, X, Z X, Π = − E " E " I { D = d } p d ( X, π ( d, X, Z )) · = π ( d,X,Z ) z }| { E [ S | D = d, X, Z ] · [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z ))] π ( d, X, Z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X, Π = − E " = p d ( X,π ( d,X,Z )) z }| { E [ I { D = d }| X, π ( d, X, Z )] p d ( X, π ( d, X, Z )) · [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z ))] = − E [ µ ( d, , X, π ( d, X, Z )) − µ ( d, , X, π ( d, X, Z ))] , hich cancels out with ( ∗∗ ). In an analogous way, it can be shown that ( ∗ ∗ ∗ ) corresponds to E [ − ∂E [ µ ( d, , X, π ( d, X, Z ))] · [ π ( d, X, Z ) − π ( d, X, Z )]] , which cancels out with ( ∗ ∗ ∗∗ ). Therefore, ∂E (cid:2) φ d ( W, η ∗ , Ψ d ) (cid:3)(cid:2) η ∗ − η ∗ (cid:3) = 0proving that the score function is orthogonal. Assumption 3.2: Score regularity and quality of nuisance parameter estimators
This proof follows in a similar manner to the proof of theorem 1 and is omitted for brevity.
A.4 Derivation of efficient influence functions