[PDF] Causal mediation analysis for stochastic interventions

Abstract

Mediation analysis in causal inference has traditionally focused on binary exposures and deterministic interventions, and a decomposition of the average treatment effect in terms of direct and indirect effects. In this paper we present an analogous decomposition of the \textit{population intervention effect}, defined through stochastic interventions on the exposure. Population intervention effects provide a generalized framework in which a variety of interesting causal contrasts can be defined, including effects for continuous and categorical exposures. We show that identification of direct and indirect effects for the population intervention effect requires weaker assumptions than its average treatment effect counterpart, under the assumption of no mediator-outcome confounders affected by exposure. In particular, identification of direct effects is guaranteed in experiments that randomize the exposure and the mediator. We discuss various estimators of the direct and indirect effects, including substitution, re-weighted, and efficient estimators based on flexible regression techniques, allowing for multivariate mediators. Our efficient estimator is asymptotically linear under a condition requiring n 1/4 -consistency of certain regression functions. We perform a simulation study in which we assess the finite-sample properties of our proposed estimators. We present the results of an illustrative study where we assess the effect of participation in a sports team on BMI among children, using mediators such as exercise habits, daily consumption of snacks, and overweight status.

Full PDF

CCausal mediation analysis for stochasticinterventions

Iv´an D´ıaz ∗ and Nima S. Hejazi Division of Biostatistics, Weill Cornell Medicine. Graduate Group in Biostatistics, University of California, Berkeley. Center for Computational Biology, University of California, Berkeley.

June 25, 2019

Abstract

Mediation analysis in causal inference has traditionally focused on binary expo-sures and deterministic interventions, and a decomposition of the average treatmenteffect in terms of direct and indirect effects. In this paper we present an analogousdecomposition of the population intervention effect , deﬁned through stochastic in-terventions on the exposure. Population intervention effects provide a generalizedframework in which a variety of interesting causal contrasts can be deﬁned, includ-ing effects for continuous and categorical exposures. We show that identiﬁcationof direct and indirect effects for the population intervention effect requires weakerassumptions than its average treatment effect counterpart, under the assumption ofno mediator-outcome confounders affected by exposure. In particular, identiﬁcationof direct effects is guaranteed in experiments that randomize the exposure and themediator. We discuss various estimators of the direct and indirect effects, includingsubstitution, re-weighted, and efﬁcient estimators based on ﬂexible regression tech-niques, allowing for multivariate mediators. Our efﬁcient estimator is asymptoticallylinear under a condition requiring n / -consistency of certain regression functions.We perform a simulation study in which we assess the ﬁnite-sample properties of ourproposed estimators. We present the results of an illustrative study where we assessthe effect of participation in a sports team on BMI among children, using mediatorssuch as exercise habits, daily consumption of snacks, and overweight status. ∗ corresponding author: [email protected] a r X i v : . [ s t a t . M E ] J un Introduction

Mediation analysis is a powerful analytical tool that allows scientists to unveil the mecha-nisms through which causal effects operate. The development of tools for mediation anal-ysis has a long history in the statistical sciences, starting with the early work of Wright(1921, 1934) on path analysis, which provided the foundations for the later developmentof mediation analysis using structural equation models (Goldberger, 1972). Indeed, one ofthe most widely used mediation analysis methods is based on structural equations (Baronand Kenny, 1986). Recent decades have seen a revolution in the ﬁeld of causal infer-ence from observational and randomized studies, starting with the seminal work of Ru-bin (1974) on the potential outcomes framework, which is itself rooted in ideas datingback to Neyman (1923). More recently, Pearl (1995, 2000) has developed a causal infer-ence framework using non-parametric structural equation models, directed acyclic graphs,and the so-called do-calculus. Related approaches have been proposed by Robins (1986),Spirtes et al. (2000), Dawid (2000), and Richardson and Robins (2013). These frameworksallow researchers to deﬁne causal effects non-parametrically, and to assess the conditionsunder which causal effects can be identiﬁed from data. In particular, novel tools have un-covered important limitations of the earlier work on parametric structural equation modelsfor mediation analysis (Pearl, 1998; Imai et al., 2010). Essentially, structural equationmodels impose implausible assumptions on the data generating mechanism, and are thusof limited applicability to complex phenomena in biology, health, economics, and the so-cial sciences. For example, modern causal models have revealed the incorrectness of thewidely popular method of Baron and Kenny (1986) in several important cases, such asin the presence of confounders of the mediator-outcome relationship (Cole and Hern´an,2002).Using the potential outcomes framework, Robins and Greenland (1992) introduced anon-parametric decomposition of the causal effect of a binary exposure into so-called nat-ural indirect and direct effects. The indirect effect quantiﬁes the effect on the outcomethrough the mediator and the direct effect quantiﬁes the effect through all other mecha-nisms. Pearl (2001) arrived at an equivalent effect decomposition using non-parametricstructural equation models. The identiﬁcation of these natural (in)direct effects relies onso-called cross-world counterfactual independencies, i.e., independencies on counterfac-tual variables indexed by distinct hypothetical interventions. An important consequenceof this deﬁnition is that the natural (in)direct effect is not identiﬁable in a randomized trial,which is problematic as it implies that scientiﬁc claims obtained from these models are notfalsiﬁable through experimentation (Popper, 1934; Dawid, 2000; Robins and Richardson,2010).In an attempt to solve these problems, several authors have proposed methods that do2way with cross-world counterfactual independencies. These methods can be divided intwo types: identiﬁcation of bounds (Robins and Richardson, 2010; Tchetgen and Phiri,2014; Miles et al., 2015), and alternative deﬁnitions of the (in)direct effect (Petersen et al.,2006; van der Laan and Petersen, 2008; Vansteelandt and VanderWeele, 2012; Vander-Weele et al., 2014). Here, we take the second approach, deﬁning the (in)direct effect interms of a decomposition of the total effect of a stochastic intervention on the populationexposure.Most causal inference problems consider deterministic interventions that set each unit’sexposure to some ﬁxed value that could be a function of the unit’s baseline variables.Stochastic interventions are a generalization of this framework, and are loosely deﬁnedas interventions which yield an exposure that is a random variable after conditioning onbaseline variables. Estimation of total effects of stochastic interventions was ﬁrst consid-ered by Stock (1989) and has been the subject of recent study (Robins et al., 2004; Didelezet al., 2006; Tian, 2008; Pearl, 2009; Taubman et al., 2009; Stitelman et al., 2010; D´ıazand van der Laan, 2013; Dud´ık et al., 2014; Haneuse and Rotnitzky, 2013; Young et al.,2014). Particularly relevant to this work are the methods of D´ıaz and van der Laan (2012);Haneuse and Rotnitzky (2013) who deﬁne total effects for modiﬁed treatment policies,and Kennedy (2018a), who study identiﬁcation and estimation of the total the effect ofpropensity score interventions that shift a binary exposure distribution. These papers donot address decomposition of the effects of stochastic interventions on the exposure intodirect and indirect effects, which is the central theme of our manuscript.Our methods are also related to a family of new direct and indirect effects (Didelezet al., 2006; VanderWeele et al., 2014; Lok, 2016; Vansteelandt and Daniel, 2017; Zhengand van der Laan, 2017; Rudolph et al., 2017; Lok, 2019), which have been collectivelytermed interventional effects (Nguyen et al., 2019). This family of effects deals with bi-nary exposures and deterministic interventions on the exposure, and is thus not entirelyrelated to our approach, which deals with both continuous and categorical exposures andstochastic interventions on the exposure. Like the effects on the treated of Vansteelandtand VanderWeele (2012), interventional effects share the no-cross-world-independenceproperty of our methods. The interested reader is referred to Nguyen et al. (2019) for ataxonomy of the several mediation analyses proposed in the causal inference literature upto date.Stochastic interventions have analytical advantages compared to their deterministiccounterparts, such as allowing the seamless deﬁnition of causal effects for continuousexposures with an interpretation that is familiar to regular users of linear regression ad-justment. For example, Haneuse and Rotnitzky (2013) assess the effect of an interventionthat reduces a patient’s operating time (i.e., the time spent in surgery) on the risk of post-operative outcomes among patients undergoing surgical resection non-small-cell lung can-3er. D´ıaz and van der Laan (2012) study the effect of increasing the amount of leisure timephysical activity in the elderly on subsequent all-cause mortality. D´ıaz and van der Laan(2013) study the effect of a (hypothetical) policy that enforces pollution levels below acertain cutoff point. Kennedy (2018a) shows that stochastic interventions can also be usedin longitudinal studies to deﬁne and estimate total effects without relying on the positivityassumption.In this article, we propose a decomposition of the effect of a stochastic interventioninto a direct and an indirect effect, with interpretation analogous to that originally pro-posed by Robins and Greenland (1992) and Pearl (2001). We show that the identiﬁcationof (in)direct effects based on stochastic interventions does not require cross-world coun-terfactual independencies, therefore yielding scientiﬁc results that can be tested throughexperimentation on both the exposure and mediator. Of high practical relevance, our pro-posal also allows the deﬁnition and estimation of non-parametric mediated effects for con-tinuous exposures, a problem for which no methods or software exist. Parametric media-tion methods such as those discussed by Vansteelandt et al. (2012) induce unquantiﬁableamounts of bias by imposing untestable and implausible parametric assumptions on thedistribution of cross-world counterfactuals.We develop a one-step non-parametric estimator based on the efﬁcient inﬂuence func-tion, incorporating ﬂexible regression tools from the machine learning literature, and pro-vide n / -rate convergence and asymptotic linearity results. We propose methods to usethese asymptotic distributions to construct conﬁdence regions and to test the null hypoth-esis of no direct effect. Our estimator has roots in semiparametric estimation theory (e.g.,Pfanzagl and Wefelmeyer, 1985; Begun et al., 1983; van der Vaart, 1991; Newey, 1994;Bickel et al., 1997), and in the targeted learning framework of van der Laan and Rubin(2006); van der Laan and Rose (2011, 2018). In particular, we use cross-ﬁtting in or-der to obtain n / -convergence of our estimators while avoiding entropy conditions thatmay be violated by the data adaptive estimators we use (Zheng and van der Laan, 2011;Chernozhukov et al., 2018). Our estimators use a re-parameterization of certain integralsas conditional expectations in order to accommodate multivariate mediators. Softwareimplementing our methods is provided in the form of an open source R package freelyavailable on GitHub. Let A denote a continuous or categorical exposure variable, let Y denote a continuousor binary outcome, let Z denote a multivariate mediator, and let W denote a vector ofobserved covariates. Let O = ( W, A, Z, Y ) represent a random variable with distribution4 . We use P n to denote the empirical distribution of a sample of n i.i.d. observations O , . . . , O n . We let P f = (cid:82) f ( o )d P ( o ) for a given function f ( o ) , and use E to denoteexpectations with respect to P . We assume P ∈ M , where M is the nonparametricstatistical model deﬁned as all continuous densities on O with respect to a dominatingmeasure ν . Let p denote the corresponding probability density function. We use g ( a | w ) to denote the probability density function or the probability mass function of A conditionalon W = w ; m ( a, z, w ) and b ( a, w ) to denote the outcome regression functions E ( Y | A = a, Z = z, W = w ) and E ( Y | A = a, W = w ) , respectively; and e ( a | z, w ) to denotethe conditional density or probability mass function of A conditional on ( Z, W ) . Let g ( a | w ) be dominated by a measure κ ( a ) (e.g., the counting measure for binary A andthe Lebesgue measure for continuous A ). We use q ( z | a, w ) and r ( z | w ) to denotethe corresponding conditional densities of Z . The parametrization e = gq/r will provefundamental in the construction of our estimators, since it will allow us to avoid estimationof multivariate conditional densities. A similar parameterization is used by Zheng andvan der Laan (2012) to estimate mediated effects under deterministic interventions. Weuse W , A , Z and Y to denote the support of the corresponding random variables.We formalize the deﬁnition of our counterfactual variables using the following non-parametric structural equation model (NPSEM), but note that equivalent methods may bedeveloped by taking the counterfactual variables as primitives. Assume W = f W ( U W ); A = f A ( W, U A ); Z = f Z ( W, A, U M ); Y = f Y ( W, A, ZU Y ) . (1)This set of equations represents a mechanistic model assumed to generate the observeddata O ; furthermore, it encodes several fundamental assumptions. First, an implicit tem-poral ordering is assumed — that is, Y occurs after Z , A and W ; Z occurs after A and W ; and A occurs after W . Second, each variable (i.e., { W, A, Z, Y } ) is assumed to begenerated from the corresponding deterministic function (i.e., { f W , f A , f Z , f Y } ) of theobserved variables that precede it temporally, plus an exogenous variable, denoted by U .Each exogenous variable is assumed to contain all unobserved causes of the correspond-ing observed variable. Independence assumptions on U = ( U W , U A , U Z , U Y ) necessaryfor identiﬁcation will be clariﬁed in Section 2.1. Furthermore, we note that we have explic-itly excluded outcome-mediator confounders which are affected by exposure. Mediationanalysis in the presence of a such variables is notoriously hard (Avin et al., 2005); theadaptation of our methods to this problem is possible but it requires a new set of toolswhich is out of the scope of this paper.Causal effects are deﬁned in terms of hypothetical interventions on the NPSEM (1). Inparticular, consider an intervention in which the equation corresponding to A is removed,and the exposure is drawn from a user-speciﬁed distribution g δ ( a | w ) , which may dependon g and is indexed by a user-speciﬁed parameter δ . We assume without loss of generality5hat g δ =0 = g . Let A δ denote a draw from g δ ( a | w ) . Alternatively, such modiﬁcations cansometimes be described in terms of an intervention in which the equation correspondingto A is removed and the exposure is set equal to a hypothetical regime d ( A, W ) . Regime d depends on the natural (that is, under no intervention) exposure level A and covariates W . The latter intervention is sometimes referred to as depending on the natural value ofexposure , or as a modiﬁed treatment policy (Haneuse and Rotnitzky, 2013). Young et al.(2014) provide a discussion of the differences and similarities in the interpretation andidentiﬁcation of these two interventions. Below, we discuss two examples of stochasticinterventions: modiﬁed treatment policies, and exponential tilting. Example 1 (Modiﬁed treatment policy (Haneuse and Rotnitzky, 2013)) . Let A denote acontinuous exposure, such as operating time in non-small-cell lung cancer. Assume thedistribution of A conditional on W = w is supported in the interval ( l ( w ) , u ( w )) . Thatis, the minimum possible operating time for an individual with covariates W = w is l ( w ) .Then one may deﬁne a hypothetical post-intervention exposure A δ = d ( A, W ) , where d ( a, w ) = (cid:40) a − δ if a > l ( w ) + δa if a ≤ l ( w ) + δ, (2)where < δ < u ( w ) is an arbitrary user-given value. Interesting modiﬁcations to thisregime may be obtained by allowing δ to be a function of w , therefore allowing the re-searcher to specify a different change in operating time as a function of covariates such ascomorbidities, age, etc. This intervention was ﬁrst introduced by D´ıaz and van der Laan(2012), and has been further discussed in D´ıaz and van der Laan (2018) and Haneuse andRotnitzky (2013). Example 2 (Exponential tilting) . We can alternatively deﬁne a tilted intervention distribu-tion as g δ ( a | w ) = exp( δa ) g ( a | w ) (cid:82) exp( δa ) g ( a | w )d κ ( a ) , (3)for δ ∈ R , and let the hypothetical post-intervention exposure A δ be a random draw from g δ , conditional on the natural value of the observed covariates W . For binary A , Kennedy(2018a) proposed evaluating the total effect of a binary exposure A in terms of incrementalpropensity score interventions that replace the propensity score g (1 | w ) with a shifted ver-sion based on multiplying the odds of exposure by a user-given parameter δ (cid:48) . In particular,the post-intervention propensity score is given by g δ (cid:48) (1 | w ) = δ (cid:48) g (1 | w ) δ (cid:48) g (1 | w ) + 1 − g (1 | w ) , (4)6or < δ (cid:48) < ∞ . The proposal of Kennedy (2018a) is thus a case of exponential tilting (3)under the parameterization δ (cid:48) = exp( δ ) . This choice of parameterization is motivated bythe fact that δ (cid:48) can be interpreted as an odds ratio indicating how the intervention changesthe odds of exposure. The extremes of δ (cid:48) = 0 and δ (cid:48) = ∞ correspond to the standardinterventions A = 0 and A = 1 considered in the deﬁnition of the average treatmenteffect.We now turn our attention to deﬁning the population intervention effect (PIE) of A on Y . To proceed, for any values ( a, z ) , consider the counterfactual outcome Y ( a, z ) = f Y ( W, a, z, U Y ) . , and the counterfactual mediator Z ( a ) = f Z ( W, a, U Z ) . The counter-factual Y ( a, z ) is the outcome in a hypothetical world in which ( A, Z ) = ( a, z ) is ﬁxedexternally. The PIE is deﬁned as a contrast comparing the expectation of the outcomeunder no intervention with the expectation of the counterfactual outcome obtained underan intervention A δ : ψ ( δ ) = E { Y ( A δ ) − Y } . Note that the interpretation of the PIE depends on the stochastic intervention considered.For example, for the modiﬁed treatment policies of Example 1, the PIE describes thedifference in outcomes obtained by a reduction of δ in operating time. In the case of theincremental propensity score intervention (4), the PIE is interpreted as the difference inoutcomes obtained by an intervention under which the odds of exposure is δ (cid:48) times highercompared to current practice.Since A is a cause of Z , an intervention that changes the exposure to A δ also inducesa counterfactual mediator Z ( A δ ) . As a consequence of the consistency implied by theNPSEM, we have Y ( A, Z ) = Y . Similarly, the law of composition (Pearl, 2000) allowsus to write Y ( A δ , Z ( A δ )) = Y ( A δ ) . Thus, the PIE may be decomposed in terms of a population intervention direct effect (PIDE) and a population intervention indirect effect(PIIE) : ψ ( δ ) = PIIE (cid:122) (cid:125)(cid:124) (cid:123) E { Y ( A δ , Z ( A δ )) − Y ( A δ , Z ) } + PIDE (cid:122) (cid:125)(cid:124) (cid:123) E { Y ( A δ , Z ) − Y ( A, Z ) } . (5)This decomposition of the PIE as the sum of direct and indirect effects has an interpre-tation analogous to the corresponding standard decomposition of the average treatmenteffect (Pearl, 2001). In particular, the direct effect represents the effect of an interventionthat changes the distribution of the exposure while keeping the distribution of the media-tors ﬁxed at the value that it would have taken under no intervention. The indirect effectmeasures the effect of an indirect intervention on the mediators generated by interveningon the exposure, while holding the intervention on the exposure constant.7he intervention in Example 1 arises naturally as a modiﬁed treatment policy. In con-trast, the intervention in Example 2 arises directly as a stochastic intervention that modiﬁesthe distribution of the variables — it is unclear as of yet whether this quantity may be in-terpreted as a modiﬁed treatment policy. Drawing on the work of Haneuse and Rotnitzky(2013), we make the following assumption for modiﬁed treatment policies, which ensuresthat we can use the change of variable formula when computing integrals over A . This isuseful for studying properties of the parameter and estimators we propose. A1 (Piecewise smooth invertibility) . For each w ∈ W , assume that the interval I ( w ) =( l ( w, ) , u ( w )) may be partitioned into subintervals I δ,j ( w ) : j = 1 , . . . , J ( w ) such that d ( a, w ) is equal to some d j ( a, w ) in I δ,j ( w ) and d j ( · , w ) has inverse function h j ( · , w ) withderivative h (cid:48) j ( · , w ) .Under this assumption, the distribution of a modiﬁed treatment policy A δ = d ( A, W ) may be recovered through (see Haneuse and Rotnitzky, 2013): g δ ( a | w ) = J ( w ) (cid:88) j =1 I δ,j { h j ( a, w ) , w } g { h j ( a, w ) | w } h (cid:48) j ( a, w ) , (6)where I δ,j { u, w } = 1 if u ∈ I δ,j ( w ) and I δ,j { u, w } = 0 otherwise. In Example 1, thestochastic intervention becomes g δ ( a | w ) = g ( a | w ) { l ( w ) ≤ a ≤ l ( w ) + δ } + g ( a + δ | w ) { l ( w ) ≤ a ≤ u ( w ) − δ } . Therefore, under A1, a modiﬁed treatment policy may also be represented as a changeby which the equation f A is removed from the NPSEM and A is replaced by a draw A δ from the distribution g δ ( a | w ) . As a result of these two representations, the interven-tion may be interpreted in two different ways: (i) a change in the probabilistic mechanismused to assign exposure level, and (ii) a subject-speciﬁc change in exposure from A to A δ = d ( A, W ) , where only interpretation (i) requires A1. Note, however, that the popula-tion distribution of the exposure is the same under both interventions (Young et al., 2014);thus, both representations lead to exactly the same marginal counterfactual outcome dis-tributions.Several estimators of the functional ψ ( δ ) have previously been proposed. For the caseof a continuous exposure, D´ıaz and van der Laan (2012) developed inverse probabilityweighted, outcome regression, and doubly robust estimators based on the framework oftargeted minimum loss-based estimation (TMLE) (van der Laan and Rose, 2011), us-ing data adaptive estimators of the relevant nuisance parameters. D´ıaz and van der Laan(2018) improved on the previous methodology by constructing a TMLE algorithm with8ower computational complexity that preserves the desirable asymptotic properties of theoriginal approach. Haneuse and Rotnitzky (2013) propose estimators that rely on cor-rectly speciﬁed parametric models. Such methods are of limited applicability since theyare reliable only in situations where the nuisance parameters involve only few categoricalvariables, where correctly speciﬁed (that is, saturated) parametric models can conscien-tiously be constructed. For the binary case with g δ as in Example 2, Kennedy (2018a)proposed an estimator for ψ ( δ ) . This estimator is efﬁcient, asymptotically linear, and itallows incorporation of data adaptive estimators of the nuisance parameters.Since E ( Y ) is trivially estimated by the empirical mean in the sample, our optimalitytheory and estimators focus on θ ( δ ) = E { Y ( A δ , Z ) } . We present two types of results:for general modiﬁed treatment policies satisfying (A1), and for the particular stochasticintervention of Example 2. We compare the assumptions required for both. In this section we introduce the counterfactual variable Y ( a, z ) , deﬁned as the outcomethat would be observed in a hypothetical world in which P { ( A, Z ) = ( a, z ) } = 1 . This isthe same counterfactual variable that is often used to perform mediation analyses on theaverage treatment effect (Robins and Greenland, 1992; Pearl, 2001).We introduce the following identiﬁcation assumptions: A2 (Common support) . Assume supp { g δ ( · | w ) } ⊆ supp { g ( · | w ) } for all w ∈ W . A3 (Conditional exchangeability of exposure and mediator assignment) . Assume E { Y ( a, z ) | A, W, Z } = E { Y ( a, z ) | W, Z } for all ( a, z ) ∈ A × Z . Assumption A2 is standard in the analysis of causal effects, and simply states that the δ -speciﬁc intervention of interest is supported in the data. This assumption holds for all δ in the interventions described in Examples 1 and 2 (D´ıaz and van der Laan, 2012; Kennedy,2018a). Assumption A3 is related to the assumption that Vansteelandt and VanderWeele(2012) used for identiﬁcation of mediated effects among the treated. In that proposalthe authors assume Y ( a, z ) ⊥⊥ ( A, Z ) | W , which would imply the stronger assumption E { Y ( a, z ) | A, W, Z } = E { Y ( a, z ) | W } . This assumption would be satisﬁed for anypre-exposure variable W in a randomized experiment in which exposure and mediator arerandomized. Thus, the direct effect for a population intervention corresponds to contrastsbetween treatment regimes of a randomized experiment via interventions on A and Z ,unlike the natural direct effect for the average treatment effect (Robins and Richardson,2010). This claim is made rigorous in the identiﬁcation result of Theorem 1 presentedbelow. A proof is available in the Supplementary Materials, together with the assumptionson the NPSEM exogenous errors U which are compatible with A3.9 heorem 1 (Identiﬁcation) . Under A2 and A3, θ ( δ ) is identiﬁed and is given by θ ( δ ) = (cid:90) m ( a, z, w ) g δ ( a | w ) p ( z, w )d ν ( a, z, w ) . (7) Remark 1 (Mediator-outcome confounder not affected by exposure) . Note that, like thenatural direct effect of Pearl (2001), we require that all confounders of the mediator-outcome relation are measured. This assumption is implicit in A3. To see why, considerthe DAG in Figure 1. Conditioning on the collider Z opens a pathway from A to Y ( a, z ) through the outcome-mediator confounder V . If V is not measured and adjusted for (i.e., V ⊆ W ), then A3 fails. Z VA Y ( a, z ) Figure 1: Directed acyclic sub-graph of the variables involved in the case of an unmeasuredmediator-outcome confounder.

Remark 2 (Mediator-outcome confounded by exposure) . The methods presented herecannot be used if the mediator-outcome confounder V is affected by exposure. This isdue to the introduction of a new counterfactual variable V ( a ) . In particular, consider theDAG in Figure 2, where we have included only the relevant factual and counterfactualvariables. In this case, conditioning on the collider V would open a path A → V ← U V → V ( a ) → Y ( a, z ) , and would make A3 invalid. However, conditioning on V is nec-essary for A3 in order to close the path A → Z ← V → U V → V ( a ) → Y ( a, z ) , whichgets open when we condition on the collider Z . A comprehensive discussion of issues inidentiﬁcation of path effects that includes this issue as a particular problem may be foundin Avin et al. (2005). VanderWeele et al. (2014) propose a solution to this problem whichinvolves a stochastic intervention on the mediator Z . We note that this is intrinsically dif-ferent from the problem treated here, since we are interested in stochastic interventions on A (not on Z ) and do not address mediator-outcome confounders affected by exposure. U V V V ( a ) A Z Y ( a, z ) Figure 2: Directed acyclic sub-graph of the variables involved in the case of an outcome-mediator confounder affected by exposure. 10

Optimality theory for estimation of the direct effect

Thus far we have discussed the decomposition of the effect of a stochastic interventioninto direct and indirect effects, and have provided identiﬁcation results under weaker as-sumptions in comparison to the natural direct effect. In the sequel, we turn our attentionto a discussion of efﬁciency theory for the estimation of θ ( δ ) in the nonparametric model M . The efﬁcient inﬂuence function (EIF) is a key object in semi-parametric estimationtheory, as it characterizes the asymptotic behavior of all regular and efﬁcient estimators(Bickel et al., 1997; van der Vaart, 2002). Knowledge of the EIF has important practi-cal implications. First, the EIF is often useful in constructing locally efﬁcient estimators.There are three common approaches for this: (i) using the EIF as an estimating equation(e.g., van der Laan and Robins, 2003), (ii) using the EIF in a one-step bias correction (e.g.,Pfanzagl and Wefelmeyer, 1985), and targeted minimum loss-based estimation (van derLaan and Rubin, 2006; van der Laan and Rose, 2011, 2018). Second, the EIF estimatingequation often enjoys desirable properties such as multiple robustness, which allows forsome components of the data distribution to be inconsistently estimated while preservingconsistency of the estimator. Third, the asymptotic analysis of estimators constructed us-ing the EIF often yields second-order bias terms, which require slow convergence rates(e.g., n − / ) for the nuisance parameters involved, thereby enabling the use of ﬂexibleregression techniques in estimating these quantities.In Theorem 2 we present the EIF for a general stochastic intervention. Although thecomponents of the EIF associated with Y and ( Z, W ) are the same, the component as-sociated with the model for the distribution of A must be computed on a case-by-casebasis, that is, for each intervention of interest. Proofs for all results are available in theSupplementary Materials. Theorem 2 (Efﬁcient inﬂuence function) . Let η = ( g, m, e ) . The efﬁcient inﬂuence func-tion for θ ( δ ) in the nonparametric model M is D Yη,δ ( o ) + D Aη,δ ( o ) + D Z,Wη,δ ( o ) − θ ( δ ) ,where D Yη,δ ( o ) = g δ ( a | w ) e ( a | z, w ) { y − m ( a, z, w ) } D Z,Wη,δ ( o ) = (cid:90) m ( a, z, w ) g δ ( a | w )d κ ( a ) , and D Aη,δ ( o ) is the efﬁcient score corresponding to the non-parametric model for g . An immediate consequence of Theorem 2 is that, in a randomized trial, we have D Aη,δ ( o ) = 0 . Lemmas 1 and 2 below present the D Aη,δ ( o ) components for modiﬁed treat-ment policies satisfying A1 and for the exponential tilting of Example 2, respectively.11 emma 1 (Modiﬁed treatment policies) . Deﬁne the nuisance parameter φ ( a, w ) = (cid:90) m ( d ( a, w ) , z, w ) r ( z | w ) dν ( z ) (8) = E (cid:26) g ( A | W ) e ( A | Z, W ) m ( d ( A, W ) , Z, W ) | A = a, W = w (cid:27) , (9) and augment η as η = ( g, m, e, φ ) . If the modiﬁed treatment policy d ( A, W ) satisﬁes A1,then D Aη,δ ( o ) = φ ( a, w ) − (cid:90) φ ( a, w ) g ( a | w )d κ ( a ) . Lemma 2 (Exponential tilt) . Deﬁne the nuisance parameter φ ( a, w ) = (cid:90) m ( a, z, w ) r ( z | w ) dν ( z ) (10) = E (cid:26) g ( A | W ) e ( A | Z, W ) m ( A, Z, W ) | A = a, W = w (cid:27) , (11) and augment η as η = ( g, m, e, φ ) . If the stochastic intervention is the exponential tilt (3),then D Aη,δ ( o ) = g δ ( a | w ) g ( a | w ) (cid:26) φ ( a, w ) − (cid:90) φ ( a, w ) g δ ( a | w )d κ ( a ) (cid:27) . Expressions (8) and (10) show that estimators based on the respective inﬂuence func-tions require integration with respect to the mediator Z , as well as estimation of the possi-bly multivariate conditional density r ( z | w ) , which may pose an estimation challenge dueto the curse of dimensionality. To solve the issue, we propose an alternative parametriza-tion (9) and (11) of the EIF based on a sequential regression φ , rather than using the densityof Z conditional on ( A, W ) and W . This choice has important consequences for the pur-pose of estimation, as it helps to bypass estimation of the (possibly high-dimensional)conditional density of the mediators, instead allowing for regression methods, which arefar more commonly found in the statistics literature and software, to be used for estimationof the relevant quantity. In particular, if r ( z | w ) is hard to estimate, estimators of φ may becomputed by ﬁrst estimating g , m , and e , computing the pseudo-outcomes deﬁned in thelemmas, and applying regression techniques to estimate the outer conditional expectation.For binary exposures, the EIF corresponding to the incremental propensity score inter-vention may be simpliﬁed as in the following corollary. Corollary 1 (Efﬁcient inﬂuence function for incremental propensity score interventions) . Let A take values on { , } , and let the exponentially tilted intervention g δ, (1 | W ) be s in (4). Then, the EIF of Lemma 2 may be simpliﬁed as follows. Deﬁne the nuisanceparameter φ ( w ) = E { m (1 , Z, W ) − m (0 , Z, W ) | W = w } , and let η = ( g, m, e, φ ) . Then D Aη,δ ( o ) = δφ ( w ) { a − g (1 | w ) }{ δg (1 | w ) + 1 − g (1 | w ) } . Note that in Lemmas 1, 2, and Corollary 1, we have used φ to represent differentparameters. We have allowed this abuse of notation because the nature of this auxiliaryparameter is the same for all three cases, and having one symbol will allow us to stateour estimation results in some generality. In the sequel, the difference will always beclear from context. Note also that g ( a | w ) could be pulled out of the expectation in thedeﬁnition of φ ( a, w ) . We decided to leave it inside the expectation as we conjecture that itmay act as a stabilizing factor for the inverse probability weights { e ( a | z, w ) } − .In contrast to the efﬁcient inﬂuence function for the natural direct effect (Tchetgen Tch-etgen and Shpitser, 2012), the contribution of the exposure process to the EIF for the PIEmediated effect is non-zero. This is a direct consequence of the fact that the parameterof interest depends on g ; moreover, this implies that, unlike the natural direct effect, theefﬁciency bound in observational studies differs from the efﬁciency bound in randomizedstudies. As we see in the lemmas below, this also implies that it is not generally possibleto obtain estimating equations that are robust to inconsistent estimation of g . However,such robustness will be possible if the stochastic intervention is also a modiﬁed treatmentpolicy satisfying A1. Lemma 3 (Multiple robustness for modiﬁed treatment policies) . Let the modiﬁed treat-ment policy satisfy A1, and let η = ( g , e , m , φ ) be such that one of the two followingconditions hold:(i) g = g and either e = e or m = m ,(ii) m = m and φ = φ .Then P D η ,δ = θ ( δ ) , with D η,δ as deﬁned in Theorem 2 and Lemma 1. The above lemma implies that it is possible to construct consistent estimators for θ under consistent estimation of at least two of the nuisance parameters in η , in the conﬁgu-rations described in the lemma. This lemma is a direct consequence of Theorem 5, foundin the Supplementary Materials. We note, however, that part (ii) of the lemma may beuninteresting if the parameterization (9) is used to estimate φ . In that case φ = φ will13enerally require g = g , e = e , and m = m , as well as consistency of the estimatorfor the outer expectation. In contrast, if the parameterization (8) is used to estimate φ ,then the case m = m and φ = φ would be trivially satisﬁed if m = m and r = r ,where r is the density used to compute φ . To some readers it may seem surprising thatestimation of θ ( δ ) may be robust to estimation of g , even when the parameter deﬁnitionin (7) is explicitly dependent on g . We offer some intuition into this surprising result bynoting that assumption A1 allows us to use the change of variable formula to obtain θ ( δ ) = E (cid:26)(cid:90) m ( d ( A, W ) , z, W ) r ( z | , W )d ν ( z ) (cid:27) . Estimation of this parameter without relying on g may be carried out by consistently esti-mating m , r , and using the empirical distribution as an estimator of the outer expectation.This behavior has been previously observed for the total effect ψ ( δ ) under A1 (D´ıaz andvan der Laan, 2012; Haneuse and Rotnitzky, 2013).The robustness properties of the EIF for an exponential tilt are presented below. Lemma 4 (Robustness for exponential tilting) . Let g δ be deﬁned as in (3). Let η =( g , e , m , φ ) be such that g = g and either e = e or m = m . Then P D η ,δ = θ ( δ ) ,with D η,δ as deﬁned in Theorem 2 and Lemma 2. Lemma 4 is a direct consequence of Theorem 6 in the Supplementary Materials. Thecorresponding proof reveals that the EIF for the binary distribution is not multiply robust— that is, the intervention fails to satisfy assumption A1 and integrals over the range of A cannot be computed using change of variable formula. This behavior has been previouslyobserved for other interventions that do not satisfy A1 (D´ıaz and van der Laan, 2013).Even though this lemma implies that consistent estimation of g is required, the bias termsare still second-order, so an estimator of g converging at rate n / or faster is sufﬁcient, aswe will see in the sequel. We start this section describing two simple estimators, the substitution and re-weightedestimators. These estimators are motivated by the fact that θ ( δ ) has the two followingalternative representations: θ ( δ ) = E (cid:26)(cid:90) m ( a, Z, W ) g δ ( a | W )d κ ( a ) (cid:27) (12) = E (cid:26) g δ ( A | W ) e ( A | Z, W ) Y (cid:27) , (13)14here we remind the reader that e ( a | z, w ) denotes the probability density function of A conditional on ( Z, W ) . Equation (13) follows from noting that gq/r = e . This parameter-ization has the advantage that only the univariate conditional density e ( a | z, w ) has to beestimated, instead of the conditional densities of the possibly high-dimensional mediator Z . A similar result was also used by Zheng and van der Laan (2012) to develop a targetedminimum loss-based estimator of natural direct effects under a binary exposure variable.The substitution estimator is simply deﬁned by plugging in estimators of m and g δ intothe identiﬁcation result given in (12). Consistency of this estimator requires consistentestimation of the outcome regression m and the intervention distribution g δ . The secondestimator is a re-weighting estimator based on the alternative representation of the iden-tiﬁcation result given in (13), which requires consistent estimation of g δ and e . In theremainder of this section, we discuss an efﬁcient estimator that combines ideas from theprevious two estimators as well as the efﬁcient inﬂuence function derived in the previoussection, in order to build an estimator that is both efﬁcient and robust to model misspec-iﬁcation. We discuss an asymptotic linearity result for the doubly robust estimator thatallows computation of asymptotically correct conﬁdence intervals and hypothesis tests.In the sequel, we assume that preliminary estimators ˆ m , ˆ g δ , ˆ φ and ˆ e of m , g δ , φ , and e , respectively, are available. These estimators may be obtained from ﬂexible regressiontechniques such as support vector machines, regression trees, boosting, neural networks,splines, or ensembles thereof (Breiman, 1996; van der Laan et al., 2007). As previouslydiscussed, the consistency of these estimators will determine the consistency of our esti-mators of the population mediation intervention mean θ . First, we discuss a substitution estimator based on (12), computed as ˆ θ sub ( δ ) = (cid:90) n n (cid:88) i =1 ˆ m ( a, Z i , W i )ˆ g δ ( a | W i )d κ ( a ) , (14)where we have substituted estimators of m and g δ in (12), and have estimated the expec-tation with respect to the joint density p ( z, w ) by the empirical mean. The re-weightedestimator is based on (13), and is deﬁned by ˆ θ re ( δ ) = 1 n n (cid:88) i =1 ˆ g δ ( A i | W i )ˆ e ( A i | , Z i , W i ) Y i (15)If ˆ m , ˆ g δ , and ˆ e are estimated within parametric models, then, by the delta method,both ˆ θ sub ( δ ) and ˆ θ re ( δ ) are asymptotically linear and n / -consistent. The bootstrap or an15nﬂuence function-based estimator may be used to construct asymptotically correct conﬁ-dence intervals. However, if either the mediators or confounders are high-dimensional, therequired consistency of ˆ m , ˆ g δ , and ˆ e will hardly be achievable within parametric models.This issue may be alleviated through the use of data adaptive estimators. Unfortunately, n / -consistency of ˆ θ sub ( δ ) and ˆ θ re ( δ ) will generally require that ˆ m , ˆ g δ , and ˆ e are consis-tent in L ( P ) -norm at parametric rate, which is generally not possible within data adaptiveestimation of high-dimensional regressions. Thus, the asymptotic distribution will gener-ally be unknown, rendering the construction of conﬁdence intervals and hypothesis testsimpossible. In the following section, we use the efﬁcient inﬂuence function to propose anestimator that is n / -consistent and efﬁcient under a weaker assumption, requiring only n / -convergence of second-order regression bias terms. We propose using the efﬁcient inﬂuence function D η,δ to construct a robust and efﬁcientestimator, constructed as the solution to the estimating equation P n D ˆ η,δ = 0 in θ , fora preliminary estimator ˆ η of η . In order to avoid imposing entropy conditions on theinitial estimators, we advocate for the use of cross-ﬁtting (Zheng and van der Laan, 2011;Chernozhukov et al., 2016) in the estimation procedure. Let V , . . . , V J denote a randompartition of the index set { , . . . , n } into J prediction sets of approximately the same size.That is, V j ⊂ { , . . . , n } ; (cid:83) Jj =1 V j = { , . . . , n } ; and V j ∩ V j (cid:48) = ∅ . In addition, foreach j , the associated training sample is given by T j = { , . . . , n } \ V j . Denote by ˆ η j theestimator of η = ( g, m, e, φ ) , obtained by training the corresponding prediction algorithmusing only data in the sample T j . Further, let j ( i ) denote the index of the validation setwhich contains observation i . The estimator is thus deﬁned as: ˆ θ ( δ ) = 1 n n (cid:88) i =1 D ˆ η j ( i ) ,δ ( O i ) = 1 n n (cid:88) i =1 (cid:110) D Y ˆ η j ( i ) ,δ ( O i ) + D A ˆ η j ( i ) ,δ ( O i ) + D Z,W ˆ η j ( i ) ,δ ( O i ) (cid:111) . (16)In a randomized trial the estimator may also be computed by setting D A ˆ η j ( i ) ,δ ( O i ) = 0 . M -estimation theory may be used to derive the asymptotic distribution of ˆ θ ( δ ) . Asymp-totic linearity and efﬁciency of the estimator for modiﬁed treatment policies is detailed inthe following theorem: Theorem 3 (Pointwise weak convergence for modiﬁed treatment policies) . Let (cid:107)·(cid:107) denotethe L ( P ) -norm deﬁned as (cid:107) f (cid:107) = (cid:82) f d P . Assume(i) (cid:107) ˆ m − m (cid:107) {(cid:107) ˆ g − g (cid:107) + (cid:107) ˆ e − e (cid:107)} + (cid:107) ˆ g − g (cid:107) (cid:107) ˆ φ − φ (cid:107) = o P ( n − / ) , and(ii) P {| D η,δ ( O ) | ≤ C } = P {| D ˆ η,δ ( O ) | ≤ C } = 1 for some C < ∞ , and iii) The modiﬁed treatment policy d ( a, w ) is piecewise smooth invertible (A1).Then √ n { ˆ θ ( δ ) − θ ( δ ) } (cid:32) N (0 , σ ( δ )) , where σ ( δ ) = Var { D η,δ ( O ) } is the efﬁciency bound. Theorem 3 establishes the weak convergence of ˆ θ ( δ ) pointwise in δ . This convergenceis useful to derive conﬁdence intervals in situations where the modiﬁed treatment policyhas a suitable scientiﬁc interpretation for a given δ , such as in our Example 1. Under theassumptions of the theorem, an estimator ˆ σ ( δ ) of σ ( δ ) may be obtained as the empiricalvariance of D ˆ η j ( i ) ,δ ( O i ) , and a Wald-type conﬁdence interval may be constructed as ˆ θ ( δ ) ± z − α/ ˆ σ ( δ ) / √ n .For the remainder of this section, we turn our attention to a discussion of uniformconvergence of ˆ θ ( δ ) . Such a convergence result will prove useful in the following section,where we establish a hypothesis test of no direct effect. Such a test is constructed byrejecting the hypothesis if the direct effect is non-signiﬁcant (at level α ), uniformly in δ . Tobuild such a testing procedure, we focus on the intervention deﬁned in terms of exponentialtilting (3). Results for modiﬁed treatment policies are possible as well; however, theserequire smoothness assumptions on the map δ (cid:55)→ g δ ( a | w ) . Inspection of (6) reveals thatthis may in turn require smoothness assumptions on a (cid:55)→ g ( a | w ) , which may not bejustiﬁable in a number of applications. We thus focus on exponential tilting, which yieldssmooth maps δ (cid:55)→ g δ ( a | w ) by construction. This discussion, together with Lemmas 1 and2, thus reveals a trade-off between smoothness and robustness in estimation of modiﬁedtreatment policies and exponential tilting. Theorem 4 (Uniform weak convergence for exponential tilting) . Let g δ be the exponentialtilting intervention distribution (3) and let ∆ = [ δ l , δ u ] denote an interval with < δ l ≤ δ u < ∞ . Deﬁne c ( w ) = { (cid:82) a exp( δa ) g ( a | w ) } − . Assume || ˆ c − c || = o P ( n − / ) as wellas (i) and (ii) stated in Theorem 3. Then √ n { ˆ θ ( δ ) − θ ( δ ) } (cid:32) G ( δ ) in (cid:96) ∞ (∆) , where for any δ , δ ∈ ∆ , G ( · ) is a mean-zero Gaussian process with covariancefunction E { G ( δ ) G ( δ ) } = E { D η,δ ( O ) D η,δ ( O ) } . In this section, we consider estimation of the direct effect β ( δ ) = θ ( δ ) − E ( Y ) . Deﬁne thecorresponding (uncentered) inﬂuence function S η,δ ( o ) = D η,δ ( o ) − y . A straightforward17xtension of Theorem 4 shows that ˆ β ( δ ) = ˆ θ ( δ ) − ¯ Y converges weakly to a process G ( δ ) with covariance function E { G ( δ ) G ( δ ) } = E { S η,δ ( O ) S η,δ ( O ) } .We now present an approach to constructing uniform conﬁdence bands on the function β ( δ ) , allowing testing of the null hypothesis of no direct effect H : sup δ ∈ ∆ β ( δ ) = 0 . Thishypothesis test is useful for checking the existence of a direct effect even if the interpre-tation of the exponential tilt g δ (e.g., as the odds ratio comparing post vs pre-interventionodds of exposure) does not answer a particularly meaningful question in a given applica-tion. Let ˆ σ ( δ ) denote the empirical variance of S ˆ η j ( i ) ,δ ( O i ) . Our goal will be achieved byﬁnding a value c α such that ˆ ρ ( c α ) = 1 − α , where ˆ ρ is a function such that ˆ ρ ( t ) = P (cid:32) sup δ ∈ ∆ (cid:12)(cid:12)(cid:12)(cid:12) ˆ β ( δ ) − β ( δ )ˆ σ ( δ ) / √ n (cid:12)(cid:12)(cid:12)(cid:12) ≤ t (cid:33) + o P (1) . (17)Conﬁdence bands may be computed as ˆ θ ± n − / c α ˆ σ ( δ ) , and p-values for H can be com-puted by evaluating − ˆ ρ ( t ) at the observed value of the supremum test statistic. Thefunction ˆ ρ ( t ) may be obtained by approximating the distribution of sup δ ∈ ∆ G ( δ ) , where G ( δ ) is the Gaussian process deﬁned above. In this paper we take the approach proposedby Kennedy (2018a), using the multiplier bootstrap (Gin´e and Zinn, 1984; van der Vaartand Wellner, 1996; Chernozhukov et al., 2013; Belloni et al., 2015). We omit the relevantproofs as they are identical to those presented by Kennedy (2018a). In comparison withthe nonparametric bootstrap, the multiplier bootstrap has the computational advantage thatthe nuisance estimators ˆ η need not be re-estimated. In comparison with directly sampling sup δ ∈ ∆ G ( δ ) , the proposed procedure does not require the evaluation of potentially largecovariance matrices; therefore, it is far more computationally efﬁcient and convenient.The multiplier bootstrap approximates the distribution of sup δ ∈ ∆ G ( δ ) with the supre-mum of the process M ( δ ) = 1 √ n n (cid:88) i =1 ξ i { S ˆ η j ( i ) ,δ ( O i ) − ˆ β ( δ ) } ˆ σ ( δ ) , where randomness is introduced through sampling the multipliers ( ξ , . . . , ξ n ) , despite theprocess being conditional on the observed data O , . . . , O n . The multiplier variables arei.i.d. with mean zero and unit variance, and are drawn independently from the sample.Typical choices are Rademacher ( P ( ξ = −

1) = P ( ξ = 1) = 0 . ) or Gaussian multipliers.Under the assumptions of Theorem 4, plus uniform consistency of ˆ σ ( δ ) , it can be shownthat (17) holds for ˆ ρ ( t ) = P (cid:18) sup δ ∈ ∆ (cid:12)(cid:12) M ( δ ) (cid:12)(cid:12) ≤ t (cid:12)(cid:12)(cid:12)(cid:12) O , . . . , O n (cid:19) .

18s a consequence, computation of the critical value, p-values, and conﬁdence intervalsonly requires simulation of a large number of realizations of the multipliers over a ﬁnegrid over ∆ . We now turn to comparing the three estimators of the direct effect, previously consideredin Section 4. In particular, we investigate the performance of the substitution (12, 14),re-weighted (13, 15), and efﬁcient (16) estimators in the case of an incremental propensityscore (IPS) intervention on a binary intervention variable of interest. The estimators areevaluated on data simulated from the following data-generating mechanism: W ∼ Bern(0 . W ∼ Bern(0 . W ∼ Bern(0 . A ∼ Bern (cid:32) · (cid:88) j =1 W j + 0 . (cid:33) Z ∼ Bern (cid:18) − expit (cid:20) A + W A + W + 0 . (cid:21)(cid:19) Z ∼ Bern (cid:18) expit (cid:20) ( A −

1) + W W + 3 (cid:21)(cid:19) Z ∼ Bern (cid:18) expit (cid:20) ( A −

1) + 2 · W − · W + 0 . (cid:21)(cid:19) Y = Z + Z − Z + A − . · (cid:32) (cid:88) j =1 W j (cid:33) + (cid:15), where Bern( p ) is the Bernoulli distribution with parameter p , expit( x ) = { − x ) } − is the CDF of the logistic distribution (as implemented in the plogis function in the R programming language), and (cid:15) ∼ N (0 , . . The data available on a single observationalunit is denoted by the random variable O = ( W , W , W , A, Z , Z , Z , Y ) , where, in anygiven simulation, we consider observing n i.i.d. copies of O for one of seven sample sizes n ∈ { , , , , , , } .Under the above data-generating mechanism, we seek to estimate the direct effect un-der an incremental propensity score intervention δ = 0 . , for which the true value of thenatural direct effect is approximately . . We approximated this effect by using thealternative representation of θ ( δ ) as θ ( δ ) = E (cid:26)(cid:90) m ( a, Z, W ) g δ ( a | W )d ν ( a ) (cid:27) , a , . . . , a m of uniformly distributed numbers in the range of A , and the outer ex-pectation is approximated through the law of large numbers by drawing a large sample ( W , Z ) , . . . , ( W k , Z k ) from the joint distribution of ( W, Z ) . Each of the estimators isevaluated by contrasting regimes in which the appropriate nuisance parameters are ﬁt viaa well-speciﬁed nonparametric regression or misspeciﬁed by ﬁtting an intercept model.The enumerated set of estimators and regimes is summarized in Table 1 and Figure 3.In order to ensure a well-speciﬁed nonparametric regression for the nuisance parameters,we rely on the highly adaptive lasso (HAL) estimator, a recently proposed nonparametricregression function with properties guaranteeing convergence of estimated nuisance com-ponents at the n / -rates required by our theorems (Benkeser and van der Laan, 2016; vander Laan, 2017; van der Laan and Benkeser, 2018). n Estimator

400 900 1600 2500 3600 4900 6400

Substitution 0.083 0.086 0.084 0.077 0.072 0.074 0.075Reweighted (IPW) 0.105 0.120 0.111 0.116 0.107 0.112 0.109Efﬁcient 0.092 0.086 0.071 0.072 0.068 0.067 0.065Efﬁcient (E mis.) 0.060 0.060 0.060 0.059 0.054 0.059 0.058Efﬁcient (M mis.) 0.165 0.130 0.110 0.107 0.099 0.103 0.097Efﬁcient (G mis.) 0.436 0.829 1.255 1.912 2.662 3.543 4.519Table 1: Mean-squared errors (MSE), scaled by n , of the three key estimators of the directeffect under an IPS intervention δ = 0 . , across Monte Carlo simulations for eachof seven sample sizes. Substitution and reweighted estimators are computed using HALfor g , m , and e . “E mis.” denotes that e was inconsistently estimated via an intercept-onlylogistic regression model, “M mis.” and “G mis.” denote analogous estimators.20 l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l Efficient Eff. ( E - mis. ) Eff. ( M - mis. ) Reweighted Substitution | B i a s | n | B i a s | n s n M SE Eff. ( G - mis. ) n n Figure 3: Statistics for the three key estimators (and variations thereof) of the direct effectunder an IPS intervention δ = 0 . , across Monte Carlo simulations for each of sevensample sizes.Inspection of the mean-squared error, after scaling by √ n , reveals that the substitutionestimator and the efﬁcient one-step estimator both display excellent, essentially equivalentperformance when nuisance components are estimated using the highly adaptive lasso.The one-step estimator has slightly better performance, which seems to be driven by abetter bias-variance trade-off. In contrast to the substitution estimator, the efﬁcient one-step estimator carries the advantage of being double robust, allowing misspeciﬁcation ofeither the outcome regression (denoted “M”) or the mediator-inclusive propensity score(denoted “E”). The robustness of the efﬁcient estimator to the misspeciﬁcation of thesenuisance components — and the lack of robustness to the mediator-exclusive propensityscore (denoted “G”) — are demonstrated in the last three rows of Table 1. Figure 3 visu-alizes the performance of the estimators, and their misspeciﬁed variants, in terms of boththe MSE (as presented in Table 1) and its individual components, the bias and standarderror. This comparison of the estimators reveals that the correctly-speciﬁed one-step ef-ﬁcient estimator displays excellent performance in terms of both bias and variance while21ts non-robust misspeciﬁed variant displays an asymptotic bias that grows with samplesize. Interestingly, the one-step estimator with e inconsistently estimated displayed bet-ter performance than the fully efﬁcient version. This is possibly an idiosyncrasy of thissimulation due to the fact that misspeciﬁcation through an intercept-only model generatessmaller variable weights. Altogether, these numerical investigations demonstrate the util-ity of the proposed estimators in settings where the nonparametric estimation of nuisancecomponents is viable; moreover, in applied data analytic settings where this proceduremay be of interest, the one-step efﬁcient estimator is clearly preferable on account of itsmultiple robustness. All numerical studies of the estimators were performed using theimplementations available in the medshift software package (Hejazi and D´ıaz, 2019) forthe R language and environment for statistical computing (R Core Team, 2019). We now turn to considering a scenario in which the decomposition proposed in equa-tion 5 and the proposed efﬁcient estimator (16) may be used to estimate direct and in-direct effects. To proceed, we take as example a simple data set from an observationalstudy of the relationship between BMI and children’s behavior, distributed as part of the mma R package, available via the Comprehensive R Archive Network ( https://CRAN.R-project.org/package=mma ). The documentation of this data set describes it as a“database obtained from the Louisiana State University Health Sciences Center, New Or-leans, by Dr. Richard Scribner [who] explored the relationship between BMI and kidsbehavior through a survey at children, teachers and parents in Grenada in 2014. Thisdata set includes observations and variables.” In particular, we consider a modi-ﬁed version of this data set with all missing values removed, as these are irrelevant to thedemonstration of the proposed methodology. In standard data analytic practice, we advo-cate for the use of the proposed methodology in tandem with a correction for missing data,such as imputation or weighting by inverse probability of censoring. (Carpenter et al.,2006; Vansteelandt et al., 2010; Seaman et al., 2012).To demonstrate the assessment of the direct and indirect effect with this observationaldata set, we consider the effect of participation in a sports team on the BMI of children,taking several related covariates as mediators (including snacking, exercising, and over-weight status) and all other collected covariates as potential confounders. As the inter-vention variable is binary, we frame our proposal in terms of an incremental propensityscore intervention (Kennedy, 2018a), wherein the odds of participating in a sports team isincreased by a factor of δ = 2 for each individual. Such a stochastic exposure regime maybe interpreted as the introduction of a school program or policy that motivates children to22pt in to participating in a sports team, doubling the odds of such voluntary participation. As noted in equation 5, the population intervention effect admits a decomposition in termsof components that allow estimation of the direct and indirect effects. We compute each ofthe components of the direct and indirect effects using appropriate estimators as follows • for E { Y ( A, Z ) } = E Y , the natural value of the outcome under no intervention, theempirical mean in the sample serves as an efﬁcient estimator; • for E { Y ( A δ , Z ) } = θ ( δ ) , the mean outcome under an intervention altering the ex-posure mechanism but not the mediation mechanism, a one-step efﬁcient estimator,denoted ˆ θ ( δ ) , is proposed as equation 16 and made available via the medshift R package (Hejazi and D´ıaz, 2019); • for E { Y ( A δ ) } = ψ ( δ ) , the mean outcome under an intervention altering both theexposure and mediation mechanisms, a one-step efﬁcient estimator, denoted ˆ ψ ( δ ) inthe sequel, is easily estimable using the npcausal R package (Kennedy, 2018b).In the construction of estimators for θ ( δ ) and ψ ( δ ) , data adaptive nonparametric regres-sion procedures are incorporated to allow the relevant nuisance parameters of each estima-tor to be computed in a ﬂexible manner using various R packages. The npcausal pack-age allows the estimator ˆ ψ ( δ ) to be constructed using the ranger algorithm (Wright andZiegler, 2015), an efﬁcient and fast implementation of random forests (Breiman, 2001).In constructing ˆ θ ( δ ) , the medshift package provides facilities for estimating nuisance pa-rameters data adaptively via the Super Learner algorithm (van der Laan et al., 2007) forconstructing ensemble learners through cross-validation, using its implementation in the sl3 package (Coyle et al., 2018). In particular, the Super Learner procedure was used tocreate a weighted ensemble of algorithms from a library including extreme gradient boost-ing via the xgboost package (Chen and Guestrin, 2016), with variants including 50, 100,and 300 boosting iterations; variants of random forests using 50, 100, and 500 trees; L -penalized lasso and L -penalized ridge GLMs via the glmnet package (Friedman et al.,2009); an elastic net GLM with equally weighted L and L penalization terms (also via glmnet ); a main terms GLM; an intercept model; and the highly adaptive lasso (Benkeserand van der Laan, 2016), with 5–fold cross-validation and up to either 3-way or 5-wayinteraction terms, using the hal9001 package (Coyle and Hejazi, 2018).23 .2 Estimating the Direct and Indirect Effects From the decomposition given in equation 5, the direct effect may be denoted β ( δ ) = θ ( δ ) − E Y . An estimator of the direct effect, ˆ β ( δ ) may be expressed as a composition ofestimators of its constituent parameters: ˆ β ( δ ) = ˆ θ ( δ ) − n n (cid:88) i =1 Y i . Using the estimation strategies previously outlined, we may construct an estimate of the di-rect effect through a straightforward application of the delta method, yielding both a pointestimate and associated standard errors under our proposed stochastic intervention policy.Similarly, the indirect effect ψ ( δ ) − θ ( δ ) may be estimated as ˆ ψ ( δ ) − ˆ θ ( δ ) . We provideboth point estimates and associated inference under our proposed stochastic interventionpolicy in Table 2. Parameter

Lower 95% CI Estimate Upper 95% CIDirect Effect -0.458 0.011 0.479Indirect Effect -0.672 -0.157 0.357Table 2: Point estimates and 95% conﬁdence intervals for the direct effect and indirecteffect for an IPS intervention of δ = 2 applied to the data set from the mma R package.From the estimates in Table 2, the conclusion may be easily drawn that there is littletotal effect of doubling the odds of participation in a sports team on the BMI of chil-dren, based on the data collected in the observational study made available in the mma R package. For reference, the marginal odds of participating in a sports team in the ob-served data are . , whereas the odds under the intervention considered are . . Basedon the 95% conﬁdence intervals around the point estimates, we cannot conclude that theproposed incremental propensity score intervention is sufﬁciently efﬁcacious to decreasechildren’s BMI. However, the magnitude of the effects seem to be in the correct direction,with increased participation in a sports team causing a reduction of BMI of . throughchanges in behaviors such as snacking and exercise. Using an approach similar to thatdemonstrated with this data set, the direct and indirect effects attributable to interventionswith higher odds of participating in a sports team are easily estimable.24 Discussion

We have proposed a novel mediation analysis based on the decomposition of the causal ef-fect of a stochastic intervention on the population, focusing on two types of interventions:modiﬁed treatment policies and exponential tilting. Unlike the natural direct effect of Pearl(2001), identiﬁcation of the (in)direct effect proposed here does not require cross-worldcounterfactual independencies, and is therefore achievable in an experimental setting ran-domizing both the exposure and the mediator.We present results for stochastic interventions deﬁned as a modiﬁed treatment policyand explicitly deﬁned in terms of the post-intervention distribution (exponential tilting).In addition to the considerations about robustness and smoothness discussed in the presentarticle, the choice between these two options may also be guided by the fact that modi-ﬁed treatment policies are more useful in practical settings as they can be used to informfeasible interventions.Note that our effect decomposition and estimators allow for multivariate mediators.The interpretation of the (joint) indirect effect in this case is entirely context-dependent.For example, the multivariate mediators may represent an innately multivariate construct(e.g., a psychological construct such as personality, behavior, etc.) in this case the indirecteffect could be interpreted as an effect through the construct (e.g., personality). Nonethe-less, our approach does not require that the multivariate mediators are part of a singleconstruct; the interpretation in these cases requires more care.We assume that there is no mediator-outcome confounder affected by exposure. Pointidentiﬁcation of natural (in)direct effects in the presence of such variables is not generallypossible, and its partial identiﬁcation is an area of active research (Robins and Richardson,2010; Tchetgen and Phiri, 2014; Miles et al., 2015).Lastly, for simplicity we focus on estimators constructed as solutions to the efﬁcient in-ﬂuence function estimating equation; moreover, we have made implementations of each ofthe proposed estimators available in the free and open source medshift software package(Hejazi and D´ıaz, 2019) for the R language and environment for statistical computing (RCore Team, 2019). Alternative estimation strategies, such as targeted minimum loss-basedestimation (van der Laan and Rubin, 2006; van der Laan and Rose, 2011, 2018), may havebetter performance than our propsoed estimators in ﬁnite samples. The development ofsuch estimators will be the subject of future research and software development.25 Proofs of results in the main document

Proof

We prove the result separately for exponential tilting and for modiﬁed treatmentpolicies. First, let A δ denote a variable drawn from the exponentially tilted distribution g δ ( a | w ) . We have E { Y ( A δ , Z ) | A δ = a, Z = z, W = w } = E { Y ( a, z ) | A δ = a, Z = z, W = w } = E { Y ( a, z ) | Z = z, W = w } = E { Y ( a, z ) | A = a, Z = z, W = w } = m ( z, a, w ) . The ﬁrst equality follows from the deﬁnition of Y ( A δ , Z ) , the second equality followsbecause, by deﬁnition, A δ ⊥⊥ Y ( a, z ) | ( W, Z ) , and the third equality follows from A3.Finally, the fourth equality follows from the consistency implied by he NPSEM: ( A, Z ) =( a, z ) → Y ( a, z ) = Y .Note that, by deﬁnition, A δ ⊥⊥ Z | W . Thus E { Y ( A δ , Z ) } = (cid:90) supp( g δ ) × supp( q ) × supp( p ) m ( z, a, w ) r ( z | w ) g δ ( a | w ) p ( w ) dν ( a, z, w ) , where A2 ensures that m ( z, a, w ) is deﬁned in the integration set.If the intervention is a modiﬁed treatment policies, such as our Example 1 where A δ = d ( A, W ) , then the proof proceeds as follows. First of all, we have A ⊥⊥ Y ( A δ , Z ) | ( A δ , Z, W ) , so that E { Y ( A δ , Z ) | A δ = a, A = a (cid:48) , Z = z, W = w } = m ( z, a, w ) . Integrating the above expression with respect to the joint density of ( A δ , A, Z, W ) , andusing r ( z | w ) = (cid:90) supp( g ) q ( z | a (cid:48) , w ) g ( a (cid:48) | w ) dν ( a (cid:48) ) yields the desired result. Lemma 5.

Deﬁne the non-parametric structural equation model in (1). If ( U A ⊥⊥ U Y and U W ⊥⊥ U Z and U Y ⊥⊥ U Z ) and either U Y ⊥⊥ U W or U A ⊥⊥ U W , then A3 holds. roof For ﬁxed ( a, z ) , let Y ( a, z ) = f Y ( W, a, z, U Y ) . ( W, Z ) d-separates Y ( a, z ) from A in Figures 4 and 5, concluding the proof of the lemma. U W WZU Z AU A Y a,z U Y Figure 4: Directed Acyclic Graph for U A ⊥⊥ U Y and U W ⊥⊥ U Z and U Y ⊥⊥ U Z and U A ⊥⊥ U W . U W WZU Z AU A Y a,z U Y Figure 5: Directed Acyclic Graph for U A ⊥⊥ U Y and U W ⊥⊥ U Z and U Y ⊥⊥ U Z and U Y ⊥⊥ U W . Proof

In this proof we will use Θ( P ) to denote a parameter as a functional that mapsthe distribution P in the model to a real number. We will assume that the measure v isdiscrete so that integrals can be written as sums. The resulting inﬂuence function willalso correspond to the inﬂuence function of a general measure ν . For example, the trueparameter value is given by θ ( δ ) = Θ( P ) = (cid:88) y,z,a,w m ( a, z, w ) g δ ( a | w ) p ( z, w ) . g δ is known. Then the non-parametric MLE of θ ( δ ) is given by Θ( P n ) = (cid:88) y,z,a,w y P n ( y | a, z, w ) g δ ( a | w ) P n ( z, w )= (cid:88) y,z,a,w y P n f y,a,z,w P n f a,z,w g δ ( a | w ) P n f z,w , (18)where we remind the reader of the notation P f = (cid:82) f dP . Here f y,a,z,w = I ( Y = y, A = a, Z = z, W = w ) , f a,z,w = I ( A = a, Z = z, W = w ) , f a,w = I ( A = a, W = w ) , f z,w = I ( Z = z, W = w ) , and I ( · ) denotes the indicator function.We will use the fact that the efﬁcient inﬂuence function in a non-parametric modelcorresponds with the inﬂuence curve of the NPMLE. This is true because the inﬂuencecurve of any regular estimator is also a gradient, and a non-parametric model has onlyone gradient. Appendix 18 of van der Laan and Rose (2011) shows that if ˆΘ( P n ) is asubstitution estimator such that θ ( δ ) = ˆΘ( P ) , and ˆΘ( P n ) can be written as ˆΘ ∗ ( P n f : f ∈F ) for some class of functions F and some mapping B ∗ , the inﬂuence curve of ˆΘ( P n ) isequal to IC ( P )( O ) = (cid:88) f ∈F d ˆΘ ∗ ( P ) d P f { f ( O ) − P f } . Applying this result to (18) with F = { f y,a,z,w , f y,a,w , f a,w , f z,w , f w } gives an efﬁcientinﬂuence function equal to D Yη,δ ( o ) + D Z,Wη,δ ( o ) − θ ( δ ) . It remains to ﬁnd the component D Aη,δ ( o ) for each speciﬁc intervention. This component may be found as the IF of theestimator Θ( P n ) = (cid:88) y,z,a,w y P ( y | a, z, w )ˆ g δ ( a | w ) P ( z, w ) , where ˆ g δ is the MLE of g δ , obtained by substitution of the MLE of g .The algebraic derivations described here are lengthy and not particularly illuminating,and are therefore omitted from the proof. Proof

In this proof we use the notation π ( w ) = g (1 | w ) . From the parameterization e ( a | z, w ) = g ( a | w ) q ( z | a, w ) /r ( z | w ) , note that φ ( a, w ) = (cid:90) m ( a, z, w )d P ( z | w ) . D Aη,δ ( o ) from Lemma 2 may be written as follows D Aη,δ ( o ) = (cid:88) t ∈{ , } g δ ( a | w ) g ( a | w ) (cid:90) m ( t, z, w )d P ( z | w ) { I ( a = t ) − g δ ( t | w ) } = (cid:88) t ∈{ , } (cid:90) m ( t, z, w )d P ( z | w ) (cid:20) a π δ ( w ) π ( w ) { t − g δ ( t | w ) } +(1 − a ) 1 − π δ ( w )1 − π ( w ) { − t − g δ ( t | w ) } (cid:21) Note that π δ ( w ) { t − g δ ( t | w ) } = −{ − π δ ( w ) }{ − t − g δ ( t | w ) } = (2 t − π δ ( w ) { − π δ ( w ) } . Thus, D Aη,δ ( o ) = (cid:88) t ∈{ , } (cid:90) m ( t, z, w )d P ( z | w )(2 t − π δ ( w ) { − π δ ( w ) } (cid:20) aπ ( w ) − − a − π ( w ) (cid:21) = π δ ( w ) { − π δ ( w ) } π ( w ) { − π ( w ) } { a − π ( w ) } (cid:88) t ∈{ , } (cid:90) m ( t, z, w )d P ( z | w )(2 t − since π δ ( w ) { − π δ ( w ) } π ( w ) { − π ( w ) } = δ { δπ ( w ) + 1 − π ( w ) } , expanding the sum in t concludes the proof. Proof

Let P n,j denote the empirical distribution of the prediction set V j , and let G n,j denote the associated empirical process (cid:112) n/J ( P n,j − P ) . Note that ˆ θ ( δ ) = 1 J J (cid:88) j =1 P n,j D ˆ η j ,δ , θ ( δ ) = P D η . Thus, √ n { ˆ θ ( δ ) − θ ( δ ) } = G n { D η,δ − θ ( δ ) } + R n, ( δ ) + R n, ( δ ) , R n, ( δ ) = 1 √ J J (cid:88) j =1 G n,j ( D ˆ η j ,δ − D η,δ ) , R n, ( δ ) = √ nJ J (cid:88) j =1 P { D ˆ η j ,δ − θ ( δ ) } . It remains to show that R n, ( δ ) and R n, ( δ ) are o P (1) . Theorem 5 together with theCauchy-Schwartz inequality and assumption (i) of the theorem shows that || R n, || ∆ = o P (1) . For || R n, || ∆ we use empirical process theory to argue conditional on the trainingsample T j . In particular, Lemma 19.33 of van der Vaart (1998) applied to the class offunctions F = { D ˆ η j ,δ − D η,δ } (which consists of one element) yields E (cid:26)(cid:12)(cid:12) G n,j ( D ˆ η j ,δ − D η,δ ) (cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) T j (cid:27) (cid:46) C log 2 n / + || D ˆ η j ,δ − D η,δ || (log 2) / By assumption (i), the left hand side is o P (1) . Lemma 6.1 of Chernozhukov et al. (2018)may now be used to argue that conditional convergence implies unconditional conver-gence, concluding the proof. Let || f || ∆ = sup δ ∈ ∆ | f ( δ ) | . Let P n,j denote the empirical distribution of the prediction set V j , and let G n,j denote the associated empirical process (cid:112) n/J ( P n,j − P ) . Note that ˆ θ ( δ ) = 1 J J (cid:88) j =1 P n,j D ˆ η j ,δ , θ ( δ ) = P D η . Thus, √ n { ˆ θ ( δ ) − θ ( δ ) } = G n { D η,δ − θ ( δ ) } + R n, ( δ ) + R n, ( δ ) , where R n, ( δ ) = 1 √ J J (cid:88) j =1 G n,j ( D ˆ η j ,δ − D η,δ ) , R n, ( δ ) = √ nJ J (cid:88) j =1 P { D ˆ η j ,δ − θ ( δ ) } . The map δ (cid:55)→ ¯ D η,δ is Lipschitz, which implies that the class F = { ¯ D η,δ : δ ∈ ∆ } has bounded bracketing numbers (Theorem 2.7.11 of van der Vaart and Wellner, 1996).Therefore, F is Donsker and G n { D η,δ − θ ( δ ) } (cid:32) G ( δ ) in (cid:96) ∞ (∆) .30t remains to show that || R n, || ∆ and || R n, || ∆ are o P (1) . Theorem 6 together with theassumptions of the theorem and the Cauchy-Schwartz inequality, show that || R n, || ∆ = o P (1) . For || R n, || ∆ we use empirical process theory to argue conditional on the trainingsample T j . Let F jn = { D ˆ η j ,δ − D η,δ : δ ∈ ∆ } . Because the function ˆ η j is ﬁxed given thetraining data, we can apply Theorem 2.14.2 of van der Vaart and Wellner (1996) to obtain E (cid:40) sup f ∈F jn | G n,j f | (cid:12)(cid:12)(cid:12)(cid:12) T j (cid:41) (cid:46) || F jn || (cid:90) (cid:113) N [ ] ( (cid:15) || F jn || , F jn , L ( P ))d (cid:15), where N [ ] ( (cid:15) || F jn || , F jn , L ( P )) is the bracketing number and we take F jn = sup δ ∈ ∆ | D ˆ η j ,δ − D η,δ | as an envelope for the class F jn . Theorem 2.7.2 of van der Vaart and Wellner (1996)shows log N [ ] ( (cid:15) || F jn || , F jn , L ( P )) (cid:46) (cid:15) || F jn || . This shows || F jn || (cid:90) (cid:113) N [ ] ( (cid:15) || F jn || , F jn , L ( P ))d (cid:15) (cid:46) (cid:90) (cid:115) || F jn || + || F jn || (cid:15) d (cid:15) ≤ || F jn || + || F jn || / (cid:90) (cid:15) / d (cid:15) ≤ || F jn || + 2 || F jn || / . Since || F jn || = o P (1) , this shows sup f ∈F jn G n,j f = o P (1) for each j , conditional on T j .and thus || R n, || ∆ = o P (1) , concluding the proof of the theorem. Theorem 5.

Let d ( A, W ) satisfy assumption A1. Denote m d ( z, a, w ) = m ( z, d ( a, w ) , w ) .Let q ( z | a, w ) denote any density compatible with φ ( a, w ) . That is, let q be such that φ ( a, w ) = (cid:90) g ( a | w ) e ( a | z, w ) m ,d ( z, a, w ) q ( z | a, w ) dν ( z ) , nd deﬁne r = g q /e . In this theorem we denote d ξ ( w ) = d P ( w ) . We have P D η ,δ − θ ( δ ) = (cid:90) g δ (cid:18) ee − (cid:19) ( m − m ) r d κ d ξ + (cid:90) ee ( g δ, − g δ )( m − m ) r d κ d ξ + (cid:90) m ,d ( r − r )( g − g )d κ d ξ Proof

Note that P D Yη ,δ + P D Z,Wη,δ − θ ( δ ) = (cid:90) g δ e ( m − m ) er d κ d ξ − (cid:90) g δ ( m − m ) r d κ d ξ + (cid:90) ee ( g ,δ − g δ )( m − m ) r d κ d ξ + (cid:90) m ( g ,δ − g δ ) r d κ d ξ = (cid:90) g δ (cid:18) ee − (cid:19) ( m − m ) r d κ d ξ (19) + (cid:90) ee ( g ,δ − g δ )( m − m ) r d κ d ξ + (cid:90) ( g ,δ − g δ ) m r d κ d ξ. We have P D Aη ,δ = P (cid:18) φ − (cid:90) φ g dν (cid:19) = (cid:90) ( g − g ) m ,d g q e d κ d ξ. (20)Under A1, we can change variables in the following integral to obtain (cid:90) ( g ,δ − g δ ) m r d κ d ξ = (cid:90) ( g − g ) m ,d r d κ d ξ = (cid:90) ( g − g ) m ,d gqe d κ d ξ, where we used the fact that r ( z | w ) = g ( a | w ) q ( z | a, w ) /e ( a | z, w ) . Adding this32uantity in both sides of (20) we get P D Aη ,δ + (cid:90) ( g − g ) m ,d r d κ d ξ = (cid:90) e q g m ,d ( g − g )d κ d ξ − (cid:90) eqg m d ( g − g )d κ d ξ + (cid:90) eqg ( m d − m ,d )( g − g )d κ d ξ = (cid:90) ( φ − φ )( g − g )d κ d ξ + (cid:90) eqg ( m d − m ,d )( g − g )d κ d ξ. Theorem 6.

Deﬁne c ( w ) = { (cid:82) a exp( δa ) g ( a | w ) } − , and let c ( w ) be deﬁned analo-gously. Let b ( a ) = exp( δa ) . Using the same notation as in Theorem 5, we have P D η ,δ − θ ( δ ) = (cid:90) g δ (cid:18) ee − (cid:19) ( m − m ) r d κ d ξ + (cid:90) ee ( g ,δ − g δ )( m − m ) r d κ d ξ + (cid:90) ( g ,δ − g δ ) { ( m − m ) r − ( φ − φ ) } d κ d ξ − (cid:90) (cid:26) ( c − c ) (cid:90) bg φ d κ (cid:90) bg d κ (cid:27) d ξ + (cid:90) (cid:26) ( c − c ) (cid:90) bφ ( g − g )d κ (cid:27) d ξ Proof

Let q ( z | a, w ) be any density compatible with φ . That is φ ( a, w ) = (cid:90) g ( a | w ) e ( a | z, w ) m ( a, z, w ) q ( z | a, w )d ν ( z ) Note that display (19) is also valid here. Note also that (cid:90) ( g ,δ − g δ ) m r d κ d ξ = (cid:90) ( g ,δ − g δ ) φ d κ d ξ + (cid:90) ( g ,δ − g δ ) { ( m − m ) r − ( φ − φ ) } d κ d ξ. g ,δ ( a | w ) = c ( w ) b ( a ) g ( a | w ) . We have P D Aη + (cid:90) ( g δ, − g δ ) φ d ξ = (cid:90) (cid:26)(cid:90) g ,δ g φg d κ − (cid:90) g ,δ g g d κ (cid:90) φg ,δ d κ + (cid:90) ( g ,δ − g δ ) φ d κ (cid:27) d ξ = (cid:90) (cid:26) g ,δ g gφ d κ − (cid:90) g δ φ d κ + (cid:90) g ,δ φ d κ (cid:20) − (cid:90) g ,δ g g d κ (cid:21)(cid:27) d ξ = (cid:90) (cid:26) c (cid:90) bφg d κ − c (cid:90) bφg d κ + c (cid:90) bgφ d κ (cid:90) ( c − c ) bg d κ (cid:27) d ξ = (cid:90) ( c − c ) (cid:26)(cid:90) bφg d κ − c (cid:90) bg φ d κ (cid:90) bg d κ (cid:27) d ξ = (cid:90) ( c − c ) (cid:26)(cid:90) bφg d κ − c (cid:90) bg φ d κ (cid:90) bg d κ − ( c − c ) (cid:90) bg φ d κ (cid:90) bg d κ (cid:27) d ξ = (cid:90) (cid:26) − ( c − c ) (cid:90) bg φ d κ (cid:90) bg d κ + ( c − c ) (cid:20)(cid:90) bφg d κ − (cid:90) bg φ d κ (cid:21)(cid:27) d ξ (21) = (cid:90) (cid:26) − ( c − c ) (cid:90) bg φ d κ (cid:90) bg d κ + ( c − c ) (cid:90) bφ ( g − g )d κ (cid:27) d ξ, where (21) follows from c (cid:82) bg d κ = 1 References

Chen Avin, Ilya Shpitser, and Judea Pearl. Identiﬁability of path-speciﬁc effects. In

IJCAIInternational Joint Conference on Artiﬁcial Intelligence , pages 357–363, 2005.Reuben M Baron and David A Kenny. The moderator–mediator variable distinction in so-cial psychological research: Conceptual, strategic, and statistical considerations.

Jour-nal of personality and social psychology , 51(6):1173, 1986.Janet M Begun, WJ Hall, Wei-Min Huang, Jon A Wellner, et al. Information and asymp-totic efﬁciency in parametric-nonparametric models.

The Annals of Statistics , 11(2):432–452, 1983. 34lexandre Belloni, Victor Chernozhukov, Denis Chetverikov, and Ying Wei. Uni-formly valid post-regularization conﬁdence regions for many functional parameters inz-estimation framework. arXiv preprint arXiv:1512.07619 , 2015.David Benkeser and Mark van der Laan. The highly adaptive lasso estimator. In , pages689–696. IEEE, 2016.Peter J Bickel, Chris AJ Klaassen, YA’Acov Ritov, and Jon A Wellner.

Efﬁcient andAdaptive Estimation for Semiparametric Models . Springer-Verlag, 1997.Leo Breiman. Stacked regressions.

Machine learning , 24(1):49–64, 1996.Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.James R Carpenter, Michael G Kenward, and Stijn Vansteelandt. A comparison of multipleimputation and doubly robust estimation for analyses with missing data.

Journal of theRoyal Statistical Society: Series A (Statistics in Society) , 169(3):571–584, 2006.Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In

Proceed-ings of the 22nd acm sigkdd international conference on knowledge discovery and datamining , pages 785–794. ACM, 2016.Victor Chernozhukov, Denis Chetverikov, Kengo Kato, et al. Gaussian approximationsand multiplier bootstrap for maxima of sums of high-dimensional random vectors.

TheAnnals of Statistics , 41(6):2786–2819, 2013.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen,et al. Double machine learning for treatment and causal parameters. arXiv preprintarXiv:1608.00060 , 2016.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen,Whitney Newey, and James Robins. Double/debiased machine learning for treatmentand structural parameters.

The Econometrics Journal , 21(1):C1–C68, 2018.Stephen R Cole and Miguel A Hern´an. Fallibility in estimating direct effects.

Internationaljournal of epidemiology , 31(1):163–165, 2002.Jeremy R Coyle and Nima S Hejazi. hal9001: The Scalable Highly Adaptive LASSO ,2018. URL https://github.com/tlverse/hal9001 . R package version 0.2.1.35eremy R Coyle, Nima S Hejazi, Ivana Malenica, and Oleg Sofrygin. sl3: Modernpipelines for machine learning and Super Learning. https://github.com/tlverse/sl3 , 2018. URL https://doi.org/10.5281/zenodo.1342294 . R package version1.1.0.A Philip Dawid. Causal inference without counterfactuals.

Journal of the American Sta-tistical Association , 95(450):407–424, 2000.Iv´an D´ıaz and Mark J van der Laan. Population intervention causal effects based onstochastic interventions.

Biometrics , 68(2):541–549, 2012.Iv´an D´ıaz and Mark J van der Laan. Assessing the causal effect of policies: an exampleusing stochastic interventions.

The international journal of biostatistics , 9(2):161–174,2013.Iv´an D´ıaz and Mark J van der Laan. Stochastic treatment regimes. In

Targeted Learningin Data Science , pages 219–232. Springer, 2018.Vanessa Didelez, A Philip Dawid, and Sara Geneletti. Direct and indirect effects of se-quential treatments. In

Proceedings of the Twenty-Second Conference on Uncertaintyin Artiﬁcial Intelligence , pages 138–146. AUAI Press, 2006.Miroslav Dud´ık, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policyevaluation and optimization.

Statistical Science , 29(4):485–511, 2014.Jerome Friedman, Trevor Hastie, and Rob Tibshirani. glmnet: Lasso and elastic-net regu-larized generalized linear models , 2009.Evarist Gin´e and Joel Zinn. Some limit theorems for empirical processes.

The Annals ofProbability , pages 929–989, 1984.Arthur S Goldberger. Structural equation methods in the social sciences.

Econometrica:Journal of the Econometric Society , pages 979–1001, 1972.Sebastian Haneuse and Andrea Rotnitzky. Estimation of the effect of interventions thatmodify the received treatment.

Statistics in Medicine , 2013.Nima S Hejazi and Iv´an D´ıaz. medshift: Causal mediation analysis for stochastic in-terventions in R , 2019. URL https://github.com/nhejazi/medshift . R packageversion 0.0.8.Kosuke Imai, Luke Keele, and Dustin Tingley. A general approach to causal mediationanalysis.

Psychological methods , 15(4):309, 2010.36dward H Kennedy. Nonparametric causal effects based on incremental propensity scoreinterventions.

Journal of the American Statistical Association , pages 1–12, 2018a.Edward H Kennedy. npcausal: Nonparametric causal inference methods , 2018b. URL https://github.com/ehkennedy/npcausal . R package version 0.1.0.Judith J Lok. Deﬁning and estimating causal direct and indirect effects when setting themediator to speciﬁc values is not feasible.

Statistics in medicine , 35(22):4008–4020,2016.Judith J Lok. Causal organic direct and indirect effects: closer to baron and kenny. arXivpreprint arXiv:1903.04697 , 2019.Caleb H Miles, Phyllis Kanki, Seema Meloni, and Eric J Tchetgen Tchetgen. On partialidentiﬁcation of the pure direct effect. arXiv preprint arXiv:1509.01652 , 2015.Whitney K Newey. The asymptotic variance of semiparametric estimators.

Econometrica:Journal of the Econometric Society , pages 1349–1382, 1994.J. Neyman. Sur les applications de la thar des probabilites aux experiences agaricales:Essay des principle (1923). excerpts reprinted (1990) in english (d. dabrowska and t.speed), trans.

Statistical Science , 5:463–472, 1923.Trang Quynh Nguyen, Ian Schmid, and Elizabeth A Stuart. Clarifying causal mediationanalysis for the applied researcher: Deﬁning effects based on what we want to learn. arXiv preprint arXiv:1904.08515 , 2019.Judea Pearl. Causal diagrams for empirical research.

Biometrika , 82(4):669–688, 1995.Judea Pearl. Graphs, causality, and structural equation models.

Sociological Methods &Research , 27(2):226–284, 1998.Judea Pearl.

Causality: Models, Reasoning, and Inference . Cambridge University Press,Cambridge, 2000.Judea Pearl. Direct & indirect effects. In

Proceedings of the 17th Conference inUncertainty in Artiﬁcial Intelligence , UAI ’01, pages 411–420, San Francisco, CA,USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-800-1. URL http://dl.acm.org/citation.cfm?id=647235.720084 .Judea Pearl. Myth, Confusion, and Science in Causal Analysis. Technical Report R-348,Cognitive Systems Laboratory, Computer Science Department University of California,Los Angeles, Los Angeles, CA, May 2009.37aya L Petersen, Sandra E Sinisi, and Mark J van der Laan. Estimation of direct causaleffects.

Epidemiology , pages 276–284, 2006.J Pfanzagl and W Wefelmeyer. Contributions to a general asymptotic statistical theory.

Statistics & Risk Modeling , 3(3-4):379–388, 1985.K. R. Popper.

The Logic of Scientiﬁc Discovery . Hutchinson, London, 1934.R Core Team.

R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria, 2019. URL .Thomas S Richardson and James M Robins. Single world intervention graphs (swigs): Auniﬁcation of the counterfactual and graphical approaches to causality.

Center for theStatistics and the Social Sciences, University of Washington Series. Working Paper , 128(30):2013, 2013.James M Robins. A new approach to causal inference in mortality studies with sustainedexposure periods - application to control of the healthy worker survivor effect.

Mathe-matical Modelling , 7:1393–1512, 1986.James M Robins and Sander Greenland. Identiﬁability and exchangeability for direct andindirect effects.

Epidemiology , 3(0):143–155, 1992.James M Robins and Thomas S Richardson. Alternative graphical causal models and theidentiﬁcation of direct effects.

Causality and psychopathology: Finding the determi-nants of disorders and their cures , pages 103–158, 2010.James M Robins, Miguel A Hern´an, and UWE SiEBERT. Effects of multiple interven-tions.

Comparative quantiﬁcation of health risks: global and regional burden of diseaseattributable to selected major risk factors , 1:2191–2230, 2004.D. B. Rubin. Estimating Causal Effects of Treatments in Randomized & NonrandomizedStudies.

Journal of Educational Psychology , 1974. URL .Kara E Rudolph, Oleg Sofrygin, Wenjing Zheng, and Mark J Van Der Laan. Robust andﬂexible estimation of stochastic mediation effects: a proposed method and example in arandomized trial setting.

Epidemiologic Methods , 7(1), 2017.Shaun R Seaman, Ian R White, Andrew J Copas, and Leah Li. Combining multiple impu-tation and inverse-probability weighting.

Biometrics , 68(1):129–137, 2012.38eter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek,Gregory Cooper, and Thomas Richardson.

Causation, prediction, and search . MITpress, 2000.Ori M Stitelman, Alan E Hubbard, and Nicholas P Jewell. The impact of coarseningthe explanatory variable of interest in making causal inferences: Implicit assumptionsbehind dichotomizing variables. 2010.James H Stock. Nonparametric policy analysis.

Journal of the American Statistical Asso-ciation , 84(406):567–575, 1989.Sarah L Taubman, James M Robins, Murray A Mittleman, and Miguel A Hern´an. In-tervening on risk factors for coronary heart disease: an application of the parametricg-formula.

International journal of epidemiology , 38(6):1599–1611, 2009.Eric J. Tchetgen Tchetgen and Kelesitse Phiri. Bounds for pure direct effect.

Epidemiology(Cambridge, Mass.) , 25(5):775, 2014.Eric J Tchetgen Tchetgen and Ilya Shpitser. Semiparametric theory for causal mediationanalysis: efﬁciency bounds, multiple robustness, and sensitivity analysis.

Annals ofStatistics , 40(3):1816, 2012.Jin Tian. Identifying dynamic sequential plans. In

Proceedings of the Twenty-FourthConference on Uncertainty in Artiﬁcial Intelligence , pages 554–561. AUAI Press, 2008.Mark J van der Laan. A generally efﬁcient targeted minimum loss based estimator basedon the highly adaptive lasso.

The International Journal of Biostatistics , 13(2), 2017.Mark J van der Laan and David Benkeser. Highly adaptive lasso (hal). In

Targeted Learn-ing in Data Science , pages 77–94. Springer, 2018.Mark J van der Laan and Maya L Petersen. Direct effect models.

The international journalof biostatistics , 4(1), 2008.Mark J van der Laan and James M Robins.

Uniﬁed Methods for Censored LongitudinalData and Causality . Springer, New York, 2003.Mark J van der Laan and Sherri Rose.

Targeted Learning: Causal Inference for Observa-tional and Experimental Data . Springer, New York, 2011.Mark J van der Laan and Sherri Rose.

Targeted Learning in Data Science: Causal Infer-ence for Complex longitudinal Studies . Springer, New York, 2018.39ark J van der Laan and Daniel Rubin. Targeted maximum likelihood learning.

TheInternational Journal of Biostatistics , 2(1), 2006.M.J. van der Laan, E. Polley, and A. Hubbard. Super learner.

Statistical Applications inGenetics & Molecular Biology , 6(25):Article 25, 2007.A. W. van der Vaart.

Asymptotic Statistics . Cambridge University Press, 1998.Aad van der Vaart. Semiparameric statistics.

Lectures on Probability Theory and Statistics ,pages 331–457, 2002.Aad W van der Vaart. On differentiable functionals.

Annals of Statistics , 19:178–204,1991.Aad W van der Vaart and Jon A Wellner.

Weak Convergence and Emprical Processes .Springer-Verlag New York, 1996.Tyler J VanderWeele, Stijn Vansteelandt, and James M Robins. Effect decompositionin the presence of an exposure-induced mediator-outcome confounder.

Epidemiology(Cambridge, Mass.) , 25(2):300, 2014.Stijn Vansteelandt and Rhian M Daniel. Interventional effects for mediation analysis withmultiple mediators.

Epidemiology (Cambridge, Mass.) , 28(2):258, 2017.Stijn Vansteelandt and Tyler J VanderWeele. Natural direct and indirect effects on theexposed: effect decomposition under weaker assumptions.

Biometrics , 68(4):1019–1027, 2012.Stijn Vansteelandt, James Carpenter, and Michael G Kenward. Analysis of incompletedata using inverse probability weighting and doubly robust estimators.

Methodology ,2010.Stijn Vansteelandt, Maarten Bekaert, and Theis Lange. Imputation strategies for the es-timation of natural direct and indirect effects.

Epidemiologic Methods , 1(1):131–158,2012.Marvin N Wright and Andreas Ziegler. Ranger: a fast implementation of random forestsfor high dimensional data in c++ and r. arXiv preprint arXiv:1508.04409 , 2015.Sewall Wright. Correlation and causation.

Journal of agricultural research , 20(7):557–585, 1921. 40ewall Wright. The method of path coefﬁcients.

The annals of mathematical statistics , 5(3):161–215, 1934.Jessica G Young, Miguel A Hern´an, and James M Robins. Identiﬁcation, estimation andapproximation of risk under interventions that depend on the natural value of treatmentusing observational data.

Epidemiologic methods , 3(1):1–19, 2014.Wenjing Zheng and Mark van der Laan. Longitudinal mediation analysis with time-varying mediators and exposures, with application to survival outcomes.

Journal ofcausal inference , 5(2), 2017.Wenjing Zheng and Mark J van der Laan. Cross-validated targeted minimum-loss-basedestimation. In

Targeted Learning , pages 459–474. Springer, 2011.Wenjing Zheng and Mark J van der Laan. Targeted maximum likelihood estimation ofnatural direct effects.