Causal mediation analysis for stochastic interventions
CCausal mediation analysis for stochasticinterventions
Iv´an D´ıaz ∗ and Nima S. Hejazi Division of Biostatistics, Weill Cornell Medicine. Graduate Group in Biostatistics, University of California, Berkeley. Center for Computational Biology, University of California, Berkeley.
June 25, 2019
Abstract
Mediation analysis in causal inference has traditionally focused on binary expo-sures and deterministic interventions, and a decomposition of the average treatmenteffect in terms of direct and indirect effects. In this paper we present an analogousdecomposition of the population intervention effect , defined through stochastic in-terventions on the exposure. Population intervention effects provide a generalizedframework in which a variety of interesting causal contrasts can be defined, includ-ing effects for continuous and categorical exposures. We show that identificationof direct and indirect effects for the population intervention effect requires weakerassumptions than its average treatment effect counterpart, under the assumption ofno mediator-outcome confounders affected by exposure. In particular, identificationof direct effects is guaranteed in experiments that randomize the exposure and themediator. We discuss various estimators of the direct and indirect effects, includingsubstitution, re-weighted, and efficient estimators based on flexible regression tech-niques, allowing for multivariate mediators. Our efficient estimator is asymptoticallylinear under a condition requiring n / -consistency of certain regression functions.We perform a simulation study in which we assess the finite-sample properties of ourproposed estimators. We present the results of an illustrative study where we assessthe effect of participation in a sports team on BMI among children, using mediatorssuch as exercise habits, daily consumption of snacks, and overweight status. ∗ corresponding author: [email protected] a r X i v : . [ s t a t . M E ] J un Introduction
Mediation analysis is a powerful analytical tool that allows scientists to unveil the mecha-nisms through which causal effects operate. The development of tools for mediation anal-ysis has a long history in the statistical sciences, starting with the early work of Wright(1921, 1934) on path analysis, which provided the foundations for the later developmentof mediation analysis using structural equation models (Goldberger, 1972). Indeed, one ofthe most widely used mediation analysis methods is based on structural equations (Baronand Kenny, 1986). Recent decades have seen a revolution in the field of causal infer-ence from observational and randomized studies, starting with the seminal work of Ru-bin (1974) on the potential outcomes framework, which is itself rooted in ideas datingback to Neyman (1923). More recently, Pearl (1995, 2000) has developed a causal infer-ence framework using non-parametric structural equation models, directed acyclic graphs,and the so-called do-calculus. Related approaches have been proposed by Robins (1986),Spirtes et al. (2000), Dawid (2000), and Richardson and Robins (2013). These frameworksallow researchers to define causal effects non-parametrically, and to assess the conditionsunder which causal effects can be identified from data. In particular, novel tools have un-covered important limitations of the earlier work on parametric structural equation modelsfor mediation analysis (Pearl, 1998; Imai et al., 2010). Essentially, structural equationmodels impose implausible assumptions on the data generating mechanism, and are thusof limited applicability to complex phenomena in biology, health, economics, and the so-cial sciences. For example, modern causal models have revealed the incorrectness of thewidely popular method of Baron and Kenny (1986) in several important cases, such asin the presence of confounders of the mediator-outcome relationship (Cole and Hern´an,2002).Using the potential outcomes framework, Robins and Greenland (1992) introduced anon-parametric decomposition of the causal effect of a binary exposure into so-called nat-ural indirect and direct effects. The indirect effect quantifies the effect on the outcomethrough the mediator and the direct effect quantifies the effect through all other mecha-nisms. Pearl (2001) arrived at an equivalent effect decomposition using non-parametricstructural equation models. The identification of these natural (in)direct effects relies onso-called cross-world counterfactual independencies, i.e., independencies on counterfac-tual variables indexed by distinct hypothetical interventions. An important consequenceof this definition is that the natural (in)direct effect is not identifiable in a randomized trial,which is problematic as it implies that scientific claims obtained from these models are notfalsifiable through experimentation (Popper, 1934; Dawid, 2000; Robins and Richardson,2010).In an attempt to solve these problems, several authors have proposed methods that do2way with cross-world counterfactual independencies. These methods can be divided intwo types: identification of bounds (Robins and Richardson, 2010; Tchetgen and Phiri,2014; Miles et al., 2015), and alternative definitions of the (in)direct effect (Petersen et al.,2006; van der Laan and Petersen, 2008; Vansteelandt and VanderWeele, 2012; Vander-Weele et al., 2014). Here, we take the second approach, defining the (in)direct effect interms of a decomposition of the total effect of a stochastic intervention on the populationexposure.Most causal inference problems consider deterministic interventions that set each unit’sexposure to some fixed value that could be a function of the unit’s baseline variables.Stochastic interventions are a generalization of this framework, and are loosely definedas interventions which yield an exposure that is a random variable after conditioning onbaseline variables. Estimation of total effects of stochastic interventions was first consid-ered by Stock (1989) and has been the subject of recent study (Robins et al., 2004; Didelezet al., 2006; Tian, 2008; Pearl, 2009; Taubman et al., 2009; Stitelman et al., 2010; D´ıazand van der Laan, 2013; Dud´ık et al., 2014; Haneuse and Rotnitzky, 2013; Young et al.,2014). Particularly relevant to this work are the methods of D´ıaz and van der Laan (2012);Haneuse and Rotnitzky (2013) who define total effects for modified treatment policies,and Kennedy (2018a), who study identification and estimation of the total the effect ofpropensity score interventions that shift a binary exposure distribution. These papers donot address decomposition of the effects of stochastic interventions on the exposure intodirect and indirect effects, which is the central theme of our manuscript.Our methods are also related to a family of new direct and indirect effects (Didelezet al., 2006; VanderWeele et al., 2014; Lok, 2016; Vansteelandt and Daniel, 2017; Zhengand van der Laan, 2017; Rudolph et al., 2017; Lok, 2019), which have been collectivelytermed interventional effects (Nguyen et al., 2019). This family of effects deals with bi-nary exposures and deterministic interventions on the exposure, and is thus not entirelyrelated to our approach, which deals with both continuous and categorical exposures andstochastic interventions on the exposure. Like the effects on the treated of Vansteelandtand VanderWeele (2012), interventional effects share the no-cross-world-independenceproperty of our methods. The interested reader is referred to Nguyen et al. (2019) for ataxonomy of the several mediation analyses proposed in the causal inference literature upto date.Stochastic interventions have analytical advantages compared to their deterministiccounterparts, such as allowing the seamless definition of causal effects for continuousexposures with an interpretation that is familiar to regular users of linear regression ad-justment. For example, Haneuse and Rotnitzky (2013) assess the effect of an interventionthat reduces a patient’s operating time (i.e., the time spent in surgery) on the risk of post-operative outcomes among patients undergoing surgical resection non-small-cell lung can-3er. D´ıaz and van der Laan (2012) study the effect of increasing the amount of leisure timephysical activity in the elderly on subsequent all-cause mortality. D´ıaz and van der Laan(2013) study the effect of a (hypothetical) policy that enforces pollution levels below acertain cutoff point. Kennedy (2018a) shows that stochastic interventions can also be usedin longitudinal studies to define and estimate total effects without relying on the positivityassumption.In this article, we propose a decomposition of the effect of a stochastic interventioninto a direct and an indirect effect, with interpretation analogous to that originally pro-posed by Robins and Greenland (1992) and Pearl (2001). We show that the identificationof (in)direct effects based on stochastic interventions does not require cross-world coun-terfactual independencies, therefore yielding scientific results that can be tested throughexperimentation on both the exposure and mediator. Of high practical relevance, our pro-posal also allows the definition and estimation of non-parametric mediated effects for con-tinuous exposures, a problem for which no methods or software exist. Parametric media-tion methods such as those discussed by Vansteelandt et al. (2012) induce unquantifiableamounts of bias by imposing untestable and implausible parametric assumptions on thedistribution of cross-world counterfactuals.We develop a one-step non-parametric estimator based on the efficient influence func-tion, incorporating flexible regression tools from the machine learning literature, and pro-vide n / -rate convergence and asymptotic linearity results. We propose methods to usethese asymptotic distributions to construct confidence regions and to test the null hypoth-esis of no direct effect. Our estimator has roots in semiparametric estimation theory (e.g.,Pfanzagl and Wefelmeyer, 1985; Begun et al., 1983; van der Vaart, 1991; Newey, 1994;Bickel et al., 1997), and in the targeted learning framework of van der Laan and Rubin(2006); van der Laan and Rose (2011, 2018). In particular, we use cross-fitting in or-der to obtain n / -convergence of our estimators while avoiding entropy conditions thatmay be violated by the data adaptive estimators we use (Zheng and van der Laan, 2011;Chernozhukov et al., 2018). Our estimators use a re-parameterization of certain integralsas conditional expectations in order to accommodate multivariate mediators. Softwareimplementing our methods is provided in the form of an open source R package freelyavailable on GitHub. Let A denote a continuous or categorical exposure variable, let Y denote a continuousor binary outcome, let Z denote a multivariate mediator, and let W denote a vector ofobserved covariates. Let O = ( W, A, Z, Y ) represent a random variable with distribution4 . We use P n to denote the empirical distribution of a sample of n i.i.d. observations O , . . . , O n . We let P f = (cid:82) f ( o )d P ( o ) for a given function f ( o ) , and use E to denoteexpectations with respect to P . We assume P ∈ M , where M is the nonparametricstatistical model defined as all continuous densities on O with respect to a dominatingmeasure ν . Let p denote the corresponding probability density function. We use g ( a | w ) to denote the probability density function or the probability mass function of A conditionalon W = w ; m ( a, z, w ) and b ( a, w ) to denote the outcome regression functions E ( Y | A = a, Z = z, W = w ) and E ( Y | A = a, W = w ) , respectively; and e ( a | z, w ) to denotethe conditional density or probability mass function of A conditional on ( Z, W ) . Let g ( a | w ) be dominated by a measure κ ( a ) (e.g., the counting measure for binary A andthe Lebesgue measure for continuous A ). We use q ( z | a, w ) and r ( z | w ) to denotethe corresponding conditional densities of Z . The parametrization e = gq/r will provefundamental in the construction of our estimators, since it will allow us to avoid estimationof multivariate conditional densities. A similar parameterization is used by Zheng andvan der Laan (2012) to estimate mediated effects under deterministic interventions. Weuse W , A , Z and Y to denote the support of the corresponding random variables.We formalize the definition of our counterfactual variables using the following non-parametric structural equation model (NPSEM), but note that equivalent methods may bedeveloped by taking the counterfactual variables as primitives. Assume W = f W ( U W ); A = f A ( W, U A ); Z = f Z ( W, A, U M ); Y = f Y ( W, A, ZU Y ) . (1)This set of equations represents a mechanistic model assumed to generate the observeddata O ; furthermore, it encodes several fundamental assumptions. First, an implicit tem-poral ordering is assumed — that is, Y occurs after Z , A and W ; Z occurs after A and W ; and A occurs after W . Second, each variable (i.e., { W, A, Z, Y } ) is assumed to begenerated from the corresponding deterministic function (i.e., { f W , f A , f Z , f Y } ) of theobserved variables that precede it temporally, plus an exogenous variable, denoted by U .Each exogenous variable is assumed to contain all unobserved causes of the correspond-ing observed variable. Independence assumptions on U = ( U W , U A , U Z , U Y ) necessaryfor identification will be clarified in Section 2.1. Furthermore, we note that we have explic-itly excluded outcome-mediator confounders which are affected by exposure. Mediationanalysis in the presence of a such variables is notoriously hard (Avin et al., 2005); theadaptation of our methods to this problem is possible but it requires a new set of toolswhich is out of the scope of this paper.Causal effects are defined in terms of hypothetical interventions on the NPSEM (1). Inparticular, consider an intervention in which the equation corresponding to A is removed,and the exposure is drawn from a user-specified distribution g δ ( a | w ) , which may dependon g and is indexed by a user-specified parameter δ . We assume without loss of generality5hat g δ =0 = g . Let A δ denote a draw from g δ ( a | w ) . Alternatively, such modifications cansometimes be described in terms of an intervention in which the equation correspondingto A is removed and the exposure is set equal to a hypothetical regime d ( A, W ) . Regime d depends on the natural (that is, under no intervention) exposure level A and covariates W . The latter intervention is sometimes referred to as depending on the natural value ofexposure , or as a modified treatment policy (Haneuse and Rotnitzky, 2013). Young et al.(2014) provide a discussion of the differences and similarities in the interpretation andidentification of these two interventions. Below, we discuss two examples of stochasticinterventions: modified treatment policies, and exponential tilting. Example 1 (Modified treatment policy (Haneuse and Rotnitzky, 2013)) . Let A denote acontinuous exposure, such as operating time in non-small-cell lung cancer. Assume thedistribution of A conditional on W = w is supported in the interval ( l ( w ) , u ( w )) . Thatis, the minimum possible operating time for an individual with covariates W = w is l ( w ) .Then one may define a hypothetical post-intervention exposure A δ = d ( A, W ) , where d ( a, w ) = (cid:40) a − δ if a > l ( w ) + δa if a ≤ l ( w ) + δ, (2)where < δ < u ( w ) is an arbitrary user-given value. Interesting modifications to thisregime may be obtained by allowing δ to be a function of w , therefore allowing the re-searcher to specify a different change in operating time as a function of covariates such ascomorbidities, age, etc. This intervention was first introduced by D´ıaz and van der Laan(2012), and has been further discussed in D´ıaz and van der Laan (2018) and Haneuse andRotnitzky (2013). Example 2 (Exponential tilting) . We can alternatively define a tilted intervention distribu-tion as g δ ( a | w ) = exp( δa ) g ( a | w ) (cid:82) exp( δa ) g ( a | w )d κ ( a ) , (3)for δ ∈ R , and let the hypothetical post-intervention exposure A δ be a random draw from g δ , conditional on the natural value of the observed covariates W . For binary A , Kennedy(2018a) proposed evaluating the total effect of a binary exposure A in terms of incrementalpropensity score interventions that replace the propensity score g (1 | w ) with a shifted ver-sion based on multiplying the odds of exposure by a user-given parameter δ (cid:48) . In particular,the post-intervention propensity score is given by g δ (cid:48) (1 | w ) = δ (cid:48) g (1 | w ) δ (cid:48) g (1 | w ) + 1 − g (1 | w ) , (4)6or < δ (cid:48) < ∞ . The proposal of Kennedy (2018a) is thus a case of exponential tilting (3)under the parameterization δ (cid:48) = exp( δ ) . This choice of parameterization is motivated bythe fact that δ (cid:48) can be interpreted as an odds ratio indicating how the intervention changesthe odds of exposure. The extremes of δ (cid:48) = 0 and δ (cid:48) = ∞ correspond to the standardinterventions A = 0 and A = 1 considered in the definition of the average treatmenteffect.We now turn our attention to defining the population intervention effect (PIE) of A on Y . To proceed, for any values ( a, z ) , consider the counterfactual outcome Y ( a, z ) = f Y ( W, a, z, U Y ) . , and the counterfactual mediator Z ( a ) = f Z ( W, a, U Z ) . The counter-factual Y ( a, z ) is the outcome in a hypothetical world in which ( A, Z ) = ( a, z ) is fixedexternally. The PIE is defined as a contrast comparing the expectation of the outcomeunder no intervention with the expectation of the counterfactual outcome obtained underan intervention A δ : ψ ( δ ) = E { Y ( A δ ) − Y } . Note that the interpretation of the PIE depends on the stochastic intervention considered.For example, for the modified treatment policies of Example 1, the PIE describes thedifference in outcomes obtained by a reduction of δ in operating time. In the case of theincremental propensity score intervention (4), the PIE is interpreted as the difference inoutcomes obtained by an intervention under which the odds of exposure is δ (cid:48) times highercompared to current practice.Since A is a cause of Z , an intervention that changes the exposure to A δ also inducesa counterfactual mediator Z ( A δ ) . As a consequence of the consistency implied by theNPSEM, we have Y ( A, Z ) = Y . Similarly, the law of composition (Pearl, 2000) allowsus to write Y ( A δ , Z ( A δ )) = Y ( A δ ) . Thus, the PIE may be decomposed in terms of a population intervention direct effect (PIDE) and a population intervention indirect effect(PIIE) : ψ ( δ ) = PIIE (cid:122) (cid:125)(cid:124) (cid:123) E { Y ( A δ , Z ( A δ )) − Y ( A δ , Z ) } + PIDE (cid:122) (cid:125)(cid:124) (cid:123) E { Y ( A δ , Z ) − Y ( A, Z ) } . (5)This decomposition of the PIE as the sum of direct and indirect effects has an interpre-tation analogous to the corresponding standard decomposition of the average treatmenteffect (Pearl, 2001). In particular, the direct effect represents the effect of an interventionthat changes the distribution of the exposure while keeping the distribution of the media-tors fixed at the value that it would have taken under no intervention. The indirect effectmeasures the effect of an indirect intervention on the mediators generated by interveningon the exposure, while holding the intervention on the exposure constant.7he intervention in Example 1 arises naturally as a modified treatment policy. In con-trast, the intervention in Example 2 arises directly as a stochastic intervention that modifiesthe distribution of the variables — it is unclear as of yet whether this quantity may be in-terpreted as a modified treatment policy. Drawing on the work of Haneuse and Rotnitzky(2013), we make the following assumption for modified treatment policies, which ensuresthat we can use the change of variable formula when computing integrals over A . This isuseful for studying properties of the parameter and estimators we propose. A1 (Piecewise smooth invertibility) . For each w ∈ W , assume that the interval I ( w ) =( l ( w, ) , u ( w )) may be partitioned into subintervals I δ,j ( w ) : j = 1 , . . . , J ( w ) such that d ( a, w ) is equal to some d j ( a, w ) in I δ,j ( w ) and d j ( · , w ) has inverse function h j ( · , w ) withderivative h (cid:48) j ( · , w ) .Under this assumption, the distribution of a modified treatment policy A δ = d ( A, W ) may be recovered through (see Haneuse and Rotnitzky, 2013): g δ ( a | w ) = J ( w ) (cid:88) j =1 I δ,j { h j ( a, w ) , w } g { h j ( a, w ) | w } h (cid:48) j ( a, w ) , (6)where I δ,j { u, w } = 1 if u ∈ I δ,j ( w ) and I δ,j { u, w } = 0 otherwise. In Example 1, thestochastic intervention becomes g δ ( a | w ) = g ( a | w ) { l ( w ) ≤ a ≤ l ( w ) + δ } + g ( a + δ | w ) { l ( w ) ≤ a ≤ u ( w ) − δ } . Therefore, under A1, a modified treatment policy may also be represented as a changeby which the equation f A is removed from the NPSEM and A is replaced by a draw A δ from the distribution g δ ( a | w ) . As a result of these two representations, the interven-tion may be interpreted in two different ways: (i) a change in the probabilistic mechanismused to assign exposure level, and (ii) a subject-specific change in exposure from A to A δ = d ( A, W ) , where only interpretation (i) requires A1. Note, however, that the popula-tion distribution of the exposure is the same under both interventions (Young et al., 2014);thus, both representations lead to exactly the same marginal counterfactual outcome dis-tributions.Several estimators of the functional ψ ( δ ) have previously been proposed. For the caseof a continuous exposure, D´ıaz and van der Laan (2012) developed inverse probabilityweighted, outcome regression, and doubly robust estimators based on the framework oftargeted minimum loss-based estimation (TMLE) (van der Laan and Rose, 2011), us-ing data adaptive estimators of the relevant nuisance parameters. D´ıaz and van der Laan(2018) improved on the previous methodology by constructing a TMLE algorithm with8ower computational complexity that preserves the desirable asymptotic properties of theoriginal approach. Haneuse and Rotnitzky (2013) propose estimators that rely on cor-rectly specified parametric models. Such methods are of limited applicability since theyare reliable only in situations where the nuisance parameters involve only few categoricalvariables, where correctly specified (that is, saturated) parametric models can conscien-tiously be constructed. For the binary case with g δ as in Example 2, Kennedy (2018a)proposed an estimator for ψ ( δ ) . This estimator is efficient, asymptotically linear, and itallows incorporation of data adaptive estimators of the nuisance parameters.Since E ( Y ) is trivially estimated by the empirical mean in the sample, our optimalitytheory and estimators focus on θ ( δ ) = E { Y ( A δ , Z ) } . We present two types of results:for general modified treatment policies satisfying (A1), and for the particular stochasticintervention of Example 2. We compare the assumptions required for both. In this section we introduce the counterfactual variable Y ( a, z ) , defined as the outcomethat would be observed in a hypothetical world in which P { ( A, Z ) = ( a, z ) } = 1 . This isthe same counterfactual variable that is often used to perform mediation analyses on theaverage treatment effect (Robins and Greenland, 1992; Pearl, 2001).We introduce the following identification assumptions: A2 (Common support) . Assume supp { g δ ( · | w ) } ⊆ supp { g ( · | w ) } for all w ∈ W . A3 (Conditional exchangeability of exposure and mediator assignment) . Assume E { Y ( a, z ) | A, W, Z } = E { Y ( a, z ) | W, Z } for all ( a, z ) ∈ A × Z . Assumption A2 is standard in the analysis of causal effects, and simply states that the δ -specific intervention of interest is supported in the data. This assumption holds for all δ in the interventions described in Examples 1 and 2 (D´ıaz and van der Laan, 2012; Kennedy,2018a). Assumption A3 is related to the assumption that Vansteelandt and VanderWeele(2012) used for identification of mediated effects among the treated. In that proposalthe authors assume Y ( a, z ) ⊥⊥ ( A, Z ) | W , which would imply the stronger assumption E { Y ( a, z ) | A, W, Z } = E { Y ( a, z ) | W } . This assumption would be satisfied for anypre-exposure variable W in a randomized experiment in which exposure and mediator arerandomized. Thus, the direct effect for a population intervention corresponds to contrastsbetween treatment regimes of a randomized experiment via interventions on A and Z ,unlike the natural direct effect for the average treatment effect (Robins and Richardson,2010). This claim is made rigorous in the identification result of Theorem 1 presentedbelow. A proof is available in the Supplementary Materials, together with the assumptionson the NPSEM exogenous errors U which are compatible with A3.9 heorem 1 (Identification) . Under A2 and A3, θ ( δ ) is identified and is given by θ ( δ ) = (cid:90) m ( a, z, w ) g δ ( a | w ) p ( z, w )d ν ( a, z, w ) . (7) Remark 1 (Mediator-outcome confounder not affected by exposure) . Note that, like thenatural direct effect of Pearl (2001), we require that all confounders of the mediator-outcome relation are measured. This assumption is implicit in A3. To see why, considerthe DAG in Figure 1. Conditioning on the collider Z opens a pathway from A to Y ( a, z ) through the outcome-mediator confounder V . If V is not measured and adjusted for (i.e., V ⊆ W ), then A3 fails. Z VA Y ( a, z ) Figure 1: Directed acyclic sub-graph of the variables involved in the case of an unmeasuredmediator-outcome confounder.
Remark 2 (Mediator-outcome confounded by exposure) . The methods presented herecannot be used if the mediator-outcome confounder V is affected by exposure. This isdue to the introduction of a new counterfactual variable V ( a ) . In particular, consider theDAG in Figure 2, where we have included only the relevant factual and counterfactualvariables. In this case, conditioning on the collider V would open a path A → V ← U V → V ( a ) → Y ( a, z ) , and would make A3 invalid. However, conditioning on V is nec-essary for A3 in order to close the path A → Z ← V → U V → V ( a ) → Y ( a, z ) , whichgets open when we condition on the collider Z . A comprehensive discussion of issues inidentification of path effects that includes this issue as a particular problem may be foundin Avin et al. (2005). VanderWeele et al. (2014) propose a solution to this problem whichinvolves a stochastic intervention on the mediator Z . We note that this is intrinsically dif-ferent from the problem treated here, since we are interested in stochastic interventions on A (not on Z ) and do not address mediator-outcome confounders affected by exposure. U V V V ( a ) A Z Y ( a, z ) Figure 2: Directed acyclic sub-graph of the variables involved in the case of an outcome-mediator confounder affected by exposure. 10
Optimality theory for estimation of the direct effect
Thus far we have discussed the decomposition of the effect of a stochastic interventioninto direct and indirect effects, and have provided identification results under weaker as-sumptions in comparison to the natural direct effect. In the sequel, we turn our attentionto a discussion of efficiency theory for the estimation of θ ( δ ) in the nonparametric model M . The efficient influence function (EIF) is a key object in semi-parametric estimationtheory, as it characterizes the asymptotic behavior of all regular and efficient estimators(Bickel et al., 1997; van der Vaart, 2002). Knowledge of the EIF has important practi-cal implications. First, the EIF is often useful in constructing locally efficient estimators.There are three common approaches for this: (i) using the EIF as an estimating equation(e.g., van der Laan and Robins, 2003), (ii) using the EIF in a one-step bias correction (e.g.,Pfanzagl and Wefelmeyer, 1985), and targeted minimum loss-based estimation (van derLaan and Rubin, 2006; van der Laan and Rose, 2011, 2018). Second, the EIF estimatingequation often enjoys desirable properties such as multiple robustness, which allows forsome components of the data distribution to be inconsistently estimated while preservingconsistency of the estimator. Third, the asymptotic analysis of estimators constructed us-ing the EIF often yields second-order bias terms, which require slow convergence rates(e.g., n − / ) for the nuisance parameters involved, thereby enabling the use of flexibleregression techniques in estimating these quantities.In Theorem 2 we present the EIF for a general stochastic intervention. Although thecomponents of the EIF associated with Y and ( Z, W ) are the same, the component as-sociated with the model for the distribution of A must be computed on a case-by-casebasis, that is, for each intervention of interest. Proofs for all results are available in theSupplementary Materials. Theorem 2 (Efficient influence function) . Let η = ( g, m, e ) . The efficient influence func-tion for θ ( δ ) in the nonparametric model M is D Yη,δ ( o ) + D Aη,δ ( o ) + D Z,Wη,δ ( o ) − θ ( δ ) ,where D Yη,δ ( o ) = g δ ( a | w ) e ( a | z, w ) { y − m ( a, z, w ) } D Z,Wη,δ ( o ) = (cid:90) m ( a, z, w ) g δ ( a | w )d κ ( a ) , and D Aη,δ ( o ) is the efficient score corresponding to the non-parametric model for g . An immediate consequence of Theorem 2 is that, in a randomized trial, we have D Aη,δ ( o ) = 0 . Lemmas 1 and 2 below present the D Aη,δ ( o ) components for modified treat-ment policies satisfying A1 and for the exponential tilting of Example 2, respectively.11 emma 1 (Modified treatment policies) . Define the nuisance parameter φ ( a, w ) = (cid:90) m ( d ( a, w ) , z, w ) r ( z | w ) dν ( z ) (8) = E (cid:26) g ( A | W ) e ( A | Z, W ) m ( d ( A, W ) , Z, W ) | A = a, W = w (cid:27) , (9) and augment η as η = ( g, m, e, φ ) . If the modified treatment policy d ( A, W ) satisfies A1,then D Aη,δ ( o ) = φ ( a, w ) − (cid:90) φ ( a, w ) g ( a | w )d κ ( a ) . Lemma 2 (Exponential tilt) . Define the nuisance parameter φ ( a, w ) = (cid:90) m ( a, z, w ) r ( z | w ) dν ( z ) (10) = E (cid:26) g ( A | W ) e ( A | Z, W ) m ( A, Z, W ) | A = a, W = w (cid:27) , (11) and augment η as η = ( g, m, e, φ ) . If the stochastic intervention is the exponential tilt (3),then D Aη,δ ( o ) = g δ ( a | w ) g ( a | w ) (cid:26) φ ( a, w ) − (cid:90) φ ( a, w ) g δ ( a | w )d κ ( a ) (cid:27) . Expressions (8) and (10) show that estimators based on the respective influence func-tions require integration with respect to the mediator Z , as well as estimation of the possi-bly multivariate conditional density r ( z | w ) , which may pose an estimation challenge dueto the curse of dimensionality. To solve the issue, we propose an alternative parametriza-tion (9) and (11) of the EIF based on a sequential regression φ , rather than using the densityof Z conditional on ( A, W ) and W . This choice has important consequences for the pur-pose of estimation, as it helps to bypass estimation of the (possibly high-dimensional)conditional density of the mediators, instead allowing for regression methods, which arefar more commonly found in the statistics literature and software, to be used for estimationof the relevant quantity. In particular, if r ( z | w ) is hard to estimate, estimators of φ may becomputed by first estimating g , m , and e , computing the pseudo-outcomes defined in thelemmas, and applying regression techniques to estimate the outer conditional expectation.For binary exposures, the EIF corresponding to the incremental propensity score inter-vention may be simplified as in the following corollary. Corollary 1 (Efficient influence function for incremental propensity score interventions) . Let A take values on { , } , and let the exponentially tilted intervention g δ, (1 | W ) be s in (4). Then, the EIF of Lemma 2 may be simplified as follows. Define the nuisanceparameter φ ( w ) = E { m (1 , Z, W ) − m (0 , Z, W ) | W = w } , and let η = ( g, m, e, φ ) . Then D Aη,δ ( o ) = δφ ( w ) { a − g (1 | w ) }{ δg (1 | w ) + 1 − g (1 | w ) } . Note that in Lemmas 1, 2, and Corollary 1, we have used φ to represent differentparameters. We have allowed this abuse of notation because the nature of this auxiliaryparameter is the same for all three cases, and having one symbol will allow us to stateour estimation results in some generality. In the sequel, the difference will always beclear from context. Note also that g ( a | w ) could be pulled out of the expectation in thedefinition of φ ( a, w ) . We decided to leave it inside the expectation as we conjecture that itmay act as a stabilizing factor for the inverse probability weights { e ( a | z, w ) } − .In contrast to the efficient influence function for the natural direct effect (Tchetgen Tch-etgen and Shpitser, 2012), the contribution of the exposure process to the EIF for the PIEmediated effect is non-zero. This is a direct consequence of the fact that the parameterof interest depends on g ; moreover, this implies that, unlike the natural direct effect, theefficiency bound in observational studies differs from the efficiency bound in randomizedstudies. As we see in the lemmas below, this also implies that it is not generally possibleto obtain estimating equations that are robust to inconsistent estimation of g . However,such robustness will be possible if the stochastic intervention is also a modified treatmentpolicy satisfying A1. Lemma 3 (Multiple robustness for modified treatment policies) . Let the modified treat-ment policy satisfy A1, and let η = ( g , e , m , φ ) be such that one of the two followingconditions hold:(i) g = g and either e = e or m = m ,(ii) m = m and φ = φ .Then P D η ,δ = θ ( δ ) , with D η,δ as defined in Theorem 2 and Lemma 1. The above lemma implies that it is possible to construct consistent estimators for θ under consistent estimation of at least two of the nuisance parameters in η , in the configu-rations described in the lemma. This lemma is a direct consequence of Theorem 5, foundin the Supplementary Materials. We note, however, that part (ii) of the lemma may beuninteresting if the parameterization (9) is used to estimate φ . In that case φ = φ will13enerally require g = g , e = e , and m = m , as well as consistency of the estimatorfor the outer expectation. In contrast, if the parameterization (8) is used to estimate φ ,then the case m = m and φ = φ would be trivially satisfied if m = m and r = r ,where r is the density used to compute φ . To some readers it may seem surprising thatestimation of θ ( δ ) may be robust to estimation of g , even when the parameter definitionin (7) is explicitly dependent on g . We offer some intuition into this surprising result bynoting that assumption A1 allows us to use the change of variable formula to obtain θ ( δ ) = E (cid:26)(cid:90) m ( d ( A, W ) , z, W ) r ( z | , W )d ν ( z ) (cid:27) . Estimation of this parameter without relying on g may be carried out by consistently esti-mating m , r , and using the empirical distribution as an estimator of the outer expectation.This behavior has been previously observed for the total effect ψ ( δ ) under A1 (D´ıaz andvan der Laan, 2012; Haneuse and Rotnitzky, 2013).The robustness properties of the EIF for an exponential tilt are presented below. Lemma 4 (Robustness for exponential tilting) . Let g δ be defined as in (3). Let η =( g , e , m , φ ) be such that g = g and either e = e or m = m . Then P D η ,δ = θ ( δ ) ,with D η,δ as defined in Theorem 2 and Lemma 2. Lemma 4 is a direct consequence of Theorem 6 in the Supplementary Materials. Thecorresponding proof reveals that the EIF for the binary distribution is not multiply robust— that is, the intervention fails to satisfy assumption A1 and integrals over the range of A cannot be computed using change of variable formula. This behavior has been previouslyobserved for other interventions that do not satisfy A1 (D´ıaz and van der Laan, 2013).Even though this lemma implies that consistent estimation of g is required, the bias termsare still second-order, so an estimator of g converging at rate n / or faster is sufficient, aswe will see in the sequel. We start this section describing two simple estimators, the substitution and re-weightedestimators. These estimators are motivated by the fact that θ ( δ ) has the two followingalternative representations: θ ( δ ) = E (cid:26)(cid:90) m ( a, Z, W ) g δ ( a | W )d κ ( a ) (cid:27) (12) = E (cid:26) g δ ( A | W ) e ( A | Z, W ) Y (cid:27) , (13)14here we remind the reader that e ( a | z, w ) denotes the probability density function of A conditional on ( Z, W ) . Equation (13) follows from noting that gq/r = e . This parameter-ization has the advantage that only the univariate conditional density e ( a | z, w ) has to beestimated, instead of the conditional densities of the possibly high-dimensional mediator Z . A similar result was also used by Zheng and van der Laan (2012) to develop a targetedminimum loss-based estimator of natural direct effects under a binary exposure variable.The substitution estimator is simply defined by plugging in estimators of m and g δ intothe identification result given in (12). Consistency of this estimator requires consistentestimation of the outcome regression m and the intervention distribution g δ . The secondestimator is a re-weighting estimator based on the alternative representation of the iden-tification result given in (13), which requires consistent estimation of g δ and e . In theremainder of this section, we discuss an efficient estimator that combines ideas from theprevious two estimators as well as the efficient influence function derived in the previoussection, in order to build an estimator that is both efficient and robust to model misspec-ification. We discuss an asymptotic linearity result for the doubly robust estimator thatallows computation of asymptotically correct confidence intervals and hypothesis tests.In the sequel, we assume that preliminary estimators ˆ m , ˆ g δ , ˆ φ and ˆ e of m , g δ , φ , and e , respectively, are available. These estimators may be obtained from flexible regressiontechniques such as support vector machines, regression trees, boosting, neural networks,splines, or ensembles thereof (Breiman, 1996; van der Laan et al., 2007). As previouslydiscussed, the consistency of these estimators will determine the consistency of our esti-mators of the population mediation intervention mean θ . First, we discuss a substitution estimator based on (12), computed as ˆ θ sub ( δ ) = (cid:90) n n (cid:88) i =1 ˆ m ( a, Z i , W i )ˆ g δ ( a | W i )d κ ( a ) , (14)where we have substituted estimators of m and g δ in (12), and have estimated the expec-tation with respect to the joint density p ( z, w ) by the empirical mean. The re-weightedestimator is based on (13), and is defined by ˆ θ re ( δ ) = 1 n n (cid:88) i =1 ˆ g δ ( A i | W i )ˆ e ( A i | , Z i , W i ) Y i (15)If ˆ m , ˆ g δ , and ˆ e are estimated within parametric models, then, by the delta method,both ˆ θ sub ( δ ) and ˆ θ re ( δ ) are asymptotically linear and n / -consistent. The bootstrap or an15nfluence function-based estimator may be used to construct asymptotically correct confi-dence intervals. However, if either the mediators or confounders are high-dimensional, therequired consistency of ˆ m , ˆ g δ , and ˆ e will hardly be achievable within parametric models.This issue may be alleviated through the use of data adaptive estimators. Unfortunately, n / -consistency of ˆ θ sub ( δ ) and ˆ θ re ( δ ) will generally require that ˆ m , ˆ g δ , and ˆ e are consis-tent in L ( P ) -norm at parametric rate, which is generally not possible within data adaptiveestimation of high-dimensional regressions. Thus, the asymptotic distribution will gener-ally be unknown, rendering the construction of confidence intervals and hypothesis testsimpossible. In the following section, we use the efficient influence function to propose anestimator that is n / -consistent and efficient under a weaker assumption, requiring only n / -convergence of second-order regression bias terms. We propose using the efficient influence function D η,δ to construct a robust and efficientestimator, constructed as the solution to the estimating equation P n D ˆ η,δ = 0 in θ , fora preliminary estimator ˆ η of η . In order to avoid imposing entropy conditions on theinitial estimators, we advocate for the use of cross-fitting (Zheng and van der Laan, 2011;Chernozhukov et al., 2016) in the estimation procedure. Let V , . . . , V J denote a randompartition of the index set { , . . . , n } into J prediction sets of approximately the same size.That is, V j ⊂ { , . . . , n } ; (cid:83) Jj =1 V j = { , . . . , n } ; and V j ∩ V j (cid:48) = ∅ . In addition, foreach j , the associated training sample is given by T j = { , . . . , n } \ V j . Denote by ˆ η j theestimator of η = ( g, m, e, φ ) , obtained by training the corresponding prediction algorithmusing only data in the sample T j . Further, let j ( i ) denote the index of the validation setwhich contains observation i . The estimator is thus defined as: ˆ θ ( δ ) = 1 n n (cid:88) i =1 D ˆ η j ( i ) ,δ ( O i ) = 1 n n (cid:88) i =1 (cid:110) D Y ˆ η j ( i ) ,δ ( O i ) + D A ˆ η j ( i ) ,δ ( O i ) + D Z,W ˆ η j ( i ) ,δ ( O i ) (cid:111) . (16)In a randomized trial the estimator may also be computed by setting D A ˆ η j ( i ) ,δ ( O i ) = 0 . M -estimation theory may be used to derive the asymptotic distribution of ˆ θ ( δ ) . Asymp-totic linearity and efficiency of the estimator for modified treatment policies is detailed inthe following theorem: Theorem 3 (Pointwise weak convergence for modified treatment policies) . Let (cid:107)·(cid:107) denotethe L ( P ) -norm defined as (cid:107) f (cid:107) = (cid:82) f d P . Assume(i) (cid:107) ˆ m − m (cid:107) {(cid:107) ˆ g − g (cid:107) + (cid:107) ˆ e − e (cid:107)} + (cid:107) ˆ g − g (cid:107) (cid:107) ˆ φ − φ (cid:107) = o P ( n − / ) , and(ii) P {| D η,δ ( O ) | ≤ C } = P {| D ˆ η,δ ( O ) | ≤ C } = 1 for some C < ∞ , and iii) The modified treatment policy d ( a, w ) is piecewise smooth invertible (A1).Then √ n { ˆ θ ( δ ) − θ ( δ ) } (cid:32) N (0 , σ ( δ )) , where σ ( δ ) = Var { D η,δ ( O ) } is the efficiency bound. Theorem 3 establishes the weak convergence of ˆ θ ( δ ) pointwise in δ . This convergenceis useful to derive confidence intervals in situations where the modified treatment policyhas a suitable scientific interpretation for a given δ , such as in our Example 1. Under theassumptions of the theorem, an estimator ˆ σ ( δ ) of σ ( δ ) may be obtained as the empiricalvariance of D ˆ η j ( i ) ,δ ( O i ) , and a Wald-type confidence interval may be constructed as ˆ θ ( δ ) ± z − α/ ˆ σ ( δ ) / √ n .For the remainder of this section, we turn our attention to a discussion of uniformconvergence of ˆ θ ( δ ) . Such a convergence result will prove useful in the following section,where we establish a hypothesis test of no direct effect. Such a test is constructed byrejecting the hypothesis if the direct effect is non-significant (at level α ), uniformly in δ . Tobuild such a testing procedure, we focus on the intervention defined in terms of exponentialtilting (3). Results for modified treatment policies are possible as well; however, theserequire smoothness assumptions on the map δ (cid:55)→ g δ ( a | w ) . Inspection of (6) reveals thatthis may in turn require smoothness assumptions on a (cid:55)→ g ( a | w ) , which may not bejustifiable in a number of applications. We thus focus on exponential tilting, which yieldssmooth maps δ (cid:55)→ g δ ( a | w ) by construction. This discussion, together with Lemmas 1 and2, thus reveals a trade-off between smoothness and robustness in estimation of modifiedtreatment policies and exponential tilting. Theorem 4 (Uniform weak convergence for exponential tilting) . Let g δ be the exponentialtilting intervention distribution (3) and let ∆ = [ δ l , δ u ] denote an interval with < δ l ≤ δ u < ∞ . Define c ( w ) = { (cid:82) a exp( δa ) g ( a | w ) } − . Assume || ˆ c − c || = o P ( n − / ) as wellas (i) and (ii) stated in Theorem 3. Then √ n { ˆ θ ( δ ) − θ ( δ ) } (cid:32) G ( δ ) in (cid:96) ∞ (∆) , where for any δ , δ ∈ ∆ , G ( · ) is a mean-zero Gaussian process with covariancefunction E { G ( δ ) G ( δ ) } = E { D η,δ ( O ) D η,δ ( O ) } . In this section, we consider estimation of the direct effect β ( δ ) = θ ( δ ) − E ( Y ) . Define thecorresponding (uncentered) influence function S η,δ ( o ) = D η,δ ( o ) − y . A straightforward17xtension of Theorem 4 shows that ˆ β ( δ ) = ˆ θ ( δ ) − ¯ Y converges weakly to a process G ( δ ) with covariance function E { G ( δ ) G ( δ ) } = E { S η,δ ( O ) S η,δ ( O ) } .We now present an approach to constructing uniform confidence bands on the function β ( δ ) , allowing testing of the null hypothesis of no direct effect H : sup δ ∈ ∆ β ( δ ) = 0 . Thishypothesis test is useful for checking the existence of a direct effect even if the interpre-tation of the exponential tilt g δ (e.g., as the odds ratio comparing post vs pre-interventionodds of exposure) does not answer a particularly meaningful question in a given applica-tion. Let ˆ σ ( δ ) denote the empirical variance of S ˆ η j ( i ) ,δ ( O i ) . Our goal will be achieved byfinding a value c α such that ˆ ρ ( c α ) = 1 − α , where ˆ ρ is a function such that ˆ ρ ( t ) = P (cid:32) sup δ ∈ ∆ (cid:12)(cid:12)(cid:12)(cid:12) ˆ β ( δ ) − β ( δ )ˆ σ ( δ ) / √ n (cid:12)(cid:12)(cid:12)(cid:12) ≤ t (cid:33) + o P (1) . (17)Confidence bands may be computed as ˆ θ ± n − / c α ˆ σ ( δ ) , and p-values for H can be com-puted by evaluating − ˆ ρ ( t ) at the observed value of the supremum test statistic. Thefunction ˆ ρ ( t ) may be obtained by approximating the distribution of sup δ ∈ ∆ G ( δ ) , where G ( δ ) is the Gaussian process defined above. In this paper we take the approach proposedby Kennedy (2018a), using the multiplier bootstrap (Gin´e and Zinn, 1984; van der Vaartand Wellner, 1996; Chernozhukov et al., 2013; Belloni et al., 2015). We omit the relevantproofs as they are identical to those presented by Kennedy (2018a). In comparison withthe nonparametric bootstrap, the multiplier bootstrap has the computational advantage thatthe nuisance estimators ˆ η need not be re-estimated. In comparison with directly sampling sup δ ∈ ∆ G ( δ ) , the proposed procedure does not require the evaluation of potentially largecovariance matrices; therefore, it is far more computationally efficient and convenient.The multiplier bootstrap approximates the distribution of sup δ ∈ ∆ G ( δ ) with the supre-mum of the process M ( δ ) = 1 √ n n (cid:88) i =1 ξ i { S ˆ η j ( i ) ,δ ( O i ) − ˆ β ( δ ) } ˆ σ ( δ ) , where randomness is introduced through sampling the multipliers ( ξ , . . . , ξ n ) , despite theprocess being conditional on the observed data O , . . . , O n . The multiplier variables arei.i.d. with mean zero and unit variance, and are drawn independently from the sample.Typical choices are Rademacher ( P ( ξ = −
1) = P ( ξ = 1) = 0 . ) or Gaussian multipliers.Under the assumptions of Theorem 4, plus uniform consistency of ˆ σ ( δ ) , it can be shownthat (17) holds for ˆ ρ ( t ) = P (cid:18) sup δ ∈ ∆ (cid:12)(cid:12) M ( δ ) (cid:12)(cid:12) ≤ t (cid:12)(cid:12)(cid:12)(cid:12) O , . . . , O n (cid:19) .
18s a consequence, computation of the critical value, p-values, and confidence intervalsonly requires simulation of a large number of realizations of the multipliers over a finegrid over ∆ . We now turn to comparing the three estimators of the direct effect, previously consideredin Section 4. In particular, we investigate the performance of the substitution (12, 14),re-weighted (13, 15), and efficient (16) estimators in the case of an incremental propensityscore (IPS) intervention on a binary intervention variable of interest. The estimators areevaluated on data simulated from the following data-generating mechanism: W ∼ Bern(0 . W ∼ Bern(0 . W ∼ Bern(0 . A ∼ Bern (cid:32) · (cid:88) j =1 W j + 0 . (cid:33) Z ∼ Bern (cid:18) − expit (cid:20) A + W A + W + 0 . (cid:21)(cid:19) Z ∼ Bern (cid:18) expit (cid:20) ( A −
1) + W W + 3 (cid:21)(cid:19) Z ∼ Bern (cid:18) expit (cid:20) ( A −
1) + 2 · W − · W + 0 . (cid:21)(cid:19) Y = Z + Z − Z + A − . · (cid:32) (cid:88) j =1 W j (cid:33) + (cid:15), where Bern( p ) is the Bernoulli distribution with parameter p , expit( x ) = { − x ) } − is the CDF of the logistic distribution (as implemented in the plogis function in the R programming language), and (cid:15) ∼ N (0 , . . The data available on a single observationalunit is denoted by the random variable O = ( W , W , W , A, Z , Z , Z , Y ) , where, in anygiven simulation, we consider observing n i.i.d. copies of O for one of seven sample sizes n ∈ { , , , , , , } .Under the above data-generating mechanism, we seek to estimate the direct effect un-der an incremental propensity score intervention δ = 0 . , for which the true value of thenatural direct effect is approximately . . We approximated this effect by using thealternative representation of θ ( δ ) as θ ( δ ) = E (cid:26)(cid:90) m ( a, Z, W ) g δ ( a | W )d ν ( a ) (cid:27) , a , . . . , a m of uniformly distributed numbers in the range of A , and the outer ex-pectation is approximated through the law of large numbers by drawing a large sample ( W , Z ) , . . . , ( W k , Z k ) from the joint distribution of ( W, Z ) . Each of the estimators isevaluated by contrasting regimes in which the appropriate nuisance parameters are fit viaa well-specified nonparametric regression or misspecified by fitting an intercept model.The enumerated set of estimators and regimes is summarized in Table 1 and Figure 3.In order to ensure a well-specified nonparametric regression for the nuisance parameters,we rely on the highly adaptive lasso (HAL) estimator, a recently proposed nonparametricregression function with properties guaranteeing convergence of estimated nuisance com-ponents at the n / -rates required by our theorems (Benkeser and van der Laan, 2016; vander Laan, 2017; van der Laan and Benkeser, 2018). n Estimator
400 900 1600 2500 3600 4900 6400
Substitution 0.083 0.086 0.084 0.077 0.072 0.074 0.075Reweighted (IPW) 0.105 0.120 0.111 0.116 0.107 0.112 0.109Efficient 0.092 0.086 0.071 0.072 0.068 0.067 0.065Efficient (E mis.) 0.060 0.060 0.060 0.059 0.054 0.059 0.058Efficient (M mis.) 0.165 0.130 0.110 0.107 0.099 0.103 0.097Efficient (G mis.) 0.436 0.829 1.255 1.912 2.662 3.543 4.519Table 1: Mean-squared errors (MSE), scaled by n , of the three key estimators of the directeffect under an IPS intervention δ = 0 . , across Monte Carlo simulations for eachof seven sample sizes. Substitution and reweighted estimators are computed using HALfor g , m , and e . “E mis.” denotes that e was inconsistently estimated via an intercept-onlylogistic regression model, “M mis.” and “G mis.” denote analogous estimators.20 l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l l l l l l l ll l l l l l ll l l l l l ll l l l l l l Efficient Eff. ( E - mis. ) Eff. ( M - mis. ) Reweighted Substitution | B i a s | n | B i a s | n s n M SE Eff. ( G - mis. ) n n Figure 3: Statistics for the three key estimators (and variations thereof) of the direct effectunder an IPS intervention δ = 0 . , across Monte Carlo simulations for each of sevensample sizes.Inspection of the mean-squared error, after scaling by √ n , reveals that the substitutionestimator and the efficient one-step estimator both display excellent, essentially equivalentperformance when nuisance components are estimated using the highly adaptive lasso.The one-step estimator has slightly better performance, which seems to be driven by abetter bias-variance trade-off. In contrast to the substitution estimator, the efficient one-step estimator carries the advantage of being double robust, allowing misspecification ofeither the outcome regression (denoted “M”) or the mediator-inclusive propensity score(denoted “E”). The robustness of the efficient estimator to the misspecification of thesenuisance components — and the lack of robustness to the mediator-exclusive propensityscore (denoted “G”) — are demonstrated in the last three rows of Table 1. Figure 3 visu-alizes the performance of the estimators, and their misspecified variants, in terms of boththe MSE (as presented in Table 1) and its individual components, the bias and standarderror. This comparison of the estimators reveals that the correctly-specified one-step ef-ficient estimator displays excellent performance in terms of both bias and variance while21ts non-robust misspecified variant displays an asymptotic bias that grows with samplesize. Interestingly, the one-step estimator with e inconsistently estimated displayed bet-ter performance than the fully efficient version. This is possibly an idiosyncrasy of thissimulation due to the fact that misspecification through an intercept-only model generatessmaller variable weights. Altogether, these numerical investigations demonstrate the util-ity of the proposed estimators in settings where the nonparametric estimation of nuisancecomponents is viable; moreover, in applied data analytic settings where this proceduremay be of interest, the one-step efficient estimator is clearly preferable on account of itsmultiple robustness. All numerical studies of the estimators were performed using theimplementations available in the medshift software package (Hejazi and D´ıaz, 2019) forthe R language and environment for statistical computing (R Core Team, 2019). We now turn to considering a scenario in which the decomposition proposed in equa-tion 5 and the proposed efficient estimator (16) may be used to estimate direct and in-direct effects. To proceed, we take as example a simple data set from an observationalstudy of the relationship between BMI and children’s behavior, distributed as part of the mma R package, available via the Comprehensive R Archive Network ( https://CRAN.R-project.org/package=mma ). The documentation of this data set describes it as a“database obtained from the Louisiana State University Health Sciences Center, New Or-leans, by Dr. Richard Scribner [who] explored the relationship between BMI and kidsbehavior through a survey at children, teachers and parents in Grenada in 2014. Thisdata set includes observations and variables.” In particular, we consider a modi-fied version of this data set with all missing values removed, as these are irrelevant to thedemonstration of the proposed methodology. In standard data analytic practice, we advo-cate for the use of the proposed methodology in tandem with a correction for missing data,such as imputation or weighting by inverse probability of censoring. (Carpenter et al.,2006; Vansteelandt et al., 2010; Seaman et al., 2012).To demonstrate the assessment of the direct and indirect effect with this observationaldata set, we consider the effect of participation in a sports team on the BMI of children,taking several related covariates as mediators (including snacking, exercising, and over-weight status) and all other collected covariates as potential confounders. As the inter-vention variable is binary, we frame our proposal in terms of an incremental propensityscore intervention (Kennedy, 2018a), wherein the odds of participating in a sports team isincreased by a factor of δ = 2 for each individual. Such a stochastic exposure regime maybe interpreted as the introduction of a school program or policy that motivates children to22pt in to participating in a sports team, doubling the odds of such voluntary participation. As noted in equation 5, the population intervention effect admits a decomposition in termsof components that allow estimation of the direct and indirect effects. We compute each ofthe components of the direct and indirect effects using appropriate estimators as follows • for E { Y ( A, Z ) } = E Y , the natural value of the outcome under no intervention, theempirical mean in the sample serves as an efficient estimator; • for E { Y ( A δ , Z ) } = θ ( δ ) , the mean outcome under an intervention altering the ex-posure mechanism but not the mediation mechanism, a one-step efficient estimator,denoted ˆ θ ( δ ) , is proposed as equation 16 and made available via the medshift R package (Hejazi and D´ıaz, 2019); • for E { Y ( A δ ) } = ψ ( δ ) , the mean outcome under an intervention altering both theexposure and mediation mechanisms, a one-step efficient estimator, denoted ˆ ψ ( δ ) inthe sequel, is easily estimable using the npcausal R package (Kennedy, 2018b).In the construction of estimators for θ ( δ ) and ψ ( δ ) , data adaptive nonparametric regres-sion procedures are incorporated to allow the relevant nuisance parameters of each estima-tor to be computed in a flexible manner using various R packages. The npcausal pack-age allows the estimator ˆ ψ ( δ ) to be constructed using the ranger algorithm (Wright andZiegler, 2015), an efficient and fast implementation of random forests (Breiman, 2001).In constructing ˆ θ ( δ ) , the medshift package provides facilities for estimating nuisance pa-rameters data adaptively via the Super Learner algorithm (van der Laan et al., 2007) forconstructing ensemble learners through cross-validation, using its implementation in the sl3 package (Coyle et al., 2018). In particular, the Super Learner procedure was used tocreate a weighted ensemble of algorithms from a library including extreme gradient boost-ing via the xgboost package (Chen and Guestrin, 2016), with variants including 50, 100,and 300 boosting iterations; variants of random forests using 50, 100, and 500 trees; L -penalized lasso and L -penalized ridge GLMs via the glmnet package (Friedman et al.,2009); an elastic net GLM with equally weighted L and L penalization terms (also via glmnet ); a main terms GLM; an intercept model; and the highly adaptive lasso (Benkeserand van der Laan, 2016), with 5–fold cross-validation and up to either 3-way or 5-wayinteraction terms, using the hal9001 package (Coyle and Hejazi, 2018).23 .2 Estimating the Direct and Indirect Effects From the decomposition given in equation 5, the direct effect may be denoted β ( δ ) = θ ( δ ) − E Y . An estimator of the direct effect, ˆ β ( δ ) may be expressed as a composition ofestimators of its constituent parameters: ˆ β ( δ ) = ˆ θ ( δ ) − n n (cid:88) i =1 Y i . Using the estimation strategies previously outlined, we may construct an estimate of the di-rect effect through a straightforward application of the delta method, yielding both a pointestimate and associated standard errors under our proposed stochastic intervention policy.Similarly, the indirect effect ψ ( δ ) − θ ( δ ) may be estimated as ˆ ψ ( δ ) − ˆ θ ( δ ) . We provideboth point estimates and associated inference under our proposed stochastic interventionpolicy in Table 2. Parameter
Lower 95% CI Estimate Upper 95% CIDirect Effect -0.458 0.011 0.479Indirect Effect -0.672 -0.157 0.357Table 2: Point estimates and 95% confidence intervals for the direct effect and indirecteffect for an IPS intervention of δ = 2 applied to the data set from the mma R package.From the estimates in Table 2, the conclusion may be easily drawn that there is littletotal effect of doubling the odds of participation in a sports team on the BMI of chil-dren, based on the data collected in the observational study made available in the mma R package. For reference, the marginal odds of participating in a sports team in the ob-served data are . , whereas the odds under the intervention considered are . . Basedon the 95% confidence intervals around the point estimates, we cannot conclude that theproposed incremental propensity score intervention is sufficiently efficacious to decreasechildren’s BMI. However, the magnitude of the effects seem to be in the correct direction,with increased participation in a sports team causing a reduction of BMI of . throughchanges in behaviors such as snacking and exercise. Using an approach similar to thatdemonstrated with this data set, the direct and indirect effects attributable to interventionswith higher odds of participating in a sports team are easily estimable.24 Discussion
We have proposed a novel mediation analysis based on the decomposition of the causal ef-fect of a stochastic intervention on the population, focusing on two types of interventions:modified treatment policies and exponential tilting. Unlike the natural direct effect of Pearl(2001), identification of the (in)direct effect proposed here does not require cross-worldcounterfactual independencies, and is therefore achievable in an experimental setting ran-domizing both the exposure and the mediator.We present results for stochastic interventions defined as a modified treatment policyand explicitly defined in terms of the post-intervention distribution (exponential tilting).In addition to the considerations about robustness and smoothness discussed in the presentarticle, the choice between these two options may also be guided by the fact that modi-fied treatment policies are more useful in practical settings as they can be used to informfeasible interventions.Note that our effect decomposition and estimators allow for multivariate mediators.The interpretation of the (joint) indirect effect in this case is entirely context-dependent.For example, the multivariate mediators may represent an innately multivariate construct(e.g., a psychological construct such as personality, behavior, etc.) in this case the indirecteffect could be interpreted as an effect through the construct (e.g., personality). Nonethe-less, our approach does not require that the multivariate mediators are part of a singleconstruct; the interpretation in these cases requires more care.We assume that there is no mediator-outcome confounder affected by exposure. Pointidentification of natural (in)direct effects in the presence of such variables is not generallypossible, and its partial identification is an area of active research (Robins and Richardson,2010; Tchetgen and Phiri, 2014; Miles et al., 2015).Lastly, for simplicity we focus on estimators constructed as solutions to the efficient in-fluence function estimating equation; moreover, we have made implementations of each ofthe proposed estimators available in the free and open source medshift software package(Hejazi and D´ıaz, 2019) for the R language and environment for statistical computing (RCore Team, 2019). Alternative estimation strategies, such as targeted minimum loss-basedestimation (van der Laan and Rubin, 2006; van der Laan and Rose, 2011, 2018), may havebetter performance than our propsoed estimators in finite samples. The development ofsuch estimators will be the subject of future research and software development.25 Proofs of results in the main document
Proof
We prove the result separately for exponential tilting and for modified treatmentpolicies. First, let A δ denote a variable drawn from the exponentially tilted distribution g δ ( a | w ) . We have E { Y ( A δ , Z ) | A δ = a, Z = z, W = w } = E { Y ( a, z ) | A δ = a, Z = z, W = w } = E { Y ( a, z ) | Z = z, W = w } = E { Y ( a, z ) | A = a, Z = z, W = w } = m ( z, a, w ) . The first equality follows from the definition of Y ( A δ , Z ) , the second equality followsbecause, by definition, A δ ⊥⊥ Y ( a, z ) | ( W, Z ) , and the third equality follows from A3.Finally, the fourth equality follows from the consistency implied by he NPSEM: ( A, Z ) =( a, z ) → Y ( a, z ) = Y .Note that, by definition, A δ ⊥⊥ Z | W . Thus E { Y ( A δ , Z ) } = (cid:90) supp( g δ ) × supp( q ) × supp( p ) m ( z, a, w ) r ( z | w ) g δ ( a | w ) p ( w ) dν ( a, z, w ) , where A2 ensures that m ( z, a, w ) is defined in the integration set.If the intervention is a modified treatment policies, such as our Example 1 where A δ = d ( A, W ) , then the proof proceeds as follows. First of all, we have A ⊥⊥ Y ( A δ , Z ) | ( A δ , Z, W ) , so that E { Y ( A δ , Z ) | A δ = a, A = a (cid:48) , Z = z, W = w } = m ( z, a, w ) . Integrating the above expression with respect to the joint density of ( A δ , A, Z, W ) , andusing r ( z | w ) = (cid:90) supp( g ) q ( z | a (cid:48) , w ) g ( a (cid:48) | w ) dν ( a (cid:48) ) yields the desired result. Lemma 5.
Define the non-parametric structural equation model in (1). If ( U A ⊥⊥ U Y and U W ⊥⊥ U Z and U Y ⊥⊥ U Z ) and either U Y ⊥⊥ U W or U A ⊥⊥ U W , then A3 holds. roof For fixed ( a, z ) , let Y ( a, z ) = f Y ( W, a, z, U Y ) . ( W, Z ) d-separates Y ( a, z ) from A in Figures 4 and 5, concluding the proof of the lemma. U W WZU Z AU A Y a,z U Y Figure 4: Directed Acyclic Graph for U A ⊥⊥ U Y and U W ⊥⊥ U Z and U Y ⊥⊥ U Z and U A ⊥⊥ U W . U W WZU Z AU A Y a,z U Y Figure 5: Directed Acyclic Graph for U A ⊥⊥ U Y and U W ⊥⊥ U Z and U Y ⊥⊥ U Z and U Y ⊥⊥ U W . Proof
In this proof we will use Θ( P ) to denote a parameter as a functional that mapsthe distribution P in the model to a real number. We will assume that the measure v isdiscrete so that integrals can be written as sums. The resulting influence function willalso correspond to the influence function of a general measure ν . For example, the trueparameter value is given by θ ( δ ) = Θ( P ) = (cid:88) y,z,a,w m ( a, z, w ) g δ ( a | w ) p ( z, w ) . g δ is known. Then the non-parametric MLE of θ ( δ ) is given by Θ( P n ) = (cid:88) y,z,a,w y P n ( y | a, z, w ) g δ ( a | w ) P n ( z, w )= (cid:88) y,z,a,w y P n f y,a,z,w P n f a,z,w g δ ( a | w ) P n f z,w , (18)where we remind the reader of the notation P f = (cid:82) f dP . Here f y,a,z,w = I ( Y = y, A = a, Z = z, W = w ) , f a,z,w = I ( A = a, Z = z, W = w ) , f a,w = I ( A = a, W = w ) , f z,w = I ( Z = z, W = w ) , and I ( · ) denotes the indicator function.We will use the fact that the efficient influence function in a non-parametric modelcorresponds with the influence curve of the NPMLE. This is true because the influencecurve of any regular estimator is also a gradient, and a non-parametric model has onlyone gradient. Appendix 18 of van der Laan and Rose (2011) shows that if ˆΘ( P n ) is asubstitution estimator such that θ ( δ ) = ˆΘ( P ) , and ˆΘ( P n ) can be written as ˆΘ ∗ ( P n f : f ∈F ) for some class of functions F and some mapping B ∗ , the influence curve of ˆΘ( P n ) isequal to IC ( P )( O ) = (cid:88) f ∈F d ˆΘ ∗ ( P ) d P f { f ( O ) − P f } . Applying this result to (18) with F = { f y,a,z,w , f y,a,w , f a,w , f z,w , f w } gives an efficientinfluence function equal to D Yη,δ ( o ) + D Z,Wη,δ ( o ) − θ ( δ ) . It remains to find the component D Aη,δ ( o ) for each specific intervention. This component may be found as the IF of theestimator Θ( P n ) = (cid:88) y,z,a,w y P ( y | a, z, w )ˆ g δ ( a | w ) P ( z, w ) , where ˆ g δ is the MLE of g δ , obtained by substitution of the MLE of g .The algebraic derivations described here are lengthy and not particularly illuminating,and are therefore omitted from the proof. Proof
In this proof we use the notation π ( w ) = g (1 | w ) . From the parameterization e ( a | z, w ) = g ( a | w ) q ( z | a, w ) /r ( z | w ) , note that φ ( a, w ) = (cid:90) m ( a, z, w )d P ( z | w ) . D Aη,δ ( o ) from Lemma 2 may be written as follows D Aη,δ ( o ) = (cid:88) t ∈{ , } g δ ( a | w ) g ( a | w ) (cid:90) m ( t, z, w )d P ( z | w ) { I ( a = t ) − g δ ( t | w ) } = (cid:88) t ∈{ , } (cid:90) m ( t, z, w )d P ( z | w ) (cid:20) a π δ ( w ) π ( w ) { t − g δ ( t | w ) } +(1 − a ) 1 − π δ ( w )1 − π ( w ) { − t − g δ ( t | w ) } (cid:21) Note that π δ ( w ) { t − g δ ( t | w ) } = −{ − π δ ( w ) }{ − t − g δ ( t | w ) } = (2 t − π δ ( w ) { − π δ ( w ) } . Thus, D Aη,δ ( o ) = (cid:88) t ∈{ , } (cid:90) m ( t, z, w )d P ( z | w )(2 t − π δ ( w ) { − π δ ( w ) } (cid:20) aπ ( w ) − − a − π ( w ) (cid:21) = π δ ( w ) { − π δ ( w ) } π ( w ) { − π ( w ) } { a − π ( w ) } (cid:88) t ∈{ , } (cid:90) m ( t, z, w )d P ( z | w )(2 t − since π δ ( w ) { − π δ ( w ) } π ( w ) { − π ( w ) } = δ { δπ ( w ) + 1 − π ( w ) } , expanding the sum in t concludes the proof. Proof
Let P n,j denote the empirical distribution of the prediction set V j , and let G n,j denote the associated empirical process (cid:112) n/J ( P n,j − P ) . Note that ˆ θ ( δ ) = 1 J J (cid:88) j =1 P n,j D ˆ η j ,δ , θ ( δ ) = P D η . Thus, √ n { ˆ θ ( δ ) − θ ( δ ) } = G n { D η,δ − θ ( δ ) } + R n, ( δ ) + R n, ( δ ) , R n, ( δ ) = 1 √ J J (cid:88) j =1 G n,j ( D ˆ η j ,δ − D η,δ ) , R n, ( δ ) = √ nJ J (cid:88) j =1 P { D ˆ η j ,δ − θ ( δ ) } . It remains to show that R n, ( δ ) and R n, ( δ ) are o P (1) . Theorem 5 together with theCauchy-Schwartz inequality and assumption (i) of the theorem shows that || R n, || ∆ = o P (1) . For || R n, || ∆ we use empirical process theory to argue conditional on the trainingsample T j . In particular, Lemma 19.33 of van der Vaart (1998) applied to the class offunctions F = { D ˆ η j ,δ − D η,δ } (which consists of one element) yields E (cid:26)(cid:12)(cid:12) G n,j ( D ˆ η j ,δ − D η,δ ) (cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) T j (cid:27) (cid:46) C log 2 n / + || D ˆ η j ,δ − D η,δ || (log 2) / By assumption (i), the left hand side is o P (1) . Lemma 6.1 of Chernozhukov et al. (2018)may now be used to argue that conditional convergence implies unconditional conver-gence, concluding the proof. Let || f || ∆ = sup δ ∈ ∆ | f ( δ ) | . Let P n,j denote the empirical distribution of the prediction set V j , and let G n,j denote the associated empirical process (cid:112) n/J ( P n,j − P ) . Note that ˆ θ ( δ ) = 1 J J (cid:88) j =1 P n,j D ˆ η j ,δ , θ ( δ ) = P D η . Thus, √ n { ˆ θ ( δ ) − θ ( δ ) } = G n { D η,δ − θ ( δ ) } + R n, ( δ ) + R n, ( δ ) , where R n, ( δ ) = 1 √ J J (cid:88) j =1 G n,j ( D ˆ η j ,δ − D η,δ ) , R n, ( δ ) = √ nJ J (cid:88) j =1 P { D ˆ η j ,δ − θ ( δ ) } . The map δ (cid:55)→ ¯ D η,δ is Lipschitz, which implies that the class F = { ¯ D η,δ : δ ∈ ∆ } has bounded bracketing numbers (Theorem 2.7.11 of van der Vaart and Wellner, 1996).Therefore, F is Donsker and G n { D η,δ − θ ( δ ) } (cid:32) G ( δ ) in (cid:96) ∞ (∆) .30t remains to show that || R n, || ∆ and || R n, || ∆ are o P (1) . Theorem 6 together with theassumptions of the theorem and the Cauchy-Schwartz inequality, show that || R n, || ∆ = o P (1) . For || R n, || ∆ we use empirical process theory to argue conditional on the trainingsample T j . Let F jn = { D ˆ η j ,δ − D η,δ : δ ∈ ∆ } . Because the function ˆ η j is fixed given thetraining data, we can apply Theorem 2.14.2 of van der Vaart and Wellner (1996) to obtain E (cid:40) sup f ∈F jn | G n,j f | (cid:12)(cid:12)(cid:12)(cid:12) T j (cid:41) (cid:46) || F jn || (cid:90) (cid:113) N [ ] ( (cid:15) || F jn || , F jn , L ( P ))d (cid:15), where N [ ] ( (cid:15) || F jn || , F jn , L ( P )) is the bracketing number and we take F jn = sup δ ∈ ∆ | D ˆ η j ,δ − D η,δ | as an envelope for the class F jn . Theorem 2.7.2 of van der Vaart and Wellner (1996)shows log N [ ] ( (cid:15) || F jn || , F jn , L ( P )) (cid:46) (cid:15) || F jn || . This shows || F jn || (cid:90) (cid:113) N [ ] ( (cid:15) || F jn || , F jn , L ( P ))d (cid:15) (cid:46) (cid:90) (cid:115) || F jn || + || F jn || (cid:15) d (cid:15) ≤ || F jn || + || F jn || / (cid:90) (cid:15) / d (cid:15) ≤ || F jn || + 2 || F jn || / . Since || F jn || = o P (1) , this shows sup f ∈F jn G n,j f = o P (1) for each j , conditional on T j .and thus || R n, || ∆ = o P (1) , concluding the proof of the theorem. Theorem 5.
Let d ( A, W ) satisfy assumption A1. Denote m d ( z, a, w ) = m ( z, d ( a, w ) , w ) .Let q ( z | a, w ) denote any density compatible with φ ( a, w ) . That is, let q be such that φ ( a, w ) = (cid:90) g ( a | w ) e ( a | z, w ) m ,d ( z, a, w ) q ( z | a, w ) dν ( z ) , nd define r = g q /e . In this theorem we denote d ξ ( w ) = d P ( w ) . We have P D η ,δ − θ ( δ ) = (cid:90) g δ (cid:18) ee − (cid:19) ( m − m ) r d κ d ξ + (cid:90) ee ( g δ, − g δ )( m − m ) r d κ d ξ + (cid:90) m ,d ( r − r )( g − g )d κ d ξ Proof
Note that P D Yη ,δ + P D Z,Wη,δ − θ ( δ ) = (cid:90) g δ e ( m − m ) er d κ d ξ − (cid:90) g δ ( m − m ) r d κ d ξ + (cid:90) ee ( g ,δ − g δ )( m − m ) r d κ d ξ + (cid:90) m ( g ,δ − g δ ) r d κ d ξ = (cid:90) g δ (cid:18) ee − (cid:19) ( m − m ) r d κ d ξ (19) + (cid:90) ee ( g ,δ − g δ )( m − m ) r d κ d ξ + (cid:90) ( g ,δ − g δ ) m r d κ d ξ. We have P D Aη ,δ = P (cid:18) φ − (cid:90) φ g dν (cid:19) = (cid:90) ( g − g ) m ,d g q e d κ d ξ. (20)Under A1, we can change variables in the following integral to obtain (cid:90) ( g ,δ − g δ ) m r d κ d ξ = (cid:90) ( g − g ) m ,d r d κ d ξ = (cid:90) ( g − g ) m ,d gqe d κ d ξ, where we used the fact that r ( z | w ) = g ( a | w ) q ( z | a, w ) /e ( a | z, w ) . Adding this32uantity in both sides of (20) we get P D Aη ,δ + (cid:90) ( g − g ) m ,d r d κ d ξ = (cid:90) e q g m ,d ( g − g )d κ d ξ − (cid:90) eqg m d ( g − g )d κ d ξ + (cid:90) eqg ( m d − m ,d )( g − g )d κ d ξ = (cid:90) ( φ − φ )( g − g )d κ d ξ + (cid:90) eqg ( m d − m ,d )( g − g )d κ d ξ. Theorem 6.
Define c ( w ) = { (cid:82) a exp( δa ) g ( a | w ) } − , and let c ( w ) be defined analo-gously. Let b ( a ) = exp( δa ) . Using the same notation as in Theorem 5, we have P D η ,δ − θ ( δ ) = (cid:90) g δ (cid:18) ee − (cid:19) ( m − m ) r d κ d ξ + (cid:90) ee ( g ,δ − g δ )( m − m ) r d κ d ξ + (cid:90) ( g ,δ − g δ ) { ( m − m ) r − ( φ − φ ) } d κ d ξ − (cid:90) (cid:26) ( c − c ) (cid:90) bg φ d κ (cid:90) bg d κ (cid:27) d ξ + (cid:90) (cid:26) ( c − c ) (cid:90) bφ ( g − g )d κ (cid:27) d ξ Proof
Let q ( z | a, w ) be any density compatible with φ . That is φ ( a, w ) = (cid:90) g ( a | w ) e ( a | z, w ) m ( a, z, w ) q ( z | a, w )d ν ( z ) Note that display (19) is also valid here. Note also that (cid:90) ( g ,δ − g δ ) m r d κ d ξ = (cid:90) ( g ,δ − g δ ) φ d κ d ξ + (cid:90) ( g ,δ − g δ ) { ( m − m ) r − ( φ − φ ) } d κ d ξ. g ,δ ( a | w ) = c ( w ) b ( a ) g ( a | w ) . We have P D Aη + (cid:90) ( g δ, − g δ ) φ d ξ = (cid:90) (cid:26)(cid:90) g ,δ g φg d κ − (cid:90) g ,δ g g d κ (cid:90) φg ,δ d κ + (cid:90) ( g ,δ − g δ ) φ d κ (cid:27) d ξ = (cid:90) (cid:26) g ,δ g gφ d κ − (cid:90) g δ φ d κ + (cid:90) g ,δ φ d κ (cid:20) − (cid:90) g ,δ g g d κ (cid:21)(cid:27) d ξ = (cid:90) (cid:26) c (cid:90) bφg d κ − c (cid:90) bφg d κ + c (cid:90) bgφ d κ (cid:90) ( c − c ) bg d κ (cid:27) d ξ = (cid:90) ( c − c ) (cid:26)(cid:90) bφg d κ − c (cid:90) bg φ d κ (cid:90) bg d κ (cid:27) d ξ = (cid:90) ( c − c ) (cid:26)(cid:90) bφg d κ − c (cid:90) bg φ d κ (cid:90) bg d κ − ( c − c ) (cid:90) bg φ d κ (cid:90) bg d κ (cid:27) d ξ = (cid:90) (cid:26) − ( c − c ) (cid:90) bg φ d κ (cid:90) bg d κ + ( c − c ) (cid:20)(cid:90) bφg d κ − (cid:90) bg φ d κ (cid:21)(cid:27) d ξ (21) = (cid:90) (cid:26) − ( c − c ) (cid:90) bg φ d κ (cid:90) bg d κ + ( c − c ) (cid:90) bφ ( g − g )d κ (cid:27) d ξ, where (21) follows from c (cid:82) bg d κ = 1 References
Chen Avin, Ilya Shpitser, and Judea Pearl. Identifiability of path-specific effects. In
IJCAIInternational Joint Conference on Artificial Intelligence , pages 357–363, 2005.Reuben M Baron and David A Kenny. The moderator–mediator variable distinction in so-cial psychological research: Conceptual, strategic, and statistical considerations.
Jour-nal of personality and social psychology , 51(6):1173, 1986.Janet M Begun, WJ Hall, Wei-Min Huang, Jon A Wellner, et al. Information and asymp-totic efficiency in parametric-nonparametric models.
The Annals of Statistics , 11(2):432–452, 1983. 34lexandre Belloni, Victor Chernozhukov, Denis Chetverikov, and Ying Wei. Uni-formly valid post-regularization confidence regions for many functional parameters inz-estimation framework. arXiv preprint arXiv:1512.07619 , 2015.David Benkeser and Mark van der Laan. The highly adaptive lasso estimator. In , pages689–696. IEEE, 2016.Peter J Bickel, Chris AJ Klaassen, YA’Acov Ritov, and Jon A Wellner.
Efficient andAdaptive Estimation for Semiparametric Models . Springer-Verlag, 1997.Leo Breiman. Stacked regressions.
Machine learning , 24(1):49–64, 1996.Leo Breiman. Random forests.
Machine learning , 45(1):5–32, 2001.James R Carpenter, Michael G Kenward, and Stijn Vansteelandt. A comparison of multipleimputation and doubly robust estimation for analyses with missing data.
Journal of theRoyal Statistical Society: Series A (Statistics in Society) , 169(3):571–584, 2006.Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In
Proceed-ings of the 22nd acm sigkdd international conference on knowledge discovery and datamining , pages 785–794. ACM, 2016.Victor Chernozhukov, Denis Chetverikov, Kengo Kato, et al. Gaussian approximationsand multiplier bootstrap for maxima of sums of high-dimensional random vectors.
TheAnnals of Statistics , 41(6):2786–2819, 2013.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen,et al. Double machine learning for treatment and causal parameters. arXiv preprintarXiv:1608.00060 , 2016.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen,Whitney Newey, and James Robins. Double/debiased machine learning for treatmentand structural parameters.
The Econometrics Journal , 21(1):C1–C68, 2018.Stephen R Cole and Miguel A Hern´an. Fallibility in estimating direct effects.
Internationaljournal of epidemiology , 31(1):163–165, 2002.Jeremy R Coyle and Nima S Hejazi. hal9001: The Scalable Highly Adaptive LASSO ,2018. URL https://github.com/tlverse/hal9001 . R package version 0.2.1.35eremy R Coyle, Nima S Hejazi, Ivana Malenica, and Oleg Sofrygin. sl3: Modernpipelines for machine learning and Super Learning. https://github.com/tlverse/sl3 , 2018. URL https://doi.org/10.5281/zenodo.1342294 . R package version1.1.0.A Philip Dawid. Causal inference without counterfactuals.
Journal of the American Sta-tistical Association , 95(450):407–424, 2000.Iv´an D´ıaz and Mark J van der Laan. Population intervention causal effects based onstochastic interventions.
Biometrics , 68(2):541–549, 2012.Iv´an D´ıaz and Mark J van der Laan. Assessing the causal effect of policies: an exampleusing stochastic interventions.
The international journal of biostatistics , 9(2):161–174,2013.Iv´an D´ıaz and Mark J van der Laan. Stochastic treatment regimes. In
Targeted Learningin Data Science , pages 219–232. Springer, 2018.Vanessa Didelez, A Philip Dawid, and Sara Geneletti. Direct and indirect effects of se-quential treatments. In
Proceedings of the Twenty-Second Conference on Uncertaintyin Artificial Intelligence , pages 138–146. AUAI Press, 2006.Miroslav Dud´ık, Dumitru Erhan, John Langford, Lihong Li, et al. Doubly robust policyevaluation and optimization.
Statistical Science , 29(4):485–511, 2014.Jerome Friedman, Trevor Hastie, and Rob Tibshirani. glmnet: Lasso and elastic-net regu-larized generalized linear models , 2009.Evarist Gin´e and Joel Zinn. Some limit theorems for empirical processes.
The Annals ofProbability , pages 929–989, 1984.Arthur S Goldberger. Structural equation methods in the social sciences.
Econometrica:Journal of the Econometric Society , pages 979–1001, 1972.Sebastian Haneuse and Andrea Rotnitzky. Estimation of the effect of interventions thatmodify the received treatment.
Statistics in Medicine , 2013.Nima S Hejazi and Iv´an D´ıaz. medshift: Causal mediation analysis for stochastic in-terventions in R , 2019. URL https://github.com/nhejazi/medshift . R packageversion 0.0.8.Kosuke Imai, Luke Keele, and Dustin Tingley. A general approach to causal mediationanalysis.
Psychological methods , 15(4):309, 2010.36dward H Kennedy. Nonparametric causal effects based on incremental propensity scoreinterventions.
Journal of the American Statistical Association , pages 1–12, 2018a.Edward H Kennedy. npcausal: Nonparametric causal inference methods , 2018b. URL https://github.com/ehkennedy/npcausal . R package version 0.1.0.Judith J Lok. Defining and estimating causal direct and indirect effects when setting themediator to specific values is not feasible.
Statistics in medicine , 35(22):4008–4020,2016.Judith J Lok. Causal organic direct and indirect effects: closer to baron and kenny. arXivpreprint arXiv:1903.04697 , 2019.Caleb H Miles, Phyllis Kanki, Seema Meloni, and Eric J Tchetgen Tchetgen. On partialidentification of the pure direct effect. arXiv preprint arXiv:1509.01652 , 2015.Whitney K Newey. The asymptotic variance of semiparametric estimators.
Econometrica:Journal of the Econometric Society , pages 1349–1382, 1994.J. Neyman. Sur les applications de la thar des probabilites aux experiences agaricales:Essay des principle (1923). excerpts reprinted (1990) in english (d. dabrowska and t.speed), trans.
Statistical Science , 5:463–472, 1923.Trang Quynh Nguyen, Ian Schmid, and Elizabeth A Stuart. Clarifying causal mediationanalysis for the applied researcher: Defining effects based on what we want to learn. arXiv preprint arXiv:1904.08515 , 2019.Judea Pearl. Causal diagrams for empirical research.
Biometrika , 82(4):669–688, 1995.Judea Pearl. Graphs, causality, and structural equation models.
Sociological Methods &Research , 27(2):226–284, 1998.Judea Pearl.
Causality: Models, Reasoning, and Inference . Cambridge University Press,Cambridge, 2000.Judea Pearl. Direct & indirect effects. In
Proceedings of the 17th Conference inUncertainty in Artificial Intelligence , UAI ’01, pages 411–420, San Francisco, CA,USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-800-1. URL http://dl.acm.org/citation.cfm?id=647235.720084 .Judea Pearl. Myth, Confusion, and Science in Causal Analysis. Technical Report R-348,Cognitive Systems Laboratory, Computer Science Department University of California,Los Angeles, Los Angeles, CA, May 2009.37aya L Petersen, Sandra E Sinisi, and Mark J van der Laan. Estimation of direct causaleffects.
Epidemiology , pages 276–284, 2006.J Pfanzagl and W Wefelmeyer. Contributions to a general asymptotic statistical theory.
Statistics & Risk Modeling , 3(3-4):379–388, 1985.K. R. Popper.
The Logic of Scientific Discovery . Hutchinson, London, 1934.R Core Team.
R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria, 2019. URL .Thomas S Richardson and James M Robins. Single world intervention graphs (swigs): Aunification of the counterfactual and graphical approaches to causality.
Center for theStatistics and the Social Sciences, University of Washington Series. Working Paper , 128(30):2013, 2013.James M Robins. A new approach to causal inference in mortality studies with sustainedexposure periods - application to control of the healthy worker survivor effect.
Mathe-matical Modelling , 7:1393–1512, 1986.James M Robins and Sander Greenland. Identifiability and exchangeability for direct andindirect effects.
Epidemiology , 3(0):143–155, 1992.James M Robins and Thomas S Richardson. Alternative graphical causal models and theidentification of direct effects.
Causality and psychopathology: Finding the determi-nants of disorders and their cures , pages 103–158, 2010.James M Robins, Miguel A Hern´an, and UWE SiEBERT. Effects of multiple interven-tions.
Comparative quantification of health risks: global and regional burden of diseaseattributable to selected major risk factors , 1:2191–2230, 2004.D. B. Rubin. Estimating Causal Effects of Treatments in Randomized & NonrandomizedStudies.
Journal of Educational Psychology , 1974. URL .Kara E Rudolph, Oleg Sofrygin, Wenjing Zheng, and Mark J Van Der Laan. Robust andflexible estimation of stochastic mediation effects: a proposed method and example in arandomized trial setting.
Epidemiologic Methods , 7(1), 2017.Shaun R Seaman, Ian R White, Andrew J Copas, and Leah Li. Combining multiple impu-tation and inverse-probability weighting.
Biometrics , 68(1):129–137, 2012.38eter Spirtes, Clark N Glymour, Richard Scheines, David Heckerman, Christopher Meek,Gregory Cooper, and Thomas Richardson.
Causation, prediction, and search . MITpress, 2000.Ori M Stitelman, Alan E Hubbard, and Nicholas P Jewell. The impact of coarseningthe explanatory variable of interest in making causal inferences: Implicit assumptionsbehind dichotomizing variables. 2010.James H Stock. Nonparametric policy analysis.
Journal of the American Statistical Asso-ciation , 84(406):567–575, 1989.Sarah L Taubman, James M Robins, Murray A Mittleman, and Miguel A Hern´an. In-tervening on risk factors for coronary heart disease: an application of the parametricg-formula.
International journal of epidemiology , 38(6):1599–1611, 2009.Eric J. Tchetgen Tchetgen and Kelesitse Phiri. Bounds for pure direct effect.
Epidemiology(Cambridge, Mass.) , 25(5):775, 2014.Eric J Tchetgen Tchetgen and Ilya Shpitser. Semiparametric theory for causal mediationanalysis: efficiency bounds, multiple robustness, and sensitivity analysis.
Annals ofStatistics , 40(3):1816, 2012.Jin Tian. Identifying dynamic sequential plans. In
Proceedings of the Twenty-FourthConference on Uncertainty in Artificial Intelligence , pages 554–561. AUAI Press, 2008.Mark J van der Laan. A generally efficient targeted minimum loss based estimator basedon the highly adaptive lasso.
The International Journal of Biostatistics , 13(2), 2017.Mark J van der Laan and David Benkeser. Highly adaptive lasso (hal). In
Targeted Learn-ing in Data Science , pages 77–94. Springer, 2018.Mark J van der Laan and Maya L Petersen. Direct effect models.
The international journalof biostatistics , 4(1), 2008.Mark J van der Laan and James M Robins.
Unified Methods for Censored LongitudinalData and Causality . Springer, New York, 2003.Mark J van der Laan and Sherri Rose.
Targeted Learning: Causal Inference for Observa-tional and Experimental Data . Springer, New York, 2011.Mark J van der Laan and Sherri Rose.
Targeted Learning in Data Science: Causal Infer-ence for Complex longitudinal Studies . Springer, New York, 2018.39ark J van der Laan and Daniel Rubin. Targeted maximum likelihood learning.
TheInternational Journal of Biostatistics , 2(1), 2006.M.J. van der Laan, E. Polley, and A. Hubbard. Super learner.
Statistical Applications inGenetics & Molecular Biology , 6(25):Article 25, 2007.A. W. van der Vaart.
Asymptotic Statistics . Cambridge University Press, 1998.Aad van der Vaart. Semiparameric statistics.
Lectures on Probability Theory and Statistics ,pages 331–457, 2002.Aad W van der Vaart. On differentiable functionals.
Annals of Statistics , 19:178–204,1991.Aad W van der Vaart and Jon A Wellner.
Weak Convergence and Emprical Processes .Springer-Verlag New York, 1996.Tyler J VanderWeele, Stijn Vansteelandt, and James M Robins. Effect decompositionin the presence of an exposure-induced mediator-outcome confounder.
Epidemiology(Cambridge, Mass.) , 25(2):300, 2014.Stijn Vansteelandt and Rhian M Daniel. Interventional effects for mediation analysis withmultiple mediators.
Epidemiology (Cambridge, Mass.) , 28(2):258, 2017.Stijn Vansteelandt and Tyler J VanderWeele. Natural direct and indirect effects on theexposed: effect decomposition under weaker assumptions.
Biometrics , 68(4):1019–1027, 2012.Stijn Vansteelandt, James Carpenter, and Michael G Kenward. Analysis of incompletedata using inverse probability weighting and doubly robust estimators.
Methodology ,2010.Stijn Vansteelandt, Maarten Bekaert, and Theis Lange. Imputation strategies for the es-timation of natural direct and indirect effects.
Epidemiologic Methods , 1(1):131–158,2012.Marvin N Wright and Andreas Ziegler. Ranger: a fast implementation of random forestsfor high dimensional data in c++ and r. arXiv preprint arXiv:1508.04409 , 2015.Sewall Wright. Correlation and causation.
Journal of agricultural research , 20(7):557–585, 1921. 40ewall Wright. The method of path coefficients.
The annals of mathematical statistics , 5(3):161–215, 1934.Jessica G Young, Miguel A Hern´an, and James M Robins. Identification, estimation andapproximation of risk under interventions that depend on the natural value of treatmentusing observational data.
Epidemiologic methods , 3(1):1–19, 2014.Wenjing Zheng and Mark van der Laan. Longitudinal mediation analysis with time-varying mediators and exposures, with application to survival outcomes.
Journal ofcausal inference , 5(2), 2017.Wenjing Zheng and Mark J van der Laan. Cross-validated targeted minimum-loss-basedestimation. In
Targeted Learning , pages 459–474. Springer, 2011.Wenjing Zheng and Mark J van der Laan. Targeted maximum likelihood estimation ofnatural direct effects.