A Bayesian view of doubly robust causal inference
AA Bayesian view of doubly robust causal inference
O. Saarela ∗ , L. R. Belzile † AND
D. A. Stephens ‡ Abstract
In causal inference confounding may be controlled either through regression adjustment in an out-come model, or through propensity score adjustment or inverse probability of treatment weighting, orboth. The latter approaches, which are based on modelling of the treatment assignment mechanismand their doubly robust extensions have been difficult to motivate using formal Bayesian arguments;in principle, for likelihood-based inferences, the treatment assignment model can play no part in infer-ences concerning the expected outcomes if the models are assumed to be correctly specified. On theother hand, forcing dependency between the outcome and treatment assignment models by allowingthe former to be misspecified results in loss of the balancing property of the propensity scores and theloss of any double robustness. In this paper, we explain in the framework of misspecified models whydoubly robust inferences cannot arise from purely likelihood-based arguments, and demonstrate thisthrough simulations. As an alternative to Bayesian propensity score analysis, we propose a Bayesianposterior predictive approach for constructing doubly robust estimation procedures. Our approachappropriately decouples the outcome and treatment assignment models by incorporating the inversetreatment assignment probabilities in Bayesian causal inferences as importance sampling weights inMonte Carlo integration.A revised version of this article has been accepted for publication in
Biometrika , published byOxford University Press.Saarela, O., Belzile, L. R. and D. A. Stephens. A Bayesian view of doubly robust causal inference,
Biometrika (2016), 103 (3): 667-681. doi:10.1093/biomet/asw025 .
1. I
NTRODUCTION
In causal inference contexts, confounding is most often controlled by one of two approaches: first, byconditioning on the confounders in a regression model for the outcome in an outcome regression model;secondly, by modelling the treatment assignment mechanism to obtain so-called propensity score values,and then using these scores to construct strata within the observed sample, or a pseudo-sample froma hypothetical population, within which treatment assignment is not confounded. This pseudo-samplecan be obtained via inverse probability of treatment weighting of the original sample, analogously tosurvey sampling procedures. The outcome regression adjustment method requires correct specificationof the regression function in order to obtain consistent inference; this may be achieved in practice us-ing flexible regression strategies and complex functions of typically a large number of covariates. Thepropensity score adjustment methods focus principally on the specification of the treatment assignmentmodel, which may be similarly flexible or complex. Under either approach, sufficient control of con-founding is therefore dependent on possibly unverifiable modelling assumptions. This has motivated thedevelopment of doubly robust methods in which both model components are specified, but only one ofthem needs to be correctly specified to sufficiently control for confounding.Adjustments that depend on the propensity score using regression or reweighting are not easy tointerpret from the Bayesian perspective, since Bayesian inferences are naturally based on modelling ofthe outcome, with modelling of the treatment assignment playing no role in inference relating to theoutcome/treatment relationship (Robins and Ritov, 1997). Gustafson (2012) attempted a Bayesian in-terpretation as a compromise between a saturated outcome model and a parametric one; however, thetreatment assignment model did not feature in this interpretation. Scharfstein et al. (1999) and Bang and ∗ Dalla Lana School of Public Health, University of Toronto, 155 College Street, Toronto, Ontario M5T 3M7, Canada. [email protected] † École Polytechnique Fédérale de Lausanne, EPFL-SB-MATHAA-STAT, Station 8, CH-1015 Lausanne, Switzerland. [email protected] ‡ Department of Mathematics and Statistics, McGill University, Montreal, Quebec H3A 2K6, Canada. [email protected] a r X i v : . [ s t a t . M E ] J a n O. S
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS
Robins (2005) pointed out a connection between a doubly robust estimator and a model-based estima-tor; the two are equivalent if the outcome model in the latter features the so called clever covariate, aparticular function of the propensity score.Separately from the above developments, there has been a body of work studying Bayesian versionsof propensity score adjustment to control for confounding (Hoshino, 2008; McCandless et al., 2009a,b,2010, 2012; An, 2010; Kaplan and Chen, 2012; Zigler et al., 2013; Chen and Kaplan, 2015; Graham et al.,2015). These approaches require one of two kinds of compromises; they either force a parametrizationwhich makes the outcome and treatment assignment models dependent, thus losing the balancing prop-erty of the propensity score, or cut such a feedback in which case the inference procedures are no longerBayesian. Because of these difficulties, some authors (e.g. Achy-Brou et al., 2010, p. 828) have beencontent to fix the propensity scores to their best estimates in model-based inferences, without attemptingto jointly estimate the two model components. In an alternative approach, Wang et al. (2012) and Ziglerand Dominici (2014) have suggested connecting the outcome and treatment assignment models throughthe prior distribution in order to incorporate the uncertainty in confounder selection. Herein we do notconsider model uncertainty, but rather, concentrate on inferences with a priori specified outcome andtreatment assignment models.The purpose of this paper is to clarify the theoretical and practical motivations for Bayesian propen-sity score adjustment, and the relationships between the different methods proposed for this, which havenot been fully explored previously. We address these in the context of double adjustment for both the po-tential confounders and the propensity score, and argue that the problem cannot be properly understoodwithout considering it in the framework of misspecified models. To provide an alternative to Bayesianpropensity score adjustment, we propose deriving Bayesian versions of various inverse probability oftreatment weighted estimators, including inverse probability of treatment weighted outcome regressionand the semi-parametric double robust estimator, through posterior predictive expectations, with theweights introduced as importance sampling weights in Monte Carlo integration.
2. P
RELIMINARIES
2. Notation and assumptions
For simplicity, we consider the case of a single binary treatment, and defer discussion of the longitudinal,multiple exposure and continuous cases to the Discussion. Let the random vectors X i represent a set ofpre-treatment covariate measurements, Z i a binary treatment allocation indicator, and Y i an outcomefor individual i , measured after sufficient time has passed since administering the treatment. We adopt,for convenience, the standard potential outcome (or counterfactual ) framework: for individual i , theobserved outcome is related to the two possible potential outcomes ( Y i , Y i ) by Y i = (1 − Z i ) Y i + Z i Y i . We assume that X i includes a sufficient set of covariates to control for confounding in thesense that Z i ⊥⊥ ( Y i , Y i ) | X i (cf. ignorable treatment assignment, Rosenbaum and Rubin, 1983,p. 43). The propensity score e ( X i ) ≡ pr( Z i = 1 | X i ) has the balancing property Z i ⊥⊥ X i | e ( X i ) ,which also implies that ( Y i , Y i ) ⊥⊥ Z i | e ( X i ) – this scalar score is therefore useful in controlling forconfounding.If the covariate space X i is high-dimensional, in practice the task of controlling for confounding ofteninvolves some covariate selection; here one can either select for the features that predict the outcome,or the treatment assignment. To represent this, let S i and B i represent some a priori selected subsets ofthe all observed features X i , so that the latter can be partitioned as X i = ( S i , R i ) or X i = ( B i , C i ) ,where possibly R i = ∅ and C i = ∅ . If the selected set of features S i captures all relevant prognosticinformation, then Y i ⊥⊥ X i | S i (Hansen, 2008). For the remainder of the paper, we consider thestronger condition (i) ( Y i , Y i ) ⊥⊥ X i | S i , which requires that S i also captures all relevant informationabout possible effect modification.In this case S i is sufficient to control for confounding, since from the properties of conditional inde-pendence (Dawid, 1979) it follows that ( Y i , Y i ) ⊥⊥ Z i | S i . If, on the other hand, the selected set offeatures B i has the balancing property (ii) Z i ⊥⊥ X i | B i , it is sufficient to control for confounding. Weare interested in estimation procedures which are valid when either ( Y i , Y i ) ⊥⊥ X i | S i (but possibly Z i (cid:54)⊥⊥ X i | B i ) or Z i ⊥⊥ X i | B i (but possibly ( Y i , Y i ) (cid:54)⊥⊥ X i | S i ).Our parameter of interest is an average causal contrast such as E( Y i ) − E( Y i ) = E X i { E( Y i | X i ) } − E X i { E( Y i | X i ) } . Interest then lies in the identifiability of the average potential outcomes based on the observed data. preprintpreprint
For simplicity, we consider the case of a single binary treatment, and defer discussion of the longitudinal,multiple exposure and continuous cases to the Discussion. Let the random vectors X i represent a set ofpre-treatment covariate measurements, Z i a binary treatment allocation indicator, and Y i an outcomefor individual i , measured after sufficient time has passed since administering the treatment. We adopt,for convenience, the standard potential outcome (or counterfactual ) framework: for individual i , theobserved outcome is related to the two possible potential outcomes ( Y i , Y i ) by Y i = (1 − Z i ) Y i + Z i Y i . We assume that X i includes a sufficient set of covariates to control for confounding in thesense that Z i ⊥⊥ ( Y i , Y i ) | X i (cf. ignorable treatment assignment, Rosenbaum and Rubin, 1983,p. 43). The propensity score e ( X i ) ≡ pr( Z i = 1 | X i ) has the balancing property Z i ⊥⊥ X i | e ( X i ) ,which also implies that ( Y i , Y i ) ⊥⊥ Z i | e ( X i ) – this scalar score is therefore useful in controlling forconfounding.If the covariate space X i is high-dimensional, in practice the task of controlling for confounding ofteninvolves some covariate selection; here one can either select for the features that predict the outcome,or the treatment assignment. To represent this, let S i and B i represent some a priori selected subsets ofthe all observed features X i , so that the latter can be partitioned as X i = ( S i , R i ) or X i = ( B i , C i ) ,where possibly R i = ∅ and C i = ∅ . If the selected set of features S i captures all relevant prognosticinformation, then Y i ⊥⊥ X i | S i (Hansen, 2008). For the remainder of the paper, we consider thestronger condition (i) ( Y i , Y i ) ⊥⊥ X i | S i , which requires that S i also captures all relevant informationabout possible effect modification.In this case S i is sufficient to control for confounding, since from the properties of conditional inde-pendence (Dawid, 1979) it follows that ( Y i , Y i ) ⊥⊥ Z i | S i . If, on the other hand, the selected set offeatures B i has the balancing property (ii) Z i ⊥⊥ X i | B i , it is sufficient to control for confounding. Weare interested in estimation procedures which are valid when either ( Y i , Y i ) ⊥⊥ X i | S i (but possibly Z i (cid:54)⊥⊥ X i | B i ) or Z i ⊥⊥ X i | B i (but possibly ( Y i , Y i ) (cid:54)⊥⊥ X i | S i ).Our parameter of interest is an average causal contrast such as E( Y i ) − E( Y i ) = E X i { E( Y i | X i ) } − E X i { E( Y i | X i ) } . Interest then lies in the identifiability of the average potential outcomes based on the observed data. preprintpreprint , version of October 29th, 2015 Author’s Original Version
Bayesian view of doubly robust causal inference, E( Y i ) − E( Y i ) = (cid:90) x i { E( Y i | Z i = 1 , x i ) − E( Y i | Z i = 0 , x i ) } p ( x i ) d x i .
2. Bayesian model formulation for outcome regression
Any Bayesian model specification is constructed via de Finetti’s representation for exchangeable randomsequences v i ≡ ( x i , y i , z i ) (see for example Bernardo and Smith, 1994, Chap. 4). The (subjective) jointdistribution for observations v ≡ ( v , . . . , v n ) may then be represented by p ( v ) = (cid:90) φ,γ,ψ (cid:40) n (cid:89) i =1 p ( y i | z i , x i ; φ ) p ( z i | x i ; γ ) p ( x i ; ψ ) (cid:41) π ( φ, γ, ψ ) d φ d γ d ψ, (1)implying the existence of the parametric models and the prior density π ( φ, γ, ψ ) . Since the repre-sentation theorem is not constructive, that is, does not specify the models implicit in (1), in practiceinferences about a given finite-dimensional parametrization involves the often implicit assumption that p ( y i | z i , x i ; φ ) = f ( y i | z i , x i ) , p ( z i | x i ; γ ) = f ( z i | x i ) and p ( x i ; ψ ) = f ( x i ) , where ( φ , γ , ψ ) isthe limiting value of the posterior π n ( φ, γ, ψ ) ≡ p ( φ, γ, ψ | v ) in the sense of e.g. van der Vaart (1998, p.139), and where the f s represent the ‘true’ limiting (sampling) distributions. We might further assumethat the parameters are a priori independent, so that π ( φ, γ, ψ ) = π ( φ ) π ( γ ) π ( ψ ) , in which case itit also follows also that the posterior factorizes as π n ( φ, γ, ψ ) = π n ( φ ) π n ( γ ) π n ( ψ ) (e.g. Gelman et al.,2004, p. 354–355).The marginal distribution p ( x i ; ψ ) can in practice be specified nonparametrically and estimated usingthe empirical covariate distribution, leading to a Bayesian estimator of the average causal contrast, (cid:90) φ n n (cid:88) i =1 { m (1 , x i ; φ ) − m (0 , x i ; φ ) } π n ( φ ) d φ, (2)where π n ( φ ) ∝ (cid:81) ni =1 p ( y i | z i , x i ; φ ) π ( φ ) d φ , and m ( z, x ; φ ) ≡ (cid:82) yp ( y | z, x ; φ ) d y . In SupplementaryAppendix 1, we show that (2) can be motivated without the use of potential outcomes notation throughposterior predictive expectations for a new observation under a hypothetical completely randomizedsetting.The estimator (2) is the Bayesian version of the well-known direct standardization or g -formula. Wereturn to it in Section 6, but note here that it does not feature the treatment assignment model; rather,the posterior predictive approach for estimating the marginal causal contrast depends entirely on correctspecification of the distribution p ( y i | z i , x i ; φ ) , or in a moment-based representation, m ( z, x ; φ ) . Beforediscussing Bayesian alternatives to (2), we briefly review some commonly used frequentist approachesfor combining outcome regression and propensity score adjustment.
3. F
REQUENTIST APPROACHES FOR COMBINING OUTCOME REGRESSION ANDPROPENSITY SCORE ADJUSTMENT
3. Including propensity scores into outcome regression
Because of the balancing property of the propensity score, it is tempting to specify a propensity score e ( b i ; γ ) ≡ pr( Z i = 1 | b i ; γ ) and use a statistical model such as p { y i | z i , e ( b i ) , s i ; φ } , in the hope that, ifthe prognostic model is is misspecified, adjusting for the propensity score would still sufficiently controlfor any residual confounding. For simplicity, we take the parameters φ to specify also the functionaldependence between the propensity score and the outcome; to model this dependency, it is advisable touse flexible formulations such as splines (e.g. Zhang and Little, 2009). Using such an outcome model,the marginal causal contrast would then be estimated by n n (cid:88) i =1 (cid:104) m (cid:110) , e ( b i ; (cid:98) γ ) , s i ; (cid:98) φ (cid:111) − m (cid:110) , e ( b i ; (cid:98) γ ) , s i ; (cid:98) φ (cid:111)(cid:105) , (3)where m { z, e ( b ; γ ) , s ; φ } ≡ (cid:82) yp { y | z, e ( b ; γ ) , s ; φ } d y , and where (cid:98) φ and (cid:98) γ are maximum likelihoodestimators for the parameters in the outcome regression and treatment assignment model, respectively.The motivation for such a double adjustment is that it is sufficient to control for confounding if either Author’s Original Version preprint , version of October 29th, 2015
O. S
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS condition (i) or (ii) of Section 2.1 applies. We summarize this property in the following theorem (aproof in Supplementary Appendix 2; in the notation all convergencies are in probability unless otherwisestated).T
HEOREM Estimator (3) is consistent if the outcome model is correctly specified in the sense that m { z, e ( b ; γ ) , s ; φ } = (cid:82) yf ( y | z, e ( b ) , s ) d y , the parameters in this can be consistently estimated sothat (cid:98) φ → φ , the treatment assignment model is correctly specified in the sense that p ( z i | b i ; γ ) = f ( z i | b i ) and (cid:98) γ → γ , and either (i) holds true, or (ii) holds true. The estimator (3) may be considered ‘doubly robust’ in terms of the covariate selection in the sensethat only one of the sets S i and B i needs to be correctly specified, although it still always relies on acorrect parametric specification of the model for the expected outcome, conditional on { Z i , e ( B i ) , S i } .
3. The clever covariate and augmented outcome regression
The estimator discussed in the previous section did not specify which function of the propensity score e ( B i ) should be added to the regression model. Scharfstein et al. (1999, p. 1141–1142) and Bangand Robins (2005, p. 964–965) drew a connection between propensity score regression adjustment anddoubly robust estimators of the form n n (cid:88) i =1 y i z i − { z i − e ( b i ; (cid:98) γ ) } m (1 , s i ; (cid:98) φ ) e ( b i ; (cid:98) γ ) − n n (cid:88) i =1 y i (1 − z i ) − [(1 − z i ) − { − e ( b i ; (cid:98) γ ) } ] m (0 , s i ; (cid:98) φ )1 − e ( b i ; (cid:98) γ ) , which can be equivalently represented as n n (cid:88) i =1 (cid:26) z i e ( b i ; (cid:98) γ ) − − z i − e ( b i ; (cid:98) γ ) (cid:27) (cid:110) y i − m ( z i , s i ; (cid:98) φ ) (cid:111) + 1 n n (cid:88) i =1 (cid:26) m (1 , s i ; (cid:98) φ ) − m (0 , s i ; (cid:98) φ ) (cid:27) . (4)On considering the score equation derived from a regression of Y i on Z i and S i with mean function m ( z, s ; φ ) , this form suggests incorporating the derived covariate c ( z i , b i ) = z i /e ( b i ) − (1 − z i ) / { − e ( b i ) } (termed the ‘clever covariate’ by Rose and van der Laan, 2008, p. 8) additively into the outcomeregression, that is, for example m ( z, s ; φ ) = φ + φ z + φ (cid:62) s + φ c ( z, b ) . (5)The first term in (4) is then zero through the maximum likelihood score equation, leaving only the lastterm which is the model based estimator of the marginal treatment effect. Thus with the clever covariatein the outcome model, the doubly robust estimator is equivalent to the model-based estimator. In thespecial case of model (5), this becomes n n (cid:88) i =1 (cid:110) m (1 , x i ; (cid:98) φ ) − m (0 , x i ; (cid:98) φ ) (cid:111) = (cid:98) φ + (cid:98) φ n n (cid:88) i =1 (cid:26) e ( b i ; (cid:98) γ ) − − e ( b i ; (cid:98) γ ) (cid:27) . A potential drawback of using this covariate is that it may lead to extreme variability for the resultingmean difference estimator, even compared to inverse probability of treatment weighted estimators of theform n n (cid:88) i =1 (cid:26) y i z i e ( b i ; (cid:98) γ ) − y i (1 − z i )1 − e ( b i ; (cid:98) γ ) (cid:27) . (6)To see why this is the case, the distribution of c ( z i , b i ) in itself can be very skewed to the right, but thisbecomes even more pronounced in the model-based estimator where the clever covariate has to evaluatedat both c (1 , b i ) and c (0 , b i ) for each i = 1 , . . . , n . In contrast, the inverse probability of treatmentweighted estimator (6) involves only the probabilities of treatments that were actually assigned.Eq. (4) is doubly robust in a stronger, semi-parametric, sense than (3); it does not require correctspecification of the outcome model, if the treatment assignment model is correctly specified. The approx-imate Bayesian double robust approach proposed by Graham et al. (2015) involved replacing m ( z, x ; φ ) preprintpreprint
The estimator discussed in the previous section did not specify which function of the propensity score e ( B i ) should be added to the regression model. Scharfstein et al. (1999, p. 1141–1142) and Bangand Robins (2005, p. 964–965) drew a connection between propensity score regression adjustment anddoubly robust estimators of the form n n (cid:88) i =1 y i z i − { z i − e ( b i ; (cid:98) γ ) } m (1 , s i ; (cid:98) φ ) e ( b i ; (cid:98) γ ) − n n (cid:88) i =1 y i (1 − z i ) − [(1 − z i ) − { − e ( b i ; (cid:98) γ ) } ] m (0 , s i ; (cid:98) φ )1 − e ( b i ; (cid:98) γ ) , which can be equivalently represented as n n (cid:88) i =1 (cid:26) z i e ( b i ; (cid:98) γ ) − − z i − e ( b i ; (cid:98) γ ) (cid:27) (cid:110) y i − m ( z i , s i ; (cid:98) φ ) (cid:111) + 1 n n (cid:88) i =1 (cid:26) m (1 , s i ; (cid:98) φ ) − m (0 , s i ; (cid:98) φ ) (cid:27) . (4)On considering the score equation derived from a regression of Y i on Z i and S i with mean function m ( z, s ; φ ) , this form suggests incorporating the derived covariate c ( z i , b i ) = z i /e ( b i ) − (1 − z i ) / { − e ( b i ) } (termed the ‘clever covariate’ by Rose and van der Laan, 2008, p. 8) additively into the outcomeregression, that is, for example m ( z, s ; φ ) = φ + φ z + φ (cid:62) s + φ c ( z, b ) . (5)The first term in (4) is then zero through the maximum likelihood score equation, leaving only the lastterm which is the model based estimator of the marginal treatment effect. Thus with the clever covariatein the outcome model, the doubly robust estimator is equivalent to the model-based estimator. In thespecial case of model (5), this becomes n n (cid:88) i =1 (cid:110) m (1 , x i ; (cid:98) φ ) − m (0 , x i ; (cid:98) φ ) (cid:111) = (cid:98) φ + (cid:98) φ n n (cid:88) i =1 (cid:26) e ( b i ; (cid:98) γ ) − − e ( b i ; (cid:98) γ ) (cid:27) . A potential drawback of using this covariate is that it may lead to extreme variability for the resultingmean difference estimator, even compared to inverse probability of treatment weighted estimators of theform n n (cid:88) i =1 (cid:26) y i z i e ( b i ; (cid:98) γ ) − y i (1 − z i )1 − e ( b i ; (cid:98) γ ) (cid:27) . (6)To see why this is the case, the distribution of c ( z i , b i ) in itself can be very skewed to the right, but thisbecomes even more pronounced in the model-based estimator where the clever covariate has to evaluatedat both c (1 , b i ) and c (0 , b i ) for each i = 1 , . . . , n . In contrast, the inverse probability of treatmentweighted estimator (6) involves only the probabilities of treatments that were actually assigned.Eq. (4) is doubly robust in a stronger, semi-parametric, sense than (3); it does not require correctspecification of the outcome model, if the treatment assignment model is correctly specified. The approx-imate Bayesian double robust approach proposed by Graham et al. (2015) involved replacing m ( z, x ; φ ) preprintpreprint , version of October 29th, 2015 Author’s Original Version Bayesian view of doubly robust causal inference,
3. Inverse probability of treatment weighted outcome regression
Yet another estimator for the marginal causal contrast is n n (cid:88) i =1 (cid:110) E( Y i | s i ; (cid:98) φ ) − E( Y i | s i ; (cid:98) φ ) (cid:111) , (7)where the parameters φ in the model for the potential outcomes ( Y i , Y i ) are estimated using an inverseprobability of treatment weighted estimating function l ( φ ) = (cid:80) ni =1 l i ( φ ) , where l i ( φ ) = (cid:88) a =0 { z i = a } log p ( y ai | s i ; φ )pr( Z i = a | b i ) . The corresponding estimating equation is u ( φ ) = (cid:80) ni =1 u i ( φ ) = 0 , where the pseudo-score function is u i ( φ ) = (cid:88) a =0 { z i = a } ∂∂φ log p ( y ai | s i ; φ )pr( Z i = a | b i ) . (8)Here the treatment assignment probabilities pr( Z i = a | b i ) would in practice be replaced with estimates pr( Z i = a | b i ; (cid:98) γ ) . To motivate the use of (8) for the parameter estimation, we present the followingtheorem (a proof given in Supplementary Appendix 2).T HEOREM The estimating equation u ( φ ) = 0 is unbiased if the model for the potential outcomesis correctly specified in the sense that p ( y ai | s i ; φ ) = f ( y ai | s i ) , and either (i) or (ii) holds true. If we further assume the consistency of the estimator for φ , as well as consistency of (cid:98) γ when theweights are correctly specified, it also follows that (7) consistently estimates the marginal causal contrast.In Section 6.1 we demonstrate that an estimator of the form (7) can also be motivated from Bayesianarguments.
4. T WO - STEP ESTIMATION WHEN THE PROPENSITY SCORE IS UNKNOWN
In observational settings the function e ( B i ) is unknown and has to be estimated. When using an es-timator of the form (3), a central question from a Bayesian perspective then is how the uncertainty inthe estimation of the parameters γ is incorporated in the inference of the marginal causal contrast. A‘Bayesian’ approach could be motivated by writing the posterior predictive expectation as (cid:90) ψ,γ,φ (cid:90) x i m ( a, e ( b i ; γ ) , s i ; φ ) p ( x i ; ψ ) π n ( φ | γ ) π n ( γ ) π n ( ψ ) d x i d φ d γ d ψ, (9)where π n ( φ | γ ) ∝ n (cid:89) i =1 p { y i | z i , e ( b i ; γ ) , s i ; φ } π ( φ ) and π n ( γ ) ∝ n (cid:89) i =1 p ( z i | b i ; γ ) π ( γ ) . The integrals of the form (9) could be evaluated by Monte Carlo integration, by forward samplingfirst from π n ( γ ) and given the current value γ , from the conditional posterior π n ( φ | γ ) . However, aconcern related to such an approach is that the product of the posterior distributions π n ( φ | γ ) and π n ( γ ) in (9) does not necessarily correspond to any well defined joint posterior, except in the case when theoutcome is correctly specified in the sense (i), in which case it does not depend on the propensity score.In contrast, for the correctly specified models in (1) we indeed have that p ( φ | v ) p ( φ | x, z ) = p ( φ, γ | v ) .As a result, the above outlined two-step approach is not proper Bayesian, and would have to be evaluatedon its frequency-based properties.To give an example of a situation where the two-step Bayesian approach does not result in correctfrequency-based inferences, we can consider the special case of the outcome model m { z, e ( b ; γ ) , s ; φ } = Author’s Original Version preprint , version of October 29th, 2015
O. S
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS φ + φ z + φ (cid:62) s + φ (cid:62) g { e ( b ; γ ) } , where g is, for example, some appropriate spline basis transforma-tion of the propensity score. With this model the estimator based on (9) for the average causal con-trast E( Y i ) − E( Y i ) reduces to (cid:82) φ,γ φ π n ( φ | γ ) π n ( γ ) d φ d γ , that is, to an estimator of the posteriormean E Γ | X,Z { E(Φ | v ; γ ) } . This estimator in turn can be approximated with (cid:80) mj =1 (cid:98) φ ( γ ( j ) ) /m , where ( γ (1) , . . . , γ ( m ) ) is a Monte Carlo sample from π n ( γ ) (cf. Kaplan and Chen, 2012, p. 589). We note firstthat this estimator has the same asymptotic distribution as (cid:98) φ ( (cid:98) γ ) , where the treatment assignment modelparameters have been fixed to their maximum likelihood estimates (see Supplementary Appendix 3). InSupplementary Appendix 3 we further show that avar { (cid:98) φ ( (cid:98) γ ) } ≤ avar { (cid:98) φ ( γ ) } , where (cid:98) φ ( γ ) is theestimator given the ‘true’ propensity scores. Thus, with the propensity score adjusted outcome modelspecification, a variance adjustment due to estimating the propensity scores should reduce the asymp-totic variance of the resulting treatment effect estimator compared to a hypothetical situation where thetrue propensity scores are known (cf. Henmi and Eguchi, 2004). In contrast, Kaplan and Chen (2012, p.592) and Graham et al. (2015, p. 11) propose variance estimators based on the variance decompositionformula var(Φ | v ) = E Γ | X,Z { var(Φ | v ; γ ) } + var Γ | X,Z { E(Φ | v ; γ ) } , (10)which appears to add a further variance component. An explanation for the discrepancy is that with thecorrectly specified models in the representation (1), we have that p ( φ | v ; γ ) = p ( φ | v ) , and the thesecond variance component becomes zero. On the other hand, if the models are misspecified, the productform likelihood in the representation (1) does not apply in the first place. This illustrates the difficultyin applying Bayesian procedures in the context of incompatible models. Even though this is routinelydone in the context of multiple imputation (e.g. Rubin, 1996), and often produces reasonable results, inthe present context there is little motivation to use an approach which introduces an additional variancecomponent to the posterior variance, given that estimation of the propensity scores should reduce thevariance of the treatment effect estimator. We further discuss this discrepancy in the following section.
5. J
OINT ESTIMATION OF OUTCOME AND TREATMENT ASSIGNMENT MODELS
Even though the outcome Y i can obviously be predictive of the individual treatment assignment Z i ,the outcomes are not informative of the treatment assignment mechanism (a proof in SupplementaryAppendix 2):T HEOREM If the outcome model is correctly specified, the outcomes are non-informative of theparameters characterizing the treatment assignment process.
In such a case the treatment assignment model plays no part in the inferences, since the correspondingposterior predictive estimator is (2). However, the Bayesian propensity score approach proposed byMcCandless et al. (2009a) specifies a parametrization making the outcome and treatment assignmentmodels dependent and estimates the parameters jointly. More recently, Zigler et al. (2013) suggestedthat a similar approach could be used to obtain a Bayesian analogue to doubly robust inferences. Suchan approach can be understood by assuming that there exists a de Finetti parametrization ( φ ∗ , γ ∗ ) forwhich p ( v ) = (cid:90) φ ∗ ,γ ∗ ,ψ (cid:40) n (cid:89) i =1 p ( y i , z i | x i ; φ ∗ , γ ∗ ) p ( x i ; ψ ) (cid:41) π ( φ ∗ ) π ( γ ∗ ) π ( ψ ) d φ ∗ d γ ∗ d ψ, where p ( y i , z i | x i ; φ ∗ , γ ∗ ) = p { y i | z i , e ( b i ; γ ∗ ) , s i ; φ ∗ } p ( z i | b i ; γ ∗ ) . Compared to ( φ, γ ) in (1), neither φ ∗ or γ ∗ retains the original interpretation, but now there is a well defined joint posterior distribution π n ( φ ∗ , γ ∗ ) ∝ n (cid:89) i =1 [ p { y i | z i , e ( b i ; γ ∗ ) , s i ; φ ∗ } p ( z i | b i ; γ ∗ )] π ( φ ∗ ) π ( γ ∗ ) . Inferences could now be based on the posterior predictive expectations (cid:90) φ ∗ ,γ ∗ ,ψ (cid:90) x i m { a, e ( b i ; γ ∗ ) , s i ; φ ∗ } p ( x i ; ψ ) π n ( φ ∗ , γ ∗ ) π n ( ψ ) d x i d φ ∗ d γ ∗ d ψ. (11)At first sight, (11) would seem more natural than (9), since the specification (11) does not make use ofincompatible models. However, now the quantities e ( b i ; γ ∗ ) do not possess the balancing properties of preprintpreprint
In such a case the treatment assignment model plays no part in the inferences, since the correspondingposterior predictive estimator is (2). However, the Bayesian propensity score approach proposed byMcCandless et al. (2009a) specifies a parametrization making the outcome and treatment assignmentmodels dependent and estimates the parameters jointly. More recently, Zigler et al. (2013) suggestedthat a similar approach could be used to obtain a Bayesian analogue to doubly robust inferences. Suchan approach can be understood by assuming that there exists a de Finetti parametrization ( φ ∗ , γ ∗ ) forwhich p ( v ) = (cid:90) φ ∗ ,γ ∗ ,ψ (cid:40) n (cid:89) i =1 p ( y i , z i | x i ; φ ∗ , γ ∗ ) p ( x i ; ψ ) (cid:41) π ( φ ∗ ) π ( γ ∗ ) π ( ψ ) d φ ∗ d γ ∗ d ψ, where p ( y i , z i | x i ; φ ∗ , γ ∗ ) = p { y i | z i , e ( b i ; γ ∗ ) , s i ; φ ∗ } p ( z i | b i ; γ ∗ ) . Compared to ( φ, γ ) in (1), neither φ ∗ or γ ∗ retains the original interpretation, but now there is a well defined joint posterior distribution π n ( φ ∗ , γ ∗ ) ∝ n (cid:89) i =1 [ p { y i | z i , e ( b i ; γ ∗ ) , s i ; φ ∗ } p ( z i | b i ; γ ∗ )] π ( φ ∗ ) π ( γ ∗ ) . Inferences could now be based on the posterior predictive expectations (cid:90) φ ∗ ,γ ∗ ,ψ (cid:90) x i m { a, e ( b i ; γ ∗ ) , s i ; φ ∗ } p ( x i ; ψ ) π n ( φ ∗ , γ ∗ ) π n ( ψ ) d x i d φ ∗ d γ ∗ d ψ. (11)At first sight, (11) would seem more natural than (9), since the specification (11) does not make use ofincompatible models. However, now the quantities e ( b i ; γ ∗ ) do not possess the balancing properties of preprintpreprint , version of October 29th, 2015 Author’s Original Version Bayesian view of doubly robust causal inference, π n ( γ ) and π n ( φ | γ ) to approximate the joint posterior of ( φ ∗ , γ ∗ ) .However, as discussed in the previous Section, these posteriors are incompatible and generally such asampling procedure is not guaranteed to converge to any well defined joint distribution. In fact, if theconditional posteriors can be sampled directly, or if the second sampling step is allowed to converge tothe corresponding conditional distribution, the inferences based on the formulations (9) and (11) will beequivalent.To sum up, trying to recover fully probabilistic inferences through sampling from a joint posterior ofthe outcome and treatment assignment model parameters loses the balancing property of the propensityscores, and consequently, the properties of the resulting estimator would be difficult to establish. On theother hand, cutting the feedback in an attempt to recover the balancing property would mean that theinferences are no longer based on well-defined posterior distributions. Thus, in the following section,following the approach outlined in Saarela et al. (2015b), we formulate alternative Bayesian estimatorsthat are not based on Bayesian propensity score adjustment.
6. P
OSTERIOR PREDICTIVE INFERENCES WITH IMPORTANCE SAMPLING
6. Inverse probability of treatment weighted outcome regression
It has been recognized by various authors (e.g. Røysland, 2011; Chakraborty and Moodie, 2013) thatinverse probability of treatment weighting can be motivated through a change of probability measures,or equivalently, importance sampling. However, as far as we know, before Saarela et al. (2015b) thisapproach has not been used to formulate Bayesian causal inferences. Here we argue that this approachcan be used to resolve the paradoxes discussed in Sections (4) and (5). We follow the posterior predictivereasoning of Supplementary Appendix 1, but rather than trying to directly predict a new observationunder the experimental setting E , we consider first the task of parameter estimation under this regime. Forthis purpose, a Bayes estimator for the outcome model parameters can be constructed by maximizing theexpected utility E E { l ( φ ; V i ) | v } with respect to φ , where the log-likelihood l ( φ ; v i ) ≡ log p ( y i | z i , s i ; φ ) takes the role of a parametric utility function, and the expectation is over a predicted new observation v i = ( y i , z i , x i ) , i / ∈ { , . . . , n } .Let further ξ be a set of parameters characterizing the entire data-generating mechanism under theobservational regime O . We can further write the expectation as E E { l ( φ ; V i ) | v } = E Ξ | V [E E { l ( φ ; V i ) | v ; ξ } ] , where, following Walker (2010, p. 26–27), we can consider the lower dimensional decision of maxi-mizing the expected utility E E { l ( φ ; V i ) | v ; ξ } with respect to φ conditional on ξ . With a known regime E and the stability assumption discussed in Supplementary Appendix 1, arg max φ E E { l ( φ ; V i ) | v ; ξ } isa deterministic function of ξ . Thus, the uncertainty represented by the posterior distribution p O ( ξ | v ) then also reflects the uncertainty on φ , providing means to construct a posterior distribution for φ . Thisproceeds as follows; the inner expectation can be written as E E [ l ( φ ; V i ) | v ; ξ ] = (cid:90) v i l ( φ ; v i ) p E ( v i | v ; ξ ) d v i = (cid:90) v i l ( φ ; v i ) p E ( v i | v ; ξ ) p O ( v i | v ; ξ ) p O ( v i | v ; ξ ) d v i = (cid:90) v i l ( φ ; v i ) p E ( y i | z i , x i , v ; ξ ) p E ( z i ) p E ( x i | v ; ξ ) p O ( y i | z i , x i , v ; ξ ) p O ( z i | x i , v ; ξ ) p O ( x i | v ; ξ ) p O ( v i | v ; ξ ) d v i = (cid:90) v i l ( φ ; v i ) p E ( z i ) p O ( z i | x i , v ; ξ ) p O ( v i | v ; ξ ) d v i , where in the last equality we made use of the stability assumption of Supplementary Appendix 1. In thelast form we can replace the predictive distribution under O with the Bayesian bootstrap specification p O ( v i | v ; ξ ) = (cid:80) nk =1 ξ k δ v k ( v i ) , where ξ ≡ ( ξ , . . . , ξ n ) and Ξ | v ∼ Dirichlet (1 , . . . , , as in theweighted likelihood bootstrap of Newton and Raftery, 1994. Denoting w i ( ξ ) ≡ p E ( z i ) /p O ( z i | x i , v ; ξ ) , Author’s Original Version preprint , version of October 29th, 2015
O. S
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS the expected utility now becomes E E [ l ( φ ; V i ) | v ; ξ ] = (cid:90) v i l ( φ ; v i ) w i ( ξ ) n (cid:88) k =1 ξ k δ v k ( v i ) d v i = n (cid:88) k =1 w k ( ξ ) ξ k l ( φ ; v k ) , (12)that is, a weighted log-likelihood, motivating the estimator (cid:98) φ ( ξ ) ≡ arg max φ n (cid:88) k =1 w k ( ξ ) ξ k l ( φ ; v k ) . An approximate posterior distribution for φ under E can now be constructed by repeatedly samplingthe weight vectors from Ξ | v ∼ Dirichlet (1 , . . . , , and recalculating (cid:98) φ ( ξ ) at each realization. Thisapproach of creating a mapping between the non-parametric specification and a parametrization relevantto inferences is analogous to Chamberlain and Imbens (2003), but adds the importance sampling weightsto the Dirichlet weights in order to make inferences across the observational and experimental regimes.The weights w i ( ξ ) function as importance sampling weights in Monte Carlo integration; they addvariability to the estimation, but in the present context provide some protection towards misspecificationof the outcome model, in the sense of Section 3.3. The above did not yet address how to calculate theseweights; in principle these are fully determined by the current realization of ξ under the non-parametricspecification, but in practice parametric model specifications are needed for smoothing purposes, andwe need a way to link the ξ and the treatment assignment model parameters γ . For this purpose γ itselfcan be estimated through the weighted likelihood bootstrap since this readily gives the deterministicfunction linking the two parametrizations; thus in (12) we choose w i ( ξ ) = p E ( z i ) /p O { z i | b i ; (cid:98) γ ( ξ ) } ,where (cid:98) γ ( ξ ) ≡ arg max γ (cid:80) nk =1 ξ k log p ( z k | b k ; γ ) . The probabilities p E ( z i ) are given by the chosenregime E that is the object of inference; in practice the estimation is most efficient when we choose thetarget regime to be as close as possible to the observed regime O ; this can be achieved by fixing p E ( z i ) to the marginal treatment assignment probabilities under O , which would result in the usual kind ofstabilized inverse probability of treatment weights used in marginal structural modelling (Robins et al.,2000; Hernán et al., 2001; Cole and Hernán, 2008).The marginal causal contrast may now be estimated through the expectations E Ξ | V { E E ( Y i | Z i = a, v ; ξ ) } = (cid:90) ξ (cid:90) s i m { a, s i ; (cid:98) φ ( ξ ) } n (cid:88) k =1 ξ k δ s k ( s i ) p O ( ξ | v ) d s i d ξ = (cid:90) ξ n (cid:88) k =1 ξ k m { a, s k ; (cid:98) φ ( ξ ) } p O ( ξ | v ) d ξ, (13)where we used the non-parametric specification p ( s i | v ; ξ ) = (cid:80) nk =1 ξ k δ s k ( s i ) , and where again p O ( ξ | v ) is replaced with the uniform Dirichlet distribution. Since all uncertainty is contained in the parametervector ξ , a posterior distribution for the predictive mean or mean difference can be constructed as beforethrough resampling. The point estimator given by (13) is the direct Bayesian analogue of (7), where theoutcome model was estimated using inverse probability of treatment weighted regression. In fact, if wefix ξ k = 1 /n , k = 1 , . . . , n , instead of considering these as unknown parameters, the two estimatorsare equivalent. Thus, we conjecture that the estimator given by (13) has a similar ‘double robustness’property as (7). We demonstrate this through simulations in Section 7, but before that, we show how thesemi-parametric doubly robust estimator (4) can be motivated as a posterior predictive expectation.
6. Doubly robust estimation
In Supplementary Appendix 4 we show that under the non-parametric specification in terms of ξ , theposterior predictive causal contrast may be written as E E ( Y i | Z i = 1 , v ; ξ ) − E E ( Y i | Z i = 0 , v ; ξ )= n (cid:88) k =1 ξ k { y i − m ( z k , x k ; ξ ) } (cid:26) z k pr O ( Z k = 1 | x k , v ; ξ ) − − z k pr O ( Z k = 0 | x k , v ; ξ ) (cid:27) + n (cid:88) k =1 ξ k { m (1 , x k ; ξ ) − m (0 , x k ; ξ ) } , (14) preprintpreprint
In Supplementary Appendix 4 we show that under the non-parametric specification in terms of ξ , theposterior predictive causal contrast may be written as E E ( Y i | Z i = 1 , v ; ξ ) − E E ( Y i | Z i = 0 , v ; ξ )= n (cid:88) k =1 ξ k { y i − m ( z k , x k ; ξ ) } (cid:26) z k pr O ( Z k = 1 | x k , v ; ξ ) − − z k pr O ( Z k = 0 | x k , v ; ξ ) (cid:27) + n (cid:88) k =1 ξ k { m (1 , x k ; ξ ) − m (0 , x k ; ξ ) } , (14) preprintpreprint , version of October 29th, 2015 Author’s Original Version Bayesian view of doubly robust causal inference, m ( z k , x k ; ξ ) and pr O ( Z k = a | x k , v ; ξ ) would have to be replaced with the parametric versions m { z k , s k ; (cid:98) φ ( ξ ) } and pr O { Z k = a | b k ; (cid:98) γ ( ξ ) } , estimated using the weighted likelihood bootstrap, as inthe previous section. It is straightforward to see that if the outcome model is correctly specified, theexpression (14) reduces to the model-based estimator (the second additive term). This again reflects thefact that if we believe in our outcome model, the treatment assignment model does not play a part in theinferences. However, a Bayesian might want to use an estimator of the form (14) if being restricted bytwo parametric constraints, in terms of φ and γ , but not knowing which one of these is correct. If eitherone of the parametric specifications is correct, (14) will still reduce to the posterior predictive mean dif-ference, the natural Bayesian estimator. We summarize this property in the following theorem (a proofin Supplementary Appendix 2).T HEOREM The estimator obtained by substituting the parametric specifications m { z k , s k ; (cid:98) φ ( ξ ) } and pr O { Z k = a | b k ; (cid:98) γ ( ξ ) } into expression (14) is equivalent to the posterior predictive mean differenceif either one of these models is correctly specified. A posterior distribution for the mean difference can be generated as in the previous section throughresampling of the parameter vectors ξ and recalculating (14) at each realization; we will illustrate this inthe following section.
7. S
IMULATION STUDY
Above we have made a distinction between model misspecification due to omission of relevant co-variates, and misspecification of the parametric functional relationship between the outcome and thecovariates, and noted that all the estimators discussed in Section 3 should be ‘doubly robust’ againstthe former type of misspecification. However, in practice the consequences of these two types of mis-specification will often be similar; they result in residual confounding. Therefore, herein we investi-gate how the different estimators perform when the covariate sets S i and B i are not only created bya partitioning, but also a transformation of the x i ’s. For this purpose, we simulated X ij ∼ N(0 , , j = 1 , . . . , , and transformed these as c ij = | x ij | / (1 − /π ) / . The true treatment assignmentand outcome mechanisms were specified as Z i | x i ∼ Bernoulli( expit { . c i + 0 . x i + 0 . x i } ) and Y i | z i , x i ∼ N( z i − c i − x i − x i , , respectively. For estimation, we considered two scenarios:(I) s i ≡ ( x i , x i , x i ) and b i ≡ ( c i , x i , x i ) (misspecified outcome model and correctly specifiedtreatment assignment model), and (II) s i ≡ ( c i , x i , x i ) and b i ≡ ( x i , x i , x i ) (correctly specifiedoutcome model and misspecified treatment assignment model).We are interested in the marginal causal contrast E( Y i ) − E( Y i ) = 1 . To estimate this, we appliedthe Bayesian estimators discussed in Sections 4, 5, and 6. In the two-step estimation we attemptedboth forward sampling from the posterior distributions, and the variance decomposition formula (10). Inthe former, instead of Markov chain Monte Carlo, we applied normal approximations for the posteriordistributions, of the form Φ | v ; γ ∼ N { (cid:98) φ ( γ ) , S } , where (cid:98) φ ( γ ) is the maximum likelihood estimate and S its estimated variance-covariance matrix. The posterior distribution Γ | x, z was approximated using theweighted likelihood bootstrap. In the joint estimation, we again used a normal approximation, centered atthe joint maximum likelihood estimates ( (cid:98) φ, (cid:98) γ ) , and the variance-covariance matrix given by the inverseof the observed information at the maximum likelihood point. In both two-step and joint estimation, thefitted models were specified as m { z i , e ( b i ; γ ) , s i ; φ } = φ + φ z i + φ (cid:62) s i + φ (cid:62) g { e ( b i ; γ ) } , where g isa cubic polynomial basis, and e ( b i ; γ ) = expit ( γ + γ (cid:62) b i ) . In the importance sampling (IS) estimatorproposed in Section 6.1, and in the importance sampling/doubly robust estimator (IS/DR) of Section6.2, the fitted treatment assignment model was the same, with the outcome model specified through m ( z i , s i ; φ ) = φ + φ z i + φ (cid:62) s i .For comparison to the Bayesian estimators, we also calculate naive unadjusted comparison (naive),outcome regression adjusted for covariates s i (adjusted), the standard inverse probability of treatmentweighted estimator (6) (IPTW), the semi-parametric doubly robust estimator (4) (DR), the clever co-variate version of this (CC), the inverse probability of treatment weighted outcome regression basedestimator (7) (OR/IPTW), as well as propensity score adjusted outcome regression based estimator (3)(OR/PS). For IPTW, DR, CC, and OR/IPTW, the standard errors were estimated through the standardfrequentist nonparametric bootstrap (Efron, 1979). For OR/PS, to demonstrate the variance estimation Author’s Original Version preprint , version of October 29th, 2015
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS issues discussed in Section 4, we calculated both observed information based standard errors, and theadjusted sandwich type standard errors discussed in Supplementary Appendix 3.The results over 1000 replications are shown in Table 1. Under scenario (I), all estimators exceptnaive and adjusted can correct for confounding, although the joint estimation approach produces a slightbias. In terms of efficiency, the estimators based on propensity score adjusted outcome regression are thebest, with the inverse probability of treatment weighting based estimators losing slightly. As discussedin Section 3.2, the clever covariate approach results in higher variability compared to the other doublyadjusted estimators. In terms of variance estimation, the comparison between the unadjusted and adjustedstandard errors for OR/PS suggests that under this simulation setting estimation of the propensity scoressubstantially reduces the variance, and not adjusting for this results in overcoverage. The resamplingbased variance estimators adjust for this automatically. However, the two-step approach to varianceestimation performs poorly; as demonstrated in Supplementary Appendix 3, the two-step point estimatorhas the same asymptotic variance as the other OR/PS estimators, but the two-step variance estimatorsunnecessarily add a further variance component. Thus, the simulation results support the discussion inSections 4 and 5; the two-step and joint estimation approaches are difficult to justify from Bayesianarguments, and do not seem to provide practical advantages in terms of their frequency-based properties.On the other hand, the importance sampling based Bayesian estimators produce very similar results tothe OR/IPTW and DR estimators, respectively.Under scenario (II), all the estimators except IPTW are unbiased, which is expected based on theirpreviously discussed theoretical properties. When the outcome model is correctly specified, there is alsovery little difference in the efficiency of the various estimators.Table 1: Estimates for the marginal causal contrast (true value = 1 ) over 1000 simulation rounds
Scenario Estimator Point Relative SD SE MC Coverageestimate bias (%) error (%)(I) Naive 0.347 − − − − − − − − − − − − − − − − − − − − − − preprintpreprint
Scenario Estimator Point Relative SD SE MC Coverageestimate bias (%) error (%)(I) Naive 0.347 − − − − − − − − − − − − − − − − − − − − − − preprintpreprint , version of October 29th, 2015 Author’s Original Version Bayesian view of doubly robust causal inference,
8. D
ISCUSSION
In this paper we reviewed previously proposed Bayesian approaches for propensity score adjusted in-ferences, focusing on the assumptions concerning correct model specifications. Here it is important tomake a distinction between misspecification due to omission of relevant covariates from the outcomemodel, and misspecification of the functional form of the dependency between the covariates and theoutcome. The frequentist propensity score adjusted outcome regression is robust against the former typeof model misspecification, but this property is lost in Bayesian estimation, if the misspecified outcomemodel is allowed to feed back to the estimation of the propensity scores. While feedback issue has beenwell documented in the literature (e.g. McCandless et al., 2009b; Zigler et al., 2013), and the reasonsbehind this were already stated by Robins and Ritov (1997), here we attempted to make the assumptionsunderlying the Bayesian propensity score approach more explicit. On the other hand, we point out thatcutting this feedback in a two-step Bayesian estimation procedure unnecessarily inflates the posteriorvariance estimates.As reaching double robustness through Bayesian propensity score adjustment looks difficult, hereinwe attempted a completely different approach through posterior predictive inferences. Our proposedapproach decouples the outcome regression and treatment assignment model through introducing theinverse probability of treatment weights as importance sampling weights in Monte Carlo integration inevaluating posterior predictive expectations. A similar procedure was used in a marginal structural mod-elling context by Saarela et al. (2015b), improved to its present form in Saarela et al. (2015c). Whilethey used the importance sampling approach for estimating marginal outcome models in a longitudinalsetting, herein we showed that in a point treatment setting the combination of importance sampling andposterior predictive inferences can be used to motivate weighted outcome regression or semi-parametricdoubly robust inferences. Such a possibility was mentioned, but not formally justified, by Saarela et al.(2015a) who applied the importance sampling procedure in the context of estimating optimal treatmentregimes. The disadvantage of the importance sampling approach is the same as in the correspondingfrequentist inverse probability of treatment weighted inference procedures: the importance samplingweights add variability to the point estimator. In order to control this, a standard approach would be totruncate the weights (e.g. Xiao et al., 2013), which would also be possible in the importance samplingcontext (Ionides, 2008). Recently, Vehtari & Gelman (2015, arXiv:1507.02646v2 ) suggested prob-abilistic truncation of importance sampling weights; studying this possibility in the present context is atopic for further research. A CKNOWLEDGEMENT
The authors acknowledge support of the Natural Sciences and Engineering Research Council (NSERC)of Canada. S UPPLEMENTARY MATERIAL
Supplementary material available includes Supplementary Appendices 1-4 referred to herein, containingproofs to theorems and other technical material. A PPENDIX Causal inference as a prediction problem
The estimator (2) can be motivated without the use of potential outcomes notation as a posterior pre-dictive expectation for a new observation under a hypothetical completely randomized setting E where Z i ⊥⊥ E X i and the probabilities pr E ( Z i = a ) are known constants (cf. the randomized trial measurediscussed by Røysland, 2011). The data are observed under a setting O , where Z i (cid:54)⊥⊥ O X i , and causal Author’s Original Version preprint , version of October 29th, 2015
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS inference then corresponds to inference across these regimes. We can now write for i / ∈ { , . . . , n } E E ( Y i | Z i = a, v )= (cid:82) φ,ψ (cid:82) x i (cid:82) y i y i p ( y i | Z i = a, x i ; φ )pr E ( Z i = a ) p ( x i ; ψ ) π n ( φ ) π n ( ψ ) d y i d x i d φ d ψ (cid:82) φ,ψ (cid:82) x i (cid:82) y i p ( y i | Z i = a, x i ; φ )pr E ( Z i = a ) p ( x i ; ψ ) π n ( φ ) π n ( ψ ) d y i d x i d φ d ψ = (cid:90) φ,ψ (cid:90) x i m ( a, x k ; φ ) p ( x i ; ψ ) π n ( φ ) π n ( ψ ) d x i d φ d ψ (15) = (cid:90) φ n n (cid:88) k =1 m ( a, x k ; φ ) π n ( φ ) d φ, (16)where the last form was obtained by choosing the non-parametric specification (cid:82) ψ p ( x i ; ψ ) π n ( ψ ) d ψ = p ( x i | x , . . . , x n ) = (cid:80) nk =1 δ x k ( x i ) /n . Alternatively, in (15) one could use the Bayesian bootstrap(Rubin, 1981) specification p ( x i | x , . . . , x n ; ξ ) = (cid:80) nk =1 ξ k δ x k ( x i ) , where ξ = ( ξ , . . . , ξ n ) , with π n ( ξ ) taken to be a uniform Dirichlet distribution (see Section 6). Obtaining (16) also required assumingthat p E ( y i | z i , x i ; φ ) = p O ( y i | z i , x i ; φ ) ≡ p ( y i | z i , x i ; φ ) and and p E ( x i ; ψ ) = p O ( x i ; ψ ) ≡ p ( x i ; ψ ) ,which corresponds to the stability assumption discussed by Dawid and Didelez (2010). A PPENDIX Proofs to theorems
Proof (to Theorem 1).
If (i) holds true, then also ( Y i , Y i ) ⊥⊥ e ( B i ) | S i , and the propensity scoreadjustment does not add information. If on the other hand (ii) holds true, { e ( B i ) , S i } has jointly the bal-ancing property Z i ⊥⊥ X i | { e ( B i ) , S i } . This follows from Theorem 2 of Rosenbaum and Rubin (1983,p. 44) and also implies that ( Y i , Y i ) ⊥⊥ Z i | { e ( B i ) , S i } (Rosenbaum and Rubin, 1983, Theorem 3).Now E { Y i | Z i , e ( B i ) , S i } = E { (1 − Z i ) Y i | Z i , e ( B i ) , S i } + E { Z i Y i | Z i , e ( B i ) , S i } = (1 − Z i )E { Y i | Z i , e ( B i ) , S i } + Z i E { Y i | Z i , e ( B i ) , S i } = (1 − Z i )E { Y i | e ( B i ) , S i } + Z i E { Y i | e ( B i ) , S i } , and further, (cid:90) x i E { Y i | Z i = 1 , e ( b i ) , s i } p ( x i ) d x i − (cid:90) x i E { Y i | Z i = 0 , e ( b i ) , s i } p ( x i ) d x i (17) = (cid:90) x i E { Y i | e ( b i ) , s i } p ( x i ) d x i − (cid:90) x i E { Y i | e ( b i ) , s i } p ( x i ) d x i = E( Y i ) − E( Y i ) . The consistency of the estimator (3) then relies on being able to consistently estimate the quantities in(17). (cid:3)
Proof (to Theorem 2).
Consider first the expectation of (8) under the assumption that (i) holds true.Now E (cid:40) { z i = a } ∂∂φ log p ( y ai | s i , φ )pr( Z i = a | b i ) (cid:41) = (cid:90) x i (cid:90) y ai (cid:88) z i { z i = a } ∂∂φ log p ( y ai | s i , φ )pr( Z i = a | b i ) f ( y ai | z i , x i ) f ( z i | x i ) f ( x i ) d y ai d x i = (cid:90) x i (cid:40)(cid:90) y ai ∂∂φ p ( y ai | s i ; φ ) p ( y ai | s i ; φ ) f ( y ai | x i ) d y ai (cid:41) pr( Z i = a | x i )pr( Z i = a | b i ) f ( x i ) d x i = (cid:90) x i (cid:26) ∂∂φ (cid:90) y ai p ( y ai | s i ; φ ) d y ai (cid:27) pr( Z i = a | x i )pr( Z i = a | b i ) f ( x i ) d x i = 0 , preprintpreprint
Consider first the expectation of (8) under the assumption that (i) holds true.Now E (cid:40) { z i = a } ∂∂φ log p ( y ai | s i , φ )pr( Z i = a | b i ) (cid:41) = (cid:90) x i (cid:90) y ai (cid:88) z i { z i = a } ∂∂φ log p ( y ai | s i , φ )pr( Z i = a | b i ) f ( y ai | z i , x i ) f ( z i | x i ) f ( x i ) d y ai d x i = (cid:90) x i (cid:40)(cid:90) y ai ∂∂φ p ( y ai | s i ; φ ) p ( y ai | s i ; φ ) f ( y ai | x i ) d y ai (cid:41) pr( Z i = a | x i )pr( Z i = a | b i ) f ( x i ) d x i = (cid:90) x i (cid:26) ∂∂φ (cid:90) y ai p ( y ai | s i ; φ ) d y ai (cid:27) pr( Z i = a | x i )pr( Z i = a | b i ) f ( x i ) d x i = 0 , preprintpreprint , version of October 29th, 2015 Author’s Original Version Bayesian view of doubly robust causal inference, p ( y ai | s i ; φ ) = f ( y ai | x i ) at the true parameter value. Thus, the misspec-ified weights do not influence the estimation (in terms of bias) as long as the outcome model is correctlyspecified.Under the assumption that (ii) holds true, we have in turn that E (cid:40) { z i = a } ∂∂φ log p ( y ai | s i ; φ )pr( Z i = a | b i ) (cid:41) = (cid:90) x i (cid:90) y ai (cid:88) z i { z i = a } ∂∂φ log p ( y ai | s i ; φ )pr( Z i = a | b i ) f ( y ai | z i , x i ) f ( z i | x i ) f ( x i ) d y ai d x i = (cid:90) x i (cid:90) y ai ∂∂φ log p ( y ai | s i ; φ ) f ( y ai | x i ) f ( x i ) d y ai d x i , since now p ( z i | b i ) = f ( z i | x i ) . Using the partitioning x i = ( s i , r i ) , we can write the last form in aboveas (cid:90) x i (cid:90) y ai ∂∂φ log p ( y ai | s i ; φ ) f ( y ai | x i ) f ( x i ) d y ai d x i = (cid:90) s i (cid:90) r i (cid:90) y ai ∂∂φ log p ( y ai | s i ; φ ) f ( y ai , s i , r i ) d y ai d s i d r i = (cid:90) s i (cid:90) y ai ∂∂φ p ( y ai | s i ; φ ) p ( y ai | s i ; φ ) f ( y ai | s i ) f ( s i ) d y ai d s i = (cid:90) s i (cid:26) ∂∂φ (cid:90) y ai p ( y ai | s i ; φ ) d y ai (cid:27) f ( s i ) d s i = 0 . Thus, even though the outcome regression does not include a sufficient set of confounders, through theIPT weighting we can still obtain valid estimates for the conditional distributions p ( y ai | s i ; φ ) . (cid:3) Proof (to theorem 3).
Now the marginal posterior distribution of the parameters γ becomes p ( γ | v ) = (cid:90) φ,ψ p ( φ, γ, ψ | v ) d φ d ψ ∝ (cid:90) φ,ψ (cid:40) n (cid:89) i =1 p ( y i | z i , x i ; φ ) p ( z i | x i ; γ ) p ( x i ; ψ ) (cid:41) π ( φ ) π ( ψ ) π ( γ ) d φ d ψ d γ ∝ n (cid:89) i =1 p ( z i | x i ; γ ) π ( γ ) ∝ p ( γ | x, z ) . (cid:3) Proof (to Theorem 4).
The estimator obtained through substituting in the parametric models is n (cid:88) k =1 ξ k (cid:104) y i − m { z k , s k ; (cid:98) φ ( ξ ) } (cid:105) (cid:20) z k pr O { Z k = 1 | b k ; (cid:98) γ ( ξ ) } − − z k pr O { Z k = 0 | b k ; (cid:98) γ ( ξ ) } (cid:21) + n (cid:88) k =1 ξ k (cid:104) m { , s k ; (cid:98) φ ( ξ ) } − m { , s k ; (cid:98) φ ( ξ ) } (cid:105) . (18)First, if the outcome model is correctly specified in the sense that m { z k , s k ; (cid:98) φ ( ξ ) } = m ( z k , x k ; ξ ) , we Author’s Original Version preprint , version of October 29th, 2015
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS get pr E ( Z k = a ) n (cid:88) k =1 ξ k { z k = a } m { z k , s k ; (cid:98) φ ( ξ ) } pr O { Z k = a | b k ; (cid:98) γ ( ξ ) } = n (cid:88) k =1 ξ k { z k = a } m { z k , s k ; (cid:98) φ ( ξ ) } p E ( z k ) p O { z k | b k ; (cid:98) γ ( ξ ) } = (cid:90) z i ,x i { z i = a } m { z i , s i ; (cid:98) φ ( ξ ) } p E ( z i ) p O { z i | b i ; (cid:98) γ ( ξ ) } n (cid:88) k =1 ξ k δ ( z k ,x k ) ( z i , x i ) d z i d x i = (cid:90) y i ,z i ,x i { z i = a } y i p E ( z i ) p O { z i | b i ; (cid:98) γ ( ξ ) } p O ( y i | z i , x i , v ; ξ ) p O ( z i , x i | v ; ξ ) d y i d z i d x i = (cid:90) v i { z i = a } y i p E ( z i ) p O { z i | b i ; (cid:98) γ ( ξ ) } p O ( v i | v ; ξ ) d v i = pr E ( Z k = a ) n (cid:88) k =1 ξ k { z k = a } y k pr O { Z k = a | b k ; (cid:98) γ ( ξ ) } , because the second to last form is equivalent to (19) in Supplementary Appendix 4. Thus, the first sum-mation term in (18) cancels out, leaving only the model based estimator, which itself is now equivalentto the posterior predictive mean difference.On the other hand, if the treatment assignment model is correctly specified in the sense that pr O { Z k = a | b k ; (cid:98) γ ( ξ ) } = pr O ( Z k = a | x k , v ; ξ ) , we get pr E ( Z k = a ) n (cid:88) k =1 ξ k { z k = a } m { z k , s k ; (cid:98) φ ( ξ ) } pr O { Z k = a | b k ; (cid:98) γ ( ξ ) } = n (cid:88) k =1 ξ k { z k = a } m { z k , s k ; (cid:98) φ ( ξ ) } p E ( z k ) p O { z k | b k ; (cid:98) γ ( ξ ) } = (cid:90) z i ,x i { z i = a } m { z i , s i ; (cid:98) φ ( ξ ) } p E ( z i ) p O { z i | b i ; (cid:98) γ ( ξ ) } n (cid:88) k =1 ξ k δ ( z k ,x k ) ( z i , x i ) d z i d x i = (cid:90) z i ,x i { z i = a } m { z i , s i ; (cid:98) φ ( ξ ) } p E ( z i ) p O { z i | b i ; (cid:98) γ ( ξ ) } p O ( z i | x i , v ; ξ ) p O ( x i | v ; ξ ) d z i d x i = (cid:90) z i ,x i { z i = a } m { z i , s i ; (cid:98) φ ( ξ ) } p E ( z i ) p O ( x i | v ; ξ ) d z i d x i = pr E ( Z i = a ) (cid:90) x i m { a, s i ; (cid:98) φ ( ξ ) } n (cid:88) k =1 ξ k δ ( x k ) ( x i ) d x i = pr E ( Z k = a ) n (cid:88) k =1 ξ k m { a, s k ; (cid:98) φ ( ξ ) } . Therefore, the estimator (18) now reduces to n (cid:88) k =1 ξ k y i (cid:20) z k pr O { Z k = 1 | b k ; (cid:98) γ ( ξ ) } − − z k pr O { Z k = 0 | b k ; (cid:98) γ ( ξ ) } (cid:21) , which is again equivalent to the posterior predictive mean difference (see Supplementary Appendix 4). (cid:3) A PPENDIX On the frequency-based properties of the two-step approach
Trivially, if the outcome model is correctly specified, then Φ ⊥⊥ Γ | V and (9) reduces to (16). Theinteresting situations are naturally those where this is not the case. We denote the log-likelihood by q i ( φ ; γ ) = log p { y i | z i , e ( b i ; γ ) , s i ; φ ) } and q ( φ ; γ ) ≡ (cid:80) ni =1 q i ( φ ; γ ) and consider the quasi-maximumlikelihood estimator (cid:98) φ ( (cid:98) γ ) ≡ arg max φ q ( φ ; (cid:98) γ ) . If the treatment assignment model is correctly specified, preprintpreprint
Trivially, if the outcome model is correctly specified, then Φ ⊥⊥ Γ | V and (9) reduces to (16). Theinteresting situations are naturally those where this is not the case. We denote the log-likelihood by q i ( φ ; γ ) = log p { y i | z i , e ( b i ; γ ) , s i ; φ ) } and q ( φ ; γ ) ≡ (cid:80) ni =1 q i ( φ ; γ ) and consider the quasi-maximumlikelihood estimator (cid:98) φ ( (cid:98) γ ) ≡ arg max φ q ( φ ; (cid:98) γ ) . If the treatment assignment model is correctly specified, preprintpreprint , version of October 29th, 2015 Author’s Original Version Bayesian view of doubly robust causal inference, (cid:98) γ → γ . In addition, we assume that with any fixed value of γ , (cid:98) φ ( γ ) → φ ( γ ) , where φ ( γ ) is theparameter vector which minimizes the Kullback-Leibler divergence to the true outcome model (e.g.White, 1982, p. 4). Thus, by the law of large numbers and continuous mapping, we can write in theusual way that n n (cid:88) i =1 [ q i ( φ ; (cid:98) γ ) − q i { φ ( γ ); γ } ] → E [ q i ( φ ; γ ) − q i { φ ( γ ); γ } ] , where the right hand side is maximized at zero when φ = φ ( γ ) , at which point E { Y i | Z i = a, e ( b i ; γ ) , s i ; φ ( γ ) } = E { Y ia | e ( b i ; γ ) , s i ; φ ( γ ) } . Since we also have that the posterior p ( γ | x, y ) → δ γ ( γ ) in distribution, we can then conjecture thatposterior predictive inferences based on (9) will be asymptotically uncounfounded.With the definitions U φ ( φ ; γ ) ≡ ∂q ( φ ; γ ) /∂φ,U φφ ( φ ; γ ) ≡ ∂ q ( φ ; γ ) /∂φ ,U φγ ( φ ; γ ) ≡ ∂ q ( φ ; γ ) /∂φ∂γ,U γ ( γ ) ≡ ∂ (cid:80) ni =1 log p ( z i | b i ; γ ) /∂γ,U γγ ( γ ) ≡ ∂ (cid:80) ni =1 log p ( z i | b i ; γ ) /∂γ , and noting that U φ ( (cid:98) φ ; γ ( j ) ) = 0 for each γ ( j ) , j = 1 , . . . , m , we can consider the first order Taylorexpansion of U φ ( (cid:98) φ ; γ ( j ) ) around the true parameter values ( φ , γ ) , which becomes n U φ ( (cid:98) φ ; γ ( j ) ) ≈ n U φ ( φ ; γ ) + 1 n U φφ ( φ ; γ ) { (cid:98) φ ( γ ( j ) ) − φ } + 1 n U φγ ( φ ; γ )( γ ( j ) − γ ) ≈ n U φ ( φ ; γ ) + E { U φφi ( φ ; γ ) }{ (cid:98) φ ( γ ( j ) ) − φ } + E { U φγi ( φ ; γ ) } ( γ ( j ) − γ ) , and further, m m (cid:88) j =1 n U φ ( (cid:98) φ ; γ ( j ) ) ≈ n U φ ( φ ; γ ) + E { U φφi ( φ ; γ ) } (cid:40) m m (cid:88) j =1 (cid:98) φ ( γ ( j ) ) − φ (cid:41) + E { U φγi ( φ ; γ ) } ( (cid:98) γ − γ ) , if m (cid:80) mj =1 γ ( j ) ≈ (cid:98) γ . Hence, √ n (cid:40) m m (cid:88) j =1 (cid:98) φ ( γ ( j ) ) − φ (cid:41) ≈ E {− U φφi ( φ ; γ ) } − × (cid:20) √ nn U φ ( φ ; γ ) + E { U φγi ( φ ; γ ) }√ n ( (cid:98) γ − γ ) (cid:21) . Here we have, by another first order expansion around γ , that √ n ( (cid:98) γ − γ ) ≈ E {− U γγi ( γ ) } − √ nn U γ ( γ ) , so finally, √ n (cid:40) m m (cid:88) j =1 (cid:98) φ ( γ ( j ) ) − φ (cid:41) ≈ E {− U φφi ( φ ; γ ) } − × (cid:32) √ nn n (cid:88) i =1 (cid:104) U φi ( φ ; γ ) + E { U φγi ( φ ; γ ) } E {− U γγi ( γ ) } − U γi ( γ ) (cid:105)(cid:33) . Author’s Original Version preprint , version of October 29th, 2015
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS
We may similarly expand U φ ( (cid:98) φ ; (cid:98) γ ) where the parameters γ have been fixed to their maximum likelihoodestimates around ( φ , γ ) as n U φ ( (cid:98) φ ; (cid:98) γ ) ≈ n U φ ( φ ; γ ) + E { U φφi ( φ ; γ ) } ( (cid:98) φ ( (cid:98) γ ) − φ ) + E { U φγi ( φ ; γ ) } ( (cid:98) γ − γ ) to find that √ n (cid:110) (cid:98) φ ( (cid:98) γ ) − φ (cid:111) ≈ E {− U φφi ( φ ; γ ) } − √ nn n (cid:88) i =1 B i ( φ ; γ ) , where B i ( φ ; γ ) ≡ U φi ( φ ; γ ) + E { U φγi ( φ ; γ ) } E {− U γγi ( γ ) } − U γi ( γ ) . Since the two estimators m (cid:80) mj =1 (cid:98) φ ( γ ( j ) ) and (cid:98) φ ( (cid:98) γ ) have the same linear approximation which is a sumof independent terms, we conclude that they have the same asymptotic distribution, √ n ( (cid:98) φ − φ ) → N (cid:104) , E {− U φφi ( φ ; γ ) } − var { B i ( φ ; γ ) } E {− U φφ ( φ ; γ ) (cid:62) } − (cid:105) . Fitting the regression model y i = φ + φ z i + φ (cid:62) s i + φ (cid:62) g { e ( b i ; γ ) } + ε i to estimate the parameterof interest φ is numerically equivalent to fitting the sequence of regressions y i = ν (cid:62) s ∗ i ( (cid:98) γ ) + ε i , z i = α (cid:62) s ∗ i ( (cid:98) γ ) + ε i and { y i − (cid:98) ν (cid:62) s ∗ i ( (cid:98) γ ) } = φ { z i − (cid:98) α (cid:62) s ∗ i ( (cid:98) γ ) } + ε i , where s ∗ i ( γ ) ≡ [ s i , g { e ( b i ; γ ) } ] .Denoting the estimating function corresponding to the last regression as U φ { φ , (cid:98) γ, (cid:98) ν ( (cid:98) γ ) , (cid:98) α ( (cid:98) γ ) } ≡ n (cid:88) i =1 { z i − (cid:98) α (cid:62) s ∗ i ( (cid:98) γ ) } (cid:2) { y i − (cid:98) ν (cid:62) s ∗ i ( (cid:98) γ ) } − φ { z i − (cid:98) α (cid:62) s ∗ i ( (cid:98) γ ) } (cid:3) , and the partial derivatives of this as e.g. U φ γ ≡ ∂U φ /∂γ , we can expand this around ( φ , γ , ν , α ) ,where ν ≡ ν ( γ ) and α ≡ α ( γ ) are the limiting values of the nuisance parameters, as n U φ { φ , (cid:98) γ, (cid:98) ν ( (cid:98) γ ) , (cid:98) α ( (cid:98) γ ) } ≈ n U φ ( φ , γ , ν , α ) + E { U φ γi ( φ , γ , ν , α ) } ( (cid:98) γ − γ )+ E { U φ νi ( φ , γ , ν , α ) } ( (cid:98) ν − γ )+ E { U φ αi ( φ , γ , ν , α ) } ( (cid:98) α − γ )= 1 n U φ ( φ , γ , ν , α ) + E { U φ γi ( φ , γ , ν , α ) } ( (cid:98) γ − γ ) ≈ n U φ ( φ , (cid:98) γ, ν , α ) , since here E { U φ νi ( φ , γ , ν , α ) } = E { U φ αi ( φ , γ , ν , α ) } = 0 . We can now see that the The-orem 1 of Henmi and Eguchi (2004, p. 935) applies to the last form here, implying that avar( (cid:98) φ ) ≤ avar( ˜ φ ) , where (cid:98) φ is the solution to U ( φ , (cid:98) γ, ν , α ) = 0 and ˜ φ is the solution to U ( φ , γ , ν , α ) =0 . A PPENDIX The doubly robust estimator as a posterior predictive expectation
We first note that because (cid:90) v i { z i = a } y i p E ( v i | v ; ξ ) d v i = (cid:90) y i ,x i y i p E ( y i | Z i = a, x i , v ; ξ )pr E ( Z i = a ) p E ( x i | v ; ξ ) d y i d x i , we have that E E ( Y i | Z i = 1 , v ; ξ ) = E E (cid:26) Z i Y i pr E ( Z i = 1) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) preprintpreprint
We first note that because (cid:90) v i { z i = a } y i p E ( v i | v ; ξ ) d v i = (cid:90) y i ,x i y i p E ( y i | Z i = a, x i , v ; ξ )pr E ( Z i = a ) p E ( x i | v ; ξ ) d y i d x i , we have that E E ( Y i | Z i = 1 , v ; ξ ) = E E (cid:26) Z i Y i pr E ( Z i = 1) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) preprintpreprint , version of October 29th, 2015 Author’s Original Version Bayesian view of doubly robust causal inference, E E ( Y i | Z i = 0 , v ; ξ ) = E E (cid:26) (1 − Z i ) Y i pr E ( Z i = 0) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) . The usual IPT-weighted estimator for a marginal causal contrast may be derived through a posteriorpredictive argument as follows. First, E E [ Z i Y i | v ; ξ ] = (cid:90) v i z i y i p E ( v i | v ; ξ ) d v i = (cid:90) v i z i y i p E ( v i | v ; ξ ) p O ( v i | v ; ξ ) p O ( v i | v ; ξ ) d v i = (cid:90) v i z i y i p E ( y i | z i , x i , v ; ξ ) p E ( z i ) p E ( x i | v, ξ ) p O ( y i | z i , x i , v ; ξ ) p O ( z i | x i , v ; ξ ) p O ( x i | v ; ξ ) p O ( v i | v ; ξ ) d v i = (cid:90) v i z i y i p E ( z i ) p O ( z i | x i , v ; ξ ) p O ( v i | v ; ξ ) d v i (19) = (cid:90) v i z i y i p E ( z i ) p O ( z i | x i , v ; ξ ) n (cid:88) k =1 ξ k δ v k ( v i ) d v i = n (cid:88) k =1 ξ k z k y k p E ( z k ) p O ( z k | x k , v ; ξ )= pr E ( Z k = 1) n (cid:88) k =1 ξ k z k y k pr O ( Z k = 1 | x k , v ; ξ ) , and thus, E E (cid:26) Z i Y i pr E ( Z i = 1) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) = n (cid:88) k =1 ξ k z k y k pr O ( Z k = 1 | x k , v ; ξ ) . Similarly, E E (cid:26) (1 − Z i ) Y i pr E ( Z i = 0) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) = n (cid:88) k =1 ξ k (1 − z k ) y k pr O ( Z k = 0 | x k , v ; ξ ) , and E E (cid:26) Z i Y i pr E ( Z i = 1) − (1 − Z i ) Y i pr E ( Z i = 0) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) = n (cid:88) k =1 ξ k y k (cid:26) z k pr O ( Z k = 1 | x k , v ; ξ ) − − z k pr O ( Z k = 0 | x k , v ; ξ ) (cid:27) . On the other hand, the usual outcome model based estimator may be motivated similarly as in Sup-plementary Appendix 1 through E E ( Y i | Z i = a, v ; ξ ) = (cid:90) x i (cid:26)(cid:90) y i y i p O ( y i | Z i = a, x i , v ; ξ ) d y i (cid:27) p O ( x i | v ; ξ ) d x i = (cid:90) x i m ( a, x i ; ξ ) n (cid:88) k =1 ξ k δ x k ( x i ) d x i = n (cid:88) k =1 ξ k m ( a, x k ; ξ ) , and E E ( Y i | Z i = 1 , v ; ξ ) − E E ( Y i | Z i = 0 , v ; ξ ) = n (cid:88) k =1 ξ k { m (1 , x k ; ξ ) − m (0 , x k ; ξ ) } . Author’s Original Version preprint , version of October 29th, 2015
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS
Finally, we note that we can write (19) alternatively as E E ( Z i Y i | v ; ξ )= (cid:90) v i z i y i p E ( z i ) p O ( z i | x i , v ; ξ ) p O ( v i | v ; ξ ) d v i = (cid:90) y i ,z i ,x i z i y i p E ( z i ) p O ( z i | x i , v ; ξ ) p O ( y i | z i , x i , v ; ξ ) p O ( z i , x i | v ; ξ ) d y i d z i d x i = (cid:90) z i ,x i z i m ( z i , x i ; ξ ) p E ( z i ) p O ( z i | x i , v ; ξ ) n (cid:88) k =1 ξ k δ ( z k ,x k ) ( z i , x i ) d z i d x i = n (cid:88) k =1 ξ k z k m ( z k , x k ; ξ ) p E ( z k ) p O ( z i | x k , v ; ξ )= pr E ( Z k = 1) n (cid:88) k =1 ξ k z k m ( z k , x k ; ξ )pr O ( Z k = 1 | x k , v ; ξ ) , and therefore E E (cid:26) Z i Y i pr E ( Z i = 1) − (1 − Z i ) Y i pr E ( Z i = 0) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) = n (cid:88) k =1 ξ k m ( z k , x k ; ξ ) (cid:26) z k pr O ( Z k = 1 | x k , v ; ξ ) − − z k pr O ( Z k = 0 | x k , v ; ξ ) (cid:27) . Thus, the posterior predictive mean difference can be written as E E ( Y i | Z i = 1 , v ; ξ ) − E E ( Y i | Z i = 0 , v ; ξ )= E E (cid:26) Z i Y i pr E ( Z i = 1) − (1 − Z i ) Y i pr E ( Z i = 0) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) + E E ( Y i | Z i = 1 , v ; ξ ) − E E (cid:26) Z i Y i pr E ( Z i = 1) − (1 − Z i ) Y i pr E ( Z i = 0) (cid:12)(cid:12)(cid:12)(cid:12) v ; ξ (cid:27) − E E ( Y i | Z i = 0 , v ; ξ )= n (cid:88) k =1 ξ k { y i − m ( z k , x k ; ξ ) } (cid:26) z k pr O ( Z k = 1 | x k , v ; ξ ) − − z k pr O ( Z k = 0 | x k , v ; ξ ) (cid:27) + n (cid:88) k =1 ξ k { m (1 , x k ; ξ ) − m (0 , x k ; ξ ) } . (20) R EFERENCES
Achy-Brou, A. C., Frangakis, C. E., and Griswold, M. (2010). Estimating treatment effects of longitudinal designsusing regression models on propensity scores.
Biometrics , 66:824–833.An, W. (2010). Bayesian propensity score estimators: incorporating uncertainties in propensity scores into causalinference.
Sociological Methodology , 40:151–189.Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models.
Biomet-rics , 61:962–972.Bernardo, J. M. and Smith, A. F. M. (1994).
Bayesian Theory . Wiley, Chichester.Chakraborty, B. and Moodie, E. E. M. (2013).
Statistical Methods for Dynamic Treatment Regimes . Springer-Verlag,New York.Chamberlain, G. and Imbens, G. W. (2003). Nonparametric applications of Bayesian inference.
Journal of Business& Economic Statistics , 21:12–18.Chen, J. and Kaplan, D. (2015). Covariate balance in Bayesian propensity score approaches for observationalstudies.
Journal of Research on Educational Effectiveness , 8:280–302.Cole, S. R. and Hernán, M. A. (2008). Constructing inverse probability weights for marginal structural models.
American Journal of Epidemiology , 168:656–664.Dawid, A. P. (1979). Conditional independence in statistical theory.
Journal of the Royal Statistical Society, SeriesB , 41:1–31.Dawid, A. P. and Didelez, V. (2010). Identifying the consequences of dynamic treatment strategies: a decision-theoretic overview.
Statistical Surveys , 4:184–231. preprintpreprint
Statistical Surveys , 4:184–231. preprintpreprint , version of October 29th, 2015 Author’s Original Version
Bayesian view of doubly robust causal inference, Efron, B. (1979). Bootstrap methods: another look at the jackknife.
The Annals of Statistics , 7:1–26.Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004).
Bayesian Data Analysis, Second Edition . Chapman& Hall/CRC, Boca Raton, FL.Graham, D. J., McCoy, E. J., and Stephens, D. A. (2015). Approximate Bayesian inference for doubly robustestimation.
Bayesian Analysis . doi:10.1214/14-BA928 .Gustafson, P. (2012). Double-robust estimators: slightly more Bayesian than meets the eye? The InternationalJournal of Biostatistics , 8. DOI: 10.2202/1557-4679.1349.Hansen, B. B. (2008). The prognostic analogue of the propensity score.
Biometrika , 95:481–488.Henmi, M. and Eguchi, S. (2004). A paradox concerning nuisance parameters and projected estimating functions.
Biometrika , 91:929–941.Hernán, M. A., Brumback, B., and Robins, J. M. (2001). Marginal structural models to estimate the joint causaleffect of nonrandomized treatments.
Journal of the American Statistical Association , 96:440–448.Hernán, M. A. and Robins, J. M. (2006). Estimating causal effects from epidemiological data.
Journal of Epidemi-ology & Community Health , 60:578–586.Hoshino, A. (2008). A Bayesian propensity score adjustment for latent variable modeling and MCMC algorithm.
Computational Statistics & Data Analysis , 52:1413–1429.Ionides, E. L. (2008). Truncated importance sampling.
Journal of Computational and Graphical Statistics , 17:295–311.Kaplan, D. and Chen, J. (2012). Two-step bayesian approach for propensity score analysis: simulations and casestudy.
Psychometrika , 77:581–609.Lunn, D., Best, N., Spiegelhalter, D., Graham, G., and Neuenschwander, B. (2009). Combining MCMC with“sequential” PKPD modelling.
Journal of Pharmacokinetics and Pharmacodynamics , 36:19–39.McCandless, L. C., Douglas, I. J., Evans, S. J., and Smeeth, L. (2010). Cutting feedback in Bayesian regressionadjustment for the propensity score.
The International Journal of Biostatistics , 6.McCandless, L. C., Gustafson, P., and Austin, P. C. (2009a). Bayesian propensity score analysis for observationaldata.
Statistics in Medicine , 28:94–112.McCandless, L. C., Gustafson, P., Austin, P. C., and Levy, A. R. (2009b). Covariate balance in a Bayesian propensityscore analysis of beta blocker therapy in heart failure patients.
Epidemiologic Perspectives & Innovations , 6.McCandless, L. C., Richardson, S., and Best, N. (2012). Adjustment for missing confounders using external valida-tion data and propensity scores.
Journal of the American Statistical Association , 107:40–51.Newton, M. A. and Raftery, A. E. (1994). Approximating Bayesian inference with the weighted likelihood bootstrap.
Journal of the Royal Statistical Society, Series B , 56:3–48.Robins, J. M., Hernán, M. Á., and Brumback, B. (2000). Marginal structural models and causal inference inepidemiology.
Epidemiology , 11:550–560.Robins, J. M. and Ritov, Y. (1997). Toward a curse of dimensionality appropriate (CODA) asymptotic theory forsemi-parametric models.
Statistics in Medicine , 16:285–319.Rose, S. and van der Laan, M. J. (2008). Simple optimal weighting of cases and controls in case-control studies.
The International Journal of Biostatistics , 4. DOI: 10.2202/1557-4679.1115.Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies forcausal effects.
Biometrika , 6:41–55.Røysland, K. (2011). A martingale approach to continuous-time marginal structural models.
Bernoulli , 17:895–915.Rubin, D. B. (1981). The Bayesian bootstrap.
The Annals of Statistics , 9:130–134.Rubin, D. B. (1996). Multiple imputation after 18+ years.
Journal of the American Statistical Association , 91:473–489.Saarela, O., Arjas, E., Stephens, D. A., and Moodie, E. E. M. (2015a). Predictive Bayesian inference and dynamictreatment regimes.
Biometrical Journal . .Saarela, O., Moodie, E. E. M., Stephens, D. A., and Klein, M. B. (2015b). On Bayesian estimation of marginalstructural models. Biometrics . doi:10.1111/biom.12269 .Saarela, O., Moodie, E. E. M., Stephens, D. A., and Klein, M. B. (2015c). Rejoinder: On Bayesian estimation ofmarginal structural models. Biometrics . doi:10.1111/biom.12274 .Scharfstein, D. O., Rotnitzky, A., and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semipara-metric nonresponse models. Journal of the American Statistical Association , 94:1096–1146.van der Vaart, A. W. (1998).
Asymptotic Statistics . Cambridge University Press, New York, NY.Walker, S. G. (2010). Bayesian nonparametric methods: motivation and ideas. In Hjort, N. L., Holmes, C., Müller,P., and Walker, S. G., editors,
Bayesian Nonparametrics . Cambridge University Press, Cambridge, UK.Wang, C., Parmigiani, G., and Dominici, F. (2012). Bayesian effect estimation accounting for adjustment uncer-tainty.
Biometrics , 68:661–671.White, H. (1982). Maximum likelihood estimation of misspecified models.
Econometrica , 115:1–25.Xiao, Y., Moodie, E. E. M., and Abrahamowicz, M. (2013). Comparison of approaches to weight truncation formarginal structural Cox models.
Epidemiologic Methods , 2. doi:10.1515/em-2012-0006 .Zhang, G. and Little, R. (2009). Extensions of the penalized spline of propensity prediction method of imputation.
Biometrics , 65:911–918.
Author’s Original Version preprint , version of October 29th, 2015
AARELA , L. R. B
ELZILE
D. A. S
TEPHENS
Zigler, C. M. and Dominici, F. (2014). Uncertainty in propensity score estimation: Bayesian methods for variableselection and model-averaged causal effects.
Journal of the American Statistical Association , 109(505):95–107.Zigler, C. M., Watts, K., Yeh, R. W., Wang, Y., Coull, B. A., and Dominici, F. (2013). Model feedback in Bayesianpropensity score estimation.
Biometrics , 69:263–273. preprintpreprint