[PDF] Flexible sensitivity analysis for observational studies without observable implications

Abstract

A fundamental challenge in observational causal inference is that assumptions about unconfoundedness are not testable from data. Assessing sensitivity to such assumptions is therefore important in practice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictions that are at odds with modern causal inference methods, which emphasize flexible models for observed data. To address this issue, we propose a framework that allows (1) flexible models for the observed data and (2) clean separation of the identified and unidentified parts of the sensitivity model. Our framework extends an approach from the missing data literature, known as Tukey's factorization, to the causal inference setting. Under this factorization, we can represent the distributions of unobserved potential outcomes in terms of unidentified selection functions that posit an unidentified relationship between the treatment assignment indicator and the observed potential outcomes. The sensitivity parameters in this framework are easily interpreted, and we provide heuristics for calibrating these parameters against observable quantities. We demonstrate the flexibility of this approach in two examples, where we estimate both average treatment effects and quantile treatment effects using Bayesian nonparametric models for the observed data.

Full PDF

FFlexible sensitivity analysis for observational studies withoutobservable implications ∗ Alex Franks † UCSB Alex D’Amour † UC Berkeley Avi FellerUC BerkeleyThis draft: January 15, 2019

Abstract

A fundamental challenge in observational causal inference is that assumptions about unconfound-edness are not testable from data. Assessing sensitivity to such assumptions is therefore important inpractice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictionsthat are at odds with modern causal inference methods, which emphasize ﬂexible models for observeddata. To address this issue, we propose a framework that allows (1) ﬂexible models for the observed dataand (2) clean separation of the identiﬁed and unidentiﬁed parts of the sensitivity model. Our frameworkextends an approach from the missing data literature, known as Tukey’s factorization, to the causalinference setting. Under this factorization, we can represent the distributions of unobserved potentialoutcomes in terms of unidentiﬁed selection functions that posit an unidentiﬁed relationship between thetreatment assignment indicator and unobserved potential outcomes. The sensitivity parameters in thisframework are easily interpreted, and we provide heuristics for calibrating these parameters against ob-servable quantities. We demonstrate the ﬂexibility of this approach in two examples, where we estimateboth average treatment eﬀects and quantile treatment eﬀects using Bayesian nonparametric models forthe observed data.

Keywords : Observational studies; sensitivity analysis; Tukey’s factorization; latent confounder; Bayesianinference. ∗ Alexander M. Franks is an Assistant Professor of Statistics at the University of California, Santa Barbara([email protected]). Alexander D’Amour is now a Research Scientist at Google AI, Cambridge, MA ([email protected]); most of this work wascompleted when he was a Neyman Visiting Assistant Professor of Statistics atthe University of California, Berkeley. Avi Feller is an Assistant Professor of Public Policy and Statistics at the Universityof California, Berkeley ([email protected]). We thank Edo Airoldi, Carlos Cinelli, Peng Ding, Sharad Goel, Chad Hazlett,Jennifer Hill, Jared Murray, Sam Pimentel, Don Rubin, and Ravi Shroﬀ as well as participants at ACIC 2018 for thoughtfulcomments and discussion. † These authors contributed equally to this work. a r X i v : . [ s t a t . M E ] J a n Introduction

Causal inference generally requires two distinct elements: modeling observed potential outcomes and makingassumptions about missing potential outcomes. While researchers can investigate the ﬁrst element withstandard model-checking techniques, the data are uninformative about the second element, which can onlybe probed via sensitivity analysis. In principle, a sensitivity analysis quantiﬁes how results change underdiﬀerent assumptions about unobserved potential outcomes — without aﬀecting the observed data model(Linero and Daniels, 2017; Gustafson et al., 2018). Unfortunately, such “clean” sensitivity analyses are oftendiﬃcult to construct under common sensitivity analysis frameworks, sometimes leading to analyses whereassumptions about the sensitivity analysis inadvertently impose constraints on the observed data model.This tension puts sensitivity analysis at odds with modern causal inference approaches that incorporateﬂexible observed data models, such as recent approaches that feature Bayesian nonparametric regression orother machine learning methods (Hill, 2012; Athey and Wager, 2017; Hahn et al., 2017).In this paper, we propose a sensitivity analysis framework for observational studies that cleanly separatesmodel checking from sensitivity analysis. Speciﬁcally, we propose a factorization of the joint distributionof potential outcomes, known as

Tukey’s factorization , that separates factors into those that are nonpara-metrically identiﬁed and those that are completely unidentiﬁed from the data. This factorization expressesthe unidentiﬁed factors in terms of selection functions, which are easily interpreted, and a copula that char-acterizes the dependence between potential outcomes, which we can ignore for a broad class of commonestimands. Taken together, we can then interpret the implied distribution on missing potential outcomes asan extrapolation from the observed data distribution.Our main contribution is develop a practical workﬂow for this factorization in causal inference, thusextending Tukey’s factorization, which has a long history in the missing data literature, to the observationalstudy setting. First suggested by John Tukey (recorded in Holland, 1986), variants of the approach havebeen known by a number of names including exponential tilting (Birmingham et al., 2003; Rotnitzky et al.,2001; Scharfstein et al., 1999), non-parametric (just) identiﬁed (NPI) models (Robins et al., 2000), theextrapolation factorization (Linero and Daniels, 2017) and Tukey’s factorization (Franks et al., 2016). Inthis paper we use the term “Tukey’s factorization” since our approach explicitly adapts the method proposedby Franks et al. (2016) to observational causal inference. Tukey’s factorization also bears close resemblanceto missing data methods based on weighting, including importance weight methodology (Riddles et al.,2016) and inverse probability weighting models for sensitivity analysis (e.g. Zhao et al., 2017). While Robinset al. (2000) also proposed to apply Tukey’s approach to unobserved confounding in observational studies,1pplications of their proposal appear to be limited to the context of clinical trials with dropout (Rotnitzkyet al., 2001; Scharfstein et al., 2003).The workﬂow we propose has a number of practical strengths for assessing sensitivity to unmeasuredconfounding. First, the framework naturally accommodates a large class of commonly-applied observed datamodels, including models for complex data. We demonstrate this ﬂexibility by conducting sensitivity analysiswith a range of modern statistical models including Bayesian Additive Regression Trees (BART) for responsesurface estimation (Dorie et al., 2016) and Dirichlet processes mixture (DPM) models for ﬂexible residualdistributions. We also demonstrate the use of the factorization with semi-continuous data using zero-inﬂatedmodels. Second, the framework is computationally cheap. Because the factorization separates observed datamodeling from sensitivity parameters, an investigator only needs to ﬁt the observed data model once, andcan apply our sensitivity analysis post hoc . Finally, the sensitivity parameters deﬁned in our framework areeasily interpreted, facilitating model speciﬁcation and parameter calibration. In particular, the sensitivityparameters can be calibrated against variation explained by covariates in a standard propensity score model,which are already familiar to many investigators.The paper proceeds as follows. Section 2 introduces the formal setup for sensitivity analysis and high-lights key issues involving identiﬁability of sensitivity parameters. Section 3 introduces Tukey’s factorizationfor causal inference and provides theoretical justiﬁcation for applying the approach to a common set of es-timands. Section 4 deﬁnes a ﬂexible, convenient model speciﬁcation, the logistic-mixture exponential familymodel, and explores technical properties. Section 5 discusses heuristics for interpreting and calibrating thesensitivity parameters in this model. Section 6 applies Tukey’s factorization to two examples, demonstratingthe ﬂexibility of the approach on a range of estimands with several Bayesian nonparametric estimation ap-proaches. Finally, Section 7 discusses open issues and possible extensions. The appendix contains technicaldetails and some additional results from the applied analyses.

We describe our approach using the potential outcomes framework (Neyman, 1923; Rubin, 1974). Foroutcome Y i ∈ Y for unit i and binary treatment, let Y i (0) and Y i (1) denote that unit’s potential outcomesif assigned to control or treatment, respectively. Let T i denote a binary treatment indicator and X i denoteobserved covariates. For compactness, we often write Y i ( t ) , t ∈ { , } to denote the outcome for treatment2evel t . Assuming SUTVA (Rubin, 1980), we can then write the observed outcome as Y obs i = T i Y i (1) +(1 − T i ) Y i (0) . Finally, we write the propensity score as e ( x ) = P ( T i = 1 | X i = x ) . We assume an inﬁnitepopulation of iid units, from which we sample triplets ([ Y i (0) , Y i (1)] , T i , X i ) and observe triplets ( Y obs i , T i , X i ) .Given this, we suppress the subscript i unless otherwise noted.We focus on two classes of population estimands. First, we consider average treatment eﬀects on thewhole population as well as on the treated and control populations: τ AT E := E [ Y (1) − Y (0)] = E [ Y (1)] − E [ Y (0)] τ AT T := E [ Y (1) − Y (0) | T = 1] = E [ Y (1) | T = 1] − E [ Y (0) | T = 1] τ AT C := E [ Y (1) − Y (0) | T = 0] = E [ Y (1) | T = 0] − E [ Y (0) | T = 0] . Second, we consider quantile treatment eﬀects, τ q = Q q ( Y (1)) − Q q ( Y (0)) the diﬀerence in the q -th treatment quantile, Q q ( Y (1)) , and control quantile, Q q ( Y (0)) .Each of these estimands is a contrast between the marginal complete-data outcome distributions f ( Y (1)) and f ( Y (0)) . For each t , the complete-data distribution for each potential outcome can be written as amixture of the distribution of observed and missing outcomes: f ( Y ( t ) | X ) = f ( T = t | X ) f obs t ( Y ( t ) | T = t, X ) + f ( T = 1 − t | X ) f mis t ( Y ( t ) | T = 1 − t, X ) . All factors in this expression are identiﬁed except for f mis t , which is completely uninformed by the data.Identifying these estimands thus requires untestable assumptions that characterize f mis t . Sensitivity analysisprobes the robustness of estimates to these assumptions. Speciﬁcally, we consider sensitivity analyses forobservational studies where the investigator wishes to test robustness to violations of the unconfoundednessassumption. Assumption 1 (Unconfoundedness) . [ Y (0) , Y (1)] ⊥⊥ T | X .Unconfoundedness implies that f obs t = f mis t , and is thus suﬃcient to identify our estimands of interest.Broadly, sensitivity analysis for violations of unconfoundedness proceeds by parameterizing the conditionaldependence between partially-observable potential outcomes [ Y (0) , Y (1)] and T given covariates X . The3arameters of this dependence are called sensitivity parameters . Investigators can then report how causaleﬀect estimates change when the sensitivity parameters are allowed to vary within a selected range of plausiblevalues. The plausibility of the speciﬁc values of sensitivity parameters is ultimately determined externallyto the data analysis, e.g., by domain expertise. We propose a method for model-based sensitivity analysis that explicitly factorizes the joint distribution of ([ Y (0) , Y (1)] , T | X ) in terms of the observed outcome distributions f obs t . Speciﬁcally, we parameterize themissing outcome distributions as a tilt of the observed outcome distribution f mis ψ,t ( Y ( t ) | T = 1 - t, X ) ∝ f ψ ( T = t | Y ( t ) , X ) f ψ ( T = 1 - t | Y ( t ) , X ) f obs t ( Y ( t ) | T = t, X ) . (1)We explore key features of this factorizaton in subsequent sections. This speciﬁcation deﬁnes the complete-data distributions f ψ ( Y ( t )) implicitly. The fraction can be interpreted as importance weights that transformthe observed outcome distribution f obs t into the missing outcome distribution f mis ψ,t parameterized by ψ .Notably, the observed data distributions in this approach, f obs t , are free of the sensitivity parameter ψ ,which implies a clean separation of model checking and sensitivity analysis. In practice, sensitivity analysiswith Equation (1) involves a number of non-trivial implementation details, including imposing constraintson normalizing constants and setting and interpreting sensitivity parameters. There is an extensive literature on assessing sensitivity to departures from unconfoundedness, dating back atleast to the foundational work of Cornﬁeld et al. (1959) on the link between smoking and lung cancer. Theapproach we take in this paper ﬁts broadly within the model-based sensitivity analysis framework, largelybuilding on Rosenbaum and Rubin (1983). Examples of alternative sensitivity analysis approaches includeRosenbaum and Silber (2009), Díaz and van der Laan (2013), Ding and VanderWeele (2016), and Zhao et al.(2017).Our approach most closely follows recent proposals to enable model-based sensitivity analysis with ﬂexibleoutcome models, especially Dorie et al. (2016), who use Bayesian Additive Regression Trees (BART) modelsfor ﬂexible outcome modeling (see also Carnegie et al., 2016). We contrast their approach with ours inSection 6.1. Similarly, Jung et al. (2018) combine ﬂexible, black-box modeling with sensitivity in the setting4f algorithmic decision making. These methods extend the so-called latent confounder approach of sensitivityanalysis (Rosenbaum and Rubin, 1983) by specifying a parametric but ﬂexible latent variable model for thecomplete data. The latent confounder approach is highly intuitive, and in cases where these is strongscientiﬁc prior knowledge, may be preferable to the approach we propose here. For example, when theparameters in the model are scientiﬁcally meaningful, this formulation may be necessary to elicit priors forthese parameters. As we discuss next, however, such models can introduce issues around identiﬁed sensitivityparameters. While some modern approaches incorporate parameterizations that mitigate these identiﬁcationconcerns, the degree to which sensitivity parameters are decoupled from the observed data model can bediﬃcult to assess in practice.A number of previous methods have also sought to separate model checking from sensitivity analysis inthe context of missing data. Speciﬁcally, our approach is most similar to non-parametric (just) identiﬁed(NPI) models proposed by Robins et al. (2000), so called because they put no constraints on the observeddata model, but identify a complete-data distribution when the sensitivity parameters are ﬁxed. NPI modelswere proposed using a variant of Tukey’s factorization, but have been applied primarily to longitudinalmissing data problems and clinical trials with dropout; see, e.g., Rotnitzky et al. (2001) and Scharfsteinet al. (2003). Robins et al. (2000) suggested that NPI models could be applied to observational studies withunobserved confounding, but to the best of our knowledge, did not pursue this method in applications. Ourpaper builds a practical sensitivity analysis method based on this theoretical suggestion, in part by proposinga speciﬁcation that reduces some of the technical hurdles in the original Robins et al. (2000) proposal; seeSection 3.Our method is also related to a line of Bayesian missing data methods summarized by Linero andDaniels (2017). These methods specify a Bayesian model for observed data, and explore a set of “identifyingrestrictions” that map the observed data model to complete data models. Tukey’s factorization in (1) isan example of an identifying restriction, although the authors generally choose alternative restrictions intheir applications. Similar to the NPI literature, these methods do not appear to have been applied toobservational studies, with their causal applications restricted to clinical trials with dropout (Linero andDaniels, 2015; Linero, 2017). Finally, in a Bayesian approach that aligns well with ours, Hahn et al. (2016)propose ﬂexible, nonparametric models that separate the observed data model from sensitivity parameters,also in the context of missing data. For example, the model in Jung et al. (2018) has sensitivity parameters that depend only weakly on the observed databecause they focus exclusively on the setting with a binary outcome and employ sensitivity parameters that vary smoothlyacross covariates. .4 Separating Sensitivity Analysis from Model Checking A prime motivation for our sensitivity analysis framework is to separate sensitivity analysis from modelchecking, that is, from assessing the ﬁt of diﬀerent modeling assumptions to the observed data. In practice,many popular approaches to sensitivity analysis blur the line between sensitivity analysis and model checkingby introducing sensitivity parameters that are informed by the observed data. Sensitivity analyses of thistype can be diﬃcult to interpret, however. Such methods typically relax the unconfoundedness assumptionby introducing a parametric speciﬁcation, which can in turn fundamentally change the relationship betweenconfounding and identiﬁcation. This can also introduce new types of sensitivity that go unexamined bythe analysis itself, resulting in conclusions whose credibility depend on strong prior knowledge about theparametric speciﬁcation. We demonstrate some of these pathologies in the context of a latent confoundermodel in the following example.

Example 1 (Normal Outcome, Binary Confounder) . Suppose we have a study with a continuous outcomeand no covariates. We make the assumption that treatment was randomly assigned according to a Bernoullidesign, but it is plausible that there exists a latent class that confounds the study. To test the robustnessof our conclusions to the presence of such a latent class, we propose a sensitivity analysis by introducing abinary latent confounder. The model is parameterized as follows: U ∼ Bern ( ξ u ) T | U ∼ Bern ( g ( α + ψ T U )) Y ( t ) | U ∼ N ( µ t + ψ Y U, σ ) . When ( ψ T , ψ Y ) = (0 , , the model reduces to random assignment to treatment. Importantly, the observeddata distribution depends on the sensitivity parameters ψ . To see this, let h ψ T := f ( U = 1 | T ) . Thedistribution of observed outcomes is a two-component mixture of normals for t = { , } : Y ( t ) | T = t ∼ h ψ T N ( µ t + ψ Y , σ ) + (1 − h ψ T ) N ( µ t , σ ) (2)The observed and missing potential outcome distributions, f ( Y ( t ) | T = t, X ) and f ( Y ( t ) | T = 1 − t, X ) ,respectively, have the same mixture components but diﬀerent mixture weights, where ψ Y determines thediﬀerence between the component means. Importantly, the mixture weights, h ψ T ( t ) and component meansare identiﬁable under relatively weak assumptions (see Everitt, 1985).6 y T = Dy T = D y Y = −2 Dy T = − D y Y = 0 D y Y = 2 Figure 1: Illustrations of sensitivity parameter identiﬁcation in Example 1, via variation in posterior pre-dictive distributions at diﬀerent sensitivity parameter settings. We plot the true (blue) and inferred (red)observed control outcome densities, f ( Y (0) | T = 0) , implied by the normal outcome-binary latent con-founder model with sensitivity parameters ψ T and ψ Y . ∆ ψ T and ∆ ψ y represent the diﬀerence between thetrue and assumed sensitivity parameters. The observed data densities are only correctly inferred when weassume the true values of the sensitivity parameters ( i.e. ∆ ψ T = ∆ ψ Y = 0 ). Note that the true potentialoutcome distributions Y (0) are a mixture of normal distributions with the same component means anddiﬀerent variances.In Figure 1, we demonstrate a range of data ﬁts obtained from this model. As is common practice, weﬁt the model multiple times for a range of sensitivity parameters and plot the true (blue) and inferred (red)observed potential outcome densities for the control potential outcomes. Most settings of the sensitivityparameters result in severe misﬁt of the observed data, and would likely be rejected by a competent analystif they were considered the “main” model for the observed data. In this case, the sensitivity analysis operatesas a model checking exercise more than an exploration of relaxed identiﬁcation assumptions. In fact, basedon this parametric speciﬁcation, the investigator could reject the hypothesis that the study satisﬁes uncon-foundedness (Assumption 1) based on model ﬁt. In the case where the investigator had strong scientiﬁc basisto believe the parametric form of the latent confounder model, this would be useful information. However,in the absence of a priori knowledge about this speciﬁcation, the unconfoundedness assumption is not testedin full generality.More elaborate forms of this model appear throughout the literature, including extensions that incor-porate covariates and nonparametric regression models (Imbens, 2003; Dorie et al., 2016). Under thesespeciﬁcations, the residuals of the observed outcomes Y ( t ) − E [ Y ( t ) | X, T = t ] are distributed as a mixtureof normals, and the identiﬁcation issue noted in this example remains. In this simulation we set ξ u = 1 / , α = 0 , µ t = 0 , ψ T = − , ψ Y = 5 and σ = 1 . post hoc sensitivity analysis raises the spectre of data snooping. In a valid sensitivityanalysis, the investigator can tune an observed data model in a held-out sample. By contrast, in a sensitivityanalysis with identiﬁed sensitivity parameters, the investigator can instead choose the model ﬁt that is themost favorable to their conclusions, and declare it to be the “main” model whose robustness is being checked(Rosenbaum, 2017, p. 172, “What Is Not a Sensitivity Analysis?”). Finally, in the context of Bayesiansensitivity analysis, Gustafson et al. (2018) note that identiﬁed sensitivity parameters introduce a tensionbetween the prior information used to calibrate the sensitivity parameters and the information from theobserved data that make posterior uncertainties diﬃcult to interpret. For example, in the context of aclinical trial with dropout, Scharfstein et al. (2003) consider and reject a parametric complete-data modelwith an identiﬁed sensitivity parameter because, under this model, the observed data ﬁt implies implausiblylarge selection bias. We now turn to applying Tukey’s factorization to assessing sensitivity to unobserved confounding, payingspecial attention to the structural diﬀerences between this new setting and the more common missing datasetting. We show that these structural diﬀerences imply a distinct Tukey factorization for causal inference,which includes a copula that characterizes the dependence between potential outcomes. Importantly, we showthat there is a broad class of estimands, including the ATE and QTE, whose sensitivity does not depend onthis copula. For these estimands, sensitivity analysis can proceed by simply treating the observational study8s though it were two separate missing data problems. In short, this result can be summarized as: “causalinference is missing data twice”.

We begin by demonstrating Tukey’s factorization on only one arm of an observational study. Speciﬁcally,we examine the joint distribution of one potential outcome Y ( t ) and the treatment indicator T . This case isanalogous to a missing data problem, where Tukey’s factorization has been applied previously (Birminghamet al., 2003; Scharfstein et al., 1999; Franks et al., 2016; Linero and Daniels, 2017). Tukey’s factorization ofthis joint distribution is f ψ ( Y ( t ) , T | X ) = f obs ( Y ( t ) | T = t, X ) f ( T = t | X ) · f ψ ( T | Y ( t ) , X ) f ψ ( T = t | Y ( t ) , X ) . (3)Here, the ﬁrst two factors constitute the observed data density, which is nonparametrically identiﬁed, whilethe ﬁnal factor is determined by the selection function , which is unidentiﬁed but easily interpreted. In ourapproach, we parameterize the selection function with sensitivity parameters ψ . Thus, the unidentiﬁedselection function fully determines the relationship between the observed outcome distribution and thedistribution of missing outcomes.The validity of this factorization requires two technical conditions. First, following Robins et al. (2000),we need to impose some mild restrictions on the selection factors. Condition 1 (Integral constraints) . For each t , the normalizing constant in (1) satisﬁes: (cid:90) Y f obs ( Y ( t ) | T = t, X ) · f ψ ( T = 1 - t | Y ( t ) , X ) f ψ ( T = t | Y ( t ) , X ) dY ( t ) = f ( T = 1 - t | X ) f ( T = t | X ) . (4)This condition arises by integrating both sides of (3) with respect to Y ( t ) . This condition restricts the classof selection functions that can be used within our framework. While this restriction has lead to technicalchallenges explored in earlier work (see Robins et al., 2000, for additional discussion), the parameterizationswe propose in Section 4 satisfy this constraint automatically.Second, we need an outcome overlap condition, which ensures that the missing data distribution can beexpressed as an extrapolation of the observed data distribution. Condition 2 (Outcome Overlap Condition) . The support of the missing potential outcomes is a subset of9he support of the observed potential outcomes. That is, P ( Y ( t ) ∈ A | T = 1 − t, X ) > ⇒ P ( Y ( t ) ∈ A | T = t, X ) > for all sets A in the outcome sample space Y .As opposed to Condition 1, this condition is an assumption about the true data-generating process andcannot be enforced by model speciﬁcation. We discuss this assumption further in Section 7. We now extend Tukey’s factorization to the observational study setting. In particular, we show that the jointdistribution of potential outcomes ( Y (0) , Y (1)) and treatment T can be uniquely speciﬁed by supplementingthe distribution of the observed data with three unidentiﬁed models: a model for treatment given Y (1) alone;a model for treatment given Y (0) alone; and a copula that speciﬁes the dependence between Y (0 ) and Y (1) given T .Under Conditions 1 and 2, the joint density f ( T, [ Y (0) , Y (1)] | X ) can be decomposed into two uni-variate complete-data densities and a copula. The derivation follows as a consequence of applying Tukey’sfactorization to the marginal densities f ( T, Y (0) | X ) and f ( T, Y (1) | X ) : f ( T, [ Y (0) , Y (1)] | X ) = f ( Y (0) | T, X ) · f ( Y (1) | T, X ) · f ( T | X ) · f ( Y (0) , Y (1) | T, X ) f ( Y (0) | T, X ) f ( Y (1) | T, X )= f ( Y (0) , T | X ) f ( T | X ) · f ( Y (1) , T | X ) f ( T | X ) · f ( Y (0) , Y (1) | T, X ) f ( Y (0) | T, X ) f ( Y (1) | T, X ) · f ( T | X )= f ( Y (0) | T = 0 , X ) f ( T = 0 | X ) · f ( T | Y (0) , X ) f ( T = 0 | Y (0) , X ) · f ( Y (1) | T = 1 , X ) f ( T = 1 | X ) · f ( T | Y (1) , X ) f ( T = 1 | Y (1) , X ) · f ( T | X ) · c ( F ( Y (0) | T, X ) , F ( Y (1) | T, X ) | T, X ) (5)where we use F to represent a cumulative distribution function. We now deﬁne several important terms.First, we deﬁne the marginal selection factors f ( T | Y (1) , X ) and f ( T | Y (0) , X ) , which specify the non-ignorable selection mechanism in each arm. Second, we deﬁne the conditional copula c ( F ( Y (0) | T, X ) , F ( Y (1) | T, X ) | T, X ) = f ( Y (0) , Y (1) | T, X ) f ( Y (0) | T, X ) f ( Y (1) | T, X )

10s the copula density that characterizes the residual dependence between potential outcomes conditional onthe assigned treatment. Apart from the marginal selection factors and the conditional copula, all other termsin (5) are identiﬁable.Equation (5) implies that the density of the missing potential outcomes, conditional on the observedpotential outcomes is f mis ( Y ( t ) | T = (1 - t ) , Y (1 - t ) , X ]) ∝ f obs ( Y ( t ) | T = t, X ) · f ( T = (1 - t ) | Y ( t ) , X ) f ( T = t | Y ( t ) , X ) · c ( F ( Y (0) | T, X ) , F ( Y (1) | T, X ) | T, X ) . (6)We can then use this factorization to estimate the complete joint distribution, f ([ T, Y (0) , Y (1)] | X ) , and,in turn, estimate causal estimands of interest. So far, we have described the full set of unidentiﬁed factors necessary to specify the joint density of f ( T, [ Y (0) , Y (1)] | X ) . We now show that for a broad set of estimands, one need not specify the con-ditional copula to construct a well-deﬁned sensitivity analysis. These results give us license to conductsensitivity analysis for such estimands as though the observational study were two independent missing dataproblems. We refer to the general class of estimands for which this invariance holds as marginal contrasts . Deﬁnition 1.

A causal estimand τ is a marginal contrast estimand iﬀ it can be completely characterized asa function of the marginal distributions of the potential outcomes f ( Y (0) | X ) and f ( Y (1) | X ) . Formally,letting τ be a functional of the joint distribution f ( T, [ Y (0) , Y (1)] | X ) , with slight abuse of notation, amarginal contrast satisﬁes τ ( f ( T, [ Y (0) , Y (1)] | X )) = τ ( f ( Y (0) | X ) , f ( Y (1) | X )) . This class includes common estimands include the Average Treatment Eﬀect, E [ Y (1)] − E [ Y (0)] , and theQuantile Treatment Eﬀect, τ q = Q q ( Y (1)) − Q q ( Y (0)) . This class, however, excludes estimands that dependon the joint distribution of potential outcomes, such as the proportion of units with positive treatment, thevariance of the treatment eﬀect, and unit-speciﬁc treatments (see, for example, Heckman et al., 1997; Dinget al., 2018). Theorem 1.

Suppose the joint distribution f ( T, [ Y (0) , Y (1)] | X ) admits Tukey’s factorization in (5) .Further, suppose that τ is a marginal contrast estimand. Then τ is uniquely deﬁned by the marginal selectionfactors f ( T | Y (1) , X ) and f ( T | Y (0) , X ) . roof. See appendix.Stated diﬀerently, marginal contrast estimands are invariant to the conditional copula c ( F ( Y (0) | T, X ) , F ( Y (1) | T, X ) | T, X ) . In likelihood-based estimation, the invariance of the estimand translates to invariance in es-timation. In the special case of Bayesian inference, the following corollary establishes when the posteriordistribution for marginal contrast estimands exhibits invariance to the speciﬁcation of the copula. This in-variance holds under a distinct parameters condition that is common in Bayesian formulations of ignorability(Gelman et al., 2013). Corollary 1.1.

In the setting of Theorem 1, suppose, in addition, that the parameters of the conditionalcopula c ( F ( Y (0 | T, X )) , F ( Y (1) | T, X ) | T, X ) are distinct, i.e., a priori independent, of the parameters ofall other factors in (5) . Then the posterior distributions for marginal contrast estimands are invariant tothe speciﬁcation of the conditional copula in the model likelihood. This establishes that the selection factors in each treatment arm are the only unidentiﬁable factors thatneed to be speciﬁed when estimating marginal contrast estimands by Bayesian inference. Obviating the needfor the copula is useful in practice because the dependence between potential outcomes is unobservable, evenin experiments, and thus assumptions about this dependence are diﬃcult to calibrate. However, one wouldneed to wrestle with this complication for some of the more general sets of estimands mentioned above (i.e.unit-speciﬁc treatment eﬀects). These estimands, which are not marginal contrasts, arise for instance whenestimating optimal treatment regimes (Klausch et al., 2018). We leave sensitivity analysis for estimands ofthis type to future work.

We now propose a simple but widely applicable class of models that addresses several practical challengeswith implementing Tukey’s factorization for causal inference. Speciﬁcally, we focus on logistic selection withmixtures of exponential families (logistic-mEF models). In these models, the marginal selection functionsin each arm are speciﬁed as logistic in the potential outcomes, and the observed data is modeled with amixture of exponential family distributions. Building on previous work, we show that this class of modelsautomatically satisﬁes Condition 1 and yields missing potential outcome distributions that are analyticallytractable. Importantly, logistic-mEF models include non-parametric observed data models like Dirichletprocess mixtures (DPM) and BART. Thus, beyond being analytically convenient, the results in this section12irectly apply to sensitivity analysis with many ﬂexible modeling tools already used in causal inference. Westart by describing the setting in which the observed data densities belong to a single exponential familydistribution, and then demonstrate how our formulation extends to mixtures of exponential families. Weleave it to future work to explore speciﬁcations outside of the logistic-mEF family.

Under a logistic selection speciﬁcation, we posit that the log-odds of receiving treatment are linear in somesuﬃcient statistics of the potential outcomes, s ( Y ( t )) : f ( T = 1 | Y ( t ) , X ) = logit − { α t ( X ) + γ (cid:48) t s t ( Y ( t )) } , (7)where logit − ( x ) = (1 + exp( − x )) − . This speciﬁcation has sensitivity parameters γ = ( γ , γ ) , whichdescribe how treatment assignment depends marginally on each potential outcome, and a parameter α t ( X ) in each arm that is identiﬁed by the observed data once γ t is speciﬁed (discussed below). Logistic selection iscommonly used in latent confounder approaches to model the probability of treatment given the unobservedconfounder. Here, we take a similar approach but instead assume that the treatment probabilities arelogistic in (suﬃcient statistics of) the potential outcomes. Throughout this paper we assume the sensitivityparameters are independent of covariates. We return to this point in the discussion.Beyond their interpretability, logistic selection speciﬁcations have several desirable technical properties,some of which have been explored previously in diﬀerent settings. Here, we adapt these results to our settingand notation. First, for any speciﬁcation of γ t that implies a proper missing data distribution, Condition 1 isautomatically satisﬁed. In particular, Robins et al. (2000) shows that in each arm t , for each γ t , there existsa unique α t ( X ) that satisﬁes Condition 1. Second, Rotnitzky et al. (2001) and Scharfstein et al. (2003) showthat the missing data distribution implied by a logistic selection speciﬁcation in each arm t is free of α t ( X ) .Thus, having speciﬁed γ t , it is not necessary to solve for α t ( X ) explicitly. Speciﬁcally, f mis ( Y ( t ) | T = 1 − t, X ) = f obs ( Y ( t ) | T = t, X ) exp( γ (cid:48) t s t ( Y ( t ))) C ( γ t , X ) , (8)where the normalizing constant C ( γ t , X ) = (cid:82) Y exp( γ (cid:48) t s t ( Y ( t ))) f obs ( Y ( t ) | T = t, X ) dY ( t ) . Finally, in theobservational causal inference setting, because Condition 1 is satisﬁed independently for valid values of γ t in each arm, the sensitivity parameter vectors γ and γ are variation independent.The primary practical diﬃculty in applying the logistic selection speciﬁcation is computing the normaliz-13ng constant C ( γ t , X ) in (8) for each speciﬁcation of γ t . The normalizing constant is necessary for computingcausal eﬀects at each value of γ t , as well as calibrating γ t , as we discussion Section 5. Dealing with thisnormalizing constant in practice can require either computationally expensive calculation or strong restric-tions on the observed data model. For example, Scharfstein et al. (2003) address the normalizing constantby either modeling the observed data with its empirical CDF, for which the integration is trivial, or, inthe context of a more complex model, by using Markov chain Monte Carlo on the joint space of sensitivityparameters and parameters from the observed data model. In the next section, we show that logistic-mEFmodels largely avoid this diﬃculty. We now show that sensitivity analysis with logistic speciﬁcations is especially convenient under the weakassumption that the observed data model belongs to the class of exponential family mixtures. In particular,we review results of Franks et al. (2016) showing that the mixture of exponential families assumption makesthe normalizing constant in (8) analytically tractable.

We ﬁrst model the observed data as an exponential family distribution with natural parameter η t ( X ) ,possibly depending on X , f obs t ( Y ( t ) | T = t, X ) = h ( Y ( t )) g ( η t ( X )) e s ( Y ( t )) (cid:48) η t ( X ) (9)A key result from Franks et al. (2016) shows that when the selection function is assumed to have the logisticform in (7), the missing data distribution belongs to the same exponential family as the observed data. Proposition 1.

Assume that the observed data is an exponential family with density f obs t ( Y ( t ) | T = t, X ) = h ( Y ( t )) g ( η t ) e s ( Y ( t )) (cid:48) η t ( X ) with suﬃcient statistic s ( Y ( t )) and natural parameter η t ( X ) . Further, assume theprobability of selection is logistic in that statistic, f ( T = 1 | Y ( t ) , X ) = logit − { α t ( X ) + γ (cid:48) t s t ( Y ( t )) } , suchthat η ∗ t ( X ) = η t ( X ) + γ t lies in the natural parameter space of the exponential family of f obs t . Then thedistribution of the missing potential outcomes in arm t is in the exponential family as f obs t , and has density f mis γ t ,t ( Y ( t ) | T = 1 - t, X ) = h ( Y ( t )) g ( η ∗ t ( X )) e s ( Y ( t )) (cid:48) η ∗ t ( X )) . (10)14roposition 1 follows because, with an exponential family outcome model, the normalizing constant inEquation (8) is analytically tractable, with C ( γ t , X ) = g ( η t ( X )) g ( γ t + η t ( X )) . In addition, the constraint that η ∗ t ( X ) lie in the natural parameter space ensures that the missing data distribution is proper. Given the missingdata distribution in (10), the implied complete data density f γ t ,t ( Y ( t )) can then be expressed as a simplemixture of the observed and missing components.To provide some intuition for how the factorization operates under the logistic selection with exponentialfamilies, we present two simple examples. Example 2 (Binary observed outcomes) . Let Y ( t ) ∈ { , } and f obs t ( Y ( t ) | T = t, X ) ∼ Bern ( Logit − ( µ t ( X )) f ( T | X, Y ( t )) ∝ Bern ( Logit − ( α t ( X ) + γ t Y ( t ))) By Proposition 1 the unobserved potential outcome distribution is: f mis γ t ,t ( Y ( t ) | T = 1 - t, X ) ∼ Bern ( Logit − ( µ t ( X ) + γ t )) Thus, the extrapolation factorization with logistic treatment assignment applied to Bernoulli data impliesan additive shift in the log-odds of the unobserved potential outcomes, relative to the observed potentialoutcome distribution.

Example 3 (Normal observed outcomes) . Let Y ( t ) follow a Normal distribution. Because the Normal dis-tribution is a two parameter exponential family distribution with suﬃcient statistics s ( Y ( t )) = ( Y ( t ) , Y ( t ) ) ,we consider a treatment assignment mechanism that is quadratic in the potential outcomes. f obs t ( Y ( t ) | T = t, X ) ∼ N ( µ t ( X ) , σ t ) f ( T | Y ( t )) ∝ Bern ( Logit − ( α t ( X ) + γ t Y ( t ) + ψ t Y ( t ) )) By Proposition 1 the unobserved potential outcome distribution is: f mis( γ t ,ψ t ) ,t ( Y ( t ) | T = 1 - t, X ) ∼ N (cid:18) µ t ( X ) + γ t σ t − ψ t σ t , σ t − ψ t σ t (cid:19) (11)This model has two sensitivity parameters for each treatment arm. Assuming that the logistic function islinear in Y ( t ) , i.e. ψ t = 0 , implies that standard deviations of observed and missing potential outcome15istributions are identical. In Section 6.1, we estimate µ t ( X ) from the observed data using BART, whichallows us to conduct a sensitivity analysis on potential outcomes with ﬂexibly estimated response surfaces. We now extend the formulation to the full class of logistic-mEF models, which include outcome models thatare mixtures of exponential family distributions. Let f obs t ( Y ( t ) | η t , T = t ) = (cid:88) k π k h k ( Y ( t )) g k ( η tk ) e s ( Y ( t )) (cid:48) η tk be a mixture of exponential family distributions with mixture weights π k and common suﬃcient statis-tics, s ( Y ( t )) . Then, the normalizing constant in Equation (8) is analytically tractable with, C ( γ t , X ) = (cid:80) k π k g k ( η tk ) g k ( γ t + η tk ) . This leads to the following proposition: Proposition 2.

Assume that the observed data follows a K -component mixture exponential family distribu-tion with density f obs t ( Y ( t ) | η t , T = t ) = (cid:80) k π k h k ( Y ( t )) g k ( η tk ) e s ( Y ( t )) (cid:48) η tk , with suﬃcient statistic, s ( Y ( t )) ,common across all components. Further, assume the probability of selection is logistic in the suﬃcient statis-tic, f ( T = 1 | Y ( t ) , X ) = logit − { α t ( X )+ γ (cid:48) t s t ( Y ( t )) } , such that, for each component k , η ∗ tk ( X ) = η tk ( X )+ γ t lies in the natural parameter space of the exponential family. Then the distribution of the missing potentialoutcomes in arm t is also a k -component mixture of the same exponential family distributions and has density f mis γ t ,t ( Y ( t ) | η t , γ t , T = 1 − t ) = (cid:88) k π ∗ k h k ( Y ( t )) g k ( η ∗ tk ) e s ( Y ( t )) (cid:48) η ∗ tk (12) where η ∗ tk = η tk + γ t , and π ∗ k = π k g k ( η k ) g k ( η ∗ k ) (cid:80) Kk π k g k ( η k ) g k ( η ∗ k ) . (13)Consider the following simple example. Example 4 (Normal mixture observed outcomes) . We extend the Normal model above to a ﬁnite mixtureof Normal distributions: f obs t ( Y ( t ) | T = t, X ) ∼ (cid:88) π k N ( µ tk ( X ) , σ tk ) . Note that in the quadratic model we require ψ t > σ t to ensure propriety of the missing potential outcome distribution(Condition 1). ψ t = 0 . Then, by Proposition 1 and Equation (13), f mis γ t ,t ( Y ( t ) | T = 1 − t, X ) ∼ (cid:88) π ∗ k N ( µ tk ( X ) + γ t σ tk , σ tk ) (14) π ∗ k ∝ π k exp 12 (cid:32) µ tk ( X ) σ tk − (cid:18) µ tk ( X ) σ tk − γ t (cid:19) (cid:33) . In this model, the sensitivity parameters γ t aﬀect both the mixture weights and the component means: themixture weight for component k increases as γ t approaches µ tk ( X ) σ tk .In general, we can apply Tukey’s factorization with logistic selection to observed data densities modeledwith non-parametric Bayesian methods like Dirichlet process mixtures of any exponential family distribution,and can can easily adapt the mixture results to model structured semi-continuous data as well. In Section6.2, we demonstrate both of these features by modeling zero-inﬂated income data, and use a DPM modelfor the continuous component. Since sensitivity parameters are not identiﬁed from the data, calibrating the magnitude of the parametersrequires reasoning about plausible values using prior knowledge and domain expertise. We ﬁrst discusshow to interpret the sign of the sensitivity parameters. We then turn to the magnitude of the sensitivityparameters and introduce a method to calibrate this quantity using information from observed covariates.

To interpret the sign of sensitivity parameters, we return to the logistic selection model and consider thecase where s ( Y ( t )) = Y ( t ) ; that is, the suﬃcient statistic in the logistic selection model is simply Y ( t ) itself.In this setting, γ and γ are scalars and the probability of assignment to treatment is f ( T = 1 | Y ( t ) , X ) = logit − ( α t ( X ) + γ t Y ( t ))) .With this setup, the sensitivity parameters ( γ , γ ) have a relatively straightforward interpretation: theyspecify how suﬃcient statistics of the potential outcomes are over- or under-represented among observedcontrol and treated units. For example, γ > implies that units with large Y (1) are over-representedamong treated units, and thus the observed treated unit average will be larger than the average of theunobserved treatment outcomes. Likewise, γ > implies that units with large values of Y (0) are more17ikely to be assigned to treatment, so the observed control unit average will be smaller than the average ofthe unobserved control outcomes. We give several examples to illustrate interpretation in practice. Example 5 (Same Sign) . Consider a study of a medical treatment with health outcome Y and unmeasuredincome variable U . Suppose that larger values of Y correspond to good outcomes, that treatment is expensive,so P ( T = 1 | U ) is increasing in U , and that higher income induces better outcomes, such that Y ( t ) = α + τ t + ψU + (cid:15) with ψ > . In this study, both Y (0) and Y (1) are positively correlated with T , so largevalues of Y (1) are over-represented among the observed treated units, and large values of Y (0) are under-represented among control units, corresponding to positive γ and γ . When γ and γ have the same sign,the treatment and control group means are biased in opposite directions, so the ATE changes rapidly as themagnitude of the sensitivity parameters increases. Example 6 (Opposite Signs) . Consider the canonical “perfect doctor” example (Rubin, 2003), where aparticular medical treatment is prescribed by a doctor who is able to perfectly predict Y (0) and Y (1) foreach patient, and assigns the patient to whichever arm gives them the higher outcome. Suppose that in thispopulation Y (1) − Y (0) is independent of [ Y (1) + Y (0)] / ; that is, the individual treatment eﬀect for eachunit is unrelated to each unit’s expected outcome under random assignment to treatment. For example,suppose that [ Y (0) , Y (1)] are independently normal. Then the control arm and the treated arm both over-represent high outcomes, so γ < and γ > . When γ and γ have opposite signs, the treated and controlgroup means are biased in the same direction, so the ATE changes slowly as the magnitude of the sensitivityparameters increases. Example 7 (Single-Arm Confounding) . Consider a study evaluating a job training program with wageoutcome Y . Prior to the study, a subset of units is randomly given access to an alternative program thatthey can attend if they do not enroll in the treatment. Let U = 1 indicate access to this alternative programand that units with access to the alternative program are less likely to enroll in treatment, so P ( T = 1 | U ) is decreasing in U . Suppose that the alternative program is beneﬁcial on average, for example, suppose thepotential outcomes are given by Y (0) = α ( X ) + ψU + (cid:15) with ψ > and Y (1) = α ( X ) + τ + (cid:15) . Then, theobserved outcomes under treatment are representative of the distribution of Y (1) , so γ = 0 , but, undercontrol, units with higher wage outcomes are over-represented, so γ < . When only one of the sensitivityparameters is nonzero, only one of the group means is biased, so the ATE changes moderately when themagnitude of the nonzero sensitivity parameter increases.18 .2 Calibrating the magnitude of sensitivity parameters We now turn to the more challenging task of calibrating the magnitude of sensitivity parameters. Ourprimary proposal is to calibrate the magnitude of sensitivity parameters to the amount of variation in thetreatment assignment T that is explained by Y ( t ) , above and beyond what is accounted for by X . Toinstantiate this idea, we adopt a version of the “implicit R ” measure from Imbens (2003), which generalizesvariance-explained measures to the case of logistic regression. This approach is consistent with correspondingproposals for calibration in some latent confounder models (Imbens, 2003; Cinelli and Hazlett, 2018), with theimportant diﬀerence that our parameterization involves calibrating the variance explained by the potentialoutcomes rather than by a latent confounder.To ﬁx ideas, we will again focus on the simple case where the suﬃcient statistic is s ( Y ( t )) = Y ( t ) and γ and γ are scalars, so that following Equation (7), the probability of assignment to treatment is f ( T = 1 | Y ( t ) , X ) = logit − ( α t ( X ) + γ t Y ( t ))) . In the discussion that follows, we work on the logit scaleand let m ( X ) = logit( e ( X )) be the logit of the propensity score given variables X . To characterize varianceexplained, we ﬁrst note that the logistic treatment model can be expressed using a latent variable formulationas Z = m ( X ) + (cid:15) with (cid:15) ∼ logistic (0 , T =  if Z < if Z ≥ Since T is a deterministic function of the latent Z , it is suﬃcient to characterize how well Y ( t ) and X explain Z . Because the variance of a standard logistic is π / , the total variance of Z is simply Var ( m ( X )) + π / .We then deﬁne the variance explained by X , and the partial variance explained by Y ( t ) given X , respectively, One evident approach, which we do not endorse here, is to reason about the magnitude of γ t directly, calibrating it againstregression coeﬃcients inferred from observed covariates (Dorie et al., 2016; Blackwell, 2014; Middleton et al., 2016). The problemwith this approach is that it can be diﬃcult to interpret coeﬃcients due to collinearity between and among X and Y ( t ) ; see, forexample, Oster (2017). In contrast, our approach is robust to multicollinearity among the predictors. See Cinelli and Hazlett(2018) for additional discussion. ρ X = Var ( m ( X )) Var ( m ( X )) + π / , ≤ ρ X ≤ (15) ρ Y ( t ) | X = ρ X,Y ( t ) − ρ X − ρ X , ≤ ρ Y ( t ) | X ≤ . (16)The partial variance explained by Y ( t ) given X , ρ Y ( t ) | X , represents the fraction of previously unexplainedvariance in T that can now be explained by adding Y ( t ) to the set of predictors X .With our method, we propose a target value, ρ ∗ , for the the unidentiﬁed ρ Y ( t ) | X . Although the decisionultimately falls to domain expertise, we suggest using observed predictors to set the target ρ ∗ by analogy. Forexample, for each covariate X j , we can compute a partial variance explained by X j given all other covariates X − j , and set the target ρ ∗ = ρ X j | X − j for an appropriate covariate based on expert knowledge. We interpretthis to mean that the information gained by adding Y ( t ) to X as a predictor of treatment assignment iscomparable to the information gained by adding X j to X − j . We give concrete examples of this approachin the analysis in Section 6.1. In some cases, alternative calibration schemes may be more appropriate. Forexample, one could calibrate ρ Y ( t ) | X A for some core subset of covariates X A ⊂ X using (18), replacing X with X A . In Section 6.2, we take this approach and apply a calibration scheme setting X A to be the emptyset. See Cinelli and Hazlett (2018) for a discussion of related alternatives in the context of linear modeling. Once a value of ρ ∗ has been chosen, we must then identify the magnitude of the corresponding sensitivityparameter, γ t . Since ρ Y ( t ) | X is monotone in γ t , there is a one-to-one mapping between the two quantitiesthat can be used to identify γ t . We formalize this in the following proposition: Proposition 3 (Calibration Identity) . Suppose that logit ( e ( X, Y ( t ))) is linear in Y ( t ) , with the form in (7) .Further, suppose that logit ( e ( X )) = m ( X ) and σ rt = (cid:112) E [ V ar ( Y ( t ) | X )] . Then, ρ Y ( t ) | X = σ rt γ t Var( m ( X )) + π / σ rt γ t , (17) In homoscedastic models, σ rt is simply the residual standard deviation of the potential outcome sd ( Y ( t ) | X ) , and isindependent of X . he inverse of this equation identiﬁes the magnitude of γ t : | γ t | = 1 σ rt (cid:118)(cid:117)(cid:117)(cid:116) ρ Y ( t ) | X − ρ Y ( t ) | X (Var( m ( X )) + π / . (18) Proof.

See appendix.We use (18) to translate any ﬁxed target on ρ ∗ to a value for | γ t | . Note that (18) only requires estimates of Var( m ( X )) and σ rt . Estimating Var( m ( X )) is trivial using any propensity score model, but estimating theresidual standard deviation σ rt is slightly more nuanced. This is because σ rt is a property of the completedistribution of the potential outcome, Y ( t ) , which itself depends on γ t . We write σ rt ( γ t ) to emphasizethat the residual standard deviation depends on γ t ; σ rt ( γ t ) is available in analytic form for the mixture ofexponential family models considered in Section 4. We then numerically solve σ rt ( γ t ) | γ t | = (cid:118)(cid:117)(cid:117)(cid:116) ρ Y ( t ) | X − ρ Y ( t ) | X (Var( m ( X )) + π / for γ t . In many of observed-data model speciﬁcations discussed in this paper, the residual standard deviationin both the observed and missing data densities are identical, so this recursive identiﬁcation of γ t is notnecessary (e.g. as in Equation 11 for ψ t = 0 ). We now apply our approach to two examples. In the ﬁrst example, we conduct sensitivity analysis for theeﬀects of blood pressure medication. In this example, we use a nonparametric estimate of the responsesurface but assume normally distributed residuals. In the second example, we conduct sensitivity analysison the eﬀect of a job training program on income and employment. In this example, the outcome is zero-inﬂated, and we focus on quantile treatment eﬀects rather than average treatment eﬀects. In addition todemonstrating the ﬂexibility in modeling structured data, we also show how Tukey’s factorization can beapplied to nonparametric residual models, by modeling the continuous component of the observed datadensities using a Dirichlet Process Mixture of normal distributions. In both examples, we demonstrate theﬂexibility of our approach by estimating the posterior distribution of quantile treatment eﬀects for a rangeof quantiles. 21 .1 Analysis of NHANES data

We consider a study aimed at estimating the eﬀect of ‘taking two or more anti-hypertensives’ on average dias-tolic blood pressure using data from the Third National Health and Nutrition Examination Survey (NHANESIII) (Centers for Disease Control and Prevention (CDC), 1997), a comprehensive survey of Americans’ healthand nutritional status. We follow the same set up as Dorie et al. (2016), and utilize pre-treatment covariateslike race, gender, age, income, body mass index (BMI), and whether the patient was insured, among others.We let Y ( t ) be the average diastolic blood pressure for a subject in treatment arm t , where t = 1 means thesubject was taking two or more anti-hypertensive medications and t = 0 means the subject was not.In this application, we show that our method works well with a nonparametric model for the responsesurfaces. First, we assume the following data generating model for the potential outcomes: f obs t ( Y ( t ) | T = t, X ) ∼ N ( µ t ( X ) , σ t ) (19) ( µ t ( X ) , σ t ) ∼ BART ( X | T = t ) (20) f ( T | Y ( t ) , X ) ∼ Bern ( Logit − ( α t ( X ) + γ t Y ( t ))) (21)As shown in Section 4, the missing data distribution is therefore: f mis γ t ,t ( Y ( t ) | T = 1 - t, X ) ∼ N ( µ t ( X ) + γ t σ t , σ t ) . (22)Again following Dorie et al. (2016) we use Bayesian Additive Regression Trees (BART; Chipman et al.,2010) to ﬂexibly model µ t ( X ) and σ t . In contrast to their approach, we focus on using BART to estimateonly the observed potential outcomes and do not incorporate latent confounders into the BART model. Weuse independent BART prior distributions for the mean surface, µ t ( X ) , for each treatment arm and similarlymodel separate residual variances for each arm (20). The ATE and QTE can then be estimated as functionsof the estimated marginal complete data distributions for the potential outcomes (averaged over covariates)where: f ( Y ( t )) ∼ N t (cid:88) i =1 N t N ( µ t ( X i ) , σ t ) + N - t (cid:88) j =1 N - t N ( µ t ( X j ) + γ t σ t , σ t ) . (23) Calibration.

We calibrate the magnitude of the sensitivity parameters using the approach outlined inSection 5. We illustrate this approach with body mass index (BMI), one of the most important predictors in22 (treatment) ↓ in control BP | P(treatment) ↑ in control BP P ( t r ea t m en t ) ↓ i n t r ea t m en t BP | P ( t r ea t m en t ) ↑ i n t r ea t m en t BP (a) ATE −10−505 0.25 0.50 0.75 Quantile q Q uan t il e T r ea t m en t E ff e c t ( g , g )−0.05, −0.020, 00.01, 0.010.03, 0.05 (b) Quantile eﬀects Figure 2: Average and quantile treatment eﬀects of diuretics on diastolic blood pressure. Treatment eﬀectmeasured in units of millimeters of mercury (mmHg). NS denotes “not signiﬁcant”. a) Estimated ATEs forthe BART model. Under unconfoundedness, the ATE is negative but not signiﬁcant. c) Quantile treatmenteﬀects, τ q . τ q is increasing in q because the treatment potential outcomes have higher variance than thecontrol potential outcomes. Colored boxes in (a) correspond to QTE distributions in (b).terms of partial variance explained, with ρ X BMI | X − BMI ≈ . . To map this value to sensitivity parameters,we use the estimated residual standard deviation of the potential outcomes in each arm, ˆ σ r ≈ . and ˆ σ r ≈ . . We can then apply the formula in Equation (18) to obtain | γ | ≈ . and | γ | ≈ . . Ingeneral, we limit our sensitivity analysis to unmeasured confounders up to this magnitude.

Results.

In Figure 2a we visualize ATE estimates for a grid of sensitivity parameters. Here “NS” denotes“not signiﬁcant”, by which we mean the 95% posterior credible interval of the ATE contains 0. Although thenotion of “signiﬁcance” vs “non-signiﬁance” is fragile, it still provides a measure of the uncertainty associatedwith the estimated eﬀects; see Dorie et al. (2016).Under unconfoundedness ( γ = γ = 0 ) the posterior mean for the ATE is approximately -0.7 mmHg butthere is enough posterior uncertainty that the eﬀect is not signiﬁcantly diﬀerent from 0 (light blue box inFigure 2a). The ATE changes the most along the diagonal parallel to γ = γ = γ , with the ATE no longersigniﬁcantly diﬀerent from 0 when γ = 0 . . Figure 2 also highlights the sensitivity patterns discussed inExamples 5–7, where the ATE is far more sensitive to the magnitudes of γ and γ when they have thesame sign. For example, for the pink box, with γ = 0 . and γ = 0 . , the posterior mean ATE isapproximately − . mmHg, which is roughly half a (marginal) standard deviation larger than the estimate Speciﬁcally, the formula in (18) implies σ rt γ t ≈ . . In turn, | γ | ≈ . / . ≈ . and | γ | ≈ . / . ≈ . . ρ Y (0) | X ≈ ρ X ins | X − ins ; the variance explained by the treatment potential outcome is comparable to thevariance explained by BMI, that is ρ Y (1) | X ≈ ρ X BMI | X − BMI (see Figure 4).In Figure 2b, we plot the 95% posterior credible band for the quantile treatment eﬀects, τ q , for severalsettings of the sensitivity parameters (colors match squares those in Figure 2a). Interestingly, τ q is increasingwith q for all combinations of γ , which occurs because the estimated residual variance for the treated potentialoutcomes is slightly larger than for the control potential outcomes. Similarly, for the complete data, thediﬀerence between the largest and smallest treatment potential outcomes is larger than the diﬀerence inthe control potential outcomes. This is consistent with a situation in which the treatment varies acrossindividuals beyond what is explained by covariates (Ding et al., 2018).In the appendix, we also demonstrate the separation between the observed data model and the sensitivityanalysis in our framework. Speciﬁcally, we show how results change when we use diﬀerent observed datamodels with the same treatment selection speciﬁcations as above. We provide results for two additionalobserved data models: an alternative BART parameterization and the Bayesian Causal Forest (BCF) model(Hahn et al., 2017). Broadly, our results are consistent with Dorie et al. (2016), who show that both eﬀect sizeand signiﬁcance in this example can be sensitive to changes in the outcome model (testable) and treatmentselection (untestable). Finally, while we use a ﬂexible model to estimate the response surfaces, the quantileeﬀects are still sensitive to the assumption of normality on the residuals. We turn to this next. In this example, we conduct a sensitivity analysis for quantile treatment eﬀects for zero-inﬂated incomedata. Zero-inﬂated outcomes are common in a range of settings; we focus on the context of evaluating jobtraining programs. In these studies, the primary outcome of interest, income, is zero for individuals whoare unemployed and thus the average treatment eﬀect misses important variation (Heckman et al., 1997;Bitler et al., 2006). As a result, several studies have instead focused on estimating quantile treatment eﬀectsin these settings. Speciﬁcally, we consider the well-known study from Abadie et al. (2002), who estimatequantile treatment eﬀects in the context of the Job Training and Partnership Act (JTPA) evaluation, a largerandomized trial estimating the impact of select workforce development programs on wages. In this analysiswe focus only on individuals randomly assigned to treatment, and compare outcomes between those whochoose to participate in the program ( T = 1 ) versus those who did not ( T = 0) . We choose this artiﬁcial setup24 comparing participants and non-participants among those randomly assigned to treatment — speciﬁcallybecause selection bias is a clear concern.This analysis is designed to showcase that our sensitivity framework allows investigators to conductsensitivity analysis even when they employ ﬂexible models of complex data. Following previous work, wedevelop a two-part model for the semi-continuous data (Duan et al., 1983; Olsen and Schafer, 2001; Javarasand Van Dyk, 2003). Let Y ( t ) be income and W ( t ) ∈ { , } an indicator for employment status with W ( t ) = 0 = ⇒ Y ( t ) = 0 and W ( t ) = 1 = ⇒ Y ( t ) > . For simplicity, we exclude covariates in this analysisand focus on nonparametric estimation of the observed potential outcome distributions. Speciﬁcally, weﬂexibly model the observed (log) income among the employed using a Dirichlet Process mixture of normaldistributions (Neal, 2000): f obs t (log( Y i ( t )) | T = t, W = 1) ∼ N ( µ it , σ it ) (24) ( µ it , σ it ) ∼ G (25) G ∼ DP ( αG ) (26)where G is the conjugate normal-inverse gamma prior distribution for the normal likelihood. To estimate theobserved data density, we use the “dirichletprocess” R package, which implements a truncated stick breakingprocess to approximate the inﬁnite mixture weights (Ross and Markwick, 2018).We then propose the following treatment selection speciﬁcation, with separate selection functions foremployed and unemployed individuals: f ( T = t | log( Y ( t )) , W ( t ) = 1) ∼ logit − ( β t + γ t log( Y ( t ))) (27) f ( T = t | W ( t ) = 0) ∼ logit − ( α t + ω t I { W ( t ) = 0 } ) (28)In words, for employed individuals, the participation probability is logistic in the log income. Analogousto the Normal mixture example in Section 4 (Example 4), under the logistic selection model, the missingpotential log-incomes Y ( t ) also follow a Dirichlet Process mixture of Normals but with diﬀerent componentmeans and mixture weights. For unemployed individuals, the odds of participation increase multiplicativelyby exp( ω t ) . Analogous to the Bernoulli example (Example 2), under the logistic selection model, the missingpotential employment outcome W ( t ) is also Bernoulli but with an additive shift in the log-odds.25 .000.250.500.751.00 Unemployed Income (log) f ( T = | Y ( ) , W ( )) (a) User-speciﬁed selection functions ( g , w )(−0.1, −0.5)(0, 0)(0.1, 0.5) Unemployed Income (log) f m i s ( Y ( ) , W ( ) | T = ) (b) Missing control outcomes Quantile q Q uan t il e T r ea t m en t E ff e c t ( do ll a r s ) ( g , w )(−0.1, −0.5)(−0.1, 0.5)(0, 0)(0.1, −0.5)(0.1, 0.5) (c) Quantile eﬀects Figure 3: Example model speciﬁcations for non-participators (a,b) and quantile treatment eﬀects (c) forthe JTPA data set. a) Participation probability as a function of non-participator potential outcomes, ( Y (0) , W (0)) . The left bar plot shows participation probability given employment status, f ( T = 1 | W (0)) and the right depicts participation vs log income, f ( T = 1 | log ( Y (0))) , for three diﬀerent pairs of sensitivityparameters. b) The probability of unemployment, E [ W (0) | T = 1] , for the missing control units (left) andlog income density for the employed among the missing control outcomes, f ( log ( Y (0)) | T = 1) , (right). c)95% posterior conﬁdence bands for τ q , for ﬁve pairs of sensitivity parameters.26 alibration. Reﬂecting the zero-inﬂated data structure, we handle calibration separately for employedand unemployed individuals. By way of demonstration, we use covariates to calibrate our sensitivity analysisbut exclude them from treatment eﬀect estimation. The covariates, X , that we use for calibration are race,marital status, gender, age, and high school diploma or equivalent.First, we calibrate the magnitude of γ t , the sensitivity parameters for (log) income. In this analysis, wefocus on subset of the sensitivity parameter space in which γ = γ = γ . We then calibrate γ by ﬁxingtarget values of ρ Y ( t ) to be approximately ρ X , the partial variance explained by the vector of observedcovariates. We ﬁnd that for the subset of employed individuals, the variance in T explained by X is ρ X ≈ . . By Equation (18), we ﬁnd that | γ t | ≈ . when ρ Y ( t ) = 0 . , where we use the fact thatVar ( m ( X )) = Var ( m ( ∅ )) = 0 .Second, we calibrate the magnitude of ω t , the sensitivity parameter for (binary) employment, W ( t ) . Wefollow the same calibration strategy by setting the target for ρ W ( t ) ≈ ρ X . We ﬁnd that for the subset ofunemployed individuals, the variance in T explained by X is ρ W ( t ) ≈ . . Applying Equation (18), thisvalue of ρ W ( t ) corresponds to a value of | ω t | ≈ . .Figure 3a visualizes diﬀerent user-speciﬁed choices for the unidentiﬁed selection model (27–28), or prob-ability of assignment to treatment, implied under three diﬀerent settings of ( γ, ω ) . Here, ω determines theheight of the bar in the left panel, and γ determines the steepness of the curve in the right panel. Results.

In this section, we summarize our results, as implied by the choice for this treatment assignmentfunction and the observed data density estimated with the Dirichlet Process mixture model (24–26) in Figures3b and 3c. First, in Figure 3b we display the distribution of missing control outcomes for those units assignedto treatment, f mis( γ,ω ) , ( Y (0) , W (0) | T = 1) . From this ﬁgure, it is clear that the observed data imply verydiﬀerent counterfactual employment and income distributions under diﬀerent sensitivity parameter settings.The thickness of the lower tail of the income distribution for the employed appears to be particular sensitiveto selection eﬀects. The fraction of unemployed among the missing control outcomes assigned to treatmentis roughly twice as large for ω = − . as it is for ω = 0 . .Figure 3c summarizes sensitivity in the QTEs, which are the primary objects of interest. Consistent withthe distributional diﬀerences in Figure 3b, however, the QTEs are sensitive in terms of both employmentand income levels. First, under all sensitivity settings we explore, the QTE at the lowest quantiles areidentically zero, since both control and treated outcome distributions have a point mass at zero. The This is not a viable approach in a real analysis without covariates, since there is no X from which we can estimate ρ X . ρ Y ( t ) = ρ Y ( t ) |∅ is the marginal variance explained. q , which are related to the proportion unemployed inthe treatment and control populations. For example, for ω = 0 or − . , the proportion of units who areunemployed under treatment is smaller than the proportion of units who are unemployed under control, f ( W (1) = 0) < f ( W (0) = 0) . Here, the QTE increases away from 0 at q = f ( W (1) = 1) . By contrast, for ω = 0 . , the proportion unemployed under treatment is actually larger than the proportion unemployed undercontrol, f ( W (1) = 0) > f ( W (0) = 0) , and thus the QTE decreases away from 0 (brieﬂy) at q = f ( W (0) = 1) .Diﬀerences in income eﬀects are more straightforward to read from the plot. For the sensitivity parametersettings used in this analysis, the income under treatment Y (1) generally has larger income quantiles than thedistribution of income under control Y (0) . Only when selection eﬀects are very large, e.g. ( γ, ω ) = (0 . , . ,do the eﬀects lose signiﬁcance and shift toward slightly negative values. In the end, the shape of the estimatedQTEs is similar to those in Abadie et al. (2002), though the impacts are larger due to the artiﬁcial setup. In this paper, we proposed a framework for sensitivity analysis in causal inference employing Tukey’s factor-ization. The framework has a number of advantages. First, it cleanly separates the identiﬁed and unidentiﬁedportions of the data-generating process. This guarantees that sensitivity parameters are unidentiﬁed, in con-trast to many latent confounder models, and decouples model checking and sensitivity analysis. Second,it only requires the data to be ﬁt once, reducing computational burden and enabling post facto sensitivityanalysis for a wide range of models that previously assumed unconfoundedness. Third, it supports intuitivesensitivity parameterizations that investigators can calibrate to selection on observed covariates.Extensions of our framework could ﬁt particularly well with modern causal inference workﬂows thatemploy a similar separation of observed data modeling and causal reasoning. In these workﬂows, the analystﬁrst focuses on optimizing a ﬁt to the observed data distribution, often employing heuristics such as crossvalidation to select or combine models. The analyst then plugs predictions from this model into an estimationstep tailored to the estimand of interest (Van der Laan and Rose, 2011; Chernozhukov et al., 2016; Xu et al.,2018). A sensitivity analysis based on Tukey’s factorization would allow investigators to assess sensitivityin this workﬂow without putting constraints on the ﬂexible model used in the ﬁrst stage. In particular,following (1), such a sensitivity analysis could be implemented by adding a weighting step, parameterizedby sensitivity parameters, between the ﬁrst and second stages.Although we highlighted many diﬀerent use cases, there are several extensions we did not explore. First,28e focused here on sensitivity speciﬁcations that are independent of the covariates, which enabled us tolimit the number of sensitivity parameters. A natural extension would generalize these speciﬁcations to infercausal eﬀects under covariate-varying values of the sensitivity parameter, i.e. γ t ( X ) (see Jung et al., 2018).Such an extension would increase the number of sensitivity parameters, introducing several challenges incalibration and reporting results. A second extension would generalize our approach to multiple or multi-level treatments (Imai and Van Dyk, 2004). This would require generalizing the factorization in Equation(5). In some special cases, for example, where the unobserved confounding can be represented as latentfactors of the observed distribution of treatments (Wang and Blei, 2018), more parsimonious sensitivityfactorizations may be possible, e.g., following D’Amour (2019). A ﬁnal extension would extend our sensitivityanalysis to observational studies with missing data. Speciﬁcally, we could combine approaches for dealingwith informative dropout, previously applied in experimental settings (Daniels and Hogan, 2000; Scharfsteinet al., 2003), with the models for confoundedness in observational studies described in this paper. We leaveit to future work to explore these and other classes of useful extrapolation models, including models thatfacilitate speciﬁcation of covariate-varying sensitivity parameters and multiple treatments regimes.There are also several open technical questions about our application of Tukey’s factorization to observa-tional studies. One important consideration for Tukey’s factorization is the validity of the outcome overlapcondition (Condition 2, Section 3). This condition says that the support of the missing potential outcomesmust be a subset of the support of the observed potential outcomes. As discussed in Franks et al. (2016),even when the outcome overlap condition is technically satisﬁed, the inferred missing data distribution canbe sensitive to the estimated tails of the observed data density if the distance between the observed andmissing data is large. This is particularly evident when viewed from the importance weighting perspective,since f ( T =1 - t | Y ( t )) f ( T = t | Y ( t )) increases in Y ( t ) in regions where the missing data density is far from the observed datadensity. In this case, the inferred missing data density is largely determined by parametric assumptionsabout the tail behavior of the observed data densities, which often have limited information in practice.Additionally, the outcome overlap condition may become less plausible when the covariates explain most ofthe variance in the observed outcome Y ( t ) . In such a case, an unobserved confounder would only need toinduce a small amount of variation in Y ( t ) to violate the outcome overlap assumption.In the end, the methods described in this paper are quite general, can be extended to a range of models,and are easy to interpret and implement, even for complex data generating models. We therefore believeTukey’s factorization is a powerful framework for assessing sensitivity to unobserved confounders in obser-vational causal inference. 29 Theory

Proof of Theorem 1: τ AT E , τ AT T , τ AT C and τ OR are all only functions f ( Y ( t )) t ∈ { , } (or f ( Y ( t ) | X ) for conditional treatment eﬀects). Thus it suﬃces to show that f ( Y ( t ) | T ) are independent of any copulaparameters. Note that in the extrapolation factorization we model f ( Y ( t ) | T = t ) directly and thus, thisconditional expectation is independent of copula parameters by deﬁnition. Thus it suﬃces to show that f ( Y ( t ) | T = (1 − t )) is independent of copula parameters. f ( Y ( t ) | T = (1 - t )) = (cid:90) f ( Y ( t ) , Y (1 - t ) | T = (1 - t )) dY (1 - t )= (cid:90) f ( Y ( t ) | Y (1 - t ) , T = (1 - t )) f ( Y (1 - t ) | T = (1 - t ) dY (1 - t ) ∝ (cid:90) f ( Y ( t ) | T = t ) f ( T = (1 - t ) | Y ( t )) f ( T = t | Y ( t )) c ( F ( Y ( t ) | T ) , F ( Y (1 - t ) | T ) | ) × f ( Y (1 - t ) | T = (1 - t ) dY (1 - t )= f ( Y ( t ) | T = t ) f ( T = (1 - t ) | Y ( t )) f ( T = t | Y ( t )) × (cid:90) c ( F ( Y ( t ) | T ) , F ( Y (1 - t ) | T ) | T ) f ( Y (1 - t ) | T = (1 - t ) dY (1 - t )= f ( Y ( t ) | T = t ) f ( T = (1 - t ) | Y ( t )) f ( T = t | Y ( t )) Where the last equality holds by using the deﬁnition of the copula density: (cid:90) c ( F ( Y ( t ) | T ) , F ( Y (1 - t ) | T ) | T ) f ( Y (1 - t ) | T = (1 - t ) dY (1 - t ) == (cid:90) f ( Y ( t ) , Y (1 - t ) | T = (1 - t )) f ( Y ( t ) | T = (1 - t )) f ( Y (1 - t ) | T = (1 - t )) f ( Y (1 - t ) | T = (1 - t ) dY (1 - t )= (cid:90) f ( Y (1 - t ) | Y ( t ) , T ) dY (1 - t )= 1 Proof of Proposition 3:

We seek to ﬁnd the value of γ t such that the model (17) implies ρ Y | X achieves a30articular value, ρ ∗ . m ( X, Y ( t )) =: logit ( e ( X, Y ( t ))) = α t ( X ) + γ t Y ( t )) (29) = α t ( X ) + γ t µ t ( X ) + γ t ( Y ( t ) − µ t ( X )) (30) = α ∗ t ( X ) + ˜ γ t ˜ R ( t )) (31) = m ( X, ˜ R ( t )) (32)where ˜ R ( t ) = R ( t ) σ rt is the unit-scaled complete data residual, ˜ γ =: σ rt γ and σ rt = (cid:112) E [ V ar ( Y ( t ) | X )] . Wedeﬁne α ∗ t ( X ) =: α t ( X ) + γ t µ t ( X ) . Importantly, since m ( X, Y ( t )) = m ( X, ˜ R ( t )) the above implies that ρ Y ( t ) ,X = ρ R ( t ) ,X . Since ˜ R ( t ) is orthogonal to α ∗ t ( X ) and has unit variance, we have Var ( m ( X, ˜ R ( t )) = Var ( m ( X ) + ˜ γ t ˜ R ( t ))) = Var ( m ( X )) + ˜ γ t . Thus, ρ X,Y ( t ) = ρ X, ˜ R ( t ) = Var ( m ( X )) + ˜ γ t Var ( m ( X )) + ˜ γ t + π / . (33)Using the deﬁnition of “implicit R-squared” from Section 5, we have ρ Y ( t ) | X = ρ X, ˜ Y ( t ) − ρ X − ρ X (34) = π / Var ( m ( X ))+ π / − π / Var ( m ( X ))+ π / γ t π / Var ( m ( X ))+ π / (35) = 1 − Var ( m ( X )) + π / Var ( m ( X )) + π / γ t (36)Solving the above equation for ˜ γ t such that ρ Y ( t ) | X = ρ ∗ , yields | ˜ γ t | = (cid:115) ρ ∗ − ρ ∗ ( Var ( m ( X )) + π / (37)We complete the result by using the fact that ˜ γ t = σ rt γ t B Additional Results from Section 5.1

In Section 5.1, we focus on one particular potential outcomes model, although many plausible models arepossible. In this section, we provide results for two variations of the observed potential outcome model. This31 .000.050.100.150.20 0.00 0.03 0.06 0.09 g t r Y | X r X j |X - j AgeBMIInsurancePulse

Figure 4: γ t vs ρ Y | X , calibration for the NHANES data. The magnitude of the sensitivity parameter γ t is increasing with the residual coeﬃcient of determination, ρ Y | X . For comparison, we plot the partialcoeﬃcients of variation from covariates, ρ X j | X − j , for the most important predictors: age, BMI, insuranceand pulse. We calibrate the magnitude of γ t in Section 6.1 based on BMI.plot highlights that the ATE estimates vary as a function of both model speciﬁcation (model checking) andthe strength of confounding in both treatment arms (sensitivity analysis).First, we posit a pooled model for the mean surface and residual variance ( µ t ( X ) , σ t ) ∼ BART ( X, T ) with µ t ( X ) = µ ( t, X ) and σ t = σ - t . In Figure 5a we show the results for this model, which shows has the largestestimated eﬀect size under unconfoundedness of any of the models considered. Under unconfoundedness,the posterior mean ATE is approximately -2.5 mmHG under this model, and unlike the model proposed inSection 6.1 appears signiﬁcantly diﬀerent from 0.We also show the results for the Bayesian Causal Forest (BCF) model recently introduced by (Hahnet al., 2017). In this model, the observed propensity score is included as a covariate and independent BARTprior distributions are speciﬁed for the control and for the heterogeneous treatment eﬀect and one is usedfor the the control outcome surface. In this model, under unconfoundedness the posterior mean for the ATEis approximately -1.73 mmHg but in contrast to the other observed data models, yields ATEs with largeposterior uncertainty. 32 SNSNSNS NSNSNSNSNS NSNSNSNS NSNSNSNS NSNSNSNS NSNSNSNS NSNSNSNS NSNSNS NSNSNS NSNS NS − − − − − − − − − − − − (a) ATE for pooled model NSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNS NSNSNSNSNSNSNS NSNSNSNSNSNS − − − − − − − − − − − − (b) ATE for BCF model Figure 5: Average treatment eﬀect measured in units of millimeters of mercury (mmHg). NS denotes “notsigniﬁcant”. a) Average treatment eﬀect in the pooled model. Under unconfoundedness, the eﬀect size issigniﬁcantly negative. This model has the smallest posterior uncertainty. b) Average treatment eﬀect inthe Bayesian Causal Forest model. Although the eﬀect sizes are comparable, the posterior uncertainty issigniﬁcantly larger. 33 eferences

Abadie, A., J. Angrist, and G. Imbens (2002). Instrumental variables estimates of the eﬀect of subsidizedtraining on the quantiles of trainee earnings.

Econometrica 70 (1), 91–117.Athey, S. and S. Wager (2017). Eﬃcient policy learning. arXiv preprint arXiv:1702.02896 .Birmingham, J., A. Rotnitzky, and G. M. Fitzmaurice (2003). Pattern-mixture and selection models foranalysing longitudinal data with monotone missing patterns.

Journal of the Royal Statistical Society.Series B: Statistical Methodology 65 , 275–297.Bitler, M. P., J. B. Gelbach, and H. W. Hoynes (2006). What mean impacts miss: Distributional eﬀects ofwelfare reform experiments.

American Economic Review 96 (4), 988–1012.Blackwell, M. (2014). A selection bias approach to sensitivity analysis for causal eﬀects.

Political Analy-sis 22 (2), 169–182.Carnegie, N. B., M. Harada, and J. L. Hill (2016). Assessing sensitivity to unmeasured confounding using asimulated potential confounder.

Journal of Research on Educational Eﬀectiveness 9 (3), 395–420.Centers for Disease Control and Prevention (CDC) (1997). National Health and Nutrition Examination Sur-vey Data III, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention,Hyattsville, MD. .Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duﬂo, C. Hansen, and W. K. Newey (2016). Doublemachine learning for treatment and causal parameters. Technical report, cemmap working paper, Centrefor Microdata Methods and Practice.Chipman, H. A., E. I. George, R. E. McCulloch, et al. (2010). Bart: Bayesian additive regression trees.

TheAnnals of Applied Statistics 4 (1), 266–298.Cinelli, C. and C. Hazlett (2018). Making sense of sensitivity: Extending omitted variable bias.Cornﬁeld, J., W. Haenszel, E. C. Hammond, A. M. Lilienfeld, M. B. Shimkin, and E. L. Wynder (1959).Smoking and lung cancer: recent evidence and a discussion of some questions.

J. Nat. Cancer Inst 22 ,173–203.D’Amour, A. (2019). On multi-cause causal inference: Impossibility, sensitivity, and the promise of proxies.In

International Conference on Artiﬁcial Intelligence and Statistics , pp. Forthcoming.Daniels, M. J. and J. W. Hogan (2000). Reparameterizing the pattern mixture model for sensitivity analysesunder informative dropout.

Biometrics 56 (4), 1241–1248.Díaz, I. and M. J. van der Laan (2013). Sensitivity analysis for causal inference under unmeasured confound-ing and measurement error problems.

The international journal of biostatistics 9 (2), 149–160.Ding, P., A. Feller, and L. Miratrix (2018). Decomposing treatment eﬀect variation.

Journal of the AmericanStatistical Association , 1–14.Ding, P. and T. J. VanderWeele (2016). Sensitivity analysis without assumptions.

Epidemiology (Cambridge,Mass.) 27 (3), 368.Dorie, V., M. Harada, N. B. Carnegie, and J. Hill (2016). A ﬂexible, interpretable framework for assessingsensitivity to unmeasured confounding.

Statistics in medicine 35 (20), 3453–3470.Duan, N., W. G. Manning, C. N. Morris, and J. P. Newhouse (1983). A comparison of alternative modelsfor the demand for medical care.

Journal of business & economic statistics 1 (2), 115–126.34veritt, B. S. (1985).

Mixture Distributions—I . Wiley Online Library.Franks, A. M., E. M. Airoldi, and D. B. Rubin (2016). Non-standard conditionally speciﬁed models fornon-ignorable missing data. arXiv preprint arXiv:1603.06045 .Gelman, A., H. S. Stern, J. B. Carlin, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013).

Bayesian dataanalysis . Chapman and Hall/CRC.Gustafson, P., L. C. McCandless, et al. (2018). When is a sensitivity parameter exactly that?

StatisticalScience 33 (1), 86–95.Hahn, P. R., J. S. Murray, and C. M. Carvalho (2017). Bayesian regression tree models for causal inference:regularization, confounding, and heterogeneous eﬀects.Hahn, P. R., J. S. Murray, and I. Manolopoulou (2016). A bayesian partial identiﬁcation approach to inferringthe prevalence of accounting misconduct.

Journal of the American Statistical Association 111 (513), 14–26.Heckman, J. J. (1979). Sample selection bias as a speciﬁcation error.

Econometrica 47 (1), 153–161.Heckman, J. J., J. Smith, and N. Clements (1997). Making the most out of programme evaluations and socialexperiments: Accounting for heterogeneity in programme impacts.

The Review of Economic Studies 64 (4),487–535.Hill, J. L. (2012). Bayesian nonparametric modeling for causal inference.

Journal of Computational andGraphical Statistics .Holland, P. (1986). Discussion 4: Mixture modeling versus selection modeling with nonignorable nonresponse.In H. Wainer (Ed.),

Drawing inferences from self-selected samples , pp. 143–149. Routledge.Imai, K. and D. A. Van Dyk (2004). Causal inference with general treatment regimes: Generalizing thepropensity score.

Journal of the American Statistical Association 99 (467), 854–866.Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation.

American EconomicReview 93 (2), 126–132.Javaras, K. N. and D. A. Van Dyk (2003). Multiple imputation for incomplete data with semicontinuousvariables.

Journal of the American Statistical Association 98 (463), 703–715.Jung, J., R. Shroﬀ, A. Feller, and S. Goel (2018). Algorithmic decision making in the presence of unmeasuredconfounding. arXiv preprint arXiv:1805.01868 .Kenward, M. G. (1998). Selection models for repeated measurements with non-random dropout: an illus-tration of sensitivity.

Statist. Med. .Klausch, T., P. van de Ven, T. van de Brug, R. H. Brakenhoﬀ, M. A. van de Wiel, and J. Berkhof (2018).Estimating bayesian optimal treatment regimes for dichotomous outcomes using observational data. arXivpreprint arXiv:1809.06679 .Linero, A. R. (2017). Bayesian nonparametric analysis of longitudinal studies in the presence of informativemissingness.

Biometrika 104 (2), 327–341.Linero, A. R. and M. J. Daniels (2015). A ﬂexible bayesian approach to monotone missing data in longitudinalstudies with nonignorable missingness with application to an acute schizophrenia clinical trial.

Journal ofthe American Statistical Association 110 (509), 45–55.Linero, A. R. and M. J. Daniels (2017). Bayesian approaches for missing not at random outcome data: Therole of identifying restrictions. 35ittle, R. J. and D. B. Rubin (2015).

Statistical analysis with missing data . John Wiley & Sons.Middleton, J. A., M. A. Scott, R. Diakow, and J. L. Hill (2016). Bias ampliﬁcation and bias unmasking.

Political Analysis 24 (3), 307–323.Neal, R. M. (2000). Markov chain sampling methods for dirichlet process mixture models.

Journal ofcomputational and graphical statistics 9 (2), 249–265.Neyman, J. (1990 [1923]). On the application of probability theory to agricultural experiments. essay onprinciples. section 9.

Statistical Science 5 (4), 465–472.Olsen, M. K. and J. L. Schafer (2001). A two-part random-eﬀects model for semicontinuous longitudinaldata.

Journal of the American Statistical Association 96 (454), 730–745.Oster, E. (2017). Unobservable selection and coeﬃcient stability: Theory and evidence.

Journal of Business& Economic Statistics , 1–18.Riddles, M. K., J. K. Kim, and J. Im (2016). A propensity-score-adjustment method for nonignorablenonresponse.

Journal of Survey Statistics and Methodology 4 (2), 215–245.Robins, J. M., A. Rotnitzky, and D. O. Scharfstein (2000). Sensitivity analysis for selection bias andunmeasured confounding in missing data and causal inference models. In

Statistical models in epidemiology,the environment, and clinical trials , pp. 1–94. Springer.Rosenbaum, P. (2017).

Observation and Experiment: An Introduction to Causal Inference . Harvard Univer-sity Press.Rosenbaum, P. R. and D. B. Rubin (1983). Assessing sensitivity to an unobserved binary covariate in an ob-servational study with binary outcome.

Journal of the Royal Statistical Society. Series B (Methodological) ,212–218.Rosenbaum, P. R. and J. H. Silber (2009). Ampliﬁcation of sensitivity analysis in matched observationalstudies.

Journal of the American Statistical Association 104 (488), 1398–1405.Ross, G. J. and D. Markwick (2018). dirichletprocess: Build Dirichlet Process Objects for Bayesian Modelling .R package version 0.2.1.Rotnitzky, A., D. Scharfstein, T.-L. Su, and J. Robins (2001). Methods for conducting sensitivity analysisof trials with potentially nonignorable competing causes of censoring.

Biometrics 57 (1), 103–113.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandomized studies.

Journal of educational Psychology 66 (5), 688.Rubin, D. B. (1980). Comment.

Journal of the American Statistical Association 75 (371), 591–593.Rubin, D. B. (2003). Basic concepts of statistical inference for causal eﬀects in experiments and observationalstudies.

Cambridge, MA: Harvard University, Department of Statistics .Scharfstein, D. O., M. J. Daniels, and J. M. Robins (2003). Incorporating prior beliefs about selection biasinto the analysis of randomized trials with missing outcomes.

Biostatistics 4 (4), 495–512.Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999, December). Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models.

Journal of the American Statistical Association 94 (448),1096–1120.Van der Laan, M. J. and S. Rose (2011).

Targeted learning: causal inference for observational and experi-mental data . Springer Science & Business Media. 36ang, Y. and D. M. Blei (2018). The blessings of multiple causes. arXiv preprint arXiv:1805.06826 .Xu, D., M. J. Daniels, and A. G. Winterstein (2018). A bayesian nonparametric approach to causal inferenceon quantiles.

Biometrics .Zhao, Q., D. S. Small, and B. B. Bhattacharya (2017). Sensitivity analysis for inverse probability weightingestimators via the percentile bootstrap. arXiv preprint arXiv:1711.11286arXiv preprint arXiv:1711.11286