Flexible sensitivity analysis for observational studies without observable implications
FFlexible sensitivity analysis for observational studies withoutobservable implications ∗ Alex Franks † UCSB Alex D’Amour † UC Berkeley Avi FellerUC BerkeleyThis draft: January 15, 2019
Abstract
A fundamental challenge in observational causal inference is that assumptions about unconfound-edness are not testable from data. Assessing sensitivity to such assumptions is therefore important inpractice. Unfortunately, some existing sensitivity analysis approaches inadvertently impose restrictionsthat are at odds with modern causal inference methods, which emphasize flexible models for observeddata. To address this issue, we propose a framework that allows (1) flexible models for the observed dataand (2) clean separation of the identified and unidentified parts of the sensitivity model. Our frameworkextends an approach from the missing data literature, known as Tukey’s factorization, to the causalinference setting. Under this factorization, we can represent the distributions of unobserved potentialoutcomes in terms of unidentified selection functions that posit an unidentified relationship between thetreatment assignment indicator and unobserved potential outcomes. The sensitivity parameters in thisframework are easily interpreted, and we provide heuristics for calibrating these parameters against ob-servable quantities. We demonstrate the flexibility of this approach in two examples, where we estimateboth average treatment effects and quantile treatment effects using Bayesian nonparametric models forthe observed data.
Keywords : Observational studies; sensitivity analysis; Tukey’s factorization; latent confounder; Bayesianinference. ∗ Alexander M. Franks is an Assistant Professor of Statistics at the University of California, Santa Barbara([email protected]). Alexander D’Amour is now a Research Scientist at Google AI, Cambridge, MA ([email protected]); most of this work wascompleted when he was a Neyman Visiting Assistant Professor of Statistics atthe University of California, Berkeley. Avi Feller is an Assistant Professor of Public Policy and Statistics at the Universityof California, Berkeley ([email protected]). We thank Edo Airoldi, Carlos Cinelli, Peng Ding, Sharad Goel, Chad Hazlett,Jennifer Hill, Jared Murray, Sam Pimentel, Don Rubin, and Ravi Shroff as well as participants at ACIC 2018 for thoughtfulcomments and discussion. † These authors contributed equally to this work. a r X i v : . [ s t a t . M E ] J a n Introduction
Causal inference generally requires two distinct elements: modeling observed potential outcomes and makingassumptions about missing potential outcomes. While researchers can investigate the first element withstandard model-checking techniques, the data are uninformative about the second element, which can onlybe probed via sensitivity analysis. In principle, a sensitivity analysis quantifies how results change underdifferent assumptions about unobserved potential outcomes — without affecting the observed data model(Linero and Daniels, 2017; Gustafson et al., 2018). Unfortunately, such “clean” sensitivity analyses are oftendifficult to construct under common sensitivity analysis frameworks, sometimes leading to analyses whereassumptions about the sensitivity analysis inadvertently impose constraints on the observed data model.This tension puts sensitivity analysis at odds with modern causal inference approaches that incorporateflexible observed data models, such as recent approaches that feature Bayesian nonparametric regression orother machine learning methods (Hill, 2012; Athey and Wager, 2017; Hahn et al., 2017).In this paper, we propose a sensitivity analysis framework for observational studies that cleanly separatesmodel checking from sensitivity analysis. Specifically, we propose a factorization of the joint distributionof potential outcomes, known as
Tukey’s factorization , that separates factors into those that are nonpara-metrically identified and those that are completely unidentified from the data. This factorization expressesthe unidentified factors in terms of selection functions, which are easily interpreted, and a copula that char-acterizes the dependence between potential outcomes, which we can ignore for a broad class of commonestimands. Taken together, we can then interpret the implied distribution on missing potential outcomes asan extrapolation from the observed data distribution.Our main contribution is develop a practical workflow for this factorization in causal inference, thusextending Tukey’s factorization, which has a long history in the missing data literature, to the observationalstudy setting. First suggested by John Tukey (recorded in Holland, 1986), variants of the approach havebeen known by a number of names including exponential tilting (Birmingham et al., 2003; Rotnitzky et al.,2001; Scharfstein et al., 1999), non-parametric (just) identified (NPI) models (Robins et al., 2000), theextrapolation factorization (Linero and Daniels, 2017) and Tukey’s factorization (Franks et al., 2016). Inthis paper we use the term “Tukey’s factorization” since our approach explicitly adapts the method proposedby Franks et al. (2016) to observational causal inference. Tukey’s factorization also bears close resemblanceto missing data methods based on weighting, including importance weight methodology (Riddles et al.,2016) and inverse probability weighting models for sensitivity analysis (e.g. Zhao et al., 2017). While Robinset al. (2000) also proposed to apply Tukey’s approach to unobserved confounding in observational studies,1pplications of their proposal appear to be limited to the context of clinical trials with dropout (Rotnitzkyet al., 2001; Scharfstein et al., 2003).The workflow we propose has a number of practical strengths for assessing sensitivity to unmeasuredconfounding. First, the framework naturally accommodates a large class of commonly-applied observed datamodels, including models for complex data. We demonstrate this flexibility by conducting sensitivity analysiswith a range of modern statistical models including Bayesian Additive Regression Trees (BART) for responsesurface estimation (Dorie et al., 2016) and Dirichlet processes mixture (DPM) models for flexible residualdistributions. We also demonstrate the use of the factorization with semi-continuous data using zero-inflatedmodels. Second, the framework is computationally cheap. Because the factorization separates observed datamodeling from sensitivity parameters, an investigator only needs to fit the observed data model once, andcan apply our sensitivity analysis post hoc . Finally, the sensitivity parameters defined in our framework areeasily interpreted, facilitating model specification and parameter calibration. In particular, the sensitivityparameters can be calibrated against variation explained by covariates in a standard propensity score model,which are already familiar to many investigators.The paper proceeds as follows. Section 2 introduces the formal setup for sensitivity analysis and high-lights key issues involving identifiability of sensitivity parameters. Section 3 introduces Tukey’s factorizationfor causal inference and provides theoretical justification for applying the approach to a common set of es-timands. Section 4 defines a flexible, convenient model specification, the logistic-mixture exponential familymodel, and explores technical properties. Section 5 discusses heuristics for interpreting and calibrating thesensitivity parameters in this model. Section 6 applies Tukey’s factorization to two examples, demonstratingthe flexibility of the approach on a range of estimands with several Bayesian nonparametric estimation ap-proaches. Finally, Section 7 discusses open issues and possible extensions. The appendix contains technicaldetails and some additional results from the applied analyses.
We describe our approach using the potential outcomes framework (Neyman, 1923; Rubin, 1974). Foroutcome Y i ∈ Y for unit i and binary treatment, let Y i (0) and Y i (1) denote that unit’s potential outcomesif assigned to control or treatment, respectively. Let T i denote a binary treatment indicator and X i denoteobserved covariates. For compactness, we often write Y i ( t ) , t ∈ { , } to denote the outcome for treatment2evel t . Assuming SUTVA (Rubin, 1980), we can then write the observed outcome as Y obs i = T i Y i (1) +(1 − T i ) Y i (0) . Finally, we write the propensity score as e ( x ) = P ( T i = 1 | X i = x ) . We assume an infinitepopulation of iid units, from which we sample triplets ([ Y i (0) , Y i (1)] , T i , X i ) and observe triplets ( Y obs i , T i , X i ) .Given this, we suppress the subscript i unless otherwise noted.We focus on two classes of population estimands. First, we consider average treatment effects on thewhole population as well as on the treated and control populations: τ AT E := E [ Y (1) − Y (0)] = E [ Y (1)] − E [ Y (0)] τ AT T := E [ Y (1) − Y (0) | T = 1] = E [ Y (1) | T = 1] − E [ Y (0) | T = 1] τ AT C := E [ Y (1) − Y (0) | T = 0] = E [ Y (1) | T = 0] − E [ Y (0) | T = 0] . Second, we consider quantile treatment effects, τ q = Q q ( Y (1)) − Q q ( Y (0)) the difference in the q -th treatment quantile, Q q ( Y (1)) , and control quantile, Q q ( Y (0)) .Each of these estimands is a contrast between the marginal complete-data outcome distributions f ( Y (1)) and f ( Y (0)) . For each t , the complete-data distribution for each potential outcome can be written as amixture of the distribution of observed and missing outcomes: f ( Y ( t ) | X ) = f ( T = t | X ) f obs t ( Y ( t ) | T = t, X ) + f ( T = 1 − t | X ) f mis t ( Y ( t ) | T = 1 − t, X ) . All factors in this expression are identified except for f mis t , which is completely uninformed by the data.Identifying these estimands thus requires untestable assumptions that characterize f mis t . Sensitivity analysisprobes the robustness of estimates to these assumptions. Specifically, we consider sensitivity analyses forobservational studies where the investigator wishes to test robustness to violations of the unconfoundednessassumption. Assumption 1 (Unconfoundedness) . [ Y (0) , Y (1)] ⊥⊥ T | X .Unconfoundedness implies that f obs t = f mis t , and is thus sufficient to identify our estimands of interest.Broadly, sensitivity analysis for violations of unconfoundedness proceeds by parameterizing the conditionaldependence between partially-observable potential outcomes [ Y (0) , Y (1)] and T given covariates X . The3arameters of this dependence are called sensitivity parameters . Investigators can then report how causaleffect estimates change when the sensitivity parameters are allowed to vary within a selected range of plausiblevalues. The plausibility of the specific values of sensitivity parameters is ultimately determined externallyto the data analysis, e.g., by domain expertise. We propose a method for model-based sensitivity analysis that explicitly factorizes the joint distribution of ([ Y (0) , Y (1)] , T | X ) in terms of the observed outcome distributions f obs t . Specifically, we parameterize themissing outcome distributions as a tilt of the observed outcome distribution f mis ψ,t ( Y ( t ) | T = 1 - t, X ) ∝ f ψ ( T = t | Y ( t ) , X ) f ψ ( T = 1 - t | Y ( t ) , X ) f obs t ( Y ( t ) | T = t, X ) . (1)We explore key features of this factorizaton in subsequent sections. This specification defines the complete-data distributions f ψ ( Y ( t )) implicitly. The fraction can be interpreted as importance weights that transformthe observed outcome distribution f obs t into the missing outcome distribution f mis ψ,t parameterized by ψ .Notably, the observed data distributions in this approach, f obs t , are free of the sensitivity parameter ψ ,which implies a clean separation of model checking and sensitivity analysis. In practice, sensitivity analysiswith Equation (1) involves a number of non-trivial implementation details, including imposing constraintson normalizing constants and setting and interpreting sensitivity parameters. There is an extensive literature on assessing sensitivity to departures from unconfoundedness, dating back atleast to the foundational work of Cornfield et al. (1959) on the link between smoking and lung cancer. Theapproach we take in this paper fits broadly within the model-based sensitivity analysis framework, largelybuilding on Rosenbaum and Rubin (1983). Examples of alternative sensitivity analysis approaches includeRosenbaum and Silber (2009), Díaz and van der Laan (2013), Ding and VanderWeele (2016), and Zhao et al.(2017).Our approach most closely follows recent proposals to enable model-based sensitivity analysis with flexibleoutcome models, especially Dorie et al. (2016), who use Bayesian Additive Regression Trees (BART) modelsfor flexible outcome modeling (see also Carnegie et al., 2016). We contrast their approach with ours inSection 6.1. Similarly, Jung et al. (2018) combine flexible, black-box modeling with sensitivity in the setting4f algorithmic decision making. These methods extend the so-called latent confounder approach of sensitivityanalysis (Rosenbaum and Rubin, 1983) by specifying a parametric but flexible latent variable model for thecomplete data. The latent confounder approach is highly intuitive, and in cases where these is strongscientific prior knowledge, may be preferable to the approach we propose here. For example, when theparameters in the model are scientifically meaningful, this formulation may be necessary to elicit priors forthese parameters. As we discuss next, however, such models can introduce issues around identified sensitivityparameters. While some modern approaches incorporate parameterizations that mitigate these identificationconcerns, the degree to which sensitivity parameters are decoupled from the observed data model can bedifficult to assess in practice.A number of previous methods have also sought to separate model checking from sensitivity analysis inthe context of missing data. Specifically, our approach is most similar to non-parametric (just) identified(NPI) models proposed by Robins et al. (2000), so called because they put no constraints on the observeddata model, but identify a complete-data distribution when the sensitivity parameters are fixed. NPI modelswere proposed using a variant of Tukey’s factorization, but have been applied primarily to longitudinalmissing data problems and clinical trials with dropout; see, e.g., Rotnitzky et al. (2001) and Scharfsteinet al. (2003). Robins et al. (2000) suggested that NPI models could be applied to observational studies withunobserved confounding, but to the best of our knowledge, did not pursue this method in applications. Ourpaper builds a practical sensitivity analysis method based on this theoretical suggestion, in part by proposinga specification that reduces some of the technical hurdles in the original Robins et al. (2000) proposal; seeSection 3.Our method is also related to a line of Bayesian missing data methods summarized by Linero andDaniels (2017). These methods specify a Bayesian model for observed data, and explore a set of “identifyingrestrictions” that map the observed data model to complete data models. Tukey’s factorization in (1) isan example of an identifying restriction, although the authors generally choose alternative restrictions intheir applications. Similar to the NPI literature, these methods do not appear to have been applied toobservational studies, with their causal applications restricted to clinical trials with dropout (Linero andDaniels, 2015; Linero, 2017). Finally, in a Bayesian approach that aligns well with ours, Hahn et al. (2016)propose flexible, nonparametric models that separate the observed data model from sensitivity parameters,also in the context of missing data. For example, the model in Jung et al. (2018) has sensitivity parameters that depend only weakly on the observed databecause they focus exclusively on the setting with a binary outcome and employ sensitivity parameters that vary smoothlyacross covariates. .4 Separating Sensitivity Analysis from Model Checking A prime motivation for our sensitivity analysis framework is to separate sensitivity analysis from modelchecking, that is, from assessing the fit of different modeling assumptions to the observed data. In practice,many popular approaches to sensitivity analysis blur the line between sensitivity analysis and model checkingby introducing sensitivity parameters that are informed by the observed data. Sensitivity analyses of thistype can be difficult to interpret, however. Such methods typically relax the unconfoundedness assumptionby introducing a parametric specification, which can in turn fundamentally change the relationship betweenconfounding and identification. This can also introduce new types of sensitivity that go unexamined bythe analysis itself, resulting in conclusions whose credibility depend on strong prior knowledge about theparametric specification. We demonstrate some of these pathologies in the context of a latent confoundermodel in the following example.
Example 1 (Normal Outcome, Binary Confounder) . Suppose we have a study with a continuous outcomeand no covariates. We make the assumption that treatment was randomly assigned according to a Bernoullidesign, but it is plausible that there exists a latent class that confounds the study. To test the robustnessof our conclusions to the presence of such a latent class, we propose a sensitivity analysis by introducing abinary latent confounder. The model is parameterized as follows: U ∼ Bern ( ξ u ) T | U ∼ Bern ( g ( α + ψ T U )) Y ( t ) | U ∼ N ( µ t + ψ Y U, σ ) . When ( ψ T , ψ Y ) = (0 , , the model reduces to random assignment to treatment. Importantly, the observeddata distribution depends on the sensitivity parameters ψ . To see this, let h ψ T := f ( U = 1 | T ) . Thedistribution of observed outcomes is a two-component mixture of normals for t = { , } : Y ( t ) | T = t ∼ h ψ T N ( µ t + ψ Y , σ ) + (1 − h ψ T ) N ( µ t , σ ) (2)The observed and missing potential outcome distributions, f ( Y ( t ) | T = t, X ) and f ( Y ( t ) | T = 1 − t, X ) ,respectively, have the same mixture components but different mixture weights, where ψ Y determines thedifference between the component means. Importantly, the mixture weights, h ψ T ( t ) and component meansare identifiable under relatively weak assumptions (see Everitt, 1985).6 y T = Dy T = D y Y = −2 Dy T = − D y Y = 0 D y Y = 2 Figure 1: Illustrations of sensitivity parameter identification in Example 1, via variation in posterior pre-dictive distributions at different sensitivity parameter settings. We plot the true (blue) and inferred (red)observed control outcome densities, f ( Y (0) | T = 0) , implied by the normal outcome-binary latent con-founder model with sensitivity parameters ψ T and ψ Y . ∆ ψ T and ∆ ψ y represent the difference between thetrue and assumed sensitivity parameters. The observed data densities are only correctly inferred when weassume the true values of the sensitivity parameters ( i.e. ∆ ψ T = ∆ ψ Y = 0 ). Note that the true potentialoutcome distributions Y (0) are a mixture of normal distributions with the same component means anddifferent variances.In Figure 1, we demonstrate a range of data fits obtained from this model. As is common practice, wefit the model multiple times for a range of sensitivity parameters and plot the true (blue) and inferred (red)observed potential outcome densities for the control potential outcomes. Most settings of the sensitivityparameters result in severe misfit of the observed data, and would likely be rejected by a competent analystif they were considered the “main” model for the observed data. In this case, the sensitivity analysis operatesas a model checking exercise more than an exploration of relaxed identification assumptions. In fact, basedon this parametric specification, the investigator could reject the hypothesis that the study satisfies uncon-foundedness (Assumption 1) based on model fit. In the case where the investigator had strong scientific basisto believe the parametric form of the latent confounder model, this would be useful information. However,in the absence of a priori knowledge about this specification, the unconfoundedness assumption is not testedin full generality.More elaborate forms of this model appear throughout the literature, including extensions that incor-porate covariates and nonparametric regression models (Imbens, 2003; Dorie et al., 2016). Under thesespecifications, the residuals of the observed outcomes Y ( t ) − E [ Y ( t ) | X, T = t ] are distributed as a mixtureof normals, and the identification issue noted in this example remains. In this simulation we set ξ u = 1 / , α = 0 , µ t = 0 , ψ T = − , ψ Y = 5 and σ = 1 . post hoc sensitivity analysis raises the spectre of data snooping. In a valid sensitivityanalysis, the investigator can tune an observed data model in a held-out sample. By contrast, in a sensitivityanalysis with identified sensitivity parameters, the investigator can instead choose the model fit that is themost favorable to their conclusions, and declare it to be the “main” model whose robustness is being checked(Rosenbaum, 2017, p. 172, “What Is Not a Sensitivity Analysis?”). Finally, in the context of Bayesiansensitivity analysis, Gustafson et al. (2018) note that identified sensitivity parameters introduce a tensionbetween the prior information used to calibrate the sensitivity parameters and the information from theobserved data that make posterior uncertainties difficult to interpret. For example, in the context of aclinical trial with dropout, Scharfstein et al. (2003) consider and reject a parametric complete-data modelwith an identified sensitivity parameter because, under this model, the observed data fit implies implausiblylarge selection bias. We now turn to applying Tukey’s factorization to assessing sensitivity to unobserved confounding, payingspecial attention to the structural differences between this new setting and the more common missing datasetting. We show that these structural differences imply a distinct Tukey factorization for causal inference,which includes a copula that characterizes the dependence between potential outcomes. Importantly, we showthat there is a broad class of estimands, including the ATE and QTE, whose sensitivity does not depend onthis copula. For these estimands, sensitivity analysis can proceed by simply treating the observational study8s though it were two separate missing data problems. In short, this result can be summarized as: “causalinference is missing data twice”.
We begin by demonstrating Tukey’s factorization on only one arm of an observational study. Specifically,we examine the joint distribution of one potential outcome Y ( t ) and the treatment indicator T . This case isanalogous to a missing data problem, where Tukey’s factorization has been applied previously (Birminghamet al., 2003; Scharfstein et al., 1999; Franks et al., 2016; Linero and Daniels, 2017). Tukey’s factorization ofthis joint distribution is f ψ ( Y ( t ) , T | X ) = f obs ( Y ( t ) | T = t, X ) f ( T = t | X ) · f ψ ( T | Y ( t ) , X ) f ψ ( T = t | Y ( t ) , X ) . (3)Here, the first two factors constitute the observed data density, which is nonparametrically identified, whilethe final factor is determined by the selection function , which is unidentified but easily interpreted. In ourapproach, we parameterize the selection function with sensitivity parameters ψ . Thus, the unidentifiedselection function fully determines the relationship between the observed outcome distribution and thedistribution of missing outcomes.The validity of this factorization requires two technical conditions. First, following Robins et al. (2000),we need to impose some mild restrictions on the selection factors. Condition 1 (Integral constraints) . For each t , the normalizing constant in (1) satisfies: (cid:90) Y f obs ( Y ( t ) | T = t, X ) · f ψ ( T = 1 - t | Y ( t ) , X ) f ψ ( T = t | Y ( t ) , X ) dY ( t ) = f ( T = 1 - t | X ) f ( T = t | X ) . (4)This condition arises by integrating both sides of (3) with respect to Y ( t ) . This condition restricts the classof selection functions that can be used within our framework. While this restriction has lead to technicalchallenges explored in earlier work (see Robins et al., 2000, for additional discussion), the parameterizationswe propose in Section 4 satisfy this constraint automatically.Second, we need an outcome overlap condition, which ensures that the missing data distribution can beexpressed as an extrapolation of the observed data distribution. Condition 2 (Outcome Overlap Condition) . The support of the missing potential outcomes is a subset of9he support of the observed potential outcomes. That is, P ( Y ( t ) ∈ A | T = 1 − t, X ) > ⇒ P ( Y ( t ) ∈ A | T = t, X ) > for all sets A in the outcome sample space Y .As opposed to Condition 1, this condition is an assumption about the true data-generating process andcannot be enforced by model specification. We discuss this assumption further in Section 7. We now extend Tukey’s factorization to the observational study setting. In particular, we show that the jointdistribution of potential outcomes ( Y (0) , Y (1)) and treatment T can be uniquely specified by supplementingthe distribution of the observed data with three unidentified models: a model for treatment given Y (1) alone;a model for treatment given Y (0) alone; and a copula that specifies the dependence between Y (0 ) and Y (1) given T .Under Conditions 1 and 2, the joint density f ( T, [ Y (0) , Y (1)] | X ) can be decomposed into two uni-variate complete-data densities and a copula. The derivation follows as a consequence of applying Tukey’sfactorization to the marginal densities f ( T, Y (0) | X ) and f ( T, Y (1) | X ) : f ( T, [ Y (0) , Y (1)] | X ) = f ( Y (0) | T, X ) · f ( Y (1) | T, X ) · f ( T | X ) · f ( Y (0) , Y (1) | T, X ) f ( Y (0) | T, X ) f ( Y (1) | T, X )= f ( Y (0) , T | X ) f ( T | X ) · f ( Y (1) , T | X ) f ( T | X ) · f ( Y (0) , Y (1) | T, X ) f ( Y (0) | T, X ) f ( Y (1) | T, X ) · f ( T | X )= f ( Y (0) | T = 0 , X ) f ( T = 0 | X ) · f ( T | Y (0) , X ) f ( T = 0 | Y (0) , X ) · f ( Y (1) | T = 1 , X ) f ( T = 1 | X ) · f ( T | Y (1) , X ) f ( T = 1 | Y (1) , X ) · f ( T | X ) · c ( F ( Y (0) | T, X ) , F ( Y (1) | T, X ) | T, X ) (5)where we use F to represent a cumulative distribution function. We now define several important terms.First, we define the marginal selection factors f ( T | Y (1) , X ) and f ( T | Y (0) , X ) , which specify the non-ignorable selection mechanism in each arm. Second, we define the conditional copula c ( F ( Y (0) | T, X ) , F ( Y (1) | T, X ) | T, X ) = f ( Y (0) , Y (1) | T, X ) f ( Y (0) | T, X ) f ( Y (1) | T, X )
10s the copula density that characterizes the residual dependence between potential outcomes conditional onthe assigned treatment. Apart from the marginal selection factors and the conditional copula, all other termsin (5) are identifiable.Equation (5) implies that the density of the missing potential outcomes, conditional on the observedpotential outcomes is f mis ( Y ( t ) | T = (1 - t ) , Y (1 - t ) , X ]) ∝ f obs ( Y ( t ) | T = t, X ) · f ( T = (1 - t ) | Y ( t ) , X ) f ( T = t | Y ( t ) , X ) · c ( F ( Y (0) | T, X ) , F ( Y (1) | T, X ) | T, X ) . (6)We can then use this factorization to estimate the complete joint distribution, f ([ T, Y (0) , Y (1)] | X ) , and,in turn, estimate causal estimands of interest. So far, we have described the full set of unidentified factors necessary to specify the joint density of f ( T, [ Y (0) , Y (1)] | X ) . We now show that for a broad set of estimands, one need not specify the con-ditional copula to construct a well-defined sensitivity analysis. These results give us license to conductsensitivity analysis for such estimands as though the observational study were two independent missing dataproblems. We refer to the general class of estimands for which this invariance holds as marginal contrasts . Definition 1.
A causal estimand τ is a marginal contrast estimand iff it can be completely characterized asa function of the marginal distributions of the potential outcomes f ( Y (0) | X ) and f ( Y (1) | X ) . Formally,letting τ be a functional of the joint distribution f ( T, [ Y (0) , Y (1)] | X ) , with slight abuse of notation, amarginal contrast satisfies τ ( f ( T, [ Y (0) , Y (1)] | X )) = τ ( f ( Y (0) | X ) , f ( Y (1) | X )) . This class includes common estimands include the Average Treatment Effect, E [ Y (1)] − E [ Y (0)] , and theQuantile Treatment Effect, τ q = Q q ( Y (1)) − Q q ( Y (0)) . This class, however, excludes estimands that dependon the joint distribution of potential outcomes, such as the proportion of units with positive treatment, thevariance of the treatment effect, and unit-specific treatments (see, for example, Heckman et al., 1997; Dinget al., 2018). Theorem 1.
Suppose the joint distribution f ( T, [ Y (0) , Y (1)] | X ) admits Tukey’s factorization in (5) .Further, suppose that τ is a marginal contrast estimand. Then τ is uniquely defined by the marginal selectionfactors f ( T | Y (1) , X ) and f ( T | Y (0) , X ) . roof. See appendix.Stated differently, marginal contrast estimands are invariant to the conditional copula c ( F ( Y (0) | T, X ) , F ( Y (1) | T, X ) | T, X ) . In likelihood-based estimation, the invariance of the estimand translates to invariance in es-timation. In the special case of Bayesian inference, the following corollary establishes when the posteriordistribution for marginal contrast estimands exhibits invariance to the specification of the copula. This in-variance holds under a distinct parameters condition that is common in Bayesian formulations of ignorability(Gelman et al., 2013). Corollary 1.1.
In the setting of Theorem 1, suppose, in addition, that the parameters of the conditionalcopula c ( F ( Y (0 | T, X )) , F ( Y (1) | T, X ) | T, X ) are distinct, i.e., a priori independent, of the parameters ofall other factors in (5) . Then the posterior distributions for marginal contrast estimands are invariant tothe specification of the conditional copula in the model likelihood. This establishes that the selection factors in each treatment arm are the only unidentifiable factors thatneed to be specified when estimating marginal contrast estimands by Bayesian inference. Obviating the needfor the copula is useful in practice because the dependence between potential outcomes is unobservable, evenin experiments, and thus assumptions about this dependence are difficult to calibrate. However, one wouldneed to wrestle with this complication for some of the more general sets of estimands mentioned above (i.e.unit-specific treatment effects). These estimands, which are not marginal contrasts, arise for instance whenestimating optimal treatment regimes (Klausch et al., 2018). We leave sensitivity analysis for estimands ofthis type to future work.
We now propose a simple but widely applicable class of models that addresses several practical challengeswith implementing Tukey’s factorization for causal inference. Specifically, we focus on logistic selection withmixtures of exponential families (logistic-mEF models). In these models, the marginal selection functionsin each arm are specified as logistic in the potential outcomes, and the observed data is modeled with amixture of exponential family distributions. Building on previous work, we show that this class of modelsautomatically satisfies Condition 1 and yields missing potential outcome distributions that are analyticallytractable. Importantly, logistic-mEF models include non-parametric observed data models like Dirichletprocess mixtures (DPM) and BART. Thus, beyond being analytically convenient, the results in this section12irectly apply to sensitivity analysis with many flexible modeling tools already used in causal inference. Westart by describing the setting in which the observed data densities belong to a single exponential familydistribution, and then demonstrate how our formulation extends to mixtures of exponential families. Weleave it to future work to explore specifications outside of the logistic-mEF family.
Under a logistic selection specification, we posit that the log-odds of receiving treatment are linear in somesufficient statistics of the potential outcomes, s ( Y ( t )) : f ( T = 1 | Y ( t ) , X ) = logit − { α t ( X ) + γ (cid:48) t s t ( Y ( t )) } , (7)where logit − ( x ) = (1 + exp( − x )) − . This specification has sensitivity parameters γ = ( γ , γ ) , whichdescribe how treatment assignment depends marginally on each potential outcome, and a parameter α t ( X ) in each arm that is identified by the observed data once γ t is specified (discussed below). Logistic selection iscommonly used in latent confounder approaches to model the probability of treatment given the unobservedconfounder. Here, we take a similar approach but instead assume that the treatment probabilities arelogistic in (sufficient statistics of) the potential outcomes. Throughout this paper we assume the sensitivityparameters are independent of covariates. We return to this point in the discussion.Beyond their interpretability, logistic selection specifications have several desirable technical properties,some of which have been explored previously in different settings. Here, we adapt these results to our settingand notation. First, for any specification of γ t that implies a proper missing data distribution, Condition 1 isautomatically satisfied. In particular, Robins et al. (2000) shows that in each arm t , for each γ t , there existsa unique α t ( X ) that satisfies Condition 1. Second, Rotnitzky et al. (2001) and Scharfstein et al. (2003) showthat the missing data distribution implied by a logistic selection specification in each arm t is free of α t ( X ) .Thus, having specified γ t , it is not necessary to solve for α t ( X ) explicitly. Specifically, f mis ( Y ( t ) | T = 1 − t, X ) = f obs ( Y ( t ) | T = t, X ) exp( γ (cid:48) t s t ( Y ( t ))) C ( γ t , X ) , (8)where the normalizing constant C ( γ t , X ) = (cid:82) Y exp( γ (cid:48) t s t ( Y ( t ))) f obs ( Y ( t ) | T = t, X ) dY ( t ) . Finally, in theobservational causal inference setting, because Condition 1 is satisfied independently for valid values of γ t in each arm, the sensitivity parameter vectors γ and γ are variation independent.The primary practical difficulty in applying the logistic selection specification is computing the normaliz-13ng constant C ( γ t , X ) in (8) for each specification of γ t . The normalizing constant is necessary for computingcausal effects at each value of γ t , as well as calibrating γ t , as we discussion Section 5. Dealing with thisnormalizing constant in practice can require either computationally expensive calculation or strong restric-tions on the observed data model. For example, Scharfstein et al. (2003) address the normalizing constantby either modeling the observed data with its empirical CDF, for which the integration is trivial, or, inthe context of a more complex model, by using Markov chain Monte Carlo on the joint space of sensitivityparameters and parameters from the observed data model. In the next section, we show that logistic-mEFmodels largely avoid this difficulty. We now show that sensitivity analysis with logistic specifications is especially convenient under the weakassumption that the observed data model belongs to the class of exponential family mixtures. In particular,we review results of Franks et al. (2016) showing that the mixture of exponential families assumption makesthe normalizing constant in (8) analytically tractable.
We first model the observed data as an exponential family distribution with natural parameter η t ( X ) ,possibly depending on X , f obs t ( Y ( t ) | T = t, X ) = h ( Y ( t )) g ( η t ( X )) e s ( Y ( t )) (cid:48) η t ( X ) (9)A key result from Franks et al. (2016) shows that when the selection function is assumed to have the logisticform in (7), the missing data distribution belongs to the same exponential family as the observed data. Proposition 1.
Assume that the observed data is an exponential family with density f obs t ( Y ( t ) | T = t, X ) = h ( Y ( t )) g ( η t ) e s ( Y ( t )) (cid:48) η t ( X ) with sufficient statistic s ( Y ( t )) and natural parameter η t ( X ) . Further, assume theprobability of selection is logistic in that statistic, f ( T = 1 | Y ( t ) , X ) = logit − { α t ( X ) + γ (cid:48) t s t ( Y ( t )) } , suchthat η ∗ t ( X ) = η t ( X ) + γ t lies in the natural parameter space of the exponential family of f obs t . Then thedistribution of the missing potential outcomes in arm t is in the exponential family as f obs t , and has density f mis γ t ,t ( Y ( t ) | T = 1 - t, X ) = h ( Y ( t )) g ( η ∗ t ( X )) e s ( Y ( t )) (cid:48) η ∗ t ( X )) . (10)14roposition 1 follows because, with an exponential family outcome model, the normalizing constant inEquation (8) is analytically tractable, with C ( γ t , X ) = g ( η t ( X )) g ( γ t + η t ( X )) . In addition, the constraint that η ∗ t ( X ) lie in the natural parameter space ensures that the missing data distribution is proper. Given the missingdata distribution in (10), the implied complete data density f γ t ,t ( Y ( t )) can then be expressed as a simplemixture of the observed and missing components.To provide some intuition for how the factorization operates under the logistic selection with exponentialfamilies, we present two simple examples. Example 2 (Binary observed outcomes) . Let Y ( t ) ∈ { , } and f obs t ( Y ( t ) | T = t, X ) ∼ Bern ( Logit − ( µ t ( X )) f ( T | X, Y ( t )) ∝ Bern ( Logit − ( α t ( X ) + γ t Y ( t ))) By Proposition 1 the unobserved potential outcome distribution is: f mis γ t ,t ( Y ( t ) | T = 1 - t, X ) ∼ Bern ( Logit − ( µ t ( X ) + γ t )) Thus, the extrapolation factorization with logistic treatment assignment applied to Bernoulli data impliesan additive shift in the log-odds of the unobserved potential outcomes, relative to the observed potentialoutcome distribution.
Example 3 (Normal observed outcomes) . Let Y ( t ) follow a Normal distribution. Because the Normal dis-tribution is a two parameter exponential family distribution with sufficient statistics s ( Y ( t )) = ( Y ( t ) , Y ( t ) ) ,we consider a treatment assignment mechanism that is quadratic in the potential outcomes. f obs t ( Y ( t ) | T = t, X ) ∼ N ( µ t ( X ) , σ t ) f ( T | Y ( t )) ∝ Bern ( Logit − ( α t ( X ) + γ t Y ( t ) + ψ t Y ( t ) )) By Proposition 1 the unobserved potential outcome distribution is: f mis( γ t ,ψ t ) ,t ( Y ( t ) | T = 1 - t, X ) ∼ N (cid:18) µ t ( X ) + γ t σ t − ψ t σ t , σ t − ψ t σ t (cid:19) (11)This model has two sensitivity parameters for each treatment arm. Assuming that the logistic function islinear in Y ( t ) , i.e. ψ t = 0 , implies that standard deviations of observed and missing potential outcome15istributions are identical. In Section 6.1, we estimate µ t ( X ) from the observed data using BART, whichallows us to conduct a sensitivity analysis on potential outcomes with flexibly estimated response surfaces. We now extend the formulation to the full class of logistic-mEF models, which include outcome models thatare mixtures of exponential family distributions. Let f obs t ( Y ( t ) | η t , T = t ) = (cid:88) k π k h k ( Y ( t )) g k ( η tk ) e s ( Y ( t )) (cid:48) η tk be a mixture of exponential family distributions with mixture weights π k and common sufficient statis-tics, s ( Y ( t )) . Then, the normalizing constant in Equation (8) is analytically tractable with, C ( γ t , X ) = (cid:80) k π k g k ( η tk ) g k ( γ t + η tk ) . This leads to the following proposition: Proposition 2.
Assume that the observed data follows a K -component mixture exponential family distribu-tion with density f obs t ( Y ( t ) | η t , T = t ) = (cid:80) k π k h k ( Y ( t )) g k ( η tk ) e s ( Y ( t )) (cid:48) η tk , with sufficient statistic, s ( Y ( t )) ,common across all components. Further, assume the probability of selection is logistic in the sufficient statis-tic, f ( T = 1 | Y ( t ) , X ) = logit − { α t ( X )+ γ (cid:48) t s t ( Y ( t )) } , such that, for each component k , η ∗ tk ( X ) = η tk ( X )+ γ t lies in the natural parameter space of the exponential family. Then the distribution of the missing potentialoutcomes in arm t is also a k -component mixture of the same exponential family distributions and has density f mis γ t ,t ( Y ( t ) | η t , γ t , T = 1 − t ) = (cid:88) k π ∗ k h k ( Y ( t )) g k ( η ∗ tk ) e s ( Y ( t )) (cid:48) η ∗ tk (12) where η ∗ tk = η tk + γ t , and π ∗ k = π k g k ( η k ) g k ( η ∗ k ) (cid:80) Kk π k g k ( η k ) g k ( η ∗ k ) . (13)Consider the following simple example. Example 4 (Normal mixture observed outcomes) . We extend the Normal model above to a finite mixtureof Normal distributions: f obs t ( Y ( t ) | T = t, X ) ∼ (cid:88) π k N ( µ tk ( X ) , σ tk ) . Note that in the quadratic model we require ψ t > σ t to ensure propriety of the missing potential outcome distribution(Condition 1). ψ t = 0 . Then, by Proposition 1 and Equation (13), f mis γ t ,t ( Y ( t ) | T = 1 − t, X ) ∼ (cid:88) π ∗ k N ( µ tk ( X ) + γ t σ tk , σ tk ) (14) π ∗ k ∝ π k exp 12 (cid:32) µ tk ( X ) σ tk − (cid:18) µ tk ( X ) σ tk − γ t (cid:19) (cid:33) . In this model, the sensitivity parameters γ t affect both the mixture weights and the component means: themixture weight for component k increases as γ t approaches µ tk ( X ) σ tk .In general, we can apply Tukey’s factorization with logistic selection to observed data densities modeledwith non-parametric Bayesian methods like Dirichlet process mixtures of any exponential family distribution,and can can easily adapt the mixture results to model structured semi-continuous data as well. In Section6.2, we demonstrate both of these features by modeling zero-inflated income data, and use a DPM modelfor the continuous component. Since sensitivity parameters are not identified from the data, calibrating the magnitude of the parametersrequires reasoning about plausible values using prior knowledge and domain expertise. We first discusshow to interpret the sign of the sensitivity parameters. We then turn to the magnitude of the sensitivityparameters and introduce a method to calibrate this quantity using information from observed covariates.
To interpret the sign of sensitivity parameters, we return to the logistic selection model and consider thecase where s ( Y ( t )) = Y ( t ) ; that is, the sufficient statistic in the logistic selection model is simply Y ( t ) itself.In this setting, γ and γ are scalars and the probability of assignment to treatment is f ( T = 1 | Y ( t ) , X ) = logit − ( α t ( X ) + γ t Y ( t ))) .With this setup, the sensitivity parameters ( γ , γ ) have a relatively straightforward interpretation: theyspecify how sufficient statistics of the potential outcomes are over- or under-represented among observedcontrol and treated units. For example, γ > implies that units with large Y (1) are over-representedamong treated units, and thus the observed treated unit average will be larger than the average of theunobserved treatment outcomes. Likewise, γ > implies that units with large values of Y (0) are more17ikely to be assigned to treatment, so the observed control unit average will be smaller than the average ofthe unobserved control outcomes. We give several examples to illustrate interpretation in practice. Example 5 (Same Sign) . Consider a study of a medical treatment with health outcome Y and unmeasuredincome variable U . Suppose that larger values of Y correspond to good outcomes, that treatment is expensive,so P ( T = 1 | U ) is increasing in U , and that higher income induces better outcomes, such that Y ( t ) = α + τ t + ψU + (cid:15) with ψ > . In this study, both Y (0) and Y (1) are positively correlated with T , so largevalues of Y (1) are over-represented among the observed treated units, and large values of Y (0) are under-represented among control units, corresponding to positive γ and γ . When γ and γ have the same sign,the treatment and control group means are biased in opposite directions, so the ATE changes rapidly as themagnitude of the sensitivity parameters increases. Example 6 (Opposite Signs) . Consider the canonical “perfect doctor” example (Rubin, 2003), where aparticular medical treatment is prescribed by a doctor who is able to perfectly predict Y (0) and Y (1) foreach patient, and assigns the patient to whichever arm gives them the higher outcome. Suppose that in thispopulation Y (1) − Y (0) is independent of [ Y (1) + Y (0)] / ; that is, the individual treatment effect for eachunit is unrelated to each unit’s expected outcome under random assignment to treatment. For example,suppose that [ Y (0) , Y (1)] are independently normal. Then the control arm and the treated arm both over-represent high outcomes, so γ < and γ > . When γ and γ have opposite signs, the treated and controlgroup means are biased in the same direction, so the ATE changes slowly as the magnitude of the sensitivityparameters increases. Example 7 (Single-Arm Confounding) . Consider a study evaluating a job training program with wageoutcome Y . Prior to the study, a subset of units is randomly given access to an alternative program thatthey can attend if they do not enroll in the treatment. Let U = 1 indicate access to this alternative programand that units with access to the alternative program are less likely to enroll in treatment, so P ( T = 1 | U ) is decreasing in U . Suppose that the alternative program is beneficial on average, for example, suppose thepotential outcomes are given by Y (0) = α ( X ) + ψU + (cid:15) with ψ > and Y (1) = α ( X ) + τ + (cid:15) . Then, theobserved outcomes under treatment are representative of the distribution of Y (1) , so γ = 0 , but, undercontrol, units with higher wage outcomes are over-represented, so γ < . When only one of the sensitivityparameters is nonzero, only one of the group means is biased, so the ATE changes moderately when themagnitude of the nonzero sensitivity parameter increases.18 .2 Calibrating the magnitude of sensitivity parameters We now turn to the more challenging task of calibrating the magnitude of sensitivity parameters. Ourprimary proposal is to calibrate the magnitude of sensitivity parameters to the amount of variation in thetreatment assignment T that is explained by Y ( t ) , above and beyond what is accounted for by X . Toinstantiate this idea, we adopt a version of the “implicit R ” measure from Imbens (2003), which generalizesvariance-explained measures to the case of logistic regression. This approach is consistent with correspondingproposals for calibration in some latent confounder models (Imbens, 2003; Cinelli and Hazlett, 2018), with theimportant difference that our parameterization involves calibrating the variance explained by the potentialoutcomes rather than by a latent confounder.To fix ideas, we will again focus on the simple case where the sufficient statistic is s ( Y ( t )) = Y ( t ) and γ and γ are scalars, so that following Equation (7), the probability of assignment to treatment is f ( T = 1 | Y ( t ) , X ) = logit − ( α t ( X ) + γ t Y ( t ))) . In the discussion that follows, we work on the logit scaleand let m ( X ) = logit( e ( X )) be the logit of the propensity score given variables X . To characterize varianceexplained, we first note that the logistic treatment model can be expressed using a latent variable formulationas Z = m ( X ) + (cid:15) with (cid:15) ∼ logistic (0 , T = if Z < if Z ≥ Since T is a deterministic function of the latent Z , it is sufficient to characterize how well Y ( t ) and X explain Z . Because the variance of a standard logistic is π / , the total variance of Z is simply Var ( m ( X )) + π / .We then define the variance explained by X , and the partial variance explained by Y ( t ) given X , respectively, One evident approach, which we do not endorse here, is to reason about the magnitude of γ t directly, calibrating it againstregression coefficients inferred from observed covariates (Dorie et al., 2016; Blackwell, 2014; Middleton et al., 2016). The problemwith this approach is that it can be difficult to interpret coefficients due to collinearity between and among X and Y ( t ) ; see, forexample, Oster (2017). In contrast, our approach is robust to multicollinearity among the predictors. See Cinelli and Hazlett(2018) for additional discussion. ρ X = Var ( m ( X )) Var ( m ( X )) + π / , ≤ ρ X ≤ (15) ρ Y ( t ) | X = ρ X,Y ( t ) − ρ X − ρ X , ≤ ρ Y ( t ) | X ≤ . (16)The partial variance explained by Y ( t ) given X , ρ Y ( t ) | X , represents the fraction of previously unexplainedvariance in T that can now be explained by adding Y ( t ) to the set of predictors X .With our method, we propose a target value, ρ ∗ , for the the unidentified ρ Y ( t ) | X . Although the decisionultimately falls to domain expertise, we suggest using observed predictors to set the target ρ ∗ by analogy. Forexample, for each covariate X j , we can compute a partial variance explained by X j given all other covariates X − j , and set the target ρ ∗ = ρ X j | X − j for an appropriate covariate based on expert knowledge. We interpretthis to mean that the information gained by adding Y ( t ) to X as a predictor of treatment assignment iscomparable to the information gained by adding X j to X − j . We give concrete examples of this approachin the analysis in Section 6.1. In some cases, alternative calibration schemes may be more appropriate. Forexample, one could calibrate ρ Y ( t ) | X A for some core subset of covariates X A ⊂ X using (18), replacing X with X A . In Section 6.2, we take this approach and apply a calibration scheme setting X A to be the emptyset. See Cinelli and Hazlett (2018) for a discussion of related alternatives in the context of linear modeling. Once a value of ρ ∗ has been chosen, we must then identify the magnitude of the corresponding sensitivityparameter, γ t . Since ρ Y ( t ) | X is monotone in γ t , there is a one-to-one mapping between the two quantitiesthat can be used to identify γ t . We formalize this in the following proposition: Proposition 3 (Calibration Identity) . Suppose that logit ( e ( X, Y ( t ))) is linear in Y ( t ) , with the form in (7) .Further, suppose that logit ( e ( X )) = m ( X ) and σ rt = (cid:112) E [ V ar ( Y ( t ) | X )] . Then, ρ Y ( t ) | X = σ rt γ t Var( m ( X )) + π / σ rt γ t , (17) In homoscedastic models, σ rt is simply the residual standard deviation of the potential outcome sd ( Y ( t ) | X ) , and isindependent of X . he inverse of this equation identifies the magnitude of γ t : | γ t | = 1 σ rt (cid:118)(cid:117)(cid:117)(cid:116) ρ Y ( t ) | X − ρ Y ( t ) | X (Var( m ( X )) + π / . (18) Proof.
See appendix.We use (18) to translate any fixed target on ρ ∗ to a value for | γ t | . Note that (18) only requires estimates of Var( m ( X )) and σ rt . Estimating Var( m ( X )) is trivial using any propensity score model, but estimating theresidual standard deviation σ rt is slightly more nuanced. This is because σ rt is a property of the completedistribution of the potential outcome, Y ( t ) , which itself depends on γ t . We write σ rt ( γ t ) to emphasizethat the residual standard deviation depends on γ t ; σ rt ( γ t ) is available in analytic form for the mixture ofexponential family models considered in Section 4. We then numerically solve σ rt ( γ t ) | γ t | = (cid:118)(cid:117)(cid:117)(cid:116) ρ Y ( t ) | X − ρ Y ( t ) | X (Var( m ( X )) + π / for γ t . In many of observed-data model specifications discussed in this paper, the residual standard deviationin both the observed and missing data densities are identical, so this recursive identification of γ t is notnecessary (e.g. as in Equation 11 for ψ t = 0 ). We now apply our approach to two examples. In the first example, we conduct sensitivity analysis for theeffects of blood pressure medication. In this example, we use a nonparametric estimate of the responsesurface but assume normally distributed residuals. In the second example, we conduct sensitivity analysison the effect of a job training program on income and employment. In this example, the outcome is zero-inflated, and we focus on quantile treatment effects rather than average treatment effects. In addition todemonstrating the flexibility in modeling structured data, we also show how Tukey’s factorization can beapplied to nonparametric residual models, by modeling the continuous component of the observed datadensities using a Dirichlet Process Mixture of normal distributions. In both examples, we demonstrate theflexibility of our approach by estimating the posterior distribution of quantile treatment effects for a rangeof quantiles. 21 .1 Analysis of NHANES data
We consider a study aimed at estimating the effect of ‘taking two or more anti-hypertensives’ on average dias-tolic blood pressure using data from the Third National Health and Nutrition Examination Survey (NHANESIII) (Centers for Disease Control and Prevention (CDC), 1997), a comprehensive survey of Americans’ healthand nutritional status. We follow the same set up as Dorie et al. (2016), and utilize pre-treatment covariateslike race, gender, age, income, body mass index (BMI), and whether the patient was insured, among others.We let Y ( t ) be the average diastolic blood pressure for a subject in treatment arm t , where t = 1 means thesubject was taking two or more anti-hypertensive medications and t = 0 means the subject was not.In this application, we show that our method works well with a nonparametric model for the responsesurfaces. First, we assume the following data generating model for the potential outcomes: f obs t ( Y ( t ) | T = t, X ) ∼ N ( µ t ( X ) , σ t ) (19) ( µ t ( X ) , σ t ) ∼ BART ( X | T = t ) (20) f ( T | Y ( t ) , X ) ∼ Bern ( Logit − ( α t ( X ) + γ t Y ( t ))) (21)As shown in Section 4, the missing data distribution is therefore: f mis γ t ,t ( Y ( t ) | T = 1 - t, X ) ∼ N ( µ t ( X ) + γ t σ t , σ t ) . (22)Again following Dorie et al. (2016) we use Bayesian Additive Regression Trees (BART; Chipman et al.,2010) to flexibly model µ t ( X ) and σ t . In contrast to their approach, we focus on using BART to estimateonly the observed potential outcomes and do not incorporate latent confounders into the BART model. Weuse independent BART prior distributions for the mean surface, µ t ( X ) , for each treatment arm and similarlymodel separate residual variances for each arm (20). The ATE and QTE can then be estimated as functionsof the estimated marginal complete data distributions for the potential outcomes (averaged over covariates)where: f ( Y ( t )) ∼ N t (cid:88) i =1 N t N ( µ t ( X i ) , σ t ) + N - t (cid:88) j =1 N - t N ( µ t ( X j ) + γ t σ t , σ t ) . (23) Calibration.
We calibrate the magnitude of the sensitivity parameters using the approach outlined inSection 5. We illustrate this approach with body mass index (BMI), one of the most important predictors in22 (treatment) ↓ in control BP | P(treatment) ↑ in control BP P ( t r ea t m en t ) ↓ i n t r ea t m en t BP | P ( t r ea t m en t ) ↑ i n t r ea t m en t BP (a) ATE −10−505 0.25 0.50 0.75 Quantile q Q uan t il e T r ea t m en t E ff e c t ( g , g )−0.05, −0.020, 00.01, 0.010.03, 0.05 (b) Quantile effects Figure 2: Average and quantile treatment effects of diuretics on diastolic blood pressure. Treatment effectmeasured in units of millimeters of mercury (mmHg). NS denotes “not significant”. a) Estimated ATEs forthe BART model. Under unconfoundedness, the ATE is negative but not significant. c) Quantile treatmenteffects, τ q . τ q is increasing in q because the treatment potential outcomes have higher variance than thecontrol potential outcomes. Colored boxes in (a) correspond to QTE distributions in (b).terms of partial variance explained, with ρ X BMI | X − BMI ≈ . . To map this value to sensitivity parameters,we use the estimated residual standard deviation of the potential outcomes in each arm, ˆ σ r ≈ . and ˆ σ r ≈ . . We can then apply the formula in Equation (18) to obtain | γ | ≈ . and | γ | ≈ . . Ingeneral, we limit our sensitivity analysis to unmeasured confounders up to this magnitude.
Results.
In Figure 2a we visualize ATE estimates for a grid of sensitivity parameters. Here “NS” denotes“not significant”, by which we mean the 95% posterior credible interval of the ATE contains 0. Although thenotion of “significance” vs “non-signifiance” is fragile, it still provides a measure of the uncertainty associatedwith the estimated effects; see Dorie et al. (2016).Under unconfoundedness ( γ = γ = 0 ) the posterior mean for the ATE is approximately -0.7 mmHg butthere is enough posterior uncertainty that the effect is not significantly different from 0 (light blue box inFigure 2a). The ATE changes the most along the diagonal parallel to γ = γ = γ , with the ATE no longersignificantly different from 0 when γ = 0 . . Figure 2 also highlights the sensitivity patterns discussed inExamples 5–7, where the ATE is far more sensitive to the magnitudes of γ and γ when they have thesame sign. For example, for the pink box, with γ = 0 . and γ = 0 . , the posterior mean ATE isapproximately − . mmHg, which is roughly half a (marginal) standard deviation larger than the estimate Specifically, the formula in (18) implies σ rt γ t ≈ . . In turn, | γ | ≈ . / . ≈ . and | γ | ≈ . / . ≈ . . ρ Y (0) | X ≈ ρ X ins | X − ins ; the variance explained by the treatment potential outcome is comparable to thevariance explained by BMI, that is ρ Y (1) | X ≈ ρ X BMI | X − BMI (see Figure 4).In Figure 2b, we plot the 95% posterior credible band for the quantile treatment effects, τ q , for severalsettings of the sensitivity parameters (colors match squares those in Figure 2a). Interestingly, τ q is increasingwith q for all combinations of γ , which occurs because the estimated residual variance for the treated potentialoutcomes is slightly larger than for the control potential outcomes. Similarly, for the complete data, thedifference between the largest and smallest treatment potential outcomes is larger than the difference inthe control potential outcomes. This is consistent with a situation in which the treatment varies acrossindividuals beyond what is explained by covariates (Ding et al., 2018).In the appendix, we also demonstrate the separation between the observed data model and the sensitivityanalysis in our framework. Specifically, we show how results change when we use different observed datamodels with the same treatment selection specifications as above. We provide results for two additionalobserved data models: an alternative BART parameterization and the Bayesian Causal Forest (BCF) model(Hahn et al., 2017). Broadly, our results are consistent with Dorie et al. (2016), who show that both effect sizeand significance in this example can be sensitive to changes in the outcome model (testable) and treatmentselection (untestable). Finally, while we use a flexible model to estimate the response surfaces, the quantileeffects are still sensitive to the assumption of normality on the residuals. We turn to this next. In this example, we conduct a sensitivity analysis for quantile treatment effects for zero-inflated incomedata. Zero-inflated outcomes are common in a range of settings; we focus on the context of evaluating jobtraining programs. In these studies, the primary outcome of interest, income, is zero for individuals whoare unemployed and thus the average treatment effect misses important variation (Heckman et al., 1997;Bitler et al., 2006). As a result, several studies have instead focused on estimating quantile treatment effectsin these settings. Specifically, we consider the well-known study from Abadie et al. (2002), who estimatequantile treatment effects in the context of the Job Training and Partnership Act (JTPA) evaluation, a largerandomized trial estimating the impact of select workforce development programs on wages. In this analysiswe focus only on individuals randomly assigned to treatment, and compare outcomes between those whochoose to participate in the program ( T = 1 ) versus those who did not ( T = 0) . We choose this artificial setup24 comparing participants and non-participants among those randomly assigned to treatment — specificallybecause selection bias is a clear concern.This analysis is designed to showcase that our sensitivity framework allows investigators to conductsensitivity analysis even when they employ flexible models of complex data. Following previous work, wedevelop a two-part model for the semi-continuous data (Duan et al., 1983; Olsen and Schafer, 2001; Javarasand Van Dyk, 2003). Let Y ( t ) be income and W ( t ) ∈ { , } an indicator for employment status with W ( t ) = 0 = ⇒ Y ( t ) = 0 and W ( t ) = 1 = ⇒ Y ( t ) > . For simplicity, we exclude covariates in this analysisand focus on nonparametric estimation of the observed potential outcome distributions. Specifically, weflexibly model the observed (log) income among the employed using a Dirichlet Process mixture of normaldistributions (Neal, 2000): f obs t (log( Y i ( t )) | T = t, W = 1) ∼ N ( µ it , σ it ) (24) ( µ it , σ it ) ∼ G (25) G ∼ DP ( αG ) (26)where G is the conjugate normal-inverse gamma prior distribution for the normal likelihood. To estimate theobserved data density, we use the “dirichletprocess” R package, which implements a truncated stick breakingprocess to approximate the infinite mixture weights (Ross and Markwick, 2018).We then propose the following treatment selection specification, with separate selection functions foremployed and unemployed individuals: f ( T = t | log( Y ( t )) , W ( t ) = 1) ∼ logit − ( β t + γ t log( Y ( t ))) (27) f ( T = t | W ( t ) = 0) ∼ logit − ( α t + ω t I { W ( t ) = 0 } ) (28)In words, for employed individuals, the participation probability is logistic in the log income. Analogousto the Normal mixture example in Section 4 (Example 4), under the logistic selection model, the missingpotential log-incomes Y ( t ) also follow a Dirichlet Process mixture of Normals but with different componentmeans and mixture weights. For unemployed individuals, the odds of participation increase multiplicativelyby exp( ω t ) . Analogous to the Bernoulli example (Example 2), under the logistic selection model, the missingpotential employment outcome W ( t ) is also Bernoulli but with an additive shift in the log-odds.25 .000.250.500.751.00 Unemployed Income (log) f ( T = | Y ( ) , W ( )) (a) User-specified selection functions ( g , w )(−0.1, −0.5)(0, 0)(0.1, 0.5) Unemployed Income (log) f m i s ( Y ( ) , W ( ) | T = ) (b) Missing control outcomes Quantile q Q uan t il e T r ea t m en t E ff e c t ( do ll a r s ) ( g , w )(−0.1, −0.5)(−0.1, 0.5)(0, 0)(0.1, −0.5)(0.1, 0.5) (c) Quantile effects Figure 3: Example model specifications for non-participators (a,b) and quantile treatment effects (c) forthe JTPA data set. a) Participation probability as a function of non-participator potential outcomes, ( Y (0) , W (0)) . The left bar plot shows participation probability given employment status, f ( T = 1 | W (0)) and the right depicts participation vs log income, f ( T = 1 | log ( Y (0))) , for three different pairs of sensitivityparameters. b) The probability of unemployment, E [ W (0) | T = 1] , for the missing control units (left) andlog income density for the employed among the missing control outcomes, f ( log ( Y (0)) | T = 1) , (right). c)95% posterior confidence bands for τ q , for five pairs of sensitivity parameters.26 alibration. Reflecting the zero-inflated data structure, we handle calibration separately for employedand unemployed individuals. By way of demonstration, we use covariates to calibrate our sensitivity analysisbut exclude them from treatment effect estimation. The covariates, X , that we use for calibration are race,marital status, gender, age, and high school diploma or equivalent.First, we calibrate the magnitude of γ t , the sensitivity parameters for (log) income. In this analysis, wefocus on subset of the sensitivity parameter space in which γ = γ = γ . We then calibrate γ by fixingtarget values of ρ Y ( t ) to be approximately ρ X , the partial variance explained by the vector of observedcovariates. We find that for the subset of employed individuals, the variance in T explained by X is ρ X ≈ . . By Equation (18), we find that | γ t | ≈ . when ρ Y ( t ) = 0 . , where we use the fact thatVar ( m ( X )) = Var ( m ( ∅ )) = 0 .Second, we calibrate the magnitude of ω t , the sensitivity parameter for (binary) employment, W ( t ) . Wefollow the same calibration strategy by setting the target for ρ W ( t ) ≈ ρ X . We find that for the subset ofunemployed individuals, the variance in T explained by X is ρ W ( t ) ≈ . . Applying Equation (18), thisvalue of ρ W ( t ) corresponds to a value of | ω t | ≈ . .Figure 3a visualizes different user-specified choices for the unidentified selection model (27–28), or prob-ability of assignment to treatment, implied under three different settings of ( γ, ω ) . Here, ω determines theheight of the bar in the left panel, and γ determines the steepness of the curve in the right panel. Results.
In this section, we summarize our results, as implied by the choice for this treatment assignmentfunction and the observed data density estimated with the Dirichlet Process mixture model (24–26) in Figures3b and 3c. First, in Figure 3b we display the distribution of missing control outcomes for those units assignedto treatment, f mis( γ,ω ) , ( Y (0) , W (0) | T = 1) . From this figure, it is clear that the observed data imply verydifferent counterfactual employment and income distributions under different sensitivity parameter settings.The thickness of the lower tail of the income distribution for the employed appears to be particular sensitiveto selection effects. The fraction of unemployed among the missing control outcomes assigned to treatmentis roughly twice as large for ω = − . as it is for ω = 0 . .Figure 3c summarizes sensitivity in the QTEs, which are the primary objects of interest. Consistent withthe distributional differences in Figure 3b, however, the QTEs are sensitive in terms of both employmentand income levels. First, under all sensitivity settings we explore, the QTE at the lowest quantiles areidentically zero, since both control and treated outcome distributions have a point mass at zero. The This is not a viable approach in a real analysis without covariates, since there is no X from which we can estimate ρ X . ρ Y ( t ) = ρ Y ( t ) |∅ is the marginal variance explained. q , which are related to the proportion unemployed inthe treatment and control populations. For example, for ω = 0 or − . , the proportion of units who areunemployed under treatment is smaller than the proportion of units who are unemployed under control, f ( W (1) = 0) < f ( W (0) = 0) . Here, the QTE increases away from 0 at q = f ( W (1) = 1) . By contrast, for ω = 0 . , the proportion unemployed under treatment is actually larger than the proportion unemployed undercontrol, f ( W (1) = 0) > f ( W (0) = 0) , and thus the QTE decreases away from 0 (briefly) at q = f ( W (0) = 1) .Differences in income effects are more straightforward to read from the plot. For the sensitivity parametersettings used in this analysis, the income under treatment Y (1) generally has larger income quantiles than thedistribution of income under control Y (0) . Only when selection effects are very large, e.g. ( γ, ω ) = (0 . , . ,do the effects lose significance and shift toward slightly negative values. In the end, the shape of the estimatedQTEs is similar to those in Abadie et al. (2002), though the impacts are larger due to the artificial setup. In this paper, we proposed a framework for sensitivity analysis in causal inference employing Tukey’s factor-ization. The framework has a number of advantages. First, it cleanly separates the identified and unidentifiedportions of the data-generating process. This guarantees that sensitivity parameters are unidentified, in con-trast to many latent confounder models, and decouples model checking and sensitivity analysis. Second,it only requires the data to be fit once, reducing computational burden and enabling post facto sensitivityanalysis for a wide range of models that previously assumed unconfoundedness. Third, it supports intuitivesensitivity parameterizations that investigators can calibrate to selection on observed covariates.Extensions of our framework could fit particularly well with modern causal inference workflows thatemploy a similar separation of observed data modeling and causal reasoning. In these workflows, the analystfirst focuses on optimizing a fit to the observed data distribution, often employing heuristics such as crossvalidation to select or combine models. The analyst then plugs predictions from this model into an estimationstep tailored to the estimand of interest (Van der Laan and Rose, 2011; Chernozhukov et al., 2016; Xu et al.,2018). A sensitivity analysis based on Tukey’s factorization would allow investigators to assess sensitivityin this workflow without putting constraints on the flexible model used in the first stage. In particular,following (1), such a sensitivity analysis could be implemented by adding a weighting step, parameterizedby sensitivity parameters, between the first and second stages.Although we highlighted many different use cases, there are several extensions we did not explore. First,28e focused here on sensitivity specifications that are independent of the covariates, which enabled us tolimit the number of sensitivity parameters. A natural extension would generalize these specifications to infercausal effects under covariate-varying values of the sensitivity parameter, i.e. γ t ( X ) (see Jung et al., 2018).Such an extension would increase the number of sensitivity parameters, introducing several challenges incalibration and reporting results. A second extension would generalize our approach to multiple or multi-level treatments (Imai and Van Dyk, 2004). This would require generalizing the factorization in Equation(5). In some special cases, for example, where the unobserved confounding can be represented as latentfactors of the observed distribution of treatments (Wang and Blei, 2018), more parsimonious sensitivityfactorizations may be possible, e.g., following D’Amour (2019). A final extension would extend our sensitivityanalysis to observational studies with missing data. Specifically, we could combine approaches for dealingwith informative dropout, previously applied in experimental settings (Daniels and Hogan, 2000; Scharfsteinet al., 2003), with the models for confoundedness in observational studies described in this paper. We leaveit to future work to explore these and other classes of useful extrapolation models, including models thatfacilitate specification of covariate-varying sensitivity parameters and multiple treatments regimes.There are also several open technical questions about our application of Tukey’s factorization to observa-tional studies. One important consideration for Tukey’s factorization is the validity of the outcome overlapcondition (Condition 2, Section 3). This condition says that the support of the missing potential outcomesmust be a subset of the support of the observed potential outcomes. As discussed in Franks et al. (2016),even when the outcome overlap condition is technically satisfied, the inferred missing data distribution canbe sensitive to the estimated tails of the observed data density if the distance between the observed andmissing data is large. This is particularly evident when viewed from the importance weighting perspective,since f ( T =1 - t | Y ( t )) f ( T = t | Y ( t )) increases in Y ( t ) in regions where the missing data density is far from the observed datadensity. In this case, the inferred missing data density is largely determined by parametric assumptionsabout the tail behavior of the observed data densities, which often have limited information in practice.Additionally, the outcome overlap condition may become less plausible when the covariates explain most ofthe variance in the observed outcome Y ( t ) . In such a case, an unobserved confounder would only need toinduce a small amount of variation in Y ( t ) to violate the outcome overlap assumption.In the end, the methods described in this paper are quite general, can be extended to a range of models,and are easy to interpret and implement, even for complex data generating models. We therefore believeTukey’s factorization is a powerful framework for assessing sensitivity to unobserved confounders in obser-vational causal inference. 29 Theory
Proof of Theorem 1: τ AT E , τ AT T , τ AT C and τ OR are all only functions f ( Y ( t )) t ∈ { , } (or f ( Y ( t ) | X ) for conditional treatment effects). Thus it suffices to show that f ( Y ( t ) | T ) are independent of any copulaparameters. Note that in the extrapolation factorization we model f ( Y ( t ) | T = t ) directly and thus, thisconditional expectation is independent of copula parameters by definition. Thus it suffices to show that f ( Y ( t ) | T = (1 − t )) is independent of copula parameters. f ( Y ( t ) | T = (1 - t )) = (cid:90) f ( Y ( t ) , Y (1 - t ) | T = (1 - t )) dY (1 - t )= (cid:90) f ( Y ( t ) | Y (1 - t ) , T = (1 - t )) f ( Y (1 - t ) | T = (1 - t ) dY (1 - t ) ∝ (cid:90) f ( Y ( t ) | T = t ) f ( T = (1 - t ) | Y ( t )) f ( T = t | Y ( t )) c ( F ( Y ( t ) | T ) , F ( Y (1 - t ) | T ) | ) × f ( Y (1 - t ) | T = (1 - t ) dY (1 - t )= f ( Y ( t ) | T = t ) f ( T = (1 - t ) | Y ( t )) f ( T = t | Y ( t )) × (cid:90) c ( F ( Y ( t ) | T ) , F ( Y (1 - t ) | T ) | T ) f ( Y (1 - t ) | T = (1 - t ) dY (1 - t )= f ( Y ( t ) | T = t ) f ( T = (1 - t ) | Y ( t )) f ( T = t | Y ( t )) Where the last equality holds by using the definition of the copula density: (cid:90) c ( F ( Y ( t ) | T ) , F ( Y (1 - t ) | T ) | T ) f ( Y (1 - t ) | T = (1 - t ) dY (1 - t ) == (cid:90) f ( Y ( t ) , Y (1 - t ) | T = (1 - t )) f ( Y ( t ) | T = (1 - t )) f ( Y (1 - t ) | T = (1 - t )) f ( Y (1 - t ) | T = (1 - t ) dY (1 - t )= (cid:90) f ( Y (1 - t ) | Y ( t ) , T ) dY (1 - t )= 1 Proof of Proposition 3:
We seek to find the value of γ t such that the model (17) implies ρ Y | X achieves a30articular value, ρ ∗ . m ( X, Y ( t )) =: logit ( e ( X, Y ( t ))) = α t ( X ) + γ t Y ( t )) (29) = α t ( X ) + γ t µ t ( X ) + γ t ( Y ( t ) − µ t ( X )) (30) = α ∗ t ( X ) + ˜ γ t ˜ R ( t )) (31) = m ( X, ˜ R ( t )) (32)where ˜ R ( t ) = R ( t ) σ rt is the unit-scaled complete data residual, ˜ γ =: σ rt γ and σ rt = (cid:112) E [ V ar ( Y ( t ) | X )] . Wedefine α ∗ t ( X ) =: α t ( X ) + γ t µ t ( X ) . Importantly, since m ( X, Y ( t )) = m ( X, ˜ R ( t )) the above implies that ρ Y ( t ) ,X = ρ R ( t ) ,X . Since ˜ R ( t ) is orthogonal to α ∗ t ( X ) and has unit variance, we have Var ( m ( X, ˜ R ( t )) = Var ( m ( X ) + ˜ γ t ˜ R ( t ))) = Var ( m ( X )) + ˜ γ t . Thus, ρ X,Y ( t ) = ρ X, ˜ R ( t ) = Var ( m ( X )) + ˜ γ t Var ( m ( X )) + ˜ γ t + π / . (33)Using the definition of “implicit R-squared” from Section 5, we have ρ Y ( t ) | X = ρ X, ˜ Y ( t ) − ρ X − ρ X (34) = π / Var ( m ( X ))+ π / − π / Var ( m ( X ))+ π / γ t π / Var ( m ( X ))+ π / (35) = 1 − Var ( m ( X )) + π / Var ( m ( X )) + π / γ t (36)Solving the above equation for ˜ γ t such that ρ Y ( t ) | X = ρ ∗ , yields | ˜ γ t | = (cid:115) ρ ∗ − ρ ∗ ( Var ( m ( X )) + π / (37)We complete the result by using the fact that ˜ γ t = σ rt γ t B Additional Results from Section 5.1
In Section 5.1, we focus on one particular potential outcomes model, although many plausible models arepossible. In this section, we provide results for two variations of the observed potential outcome model. This31 .000.050.100.150.20 0.00 0.03 0.06 0.09 g t r Y | X r X j |X - j AgeBMIInsurancePulse
Figure 4: γ t vs ρ Y | X , calibration for the NHANES data. The magnitude of the sensitivity parameter γ t is increasing with the residual coefficient of determination, ρ Y | X . For comparison, we plot the partialcoefficients of variation from covariates, ρ X j | X − j , for the most important predictors: age, BMI, insuranceand pulse. We calibrate the magnitude of γ t in Section 6.1 based on BMI.plot highlights that the ATE estimates vary as a function of both model specification (model checking) andthe strength of confounding in both treatment arms (sensitivity analysis).First, we posit a pooled model for the mean surface and residual variance ( µ t ( X ) , σ t ) ∼ BART ( X, T ) with µ t ( X ) = µ ( t, X ) and σ t = σ - t . In Figure 5a we show the results for this model, which shows has the largestestimated effect size under unconfoundedness of any of the models considered. Under unconfoundedness,the posterior mean ATE is approximately -2.5 mmHG under this model, and unlike the model proposed inSection 6.1 appears significantly different from 0.We also show the results for the Bayesian Causal Forest (BCF) model recently introduced by (Hahnet al., 2017). In this model, the observed propensity score is included as a covariate and independent BARTprior distributions are specified for the control and for the heterogeneous treatment effect and one is usedfor the the control outcome surface. In this model, under unconfoundedness the posterior mean for the ATEis approximately -1.73 mmHg but in contrast to the other observed data models, yields ATEs with largeposterior uncertainty. 32 SNSNSNS NSNSNSNSNS NSNSNSNS NSNSNSNS NSNSNSNS NSNSNSNS NSNSNSNS NSNSNS NSNSNS NSNS NS − − − − − − − − − − − − (a) ATE for pooled model NSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNS NSNSNSNSNSNSNSNS NSNSNSNSNSNSNS NSNSNSNSNSNS − − − − − − − − − − − − (b) ATE for BCF model Figure 5: Average treatment effect measured in units of millimeters of mercury (mmHg). NS denotes “notsignificant”. a) Average treatment effect in the pooled model. Under unconfoundedness, the effect size issignificantly negative. This model has the smallest posterior uncertainty. b) Average treatment effect inthe Bayesian Causal Forest model. Although the effect sizes are comparable, the posterior uncertainty issignificantly larger. 33 eferences
Abadie, A., J. Angrist, and G. Imbens (2002). Instrumental variables estimates of the effect of subsidizedtraining on the quantiles of trainee earnings.
Econometrica 70 (1), 91–117.Athey, S. and S. Wager (2017). Efficient policy learning. arXiv preprint arXiv:1702.02896 .Birmingham, J., A. Rotnitzky, and G. M. Fitzmaurice (2003). Pattern-mixture and selection models foranalysing longitudinal data with monotone missing patterns.
Journal of the Royal Statistical Society.Series B: Statistical Methodology 65 , 275–297.Bitler, M. P., J. B. Gelbach, and H. W. Hoynes (2006). What mean impacts miss: Distributional effects ofwelfare reform experiments.
American Economic Review 96 (4), 988–1012.Blackwell, M. (2014). A selection bias approach to sensitivity analysis for causal effects.
Political Analy-sis 22 (2), 169–182.Carnegie, N. B., M. Harada, and J. L. Hill (2016). Assessing sensitivity to unmeasured confounding using asimulated potential confounder.
Journal of Research on Educational Effectiveness 9 (3), 395–420.Centers for Disease Control and Prevention (CDC) (1997). National Health and Nutrition Examination Sur-vey Data III, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention,Hyattsville, MD. .Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, and W. K. Newey (2016). Doublemachine learning for treatment and causal parameters. Technical report, cemmap working paper, Centrefor Microdata Methods and Practice.Chipman, H. A., E. I. George, R. E. McCulloch, et al. (2010). Bart: Bayesian additive regression trees.
TheAnnals of Applied Statistics 4 (1), 266–298.Cinelli, C. and C. Hazlett (2018). Making sense of sensitivity: Extending omitted variable bias.Cornfield, J., W. Haenszel, E. C. Hammond, A. M. Lilienfeld, M. B. Shimkin, and E. L. Wynder (1959).Smoking and lung cancer: recent evidence and a discussion of some questions.
J. Nat. Cancer Inst 22 ,173–203.D’Amour, A. (2019). On multi-cause causal inference: Impossibility, sensitivity, and the promise of proxies.In
International Conference on Artificial Intelligence and Statistics , pp. Forthcoming.Daniels, M. J. and J. W. Hogan (2000). Reparameterizing the pattern mixture model for sensitivity analysesunder informative dropout.
Biometrics 56 (4), 1241–1248.Díaz, I. and M. J. van der Laan (2013). Sensitivity analysis for causal inference under unmeasured confound-ing and measurement error problems.
The international journal of biostatistics 9 (2), 149–160.Ding, P., A. Feller, and L. Miratrix (2018). Decomposing treatment effect variation.
Journal of the AmericanStatistical Association , 1–14.Ding, P. and T. J. VanderWeele (2016). Sensitivity analysis without assumptions.
Epidemiology (Cambridge,Mass.) 27 (3), 368.Dorie, V., M. Harada, N. B. Carnegie, and J. Hill (2016). A flexible, interpretable framework for assessingsensitivity to unmeasured confounding.
Statistics in medicine 35 (20), 3453–3470.Duan, N., W. G. Manning, C. N. Morris, and J. P. Newhouse (1983). A comparison of alternative modelsfor the demand for medical care.
Journal of business & economic statistics 1 (2), 115–126.34veritt, B. S. (1985).
Mixture Distributions—I . Wiley Online Library.Franks, A. M., E. M. Airoldi, and D. B. Rubin (2016). Non-standard conditionally specified models fornon-ignorable missing data. arXiv preprint arXiv:1603.06045 .Gelman, A., H. S. Stern, J. B. Carlin, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013).
Bayesian dataanalysis . Chapman and Hall/CRC.Gustafson, P., L. C. McCandless, et al. (2018). When is a sensitivity parameter exactly that?
StatisticalScience 33 (1), 86–95.Hahn, P. R., J. S. Murray, and C. M. Carvalho (2017). Bayesian regression tree models for causal inference:regularization, confounding, and heterogeneous effects.Hahn, P. R., J. S. Murray, and I. Manolopoulou (2016). A bayesian partial identification approach to inferringthe prevalence of accounting misconduct.
Journal of the American Statistical Association 111 (513), 14–26.Heckman, J. J. (1979). Sample selection bias as a specification error.
Econometrica 47 (1), 153–161.Heckman, J. J., J. Smith, and N. Clements (1997). Making the most out of programme evaluations and socialexperiments: Accounting for heterogeneity in programme impacts.
The Review of Economic Studies 64 (4),487–535.Hill, J. L. (2012). Bayesian nonparametric modeling for causal inference.
Journal of Computational andGraphical Statistics .Holland, P. (1986). Discussion 4: Mixture modeling versus selection modeling with nonignorable nonresponse.In H. Wainer (Ed.),
Drawing inferences from self-selected samples , pp. 143–149. Routledge.Imai, K. and D. A. Van Dyk (2004). Causal inference with general treatment regimes: Generalizing thepropensity score.
Journal of the American Statistical Association 99 (467), 854–866.Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation.
American EconomicReview 93 (2), 126–132.Javaras, K. N. and D. A. Van Dyk (2003). Multiple imputation for incomplete data with semicontinuousvariables.
Journal of the American Statistical Association 98 (463), 703–715.Jung, J., R. Shroff, A. Feller, and S. Goel (2018). Algorithmic decision making in the presence of unmeasuredconfounding. arXiv preprint arXiv:1805.01868 .Kenward, M. G. (1998). Selection models for repeated measurements with non-random dropout: an illus-tration of sensitivity.
Statist. Med. .Klausch, T., P. van de Ven, T. van de Brug, R. H. Brakenhoff, M. A. van de Wiel, and J. Berkhof (2018).Estimating bayesian optimal treatment regimes for dichotomous outcomes using observational data. arXivpreprint arXiv:1809.06679 .Linero, A. R. (2017). Bayesian nonparametric analysis of longitudinal studies in the presence of informativemissingness.
Biometrika 104 (2), 327–341.Linero, A. R. and M. J. Daniels (2015). A flexible bayesian approach to monotone missing data in longitudinalstudies with nonignorable missingness with application to an acute schizophrenia clinical trial.
Journal ofthe American Statistical Association 110 (509), 45–55.Linero, A. R. and M. J. Daniels (2017). Bayesian approaches for missing not at random outcome data: Therole of identifying restrictions. 35ittle, R. J. and D. B. Rubin (2015).
Statistical analysis with missing data . John Wiley & Sons.Middleton, J. A., M. A. Scott, R. Diakow, and J. L. Hill (2016). Bias amplification and bias unmasking.
Political Analysis 24 (3), 307–323.Neal, R. M. (2000). Markov chain sampling methods for dirichlet process mixture models.
Journal ofcomputational and graphical statistics 9 (2), 249–265.Neyman, J. (1990 [1923]). On the application of probability theory to agricultural experiments. essay onprinciples. section 9.
Statistical Science 5 (4), 465–472.Olsen, M. K. and J. L. Schafer (2001). A two-part random-effects model for semicontinuous longitudinaldata.
Journal of the American Statistical Association 96 (454), 730–745.Oster, E. (2017). Unobservable selection and coefficient stability: Theory and evidence.
Journal of Business& Economic Statistics , 1–18.Riddles, M. K., J. K. Kim, and J. Im (2016). A propensity-score-adjustment method for nonignorablenonresponse.
Journal of Survey Statistics and Methodology 4 (2), 215–245.Robins, J. M., A. Rotnitzky, and D. O. Scharfstein (2000). Sensitivity analysis for selection bias andunmeasured confounding in missing data and causal inference models. In
Statistical models in epidemiology,the environment, and clinical trials , pp. 1–94. Springer.Rosenbaum, P. (2017).
Observation and Experiment: An Introduction to Causal Inference . Harvard Univer-sity Press.Rosenbaum, P. R. and D. B. Rubin (1983). Assessing sensitivity to an unobserved binary covariate in an ob-servational study with binary outcome.
Journal of the Royal Statistical Society. Series B (Methodological) ,212–218.Rosenbaum, P. R. and J. H. Silber (2009). Amplification of sensitivity analysis in matched observationalstudies.
Journal of the American Statistical Association 104 (488), 1398–1405.Ross, G. J. and D. Markwick (2018). dirichletprocess: Build Dirichlet Process Objects for Bayesian Modelling .R package version 0.2.1.Rotnitzky, A., D. Scharfstein, T.-L. Su, and J. Robins (2001). Methods for conducting sensitivity analysisof trials with potentially nonignorable competing causes of censoring.
Biometrics 57 (1), 103–113.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies.
Journal of educational Psychology 66 (5), 688.Rubin, D. B. (1980). Comment.
Journal of the American Statistical Association 75 (371), 591–593.Rubin, D. B. (2003). Basic concepts of statistical inference for causal effects in experiments and observationalstudies.
Cambridge, MA: Harvard University, Department of Statistics .Scharfstein, D. O., M. J. Daniels, and J. M. Robins (2003). Incorporating prior beliefs about selection biasinto the analysis of randomized trials with missing outcomes.
Biostatistics 4 (4), 495–512.Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999, December). Adjusting for Nonignorable Drop-Out Using Semiparametric Nonresponse Models.
Journal of the American Statistical Association 94 (448),1096–1120.Van der Laan, M. J. and S. Rose (2011).
Targeted learning: causal inference for observational and experi-mental data . Springer Science & Business Media. 36ang, Y. and D. M. Blei (2018). The blessings of multiple causes. arXiv preprint arXiv:1805.06826 .Xu, D., M. J. Daniels, and A. G. Winterstein (2018). A bayesian nonparametric approach to causal inferenceon quantiles.
Biometrics .Zhao, Q., D. S. Small, and B. B. Bhattacharya (2017). Sensitivity analysis for inverse probability weightingestimators via the percentile bootstrap. arXiv preprint arXiv:1711.11286arXiv preprint arXiv:1711.11286