Copula-based Sensitivity Analysis for Multi-Treatment Causal Inference with Unobserved Confounding
CCopula-based Sensitivity Analysis for Multi-Treatment CausalInference with Unobserved Confounding ∗Jiajing ZhengUCSB Alexander D’AmourGoogle Research Alexander FranksUCSBFebruary 19, 2021
Abstract
Recent work has focused on the potential and pitfalls of causal identification in observational studieswith multiple simultaneous treatments. On the one hand, a latent variable model fit to the observedtreatments can identify essential aspects of the distribution of unobserved confounders. On the otherhand, it has been shown that even when the latent confounder distribution is known exactly, causal effectsare still not point identifiable. Thus, the practical benefits of latent variable modeling in multi-treatmentsettings remain unclear. We clarify these issues with a sensitivity analysis method that can be usedto characterize the range of causal effects that are compatible with the observed data. Our method isbased on a copula factorization of the joint distribution of outcomes, treatments, and confounders, andcan be layered on top of arbitrary observed data models. We propose a practical implementation of thisapproach making use of the Gaussian copula, and establish conditions under which causal effects canbe bounded. We also describe approaches for reasoning about effects, including calibrating sensitivityparameters, quantifying robustness of effect estimates, and selecting models which are most consistentwith prior hypotheses.
Keywords : Observational studies; multiple treatments; sensitivity analysis; copulas, latent confounders;deconfounder ∗ Jiajing Zheng is a PhD candidate in the Department of Statistics and Applied Probability at the University of CaliorniaSanta Barbara ([email protected]). Alexander D’Amour is a Research Scientist at Google Research, Cambridge, MA([email protected]). Alexander M. Franks is an Assistant Professor of Statistics at the University of California, SantaBarbara ([email protected]). We thank Steve Yadlowksy, Victor Veitch, and Avi Feller for thoughtful comments anddiscussion. a r X i v : . [ s t a t . M E ] F e b Introduction
Although it is well-established that treatment effects are not generally identifiable in the presence of un-observed confounding, recent work has focused on whether this challenge can be mitigated when there aremultiple simultaneous treatments. Intuitively, dependence among multivariate treatments could provideinformation about latent confounders, which could in turn be leveraged to facilitate causal inference andidentification. This intuition has motivated latent variable approaches such as “the deconfounder”, a muchdiscussed approach for estimating causal effects for multiple treatments (Wang and Blei, 2018).Unfortunately, it was shown that this strategy has limited practical applicability for point identificationand estimation of causal effects. For example, D’Amour (2019a) and D’Amour (2019b) note the lack ofgeneral nonparametric identification in the deconfounder approach, and show that the special cases in whichthe approach does provide identification correspond to situations where all confounding is already observed.Ogburn et al. (2019) and Ogburn et al. (2020) provide several additional counterexamples and detailedrebuttals to previous theoretical results. Even in the special cases where causal effects are identifiable, Grim-mer et al. (2020) demonstrate through a suite of simulations and real-data analyses that the deconfoundercannot consistently outperform naive regression. They conclude by further arguing that the deconfounderassumptions are too strong to be applicable in practice.These challenges are particularly relevant because similar strategies are used in genomics (Price et al.,2006), computational neuroscience, social science and medicine (Zhang et al., 2019), and time series appli-cations (Bica et al., 2020). Given the practical importance of causal inference with multiple treatments,recent work has focused on stronger identifying assumptions for causal effects in the multi-treatment setting.Miao et al. (2020) propose identifying assumptions involving instrumental variables and in settings whenover half of the treatments are assumed to have a null effect while Kong et al. (2019) consider identificationin a parametric model with binary outcomes.This literature has revolved around a binary question about point identification: can be causal effectsbe identified or not? Negative answers to this question often run counter to practitioners’ intuitions inspecific data analyses. In particular, it is intuitive that a latent variable model should provide some helpfulinformation, even if this information is not enough to fully identify causal effects.To address this issue, we propose that sensitivity analysis—which explores the range of causal effectsthat are consistent with the observed data in the context of a given problem—can resolve this tension.1pecifically, sensitivity analysis can show how much is gained by leveraging latent structure in a givenapplication, even if this (usually) falls short of fully identifying the causal effect of interest. To this end,we propose a sensitivity analysis approach to help practioners better understand confounding in the multi-treatment setting, focusing on the special case where the conditional distribution of unobserved confoundersgiven treatments is identifiable. To extend sensitivity analysis to the multi-treatment setting, we proposea general copula-based decomposition of standard latent variable–based sensitivity analysis models. Thisfactorization allows us to precisely separate the parts of the model that are, and are not, identified in themulti-treatment setting.For practical analyses, we propose a specialization of the general decomposition, which specifies a sen-sitivity model based on invariant Gaussian copulas. While this Gaussian copula specification only coversa sub-family of sensitivity models expressible in our general formulation, we show that it captures severalessential qualitative aspects of confounding in the multi-treatment setting. In this context, we establish thatthere are important advantages to multi-treatment inference over single-treatment inference for character-izing sensitivity to unobserved confounding. Specifically, under appropriate assumptions, we establish thatthe number of effective sensitivity parameters is halved in multi-treatment inference and that this impliesthat the magnitude of causal effects can be bounded.The paper proceeds as follows. We begin by defining the relevant quantities and notation in Section 2. InSection 3 we describe our basic framework for latent variable sensitivity analysis via a copula factorization. InSection 4 we introduce a special case of the more general approach in which we assume confounder-outcomerelationships can be characterized by a Gaussian copula. In Section 5 we provide some theoretical insights intobias and confounding with the Gaussian copula. We discuss sensitivity parameter interpretation, calibration,and measures of robustness in Section 6 and, finally, in Section 7 and Section 8 we demonstrate our approachin simulation and with the movie example analyzed by Wang and Blei (2018) and later reanalyzed byGrimmer et al. (2020).
Let T be a random k -vector of treatment variables, Y be a scalar random outcome of interest, and t and y be realizations of the respective random variables. We let U be a random m -vector denoting unobserved2onfounders, and x denote any observed pre-treatment variables. We use the do -calculus framework (Pearl,2009) and let f ( y | do ( t )) denote the density of y in the population in which we have intervened to assigntreatment level t to all units. In general, this is distinct from the observed outcome density, f ( y | t ) , whichrepresents the density of the outcome in the subpopulation that received treatment t . These two densitiesare the same if and only if there are no confounders (VanderWeele and Shpitser, 2013).The goal of observational inference with multiple treatments is to quantify the effects of different treat-ments by comparing the intervention distribution at different levels of treatment t (Lechner, 1999; Lopezet al., 2017). In this work we focus on marginal contrast estimands (Franks et al., 2019) under arbitraryoutcome and treatment distributions. An estimand is a “marginal contrast” if it can be expressed as afunction of the marginal distributions of the potential outcomes, τ ( E [ v ( y ) | do ( t )] , E [ v ( y ) | do ( t )]) for somefunctions v and τ . This includes the vast majority of commonly used estimands. For continuous outcomes,our primary estimand is the difference in the population average outcome for treatment T = t and thepopulation average outcome given treatment T = t :PATE t ,t := E ( Y | do ( t )) − E ( Y | do ( t )) . (1)Here, v ( y ) = y is the identity function and τ ( a, b ) = a − b . We also consider the difference in the populationaverage outcome receiving treatment t and observed population average outcome, which we denotePATE t, • := E ( Y | do ( t )) − E ( Y ) , (2)where E ( Y ) = (cid:82) E ( Y | t ) f ( t ) dt and PATE t ,t = PATE t , • − PATE t , • . The conditional average treatmenteffects are defined analogously as CATE t ,t | x := E ( Y | do ( t ) , x ) − E ( Y | do ( t ) , x ) and CATE t, • | x := E ( Y | do ( t ) , x ) − E ( Y | x ) .For binary outcomes, our primary estimand is the causal risk ratio between treatments t and t RR t ,t = P ( Y = 1 | do ( t )) /P ( Y = 1 | do ( t )) . (3)where RR t, • is defined analogously to (2), as P ( Y = 1 | do ( t )) /P ( Y = 1) , so that we can express RR t ,t = RR t , • /RR t , • . Here v ( y ) = I [ y = 1] is the indicator function and τ ( a, b ) = a/b .In general, it is difficult to infer PATEs or RRs from observational data since the potential presence of3nmeasured confounders, which affect both treatment and outcome, can bias naive estimates. If U wereto be observed, the following assumptions would be sufficient to identify the intervention distribution, andhence the treatment effect: Assumption 1 (Latent ignorability) . X and U block all backdoor paths between T and Y (Pearl, 2009). Assumption 2 (Positivity) . f ( T = t | U = u, x ) > for all u and x . Assumption 3 (SUTVA) . There are no hidden versions of the treatments and there is no interferencebetween units (see Rubin, 1980).Since u is not observed, and x alone does not block all backdoor paths, then treatment effects are notidentifiable without additional assumptions. In this case, a common solution is to conduct a sensitivityanalysis which characterizes how the implied causal effects change under different assumptions about U andits relationship to T and Y . There is an extensive literature on assessing sensitivity to violations of unconfoundedness in single treatmentmodels, dating back at least to the work of Cornfield et al. (1959) on the link between smoking and lungcancer. Since then, a wide range of strategies have been proposed for assessing sensitivity to unobservedconfounding (e.g. see Greenland, 1996; Gastwirth et al., 1998; Vansteelandt et al., 2006; Imbens, 2003;VanderWeele and Arah, 2011; VanderWeele et al., 2012; Robins et al., 2000; Franks et al., 2019; Cinelliand Hazlett, 2019; Veitch and Zaveri, 2020). A common strategy for sensitivity analysis is to assert, as inAssumption 1, that unconfoundess would hold if only an additional latent variable U were observed (Rosen-baum and Rubin, 1983). With assumptions 1 and 2, latent confounder models simultaneously specify thefull conditional distributions of treatment and outcomes given both observed confounders, x , and unob-served confounders, U . The dependence of Y and T on U is indexed by a vector of sensitivity parameters ψ = ( ψ Y , ψ T ) . Practitioners can then reason about how assumptions about these parameters translate todifferent causal conclusions. Often, this is done through calibration , by determining reasonable ranges for ψ using analogies about observable associations and through a robustness assessment, by examining howstrong associations with unobserved confounders must be for conclusions to change.Concretely, in a typical latent confounder analysis, we posit densities f ( u | x ) , the marginal density ofthe latent confounders, f ψ T ( t | x, u ) , the conditional density (or PMF) for treatment assignment given all4onfounders and f ψ Y ( y | x, u, t ) , the outcome density in treatment arm t . The sensitivity parameters encodethe relationship between both the treatment and unobserved confounders and the outcome and unobservedconfounders (e.g. see Imbens, 2003; Dorie et al., 2016). Latent confounder models are usually parameterizedso that some specific values of the sensitivity parameters ψ indicate the “no unobserved confounding” case.For example, we can take ψ T = 0 to imply that f ψ T ( t | x, u ) = f ψ T ( t | x ) and ψ Y = 0 to imply that f ψ Y ( y | x, u, t ) = f ψ Y ( y | x, t ) . Then, when either ψ T = 0 or ψ Y = 0 , U is not a confounder (VanderWeeleand Shpitser, 2013). Without loss of generality, we suppress conditioning on x throughout the remainder ofthe manuscript, and comment on the role of observed covariates where appropriate.A key principle of sensitivity analysis is that the observed data densities should be invariant to thesensitivity parameters (Gustafson et al., 2018). Unfortunately, this principle can easily be violated in alatent confounder analysis (Franks et al., 2019). The crux of the problem is that, in general, none of f ( u | x ) , f ψ T ( t | x, u ) or f ψ Y ( y | x, u, t ) are nonparametrically identifiable themselves, but the observedoutcome density, which is a function of these densities, is identifiable. The observed outcome density, hasthe following form: f ( y | T = t ) = (cid:90) U f ψ Y ( y | t, u ) f ψ T ( u | t ) du, for all t and ψ (4)where f ψ T ( u | t ) is the conditional density of the latent confounders given treatment level t . This contrastswith the density of the intervention distribution, which is: f ψ ( y | do ( t )) = (cid:90) U f ψ Y ( y | t, u ) f ( u ) du (5) = (cid:90) U f ψ Y ( y | t, u ) (cid:20)(cid:90) f ψ T ( u | ˜ t ) f (˜ t ) d ˜ t (cid:21) du (6)The intervention distribution is obtained by integrating over the marginal distribution of the unobservedconfounder, f ψ T ( u ) = (cid:82) f ψ T ( u | ˜ t ) f (˜ t ) dt , whereas the observed data distribution in 4 is obtained by in-tegrating with respect to the conditional confounder distribution, f ψ T ( u | t ) . A fundamental challenge ofsensitivity analysis is to specify a class of densities f ψ Y ( y | t, u ) and f ψ T ( u | t ) , with interpretable parameters ψ = ( ψ Y , ψ T ) , for which the intervention distribution necessarily varies with ψ but for which the observeddata distribution does not.In fact, several authors have proposed latent variable sensitivity models with unidentified sensitivityparameters in single-treatment settings. Many of these are in simple settings with binary or categorical5utcomes (Rosenbaum and Rubin, 1983; Robins, 1997; Vansteelandt et al., 2006; Daniels and Hogan, 2008). Amore general solution was proposed by Zhang and Tchetgen (2019) who propose a semi-parametric sensitivitymodel in which the distribution of U is left unrestricted. Franks et al. (2019) propose an alternative frameworkfor sensitivity analysis without directly introducing latent variables by directly parameterizing the effectof the outcome on the treatment assignment. Cinelli and Hazlett (2019) and Cinelli et al. (2019) usemoment arguments to derive confounding bias in the linear regression setting and introduce an approach forsensitivity parameter calibration based on the proportion of variance explained by the latent confounder. Inthe multiple treatment setting, additional observable implications can complicate sensitivity analysis, andthus new strategies are needed. Below, we describe our copula based approach for addressing this challenge. The multiple treatment setting presents unique challenges for sensitivity analysis. In particular, the addi-tional structure imposed in studies with multiple treatments introduces new observable implications, mud-dling issues of identifiability. We focus on the specific case in which the conditional confounder distribution f ψ T ( u | t ) may be partially identifiable from the multiple treatments. To adapt sensitivity analysis to thissetting, sensitivity parameters must not only be decoupled from the observed data distribution; parametersdescribing the treatment-confounder relationships must also be decoupled from parameters describing theoutcome-confounder relationships. We tackle this challenge by factorizing the joint distribution f ( y, u, t ) using a copula, which decompose joint distributions of variables into their marginal distributions and jointdependence.In this section, we introduce our copula-based factorization, discuss how it applies to the multiple treat-ment setting, and then discuss the Gaussian copula parameterization, a special case that we will use intheoretical analysis and methods development in the remainder of the paper. Our approach is based on the following factorization. The model for Y conditional on treatments andunobserved confounders can be decomposed into the observed data density and a conditional copula as f ψ ( y | u, t ) = f ( y | t ) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) (7)6 T Y 𝜖 !| 𝜖 𝜖 $|!, (a) Univariate Case U Y 𝑇 ! 𝑇 " ... 𝜖 𝜖 % 𝜖 &| (b) Multiple Treatments Figure 1: Treatment T, outcome Y, unobserved variables U. (cid:15) u , (cid:15) t | u and (cid:15) y | t,u are respectively the randomnoises of U , T and Y .where F Y | t is the CDF of f ( y | t ) and F U | t is the CDF of f ψ T ( u | t ) . c ψ Y is the conditional copula density,defined on the unit hypercube and parameterized by ψ Y , which characterizes the joint density of Y and U conditional of T = t after transforming the marginals to uniform random variables (Nelsen, 2007). Byexplicitly factoring the observed outcome density, f ( y | t ) , out of the complete data distribution, we ensurethat the left hand side of Equation 4 is invariant to ψ , establishing that there are no observable implicationsof varying the copula parameters. Moreover, this factorization holds for all densities (or PMFs) f ( y | t ) and f ψ T ( u | t ) and any number of treatments, and thus can be used to characterize the outcome-confounderdependence for any model of the observables.With Equation 7, we can express the intervention distribution, f ψ ( y | do ( t )) , as: f ψ ( y | do ( t )) = f ( y | t ) (cid:90) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) f ( u ) du, (8) = f ( y | t ) (cid:90) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) (cid:20)(cid:90) f ψ T ( u | ˜ t ) f (˜ t ) d ˜ t (cid:21) du (9)The observed outcome distribution f ( y | t ) and the marginal distribution of treatment f ( t ) are clearly invari-ant to the sensitivity parameters by construction. The intervention distribution f ( y | do ( t )) is parameterizedby ψ Y , the parameter governing the conditional dependence between Y and U given T and by ψ T , theparameters governing the conditional distribution of U given T .Given f ( y | t ) , f ψ T ( u | t ) and c ψ Y we can compute the expected value of any function of the outcomeunder the intervention distribution, E [ v ( Y ) | do ( t )] = (cid:82) v ( y ) f ( y | do ( t )) dy . This can be in turn used tocompute any marginal contrast estimand. By applying Equation 8, we write this intervention expectationas E [ v ( Y ) | do ( t )] = (cid:90) v ( y ) w ψ ( y, t ) f ( y | t ) dy, (10)7here w ψ ( y, t ) = (cid:82) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) f ψ T ( u ) du is the importance weight associated with samplingfrom the observed data distribution instead of the intervention distribution. In practice, we can approximatethe marginal distribution of the unobserved confounder with the mixture density f ψ T ( u ) ≈ n (cid:80) i f ψ T ( u | t i ) where t i ∈ T is the i th observed treatment and T is the set of all observed treatment vectors. Thus, theimportance weight can be approximated as w ψ ( y, t ) ≈ |T | (cid:88) t i ∈T (cid:20)(cid:90) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) f ψ T ( u | t i ) du (cid:21) . (11)We use this approximation to derive importance sampling algorithm for computing the the expected valuein Equation 10 for any copula and conditional confounder distributions f ( u | t ) (Appendix, Algorithm 2).This can in turn be used to compute any marginal contrast estimand, τ ( E [ v ( y ) | do ( t )] , E [ v ( y ) | do ( t )]) . Algorithm 2 is fully general and thus can be used to compute marginal contrast estimands in a singletreatment setting, multiple treatment setting, or even multiple outcome settings. However, the primarymotivation in this paper is to study this factorization in multiple treatments setting when there additionalobservable implications from latent variable models. The copula factorization elucidates the role of confound-ing even when aspects of the treatment-confounder relationship are identifiable from multiple treatments.Specifically, with multiple treatments, the conditional distribution of latent confounders given treatments,parameterized by ψ T , is often identifiable up to a particular equivalence class (e.g. up to rotation and scale).In this paper, we focus on inference with latent variable methods where ψ T is identified up to an equiv-alence class where the set of possible causal effects compatible with any particular value of ψ T does notchange within this class. We formalize this notion through the following definition and assumption. Definition 1 (Causal equivalence class) . [ ψ T ] is a causal equivalance class of ψ T if and only if for any ˜ ψ T in [ ψ T ] , then, for every ψ Y there exists a ˜ ψ Y such that f ψ Y ,ψ T ( y | do ( T = t )) = f ˜ ψ Y , ˜ ψ T ( y | do ( T = t )) for all y, t .For the purposes of sensitivity analysis, when ψ T is identified up to a causal equivalence class, we canassume that ψ T is point-identified at a particular value within the class [ ψ T ] without loss of generality.Identification up to a causal equivalence class is not generally possible in single treatment studies but willoften hold in a multi-treatment study when certain identifying conditions for the latent variable model are8et.Crucially, the copula-based formulation enables valid sensitivity analysis without observable implications,even in these cases where ψ T is restricted by the observed data. In this case, the outcome-confounder copula c ψ Y remains the lone degree of freedom in the sensitivity model. As we will show, this restriction can inducequalitatively different sensitivity regions in the multi-treatment setting as opposed to the single-treatmentsetting. For example, sensitivity regions can be bounded, even without additional restrictions on ψ Y . In practice, it is infeasible to characterize and interpret the implied causal effects for all possible copulaspecifications. In this section, we propose a practical sensitivity analysis method based on the special casein which c ψ Y is a Gaussian copula. This model characterizes the sensitivity of causal effects to monotonedependencies between the outcome and unobserved confounders. As we will discuss in the following sec-tions, the Gaussian copula facilitates interpretation and sensitivity parameter calibration, and, as before,is compatible with arbitrary marginal distributions f ( y | t ) and f ( t ) . For our method and throughout theremainder of the paper we make the following additional assumptions: Assumption 4 (Copula invariance) . The conditional copula does not depend on the value of t , that is, theconditional dependence between Y and U is invariant to the level of T . Assumption 5 (Gaussian copula) . The conditional copula between the outcome and m -dimensional latentconfounders given treatments, c ψ ( F Y | t ( y ) , F U | t ( u ) | t ) , is a Gaussian copula.These assumptions do not impose constraints on the observed data distributions, only the relationshipbetween the observed and latent variables. Given Assumption 4 and 5, the conditional confounder densitycan be expressed as a multivariate normal density, f ( u | t ) ∼ N ( µ u | t , Σ u | t ) , where Σ u | t invariant to the levelof t . Together, these assumptions imply the following generative model: T ∼ F T (12) f ( u | t ) ∼ N ( µ u | t , Σ u | t ) (13) ˜ Y = γ (cid:48) ( U − µ u | t ) + (cid:15) ˜ y | t,u , (cid:15) ˜ y | t,u ∼ N (0 , σ y | t,u ) , γ T Σ u | t γ + σ y | t,u = 1 (14) Y = F − Y | t (Φ( ˜ Y )) (15)9here F T is the distribution of the treatments and F − Y | t is the inverse-CDF of the conditional distributionof Y given T = t . The Gaussian copula is parameterized by the correlation matrix implied byCov ([ ˜ Y , U ] | T = t ) = γ T Σ u | t Σ u | t γ Σ u | t . (16)with parameters are ψ T = { µ u | t , Σ u | t } and ψ Y = { γ } . In general, µ u | t and Σ u | t will not be point identified,although under many latent variable models they can be identified up to invertible linear transformationof U . Importantly, the following theorem establishes that the class of ψ T defined by all invertible lineartransformations of U is a causal equivalance class. Theorem 1.
Assume model 13-15. Let [ ψ T ] = { ˜ ψ T = { Aµ u | t , A Σ u | t A } : A ∈ S + } where S + is the space ofsymmetric positive definite matrices. Then [ ψ T ] is a causal equivalence class. The gist of the proof is that for any invertible linear transformation, A , of U , the copula parameterized by ˜ γ = A − γ yields equivalent causal effects in the reparameterized coordinates of U as γ does in the originalconfounder coordinates.Throughout this work, we will assume that ψ T is identified up to invertible linear transformations of U , and explore the range of possible causal effects for different γ satisfying γ T Σ u | t γ ≤ . In Algorithm1, we provide a procedure for estimating any marginal contrast estimand, given a sensitivity vector γ andtreatments levels t and t . At a high level, we compute a Monte Carlo estimate of f ( y | do ( t )) via thefollowing three step procedure: 1) draw a sample from f ( u ) , 2) compute the conditional density of of theGaussianized outcome f ( ˜ Y | u, t ) via the Gaussian copula and 3) transform ˜ Y back to original space via theconditional quantile function F − Y | t (see Figure 3). In the following Sections, we introduce some theoreticalinsights about our approach and also provide a method for calibrating the magnitude of γ and reasoningabout it’s direction. As described in the previous Section, our method for practical sensitivity analysis with multiple treatmentsis based on a Gaussian copula parameterization of the confounder-outcome relationship. Here, we start byproviding some theoretical insights about how the Gaussian covariance structure relates to confounding biasin the linear-Gaussian model. Specifically, we describe how the causal effects vary as a function of γ and how10 lgorithm 1 Marginal Contrast Estimation. function ComputeMean ( t , γ ) for i = 1 , , . . . , n do µ i ← γ T ( µ u | t i − µ u | t ) for j = 1 , , . . . , nSim do Sample ˜ y ij from N ( µ i , y ij ← F − Y | t (Φ(˜ y ij )) end for end for return n (cid:80) ij v ( y ij ) end function Return τ ( ComputeMean ( t , γ ) , ComputeMean ( t , γ )) bounds on these effects depend on both the treatment contrast and inferred conditional confounder density.In 5.2, we generalize some of these results to arbitrary models for f ( y | t ) and f ( t ) . We start by illustrating our approach in a simple Linear-Gaussian model when ( Y, T, U ) are jointly multi-variate Gaussian and establish the following results:• For causal inference with a single treatment, the confounding bias for PATE t ,t is unbounded.• When there are multiple treatments which can be used to identify (up to a causal equivalance class) theconditional confounder distribution, the magnitude of the confounding bias for PATE t ,t is bounded.We characterize how the magnitude of this bound depends on the parameters of the latent confoundermodel.• In the multi-cause setting there are many possible treatment contrasts. The confounding bias dependson the contrast. We characterize which treatment contrasts lead to the largest bounds and whichtreatment contrasts imply identifiable effects.We demonstrate these results in the following model: U = (cid:15) u , (cid:15) u ∼ N m (0 , Σ u ) , (17) T = BU + (cid:15) t | u , (cid:15) t | u ∼ N k (0 , σ t | u I k ) , (18) Y = τ (cid:48) T + γ (cid:48) U + (cid:15) y | t,u , (cid:15) y | t,u ∼ N m (0 , σ y | t,u ) , (19)11ith τ ∈ R k and γ ∈ R m . When either B = 0 or γ = 0 , there is no confounding. We also note thatEquations 17 and 18 imply that the conditional distribution of the confounder can be expressed as f ( u | t ) ∼ N ( µ u | t , Σ u | t ) , as in Equation 13, where both µ u | t and Σ u | t are known functions of B and σ t | u . Under model17-19, the intervention distribution has density f ( y | do ( T = t )) ∼ N ( τ (cid:48) t, σ y | t,u + γ (cid:48) Σ u γ ) . (20)For any t , t , PATE t ,t is characterized entirely by the regression coefficients τ . The observed outcomedistribution can be expressed as f ( y | T = t ) ∼ N ( τ (cid:48) naive t, σ y | t ) (21)where τ naive = τ + ( B Σ u B (cid:48) + σ t | u I k ) − B Σ u γ (22) σ y | t = σ y | t,u + γ (cid:48) (Σ u − Σ u B (cid:48) ( B Σ u B (cid:48) + σ t | u I k ) − B Σ u ) γ (23) = σ y | t,u + γ (cid:48) Σ u | t γ (24)which are both fully identified from observed data. We refer to τ naive as the naive estimate since it naivelyneglects the effect of unobserved confounders. Equation 24 shows that the observed residual outcome variancecan be decomposed into nonconfounding variation σ y | t,u and confounding variation, γ (cid:48) Σ u | t γ . We take σ y | t and τ naive as fixed and known, and characterize the range of confounding biases by considering differentassumptions about the strength of confounding.We note that the bias of the naive estimator depends only on the difference between the treatmentvectors, ∆ t = t − t , since the population average treatment effect can be expressed asPATE ∆ t = τ (cid:48) ( t − t ) := τ (cid:48) ∆ t, (25)12he confounding bias, denoted Bias ∆ t = τ (cid:48) naive ∆ t − PATE ∆ t , can then be expressed asBias ∆ t = γ (cid:48) Σ u B (cid:48) ( B Σ u B (cid:48) + σ t | u I k ) − ∆ t (26) = γ (cid:48) ( E ( U | T = t ) − E ( U | T = t )) (27) := γ (cid:48) µ u | ∆ t , (28)where we use µ u | ∆ t to denote the difference in confounder means for the treatment contrast, ∆ t . We can thensuccinctly express the PATE in terms of the naive estimate minus the bias as PATE ∆ t = τ naive (cid:48) ∆ t − γ (cid:48) µ u | ∆ t .In the single treatment setting, neither ψ Y = { γ } nor ψ T = { µ u | ∆ t , Σ u | t } are identifiable, which impliesthat the confounding bias is unbounded. However, with multiple treatments ψ T is identifiable up to a causalequivalence class defined by invertible linear transformations of U . We make this concrete in the followingtheorems. Theorem 2.
Suppose that the observed data is generated by model 17-19. When there k treatments with < m < k , then ψ T is identified up to the causal equivalence class [ ψ T ] = { ˜ ψ T = { Aµ u | t , A Σ u | t A } : A ∈ S + } .When there is a single treatment ( k = 1 ) or at least m = k confounders, then ψ T is not identifiable up tocausal equivalence class.Proof. See appendix. One consequence of Theorem 2 is that the distribution of U is only causally relevant up to linear transforms,and as such, without loss of generality, we make the simplifying assumption that U ∼ N (0 , I m ) for theremainder of this Section.First, we review the implications of this theorem when there is only a single treatment, i.e. k = 1 . Asshown in Cinelli and Hazlett (2019), for single treatment inference, the squared confounding bias of thePATE can be expressed as Bias t = R T ∼ U − R T ∼ U R Y ∼ U | T σ y | t σ T (29)where σ T := BB (cid:48) + σ t | u I k is the marginal variance of the treatment, ≤ R T ∼ U = σ T ( µ u | ∆ t ) (cid:48) µ u | ∆ t (∆ t ) ≤ (30)13s the unidentifed fraction of treatment variance explained by confounders and ≤ R Y ∼ U | T = γ T Σ u | t γσ y | t ≤ (31)is the fraction of the residual outcome variance explained by confounders. By Theorem 2, neither R T ∼ U nor R Y ∼ U | T are identifiable. Since R T ∼ U − R T ∼ U can be arbitrarily large, the confounding bias is unbounded in thesingle treatment setting.In contrast, Theorem 2 states that with multiple treatments, we can identify an element of the causalequivalence class for parameters governing the conditional confounder distribution. The relationship be-tween Y and U , as parameterized by m-vector γ , remains an unidentified sensitivity vector. This sensi-tivity vector can be viewed as parameterizing the Gaussian conditional copula between Y and U given T , c γ ( F Y | t ( y ) , F U | t ( u ) | t ) (Equation 16).Identification up to causal equivalence class implies that the confounding bias is bounded. From Equation26, we can see that the sign and magnitude of the bias depends on both ∆ t as well as γ . Although γ is notidentified, its values are constrained since unobserved confounding cannot explain more than 100% of theresidual outcome variance (Equation 31). This constraint on the magnitude of γ implies the following resultabout the bias of the naive estimator. Theorem 3.
Suppose that the observed data is generated by model 17-19 with σ t | u > . Then, ∀ γ satisfyingAssumptions 1 and 2, γ T Σ u | t γ ≤ σ y | t (32) For any given ∆ t , we have Bias t ≤ σ y | t R Y ∼ U | T (cid:107) Σ − / u | t µ u | ∆ t (cid:107) , (33) The bound is achieved when γ is colinear with Σ − u | t µ u | ∆ t .Proof. See appendix. This theorem states that the true causal effect lies in the interval τ (cid:48) naive ∆ t ± (cid:113) σ y | t R Y ∼ U | T (cid:107) Σ − / u | t µ u | ∆ t (cid:107) .When additional assumptions are applied to identify a particular value of γ (e.g. see Miao et al., 2020), thecorresponding causal effect estimate will correspond to a single point inside this ignorance region when theGaussian copula assumption holds. We refer to the right-hand side of 33 as the “worst-case bias” of the naiveestimator. In particular, since τ naive is the midpoint of the ignorance region, it has the minimum worst-case14ias over all alternative causal effect estimators. This is consistent with Grimmer et al. (2020) who emphasizethat the deconfounder proposed by Wang and Blei (2018) cannot outperform the naive estimator in general.The worst-case bias of τ naive is proportional to the norm of the scaled difference in confounder meansin each treatment arm, Σ − / u | t µ u | ∆ t . This result provides a useful generalization of existing work whichdemonstrates that overlap is violated when u can be pinpointed by a deterministic function of t (D’Amour,2019b). In contrast to the original work by Wang and Blei (2018), our result suggests that the more preciselywe can pinpoint u given t , the less precisely we can pinpoint PATE ∆ t . In the following corollary, we statethe worst-case bias over all possible treatment contrasts: Corollary 3.1.
Let d be the largest singular value of B . For all ∆ t with (cid:107) ∆ t (cid:107) = 1 , the squared bias isbounded by Bias t ≤ d ( d + σ t | u ) σ y | t σ t | u R Y ∼ U | T , (34) with equality when ∆ t = u B , the first left singular vector of B . When ∆ t ∈ N ull ( B (cid:48) ) , the naive estimate isunbiased, that is, P AT E ∆ t = τ (cid:48) naive ∆ t .Proof: See Appendix. The first term in (67), d ( d + σ t | u ) , is the fraction of variance in the first principal component of the causes thatcan be explained by confounding. The first principal component corresponds to the projection of treatmentswhich is most correlated with confounders, and thus is the causal contrast with the largest ignorance region.We also note that the squared biases depends on R Y ∼ U | T , the partial variance explained by confoundersgiven treatments. While the magnitude of the confounding bias is always largest when R Y ∼ U | T = 1 , weoften have reason to believe that the variance explained by confounders is likely smaller. We describe howto leverage this idea to calibrate more plausible bounds and measures of robustness in Section 6.We illustrate some key insights from these theorems in Figure 2, where we display the worst-case bias asa function of the treatment contrasts, ∆ t . In this illustration, we assume that ∆ t lies on a plane spanned by u B and n B , an arbitrary vector in the null space of B . We let θ = arccos (∆ t (cid:48) u B ) be the angle of ∆ t relativeto u B , Figure 2a. Figure 2b depicts the bias as function of θ for different values of R Y ∼ U | T . When ∆ t isin the null space of B , PATE ∆ t is identified because the confounder distributions are identical in the twotreatment arms, i.e. there is no confounding for this particular contrast. When ∆ t is colinear with u B thescaled difference in means of u is largest, which implies the largest worst-case bias for the treatment effect,Figure 2c (left). Even when PATE ∆ t is identified, we emphasize that PATE t , • and PATE t , • are both biased,15 a) ∆ t := t − t −2−1012 0 p p p p q B i a s R = Change of Bias with D t (b) Estimation Bias of PATE ∆ t −2 −1 0 1 2 U D i s tt r i bu t i on o f U U U | t U | t D t = u −2 −1 0 1 2 U D i s t r i bu t i on o f U U U | t U | t D t = n (c) Distribution of U Figure 2: Illustration of Corollary 3.1. (a) We parameterize ∆ t with θ , the angle between n B , a vector inthe null space of B , and u B , the first left singular vector of B . (b) The confounding bias of naive estimatesof PATE ∆ t changes with θ and depends on R Y ∼ U | T . (c) Confounder densities in different populations. Theblue, green, red densities denote distributions of U in the observed population, the subpopulation receiving t and the subpopulation receiving treatment t respectively. Observed data estimates of PATE ∆ t are unbiasedwhen ∆ t = n B , since the confounder distributions are the same in two treatment arms. However, observeddata estimates of PATE t , • and PATE t , • are biased since in general the superpopulation distribution of theconfounder is different.since the distribution of confounders in the treatment arm differs from the distribution of confounder in thesuperpopulation, Figure 2c (right). As noted by others, identification of PATE ∆ t for ∆ t in the null space of B arises due to bias cancellation in intervention means of the two treatment arms (Grimmer et al., 2020).Our theory is invariant to rotations of the treatments vector, which means that under model 17-19, wecan always make a change of treatment variables so that each confounder affects a distinct single rotatedtreatment (called “single cause” confounders in Wang and Blei (2018)). Specifically, let ˜ T = RT be arotation of the original treatment variables, so that ˜ T ∼ N (0 , ∆ + σ t | u I k ) where ∆ is a diagonal positivesemi-definite matrix. Then, as before, ∆ i ∆ i + σ t | u is the fraction of variance in the i th rotated treatment dueto confounding and provides a bound on the omitted confounder bias of estimates of PATE for ∆ ˜ T = e i .For the k − m contrasts corresponding to the zeros in the diagonal of ∆ , there is no confounding for theseparticular treatment contrasts and thus there is no confounding bias; these correspond to the contrasts inthe original space which fall in the null space of B . Next, we generalize beyond the setting in which ( Y, T, U ) is jointly multivariate Gaussian. First, when Y isGaussian with mean µ y | t and variance σ y | t , even when T has an arbitrary functional relationship with Y wehave that PATE t, • = ( µ y | t − µ y ) − σ y | t γ (cid:48) ( µ u | t − µ u ) , (35)16 − − − U|t Y | t 𝑓(𝑈)𝑓(𝑈|𝑡) 𝑓(𝑌|𝑡)𝑓(𝑌|𝑑𝑜 𝑡 ) (a) (cid:113) R Y ∼ U | T = 0 . − − − − U|t Y | t 𝑓(𝑈)𝑓(𝑈|𝑡) 𝑓(𝑌|𝑡)𝑓(𝑌|𝑑𝑜 𝑡 ) (b) (cid:113) R Y ∼ U | T = 0 . (opposite sign) − − − − U|t Y | t 𝑓(𝑈)𝑓(𝑈|𝑡) 𝑓(𝑌|𝑡)𝑓(𝑌|𝑑𝑜 𝑡 ) (c) (cid:113) R Y ∼ U | T = 0 . Figure 3: Differences between observed and intervention densities as a function of the fraction of outcomevariance explained by a single confounder. The black contours depict the conditional Gaussian copula, c γ ( F Y | t ( y ) , F U | t ( u ) | t ) whereas red points represent samples from the joint distribution, f ( y, u | do ( t )) ∝ f ( y | t ) c γ ( F Y | t ( y ) , F U | t ( u ) | t ) f ( u ) . We visualize the shift in the outcome density for different conditionalcorrelations and note that smaller values of R Y ∼ U | T imply smaller biases in the outcome despite largeimbalance in the distribution of U.where µ u = E [ U ] is the population mean of U . Thus,PATE t ,t = ( µ y | t − µ y | t ) − σ y | t γ (cid:48) ( µ u | t − µ u | t ) . (36)This leads to the following generalization of Theorem 3. Theorem 4.
Assume the model 13-15 with Gaussian outcomes. If Σ u | t is non-invertible, then Bias t ,t isbounded if and only if µ u | t − µ u | t is in the row space of Σ u | t . When bounded,Bias t ,t ≤ σ y | t R Y ∼ U | T (cid:107) (Σ † u | t ) / ( µ u | t − µ u | t ) (cid:107) , where Σ † u | t is the pseudo-inverse of Σ u | t .Proof: See Appendix. As before, when bounded, the bias is proportional to the norm of the scaled difference in confounder meansin the two treatment arms.When there exists an m-vector, q , such that Var ( q (cid:48) U | t ) = 0 , then Σ u | t is non-invertible because thereexists a projection of the confounders which is point identified. Theorem 4 says that in this case, theignorance region for the PATE is bounded if and only if q (cid:48) ( µ u | t − µ u | t ) = 0 . In words, if a projectionof the confounders can be identified, then the confounding bias is bounded if the identifiable projection ofthe confounders has the same value in both treatment arms. This result can be viewed in the context ofTheorem 7 in Wang and Blei (2018), which assumes consistency and overlap of estimators for the relevantlatent confounders for identification. 17hen the observed outcome distribution is non-Gaussian, we cannot necessarily express PATE t ,t ana-lytically. In particular, for non-Gaussian Y , when f ( u | t ) ∼ f ( u | t ) the average treatment effect amongthe t and t treated units is unconfounded, but the bias of PATE t ,t may be nonzero since f ( u | t ) (cid:28) f ( u ) .The causal effects, however, can still be calculated using Algorithm 1. Another particularly important non-Gaussian setting we highlight here is when the outcome is binary. Interestingly, unlike the linear case, RR t, • and RR t ,t are non-monotone in the magnitude of γ . We discuss this in more detail in Appendix B.3 andprovide simulation results with binary outcomes in Section 7. Sensitivity analyses consist of two parts: first, the sensitivity model itself, which specifies a set of causalmodels, indexed by sensitivity parameters; and secondly, exploratory tools for mapping external assumptionsto particular causal models in this set. We now turn to discussing the latter in the context of our proposedmodel.In the sensitivity analysis literature so far, two exploratory techniques have been particularly popular insingle treatment studies: calibration , which maps sensitivity parameter values to interpretable observable orhypothetical quantities; and robustness analysis , which characterizes the “strength” of confounding necessaryto change the conclusion of a study. Here, we show how to adapt these techniques to our sensitivity model inthe multi-treatment setting. In addition, we introduce a third class of tools that are particularly well-suitedto the multi-treatment setting, which we call multiple contrast criteria (MCCs). MCCs specify aggregateproperties of the treatment effects for multiple treatment contrasts that are implied by a single causal model,e.g., the L2 norm of PATEs corresponding to contrasts in each individual treatment variable in T . In manymulti-treatment settings, assumptions are often expressed in terms of the aggregates—e.g., in genomics, theidea that the effect of most single nucleotide polymorphisms is small—and we show here how these canbe used in conjunction with our sensitivity model to characterize candidate causal models that may be ofinterest in an application. We begin by describing calibration for γ in our sensitivity model when the focus is on a single treatmentcontrast, between levels T = t and T = t . The goal is to develop heuristics for specifying “reasonable”18alues or ranges for γ , e.g., to derive bounds on treatment effects by specifying bounds on the strength ordirection of confounding. Following previous work in the single treatment setting, we outline how to calibrateour sensitivity parameter vector γ in terms of a fraction of outcome variance explained by the unobservedconfounder. Recall that γ is a vector that parameterizes the residual correlation between the m -dimensionalunobserved confounder U and the outcome Y after conditioning on the treatment vector T .First, we briefly review calibration in single-treatment settings. In latent variable approaches for singletreatment sensitivity analysis, the causal effect is identified given two sensitivity parameters: the fractionof outcome variance explained by unobserved confounders, R Y ∼ U | T , and the fraction of treatment varianceexplained by unobserved confounders, R T ∼ U (Cinelli and Hazlett, 2019). In a linear model, these two scalarquantities identify the confounding bias (Equation 29). Neither R-squared value is identifiable and thusmany authors have proposed strategies for drawing analogies between these values and other observable orhypothetical quantities (Cinelli et al., 2020; Veitch and Zaveri, 2020; Franks et al., 2019).We borrow this strategy for calibration in our setting, with some modifications. First, in our setting thereis no need to calibrate R T ∼ U , because we have restricted ourselves to the case where f ( u | t ) is identified upto a causal equivalence class. This leaves calibration of the outcome-confounder relationship, which in oursetting is more complex because it is parameterized by a vector γ . However, we can reparameterize γ interms of a direction d and an R-squared for interpretable calibration: γ = (cid:113) R Y ∼ U | T Σ − / u | t d, (37)where d ∈ S m − is an m -dimensional unit vector. We discuss strategies for calibrating both the magnitudeand direction separately. Calibrating the magnitude of γ . For Gaussian outcomes, the magnitude of γ is characterized entirelyby R Y ∼ U | T , the partial fraction of outcome variance explained by U given T . When R Y ∼ U | T = 0 thereis no unobserved confounding and when R Y ∼ U | T = 1 , all of the observed residual variance in Y is due toconfounding factors. In order to calibrate this magnitude, we adopt an idea proposed by Cinelli and Hazlett(2019) for inference with single treatments. In their work, they calibrate R Y ∼ U | T by comparing it to thepartial fraction of variance explained by different observed covariates. We use a closely related strategy thatmakes use of the presence of multiple treatments rather than observed covariates. Specifically, we compute Unlike the single treatment setting, the confounder-outcome relationship cannot be sufficiently summarized in terms of ascalar R Y ∼ U | T . Each confounder can impact each treatment in different ways. Y that can be explained by a specific treatment (or set of treatments), T j , aftercontrolling for all other treatments T − j as R Y ∼ T j | T − j := R Y ∼ T − R Y ∼ T − j − R Y ∼ T − j , (38)When observed covariates are available, we can still analogously compute the partial fraction of varianceexplained by an observed covariate, R Y ∼ X j | T,X − j , as done in Cinelli and Hazlett (2019). Veitch and Zaveri(2020) and Cinelli et al. (2020) propose useful graphical summaries for calibration based on these metrics inthe single treatment setting.When the observed outcome is non-Gaussian, we calibrate the “implicit R ”, by considering the explainedvariance of the latent Gaussian outcome, ˜ Y in Equation 14. The implicit R of T for model 13-15 is definedas R Y ∼ T = Var ( E [ ˜ Y | T ]) Var ( E [ ˜ Y | T ]) + 1 . (39)and the implicit partial R-squared of treatment T j , R Y ∼ T j | T − j , is defined analogously to Equation 38. Asbefore, these estimable partial R-squared values can be used to provide a useful comparison for the partialR-squared of potential unobserved confounders, R Y ∼ U | T . For more detail, see Imbens (2003) and Frankset al. (2019) who discuss calibration with implicit R-squared values in logistic regression models. Choosing the direction of γ . Given a magnitude, we now propose a default method for identifying thedirection of γ for a single contrast. The dot product d (cid:48) Σ − / u | t ( µ u | t − µ u | t ) corresponds to the projection ofthe scaled difference in confounder means onto the outcome space. By default, we suggest using the directionwhich maximizes the squared bias. As shown in Theorem 4, when d is colinear with Σ − / u | t ( µ u | t − µ u | t ) ,the confounding bias of the naive estimator for Gaussian outcomes is maximized at | Bias t ,t | = (cid:113) R Y ∼ U | T σ y | t (cid:107) Σ − / u | t ( µ u | t − µ u | t ) (cid:107) , (40)Choosing the direction of the sensitivity vector in this way provides conservative bounds for each contrastof interest. For non-Gaussian outcomes or alternative estimands, there may not be an analytic solution tothe direction which maximizes the bias, but we can still compute the direction via numerical optimization.20 .2 Robustness for Single Contrasts We now turn to assessing the robustness of conclusions using our sensitivity model, extending work by Cinelliand Hazlett (2019) and VanderWeele and Ding (2017) in the single treatment setting. Specifically, we proposean extension of the robustness value (RV), which characterizes the minimum strength of confounding neededto change the sign of the treatment effect. As in the previous section, the extension is most straightforwardwhen considering the effect of a single treatment contrast, between levels T = t and T = t .To review briefly, in single treatment settings, Cinelli and Hazlett define the RV as the minimum R-squared needed to reduce the treatment effect to zero, assuming R Y ∼ U,T = R T ∼ U . (we return to thisassumption below.) A robustness value close to one means the treatment effect maintains the same signeven if nearly all of the observed residual variance in the outcome is due to confounding. On the otherhand, a robustness value close to zero means that even weak confounding would change the sign of the pointestimate.In the multi-treatment setting, we can more precisely characterize the robustness of causal effects when R T ∼ U is identifiable. In particular, we can compute an analogue to the RV without assuming R Y ∼ U,T = R T ∼ U , an assumption that, in single treatment settings, can be consequential . With R T ∼ U known, wedefine multi-treatment RV can then simply as the minimum value of R Y ∼ U | T needed to explain away thetreatment effect of interest, assuming the direction of the sensitivity vector is chosen to maximize the bias.When the observed outcomes are Gaussian, the robustness value can be computed in closed form:RV t ,t = ( µ y | t − µ y | t ) σ y | t (cid:107) Σ − / u | t ( µ u | t − µ u | t ) (cid:107) , (41)RV metrics for alternative estimands and/or non-Gaussian data can still be computed using the same prin-ciple. For example, when the observed outcome is binary, the RV can be computed numerically by solving RR t ,t = 1 , which corresponds to the minimum strength of confounding needed for the observed risk ratio(RR) to equal to one. This robustness value is analogous to the “E-value” proposed by VanderWeele andDing (2017).In our setting, we can also make stronger statements about robustness than in the single treatmentsetting: under the latent variable model, it is possible to declare an effect robust to any level of confounding. In single treatment settings, when R Y ∼ U | T > R T ∼ U , the single-treatment RV will be too conservative. Conversely, when R Y ∼ U | T < R T ∼ U the single-treatment RV will overestimate the robustness of the effect.
21n particular, when the latent variable model implies R T ∼ U < (i.e., we have confounder overlap), then evenwhen R Y ∼ U | T = 1 , the ignorance region is bounded (Theorem 4). When this ignorance region excludes zero,we declare the effect “robust”. This operation is consistent with the result in Miao et al. (2018), showingthat hypotheses of zero effect can be tested in this setting, even if the treatment effect cannot be identified. In addition to exploratory tools used with single-treatment sensitivity analysis, the multi-treatment settingpresents opportunities for exploring sensitivity models in new ways. Here, we propose one such approachusing multiple contrast criteria, or MCCs. As opposed to the approaches we have discussed so far, whichconsider treatment contrasts in isolation, MCCs characterize a choice of sensitivity vector γ by concurrentlyconsidering its implications for multiple treatment contrasts in aggregate. Thus, while the sensitivity vector γ that gives the worst-case bias may differ across individual contrasts, an MCC can be used to select a single γ that has implications for many simultaneous treatment effects.Formally, for a set of treatment contrasts T = { ( t , t ) k } Kk =1 , and a candidate sensitivity vector γ ,let PATE T ( γ ) be the vector of PATEs implied by the causal model indexed by γ . An MCC is a scalarsummary of this treatment effect vector, which we write as ω ( PATE T ( γ )) . An MCC is specified by theset of constrasts T and the summary function ω , both of which can be chosen to meet the needs of a givenanalysis.MCCs can be used in many ways, but here we consider how they can be used to search for the causalmodel that yields the minimum norm treatment effect vector, subject to assumptions 1-5 and a confoundinglimit R . Specifically, we take ω to be an L p norm for some p , and consider sensitivity vectors γ ∗ that satisfy: γ ∗ = argmin ˜ γ ω ( PATE T (˜ γ )) subject to R Y ∼ U | T (˜ γ ) ≤ R (42)where R Y ∼ U | T,X (˜ γ ) = ˜ γ (cid:48) Σ u | t ˜ γσ y | t is the partial fraction of outcome variance explained by confounding forsensitivity vector ˜ γ . Causal models selected in this way are often highly interpretable, in terms of either“worst case” effect sizes or established prior knowledge. For example, we can chose ω to be the L ∞ norm, sothat γ ∗ is the sensitivity vector that minimizes the maximum absolute effect across contrasts. Alternatively,we could choose ω to be the L or L norm of the treatment effects to find models that imply small “typical”effect sizes. We demonstrate how this minimization approach can be used to express prior knowledge about22mall effects in simulated data in Section 7.2, and how it can be used to evaluate robustness on a real dataset in Section 8. In this Section, we demonstrate our sensitivity analysis workflow in several numerical simulations. The goalof these simulations is twofold: first, to demonstrate some of the operating characteristics of the approach insettings that are more realistic than the linear Gaussian settings we characterized analytically; and secondly,to show how exploratory tools like calibration, robustness analysis, and MCCs can be used to draw conclusionsand choose interesting candidate models.We consider two broad simulation settings. In the first setting, we construct simulations with non-linearresponses to treatment to show how the ignorance regions returned by our method can vary in differentscenarios. In the second setting, we construct a simulation that mimics the structure of a Genome WideAssociation Study (GWAS). Here, we examine the behavior of our method when a popular approximate latentvariable method—the Variational Auto Encoder (VAE)—is used to estimate the effects of latent confounders,and demonstrate how MCCs can be useful tools for using prior information to choose potentially useful causalmodels from the set that is compatible with the observed data. In both subsections, we simulate data fromthe following generating process: U := (cid:15) u , (cid:15) u ∼ N m (0 , I m ) , (43) ˜ T := BU + (cid:15) t | u , (cid:15) t | u ∼ N k (0 , σ t | u I k ) , (44) T := h T ( ˜ T ) , (45) ˜ Y := g ( T ) + γ (cid:48) U + (cid:15) y | t,u , (cid:15) y | t,u ∼ N (0 , σ y | t,u ) , (46) Y := h Y | T ( ˜ Y ) , (47)The functions h Y and h T are chosen according to be either the identity for Gaussian data, or an indicatorfunction for binary data. 23 % robust robust 8.95% −101 1 2 3 4 i C au s a l E ff e c t True EffectCalibrated R Y~~U|T2 = 1, upperR
Y~~U|T2 = 0R
Y~~U|T2 = 1, lower
PATE t ,t for Gaussian Outcome (a) Gaussian Outcome i C au s a l E ff e c t True EffectCalibrated
UpperNaiveLower RR e i ,0 for Binary Outcome (b) Binary Outcome Figure 4: Estimated ignorance region for e i in case when ˜ Y is nonlinear in T . (a) R Y ∼ U | T = 0 denotes thetreatment effects estimated based on the observed data only, i.e., under the assumption of no confoundedness. R Y ∼ U | T = 1 and R Y ∼ U | T = 1 correspond to the case when all residual variation in Y is due to theconfounding, respectively denoting the upper and lower bounds of the ignorance region. (b) In the binarysetting, even though the estimand is a non-linear function of the latent Gaussian outcome, the width of theignorance region and general robustness pattern is largely consistent with the implications of Corollary 3.1. We start by exploring variation in the size of ignorance regions for different contrasts in a simple simulatedexample with four treatments where the outcome is a nonlinear function of these treatments. We considertwo cases: first, a case where Y is Gaussian with h Y ( ˜ Y ) = ˜ Y ; and secondly, a case where Y is binary with h Y ( ˜ Y ) = I ˜ Y > . We aim to estimate the PATE e i , for Gaussian outcome and RR e i , for binary outcome,where e i denotes the i th canonical vector, i.e. the vector with a 1 in the i -th coordinate and 0’s elsewhere.In both examples, we generate the data with a 1-dimensional latent confounder ( m = 1 ), k = 4 treatments, B = [2 , . , − . , . , σ t | u = 1 , γ = 2 . , σ y | t,u = 1 , h T ( ˜ T ) = ˜ T and g ( T ) = 3 T − T + T I T > + 0 . T I T ≤ − . T − T . Based on the choice of g , contrasts along the j th dimension of T have effects of widely varying magnitude.Based on our choice for B , the worst-case confounding bias also varies significantly across contrasts. Forexample, the effect of confounding is larger when estimating the treatment of T , since the first entry of B has the largest magnitude, meaning T is the feature most correlated with U . In order to demonstrate thisin simulation, we first apply probabilistic PCA (PPCA) to estimate the distribution f ( u | t ) , and then model f ( y | t ) using Bayesian Additive Regression Tree (BART) with R package BART (McCulloch et al., 2018).24or Gaussian outcomes, the width of the ignorance regions are larger for the treatments most correlatedwith confounders as characterized Theorem 4 (see Figure 4). Since B is a vector, the width of the ignoranceregion of PATE t ,t can be examined by looking at the dot product between B and the treatment contrasts.The larger the dot product, the wider the ignorance region. As expected, the ignorance region of thetreatment effect is widest when t = e (RV ≈ ) and narrowest when t = e , since B (cid:48) e has the largestmagnitude while B (cid:48) e has the smallest. Despite the fact that t = e has the smallest ignorance region, it isnot robust to confounding because the naive effect is already close to zero (RV = ) . For the second andthird treatment contrasts, estimates are robust to confounders, as their entire ignorance regions exclude 0.For the simulation with binary outcomes, we compute ignorance regions for the risk ratio. Although wedo not have a theoretical result about the ignorance regions of the risk ratio, the general trends in the sizeof the ignorance region and the robustness of effects are comparable to the Gaussian. Most notably, thetreatments with the largest ignorance regions are still those which are most correlated with the confounder.On the other hand, because the outcome is non-linear in U , the naive estimate is not at the center of theignorance region (Figure 4b). In fact, the ignorance region is also non-monotone in R Y ∼ U | T because thevariance of the intervention distribution also depends on γ . In this case, one of the endpoints of the ignoranceregion corresponds to R Y ∼ U | T = 1 but the other does not. We compute the endpoints of the ignorance regionnumerically (see Appendix B.3 for more details). We now explore a slightly more complex setting motivated by applications in biology, particularly in genomewide association studies (GWAS). GWAS investigate the association between hundreds or thousands ofgenetic features (i.e., single nucleotide polymorphisms, or SNPs) and observable traits (i.e., phenotypes), suchas disease status. Here, we construct a simulated GWAS to demonstrate two properties of our sensitivityanalysis method. First, we show that flexible latent variable models can be plugged into our sensitivitymodel. Secondly, we demonstrate how minimizing multiple contrast criteria (MCC) can be used to selectinteresting candidate models that conform to broad hypotheses about the nature of genetic effects.Despite having “association” in the name, GWAS is a particularly interesting application area for multi-treatment causal inference. In practice, measures of association in GWAS are often adjusted to afford a causalinterpretation in which conclusions speak to how a phenotype would change if the genome were intervenedupon. For example, most analyses adjust for “population structure”, which correspond to broad genetic25atterns induced by population dynamics that are often confounded with geography, ancestry, environment,and other lifestyle factors (Price et al., 2006; Song et al., 2015). Wang and Blei (2018) cite this literature asmotivation for their work.In this simulation, we generate data with high-dimensional binary treatments (SNPs), and set the truecausal effects to be mostly small, with a small fraction of treatments having effects of larger magnitudes.The simulation is then designed so that unobserved confounding biases estimates for each of these treatmenteffects, obscuring the difference between large and small effects. To generate data, we follow the templatein Equations 43–47. We generate data with m = 3 latent confounders and k = 500 treatments, T ∈ { , } k ,where T j = 1 if the the j th site shows a deviation from the baseline sequence (i.e., the presence of at least oneminor allele). We set the response function g ( T ) = τ (cid:48) T to be linear in the treatments (a common assumptionin GWAS), and set the outcome Y to be Gaussian by setting h Y ( ˜ Y ) = ˜ Y . We focus on estimating n n (cid:88) i =1 P AT E t ji ,t − ji for all j = 1 , · · · , k, (48)where t ji and t − ji correspond to the i th observed treatment vector with the j th SNP set to be 1 and 0respectively. Note that since g ( T ) is linear in T , n (cid:80) ni =1 P AT E t ji ,t − ji = τ j , the j th element of τ . We generate τ from a two component mixture with 90% of the coefficients from a Uniform ( − . , . (small effects) and10% from a Uniform ( − , (large effects). We assume that there are m = 3 latent confounders.We consider a model for the observed data with two components, paying special attention to the latentconfounder model. In particular, we model the conditional distribution of confounders given treatment f ( u | t ) using a variational autoencoder (VAE), which is a popular, flexible neural network–based approximatelatent variable model. This model is particularly appropriate because it yields an approximate Gaussianconditional distribution f ( u | t ) , even for discrete T as we have here. (We discuss latent confounder inferencewith VAEs in more detail in Appendix B.2.) We fit the observed outcome model f ( y | t ) using a simplelinear regression, ignoring confounding, which corresponds to the setting in which R Y ∼ U | T = 0 . Worst-Case Ignorance Regions.
With this simulation setup, we first examine whether the ignoranceregions contain to the true causal effects. Importantly, because the VAE is an approximate latent variablemodel, and we are currently ignoring estimation uncertainty, it is not immediate that the ignorance regionsshould be valid. We find that, even using our plug-in approach, the worst case ignorance regions cover498 out of 500 of the true treatment effects. In all cases, the worst case bounds communicate substantial26 ull non−null −10010 0 25 50 75 100 i t R = %, RMSE= = %, RMSE= = %, RMSE= Estimates of Treatment Coefficients (a) L1 minimizing effects T P R R = 1 (latent dim. = 2) R = 1 (latent dim. = 3) R = 1 (latent dim. = 4) R = 0 ROC Curves (b) Non-null classification
Figure 5: Causal inference with 500 binary treatments with k = 3 latent confounders. Naive estimates ofthe null and non-null effects are overdispersed due to confounding. (a) Minimum L1-norm treatment effectsare shown for fifty randomly chosen small effects (“null” contrasts) and all large effects (“non-null” contrasts)for three different limits on the magnitude of confounding, R ∈ { , . , . } . When R = 1 , there overallL1 minimizer of the treatment effects is achieved for the sensitivity vector which explains R Y ∼ U | T = 74% of the residual outcome variance. (b) We construct a simple non-null classifier from minimum L1 treatmenteffects with R = 1 and naive effects ( R = 0 ). The blue curve represents the ROC curves from the naiveestimates and the green, yellow and red curves represents the L1 minimizer of the treatment effect estimatesfor inferred confounder models with dimensions ˆ k ∈ { , , } . The area under the curve (AUC) for the naiveestimates is 0.54, whereas the AUC for the L1-minimized estimates are 0.61 ( ˆ k = 2 ), 0.73 ( ˆ k = 3 ) and 0.64( ˆ k = 4 ).fundamental uncertainty about the true treatment effects (See Appendix Figure 8). Finding Candidate Models with MCCs.
Investigators often have strong hypotheses about the ag-gregate properties of SNP treatment effects. For example, while some phenotypes can be predominantlyexplained by only a small number of SNPs, other phenotypes may be more plausibly described by the omni-genic hypothesis, which suggests that some observable effects must be explained by the sum of many smalleffects across many SNPs (Boyle et al., 2017). Here, we show that some of these aggregate hypotheses canbe formalized in terms of MCCs, and in these cases, the MCC minimization procedure from Section 6.3 canbe used to find useful candidate causal models that align with these hypothesis while being fully consistentwith the observed data.To motivate candidate model selection, we consider the use case of estimating effect sizes from a singlecoherent model, under the hypothesis that the median effect size is small. Specifically, we formalize thishypothesis by defining a MCC ω ( PATE T ( γ )) to be the L norm of the effects of each contrast T = ( t ji , t − ji ) : i ∈ (1 , ..., n ) } for all treatments j = 1 , · · · , k . We then select the model that minimizes thiscriterion by selecting γ subject to different allowed levels of confounding R Y ∼ U | T .In Figure 5a, we plot the the resulting coefficients estimates for three values of R : (naive effects), . and . Because the true effects are much smaller in magnitude than the naïve effects, the RMSE of theestimates decreases as we increase R Y ∼ U | T , although all effects are equally compatible with the observeddata. In this simulation, the L1 norm of naive estimates is approximately 2525 and the norm of the trueeffects is drastically smaller at approximately 75.Models selected using this MCC minimization procedure are also useful for the coarser goal of separatingsmall and large effects. From the naive regression, the coefficients are overdispersed to the true causal effectsand the true small coefficients are practically indistinguishable from true large coefficients. Meanwhile,models chosen with the MCC minimization procedure provide more useful signal. To formalize this, weconsider a classifier that separates large and small effects using the magnitude of the inferred coefficientsas the classification score. In Figure 5 we plot the receiver operating characteristic (ROC) curves for theclassifiers based on the naive estimates as well as the overall L minimizer of the treatment effects ( R = 1 ,i.e. no limit on the value R Y ∼ U | T ).Importantly, the difference in conditional confounder means, µ u | t ji − µ u | t − ji , varies between non-null andnull contrasts. This leads to a larger reduction in the relative magnitude of the null effects for models chosenthrough MCC minimization, accentuating the differences between large and small treatment effects (SeeAppendix Figure 9). For models selected by MCC minimization, the area under the ROC curve (AUC)increases from 0.54 (almost no ability to distinguish small and large treatments) to 0.72 ( ˆ k = 3 , red curve).The selected model achieves nearly 25% true positive rate without accruing any false positives. Naturally,the classifier performance is the best when we fit a latent variable model with the correct number of latentfactors, although the classifier based on latent variable models of dimensions ˆ k = 2 and ˆ k = 4 still outperformclassification from naive effects. In the Discussion, we note how this approach relates to, and complementsrecent identification results for a similar setting in Miao et al. (2020). In this section, we compare our approach to other recent analyses of the TMDB 5000 Movie Dataset (Kaggle,2017) which was analyzed extensively by Wang and Blei (2018) and Grimmer et al. (2020). The dataset28onsists of 5000 movies and their corresponding revenue, budget, genre and the identities of the lead castmembers. Following Wang and Blei, we focus on estimating the causal effect of an actor’s presence onthe movie’s log revenue. We let Y denote the log revenue and T i = ( T i , · · · , T ik ) encode the movie cast,where the binary random variable T ij ∈ { , } indicates whether actor j appeared in the movie i and T i ∈ T = { T , ..., T n } . We also let T j denote the set of all movies T i for which T ij = 1 . We define theestimand of interest, η j , as the the total log revenue contributed by actor j : η j := (cid:88) t i ∈T j PATE t i ,t - ji (49)where t - ji corresponds to the observed treatment vector for movie i excluding actor j . This estimand is anon-parametric generalization of the regression coefficient τ j , which was targeted in the analysis in Wang andBlei (2018). Specifically, under the assumption that log-revenue is linear in the cast indicators, η j reducesto n j τ j , the effect of actor j scaled by the number of movies they appeared in, where τ j are the regressioncoefficients for actor j . Our estimand is well-defined without this linearity assumption.We regress the log revenue on cast indicators to estimate actor effects, τ naive j , under an assumption of nounobserved confounding. In order to demonstrate the applicability of our sensitivity analysis, we explicitlyinduce unobserved confounding by excluding observed confounders. We validate our analysis, by comparingcalibrated effect estimates when the confounders are excluded to estimates when the confounder is included.Most importantly we exclude the movie’s budget which we estimate to be the largest known source ofconfounding (computed using Equation 38, see Appendix Figure 10a) .For simplicity, we model the observed outcome distribution with a linear regression, although other moreflexible outcome models (e.g. BART ) can also be used. As in the previous Section, we use a VAE to infer aGaussian conditional confounder distribution, f ( u | t ) ∼ N ( µ u | t , Σ u | t ) (See Appendix, Section B.2). Results.
Since our focus is on confounding not estimation, in order to limit the influence of estimationuncertainty we subset the data to the k = 327 actors who participated in at least twenty movies. Thisreduces the total number of movies to 2439. We fit the VAE to the treatments and use cross-validationto identify the appropriate latent confounder dimension, which we inferred to be ˆ m = 20 (See AppendixFigure 10b). We then plot the worst-case ignorance region for the causal effect on log revenue as a functionof R Y ∼ U | T for the 46 actors with significant regression coefficients in the naive regression (Figure 6, top). For illustrative purposes, we can assume that the budget is pre-treatment, meaning that the budget is decided prior toselecting the cast, which may be a dubious assumption in actuality. ,with the majority of the other actors well below . For reference, the log budget, which was explicitlyexcluded from our causal analysis, explains about 30% of the variance in log revenue (Appendix Figure 10a).In other words, none of the causal effects are robust at a level which matches the variance explained by themost important excluded confounder. worst−case ignorance regions −1000100200 h j R Y~~U|T2 : L2 minimization of effects −2002040 M a r k R u ff a l o S u s an S a r andon E li a s K o t ea sJ e r e m y P i v en T i m B l a k e N e l s on V i ggo M o r t en s en K a t e B o s w o r t h D enn i s H oppe r A nd y S e r k i s R o s e B y r ne T o mm y Lee J one sJ ud y G r ee r J a m i e F o xx T i m o t h y S pa ll K e v i n H a r t Z oe S a l dana R ee s e W i t he r s poon I an M c K e ll en J i m C a rr e y K a t h y B a t e s O c t a v i a S pen c e r C a r l a G ug i no A nge li na J o li e C hann i ng T a t u mM i c hae l C a i ne H ugo W ea v i ng R ob i n W illi a m sJ ohn T r a v o l t a E dd i e M u r ph y D en z e l W a s h i ng t onL i a m N ee s on A da m S and l e r Leona r do D i C ap r i o J ud i D en c h B r ad P i tt F r an k W e l k e r J ohnn y D epp A r no l d S c h w a r z enegge r H a rr i s on F o r d T o m H an ks B r u c e W illi s W ill S m i t h M o r gan F r ee m an S t an Lee T o m C r u i s e J ohn R a t z enbe r ge r Actor j h j R Y~U|T2 : Figure 6: Estimated total log revenue contributed by a given actor. Top: worst-case ignorance region foreach actor on a case by case basis. The blue points correspond to R Y ∼ U | T = 0 , i.e. the naive estimates.Robustness values can be found in Appendix Table 1. Bottom: Treatment effects for candidate modelschosen with the L2 minimzing multiple contrast criterion (MCC). The color correspond to R , the limit onthe fraction of outcome variance explained by confounding.The worst case ignorance regions depicted in top panel of Figure 6 correspond to a different choice of γ for each actor. We can also explore the robustness of causal effects under a single model by applyingan appropriate MCC. Specifically, we search for a “worst case” candidate model by finding the sensitivity30ector, γ ∗ , that implies the smallest L2 norm of the regression coefficients, τ . In this conservative model, theminimum L2 norm of the treatment coefficients is 4.4, down from 7.6 for the naive coefficients. In addition, 40out of 46 actors have coefficients that are smaller in magnitude than the magnitudes of the naive coefficients(Figure 6, bottom). For this candidate model, it turns out that γ (cid:48)∗ E [ U | T = t ] is significantly correlated, albeitweakly, with budget (Spearman’s rank correlation = 0.2, p-value < e - ). Thus, the conservative modelcorrectly attributes part of the outcome variation induced by the known excluded confounder to unobservedconfounding. In this paper, we introduced a framework for sensitivity analysis with multiple treatments which providesfurther context to the growing literature on the challenges of inference in this setting. Unlike previous work,we emphasize the importance of carefully defined estimands and show that bounds on the magnitude ofconfounding bias depend on the particular estimands of interest. Our work also provides a practical solutionto characterizing and calibrating the robustness of causal effects across multiple treatments in the presenceof unobserved confounding. Code to replicate all analyses is available (Zheng, 2021b) and an R packageimplementing our methodology is also available and in active development (Zheng, 2021a).There are several interesting generalizations of our proposed approach, many of which center on assump-tions about the copula. For example, in many contexts we may be interested in accounting for potentialtreatment-confounder interactions, in which case the copula, c ( F Y | t ( y ) , F U | t ( u ) | T = t ) , will vary with t . Likewise, as noted, our approach can be applied with non-Gaussian copulas and and alternative latentvariable models (e.g latent class models), but calibration and model-specification remains a challenge.Generalizations based on joint inference of the treatment and outcome models should also be explored.In practice, causal effect estimates may be overdispersed about the true effects due to a combination ofsampling variance and unobserved confounding. Joint inference might be especially useful for accounting forboth estimation uncertainty and uncertainty due to unobserved confounding. This is particularly importantfor the multiple contrast criteria which, as described, does not incorporate parameter uncertainty into theobjective function. A simple solution in a Bayesian analysis would be to characterize posterior uncertaintyby applying the criteria to each MCMC sample of the naive causal effect estimates. A more thoroughexploration of the interplay between shrinkage estimation (in the classical sense) and MCC shrinkage to31djust for confounding bias under the “small effects” hypothesis would be interesting to explore in thiscontext.Finally, we note that there are many promising directions for incorporating additional constraints into thecalibration criteria. One direction of particular interest is to follow the line of research on negative controlexposures (NCE) (Shi et al., 2020). A set of NCEs would be subset of causes that were known a priori tohave zero (or bounded) causal effect on the outcome, for example genes in a GWAS which were known tobe causally unrelated to a particular phenotype. Incorporating these assumptions would further constrainthe ignorance regions for particular causal contrasts of interest. These strategies are closely related to themultivariate calibration procedure that we propose. For example, Miao et al. (2020) describe a procedure foridentifying the treatment effects when over half of the treatments are assumed to have no causal effect on theoutcome. In Equation 42, this would be analogous to the case in which m is the L norm, the number of non-zero effects. It is also worth further exploring the relationship between inference with multiple treatments andinference with multiple outcomes. Additional structure in the correlation between outcomes could furtherconstrain the causal ignorance regions, under the right set of assumptions. As with NCE’s, negative controloutcomes (NCOs) could be applied to calibrate the sensitivity analysis. We leave this to future work. References
Bica, I., A. M. Alaa, C. Lambert, and M. van der Schaar (2020). From real-world patient data to indi-vidualized treatment effects using machine learning: Current and future methods to address underlyingchallenges.
Clinical Pharmacology & Therapeutics .Boyle, E. A., Y. I. Li, and J. K. Pritchard (2017). An expanded view of complex traits: from polygenic toomnigenic.
Cell 169 (7), 1177–1186.Cinelli, C., J. Ferwerda, and C. Hazlett (2020). sensemakr: Sensitivity analysis tools for ols in r and stata.
Submitted to the Journal of Statistical Software .Cinelli, C. and C. Hazlett (2019, 12). Making sense of sensitivity: extending omitted variable bias.
Journalof the Royal Statistical Society Series B (Statistical Methodology) .Cinelli, C., D. Kumor, B. Chen, J. Pearl, and E. Bareinboim (2019). Sensitivity analysis of linear structuralcausal models. In
ICML .Cornfield, J., W. Haenszel, E. C. Hammond, A. M. Lilienfeld, M. B. Shimkin, and E. L. Wynder (1959).Smoking and lung cancer: recent evidence and a discussion of some questions.
J. Nat. Cancer Inst 22 ,173–203.D’Amour, A. (2019a). Comment: Reflections on the deconfounder.
Journal of the American StatisticalAssociation 114 (528), 1597–1601.D’Amour, A. (2019b). On multi-cause approaches to causal inference with unobserved counfounding: Twocautionary failure cases and a promising alternative. In
The 22nd International Conference on ArtificialIntelligence and Statistics , pp. 3478–3486. 32aniels, M. J. and J. W. Hogan (2008).
Missing data in longitudinal studies: Strategies for Bayesian modelingand sensitivity analysis . CRC Press.Dorie, V., M. Harada, N. B. Carnegie, and J. Hill (2016). A flexible, interpretable framework for assessingsensitivity to unmeasured confounding.
Statistics in Medicine 35 (20), 3453–3470.Everett, B. (2013).
An introduction to latent variable models . Springer Science & Business Media.Franks, A., A. D’Amour, and A. Feller (2019). Flexible sensitivity analysis for observational studies withoutobservable implications.
Journal of the American Statistical Association (just-accepted), 1–38.Gastwirth, J. L., A. M. Krieger, and P. R. Rosenbaum (1998). Dual and simultaneous sensitivity analysisfor matched pairs.
Biometrika 85 (4), 907–920.Gavish, M. and D. L. Donoho (2014). The optimal hard threshold for singular values is / √ . InformationTheory, IEEE Transactions on 60 (8), 5040–5053.Ghosh, P., M. S. M. Sajjadi, A. Vergari, M. Black, and B. Scholkopf (2020). From variational to deterministicautoencoders. In
International Conference on Learning Representations .Gopalan, P., J. M. Hofman, and D. M. Blei (2013). Scalable recommendation with poisson factorization. arXiv preprint arXiv:1311.1704 .Greenland, S. (1996). Basic methods for sensitivity analysis of biases.
International journal of epidemiol-ogy 25 (6), 1107–1116.Grimmer, J., D. Knox, and B. M. Stewart (2020). Naive regression requires weaker assumptions than factormodels to adjust for multiple cause confounding. arXiv preprint arXiv:2007.12702 .Gustafson, P., L. C. McCandless, et al. (2018). When is a sensitivity parameter exactly that?
StatisticalScience 33 (1), 86–95.Hao, W., M. Song, and J. D. Storey (2015). Probabilistic models of genetic variation in structured populationsapplied to global human studies.
Bioinformatics 32 (5), 713–721.Horn, R. (1985).
Matrix analysis . Cambridge Cambridgeshire New York: Cambridge University Press.Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation.
American EconomicReview 93 (2), 126–132.Kaggle (2017, Sep). Tmdb 5000 movie dataset. data retrieved from Kaggle, .Kong, D., S. Yang, and L. Wang (2019). Multi-cause causal inference with unmeasured confounding andbinary outcome. arXiv preprint arXiv:1907.13323 .Lechner, M. (1999, December). Identification and Estimation of Causal Effects of Multiple Treatments Underthe Conditional Independence Assumption. IZA Discussion Papers 91, Institute of Labor Economics (IZA).Lopez, M. J., R. Gutman, et al. (2017). Estimation of causal effects with multiple treatments: a review andnew ideas.
Statistical Science 32 (3), 432–454.Lopez, R., P. Boyeau, N. Yosef, M. I. Jordan, and J. Regier (2020). Decision-making with auto-encodingvariational bayes. arXiv preprint arXiv:2002.07217 .Louizos, C., U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling (2017). Causal effect inferencewith deep latent-variable models. In
Advances in Neural Information Processing Systems , pp. 6446–6456.Mardia, K. V., J. T. Kent, and J. M. Bibby (1980).
Multivariate analysis . Academic press.33cCulloch, R., R. Sparapani, R. Gramacy, C. Spanbauer, and M. Pratola (2018).
BART: Bayesian AdditiveRegression Trees . R package version 1.9.Miao, W., Z. Geng, and E. J. Tchetgen Tchetgen (2018). Identifying causal effects with proxy variables ofan unmeasured confounder.
Biometrika 105 (4), 987–993.Miao, W., W. Hu, E. L. Ogburn, and X. Zhou (2020). Identifying effects of multiple treatments in thepresence of unmeasured confounding.Minka, T. P. (2001). Automatic choice of dimensionality for pca. In
Advances in neural information processingsystems , pp. 598–604.Nelsen, R. B. (2007).
An introduction to copulas . Springer Science & Business Media.Ogburn, E. L., I. Shpitser, and E. J. T. Tchetgen (2019). Comment on “blessings of multiple causes”.
Journalof the American Statistical Association 114 (528), 1611–1615.Ogburn, E. L., I. Shpitser, and E. J. T. Tchetgen (2020). Counterexamples to" the blessings of multiplecauses" by wang and blei. arXiv preprint arXiv:2001.06555 .Pearl, J. (2009).
Causality . Cambridge university press.Price, A. L., N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich (2006). Principalcomponents analysis corrects for stratification in genome-wide association studies.
Nature genetics 38 (8),904–909.Pu, Y., Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin (2016). Variational autoencoder fordeep learning of images, labels and captions. In
Advances in neural information processing systems , pp.2352–2360.Robins, J. M. (1997). Non-response models for the analysis of non-monotone non-ignorable missing data.
Statistics in Medicine 16 (1), 21–37.Robins, J. M., A. Rotnitzky, and D. O. Scharfstein (2000). Sensitivity analysis for selection bias andunmeasured confounding in missing data and causal inference models. In
Statistical models in epidemiology,the environment, and clinical trials , pp. 1–94. Springer.Rosenbaum, P. R. and D. B. Rubin (1983). Assessing sensitivity to an unobserved binary covariate in an ob-servational study with binary outcome.
Journal of the Royal Statistical Society. Series B (Methodological) ,212–218.Rubin, D. B. (1980). Comment.
Journal of the American Statistical Association 75 (371), 591–593.Shi, X., W. Miao, and E. T. Tchetgen (2020). A selective review of negative control methods in epidemiology. arXiv preprint arXiv:2009.05641 .Song, M., W. Hao, and J. D. Storey (2015). Testing for genetic associations in arbitrarily structuredpopulations.
Nature genetics 47 (5), 550–554.Tipping, M. E. and C. M. Bishop (1999). Probabilistic principal component analysis.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) 61 (3), 611–622.VanderWeele, T. J. and O. A. Arah (2011). Bias formulas for sensitivity analysis of unmeasured confoundingfor general outcomes, treatments, and confounders.
Epidemiology , 42–52.VanderWeele, T. J. and P. Ding (2017, July). Sensitivity analysis in observational research: Introducing thee-value.
Annals of Internal Medicine 167 (4), 268. 34anderWeele, T. J., B. Mukherjee, and J. Chen (2012). Sensitivity analysis for interactions under unmeasuredconfounding.
Statistics in medicine 31 (22), 2552–2564.VanderWeele, T. J. and I. Shpitser (2013). On the definition of a confounder.
Annals of statistics 41 (1),196.Vansteelandt, S., E. Goetghebeur, M. G. Kenward, and G. Molenberghs (2006). Ignorance and uncertaintyregions as inferential tools in a sensitivity analysis.
Statistica Sinica 16 (3), 953–979.Veitch, V. and A. Zaveri (2020). Sense and sensitivity analysis: Simple post-hoc analysis of bias due tounobserved confounding. arXiv preprint arXiv:2003.01747 .Wang, Y. and D. M. Blei (2018, May). The Blessings of Multiple Causes. arXiv e-prints , arXiv:1805.06826.Zhang, B. and E. J. T. Tchetgen (2019). A semiparametric approach to model-based sensitivity analysis inobservational studies. arXiv preprint arXiv:1910.14130 .Zhang, L., Y. Wang, A. Ostropolets, J. J. Mulgrave, D. M. Blei, and G. Hripcsak (2019). The medicaldeconfounder: Assessing treatment effects with electronic health records. arXiv preprint arXiv:1904.02098 .Zheng, J. (2021a). Copsens: Copula-based sensitivity analysis method for unobserved confounding in multi-treatment inference. https://github.com/JiajingZ/CopSens .Zheng, J. (2021b). Replication code for “copula-based sensitivity analysis for observational multi-treatmentcausal inference”. https://github.com/JiajingZ/CopulaSensitivity .35
Theory
A.1 General Contrast Estimation Algorithm
Algorithm 2
Marginal Contrast Estimation for Arbitrary Copulas function ComputeMean ( t , ψ ) for k = 1 , , . . . , M do Sample y k from f ( y | t ) for i = 1 , , . . . , n do for j = 1 , , . . . , N do Sample u ij from f ( u | t i ) Compute c ij ← c ψ ( y k , u ij | t ) end for end for Compute w k ← nN (cid:80) ij c ij end for return M (cid:80) k ν ( y k ) w k end function Return τ ( ComputeMean ( t , ψ ) , ComputeMean ( t , ψ ) A.2 Derivation of Algorithm 1
Since we have Equation 10 and 11, furthermore, we can write E [ v ( Y ) | do ( t )] = (cid:90) (cid:90) v ( y ) f ( y | ˜ y ) w ψ (˜ y, t ) f (˜ y | t ) d ˜ ydy, (50)where w ψ (˜ y, t ) ≈ |T | (cid:80) t i ∈T (cid:104)(cid:82) c ψ ( F ˜ Y | t (˜ y ) , F U | t ( u ) | t ) f ( u | t i ) du (cid:105) . To verify Algorithm 1, we only need toshow that (cid:90) f (˜ y | t, u ) f ( u | t i ) du ∼ N ( γ T ( µ u | t i − µ u | t ) , , (51)where f (˜ y | t, u ) = f (˜ y | t ) c ψ ( F ˜ Y | t (˜ y ) , F U | t ( u ) | t ) . According to Equation 13 and 14, we know that f ( u | t i ) ∼ N ( µ u | t i , Σ u | t ) , (52) f (˜ y | t, u ) ∼ N ( γ T ( u − µ u | t ) , σ y | t,u ) . (53)By integrating out the U , we have (cid:90) f (˜ y | t, u ) f ( u | t i ) du = 1 (cid:113) π ( σ y | t,u + γ T Σ u | t γ ) exp (cid:40) − ( y − γ (cid:48) ( µ u | t i − µ u | t )) σ y | t,u + γ T Σ u | t γ ) (cid:41) , (54)where σ y | t,u + γ (cid:48) Σ u | t γ = 1 . A.3 Proof of Theorem 1
Theorem 1.
Assume model 13-15. Let [ ψ T ] = { ˜ ψ T = { Aµ u | t , A Σ u | t A } : A ∈ S + } where S + is the spaceof symmetric positive definite matrices. Then [ ψ T ] is a causal equivalence class. roof. The intervention distribution for ˜ y is defined as f ψ (˜ y | do ( t )) = (cid:90) (cid:20)(cid:90) f ψ Y (˜ y | t, u ) f ψ T ( u | ˜ t ) du (cid:21) f (˜ t ) d ˜ t (55)where ψ Y = γ and ψ T = { µ u | t , Σ u | t } . Then, (cid:82) f γ (˜ y | t, u ) f ψ T ( u | ˜ t ) du ∼ N ( γ T ( µ u | ˜ t − µ u | t ) , for any γ suchthat γ (cid:48) Σ u | t γ ≤ (see Equation 54). Let ˜ ψ T = { Aµ u | t , A Σ u | t A } ∈ [ ψ T ] where A ∈ S + is a positive definitematrix and assume ˜ ψ Y = ˜ γ . Then, (cid:90) f ˜ γ (˜ y | t, u ) f ˜ ψ T ( u | ˜ t ) du ∼ N (˜ γ (cid:48) A ( µ u | ˜ t − µ u | t ) , . (56)Let ˜ γ = A − γ be a bijective mapping from γ to ˜ γ . For any γ and positive definite A , we have ˜ γ (cid:48) A Σ u | t A ˜ γ = γ (cid:48) Σ u | t γ ≤ so that ˜ γ is a valid copula parameter. In addition, (cid:82) f γ (˜ y | t, u ) f ψ T ( u | ˜ t ) du = (cid:82) f ˜ γ (˜ y | t, u ) f ˜ ψ T ( u | ˜ t ) du , which implies f γ,ψ T (˜ y | do ( t )) = f ˜ γ, ˜ ψ T (˜ y | do ( t )) . Since Y is a deterministic function of ˜ Y ,this implies f γ,ψ T ( y | do ( t )) = f ˜ γ, ˜ ψ T ( y | do ( t )) . Therefore, [ ψ T ] is a causal equivalence class. A.4 Proof of Theorem 2
Theorem 2.
Suppose that the observed data is generated by model 17-19. When there k treatments with < m < k , then ψ T is identified up to the causal equivalence class [ ψ T ] = { ˜ ψ T = { Aµ u | t , A Σ u | t A } : A ∈ S + } .When there is a single treatment ( k = 1 ) or m = k confounders, then ψ T is not identifiable up to causalequivalence class.Proof. The sample covariance matrix of the treatments is a consistent estimator for B Σ u B (cid:48) + σ t | u I k , thecovariance matrix T . As long as < m < k , then the k th eigenvalue is a consistent estimator for σ t | u . B Σ u B (cid:48) is identified by the m eigenvectors and eigenvalues of the sample covariance matrix. The span of thefirst m eigenvectors of the sample covariance matrix is a consistent estimator for the span of B . With model17-19, the conditional distribution of confounder Uf ( u | t ) ∼ N ( µ u | t , Σ u | t ) , (57)where µ u | t := Σ u B (cid:48) ( B Σ u B (cid:48) + σ t | u I k ) − t , Σ u | t := Σ u − Σ u B (cid:48) ( B Σ u B (cid:48) + σ t | u I k ) − B Σ u , and the interventiondistribution f γ,B, Σ u ( y | do ( T = t )) ∼ N (( τ naive − ( B Σ u B (cid:48) + σ t | u I k ) − B Σ u γ ) (cid:48) t, σ y | t,u + γ (cid:48) Σ u γ ) , (58)where all m-vectors γ which satisfy γ T Σ u | t γ ≤ σ y | t are valid sensitivity parameters.Let ˜ B = BA and ˜Σ u = A − Σ u A − T for an arbitrary positive definite matrix A , so that ˜ B ˜Σ u ˜ B (cid:48) + σ t | u I k = B Σ u B (cid:48) + σ t | u I k . Then, the observed treatments are consistent with T = ˜ B ˜ U + (cid:15) t | u , where ˜ U = A − U ∼ N m (0 , ˜Σ u ) . Hence, the conditional confounder distribution f (˜ u | t ) ∼ N m (˜ µ u | t , ˜Σ u | t ) , (59)where ˜ µ u | t = A − µ u | t and ˜Σ u | t = A − Σ u | t A − T . With ˜ B and ˜Σ u , the intervention distribution can bealternatively expressed as f ˜ γ, ˜ B, ˜Σ u ( y | do ( T = t )) ∼ N (( τ naive − ( ˜ B ˜Σ u ˜ B (cid:48) + σ t | u I k ) − ˜ B ˜Σ u ˜ γ ) (cid:48) t, σ y | t,u + ˜ γ (cid:48) ˜Σ u ˜ γ ) (60)37ith ˜ γ satisfying ˜ γ T ˜Σ u | t ˜ γ ≤ σ y | t to be valid sensitivity parameter.Let ˜ γ = A T γ . If γ T Σ u | t γ ≤ σ y | t , then we have ˜ γ T ˜Σ u | t ˜ γ = γ T Σ u | t γ ≤ σ y | t and f ˜ γ, ˜ B, ˜Σ u ( y | do ( T = t )) = f γ,B, Σ u ( y | do ( T = t )) . Therefore, the causal equivalence class characterized by ψ T = { µ u | t , Σ u | t } isidentifiable when < m < k . A.5 Proof of Theorem 3 and 4
Proof of Theorem 3Theorem 3.
Suppose that the observed data is generated by model 17-19. Then, ∀ γ satisfying Assumptions1 and 2, γ T Σ u | t γ ≤ σ y | t R Y ∼ U | T , (61) For any given ∆ t , we have Bias t ≤ σ y | t R Y ∼ U | T (cid:107) Σ − / u | t µ u | ∆ t (cid:107) , (62) The bound is achieved when γ is colinear with Σ − u | t µ u | ∆ t .Proof. Under model 17-19, the variance of the observed outcome equals σ y | t := V ar ( Y | T )= σ y | t,u + γ T ( I m − B T ( BB T + σ t | u I k ) − B ) γ = σ y | t,u + γ T Σ u | t γ, (63)where γ (cid:48) Σ u | t γ corresponds to the confounding variation, and σ y | t,u stands for the non-confounding variationin the residual of observed outcome. Hence, the fraction of confounding variation in the residual of Y , R Y ∼ U | T , can be expressed in terms of equation 31, which produces a constrain for γ (equation 32) that theconfounding variation in the residual of Y , γ T Σ u | t γ , should not be larger than σ y | t R Y ∼ U | T for a given levelof R Y ∼ U | T .Let Z := Σ / u | t γ, (64)then the omitted variable bias in equation 26 can be written as Bias ∆ t = Z T Σ − / u | t µ u | ∆ t , where Z T Z ≤ σ y | t R Y ∼ U | T , implied by inequality 32.Therefore, Bias t = Z T Σ − / u | t µ u | ∆ t µ Tu | ∆ t Σ − / u | t Z (65) ≤ σ y | t R Y ∼ U | T (cid:107) Σ − / u | t µ u | ∆ t (cid:107) , (66)where the bounds are reached when Z is colinear with Σ − / u | t µ u | ∆ t , i.e., γ is colinear with the Σ − u | t µ u | ∆ t inferred by the relationship defined in equation 64. Corollary 3.1.
Let d be the largest singular value of B . For all ∆ t with (cid:107) ∆ t (cid:107) = 1 , the squared bias is ounded by Bias t ≤ d ( d + σ t | u ) σ y | t σ t | u R Y ∼ U | T , (67) with equality when ∆ t = u B , the first left singular vector of B . When ∆ t ∈ N ull ( B (cid:48) ) , the naive estimate isunbiased, that is, P AT E ∆ t = τ (cid:48) naive ∆ t .Proof. Suppose that the matrix B has the singular value decomposition, B = U DV T , where the diagonal entries of D are the singular values of B in descending order. Then, we can write µ u | ∆ t = V D ( D + σ t | u I s ) − U T ∆ t, (68)and Σ − u | t = V [ I s + 1 σ t | u D ] V T . (69)By plugging Equation 68 and 69 into the result of theorem 3, we haveBias t ≤ σ y | t σ t | u R Y ∼ U | T (cid:107) V D ( σ t | u I s + D ) − / U T ∆ t (cid:107) , (70)where, according to Rayleigh quotient (Horn, 1985), the squared L2 norm reaches its maximum, d ( d + σ t | u ) ,when ∆ t equals the first column of U , i.e., the first left singular vector of B .Therefore, we have Bias t ≤ d ( d + σ t | u ) σ y | t σ t | u R Y ∼ U | T . (71) Proof of Theorem 4Theorem 4.
Assume the model 13-15 with Gaussian outcomes. If Σ u | t is non-invertible, then Bias t ,t isbounded if and only if µ u | t − µ u | t is in the row space of Σ u | t . When bounded,Bias t ,t ≤ σ y | t R Y ∼ U | T (cid:107) (Σ † u | t ) / ( µ u | t − µ u | t ) (cid:107) , where Σ † u | t is the pseudo-inverse of Σ u | t .Proof. Under model 13-15, we have γ T Σ u | t γ + σ y | t,u = 1 , where γ T Σ u | t γ corresponds to the confoundingvariation and σ y | t,u corresponds to the non-confounding variation in the Gaussianized Y . Therefore, theconfounding variation, γ T Σ u | t γ should not be larger than a given level of R Y ∼ U | T , γ T Σ u | t γ ≤ R Y ∼ U | T (72)where R Y ∼ U | T denotes the fraction of confounding variation in residual variance of ˜ Y conditional on T .Let Z := Σ / u | t γ, (73) R Y ∼ U | T coincides with R Y ∼ U here, but we use notation R Y ∼ U | T for consistency. t ,t = σ y | t Z T (Σ † u | t ) / ( µ u | t − µ u | t ) , (74)where Z T Z ≤ R Y ∼ U | T , implied by inequality 72.Therefore, Bias t ,t = σ y | t Z T (Σ † u | t ) / ( µ u | t − µ u | t )( µ u | t − µ u | t ) T (Σ † u | t ) / Z (75) ≤ σ y | t R Y ∼ U | T (cid:107) (Σ † u | t ) / ( µ u | t − µ u | t ) (cid:107) , (76)where the bounds are reached when Z is colinear with (Σ † u | t ) / ( µ u | t − µ u | t ) , i.e., γ is colinear with the Σ † u | t ( µ u | t − µ u | t ) .Suppose that Σ u | t has the eigendecomposition, Σ u | t = Q Λ Q T , (77)where Q is the square s × s matrix whose j th column is the eigenvector q j of Σ u | t , and Λ is the diagonalmatrix whose diagonal elements are the corresponding eigenvalues, Λ jj = λ j , in descending order. If Σ u | t isnon-invertible and has rank p ( p ≤ s ), then we have λ j = 0 for j = p + 1 , · · · , s .On the one hand, when µ u | t − µ u | t is in the row space of Σ u | t , it can be expressed as a linear combinationof q j , (cid:80) pj =1 a j q j , a j ∈ R . Then, we have the squared omitted variable biasBias t ,t ≤ σ y | t R Y ∼ U | T (cid:107) (Σ † u | t ) / ( µ u | t − µ u | t ) (cid:107) , (78) = σ y | t R Y ∼ U | T (cid:107) Q (Λ † ) / Q T p (cid:88) j =1 a j q j (cid:107) , (79) = σ y | t R Y ∼ U | T s (cid:88) i =1 ( p (cid:88) j =1 a j λ − j Q ij ) , (80)where Q ij denotes the element at the i th row and j th column of matrix Q , and Λ † is the pseudo-inverse of Λ by taking the reciprocal of each its non-zero element on the diagonal, leaving the zeros in place.On the other hand, when Bias t ,t is bounded, let’s assume that µ u | t − µ u | t is not in the row spaceof Σ u | t , say µ u | t − µ u | t = q s . Since λ s = 0 , λ − / s = ∞ so as the bound of Bias t ,t equal to ∞ , whichcontradicts the condition that Bias t ,t is bounded.Therefore, Bias t ,t is bounded if and only if µ u | t − µ u | t is in the row space of Σ u | t . B Modeling Choice Details
B.1 Identification and Inference in the Factor Model
Here, we briefly elaborate on identifiability of the probabilistic principal components model, which is aprerequisite for our multi-cause sensitivity analysis. Identifiability under various factor model assumptionsis well studied and has a long history in the literature (Mardia et al., 1980; Everett, 2013). In the specificprobabilistic principal components model 18, Tipping and Bishop (1999) provide a maximum likelihoodsolution for inferring the latent confounder parameters conditional on m . Many procedures are available forselecting the appropriate value of m , using for example Bayesian model selection techniques (Minka, 2001)40r large p , small n asymptotics Gavish and Donoho (2014).The change of variables described at the end of the subsection 5.1 further elucidates important situationsin which we cannot bound the omitted variable bias due to non-identifiability of the factor model. Again, wefocus on the rotated treatments ˜ T ∼ N (0 , ∆ + σ t | u I k ) and highlight two simple situations in which we cannotbound the the causal effects. First, when B is rank k , i.e. there exist m = k independent confounders, ∆ hasno non-zero entries on the diagonal and thus we cannot identify either ∆ nor Cov ( (cid:15) t | u ) = σ t | u I k , only theirsum. Second, if Cov ( (cid:15) t | u ) is an unknown arbitrary diagonal matrix (as opposed to a matrix proportionalto the identity), then Cov ( (cid:15) t | u ) is not distinguishable from ∆ . In both of these cases, the worst-case bias isunbounded since the non-confounding variation of the treatment assignment, Cov ( (cid:15) t | u ) , can be arbitrarilysmall. In such settings, we can still apply approaches used in single cause sensitivity analysis, by specifyingboth Ψ Y and Ψ T ; when the factor model is not identifiable, Ψ T must be chosen as a true parameter, e.g. bybounding the fraction of treatment variation due to confounding, R T ∼ U . B.2 Confounder Inference with Variational Autoencoders
Probabilistic Principal Component Analysis should only be used when the treatments are approximatelyGaussian treatments. For binary and other general treatment distributions, more sophisticated probabilisticlatent variables models are required. Examples of such latent variable models include models for countdata like the logistic factor analysis (Hao et al., 2015) and Poisson factor analysis methods (Gopalan et al.,2013). Unfortunately, these models imply posteriors which are non-Gaussian and heteroskedastic, violatingAssumptions 5 and 4.As such, for general treatment distributions, our approach is to infer a conditional Gaussian latent variablemodel using a variational autoencoder (VAE). VAEs have been extremely popular in machine learning, inparticular for generating low dimensional representations of complex inputs like images (Pu et al., 2016)but more recently have been used in scientific and decision-making applications (Lopez et al., 2020) and inapplications to causal inference (Louizos et al., 2017). A VAE consists of a prior distribution, f ( u ) , typicallyfor the low-dimensional latent variables, a stochastic encoder, and a stochastic decoder. In our application,the inferred stochastic decoder, ˆ f θ ( t | u ) , is a non-linear map from latent confounders to a distribution overcauses. Together, the prior distribution for u and the decoder imply a posterior confounder distribution, ˆ f ( u | t ) .In practice, inference for the true posterior is intractable and so a variational approximation, called theencoder, q φ ( u | t ) , is used in place of the true posterior. Typically the encoder is chosen to be a normaldistribution with mean and variance which are non-linear functions of the input, q φ = N ( µ φ ( t ) , σ φ ( t )) .A crucial question is that how well the Gaussian encoder approximates the true posterior; improving thevariational approximation to the true latent variable posterior is an area of active research. In this work,we follow a common strategy of using the encoder learned by the VAE as the proposal distribution in animportance sampler (Lopez et al., 2020).Specifically, we apply a variant of the Constant-Variance Variational Autoencoder (CV-VAE) (Ghoshet al., 2020) to infer the conditional confounder distribution, f ( u | t ) ∼ N ( µ u | t , Σ u | t ) , in which Σ u | t doesnot depend on the level of t . We use the importance sampling to improve estimates of the conditional mean µ u | t , and posterior variance, Σ u | t . While this appraoch only yields an approximation to the true posterior,we demonstrate the practical effectiveness of this approach in Sections 7 and 8.41 .3 Binary Outcomes For binary outcomes with the risk ratio estimand: RR t, • = (cid:88) t i ∈T Φ (cid:0) Φ − ( µ y | t ) + γ T ( µ u | t i − µ u | t ) (cid:1) (cid:30) P r ( Y = 1) , (81)which implies that RR t ,t = (cid:88) t i ∈T Φ (cid:0) Φ − ( µ y | t ) + γ T ( µ u | t i − µ u | t ) (cid:1) (cid:30) (cid:88) t i ∈T Φ (cid:0) Φ − ( µ y | t ) + γ T ( µ u | t i − µ u | t ) (cid:1) , (82)where γ T Σ u | t γ ≤ σ y | t R Y ∼ U | T . We can numerically explore values of RR t ,t within the valid domain of γ ,and calculate the corresponding implicit partial R-squared by R Y ∼ U | T = γ T Σ u | t γσ y | t . To calculate the robustnessvalue, we only need to find the value of R Y ∼ U | T for which the corresponding RR t ,t = 1 . Noticeably, RR t ,t is not monotone in R Y ∼ U | T , since the variance of intervention distribution also depends on γ . This is evidentin the simulation in Section 7 where we fit the observed outcome model by probit regression and the validrange for scalar γ is [ − σ u | t , σ u | t ] . We visualize the non-monotone relationship between RR t ,t and R Y ∼ U | T in Figure 7. R RR e , (a) RR e , vs R R RR e , (b) RR e , vs R R RR e , (c) RR e , vs R R RR e , (d) RR e , vs R Figure 7: RR t ,t is non-monotone in R Y ∼ U | T . Positive values of R indicates that U is positively correlatedwith ˜ Y , and negative values of R means that U is negatively correlated with ˜ Y .42 Additional Results
C.1 Additional Results from Simulation in Sparse Effects Setting null non−null −20020 0 25 50 75 100 case t True EffectR
Y~~U|T2 = 0%R
Y~~U|T2 = 50%R
Y~~U|T2 = 100%
Figure 8: Worst-case ignorance regions for 55 randomly chosen null effects (left) and all 45 non-null effects(right) ordered by the magnitude of true effects in each group. Two red arrows indicate non-null treatmentsfor which the worst-case ignorance region does not cover the true effect. This appears to be due to estimationerror in the outcome model, more so than with the VAE. R Y~U|T2 E ( | t | ) nullnonnull (a) R Y~U|T2 R a t i o o f E ( | t | ) f o r non − nu ll and nu ll (b) Figure 9: Change in E ( | τ | ) for the L1-minimized estimates as a function of R Y ∼ U | T , separated by nulland non-null effects. (a) The magnitude of effects decreases with R , with a larger relative decrease fornull contrasts. (b) The relative magnitude of non-null and null effects increases with R in general. Themagnitude of non-null effects can be as large as 1.9 times the null effects when R is large.43 .2 Additional Results from the Actor Case Study P a r t i a l R Partial R for Observed Factors (a) v a li da t i on l o ss Latent Dimension Selection (b)
Figure 10: (a) Estimated partial R for observed confounders using method described in section 6.1. Budgetis the most dominant variable, which can explain significantly higher variation in outcome Y . (b) Latentconfounder dimension selection, based on the reconstruction loss on the validation set.44able 1: Robustness Value for Significant ActorsEffect RV mean (%) RV limitlimit