[PDF] Copula-based Sensitivity Analysis for Multi-Treatment Causal Inference with Unobserved Confounding

Abstract

Recent work has focused on the potential and pitfalls of causal identification in observational studies with multiple simultaneous treatments. On the one hand, a latent variable model fit to the observed treatments can identify essential aspects of the distribution of unobserved confounders. On the other hand, it has been shown that even when the latent confounder distribution is known exactly, causal effects are still not point identifiable. Thus, the practical benefits of latent variable modeling in multi-treatment settings remain unclear. We clarify these issues with a sensitivity analysis method that can be used to characterize the range of causal effects that are compatible with the observed data. Our method is based on a copula factorization of the joint distribution of outcomes, treatments, and confounders, and can be layered on top of arbitrary observed data models. We propose a practical implementation of this approach making use of the Gaussian copula, and establish conditions under which causal effects can be bounded. We also describe approaches for reasoning about effects, including calibrating sensitivity parameters, quantifying robustness of effect estimates, and selecting models which are most consistent with prior hypotheses.

Full PDF

CCopula-based Sensitivity Analysis for Multi-Treatment CausalInference with Unobserved Confounding ∗Jiajing ZhengUCSB Alexander D’AmourGoogle Research Alexander FranksUCSBFebruary 19, 2021

Abstract

Recent work has focused on the potential and pitfalls of causal identiﬁcation in observational studieswith multiple simultaneous treatments. On the one hand, a latent variable model ﬁt to the observedtreatments can identify essential aspects of the distribution of unobserved confounders. On the otherhand, it has been shown that even when the latent confounder distribution is known exactly, causal eﬀectsare still not point identiﬁable. Thus, the practical beneﬁts of latent variable modeling in multi-treatmentsettings remain unclear. We clarify these issues with a sensitivity analysis method that can be usedto characterize the range of causal eﬀects that are compatible with the observed data. Our method isbased on a copula factorization of the joint distribution of outcomes, treatments, and confounders, andcan be layered on top of arbitrary observed data models. We propose a practical implementation of thisapproach making use of the Gaussian copula, and establish conditions under which causal eﬀects canbe bounded. We also describe approaches for reasoning about eﬀects, including calibrating sensitivityparameters, quantifying robustness of eﬀect estimates, and selecting models which are most consistentwith prior hypotheses.

Keywords : Observational studies; multiple treatments; sensitivity analysis; copulas, latent confounders;deconfounder ∗ Jiajing Zheng is a PhD candidate in the Department of Statistics and Applied Probability at the University of CaliorniaSanta Barbara ([email protected]). Alexander D’Amour is a Research Scientist at Google Research, Cambridge, MA([email protected]). Alexander M. Franks is an Assistant Professor of Statistics at the University of California, SantaBarbara ([email protected]). We thank Steve Yadlowksy, Victor Veitch, and Avi Feller for thoughtful comments anddiscussion. a r X i v : . [ s t a t . M E ] F e b Introduction

Although it is well-established that treatment eﬀects are not generally identiﬁable in the presence of un-observed confounding, recent work has focused on whether this challenge can be mitigated when there aremultiple simultaneous treatments. Intuitively, dependence among multivariate treatments could provideinformation about latent confounders, which could in turn be leveraged to facilitate causal inference andidentiﬁcation. This intuition has motivated latent variable approaches such as “the deconfounder”, a muchdiscussed approach for estimating causal eﬀects for multiple treatments (Wang and Blei, 2018).Unfortunately, it was shown that this strategy has limited practical applicability for point identiﬁcationand estimation of causal eﬀects. For example, D’Amour (2019a) and D’Amour (2019b) note the lack ofgeneral nonparametric identiﬁcation in the deconfounder approach, and show that the special cases in whichthe approach does provide identiﬁcation correspond to situations where all confounding is already observed.Ogburn et al. (2019) and Ogburn et al. (2020) provide several additional counterexamples and detailedrebuttals to previous theoretical results. Even in the special cases where causal eﬀects are identiﬁable, Grim-mer et al. (2020) demonstrate through a suite of simulations and real-data analyses that the deconfoundercannot consistently outperform naive regression. They conclude by further arguing that the deconfounderassumptions are too strong to be applicable in practice.These challenges are particularly relevant because similar strategies are used in genomics (Price et al.,2006), computational neuroscience, social science and medicine (Zhang et al., 2019), and time series appli-cations (Bica et al., 2020). Given the practical importance of causal inference with multiple treatments,recent work has focused on stronger identifying assumptions for causal eﬀects in the multi-treatment setting.Miao et al. (2020) propose identifying assumptions involving instrumental variables and in settings whenover half of the treatments are assumed to have a null eﬀect while Kong et al. (2019) consider identiﬁcationin a parametric model with binary outcomes.This literature has revolved around a binary question about point identiﬁcation: can be causal eﬀectsbe identiﬁed or not? Negative answers to this question often run counter to practitioners’ intuitions inspeciﬁc data analyses. In particular, it is intuitive that a latent variable model should provide some helpfulinformation, even if this information is not enough to fully identify causal eﬀects.To address this issue, we propose that sensitivity analysis—which explores the range of causal eﬀectsthat are consistent with the observed data in the context of a given problem—can resolve this tension.1peciﬁcally, sensitivity analysis can show how much is gained by leveraging latent structure in a givenapplication, even if this (usually) falls short of fully identifying the causal eﬀect of interest. To this end,we propose a sensitivity analysis approach to help practioners better understand confounding in the multi-treatment setting, focusing on the special case where the conditional distribution of unobserved confoundersgiven treatments is identiﬁable. To extend sensitivity analysis to the multi-treatment setting, we proposea general copula-based decomposition of standard latent variable–based sensitivity analysis models. Thisfactorization allows us to precisely separate the parts of the model that are, and are not, identiﬁed in themulti-treatment setting.For practical analyses, we propose a specialization of the general decomposition, which speciﬁes a sen-sitivity model based on invariant Gaussian copulas. While this Gaussian copula speciﬁcation only coversa sub-family of sensitivity models expressible in our general formulation, we show that it captures severalessential qualitative aspects of confounding in the multi-treatment setting. In this context, we establish thatthere are important advantages to multi-treatment inference over single-treatment inference for character-izing sensitivity to unobserved confounding. Speciﬁcally, under appropriate assumptions, we establish thatthe number of eﬀective sensitivity parameters is halved in multi-treatment inference and that this impliesthat the magnitude of causal eﬀects can be bounded.The paper proceeds as follows. We begin by deﬁning the relevant quantities and notation in Section 2. InSection 3 we describe our basic framework for latent variable sensitivity analysis via a copula factorization. InSection 4 we introduce a special case of the more general approach in which we assume confounder-outcomerelationships can be characterized by a Gaussian copula. In Section 5 we provide some theoretical insights intobias and confounding with the Gaussian copula. We discuss sensitivity parameter interpretation, calibration,and measures of robustness in Section 6 and, ﬁnally, in Section 7 and Section 8 we demonstrate our approachin simulation and with the movie example analyzed by Wang and Blei (2018) and later reanalyzed byGrimmer et al. (2020).

Let T be a random k -vector of treatment variables, Y be a scalar random outcome of interest, and t and y be realizations of the respective random variables. We let U be a random m -vector denoting unobserved2onfounders, and x denote any observed pre-treatment variables. We use the do -calculus framework (Pearl,2009) and let f ( y | do ( t )) denote the density of y in the population in which we have intervened to assigntreatment level t to all units. In general, this is distinct from the observed outcome density, f ( y | t ) , whichrepresents the density of the outcome in the subpopulation that received treatment t . These two densitiesare the same if and only if there are no confounders (VanderWeele and Shpitser, 2013).The goal of observational inference with multiple treatments is to quantify the eﬀects of diﬀerent treat-ments by comparing the intervention distribution at diﬀerent levels of treatment t (Lechner, 1999; Lopezet al., 2017). In this work we focus on marginal contrast estimands (Franks et al., 2019) under arbitraryoutcome and treatment distributions. An estimand is a “marginal contrast” if it can be expressed as afunction of the marginal distributions of the potential outcomes, τ ( E [ v ( y ) | do ( t )] , E [ v ( y ) | do ( t )]) for somefunctions v and τ . This includes the vast majority of commonly used estimands. For continuous outcomes,our primary estimand is the diﬀerence in the population average outcome for treatment T = t and thepopulation average outcome given treatment T = t :PATE t ,t := E ( Y | do ( t )) − E ( Y | do ( t )) . (1)Here, v ( y ) = y is the identity function and τ ( a, b ) = a − b . We also consider the diﬀerence in the populationaverage outcome receiving treatment t and observed population average outcome, which we denotePATE t, • := E ( Y | do ( t )) − E ( Y ) , (2)where E ( Y ) = (cid:82) E ( Y | t ) f ( t ) dt and PATE t ,t = PATE t , • − PATE t , • . The conditional average treatmenteﬀects are deﬁned analogously as CATE t ,t | x := E ( Y | do ( t ) , x ) − E ( Y | do ( t ) , x ) and CATE t, • | x := E ( Y | do ( t ) , x ) − E ( Y | x ) .For binary outcomes, our primary estimand is the causal risk ratio between treatments t and t RR t ,t = P ( Y = 1 | do ( t )) /P ( Y = 1 | do ( t )) . (3)where RR t, • is deﬁned analogously to (2), as P ( Y = 1 | do ( t )) /P ( Y = 1) , so that we can express RR t ,t = RR t , • /RR t , • . Here v ( y ) = I [ y = 1] is the indicator function and τ ( a, b ) = a/b .In general, it is diﬃcult to infer PATEs or RRs from observational data since the potential presence of3nmeasured confounders, which aﬀect both treatment and outcome, can bias naive estimates. If U wereto be observed, the following assumptions would be suﬃcient to identify the intervention distribution, andhence the treatment eﬀect: Assumption 1 (Latent ignorability) . X and U block all backdoor paths between T and Y (Pearl, 2009). Assumption 2 (Positivity) . f ( T = t | U = u, x ) > for all u and x . Assumption 3 (SUTVA) . There are no hidden versions of the treatments and there is no interferencebetween units (see Rubin, 1980).Since u is not observed, and x alone does not block all backdoor paths, then treatment eﬀects are notidentiﬁable without additional assumptions. In this case, a common solution is to conduct a sensitivityanalysis which characterizes how the implied causal eﬀects change under diﬀerent assumptions about U andits relationship to T and Y . There is an extensive literature on assessing sensitivity to violations of unconfoundedness in single treatmentmodels, dating back at least to the work of Cornﬁeld et al. (1959) on the link between smoking and lungcancer. Since then, a wide range of strategies have been proposed for assessing sensitivity to unobservedconfounding (e.g. see Greenland, 1996; Gastwirth et al., 1998; Vansteelandt et al., 2006; Imbens, 2003;VanderWeele and Arah, 2011; VanderWeele et al., 2012; Robins et al., 2000; Franks et al., 2019; Cinelliand Hazlett, 2019; Veitch and Zaveri, 2020). A common strategy for sensitivity analysis is to assert, as inAssumption 1, that unconfoundess would hold if only an additional latent variable U were observed (Rosen-baum and Rubin, 1983). With assumptions 1 and 2, latent confounder models simultaneously specify thefull conditional distributions of treatment and outcomes given both observed confounders, x , and unob-served confounders, U . The dependence of Y and T on U is indexed by a vector of sensitivity parameters ψ = ( ψ Y , ψ T ) . Practitioners can then reason about how assumptions about these parameters translate todiﬀerent causal conclusions. Often, this is done through calibration , by determining reasonable ranges for ψ using analogies about observable associations and through a robustness assessment, by examining howstrong associations with unobserved confounders must be for conclusions to change.Concretely, in a typical latent confounder analysis, we posit densities f ( u | x ) , the marginal density ofthe latent confounders, f ψ T ( t | x, u ) , the conditional density (or PMF) for treatment assignment given all4onfounders and f ψ Y ( y | x, u, t ) , the outcome density in treatment arm t . The sensitivity parameters encodethe relationship between both the treatment and unobserved confounders and the outcome and unobservedconfounders (e.g. see Imbens, 2003; Dorie et al., 2016). Latent confounder models are usually parameterizedso that some speciﬁc values of the sensitivity parameters ψ indicate the “no unobserved confounding” case.For example, we can take ψ T = 0 to imply that f ψ T ( t | x, u ) = f ψ T ( t | x ) and ψ Y = 0 to imply that f ψ Y ( y | x, u, t ) = f ψ Y ( y | x, t ) . Then, when either ψ T = 0 or ψ Y = 0 , U is not a confounder (VanderWeeleand Shpitser, 2013). Without loss of generality, we suppress conditioning on x throughout the remainder ofthe manuscript, and comment on the role of observed covariates where appropriate.A key principle of sensitivity analysis is that the observed data densities should be invariant to thesensitivity parameters (Gustafson et al., 2018). Unfortunately, this principle can easily be violated in alatent confounder analysis (Franks et al., 2019). The crux of the problem is that, in general, none of f ( u | x ) , f ψ T ( t | x, u ) or f ψ Y ( y | x, u, t ) are nonparametrically identiﬁable themselves, but the observedoutcome density, which is a function of these densities, is identiﬁable. The observed outcome density, hasthe following form: f ( y | T = t ) = (cid:90) U f ψ Y ( y | t, u ) f ψ T ( u | t ) du, for all t and ψ (4)where f ψ T ( u | t ) is the conditional density of the latent confounders given treatment level t . This contrastswith the density of the intervention distribution, which is: f ψ ( y | do ( t )) = (cid:90) U f ψ Y ( y | t, u ) f ( u ) du (5) = (cid:90) U f ψ Y ( y | t, u ) (cid:20)(cid:90) f ψ T ( u | ˜ t ) f (˜ t ) d ˜ t (cid:21) du (6)The intervention distribution is obtained by integrating over the marginal distribution of the unobservedconfounder, f ψ T ( u ) = (cid:82) f ψ T ( u | ˜ t ) f (˜ t ) dt , whereas the observed data distribution in 4 is obtained by in-tegrating with respect to the conditional confounder distribution, f ψ T ( u | t ) . A fundamental challenge ofsensitivity analysis is to specify a class of densities f ψ Y ( y | t, u ) and f ψ T ( u | t ) , with interpretable parameters ψ = ( ψ Y , ψ T ) , for which the intervention distribution necessarily varies with ψ but for which the observeddata distribution does not.In fact, several authors have proposed latent variable sensitivity models with unidentiﬁed sensitivityparameters in single-treatment settings. Many of these are in simple settings with binary or categorical5utcomes (Rosenbaum and Rubin, 1983; Robins, 1997; Vansteelandt et al., 2006; Daniels and Hogan, 2008). Amore general solution was proposed by Zhang and Tchetgen (2019) who propose a semi-parametric sensitivitymodel in which the distribution of U is left unrestricted. Franks et al. (2019) propose an alternative frameworkfor sensitivity analysis without directly introducing latent variables by directly parameterizing the eﬀectof the outcome on the treatment assignment. Cinelli and Hazlett (2019) and Cinelli et al. (2019) usemoment arguments to derive confounding bias in the linear regression setting and introduce an approach forsensitivity parameter calibration based on the proportion of variance explained by the latent confounder. Inthe multiple treatment setting, additional observable implications can complicate sensitivity analysis, andthus new strategies are needed. Below, we describe our copula based approach for addressing this challenge. The multiple treatment setting presents unique challenges for sensitivity analysis. In particular, the addi-tional structure imposed in studies with multiple treatments introduces new observable implications, mud-dling issues of identiﬁability. We focus on the speciﬁc case in which the conditional confounder distribution f ψ T ( u | t ) may be partially identiﬁable from the multiple treatments. To adapt sensitivity analysis to thissetting, sensitivity parameters must not only be decoupled from the observed data distribution; parametersdescribing the treatment-confounder relationships must also be decoupled from parameters describing theoutcome-confounder relationships. We tackle this challenge by factorizing the joint distribution f ( y, u, t ) using a copula, which decompose joint distributions of variables into their marginal distributions and jointdependence.In this section, we introduce our copula-based factorization, discuss how it applies to the multiple treat-ment setting, and then discuss the Gaussian copula parameterization, a special case that we will use intheoretical analysis and methods development in the remainder of the paper. Our approach is based on the following factorization. The model for Y conditional on treatments andunobserved confounders can be decomposed into the observed data density and a conditional copula as f ψ ( y | u, t ) = f ( y | t ) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) (7)6 T Y 𝜖 !| 𝜖 𝜖 $|!, (a) Univariate Case U Y 𝑇 ! 𝑇 " ... 𝜖 𝜖 % 𝜖 &| (b) Multiple Treatments Figure 1: Treatment T, outcome Y, unobserved variables U. (cid:15) u , (cid:15) t | u and (cid:15) y | t,u are respectively the randomnoises of U , T and Y .where F Y | t is the CDF of f ( y | t ) and F U | t is the CDF of f ψ T ( u | t ) . c ψ Y is the conditional copula density,deﬁned on the unit hypercube and parameterized by ψ Y , which characterizes the joint density of Y and U conditional of T = t after transforming the marginals to uniform random variables (Nelsen, 2007). Byexplicitly factoring the observed outcome density, f ( y | t ) , out of the complete data distribution, we ensurethat the left hand side of Equation 4 is invariant to ψ , establishing that there are no observable implicationsof varying the copula parameters. Moreover, this factorization holds for all densities (or PMFs) f ( y | t ) and f ψ T ( u | t ) and any number of treatments, and thus can be used to characterize the outcome-confounderdependence for any model of the observables.With Equation 7, we can express the intervention distribution, f ψ ( y | do ( t )) , as: f ψ ( y | do ( t )) = f ( y | t ) (cid:90) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) f ( u ) du, (8) = f ( y | t ) (cid:90) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) (cid:20)(cid:90) f ψ T ( u | ˜ t ) f (˜ t ) d ˜ t (cid:21) du (9)The observed outcome distribution f ( y | t ) and the marginal distribution of treatment f ( t ) are clearly invari-ant to the sensitivity parameters by construction. The intervention distribution f ( y | do ( t )) is parameterizedby ψ Y , the parameter governing the conditional dependence between Y and U given T and by ψ T , theparameters governing the conditional distribution of U given T .Given f ( y | t ) , f ψ T ( u | t ) and c ψ Y we can compute the expected value of any function of the outcomeunder the intervention distribution, E [ v ( Y ) | do ( t )] = (cid:82) v ( y ) f ( y | do ( t )) dy . This can be in turn used tocompute any marginal contrast estimand. By applying Equation 8, we write this intervention expectationas E [ v ( Y ) | do ( t )] = (cid:90) v ( y ) w ψ ( y, t ) f ( y | t ) dy, (10)7here w ψ ( y, t ) = (cid:82) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) f ψ T ( u ) du is the importance weight associated with samplingfrom the observed data distribution instead of the intervention distribution. In practice, we can approximatethe marginal distribution of the unobserved confounder with the mixture density f ψ T ( u ) ≈ n (cid:80) i f ψ T ( u | t i ) where t i ∈ T is the i th observed treatment and T is the set of all observed treatment vectors. Thus, theimportance weight can be approximated as w ψ ( y, t ) ≈ |T | (cid:88) t i ∈T (cid:20)(cid:90) c ψ Y ( F Y | t ( y ) , F ψ T U | t ( u ) | t ) f ψ T ( u | t i ) du (cid:21) . (11)We use this approximation to derive importance sampling algorithm for computing the the expected valuein Equation 10 for any copula and conditional confounder distributions f ( u | t ) (Appendix, Algorithm 2).This can in turn be used to compute any marginal contrast estimand, τ ( E [ v ( y ) | do ( t )] , E [ v ( y ) | do ( t )]) . Algorithm 2 is fully general and thus can be used to compute marginal contrast estimands in a singletreatment setting, multiple treatment setting, or even multiple outcome settings. However, the primarymotivation in this paper is to study this factorization in multiple treatments setting when there additionalobservable implications from latent variable models. The copula factorization elucidates the role of confound-ing even when aspects of the treatment-confounder relationship are identiﬁable from multiple treatments.Speciﬁcally, with multiple treatments, the conditional distribution of latent confounders given treatments,parameterized by ψ T , is often identiﬁable up to a particular equivalence class (e.g. up to rotation and scale).In this paper, we focus on inference with latent variable methods where ψ T is identiﬁed up to an equiv-alence class where the set of possible causal eﬀects compatible with any particular value of ψ T does notchange within this class. We formalize this notion through the following deﬁnition and assumption. Deﬁnition 1 (Causal equivalence class) . [ ψ T ] is a causal equivalance class of ψ T if and only if for any ˜ ψ T in [ ψ T ] , then, for every ψ Y there exists a ˜ ψ Y such that f ψ Y ,ψ T ( y | do ( T = t )) = f ˜ ψ Y , ˜ ψ T ( y | do ( T = t )) for all y, t .For the purposes of sensitivity analysis, when ψ T is identiﬁed up to a causal equivalence class, we canassume that ψ T is point-identiﬁed at a particular value within the class [ ψ T ] without loss of generality.Identiﬁcation up to a causal equivalence class is not generally possible in single treatment studies but willoften hold in a multi-treatment study when certain identifying conditions for the latent variable model are8et.Crucially, the copula-based formulation enables valid sensitivity analysis without observable implications,even in these cases where ψ T is restricted by the observed data. In this case, the outcome-confounder copula c ψ Y remains the lone degree of freedom in the sensitivity model. As we will show, this restriction can inducequalitatively diﬀerent sensitivity regions in the multi-treatment setting as opposed to the single-treatmentsetting. For example, sensitivity regions can be bounded, even without additional restrictions on ψ Y . In practice, it is infeasible to characterize and interpret the implied causal eﬀects for all possible copulaspeciﬁcations. In this section, we propose a practical sensitivity analysis method based on the special casein which c ψ Y is a Gaussian copula. This model characterizes the sensitivity of causal eﬀects to monotonedependencies between the outcome and unobserved confounders. As we will discuss in the following sec-tions, the Gaussian copula facilitates interpretation and sensitivity parameter calibration, and, as before,is compatible with arbitrary marginal distributions f ( y | t ) and f ( t ) . For our method and throughout theremainder of the paper we make the following additional assumptions: Assumption 4 (Copula invariance) . The conditional copula does not depend on the value of t , that is, theconditional dependence between Y and U is invariant to the level of T . Assumption 5 (Gaussian copula) . The conditional copula between the outcome and m -dimensional latentconfounders given treatments, c ψ ( F Y | t ( y ) , F U | t ( u ) | t ) , is a Gaussian copula.These assumptions do not impose constraints on the observed data distributions, only the relationshipbetween the observed and latent variables. Given Assumption 4 and 5, the conditional confounder densitycan be expressed as a multivariate normal density, f ( u | t ) ∼ N ( µ u | t , Σ u | t ) , where Σ u | t invariant to the levelof t . Together, these assumptions imply the following generative model: T ∼ F T (12) f ( u | t ) ∼ N ( µ u | t , Σ u | t ) (13) ˜ Y = γ (cid:48) ( U − µ u | t ) + (cid:15) ˜ y | t,u , (cid:15) ˜ y | t,u ∼ N (0 , σ y | t,u ) , γ T Σ u | t γ + σ y | t,u = 1 (14) Y = F − Y | t (Φ( ˜ Y )) (15)9here F T is the distribution of the treatments and F − Y | t is the inverse-CDF of the conditional distributionof Y given T = t . The Gaussian copula is parameterized by the correlation matrix implied byCov ([ ˜ Y , U ] | T = t ) =  γ T Σ u | t Σ u | t γ Σ u | t  . (16)with parameters are ψ T = { µ u | t , Σ u | t } and ψ Y = { γ } . In general, µ u | t and Σ u | t will not be point identiﬁed,although under many latent variable models they can be identiﬁed up to invertible linear transformationof U . Importantly, the following theorem establishes that the class of ψ T deﬁned by all invertible lineartransformations of U is a causal equivalance class. Theorem 1.

Assume model 13-15. Let [ ψ T ] = { ˜ ψ T = { Aµ u | t , A Σ u | t A } : A ∈ S + } where S + is the space ofsymmetric positive deﬁnite matrices. Then [ ψ T ] is a causal equivalence class. The gist of the proof is that for any invertible linear transformation, A , of U , the copula parameterized by ˜ γ = A − γ yields equivalent causal eﬀects in the reparameterized coordinates of U as γ does in the originalconfounder coordinates.Throughout this work, we will assume that ψ T is identiﬁed up to invertible linear transformations of U , and explore the range of possible causal eﬀects for diﬀerent γ satisfying γ T Σ u | t γ ≤ . In Algorithm1, we provide a procedure for estimating any marginal contrast estimand, given a sensitivity vector γ andtreatments levels t and t . At a high level, we compute a Monte Carlo estimate of f ( y | do ( t )) via thefollowing three step procedure: 1) draw a sample from f ( u ) , 2) compute the conditional density of of theGaussianized outcome f ( ˜ Y | u, t ) via the Gaussian copula and 3) transform ˜ Y back to original space via theconditional quantile function F − Y | t (see Figure 3). In the following Sections, we introduce some theoreticalinsights about our approach and also provide a method for calibrating the magnitude of γ and reasoningabout it’s direction. As described in the previous Section, our method for practical sensitivity analysis with multiple treatmentsis based on a Gaussian copula parameterization of the confounder-outcome relationship. Here, we start byproviding some theoretical insights about how the Gaussian covariance structure relates to confounding biasin the linear-Gaussian model. Speciﬁcally, we describe how the causal eﬀects vary as a function of γ and how10 lgorithm 1 Marginal Contrast Estimation. function ComputeMean ( t , γ ) for i = 1 , , . . . , n do µ i ← γ T ( µ u | t i − µ u | t ) for j = 1 , , . . . , nSim do Sample ˜ y ij from N ( µ i , y ij ← F − Y | t (Φ(˜ y ij )) end for end for return n (cid:80) ij v ( y ij ) end function Return τ ( ComputeMean ( t , γ ) , ComputeMean ( t , γ )) bounds on these eﬀects depend on both the treatment contrast and inferred conditional confounder density.In 5.2, we generalize some of these results to arbitrary models for f ( y | t ) and f ( t ) . We start by illustrating our approach in a simple Linear-Gaussian model when ( Y, T, U ) are jointly multi-variate Gaussian and establish the following results:• For causal inference with a single treatment, the confounding bias for PATE t ,t is unbounded.• When there are multiple treatments which can be used to identify (up to a causal equivalance class) theconditional confounder distribution, the magnitude of the confounding bias for PATE t ,t is bounded.We characterize how the magnitude of this bound depends on the parameters of the latent confoundermodel.• In the multi-cause setting there are many possible treatment contrasts. The confounding bias dependson the contrast. We characterize which treatment contrasts lead to the largest bounds and whichtreatment contrasts imply identiﬁable eﬀects.We demonstrate these results in the following model: U = (cid:15) u , (cid:15) u ∼ N m (0 , Σ u ) , (17) T = BU + (cid:15) t | u , (cid:15) t | u ∼ N k (0 , σ t | u I k ) , (18) Y = τ (cid:48) T + γ (cid:48) U + (cid:15) y | t,u , (cid:15) y | t,u ∼ N m (0 , σ y | t,u ) , (19)11ith τ ∈ R k and γ ∈ R m . When either B = 0 or γ = 0 , there is no confounding. We also note thatEquations 17 and 18 imply that the conditional distribution of the confounder can be expressed as f ( u | t ) ∼ N ( µ u | t , Σ u | t ) , as in Equation 13, where both µ u | t and Σ u | t are known functions of B and σ t | u . Under model17-19, the intervention distribution has density f ( y | do ( T = t )) ∼ N ( τ (cid:48) t, σ y | t,u + γ (cid:48) Σ u γ ) . (20)For any t , t , PATE t ,t is characterized entirely by the regression coeﬃcients τ . The observed outcomedistribution can be expressed as f ( y | T = t ) ∼ N ( τ (cid:48) naive t, σ y | t ) (21)where τ naive = τ + ( B Σ u B (cid:48) + σ t | u I k ) − B Σ u γ (22) σ y | t = σ y | t,u + γ (cid:48) (Σ u − Σ u B (cid:48) ( B Σ u B (cid:48) + σ t | u I k ) − B Σ u ) γ (23) = σ y | t,u + γ (cid:48) Σ u | t γ (24)which are both fully identiﬁed from observed data. We refer to τ naive as the naive estimate since it naivelyneglects the eﬀect of unobserved confounders. Equation 24 shows that the observed residual outcome variancecan be decomposed into nonconfounding variation σ y | t,u and confounding variation, γ (cid:48) Σ u | t γ . We take σ y | t and τ naive as ﬁxed and known, and characterize the range of confounding biases by considering diﬀerentassumptions about the strength of confounding.We note that the bias of the naive estimator depends only on the diﬀerence between the treatmentvectors, ∆ t = t − t , since the population average treatment eﬀect can be expressed asPATE ∆ t = τ (cid:48) ( t − t ) := τ (cid:48) ∆ t, (25)12he confounding bias, denoted Bias ∆ t = τ (cid:48) naive ∆ t − PATE ∆ t , can then be expressed asBias ∆ t = γ (cid:48) Σ u B (cid:48) ( B Σ u B (cid:48) + σ t | u I k ) − ∆ t (26) = γ (cid:48) ( E ( U | T = t ) − E ( U | T = t )) (27) := γ (cid:48) µ u | ∆ t , (28)where we use µ u | ∆ t to denote the diﬀerence in confounder means for the treatment contrast, ∆ t . We can thensuccinctly express the PATE in terms of the naive estimate minus the bias as PATE ∆ t = τ naive (cid:48) ∆ t − γ (cid:48) µ u | ∆ t .In the single treatment setting, neither ψ Y = { γ } nor ψ T = { µ u | ∆ t , Σ u | t } are identiﬁable, which impliesthat the confounding bias is unbounded. However, with multiple treatments ψ T is identiﬁable up to a causalequivalence class deﬁned by invertible linear transformations of U . We make this concrete in the followingtheorems. Theorem 2.

Suppose that the observed data is generated by model 17-19. When there k treatments with < m < k , then ψ T is identiﬁed up to the causal equivalence class [ ψ T ] = { ˜ ψ T = { Aµ u | t , A Σ u | t A } : A ∈ S + } .When there is a single treatment ( k = 1 ) or at least m = k confounders, then ψ T is not identiﬁable up tocausal equivalence class.Proof. See appendix. One consequence of Theorem 2 is that the distribution of U is only causally relevant up to linear transforms,and as such, without loss of generality, we make the simplifying assumption that U ∼ N (0 , I m ) for theremainder of this Section.First, we review the implications of this theorem when there is only a single treatment, i.e. k = 1 . Asshown in Cinelli and Hazlett (2019), for single treatment inference, the squared confounding bias of thePATE can be expressed as Bias t = R T ∼ U − R T ∼ U R Y ∼ U | T σ y | t σ T (29)where σ T := BB (cid:48) + σ t | u I k is the marginal variance of the treatment, ≤ R T ∼ U = σ T ( µ u | ∆ t ) (cid:48) µ u | ∆ t (∆ t ) ≤ (30)13s the unidentifed fraction of treatment variance explained by confounders and ≤ R Y ∼ U | T = γ T Σ u | t γσ y | t ≤ (31)is the fraction of the residual outcome variance explained by confounders. By Theorem 2, neither R T ∼ U nor R Y ∼ U | T are identiﬁable. Since R T ∼ U − R T ∼ U can be arbitrarily large, the confounding bias is unbounded in thesingle treatment setting.In contrast, Theorem 2 states that with multiple treatments, we can identify an element of the causalequivalence class for parameters governing the conditional confounder distribution. The relationship be-tween Y and U , as parameterized by m-vector γ , remains an unidentiﬁed sensitivity vector. This sensi-tivity vector can be viewed as parameterizing the Gaussian conditional copula between Y and U given T , c γ ( F Y | t ( y ) , F U | t ( u ) | t ) (Equation 16).Identiﬁcation up to causal equivalence class implies that the confounding bias is bounded. From Equation26, we can see that the sign and magnitude of the bias depends on both ∆ t as well as γ . Although γ is notidentiﬁed, its values are constrained since unobserved confounding cannot explain more than 100% of theresidual outcome variance (Equation 31). This constraint on the magnitude of γ implies the following resultabout the bias of the naive estimator. Theorem 3.

Suppose that the observed data is generated by model 17-19 with σ t | u > . Then, ∀ γ satisfyingAssumptions 1 and 2, γ T Σ u | t γ ≤ σ y | t (32) For any given ∆ t , we have Bias t ≤ σ y | t R Y ∼ U | T (cid:107) Σ − / u | t µ u | ∆ t (cid:107) , (33) The bound is achieved when γ is colinear with Σ − u | t µ u | ∆ t .Proof. See appendix. This theorem states that the true causal eﬀect lies in the interval τ (cid:48) naive ∆ t ± (cid:113) σ y | t R Y ∼ U | T (cid:107) Σ − / u | t µ u | ∆ t (cid:107) .When additional assumptions are applied to identify a particular value of γ (e.g. see Miao et al., 2020), thecorresponding causal eﬀect estimate will correspond to a single point inside this ignorance region when theGaussian copula assumption holds. We refer to the right-hand side of 33 as the “worst-case bias” of the naiveestimator. In particular, since τ naive is the midpoint of the ignorance region, it has the minimum worst-case14ias over all alternative causal eﬀect estimators. This is consistent with Grimmer et al. (2020) who emphasizethat the deconfounder proposed by Wang and Blei (2018) cannot outperform the naive estimator in general.The worst-case bias of τ naive is proportional to the norm of the scaled diﬀerence in confounder meansin each treatment arm, Σ − / u | t µ u | ∆ t . This result provides a useful generalization of existing work whichdemonstrates that overlap is violated when u can be pinpointed by a deterministic function of t (D’Amour,2019b). In contrast to the original work by Wang and Blei (2018), our result suggests that the more preciselywe can pinpoint u given t , the less precisely we can pinpoint PATE ∆ t . In the following corollary, we statethe worst-case bias over all possible treatment contrasts: Corollary 3.1.

Let d be the largest singular value of B . For all ∆ t with (cid:107) ∆ t (cid:107) = 1 , the squared bias isbounded by Bias t ≤ d ( d + σ t | u ) σ y | t σ t | u R Y ∼ U | T , (34) with equality when ∆ t = u B , the ﬁrst left singular vector of B . When ∆ t ∈ N ull ( B (cid:48) ) , the naive estimate isunbiased, that is, P AT E ∆ t = τ (cid:48) naive ∆ t .Proof: See Appendix. The ﬁrst term in (67), d ( d + σ t | u ) , is the fraction of variance in the ﬁrst principal component of the causes thatcan be explained by confounding. The ﬁrst principal component corresponds to the projection of treatmentswhich is most correlated with confounders, and thus is the causal contrast with the largest ignorance region.We also note that the squared biases depends on R Y ∼ U | T , the partial variance explained by confoundersgiven treatments. While the magnitude of the confounding bias is always largest when R Y ∼ U | T = 1 , weoften have reason to believe that the variance explained by confounders is likely smaller. We describe howto leverage this idea to calibrate more plausible bounds and measures of robustness in Section 6.We illustrate some key insights from these theorems in Figure 2, where we display the worst-case bias asa function of the treatment contrasts, ∆ t . In this illustration, we assume that ∆ t lies on a plane spanned by u B and n B , an arbitrary vector in the null space of B . We let θ = arccos (∆ t (cid:48) u B ) be the angle of ∆ t relativeto u B , Figure 2a. Figure 2b depicts the bias as function of θ for diﬀerent values of R Y ∼ U | T . When ∆ t isin the null space of B , PATE ∆ t is identiﬁed because the confounder distributions are identical in the twotreatment arms, i.e. there is no confounding for this particular contrast. When ∆ t is colinear with u B thescaled diﬀerence in means of u is largest, which implies the largest worst-case bias for the treatment eﬀect,Figure 2c (left). Even when PATE ∆ t is identiﬁed, we emphasize that PATE t , • and PATE t , • are both biased,15 a) ∆ t := t − t −2−1012 0 p p p p q B i a s R = Change of Bias with D t (b) Estimation Bias of PATE ∆ t −2 −1 0 1 2 U D i s tt r i bu t i on o f U U U | t U | t D t = u −2 −1 0 1 2 U D i s t r i bu t i on o f U U U | t U | t D t = n (c) Distribution of U Figure 2: Illustration of Corollary 3.1. (a) We parameterize ∆ t with θ , the angle between n B , a vector inthe null space of B , and u B , the ﬁrst left singular vector of B . (b) The confounding bias of naive estimatesof PATE ∆ t changes with θ and depends on R Y ∼ U | T . (c) Confounder densities in diﬀerent populations. Theblue, green, red densities denote distributions of U in the observed population, the subpopulation receiving t and the subpopulation receiving treatment t respectively. Observed data estimates of PATE ∆ t are unbiasedwhen ∆ t = n B , since the confounder distributions are the same in two treatment arms. However, observeddata estimates of PATE t , • and PATE t , • are biased since in general the superpopulation distribution of theconfounder is diﬀerent.since the distribution of confounders in the treatment arm diﬀers from the distribution of confounder in thesuperpopulation, Figure 2c (right). As noted by others, identiﬁcation of PATE ∆ t for ∆ t in the null space of B arises due to bias cancellation in intervention means of the two treatment arms (Grimmer et al., 2020).Our theory is invariant to rotations of the treatments vector, which means that under model 17-19, wecan always make a change of treatment variables so that each confounder aﬀects a distinct single rotatedtreatment (called “single cause” confounders in Wang and Blei (2018)). Speciﬁcally, let ˜ T = RT be arotation of the original treatment variables, so that ˜ T ∼ N (0 , ∆ + σ t | u I k ) where ∆ is a diagonal positivesemi-deﬁnite matrix. Then, as before, ∆ i ∆ i + σ t | u is the fraction of variance in the i th rotated treatment dueto confounding and provides a bound on the omitted confounder bias of estimates of PATE for ∆ ˜ T = e i .For the k − m contrasts corresponding to the zeros in the diagonal of ∆ , there is no confounding for theseparticular treatment contrasts and thus there is no confounding bias; these correspond to the contrasts inthe original space which fall in the null space of B . Next, we generalize beyond the setting in which ( Y, T, U ) is jointly multivariate Gaussian. First, when Y isGaussian with mean µ y | t and variance σ y | t , even when T has an arbitrary functional relationship with Y wehave that PATE t, • = ( µ y | t − µ y ) − σ y | t γ (cid:48) ( µ u | t − µ u ) , (35)16 − − − U|t Y | t 𝑓(𝑈)𝑓(𝑈|𝑡) 𝑓(𝑌|𝑡)𝑓(𝑌|𝑑𝑜 𝑡 ) (a) (cid:113) R Y ∼ U | T = 0 . − − − − U|t Y | t 𝑓(𝑈)𝑓(𝑈|𝑡) 𝑓(𝑌|𝑡)𝑓(𝑌|𝑑𝑜 𝑡 ) (b) (cid:113) R Y ∼ U | T = 0 . (opposite sign) − − − − U|t Y | t 𝑓(𝑈)𝑓(𝑈|𝑡) 𝑓(𝑌|𝑡)𝑓(𝑌|𝑑𝑜 𝑡 ) (c) (cid:113) R Y ∼ U | T = 0 . Figure 3: Diﬀerences between observed and intervention densities as a function of the fraction of outcomevariance explained by a single confounder. The black contours depict the conditional Gaussian copula, c γ ( F Y | t ( y ) , F U | t ( u ) | t ) whereas red points represent samples from the joint distribution, f ( y, u | do ( t )) ∝ f ( y | t ) c γ ( F Y | t ( y ) , F U | t ( u ) | t ) f ( u ) . We visualize the shift in the outcome density for diﬀerent conditionalcorrelations and note that smaller values of R Y ∼ U | T imply smaller biases in the outcome despite largeimbalance in the distribution of U.where µ u = E [ U ] is the population mean of U . Thus,PATE t ,t = ( µ y | t − µ y | t ) − σ y | t γ (cid:48) ( µ u | t − µ u | t ) . (36)This leads to the following generalization of Theorem 3. Theorem 4.

Assume the model 13-15 with Gaussian outcomes. If Σ u | t is non-invertible, then Bias t ,t isbounded if and only if µ u | t − µ u | t is in the row space of Σ u | t . When bounded,Bias t ,t ≤ σ y | t R Y ∼ U | T (cid:107) (Σ † u | t ) / ( µ u | t − µ u | t ) (cid:107) , where Σ † u | t is the pseudo-inverse of Σ u | t .Proof: See Appendix. As before, when bounded, the bias is proportional to the norm of the scaled diﬀerence in confounder meansin the two treatment arms.When there exists an m-vector, q , such that Var ( q (cid:48) U | t ) = 0 , then Σ u | t is non-invertible because thereexists a projection of the confounders which is point identiﬁed. Theorem 4 says that in this case, theignorance region for the PATE is bounded if and only if q (cid:48) ( µ u | t − µ u | t ) = 0 . In words, if a projectionof the confounders can be identiﬁed, then the confounding bias is bounded if the identiﬁable projection ofthe confounders has the same value in both treatment arms. This result can be viewed in the context ofTheorem 7 in Wang and Blei (2018), which assumes consistency and overlap of estimators for the relevantlatent confounders for identiﬁcation. 17hen the observed outcome distribution is non-Gaussian, we cannot necessarily express PATE t ,t ana-lytically. In particular, for non-Gaussian Y , when f ( u | t ) ∼ f ( u | t ) the average treatment eﬀect amongthe t and t treated units is unconfounded, but the bias of PATE t ,t may be nonzero since f ( u | t ) (cid:28) f ( u ) .The causal eﬀects, however, can still be calculated using Algorithm 1. Another particularly important non-Gaussian setting we highlight here is when the outcome is binary. Interestingly, unlike the linear case, RR t, • and RR t ,t are non-monotone in the magnitude of γ . We discuss this in more detail in Appendix B.3 andprovide simulation results with binary outcomes in Section 7. Sensitivity analyses consist of two parts: ﬁrst, the sensitivity model itself, which speciﬁes a set of causalmodels, indexed by sensitivity parameters; and secondly, exploratory tools for mapping external assumptionsto particular causal models in this set. We now turn to discussing the latter in the context of our proposedmodel.In the sensitivity analysis literature so far, two exploratory techniques have been particularly popular insingle treatment studies: calibration , which maps sensitivity parameter values to interpretable observable orhypothetical quantities; and robustness analysis , which characterizes the “strength” of confounding necessaryto change the conclusion of a study. Here, we show how to adapt these techniques to our sensitivity model inthe multi-treatment setting. In addition, we introduce a third class of tools that are particularly well-suitedto the multi-treatment setting, which we call multiple contrast criteria (MCCs). MCCs specify aggregateproperties of the treatment eﬀects for multiple treatment contrasts that are implied by a single causal model,e.g., the L2 norm of PATEs corresponding to contrasts in each individual treatment variable in T . In manymulti-treatment settings, assumptions are often expressed in terms of the aggregates—e.g., in genomics, theidea that the eﬀect of most single nucleotide polymorphisms is small—and we show here how these canbe used in conjunction with our sensitivity model to characterize candidate causal models that may be ofinterest in an application. We begin by describing calibration for γ in our sensitivity model when the focus is on a single treatmentcontrast, between levels T = t and T = t . The goal is to develop heuristics for specifying “reasonable”18alues or ranges for γ , e.g., to derive bounds on treatment eﬀects by specifying bounds on the strength ordirection of confounding. Following previous work in the single treatment setting, we outline how to calibrateour sensitivity parameter vector γ in terms of a fraction of outcome variance explained by the unobservedconfounder. Recall that γ is a vector that parameterizes the residual correlation between the m -dimensionalunobserved confounder U and the outcome Y after conditioning on the treatment vector T .First, we brieﬂy review calibration in single-treatment settings. In latent variable approaches for singletreatment sensitivity analysis, the causal eﬀect is identiﬁed given two sensitivity parameters: the fractionof outcome variance explained by unobserved confounders, R Y ∼ U | T , and the fraction of treatment varianceexplained by unobserved confounders, R T ∼ U (Cinelli and Hazlett, 2019). In a linear model, these two scalarquantities identify the confounding bias (Equation 29). Neither R-squared value is identiﬁable and thusmany authors have proposed strategies for drawing analogies between these values and other observable orhypothetical quantities (Cinelli et al., 2020; Veitch and Zaveri, 2020; Franks et al., 2019).We borrow this strategy for calibration in our setting, with some modiﬁcations. First, in our setting thereis no need to calibrate R T ∼ U , because we have restricted ourselves to the case where f ( u | t ) is identiﬁed upto a causal equivalence class. This leaves calibration of the outcome-confounder relationship, which in oursetting is more complex because it is parameterized by a vector γ . However, we can reparameterize γ interms of a direction d and an R-squared for interpretable calibration: γ = (cid:113) R Y ∼ U | T Σ − / u | t d, (37)where d ∈ S m − is an m -dimensional unit vector. We discuss strategies for calibrating both the magnitudeand direction separately. Calibrating the magnitude of γ . For Gaussian outcomes, the magnitude of γ is characterized entirelyby R Y ∼ U | T , the partial fraction of outcome variance explained by U given T . When R Y ∼ U | T = 0 thereis no unobserved confounding and when R Y ∼ U | T = 1 , all of the observed residual variance in Y is due toconfounding factors. In order to calibrate this magnitude, we adopt an idea proposed by Cinelli and Hazlett(2019) for inference with single treatments. In their work, they calibrate R Y ∼ U | T by comparing it to thepartial fraction of variance explained by diﬀerent observed covariates. We use a closely related strategy thatmakes use of the presence of multiple treatments rather than observed covariates. Speciﬁcally, we compute Unlike the single treatment setting, the confounder-outcome relationship cannot be suﬃciently summarized in terms of ascalar R Y ∼ U | T . Each confounder can impact each treatment in diﬀerent ways. Y that can be explained by a speciﬁc treatment (or set of treatments), T j , aftercontrolling for all other treatments T − j as R Y ∼ T j | T − j := R Y ∼ T − R Y ∼ T − j − R Y ∼ T − j , (38)When observed covariates are available, we can still analogously compute the partial fraction of varianceexplained by an observed covariate, R Y ∼ X j | T,X − j , as done in Cinelli and Hazlett (2019). Veitch and Zaveri(2020) and Cinelli et al. (2020) propose useful graphical summaries for calibration based on these metrics inthe single treatment setting.When the observed outcome is non-Gaussian, we calibrate the “implicit R ”, by considering the explainedvariance of the latent Gaussian outcome, ˜ Y in Equation 14. The implicit R of T for model 13-15 is deﬁnedas R Y ∼ T = Var ( E [ ˜ Y | T ]) Var ( E [ ˜ Y | T ]) + 1 . (39)and the implicit partial R-squared of treatment T j , R Y ∼ T j | T − j , is deﬁned analogously to Equation 38. Asbefore, these estimable partial R-squared values can be used to provide a useful comparison for the partialR-squared of potential unobserved confounders, R Y ∼ U | T . For more detail, see Imbens (2003) and Frankset al. (2019) who discuss calibration with implicit R-squared values in logistic regression models. Choosing the direction of γ . Given a magnitude, we now propose a default method for identifying thedirection of γ for a single contrast. The dot product d (cid:48) Σ − / u | t ( µ u | t − µ u | t ) corresponds to the projection ofthe scaled diﬀerence in confounder means onto the outcome space. By default, we suggest using the directionwhich maximizes the squared bias. As shown in Theorem 4, when d is colinear with Σ − / u | t ( µ u | t − µ u | t ) ,the confounding bias of the naive estimator for Gaussian outcomes is maximized at | Bias t ,t | = (cid:113) R Y ∼ U | T σ y | t (cid:107) Σ − / u | t ( µ u | t − µ u | t ) (cid:107) , (40)Choosing the direction of the sensitivity vector in this way provides conservative bounds for each contrastof interest. For non-Gaussian outcomes or alternative estimands, there may not be an analytic solution tothe direction which maximizes the bias, but we can still compute the direction via numerical optimization.20 .2 Robustness for Single Contrasts We now turn to assessing the robustness of conclusions using our sensitivity model, extending work by Cinelliand Hazlett (2019) and VanderWeele and Ding (2017) in the single treatment setting. Speciﬁcally, we proposean extension of the robustness value (RV), which characterizes the minimum strength of confounding neededto change the sign of the treatment eﬀect. As in the previous section, the extension is most straightforwardwhen considering the eﬀect of a single treatment contrast, between levels T = t and T = t .To review brieﬂy, in single treatment settings, Cinelli and Hazlett deﬁne the RV as the minimum R-squared needed to reduce the treatment eﬀect to zero, assuming R Y ∼ U,T = R T ∼ U . (we return to thisassumption below.) A robustness value close to one means the treatment eﬀect maintains the same signeven if nearly all of the observed residual variance in the outcome is due to confounding. On the otherhand, a robustness value close to zero means that even weak confounding would change the sign of the pointestimate.In the multi-treatment setting, we can more precisely characterize the robustness of causal eﬀects when R T ∼ U is identiﬁable. In particular, we can compute an analogue to the RV without assuming R Y ∼ U,T = R T ∼ U , an assumption that, in single treatment settings, can be consequential . With R T ∼ U known, wedeﬁne multi-treatment RV can then simply as the minimum value of R Y ∼ U | T needed to explain away thetreatment eﬀect of interest, assuming the direction of the sensitivity vector is chosen to maximize the bias.When the observed outcomes are Gaussian, the robustness value can be computed in closed form:RV t ,t = ( µ y | t − µ y | t ) σ y | t (cid:107) Σ − / u | t ( µ u | t − µ u | t ) (cid:107) , (41)RV metrics for alternative estimands and/or non-Gaussian data can still be computed using the same prin-ciple. For example, when the observed outcome is binary, the RV can be computed numerically by solving RR t ,t = 1 , which corresponds to the minimum strength of confounding needed for the observed risk ratio(RR) to equal to one. This robustness value is analogous to the “E-value” proposed by VanderWeele andDing (2017).In our setting, we can also make stronger statements about robustness than in the single treatmentsetting: under the latent variable model, it is possible to declare an eﬀect robust to any level of confounding. In single treatment settings, when R Y ∼ U | T > R T ∼ U , the single-treatment RV will be too conservative. Conversely, when R Y ∼ U | T < R T ∼ U the single-treatment RV will overestimate the robustness of the eﬀect.

21n particular, when the latent variable model implies R T ∼ U < (i.e., we have confounder overlap), then evenwhen R Y ∼ U | T = 1 , the ignorance region is bounded (Theorem 4). When this ignorance region excludes zero,we declare the eﬀect “robust”. This operation is consistent with the result in Miao et al. (2018), showingthat hypotheses of zero eﬀect can be tested in this setting, even if the treatment eﬀect cannot be identiﬁed. In addition to exploratory tools used with single-treatment sensitivity analysis, the multi-treatment settingpresents opportunities for exploring sensitivity models in new ways. Here, we propose one such approachusing multiple contrast criteria, or MCCs. As opposed to the approaches we have discussed so far, whichconsider treatment contrasts in isolation, MCCs characterize a choice of sensitivity vector γ by concurrentlyconsidering its implications for multiple treatment contrasts in aggregate. Thus, while the sensitivity vector γ that gives the worst-case bias may diﬀer across individual contrasts, an MCC can be used to select a single γ that has implications for many simultaneous treatment eﬀects.Formally, for a set of treatment contrasts T = { ( t , t ) k } Kk =1 , and a candidate sensitivity vector γ ,let PATE T ( γ ) be the vector of PATEs implied by the causal model indexed by γ . An MCC is a scalarsummary of this treatment eﬀect vector, which we write as ω ( PATE T ( γ )) . An MCC is speciﬁed by theset of constrasts T and the summary function ω , both of which can be chosen to meet the needs of a givenanalysis.MCCs can be used in many ways, but here we consider how they can be used to search for the causalmodel that yields the minimum norm treatment eﬀect vector, subject to assumptions 1-5 and a confoundinglimit R . Speciﬁcally, we take ω to be an L p norm for some p , and consider sensitivity vectors γ ∗ that satisfy: γ ∗ = argmin ˜ γ ω ( PATE T (˜ γ )) subject to R Y ∼ U | T (˜ γ ) ≤ R (42)where R Y ∼ U | T,X (˜ γ ) = ˜ γ (cid:48) Σ u | t ˜ γσ y | t is the partial fraction of outcome variance explained by confounding forsensitivity vector ˜ γ . Causal models selected in this way are often highly interpretable, in terms of either“worst case” eﬀect sizes or established prior knowledge. For example, we can chose ω to be the L ∞ norm, sothat γ ∗ is the sensitivity vector that minimizes the maximum absolute eﬀect across contrasts. Alternatively,we could choose ω to be the L or L norm of the treatment eﬀects to ﬁnd models that imply small “typical”eﬀect sizes. We demonstrate how this minimization approach can be used to express prior knowledge about22mall eﬀects in simulated data in Section 7.2, and how it can be used to evaluate robustness on a real dataset in Section 8. In this Section, we demonstrate our sensitivity analysis workﬂow in several numerical simulations. The goalof these simulations is twofold: ﬁrst, to demonstrate some of the operating characteristics of the approach insettings that are more realistic than the linear Gaussian settings we characterized analytically; and secondly,to show how exploratory tools like calibration, robustness analysis, and MCCs can be used to draw conclusionsand choose interesting candidate models.We consider two broad simulation settings. In the ﬁrst setting, we construct simulations with non-linearresponses to treatment to show how the ignorance regions returned by our method can vary in diﬀerentscenarios. In the second setting, we construct a simulation that mimics the structure of a Genome WideAssociation Study (GWAS). Here, we examine the behavior of our method when a popular approximate latentvariable method—the Variational Auto Encoder (VAE)—is used to estimate the eﬀects of latent confounders,and demonstrate how MCCs can be useful tools for using prior information to choose potentially useful causalmodels from the set that is compatible with the observed data. In both subsections, we simulate data fromthe following generating process: U := (cid:15) u , (cid:15) u ∼ N m (0 , I m ) , (43) ˜ T := BU + (cid:15) t | u , (cid:15) t | u ∼ N k (0 , σ t | u I k ) , (44) T := h T ( ˜ T ) , (45) ˜ Y := g ( T ) + γ (cid:48) U + (cid:15) y | t,u , (cid:15) y | t,u ∼ N (0 , σ y | t,u ) , (46) Y := h Y | T ( ˜ Y ) , (47)The functions h Y and h T are chosen according to be either the identity for Gaussian data, or an indicatorfunction for binary data. 23 % robust robust 8.95% −101 1 2 3 4 i C au s a l E ff e c t True EffectCalibrated R Y~~U|T2 = 1, upperR

Y~~U|T2 = 0R

Y~~U|T2 = 1, lower

PATE t ,t for Gaussian Outcome (a) Gaussian Outcome i C au s a l E ff e c t True EffectCalibrated

UpperNaiveLower RR e i ,0 for Binary Outcome (b) Binary Outcome Figure 4: Estimated ignorance region for e i in case when ˜ Y is nonlinear in T . (a) R Y ∼ U | T = 0 denotes thetreatment eﬀects estimated based on the observed data only, i.e., under the assumption of no confoundedness. R Y ∼ U | T = 1 and R Y ∼ U | T = 1 correspond to the case when all residual variation in Y is due to theconfounding, respectively denoting the upper and lower bounds of the ignorance region. (b) In the binarysetting, even though the estimand is a non-linear function of the latent Gaussian outcome, the width of theignorance region and general robustness pattern is largely consistent with the implications of Corollary 3.1. We start by exploring variation in the size of ignorance regions for diﬀerent contrasts in a simple simulatedexample with four treatments where the outcome is a nonlinear function of these treatments. We considertwo cases: ﬁrst, a case where Y is Gaussian with h Y ( ˜ Y ) = ˜ Y ; and secondly, a case where Y is binary with h Y ( ˜ Y ) = I ˜ Y > . We aim to estimate the PATE e i , for Gaussian outcome and RR e i , for binary outcome,where e i denotes the i th canonical vector, i.e. the vector with a 1 in the i -th coordinate and 0’s elsewhere.In both examples, we generate the data with a 1-dimensional latent confounder ( m = 1 ), k = 4 treatments, B = [2 , . , − . , . , σ t | u = 1 , γ = 2 . , σ y | t,u = 1 , h T ( ˜ T ) = ˜ T and g ( T ) = 3 T − T + T I T > + 0 . T I T ≤ − . T − T . Based on the choice of g , contrasts along the j th dimension of T have eﬀects of widely varying magnitude.Based on our choice for B , the worst-case confounding bias also varies signiﬁcantly across contrasts. Forexample, the eﬀect of confounding is larger when estimating the treatment of T , since the ﬁrst entry of B has the largest magnitude, meaning T is the feature most correlated with U . In order to demonstrate thisin simulation, we ﬁrst apply probabilistic PCA (PPCA) to estimate the distribution f ( u | t ) , and then model f ( y | t ) using Bayesian Additive Regression Tree (BART) with R package BART (McCulloch et al., 2018).24or Gaussian outcomes, the width of the ignorance regions are larger for the treatments most correlatedwith confounders as characterized Theorem 4 (see Figure 4). Since B is a vector, the width of the ignoranceregion of PATE t ,t can be examined by looking at the dot product between B and the treatment contrasts.The larger the dot product, the wider the ignorance region. As expected, the ignorance region of thetreatment eﬀect is widest when t = e (RV ≈ ) and narrowest when t = e , since B (cid:48) e has the largestmagnitude while B (cid:48) e has the smallest. Despite the fact that t = e has the smallest ignorance region, it isnot robust to confounding because the naive eﬀect is already close to zero (RV = ) . For the second andthird treatment contrasts, estimates are robust to confounders, as their entire ignorance regions exclude 0.For the simulation with binary outcomes, we compute ignorance regions for the risk ratio. Although wedo not have a theoretical result about the ignorance regions of the risk ratio, the general trends in the sizeof the ignorance region and the robustness of eﬀects are comparable to the Gaussian. Most notably, thetreatments with the largest ignorance regions are still those which are most correlated with the confounder.On the other hand, because the outcome is non-linear in U , the naive estimate is not at the center of theignorance region (Figure 4b). In fact, the ignorance region is also non-monotone in R Y ∼ U | T because thevariance of the intervention distribution also depends on γ . In this case, one of the endpoints of the ignoranceregion corresponds to R Y ∼ U | T = 1 but the other does not. We compute the endpoints of the ignorance regionnumerically (see Appendix B.3 for more details). We now explore a slightly more complex setting motivated by applications in biology, particularly in genomewide association studies (GWAS). GWAS investigate the association between hundreds or thousands ofgenetic features (i.e., single nucleotide polymorphisms, or SNPs) and observable traits (i.e., phenotypes), suchas disease status. Here, we construct a simulated GWAS to demonstrate two properties of our sensitivityanalysis method. First, we show that ﬂexible latent variable models can be plugged into our sensitivitymodel. Secondly, we demonstrate how minimizing multiple contrast criteria (MCC) can be used to selectinteresting candidate models that conform to broad hypotheses about the nature of genetic eﬀects.Despite having “association” in the name, GWAS is a particularly interesting application area for multi-treatment causal inference. In practice, measures of association in GWAS are often adjusted to aﬀord a causalinterpretation in which conclusions speak to how a phenotype would change if the genome were intervenedupon. For example, most analyses adjust for “population structure”, which correspond to broad genetic25atterns induced by population dynamics that are often confounded with geography, ancestry, environment,and other lifestyle factors (Price et al., 2006; Song et al., 2015). Wang and Blei (2018) cite this literature asmotivation for their work.In this simulation, we generate data with high-dimensional binary treatments (SNPs), and set the truecausal eﬀects to be mostly small, with a small fraction of treatments having eﬀects of larger magnitudes.The simulation is then designed so that unobserved confounding biases estimates for each of these treatmenteﬀects, obscuring the diﬀerence between large and small eﬀects. To generate data, we follow the templatein Equations 43–47. We generate data with m = 3 latent confounders and k = 500 treatments, T ∈ { , } k ,where T j = 1 if the the j th site shows a deviation from the baseline sequence (i.e., the presence of at least oneminor allele). We set the response function g ( T ) = τ (cid:48) T to be linear in the treatments (a common assumptionin GWAS), and set the outcome Y to be Gaussian by setting h Y ( ˜ Y ) = ˜ Y . We focus on estimating n n (cid:88) i =1 P AT E t ji ,t − ji for all j = 1 , · · · , k, (48)where t ji and t − ji correspond to the i th observed treatment vector with the j th SNP set to be 1 and 0respectively. Note that since g ( T ) is linear in T , n (cid:80) ni =1 P AT E t ji ,t − ji = τ j , the j th element of τ . We generate τ from a two component mixture with 90% of the coeﬃcients from a Uniform ( − . , . (small eﬀects) and10% from a Uniform ( − , (large eﬀects). We assume that there are m = 3 latent confounders.We consider a model for the observed data with two components, paying special attention to the latentconfounder model. In particular, we model the conditional distribution of confounders given treatment f ( u | t ) using a variational autoencoder (VAE), which is a popular, ﬂexible neural network–based approximatelatent variable model. This model is particularly appropriate because it yields an approximate Gaussianconditional distribution f ( u | t ) , even for discrete T as we have here. (We discuss latent confounder inferencewith VAEs in more detail in Appendix B.2.) We ﬁt the observed outcome model f ( y | t ) using a simplelinear regression, ignoring confounding, which corresponds to the setting in which R Y ∼ U | T = 0 . Worst-Case Ignorance Regions.

With this simulation setup, we ﬁrst examine whether the ignoranceregions contain to the true causal eﬀects. Importantly, because the VAE is an approximate latent variablemodel, and we are currently ignoring estimation uncertainty, it is not immediate that the ignorance regionsshould be valid. We ﬁnd that, even using our plug-in approach, the worst case ignorance regions cover498 out of 500 of the true treatment eﬀects. In all cases, the worst case bounds communicate substantial26 ull non−null −10010 0 25 50 75 100 i t R = %, RMSE= = %, RMSE= = %, RMSE= Estimates of Treatment Coefficients (a) L1 minimizing eﬀects T P R R = 1 (latent dim. = 2) R = 1 (latent dim. = 3) R = 1 (latent dim. = 4) R = 0 ROC Curves (b) Non-null classiﬁcation

Figure 5: Causal inference with 500 binary treatments with k = 3 latent confounders. Naive estimates ofthe null and non-null eﬀects are overdispersed due to confounding. (a) Minimum L1-norm treatment eﬀectsare shown for ﬁfty randomly chosen small eﬀects (“null” contrasts) and all large eﬀects (“non-null” contrasts)for three diﬀerent limits on the magnitude of confounding, R ∈ { , . , . } . When R = 1 , there overallL1 minimizer of the treatment eﬀects is achieved for the sensitivity vector which explains R Y ∼ U | T = 74% of the residual outcome variance. (b) We construct a simple non-null classiﬁer from minimum L1 treatmenteﬀects with R = 1 and naive eﬀects ( R = 0 ). The blue curve represents the ROC curves from the naiveestimates and the green, yellow and red curves represents the L1 minimizer of the treatment eﬀect estimatesfor inferred confounder models with dimensions ˆ k ∈ { , , } . The area under the curve (AUC) for the naiveestimates is 0.54, whereas the AUC for the L1-minimized estimates are 0.61 ( ˆ k = 2 ), 0.73 ( ˆ k = 3 ) and 0.64( ˆ k = 4 ).fundamental uncertainty about the true treatment eﬀects (See Appendix Figure 8). Finding Candidate Models with MCCs.

Investigators often have strong hypotheses about the ag-gregate properties of SNP treatment eﬀects. For example, while some phenotypes can be predominantlyexplained by only a small number of SNPs, other phenotypes may be more plausibly described by the omni-genic hypothesis, which suggests that some observable eﬀects must be explained by the sum of many smalleﬀects across many SNPs (Boyle et al., 2017). Here, we show that some of these aggregate hypotheses canbe formalized in terms of MCCs, and in these cases, the MCC minimization procedure from Section 6.3 canbe used to ﬁnd useful candidate causal models that align with these hypothesis while being fully consistentwith the observed data.To motivate candidate model selection, we consider the use case of estimating eﬀect sizes from a singlecoherent model, under the hypothesis that the median eﬀect size is small. Speciﬁcally, we formalize thishypothesis by deﬁning a MCC ω ( PATE T ( γ )) to be the L norm of the eﬀects of each contrast T = ( t ji , t − ji ) : i ∈ (1 , ..., n ) } for all treatments j = 1 , · · · , k . We then select the model that minimizes thiscriterion by selecting γ subject to diﬀerent allowed levels of confounding R Y ∼ U | T .In Figure 5a, we plot the the resulting coeﬃcients estimates for three values of R : (naive eﬀects), . and . Because the true eﬀects are much smaller in magnitude than the naïve eﬀects, the RMSE of theestimates decreases as we increase R Y ∼ U | T , although all eﬀects are equally compatible with the observeddata. In this simulation, the L1 norm of naive estimates is approximately 2525 and the norm of the trueeﬀects is drastically smaller at approximately 75.Models selected using this MCC minimization procedure are also useful for the coarser goal of separatingsmall and large eﬀects. From the naive regression, the coeﬃcients are overdispersed to the true causal eﬀectsand the true small coeﬃcients are practically indistinguishable from true large coeﬃcients. Meanwhile,models chosen with the MCC minimization procedure provide more useful signal. To formalize this, weconsider a classiﬁer that separates large and small eﬀects using the magnitude of the inferred coeﬃcientsas the classiﬁcation score. In Figure 5 we plot the receiver operating characteristic (ROC) curves for theclassiﬁers based on the naive estimates as well as the overall L minimizer of the treatment eﬀects ( R = 1 ,i.e. no limit on the value R Y ∼ U | T ).Importantly, the diﬀerence in conditional confounder means, µ u | t ji − µ u | t − ji , varies between non-null andnull contrasts. This leads to a larger reduction in the relative magnitude of the null eﬀects for models chosenthrough MCC minimization, accentuating the diﬀerences between large and small treatment eﬀects (SeeAppendix Figure 9). For models selected by MCC minimization, the area under the ROC curve (AUC)increases from 0.54 (almost no ability to distinguish small and large treatments) to 0.72 ( ˆ k = 3 , red curve).The selected model achieves nearly 25% true positive rate without accruing any false positives. Naturally,the classiﬁer performance is the best when we ﬁt a latent variable model with the correct number of latentfactors, although the classiﬁer based on latent variable models of dimensions ˆ k = 2 and ˆ k = 4 still outperformclassiﬁcation from naive eﬀects. In the Discussion, we note how this approach relates to, and complementsrecent identiﬁcation results for a similar setting in Miao et al. (2020). In this section, we compare our approach to other recent analyses of the TMDB 5000 Movie Dataset (Kaggle,2017) which was analyzed extensively by Wang and Blei (2018) and Grimmer et al. (2020). The dataset28onsists of 5000 movies and their corresponding revenue, budget, genre and the identities of the lead castmembers. Following Wang and Blei, we focus on estimating the causal eﬀect of an actor’s presence onthe movie’s log revenue. We let Y denote the log revenue and T i = ( T i , · · · , T ik ) encode the movie cast,where the binary random variable T ij ∈ { , } indicates whether actor j appeared in the movie i and T i ∈ T = { T , ..., T n } . We also let T j denote the set of all movies T i for which T ij = 1 . We deﬁne theestimand of interest, η j , as the the total log revenue contributed by actor j : η j := (cid:88) t i ∈T j PATE t i ,t - ji (49)where t - ji corresponds to the observed treatment vector for movie i excluding actor j . This estimand is anon-parametric generalization of the regression coeﬃcient τ j , which was targeted in the analysis in Wang andBlei (2018). Speciﬁcally, under the assumption that log-revenue is linear in the cast indicators, η j reducesto n j τ j , the eﬀect of actor j scaled by the number of movies they appeared in, where τ j are the regressioncoeﬃcients for actor j . Our estimand is well-deﬁned without this linearity assumption.We regress the log revenue on cast indicators to estimate actor eﬀects, τ naive j , under an assumption of nounobserved confounding. In order to demonstrate the applicability of our sensitivity analysis, we explicitlyinduce unobserved confounding by excluding observed confounders. We validate our analysis, by comparingcalibrated eﬀect estimates when the confounders are excluded to estimates when the confounder is included.Most importantly we exclude the movie’s budget which we estimate to be the largest known source ofconfounding (computed using Equation 38, see Appendix Figure 10a) .For simplicity, we model the observed outcome distribution with a linear regression, although other moreﬂexible outcome models (e.g. BART ) can also be used. As in the previous Section, we use a VAE to infer aGaussian conditional confounder distribution, f ( u | t ) ∼ N ( µ u | t , Σ u | t ) (See Appendix, Section B.2). Results.

Since our focus is on confounding not estimation, in order to limit the inﬂuence of estimationuncertainty we subset the data to the k = 327 actors who participated in at least twenty movies. Thisreduces the total number of movies to 2439. We ﬁt the VAE to the treatments and use cross-validationto identify the appropriate latent confounder dimension, which we inferred to be ˆ m = 20 (See AppendixFigure 10b). We then plot the worst-case ignorance region for the causal eﬀect on log revenue as a functionof R Y ∼ U | T for the 46 actors with signiﬁcant regression coeﬃcients in the naive regression (Figure 6, top). For illustrative purposes, we can assume that the budget is pre-treatment, meaning that the budget is decided prior toselecting the cast, which may be a dubious assumption in actuality. ,with the majority of the other actors well below . For reference, the log budget, which was explicitlyexcluded from our causal analysis, explains about 30% of the variance in log revenue (Appendix Figure 10a).In other words, none of the causal eﬀects are robust at a level which matches the variance explained by themost important excluded confounder. worst−case ignorance regions −1000100200 h j R Y~~U|T2 : L2 minimization of effects −2002040 M a r k R u ff a l o S u s an S a r andon E li a s K o t ea sJ e r e m y P i v en T i m B l a k e N e l s on V i ggo M o r t en s en K a t e B o s w o r t h D enn i s H oppe r A nd y S e r k i s R o s e B y r ne T o mm y Lee J one sJ ud y G r ee r J a m i e F o xx T i m o t h y S pa ll K e v i n H a r t Z oe S a l dana R ee s e W i t he r s poon I an M c K e ll en J i m C a rr e y K a t h y B a t e s O c t a v i a S pen c e r C a r l a G ug i no A nge li na J o li e C hann i ng T a t u mM i c hae l C a i ne H ugo W ea v i ng R ob i n W illi a m sJ ohn T r a v o l t a E dd i e M u r ph y D en z e l W a s h i ng t onL i a m N ee s on A da m S and l e r Leona r do D i C ap r i o J ud i D en c h B r ad P i tt F r an k W e l k e r J ohnn y D epp A r no l d S c h w a r z enegge r H a rr i s on F o r d T o m H an ks B r u c e W illi s W ill S m i t h M o r gan F r ee m an S t an Lee T o m C r u i s e J ohn R a t z enbe r ge r Actor j h j R Y~U|T2 : Figure 6: Estimated total log revenue contributed by a given actor. Top: worst-case ignorance region foreach actor on a case by case basis. The blue points correspond to R Y ∼ U | T = 0 , i.e. the naive estimates.Robustness values can be found in Appendix Table 1. Bottom: Treatment eﬀects for candidate modelschosen with the L2 minimzing multiple contrast criterion (MCC). The color correspond to R , the limit onthe fraction of outcome variance explained by confounding.The worst case ignorance regions depicted in top panel of Figure 6 correspond to a diﬀerent choice of γ for each actor. We can also explore the robustness of causal eﬀects under a single model by applyingan appropriate MCC. Speciﬁcally, we search for a “worst case” candidate model by ﬁnding the sensitivity30ector, γ ∗ , that implies the smallest L2 norm of the regression coeﬃcients, τ . In this conservative model, theminimum L2 norm of the treatment coeﬃcients is 4.4, down from 7.6 for the naive coeﬃcients. In addition, 40out of 46 actors have coeﬃcients that are smaller in magnitude than the magnitudes of the naive coeﬃcients(Figure 6, bottom). For this candidate model, it turns out that γ (cid:48)∗ E [ U | T = t ] is signiﬁcantly correlated, albeitweakly, with budget (Spearman’s rank correlation = 0.2, p-value < e - ). Thus, the conservative modelcorrectly attributes part of the outcome variation induced by the known excluded confounder to unobservedconfounding. In this paper, we introduced a framework for sensitivity analysis with multiple treatments which providesfurther context to the growing literature on the challenges of inference in this setting. Unlike previous work,we emphasize the importance of carefully deﬁned estimands and show that bounds on the magnitude ofconfounding bias depend on the particular estimands of interest. Our work also provides a practical solutionto characterizing and calibrating the robustness of causal eﬀects across multiple treatments in the presenceof unobserved confounding. Code to replicate all analyses is available (Zheng, 2021b) and an R packageimplementing our methodology is also available and in active development (Zheng, 2021a).There are several interesting generalizations of our proposed approach, many of which center on assump-tions about the copula. For example, in many contexts we may be interested in accounting for potentialtreatment-confounder interactions, in which case the copula, c ( F Y | t ( y ) , F U | t ( u ) | T = t ) , will vary with t . Likewise, as noted, our approach can be applied with non-Gaussian copulas and and alternative latentvariable models (e.g latent class models), but calibration and model-speciﬁcation remains a challenge.Generalizations based on joint inference of the treatment and outcome models should also be explored.In practice, causal eﬀect estimates may be overdispersed about the true eﬀects due to a combination ofsampling variance and unobserved confounding. Joint inference might be especially useful for accounting forboth estimation uncertainty and uncertainty due to unobserved confounding. This is particularly importantfor the multiple contrast criteria which, as described, does not incorporate parameter uncertainty into theobjective function. A simple solution in a Bayesian analysis would be to characterize posterior uncertaintyby applying the criteria to each MCMC sample of the naive causal eﬀect estimates. A more thoroughexploration of the interplay between shrinkage estimation (in the classical sense) and MCC shrinkage to31djust for confounding bias under the “small eﬀects” hypothesis would be interesting to explore in thiscontext.Finally, we note that there are many promising directions for incorporating additional constraints into thecalibration criteria. One direction of particular interest is to follow the line of research on negative controlexposures (NCE) (Shi et al., 2020). A set of NCEs would be subset of causes that were known a priori tohave zero (or bounded) causal eﬀect on the outcome, for example genes in a GWAS which were known tobe causally unrelated to a particular phenotype. Incorporating these assumptions would further constrainthe ignorance regions for particular causal contrasts of interest. These strategies are closely related to themultivariate calibration procedure that we propose. For example, Miao et al. (2020) describe a procedure foridentifying the treatment eﬀects when over half of the treatments are assumed to have no causal eﬀect on theoutcome. In Equation 42, this would be analogous to the case in which m is the L norm, the number of non-zero eﬀects. It is also worth further exploring the relationship between inference with multiple treatments andinference with multiple outcomes. Additional structure in the correlation between outcomes could furtherconstrain the causal ignorance regions, under the right set of assumptions. As with NCE’s, negative controloutcomes (NCOs) could be applied to calibrate the sensitivity analysis. We leave this to future work. References

Bica, I., A. M. Alaa, C. Lambert, and M. van der Schaar (2020). From real-world patient data to indi-vidualized treatment eﬀects using machine learning: Current and future methods to address underlyingchallenges.

Clinical Pharmacology & Therapeutics .Boyle, E. A., Y. I. Li, and J. K. Pritchard (2017). An expanded view of complex traits: from polygenic toomnigenic.

Cell 169 (7), 1177–1186.Cinelli, C., J. Ferwerda, and C. Hazlett (2020). sensemakr: Sensitivity analysis tools for ols in r and stata.

Submitted to the Journal of Statistical Software .Cinelli, C. and C. Hazlett (2019, 12). Making sense of sensitivity: extending omitted variable bias.

Journalof the Royal Statistical Society Series B (Statistical Methodology) .Cinelli, C., D. Kumor, B. Chen, J. Pearl, and E. Bareinboim (2019). Sensitivity analysis of linear structuralcausal models. In

ICML .Cornﬁeld, J., W. Haenszel, E. C. Hammond, A. M. Lilienfeld, M. B. Shimkin, and E. L. Wynder (1959).Smoking and lung cancer: recent evidence and a discussion of some questions.

J. Nat. Cancer Inst 22 ,173–203.D’Amour, A. (2019a). Comment: Reﬂections on the deconfounder.

Journal of the American StatisticalAssociation 114 (528), 1597–1601.D’Amour, A. (2019b). On multi-cause approaches to causal inference with unobserved counfounding: Twocautionary failure cases and a promising alternative. In

The 22nd International Conference on ArtiﬁcialIntelligence and Statistics , pp. 3478–3486. 32aniels, M. J. and J. W. Hogan (2008).

Missing data in longitudinal studies: Strategies for Bayesian modelingand sensitivity analysis . CRC Press.Dorie, V., M. Harada, N. B. Carnegie, and J. Hill (2016). A ﬂexible, interpretable framework for assessingsensitivity to unmeasured confounding.

Statistics in Medicine 35 (20), 3453–3470.Everett, B. (2013).

An introduction to latent variable models . Springer Science & Business Media.Franks, A., A. D’Amour, and A. Feller (2019). Flexible sensitivity analysis for observational studies withoutobservable implications.

Journal of the American Statistical Association (just-accepted), 1–38.Gastwirth, J. L., A. M. Krieger, and P. R. Rosenbaum (1998). Dual and simultaneous sensitivity analysisfor matched pairs.

Biometrika 85 (4), 907–920.Gavish, M. and D. L. Donoho (2014). The optimal hard threshold for singular values is / √ . InformationTheory, IEEE Transactions on 60 (8), 5040–5053.Ghosh, P., M. S. M. Sajjadi, A. Vergari, M. Black, and B. Scholkopf (2020). From variational to deterministicautoencoders. In

International Conference on Learning Representations .Gopalan, P., J. M. Hofman, and D. M. Blei (2013). Scalable recommendation with poisson factorization. arXiv preprint arXiv:1311.1704 .Greenland, S. (1996). Basic methods for sensitivity analysis of biases.

International journal of epidemiol-ogy 25 (6), 1107–1116.Grimmer, J., D. Knox, and B. M. Stewart (2020). Naive regression requires weaker assumptions than factormodels to adjust for multiple cause confounding. arXiv preprint arXiv:2007.12702 .Gustafson, P., L. C. McCandless, et al. (2018). When is a sensitivity parameter exactly that?

StatisticalScience 33 (1), 86–95.Hao, W., M. Song, and J. D. Storey (2015). Probabilistic models of genetic variation in structured populationsapplied to global human studies.

Bioinformatics 32 (5), 713–721.Horn, R. (1985).

Matrix analysis . Cambridge Cambridgeshire New York: Cambridge University Press.Imbens, G. W. (2003). Sensitivity to exogeneity assumptions in program evaluation.

American EconomicReview 93 (2), 126–132.Kaggle (2017, Sep). Tmdb 5000 movie dataset. data retrieved from Kaggle, .Kong, D., S. Yang, and L. Wang (2019). Multi-cause causal inference with unmeasured confounding andbinary outcome. arXiv preprint arXiv:1907.13323 .Lechner, M. (1999, December). Identiﬁcation and Estimation of Causal Eﬀects of Multiple Treatments Underthe Conditional Independence Assumption. IZA Discussion Papers 91, Institute of Labor Economics (IZA).Lopez, M. J., R. Gutman, et al. (2017). Estimation of causal eﬀects with multiple treatments: a review andnew ideas.

Statistical Science 32 (3), 432–454.Lopez, R., P. Boyeau, N. Yosef, M. I. Jordan, and J. Regier (2020). Decision-making with auto-encodingvariational bayes. arXiv preprint arXiv:2002.07217 .Louizos, C., U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling (2017). Causal eﬀect inferencewith deep latent-variable models. In

Advances in Neural Information Processing Systems , pp. 6446–6456.Mardia, K. V., J. T. Kent, and J. M. Bibby (1980).

Multivariate analysis . Academic press.33cCulloch, R., R. Sparapani, R. Gramacy, C. Spanbauer, and M. Pratola (2018).

BART: Bayesian AdditiveRegression Trees . R package version 1.9.Miao, W., Z. Geng, and E. J. Tchetgen Tchetgen (2018). Identifying causal eﬀects with proxy variables ofan unmeasured confounder.

Biometrika 105 (4), 987–993.Miao, W., W. Hu, E. L. Ogburn, and X. Zhou (2020). Identifying eﬀects of multiple treatments in thepresence of unmeasured confounding.Minka, T. P. (2001). Automatic choice of dimensionality for pca. In

Advances in neural information processingsystems , pp. 598–604.Nelsen, R. B. (2007).

An introduction to copulas . Springer Science & Business Media.Ogburn, E. L., I. Shpitser, and E. J. T. Tchetgen (2019). Comment on “blessings of multiple causes”.

Journalof the American Statistical Association 114 (528), 1611–1615.Ogburn, E. L., I. Shpitser, and E. J. T. Tchetgen (2020). Counterexamples to" the blessings of multiplecauses" by wang and blei. arXiv preprint arXiv:2001.06555 .Pearl, J. (2009).

Causality . Cambridge university press.Price, A. L., N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich (2006). Principalcomponents analysis corrects for stratiﬁcation in genome-wide association studies.

Nature genetics 38 (8),904–909.Pu, Y., Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin (2016). Variational autoencoder fordeep learning of images, labels and captions. In

Advances in neural information processing systems , pp.2352–2360.Robins, J. M. (1997). Non-response models for the analysis of non-monotone non-ignorable missing data.

Statistics in Medicine 16 (1), 21–37.Robins, J. M., A. Rotnitzky, and D. O. Scharfstein (2000). Sensitivity analysis for selection bias andunmeasured confounding in missing data and causal inference models. In

Statistical models in epidemiology,the environment, and clinical trials , pp. 1–94. Springer.Rosenbaum, P. R. and D. B. Rubin (1983). Assessing sensitivity to an unobserved binary covariate in an ob-servational study with binary outcome.

Journal of the Royal Statistical Society. Series B (Methodological) ,212–218.Rubin, D. B. (1980). Comment.

Journal of the American Statistical Association 75 (371), 591–593.Shi, X., W. Miao, and E. T. Tchetgen (2020). A selective review of negative control methods in epidemiology. arXiv preprint arXiv:2009.05641 .Song, M., W. Hao, and J. D. Storey (2015). Testing for genetic associations in arbitrarily structuredpopulations.

Nature genetics 47 (5), 550–554.Tipping, M. E. and C. M. Bishop (1999). Probabilistic principal component analysis.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) 61 (3), 611–622.VanderWeele, T. J. and O. A. Arah (2011). Bias formulas for sensitivity analysis of unmeasured confoundingfor general outcomes, treatments, and confounders.

Epidemiology , 42–52.VanderWeele, T. J. and P. Ding (2017, July). Sensitivity analysis in observational research: Introducing thee-value.

Annals of Internal Medicine 167 (4), 268. 34anderWeele, T. J., B. Mukherjee, and J. Chen (2012). Sensitivity analysis for interactions under unmeasuredconfounding.

Statistics in medicine 31 (22), 2552–2564.VanderWeele, T. J. and I. Shpitser (2013). On the deﬁnition of a confounder.

Annals of statistics 41 (1),196.Vansteelandt, S., E. Goetghebeur, M. G. Kenward, and G. Molenberghs (2006). Ignorance and uncertaintyregions as inferential tools in a sensitivity analysis.

Statistica Sinica 16 (3), 953–979.Veitch, V. and A. Zaveri (2020). Sense and sensitivity analysis: Simple post-hoc analysis of bias due tounobserved confounding. arXiv preprint arXiv:2003.01747 .Wang, Y. and D. M. Blei (2018, May). The Blessings of Multiple Causes. arXiv e-prints , arXiv:1805.06826.Zhang, B. and E. J. T. Tchetgen (2019). A semiparametric approach to model-based sensitivity analysis inobservational studies. arXiv preprint arXiv:1910.14130 .Zhang, L., Y. Wang, A. Ostropolets, J. J. Mulgrave, D. M. Blei, and G. Hripcsak (2019). The medicaldeconfounder: Assessing treatment eﬀects with electronic health records. arXiv preprint arXiv:1904.02098 .Zheng, J. (2021a). Copsens: Copula-based sensitivity analysis method for unobserved confounding in multi-treatment inference. https://github.com/JiajingZ/CopSens .Zheng, J. (2021b). Replication code for “copula-based sensitivity analysis for observational multi-treatmentcausal inference”. https://github.com/JiajingZ/CopulaSensitivity .35

Theory

A.1 General Contrast Estimation Algorithm

Algorithm 2

Marginal Contrast Estimation for Arbitrary Copulas function ComputeMean ( t , ψ ) for k = 1 , , . . . , M do Sample y k from f ( y | t ) for i = 1 , , . . . , n do for j = 1 , , . . . , N do Sample u ij from f ( u | t i ) Compute c ij ← c ψ ( y k , u ij | t ) end for end for Compute w k ← nN (cid:80) ij c ij end for return M (cid:80) k ν ( y k ) w k end function Return τ ( ComputeMean ( t , ψ ) , ComputeMean ( t , ψ ) A.2 Derivation of Algorithm 1

Theorem 1.

Assume model 13-15. Let [ ψ T ] = { ˜ ψ T = { Aµ u | t , A Σ u | t A } : A ∈ S + } where S + is the spaceof symmetric positive deﬁnite matrices. Then [ ψ T ] is a causal equivalence class. roof. The intervention distribution for ˜ y is deﬁned as f ψ (˜ y | do ( t )) = (cid:90) (cid:20)(cid:90) f ψ Y (˜ y | t, u ) f ψ T ( u | ˜ t ) du (cid:21) f (˜ t ) d ˜ t (55)where ψ Y = γ and ψ T = { µ u | t , Σ u | t } . Then, (cid:82) f γ (˜ y | t, u ) f ψ T ( u | ˜ t ) du ∼ N ( γ T ( µ u | ˜ t − µ u | t ) , for any γ suchthat γ (cid:48) Σ u | t γ ≤ (see Equation 54). Let ˜ ψ T = { Aµ u | t , A Σ u | t A } ∈ [ ψ T ] where A ∈ S + is a positive deﬁnitematrix and assume ˜ ψ Y = ˜ γ . Then, (cid:90) f ˜ γ (˜ y | t, u ) f ˜ ψ T ( u | ˜ t ) du ∼ N (˜ γ (cid:48) A ( µ u | ˜ t − µ u | t ) , . (56)Let ˜ γ = A − γ be a bijective mapping from γ to ˜ γ . For any γ and positive deﬁnite A , we have ˜ γ (cid:48) A Σ u | t A ˜ γ = γ (cid:48) Σ u | t γ ≤ so that ˜ γ is a valid copula parameter. In addition, (cid:82) f γ (˜ y | t, u ) f ψ T ( u | ˜ t ) du = (cid:82) f ˜ γ (˜ y | t, u ) f ˜ ψ T ( u | ˜ t ) du , which implies f γ,ψ T (˜ y | do ( t )) = f ˜ γ, ˜ ψ T (˜ y | do ( t )) . Since Y is a deterministic function of ˜ Y ,this implies f γ,ψ T ( y | do ( t )) = f ˜ γ, ˜ ψ T ( y | do ( t )) . Therefore, [ ψ T ] is a causal equivalence class. A.4 Proof of Theorem 2

Theorem 2.

Suppose that the observed data is generated by model 17-19. When there k treatments with < m < k , then ψ T is identiﬁed up to the causal equivalence class [ ψ T ] = { ˜ ψ T = { Aµ u | t , A Σ u | t A } : A ∈ S + } .When there is a single treatment ( k = 1 ) or m = k confounders, then ψ T is not identiﬁable up to causalequivalence class.Proof. The sample covariance matrix of the treatments is a consistent estimator for B Σ u B (cid:48) + σ t | u I k , thecovariance matrix T . As long as < m < k , then the k th eigenvalue is a consistent estimator for σ t | u . B Σ u B (cid:48) is identiﬁed by the m eigenvectors and eigenvalues of the sample covariance matrix. The span of theﬁrst m eigenvectors of the sample covariance matrix is a consistent estimator for the span of B . With model17-19, the conditional distribution of confounder Uf ( u | t ) ∼ N ( µ u | t , Σ u | t ) , (57)where µ u | t := Σ u B (cid:48) ( B Σ u B (cid:48) + σ t | u I k ) − t , Σ u | t := Σ u − Σ u B (cid:48) ( B Σ u B (cid:48) + σ t | u I k ) − B Σ u , and the interventiondistribution f γ,B, Σ u ( y | do ( T = t )) ∼ N (( τ naive − ( B Σ u B (cid:48) + σ t | u I k ) − B Σ u γ ) (cid:48) t, σ y | t,u + γ (cid:48) Σ u γ ) , (58)where all m-vectors γ which satisfy γ T Σ u | t γ ≤ σ y | t are valid sensitivity parameters.Let ˜ B = BA and ˜Σ u = A − Σ u A − T for an arbitrary positive deﬁnite matrix A , so that ˜ B ˜Σ u ˜ B (cid:48) + σ t | u I k = B Σ u B (cid:48) + σ t | u I k . Then, the observed treatments are consistent with T = ˜ B ˜ U + (cid:15) t | u , where ˜ U = A − U ∼ N m (0 , ˜Σ u ) . Hence, the conditional confounder distribution f (˜ u | t ) ∼ N m (˜ µ u | t , ˜Σ u | t ) , (59)where ˜ µ u | t = A − µ u | t and ˜Σ u | t = A − Σ u | t A − T . With ˜ B and ˜Σ u , the intervention distribution can bealternatively expressed as f ˜ γ, ˜ B, ˜Σ u ( y | do ( T = t )) ∼ N (( τ naive − ( ˜ B ˜Σ u ˜ B (cid:48) + σ t | u I k ) − ˜ B ˜Σ u ˜ γ ) (cid:48) t, σ y | t,u + ˜ γ (cid:48) ˜Σ u ˜ γ ) (60)37ith ˜ γ satisfying ˜ γ T ˜Σ u | t ˜ γ ≤ σ y | t to be valid sensitivity parameter.Let ˜ γ = A T γ . If γ T Σ u | t γ ≤ σ y | t , then we have ˜ γ T ˜Σ u | t ˜ γ = γ T Σ u | t γ ≤ σ y | t and f ˜ γ, ˜ B, ˜Σ u ( y | do ( T = t )) = f γ,B, Σ u ( y | do ( T = t )) . Therefore, the causal equivalence class characterized by ψ T = { µ u | t , Σ u | t } isidentiﬁable when < m < k . A.5 Proof of Theorem 3 and 4

Proof of Theorem 3Theorem 3.

Let d be the largest singular value of B . For all ∆ t with (cid:107) ∆ t (cid:107) = 1 , the squared bias is ounded by Bias t ≤ d ( d + σ t | u ) σ y | t σ t | u R Y ∼ U | T , (67) with equality when ∆ t = u B , the ﬁrst left singular vector of B . When ∆ t ∈ N ull ( B (cid:48) ) , the naive estimate isunbiased, that is, P AT E ∆ t = τ (cid:48) naive ∆ t .Proof. Suppose that the matrix B has the singular value decomposition, B = U DV T , where the diagonal entries of D are the singular values of B in descending order. Then, we can write µ u | ∆ t = V D ( D + σ t | u I s ) − U T ∆ t, (68)and Σ − u | t = V [ I s + 1 σ t | u D ] V T . (69)By plugging Equation 68 and 69 into the result of theorem 3, we haveBias t ≤ σ y | t σ t | u R Y ∼ U | T (cid:107) V D ( σ t | u I s + D ) − / U T ∆ t (cid:107) , (70)where, according to Rayleigh quotient (Horn, 1985), the squared L2 norm reaches its maximum, d ( d + σ t | u ) ,when ∆ t equals the ﬁrst column of U , i.e., the ﬁrst left singular vector of B .Therefore, we have Bias t ≤ d ( d + σ t | u ) σ y | t σ t | u R Y ∼ U | T . (71) Proof of Theorem 4Theorem 4.

B.1 Identiﬁcation and Inference in the Factor Model

Here, we brieﬂy elaborate on identiﬁability of the probabilistic principal components model, which is aprerequisite for our multi-cause sensitivity analysis. Identiﬁability under various factor model assumptionsis well studied and has a long history in the literature (Mardia et al., 1980; Everett, 2013). In the speciﬁcprobabilistic principal components model 18, Tipping and Bishop (1999) provide a maximum likelihoodsolution for inferring the latent confounder parameters conditional on m . Many procedures are available forselecting the appropriate value of m , using for example Bayesian model selection techniques (Minka, 2001)40r large p , small n asymptotics Gavish and Donoho (2014).The change of variables described at the end of the subsection 5.1 further elucidates important situationsin which we cannot bound the omitted variable bias due to non-identiﬁability of the factor model. Again, wefocus on the rotated treatments ˜ T ∼ N (0 , ∆ + σ t | u I k ) and highlight two simple situations in which we cannotbound the the causal eﬀects. First, when B is rank k , i.e. there exist m = k independent confounders, ∆ hasno non-zero entries on the diagonal and thus we cannot identify either ∆ nor Cov ( (cid:15) t | u ) = σ t | u I k , only theirsum. Second, if Cov ( (cid:15) t | u ) is an unknown arbitrary diagonal matrix (as opposed to a matrix proportionalto the identity), then Cov ( (cid:15) t | u ) is not distinguishable from ∆ . In both of these cases, the worst-case bias isunbounded since the non-confounding variation of the treatment assignment, Cov ( (cid:15) t | u ) , can be arbitrarilysmall. In such settings, we can still apply approaches used in single cause sensitivity analysis, by specifyingboth Ψ Y and Ψ T ; when the factor model is not identiﬁable, Ψ T must be chosen as a true parameter, e.g. bybounding the fraction of treatment variation due to confounding, R T ∼ U . B.2 Confounder Inference with Variational Autoencoders

Probabilistic Principal Component Analysis should only be used when the treatments are approximatelyGaussian treatments. For binary and other general treatment distributions, more sophisticated probabilisticlatent variables models are required. Examples of such latent variable models include models for countdata like the logistic factor analysis (Hao et al., 2015) and Poisson factor analysis methods (Gopalan et al.,2013). Unfortunately, these models imply posteriors which are non-Gaussian and heteroskedastic, violatingAssumptions 5 and 4.As such, for general treatment distributions, our approach is to infer a conditional Gaussian latent variablemodel using a variational autoencoder (VAE). VAEs have been extremely popular in machine learning, inparticular for generating low dimensional representations of complex inputs like images (Pu et al., 2016)but more recently have been used in scientiﬁc and decision-making applications (Lopez et al., 2020) and inapplications to causal inference (Louizos et al., 2017). A VAE consists of a prior distribution, f ( u ) , typicallyfor the low-dimensional latent variables, a stochastic encoder, and a stochastic decoder. In our application,the inferred stochastic decoder, ˆ f θ ( t | u ) , is a non-linear map from latent confounders to a distribution overcauses. Together, the prior distribution for u and the decoder imply a posterior confounder distribution, ˆ f ( u | t ) .In practice, inference for the true posterior is intractable and so a variational approximation, called theencoder, q φ ( u | t ) , is used in place of the true posterior. Typically the encoder is chosen to be a normaldistribution with mean and variance which are non-linear functions of the input, q φ = N ( µ φ ( t ) , σ φ ( t )) .A crucial question is that how well the Gaussian encoder approximates the true posterior; improving thevariational approximation to the true latent variable posterior is an area of active research. In this work,we follow a common strategy of using the encoder learned by the VAE as the proposal distribution in animportance sampler (Lopez et al., 2020).Speciﬁcally, we apply a variant of the Constant-Variance Variational Autoencoder (CV-VAE) (Ghoshet al., 2020) to infer the conditional confounder distribution, f ( u | t ) ∼ N ( µ u | t , Σ u | t ) , in which Σ u | t doesnot depend on the level of t . We use the importance sampling to improve estimates of the conditional mean µ u | t , and posterior variance, Σ u | t . While this appraoch only yields an approximation to the true posterior,we demonstrate the practical eﬀectiveness of this approach in Sections 7 and 8.41 .3 Binary Outcomes For binary outcomes with the risk ratio estimand: RR t, • = (cid:88) t i ∈T Φ (cid:0) Φ − ( µ y | t ) + γ T ( µ u | t i − µ u | t ) (cid:1) (cid:30) P r ( Y = 1) , (81)which implies that RR t ,t = (cid:88) t i ∈T Φ (cid:0) Φ − ( µ y | t ) + γ T ( µ u | t i − µ u | t ) (cid:1) (cid:30) (cid:88) t i ∈T Φ (cid:0) Φ − ( µ y | t ) + γ T ( µ u | t i − µ u | t ) (cid:1) , (82)where γ T Σ u | t γ ≤ σ y | t R Y ∼ U | T . We can numerically explore values of RR t ,t within the valid domain of γ ,and calculate the corresponding implicit partial R-squared by R Y ∼ U | T = γ T Σ u | t γσ y | t . To calculate the robustnessvalue, we only need to ﬁnd the value of R Y ∼ U | T for which the corresponding RR t ,t = 1 . Noticeably, RR t ,t is not monotone in R Y ∼ U | T , since the variance of intervention distribution also depends on γ . This is evidentin the simulation in Section 7 where we ﬁt the observed outcome model by probit regression and the validrange for scalar γ is [ − σ u | t , σ u | t ] . We visualize the non-monotone relationship between RR t ,t and R Y ∼ U | T in Figure 7. R RR e , (a) RR e , vs R R RR e , (b) RR e , vs R R RR e , (c) RR e , vs R R RR e , (d) RR e , vs R Figure 7: RR t ,t is non-monotone in R Y ∼ U | T . Positive values of R indicates that U is positively correlatedwith ˜ Y , and negative values of R means that U is negatively correlated with ˜ Y .42 Additional Results

C.1 Additional Results from Simulation in Sparse Eﬀects Setting null non−null −20020 0 25 50 75 100 case t True EffectR

Y~~U|T2 = 0%R

Y~~U|T2 = 50%R

Y~~U|T2 = 100%

Figure 8: Worst-case ignorance regions for 55 randomly chosen null eﬀects (left) and all 45 non-null eﬀects(right) ordered by the magnitude of true eﬀects in each group. Two red arrows indicate non-null treatmentsfor which the worst-case ignorance region does not cover the true eﬀect. This appears to be due to estimationerror in the outcome model, more so than with the VAE. R Y~U|T2 E ( | t | ) nullnonnull (a) R Y~U|T2 R a t i o o f E ( | t | ) f o r non − nu ll and nu ll (b) Figure 9: Change in E ( | τ | ) for the L1-minimized estimates as a function of R Y ∼ U | T , separated by nulland non-null eﬀects. (a) The magnitude of eﬀects decreases with R , with a larger relative decrease fornull contrasts. (b) The relative magnitude of non-null and null eﬀects increases with R in general. Themagnitude of non-null eﬀects can be as large as 1.9 times the null eﬀects when R is large.43 .2 Additional Results from the Actor Case Study P a r t i a l R Partial R for Observed Factors (a) v a li da t i on l o ss Latent Dimension Selection (b)

Figure 10: (a) Estimated partial R for observed confounders using method described in section 6.1. Budgetis the most dominant variable, which can explain signiﬁcantly higher variation in outcome Y . (b) Latentconfounder dimension selection, based on the reconstruction loss on the validation set.44able 1: Robustness Value for Signiﬁcant ActorsEﬀect RV mean (%) RV limitlimit