[PDF] A Class of Algorithms for General Instrumental Variable Models

Abstract

Causal treatment effect estimation is a key problem that arises in a variety of real-world settings, from personalized medicine to governmental policy making. There has been a flurry of recent work in machine learning on estimating causal effects when one has access to an instrument. However, to achieve identifiability, they in general require one-size-fits-all assumptions such as an additive error model for the outcome. An alternative is partial identification, which provides bounds on the causal effect. Little exists in terms of bounding methods that can deal with the most general case, where the treatment itself can be continuous. Moreover, bounding methods generally do not allow for a continuum of assumptions on the shape of the causal effect that can smoothly trade off stronger background knowledge for more informative bounds. In this work, we provide a method for causal effect bounding in continuous distributions, leveraging recent advances in gradient-based methods for the optimization of computationally intractable objective functions. We demonstrate on a set of synthetic and real-world data that our bounds capture the causal effect when additive methods fail, providing a useful range of answers compatible with observation as opposed to relying on unwarranted structural assumptions.

Full PDF

AA Class of Algorithms forGeneral Instrumental Variable Models

Niki Kilbertus

MPI for Intelligent SystemsUniversity of Cambridge

Matt J. Kusner

University College LondonThe Alan Turing Institute

Ricardo Silva

University College LondonThe Alan Turing Institute

Abstract

Causal treatment e ﬀ ect estimation is a key problem that arises in a variety of real-world settings, from personalized medicine to governmental policy making. There hasbeen a ﬂurry of recent work in machine learning on estimating causal e ﬀ ects whenone has access to an instrument. However, to achieve identiﬁability, they in generalrequire one-size-ﬁts-all assumptions such as an additive error model for the outcome.An alternative is partial identiﬁcation, which provides bounds on the causal e ﬀ ect. Littleexists in terms of bounding methods that can deal with the most general case, wherethe treatment itself can be continuous. Moreover, bounding methods generally do notallow for a continuum of assumptions on the shape of the causal e ﬀ ect that can smoothlytrade o ﬀ stronger background knowledge for more informative bounds. In this work,we provide a method for causal e ﬀ ect bounding in continuous distributions, leveragingrecent advances in gradient-based methods for the optimization of computationallyintractable objective functions. We demonstrate on a set of synthetic and real-world datathat our bounds capture the causal e ﬀ ect when additive methods fail, providing a usefulrange of answers compatible with observation as opposed to relying on unwarrantedstructural assumptions. Machine learning is becoming more and more prevalent in applications that inform actionsto be taken in the physical world. To ensure robust and reliable performance, many settingsrequire an understanding of the causal e ﬀ ects an action will have before it is taken. Often,the only available source of training data is observational, where the actions of interest werechosen by unknown criteria: How would a new housing policy impact the livelihood ofminority groups? What are the environmental consequences of building a new factory?How well could existing medication work to combat the e ﬀ ects of a new illness? One of themajor obstacles to trustworthy causal e ﬀ ect estimation with observational data is the relianceon the strong, untestable assumption of no unobserved confounding . To avoid this, only invery speciﬁc settings (e.g., front-door adjustment, linear/additive instrumental variableregression) it is possible to allow for unobserved confounding and still identify the causale ﬀ ect (Pearl, 2009). Outside of these settings, one can only hope to meaningfully bound thecausal e ﬀ ect (Manski, 2007).In many applications, we have one or few treatment variables X and one outcome variable Y . Nearly all existing approaches to obtain meaningful bounds on the causal e ﬀ ect of X on Y impose constraints on how observed variables are related, in order to mitigate the inﬂuenceof unobserved confounders. One of the most useful structural constraints is the existence ofan observable instrumental variable (IV): a variable Z , not caused by X , whose relationship Code available at https://github.com/nikikilbertus/general-iv-models . a r X i v : . [ c s . L G ] J un ith Y is entirely mediated by X , see Pearl (2009) for a graphical characterization. It is alsopossible to have these conditions hold only after adjusting for other observable covariates,which for simplicity we will not make use of in this work. A formal graphical characterizationof IVs is given by Pearl (2009). The lack of an edge between Z and Y , and between Z and theunobserved confounders of X and Y , is a necessary condition in the corresponding directedacyclic graph (DAG) representing causal relationships. The existence of an IV can be used toderive upper (lower) bounds on causal e ﬀ ects of interest by maximizing (minimizing) thosee ﬀ ects among all IV models compatible with the observable distribution. In this work, wedevelop algorithms to compute these bounds on causal e ﬀ ects over all IV models compatible withthe data in a general continuous setting. Eliciting constraints that characterize “the models compatible with data” under a causaldirected acyclic graph (DAG) for discrete variables is an active ﬁeld of study, with contribu-tions from the machine learning, algebraic statistics, economics, and quantum mechanicsliterature. This has provided complete characterizations of equality (Evans, 2019; Tian& Pearl, 2002) and inequality (Wolfe et al., 2019; Navascues & Wolfe, 2019) constraints.Enumerating all inequality constraints in the observed distribution is in general super-exponential in the number of observed variables, even for discrete causal models. However,this line of work typically solves a harder problem than is strictly required for boundingcausal e ﬀ ects: they provide symbolic constraints obtained by eliminating all hidden vari-ables. While the pioneering work of Balke & Pearl (1994) in the discrete setting also providessymbolic constraints via a super-exponential algorithm, it introduces constraints that matchthe observed marginals of a latent variable model against the observable distribution. Whilethis also requires solving generally intractable summations, our insight is that it providesa connection to non-symbolic, stochastic approaches for evaluating integrals, which wedevelop in this work.Our key observation is that we can leverage recent advances in e ﬃ cient gradient andMonte Carlo-based optimization of computationally intractable objective functions to boundthe causal e ﬀ ect directly. This can be done even in the setting where X is continuous, wherenone of the literature described above applies. We do so by (a) parameterizing the spaceof causal responses to treatment X such that we can incorporate further assumptions thatlead to informative bounds; (b) using a Monte Carlo approximation to the integral over thedistribution of possible responses to X , where the distribution itself must be parameterizedcarefully to incorporate the structural constraints of an IV DAG model. This allows us tooptimize over the domain-dependent set of all plausible models that are consistent withobserved data to ﬁnd lower/upper bounds on the target causal e ﬀ ect.In Section 2, we describe the general problem of using instrumental variables whentreatment X is continuous. Section 3 develops our representation of the causal model. InSection 4 we introduce a class of algorithms for solving the bounding problem and oursuggested implementation. Section 5 provides several demonstrations of the advantages ofour method. Balke & Pearl (1994) focused on partial identiﬁcation (bounding) of causal e ﬀ ects on binarydiscrete models. Such sensitivity analysis is less well understood in continuous settings,save for Kilbertus et al. (2019) for a speciﬁc application in fairness. Angrist et al. (1996)studied identiﬁcation of e ﬀ ects for a particular latent subclass of individuals also in thebinary case. Meanwhile, the econometrics literature has focused on problems where thetreatment X is continuous (Newey & Powell, 2003; Blundell et al., 2007; Angrist & Pischke,2008; Wooldridge, 2010; Darolles et al., 2011; Horowitz, 2011; Chen & Christensen, 2018).2his problem has recently received attention in machine learning, using techniques fromdeep learning (Hartford et al., 2017) and kernel machines (Singh et al., 2019; Muandet et al.,2019). Crucially, this literature assumes that the structural equation for Y has a special form,such as having an additive error term e Y , as in Y = f ( X ) + e Y . This error term is not causedby X , but need not be independent of it, introducing unobserved confounding.Using the notation of Pearl (2009), the expected response under an intervention on X atlevel x is denoted by E [ Y | do ( x )], which in the model above boils down to f ( x ). An averagetreatment e ﬀ ect (ATE) can be deﬁned as a contrast of this expected response under twotreatment levels, e.g., f ( x ) − f ( x (cid:48) ). In the zero-mean additive error case, E [ Y | z ] = (cid:90) f ( x ) p ( x | z ) dx. Under some regularity conditions, no function other than f ( · ) satisﬁes that integral equation.Since E [ Y | z ] and p ( x | z ) can be both learned from data, this allows us to learn the ATEfrom observational data. This is how the vast majority of recent work identiﬁes the causaltreatment e ﬀ ect in the IV model (Hartford et al., 2017; Singh et al., 2019; Muandet et al.,2019).The price paid for simpliﬁed identiﬁcation is that it seriously limits the applicability ofthese models. Diagnostic tests for the additivity assumption are not possible, as residuals Y − f ( X ) can be arbitrarily associated with X by assumption. On the other hand, without any restrictions on the structural equations, it is not only impossible to identify the causal e ﬀ ectof the IV model with a continuous treatment, but even bounds on the ATE are vacuous (Pearl,1995; Bonet, 2001; Gunsilius, 2018, 2019). However, with relatively weak assumptions onthe space of allowed structural equations, it is possible to achieve meaningful bounds onthe causal e ﬀ ect (Gunsilius, 2019). It su ﬃ ces that the equations for X and Y have a ﬁnitenumber of discontinuities. Gunsilius provides a theoretical framework for representationand estimation of bounds. Algorithmically, he proposes a truncated wavelet representationfor the causal response and builds convex combinations of a sample of response functionsto optimize IV bounds. Although it is an important proof of concept for the possibilityof bounds for the general IV case with a strong theoretical motivation, we found that themethod has frequent stability issues that are not easy to diagnose. We return to this inAppendix A.Building on top of this work and some classical ideas ﬁrst outlined by Balke & Pearl(1994), we propose an alternative formulation for ﬁnding bounds when both X and Y are continuous. Our technique ﬂexibly parameterizes the causal response functions, whilenaturally encoding the structural IV constraints for compatibility with the observed data. Wethen leverage an augmented Lagrangian method that is tailored to non-convex optimizationwith inequality constraints. We demonstrate that our method matches estimation resultsof prior work in the additive setting, and gives meaningful bounds on the causal e ﬀ ectin general, non-additive models. Thereby, we follow a line of recent successes in variousdomains achieved by replacing previous intractable symbolic-combinatorial algorithms(Balke & Pearl, 1994; Wolfe et al., 2019; Drton et al., 2009) with a continuous program.One of our key contributions is to formulate bounds on true causal e ﬀ ects as well as theircompatibility requirements as a smooth, constrained objective, for which we can leveragee ﬃ cient gradient-based optimization techniques with Monte Carlo approximations.Guided by this goal, we make a number of assumptions mostly manifested as approxima-tions, which we will describe in detail in the following sections. Our resulting optimizationproblem is a non-convex constrained optimization problem that we solve using the well-studied augmented Lagrangian framework Nocedal & Wright (2006). We demonstrate thatour method not only provides appropriate bounds in cases where additive IV methods fail,3a) Z X YU (b)

Z X YR Y R X (c) Z X YR

Figure 1: (a) An example of DAG compatible with Z being an instrument for X → Y , withhidden confounder U . (b) An equivalent representation using response function indices fordeterministic functions X = g R X ( Z ) and Y = f R Y ( X ), with two random indexing variables R X and R Y . (c) For the purposes of modeling E [ Y | do ( x )], it is enough to express the model interms of R := R Y only.but also accounts for the allowed distributions of unobserved variables, avoiding overcon-ﬁdent causal e ﬀ ect estimates. We note that we ultimately still end up with a non-convexoptimization problem, which we propose to solve with gradient-based local optimizationtechniques. Therefore, our approach still su ﬀ ers from all the complications implied bynon-convexity including no guarantees of ﬁnding global optima. Following Pearl’s Structural Causal Model (SCM) framework (Pearl, 2009), we assume theexistence of structural equations and a (possibly inﬁnite dimensional) unobserved exogenousprocess U , X = g ( Z, U ) and Y = f ( X, U ) . We illustrate this situation in Figure 1(a). It assumes the usual requirements for the instru-ment Z to be satisﬁed, namely(a) Z ⊥⊥ U , (b) Z (cid:54)⊥⊥ X, and (c) Z ⊥⊥ Y | { X, U } . The primary goal is to compute lower and upper bounds on E [ Y | do ( x (cid:63) )] for any desiredintervention level x (cid:63) . Bounds on an ATE of interest can also be derived. Intuitively, we wantto put bounds on how f ( X, U ) depends on X by optimizing over “allowed” distributions of U . Which distributions are “allowed” is determined by observations, i.e., we only considersettings where marginalizing U results in p ( x, y | z ) for all ( x, y, z ) in the support of theobservational distribution. In fact, as pointed out by Palmer et al. (2011), it is enough toconsider matching the marginals of the latent variable model to the two conditional densities p ( x | z ) and p ( y | z ). Informally, among all possible structural equations { g, f } and distributionsover U that reproduce the estimated densities { ˆ p ( x | z ) , ˆ p ( y | z ) } , we want to ﬁnd estimates of theminimum and maximum expected outcomes under intervention . Response functions.

The main idea of Balke & Pearl (1994) is to express structural equa-tions in terms of response functions : labeling (and possibly clustering) states of U accordingto the implied functional relationship between the observed variable and its direct causes.These U states are mapped to a particular level of an index variable R . For instance, if Y = f ( X, U ) = λ X + λ XU + U , a two-dimensional U space in a linear, non-additive out-come function, we have that f ( x, u ) = λ x + λ x for u = 1, u = 0. We can deﬁne an implicitarbitrary value r such that f r ( x ) = λ r x , λ r = λ + λ , the value “ r ” being an alias for (1 ,

0) inthe space of the confounders. The advantage of this representation is that we can think of a4istribution over R as a distribution over functions of X alone. Otherwise we would needto deal with interactions between U and X on top of a distribution over U , itself of uncleardimensionality. In contrast, the dimensionality of R is the one implied by the particularfunction space adopted. Gunsilius (2019) provides a more thorough discussion of the role ofresponse functions corresponding to a possibly inﬁnite-dimensional U . Figure 1(b) shows agraphical representation of a system parameterized by response function indices R X and R Y ,with a bi-directed edge indicating possible association between the two. In what follows, asthere will be no explicit need for R X , the causal DAG corresponding to our counterfactualmodel is shown in Figure 1(c). This itself departs from Balke & Pearl (1994) and Gunsilius(2019), having the advantage of not only simplifying the optimization, but also not requiringsimultaneous measurements of X and Y (Palmer et al., 2011). Within this framework, we canrewrite the optimization over allowed distributions of U into an optimization over alloweddistributions of response functions for Y .Without restrictions on the function space, non-trivial inference is impossible (Pearl,1995; Bonet, 2001; Gunsilius, 2018). In our proposed class of solutions, we will adopt aparametric response function space: each response type r corresponds to some parametervalue θ r ∈ Θ ⊂ R K for some ﬁnite K . We write f r ( x ) := f θ r ( x ). Going forward, we will simplyuse θ to denote a speciﬁc response type and drop the index r . While our method worksfor any di ﬀ erentiable f θ , we will focus on linear combinations of a set of basis functions { ψ k : R → R } k ∈ [ K ]2 with coe ﬃ cients θ ∈ Θ : f θ ( x ) := K (cid:88) k =1 θ k ψ k ( x ) . (1)We propose to optimize over distributions p M ( θ ) of the response function parameters θ in the unknown causal model M , subject to the observed marginal of the model (cid:90) p M ( x, y | z, θ ) p M ( θ ) dθ, matching the corresponding (estimated) marginals p ( y | z ) and p ( x | z ). Notice that θ ⊥⊥ Z isimplied by Z ⊥⊥ U in the original formulation in terms of exogenous variables U . We assumea parametric form for p M ( θ ) via parameters η ∈ R d , denoted by p η ( θ ). We propose to usefunction families for p η ( θ ) that allow for practically low-variance Monte-Carlo gradientestimation via the reparameterization trick (Kingma & Welling, 2014) to learn η — more inSection 3.2. Objective.

An upper bound for the expected outcome under intervention can be directlywritten as max η E [ Y | do ( x (cid:63) )] = max η (cid:90) f θ ( x (cid:63) ) p η ( θ ) dθ. (2)A lower bound can be found analogously by the minimization problem. When optimizingeq. (2) constrained by p ( y | z ) and p ( x | z ) in the sequel, it will be necessary to deﬁne p η ( x, θ | z ). In particular, (cid:82) p η ( x, θ | z ) dx = p η ( θ | z ) = p η ( θ ). The last equality will be enforced in theencoding of p η ( x, θ | z ), as we need Z ⊥⊥ θ even if Z (cid:54)⊥⊥ θ | X . This encoding is introduced inSection 3.2, which will also allow us to easily match the marginal p ( x | z ). In Section 3.3, weconstruct constraints for the optimization so that the marginal of Y given Z in M matchesthe model-free p ( y | z ). We use the notation [ K ] := { ,...,K } for K ∈ N > . We abuse notation slightly by expanding the deﬁnition of η to simultaneously signify all parametersspecifying this joint distribution, as well as individual parameters speciﬁc to certain factors of the joint. .2 Matching p ( x | z ) and Enforcing Z ⊥⊥ U Instead of formulating the criterion of preserving the observed marginal p ( x | z ) as a constraintin the optimization problem, we bake it directly into our model. To accomplish that, wefactor p η ( x, θ | z ) as p ( x | z ) p η ( θ | x, z ). The ﬁrst factor is identiﬁed from the observed data andwe can thus force our model to match it. The second factor must be constructed so as toenforce marginal independence between θ and Z (as required by Z ⊥⊥ U ). We achieve that byparameterizing it by a copula density c η ( · ) that takes univariate CDFs F ( · ), which uniquelydeﬁne the distributions, as inputs, p η ( θ | x, z ) := c η ( F ( x | z ) , F η ( θ ) , · · · , F η ( θ K )) K (cid:89) k =1 p η ( θ k ) . (3)Here we assume that each component θ k of θ has a Gaussian marginal density with mean µ k and variance σ k , i.e., p η ( θ k ) = N ( θ k ; µ k , σ k ). Moreover, assuming c η is a multivariateGaussian copula density requires a correlation matrix S ∈ R ( K +1) × ( K +1) for which we onlykeep a Cholesky factor L without further constraints, rescaling L T L to have a diagonal of 1s.Our full set of parameters is η := { µ , ln( σ ) , . . . , µ K , ln( σ K ) , L } ∈ R K ( K +1) / K . (4) p ( y | z ) In the continuous output case, our parameterization implies the following set of integralequations Pr( Y ≤ y | Z = z ) = (cid:90) ( f θ ( x ) ≤ y ) p η ( x, θ | z ) dx dθ, (5)for all y ∈ Y , z ∈ Z , the respective sample spaces of Y and Z , where ( · ) is the indicatorfunction. These constraints immediately introduce two di ﬃ culties. First, we have aninﬁnite number of constraints to satisfy. Second, the right-hand side involves integratingnon-continuous indicator functions, which poses a problem for smooth gradient-basedoptimization with respect to η . To circumvent these issues, we ﬁrst choose a ﬁnite grid { z ( m ) } Mm =1 ⊂ Z of size M ∈ N ,instead of conditioning on all values in Z . We compute z ( m ) from a uniform grid on the CDF F Z of Z : z ( m ) := F − Z (cid:18) mM + 1 (cid:19) for m ∈ [ M ] . Second, to avoid the integration of non-continuous indicator functions, we can express theconstraints of eq. (5) in terms of expectations over a dictionary of L basis functions { φ l } Ll =1 .This leads to the following constraints for p ( y | z ): E [ φ l ( Y ) | z ( m ) ] = (cid:90) φ l ( f θ ( x )) p η ( x, θ | z ( m ) ) dx dθ for all l ∈ [ L ] , m ∈ [ M ] . (6)This idea borrows from mean embeddings, where one can reconstruct p ( y | z ) from an inﬁnitedictionary sampled at inﬁnitely many points in Z (Singh et al., 2019). In this work, wechoose an even simpler approach and only constrain moments like mean and variance φ ( Y ) := E [ Y ], φ ( Y ) := V [ Y ], . . . . Crucially, we note that our approximations can only relaxthe constraints , i.e., the optima may result in looser bounds compared to the full constraintset, but not invalid bounds , barring bad local optima as well as Monte Carlo and estimationerrors. A full discussion on the construction and implications of such assumptions is given in Appendix B. We discuss discrete outcomes or discrete features, which could also lead to discontinuous f θ in Appendix C. lgorithm 1 Bounding the IV interventional e ﬀ ect at treatment level x (cid:63) . Require: dataset D = { ( z i , x i , y i ) } Ni =1 ; number of z grid points M ; constraint functions { φ l } Ll =1 ;response function family { f θ } θ ∈ Θ ; batchsize B ; initial temperature τ (0) >

0; tempera-ture increase factor α >

1; tolerances (cid:15) abs , (cid:15) rel ; initial Lagrange multipliers λ (0) ; initialparameters η (0) ; whiten data D for X, Y , Z : subtract mean and divide by variance z ( m ) := ˆ F − Z ( mM +1 ) for m ∈ [ M ] (cid:46) ˆ F Z : CDF of { z i } Ni =1 . bin( i ) := max { arg min m ∈ [ M ] | z i − z ( m ) |} for i ∈ [ N ] (cid:46) split data points into “z-bins” LHS m,l := | bin − ( m ) | (cid:80) i ∈ bin − ( m ) φ l ( y i ) for m ∈ [ M ] , l ∈ [ L ] (cid:46) pre-compute LHS smoothen LHS m,l across m for each l with spline regression b := max { (cid:15) abs , (cid:15) rel LHS } (element-wise) (cid:46) set constraint tolerances ˆ x ( m ) j := ˆ F − X | z ( m ) (cid:18) j − B − (cid:19) for all j ∈ [ B ] , m ∈ [ M ] (cid:46) ˆ F X | z ( m ) : CDF of { x i } i ∈ bin − ( m ) for t = 1 . . . T (or until convergence) do (cid:46) optimization rounds η ( t ) := OptimizeSubproblem ( η ( t − , λ ( t − , τ ( t − ) (cid:46) min. Lagrangian at ﬁxed λ, τ λ ( t ) l ← max (cid:18) , λ ( t − l − τ ( t − c l ( η ( t ) ) (cid:19) (cid:46) update Lagrangian multipliers τ ( t ) ← α τ ( t − (cid:46) increase temperature parameter return o x (cid:63) ( η ( T ) ) function OptimizeSubproblem ( η, λ, τ )In here we use SGD with auto-di ﬀ erentiation to minimize L . Hence we only describehow to evaluate L in a di ﬀ erentiable fashion: o x (cid:63) ( η ) := B (cid:80) Bj =1 f θ ( j ) ( x (cid:63) ) with θ ( j ) ∼ p η ( θ ) (cid:46) c.f. Algorithm 2 for sampling RHS m,l ( η ) := B (cid:80) Bj =1 φ l (cid:16) f θ ( j ) ( ˆ x ( m ) j ) (cid:17) (cid:46) c.f. Algorithm 2 for sampling c ( η ) := b − | LHS − RHS( η ) | (cid:46) constraint terms L ( η ) := ± o x (cid:63) ( η ) + (cid:80) M · Ll =1 ξ ( c l ( η ) , λ l , τ ) (cid:46) Lagrangian ( ± for lower/upper bound) return arg min η L ( η ) (cid:46) optimize with SGD Here we state our ﬁnal non-convex, yet smooth, constrained optimization problem:objective: o x (cid:63) ( η ) := (cid:90) f θ ( x (cid:63) ) p η ( θ ) dθ (7)constraint LHS: LHS m,l := E [ φ l ( Y ) | z ( m ) ] (8)constraint RHS: RHS m,l ( η ) := (cid:90) φ l ( f θ ( x )) p η ( x, θ | z ( m ) ) dx dθ (9) opt. problem: min η / max η o x (cid:63) ( η ) s.t. LHS m,l = RHS m,l ( η ) for all m ∈ [ M ] , l ∈ [ L ] (10)Here, min and max give the lower and upper bound respectively. In this section we describehow to tackle the optimization with an augmented Lagrangian strategy (Nocedal & Wright,2006) and how to estimate all quantities from observed data. Algorithm 1 describes the fullprocedure. 7 .1 Augmented Lagrangian Strategy We can think of the left-hand side LHS as target values, estimated once up front from ob-served data. The right-hand side RHS is estimated repeatedly using samples from our model p η ( x, θ | z ( m ) ) during optimization. Hence, we restrict η such that RHS m,l ( η ) (approximately)matches the ﬁxed LHS m,l for all m and l . For notational simplicity, we will often “ﬂatten”the indices m and l into a single index l ∈ [ M · L ]. Since LHS is subject to misspeciﬁcationand estimation error, we introduce positive tolerance variables b ∈ R M · L> , relaxing equalityconstraints into inequality constraints c l ( η ) := b l − | LHS l − RHS l ( η ) | ≥ , with b l := max { (cid:15) abs , (cid:15) rel · | LHS l |} , (11)for ﬁxed absolute and relative tolerances (cid:15) abs , (cid:15) rel >

0. The constraint c l ( η ) is satisﬁed ifRHS l ( η ) is either within a fraction (cid:15) rel of LHS l or within (cid:15) abs of LHS l in absolute di ﬀ erence.The absolute tolerance is useful when LHS is close to zero. The exact constraints arerecovered as (cid:15) abs , (cid:15) rel →

0. Again, the introduced tolerance can only make the obtainedbounds looser, not invalid.We consider an inequality-constrained version of the augmented Lagrangian approachwith Lagrange multipliers λ ∈ R M · L (detailed in Section 17.4 of Nocedal & Wright (2006)).Speciﬁcally, the Lagrangian we aim to minimize with respect to η is: L ( η, λ, τ ) := ± o x (cid:63) ( η ) + M · L (cid:88) l =1  − λ l c l ( η ) + τc l ( η ) if τc l ( η ) ≤ λ l , − λ l τ otherwise , (12)where +/ − is used for the lower/upper bound and τ is a temperature parameter, which isincreased throughout the optimization procedure. Given an approximate minimum η of thissubproblem, we then update λ and τ according to λ l ← max { , λ l − τc l ( η ) } and τ ← α · τ forall l ∈ [ M · L ] and a ﬁxed α > λ l and τ .Thus at time step t , we have the current approximate solution η ( t ) of eq. (12), the Lagrangemultipliers λ ( t ) l and the temperature parameter τ ( t ) . While the number of optimizationparameters grows quickly with the dimensionality of θ , which may render the optimizationchallenging, in our experiments we did not encounter any issues with up to 54 optimizationparameters and 40 constraints. We describe the augmented Lagrangian approach in moredetails in Appendix D. For a dataset D = { ( z i , x i , y i ) } Ni =1 ⊂ R , we describe our method in Algorithm 1. Pre-processing.

As a ﬁrst step, we whiten the data (subtract mean, divide by variance).Then, we interpolate the CDF ˆ F Z of { z i } Ni =1 to compute the grid points z ( m ) . Next, we assigneach observation to a grid point viabin( i ) := max { arg min m ∈ [ M ] | z i − z ( m ) |} for i ∈ [ N ] , i.e., each datapoint is assigned to the gridpoint that is closest to its z -value (higher bin forties). Given M, L and φ l , we can estimate LHS m,l from data viaLHS m,l := E [ φ l ( Y ) | z ( m ) ] ≈ | bin − ( m ) | (cid:88) i ∈ bin − ( m ) φ l ( y i ) , (13)8hich remain unchanged throughout the optimization. Since LHS m,l are estimated via em-pirical averages of φ l ( y i ) for datapoints in a given bin i ∈ bin − ( m ), “neighboring” constraintsLHS m,l and LHS m +1 ,l may have substantially di ﬀ erent values. Since our model is smooth, itcan be hard to match such non-continuities with RHS m,l ( η ). We expect such jumps to beartifacts of ﬁnite sample e ﬀ ects and not important properties of the true data distribution.Hence we apply a spline regression to the values { LHS m,l } Mm =1 for each l ∈ [ L ] to smoothenout larger jumps between neighboring values. In practice, we use a cubic univariate splinefor each l with a smoothing factor of 0 .

2. Empirically, our results were not sensitive to thesmoothing factor.Once LHS is ﬁxed, we can ﬁx the tolerances b = max { (cid:15) abs , (cid:15) rel LHS } . Finally, we obtain asingle batch of examples from X | z ( m ) of size B ∈ N , which we will also reuse throughout theoptimization via inverse CDF samplingˆ x ( m ) j = ˆ F − X | z ( m ) (cid:18) j − B − (cid:19) for j ∈ [ B ] , m ∈ [ M ] . (14)Here, ˆ F X | z ( m ) is the CDF of { x i } i ∈ bin − ( m ) . Monte Carlo estimation.

To minimize the Lagrangian, we use stochastic gradient descent(SGD). Therefore, we need to compute (estimates for) ∇ η o x (cid:63) ( η ) , ∇ η c l ( η ), where the latterboils down to ∇ η RHS m,l ( η ). In practice, we compute Monte Carlo estimates of o x (cid:63) ( η ) andRHS m,l ( η ) and use automatic di ﬀ erentiation, e.g., using JAX (Bradbury et al., 2018), to getthe gradients. If we had a batch of independent samples θ ( j ) ∼ p η ( θ ) of size B , we couldestimate the objective eq. (2) for a given η via E [ Y | do ( x (cid:63) )] ≈ B B (cid:88) j =1 f θ ( j ) ( x (cid:63) ) . (15)Similarly, with i.i.d. samples θ ( j ) ∼ p η ( θ | z ( m ) ) we can estimate RHS m,l in eq. (6) asRHS m,l ( η ) ≈ B B (cid:88) j =1 φ l (cid:16) f θ ( j ) ( ˆ x ( m ) j ) (cid:17) . (16)Hence, the last missing piece is to sample from eq. (3) in a fashion that maintains di ﬀ erentia-bility with respect to η . We follow the standard procedure to sample from a Gaussian copulafor the parameters θ ( j ) , with the additional restriction to preserve the pre-computed sampleˆ x . Algorithm 2 describes the sampling process from p η ( θ, X | z ( m ) ) as deﬁned in Section 3.2in detail. The output is a ( K + 1) × B -matrix, where the ﬁrst row contains B independent X -samples and the remaining K rows are the components of θ ∈ R K . We pool samples fromall z ( m ) to obtain samples from p η ( θ ). By change of variables, the parameters η = ( µ, σ , L )enter in a di ﬀ erentiable fashion (c.f. reparameterization trick (Kingma & Welling, 2014)).To ensure that the marginal variances σ k are positive, we optimize their natural logarithminstead. Initialization.

We initialize the optimization parameters L with ones on the diagonal, zerosin the upper triangle, and sample the lower triangle from N (0 , . µ k and ln( σ k ) depends on the chosen response function family. Our guiding principle is toensure that the initial distribution covers a large set of possible response functions, tendingtowards larger σ k . 9 lgorithm 2 Sampling parameter values θ from p η ( θ, X | z ( m ) ). Sample each component of w ∈ R K × B i.i.d. from a standard Gaussian. Prepend the vector (0 , /B, . . . ,

1) as the ﬁrst row of w , resulting in w ∈ R ( K +1) × N . Allow for dependencies between components by multiplying with the Cholesky factor w ← L w . Normalize all values by applying the standard Gaussian CDF component wise, w ← ϕ , ( w ). Fix the marginals of θ ( j ) k by applying the inverse CDF of a ( µ k , σ k )-Gaussian: θ ( j ) ← ϕ − µ k ,σ k ( w j +1 ) for j ∈ [ K ]. Here, w j +1 denotes the j + 1-st row of w . Recall that the ﬁrst rowof w encodes X and the remaining rows encode θ . Sampling X via ˆ F − X ( w ) by design simply gives the pre-computed ˆ x . Response functions.

One key advantage of our approach is that it allows us to ﬂexiblytrade o ﬀ assumptions on the response function family with more informative bounds. Dueto our simple, yet expressive choice of linear combinations of a set of basis functions, thereare many natural and easy to implement options for the response functions. In particular,we consider the following options:1. Polynomials: ψ k ( x ) = x k − for k ∈ [ K ]. In this work, we speciﬁcally focus on linear( K = 2), quadratic ( K = 3), and cubic ( K = 4) polynomial functions.2. Neural basis functions (MLP):

We ﬁt a multi-layer perceptron with K neurons in thelast hidden layer to the observed data { ( x i , y i ) } i ∈ N and take ψ k (x) to be the activation ofthe k -th neuron in the last hidden layer. Note that the network output itself is a linearcombination of these last hidden layer activations. Hence, the underlying assumptionfor this approach to work well is that the true causal e ﬀ ects can also be approximatedwell by a linear combination of the learned last hidden layer activations, i.e., the truee ﬀ ect is in this sense “similar” to the estimated observed conditional ˆ p ( y | x ).3. Gaussian process basis functions (GP):

We ﬁt a Gaussian process to K di ﬀ erent sub-samples { ( x i , y i ) } i ∈ N (cid:48) with N (cid:48) ≤ N . We then sample a single function from each Gaussianprocess as the basis functions ψ k for k ∈ [ K ]. We train multiple Gaussian processeson smaller subsets of the data to ensure su ﬃ cient variance in the learned functionalrelation. Similarly to the neural net basis functions, the assumption is that the causale ﬀ ect can be approximated by a linear combination of these varying samples. We evaluate our method on a variety of synthetic and real datasets. In all experiments, wereport the results of two stage least squares ( ) and kernel instrumental variableregression (

KIV ) (Singh et al., 2019). Note that both methods assume additive noiseand provide point estimates for expected outcomes under a given treatment. The KIVimplementation by Singh et al. (2019) comes as an o ﬀ -the-shelf method with internalheuristics for tuning hyperparameters. For our method, we show lower ( ) and upper ( ) bounds computed individually for multiple values of x (cid:63) ∈ R . The transparency ofthese lines indicates the tolerances (cid:15) abs , (cid:15) rel , where more transparency corresponds to largertolerances. Missing bounds at an x (cid:63) indicate that the constraints could not be satisﬁed in theoptimization. In the synthetic settings, we also show the true causal e ﬀ ect E [ Y | do ( X = x (cid:63) )]( ). 10 − − X − − − Y possible models E [ Y | do ( X = x (cid:63) )] 2SLS KIV lower bound upper bound data linear response quadratic response MLP response linear Gaussian setting; weak instrument and strong confounding ( α = 0 . , β = 3) − − − X − − − Y − − − X − − − Y − − − X − − − Y linear Gaussian setting; strong instrument and weak confounding ( α = 3 , β = 0 . − − − − X − − − Y − − − − X − − − Y − − − − X − − − Y non-additive, non-linear setting; weak instrument and strong confounding ( α = 0 . , β = 3) − − − X − − − Y − − − X − − − Y − − − X − − − Y non-additive, non-linear setting; strong instrument and weak confounding ( α = 3 , β = 0 . − − − − X − − Y − − − − X − − Y − − − − X − − Y Figure 2: Results for synthetic datasets (linear Gaussian and non-linear, non-additive) for aweak and strong instrument respectively. Columns correspond to di ﬀ erent response functionfamilies. 11inally, we highlight that there are multiple possible causal e ﬀ ects compatible with thedata (which our method aims to bound). To do so, we ﬁt a latent variable model of theform shown in Figure 1(a) to the data, with U | Z, X, Y ∼ N ( µ ( Z, X, Y ) , σ ( Z, X, Y )) where µ, σ as well as E [ X | Z, U ] are parameterized by neural networks. We ensure that the formof E [ Y | X, U ] matches our assumptions on the function form of the response family (i.e.,either polynomials of ﬁxed degree in X , or neural networks). We then optimize the evidencelower bound following standard techniques (Kingma & Welling, 2014), see Appendix E. Weﬁt multiple models with di ﬀ erent random initializations and compute the implied causale ﬀ ect of X on Y for each one, shown as multiple thin gray lines ( ). We use a single set ofhyperparameters for all experiments, which we describe in detail in Appendix G.

Linear Gaussian case.

First, we test our method in a synthetic linear Gaussian scenario,where instrument, confounder, and noises

Z, C, e X , e Y are independent standard Gaussianvariables. We consider two settings of the form X = α Z + β C + e X and Y = X − C + e Y , (17)with α, β ∈ { (0 . , , (3 , . } . The two settings of coe ﬃ cients α, β describe a weak instrumentwith strong confounding and a strong instrument with weak confounding respectively. Theﬁrst two rows of Figure 2 show our bounds in these settings for linear, quadratic and MLPresponse functions. Because these scenarios satisfy all theoretical assumptions of 2SLS andKIV, 2SLS ( ) reliably recovers the true causal e ﬀ ect, which is simply E [ Y | do ( X = x (cid:63) )] = x (cid:63) .For a weak instrument, KIV ( ) fails by reverting to its prior mean 0 everywhere, whereasit matches the true e ﬀ ect in data rich regions in the second setting with weak confounding. We observe that the true causal e ﬀ ect ( ) is always within our bounds ( , ).Moreover, our bounds also contain most of the “other possible models” that could explainthe data ( ), showing that they are highly informative, without being more conﬁdent thanwarranted. As expected, our bounds get looser as we increase the ﬂexibility of the responsefunctions (linear, quadratic, MLP from columns 1-3). In particular, allowing for ﬂexibleMLP responses (column 3), our bounds are rightfully loose for strong confounding. Asconfounding weakens and the instrument strengthens (in the second row) the gap betweenour bounds gets narrower. Non-additive, non-linear case.

Our next synthetic setting is non-linear and violates theadditivity assumption. Again, the treatment is given by X = α Z + β C + e X with the same setof coe ﬃ cients α, β as for the linear setting. The outcome is non-linear and non-additive X = α Z + β C + e X and Y = 0 . X − . X C + e Y , (18)The bottom two rows of Figure 2 show our results for this setting. Since additivity is violated(due to the X C -term) and the e ﬀ ect is non-linear, 2SLS fails. Without additivity, KIV alsofails for strong confounding, but captures the true e ﬀ ect well in data rich regions whenthe instrument is strong and confounding is weak. The strongly confounded case (row 3)highlights the e ﬀ ect of the choice of response functions. Wrongly assuming linear responsefunctions, our bounds rule out the true e ﬀ ect (row 3, column 3). However, they capturethe implied causal e ﬀ ects from possible compatible linear models. As we allow for moreﬂexible response functions capable of describing the true e ﬀ ect, our bounds are extremelyconservative (row 3, columns 2 & 3) as they should be, indicated by the e ﬀ ects from othercompatible models. In the strong instrument, weak confounding case (row 4), our bounds We report additional results on the performance in the small data regime in Appendix F. We provide more details on this failure mode of KIV in Appendix H. − . − . − . . . . . X − − − Y − . − . − . . . . . X − − − Y − . − . − . . . . . X − − − Y Figure 3: Bounds for the simulated sigmoidal design. The true causal e ﬀ ect is given by alogistic function, which is well recovered by our method for di ﬀ erent response functionfamilies (cubic polynomials, GP basis functions, and MLP basis functions).linear response quadratic response MLP response − − − − X − Y − − − − X − Y − − − − X − Y Figure 4: Results on the expenditure dataset for di ﬀ erent response function families.become narrower to the point of essentially identifying the true e ﬀ ect for adequate responsefunctions (column 2). Here, linear response functions cannot explain the data anymore,indicated by missing bounds (row 4, column 1). Sigmoidal design.

Next, we evaluate our method on simulated data from a sigmoidaldesign introduced by Chen & Christensen (2018), adopted by Newey & Powell (2003) andused in previous work on continuous instrumental variable approaches under the additiveassumption as a common test case (Hartford et al., 2017; Singh et al., 2019; Muandetet al., 2019). We show the results from KIV and our bounds for response function familiesconsisting of cubic polynomials and neural net basis functions in Figure 3. The observeddata distribution ˆ p ( y | x ) follows the true causal e ﬀ ect rather closely and the instrument isrelatively strong in this setting, see Singh et al. (2019) for details. Therefore, the gap betweenour bounds is relatively narrow for a broad set of di ﬀ erent basis functions as long as they areﬂexible enough to capture a sigmoidal shape. Expenditure data.

We now turn to a real dataset from a 1995/96 survey on family expendi-ture in the UK (O ﬃ ce for National Statistics, 2000). This dataset has been used by Gunsilius(2019) and previously (Blundell et al., 2007; Imbens & Newey, 2009) for 1994/95 data. Theoutcome of interest is the share of expenditure on food. The treatment is the log of the totalexpenditure and the instrument is gross earnings of the head of the household. All threevariables are continuous, relations cannot be expected to be linear, and we cannot excludeunobserved confounding (Gunsilius, 2019), making this a good test case for our method. Weprepare the data from O ﬃ ce for National Statistics (2000) using the same steps as Gunsilius(2019) closely following Newey & Powell (2003); Blundell et al. (2007). This is, we restrictthe sample to households with married couples who live together and in which the head13f the household is between 20 and 55 years old. We further exclude couples with morethan 2 children. Finally, we also require the head of the household not to be unemployed.Otherwise, the instrument, gross earnings, would not be available. After these restrictions,we end up with 1650 observations in our dataset. The dataset can be downloaded for freefor academic purposes after creating an account.Figure 4 shows that our bounds provide useful information about both the sign andmagnitude of the causal e ﬀ ect and gracefully capture the increasing uncertainty as we allowfor more ﬂexible response functions. Moreover, they include most of the possible e ﬀ ectsfrom latent variable models indicating that they are not overly restrictive. The few curvesthat escaped our bounds correspond to situations where the latent variable model ﬁt wassuboptimal in terms of local likelihood and hence may be an artifact of the latent variablemodel training procedure. We have proposed a class of algorithms for computing bounds on causal e ﬀ ects by exploitingmodern optimization machinery. While this addresses an important source of uncertainty incausal inference — partial identiﬁability as opposed to full identiﬁability — there is alsostatistical uncertainty: conﬁdence or credible intervals for the bounds themselves (Imbens& Manski, 2004). Clearly this is an important matter to be addressed in future work.There are also considerations about the parameterization of p η ( θ | x, z ) and how possible pre-treatment covariates can be used in the model. We defer these considerations to Appendix B.Another direction of future research is using the same ideas to test whether an IV model isvalid, one of the original motivations for deriving the implied constraints of latent variablecausal models (e.g., Wolfe et al., 2019). In all that followed, we assumed that the modelwas correct. Adapting our methods for testing models instead of deriving bounds is aninteresting direction for future work. Finally, we foresee our ideas as ways of liberating causalmodeling to accommodate “softer,” more general constraints than conditional independencestatements. For instance, as described by Silva & Evans (2016), there is no need to assume anysparsity in a causal DAG, as long as we know that some edges are “weak” (in a technical sense)so that, e.g., edge Z → Y is allowed, but its inﬂuence on Y is not arbitrary. How to do that ina computationally feasible way remains a challenge, but the possibility of complementingcausal inference based on sparse DAGs, such as the do-calculus of Pearl (2009), with thesledgehammer of modern continuous optimization, is an attractive prospect. Acknowledgments

We thank Florian Gunsilius for useful discussions, providing code for his method andexplaining how to prepare the Family Expenditure Survey dataset. We are grateful to ArthurGretton, Jiri Hron, Paul Rubenstein, and Rahul Singh for useful discussions and feedback.MK and RS acknowledge support from the The Alan Turing Institute under EPSRC grantEP/N510129/1.

References

Acemoglu, D., Johnson, S., and Robinson, J. A. The colonial origins of comparative develop-ment: An empirical investigation.

American economic review , 91(5):1369–1401, 2001.Angrist, J. D. and Pischke, J.-S.

Mostly harmless econometrics: An empiricist’s companion .Princeton university press, 2008. 14ngrist, J. D., Imbens, G. W., and Rubin, D. B. Identiﬁcation of causal e ﬀ ects using in-strumental variables. Journal of the American statistical Association , 91(434):444–455,1996.Balke, A. and Pearl, J. Counterfactual probabilities: Computational methods, bounds andapplications. In

Proceedings of the Tenth international conference on Uncertainty in artiﬁcialintelligence , pp. 46–54. Morgan Kaufmann Publishers Inc., 1994.Blundell, R., Chen, X., and Kristensen, D. Semi-nonparametric iv estimation of shape-invariant engel curves.

Econometrica , 75(6):1613–1669, 2007.Bonet, B. Instrumentality tests revisited. In

Proceedings of the 17th Conference on Uncertaintyin Artiﬁcial Intelligence , pp. 48–55, 2001.Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., andWanderman-Milne, S. JAX: composable transformations of Python+NumPy programs,2018. URL http://github.com/google/jax .Chen, X. and Christensen, T. M. Optimal sup-norm rates and uniform inference on nonlinearfunctionals of nonparametric iv regression.

Quantitative Economics , 9(1):39–84, 2018.Darolles, S., Fan, Y., Florens, J.-P., and Renault, E. Nonparametric instrumental regression.

Econometrica , 79(5):1541–1565, 2011.Diamond, S. and Boyd, S. CVXPY: A Python-embedded modeling language for convexoptimization.

Journal of Machine Learning Research , 17(83):1–5, 2016.Drton, M., Sturmfels, B., and Sullivant, S.

Lectures on Algebraic Statistics (OberwolfachSeminars Book 39) . Springer, 2009.Evans, R. Margins of discrete Bayesian networks.

Annals of Statistics , 46(6A):2623–2656,2019.Gunsilius, F. Testability of instrument validity under continuous endogenous variables. arXiv preprint arXiv:1806.09517 , 2018.Gunsilius, F. Bounds in continuous instrumental variable models. arXiv preprintarXiv:1910.09502 , 2019.Hartford, J., Lewis, G., Leyton-Brown, K., and Taddy, M. Deep iv: A ﬂexible approach forcounterfactual prediction. In

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pp. 1414–1423. JMLR. org, 2017.Hestenes, M. R. Multiplier and gradient methods.

Journal of optimization theory and applica-tions , 4(5):303–320, 1969.Horowitz, J. L. Applied nonparametric instrumental variables estimation.

Econometrica , 79(2):347–394, 2011.Imbens, G. and Manski, C. Conﬁdence intervals for partially identiﬁed parameters.

Econo-metrica , 72(6):1845–1857, 2004.Imbens, G. W. and Newey, W. K. Identiﬁcation and estimation of triangular simultaneousequations models without additivity.

Econometrica , 77(5):1481–1512, 2009.15ilbertus, N., Ball, P. J., Kusner, M. J., Weller, A., and Silva, R. The sensitivity of coun-terfactual fairness to unmeasured confounding. In

Proceedings of the 35th Conference onUncertainty in Artiﬁcial Intelligence (UAI) , pp. 213. AUAI Press, 2019.Kingma, P. D. and Welling, M. Auto-encoding variational bayes. In

Proceedings of theInternational Conference on Learning Representations (ICLR) , volume 1, 2014.Manski, C.

Identiﬁcation for Prediction and Decision . Harvard University Press, 2007.Muandet, K., Mehrjou, A., Lee, S. K., and Raj, A. Dual iv: A single stage instrumentalvariable regression. arXiv preprint arXiv:1910.12358 , 2019.Navascues, M. and Wolfe, E. The inﬂation technique solves completely the classical inferenceproblem. arXiv: 1707.06476 , 2019.Newey, W. K. and Powell, J. L. Instrumental variable estimation of nonparametric models.

Econometrica , 71(5):1565–1578, 2003.Nocedal, J. and Wright, S.

Numerical optimization . Springer Science & Business Media, 2006.O ﬃ ce for National Statistics. Family expenditure survey, 1996-1997, 2000. URL http://doi.org/10.5255/UKDA-SN-3783-1 .Palmer, T., Ramsahai, R., Didelez, V., and Sheehan, N. Nonparametric bounds for the causale ﬀ ect in a binary instrumental variable model. The Stata Journal , 11(3):345–367, 2011.Pearl, J. On the testability of causal models with latent and instrumental variables. In

Proceedings of the Eleventh conference on Uncertainty in artiﬁcial intelligence , pp. 435–443.Morgan Kaufmann Publishers Inc., 1995.Pearl, J.

Causality . Cambridge university press, 2009.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research , 12:2825–2830, 2011.Silva, R. and Evans, R. Causal inference through a witness protection program.

Journal ofMachine Learning Research , 17:1–53, 2016.Singh, R., Sahani, M., and Gretton, A. Kernel instrumental variable regression. In

Advancesin Neural Information Processing Systems , pp. 4595–4607, 2019.Tian, J. and Pearl, J. On the testable implications of causal models with hidden variables.In

Proceedings of the 18th Conference on Uncertainty in Artiﬁcial Intelligence , pp. 519–527,2002.Wolfe, E., Spekkens, R. W., and Fritz, T. The inﬂation technique for causal inference withlatent variables.

Journal of Causal Inference , 7(2), 2019.Wooldridge, J. M.

Econometric analysis of cross section and panel data . MIT press, 2010.16

Gunsilius’s Algorithm

Gunsilius (2019) provides a theoretical framework for minimal conditions for a continuousIV model to imply non-trivial bounds (that is, bounds tighter that what can be obtainedby just assuming that the density function p ( x, y | z ) exists). That work also introduces twovariations of an algorithm for ﬁtting bounds.The basic version consists of ﬁrst sampling l response functions f R x ( · ) and f R y ( · ) from adistribution over functions – in the experiments described, a Gaussian process evaluated ona grid in the respective spaces. The ﬁnal distribution is reweighted combination of the pre-sampled l response functions with weights µ playing the role of the decision variables to beoptimized. Hence, by construction, the space of distributions in the response function spaceis absolutely continuous with respect to the pre-deﬁned Gaussian process. The constraintsare deﬁned by approximating an estimate of the bivariate CDF F ( x, y | z ) on a grid of values,which are approximately constrained to match the model implied CDF in a L sense. Largedeviance bounds are then used to show the (intuitive) result that this approximation is aprobably approximately correct formulation of the original optimization problem.One issue with this algorithm is that l may be required to be large as it is a non-adaptiveMonte Carlo approximation in a high dimensional space. A variant is described where, everytime a solution for µ is found, response function samples with low corresponding values of µ are replaced (again, from the given and non-adaptive Gaussian process). Although this nowhas the advantage of adapting the Monte Carlo samples to the problem, this has convergenceproblems that may be severe (and not easy to diagnose).In contrast, we formulate our adaptation of η as a continuous optimization problemwith an estimate of the gradient that has empirically reasonable stability, as expectedfrom the related work in the machine learning literature for gradient estimation. We alsoparameterize was the distribution so that the only constraint that we need to enforce concernsthe univariate density p ( y | z ). Like the algorithm given by Gunsilius, the space of functionsis a linear combination of a ﬁxed dictionary of basis functions with a Gaussian distributionon the parameters, although we do not introduce the discrete mixture reweighting on theMonte Carlo samples, which introduces instability in (Gunsilius, 2019) despite its goodtheoretical properties. Our formulation, like the one in (Gunsilius, 2019), can make use ofmore a ﬂexible distribution such as a mixture of Gaussian copulas, although there are manycomputational advantages implied by the (ﬁnite) ﬂexible Gaussian process formulation weadopt as discussed in Appendix B.The proposed implementation computes F Y | do ( x (cid:63) ) ( y (cid:63) ) − F Y | do ( x (cid:63) ) ( y (cid:63) ), i.e., the di ﬀ erencein e ﬀ ects at two di ﬀ erent treatment levels x (cid:63) and x (cid:63) for individuals within a ﬁxed quantile y (cid:63) ∈ [0 ,

1] of the outcome variable. For example, in the expenditure dataset, the setting x (cid:63) =0 . , x (cid:63) = 0 . , y (cid:63) = 0 .

25 would look at how much people, who spend a lot overall ( x (cid:63) = 0 . x (cid:63) = 0 . λ , which corresponds to the tightnessof the constraint. In the proposed implementation, this parameter is ﬁxed throughout theoptimization and must be chosen manually. In Figure 5, we show the results of Gunsilius’salgorithm for three di ﬀ erent levels of y (cid:63) on the expenditure dataset. Small values of λ resultin uninformatively loose bounds and do not always seem to converge (e.g., for y (cid:63) = 0 . λ , which corresponds to stronger enforcement of the constraint, the boundsget narrower. However, even after a long burn-in period, we still encounter substantial“instantaneous jumps” as well as longer-term drifts in the bounds, which may change thequalitative conclusions (for example in the y (cid:63) = 0 .

75 setting). Note that this algorithm workson the empirical CDFs of all variables, i.e., they are all scaled to lie within [0 ,

50 500 750 1000 1250 1500 1750 2000iterations − . − . . . . λ = 0 . λ = 0 . λ = 1 . λ = 1 . λ = 1 . λ = 2 . λ = 2 . y (cid:63) = 0 . y (cid:63) = 0 . y (cid:63) = 0 .

250 500 750 1000 1250 1500 1750 2000iterations − . − . . . .

250 500 750 1000 1250 1500 1750 2000iterations − . − . − . . . . .

250 500 750 1000 1250 1500 1750 2000iterations − . − . . . Figure 5: We show results of Gunsilius’s algorithm for three di ﬀ erent settings of y (cid:63) ∈{ . , . , . } and varying penalization parameter λ .Moreover, even after laboriously improving the performance of the algorithm usingacceleration via JAX (Bradbury et al., 2018) and parallelized solving of the quadratic pro-grams with CVXPY (Diamond & Boyd, 2016), producing an upper and lower bound for asingle setting of x (cid:63) , x (cid:63) , y (cid:63) , λ with Gunsilius’s algorithm took longer (about 30 minutes on aquad-core Intel Core i7) than a full set of upper and lower bounds at 15 di ﬀ erent x (cid:63) valueswith our algorithm (about 20 minutes on the same hardware). B The Shape of p η ( θ | x, z ) and Conditional E ﬀ ects It is not di ﬃ cult to show that our parameterization of p η ( θ | x, z ) in eq. (3) enforces θ ⊥⊥ Z while allowing for θ (cid:54)⊥⊥ Z | X , as suggested by Figure 1(c). It follows directly by factoring aconditional density in terms of a copula density c ( · ) and the required univariate marginals.That is, for some ( V , V , V ) for which we want to deﬁne a conditional pdf p ( v | v , v ), wehave p ( v , v | v ) := c ( F ( v | v ) , F ( v | v )) p ( v | v ) p ( v | v ) ⇒ p ( v | v , v ) = c ( F ( v | v ) , F ( v | v )) p ( v | v ) . Since (cid:82) p ( v , v | v ) dv = p ( v | v ), a necessary and su ﬃ cient condition for V ⊥⊥ V is choos-ing a model marginal such that p ( v | v ) = p ( v ). If c ( F ( v | v ) , F ( v )) cannot be factored interms of some product h ( v , v ) h ( v , v ), which is typically the case, then V (cid:54)⊥⊥ V | V .The main apparent limitation of our p η ( θ k ) (and the related copula) is its reliance on aparametric form. There is a complex relationship between the shape of the response functionspace and the distribution implied on that space by the unknown model M . For Y = f ( X, U ),it is always possible to assume without loss of generality that U is a set of variables whichare marginally standard Gaussians: just let the transformation U (cid:48) i := Φ − ( F i ( U i )) be absorbedinto f ( · ), where F i ( · ) is the marginal CDF of U i and Φ ( · ) is the CDF of a standard Gaussian.Moreover, assuming that any dependence among elements of U can be explained by directcausation among them or by other latent parents, we can also assume all members of U areindependent.However, we do not want to assume a one-to-one correspondence between elements of θ and elements of U : that is the whole point of using response functions. Even independentstandard Gaussian U s would not translate to marginally Gaussian θ . As an example, suppose Y = U X + λU . All response functions can be written in the form f θ ( x ) := θ x + θ , where θ = U and θ = λU . Hence, θ follows a chi-squared distribution and θ a zero-mean,18ut not standard, Gaussian. If Y = U X + λU U , then on top of that θ and θ are notindependent.The solution is conceptually not complicated: just let p η ( · ) be as ﬂexible as possible. Forinstance, let the copula be a Dirichlet process mixture of Gaussian copulas, also deﬁningﬂexible models for the marginals. The IV conditional independence structure among Z, X, θ is still preserved. The practical issue of course is the optimization. The algorithm ofGunsilius (2019) itself tries to approach the problem by learning the reweighting of a MonteCarlo approximation to a ﬁxed base measure. That alone is already very computationallydemanding and has convergence problems. We set a parametric form for p η ( · ) (in our paper,the Gaussian should be seen as an useful illustration, not as a one-size-ﬁts-all solution) asa compromise between ﬂexibility and computational tractability. The practitioner shouldbe invited to sample from the implied function space to visualize whether the distributionof sample paths has a desired level of variability. Getting the “exact” shape of the truedistribution is nowhere as important as just having enough variability to avoid overconﬁdentbounds. How to achieve “enough variability” without aiming at a completely ﬂexibledistribution of θ may be a compromise between computational costs and domain-dependentjudgment.Another important aspect brought by a parameterization of p η ( · ) is in case we havepre-treatment covariates W to either reduce confounding, remove (direct) dependencebetween Z and U or Z and Y , or just to answer questions related to conditional expectedoutcomes e.g. E [ Y | do ( x ) , w ] and conditional average causal e ﬀ ects (CATE), E [ Y | do ( x ) , w ] − E [ Y | do ( x (cid:48) ) , w ]. Although a response function can straightforwardly depend on a vector oftreatment variables, this makes less sense if variables W are not direct causes of Y . And evenif elements of W are direct causes, we may want to treat them analogously to U : playing arole in the response function only via the distribution of θ , instead of being explicitly in thescope of such functions.Hence, we suggest that a way of incorporating covariates W is by a multilevel approach:deﬁne p η ( w ) ( θ | x, z, w ), where each element of η may itself be a function of W , e.g. µ = β T W for some parameter vector β . Here, p ( x | z, w ) and p ( y | z, w ) are the marginals to be matched.We will discuss in future work ways of making p η ( · ) more ﬂexible in general, including theuse of covariates. C Discrete Outcomes and Discrete Features If Y is discrete, f θ ( x ) will be discontinuous. Theoretically this will not pose a problemas long as the number of discontinuities is ﬁnite (Gunsilius, 2019). The main practicalissue is optimization, as eq. (6) will now not lead itself to gradient-based methods. Themost immediate approximation is to use di ﬀ erentiable surrogates of f θ ( x ) that relax theconstraints. In the most basic formulation, we have the inequalities tol − ≤ E [ φ l ( Y ) | z ( m ) ] − (cid:90) φ l ( f θ ( x )) p η ( x, θ | z ( m ) ) dx dθ ≤ tol + , for some tolerance factors tol + , tol − . Given upper and lower bounds φ + l ( f θ ( x )), φ − l ( f θ ( x )) on φ l ( f θ ( x )), the relaxed constraints tol − ≤ E [ φ l ( Y ) | z ( m ) ] − (cid:82) φ − l ( f θ ( x )) p η ( x, θ | z ( m ) ) dx dθ E [ φ l ( Y ) | z ( m ) ] − (cid:82) φ + l ( f θ ( x )) p η ( x, θ | z ( m ) ) dx dθ ≤ tol + , will still result in valid, but looser bounds (again, up to local optima and Monte Carlo error).If f θ ( x ) is non-negative (for instance, if its codomain is { , } ) and φ l ( · ) is monotonic for19on-negative inputs (such as φ l ( x ) = x and φ l ( x ) = x ), it is enough to plug in bounds for f θ ( x ) itself. We will elaborate on that in future work. In this context, we can also formulatean alternative approach to matching p ( y | z ). Alternative Approach to Matching p ( y | z ) . Here we describe an alternative approximationof eq. (5) that hinges on smoothly approximating the indicator function to render the integralwell behaved. First, instead of evaluating Pr(

Y < y | Z = z ( m ) ) for all y ∈ Y , we take a similarapproach for discretizing Y | z ( i ) as we took for z ( m ) . For a given z ( m ) , instead of all half-spaces Y < y , we only consider the sets A ( m,l ) := ( −∞ , y ( m,l ) ] with y ( m,l ) := F − Y | z ( m ) (cid:18) l − L − (cid:19) for l ∈ [ L ] with some ﬁxed L ∈ N . This results in constraints for the L -quantiles of theconditional distributions of Yl − L − (cid:90) (cid:16) f θ ( x ) ≤ y ( m,l ) (cid:17) p η ( x, θ | z ( m ) ) dx dθ. for all m ∈ [ M ] and l ∈ [ L ]. In practice, we would evaluate the integral on the right handside with a Monte Carlo estimate, sampling from p η ( x, θ | z ( m ) ) and then di ﬀ erentiate withrespect to η for gradient-based optimization. Therefore, the non-di ﬀ erentiable (even non-continuous) indicator function poses an issue for the optimization. We can circumventthis problem by approximating the indicator with a smoothly di ﬀ erentiable function, forexample ( t ≤ t ∗ ) ≈ σ ρ ( t − t ∗ ) for σ ρ ( t ) := 11 + e − ρ t or σ ρ ( t ) := 11 + exp (cid:18) − ρ (cid:16) t + √ ρ (cid:17)(cid:19) for ρ >

0. As ρ → ∞ , σ ρ ( t ) → ( t ≤

0) pointwise on R \ { } , i.e., we can slowly increase ρ throughout the optimization to gradually approximate the constraints.Hence an alternative approach to implement the constraint for matching p ( y | z ) is l − L − (cid:90) σ ρ (cid:16) f θ ( x ) − y ( m,l ) (cid:17) p η ( x, θ | z ( m ) ) dx dθ (19)for all m ∈ [ M ] and l ∈ [ L ], where we increase ρ > ρ . Therefore, weonly report results for the approach using dictionary functions φ l described in the main text. D Augmented Lagrangian Optimization Strategy

The Augmented Lagrangian method (Hestenes, 1969) is a general method for constrainedoptimization, originally proposed just for dealing with equality constraints. The beneﬁtof this over penalty methods is that we do not need to take the penalty parameters τ to ∞ in order to solve the original constrained optimization problem, which can cause ill-conditioning (Nocedal & Wright, 2006). However, our problem only contains inequalityconstraints. Thus, we consider a reﬁnement proposed by Nocedal & Wright (2006) to purelyhandle inequality constraints using Augmented Lagrangian methods. Speciﬁcally, we canwrite the inequality constrained optimization problem equivalently as an unconstrainedoptimization problem with Lagrange multipliers λ :min η max λ ≥ (cid:26) o x (cid:63) ( η ) + λ (cid:62) ( c ( η ) − b ) (cid:27) .

20o see that it is equivalent, note that the max returns o ( η ) when η satisﬁes the constraints(as the maximum is obtained at λ = 0), and ∞ otherwise (as the maximum is at λ = ∞ ).However, this is not easy to optimize as the λ jumps from 0 to ∞ when passing through theconstraint boundary. To ﬁx this, we add a term that penalizes λ making larger changes fromits previous value. Speciﬁcally,min η max λ ≥ (cid:26) o ( η ) + λ (cid:62) ( c ( η ) − b ) − τ (cid:107) λ − λ (cid:48) (cid:107) (cid:27) , where λ (cid:48) are the Lagrange multipliers from the previous iteration and τ is a penalty termthat is iteratively increased. Note that the max optimization can be solved in closed form foreach Lagrange multiplier λ l λ l = max (cid:110) , λ (cid:48) l + τc l ( η ) (cid:111) , (20)where c l ( η ) is shorthand for the l -th inequality constraint. Plugging these values into theoptimization problem, we arrive atmin η L ( η, λ, τ ) := o x (cid:63) ( η ) + M · L (cid:88) l =1 ξ ( c l ( η ) , λ l , τ ) (21)with ξ ( c l ( η ) , λ l , τ ) :=  − λ l c l ( η ) + τc l ( η ) if τc l ( η ) ≤ λ l , − λ l τ otherwise , (22)where τ is increases throughout the optimization procedure. Given an approximate solution η of this subproblem, we then update λ according to λ l ← max { , λ l − τc l ( η ) } (23)for all l ∈ [ M · L ] and set τ ← α τ for a ﬁxed α >

1. For the full optimization, we attachtemporal upper indices, i.e., at time step t , we have the current approximate solution η ( t ) ,the Lagrange multipliers λ ( t ) l and the temperature parameter τ ( t ) . See Algorithm 1 for adescription of the optimization scheme. E Fitting Latent Variable Models

When ﬁtting the latent variable models, we use multi-layer perceptrons with inputs z, x, y for the means and variances of the latent dimensions U , where we use lower indices U i for the di ﬀ erent components. For this encoder, we use 32 neurons in the hidden layerand rectiﬁed linear units as the activation function. There are two decoders. The ﬁrst oneis trained to reconstruct E [ X | X, U ], i.e., receives the original Z in addition to the latentvector U as input. It is also parameterized by an MLP with 32 neurons in the hiddenlayer and ReLu activations. The second decoder reconstructs E [ Y | X, U ] and is either anMLP of the same architecture (when comparing to MLP response functions), linear in X ,i.e., αX + β + (cid:80) n_latent i =1 ( γ i XU i + δ i U i ) (when comparing to linear response functions), orquadratic in X , i.e., αX + βX + γ + (cid:80) n_latent i =1 ( δ i X U i + (cid:15) i XU i + ζ i U i ) (when comparing toquadratic response functions). Thereby, we ensure that the form of matches our assumptionson the function form of the response family. We then optimize the evidence lower boundfollowing standard techniques of variational autoencoders (Kingma & Welling, 2014) with L reconstruction loss for X and Y . We ﬁt multiple models with di ﬀ erent random initializationsand compute the implied causal e ﬀ ect of X on Y for each one, which is obtained from thedecoder E [ Y | u, x ] by averaging over 1000 samples of the latent variable U for a ﬁxed grid of x -values. 21 − X − − − Y E [ Y | do ( X = x (cid:63) )] 2SLS KIV lower bound upper bound data linear Gaussian settings using linear response functions − − X − − − Y − − X − − − Y non-linear, non-additive settings using quadratic response functions − − X − − − − − − Y − − X − Y Figure 6: Performance of our method on smaller datasets with only 500 observations. Theleft column is the strong confounding weak instrument case ( α = 0 . , β = 3) and the rightcolumn is the weak confounding strong instrument case ( α = 3 , β = 0 . F Small Data Regime

Having tested our method on datasets of size 5000 (synthetic) and 1650 (expenditure data),we now evaluate how our method performs on even smaller datasets. To this end, we ﬁrstlook at our synthetic settings using only 500 datapoints and correspondingly reducing thenumber of z -bins to M = 6 in Figure 6. While the bounds are looser, our method can stillprovide useful information with relatively little data.In addition, we ran our methods on a classic instrumental variable setting from eco-nomics, namely the dataset used by Acemoglu et al. (2001) on using settler mortality as aninstrument to estimate the causal e ﬀ ect of the health of institutions on economic perfor-mance. This dataset consists of only 70 datapoints. Therefore, we set the number of z -binsto M = 5 for this dataset. Restricting ourselves to linear response functions, our method stillgives informative bounds, which include the e ﬀ ect estimated by 2SLS, but does not fullyinclude the KIV results, see Figure 7. G Hyperparameter Settings

In all experiments, we ﬁx hyperparameters M = 20, L = 2, B = 1024 and run SGD withmomentum 0 . .

001 for 150 rounds of the augmented Lagrangian with30 gradient updates for each subproblem optimization. We start with a temperature pa-rameter τ = 0 . α = 1 .

08 in each round, capped at τ max = 10. This set The dataset is freely available at https://economics.mit.edu/faculty/acemoglu/data/ajr2001 . . − . − . − . . . . . . X − . − . − . − . . . . . . Y Figure 7: Results for the small dataset from Acemoglu et al. (2001) with linear responsefunctions and M = 5 z -bins.of hyperparameters did not require much manual tuning and worked for all datasets andresponse function families, i.e., also di ﬀ erent dimensionality of θ . For the synthetic settings,we sample 5000 observations each, see Appendix F for results on small datasets. For thetolerances, we use (cid:15) abs = 0 . (cid:15) abs = 0 . (cid:15) abs = 0 . (cid:15) rel from 0 . .

05 in allsettings (which corresponds to the increasingly opaque lines).For the MLP response functions, we train a 2-hidden layer MLP with 64 neurons in eachlayer, rectiﬁed linear units as activation functions and an mean-squared-error loss for 100epochs and a batchsize of 256 using Adam with a learning rate of 0 . K di ﬀ erent sub-samples { ( x i , y i ) } i ∈ N (cid:48) with N (cid:48) ≤ N . We use scikit-learn’s GaussianProcessRegressor (Pedregosaet al., 2011) with N (cid:48) = 200 and a white kernel variance of 0 .

4. For the sigmoidal designdataset we sample K = 7 basis functions.When ﬁtting the latent variable models, we use 3 as the latent dimension. Further detailsare described in Appendix E. H KIV Heuristic for Tuning Hyperparameters

We have found KIV to fail in the strongly confounded linear Gaussian setting, even thoughall the assumptions are satisﬁed, see Figure 2 (row 1). Closer analysis of these cases showedthat the heuristic that determines the hyperparameters does not return useful values in thissetting. Instead, we performed a grid search over the main hyperparameters λ and ξ (seeSingh et al., 2019, for details) and scored them by the out-of-sample mean-squared-error forthe true causal e ﬀ ect (which is known in our synthetic setting). After manual explorationof the parameter space, we found a good setting marked by the red cross in the ﬁrst rowon the left of Figure 8. Using these ﬁxed hyperparameters for KIV instead of the internaltuning stage, we get a much better approximation of the true causal e ﬀ ect shown in the ﬁrstrow on the right of Figure 8. Towards the data starved regions at large and small x -values,KIV again reverts back towards the prior mean of zero as expected. It is unclear at themoment, however, how to set such hyperparameter values without access to the true causale ﬀ ect. Our point here is that in principle there is a setting with acceptable results, althougheven then it is not clear how much of it is a coincidence based on looking at many possibleconﬁgurations.We performed a similar manual analysis for the non-linear, non-additive synthetic23inear Gaussian setting; strong confounding and weak instrument ( α = 0 . β = 3) − . − . − . − . − . − . − . ξ ) − . − . − . − . . . . . . l og ( λ ) . . . . . . . . . . − − − − Y true eﬀectKIV ( λ = 0 . ξ = − . non-linear, non-additive setting; strong confounding and weak instrument ( α = 0 . β = 3) − − − − − − ξ )2 . . . . . . . . . l og ( λ ) . . . . . . . . . − − − Y true eﬀectKIV ( λ = 4 . ξ = − . Figure 8: We show the results of a manual hyperparameter search for KIV in the left column,where we score di ﬀ erent settings in the two-dimensional hyperparameter space by the log ofthe out-of-sample mean squared error, which requires knowledge of the true causal e ﬀ ect.The red cross denotes the setting with the smallest out-of-sample mean squared error. Inthe right column, we show the KIV regression lines using the hyperparameters found in themanual search. The ﬁrst row corresponds to the linear Gaussian setting and the second rowto the non-linear, non-additive synthetic setting.setting with strong confounding, in which o ﬀ -the-shelf KIV fails as well, see Figure 2 (row 3).Note that this setting does not satisfy the assumptions of KIV, because of the non-additiveconfounding. Again, we do manage to ﬁnd hyperparameters that locally minimize theout-of-sample mean-squared-error shown in the second row on the left of Figure 8. However,the resulting regression of the causal e ﬀ ect does not properly capture the true e ﬀﬀ