[PDF] Invariant Representation Learning for Treatment Effect Estimation

Abstract

The defining challenge for causal inference from observational data is the presence of `confounders', covariates that affect both treatment assignment and the outcome. To address this challenge, practitioners collect and adjust for the covariates, hoping that they adequately correct for confounding. However, including every observed covariate in the adjustment runs the risk of including `bad controls', variables that \emph{induce} bias when they are conditioned on. The problem is that we do not always know which variables in the covariate set are safe to adjust for and which are not. To address this problem, we develop Nearly Invariant Causal Estimation (NICE). NICE uses invariant risk minimization (IRM) [Arj19] to learn a representation of the covariates that, under some assumptions, strips out bad controls but preserves sufficient information to adjust for confounding. Adjusting for the learned representation, rather than the covariates themselves, avoids the induced bias and provides valid causal inferences. NICE is appropriate in the following setting. i) We observe data from multiple environments that share a common causal mechanism for the outcome, but that differ in other ways. ii) In each environment, the collected covariates are a superset of the causal parents of the outcome, and contain sufficient information for causal identification. iii) But the covariates also may contain bad controls, and it is unknown which covariates are safe to adjust for and which ones induce bias. We evaluate NICE on both synthetic and semi-synthetic data. When the covariates contain unknown collider variables and other bad controls, NICE performs better than existing methods that adjust for all the covariates.

Full PDF

IInvariant Representation Learningfor Treatment Eﬀect Estimation

Claudia Shi , Victor Veitch , and David M. Blei Columbia University Google Research

Abstract

The deﬁning challenge for causal inference from observational data is thepresence of ‘confounders’, covariates that aﬀect both treatment assignment andthe outcome. To address this challenge, practitioners collect and adjust for thecovariates, hoping that they adequately correct for confounding. However, includingevery observed covariate in the adjustment runs the risk of including ‘bad controls’,variables that induce bias when they are conditioned on. The problem is thatwe do not always know which variables in the covariate set are safe to adjust forand which are not. To address this problem, we develop Nearly Invariant CausalEstimation (NICE). NICE uses invariant risk minimization (IRM) [Arj19] to learn arepresentation of the covariates that, under some assumptions, strips out bad controlsbut preserves suﬃcient information to adjust for confounding. Adjusting for thelearned representation, rather than the covariates themselves, avoids the inducedbias and provides valid causal inferences. NICE is appropriate in the followingsetting. i) We observe data from multiple environments that share a common causalmechanism for the outcome, but that diﬀer in other ways. ii) In each environment,the collected covariates are a superset of the causal parents of the outcome, andcontain suﬃcient information for causal identiﬁcation. iii) But the covariates alsomay contain bad controls, and it is unknown which covariates are safe to adjust forand which ones induce bias. We evaluate NICE on both synthetic and semi-syntheticdata. When the covariates contain unknown collider variables and other bad controls,NICE performs better than existing methods that adjust for all the covariates.

Consider the following motivating causal inference example.We want to estimate the eﬀect of sleeping pills on lung disease using electronichealth records, collected from multiple hospitals around the world. For each hospital e and patient i , we observe whether the drug was administered T ei , the patient’s outcome Y ei , and their covariates X ei , which includes comprehensive health and socioeconomicinformation. The diﬀerent hospitals serve diﬀerent populations, so the distribution of the1 a r X i v : . [ c s . L G ] N ov ecords X e is diﬀerent across the datasets. But the causal mechanism between sleepingpills T e and lung disease Y e is the same.The data in this example are observational. One challenge to causal inference fromobservational data is the presence of confounding variables that inﬂuence both T and Y [RR83; Pea00]. To account for confounding, we try to ﬁnd them among X and thenadjust for them.To ensure that we have adjusted for all confounding variables, we might includeevery covariate in the medical record X . However, simply adjusting for the entire recordruns the risk of including “bad controls” [BV07; Pea09; CH20], variables that induce a bias when they are adjusted for. A health condition caused by lung disease would be abad control, as it is causally aﬀected by the outcome.We want to exclude bad controls from the adjustment set. One approach is to selectconfounders through a causal graph [Pea09]. When constructing a causal graph, domainexperts describe the relationships between a few causally-relevant variables, such as“social-economical status" and “health status." If we observed these variables directly, wecould simply select the adjustment variables as the confounders in the graph. However,as illustrated in Figure 1, in practice, the covariates we collect are measurements ofconvenience—"wage," "education," "BMI," and so forth [CPE14; CEP16]. Theseobserved covariates are many partial measurements of the causal variables. Accordingly,even when the covariates suﬃce for adjustment in aggregate, there may not be a cleanseparation of the covariates into those required for adjustment and those that are badcontrols. In this case, it is not possible to select a set of covariates for the adjustment.Another approach is to restrict the adjustment set to the pre-treatment covariates[Ros02; Rub09]. However, this approach can lead us to include information that ispredictive of treatment assignment but not the outcome. If the record is suﬃcientlyrich, this information can lead to near-perfect prediction of treatment. This creates anapparent violation of overlap, the requirement that each unit had a non-zero probabilityof receiving treatment [D’A+20]. Practically, near-violations of overlap can lead tounstable or high-variance estimates of treatment eﬀects.These challenges suggest a new approach for causal estimation — we want arepresentation of the covariates that contains suﬃcient information for causal adjustment,excludes bad control, and helps provide low-variance causal estimates. This paperpresents a method to ﬁnd such a representation. Problem.

We now state our problem plainly, we want to do causal inference whereit is unknown which covariates are safe to adjust for. We are interested in the case whereour data is collected from multiple environments, as in the hospitals example above. Ourmain question is: how can we use the multiple environments to ﬁnd a representation ofthe covariates for valid causal estimation?To address this question, we develop nearly invariant causal estimation (NICE),an estimation procedure for causal inference from observational data where the datacomes from multiple datasets. The datasets are drawn from distinct environments,corresponding to distinct distributions.NICE adapts Invariant Risk Minimization(IRM)[Arj+19] for causal adjustment.IRM is a framework for solving prediction problems. The goal is produce a predictorthat is robust to changes in the deployment domain. The IRM procedure uses datafrom multiple environments to learn an invariant representation Φ( T, X ) , a function2 SSEM M M M n . . . T Y

Figure 1:

Health status (HS) and social-economical status (SE) are (unobserved)causally relevant variables. M , M , ..., M n denote (observed) partial measurements ofthe variables (e.g., BMI, wage, etc). Solidlines denote causal relationships. Dottedlines denote potential mappings betweencausal variables and measurements. X ecn X ean X epa T e Y e X ecl X ede Figure 2:

If the composition of X e isunknown, the treatment eﬀect of T e on Y e cannot be identiﬁed. X ecn denoteconfounders, X ecl denote colliders, X epa denotes parents of Y e , X ean denote non-parent ancestors of Y e , and X ede denotedescendants of Y e )such that the outcome Y and the representation Φ( T, X ) have the same relationship ineach environment. Predictors built on top of this representation will have the desiredrobustness.Our main observation is that the IRM invariant representation also suﬃces for causaladjustment. Informally, a representation is invariant if and only if it is informationallyequivalent to the causal parents of the outcome Y [Arj+19]. For example, an invariantrepresentation of the medical records will isolate the causal parents of lung disease.The causal parents of Y constitute an adjustment set that suﬃces for causal adjustment,minimally impacts overlap, and that excludes all bad controls. Hence, adjusting for aninvariant representation is a safe way to estimate the causal eﬀect. Contributions.

This paper develops the idea and algorithmic details of NICE,an estimation procedure that leverages data from multiple environments to do causalinference. It articulates the theoretical conditions under which it provides unbiased causalestimates and evaluates the method on synthetic and semi-synthetic causal estimationproblems.

We observe multiple datasets. Each dataset is from an environment e , in which weobserve a treatment T e , an outcome Y e , and other variables X e , called covariates.Assume each environment involves the same causal mechanism between the causalparents of Y e and Y e , but otherwise might be diﬀerent from the others, e.g., in thedistribution of X e . Assume we have enough information in X e to estimate the causaleﬀect, but we do not know the status of each covariate in the causal graph. A covariatemight be an ancestor, confounder, collider, parent, or descendant. Figure 2 shows thegraph and deﬁnes these terms. This statement implicitly assumes no mediators , variables on the causal path between the treatment andoutcome. To keep the exposition simple, we defer a discussion of mediators to § 3. P e . The data from each environment is drawn i.i.d., { X ei , T ei , Y ei } iid ∼ P e . The causal mechanism relating Y to T and X is assumed to bethe same in each environment. In the example from the introduction, diﬀerent hospitalsconstitute diﬀerent environments. All the hospitals share the same causal mechanismfor lung disease, but they vary in the population distribution of who they serve, theirpropensity to prescribe sleeping pills, and other aspects of the distribution.The goal is to estimate the average treatment eﬀect on the treated (ATT) in eachenvironment, ψ e (cid:44) E [ Y e | do( T e = 1) , T e = 1] − E [ Y e | do( T e = 0) , T e = 1] . (2.1)The ATT is the diﬀerence between intervening by assigning the treatment and interveningto prevent the treatment, averaged over the people who were actually assigned thetreatment. The causal eﬀect for any given individual does not depend on the environment.However, the ATT does depend on the environment because it averages over diﬀerentpopulations of individuals. For the moment, consider one environment. In theory, we can estimate the eﬀect byadjusting for the confounding variables that inﬂuence both T and Y [RR83]. Let Z ( X ) be an admissible subset of X —it contains no descendants of Y and blocks all “backdoorpaths” between Y and T [PP14]. An admissible subset in Figure 2 is any that includes X cn but excludes X cl and X de . Using Z ( X ) , the causal eﬀect can be expressed as afunction of the observational distribution, ψ = E X [ E Y [ Y | T = 1 , Z ( X )] − E Y [ Y | T = 0 , Z ( X )] | T = 1] . (2.2)We estimate ψ in two stages. First, we ﬁt a model ˆ Q for the conditional expectation Q ( T, Z ( X )) = E Y [ Y | T, Z ( X )] . Second, we use Monte Carlo to approximate theexpectation over X , ˆ ψ = 1 (cid:80) i t i (cid:88) i : t i =1 (cid:16) ˆ Q (1 , Z ( X i )) − ˆ Q (0 , Z ( X i )) (cid:17) , (2.3)The function ˆ Q can come from any model that predicts Y from { T, Z ( X ) } .If the causal graph is known, then the admissible set Z ( X ) can be easily selectedand the estimation in (2.2) is straightforward. But here we do not know the status ofeach covariate—if we inadvertently include bad controls in Z ( X ) then we will biasthe estimate. To solve this problem, we develop a method for learning an admissiblerepresentation Φ( T, X ) , which is learned from datasets from multiple environments.An admissible representation is a function of the full set of covariates but one thatcaptures the confounding factors and excludes the bad controls, i.e., the descendants of For simple exposition, we focus on the ATT estimation. Though this method can also be applied to CATEor ATE estimations. . Given the representation, we estimate the conditionalexpectations E Y [ Y | Φ( T, X )] and proceed to estimate the causal eﬀect. IRM is a framework for learning predictors that perform well across many environments.We review the main ideas of IRM and then adapt it to causal estimation.Each environment is a causal structure and probability distribution. Informally, foran environment to be valid, it must preserve the causal mechanism relating the outcomeand the other variables.

Deﬁnition 2.1 (Valid environment [Arj+19]) . Consider a causal graph G and a distri-bution P ( X, T, Y ) respecting G . Let G e denote the graph under an intervention and P e = P ( X e , T e , Y e ) be the distribution induced by the intervention. An interven-tion is valid with respect to ( G , P ) if (i) E P e [ Y e | P a ( Y )] = E P [ Y | P a ( Y )] , and (ii) V ( Y e | P a ( Y )) is ﬁnite. An environment is valid with respect to ( G , P ) if it can becreated by a valid intervention.Given this deﬁnition, a natural notion of an invariant representation is one where theconditional expectation of the outcome is the same regardless of the environment. Deﬁnition 2.2 (Invariant representation) . A representation Φ( T, X ) is invariant withrespect to environments E if and only if E [ Y e | Φ( T e , X e )] = E [ Y e | Φ( T e , X e )] for all e , e ∈ E .Arjovsky et al. [Arj+19] recast the problem of ﬁnding an invariant representationas one about prediction. In this context, the goal of IRM is to learn a representationsuch that there is a single classiﬁer w that is optimal in all environments. Thus IRMseeks a composition w ◦ Φ( T e , X e ) that is a good estimate of Y e in the given set ofenvironments. This estimate is composed of a representation Φ( T, X ) and a classiﬁer w that estimates Y from the representation. Deﬁnition 2.3 (Invariant representation via predictor [Arj+19]) . A data representation

Φ :

X → H elicits an invariant predictor across environments E if there is a classiﬁer w : H → Y that is simultaneously optimal for all environments. That is, w ∈ arg min ¯ w : H→Y R e ( ¯ w ◦ Φ) for all e ∈ E , (2.4)where R e is the risk of the training objective in environment e .The invariant representations in Deﬁnitions 2.2 and 2.3 align if we choose a lossfunction for which the minimizer of the associated risk in (2.4) is a conditional expectation.(Examples of such loss functions include squared error and cross entropy.) In this case,we can ﬁnd an invariant predictor Q inv = w ◦ Φ( T e , X e ) = E [ Y | Φ( T, X )] by solving(2.4) for both w and Φ .However, the general formulation of (2.4) is computationally intractable, so Arjovskyet al. [Arj+19] introduce IRMv1 as a practical alternative. An admissible representation is analogous to an ‘admissible set’ [Pea00], where conditional on anadmissible set renders a causal eﬀect identiﬁable. eﬁnition 2.4 (IRMv1[Arj+19] ) . IRMv1 is: ˆΦ = arg min Φ (cid:88) e ∈E R e (1 . · Φ) + λ (cid:107) ∇ w | w =1 . R e ( w · Φ) (cid:107) . (2.5)Notice here, IRMv1 ﬁxes the classiﬁer to the simplest possible choice: multiplicationby the scalar constant w = 1 . . The task is then to learn a representation Φ such that w = 1 . is the optimal classiﬁer in all environments. In eﬀect, Φ becomes the invariantpredictor, as Q inv = 1 . · Φ . The gradient norm penalizes model deviations from theoptimal classiﬁer in each environment e , enforcing the invariance. The hyperparameter λ controls the trade-oﬀ between invariance and predictive accuracy. In practice, we parameterize Φ with a neural network that takes T, X as input andoutputs a real number. Let l be a loss function such as squared error and cross entropy, n e be the number of units sampled in environment e . Then, we learn ˆΦ by solvingIRMv1 where each environment risk is replaced with the corresponding empirical risk: ˆ R e ( Q ) = 1 n e (cid:88) i l ( y ei , Q ( t ei , x ei )) . (2.6)Now, ˆ Q inv = 1 . · ˆΦ is an empirical estimate of E [ Y | Φ( T, X )] . We now introduce nearly invariant causal estimation (NICE). NICE is a causal estimationprocedure that uses data collected from multiple environments. NICE exploits invarianceacross the environments to perform causal adjustment without detailed knowledge ofwhich covariates are bad controls.Informally, the key connection between causality and invariance is that if a representa-tion is invariant across all valid environments then the information in that representationis exactly the information in the causal parents of Y . Since the causal structure relevantto the outcome is invariant across environments, a representation capturing only thecausal parents will also be invariant. We can see that Pa( Y ) is the minimal informationrequired for invariance. A representation that is invariant over all valid environmentswill be minimal; hence, an invariant representation must capture only the parents of Y .NICE is based on two insights. First, as just explained, if Φ( T, X ) is invariantover all valid environments, then E [ Y | T, Pa( Y ) \ { T } ] = E [ Y | Φ( T, X )] . Second, Pa( Y ) \ { T } suﬃces for causal adjustment. That is, Pa( Y ) \ { T } blocks any backdoorpaths and doesn’t include bad controls. Following (2.2), ψ = E [ E [ Y | T = 1 , Pa( Y ) \ { T } ] − E [ Y | T = 0 , Pa( Y ) \ { T } )] | T = 1] (2.7) = E [ E [ Y | Φ(1 , X )] − E [ Y | Φ(0 , X )] | T = 1] . (2.8)Recall the invariant predictor Q inv ( T, X ) = E [ Y | Φ( T, X )] . The NICE procedure is1. Input: multiple datasets D e := { ( X ei , Y ei , T ei ) } n e i =1 . For more details about the intuition of IRMv1, see section 3.1 [Arj+19].

6. Estimate the invariant predictor ˆ Q inv = 1 . · ˆΦ using IRMv1.3. Compute ˆ ψ e = (cid:80) i t ei (cid:80) i : t ei =1 ˆ Q inv (1 , x ei ) − ˆ Q inv (0 , x ei ) for each environment e .Note that ˆ Q inv can be any class of predictors. In § 5, we use the TARNet [SJS16] andDragonnet architectures [SBV19]. We call the procedure ‘nearly’ invariant as we onlyever have access to a limited number of environments, so we cannot be certain that we’llachieve invariance across all possible environments. We now establish the validity of NICE as a causal estimation procedure. All proofs tothe appendix.First consider the case where we observe data from a suﬃciently diverse set ofenvironments that the learned representation is invariant across all possible validenvironments. We prove that conditioning on a fully invariant representation is the sameas conditioning on the parents of Y . Lemma 3.1.

Suppose that E [ Y | Pa( Y ) = a ] (cid:54) = E [ Y | Pa( Y ) = a (cid:48) ] whenever a (cid:54) = a (cid:48) .Then a representation Φ is invariant across all valid environments if and only if E [ Y e | Φ( T e , X e )] = E [ Y | Pa( Y )] for all valid environments. Lemma 3.1 helps show that a representation that elicits an invariant predictor suﬃcesfor adjustment.

Theorem 3.2.

Let L be a loss function such that the minimizer of the associated risk isa conditional expectation, and let Φ be a representation that elicits a predictor Q inv that is invariant for all valid environments. Assuming there is no mediators between thetreatment and the outcome, then ψ e = E (cid:2) Q inv (1 , X e ) − Q inv (0 , X e ) | T e = 1 (cid:3) . Theorem 3.2 shows that the NICE estimand is equal to the ATT as long as invarianceover the observed environments guarantees invariance on any valid environment (totalinvariance). Invariance across a limited set of diverse environments may already suﬃce.Assuming a linear DGP, Arjovsky et al. [Arj+19] establish suﬃcient conditions on thenumber and diversity of the training environments such that the learned representationgeneralizes to all valid environments. In the non-linear case, there are no knownsuﬃciency results. However, Arjovsky et al. [Arj+19] give empirical evidence thataccess to even a few environments may suﬃce.In addition to identiﬁability, non-parametric estimation of treatment eﬀects withﬁnite data, i.e., (2.3), requires ‘positivity’ or ‘overlap’ – both treatment and non-treatmenthave a non-zero probability for all levels of the confounders [RR83; Imb04]. Let Φ( X e ) be the covariate representation, i.e., Φ( X e ) = { Φ( T e = 1 , X e ) , Φ( T e = 0 , X e ) } , inthe following theorem, we establish that if the covariate set X is suﬃcient for overlap,then Φ( X e ) is suﬃcient for overlap. Theorem 3.3.

Suppose (cid:15) ≤ P ( T e = 1 | X e ) ≤ − (cid:15) with probability 1, then (cid:15) ≤ P ( T e = 1 | Φ( X e )) ≤ − (cid:15) with probability 1. Φ( X e ) , bydeﬁnition, contains less information than X e , therefore Φ( X e ) satisﬁes overlap if X e satisﬁes overlap.Even when total invariance is not achieved, NICE may still improve the estimationquality when there are possible colliders in the adjustment set. Invariance across asubset of environments should remove at least some (if not all) collider dependence.Intuitively, conditioning on a subset of collider information should reduce bias in theresulting estimate. Theorem 7.1 in the appendix shows that this intuition holds for atleast one illustrative causal structure. A fully general statement remains open. The case of mediators.

So far, we assumed no mediators between T and Y . Whathappens to the interpretation of the learned parameter if the adjustment set containsmediators? Intuitively, NICE captures the information in the direct link between T and Y . Concretely, if there are no mediators, the parameter reduces to ATT. If thereare mediators but no confounders, the parameter reduces to the Natural Direct Eﬀect[Pea00]. If there are mediators and confounders, we deﬁne the parameters as the naturaldirect eﬀect on the treated (NDET). We continue the discussion of mediators in theAppendix. To address the motivating causal adjustment question, one possible solution is to restrictto all the pre-treatment covariates [Ros02; Rub09]. However, three issues arise. 1) If therecord is suﬃciently rich, the record might predict the treatment perfectly [D’A+20].This creates the poor overlap issue — some units have a zero probability of assignmentto each treatment condition. Suﬃcient overlap is required for non-parametric estimation[RR83]. 2) In some cases, including non-confounders that are predictive of Y may lowerthe variance of the causal estimates [RJ91; ML09]. 3) Pre-treatment variables do notnecessarily mean pre-treatment measurements. "Micro"-level measurements [CPE14;CEP16] collected after the treatment may still be reﬂective of the pre-treatment variables.In the motivating example, a measurement, such as "BMI", that was collected days afterthe treatment may still be a good measurement of the pre-treatment variable of interest— the health status.Another possible solution is to select the adjustment set through causal discovery[MM+99; Spi+00; Shi+06; GZS19] or variable selection [SE17; PBM16; HDPM18].Causal discovery methods aim to recover causal relationships by analyzing purelyobservational data. Peters et al. [PBM16] and Heinze-Deml et al. [HDPM18] leveragethe multiple environments setup to ﬁnd a set of covariates that are causal parents of theoutcome. However, these methods are designed to discover robust causal relationships,not ﬁnding an adjustment set for the downstream inference. The ‘robustness’ featuremay be desirable for discovering important features; it is less desirable for downstreaminference, as it may throw out features relevant to the inference. As an illustration, Zhaoet al. [Zha+16] showed that slight perturbations on model assumptions might lead to amodel’s poor performance.NICE uses the principle of invariance to solve an estimation problem. A thread of8 e T e Y e X e (a) Noise X e T e Y e X e (b) Descendant X e T e Y e X e (c) Collider Figure 3:

In (a) X has no relation with other variables. Conditioning on X doesn’t induce additional bias asymptotically. In (b) and (c), X is downstream of Y .Conditioning on X induces bias.related work uses the same principle to tackle diﬀerent problems.The principle of invariance is: if a relationship between X and Y is causal, thenit is invariant to perturbations that changes the distributions of X and Y . Conversely,if a relationship is invariant to many diﬀerent perturbations, it must be causal [Haa43;Büh18]. This principle inspired a line of causality-based domain adaptation and robustprediction work.Rojas-Carulla et al. [RC+18] apply the idea for causal transfer learning, assumingthe conditional distribution of the target variable given some subset of covariates is thesame across domains. Magliacane et al. [Mag+18] relax that assumption. Peters et al.[PBM16] and Heinze-Deml et al. [HDPM18] apply this principle for causal variableselection from multiple environments under the linear and non-linear settings. Zhanget al. [Zha+20] recast the problem of domain adaptation as a problem of Bayesianinference on the graphical models. Arjovsky et al. [Arj+19] advocate a new generalizablestatistical learning principle that is based on the invariant principle.These works focus on the problem of robust prediction. NICE focuses on the problemof causal estimation. NICE is complementary as it studies the idea of using domainadaptation methods for causal estimations. In particular, we focus on the application ofIRM for treatment eﬀect estimation. We perform three experiments to assess the estimation quality of NICE. We ﬁnd that:i) When total invariance is guaranteed, NICE captures the information relevant for theadjustment. ii) When total invariance is not guaranteed, near invariance still improvesthe estimation quality. iii) NICE reduces biases when there are bad controls in theadjustment set. iv) NICE improves ﬁnite sample estimation quality than alternativeadjustment schemes suggested in the introduction and related work.

Setup.

Ground truth for counterfactuals is not accessible in real-world data.Therefore, empirical evaluation of causal estimation methods generally relies on syntheticor semi-synthetic datasets. Recovering the parents of Y is shown to be achievable witha linear DGP and well-speciﬁed models [Arj+19]. Therefore, in the ﬁrst experiment, weconsider three variants of a simple linear DGP, illustrated in Figure 3, as a minimal proofof idea. In the second experiment, we use SpeedDating, a benchmark dataset simulated9s part of the 2019 Atlantic causal inference competition (ACIC) [Gru+]. In the thirdexperiment, we construct a non-linear DGP, illustrated in Figure 6, to assess the ﬁnitesample performance of NICE in comparison to alternative adjustment schemes. Evaluation metrics.

We consider two estimands: the sample average treatmenteﬀect on the treated (SATT), ψ s = (cid:80) i t i (cid:80) i : t i =1 ( Q (1 , Z ( x i )) − Q (0 , Z ( x i ))) andthe conditional average treatment eﬀect (CATE), τ ( x i ) = Q (1 , Z ( x i )) − Q (0 , Z ( x i )) [Imb04]. For the SATT, the evaluation metric is the mean absolute error (MAE), (cid:15) att = | ˆ ψ s − ψ s | . For the CATE, the metric is the Precision in Estimation of HeterogeneousEﬀect (PEHE) (cid:15) PEHE = n (cid:80) n (ˆ τ ( x i ) − τ ( x i )) [Hil11]. PEHE reﬂects the ability tocapture individual variation in treatment eﬀects. We simulate data with the three causal graphs in Figure 3. With a slight abuseof notation, each intervention e generates a new environment e with interventionaldistribution P ( X e , T e , Y e ) . T e is the binary treatment and Y e is the outcome. X e is a10-dimensional covariate set that diﬀers across DGPs. X e = ( X e , X e ) , where X e isa ﬁve-dimensional confounder. X e is either noise, a descendant, or a collider in eachDGP.For evaluation, following [Arj+19], we create three environments E = { . , , } .We ran 10 simulations. In each simulation we draw 1000 samples from each environment.We compare against Invariant Causal Prediction (ICP) [PBM16], which selects a subsetof the covariates as causal parents, and use the subset for causal adjustment. We alsocompare against linear regression with separate regressors for the treated and the controlpopulation (OLS-2). We examine the models’ performance under two types of variations:1) whether the observed covariates are scrambled versions of the true covariates. 2)whether the treatment eﬀects are heteroskedastic across environments. The result inFigure 4 is under the scrambled and heteroskedastic variant. The results of the othervariants and details of each dgp are in the appendix. Analysis.

Figure 4 shows that when the model is well-speciﬁed—simulation setting(a)— NICE performs well relative to the OLS-2. When the covariate set includes badcontrols that are closely related to the outcome, OLS-2 relies on the spurious correlationsbut NICE discards them. We ﬁnd that ICP often recovers none or a very limited amountof the covariates. We believe that this is because 1) the amount of noise in the DGP isnon-trivial and 2) in some settings the observed covariates are scrambled versions of thetrue covariates. The result suggests that while ICP is a robust causal discovery method,it should not be used for downstream estimation.

We validate NICE for the non-linear case on a benchmark dataset, SpeedDating. Speed-Dating was collected to study the gender diﬀerence in mate selection [Fis+06]. The studyrecruited university students to participate in speed dating, and collected objective andsubjective information such as ‘undergraduate institution’ and ‘perceived attractiveness’.It has 8378 entries and 185 covariates. ACIC 2019’s simulation samples subsets of thecovariates to simulate the treatment T and outcome Y . Speciﬁcally, it provides four10 igure 4: NICE yields better esti-mates if X contains bad controls, per-forms equally well to OLS-2 other-wise. When ICP returns an empty set,estimated causal eﬀect as zero. Theﬁgure reports MAE and standard errorof the SATT over 10 simulations. Figure 5:

NICE yields better estimatesthan adjusting for the minimal adjust-ment set X t or adjusting for the pre-treatment variables. NICE performscomparably to adjusting for all X and A .The ﬁgure reports MAE and standard de-viation of the SATT over 10 simulations.modiﬁed DGPs: Mod1: parametric models; Mod2: complex models; Mod3: parametricmodels with poor overlap; Mod4: complex models with treatment heterogeneity. Eachmodiﬁcation includes three versions: low, med, high, indicating an increasing numberof covariates included in the models for T and Y . Table 1:

NICE performs well relative to the baselines if the adjustment set does notcontain bad controls. The left table reports MAE and bootstrap standard deviation ofthe SATT estimation. The model is trained and evaluated on all three environments.The right table reports PEHE and bootstrap standard deviation of the out-of-distributionCATE estimation. The model is trained on two environments and evaluated on the third.

Within-sample out-of-sample (cid:15) att √ (cid:15) PEHE

Mod1 Mod2 Mod3 Mod4 Mod1 Mod2 Mod3 Mod4TARNet . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Dragon . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . We compare NICE against two neural network models similar to the structure ofTARNet [SJS16] and Dragonnet [SBV19]. TARNet is a two-headed model with a sharedrepresentation Z ( X ) ∈ R p , and two heads for the treated and control representation.The network has 4 layers for the shared representation and 3 layers for each expectedoutcome head. The hidden layer size is 250 for the shared representation layers and 100for the expected outcome layers. We use Adam [KB14] as the optimizer, set learningrate as 0.001, and an l2 regularization rate of 0.0001. For Dragonnet, an additional‘treatment head’ for the treatment prediction is added. Analysis.

We intervene on a variable to generate three environments and draw 2000samples from each environment. We compare the estimation quality of the within-sampleSATT and out-of-sample CATE over 10 bootstraps in Table 1. Since the DGPs do not11nclude bad controls in the covariate set, NICE performs as well as existing methods,though factors such as optimization diﬃculty may lead to variation.To examine whether NICE helps reduce collider bias, we simulated 20 copies ofa collider: X eco = T e + Y e + N (0 , e ) , where e ∈ { . , . , } and include it in thecovariate set. As shown in Table 2, NICE reduces collider bias across simulation setups.However, we also observe that while it reduces the collider bias, it does not eliminateit completely. One possible reason is that the predictor is not optimal . Table 2:

NICE reduces estimation bias in the presence of colliders. The table reportsthe MAE and bootstrap standard deviation of SATT. The model is trained and evaluatedon three environments.

Within-sample (cid:15) att

Mod1 Mod2 Mod3 Mod4 low

TARNet . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . med TARNet . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . high TARNet . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . In the previous experiments, we consider the setting where we do not know the statusof any covariate. We show that NICE helps reduce collider bias when it is unclearwhat variables to adjust for. In this experiment, in addition to the motivating setting,we consider the case the causal graph and the pre-treatment variables are known.We simulate non-linear data according to Figure 6. The details of the simulation isin the supplementary material. we consider the following adjustment schemes: (1)no adjustment; (2) adjusting for all variables; (3) adjusting for A : { A , A t , A y } and X : { X t , X y } , variables that are safe to adjust for; (4) adjusting for the pre-treatmentvariables { X t , A t } ; (5) adjusting for the minimal adjustment set, X t ; (6) adjustment viaNICE. Analysis.

As shown in Figure 5, NICE yields better estimates than adjustingfor X t , the minimal adjustment, and adjusting for all pre-treatment variables. NICEperforms comparably to adjusting for all X and A , variables that are safe to adjustfor. While adjustment scheme (3), (4), (5), and (6) are all unbiased asymptotically, theresult suggests including the covariates that are predictive of the outcome improve theestimation quality. This experiment shows that even in cases where we can isolate somecovariates that are pre-treament, performing causal adjustment through NICE may stillimprove the ﬁnite sample estimation quality. Comparison of Dragonnet with Dragonnet + NICE is in the Appendix. Discussion A t A y A X t X y T Y Z

Figure 6:

If the graph and the datadistribution are observed, we can iden-tify the treatment eﬀect by adjustingfor the minimal adjustment set { X t } ,or the pre-treament set { X t , A t } , or { A , A t , A y , X t , X y } This paper develops nearly invariantcausal estimation (NICE), connecting theidea of invariant representation learningto the goals of causal inference. NICEhelps ﬁnd an admissible representation for causal adjustment, when adjusting forall covariates may otherwise induce bias.Questions and limitations suggest av-enues of future work. (1) IRM requiresthe data to include all causal parents ofthe outcome, and provides no guaranteesif this assumption is violated. In prin-ciple, causal identiﬁcation only requireswe observe variables that aﬀect both Y and T . Is it possible to address this gap?(2) NICE uses domain adaptation to helpﬁlter out bad controls. While IRM is apromising robust predictor, can we adapt others to develop NICE-type causal estimationprocedures? (3) There are sensitivity analysis methods for assumptions such as "nounobserved confounding," but not for the "suﬃciently diverse environments." How canwe perform a sensitivity analysis for invariant estimations? (4) NICE does not explicitlyuse propensity score information. Can we use IRM with propensity scores to developstatistically eﬃcient non-parametric (’double robust’) estimators?13 eferences [Arj+19] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. “Invariant riskminimization”. In: arXiv preprint arXiv:1907.02893 (2019).[BV07] J. Bhattacharya and W. B. Vogt. Do instrumental variables belong inpropensity scores?

Tech. rep. National Bureau of Economic Research,2007.[Büh18] P. Bühlmann. “Invariance, causality and robustness”. In: arXiv preprintarXiv:1812.08233 (2018).[CEP16] K. Chalupka, F. Eberhardt, and P. Perona. “Multi-level cause-eﬀect sys-tems”. In:

Artiﬁcial Intelligence and Statistics . 2016.[CPE14] K. Chalupka, P. Perona, and F. Eberhardt. “Visual causal feature learning”.In: arXiv preprint arXiv:1412.2309 (2014).[CH20] C. Cinelli and C. Hazlett. “Making sense of sensitivity: extending omittedvariable bias”. In:

Journal of the Royal Statistical Society: Series B(Statistical Methodology)

Journal ofEconometrics (2020).[Fis+06] R. Fisman, S. S. Iyengar, E. Kamenica, and I. Simonson. “Gender diﬀer-ences in mate selection: evidence from a speed dating experiment”. In:

The Quarterly Journal of Economics

Frontiers in genetics (2019).[Gru+] S. Gruber, G. Lefebvre, A. PichÃľ, and T. Schuster.

Data Challenge . url: https://sites.google.com/view/acic2019datachallenge .[Haa43] T. Haavelmo. “The statistical implications of a system of simultaneousequations”. In:

Econometrica, Journal of the Econometric Society (1943).[HDPM18] C. Heinze-Deml, J. Peters, and N. Meinshausen. “Invariant causal predic-tion for nonlinear models”. In:

Journal of Causal Inference

Journal of Computational and Graphical Statistics (2011).[Imb04] G. W. Imbens. “Nonparametric estimation of average treatment eﬀectsunder exogeneity: a review”. In:

Review of Economics and statistics arXiv preprint arXiv:1412.6980 (2014).[Mag+18] S. Magliacane, T. van Ommen, T. Claassen, S. Bongers, P. Versteeg, andJ. M. Mooĳ. “Domain adaptation by using causal inference to predictinvariant conditional distributions”. In:

Advances in Neural InformationProcessing Systems . 2018. 14ML09] K. L. Moore and M. J. van der Laan. “Covariate adjustment in randomizedtrials with binary outcomes: targeted maximum likelihood estimation”. In:

Statistics in medicine

Modelling gene expression data using dynamicBayesian networks . Tech. rep. Citeseer, 1999.[NDO19] T. Q. Nguyen, A. Dafoe, and E. L. Ogburn. “The magnitude and directionof collider bias for binary variables”. In:

Epidemiologic Methods

Causality: models, reasoning and inference . 2000.[Pea09] J. Pearl.

Causality . 2009.[PP14] J. Pearl and A. Paz. “Confounding equivalence in causal inference”. In:

Journal of Causal Inference J. Causal Infer.

Journalof the Royal Statistical Society: Series B (Statistical Methodology) (2016).[RJ91] L. D. Robinson and N. P. Jewell. “Some surprising results about covariateadjustment in logistic regression models”. In:

International StatisticalReview/Revue Internationale de Statistique (1991).[RC+18] M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters. “Invariantmodels for causal transfer learning”. In:

The Journal of Machine LearningResearch

Observationalstudies . 2002.[RR83] P. R. Rosenbaum and D. B. Rubin. “The central role of the propensityscore in observational studies for causal eﬀects”. In:

Biometrika (1983).[Rub09] D. B. Rubin. “Should observational studies be designed to allow lack ofbalance in covariate distributions across treatment groups?” In:

Statisticsin Medicine arXiv e-printsarXiv:1606.03976 (2016).[SBV19] C. Shi, D. M. Blei, and V. Veitch. “Adapting neural networks for theestimation of treatment eﬀects”. In:

Advances in neural informationprocessing systems (2019).[Shi+06] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. “A linearnon-gaussian acyclic model for causal discovery”. In:

Journal of MachineLearning Research

Oct (2006).[SE17] S. M. Shortreed and A. Ertefaie. “Outcome-adaptive lasso: variableselection for causal inference”. In:

Biometrics arXivpreprint arXiv:2002.03278 (2020).[Zha+16] Q. Zhao, C. Zheng, T. Hastie, and R. Tibshirani. “Comment on causalinference using invariant prediction”. In: arXiv preprint arXiv: 1501.01332 (2016). 16

Appendix

Lemma 3.1.

Suppose that E [ Y | Pa( Y ) = a ] (cid:54) = E [ Y | Pa( Y ) = a (cid:48) ] whenever a (cid:54) = a (cid:48) .Then a representation Φ is invariant across all valid environments if and only if E [ Y e | Φ( T e , X e )] = E [ Y | Pa( Y )] for all valid environments.Proof. The if direction is immediate.To establish the only if direction, we ﬁrst show that Φ must contain at least Pa( Y ) ,in the sense E [ Y | Φ( X )] = E [ Y | Pa( Y ) ∪ Z ] for some set Z . We proceed bycontradiction. Suppose that conditioning on Φ is equivalent to conditioning on only Pa( Y ) \ { P } ∪ Z , where P is a parent of Y . We now create two environments bysetting P = p and P = p (cid:48) . Since P is a parent of Y this follows from the second rule ofdo calculus [Pea00], E [ Y | Pa( Y ) \ { P } ∪ Z ; do ( P = p )] = E [ Y | Pa( Y ) \ { P } ∪ Z, P = p ] and E [ Y | Pa( Y ) \ { P } ∪ Z ; do ( P = p (cid:48) )] = E [ Y | Pa( Y ) \ { P } ∪ Z, P = p (cid:48) ] . The equality E [ Y | Pa( Y ) \ { P } ∪ Z, P = p ] = E [ Y | Pa( Y ) \ { P } ∪ Z, P = p (cid:48) ] holds only if P is conditionally independent of Y given Pa( Y ) \ { P } ∪ Z . Since P is aparent of Y , by the ﬁrst assumption of the lemma, the equality does not hold. It followsthat E [ Y | Pa( Y ) \ { P } ∪ Z ; do ( P = p )] (cid:54) = E [ Y | Pa( Y ) \ { P } ∪ Z ; do ( P = p (cid:48) )] .That is, if conditioning on Φ was equivalent to conditioning on less information than Pa( Y ) ∪ Z , then Φ would not be invariant across all valid environments.It remains to show that Φ does not contain any more information than Pa( Y ) . Φ cannot contain any descendants of the outcome. Suppose that Φ depends onsome descendant D of Y in the sense that there is at least one environment and d (cid:54) = d (cid:48) where E [ Y | Φ( X \ D, D = d )] (cid:54) = E [ Y | Φ( X \ D, D = d (cid:48) )] . Then, construct a newenvironment e by randomly intervening and setting do( D = d ) or do( D = d (cid:48) ) , eachwith probability . . In this new environment, there is no relationship between Y and D .Accordingly, E [ Y e | Φ( X e \ D e , D e = d )] = E [ Y e | Φ( X e \ D e , D e = d (cid:48) )] . Thus,the conditional expectations are not equal (as functions of d ) in the two environments—acontradiction.Next, we show that, Φ need not to contain the non-parent ancestors A of the outcome,because E [ Y | { A } ∪ Pa( Y )] = E [ Y | Pa( Y )] by the Markov property of the causalgraph, where A is any non-ancestor variables. Since Φ contains Pa( Y ) , it follows that Φ does not depend on any non-parent ancestor A . Theorem 3.2.

Figure 7:

V-structure graph. We denotethe bias induced by conditioning on X as V-bias. CDT Y

Figure 8:

Y-structure graph. We denotethe bias induced by conditioning on D as Y-bias. Proof.

We assume the technical condition of lemma 3.1, that E [ Y | Pa( Y ) = a ] (cid:54) = E [ Y | Pa( Y ) = a (cid:48) ] whenever a (cid:54) = a (cid:48) . This is without loss of generality becauseviolations of this condition will not lead to diﬀerent causal eﬀects.By the assumption on the loss function, the elicited invariant predictor is E [ Y | Φ( T, X )] .Lemma 3.1 shows that E [ Y | Φ( T, X )] = E [ Y | Pa( Y )] . We further observe that thenon-treatment parents of Y are suﬃcient to block backdoor paths. It follows the ATTcan be expressed as the following. Ψ = E [ E [ Y | T = 1 , Pa( Y ) \ { T } ] − E [ Y | T = 0 , Pa( Y ) \ { T } )] | T = 1]= E [ E [ Y | Φ(1 , X )] − E [ Y | Φ(0 , X )] | T = 1] Theorem 3.3.

Suppose (cid:15) ≤ P ( T e = 1 | X e ) ≤ − (cid:15) with probability 1, then (cid:15) ≤ P ( T e = 1 | Φ( X e )) ≤ − (cid:15) with probability 1.Proof. The proof follows directly from Theorem 1 in [D’A+20]. The intuition is thatthe richer the covariate set is, the more likely it is to predict the treatment assignmentaccurately [D’A+20]. The covariate representation Φ( X e ) by deﬁnition contains lessinformation than X e , therefore Φ( X e ) satisﬁes overlap if X e satisﬁes overlap.Consider the DGP with binary variables { X, Y, T } illustrated in figure 7, where X is causally inﬂuence by Y and T . Theorem 7.1.

Let cov denote the covariance between two variables, we deﬁne colliderbias at X = c as ∆( X = c ) = cov ( T, Y | X = c ) − cov ( T, Y ) , and collider bias of X as ∆( X ) = | P ( X = 1)∆( X = 1) + P ( X = 0)∆( X = 0) | . Let Φ( T, X ) be a randomvariable, where P (Φ( T, X ) = X ) ≥ . . Suppose P ( X = 1) = 0 . , and ∆( X = 1) has the same sign as ∆( X = 0) , conditioning on X induce more collider bias thanconditioning its coarsening Φ( T, X ) : ∆(Φ( T, X )) ≤ ∆( X ) Proof.

The proof follows corollary 2.1 in [NDO19].

Corollary 2.1.

We refer to collider bias in the V substructure embedded in the Ystructure as ‘embedded V-bias’ and denote it as ∆( C = c ) . For the covariance eﬀect cale, Y-bias ∆( D = d ) relates to embedded V-bias through the following formula: ∆( D = d ) = p ( D = d | C = 1) − p ( D = d | C = 0) { P( D = d ) } · (cid:20) p ( D = d | C = 1) { P( C = 1) } · ∆( C = 1) − p ( D = d | C = 0) { P( C = 0) } · ∆( C = 0) (cid:21) . With the corollary above, let D denote Φ( T, X ) , let C denote the collider X infigure 7. The bias induced by conditioning on D is less than the bias induced byconditioning on C . ∆( D = 1) = 2 α − .

25 (0 . α · ∆( C = 1) − . − α ) · ∆( C = 0))= (2 α − α · ∆( C = 1) − (1 − α ) · ∆( C = 0))∆( D = 0) = 1 − α .

25 (0 . − α ) · ∆( C = 1) − . α · ∆( C = 0))= (1 − α )((1 − α ) · ∆( C = 1) − α · ∆( C = 0))∆( C ) = | . · ∆( C = 0) + 0 . · ∆( C = 1) | ∆( D ) = | . · ∆( D = 0) + 0 . · ∆( D = 1) | ∆( D ) = | . − α )((1 − α ) · ∆( C = 1) − α · ∆( C = 0))+ 0 . α − α · ∆( C = 1) − (1 − α ) · ∆( C = 0))) | = | . α − · ∆( C = 1) + 0 . α − · ∆( C = 0) |≤ ∆( C ) In the most part of the paper, we assumed no mediators between treatment and outcome.What happens to the interpretation of the learned parameter if the adjustment set containsmediators? Intuitively, NICE retains the direct link between the treatment and theoutcome. Speciﬁcally, if there are no mediators, the parameter reduces to ATT. If thereare mediators but no confounders, the parameter reduces to the Natural Direct Eﬀect[Pea00]. If there are mediators and confounders, the NICE estimand is a non-standardcausal target that we call the natural direct eﬀect on the treated (NDET).Conceptually, NDET describes the expected change in outcome Y for the treatedpopulation, induced by changing the value of T , while keeping all mediating factors M ,constant at whatever value they would have obtained under do( t ) . The main point is thatNDET provides answers to questions such as, “does this treatment have a substantialdirect eﬀect on this outcome?”. Substantively, NDET is the natural direct eﬀect, adjustedfor confounders.Formally, NDET for environment e is ψ e = E M e | T e =1 [ E [ Y e | M e ; do( T e = 1)] − E [ Y e | M e ; do( T e = 0)] | T e = 1] . (7.1)19ith adjustment set W e , the causal eﬀect can be expressed through a parameter of theobservational distribution: ψ e = E M e ,W e [ E [ Y e | T e = 1 , M e , W e ] − E [ Y e | T e = 0 , M e , W e ] | T e = 1] . (7.2)Importantly, the mediators M e and the confounders W e show up in the same wayin (7.2). Accordingly, we don’t need to know which observed variables are mediatorsand which are confounders to compute the parameter. Under the NICE procedure, wecondition on all parents of Y e , including possible mediators. Thus, the NICE estimandis the NDET in each environment. Experiment 1

We simulate data with three causal graphs in Figure 3. With a slight abuse of notation,each intervention e generates a new environment e with interventional distribution P ( X e , T e , Y e ) . T e is the binary treatment and Y e is the outcome. X e is a 10-dimensional covariate set that diﬀers across DGPs. X e = ( X e , X e ) , where X e is aﬁve-dimensional confounder. X e is either noise, a descendant, or a collider in eachDGP. We examine the models’ performance under two types of variations: 1) whetherthe observed covariates are scrambled versions of the true covariates. 2) whether thetreatment eﬀects are heteroskedastic across environments. The data generating processis illustrated below. X e ← N (0 , e ) P e ← sigmoid ( X e · w xt e + N (0 , T e ← Bern ( P e ) τ ← N (0 , σ ) Y e ← X e · w xy e + T e · τ + N (0 , e ) X e equals N (0 , e ) in setting a), X e equals e ∗ Y e + N (0 , in setting b), and X e equals e ∗ Y e + T e + N (0 , in setting c). For the four variants, in the scrambledsetting N (0 , σ ) = N (0 , e ) , in the un-scrambled setting N (0 , σ ) = N (0 , . In theenvironment-level heteroskedastic setting τ ← N (0 , e ) . In the environment-levelhomoscedastic setting τ ← N (0 , . The performance under the four variants areillustrated in Figure 9, Figure 10, Figure 11, and Figure 12. Experiment 2

We validate NICE for the non-linear case on the benchmark dataset SpeedDating.SpeedDating was collected to study the gender diﬀerence in mate selection [Fis+06].The study recruited university students to participate in speed dating, and collectedobjective and subjective information such as ‘undergraduate institution’ and ‘perceivedattractiveness’. It has 8378 entries and 185 covariates. ACIC 2019’s simulation samples20 igure 9:

Models performance un-der the scrambled and heteroskedasticsetting

Figure 10:

Models performance underthe scrambled and homoscedasticsetting

Figure 11:

Models performance un-der the unscrambled and heteroskedas-tic setting

Figure 12:

Models performance underthe ubscrambled and homoscedasticsetting21ubsets of the covariates to simulate the treatment T and outcome Y . Speciﬁcally, itprovides four modiﬁed DGPs: Mod1: parametric models; Mod2: complex models;Mod3: parametric models with poor overlap; Mod4: complex models with treatmentheterogeneity. Each modiﬁcation includes three versions: low, med, high, indicating anincreasing number of covariates included in the models for T and Y .We compare the estimation quality of the within-sample SATT and out-of-sampleCATE over 10 bootstraps in Table 3 and Table 4. The main paper reports the modelsperformance under the low setting. We now report results for the med and high setting. Within-sample (cid:15) att

Mod1 Mod2 Mod3 Mod4medTARNet . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . Dragon . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . highTARNet . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . Dragon . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . Table 3:

NICE performs well relative to the baselines if the adjustment setdoes not contain bad controls. The table reports MAE and bootstrap standarddeviation of the SATT estimation. The model is trained and evaluated on allthree environments.To examine whether NICE helps reduce collider bias, we simulated 20 copies ofa collider: X eco = T e + Y e + N (0 , e ) , where e ∈ { . , . , } and include it inthe covariate set. table 5 compares Dragonnet trained under standard empirical riskminimization framework and trained under NICE. NICE reduces collider bias acrosssimulation setups. Experiment 3

In the third experiment, we consider data generated according figure 6. Notably, inthe setup, we observe P ( A, X, T, Y, Z ) , where A = { A , A t , A y } and X = { X t , X y } .Here A is a 50 dimensional covariate, X a 30 dimensional covariate, and Z a 50dimensional covariate. Z is causally aﬀected by A and Y .We compare NICE against a neural network model similar to the structure of TARNet.The model architecture is the same as the models in the SpeedDating experiment, exceptthe hidden layer size is 200 for the shared representation. For the exact data generatingprocess and the detailed implementation of the models, see the associated codebase.22 ut-of-sample √ (cid:15) PEHE

NICE performs well relative to the baselines if the adjustment setdoes not contain bad controls. The table reports PEHE and bootstrap standarddeviation of the out-of-distribution CATE estimation. The model is trained ontwo environments and evaluated on the third.

Within-sample (cid:15) att

Mod1 Mod2 Mod3 Mod4lowDragon . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . medDragon . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . highDragon . ± . . ± . . ± . . ± . +NICE . ± . . ± . . ± . . ± . Table 5:

NICE reduces estimation bias in the presence of colliders. The tablereports the MAE and bootstrap standard deviation of SATT. The model istrained and evaluated on three environments.23 ut-of-sample (cid:15) pehe