Estimating Individual Treatment Effects using Non-Parametric Regression Models: a Review
EEstimating Individual Treatment Effects using Non-ParametricRegression Models: a Review ∗ Alberto Caron † Department of Statistical ScienceUniversity College London
Gianluca Baio
Department of Statistical ScienceUniversity College London
Ioanna Manolopoulou
Department of Statistical ScienceUniversity College London
October 26, 2020
Abstract
Large observational data are increasingly available in disciplines such as health, eco-nomic and social sciences, where researchers are interested in causal questions rather thanprediction. In this paper, we investigate the problem of estimating heterogeneous treat-ment effects using non-parametric regression-based methods. Firstly, we introduce thesetup and the issues related to conducting causal inference with observational or non-fullyrandomized data, and how these issues can be tackled with the help of statistical learningtools. Then, we provide a review of state-of-the-art methods, with a particular focus onnon-parametric modeling, and we cast them under a unifying taxonomy. After present-ing a brief overview on the problem of model selection, we illustrate the performance ofsome of the methods on three different simulated studies and on a real world example toinvestigate the effect of participation in school meal programs on health indicators.
Keywords—
Bayesian Non-Parametrics, Causal inference, Heterogeneous Treatment Effects,Machine Learning, Observational Studies, Regression Trees.
The application of advanced statistical learning tools in causal inference has gained popularity in recentyears, partly due to the fact that large datasets are becoming available at relatively lower costs (thanksto e.g. electronic health records, social network data etc.). One of the increasingly common objectivesof causal inference in many disciplines is to draw inferences about individual-level treatment effects, ∗ This work was supported by a British Heart Foundation-Turing Cardiovascular Data Science Award(BCDSA/100003). † Corresponding author: [email protected] , 1-19 Torrington Pl, London WC1E 7HB. a r X i v : . [ s t a t . M E ] O c t s opposed to inferring treatment effects on average across the entire population. The importanceof inferring individual-level treatment effects lies in the fact that treatment effects are very oftenheterogeneous across units of analysis. Two such examples arise in precision medicine and personalizedadvertisement, where the ultimate goal is to make decisions at the level of an individual patient or user.For instance, patients with high cholesterol levels respond differently to statin prescriptions based ontheir demographics. This level of analysis requires causal inference methods that can accurately predictthe impact of treatment, as well as quantify its uncertainty, at a fine resolution. To this end, popularstatistical learning algorithms that exhibit excellent predictive performance, such as tree ensembles,kernel methods and neural networks, can be exploited also, with due adjustments, in causal settings.This paper is therefore motivated by the growing number of non-parametric regression methods formodelling Individual Treatment Effects (ITE) using large datasets to answer individual-level questions,and reviews some of the most popular ones laying down the underlying assumptions and pros/cons foreach of them.Regardless of the precise causal questions of interest, the fundamental challenge in causal inferenceis that the quantity of interest — namely the effect of treatment — is an unknown model’s parameter.Moreover, treatment effect and selection into treatment are invariably entangled. As a result, advancedstatistical learning tools, capable of exploiting large datasets in supervised learning settings, cannotbe directly applied. Instead, the latent treatment effect is inferred by reconstructing counterfactualstatements either through sampling (randomization in the administration of treatment) and/or byadjusting for covariates that affect both the outcome of interest and treatment assignment, and bydoing so “confound” the treatment effect (confounding).Randomized Controlled Trials (RCTs), where treatment is randomly allocated, marginally onconfounding factors (such as medical or socio-economic characteristics), are considered to be the goldstandard for causal inference. However, fully randomized studies are costly, difficult to access, andsometimes suffer from problems such as non-compliance and other issues of missingness not at randomthat might invalidate the randomization mechanism, or external validity of the results.In contrast, data of observational nature, where treatment administration is not randomized, aremore easily accessible and abundantly present in many applied fields. However, observational studiespresent several drawbacks, largely attributable to three complex phenomena. The first is selection bias ,which manifests when the treatment allocation mechanism is not under the researcher’s control, butdetermined by observable or unobservable factors. This constitutes a potential source of confoundingthat needs to be controlled for, as it generates structural differences between the treated and thecontrol groups. The second phenomenon is known as partial overlap , which occurs when there areregions in the space of relevant covariates where only treated, or only control, units are present. Asa result, units in these particular regions lack an appropriate comparator with similar demographics.These two issues are closely related, as partial overlap may be a direct consequence of selection bias.The third phenomenon relates to the fact that treatment allocations and their corresponding outcomesmay not be independent across individuals.In this paper we review some of the most recent non-parametric regression methods for estimatingheterogeneous effects arising from a binary treatment assignment, at a individual level, and introducea taxonomy that classifies these methods under the same unified framework. More in details, weprovide an overview of the implied assumptions and compare methods’ performance with the help oftwo simulated studies. In addition, we illustrate practical application of some of the methods on a real-world dataset to investigate the effect of participation to school meal programs on students’ healthindicators, which is likely to display heterogeneity patterns depending on students’ demographics.We focus on methods which circumvent the potential issues stemming from lack of complete (andcontrolled) randomization by making a set of assumptions which are common in the literature. The rst assumption is unconfoundedness , which ensures that there is no unobserved confounding factordriving selection into treatment. Unconfoudedness might be, in some cases, a strained assumption tomake, but represents less of a threat in settings where a large number of relevant covariates is available.The second assumption is common support , which states that each unit, identified by a given set ofdemographics, has non-zero probability of being observed in each of the treatment groups. In otherwords, common support ensures that there is no deterministic component in treatment administration.Finally, the last assumption is known as Stable Unit Treatment Value Assumption (SUTVA), and statesthat the response to treatment of one unit is not affected by other units’ assignment to treatment,thus ensuring that there is no interference.The rest of the work is organized as follows. Section 2 formulates the problem of estimating
Conditional Average Treatment Effect (CATE) by using the potential outcomes representation byRubin (1978). Section 3 reviews the most recent methods of estimating CATE and the problem ofmodel selection. Section 4 presents results from two different simulated studies that we conducted tocompare performance of some of the methods. Section 5 provides an example of a real-world socialsciences application. Section 6 concludes with a discussion.
We will follow the Neyman-Rubin Causal Model, outlined in Rubin (1978) and Imbens and Rubin(2015), which conceives causal inference as a missing data problem. For each unit of analysis i ∈{ , ..., N } , given a binary treatment assignment Z i ∈ { , } , where Z i = 1 indicates exposure to thetreatment and Z i = 0 indicates no exposure, the framework defines the quantities (cid:0) Y (0) i , Y (1) i (cid:1) aspotential outcomes. Y (1) i corresponds to unit i ’s outcome under exposure to the treatment, while Y (0) i corresponds to its outcome under no exposure, and only one of them is actually observed. Wewill consider, throughout this work, a setting where the outcome variable of interest is continuous,i.e., (cid:0) Y (0) i , Y (1) i (cid:1) ∈ R , but most of the ideas and methods presented can be generalized. It is worthpointing out that the Neyman-Rubin framework is not the only viable approach to causal inference. Forexample, Dawid (2000) develops a critique of potential outcomes and proposes a different frameworkbased on Bayesian decision analysis.Given the framework outlined above, in case both potential outcomes were observed, one wouldhave access also the Individual Treatment Effects (ITE), defined as Y (1) i − Y (0) i . However, as describedearlier, the fundamental problem of causal inference is that, for each individual i , only one of the twopotential outcomes (cid:0) Y (0) i , Y (1) i (cid:1) ∈ R is observed, corresponding to the realized treatment assignment: Y i = Z i Y (1) i + (1 − Z i ) Y (0) i . This implies that Y (1) i − Y (0) i cannot be estimated as in a usual regressionproblem.Given a dataset of either observational or randomized nature D i = { X i , Z i , Y i } , with i ∈ { , ..., N } ,where X i ∈ X denotes a d -dimensional set of observed covariates for the individual i which are potentialsource of confounding to be controlled for, the aim is to estimate Conditional Average Treatment Effect(CATE), defined as τ ( x i ) = E (cid:104) Y (1) i − Y (0) i | X i = x i (cid:105) . (1)The two quantities µ Z ( x i ) = E (cid:2) Y ( Z i ) | X i = x i (cid:3) in (1) are the conditional average potential outcomes.The intuition behind the estimation of τ ( x i ) is the following. In case both potential outcomes wereobserved, then Y (1) i − Y (0) i (ITE) would be modelled as the response variable in a regression frameworkwhere X i are the d regressors, and where the aim is to estimate the conditional mean of the outcome,namely τ ( x i ) = E [ Y (1) i − Y (0) i | X i = x i ]. Figure 1 provides a graphical representation of a simple l ll llllll ll l l lllll llll lll ll ll lll ll ll l ll ll l lllll lll lll l l ll ll l lll l ll llllll l lll l ll lll l llll ll l l lll ll ll lll lll ll l lllll ll ll l llll ll llll l l lll ll lll ll l lll ll ll ll Covariate X R e s pon s e Y Group ll ControlTreated 1.01.52.02.53.03.5 −2 −1 0 1 2
Covariate X I T E Figure 1:
Simulated example with one single covariate X . Left panel: observed outcomes for thetreated (blue dots) and control (red dots) groups with underlying conditional mean function (dashedlines), and unobservable counterfactual outcomes (grey dots). Right panel: unobservable true ITE(grey dots) and corresponding conditional mean function (dashed line).single-covariate example, where coloured dots show observed values Y i = Y ( Z i ) of the response, andgrey their corresponding (unobservable) counterfactuals Y (1 − Z i ) i . Notice that the example is purelyillustrative and serves as visual aid to introduce the reader to the key concepts in the Rubin-Neymanframework.An additional quantity of interest under the Neyman-Rubin framework is the propensity score ,which is defined, for each unit of analysis i , as the probability of being selected into treatment, givena set of observed covariates that we denote by ˜ X i = ˜ x i , to stress the fact that it might be differentfrom the set of covariates used for the estimation of τ ( x i ); more formally: π ( x i ) = P ( Z i = 1 | ˜ X i = ˜ x i ) . Thus, for each unit, the binary treatment assignment Z i can be seen as the outcome of a Bernoulliexperiment where Z i ∼ Bern (cid:0) π ( x i ) (cid:1) . In the simple example of Figure 1, the propensity score isgenerated as a monotone function of the covariate X . This is why treated units are more frequentlyobserved for higher values of X , while control units are more frequent for lower values of X . Thepropensity score distribution in this case is also slightly skewed to the right; as a consequence, thereare more units in the control group compared to the treated one.Under the potential outcomes framework, the untestable unconfoundedness assumption is equiva-lent to assuming the conditional independence (cid:0) Y (0) i , Y (1) i (cid:1) ⊥⊥ Z i | X i , (2)which ensures that selection is determined by observed demographics. A well-known result concerningthe propensity score, derived by Rosenbaum and Rubin (1983), is that if (2) holds, then (cid:0) Y (0) i , Y (1) i (cid:1) ⊥⊥ Z i | π ( X i ) . (3)Conditional independence in (3) represents a different way of expressing unconfoundedness, by condi-tioning only on the propensity score rather than on the full set of covariates. In the context where thenumber of covariates is high, π ( X i ) represents a 1-dimensional summary of a d -dimensional covariate et X i ∈ X which achieves conditional independence between outcome and treatment assignment.There are many different ways of making effective use of propensity score for deriving population-levelor individual-level causal estimators. In some approaches propensity score assumes a central role, suchas in matching and blocking methods (Rosenbaum and Rubin, 1984, 1985), compared to regression-based methodologies that we focus on in this work. Imbens (2004) offers a nice and thorough overviewof methods based on the propensity score, and also of other methods (e.g. regression and matching).The assumption of common support, under the Rubin-Neyman framework, implies that propensityscore is such that 0 < π ( x i ) <
1, for each i , which equivalently means that treatment assignment isnot deterministic, and that an individual with covariate values X i = x i can be potentially observedin either treatment group with non-zero probability.Under unconfoundedness and common support assumptions, CATE can be estimated from ob-served data in D i = { X i , Z i , Y i } i ∈ { , ..., N } . In fact, under unconfoudedness, the conditionalaverage potential outcomes are such that µ Z ( x i ) = E (cid:104) Y ( Z i ) | X i = x i (cid:105) = E (cid:104) Y ( Z i ) | Z i = z i , X i = x i (cid:105) = E [ Y i | Z i = z i , X i = x i ] , (4)for z i ∈ { , } , where the second equality generates from the conditional independence between (cid:0) Y (0) i , Y (1) i (cid:1) and Z i given X i , while the last one from the identity Y i = Z i Y (1) i + (1 − Z i ) Y (0) i . Hence,as a straightforward implication of (4), one can derive an estimator for CATE as τ ( x i ) = E (cid:104) Y (1) i − Y (0) i | X i = x i (cid:105) = E (cid:104) Y (1) i | X i = x i (cid:105) − E (cid:104) Y (0) i | X i = x i (cid:105) = E (cid:104) Y (1) i | Z i = 1 , X i = x i (cid:105) − E (cid:104) Y (0) i | Z i = 0 , X i = x i (cid:105) = E [ Y i | Z i = 1 , X i = x i ] − E [ Y i | Z i = 0 , X i = x i ] , (5)where, as in (4), the third equality is given by unconfoundedness, and the last one by the identity Y i = Z i Y (1) i + (1 − Z i ) Y (0) i . Common support assumption is then needed to guarantee that thetwo conditional average potential outcomes µ Z ( x i ) = E (cid:2) Y ( Z i ) | X i = x i (cid:3) exist for each values z i and x i in their supports, and thus can be estimated through the observed quantities in the conditionalexpectations E [ Y i | Z i = z i , X i = x i ]. To clarify, suppose that common support does not hold for X i = x ∗ and that π ( x ∗ ) = 0 (without loss of generality), then the conditional average potentialoutcome µ ( x ∗ ) = E (cid:2) Y (1) | X i = x ∗ (cid:3) does not theoretically exist, and it would not make sense toattempt to even estimate it. Throughout this work, we will review non-parametric regression approaches where the outcome surface Y i is modelled as a function of ( X i , Z i ) and some unobservable error term ε i : Y i = g ( X i , Z i , ε i ).More specifically, the reviewed methods generally assume that the error term is additive and normallydistributed, which leads to the following setup: Y i = f ( X i , Z i ) + ε i , where ε i ∼ N (0 , σ ) , (6)where f ( X i , Z i ) = E (cid:2) Y i | X i , Z i (cid:3) is learnt from the data. The strength of non-parametric regressionmodels is their robustness to misspecification of the functional form of f ( · ) (e.g. tree-based methods odel f ( · ) as piece-wise constant, splines as piece-wise polynomial, etc.). The covariates X i ∈ X represent a potential source of confounding, in that they might affect both the treatment and theoutcome, and are thus controlled for. In a setting where the number of available covariates is high,one might want to resort to regularization in the estimation of f ( · ); this is more likely to be ofinterest in observational studies in socio-economic disciplines rather than clinical research, where closelycontrolling for which covariates are included in the model is preferable. However, as explained in bothHahn et al. (2018) and Hahn et al. (2020), regularization should be handled carefully in this context;we will return to this point in Section 3.1.6.In the regression setup illustrated in (6), some of the methods reviewed in the next section reservea specific role for the propensity score. For the remaining methods which do not explicitly use thepropensity score, in the simulated studies that we conduct in Section 4, we follow the suggestion ofHahn et al. (2020) and incorporate it via the following two-stage regression framework: π ( X i ) = P (cid:0) Z i = 1 | X i (cid:1) Y i = f (cid:16)(cid:2) X i π ( X i ) (cid:3) , Z i (cid:17) + ε i . (7)The first stage in (7) involves estimating the propensity score, while the second embeds it as an extracovariate in the covariate set. Any probabilistic classifier is suitable for use in the first stage regression(e.g. logistic regression, probit Bayesian Additive Regression Trees, Gaussian Process Classifier, etc.).As explained in Hahn et al. (2020), and as we will describe later in Section 3.1.6, the addition of π ( X i ) to the covariate set in (7) represents an effective way to tackle bias deriving from targetedselection . Targeted selection arises when individuals are selected into treatment based on the predictionof otherwise adverse outcome, i.e., when π ( X i ) is a strictly monotone function of E [ Y (0) i | X i = x i ],and is common in many observational studies (e.g. medical or socio-economic studies).CATE estimators can be derived from the representations in (6) and (7). There are a few differentapproaches for deriving an estimator for CATE from (6) and (7), that will be analyzed in the nextsection. Given the framework outlined in the previous section, various meta-algorithms designed to derive aCATE estimator have been proposed in the literature. These meta-algorithms are often referred to as“Meta-Learners”, in that they are built on top of “base-learners”, which are common machine learningalgorithms (e.g. tree ensembles, regularized regressions, neural networks etc.). We attempt to build ataxonomy of these “Meta-Learners” approaches in Section 3.1, while in Section 3.2 we present a reviewof the metrics used for model selection when estimating CATE, which is a substantially hard, arguablyimpossible, problem.
As mentioned in the earlier sections, we will partly build on top of the review of K¨unzel et al. (2019),and we will illustrate the most recent contributions stemming from both statistics and computer scienceliterature. A concise summary of the presented “Meta-Learners”, together with the relevant references,can be found in Table 1.
References CATE estimatorS-Learner
Hill (2011); Foster et al.(2011) τ ( x i ) = f ( x i , − f ( x i , T-Learner
Athey and Imbens (2016);Lu et al. (2018), Powerset al. (2018) τ ( x i ) = f ( x i ) − f ( x i ) X-Learner
K¨unzel et al. (2019) τ ( x i ) = π ( x i ) τ ( x i ) + (cid:0) − π ( x i ) (cid:1) τ ( x i ) R-Learner
Nie and Wager (2019) τ ( x i ) = arg min τ (cid:110) L n (cid:0) τ ( · ) (cid:1) +Λ n (cid:0) τ ( · ) (cid:1)(cid:111) Multitask-Learner
Alaa and van der Schaar(2017, 2018) τ ( x i ) = f (cid:62) ( x i ) e τ -Learner Hahn et al. (2020) τ ( x i ) as explicit model parameter “Single-Learners”, shortened to S-Learners, have been implicitly proposed in two early contributions(Hill, 2011; Foster et al., 2011) and derive an estimator for CATE by using treatment assignment as“just another covariate” in the covariate space X , which means that CATE is estimated as τ ( x i ) = f ( x i , − f ( x i , . (8)An S-Learner fits a single surface f ( · ) through a base-learner and derives CATE estimates by takingthe difference between the two conditional average potential outcomes, which are represented by thefitted (cid:98) f ( · ) with Z i = 1 and Z i = 0 respectively. The underlying assumption is that the group-specificconditional average potential outcomes stem from the same model, with conditional mean function f ( · )and error term ε i . Regression trees are popular base-learners employed in the context of S-Learners.For instance, Hill (2011) advocates the use of Bayesian Additive Regression Trees (BART), while Fosteret al. (2011) of random forests.The left panel plot of Figure 2 shows a S-Learner BART fit for the conditional mean (cid:98) f ( · ) ofthe single-covariate simulated example already encountered in Figure 1. Notice that the dashed linerepresenting (cid:98) f ( · ) has a unique color (grey) to emphasize the fact that S-Learner fits a unique surface.Since a S-Learner fits a single regression, it is quite restrictive in the sense that it does not incor-porate the way in which f ( · ) changes with respect to Z ; and it does not take into account the factthat the distribution of the covariates X i ∈ X may vary in Z , as a result of selection bias. This isobviously more problematic when working with observational data, while it is less so with randomizedstudies where selection bias is less likely to be present. Alaa and van der Schaar (2018) and Hahnet al. (2020) have both identified that the main drawback of S-Learners is their lack of ability in adapt-ing to variable levels of sparsity and smoothness across the two treatment groups, since they imposethe same regularizing conditions for both treated and control groups. A S-Learner will then performpoorly in a situation where the outcome surface complexity is very different across the two groups.On the contrary, S-Learner will perform well when CATE is of small magnitude, as τ ( x i ) surfacecomplexity does not heavily depend on Z i , or, in other words, does not vary much across treatmentgroups. For example, consider the case of a S-Learner employing a tree ensemble base-learner, suchas BART. Since a tree ensemble method like BART picks splitting variables at each node in each treerandomly, it might not even choose Z as splitting variable in some of the trees in the ensemble, so that Covariate X R e s pon s e Y Covariate X Group
ControlTreated
Figure 2:
Simulated one-covariate data from Section 2. Left panel: conditional mean fit from a S-Learner BART (dashed grey line). Right panel: conditional mean fit from a T-Learner BART (blueand red dashed lines). Z will possibly be included in most of the trees fitting the response Y , but not necessarily in all ofthem. The exclusion of Z from the splitting rules of a tree is more likely to happen as the number ofcovariates X i grows larger, in that the model has a larger set of splitting variables to pick from. Thisintuitively explains why S-Learners turn out to be appropriate in situations where the magnitude ofground-truth CATE is rather small, relative to the variation in outcome attributable to the covariates.It may happen in real world applications, e.g. in clinical studies, that the researcher has a prior idearegarding the magnitude of the treatment effect. As it becomes clearer in the following sections, thisis a non-negligible piece of information when choosing a method to perform the analysis. “Two-Learners”, shortened to T-Learners, derive an estimator for CATE by fitting two separate sur-faces for the treated and control groups and computing their difference: τ ( x i ) = f ( x i ) − f ( x i ) . (9)Versions of T-Learners can be found in many contributions in the literature. For instance, Athey andImbens (2016), Lu et al. (2018) and Powers et al. (2018) offer some examples employing decision trees,random forests and gradient boosted trees as base-learners respectively. In contrast to S-Learners,T-Learners separate the two treatment groups when modelling response variable Y , and assume thatgroup-specific conditional average potential outcomes are derived from separate regression models,characterized by different conditional means f ( · ) and f ( · ) and independent error terms ε i and ε i .This allows to preserve distributional differences across the two groups that might originate fromselection bias, and to take into account different degrees of sparsity and smoothness that vary with Z ,when regressing Y against X . On the other hand, a shortcoming of T-Learners is that, as a result ofsplitting the sample in two, they do not allow sharing common underlying information between thegroups when estimating the two surfaces. This is particularly not ideal in a scenario where individualsin the two groups share the same distributional characteristics, such as in a randomized study.The right panel plot in Figure 2 displays a T-Learner BART fit for the conditional means of the twotreatment groups f ( · ) and f ( · ), on the same one-covariate simulated example of Figure 1. Notice that tted (cid:98) f ( · ) and (cid:98) f ( · ) are differentiated by colors (blue and red dashed lines respectively), emphasizingthe fact that T-Learner fits two separate surfaces.T-Learners work particularly well when complexity of the response surface is very different acrosstreatment groups, and so CATE itself turns out to be a rather complex function. In addition, asformally derived by Alaa and van der Schaar (2018), T-Learners are expected to do well as sample size N goes to infinity, and more observations per group are available to estimate f ( · ) and f ( · ). However,this is not usually the case with real world data. For example, in presence of imbalanced designs,where one group is larger than the other, splitting the sample in two would leave few observations forthe estimation of f Z in the smaller group. On the contrary, T-Learners tend to perform quite poorly insettings where CATE surface is relatively simple and heterogeneity patterns are not so sophisticated,i.e., situations where S-Learner usually performs better. Hence if subject-matter knowledge suggeststhat treatment impact is likely to be of significant magnitude, a T-Learner might be the preferredchoice. X-Learners have been proposed by K¨unzel et al. (2019) as an extension of T-Learners, and derive aCATE estimator in three steps. In the first step, conditional average potential outcomes are fitted as ina T-Learner approach, that is by using two separate regression models for the conditional means f ( x i )and f ( x i ). This carries the same underlying assumptions, benefits and drawbacks of T-Learners,illustrated in the previous section. Then in the second step, “imputed treatment effects” are computedfor each group separately; these are defined as the differences between the group-specific observedoutcome Y Zi , and the estimated unobservable conditional average potential outcome (cid:98) Y ( Z ) i derived inthe first step, more formally: ˜ D i = Y i − (cid:98) Y (0) i if Z i = 1˜ D i = (cid:98) Y (1) i − Y i if Z i = 0 . (10)The second step thus attempts to recover the unobservable differences D i = Y (1) i − Y (0) i (ITE) byreplacing the unobservable potential outcomes with the relative conditional average potential outcomeestimates (cid:98) Y ( Z ) i , separately for the treated and control group. In the last step, ˜ D i and ˜ D i are usedas response variables in two separate regressions, employing the chosen base-learner (linear regression,random forest, BART, etc.), to obtain estimates of ˆ τ ( x i ) and ˆ τ ( x i ), using covariates X i as regressors.These two independent regressions can be depicted as˜ D i = τ ( X i ) + η i if Z i = 1˜ D i = τ ( X i ) + η i if Z i = 0 , (11)where the two estimated quantities ˆ τ ( x i ) and ˆ τ ( x i ) from (11) are group-specific CATE estimates.The final CATE estimate is then obtained through a weighted average of the two group-specific CATEestimates, (cid:98) τ ( x i ) = g ( x i ) (cid:98) τ ( x i ) + (cid:0) − g ( x i ) (cid:1)(cid:98) τ ( x i ) , (12)where g ( x i ) ∈ [0 ,
1] is a given weighting function. The authors propose to set g ( · ) equal to a propensityscore estimate g ( x i ) = (cid:98) π ( x i ), but it can take different forms (e.g. g ( x i ) = 1 or g ( x i ) = 0).The intuition behind the X-Learner approach is the following. If both potential outcomes wereobservable for each individual i , then E (cid:2) Y (1) i − Y (0) i | X i = x i (cid:3) (CATE) would be estimated byregressing Y (1) i − Y (0) i (ITE) directly on the covariates X i . However, since the difference Y (1) i − Y (0) i .01.52.02.53.03.5 −2 −1 0 1 2 Covariate X I T E Group D D Covariate X t (x) t (x) t (x) Figure 3:
X-Learner BART applied to the simulated one-covariate example. Left panel: unobservableITE (grey dots) and imputed treatment effects D and D (blue and red triangles), estimated as in(10) using T-Learner BART. Right panel: group-specific CATE estimates (blue and red dashed lines)obtained from the two regressions in (11), and final weighted CATE estimates (green dashed line)obtained from the re-balancing step in (12).is not observable, one can estimate it by using the observed potential outcome Y Zi and replacing thecounterfactual unobservable outcome (cid:98) Y (1 − Z ) i with an estimate, derived via T-Learner in the first step.The procedure laid out by the X-Learner in the first two steps is comparable to that of a T-Learneras both methods imply dividing the sample into the two groups. The main difference lies then in the lastadditional step, where the X-Learner attempts to re-balance CATE estimates through weighting viathe estimated propensity score. X-Learners are in fact shown to perform well in unbalanced settings,where one of the treatment groups (typically the control group) is larger in size compared to the otherone, and where T-Learner might face issues. A final note about the X-Learner balancing step is thatit requires careful specification of the propensity score model, as poor PS estimates would defy there-balancing purpose. Propensity score estimation is no different from a standard prediction problemand boils down to performing a probabilistic binary outcome regression of Z on X . We will see thatalso other Meta-Learners envisage an important balancing role for propensity score, and thus mustdeal with its estimation.Figure 3 offers a simple example of a X-Learner, with BART as base-learner, applied to the one-covariate simulated data encountered in Figure 1 and Figure 2. X-Learner’s first step essentiallyderives, via T-Learner, the same BART estimates seen in the right panel plot of Figure 2. The outputof the second step, namely the imputed treatment effects ˜ D i and ˜ D i , are depicted in the left panel plotof Figure 3 (red and blue triangles), together with the true ITE (grey dots). The graph on the rightinstead shows the estimated group-specific CATE ˆ τ ( x i ) (blue dashed line) and ˆ τ ( x i ) (red dashedline), derived from the two regressions in (11), and the final CATE estimate (cid:98) τ ( x i ) (green dashed line),obtained from the weighting step in (12). Propensity score estimates employed for the weighting stepwere retrieved via probit version of BART. Notice that the final CATE estimates (cid:98) τ ( x i ) lie in betweenthe two fitted group-specific ˆ τ ( x i ) and ˆ τ ( x i ). R-Learner was proposed by Nie and Wager (2019) as a two-stage meta-algorithm, which aims atminimizing a loss function, specifically defined on CATE, through parameter tuning. The derivation f the two-step procedure stems from Robinson (1988) decomposition of the outcome model in (6).Define the following two quantities: Y i = µ ( X i ) + τ ( X i ) Z i + ε i m ( X i ) = E ( Y i | X i ) = µ ( X i ) + τ ( X i ) π ( X i ) (13)as the outcome model and the conditional mean outcome model, respectively. Under this setup,unconfoudedness assumption implies that the error term is such that E (cid:2) ε i | X i , Z i (cid:3) = 0. Noticethat under this parametrization τ ( X i ) (CATE) enters explicitly in the outcome regression model. Bycombining the two quantities above, Robinson (1988) noticed that the outcome model can be re-writtenas: Y i − m ( X i ) = (cid:16) Z i − π ( X i ) (cid:17) τ ( X i ) + ε i . (14)Starting from this decomposition, Nie and Wager (2019) derive a loss function that can be used forparameter tuning in the estimation of CATE; the optimal CATE estimates are defined as the minimizerof the following loss function: τ ( X i ) = arg min τ (cid:40) E (cid:34)(cid:16)(cid:0) Y i − m ( X i ) (cid:1) − (cid:0) Z i − π ( X i ) (cid:1) τ ( X i ) (cid:17) (cid:35)(cid:41) . (15)The idea is that a base-learner that relies heavily on parameter tuning (e.g. random forest or gradientboosted trees) can be tuned on the modified parametrization of the outcome model in (14), whichincludes a version of the outcome net of the impact of the covariates m ( X i ) and propensity scorebalancing, instead of being tuned on the raw outcome Y i as one would do in an S- or T-Learnerprocedure. Since we cannot observe directly the quantities in (15) for the minimization problem, theR-Learner overcomes the issue through the following two-step approach, i.e., by:1. Fitting (cid:98) m ( x i ) and (cid:98) π ( x i ), by minimizing usual prediction error.2. Plugging in estimates from the first step to estimate ˆ τ ( x i ), by minimizing the regularized sampleequivalent of (15), via cross-validation parameters tuning, that is: (cid:98) τ ( X i ) = arg min τ (cid:110)(cid:98) L n (cid:0)(cid:98) τ ( X i ) (cid:1) + Λ n (cid:0)(cid:98) τ ( X i ) (cid:1)(cid:111) , where (cid:98) L n (cid:0)(cid:98) τ ( X i ) (cid:1) = 1 N N (cid:88) i =1 (cid:32)(cid:16) Y i − (cid:98) m − i ( X i ) (cid:17) − (cid:16) Z i − (cid:98) π − i ( X i ) (cid:17)(cid:98) τ ( X i ) (cid:33) , (16)and where Λ n (cid:0) ˆ τ ( · ) (cid:1) is a term representing regularization (e.g. Ridge or LASSO penalization, splinessmoothness penalization, etc.), and the super-script ( − i ) refers to the i -th observation being held-outfrom the estimation subsample, and used for leave-one out cross validation (LOO-CV).The advantage of R-Learners lies in the fact that they take a two-stage procedure where the firststep produces an estimate of propensity score (cid:98) π ( x i ), and by doing so mitigates confounding generatedby the non-randomized selection mechanism. While in the second step, exploiting the parametrizationin (13), they make use of a loss function through which parameters can be optimized to estimate (cid:98) τ ( x i )directly. R-Learners naturally work better with types of base-learners that rely heavily on parametertuning, such as ensemble methods (e.g. gradient boosting, random forests, etc.). .1.5 Multitask-Learners The idea of multitask-learning, or multi-output learning, for causal inference was introduced by Alaaand van der Schaar (2017) and Alaa and van der Schaar (2018), in the context of Gaussian Processes.The multitask perspective on CATE estimation consists in viewing the two potential outcomes Y (0) i and Y (1) i as output of a function f : X → R , with d -dimensional input space and 2-dimensionaloutput space, where the output space is indexed by Z i , that acts as “task identifier”. CATE estimatoris defined as the difference between the elements of the 2-dimensional output of f ( · ), i.e., (cid:98) τ ( x i ) = (cid:98) f ( x i ) − (cid:98) f ( x i ) = ˆf (cid:62) ( x i ) ξ , where ξ (cid:62) = [ − . (17)Equation (17) displays a very similar formulation to a T-Learner; and as in the T-Learner pro-cedure, the sample is divided into the two subgroups for the estimation. However, the advantageof viewing CATE estimation as a multitask problem is that, instead of estimating the two potentialoutcomes independently as one would do in a T-Learner or X-Learner, they are estimated “jointly”,through the specification of some hyperparameters that trigger a joint optimization for the two “tasks”:learning f Z =1 and f Z =0 . Hence, this approach fits separate conditional mean functions (as in a T-Learner), but at the same time attempts to recover common underlying patterns between the twogroups (as in an S-Learner) that would be otherwise lost due to the sample split.In the case of Alaa and van der Schaar (2017), multitask learning is induced through the specifi-cation of a particular structure in the stationary kernel function of the Gaussian Process prior. Themethod is labelled as Causal Multitask Gaussian Process (CMGP). Alaa and van der Schaar (2018)then proposed a similar method where multitask learning is induced via a non-stationary version ofthe GP kernel function (
Non-Stationary Gaussian Process - NSGP). Both causal multitask GPs allowto compute posterior credible interval for CATE as natural result of their Bayesian approach.Alaa and van der Schaar (2017) and Alaa and van der Schaar (2018) are two example of Multitask-Learners employing Gaussian Process regression as a base-learner, but there are different ways ofinducing multitask learning using other types of base-learners (e.g. tree ensembles). Due to theirsimilarity with T-Learners in deriving a CATE estimator, we expect Multitask-Learners to performwell when complexity of the response surfaces f , f varies across groups, and CATE itself turns outto be a rather complex function. τ -Learners The last type of Meta-Learner presented in this work was developed by Hahn et al. (2020), underthe name of “Bayesian Causal Forest”. The authors tackle the problem of CATE estimation with aBayesian approach, where they exploit the same parametrization seen in the context of R-Learners.Particularly, they noticed that the parametrization Y i = µ ( X i ) + τ ( X i ) Z i + ε i , (18)can be viewed as a Bayesian regression framework where the prognostic score , defined as the impact ofthe covariates X i ∈ X on the outcome Y i in absence of the treatment, µ ( x i ) = E (cid:2) Y i | Z i = 0 , X i = x i (cid:3) ,plays the role of the intercept, while τ ( x i ) the role of the slope. In this perspective, CATE is an explicitparameter of the model and thus can be treated in a Bayesian fashion through the specification of a priordistribution p (cid:0) τ ( · ) (cid:1) , which can be shaped to convey prior knowledge and more targeted regularizationthat can capture even simple patterns of heterogeneity. Bayesian Causal Forest of Hahn et al. (2020) iscomposed by a pair of separate and independent BART priors placed on µ ( · ) and τ ( · ) respectively, but he parametrization in (18) can be exploited using other Bayesian regression methods (e.g. GaussianProcess, Dirichlet Process regression, etc.).In addition to the parametrization shown in (18), Hahn et al. (2020) make use of the two-stageprocedure seen in (7), Section 2. The two-stage approach is motivated by the presence of a particulartype of confounding, which the authors in Hahn et al. (2018) and Hahn et al. (2020) call Regulariza-tion Induced Confounding (RIC). The intuition behind RIC is the following: regularization applieddirectly on the two curves f , f may have unintended consequences on the induced regularizationon τ ( · ), leading to biased estimates of CATE. RIC is shown to have a stronger effect when there isstrong confounding, such as in presence of targeted selection , that is when individuals are selectedinto treatment based on the prediction of otherwise adverse outcome. Targeted selection implies apotential strictly monotone relationship between the propensity score π ( x i ) and the prognostic score µ ( x i ) = E (cid:2) Y i | X i , Z i = 0 (cid:3) , and is rather common in studies of observational nature. The proposedway to tackle confounding from targeted selection is precisely to use the two-stage representation il-lustrated in (7), where a probabilistic estimate of the propensity score (cid:98) π ( x i ), obtained from the firststage regression, is added to the covariates for the estimation of µ ( x i ) = E (cid:2) Y i | X i , Z i = 0 (cid:3) in thesecond stage, to account for their potential relationship.We name the above approach τ -Learner, as it involves an explicit parametrization in terms of τ ( x i ) and a direct Bayesian approach to CATE estimation. Hahn et al. (2020) specifically make use ofBART for estimation of µ ( x i ) and τ ( x i ), but any other Bayesian method could potentially work. Asa further advantage, the direct Bayesian approach allows to obtain valid posterior credible intervalsfor CATE. Model selection is a challenging problem in causal inference. The main reason is that researchers cannotaccess counterfactual outcomes Y (1 − Z i ) i and thus observe the difference Y (1) i − Y (0) i , ∀ i ∈ { , ..., N } ,which distinguishes it from other classical model selection problems. The aim here is to select a model M ∗ ∈ {M , ..., M d } , which minimizes a loss function of the estimated CATE ˆ τ . We will review someof the methods that have been proposed, partly following the review in Schuler et al. (2018).Standard likelihood-based criteria such as AIC and BIC are doomed to fail in a structural missingdata context such as the one depicted. Other popular metrics in the ML literature might be used,such as prediction error in estimating the two conditional average potential outcomes µ z i ( x i ) = E [ Y i | X i = x i , Z i = z i ], that are the expression of the two surfaces f ( · ) and f o ( · ), that is (cid:98) E µ [ (cid:96) ( (cid:98) µ z , µ z )] = 1 N N (cid:88) i =1 (cid:16)(cid:98) µ z i ( x i ) − y ( z i ) i (cid:17) . (19)A slightly modified version of (19) takes into account the observational nature of the data at hand anduses inverse propensity score estimates for weighting the prediction error in estimating each µ z , thatis, (cid:98) E µ,IP T W [ (cid:96) ( (cid:98) µ z , µ z )] = 1 N N (cid:88) i =1 (cid:0)(cid:98) µ z i ( x i ) − y ( z i ) i (cid:1) (cid:98) π z i ( x i ) . (20)The main reason why these metrics work poorly is that they do not express a direct loss functionon CATE. CATE squared-loss function, of the type (cid:96) (ˆ τ , τ ) = E [(ˆ τ − τ ) ], was defined as Precision inEstimating Heterogeneous Treatment Effects (PEHE) by Hill (2011) and takes the following form: E (cid:2) (ˆ τ i − τ i ) | X i = x i (cid:3) . (21) EHE would be the ideal loss function to use, but cannot be computed since one of the two potentialoutcomes is unobservable. Attempts have been made in the literature to render the estimation ofPEHE feasible, through plug-in estimates of τ i , but understandably none of them has been successfuland commonly used so far. These approaches generally propose to use a Meta-Learner to estimateCATE on a portion of training sample N train , then to derive CATE estimates also on a portion ofheld-out validation sample N val (where N = N train + N val ). The training set CATE estimates, denotedby ˜ τ ( x i ), will act as the ground-truth CATE, while the validation set estimates, denoted by ˆ τ ( x i ), areinstead used to validate the model. Given these two quantities, an estimator of PEHE is constructedas (cid:98) E τ [ (cid:96) (ˆ τ , τ )] = 1 N val N val (cid:88) i =1 (cid:0) ˆ τ ( x i ) − ˜ τ ( x i ) (cid:1) . (22)A slightly different version of (22) replaces the training set estimate of CATE ˜ τ ( x i ) with a version ofthe observed outcome y i weighted by the inverse propensity score, to correct for selection bias typicallyfound in observational studies, and reads: (cid:98) E τ,IP T W [ (cid:96) (ˆ τ , τ )] = 1 N val N val (cid:88) i =1 (cid:18) ˆ τ ( x i ) − (2 z i − y i ˜ π z i ( x i ) (cid:19) , (23)where ˜ π z i ( x i ) is a propensity score estimate obtained on the training set N train . The advantage of thisversion is that instead of employing the training set CATE estimate ˜ τ ( x i ) as ground-truth, it replacesit with (2 z i − y i ˜ π zi ( x i ) , which contains observed y i and z i , and an estimate of the propensity score that canbe obtained through supervised approaches.However, the evident setback of these plug-in methods is that they exhibit high model-dependentbias, which derives from the fact that the plug-in estimates of the ground-truth CATE ˜ τ ( x i ) (or thoseof the propensity score) obtained on a training subset of the data, clearly depend on the methodused for the estimation. In this way (cid:98) E τ [ (cid:96) (ˆ τ , τ )] actually measures the error of (cid:98) τ ( x i ) in estimating thetraining set CATE (cid:101) τ ( x i ), rather than in estimating the ground-truth CATE τ ( x i ).Finally, it is worth noticing that the R-Learner presented in the previous section develops an actualloss function for τ ( · ) in (16). However this can be exploited exclusively in the context of R-Learners(e.g. comparing a random forest R-Learner with a BART R-Learner), as it implies assuming Robinson(1988) parametrization as in (13). In this section we report and comment on results from two different semi-simulated studies, carriedout to compare performance of some of the models presented above in estimating CATE. A thirdsupplemental semi-simulated study can be found in the Appendix Section A. A semi-simulated studyconsists in simulating only the outcome surface Y i in the tuple D i = { X i , Z i , Y i } , starting from real-world X i and Z i . In the case of observational semi-simulations, Hill (2011) introduced a practicalway of recreating an observational study from a randomized one. This is essentially done by leavingout a non-random portion of the treated group, so that treatment assignment is no longer random.Recreating an observational study from a purely randomized one has the main advantage of allowingcontrol over the selection mechanism and that the common support assumption holds for the treatedgroup.We provide results based on the analysis of two real world randomized controlled trials, after trans-forming them into observational studies. The first semi-simulated setup employs the IHDP dataset, Family Name Description
Linear Models S-OLS Linear regression as S-LearnerT-OLS Linear regression as T-LearnerR-LASSO LASSO regression as R-LearnerNaive Non-Parametrics k NN k -Nearest NeighborsTree-Based Methods CF Causal ForestS-BART BART as S-LearnerT-BART BART as T-LearnerBCF Bayesian Causal ForestR-BOOST Gradient Boosting as R-LearnerGaussian Processes CMGP Causal Multi-task Gaussian ProcessNSGP Non-Stationary Gaussian Processfirstly introduced by Hill (2011) and popular in both the computer science and statistics literature onCATE estimation. The second and third setups instead employ the ACTG-175 dataset, available inthe R package speff2trial , and differ in the way the outcome and ITE are generated. Both datasets arepublicly available. As mentioned above, we present here below results from the IHDP data simulationand one of the two setups involving the ACTG-175 data, while we leave the other ACTG-175 setup inthe Appendix Section A.The models tested include: linear regressions with S-Learner and T-Learner approach respectively(S-OLS and T-OLS); a penalized LASSO regression, through which we induce variable selection, withR-Learner approach (R-LASSO); a simple non parametric k -Nearest Neighbors ( k NN) as T-Learner;a Causal Forest (CF), proposed by Wager and Athey (2018) as a modified version of Random Forests,where after partitioning the sample space on the covariates X , the last split in each tree in the ensembleis made on Z ; Bayesian Additive Regression Trees as S-Learner and T-Learner respectively (S-BARTand T-BART); a Bayesian Causal Forest (BCF), i.e., a τ -Learner; Gradient Boosted trees with R-Learner procedure (R-BOOST); and finally the two causal multitask versions of Gaussian Processes,one with stationary kernel (CMGP) and the other with non-stationary kernel (NSGP), developed byAlaa and van der Schaar (2017) and Alaa and van der Schaar (2018) respectively. A summary of thetested models is provided in Table 2 below.For each of the two datasets analyzed, in order to provide a comparison of the methods presentedabove, we computed √ PEHE estimates for each of the B = 1000 Monte Carlo simulations, and weaveraged it over all the simulations. Estimates of PEHE were obtained through its sample equivalent,namely: P (cid:99) EHE τ = 1 N N (cid:88) i =1 (cid:16) τ ( x i ) − (cid:98) τ ( x i ) (cid:17) , (24)where (cid:98) τ ( x i ) is the CATE estimate obtained under the given method, while τ ( x i ) is the ground-truthCATE, which is always unknown in the real world, but in this case is simulated. Data are randomlysplit in 70% train set used to train the models, and 30% test set to evaluate the model on unseen data. √ PEHE is reported for both train and test data. SGPCMGPRBOOSTBCFTBARTSBARTCFKNNRLASSOTOLSSOLS 0 1 2 3 4 5
PEHE M ode l NSGPCMGPRBOOSTBCFTBARTSBARTCFKNNRLASSOTOLSSOLS 0 1 2 3 4 5
PEHE M ode l Figure 4: √ PEHE distribution on train set (left) and test set (right), IHDP data.
The first semi-simulated setup makes use of the IHDP dataset, popular in the literature for CATEestimation and used for the first time in Hill (2011). It includes data taken from the Infant Health andDevelopment Program (IHDP), a randomized controlled trial carried out in 1985, aimed at improvinghealth and cognitive status of premature infants with low weight at birth, through follow-ups andparent support groups. The dataset includes 25 covariates, 6 continuous and 19 binary. The dataare transformed into observational by leaving out a non-random portion of the treated individuals,namely those with non-white mothers. This leaves 139 observations in the treated group and 608 inthe control group, for a total of 747 observations.ITE is derived as the difference between the simulated potential outcomes, which are generated as: Y (0) ∼ N (cid:16) exp (cid:0) ( X + W ) β B (cid:1) , (cid:17) ,Y (1) ∼ N (cid:0) Xβ B − ω bB , (cid:1) , (25)where W is an offset matrix of same dimension as X with every entry equal to 0.5, and β B is a 25-dimensional vector of regression coefficients (cid:0) , . , . , . , . (cid:1) , sampled with probabilities (cid:0) . , . , . , . , . (cid:1) (“Response surface B” in Hill (2011)). For each simulation b ∈ { , ..., } , ω bB is an offset chosento guarantee that the population-level “Average Treatment effect on the Treated” (ATT) is equal to AT T = E (cid:2) Y (1) − Y (0) | Z = 1 (cid:3) = 4 .Simulation results are reported in Table 3 and Figure 4. The best models appear to be themultitask Gaussian Processes (CMGP and NSGP) of Alaa and van der Schaar (2017) and Alaa andvan der Schaar (2018). This does not come as unexpected for two reasons related to the way thepotential outcomes are simulated. First, the simulated setting tends to accommodate Meta-Learnersfitting separate surfaces f , f (T-Learners, X-Learners, Multitask-Learners), since the two simulatedpotential outcomes display very different degrees of complexity. Indeed, T-Learners in Table 3 showbetter performance than their S-Learner counterparts (e.g. T-BART vs S-BART). The second reason isthat potential outcomes generated in (25) are rather smooth, and Gaussian Process regression tends toadapt well to non-linear continuous surfaces. We report also, in Figure 4, the distribution of √ PEHE τ over the B = 1000 iterations on both train and test data, for each of the models. We also notice that IHDP and ACTG-175 data. √ PEHE τ estimates ±
95% confidence intervals for each testedmodel, on the train and test sets respectively.
IHDP ACTG-175
Train Test Train TestS-OLS 4.78 ± ± ± ± ± ± ± ± ± ± ± ± k NN 2.67 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± R-BOOST 2.28 ± ± ± ± ± ± ± ± ± ± ± ± tree-based methods are relatively more prone to overfitting.An important remark about the data generating process described by (25) is that it does notreally induce strong confounding, since it is easy for a non-parametric base-learner to distinguish thetwo underlying polynomials E [ Y ( Z i ) | X i = x i ]. Besides, the fact that noise around the conditionalmean is independently simulated for the two potential outcomes produces extra noise around CATEthat often results in unrealistically large variation in the simulated ITE. In the ACTG-175 simulatedexample illustrated in Section 4.2, we will follow the parametrization found in Nie and Wager (2019)and Hahn et al. (2020) in the data generating process of the outcome surface, in order to inducestronger confounding (which is believed to be common in observational studies) and avoid creatingunnecessarily high noise around CATE. The second semi-simulated setup presented here is created using ACTG 175 clinical trial data, freelyavailable in the R package speff2trial . The data come from a randomized placebo-controlled trial aimedat comparing monotherapy versus a combination of therapies in HIV-1-infected subjects with CD4 cellcounts between 200 and 500 (Hammer et al. (1996) for details). As in the case of IHDP data, anobservational study is recreated by throwing away a non-random subset of patients, namely those notshowing symptomatic HIV infection. The final dataset consists of 813 observations and 12 variables (3continuous and 9 binary). We simulate continuous outcome and treatment effects by trying to resemblesign and magnitude of the estimated treatment effect fitted with a simple linear model, and also thescale of the original outcome (CD4 cell counts, on a log scale). Variables included in the dataset areshown in Table 4.Unlike the case of IHDP data, response surface Y i is not generated via simulation of the twopotential outcomes. Instead, we generate continuous outcome Y i according to the parametrization Y i = µ ( X i ) + τ ( X i ) Z i + ε i , (26)which allows to specify CATE directly, instead of starting from the simulation of potential outcomes,and to induce more confounding through the prognostic score function µ ( x i ). The prognostic score Variable Description age
Numeric wtkg
Numeric hemo
Binary (hemophilia = 1) homo
Binary (homosexual = 1) drugs
Binary (intravenous drug use = 1) oprior
Binary (non-zidovudine antiretroviral therapy prior to ini-tiation of study treatment = 1) z30
Binary (zidovudine use in the 30 days prior to treatmentinitiation = 1) preanti
Numeric (number of days of previously received antiretro-viral therapy) race
Binary gender
Binary str2
Binary: antiretroviral history (0 = naive, 1 = experienced) karnof hi
Binary: Karnofsky score (0 = < µ ( x i ) and CATE τ ( x i ) are generated as: µ ( x i ) = 8 − . hemo − . | wtkg − | + 0 . gender − . age + 2+ 0 . karnof hi − . z − . raceτ ( x i ) = 0 . . age · ( karnof hi + 2) . (27)Then noise is added by simulating normally distributed errors ε i ∼ N (0 , σ ), with standard deviationequal to σ = ( µ max − µ min ) /
2, where µ max is the maximum value of the generated prognostic scoreacross the sample, that is µ max = (cid:8) µ ( x j ) : µ ( x j ) > µ ( x i ) , ∀ i ∈ { , ..., N }\{ j } (cid:9) , while µ min is theminimum. Notice that, unlike the case of IHDP data, the error term is not simulated independentlyfor the two groups, which avoids imposing too much noise around CATE. This translates into generallylower and less variable estimated PEHE for all models compared to the case of IHDP data, as shownboth in Table 3 and Figure 5.In this simulated setting CATE is of rather simple and known form. Hence, we expect S-Learnerto outperform their T-Learner counterparts. Indeed, this is exactly what Table 3 shows. S-Learnersall display lower √ PEHE than T- and Multitask-Learners, with the only exception of NSGP, whichinstead performs quite well. BCF turns out to be the best method in this scenario, thanks to itsability to adjust to stronger confounding and apply more regularization when learning CATE, whichallows to capture small and weakly heterogeneous treatment effect patterns. Figure 5 depicts againthe distribution of √ PEHE over the B = 1000 simulations, for both train and test data, for all thetested models.We employ ACTG-175 data also in a third semi-simulated setup featuring more complex µ ( x i )and τ ( x i ) surfaces, compared to the ones in (27). Description and results of this third example areprovided in the Appendix Section A. SGPCMGPRBOOSTBCFTBARTSBARTCFKNNRLASSOTOLSSOLS 0.00 0.25 0.50 0.75 1.00 1.25
PEHE M ode l NSGPCMGPRBOOSTBCFTBARTSBARTCFKNNRLASSOTOLSSOLS 0.00 0.25 0.50 0.75 1.00 1.25
PEHE M ode l Figure 5: √ PEHE distribution on train set (left) and test set (right), ACTG-175 data.
Social, economic and political sciences are among the disciplines that rarely benefit from RCTs studies,and often have to rely on observational or non-fully randomized data to address causal questions. Inthis section we provide an example of a real world social sciences application of the models reviewedin previous sections, to demonstrate their usefulness in applied research. To this end, we make useof the nhanes bmi dataset found in the
ATE R package. The dataset consists in a subset of the 2007-2008 National Health and Nutrition Examination Survey (NHANES), containing the variables shownin Table 5. The data are employed in this context to investigate whether participation to schoolmeal programs is associated to improved children’s health indicators. There is reason to believe thatthe impact of such programs is heterogeneous across individuals, in that demographics such as age,gender or ethnicity are likely to play a role in how effective participation is (e.g. younger kids mightbenefit more than older ones, etc.). This supports the use of methods for individual treatment effectsestimation.Contrary to a simulated study, in a real world application it is impossible to recognize the typeof heterogeneity patterns one encounters. For this reason, it is usually good practice to deploy morethan just a single model to estimate CATE. In this specific context, we derive and compare CATEestimates from both Bayesian Causal Forest (BCF) and non-stationary Multitask Gaussian Process(NSGP), as from the simulation studies they emerged to be the most flexible methods when it comesto adapt to different types of surfaces.We find that CATE estimates obtained from the two models are highly correlated, with Pearson’sand Spearman’s coefficients equal to 0.93 and 0.92 respectively, but BCF exhibits higher variance inthe estimates (sample standard deviation of 0.09) compared to NSGP (sample standard deviation of0.03). This means that BCF estimates capture much more heterogeneity. The left panel plot of Figure6 depicts the densities of CATE estimates derived from each model, and NSGP’s estimates appear tobe far more concentrated around the mean and generally indicate larger impact. The main source ofthe difference in dispersion between the two obtained estimates is likely to be child’s age covariate. Asillustrated in the right panel plot in Figure 6, BCF attributes higher degree of treatment heterogeneityto child’s age compared to NSGP, although they both indicate slightly non-linear monotone relationshipbetween the two quantities.
NHANES data variables
Variable Description
BMI
Numeric. Body mass index (outcome variable) school meal
Binary (treatment indicator) age
Numeric (child’s age) childSex
Binary (male = 1) afam
Binary (African American = 1) hisam
Binary (Hispanic = 1) povlev 200
Binary (family above 200% federal poverty lvl = 1) sup nutr
Binary (supplementary nutrition program = 1) stamp prog
Binary (food stamp program = 1) food sec
Binary (food security in household = 1) ins
Binary (any insurance = 1) refsex
Binary (adult respondent gender is male = 1) refage
Numeric (adult respondent’s age)
The impossibility of directly measuring performance of the methods prevents from carrying outproper model selection, as outlined in details in Section 3.2. However, it appears that both modelssupport the existence of treatment heterogeneity, although with different patterns. In particular, BCFdetects a stronger moderating effect of child’s age on treatment effect than NSGP, while to othercovariates (such as gender and ethnicity) the two methods attribute very similar moderating effects.
In the previous sections, we covered the most recent developments on the estimation of heterogeneoustreatment effect in the context of non-randomized studies. We use here the phrase “non-randomized”studies as the methods illustrated are well suited for observational studies, but also for RCTs thatmight display threats to randomization such as imperfect compliance and non-random missingness(obviously also pure RCTs, sample size permitting).Our review and simulation studies lead to a few general observations. A first simple but practicalsuggestion in applied setting where CATE is to be estimated, is to implement more than just one singlemethod, to provide robustness to the results, given that it is impossible to discern which particularscenario the researcher faces (weak vs strong heterogeneity, simple vs complex CATE patterns, etc.).Secondly, it is clear from the semi-simulated exercises that non-parametric regression methods adaptwell to non-linearities, and in general to complexity of the CATE function, whereas parametric ones arenaturally less robust to functional form misspecification, unless heterogeneity patterns are very weak(reverting to homogeneity). In addition, Bayesian approaches to non-parametric modelling sacrificesome computational cost for less cumbersome parameter tuning and nicer posterior properties (i.e.,posterior credible interval).A final remark concerns τ -learners, and BCF in particular. Bayesian Causal Forest of Hahnet al. (2020) appears to show nice adaptability and robustness to misspecification regarding the CATEsurface (i.e., smoothness, sparsity, heterogeneity vs homogeneity of treatement effect) compared toother Meta-Learners examined. This ability stems directly from the explicit inclusion of τ ( x i ) as a“parameter” in a regression framework, which is designed specifically to detect moderating impact ofthe covariates on treatment effects, and allows to place a direct prior distribution in a Bayesian fashion.From an empirical perspective, this implies some degree of flexibility in the choice of p (cid:0) τ ( x i ) (cid:1) . In fact, .000.010.020.03 −0.2 −0.1 0.0 0.1 0.2 0.3 CATE P e r c en t age Model
BCFNSGP −0.10.00.10.20.3 4 6 8 10 12 14 16
Child's Age C A T E Model
BCFNSGP
Figure 6:
Left panel: kernel density of CATE estimates from the two models (dashed lines are thesample means). Right panel: CATE estimates from the two models as a function of child’s age.the prior on τ ( x i ) can translate into a more or less targeted regularization based on prior knowledgeon the complexity of CATE surface. In a case where CATE is believed to be complex one might,for example, modify default options to have deeper trees in the estimation of τ . On the contrary, incase little or no a priori knowledge is possessed, a more agnostic prior, which prevents overfitting viastronger regularization and leaves some degree of freedom in CATE estimation, can be chosen. Thisis something extraneous to the other methods found in the literature that, although the promisingperformances (the “causal” Gaussian Processes, CMGP and NSGP, in particular), do not allow totreat CATE as a model parameter and to specify of a prior distribution on it, thus not allowing toconvey a priori knowledge, which is frequently of interest in causal analysis. References
Alaa, A. and M. van der Schaar (2018). Limits of estimating heterogeneous treatment effects: Guide-lines for practical algorithm design. In
Proc. 35th Inter. Conf. on Mach. Learn , Volume 80, pp.129–138.Alaa, A. M. and M. van der Schaar (2017). Bayesian inference of individualized treatment effects usingmulti-task Gaussian Processes. pp. 3427–3435.Angrist, J. D., G. W. Imbens, and D. B. Rubin (1996). Identification of causal effects using instrumentalvariables.
J. Am. Stat. Assoc. 91 , 444–455.Athey, S. and G. Imbens (2016). Recursive partitioning for heterogeneous causal effects. , 7353–7360.Chipman, H. A., E. I. George, and R. E. McCulloch (1998). Bayesian CART model search.
J. Am.Stat. Assoc. 93 , 935–948.Chipman, H. A., E. I. George, and R. E. McCulloch (2010). BART: Bayesian additive regression trees.
Ann. Appl. Stat. 4 , 266–298.Dawid, A. P. (2000). Causal inference without counterfactuals.
J. Am. Stat. Assoc. 95 , 407–424. oster, J. C., J. m. G. Taylor, and S. J. Ruberg (2011). Subgroup identification from randomizedclinical trial data. Stat. Med. 30 24 , 2867–80.Hahn, P. R., C. M. Carvalho, D. Puelz, and J. He (2018). Regularization and confounding in linearregression for treatment effect estimation.
Bayesian Anal. 13 , 163–182.Hahn, P. R., J. S. Murray, and C. M. Carvalho (2020). Bayesian regression tree models for causalinference: Regularization, confounding, and heterogeneous effects.
Bayesian Anal. .Hammer, S. M., D. A. Katzenstein, M. D. Hughes, H. Gundacker, R. T. Schooley, R. H. Haubrich,W. K. Henry, M. M. Lederman, J. P. Phair, M. Niu, M. S. Hirsch, and T. C. Merigan (1996). Atrial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with CD4cell counts from 200 to 500 per cubic millimeter.
N. Engl. J. Med. 335 , 1081–1090.Hartford, J., G. Lewis, K. Leyton-Brown, and M. Taddy (2017). Deep IV: A flexible approach forcounterfactual prediction. In
Proc. 34th Inter. Conf. on Mach. Learn. , Volume 70, pp. 1414–1423.Hastie, T., R. Tibshirani, and J. Friedman (2001).
The Elements of Statistical Learning . New York:Springer.Heckman, J. J. (1979). Sample selection bias as a specification error.
Econometrica 47 , 153–161.Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference.
J. Comput. Graph. Stat. 20 ,217–240.Hudgens, M. G. and M. E. Halloran (2008). Toward causal inference with interference.
J. Am. Stat.Assoc. 103 , 832–842.Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: Areview.
Rev. Econ. Stat. 86 , 4–29.Imbens, G. W. and D. B. Rubin (2015).
Causal Inference for Statistics, Social, and BiomedicalSciences: An Introduction . Cambridge: Cambridge University Press.Johansson, F. D., U. Shalit, and D. Sontag (2016). Learning representations for counterfactual infer-ence.King, G. and R. Nielsen (2019). Why propensity scores should not be used for matching.
PoliticalAnal. .K¨unzel, S. R., J. S. Sekhon, P. J. Bickel, and B. Yu (2019). Metalearners for estimating heterogeneoustreatment effects using machine learning. , 4156–4165.Linero, A. R. (2018). Bayesian regression trees for high-dimensional prediction and variable selection.
J. Am. Stat. Assoc. 113 , 626–636.Lu, M., S. Sadiq, D. J. Feaster, and H. Ishwaran (2018). Estimating individual treatment effect inobservational data using random forest methods.
J. Comput. Graph. Stat. 27 , 209–219.Morris, T. P., I. R. White, and M. J. Crowther (2019). Using simulation studies to evaluate statisticalmethods.
Stat. Med. 38 , 2074–2102.Nie, X. and S. Wager (2019). Quasi-oracle estimation of heterogeneous treatment effects. earl, J. (2018). Theoretical impediments to machine learning with seven sparks from the causalrevolution. pp. 3.Powers, S., J. Qian, K. Jung, A. Schuler, N. H. Shah, T. Hastie, and R. Tibshirani (2018). Somemethods for heterogeneous treatment effect estimation in high dimensions. Stat. Med. 37 , 1767–1787.Robinson, P. M. (1988). Root-N-Consistent semiparametric regression.
Econometrica 56 , 931–954.Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observationalstudies for causal effects.
Biometrika 70 , 41–55.Rosenbaum, P. R. and D. B. Rubin (1984). Reducing bias in observational studies using subclassifica-tion on the propensity score.
J. Am. Stat. Assoc. 79 , 516–524.Rosenbaum, P. R. and D. B. Rubin (1985). Constructing a control group using multivariate matchedsampling methods that incorporate the propensity score.
Am. Stat. 39 , 33–38.Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization.
Ann. Statist. 6 ,34–58.Schuler, A., M. Baiocchi, R. Tibshirani, and N. Shah (2018). A comparison of methods for modelselection when estimating individual treatment effects.Starling, J., J. Murray, P. Lohr, A. Aiken, C. Carvalho, and J. Scott (2019). Targeted SmoothBayesian Causal Forests: An analysis of heterogeneous treatment effects for simultaneous versusinterval medical abortion regimens over gestation.Tchetgen, E. and T. VanderWeele (2010). On causal inference in the presence of interference.
Stat.Methods Med. Res. 21 , 55–75.Wager, S. and S. Athey (2018). Estimation and inference of heterogeneous treatment effects usingrandom forests.
J. Am. Stat. Assoc. 113 , 1228–1242.Yao, L., S. Li, Y. Li, M. Huai, J. Gao, and A. Zhang (2018). Representation learning for treatmenteffect estimation from observational data. In
Adv. Neural Inf. Process Syst. 31 , pp. 2633–2643. ACTG-175 data: third simulated exercise
In this short appendix section we present results obtained from a third semi-simulated exercise involvingthe ACTG-175 dataset. The structure of the utilized ACTG-175 data is exactly the same as the onefound in the other example (same number of covariates and sample size). The difference lies in howthe prognostic score and CATE functions are simulated. For this third setup we chose slightly morecomplex surfaces compared to the ones in the other ACTG-175 simulation, to provide an additionalexample on the performance of the reviewed methods under a more challenging data generating process(closer to the one seen in the IHDP data example). More specifically, the two µ ( x i ) and τ ( x i ) surfacesare generated as µ ( x i ) = 6 + 0 . wtkg − sin( age ) · ( gender + 1) + 0 . hemo · race − . z ,τ ( x i ) = 1 + 1 . wtkg ) · ( karnof hi + 1) + 0 . age . (28)Surfaces in (28) feature more complex functions and more interaction terms. As in the other ACTG-175 data setup, Gaussian noise was added by simulating ε i ∼ N (0 , σ ), with standard deviation equalto σ = ( µ max − µ min ) /
10, where µ max is the maximum value of the generated prognostic score acrossthe sample, while µ min is the minimum value. Table 6: Third simulated setup (ACTG-175 data). √ PEHE τ estimates ±
95% confidenceintervals for each tested model, on the train and test sets respectively.