Learning Representations for Counterfactual Inference
LLearning Representations for Counterfactual Inference
Fredrik D. Johansson ∗ FREJOHK @ CHALMERS . SE CSE, Chalmers University of Technology, G¨oteborg, SE-412 96, Sweden
Uri Shalit ∗ SHALIT @ CS . NYU . EDU
David Sontag
DSONTAG @ CS . NYU . EDU
CIMS, New York University, 251 Mercer Street, New York, NY 10012 USA ∗ Equal contribution
Abstract
Observational studies are rising in importancedue to the widespread accumulation of data infields such as healthcare, education, employmentand ecology. We consider the task of answeringcounterfactual questions such as, “Would this pa-tient have lower blood sugar had she received adifferent medication?”. We propose a new algo-rithmic framework for counterfactual inferencewhich brings together ideas from domain adapta-tion and representation learning. In addition to atheoretical justification, we perform an empiricalcomparison with previous approaches to causalinference from observational data. Our deeplearning algorithm significantly outperforms theprevious state-of-the-art.
1. Introduction
Inferring causal relations is a fundamental problem in thesciences and commercial applications. The problem ofcausal inference is often framed in terms of counterfactual questions (Lewis, 1973; Rubin, 1974; Pearl, 2009) such as“Would this patient have lower blood sugar had she re-ceived a different medication?”, or “Would the user haveclicked on this ad had it been in a different color?”. In thispaper we propose a method to learn representations suitedfor counterfactual inference, and show its efficacy in bothsimulated and real world tasks.We focus on counterfactual questions raised by what areknown as observational studies . Observational studiesare studies where interventions and outcomes have beenrecorded, along with appropriate context. For example,consider an electronic health record dataset collected over
Proceedings of the rd International Conference on MachineLearning , New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s). several years, where for each patient we have lab tests andpast diagnoses, as well as data relating to their diabetic sta-tus, and the causal question of interest is which of two ex-isting anti-diabetic medications A or B is better for a givenpatient. Observational studies are rising in importance dueto the widespread accumulation of data in fields such ashealthcare, education, employment and ecology. We be-lieve machine learning will be called on more and moreto help make better decisions in these fields, and that re-searchers should be careful to pay attention to the ways inwhich these studies differ from classic supervised learning,as explained in Section 2 below.In this work we draw a connection between counterfac-tual inference and domain adaptation. We then introducea form of regularization by enforcing similarity betweenthe distributions of representations learned for populationswith different interventions. For example, the representa-tions for patients who received medication A versus thosewho received medication B. This reduces the variance fromfitting a model on one distribution and applying it to an-other. In Section 3 we give several methods for learningsuch representations. In Section 4 we show our methodsapproximately minimizes an upper bound on a regret termin the counterfactual regime. The general method is out-lined in Figure 1. Our work has commonalities with recentwork on learning fair representations (Zemel et al., 2013;Louizos et al., 2015) and learning representations for trans-fer learning (Ben-David et al., 2007; Gani et al., 2015). Inall these cases the learned representation has some invari-ance to specific aspects of the data: either an identity of acertain group such as racial minorities for fair representa-tions, or the identity of the data source for domain adapta-tion, or, in the case of counterfactual learning, the type ofintervention enacted in each population.In machine learning, counterfactual questions typicallyarise in problems where there is a learning agent whichperforms actions, and receives feedback or reward for that a r X i v : . [ s t a t . M L ] J un earning Representations for Counterfactual Inference choice without knowing what would be the feedback forother possible choices. This is sometimes referred to asbandit feedback (Beygelzimer et al., 2010). This setupcomes up in diverse areas, for example off-policy evalu-ation in reinforcement learning (Sutton & Barto, 1998),learning from “logged implicit exploration data” (Strehlet al., 2010) or “logged bandit feedback” (Swaminathan &Joachims, 2015), and in understanding and designing com-plex real world ad-placement systems (Bottou et al., 2013).Note that while in contextual bandit or robotics applica-tions the researcher typically knows the method underlyingthe action choice (e.g. the policy in reinforcement learn-ing), in observational studies we usually do not have con-trol or even a full understanding of the mechanism whichchooses which actions are performed and which feedbackor reward is revealed. For instance, for anti-diabetic med-ication, more affluent patients might be insensitive to theprice of a drug, while less affluent patients could bring thisinto account in their choice.Given that we do not know beforehand the particulars de-termining the choice of action, the question remains, howcan we learn from data which course of action would havebetter outcomes. By bringing together ideas from represen-tation learning and domain adaptation, our method offers anovel way to leverage increasing computation power andthe rise of large datasets to tackle consequential questionsof causal inference.The contributions of our paper are as follows. First, weshow how to formulate the problem of counterfactual infer-ence as a domain adaptation problem, and more specificallya covariate shift problem. Second, we derive new fami-lies of representation algorithms for counterfactual infer-ence: one is based on linear models and variable selection,and the other is based on deep learning of representations(Bengio et al., 2013). Finally, we show that learning repre-sentations that encourage similarity (balance) between thetreated and control populations leads to better counterfac-tual inference; this is in contrast to many methods which at-tempt to create balance by re-weighting samples (e.g., Bang& Robins, 2005; Dud´ık et al., 2011; Austin, 2011; Swami-nathan & Joachims, 2015). We show the merit of learningbalanced representations both theoretically in Theorem 1,and empirically in a set of experiments across two datasets.
2. Problem setup
Let T be the set of potential interventions or actions wewish to consider, X the set of contexts, and Y the set ofpossible outcomes. For example, for a patient x ∈ X theset T of interventions of interest might be two differenttreatments, and the set of outcomes might be Y = [0 , indicating blood sugar levels in mg/dL. For an ad slot on awebpage x , the set of interventions T might be all possi- ble ads in the inventory that fit that slot, while the potentialoutcomes could be Y = { click, no click } . For a context x (e.g. patient, webpage), and for each potential interven-tion t ∈ T , let Y t ( x ) ∈ Y be the potential outcome for x .The fundamental problem of causal inference is that onlyone potential outcome is observed for a given context x :even if we give the patient one medication and later theother, the patient is not in exactly the same state. In ma-chine learning this type of partial feedback is often called“bandit feedback”. The model described above is known asthe Rubin-Neyman causal model (Rubin, 1974; 2011).We are interested in the case of a binary action set T = { , } , where action is often known as the “treated”and action is the “control”. In this case the quantity Y ( x ) − Y ( x ) is of high interest: it is known as the in-dividualized treatment effect (ITE) for context x (van derLaan & Petersen, 2007; Weiss et al., 2015). Knowing thisquantity enables choosing the best of the two actions whenconfronted with the choice, for example choosing the besttreatment for a specific patient. However, the fact that weonly have access to the outcome of one of the two ac-tions prevents the ITE from being known. Another com-monly sought after quantity is the average treatment effect ,ATE = E x ∼ p ( x ) [ ITE ( x )] for a population with distribution p ( x ) . In the binary action setting, we refer to the observedand unobserved outcomes as the factual outcome y F ( x ) ,and counterfactual outcome y CF ( x ) respectively.A common approach for estimating the ITE is by directmodelling : given n samples { ( x i , t i , y Fi ) } ni =1 , where y Fi = t i · Y ( x i )+(1 − t i ) Y ( x i ) , learn a function h : X ×T → Y such that h ( x i , t i ) ≈ y Fi . The estimated transductive ITEis then: ˆ ITE ( x i ) = (cid:40) y Fi − h ( x i , − t i ) , t i = 1 .h ( x i , − t i ) − y Fi , t i = 0 . (1)While in principle any function fitting model might beused for estimating the ITE (Prentice, 1976; Gelman &Hill, 2006; Chipman et al., 2010; Wager & Athey, 2015;Weiss et al., 2015), it is important to note how this taskdiffers from standard supervised learning. The problemis as follows: the observed sample consists of the set ˆ P F = { ( x i , t i ) } ni =1 . However, calculating the ITE requiresinferring the outcome on the set ˆ P CF = { ( x i , − t i ) } ni =1 .We call the set ˆ P F ∼ P F the empirical factual distribu-tion , and the set ˆ P CF ∼ P CF the empirical counterfac-tual distribution , respectively. Because P F and P CF neednot be equal, the problem of causal inference by counter-factual prediction might require inference over a differentdistribution than the one from which samples are given. Inmachine learning terms, this means that the feature distri-bution of the test set differs from that of the train set. Thisis a case of covariate shift , which is a special case of do- earning Representations for Counterfactual Inference Context x Representation Φ
Outcome error loss ( h ( Φ , t), y)Treatment t Imbalance disc ( Φ C , Φ T ) Figure 1.
Contexts x are representated by Φ( x ) , which are used,with group indicator t , to predict the response y while minimizingthe imbalance in distributions measured by disc(Φ C , Φ T ) . Algorithm 1
Balancing counterfactual regression Input:
X, T, Y F ; H , N ; α, γ, λ Φ ∗ , g ∗ = arg min Φ ∈N ,g ∈H B H ,α,γ (Φ , g ) (2) h ∗ = arg min h ∈H n (cid:80) ni =1 ( h (Φ , t i ) − y Fi ) + λ (cid:107) h (cid:107) H Output: h ∗ , Φ ∗ main adaptation (Daume III & Marcu, 2006; Jiang, 2008;Mansour et al., 2009). A somewhat similar connection wasnoted in Sch¨olkopf et al. (2012) with respect to covariateshift, in the context of a very simple causal model.Specifically, we have that P F ( x, t ) = P ( x ) · P ( t | x ) and P CF ( x, t ) = P ( x ) · P ( ¬ t | x ) . The difference betweenthe observed (factual) sample and the sample we must per-form inference on lies precisely in the treatment assignmentmechanism, P ( t | x ) . For example, in a randomized con-trol trial, we typically have that t and x are independent.In the contextual bandit setting, there is typically an algo-rithm which determines the choice of the action t given thecontext x . In observational studies, which are the focus ofthis work, the treatment assignment mechanism is not un-der our control and in general will not be independent ofthe context x . Therefore, in general, the counterfactual dis-tribution will be different from the factual distribution.
3. Balancing counterfactual regression
We propose to perform counterfactual inference by amend-ing the direct modeling approach, taking into account thefact that the learned estimator h must generalize from thefactual distribution to the counterfactual distribution.Our method, see Figure 1, learns a representation Φ :
X → R d , (either using a deep neural network, or by feature re-weighting and selection), and a function h : R d × T → R ,such that the learned representation trades off three objec-tives: (1) enabling low-error prediction of the observed out-comes over the factual representation, (2) enabling low-error prediction of unobserved counterfactuals by takinginto account relevant factual outcomes, and (3) the distri-butions of treatment populations are similar or balanced . We accomplish low-error prediction by the usual means oferror minimization over a training set and regularization inorder to enable good generalization error. We accomplishthe second objective by a penalty that encourages counter-factual predictions to be close to the nearest observed out-come from the respective treated or control set. Finally, weaccomplish the third objective by minimizing the so-called discrepancy distance , introduced by Mansour et al. (2009),which is a hypothesis class dependent distance measure tai-lored for domain adaptation. For hypothesis space H , wedenote the discrepancy distance by disc H . See Section 4for the formal definition and motivation. Other discrepancymeasures such as Maximum Mean Discrepancy (Grettonet al., 2012) could also be used for this purpose.Intuitively, representations that reduce the discrepancy be-tween the treated and control populations prevent thelearner from using “unreliable” aspects of the data whentrying to generalize from the factual to counterfactual do-mains. For example, if in our sample almost no men everreceived medication A, inferring how men would react tomedication A is highly prone to error and a more conserva-tive use of the gender feature might be warranted.Let X = { x i } ni =1 , T = { t i } ni =1 , and Y F = { y Fi } ni =1 de-note the observed units, treatment assignments and factualoutcomes respectively. We assume X is a metric space witha metric d . Let j ( i ) ∈ arg min j ∈{ ...n } s.t. t j =1 − t i d( x j , x i ) be the nearest neighbor of x i among the group that receivedthe opposite treatment from unit i . Note that the nearestneighbor is computed once, in the input space, and does not change with the representation Φ . The objective weminimize over representations Φ and hypotheses h ∈ H is B H ,α,γ (Φ , h ) = 1 n n (cid:88) i =1 | h (Φ( x i ) , t i ) − y Fi | + (2) α disc H ( ˆ P F Φ , ˆ P CF Φ ) + γn n (cid:88) i =1 | h (Φ( x i ) , − t i ) − y Fj ( i ) | , where α, γ > are hyperparameters to control the strengthof the imbalance penalties, and disc is the discrepancymeasure defined in 4.1. When the hypothesis class H isthe class of linear functions, the term disc H ( ˆ P F Φ , ˆ P CF Φ ) has a closed form brought in 4.1 below, and h (Φ , t i ) = h (cid:62) [Φ( x i ) t i ] . For more complex hypothesis spaces thereis in general no exact closed form for disc H ( ˆ P F Φ , ˆ P CF Φ ) .Once the representation Φ is learned, we fit a final hypoth-esis minimizing a regularized squared loss objective on thefactual data. Our algorithm is summarized in Algorithm 1.Note that our algorithm involves two minimization proce-dures. In Section 4 we motivate our method, by showingthat our method of learning representations minimizes anupper bound on the regret error over the counterfactual dis-tribution, using results of Cortes & Mohri (2014). earning Representations for Counterfactual Inference A na¨ıve way of obtaining a balanced representation is touse only features that are already well balanced, i.e. fea-tures which have a similar distribution over both treated andcontrol sets. However, imbalanced features can be highlypredictive of the outcome, and should not always be dis-carded. A middle-ground is to restrict the influence of im-balanced features on the predicted outcome. We build onthis idea by learning a sparse re-weighting of the featuresthat minimizes the bound in Theorem 1. The re-weightingdetermines the influence of a feature by trading off its pre-dictive capabilities and its balance.We implement the re-weighting as a diagonal matrix W ,forming the representation Φ( x ) = W x , with diag ( W ) subject to a simplex constraint to achieve sparsity. Let N = { x (cid:55)→ W x : W = diag ( w ) , w i ∈ [0 , , (cid:80) i w i = 1 } de-note the space of such representations. We can now applyAlgorithm 1 with H l the space of linear hypotheses. Be-cause the hypotheses are linear, disc(Φ) is a function ofthe distance between the weighted population means, seeSection 4.1. With p = E [ t ] , c = p − / , n t = (cid:80) ni =1 t i , µ = n t (cid:80) ni : t i =1 x i , and µ analogously defined, disc H l ( XW ) = c + (cid:113) c + (cid:107) W ( pµ − (1 − p ) µ )] (cid:107) To minimize the discrepancy, features k that differ a lot be-tween treatment groups will receive a smaller weight w k .Minimizing the overall objective B , involves a trade-offbetween maximizing balance and predictive accuracy. Weminimize (2) using alternating sub-gradient descent. Deep neural networks have been shown to successfullylearn good representations of high-dimensional data inmany tasks (Bengio et al., 2013). Here we show that theycan be used for counterfactual inference and, crucially, foraccommodating imbalance penalties. We propose a modi-fication of the standard feed-forward architecture with fullyconnected layers, see Figure 2. The first d r hidden layersare used to learn a representation Φ( x ) of the input x . Theoutput of the d r :th layer is used to calculate the discrepancy disc H ( ˆ P F Φ , ˆ P CF Φ ) . The d o layers following the first d r lay-ers take as additional input the treatment assignment t i andgenerate a prediction h ([Φ( x i ) , t i ]) of the outcome. We note that both in the case of variable re-weighting,and for neural nets with a single linear outcome layer, thehypothesis space H comprises linear functions of [Φ , t ] and the discrepancy, disc H (Φ) can be expressed in closed-form. A less desirable consequence is that such modelscannot capture difference in the individual treatment ef- tx loss ( h ( Φ , t ), y ) … … Φ t Φ disc ( Φ t=0 , Φ t=1 ) d r d o Figure 2.
Neural network architecture. fect, as they involve no interactions between Φ( x ) and t . Such interactions could be introduced by for example(polynomial) feature expansion, or in the case of neural net-works, by adding non-linear layers after the concatenation [Φ( x ) , t ] . For both approaches however, we no longer havea closed form expression for disc H ( ˆ P F Φ , ˆ P CF Φ ) .
4. Theory
In this section we derive an upper bound on the rela-tive counterfactual generalization error of a representationfunction Φ . The bound only uses quantities we can mea-sure directly from the available data. In the previous sec-tion we gave several methods for learning representationswhich approximately minimize the upper bound.Recall that for an observed context or instance x i ∈ X with observed treatment t i ∈ { , } , the two potential out-comes are Y ( x i ) , Y ( x i ) ∈ Y , of which we observe the factual outcome y Fi = t i Y ( x i ) + (1 − t i ) Y ( x i ) . Let ( x , t , y F ) , . . . , ( x n , t n , y Fn ) be a sample from the factualdistribution. Similarly, let ( x , − t , y CF ) , . . . , ( x n , − t n , y CFn ) be the counterfactual sample. Note that while weknow the factual outcomes y Fi , we do not know the coun-terfactual outcomes y CFi . Let
Φ :
X → R d be a repre-sentation function, and let R (Φ) denote its range. Denoteby ˆ P F Φ the empirical distribution over the representationsand treatment assignments (Φ( x ) , t ) , . . . , (Φ( x n ) , t n ) ,and similarly ˆ P CF Φ the empirical distribution over therepresentations and counterfactual treatment assignments (Φ( x ) , − t ) , . . . , (Φ( x n ) , − t n ) . Let H l be the hy-pothesis set of linear functions β : R (Φ) × { , } → Y . Definition 1 (Mansour et al. 2009) . Given a hypothesis set H and a loss function L , the empirical discrepancy betweenthe empirical distributions ˆ P F Φ and ˆ P CF Φ is:disc H ( ˆ P F Φ , ˆ P CF Φ ) =max β,β (cid:48) ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼ ˆ P F Φ [ L ( β ( x ) , β (cid:48) ( x ))] − E x ∼ ˆ P CF Φ [ L ( β ( x ) , β (cid:48) ( x ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where L is a loss function L : Y × Y → R with weak Lip-schitz constant µ relative to H . Note that the discrepancy When L is the squared loss we can show that if (cid:107) Φ( x ) (cid:107) ≤ m and | y | ≤ M , and the hypothesis set H is that of linear func-tions with norm bounded by m/λ , then µ ≤ M (1 + m /λ ) . earning Representations for Counterfactual Inference is defined with respect to a hypothesis class and a loss func-tion, and is therefore very useful for obtaining generaliza-tion bounds involving different distributions. Throughoutthis section we always have L denote the squared loss. Weprove the following, based on Cortes & Mohri (2014): Theorem 1.
For a sample { ( x i , t i , y Fi ) } ni =1 , x i ∈ X , t i ∈{ , } and y i ∈ Y , and a given representation function Φ :
X → R d , let ˆ P F Φ = (Φ( x ) , t ) , . . . , (Φ( x n ) , t n ) , ˆ P CF Φ =(Φ( x ) , − t ) , . . . , (Φ( x n ) , − t n ) . We assume that X is ametric space with metric d , and that the potential outcomefunctions Y ( x ) and Y ( x ) are Lipschitz continuous withconstants K and K respectively, such that d( x a , x b ) ≤ c = ⇒ | Y t ( x a ) − Y t ( x b ) | ≤ K t · c for t = 0 , .Let H l ⊂ R d +1 be the space of linear func-tions β : X × { , } → Y , and for β ∈ H l ,let L P ( β ) = E ( x,t,y ) ∼ P [ L ( β ( x, t ) , y )] be the ex-pected loss of β over distribution P . Let r = max (cid:0) E ( x,t ) ∼ P F [ (cid:107) [Φ( x ) , t ] (cid:107) ] , E ( x,t ) ∼ P CF [ (cid:107) [Φ( x ) , t ] (cid:107) ] (cid:1) be the maximum expected radius of the distributions. For λ > , let ˆ β F (Φ) = arg min β ∈H l L ˆ P F Φ ( β ) + λ (cid:107) β (cid:107) ,and ˆ β CF (Φ) similarly for ˆ P CF Φ , i.e. ˆ β F (Φ) and ˆ β CF (Φ) are the ridge regression solutions for the factual andcounterfactual empirical distributions, respectively.Let ˆ y Fi (Φ , h ) = h (cid:62) [Φ( x i ) , t i ] and ˆ y CFi (Φ , h ) = h (cid:62) [Φ( x i ) , − t i ] be the outputs of the hypothesis h ∈H l over the representation Φ( x i ) for the factual andcounterfactual settings of t i , respectively. Finally, foreach i, j ∈ { . . . n } , let d i,j ≡ d( x i , x j ) and j ( i ) ∈ arg min j ∈{ ...n } s.t. t j =1 − t i d( x j , x i ) be the nearest neigh-bor in X of x i among the group that received the oppo-site treatment from unit i . Then for both Q = P F and Q = P CF we have: λµr ( L Q ( ˆ β F (Φ)) − L Q ( ˆ β CF (Φ))) ≤ disc H l ( ˆ P F Φ , ˆ P CF Φ ) + (3) min h ∈H l n n (cid:88) i =1 (cid:0) | ˆ y Fi (Φ , h ) − y Fi | + | ˆ y CFi (Φ , h ) − y CFi | (cid:1) ≤ (4) disc H l ( ˆ P F Φ , ˆ P CF Φ )+min h ∈H l n n (cid:88) i =1 (cid:16) | ˆ y Fi (Φ , h ) − y Fi | + | ˆ y CFi (Φ , h ) − y Fj ( i ) | (cid:17) + (5) K n (cid:88) i : t i =1 d i,j ( i ) + K n (cid:88) i : t i =0 d i,j ( i ) . (6)The proof is in the supplemental material.Theorem 1 gives, for all fixed representations Φ , a bound on the relative error for a ridge regression model fit on thefactual outcomes and evaluated on the counterfactual, ascompared with ridge regression had it been fit on the un-observed counterfactual outcomes. It does not take into ac-count how Φ is obtained, and applies even if h (Φ( x ) , t ) isnot convex in x , e.g. if Φ is a neural net. Since the bound inthe theorem is true for all representations Φ , we can attemptto minimize it over Φ , as done in Algorithm 1.The term on line (4) of the bound includes the unknowncounterfactual outcomes y CFi . It measures how well couldwe in principle fit the factual and counterfactual outcomestogether using a linear hypothesis over the representation Φ . For example, if the dimension of the representation isgreater than the number of samples, and in addition if thereexist constants b and (cid:15) such that | y Fi − y CFi − b | ≤ (cid:15) , thenthis term is upper bounded by (cid:15) . In general however, wecannot directly control its magnitude.The term on line (3) measures the discrepancy between thefactual and counterfactual distributions over the represen-tation Φ . In 4.1 below, we show that this term is closelyrelated to the norm of the difference in means between therepresentation of the control group and the treated group.A representation for which the means of the treated andcontrol are close (small value of (3)), but at the same timeallows for a good prediction of the factuals and counterfac-tuals (small value of (4)), is guaranteed to yield structuralrisk minimizers with similar generalization errors betweenfactual and counterfactual.We further show that the term on line (4), which can-not be evaluated since we do not know y CFi , can be up-per bounded by a sum of the terms on lines (5) and (6).The term (5) includes two empirical data fitting terms: | ˆ y Fi (Φ , v ) − y Fi | and | ˆ y CFi (Φ , v ) − y Fj ( i ) | . The first is simplyfitting the observed factual outcomes using a linear func-tion over the representation Φ . The second term is a formof nearest-neighbor regression, where the counterfactualoutcomes for a treated (resp. control) instance are fit tothe most similar factual outcome among the control (resp.treated) set, where similarity is measured in the originalspace X . Finally, the term on line (6), is the only quan-tity which is independent of the representation Φ . It mea-sures the average distance between each treated instance tothe nearest control, and vice-versa, scaled by the Lipschitzconstants of the true treated and control outcome functions.This term will be small when: (a) the true outcome func-tions Y ( x ) and Y ( x ) are relatively smooth, and (b) thereis overlap between the treated and control groups, lead-ing to small average nearest neighbor distance across thegroups. It is well-known that when there is not much over-lap between treated and control, causal inference in generalis more difficult since the extrapolation from treated to con-trol and vice-versa is more extreme (Rosenbaum, 2009). earning Representations for Counterfactual Inference The upper bound in Theorem 1 suggests the following ap-proach for counterfactual regression. First minimize theterms (3) and (5) as functions of the representation Φ . Once Φ is obtained, perform a ridge regression on the factual out-comes using the representations Φ( x ) and the treatment as-signments as input. The terms in the bound ensure that Φ would have a good fit for the data (term (5)), while remov-ing aspects of the treated and control which create a largediscrepancy term (3)). For example, if there is a featurewhich is much more strongly associated with the treatmentassignment than with the outcome, it might be advisable tonot use it (Pearl, 2011). A straightforward calculation shows that for a class H l oflinear hypotheses, disc H l ( P, Q ) = (cid:107) µ ( P ) − µ ( Q ) (cid:107) . Here, (cid:107) A (cid:107) is the spectral norm of A and µ ( P ) = E x ∼ P [ xx (cid:62) ] is the second-order moment of x ∼ P . Inthe special case of counterfactual inference, P and Q differonly in the treatment assignment. Specifically,disc ( ˆ P F Φ , ˆ P CF Φ ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) d,d vv (cid:62) p − (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) (7) = p −
12 + (cid:114) (2 p − (cid:107) v (cid:107) (8)where v = E ( x,t ) ∼ ˆ P F Φ [Φ( x ) · t ] − E ( x,t ) ∼ ˆ P F Φ [Φ( x ) · (1 − t )] and p = E [ t ] .Let µ (Φ) = E ( x,t ) ∼ ˆ P F Φ [Φ( x ) | t = 1] and µ (Φ) = E ( x,t ) ∼ ˆ P F Φ [Φ( x ) | t = 0] be the treated and control means in Φ space. Then v = p · µ (Φ) − (1 − p ) · µ (Φ) , exactly thedifference in means between the treated and control groups,weighted by their respective sizes. As a consequence, min-imizing the discrepancy with linear hypotheses constitutesmatching means in feature space.
5. Related work
Counterfactual inference for determining causal effects inobservational studies has been studied extensively in statis-tics, economics, epidemiology and sociology (Morgan &Winship, 2014; Robins et al., 2000; Rubin, 2011; Cher-nozhukov et al., 2013) as well as in machine learning(Langford et al., 2011; Bottou et al., 2013; Swaminathan& Joachims, 2015).Non-parametric methods do not attempt to model the rela-tion between the context, intervention, and outcome. Themethods include nearest-neighbor matching, propensityscore matching, and propensity score re-weighting (Rosen-baum & Rubin, 1983; Rosenbaum, 2002; Austin, 2011). Parametric methods, on the other hand, attempt to con-cretely model the relation between the context, interven-tion, and outcome. These methods include any type ofregression including linear and logistic regression (Pren-tice, 1976; Gelman & Hill, 2006), random forests (Wager& Athey, 2015) and regression trees (Chipman et al., 2010).Doubly robust methods combine aspects of parametric andnon-parametric methods, typically by using a propensityscore weighted regression (Bang & Robins, 2005; Dud´ıket al., 2011). They are especially of use when the treat-ment assignment probability is known, as is the case foroff-policy evaluation or learning from logged bandit data.Once the treatment assignment probability has to be esti-mated, as is the case in most observational studies, theirefficacy might wane considerably (Kang & Schafer, 2007).Tian et al. (2014) presented one of the few methods thatachieve balance by transforming or selecting covariates,modeling interactions between treatment and covariates.
6. Experiments
We evaluate the two variants of our algorithm proposed inSection 3 with focus on two questions: 1) What is the effectof imposing imbalance regularization on representations?2) How do our methods fare against established methodsfor counterfactual inference? We refer to the variable se-lection method of Section 3.1 as
Balancing Linear Regres-sion (BLR) and the neural network approach as BNN for
Balancing Neural Network .We report the RMSE of the estimated individual treatmenteffect, denoted (cid:15)
IT E , and the absolute error in estimatedaverage treatment effect, denoted (cid:15)
AT E , see Section 2.Further, following Hill (2011), we report the
Precisionin Estimation of Heterogeneous Effect (PEHE), PEHE = (cid:113) n (cid:80) ni =1 (ˆ y ( x i ) − ˆ y ( x i ) − ( Y ( x i ) − Y ( x i ))) . Un-like for ITE, obtaining a good (small) PEHE requires ac-curate estimation of both the factual and counterfactual re-sponses, not just the counterfactual. Standard methods forhyperparameter selection, including cross-validation, areunavailable when training counterfactual models on real-world data, as there are no samples from the counterfactualoutcome. In our experiments, all outcomes are simulated,and we have access to counterfactual samples. To avoidfitting parameters to the test set, we generate multiple re-peated experiments, each with a different outcome functionand pick hyperparameters once, for all models (and base-lines), based on a held-out set of experiments. While notpossible for real-world data, this approach gives an indica-tion of the robustness of the parameters.The neural network architectures used for all experimentsconsist of fully-connected ReLU layers trained using RM- earning Representations for Counterfactual Inference SProp, with a small l weight decay, λ = 10 − . Weevaluate two architectures. BNN-4-0 consists of 4 ReLUrepresentation-only layers and a single linear output layer, d r = 4 , d o = 0 . BNN-2-2 consists of 2 ReLUrepresentation-only layers, 2 ReLU output layers after thetreatment has been added, and a single linear output layer, d r = 2 , d o = 2 , see Figure 2. For the IHDP data we uselayers of 25 hidden units each. For the News data repre-sentation layers have 400 units and output layers 200 units.The nearest neighbor term, see Section 3, did not improveempirical performance, and was omitted for the BNN mod-els. For the neural network models, the hypothesis and therepresentation were fit jointly.We include several different linear models in our compari-son, including ordinary linear regression (OLS) and doublyrobust linear regression (DR) (Bang & Robins, 2005). Wealso include a method were variables are first selected us-ing LASSO and then used to fit a ridge regression (L ASSO + R
IDGE ). Regularization parameters are picked based ona held out sample. For DR, we estimate propensity scoresusing logistic regression and clip weights at 100. For theNews dataset (see below), we perform the logistic regres-sion on the first 100 principal components of the data.Bayesian Additive Regression Trees (BART) (Chipmanet al., 2010) is a non-linear regression model which hasbeen used successfully for counterfactual inference in thepast (Hill, 2011). We compare our results to BARTusing the implementation provided in the BayesTree R-package (Chipman & McCulloch, 2016). Like (Hill, 2011),we do not attempt to tune the parameters, but use the de-fault. Finally, we include a standard feed-forward neuralnetwork, trained with 4 hidden layers, to predict the factualoutcome based on X and t , without a penalty for imbal-ance. We refer to this as NN-4. Hill (2011) introduced a semi-simulated dataset based onthe Infant Health and Development Program (IHDP). TheIHDP data has covariates from a real randomized exper-iment, studying the effect of high-quality child care andhome visits on future cognitive test scores. The experimentproposed by Hill (2011) uses a simulated outcome and ar-tificially introduces imbalance between treated and controlsubjects by removing a subset of the treated population.In total, the dataset consists of 747 subjects (139 treated,608 control), each represented by 25 covariates measuringproperties of the child and their mother. For details, seeHill (2011). We run 100 repeated experiments for hyper-parameter selection and 1000 for evaluation, all with thelog-linear response surface implemented as setting “A” inthe NPCI package (Dorie, 2016).
ITE(x)
Figure 3.
Visualization of one of the News sets (left). Each dotrepresents a single news item x . The radius represents the out-come y ( x ) , and the color the treatment t . The two black dotsrepresent the two centroids. Histogram of ITE in News (right). We introduce a new dataset, simulating the opinions of amedia consumer exposed to multiple news items. Eachitem is consumed either on a mobile device or on desk-top. The units are different news items represented byword counts x i ∈ N V , and the outcome y F ( x i ) ∈ R isthe readers experience of x i . The intervention t ∈ { , } represents the viewing device, desktop ( t = 0) or mobile ( t = 1) . We assume that the consumer prefers to readabout certain topics on mobile. To model this, we train atopic model on a large set of documents and let z ( x ) ∈ R k represent the topic distribution of news item x . We definetwo centroids in topic space, z c (mobile), and z c (desk-top), and let the readers opinion of news item x on de-vice t be determined by the similarity between z ( x ) and z ct , y F ( x i ) = C (cid:0) z ( x ) (cid:62) z c + t i · z ( x ) (cid:62) z c (cid:1) + (cid:15) , where C isa scaling factor and (cid:15) ∼ N (0 , . Here, we let the mobilecentroid, z c be the topic distribution of a randomly sam-pled document, and z c be the average topic representationof all documents. We further assume that the assignmentof a news item x to a device t ∈ { , } is biased towardsthe device preferred for that item. We model this using thesoftmax function, p ( t = 1 | x ) = e κ · z ( x ) (cid:62) zc e κ · z ( x ) (cid:62) zc + e κ · z ( x ) (cid:62) zc ,where κ ≥ determines the strength of the bias. Note that κ = 0 implies a completely random device assignment.We sample n = 5000 news items and outcomes accord-ing to this model, based on 50 LDA topics, trained ondocuments from the NY Times corpus (downloaded fromUCI (Newman, 2008)). The data available to the algo-rithms are the raw word counts, from a vocabulary of k = 3477 words, selected as union of the most prob-able words in each topic. We set the scaling parameters to C = 50 , κ = 10 and sample realizations for evaluation.Figure 3 shows a visualization of the outcome and deviceassignments for a sample of 500 documents. Note that thedevice assignment becomes increasingly random, and theoutcome lower, further away from the centroids. earning Representations for Counterfactual Inference Table 1.
IHDP. Results and standard errors for 1000 repeated ex-periments. (Lower is better.) Proposed methods: BLR, BNN-4-0 and BNN-2-2. † (Chipman et al., 2010) (cid:15) ITE (cid:15)
ATE
PEHEL
INEAR OUTCOME
OLS . ± . . ± . . ± . D OUBLY R OBUST . ± . . ± . . ± . L ASSO + R
IDGE . ± . . ± . . ± . BLR . ± . . ± . . ± . BNN-4-0 . ± . . ± . . ± . N ON - LINEAR OUTCOME
NN-4 . ± . . ± . . ± . BART † . ± . . ± . . ± . BNN-2-2 . ± . . ± . . ± . Table 2.
News. Results and standard errors for 50 repeated exper-iments. (Lower is better.) Proposed methods: BLR, BNN-4-0and BNN-2-2. † (Chipman et al., 2010) (cid:15) ITE (cid:15)
ATE
PEHEL
INEAR OUTCOME
OLS . ± . . ± . . ± . D OUBLY R OBUST . ± . . ± . . ± . L ASSO + R
IDGE . ± . . ± . . ± . BLR . ± . . ± . . ± . BNN-4-0 . ± . . ± . . ± . N ON - LINEAR OUTCOME
NN-4 . ± . . ± . . ± . BART † . ± . . ± . . ± . BNN-2-2 . ± . . ± . . ± . The results of the IHDP and News experiments are pre-sented in Table 1 and Table 2 respectively. We see that, ingeneral, the non-linear methods perform better in terms ofindividual prediction (ITE, PEHE). Further, we see that ourproposed balancing neural network BNN-2-2 performsthe best on both datasets in terms of estimating the ITEand PEHE, and is competitive on average treatment effect,ATE. Particularly noteworthy is the comparison with thenetwork without balance penalty, NN-4. These results in-dicate that our proposed regularization can help avoid over-fitting the representation to the factual outcome. Figure 4plots the performance of BNN-2-2 for various imbalancepenalties α . The valley in the region α = 1 , and the factthat we don’t experience a loss in performance for smallervalues of α , show that the penalizing imbalance in the rep-resentation Φ has the desired effect.For the linear methods, we see that the two variable selec-tion approaches, our proposed BLR method and L ASSO +R IDGE , work the best in terms of estimating ITE. We would
Imbalance penalty, , (log-scale) -4 -2 //// PEHEITEITE (BART)
Imbalance penalty, , (log-scale) -4 -2 //// Factual RMSECounterfactual RMSECounterfactual RMSE (BART)
Figure 4.
Error in estimated treatment effect (ITE, PEHE) andcounterfactual response (RMSE) on the IHDP dataset. Sweepover α for the BNN-2-2 neural network model. like to emphasize that L ASSO + R
IDGE is a very strongbaseline and it’s exciting that our theory-guided method iscompetitive with this approach.On News, BLR and L
ASSO + R
IDGE perform equally wellyet again, although this time with qualitatively different re-sults, as they do not select the same variables. Interestingly,BNN-4-0, BLR and L
ASSO + R
IDGE all perform better onNews than the standard neural network, NN-4. The perfor-mance of BART on News is likely hurt by the dimensional-ity of the dataset, and could improve with hyperparametertuning.
7. Conclusion
As machine learning is becoming a major tool for re-searchers and policy makers across different fields such ashealthcare and economics, causal inference becomes a cru-cial issue for the practice of machine learning. In this paperwe focus on counterfactual inference, which is a widely ap-plicable special case of causal inference. We cast counter-factual inference as a type of domain adaptation problem,and derive a novel way of learning representations suitedfor this problem.Our models rely on a novel type of regularization criteria:learning balanced representations , representations whichhave similar distributions among the treated and untreatedpopulations. We show that trading off a balancing criterionwith standard data fitting and regularization terms is bothpractically and theoretically prudent.Open questions which remain are how to generalize thismethod for cases where more than one treatment is inquestion, deriving better optimization algorithms and usingricher discrepancy measures.
Acknowledgements
DS and US were supported by NSF CAREER award earning Representations for Counterfactual Inference
References
Austin, Peter C. An introduction to propensity score meth-ods for reducing the effects of confounding in observa-tional studies.
Multivariate behavioral research , 46(3):399–424, 2011.Bang, Heejung and Robins, James M. Doubly robust es-timation in missing data and causal inference models.
Biometrics , 61(4):962–973, 2005.Ben-David, Shai, Blitzer, John, Crammer, Koby, Pereira,Fernando, et al. Analysis of representations for domainadaptation.
Advances in neural information processingsystems , 19:137, 2007.Bengio, Yoshua, Courville, Aaron, and Vincent, Pierre.Representation learning: A review and new perspectives.
Pattern Analysis and Machine Intelligence, IEEE Trans-actions on , 35(8):1798–1828, 2013.Beygelzimer, Alina, Langford, John, Li, Lihong, Reyzin,Lev, and Schapire, Robert E. Contextual bandit al-gorithms with supervised learning guarantees. arXivpreprint arXiv:1002.4058 , 2010.Bottou, L´eon, Peters, Jonas, Quinonero-Candela, Joaquin,Charles, Denis X, Chickering, D Max, Portugaly, Elon,Ray, Dipankar, Simard, Patrice, and Snelson, Ed. Coun-terfactual reasoning and learning systems: The exampleof computational advertising.
The Journal of MachineLearning Research , 14(1):3207–3260, 2013.Chernozhukov, Victor, Fern´andez-Val, Iv´an, and Melly,Blaise. Inference on counterfactual distributions.
Econo-metrica , 81(6):2205–2268, 2013.Chipman, Hugh and McCulloch, Robert. BayesTree:Bayesian additive regression trees. https://cran.r-project.org/package=BayesTree/ , 2016.Accessed: 2016-01-30.Chipman, Hugh A, George, Edward I, and McCulloch,Robert E. Bart: Bayesian additive regression trees.
TheAnnals of Applied Statistics , pp. 266–298, 2010.Cortes, Corinna and Mohri, Mehryar. Domain adaptationand sample bias correction theory and algorithm for re-gression.
Theoretical Computer Science , 519:103–126,2014.Daume III, Hal and Marcu, Daniel. Domain adaptation forstatistical classifiers.
Journal of Artificial IntelligenceResearch , pp. 101–126, 2006.Dorie, Vincent. NPCI: Non-parametrics for causal in-ference. https://github.com/vdorie/npci ,2016. Accessed: 2016-01-30. Dud´ık, Miroslav, Langford, John, and Li, Lihong. Dou-bly robust policy evaluation and learning. arXiv preprintarXiv:1103.4601 , 2011.Gani, Yaroslav, Ustinova, Evgeniya, Ajakan, Hana, Ger-main, Pascal, Larochelle, Hugo, Laviolette, Franc¸ois,Marchand, Mario, and Lempitsky, Victor. Domain-adversarial training of neural networks. arXiv preprintarXiv:1505.07818 , 2015.Gelman, Andrew and Hill, Jennifer.
Data analysis using re-gression and multilevel/hierarchical models . CambridgeUniversity Press, 2006.Gretton, Arthur, Borgwardt, Karsten M., Rasch, Malte J.,Sch¨olkopf, Bernhard, and Smola, Alexander. A ker-nel two-sample test.
J. Mach. Learn. Res. , 13:723–773,March 2012. ISSN 1532-4435.Hill, Jennifer L. Bayesian nonparametric modeling forcausal inference.
Journal of Computational and Graph-ical Statistics , 20(1), 2011.Jiang, Jing. A literature survey on domain adaptation ofstatistical classifiers. Technical report, University of Illi-nois at Urbana-Champaign, 2008.Kang, Joseph DY and Schafer, Joseph L. Demystifyingdouble robustness: A comparison of alternative strate-gies for estimating a population mean from incompletedata.
Statistical science , pp. 523–539, 2007.Langford, John, Li, Lihong, and Dud´ık, Miroslav. Doublyrobust policy evaluation and learning. In
Proceedings ofthe 28th International Conference on Machine Learning(ICML-11) , pp. 1097–1104, 2011.Lewis, David. Causation.
The journal of philosophy , pp.556–567, 1973.Louizos, Christos, Swersky, Kevin, Li, Yujia, Welling,Max, and Zemel, Richard. The variational fair auto en-coder. arXiv preprint arXiv:1511.00830 , 2015.Mansour, Yishay, Mohri, Mehryar, and Rostamizadeh, Af-shin. Domain adaptation: Learning bounds and algo-rithms. arXiv preprint arXiv:0902.3430 , 2009.Morgan, Stephen L and Winship, Christopher.
Counterfac-tuals and causal inference . Cambridge University Press,2014.Newman, David. Bag of words data set. https://archive.ics.uci.edu/ml/datasets/Bag+of+Words , 2008.Pearl, Judea.
Causality . Cambridge university press, 2009. earning Representations for Counterfactual Inference
Pearl, Judea. Invited commentary: understanding bias am-plification.
American journal of epidemiology , 174(11):1223–1227, 2011.Prentice, Ross. Use of the logistic model in retrospectivestudies.
Biometrics , pp. 599–606, 1976.Robins, James M, Hernan, Miguel Angel, and Brumback,Babette. Marginal structural models and causal inferencein epidemiology.
Epidemiology , pp. 550–560, 2000.Rosenbaum, Paul R.
Observational studies . Springer,2002.Rosenbaum, Paul R.
Design of Observational Studies .Springer Science & Business Media, 2009.Rosenbaum, Paul R and Rubin, Donald B. The centralrole of the propensity score in observational studies forcausal effects.
Biometrika , 70(1):41–55, 1983.Rubin, Donald B. Estimating causal effects of treatmentsin randomized and nonrandomized studies.
Journal ofeducational Psychology , 66(5):688, 1974.Rubin, Donald B. Causal inference using potential out-comes.
Journal of the American Statistical Association ,2011.Sch¨olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang,K., and Mooij, J. On causal and anticausal learning.In
Proceedings of the 29th International Conferenceon Machine Learning , pp. 1255–1262, New York, NY,USA, 2012. Omnipress.Strehl, Alex, Langford, John, Li, Lihong, and Kakade,Sham M. Learning from logged implicit explorationdata. In
Advances in Neural Information Processing Sys-tems , pp. 2217–2225, 2010.Sutton, Richard S and Barto, Andrew G.
Reinforcementlearning: An introduction , volume 1. MIT press Cam-bridge, 1998.Swaminathan, Adith and Joachims, Thorsten. Batch learn-ing from logged bandit feedback through counterfactualrisk minimization.
Journal of Machine Learning Re-search , 16:1731–1755, 2015.Tian, Lu, Alizadeh, Ash A, Gentles, Andrew J, and Tib-shirani, Robert. A simple method for estimating interac-tions between a treatment and a large number of covari-ates.
Journal of the American Statistical Association ,109(508):1517–1532, 2014.van der Laan, Mark J and Petersen, Maya L. Causal effectmodels for realistic individualized treatment and inten-tion to treat rules.
The International Journal of Biostatis-tics , 3(1), 2007. Wager, Stefan and Athey, Susan. Estimation and inferenceof heterogeneous treatment effects using random forests. arXiv preprint arXiv:1510.04342 , 2015.Weiss, Jeremy C, Kuusisto, Finn, Boyd, Kendrick, Lui,Jie, and Page, David C. Machine learning for treatmentassignment: Improving individualized risk attribution.
American Medical Informatics Association Annual Sym-posium , 2015.Zemel, Rich, Wu, Yu, Swersky, Kevin, Pitassi, Toni, andDwork, Cynthia. Learning fair representations. In
Pro-ceedings of the 30th International Conference on Ma-chine Learning (ICML-13) , pp. 325–333, 2013. earning Representations for Counterfactual Inference
A. Proof of Theorem 1
We use a result implicit in the proof of Theorem 2 of Cortes& Mohri (2014), for the case where H is the set of linearhypotheses over a fixed representation Φ . Cortes & Mohri(2014) state their result for the case of domain adaptation:in our case, the factual distribution is the so-called “sourcedomain”, and the counterfactual distribution is the “targetdomain”. Theorem A1. [Cortes & Mohri (2014)] Using the notationand assumptions of Theorem 1, for both Q = P F and Q = P CF : λµr ( L Q ( ˆ β F (Φ)) − L Q ( ˆ β CF (Φ))) ≤ disc H l ( ˆ P F Φ , ˆ P CF Φ )+min h ∈H l n (cid:32) n (cid:88) i =1 | ˆ y Fi (Φ , h ) − y Fi | + | ˆ y CFi (Φ , h ) − y CFi | (cid:33) (9)In their work, Cortes & Mohri (2014) assume the H is a reproducing kernel Hilbert space (RKHS) for auniversal kernel, and they do not consider the role ofthe representation Φ . Since the RKHS hypothesis spacethey use is much stronger than the linear space H l , it isoften reasonable to assume that the second term in thebound 9 is small. We however cannot make this assump-tion, and therefore we wish to explicitly bound the term min h ∈H l n (cid:0)(cid:80) ni =1 | ˆ y Fi (Φ , h ) − y Fi | + | ˆ y CFi (Φ , h ) − y CFi | (cid:1) ,while using the fact that we have control over the represen-tation Φ . Lemma 1.
Let { ( x i , t i , y Fi ) } ni =1 , x i ∈ X , t i ∈ { , } and y Fi ∈ Y ⊆ R . We assume that X is a metric space withmetric d , and that there exist two function Y ( x ) and Y ( x ) such that y Fi = t i Y ( x i ) + (1 − t i ) Y ( x i ) , and in additionwe define y CFi = (1 − t i ) Y ( x i ) + t i Y ( x i ) . We furtherassume that the functions Y ( x ) and Y ( x ) are Lipschitzcontinuous with constants K and K respectively, suchthat d( x a , x b ) ≤ c = ⇒ | Y t ( x a ) − Y t ( x b ) | ≤ K t c . De-fine j ( i ) ∈ arg min j ∈{ ...n } s.t. t j =1 − t i d( x j , x i ) to be thenearest neighbor of x i among the group that received theopposite treatment from unit i , for all i ∈ { . . . n } . Let d i,j = d( x i , x j ) For any b ∈ Y and h ∈ H : | b − y CFi | ≤ | b − y Fj ( i ) | + K − t i d i,j ( i ) Proof.
By the triangle inequality, we have that: | b − y CFi | ≤ | b − y Fj ( i ) | + | y Fj ( i ) − y CFi | . By the Lipschitz assumption on Y − t i , and since d( x i , x j ( i ) ) ≤ d i,j ( i ) , we obtain that | y Fj ( i ) − y CFi | = | Y − t i ( x j ( i ) ) − Y − t i ( x i ) | ≤ d i,j ( i ) K − t i . By definition y CFi = Y − t i ( x i ) . In addition, by def-inition of j ( i ) , we have t j ( i ) = 1 − t i , and therefore y Fj ( i ) = Y − t i ( x j ( i ) ) , proving the equality. The inequalityis an immediate consequence of the Lipschitz property.We restate Theorem 1 and prove it. Theorem 1.
For a sample { ( x i , t i , y Fi ) } ni =1 , x i ∈ X , t i ∈ { , } and y i ∈ Y , recall that y Fi = t i Y ( x i ) +(1 − t i ) Y ( x i ) , and in addition define y CFi = (1 − t i ) Y ( x i ) + t i Y ( x i ) . For a given representation function Φ :
X → R d , let ˆ P F Φ = (Φ( x ) , t ) , . . . , (Φ( x n ) , t n ) , ˆ P CF Φ = (Φ( x ) , − t ) , . . . , (Φ( x n ) , − t n ) . We assumethat X is a metric space with metric d , and that the poten-tial outcome functions Y ( x ) and Y ( x ) are Lipschitz con-tinuous with constants K and K respectively, such that d( x a , x b ) ≤ c = ⇒ | Y t ( x a ) − Y t ( x b ) | ≤ K t c .Let H l ⊂ R d +1 be the space of linear functions, andfor β ∈ H l , let L P ( β ) = E ( x,t,y ) ∼ P [ L ( β ( x, t ) , y )] be the expected loss of β over distribution P . Let r = max (cid:0) E ( x,t ) ∼ P F [ (cid:107) [Φ( x ) , t ] (cid:107) ] , E ( x,t ) ∼ P CF [ (cid:107) [Φ( x ) , t ] (cid:107) ] (cid:1) .For λ > , let ˆ β F (Φ) = arg min β ∈H l L ˆ P F Φ ( β ) + λ (cid:107) β (cid:107) ,and ˆ β CF (Φ) similarly for ˆ P CF Φ , i.e. ˆ β F (Φ) and ˆ β CF (Φ) are the ridge regression solutions for the factual andcounterfactual empirical distributions, respectively.Let ˆ y Fi (Φ , h ) = h (cid:62) [Φ( x i ) , t i ] and ˆ y CFi (Φ , h ) = h (cid:62) [Φ( x i ) , − t i ] be the outputs of the hypothesis h ∈ H l over the representation Φ( x i ) for the factual and counter-factual settings of t i , respectively. Finally, for each i ∈{ . . . n } , let j ( i ) ∈ arg min j ∈{ ...n } s.t. t j =1 − t i d( x j , x i ) be the nearest neighbor of x i among the group that receivedthe opposite treatment from unit i . Let d i,j = d( x i , x j ) .Then for both Q = P F and Q = P CF we have: λµr ( L Q ( ˆ β F (Φ)) − L Q ( ˆ β CF (Φ))) ≤ (10) disc H l ( ˆ P F Φ , ˆ P CF Φ )+min h ∈H l n n (cid:88) i =1 (cid:0) | ˆ y Fi (Φ , h ) − y Fi | + | ˆ y CFi (Φ , h ) − y CFi | (cid:1) ≤ (11) disc H l ( ˆ P F Φ , ˆ P CF Φ )+min h ∈H l n n (cid:88) i =1 (cid:16) | ˆ y Fi (Φ , h ) − y Fi | + | ˆ y CFi (Φ , h ) − y Fj ( i ) | (cid:17) + K n (cid:88) i : t i =1 d i,j ( i ) + K n (cid:88) i : t i =0 d i,j ( i ) . Proof.
Inequality (10) is immediate by Theorem A1. Inorder to prove inequality (11), we apply Lemma 1, setting b = ˆ y CFi and summing over the ii