Causal Estimation with Functional Confounders
CCausal Estimation with Functional Confounders
Aahlad Puli Adler J. Perotte Rajesh Ranganath [email protected] [email protected] [email protected] Computer Science, New York University, New York, NY 10011 Biomedical Informatics, Columbia University, New York, NY 10032 Center for Data Science, New York University, New York, NY 10011
Abstract
Causal inference relies on two fundamental assumptions: ignorability and positivity .We study causal inference when the true confounder value can be expressed asa function of the observed data; we call this setting estimation with functionalconfounders (
EFC ) . In this setting ignorability is satisfied, however positivity isviolated, and causal inference is impossible in general. We consider two scenarioswhere causal effects are estimable. First, we discuss interventions on a part ofthe treatment called functional interventions and a sufficient condition for effectestimation of these interventions called functional positivity . Second, we developconditions for nonparametric effect estimation based on the gradient fields of thefunctional confounder and the true outcome function. To estimate effects underthese conditions, we develop Level-set Orthogonal Descent Estimation ( LODE ).Further, we prove error bounds on
LODE ’s effect estimates, evaluate our methodson simulated and real data, and empirically demonstrate the value of
EFC . Determining the effect of interventions on outcomes using observational data lies at the core of manyfields like medicine, economic policy, and genomics. For example, policy makers estimate effectsto elect whether to invest in education or job training programs. In medicine, doctors use effects todesign optimal treatment strategies for patients. Geneticists perform genome-wide association studies(
GWAS ) to relate genotypes and phenotypes. In observational data, there could exist unobservedvariables that affect both the intervention and the outcome, called confounders. A necessary conditionfor the causal effect to be identified is that all confounders are observed; called ignorability . Ifignorability holds, a sufficient condition for causal effect estimation is adequate variation in theintervention after conditioning on the confounders; this condition is called positivity .The data apriori does not differentiate between confounders and interventions. It is the practitionersthat select interventions of interest from all pre-outcome variables (variables that occur before theoutcome). Then, assuming knowledge of the data generating mechanism, practitioners can labelcertain variables amongst the remaining pre-outcome variables as confounders. This corresponds toindexing into the set of pre-outcome variables.In certain problems the confounders are specified as a function of the pre-outcome variables that doesnot simply index into the set of pre-outcome variables. For a concrete example, consider
GWAS . Thegoal in
GWAS is to estimate the influence of genetic variations on phenotypes like disease risk. In
GWAS , population and family structures both result in certain genetic variations and affect phenotypesand therefore, are confounders [4]. Practitioners specify these confounders by using the geneticsimilarity between individuals [15, 19, 31], which is a function of the genetic variations. When theconfounders are a function of the same pre-outcome variables that define the interventions, positivityis violated. Then, the class of interventions whose effects are estimable is not well-defined. a r X i v : . [ s t a t . M E ] F e b e study causal effect estimation in such settings, where a function of the pre-outcome variablesprovides the confounder and these same pre-outcome variables define the intervention. We call thisestimation with functional confounders ( EFC ). In
EFC , one column in the observed data is the outcomeand all others are pre-outcome variables. We assume access to a function h ( · ) that takes as input thepre-outcome variables and returns the value of the confounder. Further, we assume these confoundersgive us ignorability. In settings like GWAS , the function h reflects the practitioner-specified functionthat captures the genetic variation influenced by the population structure. In traditional observationalcausal inference ( OBS - CI ), h ( · ) reflects the selection of certain variables in the data and labellingthem as confounders. In EFC , two different values of the confounder are never observed for the samesetting of the pre-outcome variables. This means that positivity is violated and the effects of onlycertain interventions may be estimable.We address this issue in two ways. First, we investigate a class of plausible interventions that are functions of the observed pre-outcome variables, called functional interventions. We develop asufficient condition to estimate the effects of said functional interventions, called functional positivity( F - POSITIVITY ). Second, we consider intervening on all pre-outcome variables, called the full inter-vention. We develop a sufficient condition to estimate the effect of the full intervention, called causalredundancy ( C - REDUNDANCY ). For an intervention, given a confounder value, C - REDUNDANCY allows us to compute a surrogate intervention such that the conditional effect of the surrogate isequal to that of the original intervention. We also show that such surrogate interventions exist onlyunder a certain condition that we call Effect Connectivity, that is necessary for nonparametric effectestimation in
EFC . This condition is satisfied by default in traditional
OBS - CI if ignorability andpositivity hold. Then, we develop an algorithm for causal estimation assuming C - REDUNDANCY ,called Level-set Orthogonal Descent Estimation (
LODE ), which estimates effects using surrogateinterventions. If the surrogate is not estimated well,
LODE ’s estimates are biased. We establish boundson this bias that capture the mitigating effect of the smoothness of the true outcome function.
Related work
The problem of genome-wide association studies (
GWAS ) is to estimate the effect ofgenetic variations(also called single nucleotide polymorphisms (
SNP s)) on the phenotype [29]. Theancestry of the subjects acts as a confounder in
GWAS . In
GWAS practice, principle component analysis(
PCA ) and linear mixed models (
LMM s) are used to compute this confounding structure [19, 31].Lippert et al. [15] suggest estimating the confounders and effects on separate subsets of the
SNP s.This separation disregards the confounding that is captured in the interaction of the two subsets of
SNP s. G
WAS is a special case of effects from multiple treatments (
MTE ) where the confounder valueis specified via optimization as a function of the pre-outcome variables [20, 30]. In all these settings,positivity is violated and not all effects are estimable. We provide an avenue for nonparametriceffect-estimation of the full intervention under a new condition, C - REDUNDANCY . Traditional observational causal inference (
OBS - CI ) review We setup causal inference withStructural Causal Models [17] and use do ( t = t ∗ ) to denote making an intervention. Let t be avector of the interventions, z be the confounder, and y be the outcome. Let η ∼ p ( η )( η | = ( z , t )) benoise. With f as the outcome function , we define the causal model for traditional OBS - CI as : z ∼ p ( z ) , t ∼ p ( t | z ) , y = f ( t , z , η ) .Let p ( y , z , t ) denote the joint distribution implied by this data generating process. The effects ofinterest under the full intervention do ( t = t ∗ ) are the average and conditional effect ( average ) τ ( t ∗ ) = E z , η f ( t ∗ , z , η ) ( conditional ) φ ( t ∗ , z ) = E η [ f ( t ∗ , z , η )] . (1)With observed confounders, two assumptions make causal estimation possible: ignorability and positivity . Ignorability means that all confounders z are observed in data. Conditioning on all theconfounders, the outcome under an intervention is distributed as if conditional on the value of theintervention: p ( y = y | do ( t = t ∗ ) , z = z ) = p ( f ( t ∗ , z , η ) = y ) = p ( y = y | t = t ∗ , z = z ) . This allows the expression of average effect as an expectation over the observed outcomes τ ( t ∗ ) = E z , η [ f ( t ∗ , z , η )] = E z E [ y | z , t ∗ ] . The conditional expectation only exists for all t ∗ if p ( y | z , t = t ∗ ) = p ( y , z , t = t ∗ ) / p ( z ) p ( t = t ∗ | z ) exists. Positivity guarantees this existence ( positivity ) ∀ t ∗ ∈ supp ( t ) p ( z = z ) > = ⇒ p ( t = t ∗ | z = z ) >
0. (2) We focus on f that generates y from t , z . SCMs generally specify the function that generates t from z also. yz (a) Traditional
OBS - CI t y h ( t ) (b) EFC t y h ( t ) (c) Intervening in
EFC
Figure 1:
Causal Graphs for Traditional
OBS - CI vs. EFC . In traditional
OBS - CI , causal estimation relied on knowing the confounders. In this section, weconsider settings where confounders are known via a function of the pre-outcome variables h ( t ) = z .We call this setting estimation with functional confounders ( EFC ) . An example of this is GWAS , where
SNP s (the pre-outcome variables) are used to estimate the confounding population structure throughmethods like
PCA [31]. Assuming the confounders are a function of the pre-outcome variablesviolates positivity in general. Positivity is violated in this setting because ∀ t , t ∈ supp ( t ) s . t . h ( t ) (cid:54) = h ( t ) = ⇒ p ( z = h ( t ) | t = t ) = (cid:54) = p ( z = h ( t )) > t . A positivity violation precludesnonparametric effect estimation of the full intervention do ( t = t ∗ ) . Positivity and Regression Identifiability
Positivity can be viewed as providing identifiability. Tosee this, let the confounder be z = h ( t ) and the outcome be y ( t , z , η ) = z + h ( t ) . Now considerregressing z and t onto y . Then, functions y = α z + βh ( t ) indexed by α , β , such that α + β = y on ( t , z ) , meaning that the regression is not identifiable. Assuming positivitynecessitates sufficient randomness to identify the regression and thus the causal effect. A violation ofpositivity means that nonparametric estimation of causal effects needs further assumptions. EFC In EFC , the confounder is provided as a non-bijective function h of the pre-outcome variables t . Toreflect this property, we use h ( t ) to denote the confounder. As an illustrative example, let G be theGamma distribution and consider z ∈ { −
1, 1 } , p ( z = ) = t = z ∗ G (
1, exp ( z )) . Note sign ( t ) = z meaning that h ( t ) = sign ( t ) is the confounder.Figure 1 shows causal graphs connecting our EFC notation to that in traditional
OBS - CI . With noise η ∼ p ( η )( η | = t ) , our causal model samples, in order, the confounder ”part” of pre-outcome variables h ( t ) , the pre-outcome variables t , and the outcome y via the outcome function f : h ( t ) ∼ p ( h ( t )) t ∼ p ( t | h ( t )) y = f ( t , h ( t ) , η ) Similar to traditional
OBS - CI , for an intervention t ∗ the average effect, τ ( · ) , and the conditionaleffect, φ ( · , · ) at h ( t ∗ ) , respectively, are defined as: τ ( t ∗ ) = E h ( t ) , η [ f ( t ∗ , h ( t ) , η )] , φ ( t ∗ , h ( t ∗ )) = E η [ f ( t ∗ , h ( t ∗ ) , η )] . (3)As the pre-outcome variables determine the confounder, positivity is violated. Further, the outcomefunction f ( t , h ( t ) , η ) could recover the exact value of h ( t ) from t instead of its second argument.Thus, two different outcome functions could lead to the same observational data distribution, posinga fundamental obstacle to causal effect estimation. This is the central challenge in EFC . Without positivity, we can only estimate the effects of certain functions of t . We call such in-terventions, on some function g ( t ) , functional interventions . The implied causal model for theoutcome for functional intervention value g ( t ∗ ) and confounder value h ( t ∗ ) is first t ∼ p ( t | g ( t ) = g ( t ∗ ) , h ( t ) = h ( t ∗ )) and then y = f ( t , h ( t ∗ ) , η ) . Then, the functional average effect is ( average ) τ ( g ( t ∗ )) = E h ( t ) , η E t | g ( t )= g ( t ∗ ) , h ( t ) [ f ( t , h ( t ) , η )] .An example of a functional intervention is intervening on the cumulative dosage of a drug. In contrast,traditional interventions would set each individual dose given at different points in time. We also assume no interference [10] (also called Stable Unit Treatment Value Assumption [24]) whichmeans that an individual’s outcome does not depend on others’ treatment. In
EFC , when t and η are sampled IIDthere is no interference. To see this, note ∀ i , j ( t i , η i ) | = ( t j , η j ) = ⇒ ( y i , t i ) | = ( y j , t j ) = ⇒ y i | = t j . Intervening on g ( t ) can be interpreted as making a soft intervention [9, 7] of t to p ( t | z , g ( t ) = g ( ˜ t )) . - POSITIVITY and Functional Effect Estimation
For the causal model above to be well-definedfor all functional interventions g ( t ∗ ) , the conditional p ( t | g ( t ) = g ( t ∗ ) , h ( t ) = h ( t ∗ )) must exist.To guarantee this existence, we define functional positivity ( F - POSITIVITY ) for any g ( t ∗ )( F - POSITIVITY ) p ( h ( t ) = h ( t ∗ )) > = ⇒ p ( g ( t ) = g ( t ∗ ) | h ( t ) = h ( t ∗ )) >
0. (4)F-
POSITIVITY says that the function of the pre-outcome variables that is being intervened on needsto have sufficient randomness when the function of the pre-outcome variables that defines theconfounders is fixed. Further, under F - POSITIVITY , effect estimation for functional interventionsis reduced to traditional
OBS - CI on data p ( y , g ( t ) , h ( t )) . With positivity and ignorability satisfied,traditional causal estimators such as propensity scores [23], matching [21], regression [11], anddoubly robust methods [22] can be used to estimate the causal effect. Focusing on regression, let f θ be a flexible function, then min θ E y , t [( y − f θ ( h ( t ) , g ( t ))) ] would estimate the conditionalexpectation of interest : E [ y | h ( t ) , g ( t ∗ )] . With θ , the effect of g ( t ∗ ) can be estimated by averagingthe estimate of the conditional expectation over the marginal distribution p ( h ( t )) : τ ( g ( t ∗ )) = E t [ f θ ( h ( t ) , g ( t ∗ ))] . (5) When positivity is violated, causal effects cannot be estimated as conditional expectations over theobserved data in general. We give a functional condition, called causal redundancy ( C - REDUNDANCY ),that allows us to estimate the effect of the full intervention do ( t = t ∗ ) , even when positivity isviolated. Specifically, C - REDUNDANCY allows us to construct a surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) whose conditional effect at h ( t (cid:48) ) matches the conditional effect of interest, φ ( t ∗ , h ( t ∗ )) . Let ˜ t be afixed value of the full intervention, then C - REDUNDANCY is Assumption.
Recall the outcome y = f ( ˜ t , h ( ˜ t ) , η ) . With ∇ ˜ t as gradient w.r.t. to argument ˜ t : ∀ ˜ t , h ( ˜ t ) , η , ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) T ∇ ˜ t h ( ˜ t ) = C - REDUNDANCY is the condition that the outcome function f uses the value of theconfounder from its second argument instead of computing h ( t ) from the first argument . Tocompute the conditional effect φ ( t ∗ , h ( t ∗ )) , we develop Level-set Orthogonal Descent Estimation( LODE ). LODE ’s key step is to construct a surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) such that φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) , h ( t ∗ ) = h ( t (cid:48) ( t ∗ , h ( t ∗ ))) . Figure 2:
LODE ’s traversal.By definition, a surrogate intervention lives in the conditionaleffect level-set: { ˜ t : φ ( ˜ t , h ( t ∗ )) = φ ( t ∗ , h ( t ∗ )) } . So LODE searches this level-set for t (cid:48) ( t ∗ , h ( t ∗ )) . See fig. 2 which plots theconditional effect level-sets with the value of h ( t ) fixed (red) in ( supp ( t ) , supp ( h ( t ))) -space. Green corresponds to the observeddata, supp ( t , h ( t )) . LODE finds t (cid:48) ( t ∗ , h ( t ∗ )) by traversing thelevel-sets ( black ) to account for the confounder part mismatch h ( t ∗ ) (cid:54) = h ( t ∗ ) . C - REDUNDANCY ensures
LODE can traversethese level-sets as it implies ∇ ˜ t φ ( ˜ t , h ( ˜ t )) ∇ ˜ t h ( ˜ t ) = C - REDUNDANCY , surrogate interventions can beconstructed by solving a gradient flow equation which guarantees identification as follows:
Theorem 1.
Assume C - REDUNDANCY holds. Assuming the following:1. Let t (cid:48) ( t ∗ , h ( t ∗ )) be the limiting solution to the gradient flow equation d ˜ t ( s ) ds = − ∇ ˜ t ( h ( ˜ t ( s )) − h ( t ∗ )) , initialized at ˜ t ( ) = t ∗ ; i.e. t (cid:48) ( t ∗ , h ( t ∗ )) = lim s → ∞ ˜ t ( s ) .Further, let h ( t (cid:48) ( t ∗ , h ( t ∗ ))) = h ( t ∗ ) and t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) .2. f ( ˜ t , h ( ˜ t ) , η ) and h ( ˜ t ) as functions of ˜ t , h ( ˜ t ) are continuous and differentiable and the derivativesexist for all ˜ t , η . Let ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) exist and be bounded and integrable w.r.t. the probabilitymeasure corresponding to p ( η ) , for all values of ˜ t and h ( ˜ t ) . If f transforms its first argument ˜ t into h ( ˜ t ) as one amongst many different computations, the chain ruleimplies ∇ ˜ t f ( ˜ t , h ( t ∗ )) (cid:62) ∇ ˜ t h ( ˜ t ) has a term (cid:107)∇ ˜ t h ( ˜ t ) (cid:107) which is non-zero in general. hen the conditional effect (and therefore the average effect) is identified: φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t (cid:48) ( t ∗ , h ( t ∗ )))) = E [ y | t = t (cid:48) ( t ∗ , h ( t ∗ ))] (6)In words, the key idea is that starting at ˜ t ( ) = t ∗ and following ∇ ˜ t h ( ˜ t ) means ˜ t ( s ) always lies in thelevel-set { ˜ t : φ ( ˜ t , h ( t ∗ )) = φ ( t ∗ , h ( t ∗ )) } . See appendix A.2 for the proof. While C - REDUNDANCY is stated in terms of the gradient of the outcome function, it suffices for theorem 1 to assume a weakercondition about the gradient of the conditional effect: ∇ ˜ t E η f ( ˜ t , ˜ t , η ) (cid:62) ∇ ˜ t h ( ˜ t ) = Surrogate Positivity
In theorem 1, we assumed that the surrogate t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) . Thiscondition, which we call surrogate positivity (analogous to positivity), states that for any interventionand confounder, surrogate interventions that are limiting solutions to the gradient flow equation havenonzero density conditional on the confounder value. Formally, for any intervention t = t ∗ p ( h ( t ) = h ( t ∗ )) > = ⇒ p ( t = t (cid:48) ( t ∗ , h ( t ∗ )) | h ( t ) = h ( t ∗ )) >
0, (7)and t (cid:48) ( t ∗ , h ( t ∗ )) satisfies assumption 1 in theorem 1. Surrogate positivity along with C - REDUNDANCY , is sufficient for full effect estimation under
EFC . Next, we show that the positivityassumption in traditional causal inference is a special case of surrogate positivity.
Traditional observational causal inference (
OBS - CI ) and LODE
Let the confounder and interven-tion of interest in traditional
OBS - CI be z and a respectively. Assume both are scalars and ignorabilityand positivity hold. This setup can be embedded in EFC by defining the vector of pre-outcomevariables as: t = [ a ; z ] . In this setting, C - REDUNDANCY and surrogate positivity(eq. (7)) hold bydefault. Let the outcome be y = f ( t , h ( t )) = f ( a , z ) , where f only depends on the first elementof t , i.e. a . Let e = [
1, 0 ] and e = [
0, 1 ] . In traditional OBS - CI as EFC , ∇ ˜ t f ( ˜ t , h ( t ∗ )) ∝ e and ∇ ˜ t h ( ˜ t ) ∝ e meaning that ∇ ˜ t f ( ˜ t , h ( t ∗ )) (cid:62) ∇ ˜ t h ( ˜ t ) =
0. Thus, C - REDUNDANCY holds by default.Moreover, under positivity of a w.r.t. z , we also have surrogate positivity for traditional OBS - CI asan EFC problem. In this setting,
LODE computes t (cid:48) = [ a ∗ , h ( t ∗ )] by following − ∇ ˜ t h ( ˜ t ) = [ − ] ,which only changes the value of h ( ˜ t ) , not the value of a . Thus, t ∗ and t (cid:48) ( t ∗ , h ( t ∗ )) will havethe same first element and t (cid:48) ’s second element will be h ( t ∗ ) . As a has positivity w.r.t. z , wehave p ( a = a ∗ , z = h ( t ∗ )) > t (cid:48) ∈ supp ( t ) . The estimated conditional effectis E [ y | t = t (cid:48) ( t ∗ , h ( t ∗ ))] = f ([ a ∗ , z ∗ ] , h ( t ∗ )) = E [ y | a = a ∗ , z = h ( t ∗ ))] , which matches theestimate in traditional OBS - CI . Implementation of
LODE L ODE first estimates the conditional expectation E [ y | t ] ; this canbe done with model-based or nonparametric estimators. This is achieved by regressing y on t ,ˆ f = arg min u ∈ F E y , t ∼ D ( y − u ( t )) , with empirical distribution D . The surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) is computed using Euler integration to solve the gradient flow equation. Euler inte-gration in this setting is equivalent to gradient descent with a fixed step size. Other, more efficientschemes like Runge–Kutta numerical integration methods [3] could also be used. The conditionaleffect estimate is ˆ f ( t (cid:48) ( t ∗ , h ( t ∗ ))) . See algorithm 1 for a description. LODE in practice
To compute the surrogate intervention t (cid:48) , LODE uses the gradients of h ( · ) in Euler integration. In prac-tice, taking Euler integration steps, instead of solving the gradient flow exactly, could result in errors.Then t (cid:48) could lie outside the level-set of the conditional effect φ ( t ∗ , h ( t ∗ )) = E η [ f ( t ∗ , h ( t ∗ ) , η )] .Further, if h ( t (cid:48) ( t ∗ , h ( t ∗ ))) (cid:54) = h ( t ∗ ) , LODE incurs error for conditioning on a value of the con-founder that is different from h ( t ∗ ) . The error due to t (cid:48) estimation is decoupled from the error in theestimation of E [ y | t ] which adds without further amplification. We formalize this error: Theorem 2.
Consider the conditional effect φ ( t ∗ , h ( t ∗ )) . Let ˆ t ( t ∗ , h ( t ∗ )) be the estimate of thesurrogate intervention computed by LODE , computed via Euler integration of the gradient flow d ˜ t ( s ) ds = − ∇ ˜ t ( h ( ˜ t ( s )) − h ( t ∗ )) , initialized at ˜ t ( ) = t ∗ . Assume the true surrogate t (cid:48) ( t ∗ , h ( t ∗ )) exists and is the limiting solution to the gradient flow equation.1. Let the finite sample estimator of E [ y | t = ˜ t ] be ˆ f ( ˜ t ) . Let the error for all ˜ t be bounded, | ˆ f ( ˜ t ) − E [ y | t = ˜ t ] | (cid:54) c ( N ) , where N is the sample size and lim N → ∞ c ( N ) = .2. Assume K Euler integrator steps were taken to find the surrogate estimate ˆ t ( t ∗ , h ( t ∗ )) ,each of size (cid:96) . Let the maximum confounder mismatch be max i (cid:54) K ( h ( ˜ t i ) − h ( t ∗ )) = M . We ignore noise in the outcome for ease of exposition. . Let L z ,˜ t be the Lipschitz-constant of φ ( ˜ t , h ( ˜ t )) as a function of h ( ˜ t ) , for fixed ˜ t .Let L e be the Lipschitz-constant of E [ y | t = ˜ t ] = φ ( ˜ t , h ( ˜ t )) as a function of ˜ t .Assume h has a gradient with bounded norm, (cid:107)∇ h ( ˜ t ) (cid:107) < L h .Assume f ’s Hessian has bounded eigenvalues: ∀ ˜ t , ˜ t , (cid:107)∇ t φ ( ˜ t , h ( ˜ t )) (cid:107) (cid:54) σ H φ .The conditional effect estimate error, ξ ( t ∗ , h ( t ∗ )) = | ˆ f ( ˆ t ) − φ ( t ∗ , h ( t ∗ )) | , is upper bounded by: c ( N ) + min (cid:0) L e (cid:107) t (cid:48) − ˆ t (cid:107) , 2 K(cid:96) (cid:0) O ( (cid:96) ) + Mσ H φ L h (cid:1) + L z ,ˆ t (cid:107) h ( ˆ t ) − h ( t ∗ ) (cid:107) (cid:1) (8)See appendix A.3 for the proof. Theorem 2 captures the trade-off between biases due to conditioningon the wrong confounder value and due to the accumulated error in solving the gradient flow equation.This accumulated error analysis may be loose in settings where the sum of many gradient steps leadto ˆ t ≈ t (cid:48) , even if each step individually induces large error. In such settings, the term that depends on (cid:107) ˆ t − t (cid:48) (cid:107) is a better measure of error. The maximum-mismatch M appears because Euler integratortakes steps that depend on the magnitude of the gradient which depends on the mismatch value ( h ( ˜ t i ) − h ( t ∗ )) . If mismatch is large for some i , the Euler step could lead to a large error for a fixedstep size (cid:96) . We discuss the assumptions in theorems 1 and 2 in appendix A.1 t (cid:48) ( t ∗ , h ( t ∗ )) The key element in Theorem 1 is the surrogate intervention t (cid:48) such that its conditional effect given h ( t (cid:48) ) , equals that of t ∗ and h ( t ∗ ) . The orthogonality ∇ ˜ t f (cid:62) ∇ ˜ t h =
0, is a functional condition thatdoes not guarantee t (cid:48) ( t ∗ , h ( t ∗ )) exists in supp ( t ) ; a necessity to compute E [ y | t = t (cid:48) ] withoutadditional parametric assumptions. We give a general condition called Effect Connectivity thatguarantees the surrogate intervention exists. With conditional effect φ ( t ∗ , h ( t ∗ )) , for any t ∗ p ( h ( t ) = h ( t ∗ )) > = ⇒ p ( φ ( t , h ( t )) = φ ( t ∗ , h ( t ∗ )) | h ( t ) = h ( t ∗ )) >
0. (9)In words, t has a chance of setting the conditional effect to any possible value supp ( φ ( t , h ( t ))) given any confounder value h ( t ∗ ) ∈ supp ( h ( t )) . An equivalent statement is that every levelset of the conditional effect φ ( t ∗ , h ( t ∗ )) , with h ( t ∗ ) fixed, contains an intervention for eachconfounder value. That is, for some h ( t ∗ ) define the level set A c = { t ∗ ; f ( t ∗ , h ( t ∗ )) = c } , then ∀ h ( t ∗ ) ∈ supp ( h ( t )) , p ( t ∈ A c | h ( t ) = h ( t ∗ )) > Theorem 3.
Under Effect Connectivity, eq. (9) , any surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) . We give the proof in appendix A.4. Whether the intervention t (cid:48) ( t ∗ , h ( t ∗ )) can be found via tractablesearch is problem-specific. If the surrogate t (cid:48) ( t ∗ , h ( t ∗ )) exists ∀ t ∗ , h ( t ∗ ) , then eq. (9) holds bydefinition of the surrogate. Effect Connectivity allows us to reason about values of f anywhere insupp ( t ) × supp ( h ( t )) using only samples from p ( y , t ) . Further, it is necessary in EFC : Theorem 4.
Effect Connectivity is necessary for nonparametric effect estimation in
EFC . We prove this in appendix A.5. Effect Connectivity ensures that causal models with different causaleffects have different observational distributions. Then, parametric assumptions on the causal modelare not necessary to estimate effects.
We evaluate
LODE on simulated data first and show that
LODE can correct for confounding. We alsoinvestigate the error induced by imperfect estimation of the surrogate intervention in
LODE . Further,we run
LODE on a
GWAS dataset [6] and demonstrate that
LODE is able to correct for confounding andrecovers genetic variations that have been reported relevant to Celiac disease [8, 25, 14, 1].
We investigate different properties of
LODE on simulated data where ground truth is available. Let thedimension of t (pre-outcome variables) be T =
20 and outcome noise be η ∼ N (
0, 0.1 ) . We considertwo EFC causal models, denoted by A and B with different h ( t ) and f ( t , h ( t ) , η ) : ( A ) h ( t ) = γ (cid:80) i t i √ T , t ∼ N ( σ I T × T ) , y = (cid:80) i (− ) i t i √ T + αh ( t ) + ( + α ) h ( t ) + η ( B ) h ( t ) = (cid:80) i : i ∈ Z γ t i t i + , t ∼ N ( σ I T × T ) , y = (cid:80) i (− ) i t i √ T + αh ( t ) + η a) Causal Model A (b) Causal Model B Figure 3: R MSE of estimated conditional effect vs. strength of confounding γ . LODE corrects forconfounding and produces good effect estimates across different values of γ . (a) Causal Model A (b) Causal Model B Figure 4:
RMSE of estimated conditional effect estimate vs. the strength of confounding γ , fordifferent levels of variance of t , σ . Small σ leads to large conditional estimation error.In both causal models, C - REDUNDANCY is satisfied. The constant γ controls the strength of theconfounder and the constant α controls the Lipschitz constant of the outcome as a function of theconfounder. We let the variance σ =
1, unless specified otherwise. In the following, we train on1000 samples and report conditional effect root-mean-squared error (
RMSE ), computed with another1000 samples. We used a degree-2 kernel ridge regression to fit the outcome model as a function of t . This model is correctly specified, and so the conditional E [ y | t = ˜ t ] can be estimated well. Wecompare against a baseline estimate of conditional effect that is the same outcome model’s estimateof E [ y | t = t ∗ ] . This baseline fails to account for confounding and produces a biased estimate of theconditional effect of do ( t = t ∗ ) , conditional on any h ( t ∗ ) (cid:54) = h ( t ∗ ) .First, we investigate how well LODE can correct for confounding for both causal models. We let α = E t ∗ , h ( t ∗ ) ( h ( ˜ t ( s )) − h ( t ∗ )) is smaller than 10 − times value at initialization, where E t ∗ , h ( t ∗ ) is expectation over the eval-uation set. In fig. 3, we plot the mean and standard deviation of conditional effect RMSE av-eraged over 10 seeds, for different strengths of confounding. We see that
LODE is able toestimate effects well across multiple strengths of confounding while the baseline suffers.
Figure 5: R MSE of estimated conditionaleffect vs. step size in Euler Integrator incausal model B . Accumulating error due tolarge step size in Euler integrator increaseswith strength of confounding.Second, we investigate LODE ’s estimation when sur-rogate positivity holds but the probability p ( t ≈ t (cid:48) ( t ∗ , h ( t ∗ ))) is very small. This results in estimationerror due to poor fitting of the outcome model in lowdensity regions of supp ( t ) . We run LODE on simulateddata where t is generated with different variances ( σ ).For small σ , the outcome model error is large when us-ing surrogate interventions t (cid:48) ( t ∗ , h ( t ∗ )) , where either h ( t ∗ ) or t ∗ is large. This leads to high variance effectestimation as we show in fig. 4 for both causal mod-els. For various variances of t , σ , we plot the meanand standard deviation of RMSE of estimated conditionaleffect over 10 seeds, against different γ . 7 a) Causal Model A (b) Causal Model B Figure 6: R MSE of estimated conditional effect vs. degree of confounder mismatch δ . Error due toconditioning on a mismatched value of the confounder increases with strength of confounding but ismitigated by smoothness of the outcome function.Third, we investigate the bias induced due to imperfect estimation of the surrogate intervention in LODE for both causal models. We construct surrogate interventions t (cid:48) ( t ∗ , h ( t ∗ )) by ensuring thereis confounder-value mismatch h ( ˜ t ) (cid:54) = h ( t ∗ ) . We do this by interrupting Euler integration when theobjective E t ∗ , h ( t ∗ ) ( h ( t (cid:48) ( t ∗ , h ( t ∗ ))) − h ( t ∗ )) = δ >
0, where the E t ∗ , h ( t ∗ ) is over our evaluationset upon which we estimate conditional effects. For different α , we plot in fig. 6 the mean andstandard deviation of RMSE of estimated conditional effect over 10 seeds, against different degreesof confounder mismatch, δ . The error due to confounder mismatch is mitigated by small α , theLipschitz-constant of the outcome as a function of h ( t ) . Finally, we consider how step size in Eulerintegration affects the quality of estimated effects. Large step sizes may result in biased surrogateestimates; this bias is captured in the accumulation error in section 3.1. We focus on the non-linearcase in causal model B where gradient errors can accumulate(see appendix A.3.1). We demonstratethis error in fig. 5 where we plot mean and standard deviation of conditional effect RMSE againstthe strength of confounding, for different step sizes (cid:96) . We do not report results for larger step sizes( (cid:96) >
2) because Euler integration diverged for many surrogate estimates.
GWAS ) In this experiment, we explore the associations of genetic factors and Celiac disease. We utilize datafrom the Wellcome Trust Celiac disease
GWAS dataset [8, 6] consisting of individuals with celiacdisease, called cases ( n = ) , and controls ( n = ) . We construct our dataset by filtering fromthe ∼ SNP s. The only preprocessing in our experiments is linkage disequilibrium pruning ofadjacent
SNP s (at 0.5 R ) and PLINK [5] quality control. After this, 337, 642 SNP s remain for 11, 950people. We imputed missing
SNP s for each person by sampling from the marginal distribution of that
SNP . No further
SNP or person was dropped due to missingness. The objective of this experiment isto show that
LODE corrects for confounding and recovers
SNP s reported in the literature [8, 25, 14, 1].To this end, after preprocessing, we included in our data 50
SNP s reported in [8, 25, 14, 1] and 1000randomly sampled from the rest.We use outcome models and functional confounders h () traditionally employed in the GWAS literature.We choose a linear h ( ˜ t ) = A (cid:62) ˜ t , where A is a matrix of the right singular vectors of a normalizedGenotype matrix, that correspond to the top 10 singular values [19]. The outcome model is selectedfrom logistic Lasso linear models with various regularization strengths, via cross validation within thetraining data (60% of the dataset). We defer details about the experimental setup to appendix B.We then use this outcome model in LODE to compute causal effects on the whole filtered dataset.The effects are computed one
SNP at a time. First, for each person ˜ t , create ˜ t i , ˜ t i whichcorrespond to the i th SNP set to 1 and 0 respectively, with all other
SNP s same as ˜ t . Ran-domly sample a h ( t ∗ ) from the marginal p ( h ( t )) and, using the outcome model P θ , compute φ ( ˜ t , i ) = log P θ ( y = | t (cid:48) ( ˜ t i , h ( t ∗ ))) / P θ ( y = | t (cid:48) ( ˜ t i , h ( t ∗ ))) . The average effect of SNP i is obtained byaveraging across all persons: (cid:80) ˜ t φ ( ˜ t , i ) / N . Any SNP that beats a specified threshold of effect isdeemed relevant to Celiac disease by
LODE . We use a 60 −
40% train-test split, and outcome modelselection is done via cross-validation within the training set. We did 5-fold cross-validation using justthe training set. We use Scikit-learn [18] to fit the outcome models and for cross-validation.
Results
The best outcome model was a Lasso model, trained with regularization constant 10. Weselect relevant
SNP s by thresholding estimated effects at a magnitude > SNP s (10008ot reported before)
LODE returned 31
SNP s, out of which 13 were previously reported as beingassociated with Celiac disease [8, 25, 14, 1]. In appendix B.2 we plot the true positive and falsenegative rates of identifying previously reported
SNP s, as a function of the effect threshold.
SNP E FFECT . C
OEF .rs13151961 0.17 0.32rs2237236 0.17 0.00rs1738074 − − − − Table 1:
A few
SNP s previously reported asrelevant and recovered by
LODE , with esti-mated effects and Lasso coefficients.
LODE produces effect estimates that do not relypurely on the coefficients.In table 1, we list a few
SNP s that were both deemedrelevant by
LODE and were reported in existing litera-ture [8, 25, 14, 1], their effects, and their Lasso coeffi-cients. The full list is in table 2 in appendix B. If
LODE cannot adjust for confounding, the Lasso coefficientswould dictate the effects; 0 coefficient means 0 effect.However, the two pairs of
SNP s in table 1 show thatthe effects estimated by
LODE do not rely solely onthe Lasso coefficients. For the first pair (rs13151961,rs2237236), the effect is the same but the coefficientof one is 0, while the other is positive. We note thatrs2237236 was found to be associated with ulcerativecolitis [12, 2], which is an inflammatory bowel diseasethat has been reported to share some common geneticbasis with celiac disease [16]. For the second pair, (rs1738074, rs11221332), the magnitude of theeffect is smaller for the former, but the coefficient is larger. Thus,
LODE adjusts for confoundingfactors that the outcome model ignored.
When positivity is violated in traditional
OBS - CI , not all effects are estimable without furtherassumptions. In such cases, practitioners have to turn to parametric models to estimate causaleffects. However, parametric models can be misspecified when used without underlying causalmechanistic knowledge. We develop a new general setting of observational causal effect estimationcalled estimation with functional confounders ( EFC ) where the confounder can be expressed as afunction of the data, meaning positivity is violated. Even when positivity is violated, the effects ofmany functional interventions are estimable. We develop a sufficient condition called functionalpositivity ( F - POSITIVITY ) to estimate effects of functional interventions. Such effects could be ofindependent interest; like the effect of cumulative dosage of a drug instead of joint effects of multipledosages at different times.Second, we prove a necessary condition for nonparametric estimation of effects of the full intervention.We propose the C - REDUNDANCY condition, under which, the effect of the full intervention on t is estimable without parametric restrictions. We develop Level-set Orthogonal Descent Estimation( LODE ) that computes surrogate interventions whose effects are estimable and match a conditionaleffect of interest. Further, we give bounds on errors (theorem 2) induced due to imperfect estimationof the surrogate intervention. Finally, we empirically demonstrate
LODE ’s ability to correct forconfounding in both simulated and real data.
Future.
A few directions of improvement remain which we elaborate next. First, F - POSITIVITY may not hold for all functions g ( t ) that we want to intervene on. Instead, one could compute a“projection” g Π to the space of functions that satisfy F - POSITIVITY and inspect the effects defined by g Π instead. A second direction of interest is to let h ( t ) only account for a part of the confounding,meaning ignorability is violated. This bias could be mitigated under smoothness conditions of theoutcome function and its interaction with the degree of violation of ignorability.Finally, LODE ’s search strategy is Euler integration, which is equivalent to gradient descent with afixed step size. Optimization techniques like momentum, rescaling the gradient using an adaptivematrix, and using second order hessian information, speed up gradient descent. However, if there aremany local or global minima for ( h ( ˜ t ) − h ( t ∗ )) , such techniques will result in a different solutionthan Euler integration, which could mean that effect estimates are biased. One extension of LODE would allow for search strategies that use such techniques.9 roader Impact
Our work mainly applies to causal inference where confounders are specified as functions of observeddata, such as in problems in genetics and healthcare. We choose to assess the impact of our workthrough its applications in these fields. A positive impact of the work is that better estimates of causaleffects helps guide treatment for people and aid in understanding biological pathways of diseases.However, in healthcare, data collected in hospitals has biases. If, for instance, a certain demographicof people have more complete data collected about them, then this demographic would have betterquality effect estimates, potentially meaning that they receive better treatment. This problem could becharacterized by evaluating the positivity of treatment and completeness of confounders in electronichealth record data split by demographics.
Acknowledgements
The authors were partly supported by NIH/NHLBI Award R01HL148248, and by NSF Award1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. Theauthors would like to thank Xintian Han, Raghav Singhal, Victor Veitch, Fredrik D. Johansson andthe reviewers for thoughtful feedback. The authors would also like to thank Mukund Sudarshan andProf. Sriram Sankararaman for help with running the
GWAS experiments.
References [1] Svetlana Adamovic, SS Amundsen, BA Lie, AH Gudjonsdottir, H Ascher, J Ek, DA Van Heel,S Nilsson, LM Sollid, and ˚A Torinsson Naluai. Association study of il2/il21 and fcgriia:significant association with the il2/il21 region in scandinavian coeliac disease families.
Genesand immunity , 9(4):364, 2008.[2] Carl A Anderson, Gabrielle Boucher, Charlie W Lees, Andre Franke, Mauro D’Amato, Kent DTaylor, James C Lee, Philippe Goyette, Marcin Imielinski, Anna Latiano, et al. Meta-analysisidentifies 29 additional ulcerative colitis risk loci, increasing the number of confirmed associa-tions to 47.
Nature genetics , 43(3):246, 2011.[3] Uri M Ascher and Linda R Petzold.
Computer methods for ordinary differential equations anddifferential-algebraic equations , volume 61. Siam, 1998.[4] William Astle, David J Balding, et al. Population structure and cryptic relatedness in geneticassociation studies.
Statistical Science , 24(4):451–471, 2009.[5] Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun MPurcell, and James J Lee. Second-generation plink: rising to the challenge of larger and richerdatasets.
Gigascience , 4(1):s13742–015, 2015.[6] Wellcome Trust Case Control Consortium et al. Genome-wide association study of 14,000cases of seven common diseases and 3,000 shared controls.
Nature , 447(7145):661, 2007.[7] J. Correa and E. Bareinboim. A calculus for stochastic interventions: Causal effect identificationand surrogate experiments. In
Proceedings of the 34th AAAI Conference on Artificial Intelligence ,New York, NY, 2020. AAAI Press.[8] Patrick CA Dubois, Gosia Trynka, Lude Franke, Karen A Hunt, Jihane Romanos, AlessandraCurtotti, Alexandra Zhernakova, Graham AR Heap, R´oza ´Ad´any, Arpo Aromaa, et al. Multiplecommon variants for celiac disease influencing immune gene expression.
Nature genetics , 42(4):295, 2010.[9] Frederick Eberhardt and Richard Scheines. Interventions and causal inference.
Philosophy ofScience , 74(5):981–995, 2007.[10] Miguel A Hern´an and James M Robins. Causal inference: what if.
Boca Raton: Chapman &Hill/CRC , 2020, 2020.[11] Jennifer L. Hill. Bayesian nonparametric modeling for causal inference.
Journal of Compu-tational and Graphical Statistics , 20(1):217–240, 2011. doi: 10.1198/jcgs.2010.08162. URL https://doi.org/10.1198/jcgs.2010.08162 .[12] Lucia A Hindorff, Praveen Sethupathy, Heather A Junkins, Erin M Ramos, Jayashri P Mehta,Francis S Collins, and Teri A Manolio. Potential etiologic and functional implications ofgenome-wide association loci for human diseases and traits.
Proceedings of the NationalAcademy of Sciences , 106(23):9362–9367, 2009.1013] Morris W Hirsch, Robert L Devaney, and Stephen Smale.
Differential equations, dynamicalsystems, and linear algebra , volume 60. Academic press, 1974.[14] Karen A Hunt, Alexandra Zhernakova, Graham Turner, Graham AR Heap, Lude Franke, MarcelBruinenberg, Jihane Romanos, Lotte C Dinesen, Anthony W Ryan, Davinder Panesar, et al.Novel celiac disease genetic determinants related to the immune response.
Nature genetics , 40(4):395, 2008.[15] Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson, and DavidHeckerman. Fast linear mixed models for genome-wide association studies.
Nature methods , 8(10):833, 2011.[16] Virginia Pascual, Romina Dieli-Crimi, Natalia L´opez-Palacios, Andr´es Bodas, Luz Mar´ıaMedrano, and Concepci´on N´u˜nez. Inflammatory bowel disease and celiac disease: overlaps anddifferences.
World journal of gastroenterology: WJG , 20(17):4846, 2014.[17] Judea Pearl et al. Causal inference in statistics: An overview.
Statistics surveys , 3:96–146,2009.[18] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python.
Journal of machine learning research , 12(Oct):2825–2830,2011.[19] Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick,and David Reich. Principal components analysis corrects for stratification in genome-wideassociation studies.
Nature genetics , 38(8):904, 2006.[20] Rajesh Ranganath and Adler Perotte. Multiple causal inference with latent confounding. arXivpreprint arXiv:1805.08273 , 2018.[21] Marc Ratkovic. Balancing within the margin: Causal effect estimation with support vectormachines.
Department of Politics, Princeton University, Princeton, NJ , 2014.[22] James M Robins. Robust estimation in sequentially ignorable missing data and causal inferencemodels. In
Proceedings of the American Statistical Association , volume 1999, pages 6–10.Indianapolis, IN, 2000.[23] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observationalstudies for causal effects.
Biometrika , 70(1):41–55, 1983.[24] Donald B Rubin. Randomization analysis of experimental data: The fisher randomization testcomment.
Journal of the American Statistical Association , 75(371):591–593, 1980.[25] Ludvig M Sollid. Coeliac disease: dissecting a complex inflammatory disorder.
Nature ReviewsImmunology , 2(9):647, 2002.[26] Michael Spivak.
Calculus on manifolds: a modern approach to classical theorems of advancedcalculus . CRC press, 2018.[27] Gerald Teschl.
Ordinary differential equations and dynamical systems , volume 140. AmericanMathematical Soc., 2012.[28] Timothy Thornton and Michael Wu. Summer institute in statistical genetics 2015.[29] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew ABrown, and Jian Yang. 10 years of gwas discovery: biology, function, and translation.
TheAmerican Journal of Human Genetics , 101(1):5–22, 2017.[30] Yixin Wang and David M Blei. The blessings of multiple causes.
Journal of the AmericanStatistical Association , (just-accepted):1–71, 2019.[31] Jianming Yu, Gael Pressoir, William H Briggs, Irie Vroh Bi, Masanori Yamasaki, John FDoebley, Michael D McMullen, Brandon S Gaut, Dahlia M Nielsen, James B Holland, et al.A unified mixed-model method for association mapping that accounts for multiple levels ofrelatedness.
Nature genetics , 38(2):203, 2006.11
Theoretical details
A.1 A note about the assumptionsNote about the assumptions
In theorem 1, assumption 1 consists of three parts that can all bevalidated on observed data: 1) that the gradient flow converges, 2) that the confounder value ofthe surrogate matches the confounder value whose effect is of interest, and 3) that the surrogateintervention lies in the support of the pre-outcome variables. Assumption 2 is required for expectationsand their gradients to exist and be finite. In theorem 2, assumption 1 requires a consistent estimatorof E [ y | t ] , which can be provided with regression. Assumption 3 lists regularity conditions whichhelp control how the surrogate estimation error propagates to the effect error. A.2 Proof of Theorem 1
We restate the theorem for completeness:
Theorem 1.
Assume C - REDUNDANCY holds. Assuming the following:1. Let t (cid:48) ( t ∗ , h ( t ∗ )) be the limiting solution to the gradient flow equation d ˜ t ( s ) ds = − ∇ ˜ t ( h ( ˜ t ( s )) − h ( t ∗ )) , initialized at ˜ t ( ) = t ∗ ; i.e. t (cid:48) ( t ∗ , h ( t ∗ )) = lim s → ∞ ˜ t ( s ) .Further, let h ( t (cid:48) ( t ∗ , h ( t ∗ ))) = h ( t ∗ ) and t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) .2. f ( ˜ t , h ( ˜ t ) , η ) and h ( ˜ t ) as functions of ˜ t , h ( ˜ t ) are continuous and differentiable and the derivativesexist for all ˜ t , η . Let ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) exist and be bounded and integrable w.r.t. the probabilitymeasure corresponding to p ( η ) , for all values of ˜ t and h ( ˜ t ) .Then the conditional effect (and therefore the average effect) is identified: φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t (cid:48) ( t ∗ , h ( t ∗ )))) = E [ y | t = t (cid:48) ( t ∗ , h ( t ∗ ))] (10) Proof.
Recall definition of conditional effect φ ( ˜ t , h ( ˜ t )) = E η f ( ˜ t , h ( ˜ t ) , η ) . Recall ∇ ˜ t is thegradient with respect to the first argument of f , that is ˜ t . First, by assumption 2, E and ∇ commute,under the dominated convergence theorem. Then, by C - REDUNDANCY ∇ ˜ t φ ( ˜ t , h ( t ∗ )) T ∇ ˜ t h ( ˜ t ) = ∇ ˜ t E η f ( ˜ t , h ( t ∗ ) , η ) T ∇ ˜ t h ( ˜ t ) = E η [ ∇ ˜ t f ( ˜ t , h ( t ∗ ) , η ) T ∇ ˜ t h ( ˜ t )] = d ˜ t ( s ) / ds = − ∇ ˜ t ( h ( ˜ t ) − h ( t ∗ )) . We refer to the gradientevaluated at ˜ t as ∆ ˜ t = − ∇ ˜ t ( h ( ˜ t ) − h ( t ∗ )) = − ( h ( ˜ t ) − h ( t ∗ )) ∇ ˜ t h ( ˜ t ) . We will express φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) as defined by the starting point φ ( t ∗ , h ( t ∗ )) and the gradient flow equation.Let the solution path to the gradient flow equation be C with t ∗ , t (cid:48) ( t ∗ , h ( t ∗ )) being the startingand ending points respectively. By the Gradient Theorem [26], we have that φ ( t ∗ , h ( t ∗ )) and φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) are related via the line integral over C : (cid:90) C ∇ ˜ t φ ( ˜ t , h ( t ∗ )) · d ˜ t = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) − φ ( ˜ t , h ( t ∗ )) Let ˜ t ( s ) be a parametrization of solution path C by the scalar time s ∈ [ ∞ ) . Now, to obtain thevalue of φ ( ˜ t , h ( t ∗ )) , we will compute the line integral over the vector field defined by ∇ ˜ t φ ( ˜ t , h ( t ∗ )) ,which exists by assumption 2 in theorem 1, evaluated along the path C defined by ∆ ˜ t ( s ) : φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) = φ ( t ∗ , h ( t ∗ )) + (cid:90) C ∇ ˜ t φ ( ˜ t , h ( t ∗ )) · d ˜ t = φ ( t ∗ , h ( t ∗ )) + (cid:90) ∞ ∇ ˜ t φ ( ˜ t ( s ) , h ( t ∗ )) T d ˜ t ( s ) ds ds = φ ( t ∗ , h ( t ∗ )) + (cid:90) ∞ ∇ ˜ t φ ( ˜ t ( s ) , h ( t ∗ )) T ∆ ˜ t ( s ) ds = φ ( t ∗ , h ( t ∗ ))+ (cid:90) ∞ − (( h ( ˜ t ( s )) − h ( t ∗ ))) ∇ ˜ t φ ( ˜ t ( s ) , h ( t ∗ )) T ∇ ˜ t h ( ˜ t ( s )) ds = φ ( t ∗ , h ( t ∗ )) + { by C - REDUNDANCY } (11)12inally, by assumption 1 in theorem 1, h ( t (cid:48) ( t ∗ , h ( t ∗ ))) = h ( t ∗ ) , and so φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t (cid:48) ( t ∗ , h ( t ∗ )))) (12)For clarity, the same equation, but using t (cid:48) and suppressing dependence on t ∗ , h ( t ∗ ) ): φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) , h ( t ∗ )) = φ ( t (cid:48) , h ( t (cid:48) )) (13)Under the causal model for EFC , the outcome y = f ( t , h ( t ) , η ) . Then, ∀ ˜ t ∈ supp ( p ( t )) , E [ y | t = ˜ t ] = E η [ f ( ˜ t , h ( ˜ t ) , η )] = φ ( ˜ t , h ( ˜ t )) . (14)Using that t (cid:48) ( t ∗ , t ∗ ) ∈ supp ( p ( t )) and eqs. (13) and (14), the conditional effect is identified φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t (cid:48) ( t ∗ , h ( t ∗ ))))= E [ y | t = t (cid:48) ( t ∗ , h ( t ∗ ))] (15)Thus, the conditional effect, and consequently the average effect, are identified as E [ y | t (cid:48) ( t ∗ , h ( t ∗ ))] and τ ( t ∗ ) = E h ( t ) E [ y | t (cid:48) ( t ∗ , h ( t ))] respectively. Note about convergence of gradient flow
Any ODE’s solution, if it exists and converges, con-verges to an ω -limit set [27]. An ω -limit set is nonempty when the solution path lies entirely in aclosed and bounded set and can consist of limit cycles, equilibrium points, or neither [13, 27]. Agradient flow equation d ˜ t ( s ) / ds = − ∇ h ( ˜ t ) (also called a gradient system) has the special propertythat its ω -limit set only consists of critical points of h ( ˜ t ) ; critical points of h ( ˜ t ) are also equilibriumpoints of the gradient flow equation [13]. Further, if ∇ h ( ˜ t ) exists and is bounded and h ( ˜ t ) hasbounded sublevel sets ( { ˜ t : h ( ˜ t ) (cid:54) c } ), then the solution to the gradient flow equation will entirelylie within a bounded set. This is because along the solution path, h ( ˜ t ( s )) always decreases meaningthat the solution will remain in any sublevel set it started in. Thus, if h ( ˜ t ) has bounded sublevel sets,the solution of the gradient flow equation will converge only to critical points of h ( ˜ t ) . A.3 Estimation error in
LODE
Theorem 2.
Consider the conditional effect φ ( t ∗ , h ( t ∗ )) . Let ˆ t ( t ∗ , h ( t ∗ )) be the estimate of thesurrogate intervention computed by LODE , computed via Euler integration of the gradient flow d ˜ t ( s ) ds = − ∇ ˜ t ( h ( ˜ t ( s )) − h ( t ∗ )) , initialized at ˜ t ( ) = t ∗ . Assume the true surrogate t (cid:48) ( t ∗ , h ( t ∗ )) exists and is the limiting solution to the gradient flow equation.1. Let the finite sample estimator of E [ y | t = ˜ t ] be ˆ f ( ˜ t ) . Let the error for all ˜ t be bounded, | ˆ f ( ˜ t ) − E [ y | t = ˜ t ] | (cid:54) c ( N ) , where N is the sample size and lim N → ∞ c ( N ) = .2. Assume K Euler integrator steps were taken to find the surrogate estimate ˆ t ( t ∗ , h ( t ∗ )) ,each of size (cid:96) . Let the maximum confounder mismatch be max i (cid:54) K ( h ( ˜ t i ) − h ( t ∗ )) = M .3. Let L z ,˜ t be the Lipschitz-constant of φ ( ˜ t , h ( ˜ t )) as a function of h ( ˜ t ) , for fixed ˜ t .Let L e be the Lipschitz-constant of E [ y | t = ˜ t ] = φ ( ˜ t , h ( ˜ t )) as a function of ˜ t .Assume h has a gradient with bounded norm, (cid:107)∇ h ( ˜ t ) (cid:107) < L h .Assume f ’s Hessian has bounded eigenvalues: ∀ ˜ t , ˜ t , (cid:107)∇ t φ ( ˜ t , h ( ˜ t )) (cid:107) (cid:54) σ H φ .The conditional effect estimate error, ξ ( t ∗ , h ( t ∗ )) = | ˆ f ( ˆ t ) − φ ( t ∗ , h ( t ∗ )) | , is upper bounded by: c ( N ) + min (cid:0) L e (cid:107) t (cid:48) − ˆ t (cid:107) , 2 K(cid:96) (cid:0) O ( (cid:96) ) + Mσ H φ L h (cid:1) + L z ,ˆ t (cid:107) h ( ˆ t ) − h ( t ∗ ) (cid:107) (cid:1) (16) Proof. (of Theorem 2) Recall the definition of conditional effect : φ ( ˜ t , h ( ˜ t )) = E η f ( ˜ t , h ( ˜ t ) , η ) . LODE ’s estimate of the conditional effect is ˆ f ( ˆ t ( t ∗ , h ( t ∗ ))) . We will suppress notation for dependenceon t ∗ , h ( t ∗ ) , and use t (cid:48) and ˆ t to refer to the true surrogate intervention and the estimated surrogateinterventions respectively. Note ˆ f is the estimate of the conditional expectation E [ y | t = ˜ t ] , learnedfrom N samples. We first bound the error by splitting into two parts and bounding each separately: | ξ ( t ∗ , h ( t ∗ )) | = | ˆ f ( ˆ t ) − φ ( t ∗ , h ( t ∗ )) | (cid:54) | ˆ f ( ˆ t ) − φ ( ˆ t , h ( ˆ t )) | + | φ ( ˆ t , h ( ˆ t )) − φ ( t ∗ , h ( t ∗ )) | (cid:54) c ( N ) + | φ ( ˆ t , h ( ˆ t )) − φ ( t ∗ , h ( t ∗ )) | (cid:54) | φ ( ˆ t , h ( ˆ t )) − φ ( ˆ t , h ( t ∗ )) | + | φ ( ˆ t , h ( t ∗ )) − φ ( t ∗ , h ( t ∗ )) | + c ( N ) φ as a function of h ( ˜ t ) with fixed first argument˜ t = ˆ t . | φ ( ˆ t , h ( ˆ t )) − φ ( ˆ t , h ( t ∗ )) | (cid:54) L z ,ˆ t | h ( ˆ t ) − h ( t ∗ ) | We now bound the remaining term. Recall that L
ODE ’s computation of the surrogate interventioninvolved K gradient steps, each of size (cid:96) . We work with a constant step-size but the analysis canbe generalized to a non-uniform step size. Indexing steps with i , let d i = h ( ˜ t i ) − h ( t ∗ ) be theconfounder mismatch error at the i th iterate. Then note that ˆ t = t ∗ − (cid:96) (cid:80) K − i = d i ∇ ˜ t h ( ˜ t i ) . We canuse this to bound the error φ ( ˆ t , h ( t ∗ )) − φ ( t ∗ , h ( t ∗ )) . With ˜ t K = ˆ t and ˜ t = t ∗ , we proceed byexpressing the error as a telescoping sum and using the Taylor expansion for φ ( ˜ t , h ( t ∗ )) in terms ofthe the first argument ˜ t . φ ( ˆ t , h ( t ∗ )) − φ ( t ∗ , h ( t ∗ )) = K − (cid:88) i = φ ( ˜ t i + , h ( t ∗ )) − φ ( ˜ t i , h ( t ∗ )) (17) = K − (cid:88) i = ∇ ˜ t φ ( ˜ t i , h ( t ∗ )) (cid:62) ( ˜ t i + − ˜ t i ) (18) + ( ˜ t i + − ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ ))( ˜ t i + − ˜ t i ) + O ( (cid:107) ˜ t i + − ˜ t i (cid:107) ) (19) = K − (cid:88) i = (cid:96)d i ∇ ˜ t φ ( ˜ t i , h ( t ∗ )) (cid:62) ∇ ˜ t h ( ˜ t i ) + ( (cid:96)d i ) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) + O ( (cid:96) ) (20) = K − (cid:88) i = + ( (cid:96)d i ) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) + O ( (cid:96) ) (21) = O ( K(cid:96) ) + K − (cid:88) i = ( (cid:96)d i ) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) (22) (cid:54) O ( K(cid:96) ) + K − (cid:88) i = ( (cid:96) ( h ( ˜ t i ) − h ( t ∗ ))) (cid:12)(cid:12) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) (cid:12)(cid:12) (23) (cid:54) O ( K(cid:96) ) + K − (cid:88) i = (cid:96) M (cid:12)(cid:12) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) (cid:12)(cid:12) (24) (cid:54) O ( K(cid:96) ) + K − (cid:88) i = (cid:96) Mσ H φ (cid:107)∇ ˜ t h ( ˜ t i ) (cid:107) (25) (cid:54) O ( K(cid:96) ) + K − (cid:88) i = (cid:96) Mσ H φ L h (26) = K(cid:96) (cid:0) O ( (cid:96) ) + Mσ H φ L h (cid:1) , (27)where the inequalities follow by the maximum value of ( h ( ˜ t i ) − h ( t ∗ )) , bounded eigenvalues ofthe Hessian of φ and the Lipschitz-ness of h ( ˜ t ) .Another way we bound the error is via the Lipschitz constant of the conditional expectation as afunction of ˜ t . Recall this is L e . An alternate bound on the error is as follows: | φ ( ˆ t , h ( ˆ t )) − φ ( t ∗ , h ( t ∗ )) | = | φ ( ˆ t , h ( ˆ t )) − φ ( t (cid:48) , h ( t (cid:48) )) | (cid:54) L e (cid:107) t (cid:48) − ˆ t (cid:107) The bound follows: | ξ ( ˜ t , h ( t ∗ )) | (cid:54) c ( N ) + min (cid:0) L e (cid:107) t (cid:48) − ˆ t (cid:107) , 2 K(cid:96) (cid:0) O ( (cid:96) ) + Mσ H φ L h (cid:1) + L z ,ˆ t (cid:107) h ( ˆ t ) − h ( t ∗ ) (cid:107) (cid:1) .3.1 A note on linear confounder functions and LODE
In the proof above, the error in Euler integration accumulates due to terms like this one: ∇ (cid:62) ˜ t h ( ˜ t ) ∇ t f ( ˜ t , h ( t ∗ ) , η ) ∇ ˜ t h ( ˜ t ) . For a linear confounder function that satisfies ∇ ˜ t h ( ˜ t ) = β , suchterms can be expressed as β (cid:62) ∇ ˜ t ( ∇ ˜ t f ( ˜ t , h ( t ∗ ) , η ) (cid:62) β ) = β (cid:62) ∇ ˜ t ( ) = C - REDUNDANCY .Thus, such error does not accumulate even with large step sizes.Further, note that the gradient flow equation in
LODE for the causal model A in section 4 is alinear ODE whose solution has a closed form expression and one can estimate the surrogate withoutnumerical integration [27]. A.4 Proof of sufficiency of Effect ConnectivityTheorem 3.
Under Effect Connectivity, eq. (9) , any surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) .Proof. Recall φ ( ˜ t , h ( ˜ t )) = E η f ( ˜ t , h ( ˜ t ) , η ) . We have ∀ t ∗ ∈ supp ( p ( t )) : p ( h ( t ) = h ( t ∗ )) > = ⇒ p ( φ ( t , h ( t )) = φ ( t ∗ , h ( t ∗ )) | h ( t ) = h ( t ∗ )) > ∃ t (cid:48) ∈ supp ( t ) , φ ( t (cid:48) , h ( t ∗ )) = φ ( t ∗ , h ( t ∗ )) , s . t . h ( t (cid:48) ) = h ( t ∗ ) .Then, φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) , h ( t ∗ )) = φ ( t (cid:48) , h ( t (cid:48) )) = E [ y | t = t (cid:48) ] . A.5 Necessity of Effect Connectivity for Nonparametric effect estimation in
EFC
Theorem 4.
Effect Connectivity is necessary for nonparametric effect estimation in
EFC .Proof. (Proof of Theorem 4) Let the outcome be y = f ( t , h ( t )) . Recall the joint distribution p ( t , y ) and let h ( t ) be the confounder. Let Effect Connectivity be violated, i.e. there exists anon-measure-zero subset B ∈ supp ( t ) × supp ( h ( t )) such that : ∀ ˜ t , h ( ˜ t ) ∈ B , p ( f ( t , h ( t )) = f ( ˜ t , h ( ˜ t )) | h ( t ) = h ( ˜ t )) = y = f ( t , h ( t )) and show the conditional effects for this newoutcome are different from the one defined by f on ∀ ( ˜ t , h ( ˜ t )) ∈ B . Let f ( ˜ t , h ( ˜ t )) = f ( ˜ t , h ( ˜ t )) + ∗ (( ˜ t , h ( ˜ t )) ∈ B ) | .We have f ( ˜ t , h ( ˜ t )) = f ( ˜ t , h ( ˜ t )) ∀ ˜ t ∈ supp ( t ) , as the additional term in f is only present for ( ˜ t , h ( ˜ t )) ∈ B ; this follows from the fact that ∀ ˜ t ∈ supp ( t ) , ( ˜ t , h ( ˜ t )) (cid:54)∈ B as p [ f ( t , h ( t )) = f ( ˜ t , h ( ˜ t )) | h ( t ) = h ( ˜ t )] = p [ f ( t , h ( t )) = f ( ˜ t , h ( ˜ t ))] > p ( y , t ) = d p ( y , t ) are equal in distribution since B ∩ supp ( t , h ( t )) = ∅ . This means that theconditional effects are different for the outcomes y , y for all ( ˜ t , h ( ˜ t )) ∈ B : E [ y | do ( t = ˜ t ) , h ( t ) = h ( ˜ t )] (cid:54) = E [ y | do ( t = ˜ t ) , h ( t ) = h ( ˜ t )] Therefore, for causal models that violates Effect Connectivity, there exist observationally equivalentcausal models with different causal effects. Thus, nonparametric effect estimation is impossible.Thus, Effect Connectivity is required for
EFC . A.6 Algorithmic details
We give in algorithm 1 pseudocode for
LODE . Extensions of
LODE
Consider that we have access to m ( h ( t )) for some bijective dif-ferentiable function m ( · ) , instead of h ( t ) . The orthogonality in C - REDUNDANCY holds ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) T ∇ ˜ t m ( h ( ˜ t )) = m (cid:48) ( h ( ˜ t )) ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) T ∇ ˜ t h ( ˜ t ) =
0. Then, using m ( h ( ˜ t )) tocompute the surrogate t (cid:48) ( t ∗ , h ( t ∗ )) , LODE would estimate valid effects. Similarly,
LODE can estimatethe effect on any differentiable transformation of the outcome m ( y ) , because ∇ ˜ t m ( y ˜ t ) T ∇ ˜ t h ( ˜ t ) = m (cid:48) ( y ˜ t ) ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) T ∇ ˜ t h ( ˜ t ) = Non-zero w.r.t. the product measure over supp ( t ) × supp ( h ( t )) due to p . lgorithm 1: LODE for do ( t = t ∗ ) Input:
Functional confounder h ( t ) ; tolerance (cid:15) Output:
Conditional effects of t ∗ , h ( t ∗ ) Regress y on t and compute ˆ f () := arg min u ∈ F E y , t ( y − u ( t )) . To estimate effects of t ∗ , h ( t ∗ ) , compute the surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) by Eulerintegrating the gradient flow equation, initialized at ˜ t = t ∗ , until ( h ( ˜ t s ) − h ( t ∗ )) < (cid:15) . d ˜ t ( s ) ds = ∇ ˜ t ( h ( ˜ t s ) − h ( t ∗ )) , Return ˆ f ( t (cid:48) ( t ∗ , h ( t ∗ ))) ; B Experimental Details
B.1 Functional confounders in
GWAS
Here, we show how h ( t ) = At and A reflect the traditional PCA based adjustment in
GWAS . Recallpopulation structure acts as a confounder in
GWAS . Price et al. [19] demonstrated that using theprincipal components of the normalized genetic relationships matrix adjusts for confounding due topopulation structure in
GWAS . Let the genotype matrix be G with people as rows and SNP s as columns,such that each element is one of 0, / , 1, where / and 1 refer to one and two copies of the allelerespectively at the position of the SNP . With p s as the allele frequency at SNP s [28], Φ is the geneticrelationship matrix whose elements are defined as Φ i , j = S (cid:80) Ss = ( G i , s − p s )( G j , s − p s ) / p s ( − p s ) .Then, Price et al. [19] compute the top K (10 suggested) principal components of Φ to use as the axesof variation due to the population structure. The eigenvectors of Φ are the left eigenvectors of ˆ G suchthat Φ = ˆ G ˆ G T which capture independent axes of variation of individuals.Price et al. [19] exploit the idea that if a SNP aligns with some of the axes of variation, this is due to thepopulation structure. These axes of variation are the top K eigenvectors U of φ = ˆ G ˆ G T ≈ UΛU (cid:62) ,where U ∈ R N × K , Φ ∈ R N × N and Λ ∈ R K × K . Here, U are also the left singular vectorsof ˆ G ≈ UΣV T where Σ ∈ R K × K is diagonal, and V ∈ R S × K . We use ≈ to denote that thechosen K eigenvectors explain the variation due to population structure; what remains are randommutations.Let the s th SNP be ˆ G · , s ∈ R N , which is a column in ˆ G . In Price et al. [19], population structurein the s th SNP is captured in ˆ G (cid:62)· , s U . In words, projecting the SNP ˆ G · , s onto the axes of variation inindividuals gives the population structure between s th SNP and the outcome. This projection ˆ G (cid:62)· , s U isa row of ˆ G (cid:62) U ∈ R S × K . In turn, ˆ G (cid:62) U ∈ R S × K is the population structure in all SNP s. Projectingthis population structure onto the genotype of an individual gives the confounding due to populationstructure amongst the
SNP s present in the genotype. With G j , · ∈ { / , 1 } S as the genotype foran individual j , this projection is (cid:0) ( ˆ G (cid:62) U ) (cid:62) G j , · (cid:1) . However, ˆ G ≈ UΣV T implies that ˆ G (cid:62) U ≈ VΣ .Reflecting this, h ( t ) = ΣV T t is the functional confounder for an individual t .16 .2 Expanded results In table 2, we list the 13
SNP s recovered by
LODE , that have been previously reported as relevantto Celiac disease. In fig. 7, we plot the true positive and false negative rate amongst
SNP s deemedrelevant by
LODE . The ground truth here are the
SNP s reported associated with celiac disease in priorliterature.
SNP E FFECT L ASSO C OEF .rs3748816 0.12 0.20rs10903122 0.10 0.17rs2816316 0.11 0.20rs13151961 0.17 0.32rs2237236 0.17 0.00rs12928822 0.14 0.29rs2187668 − − − − − − − − − − − − − − Table 2:
Full list of
SNP s previously reportedas relevant that were recovered by
LODE , andtheir estimated effects and Lasso coefficientsfor
SNP s. The effect threshold here is 0.1.
Figure 7:
True positive vs. False nega-tive rate as we vary the threshold on averageeffects, that determines which