[PDF] Causal Estimation with Functional Confounders

Abstract

Causal inference relies on two fundamental assumptions: ignorability and positivity. We study causal inference when the true confounder value can be expressed as a function of the observed data; we call this setting estimation with functional confounders (EFC). In this setting, ignorability is satisfied, however positivity is violated, and causal inference is impossible in general. We consider two scenarios where causal effects are estimable. First, we discuss interventions on a part of the treatment called functional interventions and a sufficient condition for effect estimation of these interventions called functional positivity. Second, we develop conditions for nonparametric effect estimation based on the gradient fields of the functional confounder and the true outcome function. To estimate effects under these conditions, we develop Level-set Orthogonal Descent Estimation (LODE). Further, we prove error bounds on LODE's effect estimates, evaluate our methods on simulated and real data, and empirically demonstrate the value of EFC.

Full PDF

CCausal Estimation with Functional Confounders

Aahlad Puli Adler J. Perotte Rajesh Ranganath [email protected] [email protected] [email protected] Computer Science, New York University, New York, NY 10011 Biomedical Informatics, Columbia University, New York, NY 10032 Center for Data Science, New York University, New York, NY 10011

Abstract

EFC ) . In this setting ignorability is satisﬁed, however positivity isviolated, and causal inference is impossible in general. We consider two scenarioswhere causal effects are estimable. First, we discuss interventions on a part ofthe treatment called functional interventions and a sufﬁcient condition for effectestimation of these interventions called functional positivity . Second, we developconditions for nonparametric effect estimation based on the gradient ﬁelds of thefunctional confounder and the true outcome function. To estimate effects underthese conditions, we develop Level-set Orthogonal Descent Estimation ( LODE ).Further, we prove error bounds on

LODE ’s effect estimates, evaluate our methodson simulated and real data, and empirically demonstrate the value of

EFC . Determining the effect of interventions on outcomes using observational data lies at the core of manyﬁelds like medicine, economic policy, and genomics. For example, policy makers estimate effectsto elect whether to invest in education or job training programs. In medicine, doctors use effects todesign optimal treatment strategies for patients. Geneticists perform genome-wide association studies(

GWAS ) to relate genotypes and phenotypes. In observational data, there could exist unobservedvariables that affect both the intervention and the outcome, called confounders. A necessary conditionfor the causal effect to be identiﬁed is that all confounders are observed; called ignorability . Ifignorability holds, a sufﬁcient condition for causal effect estimation is adequate variation in theintervention after conditioning on the confounders; this condition is called positivity .The data apriori does not differentiate between confounders and interventions. It is the practitionersthat select interventions of interest from all pre-outcome variables (variables that occur before theoutcome). Then, assuming knowledge of the data generating mechanism, practitioners can labelcertain variables amongst the remaining pre-outcome variables as confounders. This corresponds toindexing into the set of pre-outcome variables.In certain problems the confounders are speciﬁed as a function of the pre-outcome variables that doesnot simply index into the set of pre-outcome variables. For a concrete example, consider

GWAS . Thegoal in

GWAS is to estimate the inﬂuence of genetic variations on phenotypes like disease risk. In

GWAS , population and family structures both result in certain genetic variations and affect phenotypesand therefore, are confounders [4]. Practitioners specify these confounders by using the geneticsimilarity between individuals [15, 19, 31], which is a function of the genetic variations. When theconfounders are a function of the same pre-outcome variables that deﬁne the interventions, positivityis violated. Then, the class of interventions whose effects are estimable is not well-deﬁned. a r X i v : . [ s t a t . M E ] F e b e study causal effect estimation in such settings, where a function of the pre-outcome variablesprovides the confounder and these same pre-outcome variables deﬁne the intervention. We call thisestimation with functional confounders ( EFC ). In

EFC , one column in the observed data is the outcomeand all others are pre-outcome variables. We assume access to a function h ( · ) that takes as input thepre-outcome variables and returns the value of the confounder. Further, we assume these confoundersgive us ignorability. In settings like GWAS , the function h reﬂects the practitioner-speciﬁed functionthat captures the genetic variation inﬂuenced by the population structure. In traditional observationalcausal inference ( OBS - CI ), h ( · ) reﬂects the selection of certain variables in the data and labellingthem as confounders. In EFC , two different values of the confounder are never observed for the samesetting of the pre-outcome variables. This means that positivity is violated and the effects of onlycertain interventions may be estimable.We address this issue in two ways. First, we investigate a class of plausible interventions that are functions of the observed pre-outcome variables, called functional interventions. We develop asufﬁcient condition to estimate the effects of said functional interventions, called functional positivity( F - POSITIVITY ). Second, we consider intervening on all pre-outcome variables, called the full inter-vention. We develop a sufﬁcient condition to estimate the effect of the full intervention, called causalredundancy ( C - REDUNDANCY ). For an intervention, given a confounder value, C - REDUNDANCY allows us to compute a surrogate intervention such that the conditional effect of the surrogate isequal to that of the original intervention. We also show that such surrogate interventions exist onlyunder a certain condition that we call Effect Connectivity, that is necessary for nonparametric effectestimation in

EFC . This condition is satisﬁed by default in traditional

OBS - CI if ignorability andpositivity hold. Then, we develop an algorithm for causal estimation assuming C - REDUNDANCY ,called Level-set Orthogonal Descent Estimation (

LODE ), which estimates effects using surrogateinterventions. If the surrogate is not estimated well,

LODE ’s estimates are biased. We establish boundson this bias that capture the mitigating effect of the smoothness of the true outcome function.

Related work

The problem of genome-wide association studies (

GWAS ) is to estimate the effect ofgenetic variations(also called single nucleotide polymorphisms (

SNP s)) on the phenotype [29]. Theancestry of the subjects acts as a confounder in

GWAS . In

GWAS practice, principle component analysis(

PCA ) and linear mixed models (

LMM s) are used to compute this confounding structure [19, 31].Lippert et al. [15] suggest estimating the confounders and effects on separate subsets of the

SNP s.This separation disregards the confounding that is captured in the interaction of the two subsets of

SNP s. G

WAS is a special case of effects from multiple treatments (

MTE ) where the confounder valueis speciﬁed via optimization as a function of the pre-outcome variables [20, 30]. In all these settings,positivity is violated and not all effects are estimable. We provide an avenue for nonparametriceffect-estimation of the full intervention under a new condition, C - REDUNDANCY . Traditional observational causal inference (

OBS - CI ) review We setup causal inference withStructural Causal Models [17] and use do ( t = t ∗ ) to denote making an intervention. Let t be avector of the interventions, z be the confounder, and y be the outcome. Let η ∼ p ( η )( η | = ( z , t )) benoise. With f as the outcome function , we deﬁne the causal model for traditional OBS - CI as : z ∼ p ( z ) , t ∼ p ( t | z ) , y = f ( t , z , η ) .Let p ( y , z , t ) denote the joint distribution implied by this data generating process. The effects ofinterest under the full intervention do ( t = t ∗ ) are the average and conditional effect ( average ) τ ( t ∗ ) = E z , η f ( t ∗ , z , η ) ( conditional ) φ ( t ∗ , z ) = E η [ f ( t ∗ , z , η )] . (1)With observed confounders, two assumptions make causal estimation possible: ignorability and positivity . Ignorability means that all confounders z are observed in data. Conditioning on all theconfounders, the outcome under an intervention is distributed as if conditional on the value of theintervention: p ( y = y | do ( t = t ∗ ) , z = z ) = p ( f ( t ∗ , z , η ) = y ) = p ( y = y | t = t ∗ , z = z ) . This allows the expression of average effect as an expectation over the observed outcomes τ ( t ∗ ) = E z , η [ f ( t ∗ , z , η )] = E z E [ y | z , t ∗ ] . The conditional expectation only exists for all t ∗ if p ( y | z , t = t ∗ ) = p ( y , z , t = t ∗ ) / p ( z ) p ( t = t ∗ | z ) exists. Positivity guarantees this existence ( positivity ) ∀ t ∗ ∈ supp ( t ) p ( z = z ) > = ⇒ p ( t = t ∗ | z = z ) >

0. (2) We focus on f that generates y from t , z . SCMs generally specify the function that generates t from z also. yz (a) Traditional

OBS - CI t y h ( t ) (b) EFC t y h ( t ) (c) Intervening in

EFC

Figure 1:

Causal Graphs for Traditional

OBS - CI vs. EFC . In traditional

OBS - CI , causal estimation relied on knowing the confounders. In this section, weconsider settings where confounders are known via a function of the pre-outcome variables h ( t ) = z .We call this setting estimation with functional confounders ( EFC ) . An example of this is GWAS , where

SNP s (the pre-outcome variables) are used to estimate the confounding population structure throughmethods like

PCA [31]. Assuming the confounders are a function of the pre-outcome variablesviolates positivity in general. Positivity is violated in this setting because ∀ t , t ∈ supp ( t ) s . t . h ( t ) (cid:54) = h ( t ) = ⇒ p ( z = h ( t ) | t = t ) = (cid:54) = p ( z = h ( t )) > t . A positivity violation precludesnonparametric effect estimation of the full intervention do ( t = t ∗ ) . Positivity and Regression Identiﬁability

Positivity can be viewed as providing identiﬁability. Tosee this, let the confounder be z = h ( t ) and the outcome be y ( t , z , η ) = z + h ( t ) . Now considerregressing z and t onto y . Then, functions y = α z + βh ( t ) indexed by α , β , such that α + β = y on ( t , z ) , meaning that the regression is not identiﬁable. Assuming positivitynecessitates sufﬁcient randomness to identify the regression and thus the causal effect. A violation ofpositivity means that nonparametric estimation of causal effects needs further assumptions. EFC In EFC , the confounder is provided as a non-bijective function h of the pre-outcome variables t . Toreﬂect this property, we use h ( t ) to denote the confounder. As an illustrative example, let G be theGamma distribution and consider z ∈ { −

1, 1 } , p ( z = ) = t = z ∗ G (

1, exp ( z )) . Note sign ( t ) = z meaning that h ( t ) = sign ( t ) is the confounder.Figure 1 shows causal graphs connecting our EFC notation to that in traditional

OBS - CI . With noise η ∼ p ( η )( η | = t ) , our causal model samples, in order, the confounder ”part” of pre-outcome variables h ( t ) , the pre-outcome variables t , and the outcome y via the outcome function f : h ( t ) ∼ p ( h ( t )) t ∼ p ( t | h ( t )) y = f ( t , h ( t ) , η ) Similar to traditional

OBS - CI , for an intervention t ∗ the average effect, τ ( · ) , and the conditionaleffect, φ ( · , · ) at h ( t ∗ ) , respectively, are deﬁned as: τ ( t ∗ ) = E h ( t ) , η [ f ( t ∗ , h ( t ) , η )] , φ ( t ∗ , h ( t ∗ )) = E η [ f ( t ∗ , h ( t ∗ ) , η )] . (3)As the pre-outcome variables determine the confounder, positivity is violated. Further, the outcomefunction f ( t , h ( t ) , η ) could recover the exact value of h ( t ) from t instead of its second argument.Thus, two different outcome functions could lead to the same observational data distribution, posinga fundamental obstacle to causal effect estimation. This is the central challenge in EFC . Without positivity, we can only estimate the effects of certain functions of t . We call such in-terventions, on some function g ( t ) , functional interventions . The implied causal model for theoutcome for functional intervention value g ( t ∗ ) and confounder value h ( t ∗ ) is ﬁrst t ∼ p ( t | g ( t ) = g ( t ∗ ) , h ( t ) = h ( t ∗ )) and then y = f ( t , h ( t ∗ ) , η ) . Then, the functional average effect is ( average ) τ ( g ( t ∗ )) = E h ( t ) , η E t | g ( t )= g ( t ∗ ) , h ( t ) [ f ( t , h ( t ) , η )] .An example of a functional intervention is intervening on the cumulative dosage of a drug. In contrast,traditional interventions would set each individual dose given at different points in time. We also assume no interference [10] (also called Stable Unit Treatment Value Assumption [24]) whichmeans that an individual’s outcome does not depend on others’ treatment. In

EFC , when t and η are sampled IIDthere is no interference. To see this, note ∀ i , j ( t i , η i ) | = ( t j , η j ) = ⇒ ( y i , t i ) | = ( y j , t j ) = ⇒ y i | = t j . Intervening on g ( t ) can be interpreted as making a soft intervention [9, 7] of t to p ( t | z , g ( t ) = g ( ˜ t )) . - POSITIVITY and Functional Effect Estimation

For the causal model above to be well-deﬁnedfor all functional interventions g ( t ∗ ) , the conditional p ( t | g ( t ) = g ( t ∗ ) , h ( t ) = h ( t ∗ )) must exist.To guarantee this existence, we deﬁne functional positivity ( F - POSITIVITY ) for any g ( t ∗ )( F - POSITIVITY ) p ( h ( t ) = h ( t ∗ )) > = ⇒ p ( g ( t ) = g ( t ∗ ) | h ( t ) = h ( t ∗ )) >

0. (4)F-

POSITIVITY says that the function of the pre-outcome variables that is being intervened on needsto have sufﬁcient randomness when the function of the pre-outcome variables that deﬁnes theconfounders is ﬁxed. Further, under F - POSITIVITY , effect estimation for functional interventionsis reduced to traditional

OBS - CI on data p ( y , g ( t ) , h ( t )) . With positivity and ignorability satisﬁed,traditional causal estimators such as propensity scores [23], matching [21], regression [11], anddoubly robust methods [22] can be used to estimate the causal effect. Focusing on regression, let f θ be a ﬂexible function, then min θ E y , t [( y − f θ ( h ( t ) , g ( t ))) ] would estimate the conditionalexpectation of interest : E [ y | h ( t ) , g ( t ∗ )] . With θ , the effect of g ( t ∗ ) can be estimated by averagingthe estimate of the conditional expectation over the marginal distribution p ( h ( t )) : τ ( g ( t ∗ )) = E t [ f θ ( h ( t ) , g ( t ∗ ))] . (5) When positivity is violated, causal effects cannot be estimated as conditional expectations over theobserved data in general. We give a functional condition, called causal redundancy ( C - REDUNDANCY ),that allows us to estimate the effect of the full intervention do ( t = t ∗ ) , even when positivity isviolated. Speciﬁcally, C - REDUNDANCY allows us to construct a surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) whose conditional effect at h ( t (cid:48) ) matches the conditional effect of interest, φ ( t ∗ , h ( t ∗ )) . Let ˜ t be aﬁxed value of the full intervention, then C - REDUNDANCY is Assumption.

Recall the outcome y = f ( ˜ t , h ( ˜ t ) , η ) . With ∇ ˜ t as gradient w.r.t. to argument ˜ t : ∀ ˜ t , h ( ˜ t ) , η , ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) T ∇ ˜ t h ( ˜ t ) = C - REDUNDANCY is the condition that the outcome function f uses the value of theconfounder from its second argument instead of computing h ( t ) from the ﬁrst argument . Tocompute the conditional effect φ ( t ∗ , h ( t ∗ )) , we develop Level-set Orthogonal Descent Estimation( LODE ). LODE ’s key step is to construct a surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) such that φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) , h ( t ∗ ) = h ( t (cid:48) ( t ∗ , h ( t ∗ ))) . Figure 2:

LODE ’s traversal.By deﬁnition, a surrogate intervention lives in the conditionaleffect level-set: { ˜ t : φ ( ˜ t , h ( t ∗ )) = φ ( t ∗ , h ( t ∗ )) } . So LODE searches this level-set for t (cid:48) ( t ∗ , h ( t ∗ )) . See ﬁg. 2 which plots theconditional effect level-sets with the value of h ( t ) ﬁxed (red) in ( supp ( t ) , supp ( h ( t ))) -space. Green corresponds to the observeddata, supp ( t , h ( t )) . LODE ﬁnds t (cid:48) ( t ∗ , h ( t ∗ )) by traversing thelevel-sets ( black ) to account for the confounder part mismatch h ( t ∗ ) (cid:54) = h ( t ∗ ) . C - REDUNDANCY ensures

LODE can traversethese level-sets as it implies ∇ ˜ t φ ( ˜ t , h ( ˜ t )) ∇ ˜ t h ( ˜ t ) = C - REDUNDANCY , surrogate interventions can beconstructed by solving a gradient ﬂow equation which guarantees identiﬁcation as follows:

Theorem 1.

Assume C - REDUNDANCY holds. Assuming the following:1. Let t (cid:48) ( t ∗ , h ( t ∗ )) be the limiting solution to the gradient ﬂow equation d ˜ t ( s ) ds = − ∇ ˜ t ( h ( ˜ t ( s )) − h ( t ∗ )) , initialized at ˜ t ( ) = t ∗ ; i.e. t (cid:48) ( t ∗ , h ( t ∗ )) = lim s → ∞ ˜ t ( s ) .Further, let h ( t (cid:48) ( t ∗ , h ( t ∗ ))) = h ( t ∗ ) and t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) .2. f ( ˜ t , h ( ˜ t ) , η ) and h ( ˜ t ) as functions of ˜ t , h ( ˜ t ) are continuous and differentiable and the derivativesexist for all ˜ t , η . Let ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) exist and be bounded and integrable w.r.t. the probabilitymeasure corresponding to p ( η ) , for all values of ˜ t and h ( ˜ t ) . If f transforms its ﬁrst argument ˜ t into h ( ˜ t ) as one amongst many different computations, the chain ruleimplies ∇ ˜ t f ( ˜ t , h ( t ∗ )) (cid:62) ∇ ˜ t h ( ˜ t ) has a term (cid:107)∇ ˜ t h ( ˜ t ) (cid:107) which is non-zero in general. hen the conditional effect (and therefore the average effect) is identiﬁed: φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t (cid:48) ( t ∗ , h ( t ∗ )))) = E [ y | t = t (cid:48) ( t ∗ , h ( t ∗ ))] (6)In words, the key idea is that starting at ˜ t ( ) = t ∗ and following ∇ ˜ t h ( ˜ t ) means ˜ t ( s ) always lies in thelevel-set { ˜ t : φ ( ˜ t , h ( t ∗ )) = φ ( t ∗ , h ( t ∗ )) } . See appendix A.2 for the proof. While C - REDUNDANCY is stated in terms of the gradient of the outcome function, it sufﬁces for theorem 1 to assume a weakercondition about the gradient of the conditional effect: ∇ ˜ t E η f ( ˜ t , ˜ t , η ) (cid:62) ∇ ˜ t h ( ˜ t ) = Surrogate Positivity

In theorem 1, we assumed that the surrogate t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) . Thiscondition, which we call surrogate positivity (analogous to positivity), states that for any interventionand confounder, surrogate interventions that are limiting solutions to the gradient ﬂow equation havenonzero density conditional on the confounder value. Formally, for any intervention t = t ∗ p ( h ( t ) = h ( t ∗ )) > = ⇒ p ( t = t (cid:48) ( t ∗ , h ( t ∗ )) | h ( t ) = h ( t ∗ )) >

0, (7)and t (cid:48) ( t ∗ , h ( t ∗ )) satisﬁes assumption 1 in theorem 1. Surrogate positivity along with C - REDUNDANCY , is sufﬁcient for full effect estimation under

EFC . Next, we show that the positivityassumption in traditional causal inference is a special case of surrogate positivity.

Traditional observational causal inference (

OBS - CI ) and LODE

Let the confounder and interven-tion of interest in traditional

OBS - CI be z and a respectively. Assume both are scalars and ignorabilityand positivity hold. This setup can be embedded in EFC by deﬁning the vector of pre-outcomevariables as: t = [ a ; z ] . In this setting, C - REDUNDANCY and surrogate positivity(eq. (7)) hold bydefault. Let the outcome be y = f ( t , h ( t )) = f ( a , z ) , where f only depends on the ﬁrst elementof t , i.e. a . Let e = [

1, 0 ] and e = [

0, 1 ] . In traditional OBS - CI as EFC , ∇ ˜ t f ( ˜ t , h ( t ∗ )) ∝ e and ∇ ˜ t h ( ˜ t ) ∝ e meaning that ∇ ˜ t f ( ˜ t , h ( t ∗ )) (cid:62) ∇ ˜ t h ( ˜ t ) =

0. Thus, C - REDUNDANCY holds by default.Moreover, under positivity of a w.r.t. z , we also have surrogate positivity for traditional OBS - CI asan EFC problem. In this setting,

LODE computes t (cid:48) = [ a ∗ , h ( t ∗ )] by following − ∇ ˜ t h ( ˜ t ) = [ − ] ,which only changes the value of h ( ˜ t ) , not the value of a . Thus, t ∗ and t (cid:48) ( t ∗ , h ( t ∗ )) will havethe same ﬁrst element and t (cid:48) ’s second element will be h ( t ∗ ) . As a has positivity w.r.t. z , wehave p ( a = a ∗ , z = h ( t ∗ )) > t (cid:48) ∈ supp ( t ) . The estimated conditional effectis E [ y | t = t (cid:48) ( t ∗ , h ( t ∗ ))] = f ([ a ∗ , z ∗ ] , h ( t ∗ )) = E [ y | a = a ∗ , z = h ( t ∗ ))] , which matches theestimate in traditional OBS - CI . Implementation of

LODE L ODE ﬁrst estimates the conditional expectation E [ y | t ] ; this canbe done with model-based or nonparametric estimators. This is achieved by regressing y on t ,ˆ f = arg min u ∈ F E y , t ∼ D ( y − u ( t )) , with empirical distribution D . The surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) is computed using Euler integration to solve the gradient ﬂow equation. Euler inte-gration in this setting is equivalent to gradient descent with a ﬁxed step size. Other, more efﬁcientschemes like Runge–Kutta numerical integration methods [3] could also be used. The conditionaleffect estimate is ˆ f ( t (cid:48) ( t ∗ , h ( t ∗ ))) . See algorithm 1 for a description. LODE in practice

To compute the surrogate intervention t (cid:48) , LODE uses the gradients of h ( · ) in Euler integration. In prac-tice, taking Euler integration steps, instead of solving the gradient ﬂow exactly, could result in errors.Then t (cid:48) could lie outside the level-set of the conditional effect φ ( t ∗ , h ( t ∗ )) = E η [ f ( t ∗ , h ( t ∗ ) , η )] .Further, if h ( t (cid:48) ( t ∗ , h ( t ∗ ))) (cid:54) = h ( t ∗ ) , LODE incurs error for conditioning on a value of the con-founder that is different from h ( t ∗ ) . The error due to t (cid:48) estimation is decoupled from the error in theestimation of E [ y | t ] which adds without further ampliﬁcation. We formalize this error: Theorem 2.

Consider the conditional effect φ ( t ∗ , h ( t ∗ )) . Let ˆ t ( t ∗ , h ( t ∗ )) be the estimate of thesurrogate intervention computed by LODE , computed via Euler integration of the gradient ﬂow d ˜ t ( s ) ds = − ∇ ˜ t ( h ( ˜ t ( s )) − h ( t ∗ )) , initialized at ˜ t ( ) = t ∗ . Assume the true surrogate t (cid:48) ( t ∗ , h ( t ∗ )) exists and is the limiting solution to the gradient ﬂow equation.1. Let the ﬁnite sample estimator of E [ y | t = ˜ t ] be ˆ f ( ˜ t ) . Let the error for all ˜ t be bounded, | ˆ f ( ˜ t ) − E [ y | t = ˜ t ] | (cid:54) c ( N ) , where N is the sample size and lim N → ∞ c ( N ) = .2. Assume K Euler integrator steps were taken to ﬁnd the surrogate estimate ˆ t ( t ∗ , h ( t ∗ )) ,each of size (cid:96) . Let the maximum confounder mismatch be max i (cid:54) K ( h ( ˜ t i ) − h ( t ∗ )) = M . We ignore noise in the outcome for ease of exposition. . Let L z ,˜ t be the Lipschitz-constant of φ ( ˜ t , h ( ˜ t )) as a function of h ( ˜ t ) , for ﬁxed ˜ t .Let L e be the Lipschitz-constant of E [ y | t = ˜ t ] = φ ( ˜ t , h ( ˜ t )) as a function of ˜ t .Assume h has a gradient with bounded norm, (cid:107)∇ h ( ˜ t ) (cid:107) < L h .Assume f ’s Hessian has bounded eigenvalues: ∀ ˜ t , ˜ t , (cid:107)∇ t φ ( ˜ t , h ( ˜ t )) (cid:107) (cid:54) σ H φ .The conditional effect estimate error, ξ ( t ∗ , h ( t ∗ )) = | ˆ f ( ˆ t ) − φ ( t ∗ , h ( t ∗ )) | , is upper bounded by: c ( N ) + min (cid:0) L e (cid:107) t (cid:48) − ˆ t (cid:107) , 2 K(cid:96) (cid:0) O ( (cid:96) ) + Mσ H φ L h (cid:1) + L z ,ˆ t (cid:107) h ( ˆ t ) − h ( t ∗ ) (cid:107) (cid:1) (8)See appendix A.3 for the proof. Theorem 2 captures the trade-off between biases due to conditioningon the wrong confounder value and due to the accumulated error in solving the gradient ﬂow equation.This accumulated error analysis may be loose in settings where the sum of many gradient steps leadto ˆ t ≈ t (cid:48) , even if each step individually induces large error. In such settings, the term that depends on (cid:107) ˆ t − t (cid:48) (cid:107) is a better measure of error. The maximum-mismatch M appears because Euler integratortakes steps that depend on the magnitude of the gradient which depends on the mismatch value ( h ( ˜ t i ) − h ( t ∗ )) . If mismatch is large for some i , the Euler step could lead to a large error for a ﬁxedstep size (cid:96) . We discuss the assumptions in theorems 1 and 2 in appendix A.1 t (cid:48) ( t ∗ , h ( t ∗ )) The key element in Theorem 1 is the surrogate intervention t (cid:48) such that its conditional effect given h ( t (cid:48) ) , equals that of t ∗ and h ( t ∗ ) . The orthogonality ∇ ˜ t f (cid:62) ∇ ˜ t h =

0, is a functional condition thatdoes not guarantee t (cid:48) ( t ∗ , h ( t ∗ )) exists in supp ( t ) ; a necessity to compute E [ y | t = t (cid:48) ] withoutadditional parametric assumptions. We give a general condition called Effect Connectivity thatguarantees the surrogate intervention exists. With conditional effect φ ( t ∗ , h ( t ∗ )) , for any t ∗ p ( h ( t ) = h ( t ∗ )) > = ⇒ p ( φ ( t , h ( t )) = φ ( t ∗ , h ( t ∗ )) | h ( t ) = h ( t ∗ )) >

0. (9)In words, t has a chance of setting the conditional effect to any possible value supp ( φ ( t , h ( t ))) given any confounder value h ( t ∗ ) ∈ supp ( h ( t )) . An equivalent statement is that every levelset of the conditional effect φ ( t ∗ , h ( t ∗ )) , with h ( t ∗ ) ﬁxed, contains an intervention for eachconfounder value. That is, for some h ( t ∗ ) deﬁne the level set A c = { t ∗ ; f ( t ∗ , h ( t ∗ )) = c } , then ∀ h ( t ∗ ) ∈ supp ( h ( t )) , p ( t ∈ A c | h ( t ) = h ( t ∗ )) > Theorem 3.

Under Effect Connectivity, eq. (9) , any surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) . We give the proof in appendix A.4. Whether the intervention t (cid:48) ( t ∗ , h ( t ∗ )) can be found via tractablesearch is problem-speciﬁc. If the surrogate t (cid:48) ( t ∗ , h ( t ∗ )) exists ∀ t ∗ , h ( t ∗ ) , then eq. (9) holds bydeﬁnition of the surrogate. Effect Connectivity allows us to reason about values of f anywhere insupp ( t ) × supp ( h ( t )) using only samples from p ( y , t ) . Further, it is necessary in EFC : Theorem 4.

Effect Connectivity is necessary for nonparametric effect estimation in

EFC . We prove this in appendix A.5. Effect Connectivity ensures that causal models with different causaleffects have different observational distributions. Then, parametric assumptions on the causal modelare not necessary to estimate effects.

We evaluate

LODE on simulated data ﬁrst and show that

LODE can correct for confounding. We alsoinvestigate the error induced by imperfect estimation of the surrogate intervention in

LODE . Further,we run

LODE on a

GWAS dataset [6] and demonstrate that

LODE is able to correct for confounding andrecovers genetic variations that have been reported relevant to Celiac disease [8, 25, 14, 1].

We investigate different properties of

LODE on simulated data where ground truth is available. Let thedimension of t (pre-outcome variables) be T =

20 and outcome noise be η ∼ N (

0, 0.1 ) . We considertwo EFC causal models, denoted by A and B with different h ( t ) and f ( t , h ( t ) , η ) : ( A ) h ( t ) = γ (cid:80) i t i √ T , t ∼ N ( σ I T × T ) , y = (cid:80) i (− ) i t i √ T + αh ( t ) + ( + α ) h ( t ) + η ( B ) h ( t ) = (cid:80) i : i ∈ Z γ t i t i + , t ∼ N ( σ I T × T ) , y = (cid:80) i (− ) i t i √ T + αh ( t ) + η a) Causal Model A (b) Causal Model B Figure 3: R MSE of estimated conditional effect vs. strength of confounding γ . LODE corrects forconfounding and produces good effect estimates across different values of γ . (a) Causal Model A (b) Causal Model B Figure 4:

RMSE of estimated conditional effect estimate vs. the strength of confounding γ , fordifferent levels of variance of t , σ . Small σ leads to large conditional estimation error.In both causal models, C - REDUNDANCY is satisﬁed. The constant γ controls the strength of theconfounder and the constant α controls the Lipschitz constant of the outcome as a function of theconfounder. We let the variance σ =

1, unless speciﬁed otherwise. In the following, we train on1000 samples and report conditional effect root-mean-squared error (

RMSE ), computed with another1000 samples. We used a degree-2 kernel ridge regression to ﬁt the outcome model as a function of t . This model is correctly speciﬁed, and so the conditional E [ y | t = ˜ t ] can be estimated well. Wecompare against a baseline estimate of conditional effect that is the same outcome model’s estimateof E [ y | t = t ∗ ] . This baseline fails to account for confounding and produces a biased estimate of theconditional effect of do ( t = t ∗ ) , conditional on any h ( t ∗ ) (cid:54) = h ( t ∗ ) .First, we investigate how well LODE can correct for confounding for both causal models. We let α = E t ∗ , h ( t ∗ ) ( h ( ˜ t ( s )) − h ( t ∗ )) is smaller than 10 − times value at initialization, where E t ∗ , h ( t ∗ ) is expectation over the eval-uation set. In ﬁg. 3, we plot the mean and standard deviation of conditional effect RMSE av-eraged over 10 seeds, for different strengths of confounding. We see that

LODE is able toestimate effects well across multiple strengths of confounding while the baseline suffers.

Figure 5: R MSE of estimated conditionaleffect vs. step size in Euler Integrator incausal model B . Accumulating error due tolarge step size in Euler integrator increaseswith strength of confounding.Second, we investigate LODE ’s estimation when sur-rogate positivity holds but the probability p ( t ≈ t (cid:48) ( t ∗ , h ( t ∗ ))) is very small. This results in estimationerror due to poor ﬁtting of the outcome model in lowdensity regions of supp ( t ) . We run LODE on simulateddata where t is generated with different variances ( σ ).For small σ , the outcome model error is large when us-ing surrogate interventions t (cid:48) ( t ∗ , h ( t ∗ )) , where either h ( t ∗ ) or t ∗ is large. This leads to high variance effectestimation as we show in ﬁg. 4 for both causal mod-els. For various variances of t , σ , we plot the meanand standard deviation of RMSE of estimated conditionaleffect over 10 seeds, against different γ . 7 a) Causal Model A (b) Causal Model B Figure 6: R MSE of estimated conditional effect vs. degree of confounder mismatch δ . Error due toconditioning on a mismatched value of the confounder increases with strength of confounding but ismitigated by smoothness of the outcome function.Third, we investigate the bias induced due to imperfect estimation of the surrogate intervention in LODE for both causal models. We construct surrogate interventions t (cid:48) ( t ∗ , h ( t ∗ )) by ensuring thereis confounder-value mismatch h ( ˜ t ) (cid:54) = h ( t ∗ ) . We do this by interrupting Euler integration when theobjective E t ∗ , h ( t ∗ ) ( h ( t (cid:48) ( t ∗ , h ( t ∗ ))) − h ( t ∗ )) = δ >

0, where the E t ∗ , h ( t ∗ ) is over our evaluationset upon which we estimate conditional effects. For different α , we plot in ﬁg. 6 the mean andstandard deviation of RMSE of estimated conditional effect over 10 seeds, against different degreesof confounder mismatch, δ . The error due to confounder mismatch is mitigated by small α , theLipschitz-constant of the outcome as a function of h ( t ) . Finally, we consider how step size in Eulerintegration affects the quality of estimated effects. Large step sizes may result in biased surrogateestimates; this bias is captured in the accumulation error in section 3.1. We focus on the non-linearcase in causal model B where gradient errors can accumulate(see appendix A.3.1). We demonstratethis error in ﬁg. 5 where we plot mean and standard deviation of conditional effect RMSE againstthe strength of confounding, for different step sizes (cid:96) . We do not report results for larger step sizes( (cid:96) >

2) because Euler integration diverged for many surrogate estimates.

GWAS ) In this experiment, we explore the associations of genetic factors and Celiac disease. We utilize datafrom the Wellcome Trust Celiac disease

GWAS dataset [8, 6] consisting of individuals with celiacdisease, called cases ( n = ) , and controls ( n = ) . We construct our dataset by ﬁltering fromthe ∼ SNP s. The only preprocessing in our experiments is linkage disequilibrium pruning ofadjacent

SNP s (at 0.5 R ) and PLINK [5] quality control. After this, 337, 642 SNP s remain for 11, 950people. We imputed missing

SNP s for each person by sampling from the marginal distribution of that

SNP . No further

SNP or person was dropped due to missingness. The objective of this experiment isto show that

LODE corrects for confounding and recovers

SNP s reported in the literature [8, 25, 14, 1].To this end, after preprocessing, we included in our data 50

SNP s reported in [8, 25, 14, 1] and 1000randomly sampled from the rest.We use outcome models and functional confounders h () traditionally employed in the GWAS literature.We choose a linear h ( ˜ t ) = A (cid:62) ˜ t , where A is a matrix of the right singular vectors of a normalizedGenotype matrix, that correspond to the top 10 singular values [19]. The outcome model is selectedfrom logistic Lasso linear models with various regularization strengths, via cross validation within thetraining data (60% of the dataset). We defer details about the experimental setup to appendix B.We then use this outcome model in LODE to compute causal effects on the whole ﬁltered dataset.The effects are computed one

SNP at a time. First, for each person ˜ t , create ˜ t i , ˜ t i whichcorrespond to the i th SNP set to 1 and 0 respectively, with all other

SNP s same as ˜ t . Ran-domly sample a h ( t ∗ ) from the marginal p ( h ( t )) and, using the outcome model P θ , compute φ ( ˜ t , i ) = log P θ ( y = | t (cid:48) ( ˜ t i , h ( t ∗ ))) / P θ ( y = | t (cid:48) ( ˜ t i , h ( t ∗ ))) . The average effect of SNP i is obtained byaveraging across all persons: (cid:80) ˜ t φ ( ˜ t , i ) / N . Any SNP that beats a speciﬁed threshold of effect isdeemed relevant to Celiac disease by

LODE . We use a 60 −

40% train-test split, and outcome modelselection is done via cross-validation within the training set. We did 5-fold cross-validation using justthe training set. We use Scikit-learn [18] to ﬁt the outcome models and for cross-validation.

Results

The best outcome model was a Lasso model, trained with regularization constant 10. Weselect relevant

SNP s by thresholding estimated effects at a magnitude > SNP s (10008ot reported before)

LODE returned 31

SNP s, out of which 13 were previously reported as beingassociated with Celiac disease [8, 25, 14, 1]. In appendix B.2 we plot the true positive and falsenegative rates of identifying previously reported

SNP s, as a function of the effect threshold.

SNP E FFECT . C

OEF .rs13151961 0.17 0.32rs2237236 0.17 0.00rs1738074 − − − − Table 1:

A few

SNP s previously reported asrelevant and recovered by

LODE , with esti-mated effects and Lasso coefﬁcients.

LODE produces effect estimates that do not relypurely on the coefﬁcients.In table 1, we list a few

SNP s that were both deemedrelevant by

LODE and were reported in existing litera-ture [8, 25, 14, 1], their effects, and their Lasso coefﬁ-cients. The full list is in table 2 in appendix B. If

LODE cannot adjust for confounding, the Lasso coefﬁcientswould dictate the effects; 0 coefﬁcient means 0 effect.However, the two pairs of

SNP s in table 1 show thatthe effects estimated by

LODE do not rely solely onthe Lasso coefﬁcients. For the ﬁrst pair (rs13151961,rs2237236), the effect is the same but the coefﬁcientof one is 0, while the other is positive. We note thatrs2237236 was found to be associated with ulcerativecolitis [12, 2], which is an inﬂammatory bowel diseasethat has been reported to share some common geneticbasis with celiac disease [16]. For the second pair, (rs1738074, rs11221332), the magnitude of theeffect is smaller for the former, but the coefﬁcient is larger. Thus,

LODE adjusts for confoundingfactors that the outcome model ignored.

When positivity is violated in traditional

OBS - CI , not all effects are estimable without furtherassumptions. In such cases, practitioners have to turn to parametric models to estimate causaleffects. However, parametric models can be misspeciﬁed when used without underlying causalmechanistic knowledge. We develop a new general setting of observational causal effect estimationcalled estimation with functional confounders ( EFC ) where the confounder can be expressed as afunction of the data, meaning positivity is violated. Even when positivity is violated, the effects ofmany functional interventions are estimable. We develop a sufﬁcient condition called functionalpositivity ( F - POSITIVITY ) to estimate effects of functional interventions. Such effects could be ofindependent interest; like the effect of cumulative dosage of a drug instead of joint effects of multipledosages at different times.Second, we prove a necessary condition for nonparametric estimation of effects of the full intervention.We propose the C - REDUNDANCY condition, under which, the effect of the full intervention on t is estimable without parametric restrictions. We develop Level-set Orthogonal Descent Estimation( LODE ) that computes surrogate interventions whose effects are estimable and match a conditionaleffect of interest. Further, we give bounds on errors (theorem 2) induced due to imperfect estimationof the surrogate intervention. Finally, we empirically demonstrate

LODE ’s ability to correct forconfounding in both simulated and real data.

Future.

A few directions of improvement remain which we elaborate next. First, F - POSITIVITY may not hold for all functions g ( t ) that we want to intervene on. Instead, one could compute a“projection” g Π to the space of functions that satisfy F - POSITIVITY and inspect the effects deﬁned by g Π instead. A second direction of interest is to let h ( t ) only account for a part of the confounding,meaning ignorability is violated. This bias could be mitigated under smoothness conditions of theoutcome function and its interaction with the degree of violation of ignorability.Finally, LODE ’s search strategy is Euler integration, which is equivalent to gradient descent with aﬁxed step size. Optimization techniques like momentum, rescaling the gradient using an adaptivematrix, and using second order hessian information, speed up gradient descent. However, if there aremany local or global minima for ( h ( ˜ t ) − h ( t ∗ )) , such techniques will result in a different solutionthan Euler integration, which could mean that effect estimates are biased. One extension of LODE would allow for search strategies that use such techniques.9 roader Impact

Our work mainly applies to causal inference where confounders are speciﬁed as functions of observeddata, such as in problems in genetics and healthcare. We choose to assess the impact of our workthrough its applications in these ﬁelds. A positive impact of the work is that better estimates of causaleffects helps guide treatment for people and aid in understanding biological pathways of diseases.However, in healthcare, data collected in hospitals has biases. If, for instance, a certain demographicof people have more complete data collected about them, then this demographic would have betterquality effect estimates, potentially meaning that they receive better treatment. This problem could becharacterized by evaluating the positivity of treatment and completeness of confounders in electronichealth record data split by demographics.

Acknowledgements

The authors were partly supported by NIH/NHLBI Award R01HL148248, and by NSF Award1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. Theauthors would like to thank Xintian Han, Raghav Singhal, Victor Veitch, Fredrik D. Johansson andthe reviewers for thoughtful feedback. The authors would also like to thank Mukund Sudarshan andProf. Sriram Sankararaman for help with running the

GWAS experiments.

References [1] Svetlana Adamovic, SS Amundsen, BA Lie, AH Gudjonsdottir, H Ascher, J Ek, DA Van Heel,S Nilsson, LM Sollid, and ˚A Torinsson Naluai. Association study of il2/il21 and fcgriia:signiﬁcant association with the il2/il21 region in scandinavian coeliac disease families.

Genesand immunity , 9(4):364, 2008.[2] Carl A Anderson, Gabrielle Boucher, Charlie W Lees, Andre Franke, Mauro D’Amato, Kent DTaylor, James C Lee, Philippe Goyette, Marcin Imielinski, Anna Latiano, et al. Meta-analysisidentiﬁes 29 additional ulcerative colitis risk loci, increasing the number of conﬁrmed associa-tions to 47.

Nature genetics , 43(3):246, 2011.[3] Uri M Ascher and Linda R Petzold.

Computer methods for ordinary differential equations anddifferential-algebraic equations , volume 61. Siam, 1998.[4] William Astle, David J Balding, et al. Population structure and cryptic relatedness in geneticassociation studies.

Statistical Science , 24(4):451–471, 2009.[5] Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun MPurcell, and James J Lee. Second-generation plink: rising to the challenge of larger and richerdatasets.

Gigascience , 4(1):s13742–015, 2015.[6] Wellcome Trust Case Control Consortium et al. Genome-wide association study of 14,000cases of seven common diseases and 3,000 shared controls.

Nature , 447(7145):661, 2007.[7] J. Correa and E. Bareinboim. A calculus for stochastic interventions: Causal effect identiﬁcationand surrogate experiments. In

Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence ,New York, NY, 2020. AAAI Press.[8] Patrick CA Dubois, Gosia Trynka, Lude Franke, Karen A Hunt, Jihane Romanos, AlessandraCurtotti, Alexandra Zhernakova, Graham AR Heap, R´oza ´Ad´any, Arpo Aromaa, et al. Multiplecommon variants for celiac disease inﬂuencing immune gene expression.

Nature genetics , 42(4):295, 2010.[9] Frederick Eberhardt and Richard Scheines. Interventions and causal inference.

Philosophy ofScience , 74(5):981–995, 2007.[10] Miguel A Hern´an and James M Robins. Causal inference: what if.

Boca Raton: Chapman &Hill/CRC , 2020, 2020.[11] Jennifer L. Hill. Bayesian nonparametric modeling for causal inference.

Journal of Compu-tational and Graphical Statistics , 20(1):217–240, 2011. doi: 10.1198/jcgs.2010.08162. URL https://doi.org/10.1198/jcgs.2010.08162 .[12] Lucia A Hindorff, Praveen Sethupathy, Heather A Junkins, Erin M Ramos, Jayashri P Mehta,Francis S Collins, and Teri A Manolio. Potential etiologic and functional implications ofgenome-wide association loci for human diseases and traits.

Proceedings of the NationalAcademy of Sciences , 106(23):9362–9367, 2009.1013] Morris W Hirsch, Robert L Devaney, and Stephen Smale.

Differential equations, dynamicalsystems, and linear algebra , volume 60. Academic press, 1974.[14] Karen A Hunt, Alexandra Zhernakova, Graham Turner, Graham AR Heap, Lude Franke, MarcelBruinenberg, Jihane Romanos, Lotte C Dinesen, Anthony W Ryan, Davinder Panesar, et al.Novel celiac disease genetic determinants related to the immune response.

Nature genetics , 40(4):395, 2008.[15] Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson, and DavidHeckerman. Fast linear mixed models for genome-wide association studies.

Nature methods , 8(10):833, 2011.[16] Virginia Pascual, Romina Dieli-Crimi, Natalia L´opez-Palacios, Andr´es Bodas, Luz Mar´ıaMedrano, and Concepci´on N´u˜nez. Inﬂammatory bowel disease and celiac disease: overlaps anddifferences.

World journal of gastroenterology: WJG , 20(17):4846, 2014.[17] Judea Pearl et al. Causal inference in statistics: An overview.

Statistics surveys , 3:96–146,2009.[18] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python.

Journal of machine learning research , 12(Oct):2825–2830,2011.[19] Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick,and David Reich. Principal components analysis corrects for stratiﬁcation in genome-wideassociation studies.

Nature genetics , 38(8):904, 2006.[20] Rajesh Ranganath and Adler Perotte. Multiple causal inference with latent confounding. arXivpreprint arXiv:1805.08273 , 2018.[21] Marc Ratkovic. Balancing within the margin: Causal effect estimation with support vectormachines.

Department of Politics, Princeton University, Princeton, NJ , 2014.[22] James M Robins. Robust estimation in sequentially ignorable missing data and causal inferencemodels. In

Proceedings of the American Statistical Association , volume 1999, pages 6–10.Indianapolis, IN, 2000.[23] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observationalstudies for causal effects.

Biometrika , 70(1):41–55, 1983.[24] Donald B Rubin. Randomization analysis of experimental data: The ﬁsher randomization testcomment.

Journal of the American Statistical Association , 75(371):591–593, 1980.[25] Ludvig M Sollid. Coeliac disease: dissecting a complex inﬂammatory disorder.

Nature ReviewsImmunology , 2(9):647, 2002.[26] Michael Spivak.

Calculus on manifolds: a modern approach to classical theorems of advancedcalculus . CRC press, 2018.[27] Gerald Teschl.

Ordinary differential equations and dynamical systems , volume 140. AmericanMathematical Soc., 2012.[28] Timothy Thornton and Michael Wu. Summer institute in statistical genetics 2015.[29] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew ABrown, and Jian Yang. 10 years of gwas discovery: biology, function, and translation.

TheAmerican Journal of Human Genetics , 101(1):5–22, 2017.[30] Yixin Wang and David M Blei. The blessings of multiple causes.

Journal of the AmericanStatistical Association , (just-accepted):1–71, 2019.[31] Jianming Yu, Gael Pressoir, William H Briggs, Irie Vroh Bi, Masanori Yamasaki, John FDoebley, Michael D McMullen, Brandon S Gaut, Dahlia M Nielsen, James B Holland, et al.A uniﬁed mixed-model method for association mapping that accounts for multiple levels ofrelatedness.

Nature genetics , 38(2):203, 2006.11

Theoretical details

A.1 A note about the assumptionsNote about the assumptions

In theorem 1, assumption 1 consists of three parts that can all bevalidated on observed data: 1) that the gradient ﬂow converges, 2) that the confounder value ofthe surrogate matches the confounder value whose effect is of interest, and 3) that the surrogateintervention lies in the support of the pre-outcome variables. Assumption 2 is required for expectationsand their gradients to exist and be ﬁnite. In theorem 2, assumption 1 requires a consistent estimatorof E [ y | t ] , which can be provided with regression. Assumption 3 lists regularity conditions whichhelp control how the surrogate estimation error propagates to the effect error. A.2 Proof of Theorem 1

We restate the theorem for completeness:

Theorem 1.

Recall deﬁnition of conditional effect φ ( ˜ t , h ( ˜ t )) = E η f ( ˜ t , h ( ˜ t ) , η ) . Recall ∇ ˜ t is thegradient with respect to the ﬁrst argument of f , that is ˜ t . First, by assumption 2, E and ∇ commute,under the dominated convergence theorem. Then, by C - REDUNDANCY ∇ ˜ t φ ( ˜ t , h ( t ∗ )) T ∇ ˜ t h ( ˜ t ) = ∇ ˜ t E η f ( ˜ t , h ( t ∗ ) , η ) T ∇ ˜ t h ( ˜ t ) = E η [ ∇ ˜ t f ( ˜ t , h ( t ∗ ) , η ) T ∇ ˜ t h ( ˜ t )] = d ˜ t ( s ) / ds = − ∇ ˜ t ( h ( ˜ t ) − h ( t ∗ )) . We refer to the gradientevaluated at ˜ t as ∆ ˜ t = − ∇ ˜ t ( h ( ˜ t ) − h ( t ∗ )) = − ( h ( ˜ t ) − h ( t ∗ )) ∇ ˜ t h ( ˜ t ) . We will express φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) as deﬁned by the starting point φ ( t ∗ , h ( t ∗ )) and the gradient ﬂow equation.Let the solution path to the gradient ﬂow equation be C with t ∗ , t (cid:48) ( t ∗ , h ( t ∗ )) being the startingand ending points respectively. By the Gradient Theorem [26], we have that φ ( t ∗ , h ( t ∗ )) and φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) are related via the line integral over C : (cid:90) C ∇ ˜ t φ ( ˜ t , h ( t ∗ )) · d ˜ t = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) − φ ( ˜ t , h ( t ∗ )) Let ˜ t ( s ) be a parametrization of solution path C by the scalar time s ∈ [ ∞ ) . Now, to obtain thevalue of φ ( ˜ t , h ( t ∗ )) , we will compute the line integral over the vector ﬁeld deﬁned by ∇ ˜ t φ ( ˜ t , h ( t ∗ )) ,which exists by assumption 2 in theorem 1, evaluated along the path C deﬁned by ∆ ˜ t ( s ) : φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) = φ ( t ∗ , h ( t ∗ )) + (cid:90) C ∇ ˜ t φ ( ˜ t , h ( t ∗ )) · d ˜ t = φ ( t ∗ , h ( t ∗ )) + (cid:90) ∞ ∇ ˜ t φ ( ˜ t ( s ) , h ( t ∗ )) T d ˜ t ( s ) ds ds = φ ( t ∗ , h ( t ∗ )) + (cid:90) ∞ ∇ ˜ t φ ( ˜ t ( s ) , h ( t ∗ )) T ∆ ˜ t ( s ) ds = φ ( t ∗ , h ( t ∗ ))+ (cid:90) ∞ − (( h ( ˜ t ( s )) − h ( t ∗ ))) ∇ ˜ t φ ( ˜ t ( s ) , h ( t ∗ )) T ∇ ˜ t h ( ˜ t ( s )) ds = φ ( t ∗ , h ( t ∗ )) + { by C - REDUNDANCY } (11)12inally, by assumption 1 in theorem 1, h ( t (cid:48) ( t ∗ , h ( t ∗ ))) = h ( t ∗ ) , and so φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t (cid:48) ( t ∗ , h ( t ∗ )))) (12)For clarity, the same equation, but using t (cid:48) and suppressing dependence on t ∗ , h ( t ∗ ) ): φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) , h ( t ∗ )) = φ ( t (cid:48) , h ( t (cid:48) )) (13)Under the causal model for EFC , the outcome y = f ( t , h ( t ) , η ) . Then, ∀ ˜ t ∈ supp ( p ( t )) , E [ y | t = ˜ t ] = E η [ f ( ˜ t , h ( ˜ t ) , η )] = φ ( ˜ t , h ( ˜ t )) . (14)Using that t (cid:48) ( t ∗ , t ∗ ) ∈ supp ( p ( t )) and eqs. (13) and (14), the conditional effect is identiﬁed φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) ( t ∗ , h ( t ∗ )) , h ( t (cid:48) ( t ∗ , h ( t ∗ ))))= E [ y | t = t (cid:48) ( t ∗ , h ( t ∗ ))] (15)Thus, the conditional effect, and consequently the average effect, are identiﬁed as E [ y | t (cid:48) ( t ∗ , h ( t ∗ ))] and τ ( t ∗ ) = E h ( t ) E [ y | t (cid:48) ( t ∗ , h ( t ))] respectively. Note about convergence of gradient ﬂow

Any ODE’s solution, if it exists and converges, con-verges to an ω -limit set [27]. An ω -limit set is nonempty when the solution path lies entirely in aclosed and bounded set and can consist of limit cycles, equilibrium points, or neither [13, 27]. Agradient ﬂow equation d ˜ t ( s ) / ds = − ∇ h ( ˜ t ) (also called a gradient system) has the special propertythat its ω -limit set only consists of critical points of h ( ˜ t ) ; critical points of h ( ˜ t ) are also equilibriumpoints of the gradient ﬂow equation [13]. Further, if ∇ h ( ˜ t ) exists and is bounded and h ( ˜ t ) hasbounded sublevel sets ( { ˜ t : h ( ˜ t ) (cid:54) c } ), then the solution to the gradient ﬂow equation will entirelylie within a bounded set. This is because along the solution path, h ( ˜ t ( s )) always decreases meaningthat the solution will remain in any sublevel set it started in. Thus, if h ( ˜ t ) has bounded sublevel sets,the solution of the gradient ﬂow equation will converge only to critical points of h ( ˜ t ) . A.3 Estimation error in

LODE

Theorem 2.

Consider the conditional effect φ ( t ∗ , h ( t ∗ )) . Let ˆ t ( t ∗ , h ( t ∗ )) be the estimate of thesurrogate intervention computed by LODE , computed via Euler integration of the gradient ﬂow d ˜ t ( s ) ds = − ∇ ˜ t ( h ( ˜ t ( s )) − h ( t ∗ )) , initialized at ˜ t ( ) = t ∗ . Assume the true surrogate t (cid:48) ( t ∗ , h ( t ∗ )) exists and is the limiting solution to the gradient ﬂow equation.1. Let the ﬁnite sample estimator of E [ y | t = ˜ t ] be ˆ f ( ˜ t ) . Let the error for all ˜ t be bounded, | ˆ f ( ˜ t ) − E [ y | t = ˜ t ] | (cid:54) c ( N ) , where N is the sample size and lim N → ∞ c ( N ) = .2. Assume K Euler integrator steps were taken to ﬁnd the surrogate estimate ˆ t ( t ∗ , h ( t ∗ )) ,each of size (cid:96) . Let the maximum confounder mismatch be max i (cid:54) K ( h ( ˜ t i ) − h ( t ∗ )) = M .3. Let L z ,˜ t be the Lipschitz-constant of φ ( ˜ t , h ( ˜ t )) as a function of h ( ˜ t ) , for ﬁxed ˜ t .Let L e be the Lipschitz-constant of E [ y | t = ˜ t ] = φ ( ˜ t , h ( ˜ t )) as a function of ˜ t .Assume h has a gradient with bounded norm, (cid:107)∇ h ( ˜ t ) (cid:107) < L h .Assume f ’s Hessian has bounded eigenvalues: ∀ ˜ t , ˜ t , (cid:107)∇ t φ ( ˜ t , h ( ˜ t )) (cid:107) (cid:54) σ H φ .The conditional effect estimate error, ξ ( t ∗ , h ( t ∗ )) = | ˆ f ( ˆ t ) − φ ( t ∗ , h ( t ∗ )) | , is upper bounded by: c ( N ) + min (cid:0) L e (cid:107) t (cid:48) − ˆ t (cid:107) , 2 K(cid:96) (cid:0) O ( (cid:96) ) + Mσ H φ L h (cid:1) + L z ,ˆ t (cid:107) h ( ˆ t ) − h ( t ∗ ) (cid:107) (cid:1) (16) Proof. (of Theorem 2) Recall the deﬁnition of conditional effect : φ ( ˜ t , h ( ˜ t )) = E η f ( ˜ t , h ( ˜ t ) , η ) . LODE ’s estimate of the conditional effect is ˆ f ( ˆ t ( t ∗ , h ( t ∗ ))) . We will suppress notation for dependenceon t ∗ , h ( t ∗ ) , and use t (cid:48) and ˆ t to refer to the true surrogate intervention and the estimated surrogateinterventions respectively. Note ˆ f is the estimate of the conditional expectation E [ y | t = ˜ t ] , learnedfrom N samples. We ﬁrst bound the error by splitting into two parts and bounding each separately: | ξ ( t ∗ , h ( t ∗ )) | = | ˆ f ( ˆ t ) − φ ( t ∗ , h ( t ∗ )) | (cid:54) | ˆ f ( ˆ t ) − φ ( ˆ t , h ( ˆ t )) | + | φ ( ˆ t , h ( ˆ t )) − φ ( t ∗ , h ( t ∗ )) | (cid:54) c ( N ) + | φ ( ˆ t , h ( ˆ t )) − φ ( t ∗ , h ( t ∗ )) | (cid:54) | φ ( ˆ t , h ( ˆ t )) − φ ( ˆ t , h ( t ∗ )) | + | φ ( ˆ t , h ( t ∗ )) − φ ( t ∗ , h ( t ∗ )) | + c ( N ) φ as a function of h ( ˜ t ) with ﬁxed ﬁrst argument˜ t = ˆ t . | φ ( ˆ t , h ( ˆ t )) − φ ( ˆ t , h ( t ∗ )) | (cid:54) L z ,ˆ t | h ( ˆ t ) − h ( t ∗ ) | We now bound the remaining term. Recall that L

ODE ’s computation of the surrogate interventioninvolved K gradient steps, each of size (cid:96) . We work with a constant step-size but the analysis canbe generalized to a non-uniform step size. Indexing steps with i , let d i = h ( ˜ t i ) − h ( t ∗ ) be theconfounder mismatch error at the i th iterate. Then note that ˆ t = t ∗ − (cid:96) (cid:80) K − i = d i ∇ ˜ t h ( ˜ t i ) . We canuse this to bound the error φ ( ˆ t , h ( t ∗ )) − φ ( t ∗ , h ( t ∗ )) . With ˜ t K = ˆ t and ˜ t = t ∗ , we proceed byexpressing the error as a telescoping sum and using the Taylor expansion for φ ( ˜ t , h ( t ∗ )) in terms ofthe the ﬁrst argument ˜ t . φ ( ˆ t , h ( t ∗ )) − φ ( t ∗ , h ( t ∗ )) = K − (cid:88) i = φ ( ˜ t i + , h ( t ∗ )) − φ ( ˜ t i , h ( t ∗ )) (17) = K − (cid:88) i = ∇ ˜ t φ ( ˜ t i , h ( t ∗ )) (cid:62) ( ˜ t i + − ˜ t i ) (18) + ( ˜ t i + − ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ ))( ˜ t i + − ˜ t i ) + O ( (cid:107) ˜ t i + − ˜ t i (cid:107) ) (19) = K − (cid:88) i = (cid:96)d i ∇ ˜ t φ ( ˜ t i , h ( t ∗ )) (cid:62) ∇ ˜ t h ( ˜ t i ) + ( (cid:96)d i ) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) + O ( (cid:96) ) (20) = K − (cid:88) i = + ( (cid:96)d i ) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) + O ( (cid:96) ) (21) = O ( K(cid:96) ) + K − (cid:88) i = ( (cid:96)d i ) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) (22) (cid:54) O ( K(cid:96) ) + K − (cid:88) i = ( (cid:96) ( h ( ˜ t i ) − h ( t ∗ ))) (cid:12)(cid:12) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) (cid:12)(cid:12) (23) (cid:54) O ( K(cid:96) ) + K − (cid:88) i = (cid:96) M (cid:12)(cid:12) ∇ ˜ t h ( ˜ t i ) (cid:62) ∇ t φ ( ˜ t i , h ( t ∗ )) ∇ ˜ t h ( ˜ t i ) (cid:12)(cid:12) (24) (cid:54) O ( K(cid:96) ) + K − (cid:88) i = (cid:96) Mσ H φ (cid:107)∇ ˜ t h ( ˜ t i ) (cid:107) (25) (cid:54) O ( K(cid:96) ) + K − (cid:88) i = (cid:96) Mσ H φ L h (26) = K(cid:96) (cid:0) O ( (cid:96) ) + Mσ H φ L h (cid:1) , (27)where the inequalities follow by the maximum value of ( h ( ˜ t i ) − h ( t ∗ )) , bounded eigenvalues ofthe Hessian of φ and the Lipschitz-ness of h ( ˜ t ) .Another way we bound the error is via the Lipschitz constant of the conditional expectation as afunction of ˜ t . Recall this is L e . An alternate bound on the error is as follows: | φ ( ˆ t , h ( ˆ t )) − φ ( t ∗ , h ( t ∗ )) | = | φ ( ˆ t , h ( ˆ t )) − φ ( t (cid:48) , h ( t (cid:48) )) | (cid:54) L e (cid:107) t (cid:48) − ˆ t (cid:107) The bound follows: | ξ ( ˜ t , h ( t ∗ )) | (cid:54) c ( N ) + min (cid:0) L e (cid:107) t (cid:48) − ˆ t (cid:107) , 2 K(cid:96) (cid:0) O ( (cid:96) ) + Mσ H φ L h (cid:1) + L z ,ˆ t (cid:107) h ( ˆ t ) − h ( t ∗ ) (cid:107) (cid:1) .3.1 A note on linear confounder functions and LODE

In the proof above, the error in Euler integration accumulates due to terms like this one: ∇ (cid:62) ˜ t h ( ˜ t ) ∇ t f ( ˜ t , h ( t ∗ ) , η ) ∇ ˜ t h ( ˜ t ) . For a linear confounder function that satisﬁes ∇ ˜ t h ( ˜ t ) = β , suchterms can be expressed as β (cid:62) ∇ ˜ t ( ∇ ˜ t f ( ˜ t , h ( t ∗ ) , η ) (cid:62) β ) = β (cid:62) ∇ ˜ t ( ) = C - REDUNDANCY .Thus, such error does not accumulate even with large step sizes.Further, note that the gradient ﬂow equation in

LODE for the causal model A in section 4 is alinear ODE whose solution has a closed form expression and one can estimate the surrogate withoutnumerical integration [27]. A.4 Proof of sufﬁciency of Effect ConnectivityTheorem 3.

Under Effect Connectivity, eq. (9) , any surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) ∈ supp ( t ) .Proof. Recall φ ( ˜ t , h ( ˜ t )) = E η f ( ˜ t , h ( ˜ t ) , η ) . We have ∀ t ∗ ∈ supp ( p ( t )) : p ( h ( t ) = h ( t ∗ )) > = ⇒ p ( φ ( t , h ( t )) = φ ( t ∗ , h ( t ∗ )) | h ( t ) = h ( t ∗ )) > ∃ t (cid:48) ∈ supp ( t ) , φ ( t (cid:48) , h ( t ∗ )) = φ ( t ∗ , h ( t ∗ )) , s . t . h ( t (cid:48) ) = h ( t ∗ ) .Then, φ ( t ∗ , h ( t ∗ )) = φ ( t (cid:48) , h ( t ∗ )) = φ ( t (cid:48) , h ( t (cid:48) )) = E [ y | t = t (cid:48) ] . A.5 Necessity of Effect Connectivity for Nonparametric effect estimation in

EFC

Theorem 4.

Effect Connectivity is necessary for nonparametric effect estimation in

EFC .Proof. (Proof of Theorem 4) Let the outcome be y = f ( t , h ( t )) . Recall the joint distribution p ( t , y ) and let h ( t ) be the confounder. Let Effect Connectivity be violated, i.e. there exists anon-measure-zero subset B ∈ supp ( t ) × supp ( h ( t )) such that : ∀ ˜ t , h ( ˜ t ) ∈ B , p ( f ( t , h ( t )) = f ( ˜ t , h ( ˜ t )) | h ( t ) = h ( ˜ t )) = y = f ( t , h ( t )) and show the conditional effects for this newoutcome are different from the one deﬁned by f on ∀ ( ˜ t , h ( ˜ t )) ∈ B . Let f ( ˜ t , h ( ˜ t )) = f ( ˜ t , h ( ˜ t )) + ∗ (( ˜ t , h ( ˜ t )) ∈ B ) | .We have f ( ˜ t , h ( ˜ t )) = f ( ˜ t , h ( ˜ t )) ∀ ˜ t ∈ supp ( t ) , as the additional term in f is only present for ( ˜ t , h ( ˜ t )) ∈ B ; this follows from the fact that ∀ ˜ t ∈ supp ( t ) , ( ˜ t , h ( ˜ t )) (cid:54)∈ B as p [ f ( t , h ( t )) = f ( ˜ t , h ( ˜ t )) | h ( t ) = h ( ˜ t )] = p [ f ( t , h ( t )) = f ( ˜ t , h ( ˜ t ))] > p ( y , t ) = d p ( y , t ) are equal in distribution since B ∩ supp ( t , h ( t )) = ∅ . This means that theconditional effects are different for the outcomes y , y for all ( ˜ t , h ( ˜ t )) ∈ B : E [ y | do ( t = ˜ t ) , h ( t ) = h ( ˜ t )] (cid:54) = E [ y | do ( t = ˜ t ) , h ( t ) = h ( ˜ t )] Therefore, for causal models that violates Effect Connectivity, there exist observationally equivalentcausal models with different causal effects. Thus, nonparametric effect estimation is impossible.Thus, Effect Connectivity is required for

EFC . A.6 Algorithmic details

We give in algorithm 1 pseudocode for

LODE . Extensions of

LODE

Consider that we have access to m ( h ( t )) for some bijective dif-ferentiable function m ( · ) , instead of h ( t ) . The orthogonality in C - REDUNDANCY holds ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) T ∇ ˜ t m ( h ( ˜ t )) = m (cid:48) ( h ( ˜ t )) ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) T ∇ ˜ t h ( ˜ t ) =

0. Then, using m ( h ( ˜ t )) tocompute the surrogate t (cid:48) ( t ∗ , h ( t ∗ )) , LODE would estimate valid effects. Similarly,

LODE can estimatethe effect on any differentiable transformation of the outcome m ( y ) , because ∇ ˜ t m ( y ˜ t ) T ∇ ˜ t h ( ˜ t ) = m (cid:48) ( y ˜ t ) ∇ ˜ t f ( ˜ t , h ( ˜ t ) , η ) T ∇ ˜ t h ( ˜ t ) = Non-zero w.r.t. the product measure over supp ( t ) × supp ( h ( t )) due to p . lgorithm 1: LODE for do ( t = t ∗ ) Input:

Functional confounder h ( t ) ; tolerance (cid:15) Output:

Conditional effects of t ∗ , h ( t ∗ ) Regress y on t and compute ˆ f () := arg min u ∈ F E y , t ( y − u ( t )) . To estimate effects of t ∗ , h ( t ∗ ) , compute the surrogate intervention t (cid:48) ( t ∗ , h ( t ∗ )) by Eulerintegrating the gradient ﬂow equation, initialized at ˜ t = t ∗ , until ( h ( ˜ t s ) − h ( t ∗ )) < (cid:15) . d ˜ t ( s ) ds = ∇ ˜ t ( h ( ˜ t s ) − h ( t ∗ )) , Return ˆ f ( t (cid:48) ( t ∗ , h ( t ∗ ))) ; B Experimental Details

B.1 Functional confounders in

GWAS

Here, we show how h ( t ) = At and A reﬂect the traditional PCA based adjustment in

GWAS . Recallpopulation structure acts as a confounder in

GWAS . Price et al. [19] demonstrated that using theprincipal components of the normalized genetic relationships matrix adjusts for confounding due topopulation structure in

GWAS . Let the genotype matrix be G with people as rows and SNP s as columns,such that each element is one of 0, / , 1, where / and 1 refer to one and two copies of the allelerespectively at the position of the SNP . With p s as the allele frequency at SNP s [28], Φ is the geneticrelationship matrix whose elements are deﬁned as Φ i , j = S (cid:80) Ss = ( G i , s − p s )( G j , s − p s ) / p s ( − p s ) .Then, Price et al. [19] compute the top K (10 suggested) principal components of Φ to use as the axesof variation due to the population structure. The eigenvectors of Φ are the left eigenvectors of ˆ G suchthat Φ = ˆ G ˆ G T which capture independent axes of variation of individuals.Price et al. [19] exploit the idea that if a SNP aligns with some of the axes of variation, this is due to thepopulation structure. These axes of variation are the top K eigenvectors U of φ = ˆ G ˆ G T ≈ UΛU (cid:62) ,where U ∈ R N × K , Φ ∈ R N × N and Λ ∈ R K × K . Here, U are also the left singular vectorsof ˆ G ≈ UΣV T where Σ ∈ R K × K is diagonal, and V ∈ R S × K . We use ≈ to denote that thechosen K eigenvectors explain the variation due to population structure; what remains are randommutations.Let the s th SNP be ˆ G · , s ∈ R N , which is a column in ˆ G . In Price et al. [19], population structurein the s th SNP is captured in ˆ G (cid:62)· , s U . In words, projecting the SNP ˆ G · , s onto the axes of variation inindividuals gives the population structure between s th SNP and the outcome. This projection ˆ G (cid:62)· , s U isa row of ˆ G (cid:62) U ∈ R S × K . In turn, ˆ G (cid:62) U ∈ R S × K is the population structure in all SNP s. Projectingthis population structure onto the genotype of an individual gives the confounding due to populationstructure amongst the

SNP s present in the genotype. With G j , · ∈ { / , 1 } S as the genotype foran individual j , this projection is (cid:0) ( ˆ G (cid:62) U ) (cid:62) G j , · (cid:1) . However, ˆ G ≈ UΣV T implies that ˆ G (cid:62) U ≈ VΣ .Reﬂecting this, h ( t ) = ΣV T t is the functional confounder for an individual t .16 .2 Expanded results In table 2, we list the 13

SNP s recovered by

LODE , that have been previously reported as relevantto Celiac disease. In ﬁg. 7, we plot the true positive and false negative rate amongst

SNP s deemedrelevant by

LODE . The ground truth here are the

SNP s reported associated with celiac disease in priorliterature.

SNP E FFECT L ASSO C OEF .rs3748816 0.12 0.20rs10903122 0.10 0.17rs2816316 0.11 0.20rs13151961 0.17 0.32rs2237236 0.17 0.00rs12928822 0.14 0.29rs2187668 − − − − − − − − − − − − − − Table 2:

Full list of

SNP s previously reportedas relevant that were recovered by

LODE , andtheir estimated effects and Lasso coefﬁcientsfor

SNP s. The effect threshold here is 0.1.

Figure 7:

True positive vs. False nega-tive rate as we vary the threshold on averageeffects, that determines which