[PDF] Efficient Least Squares for Estimating Total Effects under Linearity and Causal Sufficiency

Abstract

Recursive linear structural equation models are widely used to postulate causal mechanisms underlying observational data. In these models, each variable equals a linear combination of a subset of the remaining variables plus an error term. When there is no unobserved confounding or selection bias, the error terms are assumed to be independent. We consider estimating a total causal effect in this setting. The causal structure is assumed to be known only up to a maximally oriented partially directed acyclic graph (MPDAG), a general class of graphs that can represent a Markov equivalence class of directed acyclic graphs (DAGs) with added background knowledge. We propose a simple estimator based on recursive least squares, which can consistently estimate any identified total causal effect, under point or joint intervention. We show that this estimator is the most efficient among all regular estimators that are based on the sample covariance, which includes covariate adjustment and the estimators employed by the joint-IDA algorithm. Notably, our result holds without assuming Gaussian errors.

Full PDF

EEFFICIENT LEAST SQUARES FOR ESTIMATING TOTAL EFFECTSUNDER LINEARITY AND CAUSAL SUFFICIENCY

By F. Richard Guo and Emilija Perkovi´c

Department of StatisticsUniversity of Washington, Seattle

August 31, 2020

Recursive linear structural equation models are widely used topostulate causal mechanisms underlying observational data. In thesemodels, each variable equals a linear combination of a subset of theremaining variables plus an error term. When there is no unobservedconfounding or selection bias, the error terms are assumed to be inde-pendent. We consider estimating a total causal eﬀect in this setting.The causal structure is assumed to be known only up to a maximallyoriented partially directed acyclic graph (MPDAG), a general classof graphs that can represent a Markov equivalence class of directedacyclic graphs (DAGs) with added background knowledge. We pro-pose a simple estimator based on recursive least squares, which canconsistently estimate any identiﬁed total causal eﬀect, under point orjoint intervention. We show that this estimator is the most eﬃcientamong all regular estimators that are based on the sample covariance,which includes covariate adjustment and the estimators employed bythe joint-IDA algorithm. Notably, our result holds without assumingGaussian errors.

1. Introduction.

A linear structural equation model (SEM) speciﬁes a causal mechanismunderlying a set of variables [Bollen, 1989]. Each variable equals a linear combination of a subset ofthe remaining variables plus an error term. A SEM is associated with a mixed graph, also known asa path diagram [Wright, 1921, 1934], which consists of both directed edges and bi-directed edges.A directed edge i → j represents that variable i appears as a covariate in the structural equationdeﬁning variable j . The equation for variable j takes the form(1) X j = (cid:88) i : i → j γ ij X i + (cid:15) j , where (cid:15) j is an error term. Often, the errors are assumed to follow a multivariate normal distribution,but it need not be the case. A bi-directed edge i ↔ j indicates that errors (cid:15) i and (cid:15) j are dependent,which is assumed when there exists an unobserved (i.e., latent) confounder between i and j . Themixed graph is usually assumed to be acyclic , i.e., the graph does not contain cycles made ofdirected edges.We focus on the setting when there is no unobserved confounder or selection bias, a conditionalso known as causal suﬃciency ; see Spirtes et al. [2000, Chap. 3] and Pearl [2009, Chap. 6]. In thissetting, all the error terms are assumed to be mutually independent and the mixed graph associatedwith the linear SEM is a directed acyclic graph (DAG), often called a causal DAG . Aside from beinga statistical model for observational data, the linear SEM is also a causal model in the sense thatit speciﬁes the behavior of the system under interventions (see Section 3.2). Therefore, the total Keywords and phrases: total eﬀect, structural equation model, least squares, semiparametric eﬃciency, partiallydirected acyclic graph, observational studies a r X i v : . [ m a t h . S T ] A ug ausal eﬀect of one treatment variable (point intervention) or several treatment variables (jointintervention) on some outcome varibles can be deﬁned.The underlying causal DAG is usually unknown. In fact, linear SEMs associated with diﬀerentDAGs may deﬁne the same observed distribution [Drton et al., 2011]. Without further assumptionson the error distributions, the underlying DAG can only be learned from observational data up toits Markov equivalence class, which can be uniquely represented by a completed partially directedacyclic graph (CPDAG) [Meek, 1995, Andersson et al., 1997]. Additional background knowledge,such as knowledge of certain causal relationships [Meek, 1995, Fang and He, 2020] or partial or-derings [Scheines et al., 1998], restrictions on the error distributions [Hoyer et al., 2008, Petersand B¨uhlmann, 2014], and other assumptions [Hauser and B¨uhlmann, 2012, Wang et al., 2017,Rothenh¨ausler et al., 2018, Eigenmann et al., 2017] can be used to further reﬁne the Markov equiv-alence class of DAGs, resulting in representing the causal structure as a maximally oriented PDAG(MPDAG), which is a rather general class of graphs containing directed and undirected edges thatsubsumes DAGs and CPDAGs [Meek, 1995]. A given total causal eﬀect is identiﬁed given a graph,if it can be expressed as a functional of the observed distribution, which is the same for every DAGin the equivalence class. Recently, a necessary and suﬃcient graphical criterion for identiﬁcationgiven an MPDAG has been shown by Perkovi´c [2020]. In general, there may be more than oneidentifying functional.Naturally, the next step is to develop estimators for an identiﬁed total eﬀect with desirable prop-erties. When the eﬀect is unidentiﬁed, the reader is referred to the IDA-type approaches [Maathuiset al., 2009, Nandy et al., 2017], which are beyond the scope of this paper. Among others, weconsider the following desiderata. Completeness.

Can the estimator consistently estimate every identiﬁed eﬀect, under either pointor joint interventions?

Eﬃciency.

Does the estimator achieve the smallest asymptotic (co)-variance compared to a rea-sonably large class of estimators?To the best of our knowledge, no estimator proposed in the literature fulﬁlls both desiderata.Indeed, the commonly used covariate adjustment estimators [Pearl, 1993, Shpitser et al., 2010,Maathuis and Colombo, 2015, Perkovi´c et al., 2015] do not exist for certain total eﬀects underjoint interventions [Nandy et al., 2017, Perkovi´c et al., 2018, Perkovi´c, 2020]. Furthermore, whenthey exist, even with an optimal adjustment set chosen to maximize eﬃciency [Henckel et al.,2019, Rotnitzky and Smucler, 2019, Witte et al., 2020], we will show that covariate adjustment cancompare less favorably against a larger class of estimators.We propose an estimator that is based on recursive least squares, that aﬃrmatively fulﬁlls bothdesiderata. In particular, our proposed estimator achieves the eﬃciency bound among all regularestimators that only depend on the sample covariance; see Section 6 for the precise deﬁnition of theclass of estimators. Remarkably, our result holds regardless of the type of error distribution in theunderlying linear SEM. Our method is implemented in the R [R Core Team, 2020] package eff2 ( https://github.com/richardkwo/eff2 ), which standards for “eﬃcient eﬀect” (estimate).The paper is organized as follows. In Section 2, we review related work on eﬃcient estimation oftotal causal eﬀects in over-identiﬁed settings. In Section 3, we introduce the preliminaries on linearstructural equation models, causal graphs and the identiﬁcation of total causal eﬀects. The conceptof bucket decomposition is introduced. In Section 4, we introduce a block-recursive representationfor the observational data and identify the total causal eﬀect under such a representation. We ﬁrstderive the proposed least squares estimator by ﬁnding the maximum likelihood estimator (MLE)2nder the assumption of Gaussian errors in Section 5. We then prove the optimal eﬃciency of ourproposed estimator under arbitrary error distributions in Section 6. Additional preliminaries, proofsand numerical results can be found in the Appendix.

2. Related work.

The statistical performance of an estimator of a total causal eﬀect, in over-identiﬁed settings, has recently received more attention; see, e.g, Kuroki and Miyakawa [2003],Henckel et al. [2019], Witte et al. [2020], Gupta et al. [2020], Rotnitzky and Smucler [2019], Smucleret al. [2020], Kuroki and Nanmo [2020]. Here, “over-identiﬁed” [Koopmans and Reiersøl, 1950]refers to the fact that the total causal eﬀect can be expressed as more than one functional ofthe (population) observed distribution, all of which coincide due to the additional conditionalindependence constraints obeyed by the observed distribution. For example, in the case where atotal causal eﬀect can be identiﬁed through covariate adjustment, usually there exists more thanone valid adjustment set [Henckel et al., 2019]. This is in contrast to the more traditional settingof causal inference, where the observed data distribution is nonparametric and is not expected tosatisfy extra conditional independences.Intuitively, the conditional independences in over-identiﬁed models can be exploited to maximizeasymptotic eﬃciency; see, e.g., Sargan [1958], Hansen [1982] for early works in this direction.Under a linear SEM with independent errors, a total causal eﬀect can be estimated via covariateadjustment as the least squares coeﬃcient from the regression of the outcome on the treatmentand adjustment variables. Henckel et al. [2019] recently showed that, under a linear SEM withindependent errors, a valid adjustment set that minimizes asymptotic variance, also referred to asthe optimal adjustment set , can be graphically characterized; see also Witte et al. [2020] for a furthercharacterization of such an optimal set. This result was generalized by Rotnitzky and Smucler [2019]beyond linear SEMs: an optimal adjustment set is shown to always exist for point interventions, anda semiparametric eﬃcient estimator is developed for this case. Note that, while valid adjustmentsets (called “time-independent” adjustment sets by Rotnitzky and Smucler [2019]) exist for pointinterventions [Perkovi´c, 2020, Proposition 4.2], they may not exist for joint interventions [Nandyet al., 2017, Perkovi´c et al., 2018, Perkovi´c, 2020].Less is known about how to eﬃciently estimate the total causal eﬀect of a joint intervention,at least in a generic fashion. For linear SEMs with independent errors, with the knowledge of theparents of the treatment variables in the underlying causal DAG, Nandy et al. [2017] consideredtwo estimators for the joint-IDA algorithm, one based on recursive least squares and one based ona modiﬁed Cholesky decomposition. However, the eﬃciency properties of these estimators were notexplored. In Section 7, numerical comparisons will show that our proposed estimator signiﬁcantlyoutperform the estimators of Nandy et al. [2017].Other results on the linear SEM include explicit calculations and comparisons for typical ex-amples with either a particular structure or only a few variables; see, e.g., Kuroki and Cai [2004],Gupta et al. [2020]. Gaussian errors are also assumed in these calculations.

3. Linear SEMs, causal graphs and eﬀect identiﬁcation.

Linear SEMs under causal suﬃciency.

A linear SEM postulates a causal mechanism thatgenerates data. Let X denote a vector of variables generated by a linear SEM, where X is indexedby V ( X = X V ). Let D be the associated DAG on vertices V . For this | V | -dimensional randomvector X , the model in Eq. (1) can be compactly rewritten as(2) X = Γ (cid:62) X + (cid:15), Γ = ( γ ij ) , i → j not in D ⇒ γ ij = 0 . ∈ R | V |×| V | is a coeﬃcient matrix, and (cid:15) = ( (cid:15) i ) is a | V | -dimensional random vector. DAG D is associated with the linear SEM in Eq. (1) in the sense that the non-zero entries of Γ correspondto the edges in D .Under causal suﬃciency (no latent variables), we assume(3) { (cid:15) i : i ∈ V } are independent , E (cid:15) = 0 , E (cid:15)(cid:15) (cid:62) (cid:31) , where for a real, symmetric matrix A , A (cid:31) means A is positive deﬁnite. The errors { (cid:15) i : i ∈ V } are not necessarily Gaussian, nor identically distributed.The law P ( X ) is called the observed distribution . For a given D , we will use P D to denote theset of possible laws of X , namely the collection of P ( X ) as Γ and the error distribution varysubject to Eqs. (2) and (3). The linear SEM poses certain restrictions on the set of laws P D . LetPa( i, D ) denote the set of parents of vertex i , i.e., { j : j → i is in D} . For any P ∈ P D , amongother constraints, (i) P factorizes according to D , (ii) E [ X i | X Pa( i, D ) ] is linear in X Pa( i, D ) and (iii)var[ X i | X Pa( i, D ) ] is constant in X Pa( i, D ) .We observe n independent and identically distributed (iid) samples generated by the modelabove, namely X ( i ) = ( I − Γ) −(cid:62) (cid:15) ( i ) for i = 1 , . . . , n . Note that ( I − Γ) is invertible because Γ can bepermuted into a lower-triangular matrix according to a topological ordering (i.e., causal ordering)of vertices in D .3.2. Interventions and total causal eﬀects.

The assumed linear SEM also dictates the behaviorof the system under interventions. Let A ⊆ V be a set of vertices indexing treatment variables X A . We use do( X A = x A ) to denote intervening on variables X A and forcing them to take values x A [Pearl, 1995]. We call this a point intervention if A is a singleton, and a joint interventionif A consists of several vertices, which correspond to the case of multiple treatments. While X A is ﬁxed to x A , the remaining variables are generated by their corresponding structural equationsEq. (1), with each X i for i ∈ A appearing in the equations replaced by the corresponding enforcedvalue x i [Strotz and Wold, 1960]. This generating mechanism deﬁnes the interventional distribution ,denoted by P ( X | do( X A = x A )), where the conditional probability notation is only conventional.More formally, the interventional distribution is expressed as(4) P ( X | do( X A = x A )) = (cid:89) j ∈ A δ x j ( X j ) (cid:89) i/ ∈ A P (cid:0) X i | X Pa( i, D ) (cid:1) , where δ denotes a Dirac measure. Factor P (cid:0) X i | X Pa( i ) (cid:1) is deﬁned by the structural equation for X i .Eq. (4) is known as the truncated factorization formula [Pearl, 2009], manipulated density formula[Spirtes et al., 2000] or the g-formula [Robins, 1986]. Definition . Let X A be a vector oftreatment variables and X Y with Y ∈ V \ A be an outcome variable. The total causal eﬀect of X A on X Y is deﬁned as the vector τ AY ∈ R | A | , where ( τ AY ) i = ∂∂x A i E [ X Y | do( X A = x A )] , i = 1 , . . . , | A | . That is, τ AY is the gradient of the linear map x A (cid:55)→ E [ Y | do( X A = x A )]. When multiple outcomes Y = { Y , . . . , Y k } , k >

1, are considered, the total causal eﬀect of X A on X Y , . . . , X Y k can bedeﬁned by concatenating τ AY , . . . , τ AY k . Therefore, throughout, we assume the outcome variable isa singleton without loss of generality. Each coordinate of the total causal eﬀect τ AY can be expressed4s a sum-product of the underlying linear SEM coeﬃcients along certain causal paths from A to Y in D , that is, certain paths of the form A → · · · → Y i for A ∈ A ; see also Wright [1934], Sullivantet al. [2010].3.3. Causal graphs.

Two diﬀerent linear SEMs on the same set of variables can deﬁne the sameobserved distribution. For example, under Gaussian errors, linear SEMs associated with DAGs A → Y and A ← Y , deﬁne the same set of observed distributions, namely the set of centeredbivariate Gaussian distributions. Without making additional assumptions on the error distribution,such as non-Gaussianity [Shimizu et al., 2006], partial non-Gaussianity [Hoyer et al., 2008], or equalvariance of errors [Peters and B¨uhlmann, 2014, Chen et al., 2019], the underlying causal DAG canonly be learned from the observed distribution up to its Markov equivalence class [Pearl and Verma,1995, Chickering, 2002]. CPDAGs.

Two DAGs on the same set of vertices are Markov equivalent if they encode the sameset of d-separation relations between the vertices. The d-separations between the vertices, prescribeconditional independences between the corresponding variables (known as the Markov condition[Lauritzen, 1996, § completed partially directed acyclic graph (CPDAG), also known as an essential graph [Meek, 1995, Andersson et al., 1997]. A CPDAG C is agraph on the same set of vertices, that can contain both directed and undirected edges. We use [ C ] todenote the Markov equivalence class represented by CPDAG C . A directed edge i → j in C implies i → j is in every D ∈ [ C ], whereas an undirected edge i − j in C implies there exist D , D ∈ [ C ]such that i → j in D but i ← j in D . Given a DAG D , the CPDAG C representing the Markovequivalence class of D can be drawn by keeping the skeleton of D , adding all the unshielded collidersfrom D and completing the orientation rules R1–R3 of Meek [1995]; see Fig. C.1 in Appendix C.For example, DAGs A → Y and A ← Y are represented by CPDAG A − Y . To slightly abuse thenotation, for a distribution Q , we write Q ∈ [ C ] if Q factorizes according to some DAG D ∈ [ C ]; seeLauritzen [1996, § structure learning algorithms that can be used to uncover CPDAG C fromobservational data. Some well-known examples are the PC algorithm [Spirtes et al., 2000] and thegreedy equivalence search [Chickering, 2002]. Choosing an appropriate algorithm for the dataset athand is beyond the scope of this paper; the reader is referred to Drton and Maathuis [2017, §

4] fora recent overview.

MPDAGs.

Certain background knowledge, if present, can be used to further orient some undi-rected edges in a CPDAG C . Typically, knowledge of temporal orderings can inform the orientationof certain undirected edges; see Spirtes et al. [2000, § C results in a maximally oriented partially directed graph (MPDAG) G . See Fig. C.1and Algorithm 1 in Appendix C. MPDAGs are a rather general class of graphs that subsumes bothDAGs and CPDAGs. An MPDAG G represents a restricted Markov equivalence class of DAGs,which we also denote by [ G ]. Analogously to the case of a CPDAG, i → j in G implies i → j is inevery D ∈ [ G ], and i − j in G implies there exist D , D ∈ [ G ] such that i → j in D but i ← j in D .For the rest of the paper, we will assume that we have access to an MPDAG G that representsour structural knowledge about the underlying DAG D . That is,(5) causal DAG D ∈ [ G ] , G is an MPDAG , G ] represents a collection of DAGs that are Markov equivalent, but can be strictly smallerthan the corresponding Markov equivalence class due to background knowledge.3.4. Causal eﬀect identiﬁcation.

Throughout the paper we will use the following notatitions.Given treatment variables X A and an outcome variable X Y such that Y / ∈ A , we are interestedin learning the total causal eﬀect τ AY . We assume that we have access to an MPDAG G , and toobservational data that are generated as iid samples from a linear SEM deﬁned by Eqs. (2) and (3),where the causal DAG D is in [ G ]. Before estimation can be performed, we need to make sure that τ AY can be identiﬁed from observational data. That is, we need to ensure that τ AY can be expressedas a functional of the observed distribution that is the same for every DAG in [ G ]. We have thefollowing graphical criterion. Theorem . The total causal eﬀect τ AY of X A on X Y is identiﬁed given anMPDAG G if and only if there is no proper, possibly causal path from A to Y in G that starts withan undirected edge. Theorem 1 is Proposition 3.2 of Perkovi´c [2020]. This result does not require that the data isgenerated by a linear SEM. However, Perkovi´c [2020] proves that when the criterion fails, then twolinear SEMs with Gaussian errors can be constructed such that their observed distributions coincidebut their τ AY ’s are diﬀerent. Hence, even if we restrict ourselves to linear SEMs, Theorem 1 stillholds.A few terms need some explanation. A path from A to Y in G is a sequence of distinct vertices (cid:104) v , . . . , v k (cid:105) for k > v ∈ A and v k = Y , such that every pair of successive vertices areadjacent in G . The path is proper when only its ﬁrst vertex is in A . The path is possibly causal ifno edge v l ← v r is in G for 1 ≤ l < r ≤ k . The reader is referred to Appendix C for more graphicalpreliminaries. When G satisﬁes Theorem 1 relative to vertex sets A and Y , the interventionaldistribution P ( X Y | do( X A = x A )), and hence the total eﬀect, can be computed from the observeddistribution P ( X ). To express the identiﬁcation formula, we require the following concepts.3.4.1. Buckets and bucket decomposition.

Let G = ( V, E, U ) be a partially directed graph, where V is the set of vertices, and E and U are sets of directed and undirected edges respectively.Let B , . . . , B K be the maximal connected components of the undirected graph G U := ( V, ∅ , U ).Then V = B ˙ ∪ . . . ˙ ∪ B K , where symbol ˙ ∪ denotes disjoint union. Note that all the directed edgeswithin each B i are due to background knowledge. If we ignore the distinction between directed andundirected edges, then the subgraph induced by each B i is chordal [Andersson et al., 1997, § i → j ∈ E, i ∈ B i , j ∈ B j ⇒ i < j. One can show that such a partial causal ordering always exists, though it may not be unique;see Algorithm 2 in Appendix C to obtain such an ordering. Our result does not depend on theparticular choice of partial causal ordering. We call B , . . . , B K the bucket decomposition of V andcall each B k for k = 1 , . . . , K a bucket ; see Fig. 1(a) for an example. If it is clear which graph G isbeing referred to, we will shorten Pa( j, G ) as Pa( j ) to reduce clutter. For a set of vertices C in G , weuse Pa( C ) := ∪ i ∈ C Pa( i ) \ C to denote the set of their external parents . Clearly, Pa( B k ) ⊆ B [ k − ,where B [ k − := B ∪ · · · ∪ B k − . 6 emma . Let i and j be two distinct vertices in MPDAG G = ( V, E, U ) such that i → j ∈ E .Suppose that there is no undirected path from i to j in G . If there is a vertex k , and an undirectedpath j − · · · − k in G , then i → k ∈ E . By deﬁnition of the parent set above we have that Pa( B k ) = ∪ i ∈ B k Pa( i ) \ B k , k = 1 , . . . , K .However, since a bucket B k is a maximal subset of V that is connected by undirected edges in G ,Lemma 1 implies the following important property. Corollary . Let B , . . . , B K be the bucket decomposition of V inMPDAG G = ( V, E, U ) . Then, all vertices in the same bucket have the same set of external parents,namely Pa( B k ) = Pa( i ) \ B k , for any i ∈ B k , k = 1 , . . . , K. The causal identiﬁcation formula for P ( X Y | do( X A = x A )) of Perkovi´c [2020] relies on a decom-position of certain ancestors of Y in MPDAG G according to the buckets. We call vertex i anancestor of vertex j in G if there exists a directed path i → · · · → j in G ; we use the conventionthat j is an ancestor of itself. We denote the set of ancestors of j in G as An( j, G ), or shortened asAn( j ).Let G V \ A = ( V \ A, E (cid:48) , U (cid:48) ) denote the subgraph of G induced by the vertices V \ A , where E (cid:48) includes those edges in E that are between vertices in V \ A , and similarly for U (cid:48) . Consider the setof ancestors of Y in G V \ A , denoted as(7) D := An( Y, G V \ A ) . The bucket decomposition D , . . . , D K of D , induced by the bucket decomposition of V , is simply(8) D = ˙ (cid:91) Kk =1 D k , D k = D ∩ B k , i = 1 , . . . , K. Lemma . When the criterion in Theorem 1 is satisﬁed, we have

Pa( D k , G ) = Pa( B k , G ) forevery nonempty D k . Proofs of Lemmas 1 and 2 are left to Appendix B.

Theorem . Suppose the criterion in Theorem 1 is satisﬁed for

A, Y inMPDAG G = ( V, E, U ) such that Y / ∈ A . Let P ( X ) be the observed distribution. Let D =An( Y, G V \ A ) and D , . . . D K be the bucket decomposition of D as in Eq. (8) . Then the interventionaldistribution P ( X Y | do( X A = x A )) can be identiﬁed as (9) P ( X Y | do( X A = x A )) = (cid:90) (cid:40) K (cid:89) k =1 P (cid:0) X D k | X Pa( D k ) (cid:1)(cid:41) d X D \ Y for values X Pa( D k ) in agreement with x A , where P (cid:0) X D k | X Pa( D k ) (cid:1) ≡ if D k = ∅ . The expression in Eq. (9) above is a generalization of the truncated factorization Eq. (4) fromDAGs to MPDAGs. Theorem 2 holds generally even when an underlying linear SEM is not assumed.7 . Block-recursive representation.

In this section, we express the observed distribution P ( X ) induced by a linear SEM compatible with MPDAG G = ( V, E, U ) in a block-recursive form.Each block corresponds to a bucket in the bucket decomposition of V . Such a reparameterizationis necessitated by the fact that the causal ordering of D is unknown, whereas the buckets canbe arranged into a valid partial causal ordering as in Eq. (6). We will use this representation tocompute an estimator for the total causal eﬀect.Recall that P D denotes the family of laws of X arising from a linear SEM Eqs. (2) and (3)compatible with DAG D . Let P G := ∪ D∈ [ G ] P D , which denotes the family of laws of X arising froma linear SEM compatible with a DAG in [ G ]. Proposition . Let D be the causal DAG associated with the linearSEM and G an MPDAG such that D ∈ [ G ] . Further, let B , . . . , B K be the bucket decomposition of V in G . Then the linear SEM Eqs. (2) and (3) can be rewritten as X = Λ (cid:62) X + ε, for some matrix of coeﬃcients Λ = ( λ ij ) ∈ R | V |×| V | and random vector ε = ( ε i ) ∈ R | V | such that j ∈ B l , i / ∈ Pa( B l , G ) ⇒ λ ij = 0 , (10) E ε = 0 , E ε B k ε (cid:62) B k (cid:31) , ( k = 1 , . . . , K ) , ε B , . . . , ε B K are mutually independent , (11) and (12) law of ( ε B k ) ∈ P G Bk , k = 1 , . . . , K, where G B k is the subgraph of G induced by B k . Note that in contrast to symbol (cid:15) used in Eqs. (2) and (3), symbol ε is used here to denote theerrors in the block-recursive form. The coordinates within each ε B k may be dependent . Proof.

For k = 2 , . . . , K , by Eq. (2) and the restrictive property (Corollary 1), we have X B k = Γ (cid:62) Pa( B k ) ,B k X Pa( B k ) + Γ (cid:62) B k X B k + (cid:15) B k , where Pa( B k ) = Pa( B k , G ). The expression can be rewritten as X B k = ( I − Γ B k ) −(cid:62) Γ (cid:62) Pa( B k ) ,B k X Pa( B k ) + ( I − Γ B k ) −(cid:62) (cid:15) B k = Λ (cid:62) Pa( B k ) ,B k X Pa( B k ) + ε B k , where ε B k := ( I − Γ B k ) −(cid:62) (cid:15) B k for k = 1 , . . . , K (note that X B = ε B ). Additionally, Λ Pa( B k ) ,B k =Γ Pa( B k ) ,B k ( I − Γ B k ) − for k = 2 , . . . , K .Matrix Λ ∈ R | V |×| V | in the statement of the proposition is deﬁned by blocks Λ Pa( B k ) ,B k for k = 2 , . . . , K and zero entries otherwise. Therefore, λ ij = 0 if j ∈ B l and i / ∈ Pa( B l ) for some l = 1 , . . . K . Hence, by putting the blocks together, the model can be written as X = Λ (cid:62) X + ε .The “new” errors ε satisfy ε B k = Γ (cid:62) B k ε B k + (cid:15) B k , k = 1 , . . . , K.

8t then follows from Eqs. (2) and (3) that for every k ,law of ε B k ∈ P D Bk ⊂ P G Bk , since D ∈ [ G ]. Moreover, for every k , E ε B k = 0 , E ε B k ε (cid:62) B k = ( I − Γ B k ) −(cid:62) E (cid:15) B k (cid:15) (cid:62) B k ( I − Γ B k ) − (cid:31) , where both ( I − Γ B k ) and E (cid:15) B k (cid:15) (cid:62) B k are full rank, because Γ B k can be permuted into an upper-triangular matrix and E (cid:15)(cid:15) (cid:62) (cid:31) by Eq. (3). Corollary . Under the same conditions as Proposition 1, it holds that X B = ε B ,X B k = Λ (cid:62) Pa( B k ) ,B k X Pa( B k ) + ε B k , ε B k ⊥⊥ X Pa( B k ) , k = 2 , . . . , K, (13) where Pa( B k ) = Pa( B k , G ) . Next, we show that if the total causal eﬀect τ AY is identiﬁable from MPDAG G (Theorem 1),then it can be calculated from Λ in the block-recursive representation of Proposition 1. Therefore,the distribution of ε is a nuisance relative to estimating τ AY . Proposition . Suppose the criterion in Theorem 1 is satisﬁed for

A, Y in MPDAG G =( V, E, U ) such that Y / ∈ A . Let Λ be the block-recursive coeﬃcient matrix given by Proposition 1.The total causal eﬀect of X A on X Y is identiﬁed as (14) τ AY = Λ A,D (cid:2) ( I − Λ D,D ) − (cid:3) D,Y , where D = An( Y, G V \ A ) and the last subscript denotes the column corresponding to Y ∈ D . Proof.

We derive this result using Theorem 2. Recall that D , . . . , D K is a partition of D induced by the bucket decomposition B , . . . , B K of V in the sense that D k = D ∩ B k for k =1 , . . . , K . When D k = ∅ , we use the convention that P ( X D k | X Pa( D k ) ) ≡

1. By deﬁnition of D =An( Y, G V \ A ) and Eq. (6), observe that a vertex in Pa( D k ) = Pa( D k , G ) is either in D ∪ · · · ∪ D k − or in A . Let F k := A ∩ Pa( D k ). In Eq. (9), we note that the joint interventional distribution of X D is given by P ( X D | do( X A = x A )) = K (cid:89) k =1 P ( X D k | X Pa( D k ) ) = K (cid:89) k =1 P ( X D k | X Pa( D k ) \ F k , X F k = x F k ) , where x F k is ﬁxed by the do( X A = x A ) operation. Further, ﬁx a factor i ∈ { , . . . , K } . By Lemma 2,Pa( D i ) = Pa( B i ). By Eq. (13) and ε D i ⊥⊥ X Pa( B i ) , we have X D i | (cid:8) X Pa( D i ) \ F i , X F i = x F i (cid:9) = d Λ (cid:62) Pa( D i ) \ F i ,D i X Pa( D i ) \ F i + Λ F i ,D i x F i + ε D i = Λ (cid:62) Pa( D i ) ∩ D,D i X Pa( D i ) ∩ D + Λ Pa( D i ) ∩ A,D i x Pa( D i ) ∩ A + ε D i The fact that the display above holds for every i = 1 , . . . , K implies that the joint interventionaldistribution P ( X D | do( X A = x A )) satisﬁes X D = Λ TD,D X D + Λ (cid:62) A,D x A + ε D .

9t follows that X D = ( I − Λ D,D ) −(cid:62) (Λ (cid:62) A,D x A + ε D ) and hence E [ X D | do( X A = x A )] = ( I − Λ D,D ) −(cid:62) Λ (cid:62) A,D x A . Since Y ∈ D , by Deﬁnition 1 we have τ AY = ∂∂x A E [ X Y | do( X A = x A )] = Λ A,D (cid:2) ( I − Λ D,D ) − (cid:3) D,Y . We say vertex j is a possible descendant of i , denoted as j ∈ PossDe( i ), if there exists a possiblycausal path from i to j . For a set of vertices A , deﬁne PossDe( A ) := ∪ i ∈ A PossDe( i ). See Appendix Cfor more details. Corollary . If Y / ∈ PossDe( A ) , then τ AY = 0 . Proof.

Since D = An( Y, G V \ A ) and Y / ∈ PossDe( A ), Λ A,D = .

5. Recursive least squares.

Consider the special case when the errors in the linear SEMEq. (1) are jointly Gaussian. In this case, by the standard maximum likelihood theory, the Cram´er–Rao bound is achieved by the maximum likelihood estimator (MLE) of the total causal eﬀect, whichcan be obtained by plugging in the MLE for Λ in the block-recursive form (Proposition 1) into theformula Eq. (14). We now compute the MLE for Λ given an MPDAG G .When (cid:15) is multivariate Gaussian, the block-recursive form in Proposition 1 is a linear Gaussianmodel parameterized by { (Λ k ) Kk =2 , (Ω k ) Kk =1 } , where Λ k := Λ Pa( B k ) ,B k and Ω k is the covariance for ε B k . Because ε are independent between blocks (Proposition 1), the likelihood factorizes as(15) L ((Λ k ) k , (Ω k ) k ) = K (cid:89) k =1 N (cid:16) X B k − Λ (cid:62) k X Pa( B k ) ; , Ω k (cid:17) . Denote the MLE of Λ by (cid:98) Λ G , which consists of blocks ( (cid:98) Λ G k ) Kk =2 and zero values elsewhere, andthe MLE of Ω by (cid:98) Ω G = ( (cid:98) Ω G k ) Kk =1 . The superscripts highlight the dependence on MPDAG G . TheMLE maximizes L ((Λ k ) k , (Ω k ) k ) subject to Eq. (12), namely N ( , Ω k ) ∈ P G Bk , k = 1 , . . . , K, where G B k is the subgraph of G induced by B k . This further translates to a set of algebraic con-straints on (Ω k ) Kk =1 , namely for k = 1 , . . . , K ,(16) det (cid:2) (Ω k ) { i }∪ C, { j }∪ C (cid:3) = 0 , if i and j are d-separated by C in G B k ;see, e.g., Drton et al. [2008, § (cid:98) Σ ( n ) := 1 n n (cid:88) i =1 X ( i ) X ( i ) (cid:62) , where n is the sample size, and the superscripts are reserved to index samples. To reduce clutter,for a set of indices C , we often abbreviate Σ C,C as Σ C .10 emma . Suppose X ( i ) : i = 1 , . . . , n is generated iid from a linear SEM Eqs. (2) and (3) associated with an unknown causal DAG D . Suppose the error (cid:15) is distributed as multivariateGaussian. Suppose D ∈ [ G ] for a known MPDAG G . Let (cid:98) Σ ( n ) be the sample covariance as deﬁnedin Eq. (17) . The MLE for Λ k = Λ Pa( B k ) ,B k in the block-recursive form is given by (18) (cid:98) Λ G k = (cid:16)(cid:98) Σ ( n )Pa( B k ) (cid:17) − (cid:98) Σ ( n )Pa( B k ) ,B k , k = 2 , . . . , K. Proof.

By factorization in Eq. (15), MLE ( (cid:98) Λ G k , (cid:98) Ω G k ) is the maximizer of log-likelihood (cid:96) n (Λ k , Ω k )= − n (cid:88) i =1 (cid:16) X ( i ) B k − Λ (cid:62) k X ( i )Pa( B k ) (cid:17) (cid:62) Ω − k (cid:16) X ( i ) B k − Λ (cid:62) k X ( i )Pa( B k ) (cid:17) − n k )= −

12 Tr (cid:32) n (cid:88) i =1 Ω − k ( X ( i ) B k − Λ (cid:62) k X ( i )Pa( B k ) )( X ( i ) B k − Λ (cid:62) k X ( i )Pa( B k ) ) (cid:62) (cid:33) − n k ) , subject to Eq. (16). Taking a derivative with respect to Λ k ∈ R | Pa( B k ) |×| B k | , we have ∂(cid:96) n (Λ k , Ω k ) ∂ Λ k = − n (cid:88) i =1 X ( i )Pa( B k ) X ( i ) (cid:62) B k Ω − k + 2 n (cid:88) i =1 X ( i )Pa( B k ) X ( i ) (cid:62) Pa( B k ) Λ k Ω − k . For any positive deﬁnite Ω k satisfying Eq. (16), setting the derivative (cid:96) n (Λ k , Ω k ) /∂ Λ k to zero yieldsthe estimate (cid:98) Λ G k = (cid:32) n n (cid:88) i =1 X ( i )Pa( B k ) X ( i ) (cid:62) Pa( B k ) (cid:33) − (cid:32) n n (cid:88) i =1 X ( i )Pa( B k ) X ( i ) (cid:62) B k (cid:33) = (cid:16)(cid:98) Σ ( n )Pa( B k ) (cid:17) − (cid:98) Σ ( n )Pa( B k ) ,B k . Remark . Because of the restrictive property (Corollary 1), each (cid:98) Λ G k is computed by opti-mizing over the space of | Pa( B k ) | × | B k | matrices and the resulting MLE takes the simple form asabove; see also Anderson and Olkin [1985, §

5] and Amemiya [1985, § (cid:98) Λ G is obtained by simply regressing each B i onto Pa( B i , G ) using ordinary least squares,we call this speciﬁc recursive least squares G -regression . The resulting MLE for an identiﬁed totalcausal eﬀect is a plugin estimator using the formula in Proposition 2. Definition G -regression estimator) . Suppose X ( i ) : i = 1 , . . . , n is generated iid from alinear SEM Eqs. (2) and (3) associated with an unknown causal DAG D . Suppose D ∈ [ G ] for aknown MPDAG G . Further, suppose for A ⊂ V , Y ∈ V \ A , τ AY is identiﬁed under the criterionof Theorem 1. The G -regression estimator for the total causal eﬀect τ AY is deﬁned as (19) (cid:98) τ G AY = (cid:98) Λ G A,D (cid:104) ( I − (cid:98) Λ G D,D ) − (cid:105) D,Y , where (cid:98) Λ G is given by Lemma 3. . Eﬃciency theory. In this section, we establish the asymptotic eﬃciency of our G -regressionestimator, when the errors in the generating linear SEM are not necessarily Gaussian, among a rea-sonably large class of estimators—all regular estimators that only depend on the sample covariance.This class of estimators, despite not covering all the estimators considered in the standard semi-parametric eﬃciency theory, includes many in the literature, such as covariate adjustment [Henckelet al., 2019, Witte et al., 2020], recursive least squares [Gupta et al., 2020, Nandy et al., 2017], andmodiﬁed Cholesky decomposition of the sample covariance [Nandy et al., 2017]. Definition . Consider an estimator (cid:98) θ n of θ , θ ∈ R k . We say that the asymptotic covarianceof (cid:98) θ n is S , and write acov (cid:98) θ n = S , if √ n ( (cid:98) θ n − θ ) → d N ( , S ) . When k = 1 , we write avar (cid:98) θ n forasymptotic variance. For real valued symmetric matrices

A, B , we say A (cid:23) B if A − B is positive semideﬁnite. Wenow state our main result. Theorem G -regression estimator) . Suppose data is generated iidfrom a linear SEM Eqs. (2) and (3) associated with an unknown causal DAG D . Suppose D ∈ [ G ] for a known MPDAG G . Further, suppose for A ⊂ V , Y ∈ V \ A , τ AY is identiﬁed under thecriterion of Theorem 1. Let (cid:98) τ G AY be the G -regression estimator of τ AY (Deﬁnition 2). Consider anyconsistent estimator (cid:98) τ AY = (cid:98) τ AY ( (cid:98) Σ ( n ) ) that is a diﬀerentiable function of the sample covariance. Itholds that acov ( (cid:98) τ AY ) (cid:23) acov (cid:0)(cid:98) τ G AY (cid:1) . It is clear from deﬁnitions that both (cid:98) τ G AY and (cid:98) τ AY are asymptotically linear. Therefore, theirasymptotic covariances are well-deﬁned. To prove Theorem 3, it suﬃces to show that for every w ∈ R | A | avar (cid:16) w (cid:62) (cid:98) τ AY (cid:17) ≥ avar (cid:16) w (cid:62) (cid:98) τ G AY (cid:17) . To this end, for any ﬁxed w ∈ R | A | we deﬁne τ w as(20) τ w := w (cid:62) τ AY = τ w (Λ) , which is a smooth function of Λ. The corresponding G -regression estimator (cid:98) τ G w := w (cid:62) (cid:98) τ G AY = τ w ( (cid:98) Λ G )is still a plugin estimator (now of τ w ). Additionally, for a consistent estimator (cid:98) τ AY of τ AY , thecorresponding (cid:98) τ w := w (cid:62) (cid:98) τ AY = (cid:98) τ w ( (cid:98) Σ ( n ) ) is a consistent estimator of τ w , in the form of a diﬀerentiablefunction of the sample covariance. It suﬃces to show avar (cid:98) τ w ≥ avar (cid:98) τ G w for every w ∈ R | A | .The rest of this section is devoted to proving Theorem 3. First, we introduce graph ¯ G as asaturated version of G (Proposition 3). In Section 6.1, we show that G -regression with G replacedby ¯ G , aptly named ¯ G -regression, is a diﬀeomorphism between the space of covariance matrices andthe space of parameters. In Section 6.2, we characterize the class of estimators relative to which G -regression is optimal. To prove Theorem 3, we establish an eﬃciency bound for this class ofestimators in Section 6.4 and verify that G -regression achieves this bound in Section 6.5. Some ofthe proofs are left to Appendix A. See also Fig. A.1 in for an overview of the dependency structureof our results in this section.6.1. ¯ G -regression as a diﬀeomorphism. roposition G ) . For MPDAG G = ( V, E, U ) , an associated saturatedMPDAG is ¯ G = ( V, ¯ E, U ) , such that Pa( B k , ¯ G ) = B [ k − for k = 2 , . . . , K , where ( B , . . . , B K ) is abucket decomposition of V in both G and ¯ G . The proof can be found in Appendix B. In words, to create the saturated MPDAG ¯ G , we addall the possible directed edges between buckets B , . . . , B K subject to the ordering B , . . . , B K . Byconstruction, ¯ G also satisﬁes the restrictive property in Corollary 1. See Fig. 1 for an example. (a) (b) Fig 1: (a) MPDAG G = ( V, E, U ) with buckets B = { } , B = { , , } and B = { , } and (b) itsassociated saturated MPDAG ¯ G = ( V, ¯ E, U ). The new edges in ¯ E \ E are drawn as dashed. Both G and ¯ G satisfy the restrictive property in Corollary 1.In the following, we introduce ¯ G -regression as a technical tool for establishing a diﬀeomorphismbetween the space of sample covariance matrices and the space of parameters in our semiparametricmodel. This link is the key to analyzing the eﬃciency of the estimators under consideration.Recall that P G is the set of observed distributions generated by some linear SEM associated witha causal DAG D ∈ [ G ], which is characterized by Proposition 1. More explicitly, let Q k be the lawof ε B k for k = 1 , . . . , K . The set of laws is explicitly prescribed as(21) P G = (cid:40) Q ( X B ) K (cid:89) k =2 Q k (cid:16) X B k − Λ (cid:62) B [ k − ,B k X B [ k − (cid:17) : Q k ∈ P G Bk , i → j not in G ⇒ λ ij = 0 (cid:41) where the law is indexed by Λ = ( λ ij ) and ( Q k ) Kk =1 . This is a semiparametric model and ( Q k ) k isan inﬁnite-dimensional nuisance [van der Vaart, 2000, Chap. 25].Consider the set of laws P ¯ G associated with the saturated graph. Let Ω k := E Q k εε (cid:62) be thecovariance of Q k for k = 1 , . . . , K . Let R n × n PD denote the set of n × n symmetric, positive deﬁnitematrices. By our assumption, Ω k ∈ R | B k |×| B k | PD . Also, consider the coeﬃcients Λ = ( λ ij ) such that λ ij (cid:54) = 0 only if i → j in ¯ G , or equivalently, i ∈ B l and j ∈ B m for l < m . Then, the covariance of X , denoted as Σ, under any P ∈ P ¯ G is determined from (Ω k ) k and Λ. Let us write this covariancemap as Σ = φ ¯ G (cid:0) (Λ k ) Kk =2 , (Ω k ) Kk =1 (cid:1) , where Λ k = Λ B [ k − ,B k is of dimension ( | B | + · · · + | B k − | ) × | B k | . It follows from Corollary 2 thatthe covariance map φ ¯ G is explicitly given by(22) Σ B = Ω , Σ B k = Λ (cid:62) k Σ B [ k − Λ k + Ω k , Σ B [ k − ,B k = Σ B [ k − Λ k , k = 2 , . . . , K. Further, the covariance map φ ¯ G is a diﬀeomorphism between its domain and the set of | V | × | V | positive deﬁnite matrices. 13 emma . Covariance map φ ¯ G given by Eq. (22) is invertible. Further, (cid:0) (Λ k ) Kk =2 , (Ω k ) Kk =1 (cid:1) ↔ Σ given by φ ¯ G and its inverse φ − G is a diﬀeomorphism between (cid:16) Ś Kk =2 R ( | B | + ··· + | B k − | ) ×| B k | (cid:17) × (cid:16) Ś Kk =1 R | B k |×| B k | PD (cid:17) and R | V |×| V | PD . Proof.

By deﬁnition, covariance map φ ¯ G is diﬀerentiable. To show diﬀeomorphism, we needto show that φ − G (Σ) exists for every Σ ∈ R | V |×| V | PD and that φ − G is diﬀerentiable. For any positivedeﬁnite Σ, the inverse covariance map φ − G (Σ) is explicitly given by(23) Λ k = (cid:16) Σ B [ k − (cid:17) − Σ B [ k − ,B k , k = 2 , . . . , K, and(24) Ω k = Σ B k · B [ k − = Σ B k − Σ (cid:62) B [ k − ,B k Σ − B [ k − Σ B [ k − ,B k , k = 1 , . . . , K, where Σ B k · B [ k − is the Schur complement of block B k with respect to block B [ k − . Because Σ ispositive deﬁnite, Schur complement Ω k is also positive deﬁnite [Horn and Johnson, 2012, page 495].Clearly, the map φ − G ( · ) is diﬀerentiable.By Eqs. (23) and (24), Λ k is the matrix of population least squares coeﬃcients in a regressionof X B k onto X B ∪···∪ B k − according to ¯ G , and Ω k is the corresponding covariance of regressionresiduals. Hence, φ − G (Σ) is called “ ¯ G -regression”. Remark . In the special case when G is a DAG such that every bucket B i is a singleton,Lemma 4 reduces to (Λ , ω ) ↔ Σ given by ( φ ¯ G , φ − G ) being a diﬀeomorphism between (cid:110) Λ ∈ R | V |×| V | : Λ is upper-triangular (cid:111) × (cid:110) ω ∈ R | V | : ω i > , i = 1 , . . . , | V | (cid:111) ←→ R | V |×| V | PD . The covariance map is Σ = φ ¯ G (Λ , ω ) = ( I − Λ) −(cid:62) diag( ω )( I − Λ) − , and the inverse covariance map φ − G is given by the unique LDL decomposition of Σ − . Lemma 4 is a generalization of Drton [2018,Theorem 7.2].6.2. Covariance-based, consistent estimators.

We now characterize the class of estimators rela-tive to which the optimality of our estimator is established. Recall that under P ∈ P G , (cid:98) Σ = (cid:98) Σ ( n ) is the sample covariance, Σ is the population covariance and τ w = w (cid:62) τ AY . We assume that n > max k {| B k | + | Pa( B k , G ) |} such that (cid:98) Σ ( n ) is positive deﬁnite almost surely [Drton and Eichler,2006, Sec. 3.1]. For simplicity, the superscript ( n ) is often omitted. Definition . The class of estimators for τ w under consideration is (25) T w := (cid:26)(cid:98) τ w (cid:16)(cid:98) Σ ( n ) (cid:17) : R | V |×| V | PD → R : (cid:98) τ w diﬀerentiable , (cid:98) τ w ( (cid:98) Σ ( n ) ) → p τ w ( P ) as n → ∞ under every P ∈ P G (cid:27) . By deﬁnition, in particular, T w includes all regular estimators computable with least squaresoperations. It also includes shrinkage estimators such as ridge or lasso regression whose shrinkageparameter does not depend on n . 14 haracterizing T w . Let ( (cid:98) Λ ¯ G k ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 be the image of (cid:98) Σ under φ − G . Recall that (Λ k ) Kk =2 , (Ω k ) Kk =1 is the image of Σ under φ − G . For a matrix C , let vec C denote vectorizing C by concatenating itscolumns. Each vec (cid:98) Λ ¯ G k can be split by coordinates into vectors(26) (cid:98) Λ ¯ G k, G = (cid:16)(cid:98) λ ¯ G ij : j ∈ B k , i ∈ Pa( B k , G ) (cid:17) , (cid:98) Λ ¯ G k, G c = (cid:16)(cid:98) λ ¯ G ij : j ∈ B k , i ∈ Pa( B k , ¯ G ) \ Pa( B k , G ) (cid:17) , where (cid:16)(cid:98) Λ ¯ G k, G (cid:17) k corresponds to between-bucket edges in G and (cid:16)(cid:98) Λ ¯ G k, G c (cid:17) k corresponds to between-bucket edges in ¯ G but not in G . In the example of Fig. 1, we have (cid:98) Λ ¯ G , G = ( (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G ) (cid:62) , (cid:98) Λ ¯ G , G =( (cid:98) λ ¯ G , (cid:98) λ ¯ G ) (cid:62) and (cid:98) Λ ¯ G , G c = NULL , (cid:98) Λ ¯ G , G c = ( (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G ) (cid:62) . Similarly, vec Λ k can be splitinto Λ k, G and Λ k, G c for k = 2 , . . . , K .The following lemma directly follows from Deﬁnition 4 and Lemma 4. Lemma . An estimator (cid:98) τ w ∈ T w can be written as (cid:98) τ w (cid:16)(cid:98) Σ ( n ) (cid:17) = (cid:98) τ w (cid:16) ( (cid:98) Λ ¯ G k, G ) Kk =2 , ( (cid:98) Λ ¯ G k, G c ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 (cid:17) for function (cid:98) τ w (cid:16) ( (cid:98) Λ ¯ G k, G ) Kk =2 , ( (cid:98) Λ ¯ G k, G c ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 (cid:17) that is diﬀerentiable in its arguments. The consistency of (cid:98) τ w implies the following two results. Lemma . For any (cid:98) τ w ∈ T w , it holds that (27) (cid:98) τ w (cid:0) (Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 (cid:1) ≡ τ w (cid:0) (Λ k, G ) Kk =2 (cid:1) for all (Λ k, G ) k and all positive deﬁnite (Ω k ) k . Proof.

Under any P ∈ P G , since (cid:98) Σ → p Σ as n → ∞ by the law of large numbers, by Lemma 4and the continuous mapping theorem [van der Vaart, 2000, page 11], we have (cid:98) Λ ¯ G k, G → p Λ k, G , (cid:98) Λ ¯ G k, G c → p and (cid:98) Ω ¯ G k → p Ω k for k = 2 , . . . , K . By Lemma 5 and continuous mapping again, (cid:98) τ w → p (cid:98) τ w (cid:0) (Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 (cid:1) . The result then follows from the consistency of (cid:98) τ w under every P ∈ P G . Corollary . For (cid:98) τ w ∈ T w , at any ((Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 ) , it holds that (28) ∂ (cid:98) τ w ∂ Λ k, G = ∂τ w ∂ Λ k, G ( k = 2 , . . . , K ) , ∂ (cid:98) τ w ∂ Ω k = ( k = 1 , . . . , K ) . Proof.

Let symbol (cid:104)· , ·(cid:105) denote inner product. Since (cid:98) τ w is diﬀerentiable (Lemma 5), by a Taylorexpansion at ((Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 ), we have (cid:98) τ w (cid:0) (Λ k, G + ∆Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k + ∆Ω k ) Kk =1 (cid:1) − (cid:98) τ w (cid:0) (Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 (cid:1) = K (cid:88) k =2 (cid:18)(cid:28) ∂ (cid:98) τ w ∂ Λ k, G , ∆Λ k, G (cid:29) + o ( (cid:107) ∆Λ k, G (cid:107) ) (cid:19) + K (cid:88) k =1 (cid:18)(cid:28) ∂ (cid:98) τ w ∂ Ω k , ∆Ω k (cid:29) + o ( (cid:107) ∆Ω k (cid:107) ) (cid:19) , which by Lemma 6 must equal τ w ((Λ k, G + ∆Λ k, G ) Kk =2 ) − τ w ((Λ k, G ) Kk =2 ). The result then follows fromthe diﬀerentiability of τ w ( · ) and the deﬁnition of derivatives.15ote that Corollary 4 is similar to the conditions imposed on inﬂuence functions in standardsemiparametric eﬃciency theory; see, e.g., Tsiatis [2006, Corollary 1, § ∂ (cid:98) τ w /∂ (cid:98) Λ ¯ G k, G c for k = 2 , . . . , K are free to vary because (cid:98) Λ ¯ G k, G c → p . That is, an estimator (cid:98) τ w ∈ T w cantake arbitrary values as its second argument varies in the vicinity of zero, as long as diﬀerentiabilityis maintained.6.3. Asymptotic covariance of least squares coeﬃcients.

We use this section to derive someasymptotic results that will be used to prove Theorem 3.Consider a vertex j ∈ B k for k ∈ { , . . . , K } and a set of vertices C such that Pa( B k , G ) ⊆ C ⊆ Pa( B k , ¯ G ). Let (cid:98) λ ( n ) C,j ∈ R | C | be the least squares coeﬃcients from regressing X j onto X C undersample size n . Let λ C,j be the corresponding true edge coeﬃcient vector from Λ in Proposition 1.Then λ C,j has non-zero coordinates only for those indices in Pa( B k , G ). Because X j = λ (cid:62) C,j X C + ε j with ε j ⊥⊥ X C by Corollary 2, we have (cid:98) λ ( n ) C,j → p λ C,j under every P ∈ P G . Moreover, we have thefollowing asymptotic linear expansion. Lemma . Let j be a vertex in bucket B k for k ∈ { , . . . , K } . Let C be a set of vertices suchthat Pa( B k , G ) ⊆ C ⊆ Pa( B k , ¯ G ) . Under any P ∈ P G , it holds that (cid:98) λ ( n ) C,j − λ C,j = 1 n n (cid:88) i =1 (Σ C ) − X ( i ) C ε ( i ) j + O p ( n − ) , where Σ = E P XX (cid:62) , (cid:98) λ ( n ) C,j is the vector of least squares coeﬃcients from regressing X j onto X C under sample size n , and λ C,j is the vector of true coeﬃcients in Proposition 1.

We now use Lemma 7 to obtain the covariance structure of ¯ G -regression coeﬃcients ( (cid:98) Λ ¯ G k ) Kk =2 .Recall that (cid:98) Λ ¯ G k ∈ R | B [ k − |×| B k | with B [ k − = B ∪ · · · ∪ B k − and (cid:16) ( (cid:98) Λ ¯ G k ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 (cid:17) = φ − G (cid:16)(cid:98) Σ ( n ) (cid:17) , as given by Eqs. (23) and (24). For matrices A ∈ R m × n , B ∈ R p × q , the Kronecker product A ⊗ B is an mp × nq matrix given by A ⊗ B =  a B · · · a n B ... . . . ... a m B · · · a mn B  . Lemma . Let ( (cid:98) Λ ¯ G k ) Kk =2 be the ¯ G -regression coeﬃcients under sample size n . Under any P ∈ P G ,it holds that √ n  vec( (cid:98) Λ ¯ G − Λ ) ... vec( (cid:98) Λ ¯ G K − Λ K )  → d N (cid:18) , diag (cid:26) Ω ⊗ (cid:16) Σ B [1] (cid:17) − , . . . , Ω K ⊗ (cid:16) Σ B [ K − (cid:17) − (cid:27)(cid:19) . Remark . √ n vec( (cid:98) Λ ( n ) k − Λ k ) → d N (cid:18) , Ω k ⊗ (cid:16) Σ B [ k − (cid:17) − (cid:19) is equivalent to √ n ( (cid:98) Λ ( n ) k − Λ k ) → d MN (cid:18) , (cid:16) Σ B [ k − (cid:17) − , Ω k (cid:19) , B [ k − ) − and columncovariance Ω k ; see Dawid [1981].Similarly, we can compute the asymptotic covariance of the G -regression coeﬃcients. To obtainthe result below, we rely on the restrictive property of G (Corollary 1). Lemma . Let ( (cid:98) Λ G k ) Kk =2 be the G -regression coeﬃcients as deﬁned in Lemma 3 under samplesize n . Under any P ∈ P G , it holds that √ n  vec( (cid:98) Λ G − Λ ) ... vec( (cid:98) Λ G K − Λ K )  → d N (cid:16) , diag (cid:110) Ω ⊗ (cid:0) Σ Pa( B , G ) (cid:1) − , . . . , Ω K ⊗ (cid:0) Σ Pa( B K , G ) (cid:1) − (cid:111)(cid:17) . Eﬃciency bound.

We ﬁrst notice a simple fact of the quadratic form and a property of theKronecker product.

Lemma . Let S ∈ R n × n PD , x ∈ R n and suppose that ( A, B ) is a partition of the set { , . . . , n } .For any ﬁxed x A , it holds that x (cid:62) Sx ≥ x (cid:62) A ( S A · B ) x A , where S A · B = S A,A − S A,B S − B,B S B,A . The equality holds if and only if x B = − S − B,B S B,A x A . Lemma

11 (Liu [1999, Theorem 1]) . Let A ∈ R m × m and C ∈ R n × n be non-singular. Suppose α ⊂ [ m ] , β ⊂ [ n ] . Let α c , β c denote their respective complements. Let γ c = { n ( i −

1) + j : i ∈ α c , j ∈ β c } and γ = [ mn ] \ γ c . We have A α c · α ⊗ C β c · β = ( A ⊗ C ) γ c · γ . Lemma . Suppose the assumptions of Theorem 3 hold. Fix w ∈ R | A | and let τ w = w (cid:62) τ AY = τ w ((Λ k, G ) Kk =2 ) as in Eq. (20) . Consider any estimator (cid:98) τ w ∈ T w given by Deﬁnition 4. Then underany P ∈ P G , it holds that (29) avar( (cid:98) τ w ) ≥ K (cid:88) k =2 h (cid:62) k Ω k ⊗ (Σ Pa( B k , G ) ) − h k , where (Ω k ) Kk =2 and Σ are determined by P , and the gradient vectors h k = ∂τ w ((Λ k, G ) k ) /∂ Λ k, G for k = 2 , . . . , K evaluated at (Λ k, G ) k are determined by τ w ( · ) and P . Proof.

By Lemma 5, estimator (cid:98) τ w ∈ T w can be written as (cid:98) τ w = (cid:98) τ w (cid:16) ( (cid:98) Λ ¯ G k, G ) Kk =2 , ( (cid:98) Λ ¯ G k, G c ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 (cid:17) , where the arguments correspond to the image of (cid:98) Σ under φ − G ; see Eq. (26). Estimator (cid:98) τ w ∈ T w isasymptotically normal. By the delta method [Shorack, 2000, Sec 11.2], we haveavar( (cid:98) τ w ) =  ∂ (cid:98) τ w /∂ (Λ k, G ) Kk =2 ∂ (cid:98) τ w /∂ (Λ k, G c ) Kk =2 ∂ (cid:98) τ w /∂ (Ω k ) Kk =1  (cid:62) acov  vec ( (cid:98) Λ ¯ G k, G ) Kk =2 vec ( (cid:98) Λ ¯ G k, G c ) Kk =2 vec ( (cid:98) Ω ¯ G k ) Kk =1   ∂ (cid:98) τ w /∂ (Λ k, G ) Kk =2 ∂ (cid:98) τ w /∂ (Λ k, G c ) Kk =2 ∂ (cid:98) τ w /∂ (Ω k ) Kk =1  , (cid:98) τ w ( · ) are evaluated at (cid:0) (Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 ) (cid:1) , the image ofΣ under φ − G .Using ∂ (cid:98) τ w /∂ Ω k = for k = 1 , . . . , K from Corollary 4, it follows thatavar( (cid:98) τ w ) = (cid:18) ∂ (cid:98) τ w /∂ (Λ k, G ) Kk =2 ∂ (cid:98) τ w /∂ (Λ k, G c ) Kk =2 (cid:19) (cid:62) acov (cid:40) vec ( (cid:98) Λ ¯ G k, G ) Kk =2 vec ( (cid:98) Λ ¯ G k, G c ) Kk =2 (cid:41) (cid:18) ∂ (cid:98) τ w /∂ (Λ k, G ) Kk =2 ∂ (cid:98) τ w /∂ (Λ k, G c ) Kk =2 (cid:19) = K (cid:88) k =2 (cid:18) ∂ (cid:98) τ w /∂ Λ k, G ∂ (cid:98) τ w /∂ Λ k, G c (cid:19) (cid:62) acov (cid:40) (cid:98) Λ ¯ G k, G (cid:98) Λ ¯ G k, G c (cid:41) (cid:18) ∂ (cid:98) τ w /∂ Λ k, G ∂ (cid:98) τ w /∂ Λ k, G c (cid:19) , where we have used the block-diagonal structure of the asymptotic covariance from Lemma 8. Let S ( k ) := acov (cid:40) (cid:98) Λ ¯ G k, G (cid:98) Λ ¯ G k, G c (cid:41) , k = 2 , . . . , K, which equals S ( k ) = Ω k ⊗ (cid:16) Σ B [ k − (cid:17) − , k = 2 , . . . , K, by Lemma 8. From Corollary 4, note that ∂ (cid:98) τ w /∂ (Λ k, G ) k ≡ h k is ﬁxed for k = 2 , . . . , K . Then,Lemma 10 yields the lower bound avar( (cid:98) τ w ) ≥ K (cid:88) k =2 h (cid:62) k S ( k ) G·G c h k , where indices G and G c correspond to the coordinates in (cid:98) Λ ¯ G k, G and (cid:98) Λ ¯ G k, G c respectively. Indices G correspond to { ( i, j ) : j ∈ B k , i ∈ Pa( B k , G ) } ; by construction of ¯ G , indices G c correspond to { ( i, j ) : j ∈ B k , i ∈ Pa( B k , ¯ G ) \ Pa( B k , G ) } . Now, to abuse the notation slightly, we apply Lemma 11with A = Ω k , C = (Σ B [ k − ) − , α = ∅ , β = Pa( B k , ¯ G ) \ Pa( B k , G ) , such that α c = { , . . . , | B k |} , β c = Pa( B k , G ) , γ = G c , γ c = G . We obtain S ( k ) G·G c = Ω k ⊗ (cid:104) (Σ B [ k − ) − (cid:105) β c · β = Ω k ⊗ (cid:0) Σ Pa( B k , G ) (cid:1) − , where the last step follows from ( H − ) β c · β = ( H β c ,β c ) − [Horn and Johnson, 2012, § Eﬃciency of G -regression estimator. In Section 5, we have seen that when the errors areGaussian, the G -regression plugin is the MLE and hence achieves the eﬃciency bound. Here, weshow that this is still true relative to the class of estimators we consider, even though the errorsare not necessarily Gaussian. We verify that (cid:98) τ G w = w (cid:62) (cid:98) τ G AY achieves the eﬃciency bound above. Lemma . Let (cid:98) τ G w := w (cid:62) (cid:98) τ G AY , where (cid:98) τ G AY is the G -regression estimator (Deﬁnition 2). Underthe same assumptions as Lemma 12, it holds that (cid:98) τ G w ∈ T w and (cid:98) τ G w achieves the eﬃciency bound inEq. (29) under every P ∈ P G . roof. By Deﬁnition 2, (cid:98) τ G w ∈ T w . Further, note that (cid:98) τ G w = τ w (cid:16) ( (cid:98) Λ G k ) Kk =2 (cid:17) , where ( (cid:98) Λ G k ) Kk =2 are the G -regression coeﬃcients in Eq. (18). Under any P ∈ P G , we now verify thatavar (cid:98) τ G matches the RHS of Eq. (29). By the delta method [Shorack, 2000, Sec 11.2], we haveavar (cid:98) τ G w = (cid:0) ∂τ w /∂ vec (Λ k ) Kk =2 (cid:1) (cid:62) acov (cid:110) vec ( (cid:98) Λ G k ) Kk =2 (cid:111) (cid:0) ∂τ w /∂ vec (Λ k ) Kk =2 (cid:1) (i) = K (cid:88) k =2 ( ∂τ w /∂ vec Λ k ) (cid:62) acov (cid:110) vec (cid:98) Λ G k (cid:111) ( ∂τ w /∂ vec Λ k ) (ii) = K (cid:88) k =2 ( ∂τ w /∂ vec Λ k ) (cid:62) Ω k ⊗ (cid:0) Σ Pa( B k , G ) (cid:1) − ( ∂τ w /∂ vec Λ k ) , which equals the RHS of Eq. (29). The partial derivatives of τ w ( · ) are evaluated at (Λ k ) Kk =2 . Step(i) follows from the block-diagonal structure of the asymptotic covariance of (cid:98) Λ G given by Lemma 9,and (ii) follows from the same lemma.Finally, we complete the proof of our main result. Proof of Theorem 3.

Fix any P ∈ P G . It suﬃces to show that for every w ∈ R | A | , w (cid:62) acov( (cid:98) τ AY ) w ≥ w (cid:62) acov( (cid:98) τ G AY ) w, or equivalently avar (cid:16) w (cid:62) (cid:98) τ AY (cid:17) ≥ avar (cid:16) w (cid:62) (cid:98) τ G AY (cid:17) . This is true because for every (cid:98) τ AY in consideration, (cid:98) τ w := w (cid:62) (cid:98) τ AY ∈ T w and hence (cid:98) τ w is subjectto the lower bound in Lemma 12. Meanwhile, by Lemma 13, such a lower bound is achieved by (cid:98) τ G w = w (cid:62) (cid:98) τ G AY . The proof is complete because the choice of w is arbitrary. Remark . For Theorem 3 to hold, the independence error assumption Eq. (3) of the underlyinglinear SEM cannot be relaxed to uncorrelated errors. This comes from inspecting the proof ofLemma 8 in Appendix A. To show that the ¯ G -regression coeﬃcients are asymptotically independentacross buckets, the independence of errors is used to establish that for 2 ≤ k < k (cid:48) ≤ K , j ∈ B k , j (cid:48) ∈ B k (cid:48) , cov( ε j X B [ k − , ε j (cid:48) X B [ k (cid:48)− ) = .Suppose for now { (cid:15) i : i ∈ V } are only uncorrelated and hence { ε B k : k = 1 , . . . , K } are onlyuncorrelated across buckets. Further, suppose B = { } , B = { } , B = { } with j = k = 2 and j (cid:48) = k (cid:48) = 3. Then, we havecov (cid:16) ε j X B [ k − , ε j (cid:48) X B [ k (cid:48)− (cid:17) = cov (cid:16) ε ε , ε ( ε , γ ε + ε ) (cid:62) (cid:17) = E [ ε ε ( ε ε , γ ε ε + ε ε ) (cid:62) ] , which may be non-zero.

7. Numerical Results.

In this section, the ﬁnite-sample performance of G -regression is eval-uated against contending estimators. We use simulations and an in silico dataset for predictingexpression levels in gene knockout experiments. All the numerical experiments were conductedwith R v3.6, package pcalg v2.6 [Kalisch et al., 2012] and our package eff2 v0.1.19.1. Simulations.

We compare the performance of G -regression to several contending estimatorsunder ﬁnite samples. We roughly follow the simulation setup of Henckel et al. [2019], Witte et al.[2020]. First, we draw a random undirected graph from the Erd˝os-R´enyi model with average degree k , where k is drawn from { , , , } uniformly at random. The graph is converted to a DAG D witha random causal ordering and the corresponding CPDAG G is recorded. Then we ﬁx a linear SEMby drawing γ ij uniformly from [ − , − . ∪ [0 . ,

2] and choosing the error distribution randomly atrandom from the following:1. (cid:15) i ∼ N (0 , v i ) with v i ∼ unif(0 . , (cid:15) i / √ v i ∼ t with v i ∼ unif(0 . , . (cid:15) i ∼ logistic(0 , s i ) with s i ∼ unif(0 . , . (cid:15) i ∼ unif( − a i , a i ) with a i ∼ unif(1 . , . n iid samples from the model. Treatments A of a ﬁxed size are randomly selected fromthe set of vertices with non-empty descendants, and Y is selected randomly from their descendants;the drawing is repeated until τ AY is identiﬁed from G according to the criterion of Theorem 1.Finally, the data and graph G are provided to each estimator of τ AY .We compare to the following three estimators: • adj.O : optimal adjustment estimator [Henckel et al., 2019], • IDA.M : joint-IDA estimator based on modifying Cholesky decompositions [Nandy et al., 2017], • IDA.R : joint-IDA estimator based on recursive regressions [Nandy et al., 2017].They are implemented in R package pcalg . The two joint-IDA estimators use the parents of treat-ment variables to estimate a causal eﬀect. Both of them reduce to the IDA estimator of Maathuiset al. [2009] when | A | = 1. Admittedly, compared to G -regression and adj.O , the joint-IDA estima-tors require less knowledge about the graph, namely only Pa( i ) for each i ∈ A .For each estimator (cid:98) τ AY , we compute its squared error (cid:107) (cid:98) τ AY − τ AY (cid:107) . Dividing (cid:107) (cid:98) τ AY − τ AY (cid:107) by thesquared error of G -regression, we obtain the relative squared error of each contending estimator. Weconsider | A | ∈ { , , , } , | V | ∈ { , , } and n ∈ { , } ; each conﬁguration of ( | A | , | V | , n )is replicated 1,000 times.Fig. 2 shows the distributions of relative squared errors. In Table 1, we summarize the relativeerrors with their geometric mean and median. Our estimator dominates all the contending estima-tors in all cases, and the improvement gets larger as | A | gets bigger. Even though adj.O achievesthe minimal asymptotic variance among all adjustment estimators, it can compare less favorably toour estimator by several folds. In general, the IDA estimators have very poor performances. More-over, the results in Table 1 are computed only from the replications where a contending estimatorexists. As mentioned in the Introduction, unlike G -regression, none of the contending estimators isguaranteed to exist for every identiﬁed eﬀect under joint intervention ( adj.O always exists for pointinterventions); see Table 2 for the percentages of instances that are not estimable by contendingestimators, even though the eﬀect is identiﬁed by Theorem 1 and hence estimable by G -regression.In Appendix D, we report additional simulation results where the CPDAG is estimated withthe greedy equivalence search algorithm [Chickering, 2002] and provided to the estimators. Theimprovements are more modest but are still typically by several folds.7.2. Predicting double knockouts in DREAM4 data.

The DREAM4 in silico network challengedataset [Marbach et al., 2009b] provides a benchmark for evaluating the reverse engineering of generegulation networks. Here we use the 5th

Size10 dataset [Marbach et al., 2009a] as our example,which is a small network of 10 genes. Fig. 3 shows the true gene regulation network, which is20 able 1

Geometric average (brackets: median) of relative squared errors compared to G -regression | V | = 20 | V | = 50 | V | = 100 | A | n = 100 n = 1000 n = 100 n = 1000 n = 100 n = 1000 adj.O IDA.M

IDA.R |V|: 20n: 100 |V|: 20n: 1000 |V|: 50n: 100 |V|: 50n: 1000 |V|: 100n: 100 |V|: 100n: 1000 | A | : | A | : | A | : | A | : adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R1e+001e+021e+041e+061e+001e+021e+041e+061e+001e+021e+041e+061e+001e+021e+041e+06 method s qua r ede rr o rr e l a t i v e t o G − r eg r e ss i on Fig 2: Violin plots for the relative squared errors of contending estimators (‘ · ’: geometric mean, ‘+’:median).constructed based on the networks of living organisms. A stochastic diﬀerential equation modelwas used to generate the data under wild type (steady state), perturbed steady state and knockoutinterventions. A task in the challenge is to use data under wild type and perturbed steady state(both are observational data) to predict the steady state expression levels under 5 diﬀerent jointinterventions, each of which knocks out a pair of genes. For our purpose, we also use the true networkas input. However, the true network contains one cycle (other networks in DREAM4 contain morethan one cycles). In the following, we remove one edge in the cycle and provide the resulting DAG tothe estimators. Necessarily, the causal DAG is misspeciﬁed. Results are reported under 4 diﬀerentedge removals. 21 able 2 Percentage of identiﬁed instances not estimable using contending estimators(all estimable with G -regression) Estimator | A | | V | = 20 | V | = 50 | V | = 100 adj.O IDA.M

IDA.R

Fig 3: Gene regulation network from DREAM4 dataset, which contains a cycle (blue).Unfortunately, the wild type data only consists of one sample. To estimate the observationalcovariance, we use the perturbed steady state data, which consists of 5 segments of time series. Asample covariance is computed from each segment, and the ﬁnal estimate is taken as their average.For a double knockout of genes ( i, j ), we use G -regression to estimate the joint-intervention eﬀectof A = ( i, j ) on every other gene. The eﬀect is identiﬁed because the DAG is given. For gene k , let s k and s ( ij ) k respectively denote its expression level under wild type and double knockout of genes( i, j ). The expression level under double knockout is predicted as (cid:98) s ( ij ) k = (cid:40) s k − ( s i , s j ) (cid:62) (cid:98) τ ij,k , k / ∈ { i, j } , k ∈ { i, j } . E = (cid:80) ( i,j ) ∈A (cid:80) k =1 ( (cid:98) s ( ij ) k − s ( ij ) k ) (cid:80) ( i,j ) ∈A (cid:80) k =1 ( s ( ij ) k ) , where A = { (6 , , (7 , , (8 , , (8 , , (8 , } consists of 5 double knockouts available in the dataset.For comparison, we also evaluate the performance of adj.O (optimal adjustment, Henckel et al.[2019]) and IDA.R (joint-IDA based on recursive regressions, Nandy et al. [2017]);

IDA.R is chosenbecause it outperforms

IDA.M according to Section 7.1. Unfortunately, adj.O is not able to estimatethe eﬀect on every k and a modiﬁed metric E ∗ is computed by only summing over those estimable k ’s; the same metric E ∗ of G -regression is also computed for comparison. As a baseline, we alsocompute E from naively estimating s ( ij ) k with just s k . Table 3

Normalized squared errors of predicting gene double knockouts edge removed E ∗ E from cycle (cid:64) adj.O adj.O G - reg IDA.R G - reg baseline → →

32% 33% → → Table 3 reports the results, where the column ‘ (cid:64) adj.O ’ lists the percentage of eﬀects not estimableby the adjustment estimator. In almost all the cases, G -regression dominates all the contendingestimators. In this example, even though both the causal graph and the linear SEM are misspeciﬁed,one can still witness some usefulness of our estimator.

8. Discussion.

We have proposed G -regression based on recursive least squares to estimatea total causal eﬀect from observational data, under linearity and causal suﬃciency assumuptions. G -regression is applicable to estimating every identiﬁed total eﬀect, under either point interventionor joint intervention. Further, via a new semiparametric eﬃciency theory, we have shown that theestimator achieves the eﬃciency bound within a restricted, yet reasonably large, class of estimators,including covariate adjustment and other regular estimators based on the sample covariance. Toconstruct conﬁdence intervals and conduct hypothesis tests, bootstrap can be easily applied toestimate the asymptotic covariance of our estimator. This is implemented in R package eff2 .One may wonder, within the class of all regular estimators, if a (globally) semiparametric eﬃcientestimator can be constructed for this problem. Ignoring the conditional independence constraintsEq. (12) in the blockwise error distributions, the model Eq. (21) is a generalized, multivariatelocation-shift regression model; see also Tsiatis [2006, § § § n is very large [Tsiatis, 2006, page111]. The proposal of Tsiatis [2006] is to instead develop a locally eﬃcient estimator by postulatingan error distribution. Our result justiﬁes postulating Gaussian errors if only the ﬁrst two samplemoments are utilized. Another possibility of tackling the nuisance of multivariate error distributionis via the distribution-free multivariate ranks, namely R-estimation along the lines of Hallin et al.[2019], which we leave for future studies. 23e have seen that the conditional independence constraints in Eq. (12) play no role for the re-stricted class of estimators considered—a feature that holds under the restrictive property (Corol-lary 1). Recently, Rotnitzky and Smucler [2019] found that when estimating an eﬀect of a pointintervention, under certain graphical conditions, covariate adjustment can still achieve the semi-parametric eﬃciency bound, although it ignores the additional conditional independences obeyedby the observed distribution. It is thus worth investigating if the aforementioned phenomenoncontinues to hold in semiparametric estimation beyond the linear SEM.APPENDIX A: PROOFS FOR ASYMPTOTIC EFFICIENCY Lemma 7 Lemma 9 Lemma 13Lemma 8 Lemma 12 Theorem 3Lemma 4 Lemma 5 Lemma 6 Corollary 4 Lemma 10

Fig A.1: Dependency structure of proofs in Section 6 of main text.

A.1. Proof of Lemma 7.

Proof.

For simplicity, we drop the superscripts in (cid:98) Σ ( n ) and (cid:98) λ ( n ) . Since j ∈ B k and Pa( B k , G ) ⊆ C ⊆ Pa( B k , ¯ G ), we have (cid:98) λ C,j − λ C,j = ( (cid:98) Σ C ) − (cid:98) Σ C,j − (Σ C ) − Σ C,j = (cid:16) Σ C + (cid:98) Σ C − Σ C (cid:17) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) + (cid:16) ( (cid:98) Σ C ) − − (Σ C ) − (cid:17) Σ C,j . We compute the two terms separately. The ﬁrst term becomes (cid:16) Σ C + (cid:98) Σ C − Σ C (cid:17) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) = (cid:16) Σ C + O p ( n − / ) (cid:17) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) = (Σ C ) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) + O p ( n − ) , where we used the fact that Σ C is positive deﬁnite (Lemma 4) and (cid:107) (cid:98) Σ C,j − Σ C,j (cid:107) = O p ( n − / ), (cid:107) (cid:98) Σ C − Σ C (cid:107) = O p ( n − / ) by the central limit theorem.In the second term,( (cid:98) Σ C ) − − (Σ C ) − = (cid:16) Σ C + (cid:98) Σ C − Σ C (cid:17) − − (Σ C ) − = (cid:104) I − (cid:16) I − (Σ C ) − (cid:98) Σ C (cid:17)(cid:105) − (Σ C ) − − (Σ C ) − . (cid:107) I − (Σ C ) − (cid:98) Σ C (cid:107) = O p ( n − / ), using Neumann series ( I − H ) − = I + H + H + . . . for H = I − (Σ C ) − (cid:98) Σ C with (cid:107) H (cid:107) → p <

1, we have( (cid:98) Σ C ) − − (Σ C ) − = (cid:2) I + H + O p ( n − ) (cid:3) (Σ C ) − − (Σ C ) − = H (Σ C ) − + O p ( n − )= (cid:104) I − (Σ C ) − (cid:98) Σ C (cid:105) (Σ C ) − + O p ( n − ) . Combining the two terms, we obtain (cid:98) λ C,j − λ C,j = (Σ C ) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) + (cid:104) I − (Σ C ) − (cid:98) Σ C (cid:105) (Σ C ) − Σ C,j + O p ( n − )= (Σ C ) − (cid:98) Σ C,j − (Σ C ) − Σ C,j + (Σ C ) − Σ C,j − (Σ C ) − (cid:98) Σ C (Σ C ) − Σ C,j + O p ( n − ) (i) = (Σ C ) − (cid:16)(cid:98) Σ C,j − (cid:98) Σ C λ C,j (cid:17) + O p ( n − )= 1 n n (cid:88) i =1 (Σ C ) − (cid:104) X ( i ) j X ( i ) C − X ( i ) C X ( i ) (cid:62) C λ C,j (cid:105) + O p ( n − )= 1 n n (cid:88) i =1 (Σ C ) − X ( i ) C (cid:16) X ( i ) j − λ (cid:62) C,j X ( i ) C (cid:17) + O p ( n − ) (ii) = 1 n n (cid:88) i =1 (Σ C ) − X ( i ) C (cid:16) X ( i ) j − λ (cid:62) Pa( B k , G ) ,j X ( i )Pa( B k , G ) (cid:17) + O p ( n − )= 1 n n (cid:88) i =1 (Σ C ) − X ( i ) C ε ( i ) j + O p ( n − ) , where (i) uses λ C,j = (Σ C ) − Σ C,j and (ii) follows from Proposition 1 and Pa( B k , G ) ⊆ C ⊆ Pa( B k , ¯ G ). A.2. Proof of Lemma 8.

Proof.

For each k = 2 , . . . , K , note that for C = Pa( B k , ¯ G ) = B [ k − , vec (cid:98) Λ ¯ G k = ( (cid:98) λ ( n ) C,j ) j ∈ B k byconcatenation. By Lemma 7, we have the following asymptotic linear expansion(30) (cid:98) λ ( n ) B [ k − ,j − λ B [ k − ,j = 1 n n (cid:88) i =1 (cid:16) Σ B [ k − (cid:17) − X ( i ) B [ k − ε ( i ) j + O p ( n − ) . By the central limit theorem, √ n  vec( (cid:98) Λ ¯ G − Λ )...vec( (cid:98) Λ ¯ G K − Λ K )  converges to a centered multivariate normal distribution. Further, we claim that the asymptoticcovariance must be block-diagonal according to k = 2 , . . . , K . To see this, take k < k (cid:48) , j ∈ B k , j (cid:48) ∈ B k (cid:48) and let C = B [ k − , C (cid:48) = B [ k (cid:48) − . Using Eq. (30), we havelim n →∞ n − E (cid:16)(cid:98) λ ( n ) C,j − λ C,j (cid:17) (cid:16)(cid:98) λ ( n ) C (cid:48) ,j − λ C (cid:48) ,j (cid:17) (cid:62) = (Σ C ) − cov( ε j X C , ε j (cid:48) X C (cid:48) )(Σ C (cid:48) ) − = (Σ C ) − (cid:110) E (cid:104) ε j ε j (cid:48) X C X (cid:62) C (cid:48) (cid:105) − E [ ε j X C ] E (cid:104) ε j (cid:48) X (cid:62) C (cid:48) (cid:105)(cid:111) (Σ C (cid:48) ) − .

25n the expression above, because ε B k ⊥⊥ X B [ k − and ε B k (cid:48) ⊥⊥ X B [ k (cid:48)− by Corollary 2 and j ∈ B k , j (cid:48) ∈ B k (cid:48) for k < k (cid:48) , we have E (cid:2) ε j ε j (cid:48) X C X (cid:62) C (cid:48) (cid:3) = E ε j (cid:48) E (cid:2) ε j X C X (cid:62) C (cid:48) (cid:3) = , E ε j X C = and E ε j (cid:48) X C (cid:48) = . It follows that the display above evaluates to and hence the asymptotic covariance matrix isblock-diagonal.It remains to be shown that acov vec( (cid:98) Λ ¯ G k − Λ k ) = Ω k ⊗ (Σ B [ k − ) − for k = 2 , . . . , K . Fix k , takeany two distinct j, j (cid:48) ∈ B k and let C = B [ k − . Again using Eq. (30), we haveacov (cid:32) (cid:98) λ ( n ) C,j (cid:98) λ ( n ) C,j (cid:48) (cid:33) = (cid:18) H FF (cid:62) D (cid:19) , where H = (Σ C ) − cov( ε j X C , ε j X C )(Σ C ) − = var( ε j )(Σ B [ k − ) − ,F = (Σ C ) − cov( ε j X C , ε j (cid:48) X C )(Σ C ) − = cov( ε j , ε j (cid:48) )(Σ B [ k − ) − ,D = (Σ C ) − cov( ε j (cid:48) X C , ε j (cid:48) X C )(Σ C ) − = var( ε (cid:48) j )(Σ B [ k − ) − . Noting that Ω k = cov( ε B k ) and vec (cid:98) Λ ¯ G k = ( (cid:98) λ ( n ) C,j ) j ∈ B k , the result then follows from comparing theexpressions above to the deﬁnition of Kronecker product for every pair j, j (cid:48) ∈ B k . A.3. Proof of Lemma 9.

Proof.

Note that by the restrictive property of G (Corollary 1), we have vec (cid:98) Λ G k = (cid:16)(cid:98) λ ( n )Pa( B k , G ) ,j (cid:17) j ∈ B k for k = 2 , . . . , K . Using Lemma 7 with C = Pa( B k , G ), we have the following asymptotic linearexpansion(31) (cid:98) λ ( n )Pa( B k , G ) ,j − λ Pa( B k , G ) ,j = 1 n n (cid:88) i =1 (cid:0) Σ Pa( B k , G ) (cid:1) − X ( i )Pa( B k , G ) ε ( i ) j + O p ( n − ) . The rest of computation follows similarly to the proof of Lemma 8.

A.4. Proof of Lemma 10.

Proof.

Since S ∈ R n × n PD , by completing the square, we have x (cid:62) Sx = x (cid:62) A S A,A x A + x (cid:62) B S B,A x A + x (cid:62) A S A,B x B + x (cid:62) B S B,B x B − x (cid:62) A S A,B S − B,B S B,A x A + x (cid:62) A S A,B S − B,B S B,A x A = x (cid:62) A ( S A,A − S A,B S − B,B S B,A ) x A + ( x B + S − B,B S B,A x A ) (cid:62) S B,B ( x B + S − B,B S B,A x A ) ≥ x (cid:62) A ( S A,A − S A,B S − B,B S B,A ) x A = x (cid:62) A S A · B x A , where the equality holds if and only if x B = − S − B,B S B,A x A .APPENDIX B: PROOFS FOR GRAPHICAL RESULTS B.1. Proof of Lemma 1.

Proof.

Let the undirected path between j and k be p = (cid:104) j = V , . . . , V l = k (cid:105) with l >

1. Firstnote that i is not on p because there is no undirected path between i and j in G .Further, since i → j − V is in G , by Meek rules R1 and R1 (Fig. C.1 in Appendix C), i − V or i → V is in G . Since, by assumption, there is no undirected path from i to j in G , i − V / ∈ U .Hence, i → V ∈ E and if l = 2, the statement of the lemma holds. If l >

2, we can apply the abovereasoning iteratively until we obtain i → V l ∈ E .26 .2. Proof of Lemma 2. Proof.

Let l ∈ D k . Since D k ⊆ B k , l ∈ B k . Then by Corollary 1, we have that Pa( B k ) = Pa( l ) \ B k . Therefore, Pa( B k ) ⊆ ∪ j ∈ D k Pa( j ) \ B k and furthermore, Pa( B k ) ⊆ ∪ j ∈ D k Pa( j ) \ D k = Pa( D k ).Hence, it suﬃces to show Pa( D k ) ⊆ Pa( B k ).We prove Pa( D k ) ⊆ Pa( B k ) by contradiction. Suppose there exists j ∈ Pa( D k ) \ Pa( B k ). Bydeﬁnition D = An( Y, G V \ A ) and D = ∪ Kr =1 D r . Therefore, if k = 1, then j ∈ A ; if k > j must becontained in ∪ k − r =1 D r or in A . If j ∈ A , this leads to a contradiction with Lemma C.1 in Appendix C.Suppose k > j ∈ ∪ k − r =1 D r . Because ∪ k − r =1 D r ⊆ ∪ k − r =1 B r and buckets { B , . . . , B K } are disjoint,we have ( ∪ k − r =1 D r ) ∩ B k = ∅ . However, this contradicts that j ∈ B k . B.3. Proof of Proposition 3.

Proof.

By construction, the undirected component of ¯ G remains the same as that of G . Hence,¯ G has the same bucket decomposition as G . We only need to show that ¯ G is an MPDAG. It isenough to show that the edge orientations in ¯ G are closed under rules R1–R4 of Meek [1995] thatare displayed in Fig. C.1 of Appendix C. Note that since G an MPDAG it is closed under R1–R4.So if any of the left-hand-side graphs in Figure C.1 are induced subgraphs of ¯ G , then at least oneof the directed edges in these induced subgraphs must have been added in the construction of ¯ G .Since the construction of ¯ G does not involve adding directed edges within a bucket, the left-hand-side of rules R3 and R4 in Figure C.1 cannot appear as induced subgraphs of ¯ G . Hence, edgeorientations in ¯ G are complete under rules R3 and R4.Consider the left-hand-side of rule R1 in Figure C.1, A → B − C , for some A, B, C ∈ V . For A → B − C to be an induced subgraph of ¯ G , A → B must have been added in the construction of¯ G from G . Hence, A and B would need to be in diﬀerent buckets in V in G . Since B and C are inthe same bucket because of edge B − C , A → C would also be added to G in the construction of¯ G . Hence, A → B − C will also not appear as an induced subgraph of ¯ G and edge orientations in ¯ G are also closed under R1.Consider the left-hand-side of R2 in Figure C.1, and suppose for a contradiction that A → B → C and A − C is an induced subgraph of ¯ G for some A, B, C ∈ V . Then A → B , B → C , or both A → B and B → C , were added to G in the construction of ¯ G . Because of A − C , suppose A and C are in the same bucket B i for some i ∈ { , . . . , K } in G . Also, suppose B ∈ B j . Because onlydirected edges between buckets are added, i (cid:54) = j .Now, A → B and B → C cannot be both added to G to construct ¯ G , because that would implythat i < j and j < i . By R1, B → C − A cannot be an induced subgraph of MPDAG G , so A → B alone also could not be added to G . Therefore, B → C alone was added to G . But C − A → B isan induced subgraph of G , so i < j , which contradicts the direction of B → C .APPENDIX C: GRAPHICAL PRELIMINARIES Graphs, vertices, edges.

A graph G = ( V, F ) consists of a set of vertices (variables) V and a setof edges F . The graphs we consider are allowed to contain directed ( → ) and undirected ( − ) edgesand at most one edge between any two vertices. We can thus partition the set of edges F into a setof directed edges E and undirected edges U and denote graph G = ( V, F ) as G = ( V, E, U ). Thecorresponding undirected graph is simply G U = ( V, ∅ , U ). Subgraphs and skeleton. An induced subgraph G V (cid:48) = ( V (cid:48) , F (cid:48) ) of G = ( V, F ) consists of V (cid:48) ⊆ V and F (cid:48) ⊆ F where F (cid:48) are all edges in F between vertices in V (cid:48) . A skeleton of a graph G = ( V, F ) is anundirected graph G = ( V, F (cid:48) ), such that F (cid:48) are undirected versions of all edges in F .27 aths. Directed, undirected, causal, non-causal, proper paths. A path p from i to j in G is asequence of distinct vertices p = (cid:104) i, . . . , j (cid:105) in which every pair of successive vertices are adjacent.A path consisting of undirected edges is an undirected path. A directed path from i to j is a pathfrom i to j in which all edges are directed towards j , that is, i → · · · → j . We will use causal path instead of directed path when talking about causal graphs. Let p = (cid:104) v , . . . , v k (cid:105) , k > G , p is a possibly directed path ( possibly causal path ) if no edge v i ← v j , ≤ i < j ≤ k is in G .Otherwise, p is a non-causal path in G (see Deﬁnition 3.1 and Lemma 3.2 of Perkovi´c et al., 2017).A path from A to Y is proper (w.r.t. A ) if only its ﬁrst vertex is in A . Directed cycles.

A directed path from i to j and the edge j → i form a directed cycle . Colliders, shields and deﬁnite status paths.

If a path p contains i → j ← k as a subpath, then j is a collider on p . A path (cid:104) i, j, k (cid:105) is an (un)shielded triple if i and k are (not) adjacent. A path is unshielded if all successive triples on the path are unshielded. A node v j is a deﬁnite non-collider on a path p if there is at least one edge out of v j on p , or if v j − − v j − v j +1 is a subpath of p and (cid:104) v j − , v j , v j +1 (cid:105) is an unshielded triple. A node is of deﬁnite status on a path if it is a collider, adeﬁnite non-collider or an endpoint on the path. A path p is of deﬁnite status if every node on p isof deﬁnite status. Subsequences and subpaths. A subsequence of a path p is obtained by deleting some nodes from p without changing the order of the remaining nodes. A subsequence of a path is not necessarilya path. For a path p = (cid:104) v , v , . . . , v m (cid:105) , the subpath from v i to v k (1 ≤ i ≤ k ≤ m ) is the path p ( v i , v k ) = (cid:104) v i , v i +1 , . . . , a k (cid:105) . Ancestral relations. If i → j , then i is a parent of j , and j is a child of i . If there is a causalpath from k to l , then k is an ancestor of l , and l is a descendant of k . If there is a possiblycausal path from k to l , then k is a possible ancestor of l , and l is a possible descendant of k .We use the convention that every vertex is a descendant, ancestor, possible ancestor and possibledescendant of itself. The sets of parents, ancestors, descendants and possible descendants of i in G are denoted by Pa( i, G ), An( i, G ), De( i, G ) and PossDe( i, G ) respectively. For a set of vertices A ,we let Pa( A, G ) = ( ∪ i ∈ A Pa( i, G )) \ A , whereas, An( A, G ) = ∪ i ∈ A An( i, G ), De( A, G ) = ∪ i ∈ A De( i, G )and PossDe( A, G ) = ∪ i ∈ A PossDe( i, G ) DAGs, PDAGs. A directed graph contains only directed edges. A partially directed graph maycontain both directed and undirected edges. A directed graph without directed cycles is a directedacyclic graph (DAG). A partially directed acyclic graph (PDAG) is a partially directed graph withoutdirected cycles. Blocking and d-separation. (See Deﬁnition 1.2.3 of Pearl [2009] and Lemma C.1 of Henckel et al.[2019]). Let Z be a set of vertices in an PDAG G = ( V, E, U ). A deﬁnite status path p is blocked by Z if (i) p contains a non-collider that is in Z , or (ii) p contains a collider C such that no descendantof C is in Z . A deﬁnite status path that is not blocked by Z is open given Z . If A, B and Z arethree pairwise disjoint sets of nodes in a PDAG G = ( V, E, U ), then Z d-separates A from B in G if Z blocks every deﬁnite status path between any node in A and any node in B in G . CPDAGs, MPDAGs.

Several DAGs can encode the same d-separation relationships. Such DAGsform a

Markov equivalence class which is uniquely represented by a completed partially directedacyclic graph (CPDAG) [Meek, 1995, Andersson et al., 1997]. A PDAG G = ( V, E, U ) is a maximallyoriented

PDAG (MPDAG) if it is closed under orientation rules R1-R4 of [Meek, 1995], presentedin Figure C.1. The MPDAG can then be alternatively deﬁned as any PDAG that does not contain28

A C R1 ⇒ BA C AB C R2 ⇒ AB CD CA B R3 ⇒ D CA B D AC B R4 ⇒ D AC B

Fig C.1: The orientation rules from Meek [1995]. If the graph on the left-hand side of a rule isan induced subgraph of a PDAG G , then orient the blue undirected edge ( − ) as shown on theright-hand side of the rule. Hence, the graphs on the left-hand side of each rule are not allowed tobe induced subgraphs of an MPDAG.graphs on the left-hand side of each orientation rule as induced subgraphs. Both DAGs and CPDAGsare types of MPDAGs [Meek, 1995]. Background knowledge and constructing MPDAGs.

A PDAG G (cid:48) is represented by another PDAG G (equivalently G represents G (cid:48) ) if G (cid:48) and G have the same adjacencies and unshielded colliders andevery directed edge i → j in G is also in G (cid:48) . Let R be a set of directed edges representing backgroundknowledge. Algorithm 1 of Meek [1995] describes how to incorporate background knowledge R inan MPDAG G . If Algorithm 1 does not return a FAIL, then it returns a new MPDAG G (cid:48) that isrepresented by G . Background knowledge R is consistent with MPDAG G if and only if Algorithm 1does not return a FAIL [Meek, 1995]. Algorithm 1:

ConstructMPDAG , [Meek, 1995, Perkovi´c et al., 2017]

Data:

MPDAG G , background knowledge R Result:

MPDAG G (cid:48) or FAIL Let G (cid:48) = G ; while R (cid:54) = ∅ do Choose an edge { X → Y } in R ; R = R \ { X → Y } ; if { X − Y } or { X → Y } is in G (cid:48) then Orient { X → Y } in G (cid:48) ; Close the edge orientations under the rules in Figure C.1 in G (cid:48) ; else FAIL; end end Remark . The MPDAG output by

ConstructMPDAG ( G , R ) is the same independent of theordering of edges in R . This stems from the fact that the orientation rules of Meek [1995] arenecessary and suﬃcient for the construction of an MPDAG given a set of adjacencies and unshieldedcolliders. G and [ G ] . If G is a MPDAG, then [ G ] denotes every DAG represented by G .29 ausal and partial causal ordering of vertices. A total ordering, < , of vertices V (cid:48) ⊆ V is consistent with a DAG D = ( V, E, ∅ ) and called a causal ordering of V (cid:48) if for every i, j ∈ V (cid:48) , such that i < j and such that i and j are adjacent in D , i → j is in D . There can be more than one causal orderingof V (cid:48) in a DAG D = ( V, E, ∅ ). For example, in DAG i ← j → k both orderings j < i < k and j < k < i are consistent.Since an MPDAG may contain undirected edges, there is generally no unique causal ordering ofvertices in an MPDAG. Instead, we deﬁne a partial causal ordering, < , of a vertex set V (cid:48) , V (cid:48) ⊂ V in an MPDAG G = ( V, E, U ) as a total ordering of pairwise disjoint vertex sets A , . . . , A k , k ≥ ∪ ki =1 A i = V (cid:48) , that satisfy the following: if A i < A j and there is an edge between i ∈ A i and j ∈ A j in G , then i → j is in G . Buckets and bucket decomposition.

Algorithm 2 describes how to obtain an ordered bucket de-composition for a set of vertices V in an MPDAG G = ( V, E, U ). By Perkovi´c [2020, Lemma 1], theordered list of buckets output by Algorithm 2 is a partial causal ordering of V in G . Algorithm 2:

Partial causal ordering [Perkovi´c, 2020] input : vertex set V in MPDAG G =( V, E, U ) and MPDAG G . output : An ordered list B =( B , . . . , B k ) , k ≥ , of the bucket decomposition of V in G . Let G U denote the undirected subgraph of G ; Let

ConComp be the bucket decomposition (i.e., maximal connected components) of V in G U ; Let B be an empty list; while ConComp (cid:54) = ∅ do Let C be any element from ConComp ; Let C be the set of vertices in ConComp that are not in C ; if all edges between C and C are into C in G then Remove C from ConComp ; Add C to the beginning of B ; end end return B ; Lemma

C.1 . (see Lemma D.1 (i) of Perkovi´c, 2020) Let A and Y be disjoint node sets inMPDAG G = ( V, E, U ) . Suppose that there is no proper possibly causal path from A to Y thatstarts with an undirected edge in G , that is, suppose that the criterion in Theorem 1 is satisﬁed.Further, let D = An( Y, G V \ A ) and D = ˙ (cid:83) Ki =1 D i for D i = D ∩ B i , i = 1 , . . . , K , where B , . . . B K isthe bucket decomposition of V . Then for all i ∈ { , . . . , K } , there is no proper possibly causal pathfrom A to B i that starts with an undirected edge in G . APPENDIX D: ADDITIONAL SIMULATION RESULTSIn this section, we report additional simulation results. The setup is the same as Section 7.1 ofmain text, but we replace the true CPDAG with the CPDAG estimated with the greedy equivalencesearch algorithm [Chickering, 2002] based on the same sample. The relative squared errors of thecontending estimators are shown in Fig. D.1 and are summarized in Table D.1. Compared to theresults with the true CPDAG, the performance improvement of G -regression is more modest butstill matters in practice. The reduced improvement is due to the error in estimating the graph,which diminishes as n increases. 30 able D.1 Geometric average (brackets: median) of relative squared errors compared to G -regressionwhen CPDAGs are estimated | V | = 20 | V | = 50 | V | = 100 | A | n = 100 n = 1000 n = 100 n = 1000 n = 100 n = 1000 adj.O IDA.M

Advanced Econometrics . Harvard University Press, 1985.Theodore Wilbur Anderson and Ingram Olkin. Maximum-likelihood estimation of the parameters of a multivariatenormal distribution.

Linear algebra and its applications , 70:147–171, 1985. teen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of Markov equivalence classes foracyclic digraphs. The Annals of Statistics , 25:505–541, 1997.Peter J. Bickel, Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner.

Eﬃcient and Adaptive Estimation forSemiparametric Models , volume 4. Johns Hopkins University Press, Baltimore, 1993.Kenneth A. Bollen.

Structural Equations with Latent Variables . Wiley, New York, 1989.Wenyu Chen, Mathias Drton, and Y. Samuel Wang. On causal discovery with an equal-variance assumption.

Biometrika , 106(4):973–980, 2019.David Maxwell Chickering. Optimal structure identiﬁcation with greedy search.

Journal of Machine Learning Re-search , 3(Nov):507–554, 2002.A. Philip Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application.

Biometrika , 68(1):265–274, 1981.Mathias Drton. Computing all roots of the likelihood equations of seemingly unrelated regressions.

Journal ofSymbolic Computation , 41(2):245–254, 2006.Mathias Drton. Algebraic problems in structural equation modeling. In

The 50th Anniversary of Gr¨obner Bases ,pages 35–86. Mathematical Society of Japan, 2018.Mathias Drton and Michael Eichler. Maximum likelihood estimation in Gaussian chain graph models under thealternative Markov property.

Scandinavian Journal of Statistics , 33(2):247–257, 2006.Mathias Drton and Marloes H. Maathuis. Structure learning in graphical modeling.

Annual Review of Statistics andIts Application , 4:365–393, 2017.Mathias Drton and Thomas S. Richardson. Multimodality of the likelihood in the bivariate seemingly unrelatedregressions model.

Biometrika , 91(2):383–392, 2004.Mathias Drton, Bernd Sturmfels, and Seth Sullivant.

Lectures on algebraic statistics , volume 39. Springer Science &Business Media, 2008.Mathias Drton, Michael Eichler, and Thomas S. Richardson. Computing maximum likelihood estimates in recursivelinear models with correlated errors.

Journal of Machine Learning Research , 10(81):2329–2348, 2009.Mathias Drton, Rina Foygel, and Seth Sullivant. Global identiﬁability of linear structural equation models.

TheAnnals of Statistics , 39(2):865–886, 2011.Marco Eigenmann, Preetam Nandy, and Marloes H. Maathuis. Structure learning of linear Gaussian structuralequation models with weak edges. In

Proceedings of the 33rd Annual Conference on Uncertainty in ArtiﬁcialIntelligence (UAI-17) , 2017.Zhuangyan Fang and Yangbo He. IDA with background knowledge. In

Proceedings of the 36th Annual Conferenceon Uncertainty in Artiﬁcial Intelligence (UAI-20) , 2020.Shantanu Gupta, Zachary C. Lipton, and David Childers. Estimating treatment eﬀects with observed confoundersand mediators. arXiv preprint arXiv:2003.11991 , 2020.Marc Hallin, Davide La Vecchia, and Hang Liu. Center-outward R-estimation for semiparametric VARMA models. arXiv preprint arXiv:1910.08442 , 2019.Lars Peter Hansen. Large sample properties of generalized method of moments estimators.

Econometrica: Journalof the Econometric Society , pages 1029–1054, 1982.Alan Hauser and Peter B¨uhlmann. Characterization and greedy learning of interventional Markov equivalence classesof directed acyclic graphs.

Journal of Maching Learning Research , 13:2409–2464, 2012.Leonard Henckel, Emilija Perkovi´c, and Marloes H. Maathuis. Graphical criteria for eﬃcient total eﬀect estimationvia adjustment in causal linear models. arXiv preprint arXiv:1907.02435 , 2019.Roger A. Horn and Charles R. Johnson.

Matrix Analysis . Cambridge University Press, 2 edition, 2012.Patrik O. Hoyer, Aapo Hyvarinen, Richard Scheines, Peter L. Spirtes, Joseph Ramsey, Gustavo Lacerda, and ShoheiShimizu. Causal discovery of linear acyclic models with arbitrary distributions. In

Proceedings of the 24th AnnualConference on Uncertainty in Artiﬁcial Intelligence (UAI-08) , pages 282–289, 2008.Markus Kalisch, Martin M¨achler, Diego Colombo, Marloes H. Maathuis, and Peter B¨uhlmann. Causal inference usinggraphical models with the R package pcalg.

Journal of Statistical Software , 47(11):1–26, 2012.Tjalling C. Koopmans and Olav Reiersøl. The identiﬁcation of structural characteristics.

The Annals of MathematicalStatistics , 21(2):165–181, 1950.Manabu Kuroki and Zhihong Cai. Selection of identiﬁability criteria for total eﬀects by using path diagrams. In

Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence , pages 333–340, 2004.Manabu Kuroki and Masami Miyakawa. Covariate selection for estimating the causal eﬀect of control plans by usingcausal diagrams.

Journal of the Royal Statistical Society. Series B. Statistical Methodology , 65(1):209–222, 2003.Manabu Kuroki and Hisayoshi Nanmo. Variance formulas for estimated mean response and predicted response withexternal intervention based on the back-door criterion in linear structural equation models.

AStA Advances inStatistical Analysis , pages 1–19, 2020. teﬀen L. Lauritzen. Graphical Models . Oxford University Press, New York, 1996.Jianzhou Liu. Some L¨owner partial orders of Schur complements and Kronecker products of matrices.

Linear algebraand its applications , 291(1-3):143–149, 1999.Marloes H. Maathuis and Diego Colombo. A generalized back-door criterion.

The Annals of Statistics , 43(3):1060–1088, 2015.Marloes H. Maathuis, Markus Kalisch, and Peter B¨uhlmann. Estimating high-dimensional intervention eﬀects fromobservational data.

The Annals of Statistics , 37(6A):3133–3164, 2009.Daniel Marbach, Thomas Schaﬀter, Dario Floreano, Robert J. Prill, and Gustavo Stolovitzky. The DREAM4 in-silico network challenge. Draft, version 0.3 http://gnw.sourceforge.net/resources/DREAM4%20in%20silico%20challenge.pdf , 2009a.Daniel Marbach, Thomas Schaﬀter, Claudio Mattiussi, and Dario Floreano. Generating realistic in silico gene networksfor performance assessment of reverse engineering methods.

Journal of Computational Biology , 16(2):229–239,2009b.Christopher Meek. Causal inference and causal explanation with background knowledge. In

Proceedings of the 11thAnnual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-95) , pages 403–410, 1995.Preetam Nandy, Marloes H. Maathuis, and Thomas S. Richardson. Estimating the eﬀect of joint interventions fromobservational data in sparse high-dimensional settings.

The Annals of Statistics , 45(2):647–674, 2017.Judea Pearl. Comment: graphical models, causality and intervention.

Statistical Science , 8(3):266–269, 1993.Judea Pearl. Causal diagrams for empirical research.

Biometrika , 82(4):669–688, 1995.Judea Pearl.

Causality . Cambridge University Press, Cambridge, 2nd edition, 2009.Judea Pearl and Thomas S. Verma. A theory of inferred causation. In

Studies in Logic and the Foundations ofMathematics , volume 134, pages 789–811. Elsevier, 1995.Emilija Perkovi´c. Identifying causal eﬀects in maximally oriented partially directed acyclic graphs. In

Proceedings ofthe 36th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-20) , 2020.Emilija Perkovi´c, Johannes Textor, Markus Kalisch, and Marloes H. Maathuis. A complete generalized adjustmentcriterion. In

Proceedings of the 31st Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-15) AnnualConference on Uncertainty in Artiﬁcial Intelligence (UAI-15) , pages ID–155, 2015.Emilija Perkovi´c, Markus Kalisch, and Marloes H. Maathuis. Interpreting and using CPDAGs with backgroundknowledge. In

Proceedings of the 33rd Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-17) , 2017.Emilija Perkovi´c, Johannes Textor, Markus Kalisch, and Marloes H. Maathuis. Complete graphical characterizationand construction of adjustment sets in Markov equivalence classes of ancestral graphs.

Journal of Machine LearningResearch , 18(220):1–62, 2018.Jonas Peters and Peter B¨uhlmann. Identiﬁability of Gaussian structural equation models with equal error variances.

Biometrika , 101(1):219–228, 2014.R Core Team.

R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing,Vienna, Austria, 2020. URL .James M. Robins. A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor eﬀect.

Mathematical Modelling , 7:1393–1512, 1986.Dominik Rothenh¨ausler, Jan Ernest, and Peter B¨uhlmann. Causal inference in partially linear structural equationmodels: identiﬁability and estimation.

Annals of Statistics , 46:2904–2938, 2018.Andrea Rotnitzky and Ezequiel Smucler. Eﬃcient adjustment sets for population average treatment eﬀect estimationin non-parametric causal graphical models. arXiv preprint arXiv:1912.00306 , 2019.John D. Sargan. The estimation of economic relationships using instrumental variables.

Econometrica: Journal ofthe Econometric Society , pages 393–415, 1958.Richard Scheines, Peter Spirtes, Clark Glymour, Christopher Meek, and Thomas Richardson. The TETRAD project:constraint based aids to causal model speciﬁcation.

Multivariate Behavioral Research , 33(1):65–117, 1998.Shohei Shimizu, Patrik O Hoyer, Aapo Hyv¨arinen, and Antti Kerminen. A linear non-Gaussian acyclic model forcausal discovery.

Journal of Machine Learning Research , 7(Oct):2003–2030, 2006.Galen R. Shorack.

Probability for Statisticians . Springer, 2000.Ilya Shpitser, Tyler VanderWeele, and James M. Robins. On the validity of covariate adjustment for estimatingcausal eﬀects. In

Proceedings of the 26th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI-10) ,pages 527–536, 2010.Ezequiel Smucler, Facundo Sapienza, and Andrea Rotnitzky. Eﬃcient adjustment sets in causal graphical modelswith hidden variables. arXiv preprint arXiv:2004.10521 , 2020.Peter Spirtes, Clark Glymour, and Richard Scheines.

Causation, Prediction, and Search . MIT Press, Cambridge,MA, 2nd edition, 2000.Robert H. Strotz and H. O. A. Wold. Recursive vs. nonrecursive systems: An attempt at synthesis (part I of a triptych n causal chain systems). Econometrica , 28(2):417–427, 1960.Seth Sullivant, Kelli Talaska, and Jan Draisma. Trek separation for Gaussian graphical models.

The Annals ofStatistics , 38(3):1665–1685, 2010.Anastasios Tsiatis.

Semiparametric Theory and Missing Data . Springer, New York, 2006.Aad W. van der Vaart.

Asymptotic Statistics . Cambridge University Press, 2000.Yuhao Wang, Liam Solus, Karren Dai Yang, and Caroline Uhler. Permutation-based causal inference algorithms withinterventions. In

Advances in Neural Information Processing Systems 30 , pages 5822–5831. 2017.Janine Witte, Leonard Henckel, Marloes H. Maathuis, and Vanessa Didelez. On eﬃcient adjustment in causal graphs. arXiv preprint arXiv:2002.06825 , 2020.Sewall Wright. Correlation and causation.

Journal of Agricultural Research , 20:557–585, 1921.Sewall Wright. The method of path coeﬃcients.

The Annals of Mathematical Statistics , 5(3):161–215, 1934.

Department of StatisticsUniversity of WashingtonBox 354322Seattle, WA 98195E-mail: [email protected]@[email protected]@uw.edu