Efficient Least Squares for Estimating Total Effects under Linearity and Causal Sufficiency
EEFFICIENT LEAST SQUARES FOR ESTIMATING TOTAL EFFECTSUNDER LINEARITY AND CAUSAL SUFFICIENCY
By F. Richard Guo and Emilija Perkovi´c
Department of StatisticsUniversity of Washington, Seattle
August 31, 2020
Recursive linear structural equation models are widely used topostulate causal mechanisms underlying observational data. In thesemodels, each variable equals a linear combination of a subset of theremaining variables plus an error term. When there is no unobservedconfounding or selection bias, the error terms are assumed to be inde-pendent. We consider estimating a total causal effect in this setting.The causal structure is assumed to be known only up to a maximallyoriented partially directed acyclic graph (MPDAG), a general classof graphs that can represent a Markov equivalence class of directedacyclic graphs (DAGs) with added background knowledge. We pro-pose a simple estimator based on recursive least squares, which canconsistently estimate any identified total causal effect, under point orjoint intervention. We show that this estimator is the most efficientamong all regular estimators that are based on the sample covariance,which includes covariate adjustment and the estimators employed bythe joint-IDA algorithm. Notably, our result holds without assumingGaussian errors.
1. Introduction.
A linear structural equation model (SEM) specifies a causal mechanismunderlying a set of variables [Bollen, 1989]. Each variable equals a linear combination of a subset ofthe remaining variables plus an error term. A SEM is associated with a mixed graph, also known asa path diagram [Wright, 1921, 1934], which consists of both directed edges and bi-directed edges.A directed edge i → j represents that variable i appears as a covariate in the structural equationdefining variable j . The equation for variable j takes the form(1) X j = (cid:88) i : i → j γ ij X i + (cid:15) j , where (cid:15) j is an error term. Often, the errors are assumed to follow a multivariate normal distribution,but it need not be the case. A bi-directed edge i ↔ j indicates that errors (cid:15) i and (cid:15) j are dependent,which is assumed when there exists an unobserved (i.e., latent) confounder between i and j . Themixed graph is usually assumed to be acyclic , i.e., the graph does not contain cycles made ofdirected edges.We focus on the setting when there is no unobserved confounder or selection bias, a conditionalso known as causal sufficiency ; see Spirtes et al. [2000, Chap. 3] and Pearl [2009, Chap. 6]. In thissetting, all the error terms are assumed to be mutually independent and the mixed graph associatedwith the linear SEM is a directed acyclic graph (DAG), often called a causal DAG . Aside from beinga statistical model for observational data, the linear SEM is also a causal model in the sense thatit specifies the behavior of the system under interventions (see Section 3.2). Therefore, the total Keywords and phrases: total effect, structural equation model, least squares, semiparametric efficiency, partiallydirected acyclic graph, observational studies a r X i v : . [ m a t h . S T ] A ug ausal effect of one treatment variable (point intervention) or several treatment variables (jointintervention) on some outcome varibles can be defined.The underlying causal DAG is usually unknown. In fact, linear SEMs associated with differentDAGs may define the same observed distribution [Drton et al., 2011]. Without further assumptionson the error distributions, the underlying DAG can only be learned from observational data up toits Markov equivalence class, which can be uniquely represented by a completed partially directedacyclic graph (CPDAG) [Meek, 1995, Andersson et al., 1997]. Additional background knowledge,such as knowledge of certain causal relationships [Meek, 1995, Fang and He, 2020] or partial or-derings [Scheines et al., 1998], restrictions on the error distributions [Hoyer et al., 2008, Petersand B¨uhlmann, 2014], and other assumptions [Hauser and B¨uhlmann, 2012, Wang et al., 2017,Rothenh¨ausler et al., 2018, Eigenmann et al., 2017] can be used to further refine the Markov equiv-alence class of DAGs, resulting in representing the causal structure as a maximally oriented PDAG(MPDAG), which is a rather general class of graphs containing directed and undirected edges thatsubsumes DAGs and CPDAGs [Meek, 1995]. A given total causal effect is identified given a graph,if it can be expressed as a functional of the observed distribution, which is the same for every DAGin the equivalence class. Recently, a necessary and sufficient graphical criterion for identificationgiven an MPDAG has been shown by Perkovi´c [2020]. In general, there may be more than oneidentifying functional.Naturally, the next step is to develop estimators for an identified total effect with desirable prop-erties. When the effect is unidentified, the reader is referred to the IDA-type approaches [Maathuiset al., 2009, Nandy et al., 2017], which are beyond the scope of this paper. Among others, weconsider the following desiderata. Completeness.
Can the estimator consistently estimate every identified effect, under either pointor joint interventions?
Efficiency.
Does the estimator achieve the smallest asymptotic (co)-variance compared to a rea-sonably large class of estimators?To the best of our knowledge, no estimator proposed in the literature fulfills both desiderata.Indeed, the commonly used covariate adjustment estimators [Pearl, 1993, Shpitser et al., 2010,Maathuis and Colombo, 2015, Perkovi´c et al., 2015] do not exist for certain total effects underjoint interventions [Nandy et al., 2017, Perkovi´c et al., 2018, Perkovi´c, 2020]. Furthermore, whenthey exist, even with an optimal adjustment set chosen to maximize efficiency [Henckel et al.,2019, Rotnitzky and Smucler, 2019, Witte et al., 2020], we will show that covariate adjustment cancompare less favorably against a larger class of estimators.We propose an estimator that is based on recursive least squares, that affirmatively fulfills bothdesiderata. In particular, our proposed estimator achieves the efficiency bound among all regularestimators that only depend on the sample covariance; see Section 6 for the precise definition of theclass of estimators. Remarkably, our result holds regardless of the type of error distribution in theunderlying linear SEM. Our method is implemented in the R [R Core Team, 2020] package eff2 ( https://github.com/richardkwo/eff2 ), which standards for “efficient effect” (estimate).The paper is organized as follows. In Section 2, we review related work on efficient estimation oftotal causal effects in over-identified settings. In Section 3, we introduce the preliminaries on linearstructural equation models, causal graphs and the identification of total causal effects. The conceptof bucket decomposition is introduced. In Section 4, we introduce a block-recursive representationfor the observational data and identify the total causal effect under such a representation. We firstderive the proposed least squares estimator by finding the maximum likelihood estimator (MLE)2nder the assumption of Gaussian errors in Section 5. We then prove the optimal efficiency of ourproposed estimator under arbitrary error distributions in Section 6. Additional preliminaries, proofsand numerical results can be found in the Appendix.
2. Related work.
The statistical performance of an estimator of a total causal effect, in over-identified settings, has recently received more attention; see, e.g, Kuroki and Miyakawa [2003],Henckel et al. [2019], Witte et al. [2020], Gupta et al. [2020], Rotnitzky and Smucler [2019], Smucleret al. [2020], Kuroki and Nanmo [2020]. Here, “over-identified” [Koopmans and Reiersøl, 1950]refers to the fact that the total causal effect can be expressed as more than one functional ofthe (population) observed distribution, all of which coincide due to the additional conditionalindependence constraints obeyed by the observed distribution. For example, in the case where atotal causal effect can be identified through covariate adjustment, usually there exists more thanone valid adjustment set [Henckel et al., 2019]. This is in contrast to the more traditional settingof causal inference, where the observed data distribution is nonparametric and is not expected tosatisfy extra conditional independences.Intuitively, the conditional independences in over-identified models can be exploited to maximizeasymptotic efficiency; see, e.g., Sargan [1958], Hansen [1982] for early works in this direction.Under a linear SEM with independent errors, a total causal effect can be estimated via covariateadjustment as the least squares coefficient from the regression of the outcome on the treatmentand adjustment variables. Henckel et al. [2019] recently showed that, under a linear SEM withindependent errors, a valid adjustment set that minimizes asymptotic variance, also referred to asthe optimal adjustment set , can be graphically characterized; see also Witte et al. [2020] for a furthercharacterization of such an optimal set. This result was generalized by Rotnitzky and Smucler [2019]beyond linear SEMs: an optimal adjustment set is shown to always exist for point interventions, anda semiparametric efficient estimator is developed for this case. Note that, while valid adjustmentsets (called “time-independent” adjustment sets by Rotnitzky and Smucler [2019]) exist for pointinterventions [Perkovi´c, 2020, Proposition 4.2], they may not exist for joint interventions [Nandyet al., 2017, Perkovi´c et al., 2018, Perkovi´c, 2020].Less is known about how to efficiently estimate the total causal effect of a joint intervention,at least in a generic fashion. For linear SEMs with independent errors, with the knowledge of theparents of the treatment variables in the underlying causal DAG, Nandy et al. [2017] consideredtwo estimators for the joint-IDA algorithm, one based on recursive least squares and one based ona modified Cholesky decomposition. However, the efficiency properties of these estimators were notexplored. In Section 7, numerical comparisons will show that our proposed estimator significantlyoutperform the estimators of Nandy et al. [2017].Other results on the linear SEM include explicit calculations and comparisons for typical ex-amples with either a particular structure or only a few variables; see, e.g., Kuroki and Cai [2004],Gupta et al. [2020]. Gaussian errors are also assumed in these calculations.
3. Linear SEMs, causal graphs and effect identification.
Linear SEMs under causal sufficiency.
A linear SEM postulates a causal mechanism thatgenerates data. Let X denote a vector of variables generated by a linear SEM, where X is indexedby V ( X = X V ). Let D be the associated DAG on vertices V . For this | V | -dimensional randomvector X , the model in Eq. (1) can be compactly rewritten as(2) X = Γ (cid:62) X + (cid:15), Γ = ( γ ij ) , i → j not in D ⇒ γ ij = 0 . ∈ R | V |×| V | is a coefficient matrix, and (cid:15) = ( (cid:15) i ) is a | V | -dimensional random vector. DAG D is associated with the linear SEM in Eq. (1) in the sense that the non-zero entries of Γ correspondto the edges in D .Under causal sufficiency (no latent variables), we assume(3) { (cid:15) i : i ∈ V } are independent , E (cid:15) = 0 , E (cid:15)(cid:15) (cid:62) (cid:31) , where for a real, symmetric matrix A , A (cid:31) means A is positive definite. The errors { (cid:15) i : i ∈ V } are not necessarily Gaussian, nor identically distributed.The law P ( X ) is called the observed distribution . For a given D , we will use P D to denote theset of possible laws of X , namely the collection of P ( X ) as Γ and the error distribution varysubject to Eqs. (2) and (3). The linear SEM poses certain restrictions on the set of laws P D . LetPa( i, D ) denote the set of parents of vertex i , i.e., { j : j → i is in D} . For any P ∈ P D , amongother constraints, (i) P factorizes according to D , (ii) E [ X i | X Pa( i, D ) ] is linear in X Pa( i, D ) and (iii)var[ X i | X Pa( i, D ) ] is constant in X Pa( i, D ) .We observe n independent and identically distributed (iid) samples generated by the modelabove, namely X ( i ) = ( I − Γ) −(cid:62) (cid:15) ( i ) for i = 1 , . . . , n . Note that ( I − Γ) is invertible because Γ can bepermuted into a lower-triangular matrix according to a topological ordering (i.e., causal ordering)of vertices in D .3.2. Interventions and total causal effects.
The assumed linear SEM also dictates the behaviorof the system under interventions. Let A ⊆ V be a set of vertices indexing treatment variables X A . We use do( X A = x A ) to denote intervening on variables X A and forcing them to take values x A [Pearl, 1995]. We call this a point intervention if A is a singleton, and a joint interventionif A consists of several vertices, which correspond to the case of multiple treatments. While X A is fixed to x A , the remaining variables are generated by their corresponding structural equationsEq. (1), with each X i for i ∈ A appearing in the equations replaced by the corresponding enforcedvalue x i [Strotz and Wold, 1960]. This generating mechanism defines the interventional distribution ,denoted by P ( X | do( X A = x A )), where the conditional probability notation is only conventional.More formally, the interventional distribution is expressed as(4) P ( X | do( X A = x A )) = (cid:89) j ∈ A δ x j ( X j ) (cid:89) i/ ∈ A P (cid:0) X i | X Pa( i, D ) (cid:1) , where δ denotes a Dirac measure. Factor P (cid:0) X i | X Pa( i ) (cid:1) is defined by the structural equation for X i .Eq. (4) is known as the truncated factorization formula [Pearl, 2009], manipulated density formula[Spirtes et al., 2000] or the g-formula [Robins, 1986]. Definition . Let X A be a vector oftreatment variables and X Y with Y ∈ V \ A be an outcome variable. The total causal effect of X A on X Y is defined as the vector τ AY ∈ R | A | , where ( τ AY ) i = ∂∂x A i E [ X Y | do( X A = x A )] , i = 1 , . . . , | A | . That is, τ AY is the gradient of the linear map x A (cid:55)→ E [ Y | do( X A = x A )]. When multiple outcomes Y = { Y , . . . , Y k } , k >
1, are considered, the total causal effect of X A on X Y , . . . , X Y k can bedefined by concatenating τ AY , . . . , τ AY k . Therefore, throughout, we assume the outcome variable isa singleton without loss of generality. Each coordinate of the total causal effect τ AY can be expressed4s a sum-product of the underlying linear SEM coefficients along certain causal paths from A to Y in D , that is, certain paths of the form A → · · · → Y i for A ∈ A ; see also Wright [1934], Sullivantet al. [2010].3.3. Causal graphs.
Two different linear SEMs on the same set of variables can define the sameobserved distribution. For example, under Gaussian errors, linear SEMs associated with DAGs A → Y and A ← Y , define the same set of observed distributions, namely the set of centeredbivariate Gaussian distributions. Without making additional assumptions on the error distribution,such as non-Gaussianity [Shimizu et al., 2006], partial non-Gaussianity [Hoyer et al., 2008], or equalvariance of errors [Peters and B¨uhlmann, 2014, Chen et al., 2019], the underlying causal DAG canonly be learned from the observed distribution up to its Markov equivalence class [Pearl and Verma,1995, Chickering, 2002]. CPDAGs.
Two DAGs on the same set of vertices are Markov equivalent if they encode the sameset of d-separation relations between the vertices. The d-separations between the vertices, prescribeconditional independences between the corresponding variables (known as the Markov condition[Lauritzen, 1996, § completed partially directed acyclic graph (CPDAG), also known as an essential graph [Meek, 1995, Andersson et al., 1997]. A CPDAG C is agraph on the same set of vertices, that can contain both directed and undirected edges. We use [ C ] todenote the Markov equivalence class represented by CPDAG C . A directed edge i → j in C implies i → j is in every D ∈ [ C ], whereas an undirected edge i − j in C implies there exist D , D ∈ [ C ]such that i → j in D but i ← j in D . Given a DAG D , the CPDAG C representing the Markovequivalence class of D can be drawn by keeping the skeleton of D , adding all the unshielded collidersfrom D and completing the orientation rules R1–R3 of Meek [1995]; see Fig. C.1 in Appendix C.For example, DAGs A → Y and A ← Y are represented by CPDAG A − Y . To slightly abuse thenotation, for a distribution Q , we write Q ∈ [ C ] if Q factorizes according to some DAG D ∈ [ C ]; seeLauritzen [1996, § structure learning algorithms that can be used to uncover CPDAG C fromobservational data. Some well-known examples are the PC algorithm [Spirtes et al., 2000] and thegreedy equivalence search [Chickering, 2002]. Choosing an appropriate algorithm for the dataset athand is beyond the scope of this paper; the reader is referred to Drton and Maathuis [2017, §
4] fora recent overview.
MPDAGs.
Certain background knowledge, if present, can be used to further orient some undi-rected edges in a CPDAG C . Typically, knowledge of temporal orderings can inform the orientationof certain undirected edges; see Spirtes et al. [2000, § C results in a maximally oriented partially directed graph (MPDAG) G . See Fig. C.1and Algorithm 1 in Appendix C. MPDAGs are a rather general class of graphs that subsumes bothDAGs and CPDAGs. An MPDAG G represents a restricted Markov equivalence class of DAGs,which we also denote by [ G ]. Analogously to the case of a CPDAG, i → j in G implies i → j is inevery D ∈ [ G ], and i − j in G implies there exist D , D ∈ [ G ] such that i → j in D but i ← j in D .For the rest of the paper, we will assume that we have access to an MPDAG G that representsour structural knowledge about the underlying DAG D . That is,(5) causal DAG D ∈ [ G ] , G is an MPDAG , G ] represents a collection of DAGs that are Markov equivalent, but can be strictly smallerthan the corresponding Markov equivalence class due to background knowledge.3.4. Causal effect identification.
Throughout the paper we will use the following notatitions.Given treatment variables X A and an outcome variable X Y such that Y / ∈ A , we are interestedin learning the total causal effect τ AY . We assume that we have access to an MPDAG G , and toobservational data that are generated as iid samples from a linear SEM defined by Eqs. (2) and (3),where the causal DAG D is in [ G ]. Before estimation can be performed, we need to make sure that τ AY can be identified from observational data. That is, we need to ensure that τ AY can be expressedas a functional of the observed distribution that is the same for every DAG in [ G ]. We have thefollowing graphical criterion. Theorem . The total causal effect τ AY of X A on X Y is identified given anMPDAG G if and only if there is no proper, possibly causal path from A to Y in G that starts withan undirected edge. Theorem 1 is Proposition 3.2 of Perkovi´c [2020]. This result does not require that the data isgenerated by a linear SEM. However, Perkovi´c [2020] proves that when the criterion fails, then twolinear SEMs with Gaussian errors can be constructed such that their observed distributions coincidebut their τ AY ’s are different. Hence, even if we restrict ourselves to linear SEMs, Theorem 1 stillholds.A few terms need some explanation. A path from A to Y in G is a sequence of distinct vertices (cid:104) v , . . . , v k (cid:105) for k > v ∈ A and v k = Y , such that every pair of successive vertices areadjacent in G . The path is proper when only its first vertex is in A . The path is possibly causal ifno edge v l ← v r is in G for 1 ≤ l < r ≤ k . The reader is referred to Appendix C for more graphicalpreliminaries. When G satisfies Theorem 1 relative to vertex sets A and Y , the interventionaldistribution P ( X Y | do( X A = x A )), and hence the total effect, can be computed from the observeddistribution P ( X ). To express the identification formula, we require the following concepts.3.4.1. Buckets and bucket decomposition.
Let G = ( V, E, U ) be a partially directed graph, where V is the set of vertices, and E and U are sets of directed and undirected edges respectively.Let B , . . . , B K be the maximal connected components of the undirected graph G U := ( V, ∅ , U ).Then V = B ˙ ∪ . . . ˙ ∪ B K , where symbol ˙ ∪ denotes disjoint union. Note that all the directed edgeswithin each B i are due to background knowledge. If we ignore the distinction between directed andundirected edges, then the subgraph induced by each B i is chordal [Andersson et al., 1997, § i → j ∈ E, i ∈ B i , j ∈ B j ⇒ i < j. One can show that such a partial causal ordering always exists, though it may not be unique;see Algorithm 2 in Appendix C to obtain such an ordering. Our result does not depend on theparticular choice of partial causal ordering. We call B , . . . , B K the bucket decomposition of V andcall each B k for k = 1 , . . . , K a bucket ; see Fig. 1(a) for an example. If it is clear which graph G isbeing referred to, we will shorten Pa( j, G ) as Pa( j ) to reduce clutter. For a set of vertices C in G , weuse Pa( C ) := ∪ i ∈ C Pa( i ) \ C to denote the set of their external parents . Clearly, Pa( B k ) ⊆ B [ k − ,where B [ k − := B ∪ · · · ∪ B k − . 6 emma . Let i and j be two distinct vertices in MPDAG G = ( V, E, U ) such that i → j ∈ E .Suppose that there is no undirected path from i to j in G . If there is a vertex k , and an undirectedpath j − · · · − k in G , then i → k ∈ E . By definition of the parent set above we have that Pa( B k ) = ∪ i ∈ B k Pa( i ) \ B k , k = 1 , . . . , K .However, since a bucket B k is a maximal subset of V that is connected by undirected edges in G ,Lemma 1 implies the following important property. Corollary . Let B , . . . , B K be the bucket decomposition of V inMPDAG G = ( V, E, U ) . Then, all vertices in the same bucket have the same set of external parents,namely Pa( B k ) = Pa( i ) \ B k , for any i ∈ B k , k = 1 , . . . , K. The causal identification formula for P ( X Y | do( X A = x A )) of Perkovi´c [2020] relies on a decom-position of certain ancestors of Y in MPDAG G according to the buckets. We call vertex i anancestor of vertex j in G if there exists a directed path i → · · · → j in G ; we use the conventionthat j is an ancestor of itself. We denote the set of ancestors of j in G as An( j, G ), or shortened asAn( j ).Let G V \ A = ( V \ A, E (cid:48) , U (cid:48) ) denote the subgraph of G induced by the vertices V \ A , where E (cid:48) includes those edges in E that are between vertices in V \ A , and similarly for U (cid:48) . Consider the setof ancestors of Y in G V \ A , denoted as(7) D := An( Y, G V \ A ) . The bucket decomposition D , . . . , D K of D , induced by the bucket decomposition of V , is simply(8) D = ˙ (cid:91) Kk =1 D k , D k = D ∩ B k , i = 1 , . . . , K. Lemma . When the criterion in Theorem 1 is satisfied, we have
Pa( D k , G ) = Pa( B k , G ) forevery nonempty D k . Proofs of Lemmas 1 and 2 are left to Appendix B.
Theorem . Suppose the criterion in Theorem 1 is satisfied for
A, Y inMPDAG G = ( V, E, U ) such that Y / ∈ A . Let P ( X ) be the observed distribution. Let D =An( Y, G V \ A ) and D , . . . D K be the bucket decomposition of D as in Eq. (8) . Then the interventionaldistribution P ( X Y | do( X A = x A )) can be identified as (9) P ( X Y | do( X A = x A )) = (cid:90) (cid:40) K (cid:89) k =1 P (cid:0) X D k | X Pa( D k ) (cid:1)(cid:41) d X D \ Y for values X Pa( D k ) in agreement with x A , where P (cid:0) X D k | X Pa( D k ) (cid:1) ≡ if D k = ∅ . The expression in Eq. (9) above is a generalization of the truncated factorization Eq. (4) fromDAGs to MPDAGs. Theorem 2 holds generally even when an underlying linear SEM is not assumed.7 . Block-recursive representation.
In this section, we express the observed distribution P ( X ) induced by a linear SEM compatible with MPDAG G = ( V, E, U ) in a block-recursive form.Each block corresponds to a bucket in the bucket decomposition of V . Such a reparameterizationis necessitated by the fact that the causal ordering of D is unknown, whereas the buckets canbe arranged into a valid partial causal ordering as in Eq. (6). We will use this representation tocompute an estimator for the total causal effect.Recall that P D denotes the family of laws of X arising from a linear SEM Eqs. (2) and (3)compatible with DAG D . Let P G := ∪ D∈ [ G ] P D , which denotes the family of laws of X arising froma linear SEM compatible with a DAG in [ G ]. Proposition . Let D be the causal DAG associated with the linearSEM and G an MPDAG such that D ∈ [ G ] . Further, let B , . . . , B K be the bucket decomposition of V in G . Then the linear SEM Eqs. (2) and (3) can be rewritten as X = Λ (cid:62) X + ε, for some matrix of coefficients Λ = ( λ ij ) ∈ R | V |×| V | and random vector ε = ( ε i ) ∈ R | V | such that j ∈ B l , i / ∈ Pa( B l , G ) ⇒ λ ij = 0 , (10) E ε = 0 , E ε B k ε (cid:62) B k (cid:31) , ( k = 1 , . . . , K ) , ε B , . . . , ε B K are mutually independent , (11) and (12) law of ( ε B k ) ∈ P G Bk , k = 1 , . . . , K, where G B k is the subgraph of G induced by B k . Note that in contrast to symbol (cid:15) used in Eqs. (2) and (3), symbol ε is used here to denote theerrors in the block-recursive form. The coordinates within each ε B k may be dependent . Proof.
For k = 2 , . . . , K , by Eq. (2) and the restrictive property (Corollary 1), we have X B k = Γ (cid:62) Pa( B k ) ,B k X Pa( B k ) + Γ (cid:62) B k X B k + (cid:15) B k , where Pa( B k ) = Pa( B k , G ). The expression can be rewritten as X B k = ( I − Γ B k ) −(cid:62) Γ (cid:62) Pa( B k ) ,B k X Pa( B k ) + ( I − Γ B k ) −(cid:62) (cid:15) B k = Λ (cid:62) Pa( B k ) ,B k X Pa( B k ) + ε B k , where ε B k := ( I − Γ B k ) −(cid:62) (cid:15) B k for k = 1 , . . . , K (note that X B = ε B ). Additionally, Λ Pa( B k ) ,B k =Γ Pa( B k ) ,B k ( I − Γ B k ) − for k = 2 , . . . , K .Matrix Λ ∈ R | V |×| V | in the statement of the proposition is defined by blocks Λ Pa( B k ) ,B k for k = 2 , . . . , K and zero entries otherwise. Therefore, λ ij = 0 if j ∈ B l and i / ∈ Pa( B l ) for some l = 1 , . . . K . Hence, by putting the blocks together, the model can be written as X = Λ (cid:62) X + ε .The “new” errors ε satisfy ε B k = Γ (cid:62) B k ε B k + (cid:15) B k , k = 1 , . . . , K.
8t then follows from Eqs. (2) and (3) that for every k ,law of ε B k ∈ P D Bk ⊂ P G Bk , since D ∈ [ G ]. Moreover, for every k , E ε B k = 0 , E ε B k ε (cid:62) B k = ( I − Γ B k ) −(cid:62) E (cid:15) B k (cid:15) (cid:62) B k ( I − Γ B k ) − (cid:31) , where both ( I − Γ B k ) and E (cid:15) B k (cid:15) (cid:62) B k are full rank, because Γ B k can be permuted into an upper-triangular matrix and E (cid:15)(cid:15) (cid:62) (cid:31) by Eq. (3). Corollary . Under the same conditions as Proposition 1, it holds that X B = ε B ,X B k = Λ (cid:62) Pa( B k ) ,B k X Pa( B k ) + ε B k , ε B k ⊥⊥ X Pa( B k ) , k = 2 , . . . , K, (13) where Pa( B k ) = Pa( B k , G ) . Next, we show that if the total causal effect τ AY is identifiable from MPDAG G (Theorem 1),then it can be calculated from Λ in the block-recursive representation of Proposition 1. Therefore,the distribution of ε is a nuisance relative to estimating τ AY . Proposition . Suppose the criterion in Theorem 1 is satisfied for
A, Y in MPDAG G =( V, E, U ) such that Y / ∈ A . Let Λ be the block-recursive coefficient matrix given by Proposition 1.The total causal effect of X A on X Y is identified as (14) τ AY = Λ A,D (cid:2) ( I − Λ D,D ) − (cid:3) D,Y , where D = An( Y, G V \ A ) and the last subscript denotes the column corresponding to Y ∈ D . Proof.
We derive this result using Theorem 2. Recall that D , . . . , D K is a partition of D induced by the bucket decomposition B , . . . , B K of V in the sense that D k = D ∩ B k for k =1 , . . . , K . When D k = ∅ , we use the convention that P ( X D k | X Pa( D k ) ) ≡
1. By definition of D =An( Y, G V \ A ) and Eq. (6), observe that a vertex in Pa( D k ) = Pa( D k , G ) is either in D ∪ · · · ∪ D k − or in A . Let F k := A ∩ Pa( D k ). In Eq. (9), we note that the joint interventional distribution of X D is given by P ( X D | do( X A = x A )) = K (cid:89) k =1 P ( X D k | X Pa( D k ) ) = K (cid:89) k =1 P ( X D k | X Pa( D k ) \ F k , X F k = x F k ) , where x F k is fixed by the do( X A = x A ) operation. Further, fix a factor i ∈ { , . . . , K } . By Lemma 2,Pa( D i ) = Pa( B i ). By Eq. (13) and ε D i ⊥⊥ X Pa( B i ) , we have X D i | (cid:8) X Pa( D i ) \ F i , X F i = x F i (cid:9) = d Λ (cid:62) Pa( D i ) \ F i ,D i X Pa( D i ) \ F i + Λ F i ,D i x F i + ε D i = Λ (cid:62) Pa( D i ) ∩ D,D i X Pa( D i ) ∩ D + Λ Pa( D i ) ∩ A,D i x Pa( D i ) ∩ A + ε D i The fact that the display above holds for every i = 1 , . . . , K implies that the joint interventionaldistribution P ( X D | do( X A = x A )) satisfies X D = Λ TD,D X D + Λ (cid:62) A,D x A + ε D .
9t follows that X D = ( I − Λ D,D ) −(cid:62) (Λ (cid:62) A,D x A + ε D ) and hence E [ X D | do( X A = x A )] = ( I − Λ D,D ) −(cid:62) Λ (cid:62) A,D x A . Since Y ∈ D , by Definition 1 we have τ AY = ∂∂x A E [ X Y | do( X A = x A )] = Λ A,D (cid:2) ( I − Λ D,D ) − (cid:3) D,Y . We say vertex j is a possible descendant of i , denoted as j ∈ PossDe( i ), if there exists a possiblycausal path from i to j . For a set of vertices A , define PossDe( A ) := ∪ i ∈ A PossDe( i ). See Appendix Cfor more details. Corollary . If Y / ∈ PossDe( A ) , then τ AY = 0 . Proof.
Since D = An( Y, G V \ A ) and Y / ∈ PossDe( A ), Λ A,D = .
5. Recursive least squares.
Consider the special case when the errors in the linear SEMEq. (1) are jointly Gaussian. In this case, by the standard maximum likelihood theory, the Cram´er–Rao bound is achieved by the maximum likelihood estimator (MLE) of the total causal effect, whichcan be obtained by plugging in the MLE for Λ in the block-recursive form (Proposition 1) into theformula Eq. (14). We now compute the MLE for Λ given an MPDAG G .When (cid:15) is multivariate Gaussian, the block-recursive form in Proposition 1 is a linear Gaussianmodel parameterized by { (Λ k ) Kk =2 , (Ω k ) Kk =1 } , where Λ k := Λ Pa( B k ) ,B k and Ω k is the covariance for ε B k . Because ε are independent between blocks (Proposition 1), the likelihood factorizes as(15) L ((Λ k ) k , (Ω k ) k ) = K (cid:89) k =1 N (cid:16) X B k − Λ (cid:62) k X Pa( B k ) ; , Ω k (cid:17) . Denote the MLE of Λ by (cid:98) Λ G , which consists of blocks ( (cid:98) Λ G k ) Kk =2 and zero values elsewhere, andthe MLE of Ω by (cid:98) Ω G = ( (cid:98) Ω G k ) Kk =1 . The superscripts highlight the dependence on MPDAG G . TheMLE maximizes L ((Λ k ) k , (Ω k ) k ) subject to Eq. (12), namely N ( , Ω k ) ∈ P G Bk , k = 1 , . . . , K, where G B k is the subgraph of G induced by B k . This further translates to a set of algebraic con-straints on (Ω k ) Kk =1 , namely for k = 1 , . . . , K ,(16) det (cid:2) (Ω k ) { i }∪ C, { j }∪ C (cid:3) = 0 , if i and j are d-separated by C in G B k ;see, e.g., Drton et al. [2008, § (cid:98) Σ ( n ) := 1 n n (cid:88) i =1 X ( i ) X ( i ) (cid:62) , where n is the sample size, and the superscripts are reserved to index samples. To reduce clutter,for a set of indices C , we often abbreviate Σ C,C as Σ C .10 emma . Suppose X ( i ) : i = 1 , . . . , n is generated iid from a linear SEM Eqs. (2) and (3) associated with an unknown causal DAG D . Suppose the error (cid:15) is distributed as multivariateGaussian. Suppose D ∈ [ G ] for a known MPDAG G . Let (cid:98) Σ ( n ) be the sample covariance as definedin Eq. (17) . The MLE for Λ k = Λ Pa( B k ) ,B k in the block-recursive form is given by (18) (cid:98) Λ G k = (cid:16)(cid:98) Σ ( n )Pa( B k ) (cid:17) − (cid:98) Σ ( n )Pa( B k ) ,B k , k = 2 , . . . , K. Proof.
By factorization in Eq. (15), MLE ( (cid:98) Λ G k , (cid:98) Ω G k ) is the maximizer of log-likelihood (cid:96) n (Λ k , Ω k )= − n (cid:88) i =1 (cid:16) X ( i ) B k − Λ (cid:62) k X ( i )Pa( B k ) (cid:17) (cid:62) Ω − k (cid:16) X ( i ) B k − Λ (cid:62) k X ( i )Pa( B k ) (cid:17) − n k )= −
12 Tr (cid:32) n (cid:88) i =1 Ω − k ( X ( i ) B k − Λ (cid:62) k X ( i )Pa( B k ) )( X ( i ) B k − Λ (cid:62) k X ( i )Pa( B k ) ) (cid:62) (cid:33) − n k ) , subject to Eq. (16). Taking a derivative with respect to Λ k ∈ R | Pa( B k ) |×| B k | , we have ∂(cid:96) n (Λ k , Ω k ) ∂ Λ k = − n (cid:88) i =1 X ( i )Pa( B k ) X ( i ) (cid:62) B k Ω − k + 2 n (cid:88) i =1 X ( i )Pa( B k ) X ( i ) (cid:62) Pa( B k ) Λ k Ω − k . For any positive definite Ω k satisfying Eq. (16), setting the derivative (cid:96) n (Λ k , Ω k ) /∂ Λ k to zero yieldsthe estimate (cid:98) Λ G k = (cid:32) n n (cid:88) i =1 X ( i )Pa( B k ) X ( i ) (cid:62) Pa( B k ) (cid:33) − (cid:32) n n (cid:88) i =1 X ( i )Pa( B k ) X ( i ) (cid:62) B k (cid:33) = (cid:16)(cid:98) Σ ( n )Pa( B k ) (cid:17) − (cid:98) Σ ( n )Pa( B k ) ,B k . Remark . Because of the restrictive property (Corollary 1), each (cid:98) Λ G k is computed by opti-mizing over the space of | Pa( B k ) | × | B k | matrices and the resulting MLE takes the simple form asabove; see also Anderson and Olkin [1985, §
5] and Amemiya [1985, § (cid:98) Λ G is obtained by simply regressing each B i onto Pa( B i , G ) using ordinary least squares,we call this specific recursive least squares G -regression . The resulting MLE for an identified totalcausal effect is a plugin estimator using the formula in Proposition 2. Definition G -regression estimator) . Suppose X ( i ) : i = 1 , . . . , n is generated iid from alinear SEM Eqs. (2) and (3) associated with an unknown causal DAG D . Suppose D ∈ [ G ] for aknown MPDAG G . Further, suppose for A ⊂ V , Y ∈ V \ A , τ AY is identified under the criterionof Theorem 1. The G -regression estimator for the total causal effect τ AY is defined as (19) (cid:98) τ G AY = (cid:98) Λ G A,D (cid:104) ( I − (cid:98) Λ G D,D ) − (cid:105) D,Y , where (cid:98) Λ G is given by Lemma 3. . Efficiency theory. In this section, we establish the asymptotic efficiency of our G -regressionestimator, when the errors in the generating linear SEM are not necessarily Gaussian, among a rea-sonably large class of estimators—all regular estimators that only depend on the sample covariance.This class of estimators, despite not covering all the estimators considered in the standard semi-parametric efficiency theory, includes many in the literature, such as covariate adjustment [Henckelet al., 2019, Witte et al., 2020], recursive least squares [Gupta et al., 2020, Nandy et al., 2017], andmodified Cholesky decomposition of the sample covariance [Nandy et al., 2017]. Definition . Consider an estimator (cid:98) θ n of θ , θ ∈ R k . We say that the asymptotic covarianceof (cid:98) θ n is S , and write acov (cid:98) θ n = S , if √ n ( (cid:98) θ n − θ ) → d N ( , S ) . When k = 1 , we write avar (cid:98) θ n forasymptotic variance. For real valued symmetric matrices
A, B , we say A (cid:23) B if A − B is positive semidefinite. Wenow state our main result. Theorem G -regression estimator) . Suppose data is generated iidfrom a linear SEM Eqs. (2) and (3) associated with an unknown causal DAG D . Suppose D ∈ [ G ] for a known MPDAG G . Further, suppose for A ⊂ V , Y ∈ V \ A , τ AY is identified under thecriterion of Theorem 1. Let (cid:98) τ G AY be the G -regression estimator of τ AY (Definition 2). Consider anyconsistent estimator (cid:98) τ AY = (cid:98) τ AY ( (cid:98) Σ ( n ) ) that is a differentiable function of the sample covariance. Itholds that acov ( (cid:98) τ AY ) (cid:23) acov (cid:0)(cid:98) τ G AY (cid:1) . It is clear from definitions that both (cid:98) τ G AY and (cid:98) τ AY are asymptotically linear. Therefore, theirasymptotic covariances are well-defined. To prove Theorem 3, it suffices to show that for every w ∈ R | A | avar (cid:16) w (cid:62) (cid:98) τ AY (cid:17) ≥ avar (cid:16) w (cid:62) (cid:98) τ G AY (cid:17) . To this end, for any fixed w ∈ R | A | we define τ w as(20) τ w := w (cid:62) τ AY = τ w (Λ) , which is a smooth function of Λ. The corresponding G -regression estimator (cid:98) τ G w := w (cid:62) (cid:98) τ G AY = τ w ( (cid:98) Λ G )is still a plugin estimator (now of τ w ). Additionally, for a consistent estimator (cid:98) τ AY of τ AY , thecorresponding (cid:98) τ w := w (cid:62) (cid:98) τ AY = (cid:98) τ w ( (cid:98) Σ ( n ) ) is a consistent estimator of τ w , in the form of a differentiablefunction of the sample covariance. It suffices to show avar (cid:98) τ w ≥ avar (cid:98) τ G w for every w ∈ R | A | .The rest of this section is devoted to proving Theorem 3. First, we introduce graph ¯ G as asaturated version of G (Proposition 3). In Section 6.1, we show that G -regression with G replacedby ¯ G , aptly named ¯ G -regression, is a diffeomorphism between the space of covariance matrices andthe space of parameters. In Section 6.2, we characterize the class of estimators relative to which G -regression is optimal. To prove Theorem 3, we establish an efficiency bound for this class ofestimators in Section 6.4 and verify that G -regression achieves this bound in Section 6.5. Some ofthe proofs are left to Appendix A. See also Fig. A.1 in for an overview of the dependency structureof our results in this section.6.1. ¯ G -regression as a diffeomorphism. roposition G ) . For MPDAG G = ( V, E, U ) , an associated saturatedMPDAG is ¯ G = ( V, ¯ E, U ) , such that Pa( B k , ¯ G ) = B [ k − for k = 2 , . . . , K , where ( B , . . . , B K ) is abucket decomposition of V in both G and ¯ G . The proof can be found in Appendix B. In words, to create the saturated MPDAG ¯ G , we addall the possible directed edges between buckets B , . . . , B K subject to the ordering B , . . . , B K . Byconstruction, ¯ G also satisfies the restrictive property in Corollary 1. See Fig. 1 for an example. (a) (b) Fig 1: (a) MPDAG G = ( V, E, U ) with buckets B = { } , B = { , , } and B = { , } and (b) itsassociated saturated MPDAG ¯ G = ( V, ¯ E, U ). The new edges in ¯ E \ E are drawn as dashed. Both G and ¯ G satisfy the restrictive property in Corollary 1.In the following, we introduce ¯ G -regression as a technical tool for establishing a diffeomorphismbetween the space of sample covariance matrices and the space of parameters in our semiparametricmodel. This link is the key to analyzing the efficiency of the estimators under consideration.Recall that P G is the set of observed distributions generated by some linear SEM associated witha causal DAG D ∈ [ G ], which is characterized by Proposition 1. More explicitly, let Q k be the lawof ε B k for k = 1 , . . . , K . The set of laws is explicitly prescribed as(21) P G = (cid:40) Q ( X B ) K (cid:89) k =2 Q k (cid:16) X B k − Λ (cid:62) B [ k − ,B k X B [ k − (cid:17) : Q k ∈ P G Bk , i → j not in G ⇒ λ ij = 0 (cid:41) where the law is indexed by Λ = ( λ ij ) and ( Q k ) Kk =1 . This is a semiparametric model and ( Q k ) k isan infinite-dimensional nuisance [van der Vaart, 2000, Chap. 25].Consider the set of laws P ¯ G associated with the saturated graph. Let Ω k := E Q k εε (cid:62) be thecovariance of Q k for k = 1 , . . . , K . Let R n × n PD denote the set of n × n symmetric, positive definitematrices. By our assumption, Ω k ∈ R | B k |×| B k | PD . Also, consider the coefficients Λ = ( λ ij ) such that λ ij (cid:54) = 0 only if i → j in ¯ G , or equivalently, i ∈ B l and j ∈ B m for l < m . Then, the covariance of X , denoted as Σ, under any P ∈ P ¯ G is determined from (Ω k ) k and Λ. Let us write this covariancemap as Σ = φ ¯ G (cid:0) (Λ k ) Kk =2 , (Ω k ) Kk =1 (cid:1) , where Λ k = Λ B [ k − ,B k is of dimension ( | B | + · · · + | B k − | ) × | B k | . It follows from Corollary 2 thatthe covariance map φ ¯ G is explicitly given by(22) Σ B = Ω , Σ B k = Λ (cid:62) k Σ B [ k − Λ k + Ω k , Σ B [ k − ,B k = Σ B [ k − Λ k , k = 2 , . . . , K. Further, the covariance map φ ¯ G is a diffeomorphism between its domain and the set of | V | × | V | positive definite matrices. 13 emma . Covariance map φ ¯ G given by Eq. (22) is invertible. Further, (cid:0) (Λ k ) Kk =2 , (Ω k ) Kk =1 (cid:1) ↔ Σ given by φ ¯ G and its inverse φ − G is a diffeomorphism between (cid:16) Ś Kk =2 R ( | B | + ··· + | B k − | ) ×| B k | (cid:17) × (cid:16) Ś Kk =1 R | B k |×| B k | PD (cid:17) and R | V |×| V | PD . Proof.
By definition, covariance map φ ¯ G is differentiable. To show diffeomorphism, we needto show that φ − G (Σ) exists for every Σ ∈ R | V |×| V | PD and that φ − G is differentiable. For any positivedefinite Σ, the inverse covariance map φ − G (Σ) is explicitly given by(23) Λ k = (cid:16) Σ B [ k − (cid:17) − Σ B [ k − ,B k , k = 2 , . . . , K, and(24) Ω k = Σ B k · B [ k − = Σ B k − Σ (cid:62) B [ k − ,B k Σ − B [ k − Σ B [ k − ,B k , k = 1 , . . . , K, where Σ B k · B [ k − is the Schur complement of block B k with respect to block B [ k − . Because Σ ispositive definite, Schur complement Ω k is also positive definite [Horn and Johnson, 2012, page 495].Clearly, the map φ − G ( · ) is differentiable.By Eqs. (23) and (24), Λ k is the matrix of population least squares coefficients in a regressionof X B k onto X B ∪···∪ B k − according to ¯ G , and Ω k is the corresponding covariance of regressionresiduals. Hence, φ − G (Σ) is called “ ¯ G -regression”. Remark . In the special case when G is a DAG such that every bucket B i is a singleton,Lemma 4 reduces to (Λ , ω ) ↔ Σ given by ( φ ¯ G , φ − G ) being a diffeomorphism between (cid:110) Λ ∈ R | V |×| V | : Λ is upper-triangular (cid:111) × (cid:110) ω ∈ R | V | : ω i > , i = 1 , . . . , | V | (cid:111) ←→ R | V |×| V | PD . The covariance map is Σ = φ ¯ G (Λ , ω ) = ( I − Λ) −(cid:62) diag( ω )( I − Λ) − , and the inverse covariance map φ − G is given by the unique LDL decomposition of Σ − . Lemma 4 is a generalization of Drton [2018,Theorem 7.2].6.2. Covariance-based, consistent estimators.
We now characterize the class of estimators rela-tive to which the optimality of our estimator is established. Recall that under P ∈ P G , (cid:98) Σ = (cid:98) Σ ( n ) is the sample covariance, Σ is the population covariance and τ w = w (cid:62) τ AY . We assume that n > max k {| B k | + | Pa( B k , G ) |} such that (cid:98) Σ ( n ) is positive definite almost surely [Drton and Eichler,2006, Sec. 3.1]. For simplicity, the superscript ( n ) is often omitted. Definition . The class of estimators for τ w under consideration is (25) T w := (cid:26)(cid:98) τ w (cid:16)(cid:98) Σ ( n ) (cid:17) : R | V |×| V | PD → R : (cid:98) τ w differentiable , (cid:98) τ w ( (cid:98) Σ ( n ) ) → p τ w ( P ) as n → ∞ under every P ∈ P G (cid:27) . By definition, in particular, T w includes all regular estimators computable with least squaresoperations. It also includes shrinkage estimators such as ridge or lasso regression whose shrinkageparameter does not depend on n . 14 haracterizing T w . Let ( (cid:98) Λ ¯ G k ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 be the image of (cid:98) Σ under φ − G . Recall that (Λ k ) Kk =2 , (Ω k ) Kk =1 is the image of Σ under φ − G . For a matrix C , let vec C denote vectorizing C by concatenating itscolumns. Each vec (cid:98) Λ ¯ G k can be split by coordinates into vectors(26) (cid:98) Λ ¯ G k, G = (cid:16)(cid:98) λ ¯ G ij : j ∈ B k , i ∈ Pa( B k , G ) (cid:17) , (cid:98) Λ ¯ G k, G c = (cid:16)(cid:98) λ ¯ G ij : j ∈ B k , i ∈ Pa( B k , ¯ G ) \ Pa( B k , G ) (cid:17) , where (cid:16)(cid:98) Λ ¯ G k, G (cid:17) k corresponds to between-bucket edges in G and (cid:16)(cid:98) Λ ¯ G k, G c (cid:17) k corresponds to between-bucket edges in ¯ G but not in G . In the example of Fig. 1, we have (cid:98) Λ ¯ G , G = ( (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G ) (cid:62) , (cid:98) Λ ¯ G , G =( (cid:98) λ ¯ G , (cid:98) λ ¯ G ) (cid:62) and (cid:98) Λ ¯ G , G c = NULL , (cid:98) Λ ¯ G , G c = ( (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G , (cid:98) λ ¯ G ) (cid:62) . Similarly, vec Λ k can be splitinto Λ k, G and Λ k, G c for k = 2 , . . . , K .The following lemma directly follows from Definition 4 and Lemma 4. Lemma . An estimator (cid:98) τ w ∈ T w can be written as (cid:98) τ w (cid:16)(cid:98) Σ ( n ) (cid:17) = (cid:98) τ w (cid:16) ( (cid:98) Λ ¯ G k, G ) Kk =2 , ( (cid:98) Λ ¯ G k, G c ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 (cid:17) for function (cid:98) τ w (cid:16) ( (cid:98) Λ ¯ G k, G ) Kk =2 , ( (cid:98) Λ ¯ G k, G c ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 (cid:17) that is differentiable in its arguments. The consistency of (cid:98) τ w implies the following two results. Lemma . For any (cid:98) τ w ∈ T w , it holds that (27) (cid:98) τ w (cid:0) (Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 (cid:1) ≡ τ w (cid:0) (Λ k, G ) Kk =2 (cid:1) for all (Λ k, G ) k and all positive definite (Ω k ) k . Proof.
Under any P ∈ P G , since (cid:98) Σ → p Σ as n → ∞ by the law of large numbers, by Lemma 4and the continuous mapping theorem [van der Vaart, 2000, page 11], we have (cid:98) Λ ¯ G k, G → p Λ k, G , (cid:98) Λ ¯ G k, G c → p and (cid:98) Ω ¯ G k → p Ω k for k = 2 , . . . , K . By Lemma 5 and continuous mapping again, (cid:98) τ w → p (cid:98) τ w (cid:0) (Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 (cid:1) . The result then follows from the consistency of (cid:98) τ w under every P ∈ P G . Corollary . For (cid:98) τ w ∈ T w , at any ((Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 ) , it holds that (28) ∂ (cid:98) τ w ∂ Λ k, G = ∂τ w ∂ Λ k, G ( k = 2 , . . . , K ) , ∂ (cid:98) τ w ∂ Ω k = ( k = 1 , . . . , K ) . Proof.
Let symbol (cid:104)· , ·(cid:105) denote inner product. Since (cid:98) τ w is differentiable (Lemma 5), by a Taylorexpansion at ((Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 ), we have (cid:98) τ w (cid:0) (Λ k, G + ∆Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k + ∆Ω k ) Kk =1 (cid:1) − (cid:98) τ w (cid:0) (Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 (cid:1) = K (cid:88) k =2 (cid:18)(cid:28) ∂ (cid:98) τ w ∂ Λ k, G , ∆Λ k, G (cid:29) + o ( (cid:107) ∆Λ k, G (cid:107) ) (cid:19) + K (cid:88) k =1 (cid:18)(cid:28) ∂ (cid:98) τ w ∂ Ω k , ∆Ω k (cid:29) + o ( (cid:107) ∆Ω k (cid:107) ) (cid:19) , which by Lemma 6 must equal τ w ((Λ k, G + ∆Λ k, G ) Kk =2 ) − τ w ((Λ k, G ) Kk =2 ). The result then follows fromthe differentiability of τ w ( · ) and the definition of derivatives.15ote that Corollary 4 is similar to the conditions imposed on influence functions in standardsemiparametric efficiency theory; see, e.g., Tsiatis [2006, Corollary 1, § ∂ (cid:98) τ w /∂ (cid:98) Λ ¯ G k, G c for k = 2 , . . . , K are free to vary because (cid:98) Λ ¯ G k, G c → p . That is, an estimator (cid:98) τ w ∈ T w cantake arbitrary values as its second argument varies in the vicinity of zero, as long as differentiabilityis maintained.6.3. Asymptotic covariance of least squares coefficients.
We use this section to derive someasymptotic results that will be used to prove Theorem 3.Consider a vertex j ∈ B k for k ∈ { , . . . , K } and a set of vertices C such that Pa( B k , G ) ⊆ C ⊆ Pa( B k , ¯ G ). Let (cid:98) λ ( n ) C,j ∈ R | C | be the least squares coefficients from regressing X j onto X C undersample size n . Let λ C,j be the corresponding true edge coefficient vector from Λ in Proposition 1.Then λ C,j has non-zero coordinates only for those indices in Pa( B k , G ). Because X j = λ (cid:62) C,j X C + ε j with ε j ⊥⊥ X C by Corollary 2, we have (cid:98) λ ( n ) C,j → p λ C,j under every P ∈ P G . Moreover, we have thefollowing asymptotic linear expansion. Lemma . Let j be a vertex in bucket B k for k ∈ { , . . . , K } . Let C be a set of vertices suchthat Pa( B k , G ) ⊆ C ⊆ Pa( B k , ¯ G ) . Under any P ∈ P G , it holds that (cid:98) λ ( n ) C,j − λ C,j = 1 n n (cid:88) i =1 (Σ C ) − X ( i ) C ε ( i ) j + O p ( n − ) , where Σ = E P XX (cid:62) , (cid:98) λ ( n ) C,j is the vector of least squares coefficients from regressing X j onto X C under sample size n , and λ C,j is the vector of true coefficients in Proposition 1.
We now use Lemma 7 to obtain the covariance structure of ¯ G -regression coefficients ( (cid:98) Λ ¯ G k ) Kk =2 .Recall that (cid:98) Λ ¯ G k ∈ R | B [ k − |×| B k | with B [ k − = B ∪ · · · ∪ B k − and (cid:16) ( (cid:98) Λ ¯ G k ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 (cid:17) = φ − G (cid:16)(cid:98) Σ ( n ) (cid:17) , as given by Eqs. (23) and (24). For matrices A ∈ R m × n , B ∈ R p × q , the Kronecker product A ⊗ B is an mp × nq matrix given by A ⊗ B = a B · · · a n B ... . . . ... a m B · · · a mn B . Lemma . Let ( (cid:98) Λ ¯ G k ) Kk =2 be the ¯ G -regression coefficients under sample size n . Under any P ∈ P G ,it holds that √ n vec( (cid:98) Λ ¯ G − Λ ) ... vec( (cid:98) Λ ¯ G K − Λ K ) → d N (cid:18) , diag (cid:26) Ω ⊗ (cid:16) Σ B [1] (cid:17) − , . . . , Ω K ⊗ (cid:16) Σ B [ K − (cid:17) − (cid:27)(cid:19) . Remark . √ n vec( (cid:98) Λ ( n ) k − Λ k ) → d N (cid:18) , Ω k ⊗ (cid:16) Σ B [ k − (cid:17) − (cid:19) is equivalent to √ n ( (cid:98) Λ ( n ) k − Λ k ) → d MN (cid:18) , (cid:16) Σ B [ k − (cid:17) − , Ω k (cid:19) , B [ k − ) − and columncovariance Ω k ; see Dawid [1981].Similarly, we can compute the asymptotic covariance of the G -regression coefficients. To obtainthe result below, we rely on the restrictive property of G (Corollary 1). Lemma . Let ( (cid:98) Λ G k ) Kk =2 be the G -regression coefficients as defined in Lemma 3 under samplesize n . Under any P ∈ P G , it holds that √ n vec( (cid:98) Λ G − Λ ) ... vec( (cid:98) Λ G K − Λ K ) → d N (cid:16) , diag (cid:110) Ω ⊗ (cid:0) Σ Pa( B , G ) (cid:1) − , . . . , Ω K ⊗ (cid:0) Σ Pa( B K , G ) (cid:1) − (cid:111)(cid:17) . Efficiency bound.
We first notice a simple fact of the quadratic form and a property of theKronecker product.
Lemma . Let S ∈ R n × n PD , x ∈ R n and suppose that ( A, B ) is a partition of the set { , . . . , n } .For any fixed x A , it holds that x (cid:62) Sx ≥ x (cid:62) A ( S A · B ) x A , where S A · B = S A,A − S A,B S − B,B S B,A . The equality holds if and only if x B = − S − B,B S B,A x A . Lemma
11 (Liu [1999, Theorem 1]) . Let A ∈ R m × m and C ∈ R n × n be non-singular. Suppose α ⊂ [ m ] , β ⊂ [ n ] . Let α c , β c denote their respective complements. Let γ c = { n ( i −
1) + j : i ∈ α c , j ∈ β c } and γ = [ mn ] \ γ c . We have A α c · α ⊗ C β c · β = ( A ⊗ C ) γ c · γ . Lemma . Suppose the assumptions of Theorem 3 hold. Fix w ∈ R | A | and let τ w = w (cid:62) τ AY = τ w ((Λ k, G ) Kk =2 ) as in Eq. (20) . Consider any estimator (cid:98) τ w ∈ T w given by Definition 4. Then underany P ∈ P G , it holds that (29) avar( (cid:98) τ w ) ≥ K (cid:88) k =2 h (cid:62) k Ω k ⊗ (Σ Pa( B k , G ) ) − h k , where (Ω k ) Kk =2 and Σ are determined by P , and the gradient vectors h k = ∂τ w ((Λ k, G ) k ) /∂ Λ k, G for k = 2 , . . . , K evaluated at (Λ k, G ) k are determined by τ w ( · ) and P . Proof.
By Lemma 5, estimator (cid:98) τ w ∈ T w can be written as (cid:98) τ w = (cid:98) τ w (cid:16) ( (cid:98) Λ ¯ G k, G ) Kk =2 , ( (cid:98) Λ ¯ G k, G c ) Kk =2 , ( (cid:98) Ω ¯ G k ) Kk =1 (cid:17) , where the arguments correspond to the image of (cid:98) Σ under φ − G ; see Eq. (26). Estimator (cid:98) τ w ∈ T w isasymptotically normal. By the delta method [Shorack, 2000, Sec 11.2], we haveavar( (cid:98) τ w ) = ∂ (cid:98) τ w /∂ (Λ k, G ) Kk =2 ∂ (cid:98) τ w /∂ (Λ k, G c ) Kk =2 ∂ (cid:98) τ w /∂ (Ω k ) Kk =1 (cid:62) acov vec ( (cid:98) Λ ¯ G k, G ) Kk =2 vec ( (cid:98) Λ ¯ G k, G c ) Kk =2 vec ( (cid:98) Ω ¯ G k ) Kk =1 ∂ (cid:98) τ w /∂ (Λ k, G ) Kk =2 ∂ (cid:98) τ w /∂ (Λ k, G c ) Kk =2 ∂ (cid:98) τ w /∂ (Ω k ) Kk =1 , (cid:98) τ w ( · ) are evaluated at (cid:0) (Λ k, G ) Kk =2 , ( ) Kk =2 , (Ω k ) Kk =1 ) (cid:1) , the image ofΣ under φ − G .Using ∂ (cid:98) τ w /∂ Ω k = for k = 1 , . . . , K from Corollary 4, it follows thatavar( (cid:98) τ w ) = (cid:18) ∂ (cid:98) τ w /∂ (Λ k, G ) Kk =2 ∂ (cid:98) τ w /∂ (Λ k, G c ) Kk =2 (cid:19) (cid:62) acov (cid:40) vec ( (cid:98) Λ ¯ G k, G ) Kk =2 vec ( (cid:98) Λ ¯ G k, G c ) Kk =2 (cid:41) (cid:18) ∂ (cid:98) τ w /∂ (Λ k, G ) Kk =2 ∂ (cid:98) τ w /∂ (Λ k, G c ) Kk =2 (cid:19) = K (cid:88) k =2 (cid:18) ∂ (cid:98) τ w /∂ Λ k, G ∂ (cid:98) τ w /∂ Λ k, G c (cid:19) (cid:62) acov (cid:40) (cid:98) Λ ¯ G k, G (cid:98) Λ ¯ G k, G c (cid:41) (cid:18) ∂ (cid:98) τ w /∂ Λ k, G ∂ (cid:98) τ w /∂ Λ k, G c (cid:19) , where we have used the block-diagonal structure of the asymptotic covariance from Lemma 8. Let S ( k ) := acov (cid:40) (cid:98) Λ ¯ G k, G (cid:98) Λ ¯ G k, G c (cid:41) , k = 2 , . . . , K, which equals S ( k ) = Ω k ⊗ (cid:16) Σ B [ k − (cid:17) − , k = 2 , . . . , K, by Lemma 8. From Corollary 4, note that ∂ (cid:98) τ w /∂ (Λ k, G ) k ≡ h k is fixed for k = 2 , . . . , K . Then,Lemma 10 yields the lower bound avar( (cid:98) τ w ) ≥ K (cid:88) k =2 h (cid:62) k S ( k ) G·G c h k , where indices G and G c correspond to the coordinates in (cid:98) Λ ¯ G k, G and (cid:98) Λ ¯ G k, G c respectively. Indices G correspond to { ( i, j ) : j ∈ B k , i ∈ Pa( B k , G ) } ; by construction of ¯ G , indices G c correspond to { ( i, j ) : j ∈ B k , i ∈ Pa( B k , ¯ G ) \ Pa( B k , G ) } . Now, to abuse the notation slightly, we apply Lemma 11with A = Ω k , C = (Σ B [ k − ) − , α = ∅ , β = Pa( B k , ¯ G ) \ Pa( B k , G ) , such that α c = { , . . . , | B k |} , β c = Pa( B k , G ) , γ = G c , γ c = G . We obtain S ( k ) G·G c = Ω k ⊗ (cid:104) (Σ B [ k − ) − (cid:105) β c · β = Ω k ⊗ (cid:0) Σ Pa( B k , G ) (cid:1) − , where the last step follows from ( H − ) β c · β = ( H β c ,β c ) − [Horn and Johnson, 2012, § Efficiency of G -regression estimator. In Section 5, we have seen that when the errors areGaussian, the G -regression plugin is the MLE and hence achieves the efficiency bound. Here, weshow that this is still true relative to the class of estimators we consider, even though the errorsare not necessarily Gaussian. We verify that (cid:98) τ G w = w (cid:62) (cid:98) τ G AY achieves the efficiency bound above. Lemma . Let (cid:98) τ G w := w (cid:62) (cid:98) τ G AY , where (cid:98) τ G AY is the G -regression estimator (Definition 2). Underthe same assumptions as Lemma 12, it holds that (cid:98) τ G w ∈ T w and (cid:98) τ G w achieves the efficiency bound inEq. (29) under every P ∈ P G . roof. By Definition 2, (cid:98) τ G w ∈ T w . Further, note that (cid:98) τ G w = τ w (cid:16) ( (cid:98) Λ G k ) Kk =2 (cid:17) , where ( (cid:98) Λ G k ) Kk =2 are the G -regression coefficients in Eq. (18). Under any P ∈ P G , we now verify thatavar (cid:98) τ G matches the RHS of Eq. (29). By the delta method [Shorack, 2000, Sec 11.2], we haveavar (cid:98) τ G w = (cid:0) ∂τ w /∂ vec (Λ k ) Kk =2 (cid:1) (cid:62) acov (cid:110) vec ( (cid:98) Λ G k ) Kk =2 (cid:111) (cid:0) ∂τ w /∂ vec (Λ k ) Kk =2 (cid:1) (i) = K (cid:88) k =2 ( ∂τ w /∂ vec Λ k ) (cid:62) acov (cid:110) vec (cid:98) Λ G k (cid:111) ( ∂τ w /∂ vec Λ k ) (ii) = K (cid:88) k =2 ( ∂τ w /∂ vec Λ k ) (cid:62) Ω k ⊗ (cid:0) Σ Pa( B k , G ) (cid:1) − ( ∂τ w /∂ vec Λ k ) , which equals the RHS of Eq. (29). The partial derivatives of τ w ( · ) are evaluated at (Λ k ) Kk =2 . Step(i) follows from the block-diagonal structure of the asymptotic covariance of (cid:98) Λ G given by Lemma 9,and (ii) follows from the same lemma.Finally, we complete the proof of our main result. Proof of Theorem 3.
Fix any P ∈ P G . It suffices to show that for every w ∈ R | A | , w (cid:62) acov( (cid:98) τ AY ) w ≥ w (cid:62) acov( (cid:98) τ G AY ) w, or equivalently avar (cid:16) w (cid:62) (cid:98) τ AY (cid:17) ≥ avar (cid:16) w (cid:62) (cid:98) τ G AY (cid:17) . This is true because for every (cid:98) τ AY in consideration, (cid:98) τ w := w (cid:62) (cid:98) τ AY ∈ T w and hence (cid:98) τ w is subjectto the lower bound in Lemma 12. Meanwhile, by Lemma 13, such a lower bound is achieved by (cid:98) τ G w = w (cid:62) (cid:98) τ G AY . The proof is complete because the choice of w is arbitrary. Remark . For Theorem 3 to hold, the independence error assumption Eq. (3) of the underlyinglinear SEM cannot be relaxed to uncorrelated errors. This comes from inspecting the proof ofLemma 8 in Appendix A. To show that the ¯ G -regression coefficients are asymptotically independentacross buckets, the independence of errors is used to establish that for 2 ≤ k < k (cid:48) ≤ K , j ∈ B k , j (cid:48) ∈ B k (cid:48) , cov( ε j X B [ k − , ε j (cid:48) X B [ k (cid:48)− ) = .Suppose for now { (cid:15) i : i ∈ V } are only uncorrelated and hence { ε B k : k = 1 , . . . , K } are onlyuncorrelated across buckets. Further, suppose B = { } , B = { } , B = { } with j = k = 2 and j (cid:48) = k (cid:48) = 3. Then, we havecov (cid:16) ε j X B [ k − , ε j (cid:48) X B [ k (cid:48)− (cid:17) = cov (cid:16) ε ε , ε ( ε , γ ε + ε ) (cid:62) (cid:17) = E [ ε ε ( ε ε , γ ε ε + ε ε ) (cid:62) ] , which may be non-zero.
7. Numerical Results.
In this section, the finite-sample performance of G -regression is eval-uated against contending estimators. We use simulations and an in silico dataset for predictingexpression levels in gene knockout experiments. All the numerical experiments were conductedwith R v3.6, package pcalg v2.6 [Kalisch et al., 2012] and our package eff2 v0.1.19.1. Simulations.
We compare the performance of G -regression to several contending estimatorsunder finite samples. We roughly follow the simulation setup of Henckel et al. [2019], Witte et al.[2020]. First, we draw a random undirected graph from the Erd˝os-R´enyi model with average degree k , where k is drawn from { , , , } uniformly at random. The graph is converted to a DAG D witha random causal ordering and the corresponding CPDAG G is recorded. Then we fix a linear SEMby drawing γ ij uniformly from [ − , − . ∪ [0 . ,
2] and choosing the error distribution randomly atrandom from the following:1. (cid:15) i ∼ N (0 , v i ) with v i ∼ unif(0 . , (cid:15) i / √ v i ∼ t with v i ∼ unif(0 . , . (cid:15) i ∼ logistic(0 , s i ) with s i ∼ unif(0 . , . (cid:15) i ∼ unif( − a i , a i ) with a i ∼ unif(1 . , . n iid samples from the model. Treatments A of a fixed size are randomly selected fromthe set of vertices with non-empty descendants, and Y is selected randomly from their descendants;the drawing is repeated until τ AY is identified from G according to the criterion of Theorem 1.Finally, the data and graph G are provided to each estimator of τ AY .We compare to the following three estimators: • adj.O : optimal adjustment estimator [Henckel et al., 2019], • IDA.M : joint-IDA estimator based on modifying Cholesky decompositions [Nandy et al., 2017], • IDA.R : joint-IDA estimator based on recursive regressions [Nandy et al., 2017].They are implemented in R package pcalg . The two joint-IDA estimators use the parents of treat-ment variables to estimate a causal effect. Both of them reduce to the IDA estimator of Maathuiset al. [2009] when | A | = 1. Admittedly, compared to G -regression and adj.O , the joint-IDA estima-tors require less knowledge about the graph, namely only Pa( i ) for each i ∈ A .For each estimator (cid:98) τ AY , we compute its squared error (cid:107) (cid:98) τ AY − τ AY (cid:107) . Dividing (cid:107) (cid:98) τ AY − τ AY (cid:107) by thesquared error of G -regression, we obtain the relative squared error of each contending estimator. Weconsider | A | ∈ { , , , } , | V | ∈ { , , } and n ∈ { , } ; each configuration of ( | A | , | V | , n )is replicated 1,000 times.Fig. 2 shows the distributions of relative squared errors. In Table 1, we summarize the relativeerrors with their geometric mean and median. Our estimator dominates all the contending estima-tors in all cases, and the improvement gets larger as | A | gets bigger. Even though adj.O achievesthe minimal asymptotic variance among all adjustment estimators, it can compare less favorably toour estimator by several folds. In general, the IDA estimators have very poor performances. More-over, the results in Table 1 are computed only from the replications where a contending estimatorexists. As mentioned in the Introduction, unlike G -regression, none of the contending estimators isguaranteed to exist for every identified effect under joint intervention ( adj.O always exists for pointinterventions); see Table 2 for the percentages of instances that are not estimable by contendingestimators, even though the effect is identified by Theorem 1 and hence estimable by G -regression.In Appendix D, we report additional simulation results where the CPDAG is estimated withthe greedy equivalence search algorithm [Chickering, 2002] and provided to the estimators. Theimprovements are more modest but are still typically by several folds.7.2. Predicting double knockouts in DREAM4 data.
The DREAM4 in silico network challengedataset [Marbach et al., 2009b] provides a benchmark for evaluating the reverse engineering of generegulation networks. Here we use the 5th
Size10 dataset [Marbach et al., 2009a] as our example,which is a small network of 10 genes. Fig. 3 shows the true gene regulation network, which is20 able 1
Geometric average (brackets: median) of relative squared errors compared to G -regression | V | = 20 | V | = 50 | V | = 100 | A | n = 100 n = 1000 n = 100 n = 1000 n = 100 n = 1000 adj.O IDA.M
IDA.R |V|: 20n: 100 |V|: 20n: 1000 |V|: 50n: 100 |V|: 50n: 1000 |V|: 100n: 100 |V|: 100n: 1000 | A | : | A | : | A | : | A | : adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R1e+001e+021e+041e+061e+001e+021e+041e+061e+001e+021e+041e+061e+001e+021e+041e+06 method s qua r ede rr o rr e l a t i v e t o G − r eg r e ss i on Fig 2: Violin plots for the relative squared errors of contending estimators (‘ · ’: geometric mean, ‘+’:median).constructed based on the networks of living organisms. A stochastic differential equation modelwas used to generate the data under wild type (steady state), perturbed steady state and knockoutinterventions. A task in the challenge is to use data under wild type and perturbed steady state(both are observational data) to predict the steady state expression levels under 5 different jointinterventions, each of which knocks out a pair of genes. For our purpose, we also use the true networkas input. However, the true network contains one cycle (other networks in DREAM4 contain morethan one cycles). In the following, we remove one edge in the cycle and provide the resulting DAG tothe estimators. Necessarily, the causal DAG is misspecified. Results are reported under 4 differentedge removals. 21 able 2 Percentage of identified instances not estimable using contending estimators(all estimable with G -regression) Estimator | A | | V | = 20 | V | = 50 | V | = 100 adj.O IDA.M
IDA.R
Fig 3: Gene regulation network from DREAM4 dataset, which contains a cycle (blue).Unfortunately, the wild type data only consists of one sample. To estimate the observationalcovariance, we use the perturbed steady state data, which consists of 5 segments of time series. Asample covariance is computed from each segment, and the final estimate is taken as their average.For a double knockout of genes ( i, j ), we use G -regression to estimate the joint-intervention effectof A = ( i, j ) on every other gene. The effect is identified because the DAG is given. For gene k , let s k and s ( ij ) k respectively denote its expression level under wild type and double knockout of genes( i, j ). The expression level under double knockout is predicted as (cid:98) s ( ij ) k = (cid:40) s k − ( s i , s j ) (cid:62) (cid:98) τ ij,k , k / ∈ { i, j } , k ∈ { i, j } . E = (cid:80) ( i,j ) ∈A (cid:80) k =1 ( (cid:98) s ( ij ) k − s ( ij ) k ) (cid:80) ( i,j ) ∈A (cid:80) k =1 ( s ( ij ) k ) , where A = { (6 , , (7 , , (8 , , (8 , , (8 , } consists of 5 double knockouts available in the dataset.For comparison, we also evaluate the performance of adj.O (optimal adjustment, Henckel et al.[2019]) and IDA.R (joint-IDA based on recursive regressions, Nandy et al. [2017]);
IDA.R is chosenbecause it outperforms
IDA.M according to Section 7.1. Unfortunately, adj.O is not able to estimatethe effect on every k and a modified metric E ∗ is computed by only summing over those estimable k ’s; the same metric E ∗ of G -regression is also computed for comparison. As a baseline, we alsocompute E from naively estimating s ( ij ) k with just s k . Table 3
Normalized squared errors of predicting gene double knockouts edge removed E ∗ E from cycle (cid:64) adj.O adj.O G - reg IDA.R G - reg baseline → →
32% 33% → → Table 3 reports the results, where the column ‘ (cid:64) adj.O ’ lists the percentage of effects not estimableby the adjustment estimator. In almost all the cases, G -regression dominates all the contendingestimators. In this example, even though both the causal graph and the linear SEM are misspecified,one can still witness some usefulness of our estimator.
8. Discussion.
We have proposed G -regression based on recursive least squares to estimatea total causal effect from observational data, under linearity and causal sufficiency assumuptions. G -regression is applicable to estimating every identified total effect, under either point interventionor joint intervention. Further, via a new semiparametric efficiency theory, we have shown that theestimator achieves the efficiency bound within a restricted, yet reasonably large, class of estimators,including covariate adjustment and other regular estimators based on the sample covariance. Toconstruct confidence intervals and conduct hypothesis tests, bootstrap can be easily applied toestimate the asymptotic covariance of our estimator. This is implemented in R package eff2 .One may wonder, within the class of all regular estimators, if a (globally) semiparametric efficientestimator can be constructed for this problem. Ignoring the conditional independence constraintsEq. (12) in the blockwise error distributions, the model Eq. (21) is a generalized, multivariatelocation-shift regression model; see also Tsiatis [2006, § § § n is very large [Tsiatis, 2006, page111]. The proposal of Tsiatis [2006] is to instead develop a locally efficient estimator by postulatingan error distribution. Our result justifies postulating Gaussian errors if only the first two samplemoments are utilized. Another possibility of tackling the nuisance of multivariate error distributionis via the distribution-free multivariate ranks, namely R-estimation along the lines of Hallin et al.[2019], which we leave for future studies. 23e have seen that the conditional independence constraints in Eq. (12) play no role for the re-stricted class of estimators considered—a feature that holds under the restrictive property (Corol-lary 1). Recently, Rotnitzky and Smucler [2019] found that when estimating an effect of a pointintervention, under certain graphical conditions, covariate adjustment can still achieve the semi-parametric efficiency bound, although it ignores the additional conditional independences obeyedby the observed distribution. It is thus worth investigating if the aforementioned phenomenoncontinues to hold in semiparametric estimation beyond the linear SEM.APPENDIX A: PROOFS FOR ASYMPTOTIC EFFICIENCY Lemma 7 Lemma 9 Lemma 13Lemma 8 Lemma 12 Theorem 3Lemma 4 Lemma 5 Lemma 6 Corollary 4 Lemma 10
Fig A.1: Dependency structure of proofs in Section 6 of main text.
A.1. Proof of Lemma 7.
Proof.
For simplicity, we drop the superscripts in (cid:98) Σ ( n ) and (cid:98) λ ( n ) . Since j ∈ B k and Pa( B k , G ) ⊆ C ⊆ Pa( B k , ¯ G ), we have (cid:98) λ C,j − λ C,j = ( (cid:98) Σ C ) − (cid:98) Σ C,j − (Σ C ) − Σ C,j = (cid:16) Σ C + (cid:98) Σ C − Σ C (cid:17) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) + (cid:16) ( (cid:98) Σ C ) − − (Σ C ) − (cid:17) Σ C,j . We compute the two terms separately. The first term becomes (cid:16) Σ C + (cid:98) Σ C − Σ C (cid:17) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) = (cid:16) Σ C + O p ( n − / ) (cid:17) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) = (Σ C ) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) + O p ( n − ) , where we used the fact that Σ C is positive definite (Lemma 4) and (cid:107) (cid:98) Σ C,j − Σ C,j (cid:107) = O p ( n − / ), (cid:107) (cid:98) Σ C − Σ C (cid:107) = O p ( n − / ) by the central limit theorem.In the second term,( (cid:98) Σ C ) − − (Σ C ) − = (cid:16) Σ C + (cid:98) Σ C − Σ C (cid:17) − − (Σ C ) − = (cid:104) I − (cid:16) I − (Σ C ) − (cid:98) Σ C (cid:17)(cid:105) − (Σ C ) − − (Σ C ) − . (cid:107) I − (Σ C ) − (cid:98) Σ C (cid:107) = O p ( n − / ), using Neumann series ( I − H ) − = I + H + H + . . . for H = I − (Σ C ) − (cid:98) Σ C with (cid:107) H (cid:107) → p <
1, we have( (cid:98) Σ C ) − − (Σ C ) − = (cid:2) I + H + O p ( n − ) (cid:3) (Σ C ) − − (Σ C ) − = H (Σ C ) − + O p ( n − )= (cid:104) I − (Σ C ) − (cid:98) Σ C (cid:105) (Σ C ) − + O p ( n − ) . Combining the two terms, we obtain (cid:98) λ C,j − λ C,j = (Σ C ) − (cid:16)(cid:98) Σ C,j − Σ C,j (cid:17) + (cid:104) I − (Σ C ) − (cid:98) Σ C (cid:105) (Σ C ) − Σ C,j + O p ( n − )= (Σ C ) − (cid:98) Σ C,j − (Σ C ) − Σ C,j + (Σ C ) − Σ C,j − (Σ C ) − (cid:98) Σ C (Σ C ) − Σ C,j + O p ( n − ) (i) = (Σ C ) − (cid:16)(cid:98) Σ C,j − (cid:98) Σ C λ C,j (cid:17) + O p ( n − )= 1 n n (cid:88) i =1 (Σ C ) − (cid:104) X ( i ) j X ( i ) C − X ( i ) C X ( i ) (cid:62) C λ C,j (cid:105) + O p ( n − )= 1 n n (cid:88) i =1 (Σ C ) − X ( i ) C (cid:16) X ( i ) j − λ (cid:62) C,j X ( i ) C (cid:17) + O p ( n − ) (ii) = 1 n n (cid:88) i =1 (Σ C ) − X ( i ) C (cid:16) X ( i ) j − λ (cid:62) Pa( B k , G ) ,j X ( i )Pa( B k , G ) (cid:17) + O p ( n − )= 1 n n (cid:88) i =1 (Σ C ) − X ( i ) C ε ( i ) j + O p ( n − ) , where (i) uses λ C,j = (Σ C ) − Σ C,j and (ii) follows from Proposition 1 and Pa( B k , G ) ⊆ C ⊆ Pa( B k , ¯ G ). A.2. Proof of Lemma 8.
Proof.
For each k = 2 , . . . , K , note that for C = Pa( B k , ¯ G ) = B [ k − , vec (cid:98) Λ ¯ G k = ( (cid:98) λ ( n ) C,j ) j ∈ B k byconcatenation. By Lemma 7, we have the following asymptotic linear expansion(30) (cid:98) λ ( n ) B [ k − ,j − λ B [ k − ,j = 1 n n (cid:88) i =1 (cid:16) Σ B [ k − (cid:17) − X ( i ) B [ k − ε ( i ) j + O p ( n − ) . By the central limit theorem, √ n vec( (cid:98) Λ ¯ G − Λ )...vec( (cid:98) Λ ¯ G K − Λ K ) converges to a centered multivariate normal distribution. Further, we claim that the asymptoticcovariance must be block-diagonal according to k = 2 , . . . , K . To see this, take k < k (cid:48) , j ∈ B k , j (cid:48) ∈ B k (cid:48) and let C = B [ k − , C (cid:48) = B [ k (cid:48) − . Using Eq. (30), we havelim n →∞ n − E (cid:16)(cid:98) λ ( n ) C,j − λ C,j (cid:17) (cid:16)(cid:98) λ ( n ) C (cid:48) ,j − λ C (cid:48) ,j (cid:17) (cid:62) = (Σ C ) − cov( ε j X C , ε j (cid:48) X C (cid:48) )(Σ C (cid:48) ) − = (Σ C ) − (cid:110) E (cid:104) ε j ε j (cid:48) X C X (cid:62) C (cid:48) (cid:105) − E [ ε j X C ] E (cid:104) ε j (cid:48) X (cid:62) C (cid:48) (cid:105)(cid:111) (Σ C (cid:48) ) − .
25n the expression above, because ε B k ⊥⊥ X B [ k − and ε B k (cid:48) ⊥⊥ X B [ k (cid:48)− by Corollary 2 and j ∈ B k , j (cid:48) ∈ B k (cid:48) for k < k (cid:48) , we have E (cid:2) ε j ε j (cid:48) X C X (cid:62) C (cid:48) (cid:3) = E ε j (cid:48) E (cid:2) ε j X C X (cid:62) C (cid:48) (cid:3) = , E ε j X C = and E ε j (cid:48) X C (cid:48) = . It follows that the display above evaluates to and hence the asymptotic covariance matrix isblock-diagonal.It remains to be shown that acov vec( (cid:98) Λ ¯ G k − Λ k ) = Ω k ⊗ (Σ B [ k − ) − for k = 2 , . . . , K . Fix k , takeany two distinct j, j (cid:48) ∈ B k and let C = B [ k − . Again using Eq. (30), we haveacov (cid:32) (cid:98) λ ( n ) C,j (cid:98) λ ( n ) C,j (cid:48) (cid:33) = (cid:18) H FF (cid:62) D (cid:19) , where H = (Σ C ) − cov( ε j X C , ε j X C )(Σ C ) − = var( ε j )(Σ B [ k − ) − ,F = (Σ C ) − cov( ε j X C , ε j (cid:48) X C )(Σ C ) − = cov( ε j , ε j (cid:48) )(Σ B [ k − ) − ,D = (Σ C ) − cov( ε j (cid:48) X C , ε j (cid:48) X C )(Σ C ) − = var( ε (cid:48) j )(Σ B [ k − ) − . Noting that Ω k = cov( ε B k ) and vec (cid:98) Λ ¯ G k = ( (cid:98) λ ( n ) C,j ) j ∈ B k , the result then follows from comparing theexpressions above to the definition of Kronecker product for every pair j, j (cid:48) ∈ B k . A.3. Proof of Lemma 9.
Proof.
Note that by the restrictive property of G (Corollary 1), we have vec (cid:98) Λ G k = (cid:16)(cid:98) λ ( n )Pa( B k , G ) ,j (cid:17) j ∈ B k for k = 2 , . . . , K . Using Lemma 7 with C = Pa( B k , G ), we have the following asymptotic linearexpansion(31) (cid:98) λ ( n )Pa( B k , G ) ,j − λ Pa( B k , G ) ,j = 1 n n (cid:88) i =1 (cid:0) Σ Pa( B k , G ) (cid:1) − X ( i )Pa( B k , G ) ε ( i ) j + O p ( n − ) . The rest of computation follows similarly to the proof of Lemma 8.
A.4. Proof of Lemma 10.
Proof.
Since S ∈ R n × n PD , by completing the square, we have x (cid:62) Sx = x (cid:62) A S A,A x A + x (cid:62) B S B,A x A + x (cid:62) A S A,B x B + x (cid:62) B S B,B x B − x (cid:62) A S A,B S − B,B S B,A x A + x (cid:62) A S A,B S − B,B S B,A x A = x (cid:62) A ( S A,A − S A,B S − B,B S B,A ) x A + ( x B + S − B,B S B,A x A ) (cid:62) S B,B ( x B + S − B,B S B,A x A ) ≥ x (cid:62) A ( S A,A − S A,B S − B,B S B,A ) x A = x (cid:62) A S A · B x A , where the equality holds if and only if x B = − S − B,B S B,A x A .APPENDIX B: PROOFS FOR GRAPHICAL RESULTS B.1. Proof of Lemma 1.
Proof.
Let the undirected path between j and k be p = (cid:104) j = V , . . . , V l = k (cid:105) with l >
1. Firstnote that i is not on p because there is no undirected path between i and j in G .Further, since i → j − V is in G , by Meek rules R1 and R1 (Fig. C.1 in Appendix C), i − V or i → V is in G . Since, by assumption, there is no undirected path from i to j in G , i − V / ∈ U .Hence, i → V ∈ E and if l = 2, the statement of the lemma holds. If l >
2, we can apply the abovereasoning iteratively until we obtain i → V l ∈ E .26 .2. Proof of Lemma 2. Proof.
Let l ∈ D k . Since D k ⊆ B k , l ∈ B k . Then by Corollary 1, we have that Pa( B k ) = Pa( l ) \ B k . Therefore, Pa( B k ) ⊆ ∪ j ∈ D k Pa( j ) \ B k and furthermore, Pa( B k ) ⊆ ∪ j ∈ D k Pa( j ) \ D k = Pa( D k ).Hence, it suffices to show Pa( D k ) ⊆ Pa( B k ).We prove Pa( D k ) ⊆ Pa( B k ) by contradiction. Suppose there exists j ∈ Pa( D k ) \ Pa( B k ). Bydefinition D = An( Y, G V \ A ) and D = ∪ Kr =1 D r . Therefore, if k = 1, then j ∈ A ; if k > j must becontained in ∪ k − r =1 D r or in A . If j ∈ A , this leads to a contradiction with Lemma C.1 in Appendix C.Suppose k > j ∈ ∪ k − r =1 D r . Because ∪ k − r =1 D r ⊆ ∪ k − r =1 B r and buckets { B , . . . , B K } are disjoint,we have ( ∪ k − r =1 D r ) ∩ B k = ∅ . However, this contradicts that j ∈ B k . B.3. Proof of Proposition 3.
Proof.
By construction, the undirected component of ¯ G remains the same as that of G . Hence,¯ G has the same bucket decomposition as G . We only need to show that ¯ G is an MPDAG. It isenough to show that the edge orientations in ¯ G are closed under rules R1–R4 of Meek [1995] thatare displayed in Fig. C.1 of Appendix C. Note that since G an MPDAG it is closed under R1–R4.So if any of the left-hand-side graphs in Figure C.1 are induced subgraphs of ¯ G , then at least oneof the directed edges in these induced subgraphs must have been added in the construction of ¯ G .Since the construction of ¯ G does not involve adding directed edges within a bucket, the left-hand-side of rules R3 and R4 in Figure C.1 cannot appear as induced subgraphs of ¯ G . Hence, edgeorientations in ¯ G are complete under rules R3 and R4.Consider the left-hand-side of rule R1 in Figure C.1, A → B − C , for some A, B, C ∈ V . For A → B − C to be an induced subgraph of ¯ G , A → B must have been added in the construction of¯ G from G . Hence, A and B would need to be in different buckets in V in G . Since B and C are inthe same bucket because of edge B − C , A → C would also be added to G in the construction of¯ G . Hence, A → B − C will also not appear as an induced subgraph of ¯ G and edge orientations in ¯ G are also closed under R1.Consider the left-hand-side of R2 in Figure C.1, and suppose for a contradiction that A → B → C and A − C is an induced subgraph of ¯ G for some A, B, C ∈ V . Then A → B , B → C , or both A → B and B → C , were added to G in the construction of ¯ G . Because of A − C , suppose A and C are in the same bucket B i for some i ∈ { , . . . , K } in G . Also, suppose B ∈ B j . Because onlydirected edges between buckets are added, i (cid:54) = j .Now, A → B and B → C cannot be both added to G to construct ¯ G , because that would implythat i < j and j < i . By R1, B → C − A cannot be an induced subgraph of MPDAG G , so A → B alone also could not be added to G . Therefore, B → C alone was added to G . But C − A → B isan induced subgraph of G , so i < j , which contradicts the direction of B → C .APPENDIX C: GRAPHICAL PRELIMINARIES Graphs, vertices, edges.
A graph G = ( V, F ) consists of a set of vertices (variables) V and a setof edges F . The graphs we consider are allowed to contain directed ( → ) and undirected ( − ) edgesand at most one edge between any two vertices. We can thus partition the set of edges F into a setof directed edges E and undirected edges U and denote graph G = ( V, F ) as G = ( V, E, U ). Thecorresponding undirected graph is simply G U = ( V, ∅ , U ). Subgraphs and skeleton. An induced subgraph G V (cid:48) = ( V (cid:48) , F (cid:48) ) of G = ( V, F ) consists of V (cid:48) ⊆ V and F (cid:48) ⊆ F where F (cid:48) are all edges in F between vertices in V (cid:48) . A skeleton of a graph G = ( V, F ) is anundirected graph G = ( V, F (cid:48) ), such that F (cid:48) are undirected versions of all edges in F .27 aths. Directed, undirected, causal, non-causal, proper paths. A path p from i to j in G is asequence of distinct vertices p = (cid:104) i, . . . , j (cid:105) in which every pair of successive vertices are adjacent.A path consisting of undirected edges is an undirected path. A directed path from i to j is a pathfrom i to j in which all edges are directed towards j , that is, i → · · · → j . We will use causal path instead of directed path when talking about causal graphs. Let p = (cid:104) v , . . . , v k (cid:105) , k > G , p is a possibly directed path ( possibly causal path ) if no edge v i ← v j , ≤ i < j ≤ k is in G .Otherwise, p is a non-causal path in G (see Definition 3.1 and Lemma 3.2 of Perkovi´c et al., 2017).A path from A to Y is proper (w.r.t. A ) if only its first vertex is in A . Directed cycles.
A directed path from i to j and the edge j → i form a directed cycle . Colliders, shields and definite status paths.
If a path p contains i → j ← k as a subpath, then j is a collider on p . A path (cid:104) i, j, k (cid:105) is an (un)shielded triple if i and k are (not) adjacent. A path is unshielded if all successive triples on the path are unshielded. A node v j is a definite non-collider on a path p if there is at least one edge out of v j on p , or if v j − − v j − v j +1 is a subpath of p and (cid:104) v j − , v j , v j +1 (cid:105) is an unshielded triple. A node is of definite status on a path if it is a collider, adefinite non-collider or an endpoint on the path. A path p is of definite status if every node on p isof definite status. Subsequences and subpaths. A subsequence of a path p is obtained by deleting some nodes from p without changing the order of the remaining nodes. A subsequence of a path is not necessarilya path. For a path p = (cid:104) v , v , . . . , v m (cid:105) , the subpath from v i to v k (1 ≤ i ≤ k ≤ m ) is the path p ( v i , v k ) = (cid:104) v i , v i +1 , . . . , a k (cid:105) . Ancestral relations. If i → j , then i is a parent of j , and j is a child of i . If there is a causalpath from k to l , then k is an ancestor of l , and l is a descendant of k . If there is a possiblycausal path from k to l , then k is a possible ancestor of l , and l is a possible descendant of k .We use the convention that every vertex is a descendant, ancestor, possible ancestor and possibledescendant of itself. The sets of parents, ancestors, descendants and possible descendants of i in G are denoted by Pa( i, G ), An( i, G ), De( i, G ) and PossDe( i, G ) respectively. For a set of vertices A ,we let Pa( A, G ) = ( ∪ i ∈ A Pa( i, G )) \ A , whereas, An( A, G ) = ∪ i ∈ A An( i, G ), De( A, G ) = ∪ i ∈ A De( i, G )and PossDe( A, G ) = ∪ i ∈ A PossDe( i, G ) DAGs, PDAGs. A directed graph contains only directed edges. A partially directed graph maycontain both directed and undirected edges. A directed graph without directed cycles is a directedacyclic graph (DAG). A partially directed acyclic graph (PDAG) is a partially directed graph withoutdirected cycles. Blocking and d-separation. (See Definition 1.2.3 of Pearl [2009] and Lemma C.1 of Henckel et al.[2019]). Let Z be a set of vertices in an PDAG G = ( V, E, U ). A definite status path p is blocked by Z if (i) p contains a non-collider that is in Z , or (ii) p contains a collider C such that no descendantof C is in Z . A definite status path that is not blocked by Z is open given Z . If A, B and Z arethree pairwise disjoint sets of nodes in a PDAG G = ( V, E, U ), then Z d-separates A from B in G if Z blocks every definite status path between any node in A and any node in B in G . CPDAGs, MPDAGs.
Several DAGs can encode the same d-separation relationships. Such DAGsform a
Markov equivalence class which is uniquely represented by a completed partially directedacyclic graph (CPDAG) [Meek, 1995, Andersson et al., 1997]. A PDAG G = ( V, E, U ) is a maximallyoriented
PDAG (MPDAG) if it is closed under orientation rules R1-R4 of [Meek, 1995], presentedin Figure C.1. The MPDAG can then be alternatively defined as any PDAG that does not contain28
A C R1 ⇒ BA C AB C R2 ⇒ AB CD CA B R3 ⇒ D CA B D AC B R4 ⇒ D AC B
Fig C.1: The orientation rules from Meek [1995]. If the graph on the left-hand side of a rule isan induced subgraph of a PDAG G , then orient the blue undirected edge ( − ) as shown on theright-hand side of the rule. Hence, the graphs on the left-hand side of each rule are not allowed tobe induced subgraphs of an MPDAG.graphs on the left-hand side of each orientation rule as induced subgraphs. Both DAGs and CPDAGsare types of MPDAGs [Meek, 1995]. Background knowledge and constructing MPDAGs.
A PDAG G (cid:48) is represented by another PDAG G (equivalently G represents G (cid:48) ) if G (cid:48) and G have the same adjacencies and unshielded colliders andevery directed edge i → j in G is also in G (cid:48) . Let R be a set of directed edges representing backgroundknowledge. Algorithm 1 of Meek [1995] describes how to incorporate background knowledge R inan MPDAG G . If Algorithm 1 does not return a FAIL, then it returns a new MPDAG G (cid:48) that isrepresented by G . Background knowledge R is consistent with MPDAG G if and only if Algorithm 1does not return a FAIL [Meek, 1995]. Algorithm 1:
ConstructMPDAG , [Meek, 1995, Perkovi´c et al., 2017]
Data:
MPDAG G , background knowledge R Result:
MPDAG G (cid:48) or FAIL Let G (cid:48) = G ; while R (cid:54) = ∅ do Choose an edge { X → Y } in R ; R = R \ { X → Y } ; if { X − Y } or { X → Y } is in G (cid:48) then Orient { X → Y } in G (cid:48) ; Close the edge orientations under the rules in Figure C.1 in G (cid:48) ; else FAIL; end end Remark . The MPDAG output by
ConstructMPDAG ( G , R ) is the same independent of theordering of edges in R . This stems from the fact that the orientation rules of Meek [1995] arenecessary and sufficient for the construction of an MPDAG given a set of adjacencies and unshieldedcolliders. G and [ G ] . If G is a MPDAG, then [ G ] denotes every DAG represented by G .29 ausal and partial causal ordering of vertices. A total ordering, < , of vertices V (cid:48) ⊆ V is consistent with a DAG D = ( V, E, ∅ ) and called a causal ordering of V (cid:48) if for every i, j ∈ V (cid:48) , such that i < j and such that i and j are adjacent in D , i → j is in D . There can be more than one causal orderingof V (cid:48) in a DAG D = ( V, E, ∅ ). For example, in DAG i ← j → k both orderings j < i < k and j < k < i are consistent.Since an MPDAG may contain undirected edges, there is generally no unique causal ordering ofvertices in an MPDAG. Instead, we define a partial causal ordering, < , of a vertex set V (cid:48) , V (cid:48) ⊂ V in an MPDAG G = ( V, E, U ) as a total ordering of pairwise disjoint vertex sets A , . . . , A k , k ≥ ∪ ki =1 A i = V (cid:48) , that satisfy the following: if A i < A j and there is an edge between i ∈ A i and j ∈ A j in G , then i → j is in G . Buckets and bucket decomposition.
Algorithm 2 describes how to obtain an ordered bucket de-composition for a set of vertices V in an MPDAG G = ( V, E, U ). By Perkovi´c [2020, Lemma 1], theordered list of buckets output by Algorithm 2 is a partial causal ordering of V in G . Algorithm 2:
Partial causal ordering [Perkovi´c, 2020] input : vertex set V in MPDAG G =( V, E, U ) and MPDAG G . output : An ordered list B =( B , . . . , B k ) , k ≥ , of the bucket decomposition of V in G . Let G U denote the undirected subgraph of G ; Let
ConComp be the bucket decomposition (i.e., maximal connected components) of V in G U ; Let B be an empty list; while ConComp (cid:54) = ∅ do Let C be any element from ConComp ; Let C be the set of vertices in ConComp that are not in C ; if all edges between C and C are into C in G then Remove C from ConComp ; Add C to the beginning of B ; end end return B ; Lemma
C.1 . (see Lemma D.1 (i) of Perkovi´c, 2020) Let A and Y be disjoint node sets inMPDAG G = ( V, E, U ) . Suppose that there is no proper possibly causal path from A to Y thatstarts with an undirected edge in G , that is, suppose that the criterion in Theorem 1 is satisfied.Further, let D = An( Y, G V \ A ) and D = ˙ (cid:83) Ki =1 D i for D i = D ∩ B i , i = 1 , . . . , K , where B , . . . B K isthe bucket decomposition of V . Then for all i ∈ { , . . . , K } , there is no proper possibly causal pathfrom A to B i that starts with an undirected edge in G . APPENDIX D: ADDITIONAL SIMULATION RESULTSIn this section, we report additional simulation results. The setup is the same as Section 7.1 ofmain text, but we replace the true CPDAG with the CPDAG estimated with the greedy equivalencesearch algorithm [Chickering, 2002] based on the same sample. The relative squared errors of thecontending estimators are shown in Fig. D.1 and are summarized in Table D.1. Compared to theresults with the true CPDAG, the performance improvement of G -regression is more modest butstill matters in practice. The reduced improvement is due to the error in estimating the graph,which diminishes as n increases. 30 able D.1 Geometric average (brackets: median) of relative squared errors compared to G -regressionwhen CPDAGs are estimated | V | = 20 | V | = 50 | V | = 100 | A | n = 100 n = 1000 n = 100 n = 1000 n = 100 n = 1000 adj.O IDA.M
IDA.R |V|: 20n: 100 |V|: 20n: 1000 |V|: 50n: 100 |V|: 50n: 1000 |V|: 100n: 100 |V|: 100n: 1000 | A | : | A | : | A | : | A | : adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R adj.O IDA.M IDA.R1e+001e+021e+041e+061e+001e+021e+041e+061e+001e+021e+041e+061e+001e+021e+041e+06 method s qua r ede rr o rr e l a t i v e t o G − r eg r e ss i on Fig D.1: Violin plots for the relative squared errors of contending estimators (‘ · ’: geometric mean,‘+’: median). The estimated CPDAGs are provided to the estimators.ACKNOWLEDGEMENTSAuthors thank Thomas Richardson for valuable comments and discussions. The first author wassupported by ONR Grant N000141912446.REFERENCES Takeshi Amemiya.
Advanced Econometrics . Harvard University Press, 1985.Theodore Wilbur Anderson and Ingram Olkin. Maximum-likelihood estimation of the parameters of a multivariatenormal distribution.
Linear algebra and its applications , 70:147–171, 1985. teen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of Markov equivalence classes foracyclic digraphs. The Annals of Statistics , 25:505–541, 1997.Peter J. Bickel, Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner.
Efficient and Adaptive Estimation forSemiparametric Models , volume 4. Johns Hopkins University Press, Baltimore, 1993.Kenneth A. Bollen.
Structural Equations with Latent Variables . Wiley, New York, 1989.Wenyu Chen, Mathias Drton, and Y. Samuel Wang. On causal discovery with an equal-variance assumption.
Biometrika , 106(4):973–980, 2019.David Maxwell Chickering. Optimal structure identification with greedy search.
Journal of Machine Learning Re-search , 3(Nov):507–554, 2002.A. Philip Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application.
Biometrika , 68(1):265–274, 1981.Mathias Drton. Computing all roots of the likelihood equations of seemingly unrelated regressions.
Journal ofSymbolic Computation , 41(2):245–254, 2006.Mathias Drton. Algebraic problems in structural equation modeling. In
The 50th Anniversary of Gr¨obner Bases ,pages 35–86. Mathematical Society of Japan, 2018.Mathias Drton and Michael Eichler. Maximum likelihood estimation in Gaussian chain graph models under thealternative Markov property.
Scandinavian Journal of Statistics , 33(2):247–257, 2006.Mathias Drton and Marloes H. Maathuis. Structure learning in graphical modeling.
Annual Review of Statistics andIts Application , 4:365–393, 2017.Mathias Drton and Thomas S. Richardson. Multimodality of the likelihood in the bivariate seemingly unrelatedregressions model.
Biometrika , 91(2):383–392, 2004.Mathias Drton, Bernd Sturmfels, and Seth Sullivant.
Lectures on algebraic statistics , volume 39. Springer Science &Business Media, 2008.Mathias Drton, Michael Eichler, and Thomas S. Richardson. Computing maximum likelihood estimates in recursivelinear models with correlated errors.
Journal of Machine Learning Research , 10(81):2329–2348, 2009.Mathias Drton, Rina Foygel, and Seth Sullivant. Global identifiability of linear structural equation models.
TheAnnals of Statistics , 39(2):865–886, 2011.Marco Eigenmann, Preetam Nandy, and Marloes H. Maathuis. Structure learning of linear Gaussian structuralequation models with weak edges. In
Proceedings of the 33rd Annual Conference on Uncertainty in ArtificialIntelligence (UAI-17) , 2017.Zhuangyan Fang and Yangbo He. IDA with background knowledge. In
Proceedings of the 36th Annual Conferenceon Uncertainty in Artificial Intelligence (UAI-20) , 2020.Shantanu Gupta, Zachary C. Lipton, and David Childers. Estimating treatment effects with observed confoundersand mediators. arXiv preprint arXiv:2003.11991 , 2020.Marc Hallin, Davide La Vecchia, and Hang Liu. Center-outward R-estimation for semiparametric VARMA models. arXiv preprint arXiv:1910.08442 , 2019.Lars Peter Hansen. Large sample properties of generalized method of moments estimators.
Econometrica: Journalof the Econometric Society , pages 1029–1054, 1982.Alan Hauser and Peter B¨uhlmann. Characterization and greedy learning of interventional Markov equivalence classesof directed acyclic graphs.
Journal of Maching Learning Research , 13:2409–2464, 2012.Leonard Henckel, Emilija Perkovi´c, and Marloes H. Maathuis. Graphical criteria for efficient total effect estimationvia adjustment in causal linear models. arXiv preprint arXiv:1907.02435 , 2019.Roger A. Horn and Charles R. Johnson.
Matrix Analysis . Cambridge University Press, 2 edition, 2012.Patrik O. Hoyer, Aapo Hyvarinen, Richard Scheines, Peter L. Spirtes, Joseph Ramsey, Gustavo Lacerda, and ShoheiShimizu. Causal discovery of linear acyclic models with arbitrary distributions. In
Proceedings of the 24th AnnualConference on Uncertainty in Artificial Intelligence (UAI-08) , pages 282–289, 2008.Markus Kalisch, Martin M¨achler, Diego Colombo, Marloes H. Maathuis, and Peter B¨uhlmann. Causal inference usinggraphical models with the R package pcalg.
Journal of Statistical Software , 47(11):1–26, 2012.Tjalling C. Koopmans and Olav Reiersøl. The identification of structural characteristics.
The Annals of MathematicalStatistics , 21(2):165–181, 1950.Manabu Kuroki and Zhihong Cai. Selection of identifiability criteria for total effects by using path diagrams. In
Proceedings of the 20th conference on Uncertainty in artificial intelligence , pages 333–340, 2004.Manabu Kuroki and Masami Miyakawa. Covariate selection for estimating the causal effect of control plans by usingcausal diagrams.
Journal of the Royal Statistical Society. Series B. Statistical Methodology , 65(1):209–222, 2003.Manabu Kuroki and Hisayoshi Nanmo. Variance formulas for estimated mean response and predicted response withexternal intervention based on the back-door criterion in linear structural equation models.
AStA Advances inStatistical Analysis , pages 1–19, 2020. teffen L. Lauritzen. Graphical Models . Oxford University Press, New York, 1996.Jianzhou Liu. Some L¨owner partial orders of Schur complements and Kronecker products of matrices.
Linear algebraand its applications , 291(1-3):143–149, 1999.Marloes H. Maathuis and Diego Colombo. A generalized back-door criterion.
The Annals of Statistics , 43(3):1060–1088, 2015.Marloes H. Maathuis, Markus Kalisch, and Peter B¨uhlmann. Estimating high-dimensional intervention effects fromobservational data.
The Annals of Statistics , 37(6A):3133–3164, 2009.Daniel Marbach, Thomas Schaffter, Dario Floreano, Robert J. Prill, and Gustavo Stolovitzky. The DREAM4 in-silico network challenge. Draft, version 0.3 http://gnw.sourceforge.net/resources/DREAM4%20in%20silico%20challenge.pdf , 2009a.Daniel Marbach, Thomas Schaffter, Claudio Mattiussi, and Dario Floreano. Generating realistic in silico gene networksfor performance assessment of reverse engineering methods.
Journal of Computational Biology , 16(2):229–239,2009b.Christopher Meek. Causal inference and causal explanation with background knowledge. In
Proceedings of the 11thAnnual Conference on Uncertainty in Artificial Intelligence (UAI-95) , pages 403–410, 1995.Preetam Nandy, Marloes H. Maathuis, and Thomas S. Richardson. Estimating the effect of joint interventions fromobservational data in sparse high-dimensional settings.
The Annals of Statistics , 45(2):647–674, 2017.Judea Pearl. Comment: graphical models, causality and intervention.
Statistical Science , 8(3):266–269, 1993.Judea Pearl. Causal diagrams for empirical research.
Biometrika , 82(4):669–688, 1995.Judea Pearl.
Causality . Cambridge University Press, Cambridge, 2nd edition, 2009.Judea Pearl and Thomas S. Verma. A theory of inferred causation. In
Studies in Logic and the Foundations ofMathematics , volume 134, pages 789–811. Elsevier, 1995.Emilija Perkovi´c. Identifying causal effects in maximally oriented partially directed acyclic graphs. In
Proceedings ofthe 36th Annual Conference on Uncertainty in Artificial Intelligence (UAI-20) , 2020.Emilija Perkovi´c, Johannes Textor, Markus Kalisch, and Marloes H. Maathuis. A complete generalized adjustmentcriterion. In
Proceedings of the 31st Annual Conference on Uncertainty in Artificial Intelligence (UAI-15) AnnualConference on Uncertainty in Artificial Intelligence (UAI-15) , pages ID–155, 2015.Emilija Perkovi´c, Markus Kalisch, and Marloes H. Maathuis. Interpreting and using CPDAGs with backgroundknowledge. In
Proceedings of the 33rd Annual Conference on Uncertainty in Artificial Intelligence (UAI-17) , 2017.Emilija Perkovi´c, Johannes Textor, Markus Kalisch, and Marloes H. Maathuis. Complete graphical characterizationand construction of adjustment sets in Markov equivalence classes of ancestral graphs.
Journal of Machine LearningResearch , 18(220):1–62, 2018.Jonas Peters and Peter B¨uhlmann. Identifiability of Gaussian structural equation models with equal error variances.
Biometrika , 101(1):219–228, 2014.R Core Team.
R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing,Vienna, Austria, 2020. URL .James M. Robins. A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect.
Mathematical Modelling , 7:1393–1512, 1986.Dominik Rothenh¨ausler, Jan Ernest, and Peter B¨uhlmann. Causal inference in partially linear structural equationmodels: identifiability and estimation.
Annals of Statistics , 46:2904–2938, 2018.Andrea Rotnitzky and Ezequiel Smucler. Efficient adjustment sets for population average treatment effect estimationin non-parametric causal graphical models. arXiv preprint arXiv:1912.00306 , 2019.John D. Sargan. The estimation of economic relationships using instrumental variables.
Econometrica: Journal ofthe Econometric Society , pages 393–415, 1958.Richard Scheines, Peter Spirtes, Clark Glymour, Christopher Meek, and Thomas Richardson. The TETRAD project:constraint based aids to causal model specification.
Multivariate Behavioral Research , 33(1):65–117, 1998.Shohei Shimizu, Patrik O Hoyer, Aapo Hyv¨arinen, and Antti Kerminen. A linear non-Gaussian acyclic model forcausal discovery.
Journal of Machine Learning Research , 7(Oct):2003–2030, 2006.Galen R. Shorack.
Probability for Statisticians . Springer, 2000.Ilya Shpitser, Tyler VanderWeele, and James M. Robins. On the validity of covariate adjustment for estimatingcausal effects. In
Proceedings of the 26th Annual Conference on Uncertainty in Artificial Intelligence (UAI-10) ,pages 527–536, 2010.Ezequiel Smucler, Facundo Sapienza, and Andrea Rotnitzky. Efficient adjustment sets in causal graphical modelswith hidden variables. arXiv preprint arXiv:2004.10521 , 2020.Peter Spirtes, Clark Glymour, and Richard Scheines.
Causation, Prediction, and Search . MIT Press, Cambridge,MA, 2nd edition, 2000.Robert H. Strotz and H. O. A. Wold. Recursive vs. nonrecursive systems: An attempt at synthesis (part I of a triptych n causal chain systems). Econometrica , 28(2):417–427, 1960.Seth Sullivant, Kelli Talaska, and Jan Draisma. Trek separation for Gaussian graphical models.
The Annals ofStatistics , 38(3):1665–1685, 2010.Anastasios Tsiatis.
Semiparametric Theory and Missing Data . Springer, New York, 2006.Aad W. van der Vaart.
Asymptotic Statistics . Cambridge University Press, 2000.Yuhao Wang, Liam Solus, Karren Dai Yang, and Caroline Uhler. Permutation-based causal inference algorithms withinterventions. In
Advances in Neural Information Processing Systems 30 , pages 5822–5831. 2017.Janine Witte, Leonard Henckel, Marloes H. Maathuis, and Vanessa Didelez. On efficient adjustment in causal graphs. arXiv preprint arXiv:2002.06825 , 2020.Sewall Wright. Correlation and causation.
Journal of Agricultural Research , 20:557–585, 1921.Sewall Wright. The method of path coefficients.
The Annals of Mathematical Statistics , 5(3):161–215, 1934.
Department of StatisticsUniversity of WashingtonBox 354322Seattle, WA 98195E-mail: [email protected]@[email protected]@uw.edu