A Local Method for Identifying Causal Relations under Markov Equivalence
AA Local Method for Identifying Causal Relationsunder Markov Equivalence
Zhuangyan Fang Yue Liu Zhi Geng Yangbo He
Peking University Huawei Noah’s Ark LabFebruary 26, 2021
Abstract
Causality is important for designing interpretable and robust meth-ods in artificial intelligence research. We propose a local approachto identify whether a variable is a cause of a given target based oncausal graphical models of directed acyclic graphs (DAGs). In general,the causal relation between two variables may not be identifiable fromobservational data as many causal DAGs encoding different causalrelations are Markov equivalent. In this paper, we first introduce asufficient and necessary graphical condition to check the existence of acausal path from a variable to a target in every Markov equivalent DAG.Next, we provide local criteria for identifying whether the variable isa cause/non-cause of the target. Finally, we propose a local learningalgorithm for this causal query via learning local structure of the vari-able and some additional statistical independence tests related to thetarget. Simulation studies show that our local algorithm is efficient andeffective, compared with other state-of-art methods.
In many observational studies, the main purposes are to study whether atreatment variable is a cause of a target variable, or to further identify thecauses/non-causes of a specified target variable or the effects/non-effectsof a given treatment. For examples, in a clinical study we may concernabout which symptoms are the side effects caused by using a new drug andwhich are not. Causality is also important for designing interpretable androbust methods in artificial intelligence research (Miller, 2019), and has been ∗ Correspondence to: [email protected] . a r X i v : . [ s t a t . M L ] F e b sed in many fields of artificial intelligence, such as causal transfer learning(Zhang et al., 2020; Bengio et al., 2020) and causality-based algorithmicfairness(Kusner et al., 2017; Wu et al., 2019). Directed acyclic graphs (DAGs)can be used to represent causal relationships among variables (Pearl, 2009).If a variable X has a directed path to another variable Y , then X is a cause of Y and Y is an effect of X . From observational data, however, insteadof an exact causal DAG, we generally learn a Markov equivalence class ofDAGs which can be represented by a completely partially directed acyclicgraph (CPDAG). The undirected edges in a CPDAG imply that some causalrelations among variables can not be read from the graph directly.Consider a Markov equivalence class of DAGs learned from observationaldata. A variable X is a definite cause of a target Y if X is always a cause of Y in every equivalent DAG, and a variable X is a definite non-cause of Y if X is never a cause of Y in any DAG in the class. If X is neither a definitecause nor a definite non-cause of Y , X is called a possible cause of Y .Some approaches can be used to identify the causal relation between atreatment and a target. An intuitive approach is first to learn a Markovequivalence class from observational data, and then enumerate all DAGs inthe class to check whether the treatment is definitely or definitely not thecause of the target in all of these equivalent DAGs. However, the intuitiveapproach is inefficient when the number of DAGs in the learned Markovequivalence class is large (He et al., 2015).Another way is to check the paths from the treatment to the target in aCPDAG. It has been shown that the treatment is a definite non-cause of thetarget if and only if there is no partially directed path from the treatment tothe target (see, e.g. Zhang, 2006; Perković et al., 2017). Given a CPDAG,Roumpelaki et al. (2016) also introduces a sufficient condition for identifyingdefinite causes. However, the necessity of this condition remains a conjecture(Zhang, 2006; Mooij and Claassen, 2020) and the corresponding approach isusually inefficient since it needs to learn an entire CPDAG first.The third approach is to estimate the causal effect of the treatment onthe given target (Maathuis et al., 2009; Perković et al., 2017; Nandy et al.,2017; Fang and He, 2020; Liu et al., 2020a,b; Witte et al., 2020; Guo andPerković, 2020). This approach, which is called causal-effect-based method,determines whether a variable is a cause of another by testing whether allpossible causal effects are zeros/non-zeros. However, this method requires We note that, the recent progresses in identifying the causal relation between twovariables indeed provide an opportunity to learn an exact DAG. However, such methodsneed to pose additional distributional conditions (Shimizu et al., 2006; Zhang and Hyvärinen,2009; Shimizu et al., 2011; Peters and Bühlmann, 2013; Peters et al., 2014)
Background
In this section, we introduce basic concepts of causal graphical models, andthe assumptions for causal learning. We use pa ( S , G ), ch ( S , G ), sib ( S , G ), adj ( S , G ), an ( S , G ) and de ( S , G ) to denote the union of parents, children,siblings (or undirected neighbors), adjacent vertices, ancestors, and descen-dants of each variable in set S in G , respectively, where G = ( V , E ) can bea directed, an undirected, or a partially directed graph. The basic graphterminology can be found in A. As a convention, we regard a vertex as anancestor and a descendant of itself. If S = { X } is a singleton set, we willreplace S by X for ease of presentation. Let G be a causal acyclic directedgraph (causal DAG) and X be a vertex in G , the vertices in an ( X, G ) \ X are causes of X , and the vertices in pa ( X, G ) are direct causes of X . If X isa cause of Y , then the directed paths from X to Y are called causal paths . The notion of d-separation induces a set of conditional independence rela-tions encoded in a DAG (Pearl, 1988). Let G be a DAG and π = ( X = X , X , ..., X n = Y ) be a path from X to Y in G . An intermediate vertex X i is a collider on π if X i − → X i and X i ← X i +1 , otherwise, X i is a non-collider on π . For three distinct vertices X i , X j and X k , if X i → X j ← X k and X i is not adjacent to X k in G , then the triple ( X i , X j , X k ) is called a v-structure collided on X j in G . Given Z ⊆ V , we say π is d-connected (oractive) given Z if Z does not contain any endpoint or non-collider on the pathand every collider on the path has a descendant in Z . If π is not d-connectedgiven Z , then π is blocked by Z . For pairwise disjoint sets X , Y , Z ⊆ V , X and Y are d-separated by Z (denoted by X ⊥⊥ Y | Z ) if and only if every pathbetween some X ∈ X and Y ∈ Y is blocked by Z .Let J G be the set of d-separation relations read off from a DAG G . TwoDAGs G and G are Markov equivalent if J G = J G . Pearl et al. (1989)have shown that two DAGs are equivalent if and only if they have the sameskeleton and the same v-structures. A Markov equivalence class or simply equivalence class , denoted by [ G ], contains all DAGs equivalent to G . AMarkov equivalence class [ G ] can be uniquely represented by a partiallydirected graph called completely partially directed acyclic graph (CPDAG) G ∗ . Two vertices are adjacent in G ∗ if and only if they are adjacent in G and a directed edge occurs in G ∗ if and only if it appears in every DAG in[ G ] (Pearl et al., 1989). For the ease of presentation, we will also use [ G ∗ ] torepresent the Markov equivalence class represented by G ∗ below. Given a4PDAG G ∗ , we use the G ∗ u and G ∗ d to denote the undirected subgraph and the directed subgraph of G ∗ , respectively. The former is defined as the undirectedgraph resulted by removing all directed edges in G ∗ and the later is thedirected graph obtained by removing undirected edges. Andersson et al.(1997) proved that (1) the undirected subgraph G ∗ u of G ∗ is the union ofdisjoint connected chordal graphs (the definition of chordal graph is providedin A), and (2) every partially directed cycle in G ∗ is an undirected cycle, thatis, none of the partially directed cycles in G ∗ contains a directed edge. Eachisolated connected undirected subgraph of G ∗ u is called a chain component of G ∗ (Andersson et al., 1997; Lauritzen and Richardson, 2002).For a given distribution P , we use X ⊥⊥ P Y | Z to denote that X isindependent of Y given Z with respect to P . Let J P be the set of all such(conditional) independencies in P . In this paper, our main results are basedon the following assumptions: causal Markov assumption , which states that X ⊥⊥ Y | Z in J G implies X ⊥⊥ P Y | Z in J P ; causal faithfulness assumption ,which states that X ⊥⊥ P Y | Z in J P implies X ⊥⊥ Y | Z in J G ; and theassumption that there is no hidden variable or selection bias. A distribution P is called Markov and faithful to a DAG G if P and G satisfy the causalMarkov assumption and the causal faithfulness assumption. A causal DAGmodel consists of a DAG G and a joint distribution P over a common vertexset V such that P satisfies the causal Markov assumption with respect to G . G is called the causal structure of the model and P is called the observationaldistribution (or simply distribution ) (Hauser and Bühlmann, 2012). Causal structure learning methods try to recover the causal structure fromdata. The global causal structure learning focuses on learning an entirecausal structure over all variables while the local causal structure learningdiscusses how to recover a part of the underlying causal structure.Existing approaches for learning global causal structures roughly fall intotwo classes: constraint-based and score-based methods. Constraint-basedmethods, such as the PC algorithm (Spirtes and Glymour, 1991) and thestable PC algorithm (Colombo and Maathuis, 2014), use conditional inde-pendence tests to find causal skeleton and then determine the edge directionsaccording to a series of orientation rules (Meek, 1995). Under the causalMarkov and causal faithfulness assumptions, the constraint-based methodscan identify causal graphs up to a Markov equivalence class. On the otherhand, score-based methods, such as exact search algorithms like dynamic pro-gramming (Koivisto and Sood, 2004; Singh and Moore, 2005) and A* (Yuan5t al., 2011; Xiang and Kim, 2013), heuristic search algorithms like GES(Chickering, 2002b), and gradient-based methods like NOTEARS (Zhenget al., 2018), evaluate the candidate graphs with a predefined score functionand search for DAGs or CPDAGs with the optimal score.Local learning algorithms usually learn the Markov blanket (see, e.g.Tsamardinos et al., 2003; Tsamardinos and Aliferis, 2003; Fu and Desmarais,2010) or the parent and child set of a given target (see, e.g. Wang et al., 2014;Gao and Ji, 2015; Liu et al., 2019). Recently, Liu et al. (2020b, Algorithm 3)extended the MB-by-MB algorithm (Wang et al., 2014) to learn the chaincomponent containing a given target and the directed edges surrounding thechain component. This variant of MB-by-MB can thus learn the subgraph ofthe CPDAG over the target and its neighbors, that is, the parents, siblingsand children of the target in the CPDAG.
In this section, we provide a sufficient and necessary condition to identifydefinite causal relations, and show that the definite causal relations can bedivided into two subtypes: explicit and implicit causal relations.
As mentioned in Section 1, given a CPDAG, a variable X is a definite non-cause of another variable Y if and only if there is no partially directed pathfrom X to Y (Zhang, 2006; Perković et al., 2017). Roumpelaki et al. (2016,Theorem 3.1) proved that the treatment is a definite cause of the target ifthere is a directed path from the treatment to the target or the treatmenthas two chordless partially directed path to the target on which two verticesadjacent to the treatment are distinct and non-adjacent. In the section,we will show this condition is also necessary, and before that, a concept of critical set is introduced as follows. Definition 1 (Critical Set) . (Fang and He, 2020, Definition 2) Let G ∗ be aCPDAG. X and Y are two distinct vertices in G ∗ . The critical set of X withrespect to Y in G ∗ consists of all adjacent vertices of X lying on at least onechordless partially directed path from X to Y . The definition of chordless partially directed path can be found in A.With Definition 1, we have the following lemma.6 emma 1.
Let G ∗ be a CPDAG. For any two distinct vertices X and Y in G ∗ , X is a definite cause of Y in the underlying DAG if and only if thecritical set of X with respect to Y in G ∗ always contains a child of X inevery DAG G ∈ [ G ∗ ] . Lemma 1 follows from Lemma 2 in Fang and He (2020). It gives asufficient and necessary condition to decide whether X is a definite causeof Y . However, checking the condition given in Lemma 1 also requires toenumerate all equivalent DAGs. To mitigate this problem, we discuss a graphcharacteristic of critical set in the corresponding CPDAG. Lemma 2.
Let G ∗ be a CPDAG and X, Y be two distinct vertices in G ∗ .Denote by C the critical set of X with respect to Y in G ∗ , then C ∩ ch ( X, G ) = ∅ for some G ∈ [ G ∗ ] if and only if C = ∅ , or C induces a complete subgraphof G ∗ but C ∩ ch ( X, G ∗ ) = ∅ . Based on Lemmas 1 and 2, we have the desired sufficient and necessarygraphical criterion.
Theorem 1.
Suppose that G ∗ is a CPDAG, X, Y are two distinct verticesin G ∗ , and C is the critical set of X with respect to Y in G ∗ . Then, X is adefinite cause of Y if and only if C ∩ ch ( X, G ∗ ) = ∅ , or C is non-empty andinduces an incomplete subgraph of G ∗ . The sufficiency of the condition in Theorem 1 has been extended toother types of causal graphs by Roumpelaki et al. (2016) and Mooij andClaassen (2020). With the help of Theorem 1, we can identify the typeof causal relation based on a learned CPDAG by enumerating paths andfinding critical sets. Below, we give an example to illustrate this idea.
Example 1.
Consider the respiratory disease network shown in Figure 1.Let smoking be the treatment and dyspnoea be the target. From Figure 1(b)we can see that the partially directed paths from smoking to dyspnoea are
Smok − Lung → Either → Dysp and
Smok − Bronc → Dysp . Therefore, thecritical set of smoking with respect to dyspnoea is { Lung , Bronc } . As Lung and
Bronc are not adjacent, by Theorem 1 smoking is a definite cause ofdyspnoea. Similarly, the critical set of lung cancer with respect to dyspnoeais { Smok , Either } . Since Either is a child of
Lung , lung cancer is also adefinite cause of dyspnoea. We note that, although Roumpelaki et al. (2016, Theorem 3.1) also claimed that theyhave proved the necessity, their proof is flawed. As mentioned by Mooij and Claassen(2020), the last part of the proof appears to be incomplete. How to prove the necessity formore general types of causal graphs remains an open problem (Zhang, 2006). a) G t (b) G ∗ (c) Markov eqivelence class Figure 1: This example is adapted from the ASIA network. The originalnetwork structure and related parameters can be found in Lauritzen andSpiegelhalter (1988). Figure 1(a) shows the true underlying causal DAG,and Figure 1(b) shows the corresponding CPDAG. Figure 1(c) enumeratesall equivalent DAGs in the Markov equivalence class.
We now study the properties of definite causal relations, and show thatdefinite causal relations can be divided into two subtypes based on theexistence of causal paths in a CPDAG. The results in this section are of keyimportance to build local characterizations in Section 4, and are also usefulfor developing an efficient global learning algorithm.
Proposition 1.
For two distinct vertices X and Y , if X is a definite causeof Y , then X and Y are not in the same chain component. Given a target variable Y , Proposition 1 shows that Y and its definitecauses do not appear in the same chain component. Thus, if a treatment X is a definite cause of a target Y , then in G ∗ there must be a partially directedpath from X to Y which contains a directed edge. On the other hand, fortwo distinct vertices lying in the same chain component, we have, Proposition 2.
Two distinct vertices X and Y are possible causes of eachother if and only if they are in the same chain component. Recall that in Figure 1(b), both Smok and Lung are definite causes ofDysp. However, in the CPDAG there exists a directed path from Lungto Dysp while no directed path exists from Smok to Dysp. That is, thecause Lung of Dysp is explicit and the cause Smok of Dysp is implicit inthe CPDAG. This difference motivates us to introduce the following twoconcepts. 8igure 2: An example for identifying the types of causal relations
Definition 2 (Explicit Cause) . A variable X is an explicit cause of Y if X is a definite cause of Y and there is a common causal path from X to Y inevery DAG in the Markov equivalence class represented by a CPDAG G ∗ . Since there is a common causal path from an explicit cause X to thetarget Y in every DAG in the Markov equivalence class represented by G ∗ ,there is a directed path from X to Y in G ∗ . As a convention, we regard X as an explicit cause of itself. Definition 3 (Implicit Cause) . A variable X is an implicit cause of Y if X is a definite cause of Y and there is no common causal path from X to Y inevery DAG in the Markov equivalence class represented by a CPDAG G ∗ . We notice that X is a definite cause of Y if only if it satisfies one of twoconditions given in Theorem 1. The first condition, C ∩ ch ( X, G ∗ ) = ∅ , isthe sufficient and necessary condition for identifying explicit causes, whilethe second condition corresponds to implicit causes. In Section 4, we willexploit this difference between explicit and implicit causes to develop localcharacterizations for both of them. Below, we give an example to illustratethem. Example 2.
Consider the causes of the target variable Y based on theCPDAG G ∗ in Figure 2. It is clear that all the variables other than Y aredefinite or possible causes of Y . Obviously, { E, D, F } are explicit causesof Y . For B , since B − E → Y , B − D → Y and B − G − F → Y arechordless partially directed paths, the critical set of B with respect to Y is { E, D, G } . As the induced subgraph of G ∗ over { E, D, G } is not complete, B is a definite cause of Y , and B is also implicit. Similarly, G is anotherimplicit cause of Y . For X and A , the critical set of X and A with respect to Y are { B, D, G } and { X, G } , respectively. Since the corresponding induced ubgraphs are complete, by Theorem 1, X and A are not implicit causes of Y . Thus, they are possible causes of Y . Despite the difference, explicit and implicit causes also have some inter-esting connections. The following Proposition 3 proves that the existence ofan implicit cause implies the existence of at least two explicit causes.
Proposition 3.
Let G ∗ be a CPDAG. X and Y are two distinct verticesthat belong to different chain components. If X is the only explicit causeof Y in the chain component to which X belongs, then every vertex in thischain component, except X , is a possible cause of Y . In this section, we introduce the theoretical results on locally characterizingdifferent types of causal relations. Our local characterizations depend onthe induced subgraph of the true CPDAG over the treatment’s neighbors aswell as some queries about d-separation relations. The first result is aboutdefinite non-causal relations, as given in Theorem 2.
Theorem 2.
Let G ∗ be a CPDAG. For any two distinct vertices X and Y in G ∗ , X is a definite non-cause of Y if and only if X ⊥⊥ Y | pa ( X, G ∗ ) holds. Theorem 2 introduces a local characterization for definite non-causalrelations, which is based on the local structure around the treatment X anda single d-separation claim. The d-separation claim X ⊥⊥ Y | pa ( X, G ∗ ) issimilar to the following well-known result called local Markov property of acausal DAG model: any variable is d-separated from its non-descendantsgiven its parents. The difference is that in our local characterization, onlyparents of X in the CPDAG are included in the separation set, and we ruleout the siblings of X even if they may be the parents of X in the true causalDAG. Since in a causal DAG, the non-descendants of a variable are thosewhich are definitely not caused by the variable, Theorem 2 can be regardedas an extension of the local Markov property to CPDAGs.Following Theorem 2, we can distinguish the definite causes and thepossible causes from the definite non-causes with a local causal structurequery and a d-separation query. Next, we characterize explicit and implicitcausal relations locally in Theorem 3 and Theorem 4, respectively, whichtogether characterize definite causal relations.10 heorem 3. Let G ∗ be a CPDAG. For any two distinct vertices X and Y in G ∗ , X is an explicit cause of Y if and only if X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) holds. The local characterization in Theorem 3 includes a single d-separationclaim, X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ), which means the set pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) cannot block all paths from X to Y . In the proof of this theorem,we show that this claim is equivalent to that there exists at least one pathfrom X to Y in G ∗ on which the node adjacent to X is a child of X . Basedon Maathuis and Colombo (2015, Lemma 7.2) and Perković et al. (2017,Lemma B.1), this implies that there is a directed path from X to Y in G ∗ . Theorem 4.
Suppose that G ∗ is a CPDAG and M is the set of maximalcliques of the induced subgraph of G ∗ over sib ( X, G ∗ ) . Then, X is an implicitcause of Y if and only if X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) and X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M for any M ∈ M . The definition of maximal clique is given in A. In Theorem 4, the firstcondition X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) makes sure that X is not an explicitcause of Y and the second condition, which is X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M forany M ∈ M , guarantees that X is not a possible cause of Y . These twoconditions in Theorem 4 are local in the sense that both sib ( X, G ∗ ) and pa ( X, G ∗ ) are subsets of X ’s neighbors in G ∗ , and a maximal clique M isalso a subset of sib ( X, G ∗ ). Once we obtain the induced subgraph of G ∗ over adj ( X, G ∗ ), we can know sib ( X, G ∗ ) and M , and thus the conditionalindependence queries can be answered accordingly if we have the oracles.As mentioned in Section 3.2, definite causes consists of both explicitcauses and implicit causes. Therefore, Theorems 3 and 4 give a sound andcomplete local characterization of definite causal relations as follows. Corollary 1.
Suppose that G ∗ is a CPDAG and M is the set of maximalcliques of the induced subgraph of G ∗ over sib ( X, G ∗ ) . Then, X is a definitecause of Y if and only if X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) or X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M for any M ∈ M . Together with Theorem 2, Corollary 1 can be used to identify the definitecausal relations and the definite non-causal relations. This result is local inthe sense that it only depends on the local structure around the treatment X and a limited number of d-separation queries. When data is availablein practice, d-separation queries can be answered by performing statisticalindependence tests. Thus, local characterizations are particularly meaningfulfor identifying the types of causal relations from observational data.11 A Local Learning Algorithm
In this section, we discuss how to learn the type of causal relation from obser-vational data. A local algorithm, which exploits the local characterizationsin Section 4 directly, is provided in this section.The main procedure of our local algorithm is summarized in Algorithm 1.The input of Algorithm 1 consists of pa ( X, G ∗ ), the induced subgraph of G ∗ over sib ( X, G ∗ ), and some independence oracles. The first two arguments, pa ( X, G ∗ ) and the induced subgraph over sib ( X, G ∗ ), can be learned locallyby using the variant of the MB-by-MB algorithm proposed by Liu et al.(2020b, Algorithm 3), which is designed for learning the chain componentcontaining a given target variable and the directed edges connected to thevariables in the chain component. The third argument (the independenceoracles), as discussed in Section 4, can be replaced by statistical independencetests in practice. Overall, the procedure given in Algorithm 1 is a directapplication of the local characterizations in Theorems 2, 3 and 4, and thuswe have, Theorem 5.
Assuming that the true causal DAG is in the Markov equivalenceclass represented by G ∗ and the independence oracles are faithful to the trueDAG, then the local ITC (Algorithm 1) can correctly identify the type ofcausal relation between X and Y. Algorithm 1
A local algorithm for identifying the type of causal relation(local ITC)
Require:
A treatment X , a target Y , pa ( X, G ∗ ), the induced subgraph of G ∗ over sib ( X, G ∗ ), and independence oracles. Ensure:
The type of causal relation between X and Y. if X ⊥⊥ Y | pa ( X, G ∗ ) then return X is a definite non-cause of Y , end if if X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) then return X is an explicit cause of Y , end if M = the set of maximal cliques of sib ( X, G ∗ ), if exists M ∈ M such that X ⊥⊥ Y | pa ( X, G ∗ ) ∪ M , then return X is a possible cause of Y , end if return X is an implicit cause of Y .12he complexity of Algorithm 1 can be measured by the maximum numberof conditional independence tests (or d-separation queries). Clearly, the max-imum number of conditional independence tests performed by Algorithm 1 is m + 2, where m is the number of maximal cliques of sib ( X, G ∗ ). Fortunately,there are only linearly many maximal cliques (with respect to the numberof vertices) in a chordal graph (Rose and Tarjan, 1975; Blair and Peyton,1993), so the number of conditional independence tests needed in Algorithm1 is at most O ( | sib ( X, G ∗ ) | ).We also present a global learning algorithm as well as its complexityanalysis in B.1. The global learning algorithm uses the graphical criteriaas well as the properties of different causal relations to identify them. Werefer the interested readers to Algorithm 2 and Algorithm 3 in B.1 for moredetails.On the other hand, if we use the causal-effect-based method mentionedin Section 1 (or more specifically, use a possible implementation describedin B.2 and tested in Section 6), we have to enumerate all possible causaleffects of X on Y and test whether they are all zeros (non-zeros). In theworst case, the number of conditional independence tests required in thismethod is 2 | sib ( X, G ∗ ) | . Hence, Algorithm 1 is theoretically more efficientthan the current causal-effect-based method as the latter needs to performexponentially many conditional independence tests in the worst case whilethe former only needs linearly many tests. In this section, we illustrate and evaluate the proposed method experimentallyusing synthetic data sets generated from linear structural equation modelswith Erdös-Rényi random DAGs. We compare our local ITC with the globalone as well as two causal-effect-based methods (CE-based for short): regulartesting method (CE-test for short) and multiple testing method (CE-multifor short). In both CE-based methods, we used IDA (Maathuis et al., 2009)to estimate all possible causal effects of a treatment on a target and checkedwhether all of these causal effects are zeros. The CE-test method checkseach effect one by one and the CE-multi tests all effects simultaneously withmultiple testing methods.In Section 6.1, we assume that the true CPDAG or its local structureof interest is available. In this case, the synthetic data was only used toestimate causal effects by the CE-based methods, and to perform conditionalindependence tests by the local ITC method. Therefore, in these experiments,13e can compare our proposed methods with CE-based methods directly sincestructure learning from data is not needed and no corresponding erroris introduced. In Section 6.2, we further evaluate the methods withoutproviding the true CPDAG or its local structure. Three global structurelearning algorithms, including the PC algorithm (Spirtes and Glymour,1991), the stable PC algorithm (Colombo and Maathuis, 2014) and the GESalgorithm (Chickering, 2002a), were used to learn CPDAGs, and the variantof MB-by-MB (Liu et al., 2020b) was used to learn parents and siblingsof the vertices of interest. In all of these experiments, algorithms like PC,stable PC, GES, and IDA were called from pcalg R-package (Kalisch et al.,2012), significance level α of all statistical independence tests was set to be0 . and all codes were run on a computer with an Intel 2.5GHz CPUand 8 GB of memory.Let ER( n, d ) denote a random DAG with n vertices and average in-and-out degree d . In our experiment, n is chosen from { , , , , , } and d is chosen from { . , . , . , . , . , . } . To generate a data set, wefirst sampled an ER( n, d ) random DAG, and for each directed edge X i → X j in this DAG, we drew an edge weight β ij from a Uniform([0 . , . − . , − . ∪ [0 . , . X j = X X i ∈ pa ( X j ) β ij X i + (cid:15) j , j = 1 , ..., n , (1)where (cid:15) , ..., (cid:15) n are independent N (0 ,
1) noises. Finally, we drew N samplesfrom this linear model. In Section 6.1 we set N ∈ { , } , and in Section6.2 we set N ∈ { , , } .We generated 5000 data sets for each setting, and for each data set werandomly sampled a treatment variable and a target variable. We thenexplored the causal relation between the treatment and the target andcompared it with the true one read from the corresponding CPDAG of thesampled DAG. Since there are three types of causal relations, we use theKappa coefficient as well as the true positive rate (TPR) and the false positiverate (FPR) to measure the performance of each method. In this section, the true CPDAGs or their local structures are provided toexclude estimation biases caused by graph structure learning from data. Since Experiments show different significance levels give similar results. d Method Def. non-causes Poss. causes Def. causesTPR FPR TPR FPR TPR FPR50 2 local ITC
CE-test 0.9893 0.4557 0.4211 0.0024 0.3016 0.0136CE-multi 0.9899 0.4937 0.3684 0.0016 0.3016 0.01363 local ITC
CE-test 0.9864 0.6302 0.3088 0.0056 0.1889 0.0168CE-multi 0.9870 0.6484 0.2892 0.0048 0.1778 0.01684 local ITC
CE-test 0.9814 0.6789 0.2618 0.0103 0.1282 0.0230CE-multi 0.9818 0.7083 0.2403 0.0082 0.1218 0.0228250 2 local ITC
CE-test 0.9913 0.0904 0.6293 0.0018 0.8400 0.0139CE-multi 0.9921 0.1024 0.6207 0.0016 0.8000 0.01353 local ITC
CE-test 0.9828 0.2404 0.6667 0.0052 0.6014 0.0198CE-multi 0.9839 0.2463 0.6614 0.0042 0.5946 0.01984 local ITC
CE-test 0.9767 0.3281 0.6606 0.0128 0.4441 0.0267CE-multi 0.9770 0.3351 0.6561 0.0140 0.4155 0.0267
Table 1: The detailed TPRs and FPRs of different methods on 100-noderandom graphs with d ∈ { , , } and positive weights . The true graphstructures are provided. The standard deviations are omitted as they are allbelow 0 . N Method Average degree2 3 450 CE-test 2.9619 (0.9041) 2.8218 (0.8119) 2.7001 (0.7926)CE-multi 3.0261 (0.9172) 2.8742 (0.8226) 2.7451 (0.8002)250 CE-test 1.8515 (0.4090) 1.7630 (0.3461) 1.7049 (0.2990)CE-multi 1.8726 (0.4159) 1.7808 (0.3507) 1.7211 (0.3005)
Table 2: The averages and standard deviations (in parentheses) of ratiosof the time used by CE-based methods to local ITC on 100-node randomgraphs with d ∈ { , , } and positive weights . The true graph structuresare provided.the output of the global ITC is invariant when the true CPDAG is provided,the TPR and FPR for learning each type of causal relation are 1 and 0respectively. Except the global ITC, the local ITC and CE-based methods15 d Method Def. non-causes Poss. causes Def. causesTPR FPR TPR FPR TPR FPR50 2 local ITC
CE-test 0.9957 0.7532 0.1979 0.0008 0.0862 0.0063CE-multi 0.9959 0.7792 0.1771 0.0002 0.0862 0.00633 local ITC
CE-test 0.9931 0.8123 0.1579 0.0015 0.0838 0.0099CE-multi 0.9938 0.8291 0.1368 0.0010 0.0838 0.00934 local ITC
CE-test 0.9917 0.8770 0.0862 0.0036 0.0841 0.0088CE-multi 0.9917 0.8843 0.0776 0.0036 0.0779 0.0088250 2 local ITC
CE-test 0.9954 0.5191 0.3482 0.0014 0.4366 0.0067CE-multi 0.9956 0.5246 0.3482 0.0010 0.4366 0.00673 local ITC
CE-test 0.9936 0.6210 0.3140 0.0017 0.3028 0.0091CE-multi 0.9936 0.6242 0.3081 0.0017 0.3028 0.00914 local ITC
CE-test 0.9876 0.6528 0.3106 0.0086 0.2296 0.0128CE-multi 0.9881 0.6618 0.3021 0.0082 0.2201 0.0128
Table 3: The detailed TPRs and FPRs of different methods on 100-noderandom graphs with d ∈ { , , } and mixed weights . The true graphstructures are provided. The standard deviations are omitted as they are allbelow 0 . N Method Average degree2 3 450 CE-test 2.9696 (0.9242) 2.8080 (0.8331) 2.7102 (0.7684)CE-multi 3.0357 (0.9368) 2.8487 (0.8365) 2.7525 (0.7940)250 CE-test 1.8377 (0.3810) 1.7797 (0.3570) 1.7242 (0.2959)CE-multi 1.8623 (0.3935) 1.8004 (0.3608) 1.7380 (0.3040)
Table 4: The averages and standard deviations (in parentheses) of ratiosof the time used by CE-based methods to local ITC on 100-node randomgraphs with d ∈ { , , } and mixed weights . The true graph structuresare provided.need to perform statistical tests. To assess these methods, we run experimentson data with positive weights (Uniform([0 . , . − . , − . ∪ [0 . , . N and the degree d havesimilar effects on all methods: the larger the sample size and the smaller theaverage degree, the better their performance.Benefiting from less conditional independence tests, the local ITC algo-rithm is more stable, more accurate, and more efficient than the CE-basedmethods, especially when identifying definite causal relations. ComparingTable 3 with Table 1, one can see that although all TPRs drop, the TPRs ofthe local ITC drop less than those of the CE-based methods. For instance,in Table 1, for the case of N = 250, the TPRs of the local ITC and CE-testfor identifying definite causal relations are 0.8800 and 0.8400 respectively,while when negative weights are allowed in Table 3, the TPRs of the localITC and CE-test decrease to 0.7183 and 0.4366, equivalent to 18% and 48%reductions, respectively. In this section, we further study experimentally our proposed methods whenthe true causal structures are not available. We used the variant of MB-by-MB (Liu et al., 2020b) to learn parents and siblings of the vertices ofinterest, and used the PC algorithm, the stable PC algorithm and GES tolearn entire CPDAGs.Figure 3 shows the Kappa coefficients of different methods based on 20-,50- and 100-node graphs. As one can see from Figure 3, the proposed localITC outperforms the other methods at almost all settings. The global ITCcombined with PC or PCS is also competitive when the graph is extremelysparse ( d <
2) or relatively dense ( d > . ocal ITCCE−test (local)CE−multi (local) global ITC (PC)CE−test (PC)CE−multi (PC) global ITC (PCS)CE−test (PCS)CE−multi (PCS) global ITC (GES)CE−test (GES)CE−multi (GES) . . . . . . average degree k appa (a) n = 20, N = 500 . . . . . . average degree k appa (b) n = 50, N = 500 . . . . . . average degree k appa (c) n = 100, N = 500 . . . . . . average degree k appa (d) n = 20, N = 1000 . . . . . . average degree k appa (e) n = 50, N = 1000 . . . . . . average degree k appa (f) n = 100, N = 1000 . . . . . . average degree k appa (g) n = 20, N = 3000 . . . . . . average degree k appa (h) n = 50, N = 3000 . . . . . . average degree k appa (i) n = 100, N = 3000 Figure 3: The Kappa coefficients of different methods on random graphswith positive weights. The graph structures are learned from data.and local ITC, respectively.In Table 5, we report the TPRs and FPRs of different methods foridentifying each type of causal relation based on 100-node graphs withaverage degree d ∈ { , , } . The sample size N is set to 1000. Table 5 showsthat the local ITC is more stable than others, especially on relatively sparsegraphs. It can be seen that when d = 2 and d = 3, the FPR of the localITC is always one of the best three FPRs, and the TPR of the local ITC isone of the highest three TPRs except for learning non-causal relations when18 Method Def. non-causes Poss. causes Def. causesTPR FPR TPR FPR TPR FPR2 local ITC
CE-test (local) 0.9870
CE-test (PC) 0.9789
CE-test (PCS) 0.9781
CE-test (local) 0.9538
CE-test (PC) 0.9221 0.0907
CE-test (PCS) 0.9133
CE-test (PC) 0.8019 0.2081 0.6584 0.0376 0.6833 0.1636global ITC (PCS)
CE-test (PCS) 0.7869
Table 5: The detailed TPRs and FPRs of different methods on 100-noderandom graphs. The with sample size N = 1000, average degree d ∈ { , , } and positive weights. The standard deviations are omitted as they are allbelow 0 . d = 3. Moreover, the local ITC performs considerably well when learningpossible causal relations and definite causal relations, and always achieves arelatively low FPR when learning definite non-causal relations.We now compare the total computational time of different methods inFigure 4. The total computational time consists of two parts: the time forlearning the required graph structure, and the time for identifying the typeof causal relation. Generally, the total time is dominated by the learningof graph structure. As shown in Figure 4, since learning a local structureconsumes less time than learning a global structure, the local ITC and thelocal versions of the CE-based methods are more efficient than global ones.Moreover, the global ITC is much faster than the global CE-based methods,19 ocal ITCCE−test (local)CE−multi (local) global ITC (PC)CE−test (PC)CE−multi (PC) global ITC (PCS)CE−test (PCS)CE−multi (PCS) global ITC (GES)CE−test (GES)CE−multi (GES) . . . . . . average degree t i m e ( s e c ond s ) (a) n = 20, N = 500 . . . . . . average degree t i m e ( s e c ond s ) (b) n = 50, N = 500 . . . . . . . . average degree t i m e ( s e c ond s ) (c) n = 100, N = 500 . . . . . . average degree t i m e ( s e c ond s ) (d) n = 20, N = 1000 . . . . . . average degree t i m e ( s e c ond s ) (e) n = 50, N = 1000 . . . . . . . . average degree t i m e ( s e c ond s ) (f) n = 100, N = 1000 . . . . . . average degree t i m e ( s e c ond s ) (g) n = 20, N = 3000 . . . . . . average degree t i m e ( s e c ond s ) (h) n = 50, N = 3000 . . . . . . . . average degree t i m e ( s e c ond s ) (i) n = 100, N = 3000 Figure 4: The total CPU time (in seconds) of different methods on randomgraphs with positive weights. The graph structures are learned from data.as the latter need perform additional independence tests. In our experiments,regardless of the time used to learn graph structure, the global ITC takesless than 0 .
001 seconds to identify types, while the global CE-based methodsare 10 times slower. This also explains why the blue, green, or purpledashed/dotted lines in Figure 4 are above the corresponding solid lines.Since local ITC and local CE-based methods use the same local learningalgorithm (the variant of MB-by-MB) and the local structure learning domi-nates the computational time of these methods, the total time of these three20
Method Average degree2 3 4500 CE-test (local) 1.3735 (0.1798) 1.3370 (0.1606) 1.3076 (0.1398)CE-multi (local) 1.3830 (0.1811) 1.3434 (0.1635) 1.3148 (0.1407)1000 CE-test (local) 1.2998 (0.1446) 1.2703 (0.1214) 1.2520 (0.1303)CE-multi (local) 1.3076 (0.1435) 1.2787 (0.1236) 1.2617 (0.1308)3000 CE-test (local) 1.1854 (0.0885) 1.1818 (0.1112) 1.1762 (0.1190)CE-multi (local) 1.1849 (0.0869) 1.1829 (0.1078) 1.1753 (0.1135)
Table 6: The averages and standard deviations (in parentheses) of the ratiosof the time used by local CE-based methods to local ITC on 100-node randomgraphs with positive weights. The graph structures are learned from data.local methods is very close in Figure 4. To compare the computational timeof the local ITC and the local versions of the CE-based methods in detail,Table 6 reports the averages and standard deviations of the ratios of CPUtime of local CE-based methods to local ITC. It can be seen that the localITC is faster than the local CE-based methods since all ratios are greaterthan 1.
In this paper, we present a local method for identifying the type of causalrelation without evaluating causal effects and learning a global causal struc-ture. In our work, we investigate the existence of a causal path from thecause to the target based on a CPDAG and provide a sufficient and necessarygraphical condition to check the existence of a causal path. We also studythe graphical properties of each type of causal relation. Inspired by theseproperties, we further propose a local identification criterion for each typeof causal relation, which depends only on the induced subgraph of the trueCPDAG over the adjacent variables of the treatment as well as some queriesabout d-separation relations. The local criteria naturally lead to a locallearning algorithm for identifying the type of causal relation if one assumesthe faithfulness condition holds. Simulation studies empirically prove thatthe proposed local algorithm performs well.Our work introduces the local characterizations of types of causal rela-tions, which are helpful for understanding causal relations hidden behindobservational data. Except for the theoretical contributions, our resultshave many potential applications as well. Firstly, it can guide researchers to21erform interventional studies. For example, no invention is needed if thetreatment is a definite non-cause of the target. Secondly, it can be combinedwith the IDA algorithm to estimate possible causal effects to reduce thecomputational costs. For instance, if the treatment is a non-cause of thetarget, then without any computation we can conclude that all possibleeffects are zero (Maathuis et al., 2009). Besides, our results can be usedto decide the significance of estimated causal effects. If the treatment is adefinite cause of the target, then all the estimated causal effects, no matterhow small they are, are significant. To some extent, our results can makeup the shortage of using statistical tests when testing the significance of apossible causal effect. Thirdly, our theorems and algorithms can be used tocheck whether a variable is a mediator in between a cause and a target. Allthese applications are important in causal inference.Our results can be easily extended to interventional essential graphs(He and Geng, 2008; Hauser and Bühlmann, 2012), which can be used torepresent Markov equivalence classes where some variables are intervened.Basically, interventional essential graphs are also chain graphs and can belearned from the mixture of observational and interventional data. Extendingour proposed concepts, theorems, and algorithms to interventional essentialgraphs is straightforward. A possible future work is to extend the globalcharacterization for definite causal relations to maximal PDAGs. MaximalPDAGs are generalizations of CPDAGs, and have been frequently usedfor representing causal background knowledge (Perković et al., 2017; Fangand He, 2020; Perković, 2020; Witte et al., 2020; Guo and Perković, 2020).Another interesting direction is to take hidden variables and selection biasesinto account. For example, one may extend the results to partially ancestralgraphs (Richardson and Spirtes, 2002; Ali et al., 2005; Zhang, 2008).
Acknowledgements
This work was supported by National Key R&D Program of China (2018YFB1004300), NSFC (11671020, 11771028, 11971040).22 ppendixA Graph Terminology
A graph G is defined as a vertex set (or node set) V and an edge set E . Agraph is directed ( undirected , partially directed ) if all edges in the graph aredirected (undirected, a mixture of directed and undirected). The skeleton ofa graph G is an undirected graph resulted from turning every directed edgein G into an undirected edge. Given a subset V of V , the induced subgraph of G over V is defined as G = ( V , E ) where E ⊂ E contains only edgesbetween vertices in V . If a directed edge X i → X j occurs in G , we call X i a parent of X j and X j a child of X i . Two distinct vertices X i and X j are siblings of each other if the undirected edge X i − X j appears in G . If forany V ⊂ V , there exist X ∈ V and X ∈ V \ V such that X and X areadjacent, then the graph is called connected , otherwise, it is disconnected .Furthermore, if there is an edge between any two vertices, then the graph iscalled complete .A path is a sequence of distinct vertices ( X k , · · · , X k j ) such that X k i is adjacent to X k i +1 . X k and X k j are endpoints of the path, while othervertices on the path are intermediate vertices (nodes) . The length of apath is the number of vertices on the path minus one. A path is called partially directed from X k to X k j if X k i ← X k i +1 does not occur in G forany i = 1 , . . . , j −
1. A partially directed path is directed ( undirected ) if alledges on the path are directed (undirected). A cycle is a path from a vertexto itself. A partially directed (directed, undirected) cycle can be definedsimilarly. We note that both directed paths (cycles) and undirected path(cycles) are partially directed. A vertex X i is an ancestor of X j and X j is a descendant of X i if there is a directed path from X i to X j or X i = X j . A chord of a path (cycle) is any edge joining two nonconsecutive vertices onthe path (cycle). A path (cycle) without any chord is called chordless . Anypath with length one is chordless. An undirected graph is chordal if it hasno chordless cycle with length greater than three. Given a chordal graph C = ( V , E ), if the induced subgraph of C over V ⊂ V is complete, then V is called a clique of C . Moreover, if there is no V such that V ⊂ V and V is a clique, then V is called a maximal clique . A directed graph is acyclic (DAG) if there are no directed cycles.23 Algorithms
We now provide global ITC and causal-effect-based methods for learning thetypes of causal relations.
B.1 A Global Learning Algorithm
Given a target variable and a CPDAG, as shown in Example 2, identifying adefinite non-causal relation or an explicit causal relation is straightforward.To discriminate an implicit causal relation from a possible causal relation, weneed an approach to find critical sets. The next proposition is particularlyuseful.
Proposition 4.
For any two distinct vertices
X, Y in a CPDAG G ∗ suchthat X is not an explicit cause of Y , it holds that C XY = ∪ Z ∈ Z C XZ , where C UV denotes the critical set of U with respect to V and Z is the set of explicitcauses of Y which are also in the chain component containing X . Proposition 4 provides a factorization of the critical set of X with respectto Y . For simplicity, we call ∪ Z ∈ Z C XZ the critical set of X with respect to Z . Algorithm 2 shows how to find ∪ Z ∈ Z C XZ efficiently. Algorithm 2 runs abreadth-first-search and returns the critical set of X with respect to Z in G ∗ u .In Algorithm 2, we start from the siblings of X , then search chordless pathsfrom the siblings until reaching some Z i ∈ Z . Every chordless path startingfrom a sibling of X is recorded in a queue S as a triple like ( α, ψ, τ ), where α and τ are the start and the end point of the path, respectively, and ψ isthe sibling of τ on the path. If τ is a member of Z , we add α to the criticalset C and remove from S all triples where the first element is α , that is, westop enumerating chordless paths starting with α . Otherwise, we extend thechordless path to the siblings of τ that are neither ψ nor siblings of ψ andadd the corresponding triples to the queue S . In this algorithm, a set ofvisited triples, H , is introduced to speed up the search by avoiding visitingthe same triple twice.Finally, we present a global learning approach for identifying the typeof causal relation in Algorithm 3. Algorithm 3 is global in the sense that ittakes an entire CPDAG as input. In Algorithm 3, we first check whether X and Y are in the same chain component. If they are, X is a possible causeof Y based on Proposition 2. Otherwise, we find the set of explicit causesof Y and denote it by Z . This can be done by searching for the verticesthat are connected to Y in the directed subgraph of G ∗ . If X ∈ Z , X is anexplicit cause of Y , otherwise, we find the critical set C of X with respect to24 lgorithm 2 Finding the critical set of a given X with respect to a set ZRequire:
A chordal graph G ∗ u , a variable X in G ∗ u , and a variable set Z = ∅ such that X / ∈ Z . Ensure: C , which is the critical set of X with respect to Z in G ∗ u . Initialize C = ∅ , a waiting queue S = [], and a set H = ∅ , for α ∈ adj ( X ) do add ( α, X, α ) to the end of S , end for while S is not empty do take the first element ( α, ψ, τ ) out of S and add it to H , if τ ∈ Z then add α to C , and remove from S all triples where the first element is α , else for β ∈ adj ( τ ) and β / ∈ adj ( ψ ) ∪ { ψ } do if ( α, τ, β ) / ∈ H and ( α, τ, β ) / ∈ S then add ( α, τ, β ) to the end of S , end if end for end if end while return CZ . When C = ∅ , we have that there are no explicit causes of Y in the chaincomponent containing X , so X is not a cause of Y . Finally, using Theorem4, Algorithm 3 distinguishes between possible causes and implicit causes.Since Algorithm 2 does not visit the same triple like ( α, ψ, τ ) twice, where α is a sibling of X and τ is sibling of ψ in G ∗ u , the complexity of Algorithm2 in the worst case is O ( | sib ( X, G ∗ ) | · | E ( G ∗ u ) | ), where | E ( G ∗ u ) | is the numberof edges in G ∗ u . Now we consider the computational complexity of globalITC (Algorithm 3). We know that the complexity to check the undirectedconnectivity of X and Y or to find an ( Y, G ∗ ) is O ( | E ( G ∗ ) | ), where | E ( G ∗ ) | is the number of vertices in G ∗ . Consequently, the complexity of global ITCis O ( | E ( G ∗ ) | + | sib ( X, G ∗ ) | · | E ( G ∗ u ) | ). Clearly, the worst case is O ( | E ( G ∗ ) | ). B.2 Causal-Effect-Based Methods
Section 6 briefly describes the causal-effect-based methods used in ourexperiments. Now we summarize the detailed procedure in Algorithm 4.25 lgorithm 3
A global algorithm for identifying the type of causal relation(global ITC).
Require:
A CPDAG G ∗ , a variable X and a target Y in G ∗ . Ensure:
The type of causal relation between X and Y. if X and Y are connected by a path in G ∗ u then return X is a possible cause of Y , end if let Z = an ( Y, G ∗ ), if X ∈ Z then return X is an explicit cause of Y , end if use Algorithm 2 to find the critical set C of X with respect to Z in G ∗ u , if | C | = 0 then return X is a definite non-cause of Y , end if if C induces a complete subgraph of G ∗ u then return X is a possible cause of Y , end if return X is an implicit cause of Y .The first four lines in Algorithm 4 are borrowed from the IDA algorithm(Maathuis et al., 2009), which enumerate all possible parental sets of thetreatment X and estimate all possible causal effects of X on Y . The possibleeffects are stored in a set, denoted by Θ X . Next, Algorithm 4 uses regulartesting method or multiple testing method to test whether every causal effectin Θ X is zero. Based on the testing results, lines 6-12 return the type ofcausal relation between X and Y by the definitions of different types ofcauses.We note that, the input of Algorithm 4 includes a CPDAG G ∗ or itsinduced subgraph over pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). In the original version of theIDA algorithm, the authors used an entire CPDAG as an input (Maathuiset al., 2009). This is probably due to the fact that there was no efficientapproach to learn the induced subgraph over pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) locallyat that time. With the help of the variant of MB-by-MB (Liu et al., 2020b,Algorithm 3), we can now learn the induced subgraph over pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) efficiently. Therefore, Algorithm 4 could also be local if it iscombined with the variant of MB-by-MB.26 lgorithm 4 A causal-effect-based algorithm for identifying the type ofcausal relation.
Require:
A treatment X , a target Y , a CPDAG G ∗ or its induced subgraphover pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). Ensure:
The type of causal relation between X and Y. set Θ X = ∅ , for each S ⊂ sib ( X, G ∗ ) such that orienting S → X and X → sib ( X, G ∗ ) \ S does not introduce any v-structure collided on X do estimate the causal effect of X on Y by adjusting for S ∪ pa ( X, G ∗ ),and add the causal effect to Θ X , end for test whether every causal effect in Θ X is zero, if every causal effect in Θ X is zero then return X is a definite non-cause of Y , end if if every causal effect in Θ X is non-zero then return X is a definite cause of Y , end if return X is a possible cause of Y . C Detailed Proofs
The proofs of lemmas, theorems and corollaries in the main text of thispaper will be presented in this section. Before that, we first introduce someprerequisite concepts and results.Let π = ( v , v , ..., v k ) denote a path with length k . The subpath π ( v i , v j )of π , with j > i , is the path ( v i , v i +1 , ..., v j − , v j ). If k ≥
2, we say threeconsecutive vertices v i , v i +1 and v i +2 form a triangle on π if v i is adjacent to v i +2 . π is called triangle-free if it does not contain any triangle. For a pathin a chordal graph, we find the following result. Lemma 3.
In any chordal graph, a path is chordless if and only if it istriangle-free.Proof.
Let π = ( v , v , ..., v k ) denote a path with length k ≥
2, If π ischordless, then it is obviously triangle-free. Suppose π is not chordless, thenwe can choose a chord v i − v j such that the subpath π ( v i , v j ) has no chordexcept for v i − v j . If j = i +2, then v i , v i +1 and v j form a triangle. If j > i +2,then π ( v i , v j ) and v i − v j form a cycle with length greater than 3. However,since the graph is chordal, we must have a chord v k − v l with i ≤ k, l ≤ j l ≥ k + 2 and l − k < j − i . This is contrary to our assumption.Lemma 3 is useful for finding chordless path, since checking whether apath is triangle-free is much easier. The following is another useful result forchordal graphs. Lemma 4.
Let ρ be a cycle with length greater than 3 in a given chordalgraph, and X be a vertex on ρ . If the two vertices adjacent to X on ρ arenot adjacent to each other, then ρ has a chord where X is an endpoint.Proof. Let v and v be two vertices adjacent to X on ρ . Suppose that ρ doesnot have a chord where X is an endpoint. Since ρ has length greater than 3, ρ must have a chord. Clearly, any chord of ρ separates ρ into two sub-cycles.By assumption, it is easy to check that at least one sub-cycle contains X , v and v . If this sub-cycle still has a chord, then we can construct another cyclecontaining X , v and v but with shorter length. Finally, we will have a cyclecontaining X , v and v without any chord. Since v and v are not adjacent,the length of this cycle must be greater than 3, which is contradicted to thedefinition of chordal graph.A chordal graph C can be turned into a directed graph by orienting itsedges. If the resulting directed graph is a DAG without v-structure, then theseorientations form a v-structure-free acyclic orientation of C (Bernstein andTetali, 2017). Any v-structure-free acyclic orientation of a connected chordalgraph has a unique source, that is, a vertex which has no parent. Conversely,any vertex in a connected chordal graph can be the unique source in somev-structure-free acyclic orientation (Blair and Peyton, 1993; Bernstein andTetali, 2017). Recall that the undirected subgraph of a CPDAG is the unionof disjoint connected chordal graphs called chain components (Anderssonet al., 1997). Maathuis et al. (2009) argued that any v-structure-free acyclicorientation of the edges in G ∗ u corresponds to a DAG in the equivalence classrepresented by G ∗ , and such an orientation can be considered separatelyfor each of the disjoint chordal graphs (or chain components). Moreover,Maathuis et al. (2009) proved that, Lemma 5. (Maathuis et al., 2009, Lemma 3.1)
Let G ∗ be a CPDAG, X be a vertex of G ∗ , and S ⊂ ne ( X, G ∗ ) . Then there is a DAG G ∈ [ G ∗ ] suchthat pa ( X, G ) = pa ( X, G ∗ ) ∪ S if and only if orienting S → X and X → D for every S ∈ S and D ∈ sib ( X, G ∗ ) \ S in G ∗ does not introduce any newv-structure. Y ∈ pa ( X, G ∗ ), then Y ∈ pa ( X , G ∗ )for every X ∈ ne ( X, G ∗ ). From this result we can prove that the conditionin Lemma 5 holds if and only if S is a clique. As we will see, Lemma 5 playsa key role in proving the main results of this paper, as it provides a simpleand local criterion for checking whether a subset of X ’s siblings can be X ’sparents in some equivalent DAGs.Let π denote a path. A subsequence of π is obtained by deleting somevertices from π without changing the order of the remaining vertices. Thefinal prerequisite result is about the relation between directed paths andpartially directed paths. Lemma 6.
There is a directed path from X to Y in G ∗ if and only if thereis a partially directed path from X to Y in G ∗ on which the node adjacent to X is a child of X .Proof. The necessity is trivial. For sufficiency, let π = ( X, v, ..., Y ) be thepartially directed path from X to Y in G ∗ such that X → v . Assume that w is the first vertex from the side of Y which is adjacent to X , then wehave X → w . Now consider π ( w, Y ). As π ( w, Y ) is also partially directed,by Perković et al. (2017, Lemma 3.6), there is a subsequence π ∗ of π ( w, Y )forms a chordless partially directed path from X to Y in G ∗ . Let π ∗∗ denotethe path by concatenating X → w and π ∗ , then π ∗∗ is a partially directedpath from X to Y on which the node adjacent to X is a child of X . Byconstruction, X is not adjacent to any vertex on π ∗∗ except for w . Thus, byMaathuis and Colombo (2015, Lemma 7.2), π ∗∗ is a directed path.In the following Appendices C.1 to C.13, we will present the detailedproofs of the main results provided in the main text, with the help of theaforementioned concepts and lemmas. C.1 Proof of Lemma 1
Proof.
Given a CPDAG G ∗ , for any DAG G ∈ [ G ∗ ], Fang and He (2020,Lemma 2) showed that a variable X is not a cause of another variable Y in G if and only if the critical set of X with respect to Y in G ∗ , which isdenoted by C , is a subset of pa ( X, G ). Consequently, X is a cause of Y in G if and only if C is not a subset of pa ( X, G ). That is, some vertex in C mustbe a child of X in G . The desired result comes from the definition of definitecause. 29 .2 Proof of Lemma 2 Proof.
We first show the necessity. By the definition, C ⊆ sib ( X, G ∗ ) ∪ ch ( X, G ∗ ). Let G ∈ [ G ∗ ] be an arbitrary DAG. If C ∩ ch ( X, G ) = ∅ and C = ∅ , then C ⊆ pa ( X, G ), and thus we have C ⊆ sib ( X, G ∗ ). Maathuis et al.(2009, Lemma 3) proved that a non-empty subset of sib ( X, G ∗ ) can be a partof X ’s parent set in some equivalent DAG if and only if the subset induces acomplete subgraph. Therefore, C induces a complete subgraph of G ∗ . Thiscompletes the proof of the necessity. We next prove the sufficiency. If C = ∅ ,then it is clear that C ∩ ch ( X, G ) = ∅ for some G ∈ [ G ∗ ]. Now assume C = ∅ and C induces a complete subgraph of G ∗ and C ∩ ch ( X, G ∗ ) = ∅ . As C ⊆ sib ( X, G ∗ ) ∪ ch ( X, G ∗ ), we have C ⊆ sib ( X, G ∗ ). Again, by Maathuiset al. (2009, Lemma 3), there is a DAG G in [ G ∗ ] such that C ⊆ pa ( X, G ).Therefore, C ∩ ch ( X, G ) = ∅ . C.3 Proof of Theorem 1
Proof.
Theorem 1 follows from Lemmas 1 and 2 directly.
C.4 Proof of Proposition 1
Proof.
Denote the CPDAG containing X and Y by G ∗ . It suffices to showthat, if X and Y are in the same chain component, then there exists a DAGin [ G ∗ ] in which Y is an ancestor of X . By Lemma 5, there exists a DAG G in [ G ∗ ] such that pa ( Y, G ) = pa ( Y, G ∗ ) and ch ( Y, G ) = ch ( Y, G ∗ ) ∪ sib ( Y, G ∗ ).Let π = ( Y, v , ..., X ) be the shortest path from Y to X . It is clear that π hasno chord. Moreover, the corresponding path of π in G ∗ is undirected as X and Y are in the same chain component. On the other hand, Y → v is in G by our construction. Hence, according to Perković et al. (2017, Lemma B.1), π is a directed path. C.5 Proof of Proposition 2
Proof.
According to the definition of partially directed path, an undirectedpath is also partially directed, hence if X and Y are in the same chain compo-nent, they are possible causes of each other by Theorem 2 and Proposition 1.Conversely, if X and Y are possible causes of each other, then by Theorem 2,there is a partially directed path from X to Y as well as a partially directedpath from Y to X . Clearly, neither of these two paths contains a directededge, otherwise, a partially directed cycle containing directed edges would30ccur. Therefore, X and Y are connected by an undirected path, whichmeans they are in the same chain component. C.6 Proof of Proposition 3
Proof.
Let Z be a vertex in the chain component containing X , then everypartially directed path between Z and Y , if any, must pass through X . Sincethere is a v-structure-free orientation of the chain component whose uniquesource is X , there is a DAG in the Markov equivalence class represented by G ∗ such that none of the vertex in the chain component is an ancestor of Y except X . C.7 Proof of Proposition 4
Proof. If X and Y are in the same chain component, then Z = { Y } andthe equation trivially holds. Suppose that X and Y are not in the samechain component. We first prove that C XY ⊆ ∪ Z ∈ Z C XZ . Without loss ofgenerality, we can assume C XY = ∅ . By the definition of critical set, forany C ∈ C XY , there is a chordless partially directed path ρ from X to Y onwhich C is adjacent X . Since X and Y are not in the same chain component, ρ must contain a directed edge. Let Z be the vertex on ρ such that ρ ( Z, Y )starts with a directed edge and Z is in the chain component containing X . By Maathuis and Colombo (2015, Lemma 7.2) or Perković et al. (2017,Lemma B.1), ρ ( Z, Y ) is a directed path. Therefore, Z is an explicit causeof Y . Since X is not an explicit cause of Y , we have Z = X , and thus ρ ( X, Z ) is a chordless undirected path. This means C ∈ C XZ . As C ∈ C XY is arbitrary, we have C XY ⊆ ∪ Z ∈ Z C XZ . Conversely, for any Z ∈ Z and C ∈ C XZ , there is a chordless undirected path π from X to Z on which C is adjacent X . Let π be the shortest directed path from Z to Y . As X and Y are not in the same chain component, Z = Y . Hence, concatenating π and π results a partially directed path from X to Y with length greaterthan 1. Denote such a path by π . If π is chordless, then we have C ∈ C XY .If this is not the case, then π must have a chord connecting one vertex v on π and another vertex v on π . Clearly, the edge between v and v shouldbe directed, and the direction is v → v . Since X is not an explicit cause of Y , it holds that v = X . With out loss of generality, we assume v is the firstvertex from X ’s side who are adjacent to some v on π , then concatenating π ( X, v ), v → v and π ( v , Y ) results another partially directed path π which is shorter than π . It is easy to verify that π is chordless, and C isstill adjacent to X on π . Therefore, C ∈ C XY , and consequently we have31 Z ∈ Z C XZ ⊆ C XY . This completes the proof of Proposition 4. C.8 Proof of Theorem 2
Proof.
Suppose X is a definite non-cause of Y , then for every DAG G inthe Markov equivalence class represented by G ∗ , Y is a non-descendant of X . Since Lemma 5 indicates that there is a DAG G such that pa ( X, G ) = pa ( X, G ∗ ) and ch ( X, G ) = adj ( X, G ∗ ) \ pa ( X, G ∗ ), we have X ⊥⊥ Y | pa ( X, G ∗ )by local Markov property. On the other hand, if X is a definite cause ora possible cause of Y , then by definition there is a DAG G in the Markovequivalence class represented by G ∗ in which X is an ancestor of Y . Assume π is a directed path from X to Y in G . Since every vertex on π is a non-colliderand none of the vertices on π is in pa ( X, G ∗ ), X ⊥6 ⊥ Y | pa ( X, G ∗ ). C.9 Proof of Theorem 3
Proof. If X is an explicit cause of Y , then there is a directed path π from X to Y in G ∗ . Hence, for any DAG G in the Markov equivalence class representedby G ∗ , π is directed in G , which means π has no collider in G . However, noneof the vertices on π is a member of pa ( X, G ∗ ) or sib ( X, G ∗ ), since otherwise,a directed cycle or a partially directed cycle with directed edges wouldoccur in G ∗ . Therefore, π is active given pa ( X, G ∗ ) ∪ sib ( X, G ∗ ), which means X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). Conversely, suppose X is not an explicit causeof Y . In the following, we will prove that X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) holds.By Lemma 5, there is a DAG G in the Markov equivalence class representedby G ∗ such that ch ( X, G ) = sib ( X, G ∗ ) ∪ ch ( X, G ∗ ) and pa ( X, G ) = pa ( X, G ∗ ).Consider a path π from X to Y in G . If the length of π is 1, then thecorresponding path of π in G ∗ must be X ← Y or X − Y . Thus, π is blockedgiven pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). If the length of π is greater than 1, withoutloss of generality we can assume π = ( X, v , ..., v n , Y ). If v ∈ pa ( X, G ), then π is blocked by pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) since v cannot be a collider on π . If v ∈ ch ( X, G ∗ ), then π is not directed, since otherwise, the correspondingpath in G ∗ would be a partially directed path from X to Y where the nodeadjacent to X is a child of X . Therefore, there must be a collider on π .Let v i be the collider nearest to X . If v i ∈ an ( pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) , G ),there exists a partially directed cycle with directed edges in G ∗ , which isimpossible. Thus, v i / ∈ an ( pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) , G ), and π is blocked by pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). Finally, in the case where v ∈ sib ( X, G ∗ ), if v is anon-collider, π is clearly blocked by pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). If v is a collider,then v is adjacent to X , which means v / ∈ ch ( X, G ∗ ), since otherwise, both32 → v → v − X and X → v − v − X are partially directed cycleswith directed edges. This means v ∈ pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). Since v is anon-collider on π , π is blocked by pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). This completesthe proof of Theorem 3. C.10 Proof of Theorem 4
Proof.
Let C be the critical set of X with respect to Y in G ∗ . Suppose that X is an implicit cause of Y , then by Theorem 3, X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ).For any M w ∈ M , from Theorem 1 we know that C \ M w = ∅ . Therefore,according to Proposition 4, there is a partially directed path from X to Y , denoted by π w = ( X − w − ... − w t − Z w → ... → Y ), such that X − w − ... − w t − Z w is chordless and w / ∈ M w . Since every partiallydirected cycle in G ∗ is an undirected cycle, none of the vertices on π w is aparent of X in G ∗ . Moreover, due to the chordless-ness, if w = Z w , thennone of w , ..., w t , Z w is adjacent to X and thus none of them is in M w . (If w = Z w , then it is clear that Z w / ∈ M w .) Since by Lemma 5 there is a DAGin the Markov equivalence class represented by G ∗ such that π w is directed, π w is active given pa ( X, G ∗ ) ∪ M w . Therefore, X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M for any M ∈ M . Conversely, X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) implies X is not an explicit cause of Y , which also means Y / ∈ ch ( X, G ∗ ). Moreover, X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M for any M ∈ M implies Y / ∈ pa ( X, G ∗ ) ∪ sib ( X, G ∗ ).Therefore, X and Y are not adjacent. Suppose that X is not implicit. Since X is not an explicit cause of Y , C ∩ ch ( X, G ∗ ) = ∅ . Thus, by Theorem 1,there exists an M ∈ M such that C is a subset of M . (If C = ∅ , then forany M ∈ M , C ⊂ M .) We will show that pa ( X, G ∗ ) ∪ M d-separates X and Y . By Lemma 5, there is a DAG G in the Markov equivalence classrepresented by G ∗ such that ch ( X, G ) = sib ( X, G ∗ ) ∪ ch ( X, G ∗ ) \ M and pa ( X, G ) = pa ( X, G ∗ ) ∪ M . Let π = ( X, v , ..., v n , Y ) be an arbitrary pathconnecting X and Y in G . The length of π should be greater than 1 as X and Y are not adjacent. If v is a parent of X in G , then clearly π is blocked by pa ( X, G ∗ ) ∪ M , since v is a non-collider on π and v ∈ pa ( X, G ∗ ) ∪ M by theconstruction of G . Now assume that v is a child of X in G . If v ∈ ch ( X, G ∗ ),then there must be a collider on π , since otherwise, the corresponding path of π in G ∗ is a partially directed path where the node adjacent to X is a child of X , which means X is an explicit cause of Y according to Lemma 6. Clearly,the collider nearest to X on π is not an ancestor of pa ( X, G ∗ ) ∪ M . Thus, π is blocked by pa ( X, G ∗ ) ∪ M . For the same reason, if v ∈ sib ( X, G ∗ ) \ M andthere is a collider on π , then π is blocked by pa ( X, G ∗ ) ∪ M due to the factthat the collider nearest to X on π can not be an ancestor of pa ( X, G ∗ ) ∪ M .33inally, if v ∈ sib ( X, G ∗ ) \ M and there is no collider on π , then π is directedin G , and the corresponding path of π in G ∗ is partially directed. Let Z bethe vertex on π such that the subpath π ( X, Z ) is undirected in G ∗ and Z isan explicit cause of Y . Obviously, such Z exists, and Z = Y or X . Since v / ∈ M , we have v / ∈ C and thus π ( X, Z ) has a chord. By Perković et al.(2017, Lemma 3.6), there is a subsequence π ∗ of π ( X, Z ) forms a chordlessundirected path from X to Z in G ∗ . Together with Proposition 4, this resultindicates that there is a vertex w on π ( X, Z ) such that w ∈ C . However,by construction, w ∈ pa ( X, G ), which makes π ( X, w ) and w → X form adirected cycle in G . Thus, π must contain a collider. This completes theproof. C.11 Proof of Theorem 5
Proof.
The proof directly follows from Theorems 1 to 4, as well as Proposi-tions 2 and 4.
References
A. R. Ali, T. S. Richardson, P. Spirtes, and J. Zhang. Towards characterizingMarkov equivalence classes for directed acyclic graphs with latent variables.In
Proceedings of the Twenty-First Conference on Uncertainty in ArtificialIntelligence , pages 10–17. AUAI press, 2005.S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization ofMarkov equivalence classes for acyclic digraphs.
The Annals of Statistics ,25(2):505–541, 04 1997.Y. Bengio, T. Deleu, N. Rahaman, N. R. Ke, S. Lachapelle, O. Bilaniuk,A. Goyal, and C. Pal. A meta-transfer objective for learning to disentanglecausal mechanisms. In
International Conference on Learning Representa-tions , 2020.M. Bernstein and P. Tetali. On sampling graphical Markov models. arXive-prints , art. arXiv:1705.09717, May 2017.J. R. S. Blair and B. Peyton. An introduction to chordal graphs and cliquetrees. In
Graph Theory and Sparse Matrix Computation , pages 1–29, NewYork, NY, 1993. Springer New York.D. M. Chickering. Learning equivalence classes of Bayesian-network struc-tures.
Journal of machine learning research , 2(Feb):445–498, 2002a.34. M. Chickering. Optimal structure identification with greedy search.
Journal of Machine Learning Research , 3(Nov):507–554, 2002b.D. Colombo and M. H. Maathuis. Order-independent constraint-based causalstructure learning.
Journal of Machine Learning Research , 15:3921–3962,2014.Z. Fang and Y. He. IDA with background knowledge. In
Proceedings of theThirty-sixth Conference on Uncertainty in Artificial Intelligence . PMLR,2020.S. Fu and M. C. Desmarais. Markov blanket based feature selection: a reviewof past decade. In
Proceedings of the World Congress on Engineering ,volume 1, page 321–328. Newswood Ltd., 2010.T. Gao and Q. Ji. Local causal discovery of direct causes and effects. In
Advances in Neural Information Processing Systems 28 , pages 2512–2520.Curran Associates, Inc., 2015.F. R. Guo and E. Perković. Minimal enumeration of all possible total effectsin a Markov equivalence class. arXiv e-prints , art. arXiv:2010.08611, Oct.2020.A. Hauser and P. Bühlmann. Characterization and greedy learning of inter-ventional Markov equivalence classes of directed acyclic graphs.
Journalof Machine Learning Research , 13(Aug):2409–2464, 2012.Y. He and Z. Geng. Active learning of causal networks with interventionexperiments and optimal designs.
Journal of Machine Learning Research ,9(Nov):2523–2547, 2008.Y. He, J. Jia, and B. Yu. Counting and exploring sizes of Markov equivalenceclasses of directed acyclic graphs.
Journal of Machine Learning Research ,16:2589–2609, 2015.M. Kalisch, M. Mächler, D. Colombo, M. H. Maathuis, and P. Bühlmann.Causal inference using graphical models with the R package pcalg.
Journalof Statistical Software , 47(11):1–26, 2012.M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesiannetworks.
Journal of Machine Learning Research , 5(May):549–573, 2004.M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness.In
Advances in Neural Information Processing Systems , volume 30, pages4066–4076. Curran Associates, Inc., 2017.35. L. Lauritzen and T. S. Richardson. Chain graph models and their causal in-terpretations.
Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 64(3):321–348, 2002.S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilitieson graphical structures and their application to expert systems.
Journalof the Royal Statistical Society. Series B (Statistical Methodology) , 50(2):157–224, 1988.Y. Liu, Z. Cai, C. Liu, and Z. Geng. Local learning approaches for findingeffects of a specified cause and their causal paths.
ACM Transactions onIntelligent Systems and Technology , 10(5), Sep 2019.Y. Liu, Z. Fang, Y. He, and Z. Geng. Collapsible IDA: Collapsing parentalsets for locally estimating possible causal effects. In
Proceedings of theThirty-sixth Conference on Uncertainty in Artificial Intelligence . PMLR,2020a.Y. Liu, Z. Fang, Y. He, Z. Geng, and C. Liu. Local causal network learningfor finding pairs of total and direct effects.
Journal of Machine LearningResearch , 21(148):1–37, 2020b.M. H. Maathuis and D. Colombo. A generalized back-door criterion.
TheAnnals of Statistics , 43(3):1060–1088, 06 2015.M. H. Maathuis, M. Kalisch, and P. Bühlmann. Estimating high-dimensionalintervention effects from observational data.
The Annals of Statistics , 37(6A):3133–3164, 12 2009.C. Meek. Causal inference and causal explanation with background knowledge.In
Proceedings of the Eleventh Conference on Uncertainty in ArtificialIntelligence , pages 403–410. Morgan Kaufmann Publishers Inc., 1995.T. Miller. Explanation in artificial intelligence: Insights from the socialsciences.
Artificial intelligence , 267:1–38, 2019.J. Mooij and T. Claassen. Constraint-Based Causal Discovery using PartialAncestral Graphs in the presence of Cycles. In
Proceedings of the Thirty-sixth Conference on Uncertainty in Artificial Intelligence . PMLR, 2020.P. Nandy, M. H. Maathuis, and T. S. Richardson. Estimating the effectof joint interventions from observational data in sparse high-dimensionalsettings.
The Annals of Statistics , 45(2):647–674, 04 2017.36. Pearl.
Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,1988.J. Pearl.
Causality . Cambridge University Press, 2009.J. Pearl, D. Geiger, and T. Verma. Conditional independence and itsrepresentations.
Kybernetika , 25(7):33–44, 1989.E. Perković. Identifying causal effects in maximally oriented partially di-rected acyclic graphs. In
Proceedings of the Thirty-sixth Conference onUncertainty in Artificial Intelligence . PMLR, 2020.E. Perković, M. Kalisch, and M. H. Maathuis. Interpreting and usingCPDAGs with background knowledge. In
Proceedings of the Thirty-ThirdConference on Uncertainty in Artificial Intelligence . AUAI press, 2017.J. Peters and P. Bühlmann. Identifiability of Gaussian structural equationmodels with equal error variances.
Biometrika , 101(1):219–228, 2013.J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causal discoverywith continuous additive noise models.
The Journal of Machine LearningResearch , 15(1):2009–2053, 2014.T. Richardson and P. Spirtes. Ancestral graph Markov models.
The Annalsof Statistics , 30(4):962–1030, 08 2002.D. J. Rose and R. E. Tarjan. Algorithmic aspects of vertex elimination ondirected graphs. Technical report, Stanford, CA, USA, 1975.A. Roumpelaki, G. Borboudakis, S. Triantafillou, and I. Tsamardinos.Marginal causal consistency in constraint-based causal learning. In
Proceed-ings of the UAI 2016 Workshop on Causation: Foundation to Application ,Foundation to Application,number 1792 in CEUR Workshop Proceedings,page 39–47, 2016.S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery.
Journal of Machine LearningResearch , 7(Oct):2003–2030, 2006.S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio,P. O. Hoyer, and K. Bollen. Directlingam: A direct method for learninga linear non-Gaussian structural equation model.
Journal of MachineLearning Research , 12(Apr):1225–1248, 2011.37. Singh and A. Moore. Finding optimal Bayesian networks by dynamicprogramming. Technical report, 2005.P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causalgraphs.
Social Science Computer Review , 9(1):62–72, 1991.I. Tsamardinos and C. F. Aliferis. Towards principled feature selection:Relevancy, filters and wrappers. In
Proceedings of the Ninth InternationalWorkshop on Artificial Intelligence and Statistics . Morgan Kaufmann Pub-lishers, 2003.I. Tsamardinos, C. F. Aliferis, and A. R. Statnikov. Algorithms for large scaleMarkov blanket discovery. In
Proceedings of the Sixteenth InternationalFlorida Artificial Intelligence Research Society Conference , pages 376–381.AAAI Press, 2003.C. Wang, Y. Zhou, Q. Zhao, and Z. Geng. Discovering and orienting the edgesconnected to a target variable in a DAG via a sequential local learningapproach.
Computational Statistics & Data Analysis , 77:252 – 266, 2014.J. Witte, L. Henckel, M. H. Maathuis, and V. Didelez. On efficient adjustmentin causal graphs.
Journal of Machine Learning Research , 21(246):1–45,2020.Y. Wu, L. Zhang, and X. Wu. Counterfactual fairness: Unidentification,bound and algorithm. In
Proceedings of the Twenty-Eighth InternationalJoint Conference on Artificial Intelligence, IJCAI-19 , pages 1438–1444.International Joint Conferences on Artificial Intelligence Organization, 72019.J. Xiang and S. Kim. A* lasso for learning a sparse Bayesian networkstructure for continuous variables. In
Advances in Neural InformationProcessing Systems 26 , pages 2418–2426. Curran Associates, Inc., 2013.C. Yuan, B. Malone, and X. Wu. Learning optimal Bayesian networks using A*search. In
Proceedings of the Twenty-Second International Joint Conferenceon Artificial Intelligence, IJCAI-11 , pages 2186–2191. International JointConferences on Artificial Intelligence Organization, 7 2011.J. Zhang.
Causal Inference and Reasoning in Causally Insufficient Systems .PhD thesis, Carnegie Mellon University, 2006.38. Zhang. On the completeness of orientation rules for causal discovery in thepresence of latent confounders and selection bias.
Artificial Intelligence ,172(16):1873 – 1896, 2008.K. Zhang and A. Hyvärinen. On the identifiability of the post-nonlinear causalmodel. In
Proceedings of the Twenty-Fifth Conference on Uncertainty inArtificial Intelligence . AUAI press, 2009.K. Zhang, M. Gong, P. Stojanov, B. Huang, Q. Liu, and C. Glymour. Domainadaptation as a problem of inference on graphical models. In
Advancesin Neural Information Processing Systems , volume 33, pages 4965–4976.Curran Associates, Inc., 2020.X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing. DAGs with no tears:Continuous optimization for structure learning. In