[PDF] A Local Method for Identifying Causal Relations under Markov Equivalence

Abstract

Causality is important for designing interpretable and robust methods in artificial intelligence research. We propose a local approach to identify whether a variable is a cause of a given target based on causal graphical models of directed acyclic graphs (DAGs). In general, the causal relation between two variables may not be identifiable from observational data as many causal DAGs encoding different causal relations are Markov equivalent. In this paper, we first introduce a sufficient and necessary graphical condition to check the existence of a causal path from a variable to a target in every Markov equivalent DAG. Next, we provide local criteria for identifying whether the variable is a cause/non-cause of the target. Finally, we propose a local learning algorithm for this causal query via learning local structure of the variable and some additional statistical independence tests related to the target. Simulation studies show that our local algorithm is efficient and effective, compared with other state-of-art methods.

Full PDF

AA Local Method for Identifying Causal Relationsunder Markov Equivalence

Zhuangyan Fang Yue Liu Zhi Geng Yangbo He

Peking University Huawei Noah’s Ark LabFebruary 26, 2021

Abstract

Causality is important for designing interpretable and robust meth-ods in artiﬁcial intelligence research. We propose a local approachto identify whether a variable is a cause of a given target based oncausal graphical models of directed acyclic graphs (DAGs). In general,the causal relation between two variables may not be identiﬁable fromobservational data as many causal DAGs encoding diﬀerent causalrelations are Markov equivalent. In this paper, we ﬁrst introduce asuﬃcient and necessary graphical condition to check the existence of acausal path from a variable to a target in every Markov equivalent DAG.Next, we provide local criteria for identifying whether the variable isa cause/non-cause of the target. Finally, we propose a local learningalgorithm for this causal query via learning local structure of the vari-able and some additional statistical independence tests related to thetarget. Simulation studies show that our local algorithm is eﬃcient andeﬀective, compared with other state-of-art methods.

In many observational studies, the main purposes are to study whether atreatment variable is a cause of a target variable, or to further identify thecauses/non-causes of a speciﬁed target variable or the eﬀects/non-eﬀectsof a given treatment. For examples, in a clinical study we may concernabout which symptoms are the side eﬀects caused by using a new drug andwhich are not. Causality is also important for designing interpretable androbust methods in artiﬁcial intelligence research (Miller, 2019), and has been ∗ Correspondence to: [email protected] . a r X i v : . [ s t a t . M L ] F e b sed in many ﬁelds of artiﬁcial intelligence, such as causal transfer learning(Zhang et al., 2020; Bengio et al., 2020) and causality-based algorithmicfairness(Kusner et al., 2017; Wu et al., 2019). Directed acyclic graphs (DAGs)can be used to represent causal relationships among variables (Pearl, 2009).If a variable X has a directed path to another variable Y , then X is a cause of Y and Y is an eﬀect of X . From observational data, however, insteadof an exact causal DAG, we generally learn a Markov equivalence class ofDAGs which can be represented by a completely partially directed acyclicgraph (CPDAG). The undirected edges in a CPDAG imply that some causalrelations among variables can not be read from the graph directly.Consider a Markov equivalence class of DAGs learned from observationaldata. A variable X is a deﬁnite cause of a target Y if X is always a cause of Y in every equivalent DAG, and a variable X is a deﬁnite non-cause of Y if X is never a cause of Y in any DAG in the class. If X is neither a deﬁnitecause nor a deﬁnite non-cause of Y , X is called a possible cause of Y .Some approaches can be used to identify the causal relation between atreatment and a target. An intuitive approach is ﬁrst to learn a Markovequivalence class from observational data, and then enumerate all DAGs inthe class to check whether the treatment is deﬁnitely or deﬁnitely not thecause of the target in all of these equivalent DAGs. However, the intuitiveapproach is ineﬃcient when the number of DAGs in the learned Markovequivalence class is large (He et al., 2015).Another way is to check the paths from the treatment to the target in aCPDAG. It has been shown that the treatment is a deﬁnite non-cause of thetarget if and only if there is no partially directed path from the treatment tothe target (see, e.g. Zhang, 2006; Perković et al., 2017). Given a CPDAG,Roumpelaki et al. (2016) also introduces a suﬃcient condition for identifyingdeﬁnite causes. However, the necessity of this condition remains a conjecture(Zhang, 2006; Mooij and Claassen, 2020) and the corresponding approach isusually ineﬃcient since it needs to learn an entire CPDAG ﬁrst.The third approach is to estimate the causal eﬀect of the treatment onthe given target (Maathuis et al., 2009; Perković et al., 2017; Nandy et al.,2017; Fang and He, 2020; Liu et al., 2020a,b; Witte et al., 2020; Guo andPerković, 2020). This approach, which is called causal-eﬀect-based method,determines whether a variable is a cause of another by testing whether allpossible causal eﬀects are zeros/non-zeros. However, this method requires We note that, the recent progresses in identifying the causal relation between twovariables indeed provide an opportunity to learn an exact DAG. However, such methodsneed to pose additional distributional conditions (Shimizu et al., 2006; Zhang and Hyvärinen,2009; Shimizu et al., 2011; Peters and Bühlmann, 2013; Peters et al., 2014)

Background

In this section, we introduce basic concepts of causal graphical models, andthe assumptions for causal learning. We use pa ( S , G ), ch ( S , G ), sib ( S , G ), adj ( S , G ), an ( S , G ) and de ( S , G ) to denote the union of parents, children,siblings (or undirected neighbors), adjacent vertices, ancestors, and descen-dants of each variable in set S in G , respectively, where G = ( V , E ) can bea directed, an undirected, or a partially directed graph. The basic graphterminology can be found in A. As a convention, we regard a vertex as anancestor and a descendant of itself. If S = { X } is a singleton set, we willreplace S by X for ease of presentation. Let G be a causal acyclic directedgraph (causal DAG) and X be a vertex in G , the vertices in an ( X, G ) \ X are causes of X , and the vertices in pa ( X, G ) are direct causes of X . If X isa cause of Y , then the directed paths from X to Y are called causal paths . The notion of d-separation induces a set of conditional independence rela-tions encoded in a DAG (Pearl, 1988). Let G be a DAG and π = ( X = X , X , ..., X n = Y ) be a path from X to Y in G . An intermediate vertex X i is a collider on π if X i − → X i and X i ← X i +1 , otherwise, X i is a non-collider on π . For three distinct vertices X i , X j and X k , if X i → X j ← X k and X i is not adjacent to X k in G , then the triple ( X i , X j , X k ) is called a v-structure collided on X j in G . Given Z ⊆ V , we say π is d-connected (oractive) given Z if Z does not contain any endpoint or non-collider on the pathand every collider on the path has a descendant in Z . If π is not d-connectedgiven Z , then π is blocked by Z . For pairwise disjoint sets X , Y , Z ⊆ V , X and Y are d-separated by Z (denoted by X ⊥⊥ Y | Z ) if and only if every pathbetween some X ∈ X and Y ∈ Y is blocked by Z .Let J G be the set of d-separation relations read oﬀ from a DAG G . TwoDAGs G and G are Markov equivalent if J G = J G . Pearl et al. (1989)have shown that two DAGs are equivalent if and only if they have the sameskeleton and the same v-structures. A Markov equivalence class or simply equivalence class , denoted by [ G ], contains all DAGs equivalent to G . AMarkov equivalence class [ G ] can be uniquely represented by a partiallydirected graph called completely partially directed acyclic graph (CPDAG) G ∗ . Two vertices are adjacent in G ∗ if and only if they are adjacent in G and a directed edge occurs in G ∗ if and only if it appears in every DAG in[ G ] (Pearl et al., 1989). For the ease of presentation, we will also use [ G ∗ ] torepresent the Markov equivalence class represented by G ∗ below. Given a4PDAG G ∗ , we use the G ∗ u and G ∗ d to denote the undirected subgraph and the directed subgraph of G ∗ , respectively. The former is deﬁned as the undirectedgraph resulted by removing all directed edges in G ∗ and the later is thedirected graph obtained by removing undirected edges. Andersson et al.(1997) proved that (1) the undirected subgraph G ∗ u of G ∗ is the union ofdisjoint connected chordal graphs (the deﬁnition of chordal graph is providedin A), and (2) every partially directed cycle in G ∗ is an undirected cycle, thatis, none of the partially directed cycles in G ∗ contains a directed edge. Eachisolated connected undirected subgraph of G ∗ u is called a chain component of G ∗ (Andersson et al., 1997; Lauritzen and Richardson, 2002).For a given distribution P , we use X ⊥⊥ P Y | Z to denote that X isindependent of Y given Z with respect to P . Let J P be the set of all such(conditional) independencies in P . In this paper, our main results are basedon the following assumptions: causal Markov assumption , which states that X ⊥⊥ Y | Z in J G implies X ⊥⊥ P Y | Z in J P ; causal faithfulness assumption ,which states that X ⊥⊥ P Y | Z in J P implies X ⊥⊥ Y | Z in J G ; and theassumption that there is no hidden variable or selection bias. A distribution P is called Markov and faithful to a DAG G if P and G satisfy the causalMarkov assumption and the causal faithfulness assumption. A causal DAGmodel consists of a DAG G and a joint distribution P over a common vertexset V such that P satisﬁes the causal Markov assumption with respect to G . G is called the causal structure of the model and P is called the observationaldistribution (or simply distribution ) (Hauser and Bühlmann, 2012). Causal structure learning methods try to recover the causal structure fromdata. The global causal structure learning focuses on learning an entirecausal structure over all variables while the local causal structure learningdiscusses how to recover a part of the underlying causal structure.Existing approaches for learning global causal structures roughly fall intotwo classes: constraint-based and score-based methods. Constraint-basedmethods, such as the PC algorithm (Spirtes and Glymour, 1991) and thestable PC algorithm (Colombo and Maathuis, 2014), use conditional inde-pendence tests to ﬁnd causal skeleton and then determine the edge directionsaccording to a series of orientation rules (Meek, 1995). Under the causalMarkov and causal faithfulness assumptions, the constraint-based methodscan identify causal graphs up to a Markov equivalence class. On the otherhand, score-based methods, such as exact search algorithms like dynamic pro-gramming (Koivisto and Sood, 2004; Singh and Moore, 2005) and A* (Yuan5t al., 2011; Xiang and Kim, 2013), heuristic search algorithms like GES(Chickering, 2002b), and gradient-based methods like NOTEARS (Zhenget al., 2018), evaluate the candidate graphs with a predeﬁned score functionand search for DAGs or CPDAGs with the optimal score.Local learning algorithms usually learn the Markov blanket (see, e.g.Tsamardinos et al., 2003; Tsamardinos and Aliferis, 2003; Fu and Desmarais,2010) or the parent and child set of a given target (see, e.g. Wang et al., 2014;Gao and Ji, 2015; Liu et al., 2019). Recently, Liu et al. (2020b, Algorithm 3)extended the MB-by-MB algorithm (Wang et al., 2014) to learn the chaincomponent containing a given target and the directed edges surrounding thechain component. This variant of MB-by-MB can thus learn the subgraph ofthe CPDAG over the target and its neighbors, that is, the parents, siblingsand children of the target in the CPDAG.

In this section, we provide a suﬃcient and necessary condition to identifydeﬁnite causal relations, and show that the deﬁnite causal relations can bedivided into two subtypes: explicit and implicit causal relations.

As mentioned in Section 1, given a CPDAG, a variable X is a deﬁnite non-cause of another variable Y if and only if there is no partially directed pathfrom X to Y (Zhang, 2006; Perković et al., 2017). Roumpelaki et al. (2016,Theorem 3.1) proved that the treatment is a deﬁnite cause of the target ifthere is a directed path from the treatment to the target or the treatmenthas two chordless partially directed path to the target on which two verticesadjacent to the treatment are distinct and non-adjacent. In the section,we will show this condition is also necessary, and before that, a concept of critical set is introduced as follows. Deﬁnition 1 (Critical Set) . (Fang and He, 2020, Deﬁnition 2) Let G ∗ be aCPDAG. X and Y are two distinct vertices in G ∗ . The critical set of X withrespect to Y in G ∗ consists of all adjacent vertices of X lying on at least onechordless partially directed path from X to Y . The deﬁnition of chordless partially directed path can be found in A.With Deﬁnition 1, we have the following lemma.6 emma 1.

Let G ∗ be a CPDAG. For any two distinct vertices X and Y in G ∗ , X is a deﬁnite cause of Y in the underlying DAG if and only if thecritical set of X with respect to Y in G ∗ always contains a child of X inevery DAG G ∈ [ G ∗ ] . Lemma 1 follows from Lemma 2 in Fang and He (2020). It gives asuﬃcient and necessary condition to decide whether X is a deﬁnite causeof Y . However, checking the condition given in Lemma 1 also requires toenumerate all equivalent DAGs. To mitigate this problem, we discuss a graphcharacteristic of critical set in the corresponding CPDAG. Lemma 2.

Let G ∗ be a CPDAG and X, Y be two distinct vertices in G ∗ .Denote by C the critical set of X with respect to Y in G ∗ , then C ∩ ch ( X, G ) = ∅ for some G ∈ [ G ∗ ] if and only if C = ∅ , or C induces a complete subgraphof G ∗ but C ∩ ch ( X, G ∗ ) = ∅ . Based on Lemmas 1 and 2, we have the desired suﬃcient and necessarygraphical criterion.

Theorem 1.

Suppose that G ∗ is a CPDAG, X, Y are two distinct verticesin G ∗ , and C is the critical set of X with respect to Y in G ∗ . Then, X is adeﬁnite cause of Y if and only if C ∩ ch ( X, G ∗ ) = ∅ , or C is non-empty andinduces an incomplete subgraph of G ∗ . The suﬃciency of the condition in Theorem 1 has been extended toother types of causal graphs by Roumpelaki et al. (2016) and Mooij andClaassen (2020). With the help of Theorem 1, we can identify the typeof causal relation based on a learned CPDAG by enumerating paths andﬁnding critical sets. Below, we give an example to illustrate this idea.

Example 1.

Consider the respiratory disease network shown in Figure 1.Let smoking be the treatment and dyspnoea be the target. From Figure 1(b)we can see that the partially directed paths from smoking to dyspnoea are

Smok − Lung → Either → Dysp and

Smok − Bronc → Dysp . Therefore, thecritical set of smoking with respect to dyspnoea is { Lung , Bronc } . As Lung and

Bronc are not adjacent, by Theorem 1 smoking is a deﬁnite cause ofdyspnoea. Similarly, the critical set of lung cancer with respect to dyspnoeais { Smok , Either } . Since Either is a child of

Lung , lung cancer is also adeﬁnite cause of dyspnoea. We note that, although Roumpelaki et al. (2016, Theorem 3.1) also claimed that theyhave proved the necessity, their proof is ﬂawed. As mentioned by Mooij and Claassen(2020), the last part of the proof appears to be incomplete. How to prove the necessity formore general types of causal graphs remains an open problem (Zhang, 2006). a) G t (b) G ∗ (c) Markov eqivelence class Figure 1: This example is adapted from the ASIA network. The originalnetwork structure and related parameters can be found in Lauritzen andSpiegelhalter (1988). Figure 1(a) shows the true underlying causal DAG,and Figure 1(b) shows the corresponding CPDAG. Figure 1(c) enumeratesall equivalent DAGs in the Markov equivalence class.

We now study the properties of deﬁnite causal relations, and show thatdeﬁnite causal relations can be divided into two subtypes based on theexistence of causal paths in a CPDAG. The results in this section are of keyimportance to build local characterizations in Section 4, and are also usefulfor developing an eﬃcient global learning algorithm.

Proposition 1.

For two distinct vertices X and Y , if X is a deﬁnite causeof Y , then X and Y are not in the same chain component. Given a target variable Y , Proposition 1 shows that Y and its deﬁnitecauses do not appear in the same chain component. Thus, if a treatment X is a deﬁnite cause of a target Y , then in G ∗ there must be a partially directedpath from X to Y which contains a directed edge. On the other hand, fortwo distinct vertices lying in the same chain component, we have, Proposition 2.

Two distinct vertices X and Y are possible causes of eachother if and only if they are in the same chain component. Recall that in Figure 1(b), both Smok and Lung are deﬁnite causes ofDysp. However, in the CPDAG there exists a directed path from Lungto Dysp while no directed path exists from Smok to Dysp. That is, thecause Lung of Dysp is explicit and the cause Smok of Dysp is implicit inthe CPDAG. This diﬀerence motivates us to introduce the following twoconcepts. 8igure 2: An example for identifying the types of causal relations

Deﬁnition 2 (Explicit Cause) . A variable X is an explicit cause of Y if X is a deﬁnite cause of Y and there is a common causal path from X to Y inevery DAG in the Markov equivalence class represented by a CPDAG G ∗ . Since there is a common causal path from an explicit cause X to thetarget Y in every DAG in the Markov equivalence class represented by G ∗ ,there is a directed path from X to Y in G ∗ . As a convention, we regard X as an explicit cause of itself. Deﬁnition 3 (Implicit Cause) . A variable X is an implicit cause of Y if X is a deﬁnite cause of Y and there is no common causal path from X to Y inevery DAG in the Markov equivalence class represented by a CPDAG G ∗ . We notice that X is a deﬁnite cause of Y if only if it satisﬁes one of twoconditions given in Theorem 1. The ﬁrst condition, C ∩ ch ( X, G ∗ ) = ∅ , isthe suﬃcient and necessary condition for identifying explicit causes, whilethe second condition corresponds to implicit causes. In Section 4, we willexploit this diﬀerence between explicit and implicit causes to develop localcharacterizations for both of them. Below, we give an example to illustratethem. Example 2.

Consider the causes of the target variable Y based on theCPDAG G ∗ in Figure 2. It is clear that all the variables other than Y aredeﬁnite or possible causes of Y . Obviously, { E, D, F } are explicit causesof Y . For B , since B − E → Y , B − D → Y and B − G − F → Y arechordless partially directed paths, the critical set of B with respect to Y is { E, D, G } . As the induced subgraph of G ∗ over { E, D, G } is not complete, B is a deﬁnite cause of Y , and B is also implicit. Similarly, G is anotherimplicit cause of Y . For X and A , the critical set of X and A with respect to Y are { B, D, G } and { X, G } , respectively. Since the corresponding induced ubgraphs are complete, by Theorem 1, X and A are not implicit causes of Y . Thus, they are possible causes of Y . Despite the diﬀerence, explicit and implicit causes also have some inter-esting connections. The following Proposition 3 proves that the existence ofan implicit cause implies the existence of at least two explicit causes.

Proposition 3.

Let G ∗ be a CPDAG. X and Y are two distinct verticesthat belong to diﬀerent chain components. If X is the only explicit causeof Y in the chain component to which X belongs, then every vertex in thischain component, except X , is a possible cause of Y . In this section, we introduce the theoretical results on locally characterizingdiﬀerent types of causal relations. Our local characterizations depend onthe induced subgraph of the true CPDAG over the treatment’s neighbors aswell as some queries about d-separation relations. The ﬁrst result is aboutdeﬁnite non-causal relations, as given in Theorem 2.

Theorem 2.

Let G ∗ be a CPDAG. For any two distinct vertices X and Y in G ∗ , X is a deﬁnite non-cause of Y if and only if X ⊥⊥ Y | pa ( X, G ∗ ) holds. Theorem 2 introduces a local characterization for deﬁnite non-causalrelations, which is based on the local structure around the treatment X anda single d-separation claim. The d-separation claim X ⊥⊥ Y | pa ( X, G ∗ ) issimilar to the following well-known result called local Markov property of acausal DAG model: any variable is d-separated from its non-descendantsgiven its parents. The diﬀerence is that in our local characterization, onlyparents of X in the CPDAG are included in the separation set, and we ruleout the siblings of X even if they may be the parents of X in the true causalDAG. Since in a causal DAG, the non-descendants of a variable are thosewhich are deﬁnitely not caused by the variable, Theorem 2 can be regardedas an extension of the local Markov property to CPDAGs.Following Theorem 2, we can distinguish the deﬁnite causes and thepossible causes from the deﬁnite non-causes with a local causal structurequery and a d-separation query. Next, we characterize explicit and implicitcausal relations locally in Theorem 3 and Theorem 4, respectively, whichtogether characterize deﬁnite causal relations.10 heorem 3. Let G ∗ be a CPDAG. For any two distinct vertices X and Y in G ∗ , X is an explicit cause of Y if and only if X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) holds. The local characterization in Theorem 3 includes a single d-separationclaim, X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ), which means the set pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) cannot block all paths from X to Y . In the proof of this theorem,we show that this claim is equivalent to that there exists at least one pathfrom X to Y in G ∗ on which the node adjacent to X is a child of X . Basedon Maathuis and Colombo (2015, Lemma 7.2) and Perković et al. (2017,Lemma B.1), this implies that there is a directed path from X to Y in G ∗ . Theorem 4.

Suppose that G ∗ is a CPDAG and M is the set of maximalcliques of the induced subgraph of G ∗ over sib ( X, G ∗ ) . Then, X is an implicitcause of Y if and only if X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) and X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M for any M ∈ M . The deﬁnition of maximal clique is given in A. In Theorem 4, the ﬁrstcondition X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) makes sure that X is not an explicitcause of Y and the second condition, which is X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M forany M ∈ M , guarantees that X is not a possible cause of Y . These twoconditions in Theorem 4 are local in the sense that both sib ( X, G ∗ ) and pa ( X, G ∗ ) are subsets of X ’s neighbors in G ∗ , and a maximal clique M isalso a subset of sib ( X, G ∗ ). Once we obtain the induced subgraph of G ∗ over adj ( X, G ∗ ), we can know sib ( X, G ∗ ) and M , and thus the conditionalindependence queries can be answered accordingly if we have the oracles.As mentioned in Section 3.2, deﬁnite causes consists of both explicitcauses and implicit causes. Therefore, Theorems 3 and 4 give a sound andcomplete local characterization of deﬁnite causal relations as follows. Corollary 1.

Suppose that G ∗ is a CPDAG and M is the set of maximalcliques of the induced subgraph of G ∗ over sib ( X, G ∗ ) . Then, X is a deﬁnitecause of Y if and only if X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) or X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M for any M ∈ M . Together with Theorem 2, Corollary 1 can be used to identify the deﬁnitecausal relations and the deﬁnite non-causal relations. This result is local inthe sense that it only depends on the local structure around the treatment X and a limited number of d-separation queries. When data is availablein practice, d-separation queries can be answered by performing statisticalindependence tests. Thus, local characterizations are particularly meaningfulfor identifying the types of causal relations from observational data.11 A Local Learning Algorithm

In this section, we discuss how to learn the type of causal relation from obser-vational data. A local algorithm, which exploits the local characterizationsin Section 4 directly, is provided in this section.The main procedure of our local algorithm is summarized in Algorithm 1.The input of Algorithm 1 consists of pa ( X, G ∗ ), the induced subgraph of G ∗ over sib ( X, G ∗ ), and some independence oracles. The ﬁrst two arguments, pa ( X, G ∗ ) and the induced subgraph over sib ( X, G ∗ ), can be learned locallyby using the variant of the MB-by-MB algorithm proposed by Liu et al.(2020b, Algorithm 3), which is designed for learning the chain componentcontaining a given target variable and the directed edges connected to thevariables in the chain component. The third argument (the independenceoracles), as discussed in Section 4, can be replaced by statistical independencetests in practice. Overall, the procedure given in Algorithm 1 is a directapplication of the local characterizations in Theorems 2, 3 and 4, and thuswe have, Theorem 5.

Assuming that the true causal DAG is in the Markov equivalenceclass represented by G ∗ and the independence oracles are faithful to the trueDAG, then the local ITC (Algorithm 1) can correctly identify the type ofcausal relation between X and Y. Algorithm 1

A local algorithm for identifying the type of causal relation(local ITC)

Require:

A treatment X , a target Y , pa ( X, G ∗ ), the induced subgraph of G ∗ over sib ( X, G ∗ ), and independence oracles. Ensure:

The type of causal relation between X and Y. if X ⊥⊥ Y | pa ( X, G ∗ ) then return X is a deﬁnite non-cause of Y , end if if X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) then return X is an explicit cause of Y , end if M = the set of maximal cliques of sib ( X, G ∗ ), if exists M ∈ M such that X ⊥⊥ Y | pa ( X, G ∗ ) ∪ M , then return X is a possible cause of Y , end if return X is an implicit cause of Y .12he complexity of Algorithm 1 can be measured by the maximum numberof conditional independence tests (or d-separation queries). Clearly, the max-imum number of conditional independence tests performed by Algorithm 1 is m + 2, where m is the number of maximal cliques of sib ( X, G ∗ ). Fortunately,there are only linearly many maximal cliques (with respect to the numberof vertices) in a chordal graph (Rose and Tarjan, 1975; Blair and Peyton,1993), so the number of conditional independence tests needed in Algorithm1 is at most O ( | sib ( X, G ∗ ) | ).We also present a global learning algorithm as well as its complexityanalysis in B.1. The global learning algorithm uses the graphical criteriaas well as the properties of diﬀerent causal relations to identify them. Werefer the interested readers to Algorithm 2 and Algorithm 3 in B.1 for moredetails.On the other hand, if we use the causal-eﬀect-based method mentionedin Section 1 (or more speciﬁcally, use a possible implementation describedin B.2 and tested in Section 6), we have to enumerate all possible causaleﬀects of X on Y and test whether they are all zeros (non-zeros). In theworst case, the number of conditional independence tests required in thismethod is 2 | sib ( X, G ∗ ) | . Hence, Algorithm 1 is theoretically more eﬃcientthan the current causal-eﬀect-based method as the latter needs to performexponentially many conditional independence tests in the worst case whilethe former only needs linearly many tests. In this section, we illustrate and evaluate the proposed method experimentallyusing synthetic data sets generated from linear structural equation modelswith Erdös-Rényi random DAGs. We compare our local ITC with the globalone as well as two causal-eﬀect-based methods (CE-based for short): regulartesting method (CE-test for short) and multiple testing method (CE-multifor short). In both CE-based methods, we used IDA (Maathuis et al., 2009)to estimate all possible causal eﬀects of a treatment on a target and checkedwhether all of these causal eﬀects are zeros. The CE-test method checkseach eﬀect one by one and the CE-multi tests all eﬀects simultaneously withmultiple testing methods.In Section 6.1, we assume that the true CPDAG or its local structureof interest is available. In this case, the synthetic data was only used toestimate causal eﬀects by the CE-based methods, and to perform conditionalindependence tests by the local ITC method. Therefore, in these experiments,13e can compare our proposed methods with CE-based methods directly sincestructure learning from data is not needed and no corresponding erroris introduced. In Section 6.2, we further evaluate the methods withoutproviding the true CPDAG or its local structure. Three global structurelearning algorithms, including the PC algorithm (Spirtes and Glymour,1991), the stable PC algorithm (Colombo and Maathuis, 2014) and the GESalgorithm (Chickering, 2002a), were used to learn CPDAGs, and the variantof MB-by-MB (Liu et al., 2020b) was used to learn parents and siblingsof the vertices of interest. In all of these experiments, algorithms like PC,stable PC, GES, and IDA were called from pcalg R-package (Kalisch et al.,2012), signiﬁcance level α of all statistical independence tests was set to be0 . and all codes were run on a computer with an Intel 2.5GHz CPUand 8 GB of memory.Let ER( n, d ) denote a random DAG with n vertices and average in-and-out degree d . In our experiment, n is chosen from { , , , , , } and d is chosen from { . , . , . , . , . , . } . To generate a data set, weﬁrst sampled an ER( n, d ) random DAG, and for each directed edge X i → X j in this DAG, we drew an edge weight β ij from a Uniform([0 . , . − . , − . ∪ [0 . , . X j = X X i ∈ pa ( X j ) β ij X i + (cid:15) j , j = 1 , ..., n , (1)where (cid:15) , ..., (cid:15) n are independent N (0 ,

1) noises. Finally, we drew N samplesfrom this linear model. In Section 6.1 we set N ∈ { , } , and in Section6.2 we set N ∈ { , , } .We generated 5000 data sets for each setting, and for each data set werandomly sampled a treatment variable and a target variable. We thenexplored the causal relation between the treatment and the target andcompared it with the true one read from the corresponding CPDAG of thesampled DAG. Since there are three types of causal relations, we use theKappa coeﬃcient as well as the true positive rate (TPR) and the false positiverate (FPR) to measure the performance of each method. In this section, the true CPDAGs or their local structures are provided toexclude estimation biases caused by graph structure learning from data. Since Experiments show diﬀerent signiﬁcance levels give similar results. d Method Def. non-causes Poss. causes Def. causesTPR FPR TPR FPR TPR FPR50 2 local ITC

CE-test 0.9893 0.4557 0.4211 0.0024 0.3016 0.0136CE-multi 0.9899 0.4937 0.3684 0.0016 0.3016 0.01363 local ITC

CE-test 0.9864 0.6302 0.3088 0.0056 0.1889 0.0168CE-multi 0.9870 0.6484 0.2892 0.0048 0.1778 0.01684 local ITC

CE-test 0.9814 0.6789 0.2618 0.0103 0.1282 0.0230CE-multi 0.9818 0.7083 0.2403 0.0082 0.1218 0.0228250 2 local ITC

CE-test 0.9913 0.0904 0.6293 0.0018 0.8400 0.0139CE-multi 0.9921 0.1024 0.6207 0.0016 0.8000 0.01353 local ITC

CE-test 0.9828 0.2404 0.6667 0.0052 0.6014 0.0198CE-multi 0.9839 0.2463 0.6614 0.0042 0.5946 0.01984 local ITC

CE-test 0.9767 0.3281 0.6606 0.0128 0.4441 0.0267CE-multi 0.9770 0.3351 0.6561 0.0140 0.4155 0.0267

Table 1: The detailed TPRs and FPRs of diﬀerent methods on 100-noderandom graphs with d ∈ { , , } and positive weights . The true graphstructures are provided. The standard deviations are omitted as they are allbelow 0 . N Method Average degree2 3 450 CE-test 2.9619 (0.9041) 2.8218 (0.8119) 2.7001 (0.7926)CE-multi 3.0261 (0.9172) 2.8742 (0.8226) 2.7451 (0.8002)250 CE-test 1.8515 (0.4090) 1.7630 (0.3461) 1.7049 (0.2990)CE-multi 1.8726 (0.4159) 1.7808 (0.3507) 1.7211 (0.3005)

Table 2: The averages and standard deviations (in parentheses) of ratiosof the time used by CE-based methods to local ITC on 100-node randomgraphs with d ∈ { , , } and positive weights . The true graph structuresare provided.the output of the global ITC is invariant when the true CPDAG is provided,the TPR and FPR for learning each type of causal relation are 1 and 0respectively. Except the global ITC, the local ITC and CE-based methods15 d Method Def. non-causes Poss. causes Def. causesTPR FPR TPR FPR TPR FPR50 2 local ITC

CE-test 0.9957 0.7532 0.1979 0.0008 0.0862 0.0063CE-multi 0.9959 0.7792 0.1771 0.0002 0.0862 0.00633 local ITC

CE-test 0.9931 0.8123 0.1579 0.0015 0.0838 0.0099CE-multi 0.9938 0.8291 0.1368 0.0010 0.0838 0.00934 local ITC

CE-test 0.9917 0.8770 0.0862 0.0036 0.0841 0.0088CE-multi 0.9917 0.8843 0.0776 0.0036 0.0779 0.0088250 2 local ITC

CE-test 0.9954 0.5191 0.3482 0.0014 0.4366 0.0067CE-multi 0.9956 0.5246 0.3482 0.0010 0.4366 0.00673 local ITC

CE-test 0.9936 0.6210 0.3140 0.0017 0.3028 0.0091CE-multi 0.9936 0.6242 0.3081 0.0017 0.3028 0.00914 local ITC

CE-test 0.9876 0.6528 0.3106 0.0086 0.2296 0.0128CE-multi 0.9881 0.6618 0.3021 0.0082 0.2201 0.0128

Table 3: The detailed TPRs and FPRs of diﬀerent methods on 100-noderandom graphs with d ∈ { , , } and mixed weights . The true graphstructures are provided. The standard deviations are omitted as they are allbelow 0 . N Method Average degree2 3 450 CE-test 2.9696 (0.9242) 2.8080 (0.8331) 2.7102 (0.7684)CE-multi 3.0357 (0.9368) 2.8487 (0.8365) 2.7525 (0.7940)250 CE-test 1.8377 (0.3810) 1.7797 (0.3570) 1.7242 (0.2959)CE-multi 1.8623 (0.3935) 1.8004 (0.3608) 1.7380 (0.3040)

Table 4: The averages and standard deviations (in parentheses) of ratiosof the time used by CE-based methods to local ITC on 100-node randomgraphs with d ∈ { , , } and mixed weights . The true graph structuresare provided.need to perform statistical tests. To assess these methods, we run experimentson data with positive weights (Uniform([0 . , . − . , − . ∪ [0 . , . N and the degree d havesimilar eﬀects on all methods: the larger the sample size and the smaller theaverage degree, the better their performance.Beneﬁting from less conditional independence tests, the local ITC algo-rithm is more stable, more accurate, and more eﬃcient than the CE-basedmethods, especially when identifying deﬁnite causal relations. ComparingTable 3 with Table 1, one can see that although all TPRs drop, the TPRs ofthe local ITC drop less than those of the CE-based methods. For instance,in Table 1, for the case of N = 250, the TPRs of the local ITC and CE-testfor identifying deﬁnite causal relations are 0.8800 and 0.8400 respectively,while when negative weights are allowed in Table 3, the TPRs of the localITC and CE-test decrease to 0.7183 and 0.4366, equivalent to 18% and 48%reductions, respectively. In this section, we further study experimentally our proposed methods whenthe true causal structures are not available. We used the variant of MB-by-MB (Liu et al., 2020b) to learn parents and siblings of the vertices ofinterest, and used the PC algorithm, the stable PC algorithm and GES tolearn entire CPDAGs.Figure 3 shows the Kappa coeﬃcients of diﬀerent methods based on 20-,50- and 100-node graphs. As one can see from Figure 3, the proposed localITC outperforms the other methods at almost all settings. The global ITCcombined with PC or PCS is also competitive when the graph is extremelysparse ( d <

2) or relatively dense ( d > . ocal ITCCE−test (local)CE−multi (local) global ITC (PC)CE−test (PC)CE−multi (PC) global ITC (PCS)CE−test (PCS)CE−multi (PCS) global ITC (GES)CE−test (GES)CE−multi (GES) . . . . . . average degree k appa (a) n = 20, N = 500 . . . . . . average degree k appa (b) n = 50, N = 500 . . . . . . average degree k appa (c) n = 100, N = 500 . . . . . . average degree k appa (d) n = 20, N = 1000 . . . . . . average degree k appa (e) n = 50, N = 1000 . . . . . . average degree k appa (f) n = 100, N = 1000 . . . . . . average degree k appa (g) n = 20, N = 3000 . . . . . . average degree k appa (h) n = 50, N = 3000 . . . . . . average degree k appa (i) n = 100, N = 3000 Figure 3: The Kappa coeﬃcients of diﬀerent methods on random graphswith positive weights. The graph structures are learned from data.and local ITC, respectively.In Table 5, we report the TPRs and FPRs of diﬀerent methods foridentifying each type of causal relation based on 100-node graphs withaverage degree d ∈ { , , } . The sample size N is set to 1000. Table 5 showsthat the local ITC is more stable than others, especially on relatively sparsegraphs. It can be seen that when d = 2 and d = 3, the FPR of the localITC is always one of the best three FPRs, and the TPR of the local ITC isone of the highest three TPRs except for learning non-causal relations when18 Method Def. non-causes Poss. causes Def. causesTPR FPR TPR FPR TPR FPR2 local ITC

CE-test (local) 0.9870

CE-test (PC) 0.9789

CE-test (PCS) 0.9781

CE-test (local) 0.9538

CE-test (PC) 0.9221 0.0907

CE-test (PCS) 0.9133

CE-test (PC) 0.8019 0.2081 0.6584 0.0376 0.6833 0.1636global ITC (PCS)

CE-test (PCS) 0.7869

Table 5: The detailed TPRs and FPRs of diﬀerent methods on 100-noderandom graphs. The with sample size N = 1000, average degree d ∈ { , , } and positive weights. The standard deviations are omitted as they are allbelow 0 . d = 3. Moreover, the local ITC performs considerably well when learningpossible causal relations and deﬁnite causal relations, and always achieves arelatively low FPR when learning deﬁnite non-causal relations.We now compare the total computational time of diﬀerent methods inFigure 4. The total computational time consists of two parts: the time forlearning the required graph structure, and the time for identifying the typeof causal relation. Generally, the total time is dominated by the learningof graph structure. As shown in Figure 4, since learning a local structureconsumes less time than learning a global structure, the local ITC and thelocal versions of the CE-based methods are more eﬃcient than global ones.Moreover, the global ITC is much faster than the global CE-based methods,19 ocal ITCCE−test (local)CE−multi (local) global ITC (PC)CE−test (PC)CE−multi (PC) global ITC (PCS)CE−test (PCS)CE−multi (PCS) global ITC (GES)CE−test (GES)CE−multi (GES) . . . . . . average degree t i m e ( s e c ond s ) (a) n = 20, N = 500 . . . . . . average degree t i m e ( s e c ond s ) (b) n = 50, N = 500 . . . . . . . . average degree t i m e ( s e c ond s ) (c) n = 100, N = 500 . . . . . . average degree t i m e ( s e c ond s ) (d) n = 20, N = 1000 . . . . . . average degree t i m e ( s e c ond s ) (e) n = 50, N = 1000 . . . . . . . . average degree t i m e ( s e c ond s ) (f) n = 100, N = 1000 . . . . . . average degree t i m e ( s e c ond s ) (g) n = 20, N = 3000 . . . . . . average degree t i m e ( s e c ond s ) (h) n = 50, N = 3000 . . . . . . . . average degree t i m e ( s e c ond s ) (i) n = 100, N = 3000 Figure 4: The total CPU time (in seconds) of diﬀerent methods on randomgraphs with positive weights. The graph structures are learned from data.as the latter need perform additional independence tests. In our experiments,regardless of the time used to learn graph structure, the global ITC takesless than 0 .

001 seconds to identify types, while the global CE-based methodsare 10 times slower. This also explains why the blue, green, or purpledashed/dotted lines in Figure 4 are above the corresponding solid lines.Since local ITC and local CE-based methods use the same local learningalgorithm (the variant of MB-by-MB) and the local structure learning domi-nates the computational time of these methods, the total time of these three20

Method Average degree2 3 4500 CE-test (local) 1.3735 (0.1798) 1.3370 (0.1606) 1.3076 (0.1398)CE-multi (local) 1.3830 (0.1811) 1.3434 (0.1635) 1.3148 (0.1407)1000 CE-test (local) 1.2998 (0.1446) 1.2703 (0.1214) 1.2520 (0.1303)CE-multi (local) 1.3076 (0.1435) 1.2787 (0.1236) 1.2617 (0.1308)3000 CE-test (local) 1.1854 (0.0885) 1.1818 (0.1112) 1.1762 (0.1190)CE-multi (local) 1.1849 (0.0869) 1.1829 (0.1078) 1.1753 (0.1135)

Table 6: The averages and standard deviations (in parentheses) of the ratiosof the time used by local CE-based methods to local ITC on 100-node randomgraphs with positive weights. The graph structures are learned from data.local methods is very close in Figure 4. To compare the computational timeof the local ITC and the local versions of the CE-based methods in detail,Table 6 reports the averages and standard deviations of the ratios of CPUtime of local CE-based methods to local ITC. It can be seen that the localITC is faster than the local CE-based methods since all ratios are greaterthan 1.

In this paper, we present a local method for identifying the type of causalrelation without evaluating causal eﬀects and learning a global causal struc-ture. In our work, we investigate the existence of a causal path from thecause to the target based on a CPDAG and provide a suﬃcient and necessarygraphical condition to check the existence of a causal path. We also studythe graphical properties of each type of causal relation. Inspired by theseproperties, we further propose a local identiﬁcation criterion for each typeof causal relation, which depends only on the induced subgraph of the trueCPDAG over the adjacent variables of the treatment as well as some queriesabout d-separation relations. The local criteria naturally lead to a locallearning algorithm for identifying the type of causal relation if one assumesthe faithfulness condition holds. Simulation studies empirically prove thatthe proposed local algorithm performs well.Our work introduces the local characterizations of types of causal rela-tions, which are helpful for understanding causal relations hidden behindobservational data. Except for the theoretical contributions, our resultshave many potential applications as well. Firstly, it can guide researchers to21erform interventional studies. For example, no invention is needed if thetreatment is a deﬁnite non-cause of the target. Secondly, it can be combinedwith the IDA algorithm to estimate possible causal eﬀects to reduce thecomputational costs. For instance, if the treatment is a non-cause of thetarget, then without any computation we can conclude that all possibleeﬀects are zero (Maathuis et al., 2009). Besides, our results can be usedto decide the signiﬁcance of estimated causal eﬀects. If the treatment is adeﬁnite cause of the target, then all the estimated causal eﬀects, no matterhow small they are, are signiﬁcant. To some extent, our results can makeup the shortage of using statistical tests when testing the signiﬁcance of apossible causal eﬀect. Thirdly, our theorems and algorithms can be used tocheck whether a variable is a mediator in between a cause and a target. Allthese applications are important in causal inference.Our results can be easily extended to interventional essential graphs(He and Geng, 2008; Hauser and Bühlmann, 2012), which can be used torepresent Markov equivalence classes where some variables are intervened.Basically, interventional essential graphs are also chain graphs and can belearned from the mixture of observational and interventional data. Extendingour proposed concepts, theorems, and algorithms to interventional essentialgraphs is straightforward. A possible future work is to extend the globalcharacterization for deﬁnite causal relations to maximal PDAGs. MaximalPDAGs are generalizations of CPDAGs, and have been frequently usedfor representing causal background knowledge (Perković et al., 2017; Fangand He, 2020; Perković, 2020; Witte et al., 2020; Guo and Perković, 2020).Another interesting direction is to take hidden variables and selection biasesinto account. For example, one may extend the results to partially ancestralgraphs (Richardson and Spirtes, 2002; Ali et al., 2005; Zhang, 2008).

Acknowledgements

This work was supported by National Key R&D Program of China (2018YFB1004300), NSFC (11671020, 11771028, 11971040).22 ppendixA Graph Terminology

A graph G is deﬁned as a vertex set (or node set) V and an edge set E . Agraph is directed ( undirected , partially directed ) if all edges in the graph aredirected (undirected, a mixture of directed and undirected). The skeleton ofa graph G is an undirected graph resulted from turning every directed edgein G into an undirected edge. Given a subset V of V , the induced subgraph of G over V is deﬁned as G = ( V , E ) where E ⊂ E contains only edgesbetween vertices in V . If a directed edge X i → X j occurs in G , we call X i a parent of X j and X j a child of X i . Two distinct vertices X i and X j are siblings of each other if the undirected edge X i − X j appears in G . If forany V ⊂ V , there exist X ∈ V and X ∈ V \ V such that X and X areadjacent, then the graph is called connected , otherwise, it is disconnected .Furthermore, if there is an edge between any two vertices, then the graph iscalled complete .A path is a sequence of distinct vertices ( X k , · · · , X k j ) such that X k i is adjacent to X k i +1 . X k and X k j are endpoints of the path, while othervertices on the path are intermediate vertices (nodes) . The length of apath is the number of vertices on the path minus one. A path is called partially directed from X k to X k j if X k i ← X k i +1 does not occur in G forany i = 1 , . . . , j −

1. A partially directed path is directed ( undirected ) if alledges on the path are directed (undirected). A cycle is a path from a vertexto itself. A partially directed (directed, undirected) cycle can be deﬁnedsimilarly. We note that both directed paths (cycles) and undirected path(cycles) are partially directed. A vertex X i is an ancestor of X j and X j is a descendant of X i if there is a directed path from X i to X j or X i = X j . A chord of a path (cycle) is any edge joining two nonconsecutive vertices onthe path (cycle). A path (cycle) without any chord is called chordless . Anypath with length one is chordless. An undirected graph is chordal if it hasno chordless cycle with length greater than three. Given a chordal graph C = ( V , E ), if the induced subgraph of C over V ⊂ V is complete, then V is called a clique of C . Moreover, if there is no V such that V ⊂ V and V is a clique, then V is called a maximal clique . A directed graph is acyclic (DAG) if there are no directed cycles.23 Algorithms

We now provide global ITC and causal-eﬀect-based methods for learning thetypes of causal relations.

B.1 A Global Learning Algorithm

Given a target variable and a CPDAG, as shown in Example 2, identifying adeﬁnite non-causal relation or an explicit causal relation is straightforward.To discriminate an implicit causal relation from a possible causal relation, weneed an approach to ﬁnd critical sets. The next proposition is particularlyuseful.

Proposition 4.

For any two distinct vertices

X, Y in a CPDAG G ∗ suchthat X is not an explicit cause of Y , it holds that C XY = ∪ Z ∈ Z C XZ , where C UV denotes the critical set of U with respect to V and Z is the set of explicitcauses of Y which are also in the chain component containing X . Proposition 4 provides a factorization of the critical set of X with respectto Y . For simplicity, we call ∪ Z ∈ Z C XZ the critical set of X with respect to Z . Algorithm 2 shows how to ﬁnd ∪ Z ∈ Z C XZ eﬃciently. Algorithm 2 runs abreadth-ﬁrst-search and returns the critical set of X with respect to Z in G ∗ u .In Algorithm 2, we start from the siblings of X , then search chordless pathsfrom the siblings until reaching some Z i ∈ Z . Every chordless path startingfrom a sibling of X is recorded in a queue S as a triple like ( α, ψ, τ ), where α and τ are the start and the end point of the path, respectively, and ψ isthe sibling of τ on the path. If τ is a member of Z , we add α to the criticalset C and remove from S all triples where the ﬁrst element is α , that is, westop enumerating chordless paths starting with α . Otherwise, we extend thechordless path to the siblings of τ that are neither ψ nor siblings of ψ andadd the corresponding triples to the queue S . In this algorithm, a set ofvisited triples, H , is introduced to speed up the search by avoiding visitingthe same triple twice.Finally, we present a global learning approach for identifying the typeof causal relation in Algorithm 3. Algorithm 3 is global in the sense that ittakes an entire CPDAG as input. In Algorithm 3, we ﬁrst check whether X and Y are in the same chain component. If they are, X is a possible causeof Y based on Proposition 2. Otherwise, we ﬁnd the set of explicit causesof Y and denote it by Z . This can be done by searching for the verticesthat are connected to Y in the directed subgraph of G ∗ . If X ∈ Z , X is anexplicit cause of Y , otherwise, we ﬁnd the critical set C of X with respect to24 lgorithm 2 Finding the critical set of a given X with respect to a set ZRequire:

A chordal graph G ∗ u , a variable X in G ∗ u , and a variable set Z = ∅ such that X / ∈ Z . Ensure: C , which is the critical set of X with respect to Z in G ∗ u . Initialize C = ∅ , a waiting queue S = [], and a set H = ∅ , for α ∈ adj ( X ) do add ( α, X, α ) to the end of S , end for while S is not empty do take the ﬁrst element ( α, ψ, τ ) out of S and add it to H , if τ ∈ Z then add α to C , and remove from S all triples where the ﬁrst element is α , else for β ∈ adj ( τ ) and β / ∈ adj ( ψ ) ∪ { ψ } do if ( α, τ, β ) / ∈ H and ( α, τ, β ) / ∈ S then add ( α, τ, β ) to the end of S , end if end for end if end while return CZ . When C = ∅ , we have that there are no explicit causes of Y in the chaincomponent containing X , so X is not a cause of Y . Finally, using Theorem4, Algorithm 3 distinguishes between possible causes and implicit causes.Since Algorithm 2 does not visit the same triple like ( α, ψ, τ ) twice, where α is a sibling of X and τ is sibling of ψ in G ∗ u , the complexity of Algorithm2 in the worst case is O ( | sib ( X, G ∗ ) | · | E ( G ∗ u ) | ), where | E ( G ∗ u ) | is the numberof edges in G ∗ u . Now we consider the computational complexity of globalITC (Algorithm 3). We know that the complexity to check the undirectedconnectivity of X and Y or to ﬁnd an ( Y, G ∗ ) is O ( | E ( G ∗ ) | ), where | E ( G ∗ ) | is the number of vertices in G ∗ . Consequently, the complexity of global ITCis O ( | E ( G ∗ ) | + | sib ( X, G ∗ ) | · | E ( G ∗ u ) | ). Clearly, the worst case is O ( | E ( G ∗ ) | ). B.2 Causal-Eﬀect-Based Methods

Section 6 brieﬂy describes the causal-eﬀect-based methods used in ourexperiments. Now we summarize the detailed procedure in Algorithm 4.25 lgorithm 3

A global algorithm for identifying the type of causal relation(global ITC).

Require:

A CPDAG G ∗ , a variable X and a target Y in G ∗ . Ensure:

The type of causal relation between X and Y. if X and Y are connected by a path in G ∗ u then return X is a possible cause of Y , end if let Z = an ( Y, G ∗ ), if X ∈ Z then return X is an explicit cause of Y , end if use Algorithm 2 to ﬁnd the critical set C of X with respect to Z in G ∗ u , if | C | = 0 then return X is a deﬁnite non-cause of Y , end if if C induces a complete subgraph of G ∗ u then return X is a possible cause of Y , end if return X is an implicit cause of Y .The ﬁrst four lines in Algorithm 4 are borrowed from the IDA algorithm(Maathuis et al., 2009), which enumerate all possible parental sets of thetreatment X and estimate all possible causal eﬀects of X on Y . The possibleeﬀects are stored in a set, denoted by Θ X . Next, Algorithm 4 uses regulartesting method or multiple testing method to test whether every causal eﬀectin Θ X is zero. Based on the testing results, lines 6-12 return the type ofcausal relation between X and Y by the deﬁnitions of diﬀerent types ofcauses.We note that, the input of Algorithm 4 includes a CPDAG G ∗ or itsinduced subgraph over pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). In the original version of theIDA algorithm, the authors used an entire CPDAG as an input (Maathuiset al., 2009). This is probably due to the fact that there was no eﬃcientapproach to learn the induced subgraph over pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) locallyat that time. With the help of the variant of MB-by-MB (Liu et al., 2020b,Algorithm 3), we can now learn the induced subgraph over pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) eﬃciently. Therefore, Algorithm 4 could also be local if it iscombined with the variant of MB-by-MB.26 lgorithm 4 A causal-eﬀect-based algorithm for identifying the type ofcausal relation.

Require:

A treatment X , a target Y , a CPDAG G ∗ or its induced subgraphover pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). Ensure:

The type of causal relation between X and Y. set Θ X = ∅ , for each S ⊂ sib ( X, G ∗ ) such that orienting S → X and X → sib ( X, G ∗ ) \ S does not introduce any v-structure collided on X do estimate the causal eﬀect of X on Y by adjusting for S ∪ pa ( X, G ∗ ),and add the causal eﬀect to Θ X , end for test whether every causal eﬀect in Θ X is zero, if every causal eﬀect in Θ X is zero then return X is a deﬁnite non-cause of Y , end if if every causal eﬀect in Θ X is non-zero then return X is a deﬁnite cause of Y , end if return X is a possible cause of Y . C Detailed Proofs

The proofs of lemmas, theorems and corollaries in the main text of thispaper will be presented in this section. Before that, we ﬁrst introduce someprerequisite concepts and results.Let π = ( v , v , ..., v k ) denote a path with length k . The subpath π ( v i , v j )of π , with j > i , is the path ( v i , v i +1 , ..., v j − , v j ). If k ≥

2, we say threeconsecutive vertices v i , v i +1 and v i +2 form a triangle on π if v i is adjacent to v i +2 . π is called triangle-free if it does not contain any triangle. For a pathin a chordal graph, we ﬁnd the following result. Lemma 3.

In any chordal graph, a path is chordless if and only if it istriangle-free.Proof.

Let π = ( v , v , ..., v k ) denote a path with length k ≥

2, If π ischordless, then it is obviously triangle-free. Suppose π is not chordless, thenwe can choose a chord v i − v j such that the subpath π ( v i , v j ) has no chordexcept for v i − v j . If j = i +2, then v i , v i +1 and v j form a triangle. If j > i +2,then π ( v i , v j ) and v i − v j form a cycle with length greater than 3. However,since the graph is chordal, we must have a chord v k − v l with i ≤ k, l ≤ j l ≥ k + 2 and l − k < j − i . This is contrary to our assumption.Lemma 3 is useful for ﬁnding chordless path, since checking whether apath is triangle-free is much easier. The following is another useful result forchordal graphs. Lemma 4.

Let ρ be a cycle with length greater than 3 in a given chordalgraph, and X be a vertex on ρ . If the two vertices adjacent to X on ρ arenot adjacent to each other, then ρ has a chord where X is an endpoint.Proof. Let v and v be two vertices adjacent to X on ρ . Suppose that ρ doesnot have a chord where X is an endpoint. Since ρ has length greater than 3, ρ must have a chord. Clearly, any chord of ρ separates ρ into two sub-cycles.By assumption, it is easy to check that at least one sub-cycle contains X , v and v . If this sub-cycle still has a chord, then we can construct another cyclecontaining X , v and v but with shorter length. Finally, we will have a cyclecontaining X , v and v without any chord. Since v and v are not adjacent,the length of this cycle must be greater than 3, which is contradicted to thedeﬁnition of chordal graph.A chordal graph C can be turned into a directed graph by orienting itsedges. If the resulting directed graph is a DAG without v-structure, then theseorientations form a v-structure-free acyclic orientation of C (Bernstein andTetali, 2017). Any v-structure-free acyclic orientation of a connected chordalgraph has a unique source, that is, a vertex which has no parent. Conversely,any vertex in a connected chordal graph can be the unique source in somev-structure-free acyclic orientation (Blair and Peyton, 1993; Bernstein andTetali, 2017). Recall that the undirected subgraph of a CPDAG is the unionof disjoint connected chordal graphs called chain components (Anderssonet al., 1997). Maathuis et al. (2009) argued that any v-structure-free acyclicorientation of the edges in G ∗ u corresponds to a DAG in the equivalence classrepresented by G ∗ , and such an orientation can be considered separatelyfor each of the disjoint chordal graphs (or chain components). Moreover,Maathuis et al. (2009) proved that, Lemma 5. (Maathuis et al., 2009, Lemma 3.1)

Let G ∗ be a CPDAG, X be a vertex of G ∗ , and S ⊂ ne ( X, G ∗ ) . Then there is a DAG G ∈ [ G ∗ ] suchthat pa ( X, G ) = pa ( X, G ∗ ) ∪ S if and only if orienting S → X and X → D for every S ∈ S and D ∈ sib ( X, G ∗ ) \ S in G ∗ does not introduce any newv-structure. Y ∈ pa ( X, G ∗ ), then Y ∈ pa ( X , G ∗ )for every X ∈ ne ( X, G ∗ ). From this result we can prove that the conditionin Lemma 5 holds if and only if S is a clique. As we will see, Lemma 5 playsa key role in proving the main results of this paper, as it provides a simpleand local criterion for checking whether a subset of X ’s siblings can be X ’sparents in some equivalent DAGs.Let π denote a path. A subsequence of π is obtained by deleting somevertices from π without changing the order of the remaining vertices. Theﬁnal prerequisite result is about the relation between directed paths andpartially directed paths. Lemma 6.

There is a directed path from X to Y in G ∗ if and only if thereis a partially directed path from X to Y in G ∗ on which the node adjacent to X is a child of X .Proof. The necessity is trivial. For suﬃciency, let π = ( X, v, ..., Y ) be thepartially directed path from X to Y in G ∗ such that X → v . Assume that w is the ﬁrst vertex from the side of Y which is adjacent to X , then wehave X → w . Now consider π ( w, Y ). As π ( w, Y ) is also partially directed,by Perković et al. (2017, Lemma 3.6), there is a subsequence π ∗ of π ( w, Y )forms a chordless partially directed path from X to Y in G ∗ . Let π ∗∗ denotethe path by concatenating X → w and π ∗ , then π ∗∗ is a partially directedpath from X to Y on which the node adjacent to X is a child of X . Byconstruction, X is not adjacent to any vertex on π ∗∗ except for w . Thus, byMaathuis and Colombo (2015, Lemma 7.2), π ∗∗ is a directed path.In the following Appendices C.1 to C.13, we will present the detailedproofs of the main results provided in the main text, with the help of theaforementioned concepts and lemmas. C.1 Proof of Lemma 1

Proof.

Given a CPDAG G ∗ , for any DAG G ∈ [ G ∗ ], Fang and He (2020,Lemma 2) showed that a variable X is not a cause of another variable Y in G if and only if the critical set of X with respect to Y in G ∗ , which isdenoted by C , is a subset of pa ( X, G ). Consequently, X is a cause of Y in G if and only if C is not a subset of pa ( X, G ). That is, some vertex in C mustbe a child of X in G . The desired result comes from the deﬁnition of deﬁnitecause. 29 .2 Proof of Lemma 2 Proof.

We ﬁrst show the necessity. By the deﬁnition, C ⊆ sib ( X, G ∗ ) ∪ ch ( X, G ∗ ). Let G ∈ [ G ∗ ] be an arbitrary DAG. If C ∩ ch ( X, G ) = ∅ and C = ∅ , then C ⊆ pa ( X, G ), and thus we have C ⊆ sib ( X, G ∗ ). Maathuis et al.(2009, Lemma 3) proved that a non-empty subset of sib ( X, G ∗ ) can be a partof X ’s parent set in some equivalent DAG if and only if the subset induces acomplete subgraph. Therefore, C induces a complete subgraph of G ∗ . Thiscompletes the proof of the necessity. We next prove the suﬃciency. If C = ∅ ,then it is clear that C ∩ ch ( X, G ) = ∅ for some G ∈ [ G ∗ ]. Now assume C = ∅ and C induces a complete subgraph of G ∗ and C ∩ ch ( X, G ∗ ) = ∅ . As C ⊆ sib ( X, G ∗ ) ∪ ch ( X, G ∗ ), we have C ⊆ sib ( X, G ∗ ). Again, by Maathuiset al. (2009, Lemma 3), there is a DAG G in [ G ∗ ] such that C ⊆ pa ( X, G ).Therefore, C ∩ ch ( X, G ) = ∅ . C.3 Proof of Theorem 1

Proof.

Theorem 1 follows from Lemmas 1 and 2 directly.

C.4 Proof of Proposition 1

Proof.

Denote the CPDAG containing X and Y by G ∗ . It suﬃces to showthat, if X and Y are in the same chain component, then there exists a DAGin [ G ∗ ] in which Y is an ancestor of X . By Lemma 5, there exists a DAG G in [ G ∗ ] such that pa ( Y, G ) = pa ( Y, G ∗ ) and ch ( Y, G ) = ch ( Y, G ∗ ) ∪ sib ( Y, G ∗ ).Let π = ( Y, v , ..., X ) be the shortest path from Y to X . It is clear that π hasno chord. Moreover, the corresponding path of π in G ∗ is undirected as X and Y are in the same chain component. On the other hand, Y → v is in G by our construction. Hence, according to Perković et al. (2017, Lemma B.1), π is a directed path. C.5 Proof of Proposition 2

Proof.

According to the deﬁnition of partially directed path, an undirectedpath is also partially directed, hence if X and Y are in the same chain compo-nent, they are possible causes of each other by Theorem 2 and Proposition 1.Conversely, if X and Y are possible causes of each other, then by Theorem 2,there is a partially directed path from X to Y as well as a partially directedpath from Y to X . Clearly, neither of these two paths contains a directededge, otherwise, a partially directed cycle containing directed edges would30ccur. Therefore, X and Y are connected by an undirected path, whichmeans they are in the same chain component. C.6 Proof of Proposition 3

Proof.

Let Z be a vertex in the chain component containing X , then everypartially directed path between Z and Y , if any, must pass through X . Sincethere is a v-structure-free orientation of the chain component whose uniquesource is X , there is a DAG in the Markov equivalence class represented by G ∗ such that none of the vertex in the chain component is an ancestor of Y except X . C.7 Proof of Proposition 4

Proof. If X and Y are in the same chain component, then Z = { Y } andthe equation trivially holds. Suppose that X and Y are not in the samechain component. We ﬁrst prove that C XY ⊆ ∪ Z ∈ Z C XZ . Without loss ofgenerality, we can assume C XY = ∅ . By the deﬁnition of critical set, forany C ∈ C XY , there is a chordless partially directed path ρ from X to Y onwhich C is adjacent X . Since X and Y are not in the same chain component, ρ must contain a directed edge. Let Z be the vertex on ρ such that ρ ( Z, Y )starts with a directed edge and Z is in the chain component containing X . By Maathuis and Colombo (2015, Lemma 7.2) or Perković et al. (2017,Lemma B.1), ρ ( Z, Y ) is a directed path. Therefore, Z is an explicit causeof Y . Since X is not an explicit cause of Y , we have Z = X , and thus ρ ( X, Z ) is a chordless undirected path. This means C ∈ C XZ . As C ∈ C XY is arbitrary, we have C XY ⊆ ∪ Z ∈ Z C XZ . Conversely, for any Z ∈ Z and C ∈ C XZ , there is a chordless undirected path π from X to Z on which C is adjacent X . Let π be the shortest directed path from Z to Y . As X and Y are not in the same chain component, Z = Y . Hence, concatenating π and π results a partially directed path from X to Y with length greaterthan 1. Denote such a path by π . If π is chordless, then we have C ∈ C XY .If this is not the case, then π must have a chord connecting one vertex v on π and another vertex v on π . Clearly, the edge between v and v shouldbe directed, and the direction is v → v . Since X is not an explicit cause of Y , it holds that v = X . With out loss of generality, we assume v is the ﬁrstvertex from X ’s side who are adjacent to some v on π , then concatenating π ( X, v ), v → v and π ( v , Y ) results another partially directed path π which is shorter than π . It is easy to verify that π is chordless, and C isstill adjacent to X on π . Therefore, C ∈ C XY , and consequently we have31 Z ∈ Z C XZ ⊆ C XY . This completes the proof of Proposition 4. C.8 Proof of Theorem 2

Proof.

Suppose X is a deﬁnite non-cause of Y , then for every DAG G inthe Markov equivalence class represented by G ∗ , Y is a non-descendant of X . Since Lemma 5 indicates that there is a DAG G such that pa ( X, G ) = pa ( X, G ∗ ) and ch ( X, G ) = adj ( X, G ∗ ) \ pa ( X, G ∗ ), we have X ⊥⊥ Y | pa ( X, G ∗ )by local Markov property. On the other hand, if X is a deﬁnite cause ora possible cause of Y , then by deﬁnition there is a DAG G in the Markovequivalence class represented by G ∗ in which X is an ancestor of Y . Assume π is a directed path from X to Y in G . Since every vertex on π is a non-colliderand none of the vertices on π is in pa ( X, G ∗ ), X ⊥6 ⊥ Y | pa ( X, G ∗ ). C.9 Proof of Theorem 3

Proof. If X is an explicit cause of Y , then there is a directed path π from X to Y in G ∗ . Hence, for any DAG G in the Markov equivalence class representedby G ∗ , π is directed in G , which means π has no collider in G . However, noneof the vertices on π is a member of pa ( X, G ∗ ) or sib ( X, G ∗ ), since otherwise,a directed cycle or a partially directed cycle with directed edges wouldoccur in G ∗ . Therefore, π is active given pa ( X, G ∗ ) ∪ sib ( X, G ∗ ), which means X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). Conversely, suppose X is not an explicit causeof Y . In the following, we will prove that X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) holds.By Lemma 5, there is a DAG G in the Markov equivalence class representedby G ∗ such that ch ( X, G ) = sib ( X, G ∗ ) ∪ ch ( X, G ∗ ) and pa ( X, G ) = pa ( X, G ∗ ).Consider a path π from X to Y in G . If the length of π is 1, then thecorresponding path of π in G ∗ must be X ← Y or X − Y . Thus, π is blockedgiven pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). If the length of π is greater than 1, withoutloss of generality we can assume π = ( X, v , ..., v n , Y ). If v ∈ pa ( X, G ), then π is blocked by pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) since v cannot be a collider on π . If v ∈ ch ( X, G ∗ ), then π is not directed, since otherwise, the correspondingpath in G ∗ would be a partially directed path from X to Y where the nodeadjacent to X is a child of X . Therefore, there must be a collider on π .Let v i be the collider nearest to X . If v i ∈ an ( pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) , G ),there exists a partially directed cycle with directed edges in G ∗ , which isimpossible. Thus, v i / ∈ an ( pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) , G ), and π is blocked by pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). Finally, in the case where v ∈ sib ( X, G ∗ ), if v is anon-collider, π is clearly blocked by pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). If v is a collider,then v is adjacent to X , which means v / ∈ ch ( X, G ∗ ), since otherwise, both32 → v → v − X and X → v − v − X are partially directed cycleswith directed edges. This means v ∈ pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). Since v is anon-collider on π , π is blocked by pa ( X, G ∗ ) ∪ sib ( X, G ∗ ). This completesthe proof of Theorem 3. C.10 Proof of Theorem 4

Proof.

Let C be the critical set of X with respect to Y in G ∗ . Suppose that X is an implicit cause of Y , then by Theorem 3, X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ).For any M w ∈ M , from Theorem 1 we know that C \ M w = ∅ . Therefore,according to Proposition 4, there is a partially directed path from X to Y , denoted by π w = ( X − w − ... − w t − Z w → ... → Y ), such that X − w − ... − w t − Z w is chordless and w / ∈ M w . Since every partiallydirected cycle in G ∗ is an undirected cycle, none of the vertices on π w is aparent of X in G ∗ . Moreover, due to the chordless-ness, if w = Z w , thennone of w , ..., w t , Z w is adjacent to X and thus none of them is in M w . (If w = Z w , then it is clear that Z w / ∈ M w .) Since by Lemma 5 there is a DAGin the Markov equivalence class represented by G ∗ such that π w is directed, π w is active given pa ( X, G ∗ ) ∪ M w . Therefore, X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M for any M ∈ M . Conversely, X ⊥⊥ Y | pa ( X, G ∗ ) ∪ sib ( X, G ∗ ) implies X is not an explicit cause of Y , which also means Y / ∈ ch ( X, G ∗ ). Moreover, X ⊥6 ⊥ Y | pa ( X, G ∗ ) ∪ M for any M ∈ M implies Y / ∈ pa ( X, G ∗ ) ∪ sib ( X, G ∗ ).Therefore, X and Y are not adjacent. Suppose that X is not implicit. Since X is not an explicit cause of Y , C ∩ ch ( X, G ∗ ) = ∅ . Thus, by Theorem 1,there exists an M ∈ M such that C is a subset of M . (If C = ∅ , then forany M ∈ M , C ⊂ M .) We will show that pa ( X, G ∗ ) ∪ M d-separates X and Y . By Lemma 5, there is a DAG G in the Markov equivalence classrepresented by G ∗ such that ch ( X, G ) = sib ( X, G ∗ ) ∪ ch ( X, G ∗ ) \ M and pa ( X, G ) = pa ( X, G ∗ ) ∪ M . Let π = ( X, v , ..., v n , Y ) be an arbitrary pathconnecting X and Y in G . The length of π should be greater than 1 as X and Y are not adjacent. If v is a parent of X in G , then clearly π is blocked by pa ( X, G ∗ ) ∪ M , since v is a non-collider on π and v ∈ pa ( X, G ∗ ) ∪ M by theconstruction of G . Now assume that v is a child of X in G . If v ∈ ch ( X, G ∗ ),then there must be a collider on π , since otherwise, the corresponding path of π in G ∗ is a partially directed path where the node adjacent to X is a child of X , which means X is an explicit cause of Y according to Lemma 6. Clearly,the collider nearest to X on π is not an ancestor of pa ( X, G ∗ ) ∪ M . Thus, π is blocked by pa ( X, G ∗ ) ∪ M . For the same reason, if v ∈ sib ( X, G ∗ ) \ M andthere is a collider on π , then π is blocked by pa ( X, G ∗ ) ∪ M due to the factthat the collider nearest to X on π can not be an ancestor of pa ( X, G ∗ ) ∪ M .33inally, if v ∈ sib ( X, G ∗ ) \ M and there is no collider on π , then π is directedin G , and the corresponding path of π in G ∗ is partially directed. Let Z bethe vertex on π such that the subpath π ( X, Z ) is undirected in G ∗ and Z isan explicit cause of Y . Obviously, such Z exists, and Z = Y or X . Since v / ∈ M , we have v / ∈ C and thus π ( X, Z ) has a chord. By Perković et al.(2017, Lemma 3.6), there is a subsequence π ∗ of π ( X, Z ) forms a chordlessundirected path from X to Z in G ∗ . Together with Proposition 4, this resultindicates that there is a vertex w on π ( X, Z ) such that w ∈ C . However,by construction, w ∈ pa ( X, G ), which makes π ( X, w ) and w → X form adirected cycle in G . Thus, π must contain a collider. This completes theproof. C.11 Proof of Theorem 5

Proof.

The proof directly follows from Theorems 1 to 4, as well as Proposi-tions 2 and 4.

References

A. R. Ali, T. S. Richardson, P. Spirtes, and J. Zhang. Towards characterizingMarkov equivalence classes for directed acyclic graphs with latent variables.In

Proceedings of the Twenty-First Conference on Uncertainty in ArtiﬁcialIntelligence , pages 10–17. AUAI press, 2005.S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization ofMarkov equivalence classes for acyclic digraphs.

The Annals of Statistics ,25(2):505–541, 04 1997.Y. Bengio, T. Deleu, N. Rahaman, N. R. Ke, S. Lachapelle, O. Bilaniuk,A. Goyal, and C. Pal. A meta-transfer objective for learning to disentanglecausal mechanisms. In

International Conference on Learning Representa-tions , 2020.M. Bernstein and P. Tetali. On sampling graphical Markov models. arXive-prints , art. arXiv:1705.09717, May 2017.J. R. S. Blair and B. Peyton. An introduction to chordal graphs and cliquetrees. In

Graph Theory and Sparse Matrix Computation , pages 1–29, NewYork, NY, 1993. Springer New York.D. M. Chickering. Learning equivalence classes of Bayesian-network struc-tures.

Journal of machine learning research , 2(Feb):445–498, 2002a.34. M. Chickering. Optimal structure identiﬁcation with greedy search.

Journal of Machine Learning Research , 3(Nov):507–554, 2002b.D. Colombo and M. H. Maathuis. Order-independent constraint-based causalstructure learning.

Journal of Machine Learning Research , 15:3921–3962,2014.Z. Fang and Y. He. IDA with background knowledge. In

Proceedings of theThirty-sixth Conference on Uncertainty in Artiﬁcial Intelligence . PMLR,2020.S. Fu and M. C. Desmarais. Markov blanket based feature selection: a reviewof past decade. In

Proceedings of the World Congress on Engineering ,volume 1, page 321–328. Newswood Ltd., 2010.T. Gao and Q. Ji. Local causal discovery of direct causes and eﬀects. In

Advances in Neural Information Processing Systems 28 , pages 2512–2520.Curran Associates, Inc., 2015.F. R. Guo and E. Perković. Minimal enumeration of all possible total eﬀectsin a Markov equivalence class. arXiv e-prints , art. arXiv:2010.08611, Oct.2020.A. Hauser and P. Bühlmann. Characterization and greedy learning of inter-ventional Markov equivalence classes of directed acyclic graphs.

Journalof Machine Learning Research , 13(Aug):2409–2464, 2012.Y. He and Z. Geng. Active learning of causal networks with interventionexperiments and optimal designs.

Journal of Machine Learning Research ,9(Nov):2523–2547, 2008.Y. He, J. Jia, and B. Yu. Counting and exploring sizes of Markov equivalenceclasses of directed acyclic graphs.

Journal of Machine Learning Research ,16:2589–2609, 2015.M. Kalisch, M. Mächler, D. Colombo, M. H. Maathuis, and P. Bühlmann.Causal inference using graphical models with the R package pcalg.

Journalof Statistical Software , 47(11):1–26, 2012.M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesiannetworks.

Journal of Machine Learning Research , 5(May):549–573, 2004.M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness.In

Advances in Neural Information Processing Systems , volume 30, pages4066–4076. Curran Associates, Inc., 2017.35. L. Lauritzen and T. S. Richardson. Chain graph models and their causal in-terpretations.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 64(3):321–348, 2002.S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilitieson graphical structures and their application to expert systems.

Journalof the Royal Statistical Society. Series B (Statistical Methodology) , 50(2):157–224, 1988.Y. Liu, Z. Cai, C. Liu, and Z. Geng. Local learning approaches for ﬁndingeﬀects of a speciﬁed cause and their causal paths.

ACM Transactions onIntelligent Systems and Technology , 10(5), Sep 2019.Y. Liu, Z. Fang, Y. He, and Z. Geng. Collapsible IDA: Collapsing parentalsets for locally estimating possible causal eﬀects. In

Proceedings of theThirty-sixth Conference on Uncertainty in Artiﬁcial Intelligence . PMLR,2020a.Y. Liu, Z. Fang, Y. He, Z. Geng, and C. Liu. Local causal network learningfor ﬁnding pairs of total and direct eﬀects.

Journal of Machine LearningResearch , 21(148):1–37, 2020b.M. H. Maathuis and D. Colombo. A generalized back-door criterion.

TheAnnals of Statistics , 43(3):1060–1088, 06 2015.M. H. Maathuis, M. Kalisch, and P. Bühlmann. Estimating high-dimensionalintervention eﬀects from observational data.

The Annals of Statistics , 37(6A):3133–3164, 12 2009.C. Meek. Causal inference and causal explanation with background knowledge.In

Proceedings of the Eleventh Conference on Uncertainty in ArtiﬁcialIntelligence , pages 403–410. Morgan Kaufmann Publishers Inc., 1995.T. Miller. Explanation in artiﬁcial intelligence: Insights from the socialsciences.

Artiﬁcial intelligence , 267:1–38, 2019.J. Mooij and T. Claassen. Constraint-Based Causal Discovery using PartialAncestral Graphs in the presence of Cycles. In

Proceedings of the Thirty-sixth Conference on Uncertainty in Artiﬁcial Intelligence . PMLR, 2020.P. Nandy, M. H. Maathuis, and T. S. Richardson. Estimating the eﬀectof joint interventions from observational data in sparse high-dimensionalsettings.

The Annals of Statistics , 45(2):647–674, 04 2017.36. Pearl.

Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,1988.J. Pearl.

Causality . Cambridge University Press, 2009.J. Pearl, D. Geiger, and T. Verma. Conditional independence and itsrepresentations.

Kybernetika , 25(7):33–44, 1989.E. Perković. Identifying causal eﬀects in maximally oriented partially di-rected acyclic graphs. In

Proceedings of the Thirty-sixth Conference onUncertainty in Artiﬁcial Intelligence . PMLR, 2020.E. Perković, M. Kalisch, and M. H. Maathuis. Interpreting and usingCPDAGs with background knowledge. In

Proceedings of the Thirty-ThirdConference on Uncertainty in Artiﬁcial Intelligence . AUAI press, 2017.J. Peters and P. Bühlmann. Identiﬁability of Gaussian structural equationmodels with equal error variances.

Biometrika , 101(1):219–228, 2013.J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causal discoverywith continuous additive noise models.

The Journal of Machine LearningResearch , 15(1):2009–2053, 2014.T. Richardson and P. Spirtes. Ancestral graph Markov models.

The Annalsof Statistics , 30(4):962–1030, 08 2002.D. J. Rose and R. E. Tarjan. Algorithmic aspects of vertex elimination ondirected graphs. Technical report, Stanford, CA, USA, 1975.A. Roumpelaki, G. Borboudakis, S. Triantaﬁllou, and I. Tsamardinos.Marginal causal consistency in constraint-based causal learning. In

Proceed-ings of the UAI 2016 Workshop on Causation: Foundation to Application ,Foundation to Application,number 1792 in CEUR Workshop Proceedings,page 39–47, 2016.S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery.

Journal of Machine LearningResearch , 7(Oct):2003–2030, 2006.S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio,P. O. Hoyer, and K. Bollen. Directlingam: A direct method for learninga linear non-Gaussian structural equation model.

Journal of MachineLearning Research , 12(Apr):1225–1248, 2011.37. Singh and A. Moore. Finding optimal Bayesian networks by dynamicprogramming. Technical report, 2005.P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causalgraphs.

Social Science Computer Review , 9(1):62–72, 1991.I. Tsamardinos and C. F. Aliferis. Towards principled feature selection:Relevancy, ﬁlters and wrappers. In

Proceedings of the Ninth InternationalWorkshop on Artiﬁcial Intelligence and Statistics . Morgan Kaufmann Pub-lishers, 2003.I. Tsamardinos, C. F. Aliferis, and A. R. Statnikov. Algorithms for large scaleMarkov blanket discovery. In

Proceedings of the Sixteenth InternationalFlorida Artiﬁcial Intelligence Research Society Conference , pages 376–381.AAAI Press, 2003.C. Wang, Y. Zhou, Q. Zhao, and Z. Geng. Discovering and orienting the edgesconnected to a target variable in a DAG via a sequential local learningapproach.

Computational Statistics & Data Analysis , 77:252 – 266, 2014.J. Witte, L. Henckel, M. H. Maathuis, and V. Didelez. On eﬃcient adjustmentin causal graphs.

Journal of Machine Learning Research , 21(246):1–45,2020.Y. Wu, L. Zhang, and X. Wu. Counterfactual fairness: Unidentiﬁcation,bound and algorithm. In

Proceedings of the Twenty-Eighth InternationalJoint Conference on Artiﬁcial Intelligence, IJCAI-19 , pages 1438–1444.International Joint Conferences on Artiﬁcial Intelligence Organization, 72019.J. Xiang and S. Kim. A* lasso for learning a sparse Bayesian networkstructure for continuous variables. In

Advances in Neural InformationProcessing Systems 26 , pages 2418–2426. Curran Associates, Inc., 2013.C. Yuan, B. Malone, and X. Wu. Learning optimal Bayesian networks using A*search. In

Proceedings of the Twenty-Second International Joint Conferenceon Artiﬁcial Intelligence, IJCAI-11 , pages 2186–2191. International JointConferences on Artiﬁcial Intelligence Organization, 7 2011.J. Zhang.

Causal Inference and Reasoning in Causally Insuﬃcient Systems .PhD thesis, Carnegie Mellon University, 2006.38. Zhang. On the completeness of orientation rules for causal discovery in thepresence of latent confounders and selection bias.

Artiﬁcial Intelligence ,172(16):1873 – 1896, 2008.K. Zhang and A. Hyvärinen. On the identiﬁability of the post-nonlinear causalmodel. In

Proceedings of the Twenty-Fifth Conference on Uncertainty inArtiﬁcial Intelligence . AUAI press, 2009.K. Zhang, M. Gong, P. Stojanov, B. Huang, Q. Liu, and C. Glymour. Domainadaptation as a problem of inference on graphical models. In

Advancesin Neural Information Processing Systems , volume 33, pages 4965–4976.Curran Associates, Inc., 2020.X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing. DAGs with no tears:Continuous optimization for structure learning. In