[PDF] Identifying Causal Effects via Context-specific Independence Relations

Abstract

Causal effect identification considers whether an interventional probability distribution can be uniquely determined from a passively observed distribution in a given causal structure. If the generating system induces context-specific independence (CSI) relations, the existing identification procedures and criteria based on do-calculus are inherently incomplete. We show that deciding causal effect non-identifiability is NP-hard in the presence of CSIs. Motivated by this, we design a calculus and an automated search procedure for identifying causal effects in the presence of CSIs. The approach is provably sound and it includes standard do-calculus as a special case. With the approach we can obtain identifying formulas that were unobtainable previously, and demonstrate that a small number of CSI-relations may be sufficient to turn a previously non-identifiable instance to identifiable.

Full PDF

IIdentifying Causal Effectsvia Context-speciﬁc Independence Relations

Santtu Tikka

Department of Mathematics and StatisticsUniversity of Jyvaskyla, Finland [email protected]

Antti Hyttinen

HIIT, Department of Computer ScienceUniversity of Helsinki, Finland [email protected]

Juha Karvanen

Department of Mathematics and StatisticsUniversity of Jyvaskyla, Finland [email protected]

Abstract

Causal effect identiﬁcation considers whether an interventional probability dis-tribution can be uniquely determined from a passively observed distribution ina given causal structure. If the generating system induces context-speciﬁc inde-pendence (CSI) relations, the existing identiﬁcation procedures and criteria basedon do-calculus are inherently incomplete. We show that deciding causal effectnon-identiﬁability is NP-hard in the presence of CSIs. Motivated by this, we designa calculus and an automated search procedure for identifying causal effects inthe presence of CSIs. The approach is provably sound and it includes standarddo-calculus as a special case. With the approach we can obtain identifying for-mulas that were unobtainable previously, and demonstrate that a small number ofCSI-relations may be sufﬁcient to turn a previously non-identiﬁable instance toidentiﬁable.

Statistical independence of random variables is a central concept in any data analysis and predictiontask. An important generalization of this concept is context-speciﬁc independence (CSI) [26, 6]. Fora simple example consider an antibiotic that normally has a dose–response effect on the number ofbacteria. A genetic mutation makes the bacteria resistant to the antibiotic meaning that in the contextof this mutation the dose and the number of bacteria are independent. CSI-relations have been utilizedto analyze, for example, gene expression data [2], dynamics of pneumonia [33], prognosis of heartdisease [22], proteins [15], parliament elections [22] and occurence of plants [22]. CSIs have alsobeen used to speed up exact probabilistic inference [8, 12] and to improve structure learning [9, 19].However, CSIs have received much less attention in causal inference and in particular, causal effectidentiﬁability, despite their great potential in allowing for further identiﬁability results.In the structural causal model (SCM) framework, the knowledge about causal mechanisms under in-vestigation is represented as a directed acyclic graph (DAG). When some nodes represent unobservedlatent variables, all information can be determined from a corresponding semi-Markovian graph.Assuming the qualitative information given by the graph, the aim in causal effect identiﬁcation is todetermine whether a causal effect P ( Y | do ( X ) , Z ) can be uniquely determined from the availablepassively observed distribution. The known causal structure, whichever formalism is used, speciﬁes(generalized) conditional independence properties of the system through d-separation. These warrantthe manipulation of interventional distributions with the rules of do-calculus and thus the derivation a r X i v : . [ c s . A I] S e p YAL (a) P ( X | A, L ) X = 0 X = 1 AL = 00 0 . . AL = 01 0 . . AL = 10 0 . . AL = 11 0 . . (b) A L (0.1, 0.9)(0.5, 0.5) (0.6, 0.4)

10 10 (c)

X YAL AL =1 ∗ A =0 (d) X YALI X AL =1 ∗ I X L =1 ∗ AI X =0 ∗ , ∗ (e) Figure 1: (a) L is latent unobserved variable. (b) CPT for P ( X | A, L ) . (c) Decision tree with P ( X | A, L ) given in the leaf nodes. (d) corresponding labeled DAG (LDAG). (e) LDAG with anintervention node added for X .of identifying formulas [23, 24]. The ID algorithm implements this inference: it can identify thecausal effect whenever it can be non-parametrically identiﬁed [28, 17, 32].When we have further information on the generative causal model, the completeness results of theprevious approaches do not apply anymore: more causal effects become identiﬁable and do-calculusbased methods will report false non-identiﬁability. One such piece of still qualitative information areCSI-relations. One example is shown in Fig. 1(a). The causal effect P ( Y | do ( X )) is non-identiﬁableby do-calculus here due to the back-door path through latent factor L . However, if we know CSIs X ⊥⊥ L | A = 0 and X ⊥⊥ Y | A = 1 , L the causal effect is identiﬁable (see Eq. 1 in Sec. 4.2).Accounting for CSIs imposes additional challenges for deciding causal effect identiﬁability and to thederivation of identifying formulas. Instead of a graphical models for conditional independence, weneed to employ inherently more complicated graphical models for CSI. As we shall show, derivationof causal effects requires context-speciﬁc reasoning. All this is well worthwhile if it warrants theidentiﬁability of new causal effects.We formulate the problem of causal effect identiﬁability in the presence of CSIs for binary variablesand show that deciding non-identiﬁability is NP-hard (Sec. 3). Motivated by this we develop acalculus, and a search procedure over the rules of the calculus (Sec. 4 and 5). To make our searchfeasible, we eliminate redundant contexts, implement new separation criteria and use a well-motivatedheuristic. With these techniques we scale up to network sizes often reported in literature. Mostimportantly, we show a host of examples where do-calculus cannot identify a causal effect but oursearch procedure leveraging on CSIs can prove identiﬁability (Sec. 6). Impact for future research andalternative approaches are discussed in Sec. 7. Our starting point is causal effect identiﬁcation over a DAG G = ( V , E ) . The set W ⊆ V denotesa set of observed variables, marked by circular nodes. Since we also take into account the localstructure, we mark any unobserved variables explicitly as rectangular nodes in the graph (as opposedto the semi-Markovian representation with bi-directed edges). The set pa ( Y ) denotes the parents ofa node Y regardless of their observability. Notation x is used to denote an assignment to randomvariables X , and val ( X ) is used to denote the set of all possible assigments to X . All variables areassumed to be binary.There are different ways of representing the local structure in the local conditional probabilitydistribution (CPD) of a node given its parents [20, 9]. One of the most popular ways of modeling thelocal structure is to cast some of the probabilities identical in the CPDs. For example the conditional2robability table (CPT) of X in Fig. 1(b) has identical probabilities in the ﬁrst two rows. One way tomodel such local structure is to use decision trees as in Fig. 1(c), see Koller et al. [20] for others.Importantly, local structure induces local CSIs of the form Y ⊥⊥ X | pa ( Y ) \ X = (cid:96) , denoting that Y is independent of the value of a parent X when the other parents of Y are assigned to values (cid:96) . Thelocal CPT in Fig. 1(b) implies X ⊥⊥ Y | A = 0 . The decision tree in Fig. 1(c) also shows this localCSI: once going down the branch with A = 0 the value of X is not inﬂuenced by the value of L .In this paper, we employ the idea of Pensar et. al. [25] and mark local CSIs as labels on theedges of the DAG. A DAG ( V , E ) together with a set of labels L deﬁnes a labeled DAG (LDAG) G = ( V , E , L ) , where for each edge X → Y ∈ E there is a label L ∈ L , which is a (possiblyempty) set of assignments to pa ( Y ) \ X i.e., other parents of Y . Each assignment in the label encodesa local CSI: if (cid:96) ∈ L , then Y ⊥⊥ X | pa ( Y ) \ X = (cid:96) . Symbol ∗ is used as a shortcut notation forany value. For example, the label AL = 1 ∗ on X → Y in Fig. 1(d) implies that X ⊥⊥ Y | A = 1 , L .Finally, throughout the paper, we restrict our attention to regular maximal LDAGs. Maximalityrequires that all labels that follow from other labels are recorded in the edges. Regularity means thatedges absent in every context are not included in the graph. See Pensar et. al. [25] for details.Any LDAG can be turned into a context s speciﬁc DAG by removing edges that are spurious (i.e.,irrelevant) when variables S have values s as follows. The nodes appearing in the label L on some X → Y can be partitioned into two sets A and B : nodes in A are assigned to a by the context s ,while nodes in B are not. Then, the edge X → Y ∈ E is not present in the context s speciﬁc DAG(i.e., the edge is spurious) if ( a , b ) ∈ L for all possible assignments b . For example, the context A = 1 speciﬁc DAG of Fig. 1(d) is identical to the underlying DAG except for X → Y being absent.A sufﬁcient condition for a non-local CSI to be implied by an LDAG structure is given by CSI-separation criterion [6]: If sets of nodes X and Y are d-separated given C , S in the context s speciﬁcDAG of G , then X ⊥⊥ Y | C , s is implied by G . Note that d-separation is a special case when S = ∅ .For example, the labeling in Fig. 1(d) implies that X ⊥⊥ L | A = 0 by this criterion, as the edge L → X is absent in the context A = 0 speciﬁc DAG.We assume a positive distribution over the variables V [17]. This makes causal effects well-deﬁnedand justiﬁes conditioning on any subset of variables or their particular assignments. As the ﬁrst contribution we formalize causal identiﬁability problem in the presence of CSIs. Identiﬁa-bility [24, 29] considers whether a causal effect can be uniquely identiﬁed in models with a givenﬁxed structure. If an effect is non-identiﬁable, there are (at least) two models that agree with theobservations and have the same given structure but disagree on the causal effect.We use LDAGs to deﬁne identiﬁability in the presence of CSIs, as LDAGs offer a simple and intuitivevisual view of the causal structure and local CSIs. The LDAG is assumed known based on thebackground knowledge on the examined study, similarly as semi-Markovian graphs are standardlydrawn for do-calculus. For example, consider (again) the case where an antibiotic A had a dose-response effect to H only if a genetic mutation M had not taken place. Hence, we would mark label M = 1 on the edge A → H . Thus, the causal effect identiﬁcation problem can be formulated as: Input:

An LDAG G over V , P ( W ) for W ⊆ V , a query P ( Y | do ( X ) , Z ) s.t. Y , X , Z ⊂ W . Task:

Output a formula for P ( Y | do ( X ) , Z ) over P ( W ) , or decide that it is non-identiﬁable.When no labels appear on the edges of an LDAG, the causal structure can be directly cast as asemi-Markovian graph. Thus, the setting of do-calculus is a special case of this one. In contrast to causal effect identiﬁability over semi-Markovian graphs, which has polynomial decisionprocedures [28, 17], taking local structure and CSIs into account makes the corresponding decisionproblem NP-hard. (The proofs for all theorems are given in the supplementary material.)

Theorem 1.

Deciding non-identiﬁability of a causal effect given an LDAG over V and a passivelyobserved distribution over W ⊆ V is NP-hard. ule 1 (Insertion/Deletion of observations): P ( Y | do ( X ) , Z , W ) = P ( Y | do ( X ) , W ) if Y ⊥⊥ Z | X , W || X Rule 2 (Action/Observation exchange): P ( Y | do ( X ) , do ( Z ) , W ) = P ( Y | do ( X ) , Z , W ) if Y ⊥⊥ I Z | X , Z , W || X Rule 3 (Insertion/Deletion of actions): P ( Y | do ( X ) , do ( Z ) , W ) = P ( Y | do ( X ) , W ) if Y ⊥⊥ I Z | X , W || X Figure 2: Rules of do-calculus. The sets X , Y , Z and W are disjoint. Notation || X means that thecondition is evaluated in a graph in which edges into X are removed. I Z denotes the interventionnodes of variables Z (see Sec. 4.1).The proof of Theorem 1 shows that 3-SAT can be reduced to the identiﬁability of P ( Y | do ( X )) from P ( X, Y ) . On an intuitive level, the intricate structure in the local CPDs allows for representinginstances of NP-hard decision problems. This result is related to NP-hardness results of exactinference [10], implication problem of CSIs [20, 11] and the complexity results for Halpern’s actualcausation [1], however, we are not aware of other NP-hardness results for causal effect identiﬁability. In light of Theorem 1, fast algorithms for determining identiﬁability of a causal effects may begenerally unobtainable. Thus, we take here an approach similar to [14, 23, 16] and formulate acalculus called CSI-calculus which can be used to show identiﬁability for particular instantiations ofthe problem. CSI-calculus is an extension of do-calculus of Fig. 2. In the ﬁrst subsection we showthat due to the versatile graphical model used (LDAG), we only need to consider identiﬁcation ofconditional probabilities (i.e., the do-operation is not needed). The second subsection gives the rulesof CSI-calculus.

Interventions can be encoded naturally with the use of intervention variables and CSIs [6, 23, 13].Here we show how this can be done for LDAGs.For any LDAG ( V , E , L ) , we can construct an augmented LDAG that has the capacity to representinterventions as follows. Each node X ∈ V is augmented by an intervention node I X and an edge I X → X . If I X = 0 , then X is in its passive observational state determined by its parents pa ( X ) . If I X = 1 , then X is intervened on and its value is determined independently from its parents.For every X ∈ V and every label L Z ∈ L of every incoming edge Z → X such that Z (cid:54) = I X , weconstruct the augmented label L (cid:48) Z by including the assignments I X = ∗ , pa ( X ) \ ( I X ∪ Z ) = (cid:96) for every (cid:96) ∈ L Z and I X = 1 , pa ( X ) \ ( I X ∪ Z ) = ∗ . In other words, L (cid:48) Z renders the edge Z → X spurious when I X = 1 or in any context where L Z would. Fig. 1(e) shows an LDAG that isconstructed from the LDAG in Fig. 1(d) by adding an intervention node for X .Using the above construction, an interventional distribution P ( Y | do ( X )) is now simply a conditionaldistribution P ( Y | X, I X = 1) . Thus, we can essentially drop the do-operator from the problemdeﬁnition, and model interventions using intervention nodes and CSIs instead. To simplify thenotation, we omit intervention nodes for variables that are in their passive observational state fromformulas. We do still include the do-operator when possible for improved readability. Figure 3 describes the rules of CSI-calculus. In the rules we use terms that apply for all assignments(large letters) and to particular assignments (small letters). We do this in order to make the derivationsshorter and identifying formulas more understandable. A valid calculus can be formed by omittingall large letters, but our experiments (Sec. 6) suggest that such a calculus is far less efﬁcient.4 ule 1 (Insertion/Deletion of observations): P ( Y , y | Z , z , X , x ) = P ( Y , y | X , x ) if Y , Y ⊥⊥ Z , Z | X , x Rule 2 (Marginalization/Sum-rule): P ( Y , y | X , x ) = (cid:80) Z P ( Y , y , Z | X , x ) Rule 3 (Conditioning): P ( Y | Z , z , X , x ) = P ( Y , Z , z | X , x ) (cid:80) Y P ( Y , Z , z | X , x ) Rule 4 (Product-rule): P ( Y , y , Z , z | X , x ) = P ( Y , y | Z , z , X , x ) P ( Z , z | X , x ) Rule 5 (General-by-case reasoning): P ( Y , y , − z | X , x ) = P ( Y , y | X , x ) − P ( Y , y , z | X , x ) Rule 6 (Case-by-case reasoning): P ( Y , y , Z | X , x ) = (cid:26) P ( Y , y , Z = 0 | X , x ) if Z = 0 P ( Y , y , Z = 1 | X , x ) if Z = 1 Rule 7 (Case-by-general reasoning (a)): P ( Y , y , z | X , x ) = P ( Y , y , Z | X , x ) (cid:12)(cid:12) Z = z Rule 8 (Case-by-general reasoning (b)): P ( Y , y | X , x , z ) = P ( Y , y | X , x , Z ) (cid:12)(cid:12) Z = z Figure 3: Rules of CSI-calculus. The sets X , X , Y , Y , Z and Z are disjoint. We write w asshorthand for the explicit assignment W = w .Rule 1 is directly the deﬁnition of context-speciﬁc independence which includes conditional inde-pendence as a special case. Rule 1 can be applied in both directions, when the term on the left isidentiﬁed, so is the term on the right and vice versa, provided that the separation condition is satisﬁed.Marginalization, conditioning and factorization from standard probability calculus are operationalizedby rules 2–4, respectively. Rule 5 uses the law of total probability to obtain the probability of thecomplement. Rules 2–5 are applied from right to left: when the expressions on the right are identiﬁed,then so is the term on the left. Rule 5 is also valid when Y and Y are empty sets: in this case therule should be understood as P (1 − z | X , x ) = 1 − P ( z | X , x ) .Rule 6 explicates that if we know the expression for each assignment Z = z then we also knowthe expression without a speciﬁc assignment to Z . When rules 4–6 are applied, both distributionson the right-hand side must be known. Rules 7 and 8 formulate the fact that if an expression isknown for all assignments to Z , it is also known for a speciﬁc assignment Z = z . For rules 5–8, itis assumed that Z is a singleton for convenience. This assumption does not restrict identiﬁabilitysince operations involving sets can be carried out by applying the rules for each member of the setsequentially. For identiﬁable queries, the formula in terms of the joint distribution P ( W ) is easilyobtained by backtracking the chain of manipulations that resulted in identiﬁcation.Importantly, CSI-calculus includes standard do-calculus of Fig. 2 as a special case. Theorem 2.

CSI-calculus subsumes do-calculus.

This means that any formula that is derivable with standard do-calculus over a DAG G (w. latents), isalso derivable using CSI-calculus over the LDAG formed by simply adding intervention nodes andlabels as described in Section 4.1. After this augmentation, Rule 1 fully encompasses the three rulesof do-calculus [23, 24]; this is shown in the proof of the theorem.More importantly, the calculus of Fig. 3 can identify causal effects that are not identiﬁable with thestandard do-calculus. For the example of Fig. 1, the following formula can be obtained: P ( Y | do ( X )) = P ( Y | A = 0 , X ) P ( A = 0) + P ( Y | A = 1) P ( A = 1) . (1)A simple derivation of this formula using CSI-calculus is shown in Fig. 4. Note that the back-doorformula P ( Y | do ( X )) = (cid:80) A P ( A ) P ( Y | A, X ) is not valid here: conditioning on X when A = 1 biases Y through X ← L → Y . 5 ( Y, X, A ) P ( Y, A | X ) P ( A ) P ( Y, A ) P ( Y, A = 0 | X ) P ( A | I X = 1) P ( Y, A = 1) P ( Y, | X, A = 0) P ( A | X, I X = 1) P ( Y | A = 1) P ( Y | X, A = 0 , I X = 1) P ( A = 0 | X, I X = 1) P ( A = 1 | X, I X = 1) P ( Y | A = 1 , I X = 1) P ( Y, A = 0 | X, I X = 1) P ( Y | X, A = 1 , I X = 1) P ( Y, A = 1 | X, I X = 1) P ( Y, A | X, I X = 1) P ( Y | X, I X = 1) R3 R2 R2R7 R1: A ⊥⊥ I X R7R3 R1: A ⊥⊥ X | I X = 1 R3R1: Y ⊥⊥ I X | X, A = 0

R7 R7 R1: Y ⊥⊥ I X | A = 1 R4 R4 R4 R1: Y ⊥⊥ X | A = 1 , I X = 1 R4R6 R6R2

Figure 4: A derivation of P ( Y | do ( X )) from P ( X, Y, A ) in the example of Fig. 1. The applied rulesand CSIs are marked next to the edges connecting the terms. The identifying formula is Eq. 1. In contrast to the setting of standard do-calculus, due the formidable number of contexts and thecausal structure being described by arguably more complex graph formalism, applying the rules ofCSI-calculus by hand is impossible (recall also Theorem 1 on NP-hardness). Hence, we follow theapproach of [30, 18] and devise a forward search procedure over the rules of CSI-calculus that is ableto automatically output identifying formulas and derivations such as Fig 4.However, for any instance, there are a vast number of terms that may end up being useful in identifyingthe query term; in fact, the derivation in Fig. 4 only shows the terms that were actually needed (inhindsight). For applying rule 1 we need to check a coNP-hard separation criterion, in contrast to thepolynomial check of d-separation in the standard do-calculus setting. Hence, we focus here on howto efﬁciently evaluate separation criteria (Sec. 5.1), combine contexts (Sec. 5.2) and implement theheuristic search (Sec. 5.3) without weakening the theoretical properties (Sec. 5.4).

Rule 1 requires the evaluation of possibly non-local CSIs. Recall from Section 2, that CSI-separationis only a sufﬁcient criterion; in practice it misses many of the important independence relations. Fora feasible search procedure we need a sufﬁciently fast way to check a sufﬁcient separation criterion.The following sufﬁcient criterion is implemented in the search for this purpose.

Theorem 3.

If there exists a set C such that Y ⊥⊥ Z | X , w , C is implied by an LDAG G and one ofthe following is also implied by G : (i) Y ⊥⊥ C | X , w , (ii) C ⊥⊥ Z | X , w , (iii) Y ⊥⊥ C | X , Z , w ,or (iv) Z ⊥⊥ C | X , Y , w , then also Y ⊥⊥ Z | X , w is implied by G . When a CSI statement Y ⊥⊥ Z | X , w is encountered by the search, the following procedure isapplied: First, we verify whether the CSI is directly encoded in a label. If it is, we can stop and if it isnot, we continue by applying the CSI-separation criterion. If the CSI-separation criterion does nothold, we continue by attempting to ﬁnd a set C that satisﬁes Y ⊥⊥ Z | X , w , c for all c ∈ val ( C ) .Theorem 3 is then applied recursively to verify whether all of the required CSIs Y ⊥⊥ C | X , w , C ⊥⊥ Z | X , w , Y ⊥⊥ C | X , Z , w or Z ⊥⊥ C | X , Y , w hold in G . To guarantee that the recursionterminates, each variable can appear only once in each branch of the recursion. We further reduce thenumber of evaluated CSIs by caching them during the search. The number of possible contexts increases exponentially with the number of variables. It is thereforeimportant to determine which contexts should be considered when CSIs are evaluated. Differentcontexts often share the same context-speciﬁc DAG. We deﬁne the equivalence relation s ∼ as follows:6 lgorithm 1Input: Target Q = P ( Y | do ( X ) , Z ) , LDAG G and input I = { P ( W ) } . Output:

A formula F for Q in terms of P ( W ) or NA. let U be the set of unexpanded terms, initially U := I . for P (cid:48) ∈ U : let I ∗ be the set of all distributions derived from P (cid:48) using the rules of Section 5. for each new candidate distribution P ∗ ∈ I ∗ , do if an additional input is required that is not in I , then continue . if CSI relation of the current rule is not satisﬁed by G , then continue . if P ∗ = Q , then derive a formula F for Q by backtracking and return F . Add P ∗ to I , add P ∗ to U . Mark P (cid:48) as expanded: remove P (cid:48) from U . return NA. s s ∼ s if and only if the context s speciﬁc DAG is the same as the context s speciﬁc DAG,where s , s ∈ val ( S ) . When evaluating the CSI Y ⊥⊥ Z | X , w , C of Theorem 3, we do not haveto determine d-separation for every c ∈ val ( C ) and w , c speciﬁc DAG. It sufﬁces to restrict ourattention to the context-speciﬁc DAGs given by the representatives of val ( C ) / s ∼ . Theorem 4.

Let R be a set of representatives of val ( C ) / s ∼ . If Y is CSI-separated from Z by X inthe context w , c in G for all c ∈ R , then Y is CSI-separated from Z by X in the context w , c in G for all c ∈ val ( C ) . The deﬁnition of intervention nodes can also be used in this way. In general, an arbitrary context S = s can render a number of edges spurious in the LDAG. However, if the context contains theassignment I X = 1 for any node X , we know that every incoming edge of X except I X → X willbe made spurious by deﬁnition without requiring any further veriﬁcation. Algorithm 1 shows the pseudo-code which implements the calculus of Section 4 and is capable ofsolving problems that fall under the formulation of Section 3 through the use of a search heuristicand elimination of redundant contexts. A single distribution is called a term , which is considered expanded if every valid manipulation has been performed on it.The input distribution is marked as unexpanded on line 1 and iteration over the unexpanded termsbegins on line 2. In order to guide the search to identify the most promising terms, we relate theidentiﬁed distributions to the target Q through a heuristic proximity function and always expandthe closest term in U ﬁrst. Note that if we were to expand only the closest term to the targetgreedily, several identiﬁable instances would be left non-identiﬁed because the identifying formulasand derivations are highly non-trivial. More details about the proximity function are given in thesupplementary material. If multiple terms share the maximal value of the proximity function, theterm that was identiﬁed ﬁrst is selected. Next, the rules of Section 4 are applied to P (cid:48) and the derivedcandidate distributions are added to the set I ∗ on line 3. Note that not every distribution in I ∗ isnecessarily identiﬁed at this point.Iteration over the set I ∗ begins on line 4. Here the candidate terms P ∗ in I ∗ that can be identiﬁedare added to the set I . Previously identiﬁed terms are not identiﬁed again. Line 5 veriﬁes that bothrequired terms are identiﬁed for rules 4–6. Line 6 applies Theorem 3 to check the required CSIrelation for rule 1. Tests for d-separation are carried out via relevant path separation [7].If all requirements are met, P ∗ is identiﬁed either as the target on line 7 or as a new unexpandeddistribution on line 8. Once all candidate distributions are processed, we mark P (cid:48) as expandedon line 9. Note that P (cid:48) can still appear as a second required term on line 5 when another term isbeing expanded. Finally, if the target was not identiﬁed and the set of unexpanded distributions wasexhausted, we deem the target non-identiﬁable by the search and return NA on line 10. The formulated search is sound in the following sense.7

30 0 100

Sorted instance T i m e pe r i n s t an c e ( m i n ) n = 7, Alg. 1n = 8, Alg. 1n = 9, Alg. 1n = 7, Full CSn = 8, Full CSn = 9, Full CS (a)

05 Rule 1 Rule 2 Rule 3 Rule 4 Rule 5 Rule 6 Rule 7 Rule 8 A v e r age t i m e ( m i n ) n = 7n = 8n = 9 (b) Figure 5: (a) Running times of Algorithm 1. Full CS is a naive version which does not combinecontexts. (b) Time usage of each rule with error bars showing the standard error.

Theorem 5 (Soundness) . Algorithm 1 always terminates: if it returns an expression, it is correct.

In the setting of standard do-calculus, where no labels are present (in addition to those deﬁninginterventions) the search is complete for (conditional) causal effect identiﬁability. This is because theseparation condition is general enough to capture all conditional independences used by do-calculusas shown by Theorem 2.

We implemented the search in C ++ and the code is available in the R-package dosearch on CRAN[31]. First we will present a simulation study on the search and then show a host of examples whereidentiﬁability can be shown with our approach. Experiments were performed on a modern desktopcomputer (single thread, Intel Core i7-4790, 3.4 GHz).We considered DAGs with n = 7 , , nodes with 100 DAGs for each n . Edges for the DAGs weresampled randomly with average degree of . We sampled labels on the edges (local CSIs) withprobability . . Two of the nodes were considered latent and the aim was to determine whether P ( Y | do ( X )) can be identiﬁed. Fig. 5(a) shows the running times of Algorithm 1 with a minutetimeout. The search times when all contexts are considered separate (i.e., the terms have ﬁxedassigned values for all variables) are included as a baseline (full CS). Using terms that combineassignments as formulated in CSI-calculus considerably speeds up the execution times.In the same simulation, we examined the effect of applying the individual rules on the total runningtimes, as shown in Fig. 5(b). Rules 1 and 4 dominate the running time. For rule 1, considerable timeis spent on checking whether the conditional independence constraints hold (recall that this step isalso (co)NP-hard). Rule 4 combines two previously identiﬁed terms, and therefore a single term mayhelp to identify further terms in a large number of ways.Importantly, the search implementing CSI-calculus can prove identiﬁability of P ( Y | do ( X )) forthe LDAGs in Fig. 6 which would be non-identiﬁable otherwise via standard do-calculus. Non-identiﬁability can be veriﬁed by running ID on the underlying DAGs without labels or by noting thateach graph includes a hedge. In Fig. 6(a) P ( Y | do ( X )) = P ( Y | X, W = 1) . Intuitively, node W acts similarly as an intervention node and hence conditioning on W = 1 eliminates the back-door path.In Fig. 6(b) P ( Y | do ( X )) = P ( Y ) , because X and Y are independent when X is intervened on dueto the labels. In Fig. 6(c) P ( Y | do ( X )) = P ( Y | Z = 0 , X ) P ( Z = 0) + P ( Y | Z = 1) P ( Z = 1) ,adjusting for Z is needed, which opens up a new d-connecting path through H and Q . Fortunately,when Z = 0 there is no confounding path, and when Z = 1 there is a confounding path but nodirected path from Z . In Fig. 6(d), the causal effect is identiﬁable and the output by Algorithm 1 is: P ( Y | do ( X )) = P ( A = 1) (cid:80) W P ( Y | X, W, A = 1) P ( W | A = 1)+ P ( A = 0) (cid:80) Z P ( Z | X, A = 0) (cid:80) X (cid:48) P ( Y | X (cid:48) , Z, A = 0) P ( X (cid:48) | A = 0) When A = 1 , the ﬁrst term resembles the back-door formula, adjusting for W . When A = 0 , thesecond term resembles the front-door formula through Z . Since A ⊥⊥ X, I X in the LDAG, we are8 a) X YW Z W = 1 (b) X YZ AH A = 0 AH = 1 ∗ (c) X YZ QH XZ = ∗ ZQ = 1 ∗ (d) WX YZ AU

AZXU = 0 ∗ ∗ ∗

AZUW = 0 ∗ ∗ ∗ A = 1 AZXW = 1 ∗ ∗ ∗ (e)

W Z X YAL M N AM = 0 ∗ Figure 6: LDAGs such that P ( Y | do ( X )) is identiﬁable using CSIs, but not with standard do-calculus.able to combine the formulas. In Fig. 6(e) when A = 0 , the confounding path from Y to I X vanishesallowing for a back-door type formula P ( Y | do ( X )) = (cid:80) Z P ( Z | A = 0) P ( Y | X, Z, A = 0) . In this paper, we considered causal effect identiﬁability in the presence of context-speciﬁc inde-pendence relations, which commonly arise from causal mechanisms over discrete variables. Weformalized the problem employing LDAGs, showed that deciding causal effect non-identiﬁabilityis NP-hard when CSIs are present, developed a calculus, and designed a readily usable automaticsearch procedure for ﬁnding identifying formulas. We showed that with only a few additional CSIs,our approach may enable identiﬁability in previously non-identiﬁable cases.Currently, we are at the level of a calculus and a search procedure over the calculus. Although thepresented rules and the search are sound, completeness results are harder to obtain. Despite that thegeneral decision problem is NP-hard, one could think of applying polynomial ID over context-speciﬁcDAGs and then combining the results in order to obtain a complete decision procedure. However,the following theorem shows that identiﬁability in context-speciﬁc DAGs is not a direct indicator ofgeneral identiﬁability.

Theorem 6.

Causal effect P ( Y | do ( X )) may be non-identiﬁable from P ( W ) even if P ( Y | do ( X )) is identiﬁable in the context s speciﬁc DAGs for every s ∈ val ( S ) or if P ( Y | do ( X ) , s ) is identiﬁablein the context s speciﬁc DAGs for every s ∈ val ( S ) where S contains only observed variables. Hence further research is needed for a similar theory as for do-calculus, which resulted in com-pleteness proofs through hedges, ID and IDC algorithms [28, 17, 27], if it is possible here. Thegeneralization to categorical variables is mostly imminent, but designing a feasible search procedureis certainly an additional challenge. As such, the presented approach can already leverage frominterventional distributions [3] by modifying the set of inputs I of Algorithm 1.We believe our approach using CSIs will have an impact on a variety of related problems. We wouldlike to use our approach to solve cases of transportability, selection bias and missing data problems[4, 5, 21]. The methodology presented is likely to render causal effects and distributions identiﬁablealso in these problems, provided that there are CSI relations present. Acknowledgments

ST was supported by Academy of Finland grant 311877 (Decision analytics utilizing causal modelsand multiobjective optimization). AH was supported by Academy of Finland grant 295673.9 eferences [1] G. Aleksandrowicz, H. Chockler, J. Y. Halpern, and A. Ivrii. The computational complexity ofstructure-based causality.

Journal of Artiﬁcial Intelligence Research , 58:431–451, 2017.[2] Y. Barash and N. Friedman. Context-speciﬁc Bayesian clustering for gene expression data.

Journal of Computational Biology , 9(2):169–191, 2002.[3] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identiﬁability. InN. de Freitas and K. Murphy, editors,

Proceedings of the 28th Conference on Uncertainty inArtiﬁcial Intelligence , pages 113–120. AUAI Press, 2012.[4] E. Bareinboim and J. Pearl. Transportability from multiple environments with limited ex-periments: Completeness results. In

Proceedings of the 27th Annual Conference on NeuralInformation Processing Systems , pages 280–288, 2014.[5] E. Bareinboim and J. Tian. Recovering causal effects from selection bias. In

Proceedings of the29th AAAI Conference on Artiﬁcial Intelligence , pages 3475–3481, 2015.[6] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-speciﬁc independence inBayesian networks. In

Proceedings of the 12th International Conference on Uncertainty inArtiﬁcial Intelligence , pages 115–123. Morgan Kaufmann, 1996.[7] C. J. Butz, A. E. dos Santos, and J. S. Oliveira. Relevant path separation: A faster method fortesting independencies in Bayesian networks. In , pages 74–85, 2016.[8] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting.

ArtiﬁcialIntelligence , 172(6-7):772–799, 2008.[9] D. M. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesiannetworks with local structure. In , pages 80–89. Morgan Kaufmann, 1997.[10] G. F. Cooper. The computational complexity of probabilistic inference using Bayesian beliefnetworks.

Artiﬁcial Intelligence , 42(2):393–405, 1990.[11] J. Corander, A. Hyttinen, J. Kontinen, J. Pensar, and J. Väänänen. A logical approach tocontext-speciﬁc independence.

Annals of Pure and Applied Logic , 170(9):975–992, 2019.[12] G. H. Dal, A. W. Laarman, and P. J. F. Lucas. Parallel probabilistic inference by weightedmodel counting. In

Proceedings of Machine Learning Research – Volume 72 , pages 97–108.PMLR, 2018.[13] A. P. Dawid. Inﬂuence diagrams for causal modelling and inference.

International StatisticalReview , 70(2):161–189, 2002.[14] D. Galles and J. Pearl. Testing identiﬁability of causal effects. In

Proceedings of the 11thConference Annual Conference on Uncertainty in Artiﬁcial Intelligence , pages 185–195. MorganKaufmann, 1995.[15] B. Georgi, J. Schultz, and A. Schliep. Context-speciﬁc independence mixture modellingfor protein families. In

European Conference on Principles of Data Mining and KnowledgeDiscovery , pages 79–90. Springer, 2007.[16] J. Y. Halpern. Axiomatizing causal reasoning.

Journal of Artiﬁcial Intelligence Research ,12:317–337, 2000.[17] Y. Huang and M. Valtorta. Identiﬁability in causal Bayesian networks: a sound and completealgorithm. In

Proceedings of the 21st National Conference on Artiﬁcial intelligence – Volume 2 ,pages 1149–1154. AAAI Press, 2006.[18] A. Hyttinen, F. Eberhardt, and M. Järvisalo. Do-calculus when the true graph is unknown. In

Proceedings of the 31st Conference on Uncertainty in Artiﬁcial Intelligence , pages 395–404.AUAI Press, 2015.[19] A. Hyttinen, J. Pensar, J. Kontinen, and J. Corander. Structure learning for Bayesian networksover labeled DAGs. In

Proceedings of Machine Learning Research – Volume 72 , pages 133–144.PMLR, 2018.[20] D. Koller and N. Friedman.

Probabilistic Graphical Models: Principles and Techniques . MITPress, 2009. 1021] K. Mohan, J. Pearl, and J. Tian. Graphical models for inference with missing data. In

Advancesin Neural Information Systems , volume 26, pages 1277–1285, 2013.[22] H. Nyman, J. Pensar, T. Koski, and J. Corander. Stratiﬁed graphical models-context-speciﬁcindependence in graphical models.

Bayesian Analysis , 9(4):883–908, 2014.[23] J. Pearl. Causal diagrams for empirical research.

Biometrika , 82(4):669–688, 1995.[24] J. Pearl.

Causality: Models, Reasoning, and Inference . Cambridge University Press, secondedition, 2009.[25] J. Pensar, H. J. Nyman, T. Koski, and J. Corander. Labeled directed acyclic graphs: a gen-eralization of context-speciﬁc independence in directed graphical models.

Data Mining andKnowledge Discovery , 29(2):503–533, 2015.[26] S. E. Shimony. Explanation, irrelevance, and statistical independence. In

Proceedings of the 9thNational conference on Artiﬁcial intelligence – Volume 1 , pages 482–487. AAAI Press, 1991.[27] I. Shpitser and J. Pearl. Identiﬁcation of conditional interventional distributions. In

Proceedingsof the 22nd Conference on Uncertainty in Artiﬁcial Intelligence , pages 437–444. AUAI Press,2006.[28] I. Shpitser and J. Pearl. Identiﬁcation of joint interventional distributions in recursive semi-Markovian causal models. In

Proceedings of the 21st National Conference on Artiﬁcial Intelli-gence – Volume 2 , pages 1219–1226. AAAI Press, 2006.[29] I. Shpitser and J. Pearl. Complete identiﬁcation methods for the causal hierarchy.

Journal ofMachine Learning Research , 9:1941–1979, 2008.[30] S. Tikka, A. Hyttinen, and J. Karvanen. Causal effect identiﬁcation from multiple incompletedata sources: A general search-based approach. https://arxiv.org/abs/1902.01073, 2019.[31] S. Tikka, A. Hyttinen, and J. Karvanen. dosearch: Causal Effect Identiﬁcation from MultipleIncomplete Data Sources , 2019. R package version 1.0.3.[32] S. Tikka and J. Karvanen. Identifying causal effects with the R package causaleffect.

Journal ofStatistical Software , 76(12):1–30, 2017.[33] S. Visscher, P. Lucas, I. Flesch, and K. Schurink. Using temporal context-speciﬁc independenceinformation in the exploratory analysis of disease processes. In