Identifying Causal Effects via Context-specific Independence Relations
IIdentifying Causal Effectsvia Context-specific Independence Relations
Santtu Tikka
Department of Mathematics and StatisticsUniversity of Jyvaskyla, Finland [email protected]
Antti Hyttinen
HIIT, Department of Computer ScienceUniversity of Helsinki, Finland [email protected]
Juha Karvanen
Department of Mathematics and StatisticsUniversity of Jyvaskyla, Finland [email protected]
Abstract
Causal effect identification considers whether an interventional probability dis-tribution can be uniquely determined from a passively observed distribution ina given causal structure. If the generating system induces context-specific inde-pendence (CSI) relations, the existing identification procedures and criteria basedon do-calculus are inherently incomplete. We show that deciding causal effectnon-identifiability is NP-hard in the presence of CSIs. Motivated by this, we designa calculus and an automated search procedure for identifying causal effects inthe presence of CSIs. The approach is provably sound and it includes standarddo-calculus as a special case. With the approach we can obtain identifying for-mulas that were unobtainable previously, and demonstrate that a small number ofCSI-relations may be sufficient to turn a previously non-identifiable instance toidentifiable.
Statistical independence of random variables is a central concept in any data analysis and predictiontask. An important generalization of this concept is context-specific independence (CSI) [26, 6]. Fora simple example consider an antibiotic that normally has a dose–response effect on the number ofbacteria. A genetic mutation makes the bacteria resistant to the antibiotic meaning that in the contextof this mutation the dose and the number of bacteria are independent. CSI-relations have been utilizedto analyze, for example, gene expression data [2], dynamics of pneumonia [33], prognosis of heartdisease [22], proteins [15], parliament elections [22] and occurence of plants [22]. CSIs have alsobeen used to speed up exact probabilistic inference [8, 12] and to improve structure learning [9, 19].However, CSIs have received much less attention in causal inference and in particular, causal effectidentifiability, despite their great potential in allowing for further identifiability results.In the structural causal model (SCM) framework, the knowledge about causal mechanisms under in-vestigation is represented as a directed acyclic graph (DAG). When some nodes represent unobservedlatent variables, all information can be determined from a corresponding semi-Markovian graph.Assuming the qualitative information given by the graph, the aim in causal effect identification is todetermine whether a causal effect P ( Y | do ( X ) , Z ) can be uniquely determined from the availablepassively observed distribution. The known causal structure, whichever formalism is used, specifies(generalized) conditional independence properties of the system through d-separation. These warrantthe manipulation of interventional distributions with the rules of do-calculus and thus the derivation a r X i v : . [ c s . A I] S e p YAL (a) P ( X | A, L ) X = 0 X = 1 AL = 00 0 . . AL = 01 0 . . AL = 10 0 . . AL = 11 0 . . (b) A L (0.1, 0.9)(0.5, 0.5) (0.6, 0.4)
10 10 (c)
X YAL AL =1 ∗ A =0 (d) X YALI X AL =1 ∗ I X L =1 ∗ AI X =0 ∗ , ∗ (e) Figure 1: (a) L is latent unobserved variable. (b) CPT for P ( X | A, L ) . (c) Decision tree with P ( X | A, L ) given in the leaf nodes. (d) corresponding labeled DAG (LDAG). (e) LDAG with anintervention node added for X .of identifying formulas [23, 24]. The ID algorithm implements this inference: it can identify thecausal effect whenever it can be non-parametrically identified [28, 17, 32].When we have further information on the generative causal model, the completeness results of theprevious approaches do not apply anymore: more causal effects become identifiable and do-calculusbased methods will report false non-identifiability. One such piece of still qualitative information areCSI-relations. One example is shown in Fig. 1(a). The causal effect P ( Y | do ( X )) is non-identifiableby do-calculus here due to the back-door path through latent factor L . However, if we know CSIs X ⊥⊥ L | A = 0 and X ⊥⊥ Y | A = 1 , L the causal effect is identifiable (see Eq. 1 in Sec. 4.2).Accounting for CSIs imposes additional challenges for deciding causal effect identifiability and to thederivation of identifying formulas. Instead of a graphical models for conditional independence, weneed to employ inherently more complicated graphical models for CSI. As we shall show, derivationof causal effects requires context-specific reasoning. All this is well worthwhile if it warrants theidentifiability of new causal effects.We formulate the problem of causal effect identifiability in the presence of CSIs for binary variablesand show that deciding non-identifiability is NP-hard (Sec. 3). Motivated by this we develop acalculus, and a search procedure over the rules of the calculus (Sec. 4 and 5). To make our searchfeasible, we eliminate redundant contexts, implement new separation criteria and use a well-motivatedheuristic. With these techniques we scale up to network sizes often reported in literature. Mostimportantly, we show a host of examples where do-calculus cannot identify a causal effect but oursearch procedure leveraging on CSIs can prove identifiability (Sec. 6). Impact for future research andalternative approaches are discussed in Sec. 7. Our starting point is causal effect identification over a DAG G = ( V , E ) . The set W ⊆ V denotesa set of observed variables, marked by circular nodes. Since we also take into account the localstructure, we mark any unobserved variables explicitly as rectangular nodes in the graph (as opposedto the semi-Markovian representation with bi-directed edges). The set pa ( Y ) denotes the parents ofa node Y regardless of their observability. Notation x is used to denote an assignment to randomvariables X , and val ( X ) is used to denote the set of all possible assigments to X . All variables areassumed to be binary.There are different ways of representing the local structure in the local conditional probabilitydistribution (CPD) of a node given its parents [20, 9]. One of the most popular ways of modeling thelocal structure is to cast some of the probabilities identical in the CPDs. For example the conditional2robability table (CPT) of X in Fig. 1(b) has identical probabilities in the first two rows. One way tomodel such local structure is to use decision trees as in Fig. 1(c), see Koller et al. [20] for others.Importantly, local structure induces local CSIs of the form Y ⊥⊥ X | pa ( Y ) \ X = (cid:96) , denoting that Y is independent of the value of a parent X when the other parents of Y are assigned to values (cid:96) . Thelocal CPT in Fig. 1(b) implies X ⊥⊥ Y | A = 0 . The decision tree in Fig. 1(c) also shows this localCSI: once going down the branch with A = 0 the value of X is not influenced by the value of L .In this paper, we employ the idea of Pensar et. al. [25] and mark local CSIs as labels on theedges of the DAG. A DAG ( V , E ) together with a set of labels L defines a labeled DAG (LDAG) G = ( V , E , L ) , where for each edge X → Y ∈ E there is a label L ∈ L , which is a (possiblyempty) set of assignments to pa ( Y ) \ X i.e., other parents of Y . Each assignment in the label encodesa local CSI: if (cid:96) ∈ L , then Y ⊥⊥ X | pa ( Y ) \ X = (cid:96) . Symbol ∗ is used as a shortcut notation forany value. For example, the label AL = 1 ∗ on X → Y in Fig. 1(d) implies that X ⊥⊥ Y | A = 1 , L .Finally, throughout the paper, we restrict our attention to regular maximal LDAGs. Maximalityrequires that all labels that follow from other labels are recorded in the edges. Regularity means thatedges absent in every context are not included in the graph. See Pensar et. al. [25] for details.Any LDAG can be turned into a context s specific DAG by removing edges that are spurious (i.e.,irrelevant) when variables S have values s as follows. The nodes appearing in the label L on some X → Y can be partitioned into two sets A and B : nodes in A are assigned to a by the context s ,while nodes in B are not. Then, the edge X → Y ∈ E is not present in the context s specific DAG(i.e., the edge is spurious) if ( a , b ) ∈ L for all possible assignments b . For example, the context A = 1 specific DAG of Fig. 1(d) is identical to the underlying DAG except for X → Y being absent.A sufficient condition for a non-local CSI to be implied by an LDAG structure is given by CSI-separation criterion [6]: If sets of nodes X and Y are d-separated given C , S in the context s specificDAG of G , then X ⊥⊥ Y | C , s is implied by G . Note that d-separation is a special case when S = ∅ .For example, the labeling in Fig. 1(d) implies that X ⊥⊥ L | A = 0 by this criterion, as the edge L → X is absent in the context A = 0 specific DAG.We assume a positive distribution over the variables V [17]. This makes causal effects well-definedand justifies conditioning on any subset of variables or their particular assignments. As the first contribution we formalize causal identifiability problem in the presence of CSIs. Identifia-bility [24, 29] considers whether a causal effect can be uniquely identified in models with a givenfixed structure. If an effect is non-identifiable, there are (at least) two models that agree with theobservations and have the same given structure but disagree on the causal effect.We use LDAGs to define identifiability in the presence of CSIs, as LDAGs offer a simple and intuitivevisual view of the causal structure and local CSIs. The LDAG is assumed known based on thebackground knowledge on the examined study, similarly as semi-Markovian graphs are standardlydrawn for do-calculus. For example, consider (again) the case where an antibiotic A had a dose-response effect to H only if a genetic mutation M had not taken place. Hence, we would mark label M = 1 on the edge A → H . Thus, the causal effect identification problem can be formulated as: Input:
An LDAG G over V , P ( W ) for W ⊆ V , a query P ( Y | do ( X ) , Z ) s.t. Y , X , Z ⊂ W . Task:
Output a formula for P ( Y | do ( X ) , Z ) over P ( W ) , or decide that it is non-identifiable.When no labels appear on the edges of an LDAG, the causal structure can be directly cast as asemi-Markovian graph. Thus, the setting of do-calculus is a special case of this one. In contrast to causal effect identifiability over semi-Markovian graphs, which has polynomial decisionprocedures [28, 17], taking local structure and CSIs into account makes the corresponding decisionproblem NP-hard. (The proofs for all theorems are given in the supplementary material.)
Theorem 1.
Deciding non-identifiability of a causal effect given an LDAG over V and a passivelyobserved distribution over W ⊆ V is NP-hard. ule 1 (Insertion/Deletion of observations): P ( Y | do ( X ) , Z , W ) = P ( Y | do ( X ) , W ) if Y ⊥⊥ Z | X , W || X Rule 2 (Action/Observation exchange): P ( Y | do ( X ) , do ( Z ) , W ) = P ( Y | do ( X ) , Z , W ) if Y ⊥⊥ I Z | X , Z , W || X Rule 3 (Insertion/Deletion of actions): P ( Y | do ( X ) , do ( Z ) , W ) = P ( Y | do ( X ) , W ) if Y ⊥⊥ I Z | X , W || X Figure 2: Rules of do-calculus. The sets X , Y , Z and W are disjoint. Notation || X means that thecondition is evaluated in a graph in which edges into X are removed. I Z denotes the interventionnodes of variables Z (see Sec. 4.1).The proof of Theorem 1 shows that 3-SAT can be reduced to the identifiability of P ( Y | do ( X )) from P ( X, Y ) . On an intuitive level, the intricate structure in the local CPDs allows for representinginstances of NP-hard decision problems. This result is related to NP-hardness results of exactinference [10], implication problem of CSIs [20, 11] and the complexity results for Halpern’s actualcausation [1], however, we are not aware of other NP-hardness results for causal effect identifiability. In light of Theorem 1, fast algorithms for determining identifiability of a causal effects may begenerally unobtainable. Thus, we take here an approach similar to [14, 23, 16] and formulate acalculus called CSI-calculus which can be used to show identifiability for particular instantiations ofthe problem. CSI-calculus is an extension of do-calculus of Fig. 2. In the first subsection we showthat due to the versatile graphical model used (LDAG), we only need to consider identification ofconditional probabilities (i.e., the do-operation is not needed). The second subsection gives the rulesof CSI-calculus.
Interventions can be encoded naturally with the use of intervention variables and CSIs [6, 23, 13].Here we show how this can be done for LDAGs.For any LDAG ( V , E , L ) , we can construct an augmented LDAG that has the capacity to representinterventions as follows. Each node X ∈ V is augmented by an intervention node I X and an edge I X → X . If I X = 0 , then X is in its passive observational state determined by its parents pa ( X ) . If I X = 1 , then X is intervened on and its value is determined independently from its parents.For every X ∈ V and every label L Z ∈ L of every incoming edge Z → X such that Z (cid:54) = I X , weconstruct the augmented label L (cid:48) Z by including the assignments I X = ∗ , pa ( X ) \ ( I X ∪ Z ) = (cid:96) for every (cid:96) ∈ L Z and I X = 1 , pa ( X ) \ ( I X ∪ Z ) = ∗ . In other words, L (cid:48) Z renders the edge Z → X spurious when I X = 1 or in any context where L Z would. Fig. 1(e) shows an LDAG that isconstructed from the LDAG in Fig. 1(d) by adding an intervention node for X .Using the above construction, an interventional distribution P ( Y | do ( X )) is now simply a conditionaldistribution P ( Y | X, I X = 1) . Thus, we can essentially drop the do-operator from the problemdefinition, and model interventions using intervention nodes and CSIs instead. To simplify thenotation, we omit intervention nodes for variables that are in their passive observational state fromformulas. We do still include the do-operator when possible for improved readability. Figure 3 describes the rules of CSI-calculus. In the rules we use terms that apply for all assignments(large letters) and to particular assignments (small letters). We do this in order to make the derivationsshorter and identifying formulas more understandable. A valid calculus can be formed by omittingall large letters, but our experiments (Sec. 6) suggest that such a calculus is far less efficient.4 ule 1 (Insertion/Deletion of observations): P ( Y , y | Z , z , X , x ) = P ( Y , y | X , x ) if Y , Y ⊥⊥ Z , Z | X , x Rule 2 (Marginalization/Sum-rule): P ( Y , y | X , x ) = (cid:80) Z P ( Y , y , Z | X , x ) Rule 3 (Conditioning): P ( Y | Z , z , X , x ) = P ( Y , Z , z | X , x ) (cid:80) Y P ( Y , Z , z | X , x ) Rule 4 (Product-rule): P ( Y , y , Z , z | X , x ) = P ( Y , y | Z , z , X , x ) P ( Z , z | X , x ) Rule 5 (General-by-case reasoning): P ( Y , y , − z | X , x ) = P ( Y , y | X , x ) − P ( Y , y , z | X , x ) Rule 6 (Case-by-case reasoning): P ( Y , y , Z | X , x ) = (cid:26) P ( Y , y , Z = 0 | X , x ) if Z = 0 P ( Y , y , Z = 1 | X , x ) if Z = 1 Rule 7 (Case-by-general reasoning (a)): P ( Y , y , z | X , x ) = P ( Y , y , Z | X , x ) (cid:12)(cid:12) Z = z Rule 8 (Case-by-general reasoning (b)): P ( Y , y | X , x , z ) = P ( Y , y | X , x , Z ) (cid:12)(cid:12) Z = z Figure 3: Rules of CSI-calculus. The sets X , X , Y , Y , Z and Z are disjoint. We write w asshorthand for the explicit assignment W = w .Rule 1 is directly the definition of context-specific independence which includes conditional inde-pendence as a special case. Rule 1 can be applied in both directions, when the term on the left isidentified, so is the term on the right and vice versa, provided that the separation condition is satisfied.Marginalization, conditioning and factorization from standard probability calculus are operationalizedby rules 2–4, respectively. Rule 5 uses the law of total probability to obtain the probability of thecomplement. Rules 2–5 are applied from right to left: when the expressions on the right are identified,then so is the term on the left. Rule 5 is also valid when Y and Y are empty sets: in this case therule should be understood as P (1 − z | X , x ) = 1 − P ( z | X , x ) .Rule 6 explicates that if we know the expression for each assignment Z = z then we also knowthe expression without a specific assignment to Z . When rules 4–6 are applied, both distributionson the right-hand side must be known. Rules 7 and 8 formulate the fact that if an expression isknown for all assignments to Z , it is also known for a specific assignment Z = z . For rules 5–8, itis assumed that Z is a singleton for convenience. This assumption does not restrict identifiabilitysince operations involving sets can be carried out by applying the rules for each member of the setsequentially. For identifiable queries, the formula in terms of the joint distribution P ( W ) is easilyobtained by backtracking the chain of manipulations that resulted in identification.Importantly, CSI-calculus includes standard do-calculus of Fig. 2 as a special case. Theorem 2.
CSI-calculus subsumes do-calculus.
This means that any formula that is derivable with standard do-calculus over a DAG G (w. latents), isalso derivable using CSI-calculus over the LDAG formed by simply adding intervention nodes andlabels as described in Section 4.1. After this augmentation, Rule 1 fully encompasses the three rulesof do-calculus [23, 24]; this is shown in the proof of the theorem.More importantly, the calculus of Fig. 3 can identify causal effects that are not identifiable with thestandard do-calculus. For the example of Fig. 1, the following formula can be obtained: P ( Y | do ( X )) = P ( Y | A = 0 , X ) P ( A = 0) + P ( Y | A = 1) P ( A = 1) . (1)A simple derivation of this formula using CSI-calculus is shown in Fig. 4. Note that the back-doorformula P ( Y | do ( X )) = (cid:80) A P ( A ) P ( Y | A, X ) is not valid here: conditioning on X when A = 1 biases Y through X ← L → Y . 5 ( Y, X, A ) P ( Y, A | X ) P ( A ) P ( Y, A ) P ( Y, A = 0 | X ) P ( A | I X = 1) P ( Y, A = 1) P ( Y, | X, A = 0) P ( A | X, I X = 1) P ( Y | A = 1) P ( Y | X, A = 0 , I X = 1) P ( A = 0 | X, I X = 1) P ( A = 1 | X, I X = 1) P ( Y | A = 1 , I X = 1) P ( Y, A = 0 | X, I X = 1) P ( Y | X, A = 1 , I X = 1) P ( Y, A = 1 | X, I X = 1) P ( Y, A | X, I X = 1) P ( Y | X, I X = 1) R3 R2 R2R7 R1: A ⊥⊥ I X R7R3 R1: A ⊥⊥ X | I X = 1 R3R1: Y ⊥⊥ I X | X, A = 0
R7 R7 R1: Y ⊥⊥ I X | A = 1 R4 R4 R4 R1: Y ⊥⊥ X | A = 1 , I X = 1 R4R6 R6R2
Figure 4: A derivation of P ( Y | do ( X )) from P ( X, Y, A ) in the example of Fig. 1. The applied rulesand CSIs are marked next to the edges connecting the terms. The identifying formula is Eq. 1. In contrast to the setting of standard do-calculus, due the formidable number of contexts and thecausal structure being described by arguably more complex graph formalism, applying the rules ofCSI-calculus by hand is impossible (recall also Theorem 1 on NP-hardness). Hence, we follow theapproach of [30, 18] and devise a forward search procedure over the rules of CSI-calculus that is ableto automatically output identifying formulas and derivations such as Fig 4.However, for any instance, there are a vast number of terms that may end up being useful in identifyingthe query term; in fact, the derivation in Fig. 4 only shows the terms that were actually needed (inhindsight). For applying rule 1 we need to check a coNP-hard separation criterion, in contrast to thepolynomial check of d-separation in the standard do-calculus setting. Hence, we focus here on howto efficiently evaluate separation criteria (Sec. 5.1), combine contexts (Sec. 5.2) and implement theheuristic search (Sec. 5.3) without weakening the theoretical properties (Sec. 5.4).
Rule 1 requires the evaluation of possibly non-local CSIs. Recall from Section 2, that CSI-separationis only a sufficient criterion; in practice it misses many of the important independence relations. Fora feasible search procedure we need a sufficiently fast way to check a sufficient separation criterion.The following sufficient criterion is implemented in the search for this purpose.
Theorem 3.
If there exists a set C such that Y ⊥⊥ Z | X , w , C is implied by an LDAG G and one ofthe following is also implied by G : (i) Y ⊥⊥ C | X , w , (ii) C ⊥⊥ Z | X , w , (iii) Y ⊥⊥ C | X , Z , w ,or (iv) Z ⊥⊥ C | X , Y , w , then also Y ⊥⊥ Z | X , w is implied by G . When a CSI statement Y ⊥⊥ Z | X , w is encountered by the search, the following procedure isapplied: First, we verify whether the CSI is directly encoded in a label. If it is, we can stop and if it isnot, we continue by applying the CSI-separation criterion. If the CSI-separation criterion does nothold, we continue by attempting to find a set C that satisfies Y ⊥⊥ Z | X , w , c for all c ∈ val ( C ) .Theorem 3 is then applied recursively to verify whether all of the required CSIs Y ⊥⊥ C | X , w , C ⊥⊥ Z | X , w , Y ⊥⊥ C | X , Z , w or Z ⊥⊥ C | X , Y , w hold in G . To guarantee that the recursionterminates, each variable can appear only once in each branch of the recursion. We further reduce thenumber of evaluated CSIs by caching them during the search. The number of possible contexts increases exponentially with the number of variables. It is thereforeimportant to determine which contexts should be considered when CSIs are evaluated. Differentcontexts often share the same context-specific DAG. We define the equivalence relation s ∼ as follows:6 lgorithm 1Input: Target Q = P ( Y | do ( X ) , Z ) , LDAG G and input I = { P ( W ) } . Output:
A formula F for Q in terms of P ( W ) or NA. let U be the set of unexpanded terms, initially U := I . for P (cid:48) ∈ U : let I ∗ be the set of all distributions derived from P (cid:48) using the rules of Section 5. for each new candidate distribution P ∗ ∈ I ∗ , do if an additional input is required that is not in I , then continue . if CSI relation of the current rule is not satisfied by G , then continue . if P ∗ = Q , then derive a formula F for Q by backtracking and return F . Add P ∗ to I , add P ∗ to U . Mark P (cid:48) as expanded: remove P (cid:48) from U . return NA. s s ∼ s if and only if the context s specific DAG is the same as the context s specific DAG,where s , s ∈ val ( S ) . When evaluating the CSI Y ⊥⊥ Z | X , w , C of Theorem 3, we do not haveto determine d-separation for every c ∈ val ( C ) and w , c specific DAG. It suffices to restrict ourattention to the context-specific DAGs given by the representatives of val ( C ) / s ∼ . Theorem 4.
Let R be a set of representatives of val ( C ) / s ∼ . If Y is CSI-separated from Z by X inthe context w , c in G for all c ∈ R , then Y is CSI-separated from Z by X in the context w , c in G for all c ∈ val ( C ) . The definition of intervention nodes can also be used in this way. In general, an arbitrary context S = s can render a number of edges spurious in the LDAG. However, if the context contains theassignment I X = 1 for any node X , we know that every incoming edge of X except I X → X willbe made spurious by definition without requiring any further verification. Algorithm 1 shows the pseudo-code which implements the calculus of Section 4 and is capable ofsolving problems that fall under the formulation of Section 3 through the use of a search heuristicand elimination of redundant contexts. A single distribution is called a term , which is considered expanded if every valid manipulation has been performed on it.The input distribution is marked as unexpanded on line 1 and iteration over the unexpanded termsbegins on line 2. In order to guide the search to identify the most promising terms, we relate theidentified distributions to the target Q through a heuristic proximity function and always expandthe closest term in U first. Note that if we were to expand only the closest term to the targetgreedily, several identifiable instances would be left non-identified because the identifying formulasand derivations are highly non-trivial. More details about the proximity function are given in thesupplementary material. If multiple terms share the maximal value of the proximity function, theterm that was identified first is selected. Next, the rules of Section 4 are applied to P (cid:48) and the derivedcandidate distributions are added to the set I ∗ on line 3. Note that not every distribution in I ∗ isnecessarily identified at this point.Iteration over the set I ∗ begins on line 4. Here the candidate terms P ∗ in I ∗ that can be identifiedare added to the set I . Previously identified terms are not identified again. Line 5 verifies that bothrequired terms are identified for rules 4–6. Line 6 applies Theorem 3 to check the required CSIrelation for rule 1. Tests for d-separation are carried out via relevant path separation [7].If all requirements are met, P ∗ is identified either as the target on line 7 or as a new unexpandeddistribution on line 8. Once all candidate distributions are processed, we mark P (cid:48) as expandedon line 9. Note that P (cid:48) can still appear as a second required term on line 5 when another term isbeing expanded. Finally, if the target was not identified and the set of unexpanded distributions wasexhausted, we deem the target non-identifiable by the search and return NA on line 10. The formulated search is sound in the following sense.7
30 0 100
Sorted instance T i m e pe r i n s t an c e ( m i n ) n = 7, Alg. 1n = 8, Alg. 1n = 9, Alg. 1n = 7, Full CSn = 8, Full CSn = 9, Full CS (a)
05 Rule 1 Rule 2 Rule 3 Rule 4 Rule 5 Rule 6 Rule 7 Rule 8 A v e r age t i m e ( m i n ) n = 7n = 8n = 9 (b) Figure 5: (a) Running times of Algorithm 1. Full CS is a naive version which does not combinecontexts. (b) Time usage of each rule with error bars showing the standard error.
Theorem 5 (Soundness) . Algorithm 1 always terminates: if it returns an expression, it is correct.
In the setting of standard do-calculus, where no labels are present (in addition to those defininginterventions) the search is complete for (conditional) causal effect identifiability. This is because theseparation condition is general enough to capture all conditional independences used by do-calculusas shown by Theorem 2.
We implemented the search in C ++ and the code is available in the R-package dosearch on CRAN[31]. First we will present a simulation study on the search and then show a host of examples whereidentifiability can be shown with our approach. Experiments were performed on a modern desktopcomputer (single thread, Intel Core i7-4790, 3.4 GHz).We considered DAGs with n = 7 , , nodes with 100 DAGs for each n . Edges for the DAGs weresampled randomly with average degree of . We sampled labels on the edges (local CSIs) withprobability . . Two of the nodes were considered latent and the aim was to determine whether P ( Y | do ( X )) can be identified. Fig. 5(a) shows the running times of Algorithm 1 with a minutetimeout. The search times when all contexts are considered separate (i.e., the terms have fixedassigned values for all variables) are included as a baseline (full CS). Using terms that combineassignments as formulated in CSI-calculus considerably speeds up the execution times.In the same simulation, we examined the effect of applying the individual rules on the total runningtimes, as shown in Fig. 5(b). Rules 1 and 4 dominate the running time. For rule 1, considerable timeis spent on checking whether the conditional independence constraints hold (recall that this step isalso (co)NP-hard). Rule 4 combines two previously identified terms, and therefore a single term mayhelp to identify further terms in a large number of ways.Importantly, the search implementing CSI-calculus can prove identifiability of P ( Y | do ( X )) forthe LDAGs in Fig. 6 which would be non-identifiable otherwise via standard do-calculus. Non-identifiability can be verified by running ID on the underlying DAGs without labels or by noting thateach graph includes a hedge. In Fig. 6(a) P ( Y | do ( X )) = P ( Y | X, W = 1) . Intuitively, node W acts similarly as an intervention node and hence conditioning on W = 1 eliminates the back-door path.In Fig. 6(b) P ( Y | do ( X )) = P ( Y ) , because X and Y are independent when X is intervened on dueto the labels. In Fig. 6(c) P ( Y | do ( X )) = P ( Y | Z = 0 , X ) P ( Z = 0) + P ( Y | Z = 1) P ( Z = 1) ,adjusting for Z is needed, which opens up a new d-connecting path through H and Q . Fortunately,when Z = 0 there is no confounding path, and when Z = 1 there is a confounding path but nodirected path from Z . In Fig. 6(d), the causal effect is identifiable and the output by Algorithm 1 is: P ( Y | do ( X )) = P ( A = 1) (cid:80) W P ( Y | X, W, A = 1) P ( W | A = 1)+ P ( A = 0) (cid:80) Z P ( Z | X, A = 0) (cid:80) X (cid:48) P ( Y | X (cid:48) , Z, A = 0) P ( X (cid:48) | A = 0) When A = 1 , the first term resembles the back-door formula, adjusting for W . When A = 0 , thesecond term resembles the front-door formula through Z . Since A ⊥⊥ X, I X in the LDAG, we are8 a) X YW Z W = 1 (b) X YZ AH A = 0 AH = 1 ∗ (c) X YZ QH XZ = ∗ ZQ = 1 ∗ (d) WX YZ AU
AZXU = 0 ∗ ∗ ∗
AZUW = 0 ∗ ∗ ∗ A = 1 AZXW = 1 ∗ ∗ ∗ (e)
W Z X YAL M N AM = 0 ∗ Figure 6: LDAGs such that P ( Y | do ( X )) is identifiable using CSIs, but not with standard do-calculus.able to combine the formulas. In Fig. 6(e) when A = 0 , the confounding path from Y to I X vanishesallowing for a back-door type formula P ( Y | do ( X )) = (cid:80) Z P ( Z | A = 0) P ( Y | X, Z, A = 0) . In this paper, we considered causal effect identifiability in the presence of context-specific inde-pendence relations, which commonly arise from causal mechanisms over discrete variables. Weformalized the problem employing LDAGs, showed that deciding causal effect non-identifiabilityis NP-hard when CSIs are present, developed a calculus, and designed a readily usable automaticsearch procedure for finding identifying formulas. We showed that with only a few additional CSIs,our approach may enable identifiability in previously non-identifiable cases.Currently, we are at the level of a calculus and a search procedure over the calculus. Although thepresented rules and the search are sound, completeness results are harder to obtain. Despite that thegeneral decision problem is NP-hard, one could think of applying polynomial ID over context-specificDAGs and then combining the results in order to obtain a complete decision procedure. However,the following theorem shows that identifiability in context-specific DAGs is not a direct indicator ofgeneral identifiability.
Theorem 6.
Causal effect P ( Y | do ( X )) may be non-identifiable from P ( W ) even if P ( Y | do ( X )) is identifiable in the context s specific DAGs for every s ∈ val ( S ) or if P ( Y | do ( X ) , s ) is identifiablein the context s specific DAGs for every s ∈ val ( S ) where S contains only observed variables. Hence further research is needed for a similar theory as for do-calculus, which resulted in com-pleteness proofs through hedges, ID and IDC algorithms [28, 17, 27], if it is possible here. Thegeneralization to categorical variables is mostly imminent, but designing a feasible search procedureis certainly an additional challenge. As such, the presented approach can already leverage frominterventional distributions [3] by modifying the set of inputs I of Algorithm 1.We believe our approach using CSIs will have an impact on a variety of related problems. We wouldlike to use our approach to solve cases of transportability, selection bias and missing data problems[4, 5, 21]. The methodology presented is likely to render causal effects and distributions identifiablealso in these problems, provided that there are CSI relations present. Acknowledgments
ST was supported by Academy of Finland grant 311877 (Decision analytics utilizing causal modelsand multiobjective optimization). AH was supported by Academy of Finland grant 295673.9 eferences [1] G. Aleksandrowicz, H. Chockler, J. Y. Halpern, and A. Ivrii. The computational complexity ofstructure-based causality.
Journal of Artificial Intelligence Research , 58:431–451, 2017.[2] Y. Barash and N. Friedman. Context-specific Bayesian clustering for gene expression data.
Journal of Computational Biology , 9(2):169–191, 2002.[3] E. Bareinboim and J. Pearl. Causal inference by surrogate experiments: z-identifiability. InN. de Freitas and K. Murphy, editors,
Proceedings of the 28th Conference on Uncertainty inArtificial Intelligence , pages 113–120. AUAI Press, 2012.[4] E. Bareinboim and J. Pearl. Transportability from multiple environments with limited ex-periments: Completeness results. In
Proceedings of the 27th Annual Conference on NeuralInformation Processing Systems , pages 280–288, 2014.[5] E. Bareinboim and J. Tian. Recovering causal effects from selection bias. In
Proceedings of the29th AAAI Conference on Artificial Intelligence , pages 3475–3481, 2015.[6] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specific independence inBayesian networks. In
Proceedings of the 12th International Conference on Uncertainty inArtificial Intelligence , pages 115–123. Morgan Kaufmann, 1996.[7] C. J. Butz, A. E. dos Santos, and J. S. Oliveira. Relevant path separation: A faster method fortesting independencies in Bayesian networks. In , pages 74–85, 2016.[8] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting.
ArtificialIntelligence , 172(6-7):772–799, 2008.[9] D. M. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesiannetworks with local structure. In , pages 80–89. Morgan Kaufmann, 1997.[10] G. F. Cooper. The computational complexity of probabilistic inference using Bayesian beliefnetworks.
Artificial Intelligence , 42(2):393–405, 1990.[11] J. Corander, A. Hyttinen, J. Kontinen, J. Pensar, and J. Väänänen. A logical approach tocontext-specific independence.
Annals of Pure and Applied Logic , 170(9):975–992, 2019.[12] G. H. Dal, A. W. Laarman, and P. J. F. Lucas. Parallel probabilistic inference by weightedmodel counting. In
Proceedings of Machine Learning Research – Volume 72 , pages 97–108.PMLR, 2018.[13] A. P. Dawid. Influence diagrams for causal modelling and inference.
International StatisticalReview , 70(2):161–189, 2002.[14] D. Galles and J. Pearl. Testing identifiability of causal effects. In
Proceedings of the 11thConference Annual Conference on Uncertainty in Artificial Intelligence , pages 185–195. MorganKaufmann, 1995.[15] B. Georgi, J. Schultz, and A. Schliep. Context-specific independence mixture modellingfor protein families. In
European Conference on Principles of Data Mining and KnowledgeDiscovery , pages 79–90. Springer, 2007.[16] J. Y. Halpern. Axiomatizing causal reasoning.
Journal of Artificial Intelligence Research ,12:317–337, 2000.[17] Y. Huang and M. Valtorta. Identifiability in causal Bayesian networks: a sound and completealgorithm. In
Proceedings of the 21st National Conference on Artificial intelligence – Volume 2 ,pages 1149–1154. AAAI Press, 2006.[18] A. Hyttinen, F. Eberhardt, and M. Järvisalo. Do-calculus when the true graph is unknown. In
Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence , pages 395–404.AUAI Press, 2015.[19] A. Hyttinen, J. Pensar, J. Kontinen, and J. Corander. Structure learning for Bayesian networksover labeled DAGs. In
Proceedings of Machine Learning Research – Volume 72 , pages 133–144.PMLR, 2018.[20] D. Koller and N. Friedman.
Probabilistic Graphical Models: Principles and Techniques . MITPress, 2009. 1021] K. Mohan, J. Pearl, and J. Tian. Graphical models for inference with missing data. In
Advancesin Neural Information Systems , volume 26, pages 1277–1285, 2013.[22] H. Nyman, J. Pensar, T. Koski, and J. Corander. Stratified graphical models-context-specificindependence in graphical models.
Bayesian Analysis , 9(4):883–908, 2014.[23] J. Pearl. Causal diagrams for empirical research.
Biometrika , 82(4):669–688, 1995.[24] J. Pearl.
Causality: Models, Reasoning, and Inference . Cambridge University Press, secondedition, 2009.[25] J. Pensar, H. J. Nyman, T. Koski, and J. Corander. Labeled directed acyclic graphs: a gen-eralization of context-specific independence in directed graphical models.
Data Mining andKnowledge Discovery , 29(2):503–533, 2015.[26] S. E. Shimony. Explanation, irrelevance, and statistical independence. In
Proceedings of the 9thNational conference on Artificial intelligence – Volume 1 , pages 482–487. AAAI Press, 1991.[27] I. Shpitser and J. Pearl. Identification of conditional interventional distributions. In
Proceedingsof the 22nd Conference on Uncertainty in Artificial Intelligence , pages 437–444. AUAI Press,2006.[28] I. Shpitser and J. Pearl. Identification of joint interventional distributions in recursive semi-Markovian causal models. In
Proceedings of the 21st National Conference on Artificial Intelli-gence – Volume 2 , pages 1219–1226. AAAI Press, 2006.[29] I. Shpitser and J. Pearl. Complete identification methods for the causal hierarchy.
Journal ofMachine Learning Research , 9:1941–1979, 2008.[30] S. Tikka, A. Hyttinen, and J. Karvanen. Causal effect identification from multiple incompletedata sources: A general search-based approach. https://arxiv.org/abs/1902.01073, 2019.[31] S. Tikka, A. Hyttinen, and J. Karvanen. dosearch: Causal Effect Identification from MultipleIncomplete Data Sources , 2019. R package version 1.0.3.[32] S. Tikka and J. Karvanen. Identifying causal effects with the R package causaleffect.
Journal ofStatistical Software , 76(12):1–30, 2017.[33] S. Visscher, P. Lucas, I. Flesch, and K. Schurink. Using temporal context-specific independenceinformation in the exploratory analysis of disease processes. In