Learning Causal Graphs with Small Interventions
Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G. Dimakis, Sriram Vishwanath
aa r X i v : . [ c s . A I] O c t Learning Causal Graphs with Small Interventions
Karthikeyan Shanmugam , Murat Kocaoglu , Alexandros G. Dimakis , Sriram Vishwanath Department of Electrical and Computer EngineeringThe University of Texas at Austin, USA [email protected] , [email protected] [email protected] , [email protected] November 3, 2015
Abstract
We consider the problem of learning causal networks with interventions, when each intervention is limited in sizeunder Pearl’s Structural Equation Model with independent errors (SEM-IE). The objective is to minimize the numberof experiments to discover the causal directions of all the edges in a causal graph. Previous work has focused on theuse of separating systems for complete graphs for this task. We prove that any deterministic adaptive algorithm needsto be a separating system in order to learn complete graphs in the worst case. In addition, we present a novel separatingsystem construction, whose size is close to optimal and is arguably simpler than previous work in combinatorics. Wealso develop a novel information theoretic lower bound on the number of interventions that applies in full generality,including for randomized adaptive learning algorithms.For general chordal graphs, we derive worst case lower bounds on the number of interventions. Building onobservations about induced trees, we give a new deterministic adaptive algorithm to learn directions on any chordalskeleton completely. In the worst case, our achievable scheme is an α -approximation algorithm where α is theindependence number of the graph. We also show that there exist graph classes for which the sufficient number ofexperiments is close to the lower bound. In the other extreme, there are graph classes for which the required numberof experiments is multiplicatively α away from our lower bound.In simulations, our algorithm almost always performs very close to the lower bound, while the approach based onseparating systems for complete graphs is significantly worse for random chordal graphs. Causality is a fundamental concept in sciences and philosophy. The mathematical formulation of a theory of causality in a probabilistic sense has received significant attention recently (e.g. [2, 6, 8, 9, 14]). A formulation advocated byPearl considers the structural equation models : In this framework, X is a cause of Y , if Y can be written as f ( X, E ) ,for some deterministic function f and some latent random variable E . Given two causally related variables X and Y , it is not possible to infer whether X causes Y or Y causes X from random samples, unless certain assumptionsare made on the distribution of E and/or on f [7, 15]. For more than two random variables, directed acyclic graphs (DAGs) are the most common tool used for representing causal relations. For a given DAG D = ( V, E ) , the directededge ( X, Y ) ∈ E shows that X is a cause of Y .If we make no assumptions on the data generating process, the standard way of inferring the causal directions isby performing experiments, the so-called interventions . An intervention requires modifying the process that generatesthe random variables: The experimenter has to enforce values on the random variables. This process is different thanconditioning as explained in detail in [14].The natural problem to consider is therefore minimizing the number of interventions required to learn a causalDAG. Hauser et al. [6] developed an efficient algorithm that minimizes this number in the worst case. The algorithm isbased on optimal coloring of chordal graphs and requires at most log χ interventions to learn any causal graph where χ is the chromatic number of the chordal skeleton.However, one important open problem appears when one also considers the size of the used interventions: Eachintervention is an experiment where the scientist must force a set of variables to take random values. Unfortunately,the interventions obtained in [6] can involve up to n/ variables. The simultaneous enforcing of many variables can1e quite challenging in many applications: for example in biology, some variables may not be enforceable at all ormay require complicated genomic interventions for each parameter.In this paper, we consider the problem of learning a causal graph when intervention sizes are bounded by someparameter k . The first work we are aware of for this problem is by Eberhardt et al. [2], where he provided an achievablescheme. Furthermore [3] shows that the set of interventions to fully identify a causal DAG must satisfy a specific set ofcombinatorial conditions called a separating system , when the intervention size is not constrained or is 1. In [9], withthe assumption that the same holds true for any intervention size, Hyttinen et al. draw connections between causalityand known separating system constructions. One open problem is: If the learning algorithm is adaptive after eachintervention, is a separating system still needed or can one do better? It was believed that adaptivity does not help inthe worst case [3] and that one still needs a separating system. Our Contributions:
We obtain several novel results for learning causal graphs with interventions bounded bysize k . The problem can be separated for the special case where the underlying undirected graph (the skeleton) is thecomplete graph and the more general case where the underlying undirected graph is chordal.1. For complete graph skeletons, we show that any adaptive deterministic algorithm needs a ( n, k ) separating system.This implies that lower bounds for separating systems also hold for adaptive algorithms and resolves the previouslymentioned open problem.2. We present a novel combinatorial construction of a separating system that is close to the previous lower bound.This simple construction may be of more general interest in combinatorics.3. Recently [8] showed that randomized adaptive algorithms need only log log n interventions with high probabilityfor the unbounded case. We extend this result and show that O (cid:0) nk log log k (cid:1) interventions of size bounded by k suffice with high probability.4. We present a more general information theoretic lower bound of n k to capture the performance of such randomizedalgorithms.5. We extend the lower bound for adaptive algorithms for general chordal graphs. We show that over all orientations,the number of experiments from a ( χ ( G ) , k ) separating system is needed where χ ( G ) is the chromatic number ofthe skeleton graph.6. We show two extremal classes of graphs. For one of them, the interventions through ( χ, k ) separating system issufficient. For the other class, we need α ( χ − k ≈ n k experiments in the worst case.7. We exploit the structural properties of chordal graphs to design a new deterministic adaptive algorithm that usesthe idea of separating systems together with adaptability to Meek rules. We simulate our new algorithm andempirically observe that it performs quite close to the ( χ, k ) separating system. Our algorithm requires muchfewer interventions compared to ( n, k ) separating systems. A causal DAG D = ( V, E ) is a directed acyclic graph where V = { x , x . . . x n } is a set of random variablesand ( x, y ) ∈ E is a directed edge if and only if x is a direct cause of y . We adopt Pearl’s structural equationmodel with independent errors (SEM-IE) in this work (see [14] for more details). Variables in S ⊆ V cause x i , if x i = f ( { x j } j ∈ S , e y ) where e y is a random variable independent of all other variables.The causal relations of D imply a set of conditional independence (CI) relations between the variables. A condi-tional independence relation is of the following form: Given Z , the set X and the set Y are conditionally independentfor some disjoint subsets of variables X, Y, Z . Due to this, causal DAGs are also called causal Bayesian networks . Aset V of variables is Bayesian with respect to a DAG D if the joint probability distribution of V can be factorized as aproduct of marginals of every variable conditioned on its parents.All the CI relations that are learned statistically through observations can also be inferred from the Bayesiannetwork using a graphical criterion called the d-separation [16] assuming that the distribution is faithful to the graph .Two causal DAGs are said to be Markov equivalent if they encode the same set of CIs. Two causal DAGs are Markov A separating system is a - matrix with n distinct columns and each row has at most k ones. Given Bayesian network, any CI relation implied by d-separation holds true. All the CIs implied by the distribution can be found using d-separation if the distribution is faithful. Faithfulness is a widely accepted assumption, since it is known that only a measure zero set of distributionsare not faithful [13]. and the same immoralities . The class of causal DAGs thatencode the same set of CIs is called the Markov equivalence class . We denote the Markov equivalence class of a DAG D by [ D ] . The graph union of all DAGs in [ D ] is called the essential graph of D . It is denoted E ( D ) . E ( D ) is alwaysa chain graph with chordal chain components [1].The d -separation criterion can be used to identify the skeleton and all the immoralities of the underlying causalDAG [16]. Additional edges can be identified using the fact that the underlying DAG is acyclic and there are no moreimmoralities. Meek derived local rules ( Meek rules ), introduced in [17], to be recursively applied to identify everysuch additional edge (see Theorem 3 of [12]). The repeated application of
Meek rules on this partially directed graphwith identified immoralities until they can no longer be used yields the essential graph.
Given a set of variables V = { x , ..., x n } , an intervention on a set S ⊂ X of the variables is an experiment wherethe performer forces each variable s ∈ S to take the value of another independent (from other variables) variable u ,i.e., s = u . This operation, and how it affects the joint distribution is formalized by the do operator by Pearl [14]. Anintervention modifies the causal DAG D as follows: The post intervention DAG D { S } is obtained by removing theconnections of nodes in S to their parents. The size of an intervention S is the number of intervened variables, i.e., | S | . Let S c denote the complement of the set S .CI-based learning algorithms can be applied to D { S } to identify the set of removed edges, i.e. parents of S [16],and the remaining adjacent edges in the original skeleton are declared to be the children. Hence,(R0) The orientations of the edges of the cut between S and S c in the original DAG D can be inferred.Then, local Meek rules (introduced in [17]) are repeatedly applied to the original DAG D with the new direc-tions learnt from the cut to learn more till no more directed edges can be identified. Further application of CI-basedalgorithms on D will reveal no more information. The Meek rules are given below:(R1) ( a − b ) is oriented as ( a → b ) if ∃ c s.t. ( c → a ) and ( c, b ) / ∈ E .(R2) ( a − b ) is oriented as ( a → b ) if ∃ c s.t. ( a → c ) and ( c → b ) .(R3) ( a − b ) is oriented as ( a → b ) if ∃ c, d s.t. ( a − c ) , ( a − d ) , ( c → b ) , ( d → b ) and ( c, d ) / ∈ E .(R4) ( a − c ) is oriented as ( a → c ) if ∃ b, d s.t. ( b → c ) , ( a − d ) , ( a − b ) , ( d → b ) and ( c, d ) / ∈ E .The concepts of essential graphs and Markov equivalence classes are extended in [4] to incorporate the role of interven-tions: Let I = { I , I , ..., I m } , be a set of interventions and let the above process be followed after each intervention.Interventional Markov equivalence class ( I equivalence) of a DAG is the set of DAGs that represent the same set ofprobability distributions obtained when the above process is applied after every intervention in I . It is denoted by [ D ] I . Similar to the observational case, I essential graph of a DAG D is the graph union of all DAGs in the same I equivalence class; it is denoted by E I ( D ) . We have the following sequence: D → CI learning → Meek rules → E ( D ) → I a → learn by R0 b → Meek rules → E { I } ( D ) → I . . . → E { I ,I } ( D ) . . . (1)Therefore, after a set of interventions I has been performed, the essential graph E I ( D ) is a graph with someoriented edges that captures all the causal relations we have discovered so far, using I . Before any interventionshappened E ( D ) captures the initially known causal directions. It is known that E I ( D ) is a chain graph with chordalchain components. Therefore when all the directed edges are removed, the graph becomes a set of disjoint chordalgraphs. We are interested in the following question: Skeleton of a DAG is the undirected graph obtained when directed edges are converted to undirected edges. An induced subgraph on
X, Y, Z is an immorality if X and Y are disconnected, X → Z and Z ← Y . Graph union of two DAGs D = ( V, E ) and D = ( V, E ) with the same skeleton is a partially directed graph D = ( V, E ) , where ( v a , v b ) ∈ E is undirected if the edges ( v a , v b ) in E and E have different directions, and directed as v a → v b if the edges ( v a , v b ) in E and E are both directed as v a → v b . An undirected graph is chordal if it has no induced cycle of length greater than . This means that E ( D ) can be decomposed as a sequence of undirected chordal graphs G , G . . . G m (chain components) such that there is adirected edge from a vertex in G i to a vertex in G j only if i < j roblem 1. Given that all interventions in I are of size at most k < n/ variables, i.e., for each intervention I , | I | ≤ k, ∀ I ∈ I , minimize the number of interventions |I| such that the partially directed graph with all directionslearned so far E I ( D ) = D . The question is the design of an algorithm that computes the small set of interventions I given E ( D ) . Note, ofcourse, that the unknown directions of the edges D are not available to the algorithm. One can view the design of I as an active learning process to find D from the essential graph E ( D ) . E ( D ) is a chain graph with undirected chordalcomponents and it is known that interventions on one chain components do not affect the discovery process of directededges in the other components [5]. So we will assume that E ( D ) is undirected and a chordal graph to start with.Our notion of algorithm does not consider the time complexity (of statistical algorithms involved) of steps a and b in(1). Given m interventions, we only consider efficiently computing I m +1 using (possibly) the graph E { I ,...I m } . Weconsider the following three classes of algorithms:1. Non-adaptive algorithm:
The choice of I is fixed prior to the discovery process.2. Adaptive algorithm:
At every step m , the choice of I m +1 is a deterministic function of E { I ,...I m } ( D ) .3. Randomized adaptive algorithm:
At every step m , the choice of I m +1 is a random function of E { I ,...I m } ( D ) .The problem is different for complete graphs versus more general chordal graphs since rule R becomes applicablewhen the graph is not complete. Thus we give a separate treatment for each case. First, we provide algorithms forall three cases for learning the directions of complete graphs E ( D ) = K n (undirected complete graph) on n vertices.Then, we generalize to chordal graph skeletons and provide a novel adaptive algorithm with upper and lower boundson its performance.The missing proofs of the results that follow can be found in the Appendix. In this section, we consider the case where the skeleton we start with, i.e. E ( D ) , is an undirected complete graph(denoted K n ). It is known that at any stage in (1) starting from E ( D ) , rules R , R and R do not apply. Further, theunderlying DAG D is a directed clique. The directed clique is characterized by an ordering σ on [1 : n ] such that, inthe subgraph induced by σ ( i ) , σ ( i + 1) . . . σ ( n ) , σ ( i ) has no incoming edges. Let D be denoted by ~K n ( σ ) for someordering σ . Let [1 : n ] denote the set { , . . . n } . We need the following results on a separating system for our firstresult regarding adaptive and non-adaptive algorithms for a complete graph. Definition 1. [10,18] An ( n, k ) -separating system on an n element set [1 : n ] is a set of subsets S = { S , S . . . S m } such that | S i | ≤ k and for every pair i, j there is a subset S ∈ S such that either i ∈ S, j / ∈ S or j ∈ S, i / ∈ S . If apair i, j satisfies the above condition with respect to S , then S is said to separate the pair i, j . Here, we consider thecase when k < n/ In [10], Katona gave an ( n, k ) -separating system together with a lower bound on |S| . In [18], Wegener gave asimpler argument for the lower bound and also provided a tighter upper bound than the one in [10]. In this work, wegive a different construction below where the separating system size is at most ⌈ log ⌈ n/k ⌉ n ⌉ larger than the constructionof Wegener. However, our construction has a simpler description. Lemma 1.
There is a labeling procedure that produces distinct ℓ length labels for all elements in [1 : n ] using lettersfrom the integer alphabet { , . . . a } where ℓ = ⌈ log a n ⌉ . Further, in every digit (or position), any integer letter isused at most ⌈ n/a ⌉ times. Once we have a set of n string labels as in Lemma 1, our separating system construction is straightforward. Theorem 1.
Consider an alphabet A = [0 : ⌈ nk ⌉ ] of size ⌈ nk ⌉ + 1 where k < n/ . Label every element of an n elementset using a distinct string of letters from A of length ℓ = ⌈ log ⌈ nk ⌉ n ⌉ using the procedure in Lemma 1 with a = ⌈ nk ⌉ .For every ≤ i ≤ ℓ and ≤ j ≤ ⌈ nk ⌉ , choose the subset S i,j of vertices whose string’s i -th letter is j . The set of allsuch subsets S = { S i,j } is a k -separating system on n elements and |S| ≤ ( ⌈ nk ⌉ ) ⌈ log ⌈ nk ⌉ n ⌉ . .2 Adaptive algorithms: Equivalence to a Separating System Consider any non-adaptive algorithm that designs a set of interventions I , each of size at most k , to discover ~K n ( σ ) . I has to be a separating system in the worst case over all σ . This is already known. Now, we prove the necessity of aseparating system for deterministic adaptive algorithms in the worst case. Theorem 2.
Let there be an adaptive deterministic algorithm A that designs the set of interventions I such that thefinal graph learnt E I ( D ) = ~K n ( σ ) for any ground truth ordering σ starting from the initial skeleton E ( D ) = K n .Then, there exists a σ such that A designs an I which is a separating system. The theorem above is independent of the individual intervention sizes. Therefore, we have the following theorem,which is a direct corollary of Theorem 2:
Theorem 3.
In the worst case over σ , any adaptive or a non-adaptive deterministic algorithm on the DAG ~K n ( σ ) hasto be such that nk log nek n ≤ |I| . There is a feasible I with |I| ≤ ⌈ ( nk ⌉ − ⌈ log ⌈ nk ⌉ n ⌉ Proof.
By Theorem 2, we need a separating system in the worst case and the lower and upper bounds are from[10, 18].
In this section, we show that that total number of variable accesses to fully identify the complete causal DAG is Ω( n ) . Theorem 4.
To fully identify a complete causal DAG ~K n ( σ ) on n variables using size- k interventions, n k interventionsare necessary. Also, the total number of variables accessed is at least n . The lower bound in Theorem 4 is information theoretic. We now give a randomized algorithm that requires O ( nk log log k ) experiments in expectation. We provide a straightforward generalization of [8], where the authors gavea randomized algorithm for unbounded intervention size. Theorem 5.
Let E ( D ) be K n and the experiment size k = n r for some < r < . Then there exists a ran-domized adaptive algorithm which designs an I such that E I ( D ) = D with probability polynomial in n , and |I| = O ( nk log log( k )) in expectation. In this section, we turn to interventions on a general DAG G . After the initial stages in (1), E ( G ) is a chain graph withchordal chain components. There are no further immoralities throughout the graph. In this work, we focus on one ofthe chordal chain components. Thus the DAG D we work on is assumed to be a directed graph with no immoralitiesand whose skeleton E ( D ) is chordal. We are interested in recovering D from E ( D ) using interventions of size at most k following (1). We provide a lower bound for both adaptive and non-adaptive deterministic schemes for a chordal skeleton E ( D ) . Let χ ( E ( D )) be the coloring number of the given chordal graph. Since, chordal graphs are perfect, it is the same as theclique number. Theorem 6.
Given a chordal E ( D ) , in the worst case over all DAGs D (which has skeleton E ( D ) and no immoralities),if every intervention is of size at most k , then |I| ≥ χ ( E ( D )) k log χ ( E ( D )) ek χ ( E ( D )) for any adaptive and non-adaptivealgorithm with E I ( D ) = D .Upper bound: Clearly, the separating system based algorithm of Section 3 can be applied to the vertices in thechordal skeleton E ( D ) and it is possible to find all the directions. Thus, |I| ≤ nk log ⌈ nk ⌉ n ≤ α ( E ( D )) χ ( E ( D )) k log ⌈ nk ⌉ n .This with the lower bound implies an α approximation algorithm (since log ⌈ nk ⌉ n ≤ log χ ( E ( D )) ek χ ( E ( D )) , under amild assumption χ ( E ( D )) ≤ ne ). Remark:
The separating system on n nodes gives an α approximation. However, the new algorithm in Section4.3 exploits chordality and performs much better empirically. It is possible to show that our heuristic also has an α approximation guarantee but we skip that. 5 .2 Two extreme counter examples We provide two classes of chordal skeletons G : One for which the number of interventions close to the lower boundis sufficient and the other for which the number of interventions needed is very close to the upper bound. Theorem 7.
There exists chordal skeletons such that for any algorithm with intervention size constraint k , the numberof interventions |I| required is at least α ( χ − k where α and χ are the independence number and chromatic numbersrespectively. There exists chordal graph classes such that |I| = ⌈ χk ⌉⌈ log ⌈ χk ⌉ χ ⌉ is sufficient. In this section, we design an adaptive deterministic algorithm that anticipates
Meek rule R usage along with the ideaof a separating system. We evaluate this experimentally on random chordal graphs. First, we make a few observa-tions on learning connected directed trees T from the skeleton E ( T ) (undirected trees are chordal) that do not haveimmoralities using Meek rule R where every intervention is of size k = 1 . Because the tree has no cycle, Meek rulesR -R do not apply. Lemma 2.
Every node in a directed tree with no immoralities has at most one incoming edge. There is a root nodewith no incoming edges and intervening on that node alone identifies the whole tree using repeated application of ruleR . Lemma 3.
If every intervention in I is of size at most , learning all directions on a directed tree T with no immoral-ities can be done adaptively with at most |I| ≤ O (log n ) where n is the number of vertices in the tree. The algorithmruns in time poly( n ) . Lemma 4.
Given any chordal graph and a valid coloring, the graph induced by any two color classes is a forest.
In the next section, we combine the above single intervention adaptive algorithm on directed trees which usesMeek rules, with that of the non-adaptive separating system approach.
The key motivation behind the algorithm is that, a pair of color classes is a forest (Lemma 4). Choosing the right nodeto intervene leaves only a small subtree unlearnt as in the proof of Lemma 3. In subsequent steps, suitable nodes inthe remaining subtrees could be chosen until all edges are learnt. We give a brief description of the algorithm below.Let G denote the initial undirected chordal skeleton E ( D ) and let χ be its coloring number. Consider a ( χ, k ) separating system S = { S i } . To intervene on the actual graph, an intervention set I i corresponding to S i is chosen.We would like to intervene on a node of color c ∈ S i .Consider a node v of color c . Now, we attach a score P ( v, c ) as follows. For any color c ′ / ∈ S i , consider theinduced forest F ( c, c ′ ) on the color classes c and c ′ in G . Consider the tree T ( v, c, c ′ ) containing node v in F . Let d ( v ) be the degree of v in T . Let T , T , . . . T d ( v ) be the resulting disjoint trees after node v is removed from T . If v is intervened on, according to the proof of Lemma 3: a) All edge directions in all trees T i except one of them wouldbe learnt when applying Meek Rules and rule R . b) All the directions from v to all its neighbors would be found.The score is taken to be the total number of edge directions guaranteed to be learnt in the worst case. Therefore,the score P ( v ) is: P ( v ) = P c ′ : | c,c ′ T | =1 (cid:18) | T ( c, c ′ ) | − max ≤ j ≤ d ( v ) | T j | (cid:19) . The node with the highest score among the colorclass c is used for the intervention I i . After intervening on I i , all the edges whose directions are known through MeekRules (by repeated application till nothing more can be learnt) and R are deleted from G . Once S is processed,we recolor the sparser graph G . We find a new S with the new chromatic number on G and the above procedure isrepeated. The exact hybrid algorithm is described in Algorithm 1. Theorem 8.
Given an undirected choral skeleton G of an underlying directed graph with no immoralities, Algorithm1 ends in finite time and it returns the correct underlying directed graph. The algorithm has runtime complexitypolynomial in n . lgorithm 1 Hybrid Algorithm using Meek rules with separating system Input:
Chordal Graph skeleton G = ( V, E ) with no Immoralities. Initialize ~G ( V, E d = ∅ ) with n nodes and no directed edges. Initialize time t = 1 . while E = ∅ do Color the chordal graph G with χ colors. ⊲ Standard algorithms exist to do it in linear time Initialize color set C = { , . . . χ } . Form a ( χ, min( k, ⌈ χ/ ⌉ )) separating system S such that | S | ≤ k, ∀ S ∈S . for i = 1 until |S| do Initialize Intervention I t = ∅ . for c ∈ S i and every node v in color class c do Consider F ( c, c ′ ) , T ( c, c ′ , v ) and { T j } d ( i )1 (as per definitions in Sec. 4.3.1). Compute: P ( v, c ) = P c ′ ∈C T S ci | T ( c, c ′ , v ) | − max ≤ j ≤ d ( i ) | T j | . end for if k ≤ χ/ then I t = I t S c ∈ S i { argmax v : P ( v,c ) =0 P ( v, c ) } . else I t = I t ∪ c ∈ S i { First k ⌈ χ/ ⌉ nodes v with largest nonzero P ( v, c ) } . end if t = t + 1 Apply R and Meek rules using E d and E after intervention I t . Add newly learnt directed edges to E d and delete them from E . end for Remove all nodes which have degree in G. end while return ~G . We simulate our new heuristic, namely Algorithm 1, on randomly generated chordal graphs and compare it witha naive algorithm that follows the intervention sets given by our ( n, k ) separating system as in Theorem 1. Bothalgorithms apply R and Meek rules after each intervention according to (1). We plot the following lower bounds: a) Information Theoretic LB of χ k b) Max. Clique Sep. Sys. Entropic LB which is the chromatic number based lowerbound of Theorem 6. Moreover, we use two known ( χ, k ) separating system constructions for the maximum cliquesize as “references”: The best known ( χ, k ) separating system is shown by the label Max. Clique Sep. Sys. AchievableLB and our new simpler separating system construction (Theorem 1) is shown by
Our Construction Clique Sep. Sys.LB . As an upper bound, we use the size of the best known ( n, k ) separating system (without any Meek rules) and isdenoted Separating System UB . Random generation of chordal graphs:
Start with a random ordering σ on the vertices. Consider every vertexstarting from σ ( n ) . For each vertex i , ( j, i ) ∈ E with probability inversely proportional to σ ( i ) for every j ∈ S i where S i = { v : σ − ( v ) < σ − ( i ) } . The proportionality constant is changed to adjust sparsity of the graph. After all such j are considered, make S i ∩ ne ( i ) a clique by adding edges respecting the ordering σ , where ne ( i ) is the neighborhoodof i . The resultant graph is a DAG and the corresponding skeleton is chordal. Also, σ is a perfect elimination ordering. Results:
We are interested in comparing our algorithm and the naive one which depends on the ( n, k ) separatingsystem to the size of the ( χ, k ) separating system. The size of the ( χ, k ) separating system is roughly ˜ O ( χ/k ) .Consider values around χ = 100 on the x-axis for the plots with n = 1000 , k = 10 and n = 2000 , k = 10 . Note that,our algorithm performs very close to the size of the ( χ, k ) separating system, i.e. ˜ O ( χ/k ) . In fact, it is always < in both cases while the average performance of naive algorithm goes from (close to n/k = 100 ) to (close to n/k = 200 ). The result points to this: For random chordal graphs, the structured tree search allows us to learn theedges in a number of experiments quite close to the lower bound based only on the maximum clique size and not n .The plots for ( n, k ) = (500 , and ( n, k ) = (2000 , are given in Appendix.7 Chromatic Number, χ N u m be r o f E x pe r i m en t s Information Theoretic LBMax. Clique Sep. Sys. Entropic LBMax. Clique Sep. Sys. Achievable LBOur Construction Clique Sep. Sys. LBOur Heuristic AlgorithmNaive (n,k) Sep. Sys. based AlgorithmSeperating System UB (a) n = 1000 , k = 10
20 40 60 80 100 120050100150200250300350400
Chromatic Number, χ N u m be r o f E x pe r i m en t s Information Theoretic LBMax. Clique Sep. Sys. Entropic LBMax. Clique Sep. Sys. Achievable LBOur Construction Clique Sep. Sys. LBOur Heuristic AlgorithmNaive (n,k) Sep. Sys. based AlgorithmSeperating System UB (b) n = 2000 , k = 10 Figure 1: n : no. of vertices, k : Intervention size bound. The number of experiments is compared between ourheuristic and the naive algorithm based on the ( n, k ) separating system on random chordal graphs. The red markersrepresent the sizes of ( χ, k ) separating system. Green circle markers and the cyan square markers for the same χ valuecorrespond to the number of experiments required by our heuristic and the algorithm based on an ( n, k ) separatingsystem(Theorem 1), respectively, on the same set of chordal graphs. Note that, when n = 1000 and n = 2000 , thenaive algorithm requires on average about and (close to n/k ) experiments respectively, while our algorithmrequires at most ∼ (orderwise close to χ/k = 10 ) when χ = 100 . We have considered the problem of adaptively designing interventions of bounded size to learn a causal graph underPearl’s SEM-IE model. We proposed lower and upper bounds for the number of interventions needed in the worst casefor various classes of algorithms, when the causal graph skeleton is complete. We developed lower and upper boundson the minimum number of interventions required in the worst case for general graphs. We characterized two extremalgraph classes such that the minimum number of interventions in one class is close to the lower bound and in the otherclass it is close to the upper bound. In the case of chordal skeletons, we proposed an algorithm that combines ideasfor the complete graphs with the ones when the skeleton is a forest via application of Meek rules. Empirically, onrandomly generated chordal graphs, our algorithm performs close to the lower bound and it outperforms the previousstate of the art. Possible future work includes obtaining a tighter lower bound for chordal graphs that would possiblyestablish a tighter approximation guarantee for our algorithm.
Acknowledgments
Authors acknowledge the support from grants: NSF CCF 1344179, 1344364, 1407278, 1422549 and a ARO YIPaward (W911NF-14-1-0258). We also thank Frederick Eberhardt for helpful discussions.8 eferences [1] Steen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of markov equivalence classesfor acyclic digraphs.
The Annals of Statistics , 25(2):505–541, 1997.[2] Frederich Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in theworst case necessary to identify all causal relations among n variables. In
Proceedings of the 21st Conference onUncertainty in Artificial Intelligence (UAI) , pages 178–184.[3] Frederick Eberhardt.
Causation and Intervention (Ph.D. Thesis) , 2007.[4] Alain Hauser and Peter B¨uhlmann. Characterization and greedy learning of interventional markov equivalenceclasses of directed acyclic graphs.
Journal of Machine Learning Research , 13(1):2409–2464, 2012.[5] Alain Hauser and Peter B¨uhlmann. Two optimal strategies for active learning of causal networks from interven-tional data. In
Proceedings of Sixth European Workshop on Probabilistic Graphical Models , 2012.[6] Alain Hauser and Peter B¨uhlmann. Two optimal strategies for active learning of causal models from interven-tional data.
International Journal of Approximate Reasoning , 55(4):926–939, 2014.[7] Patrik O Hoyer, Dominik Janzing, Joris Mooij, Jonas Peters, and Bernhard Sch¨olkopf. Nonlinear causal discoverywith additive noise models. In
Proceedings of NIPS 2008 , 2008.[8] Huining Hu, Zhentao Li, and Adrian Vetta. Randomized experimental design for causal graph discovery. In
Proceedings of NIPS 2014 , Montreal, CA, December 2014.[9] Antti Hyttinen, Frederick Eberhardt, and Patrik Hoyer. Experiment selection for causal discovery.
Journal ofMachine Learning Research , 14:3041–3071, 2013.[10] Gyula Katona. On separating systems of a finite set.
Journal of Combinatorial Theory , 1(2):174–194, 1966.[11] Richard J Lipton and Robert Endre Tarjan. A separator theorem for planar graphs.
SIAM Journal on AppliedMathematics , 36(2):177–189, 1979.[12] Christopher Meek. Causal inference and causal explanation with background knowledge. In
Proceedings of theeleventh international conference on uncertainty in artificial intelligence , 1995.[13] Christopher Meek. Strong completeness and faithfulness in bayesian networks. In
Proceedings of the eleventhinternational conference on uncertainty in artificial intelligence , 1995.[14] Judea Pearl.
Causality: Models, Reasoning and Inference . Cambridge University Press, 2009.[15] S Shimizu, P. O Hoyer, A Hyvarinen, and A. J Kerminen. A linear non-gaussian acyclic model for causaldiscovery.
Journal of Machine Learning Research , 7:2003–2030, 2006.[16] Peter Spirtes, Clark Glymour, and Richard Scheines.
Causation, Prediction, and Search . A Bradford Book,2001.[17] Thomas Verma and Judea Pearl. An algorithm for deciding if a set of observed independencies has a causalexplanation. In
Proceedings of the Eighth international conference on uncertainty in artificial intelligence , 1992.[18] Ingo Wegener. On separating systems whose elements are sets of at most k elements.
Discrete Mathematics ,28(2):219–222, 1979. 9 ppendix
We describe a string labeling procedure as follows to label elements of the set [1 : n ] . String Labelling:
Let a > be a positive integer. Let x be the integer such that a x < n ≤ a x +1 . x + 1 = ⌈ log a n ⌉ .Every element j ∈ [1 : n ] is given a label L ( j ) which is a string of integers of length x + 1 drawn from the alphabet { , , . . . a } of size a + 1 . Let n = p d a d + r d and n = p d − a d − + r d − for some integers p d , p d − , r d , r d − , where r d < a d and r d − < a d − . Now, we describe the sequence of the d -th digit across the string labels of all elementsfrom to n :1. Repeat a d − times, repeat the next integer a d − times and so on circularly from { , . . . a − } till p d a d .2. After that, repeat ⌈ r d /a ⌉ times followed by ⌈ r d /a ⌉ times till we reach the n th position. Clearly, n -th integerin the sequence would not exceed a − .3. Every integer occurring after the position a d − p d − is increased by .From the three steps used to generate every digit, a straightforward calculation shows that every integer letter isrepeated at most ⌈ n/a ⌉ times in every digit i in the string. Now, we would like to prove inductively that the labels aredistinct for all n elements. Let us assume the induction hypothesis: For all n < a q +1 , the labels are distinct. The basecase of q = 0 is easy to see. Then, we would like to show that for a q +1 ≤ n < a q +2 , the labels are distinct.Another way of looking at the labeling procedure is as follows. Let n = a q +1 p + r with r < a q +1 . Divide thelabel matrix L (of dimensions ( q + 2) × n ) into two parts, one L consisting of the first pa q +1 columns and the other L consisting of the remaining columns. The first q + 1 rows of L is nothing but the string labels for all numbersfrom to pa q +1 expressed in base a . For any row i ≤ ⌈ log a r ⌉ in the original matrix L of labels, till the end of first pa q +1 columns, the labeling procedure would be still in Step . After that, one can take r to be the new size of the setof elements to be labelled and then restart the procedure with this r . Therefore we have the following key observation: L (1 : ⌈ log a r ⌉ , :) (the matrix with first ⌈ log a r ⌉ rows of L ) is nothing but the label matrix for r distinct elementsfrom the above labeling procedure.Since, r < a q +1 , by the induction hypothesis, the columns are distinct. Hence, any two columns in L are distinct.Suppose the first q + 1 rows of two columns b and c of L are identical. These correspond to base a expansion of b − and c − . They are separated by at least a q +1 + 1 columns. But the last row of columns b and p in L has to bedistinct because according to Step and Step of the labeling procedure, in the q + 2 th row, every integer is repeatedat most ⌈ n/a ⌉ ≤ a q +1 times continuously, and only once. Therefore, any two columns in L are distinct. The last rowentries in L are different from L because of the addition in Step . Therefore, all columns of L are distinct. Hence,by induction, the result is shown. By Lemma 1, i th place has at most ⌈ n ⌈ n/k ⌉ ⌉ ≤ k occurrences of symbol j . Therefore, | S i,j | ≤ k . Now, consider thepair of distinct elements p, q ∈ [1 : n ] . Since they are labelled distinctly (Lemma 1), there is at least one letter i intheir string labels where they differ. Suppose the distinct i th letters are a, b ∈ A , a = b and let us say a = 0 withoutloss of generality. Then, clearly the separation criterion is met by the subset S i,a . This proves the claim. We construct a worst case σ inductively. Before every step m , the adaptive algorithm deterministically chooses I m based on E { I ,I ...I m − } ( K n ) . Therefore, we will reveal a partial order σ ( m − to satisfy the observations so far.Inductively for every m , we will make sure that after I m is chosen by the algorithm, further details about σ can berevealed to form σ ( m ) such that after intervening on I and then applying R , we will make sure there is no opportunityto apply the rule R . This would make sure that I is a separating system on n elements.Before intervention at any step m , let us ‘tag’ every vertex i using a subset C ( m − i ⊆ [1 : m ] such that C ( m − i = { p : i ∈ I p , p ≤ m − } . C ( m − i contains indices of all those interventions that contain vertex i before step m . Circular means that after a − is completed, we start with again. C ( m − contain distinct elements of the multi-set { C ( m − i } .We will construct σ partially such that it satisfies thefollowing criterion always: Inductive Hypothesis:
The partial order σ ( m − is such that for any two elements i, j with C i and C j , i and j areincomparable if C i = C j and comparable otherwise. This means the edges between the elements tagged with thesame tag C has not been revealed, and thus the relevant directed edges are not known by the algorithm.Now, we briefly digress to argue that if we could construct σ (1) , σ (2) . . . satisfying such a property throughout,then clearly all vertices must be tagged differently otherwise the directions among the vertices that are tagged similarlycannot be learned by the algorithm. Therefore, the algorithm has not succeeded in its task. If all vertices are taggeddifferently, then it means it is a separating system. Construction of σ ( m ) : We now construct σ ( m ) that can be shown to satisfy the induction hypothesis before step m + 1 . Before step m , consider the vertices in C ∈ C ( m − for any C . Let the current intervention be I m chosenby the deterministic algorithm. We make the following changes: Modify σ ( m − such that vertices in I m T C comebefore ( I m ) c T C in the partial order σ ( m ) (vertices inside either sets are still not ordered amongst themselves) in theordering and clearly the directions between these two sets are revealed by R . By the induction hypothesis for step m and with the new tagging of vertices into C ( m ) , it is easy to see that only directions between distinct C ′ s in the new C ( m ) have been revealed and all directions within a tag set C are not revealed and all vertices in a tag set are contiguousin the ordering so far. We need to only show that rule R cannot reveal anymore edges amongst vertices in C ∈ C ( m ) after the new σ ( m ) and intervention I m . Suppose there are two vertices i, j such that just after intervention I m and themodified σ ( m ) , they are tagged identically and application of R reveals the direction between i and j before the nextintervention. Then there has to be a vertex k tagged differently from i, j such that j → k and k → i are both known.But this implies that j and i are comparable in σ ( m ) leading to a contradiction. This implies the hypothesis holds forstep m + 1 . Base case:
Trivially, the induction hypothesis holds for step where σ (0) leaves the entire set unordered. The proof is a direct obvious consequence of acyclicity, non-existence of immoralities and the definition of rule R1.
By Lemma 2, it is sufficient for an algorithm to identify the root node of the tree. Suppose the root node is b unknownto the algorithm. Every tree has a single vertex separator that partitions the tree into components each of which hassize at most n [11]. Choose that vertex separator a (it can be found in by removing every node and determiningthe components left). If it is a root node we stop here. Otherwise, its parent p (if it is not) after application of ruleR is identified. Let us consider component trees T , T . . . T k that result by removing node a . Let T contain p .All directions in all other trees are known after repeated application of R on the original tree after R is applied.Directions in T will not be known. For the next step, E ( T ) is the new skeleton which has no immoralities. Again, wefind the best vertex separator a and the process continues. This procedure will terminate at some step j when a j = b or there is only one node left which should be b by Lemma 2. Since the number of nodes reduce by about / at leasteach time, and initially it can be at most n , this procedure terminates in at most O (log n ) steps. The graph induced by two colors classes in any graph is a bi-partite graph and bi-partite graphs do not have oddinduced cycles. Since the graph and any induced subgraph is chordal, it implies the induced graph on a pair of colorclasses does not have a cycle. This proves the theorem.
Assume n is even for simplicity. We define a family of partial order σ ( p ) as follows: Group i, i + 1 into C i . Orderingamong i and i + 1 is not revealed. But all the edges between C i and C j for any j > i are directed from C i to C j . Now,one has to design a set of interventions such that exactly one node among every C i is intervened on at least once. Thisis because, if neither i nor i + 1 in C i are intervened on, then the direction between i and i + 1 cannot be figured out by11pplying rule R on any other set of directions in the rest of the graph. Since the size of every intervention is at most k and at least n/ nodes need to be covered by intervention sets, the number of interventions required is at least n k . Proof.
Separate n vertices arbitrarily into nk disjoint subsets C i of size- k . Let the first n/k interventions { I , I , ..., I n/k } be such that I i ( v ) = 1 if and only if v ∈ C i . This divides the problem of learning a clique of size n into learning n/k cliques of size k . Then, we can apply the clique learning algorithm in [8] as a black box to each of the nk blocks: Each block is learned with probability k − c after log c log k experiments in expectation. For k = cn r , choose c > /r − . Then the union bound over n/k blocks yields probability polynomial in n . Since each block takes O (log log k ) experiments, we need nk O (log log k ) experiments. We need the following definitions and some results before proving the theorem.
Definition 2.
A perfect elimination ordering σ p = { v , v . . . v n } on the vertices of an undirected chordal graph G issuch that for all i , the induced neighborhood of v i on the subgraph formed by { v , v . . . v i − } is a clique. Lemma 5. ( [6]) If all directions in the chordal graph are according to perfect elimination ordering (edges go onlyfrom vertices lower in the order to higher in the order), then there are no immoralities.
We make the following observation: Let the directions in a graph D be oriented according to an ordering σ onthe vertices. If a clique comes first in the ordering, then the knowledge of edge directions in the rest of the graph,excluding that of the clique, cannot help at any stage of the intervention process on the clique; because all the edgesare directed outwards from the clique and hence none of the Meek rules apply. This is because, if a → b is to beinferred by Meek rules from other known directions, then either there has to be a known edge direction into a or b before the inference step. So if one of the directed edges not from the clique was to help in the discovery process,either that edge has to be directed towards a or b (like in Meek rules R , R and R ), or it has to be directed towards c in another c → a (R ) which belongs to the clique. Both the above cases are not possible. Lemma 6. ( [6]) Let C be a maximum clique of an undirected chordal graph E ( D ) , then there is an underlying DAG D on the chordal skeleton that is oriented according to a perfect elimination ordering (implying no immoralities),where the clique C occurs first. By Lemmas 5, 6 and the observation above, given a chordal skeleton, we can construct a DAG on the skeleton withno immoralities such that the directions of the maximum clique in D cannot be learned by using knowledge of thedirections outside. This means that only the intervention sets { I T C, I T C . . . } matter for learning the directionson this clique. Therefore inference on the clique is isolated. Hence, all the lower bounds for the clique case transferto this case and since the size of the largest clique is exactly the coloring number of the chordal skeleton, the theoremfollows. Example with a feasible solution with |I| close to the lower bound:
Consider a graph G that can be partitioned intoa clique of size χ and an independent set α . Such graphs are called split graphs and as n → ∞ , the fraction of splitgraphs to chordal graphs tends to . If E ( D ) = G where G is a split graph skeleton, it is enough to intervene only onthe nodes in the clique and therefore the number of interventions that are needed is that for the clique. It is certainlypossible to orient the edges in such a way so as to avoid immoralities, since the graph is chordal. Example with |I| which needs to be close to the upper bound:
We construct a connected chordal skeleton withindependent set α and clique size χ (also coloring number) such that it would require α ( χ − k interventions at least forany algorithm over a class of orientations.Consider a line L consisting of vertices , . . . α such that every node < i < α is connected to i − and i + 1 .For, all ≤ p ≤ α , consider a clique C p of size χ which only has nodes p − , p from the line L . Now assumethat the actual orientation of the L is → . . . → α . In every clique, the orientation is partially specified as follows:In every clique C p , all edges from node p − are outgoing. It is very clear that this partial orientation excludes all12mmoralities. Further, each clique C p − { p − } can have any arbitrary orientation out of χ − possible ones in theactual DAG. Now, even if all the specified directions are revealed to the algorithm, the algorithm has to intervene onall α disjoint cliques { C p − { p − }} αp =1 each of size χ − and directions in one clique will not force directions onthe others through any of the Meek rules or rule R . Therefore, the lower bound of α ( χ − total node accesses (totalnumber of nodes intervened) is implied by Theorem 4. Given every intervention is of size k , these chordal skeletonswith the revealed partial order needs at least α ( χ − k more experiments. n = 500 , k = 10 and n = 2000 , k = 20
20 30 40 50 60 70 80 90 100 1100102030405060708090100
Chromatic Number, χ N u m be r o f E x pe r i m en t s Information Theoretic LBMax. Clique Sep. Sys. Entropic LBMax. Clique Sep. Sys. Achievable LBOur Construction Clique Sep. Sys. LBOur Heuristic AlgorithmNaive (n,k) Sep. Sys. based AlgorithmSeperating System UB (a) n = 500 , k = 10
40 50 60 70 80 90 100 110 120 130020406080100120140160180200
Chromatic Number, χ N u m be r o f E x pe r i m en t s Information Theoretic LBMax. Clique Sep. Sys. Entropic LBMax. Clique Sep. Sys. Achievable LBOur Construction Clique Sep. Sys. LBOur Heuristic AlgorithmNaive (n,k) Sep. Sys. based AlgorithmSeperating System UB (b) n = 2000 , k = 20 Figure 2: n : no. of vertices, k : Intervention size bound. The number of experiments is compared between ourheuristic and the naive algorithm based on the ( n, k ) separating system on random chordal graphs. The red markersrepresent the sizes of ( χ, k ) separating system. Green circle markers and the cyan square markers for the same χ valuecorrespond to the number of experiments required by our heuristic and the algorithm based on an ( n, k ) separatingsystem(Theorem 1), respectively, on the same set of chordal graphs. All four plots (including the ones in the maintext) indicate that our algorithm requires number of experiments proportional to the clique number χ , whereas naiveseparating system based algorithm requires experiments on the order of number of variables n . We provide the following justifications for the correctness of Algorithm 1.1. At line 4 of the algorithm, when Meek rules and R are applied after every intervention, the intermediate graph G , with unlearned edges, will always be a disjoint union of chordal components (refer to (1) and the commentsbelow) and hence a chordal graph.2. The number of unlearned edges before and after the main while loop in Algorithm 1 reduces by at least one.Every edge in E is incident on two colors and one of the colors is always picked for processing because weuse a separating system on the colors. Therefore, one node belonging to some edge has a positive score and isintervened on. The edge direction is learnt through rule R . Therefore, the algorithm terminates.3. It identifies the correct ~G because every edge is inferred after some intervention I t by applying rule R andMeek rules as in (1) both of which are correct.4. the algorithm has polynomial run time complexity because the main while loop ends in time ||
Chromatic Number, χ N u m be r o f E x pe r i m en t s Information Theoretic LBMax. Clique Sep. Sys. Entropic LBMax. Clique Sep. Sys. Achievable LBOur Construction Clique Sep. Sys. LBOur Heuristic AlgorithmNaive (n,k) Sep. Sys. based AlgorithmSeperating System UB (b) n = 2000 , k = 20 Figure 2: n : no. of vertices, k : Intervention size bound. The number of experiments is compared between ourheuristic and the naive algorithm based on the ( n, k ) separating system on random chordal graphs. The red markersrepresent the sizes of ( χ, k ) separating system. Green circle markers and the cyan square markers for the same χ valuecorrespond to the number of experiments required by our heuristic and the algorithm based on an ( n, k ) separatingsystem(Theorem 1), respectively, on the same set of chordal graphs. All four plots (including the ones in the maintext) indicate that our algorithm requires number of experiments proportional to the clique number χ , whereas naiveseparating system based algorithm requires experiments on the order of number of variables n . We provide the following justifications for the correctness of Algorithm 1.1. At line 4 of the algorithm, when Meek rules and R are applied after every intervention, the intermediate graph G , with unlearned edges, will always be a disjoint union of chordal components (refer to (1) and the commentsbelow) and hence a chordal graph.2. The number of unlearned edges before and after the main while loop in Algorithm 1 reduces by at least one.Every edge in E is incident on two colors and one of the colors is always picked for processing because weuse a separating system on the colors. Therefore, one node belonging to some edge has a positive score and isintervened on. The edge direction is learnt through rule R . Therefore, the algorithm terminates.3. It identifies the correct ~G because every edge is inferred after some intervention I t by applying rule R andMeek rules as in (1) both of which are correct.4. the algorithm has polynomial run time complexity because the main while loop ends in time || E ||