[PDF] Active Learning of Causal Structures with Deep Reinforcement Learning

Abstract

We study the problem of experiment design to learn causal structures from interventional data. We consider an active learning setting in which the experimenter decides to intervene on one of the variables in the system in each step and uses the results of the intervention to recover further causal relationships among the variables. The goal is to fully identify the causal structures with minimum number of interventions. We present the first deep reinforcement learning based solution for the problem of experiment design. In the proposed method, we embed input graphs to vectors using a graph neural network and feed them to another neural network which outputs a variable for performing intervention in each step. Both networks are trained jointly via a Q-iteration algorithm. Experimental results show that the proposed method achieves competitive performance in recovering causal structures with respect to previous works, while significantly reducing execution time in dense graphs.

Full PDF

11 Active Learning of Causal Structures with DeepReinforcement Learning

Amir Amirinezhad, Saber Salehkaleybar, and Matin Hashemi

Abstract —We study the problem of experiment design to learncausal structures from interventional data. We consider an activelearning setting in which the experimenter decides to interveneon one of the variables in the system in each step and uses theresults of the intervention to recover further causal relationshipsamong the variables. The goal is to fully identify the causalstructures with minimum number of interventions. We presentthe ﬁrst deep reinforcement learning based solution for theproblem of experiment design. In the proposed method, we embedinput graphs to vectors using a graph neural network and feedthem to another neural network which outputs a variable forperforming intervention in each step. Both networks are trainedjointly via a Q-iteration algorithm. Experimental results showthat the proposed method achieves competitive performance inrecovering causal structures with respect to previous works, whilesigniﬁcantly reducing execution time in dense graphs.

Index Terms —Casual structure learning, Experiment design,Active learning, Deep reinforcement learning

I. I

NTRODUCTION R ECOVERING causal relations among a set of variablesin various natural or social phenomena is one of theprimary goals in artiﬁcial intelligence. For instance, one mightbe interested in estimating the effect of education on salaries,smoking on lung cancer, or activation of genes on a phenotype.If we have only access to observational data (in contrast topossibility of intervening in the system), a part of causalrelationship can be identiﬁed in most cases. As a result, theinvestigator is mostly left with some unresolved causal rela-tions. However, if we could perform experiments sufﬁcientlyin the system, all the causal relationships can be recovered.Unfortunately, in many applications, it might be too costlyto intervene in the system [1]. Thus, it is desirable to designoptimal experiment, i.e., a set of interventions with minimumsize that results in full identiﬁcation of causal relationships.Directed acyclic graphs (DAGs) are commonly used topresent causal structures where each node is a random variableand there is a directed edge from a variable X to variable Y if X is a direct cause of Y . From the observationaldata, the underlying causal DAG can be identiﬁed up to aMarkov Equivalence Class (MEC), which is the set of DAGsrepresenting the same set of conditional independencies amongthe variables [2]. Several methods [2]–[4] in the literature havebeen proposed to learn MEC from purely observational data.However, in order to uniquely identify the causal structure,it is necessary to intervene in the system if there is no Authors are with Learning and Intelligent Systems Laboratory, Departmentof Electrical Engineering, Sharif University of Technology, Tehran, Iran.E-mails: [email protected], [email protected] (correspondingauthor), [email protected]. Webpage: http://lis.ee.sharif.edu prior assumption on the data generation mechanism. In mostapplications, performing experiments are too costly or eveninfeasible. Thus, we need to fully identify the true causalstructure with minimum number of interventions.Eberhardt [5] presented the worst-case bounds on the num-ber of required experiments for full identiﬁcation where thenumber of intervened variables could be as large as half ofthe size of graph. He and Geng [6] considered the problemof experiment design in two passive and active settings. Inthe passive setting, all the experiments are designed beforedoing interventions in the system. Afterwards, the results ofinterventions are aggregated in order to recover the causalstructure. While, in the active setting, we sequentially performinterventions and the results of previous interventions helpin designing latter ones. He and Geng [6] assumed thatorientations of edges incident to an intervened variable canbe revealed by performing perfect randomized interventions.In the passive setting, they enumerated all possible DAGsin the MEC obtained from observational data for a givenexperiment and checked whether it can fully identify thecausal structure for any DAG in the equivalence class. Fromexperiments with such desirable property, they returned theone with minimum number of interventions. Unfortunately,the proposed method might be too computationally intensivedue to possibly large number of DAGs in an MEC. In theactive setting, they also proposed a heuristic algorithm thatdecides which variable to be intervened on in the next stepbased on Shannon’s entropy metric. Hauser and Buhlmann[7] studied the problem of experiment design in the activesetting where the experimenter is allowed to do interventionon a single variable in each step. They proposed an optimalalgorithm for the single step case and used it as a heuristic forselecting variables for the case of multiple steps. Shanmugamet al. [8] considered the problem of causal structure learning byperforming experiments with bounded number of interventionsin each experiment. They derived lower bounds on the numberof experiments for full identiﬁcation of causal graphs in bothpassive and active settings using the theoretical results inseparating systems. Kocaoglu et al. [9] considered the problemof experiment design with no constraints on the number ofinterventions in each experiment and proposed an algorithmfor the case that there is a speciﬁc cost for intervening oneach variable. Ghassami et al. [10] proposed an approximatealgorithm in the passive setting which maximizes the averagenumber of oriented edges for a ﬁxed budget of interventions.Later, this approach has been accelerated using clique trees[11] and efﬁcient iteration on chain components [12]. Recently,Agrawal et al. [13] proposed a Bayesian experiment design a r X i v : . [ c s . A I] S e p algorithm in the active setting where the expected value ofa utility function is maximized in each step according to thecurrent belief and they proposed a tractable solution with anapproximation guarantee based on sub-modular functions.Several works utilized reinforcement learning in order totrain an agent for solving NP-Hard problems on graphs [14]–[18]. For instance, Dai et al. [14] embedded graphs to a vectorby a graph neural network and fed it to another neural networkin order to form a solution for difﬁcult graph problems suchas minimum vertex cover or traveling salesman problem. Abeet al. [15] combined Monte-Carlo tree search and graph iso-morphism networks to tackle NP-hard problems. In [16]–[18],several approaches based on reinforcement learning have beenproposed to solve the vehicle routing problem. Experimentalresults showed these algorithms can outperform traditionalheuristic methods in terms of the quality of solution. Besides,the running time of solving a new instance of the problemis signiﬁcantly reduced after the training phase, while mostheuristic methods become computationally intensive in largegraphs.Previous works on experiment design mainly focused ondeveloping heuristic metrics to decide which variables shouldbe intervened on and most of these metrics are related to graphproperties of MEC. In this paper, we consider the problemof experiment design in the active setting. Unlike previousworks, our goal is to train an agent by utilizing reinforcementlearning algorithm in order to decide which variable is suitablefor intervention in each step. In particular, we embed inputgraphs to vectors using a graph neural network and feedthem to another neural network which outputs a variable forperforming intervention in the next step. We jointly train bothneural networks via a Q-iteration algorithm. Our experimentson synthetic and real graphs show that the proposed methodachieves competitive performance in recovering causal struc-tures while it reduces running times by a factor up to .The structure of the paper is as follows: In Section 2, wereview some preliminaries on causal structures and deﬁne theproblem of experiment design in the active learning setting.In Section 3, we present our proposed method and describetraining algorithm. In Section 4, we provide experimentalresults and compare our method with previous works. Weconclude the paper in Section 5.II. P ROBLEM DEFINITION

A. Preliminaries

A graph G is represented by a pair G = ( V ( G ) , E ( G )) where V ( G ) is the set of vertices and E ( G ) is the set of edges.There exists an undirected edge between two vertices X and Y if ( X, Y ) ∈ E ( G ) and ( Y, X ) ∈ E ( G ) . Moreover, there isa directed edge from vertex X to vertex Y if ( X, Y ) ∈ E ( G ) while ( Y, X ) (cid:54)∈ E ( G ) . We denote the directed edge from X to Y and undirected edge between X and Y by X → Y and X − Y , respectively. If there is a directed edge from X to Y ,we consider X as a parent of Y . Decedents of X are the setof vertices with a directed path from X to each variable inthe set. We say a graph G is a directed graph if all its edgesare directed. A sequence ( X , · · · , X k ) is a partially directed path in a graph G if either X i → X i +1 or X i − X i +1 forall i = 1 , · · · , k − . A partially directed cycle is a partiallydirected path where the ﬁrst and last vertices in the path are thesame vertex. We say that a graph is a chain graph if it does notcontain any partially directed cycle. After removing directededges of a chain graph, the remaining undirected connectedcomponents are called the chain connected components of thegraph. Furthermore, an undirected graph is a chordal if everycycle of length four or greater has a chord. It can be shownthat each chain connected component is chordal. A v-structureis a sub-graph of G with three vertices X, Y, and Z such that X → Z ← Y . Two directed graphs have the same skeletonif they have the same set of vertices and edges regardless oftheir orientations.Let X = { X , · · · , X n } be a set of random variables.Consider a graph G whose set of vertices is equal to X . A jointdistribution P over X satisﬁes Markov property with respect to G if any variable of G is independent of its non-descendantsgiven its parents. Under causal sufﬁciency and faithfulnessassumptions [2], any conditional independence in P can beinferred by Markov property. Furthermore, multiple DAGsmay encode a same set of conditional independence assertions.A Markov equivalence class (MEC) is a set of DAGs entailingthe same set of conditional independence assertions. The set ofall DAGs that are Markov equivalent to some DAG G can berepresented by a completed partially DAG (CPDAG) in whichthere is a directed edge from X to Y if for all DAGs in MEC, X is a parent of Y . Otherwise, this edge is represented by anundirected edge. It can be shown that all DAGs in an MEChave the same skeleton and the same set of v-structure [19].Moreover, CPDAG can be obtained from the skeleton and theset of v-structures by applying commonly called Meek rules[20] which orient further edges such that no directed cycle ora new v-structure is created.Fig. 1 presents Meek rules that are used in obtaining aCPDAG. For instance, if there is a subgraph like the one inthe left-hand side of Fig. 1a, we can orient the edge betweennode and and obtain the subgraph in the right-hand sideof Fig. 1a. Otherwise, we create a new v-structure in theCPDAG. It can be shown that the CPDAG can be obtainedfrom applying these four Meek rules in any order until nomore further edges can be oriented [20]. Example 1:

An example of a CPDAG and the correspondingDAGs are given in Fig. 2. In the CPDAG shown in Fig. 2a,there is only one v-structure X → X ← X . In the DAG inFig. 2b, the variable X is the root variable (a variable withno incoming edges) while in Fig. 2c, the root variable is X .Please note that we can orient the edge between X and X ineither direction since we are not creating any new v-structureor cycle. Moreover, the direction of the edge between X and X should be from X to X in both DAGs. Otherwise, itresults in a new v-structure. Furthermore, the direction of theedge between X and X should be from X to X to avoidcreating any cycle. B. Problem of Experiment Design

Let G ∗ be the underlying causal graph among the variablesin X . From the observational data, one can just recover the (a) Rule 1 (b) Rule 2(c) Rule 3 (d) Rule 4 Fig. 1: Meek rules [20]: If the graph on the left-hand sideof each sub-ﬁgure is an induced subgraph of a CPDAG G ,then we can orient the undirected edge in the orange color asshown in the right-hand side of the sub-ﬁgure. (a) A CPDAG(b) DAG1(c) DAG2 Fig. 2: (a) An example of a CPDAG with a single v-stucture.Directed graphs in (b) and (c) are two DAGs of the CPDAGin (a).true causal graph G ∗ up to a MEC through constraint-basedapproaches [2], [3] or score-based approaches [4], [20]–[22].Thus, the orientations of undirected edges in CPDAG cannotbe identiﬁed by merely observational data. In order to fullyrecover the whole causal graph, it is required to performexperiments to orient further edges. In most applications,intervening on variables might be costly or time consuming.Thus, it is vital to recover the causal graph with minimumnumber of intervention.In this paper, we consider active learning setting forperforming experiments. In particular, in the ﬁrst step, the (a) G ∗ (b) G Fig. 3: An example of G ∗ and corresponding CPDAG denotedby G .CPDAG of true MEC containing the causal graph is obtainedfrom observational data which we denote it by G . Next, ineach step j of active learning, we select a variable X i j from G j to be intervened on. By performing perfect randomizedexperiment, we assume that the orientations of all incidentedges to X i j are identiﬁed. By considering the orientations ofthese edges in G j , we can apply meek rules to recover theorientations of further edges. Let G (cid:48) j be the resulted causalgraph. We obtain G j +1 by removing the oriented edges in G (cid:48) j .It can be shown that G j +1 is a collection of undirected chaincomponents. In the next step, we decide to intervene on oneof the variables in the collection of chain components. Thisprocedure continues until the whole causal graph is recovered.Our main goal is to select the intervened variable X i j in eachstep j such that the total number of steps until identifying thetrue causal graph is minimized. Example 2:

An example of G ∗ and its correspondingCPDAG are given in Fig. 3. In Fig. 4, we illustrate onestep of active causal structure learning for this CPDAG. G is obtained by deleting directed edges from G . At the ﬁrststep, we choose variable X for intervention. By performingperfect randomized experiment, the direction of incident edgesto X are recovered which are ( X , X ) and ( X , X ) . Next,we apply Meek rules and the orientation of ( X , X ) will bediscovered. We delete the oriented edges ( X , X ) , ( X , X ) and ( X , X ) and the graph G is obtained from G bydeleting the directed edges.III. P ROPOSED M ETHOD

The proposed method is based on a greedy approach wherefor a given graph G j in each step j , we select a variable X i j from graph G j based on a heuristic function Q . We trainan agent by a reinforcement learning algorithm in order toobtain the function Q . More speciﬁcally, the function Q hastwo arguments, namely, a chain graph G and a variable X , (a) G (b) Intervening on node 4.(c) Applying meek rules.(d) G Fig. 4: (a) Graph G is obtained from graph G by removingdirected edges. (b) After intervening on variable X , itsincident edges will be oriented according to the true graph.(c) We apply meek rules to orient further edges. (d) G isobtained by removing directed edges.and returns a score Q ( G, X ) which represents how much it isdesirable to select variable X for doing intervention in graph G . Based on these scores, we choose the following variablein step j : X i j = argmax X ∈ V ( G j ) Q ( G j , X ) . Unlike previous works on experiment design in the activesetting, we will utilize deep reinforcement learning methodsin order to ﬁnd suitable Q function. A diagram of proposedmethod for selecting a variable to be intervened on is givenin Fig. 5. First, the embedding vector of each variable iscalculated by “embedding network”. Then “score network”determines score of each variable according to embeddingvectors. The variable with the highest score will be selected fordoing intervention. Direction of some edges will be discoveredafter performing intervention. Finally, the resulted graph is given as input for the next step and this process will berepeated until the whole causal structure is recovered. A. Embedding Network

In order to feed the graph to the score network, we ﬁrstneed to represent the graph with a vector. Here, we use graphneural networks (GNN) [23] to embed our graphs into vectors.The GNNs use aggregated information of a variable’sneighborhood to represent its embedded vector. To do so,the representation of a variable is updated by aggregatingrepresentations of its neighbors in an iterative manner. After L iterations, a variable’s representation captures the structuralinformation within its L -hop network neighborhood. We canalso embed a whole graph by a pooling method, for instance,by summing the embedded vectors of all variables in thegraph. In general, the l -th layer of a GNN can be written asfollows: a ( l ) v = AGGREGAT E ( l ) ( { h ( l − u : u ∈ N ( v ) } ) h ( l ) v = COM BIN E ( l ) ( W v , h ( l − v , a ( l ) v ) , (1)where h ( l ) v is the embedding vector of node v at the l -thiteration/layer, N ( v ) is the set of nodes being adjacent to v ,and W v is node features with a dimension of q . In our imple-mentation, we set W v to the vector of all ones. Moreover, the h (0) v is initialized by W v . Multiple options have been proposedfor operations AGGREGAT E ( l ) ( . ) and COM BIN E ( l ) ( . ) in the literature of GNNs. Here, we use SU M function for

AGGREGAT E , and

RELU for

COM BIN E . Hence, wecan rewrite the above equation as follows: h ( l ) v = RELU (cid:16) θ W v + θ (cid:88) u ∈ N ( v ) h ( l − u (cid:17) , for l = 1 , · · · , L, (2)where θ ∈ R p × q and θ ∈ R p × p are the parameters of themodel. B. Score Network

In the literature of causality, multiple heuristic functionshave been proposed for the problem of active causal structurelearning which are mainly based on computing the size ofMEC [24]. However, here, we parameterized the heuristicfunction and denote it by ˆ Q where its parameters are neededto be trained. More speciﬁcally, we consider the followingparameterized heuristic function: ˆ Q ( G i , v ; Θ) = θ T RELU ([ θ (cid:88) u ∈ V h ( L ) u , θ h ( L ) v ]) , (3)where h ( L ) v is p -dimension embedding vector of node v after L iterations, θ ∈ R p , θ , θ ∈ R p × p and [ ., . ] is theconcatenation operator. We denote the set of all parametersby Θ = { θ i } i =1 . Fig. 5: An example of whole method for a given CPDAG. At ﬁrst, the embedding vector of each node will be calculated bythe “embedding network”. Then the “score network” returns a score to each node based on its embedding vector. Afterwards,we select a node with the highest score for performing intervention. Next, some of the edges will be oriented as the result ofintervention. We remove the directed edges and feed the remaining graph to “Embedding network” for the next step.

C. Training Phase

There is an analogy between selecting a variable for in-tervention and taking actions by an agent in an unknownenvironment. In particular, the state-action-value function de-ﬁned in reinforcement learning problem determines the overallexpected reward of doing each action in each state. In ourproblem, we can use similar function to determine whichvariable is more desirable for performing intervention in eachstep.

1) Reinforcement Learning Formulation:

In the following,we explain how our problem can be formulated in the frame-work of reinforcement learning by introducing the set of states,the set of actions, and the reward function: • Set of states: We consider all embedding vectors of chaingraphs as the set of states. Thus, at each time step j ,the embedding vector of G j represents the state of thesystem. • Set of actions: We consider intervening on any variablein the system as the set of actions. • Reward function: For a given graph G and a variable X in the system, we consider the reward function asthe number of directed edges that can be oriented afterintervening on X in the chain graph G .Our goal is to ﬁnd an optimal policy that maximizes theexpectation of overall reward which is the total number oforiented edges.

2) Learning Q Function:

We utilize Q -learning method indeep reinforcement learning (DRL) [25] to obtain ˆ Q functionwhich is an iterative algorithm that updates the value of Q function in each step. In DRL, Q function can be presentedby a look-up table where the value of E ( (cid:80) Tt =1 γ t r t | G, X ) isgiven for each pair of ( G, X ) where r t is the reward in step t , γ is a discount factor in the range (0 , and T is numberof steps in an episode. In many applications, the state space Algorithm 1

Q-learning algorithmInitialize Θ and set (cid:15) = 1 and experience memory M = ∅ for the replay buffer for episode e = 1 to E do Sample a DAG G ∗ Create CPDAG G Initialize I = {} for step j = 1 to T do X i j =  random node X ∈ V ( G j ) w.p. (cid:15) ( e )argmax X ∈ V ( G j ) ˆ Q ( G j , X ; Θ) o.wAdd X i j to I Direct edges that are connected to X i j according to G ∗ and applying Meek rulesRemove directed edges to obtain G j +1 Add tuple ( G j , X i j , r j , G j +1 ) to M end forif e mod q = 0 then Sample random batch B from M Update Θ by B end if Decrease (cid:15) ( e ) end for is so huge that we cannot observe all states in the trainingprocess or even keep them in the look-up table. To resolvethis issue, we obtain an approximation of it by training theparameters of score network through Q -iteration algorithm. Todo so, we consider two steps in each iteration: updating state-action-value function and updating the weights of networks. TABLE I: Running times of algorithms for performing ﬁve interventions (in seconds), ρ = 0 . , . , . . The values in theparentheses show the speedup factor of our proposed method with respect to the considered algorithm. Nodes 15 20 25 30 35 40 50 70Proposed ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . The Q-learning [25], [26] updates can be written as follows: ˆ Q ( G j , X i j ; Θ) ← ˆ Q ( G j , X i j ; Θ)+ α (cid:16) r ( G j ) + γ max X ∈ V ( G j +1 ) ˆ Q ( G j +1 , X ; Θ) − ˆ Q ( G j , X i j ; Θ) (cid:17) (4)where r ( G j ) is the number of edges that are oriented as aresult of intervening on X i j in graph G j and α is the learningfactor. For updating Θ , we use gradient descent method: Θ ← Θ + ∇ Θ (cid:32)(cid:0) r ( G j +1 )+ γ max X ∈ V ( G j +1 ) ˆ Q ( G j +1 , X ; Θ) − ˆ Q ( G j , X i j ; Θ) (cid:1) (cid:33) . (5)The description of Q-learning algorithm is given in Algo-rithm 1. At the beginning, we initialize Θ randomly accordingto normal distribution N (0 , and consider an empty replaybuffer. Then we start outer loop where in each iteration, wesample a DAG G ∗ , and construct CPDAG G from G ∗ . Weinitialize intervention sets I to ∅ . In each iteration of innerloop, we select a variable to intervene on based on (cid:15) -greedyalgorithm in order to explore more states in earlier episodes.We orient incident edges of the selected node based on thetrue causal graph, and then apply meek rules. Next, we removeoriented edges from the graph G j and obtain G j +1 . We addtuple ( G j , X i j , r j , G j +1 ) to the replay buffer M . The currentepisode is ﬁnished once the inner loop is complete. At thispoint, we decrease (cid:15) for the next execution of the inner loop.After iterating q number of episodes, we sample a batch B from M and update Θ accordingly.IV. E XPERIMENTAL R ESULTS

A. Synthetic Graph Generation

We employ synthetically generated chordal graphs similarto the ones considered in [27] in order to evaluate the perfor-mance of different algorithms. To do so, a randomly chosenperfect elimination ordering (PEO) [28] over the vertices isused to generate the underlying chordal graphs. Starting fromthe vertex v with the highest order, all the vertices withlower orders are connected to v with probability inverselyproportional to the order of v . Then, all the parents of v areconnected with directed edges, where each edge is directedfrom the parent with the lower order to the parent with the higher order. If vertex v is not connected to any of the verticeswith the lower order, one of them is taken uniformly at randomand set as the parent of v . In this way, we make sure that thegenerated graph will be connected. B. Previous Works for Quantitative Comparisons

We consider the following three main related works thathave been proposed previously for the active setting: • Entropy based approach [6]: A heuristic function basedon Shannon’s entropy metric is used such that the MECcan be reduced by intervening on the selected variableinto a subclass as small as possible. • Minimax based approach [7]: Huaser and Buhlmannproposed an optimal algorithm for performing a singleintervention according to a minimax objective function.This algorithm is utilized as an heuristic function forselecting a variable for intervention in each step. • Average based approach [27]: A variable is selected forintervention in each step which maximizes the expectednumber of edges whose orientations can be recovered byperforming intervention on this variable.

C. Details of Training and Testing Phases:

The experimental results are obtained from a model thatwas trained by graphs with the number of nodes in the set { , } . All the hyper-parameters were selected based on theresults for these small graphs and then used for all the othergraphs. For the testing phase, we generated instances ofchordal graphs with { , , , , , , , } nodes anddensity ρ = 0 . , . , . which is the average number of edgesdivided by (cid:0) n (cid:1) . The implementation of our method is availablein the supplementary material.All procedures are executed on a Xeon server with coresoperating at . GHz. To train the proposed model, we usemulti-thread programming. However, the test mode uses only asingle thread, i.e., is not multi-threaded. For the related works,we implemented average based and minimax based approachesby multi-thread programming in order to improve their runningtimes.

D. Comparison Results

To compare the performance of the above algorithms, weconsidered discovered edge ratio as the performance measure

Fig. 6: Performance of different algorithms on graphs with ρ = 0 . (left), ρ = 0 . (middle) and ρ = 0 . (right) in terms ofdiscovered edge ratio. The ﬁrst row to the sixth row correspond to n = 15 , , , , , . (a) Yeast1 (b) Yeast2 (c) Yeast3(d) Ecoli1 (e) Ecoli2 Fig. 7: Comparing performance of the proposed algorithm against previous methods in real graphs.which is the ratio of oriented edges after performing exper-iments to the total number of edges in the graph. In ourexperiments, we assumes that the budget of interventions isequal to .In Fig. 6, the average discovered edge ratio is plotted againstthe number of interventions for the considered algorithms. Wereported the performance of an algorithm if it returns theoutput within at most three hours. Different rows show theresults for different number of variables. The left, middle, andright charts in every row show the results for graphs withdensity ρ = 0 . , . , and . , respectively. In the ﬁrst tworows, we also depict the results of optimal solution. Comparingwith the other approaches, for graphs with high density, ourmethod always has better performance for k = 1 , . Forgraphs with low density, our method has a lower performance.However, by increasing the graph size, the performance of ourmethod gets close the ones of other approaches.We also compare the algorithms in terms of their runningtimes. We measure running times for ﬁve number of interven-tions on each graph. In table I, the average running times fordifferent graphs are provided. Moreover, in every entry, wealso provide the speedup factor of our proposed method withrespect to the considered algorithm in parentheses. As can beseen, running times of entropy based approach and minimaxbased approach grow exponentially with the graph size. We didnot report running times of an algorithm if it did not terminateafter three hours. Compared to the second best solution (i.e.,average based approach), our method reduces running time bya factor of up to in dense graphs.According to the results, we can conclude that the proposedmethod generalizes fairly well to graphs with different sizesalthough it has been trained over a speciﬁc range of graph sizes. For experiment design with limited budget of k ≤ , theproposed method has the best performance compared to otherrelated works in most cases. Finally, it signiﬁcantly reducesrunning times with respect to existing solutions. E. Real Graphs

In addition to synthetic graphs, we also did experiment withGene Regulatory Networks (GRN). GRN is a network of bio-logical regulators that interact with each other. In GRN, thereexist some transcription factors which have direct impacts ongene activations. More speciﬁcally, the interactions betweentranscription factors and regulated genes can be shown by adirected graph where there is a direct edge from a transcriptionfactor to a gene if it regulates the gene expression.We consider the GRN in “DREAM 3 In Silico Network”challenge [29]. The networks in this challenge were extractedfrom known biological interactions in GRN of E-coli andYeast bacteria. The size of each sub-network is equal to100. For each sub-network, we obtain CPDAG from the truecausal network and provide it as an input to the algorithms.Fig. 7 illustrates the discovered edge ratio of the algorithms inﬁve real sub-networks. As can be seen, the proposed methodachieves competitive performance in most sub-networks.V. C

ONCLUSION

In this paper, we proposed a deep reinforcement learningbased solution for the problem experiment design. In the pro-posed solution, we embed input graphs to vectors using a graphneural network and feed them to another neural network whichgives scores to variables in order to select the interventiontarget in the next step. We jointly train both neural networks by Q-iteration algorithm. Experimental results showed that theproposed solution has competitive performance in recoveringthe causal structure with respect to previous works which aremainly based on heuristic metrics related to graph properties ofMEC. Moreover, the proposed solution reduces running timessigniﬁcantly and can be applied on large graphs.R

EFERENCES[1] P. O. Hoyer, S. Shimizu, A. J. Kerminen, and M. Palviainen, “Estimationof causal effects using linear non-gaussian causal models with hiddenvariables,”

International Journal of Approximate Reasoning , vol. 49,no. 2, pp. 362–378, 2008.[2] J. Pearl,

Causality . Cambridge university press, 2009.[3] P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman,

Causation,prediction, and search . MIT press, 2000.[4] D. M. Chickering, “Optimal structure identiﬁcation with greedy search,”

Journal of machine learning research , vol. 3, no. Nov, pp. 507–554,2002.[5] F. Eberhardt, “Causation and intervention,”

Unpublished doctoral dis-sertation, Carnegie Mellon University , p. 93, 2007.[6] Y.-B. He and Z. Geng, “Active learning of causal networks withintervention experiments and optimal designs,”

Journal of MachineLearning Research , vol. 9, no. Nov, pp. 2523–2547, 2008.[7] A. Hauser and P. B¨uhlmann, “Characterization and greedy learning ofinterventional markov equivalence classes of directed acyclic graphs,”

Journal of Machine Learning Research , vol. 13, no. Aug, pp. 2409–2464, 2012.[8] K. Shanmugam, M. Kocaoglu, A. G. Dimakis, and S. Vishwanath,“Learning causal graphs with small interventions,” in

Advances inNeural Information Processing Systems , 2015, pp. 3195–3203.[9] M. Kocaoglu, A. Dimakis, and S. Vishwanath, “Cost-optimal learningof causal graphs,” in

Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 . JMLR. org, 2017, pp. 1875–1884.[10] A. Ghassami, S. Salehkaleybar, N. Kiyavash, and E. Bareinboim, “Bud-geted experiment design for causal structure learning,” in

InternationalConference on Machine Learning , 2018, pp. 1724–1733.[11] A. Ghassami, S. Salehkaleybar, N. Kiyavash, and K. Zhang, “Countingand sampling from markov equivalent dags using clique trees,” in

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol. 33,2019, pp. 3664–3671.[12] A. AhmadiTeshnizi, S. Salehkaleybar, and N. Kiyavash, “Lazyiter:A fast algorithm for counting markov equivalent dags and designingexperiments,” in

International Conference on Machine Learning , 2020,pp. 1663–1671.[13] R. Agrawal, C. Squires, K. Yang, K. Shanmugam, and C. Uhler, “Abcd-strategy: Budgeted experimental design for targeted causal structurediscovery,” arXiv preprint arXiv:1902.10347 , 2019.[14] H. Dai, E. Khalil, Y. Zhang, B. Dilkina, and L. Song, “Learningcombinatorial optimization algorithms over graphs,” in

Advances inNeural Information Processing Systems , 2017, pp. 6348–6358.[15] K. Abe, Z. Xu, I. Sato, and M. Sugiyama, “Solving np-hard problemson graphs by reinforcement learning without domain knowledge,” arXivpreprint arXiv:1905.11623 , 2019.[16] W. Kool, H. Van Hoof, and M. Welling, “Attention, learn to solve routingproblems!” arXiv preprint arXiv:1803.08475 , 2018.[17] M. Nazari, A. Oroojlooy, L. Snyder, and M. Tak´ac, “Reinforcementlearning for solving the vehicle routing problem,” in

Advances in NeuralInformation Processing Systems , 2018, pp. 9839–9849.[18] X. Chen and Y. Tian, “Learning to perform local rewriting for com-binatorial optimization,” in

Advances in Neural Information ProcessingSystems , 2019, pp. 6281–6292.[19] T. Verma and J. Pearl,

Equivalence and synthesis of causal models .UCLA, Computer Science Department, 1991.[20] C. Meek, “Graphical models: Selecting causal and statistical models phdthesis,” 1997.[21] J. Tian, R. He, and L. Ram, “Bayesian model averaging using the k-bestbayesian network structures,” arXiv preprint arXiv:1203.3520 , 2012.[22] L. Solus, Y. Wang, L. Matejovicova, and C. Uhler, “Consistencyguarantees for permutation-based causal inference algorithms,” arXivpreprint arXiv:1702.03530 , 2017.[23] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graphneural networks?” arXiv preprint arXiv:1810.00826 , 2018. [24] S. A. Andersson, D. Madigan, M. D. Perlman et al. , “A characterizationof markov equivalence classes for acyclic digraphs,”

The Annals ofStatistics , vol. 25, no. 2, pp. 505–541, 1997.[25] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[26] M. Riedmiller, “Neural ﬁtted q iteration–ﬁrst experiences with a dataefﬁcient neural reinforcement learning method,” in

European Conferenceon Machine Learning . Springer, 2005, pp. 317–328.[27] A. Ghassami, S. Salehkaleybar, N. Kiyavash, and E. Bareinboim, “Bud-geted experiment design for causal structure learning,” arXiv preprintarXiv:1709.03625 , 2017.[28] D. J. Rose and R. E. Tarjan, “Algorithmic aspects of vertex eliminationon directed graphs,”

SIAM Journal on Applied Mathematics , vol. 34,no. 1, pp. 176–197, 1978.[29] D. Marbach, T. Schaffter, C. Mattiussi, and D. Floreano, “Generatingrealistic in silico gene networks for performance assessment of reverseengineering methods,”