[PDF] Causal screening for dynamical systems

Abstract

Many classical algorithms output graphical representations of causal structures by testing conditional independence among a set of random variables. In dynamical systems, local independence can be used analogously as a testable implication of the underlying data-generating process. We suggest some inexpensive methods for causal screening which provide output with a sound causal interpretation under the assumption of ancestral faithfulness. The popular model class of linear Hawkes processes is used to provide an example of a dynamical causal model. We argue that for sparse causal graphs the output will often be close to complete. We give examples of this framework and apply it to a challenging biological system.

Full PDF

CCausal screening for dynamical systems

Søren Wengel MogensenDepartment of Mathematical SciencesUniversity of CopenhagenCopenhagen, Denmark [email protected]

Abstract

Many classical algorithms output graphical representations of causalstructures by testing conditional independence among a set of randomvariables. In dynamical systems, local independence can be used analo-gously as a testable implication of the underlying data-generating process.We suggest some inexpensive methods for causal screening which provideoutput with a sound causal interpretation under the assumption of ances-tral faithfulness. The popular model class of linear Hawkes processes isused to provide an example of a dynamical causal model. We argue thatfor sparse causal graphs the output will often be close to complete. Wegive examples of this framework and apply it to a challenging biologicalsystem.

Constraint-based causal learning is computationally and statistically challeng-ing. There is a large literature on learning structures that are represented bydirected acyclic graphs (DAGs) or marginalizations thereof (see e.g. [10] forreferences). The fast causal inference algorithm, FCI, [19] provides in a certainsense maximally informative output [24], but at the cost of using a large numberof conditional independence tests [2]. To reduce the computational cost, othermethods provide output which has a sound causal interpretation, but may beless informative. Among these are the anytime FCI [18] and RFCI [2]. A recentalgorithm, ancestral causal inference (ACI) [11], aims at learning only the di-rected part of the underlying graphical structure which allows for a sound causalinterpretation even though some information is lost.In this paper, we describe some simple methods for learning causal structurein dynamical systems represented by stochastic processes. Many authors havedescribed frameworks and algorithms for learning structure in systems of timeseries, ordinary diﬀerential equations, stochastic diﬀerential equations, andpoint processes. However, most of these methods do not have a clear causalinterpretation when the observed processes are part of a larger system and most1 a r X i v : . [ s t a t . O T ] S e p f the current literature is either non-causal in nature, or requires that there areno unobserved processes.Analogously to testing conditional independence when learning DAGs, onecan use tests of local independence in the case of dynamical systems [5, 12, 14]. In[12, 14] the authors propose algorithms for learning local independence structures.We show empirically that we can recover features of the causal structure usingconsiderably fewer tests of local independence. This is done by ﬁrst suggestinga learning target which is easier to learn, though still conveys useful causalinformation, analogously to ACI [11]. Second, the proposed algorithm is onlyguaranteed to provide a supergraph of the learning target and this also reducesthe number of local independence tests drastically. A central point is that ourproposed methods retain a sound causal interpretation under a faithfulness-typeassumption.In [12], the author suggests learning a directed graph to represent a causaldynamical system and gives a learning algorithm which we will describe as a simple screening algorithm . We show that this algorithm can be given a soundinterpretation under a weaker faithfulness assumption than that of [12]. Wealso provide a simple interpretation of the output of this algorithm and we showthat similar screening algorithms can give comparable results using considerablyfewer tests of local independence.For illustration of the proposed algorithms, we will use linear Hawkes processesin this paper. This model class is used in a wide range of application and isalso a topic of methodological research (see e.g. [9] and references therein). Allproofs are provided in the supplementary material. The algorithmic results we present apply in general to local independence models.To provide a concrete causal model, we will consider the linear Hawkes processes .[9] gives an accessible introduction to this model class. On a ﬁltered probabilityspace, (Ω , F , ( F t ) , P), we consider an n -dimensional multivariate point process, X = ( X , . . . , X n ). Each coordinate process X α is described by a sequence ofpositive, stochastic event times T α , T α , . . . such that T αj > T αi almost surely for j > i . We let V = { , . . . , n } . This can also be formulated in terms of a countingprocess, N , such that N αs = (cid:80) i ( T i ≤ s ) , α ∈ V . There exists so-called intensityprocesses , λ = ( λ , . . . , λ n ) such that λ αt = lim h → h P( N αt + h − N αt = 1 | F t )and the intensity at time t can therefore be thought of as describing theprobability of a jump in the immediate future after time t conditionally on thehistory until time t as captured by the F t -ﬁltration. In a linear Hawkes model,the intensity of the α -process, α ∈ V , is of the simple parametric form2 αt = µ α + (cid:88) γ ∈ V (cid:90) t g αγ ( t − s ) d N γs where µ α ≥ g αγ : R + → R , α, γ ∈ V , are nonnegative. Fromthe above formula, we see that if g βα = 0, then the α -process does not enterdirectly into the intensity of the β -process, α, β ∈ V and we will formalize thisobservation in subsequent sections. We will in this section deﬁne what is meant by a causal model and also deﬁne agraph (

V, E ) which represents the causal structure of the model. The node set V is the index set of the coordinate processes of the multivariate Hawkes process,thus identifying each node with a coordinate process. If we ﬁrst consider thecase where X = ( X , . . . , X n ) is a multivariate random variable, it is commonto deﬁne a causal model in terms of a DAG, D , and a structural causal model[15, 16] by assuming that there exists functions f i and error terms (cid:15) i such that X i = f i ( X pa D ( X i ) , (cid:15) i )for i = 1 , . . . , n . The causal assumption amounts to assuming that thefunctional relations are stable under interventions. This idea can be transferredto dynamical systems (see also [17, 14]). If we consider the model describedabove, we can consider intervening on the α -process and e.g. enforce events inthe α -process at the deterministic times t , . . . , t k , and these times only. In thiscase, the causal assumption amounts to assuming that the distribution of theintervened system is governed by the intensities λ βt = µ β + (cid:90) t g βα ( t − s ) d ¯ N αs + (cid:88) γ ∈ V \{ α } (cid:90) t g βγ ( t − s ) d N γs for all β ∈ V \ { α } and where ¯ N αt = (cid:80) ki =1 ( t ≤ t i ) . We will not go into adiscussion of the existence of these intervened stochastic processes. The above isa hard intervention in the sense that the α -process is ﬁxed to be a deterministicfunction of time. Note that one could easily imagine other types of interventionssuch as soft interventions where the intervened process is not deterministic. Itholds that N βt + h − N βt ∼ P ois ( λ βt · h ) in the limit h →

0, and we can think ofthis as a simulation scheme in which we generate the points in one small intervalin accordance with some distribution depending on the history of the process.As such the intensity describes a structural causal model at inﬁnitesimal timesteps.We use the set of functions { g βα } α,β ∈ V to deﬁne the causal graph of theHawkes process. A graph is a pair ( V, E ) where V is a set of nodes and E isa set of edges between these nodes. We assume that we observe the Hawkesprocess in the time interval J = [0 , T ], T ∈ R . The causal graph has node set V (the index set of the coordinate processes) and the edge α → β is in the causal3 β γ δ (cid:15)φ α β γ δ (cid:15)φ Figure 1: Left: a causal graph on nodes V = { α, β, γ, δ, (cid:15), φ } . Right: the corresponding parentgraph on nodes O = { α, δ, (cid:15) } . Note that causal graphs and parent graphs may contain cycles.The parent graph does not contain information on the confounder process φ as it only encodes’causal ancestors’. graph if and only if g βα is not identically zero on J . We call this graph causal as it is deﬁned using { g βα } α,β ∈ V which is a set of mechanisms assumed stableunder interventions, and this causal assumption is therefore analogous to that ofa classical structural causal model as brieﬂy introduced above. In principle, we would like to recover the causal graph, D , using local indepen-dence tests. Often, we will only have partial observation of the dynamical systemin the sense that we only observe the processes in O (cid:40) V . We will then aim tolearn the parent graph of D . Deﬁnition 1 (Parent graph) . Let D = ( V, E ) be a causal graph and let O ⊆ V .The parent graph of D on nodes O is the graph ( O, F ) such that for α, β ∈ O , theedge α → β is in F if and only if the edge α → β is in the causal graph or thereis a path α → δ → . . . → δ k → β in the causal graph such that δ , . . . , δ k / ∈ O ,for some k > P O ( G ), or just P ( G ) if the set O used is clear from the context. In applications, aparent graph may provide answers to important questions as it tells us the causalrelationships between the observed nodes. A similar idea was applied in [11].In large systems, it can easily be infeasible to learn the complete independencestructure of the observed system, and we propose instead to estimate the parentgraph which can be done eﬃciently. In the supplementary material, we giveanother characterization of a parent graph. Figure 1 contains an example of acausal graph and a corresponding parent graph. Local independence has been studied by several authors and in diﬀerent classes ofcontinuous-time models as well as in time series [1, 3, 4, 6]. We give an abstractdeﬁnition of local independence, following the exposition in [14].4 eﬁnition 2 (Local independence) . Let X be a multivariate stochastic processand let V be an index set of its coordinate processes. Let F Dt denote the completeand right-continuous version of the σ -algebra σ ( { X αs : s ≤ t, α ∈ D } ). Let λ be amultivariate stochastic process (assumed to be integrable and c`adl`ag) such thatits coordinate processes are indexed by V . For A, B, C ⊆ V , we say that X B is λ -locally independent of X A given X C (or simply B is λ -locally independent of A given C ) if the process t (cid:55)→ E( λ βt | F C ∪ At )has an F Ct -adapted version for all β ∈ B . We write this as A (cid:54)→ λ B | C , orsimply A (cid:54)→ B | C .In the case of Hawkes processes, the intensities will be used as the λ -processesin the above deﬁnition. See [13, 14] for technical details on the deﬁnition of localindependence. To make progress on the learning task, we will in this subsection describe thelink between the local independence model and the causal graph.

Deﬁnition 3 (Pairwise Markov property) . We say that a local independencemodel satisﬁes the pairwise Markov property with respect to a DG, D = ( V, E ),if the absence of the edge α → β in D implies α (cid:54)→ λ β | V \ α for all α, β ∈ V .We will make the following technical assumption throughout the paper. Inapplications, the functions g αβ are often assumed to be of the below type (seee.g. [9] for common choices of g βα -functions). Assumption 4.

Assume that N is a multivariate Hawkes process and that weobserved N over the interval J = [0 , T ] where T >

0. For all α, β ∈ V , thefunction g βα : R + → R is continuous on J .A version of the following result is also stated in [7] but no proof is given. If G = ( V, E ) and G = ( V, E ) are graphs, we say that G is a proper subgraph of G if E (cid:40) E . Proposition 5.

The local independence model of a linear Hawkes processsatisﬁes the pairwise Markov property with respect to the causal graph of theprocess and no proper subgraph of the causal graph has the property.

In order to give full proofs of the statements of the paper, we need a number ofgraph-theoretical concepts that are otherwise not central to our presentation.We have included these in the supplementary material and in this section weonly introduce the most central concepts.5 graph is a pair (

V, E ) where V is a ﬁnite set of nodes and E a ﬁnite setof edges. An edge can be of diﬀerent types, and we will use ∼ to denote ageneric edge of any type. Each edge is between a pair of nodes (not necessarilydistinct), and for α, β ∈ V , e ∈ E , we will write α e ∼ β to denote that the edge e is between α and β . We will in particular consider the class of directed graphs (DGs) where between each pair of nodes α, β ∈ V one has a subset of the edges { α → β, α ← β } , and we say that these edges are directed .Let G = ( V, E ) and G = ( V, E ) be graphs. We say that G is a supergraph of G , and write G ⊆ G , if E ⊆ E . For a graph G = ( V, E ) such that α, β ∈ V ,we write α → G β to indicate that the directed edge from α to β is containedin the edge set E . In this case we say that α is a parent of β . We let pa G ( β )denote the set of nodes in V that are parents of β . We write α (cid:54)→ G β to indicatethat the edge is not in E . For simplicity, we will throughout the paper assumethat all loops, i.e. self-edges α → α , are present in the graphs we consider. Thisgraphical assumption corresponds to implicitly assuming that every process ofthe dynamical system depends on its own past in a direct fashion.A walk is a ﬁnite sequence of nodes, α i ∈ V , and edges, e i ∈ E , (cid:104) α , e , α , . . . , α n , e n , α n +1 (cid:105) such that e i is between α i and α i +1 for all i = 1 , . . . , n and such that an orienta-tion of each edge is known. We say that a walk is nontrivial if it contains at leastone edge. A path is a walk such that no node is repeated. A directed path from α to β is a path such that all edges are directed and point in the direction of β . Deﬁnition 6 (Trek, directed trek) . A trek between α and β is a (nontrivial)path (cid:104) α, e , . . . , e n , β (cid:105) with no colliders [8]. We say that a trek between α and β is directed from α to β if e n has a head at β . We use dt( β ) to denote the set ofnodes γ such that there exists a directed trek from γ to β .As an example, we note that a directed path from α to β is a trek and it isalso a directed trek from α to β . However, it is not a directed trek from β to α .We will formulate the following properties using a general independence model , I , on V . Let P ( · ) denote the power set of some set. An independence model on V is simply a subset of P ( V ) × P ( V ) × P ( V ) and can be thought of as a collectionof independence statements that hold among the processes/variables indexedby V . In subsequent sections, the independence models will be deﬁned usingthe notion of local independence. In this case, for A, B, C ⊆ V , A (cid:54)→ λ B | C isequivalent with writing (cid:104) A, B | C (cid:105) ∈ I in the abstract notation, and we use thetwo interchangeably. In the following, we also use µ -separation which is a ternaryrelation and a dynamical model (and asymmetric) analogue to d -separationor m -separation. A deﬁnition and references are given in the supplementarymaterial. For G = ( V, E ) and

A, B, C ⊆ V we write A ⊥ µ B | C [ G ] to denotethat B is µ -separated from A given C in the graph G . Deﬁnition 7 (Global Markov property) . We say that an independence model I satisﬁes the global Markov property with respect to a DG, G = ( V, E ), if A ⊥ µ B | C [ G ] implies (cid:104) A, B | C (cid:105) ∈ I for all A, B, C ⊆ V .From Proposition 5, we know that the local independence model of a linearHawkes process satisﬁes the pairwise Markov property with respect to the causal6raph of the process, and using the results in [4, 14] it also satisﬁes the globalMarkov property with respect to this graph. Deﬁnition 8 (Faithfulness) . We say that I is faithful with respect to a DG, G = ( V, E ), if (cid:104)

A, B | C (cid:105) ∈ I implies A ⊥ µ B | C [ G ] for all A, B, C ⊆ V . In this section, we will deﬁne a faithfulness-type assumption that will be used toensure the soundness of the learning algorithms. We then state a very generalclass of algorithms which is easily seen to provide sound causal learning and wedescribe some speciﬁc algorithms.We throughout assume that there is some underlying DG, D = ( V, E ),describing the causal model and we wish to output P O ( D ). However, thisgraph is not in general identiﬁable from the local independence model. Inthe supplementary material, we argue that for an equivalence class of parentgraphs, there exists a unique member of the class which is a supergraph of allother members. Denote this unique graph by ¯ D . Our algorithms will outputsupergraphs of ¯ D , and the output will therefore also be supergraphs of the trueparent graph.We assume we are in the ’oracle case’, i.e. have access to a local independenceoracle that provides the correct answers. We will say that an algorithm is sound if it in the oracle case outputs a supergraph of ¯ D and that it is complete if itoutputs ¯ D . We let I O denote the local independence model restricted to subsetsof O , i.e. this is observed part of the local independence model. Under the faithfulness assumption, every local independence implies µ -separationin the graph. We assume a weaker, but similar, property to argue that theour algorithms are sound. For learning marginalized DAGs, weaker types offaithfulness have also been explored, see e.g. [25, 22, 23]. Deﬁnition 9 (Ancestral faithfulness) . Let I be an independence model andlet D be a DG. We say that I satisﬁes ancestral faithfulness with respect to D if for every α, β ∈ V and C ⊆ V \ { α } , (cid:104) α, β | C (cid:105) ∈ I implies that there is no µ -connecting directed path from α to β given C in D .It follows from the deﬁnition that faithfulness implies ancestral faithfulnessand for general independence models ancestral faithfulness is a strictly weakerproperty than faithfulness. We conjecture that local independence models oflinear Hawkes processes satisfy ancestral faithfulness with respect to their causalgraphs. Heuristically, if there is a directed path from α to β which is not blockedby any node in C , then information should ﬂow from α to β , and this cannot be‘cancelled out’ by other paths in the graph as the linear Hawkes processes areself-excitatory, i.e. no process has a dampening eﬀect on any process.7 .2 Simple screening algorithms As a ﬁrst step in describing a causal screening algorithm, we will deﬁne a verygeneral class of learning algorithms that simply test local independences andsequentially remove edges. It is easily seen that under the assumption of ancestralfaithfulness every algorithm in this class gives sound learning in the oracle case.The complete

DG on nodes V is the DG with edge set { α → β | α, β ∈ V } . Deﬁnition 10 (Simple screening algorithm) . We say that a learning algorithmis a simple screening algorithm if it starts from a complete DG on nodes O andremoves an edge α → β only if a conditioning set C ⊆ O \ { α } has been foundsuch that (cid:104) α, β | C (cid:105) ∈ I O .The following results give a clear and causally sound interpretation of theoutput of a simple screening algorithm. Proposition 11.

Assume that I satisﬁes ancestral faithfulness with respectto D = ( V, E ). The output of any simple screening algorithm is sound in theoracle case.

Corollary 12.

Assume ancestral faithfulness of I with respect to D and let A, B, C ⊆ O . If every directed path from A to B goes through C in the outputgraph of a simple screening algorithm, then every directed path from A to B goes through C in D . Corollary 13.

If there is no directed path from A to B in the output graph,then there is no directed path from A to B in D . In the previous section, it was shown that if edges are only removed when aseparating set is found the output is sound under the assumption of ancestralfaithfulness. In this section we give a speciﬁc algorithm. The key observation isthat we can easily retrieve structural information from a rather small subset oflocal independence tests.Let D t denote the output from Subalgorithm 1 (see below). The followingresult shows that under the assumption of faithfulness, α → D t β if and only ifthere is a directed trek from α to β in D . Proposition 14.

There is no directed trek from α to β in D if and only if α ⊥ µ β | β [ D ].We will refer to running ﬁrst Subalgorithm 1 and then Subalgorithm 2 (usingthe the output DG from the ﬁrst as input to the second) as the causal screening(CS) algorithm. The following proposition follows directly from the deﬁnitionsof the subalgorithms. Proposition 15.

The CS algorithm is a simple screening algorithm.8t is of course of interest to understand under what conditions the edge α → β is guaranteed to be removed by the CS algorithm when it is not in theunderlying target graph. In the supplementary material we state and prove aresult describing one such a condition. input : a local independenceoracle for I O output : a DG on nodes O initialize D as the complete DGon O ; foreach ( α, β ) ∈ V × V doif α (cid:54)→ λ β | β then delete α → β from D ; endendreturn D Subalgorithm 1:

Trek step input : a local independenceoracle for I O and aDG, D = ( O, E ) output : a DG on nodes O foreach ( α, β ) ∈ V × V suchthat α → D β doif α (cid:54)→ λ β | pa D ( β ) \ { α } then delete α → β from D ; endendreturn D Subalgorithm 2:

Parent step

In this section, we describe an additional step which propagates ancestry byreusing the output of Subalgorithm 1 to remove further edges. This comes at aprice as one needs to assume faithfulness in order to guarantee that the resultwill be sound. The idea is similar to ACI [11] that also uses that ancestry istransitive. input : a local independence oracle for I O and a DG, D = ( O, E ) output : a DG on nodes O initialize E r = ∅ as the empty edge set; foreach ( α, β, γ ) ∈ V × V × V such that α, β, γ are all distinct doif α → D β , β (cid:54)→ D α , β → D γ , and α (cid:54)→ D γ then update E r = E r ∪ { β → γ } ; endend Update D = ( V, E \ E r ); return D Subalgorithm 3:

Ancestry propagationIn ancestry propagation, we exploit the fact that any trek between α and β composed with the edge β → γ gives a directed trek from α to β . We only usethe trek between α and β ’in one direction’ and this is because we should beslightly careful if γ is actually on the trek between α and β . In Subalgorithm 4in the supplementary material, we exploit a trek between α and β twice, but atthe cost of an additional local independence test.9e can construct an algorithm by ﬁrst running Subalgorithm 1, then Subal-gorithm 3, and ﬁnally Subalgorithm 2 (using the output of one subalgorithmas input to the next). We will call this the CSAPC algorithm. If we useSubalgorithm 4 instead of Subalgorithm 3, we will call this the CSAP. Proposition 16. If I is faithful with respect to D , then the CSAP and CSAPCalgorithms both provide sound learning. When evaluating the performance of a sound screening algoritm, the outputgraph is guaranteed to be a supergraph of the true parent graph, and we willsay that edges that are in the output but not in the true graph are excess edges .For a node in a directed graph, the indegree is the number of directed edgesadjacent with and pointed into the node, and the outdegree is the number ofdirected edges adjacent with and pointed away from the node.

Caenorhabditis elegans is a roundworm in which the network between neuronshas been mapped completely [20]. We apply our methods to this network asan application to a highly complex network. It consists of 279 neurons whichare connected by both non-directional gap junctions and directional chemicalsynapses. We will represent the former as an unobserved process and thelatter as a direct inﬂuence. From this network, we sampled subnetworks of75 neurons each (details in the supplementary material) and computed theoutput of the CS algorithm. These subsampled networks had on average 1109edges (including bidirected edges representing unobserved processes, see thesupplementary material) and on average 424 directed edges. The output graphshad on average 438 excess edges which is explained by the fact that there aremany unobserved nodes in the graphs. To compare the output to the true parentgraph, we computed the rank correlation between the indegrees of the nodesin the output graph and the indegrees of the nodes in the true parent graph,and similarly for the outdegree (indegree correlation: 0.94, outdegree correlation:0.52). Finally, we investigated the method’s ability to identify the observednodes of highest directed connectivity (i.e. highest in- and outdegrees). Theneuronal network of c. elegans is inhomogeneous in the sense that some neuronsare extremely highly connected while others are only very sparsely connected.Identifying such highly connected neurons is of interest, and we considered the 15nodes of highest indegree/outdegree (out of the 75 observed nodes). On average,the CS algorithm placed 13.4 (in) and 9.2 (out) of these 15 among the 15 mostconnected nodes in the output graph.From the output of the CS algorithm, we can easily ﬁnd areas of the neuronalnetwork which mediates the information from one area to another, e.g. usingCorollary 12. 10 lllllll lllllllllll l l ll lll l lll l lllll ll ll l l l

No. of edges in underlying graph N o . o f t e s t s u s ed Algo.

CSdFCICA (a) Comparison of number of tests usedbetween CS, dFCI, and CA. We generated500 graphs, all on 5 nodes. Mean numberof excess edges: CS 0.96, dFCI 0.07, CA0.81.

No. of edges in underlying graph M ean nu m be r o f e xc e ss edge s i n ou t pu t Algo.

CSCSAPCSAPC (b) Mean number of excess edges in out-put graphs for varying numbers of edges(bidirected and directed) in the true graph(all graphs are on 10 nodes) not countingloops.

Figure 2: Comparison of performance.

In this section we compare the proposed causal screening algorithms with pre-viously published algorithms that solve similar problems. In [14], the authorspropose two algorithms, one of which is sure to output the correct graph. Theauthors note that this complete algorithm is computationally very expensiveand adds little extra information, and therefore we will only consider their otheralgorithm for comparison. We will call this algorithm dynamical

FCI (dFCI) asit resembles the FCI algorithm as noted by the authors of [14]. The algorithmactually solves a harder learning problem and provides more information (seedetails in the supplementary material), however it is computationally infeasiblefor many problems.The Causal Analysis (CA) algorithm of [12] is a simple screening algorithmand we have in this paper argued that it is sound for learning the parent graphof the underlying graph under the weaker assumption of ancestral faithfulness.Note that even though this algorithm uses a large number of tests, it is notguaranteed to provide complete learning as there may be inseparable nodes thatare not adjacent [13, 14].For the comparison of these algorithms, two aspects are important. As theyare all sound, one aspect is the number of excess edges. The other aspect is ofcourse the number of tests needed to provide their output. The CS and CSAPCalgorithms use at most 2 n ( n −

1) tests and empirically the CSAP uses roughlythe same number as the two former. This makes them feasible in even largegraphs. The quality of their output is dependent on the sparsity of the graph,though the CSAP and CSAPC algorithms can deal considerably better with lesssparse graphs (Subﬁgure 2b). 11

Discussion

We suggested inexpensive constraint-based methods for learning causal structurebased on testing local independence. An important observation is that localindependence is asymmetric while conditional independence is symmetric. In acertain sense, this may help when constructing learning algorithms as there isno need of something like an ’orientation phase’ as in the FCI. This facilitatesusing very simple methods to give sound causal learning as we do not need theindependence structure in full to give interesting output.The level of information in the output graph of the causal screening algorithmsis dependent on the sparsity of the graph. However, even in examples withvery little sparsity, as in the c.elegans neuronal network, interesting structuralinformation can be learned using these simple methods.By outputting only the directed part of the underlying causal structure, wemay be able to answer structural questions, but not other questions e.g. relatingto causal eﬀect estimation. However, by restricting the scope we can providea sound algorithm which can reveal interesting information about the causalstructure.We showed that the proposed algorithms have a large computational advan-tage over previously published algorithms within this framework. This makesit feasible to consider causal learning even in large networks with unobservedprocesses and the CS algorithms thus provide methods for sound screening incausal dynamical models.

This work was supported by research grant 13358 from VILLUM FONDEN. Theauthor thanks Niels Richard Hansen for his helpful suggestions and comments.12 upplementary material

This supplementary material contains additional graph theory, results, anddeﬁnitions, as well as the proofs of the main paper.

A Graph theory

The additional graph theory is useful in the proofs, and we can also give analternative deﬁnition of the parent graph using a graphical marginalizationoperation that we will describe.In the main paper, we introduce the class of DGs to represent causal structures.One can represent marginalized DGs using the larger class of DMGs. A directedmixed graph (DMG) is a graph such that any pair of nodes α, β ∈ V is joined bya subset of the edges { α → β, α ← β, α ↔ β } .We say that edges of the types α → β and α ← β are directed , and that α ↔ β is bidirected .We also introduced a walk (cid:104) α , e , α , . . . , α n , e n , α n +1 (cid:105) . We say that α and α n +1 are endpoint nodes. A nonendpoint node α i on a walk is a collider if e i − and e i both have heads at α i , and otherwise it is a noncollider . A cycle is apath (cid:104) α, e , . . . , β (cid:105) composed with an edge between α and β . We say that α isan ancestor of β if there exists a directed path from α to β . We let an( β ) denotethe sets of nodes that are ancestors of β .For DAGs d -separation is often used for encoding independences. We usethe analogous notion of µ -separation which is a generalization of δ -separation[3, 4, 12, 13]. Deﬁnition 17 ( µ -separation) . Let G = ( V, E ) be a DMG, and let α, β ∈ V and C ⊆ V . We say that a (nontrivial) walk from α to β , (cid:104) α, e , . . . , e n , β (cid:105) , is µ -connecting given C if α / ∈ C , the edge e n has a head at β , every collider on thewalk is in an( C ) and no noncollider is in C . Let A, B, C ⊆ V . We say that B is µ -separated from A given C if there is no µ -connecting walk from any α ∈ A toany β ∈ B given C . In this case, we write A ⊥ µ B | C , or A ⊥ µ B | C [ G ] if wewish to emphasize the graph to which the statement relates.We use the class of DGs use to represent the underlying, data-generatingstructure. When only parts of the causal system is observed, the class of DMGscan be used for representing marginalized DGs [13]. This can be done using latent projection [21, 13] which is a map that for a DG (or more generally, for aDMG), D = ( V, E ), and a subset of observed nodes/processes, O ⊆ V , providesa DMG, m ( D , O ), such that for all A, B, C ⊆ O , A ⊥ µ B | C [ D ] ⇔ A ⊥ µ B | C [ m ( D , O )] . See [13] for details on this graphical marginalization. We say that two DMGs, G = ( V, E ) , G = ( V, E ), are Markov equivalent if13 ⊥ µ B | C [ G ] ⇔ A ⊥ µ B | C [ G ] , for all A, B, C ⊆ V , and we let [ G ] denote the Markov equivalence class of G . Every Markov equivalence class of DMGs has a unique maximal element [13],i.e. there exists G ∈ [ G ] such that G is a supergraph of all other graphs in [ G ].For a DMG, G , we will let D ( G ) denote the directed part of G , i.e. the DGobtained by deleting all bidirected edges from G . Proposition 18.

Let D = ( V, E ) be a DG, and let O ⊆ V . Consider G = m ( D , O ). For α, β ∈ O it holds that α ∈ an D ( β ) if and only if α ∈ an D ( G ) ( β ).Furthermore, the directed part of G equals the parent graph of D on nodes O ,i.e. D ( G ) = P O ( D ). Proof.

Note ﬁrst that α ∈ an D ( β ) if and only if α ∈ an G ( β ) [13]. Ancestry isonly deﬁned by the directed edges, and it follows that α ∈ an G ( β ) if and only if α ∈ an D ( G ) ( β ). For the second statement, the deﬁnition of the latent projectiongives that there is a directed edge from α to β in G if and only if there is adirected path from α to β in D such that no nonendpoint node is in O . Bydeﬁnition, this is the parent graph, P O ( D ).In words, the above proposition says that if G is a marginalization (doneby latent projection) of D , then the ancestor relations of D and D ( G ) are thesame among the observed nodes. It also says that our learning target, the parentgraph, is actually the directed part of the latent projection on the observed nodes.In the next subsection, we use this to describe what is actually identiﬁable fromthe induced independence model of a graph. A.1 Maximal graphs and parent graphs

Under faithfulness of the local independence model and the causal graph, weknow that the maximal DMG is a correct representation of the local independencestructure in the sense that it encodes exactly the local independences that holdin the local independence model. From the maximal DMG, one can use resultson equivalence classes of DMGs to obtain every other DMG which encodes theobserved local independences [13] and from this graph one can ﬁnd the parentgraph as simply the directed part. However, it may require an infeasible numberof tests to output such a maximal DMG. This is not surprising, seeing that thelearning target encodes this complete information on local independences.Assume that D = ( V, E ) is the underlying causal graph and that G =( O, F ) , O ⊆ V is the marginalized graph over the observed variables, i.e. thelatent projection of D . In principle, we would like to output P ( D ) = D ( G ),the directed part of G . However, no algorithm can in general output this graphby testing only local independences as Markov equivalent DMGs may not havethe same parent graph. Within each Markov equivalence class of DMGs, thereis a unique maximal graph. Let ¯ G denote the maximal graph which is Markovequivalent of G . The DG D ( ¯ G ) is a supergraph of D ( G ) and we will say that a14earning algorithm is complete if it is guaranteed to output D ( ¯ G ) as no algorithmtesting local independence only can identify anything more than the equivalenceclass. B Complete learning

The CS algorithm provides sound learning of the parent graph of a general DMGunder the assumption of ancestral faithfulness. For a subclass of DMGs, thealgorithm actually provides complete learning. It is of interest to ﬁnd suﬃcientgraphical conditions to ensure that the algorithm removes an edge α → β whichis not in the true parent graph. In this section, we will simply state and prove onesuch condition which can be understood as ’the true parent set is always foundfor unconfounded processes’. We let D denote the output of the CS algorithm. Proposition 19. If α (cid:54)→ G β and there is no γ ∈ V \ { β } such that γ ↔ G β ,then α (cid:54)→ D β . Proof.

Let D , D , . . . , D N denote the DGs that are constructed when runningthe algorithm by sequentially removing edges, starting from the complete DG, D . Consider a walk from α to β in G . It must be of the form α ∼ . . . ∼ γ → β , γ (cid:54) = α . Under ancestral faithfulness, the edge γ → β is in D , thus γ ∈ pa D i ( β ) forall D i that occur during the algorithm, and therefore when (cid:104) α, β | pa D i ( β ) \ { α }(cid:105) is tested, the walk is closed. Any walk from α to β is of this form, thus also closed,and we have that α ⊥ µ β | pa D i ( β ) and therefore (cid:104) α, β | pa D i ( β ) \ { α }(cid:105) ∈ I .The edge α → D i β is removed and thus absent in the output graph, D . C Ancestry propagation

We state Subalgorithm 4 here. input : a local independence oracle for I O and a DG, D = ( O, E ) output : a DG on nodes O initialize E r = ∅ as the empty edge set; foreach ( α, β, γ ) ∈ V × V × V such that α, β, γ are all distinct doif α ∼ D β , β → D γ , and α (cid:54)→ D γ thenif (cid:104) α, γ | ∅(cid:105) ∈ I O then update E r = E r ∪ { β → γ } ; endendend Update D = ( V, E \ E r ); return D Subalgorithm 4:

Ancestry propagation15omposing Subalgorithm 1, Subalgorithm 4, and Subalgorithm 2 is referredto as the causal screening, ancestry propagation (CSAP) algorithm. If we useSubalgorithm 3 instead of Subalgorithm 4, we call it the CSAPC algorithm (Cfor cheap as this does not entail any additional independence tests compared toCS).

D Application and simulations

In this section, we provide some additional details about the c. elegans neuronalnetwork and the simulations.

D.1 C. elegans neuronal network

For each connection between two neurons a diﬀerent number of synapses arepresent (ranging from 1 to 37). We only consider connections with more than4 synapses when we deﬁne the true underlying network. When sampling thesubnetworks, highly connected neurons were sampled with higher probability toavoid a fully connected subnetwork.

D.2 Comparison of algorithms

As noted in the main paper, the dFCI algorithm solves a strictly harder problem.By using the additional graph theory in the supplementary material, we canunderstand the output of the dFCI algorithm as a supergraph of the maximalDMG, ¯ G . There is also a version of the dFCI which is guaranteed to output notonly a supergraph of ¯ G , but the graph ¯ G itself. Clearly, from the output of thedFCI algorithm, one can simply take the directed part of the output and this isa supergraph of the underlying parent graph. E Proofs

In this section, we provide the proofs of the result in the main paper.

Proof of Proposition 5.

Let D denote the causal graph. Assume ﬁrst that α (cid:54)→ D β . Then g βα is identically zero over the observation interval, and it followsdirectly from the functional form of λ βt that α (cid:54)→ β | V \ { α } . This shows thatthe local independence model satisﬁes the pairwise Markov property with respectto D .If instead g βα (cid:54) = 0 over J , there exists r ∈ J such that g βα ( r ) (cid:54) = 0. Fromcontinuity of g βα there exists a compact interval of positive measure, I ⊆ J , suchthat inf s ∈ I ( g βα ( s )) ≥ g βα min and g βα min >

0. Let i and i denote the endpoints ofthis interval, i < i . We consider now the events D k = ( N αT − i − N αT − i = k, N γT = 0 for all γ ∈ V \ { α } ) (1)16 ∈ N . Then under Assumption 4, for all kλ βT D k ≥ D k (cid:90) I g βα ( T − s ) d N αs ≥ g βα min · k · D k . Assume for contradiction that β is locally independent of α given V \ { α } . Then λ βT = E( λ βT | F VT ) = E( λ βT | F V \{ α } T ) is constant on ∪ k D k and furthermoreP( D k ) > k . However, this contradicts the above inequality when k → ∞ . Proof of Proposition 11.

Let D denote the DG which is output by the algorithm.We should then show that P ( D ) ⊆ D . Assume that α → P ( D ) β . In thiscase, there is a directed path from α to β in D such that no nonendpoint nodeon this directed walk is in O (the observed coordinates). Therefore for any C ⊆ O \ { α } there exists a directed µ -connecting walk from α to β in D and byancestral faithfulness it follows that (cid:104) α, β | C (cid:105) / ∈ I . The algorithm starts fromthe complete directed graph, and the above means that the directed edge from α to β will not be removed. Proof of Corollary 12.

Consider some directed path from α to β in D on whichno node is in C . Then there is also a directed path from α to β on which nonodes is in C in the graph P ( D ), and therefore also in the output graph usingProposition 11. Proof of Proposition 14.

Assume that there is a µ -connecting walk from α to β given { β } . If this walk has no colliders, then it is a directed trek, or can bereduced to one. Otherwise, assume that γ is the collider which is the closestto the endpoint α . Then γ ∈ an( β ), and composing the subwalk from α to γ with the directed path from γ to β gives a directed trek. On the other hand,assume there is a directed trek from α to β . This is µ -connecting from α to β given { β } . Proof of Proposition 16.

Assume β → P ( D ) γ . Subalgorithms 1 and 2 are bothsimple screening algorithms, and they will not remove this edge. Assume forcontradiction that β → γ is removed by Subalgorithm 3. Then there must exist α (cid:54) = β, γ and a directed trek from α to β in D . On this directed trek, γ doesnot occur as this would imply a directed trek either from α to γ or from β to γ , thus implying α → D γ or β → D α , respectively ( D is the output graph). As γ does not occur on the trek, composing this trek with the edge β → γ wouldgive a directed trek from α to γ . By faithfulness, (cid:104) α, γ | γ (cid:105) / ∈ I , and this is acontradiction as α → γ would not have been removed during Subalgorithm 1.We consider instead CSAP. Assume for contradiction that β → γ is removedduring Subalgorithm 4. There exists in D either a directed trek from α to β ora directed trek from β to α . If γ is on this trek, then γ is not µ -separated from α given the empty set (recall that there are loops at all nodes, therefore also at γ ),and using faithfulness we conclude that γ is not on this trek. Composing it withthe edge β → γ would give a directed trek from α to γ and using faithfulness weobtain a contradiction. 17 eferences [1] Odd O. Aalen. Dynamic modelling and causality. Scandinavian ActuarialJournal , pages 177–190, 1987.[2] Diego Colombo, Marloes H. Maathuis, Markus Kalisch, and Thomas S.Richardson. Learning high-dimensional directed acyclic graphs with latentand selection variables.

The Annals of Statistics , 40(1):294–321, 2012.[3] Vanessa Didelez.

Graphical Models for Event History Analysis based onLocal Independence . PhD thesis, Universit¨at Dortmund, 2000.[4] Vanessa Didelez. Graphical models for marked point processes based onlocal independence.

Journal of the Royal Statistical Society, Series B , 70(1):245–264, 2008.[5] Michael Eichler. Causal inference with multiple time series: principles andproblems.

Philosophical Transactions of the Royal Society , 371(1997):1–17,2013.[6] Michael Eichler and Vanessa Didelez. On Granger causality and the eﬀectof interventions in time series.

Lifetime Data Analysis , 16(1):3–32, 2010.[7] Michael Eichler, Rainer Dahlhaus, and Johannes Dueck. Graphical model-ing for multivariate Hawkes processes with nonparametric link functions.

Journal of Time Series Analysis , 38:225–242, 2017.[8] Rina Foygel, Jan Draisma, and Mathias Drton. Half-trek criterion forgeneric identiﬁability of linear structural equation models.

The Annals ofStatistics , 40(3):1682–1713, 2012.[9] Patrick J. Laub, Thomas Taimre, and Philip K. Pollett. Hawkes processes.2015. URL https://arxiv.org/pdf/1507.02822.pdf .[10] Marloes Maathuis, Mathias Drton, Steﬀen Lauritzen, and Martin Wain-wright.

Handbook of graphical models . Chapman & Hall/CRC handbooksof modern statistical methods, 2019.[11] Sara Magliacane, Tom Claassen, and Joris M. Mooij. Ancestral causalinference. In

Proceedings of the 29th Conference on Neural InformationProcessing Systems (NIPS 2016) , 2016.[12] Christopher Meek. Toward learning graphical and causal process models.In

CI’14 Proceedings of the UAI 2014 Conference on Causal Inference:Learning and Prediction , 2014.[13] Søren Wengel Mogensen and Niels Richard Hansen. Markov equivalence ofmarginalized local independence graphs. 2019. To appear in the Annals ofStatistics. 1814] Søren Wengel Mogensen, Daniel Malinsky, and Niels Richard Hansen. Causallearning for partially observed stochastic dynamical systems. In

Proceedingsof the 34th conference on Uncertainty in Artiﬁcial Intelligence , 2018.[15] Judea Pearl.

Causality . Cambridge University Press, 2009.[16] Jonas Christopher Peters, Dominik Janzing, and Bernhard Sch¨olkopf.

Ele-ments of causal inference, foundations and learning algorithms . MIT Press,2017.[17] Kjetil Røysland. Counterfactual analyses with graphical models based onlocal independence.

The Annals of Statistics , 40(4):2162–2194, 2012.[18] Peter Spirtes. An anytime algorithm for causal inference. In

Proceedings ofthe Eighth International Workshop on Artiﬁcial Intelligence and Statistics,AISTATS 2001 , 2001.[19] Peter Spirtes, Clark Glymour, and Richard Scheines.

Causation, Prediction,and Search . MIT Press, 2000.[20] Lav R. Varshney, Beth L. Chen, Eric Paniagua, David H. Hall, and Dmitri B.Chklovskii. Structural properties of the Caenorhabditis elegans neuronalnetwork.

PLoS Computational Biology , 7(2), 2011.[21] Thomas Verma and Judea Pearl. Equivalence and synthesis of causal models.Technical Report R-150, University of California, Los Angeles, 1991.[22] Zhalama, Jiji Zhang, Frederick Eberhardt, and Wolfgang Mayer. SAT-basedcausal discovery under weaker assumptions. In

Proceedings of the 33thConference on Uncertainty in Artiﬁcial Intelligence , 2017.[23] Zhalama, Jiji Zhang, and Wolfgang Mayer. Weakening faithfulness: someheuristic causal discovery algorithms.

International Journal of Data Scienceand Analytics , 3:93–104, 2017.[24] Jiji Zhang. On the completeness of orientation rules for causal discovery inthe presence of latent confounders and selection bias.

Artiﬁcial Intelligence ,172:1873–1896, 2008.[25] Jiji Zhang and Peter Spirtes. Detection of unfaithfulness and robust causalinference.