[PDF] Consistency Guarantees for Greedy Permutation-Based Causal Inference Algorithms

Abstract

Directed acyclic graphical models, or DAG models, are widely used to represent complex causal systems. Since the basic task of learning such a model from data is NP-hard, a standard approach is greedy search over the space of directed acyclic graphs or Markov equivalence classes of directed acyclic graphs. As the space of directed acyclic graphs on p nodes and the associated space of Markov equivalence classes are both much larger than the space of permutations, it is desirable to consider permutation-based greedy searches. Here, we provide the first consistency guarantees, both uniform and high-dimensional, of a greedy permutation-based search. This search corresponds to a simplex-like algorithm operating over the edge-graph of a sub-polytope of the permutohedron, called a DAG associahedron. Every vertex in this polytope is associated with a directed acyclic graph, and hence with a collection of permutations that are consistent with the directed acyclic graph ordering. A walk is performed on the edges of the polytope maximizing the sparsity of the associated directed acyclic graphs. We show via simulated and real data that this permutation search is competitive with current approaches.

Full PDF

CCONSISTENCY GUARANTEES FORPERMUTATION-BASED CAUSAL INFERENCEALGORITHMS

By Liam Solus ∗ , Yuhao Wang † , LenkaMatejovicova ‡ , and Caroline Uhler † KTH Royal Institute of Technology ∗ , Massachusetts Institute ofTechnology † , and Institute of Science and Technology Austria ‡ Bayesian networks, or directed acyclic graph (DAG) models, arewidely used to represent complex causal systems. Since the basic taskof learning a Bayesian network from data is NP-hard, a standard ap-proach is greedy search over the space of DAGs or Markov equiva-lent DAGs. Since the space of DAGs on p nodes and the associatedspace of Markov equivalence classes are both much larger than thespace of permutations, it is desirable to consider permutation-basedsearches. We here provide the ﬁrst consistency guarantees, both uni-form and high-dimensional, of a permutation-based greedy search.Geometrically, this search corresponds to a simplex-type algorithmon a sub-polytope of the permutohedron, the DAG associahedron . Ev-ery vertex in this polytope is associated with a DAG, and hence witha collection of permutations that are consistent with the DAG order-ing. A walk is performed on the edges of the polytope maximizing thesparsity of the associated DAGs. We show based on simulations thatthis permutation search is competitive with standard approaches.

1. Introduction.

Bayesian networks or graphical models based on di-rected acyclic graphs (DAGs) are widely used to model complex causal sys-tems arising from a variety of research areas, including computational biol-ogy, epidemiology, sociology, and environmental management [1, 8, 21, 27,29]. Given a DAG G := ([ p ] , A ) with node set [ p ] := { , , . . . , p } and arrowset A , the DAG model associates to each node i ∈ [ p ] of G a random variable X i . By the Markov property, the collection of non-arrows of G encode a setof conditional independence (CI) relations X i ⊥⊥ X Nd( i ) \ Pa( i ) | X Pa( i ) , where Nd( i ) and Pa( i ) respectively denote the nondesendants and parents of the node i in G . A joint probability distribution P on the nodes [ p ] is said MSC 2010 subject classiﬁcations:

Primary 62F15; secondary 05C90

Keywords and phrases: causal inference; Bayesian network; DAG; Gaussian graphicalmodel; greedy search; pointwise and high-dimensional consistency; faithfulness; generalizedpermutohedron; DAG associahedron. a r X i v : . [ m a t h . S T ] J u l L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER to satisfy the

Markov assumption (a.k.a. be

Markov ) with respect to G if itentails these CI relations. A fundamental problem of causal inference is thefollowing: Suppose we observe data drawn from a probability distribution P that is Markov with respect to a DAG G ∗ . From this data we infer acollection of CI relations C . Our goal is to recover the unknown DAG G ∗ from the CI relations C .Unfortunately, this problem is not well-deﬁned since multiple DAGs canencode the same set of CI relations. Any two such DAGs are termed Markovequivalent , and they are said to belong to the same

Markov equivalence class .Thus, our goal is to identify the Markov equivalence class M ( G ∗ ) of G ∗ . TheDAG model for G ∗ is said to be identiﬁable if the Markov equivalence class M ( G ∗ ) can be uniquely recovered from the set of CI relations C .The Markov assumption alone is not suﬃcient to guarantee identiﬁabilityof a DAG model, and so additional identiﬁability assumptions have beenstudied, the most prominent being the faithfulness assumption [29]. Twopopular algorithms for causal inference are the PC algorithm [29] and

GreedyEquivalence Search ( GES ) [6, 17]. The PC algorithm is an algorithm thattreats causal search as a constraint satisfaction problem with the constraintsbeing CI relations. GES is a score-based algorithm that searches greedily overall equivalence classes of DAGs and maximizes a score, such as the BayesianInformation Criterion (BIC). Both algorithms are known to be consistent (i.e., they identify the correct Markov equivalence class with inﬁnite sam-ple size) under the faithfulness assumption [6, 29]. In a study conductedin parallel to [6], Castelo and Kˇocka also developed a DAG model learningalgorithm admitting both MCMC and greedy-search versions that also ex-hibits consistency under faithfulness [4]. Unfortunately, the probability ofan “almost violation” of the faithfulness assumption is high, making this arestrictive assumption for causal inference [36]. By sacriﬁcing computationtime, the consistency guarantees can be improved: the sparsest permutation ( SP ) algorithm, which associates to each permutation π of [ p ] a DAG G π and returns the sparsest such DAG, is consistent under strictly weaker con-ditions than faithfulness [26]. However, the SP algorithm must search overall permutations.A natural approach to overcome this computational bottleneck is to per-form a greedy search in the space of permutations. In [3], Bouckaert presentsan ordering-based search algorithm that uses arrow reversals and deletionsto produce a sparse DAG entailing a collection of observed CI relations. Thealgorithm produces an optimal DAG in the sense that no more arrows canbe deleted to yield a DAG to which the observed CI relations are Markov. Inlater studies, [28] and [14] present DAG model learning algorithms that ﬁrst RDERING-BASED CAUSAL INFERENCE learn an optimal ordering of the nodes and then recover a DAG that has thisordering as a linear extension of its induced partial order on the node set. In[28], the authors use CI-based tests to identify an optimal ordering, and thenuse a greedy heuristic known as the K2 algorithm [7] to identify the DAGstructure. In [14], the authors use genetic algorithm techniques parallelingthose implemented in the traveling salesman problem to recover an optimalordering. In both studies, optimality of the ordering is measured in terms ofthe sparsity of the DAGs produced by the K2 algorithm. By Occam’s razoror the principle of parsimony, it is natural to search for an ordering of thenodes that identiﬁes the sparsest DAG entailing the observed CI relations.However, the algorithms of [3, 7, 14, 28], rely on heuristic approaches tosparse DAG recovery, and consequently, they do not consistently recoverthe underlying DAG.Recently, a permutation-based greedy search was considered by Teyssierand Koller [31], who showed via simulations that such an approach comparesfavorably to greedy search in the space of DAGs in terms of computationtime (due to the reduced search space) while being comparable in perfor-mance. However, they also did not provide any theoretical consistency guar-antees for their permutation-based search. Note that considering the spaceof Markov equivalence classes, as is the case for GES, instead of the spaceof all DAGs is not believed to signiﬁcantly reduce the search space. Fromcomputations up to ten nodes the ratio of Markov equivalence classes toall possible DAGs seems to converge to around 0.25 [11]; while on 10 nodesthere are about 10 Markov equivalence classes and 4 times as many DAGs,there are only 10! ≈ permutations.The permutations of [ p ] form the vertices of a convex polytope, known asthe permutohedron . In a recent paper [18], the authors constructed from aset of CI relations a sub-polytope of the permutohedron, the DAG associa-hedron , where each vertex is associated to a DAG G π . A natural approachis to perform a greedy SP algorithm , i.e. a simplex-type algorithm, on thisreduced search space with the graph sparsity as a score function. The greedySP algorithm searches over the DAGs G π using a subset of the moves consid-ered by Teyssier and Koller [31] that excludes moves known to not improvethis score function. It follows that greedy SP enjoys at least the same degreeof computational eﬃciency as the algorithm in [31], and is therefore moreeﬃcient than a greedy search over the space of DAGs. In this paper, weanalyze the greedy SP algorithm , give consistency guarantees, and assess itsperformance on simulated data. In particular, we provide the ﬁrst consis-tency guarantees for permutation-based DAG model learning algorithms.The paper is structured as follows: In Section 2, we recall the basics of L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

DAG models and various identiﬁability assumptions. In Section 3, we in-troduce DAG associahedra and the greedy SP algorithm. In Section 4, weprove one of our main results, namely pointwise consistency of the greedySP algorithm under the faithfulness assumption, thereby providing the ﬁrstconsistency guarantees for a greedy permutation-based search. The prooftechniques used also allow us to show pointwise consistency under the faith-fulness assumption for an ordering-based search that is closely related tothat of Teyssier and Koller [31], namely using the BIC score instead of thegraph sparsity in the greedy SP algorithm. As a consequence, these resultsshow that greedy search on the space of permutations is pointwise consis-tent under the same conditions (i.e. faithfulness) as a greedy search on thespace of Markov equivalence classes, while allowing a drastic reduction ofthe search space. The greedy SP algorithm can be interpreted geometri-cally as a walk on the DAG associahedron, or combinatorially, as a walkbetween diﬀerent DAGs associated to the vertices of the DAG associahe-dron. We prove that the identiﬁability assumption for the combinatorialapproach is strictly stronger than for the geometric approach and strictlyweaker than the faithfulness assumption. While the greedy SP algorithm isa non-parametric approach, in Section 5 we concentrate on the Gaussiansetting. We propose a strategy for eﬃciently ﬁnding a “good” starting per-mutation based on the minimum degree algorithm , a heuristic for ﬁndingsparse Cholesky decompositions. We then prove uniform consistency of thegreedy SP algorithm for ﬁxed p and in the high-dimensional setting undera more restrictive faithfulness condition known as strong-faithfulness . Sincethe greedy SP algorithm is provably consistent under strictly weaker con-ditions than faithfulness, a common identiﬁability assumption used for thePC algorithm and GES, we would expect that greedy SP can recover sim-ulated DAG models at a higher rate than these algorithms. In Section 6,we present simulations in support of these theoretical ﬁndings that comparethe rate of recovery of M ( G ∗ ) for the PC algorithm, GES, and the greedySP algorithm.

2. Background.

Given a DAG G := ([ p ] , A ) with node set [ p ] := { , , . . . , p } and arrow set A , we associate to the nodes of G a randomvector ( X , . . . , X p ) with a probability distribution P . An arrow in A is anordered pair of nodes ( i, j ) which we will often denote by i → j . A directedpath in G from node i to node j is a sequence of directed edges in G of theform i → i → i → · · · → j . A path from i to j is a sequence of arrowsbetween i and j that connect the two nodes without regard to direction. The parents of a node i in G is the collection Pa G ( i ) := { k ∈ [ p ] : k → i ∈ A } , RDERING-BASED CAUSAL INFERENCE and the ancestors of i , denoted An G ( i ), is the collection of all nodes k ∈ [ p ]for which there exists a directed path from k to i in G . We do not include i inAn G ( i ). The descendants of i , denoted De G ( i ), is the set of all nodes k ∈ [ p ]for which there is a directed path from i to k in G , and the nondescendants of i is the collection of nodes Nd G ( i ) := [ p ] \ (De G ( i ) ∪ { i } ). When the DAG G is understood from context we write Pa( i ), An( i ), De( i ), and Nd( i ), for theparents, ancestors, descendants, and nondescendants of i in G , respectively.The analogous deﬁnitions and notation will also be used for any set S ⊂ [ p ].If two nodes are connected by an arrow in G then we say they are adjacent .A triple of nodes ( i, j, k ) is called unshielded if i and j are adjacent, k and j are adjacent, but i and k are not adjacent. An unshielded triple ( i, j, k )forms an immorality if it is of the form i → j ← k . In any triple (shielded ornot) with arrows i → j ← k , the node j is called a collider . Given disjointsubsets A, B, C ⊂ [ p ] with A ∩ B = ∅ , we say that A is d -connected to B given C if there exist nodes i ∈ A and j ∈ B for which there is a pathbetween i and j such that every collider on the path is in An( C ) ∪ C andno non-collider on the path is in C . If no such path exists, we say A and B are d -separated given C .A fundamental result about DAG models is that the complete set of CI re-lations implied by the Markov assumption for G is given by the d -separationrelations in G [15, Section 3.2.2]; i.e., a probability distribution P satisﬁes theMarkov assumption with respect to G if and only if X A ⊥⊥ X B | X C in P whenever A and B are d -separated in G given C . The faithfulness assumption asserts that all CI relations entailed by P are given by d -separations in G [29]. Assumption . A probability distribution P satisﬁes the faithfulness assumption with respect to a DAG G = ([ p ] , A ) iffor any pair of nodes i, j ∈ [ p ] and any subset S ⊂ [ p ] \{ i, j } we have that i ⊥⊥ j | S ⇔ i is d -separated from j given S in G . All DAG model learning algorithms assume the Markov assumption, i.e. theforward direction of the faithfulness assumption, and many of the classicalalgorithms also assume the converse. Unfortunately, the faithfulness assump-tion is very sensitive to hypothesis testing errors for inferring CI statementsfrom data, and almost-violations of faithfulness have been shown to be fre-quent [36]. A number of relaxations of the faithfulness assumption have beensuggested [25].

Assumption . A probability dis-tribution P satisﬁes the restricted faithfulness assumption with respect to aDAG G = ([ p ] , A ) if it satisﬁes the following two conditions: L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

1. (Adjacency Faithfulness) For all arrows i → j ∈ A we have that X i (cid:54)⊥⊥ X j | X S for all subsets S ⊂ [ p ] \{ i, j } .2. (Orientation Faithfulness) For all unshielded triples ( i, j, k ) and allsubsets S ⊂ [ p ] \{ i, k } such that i is d -connected to k given S , we havethat X i (cid:54)⊥⊥ X k | X S .A classic result states that two DAGs are Markov equivalent if and onlyif they have the same set of adjacencies and the same set of immoralities[35]. The adjacency faithfulness assumption ensures that we can recover thecorrect set of adjacencies, while orientation faithfulness guarantees that wewill correctly orient all arrows in unshielded colliders. A number of attemptshave been made to modify constraint-based algorithms to adjust for weakerconditions than faithfulness (e.g., [16, 25, 38, 39]). However, these relaxationshave ultimately led to weaker claims which don’t guarantee discovery of M ( G ∗ ) (see, e.g., [16, 30]).By combining constraint-based with score-based approaches and by sac-riﬁcing computation time, it was possible to overcome this limitation. The sparsest permutation ( SP ) algorithm guarantees discovery of M ( G ∗ ) understrictly weaker assumptions than faithfulness [26]. Given a set of CI relations C on [ p ], every permutation π ∈ S p is associated to a DAG G π as follows: π i → π j ∈ A ( G π ) ⇔ i < j and π i (cid:54)⊥⊥ π j | { π , . . . , π max( i,j ) }\{ π i , π j } . A DAG G π is known as a minimal I-MAP (independence map) with respectto C , since any DAG G π satisﬁes the Markov assumption and the minimalityassumption with respect to C , i.e., any CI relation encoded by a d -separationin G π is in C and any proper subDAG of G π encodes a CI relation that is notin C [20]. We will also refer to G π as a permutation DAG. It is natural toconsider a score-based approach to Bayesian network model selection withscore( C ; G ) := (cid:40) −|G| if G is Markov with respect to C , −∞ otherwise , where |G| denotes the number of arrows in G . The SP algorithm searches overall DAGs G π for π ∈ S p and returns the DAG that maximizes score( C ; G ).In [26], it was shown that the SP algorithm is consistent under a conditionthat is strictly weaker than faithfulness, namely the SMR (sparsest Markovrepresentation) assumption . Assumption . A probability distribution P satis-ﬁes the SMR assumption with respect to a DAG G if it satisﬁes the Markovassumption with respect to G and |G| < |H| for every DAG H such that P satisﬁes the Markov assumption with respect to H and H / ∈ M ( G ). RDERING-BASED CAUSAL INFERENCE The downside to the SP algorithm is that it requires a search over all p !permutations of the node set [ p ]. In the following section, we discuss twonatural approaches to reduce run time, namely by reducing the size of thesearch space to appropriately deﬁned equivalence classes of the DAGs G π ,and by performing a greedy search through this reduced search space.

3. Greedy SP algorithm.

The SP algorithm has a natural interpre-tation in the setting of discrete geometry. The permutohedron on p elementsis denoted A p and can be deﬁned as the convex hull of all vectors obtainedby permuting the coordinates of (1 , , , . . . , p ) T . The SP algorithm can bethought of as searching over the vertices of A p , since it considers the DAGs G π for each π ∈ S p . Hence, a natural ﬁrst step to reduce the size of thesearch space is to contract together all vertices of A p that correspond to thesame DAG G π . This can be done via the following construction.Two vertices of the permutohedron A p are connected by an edge if andonly if the permutations indexing the vertices diﬀer by an adjacent transpo-sition. We can associate a CI relation to adjacent transpositions and henceto each edge of A p , namely π i ⊥⊥ π i +1 | π , . . . , π i − to the edge π · · · π i π i +1 · · · π p − π · · · π i +1 π i · · · π p . In [18, Section 4], it is shown that given a set of CI relations C from a jointdistribution P on [ p ], then contracting all edges in A p corresponding to CIrelations in C results in a convex polytope, which we denote A p ( C ). Notethat A p ( ∅ ) = A p . Furthermore, if the CI relations in C form a graphoid , i.e.,they satisfy the semigraphoid properties(SG1) if i ⊥⊥ j | S then j ⊥⊥ i | S ,(SG2) if i ⊥⊥ j | S and i ⊥⊥ k | { j } ∪ S , then i ⊥⊥ k | S and i ⊥⊥ j | { k } ∪ S ,and the intersection property(INT) if i ⊥⊥ j | { k } ∪ S and i ⊥⊥ k | { j } ∪ S , then i ⊥⊥ j | S and i ⊥⊥ k | S ,then it was shown in [18, Theorem 7.1] that contracting edges in A p thatcorrespond to CI relations in C is the same as identifying vertices of A p thatcorrespond to the same DAG. The semigraphoid properties hold for anydistribution, whereas the intersection property holds for example for strictlypositive distributions; necessary and suﬃcient conditions for the intersectionproperty were given recently in [22]. Another example of a graphoid is the setof CI relations C corresponding to all d -separations in a DAG. In that case A p ( C ) is also called a DAG associahedron [18]. The polytope A p ( C ), whereeach vertex corresponds to a diﬀerent DAG, represents a natural reducedsearch space for the SP algorithm.

L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Algorithm 1:

Edge SP

Input :

A set of CI relations C on node set [ p ] and a starting permutation π ∈ S p . Output:

A minimal I-MAP G . Compute the polytope A p ( C ) and set G := G π . Using a depth-ﬁrst search approach with root G along the edges of A p ( C ), search fora minimal I-MAP G τ with |G| > |G τ | . If no such G τ exists, return G ; else set G := G τ and repeat this step. To further reduce computation time, we next discuss a greedy search inthis reduced search space. Through a closer examination of the geometryand combinatorics of the polytope A p ( C ), we arrive at two greedy versionsof the SP algorithm, one based on the geometry of A p ( C ) by walking alongedges of the polytope, and another based on the combinatorial descriptionof the vertices by walking from DAG to DAG. These two greedy versions ofthe SP algorithm are described in Algorithm 1 and Algorithm 2.Both algorithms take as input a set of CI relations C and an initial per-mutation π ∈ S p . Beginning at the vertex G π of A p ( C ), Algorithm 1 walksalong an edge of A p ( C ) to any vertex whose corresponding DAG has at mostas many arrows as G π . Once it can no longer discover a sparser DAG, thealgorithm returns the last DAG (and its corresponding Markov equivalenceclass) it visited. Since this algorithm is based on walking along edges of A p ( C ), we call this greedy version edge SP algorithm . The correspondingidentiﬁability assumption can be stated as follows. Assumption . A probability distribu-tion P satisﬁes the edge SP (ESP) assumption with respect to a DAG G if it satisﬁes the Markov assumption with respect to G and if Algorithm 1returns only DAGs in M ( G ).Next, we describe edges in the polytope A p ( C ) with respect to the neigh-boring DAGs G π and G τ . In the following, we say that an arrow i → j in a DAG G is covered if Pa( i ) = Pa( j ) \ { i } . An arrow i → j is triv-ially covered if Pa( i ) = Pa( j ) \ { i } = ∅ . In addition, we call a sequence ofDAGs ( G , G , . . . , G N ) a weakly decreasing sequence if |G i | ≥ |G i +1 | for all i ∈ [ N − G i +1 is produced from G i by reversing a covered arrow in G i ,then we will refer to this sequence as a weakly decreasing sequence of coveredarrow reversals . Let G π and G τ denote two adjacent vertices in a DAG asso-ciahedron A p ( C ). Let ¯ G denote the skeleton of G ; i.e., the undirected graphobtained by undirecting all arrows in G . Then, as noted in [18, Theorem8.3], G π and G τ diﬀer by a covered arrow reversal if and only if G π ⊆ G τ or RDERING-BASED CAUSAL INFERENCE G π ∗ G π G τ Fig 1 . An edge of a DAG associahedron that does not correspond to a covered edge ﬂip. TheDAG associahedron A p ( C ) is constructed for the CI relations implied by the d -separationstatements for G π ∗ with π ∗ = 15234 . The DAGs G π and G τ with π = 15432 and τ = 15342 correspond to adjacent vertices in A p ( C ) , connected by the edge labeled by the transpositionof and . The arrow between nodes and is not covered in either DAG G π or G τ . G τ ⊆ G π . In some instances, this fact gives a combinatorial interpretation ofall edges of A p ( C ). However, this need not always be true, as is demonstratedin Example 5. Example . An example of a DAG associahedron containing an edgethat does not correspond to a covered arrow reversal in either DAG labelingits endpoints can be constructed as follows: Let G π ∗ denote the left-mostDAG depicted in Figure 1, and let C denote those CI relations implied bythe d -separation statements for G π ∗ . Then for the permutations π = 15432and τ = 15342, the DAGs G π and G τ label a pair of adjacent vertices of A p ( C ) since π and τ diﬀer by the transposition of 3 and 4. This adjacenttransposition corresponds to a reversal of the arrow between nodes 3 and 4in G π and G τ , however, this arrow is not covered in either minimal I-MAP.We further note that this example shows that not all edges of A p ( C ) can bedescribed by covered arrow reversals even when C is faithful to the sparsestminimal I-MAP, G π ∗ .The combinatorial description of some edges of A p ( C ) via covered arrowreversals motivates Algorithm 2, a combinatorial greedy SP algorithm. Sincethis algorithm is based on ﬂipping covered arrows, we call this the triangle SPalgorithm . Similar to Algorithm 1, we specify an identiﬁability assumptionin relation to Algorithm 2. Assumption . A probability distri-bution P satisﬁes the triangle SP (TSP) assumption with respect to a DAG L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Algorithm 2:

Triangle SP

Input :

A set of CI relations C on node set [ p ] and a starting permutation π ∈ S p . Output:

A minimal I-MAP G . Set G := G π . Using a depth-ﬁrst search approach with root G , search for a minimal I-MAP G τ with |G| > |G τ | that is connected to G by a weakly decreasing sequence of coveredarrow reversals. If no such G τ exists, return G ; else set G := G τ and repeat this step. G if it satisﬁes the Markov assumption with respect to G and if Algorithm 2returns only DAGs in M ( G ).It is straightforward to verify that every covered arrow reversal in someminimal I-MAP G π with respect to C corresponds to some edge of the DAGassociahedron A p ( C ). Consequently, if a probability distribution satisﬁes theTSP assumption then it also satisﬁes the ESP assumption. In Section 4,we prove pointwise consistency of Algorithms 1 and 2 under the faithfulnessassumption, and we also study the relationships between the faithfulness,restricted faithfulness, SMR, ESP, and TSP assumptions.3.1. Even permutohedron and trivially covered arrows.

We end this sec-tion with a new geometric construction that can be used to further reducethe size of the search space of the SP algorithm. The motivation for theconstruction of A p ( C ) was to merge all vertices in the permutohedron thatcorrespond to the same DAG, since such DAGs have the same number ofedges and the goal is to ﬁnd the sparsest DAG. To further reduce the searchspace, we would like to merge adjacent vertices on A p ( C ), whose correspond-ing DAGs are guaranteed to have the same number of edges. This is thecase for the adjacent transpositions π = π π π · · · π p and τ = π π π · · · π p ,since the DAGs G π and G τ are the same up to changing the direction ofthe arrow ( π , π ) (if it is present). Geometrically, this means that we canshrink the search space by contracting edges of the permutohedron that cor-respond to adjacent transpositions in the ﬁrst two coordinates. That is, forall permutations π = π π · · · π p we contract the edge of the permutohe-dron whose vertices are π and τ = π π π · · · π p . We denote the contracted even permutation by ( π π ) π · · · π p , and we call the resulting polytope the2 -permutohedron or the even permutohedron . The even permutohedron on 4elements is shown in Figure 2(b). Theorem . The even permutohedron A p is a ( p − -dimensional con-vex polytope. RDERING-BASED CAUSAL INFERENCE (a) The permutohedron A . (b) The even permutohedron A . Fig 2 . To construct the edge graph of the even permutohedron A from the permutohe-dron A we contract the edges of A that correspond to a transposition in the ﬁrst twocoordinates of the vertices. The proof of this result is given in the Appendix, along with a descriptionof the edges of A p and a ﬁgure showing the edge graph of A . As is notedin the proof of Theorem 7, A p is a permutohedron of the sorts deﬁned in[23]. The obvious next step is to reduce the search space further as in theconstruction of A p ( C ). Let C be a collection of CI relations on the node set[ p ], and let i ⊥⊥ j | S ∈ C . Just as for permutohedra, we can associate thisCI relation to the collection of edges of A p of the form ωijσ − ωjiσ, where ω is a permutation of the elements of S and σ is a permutation ofthe elements of [ p ] \ ( S ∪ { i, j } ). Contracting these edges is equivalent tointersecting the even permutohedron A p with the polytope A p ( C ) resultingin a new geometric object, denoted A p ( C ), a more restricted search spacethen either polytope. We now give a characterization of the vertices of A p ( C ). Proposition . The vertices of A p ( C ) are partially oriented graphs ob-tained by unorienting trivially covered arrows. The proof of Proposition 8 is given in the Appendix. In Section 4, we willprove that Algorithm 2, which has as its search space the vertices of A p ( C ), ispointwise consistent under the faithfulness assumption. However, the sameresult applies to the more restricted object A p ( C ). As a ﬁnal remark in thissection, we note that even permutohedra admit a generalization that is bothgeometric and combinatorial in nature. L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Remark k -permutohedron) . Fix a positive integer k ≤ p , and notethat the k -faces of the permutohedron A p correspond to the ordered parti-tions of [ p ] into p − k nonempty parts [40, Example 0.10]. In particular, anedge corresponds to an ordered partition of [ p ] into p − A p for which the part with two elements is the ﬁrstpart of the ordered partition. Analogously, we deﬁne the k -permutohedron to be the polytope given by contracting the k -faces of A p corresponding tothose ordered partitions of [ p ] into p − k + 1 parts for which the ﬁrst parthas size k . By an argument similar to that of Theorem 7 we observe that A kp is a ( p − k = p − A kp is a simplex whose facets correspond to the ordered bipartitions of[ p ] for which the ﬁrst part has cardinality one.

4. Consistency Guarantees and Identiﬁability Implications.

Inthis section, we prove that both versions of the greedy SP algorithm are( pointwise ) consistent (i.e., in the oracle-version as n → ∞ ) under the faith-fulness assumption. We also show that a version of the triangle SP algorithmusing the BIC score instead of the graph sparsity is consistent under thefaithfulness assumption. Additionally, we study the relationships betweenthe diﬀerent identiﬁability assumptions encountered so far, namely faithful-ness, restricted faithfulness, SMR, ESP, and TSP.4.1. Consistency of Algorithm 2 under faithfulness.

In order to provepointwise consistency of Algorithm 2, we need to show that given a set of CIrelations C corresponding to d -separations in a DAG G ∗ , then every weaklydecreasing sequence of covered arrow reversals ultimately leads to a DAG in M ( G ∗ ). Given two DAGs G and H , H is an independence map of G , denotedby G ≤ H , if every CI relation encoded by H holds in G (i.e. CI( G ) ⊇ CI( H )).The following simple result, whose proof is given in the Appendix, revealsthe main idea of the proof. Lemma . A probability distribution P on the node set [ p ] is faithfulwith respect to a DAG G if and only if G ≤ G π for all π ∈ S p . The goal is to prove that for any pair of DAGs such that G π ≤ G τ , thereis a weakly decreasing sequence of covered arrow reversals such that G τ = G π , G π , G π , . . . , G π M := G π . Our proof relies heavily on Chickering’s consistency proof of GES and, inparticular, on his proof of a conjecture known as Meek’s conjecture [6].

RDERING-BASED CAUSAL INFERENCE Theorem . [6, Theorem 4] Let G and H be any pair of DAGs suchthat G ≤ H . Let r be the number of arrows in H that have opposite orienta-tion in G , and let m be the number of arrows in H that do not exist in eitherorientation in G . There exists a sequence of at most r + 2 m arrow reversalsand additions in G with the following properties:1. Each arrow reversal is a covered arrow.2. After each reversal and addition, the resulting graph G (cid:48) is a DAG and G (cid:48) ≤ H .3. After all reversals and additions G = H . Chickering gave a constructive proof of this result by the APPLY-EDGE-OPERATION algorithm, which we recall in Algorithm 3. For convenience,we will henceforth refer to Algorithm 3 as the “Chickering algorithm.” TheChickering algorithm takes in an independence map

G ≤ H and adds anarrow to G or reverses a covered arrow in G to produce a new DAG G forwhich G ≤ G ≤ H . By Theorem 11, repeated applications of this algorithmproduces a sequence of DAGs G = G ≤ G ≤ G ≤ · · · ≤ G N := H . We will call any sequence of DAGs produced in this fashion a

Chickeringsequence (from G to H ). A quick examination of the Chickering algorithmreveals that there can be multiple Chickering sequences from G to H . We areinterested in identifying a speciﬁc type of Chickering sequence in relation toDAGs for Algorithm 2. To this end, we prove a pair of lemmas about theways we may choose to construct Chickering sequences. The proofs of theselemmas can be found in the Appendix. Lemma . Suppose

G ≤ H such that the Chickering algorithm hasreached step and selected the arrow Y → Z in G to reverse. If Y → Z isnot covered in G , then there exists a Chickering sequence (cid:0) G = G , G , G , . . . , G N ≤ H (cid:1) in which G N is produced by the reversal of Y → Z , and for all i = 1 , , . . . , N − , the DAG G i is produced by an arrow addition via step or with respectto the arrow Y → Z . For an independence map

G ≤ H , the Chickering algorithm ﬁrst deletesall sinks in G that have precisely the same parents in H , and repeats thisprocess for the resulting graphs until there is no sink of this type anymore. L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Algorithm 3:

APPLY-EDGE-OPERATION

Input :

DAGs G and H where G ≤ H and

G (cid:54) = H . Output:

A DAG G (cid:48) satisfying G (cid:48) ≤ H that is given by reversing an edge in G oradding an edge to G . Set G (cid:48) := G . While G and H contain a node Y that is a sink in both DAGs and for whichPa G ( Y ) = Pa H ( Y ), remove Y and all incident edges from both DAGs. Let Y be any sink node in H . If Y has no children in G , then let X be any parent of Y in H that is not a parent of Y in G . Add the edge X → Y to G (cid:48) and return G (cid:48) . Let D ∈ De G ( Y ) denote the (unique) maximal element from De G ( Y ) within H . Let Z be any maximal child of Y in G such that D is a descendant of Z in G . If Y → Z is covered in G , reverse Y → Z in G (cid:48) and return G (cid:48) . If there exists a node X that is a parent of Y but not a parent of Z in G , then add X → Z to G (cid:48) and return G (cid:48) . Let X be any parent of Z that is not a parent of Y . Add X → Y to G (cid:48) and return G (cid:48) . This is the purpose of step 2 of the algorithm. If the adjusted graph is (cid:101) G , thealgorithm then selects a sink node in (cid:101) G , which by construction must haveless parents than the same node in H and/or some children. The algorithmthen adds parents and reverses arrows until this node has exactly the sameparents as the corresponding node in H . The following lemma shows thatthis can be accomplished one sink node at a time, and its proof is clear fromthe statement of the algorithm. Lemma . Let

G ≤ H . If Y is a sink node selectable in step ofthe Chickering algorithm then we may always select Y each time until it isdeleted by step . We would like to see how the sequence of graphs produced in Chickering’salgorithm relates to the DAGs G π for a set of CI relations C . In particular,we would like to see that if G π ≤ G τ for permutations π, τ ∈ S p , then thereis a sequence of moves given by Chickering’s Algorithm that passes througha sequence of minimal I-MAPs taking us from G π to G τ . To do so, we requirean additional lemma relating independence maps and minimal I-MAPs. Tostate this lemma we need to consider the two steps within Algorithm 3 inwhich arrow additions occur. We now recall these two steps.(i) Suppose Y is a sink node in G ≤ H . If Y is also a sink node in G , thenchoose a parent X of Y in H that is not a parent of Y in G , and addthe arrow X → Y to H .(ii) If Y is not a sink node in G , then there exists an arrow Y → Z in G RDERING-BASED CAUSAL INFERENCE that is oriented in the opposite direction in H . If Y → Z is covered,the algorithm reverses it. If Y → Z is not covered, there exists (in G )either(a) a parent X of Y that is not a parent of Z , in which case, thealgorithm adds the arrow X → Z .(b) a parent X of Z that is not a parent of Y , in which case, thealgorithm adds the arrow X → Y . Lemma . Let C be a graphoid and G π ≤ G τ with respect to C . Thenthe common sink nodes of G π and G τ all have the same incoming arrows.In particular, Algorithm 3 needs no instance of arrow additions ( i ) to movefrom G π to G τ . Given two DAGs G π ≤ G τ , Algorithm 2 proposes that there is a pathalong the edges of A p ( C ) corresponding to covered arrow reversals taking usfrom G τ to G π , ( G τ = G π , G π , G π , . . . , G π M = G π ) , for which |G π j − | ≥ |G π j | for all j = 1 , . . . , M . Recall that we call such asequence of minimal I-MAPs satisfying the latter property a weakly decreas-ing sequence . If such a weakly decreasing sequence exists from any G τ to G π , then Algorithm 2 must ﬁnd it. By deﬁnition, such a path is composedof covered arrow reversals and arrow deletions. Since there are precisely thetypes of moves used in Chickering’s Algorithm, then we must understandthe subtleties of the relationship between independence maps between theminimal I-MAPs G π for a collection of CI relations C and the skeletal struc-ture of the DAGs G π . To this end, we will use the following two deﬁnitions:A minimal I-MAP G π with respect to a graphoid C is called MEC-minimal if for all

G ≈ G π and linear extensions τ of G we have that G π ≤ G τ . Noticeby [18, Theorem 8.1], it suﬃces to check only one linear extension τ for each G . The minimal I-MAP G π is further called MEC-s-minimal if it is MEC-minimal and G π ⊆ G τ for all G ≈ G π and linear extensions τ of G . We arenow ready to state the main theorem that allows us to verify consistency ofAlgorithm 2 under the faithfulness assumption. Theorem . Suppose that C is a graphoid and G π and G τ are minimalI-MAPs with respect to C . Then(i) If G π ≈ G τ and G π is MEC-s-minimal then there exists a weakly de-creasing edgewalk from G π to G τ along A p ( C ) . L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER (ii) If G π ≤ G τ but G π (cid:54)≈ G τ then there exists a minimal I-MAP G τ (cid:48) withrespect to C satisfying G τ (cid:48) ≤ G τ that is strictly sparser than G τ and isconnected to G τ by a weakly decreasing edgewalk along A p ( C ) . The consistency of Algorithm 2 follows from considering Lemma 10 to-gether with Theorem 15. The proof of these statements are in the Appendix.

Corollary . Algorithm 2 is pointwise consistent under the faithful-ness assumption.

Recall from Section 3 that if a probability distribution satisﬁes the TSPassumption then it also satisﬁes the ESP assumption. Corollary 16 impliesthat any faithful distribution must satisfy the TSP assumption. Therefore,we also have the following corollary.

Corollary . Algorithm 1 is pointwise consistent under the faithful-ness assumption.

Consistency of the Triangle SP algorithm using BIC under faithful-ness.

We now note that a version of the triangle SP algorithm that usesthe BIC score instead of the graph sparsity is also consistent under the faith-fulness assumption. This version of the triangle SP algorithm is presentedin Algorithm 4. Algorithm 4 is constructed in analogy to the ordering-basedsearch methods studied by Teyssier and Koller in [31]. In Remark 19 we notethe subtleties distinguishing these two algorithms.

Theorem . Algorithm 4 is pointwise consistent under the faithfulnessassumption.

The proof of Theorem 18 is given in the Appendix. It is based on the factthat the BIC is locally consistent [6, Lemma 7]. Let ˆ X be a p × n matrix Algorithm 4:

Triangle SP with BIC

Input :

Observations ˆ X , initial permutation π . Output:

Permutation ˆ π with DAG G ˆ π . Set ˆ G π := argmax G consistent with permutation π BIC( G ; ˆ X ). Using a depth-ﬁrst search approach with root π , search for a permutation τ withBIC( ˆ G τ ; ˆ X ) > BIC( ˆ G π ; ˆ X ) that is connected to π through a sequence ofpermutations ( π , · · · , π k ) where each permutation π i is produced from π i − byﬁrst doing a covered arrow reversal ˆ G π i − and selecting a linear extension π i of theDAG ˆ G π i − . If no such ˆ G τ exists, return ˆ G π ; else set π := τ and repeat.RDERING-BASED CAUSAL INFERENCE consisting of n i.i.d. samples from P . A scoring criterion Score( G ; ˆ X ) for aDAG G is locally consistent if for any two DAGs G and G (cid:48) such that G (cid:48) hasone additional edge i → j but is otherwise equal to G , the following holdsas n → ∞ :(a) if i (cid:54)⊥⊥ j | Pa G ( j ), then Score( G (cid:48) ; ˆ X ) > Score( G ; ˆ X );(b) if i ⊥⊥ j | Pa G ( j ), then Score( G (cid:48) ; ˆ X ) < Score( G ; ˆ X ). Remark . Algorithm 4 diﬀers from the ordering-based search methodproposed in [31] in two main ways:(i) Algorithm 4 selects each new permutation by a covered arrow reversalin the associated I-MAPs;(ii) Algorithm 4 uses a depth-ﬁrst search approach instead of greedy hillclimbing.In particular, our search guarantees that any independence map of minimalI-MAPS G π ≤ G τ are connected by a Chickering sequence. This allows us toprove Theorem 18, since for minimal I-MAPs |G τ | < |G π | if and only if BIC( G τ ; ˆ X ) > BIC( G π ; ˆ X ) . Since this is not true in general, the ordering-based search method of [31]has no known consistency guarantees.4.3.

Beyond faithfulness.

We now examine the relationships between theESP, TSP, SMR, faithfulness, and restricted faithfulness assumptions. Ourﬁrst result consists of the following three implications. Here we include someproofs in the text since they contain geometrically informative examples.

Theorem . The following hierarchy holds for the SMR, ESP, TSP,and faithfulness assumptions.1. The TSP assumption is strictly weaker than the faithfulness assump-tion.2. The ESP assumption is strictly weaker than the TSP assumption.3. The SMR assumption is strictly weaker than the ESP assumption.

Proof.

It is quick to see thatfaithfulness = ⇒ TSP = ⇒ ESP = ⇒ SMR . The ﬁrst implication is given by Corollary 16, and the latter three are im-mediate consequences of the deﬁnitions of the TSP, ESP, and SMR assump-tions. Namely, the TSP, ESP, and SMR assumptions are each deﬁned to L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Fig 3 . A sparsest DAG w.r.t. the CI relations C given in the proof of Theorem 20 (1). be precisely the condition in which Algorithm 2, Algorithm 1, and the SPAlgorithm are, respectively, consistent. The implications then follow sinceeach of the algorithms is a reﬁned version of the preceding one in this order.Hence, we only need to show the strict implications. For each statementwe identify a collection of CI relations satisfying the former identiﬁabilityassumption but not the latter. For statement (1), consider the collection ofCI relations C = { ⊥⊥ | { , } , ⊥⊥ | { , } , ⊥⊥ | { , , } , ⊥⊥ | { , , } , ⊥⊥ | { , }} . The sparsest DAG G π ∗ with respect to C is shown in Figure 3. To see that C satisﬁes the TSP assumption with respect to G π ∗ , we can use computerevaluation. To see that it is not faithful with respect to G ∗ π , notice that1 ⊥⊥ | { , } and 1 ⊥⊥ | { , , } are both in C , but they are not impliedby G ∗ π . We also remark that C is not a semigraphoid since the semigraphoidproperty (SG2) applied to the CI relations 1 ⊥⊥ | { , } and 1 ⊥⊥ |{ , , } implies that 1 ⊥⊥ | { , , } should be in C .For statement (2), consider the collection of CI relations C = { ⊥⊥ | { } , ⊥⊥ | { } , ⊥⊥ | { , }} and initialize Algorithm 2 at the permutation π := 1423. A sparsest DAG G π ∗ with respect to C is given in Figure 4(a), and the initial permutation ( a ) ( b ) ( c ) ( d ) Fig 4 . The four permutation DAGs described in the proof of Theorem 20 (2).

RDERING-BASED CAUSAL INFERENCE ( a ) ( b ) Fig 5 . The sparest permutation DAG and the initial permutation DAG described in theproof of Theorem 20 (3).

DAG G π is depicted in Figure 4(b). Notice that the only covered arrow in G π is 1 → τ = 4123; the corresponding DAG G τ is shown in Figure 4(c). The onlycovered arrows in G τ are 4 → →

2. Reversing 4 → G π , which we already visited, and reversing 4 → σ = 2143; the associated DAG G σ is depicted in Figure 4(d). Since theonly DAGs connected to G π and G τ via covered arrow ﬂips have at leastas many edges as G π (and G τ ), then Algorithm 2 is inconsistent, and sothe TSP assumption does not hold for C . On the other hand, we can verifycomputationally that Algorithm 1 is consistent with respect to C , meaningthat the ESP assumption holds.Finally, for statement (3), consider the collection of CI relations C = { ⊥⊥ | { } , ⊥⊥ | { , } , ⊥⊥ } , and the initial permutation π = 54321. Then a sparsest DAG G π ∗ and theinitial DAG G π are depicted in Figures 5(a) and (b), respectively. For con-venience, we state the necessary observation in the language of even DAGassociahedra. It is not hard to check that any DAG G τ that is edge adjacentto G (54)321 in A ( C ) is a complete graph and that π ∗ = (12)345. Thus, theSMR assumption holds for C but not the ESP assumption.It can be seen from the deﬁnition that restricted faithfulness is a signif-icantly weaker assumption than faithfulness. In [26, Theorem 2.5] it wasshown that the SMR assumption is strictly weaker than restricted faithful-ness. Thus, based on Theorem 20, it is interesting to ask whether ESP andTSP sit between SMR and restricted faithfulness in the hierarchy of assump-tions. In order for TSP to imply restricted faithfulness, it must imply both L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Fig 6 . A sparsest DAG for the CI relations C considered in the proof of Theorem 22. adjacency faithfulness and orientation faithfulness. It turns out that TSPindeed implies adjacency faithfulness. Theorem . Let P be a semigraphoid that satisﬁes the TSP assumptionw.r.t. a DAG G . Then P satisﬁes adjacency faithfulness with respect to G . The proof of this result is given in the Appendix. To end this subsection,we show that the TSP assumption does not imply orientation faithfulness.As a consequence, the TSP assumption does not imply restricted faithfulnessand is not comparable to restricted faithfulness.

Theorem . There exist probability distributions P such that P satis-ﬁes the TSP assumption with respect to a DAG G and P does not satisfyorientation faithfulness with respect to G . Proof.

Consider any probability distribution entailing the CI relations C = { ⊥⊥ , ⊥⊥ | { , , } , ⊥⊥ | { , , , } , ⊥⊥ | { , , , }} . (for example, C can be faithfully realized by a regular Gaussian). Fromleft-to-right, we label these CI relations as c , c , c , c . For the collection C , a sparsest DAG G π ∗ is depicted in Figure 6. Note that since there isno equally sparse or sparser DAG that is Markov with respect to P then P satisﬁes the SMR assumption with respect to G π ∗ . Notice also that theCI relation c does not satisfy the orientation faithfulness assumption withrespect to G π ∗ . Moreover, if G π entails c , then the subDAG on the nodes π , . . . , π forms a complete graph. Thus, by [5, Theorem 2], we can ﬁnd asequence of covered arrow reversals preserving edge count such that after allcovered arrow reversals, π = 6. Then transposing the entries π π producesa permutation τ in which c holds. Therefore, the number of arrows in G π isat least the number of arrows in G τ . Even more, G τ is an independence mapof G π ∗ , i.e., G π ∗ ≤ G τ . So by Theorem 15, there exists a weakly decreasing RDERING-BASED CAUSAL INFERENCE edge walk (of covered arrow reversals) along A p ( C ) taking us from G τ to G ∗ π .Thus, we conclude that P satisﬁes TSP, but not orientation faithfulness.4.4. The problem of Markov equivalence.

In Sections 4.1 and 4.3 we ex-amined when Algorithms 1 and 2 return the true Markov equivalence class M ( G ∗ ). While these results supply identiﬁability assumptions ensuring con-sistency of the algorithms, it is important to note that these algorithms maystill be quite ineﬃcient since, while they are searching over a collection ofpermutations and their corresponding DAGs, they may search over DAGsthat belong to the same Markov equivalence classes.To put this problem in perspective, suppose that we are running Algo-rithm 2 under the faithfulness assumption and that we initialize at someminimal I-MAP G π . Algorithm 2 will then reverse each covered arrow in G π , querying the new minimal I-MAPs adjacent to G π on the edge graphof A p ( C ) to see if they have strictly fewer arrows than G π . In the case thattwo DAGs G and H diﬀer only by a covered arrow reversal (without anyarrow deletions) then G and H belong to the same Markov equivalence class[5, 35]. Moreover, [5, Theorem 2] shows that any two members of the sameMarkov equivalence class diﬀer by a sequence of covered arrow reversals.Thus, the greedy nature of Algorithm 2 can leave us searching through largeportions of Markov equivalence classes until we identify a sparser permuta-tion DAG. In particular, in order to know that Algorithm 2 has terminated,it must visit all members of the sparsest Markov equivalence class M ( G ∗ ). In[24] it is shown that sparse DAGs, such as oriented trees, can have Markovequivalence classes that are exponential in size.To account for this problem, Algorithm 5 provides a more cost-eﬀective al-ternative that approximates Algorithm 2. Algorithm 5 operates exactly likeAlgorithm 2, with the exception that it bounds the search depth and numberof runs allowed before the algorithm terminates. Recall that Algorithm 2searches for a weakly decreasing edge-walk from a minimal I-MAP G π to aminimal I-MAP G τ with |G π | > |G τ | via a depth-ﬁrst search approach. InAlgorithm 5, if this search step does not produce a sparser minimal I-MAPafter searching up to and including depth d , the algorithm terminates andreturns G π . In [11], enumerative computations of all possible Markov equiva-lence classes on p ≤

10 nodes suggests that the average Markov equivalenceclass contains 4 DAGs. By the transformational characterization of Markovequivalence via covered arrow ﬂips given in [5, Theorem 2], this suggeststhat a search depth of 4 is, on average, the optimal search depth for escap-ing a MEC of minimal I-MAPs. This intuition is veriﬁed via simulationsin Section 6. Algorithm 5 also incorporates the possibility of restarting the L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Algorithm 5:

Triangle SP with bounds on search-depth and number ofruns

Input :

A set of CI relations C on node set [ p ], and two positive integers d and r . Output:

A minimal I-MAP G π . Set R := 0, and Y := ∅ . while R < r do Select a permutation π ∈ S n and set G := G π . Using a depth-ﬁrst search approach with root G , search for a permutation DAG G τ with |G| > |G τ | that is connected to G by a weakly decreasing sequence ofcovered arrow reversals that is length at most d . if no such G τ exists then set Y := Y ∪ {G} , R := R + 1, and go to step 2. else set G := G τ and go to step 4. end end Return the sparsest DAG G π in the collection Y . Algorithm 6:

High-dimensional Greedy SP

Input:

Observations ˆ X , threshold τ , and initial permutation π . Output:

Permutation ˆ π together with the DAG ˆ G ˆ π . Construct the minimal I-MAP ˆ G π from the initial permutation π and ˆ X ; Perform Algorithm 2 with constrained conditioning sets, i.e., let i → j be a coveredarrow and let S = pa( i ) = pa( j ) \ { i } ; perform the edge ﬂip, i.e. i ← j , and updatethe DAG by removing edges ( k, i ) for k ∈ S such that | ˆ ρ i,k | ( S ∪{ j }\{ k } ) | ≤ τ andedges ( k, j ) for k ∈ S such that | ˆ ρ j,k | ( S \{ k } ) | ≤ τ . algorithm in order to try and identify a sparser DAG. Here, the parameter r denotes the number of runs before the algorithm is required to output thesparsest DAG.

5. Uniform Consistency.

In this section, we show that the variant ofthe greedy SP algorithm presented in Algorithm 6 is uniformly consistentin the high-dimensional Gaussian setting. It is important to note that Al-gorithm 6 only tests conditioning sets made up of parent nodes of coveredarrows; this feature turns out to be critical for high-dimensional consistency.This variation of the greedy SP algorithm was made in analogy to the adap-tation of the SGS-algorithm into the PC algorithm studied in [29], whereeﬃciency of model recovery for sparse graphs was greatly improved by query-ing conditioning sets consisting only of nodes adjacent to the endpoint of agiven arrow.

RDERING-BASED CAUSAL INFERENCE In the following, we show that Algorithm 6 is uniformly consistent evenwhen the number of nodes p scales with n . Our approach to this problemparallels that of [12] in which the authors prove high-dimensional consistencyof the PC algorithm. Van de Geer and B¨uhlmann [34] analyzed (cid:96) -penalizedmaximum likelihood estimation for causal inference. While they proved thatthe global optimum converges to a sparse minimal I-MAP, their approachin general does not converge to the data-generating DAG. More recently,it was shown that a variant of GES is consistent in the high-dimensionalsetting [19]. Similarly as in that proof, by assuming sparsity of the initialDAG, we obtain uniform consistency of greedy SP in the high-dimensionalsetting, i.e., it converges to the data-generating DAG when the number ofnodes p scales with n .We let the dimension grow as a function of sample size; i.e. p = p n .Similarly, for the true underlying DAG and the data-generating distributionwe let G ∗ = G ∗ n and P = P n , respectively. The assumptions under which wewill guarantee high-dimensional consistency of Algorithm 6 are as follows:(A1) The distribution P n is multivariate Gaussian and faithful to the DAG G ∗ n for all n .(A2) The number of nodes p n scales as p n = O ( n a ) for some 0 ≤ a < π , the maximal degree d π of the cor-responding minimal I-MAP G π satisﬁes d π = O ( n − m ) for some0 < m ≤ M < c n > ρ i,j | S satisfy | ρ i,j | S | ≤ M and | ρ i,j | S | ≥ c n ,c − n = O ( n (cid:96) ) for some 0 < (cid:96) < m/ . Analogously to the conditions needed in [12], assumptions (A1), (A2), (A3),and (A4) relate to faithfulness, the scaling of the number of nodes withthe number of observations, the maximum degree of the initial DAG, andbounds on the minimal non-zero and maximal partial correlations, respec-tively. Recall that in the Gaussian setting the CI relation X j ⊥⊥ X k | X S is equivalent to the partial correlation ρ j,k | S = corr( X j , X k | X S ) equalingzero. Furthermore, a hypothesis test based on Fischer’s z -transform can beused to test whether X j ⊥⊥ X k | X S . These ideas will be key in the proof ofthe main result of this section, which is stated in the following theorem. Theorem . Suppose that assumptions (A1) – (A4) hold and let thethreshold τ in Algorithm 6 be deﬁned as τ := c n / . Then there exists a L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER constant c > such that Algorithm 6 is consistent, i.e., it returns a DAG ˆ G ˆ π that is in the same Markov equivalence class as G ∗ n , with probability atleast − O (exp( − cn − (cid:96) )) , where (cid:96) is deﬁned to satisfy assumption (A4). To prove Theorem 23, we require a pair of lemmas, the ﬁrst of which showsthat the conditioning sets in the Triangle SP algorithm can be restricted toparent sets of covered arrows.

Lemma . Suppose that the data-generating distribution P is faithfulto G ∗ . Then for any permutation π and any covered arrow i → j in G π itholds that(a) i ⊥⊥ k | ( S (cid:48) ∪ { j } ) \ { k } if and only if i ⊥⊥ k | ( S ∪ { j } ) \ { k } ,(b) j ⊥⊥ k | S (cid:48) \ { k } if and only if j ⊥⊥ k | S \ { k } ,for all k ∈ S , where S is the set of common parent nodes of i and j , and S (cid:48) = { a : a < π max π ( i, j ) } . The second lemma we require was ﬁrst proven in [12, Lemma 3] and ishere restated for the sake of completeness.

Remark . As can be seen from the proof of Theorem 23, consistentestimation in the high-dimensional setting requires that we initialize thealgorithm at a permutation satisfying assumption (A3). This assumptioncorresponds to a sparsity constraint. In the Gaussian setting the problem ofﬁnding a sparsest DAG in the oracle setting is equivalent to ﬁnding the spars-est Cholesky decomposition of the inverse covariance matrix [26]. Variousheuristics have been developed for ﬁnding sparse Cholesky decompositions,the most prominent being the minimum degree algorithm [10, 33]. In Algo-rithm 7 we provide a heuristic for ﬁnding a sparse minimal I-MAP G π that RDERING-BASED CAUSAL INFERENCE Algorithm 7:

Neighbor-based minimum degree algorithm

Input:

Observations ˆ X , threshold τ Output:

Permutation ˆ π together with the DAG ˆ G ˆ π Set S := [ p ]; construct undirected graph ˆ G S with ( i, j ) ∈ ˆ G S if and only if | ˆ ρ i,j | ( S \{ i,j } ) | ≥ τ ; while S (cid:54) = ∅ do Uniformly draw node k from all nodes with the lowest degree in the graph ˆ G S ;Construct ˆ G S \{ k } by ﬁrst removing node k and its adjacent edges; then updatethe graph ˆ G S \{ k } as follows:for all i, j ∈ adj( ˆ G S , k ) : if ( i, j ) is not an edge in ˆ G S , add ( i, j );else: ( i, j ) is an edge in ˆ G S \{ k } iﬀ | ˆ ρ i,j | S \{ i,j,k } | ≥ τ ;for all i, j / ∈ adj( ˆ G S , k ) : ( i, j ) is an edge in ˆ G S \{ k } iﬀ ( i, j ) is an edge in ˆ G S . Set ˆ π ( k ) := | S | and S := S \ { k } . end Output the minimal I-MAP ˆ G ˆ π constructed from ˆ π and ˆ X . reduces to the minimum degree algorithm in the oracle setting as shown inTheorem 29.In the following we let G := ( V, E ) be an undirected graph. For a subsetof nodes S ⊂ V we let G S denote the vertex-induced subgraph of G withnode set S . For k ∈ V we let adj( G, i ) denote the nodes k ∈ V \ { i } such that { i, k } ∈ E . We ﬁrst show that Algorithm 7 is equivalent to the minimumdegree algorithm [33] in the oracle setting. Theorem . Suppose the data-generating distribution P is a multivari-ate Gaussian with precision matrix Θ . Then in the oracle-setting the set ofpossible output permutations from Algorithm 7 is equal to the possible outputpermutations of the minimum degree algorithm applied to Θ . The proof of Theorem 27 is based on the following lemma. Both proofsare given in the Appendix.

Lemma . Let P be a distribution on [ p ] that is faithful to a DAG G ,and let P S denote the marginal distribution on S ⊂ [ p ] . Let G S denote theundirected graphical model corresponding to P S , i.e., the edge { i, j } is in G S if and only if ρ i,j | ( S \{ i,j } ) (cid:54) = 0 . Then G S \{ k } can be obtained from G S as L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER follows:for all i, j ∈ adj( G S , k ) : if ( i, j ) is not an edge in G S , add ( i, j ); else: ( i, j ) is an edge in G S \{ k } iﬀ | ρ i,j | S \{ i,j,k } | (cid:54) = 0; for all i, j / ∈ adj( G S , k ) : ( i, j ) is an edge in G S \{ k } iﬀ ( i, j ) is an edge in G S . The following result shows that Algorithm 7 in the non-oracle setting isalso equivalent to the minimum degree algorithm in the oracle setting. Theproof is given in the Appendix.

Theorem . Suppose that assumptions (A1), (A2), and (A4) hold,and let the threshold τ in Algorithm 6 be deﬁned as τ := c n / . Then withprobability at least − O (exp( − cn − (cid:96) )) the output permutation from Algo-rithm 7 is contained in the possible output permutations of the minimumdegree algorithm applied to Θ .

6. Simulations.

In this section, we describe our simulation results, forwhich we used the R library pcalg [13]. Our simulation study was conductedfor linear structural equation models with Gaussian noise:( X , . . . , X p ) T = (( X , . . . , X p ) A ) T + ε, where ε ∼ N (0 , I p ) with I p being the identity matrix of size p × p and A = [ a ij ] pi,j =1 is, without loss of generality, an upper-triangular matrix ofedge weights with a ij (cid:54) = 0 if and only if i → j is an arrow in the underlyingDAG G ∗ . For each simulation study, we generated 100 realizations of a p -noderandom Gaussian DAG model on an Erd¨os-Renyi graph for diﬀerent values of p and expected neighborhood sizes (i.e. edge probabilities). The edge weights a ij were sampled uniformly in [ − , − . ∪ [0 . , n samples were drawn from thedistribution induced by the Gaussian DAG model for diﬀerent values of n and p . In the oracle setting, the CI relations were computed by thresholdingthe partial correlations using diﬀerent thresholds λ . For the simulations with n samples, we estimated the CI relations by applying Fisher’s z -transformand comparing the p -values derived from the z -transform with a signiﬁcancelevel α . In the oracle and low-dimensional settings, GES is simulated usingthe standard BIC score function [6]. In the high-dimensional setting, we usethe (cid:96) -penalized maximum likelihood estimation scoring function [19, 34].Figure 7 compares the proportion of consistently estimated DAGs in theoracle setting for Algorithm 2 (i.e., with search-depth d = ∞ ) with number RDERING-BASED CAUSAL INFERENCE (a) p = 10, λ = 0 . p = 10, λ = 0 .

01 (c) p = 10, λ = 0 . Fig 7 . Expected neighborhood size versus proportion of consistently recovered Markov equiv-alence classes based on 100 simulations for each expected neighborhood size on DAGs with p = 10 nodes, edge weights sampled uniformly in [ − , − . ∪ [0 . , , and λ -values . , . and . . of runs r ∈ { , , } , Algorithm 5 with search depths d ∈ { , , } andnumber of runs r ∈ { , , } , the PC algorithm, and the GES algorithm.The number of nodes in these simulations is p = 10, and we consider thediﬀerent λ -values: 0 . , .

01 and 0 .

001 for the PC algorithm, Algorithm 2 andAlgorithm 5. Note that we run GES with n = 100 ,

000 samples since thereis no oracle version for GES.As expected, we see that increasing the number of runs for Algorithm 2and Algorithm 5 results in a consistently higher rate of model recovery. Inaddition, for each ﬁxed number of runs for Algorithm 5, search depth d = 4has the best performance compared to all other search depths. The opti-mality of the search depth 4 for Algorithm 5 coincides with the observationthat the average Markov equivalence class has 4 elements, as discussed inSubsection 4.4.For each run of the algorithm, we also recorded the structural Hammingdistance (SHD) between the true and the recovered Markov equivalenceclass. Figure 8 shows the average SHD versus the expected neighborhoodsize of the true DAG. Recall that Figure 7 demonstrates that Algorithm 5with high search depth and multiple runs learns the true Markov equivalenceclass at a notably higher rate than the PC and GES algorithms when λ ischosen small. However, Figure 8 shows that, for small values of d and r ,when Algorithm 5 learns the wrong DAG it is much further oﬀ from thetrue DAG than that learned by the PC algorithm. On the other hand, itappears that this trend only holds for Algorithm 5 with a relatively smallsearch depth and few runs. That is, increasing the value of these parameters L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER(a) p = 10, λ = 0 .

001 (b) p = 10, λ = 0 .

01 (c) p = 10, λ = 0 . Fig 8 . Expected neighborhood size versus structural Hamming distance between the trueand recovered Markov equivalence classes based on 100 simulations for each expectedneighborhood size on DAGs with p = 10 nodes, edge weights sampled uniformly in [ − , − . ∪ [0 . , , and λ -values . , . and . . ensures that the wrong DAG learned by Algorithm 5 will consistently becloser to the true DAG than that learned by the PC algorithm.We then compared the recovery performance of Algorithm 5 to the SP,GES, PC, SGS, and MMHC algorithms. The SGS-algorithm is anotherconstraint-based algorithm similar to the PC algorithm [29]. MMHC [2] is ahybrid method that ﬁrst estimates a skeleton through CI testing and thenperforms a hill-climbing search to orient the edges. In the low-dimensionalsetting, we ﬁxed the number of nodes to be p = 8 and considered sam-ple sizes n = { , , , } . We analyzed the performance of GES usingthe standard BIC score along with Algorithm 5 and the PC algorithm for α = { . , . , . } . To compensate for the trade-oﬀ between compu-tational eﬃciency and estimation performance, Algorithm 5 is consideredfor r = 10 runs with search depth d = 4. Figure 9 shows that the SP andgreedy SP algorithms achieve the best performance compared to all otheralgorithms. Since the SP algorithm can for computational reasons only withdiﬃculty be performed on graphs with 10 nodes, we conclude that the greedySP algorithm is the most preferable approach on medium-sized graphs.In the remainder of this section, we analyze the performance of Algo-rithm 6 in the sparse high-dimensional setting. For comparision, we onlycompare the performance of Algorithm 6 with methods that have high-dimensional consistency guarantees, namely the PC algorithm [12] and GES[19, 34]. The initial permutation of Algorithm 6 and its associated minimalI-MAP are used as a starting point in Algorithm 7 (“high-dim greedy SP”).To better understand the inﬂuence of accurately selecting an initial mini- RDERING-BASED CAUSAL INFERENCE Expected neighborhood size P r opo r t i on o f c on s i s t en t s i m u l a t i on s ● ● ● ● ● ● ● ● ● SPgreedy SPGESMMHCPCSGS (a) n = 1 , α = 0 . Expected neighborhood size P r opo r t i on o f c on s i s t en t s i m u l a t i on s ● ● ● ● ● ● ● ● ● SPgreedy SPGESMMHCPCSGS (b) n = 1 , α = 0 . Expected neighborhood size P r opo r t i on o f c on s i s t en t s i m u l a t i on s ● ● ● ● ● ● ● ● ● SPgreedy SPGESMMHCPCSGS (c) n = 1 , α = 0 . Expected neighborhood size P r opo r t i on o f c on s i s t en t s i m u l a t i on s ● ● ● ● ● ● ● ● ● SPgreedy SPGESMMHCPCSGS (d) n = 10 , α = 0 . Expected neighborhood size P r opo r t i on o f c on s i s t en t s i m u l a t i on s ● ● ● ● ● ● ● ● ● SPgreedy SPGESMMHCPCSGS (e) n = 10 , α = 0 . Expected neighborhood size P r opo r t i on o f c on s i s t en t s i m u l a t i on s ● ● ● ● ● ● ● ● ● SPgreedy SPGESMMHCPCSGS (f) n = 10 , α = 0 . Fig 9 . Expected neighborhood size versus proportion of consistently recovered skeletonsbased on 100 simulations for each expected neighborhood size on DAGs with p = 8 nodes,sample size n = 1 , and , , edge weights sampled uniformly in [ − , − . ∪ [0 . , ,and α -values . , . and . . mal I-MAP to the performance of Algorithm 6, we also considered the casewhen the moral graph of the data-generating DAG is given as prior knowl-edge (“high-dim greedy SP on moral graph”). In analogy to the passagefrom Algorithm 2 to Algorithm 5 for the sake of computational feasibility,we similarly conducted our high-dimensional simulations using Algorithm 6with a search depth of d = 1 and r = 50.Figure 10 compares the skeleton recovery of high-dimensional greedy SP(Algorithm 6) with the PC algorithm and GES, both without (“high-dimgreedy SP”, “high-dim PC” and “high-dim GES”) and with prior knowledgeof the moral graph (“high-dim greedy SP on moral graph”, “high-dim PCon moral graph” and “high-dim GES on moral graph”). Note that for GESgiven the moral graph, we used the ARGES-CIG algorithm presented in [19].The number of nodes in our simulations is p = 100, the number of sam-ples considered is n = 300, and the neighborhood sizes used are s = 0 . L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER(a) s = 0 . n = 300 (b) s = 1, n = 300 (c) s = 2, n = 300(d) s = 0 . n = 300 (e) s = 1, n = 300 (f) s = 2, n = 300 Fig 10 . ROC plot for skeleton recovery both with and without moral graph based on 100simulations on DAGs with p = 100 nodes, expected neighborhood size s = 0 . , and ,sample size n = 300 , and edge weights sampled uniformly in [ − , − . ∪ [0 . , . signiﬁcance level α for the PC and the greedy SP algorithms and the pe-nalization parameter λ n for GES. We then reported the average numberof true positives and false positives for each tuning parameter in the ROCplots shown in Figure 10. The result shows that, unlike the low-dimensionalsetting, although greedy SP is still comparable to the PC algorithm andGES in the high-dimensional setting, GES tends to achieve a slightly betterperformance in some of the settings.

7. Discussion.

In this paper, we examined the greedy SP algorithm(Algorithm 1). This is a simplex-type algorithm that searches for the sparsestminimal I-MAP G π associated to a set of observed CI relations C for apermutation π ∈ S p by searching for weakly decreasing edgewalks along theDAG associahedron A p ( C ), a convex polytope whose vertices are in bijectionwith the collection of minimal I-MAPs {G π : π ∈ S p } . Oftentimes, the edgesof A p ( C ) are also indexed combinatorially: two I-MAPs G π and G τ that label RDERING-BASED CAUSAL INFERENCE adjacent vertices of A p ( C ) diﬀer by the reversal of a covered arrow if andonly if either G π ⊆ G τ or G τ ⊆ G π . This partial characterization of the edgesof A p ( C ) gives rise to a combinatorial greedy SP Algorithm (Algorithm 2),called the triangle SP Algorithm, which queries weakly decreasing edgewalksalong the edge graph of A p ( C ) that use only edges indexed by covered arrowreversals. In section 4, we examined consistency guarantees for Algorithms 1and 2. We showed that the triangle SP Algorithm is pointwise consistentunder the faithfulness assumption, thereby making it the ﬁrst permutation-based causal inference algorithm for DAG model recovery with consistencyguarantees. We also proved that a high-dimensional variant of the triangleSP Algorithm (Algorithm 6) is uniformly consistent under the faithfulnessassumption.In simulation studies, we compared the triangle SP algorithm with state-of-the-art algorithms including GES and the PC algorithm in both the lowand high-dimensional settings. Since Algorithm 2 searches over weakly de-creasing edgewalks, and therefore must make moves under which the scorefunction does not strictly increase in value, we implement a version of Algo-rithm 2 equipped with depth-ﬁrst search bounds and a ﬁxed number of runs(Algorithm 5). Our results suggest that an optimal bound on search depth is d = 4. This observation is in agreement with that of [11], which suggests thatthe average Markov equivalence class contains about four DAGs. In futurework it would be interesting to analyze an approach where depth-ﬁrst-searchon the DAG associahedron is replaced by an MCMC approach.In the oracle and low-dimensional settings, we ﬁnd that Algorithm 5 with d = 4 and r = 5 runs tends to outperform both GES and the PC algorithm.Furthermore, Algorithm 2 can be scaled to the high-dimensional setting, inwhich, it performs comparably to the PC algorithm and GES. Similarly as tothe PC algorithm and in contrast to GES, our method is nonparametric andconsequently does not require the Gaussian assumption. In future work itwould be interesting to combine the greedy SP algorithm with kernel-basedCI tests [9, 32], that are better able to deal with non-linear relationshipsand non-Gaussian noise, and analyze its performance on non-Gaussian data.Furthermore, we believe that permutation-based causal inference approachescould provide new avenues for causal inference in a variety of settings. Anextension to the setting where a mix of observational and interventionaldata is available has recently been presented in [37]. In addition, it wouldbe interesting to extend the greedy SP algorithm to the setting with latentvariables or cyclic graphs. L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Acknowledgements.

Liam Solus was partially supported by an NSFMathematical Sciences Postdoctoral Research Fellowship (DMS - 1606407).Caroline Uhler was partially supported by DARPA (W911NF-16-1-0551),NSF (DMS-1651995), ONR (N00014-17-1-2147), and a Sloan Fellowship.The authors are grateful to Robert Castelo for helpful discussions on theproject.

References. [1] P. A. Aguilera, A. Fern´andez, R. Fern´andez, R. Rumi, and A. Salmer´on.

Bayesiannetworks in environmental modelling.

Environmental Modelling & Software 26.12(2011): 1376-1388.[2] I. Tsamardinos, L. E. Brown and C. F. Aliferis.

The max-min hill-climbing Bayesiannetwork structure learning algorithm.

Machine learning 65.1 (2006): 31-78.[3] R. R. Bouckaert.

Optimizing causal orderings for generating DAGs from data.

Pro-ceedings of the Eighth international conference on Uncertainty in artiﬁcial intelli-gence. Morgan Kaufmann Publishers Inc., 1992.[4] R. Castelo and T. Kˇocka.

On inclusion-driven learning of Bayesian networks.

Journalof Machine Learning Research 4.Sep (2003): 527-574.[5] D. M. Chickering.

A transformational characterization of equivalent Bayesian net-work structures.

Proceedings of the Eleventh Conference on Uncertainty in ArtiﬁcialIntelligence. Morgan Kaufmann Publishers Inc., 1995.[6] D. M. Chickering.

Optimal structure identiﬁcation with greedy search.

Journal of Ma-chine Learning Research (2002): 507-554.[7] G. F. Cooper and E. Herskovits.

A Bayesian method for the induction of probabilisticnetworks from data.

Machine learning 9.4 (1992): 309-347.[8] N. Friedman, M. Linial, I. Nachman and D. Peter.

Using Bayesian networks to analyzeexpression data . Journal of Computational Biology 7 (2000): 601–620.[9] K. Fukumizu, A. Gretton, X. Sun and B. Sch¨olkopf.

Kernel measures of conditionaldependence . Advances in Neural Information Processing Systems (2008).[10] A. George

Nested dissection of a regular ﬁnite element mesh.

SIAM Journal on Nu-merical Analysis 10.2 (1973): 345-363.[11] S. B. Gillispie and M. D. Perlman.

Enumerating Markov equivalence classes of acyclicdigraph models.

Proceedings of the Seventeenth Conference on Uncertainty in Artiﬁ-cial Intelligence. Morgan Kaufmann Publishers Inc., 2001.[12] M. Kalisch and P. B¨uhlmann.

Estimating high-dimensional directed acyclic graphswith the PC-algorithm.

Journal of Machine Learning Research 8 (2007): 613-636.[13] M. Kalisch, M. M¨achler, D. Colombo, M. H. Maathuis and P. B¨uhlmann.

Causalinference using graphical models with the R package pcalg . Journal of Statistical Soft-ware 47 (2011): 1-26.[14] P. Larra˜naga, C. M. H. Kuijpers, R. H. Murga, and Y. Yurramendi.

LearningBayesian network structures by searching for the best ordering with genetic algo-rithms.

IEEE transactions on systems, man, and cybernetics-part A: systems andhumans 26.4 (1996): 487-493.[15] S. L. Lauritzen.

Graphical Models . Oxford University Press, 1996.[16] J. Lemeire, S. Meganck, F. Cartella and T. Liu.

Conservative independence-basedcausal structure learning in absence of adjacency faithfulness . International Journalof Approximate Reasoning 53 (2012): 1305-1325.RDERING-BASED CAUSAL INFERENCE [17] C. Meek. Graphical Models: Selecting causal and statistical models.

Diss. PhD thesis,Carnegie Mellon University, 1997.[18] F. Mohammadi, C. Uhler, C. Wang, and J. Yu.

Generalized permutohedra fromprobabilistic graphical models . ArXiv preprint: http://arxiv.org/abs/1606.01814v1 (2016).[19] P. Nandy, A. Hauser, and M. H. Maathuis.

High-dimensional consistency in score-based and hybrid structure learning . ArXiv preprint: http://arxiv.org/abs/1507.02608 (2016).[20] J. Pearl.

Probabilistic Reasoning in Intelligent Systems . Morgan Kaufman, San Mateo,1988.[21] J. Pearl.

Causality: Models, Reasoning, and Inference . Cambridge University Press,Cambridge, 2000.[22] J. Peters.

On the intersection property of conditional independence and its applicationto causal discovery . Journal of Causal Inference 3 (2015): 97-108.[23] A. Postnikov.

Permutohedra, associahedra, and beyond.

International MathematicsResearch Notices 2009.6 (2009): 1026-1106.[24] A. Radhakrishnan, L. Solus, and C. Uhler.

Counting Markov equivalence classes forDAG models on trees . ArXiv preprint: https://arxiv.org/abs/1706.06091 (2017).[25] J. Ramsey, J. Zhang, and P. L. Spirtes.

Adjacency-faithfulness and conservative causalinference.

Proceedings of the Twenty-second Annual Conference on Uncertainty inArtiﬁcial Intelligence. Morgan Kaufmann Publishers Inc., 2006.[26] G. Raskutti and C. Uhler.

Learning directed acyclic graphs based on sparsest permu-tations.

ArXiv preprint: https://arxiv.org/abs/1307.0366 (2013).[27] J. M. Robins, M. A. Hern´an and B. Brumback.

Marginal structural models and causalinference in epidemiology . Epidemiology 11.5 (2000): 550-560.[28] M. Singh and M. Valtorta.

An algorithm for the construction of Bayesian networkstructures from data.

Proceedings of the Ninth international conference on Uncer-tainty in artiﬁcial intelligence. Morgan Kaufmann Publishers Inc., 1993.[29] P. Spirtes, C. N. Glymour and R. Scheines.

Causation, Prediction, and Search . MITPress, Cambridge, 2001.[30] P. Spirtes and J. Zhang.

A uniformly consistent estimator of causal eﬀects under thek-triangle faithfulness assumption.

Statistical Science 29.4 (2014): 662-678.[31] M. Teyssier and D. Koller.

Ordering-based search: A simple and eﬀective algorithmfor learning Bayesian networks.

Proceedings of the Twenty-ﬁrst Conference on Un-certainty in Artiﬁcial Intelligence. Morgan Kaufmann Publishers Inc., 2005.[32] R. E. Tillman, A. Gretton and P. Spirtes.

Nonlinear directed acyclic structure learn-ing with weakly additive noise model.

Advances in Neural Information ProcessingSystems, 2009.[33] W. F. Tinney and J. W. Walker.

Direct solutions of sparse network equations byoptimally ordered triangular factorization.

Proceedings of the IEEE 55.11 (1967):1801-1809.[34] S. Van de Geer and P. B¨uhlmann. (cid:96) -penalized maximum likelihood for sparse directedacyclic graphs. The Annals of Statistics 41.2 (2013): 536–567.[35] T. Verma and J. Pearl.

An algorithm for deciding if a set of observed independen-cies has a causal explanation.

Proceedings of the Eighth International Conference onUncertainty in Artiﬁcial Intelligence. Morgan Kaufmann Publishers Inc., 1992.[36] C. Uhler, G. Raskutti, P. B¨uhlmann, and B. Yu.

Geometry of the faithfulness as-sumption in causal inference.

The Annals of Statistics 41.2 (2013): 436-463.[37] Y. Wang, L. Solus, K. D. Yang and C. Uhler.

Permutation-based causal inferencealgorithms with interventions.

ArXiv preprint: https://arxiv.org/abs/1705.10220 L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER(2017).[38] J. Zhang and P. Spirtes.

Detection of unfaithfulness and robust causal inference.

Minds and Machines 18 (2008): 239-271.[39] J. Zhang and P. Spirtes.

The three faces of faithfulness.

Synthese 193 (2015): 1011-1027.[40] G. M. Ziegler.

Lectures on Polytopes . Vol. 152. Springer Science & Business Media,1995.

APPENDIX A: PROOFS FOR RESULTS ON THE GEOMETRICPROPERTIES OF A P ( C ) A.1. Proof of Lemma 10.

Suppose ﬁrst that C is not faithful to G π ∗ and take any conditional independence statement i ⊥⊥ i | K that is notencoded by the Markov property for G π ∗ . Take π to be any permutation inwhich K (cid:31) π i (cid:31) π j (cid:31) [ p ] \ ( K ∪ { i, j } ). Then G π ∗ ≤ G π because it encodes i ⊥⊥ j | K , but G π ∗ does not.Conversely, suppose P is faithful to G π ∗ . By [26, Lemma 2.1], we knowthat G π satisﬁes the Markov condition with respect to P for any π ∈ S n .So any conditional independence relation encoded by G π holds for P , whichmeans it also holds for G π ∗ . Thus, G π ∗ ≤ G π . A.2. Proof of Theorem 7.

Given a point ( x , . . . , x p ) T ∈ R p , the permutohedron A p ( x , . . . , x p ) is the convex hull of the p ! points obtainedby the permutations of the coordinates ( x , . . . , x p ). In particular, A p = A p (1 , , . . . , p ). We want to show that the graph G p given by contracting alledges π − τ of A p where τ = π π π π . . . π p is the edge graph of a ( p − A p (cid:0) , , , , . . . , p (cid:1) . To do so, we realizethe vertex of G p corresponding to the even permutation ( π π ) π . . . π p asthe vertex u ( π ) of A p (cid:0) , , , , . . . , p (cid:1) where u ( π ) i = π i + τ ( π ) i τ ( π ) is the even permutation satisfying τ ( π ) j = 2 and τ ( π ) k = 1 if π j = 1 and π k = 2.To prove this, we must show that u ( π ) and u ( ω ) are adjacent in A p (cid:0) , , , , . . . , p (cid:1) if and only if they diﬀer by transposing two entries π j and π k with π k = π j + 1. This can be accomplished by considering a costvector c = ( c , . . . , c p ) and a reordering such that c i ≤ c i ≤ · · · ≤ c i p . RDERING-BASED CAUSAL INFERENCE If all inequalities in c i ≤ c i ≤ · · · ≤ c i p are strict, then the point thatmaximizes c · x in A p (cid:0) , , , , . . . , p (cid:1) is (cid:18) π − + π − , π − + π − , π − , π − , . . . , π − p (cid:19) , where π = ( i i ) i · · · i p . If there is exactly one equality c i < c i < · · · A p (cid:0) , , , , . . . , p (cid:1)(cid:18) π − + π − , π − + π − , π − , π − , . . . , π − p (cid:19) , and (cid:18) τ − + τ − , τ − + τ − , τ − , τ − , . . . , τ − p (cid:19) , where π = ( i i ) i · · · i k i k +1 · · · i p and τ = ( i i ) i · · · i k +1 i k · · · i p both max-imize c · x in A p (cid:0) , , , , . . . , p (cid:1) . If there are more equalities other than c i = c i then,by the same reasoning, there are more vertices of A p (cid:0) , , , , . . . , p (cid:1) maxi-mizing c · x . Therefore, the only edges of A p (cid:0) , , , , . . . , p (cid:1) are the desiredones, and so we conclude that G p is the edge graph of the ( p − A p (cid:0) , , , , . . . , p (cid:1) . (cid:3) The structure of the graph of the even permutohedron can readily bedescribed. Partition the vertices into parts V , . . . , V p , where part V i consistsof all vertices ( π π ) π · · · π p with π p = i . If p ≥ p − / V i to V j for i (cid:54) = j . The edges are of the form( π π ) π · · · π p − ji − ( π π ) π · · · π p − ij. The edge is determined by the vector π · · · π p − and there are ( p − / V i is isomorphic to the edgegraph of the even permutohedron A p − . The isomorphism is given by ϕ : ( π π ) π · · · π p − i (cid:55)→ ( (cid:98) π (cid:98) π ) (cid:98) π · · · (cid:98) π p − , where (cid:98) j := (cid:40) j if j < i , j − j > i. Note that the cases p = 1 , A isshown in Figure A.2. Notice that each copy of A i for i ≤ j of the permutation for j = i, i + 1 , . . . , L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER (23)5 (13)4(12)4 (14)5(15)24(13)4(13)2(12)3 (12)421 3(14)2(14)3(23)1 (24)1(24)3(23)4(34)2(34)1 (12)3 (12)55 21 3(13)5 (15)3(13)2 (25)1(25)3(23)5(35)2(35)1 (15)2 (12)44 51 2(14)2(12)5 (45)1(45)2(25)4(24)5(24)1 (15)3 (15)44 51 3(14)5(14)3(13)5 (45)1(45)3(35)4(34)5(34)1 (24)3 (24)55 42 3(25)4(25)3(23)4 (45)2(45)3(34)5(35)4(35)2 (23)1 (25)1 (35)1(34)2 Fig 11 . Graph of the even permutohedron in 4d

A.3. Proof of Proposition 8.

First notice that since A p ( C ) is con-structed by contracting an edge ωijσ − ωjiσ, for some permutations ω of K and σ of [ p ] \ ( K ∪{ i, j } ) where K ⊆ [ p ] \{ i, j } ,then A p ( C ) is constructed simply by further contracting the edges betweenany two permutations π π . . . π p and π π . . . π p . Given a permutation π = π π . . . π p , recall that the DAG G π contains the arrow π → π if andonly if π (cid:54)⊥⊥ π , and that the DAGs G π label the vertices of A p ( C ). If thearrow π → π is in G π , then it is a trivially covered arrow. However, bycontracting the edges corresponding to the equivalence relation for the evenpermutohedron, we have contracted the vertices corresponding to the twoDAGs G π and G π π ...π p . Since these two DAGs diﬀer only by the directionof the arrow π → π , then combinatorially their shared vertex is labeled bythe partially directed graph that has all the arrows of G π except for π → π ,which is now undirected. On the other hand, if the arrow π → π is not in G π , then the two DAGs are equal and thus G π labels their shared vertex. RDERING-BASED CAUSAL INFERENCE APPENDIX B: PROOFS FOR RESULTS ON THE POINTWISECONSISTENCY OF THE GREEDY SP ALGORITHMS

B.1. Proof of Lemma 12.

Until the arrow Y → Z is reversed, the setDe G ( Y ) and the node choice D ∈ De G ( Y ) remain the same. This is becausesteps 7 and 8 only add parents to Y or Z that are already parents of Y or Z , respectfully. Thus, we can always choose the same Y and Z until Y → Z is covered. B.2. Proof of Lemma 14.

Suppose on the contrary that there existssome sink node Y in G π and there is a parent node X of Y in G τ that is nota parent node of Y in G π . Since Y is a sink in both permutations, then thereexists linear extensions ˆ π and ˆ τ of the partial orders corresponding to G π and G τ for which Y = ˆ π p and Y = ˆ τ p . By [18, Theorem 7.4], we know that G π = G ˆ π and G τ = G ˆ τ . In particular, we know that X (cid:54)⊥⊥ Y | [ p ] \{ X, Y } in G ˆ τ and X ⊥⊥ Y | [ p ] \{ X, Y } in G ˆ π . However, this is a contradiction, sinceboth of these relations cannot simultaneously hold. B.3. Proof of Theorem 15.

To prove Theorem 15 we must ﬁrst provea few lemmas. Throughout the remainder of this section, we use the followingnotation: Suppose that

G ≤ H for two DAGs G and H and that C = ( G := G , G , G , . . . , G N := H )is a Chickering sequence from G to H . We let π i ∈ S p denote a linearextension of G i for all i = 0 , , . . . , N . For any DAG G we also let CI( G )denote the collection of CI relations encoded by the d -separation statementsin G . Lemma . Suppose that G τ is a minimal I-MAP of a graphoid C . Sup-pose also that G ≈ G τ and that G diﬀers from G τ only by a covered arrowreversal. If π is a linear extension of G then G π is a subDAG of G . Proof.

Suppose that G is obtained from G τ by the reversal of the coveredarrow x → y in G τ . Without loss of generality, we assume that τ = SxyT and π = SyxT for some disjoint words S and T whose letters are collectivelyin bijection with the elements in [ p ] \{ x, y } . So in G π , the arrows going from S to T , x to T , and y to T are all the same as in G τ . However, the arrowsgoing from S to x and S to y may be diﬀerent. So, to prove that G π is asubDAG of G we must show that for each letter s in the word S (1) if s → x / ∈ G τ then s → x / ∈ G π , and(2) if s → y / ∈ G τ then s → y / ∈ G π . L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

To see this, notice that if s → x / ∈ G τ , then s → y / ∈ G τ since x → y iscovered in G τ . Similarly, if s → y / ∈ G τ then s → x / ∈ G τ . Thus, we knowthat s ⊥⊥ x | S \ s and s ⊥⊥ y | ( S \ s ) x are both in the collection C . It thenfollows from (SG2) that s ⊥⊥ x | ( S \ s ) y and s ⊥⊥ y | S \ s are in C as well.Therefore, G π is a subDAG of G . Lemma . Let C be a graphoid and let C = ( G := G π , G , G , . . . , G N := G τ ) be a Chickering sequence from a minimal I-MAP G π of C to another G τ . If,for some index ≤ i < N , G i is obtained from G i +1 by deletion of an arrow x → y in G i +1 then x → y is not in G π i +1 . Proof.

Let π i +1 = SxT yR be a linear extension of G i +1 for some dis-joint words S , T , and R whose letters are collectively in bijection with theelements in [ p ] \{ x, y } . Since G π ∗ ≤ G i ≤ G i +1 then C ⊇

CI( G π ) ⊇ CI( G i ) ⊇ CI( G i +1 ) . We claim that x ⊥⊥ y | ST ∈ CI( G i ) ⊆ C . Therefore, x → y cannot be anarrow in G π i +1 .First, since G i is obtained from G i +1 by deleting the arrow x → y , then π i +1 is also a linear extension of G i . Notice, there is no directed path from y to x in G i , and so it follows that x and y are d -separated in G i by Pa G i ( y ).Therefore, x ⊥⊥ y | Pa G i ( y ) ∈ CI( G i ). Notice also that Pa G i ( y ) ⊂ ST andany path in G i between x and y lacking colliders uses only arrows in thesubDAG of G i induced by the vertices S ∪ T ∪ { x, y } = [ p ] \ R . Therefore, x ⊥⊥ y | ST ∈ CI( G i ) as well. It follows that x ⊥⊥ y | ST ∈ C , and so, bydeﬁnition, x → y is not an arrow of G π i +1 . Lemma . Suppose that C is a graphoid and G π is a minimal I-MAPwith respect to C . Let C = ( G := G π , G , G , . . . , G N := G τ ) be a Chickering sequence from G π to another minimal I-MAP G τ with respectto C . Let i be the largest index such that G i is produced from G i +1 by deletionof an arrow, and suppose that for all i + 1 < k ≤ N we have G π k = G k . Then G π i +1 is a proper subDAG of G i +1 . Proof.

By Lemma 30, we know that G π i +1 is a subDAG of G i +1 . Thisis because π i +1 is a linear extension of G i +1 and G i +1 ≈ G i +2 = G π i +2 and RDERING-BASED CAUSAL INFERENCE G i +1 diﬀers from G i +2 only by a covered arrow reversal. By Lemma 31, weknow that the arrow deleted in G i +1 to obtain G i is not in G π i +1 . Therefore, G π i +1 is a proper subDAG of G .Using these lemmas, we can now give a proof of Theorem 15.B.3.1. Proof of Theorem 15.

To see that (i) holds, notice since G π ≈ G τ then by the transformational characterization of Markov equivalence givenin [5, Theorem 2], we know there exists a Chickering sequence C := ( G := G π , G , G , . . . , G N := G τ )for which G ≈ G ≈ · · · ≈ G N and G i is obtained from G i +1 by the reversalof a covered arrow in G i +1 for all 0 ≤ i < N . Furthermore, since G π isMEC-s-minimal, and by Lemma 30, we know that for all 0 ≤ i ≤ N G i ⊇ G π i ⊇ G π . However, since G i ≈ G π and G π i is a subDAG of G i , then G i = G π i for all i .Thus, the desired weakly decreasing edgewalk along A p ( C ) is( G π = G π , G π , . . . , G π N − , G π N = G τ ) . To see that (ii) holds, suppose that G π ≤ G τ but G π (cid:54)≈ G τ . Since G π ≤G τ we know that there exists a Chickering sequence from G π to G τ thatuses at least one arrow addition. By Lemmas 12 and 13 we can choose thisChickering sequence such that it resolves one sink at a time and, respectively,reverses one covered arrow at a time. We denote this Chickering sequenceby C := ( G := G π , G , G , . . . , G N := G τ ) . Let i denote the largest index for which G i is obtained from G i +1 by deletionof an arrow. Then by our choice of Chickering sequence we know that G k isobtained from G k +1 by a covered arrow reversal for all i < k < N . Moreover, π i = π i +1 , and so G π i = G π i +1 . Furthermore, by Lemma 30 we know that G π k is a subDAG of G k for all i < k ≤ N .Suppose now that there exists some index i + 1 < k < N such that G π k isa proper subDAG of G k . Without loss of generality, we pick the largest suchindex. It follows that for all indices k < (cid:96) ≤ N , G π (cid:96) = G (cid:96) and that G k +1 ≈ G k +2 ≈ · · · ≈ G N = G τ . Thus, by [5, Theorem 2], there exists a weakly decreasing edgewalk from G τ to G k +1 on A p ( C ). Since we chose the index k maximally then G k is L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER obtained from G k +1 by a covered arrow reversal. Therefore, G π k and G π k +1 are connected by an edge of A p ( C ) indexed by a covered arrow reversal.Since |G k | = |G k +1 | = |G π k +1 | and G π k is a proper subDAG of G k , then theresult follows.On the other hand, suppose that for all indices i + 1 < k ≤ N , we have G π k = G k . Then this is precisely the conditions of Lemma 32, and so itfollows that G π i +1 is a proper subDAG of G i +1 . Since G i +1 is obtained from G i +2 by a covered arrow reversal, the result follows, completing the proof. B.4. Proof of Corollary 16.

Suppose that C is a graphoid that isfaithful to the sparsest minimal I-MAP G π ∗ with respect to C . By Lemma 10,we know that G π ∗ ≤ G π for all π ∈ S p . By (ii) of Theorem 15, if the Algo-rithm 2 is at a minimal I-MAP G τ that is not in the same Markov equivalenceclass as G ∗ π , then we can take a weakly decreasing edgewalk along A p ( C ) toreach a sparser minimal I-MAP G τ (cid:48) satisfying G π ∗ ≤ G τ (cid:48) ≤ G τ . Furthermore,by (i) of Theorem 15, the Markov equivalence class of G π ∗ is a connectedsubgraph of the edge graph of A p ( C ). B.5. Proof of Theorem 18.

The proof is composed of two parts. Weﬁrst prove that for any permutation π , in the limit of large n , ˆ G π is a minimalI-MAP of G π . We prove this by contradiction. Suppose ˆ G π (cid:54) = G π . Since theBIC is consistent, in the limit of large n , ˆ G π is an I-MAP of the distribution.Since ˆ G π and G π share the same permutation and G π is a minimal I-MAP,it then follows that G π ⊂ ˆ G π . Suppose now that there exists ( i, j ) ∈ ˆ G π such that ( i, j ) (cid:54)∈ G π . Since G π is a minimal I-MAP, we obtain that i ⊥⊥ j | Pa G π ( j ) . Since the BIC is locally consistent, it follows that BIC( G π , ˆ X ) > BIC( ˆ G π , ˆ X ).Next, we prove that for any two permutations τ and π where G τ is con-nected to G π by precisely one covered arrow reversal, in the limit of large n , BIC( G τ ; ˆ X ) > BIC( G π ; ˆ X ) ⇔ |G τ | < |G π | , and BIC( G τ ; ˆ X ) = BIC( G π ; ˆ X ) ⇔ |G τ | = |G π | . It suﬃces to prove |G τ | = |G π | ⇒ BIC( G τ ; ˆ X ) = BIC( G π ; ˆ X )(B.1)and |G τ | < |G π | ⇒ BIC( G τ ; ˆ X ) > BIC( G π ; ˆ X ) . (B.2) RDERING-BASED CAUSAL INFERENCE Eq. B.1 is easily seen to be true using [5, Theorem 2] as G π and G τ areequivalent. For Eq. B.2, by Theorem 11, since G τ ≤ G π there exists a Chick-ering sequence from G τ to G π with at least one edge addition and severalcovered arrow reversals. For the covered arrow reversals, BIC remains thesame since the involved DAGs are equivalent. For the edge additions, thescore necessarily decreases in the limit of large n due to the increase in thenumber of parameters. This follows from the consistency of the BIC and thefact that DAGs before and after edge additions are both I-MAPs of P . Inthis case, the path taken in the triangle SP algorithm using the BIC scoreis the same as in the original triangle SP algorithm. Since the triangle SPalgorithm is consistent, it follows that the triangle SP algorithm with theBIC score is also consistent. B.6. Proof of Theorem 21.

Let P be a semigraphoid, and let C denotethe CI relations entailed by P . Suppose for the sake of contradiction thatAlgorithm 2 is consistent with respect to C , but P fails to satisfy adjacencyfaithfulness with respect to a sparsest DAG G ∗ π . Then there exists someCI relation i ⊥⊥ j | S in C such that i → j is an arrow of G ∗ π . Now let π be any permutation respecting the concatenated ordering iSjT where T = [ p ] \ ( { i, j } ∪ S ). Then our goal is to show that any covered arrowreversal in G π that results in a permutation DAG G τ with strictly feweredges than G π must satisfy the condition that i → j is not an arrow in G τ .First, we consider the possible types of covered arrows that may existin G π . To list these, it will be helpful to look at the diagram depicted inFigure 12. Notice ﬁrst that we need not consider any trivially covered ar-rows, since such edge reversals do not decrease the number of arrows in thepermutation DAGs. Any edge i → S or i → T is trivially covered, so thepossible cases of non-trivially covered arrows are exactly the covered arrowsgiven in Figure 13. In this ﬁgure, each covered arrow to be considered is la-beled with the symbol (cid:63) . Notice that the claim is trivially true for cases (1)– (4); i.e., any covered arrow reversal resulting in edge deletions produces apermutation DAG G τ for which i → j is not an arrow of G τ . i jS T Fig 12 . This diagram depicts the possible arrows between the node sets { i } , { j } , S, and T for the minimal I-MAP G π considered in the proof of Theorem 21. L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER it t jt t is s ss t t is t ss j t s s s j? ? ? ?? ? ? (1) (2) (3) (4)(5) (6) (7) Fig 13 . The possible non-trivially covered arrows between the node sets { i } , { j } , S, and T for the permutation DAG G π considered in the proof of Theorem 21 are labeled with thesymbol (cid:63) . Here, we take s, s (cid:48) , s (cid:48)(cid:48) ∈ S and t, t (cid:48) ∈ T . Case (5) is also easy to see. Recall that π = is · · · s k jt · · · t m where S := { s , . . . , s k } and T := { t , . . . , t k } , and that reversing the coveredarrow in case (5) results in an edge deletion. Since s → t is covered, thenthere exists a linear extension τ of G π such that s and t are adjacent in τ .Thus, either j precedes both s and t or j follows both s and t in τ . Recall alsothat by [18, Theorem 7.4] we known G τ = G π . Thus, reversing the coveredarrow s → t in G τ = G π does not add in i → j .To see the claim also holds for cases (6) and (7), we utilize the sem-igraphoid property (SG2). It suﬃces to prove the claim for case (6). Sosuppose that reversing the (cid:63) -labeled edge j → t from case (6) results ina permutation DAG with fewer arrows. We simply want to see that i → j is still a non-arrow in this new DAG. Assuming once more that π = is · · · s k jt · · · t m , by [18, Theorem 7.4] we can, without loss of general-ity, pick t := t . Thus, since i ⊥⊥ j | S and j → t is covered, then i ⊥⊥ t | S ∪ { j } . By the semigraphoid property (SG2), we then know that i ⊥⊥ j | S ∪ { t } . Thus, the covered arrow reversal j ← t produces a per-mutation τ = is · · · s k t jt · · · t m , and so i → j is not an arrow in G τ . Thiscompletes all cases of the proof.APPENDIX C: PROOFS FOR RESULTS ON THE UNIFORMCONSISTENCY OF THE GREEDY SP ALGORITHM C.1. Proof of Lemma 24.

Let Pa G π ( j ) be the set of parent nodes ofnode j in the DAG G π . Let k ∈ S and let P denote the joint distribu-tion of ( X i , X j , X k ) conditioned on S \ { k } and P the joint distribution of( X i , X j , X k ) conditioned on S (cid:48) . With this notation, the claimed statements RDERING-BASED CAUSAL INFERENCE boil down to(a) j ⊥⊥ k under distribution P ⇔ j ⊥⊥ k under distribution P ;(b) i ⊥⊥ k | j under distribution P ⇔ i ⊥⊥ k | j under distribution P .Note that P ( X i , X j , X k ) := P ( X i , X j , X k | X S \{ k } ) = P ( X i , X j | X S ) P ( X k ) . Similarly, the Markov assumption of P with respect to G π implies that P ( X i , X j , X k ) = P ( X i , X j | X S (cid:48) ) P ( X k ) = P ( X i , X j | X S ) P ( X k ) . Hence, P ( X j | X k ) = P ( X j | X k ), P ( X i | X j , X k ) = P ( X i | X j , X k ). Thiscompletes the proof since X a ⊥⊥ X b | X C under some distribution ˜ P if andonly if ˜ P ( X a | X b = z , X C ) = ˜ P ( X a | X b = z , X C ) for all z and z in thesample space. C.2. Proof of Theorem 23.

For any initial permutation π , we let L π denote the set of tuples ( i, j, S ) used for partial correlation testing inthe estimation of the initial permtuation DAG G π . That is, L π := (cid:110) ( i, j, S ) : S = { k : π ( k ) ≤ max { π ( i ) , π ( j ) }} \ { i, j } (cid:111) . Given a DAG G and a node i we let adj( G , i ) denote the collection of nodesthat share an arrow with node i in G . We then let K π denote the collectionof tuples ( i, j, S ) that will be used in the partial correlation testing done instep (2) of Algorithm 6; i.e. K π := (cid:91) ( i,j ) ∈G π (cid:110) ( k, l, S ) : k ∈ { i, j } , l ∈ adj( G π , i ) ∩ adj( G π , j ) , S (cid:54) = ∅ , and S ⊆ { adj( G π , i ) ∩ adj( G π , j ) } ∪ { i, j } (cid:111) . It follows from Lemma 24, that when ﬂipping a covered edge i → j in apermutation DAG G ˜ π , it is suﬃcient to calculate the partial correlations ρ a,b | C where( a, b, C ) ∈ (cid:110) ( a, b, C ) : a = i, b ∈ Pa i ( G ˜ π ) , C = Pa i ( G ˜ π ) ∪ { j } \ { b } (cid:111) ∪ (cid:110) ( a, b, C ) : a = j, b ∈ Pa i ( G ˜ π ) , C = Pa i ( G ˜ π ) \ { b } (cid:111) . In particular, we have that ( a, b, C ) ∈ K ˜ π . L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER

Because of the skeletal inclusion G ˜ π ⊆ G π , it then follows that K ˜ π ⊆ K π .So it follows that ( a, b, C ) ∈ K π . In addition, for all partial correlations ρ a,b | C used for constructing the initial DAG G π , we know that ( a, b, C ) ∈ L π . Therefore, for all partial correlations ( a, b, C ) used in the algorithm, wehave: ( a, b, C ) ∈ K π ∪ L π . Let E i,j | S be the event where an error occurs when doing partial correla-tion testing of i ⊥⊥ j | S , and suppose that α is the signiﬁcance level whentesting this partial correlation. Then we see that E i,j | S corresponds to: (cid:112) n − | S | − | ˆ z i,j | S | > Φ − (1 − α/ , when z i,j | S = 0; (cid:112) n − | S | − | ˆ z i,j | S | ≤ Φ − (1 − α/ , when z i,j | S (cid:54) = 0 . With the choice of α n = 2(1 − Φ( n / c n / P [ E i,j | S ] ≤ P [ | ˆ z i,j | S − z i,j | S | > ( n/ ( n − | S | − / c n / . Now, by (A2) we have that | S | ≤ p = O ( n a ). Hence it follows that P [ E i,j | S ] ≤ P [ | ˆ z i,j | S − z i,j | S | > c n / . Then, Lemma 25 together with the fact that log( − δ δ ) ∼ − δ / δ → P [ E i,j | S ] ≤ O ( n − | S | ) exp( − c (cid:48) ( n − | S | ) c n ) ≤ O (cid:16) exp(log n − cn − (cid:96) ) (cid:17) (C.1)for some constants c, c (cid:48) >

0. Since the DAG estimated using Algorithm 6is not consistent when at least one of the partial correlation tests is notconsistent, then the probability of inconsistency can be estimated as follows: P [an error occurs in Algorithm 6] ≤ P  (cid:91) i,j,S ∈ K ˆ π ∪ L ˆ π E i,j | S  ≤ | K ˆ π ∪ L ˆ π | (cid:32) sup i,j,S ∈ K ˆ π ∪ L ˆ π P ( E i,j | S ) (cid:33) . (C.2) RDERING-BASED CAUSAL INFERENCE Next note that assumption (A3) implies that the size of the set adj( G π , i ) ∪ adj( G π , j ) is at most d π . Therefore, | K π | ≤ p · d π · d π and | L π | ≤ p .Thus, we see that | K ˆ π ∪ L ˆ π | ≤ | K ˆ π | + | L ˆ π | ≤ (2 d π · d π + 1) p . Therefore, the left-hand-side of inequality (C.2) is upper-bounded by(2 d π · d π + 1) p (cid:32) sup i,j,S ∈ K ˆ π ∪ L ˆ π P ( E i,j | S ) (cid:33) . Combining this observation with the upper-bound computed in inequal-ity (C.1), we obtain that the left-hand-side of inequality (C.2) is upper-bounded by (2 d π · d π + 1) p O (exp(log n − cn − l )) ≤O (exp( d π log 2 + 2 log p + log d π + log n − cn − (cid:96) )) . By assumptions (A3) and (A4) it follows that n − (cid:96) dominates all terms inthis bound. Thus, we conclude that P [estimated DAG is consistent] ≥ − O (exp( − cn − (cid:96) )) . C.3. Proof of Lemma 28.

First, we prove:For i, j (cid:54)∈ adj( G S , k ) : ( i, j ) is an edge in G S \{ k } iﬀ ( i, j ) is an edge in G S . Suppose at least one of i or j are not adjacent to node k in G S . Withoutloss of generality, we assume i is not adjacent to k in G S ; this implies that ρ i,k | S \{ i,k } = 0. To prove the desired result we must show that ρ i,j | S \{ i,j } = 0 ⇔ ρ i,j | S \{ i,j,k } = 0 . To show this equivalence, ﬁrst suppose that ρ i,j | S \{ i,j } = 0 but ρ i,j | S \{ i,j,k } (cid:54) =0. This implies that there is a path P between i and j through k suchthat nodes i and j are d-connected given S \ { i, j, k } and d-separated given S \ { i, j } . Thi implies that k is a non-collider along P . Deﬁne P i as the pathconnecting i and k in the path P and P j the path connecting j and k in P .Then the nodes i and j are d-connected to k given S \ { i, k } and S \ { j, k } respectively, by using P i and P j . Since j is not on P i , clearly i and k arealso d-connected given S \ { i, j, k } through P i , and the same holds for j .Conversely, suppose that ρ i,j | S \{ i,j,k } = 0 but ρ i,j | S \{ i,j } (cid:54) = 0. Then thereexists a path P that d -connects nodes i and j given S \ { i, j } , while i and j are d-separated given S \ { i, j, k } . Thus, one of the following two cases mustoccur: L. SOLUS, Y. WANG, L. MATEJOVICOVA, C. UHLER k is a collider on the path P , or2. Some node (cid:96) ∈ an( S \ { i, j } ) \ an( S \ { i, j, k } ) is a collider on P .For case (2), there must exist a path: (cid:96) → · · · → k that d-connects (cid:96) and k given S \ { i, j, k } and (cid:96) (cid:54)∈ S . Such a path exists since (cid:96) is an ancestor of k and not an ancestor of all other nodes in S \ { i, j, k } . So in both cases i and k are also d-connected given S \ { i, j, k } using a path that does notcontaining the node j . Hence, i and k are also d-connected given S \ { i, k } ,a contradiction.Next, we prove for i, j ∈ adj( G S , k ), if ( i, j ) is not an edge in G S , then( i, j ) is an edge in G S \{ k } . Since i ∈ adj( G S , k ), there exists a path P i that d-connects i and k given S \ { i, k } , and similar for j . Using the sameargument as the above, i and j are also d-connected to k using P i and P j ,respectively, given S \ { i, j, k } . Deﬁning P as the path that combines P i and P j , then k must be a non-collider along P as otherwise i and j would bed-connected given S \ { i, j } , in which case i and j would also be d-connectedgiven S \ { i, j, k } , and ( i, j ) would be an edge in G S \{ k } . C.4. Proof of Theorem 27.

In the oracle setting, there are two maindiﬀerences between Algorithm 7 and the minimum degree algorithm. First,Algorithm 7 uses partial correlation testing to construct a graph, while theminimum degree algorithm uses the precision matrix Θ. The second diﬀer-ence is that Algorithm 7 only updates based on the partial correlations ofneighbors of the tested nodes.Let Θ S denote the precision matrix of the marginal distribution overthe variables { X i : i ∈ S } . Since the marginal distribution is Gaussian,the ( i, j )-th entry of Θ S is nonzero if and only if ρ i,j | S \{ i,j } (cid:54) = 0. Thus, toprove that Algorithm 7 and the minimum degree algorithm are equivalent,it suﬃces to show the following: Let G S be an undirected graph with edgescorresponding to the nonzero entries of Θ S . Then for any node k , the graph G S \{ k } constructed as deﬁned in Algorithm 7 has edges corresponding to thenonzero entries of Θ S \{ k } . To prove that this is indeed the case, note that byLemma 28, if G S is already estimated then nodes i and j are connected in G S \{ k } if and only if ρ i,j | S \{ i,j,k } (cid:54) = 0. Finally, since the marginal distributionover S is multivariate Gaussian, the ( i, j )-th entry of Θ S \{ k } is non-zero ifand only if ρ i,j | S \{ i,j,k } (cid:54) = 0. C.5. Proof of Theorem 29.

Let P oracle (ˆ π ) denote the probability thatˆ π is output by Algorithm 7 in the oracle-setting, and let N ˆ π denote thenumber of partial correlation tests that had to be performed. Then N ˆ π ≤O ( pd π ), where d ˆ π is the maximum degree of the corresponding minimal I- RDERING-BASED CAUSAL INFERENCE MAP G ˆ π . Therefore, using the same arguments as in the proof of Theorem 23,we obtain: P [ˆ π is generated by Algorithm 7] ≥ P oracle (ˆ π ) P [all hypothesis tests for generating ˆ π are consistent] ≥ P oracle (ˆ π ) (cid:32) − O ( pd π ) sup ( i,j,S ) ∈ N ˆ π P ( E i,j | S ) (cid:33) , ≥ P oracle (ˆ π ) (cid:16) − O (exp(2 log d ˆ π + log p + log n − c (cid:48) n − (cid:96) )) (cid:17) , ≥ P oracle (ˆ π ) (cid:16) − O (exp( − cn − (cid:96) )) (cid:17) . Let Π denote the set of all possible output permutations of the minimumdegree algorithm applied to Θ. Then P [Algorithm 7 outputs a permutation in Π] ≥ (cid:88) ˆ π ∈ Π P [ˆ π is output by Algorithm 7] , ≥ − O (exp( − cn − (cid:96) ))which completes the proof. L. SolusMatematikKTH Royal Institute of TechnologyStockholm, SwedenE-mail: [email protected]

Y. WangLaboratory for Information and Decision Systemsand Institute for Data, Systems and SocietyMassachusetts Institute of TechnologyCambridge, MA, U.S.A.E-mail: [email protected]

L. MatejovicovaIST AustriaKlosterneuburg, AustriaE-mail: [email protected]