[PDF] Complexity analysis of Bayesian learning of high-dimensional DAG models and their equivalence classes

Abstract

We consider MCMC methods for learning equivalence classes of sparse Gaussian DAG models when p = e^{o(n)}. The main contribution of this work is a rapid mixing result for a random walk Metropolis-Hastings algorithm, which we prove using a canonical path method. It reveals that the complexity of Bayesian learning of sparse equivalence classes grows only polynomially in n and p, under some common high-dimensional assumptions. Further, a series of high-dimensional consistency results is obtained by the path method, including the strong selection consistency of an empirical Bayes model for structure learning and the consistency of a greedy local search on the restricted search space. Rapid mixing and slow mixing results for other structure-learning MCMC methods are also derived. Our path method and mixing time results yield crucial insights into the computational aspects of high-dimensional structure learning, which may be used to develop more efficient MCMC algorithms.

Full PDF

CComplexity analysis of Bayesian learning of high-dimensionalDAG models and their equivalence classes

Quan Zhou and Hyunwoong ChangDepartment of Statistics, Texas A&M University

Abstract

We consider MCMC methods for learning equivalence classes of sparse Gaussian DAGmodels when p = e o ( n ) . The main contribution of this work is a rapid mixing result fora random walk Metropolis-Hastings algorithm, which we prove using a canonical pathmethod. It reveals that the complexity of Bayesian learning of sparse equivalence classesgrows only polynomially in n and p , under some common high-dimensional assumptions.Further, a series of high-dimensional consistency results is obtained by the path method,including the strong selection consistency of an empirical Bayes model for structurelearning and the consistency of a greedy local search on the restricted search space.Rapid mixing and slow mixing results for other structure-learning MCMC methods arealso derived. Our path method and mixing time results yield crucial insights into thecomputational aspects of high-dimensional structure learning, which may be used todevelop more eﬃcient MCMC algorithms. A directed acyclic graph (DAG) encodes a set of conditional independence (CI) relationsamong node variables, which can be read oﬀ using the “d-separation” criterion [Pearl, 1988].Structure learning of DAG models from observational data plays a fundamental role incausal inference and has found many applications in machine learning and statistical dataanalysis [Koller and Friedman, 2009]. In genomics, for example, DAG is a convenient devicefor conducting pathway analysis and inferring interactions among a huge number of genesor proteins [Maathuis et al., 2010, Gao and Cui, 2015].Two DAGs with diﬀerent edge sets can encode the same set of CI relations, in whichcase we say both belong to the same (Markov) equivalence class. Any equivalence class canbe uniquely represented by a completed partially directed acyclic graph (CPDAG), a chaingraph consisting of directed and/or undirected edges; a CPDAG is also called an essentialgraph [Andersson et al., 1997]. A Gaussian DAG model represents a set of multivariatenormal distributions that satisfy the CI constraints encoded by the DAG. Due to normality,Markov equivalence further implies distributional equivalence [Geiger and Heckerman, 2002],and thus observational data alone cannot distinguish between Markov equivalent DAGs.We consider the following structure learning problem in this paper: given n i.i.d. observa-tions from a p -variate DAG-perfect normal distribution, estimate the equivalence class of the1 a r X i v : . [ m a t h . S T ] J a n nderlying DAG model. This is essentially a model selection problem where the model spaceis a collection of p -vertex equivalence classes. We are most interested in high-dimensionalsettings where p grows much faster than n , and thus the true DAG model is assumed to besparse so that its equivalence class is identiﬁable.The structure learning problem can be greatly simpliﬁed if the topological ordering ofthe variables is known. By ordering, we mean a permutation σ ∈ S p , where S p denotes thesymmetric group on { , . . . , p } , such that for any i < j , an edge connecting σ ( i ) and σ ( j ) isalways directed as σ ( i ) → σ ( j ). Such an ordering always exists (but may not be unique) fora DAG due to acyclicity. If the ordering is given, each DAG represents a unique equivalenceclass. Hereinafter, we refer to structure learning with known ordering as DAG selection andreserve the term “structure learning” for learning equivalence classes when the ordering isnot speciﬁed. For Bayesian structure learning methods, the goal is usually to compute a posterior distri-bution on the model space, which we denote by π n , rather than ﬁnd a single best model.Since exact evaluation of π n is impossible in most cases, Markov chain Monte Carlo (MCMC)methods are often invoked to sample from π n . Random walk Metropolis-Hastings (MH) al-gorithms are a popular choice. In each iteration, a neighboring state is randomly proposed,and the acceptance probability is calculated using the Metropolis rule so that samples forma Markov chain whose stationary distribution is given by π n . Conceptually, it is helpful tothink of such an MH algorithm as a stochastic local search and compare it with non-Bayesianscore-based local search methods [Drton and Maathuis, 2017, Scutari et al., 2019].A local search algorithm for structure learning (or model selection in general) has threekey components: a search space, a neighborhood relation and a scoring criterion. For struc-ture learning, the search space can be the set of all p -vertex DAGs or their equivalence classes(i.e., CPDAGs). To increase search eﬃciency, one can use methods like CI tests to obtain amuch smaller restricted space. Such an approach is known as a hybrid algorithm. The neigh-borhood relation deﬁnes which state we may move to in the next iteration. The complexityof a search algorithm largely depends on the size of the search space and how we choosethe neighborhood relation. Regarding the scoring criterion, one can construct it using somepenalized log-likelihood function or let it be the logarithm of the un-normalized posteriorprobability. The search can be either deterministic (e.g., a greedy search) or stochastic (e.g.,a random walk MH algorithm). A greedy search can also be used to ﬁnd the maximum aposteriori (MAP) estimate for a Bayesian model. But this is usually less desirable because π n can characterize the uncertainty of structure learning.The well-known greedy equivalence search (GES), proposed by Meek [1997] and Chick-ering [2002b], is a scored-based two-stage greedy algorithm. Though GES is deﬁned on thespace of equivalence classes, the neighborhood relation in GES is constructed by applyingsingle-edge additions or deletions to all member DAGs in the equivalence class. It was notedin Nandy et al. [2018] that GES and its hybrid versions tend to have better estimation per-formance than the PC algorithm, which is the most widely used constraint-based structurelearning method [Kalisch and B¨uhlmann, 2007]. However, methods like GES have received2ess popularity in practice, and one possible reason is that it is theoretically unclear howthese algorithms scale to huge data sets. Nandy et al. [2018] were the ﬁrst to prove thehigh-dimensional consistency of GES using a condition called strong faithfulness. Though itis known that strong faithfulness is a restrictive assumption [Uhler et al., 2013], such condi-tions appear to be necessary to proving high-dimensional consistency results for many searchmethods, especially those based on CI tests.For Bayesian approaches, various random walk MH algorithms have been proposed ondiﬀerent search spaces. The famous structure MCMC algorithm searches the DAG space us-ing addition, deletion and reversal of single edges [Madigan et al., 1995, Giudici and Castelo,2003]. This algorithm is straightforward to implement (one only needs to check acyclic-ity when proposing moves) but might not be eﬃcient since it does not take into accountMarkov equivalence; see Grzegorczyk and Husmeier [2008] and Su and Borsuk [2016] forimprovements and Goudie and Mukherjee [2016] for a Gibbs sampling implementation. MHalgorithms deﬁned on the CPDAG space tend to be more complicated, mostly because of thediﬃculty in constructing a “well-behaved” neighborhood relation. In most cases, we want theneighborhood relation of a search method to be symmetric and the associated neighborhoodgraph to be connected (see Section 2), but even these basic properties can be demanding toestablish for algorithms based on CPDAG operations. See Andersson et al. [1997], Perlman[2001], Munteanu and Bendou [2001], Chickering [2002a], He et al. [2013] for how to choosea proper set of operators on the CPDAG space, and Madigan et al. [1996], Pena [2007],Castelletti et al. [2018] for MCMC implementations. Another important class of structure-learning MH algorithms is deﬁned on the order space S p [Friedman and Koller, 2003, Ellisand Wong, 2008, Agrawal et al., 2018]. The posterior probability of an ordering can becalculated by either a deterministic or a sampling method. See also Niinimaki et al. [2012]and Kuipers and Moﬀa [2017] for MCMC methods using partial orders. To the best of ourknowledge, complexity results for high-dimensional structure learning via MCMC samplingare not available in the literature. The primary goal of this paper is to study the complexity of MCMC methods for structurelearning in high-dimensional settings. To impose sparsity, we only consider DAG modelswith both the maximum in-degree and maximum out-degree bounded by some constantswhich may grow slowly with n . This is a natural setup, which facilitates the interpretabilityof the model and does not require much prior knowledge or CI tests. The scoring criterion isderived using an empirical Bayes approach, which is based on the empirical DAG selectionproposed by Lee et al. [2019]. We prove that our empirical model yields the same marginalfractional likelihood for Markov equivalent DAGs. The focus of our mixing time analysisis a new random walk MH algorithm for sampling equivalence classes, which we call RW-GES. The name reﬂects that the proposal scheme is inspired by the GES algorithm. But inaddition to single-edge additions and deletions (for all member DAGs), we consider anothertype of moves called “swap”, which replaces an edge (cid:96) → j with k → j for some proper i, j and k in a DAG. Mixing rates of some other MH algorithms for structure learning are alsoanalyzed or discussed (but order-based MCMC methods are not considered in this paper.)3he main contribution of this work is a high-dimensional rapid mixing result of the RW-GES sampler, which essentially says that, with high probability, the number of iterationsneeded to ﬁnd the true equivalence class grows only polynomially in both n and p . We are notaware of any other provable mixing time bounds for structure-learning MCMC algorithms inhigh-dimensional settings. Moreover, our proof for the rapid mixing of RW-GES yields threeintermediate results which may be equally important. The ﬁrst one is the strong selectionconsistency of our empirical model for learning equivalence classes. For Bayesian methods,such high-dimensional consistency results have only been established lately for DAG selectionproblems with known ordering [Cao et al., 2019, Lee et al., 2019]. Second, one may convertRW-GES to a deterministic greedy search using the same neighborhood relation, which canbe seen as a variant of GES and is consistent on the restricted search space under our high-dimensional setting. Third, we show that for sparse DAG selection with both in-degree andout-degree constraints (which implies parent sets are not independent a posteriori), one canuse an add-delete-swap MH algorithm for sampling DAGs, which is rapidly mixing undervery mild high-dimensional assumptions.For comparison with RW-GES, we ﬁrst consider an idealized MH algorithm for samplingDAGs, which resembles the classical structure MCMC sampler [Giudici and Castelo, 2003]but aims to mimic the behavior of RW-GES. We are only able to prove a similar rapidmixing result by making a highly restrictive assumption on the true model. This is notsurprising, since structure MCMC methods have diﬃculty in traversing huge equivalenceclasses. When there is a sub-optimal equivalence class, it may create exponentially manylocal modes in the DAG space. Our theory conﬁrms this intuition. In contrast, RW-GES orother MCMC methods deﬁned on the space of equivalence classes can take advantage of theCPDAG representation of an equivalence class and avoid enumerating all member DAGs.An alternative MH algorithm for sampling equivalence classes is also examined [Castellettiet al., 2018]. The neighborhood set of this sampler is built using six types of simple CPDAGoperators constructed in He et al. [2013]. For mixing purposes, this proposal scheme fails toprovide enough connectivity in the space of equivalence classes, and we are able to explicitlyconstruct a slow mixing example with ﬁxed p .All of our main results are developed using a general canonical path method, which canbe applied to greedy search algorithms and other model selection problems as well. The keyidea is to identify a close-to-optimal search path from any model to the true one by a localanalysis of each state in the space. For the sparse structure learning problem we consider, amajor and unique challenge is to analyze those models on the “boundary” of the restrictedsearch space, which may easily be local modes if the in-degree and out-degree constraintsare not chosen properly. We expect that the search paths we build in this paper can havea very general use in the analysis of structure learning algorithms. From the path method(Theorem 1) we use, it can be seen that, compared with the consistency of a greedy searchor that of a Bayesian model selection procedure, the rapid mixing property of a local MHalgorithm is often stronger and more informative. For a greedy search to be consistent, oneonly needs to show that there is no sub-optimal local mode along the search path. But rapidmixing characterizes the overall complexity of the algorithm and requires that the chaincannot get trapped in any sub-optimal local mode in the whole space.One potential limitation of this work is the use of a strong beta-min condition in our4igh-dimensional analysis, which is commonly used in the literature and may be weakerthan strong faithfulness in some cases [Van de Geer and B¨uhlmann, 2013, Aragam et al.,2019] (though the two assumptions are not directly comparable.) An advantage of our pathmethod is that if the strong beta-min condition fails, we can still show that RW-GES ismixing quickly among DAGs with the same ordering. However, rapid mixing in the wholesearch space can easily fail. To overcome this problem, one needs to devise proposal schemesthat provide more connectivity in the search space than the operators used in GES or RW-GES. A careful investigation of this issue can shed new light on how to design more eﬃcientMCMC algorithms for high-dimensional structure learning. In Section 2, we develop some general results for using the canonical path method to studymodel selection consistency and mixing time of MH algorithms, which may be of independentinterest. In Section 3, we introduce the RW-GES sampler and develop related path results.The space of sparse equivalence classes is deﬁned in Section 3.2. In Sections 3.4 and 3.5, weconstruct canonical paths of RW-GES on the restricted search space using locally optimaladd-delete-swap moves. These canonical paths, together with Theorem 1 given in Section 2.2,will be the main tool we use for studying the complexity of Bayesian structure learning.We formally introduce our empirical Gaussian DAG model in Section 4.1 and the high-dimensional assumptions in Section 4.2. Consistency results for sparse DAG selection andsparse structure learning are proved in Section 4.3. Section 5 provides the mixing timeresults of structure-learning MCMC algorithms. We ﬁrst prove the rapid mixing of RW-GESin Section 5.1, and then use the same method to establish the rapid mixing of the add-delete-swap sampler for sparse DAG selection in Section 5.2. In Section 5.3, we prove a rapid mixingresult of a structure MCMC algorithm, which requires a much stronger assumption than thatof RW-GES. A slow mixing example for the equivalence class sampler of Castelletti et al.[2018] is provided in Section 5.4. In Section 6.1, we discuss the advantages of GES-basedsearch algorithms and explain how to eﬃciently implement RW-GES. Section 6.2 concludesthe paper with a discussion on how to conduct mixing time analysis without the strongbeta-min condition and devise MCMC algorithms with better mixing properties. Proofsfor the canonical paths of RW-GES are given in Section 7. All other technical proofs arerelegated to the supplementary material. For readers’ convenience, a notation table is givenin Supplement A.

In this section, we use Θ = Θ p to denote a ﬁnite model space for a general model selectionproblem with p variables; for example, each θ ∈ Θ is a unique equivalence class for thestructure learning problem. We need to borrow some terminology from the theory of localsearch [Michiels et al., 2007]. A neighborhood relation on Θ can be uniquely deﬁned by aneighborhood function N : Θ → Θ . We say θ (cid:48) is a neighbor of θ if and only if θ (cid:48) ∈ N ( θ ). Theneighborhood graph, denoted by (Θ , N ), is a directed graph with node set Θ and edge set5 ( θ, θ (cid:48) ) : θ ∈ Θ , θ (cid:48) ∈ N ( θ ) } . We say the neighborhood relation or N is symmetric if θ ∈ N ( θ (cid:48) )always implies θ (cid:48) ∈ N ( θ ). If N is symmetric, the neighborhood graph is undirected, andwe say the neighborhood graph is connected if there exists a path between any two distinctstates. The deﬁnition of “path” is given below. Deﬁnition 1.

We say a ﬁnite sequence ( θ , θ , . . . , θ k − , θ k ) is an N -path (or simply path)from θ to θ (cid:48) with length k if (i) θ = θ , θ k = θ (cid:48) , (ii) for each i = 1 , . . . , k , θ i ∈ N ( θ i − ) , and(iii) the sequence contains no duplicate states. Let π denote a posterior distribution on Θ for a Bayesian procedure. Suppose N issymmetric and the graph (Θ , N ) is connected. The triple (Θ , N , π ) can be used to deﬁnea greedy local search or a random walk MH algorithm. For the former, given current state θ , the algorithm always moves to ˜ θ = arg max θ (cid:48) ∈N ( θ ) ∪{ θ } π ( θ (cid:48) ). For the random walk MHsampler, we propose new states from N ( · ) uniformly at random and compute the acceptanceprobability by the Metropolis rule so that the chain is reversible w.r.t. π . In order thatthe search is eﬃcient, the neighborhood N ( · ) should be small and contain states whose (un-normalized) posterior probabilities are easy to evaluate. But at the same time N shouldprovide enough connectivity so that the search cannot get trapped in sub-optimal localmodes. For MCMC methods, the convergence rate can be measured by the mixing time,which we deﬁne below. Deﬁnition 2 (Mixing time) . Let P be an irreducible and aperiodic transition matrix deﬁnedon a ﬁnite state space Θ , with stationary distribution π . Deﬁne its mixing time by T mix = max θ ∈ Θ min { t ∈ N : (cid:107) P t ( θ, · ) − π ( · ) (cid:107) TV ≤ / } , where (cid:107)·(cid:107) TV denotes the total variation distance which takes value in [0 , .Remark . For an MCMC algorithm, if the mixing time of the sampling chain grows at mostpolynomially in the complexity parameters n (sample size) and p (number of variables), wesay the algorithm or the chain is rapidly mixing.Consider high-dimensional settings where p = p ( n ) may grow with n . Let θ ∗ ∈ Θ denotethe true model. Note that the size of Θ often grows very quickly, and θ ∗ is also allowed tovary with n . Let P ∗ denote the probability measure corresponding to the true model. Thefollowing mode of consistency is particularly useful for high-dimensional analysis of Bayesianmodel selection [Johnson and Rossell, 2012, Narisetty and He, 2014, Cao et al., 2019]. Deﬁnition 3 (Strong selection consistency) . A Bayesian model selection method is said tohave strong selection consistency if π n ( θ ∗ ) → in probability with respect to P ∗ .Remark . Observe that strong selection consistency is much stronger than pairwise con-sistency, which requires π n ( θ ) /π n ( θ ∗ ) → θ (cid:54) = θ ∗ [Moreno et al.,2010]. The path method to be introduced in Section 2.2 enables us to derive strong selectionconsistency using only polynomial (in p ) bounds for posterior probability ratios, when thesize of Θ may grow (super-)exponentially fast. In this section, we write π instead of π n , except when deﬁning strong selection consistency, since allother results are non-asymptotic. The constant 1 / , /

2) [Levin and Peres, 2017, Chapter 4.5].

6e will see shortly in Remark 4 that the mixing time analysis of an MH algorithm canbe simpliﬁed if the strong selection consistency holds.

We propose a general path method for bounding the mixing time of a local MH algorithmand proving the strong selection consistency of a Bayesian model selection procedure. It isbased on the canonical path method of Sinclair [1992] and is a generalization of the techniqueused by Yang et al. [2016] to prove the rapid mixing of a random walk MH algorithm forhigh-dimensional variable selection.

Deﬁnition 4.

A canonical path ensemble on (Θ , N ) is a set of N -paths, one (and only one)for each ordered pair of two distinct states in Θ . Recall θ ∗ ∈ Θ denotes the true model. If ( N , Θ) is a connected undirected graph, we canalways construct a canonical path ensemble by ﬁnding a “canonical transition function” g which generates paths that lead to θ ∗ . This idea plays a key role throughout the paper. Deﬁnition 5.

We say g : Θ → Θ is a canonical transition function on (Θ , N ) with ﬁxedpoint θ ∗ if (i) g ( θ ∗ ) = θ ∗ ; (ii) for any θ (cid:54) = θ ∗ , g ( θ ) ∈ N ( θ ) and there exists some ﬁnite k such that g k ( θ ) = θ ∗ . Lemma 1.

Suppose N is a symmetric neighborhood function on a ﬁnite space Θ and thegraph (Θ , N ) is connected. Fix some θ ∗ ∈ Θ . There exists a canonical transition function g on (Θ , N ) with ﬁxed point θ ∗ . Further, g induces a canonical path ensemble on (Θ , N ) suchthat each canonical path is an N g -path, where N g ( θ ) = { θ (cid:48) ∈ Θ : g ( θ (cid:48) ) = θ, or g ( θ ) = θ (cid:48) } .Proof. See Supplement B.2.The main result of this section is provided in Theorem 1, which shows that the canonicaltransition function g can be used to prove a few interesting results if we can bound the ratio π ( g ( θ )) /π ( θ ) for any θ (cid:54) = θ ∗ . Part (i) can be used to show the consistency of a greedy search;part (ii) implies the strong selection consistency of the Bayesian model; part (iii) yields abound on the mixing time for some MH algorithm. Theorem 1.

Let P be an ergodic transition matrix deﬁned on a ﬁnite space Θ such that P isreversible with respect to some distribution π and all eigenvalues of P are non-negative. Let N ( θ ) = { θ (cid:48) (cid:54) = θ : P ( θ, θ (cid:48) ) > } for each θ ∈ Θ . Let g be a canonical transition function on (Θ , N ) with ﬁxed point θ ∗ ∈ Θ as described in Deﬁnition 5. Deﬁne g − ( θ ) = { θ (cid:48) : g ( θ (cid:48) ) = θ } .Consider the following conditions where p ≥ and t , t , t ≥ are some constants.(1) For any θ ∈ Θ , | g − ( θ ) | ≤ p t , where | · | denotes the cardinality of a set.(2) π ( g ( θ )) /π ( θ ) ≥ p t for every θ (cid:54) = θ ∗ .(3) P ( θ, g ( θ )) ≥ p − t for every θ (cid:54) = θ ∗ .Deﬁne (cid:96) max = max θ (cid:54) = θ ∗ min { k ≥ g k ( θ ) = θ ∗ } . The following statements hold.(i) If condition (2) holds for some t > , the greedy local search deﬁned by (Θ , N , π ) always returns θ ∗ . ii) If conditions (1) and (2) hold for some t > t , then π ( θ ∗ ) ≥ − p − ( t − t ) .(iii) If all three conditions hold and t > t , the mixing time of P can be bounded by T mix ≤ (cid:96) max p t − p − ( t − t ) log (cid:26) θ ∈ Θ π ( θ ) (cid:27) . Proof.

See Supplement B.3.

Remark . The reversibility of P implies that the neighborhood relation deﬁned by N issymmetric and g − ( θ ) ⊆ N ( θ ). Thus, condition (1) can often by veriﬁed by proving thatmax θ |N ( θ ) | ≤ p t . The assumption that P has positive spectrum is unimportant for themixing time analysis of MCMC algorithms, since one can always consider the lazy version P lazy = ( P + I ) /

2, where I is the identity transition matrix. Remark . Loosely speaking, T mix gives the worst estimate for how many iterations it takesfor a Markov chain to “enter stationarity”. If π ( θ ∗ ) is close to 1 for some θ ∗ ∈ Θ, it turns outthat entering stationarity essentially means to hit θ ∗ . Formally, we can prove that T mix isequivalent to the expected hitting time of θ ∗ , up to constant factors, using the result of Peresand Sousi [2015]; see Theorem B2 in Supplement B.1. For an intuitive explanation, observethat if π ( θ ∗ ) ≈

1, then P t ( θ, θ ∗ ) needs to be suﬃciently large so that (cid:107) P t ( θ, · ) − π ( · ) (cid:107) TV issmall, which suggests that hitting θ ∗ is necessary for the chain to “enter stationarity.” Onthe other hand, the chain regenerates each time it hits θ ∗ , and thus between two successivevisits to θ ∗ , the chain has completed an independent cycle. So the length of each cycle givesan estimate for the mixing time.For high-dimensional problems, the search space is often restricted to some Θ ⊂ Θ,which satisﬁes certain sparsity constraints. For convenience, we may still use N to refer toa neighborhood relation on Θ , which means that the neighborhood of θ ∈ Θ is given by N ( θ ) ∩ Θ . Note that even if π is unimodal on (Θ , N ), we may have sub-optimal local modeson (Θ , N ). The identiﬁcation of an appropriate transition function g on the restricted searchspace is critical to the sparse structure learning problem to be considered. Let [ p ] = { , . . . , p } and |·| denote the cardinality of a set. A subset of [ p ] is typically denotedby S . The Hamming distance between two sets S, S (cid:48) is denoted by Hd(

S, S (cid:48) ) = | S \ S (cid:48) | + | S (cid:48) \ S | .A DAG G is a pair ( V, E ) where V is the vertex set and E ⊂ V × V is the set of directededges. Throughout the paper, we assume V = [ p ] for DAG models, representing randomvariables X , . . . , X p . Note that ( i, i ) / ∈ E for any i ∈ [ p ]. Let | G | denote the number of edgesin the DAG G ; thus, | G | = | E | . We use the notation i → j ∈ G to mean that ( i, j ) ∈ E and ( j, i ) / ∈ E . The notation i → j / ∈ G means that ( i, j ) / ∈ E . For two DAGs G = ( V, E )and G (cid:48) = ( V, E (cid:48) ), we write G (cid:48) = G ∪ { i → j } if E (cid:48) = E ∪ ( i, j ), and G (cid:48) = G \ { i → j } if E (cid:48) = E \ ( i, j ). We write G = G (cid:48) if and only if G and G (cid:48) have the same vertex set and edgeset. Given a DAG G , we say node i is a parent of node j (and node j is a child of node i ) if i → j ∈ G . Let Pa j ( G ) = { i ∈ [ p ] : i → j ∈ G } denote the set of parents of node8 ; the in-degree of node j is | Pa j ( G ) | . The maximum in-degree of G means max j | Pa j ( G ) | .Similarly, let Ch j ( G ) = { i ∈ [ p ] : j → i ∈ G } , and | Ch j ( G ) | is called the out-degree of node j . The degree of a node is the sum of its in-degree and out-degree, and the maximum degreeof G is max j | Pa j ( G ) ∪ Ch j ( G ) | . We may simply write Pa j if we are not referring to a speciﬁcDAG or the underlying DAG is clear from context. The Hamming distance between twoDAGs G, G (cid:48) is deﬁned by Hd(

G, G (cid:48) ) = (cid:80) j ∈ [ p ] Hd(Pa j ( G ) , Pa j ( G (cid:48) )) . An equivalence class of DAGs is typically denoted by E . We always interpret E as a setof DAGs and denote the CPDAG (essential graph) representing it by EG( E ) (the CPDAGnotation is only used occasionally.) The equivalence class of a DAG G is also denoted by[ G ]; thus, E = [ G ] if and only if G ∈ E . The set of CI relations encoded by a DAG G or anequivalence class E is denoted by CI ( G ) or CI ( E ), respectively. Note that we always have CI ( G ) = CI ([ G ]).We say a p -variate distribution µ is Markovian w.r.t. a p -vertex DAG G if all CI relationsencoded by G hold for µ . If the converse is also true, we say µ is perfectly Markovian w.r.t. G [Studen´y, 2006, Chapter 3], and G is a perfect map of µ . If there exists some DAGwhich is a perfect map of µ , we say µ is DAG-perfect. A DAG G is an independence map(I-map) of a DAG G (cid:48) or its equivalence class [ G (cid:48) ] if CI ( G ) ⊆ CI ( G (cid:48) ). An I-map G (ofsome G (cid:48) ) is minimal if any sub-DAG of G is not an I-map (of G (cid:48) ). Given the set CI ( G ), aminimal I-map of G with ordering σ , which we denote by G σ , can be uniquely deﬁned asfollows. For any i < j , σ ( i ) → σ ( j ) ∈ G σ if and only if nodes σ ( i ) , σ ( j ) are not conditionallyindependent given nodes { σ (1) , . . . , σ ( j − } \ { σ ( i ) } [Solus et al., 2017]. If µ is a p -variatepositive measure, a unique minimal I-map of CI ( µ ) with ordering σ can be constructed inan analogous manner [Koller and Friedman, 2009, Chapter 3.4]. Let G p denote the space of all p -vertex DAGs, which grows super-exponentially in p . Weconsider two sparsity constraints for DAG models, one for the maximum in-degree and theother for the maximum out-degree. For d in , d out ∈ [ p ], let G p ( d in , d out ) = { G ∈ G p : max j | Pa j ( G ) | ≤ d in , and max j | Ch j ( G ) | ≤ d out } The use of d in is expected as it controls the sparsity of each nodewise variable selectionproblem (i.e., the estimation of Pa j ). The out-degree constraint is introduced so that we areable to bound the maximum in-degree of any DAG G (cid:48) ∈ [ G ] for some G ∈ G p ( d in , d out ). Onemay also use a single constraint for the maximum degree, but for the theoretical analysis tobe carried out in this paper, it is more convenient to specify d in , d out separately. This setupis appealing to practitioners, since a DAG model with bounded degree is easier to visualizeand interpret. Let C p ( d in , d out ) denote the space of “sparse equivalence classes”, which isdeﬁned by C p ( d in , d out ) = { [ G ] : G ∈ G p ( d in , d out ) } . Hence, C p ( d in , d out ) is the set of all equivalence classes that contain at least one member in G p ( d in , d out ). Each E ∈ C p ( d in , d out ) can be uniquely represented by a “sparse” CPDAG. Theunrestricted space is denoted by C p = C p ( p, p ). Note that we will always deﬁne neighborhoodrelations on the unrestricted space G p or C p . 9ecall that S p is the space of all permutations of [ p ]. For each σ ∈ S p , let G σp = { G ∈ G p : for any i, j ∈ [ p ], if σ ( i ) → σ ( j ) ∈ G, then i < j } denote the space of all p -vertex DAGs that have topological ordering σ (a DAG may havemultiple orderings.) Let G σp ( d in , d out ) = G σp ∩ G p ( d in , d out ) denote the space of sparse DAGmodels with ordering σ , which is the space we consider for the sparse DAG selection problem. To deﬁne an eﬃcient random walk MH algorithm for sampling equivalence classes, we needto construct a proper neighborhood relation on C p . Instead of a direct construction of theneighborhood of E ∈ C p using CPDAG operators, we consider operations on all memberDAGs in E . Consider the following neighborhoods of a DAG G ∈ G p . N add ( G ) = (cid:8) G (cid:48) ∈ G p : G (cid:48) = G ∪ { i → j } for some i → j / ∈ G (cid:9) , N del ( G ) = (cid:8) G (cid:48) ∈ G p : G (cid:48) = G \ { i → j } for some i → j ∈ G (cid:9) , N swap ( G ) = (cid:8) G (cid:48) ∈ G p : G (cid:48) = ( G ∪ { k → j } ) \ { (cid:96) → j } for some k → j / ∈ G, (cid:96) → j ∈ G (cid:9) , N ads ( G ) = N add ( G ) ∪ N del ( G ) ∪ N swap ( G ) . We will refer to N ads ( G ) as the add-delete-swap neighborhood of G , which is just a straight-forward extension of the add-delete-swap neighborhood used in variable selection problems.For each E ∈ C p , deﬁne N ads ( E ) = (cid:8) [ G (cid:48) ] : G (cid:48) ∈ N ads ( G ) for some G ∈ E (cid:9) , (1)and deﬁne the sets N add ( E ) , N del ( E ) and N swap ( E ) analogously. For example, E (cid:48) ∈ N add ( E )if and only if there exist G ∈ E and G (cid:48) ∈ E (cid:48) such that G (cid:48) ∈ N add ( G ). Clearly, N ads ( E ) = N add ( E ) ∪ N del ( E ) ∪ N swap ( E ), and the neighborhood relation induced by N ads is symmetric.As explained in Section 2.1, we can deﬁne a random walk MH-algorithm using N ads .We call the algorithm random walk GES (RW-GES), since this neighborhood relation isemployed by the famous GES algorithm [Chickering, 2002b]. GES is a two-stage greedysearch using the neighborhood N add ( · ) in the ﬁrst stage and N del ( · ) in the second. Notethat swap moves are not used in GES, but since we will consider structure learning withsparsity constraints, the swap moves are needed to guarantee that “good” edges can alwaysbe added to large models that lie on the boundary of the restricted search space. We deﬁnethe proposal matrix K on the unrestricted space by K ( E , E (cid:48) ) = N ads ( E ) ( E (cid:48) ) |N ads ( E ) | , ∀ E , E (cid:48) ∈ C p . This deﬁnition of K is chosen for ease of presentation. In practice, there is no need to countthe size of N ads ( E ) or enumerate member DAGs in E . States in N ads ( E ) can be proposed veryeﬃciently using the operators of the GES algorithm [Chickering, 2002b]. The implementationof RW-GES will be explained in detail in Section 6.1.In the next section, we consider how to construct a canonical path ensemble for RW-GESon the space C p ( d in , d out ). This set of canonical paths (or rather the corresponding canonicaltransition function) will be the key ingredient for the results to be developed in later sections,including the high-dimensional strong selection consistency of Bayesian structure learning.10 .4 Outline of the construction of canonical paths Let G ∗ ∈ G p ( d in , d out ) denote a true DAG model and E ∗ = [ G ∗ ]. We assume a decomposablescoring criterion ψ : G p → R is given such that ψ ( G ) = (cid:88) j ∈ [ p ] ψ j (Pa j ( G )) , where for each j , ψ j : 2 [ p ] → R gives the local score at node j .The main problem we consider in this section is how to ﬁnd an N ads -path from any E ∈ C p ( d in , d out ) to E ∗ . For the paths to be useful, we will construct them using locallyoptimal add-delete-swap moves which aim to maximize the gain in the node score of somemember DAG. Our strategy consists of two main steps. For every σ ∈ S p , let G ∗ σ ∈ G σp bethe unique minimal I-map of G ∗ (for the paths we will construct, whether G ∗ σ is minimaldoes not matter.) Assuming G ∗ σ ∈ G σp ( d in , d out ), for any G ∈ G σp ( d in , d out ), we considerhow to ﬁrst construct an N ads -path from G to G ∗ σ on the space G σp ( d in , d out ). This is thediﬃcult step of our construction due to the sparsity constraints. For the path from G ∗ σ to G ∗ , we can apply the famous Chickering algorithm; that is, Chickering’s constructive prooffor Meek’s conjecture [Chickering, 2002b]. This two-step procedure suggests that we maymeasure how “close” an equivalence class is to E ∗ using the function h ∗ deﬁned below. Foreach E ∈ C p ( d in , d out ), let h ∗ ( E ) = min (cid:8) Hd(

G, G ∗ σ ) + | G ∗ σ | − | G ∗ | : G ∈ E ∩ G σp ( d in , d out ) , σ ∈ S p (cid:9) , (2)where we recall Hd denotes the Hamming distance. Note that for any σ , | G ∗ σ | ≥ | G ∗ | since G ∗ σ is an I-map of G ∗ . Thus, h ∗ ( E ) = 0 if and only if E = E ∗ . We will construct a canonicaltransition function g on ( C p ( d in , d out ) , N ads ) such that h ∗ ( g ( E )) < h ∗ ( E ) for any E (cid:54) = E ∗ .To explain the motivation for the add-delete-swap moves to be used for deﬁning g , con-sider a DAG selection problem on the space G σp ( d in , p ) for some σ ∈ S p (note there is noout-degree constraint.) This is essentially equivalent to p variable selection problems: foreach j , we need to estimate the set Pa j which takes value in the space M σp ( j, d in ) deﬁned by M σp ( j, d in ) = (cid:8) S ⊆ A σp ( j ) : | S | ≤ d in (cid:9) , A σp ( j ) = (cid:8) k ∈ [ p ] : σ − ( k ) < σ − ( j ) (cid:9) , (3)where A σp ( j ) represents the set of variables that precede X j in the ordering σ . The mainbuilding block of our canonical paths for structure learning is the following deﬁnition of atransition function on the space M σp ( j, d in ) which deﬁnes an optimal add-delete-swap movefor each S (cid:54) = Pa j ( G ∗ σ ). Deﬁnition 6.

For each j , g σj : M σp ( j, d in ) → M σp ( j, d in ) denotes a transition function con-structed as follows. Let G ∗ σ ∈ G σp ( d in , d out ) be a minimal I-map of G ∗ and S ∗ σ,j = Pa j ( G ∗ σ ) .Fix an arbitrary S ∈ M σp ( j, d in ) , and let T = S ∗ σ,j \ S and R = S \ S ∗ σ,j .(i) If S = S ∗ σ,j , let g σj ( S ) = S ∗ σ,j .(ii) If S ∗ σ,j ⊂ S , let g σj ( S ) = S \ { ˜ (cid:96) } where ˜ (cid:96) = arg max (cid:96) ∈ R ψ j ( S \ { (cid:96) } ) .(iii) If S ∗ σ,j (cid:54)⊆ S and | S | < d in , let g σj ( S ) = S ∪ { ˜ k } where ˜ k = arg max k ∈ T ψ j ( S ∪ { k } ) . igure 1: An example for the operator g σj . We consider four nodes with ordering σ = (1 , , , d in = 3. The DAG G ∗ σ has three edges, 1 →

2, 2 → →

4. Consider another DAG G which hasedges 1 →

3, 1 → →

4. The DAGs g σ ( G ) , g σ ( G ) , g σ ( G ) , g σ ( G ) are shown above. Observe that themaximum out-degree of G is 2, while the maximum out-degree of g σ ( G ) is 3. (iv) If S ∗ σ,j (cid:54)⊆ S and | S | = d in , let g σj ( S ) = ( S ∪ { ˜ k } ) \ { ˜ (cid:96) } where (˜ k, ˜ (cid:96) ) = arg max ( k,(cid:96) ) ∈ T × R ψ j (( S ∪ { k } ) \ { (cid:96) } ) .In case (ii), we say S is (strictly) overﬁtted; in cases (iii) and (iv), we say S is underﬁt-ted. We use g σj ( G ) to denote the DAG obtained by replacing the parent set of j in G with g σj (Pa j ( G )) ; that is, Pa j ( g σj ( G )) = g σj (Pa j ( G )) and for any i (cid:54) = j , Pa i ( g σj ( G )) = Pa i ( G ) .Remark . It is clear from deﬁnition that Hd( g σj ( S ) , S ∗ σ,j ) < Hd(

S, S ∗ σ,j ) if S (cid:54) = S ∗ σ,j . Hence, g σj induces a unique path from S to S ∗ σ,j for each S ∈ M σp ( j, d in ). Further, g σj ( G ) ∈ N ads ( G )and Hd( g σj ( G ) , G ∗ σ ) < Hd(

G, G ∗ σ ) if Pa j ( G ) (cid:54) = Pa j ( G ∗ σ ). In words, if node j is overﬁtted in G , g σj ( G ) is obtained by removing an incoming edge of node j from G . If node j is underﬁtted, g σj ( G ) is obtained by adding an incoming edge of node j to G (if the in-degree constraint isviolated, remove another incoming edge of node j .) An example is provided in Figure 1. Remark . Consider the variable selection problem with model space M σp ( j, d in ) and truemodel S ∗ σ,j . The path from S to S ∗ σ,j induced by g σj can be seen as a search path of the add-delete-swap sampler such that, with high probability, the posterior probability is increasingalong the path [Yang et al., 2016]. The underlying rationale also resembles the well-knownforward-backward stepwise regression [An et al., 2008]; that is, we always ﬁrst transform anunderﬁtted model to overﬁtted and then remove redundant variables. As explained previously, for some G ∈ G σp ( d in , d out ), in order to move to G ∗ we may ﬁrstmove to the minimal I-map G ∗ σ . We want to construct such a path using the operators { g σj : j ∈ [ p ] } on the space G σp ( d in , d out ). At ﬁrst glance, this seems trivial since we can use g σj repeatedly to convert any Pa j ( G ) to Pa j ( G ∗ σ ). However, when d out < p , it is entirely unclearwhether such a path always stays within the space G σp ( d in , d out ). In extreme cases, none ofthe DAGs g σ ( G ) , . . . , g σp ( G ) belongs to G σp ( d in , d out ). Below is a simple example. Example 1.

Consider p = 4, σ = (1 , , , d in = 2 and d out = 1. Let G ∗ σ be the DAGwith two edges 1 → →

4. Let G be another DAG with ordering σ , which contains12wo edges 1 → →

3. Both nodes 3 and 4 are underﬁtted, and thus we want to add1 → →

4. But either operation violates the out-degree constraint.Fortunately, we are able to prove that, as long as d out is chosen suﬃciently large, therealways exists some j such that g σj yields a diﬀerent DAG in G σp ( d in , d out ). This result is statedin the following lemma. The key idea of the proof is to use the pigeonhole principle multipletimes to derive the contradiction. We deﬁne d ∗ σ = max j ∈ [ p ] | Pa j ( G ∗ σ ) ∪ Ch j ( G ∗ σ ) | , d ∗ = max σ ∈ S p d ∗ σ . (4) Lemma 2.

Assume that d ∗ σ ≤ d in and min { d ∗ σ d in + 1 , p } ≤ d out . For any G ∈ G σp ( d in , d out ) such that G (cid:54) = G ∗ σ , there exists some j ∈ [ p ] such that g σj ( G ) ∈ G σp ( d in , d out ) and g σj ( G ) (cid:54) = G .Proof. See Section 7.1.We can now construct the canonical transition function g using the locally optimal add-delete-swap operators { g σj : j ∈ [ p ] , σ ∈ S p } . Consider an arbitrary E ∈ C p ( d in , d out ). If E contains a minimal I-map of G ∗ , we apply the Chickering algorithm to move to E ∗ (seeLemma 5). If not, we can pick some G ∈ E ∩ G σp ( d in , d out ) for some σ ∈ S p and use Lemma 2to move towards the equivalence class of G ∗ σ . There is a caveat, though. For any E ∈C p ( d in , d out ), we need to deﬁne g ( E ) uniquely. Suppose that for E , we deﬁne g ( E ) = E =[ g σj ( G )] using some G ∈ E ∩ G σp ( d in , d out ). But g ( E ) may be deﬁned using some G ∈E ∩ G τp ( d in , d out ) for some τ (cid:54) = σ . Since it is likely that Hd( G , G ∗ σ ) ≤ Hd( G , G ∗ τ ), it isunclear how to bound the length of such a canonical path to E ∗ . It turns out that we justneed to pick the DAG representation of an equivalence class in an optimal way using thefunction h ∗ deﬁned in (2). An explicit construction of our canonical transition function g isprovided in the proof of Theorem 2, the main result for this section. Theorem 2.

Assume that d ∗ ≤ d in and min { d ∗ d in + 1 , p } ≤ d out . Let { g σj : j ∈ [ p ] , σ ∈ S p } be as given in Deﬁnition 6. There exists a function g : C p ( d in , d out ) → C p ( d in , d out ) with g ( E ∗ ) = E ∗ such that the following statements hold for any E ∈ C p ( d in , d out ) \ {E ∗ } .(i) g ( E ) = [ g σj ( G )] for some j ∈ [ p ] , σ ∈ S p and G ∈ E ∩ G σp ( d in , d out ) such that g σj ( G ) (cid:54) = G .(ii) There exist k ≤ ( d ∗ + d in ) p and k ≤ (cid:96) ≤ (2 d ∗ + d in ) p such that ( E , g ( E ) , . . . , g (cid:96) ( E )) isan N ads -path from E to E ∗ and G ∗ σ ∈ g k ( E ) for some σ ∈ S p .Proof. See Section 7.2. 13

High-dimensional consistency of an empirical Bayes modelfor structure learning

Let X be an n × p data matrix where each row is an i.i.d. copy of a random vector X =( X , . . . , X p ). Assume that, given a DAG G , the distribution of X can be described by X | B, Ω ∼ N p (0 , Σ( B, Ω)) , ( B, Ω) | G ∼ π ( B, Ω | G ) , (5)where we use π to denote the prior and, letting I be the identity matrix, deﬁneΣ = Σ( B, Ω) = ( I − B (cid:62) ) − Ω( I − B ) − . (6)The support of the conditional prior distribution π ( B, Ω | G ) is given by D p ( G ) = { ( B, Ω) : B ∈ R p × p , B ij = 0 if i → j / ∈ G, for any i, j ∈ [ p ];Ω = diag( ω , . . . , ω p ) , ω i > i ∈ [ p ] } . (7)The matrix B is often called the weighted adjacency matrix. This is a standard setup andwe refer readers to Supplement C.3 for more details. Let X j be the j -th column of our datamatrix. We can equivalently express the decomposition given in (6) as the following linearstructural equation model (SEM), X j = (cid:88) i (cid:54) = j B ij X i + ε j , ε j ∼ N n (0 , ω j I ) , (8)for j = 1 , . . . , p , where ε , . . . , ε p are independent error vectors. The SEM representation ofthe Gaussian DAG model is used frequently in the literature; see, for example, Drton et al.[2011], Van de Geer and B¨uhlmann [2013], Aragam et al. [2019].We consider the empirical prior for ( B, Ω) | G used by Lee et al. [2019], which is anextension of the empirical model for variable selection proposed by Martin et al. [2017]. Foreach regression model given in (8), we use an empirical normal-inverse-gamma prior, andthen compute the marginal likelihood of G by integrating out ( B, Ω) and using a fractionalexponent α ∈ (0 ,

1) to oﬀset the overuse of data caused by the empirical prior. More detailsare given in Supplement C.3. A highly desirable property of this model is that the marginalfractional likelihoods of Markov equivalent DAGs are always the same, which will be shownin Lemma 3. In general, we do not expect that a nodewise normal-inverse-gamma prior for( B, Ω) | G has this property. For non-empirical prior distributions, see Geiger and Heckerman[2002] and Peluso and Consonni [2020] for related results.We use the following standard prior for a DAG model G , π ( G ) ∝ G p ( d in ,d out ) ( G ) ( c p c ) −| G | , (9) The font for the random vector X and that for the data matrix X are diﬀerent. c > , c ≥ G p ( d in , d out ). The posterior probability ofany DAG G ∈ G p can be computed by π n ( G ) ∝ G p ( d in ,d out ) ( G ) exp( ψ ( G )) , ψ ( G ) = p (cid:88) j =1 ψ j (Pa j ( G )) , (10)where we refer to ψ ( G ) as the posterior score of G and ψ j (Pa j ) as the posterior score of Pa j .The function ψ j ( S ) has a closed-form expression given byexp( ψ j ( S )) = (cid:26) c p c (cid:114) αγ (cid:27) −| S | (cid:110) X j ( I − X S ( X (cid:62) S X S ) − X (cid:62) S ) X j (cid:111) − ( αn + κ ) / , (11)where γ, κ are hyperparameters of the nodewise normal-inverse-gamma prior and X S denotesthe submatrix of X containing columns indexed by S . Lemma 3.

The posterior score given in (10) is the same for Markov equivalent DAGs.Proof.

See Supplement E.1.Since ψ is the same for Markov equivalent DAGs, we can deﬁne the posterior probabilityand score of an equivalence class E by π n ( E ) ∝ C p ( d in ,d out ) ( E ) exp( ψ ( E )) , ψ ( E ) = ψ ( G ) for any G ∈ E . (12)Formally, π n ( E ) can be obtained by assigning a joint prior distribution π ( E , G ) such that (cid:80) G ∈E π ( G | E ) = 1 and π ( E ) is analogous to (9) (penalizing the number of edges inthe CPDAG representing E ). We choose to use this empirical Bayes model mainly for theconvenience of calculation. Other Bayesian structure learning models may be used as wellfor the mixing time analysis of MCMC methods. Let G ∗ denote the true DAG model as in Section 3 and E ∗ = [ G ∗ ] be the true equivalence classthat we want to recover from the data. We assume that the data is generated from N p (0 , Σ ∗ ),a distribution perfectly Markovian w.r.t. G ∗ . For the sparse DAG selection problem withsearch space G σp ( d in , d out ), we treat the minimal I-map G ∗ σ as the “true model”, which forGaussian DAG models can be equivalently deﬁned as follows. Deﬁnition 7.

Let N p (0 , Σ ∗ ) be perfectly Markovian w.r.t. some DAG G ∗ . For any σ ∈ S p ,deﬁne ( B ∗ σ , Ω ∗ σ ) to be the unique pair in D p ( σ ) = (cid:83) G ∈G σp D p ( G ) such that ( I − ( B ∗ σ ) (cid:62) ) − Ω ∗ σ ( I − B ∗ σ ) − = Σ ∗ . Then, the minimal I-map G ∗ σ is a DAG with weighted adjacency matrix B ∗ σ ; that is, i → j ∈ G ∗ σ if and only if ( B ∗ σ ) ij (cid:54) = 0 . See Supplement C.2 for the uniqueness and minimality of ( B ∗ σ , Ω ∗ σ ). The pair ( B ∗ σ , Ω ∗ σ )can be used to construct an SEM model analogous to (8), which justiﬁes the use of G ∗ σ as15he “true model” for DAG selection. Let d ∗ σ and d ∗ be as deﬁned in (4). Note that d ∗ is themaximum degree of all minimal I-maps of G ∗ Consider a high-dimensional setting with p = p ( n ) tending to inﬁnity. The true DAGmodel G ∗ , true covariance matrix Σ ∗ and prior parameters c , c , α, γ, d in , d out are all implic-itly indexed by n . In order to derive high-dimensional consistency results, we need to makea few assumptions on the true model and prior parameters. The ﬁrst three assumptionsare standard and commonly used in high-dimensional statistical theory. The ﬁrst one isthe standard restricted eigenvalue condition [Cao et al., 2019, Lee et al., 2019]. The secondassumption ensures that p does not grow exponentially in n , and the in-degree bound d in (which determines the maximum model size for nodewise variable selection) is suﬃcientlysmall. Assumptions like d in log p ≤ cn for some small constant c > α . Assumption A (Restricted eigenvalues) . There exist constants ν = ν ( n ) , ν = ν ( n ) > δ > ν (1 − δ ) ≤ λ min (Σ ∗ ) ≤ λ max (Σ ∗ ) ≤ ν (1 + δ ) , where λ min , λ max denote the smallest and largest eigenvalues respectively. Assumption B.

The sparsity parameter d in and n, p satisfy that d in log p = o ( n ). Assumption C.

Prior parameters satisfy that κ ≤ n , 1 ≤ c (cid:112) α/γ ≤ p , and c ≥ ( α + 1)(4 d in + 6) + t for some universal constant t > σ is suﬃciently sparse. It is similarto Assumption D of Yang et al. [2016] and is technically needed to show that the add-delete-swap MH sampler cannot get stuck at DAG models with in-degree of each node equal to d in . But unlike the setup considered in Yang et al. [2016], we assume both lower and upperrestricted eigenvalues are available, which enables us to derive an irrepresentability resultin Lemma D7 (in the supplement) and avoid imposing an irrepresentability assumptionas in Yang et al. [2016, Assumption D]. Assumption DP (“P” stands for “permutation”)restricts the maximum degree of all minimal I-maps of G ∗ . If ν, ν deﬁned in Assumption Aare universal constants, this assumption allows d ∗ to have the same order as d in . Assumption D.

Deﬁne ν = 4 ν ν − ( ν − ν ) . For some σ ∈ S p , ( ν + 1) d ∗ σ ≤ d in . Assumption DP.

Assumption DP holds for every σ ∈ S p ; that is, ( ν + 1) d ∗ ≤ d in .The last assumption is the well-known beta-min condition. For DAG selection withordering σ , we only need to require the nonzero entries of B ∗ σ are suﬃciently large. Forstructure learning, we assume the same bound holds uniformly over all σ ∈ S p ; this is oftenknown as the strong beta-min or permutation beta-min condition [Uhler et al., 2013] andwas used by Van de Geer and B¨uhlmann [2013] and Aragam et al. [2015, 2019].16 ssumption E. There exists a universal constant C β > σ ∈ S p ,min (cid:8) | ( B ∗ σ ) ij | : ( B ∗ σ ) ij (cid:54) = 0 , (cid:9) ≥ C β + 4 c ) ν log pαν n , (13)where B ∗ σ is as deﬁned in Deﬁnition 7. Assumption EP.

There exists a universal constant C β > σ ∈ S p . Remark . The restricted eigenvalue condition can be used to obtain some useful boundsrelated to B ∗ σ and Ω ∗ σ . Write Ω ∗ σ = diag( ω ∗ σ, , . . . , ω ∗ σ,p ). The decomposition (6) implies that ω ∗ σ,k ∈ ( ν, ν ) for any σ ∈ S p and k ∈ [ p ] (since the diagonal elements of Σ ∗ and (Σ ∗ ) − canbe bounded by the extreme eigenvalues of Σ ∗ .) Further, we can bound the (cid:96) -norm of thetrue regression coeﬃcients for node j by (cid:80) i ∈ [ p ] ( B ∗ σ ) ij ≤ ω ∗ σ,j /ν −

1, using the fact that theoperator norm is no less than the (cid:96) -norm of any column.To prove high-dimensional consistency for structure learning, we may ﬁrst establish thecorresponding result that applies uniformly to all p ! DAG selection problems. For thispurpose, the strong beta-min condition (or some similar assumption) is necessary, and wewill show that, with high probability, for every σ ∈ S p , the minimal I-map G ∗ σ has the highestposterior score among all DAGs in G σp ( d in , d out ). For frequentists’ approaches based on CItests, a similar assumption, known as “strong faithfulness”, is typically used for provingconsistency results [Nandy et al., 2018]. Uhler et al. [2013] showed that the volume ofnormal distributions that are strongly faithful is very small. The two assumptions are notdirectly comparable, but both seem to be fairly restrictive [Van de Geer and B¨uhlmann,2013]. Unfortunately, without them, search methods like GES can get trapped in localmodes, and an example is provided in Section 5.4. We will discuss in Section 6.2 how toovercome such limitations and design more eﬃcient and practical algorithms. Given that we have already constructed a canonical path ensemble on C p ( d in , d out ) in Theo-rem 2, to use Theorem 1, it remains to bound the corresponding posterior probability ratios.Theorem 3 below provides a series of high-dimensional consistency results under AssumptionsA–E. Part (i) is the most important, which can be seen as a uniform consistency result for all p ! p nodewise variable selection problems (there are p ! orderings and each corresponds to p variable selection problems.) It shows that for any j ∈ [ p ], g σj ( S ) always maps S (cid:54) = Pa j ( G ∗ σ )to some model with much larger posterior score, and this consistency result holds uniformlyover all σ ∈ S p . Once we prove part (i), the strong selection consistency for DAG selectionand structure learning can be obtained using Theorem 1. The complete proof for Theorem 3is highly technical and thus is deferred to the supplement. The most involved step of theproof is to establish an analogous consistency result for variable selection using our empiricalprior, which is treated in detail in Supplement D and may be of independent interest. Theorem 3.

Suppose Assumptions A, B, C, DP and EP hold. Let t > be the universalconstant given in Assumption C and assume C β ≥ t/ . For suﬃciently large n , withprobability at least − p − , the following statements hold. i) Consistency of the operators { g σj : j ∈ [ p ] , σ ∈ S p } given in Deﬁnition 6: min (cid:8) ψ j ( g σj ( S )) − ψ j ( S ) : σ ∈ S p , j ∈ [ p ] , S ∈ M σp ( j, d in ) \ { S ∗ σ,j } (cid:9) ≥ t log p, where ψ j is given in (11) and S ∗ σ,j = Pa j ( G ∗ σ ) .(ii) If t > , we have the strong selection consistency of nodewise variable selection, min σ ∈ S p min j ∈ [ p ] exp( ψ j ( S ∗ σ,j )) (cid:80) S ∈M σp ( j, d in ) exp( ψ j ( S )) ≥ − p − ( t − , where M σp ( j, d in ) is deﬁned in (3) .(iii) If t > , we have the strong selection consistency of sparse DAG selection, min σ ∈ S p exp( ψ ( G ∗ σ )) (cid:80) G ∈G σp ( d in ,d out ) exp( ψ ( G )) ≥ − p − ( t − , where ψ ( G ) is deﬁned in (10) .Proof. See Supplement E.3 and E.4.

Remark . The universal constant t can be chosen arbitrarily large. Given any t >

0, inorder that Theorem 3 holds, we just need to choose c = O ( d in ) accordingly and assume thatthe universal constant C β in Assumption EP is suﬃciently large.As a corollary, the strong selection consistency for a single DAG selection problem withordering σ can be obtained using Assumptions D and E. This result was also proved in Leeet al. [2019] under similar assumptions, but the method we use is diﬀerent (the primarygoal of Lee et al. [2019] was to derive minimax posterior convergence rates for the weightedadjacency matrix.) Corollary 1.

Suppose Assumptions A, B, C, D and E hold for some t > and C β ≥ t/ .Let σ be as given in Assumptions D and E. For suﬃciently large n , with probability at least − p − , exp( ψ ( G ∗ σ )) (cid:80) G ∈G σp ( d in ,d out ) exp( ψ ( G )) ≥ − p − ( t − Proof.

The proof is wholly analogous to that for Theorem 3.In order to use Theorem 1 to prove the strong selection consistency of sparse structurelearning, we need to show that the neighborhood N ads ( · ) deﬁned in (1) only grows polynomi-ally in p . Note that this is also needed if one wants to prove GES has polynomial complexity.In the following lemma, we show that it suﬃces to assume d in + d out = O (log p ). This is amild assumption since, even if d in + d out = O (1), the total number of edges in the DAG mayhave order p . Lemma 4.

Suppose d in + d out ≤ t log p for some t > . Then, for any E ∈ C p ( d in , d out ) , p ( p − / ≤ |N ads ( E ) | ≤ p ( p − d in + d out ) p t .Proof. See Section E.2. 18 heorem 4.

Assume d ∗ d in + 1 ≤ d out and d in + d out ≤ t log p for some universal constant t > . Suppose Assumptions A, B, C, DP and EP hold with C β ≥ t/ and t > t + 3 . Forsuﬃciently large n , with probability at least − p − , we have exp( ψ ( E ∗ )) (cid:80) E∈C p ( d in ,d out ) exp( ψ ( E )) ≥ − p − ( t − t − , where ψ ( E ) is given in (12) . Further, the greedy search on C p ( d in , d out ) using neighborhoodfunction N ads and score ψ returns E ∗ regardless of the initial state.Proof. We will show in Theorem 5 that the transition function g deﬁned in Theorem 2 satisﬁesthe conditions in Theorem 1. The results then follow from Theorem 1(i) and (ii). For the structure learning problem on the space C p ( d in , d out ), deﬁne the transition matrix ofRW-GES by P ( E , E (cid:48) ) = (cid:40) K ( E , E (cid:48) ) min (cid:110) , π n ( E (cid:48) ) K ( E (cid:48) , E ) π n ( E ) K ( E , E (cid:48) ) (cid:111) , if E (cid:54) = E (cid:48) , − (cid:80) ˜ E(cid:54) = E P ( E , ˜ E ) , if E = E (cid:48) , (14)where π n ( E ) ∝ C p ( d in ,d out ) ( E ) exp( ψ ( E )). Note that in practice it can be diﬃcult to checkwhether E (cid:48) is in C p ( d in , d out ). We suggest one can simply replace ( d in , d out ) with a maximumdegree constraint. Assuming the true model is suﬃciently sparse, the two approaches shouldyield similar results. Using the transition function constructed in Theorem 2 and the high-dimensional consistency results obtained in Theorem 3, we can prove the following mainresult of the paper, rapid mixing of RW-GES on the space C p ( d in , d out ). Theorem 5.

Suppose all assumptions of Theorem 4 hold. For suﬃciently large n , withprobability at least − p − , the mixing time of the RW-GES sampler with transition matrixdeﬁned in (14) can be bounded by T mix ( P ) ≤ Cp t +3 ( t log p ) log (cid:18) π min (cid:19) , for some universal constant C , where π min = min E∈C p ( d in ,d out ) π n ( E ) .Proof. See Supplement E.5.

Corollary 2.

Suppose Assumptions A and B hold. We have min

E∈C p ( d in ,d out ) π n ( E ) π n ( E ∗ ) ≥ (cid:16) c p c (cid:112) α/γ (cid:17) − p ( d in + d ∗ ) (cid:18) νν (cid:19) − p ( αn + κ ) / . Hence, under the setting of Theorem 5, the mixing time of the RW-GES sampler can bebounded by a polynomial of n and p .Proof. See Supplement E.6. 19 emark . Corollary 2 implies that RW-GES is rapidly mixing with high probability. How-ever, the term log π min in the mixing time bound is only used to handle the worst scenariowhere the chain starts from the state with minimum posterior probability. If the chainstarts from some “good” estimate, the actual mixing rate of the chain can be much faster;see [Sinclair, 1992, Proposition 1].If in the beta-min condition, we only assume that the minimum edge weight of B ∗ (theweighted adjacency matrix of G ∗ ) is suﬃciently large, the rapid mixing of RW-GES doesnot hold. Actually, it is not diﬃcult to construct an explicit example which shows thatRW-GES is slowly mixing. In the following example, we let p = 3 be ﬁxed and show thatthe mixing time grows exponentially in n . One can extend our example to the case p = n byadding variables X , . . . , X n such that, for any j = 4 , . . . , n , the observed vector X j is exactlyorthogonal to all the other column vectors of the data matrix. Example 2.

Assume p = 3 and the true SEM is given by X = z , X = b X + z , X = b X + z , where z , z , z are orthogonal to each other and (cid:107) z j (cid:107) = n for each j . Thus, we can letthe true DAG G ∗ be 1 → →

3. Suppose the prior parameters satisfy that d in = d out = 2, c = √ n , κ = 0, and c , α, γ are ﬁxed constants such that c (cid:112) α/γ = 1. The choice c = √ n is reasonable since the penalization on the model size should increase with n .Assume the true regression coeﬃcients b , b > b = b = 4 c log pαn = o (1) . Consider the DAG ˜ G given by 1 → ←

3. Observe that [ ˜ G ] = { ˜ G } . The topological orderingof ˜ G can be chosen to be σ = (1 , , G ∗ σ is a complete DAG. Onecan show that the edge weight of 1 → G ∗ σ is b b . Since b b = 16 c (log p ) α n = 16(log p ) α n = o (cid:18) c log pn (cid:19) , the true model fails to satisfy the strong beta-min condition. Indeed, we can prove thatRW-GES is slowly mixing. See Supplement E.9. When the ordering is given and the search is restricted to G σp ( d in , d out ), RW-GES becomesthe standard add-delete-swap MH sampler. Deﬁne a proposal matrix by K σ ( G, G (cid:48) ) = N ads ( G ) ( G (cid:48) ) |N ads ( G ) | , ∀ G, G (cid:48) ∈ G σp . By the Metropolis rule, the corresponding transition matrix is given by P σ ( G, G (cid:48) ) = (cid:40) K σ ( G, G (cid:48) ) min (cid:110) , π σn ( G (cid:48) ) K σ ( G (cid:48) ,G ) π σn ( G ) K σ ( G,G (cid:48) ) (cid:111) , if G (cid:54) = G (cid:48) , − (cid:80) ˜ G (cid:54) = G P σ ( G, ˜ G ) , if G = G (cid:48) , (15)where π σn ( G ) ∝ G σp ( d in ,d out ) ( G ) exp( ψ ( G )) denotes the posterior distribution on the restrictedspace G σp ( d in , d out ). If there is no out-degree constraint, by posterior modularity, one can20erform sampling for the parent set of each node separately; thus, there is no need to directlydraw DAG samples. However, if d out < p , which implies that the posterior distributionsof Pa , . . . , Pa p are not independent, this add-delete-swap sampler provides a convenientsolution. The rapid mixing result for this sampler is provided below. Theorem 6.

Suppose Assumptions A, B, C, D and E hold for some t > and C β ≥ t/ .Further, assume that min { d ∗ d in + 1 , p } ≤ d out . For suﬃciently large n , with probability atleast − p − , the mixing time of the transition matrix deﬁned in (15) can be bounded by T mix ( P σ ) ≤ Cd p log (cid:18) π σ min (cid:19) , for some universal constant C , where π σ min = min G ∈G σp ( d in ,d out ) π σn ( G ) .Proof. See Supplement E.7.

Remark . The assumptions are much weaker than those used in Theorem 5. In particular,we can allow a much larger model size for each nodewise variable selection problem. Thisis mainly because for any G ∈ G σp ( d in , d out ), we have |N ads ( G ) | = O ( d in p ). But for anequivalence class E ∈ C p ( d in , d out ), the size of N ads ( E ) may grow exponentially in d in + d out . Structure MCMC methods are local MH algorithms deﬁned on the DAG space. Accordingto the canonical paths constructed in Section 3, we can devise a DAG sampler that mimicsthe behavior of RW-GES. This algorithm is usually not practical, but it provides the insightswe need to understand the complexity of other structure MCMC methods.Let N E ( G ) = [ G ] \ { G } . Deﬁne a proposal matrix by K s ( G, G (cid:48) ) = (1 − q ) N ads ( G ) ( G (cid:48) ) |N ads ( G ) | + q N E ( G ) ( G (cid:48) ) |N E ( G ) | , ∀ G, G (cid:48) ∈ G p , (16)where q ∈ (0 ,

1) is a ﬁxed constant. That is, with probability 1 − q , we propose a usualadd-delete-swap move; with probability q , we sample a DAG from N E ( G ). The problemof this proposal scheme is that sampling DAGs from an equivalence class can be very time-consuming. A state-of-the-art algorithm of Ghassami et al. [2019] has polynomial complexityonly when the maximum degree is bounded. The rationale behind the deﬁnition of K s isclear. We use the add-delete-swap neighborhood to explore DAGs with the same topologicalordering and ﬁnd a minimal I-map of G ∗ . But, in order to arrive at some DAG in [ G ∗ ], wehave to traverse the equivalence classes of minimal I-maps, and thus the random samplingfrom N E ( · ) is introduced.Consider the MH algorithm for sampling from the posterior distribution on G p ( d in , d out )using proposal matrix K s . The transition matrix is given by P s ( G, G (cid:48) ) = (cid:40) K s ( G, G (cid:48) ) min (cid:110) , π n ( G (cid:48) ) K s ( G (cid:48) ,G ) π n ( G ) K s ( G,G (cid:48) ) (cid:111) , if G (cid:54) = G (cid:48) , − (cid:80) ˜ G (cid:54) = G P s ( G, ˜ G ) , if G = G (cid:48) , (17)21here π n ( G ) ∝ G p ( d in ,d out ) ( G ) exp( ψ ( G )). To show rapid mixing of this sampler, we need toimpose a very restrictive assumption on the true model. Deﬁne r ∗ = max σ ∈ S p | [ G ∗ σ ] | , (18)which is the maximum size of an equivalence class that contains at least one minimal I-map of G ∗ . In Theorem 7, we prove that the sampler is rapidly mixing if r ∗ only grows polynomiallyin p . In general, this condition does not hold even if the maximum degree is bounded. Forexample, suppose the true DAG G ∗ contains an edge between nodes 2 i − i for each i ∈ [ p/

2] (assuming p is even). Though G ∗ has only p/ E ∗ contains 2 p/ member DAGs, all of which belong to the space G p (1 , Theorem 7.

Suppose Assumptions A, B, C, DP, and EP hold with t > and C β ≥ t/ .Further, assume that min { d ∗ d in + 1 , p } ≤ d out and r ∗ ≤ p t/ , where r ∗ is deﬁned in (18) .For suﬃciently large n , with probability at least − p − , the mixing time of the structureMCMC sampler with transition matrix deﬁned in (17) can be bounded by T mix ( P s ) ≤ C q ( d in p ∨ r ∗ ) pr ∗ d in log (cid:18) π min (cid:19) , for some universal constant C q , where π min = min G ∈G p ( d in ,d out ) π n ( G ) . Further, (cid:88) G ∈ [ G ∗ ] π n ( G ) = 1 + O (cid:18) |E ∗ | p t/ (cid:19) , where |E ∗ | is the number of member DAGs in the equivalence class of G ∗ .Proof. See Supplement E.8.

Remark . Though we do not need to require that d in + d out = O (log p ) (which is neededfor Theorem 6), to implement the proposal scheme deﬁned in (16), one eventually needs toimpose a bounded maximum degree condition so that sampling from an equivalence class iscomputationally aﬀordable.The path method used to prove Theorem 7 (which is a slight generalization of Theorem 1)may be applied to other structure MCMC samplers as well. For the classical structure MCMCsampler [Madigan et al., 1995, Giudici and Castelo, 2003], the neighborhood of a DAG isthe set of all DAGs that can be obtained from it by an addition, deletion or reversal of asingle edge. The edge reversal enables the chain to traverse equivalence classes, since anytwo Markov equivalent DAGs G, G (cid:48) are connected by a sequence of covered edge reversals.Unfortunately, the number of reversals needed can be as large as | G | . Thus, to derive resultssimilar to Theorem 7, an assumption like r ∗ = O ( p t/ ) seems unavoidable. The inclusion-driven MCMC of Castelo and Kocka [2003] can be seen as a practical version of our samplerdeﬁned in (17). It replaces the single-edge reversal with a random number of covered edgereversals. Consequently, it is possible to jump from a DAG G to any other DAG in [ G ] inone iteration. But if the equivalence class is huge, the probability of proposing the “right”Markov equivalent DAG can be exceedingly small. Therefore, this method (and also ourDAG sampler) essentially has the same limitation as the classical structure MCMC. Recall that we always interpret an equivalence class E as a set; thus, |E| is the number of member DAGsin E , not the number of edges in the CPDAG representing E . .4 Slow mixing examples for a reversible CPDAG sampler The neighborhood N ads ( E ) used in RW-GES can be very large for some E , which seemsto be undesirable. However, other choices of the neighborhood relation on C p (which mayseem very reasonable) can cause the search algorithm to be trapped in sub-optimal localmodes. To show this, we consider an alternative random walk MH algorithm for samplingequivalence classes used by Castelletti et al. [2018].Recall that EG( E ) denotes the CPDAG representing E . He et al. [2013] considered thefollowing six types of CPDAG operators: insert/delete an undirected edge, insert/delete adirect edge, and make/remove a v-structure. When we apply one of these operators tosome CPDAG, we may get a PDAG (partially directed acyclic graph) that is not a CPDAG.But if it has a consistent extension G , we still say EG([ G ]) (which must be unique) resultsfrom this operator. However, there exist CPDAGs EG( E ) and EG( E (cid:48) ) such that we canget EG( E (cid:48) ) from EG( E ) by one of these operators, but not vice versa. To overcome thisproblem, He et al. [2013, Deﬁnition 9] constructed a set of rules deﬁning, for each CPDAG,which of the above six operators are allowed. The resulting set of allowed operators deﬁnesa neighborhood relation that is symmetric and induces a connected neighborhood graph.(In He et al. [2013], this set of allowed operators is said to be reversible and irreducible.) Weuse N C to denote the corresponding neighborhood function. A random walk MH algorithmcan be constructed by proposing a state in N C ( · ) uniformly at random. This algorithm wasimplemented in Castelletti et al. [2018]. Hence, compared with RW-GES, the only diﬀer-ence is that the neighborhood N ads ( · ) is replaced by N C ( · ). Unfortunately, this “reversible”CPDAG sampler can be slowly mixing even if the strong beta-min condition holds. Example 3.

As in Example 2, we ﬁx p = 3 and let n tend to inﬁnity. The extension to thecase p = n is straightforward. Let the true SEM model be given by X = z , X = z , X = a X + a X + z , where z , z , z are as given in Example 2 but the coeﬃcients a , a > G ∗ is given by 1 → ←

2, which is a CPDAG itself.Choose prior parameters c , c , κ, α, γ as in Example 2. Let ˜ E be the equivalence class thatcontains all complete 3-vertex DAGs (i.e., no edge is missing.) The CPDAG EG( ˜ E ) is acomplete undirected graph. Removing any edge from EG( ˜ E ) results in a CPDAG of theform i − j − k , of which i → j ← k is not a consistent extension. Thus, N C ( ˜ E ) containsthree equivalence classes but E ∗ / ∈ N C ( ˜ E ). Some routine calculations conﬁrm that the chain isslowly mixing since it can get stuck at ˜ E for exponentially many steps. See Supplement E.10.A more complicated example with 5 nodes is provided in Supplement E.10. We notethat on the restricted search space C p ( d in , d out ), the maximum size of N C ( · ) tends to bemuch smaller than that of N ads ( · ) since the former can only grow polynomially in d in + d out .However, for two diﬀerent DAGs G, G (cid:48) such that G (cid:48) = G ∪ { i → j } , we may have [ G ] / ∈ “Make a v-structure” means to convert a subgraph i − j − k in the CPDAG to i → j ← k ; similarly,“remove a v-structure” means to convert a v-structure i → j ← k to i − j − k . A DAG G is a consistent extension of a PDAG H if G, H have the same skeleton and v-structures, andeach directed edge in H has the same orientation as in G . C ([ G (cid:48) ]) because removing the edge between i and j from EG( G (cid:48) ) does not result in anyconsistent extension. If G happens to be the true DAG model, then G (cid:48) is likely to be alocal mode that can trap the algorithm for an enormous number of iterations, since othermodiﬁcations of G (cid:48) either yield a DAG that is not an independence map of G or a DAGwith more “redundant” edges. Nevertheless, as originally proposed in He et al. [2013], onecan deﬁne a random walk on (Θ , N C ) for eﬃciently sampling CPDAGs (without applyingthe Metropolis rule). To implement the RW-GES sampler, we need to propose new states from N ads ( E ) for eachsparse equivalence class E . Though N ads is constructed by applying add-delete-swap movesto all member DAGs in E , we do not need to enumerate these DAGs to sample from N ads ( E ).Instead, we can use the “Insert” and “Delete” operators of the GES algorithm to generatethe sets N add ( E ) and N del ( E ) [Chickering, 2002b, Deﬁnitions 12 and 13], which only involvelocal modiﬁcations of EG( E ). For a partially directed graph H , we use Ad j ( H ) to denotethe set of all nodes that are connected to node j by either a directed or undirected edge (inwhich case we say the two nodes are adjacent), and Un j ( H ) = Ad j ( H ) \ (Pa j ( H ) ∪ Ch j ( H ))to denote the set of nodes that are connected to j by an undirected edge. Deﬁnition 8 (Insert( i, j, S )) . Given a CPDAG H , “Insert ( i, j, S ) ” operator is deﬁned fornon-adjacent nodes i, j and a set S ⊆ Un j ( H ) \ Ad i ( H ) . It modiﬁes H by (i) inserting theedge i → j , and (ii) for each k ∈ S , directing the edge between k and j as k → j . Deﬁnition 9 (Delete( i, j, S )) . Given a CPDAG H , “Delete ( i, j, S ) ” operator is deﬁned foradjacent nodes i, j connected as either i − j or i → j and a set S ⊆ Un j ( H ) ∩ Ad i ( H ) . Itmodiﬁes H by (i) deleting the edge between i and j , and (ii) for each k ∈ S , directing theedge between k and j as j → k and any undirected edge between k and i as i → k . To test whether an operator results in a CPDAG, one can use local conditions providedin Chickering [2002b, Theorems 15 and 17], which are easy to evaluate. Let O ges ( E ) denotethe set of “Insert” and “Delete” operators deﬁned for the CDPAG EG( E ). Assuming thatthe maximum degree of the CPDAG is O (log p ), |O ges ( E ) | can be bounded by a polynomialof p . Further, observe that the only operators that may not be distinguishable (i.e., twooperators result in the same CPDAG) are (i) Insert ( i, j, ∅ ) and Insert ( j, i, ∅ ) when i, j havethe same parent set, and (ii) Delete ( i, j, S ) and Delete ( j, i, S ) where i, j are connected by anundirected edge. Therefore, to implement the addition and deletion moves for RW-GES, wejust need to randomly sample an operator from O ges ( E ) with equal probability. If it resultsin a CPDAG E (cid:48) , we can compute the proposal probability ratio by |O ges ( E ) | / |O ges ( E (cid:48) ) | . If itdoes not result in a CPDAG, we simply reject the proposal. The swap moves only requirea combination of the “Insert” and “Delete” operators, and thus they can be implementedsimilarly. We refer readers to Chickering [2002b] for more details about the implementationof the two types of operators and the conversion of PDAGs to CPDAGs.24s discussed in Section 5.3, the existence of Markov equivalent DAGs makes it diﬃcultto quickly explore the DAG space. This is a major limitation of structure MCMC methods,especially in high-dimensional settings where there are equivalence classes with high posteriorscores and exponentially many member DAGs. There are probably no simple remedies, sinceenumerating all members of an equivalence class is computationally expensive [He et al.,2015]. In contrast, GES and RW-GES directly search the space of equivalence classes, whichappears to be a main reason why we can prove the rapid mixing property of RW-GES underreasonable assumptions. But more importantly, such advantages of GES and RW-GES canbe realized in practice because there exist corresponding local operators for CPDAGs whichcan be implemented eﬃciently.We end this section with a discussion on the diﬀerence between GES and RW-GES. ForRW-GES, we consider a search space with bounded maximum degree (though the boundmay grow with n ). This is necessary to proving that there are no sub-optimal local modes inthe restricted search space in high-dimensional settings, which also implies that the greedysearch version of RW-GES is consistent (see Theorem 4). GES is a two-stage algorithm, andin the ﬁrst stage it keeps adding edges (to member DAGs). Under certain assumptions, theoutput of the ﬁrst stage, which we denote by G , is an independence map of G ∗ with highprobability [Nandy et al., 2018]. However, even if d ∗ is bounded, it is diﬃcult to bound | G | since G may not be minimal. This is inconvenient to theoretical analysis since we havelittle control over the posterior landscape among non-sparse models (e.g. the errors may beoverﬁtted.) Note that the swap proposal plays a key role when we consider a search spacewith bounded maximum degree. Without swaps, RW-GES can get stuck at DAG modelswith nodes that are underﬁtted and saturated. Both the strong beta-min condition and strong faithfulness are restrictive assumptions,though their uses in theoretical studies are common. A more ﬂexible setup can be for-mulated as follows. Let b be the lower bounded given in (13) (the constants in the boundcan always be adjusted if necessary.) For each σ ∈ S p , we deﬁne a DAG ˜ G σ as follows. i → j ∈ ˜ G σ if and only if ( B ∗ σ ) ij ≥ b . As long as the other nonzero entries of B ∗ σ are suﬃciently small (so that their summed eﬀectsfor each node have the same order as the noise), we may treat ˜ G σ as the “true” model for theDAG selection problem with ordering σ and prove the corresponding consistency and rapidmixing results (for DAG selection) using essentially the same techniques [cf. Yang et al.,2016]. All we need is just to replace G ∗ σ with ˜ G σ in the construction of canonical paths.But for the structure learning problem across all possible orderings, the current argumentfails. The DAG ˜ G σ may not be an independence map of G ∗ , and as shown in Example 2, ˜ G σ can easily be a local mode on ( C p ( d in , d out ) , N ads ). For other search methods which considera smaller neighborhood set than N ads ( · ), the multimodality of the posterior distribution canonly be more severe. Hence, to devise MCMC algorithms which may achieve rapid mixingwithout strong beta-min condition, we need to choose a larger neighborhood and modifythe construction of the canonical path from ˜ G σ to G ∗ for each σ ∈ S p . In this regard,25he works of Grzegorczyk and Husmeier [2008], Su and Borsuk [2016] suggest a potentialsolution. Consider the REV-structure MCMC sampler of Grzegorczyk and Husmeier [2008]for example. The algorithm draws samples from the DAG space, but it replaces the single-edge reversal in the classical structure MCMC with a so-called REV (new edge reversal)move. For each REV move, an edge i → j in the current DAG is randomly sampled ﬁrst.Next, the current parent set of node i is replaced by a random sample from the conditionalposterior distribution of Pa i under the constraint that the resulting graph is a DAG with theedge j → i . The parent set of node j is then re-sampled similarly. This REV move makesit possible to jump between DAGs that encode very diﬀerent CI relations involving nodes i and j . Similar ideas to the REV move might be used to improve other sampling methods,including those deﬁned on the space of equivalence classes. A more detailed investigation isleft to future research. Proof.

We prove the lemma by contradiction. Assume that for any j ∈ [ p ] such that g σj ( G ) (cid:54) = G , we have g σj ( G ) / ∈ G σp ( d in , d out ). It follows that there exists no j such that Pa j ( G ∗ σ ) ⊂ Pa j ( G ) (two sets are not equal.) Otherwise, g σj ( G ) can be obtained from G by removingsome incoming edge of node j and thus g σj ( G ) ∈ G σp ( d in , d out ).Since there is no strictly overﬁtted node, there must exist some underﬁtted node. Let U = { u ∈ [ p ] : Pa u ( G ∗ σ ) (cid:54)⊆ Pa u ( G ) } be the set of all underﬁtted nodes. Recall that for any u ∈ U , g σu ( G ) is constructed by addingan incoming edge of node u to G . Deﬁne A = { a ∈ [ p ] : g σu ( G ) = G ∪ { a → u } for some u ∈ U } . Fix an arbitrary a ∈ A and suppose that g σj ( G ) = G ∪{ a → j } for some j . If | Ch a ( G ) | < d out ,one can verify that g σj ( G ) ∈ G σp ( d in , d out ) (if Pa j ( G ) < d in , we can simply add the edge a → j ;if Pa j ( G ) = d in , we perform a swap.) Thus, | Ch a ( G ) | = d out for every a ∈ A .For any node u ∈ U , there exists some node a ( u ) ∈ A such that a ( u ) → u is in G ∗ σ butnot in G . But any node a can have at most d ∗ σ outgoing edges in G ∗ σ , which implies that | A | ≥ | U | d ∗ σ . (19)For each a ∈ A , deﬁne c a = |{ u ∈ U : g σu ( G ) = G ∪ { a → u }}| , F a = Ch a ( G ) \ Ch a ( G ∗ σ ) . So c a is the number of nodes in U which we want to connect with the parent node a .Observe that c a ≤ | Ch a ( G ∗ σ ) \ Ch a ( G ) | and (cid:80) a ∈ A c a = | U | . Using | Ch a ( G ) | = d out and | Ch a ( G ∗ σ ) | ≤ d ∗ σ , we ﬁnd that | F a | ≥ d out − d ∗ σ + c a . Hence, (cid:88) a ∈ A | F a | ≥ (cid:88) a ∈ A ( d out − d ∗ σ + c a ) ≥ | U | d out d ∗ σ (20)26here the last step follows from (19) and (cid:80) a ∈ A c a = | U | .For any node f a ∈ F a , the edge a → f a is in G but not in G ∗ σ . Deﬁne E = { ( a, f ) : a ∈ A, f ∈ F a } . Clearly, | E | = (cid:80) a ∈ A | F a | . Since the maximum in-degree of G is at most d in , for any f ∈ [ p ],we have |{ a ∈ A : ( a, f ) ∈ E }| ≤ d in . Consider the set ¯ F = (cid:83) a ∈ A F a . By (20) we have that | ¯ F | ≥ | E | d in ≥ | U | d out d ∗ σ d in > | U | , since we assume d out > d in d ∗ σ . For any node f ∈ ¯ F , we have Pa f ( G ) (cid:54) = Pa f ( G ∗ σ ). Becausewe have already shown that no node in G can be strictly overﬁtted, all nodes in ¯ F must beunderﬁtted; thus, we must have | F | ≤ | U | , which yields the contradiction. We ﬁrst prove an auxiliary lemma using the Chickering algorithm.

Lemma 5.

Assume d ∗ ≤ d in ≤ d out . Consider G ∗ σ for some σ ∈ S p such that E = [ G ∗ σ ] (cid:54) = E ∗ .There exist E (cid:48) ∈ C p ( d in , d out ) , G ∈ E and τ ∈ S p such that (i) E (cid:48) ∈ N del ( E ) and CI ( E ) ⊂CI ( E (cid:48) ) ⊆ CI ( E ∗ ) , and (ii) G ∈ G τp ( d in , d out ) and E (cid:48) = [ g τj ( G )] for some j ∈ [ p ] .Proof. The ﬁrst conclusion of the lemma is essentially the same as Chickering [2002b, Lemma10]. To prove it, note that since E = [ G ∗ σ ], we have CI ( E ) = CI ( G ∗ σ ) ⊂ CI ( E ∗ ). By Chick-ering [2002b, Theorem 4], there exist some ﬁnite m and a sequence of DAGs, ( G = G ∗ , G , . . . , G m − , G m = G ∗ σ ), such that, for each k ∈ [ m ], CI ( G k ) ⊆ CI ( G k − ) and G k is obtained from G k − by either a covered edge reversal or an edge addition. Let (cid:96) =max { j ≤ m : | G j | = | G ∗ σ | − } , which clearly exists. Then, G (cid:96) ∈ N del ( G (cid:96) +1 ), G (cid:96) +1 ∈ E byLemma C2, and CI ( G (cid:96) +1 ) ⊂ CI ( G (cid:96) ) ⊆ CI ( G ∗ ).Let τ be a topological ordering of G (cid:96) +1 . For each i ∈ [ p ], we have Pa i ( G ∗ τ ) ⊆ Pa i ( G (cid:96) +1 ),since G (cid:96) +1 is an I-map of G ∗ but G ∗ τ is the unique minimal I-map with ordering τ . Meanwhile, CI ( G (cid:96) ) ⊂ CI ( G ∗ ) and | G (cid:96) | < | G (cid:96) +1 | imply that there exists some node j such that Pa j ( G ∗ τ ) ⊂ Pa j ( G (cid:96) +1 ) (two sets are unequal.) Consider the DAG G (cid:48) = g τj ( G (cid:96) +1 ), which by deﬁnitionsatisﬁes that G (cid:48) = G (cid:96) +1 \ { i → j } for some i (cid:54) = Pa j ( G ∗ τ ). Hence, G (cid:48) ∈ N del ( G ), E (cid:48) = [ G (cid:48) ] ∈N del ( E ) and CI ( E ) = CI ( G (cid:96) +1 ) ⊂ CI ( E (cid:48) ) ⊆ CI ( E ∗ ). To conclude the proof, notice thatthe maximum degree of G (cid:96) +1 is bounded by d ∗ since G (cid:96) +1 and G ∗ σ are Markov equivalent.It then follows from the assumption d ∗ ≤ d in ≤ d out that G (cid:96) +1 ∈ G τp ( d in , d out ) and thus E (cid:48) ∈ C p ( d in , d out ). Proof of Theorem 2.

We give an explicit construction of g that satisﬁes the required condi-tions. Recall the deﬁnition of h ∗ ( E ∗ ) in (2). Let ( ¯ G ( E ) , ¯ σ ( E )) be the pair that attains theminimum in the deﬁnition of h ∗ ( E ) (if there are multiple such pairs, ﬁx one of them.) Thepair ( ¯ G ( E ) , ¯ σ ( E )) can be seen as a “canonical” representation of the equivalence class E . Fixan arbitrary E ∈ C p ( d in , d out ), and we deﬁne g ( E ) as follows.(1) Let ( G, σ ) = ( ¯ G ( E ) , ¯ σ ( E )) be its canonical representation.272) If G (cid:54) = G ∗ σ , by Lemma 2, there exists some j such that g σj ( G ) ∈ G σp ( d in , d out ) and g σj ( G ) (cid:54) = G . Let G (cid:48) = g σj ( G ) and g ( E ) = [ G (cid:48) ] ∈ C p ( d in , d out ).(3) If G = G ∗ σ but E (cid:54) = E ∗ , by Lemma 5, there exist some G ∈ E , τ ∈ S p , j ∈ [ p ] suchthat G ∈ G τp ( d in , d out ), G (cid:48) = g τj ( G ) ∈ N del ( G ), and G (cid:48) is still an I-map of G ∗ . Let g ( E ) = [ G (cid:48) ] ∈ C p ( d in , d out ).(4) If E = E ∗ , let g ( E ) = E .It follows from the deﬁnition of N ads that g ( E ) ∈ N ads ( E ) for every E (cid:54) = E ∗ . Consider g σj ( G )found in step (2). By Lemma 2 and the deﬁnition of g σj , we have h ∗ ([ g σj ( G )]) ≤ Hd( g σj ( G ) , G ∗ σ ) + | G ∗ σ | − | G ∗ | < Hd(

G, G ∗ σ ) + | G ∗ σ | − | G ∗ | = h ∗ ( E )since g σj ( G ) ∈ G σp ( d in , d out ) and ( G, σ ) is chosen to be the canonical representation of E . If g ( E ) is deﬁned in step (3) as g τj ( G ) for some G ∈ E (and G = G ∗ σ ∈ E ), we claim h ∗ ([ g τj ( G )]) ≤ Hd( g τj ( G ) , G ∗ τ ) + | G ∗ τ | − | G ∗ | < | G ∗ σ | − | G ∗ | = h ∗ ( E ) . This is because g τj ( G ) is an I-map of G ∗ with ordering τ , and thus Hd( g τj ( G ) , G ∗ τ ) = | g τj ( G ) |−| G ∗ τ | . The above inequality then follows upon noticing that | G ∗ σ | = | G | > | g τj ( G ) | .Hence, for any E (cid:54) = E ∗ , we have h ∗ ( g ( E )) < h ∗ ( E ). Since h ∗ ( E ) = 0 if and only E = E ∗ , g induces a canonical path from E to E ∗ for every E ∈ C p ( d in , d out ). To bound the length ofsuch paths, it suﬃces to bound h ∗ ( E ). Observe that h ∗ ( E ) ≤ max G ∈G σp ( d in ,d out ) ,σ ∈ S p Hd(

G, G ∗ σ ) + max σ ∈ S p ( | G ∗ σ | − | G ∗ | ) ≤ max G ∈G p ( d in ,d out ) ,σ ∈ S p ( | G | + | G ∗ σ | ) + max σ ∈ S p | G ∗ σ |≤ ( d ∗ + d in ) p + d ∗ p = (2 d ∗ + d in ) p, which concludes the proof. References

Raj Agrawal, Caroline Uhler, and Tamara Broderick. Minimal I-MAP MCMC for scalable structurediscovery in causal dag models. In

International Conference on Machine Learning , pages 89–98,2018.David J Aldous. Some inequalities for reversible Markov chains.

Journal of the London MathematicalSociety , 2(3):564–576, 1982.Hongzhi An, Da Huang, Qiwei Yao, and Cun-Hui Zhang. Stepwise searching for feature variables inhigh-dimensional linear regression.

Technical report , 2008.Steen A Andersson, David Madigan, and Michael D Perlman. A characterization of Markov equiva-lence classes for acyclic digraphs.

The Annals of Statistics , 25(2):505–541, 1997.Bryon Aragam, Arash A Amini, and Qing Zhou. Learning directed acyclic graphs with penalizedneighbourhood regression. arXiv preprint arXiv:1511.08963 , 2015. ryon Aragam, Arash Amini, and Qing Zhou. Globally optimal score-based learning of directedacyclic graphs in high-dimensions. In Advances in Neural Information Processing Systems , pages4450–4462, 2019.Ery Arias-Castro and Karim Lounici. Estimation and variable selection with exponential weights.

Electronic Journal of Statistics , 8(1):328–354, 2014.Xuan Cao, Kshitij Khare, and Malay Ghosh. Posterior graph selection and estimation consistencyfor high-dimensional Bayesian DAG models.

The Annals of Statistics , 47(1):319–348, 2019.Federico Castelletti, Guido Consonni, Marco L Della Vedova, and Stefano Peluso. Learning Markovequivalence classes of directed acyclic graphs: An objective Bayes approach.

Bayesian Analysis ,13(4):1235–1260, 2018.Robert Castelo and Tom´as Kocka. On inclusion-driven learning of Bayesian networks.

Journal ofMachine Learning Research , 4(Sep):527–574, 2003.David Maxwell Chickering. A transformational characterization of equivalent bayesian network struc-tures.

Preceedings of the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence , 1995.David Maxwell Chickering. Learning equivalence classes of Bayesian-network structures.

Journal ofmachine learning research , 2(Feb):445–498, 2002a.David Maxwell Chickering. Optimal structure identiﬁcation with greedy search.

Journal of machinelearning research , 3(Nov):507–554, 2002b.Mathias Drton and Marloes H Maathuis. Structure learning in graphical modeling.

Annual Reviewof Statistics and Its Application , 4:365–393, 2017.Mathias Drton, Rina Foygel, and Seth Sullivant. Global identiﬁability of linear structural equationmodels.

The Annals of Statistics , 39(2):865–886, 2011.Byron Ellis and Wing Hung Wong. Learning causal Bayesian network structures from experimentaldata.

Journal of the American Statistical Association , 103(482):778–789, 2008.Nir Friedman and Daphne Koller. Being Bayesian about network structure: A Bayesian approach tostructure discovery in Bayesian networks.

Machine learning , 50(1-2):95–125, 2003.Bin Gao and Yuehua Cui. Learning directed acyclic graphical structures with genetical genomicsdata.

Bioinformatics , 31(24):3953–3960, 2015.Dan Geiger and David Heckerman. Parameter priors for directed acyclic graphical models and thecharacterization of several probability distributions.

The Annals of Statistics , 30(5):1412–1440,2002.AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Counting and samplingfrom markov equivalent dags using clique trees. In

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , volume 33, pages 3664–3671, 2019.Paolo Giudici and Robert Castelo. Improving Markov chain Monte Carlo model search for datamining.

Machine learning , 50(1-2):127–158, 2003.Robert JB Goudie and Sach Mukherjee. A gibbs sampler for learning dags.

The Journal of MachineLearning Research , 17(1):1032–1070, 2016. imon Griﬃths, Ross Kang, Roberto Oliveira, and Viresh Patel. Tight inequalities among set hittingtimes in Markov chains. Proceedings of the American Mathematical Society , 142(9):3285–3298,2014.Marco Grzegorczyk and Dirk Husmeier. Improving the structure MCMC sampler for Bayesian net-works by introducing a new edge reversal move.

Machine Learning , 71(2-3):265, 2008.Yangbo He, Jinzhu Jia, and Bin Yu. Reversible MCMC on Markov equivalence classes of sparsedirected acyclic graphs.

The Annals of Statistics , 41(4):1742–1779, 2013.Yangbo He, Jinzhu Jia, and Bin Yu. Counting and exploring sizes of Markov equivalence classes ofdirected acyclic graphs.

The Journal of Machine Learning Research , 16(1):2589–2609, 2015.Valen E Johnson and David Rossell. Bayesian model selection in high-dimensional settings.

Journalof the American Statistical Association , 107(498):649–660, 2012.Markus Kalisch and Peter B¨uhlmann. Estimating high-dimensional directed acyclic graphs with thePC-algorithm.

Journal of Machine Learning Research , 8(Mar):613–636, 2007.Daphne Koller and Nir Friedman.

Probabilistic graphical models: principles and techniques . MITpress, 2009.Jack Kuipers and Giusi Moﬀa. Partition MCMC for inference on acyclic digraphs.

Journal of theAmerican Statistical Association , 112(517):282–299, 2017.Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selec-tion.

Annals of Statistics , pages 1302–1338, 2000.Kyoungjae Lee, Jaeyong Lee, and Lizhen Lin. Minimax posterior convergence rates and modelselection consistency in high-dimensional DAG models based on sparse cholesky factors.

TheAnnals of Statistics , 47(6):3413–3437, 2019.David A Levin and Yuval Peres.

Markov chains and mixing times , volume 107. American Mathe-matical Soc., 2017.Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter B¨uhlmann. Predicting causal eﬀectsin large-scale systems from observational data.

Nature Methods , 7(4):247–248, 2010.David Madigan, Jeremy York, and Denis Allard. Bayesian graphical models for discrete data.

Inter-national Statistical Review/Revue Internationale de Statistique , pages 215–232, 1995.David Madigan, Steen A Andersson, Michael D Perlman, and Chris T Volinsky. Bayesian modelaveraging and model selection for Markov equivalence classes of acyclic digraphs.

Communicationsin Statistics–Theory and Methods , 25(11):2493–2519, 1996.Ryan Martin, Raymond Mess, and Stephen G Walker. Empirical Bayes posterior concentration insparse high-dimensional linear models.

Bernoulli , 23(3):1822–1847, 2017.Christopher Meek.

Graphical Models: Selecting causal and statistical models . PhD thesis, PhD thesis,Carnegie Mellon University, 1997.Wil Michiels, Emile Aarts, and Jan Korst.

Theoretical aspects of local search . Springer Science &Business Media, 2007.El´ıas Moreno, F Javier Gir´on, and George Casella. Consistency of objective Bayes factors as themodel dimension grows.

The Annals of Statistics , 38(4):1937–1952, 2010. aul Munteanu and Mohamed Bendou. The EQ framework for learning equivalence classes of Bayesiannetworks. In Proceedings 2001 IEEE International Conference on Data Mining , pages 417–424.IEEE, 2001.Preetam Nandy, Alain Hauser, and Marloes H Maathuis. High-dimensional consistency in score-basedand hybrid structure learning.

The Annals of Statistics , 46(6A):3151–3183, 2018.Naveen Naidu Narisetty and Xuming He. Bayesian variable selection with shrinking and diﬀusingpriors.

The Annals of Statistics , 42(2):789–817, 2014.Teppo Niinimaki, Pekka Parviainen, and Mikko Koivisto. Partial order MCMC for structure discoveryin Bayesian networks. arXiv preprint arXiv:1202.3753 , 2012.Judea Pearl.

Probabilistic reasoning in intelligent systems: networks of plausible inference . MorganKaufmann, 1988.Stefano Peluso and Guido Consonni. Compatible priors for model selection of high-dimensionalgaussian dags.

Electronic Journal of Statistics , 14(2):4110–4132, 2020.Jose M Pena. Approximate counting of graphical models via MCMC. In

AISTATS , pages 355–362,2007.Yuval Peres and Perla Sousi. Mixing times are hitting times of large sets.

Journal of TheoreticalProbability , 28(2):488–519, 2015.Michael D Perlman. Graphical model search via essential graphs.

Contemporary Mathematics , 287:255–266, 2001.Jonas Peters and Peter B¨uhlmann. Identiﬁability of gaussian structural equation models with equalerror variances.

Biometrika , 101(1):219–228, 2014.Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singularvalues. In

Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures , pages 1576–1602.World Scientiﬁc, 2010.Marco Scutari, Catharina Elisabeth Graaﬂand, and Jos´e Manuel Guti´errez. Who learns betterBayesian network structures: Accuracy and speed of structure learning algorithms.

InternationalJournal of Approximate Reasoning , 115:235–253, 2019.Alistair Sinclair. Improved bounds for mixing rates of Markov chains and multicommodity ﬂow.

Combinatorics, probability and Computing , 1(4):351–370, 1992.Liam Solus, Yuhao Wang, and Caroline Uhler. Consistency guarantees for greedy permutation-basedcausal inference algorithms. arXiv preprint arXiv:1702.03530 , 2017.Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman.

Causation, prediction,and search . MIT press, 2000.Milan Studen´y.

Probabilistic conditional independence structures . Springer Science & Business Media,2006.Chengwei Su and Mark E Borsuk. Improving structure mcmc for bayesian networks through markovblanket resampling.

The Journal of Machine Learning Research , 17(1):4042–4061, 2016. aroline Uhler, Garvesh Raskutti, Peter B¨uhlmann, and Bin Yu. Geometry of the faithfulness as-sumption in causal inference. The Annals of Statistics , pages 436–463, 2013.Sara Van de Geer and Peter B¨uhlmann. (cid:96) -penalized maximum likelihood for sparse directed acyclicgraphs. The Annals of Statistics , 41(2):536–567, 2013.Thomas Verma and Judea Pearl.

Equivalence and synthesis of causal models . UCLA, ComputerScience Department, 1991.Yun Yang, Martin J Wainwright, and Michael I Jordan. On the computational complexity of high-dimensional Bayesian variable selection.

The Annals of Statistics , 44(6):2497–2532, 2016. upplement Table of Contents

A Notation used in the main text 34B Path methods and mixing time of Markov chains 35

B.1 On the equivalence between mixing time and hitting time . . . . . . . . . . 35B.2 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36B.3 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

C Preliminaries for DAG models 38

C.1 Markov equivalent DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38C.2 Gaussian DAG models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39C.3 Empirical Bayes modeling for structure learning . . . . . . . . . . . . . . . . 40

D High-dimensional empirical variable selection 41

D.1 Model, prior and posterior distributions . . . . . . . . . . . . . . . . . . . . 41D.2 High-dimensional consistency results . . . . . . . . . . . . . . . . . . . . . . 42D.3 Proof of Theorem D1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43D.4 Proof of Lemma D2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45D.5 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

E Proofs for the main text 47

E.1 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47E.2 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47E.3 Proof of Theorem 3(i) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48E.4 Proof of Theorem 3(ii) and (iii) . . . . . . . . . . . . . . . . . . . . . . . . . 50E.5 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50E.6 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51E.7 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51E.8 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52E.9 Proof of Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53E.10 Proof of Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Notation used in the main text

In the table below, we list the notation that is used frequently in Sections 3 to 7.

Notation Description [ p ] { , , . . . , p } S p set of all permutations of [ p ] | S | cardinality of a set SN p ( µ, Σ) p -variate normal distribution with covariance matrix Σ X a random vector with components X , . . . , X p X, X j , X S data matrix, column vector, submatrix with columns index by S | G | number of edges in the DAG G Hd(

S, S (cid:48) ) , Hd(

G, G (cid:48) ) Hamming distance between two sets or DAGsPa j ( G ) , Ch j ( G ) set of parents/children of node j in the DAG G [ G ] the equivalence class that contains the DAG G EG( E ) the CPDAG representing the equivalence class ECI ( G ) , CI ( E ) set of CI relations encoded by a DAG G or an equivalence class EG p , G σp set of p -vertex DAGs, set of p -vertex DAGs with ordering σ G p ( d in , d out ) , G σp ( d in , d out ) sets of DAGs that satisfy the in-degree and out-degree constraints C p , C p ( d in , d out ) sets of equivalence classes A σp ( j ) set of nodes that precede X j in the ordering σ M σp ( j, d in ) set of possible values of Pa j ( G ) for G ∈ G σp ( d in , p ); see (3) g σj ( S ) , g σj ( G ) canonical transition functions on M σp ( j, d in ) and G σp ( d in , d out ) N ads ( E ) add-delete-swap neighborhood of an equivalence class E ; see (1) N add , N del , N swap addition/deletion/swap neighborhood of G or E Σ( B, Ω) Σ has a modiﬁed Cholesky decomposition given by ( B, Ω) D p ( G ) , D p ( σ ) set of pairs ( B, Ω) such that Σ( B, Ω) is Markovian w.r.t. G ; see (7) D p ( σ ) set of pairs ( B, Ω) compatible with ordering σ ; see (24) π , π n prior and posterior distributions or density functions † d in , d out maximum in-degree/out-degree c , c , κ, γ hyperparameters for π ( E ) and π ( B, Ω | G ) α exponent for the fractional likelihood function ψ j , ψ posterior score functions; see (10) and (11)Σ ∗ , G ∗ , E ∗ true covariance matrix, DAG model and equivalence class P ∗ probability measure corresponding to the true model G ∗ σ minimal I-map of G ∗ with ordering σ ( B ∗ σ , Ω ∗ σ ) modiﬁed Cholesky decomposition of Σ ∗ in D p ( σ ) S ∗ σ,j parent set of node j in G ∗ σ d ∗ σ , d ∗ maximum degree of the minimal I-map(s); see (4) r ∗ maximum size of [ G ∗ σ ] over all σ ; see (18) K , K σ , K s proposal distributions of MH algorithms P , P σ , P s transition matrices of MH algorithms † : when π or π n denotes a density function, its dominating measure depends on the context.34 Path methods and mixing time of Markov chains

B.1 On the equivalence between mixing time and hitting time

Let ( Y t ) t ∈ N be a Markov chain deﬁned on a ﬁnite state space Θ with transition matrix P .Assume that P is irreducible and aperiodic so that there always exists a unique stationary andlimiting distribution π . Let P θ denote the probability measure for ( Y t ) t ∈ N with initial value Y = θ , and let E θ be the corresponding expectation. For t ∈ N , let P t ( θ, · ) = P θ ( Y t ∈ · )denote the t -step transition matrix. For any set A ⊆ Θ, deﬁne the hitting time of A by h ( A ) = min { t ∈ N : Y t ∈ A } . Theorem B1 (Peres and Sousi [2015]) . For some α < / , deﬁne T H ( α ) = max { E θ [ h ( A )] : θ ∈ Θ , A ⊆ Θ , π ( A ) ≥ α } . Suppose that P is reversible and let T Lmix be the mixing time of the lazy chain with transitionmatrix ( P + I ) / . Then, T Lmix and T H ( α ) are equivalent up to constant factors.Remark . This result was ﬁrst proved by Aldous [1982] for continuous-time Markov chains.Griﬃths et al. [2014] showed that the equivalence between T Lmix and T H ( α ) also holds for α = 1 /

2. “Up to constant factors” means that there exist constants c α , C α > c α T H ( α ) ≤ T Lmix ≤ C α T H ( α ). Theorem B2.

Suppose P is reversible and let T Lmix be as deﬁned in Theorem B1. If thereexists some state θ ∗ such that π ( θ ∗ ) > / , then T Lmix is equivalent, up to constant factors,to T ∗ = max θ ∈ Θ E θ [ h ( { θ ∗ } )] .Proof. Choose any α ∈ (0 , / A with π ( A ) ≥ α , we have θ ∗ ∈ A and thus h ( A ) ≤ h ( { θ ∗ } ). Hence, T H ( α ) = max { E θ [ h ( A )] : θ ∈ Θ , A = { θ ∗ }} where T H ( α ) is as deﬁnedin Theorem B1, from which the result follows.Rapid mixing of an MH algorithm is impossible if the chain can get stuck at a sub-optimallocal mode other than θ ∗ for exponentially many steps. Theorem B3.

Consider an asymptotic setting where Θ , P , π are implicitly indexed by n .For each n , assume there exists some θ such that π ( θ ) ≤ / and P ( θ , θ ) ≥ − e − cn ,where c > is a universal constant. Then the mixing time of P cannot be bounded by anypolynomial in n .Proof. Let A n = Θ \ { θ } . By the property of total variation distance, (cid:107) P t ( θ , · ) − π ( · ) (cid:107) TV ≥| P t ( θ , A n ) − π ( A n ) | . It then follows from Deﬁnition 2 that T mix ≥ min (cid:8) t ∈ N : | P t ( θ , A n ) − π ( A n ) | ≤ / (cid:9) ≥ min (cid:8) t ∈ N : P t ( θ , A n ) ≥ / (cid:9) , since π ( A n ) ≥ /

2. Observe that P t ( θ , θ ) ≥ (1 − e − cn ) t ≥ − te − cn for any t ≥

1. Hence, P t ( θ , A n ) ≤ te − cn , which yields the result. 35 .2 Proof of Lemma 1 Proof.

First, we show that such a function g exists. Since (Θ , N ) is connected, for any θ (cid:54) = θ ∗ ,there exists a shortest N -path from θ to θ ∗ , which we denote by ( θ = θ, θ , . . . , θ k = θ ∗ ).Deﬁne ˜ g ( θ ) to be the state θ on this path. Clearly, ˜ g is a canonical transition function.Next, we explicitly construct a canonical path ensemble T using any canonical transitionfunction g . The path from θ to θ (cid:48) in T will be denoted by e T ( θ, θ (cid:48) ), which is unique. Let N = { , , . . . } and k ( θ ) = min { i ∈ N : g i ( θ ) = θ ∗ } < ∞ . For θ (cid:54) = θ ∗ , deﬁne e T ( θ, θ ∗ ) =( θ, g ( θ ) , . . . , g k ( θ ) ( θ ) = θ ∗ ) (note that it cannot contain any duplicate state since otherwise k ( θ ) does not exist.) Since N is symmetric, we have θ ∈ N ( g ( θ )) for each θ (cid:54) = θ ∗ , and thusthe path e T ( θ ∗ , θ ) can be deﬁned by reversing e T ( θ, θ ∗ ). The construction of e T ( θ, θ (cid:48) ) for θ (cid:48) (cid:54) = θ ∗ is divided into three cases.(Case 1) θ = g j ( θ (cid:48) ) for some j ∈ N + , where N + = { , , . . . } .(Case 2) θ (cid:48) = g i ( θ ) for some i ∈ N + .(Case 3) Neither of the above two statements holds.For Case 1, since e T ( θ, θ ∗ ) = ( θ, g ( θ ) , . . . , g j − ( θ ) , θ (cid:48) , g j +1 ( θ ) , . . . , g k ( θ ) = θ ∗ ), we can simplydeﬁne e T ( θ, θ (cid:48) ) = ( θ, g ( θ ) , . . . , g j − ( θ ) , θ (cid:48) ), which is a sub-path of e T ( θ, θ ∗ ). Case 2 can behandled similarly. Consider Case 3. By concatenating the canonical paths e T ( θ, θ ∗ ) and e T ( θ ∗ , θ (cid:48) ), we can deﬁne e T ( θ, θ (cid:48) ) = ( θ, g ( θ ) , . . . , g j ( θ ) = θ ∗ = g k ( θ ) , . . . , g ( θ (cid:48) ) , θ (cid:48) ) , where j is the length of e T ( θ, θ ∗ ) and k is the length of e T ( θ ∗ , θ (cid:48) ). To prove e T ( θ, θ (cid:48) ) hasno duplicate states, it suﬃces to show that paths e T ( θ, θ ∗ ) and e T ( θ ∗ , θ (cid:48) ) do not share anycommon states except θ ∗ . We prove it by contradiction. Suppose τ (cid:54) = θ ∗ exists in bothpaths. Then τ = g s ( θ ) = g t ( θ (cid:48) ) for some s, t ∈ N + . Without loss of generality, assume s > t .But this implies that θ (cid:48) = g s − t ( θ ), which yields the contradiction. B.3 Proof of Theorem 1

Part (i) follows upon observing that π must be maximized at θ ∗ and there is no local maximaother than θ ∗ on (Θ , N ). Proof of part (ii).

Let g − k ( θ ∗ ) = { θ ∈ Θ : g k ( θ ) = θ ∗ , g k − ( θ ) (cid:54) = θ ∗ } . Note that Θ = (cid:83) k ≥ g − k ( θ ∗ ). By Condition (2), we have π ( θ ) /π ( θ ∗ ) ≤ p − kt for any θ ∈ g − k ( θ ∗ ). Condition(1) implies that | g − k ( θ ∗ ) | ≤ p kt . Hence, (cid:80) θ ∈ Θ π ( θ ) π ( θ ∗ ) ≤ ∞ (cid:88) k =0 π ( g − k ( θ ∗ )) π ( θ ∗ ) ≤ ∞ (cid:88) k =0 p − k ( t − t ) = 11 − p − ( t − t ) , from which the result follows. Proof of part (iii).

We use the canonical path method of Sinclair [1992] to compute an upperbound on the mixing time. Let e ( θ, θ (cid:48) ) denote an N -path (which we simply call a pathhenceforth) from θ to θ (cid:48) . A path is also interpreted as a sequence of directed edges; thus,36he notation ( θ i , θ j ) ∈ e ( θ, θ (cid:48) ) means that θ i , θ j are two consecutive states ( θ i ﬁrst) alongthe path. The length of the path is denoted by | e ( θ, θ (cid:48) ) | , which is deﬁned to be the numberof edges in the path. Let T = { e T ( θ, θ (cid:48) ) : θ, θ (cid:48) ∈ Θ , and θ (cid:54) = θ (cid:48) } denote the canonical pathensemble induced by g , as constructed in the proof of Lemma 1.By Sinclair [1992, Proposition 1, Corollary 6] and our deﬁnition of the mixing time, T mix − log[min θ ∈ Θ π ( θ )] + log 4 ≤ P ) ≤ ρ ( T ) l ( T ) , (21)where Gap( P ) is the spectral gap of a transition matrix P , l ( T ) is the length of the longestcanonical path in T , and ρ ( T ) = max τ ∈ Θ ,τ (cid:48) ∈N ( τ ) π ( τ ) P ( τ, τ (cid:48) ) (cid:88) ( θ,θ (cid:48) ):( τ,τ (cid:48) ) ∈ e T ( θ,θ (cid:48) ) π ( θ ) π ( θ (cid:48) )is known as the path congestion parameter of T . Hence, in order to bound the mixing time,we only need to calculate l ( T ) and ρ ( T ).It is clear from construction that for any θ (cid:54) = θ (cid:48) , we have | e T ( θ, θ (cid:48) ) | ≤ | e T ( θ, θ ∗ ) | + | e T ( θ (cid:48) , θ ∗ ) | ≤ (cid:96) max , where (cid:96) max is as deﬁned in the theorem. Next, we need to bound ρ ( T ).Observe that for T constructed in Lemma 1, (cid:8) ( θ, θ (cid:48) ) : ( τ, τ (cid:48) ) ∈ e T ( θ, θ (cid:48) ) (cid:9) (cid:54) = ∅ , only if τ = g ( τ (cid:48) ) or τ (cid:48) = g ( τ ) . Assume that τ (cid:48) = g ( τ ), which implies that τ (cid:54) = θ ∗ (the other case can be analyzed similarly.)By condition (3), P ( τ, τ (cid:48) ) ≥ p − t . Let N = { , , . . . } . Deﬁne Λ( τ ) = { θ ∈ Θ : τ = g k ( θ ) , k ∈ N } as the ancestor set of τ w.r.t. the transition function g . Note that τ ∈ Λ( τ ). If( τ, τ (cid:48) ) ∈ e T ( θ, θ (cid:48) ) for some θ and θ (cid:48) , according to our construction of T , it is straightforwardto verify that θ ∈ Λ( τ ). Therefore, { ( θ, θ (cid:48) ) : ( τ, τ (cid:48) ) ∈ e T ( θ, θ (cid:48) ) } ⊆ Λ( τ ) × Θ. It follows that ρ ( T ) ≤ max ( τ,τ (cid:48) ): τ (cid:48) = g ( τ ) π ( τ ) P ( τ, τ (cid:48) ) (cid:88) ( θ,θ (cid:48) ) ∈ Λ( τ ) × Θ π ( θ ) π ( θ (cid:48) )= max ( τ,τ (cid:48) ): τ (cid:48) = g ( τ ) π ( τ ) P ( τ, τ (cid:48) ) (cid:88) θ ∈ Λ( τ ) (cid:88) θ (cid:48) ∈ Θ π ( θ ) π ( θ (cid:48) )= max ( τ,τ (cid:48) ): τ (cid:48) = g ( τ ) π (Λ( τ )) π ( τ ) P ( τ, τ (cid:48) ) ≤ p t max τ (cid:54) = θ ∗ { π (Λ( τ )) /π ( τ ) } . (22)For k ∈ N , let g − k ( τ ) = { θ ∈ Θ : g k ( θ ) = τ, g k − ( θ ) (cid:54) = τ } . Then Λ( τ ) = (cid:83) k ∈ N g − k ( τ ). Bycondition (1), we have | g − k ( τ ) | ≤ p kt . Thus, using condition (2) and t > t , we ﬁnd that π (Λ( τ )) π ( τ ) = (cid:88) k ∈ N π ( g − k ( τ )) π ( τ ) ≤ (cid:88) k ∈ N p − k ( t − t ) = 11 − p − ( t − t ) . (23)The claim then follows from (21), (22) and (23).37 Preliminaries for DAG models

C.1 Markov equivalent DAGs

Below are some useful results for checking whether two DAGs are Markov equivalent. Theskeleton of a DAG is the unique undirected graph obtained by replacing all edges in theDAG with undirected ones. A v-structure is a triple i → j ← k (note that there is no edgebetween i and k .) By an edge reversal, we mean to change an existing edge i → j to j → i .We say the reversal is covered if and only if Pa i = Pa j \ { i } . Two nodes i, j are said to beadjacent in G if i → j ∈ G or j → i ∈ G . Lemma C1.

Two DAGs are Markov equivalent if and only if they have the same skeletonand v-structures.Proof.

See Verma and Pearl [1991].

Lemma C2.

Two DAGs G , G are Markov equivalent if and only if we can transform G to G by a sequence of covered edge reversals.Proof. See Chickering [1995, Theorem 2].

Lemma C3.

Let G and G be two Markov equivalent DAGs such that Pa j ( G ) = Pa j ( G ) and i / ∈ Pa j ( G ) . Suppose that G (cid:48) = G ∪ { i → j } and G (cid:48) = G ∪ { i → j } are both DAGs.Then G (cid:48) , G (cid:48) are Markov equivalent.Proof. By Lemma C1, G (cid:48) and G (cid:48) have the same skeleton. It suﬃces to show that G (cid:48) and G (cid:48) share the same v-structures. One can see that a v-structure remains unchanged unless itinvolves both i and j . There are only two cases where adding the edge i → j aﬀects somev-structure:(Case 1) For some node k , a new v-structure i → j ← k is formed.(Case 2) For some node k , an existing v-structure i → k ← j is shielded.We show that if Case 1 or Case 2 happens in G (cid:48) , it also happens in G (cid:48) . If i → j ← k is a v-structure in G (cid:48) for k ∈ Pa j ( G ), i, k are not adjacent in G (cid:48) . Then i, k must be non-adjacent in G (cid:48) as well since G (cid:48) and G (cid:48) have the same skeleton. By the assumption Pa j ( G ) = Pa j ( G ),we ﬁnd that k ∈ Pa j ( G ) and thus G (cid:48) contains the v-structure i → j ← k . If G has av-structure i → k ← j for some node k , so does G by Lemma C1. Adding i → j deletes thev-structure both in G (cid:48) and G (cid:48) . Lemma C4.

Let G and G be two Markov equivalent DAGs such that Pa j ( G ) = Pa j ( G ) and i ∈ Pa j ( G ) . Suppose that G (cid:48) = G \ { i → j } and G (cid:48) = G \ { i → j } are both DAGs.Then G (cid:48) , G (cid:48) are Markov equivalent.Proof. The proof is similar to that of Lemma C3. we show that G (cid:48) and G (cid:48) share the samev-structures. There are only two cases where deleting the edge i → j aﬀects the v-structure.(Case 1) For some node k , a new v-structure i → k ← j is formed.(Case 2) For some node k , an existing v-structure i → j ← k is broken up.38et i → k ← j be a v-structure in G (cid:48) . Observe that the assumptions Pa j ( G ) = Pa j ( G )and that G and G are Markov equivalent imply Ch j ( G ) = Ch j ( G ). Since k ∈ Ch j ( G ), G (cid:48) also has the edge k ← j . Further, we must have i → k ∈ G (cid:48) , since G is acyclic. Thisshows that i → k ← j is also a v-structure in G (cid:48) . If G has a v-structure i → j ← k forsome node k , so does G . Adding i → j deletes the v-structure in both G (cid:48) and G (cid:48) . C.2 Gaussian DAG models

We provide some background on the decomposition of Σ( B, Ω) given in (6). It is actuallythe (modiﬁed) Cholesky decomposition of Σ (up to permutation of rows and columns), and B is the modiﬁed Cholesky factor (after scaling). Let D p ( σ ) = (cid:83) G ∈G σp D p ( G ) where D p ( G )is as given in (7). Observe that, equivalently, D p ( σ ) can be deﬁned by D p ( σ ) = { ( B, Ω) : B ∈ R p × p , B ij = 0 if σ − ( i ) ≥ σ − ( j ) , for any i, j ∈ [ p ];Ω = diag( ω , . . . , ω p ) , ω i > i ∈ [ p ] } . (24) Lemma C5.

For any positive deﬁnite matrix Σ ∈ R p × p and σ ∈ S p , the decomposition Σ = ( I − B (cid:62) ) − Ω( I − B ) − for ( B, Ω) ∈ D p ( σ ) exists and is unique.Proof. First, permute the rows and columns of Σ using σ so that B becomes upper triangularafter permutation. The result then follows from the existence and uniqueness of Choleskydecomposition for positive deﬁnite matrices. See also Aragam et al. [2015, Lemma 2.1]. Lemma C6.

Let

Σ = ( I − B (cid:62) ) − Ω( I − B ) − for some ( B, Ω) ∈ (cid:83) σ ∈ S p D p ( σ ) . Let G be aDAG such that i → j ∈ G if and only if B ij > . Then G is a minimal I-map of N p (0 , Σ) ;that is, N p (0 , Σ) is Markovian w.r.t. G but not Markovian w.r.t. any sub-DAG of G .Proof. See Peters and B¨uhlmann [2014, Theorem 1].

Lemma C7.

Let N p (0 , Σ) be a non-degenerate multivariate normal distribution Markovianw.r.t. a p -vertex DAG G . Then the decomposition Σ = ( I − B (cid:62) ) − Ω( I − B ) − for ( B, Ω) ∈D p ( G ) exists and is unique.Proof. The Markovian assumption implies that the density of N p (0 , Σ) factorizes accordingto G . Let X = ( X , . . . , X p ) ∼ N p (0 , Σ). The existence of ( B, Ω) follows from the fact that,for any S ⊆ [ p ] \ { j } , X j | X S follows a normal distribution and the conditional expectationis linear in X S . Let σ be a topological ordering of G . Since D p ( G ) ⊆ D p ( σ ), the uniquenessfollows from Lemma C5. Lemma C8.

Let G be a p -vertex DAG and S = { ( i, j ) : i → j ∈ G } . Let B be a p × p matrixsuch that B S = { B ij : ( i, j ) ∈ S } is sampled from an absolutely continuous distributionon R | S | and other entries of B are zeroes. Let Ω = diag( ω , . . . , ω p ) be a diagonal matrixwhere ( ω , . . . , ω p ) is sampled from an absolutely continuous distribution on (0 , ∞ ) p . Let Σ = ( I − B (cid:62) ) − Ω( I − B ) − . Then N p (0 , Σ) is perfectly Markovian w.r.t. G almost surely.Proof. See Spirtes et al. [2000, Theorem 3.2].39 .3 Empirical Bayes modeling for structure learning

Consider the model given in (5). This Bayesian model is motivated by two observations.First, by Lemma C7, if Σ is positive deﬁnite and N p (0 , Σ) is Markovian w.r.t. G , then thedecomposition (6) exists and is unique. Second, if the edge weights (entries B ij for i → j ∈ G )are sampled from an absolutely continuous distribution, the resulting distribution N p (0 , Σ) isalmost surely perfectly Markovian w.r.t. G ; see Lemma C8. There is little loss of generalityin assuming that X has mean zero, since the normality implies that any CI statement about X , . . . , X p can be determined using Σ alone.We specify an empirical prior for ( B, Ω) | G with support D p ( G ) as follows. Let β j ( G ) = B Pa j ,j be the subvector of the j -th column of B with entries indexed by Pa j ( G ). By theSEM representation, β j contains all nonzero regression coeﬃcients for the response vector X j . We use an empirical normal-inverse-gamma prior for each ( β j , ω j ): β j | ω j , Pa j ∼ N | Pa j | (cid:18) ( X (cid:62) Pa j X Pa j ) − X (cid:62) Pa j X j , ω j γ ( X (cid:62) Pa j X Pa j ) − (cid:19) ,π ( ω j ) ∝ ω − κ/ − j , where γ, κ are hyperparameters and X Pa j denotes the submatrix of X containing columnsindexed by Pa j . So the prior mean for β j is simply the ordinary-least-squares estimator. Let L ( B, Ω) denote the likelihood function (the dependency on X is omitted.) Since the empiricalprior relies on the observed data, to counteract its eﬀect we use a fractional likelihood withexponent α ∈ (0 ,

1) [Martin et al., 2017]. The formula is given by π n ( B, Ω | G ) ∝ π ( B, Ω | G ) L ( B, Ω) α = π ( B, Ω | G ) L ( B, Ω) − α L ( B, Ω) . Hence, the eﬀective prior distribution for ( B, Ω) | G is π ( · ) /L − α ( · ). By a routine calculationusing the conjugacy of normal-inverse-gamma prior, we ﬁnd that f α ( G ) = (cid:90) π ( B, Ω | G ) L ( B, Ω) α d ( B, Ω)= (cid:18) αγ (cid:19) −| G | / p (cid:89) j =1 ( X (cid:62) j Φ ⊥ Pa j X j ) − ( αn + κ ) / , (25)The function f α gives the marginal (fractional) likelihood of a DAG model. The posteriorprobability of a DAG G then can be computed by π n ( G ) = (cid:90) π n ( G, B, Ω) d ( B, Ω) ∝ (cid:90) π ( G ) π ( B, Ω | G ) L ( B, Ω) α d ( B, Ω)= π ( G ) f α ( G ) . High-dimensional empirical variable selection

By the SEM representation (8), the DAG selection problem can be seen as a series of variableselection problems. Here we prove some high-dimensional consistency results for a singlevariable selection problem using the empirical prior. We consider a general setting wherewe have a response vector denoted by y and an n × m design matrix denoted by Z . Inorder to make our result easily applicable to DAG selection problems, we assume that thetotal number of variables is p but Z may only contain a subset of them; thus, we alwayshave m ≤ p . The main result of this section is provided in Theorem D1. If the topologicalordering for a DAG selection problem is given by σ ( i ) = i for i ∈ [ p ], one can simply applyTheorem D1 with y = X j and Z = X [ j − for every j ≥

2. If one is only interested in a ﬁxedsingle variable selection problem, one can always apply our result with m = p . D.1 Model, prior and posterior distributions

Consider the following empirical model for variable selection [Martin et al., 2017], y | S, β S , ω ∼ N n ( Z S β S , ωI ) ,π ( S ) ∝ ( c p c ) −| S | S ( d ) ( S ) , ∀ S ⊆ [ m ] ,π ( ω ) ∝ ω − κ/ − , ∀ ω > ,β S | S, ω ∼ N | S | (cid:26) ( Z (cid:62) S Z S ) − Z (cid:62) S y, ωγ ( Z (cid:62) S Z S ) − (cid:27) , where I denotes the identity matrix and c > , c ≥ , κ ≥ , γ > S ( d ) is the set of all models with size not greater than d ; that is, S ( d ) = { S ⊆ [ m ] : | S | ≤ d } . We allow d ∈ [ p ] to be greater than m . Using a fractional likelihood with exponent α , weﬁnd that the posterior probability of some S ∈ S ( d ) is given by π n ( S ) ∝ c −| S | p − c | S | (cid:18) αγ (cid:19) −| S | / ( y (cid:62) Φ ⊥ S y ) − ( αn + κ ) / , (26)where Φ ⊥ S (and Φ S ) denotes the projection matrix,Φ S = Z S ( Z (cid:62) S Z S ) − Z (cid:62) S , Φ ⊥ S = I − Φ S . (27)The exponentiation of the posterior score function ψ j deﬁned in (11) has exactly the sameform as (26). Thus, the analysis of the DAG selection and structure learning problem canbe reduced to that of all possible nodewise variable selection problems.Let the true data generating model be given by y = y + (cid:15), where y = Z S ∗ β ∗ S ∗ and (cid:15) ∼ N n (0 , ω ∗ I ) , (28)for some S ∗ ⊆ [ m ], β ∗ S ∗ ∈ R | S ∗ | and ω ∗ >

0. The vector y denotes the signal part of y .We denote a variable selection problem as described above by V = ( y, Z, M ∗ , η ) (29)where ( y, Z ) represents the data, M ∗ = ( S ∗ , β ∗ S ∗ , ω ∗ , (cid:15) ) represents the true model and η =( d, α, c , c , κ, γ ) denotes all parameters. 41 .2 High-dimensional consistency results We make the following assumptions on the data, true model and prior parameters.

Assumption A (cid:48) . Let ˜ Z = [ Z y ]. There exist constants ν, ν > nν ≤ λ min ( ˜ Z (cid:62) S ˜ Z S ) ≤ λ max ( ˜ Z (cid:62) S ˜ Z S ) ≤ nν, for any S ⊆ [ m + 1] with | S | ≤ d . Assumption B (cid:48) . For the true model given in (28), ν ≤ ω ∗ ≤ ν and (cid:15) satisﬁes thatmin S : | S |≤ d (cid:15) (cid:62) Φ ⊥ S (cid:15) ≥ nω ∗ / , (B1 (cid:48) )max S : | S |≤ d max j / ∈ S (cid:15) (cid:62) (Φ S ∪{ j } − Φ S ) (cid:15) ≤ ρω ∗ log p, (B2 (cid:48) )for some constant ρ ≥ Assumption C (cid:48) . Prior parameters satisfy that κ ≤ n , c (cid:112) α/γ ∈ [1 , p ], and c ≥ ( α + 1) ρ + t for some universal constant t > Assumption D (cid:48) . If d ≤ m , then (cid:26) ν ( ν − ν ) ν + 1 (cid:27) | S ∗ | ≤ d. Assumption E (cid:48) . There exists a universal constant C β such that β = min {| β ∗ j | : β ∗ j (cid:54) = 0 } ≥ C β + 4 c ) ν log pαν n . As in Deﬁnition 6, we deﬁne a transition function g : S ( d ) → S ( d ) by g ( S ) =  S, if S = S ∗ , arg max S (cid:48) ∈N ∗ del ( S ) π n ( S (cid:48) ) , if S ∗ ⊂ S, arg max S (cid:48) ∈N ∗ add ( S ) π n ( S (cid:48) ) , if S ∗ (cid:54)⊆ S, | S | < d arg max S (cid:48) ∈N ∗ swap ( S ) π n ( S (cid:48) ) , if S ∗ (cid:54)⊆ S, | S | = d, where N ∗ add ( S ) = { S ∪ { k } : k ∈ S ∗ \ S } , N ∗ del ( S ) = { S \ { (cid:96) } : (cid:96) ∈ S \ S ∗ } , N ∗ swap ( S ) = { ( S ∪ { k } ) \ { (cid:96) } : k ∈ S ∗ \ S, (cid:96) ∈ S \ S ∗ } . Recall that for two models S , S , we use Hd( S , S ) = | S \ S | + | S \ S | to denote the“Hamming distance”. Clearly, for any S (cid:54) = S ∗ , we always have Hd( g ( S ) , S ∗ ) < Hd(

S, S ∗ ).The consistency results provided in the following theorem shows that we can further obtaina lower bound polynomial in p on π n ( g ( S )) /π n ( S ). Theorem D1.

Suppose Assumptions A (cid:48) , B (cid:48) , C (cid:48) , D (cid:48) and E (cid:48) hold, and let t > be the universalconstant as deﬁned in Assumption C (cid:48) . If C β ≥ t/ , then π n ( g ( S )) π n ( S ) ≥ p t , ∀ S ∈ S ( d ) \ { S ∗ } . Proof.

It follows from Lemmas D1, D3 and D4 to be proved below.42 .3 Proof of Theorem D1

We prove three lemmas below, each for one subcase in the deﬁnition of g (except the case S = S ∗ ). Theorem D1 then follows.First, consider an overﬁtted model S such that S ∗ ⊂ S . Under Assumption B (cid:48) , as longas c is suﬃciently large, we can pick an arbitrary j ∈ S \ S ∗ and then S (cid:48) = S \ { j } has amuch larger posterior probability. Lemma D1.

Suppose Assumptions B (cid:48) and C (cid:48) hold. For any S ∈ S ( d ) such that S ∗ ⊂ S andany j ∈ S \ S ∗ , we have π n ( S ) π n ( S \ { j } ) ≤ p − c +( α +1) ρ ≤ p − t . Proof.

Let S (cid:48) = S \ { j } for any j ∈ S \ S ∗ . Then, it follows from (26) and the inequality1 + x ≤ e x that π n ( S ) π n ( S (cid:48) ) ≤ c − (1 + α/γ ) − / p − c exp (cid:26) αn + κ y (cid:62) (Φ S − Φ S (cid:48) ) yy (cid:62) Φ ⊥ S y (cid:27) . Since S ∗ ⊆ S and S ∗ ⊆ S (cid:48) , by (B1 (cid:48) ) and (B2 (cid:48) ), we have y (cid:62) Φ ⊥ S y = (cid:15) (cid:62) Φ ⊥ S (cid:15) ≥ nω ∗ / ,y (cid:62) (Φ S − Φ S (cid:48) ) y = (cid:15) (cid:62) (Φ S − Φ S (cid:48) ) (cid:15) ≤ ρω ∗ log p. A routine calculation then yields the result.For an underﬁtted model S ∈ S ( d ) (underﬁtted means S ∗ \ S (cid:54) = ∅ ), it is more diﬃcultto identify a neighboring model with a much larger posterior probability. Consider thereduction in residual sum of squares (RSS), (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) , by adding some covariate k ∈ S ∗ \ S . Even if | β ∗ k | is large, the change in RSS may not be signiﬁcant due to thecorrelation between Z k and other covariates in S ∗ \ S . Similarly, a covariate in ( S ∗ ) c mayhappen to explain a signiﬁcant amount of variation in the signal, due to its correlation withsome missing covariate in S ∗ \ S . To identify a much more “likely” neighboring model foreach underﬁtted S , we ﬁrst prove an auxiliary result, which bounds the change in RSS (with y replaced by y ) for the best move. Lemma D2.

Let y be as deﬁned in (28) and suppose Assumption A (cid:48) holds. For any S ∈ S ( d ) such that S ∗ \ S (cid:54) = ∅ , there exists k ∈ S ∗ \ S such that (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) ≥ nν (cid:107) β ∗ S ∗ \ S (cid:107) ν | S ∗ \ S | . For any S ∈ S ( d ) such that S \ S ∗ (cid:54) = ∅ , there exists j ∈ S \ S ∗ such that (cid:107) (Φ S − Φ S \{ j } ) y (cid:107) ≤ nν ( ν − ν ) (cid:107) β ∗ S ∗ \ S (cid:107) ν | S \ S ∗ | . Proof.

The proof is similar to that for Yang et al. [2016, Lemma 8]. The key diﬀerence isthat we do not need to impose an irrepresentability assumption. In Yang et al. [2016, Lemma8], the lower bound on (cid:107) (Φ S − Φ S \{ j } ) y (cid:107) involves the constantmax S ∈S ( d ) (cid:107) ( Z (cid:62) S Z S ) − Z (cid:62) S Z S ∗ \ S (cid:107) op .

43e directly bound this constant using Lemma D7. For completeness, we provide the fullproof in Section D.4.We need to consider two subcases for underﬁtted models according as S is saturated (wesay S is saturated if | S | = d .) Note that an underﬁtted and saturated model exists only if d < m . If S is unsaturated, we can add some covariate in S ∗ \ S so that the reduction inRSS would be signiﬁcant. If S is saturated, we perform a swap move: add some covariate in S ∗ \ S and remove another in S \ S ∗ . Lemma D3.

Suppose Assumptions A (cid:48) , B (cid:48) , C (cid:48) and E (cid:48) hold. Let S ∈ S ( d ) be an underﬁttedmodel. There exists some k ∈ S ∗ \ S such that π n ( S ) π n ( S ∪ { k } ) ≤ p − c − C β / . Proof.

Let S (cid:48) = S ∪ { k } for some k ∈ S ∗ \ S . Then, π n ( S ) π n ( S (cid:48) ) ≤ c (1 + α/γ ) / p c exp (cid:26) − αn + κ y (cid:62) (Φ S (cid:48) − Φ S ) yy (cid:62) Φ ⊥ S y (cid:27) . (30)By Assumption A (cid:48) , y (cid:62) Φ ⊥ S y ≤ y (cid:62) y ≤ nν . By Lemma D2 and Assumption E (cid:48) , there existssome k ∈ S ∗ \ S such that (cid:107) (Φ S (cid:48) − Φ S ) y (cid:107) ≥ nν ν β ≥ C β + 4 c ) ν log pα ≥ ( C / β + 4 c / ) ν log pα . (31)By Assumptions B (cid:48) and C (cid:48) , (cid:107) (Φ S (cid:48) − Φ S ) (cid:15) (cid:107) ≤ ρω ∗ log p < c ω ∗ log pα ≤ c ν log pα . (32)The reverse triangle inequality then yields that (cid:107) (Φ S (cid:48) − Φ S ) y (cid:107) ≥ (cid:110)(cid:112) C β + 3 √ c (cid:111) ν log pα ≥ ( C β + 9 c ) ν log pα . Thus, the exponent in (30) can be bounded by αn + κ y (cid:62) (Φ S (cid:48) − Φ S ) yy (cid:62) Φ ⊥ S y ≥ C β + 9 c p. By Assumptions B (cid:48) and C (cid:48) , c ≥ c / ≥ c + 1, which yields the result. Lemma D4.

Suppose d < m and Assumptions A (cid:48) , B (cid:48) , C (cid:48) , D (cid:48) and E (cid:48) hold. Let S ∈ S ( d ) bean underﬁtted model with | S | = d . There exist some k ∈ S ∗ \ S and j ∈ S \ S ∗ such that π n ( S ) π n (( S ∪ { k } ) \ { j } ) ≤ p − C β / . Proof.

Let S (cid:48) = ( S ∪ { k } ) \ { j } for some k ∈ S ∗ \ S and j ∈ S \ S ∗ . Then, π n ( S ) π n ( S (cid:48) ) ≤ exp (cid:26) − αn + κ y (cid:62) (Φ S (cid:48) − Φ S ) yy (cid:62) Φ ⊥ S y (cid:27) .

44y Lemma D2, we can pick k and j such that (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) ≥ nν (cid:107) β ∗ S ∗ \ S (cid:107) ν | S ∗ \ S | , (cid:107) (Φ S (cid:48) ∪{ j } − Φ S (cid:48) ) y (cid:107) ≤ nν ( ν − ν ) (cid:107) β ∗ S ∗ \ S (cid:107) ν | S \ S ∗ | . By Assumption D (cid:48) , (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) (cid:107) (Φ S (cid:48) ∪{ j } − Φ S (cid:48) ) y (cid:107) ≥ ν | S \ S ∗ | ν ( ν − ν ) | S ∗ \ S | ≥ ν ( d − | S ∗ | ) ν ( ν − ν ) | S ∗ | ≥ . Then, using (31), (32) and triangle inequalities, we ﬁnd that (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) ≥ (cid:110)(cid:112) C β + 3 √ c (cid:111) ν log pα , (cid:107) (Φ S (cid:48) ∪{ j } − Φ S (cid:48) ) y (cid:107) ≤ (cid:40) (cid:112) C β √ c (cid:41) ν log pα . Hence, (cid:107) (Φ S (cid:48) − Φ S ) y (cid:107) = (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) − (cid:107) (Φ S (cid:48) ∪{ j } − Φ S (cid:48) ) y (cid:107) ≥ C β ν log p α . The result then follows by a calculation similar to the proof of Lemma D3.

D.4 Proof of Lemma D2

Proof.

Observe thatmax k ∈ S ∗ \ S (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) ≥ | S ∗ \ S | (cid:88) k ∈ S ∗ \ S (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) . Using Lemma D5 and Assumption A (cid:48) , the summation term can be written as (cid:88) k ∈ S ∗ \ S (cid:107) (Φ S ∪{ k } − Φ S ) y (cid:107) = (cid:88) k ∈ S ∗ \ S ( Z (cid:62) k Φ ⊥ S y ) Z (cid:62) k Φ ⊥ S Z k ≥ (cid:88) k ∈ S ∗ \ S ( Z (cid:62) k Φ ⊥ S y ) nν . Since Φ ⊥ S Z S = 0, we have (cid:88) k ∈ S ∗ \ S ( Z (cid:62) k Φ ⊥ S y ) = (cid:107) Z (cid:62) S ∗ \ S Φ ⊥ S y (cid:107) = (cid:107) Z (cid:62) S ∗ ∪ S Φ ⊥ S y (cid:107) . Recall that Φ ⊥ S y = Z S ∗ β ∗ S ∗ − Φ S y . Thus, Φ ⊥ S y is in the column space of Z S ∗ ∪ S , whichfurther implies that (cid:107) Z (cid:62) S ∗ ∪ S Φ ⊥ S y (cid:107) ≥ λ min ( Z (cid:62) S ∗ ∪ S Z S ∗ ∪ S ) (cid:107) Φ ⊥ S y (cid:107) . Applying Lemma D6, we obtain that (cid:107) Φ ⊥ S y (cid:107) ≥ nν (cid:107) β ∗ S ∗ \ S (cid:107) . Combining the above displayedinequalities, we obtain the ﬁrst inequality stated in the lemma.Consider the second part of the lemma. Without loss of generality, assume that S = { , , . . . , | S |} . Deﬁne ˜ b = ( Z (cid:62) S Z S ) − Z (cid:62) S Z S ∗ \ S β ∗ S ∗ \ S . j ∈ S \ S ∗ , and let ˜ b − j be the subvector of ˜ b with the j -th entry removed. Then, (cid:107) Φ ⊥ S \{ j } y (cid:107) = (cid:107) Φ ⊥ S \{ j } Z S ∗ \ S β ∗ S ∗ \ S (cid:107) ≤ (cid:107) Z S \{ j } ˜ b − j − Z S ∗ \ S β ∗ S ∗ \ S (cid:107) , since (cid:107) Φ ⊥ S \{ j } y (cid:107) is the minimal distance from y to the column space of Z S \{ j } . By thedeﬁnition of ˜ b , we have Z S \{ j } ˜ b − j + Z j ˜ b j = Z S ˜ b = Φ S Z S ∗ \ S β ∗ S ∗ \ S . Hence, using Φ ⊥ S Z j = 0, we ﬁnd that (cid:107) Z S \{ j } ˜ b − j − Z S ∗ \ S β ∗ S ∗ \ S (cid:107) = (cid:107) Φ ⊥ S Z S ∗ \ S β ∗ S ∗ \ S (cid:107) + (cid:107) Z j ˜ b j (cid:107) . It then follows that (cid:107) (Φ S − Φ S \{ j } ) y (cid:107) = (cid:107) Φ ⊥ S \{ j } y (cid:107) − (cid:107) Φ ⊥ S y (cid:107) ≤ ˜ b j (cid:107) Z j (cid:107) ≤ nν ˜ b j . Choosing an optimal j , we obtain thatmin j ∈ S \ S ∗ (cid:107) (Φ S − Φ S \{ j } ) y (cid:107) ≤ | S \ S ∗ | (cid:88) j ∈ S \ S ∗ nν ˜ b j = nν | S \ S ∗ | (cid:107) ˜ b (cid:107) . Since operator norm is submultiplicative, (cid:107) ˜ b (cid:107) ≤ (cid:107) ( Z (cid:62) S Z S ) − (cid:107) op (cid:107) Z (cid:62) S Z S ∗ \ S (cid:107) op (cid:107) β ∗ S ∗ \ S (cid:107) . Assumption A (cid:48) implies that (cid:107) ( Z (cid:62) S Z S ) − (cid:107) op ≤ ( nν ) − , and, by Lemma D7, (cid:107) Z (cid:62) S Z S ∗ \ S (cid:107) op ≤ n ( ν − ν ). Hence, (cid:107) ˜ b (cid:107) ≤ ν − ( ν − ν ) (cid:107) β ∗ S ∗ \ S (cid:107) , which concludes the proof. D.5 Auxiliary lemmas

Lemma D5.

Let Φ S be the projection matrix as deﬁned in (27) . Then, Φ S ∪{ j } − Φ S = Φ ⊥ S Z j Z (cid:62) j Φ ⊥ S Z (cid:62) j Φ ⊥ S Z j . Proof.

It follows from the observation that Φ S ∪{ j } − Φ S is a projection matrix. Lemma D6.

Let A = [ A A ] be an n × k matrix for some k ≤ n . Then σ min { Φ ⊥ A } ≥ (cid:112) λ min ( A (cid:62) A ) where σ min denotes the smallest singular value and Φ ⊥ = I − A ( A (cid:62) A ) − A (cid:62) .Proof. See Arias-Castro and Lounici [2014, Lemma 5].

Lemma D7.

Let A = [ A A ] be an n × k matrix for some k ≤ n , λ max ( A (cid:62) A ) = ν max and λ min ( A (cid:62) A ) = ν min . Then, (cid:107) A (cid:62) A (cid:107) op ≤ ν max − ν min .Proof. Suppose the dimension of A i is n × k i for i = 1 ,

2. By the deﬁnition of operator norm, (cid:107) A (cid:62) A (cid:107) op = max b ∈ R k : (cid:107) b (cid:107) =1 (cid:107) A (cid:62) A b (cid:107) = max (cid:110) b (cid:62) A (cid:62) A b : b ∈ R k , b ∈ R k , (cid:107) b (cid:107) = (cid:107) b (cid:107) = 1 (cid:111) . Since (cid:107) A (cid:107) op ≤ (cid:107) A (cid:107) op = √ ν max and σ min ( A ) = √ ν min , we have2 b (cid:62) A (cid:62) A b = (cid:107) A b (cid:107) + (cid:107) A b (cid:107) − (cid:107) ( A b − A b ) (cid:107) ≤ ν max (cid:107) b (cid:107) + ν max (cid:107) b (cid:107) − ν min ( (cid:107) b (cid:107) + (cid:107) b (cid:107) ) . Hence, if (cid:107) b (cid:107) = (cid:107) b (cid:107) = 1, b (cid:62) A (cid:62) A b ≤ ν max − ν min . A similar argument yields that b (cid:62) A (cid:62) A b ≥ ν min − ν max , which completes the proof.46 Proofs for the main text

For all proofs below, we use Φ S and Φ ⊥ S to denote the projection matrices given byΦ S = X S ( X (cid:62) S X S ) − X (cid:62) S , Φ ⊥ S = I − Φ S , ∀ S ⊆ [ p ] . (33) E.1 Proof of Lemma 3

Proof.

Note that this is equivalent to proving that the marginal fractional likelihood deﬁnedin (25) is the same for Markov equivalent DAGs. By Lemma C2, if two DAGs are Markovequivalent, then there exists a sequence of covered edge reversals that can transform oneto the other. So it suﬃces to show that any covered edge reversal does not change themarginal likelihood. Let

G, G (cid:48) be two DAGs that diﬀer by a covered edge reversal. Thus,there exist i (cid:54) = j such that i → j ∈ G , j → i ∈ G (cid:48) , Pa i ( G ) = Pa j ( G ) \ { i } , and all theother edges are exactly the same in the two DAGs. Let σ be a topological ordering G such that σ ( k ) = i, σ ( k + 1) = j for some k ∈ [ p ]; such an ordering always exists sincePa i ( G ) = Pa j ( G ) \ { i } . Let σ (cid:48) = ( σ (1) , . . . , σ ( k − , j, i, σ ( k + 1) , . . . , σ ( p )) be a topologicalordering of G (cid:48) . By (10), for S = Pa i ( G ) we haveexp( ψ ( G ))exp( ψ ( G (cid:48) )) = (cid:32) X (cid:62) i Φ ⊥ S X i X (cid:62) j Φ ⊥ S ∪{ i } X j X (cid:62) i Φ ⊥ S ∪{ j } X i X (cid:62) j Φ ⊥ S X j (cid:33) − ( αn + κ ) / . Using Φ ⊥ S = I − Φ S and Lemma D5, we ﬁnd that( X (cid:62) i Φ ⊥ S X i )( X (cid:62) j Φ ⊥ S ∪{ i } X j ) = ( X (cid:62) i Φ ⊥ S X i ) X (cid:62) j (cid:18) Φ ⊥ S − Φ ⊥ S X i X (cid:62) i Φ ⊥ S X (cid:62) i Φ ⊥ S X i (cid:19) X j = ( X (cid:62) i Φ ⊥ S X i )( X (cid:62) j Φ ⊥ S X j ) − ( X (cid:62) j Φ ⊥ S X i ) . By symmetry, we conclude that ψ ( G ) = ψ ( G (cid:48) ). E.2 Proof of Lemma 4

Proof.

Fix an arbitrary

E ∈ C p ( d in , d out ). By the deﬁnition of C p ( d in , d out ), there exists some G ∈ E such that G ∈ G p ( d in , d out ). Since any Markov equivalent DAGs must have thesame skeleton, for any G ∈ E , the degree of any node is bounded by d in + d out and the setof adjacent nodes is ﬁxed. Deﬁne Pa j ( E ) = { Pa j ( G ) : G ∈ E} , the collection of all possibleparent sets of node j . It then follows that | Pa j ( E ) | ≤ d in + d out ≤ p t .Consider N add ( E ). Suppose that E (cid:48) is obtained by adding the edge i → j to some G ∈ E . By Lemma C3, for G , G ∈ E , G ∪ { i → j } and G ∪ { i → j } are still Markovequivalent if the resulting graphs are DAGs and Pa j ( G ) = Pa j ( G (cid:48) ). But | Pa j ( E ) | ≤ p t implies that adding the edge i → j can yield at most p t diﬀerent equivalence classes. Hence, |N add ( E ) | ≤ p ( p − p t since there are at most p ( p −

1) directed edges we can add.A similar argument can be applied to N del ( E ). For any G ∈ E , we have | G | ≤ pd in ,but these edges may be directed in either way. Using Lemma (C4), we then ﬁnd that |N del ( E ) | ≤ pd in p t . Finally, for any G ∈ E , |N swap ( G ) | ≤ p ( p − d in + d out ). Hence, |N swap ( E ) | ≤ p ( p − d in + d out ) p t . Combing the results for the three cases, we obtain theasserted upper bound on |N ads ( E ) | . 47o prove the lower bound on |N ads ( E ) | , ﬁx an arbitrary G ∈ E . For any i (cid:54) = j such that i precedes j in the topological ordering of G , we can either add i → j to G (or perform aswap if necessary), or remove it from G . The asserted lower bound then follows. E.3 Proof of Theorem 3(i)

We will use Theorem D1 to prove Theorem 3. The challenge is that we need to show theassumptions of Theorem D1 are satisﬁed for all the p ! p variable selection problems.For every σ ∈ S p , we have an SEM representation for the distribution N p (0 , Σ ∗ ) given by X = ( B ∗ σ ) (cid:62) X + e σ , e σ ∼ N p (0 , Ω ∗ σ ) . where ( B ∗ σ , Ω ∗ σ ) is the modiﬁed Cholesky decomposition of Σ ∗ given in Deﬁnition 7. Denotethe diagonal elements of Ω ∗ σ by ω ∗ σ, , . . . , ω ∗ σ,p . Using the data matrix, we can rewrite theSEM model as X j = (cid:88) i (cid:54) = j ( B ∗ σ ) ij X i + ε σ,j , ε σ,j ∼ N n (0 , ω ∗ σ,j I ) . for j = 1 , . . . , p . Note that the error vector (cid:15) σ,j depends on the permutation σ . Deﬁne Z = (cid:110) ( ω ∗ σ,j ) − / ε σ,j : σ ∈ S p , j ∈ [ p ] (cid:111) . (34)Let β ∗ σ,j = ( B ∗ σ ) Pa j ,j be the subvector of the j -th column of ( B ∗ σ ) with entries indexed by S ∗ σ,j = Pa j ( G ∗ σ ). As observed in Van de Geer and B¨uhlmann [2013, Section 7.4.1], β ∗ σ,j (andthus ε σ,j ) only depends on S ∗ σ,j (see also Aragam et al. [2015, Proposition 8.5]). Since themaximum degree of G ∗ σ is bounded by d ∗ , the number of possible parent sets for any node isat most p d ∗ and thus |Z| ≤ p · p d ∗ = p d ∗ +1 . (35)In (29), we introduced the notation V = ( y, Z, M ∗ , η ) for denoting a variable selection prob-lem with data ( y, Z ), true model M ∗ and parameter vector η . Let V denote the p ! p variableselection problems we need to consider, which can be deﬁned as V = (cid:91) σ ∈ S p (cid:91) j ∈ [ p ] (cid:110) V = ( y, Z, M ∗ , η ) : y = X j , Z = X A σp ( j ) ,M ∗ = ( S ∗ σ,j , β ∗ σ,j , ω ∗ σ,j , ε σ,j ) , η = ( d in , α, c , c , κ, γ ) (cid:111) , (36)where we recall A σp ( j ) is the set of variables that precede X j in the permutation σ . For any V ∈ V , Assumptions C (cid:48) , D (cid:48) and E (cid:48) follow from Assumptions C, DP and EP, respectively.We prove in the following two lemmas that, with high probability, Assumptions A (cid:48) and B (cid:48) for all V ∈ V are implied by Assumptions A and B, respectively. Lemma E1.

If Assumption A holds and d in log p = o ( n ) , then for suﬃciently large n , P ∗ (cid:110) nν ≤ min S ∈M p (2 d in ) λ min ( X (cid:62) S X S ) ≤ max S ∈M p (2 d in ) λ max ( X (cid:62) S X S ) ≤ nν (cid:111) ≥ − e − nδ / , where M p (2 d in ) = { S ⊆ [ p ] : | S | ≤ d in } . roof. For any S ⊆ [ p ], let A S = X (cid:62) S (Σ ∗ ) − X S . By Rudelson and Vershynin [2010, Theorem2.6], for any S and a >

0, we have P ∗ (cid:110) ( √ n − (cid:112) | S | − a ) ≤ λ min ( A S ) ≤ λ max ( A S ) ≤ ( √ n + (cid:112) | S | + a ) (cid:111) ≥ − e − a / . By the submultiplicative property of operator norms, λ min ( A S ) λ min (Σ ∗ ) ≤ λ min ( X (cid:62) S X S ) , λ max ( A S ) λ max (Σ ∗ ) ≥ λ max ( X (cid:62) S X S ) . Choose a = δ √ n/ n ≥ d in /δ ; the latter is allowed since d in = o ( n ). Using Assump-tion A, we ﬁnd that P ∗ (cid:110) nν ≤ λ min ( X (cid:62) S X S ) ≤ λ max ( X (cid:62) S X S ) ≤ nν (cid:111) ≥ − e − a / . Observe that { S ⊆ [ p ] : | S | ≤ d in } contains less than p d in elements. The claim then followsby the union bound and the assumption that d in log p = o ( n ). Lemma E2.

Suppose Assumption B holds and d ∗ ≤ d in . Let Z be as deﬁned in (34) . Then,for suﬃciently large n , P ∗ (cid:26) min S ∈M p ( d in ) ,z ∈Z z (cid:62) Φ ⊥ S z ≤ n/ (cid:27) ≤ e − n/ , P ∗ (cid:26) max S ∈M p ( d in ) ,j / ∈ S,z ∈Z z (cid:62) (Φ S ∪{ j } − Φ S ) z ≥ (4 d in + 6) log p (cid:27) ≤ p − , where M p ( d in ) = { S ⊆ [ p ] : | S | ≤ d in } .Proof. Let z ∼ N n (0 , I ). Given a projection matrix Φ ⊥ S , we have z (cid:62) Φ ⊥ S z ∼ χ n −| S | . By Lau-rent and Massart [2000, Lemma 1], for any a > P (cid:26) z (cid:62) Φ ⊥ S z ( n − | S | ) ≤ − a (cid:27) ≤ e − ( n −| S | ) a / . Choosing a = 1 / n so that | S | ≤ d in ≤ n/

4, we obtain that P (cid:110) z (cid:62) Φ ⊥ S z ≤ n/ (cid:111) ≤ e − n/ . Using the union bound and (35), we obtain that P ∗ (cid:26) min S ∈M p ( d in ) ,z ∈Z z (cid:62) Φ ⊥ S z ≤ n (cid:27) ≤ p d in + d ∗ +1 e − n/ . The ﬁrst asserted inequality then follows from the assumptions d in log p = o ( n ) and d ∗ ≤ d in .For any S ⊆ [ p ] and j / ∈ S , Φ S ∪{ j } − Φ S is another projection matrix with rank 1. Hence,using a standard tail bound for Gaussian distribution, for any ρ >

0, we ﬁnd that P (cid:110) z (cid:62) (Φ S ∪{ j } − Φ S ) z ≥ ρ log p (cid:111) ≤ e − ρ log p/ . Another application of the union bound then yields the second inequality.

Proof of Theorem 3(i).

Let V be as deﬁned in (36), which denotes the collection of p ! p variable selection problems. Note that Assumption DP implies that d ∗ ≤ d in . By Lemmas E1and E2 and Remark 7, Assumptions A, B, C, DP and EP imply that, for suﬃciently large n ,with probability at least 1 − p − , Assumptions A (cid:48) , B (cid:48) , C (cid:48) , D (cid:48) and E (cid:48) hold with ρ = (4 d in + 6)(where ρ is as given in Assumption B (cid:48) ) for every variable selection problem V ∈ V . Theclaim then follows from Theorem D1. 49 .4 Proof of Theorem 3(ii) and (iii) Proof of Theorem 3(ii).

For any S ∈ M σp ( j, d in ), deﬁne N σj ( S ) = { S (cid:48) ∈ M σp ( j, d in ) : ∃ k, l ∈ [ p ] s.t. S (cid:48) = ( S ∪ { k } ) \ { l }} . Note that we allow k ∈ S and l ∈ [ p ] \ S so that N σj ( S ) includes the models that can beobtained from S by an addition, deletion or swap. Observe that the neighborhood relationdeﬁned by N σj is symmetric, and |N σj ( S ) | ≤ p + ( p − d in ) d in ≤ p (if p ≥

2) for each S ∈ M σp ( j, d in ). Further, g σj is a transition function on M σp ( j, d in ) with ﬁxed point S ∗ σ,j .The result then follows from part (i) and Theorem 1(ii). Proof of Theorem 3(iii).

Without loss of generality, we may assume d out = p . To prove theconsistency of DAG selection, observe that (cid:88) G ∈G σp ( d in ,p ) e ψ ( G ) ∝ (cid:88) G : Pa j ( G ) ∈M σp ( j,d in ) p (cid:89) j =1 e ψ j (Pa j ( G )) = p (cid:89) j =1 (cid:88) S ∈M σp ( j,d in ) e ψ j ( S ) . Thus, using part (ii), we ﬁnd that (cid:80) G ∈G σp ( d in ,p ) e ψ ( G ) e ψ ( G ∗ σ ) = p (cid:89) j =1 (cid:80) S ∈M σp ( j,d in ) e ψ j ( S ) e ψ j ( S ∗ σ,j ) ≤ (1 − p − ( t − ) − p ≤ − p − ( t − , for every σ ∈ S p , from which the result follows. E.5 Proof of Theorem 5

Proof.

By Lemma 4 and the deﬁnition of K , we have (cid:12)(cid:12) {E (cid:48) : P ( E , E (cid:48) ) > } (cid:12)(cid:12) ≤ |N ads ( E ) | ≤ t p t +2 log p = O ( p t +3 ) = o ( p t ) , (37)since t > t + 3. Consider the transition function g : C p ( d in , d out ) → C p ( d in , d out ) constructedin Theorem 2, which satisﬁes that (cid:96) max = max E(cid:54) = E ∗ min { k ≥ g k ( E ) = E ∗ } ≤ p (2 d ∗ + d in ) . Next, we show that g satisﬁes the two conditions in Theorem 1. For any E (cid:54) = E ∗ , g ( E ) =[ g σj ( G )] for some j ∈ [ p ] , σ ∈ S p and G ∈ E ∩ G σp ( d in , d out ). By (10), (12) and Theorem 3(i),we have ψ ( g ( E )) − ψ ( E ) = ψ ( g σj ( G )) − ψ ( G ) = ψ j ( g σj ( S j )) − ψ j ( S j ) ≥ t log p, (38)where S j = Pa j ( G ) (note that we always have S j (cid:54) = S ∗ σ,j for E (cid:54) = E ∗ .) Thus, condition (2) inTheorem 1 holds. To check condition (3), we use Lemma 4 again to ﬁnd that π n ( g ( E )) K ( g ( E ) , E ) π n ( E ) K ( E , g ( E )) ≥ p t d in + d out ) p t ≥ n . It then follows from (37) that P ( E , E (cid:48) ) = K ( E , E (cid:48) ) ≥ t p t +2 log p . Apply Theorem 1 to conclude the proof (consider the lazy version of P if necessary.)50 .6 Proof of Corollary 2 Proof.

Deﬁne c = c (cid:112) α/γ . By (11), for any G ∈ G σp ( d in , d out ), we have e ψ ( G ) e ψ ( G ∗ σ ) = p (cid:89) j =1 e ψ j (Pa j ( G )) e ψ j (Pa j ( G ∗ σ )) = ( c p c ) | G ∗ σ |−| G | p (cid:89) j =1 (cid:32) X (cid:62) j Φ ⊥ Pa j ( G ) X j ε (cid:62) σ,j Φ ⊥ Pa j ( G ∗ σ ) ε σ,j (cid:33) − ( αn + κ ) / ≥ ( c p c ) − pd in p (cid:89) j =1 (cid:32) X (cid:62) j X j ε (cid:62) σ,j Φ ⊥ Pa j ( G ∗ σ ) ε σ,j (cid:33) − ( αn + κ ) / ≥ ( c p c ) − pd in p (cid:89) j =1 (cid:32) nνnω ∗ σ,j / (cid:33) − ( αn + κ ) / , where the last step follows from Lemmas E1 and E2. By Remark 7, ω ∗ σ,j ≥ ν , which yieldsmin G ∈G σp ( d in ,d out ) exp( ψ ( G ))exp( ψ ( G ∗ σ )) ≥ ( c p c ) − pd in (cid:18) νν (cid:19) − p ( αn + κ ) / , ∀ σ ∈ S p . By Lemma 5, there exists a Chickering sequence ( G = G ∗ σ , G , . . . , G k = G ∗ ) for some k ≤ pd ∗ such that, for i ∈ [ k ], G i is obtained from G i − by covered edge reversals and a singleedge deletion. Since by removing any single edge from a DAG in G τp , its posterior score canincrease by at most c p c , we haveexp( ψ ([ G ∗ σ ]))exp( ψ ( E ∗ )) ≥ ( c p c ) − pd ∗ . For any

E ∈ C p ( d in , d out ), there exists a pair ( G, σ ) such that G ∈ G σp ( d in , d out ). Thus, π n ( E ) π n ( E ∗ ) = e ψ ([ G ∗ σ ]) e ψ ( E ∗ ) e ψ ( G ) e ψ ( G ∗ σ ) , from which the asserted bound on π n ( E ) /π n ( E ∗ ) follows. Under the setting of Theorem 5,we have the strong selection consistency (see also Theorem 4), and thus log π n ( E ) can bebounded a polynomial of n and p . This completes the rapid mixing proof for RW-GES. E.7 Proof of Theorem 6

Proof.

By Lemma 2, there is a canonical transition function g σ : G σp ( d in , d out ) → G σp ( d in , d out )such that for any G (cid:54) = G ∗ σ , g σ ( G ) = g σj ( G ) ∈ G σp ( d in , d out ) for some j ∈ [ p ]. It then followsby Theorem 3 that g σ satisﬁes condition (2) in Theorem 1; that is, ψ ( g σ ( G )) − ψ ( G ) ≥ t log p. Observe that N σ ads ( · ) deﬁnes a symmetric neighborhood relation on G σp ( d in , d out ). Further,for any G ∈ G σp ( d in , d out ), we have |N σ ads ( G ) | ≤ d in p = O ( p ). The maximum length ofa canonical path from G to G ∗ σ can be bounded by p ( d in + d ∗ σ ) ≤ pd in . The mixing timebound then follows from Theorem 1. 51 .8 Proof of Theorem 7 Proof.

We will still use the canonical path method to prove this result, but due to theexistence of Markov equivalent DAGs, we need to slightly generalize Theorem 1. Recallthat by (21), once we have a canonical path ensemble, we can bound the mixing timeusing the maximum length of canonical paths and the congestion parameter. Let N ( G ) = N ads ( G ) ∪ N E ( G ). We ﬁrst construct a function g s : G p ( d in , d out ) → G p ( d in , d out ) such thatfor any G (cid:54) = G ∗ , g s induces a ﬁnite N -path ( G, g s ( G ) , . . . , g k s ( G ) = G ∗ ).As in the proof of Theorem 2, we need a proper deﬁnition of the “distance” from a DAG G to the set of all minimal I-maps of G ∗ . Deﬁne h ∗ ( G ) = min (cid:8) Hd(

G, G ∗ σ ) : G ∈ G σp , σ ∈ S p (cid:9) . Let ¯ σ ( G ) denote the ordering that attains the minimum in the above deﬁnition. Fix anarbitrary G ∈ G p ( d in , d out ). We deﬁne g s ( G ) as follows.(1) Let σ = ¯ σ ( G ) be its “canonical” ordering.(2) If G (cid:54) = G ∗ σ (i.e., h ∗ ( G ) (cid:54) = 0), by Lemma 2, there exists some j such that g σj ( G ) ∈G σp ( d in , d out ) and g σj ( G ) (cid:54) = G . Deﬁne g s ( G ) = g σj ( G ) ∈ N ads ( G ).(3) If G = G ∗ σ but G / ∈ [ G ∗ ], by Lemma 5, there exists some G ∈ [ G ] , τ ∈ S p , j ∈ [ p ] suchthat G ∈ G τp ( d in , d out ) and g τj ( G ) ∈ N del ( G ). Let g s ( G ) = G ∈ N E ( G ).(4) If G ∈ [ G ∗ ], let g s ( G ) = G ∗ ∈ N E ( G ) ∪ { G } .In words, starting from any G / ∈ [ G ∗ ], we ﬁrst move to some minimal I-map G ∗ σ (which maynot be optimal) and then move to some DAG in [ G ∗ ]. Consider g σj ( G ) deﬁned in step (2).By the deﬁnition of g σj , we have h ∗ ( g σj ( G )) ≤ Hd( g σj ( G ) , G ∗ σ ) < Hd(

G, G ∗ σ ) = h ∗ ( G ) ≤ ( d in + d ∗ ) p. Since h ∗ ( G ) = 0 if and only if G is a minimal I-map of G ∗ , we conclude that it takes at most( d in + d ∗ ) p steps to arrive at a minimal I-map. Observe that the DAG G picked in step(3) cannot be a minimal I-map, and thus g s ( G ) is deﬁned in step (2). But G must be anI-map of G ∗ , which implies all nodes of G are overﬁtted and g s ( G ) ∈ N del ( G ). It followsthat moving from a minimal I-map to G ∗ takes at most 2 pd ∗ steps, and thus the maximumlength of the canonical path from any G to G ∗ is (cid:96) max ≤ p (3 d ∗ + d in ).Next, consider the size of the set g − ( G ) for any G ∈ G p ( d in , d out ). If G = G ∗ σ for some σ ∈ S p and G / ∈ [ G ∗ ], we have | g − ( G ) | ≤ |N ads ( G ) | ≤ d in p , since any minimal I-mapcannot be the DAG G in step (3). Further, observe that g − ( G ) ∩ N E ( G ) (cid:54) = ∅ only if G isMarkov equivalent to some minimal I-map. Thus, if G is not a minimal I-map, we have | g − ( G ) | ≤ |N ads ( G ) | + max σ ∈ S p | [ G ∗ σ ] | ≤ r ∗ + 2 d in p . Consider the ratio π n ( G (cid:48) ) /π n ( G ) for some G (cid:48) such that g k s ( G (cid:48) ) = G . If g s ( G ) is deﬁned instep (2), then π n ( g s ( G )) /π n ( G ) ≤ p − t ; if g s ( G ) is deﬁned in step (3) or (4), then π n ( g s ( G )) = π n ( G ). However, step (3) cannot be invoked twice in a row. Therefore, π n ( G (cid:48) ) π n ( G ) ≤ p − t (cid:98) k/ (cid:99) , if g k s ( G (cid:48) ) = G, G (cid:54) = G ∗ , (cid:98) k/ (cid:99) denotes the largest integer that is not greater than k/

2. Deﬁne g − k s ( G ) = { G (cid:48) ∈G p ( d in , d out ) : g k s ( G (cid:48) ) = G, g k − ( G (cid:48) ) (cid:54) = G } . Then, π n ( g − k s ( G )) π n ( G ) ≤ p − t (cid:98) k/ (cid:99) ( r ∗ + 2 d in p ) k = O (cid:16) p − t (cid:98) k/ (cid:99) + k max { t/ , } (cid:17) , since by assumption r ∗ ≤ p t/ . For k = 2 , , . . . , we have (cid:98) k/ (cid:99) ≥ k/

3. Using the assumption t >

9, we ﬁnd that π n ( g − k s ( G )) π n ( G ) = O (cid:16) p − tk/ (cid:17) , ∀ k = 2 , , . . . . For the case k = 1, we have π n ( g − ( G )) /π n ( G ) ≤ r ∗ + 2 d in p p − t . Let Λ( G ) = (cid:83) k ∈ N g − k s ( G ).We ﬁnally obtain that π n (Λ( G )) π n ( G ) = (cid:88) k ∈ N π n ( g − k s ( G )) π n ( G ) ≤ r ∗ + 2 d in p − t + (cid:88) k ≥ O (cid:16) p − tk/ (cid:17) ≤ r ∗ , for suﬃciently large n . In particular, for G = G ∗ , we have π n (Λ( G ∗ )) π n ( G ∗ ) ≤ |E ∗ | + O ( p − t/ ) . But π n (Λ( G ∗ )) = 1, which yields the second claim of the theorem.By (21) and (22) in the proof of Theorem 1, T mix − log[min θ ∈ Θ π ( θ )] + log 4 ≤ (cid:96) max max G π n (Λ( G )) /π n ( G )min G (cid:48) = g s ( G ) ,G (cid:54) = G (cid:48) P s ( G, G (cid:48) ) . For G (cid:54) = G (cid:48) , we have P s ( G, G (cid:48) ) = K s ( G, G (cid:48) ) min (cid:26) , π n ( G (cid:48) ) K s ( G (cid:48) , G ) π n ( G ) K s ( G, G (cid:48) ) (cid:27) . If G (cid:48) = g s ( G ) ∈ [ G ], which only happens when G is a minimal I-map, we have P s ( G, G (cid:48) ) = K s ( G, G (cid:48) ) ≥ q/r ∗ . Otherwise, we have P s ( G, G (cid:48) ) ≥ − q |N ads ( G ) | min (cid:26) , p t |N ads ( G ) ||N ads ( G (cid:48) ) | (cid:27) = 1 − q |N ads ( G ) | ≥ − q d in p , since t >

9. The asserted bound then follows.

E.9 Proof of Example 2

Proof.

Consider the space G p (2 ,

2) for p = 3, the collection of all 3-vertex DAG models.By Andersson et al. [1997], there are 11 labeled equivalence classes, which we show in Figure 2(for each equivalence class we plot one DAG member.) The true DAG is given by G ∗ = G and the local mode is the equivalence class that contains ˜ G = G . Consider σ = (1 , , G . The corresponding SEM representation can be writtenas X = z ,X = b b X + b z + z ,X = b b + 1 X + b b + 1 X + 1 b + 1 z − b b + 1 z . We prove the slow mixing by verifying the two conditions in Theorem B3:53 igure 2: DAG models with three vertices. Each DAG represents a unique equivalence class. Any other3-vertex DAG model is Markov equivalent to one of these DAGs. For Example 2, the true DAG is G ∗ = G and the local mode is the equivalence class of ˜ G = G . For Example 3, G ∗ = G and ˜ G = G . (i) P ([ G ] , [ G ]) ≥ − e − c √ n for some universal constant c .(ii) π n ([ G ]) ≤ / n .Consider (i) ﬁrst. Since [ G ] = { G } , the set N ads ([ G ]) is determined by all the neighborsof G . Observe that G , G ∈ N del ( G ) and G ∈ N add ( G ). Some routine calculations using c = √ n yield that, for suﬃciently large n , π n ([ G ]) π n ([ G ]) = exp (cid:110) − c log p + αn b ) (cid:111) ≥ p c / ,π n ([ G ]) π n ([ G ]) = exp (cid:26) − c log p + αn b + 1)( b + 1) b b + b + 1 (cid:27) ≥ p c / ,π n ([ G ]) π n ([ G ]) = c exp (cid:26) c log p − αn b b + b + 1 b + 1 (cid:27) ≥ p c / . By the Metropolis rule, for any E (cid:48) (cid:54) = [ G ], P ([ G ] , E (cid:48) ) = K ([ G ] , E (cid:48) ) min (cid:26) , π n ( E (cid:48) ) K ( E (cid:48) , [ G ]) π n ([ G ]) K ([ G ] , E (cid:48) ) (cid:27) ≤ π n ( E (cid:48) ) π n ([ G ]) < p −√ n/ . It then follows that (i) holds since |N ads ([ G ]) | is bounded. Note that if p goes to inﬁnity,we can still use |N ads ([ G ]) | ≤ p to show (i).To prove (ii), we only need to compare G with the true model G . Another routinecalculation yields that π n ([ G ]) π n ([ G ]) = exp (cid:26) αn b b + b + 1 b + 1 (cid:27) ≥ p α − log p , for large n . Since p = 3, π n ([ G ]) /π n ([ G ]) ≥ p α − log p ≥ E.10 Proof of Example 3

Proof.

We still use the numbering in Figure 2. The true DAG is G ∗ = G , and the local modewe consider is the equivalence class generated by ˜ G = G . By He et al. [2013, Deﬁnition54], N C ([ G ]) = { [ G ] , [ G ] , [ G ] } . Since p, a , a , α are ﬁxed constants and c = √ n , forsuﬃciently large n , we ﬁnd that π n ([ G ]) π n ([ G ]) = p − c = p −√ n ,π n ([ G ]) π n ([ G ]) = exp (cid:110) − c log p + αn a + 1) (cid:111) ≥ e cn ,π n ([ G ]) π n ([ G ]) = exp (cid:110) − c log p + αn a + 1) (cid:111) ≥ e cn ,π n ([ G ]) π n ([ G ]) = c exp (cid:26) − c log p + αn a + 1)( a + 1) a + a + 1 (cid:27) ≥ e cn , for some universal constant c >

0. Thus, for the random walk MH algorithm using neighbor-hood relation N C , we have P ([ G ] , [ G ]) ≥ − e − cn . Since [ G ] has negligible posteriorprobability, the chain is slowly mixing by Theorem B3. Example 4.

We give another slow mixing example for the random MH algorithm usingneighborhood relation induced by N C on the space of equivalence classes. Let p = 5 and thetrue DAG G ∗ be as given in Figure 3. Consider the DAG H = G ∪ { → } . By (dd2) inDeﬁnition 9 of He et al. [2013], [ G ∗ ] / ∈ N C ([ H ]). This is not surprising since N C deﬁnes asymmetric relation and clearly, we cannot move from the CPDAG of G ∗ to the CPDAG of H by adding an edge between nodes 1 ,

4. Actually, according to He et al. [2013, Deﬁnition9], there are only 8 possible operations that we may apply to the CPDAG of H , as listed inTable 1. Each operation uniquely deﬁnes a resulting CPDAG and thus |N C ([ H ]) | = 8. InFigure 4, we plot a member DAG for each equivalence class in N C ([ H ]).Assume that the error vectors are exactly orthogonal to each other and choose all priorparameters as in Examples 2 and 3. Consider the 8 DAGs shown in Figure 4. For i = 1 , , π n ([ H i ]) /π n ([ H ]) = p − c since H is an independence map of G ∗ and H , H , H areindependence maps of H . For j = 4 , . . . ,

8, we can assume that H j has the same topologicalordering as H . It follows that, for any SEM representation that is perfectly Markovian w.r.t. G ∗ , we have π n ([ H j ]) /π n ([ H ]) ≤ e − cn for some universal constant c .55 igure 3: A slow mixing example for the random walk MH algorithm using neighborhood function N C . G ∗ :the true DAG; [ G ∗ ]: the CPDAG of G ∗ ; H : a DAG representing a local mode; [ H ]: the CPDAG of H . Operator Related edge(s)

InsertU 1 −

2, 1 − − − →

4, 3 →

5, 2 →

4, 2 → Table 1: Edge operations that may be applied to the CPDAG of H in Figure 3 according to He et al. [2013,Deﬁnition 9]. In the operator names, “U” means an undirected edge, “D” a directed edge, “V” a v-structure.Figure 4: Characterization of N C ([ H ]) where H is as given in Figure 3. For each E ∈ N C ([ H ]), we plot amember DAG of E . The 8 DAGs correspond to the 8 operations listed in Table 1. H : insert 1 − H : insert1 − H : insert 4-5; H : delete 2 − H : delete 3 → H : delete 3 → H : delete 2 → H : delete2 →5.