[PDF] Equivalence class selection of categorical graphical models

Abstract

Learning the structure of dependence relations between variables is a pervasive issue in the statistical literature. A directed acyclic graph (DAG) can represent a set of conditional independences, but different DAGs may encode the same set of relations and are indistinguishable using observational data. Equivalent DAGs can be collected into classes, each represented by a partially directed graph known as essential graph (EG). Structure learning directly conducted on the EG space, rather than on the allied space of DAGs, leads to theoretical and computational benefits. Still, the majority of efforts in the literature has been dedicated to Gaussian data, with less attention to methods designed for multivariate categorical data. We then propose a Bayesian methodology for structure learning of categorical EGs. Combining a constructive parameter prior elicitation with a graph-driven likelihood decomposition, we derive a closed-form expression for the marginal likelihood of a categorical EG model. Asymptotic properties are studied, and an MCMC sampler scheme developed for approximate posterior inference. We evaluate our methodology on both simulated scenarios and real data, with appreciable performance in comparison with state-of-the-art methods.

Full PDF

EEquivalence class selection of categorical graphical models

Federico Castelletti ∗ and Stefano Peluso † Department of Statistical Sciences, Universit`a Cattolica del Sacro Cuore, Milan Department of Statistics and Quantitative Methods, Universit`a degli Studi diMilano-Bicocca, Milan

Abstract

Learning the structure of dependence relations between variables is a pervasive issue inthe statistical literature. A directed acyclic graph (DAG) can represent a set of conditionalindependences, but diﬀerent DAGs may encode the same set of relations and are indistin-guishable using observational data. Equivalent DAGs can be collected into classes, eachrepresented by a partially directed graph known as essential graph (EG). Structure learningdirectly conducted on the EG space, rather than on the allied space of DAGs, leads to the-oretical and computational beneﬁts. Still, the majority of eﬀorts in the literature has beendedicated to Gaussian data, with less attention to methods designed for multivariate cate-gorical data. We then propose a Bayesian methodology for structure learning of categoricalEGs. Combining a constructive parameter prior elicitation with a graph-driven likelihooddecomposition, we derive a closed-form expression for the marginal likelihood of a categoricalEG model. Asymptotic properties are studied, and an MCMC sampler scheme developed forapproximate posterior inference. We evaluate our methodology on both simulated scenariosand real data, with appreciable performance in comparison with state-of-the-art methods.Keywords: Bayesian model selection, categorical data, graphical model, Markov equivalence

The wide spread of complex data has increasingly raised the interest of statisticians in the devel-opment of appropriate tools to investigate structured dependence relations between variables. Inthis context, graphical models represent a powerful methodology (Lauritzen [26]), with directedacyclic graphs (DAGs) particularly suitable for many scientiﬁc problems, expecially biological(Friedman [14], Sachs et al. [35], Shojaie & Michailidis [39], Nagarajan et al. [29]). A DAG ∗ [email protected] † [email protected] a r X i v : . [ s t a t . M E ] F e b ncodes a set of conditional independencies between variables which can be read oﬀ from thegraph using various criteria such as d-separation (Pearl [30]). While in some ﬁelds (e.g. ge-nomics) such dependence relationships can be postulated a priori based on experts’ knowledge,realistically the underlying DAG is unknown and accordingly needs to be inferred from thedata. In the Bayesian framework this corresponds to a model selection problem which requiresﬁrst the speciﬁcation of a prior distribution on the space of DAG models and parameters. Thelatter, combined with the data likelihood, leads to an integrated (marginal) likelihood and inturn to a posterior distribution on the DAG space. In this direction, the methodology deployedby Geiger & Heckerman [17] for parameter prior construction implies desirable properties of theDAG marginal likelihood.An additional complication arises because diﬀerent DAGs may encode the same set of condi-tional independencies (Markov equivalent DAGs). Markov equivalence therefore induces a par-tition of the DAG space into Markov equivalence classes (Andersson et al. [1]). Under commondistributional assumptions all DAGs in the same Markov equivalence class are indistinguishableusing obervational data (Pearl [30]), and therefore should have the same marginal likelihood,a requirement known as score equivalence . In addition, model selection algorithms that ignoreMarkov equivalence can fall into incoherences and computational ineﬃciencies as pointed outby Andersson et al. [1]. All DAGs in the same Markov equivalence class can be represented byan essential graph (EG, Andersson et al. [1], Chickering [10]), a chain graph (CG) whose chaincomponents are decomposable undirected graphs (UG) linked by arrowheads.Clearly, structural learning of EGs always guarantees score equivalence, since it operatesat level of equivalence classes. In this context, ﬁrst eﬀorts to the investigation of the EGspace have been conﬁned to small graphs (Gillispie & Perlman [18]), whilst more recently largergraphs were studied by Sonntag et al. [40] and He et al. [22] using Markov chain Monte Carlo(MCMC) methods. The recent work of Castelletti et al. [7] relies on the method of Geiger &Heckerman [17] to construct a parameter prior for Gaussian EGs, following the objective Bayesperspective of Consonni & La Rocca [11] and Consonni et al. [12] for, respectively, Gaussianand covariate-adjusted DAG models. On the frequentist side, Chickering [10] provides an EGestimate using a greedy equivalence search (GES) algorithm based on additions and deletions ofsingle edges, later modiﬁed for better estimation by Hauser & B¨uhlmann [19]. Moreover, Spirteset al. [41] proposed the PC algorithm, a constraint-based method which implements a sequenceof conditional independence tests.All the techniques above mentioned have been mainly designed for data whose nature justiﬁesthe Gaussian assumption. Even if graphical models for categorical data (also called Bayesiannetworks ) are widely employed in many domains (Scutari & Denis [38]), to our knowledge theBayesian literature on categorical EG learning is narrow, limited to Madigan et al. [27] andCastelo & Perlman [9]. The adoption of Bayesian scores by frequentist score-based methods2an partially ﬁll this gap, but clearly does not represent a fully satisfying solution. Still on thefrequentist side, the PC algorithm of Spirtes et al. [41] can be adapted for categorical data andprovides an EG estimate through a sequence of conditional independence tests, also adopted forcontingency tables in Scutari & Denis [38]. Korb & Nicholson [24, Chapter 9] and Murphy [28,Chapter 16] extensively cover categorical structure learning, but none of them at the level ofequivalence classes. Also, the score-based algorithms outlined in Scutari [37] only learn discreteDAGs. Furthermore, with categorical data diﬀerent hyperprior speciﬁcations lead to diﬀerentcommon scores, with crucial impact on the performance, and may easily compromise scoreequivalence.In the present paper we propose a fully Bayesian structure learning method for categoricalEGs. Following the approach of Geiger & Heckerman [17], originally introduced for DAG models,we derive a closed-form expression for the marginal likelihood of an EG, therefore avoidingany issue related to lack of score equivalence or hyperprior misspeciﬁcation. Exploiting theMarkov chain in He et al. [22] and developments in Castelletti [5], a relative MCMC schemeon the EG space is constructed.Our method is fully Bayesian and therefore outputs a posteriordistribution over the space of essential graphs, rather than the single model estimate provided bythe frequentist PC algorithm (Spirtes et al. [41]) available in the literaure. Accordingly, graphfeatures of interest, such as the inclusion probabilities of speciﬁc edges, as well as measure ofuncertainty around them, can be computed in our case. We will recover a single EG estimate forcomparison purposes, and simulations will show that our method is competitive when adoptedto recover a single EG structure. Furthemore, diﬀerently from the other Bayesian methods inthe literature which implement the BDeu score of Heckerman et al. [23] on the space of DAGs,we directly score EGs by deriving a closed-form expression for the EG marginal likelihoodand adopt an MCMC scheme targeting the posterior over the space of Markov equivalenceclasses. The beneﬁts of a Bayesian method for DAG model selection speciﬁcally targeted toEGs rather than DAGs are addressed from a theoretical perspective by Andersson et al. [1], andsimulation comparisons will show that our approach is highly competitive under all scenarios,with outperformances in settings characterized by moderate sample sizes, especially with a highnumber of nodes.The rest of the paper is organized as follows. We ﬁrst introduce some background materialon DAGs and Bayesian analysis of categorical data in Section 2. The unstructured Bayesianinference of contingency tables outlined in Section 2.2 is extended to model selection of EGs inSection 3. Here we focus on the EG-driven likelihood decomposition, on the parameter priorinduced by Geiger & Heckerman [17], and on the derivation of the marginal likelihood withrelated asymptotic properties. The posterior sampler developed in Section 4 is implementedon simulated data (Section 5), on a medical belief network and on US Congress voting records(Section 6). We ﬁnally discuss extensions to intervential categorical data and multiple datasets3n Section 7.

In this section we provide some background material on Directed Acyclic Graphs (DAGs) andEssential Graphs (EGs) as well as on Bayesian analysis of contingency tables. Some futhernotions on graphical models are reported in the Appendx. In addition, the reader can referto Lauritzen [26] and the more recent book by Roverato [33] for a detailed exposition of thesetopics.

Let D = ( V, E ) be a DAG where V = { , . . . , q } is a set of nodes and E ⊆ V × V a set of directededges and let Y , . . . , Y q be a collection of random variables that we associate to the nodes in D .A DAG encodes a set of conditional independencies between variables which deﬁnes its Markovproperty and can be read oﬀ from the DAG using d-separation (Pearl [30]). Diﬀerent DAGsmay encode the same conditional independencies and accordingly we say that they are

Markovequivalent . In many distributional settings, and in particular in the Gaussian framework andin the categorical case herein considered, Markov equivalent DAGs cannot be distinguished inthe presence of only observational data; see also Geiger & Heckerman [17] and Heckerman et al.[23]. Under further assumptions on the sampling distribution, such as equal variances withinthe Gaussian setting, Markov equivalence may not hold (Peters & B¨uhlmann [31]) and DAGscan be in principle distinguished from observational data.Verma & Pearl [42, Theorem 1] shows that two DAGs are Markov equivalent if and only ifthey have the same skeleton and the same v-structures , therefore providing a graphical criterionto establish Markov equivalence. For a given DAG D , let [ D ] be its Markov equivalence class ,the set of all DAGs that are Markov equivalent to D . By Andersson et al. [1] each equivalenceclass can be uniquely represented by a special chain graph called essential graph (EG), obtainedas the union (over the edge sets) of Markov equivalent DAGs; an alternative name for an EGis completed partially directed acyclic graph (CPDAG, Chickering [10]). Finally, we recall animportant result in Andersson et al. [1, Theorem 4.1], for which an EG is characterized as achain graph with decomposable chain components. Let Y , . . . , Y q be specialized to a collection of categorical variables or classiﬁcation criteria , each Y j having set of levels Y j and l j = |Y j | . We consider n multivariate observations from Y , . . . , Y q where each y i , i = 1 , . . . , n , corresponds to the levels of Y , . . . , Y q assigned to individual i , y i = (cid:0) y i ( j ) , j ∈ V (cid:1) , and y i ( j ) denotes the j -th element in y i , V = { , . . . , q } . These data can be4ollected into a q -dimensional contingency table of counts N . To this end, let Y = × j ∈ V Y j be theproduct space generated by Y , . . . , Y q , y ∈ Y an element of Y , that is a generic conﬁguration ofthe q variables. Each count n ( y ) representing the number of individuals assigned to conﬁguration y is given by n ( y ) = (cid:80) ni =1 ( y i = y ) , being ( · ) the indicator function. The collection of counts n ( y ), y ∈ Y , can be then arranged in a q -dimensional table. Clearly (cid:80) y ∈Y n ( y ) = n and thenumber of cells in N coincides with the dimension of the product space Y , that we denote by l V = |Y| = (cid:81) j ∈ V l j . Let now S ⊆ V . A marginal table of counts for the variables in S is obtainedby classifying the n individuals only according to criteria in S . The so-obtained | S | -dimensionalmarginal table is then N S with l S = (cid:81) j ∈ S l j = |Y S | number of cells, where Y S = × j ∈ S Y j . Foreach cell y S ∈ Y S the corresponding count n ( y S ) is obtained from the original contingency table N as n ( y S ) = (cid:80) y ∈Y n ( y ) ( y ( S ) = y S ) where y ( S ) are the elements of y ∈ Y corresponding tovariables in S ∈ V . It follows that (cid:80) y S ∈Y S n ( y S ) = n .For the generic cell y ∈ Y , let θ y the probability that an individual is assigned to conﬁguration y , where (cid:80) y ∈Y θ y = 1. The sampling distribution relative to an observation y i can be writtenas p ( y i | θ ) = (cid:81) y ∈Y θ ( y i = y ) y , and the likelihood function for n i.i.d. data points expressed ascounts in the contingency table N is then p ( N | θ ) = (cid:81) y ∈Y θ n ( y ) y , where θ = { θ y , y ∈ Y} isthe l V -dimensional vector collecting the cell-probabilities θ y . For the methodology developedin the next sections we need a formula for the marginal distribution of the dataset N , namely m ( N ) = (cid:82) p ( N | θ ) p ( θ ) d θ , where p ( θ ) is a prior assigned to the model parameter θ . A standardconjugate prior for θ is the Dirichlet distribution, θ ∼ Dir( θ | A ) ∝ (cid:81) y ∈Y θ a ( y ) − y , where a ( y ) ∈ R + and A = ( a ( y ) , y ∈ Y ) denotes a q -dimensional table of hyperparameters with same size andstructure of N . Because of conjugacy of the Dirichlet prior with model p ( N | θ ), the posterior p ( θ | N ) is Dir( θ | A + N ), where A + N denotes the table collecting the element-by-elementsums of A and N . Accordingly, the marginal data distribution of N can be obtained as theratio of prior and posterior normalizing constants, so that m ( N ) = Γ (cid:16)(cid:80) y ∈Y a ( y ) (cid:17) Γ (cid:16)(cid:80) y ∈Y ( a ( y ) + n ( y )) (cid:17) (cid:89) y ∈Y Γ( a ( y ) + n ( y ))Γ( a ( y )) . (1)Diﬀerent choices for the hyperparameters of the Dirichlet prior are possible. If for simplicity weset a ( y ) = a , Equation (1) reduces to Γ (cid:0) l V a (cid:1) / Γ (cid:0) l V a + n (cid:1) (cid:81) y ∈Y { Γ( a + n ( y )) / Γ( a ) } . Consider now a subset S ⊆ V , with implied marginal table N S . For later developments,we also need a formula for the marginal data distribution of N S . Recall that n ( y S ) is thecount corresponding to the cell y S ∈ Y S appearing in N S . The likelihood function restrictedto N S can be written as p ( N S | θ S ) = (cid:81) y S ∈Y S θ n ( y S ) y S , where θ y S = (cid:80) y ∈Y θ y ( y ( S ) = y S ) arethe marginal probabilities for variables in S and θ S is the vector of dimension l S collecting thecell-probabilities θ y S . Moreover, for the aggregation property of the Dirichlet distribution wehave θ S ∼ Dir( θ S | A S ) , where A S = ( a ( y S ) , y S ∈ Y S ) is a | S | -dimensional table of hyperparam-eters with elements a ( y S ) given by a ( y S ) = (cid:80) y ∈Y a ( y ) ( y ( S ) = y S ) . Accordingly, the posterior5istribution of θ S is Dir( θ S | A S + N S ) and the marginal data distribution restricted to the table N S is m ( N S ) = Γ (cid:16)(cid:80) y S ∈Y S a ( y S ) (cid:17) Γ (cid:16)(cid:80) y S ∈Y S ( a ( y S ) + n ( y S )) (cid:17) (cid:89) y S ∈Y S Γ( a ( y S ) + n ( y S ))Γ( a ( y S )) . (2)Note that, if we let a ( y ) = a , we obtain, with ¯ S = V \ S , a ( y S ) = a (cid:80) y ∈Y ( y ( S ) = y S ) = l ¯ S a. In this section we instead focus on EGs and derive a closed-form expression for the marginallikelihood of a categorical EG model. We ﬁrst write the likelihood function which factorizesaccording to the graphical structure imposed by the EG (Section 3.1). The latter involves a col-lection of parameters for each chain component (decomposable UG) for which a suitable priordistribution must be speciﬁed. To this end we follow the procedure of Geiger & Heckerman[17] originally introduced for model comparison of DAG models. The EG marginal likelihood isobtained in Section 3.3, with related asymptotic properties studied in Section 3.4, and consid-erations on the hyperparameter choice in Section 3.5.

Let G = ( V, E ) be an EG. Recall from Andersson et al. [1, Theorem 4.1] that G is a chain graphwhere each chain component τ ∈ T , τ ⊆ V , corresponds to a decomposable UG G τ . Let also y ∈ Y and y τ ∈ Y τ be the generic element of the product spaces Y and Y τ respectively as deﬁnedin Section 2.2; similarly for y pa G ( τ ) ∈ Y pa G ( τ ) , where pa G ( τ ) denotes the set of parents of τ in G .For simplicity of notation we will omit the subscript G (e.g. by writing pa( τ ) instead of pa G ( τ ))so that the dependence on the underlying EG will be tacitly assumed. All the results presentedbelow are therefore predicated on a given EG G .Recall ﬁrst from Andersson et al. [2] that under a given EG the probability distributionrelated to an observation y ∈ Y factorizes as p ( y | θ ) = (cid:89) τ ∈T p (cid:0) y ( τ ) | y (pa( τ )) , θ τ | y (pa( τ )) (cid:1) , (3)where θ is a global parameter indexing the EG model, while θ τ | y (pa( τ )) = { θ y τ | y (pa( τ )) , y τ ∈ Y τ } is a local parameter for chain component τ , corresponding to conﬁgurations of variables inpa( τ ) actually observed ; see also Castelo & Perlman [9, Equation 3]. Accordingly, the likelihood6unction for a complete dataset D comprising n observations y i , i = 1 , . . . , n , can be written as p ( D | θ ) = n (cid:89) i =1 (cid:89) τ ∈T p (cid:0) y i ( τ ) (cid:12)(cid:12) y i (pa( τ )) , θ τ | y i (pa( τ )) (cid:1) = n (cid:89) i =1 (cid:89) τ ∈T (cid:89) r ∈Y pa( τ ) (cid:89) s ∈Y τ θ { y i ( τ )= s, y i (pa( τ ))= r } s | r , by expanding the product over the sets Y pa( τ ) and Y τ . Therefore p ( N | θ ) = (cid:89) τ ∈T (cid:89) r ∈Y pa( τ ) (cid:89) s ∈Y τ θ n ( s | r ) s | r = (cid:89) τ ∈T (cid:89) r ∈Y pa( τ ) p ( N τ | N pa( τ ) , r, θ τ | r ) , (4)where the conditional frequency n ( s | r ) corresponds to the number of observations assigned tolevel s and r of variables in τ and pa( τ ) respectively. Equation (4) corresponds to the likelihoodfor n i.i.d. observations expressed as counts in the contingency table N respecting the graphicalstructure imposed by the EG. Heckerman et al. [23] and Geiger & Heckerman [17] (G&H) propose a method for the constructionof parameter priors on DAG models. An important implication of their approach concerns thecomputation of the marginal likelihood of any DAG, which can be directly obtained from themarginal data distribution computed under a complete model. In more details, Heckerman et al.[23] introduce an elicitation procedure for prior parameter construction across DAG models anddecomposable UG models. Starting from few assumptions that are naturally satisﬁed in theGaussian setting by Normal-Wishart priors and in the categorical framework by Dirichlet priors(Geiger & Heckerman [17]), they show how to assign a prior to the parameters of any given DAG(or decomposable UG) starting from a unique prior assigned to the parameter of a complete

DAG(or decomposable UG) model. In our EG context, we implement this elicitation procedure atthe level of chain component, since each chain component corresponds to a decomposable UG(Theorem 4.1 of Andersson et al. 1). This approach dramatically simpliﬁes the prior elicitationprocedure across EGs and provides a default method to assign priors to EG model parameters:we are then allowed to assume standard Dirichlet priors, in accordance to Section 2.2.For the EG global parameter θ we ﬁrst assume that the prior factorizes as p ( θ ) = (cid:89) τ ∈T p ( θ τ ) , (5)a condition known as global independence , which extends the assumption of global parameterindependence, typical of DAG models, to CG models; see also Castelo & Perlman [9]. For any7 ∈ T consider now θ τ | r = { θ y τ | r , y τ ∈ Y τ } , r ∈ Y pa( τ ) . We further assume local independence ,namely that θ τ | r are a priori independent: p ( θ τ ) = (cid:89) r ∈Y pa( τ ) p ( θ τ | r ) . (6)Recall that each θ τ | r consists of a vector of (conditional) probabilities θ y τ | r , y τ ∈ Y τ .Assuming that the underlying (decomposable) sub-graph G τ is complete, we can set p ( θ τ | r ) = pDir( θ τ | r | A τ | r ) ∝ (cid:89) y τ ∈Y τ θ a ( y τ | r ) − y τ | r . (7)Let now S ⊆ τ and N S be the corresponding marginal table; see also Section 2.2. Accord-ingly, the likelihood function restricted to S can be written as p ( N S | N pa( τ ) , r, θ S | r ) = (cid:89) s ∈Y S θ n ( s | r ) s | r , (8)where θ s | r = (cid:80) y τ ∈Y τ θ y τ | r ( y τ ( S ) = s ) , whilst, ﬁxing a ( s | r ) = (cid:80) y τ ∈Y τ a ( y τ | r ) ( y τ ( S ) = s ), p ( θ S | r ) = pDir( θ S | r | A S | r ) ∝ (cid:89) s ∈Y S θ a ( s | r ) − s | r (9)is the prior induced by (7). We now focus on the computation of the marginal likelihood of G , m G ( N ) = (cid:90) p G ( N | θ G ) p ( θ G ) d θ G , (10)where we now emphasize the dependence on the EG G . Because of the independence assumptionsin (5), we can write m G ( N ) = (cid:89) τ ∈T (cid:90) p τ ( N τ | N pa G ( τ ) , θ τ ) p ( θ τ ) d θ τ = (cid:89) τ ∈T m τ ( N τ | N pa G ( τ ) ) . (11)where it appears that m G ( N ) admits the same factorization of the sampling density in (4).Next, because of the independence assumption across θ τ | r in (6), we can write m τ ( N τ | N pa G ( τ ) ) = (cid:89) r ∈Y pa G ( τ ) m τ | r ( N τ | N pa G ( τ ) , r ) . Recall that for a decomposable UG G τ with sets of cliques and separators C G τ and S G τ , themarginal likelihood m τ | r ( · ) admits the factorization of [26]:8 τ | r ( N τ | N pa G ( τ ) , r ) = (cid:81) C ∈C τ m ( N C | N pa G ( τ ) , r ) (cid:81) S ∈S τ m ( N S | N pa G ( τ ) , r ) . (12) In addition, because of the theory presented in Geiger & Heckerman [17] and applied to de-composable UGs by Consonni & La Rocca [11], each term m ( N S | · ) in (12) corresponds to themarginal data distribution computed under a complete graph as in Equation (2), and similarlyfor m ( N C | N pa G ( τ ) , r ), that is m ( N S | N pa G ( τ ) , r ) = Γ (cid:0)(cid:80) s ∈Y S a ( s | r ) (cid:1) Γ (cid:0)(cid:80) s ∈Y S ( a ( s | r ) + n ( s | r )) (cid:1) · (cid:89) s ∈Y S Γ (cid:0) a ( s | r ) + n ( s | r ) (cid:1) Γ (cid:0) a ( s | r ) (cid:1) . (13) Note that the total number of parameters is | θ G | = (cid:80) τ l fa( τ ) , where fa( τ ) = pa( τ ) ∪ τ . We stressthat we can handle the high-dimensional case of n << | θ G | , with no constraints on the sparsityof the graph, diﬀerently from the Gaussian context of Consonni & La Rocca [11] and Consonniet al. [12], where the minimum number of observations is related to the clique number of thegraph, the dimension of the largest maximal clique. In this section we derive the asymptotic distribution of the marginal likelihood. More preciselywe show, for a single clique or separator of the graph, that the logarithm of the marginal likeli-hood, scaled by a factor of √ n , converges in distribution to a Gaussian random variable whenthe number of observations diverges, conditionally to the knowledge of parents conﬁgurations.The asymptotic variance, for which we provide an easy estimator, reveals the speed of conver-gence at which the marginal likelihood converges to its asymptotic mean, that is to the marginallikelihood evaluated at the true population conﬁguration probabilities.We ﬁrst ﬁx ˜ n ( c | r ) = n ( c | r ) /n ( r ) as the observed relative frequency of a conﬁguration c ∈ Y C ,given that y pa( C ) = r ; similarly for ˜ n ( s | r ), s ∈ Y S . The hyperparameters a ( c | r ) and a ( s | r ) areimplied by a ( y τ | r ), y τ ∈ Y τ , through the aggregation property of the Dirichlet distribution.Given a generic separator S ∈ τ (but the same can be stated for a clique C ), it is a standardresult that { ˜ n ( s | r ) , s ∈ Y S } d −→ N l S (cid:16) θ S | r , Σ S | r /n ( r ) (cid:17) , (14)where θ S | r is the vector of the true conﬁguration probabilities in S , given a speciﬁc conﬁguration r of the parents, and where Σ S | r = ( σ s ,s | r ) l S × l S , with σ s ,s | r = (cid:40) θ s | r (1 − θ s | r ) , s = s − θ s | r θ s | r , s (cid:54) = s . If a ( s | r ) is chosen so that a ( s | r ) /n ( r ) →

0, the result in (14) is also valid for { ¯ n ( s | r ) , y τ ∈ Y S } ,where ¯ n ( s | r ) := ( n ( s | r ) + a ( s | r )) /n ( r ), for instance when a ( s | r ) = a S is a constant depending9n the set S but not on the conﬁgurations of nodes in S or parents. Furthermore, assuming forsimplicity a ( s | r ) = a , we have that1 n ( r ) log m ( N S | N pa( τ ) , r ) = C ( a, S, r ) + 1 n ( r ) (cid:88) s ∈Y S log Γ ( n ( r )¯ n ( s | r )) − n ( r ) log Γ  n ( r ) (cid:88) s ∈Y S ¯ n ( s | r )  =: g S | r (cid:0) { ¯ n ( s | r ) , s ∈ Y S } (cid:1) (15)for some function C depending on a , S and r , but not on the data, such that C ( a, S, r ) → n ( r ) → ∞ . Since g S | r is continuous and with at least one non-null partial derivative in ¯ n ( s | r )ﬁxed to θ s | r , all s ∈ Y S , by the Delta method the asymptotic normality is preserved, with1 n ( r ) log m ( N S | N pa( τ ) , r ) d −→ N (cid:16) g S | r (cid:16) { θ s | r , s ∈ Y S } (cid:17) , D (cid:48) S | r Σ S | r D S | r /n ( r ) (cid:17) , where D S | r = { D s | r , s ∈ Y S } , D s | r = n ( r ) ψ ( nθ s | r ) − n ( r ) ψ ( n ( r )) and ψ is the digammafunction. From the approximation exp( ψ ( x )) ≈ x − /

2, valid for large x , we can write, for n ( r )large enough, ψ (cid:16) n ( r ) θ s | r (cid:17) ≈ log (cid:16) n ( r ) θ s | r − / (cid:17) and D s | r ≈ log (cid:32) n ( r ) θ s | r − / n ( r ) − / (cid:33) ≈ log θ s | r , so that the asymptotic variance becomes D (cid:48) S | r Σ S | r D S | r /n ( r ) ≈ n ( r ) (cid:88) s ∈Y S (cid:88) s ∈Y S σ s ,s | r log θ s | r log θ s | r and then ﬁnally (cid:112) n ( r ) (cid:16) g S | r ( { ¯ n ( s | r ) , s ∈ Y S } ) − g S (cid:16) { θ s | r , s ∈ Y S } (cid:17)(cid:17)(cid:16)(cid:80) s ∈Y S (cid:80) s ∈Y S σ s ,s | r log θ s | r log θ s | r (cid:17) / d −→ N (0 , . (16)From convergence in probability of ¯ n ( s | r ) and continuous mapping theorem, the result in (16)is also valid with the denominator replaced by its estimate  (cid:88) s ∈Y S (cid:88) s ∈Y S ˆ σ s ,s | r log ¯ n ( s | r ) log ¯ n ( s | r )  / , where ˆ σ s ,s | r = (cid:40) ¯ n ( s | r )(1 − ¯ n ( s | r )) , s = s − ¯ n ( s | r )¯ n ( s | r ) , s (cid:54) = s . n ( r ) by n , we can repeat the steps above for¯ g S | r ( { ¯ n ( s | r ) , s ∈ Y S } ) := 1 n log m ( N S | N pa( τ ) , r )and obtain, asymptotically in n , that √ n (cid:16) ¯ g S | r ( { ¯ n ( s | r ) , s ∈ Y S } ) − ¯ g S (cid:16) { θ s | r , s ∈ Y S } (cid:17)(cid:17)(cid:113) θ r (cid:80) s ∈Y S (cid:80) s ∈Y S σ s ,s | r log θ s | r log θ s | r d −→ N (0 , , (17)still valid with ˆ σ s ,s | r , ¯ n ( s | r ) and n ( r ) /n replacing, respectively, σ s ,s | r , θ s | r and θ r in the de-nominator. Therefore with a ( s | r ) /n →

0, an appropriately scaled version of log m ( N S | N pa( τ ) , r )is asymptotically Gaussian and correctly centered, with a variance that can be estimated. Following Geiger & Heckerman [17], we start from a unique prior at level of chain component τ and all other priors for included cliques and separators are derived accordingly, in a waythat is coherent with the hyperparameter construction in the BDeu score of Heckerman et al.[23]. As pointed out in Scutari [37], BDeu is the only score that guarantees equal scores toMarkov equivalent DAGs; see also Scutari [36]. Still, we stress that our marginal likelihoodderived in Section 3.3 does not coincide with the BDeu score of Heckerman et al. [23], since thelatter is on DAGs and not on EGs. Only the part of our marginal likelihood related to a singlechain component and conditionally to one observed conﬁguration of the parent nodes can bereconducted to the BDeu form.Any possible value for a ( s | r ) for which a ( s | r ) /n → y τ ∈ Y τ and r ∈ Y pa( τ ) , a ( y τ | r ) <

1, we opt for thesensible choice of a prior distribution with no mode on any chain conﬁguration. We want thisproperty to be valid also for all clique and separator conﬁgurations within the chain component.Furthermore, a prior choice of V ( θ y τ | r ) = α implies V ( θ c | r ) ≈ αl τ /l C , meaning more prioruncertainty for probabilities associated to smaller cliques or separators, proportionally to theirdimension, relative to the dimension of the chain component they belong. Then, to have thesame prior information on cliques/separators of same dimension in diﬀerent chain components,to impose no prior mode on any cliques/separators conﬁgurations, and for results in Section 3.4to be valid, we ultimately suggest the choice of a ( y τ | r ) = 1 /l τ . In this Section we introduce the MCMC scheme that we adopt to sample from the posteriordistribution on the EG space and perform posterior model inference of categorical EGs.11 .1 MCMC scheme

Let S q be the set of all EGs on q nodes. Our MCMC consists of a Metropolis Hastings (MH)algorithm targeting the posterior distribution on the EG space, p ( G | N ) ∝ m G ( N ) p ( G ) , G ∈ S q , where m G ( N ) is the marginal likelihood of G computed as in Equation (11), p ( G ) a prior assignedto G . A similar scheme was introduced in Castelletti et al. [7] within the context of GaussianEGs. The key feature of this algorithm is the choice of a suitable proposal distribution whichdetermines the transitions between EGs belonging to the (discrete) model space S q . To thisend, Castelletti et al. [7] adopted the Markov chain originally proposed by He et al. [22] toexplore the EG space and investigate features of interest (such as the number of directed andundirected edges, v -structures and so on). Some optimality properties, namely irreducibilityand reversibility, allows to eﬃciently compute the stationary distribution of the Markov chain,used to weigh samples obtained from the proposal distribution.Transitions between EGs are determined by six types of operators: inserting an undirectededge (denoted by InsertU), deleting an undirected edge (DeleteU), inserting a directed edge(InsertD), deleting a directed edge (DeleteD), converting two adjacent undirected edges in a v -structure (MakeV) and converting a v -structure into two adjacent undirected edges (RemoveV).Besides these, following Castelletti & Consonni [6] we also adopt the operator ReverseD originallyintroduced by Chickering [10]. Such operator is not needed for the Markov chain to be irreducibleand reversible, but it adds extra-connectivity to the states of the chain, thus improving theexploration of the EG space; see also Castelletti [5] for a general presentation of the MCMCscheme. For each EG G we can then construct a set of perfect operators O G , i.e. guaranteeingthat the resulting graph is an EG. Let O G be a perfect set of operators on G , |O G | its cardinality.It can be shown that the probability of transition from G to G (cid:48) , the latter a direct successor of G , is q ( G (cid:48) | G ) = 1 / |O G | . Next, we need to specify a prior p ( G ), for G ∈ S q . Let A G be the (symmetric) 0-1 adjacencymatrix of the skeleton of G , whose ( u, v ) element is denoted by A G ( u,v ) . Conditionally on aprobability of edge inclusion π ∈ (0 , A G ( u,v ) in the lower triangular part of A G , A G ( u,v ) | π iid ∼ Ber( π ) , u > v . Therefore, p ( A G ) = π | A G | (1 − π ) q ( q − −| A G | , (18)where | A G | denotes the number of edges in the skeleton of G and q ( q − / p ( G ) ∝ p ( A G ) for each G ∈ S q , whichresults in a simple prior only depending on the number of edges in the graph and that can easily12eﬂect prior knowledge of sparsity (Castelletti et al. [7]). Other priors, speciﬁc for DAGs andbased on the number of compatible perfect orderings of the vertices, are also present in theliterature (Friedman & Koller [15], Kuipers & Moﬀa [25]).Let m G ( N ) be the marginal likelihood of G given the table of counts N , p ( G ) a prior on G and q ( G (cid:48) | G ) a proposal distribution for the chain when we are at graph G . At each step of theMH scheme we then propose a new EG G (cid:48) given the current graph G from q ( G (cid:48) | G ) and accept G (cid:48) with probability α G , G (cid:48) = min (cid:26) m G (cid:48) ( N ) m G ( N ) · p ( G (cid:48) ) p ( G ) · q ( G | G (cid:48) ) q ( G (cid:48) | G ) (cid:27) . (19) Our MCMC output consists of a collection of EGs visited by the chain, {G (1) , . . . , G ( T ) } . Thiscan be used to approximate the posterior distribution over the EG space as p ( G | N ) = m G ( N ) p ( G ) (cid:80) G∈S q m G ( N ) p ( G ) ≈ T T (cid:88) t =1 (cid:110) G ( t ) = G (cid:111) , where ( · ) is the indicator function; see also Garc´ıa-Donato & Mart´ınez-Beneito [16] for adiscussion on frequency-based estimators in large model spaces. In addition we can recover fromthe same output the (estimated) posterior probability of inclusion for each (directed) edge,ˆ p u → v ( N ) = 1 T T (cid:88) t =1 u → v (cid:110) G ( t ) (cid:111) , (20)where u → v {G ( t ) } = 1 if G ( t ) contains u → v , 0 otherwise, and an undirected edge u − v isequivalent to the union of u → v and u ← v . Starting from these quantities a single EGestimate summarizing the whole output, if required, can be also obtained. For instance, onecan consider the maximum a posteriori (graph) model (MAP) which corresponds to the EGwith highest associated posterior probability. However, the MAP may not represent an optimalchoice especially from a predictive viewpoint as discussed for instance by Barbieri & Berger [3]in a multiple linear regression framework. Diﬀerently, it was shown that the median probabilitymodel , which in their context was obtained by including all variables whose posterior probabilityof inclusion exceeds 0.5, is predictively optimal. In our EG setting, we can proceed similarly andconstruct ﬁrst a graph estimate (that we name median probability graph model) by includingall edges u → v such that ˆ p u → v ( N ) > .

5. Since the latter is not guaranteed to be an EG,while is in general a partially directed graph, one further possibility is to consider any consistentextension [13] of the median probability model, as detailed in Castelletti et al. [7]. The resultingEG estimate is called projected median probability graph model.13

Simulations

We now evaluate the performance of our method through simulations. Speciﬁcally, we vary thenumber of variables q ∈ { , , , } and the sample size n ∈ { , , , } . For eachcombination of q and n (a scenario) we generate 40 categorical datasets as detailed in Section 5.1.For simplicity we assume all variables being binary, namely Y j ∈ { , } , j = 1 , . . . , q . Resultsand comparisons with some benchmark methods are presented in Section 5.2. For a given value of q we ﬁrst randomly generate 40 DAGs using the function randomDAG inthe R package pcalg by ﬁxing a probability of edge inclusion equal to p edge = 3 / (2 q −

2) as inthe sparse setting of [31]. Each DAG D deﬁnes a data generating process which in a Gaussiansetting [7] we can write as Z i,j = µ j + (cid:88) k ∈ pa D ( j ) β k,j Z i,k + ε i,j , (21)for i = 1 , . . . , n and j = 1 , . . . , q , where ε i,j ∼ N (0 , σ j ) independently. For each j we ﬁx µ j = 0 and σ j = 1, while regression coeﬃcients β k,j are uniformly chosen in the interval[ − , − . ∪ [0 . , n multivariate Gaussian observations from (21); a categorical dataset consistingof n observations from q binary variables is then obtained by setting Y i,j =  Z i,j ≥ γ j , Z i,j < γ j , (22)where we ﬁx γ j = 0, for j = 1 , . . . , q . Finally, for each DAG D we consider its representativeEG which will represent the benchmark of comparison with the EG estimate provided by eachmethod under evaluation; more details are given in the next section. We evaluate the performance of our method, that we name DBEG (Discrete Bayesian EG),in recovering the graphical structure of the true EG. To this end, for each q ∈ { , , , } we run T = 1000 · q iterations of our MCMC algorithm (Section 4). To favour sparsity, weﬁx the hyperparameter π in the EG prior (18) as π = 1 . / (2 q −

2) which corresponds to aprior probability of edge inclusion smaller than the expected level of sparsity, as commonlyrecommended; see for instance Peterson et al. [32]. Finally, we ﬁx a ( y τ | r ) = 1 /l τ in theDirichlet prior (7), as suggested in Section 3.5.We compare our method with the PC algorithm for categorical data of Spirtes et al. [41],a constraint-based method that estimates the EG through multiple conditional independence14ests, at a signiﬁcance level α that we ﬁx as α ∈ { . , . , . } . As other benchmarks, weuse HC Bdeu, an optimized hill climbing greedy search that explores the space of DAGs bysingle-arc additions, removals and reversals and that uses the BDeu score of Heckerman et al.[23], and TABU BDeu (Russell & Norvig [34]), a modiﬁed hill-climbing algorithm able to escapelocal optima by selecting DAGs that minimally decrease the score function. Since both HCBDeu and TABU BDeu were not speciﬁcally designed for EGs but for DAG model selection,their DAG estimates are converted in the EG representative of the corresponding equivalenceclass.We evaluate the ability of each method in recovering the true EG structure in terms ofStructural Hamming Distance (SHD) between true and estimated EG. The SHD representsthe number of edge insertions, deletions or ﬂips needed to transform the estimated EG intothe true one. Accordingly, lower values of SHD correspond to better performances. Resultsare summarized in the box-plots of Figure 1, where each plot reports the distribution of SHDacross the simulated datasets for a given value of q ∈ { , , , } and increasing sample sizes n ∈ { , , , } . With regard to our method we consider as EG point estimate theprojected median probability graph model (DBEG); see also Section 4.2. All methods improvetheir performance as the sample size increases. Moreover, our DBEG method outperforms PC0.10, PC 0.05, HC BDeu and TABU BDeu most of the times and remains highly competitivewith PC 0.01 under all scenarios.For each scenario and method we also evaluate the performance in learning the structureof the true EG in terms of misspeciﬁcation rate, speciﬁcity, sensitivity, precision and Matthewscorrelation coeﬃcient: MISR = F N + F Pq ( q − , SPE =

T NT N + F P , SEN =

T PT P + F N , PRE =

T PT P + F P , MCC =

T P · T N − F P · F N √ ( T P + F P )( T P + F N )( T N + F P )( T N + F N ) , where T P , T N , F P , F N are the numbers of true positives, true negatives, false positives andfalse negatives respectively. The four measures can be computed by comparing the true andestimated EG through the corresponding adjacency matrices, where an undirected edge u − v istreated as the union of the two directed edges u → v and u ← v . With the exception of MISR,better performances correspond to higher values.Results for number of nodes q ∈ { , } are summarized in Tables 1 and 2, where wecompare our DBEG with the three versions of the PC algorithm, and with the score-basedmethods of HC BDeu and TABU BDeu. In terms of speciﬁcity index (SPE), all methods arecomparable. The superiority of DBEG and PC 0.01, relative to the other methods, stems froma higher precision (PRE) and higher Matthews correlation coeﬃcient (MCC), the latter beingmore evident in the setting q = 20 or for larger sample sizes. Moreover, there is no clear ranking15igure 1: Simulations. Structural Hamming Distance (SHD) between true and estimated EG for num-ber of nodes q ∈ { , , , } and increasing samples sizes n ∈ { , , , } . Methods undercomparison are: our DBEG method, the PC algorithm of Spirtes et al. [41], implemented for signiﬁcancelevels α ∈ { . , . , . } (respectively PC 0.10, PC 0.05, PC 0.01), HC BDeu, a hill climbing greedysearch with the BDeu score of Heckerman et al. [23], and its modiﬁed version TABU BDeu (Russell &Norvig [34]). γ j = 0; see Equation (22). The zero threshold, coupled with the assumption µ j = 0in (21) which implies a marginal mean equal to zero for each latent Z j , results in a collection ofcategorical variables whose levels are well balanced, meaning that P ( Y j = 1) = P ( Y j = 0) = 0 . j = 1 , . . . , q . In the following we relax this assumption by drawing each γ j uniformly inthe interval [0 , P ( Y j = 1) ≤ . Z j (in our simulation settings approaching 0 .

15 in the “worst” case where γ j = 1).Simulation results are reported in Figure 2 where the box-plots summarize the distributionof SHD for values of q ∈ { , , , } and n ∈ { , , , } for each method undercomparison. Results are very similar to those obtained under the “balanced” setting where γ j =0, with our DBEG approach outperforming the two BDeu-based methods and being competitivewith PC in most of the settings, in particular for scenarios characterized by moderate samplesizes. The same behaviour was observed for each of the ﬁve indexes in Tables 1-2 that we donot include for brevity. We apply our method to the ALARM dataset presented in Beinlich et al. [4]. ALARM (A LogicalAlarm Reduction Mechanism) is an alarm message system for patient monitoring based on adiagnostic tool. From a graphical model viewpoint, ALARM consists of a belief network , a DAGdescribing dependence relationships between three types of categorical variables: 8 diagnoses(at the top level of the network), 13 intermediate variables and 16 ﬁndings (clinical outcomes).A number of observations n = 20000 are measured on each of the q = 37 categorical variables.Of these, 13 variables are binary, while the others have a number of levels equal to 3 or 4 (17and 7 variables respectively). The objective of the original study was to estimate parameters(i.e. conditional probabilities) of interest as a diagnostic tool for patient monitoring, given aknown DAG structure with 46 directed edges; see also Beinlich et al. [4, Figure 1].On the other hand, we account for uncertainty in the data-generating graphical model andwe implement our methodology to learn an EG structure. This can be compared with theclinically justiﬁed DAG graphical structure assumed as known in the original study. We run T = 50000 iterations of DBEG, by ﬁxing a prior probability of edge inclusion π = 0 .

02 to favoursparsity and hyperparameter a ( y τ | r ) = 1 /l τ in the Dirichlet prior (7). The MCMC output17ISR SPE SEN PRE MCC n = 100 DBEG 8.31 97.87 45.76 75.69 57.54PC 0.10 8.08 97.38 52.35 72.61 59.91PC 0.05 7.67 98.07 49.99 77.08 60.26PC 0.01 8.19 98.63 42.54 79.53 56.38HC BDeu 8.92 98.32 37.00 74.91 51.25TABU BDeu 8.92 98.32 37.00 74.91 51.25 n = 200 DBEG 6.14 98.75 61.21 82.97 69.33PC 0.10 7.17 97.11 62.39 73.95 65.89PC 0.05 6.31 97.94 63.80 81.29 69.99PC 0.01 6.31 98.60 58.71 85.64 68.69HC BDeu 7.58 99.08 42.85 88.24 59.25TABU BDeu 7.58 99.08 42.85 88.24 59.25 n = 500 DBEG 4.69 98.44 73.17 86.55 77.90PC 0.10 5.78 97.3 72.79 78.61 73.55PC 0.05 5.33 97.94 71.77 83.35 75.33PC 0.01 4.69 98.82 70.63 89.9 77.75HC BDeu 7.33 98.6 48.41 83.43 61.8TABU BDeu 7.33 98.6 48.41 83.43 61.8 n = 1000 DBEG 4.22 98.16 79.66 85.94 83.20PC 0.10 5.33 97.36 75.37 78.41 74.97PC 0.05 4.31 98.19 78.29 85.51 80.15PC 0.01 3.72 98.86 77.84 90.36 82.23HC BDeu 7.03 98.45 51.84 83.12 63.84TABU BDeu 7.03 98.45 51.84 83.12 63.84Table 1: Simulations. Misspeciﬁcation rate (MISR), speciﬁcity (SPE), sensitivity (SEN), preci-sion (PRE) and Matthews correlation coeﬃcient (MCC) averaged over 40 simulations for numberof nodes q = 10 and sample size n ∈ { , , , } for each method under comparison.18ISR SPE SEN PRE MCC n = 100 DBEG 4.17 99.05 37.75 70.17 50.59PC 0.10 4.48 98.47 42.45 59.71 49.89PC 0.05 4.49 98.62 40.06 61.07 49.10PC 0.01 4.11 99.13 38.39 70.95 51.46HC BDeu 4.67 98.58 36.12 58.31 45.74TABU BDeu 4.67 98.58 36.12 58.31 45.74 n = 200 DBEG 3.34 98.92 55.87 75.64 64.03PC 0.10 3.78 98.49 55.20 66.22 59.87PC 0.05 3.47 98.85 54.82 71.90 62.05PC 0.01 3.37 99.22 50.74 77.64 61.97HC BDeu 4.09 98.88 42.21 68.10 53.10TABU BDeu 4.09 98.88 42.21 68.10 53.10 n = 500 DBEG 2.67 99.02 67.42 80.51 72.82PC 0.10 2.81 98.64 71.91 74.24 72.22PC 0.05 2.49 98.94 72.31 78.96 74.69PC 0.01 2.30 99.33 69.51 85.19 76.04HC BDeu 3.6 98.98 49.63 74.38 59.92TABU BDeu 3.6 98.98 49.63 74.38 59.92 n = 1000 DBEG 2.32 98.96 78.06 81.06 80.48PC 0.10 2.72 98.46 76.54 73.59 74.17PC 0.05 2.36 98.83 77.01 78.62 77.02PC 0.01 1.93 99.35 76.15 86.55 80.26HC BDeu 3.09 99.17 55.63 79.89 65.70TABU BDeu 3.09 99.17 55.63 79.89 65.70Table 2: Simulations. Misspeciﬁcation rate (MISR), speciﬁcity (SPE), sensitivity (SEN), preci-sion (PRE) and Matthews correlation coeﬃcient (MCC) averaged over 40 simulations for numberof nodes q = 20 and sample size n ∈ { , , , } for each method under comparison.19igure 2: Simulations (unbalanced setting). Structural Hamming Distance (SHD) between true and es-timated EG for number of nodes q ∈ { , , , } and increasing samples sizes n ∈ { , , , } .Methods under comparison are: our DBEG method, the PC algorithm of Spirtes et al. [41], implementedfor signiﬁcance levels α ∈ { . , . , . } (respectively PC 0.10, PC 0.05, PC 0.01), HC BDeu, a hillclimbing greedy search with the BDeu score of Heckerman et al. [23], and its modiﬁed version TABUBDeu (Russell & Norvig [34]). ALARM data. Estimated EG (maximum a posteriori and median probability graph model)obtained under DBEG. estimates a posterior distribution on the EG space which is highly concentrated, with a singleEG model assigned a posterior probability of about 70%. Therefore, the maximum a posterioriand the (projected) median probability graph models coincide (Figure 3). Estimated posteriorprobabilities of edge inclusion, as in Equation (20), are summarized in the (left-side) heat mapof Figure 4 and conﬁrm the low variability of the EG posterior distribution. All edges includedin the EG estimate of Figure 3 have indeed a posterior probability close to one. Few exceptionsare represented by edges 17 −

19, 21 −

23 and 3 −

23, whose posterior probabilities however doesnot exceed the threshold for edge inclusion. In the same ﬁgure we provide a comparison withthe EG implied by the DAG model assumed in Beinlich et al. [4], here represented as a heat mapwith black dots in correspondence of edges. The two plots reveal strong similarities between theEG structures, since they diﬀer by 16 edges over 46 and 51 edges respectively included in thetwo graphs.

In this section we analyze the voting records from the 1984 United Stated Congress. The datasetincludes votes for each of the n = 434 U.S. House of Representatives Congressmen on sixteen21igure 4: ALARM data. Comparison between the original DAG assumed in Beinlich et al. [4] (left),with black dots in correspondence of edges, and estimated posterior probabilities of edge inclusion fromour DBEG method (right). key votes, identiﬁed by the Congressional Quarterly Almanac, on religion, immigration, crime,education, and other relevant subjects. Each of the q = 16 (categorical) answers takes value in { yes, no, N A } , with N A in case of missing response. The data are publicly available at https://archive.ics.uci.edu/ . In the following we also distinguish between democratic and republicanCongressmen by considering two datasets with n D = 267 and n R = 167 observations respectively.Our method is then applied independently to each dataset, by ﬁxing the number of MCMCiterations T = 30000, the prior probability of edge inclusion π = 5% and the hyperparameter a ( y τ | r ) = 1 /l τ in the Dirichlet prior (7).Diﬀerently from the previous application, the posterior distribution over the EG space ex-hibits larger variability, possibly related to the more moderate group sample sizes. This is alsoapparent from Figure 5 which summarizes the estimated posterior probabilities of edge inclusionunder each group. In addition, the two plots (democratic and republican) reveal strong diﬀer-ences, as evident from the estimated graphs in Figure 6. Few exceptions of similarity are the(directed) links between 5 (el-salvador-aid) and 8 (aid-to-nicaraguan-contras), 7 (anti-satellite-test-ban) and 16 (export-administration-act-south-africa), common to the two groups. We propose a Bayesian method for learning the conditional dependence structures of multivariatecategorical data that we represent through a Directed Acyclic Graph (DAG). To account fordiﬀerent DAGs encoding the same set of dependencies (Markov equivalent DAGs), and to avoidhyperprior speciﬁcations that lead to undesirable properties of the marginal likelihood, ourmethodology directly learns the essential graph (EG) representative of a DAG equivalence class.22igure 5:

Voting records. Heat maps with estimated posterior probabilities of edge inclusion obtainedfrom our DBEG method for the two groups: democratic (left) and republican (right).

Figure 6:

Voting records. Estimated EG (projected median probability graph model) obtained withDBEG for the two groups: democratic (left) and republican (right).

Appendix: Graph notation

A graph G is a pair ( V, E ) where V = { , . . . , q } is a set of vertices (or nodes) and E ⊆ V × V a set of edges (or arcs). Nodes are associated to variables, while edges are used to representdirect interactions between variables. Let u, v ∈ V , u (cid:54) = v be two nodes. We say that G containsthe directed edge u → v if and only if ( u, v ) ∈ E and ( v, u ) / ∈ E . If instead both ( u, v ) ∈ E and ( v, u ) ∈ E , then G contains the undirected edge u − v . Accordingly, we say that G is anundirected (directed) graph if it contains only undirected (directed) edges; in addition, G is partially directed if it contains at least one directed edge.Two vertices u, v are adjacent if they are connected by an edge (directed or undirected). Inaddition, we call u a neighbor of v if u − v is in G and denote the neighbor set of v as ne G ( v ); thecommon neighbor set of u and v is then ne G ( u, v ) = ne G ( u ) ∩ ne G ( v ). We say that u is a parent of v and that v is a child of u if u → v is in G . The set of all parents of u in G is then denoted24 G < A decomposable graph G on the set of vertices V = { , , , } ; the cycle { , , , } of length l = 4 contains the chord 1 − G has the perfect sequence of cliques { C , C } , with C = { , , } , C = { , , } and set of separators S = { S } , S = { , } . G < is the perfect directed version of G . by pa G ( u ). A sequence of nodes { v , v , . . . , v k } where v = v k and v j − − v j or v j − → v j forall j = 1 , . . . , k is called a cycle . A cycle is directed (undirected) if it contains only directed(undirected) edges; conversely we call it a partially-directed cycle. A graph with only directededges is called a directed acyclic graph (DAG) if it does not contain cycles. For any subset A ⊆ V we denote with G A = ( A, E A ) the subgraph of G induced by A , where E A = E ∩ ( A × A ).A (sub)graph is complete if its vertices are all adjacent.We now focus on a particular class of undirected graphs, namely decomposable graphs (alsocalled chordal or triangulated ). Speciﬁcally, we say that an undirected (sub)graph is decompos-able if every cycle of length l ≥ chord , that is two nonconsecutive adjacent vertices. Fora decomposable graph G , a complete subset that is maximal with respect to inclusion is calleda clique . Let C = { C , . . . , C K } be a perfect sequence of cliques. Let also H k = C ∪ · · · ∪ C k ,for k = 2 , . . . , K . We can then construct the set of separators S = { S , . . . , S K } where S k = C k ∩ H k − ; see also Figure 7. It can be shown [26, p.18] that each decomposable graphcan be uniquely represented by its set of cliques and separators. Most importantly, for eachdecomposable graph one can obtain a perfect numbering of its vertices [26] and then a perfectdirected version G < by directing its edges from lower to higher numbered vertices; see also Figure7. A partially directed graph with no partially-directed cycles is called a chain graph (CG) orsimply partially directed acyclic graph (PDAG). For a chain graph G we call chain component τ ⊆ V a set of nodes that are joined by an undirected path and denote the set of chain components of G by T . A subgraph of the form u → z ← v , where there are no edges between u and v , is calleda v-structure (or immorality ). The skeleton of a graph G is the undirected graph on the same setof vertices obtained by removing the orientation of all its edges. Finally, a consistent extensionof a PDAG G is a DAG on the same underlying set of edges, with the same orientations on thedirected edges of G and the same set of v -structures [13].25 eferences [1] Andersson, S. A. , Madigan, D. & Perlman, M. D. (1997). A characterization ofMarkov equivalence classes for acyclic digraphs.

Ann. Statist.

25 505–541.[2]

Andersson, S. A. , Madigan, D. & Perlman, M. D. (2001). Alternative Markov prop-erties for chain graphs.

Scand. J. Stat.

28 33–85.[3]

Barbieri, M. M. & Berger, J. O. (2004). Optimal predictive model selection.

Ann.Statist.

32 870–897.[4]

Beinlich, I. A. , Suermondt, H. J. , Chavez, R. M. & Cooper, G. F. (1989). TheALARM monitoring system: A case study with two probabilistic inference techniques forbelief networks. In J. Hunter, J. Cookson & J. Wyatt, eds.,

AIME 89 . Berlin, Heidelberg:Springer Berlin Heidelberg, 247–256.[5]

Castelletti, F. (2020). Bayesian model selection of Gaussian DAG structures.

Int. Stat.Rev., In press .[6]

Castelletti, F. & Consonni, G. (2019). Objective Bayes model selection of Gaussianinterventional essential graphs for the identiﬁcation of signaling pathways.

Ann. Appl. Stat.

13 2289–2311.[7]

Castelletti, F. , Consonni, G. , Della Vedova, M. & Peluso, S. (2018). Learn-ing Markov equivalence classes of directed acyclic graphs: an objective Bayes approach.

Bayesian Anal.

13 1231–1256.[8]

Castelletti, F. , La Rocca, L. , Peluso, S. , Stingo, F. & Consonni, G. (2020).Bayesian learning of multiple directed networks from observational data.

Stat. Med., Inpress .[9]

Castelo, R. & Perlman, M. D. (2004). Learning essential graph Markov models fromdata. In

Advances in Bayesian networks , vol. 146 of

Stud. Fuzziness Soft Comput.

Springer,Berlin, 255–269.[10]

Chickering, D. M. (2002). Learning equivalence classes of Bayesian-network structures.

J. Mach. Learn. Res.

Consonni, G. & La Rocca, L. (2012). Objective Bayes factors for Gaussian directedacyclic graphical models.

Scand. J. Stat.

39 743–756.[12]

Consonni, G. , La Rocca, L. & Peluso, S. (2017). Objective Bayes covariate-adjustedsparse graphical model selection.

Scand. J. Stat.

44 741–764.2613]

Dor, D. & Tarsi, M. (1992). Simple algorithm to construct a consistent extension of apartially oriented graph.

Technical Report R-185, Cognitive Systems Laboratory, UCLA .[14]

Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models.

Science

303 799–805.[15]

Friedman, N. & Koller, D. (2003). Being Bayesian about network structure. A Bayesianapproach to structure discovery in Bayesian networks.

Mach. Learn.

50 95–125.[16]

Garc´ıa-Donato, G. & Mart´ınez-Beneito, M. A. (2013). On sampling strategies inbayesian variable selection problems with large model spaces.

J. Amer. Statist. Assoc.

Geiger, D. & Heckerman, D. (2002). Parameter priors for directed acyclic graphicalmodels and the characterization of several probability distributions.

Ann. Statist.

30 1412–1440.[18]

Gillispie, S. B. & Perlman, M. D. (2002). The size distribution for Markov equivalenceclasses of acyclic digraph models.

Artif. Intell.

141 137–155.[19]

Hauser, A. & B¨uhlmann, P. (2012). Characterization and greedy learning of inter-ventional Markov equivalence classes of directed acyclic graphs.

J. Mach. Learn. Res.

Hauser, A. & B¨uhlmann, P. (2015). Jointly interventional and observational data:estimation of interventional markov equivalence classes of directed acyclic graphs.

J. R.Stat. Soc. Ser. B. Stat. Methodol.

77 291–318.[21]

He, Y. & Geng, Z. (2008). Active learning of causal networks with intervention experi-ments and optimal designs.

J. Mach. Learn. Res.

He, Y. , Jia, J. & Yu, B. (2013). Reversible MCMC on Markov equivalence classes ofsparse directed acyclic graphs.

Ann. Statist.

41 1742–1779.[23]

Heckerman, D. , Geiger, D. & Chickering, D. M. (1995). Learning Bayesian networks:The combination of knowledge and statistical data.

Mach. Learn.

20 197–243.[24]

Korb, K. B. & Nicholson, A. E. (2010).

Bayesian artiﬁcial intelligence . CRC press.[25]

Kuipers, J. & Moffa, G. (2017). Partition MCMC for inference on acyclic digraphs.

J.Amer. Statist. Assoc.

112 282–299.[26]

Lauritzen, S. L. (1996).

Graphical Models . Oxford University Press.2727]

Madigan, D. , Andersson, S. A. , Perlman, M. D. & Volinsky, C. T. (1996). Bayesianmodel averaging and model selection for Markov equivalence classes of acyclic digraphs.

Comm. Statist. Theory Methods

25 2493–2519.[28]

Murphy, K. P. (2012).

Machine learning: a probabilistic perspective . MIT press.[29]

Nagarajan, R. , Scutari, M. & Lbre, S. (2013).

Bayesian Networks in R: With Appli-cations in Systems Biology . Springer Publishing Company, Incorporated.[30]

Pearl, J. (2000).

Causality: Models, Reasoning, and Inference . Cambridge UniversityPress, Cambridge.[31]

Peters, J. & B¨uhlmann, P. (2014). Identiﬁability of Gaussian structural equation modelswith equal error variances.

Biometrika

101 219–228.[32]

Peterson, C. , Stingo, F. C. & Vannucci, M. (2015). Bayesian inference of multipleGaussian graphical models.

J. Amer. Statist. Assoc.

110 159–174.[33]

Roverato, A. (2017).

Graphical Models for Categorical Data . SemStat Elements. Cam-bridge University Press.[34]

Russell, S. & Norvig, P. (2009).

Artiﬁcial Intelligence: A Modern Approach . USA:Prentice Hall Press.[35]

Sachs, K. , Perez, O. , Pe’er, D. , Lauffenburger, D. & Nolan, G. (2005). Causalprotein-signaling networks derived from multiparameter single-cell data.

Science

308 523–529.[36]

Scutari, M. (2016). An empirical-Bayes score for discrete Bayesian networks. In

Confer-ence on probabilistic graphical models . 438–448.[37]

Scutari, M. (2018). Dirichlet Bayesian network scores and the maximum relative entropyprinciple.

Behaviormetrika

45 337–362.[38]

Scutari, M. & Denis, J.-B. (2014).

Bayesian networks: with examples in R . CRC press.[39]

Shojaie, A. & Michailidis, G. (2009). Analysis of gene sets based on the underlyingregulatory network.

J. Comput. Biol.

16 407–26.[40]

Sonntag, D. , Pe˜na, J. M. & G´omez-Olmedo, M. (2015). Approximate counting ofgraphical models via MCMC revisited.

Int. J. Intell. Syst.

30 384–420.[41]

Spirtes, P. , Glymour, C. & Scheines, R. (2000). Causation, prediction and search(2nd edition).

Cambridge, MA: The MIT Press.

Verma, T. & Pearl, J. (1991). Equivalence and synthesis of causal models. In