A Birth and Death Process for Bayesian Network Structure Inference
AA Birth and Death Process for Bayesian Network StructureInference
D. Jennings and J. N. CorcoranUniversity of ColoradoJuly 2, 2018
Abstract
Bayesian networks (BNs) are graphical models that are useful for representing high-dimensional probability distributions. There has been a great deal of interest in recentyears in the NP-hard problem of learning the structure of a BN from observed data.Typically, one assigns a score to various structures and the search becomes an optimiza-tion problem that can be approached with either deterministic or stochastic methods.In this paper, we walk through the space of graphs by modeling the appearance anddisappearance of edges as a birth and death process and compare our novel approach tothe popular Metropolis-Hastings search strategy. We give empirical evidence that thebirth and death process has superior mixing properties.
Bayesian networks (Pearl [13]) are convenient graphical expressions for high dimensionalprobability distributions representing complex relationships between a large number of ran-dom variables. A Bayesian network is a directed acyclic graph consisting of nodes whichrepresent random variables and arrows which correspond to probabilistic dependencies be-tween them.There has been a great deal of interest in recent years on the NP-hard problem of learningthe structure (placement of directed edges) of Bayesian networks from data ([1],[2],[4],[5],[6],[8],[9],[11],[12]). Much of this has been driven by the study of genetic regulatory networksin molecular biology due to advances in technology and, specifically, microarray techniquesthat allow scientists to rapidly measure expression levels of genes in cells. As an integralpart of machine learning, Bayesian networks have also been used for pattern recognition,language processing including speech recognition, and credit risk analysis.Structure learning typically involves defining a network score function and is then, in theory,a straightforward optimization problem. In practice, however, it is quite a different story as Keywords: Bayesian networks, structure learning, birth and death processesAMS Subject classification: 60J20, 60J75, 68T05 a r X i v : . [ s t a t . M L ] O c t he number of possible networks to be scored and compared grows super-exponentially withthe number of nodes. Simple greedy hill climbing with random restarts is understandablyinefficient yet surprisingly hard to beat. There have been many deterministic and stochasticalternatives proposed in recent years such as gradient descent, genetic, tempering, andMetropolis-Hastings algorithms. There have been different approaches to the task, includingorder space sampling and the scoring of graph “features” rather than graphs themselves.Several of these methods have offered some improvement over greedy hill climbing but canbe difficult to implement. Deterministic methods tend to get stuck in local maxima andprobabilistic methods tend to suffer from slow mixing.In this paper we consider a new stochastic algorithm in which the appearance and disap-pearance of edges are modeled as a birth and death process. We compare our algorithmwith the popular Metropolis-Hastings algorithm and give empirical evidence that ours hasbetter mixing properties. Bayesian networks are graphical representations of the relationships between random vari-ables from high dimensional probability distributions. A Bayesian Network on N nodes isa directed acyclic graph (DAG) where the nodes (vertices), labeled , , . . . , N , correspondto random variables X , X , . . . , X N and directed edges correspond to probabilistic depen-dencies between them. We say that node i is a parent of node j and node j is a child ofnode i if there exists a directed edge from i to j , (in which case we write i → j ), and usethe notation pa j to denote the collection of all parents of node j . We will refer to nodesand their associated random variables interchangeably. Thus, pa j may also represent thecollection of parent random variables for X j . Rigorously, a Bayesian network consists of aDAG and a set of conditional densities { P ( X i | pa i ) } Ni =1 along with the assumption that thejoint density for the N random variables can be written as the product P ( X , X , . . . , X N ) = N (cid:89) i =1 P ( X i | pa i ) . In other words, all nodes are conditionally independent given their parents.In the problem of structure inference, the DAG is not explicitly known. A set of observations D = ( D , D , . . . , D M ) is given, where each D i is an N -tuple realization of ( X , X , . . . , X N ) .The goal is then to recover the “best” edge structure of the underlying Bayesian Network,which may be measured in many ways. For example, one may consider the best DAG asthe one that maximizes the posterior probability P ( G | D ) ∝ P ( D | G ) P ( G ) (1)over G . Indeed, (1) is important for many measures of “best DAGs” and it is the goal ofthis paper to simulate DAGs efficiently from this distribution.Throughout this paper, we will use the common assumption that the data D come from amultinomial distribution, allowing us to analytically integrate out parameters to obtain ascore which is proportional to P ( G | D ) . 1 Jump Processes
Let (Ω , F ) be a state space consisting of a non-empty set Ω and a sigma algebra F on Ω .A jump process on (Ω , F ) is a continuous time stochastic process that is characterized asfollows. Assume the process is in some some state x ∈ Ω . • The waiting time until the next jump follows an exponential distribution with rate (orintensity) λ ( x ) and is independent of the past history. • The probability that the jump lands the process in F ∈ F is given by a transitionkernel K ( x, F ) .It is known (e.g. [3], [14]) that there exists a Q t : Ω × F → R + so that Q t ( x, F ) is theprobability that at time t the process is in F given that the process was in state x at time . Such Q t are defined as the solution to Kolmogorov’s backward equation ∂∂t Q t ( x, F ) = − λ ( x ) Q t ( x, F ) + λ ( x ) (cid:90) Ω Q t ( y, F ) K ( x, dy ) . Furthermore, let Q ( n ) t ( x, F ) be the probability of a transition from x to F using at most n jumps. If λ ( x ) is bounded then Q ( ∞ ) t ( x, F ) := lim n →∞ Q ( n ) t ( x, F ) is the unique minimal solution to Kolmogorov’s forward equation ∂∂t Q t ( x, F ) = − (cid:90) F λ ( z ) Q t ( x, dz ) + (cid:90) Ω λ ( z ) K ( z, F ) Q t ( x, dz ) . (It is “minimal” in the sense that if R t ( x, F ) is any other nonnegative solution, then R t ( x, F ) ≥ Q ( ∞ ) t ( x, F ) for all t ≥ , x ∈ Ω , and F ∈ F .)For a distribution, π , to be invariant under such a process, π must satisfy the detailedbalance conditions π ( x ) λ ( x ) K ( x, dy ) dµ ( x ) = π ( y ) λ ( y ) K ( y, dx ) dµ ( y ) with respect to some density µ .Preston [14] extended this jump process to the trans-dimensional case where jumps fromstates in Ω n can move the process to a one dimension higher state, living in Ω n +1 , with(birth) rate λ b ( x ) or to a one dimension lower state, living in Ω n − , with (death) rate λ d ( x ) .Associated with these birth and death rates are birth and death kernels, K b and K d . Thetotal jump rate and the transition kernel are then given by λ ( x ) = λ b ( x ) + λ d ( x ) K ( x, F ) = λ b ( x ) λ ( x ) K b ( x, F ) + λ d ( x ) λ ( x ) K d ( x, F ) . x with n points to move to a configuration x (cid:48) with n + 1 points (or viceversa), the detailed balance conditions simplify ([14], [15]). In this case one needs the thebirth rates to balance the death rates with respect to π . That is, we require that π ( x ) b ( x, x (cid:48) \ x ) = π ( x (cid:48) ) d ( x (cid:48) , x (cid:48) \ x ) where b ( x, x (cid:48) \ x ) is the birth rate of the single point x (cid:48) \ x given that the current configurationof points is x , and d ( x (cid:48) , x (cid:48) \ x ) is the death rate of the single point x (cid:48) \ x given that thecurrent configuration of points is x (cid:48) . These relate to the total birth and death rates in that λ b ( x ) = (cid:88) b ( x, x (cid:48) \ x ) λ d ( x (cid:48) ) = (cid:88) d ( x (cid:48) , x (cid:48) \ x ) where the birth sum is taken over all states x (cid:48) that consist of configuration x with the additionof a single point and the death sum is taken over all states x that consist of configuration x (cid:48) with a single point deleted. To construct a jump process for BN structure inference, our goal is to construct a birth anddeath process acting on edges of a BN which has invariant distribution P ( G | D ) .The relevant state space is ( G , G ), where G is the set of all DAGs with N nodes and G isthe power set of G . We define the disjoint sets G k , k = 0 , . . . , N ( N − , to be the set of DAGswith exactly k edges. Our jump process will then jump between the G k for adjacent valuesof k .For G ∈ G k , denote the graph with the addition of the edge from node i to node j by ( G ∪ { i → j } ) ∈ G k +1 , and the graph with the removal of the edge from i to j by ( G \ { i → j } ) ∈ G k − . Detailed balance then requires that, for every edge i → j that is a valid(non-cycle causing) addition, P ( G | D ) b ( G, { i → j } ) = P ( G ∪ { i → j }| D ) d ( G ∪ { i → j } , { i → j } ) . It is convenient to let d ( G ∪ { i → j } , { i → j } ) = 1 so that b ( G, { i → j } ) = P ( G ∪{ i → j }| D ) P ( G | D ) = P ( D | G ∪{ i → j } ) P ( G ∪{ i → j } ) P ( D | G ) P ( G ) If we let ∆ j denote the M -dimensional vector of observations of X j in the data set D , thisbirth rate may be rewritten as b ( G, { i → j } ) = P (∆ j | ∆ pa (cid:48) j , G ∪ { i → j } ) P ( G ∪ { i → j } ) P (∆ j | ∆ pa j , G ) P ( G ) . ∆ pa j is the M × k matrix of data points for the k parents of node j in G ( ∆ pa j = ∅ if k = 0 ) and ∆ pa (cid:48) j is the M × ( k + 1) matrix of data points for the k parents of node j in G ∪ { i → j } .The transition rates are then given by λ b ( G ) = (cid:88) valid i → j b ( G, { i → j } ) ,λ d ( G ) = (cid:88) { i → j }∈ G . With this, we can easily construct a way to simulate from this process in the following way.1. Start with an arbitrary initial DAG, G
2. Compute the birth rates b ( G, { i → j } ) for all possible valid i → j edge additions to G . Compute λ b ( G ) = (cid:88) valid i → j b ( G, { i → j } ) and λ d ( G ) = (cid:88) { i → j }∈ G .
3. With probability λ d ( G ) / ( λ b ( G ) + λ d ( G )) , remove a randomly selected existing edge.Otherwise, add valid edge i → j with probability b ( G, { i → j } ) /λ b ( G ) .4. Return to step 2.At first glance, it may seem like such an algorithm would be computationally expensive, asthe required birth rates depend on computing the score for two different graphs. However, ifwe assume a modular score, the computation at each step is manageable. A modular scoremeans we have P ( D | G ) = N (cid:89) i =1 P (∆ i | ∆ pa i , G ) which leads to a birth rate of b ( g, { i → j } ) = P ( G ∪ { i → j } ) (cid:81) Ni =1 P (∆ i | ∆ pa (cid:48) i , G ∪ { i → j } ) P ( G ) (cid:81) Ni =1 P (∆ i | ∆ pa (cid:48) i , G )= P ( G ∪ { i → j } ) P (∆ j | ∆ pa (cid:48) j , G ∪ { i → j } ) P ( G ) P (∆ j | ∆ pa j , G ) So, for each birth rate, we only need the ratio of the altered score to current score of a singlenode corresponding to the child end of the proposed edge.4urther computational relief comes from the fact that most of the birth rates can be storedfrom previous steps. After adding or removing an edge, the only birth rates that need to berecalculated are for edges pointing to nodes whose parent set has been altered, edges whichwere not previously valid, or edges which are no longer valid.These realizations allow for a more complete and efficient algorithm1. Start with an initial DAG, G
2. For all possible edge additions, i → j , calculate the birth rate b ( G, { i → j } )
3. With probability λ d ( G ) / ( λ b ( G ) + λ d ( G )) , remove a randomly selected existing edge.Otherwise, add a valid edge k → (cid:96) with probability b ( G, { k → (cid:96) } ) .
4. If any of the following are true, update the birth rate b ( G, { i → j } ) . • j = (cid:96) • Edge i → j was not a valid addition before the addition or removal of edge k → (cid:96) but now is valid • Edge i → j was a valid addition before the addition or removal of edge k → (cid:96) but now is not longer valid5. Return to step 3. While the theory guarantees that our edge birth and death process will have the correctstationary distribution, in this Section we investigate the mixing time and whether or notMonte Carlo simulation of graphs from P ( G | D ) using our method is feasible in practice.We first consider a simple 4 node graph so that we may compare our results with exactcomputations. We generated 50 observations of X , X , X , X as related by the DAG inFigure 1. Each node was allowed to take on values in { , , , } , with equal probability.We then ran both the standard Metropolis Hastings chain for BNs ([10]) and our edge birthand death algorithm. Figure 2 shows two independent runs of each algorithm. While bothalgorithms tend to find high probability regions of the state space, our edge birth and deathalgorithm explores much more of the state space.One benefit of running Monte Carlo methods is they allow us to easily calculate the prob-abilities of specific graph features, e.g. edge probabilities. With the edge birth and deathprocess, the edge probabilities correspond to the proportion of time that the graphs containthe given edge. With the same 4 node graph as before, we calculated the edge probabilitiesand compare them to the true edge probabilities both for m = 100 observations and m = 500 observations. The results are shown in Table 1.Next, we tested our edge birth and death algorithm on the “Alarm data set” compiled byHerskovits ([7]). This data set, often used as a benchmark for structure learning algorithms,5 Figure 1: 4 Node Graph Used for TestingFigure 2: (a) Paths of two (red and blue) independent Metropolis Hastings chains of 20000steps. (b) Paths of two independent edge birth and death process runs of 1500 jumps.The circles represent each of the 543 possible DAGs, with size proportional to the posteriorprobability of the graph.Node 1 2 3 41 – .06 .02 .052 .06 – .00 .013 .02 .04 – .014 .04 .04 .00 – Node 1 2 3 41 – .03 .02 .002 .03 – .00 .003 .02 .00 – .004 .00 .00 .00 –Table 1: Magnitudes of errors between estimated and exact probabilities of edges in a 4 nodeBayesian network from one run of the edge birth and death algorithm for observations(left) and observations (right). 6igure 3: (a) AIC score of a Metropolis Hastings run of × steps. (b) AIC score of anedge birth and death process run of steps.consists of 1000 observations of 37 random variables, each taking on 4 possible values. Weran our edge birth and death algorithm for time steps. For comparison, we ran thestandard Metropolis-Hastings scheme for × steps which took approximately the sameamount of CPU time. Akaike’s information criteria (AIC) was recorded at each step andis plotted for one of these runs in Figure 3. As is well known, the MH scheme is prone togetting trapped in local minima, and takes many steps to escape these points whereas ouredge birth and death algorithm appears to mix more easily. We have presented a new algorithm for sampling from the posterior distribution of BayesianNetworks given data based on a birth and death process. This new edge birth and deathalgorithm allows for probabilistic inferences to be made about Bayesian Network structureswhile avoiding some of the downfalls of existing methods. In particular, our edge birth anddeath process does not get trapped in local extrema as easily as structure MCMC and allowsfor less restrictive choices of priors over graphs compared to order MCMC.An open question related to this new algorithm is whether it can be made perfect by applyingsimilar constructions to perfect algorithms for spatial point processes.7 eferences [1] T. Chen, H.L. He, and G.M. Church. Modeling gene expression with differential equa-tions. In
Pacific Symposium Biocomputing ’99 , pages 29–40. 1999.[2] A. Dobra, C. Hans, B. Jones, J.R. Nevins, G. Yao, and M. West. Sparse graphicalmodels for exploring gene expression data.
Journal of Multivariate Analysis , 90(1):196–212, 2004.[3] W. Feller.
An Introduction to Probability Theory and Its Applications. Volume I.
JohnWiley & Sons, London-New York-Sydney-Toronto, 1968.[4] N. Friedman and D. Koller. Being Bayesian about network structure.
Machine Learning ,50:95–126, 2003.[5] N. Friedman, M. Linial, I. Nachman, and D. Pe‘er. Using Bayesian networks to analyzeexpression data.
Journal of Computational Biology , 7:601–620, 2000.[6] D. Heckerman. A tutorial on learning with Bayesian networks. Technical report, Mi-crosoft Corporation, 1995.[7] E. Herskovits. Computer-based probabilistic network construction. PhD Thesis, Med-ical Information Sciencec, Stanford University, 1991.[8] D. Husmeier. Sensitivity and specificity of inferring genetic regulatory interactions frommicroarray experiments with dynamic Bayesian networks.
Bioinformatics , 19(17):2271–2282, 2003.[9] S. Imoto, S. Kim, T. Goto, S. Miyano, S. Aburatani, K. Tashiro, and S. Kuhara.Bayesian network and nonparametric heteroscedastic regression for nonlinear modelingof genetic network.
Journal of Bioinformatics and Computational Biology , 1:231–252,2003.[10] D. Madigan and J. York. Bayesian graphical models for discrete data.
InternationalStatistical Review , 63:215–232, 1995.[11] E.D. Jarvis, V.A. Smith, K. Wada, M.V. Rivas, M. McElroy, T.V. Smulders, P. Carn-inci, Y. Hayashizaki, F. Dietrich, X. Wu, P. McConnell, J. Yu, P.P. Wang, A.J.Hartemink, and S. Lin. A framework for integrating the Songbird brain.
Journalof Comparative Physiology A , 188:961–980, 2002.[12] I.M. Ong, J.D. Glasner, and D. Page. Modeling regulatory pathways in E. coli fromtime series expression profiles.
Bioinformatics , 18:241–248, 2002.[13] J. Pearl.
Probabilistic reasoning in intelligent systems: networks of plausible inference .Morgan Kaufman, San Francisco, CA, 1988.[14] C. Preston. Spatial birth and death processes.
Advances in Applied Probability ,7(3):465–466, 1975. 815] B. D. Ripley. Modelling spatial patterns.