Learning Bayesian Networks from Incomplete Databases
4401
Learning Bayesian Networks from Incomplete Databases
Marco Ramoni
Knowledge Media Institute The Open University
Abstract
Bayesian approaches to learn the graphical structure of Bayesian Belief Networks (BBNs) from databases share the assumption that the database is complete, that is, no entry is reported as unknown. Attempts to relax this assumption involve the use of expensive iterative methods to discriminate among different structures. This paper introduces a deterministic method to learn the graphical structure of a
BBN from a possibly incomplete database. Experimental evaluations show a significant robustness of this method and a remarkable independence of its execution time from the number of missing data.
1 INTRODUCTION A
Bayesian Belief Network (BBN) (Pearl, is a direct acyclic graph where nodes represent stochastic variables and arcs represent conditional dependencies among these variables. A conditional dependency links a child variable to a set of parent variables and is defined by the conditional distributions of the child variable given the configurations of its parent variables. Although in their original concept BBNs were designed to rely on human experts to provide the graphical structure and assess the conditional probabilities, during the past few years an increasing number of efforts has been addressed toward the development of methods able to directly construct BBNS from databases. Early results in this quest were based on non Bayesian approaches (Sprites et al., but a seminal paper by Cooper and Herskovitz (1992) gave rise to a stream of research within a Bayesian framework ( Bun tine,
Heckerman et al., Along this approach, the learning process involves two main tasks: the induction of the graphical model best fitting the
Paola Sebastiani
Department of Actuarial Science and Statistics City University database and the extraction of the conditional prob abilities defining the dependencies in the graphical model. Methods to perform the first task, known as model selection, typically involve two components: a search procedure to explore the space of possible graphical models and a scoring metric to assess the goodness-offit of a particular model. Current approaches exploit heuristics to reduce the search space and use the scoring metric to drive the search process. Although the task of extracting a
BBN from a database in known to be NP-Hard for the general case (Chickering and Heckerman, under certain assumptions these methods are able to extract quite large BBNs from databases of thousands of cases. One of these assumptions is that the database is complete, that is, no entry in the database is reported as unknown. The reason for this assumption is that a key step in the Bayesian learning process is the computation of the marginal likelihood of the database given a graph ical model. This computation can be performed efficiently when the database is complete using exact Bayesian updating, but it becomes intractable when data are missing. Therefore, methods to approximate the marginal likelihood of the da t a have to be used. Current approaches (Chickering and Heckerman, exploit the EM algorithm (Dempster et al., or Markov Chain Monte Carlo methods, such as Gibbs Sampling (Geman and Geman, The basic strategy underlying these methods is based on the Missing Information
Principle (Little and Rubin, fill in the missing observations on the basis of the available information. EM performs this task by replacing the missing entries with the maximum likelihood estimates extracted from the available data and proceeds by iteratively estimating and replacing until stability is re a c h e d within a certain threshold. Gibbs Sampling generates a value for the missing data from some conditional distributions and provides a stochastic estimation of the posterior probabilities. Unfor- Ramoni and Sebastiani tunately, these methods are usually highly resource demanding, their convergence rate may be slow, and their execution time heavily depends on the number of missing data. Ramoni and Sebastiani ( ) introduced a deterministic method to estimate the conditional probabilities defining the dependencies in a BBN which does not rely on the Missing Information Principle. This method, called
Bound and
Collapse (Be), starts by bounding the set of possible estimates consistent with the available observations in the database and then collapses the resulting interval to a point via a convex combination of the extreme estimates with weights depending on the assumed pattern of missing data. The intuition behind BC is that the information available in the database induces a set of possible estimates and that the pattern of missing data can be used to select a single distribution within this set. The pattern of missing data may be either provided by an external source of information or may be estimated from the available information under the assumption that data are missing at random. Experimental evaluations ( Ramoni and Sebastiani, 1997b) show clearly that the estimates provided by BC are very similar to those provided by Gibbs Sampling when data are missing at random, and they are more robust to departure from the true pattern of missing data. On the other hand, Be reduces the cost of estimating a conditional distribution to the cost of an exact Bayesian updating and a convex combination for each state of the distribution. This paper describes how BC can be used to estimate the marginal likelihood of a database given a model thus extending the principle underlying BC from the task of learning the conditional probabilities to the task of extracting the graphical model of a BBN from an incomplete database. The reminder of this paper is structured as follows: Section 2 introduces the technical background,
Section describes the new method, Section reports some results of a preliminary experimental evaluation, and Section summarizes the relevant results. BACKGROUND A BBN is defined by a set of variables
X = {X ... , X I} and a direct acyclic graph identifying a model M of conditional dependencies among these variables. A conditional dependency links a child variable Xi to a set of parent variables II;, and is defined by the conditional distributions of the child variable given the configurations of its parent variables. We shall consider discrete variables only, and denote by ci the number of states of X;, and q; the number of states of II;.
The model M yields a factorization of the joint probability of a particular set of values of the variables in X, say Xk = {xlk, ... xfk}, as I p(X = XkiM) = IT p(X; = Xikiii; = (1) i=l where is the state of II; in Xk·
We will denote
X;= Xik by X;k> and
II; = 1l";J by KiJ·
Suppose we are given a database of n cases D = { X1, ... , Xn} from which we wish to select a model M of conditional dependencies among the variables in the database. We adopt a Bayesian approach, so that if p(M) is our prior belief about a particular model M, we can use the information in the database D to compute the posterior probability of M given the data: (MID)= p(M, D) p p(D) ' and then we choose the model which has the highest posterior probability. When the comparison is between two rival models M1 and M2 with p(Mt) = p(M2), this is equivalent to choosing M1 if the Bayes factor: p(DIMt) p(M1, D) = ' p(DIM2) p(M2, D) is greater than one. It is well known ( Cooper and Herskovitz, 1992 ) , that p(M, D) can be easily computed if the conditional probabilities defining M are regarded as random variables ()iJk whose prior distribution represents the observer's beliefs before seeing any data. The joint probability of a case Xk can then be written in terms of the random vector () = { Bijk} as: p(xk = IT[=1 OiJk.
This parameterization of the probabilities defining M allows us to write: p(M, D) = p(M) I p(BIM)p(DIO)dO (2) where p(BIM) is the prior density of and p(DIO) is the sampling model. A solution of ( ) exists in closed form if: 1. The database is complete; The cases are independent, given the parameter vector () associated to
M ; The prior distribution of the parameters is conjugate to the sampling model p(DIO); The parameters are marginally independent.
Let n(x;kiKij), i = . . . ,l,j = l, . . . ,q;,k = . .. ,c£, be the frequency of cases in the database with X;k IKiJ, so that n(11"iJ) = 2:::��1 n(x;kl11";j) is the frequency of cases with Assumptions 1 and 2 lead to I q; C; p(DI£1) =IT IT IT (;l��ikiTr;j). i=l j=l k=l A prior distribution on the parameters that satisfies and is a product of Dirichlet distributions. Thus, earning Bayesian Networks from Incomplete Databases
403 if we denote by
O;i = (O;i1, ... , 0;1c,) the vector of parameters associated to the conditional distribution of X ; J ?r ;j , we have B;j,...., D(a;11, ... ,a;jc.}.
The prior hyper-parameters
O:ijk can be regarded as frequencies of the imaginary cases needed to formulate the prior distribution. As a matter of fact, the marginal probability of
Xik J7r;j is aijk / a;j, and a.;j = I:�'=1
Cl.ijk is the prior precision on ()ij· Under the assumptions -
4, the posterior distribution of () is still a product of Dirichlet distributions (Spiegelhalter and Lauritzen, 1990), and
O;j JD ""' D(CI.ijl + n(X;tJ1rij), ... , Cl.ijc; + n(Xic; J7rij) ). Thus, the standard Bayesian estimate of p(Xik J7r;j) is the posterior expectation of eijk: and the posterior precision on B;j is a.;j +n(7r;j)· Furthermore, the integral (2) has the solution: and therefore is the marginal likelihood of V given M. Note that p(VJM) depends on the updated hyper-parameters of
B;j
JV, and the posterior precision on
B;j.
The probability ( 4) is the base for the algorithm proposed by Cooper and Herskovitz (1992) to induce the model from a database. Suppose we have a partial order on the variables so that X; -< X j if X; cannot be parent of Xi. Let P; be the set of current parents of X;, thus P; is the empty set if X; is a root node. Then the local contribution of a node X; and its parents II; to the joint probability of (M, V) is measured by The algorithm proceeds by adding a parent at a time and computing g(X;,
P;).
The set P; is expanded to include the parent nodes that give the largest contribution to g(X;, P;), and stops if the probability does not increase any longer. This greedy search strategy has been shown to be extremely cost-effective when the number of variables is large. When the database is complete, ( 4) can be efficiently computed using the hyper-parameters
Cl.ijk + n(x;kJ?r;1) and the precision O:;j + n(?r;j) of the posterior distribution of (;Iii· Suppose now that we are given the incomplete database V; = Do U Dm, where Dm denotes the part of V; with missing entries. The exact probability of (M, V;) is p(M , V;) = LP(M , V;,
De)=
LP(V;)p(M , VcJV;) c c where the sum is over all possible complete databases De consistent with the available data. Clearly, as the number of missing entries increases, the exact calculation of p(M, V;) is infeasible, and some approximation is needed. METHOD
In this section, we will show that it is possible to approximate the hyper-parameters of the posterior distributions of from which we derive an estimate of the marginal likelihood given in (4). Let f1iJk be an estimate of the posterior expectation of
Oijk, and &;j be an estimate of the posterior precision of
O;j.
Then the distribution
D(&;{Pijl, ... , &;{PiicJ will have precision &;j and expectation
Piik• k = · · ·, c;. Thus a moment-matching approximation of the posterior distribution of
O;i is:
O;iJV""' D(&;iPiit, ... ,&;iPiicJ· (6) From (6) we can then derive an estimate of (4): which can be also used to extend the algorithm in (Cooper and Herskovitz, 1992) to incomplete databases by estimating (5) as: Clearly the goodness of the approximation depends on the goodness of the estimates of
Pijk and &;j·
In the reminder of this section we will show how to use the BC method to estimate the posterior expectation of O;jk, and the posterior precision of eij.
POSTERIOR EXPECTATION
Let M be a model of conditional dependencies, specifying for each X; the parent variable II;j. BC estimates the conditional probabilities defining the dependencies in M by first bounding the set of possible posterior distributions of O;j consistent with the database, and then collapsing the extreme distributions in one single Dirichlet using the assumed pattern of missing data. Ramoni and Sebastiani case x1 Xz Xt 1 2 Xz 2 ? xa ? ? ? X5 1 ? JJ. X a 2 1 2 1 ? n•(x311(1,1))=2 n•(xatl(1,2))=2 n•(x311(2,1))=2 n•(xatl(2,2))=2 n•(xazl(1, 1)) = 2 n•(xa21(1, 2)) = 1 n•(xazl(2, 1)) = 1 n•(xa21(2,2)) = 0 Figure 1: Completions n•(x3klxt,X2) consistent with the incomplete database. Let n• ( x; kl n; i) be the frequency of cases with X; = Xik, given the parent configuration 1rij, which have been obtained by completing the incomplete cases. A case may be incomplete because of either a missing observation in the parent configuration or a missing observation of the child variable. An example is given in F igure for the model X; binary, i = 1, 2,
3. For each incomplete case, let ¢ijk be the probability of a completion: {9)
When data are missing at random, and therefore 'Do is a representative sample of the complete but unknown database D, the probability of a completion can be estimated from 'D as In this case, the BC estimate p(x;k ln ;i , 'D;, ¢iik) of E( O;k I'D;) becomes:
I: �ijlPl• (Xik lnij, 'D;) + JiJkP• (xik ln;j, 'D;) (10) l where ( I -n)
O.ijk + n(x;kln;i) Pl• X;k 1r;j, Vj == � • . a.;i + L.., h n(x;hl7rij) + n (xul1riJ) The value p • (x ;kl ; j, 'D;) is the upper bound of p(xik l1rii, 'D;), which is achieved when all incomplete cases in the database which could be completed as Xik l i i are assigned to X;k ln;i, and the other incomplete cases are assigned to
X;hl7r;z, any h, and l '::/= j. Thus, each maximum probability p•(x;kl1r;j, 'D;) is obtained from a Dirichlet distribution Dk(a.ijt + n(xi111r;j), ... , O.ijk + n(x;kl1r;j)+ n•(x;kl1rij), ... , O!.ijc; + n ( X i c ; 11r;J)) which identifies a unique probability P k • ( x i ll 'lr i j , 'D;) for the other states of the variable X; given from which pz.(Xikl1rij, 'D;) is obtained. The estimates fi(x;k l1rii, 'D;, ¢iJk), k == ... , c;, so found define a probability distribution since :E��1 fi(xik l1rij, 'D;, ¢iJk) = As the number of missing entries in 'D; decreases, p•(x;kl1r;j, 'D;) and P l e (X ;k l ; j , 'D;) approach (O.ijk + n(Xijkl11"ij))/(a.;j + n(1rij)) so that, when the database is complete, (10) returns the exact estimate E(O i jk I'D;).
As the number of missing entries increases then both
Jijk and the estimate (10) approach the prior probability O. i jk / a.;j, so that the estimation method is coherent and no updating is performed when data are totally missing. If n•(x;kl1rij) = nii' as for instance when data are missing only on the child variable, {10) simplifies to
O.ijk + n(x;k l1rij) + Jij k ni j a.;j + L:h n(x;h l?rij) + nii ' {11) which is a consistent estimate of the expected posterior expectation :lijk + n(x;k l1rij) + ni J ¢ i J k a.;j + L:h n(x;h ln;j) -1 nii · If O. i j k =
0, then {11) is the classical maximum likelihood estimate of (}ijk (Little and Rubin, 1987). Experimental comparisons {Ramoni and Sebastiani, 1997b) have shown that, when data are missing at random, the estimates computed by the BC method are very close to those obtained by Gibbs Sampling, and are more robust to departures from the true pattern of missing data. Although BC is able to incorporate the assumption that data are missing at random, in the general case it is not limited to it, since the parameters ¢ijk may be used to encode any pattern of missing data. For instance, when no information on the mechanism generating the missing data is available and therefore any pattern is equally likely, then ¢iJk == / c;. Furthermore, BC provides a new measure of the information available in the database when we consider that the extreme probabilities pz.(Xik l ni i , 'Di), l = · · · , c; lead to a lower bound of p(x;k ln;j, 'D;) - that is, p.(Xikl1rij, 'D;) = minl{pz.(x;kl1r;j, 'D;)} - and therefore the interval [p.(x;kl7r;j, 'D;),p•(x;kl1r;J, 'D;)] contains all posterior estimates of O;jk that would be obtained from the possible completions of the database, earning Bayesian Networks from Incomplete Databases
Generating Structure Variables Cases
M t x1 -t x2 -t x3 X1(2)X2(2) 1000 X3(2) M 2 x1 -t x2 -t x3 Xt (2}X2 (2) 1000 X3(3) Xs *-- x3 x4 Xt (2)X2 (2) M "\ /' X3(3) 5000 Xt -t x2 X4(2)Xs(2} Xs *-- x3 x4 Xt(3}X2(3) M4 "\ /' X3(3) 10000 x1 -t x2 X4(3}X2(4) Table Generating structures used in the experimental evaluations. The number next to each variable reports the number of states. thus providing a measure of the quality of information conveyed by D; about (Ramoni and Sebastiani, 1997a). POSTERIOR PRECISION
The value in (10) is an estimate of E(B;iki'D;).
We now derive an estimate of the posterior precision of
Suppose we have n(7r;j) cases completely observed on so that n-L:i n(1r;j) is the number of cases partially observed on the parent variable
II;.
Let = (8;1, . . • , be the parameters associated to the joint probability distribution of II;, and let
D(fln, ... , (J;q;) be the prior distribution, so that {3; = flii is the prior precision. If we knew the probability distribution of we could distribute the incomplete cases across the states of II;, so that the expected precision of the posterior distribution of would be a;j + n(7r;j) + p(7r;j)(n-L:i n(1r;3)). Thus if p(7r;jiV;) is an estimate of p( 1r ij), an estimate of the posterior precision is
O:;j = a;i + n(7r;j) + p(7r;jiD;)(n- I: n(7r;j)). (12) j Clearly,
Ctij is the exact posterior precision when the database is complete and, as the number of missing entries increases, the accuracy of &;j heavily depends on p(7r;jiD;).
We can apply the BC method to obtain the estimate p(7r;j IV;). When data are missing at random, the estimate of ¢ii = p(II; = III = ? ) , j = ... , q;, is We can then apply (10) to obtain q, p(7r;j) = L ¢ilPI•(7r;jiD;) + ¢ijP.(7rijiD;) l¥-}=1 % Induced Model l(D;IM)
Time 100 x1 �X2 �x3 1437 12 x2 � x3 t /' x1 �X2 �x 3 1447 12 20 x1 � X 2 �x3 1414 12 Table Models induced from the database generated from M for different percentages of available entries. where p,.(1riji'D;) = (3; + L:h n(7r;h) + n•(1ril)' with n•(7r;;) denoting the number of possible completions of the incomplete cases on As the number of missing entries increases, the estimate a;i tends to a;i + (flii / (J;)n so that the cases are distributed according to the prior belief about the parameters defining the BBN. EXPERIMENTAL EVALUATION
The aim of the experiments described in this Section is to evaluate the accuracy of the estimate (7) as the number of missing entries in the database increases.
MATERIALS AND METHODS
We considered four different models described in Table 1. From each of these models we generated a random sample of n cases, and applied the algorithm for the induction of the model from the data, using an initial order which was consistent with the generating structure, and assuming uniform prior distributions on the parameters. We then iteratively deleted of the sample at random, until the database was empty. On each incomplete database we run our system to induce the model from the data. The algorithm takes as input a database together with a partial order on the variables occurring in it, and returns a BBN.
The induction of the graphical model uses a greedy search strategy and replaces the measure (5) with the BC estimate (8). Once the graphical model has been chosen, the conditional probabilities are estimated using the BC method. This method was implemented in Common Lisp and the experiments were performed on a Macintosh Ramoni and Sebastiani % Xt = x2 =
100 0.11 0.78 80 0.11 0.78 60 0.12 0.79 40 0.11 0.79 20 0.10 0.79 x3 = Table Marginal probabilities induced for the structure M1 for different percentages of available entries.
100 80 60 g(Xt)
356 353 367 350 324 g(X2)
526 519 512 506 g(X3)
690 692 689 689 678 g(X2,X1)
519 512 520 511 483 g(X3,Xt)
691 692 692 684 667 g(X3,X2)
562 560 560 586 6 g(X3, (Xt,X2))
566 593 609
Table Estimate of -log g(X i , II) for different percentages of available entries in the database generated from M 1•
RESULTS AND DISCUSSION
Tables and show the models induced from the databases generated from the two models M and M 2, the estimates of - log p(V i i M ) for different percentages of available entries, and the total run time, in seconds, taken to extract the graphical model and estimate the parameters of the BBN.
Tables report -logp as f. The marginal probabilities are displayed in Tables and The initial order on the variables was in both cases X3 --< X2 --< Xt.
The models learned from the database generated from M1 are the correct ones when and of the entries in the database are available, and coherently the model of independence is induced from the empty database. With and of the entries, the induced models differ from the generating structure in one link. Run times show a remarkable independence from the percentage of missing data in the database. Table gives the estimates - log g( Xi, IIij) computed in each step of the algorithm. When of the entries is available, -log g(X3, (Xt, X2) = and - log .§(X3 , X2) = so that the model induced from the incomplete database is exp ( -554 + } = times more likely than the generating structure, if we assume that the prior distribution on the eight possible models consistent with the order X3 --< X2 --< Xt is uniform. The strong evidence against the model used to generate the database can be due to the fact that p(X3 = = } = and p(X2 = ) = in
100 80 60 40 20 x1 x3 x2 x1-+ x3 X2 x1 x2-+ x3 x1-+ x3 +- x2 x1-+ x2 x3 x2 +- X1-+ x2 x1-+ x2-+ x3
Xt -+ x2 .j. / x3 Table -log p (V i i M) for all possible models consistent with X3 --< X 2 --< X for different percentages of available entries generated from M 1. % Induced Model l(ViiM}
Time x1 -----+ x2 -----+ x3 x2 -----+ x3 t ? Xt x1 -----+ x2 -----+ X3 x2 -----+ x3 t ? Xt x1 -----+ x2 -----+ X3 Table Models induced from the database generated from M2 for different percentages of available entries. the generating structure. In the complete database n(X3 = = ) = which becomes when of entries are deleted, so that the small number of entries may cause the imprecision of the estimate -log p (V i i M) . The conditional probabilities estimated for the model selected are p(X3 = I X t = , X2 = ) = and p(X3 = I Xt = = ) = p(X3 = =
1, X2 = ) = and p(X3 = I X t = , X = ) = so that the estimate of the marginal probability of x3 = differs from the estimate obtained from the complete database by When of the entries are available -log g(X2) = and -log .§ (X2 , XI) = so that the model induced from the data is only
2. 7 times more likely than the generating structure. Again the marginal probabilities computed from the induced network are very similar to the marginal probabilities found in the model induced from the complete database: thus the choice of a slightly different model has little effect on the predicting power of the network. Table gives the estimate - log p ( V i i M ) for the eight possible models consistent with the initial ordering of the variables. These estimates can be computed from eaming Bayesian Networks from Incomplete Databases % X1 = x2 = Xa = x3 =
100 0.11 0.78 0.25 0.30
80 0.12 0 .
8 0.23 .3
60 0.12 0.
40 0.12 0 .
20 0 .
0 0.81
Table
7: Marginal probabilities in the networks induced from the database generated from M2 for different percentages of available entries. the values in Table by adding relevant terms. The estimates are very accurate until 40% of the entries are retained.
When only 20% of the entries is available, the error of the estimate increases, but nonetheless the model induced from the database is equal to the generating structure. If we assume t h a t the set of possible models is limited to the eight models consistent with the order X3 -< X2 -< Xt, and that they are a priori equally likely, then from the values in Ta b l e we can compute the marginal probability of V and of the four incomplete databases V; from which we can compute the posterior probabilities of all possible models. The posterior probability of the model induced from the database with 80% of the entries is 0.9987, against a probability for the generating structure. The other models have posterior probabilities near With of the entries, the posterior probability of the in duced model is 0.6699, a g a inst 0.325B for the generat ing structure. Similar results are found for the models induced from the database g e n er a te d from M2. The models in duced from the databases with 80% and of the entries differs from the generating structure in one link, and they are respectively exp( -980 + = and exp( -993 + = has little effect on the reasoning process. The estimate of -logp(ViiM) is very accurate until the database contains 40% of the orig i n a l entries. The total run times make even clearer that the source of complexity is the search space and the performances of the method remain insensitive to th e number of missing data. This result is not surprising when we realize that the computational cost of BC does not depend on the number of missing data. The number of missing data affects only the storage procedure described in (Ramoni and Sebastiani, 1997a) but its effect is limited by taking advantage of t h e local independencies of the BBN and by using discrimination trees to store the counters of observed data and % Induced Model l(ViiM)
Time
Xs t- X a x4 '\ / Xt --+ x2 Xs +-- X3 ----t x4 ')(" / xl --+ x2 Xs t- x3 x4 ')(" / Xt ----+ x2 Xs +-- x3 ----t x4 ')(" / x1 ----+ x2 Xs +-- x3 x4 t ')(" / x1 ----+ x'l Table Models induced from the database generated from M3 for different percentages of available entries. % x1 = x2 = 1 x3 = x4 = Xs = . .3
9 0 .
0 0.53 2 Table 9:
Marginal probabilities in the networks induced from the database generated from M3 for different percentages of available entries. to k e e p track of the possible completions. The models induced from the databases generated from M3 and M4 are given in Table 8 and 10, respectively. The initial order on the variables was in both cases Xs -< X4 -< X3 -< X2 -< X1. The models induced from the complete database are equal to the generating structure for both M3 and
M6, and co herently the empty s t ru c tur e is induced when data are totally missing. Table displays the marginal probabilities computed in the networks induced from the incomplete databases generated by M3. As the number of entries available decreases, at most two extra dependencies are induced from the database. The only exception is the model induced from the database generated from M4 with 80% of the entries available. In this case, four extra dependencies are learned, and the Bayes factor of the induced model against the generating structure is e13. However the conditional probabilities learned are only slightly dif ferent, so that the estimates of the marginal probabilities are extremely robust thus limiting the effect in Ramoni and Sebastiani % Induced Model x5 +--- x3 x4 '\ /" x1 ---t x2 Xs +--- x3 t 'X' t X1 ---t x2 � .j_ x4 x5 +--- x3 x4 '\ t /" X1 ---t x2 x5 +--- x3 X4 'X' /" Xt ---t x2 Xs +--- x3 x4 '\ t /" Xt ---t x2 l(ViiM) Time 40125 275 39369 296 42146 278 40132 289 39952 285 Table 10: Models induced from the database generated from M4 for different percentages of available entries. the subsequent reasoning process. The estimates of - log p ( V i iM ) are again extremely accurate. Missing data represent a challenge for learning methods because they may affect their use in real-world applications, where databases are often incomplete. Current methods to learn BBNs from incomplete databases rely on iterative methods, such as EM or Gibbs Sampling, to obtain an approximate estimate of the marginal likelihood of the database given a graphical model, a fundamental step in the process of extracting the graphical structure of a BBN from a database. This paper introduced a deterministic method able to provide this estimation, using
BC, and to extract the graphical structure from an incomplete database. In this way, BC can be used to both induce the graphical structure and assess the conditional probabilities of a BBN from an incomplete database. Preliminary experimental evaluations show a significant robustness of this method and a remarkable independence of its execution time from the number of missing data.
Acknowledgments
This research was partially supported by equipment grants from Apple Computers and Sun Microsystems.
References
Buntine, W. L. (1994). Operations for learning with graphical models.
Journal of Artificial Intelligence Research, E. {1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning,
Journal of the Royal Statistical Society, Series B,
IEEE Transactions on Pattern Analysis and Machine Intelligence,
Machine Learning,
Statistical Analysis with Missing Data.
Wiley, New York, NY. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of p lausible inference.
Morgan Kaufmann, San Mateo, CA. Ramoni, M. and Sebastiani, P. (1997a). Robust parameter learning in Bayesian networks with missing data. In
Proceedings of the Sixth Workshop on Artificial Intelligence and Statistics, pages 339-406, Fort Lauderdale, FL. Ramoni, M. and Sebastiani, P. (1997b). The use of exogenous knowledge to learn Bayesian networks from incomplete databases. In Proceedings of the Second International Symposium on Intelligent Data Analysis,
New York, NY. Springer. Spiegelhalter, D. and Lauritzen, S. (1990). Sequential updating of conditional probabilities on directed graphical structures.
Networks,