Score and Information for Recursive Exponential Models with Incomplete Data
4453
Score and Information for Recursive Exponential Models with Incomplete Data.
Bo Thiesson • Aalborg University / Microsoft Research [email protected]
Abstract
Recursive graphical models usually underlie the statistical modelling concerning probabilistic expert systems based on Bayesian networks. This paper defines a version of these models, denoted as recursive exponential models, which have evolved by the desire to impose sophisticated domain knowledge onto local fragments of a model. Besides the structural knowledge, as specified by a given model, the statistical modelling may also include expert opinion about the values of parameters in the model. It is shown how to translate imprecise e x p e r t knowledge i nto approximately conjugate prior distributions. Based on possibly incomplete data, the score and the observed information are derived for these models. This accounts for both the traditional score and observed information, derived as derivatives of the log-likelihood, and the posterior score and observed information, derived as derivatives of the log-posterior distribution. Throughout the paper the specialization into recursive graphical models is accounted for by a simple example. Keywords:
Bayesian networks, contingency tables, missing data, probabilistic expert systems, recursive graphical models, exponential models, gradient, Hessian, multivariate norm al prior, Dirichlet prior. Introduction
The recursive exponential models (REMs) of this paper have evolved from the recursive graphical models • This work was primarily done at Aalborg University,
Department of Mathematics and Computer Science, Denmark. Final preparation was done at Microsoft Research, Redmond, WA of Wermuth and Lauritzen (1983), which usually underlie the statistical modelling concerning probabilis tic expert systems (Pearl1988; Andreassen et al.
Spiegelhalter et al. based on Bayesian networks. For a recursive graphical model the structural relat ion s between variables are represented by a directed acyclic graph, where each node represents a variable, Xv, and directed edges signify for each variable the existence of direct causal influence from variables represented by parent nodes,
Xpa(v).
Markov properties with respect to the graph (Kiiveri et al. Lauritzen et al. imply that any distribution, which is structurally defined by the model, can be represented by tables of conditional distributions, p(Xv I Xpa(v)), which for each possible value of
Xpa(v) hold the local conditional distribution for
Xv.
The REMs have evolved from these models by the desire to model the local conditional distributions in further detail. One may visualize the REMs as recursive g ra ph i c a l models, where local c on d i t io n a l distributions are defined by individual regular exponential models, denoted as local models. Any exponential model can be used a s a local model, whereby it is possible to accomodate very sophisticated structural restrictions. In situations it may happen that the same fragments of the model recur at different sites in the model, as is often the case in e.g. pedigree analysis. To deal with such cases, the REMs also allow different tables of conditional distributions to be modelled by a generic component. A matching Bayesian interpretation of the modelling is considered. Besides the structural knowledge as specified by a given model, experts may also specify imprecise knowledge about parameters in the model. This knowledge can then be used for the construction of a conjugate prior distribution of parameters. The matching prior distribution for a REM factorizes into individual local priors associated for each local model. Hence, local priors can be considered independently. Thiesson
In the general case the conjugate prior for a local exponential model is approximated by a multivariate normal distribution. If the local model is not restricted beyond being a probability distribution, the natural conjugate prior is defined by a Dirichlet distribution of probabilities. In particular REMs can be used for the construction of Bayesian networks. A Bayesian network resembles a quantified model, that is, a particular distribution belonging to the set of distributions as defined by the model. Given a database of observations, the maxi mum likelihood estimate is the usual candidate for a such quantification. If expert knowledge on parameters is also available, the largest posterior mode becomes a natural alternative. In s i t u a t io n s of incomplete data, the d e te r m i n a tio n of the maximum likelihood estimate or the largest posterior mode may call for iterative methods. The present paper has primarily been motivated by the need of providing the first and second order derivatives of the log-likelihood and log-posterior distribution to be used for iterative estimation methods, and for interfacing a sequential updating method (Spiegelhalter and Lauritzen 1990a, 1990b) to follow up on a quantified model as new observations occur. An application of first order derivatives for estimation with incomplete data in REMs is demostrated in a companion paper (Thiesson et al. (1993) and Lauritzen (1995) and given in Russell et al. (1995) with a gradient-descent application for estimation. In a study on methods for learning the structural relations between variables Chickering and Beckerman (1996) have applied the second order derivatives from this paper. The first order and the negative second order derivatives of the log-likelihood are in the following denoted as the score and observed information, whereas the first order and the negative second order derivatives of the log-posterior distribution are denoted as the posterior score and observed information. Section defines the recursive exponential models. This includes the important notion of global and local variation independence of parameters, local exponential modelling, and how to relax the global variation independence to allow for parsimonious modelling of recurring fragments in the model. Section 3 and Sec t i o n derive the traditional score and observed i n fo r mation in the situation of a single incomplete observation. In Section the expressions for the scar� and observed information are extended to apply for samples of independent observations. Section 6 covers the posterior score and observed information. It is sug-gested that a prior distribution obeys assumptions of relaxed global and local independence of parameters considered as random variables, which match the assumptions of variation independence. Section investigates conjugate prior distributions and how to con struct t he s e from imprecise expert knowledge. Finally, Section indicates further aspects of modelling. Annulment of local variation independence and so-called block recursive exponential models are proposed. A simple extension of a recursive graphical model serves as a ongoing example throughout this paper. 2 Recursive exponential models
Let
X= Xv = (Xv)vEV be a finite set of classification variables, each defined on a finite set of levels Iv.
Let
A � V, then
I.A =
XvEAI.v and the variables
XA = (X,)vEA take on values XA = (xv)vEA E I.A.
For
A= V we omit the subscript. For a particular value () E 0 of the parameter space the joint distribution of X is denoted p(X I fJ), in which case the likelihood based on a complete observation x E X is denoted p(x I B).
The number of parameters in a model, also called the dimension, is denoted ]0].
Likewise,
IIvl is the number of levels for
Xv, and ]I.A] = fivEA ]I.v] denotes the number of configurations of levels for X A. Given the recursive graphical structure, as argued in the introduction, a REM holds two assumptions on the parameter space, to be described below. Readers familiar with Spiegelhalter and Lauritzen (1990a, 1990b) and the line of work reported in Beckerman et al. (1995) may r e co g n i z e the assumptions, as assumptions of variation independence between parameters in different local components of the model, which are used in these papers but not explicitly named. By global variation independence the graphical structure reflects the assumption that any distribution of a given model factorizes into a product of conditional distributions, each parametrized by variation independent components of the total parametrization. That is, p(X]fJ)= Jip (Xv]Xpa(v),(Jv), (1) vEV where 0 = XvEv0v, and Bv E co m pl e t e ly specifies the relationship between the variable Xv and its conditional set of variables Xpa(v)> and p(X, I Xpa(v), fJv) denotes a table of conditional distributions, which holds a local distribution p(Xv f lTv, fJv) for each parent configuration of levels
1T v
E I.pa( v l.
In some applications, particularly pedigree analysis, it is typical to restrain the tables of local conditional distribution by knowledge about some of these tables being equal. In order to allow this type of applications core and Information for
REMs with Incomplete Data 455 the global variation independence is relaxed in a cer tain way. Let ii � V specify a set of variables, which associate equal tables of conditional distributions, and denote by V the total set of these equivalence classes. Equal tables must be parametrized by the same parameters. Hence, p(X IO):::: IT IT p ( X , IXpa(v)>O;;), (2) iiEV vEii where E specifies the relationship between Xv and Xpa(v) for any v E ii. If equal tables are represented by a single generic table, as will be assumed from now on, this is a more appropriate representation, which reflects the reduction of the parameter space into = XiiEV8ii.
Let (i;:;, ) E I;, X Ipa(ii)• where
I;,
X Ipa(ii) = Iv X Ipa(v) for any v E ii, index a generic table. By local variation independence the parametrization of a table additionally factorizes into components associated for each local distribution in the table. Hence, by local variation independence = X.,vEI,.a(u) whereby the table of local distributions is assembled by probabilities p(i;;
Now, consider the likelihood of a single observation x. The simplifying assumptions of variation independence between local components of parameters then has the effect of breaking this likelihood into the product of local likelihoods, where each local likelihood function is given by the distribution as picked from the appropriate generic table b y setting = Xpa(v)·
Hereby, the likelihood of a single observation becomes p(x I = II II p(x,
I Xpa(v)' oiilx,.a(v) ) . (3) vEV vEii This concludes the simplifying assumptions of variation independence. By lingering on the more pragmatic effect hereof concerning the issue of modelling, we notice that the assumptions break down the statistical modelling into more tractable local constructions of
LiiEV /Ipa(ii) I models for conditional distributions. These models are denoted as local models. The statistical modelling by REMs does not stop at this point, though. To completely qualify as a REM, each local model must be structurally defined by a regular exponential model. The local likelihoods in (3) are therefore represented in the form p(x..,
I Xpo.(v)' eii l :z:, ,. ( ., )) = b(xv, Xpa(v)) e:x:p ( (;l�lx,'"<"> tv!x,,.<,J (xv) - ¢(6,:;1x,,a(")) ) (4) where denotes the matrix transpose, t denotes the statistics,
Example:
Consider a model with recursive response structure as represented by the graph in Figure Assume that this model obeys the assumptions of global and local variation independence except for domain knowledge, which dictates that the tables of conditional distributions are equal for the variables X X4.
In this case we loosen up the assumption of global variation independence by assuming that = =
04. Hence, 0 = (
03, 05, and the likelihood of a single observation x == (x1,x2,x3,x4,x5,x6) becomes p(x p(xi/.9I)p(x2/02)p(x3/XI,B!ilx1)p(x4/x2,B31o) xp(x5 I X3'
X4, 65lx3,X4 )p(x6/
Xs, e6lx::J
Consider variable
X6, which has a finite set of levels I = { i0, i1, . . . , i R}; the observed value X6 being one of these. In this case, the last factor of the likelihood is picked from the local distribution p(X61 Xs, = (p0, p1, . . . , pR). Assume that the model for this distribution does not hold structural restrictions other than positivity constraints. An exponential representation of the local likelihood is then constructed as follows: Choose an index of reference, say t0.
Then, take as canonical parameters (01, ... , OR)', where and take as canonical statistics t61x5 (x6) = (t1 (x5), . . . , tR(x6))', where
56 Thiesson for X6 = ir otherwise. A minimal exponential representation is then given by with normalizing function Score of incomplete observation Suppose for a given model that a complete observation x is only observed indirectly through the incomplete observation y. The observation may be incomplete due to missing values according to a complete observation on X or due to imprecise values. An imprecise value appears if the collector of data cannot distinguish between a set of possible values for a variable and therefore reports this set instead of a single value. Denote by X(y) the set of possible completions that are obtainable by augmenting the incomplete observation y. Under the condition that the observation is incomplete in an un-informative way the likelihood for the incomplete observation becomes p(y I = I: p(x I B) xEX(y) = I: II IT p(xv lxpa(v)•Biilx,,.1,)), (5) xEX(y) iiEV vEii where fJ = (fJiilrr-)iiEV 1r·E" _ denotes the vector of tJ ' v �pa.{tl) all parameters for the modeL Let S(y I B) = %e l og p ( y I 0) denote the score of an incomplete observation. In accordance with the partitioning of the parameter vector into variation independent components, eiil1rv, we describe the score by local components of dimension l6iilrr.:; I given by = a � logp(yiO) vlrr;; a - ( I B) �p( y I 0)
P Y vlrrv a - ( IB) I: � p(xiO), (6) p Y xEX(y) vl1r, where the last equality follows from (5). Consider the local derivatives of the likelihood for a complete observation. As the chain rule for differentiation implies a aeiil1l'v p(x I B) _ "'[ rr;;( ) p(xiB) - � X Xpa(v) ( I e ) vEii p Xv Xpa(v)' iilxrw(u) X a / p(xv I Xpa(v),eiilxpa( .. ))] vlrrv = p(x I 0) I: x1rv (Xpa(v)) a/ log p ( x v eiilrrJ, vEii vlrr;; where x1rv (xpa(v)) is the indicator function rr" ( ) _ { for Xpa(v) = 7r;; X Xpo.(v) - otherwise. (7) Thus, by the exponential representation (4) of the local likelihood for a complete observation, the partial derivatives of the likelihood become a aeiilrr;; p(x I B) = p(x I B) I: x"" (Xpa(v)) ( tiilrr;; ( xv ) - r(Biilrr;;) ) , (8) uEii where r(B;;i,.J = E[tiilrr,, (Xu) l1r;;, the expected value of the canonical statistic for the conditional distribution p(-l1r;;,
B;;17l'J·
By inserting (8) into the expression for the local score of an incomplete observation, as given in (6)
Applying that {l'ix.Jfl p(xly,O) = � for X E X(y) and p(y I e) > 0 otherwise. (9) we hereby obtain the final expression for local components of the score SvirrJYIB)=l: [ p(ifa(v)ly,O) vEii i/a(u)EL/,dul core and Information for REMs with Incomplete Data where fa(v) is a short notation for the family vUpa(v). The Lauritzen-Spiegelhalter (L-S) procedure for probability propagation (Lauritzen and Spiegelhalter 1988) can be used as an efficient method for calculating the posterior probabilities p(ifa(v) ly,B). A concise description of this dedication of the L-S procedure can be found in Lauritzen (1995).
The remaining parts of (10) are either directly extracted or easily calculated from the exponential representation of the local model. Example (continued):
Consider a situation where incompleteness is caused by an imprecise observation.
Say, that X6 has four possible values I6 = (i0, i1, i2, i3) for which the collector of data could not decide on one of the two values X6 = i2 or x6 == i3. In this case
X(y) = {(xt,X2,xs,x4,xs,i2), (xl,x2,X3,X4,xs,i3)}. As p(X61xs,B) = (p0,p1,p2,p3), the support of p(Xs,X6Iy,B) is given by the non-zero values (�+2 , �+3 ) obtained for (Xs, X5) = (xs, i2) and p p p p (X5,X6) = (x5,i3). Realize also t h at T(B) = (p1,p2,p3).
By equation (10), the local score is then easily calculated as = P2 1 p2 + p3 ( -p '1 - p , -p ) P3
2 3 + 2 3(-p ,-p •1-P ). p +p Observed information in incomplete observation
Let
I(y
I B) = -� logp(y I B) denote the observed information in the incomplete observation y. We divide the information into local information matrices of dimension l8 u ,.,, l x l8vl7r< j, where each local matrix represents the part of the information matrix as defined by the local components Bu\11",; and
Bii\11";;' it, ii E V. Consider the local information The first term in (11) is just the product of local scores. Hence, the aim is now to derive a calculable expression for the second term. Define the Kronecker delta
0_ _ = { ii = v and = O h . ot erwtse. By straightforward differentiation of (8) with respect to the local component Bu1,.,,
82 -:::-::-------==-=---p ( x B) aBu1,.,, aBv1,.,., = p(x I B) (�X,.,, (xpa(u)) (tu\,.,-. (x,J - T(Bu\,..J) uEu vEii where v(B;;I,.,-,) = V[tvi,.,,(X;;) the covariance matrix for the canonical statistic in the exponential representation of p( ·l1rii, ) .
Using (9), the second part of the local information is hereby derived as 1
82 -( IB) L aB- aB� p(xiB) p Y xEX(y) u\11",< v/11";; = L L L [P(i!a(u)Ufa(v) I y, B) uEU vEV ifa.(u.)ufa.(vJ X X"'' (tpa(u) )X"" (ipa(v)) X (tul,...(iu)-T(Bui,.J) (t;;1,.,, (iv)- T(B;;I,.J)' ) -.5,,/71",-,,i">/7r;, :L u(ipa(v) I y, B)x,." (ipa(v))v(B;q11"J. vEii i1m(,;) (12) By realizing that p(ifa(u)Ufa(v) ly,B) =p(ifa(u) lifa(v),y,B)p(ija(v) ly,B), the dedication of the procedure can be used for the calculation of the posterior probabilities in (12). The remaining part is directly extracted or easily calculated from the local exponential models. Hence, a final calculable expression for local informations can now be obtained by inserting (10) and (12) into (11). To discuss the effect that incompleteness of data imposes on structural characteristics for the observed information we reorganize the final expression as lui,.,, ,ii\71";; (y I B) =2::2:: L [(P(i!a(u)IY,B)p(ifa(v)IY,B) -p( i fa(u)Ufa(v) I y, B)) x1T,; (ipa(u) )x,." (ipa(v)) X (tu/,.,,(iu)- T(B;;,,,)) (tii/,.;;(iv)-
T(Bv\11",))'] +buj,.,1,iii,.,L
LP (ipa(v) I y, B)x"'" (ipa(v))v(Bv/",). vEV i1w.(,;) (13)
58 Thiesson
Consider a complete observation x. The posterior probabilities on a subset X A, A � V of variables are then given by ( . I e) { i A ::::; X A p �A
X, :::::
0 otherwise. (14)
In this case the local information in (13) reduces to luj,.,,,nj,.,, (xI e) ::::: bul"'• ,nj,., L v(e;:q,.J. irw(tJ) Hence, for a complete observation the information matrix will be block-diagonal on local components of the parameter vector. For an incomplete observation this is not the case. The fact that the adjustment of the information due to incompleteness of the observation will undermine the block-diagonality is easily seen by realizing that (14) is no longer valid.
Example (continued):
Consider the local information /6lx5,6lx5(y I e). Given a complete observation y::::; X::::; (x1, x2, x3, x4, xs, x6), the local information equals -plp2 p2-p2p2 -p3p2
Actually, this result only depends on the fact that x5 and x6 are observed. Now, say that x6 was not observed. In this case, the first part of the local information, as given by the product of scores in (11), is 0. Calculations on (12) show that the second part equals v(e6i:z:J - v ( e J x J ::::: 0. Hence, the observed local information is 0. This is in agreement with the fact that we do not have any information about the conditional distribution p(X61 Xs, e6l:z::J, when x6 is unobserved. We emphasize that this result is a concequence of the fact that the non-zero values for p(XJa(v) I y,B) equals p(X, lxpa(v),B), which is not true in general if Xu has observed descendants. o Sample score and observed information
Let y ::::: (y1, y2, ... , yL) denote a sample of possibly incomplete observations which are mutually independent. The likelihood then factorizes over each observation
L L p(y I B) ex II p(y1 I B) ::::; II L p(x' I e). l==l
As the likelihood is proportional to the product of likelihoods for the individual observations, the sample score and observed sample information are obtained by simply adding the individual scores and informations, respectively. Denote by n*(iA) = 'Lt==1p(iA ly1,B) the expected marginal count of observations for the marginal configuration of levels iA E IA.
By adding the local scores, as given in (10), the sample score
S(y I e) has local components
Suj,.,,(yle):::::L L [n*(ija(v)) vEv i1.(,)
Similarly, by adding the observed informations, the observed sample information
I(y I B) has local components hJ,.,, ,nJ,.,, (y I e) L = L ( � L [p(ija(u) ly1,B) l==l uEu t Ju.(u) xx.,.,;(ipa(uJ) (tuJ.,.,;(iu)-r(euJ,.,,)) ] x L L [p(ita(v) I y',B) vEV ifu.\u) [ n* (i fa(v)Uja(u)) uEii vEii i/•.t.(1,)Ufa.(h) X X.,.,; ( ipa( u) )X.,." ( ipa( v ) ) X (tuj.,.,1 (iu) -r(eui,.J) ( tnj.,.;, ( iv) - r(enj,.;J) +buj1t',-,vj1t';; L L n* (ipa(v))X""" (ipa(v) )v(evj1r,; ). vEil i1,o.(u) (16) The expression (16) is organized so that the first term should be easy to identify as the sum of the products of local scores for each observation. Hence, in case the local scores of each observation have already been calculated, one might replace the first term of (16) by L L ( suj7r;, (y1
I B)Svjrr;; (y1 I B)') . 1=1
Notice that a lot of posterior probabilities p(iA I y, e), A E V have to be calculated to complete the calculations. Hence computational efficiency demands an efficient method for calculating these, as e.g. the 1-S procedure. core and Information for REMs with Incomplete Data
459 6
Posterior score and observed information
Suppose that we have information about e in the form of a prior distribution of parameters considered as ran dom variables, p( The posterior distribution given an incomplete sample (or single observation) is then defined as (e I ) = p(y I O)p(O) p y p(y) . (17) Here we consider a Bayesian interpretation of the score and information. In analogy with the traditional score and observed information, let
S(e
I y) = ge log p ( O I y) and I(()
I y) = -� l og p ( B I y) be denoted as the posterior score and observed information, and let 5(8) = ge logp(O) and !(8) = -� log p ( O ) be denoted as the prior score and information. From (17) it is easily seen that the posterior score and information are obtained by simply adding, respectively, the traditional score and information onto the prior score a n d information. Hence, S(() I y) = S(y I())+ S(O) (18) and I(()
I y) = I(y I())+
I(8). (19) Now, consider the prior distribution of parameters. The construction simplifies considerably by matching assumptions of variation independence with independence of the parameters considered as random variables. Hence, by relaxed global independence we assume that()"' v E V are mutually independent, and by local independence we assume that local components 8"1"'', E Ipa{v) are mutually independent.
The notion of global and local independence can also be found in Spiegelhalter and Lauritzen (1990a, 1990b) and the line of work as reported in Beckerman et al. (1995). Under the assumptions of relaxed global and local independence the distribution of parameters factorizes as p(8) = II II p(()"l";.l· iiE'V 7r,,EI1m{,;)
By this factorization the local components for the prior score and the local components for the prior information are derived from local prior distributions. Specific prior distributions on parameters To complete the specification for the posterior score and observed information, we consider parametriza tions for the prior distribution of parameters. Intending a unification of a batch method for quantifying probabilistic expert systems by the mode which maximizes the posterior distribution, as described in Thiesson (1995), and a method for sequential updating of conditional probabilities, as described in Spiegelhalter and Lauritzen (1990a, 1990b), we are especially interested in conjugate distributions (or approximately conjugate). By the intended unification, a system can be initialized by the batch learning method, and following, as new data accumulates, the system can be updated and improved by the sequential updating method. Conjugate prior for local exponential models
For the general setting we consider the prior distribution of parameters as a member of the conjugate model for the likelihood as defined by a recursive exponential model.
Let x = ( x1, ... , xn) denote a sam p l e of n com p let e observations. The observed count for configuration iA E
IA, A c; V is then defined as n n(iA) = L x'A (x1), where i ( { for x� = i A X A X =
For independent observations and by the assumptions of variation independence the likelihood factorizes a s p(x I B) ex: II p( i I II II
P(i '1T e-, _)n(i,,n,) v u, v II ( . I l"" n(i,,1r,) P•- -� WuEv "V V' V 11"'(1 l where the last equality follows from the fact that conditional probability tables are equal for all v E ii.
We observe that the likelihood factorizes into a product of local likelihoods
Pvjrr,,(xiOvjrr;,) = II p(iu !1fv,BvlrrJI::,e;;n(i,,",), iilET11 Thiesson each defined by the exponential model (4).
The natural local conjugate priors are therefore defined by conjugate exponential models (Diaconis and Ylvisaker where "' is a vector of same dimension as
B;:;11r,, and j3 is a scalar. Let e;111",, denote the value which maximizes p(Bvl7rJ.
In the neighbourhood of
B�l"';, a Taylor series expansion implies that logp(B;:;11\"J � logp(B�1",)- � (Bvl"•-8�1",-J'v(B�I"J(B;;I,., -B�1,.,J. Hence, a conjugate local prior is approximately proportional to the multivariate normal distribution N ( e�1 _, -131 v(0!1 _ )-1) in a neighbourhood of _ . u� u� u� Local prior scores for this approximation are derived as
Svl..-, (B) = -j3v(BDI"" )(Bvl,.• - e�1,.J, and local prior informations as The mean B�1"" and the parameter j3 are unknown factors of the approximately conjugate multivariate normal distribution, which have to be extracted from expert knowledge. For this task it seems reasonable to request domain experts to give a "best guess" p(Xu
I ipa(uj) on each conditional distribution with an assessment of imprecision (or confidence) on each of the probabilities in the form of an interval of variation. Here it should be noticed that if the expert specification is not the same for each table of conditional distributions, p(Xu I Xpa(vj), where v E ii, these should be forced equal or the expert should reconsider the equality of these tables. If the partitioning into equal tables is indisputable one should only request for generic "best guess" tables p(X;;IXpa(vJ),ii E V with assessments of imprecision. The mean e�11r,, is derived as the value of
B;:;l"·' which minimizes the Kullback-Leibler discrepancy between p(X:v l1r:v) and p(X:v l1r:v, Biii,.J
K L (.P(X;:; l1ru ) , p(X:v l1rv, B;;i,.J) " " '(" I ) .P(iul7rv) L...t p Zv 11"v log (i- l1r·
B- ) . ivEI,�� p V v' vl7rt, Here, we use the convention O log(O/a) = = oo for a >
0. The first and second order derivatives of the discrepancy can be found as respectively a , - KL (p(Xv l1r;:;),p(X;:; vl1r;, - L p(iv l1r;;) (t;;l,.;,(i;;)-T(Biii,.J) and Besides the "best guess" these expressions do not involve statistics which are not already in demand for the implementation of the traditional score and observed information. Hence, with little additional effort in implementation, the minimizing can be found numerically (if not analytically) by e.g. a NewtonRaphson method. Ideally the discrepancy is 0. In situations, however, the discrepancy may be non-zero, which reflects inconsistency between the specified "best guess" distribution and the structural restrictions as specified for the distribution. In this case we choose the nearest distribution (by the discrepancy), which obeys therestrictions. If the discrepancy is very large, this should affect the confidence in the specified distribution or the restrictions. The parameter j3 that adjusts the variance for the multivariate normal distribution is determined from the intervals of variation as specified for each probability. By assuming that an interval of variation for a probability equals twice the standard derivation for the marginal distribution of the probability, the adjustment factor is derived as follows. Let SD(iu l1r;;) denote half the interval of variation for p(i;:; l1r;:;, B;:;1"J' and denote by V ( p (i u l 1r;;, !ftq,..,)) the variance of that probability. By the delta method the variance matrix for probabilities can be approximated from the variance matrix for parameters. If we consider the variances for the marginal distributions of probabilities only, these are approximated by the diagonal elements where (20) core and Information for REMs with Incomplete Data
461 The adjustment factor (J( i;;), associated for the margi nal distribution of p(iv \n;;, e,,,..J, can now be derived from equation (20) by utilizing that
SD(i;; j1r;:,)Z = V (p(iv j1r;;, Bvi,.J). In case of inconsistency between the calculated adjustment factors for different iv E I.�; we choose the factor which implies the lowest precision for the normal distribution. Hence, (3 = min{(J(iv) \ i;; E I.v}.
Example (continued):
Assume that the local distribution p ( X , �' = (p0 ,p\p2 ,p3) is restricted by the log-linear form logp(X6\7rB,CI) = � + XB!, where the levels for x6 are real-valued quantities, say L -c·o ·1 · z · ) - ( o � .� ,t ,t - ' ' ' . Let ry61,.6 and s6,,.6(X6) denote the parameter vector and the statistic for the exponential representation of the local distribution without s t r u c t u ra l restrictions, as derived earlier in this example (page 3). Then p(X6\1r6,�,1) can be represented as the distribution p(XB\11"6, formed by the exponential sub-model of order 1, with parameter = 1 given by the
1 2 3 affine transformation ry61,.6 = (log � , log � , log � ) = Tn h T' _ ( -1 ·o -2 -o ·3 -o) _ (1 2 u6l"a, w ere - t - t , z - t , t - t - , , . For this representation t6i.,6(X6) = T'ss,,..6(Xa), 7(86/rro) = T'T(7J61,.6), and v(86i.,6) = T'v(1J6I"o)T.
Now, say that the opinion about the local distribution has been imprecisely specified as (0.05[0.02 - . ] , . ] , . [ -0.70]), where each interval denotes the imprecision of the "best guess" in front of it. The "best guess", p(X,\xpa(v)) = (0.05,0.10,0.25, almost satisfies the structural restriction. However, a simple check reveals that log-odds disagree on the value for e6/7ro which parameterizes the exponential representation of the distribution. Therefore we use a Newton-Raphson method to determine the parameter value e�,7r6, which implies the lowest KL-discrepancy between the "best guess" and a distribution of the correct functional form. For this task, first and second order derivatives are derived as a/-KL = (p1 + + KL = Ol•o 5l•o (p1 + 4p2 + 9p3) - ( p + p + 3p3)(p1 + + 3p3). Now, starting from an initial value
B61"o = log � log five Newton-Raphson iterations given by move the parameter value into an acceptable value e�,1TG = a b l e L Consider now the variance for the approximate conju gate normal distribution. Except from the adjustment (3, it falls right out of the last Newton-Raphson it- ( ) -1 . 82 eratwn as fi(O' )2 K L = OJ•o
Let pH denote the probability p( ir r = T h e n op(i'�'���o,9ai�G) = iTp*T- p*T (p•l +2p*2+ Gino r = . . For each interval of variation we can now use (20) to calculate the individual adjustment factors fJW), r =
0, 1, 2, as respectively 17.9, 12.3, 5.6, and 17.7. Being the smallest, 5.6 is chosen as the adjustment factor for the variance. Hence, the local conjugate distribution p(861"a) is approximated by N(0.861, 0.247). Conjugate prior for local multinomial models In most real situations some of the local models will not hold structural restrictions. In case of multinomial sampling with respect to variables for a local likelihood the natural local conjugate prior distribution of parameters is defined by a Dirichlet distribution. Consider a generic local distribution and assume that the probabilities a priori are Dirichlet distributed V(a(iv, 7rv); iv E Iv) with a total number of /I;;/ parameters; a parameter for each index of I:u.
Hence, the prior distribution of probabilities is in the form p (p(i;; j7r;;,8ii/,..,-.), i:u E
I;;) ex: II p( i" \1rv, e:u1,.J"'(i,, ,,.;;)-1. i;;Eiv By a transformation, as given by the exponential representation of probabilities, the prior becomes a distribution of Bvl,..•·.
The distribution is given by ( noting that the determinant of the Jacobian for the local distribution dll,�n;, p( ·\7!";;, B;;'"") is proportional to Il,.a, p(iv \1rv, e;;1,..J ) p(Biil"") ex: II exp [ (e�1,..)vJ,.)iv)-¢(B:u1"vl) a( iii, * i,jEI,; From (21) the local prior score and information are easily derived as siil11",,(e:u,,..J = L a(i;;,1T";;) (t:u,,r;,(i;;)-r(B:uJ.,.J) i,jEijj ( ) and 62 Thiesson () te KL �KL Po pl p2 p3 where a(1r-u ) = Li,-£lv a (i-u, 1T,:; ) .
The similarity of (22) and (23) with the expressions for the traditional local score and information, (15) and (16), leads to the observation that the prior Dirichlet distribution has the effect of adding its parameters as imaginary counts to get the posterior expressions. Depending on the domain expert, naturally, it may in situations seem unreasonable to request a prior opinion about local distributions directly in the form of imaginary counts giving the parameters of a Dirichlet distribution. Again we overcome the problem by letting the domain expert specify a "best guess" on local distributions with an interval of variation on each of the probabilities. As also suggested in Spiegelhalter et al. (1993) and Heckerman et al. (1995), the parameters of a Dirichlet distribution can then be calculated from the expressions of individual means and variances for each random variable of the distribution a (i;;, ) a( 1T;;) (a(1r;;)-a (i;;, 1r;;)) a(i;; , ) a(7r-u ) 2 (a(1rv ) +
1) Assume for the conditional probability p (i;; l 1r;; , Bvi'II'J that the mean equals the "best guess" fj(i;; l 1r;;) and that the standard deviation is equal to half the interval of variation SD(iv l 1r;; ) .
The equivalent sample size ai''(7T;;), associated for the probability p(i;; l n-;; ,Bvi'II'J , is now derived as i,, ( - ) _ (1- p(i;; 1 7Tu ) ) p(iv 1 7Tu ) _ a - SD(i;; 1 7T;; ) 2 · In case of very large intervals ai,; ( ) may become negative. This is regarded as a token of non-informative prior knowledge, in which case ai,; (1r;; ) is set to the non-informative sample size 0. A consistent specification of the prior distribution requires that a,,; ( ) has the same value for all i:u E I;;.
In case of inconsistency between the calculated sample sizes from different interval specifications we choose the smallest. Hence, By the assumption that the means are given by the "best guess" distribution, the parameters of the Dirichlet distribution are then calculated as
Example (continued):
Say that the opinion about the conditional distribution without structural restrictions p(X3I 1r3 , ()3111'3) = p(X4I 1T4 , ()3111'.) , = has been imprecisely specified as (0. 10[0.04 -0.16], 0.20[0.10-0.30], 0.50[0.40-0.60], 0.20[0.10-0 .30]). For each imprecisely stated conditional probability we calculate the associated equivalent sample size as respectively 24, 15, 24, and 15. Being the smallest, 15 is chosen as the equivalent sample size for the Dirichlet distribution of p(X3I 1r3 , ()3I'II'J and p(X4 l1r4, B4111'J · The imaginary counts (the parameters that specify the Dirichlet distribution) then become (1.5, 3.0, 7.5, 3.0). Further issues on modelling
An obvious possibility of even more sophisticated modelling is to relax or annul the assumption of local variation independence. Spiegelhalter and Lauritzen (1990a) suggest (for recursive graphical models) an interesting possibility as follows. Consider the parameter vector for a generic table of local distributions, (),;. This parameter vector is restricted by assuming that each local vector, ()iii 'II';,, is defined by linear combinations of functions u � ( ) , . . . , u � ( ) on parent configurations . Hence, for any parent configuration core and Information for REMs with Incomplete Data wh e r e u;; (n;; ) = ( uMnv) , . . . , u� (nii) ) ' is a vector hold ing g i v e n values of the functions and a;; i s a IBvlrr;; I x r di m e n s i o n al matrix of p a r am e t e r s . Another interesting prospect on m o d e lli n g would be t o extend the recursive exponential model into mod els, w h ic h we denote as block recursive exponential models. These models may evolve from recursive ex p o n e n t i al models by a ll o w i n g that a b l oc k (or group) of variables are sitting at e a c h node of the graphical DAG r ep r e s e n t at i o n of the model. Relations between variables in different blocks are t h en causal in the di r e c t i o n of arrows, whereas relations between variables within a block are symmetric. A b l o c k may contain variables which are n o t direct causes to any of t he variables within a response block, a n d vice versa. The class of r e c u rs i v e graphical models is an i m por t an t specialization of r ec u r s i v e exponential models. Similarly, the block recursive e x p o ne n t i a l models should ad m i t a specialization i n t o the b l o c k recursive g ra p h ical models or c h a i n g r ap h models of Lauritzen and W e r m ut h ( 1984, 1 989), which demands an annulment of local variation i n de p e n den ce as p r op o s e d above. Acknowledgements
The author is grateful to St e ff en L. Lauritzen for helpful c o m m e n t s and for i ns p i r i n g this work by providing a fi r s t draft on th e derivation of t h e score for recursive g r a ph i ca l models. Grant support was p ro v i ded by ESPRIT Basic
Research
Action, project no. 6156 (DRUMS
II) as well as the PIFT programme of t h e Danish Research Councils. References
Andreassen,
S.,
Jensen, F. V . , A n d ers en , S .
K.,
Falck,
B . , K j ::Er u l ff , U . , W o l d b y e , M.,
S0rensen,
A . ,
Rosenfalck, A . , and J e n s e n , F . (1989).
MUNIN - an ex p e r t EMG assistant. In
Computer-Aided
Electromyography and Expert Systems, (ed. J. E. Desmedt) , chapter 2 1 , pp. E l s e v i e r S c i enc e P u b l is he r s , A m s t e rd a m . Chickering,
D . M. and Beckerman, D . (1996).
Effi cient approximations f o r the marginal likelihood of incomplete data given a Bayesian network. In P roc eedi n g s of the Twelfth
Conference on Uncer t a i nty in Artificial Intelligence, ( ed. E. Horvitz and F. Jensen), pp. Morgan Kaufmann, San Francisco. Diaconis, P. and Ylvisaker, D ( 1979). C on j u g a t e priors for e x po n e n ti a l families. Annals of St a ti s t i c s , (2), 269-8 1 . H e c k e rm a n , D . ,
Geiger,
D . , and
Chickering,
D . M. (1995). L e ar n i n g Bayesian networks:
The combi- nation of knowledge and statistical data.
Machine
Learning, 20, S p e e d , T. P., an d Carlin, J. B. (1984) . Recursive causal models.
Journal of the A ustralian Mathematical Society,
36, (Series
A), 30-52. Lauritzen, S. L. (1995). The EM algorithm for gr a p h i cal association models with missing data. Compu tational
Statistics fj Data
Analysis,
Lauritzen, S. L., Dawid, A.
P.,
Larsen,
B .
N., and
Leimer, H .-G. ( 1 990). I n d e p e nd e n c e properties of directed Markov fields. Networks,
20, 491-505. Lauritzen, S . L. a n d Spiegelhalter,
D .
J . (1988) . L o cal computations w i t h probabilities on gr a p h ic al structures a n d their application to expert systems (with discussion) . Journal of the
Royal
Statistical Society, Series B, 50,
Lauritzen, S. L. and W e r mu t h , N. ( 1 984). Mixed i nte r action models. Technical Report R Institute for
Electronic Systems,
Aalborg University.
Lauritzen, S. L. and W e r m u t h , N. (1989). Graphical m o d e ls for associations between v ar i a b les , some of which a r e qualitative and some quantitative. The Annals of Statistics, 1 7, ( 1 ) , Pearl,
J . ( 1988).
Probabilistic Reasoning in
Intelligent Systems: NetwOTks of
Plaus·ible
Inference,
Series in representation and reasoning. Morgan Kaufmann, San Fr a n s is c o . Russell, S., B i n d e r , J . , Koller,
D., and K a na z a w a , K. (1995). Local learning in probabilistic networks with hidden va r i ab l e s . In Fourteenth International J o i n t Conference on A r t i fi c ia l Intelligence, (ed.
C. S.
Mellish), pp. S p i eg e lhal t e r , D. and L a u r i tze n , S. L. (1990a).
Se quential updating of co n d i t i o n a l probabilities on di r e c t e d graphical structures. Networks,
20, 579-605. S p ie g e l ha l t e r , D . and Lauritzen, S. L. (1990b). Techniques f o r Bayesian analysis in expert systems. Annals of Mathematics and Artificial
Intelligence, 2 ,
Spiegelhalter,
D . J . , D awid, A. P.,
Lauritzen,
S .
L., and
Cowell, R. G. (1993) . Bayesian ana l y s i s in expert systems. Statistical
Science, (3), 2 1 9-47. Thiesson, B . ( 1995). A c c el e r a t ed quantification of Bayesian networks with incomplete data. In
Proceedings of First International Conference on
Knowledge Discovery and Data Mining, (ed. U. M. Fayyad a n d R. Uthurusamy) , p p . 306-11 . AAAI press. W e r m ut h , N. a nd Lauritzen, S. L. ( 1983 ) . G r ap h i cal and recursive models for contingency tables.contingency tables.