Relational Dynamic Bayesian Networks
JJournal of Artificial Intelligence Research 24 (2005) 759-797 Submitted 10/04; published 12/05
Relational Dynamic Bayesian Networks
Sumit Sanghai
SANGHAI @ CS . WASHINGTON . EDU
Pedro Domingos
PEDROD @ CS . WASHINGTON . EDU
Daniel Weld
WELD @ CS . WASHINGTON . EDU
Department of Computer Science and EngineeringUniversity of Washington
Abstract
Stochastic processes that involve the creation of objects and relations over time are widespread,but relatively poorly studied. For example, accurate fault diagnosis in factory assembly processesrequires inferring the probabilities of erroneous assembly operations, but doing this efficiently andaccurately is difficult. Modeled as dynamic Bayesian networks, these processes have discrete vari-ables with very large domains and extremely high dimensionality. In this paper, we introducerelational dynamic Bayesian networks (RDBNs), which are an extension of dynamic Bayesian net-works (DBNs) to first-order logic. RDBNs are a generalization of dynamic probabilistic relationalmodels (DPRMs), which we had proposed in our previous work to model dynamic uncertain do-mains. We first extend the Rao-Blackwellised particle filtering described in our earlier work toRDBNs. Next, we lift the assumptions associated with Rao-Blackwellization in RDBNs and pro-pose two new forms of particle filtering. The first one uses abstraction hierarchies over the predi-cates to smooth the particle filter’s estimates. The second employs kernel density estimation with akernel function specifically designed for relational domains. Experiments show these two methodsgreatly outperform standard particle filtering on the task of assembly plan execution monitoring.
1. Introduction
Sequential phenomena abound in the world, and uncertainty is a common feature of them. DynamicBayesian networks (DBNs), one of the most powerful representations available for such phenomena,represent the state of the world as a set of variables, and model the probabilistic dependencies of thevariables within and between time steps (Dean & Kanazawa, 1989). While a major advance overprevious approaches, DBNs are essentially propositional, with no notion of objects or relations;hence DBNs are unable to compactly represent many real-world domains. For example, manu-facturing plants assemble complex artifacts ( e.g. , cars, computers, aircraft) from large numbers ofcomponent parts, using multiple kinds of machines and operations. Capturing such a domain in aDBN would require exhaustively representing all possible objects and relations among them, whichis impractical.Formalisms that can represent objects and relations, as opposed to just variables, have a longhistory in AI. Recently, significant progress has been made in combining them with a principledtreatment of uncertainty. In particular, probabilistic relational models or PRMs (Friedman, Getoor,Koller, & Pfeffer, 1999) are an extension of Bayesian networks that allows reasoning with classes,objects and relations. Recently, we proposed dynamic probabilistic relational models (DPRMs)(Sanghai, Domingos, & Weld, 2003), which combine PRMs and DBNs to allow reasoning withclasses, objects and relations in a dynamic environment. We also developed a relational Rao-Blackwellized particle filtering mechanism for state monitoring in DPRMs. c (cid:13) ANGHAI , D
OMINGOS & W
ELD
In this paper we introduce relational dynamic Bayesian networks (RDBNs) which extend DBNsto first-order (relational) domains. RDBNs subsume DPRMs and have several advantages overthem, including greater simplicity and expressivity. Furthermore, they may be more easily learnedusing ILP techniques.We develop a series of efficient inference procedures for RDBNs (which are also applicable toDPRMs or any other relational stochastic process model). The Rao-Blackwellised particle filteringdescribed in our previous paper requires two strong assumptions which restrict its applicability. Welift these assumptions, developing two new forms of particle filtering. In the first approach, webuild an abstraction hierarchy over the first-order predicates and use it to smooth the particle filterestimates. In the second approach, we introduce a variant of kernel density estimation with a kernelfunction specifically designed for relational domains.Early fault detection can greatly reduce the cost of manufacturing processes. In this paperwe apply our inference algorithms to execution-monitoring of assembly plans, showing that ourmethods scale to problems with over a thousand objects and thousands of steps. Other domainswhere our techniques may be helpful include robot control, vision in motion, language processing,computational modeling of markets, battlefield management, cell biology, ecosystem modeling, andanalysis of Web information. The following are the significant contributions of this paper: • We define relational dynamic Bayesian networks (RDBNs), which allow modeling uncer-tainty in dynamic relational domains. • We present several novel methods for inferencing in RDBNs which use Rao-Blackwellizationparticle filtering, smoothing on relational abstraction hierarchies and relational kernel densityestimation. • We apply RDBNs to fault diagnosis in factory assembly processes, showing that the inferencealgorithms we propose outperform traditional particle filtering.The rest of the paper is structured as follows. In Section 2 we review DBNs and briefly discussthe different filtering algorithms applicable to them. We introduce RDBNs in Section 3, and inSections 4, 5 and 6 we describe our inference methods. In Section 7 we report our experimentalresults. In Section 8 we discuss the related work and we show that RDBNs subsume DPRMs. Weconclude with a discussion of future work.
2. Background A Bayesian network encodes the joint probability distribution of a set of variables, { Z , . . . , Z d } ,as a directed acyclic graph and a set of conditional probability models. Each node corresponds toa variable, and the model associated with it allows us to compute the probability of a state of thevariable given the state of its parents. The set of parents of Z i , denoted P a ( Z i ) , is the set of nodeswith an arc to Z i in the graph. The structure of the network encodes the assertion that each node isconditionally independent of its non-descendants given its parents. The probability of an arbitraryevent Z = ( Z , . . . , Z d ) can then be computed as P ( Z ) = Q di =1 P ( Z i | P a ( Z i )) . Dynamic Bayesian Networks (DBNs) (Dean & Kanazawa, 1989) are an extension of Bayesian net-works for modeling dynamic systems. In a DBN, the state at time t is represented by a set of random ELATIONAL D YNAMIC B AYESIAN N ETWORKS variables Z t = ( Z ,t , . . . , Z d,t ) . The state at time t is dependent on the states at previous time steps.Typically, we assume that each state only depends on the immediately preceding state (i.e., the sys-tem is first-order Markov), and thus we need to represent the transition distribution P ( Z t +1 | Z t ) .This can be done using a two-time-slice Bayesian network fragment (2TBN) B t +1 , which containsvariables from Z t +1 whose parents are variables from Z t and/or Z t +1 , and variables from Z t with-out their parents. Typically, we also assume that the process is stationary, i.e., the transition modelsfor all time slices are identical: B = B = . . . = B t = B → . Thus a DBN is defined to be apair of Bayesian networks ( B , B → ), where B represents the initial distribution P ( Z ) , and B → is a two-time-slice Bayesian network, which as discussed above defines the transition distribution P ( Z t +1 | Z t ) .The set Z t is commonly divided into two sets: the unobserved state variables X t and the ob-served variables Y t . The observed variables Y t are assumed to depend only on the current statevariables X t . The joint distribution represented by a DBN can then be obtained by unrolling the2TBN: P ( X , ..., X T , Y , ..., Y T ) = P ( X ) P ( Y | X ) T Y t =1 P ( X t | X t − ) P ( Y t | X t ) (1) Various types of inference are possible in DBNs. In this paper, we will focus on state monitoring(also known as filtering or tracking). However, the methods that we will propose later can be usedin other types of inference.The goal in state monitoring is to estimate the current state of the world given the observationsmade up to the present, i.e., to compute the distribution P ( X T | Y , Y , ..., Y T ) . Proper state monitor-ing is a necessary precondition for rational decision-making in dynamic domains. Since inferencein DBNs is NP-complete, we usually resort to approximate methods, of which the most widely usedone is particle filtering (Doucet, de Freitas, & Gordon, 2001). Particle filtering is a stochastic al-gorithm which maintains a set of particles (samples) x t , x t , . . . , x Nt to approximately represent thedistribution of possible states at time t given the observations. Each particle x it contains a completeinstance of the current state, i.e., a sampled value for each state variable. The current distribution isthen approximated by P ( X T = x | Y , Y , ..., Y T ) = 1 N N X i =1 δ ( x iT = x ) (2)where δ ( x iT = x ) is 1 if the state represented by x iT is the same as x , and 0 otherwise. The particlefilter starts by generating N particles according to the initial distribution P ( X ) . Then, at eachstep, it first generates the next state x it +1 for each particle i by sampling from P ( X it +1 | X it ) . It thenweights these samples according to the likelihood they assign to the observations, P ( Y t +1 | X it +1 ) ,and resamples N particles from this weighted distribution. The particles will thus tend to stayclustered in the more probable regions of the state space, according to the observations.Although particle filtering has scored impressive successes in many applications, one significantlimitation is of special concern: it tends to perform poorly in high-dimensional state spaces. Thisproblem can be reduced by analytically marginalizing out some of the variables, a technique knownas Rao-Blackwellisation (Murphy & Russell, 2001). When the state space X t can be divided into ANGHAI , D
OMINGOS & W
ELD two subspaces U t and V t such that P ( V t | U t , Y , . . . , Y t ) can be efficiently computed analytically,we only need to sample from the smaller space U t , and this requires far fewer particles for the sameaccuracy. Each particle is now composed of a sample from P ( U t | Y , . . . , Y t ) plus a parametric rep-resentation of P ( V t | U t , Y , . . . , Y t ) . For example, if the variables in V t are discrete and independentof each other given U t , we can store for each variable the vector of parameters of the correspondingmultinomial distribution (i.e., the probability of each value).
3. Relational Dynamic Bayesian Networks
In this section we show how to represent probabilistic dependencies in a dynamic relational do-main by combining DBNs with first-order logic. We start by defining relational and dynamic rela-tional domains in terms of first-order logic and then define relational dynamic Bayesian networks (RDBNs) which can be used to model uncertainty in such domains.A relational domain contains a set of objects with relations between them. The domain is rep-resented by constants, variables, functions, terms and predicates.
Constants are symbols used torepresent objects (e.g., plate can be one of the plate objects in the factory assembly domain) orthe attributes of objects (e.g., red is a constant which can be the color of a plate) in the domain. Vari-ables range over the objects, and both the constants and the variables can be typed, in which casethe variables take on values only of the corresponding type. Functions f ( x , . . . , x n ) take objectsas arguments and return an object. Functions are associated with an arity n which fixes the numberof arguments that the function may take. A predicate R is a symbol used to represent relationsbetween objects in the domain or attributes of objects. An interpretation specifies which objects,functions and relations in the domain are represented by which symbols. A term is an expressionused to represent an object in the relational domain. A string t is a term if (a) t is a constant symbolor (b) t is a variable or (c) t is of the form f ( t , . . . , t n ) where f is a function and each of the t i is aterm. Each predicate symbol R is associated with an arity n , and an atomic formula R ( t , . . . , t n ) is a predicate symbol applied to an n-tuple of terms (e.g., W eld ( x, y ) means that objects x and y are welded and Color ( x, Red ) means that the color of object x is Red .). A ground term is a termcontaining no variables. A ground atomic formula or ground predicate is an atomic formula all ofwhose arguments are ground terms. Each ground predicate is associated with a truth value and thestate of the domain is given by the truth value assigned to all possible ground predicates.
Definition 1 (Relational Domain)Syntax: A relational domain is a set of constants, variables, functions, terms, predicates and atomicformulas R ( t , . . . , t n ) where each of the argument t i is a term. The set of all possible groundpredicates is the set of all predicates with constants (or functions applied to constants) as arguments.Semantics: Each ground predicate in a relational domain can be either true or false. The state of arelational domain is the set of ground predicates that are true. The set of all true ground predicates can be represented explicitly as tuples in a relationaldatabase, and under closed world assumptions this corresponds to a state of the world.In an uncertain domain, the truth value of a ground predicate can be uncertain and the value canpotentially depend on the values of other ground predicates. These dependencies can be specifiedusing a Bayesian network on the ground predicates. However, the number of such ground predicatesis exponential in the size of the domain (number of constants) and hence the explicit construction
ELATIONAL D YNAMIC B AYESIAN N ETWORKS of such a Bayesian network would be infeasible. We use
Relational Bayesian Networks (RBNs)to compactly represent the uncertainty in the system. The relational Bayesian network specifies thedependency between the predicates at the first-order level by using first-order expressions whichinclude existential and universal quantifiers, and aggregate functions such as count , etc. Definition 2 (Relational Bayesian Network: RBN)Syntax: Given a relational domain , a relational Bayesian network is a graph which, for everyfirst-order predicate R , contains: • A node in the graph. • A set of parents
P a ( R ) = R ( t , . . . , t m ) , . . . , R l ( t l , . . . , t m l ) which are a subset of thepredicates in the graph (possibly including R itself). The set of parents are indicated bydirected edges in the graph from the parent to the child. • A conditional probability model for P ( R | P a ( R )) which is a function with range [0,1] definedover all the variables in P a ( R ) . Semantics: A relational Bayesian network defines a Bayesian network on the ground predicates inthe relational domain. For every ground predicate R ( c , . . . , c m ) a node is created and its parentsare obtained by making the substitutions x i /c i in the terms t jk which appear in the predicate’s parentlist. The conditional model for a ground predicate is the function restricted to the particular groundpredicate and its parents. Thus, a relational Bayesian network gives a joint probability distribution on the state of the re-lational domain.
Example of an RBN
Consider a factory assembly process where plates, brackets, etc, are welded and bolted to formcomplex objects. The plates and the brackets form the objects in the domain. Their properties suchas color, shape, etc. can be represented using predicates
Color , Shape , etc. Predicates
Bolt and
W eld can be used to represent the relationships between the objects. Many of the relationshipsand the properties of the objects can be uncertain because of faults in the assembly process. Forexample, a part may be bolted to the wrong part, and this may be more likely if the wrong part andthe intended one have the same color. An RBN can model this by having
Color as the parent of the
Bolt predicate and the conditional probability model can be used to represent the exact dependency.To avoid cycles appearing in the network obtained after expansion we need to restrict the set ofparents of a predicate. To achieve this, we assume an ordering ≺ on the predicates and the constants.The ordering forms part of the description of a relational Bayesian network, and is composed of twoparts: • A complete ordering on the predicates in the relational domain. • A complete ordering on the constants of each type.
1. Our RBNs are related to but different from the relational Bayesian networks of Jaeger (1997); see Section 8.2. Strictly speaking the function is defined over the ground predicates obtained after instantiation. However, for sim-plicity we have defined it over the first-order predicates.
ANGHAI , D
OMINGOS & W
ELD
The ordering ≺ between the ground predicates is now given by the following rules: • R ( x , . . . , x n ) ≺ R ′ ( x ′ , . . . , x ′ m ) if R ≺ R ′ . • R ( x , . . . , x n ) ≺ R ( x ′ , . . . , x ′ n ) if there exists an i such that x i ≺ x ′ i and x j = x ′ j for all j < i where x k and x ′ k are constants for all k .We now restrict the set of parents of a predicate in a relational Bayesian network as follows: • The parent set
P a ( R ) of a predicate R can contain a predicate R ′ only if either R ′ ≺ R or R ′ = R . • If P a ( R ) contains R then during the expansion R ( x , . . . , x n ) has a parent R ( x ′ , . . . , x ′ n ) only if R ( x ′ , . . . , x ′ n ) ≺ R ( x , . . . , x n ) .This ordering implies that in the expanded Bayesian network each ground predicate can onlyhave higher order ground predicates (w.r.t. ≺ ) as parents and hence there cannot be a cycle.The conditional model can be any first-order conditional model and can be chosen dependingon the domain, the model’s applicability and ease of use. In this paper, we will be using first-orderprobability trees (FOPTs) as our conditional model. They can be viewed as a combination of first-order trees (Blockeel & De Raedt, 1998) and probability estimation trees (Provost & Domingos,2003).Before defining FOPTs we need to define first-order formulas. Definition 3 (First-order Formula)Syntax: A first-order formula F is of one of the following forms: • an atomic formula R ( t , . . . , t n ) where R is a predicate of arity n and each t i is a term. • ¬ F ′ or ( F ′ ∧ F ′′ ) or ( F ′ ∨ F ′′ ) where F ′ and F ′′ are first-order formulas. • ∃ xF ′ or ∀ xF ′ where x is a variable and F is a first-order formula. • n ) xF ′ or < n ) xF ′ or > n ) xF ′ where x is a variable, F ′ is a first-order formulaand n is an integer.Semantics: The semantics of first-order formulas is the same as in standard first-order logic. Ad-ditionally, formulas of the form n ) xF ′ represent a generalized form of quantification. Forexample, ≥ n ) xF ′ is equivalent to the formula ∃ x · · · x n F ′ ( x ) ∧ F ′ ( x ) · · · ∧ F ′ ( x n ) ∧ x = x · · · 6 = x n and represents the count aggregation. Other aggregators such as max , min , etc.,can also be defined in a similar way. Definition 4 (First-order Probability Tree: FOPT)Syntax: Given a predicate R , and its parents R , · · · , R n , a first-order probability tree (FOPT) is atree where • Each interior node n contains a first-order formula F n on the parent predicates. • The child of a node n corresponds to either the true or false outcome of the first-order formula F n . ELATIONAL D YNAMIC B AYESIAN N ETWORKS c Color(x,c) ^ Color(y,c) ∃0.3 0.05Τ F Figure 1: A first-order probability tree for the
Bolted-To(x,y) predicate. • The leaves contain a function with range [0,1] and domain the cross product of all groundparent predicates of R .Semantics: An FOPT defines a conditional model for a ground predicate given its parents. Theprobability is obtained by starting at the root node, evaluating first-order expressions and followingthe relevant path in the tree to the leaf which encodes the probability in the form of a function. An FOPT can contain free variables and quantifiers/aggregators over them. Moreover, the quan-tification of a variable is preserved throughout the descendants, i.e., if a variable x is substituted bya constant c at a node n , then x takes c as its value over all the descendants of n . To avoid cycles inthe network, quantified variables in an FOPT range only over values that precede the child node’svalues in the ≺ ordering. The function at the leaf gives the probability of the ground predicate beingtrue.Just like a Bayesian network is completely specified by providing a CPT for each variable, anRBN can be completely specified by having an FOPT for each first-order predicate. Example of a first-order probability tree
Continuing the example of the RBN, the
Bolt predicate is dependent on the
Color predicate. Figure1 shows the FOPT for the
Bolt predicate. The root node checks the color of the two parts x and y .If they have the same color then the probability is 0.3. If they do not have the same color then theprobability is 0.05.We will now consider dynamic relational domains, where the state of the domain changes atevery time step. In a dynamic relational domain a ground predicate can be true or false dependingon the time step t . Therefore we add a time argument to each predicate: R ( x , . . . , x n , t ) where t isa non-negative integer variable and indicates the time step. Definition 5 (Dynamic Relational Domain)Syntax: A dynamic relational domain is a set of constants, variables, functions, terms, predicatesand atomic formulas R ( t , . . . , t n , t ) where each of the argument t i is a term and t is the time step.The set of all possible ground predicates at time t is obtained by replacing the variables in thearguments with constants and replacing the functions with their resulting constants.Semantics: The state of a dynamic relational domain at time t is the set of ground predicates thatare true at time t . As before, the dynamic relational domain can contain uncertainty, and specifying the depen-dencies using a dynamic Bayesian network on the ground predicates is infeasible. We specify the
ANGHAI , D
OMINGOS & W
ELD dependencies using a relational dynamic Bayesian network . As is the case with dynamic Bayesiannetworks, we make the assumption that the dependencies are first-order Markov i.e., the predicatesat time t can only depend on the predicates at time t or t − . We also need to add the fact that agrounding at time t − precedes a grounding at time t : • R ( x , . . . , x n , t ) ≺ R ′ ( x ′ , . . . , x ′ m , t ′ ) if t < t ′ .This takes precedence over the ordering between the predicates. Definition 6 (Two-time-slice relational dynamic Bayesian network: 2-TRDBN)A 2-TRDBN is a graph which given the state of the domain at time t gives a distribution on the stateof the domain at time t + 1 . A 2-TRDBN is defined as follows. For each predicate R at time t (i.e,predicate R restricted to groundings at time t ), we have: • A set of parents
P a ( R ) = { P a , . . . , P a l } , where each P a i is a predicate at time t − or t .If P a i is at time t then either P a i ≺ R or P a i = R . • A conditional probability model for P ( R | P a ( R )) , which is a FOPT on the parent predicates.If P a i = R , its groundings are restricted to those which precede the given grounding of R . Definition 7 (Relational Dynamic Bayesian Network: RDBN)Syntax: A relational dynamic Bayesian network is a pair of networks ( M , M → ) , where M is anRBN with all t = 0 and M → is a 2-TRDBN.Semantics: M represents the probability distribution over the state of the relational domain attime 0. M → represents the transition probability distribution, i.e., it gives the probability distribu-tion on the state of the domain at time t + 1 given the state of the domain at time t . An RDBN gives rise to a dynamic Bayesian network in the same way that a relational Bayesiannetwork gives a Bayesian network. At time t a node is created for every ground predicate and edgesadded between the predicate and its parents. (If t > then the parents are obtained from M → , oth-erwise from M ). The conditional model at each node is given by the conditional model restrictedto the particular grounding of the predicate. We now discuss an example of an RDBN using FOPTs. Example of an RDBN
Consider the factory assembly domain as before where plates, brackets, etc. are welded and boltedto form complex objects. The plates and the brackets have attributes such as shape, size and color.Additionally, a plate can be bolted to a bracket, which is indicated by the
Bolted-To relation betweenthem. The parts now are assembled over time by performing actions such as painting and boltingwhich change the attributes of the objects and the relationships between the objects. The actionscan be fault-prone which leads to uncertainty and probabilistic dependencies between different at-tributes. For example, the presence of a bolt between a plate and a bracket might depend on thesimilarity of their colors, shapes, and other attributes. This can happen because a bolt action canincorrectly bolt objects which are similar to the objects supposed to be bolted. Figure 2 shows theRDBN at time slices t − , t and t + 1 . The nodes in the graph represent the predicates and the edgesshow the dependencies between them. The predicate Bolted-To(x,y,t) represents a bolt between abracket x and a plate y at time t . The predicates Color ( y, c, t ) and Shape ( y, s, t ) represent the colorand shape of a bracket y with values c and s respectively. The predicate Bolt ( x, y, t ) represents the ELATIONAL D YNAMIC B AYESIAN N ETWORKS
Color(x, c, t) Color(x, c, t+1)Bolted−To(x, y, t−1) Bolted−To(x, y, t+1)Bolted−To(x, y, t)Bolt(x,y,t) Bolt(x,y,t+1)Color(y, c, t−1)Shape(y, s, t−1) Shape(y, s, t) Shape(y, s, t+1)
Figure 2: An RDBN representing the assembly domain.bolt action performed between the objects x and y at time t . Without loss of generality, we assumethat exactly one action is performed per time step. The graph shows that the Bolted-To predicateat time t depends on the action performed and the predicates Bolted-To , Color and
Shape at time t − . Figure 3 shows the FOPT for the Bolted-To attribute. The leaves in the FOPT (representedby the box) contain the probabilities for the predicate being true and the intermediate nodes containfirst-order expressions testing the various conditions. The left and the right branches correspondto the expressions being true and false respectively. The FOPT represents the fact that if the boltbetween objects x and y existed at time t − , then it exists at time t . Otherwise, if a bolt action wasperformed on objects x and y , then the probability of x and y getting bolted is 0.9. On the otherhand, if the bolt action was performed on objects x and z , then the probability of x and y gettingbolted depends upon the similarity between y and z . In this example, two objects are similar if theircolors are the same. If y and z are similar, then the probability that x and y get bolted is inverselyproportional to the number of similar objects to z . We model this by using the count aggregatorin the function at the leaf. The expression count ( w | Bracket ( w ) ∧ Color ( w, c, t − gives thenumber of brackets that have the color c (the same as that of z ) at time t − . In this example, weallow multiple objects to get bolted due to a single action. However, we might also wish to modelmutual exclusion between the ground predicates. This is achieved by creating a special predicate Mutex(t) which depends on the ground predicates that are involved in the action performed at theprevious time step. Figure 4 shows an FOPT for the
Mutex(t) predicate. Predicates
Weld(x,y,t) and
Welded-To(x,y,t) refer to the weld action and relation respectively. The FOPT represents that if abolt or weld action was performed, then
Mutex(t) is true only if at most one additional weld or boltground predicate is true at time t . During inference Mutex(t) is set to true with probability 1 whichforces the mutex relation between the ground predicates.Looking at Figure 3, one might conclude that FOPTs form a tedious representation of the as-sembly domain. However, this is due to the complex nature of the assembly process and a DPRMmodeling the assembly domain would also be quite complex.
ANGHAI , D
OMINGOS & W
ELD
Bolted−To(x,y,t−1) 1.0 T F Bolt(x,y,t)0.9 ∃ z Bolt(x,z,t) 0.0c Color(y,c,t−1) ^ Color(z,c,t−1) ∃ T FT FT F0.1 / (count(w |Bracket(w) ^ Color(w,c,t−1)) 0.0
Figure 3: A first-order probability tree for the
Bolted-To(x,y,t) predicate. u v Bolt(u, v, t) ∃ ∃ p q Weld(p, q, t) 1.0
T T F F f(count(x y Bolted−To(x, y, t) ^ Bolted−To(x, y, t−1))) f(count(x y Welded−To(r, s, t) ^ Welded−To(r, s, t−1)))
Figure 4: Representing mutual exclusion: an FOPT for
Mutex(t) predicate; f(a) is 0 if a >
1, else 1.
ELATIONAL D YNAMIC B AYESIAN N ETWORKS
4. Rao-Blackwellized Particle Filtering in RDBNs
This paper addresses the task of state monitoring of a complex relational stochastic process. In theprevious section we saw how an RDBN can be used for modeling purposes. In the next few sectionswe will see how to do efficient inference in RDBNs. As described before, expanding an RDBNgives rise to a DBN. In principle, we can perform inference on this DBN using particle filtering.However, the filter is likely to perform poorly, because for non-trivial RDBNs the state space of theexpanded DBN will be extremely large. The DBN will contain a variable for every ground predicateat each time slice and the number of ground predicates is on the order of the size of the domain,which can be in the tens of thousands or more, raised to the arity. We overcome this by adaptingRao-Blackwellisation to the relational setting. We will describe all of our algorithms for predicateswith two arguments (excluding the time argument). There generalization to predicates of higherarity is straightforward.
We classify the predicates into two categories, complex and simple , based on the number, size andtypes of the arguments. A predicate is termed complex if the domain size of the two argumentsis large. It is termed simple otherwise. Although we do not give a precise definition of large,the intuition becomes clear if we look at the predicates
Color ( x, c, t ) and Bolted-To(x, y, t) . Thedomain size of c in Color ( x, c, t ) is presumably small, whereas the domain sizes of x and y , whichrepresent plates and brackets, could be very large. Hence the particle filter will perform poorly oncomplex predicates. We now make the following assumptions about the predicates (in later sectionswe remove them, developing algorithms which can be applied in broader settings).A1: Uncertain complex predicates do not appear in the RDBN as parents of other predicates.A2: For any object o , there is at most one other object o ′ such that the ground predicate R ( o, o ′ , t ) is true. Similarly, there exists exactly one other object o ′ such that R ( o ′ , o, t ) is true.Assumption A1 will, for example, preclude a model where the color of a plate could depend onthe properties of brackets bolted to the plate, if the bolt relationship was uncertain. Assumption A2enforces that a plate can be bolted to at most one bracket (unless we have different predicates fordifferent bolt points). Although these assumptions are restrictive, they still allow us to model manycomplex domains, and they yield the following very desirable property. Proposition 1
Assumptions A1 and A2 together imply that, given the simple predicates and knowncomplex predicates at times t and t − , the joint distribution of the unobserved complex predicatesat time t is a product of multinomials, one for each predicate. Assumption 1 implies that all parents of an unobserved complex predicate are simple or known.Assumption 2 enforces mutual exclusion between the objects participating in a relation with a par-ticular object. Therefore, given a complex first-order predicate and an object, the probabilities of thecorresponding ground predicates being true can be seen as a multinomial, with a single trial, overthe objects participating in the relation (i.e., the first-order predicate) with the given object. Addi-tionally, the simple predicates are independent of the unknown complex predicates and the complexpredicates are independent of each other. Proposition 1 follows.
3. These are the predicates whose corresponding ground predicates have uncertain truth values.
ANGHAI , D
OMINGOS & W
ELD
Moreover, by assumption A1, unobserved simple predicates can be sampled without regard tounobserved complex ones. Thus, given these assumptions, Rao-Blackwellisation can be applied tospeed inference. Recall from Section 2.2 that a Rao-Blackwellised particle is composed of sampledvalues for all ground simple predicates, plus probability parameters for each complex predicate.The element R ( o i , o ′ j , t ) stores the probability that the relation holds between objects o i and o ′ j at time t conditioned on the values of the simple predicates in the particle. Rao-Blackwellisingthe complex predicates can vastly reduce the size of the state space which particle filtering needsto sample. For example, in the FOPT shown in Figure 3, the Bolted-To predicate is a complexpredicate. Since it only depends on the
Color and
Shape predicates and the action performed, itsatisfies Assumption 1. If, additionally, we do not allow one part to be bolted to more than one part,we can Rao-Blackwellize the
Bolted-To predicate (and sample the
Color and
Shape predicates),which can save much space and time.
Even with Rao-Blackwellization, if the relational domain contains a large number of objects and re-lations, storing and updating all the requisite probabilities can still be quite expensive. This can beameliorated if context-specific independencies exist, i.e., if a complex predicate is independent ofsome simple predicates given assignments of values to others (Boutilier, Friedman, Goldszmidt,& Koller, 1996). More precisely, we can group the pairs of objects ( o, o ′ ) (that can give riseto the ground predicate R ( o, o ′ , t ) ) into disjoint sets called abstractions : A R , A R , · · · , A R m such that two pairs of objects ( o i , o ′ j ) and ( o k , o ′ l ) belong to the same abstraction if P r ( o i , o ′ j , t ) = P r ( o k , o ′ l , t ) . If these relational abstractions can be efficiently specified by first-order logical formu-las φ over the simple predicates, then instead of maintaining probabilities for each pair of objects,we can keep probabilities for each abstraction . This can greatly reduce the space required. Simi-larly, the time required to update the abstractions and the probabilities will be reduced. In Section7, we run our inference algorithms on the assembly domain. In particular, we model situationswhere assumptions A1 and A2 are not violated and use the Rao-Blackwellized particle filter forstate monitoring (refer Section 7.2). We conclude that it greatly outperforms the standard particlefilter (whose number of particles is scaled so that the memory and time requirements are evenlymatched). We also observe that using abstractions reduce RBPF’s time and memory by a factor of30 to 70. In the remainder of this section, we focus on removing assumption A2, i.e., each object can haverelationships with multiple objects via the same relation. However, it is hard to relax this assump-tion in a way that supports efficient Rao-Blackwellization – if we allow predicates R ( o, o , t ) and R ( o, o , t ) to be simultaneously true we must maintain a joint distribution over these set of groundpredicates. As a step towards allowing an arbitrary number of ground relations, first suppose thatone is able to bound the number of ground relations per object. Thus, we modify assumption A2 asfollows :A2 ′ : For any object o , there are at most κ objects o , · · · , o κ such that all of R ( o, o , t ) , . . . , R ( o, o κ , t ) are true (and similarly for the second argument). ELATIONAL D YNAMIC B AYESIAN N ETWORKS
In this case, one can maintain a distribution over sets of pairs of objects. For example, if thesize of the domain of the second argument is n , then for every object o , the relation R ( o, y, t ) is truefor some i choices of the objects corresponding to the variable y , where i ≤ κ . One can maintaina distribution over the various ( ni ) possible combinations for each i ≤ κ . This approach is practicalfor small κ , but as κ increases the number of sets grows exponentially. In order to reduce the spacecomplexity we group sets having the same probability into an equivalence class where membershipis defined with a first-order formula. The formulas and the probabilities may change at time step inaccordance with the RDBN and the current state.Using this abstraction scheme, our experiments show that Rao-Blackwellization can be per-formed efficiently for κ < (see Section 7.2).
5. Smoothing in RDBNs
Relaxing Assumptions A1 and A2 is difficult because the domains of complex predicates are typi-cally very large, and when complex predicates are parents of other predicates one cannot efficientlycompute an analytical solution, rendering Rao-Blackwellization infeasible. If instead the complexpredicates are sampled, then an extremely large number of particles is required to maintain accuracy.Although there is no perfect solution to this problem, we can make use of the fact that similarobjects in a relational domain tend to behave similarly, and this similarity extends to the types ofrelationships in which they participate.
With simple smoothing, we perform standard particle filtering while state monitoring and we willsmooth the particles to answer the queries. In this paper, we only describe how to smooth theparticles to answer queries about complex predicates. For simple predicates we use the standardparticle filter (although our methods are easily extended to them). The simple smoothing approachcomputes a weighted sum of three components P s , P u and P m which we now describe.The first component P s ( R ( o, o ′ , t )) is the value estimated by a standard particle filter. Here,we sample all the ground predicates (both simple and complex). To estimate the probability that R ( o, o ′ , t ) is true at time t , one can count the number of particles in which this relationship holdsand divide by the number of particles.We have already seen that P s by itself will be inaccurate for any reasonable number of particlesdue to the curse of dimensionality. To reduce the number of particles needed, we smooth the proba-bility estimate toward two other distributions: P u , the uniform distribution, and P m , the distributionconditioned on the Markov blanket ( M B ).In the uniform distribution, to compute the probability that R ( o, o ′ , t ) is true, we ignore the dif-ferences between objects, and simply count the fraction of ground predicates for which the relationis true. Thus, the probability that R ( o, o ′ , t ) is true will be higher if many objects have the relation-ship between them. The probability is computed as: P u ( R ( o, o ′ , t )) = N P Ni =1 P x P y δ i ( R ( x,y,t ) , n x n y ,where N is the number of particles, δ i ( R ( x, y, t ) , is 1 if R ( x, y, t ) = true in the i th particle and
4. The term smoothing is more in the spirit of “shrinkage” and should not be confused with the smoothing task in aDBN.5. For simplicity our notation omits conditioning on observations.6. We use P ( R ( x, y, t )) to mean P ( R ( x, y, t ) = true ) . ANGHAI , D
OMINGOS & W
ELD
Figure 5: An example of simple smoothing.0 otherwise, and n x and n y represent the domain sizes of the first and the second argument of thepredicate.Finally, for each particle, the distribution conditioned on the Markov blanket is obtained bydirectly computing the probability of the ground predicate given the variable’s Markov blanket,which may contain attributes at time t or t − . We then average this estimate over all particles: P m ( R ( o, o ′ , t )) = N P Ni =1 P i ( R ( o, o ′ , t ) | MB ( R ( o, o ′ , t ))) . where P i represents the probabilitythat R ( o, o ′ , t ) is true given its Markov blanket M B according to the i th particle. We compute thefinal probability by smoothing among these three estimates: P ( R ( o, o ′ , t )) = α s P s ( R ( o, o ′ , t )) + α u P u ( R ( o, o ′ , t )) + α m P m ( R ( o, o ′ , t )) (3)where α s + α u + α m = 1.We term this approach simple smoothing . Figure 5 shows an example of simple smoothing.The weights used are α s = 0 . and α u = 0 . . For simplicity, we ignore the prediction made byconditioning on the Markov Blanket. There are four particles, each containing sampled values forthe Bolted-To predicate at some time t . The particle filter predicts that P(Bolted-To(P1,B1,t)) = 0,while simple smoothing predicts the probability to be . . This example highlights the problemwith standard particle filtering. There might be a non-zero probability of some ground predicatebeing true, but due to the large size of the domain particle filter may not have samples which reportthis. Our experiments show that simple smoothing performs considerably better than standardparticle filtering (see Section 7.3), but it overgeneralizes by smoothing over the entire set of objects.We now present a more refined approach based on smoothing over a lattice of abstractions.
7. In the experiments, α s = 0 . and α u = α m = 0 . . The weights can also be set by adapting the procedure describedin the next section. ELATIONAL D YNAMIC B AYESIAN N ETWORKS
As before, we are interested in computing the marginal probability of R ( o, o ′ , t ) . Instead of using auniform distribution to smooth the estimates, we can obtain better estimates by only considering therelationship between pairs of objects o i , o ′ j such that o i is similar to o and o ′ j is similar to o ′ . We calla grouping of a set of pairs of objects which are related in some way an abstraction . For example,pairs of large plates and brackets which are bolted together form an abstraction. An abstraction willbe more general if it allows many objects to be termed as similar. The probability estimates for moregeneral abstractions will be based on more instances, and thus have lower variance, but will ignoremore detail, and thus have higher bias. The trade-off is made by using a weighted combination ofestimates from a lattice of abstractions. In the next few sections we use the following represen-tation. Each complex predicate R is represented by a set X R of Boolean indicator variables X l,j where X l,j is 1 if R ( o l , o ′ j , t ) = 1 and 0 otherwise. An abstraction can be thought of as a set of pairsspecified by a subset of the indicator variables.5.2.1 L ATTICE OF R ELATIONAL A BSTRACTIONS
Given a set S , a lattice is a set of nodes where each node represents a subset of S . In the relationaldomain we will be building an abstraction lattice over each complex predicate. As described above,we define a relational abstraction of R to be a subset of the indicator variables, where the subsetcontains X l,j if o l and o ′ j satisfy some first-order formula. For example, if the formula is a simpleconjunctive expression we have the following. Consider the first-order formula φ = A ( x, u , t ) ∧· · · ∧ A m ( x, u m , t ) ∧ B ( y, v , t ) ∧ · · · ∧ B m ( y, v n , t ) where A i and B k are simple predicates and u i and v k are constants. Definition 8 (Relational Abstraction)The relational abstraction of R specified by φ is the set A R ⊆ X R defined as: A R = { X l,j ∈ X R | A ( o l , u , t ) ∧ · · · ∧ A m ( o l , u m , t ) ∧ B ( o ′ j , v , t ) ∧ · · · ∧ B n ( o j , ′ v n , t ) } As an example of an abstraction, consider the assembly domain and the predicate
Bolted-To . An ab-straction could be φ = Color ( x, red, t ) ∧ Size ( y, large, t ) which represents the Bolted-To relationbetween all plates which are red and all brackets which are large . These relational abstractionswill form a lattice and our goal is to use the relational abstraction lattice along with smoothing toimprove particle filtering.5.2.2 S
MOOTHING WITH AN A BSTRACTION L ATTICE
Given a relational attribute R , we have to estimate the probability of R ( o, o ′ , t ) being true, whichwe shall refer to as P ( X a,b = 1) , where o and o ′ have the indices a and b in X R . We smooththe particle filter estimates over the relevant abstractions of R . Given the RDBN and the groundpredicate R ( o, o ′ , t ) , we first consider the set of parents ( R , · · · , R n ) of the predicate. Given the i th particle, for each of the parents, we set a value which is either equal to the parent’s value inthe particle or ∗ (don’t care) . This defines a relevant abstraction. More formally, an abstraction isrelevant to R ( o, o ′ , t ) if it is a conjunctive expression involving a subset R ’s parents, and their valuesin the expression are the values in some particle i (or, more generally, if (o’, o, t) satisfies some first-order expression; conjunction is a special case). The intuition behind using such abstractions is thatif the set of parents of some variables are the same and the parents have exactly the same values, ANGHAI , D
OMINGOS & W
ELD then the probability of a variable being true can be obtained by looking at the distribution of thevariables in the particles.Thus for each subset of parent attributes and their corresponding values we have an abstraction.If we consider all possible subsets of the parents, we obtain a lattice of abstractions.For example, if the presence of a bolt between a plate and a bracket depends on the size of thebracket and the size and color of the plate but not on other attributes, then the abstraction latticeis defined over the size of the bracket and the size and color of the plate. If the color of o isred and the size of o ′ is large then the relational abstraction which will be used in smoothing tofind P ( R ( o, o ′ , t )) will have a subset of the color and size predicates and the values specified asabove. Figure 6 shows an example of an abstraction lattice. The first-order formula describing theabstractions can also contain complex predicates and quantifiers/aggregators over them.The number of abstractions is exponential in the number of parents. If the number becomes toolarge, we use an approach based on rule induction to select the most informative abstractions (seeAlgorithm 1). For each abstraction, we first define a score which is the K-L divergence (Cover &Thomas, 2001) between the distribution of R predicted by the abstraction and the empirical distri-bution. The empirical distribution is obtained by taking all the ground instances R ( o, o ′ , t ) acrossall the particles. For simplicity, we assume that instances within a particle are independent andidentically distributed (although this may not be the case). These instances can then be consideredas independent samples of the true distribution. The distribution ˆ p A R predicted by an abstraction A R is obtained by averaging across ground instances which belong to the abstraction, i.e., ˆ p A R ( x ) = P i P X l,j ∈ A R δ i ( x l,j , x ) N |A R | where x is either 0 or 1, N is the number of particles, |A R | is the size of the abstraction, and δ i ( x l,j , x ) = 1 if the value x l,j of the indicator variable X l,j is x in the i th particle and 0 otherwise.We approximate the K-L divergence between the empirical distribution p and ˆ p A R as describedin Section 7.1: score ( A R ) = ˆ D H ( p || ˆ p A R ) = − N |A R | X i X X l,j ∈ A R log ˆ p ( x l,j ) Algorithm 1 shows the procedure to select the most relevant abstractions. We start off withthe null abstraction (i.e., the most general abstraction) and greedily add an attribute-value pair toit which maximizes the score function. (The attributes correspond to the parent predicates.) Wekeep on adding attribute-value pairs until either the score cannot be improved or the number ofattribute-value pairs (also termed the length of the abstraction) exceeds the maxLen parameter(to prevent overfitting). We then add the new abstraction to the list of relevant abstractions. Toavoid redundancy among the abstractions, we remove ground instances that the new abstractioncovers. We then repeat the procedure with the updated set of ground instances. When the number ofabstractions exceeds the maximum number, we stop the search. Pruning can also be done by using aholdout set of particles and evaluating the abstractions’ scores on them, or any of the other methodsused in rule induction (see Clark & Niblett, 1989; Cohen, 1995, etc.).In our experiments, there were typically only a few parents, so we used all possible abstractions.
ELATIONAL D YNAMIC B AYESIAN N ETWORKS
Algorithm 1
Abstraction Lattice Smoothing.
P a ( R ) ← ( R , · · · , R n ) null A R ← {} RelevantAbs ← {} n ← while n < maxAbs do current A R ← null A R minKLD ← ∞ len ← while len ≤ maxlen dofor all R i ∈ P a ( R ) \ current A R dofor all V j ∈ Dom ( R i ) do temp A R ← current A R ∪ ( R i , V j ) if temp A R ∈ RelevantAbs then continue end if
KLD ← score ( temp A R ) if KLD < minKLD then new A R ← temp A R minKLD ← KLD end ifend forend forif new A R = current A R then exit else current A R ← new A R len ← len + 1 end ifend while n ← n + 1 Add current A R to RelevantAbs
Remove ground instances of R covered by current A R end while ANGHAI , D
OMINGOS & W
ELD
Color(p,red,t) ^ Size(b,large,t)Color(p,red,t) ^ Color(b,red,t) nullColor(p,red,t) Size(b,large,t)Bolted−To(p1,b2,t)Bolted−To(p2,b8,t) ......
Figure 6: An abstraction lattice over the
Bolted-To relation between plate (red) and bracket (large).The abstractions obtained are then assigned weights by adopting the heuristic length formulaproposed by Anderson et al (2002) (where it is called the rank heuristic). The weight ω A R of theabstraction A R is computed as: ω A R ∝ |A R | Length ( A R ) where |A R | is the size of the abstraction, i.e., the number of ground predicates that can belong to A R , and Length( A R ) is the length of the abstraction. The intuition behind this formula is to tradeoff bias and variance by giving more weight to abstractions with many samples, but less weightto abstractions that are overly general. We also tried using the EM algorithm to learn the weights,as described by McCallum et al (1998), by using the various particles as the data points. In ourexperiments, we found that both work well in practice, but the heuristic length formula is muchmore efficient.The probability estimate P ( X a,b = 1) is a weighted average of the probability estimates givenby the various abstractions: P i ( R ( o, o ′ , t )) = P i ( X a,b = 1) = 1 c X A R ∋ X a,b ω A R n i A R |A R | (4)where P i is the probability as represented by the i th particle, c = P A R ω A R is a normalizingconstant, A R is a relational abstraction that has X a,b as one of its elements, n i A R is the number ofindicator variables belonging to A R which have the value 1 in the i th particle, |A R | is the size of theabstraction (i.e., the number of indicator variables in A R ), and ω A R is the weight of the abstraction A R . P i is computed for each particle and the final probability is given by the average of P i over allparticles: P ( X a,b = 1) = P Ni =1 P i ( X a,b = 1) N (5)where N is the number of particles. ELATIONAL D YNAMIC B AYESIAN N ETWORKS
Thus, particle filtering proceeds as usual, except that during inference, at each time step, theprobabilities are computed using the above formula. The experiments (Section 7) show that thismethod greatly outperforms standard particle filtering and simple smoothing.
6. Relational Kernel Density Estimation
One way to approximate the joint distribution of the relational variables is to assume independencebetween all the indicator variables corresponding to each ground predicate. The marginal probabil-ities computed above can then be used to compute the joint distribution as: P ( X = x ) = Y R ∈ X Y X l,j ∈ X R P ( X l,j = x l,j ) (6)where X represents the joint state variable, x is a particular state, R is a predicate and X R is thecorresponding set of indicator variables x l,j and P ( X l,j = x l,j ) is given by Equation 5. Thisformula only describes the joint distribution of the complex predicates, but can be easily extendedto compute the joint distribution of all the predicates. However, this approach can lead to inaccurateresults when the independence assumption does not hold. Moreover, the marginal probability hasto be calculated for every indicator variable (i.e., ground predicate) irrespective of whether its valuein the state is true or not, and working in such a high-dimensional space makes this inefficient.In this section we propose a form of kernel density estimation (Duda, Hart, & Stork, 2000)to directly compute the joint probability distribution of the variables efficiently and accurately. Akernel density estimator for a variable X takes n samples (i.e., particles x i ) and estimates X ’sprobability distribution as: P ( X = x ) = 1 n X i K ( x, x i ) where K is a non-negative kernel function that satisfies P x K ( x, x i ) = 1, for all i . The kernelfunction K represents a distribution over X based on the sample x i and is typically a function ofthe distance between x i and x . For example, if x and x i are Boolean vectors, then the distance d ( x, x i ) can be the Hamming distance between the vectors, and K ( x, x i ) = d ( x,x i ) . However, inour case this is unlikely to give good results because kernel density estimation usually does not workwell in high dimensions. To overcome this problem we first break our kernel function into a productof kernel functions, one for each complex predicate. Thus we have: K ( x, x i ) = Y R K R ( x, x i ) (7)However, each of the complex predicates, when viewed as a Boolean vector, can itself be very high-dimensional and sparse, leading to d ( x, x i ) being the same for most ( x, x i ) pairs, and producingpoor results. Fortunately, the sparsity can itself be used to reduce the effective dimension of thekernel function for a relation. Let n X l,j =1 ( X R ) represent the cardinality of the subset of indicatorvariables that have the value 1. We divide the kernel function into two factors. The first factor of K R gives the probability distribution on the number of indicator variables whose value is 1 giventhe number of indicator variables whose value is 1 in particle x i , i.e., P ( n x l,j =1 ( x R ) | n x il,j =1 ( x iR )) .The particles are erroneous in reproducing the exact relationships in the domain, but they can ap-proximately capture the number of relationships . Thus, we model this number using a binomial ANGHAI , D
OMINGOS & W
ELD distribution where the number of trials is n x il,j =1 ( x iR ) , each with a success probability of p s . Theparameter p s is computed from the model. For example, in the assembly domain p s will depend onthe fault probability of the actions. The binomial model is used because of mutual exclusion in theassembly domain, which causes the number of true ground predicates to be approximately equal tothe number of actions performed. In other domains, one may not require this factor.The second factor of K R is the average probability of the indicator variables that are 1 in thestate, given the indicator variables that are 1 in the particle. To estimate this, we can once again usethe abstractions, in particular Equation 4.However, the average of these probabilities will generally not sum to over the various states.Hence, to make K R a kernel function we must normalize it over all the possible substates X suchthat n X l,j =1 ( X R ) = n x l,j =1 ( x R ) . In conclusion our kernel function is K R ( x, x i ) = B ( n x l,j =1 ( x R ) , n x il,j =1 ( x iR ) , p s ) P x a,b ∈ S P i ( x a,b = 1) d n x l,j =1 ( x R ) (8)where B ( k, n, p ) represents the binomial distribution (probability of k successes in n trials withsuccess probability p ), d is the normalization factor, S is the subset of indicator variables with value1, n x l,j =1 ( x R ) is the cardinality of S , and P i () is given by Equation 4.Computing the normalization factor by summing over the various states as described above canbe exponential in the number of ground predicates present in the state and thus infeasible. How-ever, in our case the normalization factor can be computed analytically and is given by Proposition 2. Proposition 2
The normalization factor d = (cid:18) | X R |− n xl,j =1 ( x R ) − (cid:19) P A R ω A R n i A R n xl,j =1 ( x R ) .Proof : For convenience we use n ( x R ) instead of n x l,j =1 ( x R ) . d = X y R ∈ X R : n ( y R )= n ( x R ) X y l,j ∈ y R : y l,j =1 P A R : y l,j ∈ A R w A R n i A R | A R| n ( x R )= X A R X y l,j ∈ A R X y R : y l,j =1 ,n ( y R )= n ( x R ) w A R n i A R |A R | n ( x R )= X A R X y l,j ∈ A R (cid:16) | X R |− n ( x R ) − (cid:17) w A R n i A R |A R | n ( x R )= X A R |A R | (cid:16) | X R |− n ( x R ) − (cid:17) w A R n i A R |A R | n ( x R )= (cid:16) | X R |− n ( x R ) − (cid:17) X A R w A R n i A R n ( x R ) ✷ Figure 7 shows a hypothetical example consisting of three particles which contain the sampledvalues for the
Bolted-To ground predicates. The example also describes the abstraction lattice which
ELATIONAL D YNAMIC B AYESIAN N ETWORKS
00 0 0 1 0 0 000 0 0 1 0 0 000 0 0 1 0 0 0Particles Bolted−To(P1, B1) (P1, B2) (P1, B3) (P1, B4) (P2, B1) (P2, B2) (P2, B3) (P2, B4) 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.2 0.2 0.0Abstractionhierarchy Particle filtering: P(0,0,0,0,0,0,0,1) = 0.0Abstraction smoothing with independence: P(0,0,0,0,0,0,0,1) = 1*1*1*1*0.7*0.7*0.7*0.05RKDE: P(0,0,0,0,0,0,0,1) = 0.05*B
Figure 7: An example of computing joint distributions of complex predicates by particle filtering,abstraction smoothing and RKDE.in this case is a tree. The probability of the state (0,0,0,0,0,0,0,1) as predicted by a standard particlefilter is 0.0 because none of the particles contain this state. The joint probability computed usingindependence assumptions requires calculating the probability for each ground predicate separatelyand multiplying these probabilities. For example, the abstraction smoothing method gives a prob-ability of
Bolted-To(P2, B1) being false of . ∗ / . ∗ / = . . The overall probabilityis given in the figure. This probability can be highly inaccurate as well as expensive to compute.Finally, relational kernel density estimation computes the probability only for the ground predicatesthat are true (i.e., 1) and averages them out. Using the abstraction smoothing we can see that thisprobability comes out to be . ∗ . ∗ / = . . As described above the kernel is alsomultiplied by the binomial distribution B which is the probability of the number of true groundpredicates in the state given the number of true ground predicates in the particle (and the RDBNmodel). In this case, if we assume that the fault probability is low, then p s will be close to 1 andsince all the particles have exactly one ground predicate which is true, B will be close to 1. Hence,the kernel method will predict that the state has a probability of . .
7. Experiments
In this section we study the application of RDBNs to fault detection in complex assembly plans. Wefirst describe the domain and the experimental procedure we used to study the performance of thevarious algorithms.
ANGHAI , D
OMINGOS & W
ELD
We use a modified version of the
Schedule World domain from the AIPS-2000 Planning Competition(Bacchus, 2001). The problem consists of generating a plan for assembly of objects with operationssuch as painting, polishing, etc. (see Appendix B for the exact details). Each object has attributessuch as surface type, color, hole size, etc. We add two relational operations to the domain: boltingand welding. We assume that actions may be faulty, with fault model described below. In ourexperiments, we first generate a plan using the FF planner (Hoffmann & Nebel, 2001) assuming thatthe actions are deterministic (i.e., have no faults). We then monitor the plan’s execution explicitlyconsidering possible faults using particle filtering (PF), Rao-Blackwellised particle filtering (RBPF),particle filtering with simple smoothing (SPF), particle filtering with smoothing using an abstractionlattice (ASPF), and particle filtering using relational kernel density estimation (RKDE).We consider three types of objects:
Plate , Bracket and
Bolt . Plate and
Bracket have attributessuch as weight, shape, color, surface type, hole size and hole type, while
Bolt has attributes suchas size, type and weight. Plates and brackets can be welded to each other or bolted to bolts. Theconstants in the domain represent objects (e.g., plate ) and values of attributes (e.g., red ). At-tributes and relationships represent binary predicates. Actions such as painting, drilling and polish-ing change the values of the attributes of an object. The action Bolt creates a bolt relation between a
Plate or Bracket object to a
Bolt object. The
Weld action welds a
Plate or Bracket object to another
Plate or Bracket object. The actions are fault-prone; for example, with a small probability a
Weld action may have no effect or may weld two incorrect objects based on their similarity to the originalobjects. This gives rise to uncertainty in the domain and the corresponding dependence model forthe various attributes. The fault model has a global parameter, the fault probability p f . With prob-ability − p f , an action produces the intended effect. With probability p f , one of several possiblefaults occurs. Faults include a painting operation not being completed, the wrong color being used,the polish of an object being ruined, etc. The probability of these faults depends on the properties ofthe object being acted on. In addition there are faults such as bolting the wrong objects and weldingthe wrong objects. The probability of choosing a particular wrong object depends on its similarityto the intended object. Similarity depends on the propositional attributes of the objects involved.Thus the probability of a particular wrong object being chosen is uniform across all objects with thesame relevant attribute values.We allow each object to be “attached” to several other objects. The Bolt and
Weld actions canattach two objects depending on previously attached objects and/or their properties. For example,a plate A may be welded to plate B if both of them are already welded to a common plate. Thesetogether violate assumptions A1 and A2 and serve as a good testbed for the various inference algo-rithms.The relational process also includes a noisy observation model. When an action is performed onone or more objects, all the ground predicates involving these objects are observed, and no others.With probability − p o the true value of the attribute is observed, and with probability p o an incorrectvalue is observed. In our experiments, we set p o = p f .A natural measure of the accuracy of an approximate inference procedure is the K-L divergencebetween the distribution it predicts and the actual one (Cover & Thomas, 2001). However, com-puting the K-L divergence requires performing exact inference, which for non-trivial RDBNs isinfeasible. Thus we estimate the K-L divergence by sampling, as follows. Let D ( p || ˆ p ) be the K-L ELATIONAL D YNAMIC B AYESIAN N ETWORKS divergence between the true distribution p and its approximation ˆ p , and let X be the domain overwhich the distribution is defined. Then D ( p || ˆ p ) def = X x ∈X p ( x ) log p ( x )ˆ p ( x ) = X x ∈X p ( x ) log p ( x ) − X x ∈X p ( x ) log ˆ p ( x ) The first term is simply the entropy of X , H ( X ) , and is a constant independent of the approxi-mation method. Since we are mainly interested in measuring differences in performance betweenapproximation methods, this term can be neglected. The K-L divergence can now be approximatedby taking S samples from the true distribution: ˆ D H ( p || ˆ p ) = − S S X i =1 log ˆ p ( x i ) where ˆ p ( x i ) is the probability of the i th sample according to the approximation procedure, and the H subscript indicates that the estimate of D ( p || ˆ p ) is offset by H ( X ) . We thus evaluate the accuracyof particle filtering (PF) and other algorithms on an RDBN by generating S = 10 , sequences ofstates and observations from the RDBN, passing the observations to the particle filter, inferring themarginal probability of the sampled value of each state variable at each step, plugging these valuesinto the above formula, and averaging over all variables. Notice that ˆ D H ( p || ˆ p ) = ∞ whenever asampled value is not represented in any particle. The empirical estimates of the K-L divergence weobtain will be optimistic in the sense that the true K-L divergence may be infinity, but the estimatedone will still be finite unless one of the values with zero predicted probability is sampled. This doesnot preclude a meaningful comparison between approximation methods, however, since on averagethe worse method should produce ˆ D H ( p || ˆ p ) = ∞ earlier in the time sequence. We thus report boththe average K-L divergence before it becomes infinity and the time step at which it becomes infinity,if any. First, we compare the Rao-Blackwellized particle filter (RBPF) with the standard filter (PF) in thecase where assumptions A1 and A2 hold. Figure 8 shows the comparison for 1000 objects andvarying fault probabilities. The graph shows the K-L divergence at every 100th step. The errorbars are the standard deviations. Graphs are interrupted at the first point where the K-L divergencebecame infinite in any of the runs (once infinite, the K-L divergence never went back to beingfinite in any of the runs), and that point is labeled with the average time step at which the blow-upoccurred. We allocated PF far more particles (200,000) than RBPF (5,000) so that the memory andtime requirements are approximately the same for both techniques. As can be seen, for all faultprobabilities, PF tends to diverge rapidly, while the K-L divergence of RBPF increases only veryslowly. We have also run experiments where the number of objects is varied from 500 to 1500. Ascan be seen from Figure 9, RBPF outperforms PF in this case as well.Next we compare RBPF with PF when the number of objects that can be attached to a particularobject is greater than one. The maximum number of relationships per object that we consider is10. From Figure 10, we can conclude that RBPF performs equally well in this case. AlthoughRBPF gives quite accurate predictions, its speed decreases as the number of relationships increases.Figure 11 confirms this and shows the need for faster algorithms. In all the above experiments weuse object abstractions (Section 4) which reduce RBPF’s time and memory by a factor of 30 to 70,
ANGHAI , D
OMINGOS & W
ELD
0 2000 4000 6000 8000 10000 K - L D i v e r g e n ce Time Step ff ffff
RBPF (p =0.1%)PF (p =0.1%)RBPF (p =1%)PF (p =1%)RBPF (p =10%)PF (p =10%)
Figure 8: RBPF (with 5,000 particles) has much less error than standard PF (with 200,000 particles)in domains where assumptions A1 and A2 are not violated. This experiment was donewith 1000 objects and varying fault probabilities.
0 2000 4000 6000 8000 10000 K - L D i v e r g e n ce Time Step
Figure 9: RBPF outperforms PF for varying numbers of objects. The fault probability in this ex-periment was kept constant at p f = 1%. ELATIONAL D YNAMIC B AYESIAN N ETWORKS K - L D i v e r g e n ce Time Step3200 RBPFPF
Figure 10: The graph shows the performance of RBPF when the maximum number of relationshipsper object is increased from 1 to 10. RBPF outperforms the scaled PF for 1000 objectsand p f = 1%. T i m e ( m s ) Number of relationships per objectRBPF(10000 particles)
Figure 11: The time taken by RBPF (plotted in log-scale) increases exponentially with the numberof relationships per object.
ANGHAI , D
OMINGOS & W
ELD K - L D i v e r g e n ce Time Step25402280 PF-1000SPF-1000ASPF-1000PF-1500SPF-1500ASPF-1500
Figure 12: ASPF (20,000 particles) greatly outperforms standard PF (100,000 particles) and SPF(50,000 particles) in predicting the marginal distributions for 1000 and 1500 objects and p f = 1%.and take on average six times longer and 11 times the memory of PF, per particle. However, notethat we run PF with 40 times more particles than RBPF. Thus, RBPF is using less time and memorythan PF, while predicting behavior far more accurately. RBPF works only when Assumption A1 holds (i.e., no predicate depends on uncertain complexpredicates) and it becomes slower when Assumption A2 is removed (i.e., relationships are no longerone-to-one). We now study the performance of PF, PF with simple smoothing (SPF) and PF withabstraction-based smoothing (ASPF) which can be used to obtain the marginal distribution whenAssumptions A1 and A2 are removed. Figure 12 shows the K-L divergence of the algorithms atevery 100th step on an experiment performed with 1000 and 1500 objects and p f = 1% . Our algo-rithms use the same amount of memory as PF, but require additional time (on average by a factorof 2 and 5 respectively) to do smoothing. Thus, in our experiments the number of particles usedby standard PF, SPF and ASPF are 100,000, 50,000 and 20,000 respectively. One can see that PFtends to diverge very quickly (even with many more particles), while ASPF performs best and itsapproximation to the marginal distribution is close to the actual distribution. Although the abstrac-tion smoothing algorithm has low error, we observe in the graph that the error increases with time.We attribute this growth to the fact that the effective dimension of the assembly domain increasesover time as new (possibly faulty) relations are created, making it increasingly difficult to approx-imate the distribution with a fixed number of particles. Figure 13 shows the results of experimentsfor varying fault probabilities. From the two figures we can conclude that the performance of the ELATIONAL D YNAMIC B AYESIAN N ETWORKS K - L D i v e r g e n ce Time Step25402060 PF-1%SPF-1%ASPF-1%PF-10%SPF-10%ASPF-10%
Figure 13: ASPF predicts the marginal distributions most accurately for varying fault probabilities(1%, 10%) and 1000 objects.standard PF degrades with increasing fault probability and with the number of objects, while ASPFremains almost unaffected.Next, we report experiments when Assumption A1 holds and compare ASPS with Rao-Blackwe-llized particle filtering. The experiment was performed on 1000 objects with fault probability p f = 1 %. Figure 14 shows the mean K-L divergence between the approximate marginal distri-butions and the true ones. We can see that the difference between the K-L divergence of ASPS andthe K-L divergence of RBPF is very small and this difference remains almost constant over time.We conclude that the approximations underlying abstraction smoothing are quite good. Figure 14shows that the K-L divergence for ASPF is greater than RBPF by at most 0.01, and is on averagearound 0.005, indicating that our approximations are quite good. We also compare RBPF and ASPFwhen the number of relationships per object is increased. Figure 15 plots the K-L divergence at thelast step when the two algorithms are run on 1000 objects and 1% p f and varying number of objectsper slot. One can see that the difference is always less than .03 and the curve is quite stable. Figures 16 and 17 show the K-L divergence of the full joint distribution of the state (as opposedto just the marginals) for PF, PF with abstraction smoothing (using independence assumptions) andPF using relational kernel density estimation (RKDE) on experiments done with varying numberof objects and fault probabilities respectively. One can see that the estimates for the joint proba-bility have greater K-L divergence (which is expected) and RKDE gives the best results. From theexperiments we conclude that RKDE can estimate the joint distribution quite accurately.
ANGHAI , D
OMINGOS & W
ELD K - L D i v e r g e n ce Time Step RBPFASPF
Figure 14: Although ASPF (20,000 particles) does not require A1 or A2, it is nearly as accurate asRBPF (20,000 particles) when A1 and A2 hold and hence RBPF is applicable. K - L D i v e r g e n ce Number of relationships per objectRBPFASPF
Figure 15: The K-L divergence difference between ASPF and RBPF is small irrespective of thenumber of relations per object.
ELATIONAL D YNAMIC B AYESIAN N ETWORKS K - L D i v e r g e n ce Time Step25402280 PF-1000PF(abstraction smoothing)-1000PF(RKDE)-1000PF-1500PF(abstraction smoothing)-1500PF(RKDE)-1500
Figure 16: PF with relational kernel density estimation outperforms standard PF and PF with ab-straction smoothing using independence assumptions. The experiment was run with1000 and 1500 objects and p f = 1%, with our algorithms using 20,000 particles andstandard PF using 100,000 particles. K - L D i v e r g e n ce Time Step25402060 PF-1%PF(abstraction smoothing)-1%PF(RKDE)-1%PF-10%PF(abstraction smoothing)-10%PF(RKDE)-10%
Figure 17: PF with relational kernel density estimation outperforms the other algorithms when pre-dicting the joint distribution. The experiment was run with 1000 objects and fault prob-abilities of 1% and 10%.
ANGHAI , D
OMINGOS & W
ELD
The following are the conclusions that we can draw from the experiments: • All of our algorithms (RBPF, PF with smoothing and RKDE) are much more accurate thanstandard PF for inference in RDBNs, using similar computational resources. • RBPF is the best method when Assumption A1 holds, and scales up to a small maximumnumber of relations per object per predicate. • PF with abstraction smoothing, unlike RBPF, is applicable in all scenarios to estimate themarginal distributions and is quite accurate. • For estimating joint distributions, relational kernel density estimation outperforms PF withabstraction smoothing.
8. Related Work
In recent years, much research has focused on extending Bayesian networks to domains with rela-tional structure. Approaches include stochastic logic programs (Muggleton, 1996; Cussens, 1999),probabilistic relational models (Friedman et al., 1999; Getoor, Friedman, Koller, & Taskar, 2001),Bayesian logic programs (Kersting & De Raedt, 2000) and Markov logic networks (Richardson& Domingos, 2004), among others. The relational Bayesian networks as we have defined in thispaper are most closely related to the recursive relational Bayesian networks of Jaeger (1997). Themain difference is that we specify the probabilistic dependencies using FOPTs whereas Jaeger usesthe notion of combination functions (such as noisy-or) and equality constraints to define probabilityformulae over multisets.However, there has been very limited work on extending these to temporal domains. Dynamicobject-oriented Bayesian networks (DOOBNs) (Friedman, Koller, & Pfeffer, 1998) combine DBNswith OOBNs, a predecessor of PRMs. Unfortunately, no efficient inference methods were proposedfor DOOBNs, and they have not been evaluated experimentally. Glesner and Koller (1995) proposedthe idea of adding the power of first-order logic to DBNs. However, they only give procedures forconstructing flexible DBNs out of first-order knowledge bases, and do not consider inference orlearning procedures. Like DOOBNs, these were also not evaluated experimentally. RelationalMarkov models (RMMs) (Anderson et al., 2002) and logical hidden Markov models (LOHMMs)(Kersting & Raiko, 2005) are an extension of HMMs to first-order domains.In our previous work, we introduced dynamic probabilistic relational domains (DPRMs) (Sang-hai et al., 2003) which are an extension of PRMs. DPRMs can be viewed as a combination ofPRMs and dynamic Bayesian networks. DPRMs are based on frame-based systems, which modelthe world in terms of classes, objects and their attributes. Objects are instances of classes, and eachclass has a set of propositional attributes and relational attributes (reference slots). The propositionalattributes represent the properties of an object and the relational attributes model relationships be-tween two objects. A DPRM specifies a probability distribution for each attribute of each class asa conditional probability table. The parents of an attribute are other attributes of the same class orattributes of related classes reached via some slot chain. A slot in a frame-based system performsthe same function as a foreign key in a relational database. A slot chain can be viewed as a sequence
ELATIONAL D YNAMIC B AYESIAN N ETWORKS of foreign keys enabling one to move from one table to another. The parents can be attributes fromthe current time slice or previous time slices.DPRMs are related to RDBNs in the same way that frame-based systems are related to first-orderlogic. In Appendix A we prove that RDBNs subsume DPRMs, i.e., for every DPRM representinga probability distribution over a dynamic relational domain, there exists an RDBN which gives thesame distribution. The proof is straightforward and it involves replacing attributes by predicates andconditional probability tables (CPTs) by first order probability trees (FOPTs). One of the importantpoints to note is that all the inference algorithms described here are also applicable to DPRMs.There are also several advantages of using RDBNs instead of DPRMs: • RDBNs generalize DPRMS by providing a more powerful language (first-order logic insteadof frame-based systems). • In DPRMs, the parents of an attribute (i.e., predicate) are obtained by traversing chains ofreference slots, which correspond to conjunctive expressions, while in RDBNs parents can beobtained via any first-order logic constraints. However, because of the restrictions placed onDPRMs, learning them is potentially easier. • Modeling n-ary relationships using DPRMs requires breaking them up into binary relation-ships (slots), which makes the task cumbersome. In general, the language of DPRMs is muchharder to understand than that of RDBNs. • In DPRMs the set of parents and the conditional model for each attribute are specified usingone big table. In RDBNs, the parents and the conditional model are specified using FOPTswhich can take advantage of context-specific independence to reduce space requirements andpossibly speed up inference. • In DPRMs, mutual exclusion between ground predicates is not modeled. For example, whenmodeling multi-valued slots, i.e., cases in which each object can be related to many otherobjects via the same slot, independence is assumed between the participating target objects. • RDBNs have more scope for learning using ILP techniques such as first-order decision treeinduction (Blockeel & De Raedt, 1998).Particle filtering is currently a very active area of research (Doucet et al., 2001). The FastSLAMalgorithm uses a tree structure to speed up RBPF with Gaussian variables (Montemerlo, Thrun,Koller, & Wegbreit, 2002). Koller and Fratkina (1998) used the particles at each step to induce adistribution over the DBN’s states, and generated the next step’s particles from this distribution. Wetried this approach, but it led to poor results compared to using the distribution only to estimate thecurrent state. Koller and Fratkina found that density trees outperformed Bayesian networks as therepresentation for the distribution. Consistent with these results, we tried Bayesian networks andfound they were less accurate than abstraction trees (and also much slower).An alternate method for efficient inference in DBNs that may also be useful in RDBNs wasproposed by Boyen and Koller (1998) and combined with particle filtering by Ng et al (2002).These methods take advantage of the structure in the network, decomposing it into several nearlyindependent parts and performing inference separately on each of them. Efficient MCMC inferencein relational probabilistic models has been studied by Pasula and Russell (2001). Their method uses
ANGHAI , D
OMINGOS & W
ELD a Metropolis-Hasting’s step instead of the standard Gibbs step to sample relational variables, and isapplicable only to a restricted form of distribution.The use of abstractions has been studied extensively in AI (e.g., Koenig & Holte, 2002). Fried-man et al (2000) used value abstractions to compute the likelihood. They defined safe and cautious abstractions with respect to a variable and such a concept can also be used in RDBNs to speed upcomputation. However, they did not consider using a hierarchy of abstractions and did not smoothover the abstractions. Verma et al (2003) used abstractions for particle filtering in small numericDBNs, with a bias-variance criterion for choosing abstractions; this technique may be generalizableto RDBNs.Downstream, RDBNs should be relevant to research on relational Markov decision processes(e.g., Boutilier, Reiter, & Price, 2001).
9. Conclusions and Future Work
This paper introduces relational dynamic Bayesian networks (RDBNs), a representation that handlestime-changing phenomena, relational structure and uncertainty in a principled manner. We developthree approximate algorithms for doing efficient inference in RDBNs: • Rao-Blackwellized particle filtering extends standard particle filtering by analytically com-puting the joint distribution of the complex predicates given sampled instances of the simplepredicates samples. The method only applies when the complex predicates do not appear asparents of other predicates in the RDBN. When the number of relations per object is boundedby a small constant (e.g. 10), Rao-Blackwellization can be done efficiently and it greatlyoutperforms standard particle filtering. • Particle filtering with abstraction-based smoothing uses abstraction lattices defined over thecomplex predicates to smooth the particle filter estimates. When computing marginal distribu-tions, particle filtering with abstraction-based smoothing requires substantially fewer particlesthan standard particle filter and gives very accurate results. • Relational kernel density estimation is an extension of particle filtering used to compute jointdistributions by defining a kernel function over the ground state of the complex predicates.The relational kernel density estimation outperforms both standard particle filtering and par-ticle filtering with abstraction-based smoothing when predicting joint distributions of the un-known predicates.The above algorithms can be used in any relational stochastic process. They can also be applied tostatic relational domains and propositional domains. In the future we wish to combine smoothingand kernel density estimation with sampling algorithms like MCMC. Other directions for futurework include handling continuous variables, learning RDBNs, using them as a basis for relationalMDPs, and applying them to increasingly-complex real-world problems.
Acknowledgements
This work was partly funded by NSF grant IIS-0307906, ONR grants N00014-02-1-0408 andN00014-02-1-0932, DARPA project CALO through SRI grant number 03-000225, a Sloan Fel-lowship to the second author, and an NSF CAREER Award to the second author.
ELATIONAL D YNAMIC B AYESIAN N ETWORKS
Appendix A: Assembly Domain
This section contains a list of the objects, properties(simple predicates), relations (complex predi-cates) and actions used in the experiments. • Objects: – Plates (Shape, Surface, Temperature, Color, Size) – Brackets (Shape, Surface, Weight, Color) – Bolts (Hole Type, Size, Weight) • Relations: – Welded-To: (Plate, Plate), (Plate, Bracket) – Bolted-To: (Plate, Bolt), (Bracket, Bolt) • Propositional Actions – Lathe (Object, Shape/Size, Value, t) – Paint (Object, Color, Value, t) – Polish (Object, Surface, Value, t) – Heat (Object, Temperature, Value, t) – Punch (Object, Hole Type, Value, t) • Relational Actions – Weld (Plate, Plate/Bracket, t) – Bolt (Bolt, Plate/Bracket, t) • Fault modelsWe now describe the fault model for the propositional and relational actions. These faultmodels in our system are defined using FOPTs. However, they turn out to be complex, par-ticularly for relational actions. Hence, we describe them here using pseudo-code, where f p isthe fault probability. ANGHAI , D
OMINGOS & W
ELD
Algorithm 2
Fault model for propositional actions
PropAction (Object, Attribute, Value, t)
Do one of the three actions below:1.
Attribute ( Object, V alue, t ) ← True
Attribute ( Object, V alue ′ , t ) ← False for all
V alue ′ = V alue with probability p = − f p .2. Leave the state unchangedwith probability p = f p /23. Choose V alue ′ with uniform distribution Attribute ( Object, V alue ′ , t ) ← True
Attribute ( Object, V alue ′′ , t ) ← False for all
V alue ′′ = V alue ′ with probability p = f p /2 ELATIONAL D YNAMIC B AYESIAN N ETWORKS
Algorithm 3
Fault model for relational actions
Weld (O1, O2, t)
Do one of the five actions below:1.
Welded-To (O1, O2, t) ← Truewith probability p = − f p
2. Leave state unchangedwith probability p = . ∗ f p
3. Choose a wrong object O ′ with probability p = 0.45 * f p as follows • With p ∝ / choose O ′ with uniform distribution from the set of objects whose color and shape are the same as that of O • With p ∝ / choose O ′ with uniform distribution from the set of objects whose color is the same as that of O • With p ∝ / choose O ′ with uniform distribution from the set of objects whose shape is the same as that of O • With p ∝ / choose the object O ′ uniformly from the set of all objects Welded-To(O’, O2, t) ← True.4. Choose a wrong object O ′ to replace O with probability p = 0.45 * f p , using the aboveprocedure Welded-To(O1, O’, t) ← True.5. Choose wrong objects O ′ and O ′ with probability p = 0 . ∗ f p using the above proce-dure Welded-To(O1’, O2’, t) ← True.
Bolt (O1, O2, t)
Fault model is precisely the same as above except that the properties on which the choice ofwrong object depends are
Size for bolts and
Color and
Surface for plates and brackets.
ANGHAI , D
OMINGOS & W
ELD
Appendix B: RDBNs as a Generalization of DPRMs
As discussed before, dynamic probabilistic relational models (DPRMs) can also be used to modeluncertainty in a dynamic relational domain. However, they are an extension of PRMs, which arebased on frame-based systems, and inherit the limitations of the latter. Here we show that RDBNssubsume DPRMs. In Section 8 we saw that the converse is not true.
Proposition 3
For each DPRM representing a probability distribution over dynamic relational do-mains there is an RDBN representing the same distribution.Proof : We will first convert the attribute and the reference slots in a DPRM to predicates in RDBNs,then add the corresponding edges to indicate the parents of a node, and finally prove that this doesnot lead to a cycle and the CPT can be converted to a FOPT.Each slot
C.A in a DPRM corresponds to a predicate A ( x, v, t ) where A is the predicate name, x represents the object, v is the value of the slot and t indicates the time slice. If A is a simpleattribute, then v is a constant representing the value, otherwise v is an object. If C.A has a parent ofthe form
C.B , then A ( x, y, t ) has B ( x, y, t ) as the parent. If C.A has a parent of the form γ ( C.τ.B ) where τ is a slot chain and γ an aggregation function, then all of the predicates corresponding to theslots in the slot chain are parents of A ( x, y, t ) .We now show that if the initial DPRM is legal (i.e, without a cycle) then the RDBN obtainedabove is also legal. To prove this, we first consider the set of all certain slots in the DPRM (i.e., slotswhose values are already known). In the RDBN, the predicates corresponding to these are certainand these predicates can be given higher priority than any of the other predicates. The relativeordering of the predicates themselves does not matter. For the rest of the predicates we define arelative ordering as follows: if any predicate appears as a parent of some other predicate, the parentpredicate is given a higher priority.We have to prove that the relative ordering defined above is consistent, i.e., if we consider theRDBN graph there is no cycle among the predicates. We do this by contradiction.Assume that there is a cycle corresponding to predicates R , · · · R k , i.e., R i is a parent of R i − and R is a parent of R k . It is easy to see that any predicate which has known values cannot appearin the cycle. If the cycle consists of only predicates which correspond to simple attributes, we cansee that there will be a cycle among the attributes in the DPRM. Therefore, the cycle must involvepredicates which correspond to reference slots (that are unknown). We will assume that all thepredicates in the cycle correspond to reference slots and prove that the DPRM is illegal, leadingto a contradiction. The case where some of the predicates in the cycle might correspond to simpleattributes can be ruled out similarly. Let C i .ρ i be the reference slot corresponding to the predicate R i . Since R i is a parent of R i − either C i .ρ i is a parent of C i − .ρ i − in the DPRM or C i .ρ i appearsin the slot chain τ such that C i − .τ.B is a parent of C i − .ρ i − . In the first case there is an edge inthe DPRM from C i .ρ i to C i − .ρ i − . In the second case, there is an edge from C i .ρ i to C i − .ρ i − if C i .ρ i is unknown, which is the case here. Hence, we can see that a cycle among the predicatescorresponds to a cycle among the reference slots in the DPRM. This implies that the DPRM isillegal, leading to a contradiction.Finally, we have to prove that the CPT in a DPRM can be converted to a FOPT in an RDBN.We can first see that if in the DPRM C.B is a parent of
C.A , then we make B ( x, v, t ) a parent of A ( x, v ′ , t ) for all values of v . However, we need to make sure that A ( x, v ′ , t ) is false if A ( x, v ′′ , t ) is true. To do this we define an (any) ordering on the constants, we make A () a parent of itself, ELATIONAL D YNAMIC B AYESIAN N ETWORKS and we introduce the requisite dependencies, restricted to higher-priority groundings. Similarly, if γ ( C.τ.B ) is a parent of C.A , where γ is an aggregation function, then in the FOPT we can havethe expression γ ( v : ∃ y , ··· ,y m +1 R ( x, y , t ) ∧ · · · ∧ R i ( y i , y i +1 , t ) ∧ · · · B ( y m +1 , v, t )) where τ isthe slot chain ρ ...ρ m and R i is the predicate corresponding to the slot ρ i . We can now easily seethat for a node A ( x, v ′ , t ) , any row of the corresponding CPT in the DPRM is equivalent to a first-order expression in the FOPT involving the expression above and tests on B ( x, v, t ) and A ( x, v ′′ , t ) .Hence the CPT can be converted to an equivalent FOPT. ✷ References
Anderson, C., Domingos, P., & Weld, D. (2002). Relational Markov models and their applica-tion to adaptive Web navigation. In
Proceedings of the Eighth International Conference onKnowledge Discovery and Data Mining , pp. 143–152, New York, NY, USA. ACM Press.Bacchus, F. (2001). The AIPS-2000 planning competition.
AI Magazine , , 47–56.Blockeel, H., & De Raedt, L. (1998). Top-down induction of first-order logical decision trees. Artificial Intelligence , (1-2).Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D. (1996). Context-specific independencein Bayesian networks. In Proceedings of the Twelfth Conference on Uncertainty in ArtificialIntelligence , pp. 115–123, San Francisco, CA. Morgan Kaufmann Publishers.Boutilier, C., Reiter, R., & Price, B. (2001). Symbolic dynamic programming for first-order MDPs.In
Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence ,pp. 690–700, Seattle, WA. Morgan Kaufmann.Boyen, X., & Koller, D. (1998). Tractable inference for complex stochastic processes. In
Pro-ceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence , pp. 33–42,Madison, Wisconsin. Morgan Kaufmann.Clark, P., & Niblett, T. (1989). The CN2 induction algorithm.
Machine Learning , , 261–283.Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the Twelfth InternationalConference on Machine Learning , pp. 115–123, Tahoe City, CA. Morgan Kaufmann.Cover, T., & Thomas, J. (2001).
Elements of Information Theory . Wiley, New York.Cussens, J. (1999). Loglinear models for first-order probabilistic reasoning. In
Proceedings ofthe Fifteenth Conference on Uncertainty in Artificial Intelligence , pp. 126–133, Stockholm,Sweden. Morgan Kaufmann.Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation.
Compu-tational Intelligence , (3), 142–150.Doucet, A., de Freitas, N., & Gordon, N. (Eds.). (2001). Sequential Monte Carlo Methods in Prac-tice . Springer, New York.Duda, R., Hart, P., & Stork, D. (2000).
Pattern Classification . Wiley-Interscience, New York.Friedman, N., Geiger, D., & Lotner, N. (2000). Likelihood computations using value abstraction.In
Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence , pp. 192–200, San Francisco, CA. Morgan Kaufmann Publishers.
ANGHAI , D
OMINGOS & W
ELD
Friedman, N., Getoor, L., Koller, D., & Pfeffer, A. (1999). Learning probabilistic relational models.In
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence , pp.1300–1307, Stockholm, Sweden. Morgan Kaufmann.Friedman, N., Koller, D., & Pfeffer, A. (1998). Structured representation of complex stochasticsystems. In
Proceedings of the Fifteenth National Conference on Artificial Intelligence , pp.192–200, Madison, WI. AAAI Press.Getoor, L., Friedman, N., Koller, D., & Taskar, B. (2001). Learning probabilistic models of re-lational structure. In
Proceedings of the Eighteenth International Conference on MachineLearning , pp. 170–177, Williamstown, MA, USA. Morgan Kaufmann.Glesner, S., & Koller, D. (1995). Constructing flexible dynamic belief networks from first-orderprobabilistic knowledge bases. In
Proceedings of the European Conference on Symbolic andQuantitative Approaches to Reasoning under Uncertainty , pp. 217–226.Hoffmann, J., & Nebel, B. (2001). The FF planning system: Fast plan generation through heuristicsearch.
Journal of Artificial Intelligence Research , , 253–302.Jaeger, M. (1997). Relational Bayesian networks. In Proceedings of the Thirteenth Conference onUncertainty in Artificial Intelligence , pp. 266–273, Providence, Rhode Island, USA. MorganKaufmann.Kersting, K., & Raiko, T. (2005). ’Say EM’ for selecting probabilistic models of logical sequences.In
Proceedings of the Twenty First Conference on Uncertainty in Artificial Intelligence , Ed-inburgh, Scotland. Morgan Kaufmann.Kersting, K., & De Raedt, L. (2000). Bayesian logic programs. In
Proceedings of the Tenth Inter-national Conference on Inductive Logic Programming , London, UK. Springer.Koenig, S., & Holte, R. (Eds.). (2002).
Proceedings of the Fifth International Symposium on Ab-straction, Reformulation and Approximation . Springer, Kananaskis, Canada.Koller, D., & Fratkina, R. (1998). Using learning for approximation in stochastic processes. In
Proceedings of the Fifteenth International Conference on Machine Learning , pp. 287–295,Madison, WI. Morgan Kaufmann.McCallum, A., Rosenfeld, R., Mitchell, T., & Ng, A. Y. (1998). Improving text classification byshrinkage in a hierarchy of classes. In
Proceedings of the Fifteenth International Conferenceon Machine Learning , pp. 359–367, Madison, WI. Morgan Kaufmann.Montemerlo, M., Thrun, S., Koller, D., & Wegbreit, B. (2002). FastSLAM: A factored solution to thesimultaneous localization and mapping problem. In
Proceedings of the Eighteenth NationalConference on Artificial Intelligence , pp. 593–598, Edmonton, Canada. AAAI Press.Muggleton, S. (1996). Stochastic logic programs. In
Proceedings of the Sixth International Confer-ence on Inductive Logic Programming , pp. 254–264, Stockholm, Sweden. Springer.Murphy, K., & Russell, S. (2001). Rao-Blackwellised particle filtering for dynamic Bayesian net-works. In Doucet, A., de Freitas, N., & Gordon, N. (Eds.),
Sequential Monte Carlo Methodsin Practice , pp. 499–516. Springer, New York.Ng, B., Peshkin, L., & Pfeffer, A. (2002). Factored particles for scalable monitoring. In
Proceedingsof the Eighteenth Conference on Uncertainty in Artificial Intelligence , pp. 370–377, Edmon-ton, Canada. Morgan Kaufmann.
ELATIONAL D YNAMIC B AYESIAN N ETWORKS
Pasula, H., & Russell, S. J. (2001). Approximate inference for first-order probabilistic languages.In
Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence ,pp. 741–748, Seattle, WA. Morgan Kaufmann.Provost, F., & Domingos, P. (2003). Tree induction for probability-based ranking.
Machine Learn-ing , Proceed-ings of the Eighteenth International Joint Conference on Artificial Intelligence , pp. 992–1002,Acapulco, Mexico. Morgan Kaufmann.Verma, V., Thrun, S., & Simmons, R. (2003). Variable resolution particle filter. In
Proceedings of theEighteenth International Joint Conference on Artificial Intelligence , pp. 976–984, Acapulco,Mexico. Morgan Kaufmann., pp. 976–984, Acapulco,Mexico. Morgan Kaufmann.