[PDF] Probabilistic Planning via Heuristic Forward Search and Weighted Model Counting

Abstract

We present a new algorithm for probabilistic planning with no observability. Our algorithm, called Probabilistic-FF, extends the heuristic forward-search machinery of Conformant-FF to problems with probabilistic uncertainty about both the initial state and action effects. Specifically, Probabilistic-FF combines Conformant-FFs techniques with a powerful machinery for weighted model counting in (weighted) CNFs, serving to elegantly define both the search space and the heuristic function. Our evaluation of Probabilistic-FF shows its fine scalability in a range of probabilistic domains, constituting a several orders of magnitude improvement over previous results in this area. We use a problematic case to point out the main open issue to be addressed by further research.

Full PDF

aa r X i v : . [ c s . A I] O c t Journal of Artiﬁcial Intelligence Research 30 (2007) 565-620 Submitted 3/07; published 12/07

Probabilistic Planning via Heuristic Forward Searchand Weighted Model Counting

Carmel Domshlak

DCARMEL @ IE . TECHNION . AC . IL Technion - Israel Institute of Technology,Haifa, Israel

J¨org Hoffmann J OERG .H OFFMANN @ DERI . AT University of Innsbruck, DERI,Innsbruck, Austria

Abstract

We present a new algorithm for probabilistic planning with no observability. Our algorithm,called Probabilistic-FF, extends the heuristic forward-search machinery of Conformant-FF to prob-lems with probabilistic uncertainty about both the initial state and action effects. Speciﬁcally,Probabilistic-FF combines Conformant-FF’s techniques with a powerful machinery for weightedmodel counting in (weighted) CNFs, serving to elegantly deﬁne both the search space and theheuristic function. Our evaluation of Probabilistic-FF shows its ﬁne scalability in a range of prob-abilistic domains, constituting a several orders of magnitude improvement over previous results inthis area. We use a problematic case to point out the main open issue to be addressed by furtherresearch.

1. Introduction

In this paper we address the problem of probabilistic planning with no observability (Kushmerick,Hanks, & Weld, 1995), also known in the AI planning community as conditional (Majercik &Littman, 2003) or conformant (Hyaﬁl & Bacchus, 2004) probabilistic planning. In such problemswe are given an initial belief state in the form of a probability distribution over the world states W ,a set of actions (possibly) having probabilistic effects, and a set of alternative goal states W G ⊆ W .A solution to such a problem is a single sequence of actions that transforms the system into oneof the goal states with probability exceeding a given threshold θ . The basic assumption of theproblem is that the system cannot be observed at the time of plan execution. Such a setting isuseful in controlling systems with uncertain initial state and non-deterministic actions, if sensing isexpensive or unreliable. Non-probabilistic conformant planning may fail due to non-existence of aplan that achieves the goals with 100% certainty. Even if there is such a plan, that plan does notnecessarily contain information about what actions are most useful to achieve only the requestedthreshold θ .The state-of-the-art performance of probabilistic planners has been advancing much more slowlythan that of deterministic planners, scaling from 5-10 step plans for problems with ≈

20 world statesto 15-20 step plans for problems with ≈

100 world states (Kushmerick et al., 1995; Majercik &Littman, 1998; Hyaﬁl & Bacchus, 2004). Since probabilistic planning is inherently harder than itsdeterministic counterpart (Littman, Goldsmith, & Mundhenk, 1998), such a difference in evolutionrates is by itself not surprising. However, recent developments in the area (Onder, Whelan, & Li,2006; Bryce, Kambhampati, & Smith, 2006; Huang, 2006), and in particular our work here, showthat dramatic improvements in probabilistic planning can be obtained. c (cid:13) OMSHLAK & H

OFFMANN

In this paper we introduce Probabilistic-FF, a new probabilistic planner based on heuristic for-ward search in the space of implicitly represented probabilistic belief states. The planner is a naturalextension of the recent (non-probabilistic) conformant planner Conformant-FF (Hoffmann & Braf-man, 2006). The main trick is to replace Conformant-FF’s SAT-based techniques with a recentpowerful technique for probabilistic reasoning by weighted model counting (WMC) in proposi-tional CNFs (Sang, Beame, & Kautz, 2005). In more detail, Conformant-FF does a forward searchin a belief space in which each belief state corresponds to a set of world states considered to bepossible. The main trick of Conformant-FF is the use of CNF formulas for an implicit represen-tation of belief states. Implicit, in this context, means that formulas φ ( a ) encode the semantics ofexecuting action sequence a in the initial belief state, with propositional variables corresponding tofacts with time-stamps. Any actual knowledge about the belief states has to be (and can be) inferredfrom these formulas. Most particularly, a fact p is known to be true in a belief state if and only if φ ( a ) → p ( m ) , where m is the time endpoint of the formula. The only knowledge computed byConformant-FF about belief states are these known facts , as well as (symmetrically) the facts thatare known to be false. This sufﬁces to do STRIPS-style planning, that is, to determine applicableactions and goal belief states. In the heuristic function, FF’s (Hoffmann & Nebel, 2001) relaxedplanning graph technique is enriched with approximate SAT reasoning.The basic ideas underlying Probabilistic-FF are:(i) Deﬁne time-stamped Bayesian networks (BNs) describing probabilistic belief states.(ii) Extend Conformant-FF’s belief state CNFs to model these BNs.(iii) In addition to the SAT reasoning used by Conformant-FF, use weighted model-counting todetermine whether the probability of the (unknown) goals in a belief state is high enough.(iv) Introduce approximate probabilistic reasoning into Conformant-FF’s heuristic function.Note the synergetic effect: Probabilistic-FF re-uses all of Conformant-FF’s technology to recognizefacts that are true or false with probability . This fully serves to determine applicable actions, aswell as detect whether part of the goal is already known. In fact, it is as if Conformant-FF’s CNF-based techniques were speciﬁcally made to suit the probabilistic setting: while without probabilitiesone could imagine successfully replacing the CNFs with BDDs, with probabilities this seems muchmore problematic.The algorithms we present cover probabilistic initial belief states given as Bayesian networks,deterministic and probabilistic actions, conditional effects, and standard action preconditions. Ourexperiments show that our approach is quite effective in a range of domains. In contrast to theSAT and CSP based approaches mentioned above (Majercik & Littman, 1998; Hyaﬁl & Bacchus,2004), Probabilistic-FF can ﬁnd .However, such a comparison is not entirely fair due to the different nature of the results provided; theSAT and CSP based approaches provide guarantees on the length of the solution. The approach mostclosely related to Probabilistic-FF is implemented in POND (Bryce et al., 2006): this system, likeProbabilistic-FF, does conformant probabilistic planning for a threshold θ , using a non-admissible,planning-graph based heuristic to guide the search. Hence a comparison between Probabilistic-FFand POND is fair, and in our experiments we perform a comparative evaluation of Probabilistic-FFand POND. While the two approaches are related, there are signiﬁcant differences in the search ROBABILISTIC -FF space representation, as well as in the deﬁnition and computation of the heuristic function. We runthe two approaches on a range of domains partly taken from the probabilistic planning literature,partly obtained by enriching conformant benchmarks with probabilities, and partly obtained byenriching classical benchmarks with probabilistic uncertainty. In almost all cases, Conformant-FFoutperforms POND by at least an order of magnitude. We make some interesting observationsregarding the behavior of the two planners; in particular we identify a domain – derived from theclassical Logistics domain – where both approaches fail to scale. The apparent reason is that neitherapproach is good enough at detecting how many times, at an early point in the plan, a probabilisticaction must be applied in order to sufﬁciently support a high goal threshold at the end of the plan.Devising methods that are better in this regard is the most pressing open issue in this line of work.The paper is structured as follows. The next section provides the technical background, formallydeﬁning the problem we address and illustrating it with our running example. Section 3 details howprobabilistic belief states are represented as time-stamped Bayesian networks, how these Bayesiannetworks are encoded as weighted CNF formulas, and how the necessary reasoning is performedon this representation. Section 4 explains and illustrates our extension of Conformant-FF’s heuris-tic function to the probabilistic settings. Section 5 provides the empirical results, and Section 6concludes. All proofs are moved into Appendix A.

2. Background

The probabilistic planning framework we consider adds probabilistic uncertainty to a subset ofthe classical ADL language, namely (sequential) STRIPS with conditional effects. Such STRIPSplanning tasks are described over a set of propositions P as triples ( A, I, G ) , corresponding to the action set , initial world state , and goals . I and G are sets of propositions, where I describes aconcrete initial state w I , while G describes the set of goal states w ⊇ G . Actions a are pairs ( pre ( a ) , E ( a )) of the precondition and the (conditional) effects . A conditional effect e is a triple ( con ( e ) , add ( e ) , del ( e )) of (possibly empty) proposition sets, corresponding to the effect’s con-dition , add , and delete lists, respectively. The precondition pre ( a ) is also a proposition set, andan action a is applicable in a world state w if w ⊇ pre ( a ) . If a is not applicable in w , thenthe result of applying a to w is undeﬁned. If a is applicable in w , then all conditional effects e ∈ E ( a ) with w ⊇ con ( e ) occur. Occurrence of a conditional effect e in w results in the worldstate w ∪ add ( e ) \ del ( e ) .If an action a is applied to w , and there is a proposition q such that q ∈ add ( e ) ∩ del ( e ′ ) for(possibly the same) occurring e, e ′ ∈ E ( a ) , then the result of applying a in w is undeﬁned. Thus,we require the actions to be not self-contradictory, that is, for each a ∈ A , and every e, e ′ ∈ E ( a ) ,if there exists a world state w ⊇ con ( e ) ∪ con ( e ′ ) , then add ( e ) ∩ del ( e ′ ) = ∅ . Finally, an actionsequence a is a plan if the world state that results from iterative execution of a ’s actions, starting in w I , leads to a goal state w ⊇ G . Our probabilistic planning setting extends the above with (i) probabilistic uncertainty about theinitial state, and (ii) actions that can have probabilistic effects. In general, probabilistic planning

1. POND does not use implicit belief states, and the probabilistic part of its heuristic function uses sampling techniques,rather than the probabilistic reasoning techniques we employ.

OMSHLAK & H

OFFMANN tasks are quadruples ( A, b I , G, θ ) , corresponding to the action set , initial belief state , goals , and acceptable goal satisfaction probability . As before, G is a set of propositions. The initial state isno longer assumed to be known precisely. Instead, we are given a probability distribution over theworld states, b I , where b I ( w ) describes the likelihood of w being the initial world state.Similarly to classical planning, actions a ∈ A are pairs ( pre ( a ) , E ( a )) , but the effect set E ( a ) for such a has richer structure and semantics. Each e ∈ E ( a ) is a pair ( con ( e ) , Λ( e )) of a propo-sitional condition and a set of probabilistic outcomes . Each probabilistic outcome ε ∈ Λ( e ) is atriplet ( P r ( ε ) , add ( ε ) , del ( ε )) , where add and delete lists are as before, and P r ( ε ) is the proba-bility that outcome ε occurs as a result of effect e . Naturally, we require that probabilistic effectsdeﬁne probability distributions over their outcomes, that is, P ε ∈ Λ( e ) P r ( ε ) = 1 . The special case ofdeterministic effects e is modeled this way via Λ( e ) = { ε } and P r ( ε ) = 1 . Unconditional actionsare modeled as having a single effect e with con ( e ) = ∅ . As before, if a is not applicable in w ,then the result of applying a to w is undeﬁned. Otherwise, if a is applicable in w , then there existsexactly one effect e ∈ E ( a ) such that con ( e ) ⊆ w , and for each ε ∈ Λ( e ) , applying a to w resultsin w ∪ add ( ε ) \ del ( ε ) with probability P r ( ε ) . The likelihood [ b, a ] ( w ′ ) of a world state w ′ in thebelief state [ b, a ] , resulting from applying a probabilistic action a in b , is given by [ b, a ] ( w ′ ) = X w ⊇ pre ( a ) b ( w ) X ε ∈ Λ( e ) P r ( ε ) · δ (cid:0) w ′ = w ∪ s \ s ′ , s ⊆ add ( ε ) , s ′ ⊆ del ( ε ) (cid:1) , (1)where e is the effect of a such that con ( e ) ⊆ w , and δ ( · ) is the Kronecker step function that takesthe value if the argument predicate evaluates to TRUE, and otherwise.Our formalism covers all the problem-description features supported by the previously proposedformalisms for conformant probabilistic planning (Kushmerick et al., 1995; Majercik & Littman,1998; Hyaﬁl & Bacchus, 2004; Onder et al., 2006; Bryce et al., 2006; Huang, 2006), and it cor-responds to what is called Unary Nondeterminism (1ND) normal form (Rintanen, 2003). We notethat there are more succinct forms for specifying probabilistic planning problems (Rintanen, 2003),yet 1ND normal form appears to be most intuitive from the perspective of knowledge engineering. Example 1

Say we have a robot and a block that physically can be at one of two locations. Thisinformation is captured by the propositions r , r for the robot, and b , b for the block, respec-tively. The robot can either move from one location to another, or do it while carrying the block.If the robot moves without the block, then its move is guaranteed to succeed. This provides uswith a pair of symmetrically deﬁned deterministic actions { move - right, move - lef t } . The ac-tion move - right has an empty precondition, and a single conditional effect e = ( { r } , { ε } ) with P r ( ε ) = 1 , add ( ε ) = { r } , and del ( ε ) = { r } . If the robot tries to move while carrying the block,then this move succeeds with probability . , while with probability . the robot ends up movingwithout the block, and with probability . this move of the robot fails completely. This provides uswith a pair of (again, symmetrically deﬁned) probabilistic actions { move - b - right, move - b - lef t } .The action move - b - right has an empty precondition, and two conditional effects speciﬁed as inTable 1.Having speciﬁed the semantics and structure of all the components of ( A, b I , G, θ ) but θ , we arenow ready to specify the actual task of probabilistic planning in our setting. Recall that our actionstransform probabilistic belief states to belief states. For any action sequence a ∈ A ∗ , and any belief ROBABILISTIC -FF E ( a ) con ( e ) Λ( e ) P r ( ε ) add ( ε ) del ( ε ) ε { r , b } { r , b } e r ∧ b ε { r } { r } ε ∅ ∅ e ′ ¬ r ∨ ¬ b ε ′ ∅ ∅ Table 1: Possible effects and outcomes of the action move - b - right in Example 1.state b , the new belief state [ b, a ] resulting from applying a at b is given by [ b, a ] =  b, a = hi [ b, a ] , a = h a i , a ∈ A [[ b, a ] , a ′ ] , a = h a i · a ′ , a ∈ A, a ′ = ∅ . (2)In such setting, achieving G with certainty is typically unrealistic. Hence, θ speciﬁes the required lower bound on the probability of achieving G . A sequence of actions a is called a plan if we have b a ( G ) ≥ θ for the belief state b a = [ b I , a ] . Considering the initial belief state, practical considerations force us to limit our attention only tocompactly representable probability distributions b I . While there are numerous alternatives forcompact representation of structured probability distributions, Bayes networks (BNs) (Pearl, 1988)to date is by far the most popular such representation model. Therefore, in Probabilistic-FF weassume that the initial belief state b I is described by a BN N b I over our set of propositions P .As excellent introductions to BNs abound (e.g., see Jensen, 1996), it sufﬁces here to brieﬂydeﬁne our notation. A BN N = ( G , T ) represents a probability distribution as a directed acyclicgraph G , where its set of nodes X stands for random variables (assumed discrete in this paper),and T , a set of tables of conditional probabilities (CPTs)—one table T X for each node X ∈ X .For each possible value x ∈ Dom ( X ) (where Dom ( X ) denotes the domain of X ), the table T X lists the probability of the event X = x given each possible value assignment to all of its immediateancestors (parents) P a ( X ) in G . Thus, the table size is exponential in the in-degree of X . Usually, itis assumed either that this in-degree is small (Pearl, 1988), or that the probabilistic dependence of X on P a ( X ) induces a signiﬁcant local structure allowing a compact representation of T X (Shimony,1993, 1995; Boutilier, Friedman, Goldszmidt, & Koller, 1996). (Otherwise, representation of thedistribution as a BN would not be a good idea in the ﬁrst place.) The joint probability of a completeassignment ϑ to the variables X is given by the product of |X | terms taken from the respectiveCPTs (Pearl, 1988): P r ( ϑ ) = Y X ∈X P r ( ϑ [ X ] | ϑ [ P a ( X )]) = Y X ∈X T X ( ϑ [ X ] | ϑ [ P a ( X )]) , where ϑ [ · ] stands for the partial assignment provided by ϑ to the corresponding subset of X .

2. While BNs are our choice here, our framework can support other models as well, e.g. stochastic decision trees.

OMSHLAK & H

OFFMANN

In Probabilistic-FF we allow N b I to be described over the multi-valued variables underlyingthe planning problem. This signiﬁcantly simpliﬁes the process of specifying N b I since the STRIPSpropositions P do not correspond to the true random variables underlying problem speciﬁcation. Speciﬁcally, let S ki =1 P i be a partition of P such that each proposition set P i uniquely correspondsto the domain of a multi-valued variable underlying our problem. That is, for every world state w and every P i , if |P i | > , then there is exactly one proposition q ∈ P i that holds in w . The variablesof the BN N b I describing our initial belief state b I are X = { X , . . . , X k } , where Dom ( X i ) = P i if |P i | > , and Dom ( X i ) = { q, ¬ q } if P i = { q } . Example 2

For an illustration of such N b I , consider our running example, and say the robot isknown to be initially at one of the two possible locations with probability P r ( r ) = 0 . and P r ( r ) = 0 . . Suppose there is a correlation in our belief about the initial locations of the robot andthe block. We believe that, if the robot is at r , then P r ( b ) = 0 . (and P r ( b ) = 0 . ), while if therobot is at r , then P r ( b ) = 0 . (and P r ( b ) = 0 . ). The initial belief state BN N b I is then deﬁnedover two variables R (“robot”) and B (“block”) with Dom ( R ) = { r , r } and Dom ( B ) = { b , b } ,respectively, and it is depicted in Figure 1. r r . . R / / B b b r . . r N b I for Example 1.It is not hard to see that our STRIPS-style actions a ∈ A can be equivalently speciﬁed in termsof the multi-valued variables X . Speciﬁcally, if |P i | > , then no action a can add a proposition q ∈ P i without deleting some other proposition q ′ ∈ P i , and thus, we can consider a as setting X i = q . If |P i | = 1 , then adding and deleting q ∈ P i has the standard semantics of setting X i = q and X i = ¬ q , respectively. For simplicity of presentation, we assume that our actions are not self-contradictory at the level of X as well—if two conditional effects e, e ′ ∈ E ( a ) can possibly occur insome world state w , then the subsets of X affected by these two effects have to be disjoint. Finally,our goal G directly corresponds to a partial assignment to X (unless our G is self-contradictory,requiring q ∧ q ′ for some q, q ′ ∈ P i .)

3. Belief States

In this section, we explain our representation of, and reasoning about, belief states. We ﬁrst explainhow probabilistic belief states are represented as time-stamped BNs, then we explain how thoseBNs are encoded and reasoned about in the form of weighted CNF formulas. This representationof belief states by weighted CNFs is then illustrated on the belief state from our running example inFigure 2. We ﬁnally provide the details about how this works in Probabilistic-FF.

3. Specifying N b I directly over P would require identifying the multi-valued variables anyway, followed by connectingall the propositions corresponding to a multi-valued variable by a complete DAG, and then normalizing the CPTs ofthese propositions in a certain manner. ROBABILISTIC -FF r r r r ε ∨ ε ε ∨ ε ′ r r r r r r R (0) (cid:15) (cid:15) / / $ $ IIIIIIIIII R (1) / / R (2) Y (1) nnnnnnnnnnnnnnnn ( ( PPPPPPPPPPPPPPPP ε ε ε ε ′ r ∧ b othrw B (0) / / : : uuuuuuuuuu B (1) / / B (2) b b r . r . b b ε ¬ ε b b b b b b N b a for our running Example 1-2 and action sequence a = h move - b - right, move - lef t i . Probabilistic-FF performs a forward search in a space of belief states. The search states are beliefstates (that is, probability distributions over the world states w ), and the search is restricted to beliefstates reachable from the initial belief state b I through some sequences of actions a . A key decisionone should make is the actual representation of the belief states. Let b I be our initial belief statecaptured by the BN N b I , and let b a be a belief state resulting from applying to b I a sequence ofactions a . One of the well-known problems in the area of decision-theoretic planning is that thedescription of b a directly over the state variables X becomes less and less structured as the numberof (especially stochastic) actions in a increases. To overcome this limitation, we represent beliefstates b a as a BN N b a that explicitly captures the sequential application of a starting from b I , tradingthe representation size for the cost of inference, compared to representing belief states directly asdistributions over world states. Below we formally specify the structure of such a BN N b a , assumingthat all the actions a are applicable in the corresponding belief states of their application, and latershowing that Probabilistic-FF makes sure this is indeed the case. We note that these belief-stateBNs are similar in spirit and structure to those proposed in the AI literature for verifying that aprobabilistic plan achieves its goals with a certain probability (Dean & Kanazawa, 1989; Hanks &McDermott, 1994; Kushmerick et al., 1995).Figure 2 illustrates the construction of N b a for our running example with a = h move - b - right,move - lef t i . In general, let a = h a , . . . , a m i be a sequence of actions, numbered according to theirappearance on a . For ≤ t ≤ m , let X ( t ) be a replica of our state variables X , with X ( t ) ∈ X ( t ) OMSHLAK & H

OFFMANN corresponding to X ∈ X . The variable set of N b a is the union of X (0) , . . . , X ( m ) , plus someadditional variables that we introduce for the actions in a .First, for each X (0) ∈ X (0) , we set the parents P a ( X (0) ) and conditional probability tables T X (0) to simply copy these of the state variable X in N b I . Now, consider an action a t from a , andlet a t = a . For each such action we introduce a discrete variable Y ( t ) that “mediates” between thevariable layers X ( t − and X ( t ) . The domain of Y ( t ) is set to Dom ( Y ( t ) ) = S e ∈ E ( a ) Λ( e ) , that is, tothe union of probabilistic outcomes of all possible effects of a . The parents of Y ( t ) in N b a are set to P a ( Y ( t ) ) = [ e ∈ E ( a ) (cid:8) X ( i − | con ( e ) ∩ Dom ( X ) = ∅ (cid:9) , (3)and, for each π ∈ Dom ( P a ( Y ( t ) )) , we set T Y ( t ) ( Y ( i ) = ε | π ) = ( P r ( ε ) , con ( e ( ε )) ⊆ π , otherwise , (4)where e ( ε ) denotes the effect e of a such that ε ∈ Λ( e ) .We refer to the set of all such variables Y ( t ) created for the actions of a as Y . Now, let E X ( a ) ⊆ E ( a ) be the probabilistic effects of a that affect a variable X ∈ X . If E X ( a ) = ∅ , then we set P a ( X ( t ) ) = { X ( t − } , and T X ( t ) ( X ( t ) = x | X ( t − = x ′ ) = ( , x = x ′ , , otherwise . (5)Otherwise, if E X ( a ) = ∅ , let x ε ∈ Dom ( X ) be the value provided to X by ε ∈ Λ( e ) , e ∈ E X ( a ) .Recall that the outcomes of effects E ( a ) are all mutually exclusive. Hence, we set P a ( X ( t ) ) = { X ( t − , Y ( t − } , and T X ( i ) ( X ( i ) = x | X ( i − = x ′ , Y ( i − = ε ) =  , e ( ε ) ∈ E X ( a ) ∧ x = x ε , , e ( ε ) E X ( a ) ∧ x = x ′ , , otherwise , (6)where e ( ε ) denotes the effect responsible for the outcome ε .It is not hard to verify that Equations 4-6 capture the frame axioms and probabilistic seman-tics of our actions. In principle, this accomplishes our construction of N b a over the variables X b a = Y S mt =0 X ( t ) . We note, however, that the mediating variable Y ( t ) are really needed onlyfor truly probabilistic actions. Speciﬁcally, if a t is a deterministic action a , let E X ( a ) ⊆ E ( a ) bethe conditional effects of a that add and/or delete propositions associated with the domain of a vari-able X ∈ X . If E X ( a ) = ∅ , then we set P a ( X ( t ) ) = { X ( t − } , and T X ( t ) according to Equation 5.Otherwise, we set P a ( X ( t ) ) = { X ( t − } [ e ∈ E X ( a ) n X ′ ( t − | con ( e ) ∩ Dom ( X ) = ∅ o , (7)and specify T X ( t ) as follows. Let x e ∈ Dom ( X ) be the value that (the only deterministic outcomeof) the effect e ∈ E X ( a ) provides to X . For each π ∈ Dom ( P a ( X ( t ) )) , if there exists e ∈ E X ( a ) ROBABILISTIC -FF such that con ( e ) ⊆ π , then we set T X ( t ) ( X ( t ) = x | π ) = ( , x = x e , , otherwise (8)Otherwise, we set T X ( t ) ( X ( t ) = x | π ) = ( , x = π [ X ( t − ] , , otherwise (9)Due to the self-consistency of the action, it is not hard to verify that Equations 8-9 are consistent,and, together with Equation 5, capture the semantics of the conditional deterministic actions. Thisspecial treatment of deterministic actions is illustrated in Figure 2 by the direct dependencies of X (2) on X (1) . Proposition 1

Let ( A, N b I , G, θ ) be a probabilistic planning problem, and a be an m -step sequenceof actions applicable in b I . Let P r be the joint probability distribution induced by N b a on itsvariables X b a . The belief state b a corresponds to the marginal distribution of P r on X ( m ) , that is: b a ( X ) = P r ( X ( m ) ) , and if G ( m ) is a partial assignment provided by G to X ( m ) , then the probability b a ( G ) that a achieves G starting from b I is equal to P r ( G ( m ) ) . As we already mentioned, our belief-state BNs are constructed along the principles outlinedand used by Dean and Kanazawa (1989), Hanks and McDermott (1994), and Kushmerick et al.(1995), and thus the correctness of Proposition 1 is immediate from these previous results. At thispoint, it is worth bringing attention to the fact that all the variables in X (1) , . . . , X ( m ) are completelydeterministic. Moreover, the CPTs of all the variables of N b a are all compactly representable due toeither a low number of parents, or some local structure induced by a large amount of context-speciﬁcindependence, or both. This compactness of the CPTs in N b a is implied by the compactness of theSTRIPS-style speciﬁcation of the planning actions. By exploiting this compactness of the actionspeciﬁcation, the size of the N b a ’s description can be kept linear in the size of the input and thenumber of actions in a . Proposition 2

Let ( A, N b I , G, θ ) be a probabilistic planning problem described over k state vari-ables, and a be an m -step sequence of actions from A . Then, we have |N b a | = O ( |N b I | + mα ( k +1)) where α is the largest description size of an action in A . The proof of Proposition 2, as well as the proofs of other formal claims in the paper, are relegatedto Appendix A, pp. 613.

Given the representation of belief states as BNs, next we should select a mechanism for reasoningabout these BNs. In general, computing the probability of a query in BNs is known to be

OMSHLAK & H

OFFMANN side, however, an observation that guides some recent advances in the area of probabilistic reasoningis that real-world domains typically exhibit a signiﬁcant degree of deterministic dependencies andcontext-speciﬁc independencies between the problem variables. Targeting this property of practicalBNs already resulted in powerful inference techniques (Chavira & Darwiche, 2005; Sang et al.,2005). The general principle underlying these techniques is to(i) Compile a BN N into a weighted propositional logic formula φ ( N ) in CNF, and(ii) Perform an efﬁcient weighted model counting for φ ( N ) by reusing (and adapting) certaintechniques that appear powerful in enhancing backtracking DPLL-style search for SAT.One observation we had at the early stages of developing Probabilistic-FF is that the type ofnetworks and type of queries we have in our problems make this machinery for solving BNs byweighted CNF model counting very attractive for our needs. First, in Section 3.1 we have alreadyshown that the BNs representing our belief states exhibit a large amount of both deterministic nodesand context-speciﬁc independence. Second, the queries of our interest correspond to computingprobability of the “evidence” G ( m ) in N b a , and this type of query has a clear interpretation in termsof model counting (Sang et al., 2005). Hence, taking this route in Probabilistic-FF, we compile ourbelief state BNs to weighted CNFs following the encoding scheme proposed by Sang et al. (2005),and answer probabilistic queries using Cachet (Sang, Bacchus, Beame, Kautz, & Pitassi, 2004), oneof the most powerful systems to date for exact weighted model counting in CNFs.In general, the weighted CNFs and the weights of such formulas are speciﬁed as follows. Let V = { V , . . . , V n } be a set of propositional variables with Dom ( V i ) = { v i , ¬ v i } , and let ̟ : S i Dom ( V i ) → R be a non-negative, real-valued weight function from the literals of V . For anypartial assignment π to V , the weight ̟ ( π ) of this assignment is deﬁned as the product of its literals’weights, that is, ̟ ( π ) = Q l ∈ π ̟ ( l ) . Finally, a propositional logic formula φ is called weighted if itis deﬁned over such a weighted set of propositional variables. For any weighted formula φ over V ,the weight ̟ ( φ ) is deﬁned as the sum of the weights of all the complete assignments to V satisfying φ , that is, ̟ ( φ ) = X π ∈ Dom ( V ) ̟ ( π ) δ ( π | = φ ) , where Dom ( V ) = × i Dom ( V i ) . For instance, if for all variables V i we have ̟ ( v i ) = ̟ ( ¬ v i ) = 1 ,then ̟ ( φ ) simply stands for the number of complete assignments to V that satisfy φ .Given an initial belief state BN N b I , and a sequence of actions a = h a , . . . , a m i applicable in b I , here we describe how the weighted CNF encoding φ ( N b a ) (or φ ( b a ) , for short) of the belief state b a is built and used in Probabilistic-FF. First, we formally specify the generic scheme introducedby Sang et al. (2005) for encoding a BN N over variables X into a weighted CNF φ ( N ) . Theencoding formula φ ( N ) contains two sets of variables. First, for each variable Z ∈ X and eachvalue z ∈ Dom ( Z ) , the formula φ ( N ) contains a state proposition with literals { z, ¬ z } , weightedas ̟ ( z ) = ̟ ( ¬ z ) = 1 . These state propositions act in φ ( b a ) as regular SAT propositions. Now,for each variable Z ∈ X b a , let Dom ( Z ) = { z , . . . , z k } be an arbitrary ﬁxed ordering of Dom ( Z ) .Recall that each row T Z [ i ] in the CPT of Z corresponds to an assignment ζ i (or a set of such as-signments) to P a ( Z ) . Thus, the number of rows in T Z is upper bounded by the number of differentassignments to P a ( Z ) , but (as it happens in our case) it can be signiﬁcantly lower if the depen-dence of Z on P a ( Z ) induces a substantial local structure. Following the ordering of Dom ( Z ) asabove, the entry T Z [ i, j ] contains the conditional probability of P r ( z j | ζ i ) . For every CPT entry ROBABILISTIC -FF procedure basic - WMC ( φ ) if φ = ∅ return if φ has an empty clause return select a variable V ∈ φ return basic - WMC ( φ | v ) · ̟ ( v ) + basic - WMC ( φ | ¬ v ) · ̟ ( ¬ v ) Figure 3: Basic DPPL-style weighted model counting. T Z [ i, j ] but the last one (i.e., T Z [ i, k ] ), the formula φ ( N ) contains a chance proposition with literals {h z ij i , ¬h z ij i} . These chance variables aim at capturing the probabilistic information from the CPTsof N b a . Speciﬁcally, the weight of the literal h z ij i is set to P r ( z j | ζ i , ¬ z , . . . , ¬ z j − ) , that is toconditional probability that the entry is true, given that the row is true, and no prior entry in the rowis true: ̟ (cid:0) h z ij i (cid:1) = T Z [ i, j ]1 − P j − k =1 T Z [ i, k ] ̟ (cid:0) ¬h z ij i (cid:1) = 1 − ̟ (cid:0) h z ij i (cid:1) (10)Considering the clauses of φ ( N ) , for each variable Z ∈ X , and each CPT entry T Z [ i, j ] , theformula φ ( N ) contains a clause (cid:0) ζ i ∧ ¬h z i i ∧ · · · ∧ ¬h z ij − i ∧ h z ij i (cid:1) → z j , (11)where ζ i is a conjunction of the literals forming the assignment ζ i ∈ Dom ( P a ( Z )) . These clausesensure that the weights of the complete assignments to the variables of φ ( N ) are equal to the prob-ability of the corresponding atomic events as postulated by the BN N . To illustrate the constructionin Equations 10-11, let boolean variables A and B be the parents of a ternary variable C (with Dom ( C ) = { C , C , C } ) in some BN, and let P r ( C | A, ¬ B ) = 0 . , P r ( C | A, ¬ B ) = 0 . , and P r ( C | A, ¬ B ) = 0 . . Let the raw corresponding to the assignment A, ¬ B to P a ( C ) be the i -throw of the CPT T C . In the encoding of this BN, the ﬁrst two entries of this raw of T C are capturedby a pair of respective chance propositions h C i i , and h C i i . According to Equation 10, the weightsof these propositions are set to ̟ (cid:0) h C i i (cid:1) = 0 . , and ̟ (cid:0) h C i i (cid:1) = . − . = 0 . . Then, according toEquation 11, the encoding contains three clauses (cid:0) ¬ A ∨ B ∨ ¬h C i i ∨ C (cid:1)(cid:0) ¬ A ∨ B ∨ h C i i ∨ ¬h C i i ∨ C (cid:1)(cid:0) ¬ A ∨ B ∨ h C i i ∨ h C i i ∨ C (cid:1) Finally, for each variable Z ∈ X , the formula φ ( N ) contains a standard set of clauses encodingthe “exactly one” relationship between the state propositions capturing the value of Z . This accom-plishes the encoding of N into φ ( N ) . In the next Section 3.3 we illustrate this encoding on thebelief state BN from our running example.The weighted CNF encoding φ ( b a ) of the belief state BN N b a provides the input to a weightedmodel counting procedure. A simple recursive DPPL-style procedure basic - WMC underlying Ca-chet (Sang et al., 2004) is depicted in Figure 3, where the formula φ | v is obtained from φ by setting OMSHLAK & H

OFFMANN the literal v to true. Theorem 3 by Sang et al. (2005) shows that if φ is a weighted CNF encodingof a BN N , and P r ( Q | E ) is a general query with respect to N , query Q , and evidence E , then wehave: P r ( Q | E ) = basic - WMC ( φ ∧ Q ∧ E ) basic - WMC ( φ ∧ E ) , (12)where query Q and evidence E can in fact be arbitrary formulas in propositional logic. Note that,in a special (and very relevant to us) case of empty evidence, Equation 12 reduces to P r ( Q ) = basic - WMC ( φ ∧ Q ) , that is, a single call to the basic - WMC procedure. Corollary 3 is then immediatefrom our Proposition 1 and Theorem 3 by Sang et al. (2005).

Corollary 3

Let ( A, b I , G, θ ) be a probabilistic planning task with a BN N b I describing b I , and a be an m -step sequence of actions applicable in b I . The probability b a ( G ) that a achieves G startingfrom b I is given by: b a ( G ) = WMC ( φ ( b a ) ∧ G ( m )) , (13) where G ( m ) is a conjunction of the goal literals time-stamped with the time endpoint m of a . We now illustrate the generic BN-to-WCNF encoding scheme of Sang et al. (2005) on the beliefstate BN N b a from our running example in Figure 2.For ≤ i ≤ , we introduce time-stamped state propositions r ( i ) , r ( i ) , b ( i ) , b ( i ) . Likewise,we introduce four state propositions ε (1) , ε (1) , ε (1) , ε ′ (1) corresponding to the values of thevariable Y (1) . The ﬁrst set of clauses in φ ( b a ) ensure the “exactly one” relationship between thestate propositions capturing the value of a variable in N b a : (cid:0) ε (1) ∨ ε (1) ∨ ε (1) ∨ ε ′ (1) (cid:1) , ≤ i < j ≤ ¬ y i (1) ∨ ¬ y j (1)) , ≤ i ≤ r ( i ) ∨ r ( i )) , ( ¬ r ( i ) ∨ ¬ r ( i ))( b ( i ) ∨ b ( i )) , ( ¬ b ( i ) ∨ ¬ b ( i )) (14)Now we proceed with encoding the CPTs of N b a . The root nodes have only one row in theirCPTs so their chance propositions can be identiﬁed with the corresponding state variables (Sanget al., 2005). Hence, for the root variable R (0) we need neither additional clauses nor specialchance propositions, but the state proposition r (0) of φ ( b a ) is treated as a chance propositionwith ̟ ( r (0)) = 0 . .Encoding of the variable B (0) is a bit more involved. The CPT T B (0) contains two (content-wisedifferent) rows corresponding to the “given r ” and “given r ” cases, and both these cases inducea non-deterministic dependence of B (0) on R (0) . To encode the content of T B (0) we introducetwo chance variables h b (0) i and h b (0) i with ̟ ( h b (0) i ) = 0 . and ̟ ( h b (0) i ) = 0 . . Thepositive literals of h b (0) i and h b (0) i capture the events “ b given r ” and “ b given r ”, whilethe negations ¬h b (0) i and ¬h b (0) i capture the complementary events “ b given r ” and “ b given r ”, respectively. Now consider the “given r ” row in T B (0) . To encode this row, we need ROBABILISTIC -FF φ ( b a ) to contain (cid:0) r (0) ∧ h b (0) i (cid:1) → b (0) and (cid:0) r (0) ∧ ¬h b (0) i (cid:1) → b (0) . Similar encodingis required for the row “given r ”, and thus the encoding of T B introduces four additional clauses: (cid:0) ¬ r (0) ∨ ¬h b (0) i ∨ b (0) (cid:1) , (cid:0) ¬ r (0) ∨ h b (0) i ∨ b (0) (cid:1)(cid:0) ¬ r (0) ∨ ¬h b (0) i ∨ b (0) (cid:1) , (cid:0) ¬ r (0) ∨ h b (0) i ∨ b (0) (cid:1) (15)Having ﬁnished with the N b I part of N b a , we proceed with encoding the variable Y (1) cor-responding to the probabilistic action move - b - right . To encode the ﬁrst row of T Y (1) we in-troduce three chance propositions h ε (1) i , h ε (1) i , and h ε (1) i ; in general, no chance vari-ables are needed for the last entries of the CPT rows. The weights of these chance propositionsare set according to Equation 10 to ̟ (cid:0) h ε (1) i (cid:1) = 0 . , ̟ (cid:0) h ε (1) i (cid:1) = . − . = 0 . , and ̟ (cid:0) h ε (1) i (cid:1) = . − . = 0 . . Using these chance propositions, we add to φ ( b a ) four clauses as inEquation 11, notably the ﬁrst four clauses of Equation 16 below.Proceeding the second row of T Y (1) , observe that the value of R (0) and B (0) in this case fullydetermines the value of Y (1) . This deterministic dependence can be encoded without using anychance propositions using the last two clauses in Equation 16. (cid:0) ¬ r (0) ∨ ¬ b (0) ∨ ¬h ε (1) i ∨ ε (1) (cid:1) , (cid:0) ¬ r (0) ∨ ¬ b (0) ∨ h ε (1) i ∨ ¬h ε (1) i ∨ ε (1) (cid:1) , (cid:0) ¬ r (0) ∨ ¬ b (0) ∨ h ε (1) i ∨ h ε (1) i ∨ ¬h ε (1) i ∨ ε (1) (cid:1) , (cid:0) ¬ r (0) ∨ ¬ b (0) ∨ h ε (1) i ∨ h ε (1) i ∨ h ε (1) i ∨ ε ′ (1) (cid:1) , (cid:0) r (0) ∨ ¬ ε ′ (1) (cid:1) , (cid:0) b (0) ∨ ¬ ε ′ (1) (cid:1) (16)Using the state/chance variables introduced for R , B , and Y (1) , we encode the CPTs of R (1) and B (1) as: R (1) : ( ¬ ε (1) ∨ r (1)) , ( ¬ ε (1) ∨ r (1)) , ( ¬ ε (1) ∨ ¬ r (0) ∨ r (1)) , (cid:0) ¬ ε ′ (1) ∨ ¬ r (0) ∨ r (1) (cid:1) , ( ¬ ε (1) ∨ ¬ r (0) ∨ r (1)) , (cid:0) ¬ ε ′ (1) ∨ ¬ r (0) ∨ r (1) (cid:1) B (1) : ( ¬ ε (1) ∨ b (1)) , ( ε (1) ∨ ¬ b (0) ∨ b (1)) , ( ε (1) ∨ ¬ b (0) ∨ b (1)) (17)Since the CPTs of both R (1) and B (1) are completely deterministic, their encoding as well is usingno chance propositions. Finally, we encode the (deterministic) CPTs of R (2) and B (2) as: R (2) : ( r (2)) B (2) : ( ¬ b (1) ∨ b (2))( ¬ b (1) ∨ b (2)) (18)where the unary clause ( r (2)) is a reduction of ( ¬ r (1) ∨ r (2)) and ( ¬ r (1) ∨ r (2)) . This ac-complishes our encoding of φ ( b a ) . OMSHLAK & H

OFFMANN

Besides the fact that weighted model counting is attractive for the kinds of BNs arising in our con-text, the weighted CNF representation of belief states works extremely well with the ideas underly-ing Conformant-FF (Hoffmann & Brafman, 2006). This was outlined in the introduction already;here we give a few more details.As stated, Conformant-FF does a forward search in a non-probabilistic belief space in whicheach belief state corresponds to a set of world states considered to be possible. The main trick ofConformant-FF is the use of CNF formulas for an implicit representation of belief states, whereformulas φ ( a ) encode the semantics of executing action sequence a in the initial belief state. Factsknown to be true or false are inferred from these formulas. This computation of only a partialknowledge constitutes a lazy kind of belief state representation, in comparison to other approachesthat use explicit enumeration (Bonet & Geffner, 2000) or BDDs (Bertoli, Cimatti, Pistore, Roveri,& Traverso, 2001) to fully represent belief states. The basic ideas underlying Probabilistic-FF are:(i) Deﬁne time-stamped Bayesian Networks (BN) describing probabilistic belief states (Sec-tion 3.1 above).(ii) Extend Conformant-FF’s belief state CNFs to model these BN (Section 3.2 above).(iii) In addition to the SAT reasoning used by Conformant-FF, use weighted model-counting todetermine whether the probability of the (unknown) goals in a belief state is high enough(directly below).(iv) Introduce approximate probabilistic reasoning into Conformant-FF’s heuristic function (Sec-tion 4 below).In more detail, given a probabilistic planning task ( A, b I , G, θ ) , a belief state b a corresponding tosome applicable in b I m -step action sequence a , and a proposition q ∈ P , we say that q is known in b a if b a ( q ) = 1 , negatively known in b a if b a ( q ) = 0 , and unknown in b a , otherwise. We beginwith determining whether each q is known, negatively known, or unknown at time m . Re-using theConformant-FF machinery, this classiﬁcation requires up to two SAT tests of φ ( b a ) ∧ ¬ q ( m ) and φ ( b a ) ∧ q ( m ) , respectively. The information provided by this classiﬁcation is used threefold. First,if a subgoal g ∈ G is negatively known at time m , then we have b a ( G ) = 0 . On the other extreme,if all the subgoals of G are known at time m , then we have b a ( G ) = 1 . Finally, if some subgoals of G are known and the rest are unknown at time m , then we accomplish evaluating the belief state b a by testing whether b a ( G ) = WMC ( φ ( b a ) ∧ G ( m )) ≥ θ. (19)Note also that having the sets of all (positively/negatively) known propositions at all time steps upto m allows us to signiﬁcantly simplify the CNF formula φ ( b a ) ∧ G ( m ) by inserting into it thecorresponding values of known propositions.After evaluating the considered action sequence a , if we get b a ( G ) ≥ θ , then we are done.Otherwise, the forward search continues, and the actions that are applicable in b a (and thus used togenerate the successor belief states) are actions whose preconditions are all known in b a . ROBABILISTIC -FF

4. Heuristic Function

The key component of any heuristic search procedure is the heuristic function. The quality (in-formedness) and computational cost of that function determine the performance of the search. Theheuristic function is usually obtained from solutions to a relaxation of the actual problem of in-terest (Pearl, 1984; Russell & Norvig, 2004). In classical planning, a successful idea has been touse a relaxation that ignores the delete effects of the actions (McDermott, 1999; Bonet & Geffner,2001; Hoffmann & Nebel, 2001). In particular, the heuristic of the FF planning system is based onthe notion of relaxed plan , which is a plan that achieves the goals while assuming that all deletelists of actions are empty. The relaxed plan is computed using a Graphplan-style (Blum & Furst,1997) technique combining a forward chaining graph construction phase with a backward chainingplan extraction phase. The heuristic value h ( w ) that FF provides to a world state w encounteredduring the search is the length of the relaxed plan from w . In Conformant-FF, this methodology wasextended to the setting of conformant planning under initial state uncertainty (without uncertaintyabout action effects). Herein, we extend Conformant-FF’s machinery to handle probabilistic initialstates and effects. Section 4.1 provides background on the techniques used in FF and Conformant-FF, then Sections 4.2 and 4.4 detail our algorithms for the forward and backward chaining phases inProbabilistic-FF, respectively. These algorithms for the two phases of the Probabilistic-FF heuristiccomputation are illustrated on our running example in Sections 4.3 and 4.5, respectively. We specify how relaxed plans are computed in FF; we provide a coarse sketch of how they arecomputed in Conformant-FF. The purpose of the latter is only to slowly prepare the reader for whatis to come: Conformant-FF’s techniques are re-used for Probabilistic-FF anyway, and hence will bedescribed in full detail as part of Sections 4.2 and 4.4.Formally, relaxed plans in classical planning are computed as follows. Starting from w , FFbuilds a relaxed planning graph as a sequence of alternating proposition layers P ( t ) and actionlayers A ( t ) , where P (0) is the same as w , A ( t ) is the set of all actions whose preconditions arecontained in P ( t ) , and P ( t + 1) is obtained from P ( t ) by including the add effects (with fulﬁlledconditions) of the actions in A ( t ) . That is, P ( t ) always contains those facts that will be true if onewould execute (the relaxed versions of) all actions at the earlier layers up to A ( t − . The relaxedplanning graph is constructed either until it reaches a propositional layer P ( m ) that contains allthe goals, or until the construction reaches a ﬁxpoint P ( t ) = P ( t + 1) without reaching the goals.The latter case corresponds to (all) situations in which a relaxed plan does not exist, and becauseexistence of a relaxed plan is a necessary condition for the existence of a real plan, the state w isexcluded from the search space by setting h ( w ) = ∞ . In the former case of G ⊆ P ( m ) , a relaxedplan is a subset of actions in A (1) , . . . , A ( m ) that sufﬁces to achieve the goals (under ignoring thedelete lists), and it can be extracted by a simple backchaining loop: For each goal in P ( m ) , selectan action in A (1) , . . . , A ( m ) that achieves this goal, and iterate the process by considering thoseactions’ preconditions and the respective effect conditions as new subgoals. The heuristic estimate h ( w ) is then set to the length of the extracted relaxed plan, that is, to the number of actions selectedin this backchaining process.Aiming at extending the machinery of FF to conformant planning, in Conformant-FF, Hoff-mann and Brafman (2006) suggested to extend the relaxed planning graph with additional fact lay-ers uP ( t ) containing the facts unknown at time t , and then to reason about when such unknown OMSHLAK & H

OFFMANN facts become known in the relaxed planning graph. As the complexity of this type of reasoning isprohibitive, Conformant-FF further relaxes the planning task by ignoring not only the delete lists,but also all but one of the unknown conditions of each action effect. That is, if action a appearsin layer A ( t ) , and for effect e of a we have con ( e ) ⊆ P ( t ) ∪ uP ( t ) and con ( e ) ∩ uP ( t ) = ∅ ,then con ( e ) ∩ uP ( t ) is arbitrarily reduced to contain exactly one literal, and reasoning is done as if con ( e ) had this reduced form from the beginning.This relaxation converts implications ( V c ∈ con ( e ) ∩ uP ( t ) c ( t )) → q ( t + 1) that the action effectsinduce between unknown propositions into their 2-projections that take the form of binary impli-cations c ( t ) → q ( t + 1) , for arbitrary c ∈ con ( e ) ∩ uP ( t ) . Due to the layered structure of theplanning graph, the set of all these binary implications c ( t ) → q ( t + 1) can be seen as forming adirected acyclic graph Imp. Under the given relaxations, this graph captures exactly all dependen-cies between the truth of propositions over time. Hence, checking whether a proposition q becomesknown at time t can be done as follows. First, backchain over the implication edges of Imp that endin q ( t ) , and collect the set support ( q ( t )) of leafs that are reached. Then, if Φ is the CNF formuladescribing the possible initial states, test by a SAT check whether Φ → _ l ∈ support ( q ( t )) l This test will succeed if and only if at least one of the leafs in support ( q ( t )) is true in every possibleinitial state. Under the given relaxations, this is the case if and only if, when applying all actions inthe relaxed planning graph, q will always be true at time t . The process of extracting a relaxed plan from the constructed conformant relaxed planninggraph is an extension of FF’s respective process with machinery that selects actions responsible forrelevant paths in Imp. The overall Conformant-FF heuristic machinery is sound and complete forrelaxed tasks, and yields a heuristic function that is highly informative across a range of challengingdomains (Hoffmann & Brafman, 2006).In this work, we adopt Conformant-FF’s relaxations, ignoring the delete lists of the action ef-fects, as well as all but one of the propositions in the effect’s condition. Accordingly, we adopt thefollowing notations from Conformant-FF. Given a set of actions A , we denote by | +1 any functionfrom A into the set of all possible actions, such that | +1 maps each a ∈ A to the action similar to a but with empty delete lists and with all but one conditioning propositions of each effect removed;for | +1 ( a ) , we write a | +1 . By A | +1 we denote the action set obtained by applying | +1 to all the actionsof A , that is, A | +1 = (cid:8) a | +1 | a ∈ A (cid:9) . For an action sequence a we denote by a | +1 the sequence ofactions obtained by applying | +1 to every action along a , that is, a | +1 = ( hi , a = hih a | +1 i · a ′ | +1 , a = h a i · a ′ . For a probabilistic planning task ( A, b I , G, θ ) , the task ( A | +1 , b I , G, θ ) is called a relaxation of ( A, b I , G, θ ) . Finally, if a | +1 is a plan for ( A | +1 , b I , G, θ ) , then a is called a relaxed plan for ( A, b I , G, θ ) .

4. Following the Conformant-FF terminology, by “leafs” we refer to the nodes having zero in-degree.5. Note here that it would be possible to do a full SAT check, without any 2-projection (without relying on Imp), to seewhether q becomes known at t . However, as indicated above, doing such a full check for every unknown propositionat every level of the relaxed planning graph for every search state would very likely be too expensive, computationally. ROBABILISTIC -FF

In the next two sections we describe the machinery underlying the Probabilistic-FF heuristicestimation. Due to the similarity between the conceptual relaxations used in Probabilistic-FF andConformant-FF, Probabilistic-FF inherits almost all of Conformant-FF’s machinery. Of course,the new contributions are those algorithms dealing with probabilistic belief states and probabilisticactions.

Like FF and Conformant-FF, Probabilistic-FF computes its heuristic function in two steps, the ﬁrstone chaining forward to build a relaxed planning graph, and the second step chaining backward toextract a relaxed plan. In this section, we describe in detail Probabilistic-FF’s forward chaining step,building a probabilistic relaxed planning graph (or PRPG, for short). In Section 4.4, we then showhow one can extract a (probabilistic) relaxed plan from the PRPG. We provide a detailed illustrationof the PRPG construction process on the basis of our running example; since the illustration islengthy, it is moved to a separate Section 4.3.The algorithms building a PRPG are quite involved; it is instructive to ﬁrst consider (someof) the key points before delving into the details. The main issue is, of course, that we need toextend Conformant-FF’s machinery with the ability to determine when the goal set is sufﬁciently likely , rather than when it is known to be true for sure. To achieve that, we must introduce intorelaxed planning some effective reasoning about both the probabilistic initial state, and the effectsof probabilistic actions. It turns out that such a reasoning can be obtained by a certain weighted extension of the implication graph. In a nutshell, if we want to determine how likely it is that a fact q is true at a time t , then we propagate certain weights backwards through the implication graph,starting in q ( t ) ; the weight of q ( t ) is set to , and the weight for any p ( t ′ ) gives an estimate of theprobability of achieving q at t given that p holds at t ′ . Computing this probability exactly would,of course, be too expensive. Our estimation is based on assuming independence of the variousprobabilistic events involved. This is a choice that we made very carefully; we experimented widelywith various other options before deciding in favor of this technique.Any simplifying assumption in the weight propagation constitutes, of course, another relaxation,on top of the relaxations we already inherited from Conformant-FF. The particularly problematicaspect of assuming independence is that it is not an under-estimating technique. The actual weightof a node p ( t ′ ) – the probability of achieving q at t given that p holds at t ′ – may be lower than ourestimate. In effect, the PRPG may decide wrongly that a relaxed plan exists: even if we executeall relaxed actions contained in the successful PRPG, the probability of achieving the goal by thisexecution may be less than the required threshold. In other words, we lose the soundness (relativeto relaxed tasks) of the relaxed planning process.We experimented with an alternative weight propagation method, based on an opposite assump-tion, that the relevant probabilistic events always co-occur, and that hence the weights must bepropagated according to simple maximization operations. This propagation method yielded veryuninformative heuristic values, and hence inacceptable empirical behaviour of Probabilistic-FF,even in very simple benchmarks. In our view, it seems unlikely that an under-estimating yet in-formative and efﬁcient weight computation exists. We further experimented with some alternativenon under-estimating propagation schemes, in particular one based on assuming that the probabilis-tic events are completely disjoint (and hence weights should be added ); these schemes gave better OMSHLAK & H

OFFMANN performance than maximization, but lagged far behind the independence assumption in the morechallenging benchmarks.Let us now get into the actual algorithm building a PRPG. A coarse outline of the algorithm is asfollows. The PRPG is built in a layer-wise fashion, in each iteration extending the PRPG, reachingup to time t , by another layer, reaching up to time t + 1 . The actions in the new step are those whosepreconditions are known to hold at t . Effects conditioned on unknown facts (note here the reductionof effect conditions to a single fact) constitute new edges in the implication graph. In difference toConformant-FF, we don’t obtain a single edge from condition to add effect; instead, we obtain edgesfrom the condition to “chance nodes”, where each chance node represents a probabilistic outcome ofthe effect; the chance nodes, in turn, are linked by edges to their respective add effects. The weightsof the chance nodes are set to the probabilities of the respective outcomes, the weights of all othernodes are set to . These weights are “static weights” which are not “dynamically” modiﬁed byweight propagation; rather, the static weights form an input to the propagation.Once all implication graph edges are inserted at a layer, the algorithm checks whether any newfacts become known. This check is done very much like the corresponding check in Conformant-FF,by testing whether the disjunction of the support leafs for a proposition p at t + 1 is implied by theinitial state formula. The two differences to Conformant-FF are: (1) Only leafs are relevant whosedynamic weight is (otherwise, achieving a leaf is not guaranteed to accomplish p at t + 1 ). (2)Another reason for p to become known may be that all outcomes of an unconditional effect (or aneffect with known condition) result in achievement of p at time t + 1 . We elegantly formulate theoverall test by a single implication test over support leafs whose dynamic weight equals their ownweight.Like FF’s and Conformant-FF’s algorithms, the PRPG process has two termination criteria. ThePRPG terminates positively if the goal probability is high enough at time t ; the PRPG terminatesnegatively if, from t to t + 1 , nothing has changed that may result in a higher goal propability atsome future t ′ . The goal probability in a layer t is computed based on weighted model counting overa formula derived from the support leafs of all goals not known to be true. The criteria for negativetermination check: whether any new facts have become known or unknown (not negatively known);whether any possibly relevant new support leafs have appeared; and whether the goal probabilityhas increased. If neither is the case, then we can stop safely—if the PRPG terminates unsuccessfullythen we have a guarantee that there is no relaxed plan, and that the corresponding belief is hence adead end.Let us get into the details. Figure 4 depicts the main routine for building the PRPG for a beliefstate b a . As we already speciﬁed, the sets P ( t ) , uP ( t ) , and A ( t ) contain the propositions that areknown to hold at time t (hold at t with probability ), the propositions that are unknown to hold attime t (hold at t with probability less than but greater than ), and actions that are known to beapplicable at time t , respectively. The layers t ≥ of PRPG capture applying the relaxed actionsstarting from b a . The layers − m to − of PRPG correspond to the m -step action sequence a leadingfrom the initial belief state to the belief state in question b a . We inherit the latter technique fromConformant-FF; in a sense, the PRPG “reasons about the past”. This may look confusing at ﬁrstsight, but it has a simple reason. Imagine the PRPG starts at level instead. Then, to check whethera proposition becomes known, we have to do SAT tests regarding support leafs against the beliefstate formula, φ ( b a ) , instead of the initial state formula (similarly for weighted model countingto test whether the goal is likely enough). Testing against φ ( b a ) is possible, but very expensive ROBABILISTIC -FF procedure build-PRPG ( a, A, φ ( N b I ) , G, θ, | +1 ) ,returns a Bool saying if there is a relaxed plan for the belief stategiven by a = h a − m , . . . , a − i , andbuilds data structures from which a relaxed plan can be extracted Φ := φ ( N b I ) , Imp := ∅ P ( − m ) := { p | p is known in Φ } , uP ( − m ) := { p | p is unknown in Φ } for t := − m · · · − do A ( t ) := { a t | +1 } ∪ N OOP S build - timestep ( t, A ( t )) endfor t := 0 while get - P ( t, G ) < θ do A ( t ) := { a | +1 | a ∈ A, pre ( a ) ⊆ P ( t ) } ∪ N OOP S build - timestep ( t, A ( t )) if P ( t + 1) = P ( t ) and uP ( t + 1) = uP ( t ) and ∀ p ∈ uP ( t + 1) : uP ( − m ) ∩ support ( p ( t + 1)) = uP ( − m ) ∩ support ( p ( t )) and get - P ( t + 1 , G ) = get - P ( t, G ) thenreturn FALSE endif t := t + 1 endwhile T := t , return TRUE

Figure 4: Main routine for building a probabilistic relaxed planning graph (PRPG).computationally. The negative-index layers chain the implication graph all the way back to theinitial state, and hence enable us to perform SAT tests against the – typically much smaller – initialstate formula.Returning to Figure 4, the PRPG is initialized with an empty implication set Imp, P ( − m ) and uP ( − m ) are assigned the propositions that are known and unknown in the initial belief state,and a weighted CNF formula Φ is initialized with φ ( N b I ) . Φ is the formula against which implica-tion/weighted model checking tests are run when asking whether a proposition becomes known/whetherthe goal is likely enough. While the PRPG is built, Φ is incrementally extended with further clausesto capture the behavior of different effect outcomes.The for loop builds the sets P and uP for the a ’s time steps − m · · · − by iterative invocationof the build - timestep procedure that each time expands PRPG by a single time level. At eachiteration − m ≤ t ≤ − , the sets P ( t + 1) and uP ( t + 1) are made to contain the propositionsthat are known/unknown after applying the relaxed version of the action a t ∈ a (remember that a = h a , . . . , a m i ). To simplify the presentation, each action set A ( t ) contains a set of dummyactions N OOP S that simply transport all the propositions from time layer t to time layer t +1 . Moreformally, N OOP S = (cid:8) noop p | p ∈ P (cid:9) , where pre (noop p ) = ∅ , E (noop p ) = { ( { p } , { ε } ) } , and ε = (1 . , { p } , ∅ ) } ) .

6. In Conformant-FF, this conﬁguration is implemented as an option; it signiﬁcantly slows down the search in mostdomains, and brings advantages only in a few cases.

OMSHLAK & H

OFFMANN

The subsequent while loop constructs the relaxed planning graph from layer onwards by,again, iterative invocation of the build - timestep procedure. The actions in each layer t ≥ arerelaxations of those actions whose preconditions are known to hold at time t with certainty. Thisiterative construction is controlled by two termination tests. First, if the goal is estimated to hold atlayer t with probability higher than θ , then we know that a relaxed plan estimate can be extracted.Otherwise, if the graph reaches a ﬁx point, then we know that no relaxed (and thus, no real) planfrom b I exists. We postpone the discussion of these two termination criteria, and now focus on thetime layer construction procedure build - timestep . procedure build-timestep ( t, A ) ,builds P ( t + 1) , uP ( t + 1) , and the implication edges from t to t + 1 ,as induced by the action set AP ( t + 1) := P ( t ) , uP ( t + 1) := ∅ for all effects e of an action a ∈ A , con ( e ) ∈ P ( t ) ∪ uP ( t ) dofor all ε ∈ Λ( e ) do uP ( t + 1) := uP ( t + 1) ∪ add ( ε ) introduce new fact ε ( t ) with ̟ ( ε ( t )) = P r ( ε ) Imp := Imp ∪ { ( ε ( t ) , p ( t + 1)) | p ∈ add ( ε ) } endforif con ( e ) ∈ uP ( t ) then Imp := Imp ∪ S ε ∈ Λ( e ) { ( con ( e )( t ) , ε ( t )) } else Φ := Φ ∧ (cid:0) ∨ ε ∈ Λ( e ) ε ( t ) (cid:1) ∧ V ε,ε ′ ∈ Λ( e ) ( ¬ ε ( t ) ∨ ¬ ε ′ ( t )) endifendforfor all p ∈ uP ( t + 1) do build - w - impleafs ( p ( t + 1) , Imp ) support ( p ( t + 1)) := { l | l ∈ leafs ( Imp → p ( t +1) ) ∧ ̟ p ( t +1) ( l ) = ̟ ( l ) } if Φ → W l ∈ support ( p ( t +1)) l then P ( t + 1) := P ( t + 1) ∪ { p } endifendfor uP ( t + 1) := uP ( t + 1) \ P ( t + 1) Figure 5: Building a time step of the PRPG.The build - timestep procedure is shown in Figure 5. The ﬁrst for loop of build - timestep proceedsover all outcomes of (relaxed) actions in the given set A that may occur at time t . For each suchprobabilistic outcome we introduce a new chance proposition weighted by the conditional likelihoodof that outcome. Having that, we extend Imp with binary implications from this new chanceproposition to the add list of the outcome. If we are uncertain about the condition con ( e ) of thecorresponding effect at time t , that is, we have con ( e ) ∈ uP ( t ) , then we also add implicationsfrom con ( e ) to the chance propositions created for the outcomes of e . Otherwise, if con ( e ) isknown at time t , then there is no uncertainty about our ability to make the effect e to hold at time t . In this case, we do not “ground” the chance propositions created for the outcomes of e into theimplication graph, but simply extend the running formula Φ with clauses capturing the “exactlyone” relationship between these chance propositions corresponding to the alternative outcomes of e

7. Of course, in our implementation we have a special case treatment for deterministic actions, using no chance nodes(rather than a single “chance node” with static weight ). ROBABILISTIC -FF at time t . This way, the probabilistic uncertainty about the outcome of e can be treated as if being aproperty of the initial belief state b I ; This is the only type of knowledge we add into the knowledgebase formula Φ after initializing it in build - PRPG to φ ( N b I ) . Notation Description

Imp v → u The graph containing exactly all the paths from node v to node u in Imp.Imp → u The subgraph of Imp formed by node u and all the ancestors of u in Imp. leafs ( Imp ′ ) The set of all zero in-degree nodes in the subgraph Imp ′ of Imp. E ( Imp ′ ) The set of time-stamped action effects responsible for the implication edgesof the subgraph Imp ′ of Imp.Table 2: Overview of notations around the implication graph.The second for loop checks whether a proposition p , unknown at time t , becomes known attime t + 1 . This part of the build - timestep procedure is somewhat more involved; Table 2 providesan overview of the main notations used in the follows when discussing the various uses of theimplication graph Imp.First thing in the second for loop of build - timestep , a call to build - w - impleafs procedure asso-ciates each node v ( t ′ ) in Imp → p ( t +1) with an estimate ̟ p ( t +1) ( v ( t ′ )) on the probability of achieving p at time t + 1 by the effects E ( Imp v ( t ′ ) → p ( t +1) ) , given that v holds at time t ′ . In other words, thedynamic weight (according to p ( t + 1) ) of the implication graph nodes is computed. Note that v ( t ′ ) can be either a time-stamped proposition q ( t ′ ) for some q ∈ P , or a chance proposition ε ( t ′ ) forsome probabilistic outcome ε .We will discuss the build - w - impleafs procedure in detail below. For proceeding to understandthe second for loop of build - timestep , the main thing we need to know is the following lemma: Lemma 4

Given a node v ( t ′ ) ∈ Imp → p ( t +1) , we have ̟ p ( t +1) ( v ( t ′ )) = ̟ ( v ( t ′ )) if and only if,given v at time t ′ , the sequence of effects E ( Imp v ( t ′ ) → p ( t +1) ) achieves p at t + 1 with probability . In words, v ( t ′ ) leads to p ( t + 1) with certainty iff the dynamic weight of v ( t ′ ) equals its staticweight. This is a simple consequence of how the weight propagation is arranged; it should hold truefor any reasonable weight propagation scheme (“do not mark a node as certain if it is not”). A fullproof of the lemma appears in Appendix A on pp. 613.Re-consider the second for loop of build - timestep . What happens is the following. Havingﬁnished the build - w - impleafs weight propagation for p at time t + 1 , we1. collect all the leafs support ( p ( t + 1)) of Imp → p ( t ) that meet the criteria of Lemma 4, and2. check (by a call to a SAT solver) whether the knowledge-base formula Φ implies the disjunc-tion of these leafs.If the implication holds, then the examined fact p at time t is added to the set of facts known at time t . Finally, the procedure removes from the set of facts that are known to possibly hold at time t + 1 all those facts that were just proven to hold at time t + 1 with certainty.To understand the above, consider the following. With Lemma 4, support ( p ( t + 1)) containsexactly the set of leafs achieving which will lead to p ( t + 1) with certainty. Hence we can basically OMSHLAK & H

OFFMANN procedure build-w-impleafs ( p ( t ) , Imp ) top-down propagation of weights ̟ p ( t ) from p ( t ) to all nodes in Imp → p ( t ) ̟ p ( t ) ( p ( t )) := 1 for decreasing time steps t ′ := ( t − . . . ( − m ) dofor all chance nodes ε ( t ′ ) ∈ Imp → p ( t ) do α := Q r ∈ add ( ε ) ,r ( t ′ +1) ∈ Imp → p ( t ) (cid:2) − ̟ p ( t ) ( r ( t ′ + 1)) (cid:3) ̟ p ( t ) ( ε ( t ′ )) := ̟ ( ε ( t ′ )) · (1 − α ) endforfor all fact nodes q ( t ′ ) ∈ Imp → p ( t ) do α := 1 for all a ∈ A ( t ′ ) , e ∈ E ( a ) , con ( e ) = q do α := α · h − P ε ∈ Λ( e ) ,ε ( t ′ ) ∈ Imp → p ( t ) ̟ p ( t ) ( ε ( t ′ )) i endfor ̟ p ( t ) ( q ( t ′ )) := 1 − α endforendfor Figure 6: The build-w-impleafs procedure for weight back-propagation over the implication graph.use the same implication test as in Conformant-FF. Note, however, that the word “basically” in theprevious sentence hides a subtle but important detail. In difference to the situation in Conformant-FF, support ( p ( t + 1)) may contain two kinds of nodes: (1) proposition nodes at the start layer ofthe PRPG, i.e., at layer − m corresponding to the initial belief; (2) chance nodes at later layers ofthe PRPG, corresponding to outcomes of effects that have no unknown conditions. This is the pointwhere the discussed above updates on the formula Φ are needed—those keep track of alternativeeffect outcomes. Hence testing Φ → W l ∈ support ( p ( t +1)) l is the same as testing whether either: (1) p is known at t + 1 because it is always triggered with certainty by at least one proposition true in theinitial world; or (2) p is known at t + 1 because it is triggered by all outcomes of an effect that willappear with certainty. We get the following result: Lemma 5

Let ( A, N b I , G, θ ) be a probabilistic planning task, a be a sequence of actions applicablein b I , and | +1 be a relaxation function for A . For each time step t ≥ − m , and each proposition p ∈P , if P ( t ) is constructed by build - PRPG ( a, A, φ ( N b I ) , G, θ, | +1 ) , then p at time t can be achievedby a relaxed plan starting with a | +1 (1) with probability > (that is, p is not negatively known at time t ) if and only if p ∈ uP ( t ) ∪ P ( t ) ,and(2) with probability (that is, p is known at time t ) if and only if p ∈ P ( t ) . This is a consequence of the arguments outlined above. The full proof of Lemma 5 is given inAppendix A on pp. 614.Let us now consider the weight-propagating procedure build - w - impleafs depicted in Figure 6.This procedure performs a layered, top-down weight propagation from a given node p ( t ) ∈ Imp

8. The weight propagation scheme of the build - w - impleafs procedure is similar in nature to this used in the heuristicsmodule of the recent probabilistic temporal planner Prottle of Little, Aberdeen, and Thi´ebaux (2005).9. Note that the “ t ” here will be instantiated with t + 1 when called from build - timestep . ROBABILISTIC -FF down to the leafs of Imp → p ( t ) . This order of traversal ensures that each node of Imp → p ( t ) is pro-cessed only after all its successors in Imp → p ( t ) . For the chance nodes ε ( t ′ ) , the dynamic weight ̟ p ( t ) ( ε ( t ′ )) is set to1. the probability that the outcome ε takes place at time t ′ given that the corresponding actioneffect e ( ε ) does take place at t ′ , times2. an estimate of the probability of achieving p at time t by the effects E ( Imp ε ( t ′ ) → p ( t ) ) .The ﬁrst quantity is given by the “global”, static weight ̟ ( ε ( t ′ )) assigned to ε ( t ′ ) in the ﬁrst for loop of build - timestep . The second quantity is derived from the dynamic weights ̟ p ( t ) ( r ( t ′ + 1)) for r ∈ add ( ε ) , computed in the previous iteration of the outermost for loop of build - w - impleafs .Making a heuristic assumption that the effect sets E ( Imp r ( t ′ +1) → p ( t ) ) for different r ∈ add ( ε ) areall pairwise independent, α is then set to the probability of failure to achieve p at t by the effects E ( Imp ε ( t ′ ) → p ( t ) ) . This computation of α for ε ( t ′ ) is decomposed over the artifacts of ε , and thisis where the weight propagation starts taking place. For the fact nodes q ( t ′ ) , the dynamic weight ̟ p ( t ) ( q ( t ′ )) is set to the probability that some action effect conditioned on q at time t ′ allows(possibly indirectly) achieving the desired fact p at time t . Making again the heuristic assumptionof independence between various such effects conditioned on q at t ′ , computing ̟ p ( t ) ( q ( t ′ )) isdecomposed over the outcomes of these effects. procedure get - P ( t, G ) estimates the probability of achieving G at time p . if G P ( t ) ∪ uP ( t ) then return endifif G ⊆ P ( t ) then return endiffor g ∈ G \ P ( t ) do for each l ∈ leafs ( Imp → g ( t ) ) , introduce a chance proposition h l g i with weight ̟ g ( t ) ( l ) ϕ g := ( W l ∈ leafs ( Imp → g ( t ) ) l ) ∧ V l ∈ leafs ( Imp → g ( t ) ) ∩ uP ( − m ) ( ¬ l ∨ h l g i ) endforreturn WMC (Φ ∧ V g ∈ G \ P ( t ) ϕ g ) Figure 7: Estimating the goal likelihood at a given time step.What remains to be explained of the build - PRPG procedure are the two termination criteria ofthe while loop constructing the planning graph from the layer onwards. The ﬁrst test is made bya call to the get - P procedure, and it checks whether the PRPG built to the time layer T containsa relaxed plan for ( A, N b I , G, θ ) . The get - P procedure is shown in Figure 7. First, if one of thesubgoals is negatively known at time t , then, from Lemma 5, the overall probability of achievingthe goal is . On the other extreme, if all the subgoals are known at time t , then the probability ofachieving the goal is . The correctness of the latter test is implied by Lemma 5 and non-interferenceof relaxed actions. This leaves us with the main case in which we are uncertain about some of thesubgoals. This uncertainty is either due to dependence of these subgoals on the actual initial worldstate, or due to achieving these subgoals using probabilistic actions, or due to both. The uncertaintyabout the initial state is fully captured by our weighted CNF formula φ ( N b I ) ⊆ Φ . Likewise, theoutcomes’ chance propositions ε ( t ′ ) introduced into the implication graph by the build - timestep procedure are “chained up” in Imp to the propositions on the add lists of these outcomes, and OMSHLAK & H

OFFMANN “chained down” in Imp to the unknown (relaxed) conditions of these outcomes, if any. Therefore,if some action outcome ε at time t ′ < t is relevant to achieving a subgoal g ∈ G at time t , thenthe corresponding node ε ( t ′ ) must appear in Imp → g ( t ) , and its weight will be back-propagated by build - w - impleafs ( g ( t ) , Imp ) down to the leafs of Imp → g ( t ) . The get - P procedure then exploitsthese back-propagated estimates by, again, taking a heuristic assumption of independence betweenachieving different subgoals. Namely, the probability of achieving the unknown sub-goals G \ P ( t ) is estimated by weighted model counting over the formula Φ , conjoined with probabilistic theories ϕ g of achieving each unknown goal g in isolation. To understand the formulas ϕ g , consider that, inorder to make g true at t , we must achieve at least one of the leafs l of Imp → g ( t ) ; hence the left part ofthe conjunction. On the other hand, if we make l true, then this achieves g ( t ) only with (estimated)probability ̟ g ( t ) ( l ) ; hence the right part of the conjunction requires us to “pay the price” if we set l to true. As was explained at the start of this section, the positive PRPG termination test may ﬁre even ifthe real goal probability is not high enough. That is, get - P may return a value higher than the realgoal probability, due to the approximation (independence assumption) done in the weight propaga-tion. Of course, due to the same approximation, it may also happen that get - P returns a value lowerthan the real goal probability.The second PRPPG termination test comes to check whether we have reached a point in theconstruction of PRPG that allows us to conclude that there is no relaxed plan for ( A, N b I , G, θ ) thatstarts with the given action sequence a . This termination criterion asks whether, from time step t to time step t + 1 , any potentially relevant changes have occurred. A potentially relevant changewould be if the goal-satisfaction probability estimate get - P grows, or if the known and unknownpropositions grow, of if the support leafs of the latter propositions in Imp that correspond to theinitial belief state grow. If none occurs, then the same would hold in all future iterations t ′ > t ,implying that the required goal satisfaction probability θ would never be reached. In other words,the PRPG construction is complete. Theorem 6

Let ( A, N b I , G, θ ) be a probabilistic planning task, a be a sequence of actions appli-cable in b I , and | +1 be a relaxation function for A . If build - PRPG ( a, A, φ ( N b I ) , G, θ, | +1 ) returnsFALSE, then there is no relaxed plan for ( A, b I , G, θ ) that starts with a | +1 . Note that Theorem 6 holds despite the approximation done during weight propagation, makingthe assumption of probabilistic independence. For Theorem 6 to hold, the only requirement on theweight propagation is this: if the real weight still grows, then the estimated weight still grows.

Thisrequirement is met under the independence assumption. It would not be met under the assumption ofco-occurence, propagating weights by maximization operations, and thereby conservatively under-estimating the weights. With that propagation, if the PRPG fails then we cannot conclude that thereis no plan for the respective belief. This is another good argument (besides the bad quality heuristicswe observed empirically) against using the conservative estimation.

10. If we do not introduce the extra chance propositions h l g i , and instead assign the weight ̟ g ( t ) ( l ) to l itself, then theoutcome is not correct: we have to “pay” also for setting l to false.11. To understand the latter, note that PRPG can always be added with more and more replicas of probabilistic actionsirrelevant to achieving the goals, and having effects with known conditions. While these action effects (since they areirrelevant) will not inﬂuence our estimate of goal-satisfaction probability, the chance propositions corresponding tothe outcomes of these effects may become the support leafs of some unknown proposition p . In the latter case, theset of support leafs support ( p ( t ′ )) will inﬁnitely grow with t ′ → ∞ , while the projection of support ( p ( t ′ )) on theinitial belief state (that is, support ( p ( t )) ∩ uP ( t ) ) is guaranteed to reach a ﬁx point. ROBABILISTIC -FF

The full proof to Theorem 6 is given in Appendix A on pp. 615. The theorem ﬁnalizes ourpresentation and analysis of the process of constructing probabilistic relaxed planning graphs.

To illustrate the construction of a PRPG by the algorithm in Figures 4-7, let us consider a simpliﬁ-cation of our running Examples 1-2 in which(i) only the actions { move - b - right, move - lef t } constitute the action set A ,(ii) the goal is G = { r , b } , and the required lower bound on the probability of success θ = 0 . ,(iii) the initial belief state b I is given by the BN N b I as in Example 2, and(iv) the belief state b a evaluated by the heuristic function corresponds to the actions sequence a = h move - b - right i .The effects/outcomes of the actions A considered in the construction of PRPG are described inTable 3, where e mbr is a re-notation of the effect e in Table 1, the effect e ′ in Table 1 is effectivelyignored due to the emptiness of its add effects. a E ( a ) con ( e ) con ( e ) | +1 Λ( e ) P r ( ε ) add ( ε ) ε mbr { r , b } a mbr ( move - b - right ) e mbr { r , b } { r } ε mbr { r } ε mbr ∅ a ml ( move - lef t ) e ml { r } { r } ε ml { r } noop r e r { r } { r } ε r { r } noop r e r { r } { r } ε r { r } noop b e b { b } { b } ε b { b } noop b e b { b } { b } ε b { b } Table 3: Actions and their | +1 relaxation for the PRPG construction example.The initialization phase of the build - PRPG procedure results in

Φ = φ ( N b I ) , Imp := ∅ , P ( −

1) = ∅ , and uP ( −

1) = { r , r , b , b } . The content of uP ( − is depicted in the ﬁrst columnof nodes in Figure 8. The ﬁrst for loop of build - PRPG (constructing PRPG for the “past” layerscorresponding to a ) makes a single iteration, and calls the build - timestep procedure with t = − and A ( -

1) = { a mbr } ∪ N OOP S . (In what follows, using the names of the actions we refer to their | +1 relaxations as given in Table 3.) The add list of the outcome ε mbr is empty, and thus it adds nonodes to the implication graph. Other than that, the chance nodes introduced to Imp by this call to build - timestep appear in the second column of Figure 8. The ﬁrst outer for loop of build - timestep results in Imp given by columns 1-3 of Figure 8, uP (0) = uP ( − , and no extension of Φ .In the second outer for loop of build - timestep , the weight propagating procedure build - w - impleafs is called for each unknown fact p (0) ∈ uP (0) = { r (0) , r , b (0) , b (0) } , generating the “ p (0) -oriented” weights as in Table 4. For each p (0) ∈ uP (0) , the set of supporting leafs support ( p (0)) = OMSHLAK & H

OFFMANN (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε ml (0) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr ( - MLHI / / ]\ ! / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr (0) MLHI / / ]\ ! / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr (1) MLHI / / ]\ ! / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr ( - (cid:29) (cid:29) ;;;;;;;;;;;;;;;; (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr (0) (cid:29) (cid:29) ;;;;;;;;;;;;;;;; (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr (1) (cid:29) (cid:29) ;;;;;;;;;;;;;;;; r ( - A A (cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2)(cid:2) qqqqqqqqq / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r ( - / / r (0) A A (cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4)(cid:4) rrrrrrrr / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r (0) / / r (1) r ( - / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r ( - / / r (0) RS WV (cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31)(cid:31) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r (0) / / r (1) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r (1) / / r (2) b ( - / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b ( - / / b (0) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b (0) / / b (1) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b (1) / / b (2) b ( - / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b ( - / / b (0) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b (0) / / b (1) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b (1) / / b (2) Figure 8: The implication graph Imp. The odd columns of nodes depict the sets of unknown propo-sitions uP ( t ) . The even columns of nodes depict the change propositions introduced forthe probabilistic outcomes of the actions A ( t ) . { p ( − } , none of them is implied by Φ = N b I , and thus the set of known facts P (0) remains equalto P ( −

1) = ∅ , and uP ( − equal to = uP ( − . t ′ = 0 t ′ = − r r b b ε mbr ε mbr ε r ε r ε b ε b r r b b ̟ r (0) ̟ r (0) ̟ b (0) ̟ b (0) ̟ p (0) for some p (0) ∈ uP (0) . An entry in the row of p (0) isempty if and only if the node associated with the corresponding column does not belongto the implication subgraph Imp → p (0) .Having ﬁnished with the for loop, the build - PRPG procedure proceeds with the while loop thatbuilds the “future” layers of PRPG. The test of goal (un)satisﬁcing get - P (0 , G ) < θ evaluates toTRUE as we get get - P (0 , G ) = 0 . < . , and thus the loop proceeds with its ﬁrst iteration.To see the former, consider the implication graph Imp constructed so far (columns 1-3 in Fig- ROBABILISTIC -FF ure 8). For our goal G = { r , b } we have leafs ( Imp → r (0) ) = { r ( − } , and leafs ( Imp → b (0) ) = { r ( − , b ( − } . As { r (0) , b (0) } ⊂ uP (0) and Φ = φ ( N b I ) , we have get - P (0 , G ) = WMC ( φ ( N b I ) ∧ ϕ r ∧ ϕ b ) , where ϕ r = ( h r ,r i ) ∧ ( r ↔ h r ,r i ) ,ϕ b = ( h r ,b i ∨ h b ,b i ) ∧ ( r ( − ↔ h r ,b i ) ∧ ( b ( − ↔ h b ,b i ) , (20)and ̟ ( h r ,r i ) = ̟ r (0) ( r ( − ̟ ( h b ,b i ) = ̟ b (0) ( b ( − .̟ ( h r ,b i ) = ̟ b (0) ( r ( − . (21)Observe that the two models of φ ( N b I ) consistent with r immediately falsify the sub-formula φ ( N b I ) ∧ ϕ r . Hence, we have get - P (0 , G ) = WMC (cid:0) φ ( N b I ) ∧ ϕ r ∧ ϕ b | r ( − ,b ( − (cid:1) + WMC (cid:0) φ ( N b I ) ∧ ϕ r ∧ ϕ b | r ( − ,b ( − (cid:1) = b I ( r , b ) · ̟ ( h r ,r i ) · ̟ ( h r ,b i ) + b I ( r , b ) · ̟ ( h r ,r i ) · ̟ ( h r ,b i ) · ̟ ( h b ,b i )= 0 . · · . . · · . ·

1= 0 . In the ﬁrst iteration of the while loop, build - PRPG calls the build - timestep procedure with t = 0 and A (0) = { a mbr , a ml } ∪ N OOP S . The chance nodes introduced to Imp by this call to build - timestep appear in the forth column of Figure 8. The ﬁrst outer for loop of build - timestep results in Imp given by columns 1-5 of Figure 8, uP (1) = uP (0) , and no extension of Φ . Asbefore, in the second for loop of build - timestep , the build - w - impleafs procedure is called for eachunknown fact p (1) ∈ uP (1) = { r (1) , r , b (1) , b (1) } , generating the “ p (1) -oriented” weights.The interesting case here is the case of weight propagation build - w - impleafs ( r (1) , Imp ) , resultingin weights ̟ r (1) ( r (1)) = 1 ̟ r (1) ( ε ml (0)) = 1 ̟ r (1) ( ε r (0)) = 1 ̟ r (1) ( r (0)) = 1 ̟ r (1) ( r (0)) = 1 ⇒ ̟ r (1) ( ε r ( - ̟ r (1) ( ε r ( - ̟ r (1) ( ε mbr ( - . ̟ r (1) ( ε mbr ( - . ⇒ ̟ r (1) ( r ( - ̟ r (1) ( r ( - for the nodes in Imp → r (1) . From that, the set of supporting leafs of r (1) is assigned to support ( r (1)) = { r ( − , r ( − } , and since Φ = φ ( N b I ) does implies r ( − ∨ r ( − , the fact r is concludedto be known at time , and is added to P (1) . For all other nodes p (1) ∈ uP (1) we still havesupport ( p (1)) = { p ( − } , and thus they all remain unknown at time t = 1 as well. Puttingthings together, this call to the build - w - impleafs procedure results with P (1) = { r (1) } , and OMSHLAK & H

OFFMANN uP (1) = { r , b (1) , b (1) } . The while loop of the build - PRPG procedure proceeds with check-ing the ﬁxpoint termination test, and this immediately fails due to P (1) = P (0) . Hence, the while loop proceeds with the next iteration corresponding to t = 1 .The test of goal (un)satisﬁcing get - P (1 , G ) < θ still evaluates to TRUE because we have get - P (1 , G ) = 0 . < . . Let us follow this evaluation of get - P (1 , G ) in detail as well. Consid-ering the implication graph Imp constructed so far up to time t = 1 (columns 1-5 in Figure 8), andhaving G ∩ uP (1) = { b (1) } , leafs ( Imp → b (1) ) = { r ( − , b ( − } , and (still) Φ = φ ( N b I ) , weobtain get - P (1 , G ) = WMC ( φ ( N b I ) ∧ ϕ b ) , with ϕ b = ( h r ,b i ∨ h b ,b i ) ∧ ( r ( − ↔ h r ,b i ) ∧ ( b ( − ↔ h b ,b i ) , (22)While the structure of ϕ b in Equation 22 is identical to this in Equation 20, the weights associatedwith the auxiliary chance propositions are different, notably ̟ ( h b ,b i ) = ̟ b (1) ( b ( − .̟ ( h r ,b i ) = ̟ b (1) ( r ( − . (23)The difference in ̟ ( h r ,b i ) between Equation 21 and Equation 23 stems from the fact that r ( − supports b (1) not only via the effect e mbr at time − but also via the a different instance of thesame effect at time . Now, the only model of φ ( N b I ) that falsify ϕ b is the one that sets both r and b to false. Hence, we have get - P (1 , G ) = b I ( r , b ) · ̟ ( h r ,b i ) + b I ( r , b ) · ̟ ( h r ,b i ) · ̟ ( h b ,b i ) + b I ( r , b ) · ̟ ( h b ,b i )= 0 . · .

91 + 0 . · . · . ·

1= 0 . Having veriﬁed get - P (1 , G ) < θ , the while loop proceeds with the construction for time t = 2 ,and calls the build - timestep procedure with t = 1 and A (1) = { a mbr , a ml } ∪ N OOP S . The chancenodes introduced to Imp by this call to build - timestep appear in the sixth column of Figure 8. Theﬁrst outer for loop of build - timestep results in Imp given by columns 1-7 of Figure 8, and Φ = φ ( N b I ) ∧ (cid:16) ε mbr (1) ∨ ε mbr (1) ∨ ε mbr (1) (cid:17) ∧∧ (cid:16) ¬ ε mbr (1) ∨ ¬ ε mbr (1) (cid:17) ∧ (cid:16) ¬ ε mbr (1) ∨ ¬ ε mbr (1) (cid:17) ∧ (cid:16) ¬ ε mbr (0) ∨ ¬ ε mbr (0) (cid:17) (24)Next, the build - w - impleafs procedure is called as usual for each unknown fact p (2) ∈ uP (2) = { r , b (2) , b (2) } . The information worth detailing here is that now we have leafs ( Imp → b (2) ) = { b ( − , r ( − , ε mbr (1) } , and support ( b (2)) = { b ( − , ε mbr (1) } . However, we still have Φ → W l ∈ support ( p (2)) l for no p (2) ∈ uP (2) , and thus the set of known facts P (2) remains equal to P (1) = { r } . ROBABILISTIC -FF

Returning from the call to the build - w - impleafs procedure, build - PRPG proceeds with checkingthe ﬁxpoint termination condition. This time, the ﬁrst three equalities of the condition do hold, yetthe condition is not satisﬁed due to get - P (2 , G ) > get - P ( t, G ) . To see the latter, notice that we have get - P (2 , G ) = WMC (Φ ∧ ϕ b ) , where Φ is given by Equation 24, ϕ b = (cid:16) h r ,b i ∨ h b ,b i ∨ ε mbr (1) (cid:17) ∧ ( r ( − ↔ h r ,b i ) ∧ ( b ( − ↔ h b ,b i ) , (25)and ̟ ( h b ,b i ) = ̟ b (1) ( b ( − .̟ ( h r ,b i ) = ̟ b (1) ( r ( − . ̟ ( ε mbr (1)) = ̟ b (1) ( ε mbr (1)) = 0 . (26)It is not hard to verify that get - P (2 , G ) = get - P (1 , G ) + b I ( r , b ) · ̟ ( ε mbr (1))= 0 .

899 + 0 . · .

7= 0 . Note that now we do have get - P (2 , G ) ≥ θ , and therefore build - PRPG aborts the while loop bypassing the goal satisﬁcing test, and sets T = 2 . This ﬁnalizes the construction of PRPG, and thus,our example. If the construction of the PRPG succeeds in reaching the goals with the estimated probability of suc-cess get - P ( T, G ) exceeding θ , then we extract a relaxed plan consisting of A ′ ⊆ A (0) , . . . , A ( T − , and use the size of A ′ as the heuristic value of the evaluated belief state b a .Before we get into the technical details, consider that there are some key differences betweenrelaxed (no delete lists) probabilistic planning on the one hand, and both relaxed classical and re-laxed qualitative conformant planning on the other hand. In relaxed probabilistic planning, it mightmake sense to execute the same action numerous times in consecutive time steps. In fact, thismight be essential – just think of throwing a dice in a game until a “6” appears. In contrast, in therelaxed classical and qualitatively uncertain settings this is not needed – once an effect has beenexecuted, it remains true forever. Another complication in probabilistic planning is that the requiredgoal-achievement probability is speciﬁed over a conjunction (or, possibly, some more complicatedlogical combination) of different facts. While increasing the probability of achieving each individ-ual sub-goal g ∈ G in relaxed planning will always increase the overall probability of achieving G ,choosing the right distribution of effort among the sub-goals to pass the required threshold θ for thewhole goal G is a non-trivial problem.A fundamental problem is the aforementioned lack of guarantees of the weight propagation.On the one hand, the construction of PRPG and Lemma 5 imply that a | +1 concatenated with anarbitrary linearization a R of A (0) , . . . , A ( T − is executable in b I . On the other hand, due tothe independence assumption made in the build - w - impleafs procedure, get - P ( T, G ) ≥ θ does not OMSHLAK & H

OFFMANN imply that the probability of achieving G by a | +1 concatenated with a R exceeds θ . A “real” relaxedplan, in that sense, might not even exist in the constructed PRPG.Our answer to the above difﬁculties is to extract relaxed plans that are correct relative to theweight propagation. Namely, we use an implication graph “reduction” algorithm that computesa minimal subset of that graph which still – according to the weight propagation – sufﬁcientlysupports the goal. The relaxed plan then corresponds to that subset. Obviously, this “solves” thedifﬁculty with the lack of “real” relaxed plans; we just do the relaxed plan extraction according tothe independence assumption (besides ignoring deletes and removing all but one condition of eacheffect). The mechanism also naturally takes care of the need to apply the same action several times:this corresponds to several implication graph edges which are all needed in order to obtain sufﬁcientweight. The choice of how effort is distributed among sub-goals is circumvented in the sense thatall sub-goals are considered in conjunction, that is, the reduction is performed once and for all. Ofcourse, there remains a choice in which parts of the implication graph should be removed. We havefound that it is a useful heuristic to make this choice based on which actions have already beenapplied on the path to the belief. We will detail this below.Making another assumption on top of the previous relaxations can of course be bad for heuristicquality. The “relaxed plans” we extract are not guaranteed to actually achieve the desired goal prob-ability. Since the relaxed plans are used only for search guidance, per se this theoretical weakness isonly of marginal importance. However, an over-estimation of goal probability might result in a badheuristic because the relaxed plan does not include the right actions, or does not apply them oftenenough. In Section 5, we will discuss an example domain where Probabilistic-FF fails to scale forprecisely this reason.Figure 9 shows the main routine extract - PRPlan for extracting a relaxed plan from a givenPRPG (note that T is the index of the highest PRPG layer, c.f. Figure 4). The sub-routines of extract - PRPlan are shown in Figures 10-11. At a high level, the extract - PRPlan procedure consistsof two parts:1.

Reduction of the implication graph, aiming at identifying a set of time-stamped action effectsthat can be ignored without decreasing our estimate of goal-achievement probability get - P ( T, G ) below the desired threshold θ , and2. Extraction of a valid relaxed plan a r such that (schematically) constructing PRPG with a r insteadof the full set of A (0) , . . . , A ( T ) would still result in get - P ( T, G ) ≥ θ .The ﬁrst part is accomplished by the reduce - implication - graph procedure, depicted in Figure 10.As of the ﬁrst step in the algorithm, the procedure considers only the parts of the implication graphthat are relevant to achieving the unknown sub-goals. Next, reduce - implication - graph performs agreedy iterative elimination of actions from the “future” layers , . . . , T − of PRPG until the proba-bility estimate get - P ( T, G ) over the reduced set of actions goes below θ . While, in principle, any ac-tion from A (0) , . . . , A ( T − can be considered for elimination, in reduce - implication - graph we ex-amine only repetitions of the actions that already appear in a . Speciﬁcally, reduce - implication - graph iterates over the actions a in a | +1 , and if a repeats somewhere in the “future” layers of PRPG, thenone such repetition a ( t ′ ) is considered for removal. If removing this repetition of a is found safewith respect to achieving θ , then it is effectively removed by eliminating all the edges in Imp thatare induced by a ( t ′ ) . Then the procedure considers the next repetition of a . If removing another

12. Note here that the formula for

WMC is constructed exactly as for the get - P function, c.f. Figure 7. ROBABILISTIC -FF procedure extract-PRPlan ( P RP G ( a, A, φ ( N b I ) , G, θ, | +1 )) ,selects actions from A (0) , . . . , A ( T − Imp ′ := reduce - implication - graph () extract - subplan ( Imp ′ ) sub - goal ( G ∩ P ( T )) for decreasing time steps t := T, . . . , dofor all g ∈ G ( t ) doif ∃ a ∈ A ( t − , e ∈ E ( a ) , con ( e ) ∈ P ( t − , ∀ ε ∈ Λ( e ) : g ∈ add ( ε ) thenadd-to-relaxed-plan one such a at time t sub - goal ( pre ( a ) ∪ con ( e )) else Imp g ( t ) := construct - support - graph ( support ( g ( t ))) extract - subplan ( Imp g ( t ) ) endifendforendfor Figure 9: Extracting a probabilistic relaxed plan.copy of a is not safe anymore, then the procedure breaks the inner loop and considers the nextaction. procedure reduce-implication-graph () operates on the PRPG;returns a sub-graph of Imp.Imp ′ := ∪ g ∈ G \ P ( T ) Imp → g ( T ) for all actions a ∈ a | +1 dofor all edges ( ε ( t ′ ) , p ( t ′ + 1)) ∈ Imp ′′ , induced by a ( t ′ ) ∈ A ( t ′ ) , for some t ′ ≥ do Imp ′′ := Imp ′ remove from Imp ′′ all the edges induced by a ∈ A ( t ′ ) for all g ∈ G \ P ( t ) do for each l ∈ leafs ( Imp ′′→ g ( T ) ) , introduce a chance proposition h l g i with weight ̟ g ( T ) ( l ) ϕ g := ( W l ∈ leafs ( Imp ′′→ g ( T ) ) l ) ∧ V l ∈ leafs ( Imp ′′→ g ( T ) ) ∩ uP ( − m ) ( ¬ l ∨ h l g i ) endforif WMC (Φ ∧ V g ∈ G \ P ( T ) ϕ g ) ≥ θ then Imp ′ := Imp ′′ else break endifendforendforreturn Imp ′ Figure 10: The procedure reducing the implication graph.To illustrate the intuition behind our focus on the repetitions of the actions from a , let us con-sider the following example of a simple logistics-style planning problem with probabilistic actions.Suppose we have two locations A and B , a truck that is known to be initially in A , and a heavy anduneasy to grab package that is known to be initially on the truck. The goal is to have the packageunloaded in B with a reasonably high probability, and there are two actions we can use – movingthe truck from A to B ( a m ), and unloading the package ( a u ). Moving the truck does not necessarily OMSHLAK & H

OFFMANN move the truck to B , but it does that with an extremely high probability. On the other hand, unload-ing the bothersome package succeeds with an extremely low probability, leaving the package on thetruck otherwise. Given this data, consider the belief state b a corresponding to “after trying to movethe truck once”, that is, to the action sequence h a m i . To achieve the desired probability of success,the PRPG will have to be expanded to a very large time horizon T , allowing the action a u to beapplied sufﬁciently many times. However, the fact “truck in B ” is not known in the belief state b a ,and thus the implication graph will also contain the same amount of applications of a m . Trimmingaway most of these applications of a m will still keep the probability sufﬁciently high.The reader might ask at this point what we hope to achieve by “trimming away most of theapplications of a m ”. The point is, intuitively, that the implication graph reduction mechanism isa means to understand what has been accomplished already, on the path to b a . Without such anunderstanding, the relaxed planning can be quite indiscriminative between search states. Considerthe above example, and assume we have not one but two troubled packages, P and P , on thetruck, with unload actions a u1 and a u2 . The PRPG for b a contains copies of a u1 and a u2 at layers upto the large horizon T . Now, say our search starts to unload P . In the resulting belief, the PRPGstill has T steps because the situation has not changed for P . Each step of the PRPG still containscopies of both a u1 and a u2 – and hence the heuristic value remains the same as before! In otherwords, without an implication graph reduction technique, relevant things that are accomplishedmay remain hidden behind other things that have not yet been accomplished. In the above example,this is not really critical because, as soon as we have tried an unload for each of P and P , thetime horizon T decreases by one step, and the heuristic value is reduced. It is, however, oftenthe case that some sub-task must be accomplished before some other sub-task can be attacked. Insuch situations, without implication graph reduction, the search staggers across a huge plateau untilthe ﬁrst task is completed. We observed this in a variety of benchmarks, and hence designed theimplication graph reduction to make the relaxed planning aware of what has already been done.Of course, since our weight propagation may over-estimate true probabilities, and hence over-estimate what was achieved in the past, the implication graph reduction may conclude prematurelythat a sub-task has been “completed”. This leads us to the main open question in this research; wewill get back to this at the end of Section 5, where we discuss this in the context of an examplewhere Probabilistic-FF’s performance is bad.Let us get back to explaining the extract - PRPlan procedure. After the implication graph reduc-tion, the procedure proceeds with the relaxed plan extraction. The process makes use of propositionsets G (1) , . . . , G ( T ) , which are used to store time-stamped sub-goals arising at layers ≤ t ≤ T during the relaxed plan extraction. The sub-routine extract - subplan (Figure 11)1. adds to the constructed relaxed plan all the time-stamped actions responsible for the edges of thereduced implication graph Imp ′ , and2. subgoals everything outside the implication graph that condition the applicability of the effectsresponsible for the edges of Imp ′ .Here and in the later phases of the process, the sub-goals are added into the sets G (1) , . . . , G ( T ) bythe sub - goal procedure that simply inserts each given proposition as a sub-goal at the ﬁrst layer ofits appearance in the PRPG. Having accomplished this extract-and-subgoal pass of extract - subplan over Imp ′ , we also subgoal all the goal conjuncts known at time T .In the next phase of the process, the sub-goals are considered layer by layer in decreasing orderof time steps T ≥ t ≥ . For each sub-goal g at time t , certain supporting actions are selected into ROBABILISTIC -FF procedure extract-subplan ( Imp ′ ) actions that are helpful for achieving uncertain goals G ∩ uP ( T ) andsubgoals all the essential conditions of these actions for each edge ( ε ( t ) , p ( t + 1)) ∈ Imp ′ such that t ≥ doif action a and its effect e ∈ E ( a ) be responsible for ε at time t timeadd-to-relaxed-plan a at time t sub - goal (( pre ( a ) ∪ con ( e )) ∩ P ( t )) endif endforprocedure sub-goal ( P ) inserts the propositions in P as sub-goalsat the layers of their ﬁrst appearance in the PRPG for all p ∈ P do t := argmin t { p ∈ P ( t ) } if t ≥ then G ( t ) := G ( t ) ∪ { p } endifendforprocedure construct-support-graph ( support ( g ( t ))) takes a subset support ( g ( t )) of leafs ( Imp → g ( t ) ) weighted according to g ( t ) ;returns a sub-graph Imp ′ of Imp.Imp ′ := ∅ open := support ( g ( t )) while open = ∅ do open := open \ { p ( t ′ ) } choose a ∈ A ( t ′ ) , e ∈ E ( a ) , con ( e ) = { p } such that ∀ ε ∈ Λ( e ) : ( p ( t ′ ) , ε ( t ′ )) ∈ Imp g ( t ) ∧ ̟ g ( t ) ( ε ( t ′ )) = ̟ ( ε ( t ′ )) for each ε ∈ Λ( e ) dochoose q ∈ add ( ε ) such that ̟ g ( t ) ( q ( t ′ + 1)) = 1 Imp ′ := Imp ′ ∪ { ( p ( t ′ ) , ε ( t ′ )) , ( ε ( t ′ ) , q ( t ′ + 1)) } open := open ∪ { q ( t ′ + 1) } endfor endwhilereturn Imp ′ Figure 11: Sub-routines for extract - PRPlan .the relaxed plan. If there is an action a and some effect e ∈ E ( a ) that are known to be applicableat time t − , and guarantee to achieve g with certainty, then a is added to the constructed relaxedplan at t − . Otherwise, we1. use the construct - support - graph procedure to extract a sub-graph Imp g ( t ) consisting of a set ofimplications that together ensure achieving g at time t , and2. use the already discussed procedure extract - subplan to(a) add to the constructed relaxed plan all the time-stamped actions responsible for the edgesof Imp g ( t ) , and(b) subgoal everything outside this implication graph Imp g ( t ) that condition the applicabilityof the effects responsible for the edges of Imp g ( t ) . OMSHLAK & H

OFFMANN

Processing this way all the sub-goals down to G (1) ﬁnalizes the extraction of the relaxed planestimate. Section 4.5 provides a detailed illustration of this process on the PRPG constructed inSection 4.3. In any event, it is easy to verify that the relaxed plan we extract is sound relative to ourweight propagation, in the following sense. Proposition 7

Let ( A, N b I , G, θ ) be a probabilistic planning task, a be a sequence of actions ap-plicable in b I , and | +1 be a relaxation function for A such that build - PRPG ( a, A, φ ( N b I ) , G, θ, | +1 ) returns TRUE. Let A (0) s , . . . , A ( T − s be the actions selected from A (0) , . . . , A ( T − by extract - PRPlan . When constructing a relaxed planning graph using only A (0) s , . . . , A ( T − s ,then get - P ( T, G ) ≥ θ . Proof:

By construction: reduce - implication - graph leaves enough edges in the graph so that theweight propagation underlying get - P still concludes that the goal probability is high enough. We illustrate the process of the relaxed plan extraction on the PRPG as in Figure 8, constructed forthe belief state and problem speciﬁcation as in example in Section 4.3. In this example we have T = 2 , G ∩ uP (2) = { b } , and thus the implication graph Imp gets immediately reduced to itssub-graph Imp ′ depicted in Figure 12a. As the plan a to the belief state in question consists of only asingle action a mbr , the only action instances that are considered for elimination by the outer for loopof reduce - implication - graph are a mbr (0) and a mbr (1) . If a mbr (0) is chosen to be examined, then theimplication sub-graph Imp ′′ = Imp ′ is further reduced by removing all the edges due to a mbr (0) ,and the resulting Imp ′′ appears in Figure 12b. The Φ and ϕ b components of the evaluated formula Φ ∧ ϕ b are given by Equation 24 and Equation 25, respectively, and the weights associated withthe chance propositions in Equation 25 over the reduced implication graph Imp ′′ are ̟ ( h b ,b i ) = ̟ b (1) ( b ( − ̟ ( h r ,b i ) = ̟ b (1) ( r ( − . .̟ ( ε mbr (1)) = ̟ b (1) ( ε mbr (1)) = 0 . (27)The weight model counting of Φ ∧ ϕ b evaluates to . < θ , and thus Imp ′′ does not replace Imp ′ .The only alternative action removal is this of a mbr (1) , and it can be seen from the example in Sec-tion 4.3 that this attempt for action elimination will also result in probability estimate lower than θ .Hence, the only effect of reduce - implication - graph on the PRPG processed by the extract - PRPlan procedure is the reduction of the implication graph to only the edges relevant to achieving { b } attime T = 2 . The reduced implication sub-graph Imp ′ returned by the reduce - implication - graph procedure is depicted in Figure 12a.Next, the extract - subplan procedure iterates over the edges of Imp ′ and adds to the initiallyempty relaxed plan applications of a mbr at times and . The action a mbr has no preconditions,and the condition r of the effect ε mbr ∈ E ( a mbr ) is known at time . Hence, extract - subplan invokes the sub - goal procedure on { r (1) } , and the latter is added into the proposition set G (1) . Thesubsequent call sub - goal ( G ∩ P ( T )) = sub - goal ( { r } ) leads to no further extensions of G (2) , G (1)

13. The dashed edges in Figure 12b can be removed from Imp ′′ either now or at a latter stage if Imp ′′ is chosen to replaceImp ′ . ROBABILISTIC -FF (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr ( - ]\ ! / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr (0) ]\ ! / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr (1) ]\ ! / / r ( - qqqqqqqqq / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r ( - / / r (0) rrrrrrrr b ( - / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b ( - / / b (0) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b (0) / / b (1) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b (1) / / b (2) (a) (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr ( - ]\ ! / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε mbr (1) ]\ ! / / r ( - qqqqqqqqq / / ____ (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r ( - / / ____ r (0) b ( - / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b ( - / / b (0) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b (0) / / b (1) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε b (1) / / b (2) (b) (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε ml (0) / / r ( - / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r ( - / / r (0) / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r (0) / / r (1) r ( - / / (cid:23)(cid:22) (cid:21)(cid:20)(cid:16)(cid:17) (cid:18)(cid:19) ε r ( - / / r (0) RSWV (cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30) (c)Figure 12: Illustrations for various steps of the relaxed plan extraction from the PRPG constructedin Section 4.3, and, in particular, from the implication graph of the latter, depicted inFigure 8.as we already have r ∈ G (1) . Hence, the outer for loop of extract - PRPlan starts with G (2) = ∅ ,and G (1) = { r } .Since G (2) is empty, the ﬁrst sub-goal considered by extract - PRPlan is r from G (1) . For r at time , no action effect at time passes the test of the if statement—the condition r of ε ml is not known at time , and the same is true for ε r . Hence, the subgoal r (1) is processedby extracting a sub-plan to support achieving it with certainty. First, the construct - support - graph procedure is called with support ( r (1)) = { r ( − , r ( − } (see Section 4.3). The extracted sub-

14. In fact, it is easy to see from the construction of the sub - goal procedure that if p belongs to G ( t ) , then the conditionof the noop’s effect ε p cannot be known at time t − . OMSHLAK & H

OFFMANN graph Imp r (1) of the original implication graph Imp is depicted in Figure 12c, and invoking theprocedure extract - subplan on Imp r (1) results in adding (i) application of a ml at time , and (ii)no new subgoals. Hence, the proposition sets G (1) , G (2) get emptied, and thus we end up withextracting a relaxed plan h a mbr (0) , a ml (0) , a mbr (1) i .

5. Empirical Evaluation

We have implemented Probabilistic-FF in C, starting from the Conformant-FF code. With θ = 1 . ,Probabilistic-FF behaves exactly like Conformant-FF (except that Conformant-FF cannot handlenon-deterministic effects). Otherwise, Probabilistic-FF behaves as described in the previous sec-tions, and uses Cachet (Sang et al., 2005) for the weighted model counting. To better home in onstrengths and weaknesses of our approach, the empirical evaluation of Probabilistic-FF has beendone in two steps. In Section 5.1 we evaluate Probabilistic-FF on problems having non-trivial un-certain initial states, but only deterministic actions. In Section 5.2 we examine Probabilistic-FFon problems with probabilistic action effects, and with both sources of uncertainty. We compareProbabilistic-FF’s performance to that of the probabilistic planner POND (Bryce et al., 2006). Thereasons for choosing POND as the reference point are twofold. First, similarly to Probabilistic-FF,POND constitutes a forward-search planner guided by a non-admissible heuristic function basedon (relaxed) planning graph computations. Second, to our knowledge, POND clearly is the mostefﬁcient probabilistic planner reported in the literature. The experiments were run on a PC running at 3GHz with 2GB main memory and 2MB cacherunning Linux. Unless stated otherwise, each domain/problem pair was tried at four levels of de-sired probability of success θ ∈ { . , . , . , . } . Each run of a planner was time-limited by1800 seconds of user time. Probabilistic-FF was run in the default conﬁguration inherited from FF,performing one trial of enforced hill-climbing and switching to best-ﬁrst search in case of failure. Indomains without probabilistic effects, we found that Probabilistic-FF’s simpler relaxed plan extrac-tion developed for that case (Domshlak & Hoffmann, 2006), performs better than the one describedin here. We hence switch to the simpler version in these domains. Unlike Probabilistic-FF, the heuristic computation in POND has an element of randomization;namely, the probability of goal achievement is estimated via sending a set of random particlesthrough the relaxed planning graph (the number of particles is an input parameter). For each prob-lem instance, we averaged the runtime performance of POND over 10 independent runs. In specialcases where POND timed out on some runs for a certain problem instance, yet not on all of the10 runs, the average we report for POND uses the lower-bounding time threshold of 1800s to re-place the missing time points. In some cases, POND’s best-case performance differs a lot fromits average performance; in these cases, the best-case performance is also reported. We note that,following the suggestion of Dan Bryce, POND was run in its default parameter setting, and, in par-

15. In our experiments we have used a recent version 2.1 of POND that signiﬁcantly enhances POND2.0 (Bryce et al.,2006). The authors would like to thank Dan Bryce and Rao Kambhampati for providing us with a binary distributionof POND2.1.16. Without probabilistic effects, relaxed plan extraction proceeds very much like in Conformant-FF, with an additionalstraightforward backchaining selecting support for the unknown goals. The more complicated techniques developedin here to deal with relaxed plan extraction under probabilistic effects appear to have a more unstable behavior thanthe simpler techniques. If there are probabilistic effects, then the simple backchaining is not meaningful because ithas no information on how many times an action must be applied in order to sufﬁciently support the goal.

ROBABILISTIC -FF θ = 0 . θ = 0 . θ = 0 . θ = 1 . Instance t / | S | / l t / | S | / l t / | S | / l t / | S | / l Safe-uni-70 70/71/140 1.39/19 /18 4.02/36/35 8.06/54/53 4.62/71 /70Safe-cub-70 70/70/138 0.28/6/5 0.76/13/12 1.54/22/21 4.32/70/69Cube-uni-15 6/90/3375 3.25/145/26 3.94/150/34 5.00/169/38 25.71/296/42Cube-cub-15 6/90/3375 0.56/41/8 1.16/70/13 1.95/109/18 26.35/365/42Bomb-50-50 2550/200/ > > > > > > > > > > > ∗ > ∗ > ∗ > ∗ Table 5: Empirical results for problems with probabilistic initial states. Times t in seconds, searchspace size | S | (number of calls to the heuristic function), plan length l .ticular, this includes the number of random particles (64) selected for computing POND’s heuristicestimate (Bryce et al., 2006). We now examine the performance of Probabilistic-FF and POND in a collection of domains withprobabilistic initial states, but with deterministic action effects. We will consider the domains oneby one, discussing for each a set of runtime plots. For some of the problem instances, Table 5 showsmore details, providing features of the instance size as well as detailed results for Probabilistic-FF,including the number of explored search states and the plan length.Our ﬁrst three domains are probabilistic versions of traditional conformant benchmarks: “Safe”,“Cube”, and “Bomb”. In Safe, out of n combinations one opens the safe. We are given a probabilitydistribution over which combination is the right one. The only type of action in Safe is trying acombination, and the objective is to open the safe with probability ≥ θ . We experimented withtwo probability distributions over the n combinations, a uniform one (“Safe-uni”) and a distributionthat declines according to a cubic function (“Safe-cub”). Table 5 shows that Probabilistic-FF cansolve this very efﬁciently even with n = 70 . Figure 13 compares between Probabilistic-FF andPOND, plotting their time performance on an identical linear scale, where x -axes show the numberof combinations.From the graphs it is easy to see that Probabilistic-FF outperforms POND by at least an order ofmagnitude on both Safe-uni and Safe-cub. But a more interesting observation here is not necessarilythe difference in time performance, but the relative performance of each planner on Safe-uni andSafe-cub. Note that Safe-cub is somewhat “easier” than Safe-uni in the sense that, in Safe-cub, fewercombinations must be tried to guarantee a given probability θ of opening the safe. This because the OMSHLAK & H

OFFMANN T i m e ( s e c ) T i m e ( s e c ) (a) Uniform prior distribution over the combinations. T i m e ( s e c ) T i m e ( s e c ) (b) Cubic decay prior distribution over the combinations.Figure 13: The Safe domain, Probabilistic-FF (left) vs. POND (right).dominant part of the probability mass lies on the combinations at the head of the cubic distribution(the last combination has probability to be the right combination, and thus it needs not be triedeven when θ = 1 . ). The question is now whether the heuristic functions of Probabilistic-FF andPOND exploit this difference between Safe-uni and Safe-cub. Table 5 and Figure 13 provide anafﬁrmative answer for this question for the heuristic function of Probabilistic-FF. The picture withPOND was less clear as the times spent by POND on (otherwise identical) instances of Safe-uni andSafe-cub were roughly the same. Another interesting observation is that, for both Probabilistic-FF and POND, moving from θ =1 . to θ < . , that is, from planning with qualitative uncertainty to truly probabilistic planning,

17. On Safe-cub with n = 70 and θ ∈ { . , . } , POND undergoes an exponential blow-up that is not shown in thegraphs since these data points would obscure the other data points; anyway, we believe that this blow-up is due onlyto some unfortunate troubles with numerics. ROBABILISTIC -FF T i m e ( s e c ) N for Grid NxNxNPFFp=0.25p=0.50p=0.75p=1.00 0 200 400 600 800 1000 1200 1400 1600 1800 15 13 11 9 7 5 T i m e ( s e c ) N for Grid NxNxNPOND2.1p=0.25p=0.50p=0.75p=1.00 (a) Uniform prior distribution over the initial position. T i m e ( s e c ) N for Grid NxNxNPFFp=0.25p=0.50p=0.75p=1.00 0 200 400 600 800 1000 1200 1400 1600 1800 15 13 11 9 7 5 T i m e ( s e c ) N for Grid NxNxNPOND2.1p=0.25p=0.50p=0.75p=1.00 (b) Cubic decay prior distribution over the initial position.Figure 14: The Cube domain, Probabilistic-FF (left) vs. POND (right).typically did not result in a performance decline. We even get improved performance (except for θ = 0 . in Safe-uni). The reason seems to be that the plans become shorter. This trend can beobserved also in most other domains. The trend is particularly remarkable for Probabilistic-FF, sincemoving from θ = 1 . to θ < . means to move from a case where no model counting is neededto a case where it is needed. (In other words, Probabilistic-FF automatically “specializes” itself forthe qualitative uncertainty, by not using the model counting. To our knowledge, the same is not trueof POND, which uses the same techniques in both cases.)In Cube, the task is to move into a corner of a -dimensional grid, and the actions correspondto moving from the current cube cell to one of the (up to 6) adjacent cube cells. Again, we createdproblem instances with uniform and cubic distributions (over the initial position in each dimension),and again, Probabilistic-FF scales well, easily solving instances on a × × cube. Withinour time limit, POND was capable of solving Cube problems with cube width ≤ . Figure 14 OMSHLAK & H

OFFMANN compares between Probabilistic-FF and POND in more detail, plotting their time performance on different linear scales (with x -axes capturing the width of the grid in each dimension), and showingat least an order of magnitude advantage for Probabilistic-FF. Note that, • Probabilistic-FF generally becomes faster with decreasing θ (with decreasing hardness ofachieving the objective), while θ does not seem to have a substantial effect on the performanceof POND, • Probabilistic-FF exploits the relative easiness of Cube-cub (e.g., see Table 5), while the timeperformance of POND on Cube-cub and Cube-uni is qualitatively identical.We also tried a version of Cube where the task is to move into the grid center . Probabilistic-FF isbad at doing so, reaching its performance limit at n = 7 . This weakness in the Cube-center domainis inherited from Conformant-FF. As detailed by Hoffmann and Brafman (2006), the reason for theweakness lies in the inaccuracy of the heuristic function in this domain. There are two sources ofthis inaccuracy. First, to solve Cube-center in reality, one must start with moving into a corner inorder to establish her position; in the relaxation, without delete lists, this is not necessary. Second,the relaxed planning graph computation over-approximates not only what can be achieved in futuresteps, but also what has already been achieved on the path to the considered belief state. For evenmoderately long paths of actions, the relaxed planning graph comes to the (wrong) conclusion thatthe goal has already been achieved, so the relaxed plan becomes empty and there is no heuristicinformation.Next we consider the famous Bomb-in-the-Toilet domain (or Bomb, for short). Our versionof Bomb contains n bombs and m toilets, where each bomb may be armed or not armed indepen-dently with probability /n , resulting in huge numbers of initially possible world states. Dunking abomb into an unclogged toilet disarms the bomb, but clogs the toilet. A toilet can be unclogged byﬂushing it. Table 5 shows that Probabilistic-FF scales nicely to n = 50 , and becomes faster as m increases. The latter is logical and desirable as having more toilets means having more “disarmingdevices”, resulting in shorter plans needed. Figures 15 and 16 compare between Probabilistic-FFand POND, plotting the time performance of Probabilistic-FF on a linear scale, and that of PONDon a logarithmic scale. The four pairs of graphs correspond to four choices of number of toilets m ∈ { , , , } . The x -axes in all these graphs correspond to the number of potentially armedbombs, where we checked problems with n ∈ { , , , } . Figure 15 shows that this timeProbabilistic-FF is at least four orders of magnitude faster than POND; At the extremes, while thehardest combination of n = 50 , m = 1 , and θ = 0 . took Probabilistic-FF less than 7 seconds,POND timed-out on most of the problem instances. In addition, • In Bomb as well, Probabilistic-FF exhibit the nice pattern of improved performance as wemove from non-probabilistic ( θ = 1 . ) to probabilistic planning (speciﬁcally, θ ≤ . ; for θ ≤ . , the initial state is good enough already). • While the performance of Probabilistic-FF improves with the number of toilets, POND seemsto exhibit the inverse dependence, that is, being more sensitive to the number of states in theproblem (see Table 5) rather to the optimal solution depth.Finally, we remark that, though length-optimality is not explicitly required in probabilistic confor-mant planning, for all of Safe, Cube, and Bomb, Probabilistic-FF’s plans are optimal (the shortestpossible).

ROBABILISTIC -FF T i m e ( s e c ) T i m e ( s e c ) (a) 50 toilets T i m e ( s e c ) T i m e ( s e c ) (b) 10 toiletsFigure 15: The Bomb domain, Probabilistic-FF (left) vs. POND (right).Our next three domains are adaptations of benchmarks from deterministic planning: “Logistics”,“Grid”, and “Rovers”. We assume that the reader is familiar with these domains. Each Logistics- x instance contains 10 cities, 10 airplanes, and 10 packages, where each city has x locations. Thepackages are with chance . at the airport of their origin city, and uniformly at any of the otherlocations in that city. The effects of all loading and unloading actions are conditional on the (right)position of the package. Note that higher values of x increase not only the space of world states, butalso the initial uncertainty. Grid is the complex grid world run in the AIPS’98 planning competi-tion (McDermott, 1998), featuring locked positions that must be opened with matching keys. EachGrid- x here is a modiﬁcation of instance nr. 2 (of 5) run at AIPS’98, with a × grid, 8 lockedpositions, and 10 keys of which 3 must be transported to a goal position. Each lock has x possi-ble, uniformly distributed shapes, and each of the 3 goal keys has x possible, uniformly distributedinitial positions. The effects of pickup-key, putdown-key, and open-lock actions are conditional. OMSHLAK & H

OFFMANN T i m e ( s e c ) T i m e ( s e c ) (c) 5 toilets T i m e ( s e c ) T i m e ( s e c ) (d) 1 toiletFigure 16: The Bomb domain, Probabilistic-FF (left) vs. POND (right).Finally, our last set of problems comes from three cascading modiﬁcations of instance nr. 7 (of20) of the Rovers domain used at the AIPS’02 planning competition. This problem instance has 6waypoints, 3 rovers, 2 objectives, and 6 rock/soil samples. From Rovers to RoversPPP we modifythe instance/domain as follows. • Rovers is the original AIPS’02 problem instance nr. 7, and we use it hear mainly for compar-ison. • In RoversP, each sample is with chance . at its original waypoint, and with chance . at each of the others two waypoints. Each objective may be visible from 3 waypoints withuniform distribution (this is a probabilistic adaptation of the domain suggested by Bryce &Kambhampati, 2004). ROBABILISTIC -FF T i m e ( s e c ) θ SandcastlePFFPOND 0.01 0.1 1 10 100 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 T i m e ( s e c ) θ SandcastlePFFPOND (min)POND (avg) (a) (b)Figure 17: Probabilistic-FF and POND on problems from (a) Sand-Castle, and (b) Slippery-Gripper. • RoversPP enhances RoversP by conditional probabilities in the initial state, stating that whetheror not an objective is visible from a waypoint depends on whether or not a rock sample (intu-ition: a large piece of rock) is located at the waypoint. The probability of visibility is muchhigher if the latter is not the case. Speciﬁcally, the visibility of each objective depends on thelocations of two rock samples, and if a rock sample is present, then the visibility probabilitydrops to . . • RoversPPP extends RoversPP by introducing the need to collect data about water existence.Each of the soil samples has a certain probability ( < ) to be “wet”. For communicatedsample data, an additional operator tests whether the sample was wet. If so, a fact “know-that-water” contained in the goal is set to true. The probability of being wet depends on thelocation of the sample.We show no runtime plots for Logistics, Grid, and Rovers, since POND runs out of either time ormemory on all considered instances of these domains. Table 5 shows that the scaling behavior ofProbabilistic-FF in these three domains is similar to that observed in the previous domains. Thegoals in the RoversPPP problem cannot be achieved with probabilities θ ∈ { . , . } . This isproved by Probabilistic-FF’s heuristic function , providing the correct answer in split seconds. Our ﬁrst two domains with probabilistic actions are the famous “Sand-Castle” (Majercik & Littman,1998) and “Slippery-Gripper” (Kushmerick et al., 1995) domains. The domains are simple, but theyposed the ﬁrst challenges for probabilistic planners; our performance in these domains serves anindicator of the progress relative to previous ideas for probabilistic planning.In Sand-Castle, the states are speciﬁed by two boolean variables moat and castle , and statetransitions are given by two actions dig - moat and erect - castle . The goal is to erect the castle. OMSHLAK & H

OFFMANN T i m e ( s e c ) Grid width1D-walkgridPFFPOND 0.01 0.1 1 10 100 1000 10 9 8 7 6 5 4 3 T i m e ( s e c ) Grid width2D-walkgrid PFFPOND (a) (b)Figure 18: Probabilistic-FF and POND on problems from (a) 1D-WalkGrid with θ = 0 . , and (b)2D-WalkGrid with θ = 0 . .Building a moat with dig - moat might fail with probability . . Erecting a castle with erect - castle succeeds with probability . if the moat has already been built, and with probability . , other-wise. If failed, erect - castle also destroys the moat with probability . . Figure 17(a) shows thatboth Probabilistic-FF and POND solve this problem in less than a second for arbitrary high valuesof θ , with the performance of both planners being almost independent of the required probability ofsuccess.Slippery-Gripper is already a bit more complicated domain. The states in Slippery-Gripperare speciﬁed by four boolean variables grip - dry , grip - dirty , block - painted , and block - held , andthere are four actions dry , clean , paint , and pickup . In the initial state, the block is neither paintednor held, the gripper is clean, and the gripper is dry with probability . . The goal is to have aclean gripper holding a painted block. Action dry dries the gripper with probability . . Action clean cleans the gripper with probability . . Action paint paints the block with probability ,but makes the gripper dirty with probability if the block was held, and with probability . if itwas not. Action pickup picks up the block with probability . if the gripper is dry, and withprobability . if the gripper is wet.Figure 17(b) depicts (on a log-scale) the relative performance of Probabilistic-FF and PONDon Slippery-Gripper as a function of growing θ . The performance of Probabilistic-FF is nicely ﬂataround . seconds. This time, the comparison with POND was somewhat problematic, because,for any ﬁxed θ , POND on Slippery-Gripper exhibited a huge variance in runtime. In Figure 17(b)we plot the best runtimes for POND, as well as its average runtimes. The best run-times for PONDfor different values of θ vary around a couple of seconds, but the average runtimes are signiﬁcantlyworse. (For some high values of θ POND timed-out on some sample runs, and thus the plot providesa lower bound on the average runtimes.)In the next two domains, “1D-WalkGrid” and “2D-WalkGrid”, the robot has to pre-plan a se-quence of conditional movements taking it from a corner of the grid to the farthest (from the initial

ROBABILISTIC -FF position) corner (Hyaﬁl & Bacchus, 2004). In 1D-WalkGrid the grid is one-dimensional, whilein 2D-WalkGrid the grid is two-dimensional. Figure 18(a) depicts (on a log-scale) a snapshot ofthe relative performance of Probabilistic-FF and POND on one-dimensional grids of width n and θ = 0 . . The robot is initially at (1 , , should get to (1 , n ) , and it can try moving in each of the twopossible directions. Each of the two movement actions moves the robot in the right direction withprobability . , and keeps it in place with probability . . It is easy to see from Figure 18(a) that thedifference between the two planners in this domain is substantial—while runtime of Probabilistic-FF grows only linearly with x , the same dependence for POND is seemingly exponential.The 2D-WalkGrid domain is already much more challenging for probabilistic planning. In all2D-WalkGrid problems with n × n grids the robot is initially at (1 , , should get to ( n, n ) , and itcan try moving in each of the four possible directions. Each of the four movement actions advancesthe robot in the right direction with probability . , in the opposite direction with probability ,and in either of the other two directions with probability . . Figure 18(a) depicts (on a log-scale)a snapshot of the relative performance of Probabilistic-FF and POND on 2D-WalkGrid with verylow required probability of success θ = 0 . , and this as a function of the grid’s width n . Theplot shows that Probabilistic-FF still scales well with increasing n (though not linearly anymore),while POND time-outs for all grid widths n > . For higher values of θ , however, Probabilistic-FFdoes reach the time-out limit on rather small grids, notably n = 6 and n = 5 for θ = 0 . and θ = 0 . , respectively. The reason for this is that Probabilistic-FF’s heuristic function is not goodenough at estimating how many times, at an early point in the plan, a probabilistic action must beapplied in order to sufﬁciently support a high goal threshold at the end of the plan. We explain thisphenomenon in more detail at the end of this section, where we ﬁnd that it also appears in a variantof the well-known Logistics domain.Our last set of problems comes from the standard Logistics domain. Each problem instance x - y - z contains x locations per city, y cities, and z packages. We will see that Probabilistic-FFscales much worse, in Logistics, in the presence of probabilistic effects than if there is “only” initialstate uncertainty (we will explain the reason for this at the end of this section). Hence we use muchsmaller instances than the ones used above in Section 5.1. Namely, to allow a direct comparison toprevious results in this domain, we closely follow the speciﬁcation of Hyaﬁl and Bacchus (2004).We use instances with conﬁgurations x - y - z = 2 - - , - - , and - - , and distinguish between twolevels of uncertainty. • L - x - y - z correspond to problems with uncertainty only in the outcome of the load and unload actions. Speciﬁcally, the probabilities of success for load are . for trucks and . forairplanes, and for unload , . and . , respectively. • LL - x - y - z extends L - x - y - z with independent uniform priors for each initial location of apackage within its start city.Figure 19 depicts (on a log scale) runtimes of Probabilistic-FF and POND on L - - - , L - - - ,and L - - - , as a function of growing θ . On these problems, both planners appear to scale well,with the runtime of Probabilistic-FF and the optimal runtime of POND being roughly the same,and the average runtime of POND somewhat degrading from - - to - - to - - . This showsthat both planners are much more efﬁcient in this domain than the previously known SAT and CSPbased techniques. However, moving to LL - x - y - z changes the picture for both planners. The resultsare as follows: OMSHLAK & H

OFFMANN T i m e ( s e c ) θ L-2-2-2PFFPOND (min)POND (avg) 0.01 0.1 1 10 100 0.95 0.75 0.5 0.25 0.01 T i m e ( s e c ) θ L-4-2-2PFFPOND (min)POND (avg) 0.01 0.1 1 10 100 0.95 0.75 0.5 0.25 0.01 T i m e ( s e c ) θ L-2-2-4PFFPOND (min)POND (avg) (a) (b) (c)Figure 19: Probabilistic-FF and POND on problems from Logistics (a) L - - - , (b) L - - - , and(c) L - - - .1. On LL - - - , the runtimes of Probabilistic-FF were identical to those on L - - - , and theoptimal runtimes of POND only slightly degraded to − seconds. However, for all examinedvalues of θ , some runs of POND resulted in timeouts.2. On LL - - - , the runtimes of Probabilistic-FF were identical to those on L - - - for θ ∈{ . , . , . , . } , yet Probabilistic-FF time-outed on θ = 0 . . The optimal runtimesof POND degraded from those for L - - - only to − seconds, and again, for all valuesof θ , some runs of POND resulted in timeouts.3. On LL - - - , Probabilistic-FF experienced hard times, ﬁnishing in . seconds for θ =0 . , and time-outing for all other examined values of θ . The optimal runtimes of PONDdegraded from those for L - - - to − seconds, and here as well, for all values of θ ,some runs of POND resulted in timeouts.We also tried a variant of LL - x - y - z with non-uniform priors over the initial locations of the pack-ages, but this resulted in a qualitatively similar picture of absolute and relative performance.The LL - x - y - z domain remains challenging, and deserves close attention in the future develop-ments for probabilistic planning. In this context, it is interesting to have a close look at what thereasons for the failure of Probabilistic-FF is. It turns out that Probabilistic-FF is not good enoughat estimating how many times, at an early point in the plan, a probabilistic action must be appliedin order to sufﬁciently support a high goal threshold at the end of the plan. To make this concrete,consider a Logistics example with uncertain effects of all load and unload actions. Consider a pack-age P that must go from a city A to a city B. Let’s say that P is initially not at A’s airport. If thegoal threshold is high, this means that, to be able to succeed, the package has to be brought to A’sairport with a high probability before loading it onto an airplane. This is exactly the point whereProbabilistic-FF’s heuristic function fails. The relaxed plan contains too few actions unloading Pat A’s airport. The effect is that the search proceeds too quickly to loading P onto a plane andbringing it to B. Once the search gets to the point where B should be unloaded to its goal loca-tion, the goal threshold cannot be achieved no matter how many times one unloads P. At this point, ROBABILISTIC -FF

Probabilistic-FF’s enforced hill-climbing enters a loop and eventually fails because the relaxed plan(which over-estimates the past achievements) becomes empty. The challenge here is to devise methods that are better at recognizing how many times P hasto be unloaded at A’s airport in order to sufﬁciently support the goal threshold. The error made byProbabilistic-FF lies in that our propagation of weights on the implication graph over-estimates thegoal probability. Note here that this is much more critical for actions that must be applied early onin the plan, than for actions that are applied later. If an action a appears early on in a plan, thenthe relaxed plan, when a is executed, will be long. Recall that the weight propagation proceedsbackwards, from the goal towards the current state. At each single backwards step, the propagationmakes an approximation that might lose precision of the results. Over several backwards steps,these imprecisions accumulate. Hence the quality of the approximation decreases quickly over thenumber of backwards steps. The longer the distance between goal and current state is, the moreinformation is lost. We have observed this phenomenon in detailed experiments with differentweight propagation schemes, that is, with different underlying assumptions. Of the propagationschemes we tried, the independence assumption, as presented in this paper, was by far the mostaccurate one. All other schemes failed to deliver good results even for much shorter distancesbetween the goal and the current state.It is interesting to consider how this issue affects POND, which uses a very different method forestimating the probability of goal achievement: instead of performing a backwards propagation andaggregation of weight values, POND sends a set of random particles through the relaxed planninggraph in a forward fashion, and stops the graph building if enough particles end up in the goal. Fromour empirical results, it seems that this method suffers from similar difﬁculties as Probabilistic-FF,but not to such a large extent. POND’s optimal runtimes for LL - x - y - z are much higher than thosefor L - x - y - z . This indicates that it is always challenging for POND to “recognize” the need forapplying some action a many times early on in the plan. More interestingly, POND never times-outin L - x - y - z , but it does often time-out in LL - x - y - z . This indicates that, to some extent, it is a matterof chance whether or not POND’s random particles recognize the need for applying an action a many times early on in the plan. An intuitive explanation is that the “good cases” are those wheresufﬁciently many of the particles failed to reach the goal due to taking the “wrong effect” of a .Based on this intuition, one would expect that it helps to increase the number of random particles inPOND’s heuristic function. We did so, running POND on LL - x - y - z with an increased number ofparticles, and instead of the default value of . To our surprise, the qualitative behavior ofPOND did not change, time-outing in a similar number of cases. It is unclear to us what the reasonfor this phenomenon is. Certainly, it can be observed that the situation encoded in LL - x - y - z is notsolved to satisfaction by either of Probabilistic-FF’s weight propagation or POND’s random particlemethods, in their current conﬁgurations.At the time of writing, it is unclear to the authors how better methods could be devised. It seemsunlikely that a weight propagation – at least one that does not resort to expensive reasoning – existswhich manages long distances better than the independence assumption. An alternative way outmight be to simply deﬁne a weaker notion of plans that allows to repeat certain kinds of actions –

18. This does not happen in the above L - - - , L - - - , and L - - - instances simply because they are too small and ahigh goal probability can be achieved without thinking too much about the above problem; if one increases the sizeof these instances, the problem appears. The problem appears earlier in the presence of initial state uncertainty – evenin small instances such as LL - - - , LL - - - , and LL - - - – because with uncertainty about the start position ofthe packages one needs to try unloading them at the start airports more often. OMSHLAK & H

OFFMANN throwing a dice or unloading a package – arbitrarily many times. However, since our assumption isthat we do not have any observability during plan execution, when executing such a plan there wouldstill arise the question how often an action should be tried. Since Logistics is a fairly well-solveddomain in simpler formalisms – by virtue of Probabilistic-FF, even in the probabilistic setting aslong as the effects are deterministic – we consider addressing this problem as a quite pressing openquestion.

6. Conclusion

We developed a probabilistic extension of Conformant-FF’s search space representation, usinga synergetic combination of Conformant-FF’s SAT-based techniques with recent techniques forweighted model counting. We further provided an extension of conformant relaxed planning withapproximate probabilistic reasoning. The resulting planner scales well on a range of benchmark do-mains. In particular it outperforms its only close relative, POND, by at least an order of magnitudein almost all of the cases we tried.While this point may be somewhat obvious, we would like to emphasize that our achievementsdo not solve the (this particular) problem once and for all. Probabilistic-FF inherits strengths and weaknesses from FF and Conformant-FF, like domains where FF’s or Conformant-FF’s heuristicfunctions yield bad estimates (e.g. the mentioned Cube-center variant). What’s more, the proba-bilistic setting introduces several new potential impediments for FF’s performance. For one thing,weighted model counting is inherently harder than SAT testing. Though this did not happen in ourset of benchmarks, there are bound to be cases where the cost for exact model counting becomesprohibitive even in small examples. A promising way to address this issue lies in recent methodsfor approximate model counting (Gomes, Sabharwal, & Selman, 2006; Gomes, Hoffmann, Sabhar-wal, & Selman, 2007). Such methods are much more efﬁcient than exact model counters. Theyprovide high-conﬁdence lower bounds on the number of models. The lower bounds can be used inProbabilistic-FF in place of the exact counts. It has been shown that good lower bounds with veryhigh conﬁdecne can be achieved quickly. The challenge here is to extend the methods – which arecurrently designed for non-weighted CNFs – to handle weighted model counting.More importantly perhaps, in the presence of probabilistic effects there is a fundamental weak-ness in Probabilistic-FF’s – and POND’s – heuristic information. This becomes a pitfall for perfor-mance even in a straightforward adaptation of the Logistics domain, which is otherwise very easyfor this kind of planners. As outlined, the key problem is that, to obtain a high enough conﬁdenceof goal achievement, one may have to apply particular actions several times early on in the plan.Neither Probabilistic-FF’s nor POND’s heuristics are good enough at identifying how many times.In our view, ﬁnding techniques that address this issue is currently the most important open topic inthis area.Apart from addressing the latter challenge, we intend to work towards applicability in real-wordsettings. Particularly, we will look at the space application settings that our Rovers domain hints at,at medication-type treatment planning domains, and at the power supply restoration domain (Bertoli,Cimatti, Slaney, & Thi´ebaux, 2002).

ROBABILISTIC -FF

Acknowledgments

The authors would like to thank Dan Bryce and Rao Kambhampati for providing a binary distri-bution of POND2.1. Carmel Domshlak was partially supported by the Israel Science Foundationsgrant 2008100, as well as by the C. Wellner Research Fund. Some major parts of this research havebeen accomplished at the time that J¨org Hoffmann was employed at the Intelligent InformationSystems Institute, Cornell University.

Appendix A. Proofs

Proposition 2

The proof is rather straightforward, and it exploits the local structure of N b a ’s CPTs. Theﬁrst nodes/CPTs layer X (0) of N b a constitutes an exact copy of N b I . Then, for each ≤ t ≤ m , the t -th layer of N b a contains k + 1 node { Y ( t ) } ∪ X ( t ) .First, let us consider the “action node” Y ( t ) . While specifying the CPT T Y ( t ) in a straightforwardmanner as if prescribed by Eq. 4 might result in an exponential blow up, the same Eq. 4 suggeststhat the original description of a t is by itself a compact speciﬁcation of T Y ( t ) . Therefore, T Y ( t ) can be described in space O ( α ) , and this description can be efﬁciently used for answering queries T Y ( t ) ( Y ( i ) = ε | π ) as in Eq. 4. Next, consider the CPT T X ( t ) of a state-variable node X ( t ) ∈ X ( t ) .This time, it is rather evident from Eq. 5 that T X ( t ) can be described in space O ( α ) so that queries T X ( t ) ( X ( t ) = x | X ( t − = x ′ ) could be efﬁciently answered. Thus, summing up for all layers ≤ t ≤ m , the description size of |N b a | = O ( |N b I | + mα ( k + 1)) Lemma 4

Given a node v ( t ′ ) ∈ Imp → p ( t ) , we have ̟ p ( t ) ( v ( t ′ )) = ̟ ( v ( t ′ )) if and only if, given v at time t ′ , the sequence of effects E ( Imp v ( t ′ ) → p ( t ) ) achieves p at t with probability . Proof:

The proof of Lemma 4 is by a backward induction on the time layers of Imp v ( t ′ ) → p ( t ) . Fortime t , the only node of Imp → p ( t ) time-stamped with t is p ( t ) itself. For this node we do have ̟ p ( t ) ( p ( t )) = ̟ ( p ( t )) = 1 , but, given p at time t , an empty plan corresponding to (empty) E ( Imp p ( t ) → p ( t ) ) trivially “re-establishes” p at t with certainty. Assuming now that the claim holdsfor all nodes of Imp → p ( t ) time stamped with t ′ + 1 , . . . , t , we now show that it holds for the nodestime stamped with t ′ .It is easy to see that, for any node v ( t ′ ) ∈ Imp → p ( t ) , we get ̟ p ( t ) ( v ( t ′ )) = ̟ ( v ( t ′ )) only if α goes down to zero. First, consider the chance nodes ε ( t ′ ) ∈ Imp v → p ( t ) . For such a node, lb isset to zero if and only if we have ̟ p ( t ) ( r ( t ′ + 1)) = 1 for some r ∈ add ( ε ) . However, by ourinductive assumption, in this and only in this case the effects E ( Imp ε ( t ′ ) → p ( t +1) ) achieve p at t withprobability , given the occurrence of ε at time t ′ .Now, consider the fact nodes q ( t ′ ) ∈ Imp v → p ( t ) . For such a node, α can get nulliﬁed only bysome effect e ∈ E ( a ) , a ∈ A ( t ′ ) , con ( e ) = q . The latter happens if only if, for all possible out-comes of e , (i) the node ε ( t ′ ) belongs to Imp → p ( t ) , and (ii) and the estimate ̟ p ( t ) ( ε ( t ′ )) = ̟ ( ε ( t ′ )) .In other words, by our inductive assumption, given any outcome ε ∈ Λ( e ) at time t ′ , the ef-fects E ( Imp ε ( t ′ ) → p ( t ) ) achieve p at t with probability . Thus, given q at time t ′ , the effects E ( Imp q ( t ′ ) → p ( t ) ) achieve p at t with probability independently of the actual outcome of e . Al-ternatively, if for q ( t ′ ) we have lb > , then for each effect e conditioned on q ( t ) , there exists an OMSHLAK & H

OFFMANN outcome ε of e such that, according to what we just proved for the chance nodes time-stamped with t ′ , the effects E ( Imp ε ( t ′ ) → p ( t +1) ) do not achieve p at t with probability . Hence, the whole set ofeffects E ( Imp q ( t ′ ) → p ( t +1) ) does not achieve p at t with probability . Lemma 5

The proof of the “if” direction is by a straightforward induction on t . For t = − m the claimis immediate by the direct initialization of uP ( − m ) and P ( − m ) . Assume that, for − m ≤ t ′ < t ,if p ∈ uP ( t ′ ) ∪ P ( t ′ ) , then p is not negatively known at time t ′ , and if p ∈ P ( t ′ ) , then p is knownat time t ′ .First, consider some p ( t ) ∈ uP ( t ) ∪ P ( t ) , and suppose that p is egatively know at time t . Bythe inductive assumption, and the property of the PRPG construction that uP ( t − ∪ P ( t − ⊆ uP ( t ) ∪ P ( t ) , we have p uP ( t − ∪ P ( t − . Therefore, p has to be added into uP ( t ) (andthen, possibly, moved from there to P ( t ) ) in the ﬁrst for loop of the build - timestep procedure.However, if so, then there exists an action a ∈ A ( t − , e ∈ E ( a ) , and ε ∈ Λ( e ) such that (i) con ( e ) ∈ uP ( t − ∪ P ( t − , and (ii) p ∈ add ( ε ) . Again, by the assumption of the induction wehave that pre ( a ) is known at time t − , and con ( e ) is not negatively known at time t − . Hence, thenon-zero probability of ε occurring at time t implies that p can be achieved at time t with probabilitygreater than , contradicting that p is negatively know at time t .Now, let us consider some p ( t ) ∈ P ( t ) . Notice that, for t > − m , we have p ( t ) ∈ P ( t ) if andonly if Φ −→ _ l ∈ support ( p ( t )) l . (28)Thus, for each world state w consistent with b I , we have either q ∈ w for some fact proposition q ( − m ) ∈ support ( p ( t )) , or, for some effect e of an action a ( t ′ ) ∈ A ( t ′ ) , t ′ < t , we have con ( e ) ∈ P ( t ′ ) and { ε ( t ′ ) | ε ∈ Λ( e ) } ⊆ support ( p ( t )) . In this ﬁrst case, Lemma 4 immediately implies thatthe concatenation of a | +1 with an arbitrary linearization of the (relaxed) actions A (0) , . . . , A ( t − achieves p at t with probability , and thus p is known at time t . In the second case, our inductiveassumption implies that con ( e ) is known at time t , and together with Lemma 4 this again implies thatthe concatenation of a | +1 with an arbitrary linearization of the (relaxed) actions A (0) , . . . , A ( t − achieves p at t with probability .The proof of the “only if” direction is by induction on t as well. For t = − m this claim isagain immediate by the direct initialization of P ( − m ) . Assume that, for − m ≤ t ′ < t , if p is notnegatively known at time t ′ , then p ∈ uP ( t ′ ) ∪ P ( t ′ ) , and if p is known at time t ′ , then p ∈ P ( t ′ ) .First, suppose that p is not negatively known at time t , and yet we have p uP ( t ) ∪ P ( t ) . Fromour inductive assumption plus that A ( t − containing all the NOOP actions for propositions in uP ( t − ∪ P ( t − , we know that p is negatively known at time t − . If so, then p can becomenot negatively known at time t only due to some ε ∈ Λ( e ) , e ∈ E ( a ) , such that pre ( a ) is known ROBABILISTIC -FF at time t − , and con ( e ) is not negatively known at time t − . By our inductive assumption, thelatter conditions imply con ( e ) ∈ uP ( t − ∪ P ( t − , and pre ( a ) ∈ P ( t − . But if so, then p has to be added to uP ( t ) ∪ P ( t ) by the ﬁrst for loop of the build - timestep procedure, contradictingour assumption that p uP ( t ) ∪ P ( t ) .Now, let us consider some p known at time t . By our inductive assumption, P ( t − containsall the facts known at time t − , and thus A ( t − is the maximal subset of actions A | +1 applicableat time t − . Let us begin with an exhaustive classiﬁcation of the effects e of the actions A ( t − with respect to our p at time t .(I) ∀ ε ∈ Λ( e ) : p ∈ add ( ε ) , and con ( e ) ∈ P ( t − (II) ∀ ε ∈ Λ( e ) : p ∈ add ( ε ) , and con ( e ) ∈ uP ( t − (III) ∃ ε ∈ Λ( e ) : p add ( ε ) or con ( e ) P ( t − ∪ uP ( t − If the set (I) is not empty, then, by the construction of build - w - impleafs ( p ( t ) , Imp ) , we have { ε ( t − | ε ∈ Λ( e ) } ⊆ support ( p ( t )) , for each e ∈ (I). Likewise, by the construction of build - timestep (notably, by the update of Φ ), foreach e ∈ (I), we have Φ −→ _ { ε ( t − | ε ∈ Λ( e ) } ε ( t − . Putting these two facts together, we have that Eq. 28 holds for p at time t , and thus we have p ∈ P ( t ) .Now, suppose that the set (I) is empty. It is not hard to verify that no subset of only effects (III)makes p known at time t . Thus, the event “at least one of the effects (II) occurs” must hold withprobability . First, by the construction of build - w - impleafs ( p ( t ) , Imp ) , we havesupport ( p ( t )) ⊇ [ e ∈ (II) support ( con ( e )( t − Then, and 4 from Lemma 4 we have that the event “at least one of the effects (II) occurs” holds withprobability if and only if Φ −→ _ e ∈ (II) l ∈ support ( con ( e )( t − l Putting these two facts together, we have that Eq. 28 holds for p at time t , and thus we have p ∈ P ( t ) . Theorem 6

Let t > be the last layer of the PRPG upon the termination of build - PRPG . For every − m ≤ t ′ ≤ t , by the construction of PRPG and Lemma 5, the sets P ( t ′ ) and uP ( t ′ ) contain all(and only all) propositions that are known (respectively unknown) after executing all the actions inthe action layers up to and including A ( t ′ − . OMSHLAK & H

OFFMANN

First, let us show that if build - PRPG returns FALSE, then the corresponding termination cri-terion would hold in all future iterations. If P ( t + 1) = P ( t ) , then we have A ( t + 1) = A ( t ) .Subsequently, since P ( t + 1) ∪ uP ( t + 1) = P ( t ) ∪ uP ( t ) and A ( t + 1) = A ( t ) , we have P ( t + 2) ∪ uP ( t + 2) = P ( t + 1) ∪ uP ( t + 1) . Given that, we now show that P ( t + 2) = P ( t + 1) and uP ( t + 2) = uP ( t + 1) .Assume to the contrary that there exists p ( t + 2) ∈ P ( t + 2) such that p ( t + 1) P ( t + 1) , thatis p ( t + 1) ∈ uP ( t + 1) . By the construction of the sets P ( t + 1) and P ( t + 2) in the build - timestep procedure, we have Φ −→ _ l ∈ support ( p ( t +2)) l , Φ _ l ∈ support ( p ( t +1)) l (29)Consider an exhaustive classiﬁcation of the effects e of the actions A ( t + 1) with respect to our p attime t + 2 .(I) ∀ ε ∈ Λ( e ) : p ∈ add ( ε ) , and con ( e ) ∈ P ( t + 1) (II) ∀ ε ∈ Λ( e ) : p ∈ add ( ε ) , and con ( e ) ∈ uP ( t + 1) (III) ∃ ε ∈ Λ( e ) : p add ( ε ) or con ( e ) P ( t + 1) ∪ uP ( t + 1) Suppose that the set (I) is not empty, and let e ∈ (I). From P ( t ) = P ( t + 1) we have that con ( e ) ∈ P ( t ) , and thus { ε ( t ) | ε ∈ Λ( e ) } ⊆ support ( p ( t + 1)) . By the update of Φ in build - timestep wethen have Φ −→ W { ε ( t ) | ε ∈ Λ( e ) } ε ( t ) , and thus Φ −→ W l ∈ support ( p ( t +1)) l , contradicting Eq. 29.Alternatively, assume that the set (I) is empty. Using the arguments similar to these in the proofof Lemma 5, p ( t + 2) ∈ P ( t + 2) and p ( t + 1) P ( t + 1) in this case imply that Φ −→ _ e ∈ (II) l ∈ support ( con ( e )( t +1)) l Φ _ e ∈ (II) l ∈ support ( con ( e )( t )) l (30)However, A ( t + 1) = A ( t ) , uP ( t + 1) = uP ( t ) , and P ( t + 1) = P ( t ) together imply that allthe action effects that can possibly take place at time t + 1 are also feasible to take place at time t . Therefore, since for each e ∈ (II) we have con ( e ) ∈ uP ( t + 1) by the deﬁnition of (II), Eq. 30implies that [ e ∈ (II) support ( con ( e )( t + 1)) ∩ uP ( − m ) = [ e ∈ (II) support ( con ( e )( t )) ∩ uP ( − m ) , (31)contradicting our termination condition. Hence, we arrived into contradiction with our assumptionthat p ( t + 1) P ( t + 1) .Having shown that P ( t + 2) = P ( t + 1) and uP ( t + 2) = uP ( t + 1) , we now show that thetermination criteria implies that, for each q ( t + 2) ∈ uP ( t + 2) , we have uP ( − m ) ∩ support ( p ( t + 2)) = uP ( − m ) ∩ support ( p ( t + 1)) . ROBABILISTIC -FF

Let E p ( t +2) be the set of all effects of actions A ( t + 1) such that con ( e ) ∈ uP ( t + 1) , and, for eachoutcome ε ∈ Λ( e ) , we have p ∈ add ( ε ) . Given that, we have uP ( − m ) ∩ support ( p ( t + 2)) = uP ( − m ) ∩ [ e ∈ E p ( t +2) support ( con ( e )( t + 1))= uP ( − m ) ∩ [ e ∈ E p ( t +2) support ( con ( e )( t ))= uP ( − m ) ∩ support ( p ( t + 1)) , (32)where the ﬁrst and third equalities are by the deﬁnition of support sets via Lemma 4, and the secondequation is by our termination condition.The last things that remains to be shown is that our termination criteria implies get - P ( t +2 , G ) = get - P ( t + 1 , G ) . Considering the simple cases ﬁrst, if G P ( t + 1) ∪ uP ( t + 1) , from P ( t + 2) ∪ uP ( t + 2) = P ( t + 1) ∪ uP ( t + 1) we have get - P ( t + 2 , G ) = get - P ( t + 1 , G ) = 0 . Oth-erwise, if G ⊆ P ( t + 1) , from P ( t + 2) = P ( t + 1) we have get - P ( t + 2 , G ) = get - P ( t + 1 , G ) = 1 .This leaves us with the case of G ⊆ P ( t + 1) ∪ uP ( t + 1) and G ∩ uP ( t + 1) = ∅ . From P ( t + 2) = P ( t + 1) , uP ( t + 2) = uP ( t + 1) , and the termination condition, we have G ∩ uP ( t ) = G ∩ uP ( t + 1) = G ∩ uP ( t + 2) . From get - P ( t + 1 , G ) = get - P ( t, G ) we know that action effects that become feasible only in A ( t ) do not increase our estimate of probability of achieving any g ∈ G ∩ uP ( t + 1) from time t to time t + 1 . However, from P ( t + 1) = P ( t ) , uP ( t + 1) = uP ( t ) , and A ( t + 1) = A ( t ) , we have thatno action effect will become feasible at time t + 1 if it is not already feasible at time t , and thus get - P ( t + 1 , G ) = get - P ( t, G ) will imply get - P ( t + 2 , G ) = get - P ( t + 1 , G ) .To this point we have shown that if build - PRPG returns FALSE, then the corresponding ter-mination criterion would hold in all future iterations. Now, assume to the contrary to the claim ofthe theorem that build - PRPG returns FALSE at some iteration t , yet there exists a relaxed plan for ( A, b I , G, θ ) that starts with a | +1 . First, if θ = 1 , then Lemma 5 implies that there exists time T such that G ⊆ P ( T ) . If so, then the persistence of our “negative” termination condition implies G ⊆ P ( t ) . However, in this case we would have get - P ( t, G ) = 1 (see the second if of the get - P procedure), and thus build - PRPG would return TRUE before ever getting to check the “negative”termination condition in iteration t . Alternatively, if θ = 0 , then build - PRPG would have terminatedwith returning TRUE before the “negative” termination condition is checked even once.This leaves us with the case of < θ < and get - P ( t, G ) < θ . ( get - P ( t, G ) ≥ θ willagain contradict reaching the negative termination condition at iteration t .) We can also assume that G ⊆ P ( t ) ∪ uP ( t ) because P ( t ) ∪ uP ( t ) contains all the facts that are not negatively known at time t , and thus persistence of the negative termination condition together with G P ( t ) ∪ uP ( t ) wouldimply that there is no relaxed plan for any θ > . Let us consider the sub-goals G ∩ uP ( t ) = ∅ .(1) If for all subgoals g ∈ G ∩ uP ( t ) , the implications in Imp → g ( t ) are only due to deterministicoutcomes of the effects E ( Imp → g ( t ) ) , then the uncertainty about achieving G ∩ uP ( t ) at time t is only due to the uncertainty about the initial state. Since the initial belief state is reasonedabout with no relaxation, in this case get - P ( t, G ) = WMC (Φ ∧ V g ∈ G \ P ( t ) ϕ g ) provides uswith an upper bound on the probability of achieving our goal G by a | +1 concatenated with OMSHLAK & H

OFFMANN an arbitrary linearization of an arbitrary subset of A (0) , . . . , A ( t − . The termination sub-condition get - P ( t + 1 , G ) = get - P ( t, G ) and the persistence of the action sets A ( T ) , T ≥ t ,imply then that get - P ( t, G ) provides us with an upper bound on the probability of achieving G by a | +1 concatenated with an arbitrary linearization of an arbitrary subset of A (0) , . . . , A ( T ) ,for all T ≥ t . Together with get - P ( t, G ) < θ , the latter conclusion contradicts our assumptionthat a desired relaxed plan exists.(2) If there exists a subgoal g ∈ G ∩ uP ( t ) such that some implications in Imp → g ( t ) are due to trulyprobabilistic outcomes of the effects E ( Imp → g ( t ) ) , then repeating the (relaxed) actions A ( t ) in A ( t + 1) will necessarily result in WMC (Φ ∧ V g ∈ G \ P ( t +1) ϕ g ) > WMC (Φ ∧ V g ∈ G \ P ( t ) ϕ g ) ,contradicting our termination sub-condition condition get - P ( t + 1 , G ) = get - P ( t, G ) .Hence, we arrived into contradiction that our assumption that build - PRPG returns FALSE at time t ,yet there exists a relaxed plan for ( A, b I , G, θ ) that starts with a | +1 . References

Bertoli, P., Cimatti, A., Pistore, M., Roveri, M., & Traverso, P. (2001). MBP: a model based planner.In

Proc. IJCAI’01 Workshop on Planning under Uncertainty and Incomplete Information ,Seattle, WA.Bertoli, P., Cimatti, A., Slaney, J., & Thi´ebaux, S. (2002). Solving power supply restoration prob-lems with planning via symbolic model-checking. In

Proceedings of the 15th European Con-ference on Artiﬁcial Intelligence (ECAI) , pp. 576–580, Lion, France.Blum, A. L., & Furst, M. L. (1997). Fast planning through planning graph analysis.

ArtiﬁcialIntelligence , (1-2), 279–298.Bonet, B., & Geffner, H. (2001). Planning as heuristic search. Artiﬁcial Intelligence , (1–2),5–33.Bonet, B., & Geffner, H. (2000). Planning with incomplete information as heuristic search in beliefspace. In Proceedings of the 5th International Conference on Artiﬁcial Intelligence Planningand Scheduling Systems (AIPS) , pp. 52–61, Breckenridge, CO.Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D. (1996). Context-speciﬁc independencein Bayesian networks. In

Proceedings of the Twelfth Conference on Uncertainty in ArtiﬁcialIntelligence (UAI) , pp. 115–123, Portland, OR.Brafman, R. I., & Domshlak, C. (2006). Factored planning: How, when, and when not. In

Proceed-ings of the 18th National Conference on Artiﬁcial Intelligence (AAAI) , pp. 809–814, Boston,MA.Bryce, D., & Kambhampati, S. (2004). Heuristic guidance measures for conformant planning. In

Proceedings of the 14th International Conference on Automated Planning and Scheduling(ICAPS) , pp. 365–374, Whistler, BC, Canada.Bryce, D., Kambhampati, S., & Smith, D. (2006). Sequential Monte Carlo in probabilistic planningreachability heuristics. In

Proceedings of the 16th International Conference on AutomatedPlanning and Scheduling (ICAPS) , pp. 233–242, Cumbria, UK.

ROBABILISTIC -FF

Chavira, M., & Darwiche, A. (2005). Compiling Bayesian networks with local structure. In

Pro-ceedings of the 19th International Joint Conference on Artiﬁcial Intelligence (IJCAI) , pp.1306–1312, Edinburgh, Scotland.Darwiche, A. (2000). Recursive conditioning.

Artiﬁcial Intelligence , (1-2), 5–41.Darwiche, A. (2001). Constant-space reasoning in dynamic Bayesian networks. International Jour-nal of Approximate Reasoning , (3), 161–178.Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation. Compu-tational Intelligence , , 142–150.Dechter, R. (1999). Bucket elimination: A uniﬁed framework for reasoning. Artiﬁcial Intelligence , , 41–85.Domshlak, C., & Hoffmann, J. (2006). Fast probabilistic planning through weighted model count-ing. In Proceedings of the 16th International Conference on Automated Planning andScheduling (ICAPS) , pp. 243–252, Cumbria, UK.Gomes, C. P., Hoffmann, J., Sabharwal, A., & Selman, B. (2007). From sampling to model counting.In

Proceedings of the 20th International Joint Conference on Artiﬁcial Intelligence (IJCAI-07) , Hyderabad, India.Gomes, C. P., Sabharwal, A., & Selman, B. (2006). Model counting: A new strategy for obtain-ing good bounds. In

Proceedings of the 21th National Conference on Artiﬁcial Intelligence(AAAI-06) , pp. 54–61, Boston, MA.Hanks, S., & McDermott, D. (1994). Modeling a dynamic and uncertain world I: Symbolic andprobabilistic reasoning about change.

Artiﬁcial Intelligence , (1), 1–55.Hoffmann, J., & Nebel, B. (2001). The FF planning system: Fast plan generation through heuristicsearch. Journal of Artiﬁcial Intelligence Research , , 253–302.Hoffmann, J., & Brafman, R. (2006). Conformant planning via heuristic forward search: A newapproach. Artiﬁcial Intelligence , (6–7), 507–541.Huang, J. (2006). Combining knowledge compilation and search for efﬁcient conformant proba-bilistic planning. In Proceedings of the 16th International Conference on Automated Planningand Scheduling (ICAPS) , pp. 253–262, Cumbria, UK.Hyaﬁl, N., & Bacchus, F. (2004). Utilizing structured representations and CSPs in conformantprobabilistic planning. In

Proceedings of the European Conference on Artiﬁcial Intelligence(ECAI) , pp. 1033–1034, Valencia, Spain.Jensen, F. (1996).

An Introduction to Bayesian Networks . Springer Verlag, New York.Kushmerick, N., Hanks, S., & Weld, D. (1995). An algorithm for probabilistic planning.

ArtiﬁcialIntelligence , (1-2), 239–286.Little, I., Aberdeen, D., & Thi´ebaux, S. (2005). Prottle: A probabilistic temporal planner. In Pro-ceedings of the 20th National Conference on Artiﬁcial Intelligence (AAAI-05) , pp. 1181–1186, Pittsburgh, PA.Littman, M. L., Goldsmith, J., & Mundhenk, M. (1998). The computational complexity of proba-bilistic planning.

Journal of Artiﬁcial Intelligence Research , , 1–36. OMSHLAK & H

OFFMANN

Majercik, S. M., & Littman, M. L. (1998). MAXPLAN: A new approach to probabilistic plan-ning. In

Proceedings of the 4th International Conference on Artiﬁcial Intelligence PlanningSystems (AIPS) , pp. 86–93, Pittsburgh, PA.Majercik, S. M., & Littman, M. L. (2003). Contingent planning under uncertainty via stochasticsatisﬁability.

Artiﬁcial Intelligence , (1-2), 119–162.McDermott, D. (1998). The 1998 AI Planning Systems Competition. AI Magazine , (2), 35–55.McDermott, D. V. (1999). Using regression-match graphs to control search in planning. ArtiﬁcialIntelligence , (1-2), 111–159.Onder, N., Whelan, G. C., & Li, L. (2006). Engineering a conformant probabilistic planner. Journalof Artiﬁcial Intelligence Research , , 1–15.Pearl, J. (1984). Heuristics - Intelligent Search Strategies for Computer Problem Solving . Addison-Wesley.Pearl, J. (1988).

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference .Morgan Kaufmann, San Mateo, CA.Rintanen, J. (2003). Expressive equivalence of formalisms for planning with sensing. In

Proceed-ings of the 13th International Conference on Automated Planning and Scheduling (ICAPS) ,pp. 185–194, Trento, Italy.Roth, D. (1996). On the hardness of approximate reasoning.

Artiﬁcial Intelligence , (1-2), 273–302.Russell, S., & Norvig, P. (2004). Artiﬁcial Intelligence: A Modern Approach (2 edition). Pearson.Sang, T., Bacchus, F., Beame, P., Kautz, H., & Pitassi, T. (2004). Combining component cachingand clause learning for effective model counting. In (Online) Proceedings of the 7th Interna-tional Conference on Theory and Applications of Satisﬁability Testing (SAT) , Vancouver, BC,Canada.Sang, T., Beame, P., & Kautz, H. (2005). Solving Bayes networks by weighted model counting. In

Proceedings of the 20th National Conference on Artiﬁcial Intelligence (AAAI) , pp. 475–482,Pittsburgh, PA.Shimony, S. E. (1993). The role of relevance in explanation I: Irrelevance as statistical indepen-dence.

International Journal of Approximate Reasoning , (4), 281–324.Shimony, S. E. (1995). The role of relevance in explanation II: Disjunctive assignments and approx-imate independence. International Journal of Approximate Reasoning , (1), 27–60.Zhang, N. L., & Poole, D. (1994). A simple approach to Bayesian network computations. In Proceedings of the 10th Canadian Conference on Artiﬁcial Intelligence , pp. 171–178, Banff,Alberta, Canada., pp. 171–178, Banff,Alberta, Canada.