Stochastic Planning and Lifted Inference
SStochastic Planning and Lifted Inference
Roni KhardonDepartment of Computer ScienceTufts University [email protected]
Scott SannerDepartment of Industrial EngineeringUniversity of Toronto [email protected]
January 5, 2017
Abstract
Lifted probabilistic inference (Poole, 2003) and symbolic dynamic programming for liftedstochastic planning (Boutilier et al, 2001) were introduced around the same time as algorithmicefforts to use abstraction in stochastic systems. Over the years, these ideas evolved into two dis-tinct lines of research, each supported by a rich literature. Lifted probabilistic inference focusedon efficient arithmetic operations on template-based graphical models under a finite domainassumption while symbolic dynamic programming focused on supporting sequential decision-making in rich quantified logical action models and on open domain reasoning. Given theircommon motivation but different focal points, both lines of research have yielded highly com-plementary innovations. In this chapter, we aim to help close the gap between these two researchareas by providing an overview of lifted stochastic planning from the perspective of probabilisticinference, showing strong connections to other chapters in this book. This also allows us todefine generalized lifted inference as a paradigm that unifies these areas and elucidates openproblems for future research that can benefit both lifted inference and stochastic planning.
In this chapter we illustrate that stochastic planning can be viewed as a specific form of probabilisticinference and show that recent symbolic dynamic programming (SDP) algorithms for the planningproblem can be seen to perform “generalized lifted inference”, thus making a strong connection toother chapters in this book. As we discuss below, although the SDP formulation is more expressivein principle, work on SDP to date has largely focused on algorithmic aspects of reasoning in opendomain models with rich quantified logical structure whereas lifted inference has largely focusedon aspects of efficient arithmetic computations over finite domain (quantifier free) template-basedmodels. The contributions in these areas are therefore largely along different dimensions. However,the intrinsic relationships between these problems suggest a strong opportunity for cross-fertilizationwhere the true scope of generalized lifted inference can be achieved. This chapter intends tohighlight these relationships and lay out a paradigm for generalized lifted inference that subsumesboth fields and offers interesting opportunities for future research.To make the discussion concrete, let us introduce a running example for stochastic planningand the kind of generalized solutions that can be achieved. For illustrative purposes, we borrow aplanning domain from Boutilier et. al. [1] that we refer to as
BoxWorld . In this domain, outlinedin Figure 1, there are several cities such as london , paris etc., trucks truck , truck etc., and boxes box , box etc. The agent can load a box onto a truck or unload it and can drive a truck fromone city to another. When any box has been delivered to a specific city, paris , the agent receives a1 a r X i v : . [ c s . A I] J a n igure 1: A formal desciption of the BoxWorld adapted from [1]. We use a simple STRIPS-like [2] add and delete list representation of actions and, as a simple probabilistic extension in thespirit of PSTRIPS [3], we assign probabilities that an action successfully executes conditioned onvarious state properties. • Domain Object Types (i.e., sorts) : Box , Truck , City = { paris , . . . }• Relations (with parameter sorts) :BoxIn:
BIn ( Box , City ), TruckIn:
TIn ( Truck , City ), BoxOn: On ( Box , Truck ) • Reward : if ∃ B, BIn ( B, paris ) then 10 else 0 • Actions (with parameter sorts) : – load ( Box : B, Truck : T, City : C ): ∗ Success Probability: if (
BIn ( B, C ) ∧ TIn ( T, C )) then .9 else 0 ∗ Add Effects on Success: { On ( B, T ) }∗ Delete Effects on Success: { BIn ( B, C ) } – unload ( Box : B, Truck : T, City : C ): ∗ Success Probability: if ( On ( B, T ) ∧ TIn ( T, C )) then .9 else 0 ∗ Add Effects on Success: { BIn ( B, C ) }∗ Delete Effects on Success: { On ( B, T ) } – drive ( Truck : T, City : C , City : C ): ∗ Success Probability: if (
TIn ( T, C )) then 1 else 0 ∗ Add Effects on Success: { TIn ( T, C ) }∗ Delete Effects on Success: { TIn ( T, C ) } – noop ∗ Success Probability: 1 ∗ Add Effects on Success: ∅∗ Delete Effects on Success: ∅ positive reward. The agent’s planning task is to find a policy for action selection that maximizesthis reward over some planning horizon.Our objective in lifted stochastic planning is to obtain an abstract policy, for example, like theone shown in Figure 2. In order to get some box to paris , the agent should drive a truck to the citywhere the box is located, load the box on the truck, drive the truck to paris , and finally unloadthe box in paris . This is essentially encoded in the symbolic value function shown in Fig. 2, whichwas computed by discounting rewards t time steps into the future by 0 . t .Similar to this example, for some problems we can obtain a solution which is described abstractlyand is independent of the specific problem instance or even its size — for our example problem thedescription of the solution does not depend on the number of cities, trucks or boxes, or on knowledgeof the particular location of any specific truck. Accordingly, one might hope that computing sucha solution can be done without knowledge of these quantities and in time complexity independentof them. This is the computational advantage of symbolic stochastic planning which we associatewith lifted inference in this chapter.The next two subsections expand on the connection between planning and inference, identify2igure 2: A decision-list representation of the optimal policy and expected discounted reward forthe BoxWorld problem. The optimal action parameters in the then conditions correspond to theexistential bindings that made the if conditions true. if ( ∃ B, BIn ( B, paris )) then do noop (value = 100.00)else if ( ∃ B, T,
TIn ( T, paris ) ∧ On ( B, T )) then do unload ( B, T, paris ) (value = 89.0)else if ( ∃ B, C, T, On ( B, T ) ∧ TIn ( T, C )) then do drive ( T, C, paris ) (value = 80.0)else if ( ∃ B, C, T,
BIn ( B, C ) ∧ TIn ( T, C )) then do load ( B, T, C ) (value = 72.0)else if ( ∃ B, C , T, C , BIn ( B, C ) ∧ TIn ( T, C )) then do drive ( T, C , C ) (value = 64.7)else do noop (value = 0.0) opportunities for lifted inference, and use these observations to define a new setup which we call gen-eralized lifted inference which abstracts some of the work in both areas and provides new challengesfor future work. Planning is the task of choosing what actions to take to achieve some goals or maximize long-term reward. When the dynamics of the world are deterministic, that is, each action has exactlyone known outcome, then the problem can be solved through logical inference. That is, inferencerules can be used to deduce the outcome of individual actions given the current state, and bycombining inference steps one can prove that the goal is achieved. In this manner a proof of goalachievement embeds a plan. This correspondence was at the heart of McCarthy’s seminal paper [4]that introduced the topic of AI and viewed planning as symbolic logical inference. Since thisformulation uses first-order logic, or the closely related situation calculus, lifted logical inferencecan be used to solve deterministic planning problems.When the dynamics of the world are non-deterministic, this relationship is more complex. Inparticular, in this chapter we focus on the stochastic planning problem where an action can havemultiple possible known outcomes that occur with known state-dependent probabilities. Inferencein this case must reason about probabilities over an exponential number of state trajectories forsome planning horizon. While lifted inference and planning may seem to be entirely differentproblems, analogies have been made between the two fields in several forms [5–15]. To make theconnections concrete, consider a finite domain and the finite horizon goal-oriented version of the
BoxWorld planning problem of Figure 1, e.g., two boxes, three trucks, and four cities and aplanning horizon of 10 steps where the goal is to get some box in paris . In this case, the value ofa state, V ( S ), corresponds to the probability of achieving the goal, and goal achievement can bemodeled as a specific form of inference in a Bayesian network or influence diagram.We start by considering the conformant planning problem where the intended solution is anexplicit sequence of actions. In this case, the sequence of actions is determined in advance and actionchoice at the i th step does not depend on the actual state at the i th step. For this formulation, onecan build a Dynamic Bayesian Network (DBN) model where each time slice represents the stateat that time and action nodes affect the state at the next time step, as in Figure 3(a). The edgesin this diagram capture p ( S (cid:48) | S, A ), where S is the current state, A is the current action and S (cid:48) isthe next state, and each of S, S (cid:48) , A is represented by multiple nodes to show that they are given3 h,1% S h,2% S h,n% S S S S S S n% A%A %A % S S S n% A%A %A %A%A %A % (a) S h,1% S h,2% S h,n% S S S S S S n% S S S n% A%A %A % A%A %A % A%A %A % A%A % A % A%A % A % A%A %A % A%A %A % A%A %A % A%A % A % A%A % A % A% A % A % A% A % A % A% A % A % A% A % A % A%A %A % (b)Figure 3: Planning as inference: conditioning on start and goal state. (a) Conformant planning– actions selected per time step without knowledge of the state. (b) An exponential size policyat each time step determines action selection. The transition depends on the current state andpolicy’s actions for that state.by a collection of predicates and their values. Note that, since the world dynamics are known, theconditional probabilities for all nodes in the graph are known. As a result, the goal-based planningproblem where a goal G must hold at the last step, can be modeled using standard inference. Thevalue of conformant planning is given by marginal MAP (where we seek a MAP value for somevariables but take expectation over the remaining variables) [7, 12, 13]: V conformant ( S ) = max A , . . . , max A N − P r ( G | S , A , . . . , A N − )= max A , . . . , max A N − (cid:88) S ,S ,...,S N P r ( G, S , . . . , S N | S , A , . . . , A N − ) . The optimal conformant plan is extracted using argmax instead of max in the equation.The standard MDP formulation with a reward per time time step which is accumulated canbe handled similarly, by normalizing the cumulative reward and adding a binary node G whoseprobability of being true is a function of the normalized cumulative reward. Several alternativeformulations of planning as inference have been proposed by defining an auxiliary distributionover finite trajectories which captures utility weighted probability distribution over the trajectories[6,9–11,15]. While the details vary, the common theme among these approaches is that the planningobjective is equivalent to calculating the partition function (or “probability of evidence”) in theresulting distribution. This achieves the same effect as adding a node G that depends on thecumulative reward. To simplify the discussion, we continue the presentation with the simple goalbased formulation.The same problem can be viewed from a Bayesian perspective, treating actions as randomvariables with an uninformative prior. In this case we can use P r ( G | S , A , . . . , A N − ) = P r ( A , . . . , A N − | G, S ) P r ( G | S ) P r ( A , . . . , A N − | S ) . to observe that [5, 6, 8]argmax A , . . . , max A N − P r ( G | S , A , . . . , A N − ) = argmax A , . . . , max A N − P r ( A , . . . , A N − | G, S )and therefore one can alternatively maximize the probability conditioned on G .However, linear plans, as the ones produced by the conformant setting, are not optimal forprobabilistic planning. In particular, if we are to optimize goal achievement then we must allow4he actions to depend on the state they are taken in. That is, the action in the second stepis taken with knowledge of the probabilistic outcome of the first action, which is not known inadvance. We can achieve this by duplicating action nodes, with a copy for each possible value ofthe state variables, as illustrated in Figure 3(b). This represents a separate policy associated witheach horizon depth which is required because finite horizon problems have non-stationary optimalpolicies. In this case, state transitions depend on the identity of the current state and the actionvariables associated with that state. The corresponding inference problem can be written as follows: V ( S ) = max A ( S ) , . . . , max A N − ( S N − P r ( G | S , A ( S ) , . . . , A N − ( S N − . (1)However, the number of random variables in this formulation is prohibitively large since we needthe number of original action variables to be multiplied by the size of the state space.Alternatively, the same desideratum, optimizing actions with knowledge of the previous state,can be achieved without duplicating variables in the equivalent formulation V ( S ) = max A (cid:88) S P r ( S | S , A ) max A (cid:88) S P r ( S | S , A ) . . . max A N − (cid:88) S N − P r ( S N − | S N − , A N − ) max A N − (cid:88) S N P r ( S N | S N − , A N − ) P r ( G | S n ) . (2)In fact, this formulation is exactly the same as the finite horizon application of the value iteration(VI) algorithm for (goal-based) Markov Decision Processes (MDP) which is the standard formula-tion for sequential decision making in stochastic environments. The standard formulation abstractsthis by setting V ( S ) = P r ( G | S ) V k +1 ( S ) = max A (cid:88) S (cid:48) P r ( S (cid:48) | S, A ) V k ( S (cid:48) ) (cid:124) (cid:123)(cid:122) (cid:125) Q ( S,A ) . (3)The optimal policy (at S ) can be obtained as before by recording the argmax values. In terms ofprobabilistic inference, the problem is no longer a marginal MAP problem because summation andmaximization steps are constrained in their interleaved order. But it can be seen as a natural exten-sion of such inference questions with several alternating blocks of expectation and maximization.We are not aware of an explicit study of such problems outside the planning context. Given that planning can be seen as an inference problem, one can try to apply ideas of liftedinference to planning. Taking the motivating example from Figure 1, let us specialize the rewardto a ground atomic goal G equivalent to BIn ( b ∗ , paris ) for constants b ∗ and paris . Then we canquery max A P r ( BIn ( b ∗ , paris ) | s , A ) to compute V ( S = s ) where s is the concrete value of thecurrent state.Given that Figure 1 implies a complex relational specification of the transition probabilities,lifted inference techniques are especially well-placed to attempt to exploit the structure of thisquery to perform inference in aggregate and thus avoid redundant computations. However, weemphasize that, even if lifted inference is used, this is a standard query in the graphical model5here evidence constrains the value of some nodes, and the solution is a single number representingthe corresponding probability (together with a MAP assignment to variables).However, Eq 3 suggests an explicit additional structure for the planning problem. In particular,the intermediate expressions V k ( S ) include the values (the probability of reaching the goal in k steps) for all possible concrete values of S . Similarly, the final result V N ( S ) includes the valuesfor all possible start states. In addition, as in our running example we can consider more abstractrewards. This suggests a first generalization of the standard setup in lifted inference. Insteadof asking about a ground goal P r ( BIn ( b ∗ , paris )) and expecting a single number as a response,we can abstract the setup in two ways: first, we can ask about more general conditions suchas ∃ B, P r ( BIn ( B, paris )) and second we can expect to get a structured result that specifies thecorresponding probability for every concrete state in the world. If we had two box instances b , b and m truck instances t , . . . , t m , the answer for V ( S ), i.e., the value for the goal based formulationwith horizon one, might take the form:if ( BIn ( b , paris ) ∨ BIn ( b , paris ))then V ( S ) = 10else if (( TIn ( t , paris ) ∧ On ( b , t )) ∨ . . . ∨ ( TIn ( t m , paris ) ∧ On ( b , t m )))then V ( S ) = 9else V ( S ) = 0.The significance of this is that the question can have a more general form and that the answersolves many problems simultaneously, providing the response as a case analysis depending on someproperties of the state. We refer to this reasoning as inference with generalized queries and answers .In this context, the goal of lifted inference will be to calculate a structured form of the reply directly.A second extension arises from the setup of generalized queries. The standard form for liftedinference is to completely specify the domain in advance. This means providing the number ofobjects and their properties, and that the response to the query is calculated only for this specificdomain instantiation. However, inspecting the solution in the previous paragraph it is obvious thatwe can at least hope to do better. The same solution can be described more compactly asif ( ∃ B, BIn ( B, paris ))then V ( S ) = 10else if ( ∃ B, ∃ T, ( TIn ( T, paris ) ∧ On ( B, T ))then V ( S ) = 9else V ( S ) = 0.Arriving at such a solution requires us to allow open domain reasoning over all potential objects(rather than grounding them, which is impossible in open domains), and to extend ideas of liftedinference to exploit quantifiers and their structure. Following through with this idea, we can arriveat a domain-size independent value function and policy as the one shown in Figure 2. In thiscontext, the goal of lifted inference will be to calculate an abstracted form of the reply directly. Wecall this problem inference with generalized models . As we describe in this chapter, SDP algorithmsare able to perform this type of inference.The previous example had enough structure and a special query that allowed the solution tobe specified without any knowledge of the concrete problem instance. This property is not alwayspossible. For example, consider a setting where we get one unit of reward for every box in paris : (cid:80) B : Box [if (
BIn ( B, paris )) then 1 else 0]. In addition, consider the case where, after the agent takestheir action, any box which is not on a truck disappears with probability 0 .
2. In this case, we can6till potentially calculate an abstract solution, but it requires access to more complex properties ofthe state, and in some cases the domain size (number of objects) in the state. For our example thisgives:Let n = ( B , BIn ( B, paris ))if ( ∃ B, ∃ T, ( TIn ( T, paris ) ∧ On ( B, T ))then V ( S ) = n ∗ . V ( S ) = n ∗ B , BIn ( B, paris ))counts the number of boxes in Paris in the current state. To see this result note that any existingbox in Paris disappears 20% of the time and that a box on a truck is successfully unloaded 90% ofthe time but remains and does not disappear only in 80% of possible futures leading to the value7.2. This is reminiscent of the type of expressions that arise in existing lifted inference problemsand solutions. Typical solutions to such problems involve parameterized expressions over the do-main (e.g., counting, summation, etc.), and critically do not always require closed-domain reasoning(e.g., a priori knowledge of the number of boxes). They are therefore suitable for inference withgeneralized models. Some work on SDP has approached lifted inference for problems with thislevel of complexity, including exogenous activities (the disappearing boxes) and additive rewards.But, as we describe in more detail, the solutions for these cases are much less well understood anddeveloped.To recap, our example illustrates that stochastic planning potentially enables abstract solutionsthat might be amenable to lifted computations. SDP solutions for planning problems have focusedon the computational advantages arising from these expressive generalizations. At the same time,the focus in SDP algorithms has largely been on problems where the solution is completely inde-pendent of domain size and does not require numerical properties of the state. These algorithmshave thus skirted some of the computational issues that are typically tackled in lifted inference. Itis the combination of these aspects, as illustrated in the last example, which we call generalizedlifted inference . As the discussion suggests, generalized lifted inference is still very much an openproblem. In addition to providing a survey of existing SDP algorithms, the goal of this chapter isto highlight the opportunities and challenges in this exciting area of research. This section provides a formal description of the representation language, the relational planningproblem, and the description of the running example in this context.
The computation of SDP algorithms is facilitated by a representation that enables compact spec-ification of functions over world states. Several such representations have been devised and used.In this chapter we chose to abstract away some of those details and focus on a simple language ofrelational expressions. This is closest to the GFODD representation of [16, 17], but it resembles thecase notation of [1, 18].
Syntax . We assume familiarity with basic concepts and notation in first order logic (FOL) [19–21].Relational expressions are similar to expressions in FOL. They are defined relative to a relationalsignature, with a finite set of predicates p , p , . . . , p n each with an associated arity (number of7rguments), a countable set of variables x , x , . . . , and a set of constants c , c , . . . , c m . We donot allow function symbols other than constants (that is, functions with arity ≥ color ( X, Y ) is such a formulaand its truth value depends on the assignment of X and Y to objects in the world. To simplifythe discussion, we assume for this example that arguments are typed (or sorted) and X rangesover “objects” and Y over “colors”. We can then quantify over these variables to get a sentencewhich will be evaluated to a truth value in any concrete possible world. For example, we can write[ ∃ Y, ∀ X, color ( X, Y )] expressing the statement that there is a color associated with all objects.Generalized expressions allow for more general open formulas that evaluate to numerical values.For example, E = [if color ( X, Y ) then 1 else 0] is similar to the previous logical expression but E = [if color ( X, Y ) then 0.3 else 0.5] returns non-binary values.Quantifiers from logic are replaced with aggregation operators that combine numerical valuesand provide a generalization of the logical constructs. In particular, when the open formula isrestricted to values 0 and 1, the operators max and min simulate existential and universal quan-tification. Thus, [max Y , min X , if color ( X, Y ) then 1 else 0] is equivalent to the logical sentence[ ∃ Y, ∀ X, color ( X, Y )] given above. But we can allow for other types of aggregations. For example,[max Y , sum X , if color ( X, Y ) then 1 else 0] evaluates to the largest number of objects associatedwith one color, and the expression [sum X , min Y , if color ( X, Y ) then 0 else 1] evaluates to the num-ber of objects that have no color association. In this manner, a generalized expression representsa function from possible worlds to numerical values and, as illustrated, can capture interestingproperties of the state.Relational expressions are also related to work in statistical relational learning [22–24]. Forexample, if the open expression E given above captures probability of ground facts for the predicate color () and the ground facts are mutually independent then [product X , product Y , if color ( X, Y )then 0.3 else 0.5] captures the joint probability for all facts for color (). Of course, the open formulasin logic can include more than one atom and similarly expressions can be more involved.In the following we will drop the cumbersome if-then-else notation and instead will assumea simpler notation with a set of mutually exclusive conditions which we refer to as cases . Inparticular, an expression includes a set of mutually exclusive open formulas in FOL (withoutany quantifiers or aggregators) denoted c , . . . , c k associated with corresponding numerical val-ues v , . . . , v k . The list of cases refers to a finite set of variables X , . . . , X m . A generalizedexpression is given by a list of aggregation operators and their variables and the list of cases[ agg X , agg X , . . . , agg X m [ c : v , . . . , c k : v k ]] so that the last expression is canonically representedas [product X , product Y , [ color ( X, Y ) : 0 . ¬ color ( X, Y ) : 0 . Semantics . The semantics of expressions is defined inductively exactly as in first order logic andwe skip the formal definition. As usual, an expression is evaluated in an interpretation also knownas a possible world. In our context, an interpretation specifies (1) a finite set of n domain elementsalso known as objects, (2) a mapping of constants to domain elements, and (3) the truth valuesof all the predicates over tuples of domain elements of appropriate size to match the arity of thepredicate. Now, given an expression B = ( agg X , f ( X )), an interpretation I , and a substitution ζ of variables in X to objects in I , one can identify the case c i which is true for this substitution.8xactly one such case exists since the cases are mutually exclusive and exhaustive. Therefore, thevalue associated with ζ is v i . These values are then aggregated using the aggregation operators. Forexample, consider again the expression [product X , product Y , [ color ( X, Y ) : 0 . ¬ color ( X, Y ) : 0 . I with objects a, b and where a is associated with colors black and white and b is associated with color black. In this case we have exactly 4 substitutions evaluating to 0.3, 0.3,0.5, 0.3. Then the final value is 0 . · . Operations over expressions . Any binary operation op over real values can be generalized toopen and closed expressions in a natural way. If f and f are two closed expressions, f op f represents the function which maps each interpretation w to f ( w ) op f ( w ). This provides adefinition but not an implementation of binary operations over expressions. For implementation,the work in [16] showed that if the binary operation is safe , i.e., it distributes with respect to allaggregation operators, then there is a simple algorithm (the Apply procedure) implementing thebinary operation over expressions. For example, + is safe w.r.t. max aggregation, and it is easy tosee that (max X f ( X )) + (max X g ( X )) = max X max Y f ( X ) + g ( Y ), and the open formula portionof the result can be calculated directly from the open expressions f ( X ) and g ( Y ). Note that weneed to standardize the expressions apart, as in the renaming of g ( X ) to g ( Y ) for such operations.When f ( x ) and g ( y ) are open relational expressions the result can be computed through a crossproduct of the cases. For example,[max X , min Y [ color ( X, Y ) : 3; ¬ color ( X, Y ) : 5]] ⊕ [max X , [ box ( X ) : 1; ¬ box ( X ) : 2]]= [max Z , max X , min Y [ color ( X, Y ) ∧ box ( Z ) : 4; ¬ color ( X, Y ) ∧ box ( Z ) : 6; color ( X, Y ) ∧ ¬ box ( Z ) : 5; ¬ color ( X, Y ) ∧ ¬ box ( Z ) : 7]]When the binary operation is not safe then this procedure fails, but in some cases, operation-specificalgorithms can be used for such combinations. As will become clear later, to implement SDP we need the binary operations ⊕ , ⊗ , max andthe aggregation includes max in addition to aggregation in the reward function. Since ⊕ , ⊗ , maxare safe with respect to max , min aggregation one can provide a complete solution when the rewardis restricted to have max , min aggregation. When this is not the case, for example when using sumaggregation in the reward function, one requires a special algorithm for the combination. Furtherdetails are provided in [16, 17]. Summary . Relational expressions are closest to the GFODD representation of [16, 17]. Everycase c i in a relational expression corresponds to a path or set of paths in the GFODD, all of whichreach the same leaf in the graphical representation of the GFODD. GFODDs are potentially morecompact than relational expressions since paths share common subexpressions, which can lead toan exponential reduction in size. On the other hand, GFODDs require special algorithms for theirmanipulation. Relational expressions are also similar to the case notation of [1, 18]. However,in contrast with that representation, cases are not allowed to include any quantifiers and insteadquantifiers and general aggregators are globally applied over the cases, as in standard quantifiednormal form in logic. For example, a product of expressions that include only product aggregations, which is not safe, can be obtainedby scaling the result with a number that depends on domain size, and [ (cid:81) x (cid:81) x (cid:81) x f ( x , x , x )] ⊗ [ (cid:81) y (cid:81) y g ( y , y )]is euqal to [ (cid:81) x (cid:81) x (cid:81) x [ f ( x , x , x ) × g ( x , x ) /n ]] when the domain has n objects. .2 Relational MDPs In this section we define MDPs, starting with the basic case with enumerated state and actionspaces, and then providing the relational representation.
MDP Preliminaries . We assume familiarity with basic notions of Markov Decision Processes(MDPs) [25, 26]. Briefly, a MDP is a tuple (cid:104)
S, A, P, R, γ (cid:105) given by a set of states S , set of actions A , transition probability P r ( S (cid:48) | S, A ), immediate reward function R ( S ) and discount factor γ < π that maximizes the expected discounted total reward obtainedby following that policy starting from any state. The Value Iteration algorithm (VI) informallyintroduced in Eq 3, calculates the optimal value function by iteratively performing Bellman backups, V k +1 = T [ V k ], defined for each state s ∈ S as, V k +1 ( s ) = T [ V k ]( s ) ← max a ∈ A { R ( s ) + γ (cid:88) s (cid:48) ∈ S P r ( s (cid:48) | s, a ) V k ( s (cid:48) ) } . (4)Unlike Eq 3, which was goal-oriented and had only a single reward at the terminal horizon, here weallow the reward R(S) to accumulate at all time steps as typically allowed in MDPs. If we iteratethe update until convergence, we get the optimal infinite horizon value function typically denotedby V ∗ and optimal stationary policy π ∗ . For finite horizon problems, which is the topic of thischapter, we simply stop the iterations at a specific k . In general, the optimal policy for the finitehorizon case is not stationary, that is, we might make different choice in the same state dependingon how close we are to the horizon. Logical Notation for Relational MDPs (RMDPs) . RMDPs are simply MDPs where thestates and actions are described in a function-free first order logical language. A state correspondsto an interpretation over the corresponding logical signature, and actions are transitions betweensuch interpretations.A relational planning problem is specified by providing the logical signature, the start state,the transitions as controlled by actions, and the reward function. As mentioned above, one of theadvantages of relational SDP algorithms is that they are intended to produce an abstracted form ofthe value function and policy that does not require specifying the start state or even the number ofobjects n in the interpretation at planning time. This yields policies that generalize across domainsizes. We therefore need to explain how one can use logical notation to represent the transitionmodel and reward function in a manner that does not depend on domain size.Two types of transition models have been considered in the literature: • Endogenous Branching Transitions:
In the basic form, state transitions have limitedstochastic branching due to a finite number of action outcomes. The agent has a set ofaction types { A } each parametrized with a tuple of objects to yield an action template A ( X ) and a concrete ground action A ( x ) (e.g. template unload ( B, T ) and concrete action unload ( box23 , truck1 )). Each agent action has a finite number of action variants A j ( X ) (e.g.,action success vs. action failure), and when the user performs A ( X ) in state s one of the vari-ants is chosen randomly using the state-dependent action choice distribution P r ( A j ( X ) | A ( X )).To simplify the presentation we follow [16,27] and require that P r ( A j ( X ) | A ( X )) are given byopen expressions, i.e., they have no aggregations and cannot introduce new variables. For ex-ample, in BoxWorld , the agent action unload ( B, T, C ) has success outcome unloadS ( B, T, C )10nd failure outcome unloadF ( B, T, C ) with action outcome distribution as follows: P ( unloadS ( B, T, C ) | unload ( B, T, C )) = [( On ( B, T ) ∧ TIn ( T, C )) : . ¬ : 0] P ( unloadF ( B, T, C ) | unload ( B, T, C )) = [( On ( B, T ) ∧ TIn ( T, C )) : . ¬ : 1] (5)where, to simplify the notation, the last case is shortened as ¬ to denote that it complementsprevious cases. This provides the distribution over deterministic outcomes of actions.The deterministic action dynamics are specified by providing an open expression, capturingsuccessor state axioms [28], for each variant A j ( X ) and predicate template p (cid:48) ( Y ). Following[27] we call these expressions TVDs, standing for truth value diagrams. The correspondingTVD, T ( A j ( X ) , p (cid:48) ( Y )), is an open expression that specifies the truth value of p (cid:48) ( Y ) in thenext state (following standard practice we use prime to denote that the predicate refers tothe next state) when A j ( X ) has been executed in the current state . The arguments X and Y are intentionally different logical variables as this allows us to specify the truth value of allinstances of p (cid:48) ( Y ) simultaneously. Similar to the choice probabilities we follow [16, 27] andassume that TVDs T ( A j ( X ) , p (cid:48) ( Y )) have no aggregations and cannot introduce new variables.This implies that the regression and product terms in the SDP algorithm of the next section donot change the aggregation function, thereby enabling analysis of the algorithm. Continuingour BoxWorld example, we define the TVD for
BIn (cid:48) ( B, C ) for unloadS ( B , T , C ) and unloadF ( B , T , C ) as follows: BIn (cid:48) ( B, C ) ≡ T ( unloadS ( B , T , C ) , BIn (cid:48) ( B, C )) ≡ [( BIn ( B, C ) ∨ (( B = B ) ∧ ( C = C ) ∧ On ( B , T ) ∧ TIn ( T , C ))) : 1; ¬ : 0] BIn (cid:48) ( B, C ) ≡ T ( unloadF ( B , T , C ) , BIn (cid:48) ( B, C )) ≡ [ BIn ( B, C ) : 1; ¬ : 0] (6)Note that each TVD has exactly two cases, one leading to the outcome 1 and the other leadingto the outcome 0. Our algorithm below will use these cases individually. Here we remarkthat since the next state (primed) only depends on the previous state (unprimed), we areeffectively logically encoding the Markov assumption of MDPs. • Exogenous Branching Transitions:
The more complex form combines the endogenousmodel with an exogenous stochastic process that affects ground atoms independently. As asimple example in our
BoxWorld domain, we might imagine that with some small prob-ability, each box B in a city C ( BIn ( B, C )) may independently randomly disappear (falsify
BIn ( B, C )) owing to issues with theft or improper routing — such an outcome is independentof the agent’s own action. Another more complicated example could be an inventory controlproblem where customer arrival at shops (and corresponding consumption of goods) followsan independent stochastic model. Such exogenous transitions can be formalized in a numberof ways [17, 29, 30]; we do not aim to commit to a particular representation in this chapter,but rather to mention its possibility and the computational consequences of such generalrepresentations.Having completed our discussion of RMDP transitions, we now proceed to define the reward R ( S, A ), which can be any function of the state and action, specified by a relational expression.11ur running example with existentially quantified reward is given by[max B [ BIn ( B, paris ) : 10; ¬ BIn ( B, paris ) : 0]] (7)but we will also consider additive reward as in[ (cid:88) B [ BIn ( B, paris ) : 10; ¬ BIn ( B, paris ) : 0]] . (8) The SDP algorithm is a symbolic implementation of the value iteration algorithm. The algorithmrepeatedly applies so-called decision-theoretic regression which is equivalent to one iteration of thevalue iteration algorithm.As input to SDP we get closed relational expressions for V k and R . In addition, assumingthat we are using the Endogenous Branching Transition model of the previous section, we getopen expressions for the probabilistic choice of actions
P r ( A j ( X ) | A ( X )) and for the dynamics ofdeterministic action variants as TVDs. The corresponding expressions for the running example aregiven respectively in Eq (7), Eq (5) and Eq (6).The following SDP algorithm of [16] modifies the earlier SDP algorithm of [1] and implementsEq (4) using the following 4 steps:i. Regression:
The k step-to-go value function V k is regressed over every deterministic variant A j ( X ) of every action A ( X ) to produce Regr ( V k , A j ( X )). Regression is conceptually similarto goal regression in deterministic planning. That is, we identify conditions that need to occurbefore the action is taken in order to arrive at other conditions (for example the goal) afterthe action. However, here we need to regress all the conditions in the relational expressioncapturing the value function, so that we must regress each case c i of V k separately. This canbe done efficiently by replacing every atom in each c i by its corresponding positive or negatedportion of the TVD without changing the aggregation function. Once this substitution is done,logical simplification (at the propositional level) can be used to compress the cases by removingcontradictory cases and simplifying the formulas. Applying this to regress unloadS ( B , T , C )over the reward function given by Eq (7) we get:[max B [( BIn ( B, paris ) ∨ (( B = B ) ∧ ( C = paris ) ∧ On ( B , T ) ∧ TIn ( T , C ))) : 10; ¬ : 0]]and regressing unloadF ( B , T , C ) yields[max B [ BIn ( B, paris ) : 10; ¬ : 0]]This illustrates the utility of compiling the transition model into the TVDs which allow for asimple implementation of deterministic regression.ii. Add Action Variants:
The Q-function Q A ( X ) k = R ⊕ [ γ ⊗ ⊕ j ( P r ( A j ( X )) ⊗ Regr ( V k , A j ( X )))]for each action A ( X ) is generated by combining regressed diagrams using the binary opera-tions ⊕ and ⊗ over expressions. Recall that probability expressions do not refer to additionalvariables. The multiplication can therefore be done directly on the open formulas without12hanging the aggregation function. As argued by [27], to guarantee correctness, both sum-mation steps ( ⊕ j and R ⊕ steps) must standardize apart the functions before adding them.For our running example and assuming γ = 0 .
9, we would need to compute the following: Q k unload ( B ,T ,C ) ( S ) = R ( S ) ⊕ . · [( Regr ( V , unloadS ( B , T , C )) ⊗ P ( unloadS ( B , T , C ) | unload ( B , T , C ))) ⊕ ( Regr ( V , unloadF ( B , T , C )) ⊗ P ( unloadF ( B , T , C ) | unload ( B , T , C )))] . We next illustrate some of these steps. The multiplication by probability expressions can bedone by cross product of cases and simplification. For unloadS this yields[max B [(( BIn ( B, paris ) ∨ (( B = B ) ∧ ( C = paris ))) ∧ On ( B , T ) ∧ TIn ( T , C )) : 9; ¬ : 0]]and for unloadF we get[max B [ BIn ( B, paris ) ∧ ( On ( B , T ) ∧ TIn ( T , C )) : 1; BIn ( B, paris ) ∧ ¬ ( On ( B , T ) ∧ TIn ( T , C )) : 10; ¬ : 0]] . Note that the values here are weighted by the probability of occurrence. For example thefirst case in the last equation has value 1=10*0.1 because when the preconditions of unload hold the variant unloadF occurs with 10% probability. The addition of the last two equationsrequires standardizing them apart, performing the safe operation through cross product ofcases, and simplifying. Skipping intermediate steps, this yields[max B [ BIn ( B, paris ) : 10; ¬ BIn ( B, paris ) ∧ ( B = B ) ∧ ( C = paris ) ∧ On ( B , T ) ∧ TIn ( T , C ) : 9; ¬ : 0]] . Multiplying by the discount factor scales the numbers in the last equation by 0.9 and fi-nally standardizing apart and adding the reward and simplifying (again skipping intermediatesteps) yields Q unload ( B ,T ,C ) ( S ) =[ max B [ BIn ( B, paris ) : 19; ¬ BIn ( B, paris ) ∧ ( B = B ) ∧ ( C = paris ) ∧ On ( B , T ) ∧ TIn ( T , C ) : 8 . ¬ : 0]] . Intuitively, this result states that after executing a concrete stochastic unload action witharguments ( B , T , C ), we achieve the highest value (10 plus a discounted 0.9*10) if a boxwas already in Paris, the next highest value (10 occurring with probability 0.9 and discountedby 0.9) if unloading B from T in C = paris , and a value of zero otherwise. The main sourceof efficiency (or lack thereof) of SDP is the ability to perform such operations symbolicallyand simplify the result into a compact expression.13ii. Object Maximization:
Note that up to this point in the algorithm the action argumentsare still considered to be concrete arbitrary objects, ( B , T , C ) in our example. However, wemust make sure that in each of the (unspecified and possibly infinite set of possible) states wechoose the best concrete action for that state, by specifying the appropriate action arguments.This is handled in the current step of the algorithm.To achieve this, we maximize over the action parameters X of Q A ( X ) V k to produce Q AV k for eachaction A ( X ). This implicitly obtains the value achievable by the best ground instantiationof A ( X ) in each state. This step is implemented by converting action parameters X tovariables, each associated with the max aggregation operator, and appending these operatorsto the head of the aggregation function. Once this is done, further logical simplification maybe possible. This occurs in our running example where existential quantification (over B , C )which is constrained by equality can be removed, and the result is: Q unload ( S ) =[max T , max B [ BIn ( B, paris ) : 19; ¬ BIn ( B, paris ) ∧ On ( B, T ) ∧ TIn ( T, paris ) : 8 . ¬ : 0]] . iv. Maximize over Actions:
The k +1st step-to-go value function V k +1 = max A Q AV k , is gener-ated by combining the expressions using the binary operation max.Concretely, for our running example, this means we would compute: V ( S ) = max( Q unload ( S ) , max( Q load ( S ) , Q drive ( S ))) . While we have only shown Q unload ( S ) above, we remark that the values achievable in eachstate by Q unload ( S ) dominate or equal the values achievable by Q load ( S ) and Q drive ( S ) in thesame state. Practically this implies that after simplification we obtain the following valuefunction: V ( S ) = Q unload ( S ) =[max T , max B [ BIn ( B, paris ) : 19; ¬ BIn ( B, paris ) ∧ On ( B, T ) ∧ TIn ( T, paris ) : 8 . ¬ : 0]] . Critically for the objectives of lifted stochastic planning, we observe that the value functionderived by SDP is indeed lifted: it holds for any number of boxes, trucks and cities.SDP repeats these steps to the required depth, iteratively calculating V k . For example, Figure 2illustrates V ∞ for the BoxWorld example, which was computed by terminating the SDP loop oncethe value function converged.The basic SDP algorithm is an exact calculation whenever the model can be specified usingthe constraints above and the reward function can be specified with max and min aggregation [16].This is satisfied by classical models of stochastic planning. As illustrated, in these cases, the SDPsolution conforms to our definition of generalized lifted inference.
Extending the Scope of SDP . The algorithm above cannot handle models with more complexdynamics and rewards as motivated in the introduction. In particular, prior work has considered14wo important properties that appear to be relevant in many domains. The first is additive rewards,illustrated for example, in Eq 8. The second property is exogenous branching transitions illustratedabove by the disappearing blocks example. These represent two different challenges for the SDPalgorithm. The first is that we must handle sum aggregation in value functions, despite the fact thatthis means that some of the operations are not safe and hence require a special implementation. Thesecond is in modeling the exogenous branching dynamics which requires getting around potentialconflicts among such events and between such events and agent actions. The introduction illustratedthe type of solution that can be expected in such a problem where counting expressions, thatmeasure the number of times certain conditions hold in a state, determine the value in that state.To date, exact abstract solutions for problems of this form have not been obtained. The workof [30] and [29] (Ch. 6) considered additive rewards and has formalized an expressive family ofmodels with exogenous events. This work has shown that some specific challenging domains can behandled using several algorithmic ideas, but did not provide a general algorithm that is applicableacross problems in this class. The work of [17] developed a model for “service domains” whichsignificantly constrains the type of exogenous branching. In their model, a transition includesan agent step whose dynamics use endogenous branching, followed by “nature’s step” where eachobject (e.g., a box) experiences a random exogenous action (potentially disappearing). Giventhese assumptions, they provide a generally applicable approximation algorithm as follows. Theiralgorithm treats agent’s actions exactly as in SDP above. To regress nature’s actions we follow thefollowing three steps: (1) the summation variables are first ground using a Skolem constant c , then(2) a single exogenous event centered at c is regressed using the same machinery, and finally (3) theSkolemization is reversed to yield another additive value function. The complete details are beyondthe scope of this chapter. The algorithm yields a solution that avoids counting formulas and issyntactically close to the one given by the original algorithm. Since such formulas are necessary,the result is an approximation but it was shown to be a conservative one in that it provides amonotonic lower bound on the true value. Therefore, this algorithm conforms to our definition of approximate generalized lifted inference .In our example, starting with the reward of Eq (8) we first replace the sum aggregation with ascaled version of average aggregation (which is safe w.r.t. summation)[ n · avg B [ BIn ( B, paris ) : 10; ¬ : 0]]and then ground it to get [ n · [ BIn ( c, paris ) : 10; ¬ : 0]] . The next step is to regress through the exogenous event at c . The problem where boxes disappearwith probability 0.2 can be cast as having two action variants where “disappearing-block” succeedswith probability 0.2 and fails with probability 0.8. Regressing the success variant we get theexpression [0] (the zero function) and regressing the fail variant we get [ n · [ BIn ( c, paris ) : 10; ¬ : 0]].Multiplying by the probabilities of the variants we get: [0] and [ n · [ BIn ( c, paris ) : 8; ¬ : 0]] andadding them (there are no variables to standardize apart) we get[ n · [ BIn ( c, paris ) : 8; ¬ : 0]] . Finally lifting the last equation we get[ n · avg B [ BIn ( B, paris ) : 8; ¬ BIn ( B, paris ) : 0]] . Next we follow with the standard steps of SDP for the agent’s action. The steps are analogousto the example of SDP given above. Considering the discussion in the introduction (recall that in15rder to simplify the reasoning in this case we omitted discounting and adding the reward) thisalgorithm produces [ n · max T , avg B , [ BIn ( B, paris ) : 8;( ¬ BIn ( B, paris ) ∧ On ( B, T ) ∧ TIn ( T, paris )) : 7 . ¬ : 0]] , which is identical to the exact expression given in the introduction. As already mentioned, theresult is not guaranteed to be exact in general. In addition, the maximization in step iv of SDPrequires some ad-hoc implementation because maximization is not safe with respect to averageaggregation.It is clear from the above example that the main difficulty in extending SDP is due to theinteraction of the counting formulas arising from exogenous events and additive rewards with thefirst-order aggregation structure inherent in the planning problem. Relational expressions, theirGFODD counterparts, and other representations that have been used to date are not able tocombine these effectively. A representation that seamlessly supports both relational expressionsand operations on them along with counting expressions might allow for more robust versions ofgeneralized lifted inference to be realized. As motivated in the introduction, SDP has explored probabilistic inference problems with a specificform of alternating maximization and expectation blocks. The main computational advantagecomes from lifting in the sense of lifted inference in standard first order logic. Issues that arisefrom conditional summations over combinations random variables, common in probabilistic liftedinference, have been touched upon but not extensively. In cases where SDP has been shown towork it provides generalized lifted inference where the complexity of the inference algorithm iscompletely independent of the domain size (number of objects) in problem specification, and wherethe response to queries is either independent of that size or can be specified parametrically. Thisis a desirable property but to our knowledge it is not shared by most work on probabilistic liftedinference. A notable exception is given by the knowledge compilation result of [31] (see Chapter 4and Theorem 5.5) and the recent work in [32, 33], where a model is compiled into an alternativeform parametrized by the domain D and where responses to queries can be obtained in polynomialtime as a function of D . The emphasis in that work is on being domain lifted (i.e., being polynomialin domain size). Generalized lifted inference requires an algorithm whose results can be computedonce, in time independent of that size, and then reused to evaluate the answer for specific domainsizes. This analogy also shows that SDP can be seen as a compilation algorithm, compiling adomain model into a more accessible form representing the value function, which can be queriedefficiently. This connection provides an interesting new perspective on both fields.In this chapter we focused on one particular instance of SDP. Over the last 15 years SDPhas seen a significant amount of work expanding over the original algorithm by using differentrepresentations, by using algorithms other than value iteration, and by extending the models andalgorithms to more complex settings. In addition, several “lifted” inductive approaches that do notstrictly fall within the probabilistic inference paradigm have been developed. We review this workin the remainder of this section. 16 .1 Deductive Lifted Stochastic Planning As a precursor to its use in lifted stochastic planning, the term SDP originated in the propositionallogical context [34, 35] when it was realized that propositionally structured MDP transitions (i.e.,dynamic Bayesian networks [36]) and rewards (e.g., trees that exploited context-specific indepen-dence [37]) could be used to define highly compact factored MDPs ; this work also realized thatthe factored MDP structure could be exploited for representational compactness and computationalefficiency by leveraging symbolic representations (e.g., trees) in dynamic programming. Two highlycited (and still used algorithms) in this area of work are the SPUDD [38] and APRICODD [39]algorithms that leveraged algebraic decision diagrams (ADDs) [40] for, respectively, exact andapproximate solutions to factored MDPs. Recent work in this area [41] shows how to performpropositional SDP directly with ground representations in PPDDL [42], and develops extensionsfor factored action spaces [43, 44].Following the seminal introduction of lifted
SDP in [1], several early papers on SDP approachedthe problem with existential rewards with different representation languages that enabled efficientimplementations. This includes the First-order value iteration (FOVIA) [45, 46], the RelationalBellman algorithm (ReBel) [47], and the FODD based formulation of [27, 48, 49].Along this dimension two representations are closely related to the relational expression ofthis chapter. As mentioned above, relational expressions are an abstraction of the GFODD rep-resentation [16, 17, 50] which captures expressions using a decision diagram formulation extendingpropositional ADDs [40]. In particular, paths in the graphical representation of the DAG represent-ing the GFODD correspond to the mutually exclusive conditions in expressions. The aggregationin GFODDs and relational expressions provides significant expressive power in modeling relationalMDPs. The GFODD representation is more compact than relational expressions but requires morecomplex algorithms for its manipulation. The other closely related representation is the case no-tation of [1, 18]. The case notation is similar to relational expressions in that we have a set ofconditions (these are mostly in a form that is mutually exclusive but not always so) but the maindifference is that quantification is done within each case separately, and the notion of aggregationis not fully developed. First-order algebraic decision diagrams (FOADDs) [18,29] are related to thecase notation in that they require closed formulas within diagram nodes, i.e., the quantifiers areincluded within the graphical representation of the expression. The use of quantifiers inside casesand nodes allows for an easy incorporation of off the shelf theorem provers for simplification. BothFOADD and GFODD were used to extend SDP to capture additive rewards and exogenous eventsas already discussed in the previous section. While the representations (relational expression andGFODDs vs. case notation and FOADD) have similar expressive power, the difference in aggre-gation makes for different algorithmic properties that are hard to compare in general. However,the modular treatment of aggregation in GFODDs and the generic form of operations over themmakes them the most flexible alternative to date for directly manipulating the aggregated caserepresentation used in this chapter.The idea of SDP has also been extended in terms of the choice of planning algorithm, aswell as to the case of partially observable MDPs. Case notation and FOADDs have been used toimplement approximate linear programming [18,51] and approximate policy iteration via linear pro-gramming [52] and FODDs have been used to implement relational policy iteration [53]. GFODDshave also been used for open world reasoning and applied in a robotic context [54]. The workof [55] and [56] explore SDP solutions, with GFODDs and case notation respectively, to relationalpartially observable MDPs (POMDPs) where the problem is conceptually and algorithmically muchmore complex. Related work in POMDPs has not explicitly addressed SDP, but rather has implic-17tly addressed lifted solutions through the identification of (and abstraction over) symmetries inapplications of dynamic programming for POMDPs [57, 58].
Inductive methods can be seen to be orthogonal to the inference algorithms in that they mostlydo not require a model and do not reason about that model. However, the overall objective ofproducing lifted value functions and policies is shared with the previously discussed deductiveapproaches. We therefore review these here for completeness. As we discuss, it is also possible tocombine the inductive and deductive approaches in several ways.The basic inductive approaches learn a policy directly from a teacher, sometimes known asbehavioral cloning. The work of [59–61] provided learning algorithms for relational policies withtheoretical and empirical evidence for their success. Relational policies and value functions werealso explored in reinforcement learning. This was done with pure reinforcement learning usingrelational regression trees to learn a Q-function [62], combining this with supervised guidance [63],or using Gaussian processes and graph kernels over relational structures to learn a Q-function [64].A more recent approach uses functional gradient boosting with lifted regression trees to learn liftedpolicy structure in a policy gradient algorithm [65].Finally, several approaches combine inductive and deductive elements. The work of [66] com-bines inductive logic programming with first-order decision-theoretic regression, by first using de-ductive methods (decision theoretic regression) to generate candidate policy structure, and thenlearning using this structure as features. The work of [67] shows how one can implement relationalapproximate policy iteration where policy improvement steps are performed by learning the in-tended policy from generated trajectories instead of direct calculation. Although these approachesare partially deductive they do not share the common theme of this chapter relating planning andinference in relational contexts.
This chapter provides a review of SDP methods, that perform abstract reasoning for stochasticplanning, from the viewpoint of probabilistic inference. We have illustrated how the planningproblem and the inference problem are related. Specifically, finite horizon optimization in MDPsis related to an inference problem with alternating maximization and expectation blocks and istherefore more complex than marginal MAP queries that have been studied in the literature. Thisanalogy is valid both at the propositional and relational levels and it suggests a new line of challengesfor inference problems in discrete domains.We have also identified the opportunity for generalizedlifted inference, where the algorithm and its solution are agnostic of the domain instance and its sizeand are efficient regardless of this size. We have shown that under some conditions SDP algorithmsprovide generalized lifted inference. In more complex models, especially ones with additive rewardsand exogenous events, SDP algorithms are yet to mature into an effective and widely applicableinference scheme. On the other hand, the challenges faced in such problems are exactly the onestypically seen in standard lifted inference problems. Therefore, exploring generalized lifted inferencemore abstractly has the potential to lead to advances in both areas.
Acknowledgments
This work is partly supported by NSF grants IIS-0964457 and IIS-1616280.18 eferences [1] C. Boutilier, R. Reiter, and B. Price. Symbolic dynamic programming for first-order MDPs.In
Proc. of IJCAI , pages 690–700, 2001.[2] Richard E. Fikes and Nils J. Nilsson. STRIPS: A new approach to the application of theoremproving to problem solving.
AI Journal , 2:189–208, 1971.[3] Neil Kushmerick, Steve Hanks, and Dan Weld. An algorithm for probabilistic planning.
Arti-ficial Intelligence , 76(12):239–286, 1995.[4] J. McCarthy. Programs with common sense. In
Proceedings of the Symposium on the Mech-anization of Thought Processes , volume 1, pages 77–84. National Physical Laboratory, 1958.Reprinted in R. Brachman and H. Levesque (Eds.), Readings in Knowledge Representation,1985, Morgan Kaufmann, Los Altos, CA.[5] Hagai Attias. Planning by probabilistic inference. In
Proceedings of the Ninth InternationalWorkshop on Artificial Intelligence and Statistics, AISTATS 2003, Key West, Florida, USA,January 3-6, 2003 , 2003.[6] M. Toussaint and A. Storsky. Probabilistic inference for solving discrete and continuous state markov decision processes. In
Proceedings of the International Conference on MachineLearning , 2006.[7] Carmel Domshlak and J¨org Hoffmann. Fast probabilistic planning through weighted modelcounting. In
Proceedings of the International Conference on Automated Planning and Schedul-ing , 2006.[8] M. Lang and M. Toussaint. Approximate inference for planning in stochastic relational worlds.In
Proceedings of the International Conference on Machine Learning , 2009.[9] Thomas Furmston and David Barber. Variational methods for reinforcement learning. In
Proceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS ,pages 241–248, 2010.[10] Qiang Liu and Alexander T. Ihler. Belief propagation for structured decision making. In
Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) , pages 523–532,2012.[11] Qiang Cheng, Qiang Liu, Feng Chen, and Alexander T. Ihler. Variational planning for graph-based mdps. In
International Conference on Neural Information Processing Systems , pages2976–2984, 2013.[12] J. Lee, R. Marinescau, and R. Dechter. Applying marginal map search to probabilistic con-formant planning. In
Fourth International Workshop on Statistical Relational AI (StarAI) ,2014.[13] J. Lee, R. Marinescau, and R. Dechter. Applying search based probabilistic inference al-gorithms to probabilistic conformant planning: Preliminary results. In
Proceedings of theInternational Symposium on Artificial Intelligence and Mathematics (ISAIM) , 2016.[14] M. Issakkimuthu, A. Fern, R. Khardon, P. Tadepalli, and S. Xue. Hindsight optimization forprobabilistic planning with factored actions. In
ICAPS , 2015.1915] Jan-Willem van de Meent, Brooks Paige, David Tolpin, and Frank Wood. Black-box policysearch with probabilistic programs. In
Proceedings of the International Conference on ArtificialIntelligence and Statistics, AISTATS , pages 1195–1204, 2016.[16] S. Joshi, K. Kersting, and R. Khardon. Decision theoretic planning with generalized first orderdecision diagrams.
AIJ , 175:2198–2222, 2011.[17] S. Joshi, R. Khardon, A. Raghavan, P. Tadepalli, and A. Fern. Solving relational MDPs withexogenous events and additive rewards. In
ECML , 2013.[18] S. Sanner and C. Boutilier. Practical solution techniques for first order MDPs.
AIJ , 173:748–788, 2009.[19] J.W. Lloyd.
Foundations of Logic Programming . Springer Verlag, 1987. Second Edition.[20] S. Russell and P. Norvig.
Artificial Intelligence: a modern approach . Prentice Hall, 1995.[21] C. Chang and J. Keisler.
Model Theory . Elsevier, Amsterdam, Holland, 1990.[22] M. Richardson and P. Domingos. Markov logic networks.
Machine Learning , 62:107–136, 2006.[23] L. De Raedt, A. Kimmig, and H. Toivonen. Problog: A probabilistic prolog and its applicationin link discovery. In
Proc. of the IJCAI , pages 2462–2467, 2007.[24] G. Van den Broeck, N. Taghipour, W. Meert, J. Davis, and L. De Raedt. Lifted probabilisticinference by first-order knowledge compilation. In
Proc. of the IJCAI , pages 2178–2185, 2011.[25] S. Russell and P. Norvig.
Artificial Intelligence: a modern approach . Prentice Hall, 2009. 3rdEdition.[26] M. L. Puterman.
Markov decision processes: Discrete stochastic dynamic programming . Wiley,1994.[27] C. Wang, S. Joshi, and R. Khardon. First order decision diagrams for relational MDPs.
JAIR ,31:431–472, 2008.[28] Ray Reiter.
Knowledge in Action: Logical Foundations for Specifying and Implementing Dy-namical Systems . MIT Press, 2001.[29] S. Sanner.
First-order decision-theoretic planning in structured relational environments . PhDthesis, University of Toronto, 2008.[30] S. Sanner and C. Boutilier. Approximate solution techniques for factored first-order MDPs. In
Proceedings of the 17th Conference on Automated Planning and Scheduling (ICAPS-07) , 2007.[31] Guy Van den Broeck.
Lifted Inference and Learning in Statistical Relational Models . PhDthesis, KU Leuven, 2013.[32] Seyed Mehran Kazemi and David Poole. Knowledge compilation for lifted probabilistic in-ference: Compiling to a low-level language. In
Proceedings of the Conference on KnowledgeRepresentation and Reasoning , pages 561–564, 2016.2033] Seyed Mehran Kazemi, Angelika Kimmig, Guy Van den Broeck, and David Poole. New liftableclasses for first-order probabilistic inference. In
International Conference on Neural Informa-tion Processing Systems , pages 3117–3125, 2016.[34] Craig Boutilier, Thomas Dean, and Steve Hanks. Planning under uncertainty: Structuralassumptions and computational leverage. In
Third European Workshop on Planning , Assisi,Italy, 1995.[35] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structuralassumptions and computational leverage.
Journal of Artificial Intelligence Research (JAIR) ,11:1–94, 1999.[36] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation.
Computational Intelligence , 5(3):142–150, 1989.[37] Craig Boutilier, Nir Friedman, Mois´es Goldszmidt, and Daphne Koller. Context-specific in-dependence in Bayesian networks. In
Uncertainty in Artificial Intelligence (UAI-96) , pages115–123, Portland, OR, 1996.[38] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. SPUDD: Stochastic planningusing decision diagrams. In
Uncertainty in Artificial Intelligence (UAI-99) , pages 279–288,Stockholm, 1999.[39] Robert St-Aubin, Jesse Hoey, and Craig Boutilier. APRICODD: Approximate policy construc-tion using decision diagrams. In
Advances in Neural Information Processing 13 (NIPS-00) ,pages 1089–1095, Denver, 2000.[40] R. Bahar, E. Frohm, C. Gaona, G. Hachtel, E. Macii, A. Pardo, and F. Somenzi. Algebraicdecision diagrams and their applications. In
IEEE /ACM ICCAD , pages 188–191, 1993.[41] Boris Lesner and Bruno Zanuttini. Efficient policy construction for mdps represented in prob-abilistic pddl. In Fahiem Bacchus, Carmel Domshlak, Stefan Edelkamp, and Malte Helmert,editors,
ICAPS . AAAI, 2011.[42] H˚akan L. S. Younes, Michael L. Littman, David Weissman, and John Asmuth. The firstprobabilistic track of the international planning competition.
Journal of Artificial IntelligenceResearch (JAIR) , 24:851–887, 2005.[43] Aswin Raghavan, Saket Joshi, Alan Fern, Prasad Tadepalli, and Roni Khardon. Planning inFactored Action Spaces with Symbolic Dynamic Pr ogramming. In
Proceedings of the AAAIConference on Artificial Intelligence , 2012.[44] Aswin Raghavan, Roni Khardon, Alan Fern, and Prasad Tadepalli. Symbolic opportunisticpolicy iteration for factored-action mdps. In
International Conference on Neural InformationProcessing Systems , pages 2499–2507, 2013.[45] Eldar Karabaev and Olga Skvortsova. A heuristic search algorithm for solving first-orderMDPs. In
Uncertainty in Artificial Intelligence (UAI-05) , pages 292–299, Edinburgh, Scotland,2005.[46] S. H¨olldobler, E. Karabaev, and O. Skvortsova. FluCaP: a heuristic search planner for first-order MDPs.
JAIR , 27:419–439, 2006. 2147] K. Kersting, M. Van Otterlo, and L. De Raedt. Bellman goes relational. In
Proc. of ICML ,2004.[48] S. Joshi and R. Khardon. Stochastic planning with first order decision diagrams. In
Proc. ofICAPS , pages 156–163, 2008.[49] S. Joshi, K. Kersting, and R. Khardon. Self-taught decision theoretic planning with first orderdecision diagrams. In
Proc. of ICAPS , pages 89–96, 2010.[50] B. Hescott and R. Khardon. The complexity of reasoning with FODD and GFODD.
ArtificialIntelligence , 2015.[51] Scott Sanner and Craig Boutilier. Approximate linear programming for first-order MDPs. In
Uncertainty in Artificial Intelligence (UAI-05) , pages 509–517, Edinburgh, Scotland, 2005.[52] Scott Sanner and Craig Boutilier. Practical linear evaluation techniques for first-order MDPs.In
Uncertainty in Artificial Intelligence (UAI-06) , Boston, Mass., 2006.[53] C. Wang and R. Khardon. Policy iteration for relational MDPs. In
Proceedings of UAI , 2007.[54] Saket Joshi, Paul W. Schermerhorn, Roni Khardon, and Matthias Scheutz. Abstract planningfor reactive robots. In
ICRA , pages 4379–4384, 2012.[55] Chenggang Wang and Roni Khardon. Relational partially observable MDPs. In
Proceedings ofthe Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia,USA, July 11-15, 2010 , 2010.[56] Scott Sanner and Kristian Kersting. Symbolic dynamic programming for first-order POMDPs.In
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010,Atlanta, Georgia, USA, July 11-15, 2010 , 2010.[57] Finale Doshi and Nicholas Roy. The permutable pomdp: Fast solutions to pomdps for pref-erence elicitation. In
Proceedings of the Seventh International Conference on AutonomousAgents and Multiagent Systems (AAMAS 2008) , Estoril, Portugal, May 2008.[58] Byung Kon Kang and Kee-Eung Kim. Exploiting symmetries for single- and multi-agentpartially observable stochastic domains.
Artif. Intell. , 182-183:32–57, 2012.[59] R. Khardon. Learning to take actions.
Machine Learning , 35:57–90, 1999.[60] R. Khardon. Learning action strategies for planning domains.
Artificial Intelligence , 113(1-2):125–148, 1999.[61] SungWook Yoon, Alan Fern, and Robert Givan. Inductive policy selection for first-orderMarkov decision processes. In
Uncertainty in Artificial Intelligence (UAI-02) , pages 569–576,Edmonton, 2002.[62] Saso Dzeroski, Luc DeRaedt, and Kurt Driessens. Relational reinforcement learning.
MachineLearning Journal (MLJ) , 43:7–52, 2001.[63] Kurt Driessens and Saso Dzeroski. Integrating experimentation and guidance in relationalreinforcement learning. In
International Conference on Machine Learning (ICML) , pages115–122, 2002. 2264] Thomas Gartner, Kurt Driessens, and Jan Ramon. Graph kernels and gaussian processes forrelational reinforcement learning.
Machine Learning Journal (MLJ) , 64:91–119, 2006.[65] Kristian Kersting and Kurt Driessens. Non-parametric policy gradients: A unified treatmentof propositional and relational domains. In
Proceedings of the 25th International Conferenceon Machine Learning , ICML ’08, pages 456–463, New York, NY, USA, 2008. ACM.[66] Charles Gretton and Sylvie Thiebaux. Exploiting first-order regression in inductive policyselection. In
Uncertainty in Artificial Intelligence (UAI-04) , pages 217–225, Banff, Canada,2004.[67] Sungwook Yoon, Alan Fern, and Robert Givan. Approximate policy iteration with a policylanguage bias: Learning to solve relational markov decision processes.