Online Bayesian Goal Inference for Boundedly-Rational Planning Agents
Tan Zhi-Xuan, Jordyn L. Mann, Tom Silver, Joshua B. Tenenbaum, Vikash K. Mansinghka
OOnline Bayesian Goal Inference forBoundedly-Rational Planning Agents
Tan Zhi-Xuan, Jordyn L. Mann, Tom SilverJoshua B. Tenenbaum, Vikash K. Mansinghka
Massachusetts Institute of Technology {xuan|jordynm|tslvr|jbt|vkm}@mit.edu
Abstract
People routinely infer the goals of others by observing their actions over time.Remarkably, we can do so even when those actions lead to failure, enabling us toassist others when we detect that they might not achieve their goals. How might weendow machines with similar capabilities? Here we present an architecture capableof inferring an agent’s goals online from both optimal and non-optimal sequencesof actions. Our architecture models agents as boundedly-rational planners thatinterleave search with execution by replanning, thereby accounting for sub-optimalbehavior. These models are specified as probabilistic programs, allowing us to rep-resent and perform efficient Bayesian inference over an agent’s goals and internalplanning processes. To perform such inference, we develop Sequential InversePlan Search (SIPS), a sequential Monte Carlo algorithm that exploits the onlinereplanning assumption of these models, limiting computation by incrementallyextending inferred plans as new actions are observed. We present experimentsshowing that this modeling and inference architecture outperforms Bayesian inversereinforcement learning baselines, accurately inferring goals from both optimal andnon-optimal trajectories involving failure and back-tracking, while generalizingacross domains with compositional structure and sparse rewards.
Everyday experience tells us that it is impossible to plan ahead for everything. Yet, not only dohumans still manage to achieve our goals by piecing together partial and approximate plans, we alsoappear to account for this cognitive strategy when inferring the goals of others, understanding thatthey might plan and act sub-optimally, or even fail to achieve their goals. Indeed, even 18-monthold infants seem capable of such inferences, offering their assistance to adults after observing themexecute failed plans [1]. How might we understand this ability to infer goals from such plans? Andhow might we endow machines with this capacity, so they might assist us when our plans fail?While there has been considerable work on inferring the goals and desires of agents, much of this workhas assumed that agents act optimally to achieve their goals. Even when this assumption is relaxed,the forms of sub-optimality considered are often highly simplified. In inverse reinforcement learning,for example, agents are assumed to either act optimally [2] or to exhibit Boltzmann-rational actionnoise [3], while in plan recognition, longer plans are assigned exponentially decreasing probability[4]. None of these approaches account for the difficulty of planning itself, which may lead agentsto produce sub-optimal or failed plans. This not only makes them ill-equipped to infer goals fromsuch plans, but also saddles them with a cognitively implausible burden: If inferring an agent’s goalsrequires knowing the optimal solution to reach each goal, then an observer would need to computethe optimal plan or policy for all of those goals in advance [5]. Outside of the simplest problems anddomains, this is deeply intractable.
Preprint. Under review. a r X i v : . [ c s . A I] J un igure 1: Our architecture performing online Bayesian goal inference via Sequential Inverse PlanSearch. In (a) , an agent exhibits a sub-optimal plan to acquire the blue gem, backtracking to pick upthe key required for the second door. In (b) , an agent exhibits a failed plan to acquire the blue gem,myopically using up its first key to get closer to the gem instead of realizing that it needs to collectthe bottom two keys. In both cases, our method not only manages to infer the correct goal by the end,but also captures sharp human-like shifts in its inferences at key points, such as (a.ii) when the agentpicks up a key unnecessary for the red gem, (a.ii) when the agent starts to backtrack, (b.iii) when theagent ignores the door to the red gem, or (b.iv) when the agent unlocks the first door to the blue gem.In this paper, we present a unified modeling and inference architecture (Figure 2) that addresses bothof these limitations. In contrast to prior work that models agents as actors that are noisily rational , wemodel agents as planners that are boundedly rational , interleaving resource-limited plan search withplan execution. This allows us to perform online Bayesian inference of plans and goals even fromhighly sub-optimal trajectories involving backtracking or irreversible failure (Figure 1). We do so bymodeling agents as probabilistic programs (Figure 3), comprised of goal priors and domain-generalplanning algorithms (Figure 2(i)), and interacting with a symbolic environment model (Figure 2(ii)).Inference is then performed via Sequential Inverse Plan Search (SIPS), a sequential Monte Carloalgorithm that exploits the replanning assumption of our agent models, incrementally inferring partialplans while limiting computational cost (Figure 2(iii)).Our architecture delivers both accuracy and speed by being built in Gen, a general-purpose prob-abilistic programming system that supports customized inference using data-driven proposals andrejuvenation kernels [6], alongside an embedding of the Planning Domain Definition Language[7, 8], enabling the use of fast general-purpose planners [9] as modeling components. We evaluateour approach against a Bayesian inverse reinforcement learning baseline [10] on a wide varietyof planning domains that exhibit compositional task structure and sparse rewards (e.g. Figure 1),achieving high accuracy on many domains, often with orders of magnitude less computation.2 equential Inverse Plan SearchBoundedly RationalAgent Program
Goal Prior over logical speci fi cationsor reward functions Online Replanner e.g. A*, Probabilistic A*, HSP
Planning Algorithm e.g. Manhattan, h add
Planning Heuristic
Domain-General
PDDLEnvironment Model
State Representation e.g. on(A, B), at(Item, X, Y)
Logical Predicates e.g. xpos, ypos, fuel-used
Numeric Fluents
Transition Operators e.g. stack(A, B)Precond.: holding(A) & clear(B) E ff ect: ¬holding(A) & on(A, B) ActionsOnline Plan Hypothesis Extension [right, down, pickup][right, up][right, up]
Goal 2Goal 1 [right, down, pickup][right, up, up, left, pickup][right, up, up, right] t = 2 t = 3 Goal Pruning & Rejuvenation g g g g g g g g g g g g g g g g g g ResamplingRejuvenation
Inference Observations (i) (ii)(iii)
Figure 2:
Our modeling and inference architecture is comprised of: (i)
A programmatic model of aboundedly rational planning agent, implemented in the Gen probabilistic programming system; (ii)
An environment model specified in the Planning Domain Definition Language (PDDL), facilitatingsupport for a wide variety of planning domains and state-of-the-art symbolic planners; (iii)
SequentialInverse Plan Search (SIPS), a novel SMC algorithm that exploits the replanning assumption of ouragent model to reduce computation, extending hypothesized plans only as new observations arrive.
Inverse reinforcement learning (IRL).
A long line of work has shown how to learn reward functionsas explanations of goal-directed agent behavior via inverse reinforcement learning [2, 11, 10, 12].However, most such approaches are too costly for online settings of complex domains, as theyrequire solving the underlying Markov Decision Process (MDP) for every posited goal or rewardfunction, and for all possible initial states [13, 5]. Our approach instead assumes that agents areonline model-based planners. This greatly reduces computation time, while also better reflectinghumans’ intuitive understanding of other agents.
Bayesian theory-of-mind (BToM).
Computational models of humans’ intuitive theory-of-mindposit that we understand other’s actions by Bayesian inference of their likely goals and beliefs.These models, largely built upon the same MDP formalism used in IRL, have been shown to makepredictions that correspond closely with human inferences [14, 15, 16, 17, 18, 19, 20]. Our researchextends this line of work by explicitly modeling an agent’s partial plans, or intentions [21]. Thisallows our architecture to infer final goals from instrumental subgoals produced as part of a plan, andto account for sub-optimality in those plans, thereby enriching the range of mental inferences thatBToM models can explain.
Plan recognition as planning (PRP).
Our work is related to the literature on plan recognition asplanning, which performs goal and plan inference by using classical satisficing planners instead ofa Boltzmann-rational MDP policy to model action likelihoods given a goal [22, 4, 23, 24, 25, 26].However, because these approaches use a heuristic likelihood model that assumes goals are alwaysachievable, they are unable to infer likely goals when irreversible failures occur. In contrast, wemodel agents as online planners who may occasionally execute partial plans that lead to dead ends.
Online goal inference.
Several recent papers have extended IRL to an online setting, but these haveeither focused on maximum-likelihood estimation in 1D state spaces [27, 28], or utilize an expensivevalue iteration subroutine that is unlikely to scale [29]. In contrast, we develop a sequential MonteCarlo algorithm that exploits the online nature of the agent models in order to perform incrementalplan inference with limited computation cost.
Inferences from sub-optimal behavior.
We build upon a growing body of research on inferringgoals and preferences while accounting for human sub-optimality [3, 30, 31, 32, 33], introducing amodel of boundedly-rational planning as resource-limited search. This reflects a natural principleof resource rationality under which agents are less likely to engage in costly computations [34, 35].Unlike prior models of myopic agents which assign zero reward to future states beyond some timehorizon [30, 32], our approach accounts for myopic planning in domains with instrumental subgoalsand sparse rewards. 3
Goal p Plan a Action s State s State Plan a Action p PlanAction p a s State ActionPlan p a s State PlanAction p a s State o Obs. o Obs. o Obs. o Obs. o Obs. η SearchBudget η SearchBudget (a)
One realization of our agent and environment model. model
UPDATE - PLAN ( t , s t , p t − , g ) parameters : PLANNER , r, q, T, h if t > LENGTH ( p t − ) or s t / ∈ p t − [ t ] then η ∼ NEGATIVE - BINOMIAL ( r, q )˜ p t ∼ PLANNER ( s t , g, h, T, η ) p t ← APPEND ( p t − , ˜ p t ) else p t ← p t − end ifreturn p t end model(i) Samples from P ( p t | s t , p t − , g ) model SELECT - ACTION ( t , s t , p t ) return p t [ t ][ s t ] end model(ii) Samples from P ( a t | s t , p t ) (b) Boundedly-rational agent programs.
Figure 3:
We model agents as boundedly rational planners that interleave search and execution ofpartial plans as they interact with the environment. In (a) we depict one possible realization of thismodel, where the agent initially samples a search budget η and searches for a plan p that is twoactions long. At t = 2 , no additional planning needs to be done, so p is copied from p , as denotedby the dashed lines. The agent then replans at t = 3 from state s , sampling a new search budget η and an extended plan p with three more actions. We formally specify this agent model usingprobabilistic programs, with pseudo-code shown in (b) . UPDATE - PLAN samples extended plans p t given previous plans p t − , while SELECT - ACTION selects an action a t according the current plan p t . In order to account for sub-optimal behavior due to resource-limited planning, observers need tomodel not only an agent’s goals and actions, but also the plans they form to achieve those goals. Assuch, we model agents and their environments as generative processes of the following form:
Goal prior: g ∼ P ( g ) (1) Plan update: p t ∼ P ( p t | s t , p t − , g ) (2) Action selection: a t ∼ P ( a t | s t , p t ) (3) State transition: s t +1 ∼ P ( s t +1 | s t , a t ) (4) Observation noise: o t +1 ∼ P ( o t +1 | s t +1 ) (5)where g , p t , a t , s t are the agent’s goals, the internal state of the agent’s plan, the agent’s action, andthe environment’s state at time t respectively. For the purposes of goal inference, observers alsoassume that each state s t might be subject to observation noise, producing an observed state o t .This generative process, depicted in Figure 3(a), extends the standard model of MDP agents bymodeling plans and plan updates explicitly, allowing us to represent not only agents that act accordingto some precomputed policy a t ∼ π ( a t | s t ) , but also agents that compute and update their plans p t on-the-fly. We describe each component of this process in greater detail below. To represent states, observations, goals, and distributions over goals in a general and flexible manner,our architecture embeds the Planning Domain Definition Language (PDDL) [7, 8], representing states s t , state transitions P ( s t | s t − , a t − ) , and goals g in terms of predicate-based facts and relations,numeric expressions, and transition operators that specify the preconditions and effects of actions, asshown in Figure 2(ii). A prior over goals P ( g ) can then be specified as a probabilistic program overPDDL goal specifications, including numeric expressions corresponding to reward functions, as wellas sets of goal predicates (e.g. has(gem) ), equivalent to indicator reward functions. Observationnoise P ( o t +1 | s t +1 ) can also be modeled by corrupting each Boolean predicate with some probability,and adding continuous (e.g. Gaussian) noise to numeric fluents.4 .2 Modeling Sub-Optimal Plans and Actions To model sub-optimal plans, the basic insight we follow is that agents like ourselves are boundedlyrational : we attempt to plan to achieve our goals efficiently, but are limited by our cognitive resources.The primary limitation we consider is that full-horizon planning is often costly or intractable. Instead,it may often make sense to form partial plans towards promising intermediate states, execute them,and replan from there. We model this by assuming that agents only search for a plan up to somebudget η , before executing a partial plan to a promising state found during search. We operationalize η as the maximum number of nodes expanded (i.e., states explored), which we treat as a randomvariable sampled from a negative binomial distribution: η ∼ NEGATIVE - BINOMIAL ( r, q ) (6)The parameters r (maximum failure count) and q (continuation probability) characterize the persis-tence of a planner who may choose to give up after expanding each node. When r > , this induces adistribution over η that disfavors small search budgets, but decreases exponentially as η grows large.This model also assumes access to a planning algorithm capable of producing partial plans. While ourarchitecture supports any such planner as a sub-component, in this work we focus on A* search dueto its ability to support domain-general heuristics that can guide search in human-like ways [9, 36].We also modify A* so that the search process is stochastic, accounting for the difficulty of rankingintermediate states during search. In particular, instead of always expanding the most promisingsuccessor state, we sample successor s with probability: P expand ( s ) ∝ − exp( − f ( s, g ) /T ) (7)where T is a temperature parameter controlling the randomness of search, and f ( s, g ) = c ( s )+ h ( s, g ) is the estimated total plan cost, i.e. the sum of the path cost c ( s ) so far with the estimated goaldistance heuristic h ( s, g ) . Upon termination, we simply return the most recently selected successorstate, which is likely to have low total plan cost f ( s, g ) if the heuristic h ( s, g ) is informative.We incorporate these limitations into a model of how a boundedly rational planning agent interleavessearch and execution, specified by the probabilistic programs UPDATE - PLAN and
SELECT - ACTION in Figure 3(b). At each time t , the agent may reach the end of its last made plan p t − or encountera state s t not anticipated by the plan, in which case it will call the base planner (probabilistic A*)parameterized by the search temperature T , heuristic h , and a randomly sampled node budget η .The partial plan produced is then used to extend the original plan. Otherwise, the agent will simplycontinue executing its original plan, performing no additional computation. Having specified our model, we can now state the problem of Bayesian goal inference. We assumethat an observer receives a sequence of potentially noisy state observations o t = ( o , ..., o t ) . Giventhe observations up to timestep t and a set of possible goals G , the observer’s aim is to infer theagent’s goal g ∈ G by computing the posterior: P ( g | o t ) ∝ P ( g ) t − Y τ =0 P ( o τ +1 | s τ +1 ) P ( s τ +1 | s τ , a τ ) P ( a τ | s τ , p τ ) P ( p τ | s τ , p τ − ) (8)Computing this posterior exactly is intractable, as it requires marginalizing over all the random latentvariables s τ , a τ , and p τ . Instead, we develop a sequential Monte Carlo (SMC) procedure, shownin Algorithm 1, to perform approximate inference in an online manner, using samples from theposterior P ( g | o t − ) at time t − to inform sampling from the posterior P ( g | o t ) at time t . We callthis algorithm Sequential Inverse Plan Search (SIPS), because it sequentially inverts a search-basedplanning algorithm, inferring sequences of partial plans that are likely given the observations, andconsequently the likely goals.As in standard SMC schemes, we first sample a set of particles or hypotheses i ∈ [1 , k ] , withcorresponding weights w i (lines 3-5). Each particle corresponds to a particular plan p iτ and goal g i . As each new observation o τ arrives, we extend the particles (lines 12–14) and reweight them bytheir likelihood of producing that observation (line 15). The collection of weighted particles thusapproximates the full posterior over the unobserved variables in our model, including the agent’splans and goals. We describe several key features of this algorithm below.5 lgorithm 1 Sequential inverse plan search for online Bayesian goal inference procedure SIPS ( s , o t , ) parameters: k , number of particles; c , resampling threshold w i ← for i ∈ [1 , k ] . Initialize particle weights s i , p i , a i ← s , [] , no-op for i ∈ [1 , k ] . Initialize states, plans and actions g i ∼ GOAL - PRIOR () for i ∈ [1 , k ] . Sample k particles from goal prior for τ ∈ [1 , t ] do if EFFECTIVE - SAMPLE - SIZE ( w , ..., w k ) /k < c then . Resample and rejuvenate g i , s i τ , p i τ , a i τ ∼ RESAMPLE ([ g, s τ , p τ , a τ ] k ) for i ∈ [1 , k ] g i , s i τ , p i τ , a i τ ∼ REJUVENATE ( g i , o τ , s i τ , p i τ , a i τ ) for i ∈ [1 , k ] end if for i ∈ [1 , k ] do . Extend each particle to timestep τ s iτ ∼ P ( s τ | s iτ − , a iτ − ) . Sample state transition p iτ ∼ UPDATE - PLAN ( p τ | s iτ , p iτ − , g i ) . Extend plan if necessary a iτ ∼ SELECT - ACTION ( a τ | s iτ , p iτ ) . Select action w i ← w i · P ( o τ | s iτ ) . Update particle weight end for end for ˜ w i ← w i / P kj =1 w j for i ∈ [1 , k ] . Normalize particle weights return [( g , w ) , ..., ( g k , w k )] . Return weighted goal particles end procedure procedure
REJUVENATE ( g , o τ , s τ , p τ , a τ ) . Metropolis-Hasting rejuvenation move parameters: p g , goal rejuvenation probability if BERNOULLI ( p g ) then . Heuristic-driven goal proposal g ∼ Q ( g ) := SOFTMAX ([ h ( o τ , g ) for g ∈ G ]) . Propose g based on est. distance to o τ s τ , p τ , a τ ∼ P ( s τ , p τ , a τ | g ) . Sample trajectory under new goal g α ← Q ( g ) /Q ( g ) . Compute proposal ratio else . Error-driven replanning proposal t ∗ ∼ Q ( t ∗ | s τ , o τ ) . Sample a time close to when s τ diverges from o τ s t ∗ : τ , p t ∗ : τ , a t ∗ : τ ∼ Q ( s t ∗ : τ , p t ∗ : τ , a t ∗ : τ | o t ∗ : τ ) . Propose new plan sequence p t ∗ : τ α ← Q ( s t ∗ : τ , p t ∗ : τ , a t ∗ : τ | o t ∗ : τ ) /Q ( s t ∗ : τ , p t ∗ : τ , a t ∗ : τ | o t ∗ : τ ) . Compute proposal ratio α ← α · Q ( t ∗ | s τ , o τ ) /Q ( t ∗ | s τ , o τ ) . Reweight by auxiliary proposal ratio end if α ← α · P ( o τ | s τ ) /P ( o τ | s τ ) . Compute acceptance ratio return g , s τ , p τ , a τ if BERNOULLI ( min( α, ) else g , s τ , p τ , a τ . Accept or reject proposals end procedure
A key aspect that makes SIPS a genuinely online algorithm is the modeling assumption that agentsalso plan online . This obviates the need for the observer to precompute a complete plan or policy foreach of the agent’s possible goals in advance, and instead defers such computation to the point wherethe agent reaches a time t that the observer’s hypothesized plans do not yet reach. In particular, foreach particle i , the corresponding plan hypothesis p it − is extended (Algorithm 1, line 13) by runningthe UPDATE - PLAN procedure in Figure 3(b.i), which only performs additional computation if p it − does not already contain a planned action for time t and state s t . This means that at any given time t ,only a small number of plans require extension, limiting the number of expensive planning calls. We also introduce resampling and rejuvenation steps into SIPS in order to ensure particle diversity.Whenever the effective sample size falls below a threshold c (line 7), we resample the particles(line 8), effectively pruning low-weight goal and plan hypotheses. After resampling, we performrejuvenation by applying a mixture of two data-driven Metropolis-Hastings kernels to each particle.The first kernel uses a heuristic-driven goal proposal (lines 25-27), which proposes goals ˜ g ∈ G whichare close in heuristic distance h ( o τ , ˜ g ) to the last observed state o τ . This ensures that the algorithmcan reintroduce goals that were initially pruned during a resampling step that might now be morelikely. The second kernel uses an error-driven replanning proposal (lines 29-32), which samplesa time close to the divergence point between the hypothesized and observed trajectories, and thenproposes with high probability to replan from that time, thereby constructing a new sequence ofhypothesized partial plans that are less likely to diverge from the observations. Collectively, theseproposals help to ensure that the set of hypotheses is both diverse and likely given the observations.6 Experiments
We demonstrate the capabilities of our architecture on a variety of planning domains and goalinference problems, showing that our approach outperforms Bayesian IRL baselines in both speedand accuracy. Beyond these quantitative improvements, we also present qualitative experimentsdemonstrating the novel capacity of approach to infer goals from sub-optimal trajectories involvingbacktracking and failure.
We validate our approach on domains with varying degrees of complexity, both in terms of the size ofthe state space |S| , a measure of planning difficulty, and the number of possible goals |G| , a measureof inference difficulty. All domains are characterized by compositional structure and sparse rewards,posing a challenge for standard MDP-based approaches.
Taxi ( |G| = 3 ): A benchmark domain used in hierarchical reinforcement learning [37], where a taxihas to transport a passenger from one location to another in a gridworld. Doors, Keys, & Gems ( |G| = 3 ): A domain in which an agent must navigate a maze with doors,keys, and gems (Figure 1). Each key can be used once to unlock a door, allowing the agent to acquireitems behind that door. Goals correspond to acquiring one out of three colored gems. Block Words ( |G| = 5 ): A Blocks World variant adapted from [4] where blocks are labeled withletters. Goals correspond to stacking block towers that spell one of a set of five English words. Intrusion Detection ( |G| = 20 ): A cybersecurity-inspired domain drawn from [4], where an agentmight perform a variety of attacks on a set of servers. There are 20 possible goals, each correspondingto a set of attacks (e.g. cyber-vandalism or data-theft) on up to 20 servers. We implemented Bayesian IRL (BIRL) baselines by running value iteration to compute a Boltzman-rational policy π ( a t | s t , g ) for each possible goal g ∈ G . Following the human-like setting of earlyBayesian theory-of-mind approaches [16], goals were treated as indicator reward functions, and auniform prior P ( g ) was assumed over goals. Inference was then performed by exact computation ofthe posterior over reward functions, using the policy as the likelihood model for actions.Due to the exponentially large state space of many of our domains, standard value iteration (VI) oftenfailed to converge even after iterations. As such, we implemented two variants of BIRL that useasynchronous VI, sampling states instead of fully enumerating them. The first, unbiased BIRL , usesuniform random sampling of the state space up to 250,000 iterations, sufficient for convergence in theBlock Words domain. The second, oracle BIRL , assumes oracular access to the full set of observedtrajectories in advance, and performing biased sampling of states that appear in those trajectories.Although inapplicable in practice for online use, this ensures that the computed policy is able to reachthe goal in all cases, making it a useful benchmark for comparison. To investigate the novel capabilities of our approach, we performed a set of qualitative goal inferenceexperiments on a set of hand-crafted trajectories involving sub-optimal or failed plans. The experi-ments were performed on the Doors, Keys & Gems domain because it allows for irreversible failures.Two illustrative examples are shown in Figure 1. In Figure 1(a), SIPS accurately infers goals from asuboptimal plan with substantial backtracking, initially placing more posterior mass on the yellowgem when the agent acquires the first key (panel ii), but then switching to the blue gem once theagent backtracks to the second key (panel iv). In Figure 1(b), SIPS remains uncertain about all threegoals when the first key is acquired (panel ii), but discards the red gem as a possibility when the agentwalks past the door (panel iii), and finally converges upon the blue gem when the agent myopicallyunlocks the first door required to access that gem (panel iv). In contrast, the BIRL baselines fail toinfer the correct goal from the failed plan. Due to space constraints, we present those baseline resultsin the supplement, alongside more qualitative experiments.7 ccuracy RuntimeDomain Method P ( g true | o ) Top-1 C (s) MC (s) AC (s) NQ1 Q2 Q3 Q1 Q2 Q3Taxi(3 Goals) SIPS (ours) BIRL (unbiased) 0.34 0.35
BIRL (unbiased) 0.33 0.33 0.33 0.33 0.33 0.33 3326
154 250000BIRL (oracle) 0.37 0.36 0.42 0.44 0.60 0.80 150 0.12 7.01 10000Block Words(5 Goals) SIPS (ours)
BIRL (unbiased) 0.20 0.20 0.21 0.42 0.49 0.56 687
BIRL (unbiased) 0.05 0.05 0.05 0.05 0.05 0.05 18038
Table 1:
Accuracy and runtime of goal inference across domains and inference methods. We quantifyaccuracy at the first, second and third quartiles (Q1–Q3) of each observed trajectory via the posteriorprobability of the true goal P ( g true | o ) , and the fraction of problems where g true is top-ranked (Top-1).We measure runtime in terms of the start-up cost (C ), marginal cost per timestep (MC), and averagecost per timestep (AC) in seconds. We also report the total number of states visited (N) during eithersearch or value iteration as a platform-independent measure of cost. The best results for each metric(excluding the oracle baseline) are in bold. To evaluate accuracy and speed, we ran each inference method on a dataset of optimal and non-optimalagent trajectories for each domain, assuming a uniform prior over goals. The optimal trajectorieswere generated using A* search with an admissible heuristic for each possible goal in the domain.Non-optimal trajectories were generated using the replanning agent model in Figure 3(b), withparameters r = 2 , q = 0 . , T = 10 . We found that SIPS achieved good performance with 10particles per goal without the use of rejuvenation moves in these experiments, so we report thoseresults here. Further experimental details and hyperparameters can be found in the supplement.We present results of these experiments in Table 1, with additional dis-aggregrated and baselineresults in the supplement. Our method greatly outperforms the unbiased BIRL baseline in bothaccuracy and speed in three out of four domains, with an average runtime (AC) often several ordersof magnitude smaller. This is largely because unbiased VI fails to converge except for the highlyrestricted Taxi domain. In the other domains, large amounts of computation still fail to find a goodpolicy, leading to a very poor action likelihood model. In contrast, SIPS requires far less initialcomputation, albeit with higher marginal cost due its online generation of hypothesized partial plans.In fact, it achieves comparable accuracy and speed to the oracle BIRL baseline, sometimes with lesscomputation (e.g. in Doors, Keys & Gems). SIPS also produces higher estimates of the goal posterior P ( g true | o ) . This is a reflection of the underlying agent model, which assumes randomness at the levelof planning instead of acting. As a result, even a few observations can provide substantial evidencethat a particular plan and goal was chosen. In this paper, we demonstrated an architecture capable of online inference of goals and plans, evenwhen those plans might fail. However, several important limitations remain. First, we consideredonly finite sets of goals, but the space of goals that humans pursue is easily infinite. A promising nextstep would thus be to express goal priors as probabilistic grammars or programs over specificationsand reward functions, capturing both the infinitude and structure of the motives we attribute to eachother [38, 39]. Second, unlike the domains considered here, the environments we operate in ofteninvolve stochastic dynamics and infinite action spaces [40, 41]. A natural extension would be tointegrate Monte Carlo Tree Search or sample-based motion planners into our architecture as modelingcomponents [42], potentially parameterized by learned heuristics [43]. With hope, our architecturemight then approach the full complexity of problems that humans face everyday, whether it is stackingblocks as a kid, preparing a meal in the kitchen, or writing a research paper.8
Broader Impact
We embarked upon this research in the belief that, as increasingly powerful autonomous systemsbecome embedded in our society, it may eventually become necessary for them to accurately under-stand our goals and values, so as to robustly act in our collective interest. Crucially, this will requiresuch systems to understand the ways in which humans routinely fail to achieve our goals, and nottake that as evidence that those goals were never desired. Due to our manifold cognitive limitations,gaps emerge between our goals and our intentions, our intentions and our actions, our beliefs and ourconclusions, and our ideals and our practices. To the extent that we would like machines to aid usin actualizing the goals and ideals we most value, rather than those we appear to be acting towards,it will be critical for them to understand how, when, and why those gaps emerge. This aspect ofthe value alignment problem has thus far been under-explored [44]. By contributing this piece ofresearch at the intersection of cognitive science and AI, we hope to lay some of the conceptual andtechnical groundwork that may be necessary to understand our boundedly-rational behavior.Of course, the ability to infer the goals of others, and to do so online and despite failures, hasmany more immediate uses, each of them with its own set of benefits and risks. Perhaps the moststraightforwardly beneficial are assistive use cases, such as smart user interfaces [45], intelligentpersonal assistants, and collaborative robots, which may offer to aid a user if that user appears tobe pursuing a sub-optimal plan. However, even those use cases come with the risk of reducinghuman autonomy, and care should be taken so that such applications ensure the autonomy and willingconsent of those being aided [46].More concerning however is the potential for such technology to be abused for manipulative, offensive,or surveillance purposes. While the state of the research presented in this paper is nowhere near thelevel of integration that might be necessary for active surveillance purposes, it is highly likely thatmature versions of this technology will be co-opted for such purposes by governments, military, andother institutions that provide security. While detecting and inferring “suspicious intent” may notseem harmful in its own right, these uses need to be considered within the broader context of historyand society, and the ways in which marginalized groups of people are over-policed and incarcerated.As such, we urge future research on this topic to consider seriously the ways in which technology ofthis sort will most likely be used, by which institutions, and whether those uses will tend to lead tojust and beneficial outcomes for society as a whole. The ability to infer and understand the motivesof others is a skill that can be wielded to both great benefit and great harm. We ought to use it wisely.
References [1] Felix Warneken and Michael Tomasello. Altruistic helping in human infants and youngchimpanzees.
Science , 311(5765):1301–1303, 2006.[2] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In
Pro-ceedings of the Seventeenth International Conference on Machine Learning , pages 663–670,2000.[3] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropyinverse reinforcement learning. In
Aaai , volume 8, pages 1433–1438. Chicago, IL, USA, 2008.[4] Miguel Ramírez and Hector Geffner. Probabilistic plan recognition using off-the-shelf classicalplanners. In
Twenty-Fourth AAAI Conference on Artificial Intelligence , 2010.[5] Bernard Michini and Jonathan P How. Improving the efficiency of bayesian inverse reinforce-ment learning. In , pages3651–3656. IEEE, 2012.[6] Marco F Cusumano-Towner, Feras A Saad, Alexander K Lew, and Vikash K Mansinghka.Gen: a general-purpose probabilistic programming system with programmable inference. In
Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design andImplementation , pages 221–236. ACM, 2019.[7] Drew McDermott, Malik Ghallab, Adele Howe, Craig Knoblock, Ashwin Ram, Manuela Veloso,Daniel Weld, and David Wilkins. PDDL - the Planning Domain Definition Language, 1998.[8] Maria Fox and Derek Long. PDDL2. 1: An extension to PDDL for expressing temporal planningdomains.
Journal of artificial intelligence research , 20:61–124, 2003.99] Blai Bonet and Héctor Geffner. Planning as heuristic search.
Artificial Intelligence. 2001 Jun;129 (1-2): 5-33. , 2001.[10] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In
IJCAI ,volume 7, pages 2586–2591, 2007.[11] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforcement learning.In
Proceedings of the twenty-first international conference on Machine learning , page 1, 2004.[12] Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inversereinforcement learning. In
Advances in neural information processing systems , pages 3909–3917, 2016.[13] Daniel S Brown and Scott Niekum. Deep bayesian reward learning from preferences. arXivpreprint arXiv:1912.04472 , 2019.[14] Noah D Goodman, Chris L Baker, Elizabeth Baraff Bonawitz, Vikash K Mansinghka, AlisonGopnik, Henry Wellman, Laura Schulz, and Joshua B Tenenbaum. Intuitive theories of mind: Arational approach to false belief. In
Proceedings of the twenty-eighth annual conference of thecognitive science society , volume 6. Cognitive Science Society Vancouver, 2006.[15] Chris L Baker, Joshua B Tenenbaum, and Rebecca R Saxe. Goal inference as inverse planning.In
Proceedings of the Annual Meeting of the Cognitive Science Society , volume 29, 2007.[16] Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. Action understanding as inverseplanning.
Cognition , 113(3):329–349, 2009.[17] Chris Baker, Rebecca Saxe, and Joshua Tenenbaum. Bayesian theory of mind: Modeling jointbelief-desire attribution. In
Proceedings of the Annual Meeting of the Cognitive Science Society,33 (33) , 2011.[18] Julian Jara-Ettinger, Hyowon Gweon, Laura E Schulz, and Joshua B Tenenbaum. The naïveutility calculus: Computational principles underlying commonsense psychology.
Trends incognitive sciences , 20(8):589–604, 2016.[19] Chris L Baker, Julian Jara-Ettinger, Rebecca Saxe, and Joshua B Tenenbaum. Rational quantita-tive attribution of beliefs, desires and percepts in human mentalizing.
Nature Human Behaviour ,1(4):1–10, 2017.[20] Julian Jara-Ettinger, Laura Schulz, and Josh Tenenbaum. The naive utility calculus as a unified,quantitative framework for action understanding.
PsyArXiv , 2019.[21] Michael Bratman.
Intention, plans, and practical reason , volume 10. Harvard University PressCambridge, MA, 1987.[22] Miquel Ramírez and Hector Geffner. Plan recognition as planning. In
Twenty-First InternationalJoint Conference on Artificial Intelligence , 2009.[23] Shirin Sohrabi, Anton V Riabov, and Octavian Udrea. Plan recognition as planning revisited.In
IJCAI , pages 3258–3264, 2016.[24] Daniel Höller, Gregor Behnke, Pascal Bercher, and Susanne Biundo. Plan and goal recognitionas htn planning. In , pages 466–473. IEEE, 2018.[25] Gal A Kaminka, Mor Vered, and Noa Agmon. Plan recognition in continuous domains. In
Thirty-Second AAAI Conference on Artificial Intelligence , 2018.[26] Mor Vered, Ramon Fraga Pereira, Mauricio Cecilio Magnaguagno, Felipe Meneguzzi, andGal A Kaminka. Online goal recognition as reasoning over landmarks. In
Workshops at theThirty-Second AAAI Conference on Artificial Intelligence , 2018.[27] Pratiksha Thaker, Joshua B Tenenbaum, and Samuel J Gershman. Online learning of symbolicconcepts.
Journal of Mathematical Psychology , 77:10–20, 2017.[28] Ryan Self, Michael Harlan, and Rushikesh Kamalapurkar. Online inverse reinforcement learningfor nonlinear systems. In , pages 296–301. IEEE, 2019.[29] Nicholas Rhinehart and Kris Kitani. First-person activity forecasting from video with onlineinverse reinforcement learning.
IEEE transactions on pattern analysis and machine intelligence ,2018. 1030] Owain Evans and Noah D Goodman. Learning the preferences of bounded agents. In
NIPSWorkshop on Bounded Optimality , volume 6, 2015.[31] Owain Evans, Andreas Stuhlmüller, and Noah Goodman. Learning the preferences of ignorant,inconsistent agents. In
Thirtieth AAAI Conference on Artificial Intelligence , 2016.[32] Rohin Shah, Noah Gundotra, Pieter Abbeel, and Anca D Dragan. On the feasibility of learning,rather than assuming, human biases for reward inference. arXiv preprint arXiv:1906.09624 ,2019.[33] Stuart Armstrong and Sören Mindermann. Occam’s razor is insufficient to infer the preferencesof irrational agents. In
Advances in Neural Information Processing Systems , pages 5598–5609,2018.[34] Thomas L Griffiths, Falk Lieder, and Noah D Goodman. Rational use of cognitive resources:Levels of analysis between the computational and the algorithmic.
Topics in cognitive science ,7(2):217–229, 2015.[35] Mark K Ho, David Abel, Jonathan D Cohen, Michael L Littman, and Thomas L Griffiths. Theefficiency of human cognition reflects planned information processing. In
Proceedings of theThirty-Fourth AAAI Conference on Artificial Intelligence , 2020.[36] Hector Geffner. Heuristics, planning and cognition.
Heuristics, Probability and Causality. ATribute to Judea Pearl. College Publications , 2010.[37] Thomas G Dietterich. The maxq method for hierarchical reinforcement learning. In
ICML ,volume 98, pages 118–126. Citeseer, 1998.[38] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level conceptlearning through probabilistic program induction.
Science , 350(6266):1332–1338, 2015.[39] Feras A Saad, Marco F Cusumano-Towner, Ulrich Schaechtle, Martin C Rinard, and Vikash KMansinghka. Bayesian synthesis of probabilistic programs for automatic data modeling.
Pro-ceedings of the ACM on Programming Languages , 3(POPL):1–32, 2019.[40] Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Strips planning ininfinite domains. arXiv preprint arXiv:1701.00287 , 2017.[41] Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. PDDLStream: Integrat-ing symbolic planners and blackbox samplers via optimistic adaptive planning. In
Proceedingsof the International Conference on Automated Planning and Scheduling , volume 30, pages440–448, 2020.[42] Marco F Cusumano-Towner, Alexey Radul, David Wingate, and Vikash K Mansinghka. Proba-bilistic programs for inferring the goals of autonomous agents. arXiv preprint arXiv:1704.04977 ,2017.[43] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.Mastering the game of go with deep neural networks and tree search. nature , 529(7587):484,2016.[44] Stuart Russell.
Human compatible: Artificial intelligence and the problem of control . Penguin,2019.[45] Eric J Horvitz, John S Breese, David Heckerman, David Hovel, and Koos Rommelse. Thelumiere project: Bayesian user modeling for inferring the goals and needs of software users. arXiv preprint arXiv:1301.7385 , 2013.[46] Vasanth Sarathy, Thomas Arnold, and Matthias Scheutz. When exceptions are the norm:Exploring the role of consent in hri.
ACM Transactions on Human-Robot Interaction (THRI) ,8(3):1–21, 2019. 11 upplemental Material
Online Bayesian Goal Inference for Boundedly-Rational Planning Agents
Tan Zhi-Xuan, Jordyn L. Mann, Tom SilverJoshua B. Tenenbaum, Vikash K. Mansinghka
Massachusetts Institute of Technology {xuan|jordynm|tslvr|jbt|vkm}@mit.edu
A Experimental Details
Below we provide experimental details for each of the inference methods described in the main text.We have also performed additional experiments using a baseline adapted from the plan recognition asplanning (PRP) literature [1], which we include below as a useful offline benchmark.
A.1 Sequential Inverse Plan Search
We conducted experiments using two main variants of Sequential Inverse Plan Search (SIPS), the firstusing data-driven rejuvenation, as described in the main text, and the second without. Rejuvenation isnecessary for the results shown in Figure 1 of the main text, and for highly sub-optimal and failedplans more generally. However, rejuvenation is also hard to tune, and can increase runtime due to theneed to replan. We thus report results without rejuvenation in our quantitative experiments.Hyper-parameters for qualitative experiments are given in each of the corresponding figures in sectionB.1. For the quantitative experiments, we used SIPS with 10 particles per possible goal (e.g., 50particles for the Block Words domain), with a resampling threshold of c = 1 / . For the underlyingagent model, we assumed a search temperature of T = 10 and persistence parameters of r = 2 and q = 0 . (giving an average search budget of nodes). We varied the search heuristic h to suit thetype of domain: For the gridworld-based domains (Taxi; Doors, Keys & Gems), we used a Manhattandistance heuristic to the goal. For the other domains (Block Words; Intrusion Detection), we used the h add heuristic introduced by the HSP algorithm [2] as a generalized relaxed-distance heuristic.SIPS also requires the specification of an observation model P ( o | s ) , in order to score the likelihoodof a hypothesized state trajectory ˆ s , ..., ˆ s t given the observed states o , ..., o t . We defined thisobservation model by adding zero-mean Gaussian noise with σ = 0 . for each numeric variable inthe state (e.g., the agent’s position in a gridworld), and Bernoulli corruption noise with p = 0 . foreach Boolean variable in the state (e.g. whether block A is on top of block B).All SIPS experiments were performed using Plinf.jl , a Julia implementation of our modeling andinference architecture that integrates the Gen probabilistic programming system with
PDDL.jl , aJulia interpreter for the Planning Domain Definition Language [3]. Experiments were run on a 1.9GHz Intel Core i7 processor with 16 GB RAM.
A.2 Bayesian Inverse Reinforcement Learning
Bayesian Inverse Reinforcement Learning (BIRL) requires computing an approximate value function Q ( s, a ) offline and a posterior over goals online using the likelihood P ( a | s, g ) = Z e α · Q ( s,a ) , where Z is the partition function and α is a hyperparameter. We used α = 1 for all domains, which wefound to perform well in preliminary experiments. To approximate the value function, we consideredvalue iteration (VI) with a temporal discount factor of γ = 0 . . Preprint. Under review. a r X i v : . [ c s . A I] J un s discussed in the main text, several of the domains considered in this work have state spaces thatare too large to enumerate, making standard VI intractable. We therefore used asynchronous VI,sampling states instead of fully enumerating them, for 250,000 iterations for the unbiased baseline(BIRL-U). Preliminary experiments suggested that running for up to 1,000,000 iterations did notappreciably improve results. Taxi, which has a far smaller state space than the other domains, wasrun with 10,000 iterations, which was consistently sufficient for convergence. For the oracle baseline(BIRL-O), 2500 iterations were sufficient to reach convergence for the Taxi domain, and 10,000iterations for the other domains.All BIRL experiments were written in Python and run on a 2.9 GHz Intel Core i9 processor with 32GB RAM. We made use of the PDDLGym library [4] for instantiating the PDDL planning problemsas OpenAI Gym environments. To perform asynchronous VI efficiently, we implemented statesamplers and valid action generators for each domain. The unbiased version of BIRL (BIRL-U) usesthese state samplers to sample states within asynchronous VI. For the oracle baseline (BIRL-O),which has access to the test-time trajectories, we instead sampled one state uniformly at random fromthe states visited across all test-time trajectories. A.3 Plan Recognition as Planning
We adapted the plan recognition as planning (PRP) approach described in [1] as an offline benchmarkthat achieves high accuracy at the cost of substantially more runtime (up to 30 times) than SIPS. Inthe PRP approach, we use a heuristic approximation to the likelihood of a plan p given a goal g : P ( p | g ) ∝ e − β ( | p |−| p g ∗ | ) (1)where p g ∗ is an optimal plan to the goal g , | p | denotes the length of the plan p , and β is a noiseparameter. This likelihood function model agent rationality by placing exponentially less probabilityon costlier plans, where larger values of β correspond to more optimality.In order to perform inference using this likelihood model, we first compute the optimal plan p g ∗ foreach possible goal g in a domain. At each timestep t , we then construct a plan p gt to each goal g consistent with the observations so far, by computing an optimal partial plan p + t from the currentobserved state o t to g , and then concatenating it with the initial sequence of actions p − t := a , ..., a t − taken by the agent, giving p gt = [ p − t , p + t ] . Under the additional approximation that p gt is the only planconsistent with the observation sequence o , ..., o t , we can then compute the goal posterior as P ( g | o , ..., o t ) ’ e − β ( | p gt |−| p g ∗ | ) P g ∈G e − β ( | p g t |−| p g | ) (2)The main limitation of this approach is that it requires computation of an optimal partial plan p + t forevery goal g at every timestep t , which scales poorly with the number of goals and timesteps pertrajectory, especially when the observed trajectory leads the agent further and further away from mostof the goals under consideration. This is contrast to SIPS, which performs incremental computationby extending partial plans from previous timesteps. In addition, due to the assumption that therealways exists a plan from the current observed state o t to every goal g , the PRP approach is unable toaccount for irreversible failures. This is shown in our qualitative comparisons.Nonetheless, because PRP still achieves high accuracy on many sub-optimal trajectories (at theexpense of considerably more computation, especially on domains with many goals), we include ithere as a benchmark for accuracy. All PRP experiments were performed on the same machine as theSIPS experiments, using the implementation of A* search provided by Plinf.jl . B Additional Results
B.1 Qualitative Comparisons for Sub-Optimal & Failed Plans
Here we present detailed qualitative comparisons of the goal inferences made for sub-optimal andfailed plans in the Doors, Keys & Gems domain. Figures S1 and S2 show the inferences made for twosub-optimal trajectories, while Figures S3 and S4 show the inferences made for two trajectories withirreversible failures. We omit the unbiased Bayesian IRL baseline (BIRL-U), because it is unableto solve the underlying Markov Decision Process in any of these examples, leading to a uniformposterior over goals over the entire trajectory. 2 igure S1:
Goal inferences made by SIPS, BIRL-O, and PRP for the sub-optimal trajectory shownin Figure 1(a) of the main text. Predicted future trajectories in panels (i)–(iv) are made by SIPS. ForSIPS, we used 30 particles per goal, a search temperature of T = 10 , persistence parameters r = 2 , q = 0 . , and a Manhattan distance heuristic to the goal. Rejuvenation moves were used, with a goalrejuvenation probability of p g = 0 . . For BIRL-O, we used α = 5 . For PRP, we used β = 1 . Figure S2:
Goal inferences made by SIPS, BIRL-O, and PRP for another sub-optimal trajectory.Predicted future trajectories in panels (i)–(iv) are made by SIPS. For SIPS, we used 30 particles pergoal, a search temperature of T = 10 , persistence parameters r = 2 , q = 0 . , and a Manhattandistance heuristic to the goal. Rejuvenation moves were used, with a goal rejuvenation probability of p g = 0 . . For BIRL-O, we used α = 5 . For PRP, we used β = 1 .3 .1.1 Sub-Optimal Plans Figure S1 shows how the inferences produced by SIPS are more human-like, compared to theBIRL and PRP baselines. In particular, SIPS adjusts its inferences in a human-like manner, initiallyremaining uncertain between the 3 gems (panel i), placing more posterior mass on the yellow gemwhen the agent acquires the first key (panel ii), increasing that posterior mass when agent appears toignore the second key and unlock the first door (panel iii), but then switching to the blue gem oncethe agent backtracks towards the second key (panel iv).While the inferences produced by BIRL display similar trends, they are much more gradual, becauseBIRL assumes noise at the level of acting instead of planning. In addition, the agent model underlyingBIRL leads to strange artifacts, such as the rise in probability of the red gem when t < . This isbecause Boltzmann action noise places lower probability P ( a | g ) on an action a that leads to a goal g which is further away, due to the value function V g associated with that goal g being smaller due totime discounting. As a result, when t < , BIRL computes that P ( right | red ) > P ( right | yellow ) and P ( right | blue ) , leading to the red gem being inferred as the most likely goal.Finally, PRP exhibits both over-confidence in the yellow gem and slow recovery towards the bluegem. This is due to the assumption that the likelihood of a plan p to some goal g is exponentiallydecreasing in its cost difference from the optimal plan p g ∗ . Between t = 10 and t = 20 , all plansconsistent with the observations to the blue gem are considerably longer than the optimal plan p blue ∗ .As a result, PRP gives very low probability to the blue gem. This effect continues for many timestepsafter the agent starts to backtrack ( t = 17 to t = 24 ), indicating that the PRP modeling assumptionsare inadequate for plans with substantial backtracking.Similar dynamics can be observed for the trajectory in Figure S2. The BIRL baseline performsespecially poorly, placing high probability on the yellow gem even when the agent backtracks tocollect the second key ( t = 19 to t = 22 ). This again is due to the assumption of action noise insteadof planning noise, making it much more likely under the BIRL model that an agent would randomlywalk back towards the second key. The PRP baseline exhibits the same issues with over-confidenceand slow recovery described earlier, placing so little posterior mass on the blue gem from t = 17 to t = 20 that it even considers the red gem to be more likely. In contrast, our method, SIPS,immediately converges to the blue gem once backtracking occurs at t = 20 . B.1.2 Failed Plans
The differences between SIPS and the baseline methods are even more striking for trajectories withirreversible failures. As shown in Figure S3, SIPS accurately infers that the blue gem is the mostlikely goal when the agent ignores the two keys at the bottom, instead turning towards the first doorguarding the blue gem at t = 19 . This inference also remains stable after t = 21 , when the agentirreversibly uses up its key to unlock that door. SIPS is capable of such inferences because the searchfor partial plans is biased towards promising intermediate states. Since the underlying agent modelassumes a relaxed distance heuristic that considers states closer to the blue gem as promising, themodel is likely to produce partial plans that lead spatially toward the blue gem, even if those plansmyopically use up the agent’s only key.In contrast, both BIRL and PRP fail to infer that the blue gem is the goal. BIRL initially placesincreasing probability on the red gem, due to Boltzmann action noise favoring goals which take lesstime to reach. While this probability decreases slightly as the agent detours from the optimal plan tothe red gem, it remains the highest probability goal even after the agent uses up its key at t = 21 .The posterior over goals stops changing after that, because there are no longer any any possible pathsto a goal. PRP exhibits a different failure mode. While it does not suffer from the artifacts due toBoltzmann action noise, it completely fails to account for the possibility that an agent might make afailed plan. As a result, the probability of the blue gem does not increase even after the agent turnstowards it at t = 19 . Furthermore, once failure occurs at t = 21 , PRP ends up defaulting to a uniformdistribution over the three gems, even though it had previously eliminated the red gem as a possibility.The inferences in Figure S4 display similar trends. Once again, SIPS accurately infers that the bluegem is the goal, even slightly in advance of failure (panel iii). In contrast, BIRL wrongly infers thatthe red gem is the most likely, while PRP erroneously defaults to inferring upon failure that the onlyremaining acquirable gem (yellow) is the goal. 4 igure S3: Goal inferences made by SIPS, BIRL-O, and PRP for the failed trajectory shown inFigure 1(b) of the main text. Predicted future trajectories in panels (i)–(iv) are made by SIPS. ForSIPS, we used 30 particles per goal, a search temperature of T = 10 , persistence parameters r = 2 , q = 0 . , and a maze-distance heuristic (i.e. distance to the goal, ignoring doors). Rejuvenationmoves were used with p g = 0 . . For BIRL-O, we used α = 5 . For PRP, we used β = 1 . Figure S4:
Goal inferences made by SIPS, BIRL-O, and PRP for another failed trajectory. Predictedfuture trajectories in panels (i)–(iv) are made by SIPS. For SIPS, we used 30 particles per goal, a searchtemperature of T = 10 , persistence parameters r = 2 , q = 0 . , and a Manhattan distance heuristicto the goal. Rejuvenation moves were used, with a goal rejuvenation probability of p g = 0 . . ForBIRL-O, we used α = 5 . For PRP, we used β = 1 .5 .2 Accuracy & Speed Here we present quantitative comparisons of the accuracy and speed of each inference method. TablesS1 and S2 show the accuracy results for the optimal and sub-optimal datasets respectively. P ( g true | o ) represents the posterior probability of the true goal, while Top-1 represents the fraction of problemswhere g true is top-ranked. Accuracy metrics are reported at the first (Q1), second (Q2), and third(Q3) quartiles of each observed trajectory. The corresponding standard deviations (taken across thedataset) are shown to the right of each accuracy mean.Tables S3 and S4 show the runtime results for the optimal and sub-optimal datasets respectively.Runtime is reported in terms of the start-up cost (C ), marginal cost per timestep (MC), and averagecost per timestep (AC), all measured in seconds. The corresponding standard deviations are shown tothe right of each runtime mean. The total number (N) of states visited (during either plan search orvalue iteration) are also reported as a platform-independent cost metric. AccuracyDomain Method P ( g true | o ) Top-1Q1 Q2 Q3 Q1 Q2 Q3Taxi(3 Goals) SIPS 0.45 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table S1:
Inference accuracy on the dataset of optimal trajectories.
AccuracyDomain Method P ( g true | o ) Top-1Q1 Q2 Q3 Q1 Q2 Q3Taxi(3 Goals) SIPS 0.43 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table S2:
Inference accuracy on the dataset of suboptimal trajectories.6n terms of accuracy alone, it can be seen that the PRP baseline generally achieves the highest metrics,with SIPS and BIRL-O performing comparably, and with BIRL-U completely incapable of makingaccurate inferences except in the Taxi domain. As demonstrated by the qualitative comparisonshowever, these metrics alone maybe misleading, failing to show how inferences of each methodreally evolve over time. In particular, while the PRP baseline is routinely able to achieve the highestTop-1 accuracy, this may not correspond to a suitably calibrated posterior over goals, nor might itcapture the sharp human-like changes over time that SIPS appears to display. It should also be notedthat most of the domains considered do not allow for irreversible failures. As such, the distinctivecapability of SIPS to infer goals despite failed plans is not captured by the results in Table S2.
RuntimeDomain Method
C0 (s) MC (s) AC (s) NTaxi(3 Goals) SIPS 14.7 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
173 0.13 ± ±
184 250000 ± ± ± ± ± ± ± ± ± ± ± ± ± ±
273 0.15 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
230 250000 ± ± ± ± ± ± ±
108 374 ±
102 75700 ± Table S3:
Inference runtime on the dataset of optimal trajectories.
RuntimeDomain Method
C0 (s) MC (s) AC (s) NTaxi(3 Goals) SIPS 12.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
273 0.33 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
163 250000 ± ± ± ± ± ± ± ± ± Table S4:
Inference runtime on the dataset of suboptimal trajectories.Once runtime is taken into account, it becomes clear that SIPS achieves the best balance betweenspeed and accuracy due to its use of incremental computation. In contrast, BIRL-U requires orders ofmagnitude more initial computation while still failing to produce meaningful inferences, while PRPrequires up to 30 times more computation per timestep. This is especially apparent on the IntrusionDetection domain, which has a large number of goals, requiring PRP to compute a large number ofoptimal plans at each timestep. Even the BIRL-O baseline, which assumes oracular access to thedataset of observed trajectories during value iteration, is slower than SIPS on the Doors, Keys &Gems domain in terms of average runtime. Overall, these results imply that SIPS is the only methodsuitable for online usage on the full range of domains we consider.7
Code Availability
Code for our algorithms and experiments is provided together with this supplement.
References [1] Miguel Ramírez and Hector Geffner. Probabilistic plan recognition using off-the-shelf classicalplanners. In
Twenty-Fourth AAAI Conference on Artificial Intelligence , 2010.[2] Blai Bonet and Héctor Geffner. Planning as heuristic search.