Region-Based Approximations for Planning in Stochastic Domains
4472
Region-Based Approximations for Planning in Stochastic Domains
Nevin L. Zhang and Wenju Liu
Department of Computer Science Hong Kong University of Science and Technology {lzhang, wliu}Ocs.ust.hk
Abstract
This paper is concerned with planning in stochastic domains by means of partially observable Markov decision processes (POMDPs). POMDPs are difficult to solve. This paper identifies a subclass of POMDPs called region observable POMDPs, which are easier to solve and can be used to approximate general POMDPs to arbitrary accuracy.
Keywords: planning under uncertainty, partially observable Markov decision processes, problem characteristics. INTRODUCTION
To plan is to find a policy that will lead an agent to achieve a goal with minimum cost. W hen the environment of the agent, henceforth referred to as the world, is completely observable and the effects of actions are deterministic, planning is reduced to finding the shortest sequence of actions that leads the agent to the goal. In real-world applications, however, the world is rarely completely observable and effects of actions are almost always nondeterministic. For this reason, a growing number of researchers concern themselves with planning in stochastic domains (e.g. Dean and Wellman 1991, Cassandra et al , Boutillier et al1995,
Parr and Russell 1995). Partially observable Markov decision processes (POMDPs) can be used as a model for planning in such domains. In this model, nondeterminism in effects of actions is encoded by transition probabilities, partial observability of the world by observation probabilities, and goals and criteria for good plans by reward functions. POMDPs are difficult to solve and approximation is a must in real-world applications. Most previous approximation methods (e.g. Cheng 1988, Lovejoy 1991b, and Parr and Russell1995) are value function approximation methods in the sense that they approximate optimal value functions of POMDPs directly. We advocate model approximation methods.
Such a method approximates a POMDP itself by another that is easier to solve and uses the solution of the latter to construct an approximate solution to the original POMDP. Model approximation can be in the form of a more informative observation model, or a more deterministic action model, or an aggregation of the state space, or a combination of two or all of them. This paper investigates the first alternative. The idea of approximating a POMDP by assuming a more informative observation model is not new. Cassandra et al (1996) have proposed to approximate POMDPs by using MDPs. This paper generalizes the idea. We transform a POMDP by assuming that, in addition to the observations obtained by itself, the agent also receives a report from an oracle who knows the true state of the world. The oracle does not report the true state itself. Rather, he selects, from a list of candidate regions, a region that contains the true state and reports that region. The transformed POMDP is said to be region observable because the agent knows for sure that the true state is in region reported by the oracle. When all candidate regions are singletons, the oracle actually reports the true state of the world. In such a case, the region observable POMDP reduces to an MDP. MDPs are much easier to solve than POMDPs. One would expect the region observable POMDP to be solvable when all candidate regions are small. In terms of quality of approximation, the larger the candidate regions, the less extra information the oracle provides and hence the more accurate the approximation. In the extreme case when there is only one n App r ox i mati o n Scheme for Decision-Theoretic Planning candidate region and it consists of all possible states of the world, the oracle provides no extra information at all. Hence the region observable POMDP is identical to the original POMDP. A way to determine the quality of approximation will be described. This allows one to make the tradeoff between approximation quality and computational com plexity as follows: start with small candidate regions and increase their sizes gradually until the approximation becomes accurate enough or the region observable POMDP becomes untractable. In many applications, the agent often has a good idea about the true state of the world. Take robot path planning as an example. Observing a landmark, a room number for instance, would imply that the robot is at the proximity of that landmark. Observing a feature about the world, a corridor T-junction for instance, might imply the robot is in one of several regions. Taking history into account, the robot might be able to determine a unique region for its current location. Also, an action usually moves the true state of the world to only a few "nearby" states. Thus if the robot has a good idea about the current state of world, it should continue to have a good idea about it in the next few steps. W hen the agent has a good idea about the true state at all time, accurate approximation can be achieved with small candidate regions. We shall begin with a brief review of planning under uncertainty and POMDPs. We shall then formally in troduce region observable POMDPs as an approximation to general POMDPs. Thereafter, we shall describe a way to determine the quality of approximation. Finally, we shall report empirical results, which suggest that when there is not much uncertainty, a POMDP can be approximated accurately by a region observable POMDP that has small candidate regions and can hence be solved exactly. PLANNING UNDER UNCER TAINTY AND POMDPs
To specify a planning problem, one needs to give a set S of possible states of the world, a set of possible observations, and a set A of possible actions. In this paper, all those three sets are assumed to be finite. One needs also to give an observation m odel, which describes the relationship between an observation and the state of the world; and an action model, which describes the effects of each action. Furthermore, one needs to specify the initial state of the world and a goal state. As a background example, consider path planning for a robot who acts in an office environment. Here S is the set of all location-orientation pairs, is the set of possible sensor readings, and A consists of actions move-forward, tum-left, turn-right, and declare-goal. The current observation o depends on the current state of the world s. Due to sensor noise, this dependency is uncertain in nature. The observation o sometimes also depends on the action that the robot has just taken a_. The minus sign in the subscript indicates the previous time point. In the POMDP model, the dependency of o upon s and a.. is numerically characterized by a conditional probability P(ois, a.. ) , which is usually referred to as the observation probability. It is the observation model. In a region observable POMDP, the current observation also depends on the previous state of the world The observation probability for this case can be written
P(o!s,a.., 8-).
The state s+ the world will be in after taking an action a depends on the action and on the current state The plus sign in the subscript indicates the next time point. This dependency is again uncertain in nature due to uncertainty in the actuator. In the POMDP model, the dependency of upon and a is numerically characterized by a conditional probability P(s+l8, a), which is usually referred to as the transition probability. It is the action model. We will often need to consider the joint conditional probability P(s+, o+ is, a) of the next state of the world and the next observation given the current state and the current action. It is given by
P(8+,o+l8,a) = P(s+ls,a)P(o+l8+,a,s).
The POMDP model encodes the starting state by a probability mass function Po over S. The planning goal is encoded by a reward function such as the following: r(s a)= { if a=d � lcare-goal and (1) ' 0 otherwise. DECISION MA KING IN POMDPs
The agent chooses and executes an action at each time point. The choice is made based on the agent's knowledge about the true state of the world, which is summarized by a probability distribution over the set of possible states and called a belief s t a t e . The initial be lief state is P0• Suppose b is the current belief state, and a is the current action. H the observation o+ is Zhang and Liu obtained at the next time point, then the next belief state b+ is given by b+(s+) = k LP (s+,o+ls,a)b(s), (2) where k=1/ La,s+
P(s+, o+ls, a)b(s) is the normalization constant (Cassandra et al1994). To signify the dependence of b+ upon b, a, and o+, we shall some-. times write it as b+(.lb, a, o+)· A policy prescribes an action for each possible belief state. Formally it is a mapping from the set B of all possible belief states to A. For each belief state b, 1r(b) is the action prescribed by for b. The value function of is defined for all belief states b by v11" (b) = Eb(L:o ·lrt], where 0<-y<1 is the discount factor and rt is the reward received at the tth step in the future. Intuitively, it is the expected discounted reward the agent can expect to receive starting from belief state b if it behaves according to policy An policy 1r* is optimal if v,..• (b);::: V,..(b) for all b and all other policies The value function of an optimal policy is called the optimal value function and is usually denoted by v•. Policies for POMDPs can be found through value iteration (Bellman 1957). Value iteration begins with an arbitrary initial function "Y(t{b) and improves it by using the following equation Vt(b) = maxa[r(b,a) +-y LP (o + lb,a ) V t __ (3) where P(o+ lb, a) =
Ls,s+
P(s+, o+ls, a)b(s), and b+ is a shorthand for b+(-lb, a, o+)· If V0*=0, 'V;* is called the t-step optimal value function. It is well known that when the
Bellman residual maxbEBIV't*(b) - yt�1 (b)l becomes small, l't* is close to V* and the greedy policy based on vt* + ')' LP(o+lb,a)l/t*(b+)] (4) is a good approximation of the optimal policy (e.g. Puterman 1990). Since there are uncountably infinite many belief states, value iteration cannot to carried out explicitly. Fortunately, it can be carried out implicitly due to the piecewise linearity of the t-step optimal value function (Sondik 1971). More specifically, there exists a list Vt of function of s, usually referred to simply as vectors, such that for any belief state vt*(b) = maxvEV1 L V(s)b(s). (5) Exact methods for solving POMDPs (Monahan 1992, Eagle 1984, and Larke 1991 (see W hite 1991), Sondik 1971, Cheng 1988,Cassandra et al et al1995, Cassandra et al1997). Approximation is a must for real-world problems. Most previous approximate methods (e.g. Cheng 1988, Lovejoy 1991b, and Parr and Russell1995) attempt to find a list of vectors that satisfies equation (5) approximately. This paper proposes to approximate POMDPs themselves by others that have more informative observations and hence are easier to solve. We make the following assumption about problem characteristics. Even though in a POMDP M the agent does not know the true state of the world, he often has a good idea about it. See the introduction for justifications of this assumption. Consider another POMDP M' which is the same as M except that in addition to the observation made by itself, the agent also receives a report from an oracle who knows the true state of the world. The oracle does not report the true state itself. Rather he selects, from a list of candidate regions, a region that contains the true state and report that region. More information is available to the agent in M' than in M; extra information is provided by the oracle. When the agent already has a good idea about the true state of the world, the oracle does not provides much extra information even when the candidate regions are small. In such a case, M' is a good approximation of M. In M', the agent knows for sure that the true state of the world is in the region reported by the oracle. For this reason, we say that it is region obseroable. The region observable POMDP M' can be much easier to solve than M when the candidate regions are small. For exam ple, if the oracle is allowed to report only singleton regions, then he actually reports the true state ofthe world and hence M' is an MDP. MDPs are much easier the solve than POMDPs. We now set out to make the idea more concrete. Let us begin with the concept of region systems. Region Systems A region is simply a subset of states of the world. A region system is a collection of regions such that no region is a subset of other regions in the collection and n Approximation Scheme for Decision-Theoretic Planning the union of all regions equals the set of all possible states of the world. We shall use R to denote a region and n to denote a r e g i o n system. Region s ys te ms are to be used to restrict the regions that the oracle can cho o s e to re port . There are many p o s s i b l e ways to construct a region system. A natural way is to cr ea t e a region for each st a t e by in c l u d i n g its "nearby" states. Let us make this m o r e precise. Each action has an intended effect. The intended effect of move-forward, for in s tan c e , is to move one step forward. We say a s t a t e is ideally reachable in one step from another state
81 if there is an ac t i o n whose intended effe<:t is, when t h e world is currently in state s', to take t h e world into state s. A st a t e is ideally reachable ink s t e p s from another state so if there are state s1, • • • , BA:-1 such t h a t si+l is ideally reachable from Si in o n e step for all An y state is i d e all y reachable from itself in step. For any no n - ne g a t i v e integer k, the radius-k region centered at a state s consists of states that are ideally reachable from in k or less steps. A radius-k region system is the one obtained by creating a radius k r e g i on for each state and then removing, one after anot he r , regions that are subsets of others. W hen k is the radius-k region system consists of s i ng l e t o n r egi o n s . On the other hand, if there is a k such t ha t any s t at e is id eal l y reachable from any other state in k or less steps, then there is only on e r e gion in the radius-k region s y s t e m , which is the set of all possible states. Observable
POMDPs
To complete the definition o£ the region observable POMDP M', assume a region system has been given and t h e oracle is al l o w e d to choose region only from the system. This subsection discusses how t h e oracle s ho u l d choose regions from the system. The main issue is to minimizes the amount of extra information. T o p r o v i d e a.s little extra information as possible, the oracle should consider what the agent alr e ad y knows. However, he c an n o t take the entire history of past ac tions and observations into ac c o u n t because if h e did, M' would not be a POMDP. We suggest the f o ll ow ing rule. For any no n - ne g a t i ve function f(s) of s and any region R, we call t h e quantity supp(f, R)= EseR f(s)f'EseS f(s) the degree of support of f by R. If R su p p o rt s f to degree we s a y that R fully supports f.
Let s. b e the previous true state of the world, a. be the previous action, and o b e the current observation. The oracle should choose, am on g all the regions in n that contain the true state of the w orl d , one that supports the function P( s, ol s_, a.) of s to the maximum degree. Where there is more than one such regions, choose the one that comes first in a predetermined o r d e r i n g among the regions. Here are the intuitions. If t h e previous w o r l d s t a t e a. were known to the agent, then his current belief state b(8) would be proportional to P(s,ols.,aJ. In this case, the rule minimizes extra information in t h e sense th a t it supports the current belief state to the maximum degree. Also if t he current observation is info r m a t i v e enough, being a landmark f o r instance, to ensure t h a t the world state is in a. certain region, then re gio n chosen using the rule fully supports th e current belief st a t e . In such a case, no extra information is provided. We do not claim that the rule described above is op timal. Finding a rule that minimize extra information is still an o p e n p rob l e m . The probability P(Ris, o, s_, a_) of a region R being chosen under the above scheme is given by if R is the first region s.t. sER and for any other region R' P(Ris,o,s.,a.) = Es'eRP(s',ols.,a.):2: Es'ER' P(s', ols., a.)
The region observable
POMDP M' diff e r s from the original POMDP M o n l y in terms of observation; in add i tio n to the ob se rv a ti o n o made by himself, the agent also receives a report R from the oracle. We shall d e n o t e an o b s e r va t i on in M' by z and write z=(o,R). Ob s er va t i o n model of M' is given by P(zis,a_, s.) = P(o, Rls, a., ) = P(ols, a.)P(Ris, o, s_, a_). S o lving Region Observable POMDPs For any region R, l e t R b e the set of belief states that are fully supported by R. For any region s y st e m 'R, let BR- = UReR-Bn.
Let n be the region system underlying the region observable POMDP M'. It is easy to see that no m a tte r what the c u r r e n t belief state b is, the next belief s t a t e b+ must be in f3n. We assume that in M' the initial belief state is in Bn.
Then all possible belief states the agent might have are in Bn.
This implies that policies fo r M' need only be defined over and val u e iteration for .M' can restricted to the subset
8n of 8.
Restricting value iteration for M' to implies that the t - s te p optimal value function Ui of M' is defined only over Bn and the Bellman residual is now ma xbEBR
IU;(b)-u;_l (b)J. Zhang and
Liu
Like value iteration, restricted value iteration can be carried out implicitly.
Due to region observability, re stricted implicit value iteration in M' can be done more efficiently than implicit value iteration in M. See Zhang and Liu (1996) for details. Implicit restrict value iteration gives us a vectors, which will be henceforth denoted by Ut. It r ep r e sents the t-step optimal value function Ui(b) of M' in the sense that Ut(b)=maxveu,
L8 b(s)V(s) for any bEBn. The greedy policy for M' based on Ut is as follows: f o r any beBn = arg max0[r(b,a) +1 LP(z+lb,a)Ui(b+)], (6) Z+ where z+ stands for observation of the next time point a.od b+ is a shorthand for the next belief state b+(-lb,a,z+)· POLICY FOR
THE
ORIGINAL POMDP
Suppose we have solved the region observable
POMDP
M'.
The next step is to construct a policy for the original POMDP
M based on the solution forM'. Even though it is our assumption that in the original
POMDP
M the agent has a good idea about the state of the world at all time, there is no guarantee that i t s belief state will always be in B"R.· There is n o oracle in M. A policy should prescribes actions for belief states in Bn as well as for belief states outside BR.. An is sue here is that the policy for M' is defined only for belief states in BR..
Fortunately, be naturally extended to the entire belief space by ignoring the constraint bEB"R. in equation (6). We hence define an pol i c y forM as follows: for any bEB, 1r(b) = arg maxa[r(b, a)+ L P(z+lb, a)Ui(b+)]. (7) Z+ Let k be the radius of the region system underlying M'. The policy f o r M given above will be referred to as th e mdius-k approximate policy forM. The entire process of obtaining the policy, including the construction and solving of the region observable POMDP M', will be referred to as region-based approximation. It is w o r t h w hil e to compare this equation with equation (4). In equation (4), there are two terms on the right hand side. The first term is the immediate re ward for t aki n g action a and the second term is the discounted future reward the agent can expect to receive if it behaves optimally. Their sum is the total ex pect e d reward for taking action a. The action with the highest total reward is chosen. The second term is difficult to obtain. In essence, equation (7) approximates the second term using the optimal expected future reward the agent can receive with the help of the oracle, which is easier to compute.
It should be emphasized that the presence of the oracle is assumed only in the process of computing the radius k approximate policy. The oracle is not present when executing t he policy. 6 QUALITY OF APPROXIMATION AND SIMULATION In general, the quality of an approximate policy is measured by the distance between the optimal value function V"'(b) an d the value function V..-(b) of This me asu re me n t does not consider what the agent might know about the initial state of the world. As such, it is no t appropriate for a policy obtained through regionbased approximation. One cannot expect such a policy be of good quality if the agent is very uncertain about the initial state of the world because it is obtained under the ass u m p t i on that the agent has a good idea about the state of the world at all time. This s e c t io n describes a scheme for determining the quality of an approximate policy in cases where the agent knows the in i t ial state of t h e world with cer tainty. The scheme can be generalized to cases where there is a small amount of uncertainty about the ini tial state; for example, cases where the initial state is known to be in some small region. The agent might need to reach the goal fr o m dif ferent initial states at different times. Let P(s) be the frequency it will start from state sl.
The quality of an a p pr o xim ate policy c an be measured by Ls IV*(s)- V��"(s)IP(s), where V*(s) and V��" denote the rewards the agent can expect to receive starting from state s if it behaves optimally or according to respectively. By definition v• (s);?: V"��" (s) for all s. Let u• be the optimal value function of the region observable POMDP M'. Since more information is available to the agent in M',
U*(s);?:V•(s) for all s. Therefore, 'E.[U • (s) - V��"(s)]P(s) is an upper bound on L8[V*(s)- V��"(s)]P(s).
Let be the policy for M' given by (6). When the Bellman residual is small, is close to optimal for M 1 and the value function v.,..' o f is close to u·. Consequently,
L:8[V��"' (s)-V��"(s)]P(s) is an upper bound on L:,[V*(s)- v1r(s)]P(s) when the Bellman residual is small enough. n Approximation Scheme for Decision-Theoretic Planning One way to estimate the quantity 'Z:,[V1r' ( s ) - V1r(s)]P(s) is to conduct a large number of simulation trials. In each trial, an initial state is randomly generated according to P(s).
The agent is informed of the initial state. Simulation takes place in both M and M'. In M, the agent chooses, at each step, an action using based on the its current belief state. The action is passed to a simulator which randomly generates the next state of the world and the next observation according to the transition and observation probabilities. The observation (but not the state) is passed to the agent, who updates its belief state and chooses the next action. And so on and so forth. The trial terminates when the agent chooses the action declare-goal or a maximum number of steps is reached. Simulation in M takes place in a similar manner except that the observations and the observation probabilities are different and actions are chosen using the goal is correctly declared at the end of a trial, the agent receives a reward of the amount "Yn, where n is the number of steps. Otherwise, the agent receive no reward. The quantity 'E.,[V1r' (s)- v1r(s)]P(s) can be estimated using the difference between the average reward received in the trials for M' and the average reward received in the trials f o r M . TRADEOFF BETWEEN QU ALITY OF APPR OX IMATION AND COMPLEXITY
Intuitively, the larger the radius of the region system, the less the amount of extra information the oracle provides. Hence the closer M' is toM and the narrower the gap between 'E. v1r' (s)P(s) and Es V7r(s)P(s).
Although we have not theoretically proved this, empirical results (see the next section) do suggest that Ls V,.-(s)P(s) increases with the radius of the region system while Ls v1r' (s)P(s) decreases with it. At the extreme case when there is one region in the region system that contains all the possible states of the world, M and M' are identical and hence so are E. v1r' (s)P(s) and E. V1r(s)P(s).
Those discussions lead to the following scheme for making the tradeoff between complexity and quality. Start with the radius-0 region system and increases the radius gradually until the quantity 'E .. [V,..' (s) -V,..(s)]P(s) becomes sufficiently small or the region observable POMDP M' becomes untractable. SIMULA TION EXPERIMENTS
Simulation experiments have been carried out to show that (1) quality of approximation increased with radius
Environment A C•cut.h) bvira�Hnt B
Figure Synthetic Office Environments. of region system and (2) where there is n o t much uncertainty, a POMDP can be accurately approximated by a region-observable POMDP that can be solved exactly. This section reports on the experiments. Synthetic Office Environments
Our experiments were carried using two synthetic office environments borrowed from Cassandra et al (1996) with some minor modifications. Layouts of the environments are shown in Figure where squares represent locations. Each location is represented as four states in the POMDP model, one for each orientation. The dark locations are rooms connected to corridors by doorways. In each environment, a robot needs to reach the goal location with the correct orientation. At each step, the robot can execute one of the following actions: move-forward, tum-left, tum-right, and declare-goal. The two sets of action models given in the following table were used. Action Standard Noisy outcomes outcomes move-forward N(O.ll), F(0.88), N(0.2), F(0.7), F-F(0.01) F-F(0.1) tum-left
N(0.05), L(0.9), N(0.15), L (0 . ) , L-L(0.05) L-L(O.l5) tum-right
N(0.05), R(0.9), N(0.15), R(O. 7), R-R(0.05) R-R(0.15) declare-:_g_ oal N _( l. O ) N(LO} For the action move-forward, the term
F-F (0.01) means that with probability the robot actually moves two steps forward. The other terms are to be interpreted similarly. H an outcome cannot occur in a Zhang and Liu certain state of the world, then the robot is left i n the last state before the impossible outcome. In each state, the robot is able to perceive in each of t h r ee nominal directions (front, left, and right) whether there is a doorway, wall, open, or it is undetermined. The following two sets of observation models were used:
Actual Standard observations case wall wall (0.90), open (0.04), doorway (0.04), undetermined (0.02) open wall (0.02), open (0.90), doorway (0.06), undetermined (0.02) doorwa wall (0.15), open (0.15), doorway (0.69), undetermined (0.01) Noisy observations wall (0.70), open (0.19), doorway (0.09), undetermined (0.02 wall (0.19), open (0.70), doorway (0.09), undetermined (0.02 wall (0.15), open (0.15), doorway (0.69), undetermined (0.01
Complexity of Solving the POMDPs
One of the POMDPs have 280 possible states while the other has 200. They both have p o ss i b l e observations and 4 possible actions. Since the l arges t Env�nmon\A 1000 1100 . ....••.. /······ I l / .. .. � /. /// .rO-ot'acle" ···-··
800 'r1-<>n�cW- e •n" -- � "rff - �··.r <100 . .
15 20 25 30 35
Sleps Envin>n"""'t B .g I
700 •rO-o rac le"' •••••. •n-orac'-• -"3 .r1. -·- e 800 'r(f"- " z 500 <100 10 15 20 25 30 Sleps
F igure 2: Experiments with standard action and noisy models. The POMDPs are a cc u r a t e l y approximated by region observable POMDPs with radius zero or one. Quality of Approximation for Standard Models
POMDPs that researchers have been able to solve ex-
To determine the quality of the radius-0 and radius-1 actly so far have less than 20 states and 15 observa-approximate policies for the POMDPs with s t an d ar d tions, it is safe to say no existing en.ct alg o ri t h ms can action and observation models, 1000 simulation trials solve those two POMDPs. were conducted using the scheme described in Section We were be able to solve the radius-O and radius-1 6. It was assumed that the agent is equally likely to start from any state. Instead of the average reward approximations (region observable POMDPs) of the two POMDPs on a SUN SPARC2o computer. The over the trials, the performance of the agent is sum-threshold for the Bellman residual was set at 0.001 marized by the distribution of the numbers of steps it and the discount factor at 0_99_ The amounts of time took to successfully complete the t ri als , i.e. by a func-it took in CPU seconds are collected in the following tion g(n) of s t e ps n, where for each n, g(n) is the num-table. her of trials where the goal was reached and declared r-:::--..,---.--...,---=---:--....,.....,=---r---:-::-:---�,-----. in n or less steps. The average reward over the tri t-;::----;o:--�-:;::-.....-;---,.--+-;;:;:--....--;<-..-.,---=-:-:-1 als can be computed by E�o In (g ( n)-g( n-1)) /1000. F.= =�= :::;::::::::;::: =l= ::::;::=:===:l===:::= =:== =¥ == ::::;:= ===l We choose the function g(n) instead of the average re I ward because it is more informative than the latter. L-. ---
J...- --.:.....:...: --'- ___;....;..;..:..__.J.._....;_;...;;;....___JL..._...:....::.::..:. __..
Simulation results are shown in Figure The curves
We see that the radius-1 approximations took much longer time to solve than the radius-0 approximations. Also notice that the region observable POMDPs with noisy action and observation models took more time to solve that those with the standard models. We were unable to solve the radius-2 approximations.
Other approximation techniques need to be incorporated in order to solve t h e approximations based on region systems with radius larger than or equal to 2. rO-oracle, for instance, represent the g-functions for simulations in the radius-0 region observable POMDPs ( i.e. with the help of the oracle) using their optimal policies. In contrast, the curves rO r ep res en t the g-functions for simulations in the original POMDPs (without the help of the oracle ) using radius-0 approx imate policies. For readability, only top portions of the g-functions are shown. We see that the gap between rO-oracle and rO is quite n Approximation Scheme for Decision-Theoretic Planning small in both cases. This indicates that the radius-0 region observable POMDPs (MDPs) are quite accurate approximations of the original POMDPs. The radius-0 approximate policies are close to optimal for the original POMDPs. The gaps between the curve'3 rl-oracle and rl are even narrower. For environment A, there is essentially no gap. Also notice that the curves rl lie above rO and the curves rl-oracle lie b e lo w rO-oracle. Those support our claim that quality of approximation increases with radius of region system. There is a couple other facts worth mentioning. The gaps are larger in environment B than in environment A. This is because environment B is more symmetric and consequently observations a.re less effective in disambiguating uncertainty in the agent's belief about the state of the world. There were a few failures in environment A even with the presence of the oracle ( curve rl-oracle). The failures occurred due to uncertainties in the actions models: The agent was one step away from the goal and had an very good idea about the state of the world. An action towards the goal was taken and afterwards the agent believed strongly that the world is in the goal state. However, the action failed to effect any movement and the orcale's report did point this out2• So a failure. Quality of Approximation for Noisy Models
One thousand trials were also conducted for the POMDPs with noisy action and observation models. Results are shown in Figure We see that the gaps between rl-oracle and rl is significantly narrower than the gaps between rO-oracle and rO, especially for environment A. The curves rl lie above the curves rO and the curves rl-oracle lie below rO-oracle. Again, those support our claim that quality of approximation increases with radius of region system. As far as absolute quality of approximation is concerned, the radius-0 POMDPs are obviously very poor approximations of the original POMDPs since the gaps gaps between the curves rO-oracle and rO are very wide. For Environment A, the radius-1 approxima tion is fairly accurate. However, the radius-1 approximation remains poor for environment B. The radius of region system needs to be increased. Unfortunately, increasing the radius beyond 1 renders it computationally impossible to solve the region observable and the actual state. I
900 700 I aoo
500 <100 'l5 E 300 " z � 100 0 .. BOO ..e ) BOO � � z Environ..-! A ... -··· .. / ! ,./ / / ; ' .. ! I / .. •ro-o,.c-.- -·�·-· "r1-onoc•·•n• -· "rfl'- "10-o rac '-" ----. •r1-oi'IIC .. - •n• -· "rfl'- 100 Figure Experiments with noisy action and noisy models. The POMDPs are not accurately approximated by region observable POMDPs with radius zero or one. POMDPs exactly. Tracing through the trials, we learned some interesting facts. In environment B, the agent, under the guidance of the radius-1 approximate policy, was able to quickly get to the neighborhood of the goal even when starting from far way. The fact that the environment around the g o al is highly symmetric was the cause of the poor performance. Often the agent was not able to determine whether it was at the goal location (room), or in the o pp o s i t e room, or in the left most room, or in the room to the right of the goal location. The performance would be close to optimal if the goal location had some distinct features. In environment A, the agent, again under the guidance of the radius-! approximate policy, was able to reach and declare the goal successfully once it got to the neighborhood. However, it often took many unnecessarily steps before reaching the neighborhood due to the undesirable effects of the turning actions. Take the lower left corner as an example. When the agent reached the corner from above, it was facing downward. The agent executed the action turn_left. Fif t ee n percent of the time, it ended up facing upward instead of to the right - the desired direction. The agent then decided to move-forward, thinking that it was approaching the goaL But it was actually moving upward and did not realize this until a few steps later. Zhang and
Liu The a g e n t would perf o r m much better there were i n formative landmarks around the corners.
9 CONCLUSIONS
We propose to approximate a POMDP by using a region observable
POMDP. T he region observable POMDP has more informative observations and hence is easier to solve. A method for determining the qual ity of a ppro xim a t i on is also d es c ri b e d , which allows one to make the tradeoff between quality of approximation and computational complexity by starting with a coarse a p pro xim a t i o n and refining it gradually. Sim ulation experiments have shown that when there is not much uncertainty in the effects of actions and observations are informative, a P O M D P can b e accurately to approximated by a region observable POMDP that can be solved exactly. However, this becomes infeasible as the degree of uncertainty increases. Other ap proximate m e t h o d s need to b e incorporated in order to solve region observable POMDPs whose radiuses are not small. Acknowledgement
Research was supported by Hong Kong
Research
Council under grants HKUST 658/95E and Hong Kong
University of Science and T ec h n o l o gy under grant DAG96/97.EG01(Rl). References [ll
R. Bellman (1957),
Dynamic Programming,
Princeton U ni ve rs ity Press. [2) C. Boutillier, R. Dearden and M. G o ldszmi d t (1995), Exploiting structures in policy construction, In Proceedings of IJC A/'95. pp. 1104-1111. [3] A. R. Cassandra, L. P.
Kaelbling, and M. L. Littman (1994), Acting optimally in partially ob servable stochastic domains, AAAI
Proc., July R. Cassandra, L.
P. Kaelbling, and J. Kurien (1996), Act in g under uncertainty: Discrete Bayesian models for mobile-robot navigation, TR CS-96-17, Department of Computer Sci ence, Brown University, Providence, Rhode
Island M. L. Littman and N. L. Zhang (1997), In c r em en t al Pruning: A Simple, Fast,
Ex act Algorithm for Partially Observable Markov Decision Processes, this pro c ee d in g. [6] II. T. C h e ng (1988), Algorithms for partially ob servable
Markov decision processes, PhD thesis, University of British Columbia,
Vancouver,
BC, Canada. (7] T. L. Dean and M. P. Wellman (1991) ,
Planning and Control,
Morgan
Kaufmann. {8] J. N. Eagle (1984), The optimal search for a mov i n g target when the search p a t h is c o nst r ain e d , Operations Research,
Cassandra, and L. P. Kaelbling (1995), Efficient dynamic-programming updates in partially observable Markov decision processes, TR CS-95-19, Department of Computer Science, Brown University, Providence, Rhode
Is land S. Lovejoy (1991a), A survey of algorithmic methods for solving par ti al l y observable Markov decision processes, Annals of Operations Research,
28 (1), pp. 47-65. [11] W. S. Lovejoy (1991b), Computationally feasible bo u n ds for partially observed Markov decision processes, Operations Research,
39 (1), pp.
G. E.
Monahan (1982), A survey of partially observable
Markov decision processes: theory, mo d els, and algorithms, Management Science,
28 (1), pp.
R. Parr and S.
Russell (1995), Approximating optimal polices for partially observable stochastic d o m ain s , In Proceedings of IJC A/'95, pp. M. L. Puterman (1990), Markov decision processes, in D. P. Heyman and M. J. Sobel (eds.), Handbooks in OR & MS., V o l . 2, pp. s e v i er Science Publishers. [15] E.
J. Sondik (1971), The optimal control of par tially observable Markov processes,
PhD thesis,
Stanford
University, Stanford,
California,
USA. [16] C . C. White ill (1991), Partially observed Markov decision processes:
A survey.
Annals of Operations Research,
32. [17] N. L. Zhang and
W. Liu (1996),
Planning in st o chas t i c domains: problem characteristics and approximations, Technical Rep or t HKUST-CS- of Computer Science, Hong Kong University of Science and Tech n o l og yy