Fast Value Iteration for Goal-Directed Markov Decision Processes
4489
Fast Value Iteration for Goal-Directed Markov Decision Processes
Nevin L. Zhang and Weihong Zhang Department of Computer Science Hong Kong University of Science & Technology {lzhang, wzhang}Ccs.ust.hk
Abstract
Planning problems where effects of actions are non-deterministic can be modeled a8 Markov decision processes. Planning problems are usually goal-directed. This paper proposes several techniques for exploiting the goal-directedness to accelerate value iteration, a standard algorithm for solving Markov decision processes. Empirical studies have shown that the techniques can bring about significant speedups.
Keywords: decision-theoretic planning, Markov decision processes, value iteration, efficiency. INTRODUCTION In a Markov decision process (MDP), an agent must, at each time point, choose an action from a finite set A of possible actions and execute the action. Executing an action a ha8 two consequences: The agent receives an immediate reward r ( s, a), which depends on the current state of the world a8 well a8 the action executed, and the world probabilistically moves into another state s' according to a transition probability P(s'ls,a). The action is chosen based on the current state of the world. A policy prescribes an action for each possible state. In other words, it is a mapping from the set S of all possible states to A. The set of possible states is assumed to be finite in this paper. The quality of a policy is mea8ured by its value function V,..(s); for any states, V'"(s) is the expected total discounted reward
V,.. ( the agent, under the guidance of receives starting from an initial state s. A policy rr• is optimal if v1r• (s)� V,..(s) for any states and any other policy The value function of an optimal policy is usually referred to as the optimal value function and denoted by v·. MDPs have been studied extensively in the dynamic progr amm ing literature (e.g. Howard
Puterman
BertsekaB
White
Dean and Kanazawa (1989) and Dean and Wellman (1991) initiated the use of MDPs in planning problems where effects of actions are not deterministic. Planning problems typically have a large number of states. Solving MDPs with large state space has hence become a hot topic in AI (e.g. Dean et al1993, Boutillier e t al1995). A planning problem can be modeled as an MDP in such way that (1) there is a state designated to be the goal and an action called declare-goal; (2) the reward function r ( a) is given by ( ) { a=delcare-goal and s=goal, r sa - ' - 0 otherwise; (1) and (3) the action declare_goal cannot be executed more than once. We call MDPs with such properties goal-directed MDPs. Value iteration is a standard algorithm for solving MDPs. This paper proposes several techniques for accelerating value iteration in goal-directed MDPs. Let us begin with a brief review of value iteration and of previous works on speeding up value iteration. VALUE ITERATION A value function is a mapping from the set S of possible states to the real line. Given a value function V, define another value function TV by TV(s) = max0[r(s,a) + 'Y L P(s'ls,a)V(s')] (2) s' for each state s, where 0::;"(<1 is a discount factor. T is an mapping from the space of value functions to itself. For any function V, its norm !lVII is defined Zhang and Zhang by IIVII = max.!V(s)l. Tis the contraction mapping (e.g. Puterman in the sense that for any two value functions U and V, IITU -TVII � 'YIIU-VII· For any positive number £, we say that a value function V is f.-contracted if The optimal value function satisfies the optimal equation V* = TV*, and hence is 0-contracted. A value function V induces a policy through L P(s'ls, a)V(s'))]. (3) s' IT the value function is f.-contracted for a small number €1 the induced policy is "good enough" in the sense that IIV11" -V*ll � 2£")' • (4) Proof of this inequality can be found in, for instance, Puterman (1990). It is evident the policy induced by the optimal value function is optimal. Value iteration (VI) (Bellman starts with an ar bitrary value function and improves it iteratively until the value function becomes t:-contracted. Here is the pseudo-code. VI Choose an initial value function V0 and set n=O. Vn+l =TVn.
3. If
IIVn+l-Vnii>E, increment n by and go to step Else return
Vn+l·
Since T is a contraction mapping, IITVnH -Vn+l II = IITVn+l-TVnll � � €. Hence,
Vn+l is £-contracted. PREVIOUS WORK VI converges geometrically at rate 'Y· Convergence is slow when is close to Various modifications to standard VI have been proposed and all have been theoretically or empirically shown to lead to faster convergence. Morton and Wecker (1977) suggest that one, before applying the operator T in step subtracts an appropriate value function from Vn and MacQueen (1969) proposes to subtract V .. (s0) the value of Vn itself at a predetermined state so. The aggregation/disaggregation techniques introduced by Schweitzer et al (1985} and Bertsekas and Castanon (1989) interleave standard VI steps with aggregation/disaggregation steps, which improve the current value function by solving the optimality equation for an simpler MDP obtained from the original MDP through state aggregation.
Dean and
Lin (1995) and Dean et al (1997) decompose an MDP with a large state space into a number of MDPs with smaller state spaces through state aggregation. The smaller MDPs are solved using standard VI and their solutions are used to construct a solution to the original MDP. Three pieces of previous workd are of direct relevance to this paper. The first one is the
Gauss-Seidel var ian t of standard VI proposed by Hastings (1969). Let p be an ordering among the possible states. Instead of the operator T defined in equation (2), the Gauss-Seidel variant uses another operator T' to improve the current value function. For any value function
V, T'V(s) is defined for each state s by starting from the state that comes first in the ordering p and moving backwards. The values T'V(s) for earlier states are used in defining the values for later states. Specifically, T' is given by T'V(s) = maxa[r(s,a) +1' LP (s'ls,a)V(s'),] (5) •' where V(s')=T'V(s1) when s' comes before s in the ordering p and V(s')=V(s') otherwise. The anytime algorithm presented in Dean et al (1993) is also closely related to the methods to be proposed in this paper. The algorithm restricts standard VI inside an envelope, a subset of possible states, that contains at least one path from the initial state to the goal state. The envelope is gradually enlarged to get better and better solutions. Boyan and Moore (1996) study value iteration in acyclic goal-directed MDPs. A goal-directed MDP is acyclic if once leaving a. state, the world can never come back to that state again. Boyan and Moore point out that value iteration for goal-directed acyclic MDPs can be carried out in one sweep by starting from the goal state and working backwards. Thereby the amount of computations needed to compute the optimal value function is reduced to that needed in one iteration of standard VI. The method is an extension of the
DAG-SHORTEST-PATH algorithm (Carmen et al1990) for finding shortest path in acyclic graphs. PARSIMONIOUS VALUE ITERATION
We introduce several new variants to standard VI for goal-directed MDPs. Called parsimonious value iteration (PVI), the first variant relies on the following intuition. Suppose value iteration begins with the zero value function. Then at early iterations, the value function remains zero for states far away from the goal. At later iterations, the value function does not change much for states close to the goal. The number of states whose values change significantly from one iteration to the next can be much smaller than the total number of states. At each iteration, PVI updates the value for a state only when the value is expected to change significantly. Specifically, PVI begins with the zero value function. At each iteration n+1 (n�1), PVI performs a test to detect states whose values change substantially from iteration n to iteration n+ 1. The value of a state is updated only if it passes the test. Let
Vn-1 and Vn be the value functions PVI computed at the previous two iterations. At the current iteration n+ PVI does not update the value for a state s if IVn ( 81) -Vn-1 ( 81) I '$ 5 for a small positive constant and each state such that maxaP ( j s, a)>O. Since the number of states reachable from s by executing one action is usually small, this test is cheap. It is usually much cheaper than calculating TVn(s), es pecially when one maintains a list of nodes reachable from each state by executing one action. Theoretical underpinings of the test are as follows. If the value functions Vn and Vn-1 were the value functions computed by VI, one could easily show that if s passes the test then IVn+l ( s) -Vn ( 8) I'$ -yo. In other words, the value for does not change much from iteration n to iteration n+ Here is the pseudo-code for PVI.
PVI 1.
Vo(s)=O for any n=O. For each states, (a)
Ifn�1 and � Hor all such that maxaP(s'ls, a)>O, Vn+l(s) = V .. (s). (b) Else
Vn+l(s) = TVn(s).
3. If
IIVn+l -Vn
II>£, in c r e m ent n by and go to step Else return
Vn.H·
Value Iteration for Goal-Directed
MOPs 491
There is no guarantee that the value function returned by PVI is €-contracted. However, the value function should be close to be €-contracted. We suggest to use PVI as a preprocessing step to VI, i.e. to use the value function it returns as the initial value function of VI. This way an €-contracted value function can be obtained. Since the value function return by PVI is close to be co-contracted, VI should terminate in a small numer of steps. In our experiments, it t e r mi n at ed in just one iteration. The idea behind PVI is rather similar to the idea underlying Boyan and Moore's one--sweep algorithm; start from the goal and work backwards. PVI does not assume acyclicity and hence is more general. W hen the MDP is acyclic, it is almost identical to the one--sweep algorithm. PVI is also related to the anytime algorithm by Dean et al ( ) in the sense that values are updated only for some states at each iteration. The difference lies in the fact that in PVI the states whose values are updated change from iteration to iteration, while in the anytime algorithm whether the value for a state is updated depends on whether it is in the envelop and does not change with iteration. Also the entire value iteration process needs to be carried out for each envelop. GREEDY AND DOUBLE VALUE ITERATION
Even though the test in PVI is cheap, the fact that it has to be carried out for each state is somewhat unsatisfying. Greedy value iteration (GVI) avoids the test by working in a way similar to DAG-SHORTESTPATH. Before describing GVI, we need to introduce t h e concept of ideal reachability. We say a state s' is ideally reachable in one step f r o m another state s if after exe-cuting a certain action in state s the probability o f the world ending up in state is the highest. A state s�e is ideally reachable in k steps from another state s0 if there are sta t es • • • , s��:-1 such that si+l is ideally reachable f r om Bi in one step for all O�i:::;k-1. Any state is ideally reachable from itself in step. For any state s, let d{s) be the minimum number of steps in which the goal is ideally reachable from We shall refer to d( s) as the distance from s to the goal. At each iteration n, GVI only updates the values for the states from which the goal is ideally reachable in n steps. Let N be the maximum number of steps that the goal can be ideally reached from any state. Then GVI terminates in exactly N iterations. For later con- Zhang and Zhang venience, we assume that
GVI takes a value function a.s input and uses it a.s the initial value function. Here is the psuedo-code for
GVI.
GVI(VQ) For n=O toN, • For each state s, (a) If d(s)==n, set
Vn+I(s) :::::: TVn(s). (b)
Else
2. Return
VN.
When the
MDP is acyclic,
GVI is identical to Boyan and
Moore's one-sweep algorithm and hence returns the optimal value function. When the
MDP is cyclic, however, the value function it returns could be of very poor quality. Using it a.s a preprocessing step to VI might not help much. On the positive side, the amount of computations GVI does is identical to that carried out by one iteration of standard
VI.
Also because
GVI is an approximation of the entire value iteration process, the extent to which it improves the input value function should be greater than that brought about by one iteration of standard
VI.
Thus we can expect VI to converge faster if the second line is replaced by "Vn+l = GVI(Vn)". This leads to new algorithm called double value iteration ( DVI ) . DVI 1.
Choose an initial value function V0 and set n=O. Vn+l = GVI(Vn)· If IIVn+t(8)-Vn(8)11>e, increment n by and go to step 2. Else return
Vn.
As it turns out,
DVI can be described directly without the reference to
GVI.
At each iteration, it uses a new operator
T', instead of the operator T given in Equation (2), to update the value function Vn·
For any value function
V, T'V(s) is defined for each state by starting with the goal state and gradually moving away. The value T'V(s) for a state s is defined after the values T'V(s') for all the states s1 closer to the goal than having been defined. It is given by T'V(s) :::::: max8[r(s, a)+')'
P(s'ls,a)V(s', s),] s' where A , { TV(s') V(s,s)= V(s') if d(s') < d(s), otherwise. I t can be proved that r is also a contract mapping. Hence the value function returned by DVI is €-contracted. It is evident see that
DVI is almost identical to the Gauss-Seidel variant of standard
VI, except that it proposes one particular way to order the possible states; the states are ordered according to their distances to the goal. By introducing
DVI through GVI, we hope to provide another way of looking at the Gauss-Seidel variant of standard VI in the context of goal-directed MOPs. IMPROVING PVI
The alternative understanding of
DVI can be used to improve
PVI.
We call the improved algorithm
PVIl.
The pseudo-code is a.s follows.
PV!l 1.
Vo(s)=O for any s, n=O. For m=O toN, (a)
For each state s such that d(s)=m (b) lfn�1 and IVn(s')-Vn-l(s')l � o for ails' such that maxa.P(s'ls,a)>O, Vn+l (s) = Vn(s). (c)
Else
Vn+l(s) = T'Vn(s). IfiiV.-.+1-V.-.li>e, increment n by and go to step 2. Else return
Vn+l· As PVI, PVIl should be used a.s a preprocessing step to
VI. EXPERIMENTS
Preliminary experiments have been carried to compare the algorithms proposed in this paper with standard value iteration. Four office environment navigation problems borrowed from Ca.ssandra et al (1996) were used . The problems differ in corridor layout and the total number of states. There are two sets of transition probabilities, referred to a.s standard and noisy transition probabilities respectively. Effects of actions are less certain under noisy transition probabilities than under standard transition probabilities. - -1.$ 1.4 ..,.. ...... 'Dill"� --- Wl1 .. ..,._ 1.2 l l ... OA 0.2 0 2<)0 210 220 ""' ..., ... -- ... vo no DO :100 Ul � - - ... 'VI'+-"D\11" ..... --- ""F�·u·...-1.4 I I Figure Convergence times of the algorithms in four navigation problems. The threshold for the Bellman residual was set at and the discount factor at shows that convergence times of the algorithms in the f o ur p r o b lems. The
X·axis r ep r es e nt s the sizes of the p rob l em , while theY-axis represents convergence time in CPU seconds.
Data were collected using a SPARC20.
The curves VI and DVI display the convergence times of VI and DVI respectively, while
PVI and PVIl display the convergence times for the combinations of PVI and PVIl with VI. Under both standard and noisy transition probabili ties,
DVI and
PVI converges much faster than VI and PVIl converges even faster. DVI converges slightly faster than PVI in the smallest problem but slower in all other problems. Performances of all algorithms are slightly worse under noisy transition probabilities than under standard transition probabilities. Their differences are also slightly larger. To gain an idea about how the comparisons change with problem sizes, we made copies of one environment and g lu e them together to form l arger e n v ir o n m e n ts . l l I Value Iteration for Goal·Directed MDPs -- -'VI"-+-"DVV" ..... wr ...... W11" ........ 0 100 ... - &DO ..,.
700 60(> -- -� - -"'Il"+-"OVI" -+- 'Wr+-"PV''f" ....... ---·- �00�-��--�---�--�&00�--�-�-�700-�60<> -- Figure 2: Differences in p e r f or m an c e among the algorithms as problem size increases. The convergence times are shown in Figure 2. We see that the differences in performance among the algorithms become larger as the problem size increases. In the smallest problem PVIl converges about three times faster than VI, while in the largest problem it converges six times faster. CONCLUSIONS AND
FUTURE
DIRECTIONS
We propose several techniques for exploiting the goaldirectedness of planning problems to speed up value iteration for their MDP m o d els. Empirical studies have shown that the techniques can bring a b o u t s ig n ifi c ant speedups. MDPs assume perfect observation of the state of the world. In many real-world problems, one does not know the true state of the world. Such problems can be modeled as partially o b s e r v ab l e MDPs (POMDPs). POMDPs are much harder to solve than MDPs. We are currently investigating the possibility of applying Zhang and Zhang the ideas introduced in this paper to POMDPs.
Aclcnowledgements
We thank Peter Dayan, Thomas L. Dean, and Michael Littman for pointers to references and thank Wenju Liu and D. Y. Yeung for useful discussions.
Re search was supported by Hong Kong Research Council under grants HKUST 658/95E and Hong Kong University of Science and Technology under grant DAG96/97.EG01(RI).
References (1] R. Bellman (1957),
Dynamic Programming,
Princeton University Press. (2] D. P. Bertsekas and D. A. Castanon( 1989 ) , Adaptive Aggregation for Infinite Horizon Dy namic
Programming,
IEEE trans. on auto. con tro� vol
No 6, 1989. [3]
D. P. Bertsekas (1987),
Dynamic Programming: Deterministic and Stochastic Models,
PrenticeHall. [4] C. Boutillier, R. Dearden and M. Goldszmidt (1995), Exploiting Structures In Policy Construction, In
Proceedings of IJCAI'95. pp. 1104-1111. [5] J. A. Boyan and A. W. Moore (1996), Learning Evaluation Functions for Large
Acyclic
Domains." In L. Saitta(ed.),
Machine LtUJrning: Proceedings of the Thirteenth International Conference,
Morgan Kaufmann. (6] T. H. Carmen, C. E. Leiserson, and R. L. Rivest (1990),
Introduction to Algorithms,
MIT Press. [7]
T. L.
Dean and K. Kanazawa (1989), A Model for Reasoning about Persistence and Causation,
Computational Intelligence, T. L. Dean, R. Givan, and S. Leach, Model reduction techniques for computing approximately optimal solution for Markov decision processes, in
Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. [9] T. L. Dean, L. P. Kaelbling, J. Kirman, and A. Nicholson (1993),
Planning with Deadlines in Stochastic Domains, In
Proceedings of the EletJenth National Conference on
Artificial
Intelligence,
Washington, DC. [10] T. L. Dean and S. H. Lin (1995), Decomposition techniques for planning in stochastic domains, TR CS-95-10, Department of Computer Science, Brown University, Providence, Rhode
Is land {11] T . L. Dean and M. P. Wellman (1991),
Planning and Control,
Morgan Kaufmann. [12] N. A. J. Hastings (1969), Optimization of Discounted Markov Decision Problems,
Oper. Res. Quart.,
20, 499-500. {13]
R. A. H o w ar d (1960), Dynamic Progr amm ing and Markov Decision Processes, Wiley, London. [ ] J. MacQueen {1969), A Modified Dynamic Programming Method for Markov Decision Problems, J. Math. Anal. Appl.,
14, 38-43. [15] T. E. Morton and W. E. Wrecker (1977). Decision Ergodicity and Convergence for Markov Decision Processes.
Management Sci.,
23, 890-900. (
6] M. L. Puterman {1990), Markov Decision Processes, in D. P. Heyman and M. J. Sobel (eds.),
Handbooks in OR & MS.,
Vol. pp. El sevier Science Publishers. [17]
P. J. Schweitzer, M. Puterman, and K. W. Kindle, Iterative aggregation-disaggregation procedures for solving discounted semi-Markovian reward processes,
Operations Research,
33, pp. 589-606, 1985. [18] D. J. White ( ) , MarkotJ Decision Processes, John Wiley &&