Factored Value Iteration Converges
FFACTORED VALUE ITERATION CONVERGES
ISTV ´AN SZITA AND ANDR ´AS L ˝ORINCZ
Abstract.
In this paper we propose a novel algorithm, factored value itera-tion (FVI), for the approximate solution of factored Markov decision processes(fMDPs). The traditional approximate value iteration algorithm is modified intwo ways. For one, the least-squares projection operator is modified so that itdoes not increase max-norm, and thus preserves convergence. The other mod-ification is that we uniformly sample polynomially many samples from the(exponentially large) state space. This way, the complexity of our algorithmbecomes polynomial in the size of the fMDP description length. We prove thatthe algorithm is convergent. We also derive an upper bound on the differencebetween our approximate solution and the optimal one, and also on the errorintroduced by sampling. We analyze various projection operators with respectto their computation complexity and their convergence when combined withapproximate value iteration.factored Markov decision process, value iteration, reinforcement learning Introduction
Markov decision processes (MDPs) are extremely useful for formalizing and solv-ing sequential decision problems, with a wide repertoire of algorithms to choosefrom [4, 26]. Unfortunately, MDPs are subject to the ‘curse of dimensionality’ [3]:for a problem with m state variables, the size of the MDP grows exponentiallywith m , even though many practical problems have polynomial-size descriptions.Factored MDPs (fMDPs) may rescue us from this explosion, because they offer amore compact representation [17, 5, 6]. In the fMDP framework, one assumes thatdependencies can be factored to several easy-to-handle components.For MDPs with known parameters, there are three basic solution methods (and,naturally, countless variants of them): value iteration, policy iteration and linearprogramming (see the books of Sutton & Barto [26] or Bertsekas & Tsitsiklis [4]for an excellent overview). Out of these methods, linear programming is generallyconsidered less effective than the others. So, it comes as a surprise that all effectivefMDPs algorithms, to our best knowledge, use linear programming in one way oranother. Furthermore, the classic value iteration algorithm is known to be divergentwhen function approximation is used [2, 27], which includes the case of fMDPs, too.In this paper we propose a variant of the approximate value iteration algorithmfor solving fMDPs. The algorithm is a direct extension of the traditional value iter-ation algorithm. Furthermore, it avoids computationally expensive manipulationslike linear programming or the construction of decision trees. We prove that thealgorithm always converges to a fixed point, and that it requires polynomial timeto reach a fixed accuracy. A bound to the distance from the optimal solution isalso given.In Section 2 we review the basic concepts of Markov decision processes, includ-ing the classical value iteration algorithm and its combination with linear function a r X i v : . [ c s . A I] A ug ISTV´AN SZITA AND ANDR´AS L ˝ORINCZ approximation. We also give a sufficient condition for the convergence of approxi-mate value iteration, and list several examples of interest. In Section 3 we extendthe results of the previous section to fMDPs and review related works in Section 4.Conclusions are drawn in Section 5.2.
Approximate Value Iteration in Markov Decision Processes
Markov Decision Processes.
An MDP is characterized by a sixtuple ( X , A, R, P, x s , γ ),where X is a finite set of states; A is a finite set of possible actions; R : X × A → R is the reward function of the agent, so that R ( x , a ) is the reward of the agent afterchoosing action a in state x ; P : X × A × X → [0 ,
1] is the transition function sothat P ( y | x , a ) is the probability that the agent arrives at state y , given that shestarted from x upon executing action a ; x s ∈ X is the starting state of the agent;and finally, γ ∈ [0 ,
1) is the discount rate on future rewards.A policy of the agent is a mapping π : X × A → [0 ,
1] so that π ( x , a ) tellsthe probability that the agent chooses action a in state x . For any x ∈ X , thepolicy of the agent and the parameters of the MDP determine a stochastic processexperienced by the agent through the instantiation x , a , r , x , a , r , . . . , x t , a t , r t , . . . The goal is to find a policy that maximizes the expected value of the discountedtotal reward. Let the value function of policy π be V π ( x ) := E (cid:16) ∞ (cid:88) t =0 γ t r t (cid:12)(cid:12)(cid:12) x = x (cid:17) and let the optimal value function be V ∗ ( x ) := max π V π ( x )for each x ∈ X . If V ∗ is known, it is easy to find an optimal policy π ∗ , for which V π ∗ ≡ V ∗ . Provided that history does not modify transition probability distri-bution P ( y | x , a ) at any time instant, value functions satisfy the famous Bellmanequations(1) V π ( x ) = (cid:88) a (cid:88) y π ( x , a ) P ( y | x , a ) (cid:16) R ( x , a ) + γV π ( y ) (cid:17) and(2) V ∗ ( x ) = max a (cid:88) y P ( y | x , a ) (cid:16) R ( x , a ) + γV ∗ ( y ) (cid:17) . Most algorithms that solve MDPs build upon some version of the Bellman equa-tions. In the following, we shall concentrate on the value iteration algorithm. Later on, we shall generalize the concept of the state of the system. A state of the systemwill be a vector of state variables in our fMDP description. For that reason, we already use theboldface vector notation in this preliminary description.
ACTORED VALUE ITERATION CONVERGES 3
Exact Value Iteration.
Consider an MDP ( X , A, P, R, x s , γ ). The valueiteration for MDPs uses the Bellman equations (2) as an iterative assignment: Itstarts with an arbitrary value function V : X → R , and in iteration t it performsthe update(3) V t +1 ( x ) := max a (cid:88) y ∈ X P ( y | x , a ) (cid:16) R ( x , a ) + γV t ( y ) (cid:17) for all x ∈ X . For the sake of better readability, we shall introduce vector no-tation. Let N := | X | , and suppose that states are integers from 1 to N , i.e. X = { , , . . . , N } . Clearly, value functions are equivalent to N -dimensional vec-tors of reals, which may be indexed with states. The vector corresponding to V will be denoted as v and the value of state x by v x . Similarly, for each a let usdefine the N -dimensional column vector r a with entries r a x = R ( x , a ) and N × N matrix P a with entries P a x , y = P ( y | x , a ). With these notations, (3) can be writtencompactly as(4) v t +1 := max a ∈ A (cid:0) r a + γP a v t (cid:1) . Here, max denotes the componentwise maximum operator.It is also convenient to introduce the
Bellman operator T : R N → R N that mapsvalue functions to value functions as T v := max a ∈ A (cid:0) r a + γP a v (cid:1) . As it is well known, T is a max-norm contraction with contraction factor γ : forany v , u ∈ R N , (cid:107)T v − T u (cid:107) ∞ ≤ γ (cid:107) v − u (cid:107) ∞ . Consequently, by Banach’s fixed pointtheorem, exact value iteration (which can be expressed compactly as v t +1 := T v t )converges to an unique solution v ∗ from any initial vector v , and the solution v ∗ satisfies the Bellman equations (2). Furthermore, for any required precision (cid:15) > (cid:107) v t − v ∗ (cid:107) ∞ ≤ (cid:15) if t ≥ log (cid:15) log γ (cid:107) v − v ∗ (cid:107) ∞ . One iteration costs O ( N · | A | ) computationsteps.2.3. Approximate value iteration.
In this section we shall review approximatevalue iteration (AVI) with linear function approximation (LFA) in ordinary MDPs.The results of this section hold for AVI in general, but if we can perform all oper-ations effectively on compact representations (i.e. execution time is polynomiallybounded in the number of variables instead of the number of states), then themethod can be directly applied to the domain of factorized Markovian decisionproblems, underlining the importance of our following considerations.Suppose that we wish to express the value functions as the linear combinationof K basis functions h k : X → R ( k ∈ { , . . . , K } ), where K << N . Let H be the N × K matrix with entries H x ,k = h k ( x ). Let w t ∈ R K denote the weight vectorof the basis functions at step t . We can substitute v t = H w t into the right handside (r.h.s.) of (4), but we cannot do the same on the left hand side (l.h.s.) of theassignment: in general, the r.h.s. is not contained in the image space of H , so thereis no such w t +1 that H w t +1 = max a ∈ A (cid:0) r a + γP a H w t (cid:1) . We can put the iteration into work by projecting the right-hand side into w -space:let G : R N → R K be a (possibly non-linear) mapping, and consider the iteration(5) w t +1 := G (cid:2) max a ∈ A (cid:0) r a + γP a H w t (cid:1)(cid:3) ISTV´AN SZITA AND ANDR´AS L ˝ORINCZ with an arbitrary starting vector w . Lemma 2.1. If G is such that H G is a non-expansion, i.e., for any v , v (cid:48) ∈ R N , (cid:107) H G v − H G v (cid:48) (cid:107) ∞ ≤ (cid:107) v − v (cid:48) (cid:107) ∞ , then there exists a w ∗ ∈ R K such that w ∗ = G (cid:2) max a ∈ A (cid:0) r a + γP a H w ∗ (cid:1)(cid:3) and iteration (5) converges to w ∗ from any starting point.Proof. We can write (5) compactly as w t +1 = GT H w t . Let (cid:98) v t = H w t . Thissatisfies(6) (cid:98) v t +1 = H GT (cid:98) v t . It is easy to see that the operator H GT is a contraction: for any v , v (cid:48) ∈ R N , (cid:107) H GT v − H GT v (cid:48) (cid:107) ∞ ≤ (cid:107)T v − T v (cid:48) (cid:107) ∞ ≤ γ (cid:107) v − v (cid:48) (cid:107) ∞ by the assumption of the lemma and the contractivity of T . Therefore, by Banach’sfixed point theorem, there exists a vector (cid:98) v ∗ ∈ R N such that (cid:98) v ∗ = H GT (cid:98) v ∗ anditeration (6) converges to (cid:98) v ∗ from any starting point. It is easy to see that w ∗ = G T (cid:98) v ∗ satisfies the statement of the lemma. (cid:3) Note that if G is a linear mapping with matrix G ∈ R K × N , then the assumptionof the lemma is equivalent to (cid:107) HG (cid:107) ∞ ≤ Examples of Projections, Convergent and Divergent.
In this section,we examine certain possibilities for choosing projection G . Let v ∈ R N be anarbitrary vector, and let w = G v be its G -projection. For linear operators, G canbe represented in matrix form and we shall denote it by G . Least-squares ( L -)projection. Least-squares fitting is used almost exclu-sively for projecting value functions, and the term AVI is usually used in the sense“AVI with least-squares projection”. In this case, w is chosen so that it minimizesthe least-squares error: w := arg min w (cid:107) H w − v (cid:107) . This corresponds to the linear projection G = H + (i.e., w = H + v ), where H + isthe Moore-Penrose pseudoinverse of H . It is well known, however, that this methodcan diverge. For an example on such divergence, see, e.g. the book of Bertsekas &Tsitsiklis [4]. The reason is simple: matrix HH + is a non-expansion in L -norm,but Lemma 1 requires that it should be an L ∞ -norm projection, which does nothold in the general case. (See also Appendix A.1 for illustration.) Constrained least-squares projection.
One can enforce the non-expansionproperty by expressing it as a constraint: Let w be the solution of the constrainedminimization problem w := arg min w (cid:107) H w − v (cid:107) , subject to (cid:107) H w (cid:107) ∞ ≤ (cid:107) v (cid:107) ∞ , which defines a non-linear mapping G c . This projection is computationally highlydemanding: in each step of the iteration, one has to solve a quadratic programmingproblem. ACTORED VALUE ITERATION CONVERGES 5
Max-norm ( L ∞ -)projection. Similarly to L -projection, we can also select w so that it minimizes the max-norm of the residual: w := arg min w (cid:107) H w − v (cid:107) ∞ . The computation of w can be transcribed into a linear programming task and thatdefines the non-linear mapping G ∞ . However, in general, (cid:107) H G ∞ v (cid:107) ∞ (cid:2) (cid:107) v (cid:107) ∞ , andconsequently AVI using iteration w t +1 := arg min w (cid:107) H w − T H w t (cid:107) ∞ can be divergent. Similarly to L projection, one can also introduce a constrainedversion G c ∞ defined by G c ∞ v := arg min w (cid:107) H w − v (cid:107) ∞ , subject to (cid:107) H w (cid:107) ∞ ≤ (cid:107) v (cid:107) ∞ , which can also be turned into a linear program.It is insightful to contrast this with the approximate linear programming methodof Guestrin et al. [13]: they directly minimize the max-norm of the Bellman error,i.e., they solve the problem w ∗ := arg min w (cid:107) H w − T H w (cid:107) ∞ , which can be solved without constraints. L -norm projection. Let G L be defined by G v := arg min w (cid:107) H w − v (cid:107) . The L -norm projection also requires the solution of a linear program, but inter-estingly, the projection operator G is a non-expansion (the proof can be found inAppendix A.1).AVI-compatible operators considered so far ( G c , G c ∞ and G ) were non-linear, andrequired the solution of a linear program or a quadratic program in each step ofvalue iteration, which is clearly cumbersome. On the other hand, while G v = H + v is linear, it is also known to be incompatible with AVI [2, 27]. Now, we shall focuson operators that are both AVI-compatible and linear. Normalized linear mapping.
Let G be an arbitrary K × N matrix, and defineits normalization N ( G ) as a matrix with the same dimensions and entries[ N ( G )] i,j := G i,j (cid:0)(cid:80) j (cid:48) | H i,j (cid:48) | (cid:1)(cid:0)(cid:80) i (cid:48) | G i (cid:48) ,j | (cid:1) , that is, N ( G ) is obtained from G by dividing each element with the correspondingrow sum of H and the corresponding column sum of G . All (absolute) row sumsof H · N ( G ) are equal to 1. Therefore, (i) (cid:107) H · N ( G ) (cid:107) ∞ = 1, and (ii) H · N ( G ) ismaximal in the sense that if the absolute value of any element of N ( G ) increased,then for the resulting matrix G (cid:48) , (cid:107) H · G (cid:48) (cid:107) ∞ > Probabilistic linear mapping.
If all elements of H are nonnegative and allthe row-sums of H are equal, then N ( H T ) assumes a probabilistic interpretation.This interpretation is detailed in Appendix A.2. Normalized least-squares projection.
Among all linear operators, H + isthe one that guarantees the best least-squares error, therefore we may expect thatits normalization, N ( H + ) plays a similar role among AVI-compatible linear projec-tions. Unless noted otherwise, we will use the projection N ( H + ) subsequently. ISTV´AN SZITA AND ANDR´AS L ˝ORINCZ
Convergence properties.Lemma 2.2.
Let v ∗ be the optimal value function and w ∗ be the fixed point of theapproximate value iteration (5). Then (cid:107) H w ∗ − v ∗ (cid:107) ∞ ≤ − γ (cid:107) H G v ∗ − v ∗ (cid:107) ∞ . Proof.
For the optimal value function, v ∗ = T v ∗ holds. On the other hand, w ∗ = GT H w ∗ . Thus, (cid:107) H w ∗ − v ∗ (cid:107) ∞ = (cid:107) H GT H w ∗ − T v ∗ (cid:107) ∞ ≤ (cid:107) H GT H w ∗ − H GT v ∗ (cid:107) ∞ + (cid:107) H GT v ∗ − T v ∗ (cid:107) ∞ ≤ (cid:107)T H w ∗ − T v ∗ (cid:107) ∞ + (cid:107) H G v ∗ − v ∗ (cid:107) ∞ ≤ γ (cid:107) H w ∗ − v ∗ (cid:107) ∞ + (cid:107) H G v ∗ − v ∗ (cid:107) ∞ , from which the statement of the lemma follows. For the transformations we haveapplied the triangle inequality, the non-expansion property of H G and the contrac-tion property of T . (cid:3) According to the lemma, the error bound is proportional to the projection errorof v ∗ . Therefore, if v ∗ can be represented in the space of basis functions with smallerror, then our AVI algorithm gets close to the optimum. Furthermore, the lemmacan be used to check a posteriori how good our basis functions are. One mayimprove the set of basis functions iteratively. Similar arguments have been broughtup by Guestrin et al. [13], in association with their LP-based solution algorithm.3. Factored value iteration
MDPs are attractive because solution time is polynomial in the number of states.Consider, however, a sequential decision problem with m variables. In general, weneed an exponentially large state space to model it as an MDP. So, the numberof states is exponential in the size of the description of the task. Factored Markovdecision processes may avoid this trap because of their more compact task repre-sentation.3.1. Factored Markov decision processes.
We assume that X is the Cartesianproduct of m smaller state spaces (corresponding to individual variables): X = X × X × . . . × X m . For the sake of notational convenience we will assume that each X i has the samesize, | X | = | X | = . . . = | X m | = n . With this notation, the size of the full statespace is N = | X | = n m . We note that all derivations and proofs carry through todifferent size variable spaces.A naive, tabular representation of the transition probabilities would require expo-nentially large space (that is, exponential in the number of variables m ). However,the next-step value of a state variable often depends only on a few other variables,so the full transition probability can be obtained as the product of several simplerfactors. For a formal description, we introduce several notations:For any subset of variable indices Z ⊆ { , , . . . , m } , let X [ Z ] := × i ∈ Z X i , further-more, for any x ∈ X , let x [ Z ] denote the value of the variables with indices in Z .We shall also use the notation x [ Z ] without specifying a full vector of values x , in ACTORED VALUE ITERATION CONVERGES 7 such cases x [ Z ] denotes an element in X [ Z ]. For single-element sets Z = { i } weshall also use the shorthand x [ { i } ] = x [ i ].A function f is a local-scope function if it is defined over a subspace X [ Z ] of thestate space, where Z is a (presumably small) index set. The local-scope function f can be extended trivially to the whole state space by f ( x ) := f ( x [ Z ]). If | Z | is small, local-scope functions can be represented efficiently, as they can take only n | Z | different values.Suppose that for each variable i there exist neighborhood sets Γ i such that thevalue of x t +1 [ i ] depends only on x t [Γ i ] and the action a t taken. Then we can writethe transition probabilities in a factored form(7) P ( y | x , a ) = n (cid:89) i =1 P i ( y [ i ] | x [Γ i ] , a )for each x , y ∈ X , a ∈ A , where each factor is a local-scope function(8) P i : X [Γ i ] × A × X i → [0 ,
1] (for all i ∈ { , . . . , m } ).We will also suppose that the reward function is the sum of J local-scope functions:(9) R ( x , a ) = J (cid:88) j =1 R j ( x [ Z j ] , a ) , with arbitrary (but preferably small) index sets Z j , and local-scope functions(10) R j : X [ Z j ] × A → R (for all j ∈ { , . . . , J } ).To sum up, a factored Markov decision process is characterized by the parameters (cid:16) { X i : 1 ≤ i ≤ m } ; A ; { R j : 1 ≤ j ≤ J } ; { Γ i : 1 ≤ i ≤ n } ; { P i : 1 ≤ i ≤ n } ; x s ; γ (cid:17) , where x s denotes the initial state.Functions P i and R i are usually represented either as tables or dynamic Bayesiannetworks. If the maximum size of the appearing local scopes is bounded by someconstant, then the description length of an fMDP is polynomial in the number ofvariables n .3.1.1. Value functions.
The optimal value function is an N = n m -dimensional vec-tor. To represent it efficiently, we should rewrite it as the sum of local-scopefunctions with small domains. Unfortunately, in the general case, no such factoredform exists [13].However, we can still approximate V ∗ with such an expression: let K be thedesired number of basis functions and for each k ∈ { , . . . , K } , let C k be thedomain set of the local-scope basis function h k : X [ C k ] → R . We are looking for avalue function of the form(11) ˜ V ( x ) = K (cid:88) k =1 w k · h k ( x [ C k ]) . The quality of the approximation depends on two factors: the choice of the basisfunctions and the approximation algorithm. Basis functions are usually selectedby the experiment designer, and there are no general guidelines how to automatethis process. For given basis functions, we can apply a number of algorithms todetermine the weights w k . We give a short overview of these methods in Section 4.Here, we concentrate on value iteration. ISTV´AN SZITA AND ANDR´AS L ˝ORINCZ
Exploiting factored structure in value iteration.
For fMDPs, we cansubstitute the factored form of the transition probabilities (7), rewards (9) and thefactored approximation of the value function (11) into the AVI formula (5), whichyields K (cid:88) k =1 h k ( x [ C k ]) · w k,t +1 ≈ max a (cid:88) y ∈ X (cid:16) m (cid:89) i =1 P i ( y [ i ] | x [Γ i ] , a ) (cid:17) ·· (cid:16) J (cid:88) j =1 R j ( x [ Z j ] , a ) + γ K (cid:88) k (cid:48) =1 h k (cid:48) ( y [ C k (cid:48) ]) · w k (cid:48) ,t (cid:17) . By rearranging operations and exploiting that all occurring functions have a localscope, we get K (cid:88) k =1 h k ( x [ C k ]) · w k,t +1 = G k max a (cid:34) J (cid:88) j =1 R j ( x [ Z j ] , a )+ γ K (cid:88) k (cid:48) =1 (cid:88) y [ C k (cid:48) ] ∈ X [ C k (cid:48) ] (cid:16) (cid:89) i ∈ C k (cid:48) P i ( y [ i ] | x [Γ i ] , a ) (cid:17) h k (cid:48) ( y [ C k (cid:48) ]) · w k (cid:48) ,t (cid:35) (12)for all x ∈ X . We can write this update rule more compactly in vector notation.Let w t := ( w ,t , w ,t , . . . , w K,t ) ∈ R K , and let H be an | X | × K matrix containing the values of the basis functions. Weindex the rows of matrix H by the elements of X : H x ,k := h k ( x [ C k ]) . Further, for each a ∈ A , let B a be the | X | × K value backprojection matrix definedas B a x ,k := (cid:88) y [ C k ] ∈ X [ C k ] (cid:16) (cid:89) i ∈ C k P i ( y [ i ] | x [Γ i ] , a ) (cid:17) h k ( y [ C k ])and for each a , define the reward vector r a ∈ R | X | by r a x := n r (cid:88) j =1 R j ( x [ Z j ] , a ) . Using these notations, (12) can be rewritten as(13) w t +1 := G max a ∈ A (cid:16) r a + γB a w t (cid:17) . Now, all entries of B a , H and r a are composed of local-scope functions, soany of their individual elements can be computed efficiently. This means that thetime required for the computation is exponential in the sizes of function scopes,but only polynomial in the number of variables, making the approach attractive.Unfortunately, the matrices are still exponentially large, as there are exponentiallymany equations in (12). One can overcome this problem by sampling as we showbelow. ACTORED VALUE ITERATION CONVERGES 9
Sampling.
To circumvent the problem of having exponentially many equa-tions, we select a random subset (cid:98) X ⊆ X of the original state space so that | (cid:98) X | = poly( m ), consequently, solution time will scale polynomially with m . Onthe other hand, we will select a sufficiently large subset so that the remaining sys-tem of equations is still over-determined. The necessary size of the selected subsetis to be determined later: it should be as small as possible, but the solution ofthe reduced equation system should remain close to the original solution with highprobability. For the sake of simplicity, we assume that the projection operator G islinear with matrix G . Let the sub-matrices of G , H , B a and r a corresponding to (cid:98) X be denoted by (cid:98) G , (cid:98) H , (cid:98) B a and (cid:98) r a , respectively. Then the following value update(14) w t +1 := (cid:98) G · max a ∈ A (cid:16)(cid:98) r a + γ (cid:98) B a w t (cid:17) can be performed effectively, because these matrices have polynomial size. Now weshow that the solution from sampled data is close to the true solution with highprobability. Theorem 3.1.
Let w ∗ be the unique solution of w ∗ = G T H w ∗ of an FMDP,and let w (cid:48) be the solution of the corresponding equation with sampled matrices, w (cid:48) = (cid:98) G T (cid:98) H w (cid:48) . Suppose that the projection matrix G has a factored structure,too. Then iteration (14) converges to w (cid:48) , furthermore, for a suitable constant Ξ (depending polynomially on n z , where z is the maximum cluster size), and for any (cid:15), δ > , (cid:107) w ∗ − w (cid:48) (cid:107) ∞ ≤ (cid:15) holds with probability at least − δ , if the sample sizesatisfies N ≥ Ξ m (cid:15) log mδ . The proof of Theorem 3.1 can be found in Appendix A.3. The derivation isclosely related to the work of Drineas and colleagues [11, 12], although we use theinfinity-norm instead of the L -norm. A more important difference is that we canexploit the factored structure, gaining an exponentially better bound. The resulting factored value iteration algorithm is summarized in Table 1. Algorithm 1
Factored value iteration with a linear projection matrix G . % inputs:% factored MDP, M = (cid:0) { X i } mi =1 ; A ; { R j } Jj =1 ; { Γ i } mi =1 ; { P i } mi =1 ; x s ; γ (cid:1) % basis functions, { h k } Kk =1 % required accuracy, (cid:15)N := number of samples (cid:98) X := uniform random N -element subset of X create (cid:98) H and (cid:98) G create (cid:98) B a = (cid:91) P a H and (cid:98) r a for each a ∈ A w = , t := 0 repeatw t +1 := (cid:98) G · max a ∈ A (cid:16)(cid:98) r a + γ (cid:98) B a w t (cid:17) ∆ t := (cid:107) w t +1 − w t (cid:107) ∞ t := t + 1 until ∆ t ≥ (cid:15) return w t Related work
The exact solution of factored MDPs is infeasible. The idea of representing alarge MDP using a factored model was first proposed by Koller & Parr [17] butsimilar ideas appear already in the works of Boutilier, Dearden, & Goldszmidt[5, 6]. More recently, the framework (and some of the algorithms) was extended tofMDPs with hybrid continuous-discrete variables [18] and factored partially observ-able MDPs [23]. Furthermore, the framework has also been applied to structuredMDPs with alternative representations, e.g., relational MDPs [15] and first-orderMDPs [24].4.1.
Algorithms for solving factored MDPs.
There are two major branchesof algorithms for solving fMDPs: the first one approximates the value functions asdecision trees, the other one makes use of linear programming.Decision trees (or equivalently, decision lists) provide a way to represent theagent’s policy compactly. Koller & Parr [17] and Boutilier et al. [5, 6] presentalgorithms to evaluate and improve such policies, according to the policy iterationscheme. Unfortunately, the size of the policies may grow exponentially even with adecision tree representation [6, 20].The exact Bellman equations (2) can be transformed to an equivalent linearprogram with N variables { V ( x ) : x ∈ X } and N · | A | constraints:maximize: (cid:88) x ∈ X α ( x ) V ( x )subject to V ( x ) ≤ R ( x , a ) + γ (cid:88) x (cid:48) ∈ X P ( x (cid:48) | x , a ) V ( x (cid:48) ) , ( ∀ x ∈ X , a ∈ A ).Here, weights α ( x ) are free parameters and can be chosen freely in the followingsense: the optimum solution is V ∗ independent of their choice, provided that eachof them is greater than 0. In the approximate linear programming approach, weapproximate the value function as a linear combination of basis functions (11),resulting in an approximate LP with K variables { w k : 1 ≤ k ≤ K } and N · | A | constraints:maximize: K (cid:88) k =1 (cid:88) x ∈ X w k · α ( x ) h k ( x [ C k ])(15)subject to K (cid:88) k =1 w k · h k ( x [ C k ]) ≤≤ R ( x , a ) + γ K (cid:88) k (cid:48) =1 w k (cid:48) (cid:88) x (cid:48) ∈ X P ( x (cid:48) | x , a ) · h k (cid:48) ( x (cid:48) [ C k (cid:48) ]) . Both the objective function and the constraints can be written in compact forms,exploiting the local-scope property of the appearing functions.Markov decision processes were first formulated as LP tasks by Schweitzer andSeidmann [25]. The approximate LP form is due to de Farias and van Roy [7].Guestrin et al. [13] show that the maximum of local-scope functions can be com-puted by rephrasing the task as a non-serial dynamic programming task and elim-inating variables one by one. Therefore, (15) can be transformed to an equivalent,
ACTORED VALUE ITERATION CONVERGES 11 more compact linear program. The gain may be exponential, but this is not nec-essarily so in all cases: according to Guestrin et al. [13], “as shown by Dechter[9], [the cost of the transformation] is exponential in the induced width of the costnetwork, the undirected graph defined over the variables X ; . . . ; X n , with an edgebetween X l and X m if they appear together in one of the original functions f j . Thecomplexity of this algorithm is, of course, dependent on the variable eliminationorder and the problem structure. Computing the optimal elimination order is anNP-hard problem [1] and elimination orders yielding low induced tree width donot exist for some problems.” Furthermore, for the approximate LP task (15), thesolution is no longer independent of α and the optimal choice of the α values is notknown.The approximate LP-based solution algorithm is also due to Guestrin et al. [13].Dolgov and Durfee [10] apply a primal-dual approximation technique to the linearprogram, and report improved results on several problems.The approximate policy iteration algorithm [17, 13] also uses an approximateLP reformulation, but it is based on the policy-evaluation Bellman equation (1).Policy-evaluation equations are, however, linear and do not contain the maximumoperator, so there is no need for the second, costly transformation step. On theother hand, the algorithm needs an explicit decision tree representation of thepolicy. Liberatore [20] has shown that the size of the decision tree representationcan grow exponentially.4.1.1. Applications.
Applications of fMDP algorithms are mostly restricted to ar-tificial test problems like the problem set of Boutilier et al. [6], various versions ofthe
SysAdmin task [13, 10, 21] or the New York driving task [23].Guestrin, Koller, Gearhart and Kanodia [15] show that their LP-based solutionalgorithm is also capable of solving more practical tasks: they consider the real-time strategy game
FreeCraft . Several scenarios are modelled as fMDPs, and solvedsuccessfully. Furthermore, they find that the solution generalizes to larger taskswith similar structure.4.1.2.
Unknown environment.
The algorithms discussed so far (including our FVIalgorithm) assume that all parameters of the fMDP are known, and the basis func-tions are given. In the case when only the factorization structure of the fMDP isknown but the actual transition probabilities and rewards are not, one can applythe factored versions of E [16] or R-max [14].Few attempts exist that try to obtain basis functions or the structure of thefMDP automatically. Patrascu et al. [21] select basis functions greedily so thatthe approximated Bellman error of the solution is minimized. Poupart et al. [22]apply greedy selection, too, but their selection criteria are different: a decision treeis constructed to partition the state space into several regions, and basis functionsare added for each region. The approximate value function is piecewise linear ineach region. The metric they use for splitting is related to the quality of the LPsolution.4.2. Sampling.
Sampling techniques are widely used when the state space is im-mensely large. Lagoudakis and Parr [19] use sampling without a theoretical analysisof performance, but the validity of the approach is verified empirically. De Fariasand van Roy [8] give a thorough overview on constraint sampling techniques used for the linear programming formulation. These techniques are, however, specific tolinear programming and cannot be applied in our case.The work most similar to ours is that of Drineas et al. [12, 11]. They investi-gate the least-squares solution of an overdetermined linear system, and they provethat it is sufficient to keep polynomially many samples to reach low error withhigh probability. They introduce a non-uniform sampling distribution, so that thevariance of the approximation error is minimized. However, the calculation of theprobabilities requires a complete sweep through all equations.5.
Conclusions
In this paper we have proposed a new algorithm, factored value iteration, for theapproximate solution of factored Markov decision processes. The classical approx-imate value iteration algorithm is modified in two ways. Firstly, the least-squaresprojection operator is substituted with an operator that does not increase max-norm, and thus preserves convergence. Secondly, polynomially many samples aresampled uniformly from the (exponentially large) state space. This way, the com-plexity of our algorithm becomes polynomial in the size of the fMDP descriptionlength. We prove that the algorithm is convergent and give a bound on the differ-ence between our solution and the optimal one. We also analyzed various projec-tion operators with respect to their computation complexity and their convergencewhen combined with approximate value iteration. To our knowledge, this is thefirst algorithm that (1) provably converges in polynomial time and (2) avoids linearprogramming.
Acknowledgements
The authors are grateful to Zolt´an Szab´o for calling our attention to the articlesof Drineas et al. [11, 12]. This research has been supported by the EC FET‘New and Emergent World models Through Individual, Evolutionary, and SocialLearning’ Grant (Reference Number 3752). Opinions and errors in this manuscriptare the author’s responsibility, they do not necessarily reflect those of the EC orother project members.
Appendix A. Proofs
A.1.
Projections in various norms.
We wish to know whether w = arg min w (cid:107) H w − v (cid:107) p implies (cid:107) H w (cid:107) ∞ ≤ (cid:107) v (cid:107) ∞ for various values of p . Specifically, we are interestedin the cases when p ∈ { , , ∞} . Fig. 1 indicates that the implication does not holdfor p = 2 or p = ∞ , only for the case p = 1. Below we give a rigorous proof forthese claims.Consider the example v = [1 , T ∈ R , H = [1 , T , w ∈ R . For these valueseasy calculation shows that (cid:107) H [ w ] L (cid:107) ∞ = 6 / (cid:107) H [ w ] L ∞ (cid:107) ∞ = 4 / , i.e., (cid:107) H w (cid:107) ∞ (cid:2) (cid:107) v (cid:107) ∞ for both cases. For p = 1, we shall prove the following lemma: Lemma A.1. If w = arg min w (cid:107) H w − v (cid:107) , then (cid:107) H w (cid:107) ∞ ≤ (cid:107) v (cid:107) ∞ .Proof. Let z := H w ∈ R N . If there are multiple solutions to the minimizationtask, then consider the (unique) z vector with minimum L -norm. Let r := (cid:107) z − v (cid:107) and let S ( v , r ) be the L -sphere with center v and radius r (this is an N -dimensional cross polytope or orthoplex , a generalization of the octahedron). ACTORED VALUE ITERATION CONVERGES 13
Figure 1.
Projections in various norms. The vector v is projectedonto the image space of H , i.e., the subspace defined by u = H w .Consider the smallest sphere around v (in the corresponding norm)that touches the subspace u = H w (shown in each figure). Theradius r of this sphere is the distance of v from the subspace,and the tangent point v (which is not necessarily unique for L projection) is the projection of v . For this point, v = H w holds.The shaded region indicates the region { u : (cid:107) u (cid:107) ∞ ≤ (cid:107) v (cid:107) ∞ } . Toensure the convergence of FVI, the projected vector v must fallinto the shaded region.Suppose indirectly that (cid:107) z (cid:107) ∞ > (cid:107) v (cid:107) ∞ . Without loss of generality we mayassume that z is the coordinate of z with the largest absolute value, and thatit is positive. Therefore, z > (cid:107) v (cid:107) ∞ . Let e i denote the i th coordinate vector(1 ≤ i ≤ N ), and let z δ = z − δ e . For small enough δ , S ( z δ , δ ) is a cross polytopesuch that (a) S ( z δ , δ ) ⊂ S ( v , r ), (b) ∀ z (cid:48) ∈ S ( z δ , δ ), (cid:107) z (cid:48) (cid:107) ∞ > (cid:107) v (cid:107) ∞ , (c) ∀ (cid:15) > − (cid:15) ) z ∈ S ( z δ , δ ). The first two statements are trivial. For thethird statement, note that z is a vertex of the cross polytope S ( z δ , δ ). Consider thecone whose vertex is z and its edges are the same as the edges of S ( z δ , δ ) joining z . It is easy to see that the vector pointing from z to the origo is contained in thiscone: for each 1 < i ≤ N , | z i | ≤ z (as z is the largest coordinate). Consequently,for small enough (cid:15) , z − (cid:15) z ∈ S ( z δ , δ ).Fix such an (cid:15) and let q = (1 − (cid:15) ) z . This vector is (a) contained in the imagespace of H because H [(1 − (cid:15) ) w ] = q ; (b) (cid:107) q − v (cid:107) ≤ (cid:107) z − v (cid:107) = r . The vector z was chosen so that it has the smallest L -norm in the image space of H , so theinequality cannot be sharp, i.e., (cid:107) q − v (cid:107) = r . However, (cid:107) q (cid:107) = (1 − (cid:15) ) (cid:107) z (cid:107) < (cid:107) z (cid:107) with strict inequality, which contradicts our assumption about z , thus completingthe proof. (cid:3) A.2.
Probabilistic interpretation of N ( H T ) .Definition A.2. The basis functions { h k } n b k =1 have the uniform covering (UC) property, if all row sums in the corresponding H matrix are identical: n b (cid:88) k =1 H x ,k = B for all x ∈ X , and all entries are nonnegative. Without loss of generality we may assume that allrows sum up to 1, i.e., H is a stochastic matrix. We shall introduce an auxiliary MDP M such that exact value iteration in M is identical to the approximate value iteration in the original MDP M . Let S bean K -element state space with states s , . . . , s K . A state s is considered a discreteobservation of the true state of the system, x ∈ X .The action space A and the discount factor γ are identical to the correspondingitems of M , and an arbitrary element s ∈ S is selected as initial state. In order toobtain the transition probabilities, let us consider how one can get from observing s to observing s (cid:48) in the next time step: from observation s , we can infer the hiddenstate x of the system; in state x , the agent makes action a and transfers to state x (cid:48) according to the original MDP; after that, we can infer the probability thatour observation will be s (cid:48) , given the hidden state x (cid:48) . Consequently, the transitionprobability P ( s (cid:48) | s , a ) can be defined as the total probability of all such paths: P ( s (cid:48) | s , a ) := (cid:88) x , x (cid:48) ∈ X Pr( x | s ) Pr( x (cid:48) | x , a ) Pr( s (cid:48) | x ) . Here the middle term is just the transition probability in the original MDP, therightmost term is H x , s , and the leftmost term can be rewritten using Bayes’ law(assuming a uniform prior on x ):Pr( x | s ) = Pr( s | x ) Pr( x ) (cid:80) x (cid:48)(cid:48) ∈ X Pr( s | x (cid:48)(cid:48) ) Pr( x (cid:48)(cid:48) ) = H x , s · | X | (cid:80) x (cid:48)(cid:48) ∈ X H x (cid:48)(cid:48) , s · | X | = H x , s (cid:80) x (cid:48)(cid:48) ∈ X H x (cid:48)(cid:48) , s . Consequently, P ( s (cid:48) | s , a ) = (cid:88) x , x (cid:48) ∈ X H x , s (cid:80) x (cid:48)(cid:48) ∈ X H x (cid:48)(cid:48) , s P ( x (cid:48) | x , a ) H x , s = (cid:2) N ( H ) T P a H (cid:3) s , s (cid:48) . The rewards can be defined similarly: R ( s , a ) := (cid:88) x ∈ X Pr( x | s ) R ( x , a ) = (cid:2) N ( H ) T r a (cid:3) s . It is easy to see that approximate value iteration in M corresponds to exact valueiteration in the auxiliary MDP M .A.3. The proof of the sampling theorem (theorem 3.1).
First we prove auseful lemma about approximating the product of two large matrices. Let A ∈ R m × n and B ∈ R n × k and let C = A · B . Suppose that we sample columns of A uniformly at random (with repetition), and we also select the corresponding rows of B . Denote the resulting matrices with (cid:98) A and (cid:98) B . We will show that A · B ≈ c · (cid:98) A · (cid:98) B ,where c is a constant scaling factor compensating for the dimension decrease of thesampled matrices. The following lemma is similar to Lemma 11 of [11], but herewe estimate the infinity-norm instead of the L -norm. Lemma A.3.
Let A ∈ R m × N , B ∈ R N × k and C = A · B . Let N (cid:48) be an integerso that ≤ N (cid:48) ≤ N , and for each i ∈ { , . . . , N (cid:48) } , let r i be an uniformly randominteger from the interval [1 , N ] . Let (cid:98) A ∈ R m × N (cid:48) be the matrix whose i th columnis the r i th column of A , and denote by (cid:98) B the N (cid:48) × k matrix that is obtained bysampling the rows of B similarly. Furthermore, let (cid:98) C = NN (cid:48) (cid:98) A · (cid:98) B = NN (cid:48) N (cid:48) (cid:88) i =1 A ∗ ,r i B r i , ∗ . ACTORED VALUE ITERATION CONVERGES 15
Then, for any (cid:15), δ > , (cid:107) (cid:98) C − C (cid:107) ∞ ≤ (cid:15)N (cid:107) A (cid:107) ∞ (cid:107) B T (cid:107) ∞ with probability at least − δ ,if the sample size satisfies N (cid:48) ≥ m (cid:15) log kmδ .Proof. We begin by bounding individual elements of the matrix (cid:98) C : consider theelement (cid:98) C pq = NN (cid:48) N (cid:48) (cid:88) i =1 A p,r i B r i ,q . Let C pq be the discrete probability distribution determined by mass points { A p,i · B i,q | ≤ i ≤ N } . Note that (cid:98) C pq is essentially the sum of N (cid:48) random variablesdrawn uniformly from distribution C pq . Clearly, | A pi B iq | ≤ max ij | A ij | max ij | B ij | ≤ max i (cid:88) j | A ij | max i (cid:88) j | B ij | = (cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ , so we can apply Hoeffding’s inequality to obtainPr (cid:16)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) N (cid:48) i =1 A p,r i B r i ,q N (cid:48) − (cid:80) Nj =1 A p,j B j,q N (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > (cid:15) (cid:17) < (cid:16) − N (cid:48) (cid:15) (cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ (cid:17) , or equivalently,Pr (cid:16)(cid:12)(cid:12)(cid:12)(cid:12) NN (cid:48) [ (cid:98) A (cid:98) B ] pq − [ AB ] pq (cid:12)(cid:12)(cid:12)(cid:12) > N (cid:15) (cid:17) < (cid:16) − N (cid:48) (cid:15) (cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ (cid:17) , where (cid:15) > (cid:98) C − C :Pr (cid:16) m (cid:88) p =1 (cid:12)(cid:12)(cid:12) (cid:98) C pq − C pq (cid:12)(cid:12)(cid:12) > m · N (cid:15) (cid:17) < m exp (cid:16) − N (cid:48) (cid:15) (cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ (cid:17) , which gives a bound on (cid:107) (cid:98) C − C (cid:107) ∞ . This is the maximum of these row sums:Pr (cid:16) (cid:107) (cid:98) C − C (cid:107) ∞ > mN (cid:15) (cid:17) = Pr (cid:16) max q m (cid:88) p =1 (cid:12)(cid:12)(cid:12) (cid:98) C pq − C pq (cid:12)(cid:12)(cid:12) > mN (cid:15) (cid:17) < km exp (cid:16) − N (cid:48) (cid:15) (cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ (cid:17) . Therefore, by substituting (cid:15) = (cid:15) (cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ /m , the statement of the lemma issatisfied if 2 km exp (cid:16) − N (cid:48) (cid:15) m (cid:17) ≤ δ, i.e, if N (cid:48) ≥ m (cid:15) log kmδ . (cid:3) If both A and B are structured, we can sharpen the lemma to give a much better(potentially exponentially better) bound. For this, we need the following definition:For any index set Z , a matrix A is called Z -local-scope matrix, if each columnof A represents a local-scope function with scope Z . Lemma A.4.
Let A T and B be local-scope matrices with scopes Z and Z , and let N = n | Z | + | Z | , and apply the random row/column selection procedure of the pre-vious lemma. Then, for any (cid:15), δ > , (cid:107) (cid:98) C − C (cid:107) ∞ ≤ (cid:15)N (cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ with probabilityat least − δ , if the sample size satisfies N (cid:48) ≥ m (cid:15) log kmδ .Proof. Fix a variable assignment x [ Z ∪ Z ] on the domain Z ∪ Z and considerthe rows of A that correspond to a variable assignment compatible to x [ Z ∪ Z ],i.e., they are identical to it for components Z ∪ Z and are arbitrary on W := { , , . . . , m } \ ( Z ∪ Z ) . It is easy to see that all of these rows are identical because of the local-scopeproperty. The same holds for the columns of B . All the equivalence classes ofrows/columns have cardinality N := n | W | = N/N . Now let us define the m × N matrix A (cid:48) so that only one column is kept from eachequivalence class, and define the N × k matrix B (cid:48) similarly, by omitting rows.Clearly, A · B = N A (cid:48) · B (cid:48) , and we can apply the sampling lemma to the smaller matrices A (cid:48) and B (cid:48) to getthat for any (cid:15), δ > N (cid:48) ≥ m (cid:15) log kmδ , with probability at least1 − δ , (cid:107) (cid:92) A (cid:48) · B (cid:48) − A (cid:48) · B (cid:48) (cid:107) ∞ ≤ (cid:15)N (cid:107) A (cid:48) (cid:107) ∞ (cid:107) B (cid:48) (cid:107) ∞ . Exploiting the fact that the max-norm of a matrix is the maximum of row norms, (cid:107) A (cid:48) (cid:107) ∞ = (cid:107) A (cid:107) ∞ /N and (cid:107) B (cid:48) (cid:107) ∞ = (cid:107) B (cid:107) ∞ , we can multiply both sides to get (cid:107) N (cid:92) A (cid:48) · B (cid:48) − A · B (cid:107) ∞ ≤ (cid:15)N N (cid:107) A (cid:107) ∞ /N (cid:107) B (cid:107) ∞ = (cid:15)N (cid:107) A (cid:107) ∞ (cid:107) B (cid:107) ∞ , which is the statement of the lemma. (cid:3) Note that if the scopes Z and Z are small, then the gain compared to theprevious lemma can be exponential. Lemma A.5.
Let A = A + . . . + A p and B = B + . . . + B q where all A i and B j are local-scope matrices with domain size at most z , and let N = n z . If we applythe random row/column selection procedure, then for any (cid:15), δ > , (cid:107) (cid:98) C − C (cid:107) ∞ ≤ (cid:15)N pq max i (cid:107) A i (cid:107) ∞ max j (cid:107) B j (cid:107) ∞ with probability at least − δ , if the sample sizesatisfies N (cid:48) ≥ m (cid:15) log kmδ .Proof. (cid:107) (cid:98) C − C (cid:107) ∞ ≤ p (cid:88) i =1 q (cid:88) j =1 (cid:107) (cid:92) A i · B j − A i · B j (cid:107) ∞ . For each individual product term we can apply the previous lemma. Note that wecan use the same row/column samples for each product, because independence isrequired only within single matrix pairs. Summing the right-hand sides gives thestatement of the lemma. (cid:3)
Now we can complete the proof of Theorem 3.1:
Proof. (cid:107) w ∗ − w (cid:48) (cid:107) ∞ = (cid:107) G T H w ∗ − (cid:98) G T (cid:98) H w (cid:48) (cid:107) ∞ ≤ (cid:107) G T H w ∗ − (cid:98) G T (cid:98) H w ∗ (cid:107) ∞ + (cid:107) (cid:98) G T (cid:98) H w ∗ − (cid:98) G T (cid:98) H w (cid:48) (cid:107) ∞ ≤ (cid:107) G T H w ∗ − (cid:98) G T (cid:98) H w ∗ (cid:107) ∞ + γ (cid:107) w ∗ − w (cid:48) (cid:107) ∞ , i.e., (cid:107) w ∗ − w (cid:48) (cid:107) ∞ ≤ − γ (cid:107) G T H w ∗ − (cid:98) G T (cid:98) H w ∗ (cid:107) ∞ . Let π be the greedy policywith respect to the value function H w ∗ . With its help, we can rewrite T H w ∗ asa linear expression: T H w ∗ = r π + γP π H w ∗ . Furthermore, T is a component-wise operator, so we can express its effect on the downsampled value function as T (cid:98) H w ∗ = (cid:98) r π + γ (cid:92) P π H w ∗ . Consequently, (cid:107) G T H w ∗ − (cid:98) G T (cid:98) H w ∗ (cid:107) ∞ ≤ (cid:107) G r π − (cid:98) G (cid:98) r π (cid:107) ∞ + γ (cid:107) GP π H − (cid:98) G (cid:92) P π H (cid:107) ∞ (cid:107) w ∗ (cid:107) ∞ ACTORED VALUE ITERATION CONVERGES 17
Applying the previous lemma two times, we get that with probability greater than1 − δ , (cid:107) G r π − (cid:98) G (cid:98) r π (cid:107) ∞ ≤ (cid:15) C if N (cid:48) ≥ m (cid:15) log mδ and with probability greaterthan 1 − δ , (cid:107) GP π H − (cid:98) G (cid:92) P π H (cid:107) ∞ ≤ (cid:15) C if N (cid:48) ≥ m (cid:15) log m δ ; where C and C are constants depending polynomially on N and the norm of the componentlocal-scope functions, but independent of N .Using the notation M = − γ (cid:16) C + γC (cid:107) w ∗ (cid:107) ∞ (cid:17) , (cid:15) = (cid:15) = (cid:15)/M , δ = δ = δ/ M proves the theorem. (cid:3) Informally, this theorem tells that the required number of samples grows quadrat-ically with the desired accuracy 1 /(cid:15) and logarithmically with the required certainty1 /δ , furthermore, the dependence on the number of variables m is slightly worsethan quadratic. This means that even if the number of equations is exponentiallylarge, i.e., N = O ( e m ), we can select a polynomially large random subset of theequations so that with high probability, the solution does not change very much. References [1] Stefan Arnborg, Derek G. Corneil, and Andrzej Proskurowski. Complexity of finding embed-dings in a k-tree.
SIAM Journal on Algebraic and Discrete Methods , 8(2):277–284, 1987.[2] L.C. Baird. Residual algorithms: Reinforcement learning with function approximation. In
ICML , pages 30–37, 1995.[3] Richard E. Bellman.
Adaptive Control Processes . Princeton University Press, Princeton, NJ.,1961.[4] D.P. Bertsekas and J.N. Tsitsiklis.
Neuro-Dynamic Programming . Athena Scientific, 1996.[5] C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting structure in policy construction. In
IJCAI , pages 1104–1111, 1995.[6] Craig Boutilier, Richard Dearden, and Moises Goldszmidt. Stochastic dynamic programmingwith factored representations.
Artificial Intelligence , 121(1-2):49–107, 2000.[7] D.P. de Farias and B. van Roy. Approximate dynamic programming via linear programming.In
NIPS , pages 689–695, 2001.[8] D.P. de Farias and B. van Roy. On constraint sampling in the linear programming approachto approximate dynamic programming.
Mathematics of Operations Research , 29(3):462–478,2004.[9] R. Dechter. Bucket elimination: A unifying framework for reasoning.
AIJ , 113(1-2):41–85,1999.[10] Dmitri A. Dolgov and Edmund H. Durfee. Symmetric primal-dual approximate linear pro-gramming for factored MDPs. In
Proceedings of the 9th International Symposium on ArtificialIntelligence and Mathematics (AI&M 2006) , 2006.[11] P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo algorithms for matrices I:Approximating matrix multiplication.
SIAM Journal of Computing , 36:132–157, 2006.[12] P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l2 regressionand applications. In
SODA , pages 1127–1136, 2006.[13] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Efficient solution algorithms for fac-tored MDPs.
JAIR , 19:399–468, 2002.[14] C. Guestrin, R. Patrascu, and D. Schuurmans. Algorithm-directed exploration for model-based reinforcement learning in factored mdps. In
ICML , pages 235–242, 2002.[15] Carlos Guestrin, Daphne Koller, Chris Gearhart, and Neal Kanodia. Generalizing plans tonew environments in relational MDPs. In
Eighteenth International Joint Conference on Ar-tificial Intelligence , 2003.[16] M.J. Kearns and D. Koller. Efficient reinforcement learning in factored MDPs. In
Proceedingsof the 16th International Joint Conference on Artificial Intelligence , pages 740–747, 1999.[17] D. Koller and R. Parr. Policy iteration for factored MDPs. In
UAI , pages 326–334, 2000.[18] Branislav Kveton, Milos Hauskrecht, and Carlos Guestrin. Solving factored MDPs with hybridstate and action variables.
Journal of Artificial Intelligence Research , 27:153–201, 2006.[19] M.G. Lagoudakis and R. Parr. Least-squares policy iteration.
JMLR , 4:1107–1149, 2003. [20] P. Liberatore. The size of MDP factored policies. In
AAAI , pages 267–272, 2002.[21] Relu Patrascu, Pascal Poupart, Dale Schuurmans, Craig Boutilier, and Carlos Guestrin.Greedy linear value-approximation for factored markov decision processes. In
Proceedings ofthe 18th National Conference on Artificial intelligence , pages 285–291, 2002.[22] Pascal Poupart, Craig Boutilier, Relu Patrascu, and Dale Schuurmans. Piecewise linear valuefunction approximation for factored mdps. In
Proceedings of the 18th National Conferenceon AI , 2002.[23] Brian Sallans.
Reinforcement Learning for Factored Markov Decision Processes . PhD thesis,University of Toronto, 2002.[24] Scott Sanner and Craig Boutilier. Approximate linear programming for first-order MDPs. In
Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI) ,pages 509–517, 2005.[25] P.J. Schweitzer and A. Seidmann. Generalized polynomial approximations in Markovian de-cision processes.
Journal of Mathematical Analysis and Applications , 110(6):568–582, 1985.[26] R.S. Sutton and A.G. Barto.
Reinforcement Learning: An Introduction . MIT Press, 1998.[27] John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning withfunction approximation.
IEEE Transactions on Automatic Control , 42(5):674–690, 1997., 42(5):674–690, 1997.