The Value Functions of Markov Decision Processes
TThe Value Functions of Markov Decision Processes ∗ Ehud Lehrer † , Eilon Solan ‡ , and Omri N. Solan § November 10, 2015
Abstract
We provide a full characterization of the set of value functions of Markovdecision processes.
Markov decision processes are a standard tool for studying dynamic optimizationproblems. The discounted value of such a problem is the maximal total discountedamount that the decision maker can guarantee to himself. By Blackwell (1965),the function λ (cid:55)→ v λ ( s ) that assigns the discounted value at the initial state s toeach discount factor λ is the maximum of finitely many rational functions (withreal coefficients). Standard arguments show that the roots of the polynomial inthe denominator of these rational functions either lie outside the unit ball in thecomplex plane, or on the boundary of the unit ball, in which case they have ∗ Lehrer acknowledges the support of the Israel Science Foundation, Grant † School of Mathematical Sciences, Tel Aviv University, Tel Aviv 6997800, Israel and INSEAD,Bd. de Constance, 77305 Fontainebleau Cedex, France. e-mail: [email protected] . ‡ School of Mathematical Sciences, Tel Aviv University, Tel Aviv 6997800, Israel. E-mail: [email protected] . § School of Mathematical Sciences, Tel Aviv University, Tel Aviv 6997800, Israel. E-mail: [email protected] . a r X i v : . [ m a t h . P R ] N ov ultiplicity 1. Using the theory of eigenvalues of stochastic matrices one canshow that the roots on the boundary of the unit ball must be unit roots.In this note we prove the converse result: every function λ (cid:55)→ v λ that is themaximum of finitely many rational functions such that each root of the polynomialsin the denominators either lies outside the unit ball in the complex plane, or is aunit root with multiplicity 1 is the value of some Markov decision process. Definition 1 A Markov decision process is a tuple ( S, A, r, q ) where • S is a finite set of states. • A distribution µ ∈ ∆( S ) according to which the initial state is chosen. • A = ( A ( s )) s ∈ S is the family of sets of actions available at each state s ∈ S .Denote SA := { ( s, a ) : s ∈ S, a ∈ A ( s ) } . • r : SA → R is a payoff function. • q : SA → ∆( S ) is a transition function. The process starts at an initial state s ∈ S , chosen according to µ . It thenevolves in discrete time: at every stage n ∈ N the process is in a state s n ∈ S ,the decision maker chooses an action a n ∈ A ( s n ), and a new state s n +1 is chosenaccording to q ( · | s n , a n ).A finite history is a sequence h n = ( s , a , s , a , · · · , s n ) ∈ H := ∪ ∞ k =0 ( SA ) k × S .A pure strategy is a function σ : H → ∪ s ∈ S A ( s ) such that σ ( h n ) ∈ A ( s n ) forevery finite history h n = ( s , a , · · · , s n ), and a behavior strategy is a function σ : H → ∪ s ∈ S ∆( A ( s )) such that σ ( h n ) ∈ ∆( A ( s n )) for every such finite history.The set of behavior strategies is denoted B . In other words, σ assign to every finitehistory a distribution over A , which we call a mixed action . A strategy is stationary For every finite set X , the set of probability distributions over X is denoted ∆( X ).
2f for every finite history h n = ( s , a , · · · , s n ), the mixed action σ ( h n ) is a functionof s n and is independent of ( s , a , · · · , a n − ). Every behavior strategy togetherwith a prior distribution µ over the state space induce a probability distribution P µ,σ over the space of infinite histories ( SA ) ∞ (which is endowed with the product σ -algebra). Expectation w.r.t. this probability distribution is denoted E µ,σ .For every discount factor λ ∈ [0 , λ -discounted payoff is γ λ ( µ, σ ) := E µ,σ (cid:34) ∞ (cid:88) n =1 λ n − r ( s n , a n ) (cid:35) . When µ is a probability measure that is concentrated on a single state s we denotethe λ -discounted payoff also by γ ( s, σ ). The λ -discounted value of the Markovdecision process, with the prior µ over the initial state is v λ ( µ ) := sup σ ∈B γ λ ( µ, σ ) . (1)A behavior strategy is λ -discounted optimal if it attains the maximum in (1).Denote by V the set of all functions λ (cid:55)→ v λ ( µ ) that are the value function ofsome Markov decision processes starting with some prior µ ∈ ∆( S ). The goal ofthe present note is to characterize the set V . Notation 1 (i) Denote by F the set of all rational functions PQ such that eachroot of Q is (a) outside the unit ball, or (b) a unit root with multiplicity 1.(ii) Denote by M ax F the set of functions that are the maximum of a finite numberof functions in F . The next proposition states that any function in V is the maximum of a finitenumber of functions in F Proposition 1
V ⊆
M ax F . Proof.
By Blackwell (1965), for every λ ∈ [0 ,
1) there is a λ -discounted purestationary optimal strategy. Since the number of pure stationary strategies is Recall that a complex number ω ∈ C is a unit root if there exists n ∈ N such that ω n = 1. λ (cid:55)→ γ λ ( µ, σ ) is in F , for every purestationary strategy σ . For every pure stationary strategy σ , every prior µ , andevery discount factor λ ∈ [0 , γ λ ( s , σ )) s ∈ S is the unique solution ofa system of | S | linear equations in λ : γ λ ( s, σ ) = r ( s, σ ( s )) + λ (cid:88) s (cid:48) ∈ S q ( s (cid:48) | s, σ ( s )) γ λ ( s (cid:48) , σ ) , ∀ s ∈ S, where r ( s, σ ( s )) := (cid:80) a ∈ A ( s ) σ ( a | s ) r ( s, a ) and q ( s (cid:48) | s, σ ( s )) := (cid:80) a ∈ A ( s ) σ ( a | s ) q ( s (cid:48) | s, a ) are the multilinear extensions of r and q , respectively. It follows that γ λ ( · , σ ) = ( I − λQ ( · , σ ( · ))) − · r ( · , σ ( · )) , where Q ( · , σ ( · )) = Q = ( Q s,s (cid:48) ) s,s (cid:48) ∈ S is the transition matrix induced by σ , thatis, Q s,s (cid:48) = q ( s (cid:48) | s, σ ( s )). By Cramer’s rule, the function λ (cid:55)→ ( I − λQ ) − is arational function whose denominator is det( I − λQ ). In particular, the roots ofthe denominator are the inverse of the eigenvalues of Q . Since the denominator isindependent of s , it is also the denominator of γ λ ( µ, σ ) = (cid:80) s ∈ S µ ( s ) γ λ ( s, σ ).Denote the expected payoff at stage n by x n := E µ,σ [ r ( s n , σ ( h n ))], so that γ λ ( µ, σ ) = (cid:80) ∞ n =1 x n λ n − . Since | x n | ≤ max s ∈ S,a ∈ A ( s ) | r ( s, a ) | for every n ∈ N , itfollows that the denominator det( I − λQ ( · , σ ( · )) does not have roots in the interiorof the unit ball and that all its roots that lie on the boundary of the unit ball havemultiplicity 1. Moreover, by, e.g., Higham and Lin (2011) it follows that the rootsthat lie on the boundary of the unit ball must be unit roots.The main result of this note is that the converse holds as well. Theorem 1
A function w : [0 , → R is in V if and only if it is the maximum ofa finite number of functions in F . To avoid cumbersome notations we write f ( λ ) for the function λ (cid:55)→ f ( λ ). Inparticular, λf ( λ ) will denote the function λ (cid:55)→ λf ( λ ).We start with the following observation. The result is valid for every stationary strategy, not necessarily pure. We will need it onlyfor pure stationary strategies. emma 1 If f, g ∈ V then max { f, g } ∈ V . Proof.
Let M f = ( S f , µ f , A f , r f , q f ) and M g = ( S g , µ g , A g , r g , q g ) be the Markovdecision processes that implement f and g respectively. To implement max { f, g } ,define a Markov decision process that contains M f , M g , and an additional state s ∗ , in which the decision maker chooses one of M f and M g . Formally, let M =( S f ∪ S g ∪ { s ∗ } , A (cid:48) , r (cid:48) , q (cid:48) ) be a Markov decision process in which A (cid:48) , r (cid:48) , and q (cid:48) coincide with A f , r f , and q f (resp. A g , r g , and q g ) on S f (resp. S g ), and in which inthe state s ∗ the decision maker chooses whether to follow M f or M g , and the payoffand transitions at that state are the expectation of the payoff and transitions atthe first stage in M f (resp. M g ) according to µ f (resp. µ g ): A (cid:48) ( s ∗ ) := ( × s ∈ S A f ( s )) × ( × s ∈ S A g ( s )) × { f, g } , and for every a f ∈ × s ∈ S A f ( s ) and every a g ∈ × s ∈ S A g ( s ), r (cid:48) ( s ∗ , ( a f , a g , x )) := (cid:40) (cid:80) s ∈ S µ f ( s ) r f ( s, a f ( s )) x = f, (cid:80) s ∈ S µ g ( s ) r g ( s, a f ( s )) x = g.q (cid:48) ( · | s ∗ , ( a f , a g , x )) := (cid:40) (cid:80) s ∈ S µ f ( s ) q f ( · | ( s ,f , a f ( s ))) x = f, (cid:80) s ∈ S µ g ( s ) q g ( · | ( s ,g , a g ( s ))) x = g. The reader can verify that the value function of M at the initial state s ∗ ismax { f, g } . As mentioned earlier, for every pure stationary strategy σ and every initial state s ,the function λ (cid:55)→ γ λ ( s , σ ) is a rational function of λ . Since there is a λ -discountedpure stationary optimal strategy, and since there are finitely many such functions,it follows that the function λ (cid:55)→ v λ is the maximum of finitely many rationalfunctions, each of which is the payoff function of some pure stationary strategy.When the decision maker follows a pure stationary strategy, we are reducedto a Markov decision process in which there is a single action in each state. Thisobservation leads us to the following definition.5 Markov decision process is degenerate if | A ( s ) | = 1 for every s ∈ S . When M is a degenerate Markov decision process we omit the reference to the action inthe functions r and q . A degenerate Markov decision process is thus a quadruple( S, µ, r, q ), where S is the state space, µ is a probability distribution over S , r : S → R , and q ( · | s ) is a probability distribution for every state s ∈ S .Denote by V D the set of all functions that are payoff functions of some degen-erate Markov decision process. Plainly, V D ⊆ V . To prove Theorem 1, we firstprove that degenerate Markov decision processes implement all functions in F . Theorem 2
F ⊆ V D . Theorem 2 and Lemma 1 show that
M ax
F ⊆ V , and together with Proposition1 they establish Theorem 1. The rest of the paper is dedicated to the proof ofTheorem 2. V D The following lemma lists several properties of the functions implementable bydegenerate Markov decision processes.
Lemma 2
For every f ∈ V D we havea) af ( λ ) ∈ V D for every a ∈ R .b) f ( − λ ) ∈ V D .c) λf ( λ ) ∈ V D .d) f ( cλ ) ∈ V D for every c ∈ [0 , .e) f ( λ ) + g ( λ ) ∈ V D for every g ∈ V D .f ) f ( λ n ) ∈ V D for every n ∈ N . roof. Let M f = ( S f , µ f , r f , q f ) be a degenerate Markov decision processwhose value function is f .To prove Part (a), we multiply all payoffs in M f by a . Formally, define adegenerate Markov decision process M (cid:48) = ( S f , µ f , r (cid:48) , q f ) that differs from M onlyin its payoff function: r (cid:48) ( s ) := ar f ( s ) for every s ∈ S f . The reader can verify thatthe value function of M (cid:48) at the initial state s ,f is af ( λ ).To prove Part (b), multiply the payoff in even stages by −
1. Formally, let (cid:98) S be a copy of S f ; for every state s ∈ S f we denote by (cid:98) s its copy in (cid:98) S . Define adegenerate Markov decision process M (cid:48) = ( S f ∪ (cid:98) S, µ f , r (cid:48) , q (cid:48) ) with initial distribution µ f (whose support is S f ) that visits states in (cid:98) S in even stages and states in S f inodd stages as follows: r (cid:48) ( s ) := r f ( s ) , r (cid:48) ( (cid:98) s ) := − r f ( s ) , ∀ s ∈ S f ,q (cid:48) ( (cid:98) s (cid:48) | s ) = q (cid:48) ( s (cid:48) | (cid:98) s ) := q f ( s (cid:48) | s ) , ∀ s, s (cid:48) ∈ S f ,q (cid:48) ( s (cid:48) | s ) = q (cid:48) ( (cid:98) s (cid:48) | (cid:98) s ) := 0 , ∀ s, s (cid:48) ∈ S f . The reader can verify that the value function of M (cid:48) is f ( − λ ).To prove part (c), add a state with payoff 0 from which the transition prob-ability to a state in S f coincides with µ . Formally, define a degenerate Markovdecision process M (cid:48) = ( S f ∪ { s ∗ } , µ (cid:48) , r (cid:48) , q (cid:48) ) in which µ (cid:48) assigns probability 1 to s ∗ . r (cid:48) coincides with r f on S f , while r (cid:48) ( s ∗ ) := 0. Finally, q (cid:48) coincides with q f on S f ,while at the state s ∗ , q (cid:48) ( s | s ∗ ) := µ ( s ). The value function of M (cid:48) is λf ( λ ).To prove part (d), consider the transition function that at every stage, movesto an absorbing state with payoff 0 with probability 1 − c , and with probability c continues as in M . Formally, define a degenerate Markov decision process M (cid:48) =( S f ∪ { s ∗ } , µ, r (cid:48) , q (cid:48) ) in which µ coincides with µ f , r (cid:48) and q (cid:48) coincide with r f and q f on S f , r (cid:48) ( s ) := 0, and q (cid:48) ( s ∗ | s ∗ ) := 1 (that is, s ∗ is an absorbing state), and q (cid:48) ( s ∗ | s ) := 1 − c, q (cid:48) ( s (cid:48) | s ) := cq f ( s (cid:48) | s ) , ∀ s, s (cid:48) ∈ S f . The value function of M (cid:48) at the initial state s ,f is f ( cλ ). A state s ∈ S is absorbing if q ( s | s, a ) = 1 for every action a ∈ A ( s ).
7o prove Part (e), we show that f + g is in V D and we use part (a) with a = 2. The function f + g is the value function of the degenerate Markov decisionprocess in which the prior chooses with probability one of two degenerate Markovdecision processes that implement f and g . Formally, let M g = ( S g , µ g , r g , q g )be a degenerate Markov decision process whose value function is g . Let M =( S f ∪ S g , µ (cid:48) , r (cid:48) , q (cid:48) ) be the degenerate Markov decision process whose state spaceconsists of disjoint copies of S f and S g , the functions r (cid:48) (resp. q (cid:48) ) coincide with r f and r g (resp. q f and q g ) of S f (resp. S g , and the initial distribution is µ (cid:48) = µ f + µ g .The value function of M is f + g .To prove Part (f), we space out the Markov decision process in a way that stage k of the Markov decision process that implements f becomes stage 1+( k − n , andthe payoff in all other stages is 0. Formally, Let M (cid:48) = ( S f × { , , · · · , n } , µ (cid:48) , r (cid:48) , q (cid:48) )be a degenerate Markov decision process where µ (cid:48) = µ f and q (cid:48) (( s, k + 1) | ( s, k )) := 1 k ∈ { , , . . . , n − } , s ∈ S,q (cid:48) ( · | ( s, n )) := q f ( s ) s ∈ S,r (cid:48) (( s, r f ( s ) s ∈ S,r (cid:48) (( s, k )) := 0 k ∈ { , , . . . , n } , s ∈ S. The value function of M (cid:48) with the prior µ (cid:48) is f ( λ n ). Lemma 3 a) Every polynomial P is in V D and if f ∈ V D , then P · f is alsoin V D .b) Let P and Q be two polynomials. If Q ∈ V D then PQ ∈ V D . In particular, if Q (cid:48) divides Q and Q ∈ V D then Q (cid:48) ∈ V D .c) If Q is a polynomial whose roots are all unit roots of multiplicity 1, then Q ∈ V D . Throughout the paper, whenever we refer to a polynomial we mean a polynomial with realcoefficients. roof. Part (a) follows from Lemma 2(a,c,e) and the observation that any constantfunction a is in V D , which holds since the constant function a is the value functionof the degenerate Markov decision process that starts with a state whose payoff is a and continues to an absorbing state whose payoff is 0.Part (b) follows from Part (a).We turn to prove Part (c). The degenerate Markov decision process with asingle state in which the payoff is 1 yields payoff − λ , and therefore − λ ∈ V D . ByLemma 2(f) it follows that − λ n ∈ V D , for every n ∈ N . Let n be large enoughsuch that Q divides 1 − λ n . The result now follows by Part (b).To complete the proof of Theorem 2 we characterize the polynomials Q thatsatisfy Q ∈ V D . To this end we need the following property of V D . Lemma 4 If f, g ∈ V D then f ( λ ) g ( λc ) ∈ V D for every c ∈ (0 , . Proof.
The proof of Lemma 4 is the most intricate part of the proof of Theorem1. We start with an example, that will help us illustrate the formal definition ofthe degenerate MDP that implements f ( λ ) g ( λc ).Let M f and M g be the degenerate Markov decision processes that are depictedin Figure 1 with the initial distributions µ f ( s f ) = 1 and µ g ( s ,g ) = 1, in which thepayoff at each state appears in a square next to the state. Denote by f and g thevalue functions of M f and M g , respectively. s f s (cid:48) f s (cid:48)(cid:48) f MDP M f s ,g s ,g s ,g MDP M g Figure 1: An example of two MDP’s.Consider the degenerate Markov decision process M depicted in Figure 2, where c ∈ (0 ,
1) and the initial state is s ,g . 9 ,g s ,g s ,g c c c (1 − c ) 2(1 − c ) 3(1 − c )1212 451212151 − c − c )5 s ,f s (cid:48) ,f s (cid:48)(cid:48) ,f − c − c )5 s ,f s (cid:48) ,f s (cid:48)(cid:48) ,f − c − c )5 s ,f s (cid:48) ,f s (cid:48)(cid:48) ,f Figure 2: The degenerate MDP M .The MDP M is composed of one copy of M g , and for every state in S g it containsone copy of M f . It starts at s ,g , the initial state of M g . Then, at every stage,with probability c it continues as in M g , and with probability (1 − c ) it moves toa copy of M f . In case a transition to a copy of M f occurs, the new state is chosenaccording to the transitions q f ( · | s f ). This induces a distribution similar to thatof the second stage of M f .The payoff in each of the copies of M f is the product of the payoff in M f timesthe payoff of the state in M g that has been assigned to that copy. The payoff ineach state s g ∈ S g is (1 − c ) r g ( s g ) times the expected payoff in the first stage of M f .Thus, each state in s g ∈ S g serves three purposes (see Figure 2). First, it is aregular state in the copy of M g . Second, once a transition from M g to a copy of M f occurs (at each stage it occurs with probability (1 − c )), it serves as the firststage in M f . Finally, once a transition from s g to a copy of M f occurs, the payoffsin the copy are set to the product of the original payoffs in M f times r g ( s g ).We now turn to the formal construction of M . Let f, g ∈ V D and let M f =( S f , µ f , r f , q f ) (resp. M g = ( S g , µ g , r g , q g )) be the degenerate Markov decision pro-cess that implements f (resp. g ). Define the following degenerate Markov decisionprocess M = ( S, µ, r, q ): • The set of states is S = ( S g × S f ) ∪ S g . In words, the set of states contains10 copy of S g , and for each state s g ∈ S g it contains a copy of S f . • The initial distribution is µ g : µ ( s g ) = µ g ( s g ) , ∀ s g ∈ S g , µ ( s g , s f ) = 0 , ∀ ( s g , s f ) ∈ S g × S f . • The transition is as follows: – In each copy of S f , the transition is the same as the transition in M f : q (( s g , s (cid:48) f ) | ( s g , s f )) := q f ( s (cid:48) f | s f ) , ∀ s g ∈ S g , s f , s (cid:48) f ∈ S f ,q (( s (cid:48) g , s (cid:48) f ) | ( s g , s f )) := 0 , ∀ s g (cid:54) = s (cid:48) g ∈ S g , s f , s (cid:48) f ∈ S f . – In the copy of S g , with probability c the transition is as in M g , and withprobability (1 − c ) it is as in M f starting with the initial distribution µ f : q ( s (cid:48) g | s g ) := cq g ( s (cid:48) g | s g ) , ∀ s g , s (cid:48) g ∈ S g ,q (( s g , s f ) | s g ) := (1 − c ) (cid:88) s (cid:48) f ∈ S f µ ( s (cid:48) f ) q f ( s f | s (cid:48) f ) , ∀ s g ∈ S g , s f ∈ S f ,q (( s (cid:48) g , s f ) | s g ) := 0 ∀ s g (cid:54) = s (cid:48) g ∈ S g , s f ∈ S f . • The payoff function is as follows: r ( s g ) := (1 − c ) r g ( s g ) (cid:88) s f ∈ S f µ f ( s f ) r f ( s f ) , ∀ s g ∈ S g ,r ( s g , s f ) := r g ( s g ) r f ( s f ) , ∀ s g ∈ S g , s f ∈ S f . We will now calculate the value function of M . Denote by E µ f [ r f ] := (cid:88) s f ∈ S f µ f ( s f ) r f ( s f )the expected payoff in M f at the first stage and by R the expected payoff in M f from the second stage and on. Then f ( λ ) = E ν f [ r f ] + λR .11t every stage, with probability c the process remains in S g and with probability(1 − c ) the process leaves it. In particular, the probability that at stage n the processis still in S g is c n − , in which case (a) the payoff is (1 − c ) r g ( s n ) E µ f [ r f ], and (b) withprobability (1 − c ) the process moves to a copy of M f , and the total discountedpayoff from stage n + 1 and on is R . It follows that the total discounted payoff is ∞ (cid:88) n =1 c n − λ n − (cid:0) (1 − c ) r g ( s n ) E µ f ( r f ) + (1 − c ) λr g ( s n ) R (cid:1) = (1 − c ) ∞ (cid:88) n =1 c n − λ n − r g ( s k ) f ( λ )= (1 − c ) g ( cλ ) f ( λ ) . The result follows by Lemma 2(a).
Lemma 5
Let ω ∈ C be a complex number that lies outside the unit ball. a) If ω ∈ C \ R then ω − λ )( ω − λ ) ∈ V D .b) If ω ∈ R then ω − λ ) ∈ V D . Proof.
We start by proving part (a). For every complex number ω ∈ C \ R that lies outside the unit ball there are three natural numbers k < l < m and threenonnegative reals α , α , α that sum up to 1 such that 1 = α ω k + α ω l + α ω m .Consider the degenerate Markov decision process that is depicted in Figure 3.That is, the set of states is S f := { s , s , · · · , s m } , the payoff function is r ( s m ) := 1 , r ( s j ) := 0 , ≤ j < m, and the transition function is q ( s m − k +1 | s m ) := α , q ( s m − l +1 | s m ) := α , , q ( s | s m ) := α ,q ( s j +1 | s j ) = 1 , ≤ j < m. For every complex number ω , the conjugate of ω is denoted ω . α α s s s m s m − k +1 s m − l +1 Figure 3: The degenerate MDP in the proof of Lemma 5.The discounted value satisfies v λ ( s j ) = λv λ ( s j +1 ) , ≤ j < m,v λ ( s m ) = 1 + λ (cid:0) α v λ ( s m − k +1 ) + α v λ ( s m − l +1 ) + α v λ ( s ) (cid:1) . It follows that v λ ( s m ) = − α λ k − α λ l − α λ m . Hence, this function is in V D . Since ω is one of its roots, by Lemma 3(b) we obtainthat ω − λ )( ω − λ ) ∈ V D , as desired.We turn to prove part (b). Let ω be a real number with ω >
1. By Lemma3(b) − λ is in V D . By Lemma 2(d), − λω ∈ V D , and by Lemma 2(a), ω − λ ∈ V D .When ω < −
1, by Lemma 2(a,b), ω − λ ∈ V D . Let Q (cid:54) = 0 be a polynomial with real coefficients whose roots are either outside theunit ball or unit roots with multiplicity 1. To complete the proof of Theorem 2we prove that Q ∈ V D . Denote by Ω the set of all roots of Q that are unit roots,by Ω the set of all roots of Q that lie outside the unit ball and have a positiveimaginary part, and by Ω the set of all real roots of Q that lie outside the unitball. If some roots have multiplicity larger than 1, then they appear several timesin Q or Q .For i = 1 , Q i = (cid:81) ω ∈ Ω i ( ω − λ ) and set Q = (cid:81) ω ∈ Ω ( ω − λ )( ω − λ );when Ω i = ∅ we set Q i = 1. Then Q = Q · Q · Q . If Ω (cid:54) = ∅ , then by Lemma3(c) we have Q ∈ V D . Otherwise Q = 1, in which case, Q ∈ V D by Lemma 3(a).13ix ω ∈ Ω and let c ∈ R be such that 1 < c < | ω | . Since ωc lies outsidethe unit ball, Lemma 5(a) implies that g ω ( λ ) := ωc − λ )( ωc − λ ) is in V D . By Lemma4(a), g ω ( c · λ ) · Q = c ( ω − λ )( ω − λ ) · Q ∈ V D . By Lemma 2(a), ω − λ )( ω − λ ) · Q ∈ V D .Applying successively this argument for the remaining roots in Ω , one obtainsthat Q · Q ∈ V D .To complete the proof we apply a similar idea to ω ∈ Ω . Fix ω ∈ Ω and let c ∈ R be such that 1 < c < | ω | . By Lemma 5(b), ωc − λ ∈ V D and again by ByLemma 4(a) and Lemma 2(a), ω − λ ) · Q · Q ∈ V D . By iterating this argument forevery ω ∈ Ω one obtains that Q · Q · Q ∈ V D , as desired. The set V contain all value functions of MDP’s in which the state at the first stageis chosen according to a probability distribution µ . One can wonder whether theset of implementable value functions shrinks if one restricts attention to MDP’sin which the first stage is given; that is, µ assigns probability 1 to one state. Theanswer is negative: the value function of any MDP in which the initial state ischosen according to a probability distribution (prior) can be obtained as the valuefunction of an MDP in which the initial state is deterministic. Indeed, let M bean MDP with a prior. One can construct M (cid:48) by adding to M an initial state s (cid:48) inwhich the payoff is the expected payoff at the first stage of M and the transitionsare the expected transitions after the first stage of M . We applied a similar ideain the proof of Lemma 1. References [1] Blackwell D. (1965) Discounted Dynamic Programming.
Annals of Mathemat-ical Statistics , , 226–235.[2] Higham N.J. and Lin L. (2011) On p ’th Roots of Stochastic Matrices. LinearAlgebra and its Applications ,3