Issues concerning realizability of Blackwell optimal policies in reinforcement learning
aa r X i v : . [ c s . L G ] M a y Issues Concerning the Realizability of Blackwell Optimal Policies inReinforcement Learning
Nicholas Denis [email protected]
Abstract N -discount optimality was introduced as a hi-erarchical form of policy- and value-functionoptimality, with Blackwell optimality lying atthe top level of the hierarchy [17,3]. Weformalize notions of myopic discount factors,value functions and policies in terms of Black-well optimality in MDPs, and we provide anovel concept of regret, called Blackwell re-gret, which measures the regret compared toa Blackwell optimal policy. Our main analy-sis focuses on long horizon MDPs with sparserewards. We show that selecting the discountfactor under which zero Blackwell regret canbe achieved becomes arbitrarily hard. More-over, even with oracle knowledge of such adiscount factor that can realize a Blackwellregret-free value function, an ǫ -Blackwell op-timal value function may not even be gain op-timal. Difficulties associated with this class ofproblems is discussed, and the notion of a pol-icy gap is defined as the difference in expectedreturn between a given policy and any otherpolicy that differs at that state; we prove cer-tain properties related to this gap. Finally, weprovide experimental results that further sup-port our theoretical results. When is one policy better than another, and how doesone arrive at the best policy? Additionally, is there a dif-ference between the theoretical answers to these ques-tions and how they are addressed in practice? Withinthe reinforcement learning and Markov decision processcommunity, these questions are fundamental and nothing work in progress. new. Indeed, though these questions have been well de-fined and well studied, this paper reconsiders importantissues with solutions to MDPs and RL problems. Specif-ically, we explore the role of the discount factor γ in find-ing an optimal policy, π ∗ γ , and value function V π ∗ γ . Once γ is chosen, though an (approximately) optimal solutionmay be returned by some algorithmic solution, it maystill be unsatisfactory in some regards (as demonstratedby OpenAI with the Coastrunners domain). In this paperwe explore the relationship between γ, ǫ in arriving at an ǫ -optimal policy, as well as a researchers preference orevaluation of such a policy. We discuss issues surround-ing selecting γ and ǫ without any domain knowledgeof the problem, and how even theoretically sound algo-rithms such as PAC-MDP solution methods can producepolicies that, though satisfy being ( ǫ, δ ) -PAC, are still noteven gain optimal. Especially difficult are long-horizonproblems (LHPs) with sparse rewards. Motived by suchproblems we introduce a novel concept of regret, called Blackwell Regret , R B , which compares the expected re-turn of a given policy to that of a Blackwell optimal pol-icy, evaluated at an appropriate value of γ ∈ [0 , . Webelieve Blackwell regret is more akin to how humans ex-perience regret when comparing oneself to the highestof standards. We formalize the notion of myopic dis-count factors and policies and introduce a notion of γ being Blackwell realizable. We discuss how policies thatminimize Blackwell regret are fundamentally difficult tosolve for, as recent literature has hinted at for long hori-zon problems (LHP’s)[10]. This is due to the existenceof pivot states where discovering the Blackwell optimalpolicy hinges on discerning the values of a Blackwell op-timal policy and a non-Blackwell optimal policy whichcan be arbitrarily close at a given state. Even with oracleknowledge of γ ∗ , the infimum γ that can induce a Black-well optimal and Blackwell regret-free policy, ∀ ǫ > , an ǫ -accurate Blackwell optimal policy may not be Black-well optimal, and in fact may not even be gain optimal.We provide experimental results using PAC-MDP algo-rithms that demonstrate this phenomenon. Motivated byhese findings, we argue the need for progress withinthree areas of theoretical research: 1) Analytical solutionmethods for Blackwell optimal policies; 2) provable con-vergent algorithms for solving n-discount optimal poli-cies; 3) goal based and human preference based RL. Ourfocus is on the latter. Recall that an MDP, M , is an n-tuple hS , A , p, R, γ i ,where S is a finite state space, A is a finite action space, p = p ( s ′ | s, a ) is the transition kernel, R : S × A × S → [0 , R max ] is the reward function and γ ∈ [0 , is thediscounted current value associated to one unit of rewardto be received one unit of time into the future. This workfocusses on deterministic Markovian policies. This work considers how γ plays a role both in learn-ing a policy as well as how it is used in evaluating thevalue function associated to a policy, perhaps learnedwith a different discout factor. For this reason, it is im-portant to clearly separate γ used to learn a policy π γ ,and γ ′ = γ used to evaluate that policy. Hence, by π γ we refer to a policy learned using γ , whereas V π γ γ ( s ) refers to the value of a state, when following policy π learned using γ (as defined just previously), howeverthe value function is computed using γ . Symbolically, V π γ γ ( s t ) = E π γ (cid:8) P ∞ k = t γ k − t r k (cid:9) . If γ = 1 then we are considering undiscounted re-wards, and for any infinite stream of rewards G t = E { r t + r t +1 + r t +2 ..., } = E { P ∞ k = t r k } . Since G t isoften infinite, the gain of a policy π is defined, where ρ π = lim T →∞ T E π (cid:8) T X t =1 r t (cid:9) . Using the gain of a policy, an ordering, ≥ , is defined onsome policy class Π , where ∀ π , π ∈ Π , π ≥ π ⇐⇒ ρ π ≥ ρ π , with the strict inequality defined similarily.It is worth noting that if we define r π = { r , r , ... } asthe sequence of expected rewards from following pol-icy π , then for any permutation σ : N → N , any pol-icy π ′ whose sequence of expected rewards is σ ( r π ) = σ { r , r , ... } = { r σ (1) , r σ (2) , ... } = r π ′ , then ρ π = ρ π ′ .Hence, the temporal ordering of rewards has no bearingon the value or gain of a policy when γ = 1 . This iscertainly not true for γ < . Most commonly γ ∈ [0 , . In this setting we candeal with infinite series of expected rewards as the partialsums converge geometrically fast in γ . The value of astate when considering discount factor γ , is V π γ γ ( s t ) = E π γ (cid:8) ∞ X k = t γ k − t r k (cid:9) . Since most frameworks assume rewards are bounded insome interval [0 , R max ] , then ∀ π, ∀ s , V πγ ( s ) ≤ V maxγ = R max − γ . Such assumptions and the use of V maxγ are in-tegral to theoretical bounds for algorithms and solutionmethods in RL and MDP’s. Similar to the ordering onpolicies in the undiscounted setting, an ordering of poli-cies π γ for fixed γ ∈ [0 , is used to order policies π , π ∈ Π γ . Unlike the undiscounted setting, under γ < , two policies are not equivalent under permuta-tion of the temporal sequence of rewards. Interestingly,a value of γ = 0 is rarely used in the literature, and is of-ten called myopic . With γ = 0 , the induced policy doesnot sufficiently account for the future horizon and in do-ing so is generally viewed to only lead to sub-optimalbehaviour. ∀ γ ∈ [0 , , we say a policy π ∗ γ is optimal if V π ∗ γ γ ( s ) ≥ V π γ γ , ∀ π γ ∈ Π γ , ∀ s ∈ S , where V π ∗ γ =1 γ =1 = ρ π ∗ γ . Despitethese notions of optimality being the most common inRL, there are other notions of optimality [13]. Bias optimality was introduced to supplement the useof gain optimal policies when γ = 1 . Since the gainof a policy only considers the asymptotic behaviour of apolicy, two policies that have the same gain may expe-rience different reward trajectories before arriving at thestationary distribution of the policy. For this reason thebias of a policy, defined as b π ( s ) = lim T →∞ E { T X t =1 ( r t ( s ) − ρ π ) } , and was introduced by [3]. For any finite state and actionspace MDP, a bias optimal policy always exists. n -discount Optimality n -discount optimality [17] introduces a hierarchicalview of policy optimality in MDPs. A policy π ∗ is n -discount optimal for n ∈ {− , , , , , ... } if ∀ s ∈ S ,and ∀ π ∈ Πlim γ → (1 − γ ) − n (cid:0) V π ∗ γ ( s ) − V πγ ( s ) (cid:1) ≥ . s · · · s H ǫ << Figure 1:
Distracting Long Horizon MDP Example.
For H >> , and initial state s .It has been shown [17] that a policy is − − discount op-timal ⇐⇒ it is gain optimal, and a policy is 0-discountoptimal ⇐⇒ it is bias optimal. Moreover, if a pol-icy is n -discount optimal, then it is m -discount optimal ∀ m ∈ {− , , ..., n } . The strongest and most selectivenotion of optimality is that of π ∗ being n -discount op-timal ∀ n ≥ − . Such a policy is referred to as being ∞ -discount optimal. A policy π ∗ is Blackwell optimal if ∃ γ ∗ ∈ [0 , , suchthat V π ∗ γ ( s ) ≥ V πγ ( s ) , ∀ γ ∈ [ γ ∗ , , ∀ π ∈ Π , ∀ s ∈ S . For finite state spaces such a γ ∗ is attained [13]. Intu-itively, a Blackwell optimal policy is one that, upon con-sidering sufficiently far into the future, as encoded as aplanning horizon via γ > γ ∗ , no other policy has a higherexpected cumulative reward.[17] showed that a policy is ∞ -discount optimal ⇐⇒ it is Blackwell optimal, henceBlackwell optimality implies all other forms of optimal-ity, and for this reason is the focus of this work. Finally,[3] shows that for finite state and action space MDPs,there always exists a stationary and deterministic Black-well optimal policy. Consider the infinite horizon MDP in Figure 1, withinitial state s . Before proceeding, consider what youwould do if you were in this MDP? What do you thinkis the best policy? What sort of solution would you hopethat an RL algorithm return to you and how did you cometo this conclusion?In wanting to maximize cumulative reward, it is hardto argue with any other action selection policy for theprovided example than to always “move right” towardsthe state s H , and upon doing so, remain there. Whymight someone consider any other policy? Why mighta rational agent, with full oracle knowledge of the MDP,consider staying in s to receive a reward of ǫ << at every time step for perpetuity? It is hard to accountfor why such a policy would be preferred over the policy that takes the agent to s H , aside from laziness. Compu-tationally, V π stay γ > V π right γ ⇐⇒ ǫ > γ H . Hence,depending on H , ǫ, R max , the policy induced by γ canbe set appropriately in order to induce the desired policybehaviour.Returning to γ = 0 , it is widely accepted within theliterature that π ∗ γ is myopic. We ask if it is possible for π ∗ γ to be myopic for γ = 0 ? Is γ = 10 − myopic? γ = 10 − ? If we abstract what makes π ∗ γ myopic, it isthe fact that γ is not sufficiently large so as to provide theagent with the possibility of properly assessing the op-timal value of states and actions, where this optimalityis, in some sense, not defined with respect to γ learn , the γ used during learning, but rather with respect to someideal policy or behaviour. Just as a child might seekto maximize immediate gratification (rewards) by eatingcandy before bed, which may be optimal given γ = 0 ,the role of a parent will be to convey the non-optimalityof such a policy by noting that the yet to be experiencedconsequences (poor sleep, fussy behavior the followingday), which can only be taken into consideration with γ > . This is paradoxical for the child, as they op-erate under π ∗ γ =0 = π eatcandy , and hence V π eatcandy γ =0 isoptimal from the perspective of γ = 0 . The lesson theparent tries to impart to the child is to use γ ′ > γ = 0 so that the child can learn π ∗ γ ′ . In this way, we intu-itively compare V π γ γ ′ to V π γ ′ γ ′ . It is this intuition that weseek to formalize by noting that eating candy before beddoes not sufficiently value the future, and for this reasonwe attempt to resist this myopic behaviour. In order todo so, sufficiently valuing the future means selecting asuitable γ ∈ [0 , . We argue that this sufficiency is rep-resented by the γ ∗ ∈ [0 , as found in the definition of aBlackwell optimal policy and value function.We argue that the myopic behaviour, intuitively, is de-fined with respect to the strongest sense of optimality,Blackwell optimality. Note that ∀ γ ∈ [0 , , ∃ π ∗ γ [13].So why, then, is π ∗ γ dismissed as myopic for γ = 0 ? It isstill, after all, an optimal policy. We believe that this oc-curs since we intuitively understand that not all optimalpolicies are equal. It appears that all optimal policies areoptimal, but some are more optimal than others. Thatis, though ∀ γ, π ∗ γ is optimal under γ , not all γ ’s inducethe policies or behaviours that a researcher prefers. Thisclearly highlights a common issue in machine learning,that of using a given objective function as a surrogaterepresentation for what we want the algorithm to do .The hierarchical nature of policy optimality as expressedby n -discount optimality naturally captures this phe-nomenon, and we revisit this body of literature to helpmotivate why our sense of γ being myopic has nothingo do with not being capable of finding π ∗ γ , but rather,not finding π ∗ γ ∗ , the γ ∗ that characterizes a Blackwelloptimal policy. We introduce a novel notion of regret,called Blackwell regret , and relate the concept of a my-opic γ and policy to Blackwell regret. Our work looksat a simple class of MDPs, called distracting long hori-zon MDPs , and show that even for such a simple class ofenvironments, it is arbitrarily hard to select a γ so as toarrive at a Blackwell optimal policy and value functionthat achieves zero Blackwell regret. γ , Blackwell Realizable andBlackwell Regret Looking at the MDP’s in Figure 1, we intuitively get asense of what the right policy is, and we agree that γ = 0 is myopic and will not produce the optimal policy. More-over, we can check that ∀ γ ∈ [0 , H √ ǫ ) will suffer thesame drawback. Since no formal definition of a myopic γ can be found in the literature, we provide a definition. Definition: Myopic γ and Blackwell Regret: Let β denote a Blackwell optimal policy. Let γ ∗ be as definedabove as for Blackwell optimality, such that V βγ ( s ) ≥ V πγ ( s ) , ∀ γ ∈ [ γ ∗ , , ∀ π ∈ Π , ∀ s ∈ S . Then for γ ∈ [0 , γ ∗ ) , we say γ is myopic. Similarly, a policy, π γ , is myopic if it is learned using a myopic γ . Simi-larly, we say for γ ≥ γ ∗ that γ is Blackwell realizable .For γ learn ∈ [0 , , we define Blackwell Regret, R B . Let γ ′ = max { γ ∗ , γ learn } . Then for a given policy π γ learn , R B ( π γ learn ) = E (cid:8) V βγ ′ ( s ) − V π γlearn γ ′ ( s ) (cid:9) , where the expectation is taken over initial state distri-bution. Hence, Blackwell regret is the regret accruedfor using a given policy learned with γ learn , when com-pared to a Blackwell optimal policy. Since it may bethat γ learn > γ ∗ , to ensure commensurability we requirethat γ ′ = max { γ ∗ , γ learn } in the definition, since undernon-negative rewards, ∀ fixed π γ ′ < γ ∈ [0 , , s ∈ S ,we see that V πγ ′ ( s ) < V πγ ( s ) . It immediately follows thatif γ is myopic, and π ∗ γ is the optimal policy induced by γ , then R B ( π ∗ γ ) > . We see in the following lemmathat Blackwell regret captures the very notion exempli-fied in the child-parent example previously given, in thatfor γ < γ ∗ , the Blackwell regret is simply the regretcomputed using γ ∗ . The regret of a given policy, withvalue evaluated at γ is defined as R ( π ; γ ) = E (cid:8) V π ∗ γ ( s ) − V πγ ( s ) (cid:9) . Lemma 1.
Let γ ∈ [0 , γ ∗ ) , γ ∗ , as defined in Blackwelloptimality. Then R B ( π γ ) = R ( π ; γ ∗ ) . Previous definitions of regret measure the differencein value of a given policy π γ learn and the optimal valuefunction, each evaluated with respect to a fixed γ learn ,since V ∗ γ learn ( s ) = V π ∗ γ learn ( s ) , ∀ s ∈ S . As well, the γ learn used to learn the policy is typically also used toevaluate the value of that policy, and thus the regret.Blackwell regret differs in that it measures the differencein value of a given policy, π γ learn , and a Blackwell opti-mal policy, evaluated at γ ′ = max { γ learn , γ ∗ } that fa-vors a Blackwell optimal policy and value function. Indoing so, a policy that achieves zero Blackwell regretis either itself Blackwell optimal, or when consideringa sufficiently long time horizon (as encoded by γ ′ ), hasthe same value as a Blackwell optimal policy. γ When implementing an RL algorithm that incorpo-rates γ discounting, typically no reasoning is providedto explain the choice of γ used, though most often val-ues of γ are set around 0.9. γ may be treated as a hy-perparameter and a grid-search over values may be per-formed. However, even under these settings, the prob-ability measure of non-myopic γ ’s can become vanish-ingly small for various types of problems such as LHP’sand sparse reward problems. Hence, any randomized γ selection approach can have a vanishingly small proba-bility of achieving non-zero Blackwell regret, as for theexample in Figure 1 as H grows. We show that selectinga non-myopic γ , that is, selecting a Blackwell realizable γ , is quite difficult without oracle knowledge of the prob-lem. Moreover, even using a Blackwell realizable γ , an ǫ -optimal policy may not even be gain optimal, let aloneBlackwell optimal.Ultimately we would like to consider MDP environ-ments of a particular nature conducive to multi-task RLproblems. The environments (problems) we are inter-ested in are those such that for every task assigned tothe agent, the optimal policy for that task induces a parti-tion of the state space into non-empty subsets of transientand recurrent states, S T , S A . This is equivalent to say-ing that for each task, the optimal policy associated tothe task induces a Markov chain on S which is unichain,or that the environment is multichain [13]. The intuitionis that the environment is sufficiently controllable , in thesense that the agent can direct the environment towardssome preferable subset of the state space, and stay thereindefinitely if needed, as encoded by the task MDP. Forthis paper we will consider a particular subset of suchenvironments, where there are only two regions of thestate space that produce non-zero rewards, and these twoegions are maximally separated from one another. Wedemonstrate that even for such a simple class of MDPs,selecting Blackwell realizable discount factors can be ar-bitrarily hard.More formally, we consider the class of MDPs withfinite diameters. That is, ∃ D < ∞ , such that D = max s = s ′ ∈S min π ∈ Π E π (cid:8) τ π ( s, s ′ ) (cid:9) , where τ π ( s, s ′ ) is the first hitting time of s ′ when start-ing in state s , under π . Hence, within the class of en-vironments considered, it is possible to reach any statefrom any other starting state, and do so in a finite num-ber of actions, in expectation, under some policy. Fur-thermore, denote s d := s and s H := s ′ two statesthat realize the diameter D . Suppose ∃ < r d <
Let M be a distracting long horizonMDP as described above. Then π is Blackwell optimal ⇐⇒ π ∈ argmin π ′ ∈ Π E π ′ (cid:8) τ π ′ ( s, s H ) (cid:9) , ∀ s ∈ S . We now provide results that show with oracle knowl-edge of
D, r d , R max we may select for γ ∗ and thus for aBlackwell realizable discount factor. Corollary 3.
For any distracting long horizon MDP M ,as described above, if D, r d , R max is known, then an RLalgorithm can select γ ≥ γ ∗ and hence select a Black-well realizable discount factor. The following corollary shows that with oracle knowl-edge of only two of the following properties:
D, r d and R max , then after committing to particular γ ∈ [0 , there exists a distracting long horizon MDP that is con-sistent with those MDP properties wherein π ∗ γ is not gainoptimal, but ∀ γ > γ , π ∗ γ is Blackwell optimal. Corollary 4.
Suppose for every distracting long horizonMDP M , as described above, only two of { D, r d , R max } is known. Let K ⊂ {
D, r d , R max } , |K| = 2 , denotethe MDP features known with oracle knowledge. Then ∀ γ ∈ [0 , ∃ M consistent with K , such that π ∗ γ is notgain optimal but ∀ γ ′ ∈ ( γ, , π ∗ γ ′ is Blackwell optimal. These corollaries demonstrate that there exists sufficientdomain knowledge for distracting long horizon MDPs toallow for the computation and use of a Blackwell real-izable γ , however without complete domain knowledgeof M , any γ selected may be myopic and may not evenlead to a gain optimal policyl. These results suggest fordistracting long horizon MDPs that a multi-step learningapproach may be best, where in the first phase the agentlearns the D, r d , R max , and then in the second phase,uses this knowledge to select for a non-myopic γ to solvethe task, however we leave such results for future work.The next results show that even under with access to aBlackwell realizable γ , for distracting long horizon prob-lems, then the value of a policy that is not gain optimaland that of a Blackwell optimal policy may be arbitrarilyclose (e.g. within ǫ ), hence any learning algorithm thatreturns a policy that is ǫ -accurate to a Blackwell optimalpolicy may not even be gain optimal. Further, we provideempirical results that mirror our theoretical results. Corollary 5.
Let ǫ > . ∃ a distracting MDP, M , withBlackwell optimal γ ∗ ∈ (0 , , and associated Blackwelloptimal policy β , such that || V βγ ∗ − V π γ ∗ γ ∗ || ∞ < ǫ , where π γ ∗ is not gain optimal. Prior work has been done in putting forward measure-ments that can act as indicators of when learning an op-imal policy may be difficult [10, 2] . [2] discuss thenotion of an action gap at a given state s that is the dif-ference in expected value at that state between the opti-mal action and the second best action. More formally, let A − π ( s ) = A \ { π ∗ ( s ) } . Then, AG π ∗ ( s ) = V π ∗ γ ( s ) − max a ∈ A − π ( s ) Q π ∗ γ ( s, a ) . [10] introduce the notion of the maximal action-gap(MAG) of a policy π as M AG ( S ; π ) = max s ∈S (cid:8) max a ∈A Q π ( s, a ) − min a ∈A Q π ( s, a ) (cid:9) . Both studies argue that if their respective measurementis small, then learning the optimal policy can be hard, asit is hard to discern the value of the optimal action fromone that is sub-optimal. While each may be useful, weargue that since the action gap measures the differencein value associated with abstaining just once from takingthe optimal action, it doesn’t truly measure the differ-ence in value between two policies, nor the associateddifficulty in discerning the value of one policy over an-other. The maximal action gap suffers from this as well.Moreover, [10] that under certain conditions the maximalaction gap collapses to zero, making learning arbitrarilyhard. However, in the Appendix section we prove thatthis condition only occurs in environments where the setof states that receive non-zero rewards must be transient ∀ π .We introduce a novel measurement, the policy gap,which is motivated by the action gap, discussed above.For S , A , p, R fixed, policy π and s ∈ S , we define thepolicy gap, P G π ( s ) , P G π γ ( s ) = min π ′ γ ∈ Π γ π γ ( s ) = π ′ γ ( s ) (cid:26) | V π γ γ ( s ) − V π ′ γ γ ( s ) | (cid:27) . The policy gap at state s is the smallest difference invalue at that state between the query policy and any otherpolicy that differs at s . Intuitively, if ∀ s , P G π γ ( s ) islarge, then the ability to discern the optimal action andthereby learn a Blackwell optimal policy becomes eas-ier. Conversely, if ∃ s ∈ S , sucht that P G π ∗ β ( s ) → then at state s , called a pivot state , the ability to discernthe value of a Blackwell optimal policy, β , and anotherpolicy becomes increasingly hard. For an MDP whereBlackwell optimal policies are non-trivial, that is not allpolicies are Blackwell optimal, and therefore γ ∗ > ,then there exists such a pivot state. For the Theorem be-low, we use β for a Blackwell optimal policy, and for any γ , we use V βγ to represent the value function computedfollow the Blackwell optimal policy and with discountfactor γ . Theorem 6. (Pivot State Existence) Let β be a non-trivial Blackwell optimal policy with γ ∗ ∈ (0 , , where γ ∗ as defined above such that V βγ ( s ) ≥ V πγ ( s ) , ∀ π, s , ∀ γ ∈ [ γ ∗ , . If γ < γ ∗ = ⇒ ∃ a pivot state ˜ s ∈ S , ∃ ˜ π γ ∈ Π γ where ˜ π γ (˜ s ) = β (˜ s ) , and V βγ (˜ s ) < V ˜ π γ γ (˜ s ) < V ˜ π γ γ ∗ (˜ s ) ≤ V βγ ∗ (˜ s ) . Moreover, lim γ → γ ∗ P G β γ (˜ s ) → . Theorem 7 shows that for γ values close to γ ∗ , thereexists a pivot state such that the value of a Blackwell op-timal policy at that state, when computed with γ , is ar-bitrarily close to the value of a different non-Blackwelloptimal policy at the same state, when computed with γ .Intuitively, if the policy gap is arbitrarily close to zero,an RL algorithm is expected to have a greater difficultyevaluating the difference in value associated to such poli-cies, and therefore have a greater difficulty in determin-ing which is optimal. These results may suggest thatwithout oracle knowledge of γ ∗ , an algorithm that at-tempts to search for γ ∗ by increasing γ iteratively wouldhave increasing difficulty as γ → γ ∗ . In this section we provide experimental results thatfurther illustrate the phenomena discussed in previoussections. We investigate the difficulty of solving forBlackwell optimal policies in distracting long horizonMDPs, similar to those in Figure 1. For these exper-iments we use the MDP in Figure 2, with initial state s d . We analytically solve for γ ∗ , and implement the de-layed Q-learning PAC-MDP algorithm [16] for our ex-periments. We use .
85 = γ > . γ > γ ∗ in two sets of experiments, with γ − γ ∗ < − .For our experiments we use δ = 0 . , and error toler-ance ǫ = 0 . , ǫ = 0 . , where indices for ǫ, γ coin-cide for experiments. We run each set of experimentswith a different random seed for 5 runs. The delayedQ-learning algorithms terminates when algorithm eitherfinds itself in state s d , greedily selecting a and the Learn (s,a) boolean flag is
False , or the algorithm findsitself in state s H , greedily selecting a and the Learn (s,a)boolean flag is
False . Both situations indicate no furtherlearning is possible, and the algorithm has converged onthe ( ǫ, δ ) -optimal policy.For the first set of experiments with γ > γ ∗ and ǫ = 0 . , the mean sample complexity required forconvergence was . x ± . . In eachof the five experiments, the policy learned, ˆ π ( s d ) = a , ˆ π ( s H ) = a , which is not Blackwell optimal. More- d s H a : r d =0.1 a : p = a : p = a a : R max =1Figure 2: Distracting LHP. Actions are deterministic ex-cept for a from state s d . Only non-zero rewards are r d and R max .over, the algorithms terminates in state s d , hence the pol-icy is not even gain optimal. The policy gap at s d wasalso measured, and the mean policy gap, µ ( P G ( s d )) =4 . x − ± . x − . In the second set of experi-ments, using γ , ǫ , the mean sample complexity across5 runs was . x ± . x . In each of thefive experiments the policy learned was the Blackwelloptimal policy, and the mean policy gap, µ ( P G ( s d )) =2 . x − ± . x − .These results corroborate the theoretical results ob-tained. First, we see that for the experiments with γ ,which is much closer to γ ∗ , we find that the policy gapat s d is much smaller than when compared to under γ > γ , as predicted by the theoretical results stated.More importantly, despite having oracle knowledge of γ ∗ and selecting a Blackwell realizable γ , and imple-menting a PAC-MDP algorithm with commonly usedvalues of ( ǫ, δ ) , no implementation returned a Blackwelloptimal policy, and in fact did not even return a gain-optimal policy. These results further support the diffi-culty in arriving at Blackwell optimal policies for dis-tracting long horizon MDPs. However, for γ such that γ − γ ∗ ≈ . , the Blackwell optimal policy wasreturned in all experiments. Though a positive result insome regards, it is also suggests that for distracting longhorizon MDPs where γ ∗ → , where the Lebesque mea-sure λ ([ γ ∗ , → , having the luxury of randomly se-lecting γ ∈ [0 , such that gamma > γ ∗ becomes ar-bitrarily hard. Finally, these results corroborate the theo-retical results showing the existence of a state where thepolicy gap approaches zero and that even with γ ∗ , witha commonly used ǫ error tolerance value, the ǫ -optimalpolicy returned by a PAC-MDP algorithm was not evengain optimal. The topic of effects of γ selection on policy qualityhas been of interest for several decades [17, 3, 13] with n-discount optimality and Blackwell optimality providing a global perspective on this relationship. These works rec-ognize that for γ discounting, an optimal policy may notbe Blackwell optimal, and recognize that this problemis alleviated for γ = 1 . However, there do not exist anyknown convergent algorithms with theoretical guaranteesfor the undiscounted setting. [3] also showed that for fi-nite MDPs, as γ → , the γ -discounted value functioncan be written as a Laurent series expansion, where eachof the terms in this series is a scaled notion of optimal-ity, with the first term being the gain, the second the bias,and so on. Using this construction, [13, 17] show thereis both a sequence of nested equations for solving for theLaurent series coefficients, as well as a policy iterationmethod that is provably convergent for such a policy sat-isfying these equations for any finite term approximationof the Laurent series. More recently [11] utilized an ex-citing approach in function approximation by construct-ing value functions using basis functions comprised ofterms found within the Laurent series expansion.[9] studies the relationship of γ and reward functionswith policy quality for goal based MDPs. They arguethat ∀ γ < , an agent is not risk-averse, and prove thatin the undiscounted setting and r := − , ∀ ( s, a, s ′ ) , anagent is guaranteed to arrive at the goal state, howeverwith γ < this is not so, as a shorter yet riskier paththat may lead to non-goal absorbing state can have highervalue than a longer, safer path to the goal. [12, 6, 7] aremotivated by showing that using smaller γ values may beadvantageous. Besides having faster convergence rates,they argue smaller γ values may also have better error.By decomposing the error or value difference betweenpolicies induced with different γ values, these decompo-sitions have error terms dependant on the smaller γ term,and another term that goes to zero as the two γ valuesapproach each other. These works argue that the beststrategy is to find an intermediate γ value that trades offthe two terms. However, as is often the case in theoreticalanalysis of RL problems, the bounds are stated in termsof V max , and for various values of γ , are vacuous as thebounds are higher than the absolute max error of V max (e.g. one policy only receiving zero rewards and anotheralways receiving R max ). However, when the bounds aremeaningful, without knowledge of the Blackwell opti-mal policy and associated value function, as shown inthis study, even an ǫ -optimal policy may not even be gainoptimal.For γ ∈ [0 , [10] define a hypothesis class, H γ andshow that H γ = { v ∈ R S : || s || ∞ ≤ R max − γ } . nder this framework, for a family of hypothesis classes H γ , indexed by γ ∈ [0 , , we see that as γ increases, {H γ } γ is a monotonically increasing sequence of hy-pothesis classes. [10, 6] formalize that this also corre-sponds to an increase in measure of complexity, via thegeneralized Rademacher complexity, R ( H γ ) , dependsonly on V max . That is, R ( H γ ) = R max − γ ) . As examined in [10], long horizon MDPs suffer in that γ ∗ may be arbitrarily close to 1, which implies the com-plexity of realizable hypothesis classes for LHPs growsnon-linearly with the horizon size (and γ ∗ ). [6] arguethat using γ learn < γ ∗ is therefore a mechanism akinto regularization, by selecting for a lower complexityhypothesis class one can prevent overfitting. Their re-sults suggest using smaller γ earlier in learning; howeveras discussed here, the quality of γ being small or large is problem dependant, and without oracle knowledge ofthe problem is meaningless. Though [6] does not con-sider Blackwell optimality, an interesting result [Theo-rem 2] can easily be adapted here which shows that for γ learn < γ ∗ the loss as measured by || . || ∞ for an approx-imately optimal policy ˆ π γ using γ and n samples fromeach ( s, a ) ∈ S × A follows with probabiliy > − δ : || V π ∗ γ ∗ − ˆ V ˆ π γ || ∞ ≤ γ ∗ − γ (1 − γ ∗ )(1 − γ ) R max + 2 R max (1 − γ ) r n log (cid:0) |S||A|| Π γ | δ (cid:1) . [6] argue that the tradoff between the two terms involvescontrolling the complexity of the policy class using asmaller γ , versus the error induced in the first time whenusing a smaller γ . Our results show that even as n → ∞ and the second error term goes to zero, the first error termis fixed for any fixed γ < γ ∗ , and that without strong do-main knowledge, even an ǫ -optimal approximate policymay not even be gain optimal. [15] recently suggested γ -nets, a function approximation architecture that trains us-ing a set of discount factors to learn value functions withrespect to several timescales. The idea being that the ap-proximation architecture can generalize and approximatethe value of a state for any γ if sufficiently trained.The work presented here suggests that without consid-ering Blackwell optimality and related concepts, theoret-ical bounds on value functions in RL may not providemeaningful and interpretable semantics with respect tothe optimality of the resulting policy. An apt metaphor isthat for a daredevil jumping across a canyon, coming ǫ -close to being successful is arbitrarily bad. In that vein, our results show that for LHPs an ǫ -Blackwell optimalpolicy may not even be gain optimal. In contrast, in thesupervised learning setting, one may search over a par-ticular hypothesis class and and arrive at some locally orglobally optimal hypothesis ˆ h : X → Y , which obtainsempirical accuracy of p train , p test ∈ [0 , on the train-ing and test datasets, respectively. Once a classifier isobtained, though one may not know that the Bayes op-timal classifier risk may be, one does know that it can,at most, achieve 100% accuracy, and hence in absoluteterms, one can obtain meaning from the test and trainingaccuracy of a classifier returned by some SL algorithm.However, the RL setting is not similar in these regards.Without oracle knowledge of the RL problem, the policyand value function returned by an RL algorithm, param-eterized by γ , and any other parameters θ , it is hard tosay just how optimal such a policy, in fact, is , therebyleaving a researcher in the same boat as the fictitious RL agent : with results that are evaluative not instructive.Given that γ discounting has such a strong effect onthe induced hypothesis class, one may ask why discount-ing is even used ? Authors often cite concepts from util-ity theory such as inflation and interest to motivate theuse of discounting. Such concepts for temporal valua-tion may be useful for agents, such as humans, with fi-nite time horizons, however such intuitions may not nec-essarily be commensurable for infinite horizon agents.The use of γ discounting in economic models is also ofcontention [18]. For economic and environmental poli-cies, how should we discount the value of having a cleanenvironment? Is discounting the future ethical in suchsettings? Might discounting the future lead us to an arbi-trarily bad absorbing state? Utility theory has consideredseveral qualities two utilitiy streams, { r t } t ≥ , { r ′ t } t ≥ ,may posses in forming binary relations used as order-ings on value functions (utility streams) [8], includingthat of anonymity which essentially states that two util-ity streams are equal under an ordering if they are per-mutations of one another. Hence, anonymity can onlybe realized in the RL setting if γ = 1 . These worksintroduce and argue for the use of Blackwell optimal-ity in economics research. [14] answers the questionwhy discounting is used: because it turns an infinite suminto a finite one. That is, it allows us to consider con-vergent series and therefore algorithms. It then followsthat we are not selecting for π ∗ ∈ Π , but rather for π ∗ ∈ Π ∩ { policies that are representable by convergentalgorithms } . If RL algorithms are to be used and incor-porated in real world processes and products, we raisethe rhetorical question: What are the moral and ethicalimplications of purposefully running a sub-optimal infi-nite horizon algorithm, in perpetuity? he results provided in this paper suggest that itera-tive methods at arriving at γ ∗ are problematic, suggest-ing a need for analytical methods of computing γ ∗ . How-ever, even with γ ∗ , an approximately optimal policy maynot even produce a gain optimal policy. For LHPs, as γ ∗ → , since even using γ > γ ∗ shares this unfortunateresult, as demonstrated empirically in our experiments,what can be done to ensure solving for the Blackwell op-timal policy? Recent advances in PAC-MDP algorithms[4] introduce PAC uniform learning, PAC algorithms thatare ǫ -optimal ∀ ǫ simultaneously. Such algorithms mustnever explore then commit [4] , but rather must neverstop learning, as it has been shown that such approachesare necessarily sub-optimal [5]. An interesting directionwould be to consider the use of such algorithms for ar-riving at Blackwell optimal policies.Though Blackwell optimality is an ideal, for non-trivial LHPs it is possible that Blackwell optimal poli-cies are hard to discern from policies that may not evenbe gain optimal. With such results being so dire, we sug-gest three main areas of focus for future research withinthe RL community. 1) Development of convergent algo-rithms for solving for n-discount optimal policies, withtheoretical bounds, and efficient solution methods for ar-riving at the Laurent series expansion of a γ -discountedvalue function as γ → ; 2) Analytical solutions to γ ∗ ;3) Human preference and goal based RL.Our main focus is on the third area of focus mentionedabove. For any applied RL solution, for example a com-mercial product that relies on RL, we argue that ulti-mately the quality of a policy is judged by human prefer-ences. Those implementing an RL solution method willreceive a policy and a value function, and must evaluateif it is a sufficient solution to the given problem, or not.If not, the researcher will experiment with other param-eters, including γ , and repeat until a policy is found thatis sufficient. We call such an aproach based on humanpreference, and may be separate from the value functionitself, and solely dependent on the behaviour of the pol-icy. This can be seen by the works and discussions maderecently [1] based on results on the CoastRunners do-main. CoastRunners is a video game where the policycontrols a boat in a racing game. The policy solved for byOpenAI resulted in the boat driving in circles, collectingrewards, rather than racing to the finish line and complet-ing the race. Though OpenAI uses this as an example of apathological behaviour induced by a faulty reward func-tion, it can viewed as the induced behaviour by using amyopic γ in a distracting LHP. OpenAI, and most otherswould agree, that the behaviour observed was pathologi-cal , however, what makes it pathological? In fact, it wasthe optimal policy solved for, given the encoding of the MDP. We argue that what makes this pathological is sim-ply that the policy didn’t do what the researchers wantedit to do , which was to win the race. For this reason, atthis current state in RL research, we claim the ultimately,the quality of policies solved for are measured by theirbeing deemed sufficient, as subjectively defined by theresearcher. We claim that this is equivalent to the re-searcher ultimately desiring something from the solvedpolicy, and hence if this can be encoded as an indicatorfunction, then goal based RL problems should be used,being some of the simplest classes of MDP problems. Acknowledgments.
The authors would like to thank Maia Fraser for dis-cussions and thoughtful edits of prior versions of thismanuscript. [1] Openai blog. https://blog.openai.com/faulty- reward-functions/[2] Bellemare, M., Ostrovski, G., Guez, A., Thomas, P.,Munos, R.: Increasing the action gap: New op- eratorsfor reinforcement learning. In: AAAI. pp. 14761483(2016)[3] Blackwell, D.: Discrete dynammic programming.Annals of Mathematical Stastics 33, 719726 (1962)[4] Dann, C., Lattimore, T., Brunskill, E.: Unifying pacand regret: Uniform pac bounds for episodic re- inforce-ment learning. In: Neural Information Pro- cessing Sys-tems (2016)[5] Garivier, A., Kaufmann, E.: On explore-then- com-mit strategies. In: Neural Information Process- ing Sys-tems (2017)[6] Jiang, N., Kulesza, A., Singh, S., Lewis, R.: The de-pendence of effective planning horizon on model accu-racy. In: AAMAS. vol. 14 (2015)[7] Jiang, N., Singh, S., Tewari, A.: On structural prop-erties of mdps that bound loss due to shallow plan- ning.In: IJCAI (2016)[8] Jonsson, A., Voorneveld, M.: The limit of dis-counted utilitarianism. Theoretical Economics 13, 1937(2018)[9] Koenig, S., Liu, Y.: The interaction of representa-tions and planning objectives for decision-theoretic plan-ning tasks. Journal of Experimental and Theo- reticalArtificial Intelligence 14, 303326 (2002)[10] Lehnert, L., Laroche, R., van Seijen, H.: On valueunction representation of long horizon problems. In:32nd AAAI Conference on Artificial Intelli- gence. pp.34573465 (2018)[11] Mahadevan, S., Liu, B.: Basis construction frompower series expansions of value functions. In: Lafferty,J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S.,Culotta, A. (eds.) Advances in Neu- ral Information Pro-cessing Systems 23, pp. 1540 1548. Curran Associates,Inc. (2010)[12] Petrik, M., Scherrer, B.: Biasing approximate dy-namic programming with a lower discount factor. In:In Advances in Neural Information Processing Systems.pp. 12651272 (2009)[13] Puterman, M.: Markov decision processes: Dis-crete stochastic dynammic programming. John Wi- leyand sons, Inc. (1994)[14] Schwartz, A.: A reinforcement learning method formaximizing undiscounted rewards. In: ICML (1993)[15] Sherstan, C., MacGlashan, J., Pilarski, P.: Gener-alizing value estimation over timescale. In: Pre- dic-tion and Generative Modeling in Reinforcement Learn-ing Workshop, FAIM (2018)[16] Strehl, A., Li, L., Wiewiora, E., Langford, J.,Littman, M.: Pac model-free reinforcement learn- ing.In: ICML. pp. 881888 (2006)[17] Veinott, A.: Discrete dynammic programming withsensitive discount optimality criteria. Annals of Mathe-matical Stastics 40, 1635166 (1969)[18] Weitzman, M.: Gamma discounting. AmericanEconomic Review 91, 260271 (2001)
A Comment on Bounds Related to the Maximal Ac-tion Gap: [10] define S C ⊆ S as a fully connected subset of the state space. Despite the use of the term fully connected which was intended to describe a subetof the state space that is reachable from anywhere withinthat subset, a more appropriate term is communicating,as fully connected has connotations that ∀ s, s ′ ∈ S C , ∃ a ∈ A such that p ( s ′ | s, a ) > . For this reason wewill use the term communicating to describe S C . Fromthis they define V max,γ = max s ∈S C V π ∗ γ ( s ) . Lemma 2 of[10] states: MAG ( S C ) ≤ (1 − γ D SC +1 ) V max,γ , where D S C is the diameter of S C . From this, it is stated that if V max,γ is bounded as γ → , then γ → implies thatMAG → . Though this implication is true, we showthat V max,γ is bounded as γ → if and only if under allpolicies the expected number of times a non-zero reward is obtained under π ∗ γ is finite. This means that under all policies, all non-zero rewards are transient. Hence, sucha result applies to a rather vacuous subset of MDPs. Proposition 7.
Let M = hS , A , p, R, γ i be an MDPsuch that |S| < ∞ and |A| < ∞ . Then ∃ M < ∞ ∋ lim γ → V max,γ ≤ M ⇐⇒ ∀ π E π γ (cid:8) P ∞ t =1 r t =0 (cid:9) < ∞ .Proof. Suppose ∃ M < ∞ such that V max,γ ≤ M as lim γ → . WLOG, since R ∈ [0 , R max ] , we may assume R max > , since otherwise this statement is trivial.Clearly, S 6 = S C , since otherwise, as there exists at leastone transition that induces a non-zero reward, r > ,then at worst a policy may traverse the entire diameter of S C to receive a reward r and do so for perpetuity. That is, V max,γ = max s ∈S C V π ∗ γ ( s ) ≥ rγ D SC + rγ D SC + ... = ∞ X t =1 rγ tD SC = rγ D SC ∞ X t =0 γ tD SC = rγ D SC − γ D SC But clearly, M ≥ rγ D SC − γ D SC ⇐⇒ Mr − M γ D SC r ≥ γ D SC ⇐⇒ Mr ≥ γ D SC (1 + Mr ) ⇐⇒ Mr + M ≥ γ D SC ⇐⇒ (cid:0) Mr + M (cid:1) DSC ≥ γ So for γ > (cid:0) Mr + M (cid:1) DSC we have
M < V max,γ . Thisshows that
S 6 = S C . Hence, for S C ( S , it mustbe that S C is transient under π ∗ γ , since otherwise if ∃ T ∈ N such that ∀ t ≥ T , s t ∈ S C , then again by thesame argument above, ∃ γ ∈ [0 , such that ∀ γ ′ ≥ γ , V max,γ ′ > M . Hence S C must be transient under π ∗ γ .Now, since |S| < ∞ , then |S \ S C | < ∞ . Hencefor π ∗ γ , ∃ S A ⊆ S \ S C such that S A is irreducibleand positive recurrent (e.g. absorbing). We claim thatthere must not be any possible non-zero rewards within A . Let T = max s ∈S E π ∗ γ (cid:8) τ ( s, S A ) (cid:9) be the maximum ex-pected first hitting time of reaching the absorbing subsetof the state space S A under π ∗ γ . By a similar argumentas above, there cannot be any positive rewards in S A ,since otherwise ∃ s A ∈ S A such that V π ∗ γ ( s A ) → ∞ as γ → . If this is true, then ∃ s ′ ∈ S C such that V max,γ = max s ∈S C V π ∗ γ ( s ) ≥ γ T V π ∗ γ ( s A ) , and therefore V max,γ → ∞ as γ → .Hence, as γ → , π ∗ γ obtains non-zero rewards for onlya finite number of time steps. Due to the optimality of π ∗ γ , then this must be true for any policy π γ . Hence itmust be that all rewards in M are transient.For the reverse implication, suppose that ∀ π E π γ (cid:8) P ∞ t =1 r t = 0 (cid:9) < ∞ . Let T be defined asabove, as the maximum expected hitting time of the ab-sorbing subset S A which contains no non-zero rewards. S A must exist, by a similar argument as above. Then wehave, ∀ s ∈ S V π ∗ γ ( s ) ≤ T X t =1 R max γ t − ≤ T R max < ∞ . Hence V max,γ is bounded as γ → .The maximum action gap bounds collapse to zero foran infinite horizon problem, as γ → , but only for en-vironments where all the rewards are transient. [10] ar-gue that representing the value function for such class ofMDPs is quite difficult as γ → , however such a classof environments are best solved using episodic MDP ap-proaches, with γ = 1 . Since for any policy the numberof time steps where a positive reward is possible is finite,then finding an optimal policy is only relevant for the first T < ∞ time steps, since afterwards the behaviour be-comes irrelevant. [13] shows such domains can be con-verted to undiscounted episodic tasks. In doing so, thehypothesis space is completely different, as only valuefunctions V ∈ [0 , R max T ] S need be considered, whichhave no dependancy on γ , hence the Radamacher com-plexity results stated previously do not apply here. Amulti-step learning approach of first learning T , then ap-plying an episodic RL algorithmic approach is ideal forsuch environments. Proof of Lemma 1
Proof.
Let β be a Blackwell optimal policy with associ-ated γ ∗ . Note that γ ′ = γ ∗ follows from the hypothesis and definition of γ ′ for Blackwell regret. Then, R B ( π γ ) = E (cid:8) V βγ ∗ ( s ) − V π γ γ ∗ ( s ) (cid:9) = E (cid:8) V βγ ∗ ( s ) − V ∗ γ ∗ ( s ) + V ∗ γ ∗ ( s ) − V π γ γ ∗ ( s ) (cid:9) = E (cid:8) V βγ ∗ ( s ) − V ∗ γ ∗ ( s ) (cid:9) + E (cid:8) V ∗ γ ∗ ( s ) − V π γ γ ∗ ( s ) (cid:9) = 0 + E (cid:8) V ∗ γ ∗ ( s ) − V π γ γ ∗ ( s ) (cid:9) = R ( π ; γ ∗ ) Proof of Proposition 2
Proof.
Let π be a Blackwell optimal policy, then it isbias optimal which clearly must minimize the expectedhitting time of s H . For the reverse implication, let π be the policy that minimizes the expected hitting timeof s H . Let γ = D q r d R max , with D, r d , R max defined inthe text. Then it follows that γ ∗ = γ , since otherwise ∀ γ ′ < γ, ∃ π ′ such that V π ′ γ ′ ( s d ) > V πγ ′ ( s d ) . It clearlyfollows that under any policy µ , ∀ γ ≥ γ ∗ , V πγ ≥ V µγ ,and therefore π is a Blackwell optimal policy. Proof of Corollary 3
Proof.
Let M be a distracting MDP as described above,with D, r d , R max known to the algorithm. Let π ∗ bethe Blackwell optimal policy learned and evaluated with γ ∗ . By the previous Proposition, then π ∗ is the policythat takes the shortest path from any state to s H , and asgiven in the proof of said Proposition, γ ∗ = D q r d R max .Moreover, from Proposition 2 ∀ γ < γ ∗ , ∃ π = π ∗ ∋ V π γ γ ( s d ) < V π ∗ γ ∗ γ ∗ ( s d ) . This follows for any policy thatdoes not minimize the expected first hitting time of s H .Hence ∀ γ ≥ γ ∗ = D q r d R max . Hence, with knowledge of D, r d , R max , γ ∗ can be computed and therefore a realiz-able discount factor may be selected. Proof of Corollary 4
Proof.
First, given r d , R max , and let γ ∈ [0 , . For s d , s H as defined above, it suffices to show as in the pre-vious proposition ∃ D > ∋ , ∀ γ ′ > γ , for the inducedoptimal policies π ∗ γ , π ∗ γ ′ , and the Blackwell optimal pol-icy β , V π ∗ γ γ ( s d ) = r d − γ > R max γ D − γ = V β γ γ ( s d ) butV π ∗ γ ′ γ ′ ( s d ) = r d − γ ′ ≤ R max γ ′ D − γ ′ = V β γ ′ γ ′ ( s d ) ence, it suffices to show ∃ D, γ ′ ∋ : V π ∗ γ γ ( s d ) = r d − γ< R max γ D − γ = V β γ γ ( s d ) ≤ R max γ ′ D − γ ′ = V β γ ′ γ ′ ( s d ) Let D = sup { D ′ | D ′ < log ( r d R max ) − log ( γ ) } , and set γ ′ := D q r d R max . Then D, γ ′ satisfy the claim, and withinitial state distribution being a point mass at s d , we have π ∗ γ is not gain optimal, as ρ π ∗ γ = r d , but ∀ ˜ γ > γ , itfollows that π ∗ ˜ γ is Blackwell optimal.Without loss of generality, the same proof technique canbe applied when either r d , D are known, and γ ∈ [0 , is fixed, as well as if R max , D are known, and γ ∈ [0 , is fixed. Proof of Corollary 5
Proof.
This follows as a Corollary from Theorem 6, andProposition 2, since ∃ a pivot state ˜ s where the policygap vanishes. It is easy to see that under Proposition2 and the previous two Corollaries that followed, s d isa pivot state. Let ˜ π equal the Blackwell optimal pol-icy, β , at every state except, ˜ π ( s d ) = a stay , noting that r ( s d , a stay , s d ) = r d . Then ˜ π is not gain optimal as ρ ˜ π = r d < R max = ρ β , yet ∀ s ∈ S \ { s d } , V βγ ∗ ( s ) = V ˜ π γ ∗ γ ∗ ( s ) , and for s d , we see that V ˜ πγ ∗ ( s d ) = r d − γ ∗ , while V βγ ∗ ( s d ) = γ ∗ D R max − γ ∗ . Then r d , D, R max can be set suchthat ∀ ǫ > , || V βγ ∗ − V ˜ π γ ∗ γ ∗ || ∞ < ǫ . Proof of Theorem 6
Proof.
Let γ < γ ∗ . By definition of Blackwell opti-mality, then ∃ ˜ s ∈ S , ∃ ˜ π γ such that V ˜ π γ γ (˜ s ) > V βγ (˜ s ) .Moreover, since all rewards are non-negative, ∀ π , s ∈ S , ∀ γ < γ it follows that V πγ ( s ) < V πγ ( s ) . That is, in-creasing γ while keeping the policy constant can onlyincrease the magnitude of the value function. Hence wehave as well, V ˜ π γ γ (˜ s ) < V ˜ π γ γ ∗ (˜ s ) , andV βγ (˜ s ) < V βγ ∗ (˜ s ) . Together, we see that V βγ (˜ s ) < V ˜ π γ γ (˜ s ) < V ˜ π γ γ ∗ (˜ s ) ≤ V βγ ∗ (˜ s ) . It remains to show that ˜ π (˜ s ) = β (˜ s ) and lim γ → γ ∗ P G β γ (˜ s ) → . We may assume the former,since if, infact ˜ π (˜ s ) = β (˜ s ) , then V βγ (˜ s ) = E { r (˜ s, β (˜ s )) + γV βγ ( s ′ ) } < E { r (˜ s, ˜ π (˜ s )) + γV ˜ π γ γ ( s ′ ) } = V ˜ π γ γ (˜ s ) ⇐⇒ E { r (˜ s, β (˜ s )) + γV βγ ( s ′ ) } < E { r (˜ s, β (˜ s )) + γV ˜ π γ γ ( s ′ ) } ⇐⇒ E { γV βγ ( s ′ ) } < E { γV ˜ π γ γ ( s ′ ) } ⇐⇒ E { V β ( s ′ ) } < E { V ˜ π γ ( s ′ ) } Since the expectation is taken over MDP dynamics, andboth policies selected the same action at ˜ s , then distri-bution over successor states are the same. If there areno successor states, s ′ , where π ∗ ( s ′ ) = ˜ π ( s ′ ) then thisinequality continues to the successors of the successorstates. However, this process cannot continue indefi-nitely, since otherwise the two Markov chains inducedby β and ˜ π beginning at ˜ s are therefore coupled, andwith the same dyamics and γ , must have the same value.Therefore the two policies must differ at atleast one statewhere the preceeding value function inequality is true.For this reason, WLOG, we assume this state is ˜ s .Finally, to show lim γ → γ ∗ P G β γ (˜ s ) → . This directly fol-lows, as ∀ γ < γ ∗ we have V βγ (˜ s ) < V ˜ π γ γ (˜ s ) < V ˜ π γ γ ∗ (˜ s ) ≤ V βγ ∗ (˜ s ) → < V ˜ π γ γ (˜ s ) − V βγ (˜ s ) < V βγ ∗ (˜ s ) − V βγ (˜ s ) → < P G β γ (˜ s ) < V ˜ π γ γ (˜ s ) − V βγ (˜ s ) < V βγ ∗ (˜ s ) − V βγ (˜ s ) → < lim γ → γ ∗ P G β γ (˜ s ) < lim γ → γ ∗ V βγ ∗ (˜ s ) − V βγ (˜ s ) → < lim γ → γ ∗ P G β γ (˜ s ) < lim γ → γ ∗ E β (cid:8) ∞ X t =1 γ ∗ t − r t − γ t − r t (cid:9) → < lim γ → γ ∗ P G β γ (˜ s ) < lim γ → γ ∗ E β (cid:8) ∞ X t =1 ( γ ∗ t − − γ t − ) r t (cid:9) Since lim γ → γ ∗ E β (cid:8) P ∞ t =1 ( γ ∗ t − − γ t − ) r t (cid:9) → , it followsthat lim γ → γ ∗ P G β γ (˜ s ) →0