Near Optimal Behavior via Approximate State Abstraction
NNear Optimal Behavior via Approximate State Abstraction
David Abel † Brown University david [email protected]
D. Ellis Hershkowitz † Carnegie Mellon University [email protected]
Michael L. Littman
Brown University [email protected]
Abstract
The combinatorial explosion that plagues planning and reinforcement learning (RL) algorithms canbe moderated using state abstraction. Prohibitively large task representations can be condensed suchthat essential information is preserved, and consequently, solutions are tractably computable. However,exact abstractions, which treat only fully-identical situations as equivalent, fail to present opportunitiesfor abstraction in environments where no two situations are exactly alike. In this work, we investigate ap-proximate state abstractions, which treat nearly-identical situations as equivalent. We present theoreticalguarantees of the quality of behaviors derived from four types of approximate abstractions. Addition-ally, we empirically demonstrate that approximate abstractions lead to reduction in task complexity andbounded loss of optimality of behavior in a variety of environments.
Abstraction plays a fundamental role in learning. Through abstraction, intelligent agents may reason aboutonly the salient features of their environment while ignoring what is irrelevant. Consequently, agents are ableto solve considerably more complex problems than they would be able to without the use of abstraction. How-ever, exact abstractions , which treat only fully-identical situations as equivalent, require complete knowledgethat is computationally intractable to obtain. Furthermore, often no two situations are identical, so exactabstractions are often ineffective. To overcome these issues, we investigate approximate abstractions that en-able agents to treat sufficiently similar situations as identical. This work characterizes the impact of equating“sufficiently similar” states in the context of planning and RL in Markov Decision Processes (MDPs). Theremainder of our introduction contextualizes these intuitions in MDPs.
RLSolution to Abstracted Problem, M A Bounded Error ( M G ) True Problem, M G Abstracted Problem, M A Abstraction, S t a r t G oa l G oa l S t a r t Figure 1: We investigate families of approximate state abstraction functions that induce abstract MDP’swhose optimal policies have bounded value in the original MDP.
A previous version of this paper was published in the Proceedings of the 33rd International Conference on Machine Learning,New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). † The first two authors contributed equally. a r X i v : . [ c s . L G ] J a n olving for optimal behavior in MDPs in a planning setting is known to be P-Complete in the size ofthe state space [28, 25]. Similarly, many RL algorithms for solving MDPs are known to require a number ofsamples polynomial in the size of the state space [31]. Although polynomial runtime or sample complexitymay seem like a reasonable constraint, the size of the state space of an MDP grows super-polynomially withthe number of variables that characterize the domain - a result of Bellman’s curse of dimensionality. Thus,solutions polynomial in state space size are often ineffective for sufficiently complex tasks. For instance, arobot involved in a pick-and-place task might be able to employ planning algorithms to solve for how tomanipulate some objects into a desired configuration in time polynomial in the number of states, but thenumber of states it must consider grows exponentially with the number of objects with which it is working[1]. Thus, a key research agenda for planning and RL is leveraging abstraction to reduce large state spaces [2,21, 10, 12, 6]. This agenda has given rise to methods that reduce ground MDPs with large state spacesto abstract
MDPs with smaller state spaces by aggregating states according to some notion of equality orsimilarity. In the context of MDPs, we understand exact abstractions as those that aggregate states withequal values of particular quantities, for example, optimal Q -values. Existing work has characterized howexact abstractions can fully maintain optimality in MDPs [24, 8].The thesis of this work is that performing approximate abstraction in MDPs by relaxing the stateaggregation criteria from equality to similarity achieves polynomially bounded error in the resulting behaviorwhile offering three benefits. First, approximate abstractions employ the sort of knowledge that we expecta planning or learning algorithm to compute without fully solving the MDP. In contrast, exact abstractionsoften require solving for optimal behavior, thereby defeating the purpose of abstraction. Second, becauseof their relaxed criteria, approximate abstractions can achieve greater degrees of compression than exactabstractions. This difference is particularly important in environments where no two states are identical.Third, because the state aggregation criteria are relaxed to near equality, approximate abstractions are ableto tune the aggressiveness of abstraction by adjusting what they consider sufficiently similar states.We support this thesis by describing four different types of approximate abstraction functions that pre-serve near-optimal behavior by aggregating states on different criteria: (cid:101) φ Q ∗ ,ε , on similar optimal Q -values, (cid:101) φ model ,ε , on similarity of rewards and transitions, (cid:101) φ bolt ,ε , on similarity of a Boltzmann distribution overoptimal Q -values, and (cid:101) φ mult ,ε , on similarity of a multinomial distribution over optimal Q -values. Further-more, we empirically demonstrate the relationship between the degree of compression and error incurred ona variety of MDPs.This paper is organized as follows. In the next section, we introduce the necessary terminology andbackground of MDPs and state abstraction. Section 3 surveys existing work on state abstraction applied tosequential decision making. Section 5 introduces our primary result; bounds on the error guaranteed by fourclasses of approximate state abstraction. The following two sections introduce simulated domains used inexperiments (Section 6), and a discussion of experiments in which we apply one class of approximate abstrac-tion to a variety of different tasks to empirically illustrate the relationship between degree of compressionand error incurred (Section 7). An MDP is a problem representation for sequential decision making agents, represented by a five-tuple: (cid:104)S , A , T , R , γ (cid:105) . Here, S is a finite state space; A is a finite set of actions available to the agent; T denotes T ( s, a, s (cid:48) ), the probability of an agent transitioning to state s (cid:48) ∈ S after applying action a ∈ A in state s ∈ S ; R ( s, a ) denotes the reward received by the agent for executing action a in state s ; γ ∈ [0 ,
1] is adiscount factor that determines how much the agent prefers future rewards over immediate rewards. Weassume without loss of generality that the range of all reward functions is normalized to [0 , π : S (cid:55)→ A .The objective of an agent is to solve for the policy that maximizes its expected discounted reward fromany state, denoted π ∗ . We denote the expected discounted reward for following policy π from state s as thevalue of the state under that policy, V π ( s ). We similarly denote the expected discounted reward for taking2ction a ∈ A and then following policy π from state s forever after as Q π ( s, a ), defined by the BellmanEquation as: Q π ( s, a ) = R ( s, a ) + γ (cid:88) s (cid:48) T ( s, a, s (cid:48) ) Q π ( s (cid:48) , π ( s (cid:48) )) . (1)We let RMax denote the maximum reward (which is 1), and
QMax denote the maximum Q value, whichis RMax − γ . The value function, V , defined under a given policy, denoted V π ( s ), is defined as: V π ( s ) = Q π ( s, π ( s )) . (2)Lastly, we denote the value and Q functions under the optimal policy as V ∗ or V π ∗ and Q ∗ or Q π ∗ , respec-tively. For further background, see Kaelbling et al. [22]. Several other projects have addressed similar topics.
Dean et al. [9] leverage the notion of bisimulation to investigate partitioning an MDP’s state space intoclusters of states whose transition model and reward function are within ε of each other. They develop analgorithm called Interval Value Iteration (IVI) that converges to the correct bounds on a family of abstractMDPs called Bounded MDPs.Several approaches build on Dean et al. [9]. Ferns et al. [14, 15] investigated state similarity metricsfor MDPs; they bounded the value difference of ground states and abstract states for several bisimulationmetrics that induce an abstract MDP. This differs from our work which develops a theory of abstractionthat bounds the suboptimality of applying the optimal policy of an abstract MDP to its ground MDP,covering four types of state abstraction, one of which closely parallels bisimulation. Even-Dar and Mansour[13] analyzed different distance metrics used in identifying state space partitions subject to ε -similarity, alsoproviding value bounds (their Lemma 4) for ε -homogeneity subject to the L ∞ norm, which parallels ourClaim 2. Ortner [27] developed an algorithm for learning partitions in an online setting by taking advantageof the confidence bounds for T and R provided by UCRL [3].Hutter [18, 17] investigates state aggregation beyond the MDP setting. Hutter presents a variety ofresults for aggregation functions in reinforcement learning. Most relevant to our investigation is Hutter’sTheorem 8, which illustrates properties of aggregating states based on similar Q values. Hutter’s Theorempart (a) parallels our Claim: both bound the value difference between ground and abstraction states, andpart (b) is analogous to our Lemma 1: both bound the value difference of applying the optimal abstractionpolicy in the ground, and part (c) is a repetition of the comment given by Li et al. [24] that Q ∗ abstractionspreserve the optimal value function. For Lemma 1, our proof strategies differ from Hutter’s, but the resultis the same.Approximate state abstraction has also been applied to the planning problem, in which the agent is givena model of its environment and must compute a plan that satisfies some goal. Hostetler et al. [16] apply stateabstraction to Monte Carlo Tree Search and expectimax search, giving value bounds of applying the optimalabstract action in the ground tree(s), similarly to our setting. Dearden and Boutilier [10] also formalize state-abstraction for planning, focusing on abstractions that are quickly computed and offer bounded value. Theirprimary analysis is on abstractions that remove negligible literals from the planning domain description,yielding value bounds for these abstractions and a means of incrementally improving abstract solutions toplanning problems. Jiang et al. [20] analyze a similar setting, applying abstractions to the Upper ConfidenceBound applied to Trees algorithm adapted for planning, introduced by Kocsis and Szepesv´ari [23].Mandel et al. [26] advance Bayesian aggregation in RL to define Thompson Clustering for ReinforcementLearning (TCRL), an extension of which achieves near-optimal Bayesian regret bounds. Jiang [19] analyzethe problem of choosing between two candidate abstractions. They develop an algorithm based on statistical3ests that trades of the approximation error with the estimation error of the two abstractions, yielding a lossbound on the quality of the chosen policy. Many previous works have targeted the creation of algorithms that enable state abstraction for MDPs.Andre and Russell [2] investigated a method for state abstraction in hierarchical reinforcement learningleveraging a programming language called ALISP that promotes the notion of safe state abstraction. Agentsprogrammed using ALISP can ignore irrelevant parts of the state, achieving abstractions that maintainoptimality. Dietterich [12] developed MAXQ, a framework for composing tasks into an abstracted hierarchywhere state aggregation can be applied. Bakker and Schmidhuber [4] also target hierarchical abstraction,focusing on subgoal discovery. Jong and Stone [21] introduced a method called policy-irrelevance in whichagents identify (online) which state variables may be safely abstracted away in a factored-state MDP. Dayanand Hinton [7] develop “Feudal Reinforcement Learning” which presents an early form of hierarchical RLthat restructures Q -Learning to manage the decomposition of a task into subtasks. For a more completesurvey of algorithms that leverage state abstraction in past reinforcement-learning papers, see Li et al. [24],and for a survey of early works on hierarchical reinforcement learning, see Barto and Mahadevan [5]. Li et al. [24] developed a framework for exact state abstraction in MDPs. In particular, the authors definedfive types of state aggregation functions, inspired by existing methods for state aggregation in MDPs. Wegeneralize two of these five types, φ Q ∗ and φ model , to the approximate abstraction case. Our generalizationsare equivalent to theirs when exact criteria are used (i.e. ε = 0). Additionally, when exact criteria are usedour bounds indicate that no value is lost, which is one of core results of Li et al. [24]. Walsh et al. [34] buildon the framework they previously developed by showing empirically how to transfer abstractions betweenstructurally related MDPs. We build upon the notation used by Li et al. [24], who introduced a unifying theoretical framework for stateabstraction in MDPs.
Definition 1 ( M G , M A ) : We understand an abstraction as a mapping from the state space of a groundMDP, M G , to that of an abstract MDP, M A , using a state aggregation scheme. Consequently, this mappinginduces an abstract MDP. Let M G = (cid:104)S G , A , T G , R G , γ (cid:105) and M A = (cid:104)S A , A , T A , R A , γ (cid:105) . Definition 2 ( S A , φ ) : The states in the abstract MDP are constructed by applying a state aggregationfunction, φ , to the states in the ground MDP, S A . More specifically, φ maps a state in the ground MDP toa state in the abstract MDP: S A = { φ ( s ) | s ∈ S G } . (3) Definition 3 ( G ) : Given a φ , each ground state has associated with it the ground states with which it isaggregated. Similarly, each abstract state has its constituent ground states. We let G be the function thatretrieves these states: G ( s ) = (cid:40) { g ∈ S G | φ ( g ) = φ ( s ) } , if s ∈ S G , { g ∈ S G | φ ( g ) = s } , if s ∈ S A . (4)The abstract reward function and abstract transition dynamics for each abstract state are a weightedcombination of the rewards and transitions for each ground state in the abstract state.4 efinition 4 ( ω ( s ) ) : We refer to the weight associated with a ground state, s ∈ S G by ω ( s ) . The onlyrestriction placed on the weighting scheme is that it induces a probability distribution on the ground statesof each abstract state: ∀ s ∈ S G (cid:88) s ∈ G ( s ) ω ( s ) = 1 AND ω ( s ) ∈ [0 , . (5) Definition 5 ( R A ) : The abstract reward function R A : S A × A (cid:55)→ [0 , is a weighted sum of the rewardsof each of the ground states that map to the same abstract state: R A ( s, a ) = (cid:88) g ∈ G ( s ) R G ( g, a ) ω ( g ) . (6) Definition 6 ( T A ) : The abstract transition function T A : S A × A × S A (cid:55)→ [0 , is a weighted sum of thetransitions of each of the ground states that map to the same abstract state: T A ( s, a, s (cid:48) ) = (cid:88) g ∈ G ( s ) (cid:88) g (cid:48) ∈ G ( s (cid:48) ) T G ( g, a, g (cid:48) ) ω ( g ) . (7) Here, we introduce our formal analysis of approximate state abstraction, including results bounding theerror associated with these abstraction methods. In particular, we demonstrate that abstractions based onapproximate Q ∗ similarity (5.2), approximate model similarity (5.3), and approximate similarity betweendistributions over Q ∗ , for both Boltzmann (5.4) and multinomial (5.5) distributions induce abstract MDPsfor which the optimal policy has bounded error in the ground MDP.We first introduce some additional notation. Definition 7 ( π ∗ A , π ∗ G ) : We let π ∗ A : S A → A and π ∗ G : S G → A stand for the optimal policies in theabstract and ground MDPs, respectively. We are interested in how the optimal policy in the abstract MDP performs in the ground MDP. As such,we formally define the policy in the ground MDP derived from optimal behavior in the abstract MDP:
Definition 8 ( π GA ) : Given a state s ∈ S G and a state aggregation function, φ , π GA ( s ) = π ∗ A ( φ ( s )) . (8)We now define types of abstraction based on functions of state–action pairs. Definition 9 ( (cid:101) φ f,ε ) : Given a function f : S G × A → R and a fixed non-negative ε ∈ R , we define (cid:101) φ f,ε asa type of approximate state aggregation function that satisfies the following for any two ground states s , s : (cid:101) φ f,ε ( s ) = (cid:101) φ f,ε ( s ) → ∀ a | f ( s , a ) − f ( s , a ) | ≤ ε. (9)That is, when (cid:101) φ f,ε aggregates states, all aggregated states have values of f within ε of each other for allactions.Finally, we estliabsh notation to distinguish between the ground and abstract value ( V ) and action value( Q ) functions. Definition 10 ( Q G , V G ) : Let Q G = Q π ∗ G : S G × A → R and V G = V π ∗ G : S G → R denote the optimal Qand optimal value functions in the ground MDP. Definition 11 ( Q A , V A ) : Let Q A = Q π ∗ A : S A × A → R and V A = V π ∗ A : S A → R stand for the optimal Qand optimal value functions in the abstract MDP. We now introduce the main result of the paper. 5 heorem 1.
There exist at least four types of approximate state aggregation functions, (cid:101) φ Q ∗ ,ε , (cid:101) φ model ,ε , (cid:101) φ bolt ,ε and (cid:101) φ mult ,ε , for which the optimal policy in the abstract MDP, applied to the ground MDP, hassuboptimality bounded polynomially in ε : ∀ s ∈S G V π ∗ G G ( s ) − V π GA G ( s ) ≤ εη f (10) Where η f differs between abstraction function families: η Q ∗ = 1(1 − γ ) η model = 1 + γ ( |S G | − − γ ) η bolt = (cid:16) |A| − γ + εk bolt + k bolt (cid:17) (1 − γ ) η mult = (cid:16) |A| − γ + k mult (cid:17) (1 − γ ) For η bolt and η mult , we also assume that the difference in the normalizing terms of each distribution isbounded by some non-negative constant, k mult , k bolt ∈ R , of ε : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i Q G ( s , a i ) − (cid:88) j Q G ( s , a j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k mult × ε (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i e Q G ( s ,a i ) − (cid:88) j e Q G ( s ,a j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k bolt × ε Naturally, the value bound of Equation 10 is meaningless for 2 εη f ≥ RMax − γ = − γ , since this is the maximumpossible value in any MDP (and we assumed the range of R is [0 , ε = 0,all of the above bounds are exactly 0. Any value of ε interpolated between these two points achieves differentdegrees of abstraction, with different degrees of bounded loss.We now introduce each approximate aggregation family and prove the theorem by proving the specificvalue bound for each function type. (cid:101) φ Q ∗ ,ε We consider an approximate version of Li et al. [24]’s φ Q ∗ . In our abstraction, states are aggregated togetherwhen their optimal Q -values are within ε . Definition 12 ( (cid:101) φ Q ∗ ,ε ) : An approximate Q function abstraction has the same form as Equation 9: (cid:101) φ Q ∗ ,ε ( s ) = (cid:101) φ Q ∗ ,ε ( s ) → ∀ a | Q G ( s , a ) − Q G ( s , a ) | ≤ ε. (11) Lemma 1.
When a (cid:101) φ Q ∗ ,ε type abstraction is used to create the abstract MDP: ∀ s ∈S G V π ∗ G G ( s ) − V π GA G ( s ) ≤ ε (1 − γ ) . (12) Proof of Lemma 1:
We first demonstrate that Q -values in the abstract MDP are close to Q -values in theground MDP (Claim 1). We next leverage Claim 1 to demonstrate that the optimal action in the abstractMDP is nearly optimal in the ground MDP (Claim 2). Lastly, we use Claim 2 to conclude Lemma 1 (Claim 3).6 laim 1. Optimal Q -values in the abstract MDP closely resemble optimal Q -values in the ground MDP: ∀ s G ∈S G ,a | Q G ( s G , a ) − Q A ( (cid:101) φ Q ∗ ,ε ( s G ) , a ) | ≤ ε − γ . (13)Consider a non-Markovian decision process of the same form as an MDP, M T = (cid:104)S T , A G , R T , T T , γ (cid:105) ,parameterized by integer an T , such that for the first T time steps the reward function, transition dynamicsand state space are those of the abstract MDP, M A , and after T time steps the reward function, transitiondynamics and state spaces are those of M G . Thus, S T = (cid:40) S G if T = 0 S A o/w R T ( s, a ) = (cid:40) R G ( s, a ) if T = 0 R A ( s, a ) o/w T T ( s, a, s (cid:48) ) = T G ( s, a, s (cid:48) ) if T = 0 (cid:80) g ∈ G ( s ) [ T G ( g, a, s (cid:48) ) ω ( g )] if T = 1 T A ( s, a, s (cid:48) ) o/wThe Q -value of state s in S T for action a is: Q T ( s, a ) = Q G ( s, a ) if T = 0 (cid:80) g ∈ G ( s ) [ Q G ( g, a ) ω ( g )] if T = 1 R A ( s, a ) + σ T − ( s, a ) o/w (14)where: σ T − ( s, a ) = γ (cid:88) s A (cid:48) ∈S A T A ( s, a, s A (cid:48) ) max a (cid:48) Q T − ( s A (cid:48) , a (cid:48) ) . We proceed by induction on T to show that: ∀ T,s G ∈S G ,a | Q T ( s T , a ) − Q G ( s G , a ) | ≤ T − (cid:88) t =0 εγ t , (15)where s T = s G if T = 0 and s T = (cid:101) φ Q ∗ ,ε ( s G ) otherwise. Base Case: T = 0When T = 0, Q T = Q G , so this base case trivially follows. Base Case: T = 1By definition of Q T , we have that Q is Q ( s, a ) = (cid:88) g ∈ G ( s ) [ Q G ( g, a ) ω ( g )] . Since all co-aggregated states have Q -values within ε of one another and ω ( g ) induces a convex combination, Q ( s T , a ) ≤ εγ t + ε + Q G ( s G , a ) ∴ | Q ( s T , a ) − Q G ( s G , a ) | ≤ (cid:88) t =0 εγ t . nductive Case: T > ∀ s G ∈S G ,a | Q T − ( s T , a ) − Q G ( s G , a ) | ≤ T − (cid:88) t =0 εγ t . Consider a fixed but arbitrary state, s G ∈ S G , and fixed but arbitrary action a . Since T > s T is (cid:101) φ Q ∗ ,ε ( s G ).By definition of Q T ( s T , a ), R A , T A : Q T ( s T , a ) = (cid:88) g ∈ G ( s T ) ω ( g ) × R G ( g, a ) + γ (cid:88) g (cid:48) ∈S G T G ( g, a, g (cid:48) ) max a (cid:48) Q T − ( g (cid:48) , a (cid:48) ) . Applying our inductive hypothesis yields: Q T ( s T , a ) ≤ (cid:88) g ∈ G ( s T ) ω ( g ) × (cid:20) R G ( g, a ) + γ (cid:88) g (cid:48) ∈S G T G ( g, a, g (cid:48) ) max a (cid:48) ( Q G ( g (cid:48) , a (cid:48) ) + T − (cid:88) t =0 εγ t ) (cid:21) . Since all aggregated states have Q -values within ε of one another: Q T ( s T , a ) ≤ γ T − (cid:88) t =0 εγ t + ε + Q G ( s G , a ) . Since s G is arbitrary we conclude Equation 15. As T → ∞ , (cid:80) T − t =0 εγ t → ε − γ by the sum of infinite geometricseries and Q T → Q A . Thus, Equation 15 yields Claim 1. Claim 2.
Consider a fixed but arbitrary state, s G ∈ S G and its corresponding abstract state s A = (cid:101) φ Q ∗ ,ε ( s G ) .Let a ∗ G stand for the optimal action in s G , and a ∗ A stand for the optimal action in s A : a ∗ G = arg max a Q G ( s G , a ) , a ∗ A = arg max a Q A ( s A , a ) . The optimal action in the abstract MDP has a Q -value in the ground MDP that is nearly optimal: V G ( s G ) ≤ Q G ( s G , a ∗ A ) + 2 ε − γ . (16)By Claim 1, V G ( s G ) = Q G ( s G , a ∗ G ) ≤ Q A ( s A , a ∗ G ) + ε − γ . (17)By the definition of a ∗ A , we know that Q A ( s A , a ∗ G ) + ε − γ ≤ Q A ( s A , a ∗ A ) + ε − γ . (18)Lastly, again by Claim 1, we know Q A ( s A , a ∗ A ) + ε − γ ≤ Q G ( s g , a ∗ A ) + 2 ε − γ . (19)Therefore, Equation 16 follows. Claim 3.
Lemma 1 follows from Claim 2. M G of following the optimal abstract policy π ∗ A for t steps and then following theoptimal ground policy π ∗ G in M G : π A,t ( s ) = (cid:40) π ∗ G ( s ) if t = 0 π GA ( s ) if t > t >
0, the value of this policy for s G ∈ S G in the ground MDP is: V π A,t G ( s G ) = R G ( s, π A,t ( s G )) + γ (cid:88) s G (cid:48) ∈S G T G ( s G , a, s G (cid:48) ) V π A,t − G ( s G (cid:48) ) . For t = 0, V π A,t G ( s G ) is simply V G ( s G ).We now show by induction on t that ∀ t,s G ∈S g V G ( s G ) ≤ V π A,t G ( s G ) + t (cid:88) i =0 γ i ε − γ . (21) Base Case: t = 0By definition, when t = 0, V π A,t G = V G , so our bound trivially holds in this case. Inductive Case: t > s G ∈ S G . We assume for our inductive hypothesis that V G ( s G ) ≤ V π A,t − G ( s G ) + t − (cid:88) i =0 γ i ε − γ . (22)By definition, V π A,t G ( s G ) = R G ( s, π A,t ( s G )) + γ (cid:88) g (cid:48) T G ( s G , a, s G (cid:48) ) V π A,t − G ( s G (cid:48) ) . Applying our inductive hypothesis yields: V π A,t G ( s G ) ≥ R G ( s G , π A,t ( s G )) + γ (cid:88) s G (cid:48) T G ( s G , π A,t ( s G ) , s G (cid:48) ) (cid:32) V G ( s G (cid:48) ) − t − (cid:88) i =0 γ i ε − γ (cid:33) . Therefore, V π A,t G ( s G ) ≥ − γ t − (cid:88) i =0 γ i ε − γ + Q G ( s G , π A,t ( s G )) . Applying Claim 2 yields: V π A,t G ( s G ) ≥ − γ t − (cid:88) i =0 γ i ε − γ − ε − γ + V G ( s G ) ∴ V G ( s G ) ≤ V π A,t G ( s G ) + t (cid:88) i =0 γ i ε − γ . Since s G was arbitrary, we conclude that our bound holds for all states in S G for the inductive case. Thus,from our base case and induction, we conclude that ∀ t,s G ∈S g V π ∗ G G ( s G ) ≤ V π A,t G ( s G ) + t (cid:88) i =0 γ i ε − γ . (23)Note that as t → ∞ , (cid:80) ti =0 γ i ε − γ → ε (1 − γ ) by the sum of infinite geometric series and π A,t ( s ) → π GA . Thus,we conclude Lemma 1. 9 .3 Model Similarity: (cid:101) φ model,ε Now, consider an approximate version of Li et al. [24]’s φ model , where states are aggregated together whentheir rewards and transitions are within ε . Definition 13 ( (cid:101) φ model ,ε ) : We let (cid:101) φ model ,ε define a type of abstraction that, for fixed ε , satisfies: (cid:101) φ model ,ε ( s ) = (cid:101) φ model ,ε ( s ) →∀ a |R G ( s , a ) − R G ( s , a ) | ≤ ε AND ∀ s A ∈S A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s G (cid:48) ∈ G ( s A ) [ T G ( s , a, s G (cid:48) ) − T G ( s , a, s G (cid:48) )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε. (24) Lemma 2.
When S A is created using a (cid:101) φ model,ε type: ∀ s ∈S G V π ∗ G G ( s ) − V π GA G ( s ) ≤ ε + 2 γε ( |S G | − − γ ) . (25) Proof of Lemma 2:
Let B be the maximum Q -value difference between any pair of ground states in the same abstract state for (cid:101) φ model ,ε : B = max s ,s ,a | Q G ( s , a ) − Q G ( s , a ) | , where s , s ∈ G ( s A ). First, we expand: B = max s ,s ,a (cid:12)(cid:12)(cid:12)(cid:12) R G ( s , a ) − R G ( s , a ) + γ (cid:88) s G (cid:48) ∈S G (cid:20) ( T G ( s , a, s G (cid:48) ) − T G ( s , a, s G (cid:48) )) max a (cid:48) Q G ( s G (cid:48) , a (cid:48) ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (26)Since difference of rewards is bounded by ε : B ≤ ε + γ (cid:88) s A ∈S A (cid:88) s G (cid:48) ∈ G ( s A ) (cid:20) ( T G ( s , a, s G (cid:48) ) − T G ( s , a, s G (cid:48) )) max a (cid:48) Q G ( s G (cid:48) , a (cid:48) ) (cid:21) . (27)By similarity of transitions under (cid:101) φ model ,ε : B ≤ ε + γ QMax (cid:88) s A ∈S A ε ≤ ε + γ |S G | ε QMax . Recall that
QMax = RMax − γ , and we defined RMax = 1: B ≤ ε + γ ( |S G | − ε − γ . Since the Q -values of ground states grouped under (cid:101) φ model ,ε are strictly less than B , we can understand (cid:101) φ model ,ε as a type of (cid:101) φ Q ∗ ,B . Applying Lemma 1 yields Lemma 2. (cid:101) φ bolt ,ε Here, we introduce (cid:101) φ bolt ,ε , which aggregates states with similar Boltzmann distributions on Q -values. Thistype of abstractions is appealing as Boltzmman distributions balance exploration and exploitation [32]. Wefind this type particularly interesting for abstraction purposes as, unlike (cid:101) φ Q ∗ ,ε , it allows for aggregation when Q -value ratios are similar but their magnitudes are different.10 efinition 14 ( (cid:101) φ bolt ,ε ) : We let (cid:101) φ bolt ,ε define a type of abstractions that, for fixed ε , satisfies: (cid:101) φ bolt ,ε ( s ) = (cid:101) φ bolt ,ε ( s ) → ∀ a (cid:12)(cid:12)(cid:12)(cid:12) e Q G ( s ,a ) (cid:80) b e Q G ( s ,b ) − e Q G ( s ,a ) (cid:80) b e Q G ( s ,b ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε. (28)We also assume that the difference in normalizing terms is bounded by some non-negative constant, k bolt ∈ R ,of ε : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) b e Q G ( s ,b ) − (cid:88) b e Q G ( s ,b ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k bolt × ε. (29) Lemma 3.
When S A is created using a function of the (cid:101) φ bolt ,ε type, for some non-negative constant k ∈ R : ∀ s ∈S G V π ∗ G G ( s ) − V π GA G ( s ) ≤ ε (cid:16) |A| − γ + εk bolt + k bolt (cid:17) (1 − γ ) . (30)We use the approximation for e x , with δ error: e x = 1 + x + δ ≈ x. (31)We let δ denote the error in approximating e Q G ( s ,a ) and δ denote the error in approximating e Q G ( s ,a ) . Proof of Lemma 3:
By the approximation in Equation 31 and the assumption in Equation 29: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q G ( s , a ) + δ (cid:80) j e Q G ( s ,a j ) − Q G ( s , a ) + δ (cid:80) j e Q G ( s ,a j ) ± kε (cid:124)(cid:123)(cid:122)(cid:125) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ε (32)Either term a is positive or negative. First suppose the former. It follows by algebra that: − ε ≤ Q G ( s , a ) + δ (cid:80) j e Q G ( s ,a j ) − Q G ( s , a ) + δ (cid:80) j e Q G ( s ,a j ) + εk bolt ≤ ε (33)Moving terms: − ε kε + (cid:88) j e Q G ( s ,a j ) − δ + δ ≤ εk bolt (cid:32) Q G ( s , a ) + δ (cid:80) j e Q G ( s ,a j ) (cid:33) + Q G ( s , a ) − Q G ( s , a ) ≤ ε εk bolt + (cid:88) j e Q G ( s ,a j ) − δ + δ (34)When a is the negative case, it follows that: − ε ≤ Q G ( s , a ) + δ (cid:80) j e Q G ( s ,a j ) − Q G ( s , a ) + δ (cid:80) j e Q G ( s ,a j ) − εk bolt ≤ ε (35)11y similar algebra that yielded Equation 34: − ε − εk bolt + (cid:88) j e Q G ( s ,a j ) − δ + δ ≤− kε (cid:32) Q G ( s , a ) + δ (cid:80) j e Q G ( s ,a j ) (cid:33) + Q G ( s , a ) − Q G ( s , a ) ≤ ε εk bolt + (cid:88) j e Q G ( s ,a j ) − δ + δ (36)Combining Equation 34 and Equation 36 results in: | Q G ( s , a ) − Q G ( s , a ) | ≤ ε (cid:18) |A| − γ + εk bolt + k bolt (cid:19) . (37)Consequently, we can consider (cid:101) φ bolt ,ε as a special case of the (cid:101) φ Q ∗ ,B type, where B = ε (cid:16) |A| − γ + εk bolt + k bolt (cid:17) .Lemma 3 then follows from Lemma 1. (cid:101) φ mult ,ε We consider approximate abstractions derived from a multinomial distribution over Q ∗ for similar reasonsto the Boltzmann distribution. Additionally, the multinomial distribution is appealing for its simplicity. Definition 15 ( (cid:101) φ mult ,ε ) : We let (cid:101) φ mult ,ε define a type of abstraction that, for fixed ε , satisfies (cid:101) φ mult ,ε ( s ) = (cid:101) φ mult ,ε ( s ) → ∀ a (cid:12)(cid:12)(cid:12)(cid:12) Q G ( s , a ) (cid:80) b Q G ( s , b ) − Q G ( s , a ) (cid:80) b Q G ( s , b ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε. (38)We also assume that the difference in normalizing terms is bounded by some non-negative constant, k mult ∈ R ,of ε : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i Q G ( s , a i ) − (cid:88) j Q G ( s , a j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k mult × ε. (39) Lemma 4.
When S A is created using a function of the (cid:101) φ mult ,ε type, for some non-negative constant k mult ∈ R : ∀ s ∈ S M V π ∗ G G ( s ) − V π GA G ( s ) ≤ ε (cid:16) |A| − γ + k mult (cid:17) (1 − γ ) (40) Proof of Lemma 4
The proof follows an identical strategy to that of Lemma 3, but without the approximation e x ≈ x . We apply approximate abstraction to five example domains—NChain, Upworld, Taxi, Minefield and Ran-dom. These domains were selected for their diversity—NChain is relatively simple, Upworld is particularlyillustrative of the power of abstraction, Taxi is goal-based and hierarchical in nature, Minefield is stochastic,and Random MDP has many near-optimal policies.Our code base provides implementations for abstracting arbitrary MDPs as well as visualizing andevaluating the resulting abstract MDPs. We use the graph-visualization library GraphStream [29] and theplanning and RL library, BURLAP . For all experiments, we set γ to 0 . https://github.com/david-abel/state_abstraction http://burlap.cs.brown.edu/ .1 Visualizations We provide visuals of both the ground MDP and resulting abstract MDP for each domain. A grey circleindicates a state and colored arrows indicate transitions. The thickness of the arrow indicates how muchreward is associated with that transition. In the ground MDPs, states are labeled with a number. Inthe abstract MDPs, we indicate which ground states were collapsed to each abstract state by labelling theabstract states with their ground states.
NChain is a simple MDP investigated in the Bayesian RL literature due to the interesting explorationproblem it poses [11]. In our implementation, we set N = 10, normalized rewards between 0 and 1, and useda slip probability of 0 .
2. An NChain instance ( N = 10) and its abstraction are visualized in Figure6.1.2.In all states, the agent has two actions available: advance down the chain, or return to state 0. The agentreceives . . ρ , such that the applied action results in the opposite dynamics. In our implementation, weset N = 10 and ρ = 0 . The Upworld task is an N × M grid in which the agent starts in the lower left corner. The agent maymove left, right, and up. The agent receives positive reward for transitioning to any state at the top of thegrid, where moving up in the top cells self transitions. the agent receives 0 reward for all other transitions.Consequently, moving up is always the optimal action, since moving left and right does not change theagent’s manhattan distance to positive reward. During experimentation, we set N = 10, M = 4. AnUpworld instance ( N = 10, M = 4) and its abstraction are visualized in Figure6.1.2.Upworld illustrates a compelling property with respect to state abstraction: the optimal exact Q ∗ ab-straction function (when ε = 0) can always construct an abstract MDP with |S A | = N , the height of the grid,with no change in the value of the optimal policy. Consequently, letting M be arbitrarily large, Upworldoffers an arbitrary reduction in the size of the MDP through abstraction, at no cost to the value of theoptimal policy. This is a result of the property that all states in the same row have the same Q values: Remark : The optimal exact abstraction, φ Q ∗ , , induces an abstract MDP with an optimal policy of equalvalue to the true optimal policy, and reduces the size of the state space from N × M (ground) to N (abstract). (a) Ground NChain (b) Abstract NChain (c) Ground Upworld (d) Abstract Up-world Figure 2: Comparison of the ground and abstract MDPs, under (cid:101) φ Q ∗ ,ε , with ε = 0 . a) Ground Taxi (b) Abstract Taxi Figure 3: Comparison of the ground and abstract Taxi MDPs under an (cid:101) φ Q ∗ ,ε abstraction, with ε = 0 . Taxi has long been studied by the hierarchical RL literature [12]. The agent, operating in a Grid World styledomain [30], may move left, right, up, and down, as well as pick up a passenger and drop off a passenger.The goal is achieved when the agent has taken all passengers to their destinations.We visualize the compression on a simple 626 Taxi instance in Figure 3. As stated above, we visualizethe original Taxi problem into a graph representation so that we may visualize both the ground MDP andabstract MDP in the same format, despite the unnatural appearance.
Minefield is a test problem we are introducing that uses the Grid World dynamics of Russell and Norvig [30]with slip probability of x . The reward function is such that moving up in the top row of the grid receives1 . . κ mine-states(which may include the top row) that receive 0 reward. We set N = 10 , M = 4 , ε = 0 . , κ = 5 , x = 0 . In the Random MDP domain we consider, there are 100 states and 3 actions. For each state, each ac-tion transitions to one of two randomly selected states with probability 0 .
5. The Random MDP and itscompression are visualized in Figure 4. (a) Ground Mine-field (b) AbstractMinefield (c) Ground Random (d) Abstract Random
Figure 4: Comparison of the ground and abstract MDPs under an (cid:101) φ Q ∗ ,ε abstraction, with ε = 0 . Empirical Results
We ran experiments on the (cid:101) φ Q ∗ ,ε type aggregation functions. We provide results for only (cid:101) φ Q ∗ ,ε because, asour proofs in Section 5 demonstrate, the other three functions are reducible to particular (cid:101) φ Q ∗ ,ε functions.For the purpose of illustrating what kinds of approximations are possible we built each abstraction by firstsolving the MDP, then greedily aggregating ground states into abstract states that satisfied the (cid:101) φ Q ∗ ,ε criteria.Since this approach represents an order-dependent approximation to the maximum amount of abstractionpossible, we randomized the order in which states were considered across trials. Every ground state is equallyweighted in its abstract state.For each domain, we report two quantities as a function of epsilon with 95% confidence bars. First, wecompare the number of states in the abstract MDP for different values of ε , shown in the left column ofFigure 5 and Figure 6. The smaller the number of abstract states, the smaller the state space of the MDPthat the agent must plan over. Second, we report the value under the abstract policy of the initial groundstate, also shown in the right column of Figure 5 and Figure 6. In the Taxi and Random domains, 200 trialswere run for each data point, whereas 20 trials were sufficient in Upworld, Minefield, and NChain. Epsilon N u m A b s t r a c t S t a t e s Epsilon vs. Num Abstract States
Num. Ground StatesNum. Abstract States (a) NChain
Epsilon V a l u e o f A b s t r a c t P o li c y Epsilon vs. Value of Abstract Policy
Val. Optimal PolicyVal. Random PolicyVal. Abstract Policy (b) NChain
Epsilon N u m A b s t r a c t S t a t e s Epsilon vs. Num Abstract States
Num. Ground StatesNum. Abstract States (c) Upworld
Epsilon V a l u e o f A b s t r a c t P o li c y Epsilon vs. Value of Abstract Policy
Val. Optimal PolicyVal. Random PolicyVal. Abstract Policy (d) Upworld
Figure 5: ε vs. Num States (left) and ε vs. Abstract Policy Value (right).Our empirical results corroborate our thesis—approximate state abstractions can decrease state spacesize while retaining bounded error. In both NChain and Minefield, we observe that, as ε increases from 0,the number of states that must be planned over is reduced, and optimal behavior is either fully maintained(NChain) or very nearly maintained (Minefield). Similarly for Taxi, when ε is between .
02 and . . ε is increased in the Randomdomain, there is a smooth reduction in the number of abstract states with a corresponding cost in the valueof the derived policy. When ε = 0, there is no reduction in state space size whatsoever (the ground MDPhas 100 states), because no two states have identical optimal Q -values.Our experimental results also highlight a noteworthy characteristic of approximate state abstraction ingoal-based MDPs. Taxi exhibits relative stability in state space size and behavior for ε up to .
02, at whichpoint both fall off dramatically. We attribute the sudden fall off of these quantities to the goal-based natureof the domain; once information critical for achieving optimal behavior is lost in the state aggregation,15 .00 0.05 0.10 0.15 0.20
Epsilon N u m A b s t r a c t S t a t e s Epsilon vs. Num Abstract States
Num. Ground StatesNum. Abstract States (a) Minefield
Epsilon V a l u e o f A b s t r a c t P o li c y Epsilon vs. Value of Abstract Policy
Val. Optimal PolicyVal. Random PolicyVal. Abstract Policy (b) Minefield
Epsilon N u m A b s t r a c t S t a t e s Epsilon vs. Num Abstract States
Num. Ground StatesNum. Abstract States (c) Taxi
Epsilon V a l u e o f A b s t r a c t P o li c y Epsilon vs. Value of Abstract Policy
Val. Optimal PolicyVal. Random PolicyVal. Abstract Policy (d) Taxi
Epsilon N u m A b s t r a c t S t a t e s Epsilon vs. Num Abstract States
Num. Ground StatesNum. Abstract States (e) Random
Epsilon V a l u e o f A b s t r a c t P o li c y Epsilon vs. Value of Abstract Policy
Val. Optimal PolicyVal. Random PolicyVal. Abstract Policy (f) Random
Figure 6: ε vs. Num States (left) and ε vs. Abstract Policy Value (right).solving the goal—and so acquiring any reward—is impossible. Conversely, in the Random domain, a greatdeal of near optimal policies are available to the agent. Thus, even as the information for optimal behavioris lost, there are many near optimal policies available to the agent that remain available. Approximate abstraction in MDPs offers considerable advantages over exact abstraction. First, approximateabstraction relies on criteria that we imagine a planning or learning algorithm to be able to learn withoutsolving the full MDP. Second, approximate abstractions can achieve greater degrees of compression due totheir relaxed criteria of equality. Third, methods that employ approximate aggregation techniques are able totune the aggressiveness of abstraction all the while incurring bounded error. In this work, we proved boundsfor the value lost when behaving according to the optimal policy of the abstract MDP, and empiricallydemonstrate that approximate abstractions can reduce state space size with minor loss in the quality of thebehavior. We provide a code base that provides implementations to abstract, visualize, and evaluate anarbitrary MDP to promote further investigation into approximate abstraction.There are many directions for future work. First, we are interested in extending the approach of Ort-ner [27] by learning the approximate abstraction functions introduced in this paper online in the planningor RL setting, particularly when the agent must solve a collection of related MDPs. Additionally, while16ur work presents several sufficient conditions for achieving bounded error of learned behavior with ap-proximate abstractions, we hope to investigate what conditions are strictly necessary for an approximateabstraction to achieve bounded error. Further, we are interested in characterizing the relationship betweentemporal abstractions, such as options [33] and approximate state abstractions. Lastly, we are interestedin understanding the relationship between various approximate abstractions and the information theoreticallimitations on the degree of abstraction achievable in MDPs.
References [1] David Abel, David Ellis Hershkowitz, Gabriel Barth-Maron, Stephen Brawner, Kevin O’Farrell, JamesMacGlashan, and Stefanie Tellex. Goal-based action priors. In
ICAPS , pages 306–314, 2015.[2] David Andre and Stuart J Russell. State abstraction for programmable reinforcement learning agents.In
AAAI/IAAI , pages 119–125, 2002.[3] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning.In
Advances in Neural Information Processing Systems , pages 89–96, 2009.[4] Bram Bakker and J¨urgen Schmidhuber. Hierarchical reinforcement learning based on subgoal discoveryand subpolicy specialization. In
Proc. of the 8-th Conf. on Intelligent Autonomous Systems , pages438–445, 2004.[5] Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning.
Discrete Event Dynamic Systems , 13(4):341–379, 2003.[6] James C Bean, John R Birge, and Robert L Smith. Dynamic programming aggregation.
OperationsResearch , 35(2):215–220, 2011.[7] Peter Dayan and Geoffrey Hinton. Feudal Reinforcement Learning.
Advances in neural informationprocessing systems , pages 271–278, 1993.[8] Thomas Dean and Robert Givan. Model minimization in markov decision processes. In
AAAI/IAAI ,pages 106–111, 1997.[9] Thomas Dean, Robert Givan, and Sonia Leach. Model reduction techniques for computing approxi-mately optimal solutions for markov decision processes. In
Proceedings of the Thirteenth Conference onUncertainty in Artificial Intelligence , pages 124–131. Morgan Kaufmann Publishers Inc., 1997.[10] Richard Dearden and Craig Boutilier. Abstraction and approximate decision-theoretic planning.
Arti-ficial Intelligence , 89(1):219–283, 1997.[11] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In
AAAI/IAAI , pages 761–768, 1998.[12] Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposi-tion.
Journal of Artificial Intelligence Research , 13:227–303, 2000.[13] Eyal Even-Dar and Yishay Mansour. Approximate equivalence of Markov decision processes. In
LearningTheory and Kernel Machines , pages 581–594. Springer, 2003.[14] Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite markov decision processes. In
Proceedings of the 20th conference on Uncertainty in artificial intelligence , pages 162–169. AUAI Press,2004.[15] Norman Ferns, Pablo Samuel Castro, Doina Precup, and Prakash Panangaden. Methods for computingstate similarity in markov decision processes.
Proceedings of the 22nd conference on Uncertainty inartificial intelligence , 2006. 1716] Jesse Hostetler, Alan Fern, and Tom Dietterich. State Aggregation in Monte Carlo Tree Search.
Aaai2014 , page 7, 2014.[17] Marcus Hutter. Extreme state aggregation beyond mdps. In
International Conference on AlgorithmicLearning Theory , pages 185–199. Springer, 2014.[18] Marcus Hutter. Extreme state aggregation beyond markov decision processes.
Theoretical ComputerScience , 650:73–91, 2016.[19] Nan Jiang. Abstraction Selection in Model-Based Reinforcement Learning. icml , 37, 2015.[20] Nan Jiang, Satinder Singh, and Richard Lewis. Improving uct planning via approximate homomor-phisms. In
Proceedings of the 2014 international conference on Autonomous agents and multi-agentsystems , pages 1289–1296. International Foundation for Autonomous Agents and Multiagent Systems,2014.[21] Nicholas K Jong and Peter Stone. State abstraction discovery from irrelevant state variables. In
IJCAI ,pages 752–757, 2005.[22] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey.
Journal of Artificial Intelligence Research , pages 237–285, 1996.[23] Levente Kocsis and Csaba Szepesv´ari. Bandit based monte-carlo planning. In
European conference onmachine learning , pages 282–293. Springer, 2006.[24] Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction formdps. In
ISAIM , 2006.[25] Michael L Littman, Thomas L Dean, and Leslie Pack Kaelbling. On the complexity of solving Markovdecision problems. In
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence ,pages 394–402. Morgan Kaufmann Publishers Inc., 1995.[26] Travis Mandel, Yun-En Liu, Emma Brunskill, and Zoran Popovic. Efficient bayesian clustering forreinforcement learning.
IJCAI , 2016.[27] Ronald Ortner. Adaptive aggregation for reinforcement learning in average reward Markov decisionprocesses.
Annals of Operations Research , 208(1):321–336, 2013.[28] Christos H Papadimitriou and John N Tsitsiklis. The complexity of Markov decision processes.
Math-ematics of Operations Research , 12(3):441–450, 1987.[29] Yoann Pign´e, Antoine Dutot, Fr´ed´eric Guinand, and Damien Olivier. Graphstream: A tool for bridgingthe gap between complex systems and dynamic graphs.
CoRR , abs/0803.2093, 2008.[30] Stuart Russell and Peter Norvig.
Artificial Intelligence A Modern Approach . Prentice-Hall, EnglewoodCliffs, 1995.[31] Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement Learning in Finite MDPs :PAC Analysis.
Journal of Machine Learning Research , 10:2413–2444, 2009.[32] Richard S Sutton and Andrew G Barto.
Reinforcement Learning: An Introduction . MIT Press, 1998.[33] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A frameworkfor temporal abstraction in reinforcement learning.
Artificial Intelligence , 112(1):181–211, 1999.[34] Thomas J Walsh, Lihong Li, and Michael L Littman. Transferring state abstractions between mdps. In